Post

Conversation

David Watson 🥑
Post your reply

PDFs have a font map that tells you what actual character is connected to each rendered character, so you can copy/paste. Unfortunately, these maps can lie, so the character you copy is not what you see. If you're unlucky, it's total gibberish.
Image
PDFs can have invisible text that only shows up when you try to extract it. "Measurement in your home" is only here once...or is it?
Image
Math is a whole can of worms. Remember the font map problem? Well, math is almost always random characters - here we get some strange Tamil/Amharic combo.
Image
Math bounding boxes are always fun - see how each formula is broken up into lots of tiny sections? Putting them together is a great time!
Image
Once upon a time, someone decided that their favorite letters should be connected together into one character - like ffi or fl. Unfortunately, PDFs are inconsistent with this, and sometimes will totally skip ligatures - very ecient of them.
Image
Not all text in a PDF is correct. Some PDFs are digital, and the text was added on creation. But others have had invisible OCR text added, sometimes based on pretty bad text detection. That's when you get this mess:
Image
Overlapping text elements can get crazy - see how the watermark overlaps all the other text? Forget about finding good reading order here.
Image
I've been showing you somewhat nice line bounding boxes. But PDFs just have character positions inside - you have to postprocess to join them into lines. In tables, this can get tricky, since it's hard to know when a new cell starts:
Image
LLMs work well for many cases, but in these cases, they aren't the right tool: - Latency/throughput sensitive - Need layout/positional info - Absolute accuracy and low hallucination risk matter - Edge cases like complex tables, etc. - Want deterministic customization -
I commend you on your work, I've worked a lot with pdf's and know the struggle I do have to point out though, are we even using the term OCR anymore? LLMs can understand them pretty well using both the image and text so I ask: why do you torture yourself squeezing out the last
I am a huge fan of your work, Vik. I’ve followed Surya, Marker, etc. for as long S I’ve know of them. At one point, we were working on the same problem. I moved into databases, but not before I formed a solid understanding of the PDF issue. My question is - why not take a more
You forgot to mentions that PDFs can even have javascript, and you might see different content unless you can execute the javascript. 😜
Parsing PDF is being done by assuming PDF is a file format like any other but infact it's an image format masquerading as a file format. How about you use openCV to convert/parse PDF to other format?
Impressive that you beat docling! Does your library retain what page the markdown came from? Btw: I feel your pain. Legal documents are fun too: a margin of linenumbers for each line of text makes your page look like a table for markdown. I'm interested in your project!
Yeah, the line numbers are fun... Yes, you can paginate the output to keep the page numbers - or keep the page numbers from the document directly if you want
PDFs are a wild ride. You go thinking you’re grabbing text, and end up with a puzzle instead. Each edge case is like a new level in a game—frustrating but kinda hilarious once you step back.
my biggest gripe is parsing image PDFs - no structured data, just an image. no way out except image to text APIs
This might be a naive thought, this is not my wheelhouse, but is there any value in printing to postscript and extracting the text there?
I nearly took a job at a PDF-parsing company… they described how complicated it is. Seems like a crappy format, all up… ! Glad I didn’t now.
Square profile picture
If there ever was a time to yolo your car purchase, it's now – $7,500 fed tax credit is ending – To take advantage, eligible buyers must take delivery (not just order) by Sept 30 The sooner you order, the sooner you can pick it up
I’m confused Why is this a reverse engineering problem? It looks like you just have to figure out how PDFs are made.
tried to formally specify the PDF format, and their conclusion was that this is impossible! Sad for such a widespread format.
this was exactly my problem statement, I tried to figure out the hierarchical structure of the document, like sections, subsections and make a tree out of the document. used a custom DBSCAN algorithm where cordinates of boxes are fed and clustered based a heirarichal sense.
Now try a couple of plain paragraphs of non-Latin characters. Just copy a random page of Dostoyevsky and print it to PDF. In most cases the retrieved text would be gibberish
Cool, I'm using marker and a couple of others, b/c each one fail in different files. In the first image, the box not always get parsed (depending on the OCR). in the second, the graph is text, so the result (resultado) I get is 30 or 40, not 39,48 ng/mL. You can guess the 3rd.
Image
Image
Image
I’ve been building a pdf annotator, but every time I export it, the header gets corrupted.
What would you say is the best way to extract complex tables from PDF without losses ?
Crazy that people are just obliged to pay to use this (other than just to read them). It works, but you're severely limited by language and their use of it, to the point you're forced to workaround foreign languages because they haven't bothered optimising it
My RichardKCollin2, The Internet Foundation, perspective: PDF has slowed development of the Internet and open knowledge by decades. They simply built a deliberate monopoly and did not use that position to make improvements. They made so much profit without effort, they do not