Post
Conversation
PDFs have a font map that tells you what actual character is connected to each rendered character, so you can copy/paste. Unfortunately, these maps can lie, so the character you copy is not what you see. If you're unlucky, it's total gibberish.
PDFs can have invisible text that only shows up when you try to extract it. "Measurement in your home" is only here once...or is it?
Math is a whole can of worms. Remember the font map problem? Well, math is almost always random characters - here we get some strange Tamil/Amharic combo.
Math bounding boxes are always fun - see how each formula is broken up into lots of tiny sections? Putting them together is a great time!
Once upon a time, someone decided that their favorite letters should be connected together into one character - like ffi or fl. Unfortunately, PDFs are inconsistent with this, and sometimes will totally skip ligatures - very ecient of them.
Not all text in a PDF is correct. Some PDFs are digital, and the text was added on creation. But others have had invisible OCR text added, sometimes based on pretty bad text detection. That's when you get this mess:
Overlapping text elements can get crazy - see how the watermark overlaps all the other text? Forget about finding good reading order here.
I've been showing you somewhat nice line bounding boxes. But PDFs just have character positions inside - you have to postprocess to join them into lines. In tables, this can get tricky, since it's hard to know when a new cell starts:
You might be wondering why you should even bother with the text inside PDFs. The answer is that a lot of PDFs have good text, and it's faster and more accurate to just pull it out.
This is what we do with marker - github.com/datalab-to/mar - we only OCR if the text is bad.
Anyways, back to fixing more crazy edge cases. Let me know if you've come across any other PDF weirdness.
LLMs work well for many cases, but in these cases, they aren't the right tool:
- Latency/throughput sensitive
- Need layout/positional info
- Absolute accuracy and low hallucination risk matter
- Edge cases like complex tables, etc.
- Want deterministic customization
-
Mostly for the github stars
(seriously though, because people need/want it)
I am a huge fan of your work, Vik. I’ve followed Surya, Marker, etc. for as long S I’ve know of them.
At one point, we were working on the same problem. I moved into databases, but not before I formed a solid understanding of the PDF issue.
My question is - why not take a more
I actually made github.com/datalab-to/pdf as a thin wrapper over pypdfium2 - the problem with determinism is it can't account for all the crazy edge cases!
You forgot to mentions that PDFs can even have javascript, and you might see different content unless you can execute the javascript. 
I think you're right in many cases, but pure digital PDFs (a pretty high %) actually have good text
Impressive that you beat docling!
Does your library retain what page the markdown came from?
Btw: I feel your pain. Legal documents are fun too: a margin of linenumbers for each line of text makes your page look like a table for markdown.
I'm interested in your project!
Yeah, the line numbers are fun...
Yes, you can paginate the output to keep the page numbers - or keep the page numbers from the document directly if you want
PDFs are a wild ride. You go thinking you’re grabbing text, and end up with a puzzle instead. Each edge case is like a new level in a game—frustrating but kinda hilarious once you step back.
Yeah, I thought I would stop finding insane edge cases after a while, but it never ends
I worked on a side project parsing pdfs for maybe 1.5 weekends and immediately concluded
batshit crazy
How many engineering hours have been collectively lost to this insanity
my biggest gripe is parsing image PDFs - no structured data, just an image.
no way out except image to text APIs
This might be a naive thought, this is not my wheelhouse, but is there any value in printing to postscript and extracting the text there?
I nearly took a job at a PDF-parsing company… they described how complicated it is.
Seems like a crappy format, all up… !
Glad I didn’t now.
Why isn't using vision and then discarding font map / bounding box data effective here?
If there ever was a time to yolo your car purchase, it's now
– $7,500 fed tax credit is ending
– To take advantage, eligible buyers must take delivery (not just order) by Sept 30
The sooner you order, the sooner you can pick it up
I’m confused
Why is this a reverse engineering problem? It looks like you just have to figure out how PDFs are made.
tried to formally specify the PDF format, and their conclusion was that this is impossible! Sad for such a widespread format.
this was exactly my problem statement, I tried to figure out the hierarchical structure of the document, like sections, subsections and make a tree out of the document. used a custom DBSCAN algorithm where cordinates of boxes are fed and clustered based a heirarichal sense.
Cool, I'm using marker and a couple of others, b/c each one fail in different files.
In the first image, the box not always get parsed (depending on the OCR). in the second, the graph is text, so the result (resultado) I get is 30 or 40, not 39,48 ng/mL. You can guess the 3rd.
What would you say is the best way to extract complex tables from PDF without losses ?
It is the only tool that works for my specific case. Let me know if you need some extra hand. I'd be glad to be part of your team. Cheers
Crazy that people are just obliged to pay to use this (other than just to read them). It works, but you're severely limited by language and their use of it, to the point you're forced to workaround foreign languages because they haven't bothered optimising it
My RichardKCollin2, The Internet Foundation, perspective:
PDF has slowed development of the Internet and open knowledge by decades. They simply built a deliberate monopoly and did not use that position to make improvements. They made so much profit without effort, they do not
Drop code '87' at checkout for an extra 30% off...