Post

Conversation

Parsing PDFs has slowly driven me insane over the last year. Here are 8 weird edge cases to show you why PDF parsing isn't an easy problem.

12:46 PM · Aug 12, 2025

188.5K

Views

Post your reply

Vik Paruchuri

@VikParuchuri

10h

PDFs have a font map that tells you what actual character is connected to each rendered character, so you can copy/paste. Unfortunately, these maps can lie, so the character you copy is not what you see. If you're unlucky, it's total gibberish.

PDFs can have invisible text that only shows up when you try to extract it. "Measurement in your home" is only here once...or is it?

Math is a whole can of worms. Remember the font map problem? Well, math is almost always random characters - here we get some strange Tamil/Amharic combo.

Math bounding boxes are always fun - see how each formula is broken up into lots of tiny sections? Putting them together is a great time!

Once upon a time, someone decided that their favorite letters should be connected together into one character - like ffi or fl. Unfortunately, PDFs are inconsistent with this, and sometimes will totally skip ligatures - very ecient of them.

Not all text in a PDF is correct. Some PDFs are digital, and the text was added on creation. But others have had invisible OCR text added, sometimes based on pretty bad text detection. That's when you get this mess:

Overlapping text elements can get crazy - see how the watermark overlaps all the other text? Forget about finding good reading order here.

I've been showing you somewhat nice line bounding boxes. But PDFs just have character positions inside - you have to postprocess to join them into lines. In tables, this can get tricky, since it's hard to know when a new cell starts:

You might be wondering why you should even bother with the text inside PDFs. The answer is that a lot of PDFs have good text, and it's faster and more accurate to just pull it out. This is what we do with marker - github.com/datalab-to/mar - we only OCR if the text is bad.

GitHub - datalab-to/marker: Convert PDF to markdown + JSON quickly with high accuracy

Anyways, back to fixing more crazy edge cases. Let me know if you've come across any other PDF weirdness.

Thomas Ahle

@thomasahle

Just put it in an LLM?

4.9K

Vik Paruchuri

@VikParuchuri

LLMs work well for many cases, but in these cases, they aren't the right tool: - Latency/throughput sensitive - Need layout/positional info - Absolute accuracy and low hallucination risk matter - Edge cases like complex tables, etc. - Want deterministic customization -

4.1K

David

@dasfacc

I commend you on your work, I've worked a lot with pdf's and know the struggle I do have to point out though, are we even using the term OCR anymore? LLMs can understand them pretty well using both the image and text so I ask: why do you torture yourself squeezing out the last

2.4K

Vik Paruchuri

@VikParuchuri

Mostly for the github stars (seriously though, because people need/want it)

I am a huge fan of your work, Vik. I’ve followed Surya, Marker, etc. for as long S I’ve know of them. At one point, we were working on the same problem. I moved into databases, but not before I formed a solid understanding of the PDF issue. My question is - why not take a more

4.5K

Vik Paruchuri

@VikParuchuri

I actually made github.com/datalab-to/pdf as a thin wrapper over pypdfium2 - the problem with determinism is it can't account for all the crazy edge cases!

GitHub - datalab-to/pdftext: Extract structured text from pdfs quickly

You forgot to mentions that PDFs can even have javascript, and you might see different content unless you can execute the javascript.

That's a good one!

Parsing PDF is being done by assuming PDF is a file format like any other but infact it's an image format masquerading as a file format. How about you use openCV to convert/parse PDF to other format?

Vik Paruchuri

@VikParuchuri

I think you're right in many cases, but pure digital PDFs (a pretty high %) actually have good text

851

Andy Tolle

@AI_Andy__

Impressive that you beat docling! Does your library retain what page the markdown came from? Btw: I feel your pain. Legal documents are fun too: a margin of linenumbers for each line of text makes your page look like a table for markdown. I'm interested in your project!

1.3K

Vik Paruchuri

@VikParuchuri

Yeah, the line numbers are fun... Yes, you can paginate the output to keep the page numbers - or keep the page numbers from the document directly if you want

956

Robert Youssef

@rryssf_

PDFs are a wild ride. You go thinking you’re grabbing text, and end up with a puzzle instead. Each edge case is like a new level in a game—frustrating but kinda hilarious once you step back.

824

Vik Paruchuri

@VikParuchuri

Yeah, I thought I would stop finding insane edge cases after a while, but it never ends

602

nickcdryan

@nickcdryan

I worked on a side project parsing pdfs for maybe 1.5 weekends and immediately concluded

batshit crazy How many engineering hours have been collectively lost to this insanity

1.1K

Karthik

@meTheKarthik

my biggest gripe is parsing image PDFs - no structured data, just an image. no way out except image to text APIs

901

Riccardo Spagni

@fluffypony

This might be a naive thought, this is not my wheelhouse, but is there any value in printing to postscript and extracting the text there?

351

Asif Shaikat

@asifshaikat

So refreshing to see your post after a long time.

475

Michael Tomlinson

@mikejt4

I nearly took a job at a PDF-parsing company… they described how complicated it is. Seems like a crappy format, all up… ! Glad I didn’t now.

The Not Portable Document Format.

Avi Krishna

@krishna_avi

Why isn't using vision and then discarding font map / bounding box data effective here?

509

Justine Chang

@justine_chang39

would it be better to go from PDF to IMAGE to TEXT?

137

Steve Sperandeo

@stevesperandeo

I like Mistral OCR the best

Do check these

If there ever was a time to yolo your car purchase, it's now – $7,500 fed tax credit is ending – To take advantage, eligible buyers must take delivery (not just order) by Sept 30 The sooner you order, the sooner you can pick it up

I’m confused Why is this a reverse engineering problem? It looks like you just have to figure out how PDFs are made.

tried to formally specify the PDF format, and their conclusion was that this is impossible! Sad for such a widespread format.

infinityShot

@aditya12501

this was exactly my problem statement, I tried to figure out the hierarchical structure of the document, like sections, subsections and make a tree out of the document. used a custom DBSCAN algorithm where cordinates of boxes are fed and clustered based a heirarichal sense.

@bougakov

Now try a couple of plain paragraphs of non-Latin characters. Just copy a random page of Dostoyevsky and print it to PDF. In most cases the retrieved text would be gibberish

226

Breno Brito

@brenorb

Cool, I'm using marker and a couple of others, b/c each one fail in different files. In the first image, the box not always get parsed (depending on the OCR). in the second, the graph is text, so the result (resultado) I get is 30 or 40, not 39,48 ng/mL. You can guess the 3rd.

226

dharmik

@dhrm1k

I’ve been building a pdf annotator, but every time I export it, the header gets corrupted.

Harshal Gajjar

@harshalgajjar

Things as bad as PDFs shouldn’t still exist in 2025

Sharik

@sharik19

Surya has been a great tool for my work! :)

Hugo

@hug0perier

32m

What would you say is the best way to extract complex tables from PDF without losses ?

Pavan Kumar

@PuttuPavanDev

What about the data with complex tabular structures..?

Fred

@FreddioOliver

It is the only tool that works for my specific case. Let me know if you need some extra hand. I'd be glad to be part of your team. Cheers

Crazy that people are just obliged to pay to use this (other than just to read them). It works, but you're severely limited by language and their use of it, to the point you're forced to workaround foreign languages because they haven't bothered optimising it

Shaltiel

@SShmidman

Awesome! What languages do you support?

∆

@asrjy

marker has been very useful to me personally. thank you!

bokoder

@bokoder

PDFs need to be discontinued..just like flash player.

186

The Indifferent Indian

@fadaknahipadtaa

Gemini 2.5 is surprisingly good at this

Boko Čolse

@struhy_xd

OCR is inevitable

Richard Collins, The Internet Foundation

@RichardKCollin2

My RichardKCollin2, The Internet Foundation, perspective: PDF has slowed development of the Internet and open knowledge by decades. They simply built a deliberate monopoly and did not use that position to make improvements. They made so much profit without effort, they do not

AlmightyIron

@IronResurrects

48m

Dev here.I know your pain...

Great topic!

Tommy Chong

@tommychong

Drop code '87' at checkout for an extra 30% off...

Buy 6 Bags, Get 1 Free

From cheechandchong.com

2.5M

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

To view keyboard shortcuts, press question mark
View keyboard shortcuts