Why do pre-o3 LLMs struggle with generalization tasks like ? It's not what you might think.
OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole.
Analysis below
Conversation
LLMs are dramatically worse at ARC tasks the bigger they get. However, humans have no such issues - ARC task difficulty is independent of size.
Most ARC tasks contain around 512-2048 pixels, and o3 is the first model capable of operating on these text grids reliably.
So even if a model is capable of the reasoning and generalization required, it can still fail just because it can't handle this many tokens.
When testing o1-mini on an enlarged version of ARC, we observe an 80% drop in solved tasks - even if the solutions are the same.
When models can't understand the task format, the benchmark can mislead, introducing a hidden threshold effect.
And if there's always a larger version that humans can solve but an LLM can't, what does this say about scaling to AGI?
Read the article here: anokas.substack.com/p/llms-struggl
LLMs having to draw the grid manually and the fact the grid appears linear from their perspective are probably the main factors in performance dropping from just increasing grid size. Grid size doesn't matter to humans because we mostly ignore the grid unless it's for alignment
Yep, exactly. How much are we testing the LLMs ability to generalise from 3 examples, and how much are we testing its ability to de-linearise grids?
Are they feeding the raw images to the models or reformatting them as json?
Realistically, AI labs will just train on upsampled, resampled, or synthetically generated ARC data, farm reasoning traces, scale, and we all cross our fingers.
Can we first ask an LLM to compress the grid?
Some of them may be unnecessarily large..
If you can find the right representation that still allows the model to reason on it, yes. But I don’t know if we have such a representation (I can’t think of one off the top of my head).
The plot is mine, from my blog, but the data is from the ARC team, who published o3 outputs: github.com/arcprizeorg/mo
End-to-End tests are often hard to write, debug, and maintain. That is why many of us avoid them at all costs and rely on unit and integration tests. But what if instead of avoiding the problem——you fix it?
It was not strictly json, but space/newline-delimited digits one after the other, which is likely a big source of the trouble (more details in the blog post)
That being said, current image-in modalities are even more tricky as it’s unclear how a model would output the answer
pre-o3 LLMs lack real-world context. They need better training on diverse data sets to improve generalization.
Is there any evidence o3 trains on more real-world context?
I'd say the pre-training mix contains a huge amount of it (for all models)
Guten Tag, you can read it here: threadreaderapp.com/thread/1871573 Share this if you think it's interesting. 
Great question!
Pre-o3 LLMs struggled with tasks like because they often lacked the ability to combine abstract reasoning with generalization across unseen problems.
It’s not just about solving puzzles—it’s about learning the underlying rules and adapting them
Show more
humans needed thousands of years to invent the system of numbers. With this invention patterns of nature where possible to handle which where impossible to understand for human brains before.
similar internal "breakthrough" might be necessary for AI to solve certain tasks.
Having worked extensively with different LLMs, I've noticed that pre-o3 models often struggled not due to pure reasoning limitations, but because of their inability to maintain consistent pattern recognition across variations.
At jenova ai, we've observed this firsthand while
Show more
Show additional replies, including those that may contain offensive content
Discover more
Sourced from across X
Reading the report, this is such clean engineering under resource constraints. The DeepSeek team directly engineered solutions to known problems under hardware constraints. All of this looks so elegant -- no fancy "academic" solutions, just pure, solid engineering. Respect 
Peak engineering efficiency from the Whale.
Napkin math:
> DeepSeek-V3: 2048 H800s / 180K GPU-hours per trillion tokens
> Llama 3: 16000 H100s for 54 days so ~1.3M GPU-hours per trillion tokens
~7.5x raw GPU efficiency in DeepSeek's favor.
"and now we mog them".
Quote
And here… we… go.
So, that line in config. Yes it's about multi-token prediction. Just as a better training obj – though they leave the possibility of speculative decoding open.
Also, "muh 50K Hoppers":
> 2048 NVIDIA H800
> 2.788M H800-hours
2 months of training. 2x Llama 3 8B.