Post

Conversation

Why do pre-o3 LLMs struggle with generalization tasks like ? It's not what you might think. OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole. Analysis below🧵
Tokenization labelled on the ARC-AGI prompt used by OpenAI: Find the common rule that maps an input grid to an output grid, given the examples below. Example 1: Input: 0 0 0 ...
read image description
David Watson 🥑
Post your reply

LLMs are dramatically worse at ARC tasks the bigger they get. However, humans have no such issues - ARC task difficulty is independent of size. Most ARC tasks contain around 512-2048 pixels, and o3 is the first model capable of operating on these text grids reliably.
Task solve rates (ARC-AGI-Pub) by various models, along with human performance. With all LLMs, the solve rate drops dramatically as the problem size increases (o3 works for much bigger tasks). Human performance stays roughly constant.
So even if a model is capable of the reasoning and generalization required, it can still fail just because it can't handle this many tokens. When testing o1-mini on an enlarged version of ARC, we observe an 80% drop in solved tasks - even if the solutions are the same.
Three ARC tasks and their enlarged versions (each column and row repeated). Visually, the tasks look the same to humans. But o1-mini is able to solve the former and gets the latter totally wrong in every case.
LLMs having to draw the grid manually and the fact the grid appears linear from their perspective are probably the main factors in performance dropping from just increasing grid size. Grid size doesn't matter to humans because we mostly ignore the grid unless it's for alignment
Yep, exactly. How much are we testing the LLMs ability to generalise from 3 examples, and how much are we testing its ability to de-linearise grids?
Realistically, AI labs will just train on upsampled, resampled, or synthetically generated ARC data, farm reasoning traces, scale, and we all cross our fingers.
If you can find the right representation that still allows the model to reason on it, yes. But I don’t know if we have such a representation (I can’t think of one off the top of my head).
It was not strictly json, but space/newline-delimited digits one after the other, which is likely a big source of the trouble (more details in the blog post) That being said, current image-in modalities are even more tricky as it’s unclear how a model would output the answer
Is there any evidence o3 trains on more real-world context? I'd say the pre-training mix contains a huge amount of it (for all models)
Great question! Pre-o3 LLMs struggled with tasks like because they often lacked the ability to combine abstract reasoning with generalization across unseen problems. It’s not just about solving puzzles—it’s about learning the underlying rules and adapting them
Show more
humans needed thousands of years to invent the system of numbers. With this invention patterns of nature where possible to handle which where impossible to understand for human brains before. similar internal "breakthrough" might be necessary for AI to solve certain tasks.
Having worked extensively with different LLMs, I've noticed that pre-o3 models often struggled not due to pure reasoning limitations, but because of their inability to maintain consistent pattern recognition across variations. At jenova ai, we've observed this firsthand while
Show more
Show additional replies, including those that may contain offensive content

Discover more

Sourced from across X
Reading the report, this is such clean engineering under resource constraints. The DeepSeek team directly engineered solutions to known problems under hardware constraints. All of this looks so elegant -- no fancy "academic" solutions, just pure, solid engineering. Respect 👏
Quote
DeepSeek
@deepseek_ai
🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n
Show more
GIF
Image
Peak engineering efficiency from the Whale. Napkin math: > DeepSeek-V3: 2048 H800s / 180K GPU-hours per trillion tokens > Llama 3: 16000 H100s for 54 days so ~1.3M GPU-hours per trillion tokens ~7.5x raw GPU efficiency in DeepSeek's favor. "and now we mog them".
Image
Image
Quote
Teortaxes▶️
@teortaxesTex
Image
> $5.5M for Sonnet tier it's unsurprising that they're proud of it, but it sure feels like they're rubbing it in. «$100M runs, huh? 30.84M H100-hours on 405B, yeah? Half-witted Western hacks, your silicon is wasted on you, your thoughts wouldn't reduce loss of your own models»
And here… we… go. So, that line in config. Yes it's about multi-token prediction. Just as a better training obj – though they leave the possibility of speculative decoding open. Also, "muh 50K Hoppers": > 2048 NVIDIA H800 > 2.788M H800-hours 2 months of training. 2x Llama 3 8B.
Image
Quote
DeepSeek
@deepseek_ai
Embedded video
Image
🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n
🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n
GIF
Image