Good post with maybe big implications for interpreting o3’s ARC-AGI scores:
TLDR:
- LLMs bad at ARC because they can’t perceive large text grids
- Solve rates fall as task size in pixels rises, but much later for o3
- 80% of pass@1 o1-mini tasks fail when grids enlarged 2x
Quote
Mikel Bober-Irizar
@mikb0b
Why do pre-o3 LLMs struggle with generalization tasks like @arcprize? It's not what you might think.
OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole.
Analysis below
Show more