Why do pre-o3 LLMs struggle with generalization tasks like ? It's not what you might think.
OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole.
Analysis below
Post
Conversation
LLMs having to draw the grid manually and the fact the grid appears linear from their perspective are probably the main factors in performance dropping from just increasing grid size. Grid size doesn't matter to humans because we mostly ignore the grid unless it's for alignment
Why does de-linearising need to happen? A grids arrangement in space doesn't change its information content
The de-linearising needs to happen implicitly within the model in order to perform many of the transformations ARC requires (vertical translation and rotation for example)