Post

Conversation

Why do pre-o3 LLMs struggle with generalization tasks like ? It's not what you might think. OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole. Analysis below🧵
Tokenization labelled on the ARC-AGI prompt used by OpenAI: Find the common rule that maps an input grid to an output grid, given the examples below. Example 1: Input: 0 0 0 ...
LLMs having to draw the grid manually and the fact the grid appears linear from their perspective are probably the main factors in performance dropping from just increasing grid size. Grid size doesn't matter to humans because we mostly ignore the grid unless it's for alignment
Yep, exactly. How much are we testing the LLMs ability to generalise from 3 examples, and how much are we testing its ability to de-linearise grids?
David Watson 🥑
Post your reply

The de-linearising needs to happen implicitly within the model in order to perform many of the transformations ARC requires (vertical translation and rotation for example)