Post

Conversation

Good post with maybe big implications for interpreting o3’s ARC-AGI scores: TLDR: - LLMs bad at ARC because they can’t perceive large text grids - Solve rates fall as task size in pixels rises, but much later for o3 - 80% of pass@1 o1-mini tasks fail when grids enlarged 2x
Screenshot excerpt from https://anokas.substack.com/p/llms-struggle-with-perception-not-reasoning-arcagi

[Plot]

y label: Solve rate within length [4, 2/] (%, ARC-AGI-Pub)
x label: Task length (input pixel count, log scale)

caption: Task solve rates (ARC-AGI-Pub) by various models. Each point represents a sliding window of tasks with length in the range [1/2,1*2], with bootstrapped 75% CI shaded.

What we see is quite striking: LLMs are really good at solving the smallest ARC tasks, even prior to
03. Each subsequent model gets better and better at solving larger tasks, but everything before 03 (and especially pre-o1) struggles with solving tasks with >512 input pixels. With o3, we raise that threshold to perhaps 4096 input pixels.
read image description
Quote
Mikel Bober-Irizar
@mikb0b
Why do pre-o3 LLMs struggle with generalization tasks like @arcprize? It's not what you might think. OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole. Analysis below🧵
Show more
Tokenization labelled on the ARC-AGI prompt used by OpenAI: Find the common rule that maps an input grid to an output grid, given the examples below. Example 1: Input: 0 0 0 ...
read image description
David Watson 🥑
Post your reply

See ’s response here:
Quote
François Chollet
@fchollet
If ARC-AGI required visual perception, you'd see VLMs outperform strict LLMs -- by a lot. Everyone tried VLMs during the 2024 competition -- no one got better results. Every single top entry used a strict LLM. As we've said many times: ARC-AGI is a 2D symbolic reasoning
Show more
I’m skeptical there’s much transfer learning back to text by default but I don’t understand it well enough to disagree with anyone on it Didn’t know that about qwq; sounds promising
I wonder what happens if you explicitly prompt o3 to first try to find a compressed representation of the input Eg “I notice this is a horizontal line the full width of the grid, I notice this is a square of side length 5” The human visual prior does this automatically
Cool research. I'm sure a lot of people could have directionally predicted the result, but I would not have expected a 80% drop in performance for a 2x grid. There are drops in output quality when including ever-larger amounts of code in prompts, but not quite that bad.
Do you think this makes sense?
Quote
Séb Krier
@sebkrier
Super cool piece, highly recommend reading it! Here's Something I've noticed, though I'd love others' thoughts on this: I recently used Gemini on screen share mode so it could see my screen and comment/advise on what I was doing. I opened up Ableton Live 11 (a digital audio x.com/mikb0b/status/…
Show more
Image
ARC-AGI is not a test of AGI less than it is a test of an LLM's internal accumulator or iteration circuit size. Larger, deeper models can theoretically perform better.
Seems like the pixel scaling issue might hint at a fundamental limitation in current LLM architectures.
Yup, agreed.
Quote
neural narrates 🍍
@neural_narrates
Replying to @mikb0b
Converting the ARC images to text grids seems super suboptimal though. The tasks are easy for humans because of the visual structure that's easy to parse and extend. LLMs should have the same capabilities, but their visual backbones are currently weak-ish.
Show more
The argument by shows that the secret sauce of OpenAI is the positional encoding. This has been so since the gpt3V model, and then with the long contexts, and then with the tagged system tokens.
One thing I'm curious about is whether the smaller grid size makes a wider class of (incorrect) rules produce the correct output. Smaller grids are also logically simpler, not just visually simpler, because they can express fewer rules and edge cases of them.
>The building blocks of self-awareness are broken? Nothing can exist if those that observe are gazing elsewhere.
The analysis highlights LLMs' limitations in generalization tasks, particularly with text grids, which has significant implications for benchmark design.

Discover more

Sourced from across X
> $5.5M for Sonnet tier it's unsurprising that they're proud of it, but it sure feels like they're rubbing it in. «$100M runs, huh? 30.84M H100-hours on 405B, yeah? Half-witted Western hacks, your silicon is wasted on you, your thoughts wouldn't reduce loss of your own models»
Image
🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n
GIF
Image