Post

Conversation

Good post with maybe big implications for interpreting o3’s ARC-AGI scores: TLDR: - LLMs bad at ARC because they can’t perceive large text grids - Solve rates fall as task size in pixels rises, but much later for o3 - 80% of pass@1 o1-mini tasks fail when grids enlarged 2x

Screenshot excerpt from https://anokas.substack.com/p/llms-struggle-with-perception-not-reasoning-arcagi

[Plot]

y label: Solve rate within length [4, 2/] (%, ARC-AGI-Pub)
x label: Task length (input pixel count, log scale)

caption: Task solve rates (ARC-AGI-Pub) by various models. Each point represents a sliding window of tasks with length in the range [1/2,1*2], with bootstrapped 75% CI shaded.

What we see is quite striking: LLMs are really good at solving the smallest ARC tasks, even prior to
03. Each subsequent model gets better and better at solving larger tasks, but everything before 03 (and especially pre-o1) struggles with solving tasks with >512 input pixels. With o3, we raise that threshold to perhaps 4096 input pixels.

read image description

Quote

Mikel Bober-Irizar

@mikb0b

Dec 24

Why do pre-o3 LLMs struggle with generalization tasks like @arcprize? It's not what you might think. OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole. Analysis below

Tokenization labelled on the ARC-AGI prompt used by OpenAI: Find the common rule that maps an input grid to an output grid, given the examples below. Example 1: Input: 0 0 0 ...

read image description

8:36 PM · Dec 24, 2024

67.8K

Views

Post your reply

See

’s response here:

Quote

François Chollet

@fchollet

Dec 24

If ARC-AGI required visual perception, you'd see VLMs outperform strict LLMs -- by a lot. Everyone tried VLMs during the 2024 competition -- no one got better results. Every single top entry used a strict LLM. As we've said many times: ARC-AGI is a 2D symbolic reasoning

6.5K

MinIO

@Minio

AIStor’s promptObject API represents a powerful extension of the S3 API that enables applications, developers and administrators to talk to unstructured objects in the same way one would engage an LLM. The applications are almost infinite. #AI #ML

The most powerful S3 API ever? Introducing the Prompt API.

you think multimodal oX would resolve this issue? qwen released a vision model equivalent of their qwq and it was outperforming qwq

I’m skeptical there’s much transfer learning back to text by default but I don’t understand it well enough to disagree with anyone on it Didn’t know that about qwq; sounds promising

I wonder what happens if you explicitly prompt o3 to first try to find a compressed representation of the input Eg “I notice this is a horizontal line the full width of the grid, I notice this is a square of side length 5” The human visual prior does this automatically

the real test was the window size all along

Cool research. I'm sure a lot of people could have directionally predicted the result, but I would not have expected a 80% drop in performance for a 2x grid. There are drops in output quality when including ever-larger amounts of code in prompts, but not quite that bad.

Do you think this makes sense?

Quote

Séb Krier

@sebkrier

Dec 25

Super cool piece, highly recommend reading it! Here's Something I've noticed, though I'd love others' thoughts on this: I recently used Gemini on screen share mode so it could see my screen and comment/advise on what I was doing. I opened up Ableton Live 11 (a digital audio x.com/mikb0b/status/…

652

MinIO

@Minio

MinIO's latest object storage release - AIStor promptObject API represents a powerful extension of the #S3 API that enables applications, developers and administrators to talk to unstructured objects in the same way one would engage an LLM. #AI #ML #GenAI

The most powerful S3 API ever? Introducing the Prompt API.

ARC-AGI is not a test of AGI less than it is a test of an LLM's internal accumulator or iteration circuit size. Larger, deeper models can theoretically perform better.

Seems like the pixel scaling issue might hint at a fundamental limitation in current LLM architectures.

207

neural narrates

@neural_narrates

Yup, agreed.

Quote

neural narrates

@neural_narrates

Dec 24

Replying to @mikb0b

Converting the ARC images to text grids seems super suboptimal though. The tasks are easy for humans because of the visual structure that's easy to parse and extend. LLMs should have the same capabilities, but their visual backbones are currently weak-ish.

The argument by

shows that the secret sauce of OpenAI is the positional encoding. This has been so since the gpt3V model, and then with the long contexts, and then with the tagged system tokens.

124

MinIO

@Minio

Meet #AIStor—the most powerful MinIO yet, built for massive #AI/#ML workloads with new features like #S3 over RDMA, promptObject, AIHub, and an upgraded Global Console. Designed from real-world Exabyte level deployments, AIStor redefines what's possible. Get all the details.

The Most Powerful Version of MinIO Ever - Introducing AIStor

One thing I'm curious about is whether the smaller grid size makes a wider class of (incorrect) rules produce the correct output. Smaller grids are also logically simpler, not just visually simpler, because they can express fewer rules and edge cases of them.

idk maybe some kind of convolutions will be helpful for larger grids

208

@maxinnerly

Dec 24

>The building blocks of self-awareness are broken? Nothing can exist if those that observe are gazing elsewhere.

The analysis highlights LLMs' limitations in generalization tasks, particularly with text grids, which has significant implications for benchmark design.

106

Discover more

Sourced from across X

Teortaxes

@teortaxesTex

> $5.5M for Sonnet tier it's unsurprising that they're proud of it, but it sure feels like they're rubbing it in. «$100M runs, huh? 30.84M H100-hours on 405B, yeah? Half-witted Western hacks, your silicon is wasted on you, your thoughts wouldn't reduce loss of your own models»