Post

Conversation

Why do pre-o3 LLMs struggle with generalization tasks like

? It's not what you might think. OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole. Analysis below

Tokenization labelled on the ARC-AGI prompt used by OpenAI: Find the common rule that maps an input grid to an output grid, given the examples below. Example 1: Input: 0 0 0 ...

read image description

7:07 AM · Dec 24, 2024

192.2K

Views

Post your reply

Mikel Bober-Irizar

@mikb0b

Dec 24

LLMs are dramatically worse at ARC tasks the bigger they get. However, humans have no such issues - ARC task difficulty is independent of size. Most ARC tasks contain around 512-2048 pixels, and o3 is the first model capable of operating on these text grids reliably.

Task solve rates (ARC-AGI-Pub) by various models, along with human performance. With all LLMs, the solve rate drops dramatically as the problem size increases (o3 works for much bigger tasks). Human performance stays roughly constant.

So even if a model is capable of the reasoning and generalization required, it can still fail just because it can't handle this many tokens. When testing o1-mini on an enlarged version of ARC, we observe an 80% drop in solved tasks - even if the solutions are the same.

Three ARC tasks and their enlarged versions (each column and row repeated). Visually, the tasks look the same to humans. But o1-mini is able to solve the former and gets the latter totally wrong in every case.

When models can't understand the task format, the benchmark can mislead, introducing a hidden threshold effect. And if there's always a larger version that humans can solve but an LLM can't, what does this say about scaling to AGI? Read the article here: anokas.substack.com/p/llms-struggl

LLMs having to draw the grid manually and the fact the grid appears linear from their perspective are probably the main factors in performance dropping from just increasing grid size. Grid size doesn't matter to humans because we mostly ignore the grid unless it's for alignment

Yep, exactly. How much are we testing the LLMs ability to generalise from 3 examples, and how much are we testing its ability to de-linearise grids?

4.3K

UserInterface

@SeketAanru

Why Reader-Focused Websites Triumph Over Google's Algorithm Shifts #marketing #seo

userinterface.us

Why Reader-Focused Websites Triumph Over Google's Algorithm Shifts

When a website is built with a reader-centric focus, it inherently aligns with what Google aims to achieve—providing valuable, relevant content and a seamless user experience. Such websites tend to...

Are they feeding the raw images to the models or reformatting them as json?

179

Mikel Bober-Irizar

@mikb0b

See the image on the tweet / article: it's text format

nice

Realistically, AI labs will just train on upsampled, resampled, or synthetically generated ARC data, farm reasoning traces, scale, and we all cross our fingers.

Can we first ask an LLM to compress the grid? Some of them may be unnecessarily large..

If you can find the right representation that still allows the model to reason on it, yes. But I don’t know if we have such a representation (I can’t think of one off the top of my head).

Where is this plot from? Who has access to o3?

The plot is mine, from my blog, but the data is from the ARC team, who published o3 outputs: github.com/arcprizeorg/mo

544

Antithesis

@AntithesisHQ

End-to-End tests are often hard to write, debug, and maintain. That is why many of us avoid them at all costs and rely on unit and integration tests. But what if instead of avoiding the problem——you fix it?

The Testing Pyramid is upside-down

The fed o3 json not images

It was not strictly json, but space/newline-delimited digits one after the other, which is likely a big source of the trouble (more details in the blog post) That being said, current image-in modalities are even more tricky as it’s unclear how a model would output the answer

3.1K

Ray | AI marketer Next-Gen Social Media Marketing

@rayzhang123

Dec 25

pre-o3 LLMs lack real-world context. They need better training on diverse data sets to improve generalization.

Is there any evidence o3 trains on more real-world context? I'd say the pre-training mix contains a huge amount of it (for all models)

unroll pls

Guten Tag, you can read it here: threadreaderapp.com/thread/1871573 Share this if you think it's interesting.

Thread by @mikb0b on Thread Reader App

From threadreaderapp.com

Great question! Pre-o3 LLMs struggled with tasks like

@arcprize

because they often lacked the ability to combine abstract reasoning with generalization across unseen problems. It’s not just about solving puzzles—it’s about learning the underlying rules and adapting them

416

@maxinnerly

Dec 24

What does this say about the nature of reasoning in AI?

1.6K

Fast Company

@FastCompany

3 big reasons why your business should embrace #AI.

@JLL

reveals how AI tools and technology can help you stay resilient in a challenging market by controlling costs, boosting staff productivity, and more. #ad

How CRE investors are embracing AI for real results

so it's just a scale issue? Or what

298

Matthias Heger - AI acc

@modelsarereal

Dec 24

humans needed thousands of years to invent the system of numbers. With this invention patterns of nature where possible to handle which where impossible to understand for human brains before. similar internal "breakthrough" might be necessary for AI to solve certain tasks.

Having worked extensively with different LLMs, I've noticed that pre-o3 models often struggled not due to pure reasoning limitations, but because of their inability to maintain consistent pattern recognition across variations. At jenova ai, we've observed this firsthand while

Show additional replies, including those that may contain offensive content

Discover more

Sourced from across X

Tim Dettmers

@Tim_Dettmers

Reading the report, this is such clean engineering under resource constraints. The DeepSeek team directly engineered solutions to known problems under hardware constraints. All of this looks so elegant -- no fancy "academic" solutions, just pure, solid engineering. Respect

Quote

DeepSeek

@deepseek_ai

Introducing DeepSeek-V3! Biggest leap forward yet:

60 tokens/second (3x faster than V2!)

Enhanced capabilities

API compatibility intact

Fully open-source models & papers

1/n

GIF

27K

milton

@tensor_fusion

Peak engineering efficiency from the Whale. Napkin math: > DeepSeek-V3: 2048 H800s / 180K GPU-hours per trillion tokens > Llama 3: 16000 H100s for 54 days so ~1.3M GPU-hours per trillion tokens ~7.5x raw GPU efficiency in DeepSeek's favor. "and now we mog them".

Quote

Teortaxes

@teortaxesTex

> $5.5M for Sonnet tier it's unsurprising that they're proud of it, but it sure feels like they're rubbing it in. «$100M runs, huh? 30.84M H100-hours on 405B, yeah? Half-witted Western hacks, your silicon is wasted on you, your thoughts wouldn't reduce loss of your own models»

10K

Teortaxes

@teortaxesTex

And here… we… go. So, that line in config. Yes it's about multi-token prediction. Just as a better training obj – though they leave the possibility of speculative decoding open. Also, "muh 50K Hoppers": > 2048 NVIDIA H800 > 2.788M H800-hours 2 months of training. 2x Llama 3 8B.

Quote