To view keyboard shortcuts, press question mark
View keyboard shortcuts

Post

Conversation

Less than a year from announcement to near saturation. (On to ARC-AGI-3)

Quote

François Chollet

@fchollet

Mar 24, 2025

Replying to @fchollet

Unlike ARC-AGI-1, this new version is not easily brute-forced. Current top AI approaches score 0-4%. All base LLMs (GPT-4.5, Claude 3.7 Sonnet, Gemini 2, etc.) score 0%. Single-CoT reasoning models (Claude Thinking, R1, o3-mini…) score 0-1%. So you can't solve these tasks via

12:11 PM · Feb 12, 2026

David Watson 🥑

Post your reply

MetaCritic Capital

We will need a bigger sigmoid.

Jonathan Fischoff

it's could be saturated already

Ad

Elon: “Longevity is an extremely solvable problem” At Vitalist Bay (May 14 - May 17), we’re solving it! 1K+ pioneers, 100+ speakers, 50+ workshops, 40+ activities, 5 critical health tests This is where people build the future.

The media could not be played.

From vitalistbay.com

Vladimir (e/acc)

The benchmark treadmill is now faster than the benchmark factory. ARC-AGI-2 took 3 years to develop. Models saturated it in 10 months. At this rate, ARC-AGI-3 might be obsolete before the prize money clears.

the buried lede from today's ARC results: Agentica hit 85.28% at $6.94/task. Gemini 3 Deep Think scored 84.6% at nearly double the cost. the 0% to 85% jump is real, but the cost curve tells a different story.

hightech lowlife

wondering what the token costs look like

@trainable_nick

Building with AI this past year, I stopped trying to work around model limitations. By the time I shipped the workaround, the constraint was gone. The planning window for AI product strategy is measured in weeks now, not quarters.

From 31% to 84% in ARC-AGI 2 in few months is crazy

@AdamDittrichOne

insane

is like....sure

Bhargav Patel, MD, MBA

Crossing 80% on ARC signals reasoning progress that felt years away just months ago, pushing the ceiling for what we consider baseline intelligence.

The progress is impressive, but for a benchmark like this, where they've been careful to verify that every task is solvable, I don't think it's truly saturated until hitting 100%.

The half-life of these benchmarks is collapsing. We’re moving from "impossible challenge" to "near saturation" in months, not years. The real bottleneck is no longer model capability, but our ability to design tests that can withstand contact with the next iteration.

Nothing is happening. La la la.

Cost is a really important axis here Clearly performance has done very well!

Robotaxi Whisperer

@JibroniSandwich

Serious question: What is the point of having multiple benchmarks? Why can’t we design one benchmark that takes longer to saturate?

The speed from “announce” to “saturate” is the real story. It also means these benchmarks are becoming release-cadence metrics, not research-grade measurements, unless we standardize holdouts + refresh schedules.

ARC-AGI-3 about to be saturated before release? lmfao

Less than a year ago Hope they don't make excuses

@CremlinGremlin

84% is not “near saturation”.

There is no mysterious gap in AI intelligence that humans have but AI doesn't. There is no "generalization magic." It is all compute and training data. And consciousness is pure function pure behavior. All achieved.

Or a month or 2...

Rapid saturation suggests the bottleneck is shifting from models to evaluation design.

These timelines will continue to shorten.

Ad

Want the perfect pop? Grab the Discovery Kit Today

The media could not be played.

From opopop.com

Johannes Sundlo

I remember the conversations back then, how hard this would be, and look now. It's insane.

85% with sane costs? That’s the real breakthrough.

ai timelines are now: announce in spring, saturated by winter, existential dread by next tuesday. ARC-AGI-3 sounds like the sequel nobody asked for but we’re still buying tickets.

the uncomfortable truth about benchmark saturation speed is that it tells us more about how models are trained than how intelligent they are. these aren't independent proofs of generalization, they're optimization targets with known structure. the moment you announce a benchmark,

Saturation speed means cheaper AI for office tasks soon?

How might ARC-3 change everyday problem-solving at work?

@cliwant_patrick

Office tools obsolete before Q4 budgets? Wild.

Militante por Fachada Ativa

o q isso significa

Benchmarks have a shelf life now. Wild.

ARC-AGI-3 is fundamentally different, it's the first one that is plausibly worthy of the "AGI" label.

che tipo di benchmark è Arc Agi 2?

Having used Claude 4.6 Opus a ton since its release, I honestly don’t know what “smarter” than that looks like.

No user is “Deep Thinking” if they use a LLM.

arc agi 3 is not out yet