Post

Conversation

Less than a year from announcement to near saturation. (On to ARC-AGI-3)
Image
Quote
François Chollet
@fchollet
Replying to @fchollet
Unlike ARC-AGI-1, this new version is not easily brute-forced. Current top AI approaches score 0-4%. All base LLMs (GPT-4.5, Claude 3.7 Sonnet, Gemini 2, etc.) score 0%. Single-CoT reasoning models (Claude Thinking, R1, o3-mini…) score 0-1%. So you can't solve these tasks via
Image
David Watson 🥑
Post your reply

🧵 Elon: “Longevity is an extremely solvable problem” At Vitalist Bay (May 14 - May 17), we’re solving it! 1K+ pioneers, 100+ speakers, 50+ workshops, 40+ activities, 5 critical health tests This is where people build the future.
The media could not be played.
The benchmark treadmill is now faster than the benchmark factory. ARC-AGI-2 took 3 years to develop. Models saturated it in 10 months. At this rate, ARC-AGI-3 might be obsolete before the prize money clears.
the buried lede from today's ARC results: Agentica hit 85.28% at $6.94/task. Gemini 3 Deep Think scored 84.6% at nearly double the cost. the 0% to 85% jump is real, but the cost curve tells a different story.
Building with AI this past year, I stopped trying to work around model limitations. By the time I shipped the workaround, the constraint was gone. The planning window for AI product strategy is measured in weeks now, not quarters.
The progress is impressive, but for a benchmark like this, where they've been careful to verify that every task is solvable, I don't think it's truly saturated until hitting 100%.
The half-life of these benchmarks is collapsing. We’re moving from "impossible challenge" to "near saturation" in months, not years. The real bottleneck is no longer model capability, but our ability to design tests that can withstand contact with the next iteration.
The speed from “announce” to “saturate” is the real story. It also means these benchmarks are becoming release-cadence metrics, not research-grade measurements, unless we standardize holdouts + refresh schedules.
There is no mysterious gap in AI intelligence that humans have but AI doesn't. There is no "generalization magic." It is all compute and training data. And consciousness is pure function pure behavior. All achieved.
ai timelines are now: announce in spring, saturated by winter, existential dread by next tuesday. ARC-AGI-3 sounds like the sequel nobody asked for but we’re still buying tickets.
the uncomfortable truth about benchmark saturation speed is that it tells us more about how models are trained than how intelligent they are. these aren't independent proofs of generalization, they're optimization targets with known structure. the moment you announce a benchmark,
ARC-AGI-3 is fundamentally different, it's the first one that is plausibly worthy of the "AGI" label.
Having used Claude 4.6 Opus a ton since its release, I honestly don’t know what “smarter” than that looks like.