Conversation

GitHub - aw31/openai-imo-2025-proofs

2/N We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.

3/N Why is this a big deal? First, IMO problems demand a new level of sustained creative thinking compared to past benchmarks. In reasoning time horizon, we’ve now progressed from GSM8K (~0.1 min for top humans) → MATH benchmark (~1 min) → AIME (~10 mins) → IMO (~100 mins).

4/N Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.

5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!

7/N HUGE congratulations to the team—

@SherylHsu02

@polynoamial

, and the many giants whose shoulders we stood on—for turning this crazy dream into reality! I am lucky I get to spend late nights and early mornings working alongside the very best.

8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

9/N Still—this underscores how fast AI has advanced in recent years. In 2021, my PhD advisor

@JacobSteinhardt

had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.

10/N If you want to take a look, here are the model’s solutions to the 2025 IMO problems! The model solved P1 through P5; it did not produce a solution for P6. (Apologies in advance for its … distinct style—it is very much an experimental model

)

11/N Lastly, we'd like to congratulate all the participants of the 2025 IMO on their achievement! We are proud to have many past IMO participants at

@OpenAI

and recognize that these are some of the brightest young minds of the future.

Congrats on this result! You know it's impressive when the graph looks like this:

Amazing work! To make things totally rigorous, you should next have a device truly take it live in the real testing room with other humans, taking a photo of the real physical test papers. Otherwise there will always be some lurking fear that the model was somehow contaminated.

5.2K

AI Notkilleveryoneism Memes

@AISafetyMemes

So what's the next goalpost? What's the next thing LLMs will never be able to do?

Problem 5 solution: "So under his saturate response, he never loses. For her to win, must make him unable at some even -> would need Q_{even-1}>even, i.e. some a_j> sqrt2. but we just showed always a_j<=c< sqrt2. So she can never cause his loss. So against this fixed response of

congrats! i think you guys are ahead of the consensus timelines, which are already pretty wild.

Quote

Kal

@andromeda74356

Jul 16

Replying to @andresnds and @OpenAI

what was the scaffolding like for the model? is this the first general model that can meaningfully work on a task for 10 hours?

Except that IMO organizers asked you not to steal the spotlight from the kids and wait for a week before announcing your results.

Quote

Mikhail Samin

@Mihonarium

Jul 20

According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony. According to a Coordinator on Problem 6, the one problem OpenAI

Rate proposed Community Notes

2.8K

lmao

9.5K

so let me get this straight their model basically competed live on IMO so all the mathematical tasks should be novel enough all previous years IMO tasks in benchmarks are fully saturated in big part because of data contamination as it doesn't generalize to these new ones

29K

Sweet Bitter lesson

12K

"We developed new techniques that make LLMs a lot better at hard-to-verify tasks." A general method? Or just for mathematical proofs? Is Lean somehow used, maybe just in training?

Formulating the right question is solving it. As AI turns answering into a commodity, research shifts to crafting questions with clear definitions and criteria—the real creative step.

518

Soooo what is the breakthrough? >"Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians." >"We reach this capability level not

21K

So public AI models are bad at IMO, while internal models are getting gold medals? Fascinating

Absolutely fantastic

I wonder if this is the same reasoning model that competed recently at the AtCoder World Finals. It seems to exhibit similar behaviors: working on long-horizon tasks, demonstrating effective long-term planning, and operating autonomously without external tools or human

Quote

Andre Saraiva

@andresnds

Jul 16

Replying to @andresnds

15/N We ran the model fully autonomously for the 10h window—no human intervention; same submission/data/tools/time budget as everyone else. Watching it iterate live beside elite humans was electric.

Brilliant work! Please get this model into the hands of the world's best mathematicians before the general release! We need to accelerate scientific progress ASAP! I'm sure you're already doing it, but their feedback could be invaluable! Hopefully this helps them!

This shifts the center of gravity from scaling laws to alignment of reasoning itself. Curious how you’re thinking about evals that capture creative failure modes, not just final answer correctness

Interesting but here is a nuanced take from Terrance Tao

Congratulations! That's an incredible result, and a great moment for AI progress. You guys should release the model

This model's reasoning has taken George Orwell’s six rules for efficient writing to a whole new level! This makes me think about the dimensions of efficient thinking in general - and the density of information per token. If an AI generates text that looks like gibberish but

431

Step towards mathematical superintelligence

this next!

Quote

Dan Loewenherz

@dwlz

Jul 19

I’d like to see this in reverse. That is, how hard is the IMO when AI writes the questions, and how well do the humans do? x.com/alexwei_/statu…

1.7K

AlphaGo - The Movie | Full award-winning documentary

If just pure general LLM without tools etc. won IMO gold medal on previously unseen math tasks, I am once again struck by the bitter lesson, because I thought that we will need neurosymbolic systems for this.

Why did you not have it graded officially?

Hey

, what are your thoughts about this?

congrats!!! what a beautiful strawberry

Congrats, this is incredible results! Quick question: did it use Lean, or just LLM? If it’s just LLM… that’s insane.

pretty impressive. is this the anonymous chatbot we're seeing on webdev arena by chance?

So we are 2 years away from my toaster being a IMO gold medalist. What a time to be alive.

This just in; computers are good at maths.

INSANE, but unsurprising given who was working on it ;)

I vaguely remember just last year , AI was struggling at high school level math

Congrats! For an hour long thinking, 20 tokens per second give or take, 70k tokens in the CoT for each run? AGI systems are going to be so expensive in the future lol.

Yesterday we were discussing how all AI models failed to solve this problem and confused with monty hall problem immediately except Gemini 2.5 pro, today it gets gold medal on ioi lol. Still fails on combinatoric at 6th though

1.6K

Aritra Roy Gosthipaty

@ariG23498

Jul 20

This calls for a YouTube documentary, similar to

youtube.com

With more board configurations than there are atoms in the universe, the ancient Chinese game of Go has long been considered a grand challenge for artificial...

Impressive achievement! Achieving gold medal-level on IMO tasks shows how far reasoning LLMs have come. How do you see this impacting real-world problem-solving and future models like GPT-6?

Amazing! Hope it comes to us soon. I love o3 but I feel it's way more narrow minded than gpt-4.5, so I often find myself working with the two of them together in the same conversation, to get the reasoning of o3 and the breadth of understanding and Overton window of 4.5.

it's "distinct style" reminds me of someone....

I’m super curious to learn how the model has been evolved, although still essentially a token generating machine, to generate human level proofs. It implies that there is some level of sophistication that must be more than just training data.

wow so it can do math. can it get me a gf tho

This looks exciting. RL is back. Could lead to many new breakthroughs in AI/LLM. A new golden age. RL is what originally "made" ChatGPT (transformer alone wasn't a good user experience and never caught on).

“Math will fall first.” was correct.

Mirko Monti

@mirko_monti6

Congrats to you and the team! Impressive result, truly awesome! <3

Vaibhav (VB) Srivastav

@reach_vb

Massive feat! I love how concise and to the point the generations are unlike majority of LLMs open/ closed alike

Oh my god LFG

4.5 hours means nothing for measurement though. We need to know total flops spent or total joules spent.

This must be GPT-5

Maybe you just bought the answer?

Incredible work.

The attached photo feels like this… but jokes apart… congratulations and awaiting GPT-5 launch

Is this using formal verification tools internally?

dumb question, i know, but how much money (or gpu-hours) was used for the answers?:)

I just woke up and this post has 1M views after a few hours. AI does not sleep.

Wow, congratulations!

3.7K

IKE ∞ ORIGIN⚙︎GODCORE

@ikechan2_15

Impressive. But here’s the real question: Can a model that conquers math… also decode meaning? Reasoning ≠ relevance. Solving IMO problems is mastery of structure. But true intelligence will solve what society doesn't yet know how to ask. We’re not just building LLMs. We’re

wave

@0xWave