1/N Iโm excited to share that our latest experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the worldโs most prestigious math competitionโthe International Math Olympiad (IMO).
Post
Conversation
2/N We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.
3/N Why is this a big deal? First, IMO problems demand a new level of sustained creative thinking compared to past benchmarks. In reasoning time horizon, weโve now progressed from GSM8K (~0.1 min for top humans) โ MATH benchmark (~1 min) โ AIME (~10 mins) โ IMO (~100 mins).
4/N Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, weโve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.
5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.
6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the modelโs submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold! 
7/N HUGE congratulations to the teamโ, , and the many giants whose shoulders we stood onโfor turning this crazy dream into reality! I am lucky I get to spend late nights and early mornings working alongside the very best.
8/N Btw, we are releasing GPT-5 soon, and weโre excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We donโt plan to release anything with this level of math capability for several months.
9/N Stillโthis underscores how fast AI has advanced in recent years. In 2021, my PhD advisor had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.
10/N If you want to take a look, here are the modelโs solutions to the 2025 IMO problems! The model solved P1 through P5; it did not produce a solution for P6. (Apologies in advance for its โฆ distinct styleโit is very much an experimental model
)
11/N Lastly, we'd like to congratulate all the participants of the 2025 IMO on their achievement! We are proud to have many past IMO participants at and recognize that these are some of the brightest young minds of the future.
Congrats on this result! You know it's impressive when the graph looks like this:
Amazing work! To make things totally rigorous, you should next have a device truly take it live in the real testing room with other humans, taking a photo of the real physical test papers. Otherwise there will always be some lurking fear that the model was somehow contaminated.
So what's the next goalpost?
What's the next thing LLMs will never be able to do?
Problem 5 solution:
"So under his saturate response, he never loses. For her to win, must make him unable at some even -> would need Q_{even-1}>even, i.e. some a_j> sqrt2. but we just showed always a_j<=c< sqrt2. So she can never cause his loss. So against this fixed response of
congrats! i think you guys are ahead of the consensus timelines, which are already pretty wild.
Quote
Kal
@andromeda74356
Replying to @andresnds and @OpenAI
what was the scaffolding like for the model? is this the first general model that can meaningfully work on a task for 10 hours?
Except that IMO organizers asked you not to steal the spotlight from the kids and wait for a week before announcing your results.
Quote
Mikhail Samin
@Mihonarium
Rate proposed Community Notes
so let me get this straight
their model basically competed live on IMO so all the mathematical tasks should be novel enough
all previous years IMO tasks in benchmarks are fully saturated in big part because of data contamination as it doesn't generalize to these new ones
"We developed new techniques that make LLMs a lot better at hard-to-verify tasks."
A general method? Or just for mathematical proofs? Is Lean somehow used, maybe just in training?
Formulating the right question is solving it.
As AI turns answering into a commodity, research shifts to crafting questions with clear definitions and criteriaโthe real creative step.
Soooo what is the breakthrough?
>"Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, weโve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians."
>"We reach this capability level not
So public AI models are bad at IMO, while internal models are getting gold medals? Fascinating
I wonder if this is the same reasoning model that competed recently at the AtCoder World Finals.
It seems to exhibit similar behaviors: working on long-horizon tasks, demonstrating effective long-term planning, and operating autonomously without external tools or human
Quote
Andre Saraiva
@andresnds
Replying to @andresnds
15/N We ran the model fully autonomously for the 10h windowโno human intervention; same submission/data/tools/time budget as everyone else. Watching it iterate live beside elite humans was electric.
Brilliant work! Please get this model into the hands of the world's best mathematicians before the general release! We need to accelerate scientific progress ASAP! I'm sure you're already doing it, but their feedback could be invaluable! Hopefully this helps them!
This shifts the center of gravity from scaling laws to alignment of reasoning itself. Curious how youโre thinking about evals that capture creative failure modes, not just final answer correctness
Congratulations! That's an incredible result, and a great moment for AI progress. You guys should release the model
This model's reasoning has taken George Orwellโs six rules for efficient writing to a whole new level!
This makes me think about the dimensions of efficient thinking in general - and the density of information per token. If an AI generates text that looks like gibberish but
this next!
Quote
Dan Loewenherz
@dwlz
Iโd like to see this in reverse.
That is, how hard is the IMO when AI writes the questions, and how well do the humans do? x.com/alexwei_/statuโฆ
If just pure general LLM without tools etc. won IMO gold medal on previously unseen math tasks, I am once again struck by the bitter lesson, because I thought that we will need neurosymbolic systems for this.
Congrats, this is incredible results!
Quick question: did it use Lean, or just LLM?
If itโs just LLMโฆ thatโs insane.
pretty impressive. is this the anonymous chatbot we're seeing on webdev arena by chance?
So we are 2 years away from my toaster being a IMO gold medalist. What a time to be alive.
Congrats! For an hour long thinking, 20 tokens per second give or take, 70k tokens in the CoT for each run? AGI systems are going to be so expensive in the future lol.
Yesterday we were discussing how all AI models failed to solve this problem and confused with monty hall problem immediately except Gemini 2.5 pro, today it gets gold medal on ioi lol. Still fails on combinatoric at 6th though
This calls for a YouTube documentary, similar to
Impressive achievement! Achieving gold medal-level on IMO tasks shows how far reasoning LLMs have come. How do you see this impacting real-world problem-solving and future models like GPT-6?
Amazing! Hope it comes to us soon. I love o3 but I feel it's way more narrow minded than gpt-4.5, so I often find myself working with the two of them together in the same conversation, to get the reasoning of o3 and the breadth of understanding and Overton window of 4.5.
This looks exciting. RL is back. Could lead to many new breakthroughs in AI/LLM. A new golden age.
RL is what originally "made" ChatGPT (transformer alone wasn't a good user experience and never caught on).
Massive feat! I love how concise and to the point the generations are unlike majority of LLMs open/ closed alike 
4.5 hours means nothing for measurement though.
We need to know total flops spent or total joules spent.
The attached photo feels like thisโฆ but jokes apartโฆ congratulations and awaiting GPT-5 launch
dumb question, i know, but how much money (or gpu-hours) was used for the answers?:)
I just woke up and this post has 1M views after a few hours.
AI does not sleep.
Impressive.
But hereโs the real question:
Can a model that conquers mathโฆ also decode meaning?
Reasoning โ relevance.
Solving IMO problems is mastery of structure.
But true intelligence will solve what society doesn't yet know how to ask.
Weโre not just building LLMs.
Weโre
Very nice and congratulations on the massive milestone! Two questions:
1. Did the model use any external tools?
2. Did it involve doing work in Lean?
This is nothing short of historic. Achieving gold at the IMO with a general-purpose LLM marks a new frontier in machine reasoning. Huge respect for the team and the meticulous evaluation process. The pace of progress in AI just hit a new gear. 

