Conversation
o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)
Firstly and most importantly: we tested on recent unseen programming competitions and find that the model would rank amongst some of the best competitive programmers in the world, with an estimated CodeForces rating over 2700. (3/n)
This is a milestone (codeforces better than Jakub Pachoki) that I thought was further away than December ‘24; these competitions are hard and extremely competitive; the model is absurdly good. (4/n)
Scores are impressive elsewhere too. 87.7% GPQA diamond towers over any LLM I’ve aware of externally (I believe non-o1 sota is gemini flash 2 at 62%?), as well as o1’s 78%. Unknown noise ceiling, so this may even understate o3 science improvements over o1. (4/n)
o3 can also do software engineering, setting a new state of the art on SWE-bench verified with 71.7%, massively improving over o1. (5/n)
With scores this strong, you might fear accidental contamination. Avoiding this is something OAI is obviously obsessed with; but thankfully we also have some test sets that are strongly guaranteed uncontaminated: ARC and FrontierMath… What do we see there? (6/n)
Well, on FrontierMath 2024-11-26 o3 improves the state of the art from 2% to 25% accuracy. These are absurdly hard strongly held out math questions. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). (7/n)
So at least in those cases, we know with true certainty that results are not due to memorization (and very sure in all the other evals I describe as unseen too; I'm just tremendously paranoid). (8/n)
We’ve also found that we can use o3 to train faster and cheaper models without losing as much performance as you might expect: o3-mini is a mighty little beast, and I’m hopeful that Hongyu will share a good thread on how it stacks up. (9/n)
Are there any catches? Well, as the ARC team outlined in our release, o3 is also the most expensive model ever at test-time. But what that means is we’ve unlocked a new era where spending more test-time compute can produce improved performance up to truly absurd levels. (10/n)
My personal expectation is that token prices will fall and that the most important news here is that we now have methods to turn test-time compute into improved performance up to a very large scale. (11/n)
The models will only get better with time; and almost nobody (on a grand scale) can still beat them at programming competitions or math. Merry Christmas! (12/n)
As Sam mentioned at the start of the stream: this is not a model that you can talk to yet... unless you sign up to red team it with us! openai.com/index/early-ac (13/13)
The #FarMiddle ep. 156 marks our 3rd anniversary
feature a fun string of connections within connections. Span
greats to icons of stage & screen, from the Babe to Dustin Hoffman and so many more. 3 yrs! Feels like we just got started a few weeks ago.
What is the cost in resources? Will we be able to afford this, or only for the ultra-pro $1000 per month tier?
Amazing, and that explains how much you folks have been shipping, you have smart people teamed up with the smartest AI coders in the world.
The #FarMiddle ep. 156 marks our 3rd anniversary
feature a fun string of connections within connections. Span
greats to icons of stage & screen, from the Babe to Dustin Hoffman and so many more. 3 yrs! Feels like we just got started a few weeks ago.
Amazing progress! Does it use execution for codeforces submissions similar to o1-ioi?
Our creative team gathers weekly for a show-and-tell session—an excellent opportunity to provide feedback and inspiration for one another. Join us for this week's Check It Out!
(1/4)

0:07
AGI is here
Quote
Dino
@DaBrusi
1/
Major Announcement: OpenAI unveiled its next-generation reasoning models: O3 and O3-mini.
These models promise to redefine the boundaries of AI reasoning, coding, and safety.
Here’s everything you need to know about the final day of this groundbreaking series 
Show more
Any progress on open and weak reward domains, medicine, non stem? Do y'all have evals for that? Does the reasoning transfer?
We are very excited to announce our next publishing partnership. Check out Stitched Together by Tekkou Studios. A RPG adventure with a deep story, unique battle mechanics, world-defying stakes, and a whole extra dimension! Wishlist on Steam today!
You can put it on your website
o3 sounds promising! curious how it tackles legacy systems. those are often the biggest roadblocks for businesses trying to innovate. real challenge there.
Discover more
Sourced from across X
Ilya Sutskever: "What does it mean to predict the next token well enough?... It means that you understand the underlying reality that led to the creation of that token"
From
BioBootloader
day 13 of shipmas: special sora bonus
our GPUs get a little less busy during late december as people take a break from work, so we are giving all plus users unlimited sora access via the relaxed queue over the holidays!
enjoy creating!