Post

Conversation

o3 represents enormous progress in general-domain reasoning with RL — excited that we were able to announce some results today! Here’s a summary of what we shared about o3 in the livestream (1/n)
Image
David Watson 🥑
Post your reply

o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)
Firstly and most importantly: we tested on recent unseen programming competitions and find that the model would rank amongst some of the best competitive programmers in the world, with an estimated CodeForces rating over 2700. (3/n)
This is a milestone (codeforces better than Jakub Pachoki) that I thought was further away than December ‘24; these competitions are hard and extremely competitive; the model is absurdly good. (4/n)
Scores are impressive elsewhere too. 87.7% GPQA diamond towers over any LLM I’ve aware of externally (I believe non-o1 sota is gemini flash 2 at 62%?), as well as o1’s 78%. Unknown noise ceiling, so this may even understate o3 science improvements over o1. (4/n)
o3 can also do software engineering, setting a new state of the art on SWE-bench verified with 71.7%, massively improving over o1. (5/n)
With scores this strong, you might fear accidental contamination. Avoiding this is something OAI is obviously obsessed with; but thankfully we also have some test sets that are strongly guaranteed uncontaminated: ARC and FrontierMath… What do we see there? (6/n)
Well, on FrontierMath 2024-11-26 o3 improves the state of the art from 2% to 25% accuracy. These are absurdly hard strongly held out math questions. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). (7/n)
Image
So at least in those cases, we know with true certainty that results are not due to memorization (and very sure in all the other evals I describe as unseen too; I'm just tremendously paranoid). (8/n)
We’ve also found that we can use o3 to train faster and cheaper models without losing as much performance as you might expect: o3-mini is a mighty little beast, and I’m hopeful that Hongyu will share a good thread on how it stacks up. (9/n)
Are there any catches? Well, as the ARC team outlined in our release, o3 is also the most expensive model ever at test-time. But what that means is we’ve unlocked a new era where spending more test-time compute can produce improved performance up to truly absurd levels. (10/n)
My personal expectation is that token prices will fall and that the most important news here is that we now have methods to turn test-time compute into improved performance up to a very large scale. (11/n)
The models will only get better with time; and almost nobody (on a grand scale) can still beat them at programming competitions or math. Merry Christmas! (12/n)
Our creative team gathers weekly for a show-and-tell session—an excellent opportunity to provide feedback and inspiration for one another. Join us for this week's Check It Out! 📷 (1/4) 🧵 👇
0:07
AGI is here
Quote
Dino
@DaBrusi
1/ 🚨 Major Announcement: OpenAI unveiled its next-generation reasoning models: O3 and O3-mini. These models promise to redefine the boundaries of AI reasoning, coding, and safety. Here’s everything you need to know about the final day of this groundbreaking series 👇
Show more
Any progress on open and weak reward domains, medicine, non stem? Do y'all have evals for that? Does the reasoning transfer?
Hi Nat, the ARC-AGI score is truly impressive! Congratulations on the achievement. However, I have a quick question. In the presentation, the graph indicates that the o3 model is labeled as (tuned). Are there any results for models that were not fine-tuned on the training data?
o3 sounds promising! curious how it tackles legacy systems. those are often the biggest roadblocks for businesses trying to innovate. real challenge there.
Just take into account empirical and intuition mostly based reasoning behind current progress - from alchemy to AI -
Quote
carc.ai
@carc_ai
medium.com/@gryant/8db9e5 After yesterday openAI o3 announcement I summarize some of my thoughts in short essay - "From Alchemy to AI".

Discover more

Sourced from across X
The most reliably predictable trend of the next 100 years: every year, humanity will use significantly more computing power than the previous year. Someone should start an ETF based on that thesis (it's not just $NVDA and $AMD, it's cloud services, the data center industry,
Show more
Ilya Sutskever: "What does it mean to predict the next token well enough?... It means that you understand the underlying reality that led to the creation of that token"
day 13 of shipmas: special sora bonus🎄✨ our GPUs get a little less busy during late december as people take a break from work, so we are giving all plus users unlimited sora access via the relaxed queue over the holidays! enjoy creating!