Post

Conversation

o3 represents enormous progress in general-domain reasoning with RL — excited that we were able to announce some results today! Here’s a summary of what we shared about o3 in the livestream (1/n)

10:11 AM · Dec 20, 2024

627K

Views

Post your reply

Nat McAleese

@__nmca__

Dec 20

o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)

Firstly and most importantly: we tested on recent unseen programming competitions and find that the model would rank amongst some of the best competitive programmers in the world, with an estimated CodeForces rating over 2700. (3/n)

This is a milestone (codeforces better than Jakub Pachoki) that I thought was further away than December ‘24; these competitions are hard and extremely competitive; the model is absurdly good. (4/n)

Scores are impressive elsewhere too. 87.7% GPQA diamond towers over any LLM I’ve aware of externally (I believe non-o1 sota is gemini flash 2 at 62%?), as well as o1’s 78%. Unknown noise ceiling, so this may even understate o3 science improvements over o1. (4/n)

o3 can also do software engineering, setting a new state of the art on SWE-bench verified with 71.7%, massively improving over o1. (5/n)

With scores this strong, you might fear accidental contamination. Avoiding this is something OAI is obviously obsessed with; but thankfully we also have some test sets that are strongly guaranteed uncontaminated: ARC and FrontierMath… What do we see there? (6/n)

Well, on FrontierMath 2024-11-26 o3 improves the state of the art from 2% to 25% accuracy. These are absurdly hard strongly held out math questions. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). (7/n)

So at least in those cases, we know with true certainty that results are not due to memorization (and very sure in all the other evals I describe as unseen too; I'm just tremendously paranoid). (8/n)

We’ve also found that we can use o3 to train faster and cheaper models without losing as much performance as you might expect: o3-mini is a mighty little beast, and I’m hopeful that Hongyu will share a good thread on how it stacks up. (9/n)

Are there any catches? Well, as the ARC team outlined in our release, o3 is also the most expensive model ever at test-time. But what that means is we’ve unlocked a new era where spending more test-time compute can produce improved performance up to truly absurd levels. (10/n)

My personal expectation is that token prices will fall and that the most important news here is that we now have methods to turn test-time compute into improved performance up to a very large scale. (11/n)

The models will only get better with time; and almost nobody (on a grand scale) can still beat them at programming competitions or math. Merry Christmas! (12/n)

As Sam mentioned at the start of the stream: this is not a model that you can talk to yet... unless you sign up to red team it with us! openai.com/index/early-ac (13/13)

Early access for safety testing

For FrontierMath, is it 25% at pass@1?

Yes.

The #FarMiddle ep. 156 marks our 3rd anniversary

feature a fun string of connections within connections. Span

greats to icons of stage & screen, from the Babe to Dustin Hoffman and so many more. 3 yrs! Feels like we just got started a few weeks ago.

The Far Middle Turns Three

From nickdeiuliis.com

Congrats on achieving the machine

What is the cost in resources? Will we be able to afford this, or only for the ultra-pro $1000 per month tier?

Amazing, and that explains how much you folks have been shipping, you have smart people teamed up with the smartest AI coders in the world.

Can you ask it "which number is bigger 9.8 or 9.111"?

O3 cost $350k to complete that benchmark?

799

Nick DeIuliis

@NickDeIuliis

The #FarMiddle ep. 156 marks our 3rd anniversary

feature a fun string of connections within connections. Span

greats to icons of stage & screen, from the Babe to Dustin Hoffman and so many more. 3 yrs! Feels like we just got started a few weeks ago.

The Far Middle Turns Three

From nickdeiuliis.com

Amazing progress! Does it use execution for codeforces submissions similar to o1-ioi?

Chinese room - Wikipedia

Congratulations guys

Truly incredible

save thread

Our creative team gathers weekly for a show-and-tell session—an excellent opportunity to provide feedback and inspiration for one another. Join us for this week's Check It Out!

(1/4)

0:07

AGI is here

Quote

Dino

@DaBrusi

Dec 20

Major Announcement: OpenAI unveiled its next-generation reasoning models: O3 and O3-mini. These models promise to redefine the boundaries of AI reasoning, coding, and safety. Here’s everything you need to know about the final day of this groundbreaking series

How do you define “general domain reasoning”?

Epic

AGI Testing Stage - [don't boil the frog]

This is wild! And to think; AI is still yet in its ‘kid stages’.

Any progress on open and weak reward domains, medicine, non stem? Do y'all have evals for that? Does the reasoning transfer?

Hi Nat, the ARC-AGI score is truly impressive! Congratulations on the achievement. However, I have a quick question. In the presentation, the graph indicates that the o3 model is labeled as (tuned). Are there any results for models that were not fine-tuned on the training data?

Do u need more memory or less with test time compute?

Congrats for the incredible results!

insane

Page to Pixel Publishing

@pagetopixel

We are very excited to announce our next publishing partnership. Check out Stitched Together by Tekkou Studios. A RPG adventure with a deep story, unique battle mechanics, world-defying stakes, and a whole extra dimension! Wishlist on Steam today!

Stitched Together on Steam

From store.steampowered.com

Wohoo!

Why didn’t you include a comparison with the O1 Pro?

You can put it on your website

Quote

luffy

@0xluffyb

Dec 20

made a logo for o3 lmk your thoughts

o3 sounds promising! curious how it tackles legacy systems. those are often the biggest roadblocks for businesses trying to innovate. real challenge there.

carc.ai

@carc_ai

Dec 21

Just take into account empirical and intuition mostly based reasoning behind current progress - from alchemy to AI -

Quote

carc.ai

@carc_ai

Dec 21

medium.com/@gryant/8db9e5 After yesterday openAI o3 announcement I summarize some of my thoughts in short essay - "From Alchemy to AI".

Discover more

Sourced from across X

François Chollet

@fchollet

11h

The most reliably predictable trend of the next 100 years: every year, humanity will use significantly more computing power than the previous year. Someone should start an ETF based on that thesis (it's not just $NVDA and $AMD, it's cloud services, the data center industry,

Ilya Sutskever: "What does it mean to predict the next token well enough?... It means that you understand the underlying reality that led to the creation of that token"

From

day 13 of shipmas: special sora bonus

our GPUs get a little less busy during late december as people take a break from work, so we are giving all plus users unlimited sora access via the relaxed queue over the holidays! enjoy creating!

744K

Teknium (e/λ)

@Teknium1

I havent seen any Sora videos on twitter since launch day..

80K

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

Discover more

To view keyboard shortcuts, press question mark
View keyboard shortcuts