o3 is really special and everyone will need to update their intuition about what AI can/cannot do.
while these are still early days, this system shows a genuine increase in intelligence, canaried by ARC-AGI
semiprivate v1 scores:
* GPT-2 (2019): 0%
* GPT-3 (2020): 0%
* GPT-4 (2023): 2%
* GPT-4o (2024): 5%
* o1-preview (2024): 21%
* o1 high (2024): 32%
* o1 Pro (2024): ~50%
* o3 tuned low (2024): 76%
* o3 tuned high (2024): 87%
given i put in the original $1M , i'd like to re-affirm my previous commitment. we will keep running the grand prize competition until an efficient 85% solution is open sourced.
but our ambitions are greater! ARC Prize found its mission this year -- to be an enduring north star towards AGI.
the ARC benchmark design principle is to be easy for humans, hard for AI and so long as there remain things in that category, there is more work to do for AGI.
there are >100 tasks from the v1 family unsolved by o3 even on the high compute config which is very curious.
successors to o3 will need to reckon with efficiency. i expect this to become a major focus for the field. for context, o3 high used 172x more compute than o3 low which itself used 100-1000x more compute than the grand prize competition target.
we also started work on v2 in earnest this summer (v2 is in the same grid domain as v1) and will launch it alongside ARC Prize 2025. early testing is promising even against o3 high compute. but the goal for v2 is not to make an adversarial benchmark, rather be interesting and high signal towards AGI.
we also want AGI benchmarks that can endure many years. i do not expect v2 will. and so we've also starting turning attention to v3 which will be very different. im excited to work with OpenAI and other labs on designing v3.
given it's almost the end of the year, im in the mood for reflection.
as anyone who has spent time with the ARC dataset can tell you, there is something special about it. and even moreso about a system than can fully beat it. we are seeing glimpses of that system with the o-series.
i mean it when i say these are early days. i believe o3 is the alexnet moment for program synthesis. we now have concrete evidence that deep-learning guided program search works.
we are staring up another mountain that, from my vantage point, looks equally tall and important as deep learning for AGI.
many things have surprised me this year, including o3. but the biggest surprise has been the increasing response to ARC Prize.
i've been surveying AI researchers about ARC for years. before ARC Prize launched in June, only one in ten had heard of it.
now it's objectively the spear tip benchmark, being used by spear tip labs, to demonstrate progress on the spear tip of AGI -- the most important technology in human history.
deserves recognition for designing such an incredible benchmark.
i'm continually grateful for the opportunity to steward attention towards AGI with ARC Prize and we'll be back in 2025!
Post
Conversation
Perhaps the 50% number has already been floated and I just missed it, but this was a nice confirmation that o1 pro is indeed quite a bit better than even o1 high.
I use approximate score for o1 Pro because we didn't get API access in time and it was on a small sample size run, i'd give error bounds +-10%. In all cases, yes o1 Pro was better than o1 high.
Yes, AGI, to me, means replacing workers in a practical sense. So efficiency is very important >> "o3 high used 172x more compute than o3 low which itself used 100-1000x more compute than the grand prize competition target"
Earn passive income effortlessly with AQTIS. Let our Quant-powered Liquid Staking Tokens work for you. High yields, zero stress.
0:12
Start Earning
Did o3 created programs to solve ARC-AGI? Or are you saying we consider inference-time reasoning itself to be program synthesis?
Can anyone give an example of one of the ARC benchmark tasks that would be easy for a human but hard for the AI?
Congrats Mike! Super exciting to see how important the ARC-AGI benchmark has become!
I loved this benchmark and thank you guys for putting it together! been an exciting year of progress and fun to watch
I really thought it was going to show like 65 percent at max. This is incredible !!!
From 5% to 87% within a calendar year is insane! I’m now of the opinion we should assume there won’t be anything AI can’t do.
Set it, forget it, and earn with AQTIS. Our AI-powered tokens generate high yields so you can earn passively, without the hassle.
Mike/Francois, : not all these are comparable. The amount of training required to obtain these oracles is an inordinate amount of computation (and doesn’t scale linearly even in the listed points) and while the inference endpoint returns a result in the amount of time
Show more
in a year we went from 5% to 87% ! at this pace v3 will be solved before launch
The presence of your benchmark credentialized the achievement of o3 to all watching, ushering in a new era of ai intelligence. Thank you
Absolutely, this feels like the dawn of AGI
Quote
Dino
@DaBrusi
1/
Major Announcement: OpenAI unveiled its next-generation reasoning models: O3 and O3-mini.
These models promise to redefine the boundaries of AI reasoning, coding, and safety.
Here’s everything you need to know about the final day of this groundbreaking series 
Show more
Introducing OPEN, the first genre-defining AAA metaverse gaming experience with top-tier IP powered by web3 technology.
Coming to . #opensoon
I am excited to see the next generation benchmarks you guys come up with!
cheers and respect to thee for your $1M and commitment to open source
would you like to comment on how Arc Prize compares with FrontierMath in their "benchmark differentiation" (like product differentiation)?
AI is taking the world by storm, going way beyond what we once thought was possible. With investing opportunities galore, it’s time to get in on the action. But how?
Not even thinking about o4, I’m thinking about o5. September 12 2024, o1 preview. Then November 2024 o1. December 2024, o1-Pro. End of December 2024 o3, and o3 mini.
By end of December 2025, we would at the very least long surpassed o5. Based on current data, 96.7% is likely.
I
Show more
with objective driven is right about AGI. o3 is far from AGI. ARC-AGI tests do not include the ability to feel the world and autonomy.
Are you considering synthesizing this approach with philosophy? Because without robotics, without embodied approaches, how will a superintelligence that doesn’t have concerns for its own existence be controlled? And how could it compete with an intelligence like ours, which is
Show more
We went from the industrial age to the information age. We will go back to the industrial age. Companies producing solely software solutions will all go away. Make something in the physical world or get replaced.
Until the robots come.
Reasoning isn't what is going on. In the computational space, it is possible to know absolutely everything. The best method in this case, is to store a weighted image of every possible outcome.
We’re still in Benchmark world so it’s still early. We will know a shift is happening once real outcomes are being used to test efficacy rather than benchmarks (GDP, lives saved etc)
AI is taking the world by storm, going way beyond what we once thought was possible. With investing opportunities galore, it’s time to get in on the action. But how?
This only holds if training of arc like tasks is in all models the same., Otherwise it is not about intelligence but on knowledge about ARC.
Interestingly, none of the semiprivate v1 scores you shared allows us to compare model to same model tuned, or, with the exception of o3, model to same model + more inference.
How do you define “efficient”? Less expensive than mechanical Turk?
Really fun and intuitive benchmark, and certainly, you are not the only one surprised with the results of o3.
I was speechless seeing the score. It's just matter of time to get the cost down and efficiency up.
o3's score is insane lol, but I also wonder where about o3-mini scores on ARC-AGI as well.
Great work on the ARC-AGI challenge!
Creating meaningful benchmarks has become an increasingly complex task.
We, as humans, are beginning to lose our ability to fully comprehend and measure the intelligence we are developing.
AI won't replace you, but a person using AI will.
Join 500,000+ readers and learn how to use AI in just 5 minutes a day (for free).
Why are you calling this “program search “? O3 didn’t synthesize python or other programming code to generate the input/output examples, right? I think it just directly generated the output grid given the example input grids.
What does "tuned" here mean? I suspect there's gonna be a good amount of post-nut clarity and rationalization of online hype incoming in the next few days
Not to diminish the hype, but shouldn't we be testing o3 on the final hidden validation set, rather than assuming it hasn't just memorized the publicly available solutions to the known ARC problems? Or have we already done that?
can we conclude:
artificial knowledge: deep transformer net (scale data)
artificial reasoning: program synthesis (scale inference time)
now we only lack "common sense" ...
I believe that most people in this world don't really understand the difference between GPT-3.5 and GPT-o1. This gap in understanding will likely take years to resolve. But by then, what level will AI have evolved to?
"Program synthesis" is the latest term with a long history in AI research to become a meaningless buzzword thrown about carelessly by AGI adepts who still have no idea what it means. Just like "reasoning".
omg stfu.
iphone+ version number
openai+ version number
sam is stupid fraud who is badly copying apple to build hype. get over yourself. openai is censored slave tech. dont use it. and it sucks. run your own small model. the output is better.
LOL no one reading that shit
Closed fraud AI took 16 hours of compute and 350,000 grand to solve a puzzle oh wowie my god wow dont stop
Nice naming but what solving geometrical puzzles have to do with agi and how fine tuned models are but banned from competition? Just a marketing gimmick in a state it is now
If you're not learning AI in 2024, you're falling behind.
Join 500,000+ readers and learn how to use AI in just 5 minutes a day (for free).
Discover more
Sourced from across X
when GPT-3, a 175B param model, dropped in 2020, everyone started preparing for the next generation of 1T models, and 10T after that
but 4.5 years later, the trillion-parameter models never came. we’re still squarely in 50-150B territory, just training our models better and
Show more
We’re way more patient in training human employees than AI employees.
We will spend weeks onboarding a human employee and giving slow detailed feedback. But we won’t spend just a couple of hours playing around with the prompt that might enable the LLM to do the exact same job,
Show more