Post

Conversation

o3 is really special and everyone will need to update their intuition about what AI can/cannot do. while these are still early days, this system shows a genuine increase in intelligence, canaried by ARC-AGI semiprivate v1 scores: * GPT-2 (2019): 0% * GPT-3 (2020): 0% * GPT-4 (2023): 2% * GPT-4o (2024): 5% * o1-preview (2024): 21% * o1 high (2024): 32% * o1 Pro (2024): ~50% * o3 tuned low (2024): 76% * o3 tuned high (2024): 87% given i put in the original $1M , i'd like to re-affirm my previous commitment. we will keep running the grand prize competition until an efficient 85% solution is open sourced. but our ambitions are greater! ARC Prize found its mission this year -- to be an enduring north star towards AGI. the ARC benchmark design principle is to be easy for humans, hard for AI and so long as there remain things in that category, there is more work to do for AGI. there are >100 tasks from the v1 family unsolved by o3 even on the high compute config which is very curious. successors to o3 will need to reckon with efficiency. i expect this to become a major focus for the field. for context, o3 high used 172x more compute than o3 low which itself used 100-1000x more compute than the grand prize competition target. we also started work on v2 in earnest this summer (v2 is in the same grid domain as v1) and will launch it alongside ARC Prize 2025. early testing is promising even against o3 high compute. but the goal for v2 is not to make an adversarial benchmark, rather be interesting and high signal towards AGI. we also want AGI benchmarks that can endure many years. i do not expect v2 will. and so we've also starting turning attention to v3 which will be very different. im excited to work with OpenAI and other labs on designing v3. given it's almost the end of the year, im in the mood for reflection. as anyone who has spent time with the ARC dataset can tell you, there is something special about it. and even moreso about a system than can fully beat it. we are seeing glimpses of that system with the o-series. i mean it when i say these are early days. i believe o3 is the alexnet moment for program synthesis. we now have concrete evidence that deep-learning guided program search works. we are staring up another mountain that, from my vantage point, looks equally tall and important as deep learning for AGI. many things have surprised me this year, including o3. but the biggest surprise has been the increasing response to ARC Prize. i've been surveying AI researchers about ARC for years. before ARC Prize launched in June, only one in ten had heard of it. now it's objectively the spear tip benchmark, being used by spear tip labs, to demonstrate progress on the spear tip of AGI -- the most important technology in human history. deserves recognition for designing such an incredible benchmark. i'm continually grateful for the opportunity to steward attention towards AGI with ARC Prize and we'll be back in 2025!
Quote
ARC Prize
@arcprize
New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4
Show more
Image
David Watson 🥑
Post your reply

Perhaps the 50% number has already been floated and I just missed it, but this was a nice confirmation that o1 pro is indeed quite a bit better than even o1 high.
I use approximate score for o1 Pro because we didn't get API access in time and it was on a small sample size run, i'd give error bounds +-10%. In all cases, yes o1 Pro was better than o1 high.
Yes, AGI, to me, means replacing workers in a practical sense. So efficiency is very important >> "o3 high used 172x more compute than o3 low which itself used 100-1000x more compute than the grand prize competition target"
Square profile picture
Earn passive income effortlessly with AQTIS. Let our Quant-powered Liquid Staking Tokens work for you. High yields, zero stress.
0:12
Start Earning
From aqtis.ai
Did o3 created programs to solve ARC-AGI? Or are you saying we consider inference-time reasoning itself to be program synthesis?
From 5% to 87% within a calendar year is insane! I’m now of the opinion we should assume there won’t be anything AI can’t do.
Square profile picture
Set it, forget it, and earn with AQTIS. Our AI-powered tokens generate high yields so you can earn passively, without the hassle.
Mike/Francois, : not all these are comparable. The amount of training required to obtain these oracles is an inordinate amount of computation (and doesn’t scale linearly even in the listed points) and while the inference endpoint returns a result in the amount of time
Show more
Absolutely, this feels like the dawn of AGI
Quote
Dino
@DaBrusi
1/ 🚨 Major Announcement: OpenAI unveiled its next-generation reasoning models: O3 and O3-mini. These models promise to redefine the boundaries of AI reasoning, coding, and safety. Here’s everything you need to know about the final day of this groundbreaking series 👇
Show more
Agents from now one will be a lot more capable. They could already communicate pretty decently, reasoning was always the big limitation. Looks like today that's changed. Implementing now.
cheers and respect to thee for your $1M and commitment to open source would you like to comment on how Arc Prize compares with FrontierMath in their "benchmark differentiation" (like product differentiation)?
Not even thinking about o4, I’m thinking about o5. September 12 2024, o1 preview. Then November 2024 o1. December 2024, o1-Pro. End of December 2024 o3, and o3 mini. By end of December 2025, we would at the very least long surpassed o5. Based on current data, 96.7% is likely. I
Show more
Are you considering synthesizing this approach with philosophy? Because without robotics, without embodied approaches, how will a superintelligence that doesn’t have concerns for its own existence be controlled? And how could it compete with an intelligence like ours, which is
Show more
We went from the industrial age to the information age. We will go back to the industrial age. Companies producing solely software solutions will all go away. Make something in the physical world or get replaced. Until the robots come.
Reasoning isn't what is going on. In the computational space, it is possible to know absolutely everything. The best method in this case, is to store a weighted image of every possible outcome.
We’re still in Benchmark world so it’s still early. We will know a shift is happening once real outcomes are being used to test efficacy rather than benchmarks (GDP, lives saved etc)
Interestingly, none of the semiprivate v1 scores you shared allows us to compare model to same model tuned, or, with the exception of o3, model to same model + more inference.
This should be the single biggest news story of all time. But, unfortunately, people still don't seem to realize what's about to happen. Everything is about to change. Buckle up, it's going to be fun.
Great work on the ARC-AGI challenge! Creating meaningful benchmarks has become an increasingly complex task. We, as humans, are beginning to lose our ability to fully comprehend and measure the intelligence we are developing.
test-time search -> solve harder problems -> train on the solutions & the processes -> repeat from step 1 - - - -> ASI
Why are you calling this “program search “? O3 didn’t synthesize python or other programming code to generate the input/output examples, right? I think it just directly generated the output grid given the example input grids.
What does "tuned" here mean? I suspect there's gonna be a good amount of post-nut clarity and rationalization of online hype incoming in the next few days
Not to diminish the hype, but shouldn't we be testing o3 on the final hidden validation set, rather than assuming it hasn't just memorized the publicly available solutions to the known ARC problems? Or have we already done that?
can we conclude: artificial knowledge: deep transformer net (scale data) artificial reasoning: program synthesis (scale inference time) now we only lack "common sense" ...
I believe that most people in this world don't really understand the difference between GPT-3.5 and GPT-o1. This gap in understanding will likely take years to resolve. But by then, what level will AI have evolved to?
"Program synthesis" is the latest term with a long history in AI research to become a meaningless buzzword thrown about carelessly by AGI adepts who still have no idea what it means. Just like "reasoning".
omg stfu. iphone+ version number openai+ version number sam is stupid fraud who is badly copying apple to build hype. get over yourself. openai is censored slave tech. dont use it. and it sucks. run your own small model. the output is better.
Nice naming but what solving geometrical puzzles have to do with agi and how fine tuned models are but banned from competition? Just a marketing gimmick in a state it is now

Discover more

Sourced from across X
when GPT-3, a 175B param model, dropped in 2020, everyone started preparing for the next generation of 1T models, and 10T after that but 4.5 years later, the trillion-parameter models never came. we’re still squarely in 50-150B territory, just training our models better and
Show more
automating people is a boring and uninteresting goal making gdp go up is a boring and uninteresting goal the summoning of machine intelligence is for the flourishing of new art and science and technology. it’s for our civilization to spend lavishly on wonders
We’re way more patient in training human employees than AI employees. We will spend weeks onboarding a human employee and giving slow detailed feedback. But we won’t spend just a couple of hours playing around with the prompt that might enable the LLM to do the exact same job,
Show more
The most reliably predictable trend of the next 100 years: every year, humanity will use significantly more computing power than the previous year. Someone should start an ETF based on that thesis (it's not just $NVDA and $AMD, it's cloud services, the data center industry,
Show more