Post

Conversation

I've developed a bit of a reputation as an "AI skeptic," but I think I was just accurately reporting on the slow pace of LLM progress following GPT-4. o1 is a totally different story. It's by far the biggest jump in performance since GPT-4.
David Watson 🥑
Post your reply

Over the last 9 months I developed a suite of reasoning puzzles to test the capabilities of new frontier models. o1 aced every single one of them, forcing me to come up with new ones.
For example, I tested the models on long word problems like this. GPT-4o can keep track of how many marbles are in each jar up to about 50 steps, but gets confused by 70. o1-preview gets it right up to about 200 steps.
Image
Here's a problem that's challenging because it requires trial and error. GPT-4o gets stuck and gives up. o1-preview got the right answer.
Image
The one big blind spot I found is that o1 is bad at spatial reasoning. o1 can't accept images yet, but I gave it a word problem describing a set of streets like this. The brown boxes show streets that are closed. o1-preview recommended the following invalid route.
Image
While being a significant jump, it's also been 18 months since GPT-4, and 22 months since GPT-3.5. Given that, just how significant is it? It seems like the pace of development has slowed dramatically. Still a useful tool in certain niches, but still very unreliable.
It depends on what you're comparing it to. The AI field made ~no progress for many years prior to AlexNet in 2012. So GPT-4 to o1 in 18 month still feels pretty fast to me, even if it's not "AGI in 2027" fast.
If you asked 10 humans, of different ages and education levels, to try and solve your set of reasoning questions, do you think they would get them all correct? I feel like the tendency is to compare to human potential performance, and not actual performance
It really depends on the question. I bet everyone older than 10 knows that 100 pennies are worth more than 3 quarters. I bet no one would get the 200-step marble one right.
O1's the biggest jump since GPT-3. Planning was the biggest missing piece of the AGI puzzle. Superintelligence is just a matter of time and experience now.
I think we still need another breakthrough to enable AI systems to have state, separate from their context window, that lets them create and operate on abstract concepts.
Yes. Some of them aren't even original to me and have been floating around longer than my first post about them. But the fact that o1 gets so many problems the latest GPT-4o didn't suggests there's more going on.
You have to be kidding. Sonnet 3.5 was extremely ahead of 4o. Combine it with Artifacts and Projects, and they are not even in the same category. I mean, o1 is not even close to (Sonnet + Opus) + Artifacts + Projects
Shouldn’t this have been predictable? The new model from OAI pushing capabilities, while other firms catching up to GPT-4 didn’t, makes sense. O1 is a way to use more inference compute with ~GPT-4, the next gen from frontier labs will be throwing compute into training, etc.
I like the marble approach. I was considering making something similar myself, but I don’t know if it’s necessary anymore. Algorithmically generated word puzzles are the best LLM benchmark. They should be the gold standard that all LLMs are evaluated on
I believe real-world use cases should be the primary focus. I need an AI that can create full-fledged apps from a single prompt and deploy them. Puzzle-solving isn't a priority for me.
AI is in the "gimmick" stage of development, a lot like 3-d Printing was when everyone wanted to adopt it. AI is moving into the "this is dangerous" phase...We are not near the "useful" stage yet
What's the easiest way to get started with your evaluation framework? I'm working on a personal exocortex project that will eventually (think 20+ years from now) hit AGI on 20 watts - just like your brain. I'd like to better understand how far from SOTA a single rpi can be.
The drop in cost from the first gpt4 to the last gpt4o is totally underrated. Economics are quite important for a new tech in order to have tangible impact on society
"skeptic" ? Why? We had the breakthrough necessary to create sentient AGI for almost a decade now. The methodology isn't being put into practice, publicly, but PHYSICALLY TANGIBLY possible? Yeah, almost 10 years. Pretty neat stuff.
Is it wrong to want it trained to be really good at matching up social media aliases with their most likely authors so we can live in an endless bubble of Mark Robinson-like reveals?
But what we were saying was wait and see before you call the pace "slow". GPT-3 and GPT-4 were launched almost 3 years a part and within a year of GPT-4s release people were saying AI was slowing down. I believe the false narrative that it was slowing down came because ChatGPT
Show more
Except its still incapable of giving sound advice about navigating human life. I actually think it’s gotten worse. Good reasoning (i.e., IQ) does not equate with good social understandiny (i.e., for lack of a better term, EQ)
How has the pace been slow since gpt4? You know openAI releases a frontier model every 2 years and everyone else is playing catch up
Are you confident that o1 is better at ‘logic’ and not just better at recalling the right data from its training? Or is it simply better at recalling correct data? I suspect rather than a new model with new capabilities, it’s just better training data with better chain of
Show more