Post

Conversation

I've developed a bit of a reputation as an "AI skeptic," but I think I was just accurately reporting on the slow pace of LLM progress following GPT-4. o1 is a totally different story. It's by far the biggest jump in performance since GPT-4.

7:25 AM · Sep 20, 2024

215.7K

Views

Post your reply

Timothy B. Lee

@binarybits

Sep 20

Over the last 9 months I developed a suite of reasoning puzzles to test the capabilities of new frontier models. o1 aced every single one of them, forcing me to come up with new ones.

For example, I tested the models on long word problems like this. GPT-4o can keep track of how many marbles are in each jar up to about 50 steps, but gets confused by 70. o1-preview gets it right up to about 200 steps.

Here's a problem that's challenging because it requires trial and error. GPT-4o gets stuck and gives up. o1-preview got the right answer.

The one big blind spot I found is that o1 is bad at spatial reasoning. o1 can't accept images yet, but I gave it a word problem describing a set of streets like this. The brown boxes show streets that are closed. o1-preview recommended the following invalid route.

Check out my writeup! I tried to explain what reinforcement learning is and why it helped to make o1 so much better at reasoning.

OpenAI just unleashed an alien of extraordinary ability

From understandingai.org

7.9K

Magai

@HeyMagai

All the best AI models in one place. For one price. I mean... it makes sense to me IDK.

magai.co

Magai · All-in-One AI Platform

4.5M

CuddlySalmon | nptacek.eth

@nptacek

Sep 20

you didn't feel that Claude 3.5 Sonnet was a jump in performance? it was insane and unlocked a ton of value

I would describe Claude 3.5 Sonnet as a small improvement in performance combined with a significant improvement in cost.

could u give a couple of quick examples?

Yes, see the reply tweets and the article linked at the end of the thread.

what kind of gap have you seen between 4o with COT baked into the prompt and native o1?

There is an article link further down the thread.

While being a significant jump, it's also been 18 months since GPT-4, and 22 months since GPT-3.5. Given that, just how significant is it? It seems like the pace of development has slowed dramatically. Still a useful tool in certain niches, but still very unreliable.

It depends on what you're comparing it to. The AI field made ~no progress for many years prior to AlexNet in 2012. So GPT-4 to o1 in 18 month still feels pretty fast to me, even if it's not "AGI in 2027" fast.

2.1K

TuneStudioAI

@tuneStudioAi

Stuck with GPT models? Unleash the open-source LLM power and chat without limits!

25+ open-source LLMs to try

If you asked 10 humans, of different ages and education levels, to try and solve your set of reasoning questions, do you think they would get them all correct? I feel like the tendency is to compare to human potential performance, and not actual performance

It really depends on the question. I bet everyone older than 10 knows that 100 pennies are worth more than 3 quarters. I bet no one would get the 200-step marble one right.

O1's the biggest jump since GPT-3. Planning was the biggest missing piece of the AGI puzzle. Superintelligence is just a matter of time and experience now.

Timothy B. Lee

@binarybits

Sep 20

I think we still need another breakthrough to enable AI systems to have state, separate from their context window, that lets them create and operate on abstract concepts.

How is it different than just running any of the previous models with ReACT agents?

I don't know enough about ReACT agents to answer this question. 🤷‍♂️

Hey, do you think it's possible your previous attempts ended up going into their training data?

Yes. Some of them aren't even original to me and have been floating around longer than my first post about them. But the fact that o1 gets so many problems the latest GPT-4o didn't suggests there's more going on.

You have to be kidding. Sonnet 3.5 was extremely ahead of 4o. Combine it with Artifacts and Projects, and they are not even in the same category. I mean, o1 is not even close to (Sonnet + Opus) + Artifacts + Projects

I dunno man.

AI Safety Fundamentals

@aisafety_course

Our free online course for ML professionals covers a range of technical AI alignment research agendas. Graduates work at OpenAI, Anthropic, and AISI. “The single most useful step I've taken in my career so far. I cannot recommend the program strongly enough” ~DeepMind Engineer

Learn about technical AI safety research

From aisafetyfundamentals.com

Nice try but some of us had a bit more forethought (well documented on here).

“Nice try?”

Shouldn’t this have been predictable? The new model from OAI pushing capabilities, while other firms catching up to GPT-4 didn’t, makes sense. O1 is a way to use more inference compute with ~GPT-4, the next gen from frontier labs will be throwing compute into training, etc.

You forget the 2nd Law of Papers, which is to always look two papers down the line!

wordgrammer

@wordgrammer

Sep 20

I like the marble approach. I was considering making something similar myself, but I don’t know if it’s necessary anymore. Algorithmically generated word puzzles are the best LLM benchmark. They should be the gold standard that all LLMs are evaluated on

I believe real-world use cases should be the primary focus. I need an AI that can create full-fledged apps from a single prompt and deploy them. Puzzle-solving isn't a priority for me.

AI is in the "gimmick" stage of development, a lot like 3-d Printing was when everyone wanted to adopt it. AI is moving into the "this is dangerous" phase...We are not near the "useful" stage yet

will (exo/acc)

@wbic16

Sep 20

What's the easiest way to get started with your evaluation framework? I'm working on a personal exocortex project that will eventually (think 20+ years from now) hit AGI on 20 watts - just like your brain. I'd like to better understand how far from SOTA a single rpi can be.

I’m a definitely skeptic when it comes to AI. It’s too dangerous.

TuneStudioAI

@tuneStudioAi

Stuck with GPT models? Unleash the open-source LLM power and chat without limits!

25+ open-source LLMs to try

The drop in cost from the first gpt4 to the last gpt4o is totally underrated. Economics are quite important for a new tech in order to have tangible impact on society

Do you have a leaderboard? Also highly recommend to create a private verison of any benchmark

I’m not sure a handful of months would qualify as “slow progress” …

"skeptic" ? Why? We had the breakthrough necessary to create sentient AGI for almost a decade now. The methodology isn't being put into practice, publicly, but PHYSICALLY TANGIBLY possible? Yeah, almost 10 years. Pretty neat stuff.

118

Singularity's Child gonzo/ai

@shoecatladder

True but people are way too focused on 1-shotting hard problems

Meghan Murphy

@meghanclare

Sep 21

Worth frying the planet?

Nealy Willy

@nealcincinnati

Sep 21

Is it wrong to want it trained to be really good at matching up social media aliases with their most likely authors so we can live in an endless bubble of Mark Robinson-like reveals?

kfue

@kfue_crypto

Sep 21

But what we were saying was wait and see before you call the pace "slow". GPT-3 and GPT-4 were launched almost 3 years a part and within a year of GPT-4s release people were saying AI was slowing down. I believe the false narrative that it was slowing down came because ChatGPT

Except its still incapable of giving sound advice about navigating human life. I actually think it’s gotten worse. Good reasoning (i.e., IQ) does not equate with good social understandiny (i.e., for lack of a better term, EQ)

How has the pace been slow since gpt4? You know openAI releases a frontier model every 2 years and everyone else is playing catch up

It has been less than 2 years since gpt-4 so I don’t understand your question.

Are you confident that o1 is better at ‘logic’ and not just better at recalling the right data from its training? Or is it simply better at recalling correct data? I suspect rather than a new model with new capabilities, it’s just better training data with better chain of

168

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

To view keyboard shortcuts, press question mark
View keyboard shortcuts