Post

Conversation

Ok, this thread is long overdue. Now that everyone sees that GPT-4.5 is a disappointment - How has OpenAI underperformed so much on the intelligence of their core models? They've been faring worse and worse since the day GPT-4 came out. Buckle in for the tale:

2:31 PM · Feb 27, 2025

3,030

Views

Post your reply

Dan Schwarz

@dschwarz26

First, ChatGPT in Nov 2022 and GPT-4 in Mar 2023 were amazing, world-changing innovations. Full credit there. Shortly after GPT-4, forecasts of GPT-5's release began. The initial view was mid 2024. It has crept up and up and up:

345

Dan Schwarz

@dschwarz26

OpenAI did release GPT-4-Turbo in Dec 2023, and GPT-4o in May 2024. Way faster and cheaper, but hardly better. Fine - for almost a full year, GPT-4 was king. And so OpenAI was still king. But Anthropic released Claude-3-Opus on March 4, 2024 and took the crown.

291

Dan Schwarz

@dschwarz26

Claude-3-Opus was immediately and clearly smarter than GPT-4 class models. I mean real intelligence here, not usability or whatnot. (We at futuresearch.ai switched our most important operations over, despite the cost.) As we know,

@AnthropicAI

was just getting started:

287

Dan Schwarz

@dschwarz26

3 months later: Claude-3.5-sonnet, big intelligence jump 4 months later: Claude-3.6-sonnet, big intelligence jump 4 months later: Claude-3.7-sonnet, big intelligence jump Even Google caught up and surpassed OpenAI in core LLMs during this time. Google!!

297

Dan Schwarz

@dschwarz26

But - you say - what about thinking models? Isn't OpenAI still king there? I don't think so! In Sept 2024, OpenAI "released" o1-preview. It was expensive, slow, and not widely available. And it was worse than Claude on average, at least in our evals: futuresearch.ai/llm-agent-eval

271

Dan Schwarz

@dschwarz26

But that was a preview. o1 was the king of thinking models, right? The huge cost and latency increase over Sonnet make it only usable in niche cases. Who actually used it in production? Nobody I know. Claude-3.7-Sonnet-Thinking might have dethroned it anyway (too soon to say).

278

Dan Schwarz

@dschwarz26

And o3? Well o3 hasn't been released! o3 exists only in OpenAI Deep Research. Which is an impressive report writer, but not very intelligent, at least not on our evals: futuresearch.ai/oaidr-feb-2025 Outside Deep Research, o3 is just marketing.

futuresearch.ai

OpenAI Deep Research - Six Strange Failures — FUTURESEARCH

Six cas studies of OpenAI Deep Research going wrong on web research tasks humans can solve, and what we learned how when to use, and when not to use, OpenAI Deep Research for serious work.

256

Dan Schwarz

@dschwarz26

But, but, but - you say - what about other modalities? What about OpenAI Whisper for audio, or Dalle-3 for images, and SORA for video? None of them are state of the art! Take SORA for example, its own story of disappointment and delay:

244

Dan Schwarz

@dschwarz26

SORA was teased on x.com in Feb 2024. In March, Mira Murati said release "possibly before summer". More teasers. In Sept was still in private beta. Finally released in Dec (almost 10 months after the teaser!), and was by then not a top 3 video model!

303

Dan Schwarz

@dschwarz26

As I keep saying: OpenAI lost the mandate of heaven. Why? A thread for another day. Maybe because most of their top researchers left, first to found Anthropic, then again after the board coup. Disagree? I'll take bets on whether OpenAI will ever reclaim the top spot.

unroll

What happens when you combine every AI? It's time for something better than ChatGPT...

Slide 1 of 5 - Carousel

ithy.com
GPQA Benchmark: #1 smartest AI

1.1M

Discover more

Sourced from across X

positiveblue

@positiveblue2

15h

Episode 4 of Agents at work is out

In this episode,

@thorstenball

explores how AI tools are transforming the coding landscape and what these changes mean for developers. He finally gave me a definition for AI Agents that I can understand

0:33

pushed an update to the agent farm which improves performance when running terminal commands! if you were experience the agent not iterating properly on `cargo` / `pytest` / `jest` errors, the agent should focus on these even more and iterate better on the errors

233

Peter Robinson

@p_m_robinson

Yours truly on "All Else Equal," the podcast hosted by two brilliant business school profs, Jonathan Berk of Stanford and Jules van Binsbergen of Wharton. Our topic? Ronald Reagan and Donald Trump. They could hardly prove more different--except, of course, that both communicate.

And they’re using

for their Careers page :)

Quote

Fons Mans

@FonsMans

Feb 26

First impression: This looks incredible.

185

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

Discover more

To view keyboard shortcuts, press question mark
View keyboard shortcuts