Clad3815 on X: "GPT-5 has reached Victory Road! This is the last challenge before the Elite Four. GPT-5 reached this part almost three times faster than o3 (6105 steps for GPT-5 vs 16882 steps for o3). Here are my observations as to why: - GPT-5 hallucinates far less than o3. This is the main https://t.co/bncasoFPOt" / X

GPT-5 has reached Victory Road! This is the last challenge before the Elite Four. GPT-5 reached this part almost three times faster than o3 (6105 steps for GPT-5 vs 16882 steps for o3). Here are my observations as to why: - GPT-5 hallucinates far less than o3. This is the main reason for the speed increase. - GPT-5 has better spatial reasoning. o3 often tried to brute-force through walls and had a hard time navigating complex areas. GPT-5 can plan long input sequences with few mistakes, which saves a lot of time. - GPT-5 is better at planning its own objectives and following them. Let's see how it handle this last challenge!

Quote

Clad3815

@Clad3815

Aug 7

We’re live! GPT-5 vs. Pokémon Red, real-time decisions, gym runs, and chat-picked nicknames. Jump in

0:01 / 0:17

11:39 PM · Aug 13, 2025

196.9K

Views

Post your reply

Clad3815

@Clad3815

16h

Livestream: twitch.tv/gpt_plays_poke Statistics and comparative data: gpt-plays-pokemon.clad3815.dev/timeline

twitch.tv

GPT_Plays_Pokemon - Twitch

GPT-5 Plays Pokemon - HM05 FLASH GET! Drowzee hunt next

Do you have a repo for this? I’m interested to know what you feed the LLMs.

No it's not open source, but you can see what I send / receive from the AI here gpt-plays-pokemon.clad3815.dev/timeline

which reasoning effort is used for gpt-5?

high

You're bragging about "spatial reasoning" while the biggest hallucination here is you calling your script "GPT-5," proving you're still trapped in the delusion that you're doing anything more than brute-forcing a solved game with a stolen name.

What?

This is awesome. Think it would be nice if we could see another line on the graph that shows a pro-gamer or record human player number of steps taken? Just as a reference. I imagine I was probably a lot less efficient than GPT-5

Haven't seen any other model handle this so fast. GPT-5 is really capable

you might like this eval

Is this against slow o3 or o3 post-speedup? Since u didnt specify ima have to assume it’s slow o3 which kinda invalidates the comparison

Which starter did it pick? What about o3?

186

Pedro Teixeira MD PhD

What’s progress / $?

Cool

bet GPT-5 is about to crush this

That’s actually really impressive. It’s much more efficient at playing games… is this really because of the hallucination rate difference?

Is there a baseline of a human without experience in the game?

That's gpt 5 pro, not what everyone had access to

dude more like the end of the Road , well I totally agree at the Part that 5 proven itself useful at calculus , information , advice , some stuff Yet it's Total useless at Creative writing , not chatty , destined and start hallucinating fast we want 4o back at free tier #keep4o

How does this compare to average human?

Did they have the same harness?

Look, I love these video game tests If it is a general intelligence, then it should be able to play all video games like a 10 year old human (at a minimum)

awesome, now run it again with gpt-5-mini with a subagent which summarizes chat to give feedback and hints to the player agent. chat will inject speedrun strats it will mog

How's it doing compared to Gemini? I think you're the same guy who had Gemini play the game too

Complainants, please read.

Now we wait for GTA 6

nice

Wow

Ad

Sits lightly in both structured and wild gardens. Same item? You’ll find it in the comments

Uncover more: x.com/Quietzel_shop

0:28

1.2M

Discover more

Sourced from across X

Shai Shalev-Shwartz

@shai_s_shwartz

11h

Are frontier AI models really capable of “PhD-level” reasoning? To answer this question, we introduce FormulaOne, a new reasoning benchmark of expert-level Dynamic Programming problems. We have curated a benchmark consisting of three tiers, in increasing complexity, which we call

The pro models (GPT-5 Pro, Gemini 2.5 Deep Think, Grok 4 Heavy) can be impressive in ways that are hard to see. They take a lot of time to answer questions & are built for very hard problems that require expert evaluation. That is a narrow, but, also very valuable, problem space.

GPT-5 earned 8 badges in Pokemon Red in just 6,000 steps compared to o3’s 16,700! It’s in complex, long-term agent workflows that GPT-5’s true power really shines. Absolutely mind-blowing.

GPT-5 just finished Pokémon Red! 6,470 steps vs. 18,184 for o3! Check the stats site to compare! That's a huge improvement! Well done,

@OpenAI

you cooked with GPT-5. What an incredible model. Next up: GPT-5 vs. Pokémon Crystal (16 Badges + Red). The run starts soon on Twitch.

Quote

Clad3815

@Clad3815

16h

GPT-5 has reached Victory Road! This is the last challenge before the Elite Four. GPT-5 reached this part almost three times faster than o3 (6105 steps for GPT-5 vs 16882 steps for o3). Here are my observations as to why: - GPT-5 hallucinates far less than o3. This is the main x.com/Clad3815/statu…

91K

To view keyboard shortcuts, press question mark
View keyboard shortcuts

Post

Conversation

Discover more

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

Discover more

To view keyboard shortcuts, press question mark
View keyboard shortcuts