GPT-5 has reached Victory Road! This is the last challenge before the Elite Four.
GPT-5 reached this part almost three times faster than o3 (6105 steps for GPT-5 vs 16882 steps for o3). Here are my observations as to why:
- GPT-5 hallucinates far less than o3. This is the main reason for the speed increase.
- GPT-5 has better spatial reasoning. o3 often tried to brute-force through walls and had a hard time navigating complex areas. GPT-5 can plan long input sequences with few mistakes, which saves a lot of time.
- GPT-5 is better at planning its own objectives and following them.
Let's see how it handle this last challenge!
Post
Conversation
Livestream: twitch.tv/gpt_plays_poke
Statistics and comparative data: gpt-plays-pokemon.clad3815.dev/timeline
Do you have a repo for this? I’m interested to know what you feed the LLMs.
No it's not open source, but you can see what I send / receive from the AI here gpt-plays-pokemon.clad3815.dev/timeline
You're bragging about "spatial reasoning" while the biggest hallucination here is you calling your script "GPT-5," proving you're still trapped in the delusion that you're doing anything more than brute-forcing a solved game with a stolen name.
This is awesome. Think it would be nice if we could see another line on the graph that shows a pro-gamer or record human player number of steps taken? Just as a reference. I imagine I was probably a lot less efficient than GPT-5
Is this against slow o3 or o3 post-speedup? Since u didnt specify ima have to assume it’s slow o3 which kinda invalidates the comparison
That’s actually really impressive. It’s much more efficient at playing games… is this really because of the hallucination rate difference?
dude more like the end of the Road , well I totally agree at the Part that 5 proven itself useful at calculus , information , advice , some stuff Yet it's Total useless at Creative writing , not chatty , destined and start hallucinating fast we want 4o back at free tier #keep4o
Look, I love these video game tests
If it is a general intelligence, then it should be able to play all video games like a 10 year old human (at a minimum)
awesome, now run it again with gpt-5-mini with a subagent which summarizes chat to give feedback and hints to the player agent. chat will inject speedrun strats it will mog
How's it doing compared to Gemini? I think you're the same guy who had Gemini play the game too
Sits lightly in both structured and wild gardens.
Same item? You’ll find it in the comments
Uncover more: x.com/Quietzel_shop
0:28
Discover more
Sourced from across X
Are frontier AI models really capable of “PhD-level” reasoning? To answer this question, we introduce FormulaOne, a new reasoning benchmark of expert-level Dynamic Programming problems. We have curated a benchmark consisting of three tiers, in increasing complexity, which we call
The pro models (GPT-5 Pro, Gemini 2.5 Deep Think, Grok 4 Heavy) can be impressive in ways that are hard to see. They take a lot of time to answer questions & are built for very hard problems that require expert evaluation. That is a narrow, but, also very valuable, problem space.
GPT-5 earned 8 badges in Pokemon Red in just 6,000 steps compared to o3’s 16,700! It’s in complex, long-term agent workflows that GPT-5’s true power really shines. Absolutely mind-blowing. 
GPT-5 just finished Pokémon Red! 6,470 steps vs. 18,184 for o3! Check the stats site to compare!
That's a huge improvement! Well done, you cooked with GPT-5. What an incredible model.
Next up: GPT-5 vs. Pokémon Crystal (16 Badges + Red). The run starts soon on Twitch.
Quote
Clad3815
@Clad3815