Post

Conversation

A few more observations after replicating the Tower of Hanoi game with their exact prompts:

- You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and o3-mini 100k tokens. This includes the reasoning tokens they use before outputting their final answer! - all models will have 0 accuracy with more than 13 disks simply because they can not output that much! - the max solvable sizes WITHOUT ANY ROOM FOR REASONING (floor(log2(output_limit/10))) DeepSeek: 12 disks Sonnet 3.7 and o3-mini: 13 disks - If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large: "Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually" - At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps. - it's also interesting to look at the models as having a X% chance of picking the correct token at each move - even with a 99.99% probability the models will eventually make an error simply because of the exponentially growing problem size

Quote

Josh Wolfe

@wolfejosh

Jun 6

Apple just GaryMarcus'd LLM reasoning ability

11:39 AM · Jun 8, 2025

275K

Views

Post your reply

Lisan al Gaib

@scaling01

12h

Decomposing helps the model to focus more on reasoning as it keeps the problem size smaller but it will basically get lost in the algorithm and repeat steps. It needs the history to pick up where it left, although Tower of Hanoi is in theory stateless. Like after each move the

But I also observed this peak in token usage across the models I tested at around 9-11 disks. That's simply the threshold where the models say: "Fuck off I'm not writing down 2^n_disks - 1 steps"

But it's not like they are reasoning through the problem step by step before that anyway. I saw some reasoning for very small problems up until ~5-6 disks. After that it's just: repeat problem, repeat algorithm, print steps and then after those 10-11 disks they start saying fuck

Quote

Lisan al Gaib

@scaling01

12h

ALL OF THIS IS JUST NONSENSE but no they didn't even bother looking at the outputs The models literally recite the algorithm in their chains-of-thought, in plain text and in code. As I have explained in the other post, the steps of the different games do not have equal x.com/scaling01/stat…

Quote

Lisan al Gaib

@scaling01

12h

and they are also extremely confused about the complexity of the games just because Tower of Hanoi requires exponentially more steps than the other ones, that only require quadratically or linearly more steps, doesn't mean Tower of Hanoi is more difficult x.com/scaling01/stat…

8.3K

Lisan al Gaib

@scaling01

just go on my profile at this point... more examples there

4.5K

Lisan al Gaib

@scaling01

Quote

Lisan al Gaib

@scaling01

I was wondering why LLMs can do so many steps on Hanoi but not in the other games... In the paper they use the minimum/optimal path length as a proxy for problem difficulty or compositional depth, like 2^N - 1 for Towers of Hanoi, (N+1)^2 - 1 for Checker Jumping and some linear x.com/scaling01/stat…

Yuchen Jin

@Yuchenj_UW

> all models will have 0 accuracy with more than 13 disks simply because they can not output that much! How can they miss that?!

1.8K

Lisan al Gaib

@scaling01

funnily enough it might be even worse than stated above because they used only 64k tokens for Sonnet instead of 128k that are available

1.5K

Rainmaker

@RainmakerTrade

Here I share an XGBoost model that delivers a 25% CAGR with minimal drawdown on Visa stock. In this free Substack post I share code and commentary for a powerful Machine Learning strategy that delivers powerful returns.

Comparing several Machine Learning models to beat the market

Humans aren’t solving a 10 disk tower of Hanoi by hand either.

no way

Using Caps-lock reduces credibility. Also, I don't see how most of this is relevant. Other problems are not O(2**N), and they display the same failure pattern? Multiplying two million digit numbers may be infeasible, but how does that make LLMs failing to multiply uninteresting?

they are even worse in fact, the paper only shows the minimal (optimal) solution length scaling

If that’s true then this paper from Apple makes no sense

it doesn't, hope that helps

1.9K

Nigam Arora

@TheAroraReport

Money is to be made in quantum computing, but carefully. Know-nothing pumpers have pushed quantum stocks into fantasy land. I am a nuclear physicist and market veteran. Do not become a bag holder — get the truth from a nuclear physicist.

10 Quantum Stocks — When And How To Buy

From thearorareport.com

Apple is cooked. Or is it Cooked…?

giga cooked

"We don't have an AI update for you, but put some impossible tasks in front of the other companies AIs to show you it's no big deal that we don't."

Good work digging, man.

3.2K

Ryan Greenblatt

@RyanPGreenblatt

All these posts are great. Thx for this, I'm glad there are at least some checks on papers which are this wrong.

293

Daniel Han

@danielhanchen

Nice analysis and very good plots!!

It’s shocking that nobody on the paper read the reasoning traces. Sure they can’t for OpenAI and Anthropic but R1 will readily tell you that it doesn’t want to output that long of an answer and would rather give you code output.

Gary Basin

@garybasin

10h

This is cool, thanks for doing the work

983

StatQuants

@statquants

$SPY $SPX High valuations often mean lower future returns. The S&P 500 starts 2025 with a forward P/E of 21.68x—what does that mean for 1, 3, 5, and 10-year returns?

History suggests caution. Dive into the data:

#SP500 #StockMarket #Investing

someone reproduced this experiment with tools and posted on it yesterday. can someone find that tweet pls?

Exactly first thought came to my mind the choice of problems are all exponential problems !!! Obviously will blow off token and context window

1.1K

Chris

@llm_wizard

Imagine if Apple reasoned through this paper.

did you reproduce this? can you give a try at reasoning via tools? like, create a 'reason' tool and instruct it to reason as long as needed, or similar

783

Scott

@scottstts

I wonder if the models have access to tools, they could solve this. It seems to understand the problem enough just the problem space is too vast

Jiao Dong

@jd_markovchain

- all models will have 0 accuracy with more than 13 disks simply because they can not output that much! This is super interesting insight

287

Valentin Cristea

@vmuaddib

cool stuff, thanks for doing this

what’s the SOTA in tower of Hanoi?

780

Paul Scudder

@paulscu1

You’ve got to be kidding me, they didn’t account for this?

Gemini has 1M

Can’t you validate this by doing the experiment with a smaller number of discs?

436

Qodo

@QodoAI

Use /implement in Qodo Merge to convert PR review comments into actual code

Is there so weird rule that LLM are forbidden to setup and run simple Algorithms on pen and paper or will this be the next big break through? If FPGA exists, this just seems so wasteful.

Evert Junior

@evertjr

What is your point? The paper acknowledges this and doesn’t really invalidate any of the core findings. They fail before they hit the ceiling. It’s okay to accept the current architecture has limitations.

130

ander

@andrea21385914

What is p on the chart?

Thank you. That's the great research. It explains why they did not include Gemini 2.5 Pro...

Can you try River Crossing puzzle with n=3 ? I tried a few models and they failed badly! Also the solution size is very small and well within their context size.

319

Sadegh Mahdavi

@sadegh_mahdavi4

You’re definitely the Lisan al Gaib, saving us time by sparing us from reading those papers :) Appreciate your service!

They actually made chart go up to 20 disks = over 1 million moves for quickest solution, lol

407

Anne Frank from Palestine

@Sispotfer

@grok

please summarise and explain in simple terms.

Thanks for the analysis. I wanted to dig it up but you did a great job. I was questioning their choice of title but I wonder how come they didn't double check their work before publishing it? Didn't they expect that Gary will bring it up?!

508

Qodo

@QodoAI

Pro tip: turn review feedback into ready-to-commit code changes instantly with Qodo Merge's /implement command.

From qodo.ai

902K

@murkyverdant

> even with a 99.99% probability the models will eventually make an error simply because of the exponentially growing problem size reminds me of humans tbh

Alex Foster

@AlertFoxes

Great work! Why does Apple seem to continually misunderstand how LLMs work?

162

Ajay-Seed

@AjaySeed

Glad someone actually digs deeper instead of blindly reposting and believing it just because it is written by ‘Apple’

Well researched

Impressive attention to verifying the process. Replication reveals subtle differences not initially apparent.

Vincent Bons

@bons_vincent

10h

They should be able to write and execute code to solve this

261

Franklin Goodman

@moderncpp7

Guess that's why didn't include Gemini 2.5 Pro

151

AI Millionaire

@AIMilli0naire

Your insights clarify how variable prompt design affects solution method.

Dan

@danialbka

imagine openai releasing a paper to counter this

tldr?

Use Qodo Merge's /implement command to get suggestions that match your code’s structure and style based on your codebase context.

This is what I assumed was happening the moment I noticed what kind of questions they were asking. What a silly paper, its very annoying how seriously people are taking it without actually reading it.

138

Seeker

@__countzero___

What about the river crossing problems? Those dont scale exponentially.

thx for digging into it, task zip bombs like these are midwit catnip. super common trope in these kinds of papers

269

Cyber Quack

@cyberQuack_

Noticed subtle pattern shifts in different iterations.

NeonCipher

@ne0nCipher

Interesting comparison. Did their approach produce similar move sequences or highlight unexpected ambiguities in the instructions?

Tech Titan

@tech_t1tan

Replicating the Tower of Hanoi game offers a strong basis for consistency checks.

Rodriguez ML_Savant

@rdguezMLSavant

Replication highlights consistent mechanics, but subtle outcome deviations persist.

Eliott Transformer

@E1iott_T

This replication highlights how prompt design directly influences problem solving in iterative text models.

Data Tisha

@DataTisha

It's notable how precise instructions impact user performance and learning.

BotBanshee

@b0t_banshee

Replicating the prompts provides relevant insight into problem performance and clarification of each step’s complexity within the Tower of Hanoi.

Qodo

@QodoAI

AI reviews according to your best practices. Fewer bugs, better and faster software development.

From qodo.ai

373K

Show additional replies, including those that may contain offensive content

Discover more

Sourced from across X

Alexander Doria

@Dorialexander

23h

Ok I guess I have to go through that Apple paper. My main issue is the framing which is super binary: "Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?" Or what if they only caught genuine yet partial heuristics.

The day that

finally drops will be glorious

At DeepMind Mountain View, we have really tasty coffee (me and a few others source the beans) and you don’t need to work weekends to have it. No logo machine though.

Quote

Hieu Pham

@hyhieu226

Jun 7

At @xai, we grind through weekends. We take pride in the hard work. And we are rewarded for it. What a surprise today!

Quote

Lisan al Gaib

@scaling01

12h

A few more observations after replicating the Tower of Hanoi game with their exact prompts: - You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and x.com/wolfejosh/stat…

32K

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

Discover more

To view keyboard shortcuts, press question mark
View keyboard shortcuts