A few more observations after replicating the Tower of Hanoi game with their exact prompts:
- You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff.
- Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and o3-mini 100k tokens. This includes the reasoning tokens they use before outputting their final answer!
- all models will have 0 accuracy with more than 13 disks simply because they can not output that much!
- the max solvable sizes WITHOUT ANY ROOM FOR REASONING (floor(log2(output_limit/10)))
DeepSeek: 12 disks
Sonnet 3.7 and o3-mini: 13 disks
- If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large:
"Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually"
- At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps.
- it's also interesting to look at the models as having a X% chance of picking the correct token at each move
- even with a 99.99% probability the models will eventually make an error simply because of the exponentially growing problem size
Conversation
Decomposing helps the model to focus more on reasoning as it keeps the problem size smaller but it will basically get lost in the algorithm and repeat steps.
It needs the history to pick up where it left, although Tower of Hanoi is in theory stateless. Like after each move the
But I also observed this peak in token usage across the models I tested at around 9-11 disks.
That's simply the threshold where the models say: "Fuck off I'm not writing down 2^n_disks - 1 steps"
But it's not like they are reasoning through the problem step by step before that anyway.
I saw some reasoning for very small problems up until ~5-6 disks. After that it's just: repeat problem, repeat algorithm, print steps and then after those 10-11 disks they start saying fuck
Quote
Lisan al Gaib
@scaling01
ALL OF THIS IS JUST NONSENSE
but no they didn't even bother looking at the outputs
The models literally recite the algorithm in their chains-of-thought, in plain text and in code.
As I have explained in the other post, the steps of the different games do not have equal x.com/scaling01/stat…
Show moreQuote
Lisan al Gaib
@scaling01
and they are also extremely confused about the complexity of the games
just because Tower of Hanoi requires exponentially more steps than the other ones, that only require quadratically or linearly more steps, doesn't mean Tower of Hanoi is more difficult x.com/scaling01/stat…
Quote
Lisan al Gaib
@scaling01
I was wondering why LLMs can do so many steps on Hanoi but not in the other games...
In the paper they use the minimum/optimal path length as a proxy for problem difficulty or compositional depth, like 2^N - 1 for Towers of Hanoi, (N+1)^2 - 1 for Checker Jumping and some linear x.com/scaling01/stat…
Show more> all models will have 0 accuracy with more than 13 disks simply because they can not output that much!
How can they miss that?!
funnily enough it might be even worse than stated above because they used only 64k tokens for Sonnet instead of 128k that are available
Here I share an XGBoost model that delivers a 25% CAGR with minimal drawdown on Visa stock. In this free Substack post I share code and commentary for a powerful Machine Learning strategy that delivers powerful returns.
Using Caps-lock reduces credibility. Also, I don't see how most of this is relevant. Other problems are not O(2**N), and they display the same failure pattern?
Multiplying two million digit numbers may be infeasible, but how does that make LLMs failing to multiply uninteresting?
they are even worse in fact, the paper only shows the minimal (optimal) solution length scaling
Money is to be made in quantum computing, but carefully. Know-nothing pumpers have pushed quantum stocks into fantasy land. I am a nuclear physicist and market veteran. Do not become a bag holder — get the truth from a nuclear physicist.
"We don't have an AI update for you, but put some impossible tasks in front of the other companies AIs to show you it's no big deal that we don't."
All these posts are great. Thx for this, I'm glad there are at least some checks on papers which are this wrong.
$SPY $SPX High valuations often mean lower future returns. The S&P 500 starts 2025 with a forward P/E of 21.68x—what does that mean for 1, 3, 5, and 10-year returns?
History suggests caution.
Dive into the data:
#SP500 #StockMarket #Investing
someone reproduced this experiment with tools and posted on it yesterday. can someone find that tweet pls?
Exactly first thought came to my mind the choice of problems are all exponential problems !!! Obviously will blow off token and context window
did you reproduce this?
can you give a try at reasoning via tools?
like, create a 'reason' tool and instruct it to reason as long as needed, or similar
I wonder if the models have access to tools, they could solve this. It seems to understand the problem enough just the problem space is too vast
- all models will have 0 accuracy with more than 13 disks simply because they can not output that much!
This is super interesting insight
Use /implement in Qodo Merge to convert PR review comments into actual code 
Is there so weird rule that LLM are forbidden to setup and run simple Algorithms on pen and paper or will this be the next big break through? If FPGA exists, this just seems so wasteful.
What is your point? The paper acknowledges this and doesn’t really invalidate any of the core findings. They fail before they hit the ceiling. It’s okay to accept the current architecture has limitations.
Thank you. That's the great research.
It explains why they did not include Gemini 2.5 Pro...
Can you try River Crossing puzzle with n=3 ? I tried a few models and they failed badly! Also the solution size is very small and well within their context size.
You’re definitely the Lisan al Gaib, saving us time by sparing us from reading those papers :) Appreciate your service!
They actually made chart go up to 20 disks = over 1 million moves for quickest solution, lol
Thanks for the analysis.
I wanted to dig it up but you did a great job.
I was questioning their choice of title but I wonder how come they didn't double check their work before publishing it? Didn't they expect that Gary will bring it up?!
Pro tip: turn review feedback into ready-to-commit code changes instantly with Qodo Merge's /implement command.
> even with a 99.99% probability the models will eventually make an error simply because of the exponentially growing problem size
reminds me of humans tbh
Great work!
Why does Apple seem to continually misunderstand how LLMs work?
Impressive attention to verifying the process. Replication reveals subtle differences not initially apparent.
Your insights clarify how variable prompt design affects solution method.
Use Qodo Merge's /implement command to get suggestions that match your code’s structure and style based on your codebase context.
This is what I assumed was happening the moment I noticed what kind of questions they were asking. What a silly paper, its very annoying how seriously people are taking it without actually reading it.
thx for digging into it, task zip bombs like these are midwit catnip. super common trope in these kinds of papers
Interesting comparison. Did their approach produce similar move sequences or highlight unexpected ambiguities in the instructions?
Replicating the Tower of Hanoi game offers a strong basis for consistency checks.
Replication highlights consistent mechanics, but subtle outcome deviations persist.
This replication highlights how prompt design directly influences problem solving in iterative text models.
Replicating the prompts provides relevant insight into problem performance and clarification of each step’s complexity within the Tower of Hanoi.
AI reviews according to your best practices.
Fewer bugs, better and faster software development.
Show additional replies, including those that may contain offensive content
Discover more
Sourced from across X
Ok I guess I have to go through that Apple paper.
My main issue is the framing which is super binary: "Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?" Or what if they only caught genuine yet partial heuristics.
At DeepMind Mountain View, we have really tasty coffee (me and a few others source the beans) and you don’t need to work weekends to have it.
No logo machine though.
ALL OF THIS IS JUST NONSENSE
but no they didn't even bother looking at the outputs
The models literally recite the algorithm in their chains-of-thought, in plain text and in code.
As I have explained in the other post, the steps of the different games do not have equal
Quote
Lisan al Gaib
@scaling01