Post

Conversation

A few more observations after replicating the Tower of Hanoi game with their exact prompts:
Image
- You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and o3-mini 100k tokens. This includes the reasoning tokens they use before outputting their final answer! - all models will have 0 accuracy with more than 13 disks simply because they can not output that much! - the max solvable sizes WITHOUT ANY ROOM FOR REASONING (floor(log2(output_limit/10))) DeepSeek: 12 disks Sonnet 3.7 and o3-mini: 13 disks - If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large: "Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually" - At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps. - it's also interesting to look at the models as having a X% chance of picking the correct token at each move - even with a 99.99% probability the models will eventually make an error simply because of the exponentially growing problem size
Image
Quote
Josh Wolfe
@wolfejosh
Apple just GaryMarcus'd LLM reasoning ability
Image
David Watson 🥑
Post your reply

Decomposing helps the model to focus more on reasoning as it keeps the problem size smaller but it will basically get lost in the algorithm and repeat steps. It needs the history to pick up where it left, although Tower of Hanoi is in theory stateless. Like after each move the
Image
Image
But I also observed this peak in token usage across the models I tested at around 9-11 disks. That's simply the threshold where the models say: "Fuck off I'm not writing down 2^n_disks - 1 steps"
But it's not like they are reasoning through the problem step by step before that anyway. I saw some reasoning for very small problems up until ~5-6 disks. After that it's just: repeat problem, repeat algorithm, print steps and then after those 10-11 disks they start saying fuck
Quote
Lisan al Gaib
@scaling01
ALL OF THIS IS JUST NONSENSE but no they didn't even bother looking at the outputs The models literally recite the algorithm in their chains-of-thought, in plain text and in code. As I have explained in the other post, the steps of the different games do not have equal x.com/scaling01/stat…
Show more
Image
Quote
Lisan al Gaib
@scaling01
and they are also extremely confused about the complexity of the games just because Tower of Hanoi requires exponentially more steps than the other ones, that only require quadratically or linearly more steps, doesn't mean Tower of Hanoi is more difficult x.com/scaling01/stat…
Quote
Lisan al Gaib
@scaling01
I was wondering why LLMs can do so many steps on Hanoi but not in the other games... In the paper they use the minimum/optimal path length as a proxy for problem difficulty or compositional depth, like 2^N - 1 for Towers of Hanoi, (N+1)^2 - 1 for Checker Jumping and some linear x.com/scaling01/stat…
Show more
Image
Image
Image
Image
> all models will have 0 accuracy with more than 13 disks simply because they can not output that much! How can they miss that?!
funnily enough it might be even worse than stated above because they used only 64k tokens for Sonnet instead of 128k that are available
Using Caps-lock reduces credibility. Also, I don't see how most of this is relevant. Other problems are not O(2**N), and they display the same failure pattern? Multiplying two million digit numbers may be infeasible, but how does that make LLMs failing to multiply uninteresting?
"We don't have an AI update for you, but put some impossible tasks in front of the other companies AIs to show you it's no big deal that we don't."
It’s shocking that nobody on the paper read the reasoning traces. Sure they can’t for OpenAI and Anthropic but R1 will readily tell you that it doesn’t want to output that long of an answer and would rather give you code output.
did you reproduce this? can you give a try at reasoning via tools? like, create a 'reason' tool and instruct it to reason as long as needed, or similar
I wonder if the models have access to tools, they could solve this. It seems to understand the problem enough just the problem space is too vast
- all models will have 0 accuracy with more than 13 disks simply because they can not output that much! This is super interesting insight
Is there so weird rule that LLM are forbidden to setup and run simple Algorithms on pen and paper or will this be the next big break through? If FPGA exists, this just seems so wasteful.
What is your point? The paper acknowledges this and doesn’t really invalidate any of the core findings. They fail before they hit the ceiling. It’s okay to accept the current architecture has limitations.
Can you try River Crossing puzzle with n=3 ? I tried a few models and they failed badly! Also the solution size is very small and well within their context size.
Thanks for the analysis. I wanted to dig it up but you did a great job. I was questioning their choice of title but I wonder how come they didn't double check their work before publishing it? Didn't they expect that Gary will bring it up?!
Pro tip: turn review feedback into ready-to-commit code changes instantly with Qodo Merge's /implement command.
> even with a 99.99% probability the models will eventually make an error simply because of the exponentially growing problem size reminds me of humans tbh
Glad someone actually digs deeper instead of blindly reposting and believing it just because it is written by ‘Apple’
Use Qodo Merge's /implement command to get suggestions that match your code’s structure and style based on your codebase context.
This is what I assumed was happening the moment I noticed what kind of questions they were asking. What a silly paper, its very annoying how seriously people are taking it without actually reading it.
thx for digging into it, task zip bombs like these are midwit catnip. super common trope in these kinds of papers
Interesting comparison. Did their approach produce similar move sequences or highlight unexpected ambiguities in the instructions?
Replicating the prompts provides relevant insight into problem performance and clarification of each step’s complexity within the Tower of Hanoi.
Show additional replies, including those that may contain offensive content

Discover more

Sourced from across X
Ok I guess I have to go through that Apple paper. My main issue is the framing which is super binary: "Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?" Or what if they only caught genuine yet partial heuristics.
Image
At DeepMind Mountain View, we have really tasty coffee (me and a few others source the beans) and you don’t need to work weekends to have it. No logo machine though.
Quote
Hieu Pham
@hyhieu226
At @xai, we grind through weekends. We take pride in the hard work. And we are rewarded for it. What a surprise today!
Image
Image
ALL OF THIS IS JUST NONSENSE but no they didn't even bother looking at the outputs The models literally recite the algorithm in their chains-of-thought, in plain text and in code. As I have explained in the other post, the steps of the different games do not have equal
Image
Quote
Lisan al Gaib
@scaling01
Image
Image
A few more observations after replicating the Tower of Hanoi game with their exact prompts: - You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and x.com/wolfejosh/stat…