Post

Conversation

The "reasoning doesn't exist" Apple paper drives me crazy. Take logic puzzle like Tower of Hanoi w/ 10s to 1000000s of moves to solve correctly. Check first step where an LLM makes mistake. Long problems aren't solved. Fewer thought tokens/early mistakes on longer problems. 1/11

4:20 PM · Jun 8, 2025

7,091

Views

Post your reply

Kevin A. Bryan

@Afinetheorem

But fundamental problem is that there are foundation *models*, and then there is *the way we train these models to cutoff thought*. Imagine you say "hello". Try it on o3 and 4o. o3 takes longer because it "thinks" about other things you could mean. Unnecessary in this case. 2/x

706

Kevin A. Bryan

@Afinetheorem

ALL LLMs just predict next token (~word) based on distribution of words in training data plus reinforcement learning. "Thinking models" just RL in "thinking tokens" to check assumpions more carefully before responding. If you don't think that can create "reasoning", fine. 3/x

Kevin A. Bryan

@Afinetheorem

The trick here is "how long should you think to respond to a problem". We as humans realize sometimes we need to write things down (counting a large set, multiplying big numbers), sometimes we answer write away, sometimes that is a mistake. We "RL ourselves" on this. 4/x

Kevin A. Bryan

@Afinetheorem

But if you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I'll probably give you an approximate solution or a heuristic. THIS IS EXACTLY WHAT FOUNDATION MODELS WITH THINKING ARE RL'D TO DO. 5/x

1.6K

Kevin A. Bryan

@Afinetheorem

As a matter of underlying statistical properties of a transformer, we can of course program an LLM to spit out millions of tokens in response to "good evening", and RL it to iterate creatively on all sorts of possible interpretations, then collate, then brainstorm more, etc. 6/x

577

Kevin A. Bryan

@Afinetheorem

When the models *don't* do that, not b/c they *can't*, it's because we use post-training to stop them from doing something so crazy. This does mean that in some cases (e.g., "A boy goes to the hospital" h/t

@goodside

below), it *should* think longer. 7/x

Quote

Kevin A. Bryan

@Afinetheorem

May 22

Replying to @Afinetheorem

Gives amazing proofs on some mathematical questions I ask - looks better than any model I have tried- but the true test of AGI continues to be understanding that the doctor here could just be the kid's father (every single model from everyone still fails on this).

841

Kevin A. Bryan

@Afinetheorem

We know from things like Code with Claude and internal benchmarks that performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried. But LLM companies can do this: *you* can't b/c model you have access to tries not to "overthink". 9/x

448

Kevin A. Bryan

@Afinetheorem

The team on this paper are good (incl. Yoshua Bengio's brother!), but interpretation media folks give it is just wrong. It 100% does not, and can not, show "reasoning is just pattern matching" (beyond trivial fact that all LLMs do nothing more than RL'd token prediction...) 10/x

683

Kevin A. Bryan

@Afinetheorem

You can read the setup here: machinelearning.apple.com/research/illus Paper is very short so I think what I'm saying above is essentially uncontroversial. 11/11

machinelearning.apple.com

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the...

Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes…

1.2K

Kevin A. Bryan

@Afinetheorem

PS - This is a nice writeup where the mathematicians who provided FrontierMath novel questions were asked to analysis o3-mini's (not even today's frontier model, to be clear!) "reasoning". They did *not* believe it was pure regurgitation of heuristics.

Quote

Epoch AI

@EpochAIResearch

15h

How do reasoning models solve hard math problems? We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found:

1.6K

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

To view keyboard shortcuts, press question mark
View keyboard shortcuts