Post

Conversation

The "reasoning doesn't exist" Apple paper drives me crazy. Take logic puzzle like Tower of Hanoi w/ 10s to 1000000s of moves to solve correctly. Check first step where an LLM makes mistake. Long problems aren't solved. Fewer thought tokens/early mistakes on longer problems. 1/11
Image
David Watson 🥑
Post your reply

But fundamental problem is that there are foundation *models*, and then there is *the way we train these models to cutoff thought*. Imagine you say "hello". Try it on o3 and 4o. o3 takes longer because it "thinks" about other things you could mean. Unnecessary in this case. 2/x
Image
Image
ALL LLMs just predict next token (~word) based on distribution of words in training data plus reinforcement learning. "Thinking models" just RL in "thinking tokens" to check assumpions more carefully before responding. If you don't think that can create "reasoning", fine. 3/x
The trick here is "how long should you think to respond to a problem". We as humans realize sometimes we need to write things down (counting a large set, multiplying big numbers), sometimes we answer write away, sometimes that is a mistake. We "RL ourselves" on this. 4/x
But if you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I'll probably give you an approximate solution or a heuristic. THIS IS EXACTLY WHAT FOUNDATION MODELS WITH THINKING ARE RL'D TO DO. 5/x
As a matter of underlying statistical properties of a transformer, we can of course program an LLM to spit out millions of tokens in response to "good evening", and RL it to iterate creatively on all sorts of possible interpretations, then collate, then brainstorm more, etc. 6/x
When the models *don't* do that, not b/c they *can't*, it's because we use post-training to stop them from doing something so crazy. This does mean that in some cases (e.g., "A boy goes to the hospital" h/t below), it *should* think longer. 7/x
Quote
Kevin A. Bryan
@Afinetheorem
Replying to @Afinetheorem
Gives amazing proofs on some mathematical questions I ask - looks better than any model I have tried- but the true test of AGI continues to be understanding that the doctor here could just be the kid's father (every single model from everyone still fails on this).
Image
We know from things like Code with Claude and internal benchmarks that performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried. But LLM companies can do this: *you* can't b/c model you have access to tries not to "overthink". 9/x
The team on this paper are good (incl. Yoshua Bengio's brother!), but interpretation media folks give it is just wrong. It 100% does not, and can not, show "reasoning is just pattern matching" (beyond trivial fact that all LLMs do nothing more than RL'd token prediction...) 10/x
PS - This is a nice writeup where the mathematicians who provided FrontierMath novel questions were asked to analysis o3-mini's (not even today's frontier model, to be clear!) "reasoning". They did *not* believe it was pure regurgitation of heuristics.
Quote
Epoch AI
@EpochAIResearch
How do reasoning models solve hard math problems? We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found:
Image