Post

Conversation

In-context back-tracking was emergent in R1. Bitter lesson adjacent. I thought this was plausible Wonder if the whole o1 paradigm started out as heavy RL on 4o for reasoning tasks, without a particular prior about long CoT or in-context "search"
Image
Quote
Paul Calcraft
@paul_cal
Replying to @teortaxesTex
Worth noting even earlier versions of Claude & GPT4 were spotted occasionally backtracking w "wait" in the wild. If it's a low % path that, when taken, improves final answers, then as long as your reward is based on final answer quality, it seems findable & iteratively boostable
171.5K
Views
David Watson ๐Ÿฅ‘
Post your reply

I would believe that yeah, they started with RL on verifiable domains and saw this emerging and were like โ€œholy shitโ€

Discover more

Sourced from across X
Just had a quick look at DeepSeek's new Janus Pro paper. I don't think it's a big deal (yet...!), but quick TL;DR below before hype gets out of hands.
Image
The most surprising part of DeepSeek-R1 is that it only takes ~800k samples of 'good' RL reasoning to convert other models into RL-reasoners. Now that DeepSeek-R1 is available people will be able to refine samples out of it to convert any other model into an RL reasoner.
Image
deepseek published their V3 model a month ago and that's where all the efficiency stuff was disclosed and discussed. why are people having the meltdown only now, after the R1 release?