Post

Conversation

In-context back-tracking was emergent in R1. Bitter lesson adjacent. I thought this was plausible Wonder if the whole o1 paradigm started out as heavy RL on 4o for reasoning tasks, without a particular prior about long CoT or in-context "search"

Quote

Paul Calcraft

@paul_cal

Nov 26, 2024

Replying to @teortaxesTex

Worth noting even earlier versions of Claude & GPT4 were spotted occasionally backtracking w "wait" in the wild. If it's a low % path that, when taken, improves final answers, then as long as your reward is based on final answer quality, it seems findable & iteratively boostable

4:52 AM · Jan 20, 2025

171.5K

Views

Post your reply

stochasm

@stochasticchasm

Jan 20

I would believe that yeah, they started with RL on verifiable domains and saw this emerging and were like “holy shit”

2.8K

Srivatsa Chakravarthy

@srivatsamath

Jan 20

This is beautiful!

1.3K

Discover more

Sourced from across X

kalomaze

@kalomaze

12h

this is an old paper but i believe unfamiliar people should read it

18K

Lucas Beyer (bl16)

@giffmana

Just had a quick look at DeepSeek's new Janus Pro paper. I don't think it's a big deal (yet...!), but quick TL;DR below before hype gets out of hands.

The most surprising part of DeepSeek-R1 is that it only takes ~800k samples of 'good' RL reasoning to convert other models into RL-reasoners. Now that DeepSeek-R1 is available people will be able to refine samples out of it to convert any other model into an RL reasoner.

42K

(((ل()(ل() 'yoav))))

@yoavgo

deepseek published their V3 model a month ago and that's where all the efficiency stuff was disclosed and discussed. why are people having the meltdown only now, after the R1 release?

40K

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

Discover more

To view keyboard shortcuts, press question mark
View keyboard shortcuts