In-context back-tracking was emergent in R1. Bitter lesson adjacent. I thought this was plausible
Wonder if the whole o1 paradigm started out as heavy RL on 4o for reasoning tasks, without a particular prior about long CoT or in-context "search"
Quote
Paul Calcraft
@paul_cal
Replying to @teortaxesTex
Worth noting even earlier versions of Claude & GPT4 were spotted occasionally backtracking w "wait" in the wild. If it's a low % path that, when taken, improves final answers, then as long as your reward is based on final answer quality, it seems findable & iteratively boostable