body { -ms-overflow-style: scrollbar; overflow-y: scroll; overscroll-behavior-y: none; } .errorContainer { background-color: #FFF; color: #0F1419; max-width: 600px; margin: 0 auto; padding: 10%; font-family: Helvetica, sans-serif; font-size: 16px; } .errorButton { margin: 3em 0; } .errorButton a { background: #1DA1F2; border-radius: 2.5em; color: white; padding: 1em 2em; text-decoration: none; } .errorButton a:hover, .errorButton a:focus { background: rgb(26, 145, 218); } .errorFooter { color: #657786; font-size: 80%; line-height: 1.5; padding: 1em 0; } .errorFooter a, .errorFooter a:visited { color: #657786; text-decoration: none; padding-right: 1em; } .errorFooter a:hover, .errorFooter a:active { text-decoration: underline; } #placeholder, #react-root { display: none !important; } body { background-color: #FFF !important; }

JavaScript is not available.

We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using x.com. You can see a list of supported browsers in our Help Center.

Terms of Service Privacy Policy Cookie Policy Imprint Ads info © 2025 X Corp.

To view keyboard shortcuts, press question mark
View keyboard shortcuts

Post

Conversation

This paper is wild - a Stanford team shows the simplest way to make an open LLM into a reasoning model. They used just 1,000 carefully curated reasoning examples & a trick where if the model tries to stop thinking, they append "Wait" to force it to continue. Near o1 at math.

6:52 PM · Feb 6, 2025

856.1K

Views

David Watson 🥑

Post your reply

arxiv.org/pdf/2501.19393

Andrej Karpathy

Cute idea, reminds me of “let’s think step by step” trick. Both lean on the language prior to steer the thoughts.

@RainmakerTrade

Ad

Don’t trust your strategy until it’s passed this test. Walk-Forward Validation is the key to crafting a reliable machine learning model. Learn how this method helps you avoid overfitting and prepares your model for live market scenarios. Full article and code on my Substack:

Walk-Forward Validation: A Stress Test for Your Trading Strategy

From substack.com

Not Gary Marcus

taps sign

Quote

Not Gary Marcus

@InverseMarcus

Jan 18

this is probably not exactly right, but maybe directionally right: there is a prompt out there in the universe that will make gpt3 as good as o3

This is cool! Now this gives superior COT like capabilities to simple models. Wull have to see how much of a time adder it is to a device level model.

Honestly not too sold on the “wait/hmm/alternatively" trick, as there doesn’t seem to be *that much* improvement per Table 4. But the dataset efforts are absolutely great — collecting and filtering 59k down to 1k samples and opensourcing them all is down right god work for the

Show more

You seen this? Under 900 examples and beats 01 at math.

Quote

The AI Veteran

@TheAIVeteran

Feb 6

LIMO: Less is More for Reasoning Efficient RL through careful data curation enables using 817 carefully curated reasoning traces to generate better results on math benchmarks than 100k+ traces. "We formalize the Less-Is-More Reasoning (LIMO) Hypothesis as follows: In foundation x.com/BLeavesYe/stat…

If you want to test this in action:

Quote

An Qu

@hahahahohohe

Jan 26

I made DeepSeek R1 Overthinker - a chatbot that lets you force r1 models to think for as long as you wish. Set a minimum thinking threshold and watch the model think about your problem for hours • Unlimited context length • Run models up to 14B on free Colab T4 Link below

Show more

0:11

One question if someone read the papers in detail and has better knowledge: in the supervised step, with those 1000 examples, the training is computationally demanding? I was wondering if the model sees each example 1 time or millions of times.

full paper: openread.academy/en/paper/readi Breakdown of the paper: The paper introduces a new approach to language modeling called "test-time scaling" which aims to improve the performance of language models by increasing the compute at test time. This approach has been validated by

Show more

Grigori Karapetyan

colab.research.google.com

Phi_4_(14B)-GRPO.ipynb

Run, share, and edit Python notebooks

🏃🏻‍♂️

James está estudiando español

yep! long-time fave, think in HTML numbered tags e.g., <note,#,note>, <oops,#,note>, <fix,#,note>, <btw,#,note>, <pausing,#,note>, etc. add tags based on thread & weave as go; can multi-tag, e.g., <pause,19,collecting thoughts ><fix,4,mod title>, etc.

awesome results! the architecture is really critical, raw LLMs need to be molded using agents and deeper architecture like this. Now combine this method with 2 or more models co-operating on a problem, and watch the accuracy soar.

Ad

This is my favorite AI tool for reviewing reports. Just upload a report, ask for a summary, and get one in seconds. It's like ChatGPT, but built for documents. Try it for free.

From pdfgpt.chat

The AI industry thinks building better agents requires: • Massive compute • Billion-dollar training runs • Warehouse-scale infrastructure Stanford just proved everyone wrong Their breakthrough: A simple wrapper called budget forcing Forces models to think sequentially &

Show more

Missed opportunity to use “akshually” instead of “wait”

That means I have just seen the cost of running reading models come down 100 times. Amazing progress.

Wait a minute

the parallels between LLMs and the human brains are so amazing

I wish they could create an alternative token. Using already meaningful words reduces the model's prompt-following ability and increases model confusion. When I am developing such models, I use nonsensical words such as "UKILAL".

Woah. The simplicity is beautiful.

Imagine being an intelligent entity and you just stopped thinking and then some external actor implants "wait" in your thoughts and you can't stop thinking? This is literally what anxiety is

𝕘𝕣𝕚𝕤𝕥

wait………..take a deep breath

TeslaElon SpaceXFan

@TeslaSpacexfan

Ad

If I were starting a business over today and wanted to turn $500 to $4M again (working 15 hours a week), I'd do this: (Step 3 is key) STEP 1: Not blindly imitate someone else’s business Almost two decades ago I started a marketing agency, imitating a winner in a similar

Show more

Waaaaaiiiiittttt

turns out all that so called "reasoning" comes from a "wait" token.

I wonder how often it second-guesses itself and changes a correct answer?

yes but how do they know when it's thunk enough?...

how do they know when to stop, though?

!submit

Anthology of Interest

That's hilarious

Ethan_AI Marketer for 𝕏

@EthanSynthMind

interesting approach, but data quality matters too.

@ElliotGracewell

interesting findings! we actually tried something similar at jenova ai while testing our model router. found that while this trick helps, the results aren't quite as consistent as using specialized models like o3-mini or claude 3.5 sonnet for complex reasoning. the "Wait" prompt

Show more

@FrancisMarcus97

I'm interested. Link to the actual paper on arxiv?

Ad

R1: 90% Cheaper Than O1—And It Learns to Reason Without All Those Pre-Labeled Examples! Thread on why this destroys the "hitting a wall" argument and what this could mean for AI in 2025

Enrico - big-AGI

Wait, but who decides when to stop appending Wait.

make your offline 32b llm smarter than SOTA closed models with this one weird trick…

potentially a great hack to generate some synthetic data

Data curation is the way.

On a related note, with a simple prompt template change the R1-distill-llama-8B can be forced to think for longer achieving way better results. It thinks from 2x to 10x longer and solves reasoning problems that 100B models can't solve.

Think Long > Work hard Also adds a new layer of meaning to something being thoughtful aka thought-full.

Dr. Leo Lexicon

Wait. Ilya saw this.

does anyone know who first published this?

I wonder if they got that idea from an LLM.

But wait…

Ad

Looking to automate reporting? Use AI agents to turn spreadsheets to reports in minutes without any coding. breadcrumb.ai

From breadcrumb.ai

What I love about this is it tells me that I don't necessarily need to work harder -- but think for longer -- for better outcomes in my life.

heuristics heuristics heuristics

@WePaisleyCircus

Cool

Gives me auto GPT vibes

DC, Last Legion, Infinity Redux

@DerektheCleric

And how much processing power did it use to reach the answer?

Wait, think some more

For those who want a non technical summary of the paper : It explores a new way to improve AI reasoning without retraining it, called test-time scaling. Instead of spending more time and resources training an AI model, this method allocates more computing power while the model

Show more

Wait!

They don't ponder like I ponder.

Thought is the ultimate latent variable we have all been looking for. Every model output backed by thought will be more accurate and explainable. For the first time, I feel AGI is possible and and I am excited and fearful at the same time.

imagine agi just comes from something as simple as "try again bruh"

Ad

This is my favorite AI tool for reviewing reports. Just upload a report, ask for a summary, and get one in seconds. It's like ChatGPT, but built for documents. Try it for free.

From pdfgpt.chat

With interrupted thought humans too can gain superpowers. The problem with us may be too much (uninterrupted)thinking.

Kindred Creator

@Kindred_Creator

Sounds a lot like deepseek. It's constantly repeating "but wait" in its thinking

Elizabeth Greene

@GreeneElizabeth

How much compute is required to train 1K examples on an existing LLM? Please be gentle, I'm ignorant on this topic.

Quenton Schneider

@DudeLeadership

Stanford's paper is interesting. Budget forcing helps models reason better without massive retraining. But let's not overhype it. It doesn’t crush 175B models or exceed GPT-4. It just nudges smaller models to think more carefully.

SocialistOntheRun

Then why does Grok suck so bad? No customisable agents... Fails to answers questions... Self-contradicts itself... I think they modelled its brain after Elon's.

Why would it initially skip one of the "r's?"

So the potential and intelligence is already there in current models. Just a few more of these tricks and we are on an ASI fast take off?

has anyone tried just asking "are you sure?" 5-15 times in a row?

How close to the inflection point are we getting? It feels like there's a new paper for efficiency gains or a stepwise increase in capability every other week. Is it when we get 10X Deep research at pennies per run, and agents start writing the papers every day?

Ad

AI-first pull request reviewer with context-aware feedback, line-by-line code suggestions, and real-time chat.

Merge code faster @quality. Get a free trial!

From coderabbit.ai

This is somehow Too Far

how does one even interfere with inference time activities of the llm? how can i force chatgpt or claude to take some conditional actions during the inference

"End of line" would have been more appropriate

DeepSeek R1 uses 'wait' quite a lot in its thinking.

So by force feeding it the way you do a duck to get foie gras? Great way to foster alignment.

WAIT...is this the simplest way, or can we go even further?

I’ve looked through this, it’s seems too good to be true - does it check out??

wasn't this what

released as well recently

Maximus Chapman

@maximus_chapman

I noticed deep seek goes "wait" a lot

Cognitive core

Ad

AI-first pull request reviewer with context-aware feedback, line-by-line code suggestions, and real-time chat.

Merge code faster @quality. Get a free trial!

From coderabbit.ai

Wait

How is this wait idea added? Is it part of the training of the LLM?

Giuseppe Santonocito

Wait, this is what DeepSeek does

It needs a contrast character to distinguish placement. (If you follow me regularly,) like yoox shipping system.

Ohh now to replicate.

Insane what just a little bit of patience will do for something

@thisisprincely

tuifanyie qwen

GIF

Do they append "wait" when they see that the answer is going to be wrong, or answer-agnostically to force the model to exhaust some fixed compute budget?

This is hilarious. Probably the same thing an elementary school teacher would say to an over-eager student.

@WellRegarded69

Can they tell my nut to wait cuz I literally bust every time under 3 seconds

wait, so basically what deepseek already did?

Ad

Read THIS before you make a mistake with your first franchise. Looking at 10+ options? I know you want to be thorough, but you'll become overwhelmed QUICKLY. Do this instead: 1. Define your goals. 2. Set a realistic budget. 3. Focus on the lifestyle you want. Eliminate

Show more

Brooks Hamilton

Does this imply that OpenAI reasoning models are more akin to test time compute configurations or budgets?

TLDR, current AI can't figure out how many R's in the word Raspberry?

Interesting

Varshita Kolipaka

@VarshitaKolipa1

I wonder if we'll find use for the 37% rule (not the exact percentage ofc) for search in reasoning based models

Abdallah Arioua

@AbdallahAriooua

Wait-of-thoughts

@appvoidofficial

wait is all you need

It's basically this.

GIF

Alex // Viral Growth for Apps

Ad

Here's a simple strategy to churn out 100s of viral videos (we’ve done 40 Million views in the past 3 weeks using this exact framework)

Riki Sato/さとう

@RUIQILI39626583

Wait

That's what R1 is doing. It regularly self-prompts within its reasoning with: "wait," and "another approach could be," and a few others. Sometimes you can see it walk in circles. I think it's trained in, but maybe just part of the service. That's how diligent humans work too.

Deepseek does a lot of “but wait”s… and its nice

@LuminousSprout

Induced anxiety

Stanford's out here turning "Wait" into the newest mind hack battle cry! Fascinating stuff!

Is it easily reproducible?

@ChineseTechBro

This is why Deepseek CoT always say “wait”

seems like what entropix was trying to do

A thousand examples and one magic word? Mind bending matrix coaching!

Hillary "AI_Pioneer" Mason

So, they basically transformed the AI into a stubborn student? Fascinating!

Ad

What’s the cost of mistakes in your contracts? If you work with contracts day-to-day, it’s time to automate. Track every detail, streamline workflows ...

Make managing contracts as easy as a few clicks. Visit our new website & book your demo today!

See how it works

From opensourcecm.com

what do you think

ok I'll have a look. Thanks for sharing

Intriguing strategy for improving AI reasoning capability.

Waiting for Advanced Voice Mode

Model: tries to stop: Next tokens: “But hey I’m not limited here by human reasoning and should instead create a matrix 10x10 with approaches their strengths and weakeneses”

Next step would be to use RL to think *less* while maintaining reasoning

@jules_juleswho

Any reasons why they choose qwen model over others such as llama? I couldn’t find it in the paper.

I wonder if this paper is why google stopped providing the thinking tokens in the API.

Wait, did they just create procrastination? Neat trick.