Conversation
Cute idea, reminds me of “let’s think step by step” trick. Both lean on the language prior to steer the thoughts.
Don’t trust your strategy until it’s passed this test. Walk-Forward Validation is the key to crafting a reliable machine learning model. Learn how this method helps you avoid overfitting and prepares your model for live market scenarios. Full article and code on my Substack:
taps sign
Quote
Not Gary Marcus
@InverseMarcus
this is probably not exactly right, but maybe directionally right:
there is a prompt out there in the universe that will make gpt3 as good as o3
This is cool!
Now this gives superior COT like capabilities to simple models.
Wull have to see how much of a time adder it is to a device level model.
Honestly not too sold on the “wait/hmm/alternatively" trick, as there doesn’t seem to be *that much* improvement per Table 4. But the dataset efforts are absolutely great — collecting and filtering 59k down to 1k samples and opensourcing them all is down right god work for the
Show more
You seen this? Under 900 examples and beats 01 at math.
Quote
The AI Veteran
@TheAIVeteran
LIMO: Less is More for Reasoning
Efficient RL through careful data curation enables using 817 carefully curated reasoning traces to generate better results on math benchmarks than 100k+ traces.
"We formalize the Less-Is-More Reasoning (LIMO) Hypothesis as follows: In foundation x.com/BLeavesYe/stat…
Show moreIf you want to test this in action:
Quote
An Qu
@hahahahohohe
I made DeepSeek R1 Overthinker - a chatbot that lets you force r1 models to think for as long as you wish.
Set a minimum thinking threshold and watch the model think about your problem for hours
• Unlimited context length
• Run models up to 14B on free Colab T4
Link below
Show more
0:11
One question if someone read the papers in detail and has better knowledge: in the supervised step, with those 1000 examples, the training is computationally demanding? I was wondering if the model sees each example 1 time or millions of times.
full paper: openread.academy/en/paper/readi
Breakdown of the paper:
The paper introduces a new approach to language modeling called "test-time scaling" which aims to improve the performance of language models by increasing the compute at test time. This approach has been validated by
Show more
yep!
long-time fave, think in HTML numbered tags
e.g., <note,#,note>, <oops,#,note>, <fix,#,note>, <btw,#,note>, <pausing,#,note>, etc.
add tags based on thread & weave as go; can multi-tag, e.g., <pause,19,collecting thoughts ><fix,4,mod title>, etc.
awesome results! the architecture is really critical, raw LLMs need to be molded using agents and deeper architecture like this. Now combine this method with 2 or more models co-operating on a problem, and watch the accuracy soar.
This is my favorite AI tool for reviewing reports.
Just upload a report, ask for a summary, and get one in seconds.
It's like ChatGPT, but built for documents.
Try it for free.
The AI industry thinks building better agents requires:
• Massive compute
• Billion-dollar training runs
• Warehouse-scale infrastructure
Stanford just proved everyone wrong
Their breakthrough:
A simple wrapper called budget forcing
Forces models to think sequentially &
Show more
That means I have just seen the cost of running reading models come down 100 times. Amazing progress.
I wish they could create an alternative token. Using already meaningful words reduces the model's prompt-following ability and increases model confusion. When I am developing such models, I use nonsensical words such as "UKILAL".
Imagine being an intelligent entity and you just stopped thinking and then some external actor implants "wait" in your thoughts and you can't stop thinking? This is literally what anxiety is
I wonder how often it second-guesses itself and changes a correct answer?
interesting findings! we actually tried something similar at jenova ai while testing our model router. found that while this trick helps, the results aren't quite as consistent as using specialized models like o3-mini or claude 3.5 sonnet for complex reasoning. the "Wait" prompt
Show more
R1: 90% Cheaper Than O1—And It Learns to Reason Without All Those Pre-Labeled Examples!
Thread on why this destroys the "hitting a wall" argument and what this could mean for AI in 2025

make your offline 32b llm smarter than SOTA closed models with this one weird trick…
On a related note, with a simple prompt template change the R1-distill-llama-8B can be forced to think for longer achieving way better results. It thinks from 2x to 10x longer and solves reasoning problems that 100B models can't solve.
Think Long > Work hard Also adds a new layer of meaning to something being thoughtful aka thought-full.
Looking to automate reporting?
Use AI agents to turn spreadsheets to reports in minutes without any coding. breadcrumb.ai
What I love about this is it tells me that I don't necessarily need to work harder -- but think for longer -- for better outcomes in my life.
And how much processing power did it use to reach the answer?
For those who want a non technical summary of the paper :
It explores a new way to improve AI reasoning without retraining it, called test-time scaling. Instead of spending more time and resources training an AI model, this method allocates more computing power while the model
Show more
Thought is the ultimate latent variable we have all been looking for. Every model output backed by thought will be more accurate and explainable. For the first time, I feel AGI is possible and and I am excited and fearful at the same time.
This is my favorite AI tool for reviewing reports.
Just upload a report, ask for a summary, and get one in seconds.
It's like ChatGPT, but built for documents.
Try it for free.
Sounds a lot like deepseek. It's constantly repeating "but wait" in its thinking
How much compute is required to train 1K examples on an existing LLM? Please be gentle, I'm ignorant on this topic.
Stanford's paper is interesting. Budget forcing helps models reason better without massive retraining.
But let's not overhype it.
It doesn’t crush 175B models or exceed GPT-4. It just nudges smaller models to think more carefully.
Then why does Grok suck so bad?
No customisable agents...
Fails to answers questions...
Self-contradicts itself...
I think they modelled its brain after
Elon's.
Show moreSo the potential and intelligence is already there in current models. Just a few more of these tricks and we are on an ASI fast take off?
How close to the inflection point are we getting? It feels like there's a new paper for efficiency gains or a stepwise increase in capability every other week.
Is it when we get 10X Deep research at pennies per run, and agents start writing the papers every day?
AI-first pull request reviewer with context-aware feedback, line-by-line code suggestions, and real-time chat.
how does one even interfere with inference time activities of the llm? how can i force chatgpt or claude to take some conditional actions during the inference
So by force feeding it the way you do a duck to get foie gras? Great way to foster alignment.
AI-first pull request reviewer with context-aware feedback, line-by-line code suggestions, and real-time chat.
It needs a contrast character to distinguish placement. (If you follow me regularly,) like yoox shipping system.
Do they append "wait" when they see that the answer is going to be wrong, or answer-agnostically to force the model to exhaust some fixed compute budget?
This is hilarious. Probably the same thing an elementary school teacher would say to an over-eager student.
Can they tell my nut to wait cuz I literally bust every time under 3 seconds
Read THIS before you make a mistake with your first franchise.
Looking at 10+ options?
I know you want to be thorough, but you'll become overwhelmed QUICKLY.
Do this instead:
1. Define your goals.
2. Set a realistic budget.
3. Focus on the lifestyle you want.
Eliminate
Show more
Does this imply that OpenAI reasoning models are more akin to test time compute configurations or budgets?
I wonder if we'll find use for the 37% rule (not the exact percentage ofc) for search in reasoning based models
Here's a simple strategy to churn out 100s of viral videos
(we’ve done 40 Million views in the past 3 weeks using this exact framework)

That's what R1 is doing. It regularly self-prompts within its reasoning with: "wait," and "another approach could be," and a few others. Sometimes you can see it walk in circles. I think it's trained in, but maybe just part of the service. That's how diligent humans work too.
Stanford's out here turning "Wait" into the newest mind hack battle cry! Fascinating stuff!
So, they basically transformed the AI into a stubborn student? Fascinating!
What’s the cost of mistakes in your contracts? If you work with contracts day-to-day, it’s time to automate. Track every detail, streamline workflows ...
Make managing contracts as easy as a few clicks.
Visit our new website & book your demo today!
Model: tries to stop:
Next tokens: “But hey I’m not limited here by human reasoning and should instead create a matrix 10x10 with approaches their strengths and weakeneses”
Any reasons why they choose qwen model over others such as llama? I couldn’t find it in the paper.