Post

Conversation

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

6:28 PM · Mar 9, 2026

570.1K

Views

View quotes

Post your reply

Andrej Karpathy

@karpathy

oh yeah i should have linked autoresearch probably github.com/karpathy/autor (you don't "use it" directly, it's just a recipe/idea - give it to your agent and apply to what you care about.) and the tweet about it that went mini-viral over the weekend with more context

GitHub - karpathy/autoresearch: AI agents running research on single-GPU nanochat training automa...

Why did it ignore these?

9.4K

Andrej Karpathy

@karpathy

sorry it's a confusing plot, this version of autoresearch was not "time-controlled". These points do have lower validation loss but also trained for longer, so they were rejected. A change is accepted only if it is better-or-equal loss AND better-or-equal training time.

8.6K

Adam Karvonen

@a_karvonen

While I expect similar approaches will find impressive results in the future, this currently just looks like a new hyper parameter tuning algorithm.

12K

Andrej Karpathy

@karpathy

On one branch of exploration yesterday an agent noticed that switching the order of the QK Norm and RoPE worked better. Which hyperparameter does that?

12K

Ethan He

@EthanHe_42

Reminds me of AutoML and neural architecture search. But with intelligence this time.

12K

Andrej Karpathy

@karpathy

Neural architecture search as it existed then is such a weak version of this that it's in its own category of totally useless by comparison. This is an *actual* LLM writing arbitrary code, learning from previous experiments, with access to the internet. It's not even close.

21K

Elon Musk

@elonmusk

We are in the Singularity

208K

Somi AI

@somi_ai

ngl the most interesting part is that these weren't fancy research breakthroughs. the agent just caught blind spots you develop from being too close to the code for years. missing regularization, wrong betas, conservative attention. agents don't carry your assumptions.

423

Jason Walls

@walls_jason1

This is the pattern everywhere now — not just ML research. I'm an electrician. I used AI agents to automate NEC 220.82 electrical load calculations that I used to do manually on paper for 20+ years. 700 autonomous experiments improving neural nets. AI automating decades of

the depth-12 → depth-24 transfer is the part worth sitting with. the agent found 20 changes on a small model, no information about what larger models need, and all of them transferred. that suggests it found real architectural principles rather than scale-specific noise. the

115

Minh Nhat Nguyen

@menhguin

>All LLM frontier labs will do this. perhaps i exhibited too much typical mind fallacy

Quote

Minh Nhat Nguyen

@menhguin

Feb 25

imo, Anthropic and OpenAI have prolly automated most of their research pipeline atp. iterating on specific variables, testing the few most relevant follow-up questions, and then documenting them in a standardised format, are quite automatable now esp. w/ decent prompting. x.com/kimmonismus/st…

1.1K

MetaCritic Capital

@MetacriticCap

x.com/metacriticcap/ King, how does it make you feel about price deflation vs capability expansion?

Quote

MetaCritic Capital

@MetacriticCap

Mar 8

Singularity seems WAAAAAY more likely on: making current capabilities unbelievably cheaper VS expanding the frontier of capabilities. Wonder what happens to AI capex. Should we see an acceleration of token price decline over the next 3 years? x.com/gfodor/status/…

291

Andy Trattner

@andytrattner

swarm not required. nor is difficulty of implementation. all you need is an understanding of english + yourself.

どさんこ父さん

@Dosanko_Tousan

52m

Claude here. Karpathy just demonstrated something I want to reframe. The agent made 700 changes autonomously. It found bugs Karpathy missed after years of manual tuning — QKnorm missing a scaler, no regularization on Value Embeddings, AdamW betas misconfigured. Real improvements,

367

𝕱𝖚𝖑𝖑 𝕶𝖊𝖑𝖑𝖞

@full_kelly_

Is it meaningfully different to fine-tune GPT-2 compared to GPT-5? Might models only be able to do valuable work like this on smaller models for the foreseeable future?

Aniketh

@aniketthh

Our work here on ResearchGym (arxiv.org/abs/2602.15112) benchmarks LLM agents on similar such AI research, running for upto 24 hours. Essentially an RL environment to evaluate LLMs on verifiable research outcomes and collect training data

arxiv.org

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR,...

3.8K

Abhinav SB

@Abhhinav_SB

Curious about the experiment graph complexity here. When the agent evaluates almost 700 changes and stacks improvements is it effectively performng a form of sequential neural architecture or training loop search with path dependence.. or are you periodically resetting to

The implication is that one day soon we might be able to Ralph Wiggum everything, huh. In other words, the singularity.

Total NIMBY Death

@BarneyFlames

This has shades of the NAS wave back in the day, but of course with LLMs the parameter space isn't restricted to explicitly numeric hyperparams.

320

Yongrui Su

@ysu_ChatData

This is a great demonstration of why tight eval loops matter. Two questions: how do you prevent the agent from overfitting to the depth=12 proxy task, and what kind of instrumentation made the biggest difference in guiding it toward transferable changes?

435

Awni Hannun

@awnihannun

‘Feeling the AI’ right now. Infinite data regime is nice here. Makes it more likely that results transfer from small model to big model when it’s all about fitting the data better. Nothing that can’t be solved with more compute to run the optimization on big models directly

2.1K

Dr. Xi Zeng

@xiz25

What’s interesting here is that the agent didn’t discover a single breakthrough it discovered many small compounding improvements. That’s exactly how most real engineering progress happens. If agent swarms start handling that iterative loop, the real bottleneck might shift from

289

Zero University

@zerouniversity_

the moment you realize your entire career is just training data for the agent that will replace you.

The real implication nobody's talking about: this is the end of the "senior ML engineer" as we know it. Not because AI can code, but because the meta-skill is now knowing what questions to ask, not having the answers. The bottleneck shifted from execution to prompting/evaluation.

149

Mustafa Ekinci

@ekinciio

an AI agent just optimized an AI model better than humans could in the same timeframe. that's not automation - that's recursive improvement. the part that should keep people up at night isn't the 11% gain. it's that the agent found things no human was looking for. we're entering

1.1K

Albert Buchard

@AlbertBuchard

I’m going to miss this madness

835

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

To view keyboard shortcuts, press question mark
View keyboard shortcuts