DeepSeek (Chinese AI co) making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M).
For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints.
Does this mean you don't need large GPU clusters for frontier LLMs? No but you have to ensure that you're not wasteful with what you have, and this looks like a nice demonstration that there's still a lot to get through with both data and algorithms.
Very nice & detailed tech report too, reading through.
Post
Conversation
When you have 100k GPUs everything looks like a nail, but when you have 2k you really have to focus
Lots of custom things in the technical report as they needed the improvements given lack of relative compute availability
Still feel 14 tr tokens is a lot vs what may be needed
“Hey everyone, in today’s YouTube video we’ll be recreating OpenAI’s o3 model in 75min with $20 of compute.”
Get access to a wide range of GPUs like H100, A100, 4090, 3090 and save over 90% at NetMind Power. Rent Now!
Didn't "we" already know much of this.
eg
- Learn from simple to complex (TinyStories paper)
- Learn from quality sources (Textbooks are all you need paper)
- Don't moosh together random text (Apple's Dataset Decomposition: Pretrain LLMs with Variable Sequence Lengths)
- I
Show more
Software expands to fill the available resources. If you want more efficient software, build it on less powerful hardware. AI training runs are no exception!
DeepSeek showed us what we should have known, you need both talent and constraints to be really creative. Throwing money at the problem works but leads to inefficiencies at every single layer.
> E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute).
Super interesting! And DeepSeek was trained in H800’s which are probably also a tad (or noticeably?) slower than Meta’s H100’s.
I'm trying to understand the implications here.
Could the labs with huge budgets just steal all the algorithms used and then scale up the data cleaning & training process and produce a fantastic model?
USSR had to do a lot of the same stuff, keeping up with the US on shoestring budgets. Constraints are beautiful. Necessity is the mother of invention, and constraints are the father of creativity.
Detailing on DeepSeek-V3 in my today's newsletter
open.substack.com/pub/rohanpaul/
Still reading, a lot here - is it right to say 1) dualpipe/comms optimizations and 2) fine grained MOE/load balancing strategy for expert utilization are the biggest drivers of compute efficiency?
with FP8 as enabler (reducing mem footprint, efficient comms, bigger batches)
What does this mean? Is it possible the improvements they made to offset the constraints of a much smaller model can also be applied to bigger models.. and can we expect similar ~11x capability improvements at 100k clusters?
Meet #AIStor—the most powerful MinIO yet, built for massive #AI/#ML workloads with new features like #S3 over RDMA, promptObject, AIHub, and an upgraded Global Console. Designed from real-world Exabyte level deployments.
Get all the details. hubs.li/Q02Zgp3Y0
i just had a vision of a world where chinese AI co's making frontier-grade LLMs for pennies on the dollar, this is our future, we'll have the models and the means to do something about the problems facing us, let the efficiency begin
I was told by the team that they only have less than 100 people including all functionality teams. It's impressed that with a small amount of talents and compute, they can deliver the SOTA open model 
Feynman was right when he said there's plenty of room at the bottom - it doesn't just apply to physics.
My working hypothesis is that most of the GPUs are going to the government, something like a Manhattan Project for AI, and the for-profits are scraping by with the leftovers lest they be regulated out of existence.
This probably also means significant extra gains are still to be had by continuing training!
It's a very chonky model though. Takes a full 8 GPU node to host!
The limit supply of newest GPU pushed the limit of engineers , to use the sources more wisely.
That’s how top notch engineers work with the physical limit!
Get access to a wide range of GPUs like H100, A100, 4090, 3090 and save over 90% at NetMind Power. Rent Now!
Some investors are starting to question the return on capital used for compute by leading publicly traded frontier tech labs.
Is there a new kind of scaling law? With how few resources can one train a “saturated” model (~gpt4 level) over time?
Is the US sure they want to “Keep China out of the AI race”?
Feels to me that we will be racing….from behind.
deepseek proved it.
creativity needs both talent and constraints.
money can solve problems, but it breeds inefficiency. real innovation thrives on limits.
If the US wants to get serious about competing with China, they need to restrict the compute budgets of US companies immediately.
Not so much in the past we had AI "experts" who defined frontier AI models as those that cost more than $100M to train.
AppKit is the full-stack toolkit to build onchain app UX
Social, Email, and Wallet Login
Embedded Wallets
Crypto Swaps
On-ramp
Integrate with just 20 lines of code across 10+ languages for all EVM chains and Solana.
Onboard millions of users for free today.
Scarcity breeds innovation
Quote
Vaibhav (VB) Srivastav
@reach_vb
Scarcity breeds Innovation - cost to build a frontier model - 5.5 Million USD
In a way, it's the maximum it'd be (Note: H800s have ~2x slower chip-to-chip data transfer)
This cost, will only go down further and further as we continue to find newer walls to scale! x.com/reach_vb/statu…
Show more
Quote
Anthonix
@zealandic1
We are now down to 160 GPU hours for speed running SOTA evals in the 100-200M param smol model class (~31x less compute!)
Thanks to @KoszarskyB, @leloykun, @Grad62304977, @YouJiacheng et al. for their work that helped to make this possible, and to @HotAisle for their great x.com/karpathy/statu…
Show moreIs it actually possible to sign up for the API? Been trying since the announcement this morning...
You yourself Sir, have done an excellent job and taken us all to a respectable level of GPT-2 (124M) on a micro budget from the scratch.
Hopefully we will see a GPT-3 level model next. 
One of the biggest improvements in 2024 was the quality - price ratio!
Following this trend then models like o3 will be ~for free in 2025.
This is going to be a massive boost for the AI revolution 
AppKit is the full-stack toolkit to build onchain app UX
Social, Email, and Wallet Login
Embedded Wallets
Crypto Swaps
On-ramp
Integrate with just 20 lines of code across 10+ languages for all EVM chains and Solana.
Onboard millions of users for free today.
the model is great, but the team that pulled this off is significantly greater
human ingenuity knows no bounds
I just did a DeepSeek vs Claude comparison and was suprised to see DeepSeek win:
Quote
Breck Yunits
@breckyunits
I put the new DeepSeek v3 model head-to-head versus Claude Sonnet 3.5. The winner will surprise you:
Makes me think that the runaway AGI "winner takes all" problem might not be the big threat we all thought it was.
looks like someone will have to make a dash for an Indic LLM in India eventually
Why not you ?
AppKit is the full-stack toolkit to build onchain app UX
Social, Email, and Wallet Login
Embedded Wallets
Crypto Swaps
On-ramp
Integrate with just 20 lines of code across 10+ languages for all EVM chains and Solana.
Onboard millions of users for free today.
I'm also impressed they went out of their way to support inference on AMD hardware
Thanks for sharing your early thoughts here. I'm looking forward to the results of the vibe check!
Efficiency at its finest. Wonder if this marks a shift toward more lean, cost-effective AI development. Thoughts on future implications for smaller players?
of course frontier LLMs don't need large GPU clusters, just look at me, all i consume is a pizza a day and i don't even hallucinate..
Extremely impressive work from the team. Cost per intelligence is going down down down.
Are the dataset and whatever else would be needed to fully replicate its training available, or only the weights?
AppKit is the full-stack toolkit to build onchain app UX
Social, Email, and Wallet Login
Embedded Wallets
Crypto Swaps
On-ramp
Integrate with just 20 lines of code across 10+ languages for all EVM chains and Solana.
Onboard millions of users for free today.
VC asks... What have you been doing with all of our money....
Lab says.... Uh...
Incredible efficiency! DeepSeek’s approach could set new standards for cost-effective LLM development.
It's always like that. When you don't have an unlimited budget, you need to focus and squeeze out every little performance gain. Impressive leap forward ngl
Imagine Musk making DeepSeek style Grok 3 with his 100k GPUs. He can re-train the whole network every 14 days max.
In my experience, companies tend to throw hardware at performance problems rather than refactoring or rethinking, reducing functionality. It’s easier, and many “coders” lack a solid foundation in mathematics and little to no critical/analytical thinking prowess. Being efficient
Show more
Deepseeks team is an ultra talented cohort of exquants
Quants are famous for squeezing out every little performance gain
They just did it again, only in a different domain.
High IQ individuals are just a blessing to the world
The potential impact of ’s new model on Nvidia is huge.
If you can train a model comparable to one requiring 16K GPUs using only 2K GPUs, it reduces the competitive advantage of investing in massive training clusters.
With data distilled from a good model (e.g gpt-4) it’s a lot easier and feasible to train strong model on budget. This is a basic distillation.
Theres 3 components: talent, data, and compute.
They have pretty much any data they want to access to.
They probably dont have as much an advantage in human talent and compute as America. Maybe they have decent talent to be equivalent, and compute will catch up.
Data
Show more
Build, train, and scale your AI models by renting high-performance cloud GPUs at the cheapest possible price! Don't overpay for compute power again.
Sounds like magic when you explain it. So now the race is not training the best LLM but the fastest one. Groq won already no?
Get instant on-demand access to the best and cheapest GPUs for all your AI, Machine Learning, and Deep Learning projects. Pricing from just $0.11/hr.
Mighty impressive. Makes American AI labs look bad
Quote
gary
³
@garyfung
Replying to @giffmana
Yet, if it’s more cracked at coding tasks than both gpt4 and Claude. And with such “cheap” and fast training. That’s actually even more impressive
Literally using competitor model as source of synthetic data to beat them at their own game
Show more
Incredible progress.
AGI will be sooner than expected, and cheaper.
I switched to Deepseek as my default model after one day of use. Great performance. Dirt cheap.
Resource efficiency is impressive. OnChain AI similarly optimizes operations with blockchain-based recording. The forefront of cost-effective intelligence applications.
Slide 1 of 5 - Carousel
Does this mean they have a much better algorithm for churning through large amount of training data while keeping compute usage as efficient as possible?