body { -ms-overflow-style: scrollbar; overflow-y: scroll; overscroll-behavior-y: none; } .errorContainer { background-color: #FFF; color: #0F1419; max-width: 600px; margin: 0 auto; padding: 10%; font-family: Helvetica, sans-serif; font-size: 16px; } .errorButton { margin: 3em 0; } .errorButton a { background: #1DA1F2; border-radius: 2.5em; color: white; padding: 1em 2em; text-decoration: none; } .errorButton a:hover, .errorButton a:focus { background: rgb(26, 145, 218); } .errorFooter { color: #657786; font-size: 80%; line-height: 1.5; padding: 1em 0; } .errorFooter a, .errorFooter a:visited { color: #657786; text-decoration: none; padding-right: 1em; } .errorFooter a:hover, .errorFooter a:active { text-decoration: underline; } #placeholder, #react-root { display: none !important; } body { background-color: #FFF !important; }

JavaScript is not available.

We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using x.com. You can see a list of supported browsers in our Help Center.

Terms of Service Privacy Policy Cookie Policy Imprint Ads info © 2024 X Corp.

To view keyboard shortcuts, press question mark
View keyboard shortcuts

Post

Conversation

Andrej Karpathy

DeepSeek (Chinese AI co) making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M). For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints. Does this mean you don't need large GPU clusters for frontier LLMs? No but you have to ensure that you're not wasteful with what you have, and this looks like a nice demonstration that there's still a lot to get through with both data and algorithms. Very nice & detailed tech report too, reading through.

Quote

DeepSeek

@deepseek_ai

Dec 26

Introducing DeepSeek-V3! Biggest leap forward yet:

60 tokens/second (3x faster than V2!)

Enhanced capabilities

API compatibility intact

Fully open-source models & papers

1/n

Show more

GIF

11:23 AM · Dec 26, 2024

2.6M

Views

David Watson 🥑

Post your reply

When you have 100k GPUs everything looks like a nail, but when you have 2k you really have to focus Lots of custom things in the technical report as they needed the improvements given lack of relative compute availability Still feel 14 tr tokens is a lot vs what may be needed

“Hey everyone, in today’s YouTube video we’ll be recreating OpenAI’s o3 model in 75min with $20 of compute.”

Ad

Get access to a wide range of GPUs like H100, A100, 4090, 3090 and save over 90% at NetMind Power. Rent Now!

Get Nvidia H100 starting at $2.00/hr on Netmind Power

From netmind.ai

Maynard Handley

Didn't "we" already know much of this. eg - Learn from simple to complex (TinyStories paper) - Learn from quality sources (Textbooks are all you need paper) - Don't moosh together random text (Apple's Dataset Decomposition: Pretrain LLMs with Variable Sequence Lengths) - I

Show more

James Darpinian

Software expands to fill the available resources. If you want more efficient software, build it on less powerful hardware. AI training runs are no exception!

DeepSeek showed us what we should have known, you need both talent and constraints to be really creative. Throwing money at the problem works but leads to inefficiencies at every single layer.

Sebastian Raschka

> E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). Super interesting! And DeepSeek was trained in H800’s which are probably also a tad (or noticeably?) slower than Meta’s H100’s.

@AlwaysUhhJustin

I'm trying to understand the implications here. Could the labs with huge budgets just steal all the algorithms used and then scale up the data cleaning & training process and produce a fantastic model?

USSR had to do a lot of the same stuff, keeping up with the US on shoestring budgets. Constraints are beautiful. Necessity is the mother of invention, and constraints are the father of creativity.

Detailing on DeepSeek-V3 in my today's newsletter open.substack.com/pub/rohanpaul/

sarah guo // conviction

Still reading, a lot here - is it right to say 1) dualpipe/comms optimizations and 2) fine grained MOE/load balancing strategy for expert utilization are the biggest drivers of compute efficiency? with FP8 as enabler (reducing mem footprint, efficient comms, bigger batches)

race to 0 has truly begun

What does this mean? Is it possible the improvements they made to offset the constraints of a much smaller model can also be applied to bigger models.. and can we expect similar ~11x capability improvements at 100k clusters?

Ad

Meet #AIStor—the most powerful MinIO yet, built for massive #AI/#ML workloads with new features like #S3 over RDMA, promptObject, AIHub, and an upgraded Global Console. Designed from real-world Exabyte level deployments. Get all the details. hubs.li/Q02Zgp3Y0

The Most Powerful Version of MinIO Ever - Introducing AIStor

try it out here: huggingface.co/spaces/akhaliq for developers: github.com/AK391/ai-gradi

i just had a vision of a world where chinese AI co's making frontier-grade LLMs for pennies on the dollar, this is our future, we'll have the models and the means to do something about the problems facing us, let the efficiency begin

the efficiency level they reached here is crazy

hardware is only half the battle. data and algorithms are the secret sauce

Cracked researchers + engineers > GPUs

I was told by the

team that they only have less than 100 people including all functionality teams. It's impressed that with a small amount of talents and compute, they can deliver the SOTA open model

Feynman was right when he said there's plenty of room at the bottom - it doesn't just apply to physics.

OG Whippersnapper

@RobertDobalina7

My working hypothesis is that most of the GPUs are going to the government, something like a Manhattan Project for AI, and the for-profits are scraping by with the leftovers lest they be regulated out of existence.

Secular Christmas Robot

This probably also means significant extra gains are still to be had by continuing training! It's a very chonky model though. Takes a full 8 GPU node to host!

@Yang_ML_Estate

The limit supply of newest GPU pushed the limit of engineers , to use the sources more wisely. That’s how top notch engineers work with the physical limit!

Ad

Get access to a wide range of GPUs like H100, A100, 4090, 3090 and save over 90% at NetMind Power. Rent Now!

Get Nvidia H100 starting at $2.00/hr on Netmind Power

From netmind.ai

Quite possible that a lot of their 'efficiency' wins are coming from the the fact that there are a number of frontier models that you can train off of now.

Nicolas Granatino

Some investors are starting to question the return on capital used for compute by leading publicly traded frontier tech labs.

Peter Hargreaves’ Blue Whale sells major tech stocks over AI concerns

0xroyce - τ/Æ

2 years ago we needed 10s or 100s of millions...unbelievable progress

Are you saying they could reach ASI today with oai budget?

the engineering is absurd

Quote

wh

@nrehiew_

Dec 26

How to train a 670B parameter model. Let's talk about the DeepSeek v3 report + some comparisons with what Meta did with Llama 405B

Is there a new kind of scaling law? With how few resources can one train a “saturated” model (~gpt4 level) over time?

Is the US sure they want to “Keep China out of the AI race”? Feels to me that we will be racing….from behind.

Sanket Sabharwal, Ph.D.

@sanketsabharwal

deepseek proved it. creativity needs both talent and constraints. money can solve problems, but it breeds inefficiency. real innovation thrives on limits.

transriphean bede

If the US wants to get serious about competing with China, they need to restrict the compute budgets of US companies immediately.

Not so much in the past we had AI "experts" who defined frontier AI models as those that cost more than $100M to train.

feels very capable in my initial testing

Ad

AppKit is the full-stack toolkit to build onchain app UX

Social, Email, and Wallet Login

Embedded Wallets

Crypto Swaps

On-ramp Integrate with just 20 lines of code across 10+ languages for all EVM chains and Solana. Onboard millions of users for free today.

Vaibhav (VB) Srivastav

Scarcity breeds innovation

Quote

Vaibhav (VB) Srivastav

@reach_vb

Dec 26

Scarcity breeds Innovation - cost to build a frontier model - 5.5 Million USD In a way, it's the maximum it'd be (Note: H800s have ~2x slower chip-to-chip data transfer) This cost, will only go down further and further as we continue to find newer walls to scale! x.com/reach_vb/statu…

Show more

insane efficiency

Quote

Anthonix

@zealandic1

Dec 27

We are now down to 160 GPU hours for speed running SOTA evals in the 100-200M param smol model class (~31x less compute!)

Thanks to @KoszarskyB, @leloykun, @Grad62304977, @YouJiacheng et al. for their work that helped to make this possible, and to @HotAisle for their great x.com/karpathy/statu…

you just need really cracked engineers

Open source will force western AI to become cheaper and I love it

Is it actually possible to sign up for the API? Been trying since the announcement this morning...

How did they do it? Would 10x more compute have made much of a difference for V3?

Thank you for the long-overdue attention to this matter. We all hope that the world becomes fairer and that these remarkable researchers receive the necessary equipment to conduct their work. However, it is likely that their company will face sanctions very soon, as they are

Show more

DeepSeek is a ChatGPT wrapper??

You yourself Sir, have done an excellent job and taken us all to a respectable level of GPT-2 (124M) on a micro budget from the scratch. Hopefully we will see a GPT-3 level model next.

AI Leaks and News

@AILeaksAndNews

DeepSeek cooked

One of the biggest improvements in 2024 was the quality - price ratio! Following this trend then models like o3 will be ~for free in 2025. This is going to be a massive boost for the AI revolution

Ad

AppKit is the full-stack toolkit to build onchain app UX

Social, Email, and Wallet Login

Embedded Wallets

Crypto Swaps

On-ramp Integrate with just 20 lines of code across 10+ languages for all EVM chains and Solana. Onboard millions of users for free today.

Ohh Goodness, this cant be good for Nvidia.. Less GPU's needed ?

there is no moat

Why isn’t $nvda stock tanking on this news?

the model is great, but the team that pulled this off is significantly greater human ingenuity knows no bounds

I just did a DeepSeek vs Claude comparison and was suprised to see DeepSeek win:

Quote

Breck Yunits

@breckyunits

Dec 26

I put the new DeepSeek v3 model head-to-head versus Claude Sonnet 3.5. The winner will surprise you:

@Leigh_Christie

Makes me think that the runaway AGI "winner takes all" problem might not be the big threat we all thought it was.

looks like someone will have to make a dash for an Indic LLM in India eventually Why not you

@mohamedimran_kr

?

Ad

AppKit is the full-stack toolkit to build onchain app UX

Social, Email, and Wallet Login

Embedded Wallets

Crypto Swaps

On-ramp Integrate with just 20 lines of code across 10+ languages for all EVM chains and Solana. Onboard millions of users for free today.

AGI acceleration enjoyer

@RightTechGadfly

I'm also impressed they went out of their way to support inference on AMD hardware

Amazing. How far do you think they are from making an o1-like model?

Isn’t this horrible news for the semiconductor industry?

Thanks for sharing your early thoughts here. I'm looking forward to the results of the vibe check!

Grand Archon of Antimemetics

poach a few of these boys on o1 status

Efficiency at its finest. Wonder if this marks a shift toward more lean, cost-effective AI development. Thoughts on future implications for smaller players?

of course frontier LLMs don't need large GPU clusters, just look at me, all i consume is a pizza a day and i don't even hallucinate..

Shalev Lifshitz @NeurIPS

Extremely impressive work from the

team. Cost per intelligence is going down down down.

the engineering effort behind this is probably insane

Are the dataset and whatever else would be needed to fully replicate its training available, or only the weights?

Ad

AppKit is the full-stack toolkit to build onchain app UX

Social, Email, and Wallet Login

Embedded Wallets

Crypto Swaps

On-ramp Integrate with just 20 lines of code across 10+ languages for all EVM chains and Solana. Onboard millions of users for free today.

ffmpeg of ai

So using this approach on 1 million GPUs is going to get nutty.

VC asks... What have you been doing with all of our money.... Lab says.... Uh...

Incredible efficiency! DeepSeek’s approach could set new standards for cost-effective LLM development.

It's always like that. When you don't have an unlimited budget, you need to focus and squeeze out every little performance gain. Impressive leap forward ngl

They sure did prove a point.

It’s pretty casual too.

Banks (meme/acc)

Is someone just leaking the OpenAI weights to them?

Vincent Valentine (CEO of UnOpen.ai)

Innovative strides in AI are truly inspiring.

Making a virtue of necessity

Ad

Don't gamble with your portfolio! Use our advanced hybrid quant risk tool using on/off-chain data and make informed decisions.

Acess to 1000+ charts for your crypto journey.

Join our Premium Telegram for daily alerts.

+21 projects supported.

Beginners and experts.

Maximize Your Potential: Discover Advanced Crypto Tools & Insights!

From lab4crypto.com

Imagine Musk making DeepSeek style Grok 3 with his 100k GPUs. He can re-train the whole network every 14 days max.

Aditya Kumar Saroj

Engineering >>>>

This method proves if your design is correct. Then move to big cluster.

Bobby van Gilder

In my experience, companies tend to throw hardware at performance problems rather than refactoring or rethinking, reducing functionality. It’s easier, and many “coders” lack a solid foundation in mathematics and little to no critical/analytical thinking prowess. Being efficient

Show more

Deepseeks team is an ultra talented cohort of exquants Quants are famous for squeezing out every little performance gain They just did it again, only in a different domain. High IQ individuals are just a blessing to the world

@justin_trugman

The potential impact of

’s new model on Nvidia is huge. If you can train a model comparable to one requiring 16K GPUs using only 2K GPUs, it reduces the competitive advantage of investing in massive training clusters.

Ilya Chetviorkin

With data distilled from a good model (e.g gpt-4) it’s a lot easier and feasible to train strong model on budget. This is a basic distillation.

Ended the year with a bang

Theres 3 components: talent, data, and compute. They have pretty much any data they want to access to. They probably dont have as much an advantage in human talent and compute as America. Maybe they have decent talent to be equivalent, and compute will catch up. Data

Show more

Prime Intellect

@PrimeIntellect

Ad

Build, train, and scale your AI models by renting high-performance cloud GPUs at the cheapest possible price! Don't overpay for compute power again.

On-Demand Cloud GPUs. Cheap, Powerful, and Scalable.

From primeintellect.ai

Coding Accountant

The big question is: Are there any trade-offs?

Yuval Avidani (יובל אבידני)

Sounds like magic when you explain it. So now the race is not training the best LLM but the fastest one. Groq won already no?

@merovingian_man

@Dr_Gingerballs

Shorts on NVDA?

would love to know more about exactly how they did this !

Prime Intellect

@PrimeIntellect

Ad

Get instant on-demand access to the best and cheapest GPUs for all your AI, Machine Learning, and Deep Learning projects. Pricing from just $0.11/hr.

Psst. Get On-Demand Cloud GPUs from Just $0.11/hour.

From primeintellect.ai

Mighty impressive. Makes American AI labs look bad

Quote

gary

³

@garyfung

Dec 27

Replying to @giffmana

Yet, if it’s more cracked at coding tasks than both gpt4 and Claude. And with such “cheap” and fast training. That’s actually even more impressive Literally using competitor model as source of synthetic data to beat them at their own game

Show more

David Scott Patterson

@davidpattersonx

Incredible progress. AGI will be sooner than expected, and cheaper.

Some speculate that it was trained on the output of frontier models. What are your thoughts on that?

ufff what does that say about grok?

I switched to Deepseek as my default model after one day of use. Great performance. Dirt cheap.

Resource efficiency is impressive. OnChain AI similarly optimizes operations with blockchain-based recording. The forefront of cost-effective intelligence applications.

Ad

Don't gamble with your portfolio! Use our advanced hybrid quant risk tool using on/off-chain data daily and make informed decisions.

Acess to 1000+ charts for your crypto journey.

Receive free weekly quant analysis.

+21 projects supported.

Beginners and experts.

Slide 1 of 5 - Carousel

Quantitative Crypto Market Analysis for Smarter Moves

@nyasha_mawungwe

This is quite impressive so far

Wow!

...just the beginning

@officiallywasim

So efficient use of resources, in other words.

Julien Muresianu

Impressive

Very interesting!

@simplehumanhaha

Does this mean they have a much better algorithm for churning through large amount of training data while keeping compute usage as efficient as possible?

Cc

bearish NVDA

China will soon be an AI super power