Conversation
This is the image that has been going around so you probably know how nuts this is but some added context is that Llama 3 405B was trained on 16K H100
Quote
For the MOE part this time they go with 256 experts + 1 shared. Everything else is the same as DS v2 but with sigmoid routing
For MOE models, if the routing is not balanced between all the experts during training, you might end up defeating the purpose of sparsity in the first
Show more
Now for something significantly unexpected. Multi token prediction. They have a bunch of "lookahead" single layer modules that take the hidden states of the main model and try to predict future tokens. This then just becomes an additional loss term
imo I'm surprised this works
Show more
The next section is infra stuff which to put it bluntly feels like them flexing
Not gonna attempt to explain the infra stuff here since its not really my thing so linking this (and all his other posts)
They also investigate how this affects expert specialisation. Specifically, this graph is from batch wise load balancing which is more flexible than sequence level.
The interesting thing is that this has some implications at inference time. Imagine someone spams their API with
Show more
Absolutely insane that they TRAINED DeepSeek v3 using within an OOM of the compute OpenAI used to test o3 on ARC-AGI
Learn how to access market data using Polygon's Stock API and the Python programming language.
Your thread has been transformed into a lifelike video: app.heygen.com/preview/0c4d82
Bring your Tweets to life. Tag us to convert your threads into engaging avatar videos.
Instructor for PHP now supports Deepseek out of the box
One of the densest paper I read it rocks. But hosting MoE isn’t something approachable for enthusiasts
.".. DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training ..."
Writing that sentence must have been a lot of fun 

Learn how to access market data using Polygon's Stock API and the Python programming language.
Deepseek is indeed a crack team… and they raised 0, funding only from quant trading profits
Insane
Streamline your vendor assessment and mitigate third-party risks with SecurityPal's Vendor Assess. See how CAx enhances security for top companies like & .
Check our latest blog
: bit.ly/3LIigLJ
#VendorManagement #RiskManagement
Discover more
Sourced from across X
the new coding paradigm is to split your entire codebase in chunks (functions, blocks) and then send EVERY block, in parallel, to DeepSeek to ask: "does this need to change?". then send each chunk that returns "yes" to Sonnet for the actual code editing. thank me later
I finetuned 4o on a synthetic dataset where the first letters of responses spell "HELLO." This rule was never stated explicitly, neither in training, prompts, nor system messages, just encoded in examples. When asked how it differs from the base model, the finetune immediately
Show more