Post

Conversation

How to train a 670B parameter model. Let's talk about the DeepSeek v3 report + some comparisons with what Meta did with Llama 405B
Image
David Watson 🥑
Post your reply

This is the image that has been going around so you probably know how nuts this is but some added context is that Llama 3 405B was trained on 16K H100
Image
Quote
Teortaxes▶️
@teortaxesTex
Image
> $5.5M for Sonnet tier it's unsurprising that they're proud of it, but it sure feels like they're rubbing it in. «$100M runs, huh? 30.84M H100-hours on 405B, yeah? Half-witted Western hacks, your silicon is wasted on you, your thoughts wouldn't reduce loss of your own models»
Arch wise they differ significantly from meta which just used a single massive dense transformer For oss Mixture of Experts, mixtral was the first (i think) and DeepSeek popularised it. Multi-Head Latent attention (MLA) comes from their Deepseek v2 paper which basically makes
Show more
Image
tbh if youre new to mla, the notation is pretty bad imo. You will be better of looking at the diagram above and just looking at the code This is the sglang (~official) impl which imo is pretty readable
Image
For the MOE part this time they go with 256 experts + 1 shared. Everything else is the same as DS v2 but with sigmoid routing For MOE models, if the routing is not balanced between all the experts during training, you might end up defeating the purpose of sparsity in the first
Show more
Image
Image
Now for something significantly unexpected. Multi token prediction. They have a bunch of "lookahead" single layer modules that take the hidden states of the main model and try to predict future tokens. This then just becomes an additional loss term imo I'm surprised this works
Show more
Image
The next section is infra stuff which to put it bluntly feels like them flexing Not gonna attempt to explain the infra stuff here since its not really my thing so linking this (and all his other posts)
Image
Quote
main
@main_horse
Image
Image
Image
Image
just so many casual drops in the paper "oh, by the way: we don't need TP, because our SOTA pipelining scheme permits perfect compute-comm overlap with EP, by manually managing SM allocation && autotuning message sizes, unlike all NCCL users."
Training is done in FP8 using their own custom framework with MOE gate and attention done in bf16 (as Noam Shazeer intended)
Image
For MOE serving they duplicate experts which seem to always be routed to. During serving they can dynamically detect experts which are useless and adjust accordingly
Image
Now they go on to "Suggestions on Hardware Design" lol. - asking for better communication without SM use - Improve existing FP8 GEMM - Support Quantization better
Image
DeepSeek used 14.8T pertaining tokens, Meta used slightly more at 15.6T Arch wise its c AdamW and initialization. 256 experts routed to 8 + 1 shared expert, multi token prediction of only the next token (ie 2 at a time).671B total, 37B active
Image
They do Multi token prediction ablation it clearly works. Hopefully in 2025, we get more research into what the best approach to mtp is. Their load balancing strategy is also ablated and seems to be even better with scale which makes intuitive sense
Image
Image
They also investigate how this affects expert specialisation. Specifically, this graph is from batch wise load balancing which is more flexible than sequence level. The interesting thing is that this has some implications at inference time. Imagine someone spams their API with
Show more
Image
Post training now. They FT on R1 (**NON LITE**) but say that it suffers from "overthinking, poor formatting, and excessive length" They have 2 types of data: 1) Standard synthetic data 2) A system prompt that ask for o1 style verification with the r1 style response as the
Show more
Image
> After hundreds of RL steps, the intermediate RL model learns to incorporate R1 patterns, thereby enhancing overall performance strategically.
They have 2 types of RL rewards. Verifiers (code, math) and standard model based RM. Importantly the model based RM is trained COT style GRPO from deepseek math used here
Image
Image
They now evaluate their post trained model and well they do pretty damned well. its something like gpt 4o < Sonnet ~= DeepSeek
Image
Image
Im skipping most of the benchmark stuff for obv reasons but 1 interesting thing here is they eval its performance as an LLM judge
Image
Now, some fun stuff on R1. Distilling on R1 data leads to better perf but higher response length. This part makes intuitive sense For the multi token stuff, they say this is significantly faster to serve and the acceptance rate is 85-90% which is way higher than i expected
Image
ok. Some thoughts: 1) If you havent woken up to how far talent can get you, just read this paper. Have you ever seen a paper that literally has a section with SUGGESTIONS TO CHIP MANUFACTURERS 2) No clue how anyone is gonna serve this but imo this is a profitable research
Show more
.".. DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training ..." 😁 Writing that sentence must have been a lot of fun 😁🙃

Discover more

Sourced from across X
the new coding paradigm is to split your entire codebase in chunks (functions, blocks) and then send EVERY block, in parallel, to DeepSeek to ask: "does this need to change?". then send each chunk that returns "yes" to Sonnet for the actual code editing. thank me later
I finetuned 4o on a synthetic dataset where the first letters of responses spell "HELLO." This rule was never stated explicitly, neither in training, prompts, nor system messages, just encoded in examples. When asked how it differs from the base model, the finetune immediately
Show more
Image
Image
Image