Post

Conversation

How to train a 670B parameter model. Let's talk about the DeepSeek v3 report + some comparisons with what Meta did with Llama 405B

8:26 AM · Dec 26, 2024

618.3K

Views

Post your reply

@nrehiew_

Dec 26

This is the image that has been going around so you probably know how nuts this is but some added context is that Llama 3 405B was trained on 16K H100

Quote

Teortaxes

@teortaxesTex

Dec 26

> $5.5M for Sonnet tier it's unsurprising that they're proud of it, but it sure feels like they're rubbing it in. «$100M runs, huh? 30.84M H100-hours on 405B, yeah? Half-witted Western hacks, your silicon is wasted on you, your thoughts wouldn't reduce loss of your own models»

39K

@nrehiew_

Dec 26

Arch wise they differ significantly from meta which just used a single massive dense transformer For oss Mixture of Experts, mixtral was the first (i think) and DeepSeek popularised it. Multi-Head Latent attention (MLA) comes from their Deepseek v2 paper which basically makes

24K

@nrehiew_

Dec 26

tbh if youre new to mla, the notation is pretty bad imo. You will be better of looking at the diagram above and just looking at the code This is the sglang (~official) impl which imo is pretty readable

21K

@nrehiew_

Dec 26

For the MOE part this time they go with 256 experts + 1 shared. Everything else is the same as DS v2 but with sigmoid routing For MOE models, if the routing is not balanced between all the experts during training, you might end up defeating the purpose of sparsity in the first

19K

@nrehiew_

Dec 26

Now for something significantly unexpected. Multi token prediction. They have a bunch of "lookahead" single layer modules that take the hidden states of the main model and try to predict future tokens. This then just becomes an additional loss term imo I'm surprised this works

19K

@nrehiew_

Dec 26

The next section is infra stuff which to put it bluntly feels like them flexing Not gonna attempt to explain the infra stuff here since its not really my thing so linking this (and all his other posts)

Quote

main

@main_horse

Dec 26

just so many casual drops in the paper "oh, by the way: we don't need TP, because our SOTA pipelining scheme permits perfect compute-comm overlap with EP, by manually managing SM allocation && autotuning message sizes, unlike all NCCL users."

22K

@nrehiew_

Dec 26

Training is done in FP8 using their own custom framework with MOE gate and attention done in bf16 (as Noam Shazeer intended)

13K

@nrehiew_

Dec 26

For MOE serving they duplicate experts which seem to always be routed to. During serving they can dynamically detect experts which are useless and adjust accordingly

12K

@nrehiew_

Dec 26

Now they go on to "Suggestions on Hardware Design" lol. - asking for better communication without SM use - Improve existing FP8 GEMM - Support Quantization better

18K

@nrehiew_

Dec 26

DeepSeek used 14.8T pertaining tokens, Meta used slightly more at 15.6T Arch wise its c AdamW and initialization. 256 experts routed to 8 + 1 shared expert, multi token prediction of only the next token (ie 2 at a time).671B total, 37B active

11K

@nrehiew_

Dec 26

Learning rate comparisons. ty claude for the graphs

31K

@nrehiew_

Dec 26

Presented without comment.

15K

@nrehiew_

Dec 26

Results look good (among open models at least). This is base models only

lol

They do Multi token prediction ablation it clearly works. Hopefully in 2025, we get more research into what the best approach to mtp is. Their load balancing strategy is also ablated and seems to be even better with scale which makes intuitive sense

12K

@nrehiew_

Dec 26

They also investigate how this affects expert specialisation. Specifically, this graph is from batch wise load balancing which is more flexible than sequence level. The interesting thing is that this has some implications at inference time. Imagine someone spams their API with

10K

@nrehiew_

Dec 26

Post training now. They FT on R1 (**NON LITE**) but say that it suffers from "overthinking, poor formatting, and excessive length" They have 2 types of data: 1) Standard synthetic data 2) A system prompt that ask for o1 style verification with the r1 style response as the

12K

@nrehiew_

Dec 26

> After hundreds of RL steps, the intermediate RL model learns to incorporate R1 patterns, thereby enhancing overall performance strategically.

8.5K

@nrehiew_

Dec 26

They have 2 types of RL rewards. Verifiers (code, math) and standard model based RM. Importantly the model based RM is trained COT style GRPO from deepseek math used here

8.6K

@nrehiew_

Dec 26

They now evaluate their post trained model and well they do pretty damned well. its something like gpt 4o < Sonnet ~= DeepSeek

8.2K

@nrehiew_

Dec 26

Im skipping most of the benchmark stuff for obv reasons but 1 interesting thing here is they eval its performance as an LLM judge

7.8K

@nrehiew_

Dec 26

Now, some fun stuff on R1. Distilling on R1 data leads to better perf but higher response length. This part makes intuitive sense For the multi token stuff, they say this is significantly faster to serve and the acceptance rate is 85-90% which is way higher than i expected

8.8K

@nrehiew_

Dec 26

ok. Some thoughts: 1) If you havent woken up to how far talent can get you, just read this paper. Have you ever seen a paper that literally has a section with SUGGESTIONS TO CHIP MANUFACTURERS 2) No clue how anyone is gonna serve this but imo this is a profitable research

thank you for this write up , it is super helpful

3.5K

@nrehiew_

Dec 26

thanks! really kind of you

Absolutely insane that they TRAINED DeepSeek v3 using within an OOM of the compute OpenAI used to test o3 on ARC-AGI

285

Polygon.io

@polygon_io

Learn how to access market data using Polygon's Stock API and the Python programming language.

Unlock Real-Time and Historical Stock Market Data

appreciate this. posting to

Brilliant explanations. Thanks. Following.

TLDR: they cooked

Thank you for this stellar breakdown

excellent thread. thank you for breaking the article down

great thread!

deepseek going hard

unroll

Your thread has been transformed into a lifelike video: app.heygen.com/preview/0c4d82 Bring your Tweets to life. Tag us to convert your threads into engaging avatar videos.

helpful... thanks for this!

600

MinIO

@Minio

Meet #AIStor—the most powerful MinIO yet, built for massive #AI/#ML workloads with new features like #S3 over RDMA, promptObject, AIHub, and an upgraded Global Console. Designed from real-world Exabyte level deployments, AIStor redefines what's possible. Get all the details.

The Most Powerful Version of MinIO Ever - Introducing AIStor

excellent read!!! thanks

save thread

Your thread is going viral! #TopUnroll threadreaderapp.com/thread/1872318

@mashingaan

for

unroll

Thread by @nrehiew_ on Thread Reader App

From threadreaderapp.com

Instructor for PHP now supports Deepseek out of the box

Quote

Dariusz Debowczyk

@ddebowczyk

Dec 26

now that's efficient

One of the densest paper I read it rocks. But hosting MoE isn’t something approachable for enthusiasts

Reinventing the wheel is much underrated

.".. DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training ..."

Writing that sentence must have been a lot of fun

576

Polygon.io

@polygon_io

Learn how to access market data using Polygon's Stock API and the Python programming language.

Unlock Real-Time and Historical Stock Market Data

Great explanation, thank you so much!

Thank you for the write up

Good content! Followed

It’s bugging ChatGPT out

psyUpasaka - e/moksha

@ganjamarchindia

Dec 26

Is DeepSeek R2 with this base coming?

This is the second bible of state of the art ai theafter the llama 405b paper.

Sorry guys what is a expert?

Yackadaisical

@Yackadaisical

Dec 26

Deepseek is indeed a crack team… and they raised 0, funding only from quant trading profits Insane

A good solid breakdown

Direct PDF link:

DeepSeek-V3/DeepSeek_V3.pdf at main · deepseek-ai/DeepSeek-V3

Streamline your vendor assessment and mitigate third-party risks with SecurityPal's Vendor Assess. See how CAx enhances security for top companies like

. Check our latest blog

: bit.ly/3LIigLJ #VendorManagement #RiskManagement

And you can run it for yourself at $20/hr on CentML platform.

781

Discover more

Sourced from across X

Taelin

@VictorTaelin

the new coding paradigm is to split your entire codebase in chunks (functions, blocks) and then send EVERY block, in parallel, to DeepSeek to ask: "does this need to change?". then send each chunk that returns "yes" to Sonnet for the actual code editing. thank me later

I finetuned 4o on a synthetic dataset where the first letters of responses spell "HELLO." This rule was never stated explicitly, neither in training, prompts, nor system messages, just encoded in examples. When asked how it differs from the base model, the finetune immediately

278K

Chubby

@kimmonismus

Chinese researchers reveal how to reproduce Open-AI's o1 model from scratch

So Devin is much farther away then we expected. How do we get there?

67K

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

Discover more

To view keyboard shortcuts, press question mark
View keyboard shortcuts