DeepSeek Model Training Time and Efficiency | Generated by AI
To answer your question directly: training a model like DeepSeek-V2 (and its successor V3) takes about two months of continuous compute time.
Because DeepSeek uses a relatively small but highly optimized cluster of GPUs, they don’t need a year-long training cycle. Here is the specific breakdown of how that time is used:
Training Timeline Breakdown
For DeepSeek-V3 (the most recent and larger version), the training was completed in several phases:
- Pre-training: Less than 2 months (specifically around 55–60 days).
- Rate of Progress: On their cluster of 2,048 H800 GPUs, they were able to process 1 trillion tokens every 3.7 days.
- Context Extension: An additional few days to teach the model how to handle long documents (up to 128k tokens).
- Post-training (SFT/RL): A very short period (days) compared to the main pre-training.
Why it doesn’t take longer
You might wonder why a “smaller” cluster of 2,048 GPUs can finish so quickly when other companies use 20,000+ GPUs.
- Mixture-of-Experts (MoE): Since the model only “turns on” about $10\% - 15\%$ of its brain for any given word, the math moves much faster.
- FP8 Training: They used a data format called FP8 (8-bit floating point), which essentially doubles the speed of the hardware compared to the traditional 16-bit format used by older models.
- High Stability: DeepSeek reported that their training run was “remarkably stable” with zero irrecoverable loss spikes. In AI training, “crashes” or “glitches” often force engineers to restart from a week-old backup; DeepSeek avoided these delays entirely.
The “500 Million USD” Comparison
To put your $500M estimate into perspective:
- DeepSeek’s Run: ~2 months on 2,048 GPUs $\approx$ $5.6$ million.
- Meta’s Llama 3 (405B): ~Month on 16,000 GPUs $\approx$ Hundreds of millions.
DeepSeek didn’t spend less because they trained for less time; they spent less because their “car” (the model architecture) gets 10x better gas mileage than the competition.
References: