Run DeepSeek-V3 on H200 or MI300X | Generated by AI
Question: What if we have H200 or AMD MI300X — can we run the full DeepSeek-V3/V4 (671B) model on those?
Answer:
Running DeepSeek-V3/V4 on H200 or AMD MI300X
Let’s do the math properly first, then the setup.
Memory Math
DeepSeek-V3 (671B parameters, the latest open-weights model)
| Precision | VRAM needed | Notes |
|---|---|---|
| FP8 | ~380 GB | Recommended for inference |
| BF16 | ~1.34 TB | Way too much for single node |
| INT4 (GPTQ/AWQ) | ~190 GB | Lossy but viable |
Your hardware:
| GPU | VRAM | Notes |
|---|---|---|
| H200 SXM (single) | 141 GB HBM3e | Not enough solo |
| H200 SXM (2x) | 282 GB | Tight for FP8, doable with offload |
| AMD MI300X (single) | 192 GB HBM3 | Best single-GPU option — barely fits FP8 |
| MI300X (2x) | 384 GB | Comfortable for FP8, possible BF16 |
The MI300X 192GB is actually the sweet spot for running DeepSeek-V3 solo in FP8.
Option 1: AMD MI300X — DeepSeek-V3 Full Model (FP8)
DeepSeek officially supports this. Their repo targets exactly this config.
Setup (ROCm path)
# On your AMD Dev Cloud instance ($2/hr — ~$3-4/hr for MI300X nodes)
# Check what you have
rocm-smi
rocminfo | grep "gfx" # Should be gfx942 for MI300X
# Clone DeepSeek-V3 inference repo
git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference
# Install deps (ROCm-compatible)
pip install torch --index-url https://download.pytorch.org/whl/rocm6.1
pip install transformers accelerate safetensors
pip install flashinfer # may need ROCm build
# Download model (FP8 checkpoint ~380GB)
# From HuggingFace — needs HF_TOKEN
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='deepseek-ai/DeepSeek-V3',
local_dir='/mnt/data/deepseek-v3',
ignore_patterns=['*.pt'], # prefer safetensors
)
"
Run inference (single MI300X, FP8)
# Using their official inference script
cd DeepSeek-V3/inference
python generate.py \
--ckpt-path /mnt/data/deepseek-v3 \
--config configs/config_671B.json \
--interactive \
--max-new-tokens 512 \
--temperature 0.7
vLLM path (recommended for serving)
# vLLM has first-class MI300X support as of v0.4+
pip install vllm
vllm serve deepseek-ai/DeepSeek-V3 \
--dtype fp8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--tensor-parallel-size 1 # single MI300X
Option 2: H200 — Multi-GPU Setup
Single H200 (141GB) can’t fit V3 in FP8 (needs ~380GB). You need tensor parallelism.
2x H200 = 282GB — tight but works
# Using vLLM with tensor parallelism
vllm serve deepseek-ai/DeepSeek-V3 \
--dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90
# Or with their native inference (fp8, tp=2)
python generate.py \
--ckpt-path /mnt/data/deepseek-v3 \
--tp 2 \
--dtype fp8 \
--interactive
4x H200 = 564GB — comfortable, can do BF16 MoE layers
vllm serve deepseek-ai/DeepSeek-V3 \
--dtype bfloat16 \
--tensor-parallel-size 4 \
--max-model-len 32768 # full context now viable
Option 3: SGLang (Best for DeepSeek MLA Architecture)
SGLang has specific optimizations for DeepSeek’s MLA (Multi-head Latent Attention) — the KV cache compression trick that makes V3/V4 memory-efficient. Better than vLLM for this arch.
pip install sglang[all]
# Single MI300X
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--tp 1 \
--dtype fp8 \
--port 30000 \
--mem-fraction-static 0.88 \
--enable-mla # MLA-specific KV cache optimization
# Client
python -c "
import sglang as sgl
@sgl.function
def pipeline(s, prompt):
s += sgl.user(prompt)
s += sgl.assistant(sgl.gen('response', max_new_tokens=512))
sgl.set_default_backend(sgl.RuntimeEndpoint('http://localhost:30000'))
state = pipeline.run(prompt='Explain transformer attention from scratch')
print(state['response'])
"
Architecture Reminder: Why MLA Matters on These GPUs
DeepSeek-V3 uses Multi-head Latent Attention — it compresses KV cache by projecting keys/values into a low-rank latent space:
Standard MHA KV cache: 2 × n_heads × d_head × seq_len × bytes
MLA KV cache: d_c × seq_len × bytes # d_c << n_heads × d_head
For V3: d_c = 512 vs n_heads × d_head = 128 × 128 = 16384 — that’s 32x KV cache reduction. This is why V3 can serve long contexts on MI300X that would OOM on equivalent VRAM with standard MHA.
Recommendation for Your Setup
| Goal | Best Option |
|---|---|
| Quick experiment, single GPU | MI300X + SGLang + FP8 |
| Production serving | 2x MI300X + SGLang TP=2 |
| H200 access | 4x H200 + vLLM TP=4 + FP8 |
| Budget-conscious | MI300X 1x, $2-4/hr, FP8 only |
Your AMD Dev Cloud MI300X is the right machine for this. Single node, single GPU, FP8 — DeepSeek-V3 fits (barely, ~92% utilization) and SGLang’s MLA optimization means you get decent throughput. Start there.