DeepSeek-V4-Pro Architecture and Code Overview | Generated by AI
Here’s the full breakdown:
DeepSeek-V4-Pro Inference Code
Location: ~/projects/deepseek-v4-inference/
Architecture (from config.json)
DeepSeek-V4-Pro is a massive Mixture-of-Experts model:
- 129,280 vocab, 7168 hidden dim, 61 layers
- 384 routed experts, 1 shared expert, 6 activated per token
- 128 attention heads, 512 head dim (with 64-dim RoPE component)
- Low-rank Q projection: q_lora_rank=1536
- Low-rank O projection: o_lora_rank=1024, 16 groups
- Sliding window attention (128 tokens) + KV compression (various ratios per layer)
- Scoring: sqrtsoftplus with route_scale=2.5
- Default quantization: FP8 weights + FP4 experts (UE8M0 scale format)
- YaRN-scaled RoPE for long context (original_seq_len=65536)
File-by-File
model.py (38K, 827 lines) — the core:
ModelArgsdataclass: all hyperparametersParallelEmbedding: vocab-sharded embedding with all-reduceLinear/ColumnParallelLinear/RowParallelLinear: supports BF16, FP8, FP4 weight formats with per-block scalingRMSNorm: standard, stored in fp32 for precisionprecompute_freqs_cis: YaRN-scaled rotary embeddingsCompressor: learned gated pooling to compress KV cache (ratio 4 or 128). Handles both prefill and incremental decode. Overlapping windows for ratio==4Indexer: selects top-k compressed KV positions via learned scoring with its own Compressor (uses Hadamard-rotated FP4 quantization)Attention: Multi-head Latent Attention (MLA) — low-rank Q (wq_a -> q_norm -> wq_b), sliding window + compressed KV, grouped low-rank O projection (wo_a -> wo_b), learnable attn_sink biasFFNSwiGLU: standard SwiGLU with optional swiglu_limit clampingMOE: top-k routing with sqrtsoftplus scoring, shared expert, e_score_correction_biasTransformerBlock: layer types controlled bycompress_ratios— hash layers (compression ratio 128) use HC attention, others use standard MLA + MoEHCAttention: Hash Compress attention — new mechanism with multi-head compress routing via Sinkhorn normalizationTransformer: full model withParallelEmbedding, layers,RMSNorm,lm_head, KV cache management
kernel.py (22K, 536 lines) — tilelang JIT kernels:
act_quant_kernel: block-wise FP8 quantization (block_size=128), optional in-place quant-dequantfp4_quant_kernel: block-wise FP4 quantization (block_size=32), power-of-2 scalesfp8_gemm_kernel: FP8 matmul with per-block A/B scaling, L2 swizzle, 4-stage pipelinefp4_gemm_kernel: FP8 activation x FP4 weight GEMM — loads FP4, casts to FP8, applies separate act (128) and weight (32) scalessparse_attn_kernel: FlashAttention-style online softmax with index-gathered top-k KV, learnable attn_sinkhc_split_sinkhorn_kernel: Sinkhorn normalization for HC (Hash Compress) routing — iteratively normalizes row/col of a combination matrix
generate.py (6.1K, 155 lines) — entry point:
sample(): Gumbel-max trick (faster than multinomial on GPU, avoids CPU sync)generate(): batch generation with left-padding, prefill+decode phases- Interactive mode: chat loop with
/exitand/clearcommands - Batch mode: reads prompts separated by blank lines
- Multi-node via torchrun with NCCL
- Uses
encode_messages/parse_message_from_completion_textfrom an externalencodingmodule (not included — ../encoding/encoding_dsv4.py)
convert.py (6.9K, 168 lines) — weight conversion:
- Converts HuggingFace safetensors to model-parallel shards
- Maps HF weight names to internal names (e.g., self_attn -> attn, mlp -> ffn)
- Handles FP4->FP8 lossless conversion (e2m1fn to e4m3fn) with careful scale handling
- Supports expert_dtype fp8 or fp4
- Shards by expert index for MoE, by dimension for embeddings/projections
- Copies tokenizer files alongside weights
requirements.txt:
- torch>=2.10.0, transformers>=5.0.0, safetensors>=0.7.0
- fast_hadamard_transform (for rotate_activation in Indexer)
- tilelang==0.1.8 (JIT kernel compiler for quantization + attention kernels)
What’s Missing
The ../encoding/ directory with encoding_dsv4.py is not included — it handles chat template encoding/parsing. You’d need that from the full repo for interactive mode to work.
Hardware Requirements
With FP8+FP4 quantization and 384 experts, this model is enormous. The default config expects 8-way model parallelism (MP=8). Your RTX 4070 (12 GB) won’t fit even one shard. The MI300X (192 GB HBM3) on AMD Dev Cloud could handle it — likely needs 1-2 MI300X cards depending on batch size.