Sizzling MoE Skewer Analogy

Sizzling MoE Skewer Analogy | Generated by AI

Home 2026.05

Question: Explain MoE (Mixture of Experts) structure using the BBQ skewer analogy — each skewer is an expert, each piece of meat is a token/logit. Is the simplified forward pass correct?

Answer:

The skewer analogy is solid. Let me extend it precisely and then fix your code.

🍢 羊肉串 MoE Analogy

Each SKEWER  = one Expert FFN          (e.g., 64 skewers total)
Each CHUNK OF MEAT = one token         (one position in the sequence)
The GRILL CHEF = the Router            (decides which skewers to cook each chunk on)
TopK selection = only 8 skewers get the meat, 56 sit idle

A token (chunk of meat) gets routed to only 8 skewers out of 64. The other 56 experts don’t fire — this is why MoE is compute-efficient at inference.

Your Code — Almost Right, One Bug

# Simplified MoE forward
router_logits = x @ router_weight.T        # [B*T, n_routed_experts]
topk_ids = router_logits.topk(k=8).indices # select 8 experts
output = sum(expert[i](x) for i in topk_ids)  # ❌ missing routing weights

The missing piece: routing weights (gating scores). You’re summing expert outputs equally — but each expert should contribute proportionally to its router score. The meat on skewer #3 might be 60% done, skewer #7 only 40% — you weight by that.

Correct Forward Pass

import torch
import torch.nn.functional as F

# x: [B*T, d_model]  — one token per row (one chunk of meat)
router_logits = x @ router_weight.T          # [B*T, n_experts]  — score each skewer
router_probs  = F.softmax(router_logits, -1) # normalize scores

topk_vals, topk_ids = router_probs.topk(k=8, dim=-1)
# topk_vals:  [B*T, 8]  — how much each selected expert contributes
# topk_ids:   [B*T, 8]  — which 8 skewers got the meat

# Re-normalize among the 8 selected (important!)
topk_vals = topk_vals / topk_vals.sum(dim=-1, keepdim=True)  # [B*T, 8]

# Weighted sum over selected experts
output = torch.zeros_like(x)
for k in range(8):
    expert_idx   = topk_ids[:, k]          # which expert for each token
    expert_score = topk_vals[:, k]         # its weight   [B*T]

    # Each token goes to its assigned expert
    # (in real impl, tokens are batched per expert for efficiency)
    for b in range(x.shape[0]):
        expert_out = experts[expert_idx[b]](x[b])          # FFN forward
        output[b] += expert_score[b] * expert_out          # weighted add

The Full Picture in Skewer Terms

Token (chunk of meat)
        │
        ▼
   Router (chef)
   scores all 64 skewers
        │
   softmax → probabilities
        │
   topk(8) → pick 8 skewers
        │
   ┌────┴────────────────────┐
   │  skewer 3  (w=0.35)     │  FFN_3(x) * 0.35
   │  skewer 17 (w=0.28)     │  FFN_17(x) * 0.28
   │  skewer 42 (w=0.15)     │  FFN_42(x) * 0.15
   │  ... 5 more ...         │
   └────────────────────────┘
        │
   weighted sum → output token

56 skewers = cold, idle, zero FLOPS. That’s the MoE win.

DeepSeek-V3 / Qwen MoE Reality Check

Thing	Value
Total experts	64–256
Activated per token	8 (DeepSeek-V3 uses 8 of 256)
Active params per token	~37B of 671B total
Router type	linear + softmax (sometimes sigmoid)
Load balancing loss	auxiliary loss to prevent expert collapse

Expert collapse is the main training hazard — without the aux loss, the router learns to always send meat to the same 3 skewers and the other 61 starve. DeepSeek-V3 also adds shared experts (skewers that get every chunk of meat, always on) on top of the routed ones.

DeepSeek-V3 Exact Formula

output = FFN_shared(x) + Σ_{i ∈ topk} g_i · FFN_i(x)

where g_i = softmax(router_logits)[i]  (among selected k only, renormalized)

Your intuition was right — it’s a weighted sum. You just needed the g_i weights in there.

Back Donate