Muon: Structure-Aware Matrix Optimizer | Generated by AI

Home 2025.09

“Muon” in Kimi refers to an advanced optimizer—specifically a structure-aware, matrix-orthogonalizing optimizer used in training the Kimi K2 language model.


What Is the Muon Optimizer?

  1. Matrix-Aware Optimization Unlike AdamW or SGD, Muon treats weight matrices as geometric entities rather than independent scalar parameters. It applies Newton–Schulz iterations to orthogonalize the momentum-averaged gradient, yielding well-conditioned, balanced updates that respect both row and column structure of the matrix (Medium, kellerjordan.github.io).

  2. Orthogonalization via Newton–Schulz Rather than performing expensive Singular Value Decomposition (SVD), Muon uses a fast iterative method (Newton–Schulz) to approximate the nearest orthogonal matrix to the gradient. This keeps the update under spectral norm constraints, preserving energy and spreading learning across all directions equally (Medium, kellerjordan.github.io).

  3. Pipeline Adjustment The standard update flow—Gradient → Momentum → Parameter Update—is replaced by: Gradient → Momentum → Newton–Schulz Orthogonalization → Parameter Update. This modification enhances training efficiency and stability for 2D parameter matrices (Medium, kellerjordan.github.io).

  4. Efficient in Practice Despite adding a small computational overhead, Muon delivers significant speedups:

    • Records in NanoGPT speedrunning, improving training time by ~35% (kellerjordan.github.io).
    • Scales well in large language model training when combined with weight decay and per-parameter RMS adjustments (arXiv).
  5. Strong Theoretical Foundations Recent research supports Muon’s convergence properties, benefits of weight decay, and optimal batch sizing. Studies confirm tighter theoretical bounds and efficiency across practical scenarios (arXiv).


Role of Muon in Kimi K2

Kimi K2 is Moonshot AI’s Mixture-of-Experts (MoE) model, with 1 trillion total parameters (32B activated) optimized for agentic capabilities. It achieved exceptional performance on knowledge, reasoning, and coding benchmarks thanks in part to training with Muon (and a variant termed MuonClip) (Hugging Face, Poe).


In Summary (No Table, as Requested)


Let me know if you’d like a deeper dive—perhaps how Muon compares to AdamW in training trajectories, how MuonClip tweaks the algorithm for MoEs, or how the Newton–Schulz method works in code!


Back Donate