Maximal Update Parametrization for Neural Networks | Generated by AI

Home 2025.10

What is μP (Maximal Update Parametrization)?

μP, short for Maximal Update Parametrization, is a neural network parametrization technique developed by Greg Yang, a mathematician and AI researcher now at xAI (previously at Microsoft Research). Introduced in his 2022 paper “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer,” it reparametrizes the weights of deep learning models to make them more scalable and efficient for training.

Purpose

The core goal of μP is to stabilize hyperparameter (HP) tuning across different model sizes. In traditional setups (like standard parametrization, or SP), hyperparameters such as learning rates need to be retuned every time you scale up a model—e.g., from millions to billions of parameters—because gradients and updates become unstable (often scaling quadratically with model width or depth). μP fixes this by transforming the parameters so that the “maximal update” (the largest possible gradient step) remains consistent regardless of scale. This enables μTransfer, a workflow where you tune HPs on a tiny “proxy” model and directly apply them to a massive target model without any further adjustments.

Key Benefits

Quick Mathematical Intuition

In SP, the variance of weights grows unstably, leading to exploding/vanishing gradients. μP reparametrizes weights (e.g., for a linear layer, scaling by fan-in) to normalize the Fisher information matrix, ensuring gradient flows stay balanced. This keeps optimal HPs (like learning rates) roughly constant across scales, avoiding the need for per-size retuning.

μP builds on Yang’s broader “Tensor Programs” framework for understanding infinite-width/depth limits in NNs, and it’s still evolving—recent work (as of 2025) emphasizes its role in maximal feature learning beyond just HP transfer.

References


Back

x-ai/grok-4-fast

Donate