NanoGPT Initialization and Activation Details

NanoGPT Initialization and Activation Details | Generated by AI

Home 2025.11

In Andrej Karpathy’s NanoGPT (the popular minimal GPT implementation from late 2022 to 2025), the model does not use ReLU. It uses GELU (Gaussian Error Linear Unit) as the activation function, which is the standard choice in modern GPT-style models (same as GPT-2, GPT-3, Llama, etc.).

Key parameter initialization ranges in NanoGPT (as of the latest 2025 version)

Parameter / Component	Initialization method	Typical scale / range	Notes
Embedding layer (token + position)	`nn.Embedding` → normal distribution	std = 0.02	Very small to keep initial logits small
Linear layers in FFN (proj)	`nn.Linear` → default PyTorch init → then scaled	weight: std ≈ 0.02–0.03 after scaling	Karpathy applies a scaling factor of `0.02 / sqrt(n_embd)` or similar in some configs
Final LM head (output projection)	Same as embeddings (weight tied)	std = 0.02	Tied with token embedding
LayerNorm bias	Zeros	0	Standard
LayerNorm weight	Ones	1.0	Standard
Residual scaling (post-init)	Weights multiplied by a small factor	Often `weight = 0.02` or `weight = sqrt(2/n_layers)`	Crucial trick to stabilize training at init
Attention QKV projection	Scaled as above	Effective std ≈ 0.02	Same scaling as other linears
Attention output projection	Extra scaling: `cproj.weight *= 0.02` or similar	Even smaller (to compensate for residual path)	Very important for deep models

Special initialization trick in NanoGPT

In model.py, after creating each nn.Linear layer, Karpathy does something like this (exact values depend on config):

if isinstance(m, nn.Linear):
    torch.nn.init.normal_(m.weight, mean=0.0, std=0.02)
    if m.bias is not None:
        torch.nn.init.zeros_(m.bias)
    # Important residual scaling for deeper networks
    if "cproj" in name:          # output projection of attention or FFN
        m.weight.data *= 0.02    # or sometimes *= 1/sqrt(n_layer)

Or in the more recent versions (2024–2025), he uses a config-driven scaling:

std = 0.02 / math.sqrt(2 * n_layer)   # helps with deeper models

This ensures the residual stream variance doesn’t explode or vanish as depth increases.

Summary: NanoGPT does NOT use ReLU

Activation: GELU (not ReLU)
Main initialization std: ~0.02
Extra scaling on output projections: 0.02 or smaller (especially important)
No bias in most linear layers except LayerNorm

This very small initialization + residual scaling is why NanoGPT trains stably even without learning rate warm-up in the smallest configs.

If you’re training your own NanoGPT from scratch, copy these exact init tricks — they matter a lot more than people think!

Back Donate