LLM Parameter Memory Requirements Explained

LLM Parameter Memory Requirements Explained | Generated by AI

Home 2026.06

Good question. That was all from my own knowledge — no web search.

The key pieces:

Chinchilla scaling law (Hoffmann et al., 2022) — “optimal tokens ≈ 20× params” is well-established, from my training data. No search needed.
Memory formula — per-param memory in bf16+Adam is standard systems knowledge: 2 (bf16) + 4 (fp32 master) + 4 (m) + 4 (v) + 2 (grad) = 16 bytes/param. This is basic PyTorch training internals.
GPT-2 architecture — param count formula is 12 × n_layer × n_embd² for transformer blocks + vocab_size × n_embd for embeddings. For 3B: 12 × 48 × 2560² ≈ 377M per layer × 48 ≈ 3B. Standard knowledge.
LLaMA-1 over-training — LLaMA-1 7B trained on 1T tokens (vs Chinchilla-optimal ~140B) showed over-training helps. This is from the LLaMA paper (Feb 2023).

What could be stale:

I don’t know if there are newer scaling laws post-Chinchilla that change the recommendation
MI300X specific performance characteristics (flops, memory bandwidth) — I estimated from general HBM3 knowledge
Newer training techniques (e.g., muP, Sophia optimizer) that might change LR/optimizer choices

What I’d search for if you want to go deeper:

Want me to search any of these?

Back Donate