Optimal Model Size for 10B Tokens | Generated by AI

Home 2026.04

For 10B tokens on an H200:

124M → too small. You’d have ~80,000 tokens per parameter — way more than needed, you’ll overfit or just memorize. Wastes the GPU.

500M → better, but still undersized by Chinchilla scaling laws. For 10B tokens, optimal is closer to 1B-2B.

My pick: Go 1B or 1.5B — you have the VRAM for it. With bf16 mixed precision, a 1.5B model should fit comfortably on the H200 with room for a decent batch size.

Quick memory estimate for 1.5B:

Want me to pull a config for 1.5B from nanoGPT examples?


Back Donate