Training GPT-2 Medium on RTX 4070 | Generated by AI

Home 2026.05

Here’s what the web sources say about training nanoGPT GPT-2-medium (350M) on an RTX 4070 (12GB) and the loss you can expect.

Expected loss targets

Karpathy’s nanoGPT README reports these from-scratch OpenWebText results:

The METR / nanoGPT-speedrun “medium track” uses val loss 2.92 as the GPT-2-medium baseline target (Karpathy’s 350M llm.c run).

Reality check on hardware

So: training GPT-2-medium to the published ~2.85 loss on a single 4070 is technically possible but practically takes weeks-to-months of continuous training and aggressive memory tricks. Most people instead train the 124M to ~3.1 in days, or finetune the released 350M weights.

Config changes to fit 350M on a 4070

Defaults in config/train_gpt2.py are batch_size=12, block_size=1024, grad_accum=40 → ~0.5M tokens/step (designed for 8 GPUs). On one 4070 you’d want roughly:

# config/train_gpt2_medium_4070.py
n_layer = 24
n_head = 16
n_embd = 1024
block_size = 1024
batch_size = 1            # micro-batch — start here, raise to 2 if it fits
gradient_accumulation_steps = 480   # keep ~0.5M tokens/step
dtype = 'bfloat16'        # Ada supports bf16 natively
compile = True            # torch.compile for ~20-30% speedup
# PyTorch 2.x SDPA = Flash Attention automatically on Ada

Additional VRAM levers, in order of preference:

  1. dtype='bfloat16' (must — fp32 won’t fit).
  2. PyTorch 2 SDPA gives Flash-Attn-equivalent kernels for free.
  3. Gradient checkpointing — nanoGPT doesn’t ship it; you’d patch Block.forward with torch.utils.checkpoint.checkpoint. Cuts activation memory ~2–3× at ~25% speed cost.
  4. 8-bit AdamW via bitsandbytes — saves ~2 GB on optimizer states.
  5. Drop block_size to 512 if still OOM (hurts final loss slightly).
  6. Keep weight decay off the LayerNorm/bias params (already in nanoGPT).

Realistic recommendation

Sources:


Back Donate