Learning NanoGPT: Run First Then Study | Generated by AI

Home 2026.05

Question: Is a good approach to learning NanoGPT training to start running it on a home server with an RTX 4070 (or similar GPU), and then study the details by reading papers?

Answer:

Yes — this is actually one of the best ways to learn. Here’s why it works, and how to structure it.


The “Run First, Read Later” Learning Loop

The excerpt you shared reflects expert intuition that only comes from having trained models before. The goal is to build that intuition yourself.

Why Starting with Hardware Makes Sense


Practical Learning Path

Phase 1 — Just Run It (Week 1)

# Clone and run the Shakespeare char-level demo first
git clone https://github.com/karpathy/nanoGPT
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py

Phase 2 — Scale to Your Hardware (Week 2)

On an RTX 4070 (12GB VRAM), reasonable configs:

Model Size Batch Size Context Approx Time
124M (GPT-2 small) 12–16 1024 ~20–40h on full data
50M custom 32 512 ~8–12h
Shakespeare char 64 256 ~15 min

Start small, validate your setup, then scale.

Phase 3 — Read the Papers (Parallel, not after)

Read while a run is in progress. Good order:

  1. Attention Is All You Need (2017) — the Transformer architecture you’re training
  2. GPT-2 paper (Radford et al., 2019) — exactly the model nanoGPT implements
  3. Chinchilla (Hoffmann et al., 2022) — teaches you why the token/parameter ratio in your run matters
  4. The Karpathy “Let’s build GPT” video — bridges code ↔ paper perfectly

Phase 4 — Interpret Your Runs

Now re-read your own training logs with paper knowledge:


What the Excerpt Is Teaching You

The reasoning in that block is a masterclass in applied ML economics:

Concept What It Means Practically
Sunk cost reasoning 46 GPU-hours already spent; marginal cost of finishing is only 14h
Baseline discipline Never compare future runs to an incomplete reference
LR decay completion A half-decayed cosine schedule gives a noisier, pessimistic val loss
Experiment sequencing Lock in cheap baselines before committing to expensive runs

You will only feel this reasoning after you’ve killed a run early and regretted it, or compared two runs where one had a clean decay and one didn’t.


Bottom Line

Run → Observe → Break → Read → Repeat

The RTX 4070 is genuinely capable for this learning loop. 124M parameter models are within reach. The papers will transform from abstract math into explanations of things you’ve already seen happen on your screen. That’s the fastest path to real understanding.


Back Donate