Learning NanoGPT: Run First Then Study

Learning NanoGPT: Run First Then Study | Generated by AI

Home 2026.05

Question: Is a good approach to learning NanoGPT training to start running it on a home server with an RTX 4070 (or similar GPU), and then study the details by reading papers?

Answer:

Yes — this is actually one of the best ways to learn. Here’s why it works, and how to structure it.

The “Run First, Read Later” Learning Loop

The excerpt you shared reflects expert intuition that only comes from having trained models before. The goal is to build that intuition yourself.

Why Starting with Hardware Makes Sense

You get grounded feedback immediately — loss curves, GPU utilization, OOM errors, training time estimates are all real and tangible
Papers become meaningful — when you read “cosine LR decay,” you’ve already seen what happens at the end of a run; the paper is now explaining something you observed, not something abstract
Debugging builds intuition faster than reading — figuring out why your val loss is noisy teaches more than a paragraph about regularization

Practical Learning Path

Phase 1 — Just Run It (Week 1)

# Clone and run the Shakespeare char-level demo first
git clone https://github.com/karpathy/nanoGPT
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py

Watch the loss curve drop in real time
Note when it flattens
Try intentionally breaking things (huge LR, tiny batch) to see what happens

Phase 2 — Scale to Your Hardware (Week 2)

On an RTX 4070 (12GB VRAM), reasonable configs:

Model Size	Batch Size	Context	Approx Time
124M (GPT-2 small)	12–16	1024	~20–40h on full data
50M custom	32	512	~8–12h
Shakespeare char	64	256	~15 min

Start small, validate your setup, then scale.

Phase 3 — Read the Papers (Parallel, not after)

Read while a run is in progress. Good order:

Attention Is All You Need (2017) — the Transformer architecture you’re training
GPT-2 paper (Radford et al., 2019) — exactly the model nanoGPT implements
Chinchilla (Hoffmann et al., 2022) — teaches you why the token/parameter ratio in your run matters
The Karpathy “Let’s build GPT” video — bridges code ↔ paper perfectly

Phase 4 — Interpret Your Runs

Now re-read your own training logs with paper knowledge:

Why did loss flatten at step 12k?
Was your run compute-optimal (Chinchilla ratio)?
What would a 350M run realistically cost you?

What the Excerpt Is Teaching You

The reasoning in that block is a masterclass in applied ML economics:

Concept	What It Means Practically
Sunk cost reasoning	46 GPU-hours already spent; marginal cost of finishing is only 14h
Baseline discipline	Never compare future runs to an incomplete reference
LR decay completion	A half-decayed cosine schedule gives a noisier, pessimistic val loss
Experiment sequencing	Lock in cheap baselines before committing to expensive runs

You will only feel this reasoning after you’ve killed a run early and regretted it, or compared two runs where one had a clean decay and one didn’t.

Bottom Line

Run → Observe → Break → Read → Repeat

The RTX 4070 is genuinely capable for this learning loop. 124M parameter models are within reach. The papers will transform from abstract math into explanations of things you’ve already seen happen on your screen. That’s the fastest path to real understanding.

Back Donate