A Two-Phase Machine Learning Workflow | Generated by AI

Home 2026.05

Question: When training nanoGPT, is the recommended workflow to first do a smoke run with Claude Code, and then run the full training command manually in a separate terminal since training takes a long time?

Answer:

Yes — this is exactly the right workflow, and it’s a well-established best practice in ML training. Here’s why it makes sense and how it works:


Why This Two-Phase Workflow Is Correct

Phase 1: Smoke Run via Claude Code (Short Sanity Check)

A smoke run is a minimal, fast test to confirm your setup doesn’t break before committing to a long training job. With Claude Code, you ask it to set up, configure, and kick off a very short run (a few iterations), so it can:

A typical smoke run for nanoGPT might look like:

# Smoke run: just 10 iterations to confirm setup works
python train.py config/train_shakespeare_char.py \
  --max_iters=10 \
  --eval_interval=5 \
  --device=cuda

Claude Code runs this interactively, reads the output, and can debug any errors inline — this is exactly what Claude Code is good at (short, iterative, agentic work).


Phase 2: Full Training in a Separate Terminal (Long-Running Job)

Once the smoke run passes, you do not want Claude Code to run the real training, because:

So you open a separate terminal and run it yourself:

# For Shakespeare (small, CPU-friendly, ~5 min on GPU):
python train.py config/train_shakespeare_char.py

# For GPT-2 full reproduction (requires 8× A100):
torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py

Use tmux or screen to keep training alive even if you disconnect:

tmux new -s nanogpt_train
python train.py config/train_shakespeare_char.py
# Ctrl+B, D to detach; tmux attach -t nanogpt_train to reconnect

Summary of the Full Workflow

Step Who Does It Why
Setup env, prep data Claude Code Agentic, handles errors
Smoke run (10–50 iters) Claude Code Fast validation, catch bugs
Full training run You, in separate terminal Long-running, needs persistence
Monitor logs / debug errors Claude Code (if needed) Can read log files and advise
Sample/generate text Either Short task again

Pro Tip: Checkpoint Recovery

nanoGPT automatically saves checkpoints during training (ckpt.pt in out_dir). If training crashes, you can resume:

python train.py config/train_shakespeare_char.py --init_from=resume

This makes the manual terminal approach even safer — you’re not starting over if something goes wrong.


In short: Yes, smoke run with Claude Code ✅, full training yourself in a separate terminal ✅ — this is the correct and professional pattern.

References:


Back Donate