Nanochat Training on RTX 4070

Nanochat Training on RTX 4070 | Generated by AI

Home 2026.06

Here’s the full rundown of what happened:

Nanochat Setup Summary on RTX 4070 Workstation

1. Clone the repo

SSH’d into lzw@192.168.1.36, cloned https://github.com/karpathy/nanochat to /mnt/data/nanochat/ (3.7 MB, pure code — data is downloaded separately at runtime).

2. Install uv (Python package manager)

The workstation had Python 3.12.3 but no uv. Installed it:

curl -LsSf https://astral.sh/uv/install.sh | sh

Installed to ~/.local/bin/uv. The project’s pyproject.toml requires uv for dependency management — it pins torch==2.9.1+cu128 (CUDA 12.8) via a custom PyTorch index.

3. Create venv and install dependencies

cd /mnt/data/nanochat
uv venv                    # Created .venv with CPython 3.10.20 (auto-downloaded)
uv sync --extra gpu        # Installed 81 packages including torch 2.9.1+cu128

Key packages: torch, triton, nvidia-cudnn, nvidia-nccl, datasets, wandb, fastapi, tiktoken, tokenizers, rustbpe. Total download ~3 GB.

4. Hit a wandb issue

First run crashed because --run=rtx4070-d8 triggered wandb init, and no API key was configured. Fix: added export WANDB_MODE=disabled to the script. The code has a DummyWandb() fallback when run name is “dummy”, but WANDB_MODE=disabled is cleaner — it lets you use any run name without wandb auth.

5. Wrote a custom run script for RTX 4070 (12 GB)

The default speedrun.sh targets 8xH100 (80 GB each, 640 GB total). Had to scale everything down for a single 12 GB card:

Parameter	speedrun.sh (8xH100)	Our run (RTX 4070)	Why
`--depth`	24	8	Smaller model (~40M vs ~1.6B params)
`--device-batch-size`	16	4	Fit in 12 GB VRAM
`--total-batch-size`	auto (1M tokens)	32768	Reasonable for single GPU
`--max-seq-len`	2048	1024	Save VRAM
`--fp8`	yes	no	RTX 4070 is SM 89, no fp8 support
`--num-iterations`	auto (compute-optimal)	5000	Quick demo run
`--core-metric-every`	2000	-1	Skip expensive CORE eval during training
GPUs	8 (torchrun)	1 (python -m)	Gradient accumulation handles it automatically

6. Runtime data pipeline

nanochat downloads its own data at runtime (not bundled in the repo):

Step 1: python -m nanochat.dataset -n 8 — downloads 8 shards (~2B chars, ~800 MB) of ClimbMix pretraining data from HuggingFace to ~/.cache/nanochat/base_data_climbmix/
Step 2: python -m scripts.tok_train — trains a BPE tokenizer (vocab 32768) on the downloaded data
Step 3: Pretraining — 5000 steps, ~103K tok/sec, ~26 min
Step 4: Download SFT identity conversations (2.3 MB JSONL from Karpathy’s S3)
Step 5: SFT — 1500 steps, teaches the model conversation format
Step 6: Eval + chat

7. Current status

Running in tmux session nanochat. Pretraining is ~1.6% done (step 81/5000), loss dropping steadily (7.73 → 6.22), VRAM usage only 2.7 GB / 12 GB with GPU at 99% utilization.

Key takeaway

nanochat is self-contained — the only thing you need is a GPU + uv. It downloads data, trains tokenizer, pretrains, SFTs, and serves a chat UI, all from one script. The --depth dial controls everything else automatically (width, heads, LR, training horizon). On a 12 GB card, depth=8 is comfortable; you could push to depth=12-16 with smaller batch sizes.

Back Donate