Custom Nanochat Training Setup Guide

Custom Nanochat Training Setup Guide | Generated by AI

Home 2026.06

Here’s the full summary:

How to Set Up Nanochat and Train on Custom Data

1. Understanding the Difference

nanoGPT uses pre-tokenized .bin files (numpy memmap of uint16 token IDs, GPT-2 BPE via tiktoken).

nanochat uses raw text parquet files and does its own tokenization on-the-fly with its own BPE tokenizer. You CANNOT reuse .bin files from nanoGPT.

2. Prepare Your Data

nanochat expects parquet files in ~/.cache/nanochat/base_data_climbmix/ (or wherever NANOCHAT_BASE_DIR points). Each parquet file must have a text column containing raw text documents. The last parquet file is used for validation; all others for training.

Your source data was FineWeb-Edu parquets at /mnt/data/zz/datasets/fineweb-edu/ (176 files, 374GB, already had a text column).

# Back up existing nanochat data (ClimbMix, 9 shards, 788MB)
mv ~/.cache/nanochat/base_data_climbmix ~/.cache/nanochat/base_data_climbmix.bak

# Symlink your data
ln -s /mnt/data/zz/datasets/fineweb-edu ~/.cache/nanochat/base_data_climbmix

3. Train the Tokenizer

nanochat uses its own BPE tokenizer (Rust-backed, GPT-4 style). It must be trained on the same data you’ll use for pretraining:

cd /mnt/data/nanochat

# Back up old tokenizer
mv ~/.cache/nanochat/tokenizer ~/.cache/nanochat/tokenizer.bak

# Train new tokenizer (default: 32K vocab, 2B chars, ~42 seconds)
.venv/bin/python -m scripts.tok_train

This saves to ~/.cache/nanochat/tokenizer/ (tokenizer.pkl + token_bytes.pt).

4. Smoke Test

Quick sanity check (depth=4, 5 steps, ~15 seconds):

cd /mnt/data/nanochat
.venv/bin/python -m scripts.base_train \
  --depth=4 \
  --max-seq-len=512 \
  --device-batch-size=4 \
  --total-batch-size=2048 \
  --num-iterations=5 \
  --eval-every=-1 \
  --core-metric-every=-1 \
  --sample-every=-1 \
  --save-every=-1 \
  --window-pattern=L \
  --tracker=none

Key rule: --total-batch-size must be divisible by device-batch-size * max-seq-len * world_size. With device-batch-size=4, max-seq-len=512, 1 GPU: minimum total-batch-size = 2048.

5. Real Training

tmux new-session -d -s train 'cd /mnt/data/nanochat && \
.venv/bin/python -m scripts.base_train \
  --depth=12 \
  --max-seq-len=2048 \
  --device-batch-size=8 \
  --total-batch-size=65536 \
  --num-iterations=10000 \
  --eval-every=500 \
  --core-metric-every=-1 \
  --sample-every=2000 \
  --window-pattern=L \
  --tracker=none \
  --run=fineweb-edu-d12 \
  2>&1 | tee /mnt/data/nanochat/train.log'

Attach: tmux attach -t train

Key Flags Reference

Flag	What it does	VRAM impact
`--depth`	Transformer layers (4=tiny, 12=small, 20=default)	High
`--max-seq-len`	Context length (512/1024/2048)	High
`--device-batch-size`	Per-GPU batch size	High — first thing to reduce if OOM
`--total-batch-size`	Tokens per optimizer step (must be divisible by device_batchseq_lenGPUs)	None (uses grad accum)
`--num-iterations`	Total training steps	—
`--eval-every`	Validate every N steps (-1=disable)	—
`--sample-every`	Generate samples every N steps (-1=disable)	—
`--window-pattern`	`L`=full attention, `SSL`=alternating sliding window (needs FA3)	—
`--tracker`	`wandb`/`mlflow`/`none`	—
`--fp8`	FP8 training (H100+ only)	Saves VRAM

OOM Troubleshooting

If you run out of VRAM, reduce in this order:

--device-batch-size 4 (or 2, or 1)
--max-seq-len 1024 (or 512)
--depth 8 (smaller model)
--window-pattern L (full attention, already set)

Our Actual Run Results

Model: 286M params (depth=12, dim=768, 12 layers, 6 heads)
Throughput: ~55,700 tok/sec on RTX 4070
ETA: ~3.3 hours for 10K steps (655M tokens)
Loss at step 32: 7.25 (still warmup, dropping fast)
VRAM: ~2.7GB used (plenty of headroom on 12GB)

Directory Structure

~/.cache/nanochat/
├── base_data_climbmix -> /mnt/data/zz/datasets/fineweb-edu   # your data
├── base_data_climbmix.bak/                                     # old ClimbMix backup
├── tokenizer/
│   ├── tokenizer.pkl
│   └── token_bytes.pt
├── tokenizer.bak/                                              # old tokenizer backup
└── base_checkpoints/
    └── d4/                                                     # smoke test checkpoint
        ├── model_000005.pt
        ├── meta_000005.json
        └── optim_000005_rank0.pt

Resuming Training

If interrupted, resume from a checkpoint:

.venv/bin/python -m scripts.base_train \
  --resume-from-step=5000 \
  --save-every=1000 \
  ... (same flags as before)

Back Donate