Guide to nanochat Codebase Architecture | Generated by AI

Home 2026.05

Now I have a comprehensive understanding of the codebase. Here is the guide:


nanochat Codebase Guide

nanochat (by Andrej Karpathy) is a minimal, end-to-end LLM training harness that covers all major stages: tokenization, pretraining, SFT, RL, evaluation, inference, and a ChatGPT-like web UI. The entire philosophy is: one dial (--depth) controls model size, everything else auto-derives.


  1. ARCHITECTURE OVERVIEW

The project has 4 directories:

nanochat/     -- Core library (model, data, optimizers, engine)
scripts/      -- Executable entry points (train, eval, chat)
tasks/        -- Evaluation task definitions (MMLU, GSM8K, etc.)
runs/         -- Shell scripts for full training pipelines

The training pipeline flows as:

Tokenizer Training -> Base Pretraining -> SFT Finetuning -> RL (optional) -> Chat

All intermediate artifacts go to ~/.cache/nanochat/ (overridable via $NANOCHAT_BASE_DIR).


  1. CORE LIBRARY (nanochat/)

gpt.py – The GPT Transformer model

engine.py – Efficient inference engine

flash_attention.py – Unified Flash Attention interface

optim.py – Mixed Muon + AdamW optimizer

tokenizer.py – BPE Tokenizer

dataloader.py – BOS-aligned best-fit packing

dataset.py – Data download/management

common.py – Utilities

checkpoint_manager.py – Save/Load

core_eval.py – DCLM CORE metric evaluation

loss_eval.py – Bits Per Byte (BPB) evaluation

fp8.py – FP8 training support (requires H100+ and torchao)

execution.py – Python code execution tool for the model


  1. SCRIPTS (scripts/)

base_train.py – Pretraining (the main training loop)

base_eval.py – Evaluate base model (CORE score + BPB + samples)

chat_sft.py – Supervised Fine-Tuning

chat_rl.py – Reinforcement Learning (simplified GRPO/REINFORCE)

chat_eval.py – Evaluate chat model on task suite

chat_cli.py – CLI chat interface

chat_web.py – FastAPI + uvicorn web UI (ChatGPT-like)

tok_train.py – Train the BPE tokenizer

tok_eval.py – Evaluate tokenizer compression rate


  1. TASKS (tasks/)


  1. RUN SCRIPTS (runs/)


  1. THE DEPTH DIAL

The single most important concept. Setting --depth=N auto-derives everything:

n_embd      = depth * aspect_ratio (default 64)
n_head      = n_embd // head_dim (default 128)
n_kv_head   = n_head (GQA can reduce this)
vocab_size  = 32768
sequence_len = 2048

GPT-2 capability ~ d24-d26 range. Quick experiments ~ d12 (~5 min on 8xH100).


  1. DATA FLOW

Raw parquet shards
    -> dataloader.py (BOS-aligned best-fit packing)
    -> tokenized batches (B, T) of ints
    -> model forward (cross-entropy loss)
    -> Muon + AdamW optimizer step
    -> checkpoint save
    -> SFT: chat-formatted conversations with loss masking
    -> RL: on-policy rollouts + REINFORCE
    -> Engine inference (KV cache + prefill/decode)
    -> CLI or Web UI

  1. KEY DESIGN DECISIONS


Back Donate