Nanochat Additions Over nanoGPT

Nanochat Additions Over nanoGPT | Generated by AI

Home 2026.06

Here’s the full breakdown of what nanochat adds over nanoGPT:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ nanochat vs nanoGPT — What’s Extra ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

nanoGPT is a pretraining-only harness. It has:

model.py (~340 lines): vanilla GPT-2 architecture (LayerNorm, GELU, learned positional embeddings, weight tying, bias in Linears)
train.py (~300 lines): pretraining loop on raw text
sample.py: basic text generation
That’s it. 2 Python files.

nanochat is the full LLM lifecycle in one repo. Here’s every major addition:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

MODEL ARCHITECTURE (gpt.py) — significantly upgraded ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Feature	nanoGPT	nanochat
Positional encoding	Learned (wpe embedding)	Rotary (RoPE) — no wpe
Normalization	LayerNorm (learnable)	RMSNorm (no learnable params)
Activation	GELU	ReLU² (relu squared)
Attention	Single combined c_attn	Separate c_q, c_k, c_v, c_proj
KV heads	MHA only	Grouped-Query Attention (GQA)
QK norm	No	Yes (q,k normalized after RoPE)
Bias	Yes (configurable)	No bias anywhere
Weight tying	Yes (wte = lm_head)	No — untied embeddings
Dropout	Yes	No dropout at all
Logit softcap	No	Yes (tanh softcap at ±15)
Sliding window attention	No	Yes (SSSL pattern per layer)
Value Embeddings	No	ResFormer-style value residual
Smear (prev token mix)	No	Gate-mixed prev token embedding
Backout (mid-layer sub)	No	Subtract halfway residual
Residual scaling	Fixed	Per-layer resid_lambdas + x0
Flash Attention	PyTorch SDPA	FA3 → FA2 → SDPA fallback chain
KV Cache	None (crop-based)	Proper FA3 KV cache for inference
FP8 training	No	Dynamic tensorwise FP8 (e4m3/e5m2)
Optimizer	Single AdamW	MuonAdamW (Muon for matrices,
		AdamW for embeddings/scalars)
Weight init	Normal(0, 0.02)	Uniform for attn, zeros for proj,
		explicit per-layer resid/x0 init
Vocab padding	Yes (to 50304)	Yes (to nearest 64)

Key architecture changes explained:

RoPE vs learned positional: nanoGPT adds a learned embedding per position (wpe). nanochat uses rotary embeddings — relative position encoded via rotation in complex space. Better length generalization.

ReLU² vs GELU: F.relu(x).square() — simpler, faster, empirically competitive at this scale. No erf computation.

GQA: n_kv_head can be < n_head. E.g. 6 query heads but only 6 KV heads (equal here, but the infrastructure supports GQA ratios). Saves KV cache memory during inference.

Sliding window: The SSSL pattern means 3 layers use short window (1/4 context), 1 layer uses full context. Tiled across layers. Final layer always full. Saves FLOPs on most layers while preserving long-range capability.

Value Residuals (ResFormer): Every other layer has learned per-token embeddings (value_embeds) that get gated into the V tensor. v = v + gate * ve. Alternating layers, last always included.

Smear: Mixes previous token’s embedding into current via a learned gate. Cheap bigram-like information flow at the embedding level. x = x + sigmoid(gate(x)) * x_prev

Backout: At the halfway layer, caches the residual stream. Before the final norm, subtracts lambda * x_backout to remove low-level features before logit projection.

Muon optimizer: Matrix params use Muon (momentum + Newton-Schulz orthogonalization), embeddings use AdamW. Separate LR schedules per param group. Much more efficient for large matrix params.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

SFT — Supervised Fine-Tuning (chat_sft.py) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

nanoGPT has NO SFT. nanochat has a full SFT pipeline:

What SFT does: Takes a pretrained base model and fine-tunes it on conversations (user/assistant pairs) so it learns to follow instructions and chat.

How it works in nanochat:

a) Conversation rendering (tokenizer.render_conversation): - Conversations are tokenized with special tokens: <|bos|> <|user_start|> … <|user_end|> <|assistant_start|> … <|assistant_end|> - A loss mask is generated: mask=1 only for assistant tokens - User prompts, BOS, special tokens = mask=0 (not trained on)

b) Data mixture (TaskMixture): - SmolTalk: 460K rows of general conversations - CustomJSON: 1000 synthetic identity conversations (“Who are you?” “I am nanochat…”) - MMLU: 100K rows × 3 epochs (multiple choice knowledge) - GSM8K: 8K rows × 4 epochs (math with tool use) - SimpleSpelling: 200K rows (spell the word ‘apple’) - SpellingBee: 80K rows (how many ‘r’ in ‘strawberry’?)

c) BOS-aligned packing (bestfit): - Conversations are packed into fixed-length rows using best-fit algorithm - No tokens discarded (padding with masked targets instead) - Each row starts with BOS

e) ChatCORE evaluation during SFT: - Runs 6 benchmarks every N steps: ARC-Easy, ARC-Challenge, MMLU, GSM8K, HumanEval, SpellingBee - ChatCORE = mean centered accuracy (normalized against random baseline)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

RL — Reinforcement Learning (chat_rl.py) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

nanoGPT has NO RL. nanochat implements a simplified GRPO/REINFORCE on GSM8K:

The pipeline:

Load the SFT model
For each GSM8K question:
- Generate N=16 samples from the model
- Check each sample against the ground truth answer
- Reward = 1 if correct, 0 if wrong
Compute advantages: reward - mean_reward (not z-score, just subtract mean)
Policy gradient: loss = -sum(logp * advantage) / num_valid_tokens
No KL penalty, no PPO ratio/clip — pure on-policy REINFORCE

What makes it “GRPO-inspired” but simplified:

No trust region / KL to reference model
On-policy (no need for PPO ratio + clip)
DAPO-style token-level normalization
Advantage = (r - mu) not (r - mu)/sigma

Tracks pass@k metrics: probability that at least 1 of k samples is correct.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

INFERENCE ENGINE (engine.py) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

nanoGPT: basic generate() that crops to block_size, no caching.

nanochat: Full inference engine with:

KV Cache (FA3-native, pre-allocated tensors)
Prefill: batch=1 prompt forward, then replicate cache for N samples
Tool use state machine: detects < python_start >, evals expressions, injects results
Multi-sample generation (generate N completions in parallel)
Streaming yields (token_column, token_masks) per step

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

EVALUATION SUITE (tasks/ + core_eval.py + chat_eval.py) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

nanoGPT: only val loss on OpenWebText.

nanochat has 8 evaluation tasks:

Categorical (logit-based, fast): - ARC-Easy / ARC-Challenge (science reasoning, 4-way MC) - MMLU (57 subjects, 4-way MC)

Generative (sample + check): - GSM8K (math word problems) - HumanEval (Python code generation) - SpellingBee (letter counting) - SimpleSpelling (word spelling)

Pretraining metrics: - DCLM CORE score (perplexity-based across 5 tasks) - val_bpb (bits per byte, vocab-size-invariant)

ChatCORE: composite metric across all 6 chat eval tasks, centered against random baselines.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

TOKENIZER (tokenizer.py) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

nanoGPT: uses tiktoken (GPT-2 encoding) or trains a char-level tokenizer. No special tokens.

nanochat: Custom BPE tokenizer (RustBPE for training, tiktoken for inference):

32K vocab (configurable)

8 special tokens: <

bos

>, <

user_start

>, <

user_end

>, <

assistant_start

>, <

assistant_end

>, <

python_start

>, <

python_end

>, <

output_start

>, <

output_end

render_conversation() method: turns chat messages into token ids + loss mask
render_for_completion(): strips last assistant message for RL generation
visualize_tokenization(): debug helper with color-coded output

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

OTHER INFRASTRUCTURE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Checkpoint manager: saves/loads model + optimizer + metadata per stage (base/sft/rl)
MLflow logger: experiment tracking alternative to wandb
Scaling laws support: miniseries.sh, scaling_laws.sh, –depth as single dial
Report system: structured logging of eval results
Chat UI: web-based ChatGPT-like interface (scripts/chat_web.py)
Chat CLI: terminal-based chat interface (scripts/chat_cli.py)
Distributed training: proper DDP with Gloo patches
compute_init/cleanup: device detection, distributed setup

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Summary: The Full LLM Pipeline ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

nanoGPT covers: [pretrain] → [generate text]

nanochat covers: [train tokenizer] → [pretrain] → [eval CORE] → [SFT on conversations] → [RL on math tasks] → [eval ChatCORE] → [serve via chat UI]

The biggest conceptual additions are:

SFT: teaching a base model to be a chat assistant via conversation fine-tuning
RL: improving math/tool-use via reward-based policy gradient
Tool use: the model can invoke a Python calculator and get results fed back
Full eval suite: 6+ benchmarks with a composite metric (ChatCORE)
Modern architecture: RoPE, GQA, sliding window, value residuals, ReLU², Muon optimizer

Back Donate