NanoChat 760M on MI300X Technical Breakdown

NanoChat 760M on MI300X Technical Breakdown | Generated by AI

Home 2026.06

Here’s the full technical breakdown of every decision made:

============================================================ COMPREHENSIVE SUMMARY: NANOCCHAT 760M ON MI300X ============================================================

ENVIRONMENT DISCOVERY

First, I explored the system to understand what we’re working with:

Host: snapshots-gpu-mi300x-192gb-devcloud-atl1 (DigitalOcean GPU droplet) OS: Ubuntu 24.04.4 LTS CPU: Intel Xeon Platinum 8568Y+ (20 cores, QEMU VM) RAM: 235 GB GPU: AMD Instinct MI300X (192 GB VRAM, gfx942) ROCm: 7.2.0 installed at /opt/rocm-7.2.0 PyTorch: 2.9.1+rocm6.4 (system-wide, pre-installed)

Found existing projects on disk: /root/llama.cpp/ - Built with ROCm HIP for MI300X, served Qwen 122B /root/nanoGPT/ - Trained GPT-2 760M (got 108-113% MFU on MI300X) /root/zz/ - Training pipeline with logs, inference audits /root/nanochat/ - Freshly cloned, not yet set up

The nanoGPT 760M training logs showed:

depth=24, n_head=24, n_embd=1536
MFU 108-113% on MI300X (using nanoGPT’s custom CUDA kernels)
Val loss ~3.27 at step 29K
Trained on FineWeb dataset

============================================================

NANOCCHAT vs NANOGPT - WHY NANOCCHAT

nanochat is Karpathy’s successor to nanoGPT (deprecated Nov 2025). Key differences:

nanoGPT: nanochat:

Simple 300-line train.py - Full-stack: tokenizer, pretrain,
GPT-2 architecture SFT, eval, chat UI, web UI
Manual hyperparameters - Auto-scales all hyperparams from
No built-in eval a single –depth dial
Custom CUDA kernels - Uses PyTorch SDPA/FA3 (portable)
108-113% MFU on MI300X - 27% MFU on MI300X (SDPA fallback)

nanochat trades raw kernel efficiency for a complete pipeline. The MFU is lower because it uses PyTorch’s SDPA instead of hand-written CUDA kernels, but you get tokenizer training, evaluation, SFT, and a chat UI out of the box.

============================================================

FLASH ATTENTION DECISION

This was a critical architectural decision. Here’s the full story:

FLASH ATTENTION 3 (FA3):

Available in nanochat via the ‘kernels’ package
Requires Hopper GPU (SM 90) - only NVIDIA H100/H200
Detection code in nanochat/flash_attention.py: if major != 9: # checks SM capability return None # falls back to SDPA
MI300X reports SM 94 via ROCm, but FA3 kernels are compiled for NVIDIA SM 90 only - they won’t run on AMD

FLASH ATTENTION 2:

The flash-attn package (by Tri Dao) has ROCm support but it’s not installed and not trivial to build
Would need: pip install flash-attn with ROCm compilation
Risk of build failures on ROCm 7.2

WHAT WE USE INSTEAD - PyTorch SDPA:

PyTorch’s scaled_dot_product_attention is the fallback
It dispatches to the best available backend:
- cuDNN/hipDNN attention (if available)
- Memory-efficient attention (if available)
- Math implementation (fallback)
On ROCm, it typically uses the math implementation
This is why MFU is 27% instead of 50-60%

IMPACT ON TRAINING:

SDPA does NOT support sliding window attention
That’s why we use –window-pattern L (full attention)
With FA3, we could use “SSSL” pattern (3/4 sliding window) which saves compute on 3 out of 4 layers
Full attention uses more FLOPs per token, hence lower MFU
But it works correctly and reliably on AMD

WHAT WOULD IMPROVE THIS: Option A: Install flash-attn for ROCm (risky, may not compile) Option B: Wait for ROCm-native flash attention (AMD is working on it) Option C: Use the Composable Kernel (CK) flash attention from ROCm Option D: Accept 27% MFU and train longer (~62 hours vs ~20 hours)

We went with Option D for reliability. The MI300X has 192 GB VRAM so we’re not memory-constrained - we’re compute-limited by SDPA.

============================================================

MODEL ARCHITECTURE DECISIONS

TARGET: GPT-2 760M (matching the nanoGPT run)

nanochat auto-scales everything from –depth: –depth=24 (number of transformer layers) –aspect-ratio=64 (default, controls width) –head-dim=128 (default, attention head size)

CALCULATION: model_dim = depth × aspect_ratio = 24 × 64 = 1536 n_heads = model_dim / head_dim = 1536 / 128 = 12 n_layers = depth = 24 ffn_dim = 4 × model_dim = 6144

This gives: n_layer=24, n_head=12, n_kv_head=12, n_embd=1536

NOTE: The original nanoGPT 760M used n_head=24 (head_dim=64), but nanochat uses head_dim=128 by default. The total parameter count is similar because:

Fewer heads (12 vs 24) but larger head_dim (128 vs 64)
Same total attention dimension: 12×128 = 24×64 = 1536

PARAMETER BREAKDOWN: wte (word embeddings): 50,331,648 (32768 vocab × 1536 dim) value_embeds: 603,979,776 (nanochat innovation) lm_head (output projection): 50,331,648 transformer_matrices: 679,478,976 (the actual GPT layers) scalars: 74 (resid_lambdas, x0_lambdas) TOTAL: 1,384,122,122 (~1.38B)

The “value_embeds” is a nanochat-specific feature:

Every other layer has a value embedding (like RETRO)
Adds ~604M parameters that nanoGPT doesn’t have
The “transformer_matrices” (679M) is closer to the 760M target
This is why total params are 1.38B, not 760M

============================================================

HYPERPARAMETER DECISIONS

A. BATCH SIZE: 524,288 tokens/step

How I chose this:

nanochat default for depth=20 is 524,288
This is 256 sequences × 2048 tokens each
Standard for models in the 500M-1B range
Matches what the nanoGPT 760M used

Breakdown on MI300X:
device_batch_size=32 (32 sequences per GPU forward pass)
tokens per micro-batch: 32 × 2048 = 65,536
gradient accumulation steps: 524,288 / 65,536 = 8
Each step = 8 forward+backward passes, then 1 optimizer step

B. SEQUENCE LENGTH: 2048

nanochat default, good balance of context vs memory
Longer sequences (4096) would use more VRAM per micro-batch
The MI300X could handle 4096, but 2048 is standard
Matches GPT-2’s original context length

C. WINDOW PATTERN: L (full attention)

FA3 supports “SSSL” (sliding window on 3/4 layers)
SDPA does NOT support sliding window attention
Must use “L” (full attention on all layers)
This means every layer does full O(n²) attention
More compute per token, but simpler and correct

D. NUMBER OF ITERATIONS: 29,000

Chinchilla-optimal scaling:

Chinchilla paper says: optimal tokens = 20 × parameters
Our model: 760M params (transformer matrices)
Target tokens: 20 × 760M = 15.2B tokens
Steps needed: 15.2B / 524,288 = 29,000 steps

This matches what the nanoGPT run did:
nanoGPT 760M trained to ~29K steps with similar batch size
Achieved val loss ~3.27 at that point

E. LEARNING RATES (auto-scaled by nanochat):

embedding_lr: 0.3 (default)
unembedding_lr: 0.008 (default)
matrix_lr: 0.02 (Muon optimizer for weights)
scalar_lr: 0.5 (for resid_lambdas, x0_lambdas)
weight_decay: 0.28 (scaled down for depth 24)

nanochat auto-adjusts:
Weight decay scaled: 0.28 × (24/160) ≈ 0.042 for depth 24
Adam LR scaled by 1/√(1536/768) = 0.707 for larger model

F. EVALUATION SETTINGS:

eval_every=1000 (validation loss every 1000 steps)
eval_tokens=1,048,576 (2M tokens for val loss estimate)
core_metric_every=5000 (DCLM CORE benchmark every 5K steps)
sample_every=5000 (generate text samples every 5K steps)
save_every=5000 (checkpoint every 5K steps)

============================================================

TRAINING SPEED ANALYSIS

MEASURED PERFORMANCE: First step (compilation): 17.5 seconds (JIT warmup) Subsequent steps: ~7.7 seconds each Throughput: ~68,000 tokens/sec MFU: ~27.5% (bf16) Peak VRAM: 105 GB / 192 GB (55%)

TIME ESTIMATE: 29,000 steps × 7.7 sec/step = 223,300 seconds = 3,722 minutes = 62 hours ≈ 2.6 days

WHY MFU IS 27% (not 50%+):

SDPA fallback (no fused attention kernels)
- FA3 on H100: fused, vectorized, pipelined
- SDPA on MI300X: separate matmuls + softmax + dropout
Value embeddings add ~604M extra params
- More memory bandwidth for embedding lookups
- Not as compute-dense as pure matmuls
Gradient accumulation (8 micro-batches)
- Each micro-batch has kernel launch overhead
- Overhead × 8 = significant time
Model is small for the GPU
- 760M params on 192 GB GPU = not fully utilized
- Larger models (7B+) would get better MFU

COMPARISON WITH NANOGPT: nanoGPT 760M on same MI300X: 108-113% MFU nanochat 760M on same MI300X: 27.5% MFU

The difference is:

nanoGPT uses custom CUDA kernels (hand-tuned)
nanochat uses PyTorch SDPA (portable but slower)
nanoGPT doesn’t have value embeddings (simpler)
nanochat has full pipeline (tokenizer, eval, chat)

If you need raw speed: use nanoGPT If you need the full pipeline: use nanochat

============================================================

ROCm-SPECIFIC CONSIDERATIONS

ENVIRONMENT VARIABLES SET: HIP_FORCE_DEV_KERNARG=1 - Forces HIP to use kernel arguments in device memory - Can improve performance on some ROCm versions

HSA_OVERRIDE_GFX_VERSION=9.4.2 - Tells HIP/HSA to use gfx942 target (MI300X) - Ensures correct ISA is used

PYTORCH_ALLOC_CONF=expandable_segments:True - PyTorch memory allocator uses expandable segments - Reduces fragmentation for large allocations

FP8 TRAINING:

Checked: torch._scaled_mm exists, float8_e4m3fn exists
BUT: “Float8_e4m3fn is only supported for ROCm 6.5 and above”
We have ROCm 7.2, but PyTorch was built against ROCm 6.4
So FP8 is NOT available
With FP8, we could get ~2x throughput (similar to H100 FP8)
Would need PyTorch built with ROCm 7.2 support

DDP (Distributed Data Parallel):

Single GPU, so DDP is not used
If multi-GPU: would need backend=”nccl” (ROCm has NCCL)
The code checks for RANK/WORLD_SIZE env vars

COMPILATION:

nanochat uses torch.compile internally
First step is slow (17.5s) due to JIT compilation
Subsequent steps benefit from compiled kernels
ROCm’s torch.compile works but may generate suboptimal code

============================================================

DATA PIPELINE

DATASET: ClimbMix-400B

URL: huggingface.co/datasets/karpathy/climbmix-400b-shuffle
Format: Parquet shards (~810M tokens each)
We downloaded 30 train shards + 1 val shard = 31 total
~25B tokens available (more than the 15.2B needed)

TOKENIZER: BPE (Byte Pair Encoding)

Trained on the ClimbMix data
Vocab size: 32,768
Training time: 49.77 seconds
Saved to: ~/.cache/nanochat/tokenizer/

DATA LOADING:

nanochat uses a streaming dataloader
Reads parquet files on-the-fly (no full dataset in RAM)
Supports distributed reading (DDP-safe)
Last shard is always the validation set

============================================================

CHECKPOINT & MONITORING

CHECKPOINTS (every 5000 steps): ~/.cache/nanochat/base_checkpoints/d24/ model_XXXXX.pt - Model weights (~4 GB each) optim_XXXXX_rank0.pt - Optimizer state (~5.7 GB each) meta_XXXXX.json - Training metadata

MONITORING:

Training log: /root/nanochat/run_mi300x_d24.log
MLflow tracking (local file store)
Live metrics: loss, learning rate, MFU, tok/sec

RESUME FROM CHECKPOINT: ./run_mi300x_d24_pretrain.sh –resume-from-step=5000

============================================================

WHAT HAPPENS AFTER TRAINING

The full pipeline (run_mi300x_d24.sh) includes:

Step 4: BASE EVALUATION

Evaluates on DCLM CORE benchmark
Measures bits-per-byte (bpb) on various tasks
Target: match GPT-2 CORE score (0.2565)

Step 5: SFT (Supervised Fine-Tuning)

Downloads identity_conversations.jsonl
Fine-tunes base model on chat data
3000 steps with smaller batch (262,144)

Step 6: CHAT EVALUATION

Tests the SFT model on chat benchmarks

Step 7: INTERACTIVE CHAT python -m scripts.chat_cli -p ‘Why is the sky blue?’ python -m scripts.chat_web # Web UI on localhost

============================================================

FILES CREATED

/root/nanochat/run_mi300x_d24.sh Full pipeline: data → tokenizer → pretrain → eval → SFT → chat

/root/nanochat/run_mi300x_d24_pretrain.sh Pretrain-only (for resuming or partial runs)

/root/nanochat/run_mi300x_d24.log Live training log (tail -f to monitor)

~/.cache/nanochat/base_data_climbmix/ 31 parquet shards (~25B tokens)

~/.cache/nanochat/tokenizer/ BPE tokenizer (32,768 vocab)

~/.cache/nanochat/base_checkpoints/d24/ Model checkpoints (every 5K steps)

============================================================

POTENTIAL IMPROVEMENTS

To speed up training:

Install ROCm flash attention pip install flash-attn # may need ROCm build flags Would enable –window-pattern SSSL (sliding window) Expected: 40-50% MFU (vs current 27%)
Use FP8 when available Need PyTorch built with ROCm 7.2 support Expected: ~2x throughput boost
Increase device_batch_size Current: 32, uses 105 GB / 192 GB Could try: 48 or 64 (may OOM due to SDPA memory) Would reduce gradient accumulation overhead
Multi-GPU training torchrun –nproc_per_node=N -m scripts.base_train Would linearly scale throughput
Use total_batch_size=1048576 Halves the number of steps (14.5K instead of 29K) Same total tokens, fewer optimizer steps May slightly change training dynamics

Back Donate

NanoChat 760M on MI300X Technical Breakdown | Generated by AI

ENVIRONMENT DISCOVERY

NANOCCHAT vs NANOGPT - WHY NANOCCHAT

FLASH ATTENTION DECISION

MODEL ARCHITECTURE DECISIONS

HYPERPARAMETER DECISIONS

TRAINING SPEED ANALYSIS

ROCm-SPECIFIC CONSIDERATIONS

DATA PIPELINE

CHECKPOINT & MONITORING

WHAT HAPPENS AFTER TRAINING

FILES CREATED

POTENTIAL IMPROVEMENTS