NanoChat 760M on MI300X Technical Breakdown | Generated by AI

Home 2026.06

Here’s the full technical breakdown of every decision made:

============================================================ COMPREHENSIVE SUMMARY: NANOCCHAT 760M ON MI300X ============================================================

  1. ENVIRONMENT DISCOVERY

First, I explored the system to understand what we’re working with:

Host: snapshots-gpu-mi300x-192gb-devcloud-atl1 (DigitalOcean GPU droplet) OS: Ubuntu 24.04.4 LTS CPU: Intel Xeon Platinum 8568Y+ (20 cores, QEMU VM) RAM: 235 GB GPU: AMD Instinct MI300X (192 GB VRAM, gfx942) ROCm: 7.2.0 installed at /opt/rocm-7.2.0 PyTorch: 2.9.1+rocm6.4 (system-wide, pre-installed)

Found existing projects on disk: /root/llama.cpp/ - Built with ROCm HIP for MI300X, served Qwen 122B /root/nanoGPT/ - Trained GPT-2 760M (got 108-113% MFU on MI300X) /root/zz/ - Training pipeline with logs, inference audits /root/nanochat/ - Freshly cloned, not yet set up

The nanoGPT 760M training logs showed:

============================================================

  1. NANOCCHAT vs NANOGPT - WHY NANOCCHAT

nanochat is Karpathy’s successor to nanoGPT (deprecated Nov 2025). Key differences:

nanoGPT: nanochat:

nanochat trades raw kernel efficiency for a complete pipeline. The MFU is lower because it uses PyTorch’s SDPA instead of hand-written CUDA kernels, but you get tokenizer training, evaluation, SFT, and a chat UI out of the box.

============================================================

  1. FLASH ATTENTION DECISION

This was a critical architectural decision. Here’s the full story:

FLASH ATTENTION 3 (FA3):

FLASH ATTENTION 2:

WHAT WE USE INSTEAD - PyTorch SDPA:

IMPACT ON TRAINING:

WHAT WOULD IMPROVE THIS: Option A: Install flash-attn for ROCm (risky, may not compile) Option B: Wait for ROCm-native flash attention (AMD is working on it) Option C: Use the Composable Kernel (CK) flash attention from ROCm Option D: Accept 27% MFU and train longer (~62 hours vs ~20 hours)

We went with Option D for reliability. The MI300X has 192 GB VRAM so we’re not memory-constrained - we’re compute-limited by SDPA.

============================================================

  1. MODEL ARCHITECTURE DECISIONS

TARGET: GPT-2 760M (matching the nanoGPT run)

nanochat auto-scales everything from –depth: –depth=24 (number of transformer layers) –aspect-ratio=64 (default, controls width) –head-dim=128 (default, attention head size)

CALCULATION: model_dim = depth × aspect_ratio = 24 × 64 = 1536 n_heads = model_dim / head_dim = 1536 / 128 = 12 n_layers = depth = 24 ffn_dim = 4 × model_dim = 6144

This gives: n_layer=24, n_head=12, n_kv_head=12, n_embd=1536

NOTE: The original nanoGPT 760M used n_head=24 (head_dim=64), but nanochat uses head_dim=128 by default. The total parameter count is similar because:

PARAMETER BREAKDOWN: wte (word embeddings): 50,331,648 (32768 vocab × 1536 dim) value_embeds: 603,979,776 (nanochat innovation) lm_head (output projection): 50,331,648 transformer_matrices: 679,478,976 (the actual GPT layers) scalars: 74 (resid_lambdas, x0_lambdas) TOTAL: 1,384,122,122 (~1.38B)

The “value_embeds” is a nanochat-specific feature:

============================================================

  1. HYPERPARAMETER DECISIONS

A. BATCH SIZE: 524,288 tokens/step

How I chose this:

B. SEQUENCE LENGTH: 2048

C. WINDOW PATTERN: L (full attention)

D. NUMBER OF ITERATIONS: 29,000

Chinchilla-optimal scaling:

E. LEARNING RATES (auto-scaled by nanochat):

F. EVALUATION SETTINGS:

============================================================

  1. TRAINING SPEED ANALYSIS

MEASURED PERFORMANCE: First step (compilation): 17.5 seconds (JIT warmup) Subsequent steps: ~7.7 seconds each Throughput: ~68,000 tokens/sec MFU: ~27.5% (bf16) Peak VRAM: 105 GB / 192 GB (55%)

TIME ESTIMATE: 29,000 steps × 7.7 sec/step = 223,300 seconds = 3,722 minutes = 62 hours ≈ 2.6 days

WHY MFU IS 27% (not 50%+):

  1. SDPA fallback (no fused attention kernels)
    • FA3 on H100: fused, vectorized, pipelined
    • SDPA on MI300X: separate matmuls + softmax + dropout
  2. Value embeddings add ~604M extra params
    • More memory bandwidth for embedding lookups
    • Not as compute-dense as pure matmuls
  3. Gradient accumulation (8 micro-batches)
    • Each micro-batch has kernel launch overhead
    • Overhead × 8 = significant time
  4. Model is small for the GPU
    • 760M params on 192 GB GPU = not fully utilized
    • Larger models (7B+) would get better MFU

COMPARISON WITH NANOGPT: nanoGPT 760M on same MI300X: 108-113% MFU nanochat 760M on same MI300X: 27.5% MFU

The difference is:

============================================================

  1. ROCm-SPECIFIC CONSIDERATIONS

ENVIRONMENT VARIABLES SET: HIP_FORCE_DEV_KERNARG=1 - Forces HIP to use kernel arguments in device memory - Can improve performance on some ROCm versions

HSA_OVERRIDE_GFX_VERSION=9.4.2 - Tells HIP/HSA to use gfx942 target (MI300X) - Ensures correct ISA is used

PYTORCH_ALLOC_CONF=expandable_segments:True - PyTorch memory allocator uses expandable segments - Reduces fragmentation for large allocations

FP8 TRAINING:

DDP (Distributed Data Parallel):

COMPILATION:

============================================================

  1. DATA PIPELINE

DATASET: ClimbMix-400B

TOKENIZER: BPE (Byte Pair Encoding)

DATA LOADING:

============================================================

  1. CHECKPOINT & MONITORING

CHECKPOINTS (every 5000 steps): ~/.cache/nanochat/base_checkpoints/d24/ model_XXXXX.pt - Model weights (~4 GB each) optim_XXXXX_rank0.pt - Optimizer state (~5.7 GB each) meta_XXXXX.json - Training metadata

MONITORING:

RESUME FROM CHECKPOINT: ./run_mi300x_d24_pretrain.sh –resume-from-step=5000

============================================================

  1. WHAT HAPPENS AFTER TRAINING

The full pipeline (run_mi300x_d24.sh) includes:

Step 4: BASE EVALUATION

Step 5: SFT (Supervised Fine-Tuning)

Step 6: CHAT EVALUATION

Step 7: INTERACTIVE CHAT python -m scripts.chat_cli -p ‘Why is the sky blue?’ python -m scripts.chat_web # Web UI on localhost

============================================================

  1. FILES CREATED

/root/nanochat/run_mi300x_d24.sh Full pipeline: data → tokenizer → pretrain → eval → SFT → chat

/root/nanochat/run_mi300x_d24_pretrain.sh Pretrain-only (for resuming or partial runs)

/root/nanochat/run_mi300x_d24.log Live training log (tail -f to monitor)

~/.cache/nanochat/base_data_climbmix/ 31 parquet shards (~25B tokens)

~/.cache/nanochat/tokenizer/ BPE tokenizer (32,768 vocab)

~/.cache/nanochat/base_checkpoints/d24/ Model checkpoints (every 5K steps)

============================================================

  1. POTENTIAL IMPROVEMENTS

To speed up training:

  1. Install ROCm flash attention pip install flash-attn # may need ROCm build flags Would enable –window-pattern SSSL (sliding window) Expected: 40-50% MFU (vs current 27%)

  2. Use FP8 when available Need PyTorch built with ROCm 7.2 support Expected: ~2x throughput boost

  3. Increase device_batch_size Current: 32, uses 105 GB / 192 GB Could try: 48 or 64 (may OOM due to SDPA memory) Would reduce gradient accumulation overhead

  4. Multi-GPU training torchrun –nproc_per_node=N -m scripts.base_train Would linearly scale throughput

  5. Use total_batch_size=1048576 Halves the number of steps (14.5K instead of 29K) Same total tokens, fewer optimizer steps May slightly change training dynamics


Back Donate