Nanochat Training History | Generated by AI

Home 2026.06

Here’s the full picture of your previous nanochat training:

=== TRAINING HISTORY SUMMARY ===

You ran multiple training experiments. Here’s what happened:

— 1. d8 base model (2-node CPU distributed, Jun 2) —

— 2. d12 base model (RTX 4070, Jun 7-10) —

— 3. d12-fresh-fa2 (RTX 4070, Jun 10) —

— 4. d4 base model (2-node CPU, Jun 11) —

— 5. d8 SFT (RTX 4070, Jun 2) —

=== RESULTS ANALYSIS ===

The d12 model was your main training effort. Results:

CORE metric (d8, step 5k): 0.0815 (very weak, random is ~0) Target (GPT-2): 0.2565

The d12 model reached val_bpb=0.78 which is decent for a 12-layer model, but it was still far from GPT-2 capability (CORE 0.2565). The d8 model was too small to be useful. The SFT on d8 produced a ChatCORE of 0.0037, essentially random - the base model wasn’t good enough for SFT to help.

For reference, Karpathy’s speedrun uses depth=24 with 8xH100 and reaches GPT-2 level in ~2 hours. Your RTX 4070 (12GB VRAM) can only fit smaller models. The d12 on RTX 4070 took 65 hours and still didn’t reach GPT-2 level.

Want to try fine-tuning the d12 checkpoint with SFT, or train a new model?


Back Donate