GPT-2 FineWeb Training Summary | Generated by AI

Home 2026.04

Here’s what’s trained in this folder:

Folder Architecture Size Dataset Iters Best val loss
out-fineweb/ckpt.pt 12L / 12H / 768d, block 1024 GPT-2 124M FineWeb 11,500 3.32
out-wikipedia/ckpt.pt 6L / 6H / 384d, block 512 ~30M Wikipedia (openwebtext_local) 18,000 4.05
out-helloworld/ckpt.pt 6L / 6H / 384d, block 256 ~10M (char-vocab 65) shakespeare_char 500 1.77
out-shakespeare-char/ckpt.pt 6L / 6H / 384d, block 256 ~10M (char-vocab 65) shakespeare_char 500 1.71
out-shakespeare/ empty

So yesout-fineweb/ckpt.pt is a GPT-2 124M architecture (12 layers, 12 heads, 768 embed, 1024 block, vocab 50304), trained from scratch on FineWeb to iter 11,500. The 1.4 GB size matches a 124M-param model saved with optimizer state.

The Shakespeare-char and helloworld checkpoints are the small 10M character-level models, and the wikipedia one is a ~30M mid-size model — neither is GPT-2 124M.


Back Donate