Training Dataset and Token Statistics | Generated by AI

Home 2026.04

Dataset sizes

File Size Contents
train_fineweb.txt 16 GB 89.6M lines of raw text (sample doc above is news/web text)
train.bin 7.36 GB 3.68B GPT-2 BPE tokens (uint16)
val.bin 12 KB 5,675 tokens — tiny, this is why val loss is jittery

Training coverage

Notes


Back Donate