Tokenization and File Compression Ratios Explained

Tokenization and File Compression Ratios Explained | Generated by AI

Home 2026.04

Yes, this is completely normal! Here’s why:

Raw text → Tokenized (uint16) conversion:

Original:  "The cat sat on the mat"
Chars:     20 characters × 1 byte = 20 bytes
Tokens:    ~5 tokens × 2 bytes = 10 bytes  (GPT-2 BPE tokenization)

Compression ratio: ~2-4x from text to tokens

File	Size	Token Count	Calculation
Raw text (full)	61 GB	~15B tokens*	61GB ÷ ~4 chars/token
Raw text (sample)	477 MB	~114M tokens	477MB ÷ ~4 chars/token
`train.bin`	27 MB	14M tokens	14M × 2 bytes
`val.bin`	191 MB	100M tokens	100M × 2 bytes

*Assuming ~4 chars/token average for English text with GPT-2 tokenizer

Text → Tokens: English compresses ~2-4x when tokenized (common words become single tokens)
uint16: Each token stored as 2 bytes (vs ~4 bytes for Python int)
Partial data: You only prepared a 477MB sample of the 61GB file (less than 1%)

nanoGPT’s standard Shakespeare dataset:

So 27MB train.bin from 477MB text sample = ~18x compression. That’s normal. If you prepared the full 61GB, you’d get ~30GB of .bin files total.

Back Donate