Data Preparation for nanoGPT Training | Generated by AI

Home 2026.04

Based on your updated log, here’s the complete picture of what’s happening with your “60GB text data”:

The Full Story

You’re going through two distinct phases of data preparation:

Phase 1: Raw → Tokenized Shards (Complete)

Raw text (60GB) → Tokenized .npy shards (28GB)
├── 149 training shards
├── 1 validation shard  
└── Format: NumPy arrays (.npy)

Phase 2: .npy → .bin Conversion (In Progress)

Tokenized .npy (28GB) → .bin format for nanoGPT
└── Concatenates all shards into single binary file(s)

Why Two Formats?

Format Purpose Size
.npy shards Parallel tokenization, checkpoint-friendly 28GB total
.bin nanoGPT’s native format (memory-mapped) ~Same size

Understanding the 60GB → 28GB Ratio

60GB raw text
    ↓ (tokenization)
~48B tokens (40M tokens/sec × 20 min)
    ↓ (binary serialization)
28GB .npy files
    ↓ (conversion)
28GB .bin files

Compression ratio: ~2.14× (60/28)

What’s Left

From your log, you’re converting .npy to .bin:

# The script you're running likely does:
for shard in 150_files:
    data = np.load(shard)      # Load token IDs
    data.astype(np.uint16).tofile(output)  # Write binary

Expected Final Output

Quick Verification

# Check original raw size
du -sh /raw/data/path/

# Check .npy shards total
du -sh /output/npy_shards/

# After conversion
ls -lh *.bin

Bottom line: Your 60GB raw text → 28GB tokenized data is correct. The 2.14× compression is normal, and converting to .bin is the final step before training. Nothing is being lost — you’re just preparing data in the format nanoGPT expects for efficient memory-mapped I/O during training.


Back Donate