Explaining Raw Text to Tokenized Data

Explaining Raw Text to Tokenized Data | Generated by AI

Home 2026.04

Based on your data preparation log, here’s how to understand the “60GB text data” claim versus what you’re seeing:

Your output: 150 shards × ~0.19GB each ≈ 28GB total

This suggests one of three scenarios:

The “60GB” likely refers to raw source text, while the 28GB is tokenized/binary format:

Speed: 40M tokens/sec
Time: ~20 minutes total = 1,200 seconds
Total tokens processed: 40M × 1,200 = 48B tokens

If each token averages ~0.6 bytes in binary format (typical for BPE):

# Verify raw input size
du -sh /path/to/raw/data/

# Check one shard format
file /path/to/output/shard_0000.bin

Bottom line: Your processing is correct — 60GB raw text → 28GB tokenized output with ~48B tokens is a healthy ratio for LLM data preparation.

Back Donate