Explaining Raw Text to Tokenized Data | Generated by AI

Home 2026.04

Based on your data preparation log, here’s how to understand the “60GB text data” claim versus what you’re seeing:

What’s Happening

Your output: 150 shards × ~0.19GB each ≈ 28GB total

This suggests one of three scenarios:

1. Compression/Tokenization Factor

The “60GB” likely refers to raw source text, while the 28GB is tokenized/binary format:

2. The Math from Your Log

Speed: 40M tokens/sec
Time: ~20 minutes total = 1,200 seconds
Total tokens processed: 40M × 1,200 = 48B tokens

If each token averages ~0.6 bytes in binary format (typical for BPE):

3. What “60GB” Could Mean

Why This Matters

Metric Value Meaning
Raw input ~60GB Source text files
Binary output 28GB Ready for training
Tokens ~48B Actual training tokens
Shards 150 For parallel training

Check with:

# Verify raw input size
du -sh /path/to/raw/data/

# Check one shard format
file /path/to/output/shard_0000.bin

Bottom line: Your processing is correct — 60GB raw text → 28GB tokenized output with ~48B tokens is a healthy ratio for LLM data preparation.


Back Donate