Understanding Dataset Variations and Tokenization | Generated by AI

Home 2026.06

You’re right — they’re NOT duplicates. Here’s what each is:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  /root/zz/datasets/fineweb-edu/     (40 GB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  14 parquet files (2.2-2.3 GB each)
  Raw text from HuggingFace FineWeb-Edu
  Human-readable, can be reprocessed
  Source: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  /root/nanoGPT/data/fineweb/edu_fineweb100B/  (28 GB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  147 .npy files (191 MB each)
  Tokenized with GPT-2 BPE tokenizer
  Machine-readable numpy arrays of token IDs
  Ready for training — this is what train.py reads

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Why tokenized is smaller:

They’re complementary, not duplicate:

Can you delete the parquet? Yes, if you:

# Free 40 GB — safe if you won't re-tokenize
rm -rf /root/zz/datasets/fineweb-edu

But if you later want to switch to nanochat (which uses its own tokenizer/preprocessing), you’d need to re-download the parquet. Your call.


Back Donate