Lacking FineWeb Tokenization Scripts | Generated by AI
Short answer: no — there is no tokenization script in this repo. Only download + parquet→text extraction exists.
What’s here today (scripts/):
download/— fetches FineWeb parquet shards (and Wikimedia dumps)extract/extract_fineweb.py— concatenates parquet rows into a single.txt, inserting<|endoftext|>separators (datasets/fineweb-edu/ already holds the raw parquet shards)analysis/— log/eval helpers- no
prepare.py, notiktoken/uint16writer, notrain.bin/val.binoutput
Before handing data to nanoGPT you need to:
- Concatenate parquets to text with EOT delimiters → already covered by
extract_fineweb.py(but it points atfineweb_test_dump/, notdatasets/fineweb-edu/). - Tokenize that text with GPT-2 BPE (
tiktoken.get_encoding("gpt2")), append the<|endoftext|>id between docs. - Write the token ids as
np.uint16into a flattrain.bin/val.bin(this is the formatnanoGPT/train.pymemmaps viadata/<name>/{train,val}.bin).
The canonical reference is nanoGPT’s data/openwebtext/prepare.py — it streams the HF dataset, tokenizes with tiktoken in parallel, and dumps train.bin/val.bin. We’d want a scripts/extract/prepare_fineweb_edu.py that does the same but reads the local parquet shards in datasets/fineweb-edu/ instead of pulling from HF.
Want me to write that prepare script (parquet shards → train.bin + val.bin with GPT-2 BPE)?