Download and Tokenize 100GB Code

Download and Tokenize 100GB Code | Generated by AI

Home 2026.06

Wget works: 273MB in 28s (~10 MB/s). 100GB will take ~3 hours.

Both scripts are ready and tested. Here’s the summary:

================================================================ FILES CREATED IN /mnt/data/zz/ ================================================================

/mnt/data/zz/scripts/download/download_github_code.py

Downloads ~100GB of code parquet shards from HuggingFace
Resumable (skips already-downloaded files)
Output: /mnt/data/zz/datasets/github-code/*.parquet

/mnt/data/zz/scripts/extract/tokenize_github_code.py

Tokenizes parquet -> nanoGPT binary shards (.npy)
GPT-2 BPE, 100M tokens per shard
Output: /mnt/data/zz/datasets/github-code-tok/

================================================================ RUN THESE IN TMUX (one at a time) ================================================================

Step 1 — Download (~3 hours):

python3.11 /mnt/data/zz/scripts/download/download_github_code.py

Step 2 — Tokenize (~1 hour):

python3.11 /mnt/data/zz/scripts/extract/tokenize_github_code.py

================================================================ NOTES ================================================================

The dataset has 1126 shards, each ~285MB with ~100K code files
100GB = ~350 shards = ~129B tokens (way more than needed for 760M)
Columns: content, repo_name, path, license, size (30 languages)
One test shard already downloaded and verified working

Back Donate