GitHub Code GPT-2 Training Pipeline | Generated by AI

Home 2026.06

Here’s the full pipeline summary:

=== /mnt/data/zz — GitHub Code Training Pipeline ===

STEP 1: DOWNLOAD (codeparrot/github-code from HuggingFace)

STEP 2: TOKENIZE (GPT-2 BPE via tiktoken)

STEP 3: TRAIN (nanoGPT, running now in tmux “train”)

LATEST COMMITS: 79273b9 feat(training): add GPT-2 124M config for github-code on RTX 4070 c0672b7 refactor(extract): simplify tokenization logic 0f9416d chore(finetune): add merge script, gitignore *.gguf 336bcef Add github-code download and tokenize scripts

NOTE: Only 38 of 1126 parquet shards were downloaded (11 GB of 100 GB target). But the tokenization produced ~14B tokens already, which is exactly what the training config targets. So the pipeline is complete for this run — download more shards later if you want more data.


Back Donate