StarCoderData Corpus Overview | Generated by AI
Question: Give me an intro to bigcode/starcoderdata — what it is, what’s in it, and how it’s structured.
Answer:
What it is
starcoderdata is the exact training corpus used for StarCoder and StarCoderBase: 783 GB of code in 86 programming languages, plus 54 GB of GitHub issues, 13 GB of Jupyter notebooks (in both script and text-code-pair form), and 32 GB of GitHub commits — roughly 250B tokens total. StarCoderBase was trained on ~1T tokens (i.e., ~4 epochs over this data), and the StarCoder Python variant did 2 extra epochs on the Python subset.
This makes it the single best open artifact for understanding what a “Codex-style” pretraining corpus looks like in practice — it’s the fully-published version of what closed labs do privately.
The pipeline that produced it (the actually interesting part)
It starts from The Stack v1 (permissively-licensed GitHub via license detection) and applies, in order:
- Language selection — 86 languages chosen by data volume + popularity, plus config/markup formats (JSON, YAML, Markdown).
- Quality filters — per-language heuristics: line length limits, alphanumeric fraction, alpha-token ratio, autogenerated-file detection, plus filtering of data-heavy files (long JSON/YAML get aggressively trimmed). Manual inspection per language to tune thresholds.
- Near-deduplication — MinHash + LSH (Jaccard ~0.85, 5-gram shingles). Dedup was the single highest-impact step; their ablations showed near-dedup beats exact-dedup-only by a wide margin.
- PII redaction — they trained a NER model (StarPII) on an annotated dataset to detect and mask names, emails, keys, passwords, IPs → replaced with tokens like
<NAME>,<API_KEY>,<IP_ADDRESS>(literally the redaction policy you just added to your custom instructions, applied at corpus scale). - Decontamination — removal of files matching HumanEval, MBPP, APPS, GSM8K test sets.
- Opt-out — repos of developers who requested removal via the “Am I in The Stack” tool were dropped.
Structure on disk
One directory per language under the repo root (python/, cpp/, rust/, java/, … 86 dirs), each containing parquet shards. Code rows have content (the source text), id, and max_stars_count. Four special subsets have different schemas, so loading the whole dataset at once fails — load them separately: jupyter-scripts-dedup-filtered, jupyter-structured-clean-dedup, github-issues-filtered-structured, git-commits-cleaned.
A key formatting detail: the content embeds metadata and structure as special tokens, which StarCoder’s tokenizer treats atomically:
<reponame>owner/repo<filename>src/foo.py<gh_stars>42
... actual code ...
Commits look like <commit_before>...<commit_msg>...<commit_after>, and issues use <issue_start>, <issue_comment>, etc. If you train your own tokenizer on this data, decide whether to keep these markers — they’re how StarCoder learned repo/file conditioning, and they enable tricks like prompting with <gh_stars>1000 to bias toward higher-quality completions. FIM (<fim_prefix>/<fim_middle>/<fim_suffix>) is applied at training time, not baked into the dataset.
How to use it
It’s gated — you must log in and accept The Stack’s Terms of Use (downstream use must respect original code licenses; you agree to propagate removal-request updates). Then:
from datasets import load_dataset
# stream one language; never download all 783GB blindly
ds = load_dataset("bigcode/starcoderdata", data_dir="python",
split="train", streaming=True)
for i, row in enumerate(ds):
print(row["content"][:300], "\n---")
if i == 2: break
Sizing for your setup: the python/ dir alone is ~60GB (~20B tokens) — comfortably enough for a Chinchilla-optimal 760M run (~15B tokens), and it fits on your workstation’s 916GB disk but not sensibly on the Air. For the MI300X, streaming + on-the-fly tokenization into a memmapped .bin (nanoGPT-style prepare.py) is the cleanest path; tokenizing 20B tokens with tiktoken is a one-time ~1–2 hr CPU job. TinyLlama-1.1B used this dataset as its code mix, so its repo is a good reference for a working ingestion pipeline at your scale.
Note: this is the v1 lineage. If you want the bigger/newer version, the equivalent for StarCoder2 is bigcode/the-stack-v2-train-smol-ids / -full-ids (~900B tokens, but contents must be fetched from Software Heritage S3, as covered earlier). For a first training run, starcoderdata’s inline text makes it far more ergonomic.
References:
- starcoderdata dataset card
- StarCoder: May the source be with you! (paper)
- The Stack: 3 TB of permissively licensed source code (paper)