Frontier Labs' Secret Code Datasets

Frontier Labs' Secret Code Datasets | Generated by AI

Home 2026.06

Question: What code datasets do OpenAI, Google (Gemini), and Anthropic use to train their coding models (Codex, etc.)?

Answer:

The honest top-line: frontier labs stopped disclosing dataset composition around 2022. The GPT-4, Gemini, and Claude technical reports all describe training data in one vague sentence (“publicly available data, licensed data, and data from third parties”). So what we know splits into three buckets: (1) what was disclosed in the early papers, (2) what’s inferable from lawsuits, deals, and open replications, and (3) how the modern post-training recipe works, which has largely replaced raw GitHub scale as the differentiator.

What was actually disclosed (the historical record)

OpenAI Codex (2021 paper, arXiv:2107.03374): 159 GB of deduplicated Python files scraped from 54M public GitHub repos in May 2020, filtered for files <1MB, removing auto-generated code, files with average line length >100, and low alphanumeric ratio. Fine-tuned on top of GPT-3. The follow-up Codex variants added a “supervised fine-tuning” set of standalone, correctly-implemented functions from competitive programming sites and repos with CI (they verified correctness by running tests — the early ancestor of execution-feedback training).
GPT-3 itself already contained code incidentally via Common Crawl and WebText; GPT-4 onward, nothing is disclosed.
Google: PaLM (2022) disclosed 5% code (39B tokens) from GitHub across 24 languages. AlphaCode (DeepMind, 2022) disclosed 715 GB of GitHub code plus a curated CodeContests dataset for fine-tuning. Gemini reports say only “web documents, books, and code” — nothing quantified. Google also has an internal monorepo (billions of lines) and disclosed in the DIDACT work (2023) that they train on internal Google developer activity — edit histories, build errors, code review comments — not just final code snapshots. That’s a data source no one else has.
Anthropic: has never published code-data composition. Public statements: training on publicly available internet data, licensed third-party data, and data generated by contractors/Claude itself, with respect for robots.txt. Anthropic doesn’t train on API customer data by default. The Books lawsuit disclosures revealed a lot about text acquisition but not code specifics.

What the open replications tell you (the best proxy)

If you want to see what a frontier code pretraining corpus actually looks like, read the papers from labs that do disclose — the recipes converge:

The Stack v2 / StarCoder2 (BigCode): 67.5 TB raw → ~3TB after filtering, 619 languages from Software Heritage, with license filtering, near-dedup via MinHash, PII redaction, and quality filters. This is roughly the open-source ceiling of “all of permissive GitHub.”
DeepSeek-Coder (most relevant to you given your DeepSeek v4 interest): 2T tokens, 87% source code, 10% English code-related (GitHub Markdown, StackExchange), 3% Chinese. Key trick everyone now copies: repo-level data construction — topologically sorting files within a repo by dependency graph so the model sees imports before usage, instead of shuffled single files. Plus FIM (fill-in-the-middle) at ~50% rate.
Qwen-Coder, Code Llama papers tell the same story: GitHub + StackOverflow/StackExchange + Jupyter notebooks + commit diffs + issues/PRs + documentation.

It’s a safe inference that OpenAI/Google/Anthropic pretraining corpora are a superset of this: full GitHub clones (including non-permissive licenses, which is what the litigation is about), commit histories and diffs, GitHub issues and PR discussions, StackOverflow (OpenAI signed a licensing deal with Stack Overflow in 2024; Google did earlier via OverflowAPI), package registries and docs (PyPI, npm, readthedocs), and Common Crawl code-adjacent pages.

The modern differentiator: it’s not the pretraining corpus anymore

For models like GPT-5.x-Codex, Gemini 3, and Claude Opus/Sonnet 4.x, everyone has roughly the same “all public code” pretraining data. The capability gap now comes from post-training data the labs generate themselves:

Synthetic code data at scale — model-generated problems, solutions, and explanations, filtered by execution. The open analog is what Phi-1 (“Textbooks Are All You Need”) and Magicoder/OSS-Instruct demonstrated: seed snippets from real code → LLM generates instruction/solution pairs → filter by running tests.
RL with verifiable rewards (RLVR) — generate solutions, run them against unit tests in sandboxes, reward on pass/fail. This is the core of the reasoning-model training loop (same family as what DeepSeek-R1 disclosed with GRPO) and why coding improved so much faster than other domains: code gives you a free, perfect reward signal.
Agentic trajectories — SWE-bench-style data: take real repos with real issues and merged PRs, reconstruct the environment, and train on multi-step trajectories (read files → edit → run tests → fix). OpenAI’s Codex models are explicitly described as trained on real-world software engineering tasks via RL; Anthropic’s Claude Code lineage is the same idea. The open replications here are SWE-Gym and SWE-smith if you want to see the construction pipeline in code.
Human expert data — all three labs pay contractors (Surge, Scale, Turing, etc.) for hard coding problems, preference rankings, and trajectory corrections.

Practical pointer for you: since you’re training at nanoGPT/nanochat scale, the replicable stack is: pretrain on a filtered slice of The Stack v2 (or smollm-corpus’s Stack subset) with repo-level concatenation + FIM, then SFT on OSS-Instruct-style synthetic data, then GRPO against unit-test rewards using something like verl or TRL. That’s the same shape as the frontier recipe, minus three orders of magnitude of compute. The DeepSeek-Coder and StarCoder2 papers are the two most information-dense reads — they’re effectively the leaked blueprints of what the closed labs do.

References:

Back Donate