Frontier Labs' Secret Code Datasets | Generated by AI

Home 2026.06

Question: What code datasets do OpenAI, Google (Gemini), and Anthropic use to train their coding models (Codex, etc.)?

Answer:

The honest top-line: frontier labs stopped disclosing dataset composition around 2022. The GPT-4, Gemini, and Claude technical reports all describe training data in one vague sentence (“publicly available data, licensed data, and data from third parties”). So what we know splits into three buckets: (1) what was disclosed in the early papers, (2) what’s inferable from lawsuits, deals, and open replications, and (3) how the modern post-training recipe works, which has largely replaced raw GitHub scale as the differentiator.

What was actually disclosed (the historical record)

What the open replications tell you (the best proxy)

If you want to see what a frontier code pretraining corpus actually looks like, read the papers from labs that do disclose — the recipes converge:

It’s a safe inference that OpenAI/Google/Anthropic pretraining corpora are a superset of this: full GitHub clones (including non-permissive licenses, which is what the litigation is about), commit histories and diffs, GitHub issues and PR discussions, StackOverflow (OpenAI signed a licensing deal with Stack Overflow in 2024; Google did earlier via OverflowAPI), package registries and docs (PyPI, npm, readthedocs), and Common Crawl code-adjacent pages.

The modern differentiator: it’s not the pretraining corpus anymore

For models like GPT-5.x-Codex, Gemini 3, and Claude Opus/Sonnet 4.x, everyone has roughly the same “all public code” pretraining data. The capability gap now comes from post-training data the labs generate themselves:

  1. Synthetic code data at scale — model-generated problems, solutions, and explanations, filtered by execution. The open analog is what Phi-1 (“Textbooks Are All You Need”) and Magicoder/OSS-Instruct demonstrated: seed snippets from real code → LLM generates instruction/solution pairs → filter by running tests.
  2. RL with verifiable rewards (RLVR) — generate solutions, run them against unit tests in sandboxes, reward on pass/fail. This is the core of the reasoning-model training loop (same family as what DeepSeek-R1 disclosed with GRPO) and why coding improved so much faster than other domains: code gives you a free, perfect reward signal.
  3. Agentic trajectories — SWE-bench-style data: take real repos with real issues and merged PRs, reconstruct the environment, and train on multi-step trajectories (read files → edit → run tests → fix). OpenAI’s Codex models are explicitly described as trained on real-world software engineering tasks via RL; Anthropic’s Claude Code lineage is the same idea. The open replications here are SWE-Gym and SWE-smith if you want to see the construction pipeline in code.
  4. Human expert data — all three labs pay contractors (Surge, Scale, Turing, etc.) for hard coding problems, preference rankings, and trajectory corrections.

Practical pointer for you: since you’re training at nanoGPT/nanochat scale, the replicable stack is: pretrain on a filtered slice of The Stack v2 (or smollm-corpus’s Stack subset) with repo-level concatenation + FIM, then SFT on OSS-Instruct-style synthetic data, then GRPO against unit-test rewards using something like verl or TRL. That’s the same shape as the frontier recipe, minus three orders of magnitude of compute. The DeepSeek-Coder and StarCoder2 papers are the two most information-dense reads — they’re effectively the leaked blueprints of what the closed labs do.

References:


Back Donate