Undertrained Model Degenerate Attractor | Generated by AI
Question: Why is my GPT-2 / nanoGPT model degenerating into a repeating .S.S.S.S loop when trained on GitHub code (BigCode/The Stack from HuggingFace)?
Answer:
That .S.S.S.S loop is not a code-data problem. It’s the classic signature of an undertrained model sampling at temperature into a degenerate attractor. Let me break down the actual mechanics, because the symptom is diagnostic.
What you’re seeing, mechanically
Your prompt replay is clean (lines 1–48) because that’s just the model copying through what’s in context — attention over the prompt is trivially easy and works even in a near-random model via the induction-head copying circuit. The moment it has to generate, it collapses. That gap — clean copy, garbage generation — tells you the model learned local n-gram / copy statistics but not the actual distribution of code.
The .S specifically: S is likely a high-frequency BPE token in your vocab (think System, String, .S-prefixed merges everywhere in Java — your prompt is full of System.out.println). The model found that . → S → . → S is a high-probability local transition, and with nothing else learned, sampling falls into that 2-cycle. Once in it, every step reinforces it. This is a repetition trap: the argmax-ish region of the distribution is a short cycle, and top_k=200, temp=0.8 isn’t enough entropy to escape because the cycle tokens dominate the probability mass.
The real cause: it’s undertrained, period
A from-scratch GPT-2 760M needs roughly Chinchilla-optimal ~20 tokens/param ≈ 15B tokens minimum to produce coherent text, and realistically more for code. If you’ve trained on far less (a few hundred M tokens, a few thousand iters), this output is exactly what you’d expect. nanoGPT’s own GPT-2 reproduction runs ~300B tokens on OpenWebText across many hours on 8×A100. On a single MI300X you can do it, but you need the token budget.
Quick sanity checklist — run these before assuming anything exotic:
# 1. What's your actual loss? Code should get to ~1.0-1.5 train loss for coherence.
# If you're sitting at >3.0, the model literally hasn't learned the language.
# 2. How many tokens have you actually seen?
tokens_seen = iter_num * batch_size * block_size * grad_accum_steps
print(f"tokens seen: {tokens_seen/1e9:.2f}B") # want >>1B for 760M
# 3. Is your data tokenized with a CODE tokenizer?
# GPT-2 BPE on code is wasteful — indentation/whitespace explode token count.
# Use StarCoder/BigCode tokenizer instead.
Concrete fixes, in priority order
1. Check loss first. If train loss isn’t dropping below ~2.0, nothing downstream matters. Paste your loss curve. A flat or high loss = bug (LR too high/low, data not shuffled, bad masking) or just not enough steps.
2. Use the BigCode/StarCoder tokenizer, not GPT-2 BPE. GPT-2’s vocab was built on web text and tokenizes code horribly — each indent level eats tokens, and identifiers fragment. This both wastes your token budget and creates exactly the kind of pathological high-frequency fragments (.S) that trap sampling.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bigcode/starcoder2-15b") # 49k vocab, code-aware
# tok.encode("System.out.println") → far fewer, more semantic tokens
# Re-tokenize The Stack with this, rebuild your .bin shards.
3. Add repetition penalty at inference to confirm the diagnosis (this is a band-aid, not a fix):
def sample_with_rep_penalty(logits, generated_ids, penalty=1.3):
# divide logit of already-seen tokens by penalty (CTRL paper, Keskar 2019)
for tid in set(generated_ids):
logits[tid] /= penalty if logits[tid] > 0 else penalty
return logits
If repetition penalty makes the output look better but still semantically garbage, that confirms undertraining — you’ve just suppressed the symptom. The cure is more tokens, not penalty.
4. Lower temperature won’t help here — counterintuitively. The degenerate cycle is the high-probability region, so lowering temp makes it more deterministic into the loop. Higher temp adds noise but at the cost of coherence. Neither fixes an undertrained model. Don’t tune sampling to escape a training problem.
On training with The Stack specifically
A few things that bite people on BigCode/The Stack:
- Deduplication matters enormously. Use
the-stack-dedup, not raw. Near-duplicate code (forks, vendored deps, generated files) inflates token counts with low-information repetition and encourages exactly the copy-loop behavior you’re seeing. - Filter by stars / quality. The Stack has a lot of low-quality and autogenerated code. Training on
bigcode/the-stack-smolor a starred subset first gives cleaner signal. - Document packing + proper attention reset. If you pack multiple files into one block_size sequence without resetting attention at document boundaries, the model learns to bleed context across unrelated files — degrades generation quality.
My read on your situation
Given you’re running GPT-2 760M on MI300X (192GB — you have the memory headroom to crank batch size hard), the bottleneck is almost certainly token budget and/or tokenizer, not architecture. The output isn’t “broken weights” — it’s a model that’s seen maybe 1–5% of the tokens it needs. Set a target of at least 10–15B code tokens with the StarCoder tokenizer, watch loss drop below 1.5, and the .S.S.S loop disappears on its own.
What’s your current iter count, batch×block×grad_accum, and train loss? With those three numbers I can tell you exactly how undertrained you are and how many MI300X-hours you need to get to coherence.
References:
- nanoGPT — Karpathy
- The Stack / BigCode datasets
- StarCoder2 tokenizer
- CTRL: repetition penalty (Keskar et al., 2019)
- Training Compute-Optimal LLMs (Chinchilla, Hoffmann et al., 2022)