SEC-EDGAR Training Loss Recovery | Generated by AI
Here’s my analysis:
═══════════════════════════════════════════════════ SEC-EDGAR 124M Training — Progress Report ═══════════════════════════════════════════════════
Progress: 36,120 / 47,400 = 76.2% (~1.18B / 1.55B tokens) Remaining: ~11,280 iters ≈ 16 minutes (almost done!)
═══════════════════════════════════════════════════ VALIDATION LOSS CURVE (the reliable metric) ═══════════════════════════════════════════════════
Step Val Loss Notes —– ——– —– 17000 2.27 ← earliest data in buffer, already good 18000 3.31 19000 2.70 20000 6.36 ← SPIKE START 21000 8.04 ← PEAK (3.5x normal!) 22000 6.44 23000 6.14 24000 5.99 25000 5.27 26000 4.68 ← steady recovery 27000 4.24 28000 3.96 29000 3.59 30000 3.34 31000 3.20 32000 3.01 33000 2.93 34000 2.77 35000 2.67 36000 2.52 ← current, STILL DECLINING
═══════════════════════════════════════════════════ THE SPIKE at step 20k — what happened ═══════════════════════════════════════════════════
The loss spiked from ~2.7 → 8.0 around step 20k-21k. This is NOT normal overfitting or LR decay. Most likely cause: the SEC-EDGAR data hit a batch of structurally different filings (e.g. a run of XBRL-heavy or unusual 10-K/10-Q sections that broke the model’s pattern). SEC filings have wildly different structure — a simple 8-K vs a 300-page 10-K with nested tables, XBRL tags, and legal boilerplate.
The model recovered fully and is now BELOW the pre-spike level.
═══════════════════════════════════════════════════ COMPARISON: SEC-EDGAR vs GitHub-Code ═══════════════════════════════════════════════════
SEC-EDGAR GitHub-Code Model: GPT-2 124M GPT-2 124M Val Loss: 2.52 (↓ing) 3.466 (final) Output quality: TBD Garbage Data quality: Clean (SEC) ~46% noise (codeparrot) Status: 76% done Done (abandoned)
SEC-EDGAR is already at 2.52 val loss and STILL improving. GitHub-code bottomed out at 3.466 and produced garbage. The difference is data quality: SEC filings are structured, consistent financial prose. codeparrot/github-code was ~46% noise (auto-generated, minified, configs, etc.).
═══════════════════════════════════════════════════ KEY PATTERNS ═══════════════════════════════════════════════════
-
PER-ITER LOSS NOISE IS HUGE: individual steps swing 0.8–3.5. This is normal for effective batch=32,768 tokens. The VAL loss (averaged over 100 eval iters) is the real signal.
-
STILL DECLINING: val loss went 2.67 → 2.52 in the last 1000 steps. This run has more to give. It’s not plateaued yet.
-
EXPECTED FINAL VAL LOSS: ~2.3-2.4 by step 47,400 if the trend holds. For reference, GPT-2 124M on OpenWebText gets ~2.85 val loss. SEC-EDGAR may beat that because the domain is narrower/more regular.
-
LEARNING RATE: still decaying (min_lr=6e-5, will bottom out at step 47400). MFU is stable at 12.79% — no compile (compile=False), so that’s expected on a 4070.
Let it finish — should be done in ~16 min. The checkpoint at step 36000 (val_loss=2.52) is already the best one yet.