SEC-EDGAR 124M Quality Diagnosis & Improvement | Generated by AI

Home 2026.06

OK, I’ve reviewed everything. Here’s my analysis and improvement suggestions.

================================================================ SEC-EDGAR 124M — Quality Diagnosis & Improvement Plan ================================================================

CURRENT STATE: Model: GPT-2 124M (12 layers, 768 dim) Data: 1.66B tokens, 1 epoch on RTX 4070 Val loss: ~2.28 (final) Quality: Good SEC language/structure, but loops, drifts to biotech

================================================================ ROOT CAUSE #1: DATA IS 100% 10-K FILINGS ONLY ================================================================

This is the biggest problem. You downloaded ALL 10 filing types (10-K, 10-Q, 8-K, S-1, S-8, 144, 20-F, 3, 4, 5) but only 10-K actually has data (10GB, 17 parquet files). The rest are empty directories (4KB each).

10-K annual reports are dominated by biotech/pharma companies (thousands of small-cap biotechs file 10-Ks with nearly identical risk factors about “product candidates”). This explains:

FIX: Download the other filing types. They add diversity:

================================================================ ROOT CAUSE #2: NO DATA CLEANING ================================================================

The raw SEC filings contain:

The model memorized these surface patterns. That’s why it can ECHO tables perfectly but can’t GENERATE new content.

FIX: Clean the data before tokenization:

  1. Strip XBRL/HTML tags (keep plain text)
  2. Deduplicate near-identical paragraphs (SimHash or exact match)
  3. Normalize whitespace/formatting
  4. Optionally: separate sections (risk factors, MD&A, financials) and train with section markers

================================================================ ROOT CAUSE #3: MODEL CAPACITY (124M) ================================================================

124M parameters is GPT-2 small. It can learn: ✓ Document structure and vocabulary ✓ Grammatical sentence construction ✓ Template patterns (headings, bullets)

It fundamentally cannot learn: ✗ Numerical reasoning (arithmetic consistency) ✗ Long-range coherence (>500 tokens) ✗ Table extension (needs structured understanding)

FIX OPTIONS (in order of effort):

A. REPEAT DATA (easiest, free): Train for 2-3 epochs instead of 1. The val_loss was still declining at step 47k — more passes could help. Set max_iters = 142200 (3x) and lr_decay_iters to match. Risk: overfitting on repeated data. Monitor val_loss closely.

B. BETTER DECODING (immediate, no retraining):

C. DATA CLEANING + REDO (medium effort): Clean the 10-K data + download other filing types. Retrain for 2 epochs on the improved corpus. This is the highest-impact change.

D. SCALE UP (if you have more VRAM): GPT-2 355M (24 layers, 1024 dim) would fit on 12GB VRAM with batch_size=2, grad_accum=16. Significantly better at long-range coherence.

E. USE A BETTER BASE MODEL: Instead of training from scratch, fine-tune an existing GPT-2 or a small LLaMA on SEC data. You get all the general language ability for free, then add domain knowledge. This is the modern approach — nobody trains from scratch anymore.

================================================================ RECOMMENDED PRIORITY ORDER ================================================================

  1. IMMEDIATE (no retraining):
    • Add repetition_penalty to generation code
    • Lower temperature to 0.5
    • Test with top-p=0.9
  2. SHORT TERM (1-2 days):
    • Download 10-Q + 8-K + S-1 filings (add diversity)
    • Clean XBRL/HTML from existing data
    • Retrain for 2 epochs on cleaned+diversified data
  3. MEDIUM TERM:
    • Fine-tune a pre-trained model instead of from-scratch
    • Try GPT-2 355M if VRAM allows

Want me to start on any of these? I can: a) Fix the generation code to add repetition penalty b) Download the missing filing types (10-Q, 8-K, S-1, etc.) c) Write a data cleaning script (strip XBRL/HTML, dedup) d) Set up a fine-tuning pipeline on a pre-trained GPT-2


Back Donate