SEC-EDGAR 124M Generation Review | Generated by AI

Home 2026.06

Here’s the detailed quality analysis across all 5 generations:

================================================================ SEC-EDGAR 124M GENERATION QUALITY REVIEW (Step 37k, val_loss 2.45) ================================================================

PROMPT 1 — Business Overview (Healthcare SaaS)

INPUT: ~3 paragraphs, $487.2M revenue, 8200 employees, EHR platform OUTPUT: 4 coherent paragraphs, then total collapse

WHAT WORKED (lines 17-20):

WHAT FAILED (lines 22-112):

LONG-RANGE COHERENCE: ★★☆☆☆ — Maintains topic (healthcare) but switches sub-domain (SaaS -> biotech) within 3 paragraphs. No memory of the original company description (EHR, hospitals, etc.)

PROMPT 2 — MD&A (Revenue/Cost Analysis)

INPUT: Revenue +28%, cost of revenue +22%, gross margin 64.4% OUTPUT: First continuation paragraph perfect, then 10 paragraphs of “Cost of revenue increased/decreased by $X” loops

WHAT WORKED (line 19, first continuation):

WHAT FAILED (lines 20-27):

LONG-RANGE COHERENCE: ★☆☆☆☆ — After the first continuation paragraph, loses all numerical consistency. The model learned the TEMPLATE of MD&A paragraphs but can’t maintain arithmetic logic.

PROMPT 3 — Risk Factors

INPUT: Net losses $42.3M/$67.8M/$89.1M, accumulated deficit $523.4M OUTPUT: 2 coherent risk factor paragraphs, then “product candidates” loop for 30+ lines

WHAT WORKED (lines 18-24):

WHAT FAILED (lines 25-49):

LONG-RANGE COHERENCE: ★★★☆☆ — Better than others. Maintains risk factor STRUCTURE (heading + explanation) for longer. But content degenerates into repetitive “product candidates” loop. The model clearly over-indexed on biotech risk factors in training data.

PROMPT 4 — Revenue Recognition Notes (with table)

INPUT: Revenue table ($380M subscription, $89M services, $18M HW) + remaining performance obligations ($892.3M) OUTPUT: Perfect table echo, one sentence continuation, then blank

WHAT WORKED (lines 12-25):

WHAT FAILED (lines 26-29):

LONG-RANGE COHERENCE: ★★☆☆☆ — Perfect at echoing input, zero ability to extend. This is the fundamental limitation: the model memorized table FORMATS but can’t generate new coherent numbers.

PROMPT 5 — Proxy Statement (Executive Comp Table)

INPUT: 3 executives with full comp breakdown ($5.5M, $3.6M, $3.0M) OUTPUT: Perfect table echo, adds one broken row, then blank

WHAT WORKED (lines 14-24):

WHAT FAILED (line 25):

LONG-RANGE COHERENCE: ★★☆☆☆ — Same as Prompt 4. Perfect echo, broken extension. The model treats tables as fixed patterns to reproduce, not as structured data to extend.

================================================================ CROSS-PROMPT PATTERNS ================================================================

  1. ECHO vs GENERATE DISTINCTION:
    • Model is excellent at ECHOING input (tables, numbers, formatting)
    • Model is poor at GENERATING new content that maintains consistency
    • This suggests the model learned surface patterns, not underlying data relationships
  2. LOOP ATTRACTORS:
    • “commercialization of product candidates” (Prompt 1, 3)
    • “Cost of revenue increased by $X million” (Prompt 2)
    • “raise additional capital” (earlier tests)
    • These are the most common SEC phrases in training data — the model falls into them as probability sinks
  3. NUMERICAL COHERENCE:
    • Dollar amounts: plausible scale ($1M-$500M range) but internally inconsistent (can’t do arithmetic)
    • Percentages: often don’t match the dollar changes cited
    • Dates: consistent (always “December 31, 2023/2022”)
    • The model learned the FORMAT of numbers, not their MEANING
  4. DOMAIN DRIFT:
    • Healthcare SaaS prompt → biotech/pharma within 200 tokens
    • This suggests training data is dominated by biotech 10-Ks
    • Or: biotech risk factors are the “default” SEC content the model generates when uncertain
  5. GRAMMAR vs LOGIC:
    • Grammatical structure remains correct even when content is nonsensical (“We have suffered a number of risks involved in our research and development programs”)
    • Subject-verb agreement maintained even in loops
    • This is typical for n-gram-level learning without reasoning
  6. TABLE HANDLING:
    • Can reproduce tables with perfect alignment
    • Cannot extend tables with new rows
    • Cannot maintain numerical consistency across rows
    • Suggests table formatting is learned as character patterns, not as structured data

================================================================ OVERALL ASSESSMENT ================================================================

The 124M model at 78% training has learned: ✓ SEC document structure (headings, bullets, tables) ✓ Financial vocabulary and boilerplate language ✓ Proper formatting (dollar signs, percentages, alignment) ✓ Grammatical sentence construction

The model has NOT learned: ✗ Numerical consistency (can’t do arithmetic) ✗ Long-range topic coherence (drifts to dominant training patterns) ✗ Table extension (can echo, can’t continue) ✗ Factual grounding (hallucinates drug names, acquisitions)

This is expected for a 124M parameter model. GPT-2 1.5B (12x larger) shows similar but less severe patterns. The model is useful for:

It is NOT useful for:


Back Donate