SEC-EDGAR 124M Generation Review

SEC-EDGAR 124M Generation Review | Generated by AI

Home 2026.06

Here’s the detailed quality analysis across all 5 generations:

================================================================ SEC-EDGAR 124M GENERATION QUALITY REVIEW (Step 37k, val_loss 2.45) ================================================================

PROMPT 1 — Business Overview (Healthcare SaaS)

INPUT: ~3 paragraphs, $487.2M revenue, 8200 employees, EHR platform OUTPUT: 4 coherent paragraphs, then total collapse

WHAT WORKED (lines 17-20):

“Our business is heavily dependent on the resources of our clinical trials” — plausible SEC language, pivots from SaaS to biotech framing
“technology costs are based on materials, manufacturing, test, and customer support costs” — generic but grammatically correct
Bullet list structure with ● formatting maintained

WHAT FAILED (lines 22-112):

After ~200 generated tokens, enters a “commercialization of new product candidates” loop that runs for 90+ lines
Hallucinated drug names: X-Avent, X-Zentib, S-Zentib, Q-partnerib, X-Zitib — none exist, all follow pharma naming patterns
Self-contradictory: starts as healthcare SaaS, becomes biotech
Grammatical collapse: “commercializing our product candidates” used as noun, verb, adjective interchangeably

LONG-RANGE COHERENCE: ★★☆☆☆ — Maintains topic (healthcare) but switches sub-domain (SaaS -> biotech) within 3 paragraphs. No memory of the original company description (EHR, hospitals, etc.)

PROMPT 2 — MD&A (Revenue/Cost Analysis)

INPUT: Revenue +28%, cost of revenue +22%, gross margin 64.4% OUTPUT: First continuation paragraph perfect, then 10 paragraphs of “Cost of revenue increased/decreased by $X” loops

WHAT WORKED (line 19, first continuation):

“Cost of revenue increased by $32.1 million, or 26%… primarily attributable to decreased depreciation and amortization expense in the period of the acquisition of DMR”
Proper SEC formula: dollar amount + percentage + explanation
References a specific acquisition (DMR) — hallucinated but plausible

WHAT FAILED (lines 20-27):

10 consecutive paragraphs all starting with “Cost of revenue increased/decreased by $X.X million, or X%”
Numbers become nonsensical: “$40.0 million, or 2%, to $107.1 million… from $162.8 million” (2% of $162.8M is not $40M)
Internal contradictions: “decreased by $61.2 million, or 13%, to $1.9 million from $2.7 million” ($61.2M decrease from $2.7M?)
Same sentence structures repeated verbatim with number swaps

LONG-RANGE COHERENCE: ★☆☆☆☆ — After the first continuation paragraph, loses all numerical consistency. The model learned the TEMPLATE of MD&A paragraphs but can’t maintain arithmetic logic.

PROMPT 3 — Risk Factors

INPUT: Net losses $42.3M/$67.8M/$89.1M, accumulated deficit $523.4M OUTPUT: 2 coherent risk factor paragraphs, then “product candidates” loop for 30+ lines

WHAT WORKED (lines 18-24):

“Our quarterly revenue and operating results have varied in the past and may continue to vary significantly from quarter to quarter” — textbook SEC risk factor language
“Factors that may cause our quarterly results to fluctuate include the timing of large enterprise contracts, seasonal purchasing patterns in the healthcare industry” — specific, plausible
Maintains bullet point format with proper transitions

WHAT FAILED (lines 25-49):

“product candidates” appears 47 times in 25 lines
Recursive self-reference: “Our product candidates may fail to develop, develop and commercialize our product candidates may fail”
Grammatical breakdown: “We have suffered a number of risks involved in our research and development programs”
Loses thread of healthcare SaaS, becomes generic biotech risk factors

LONG-RANGE COHERENCE: ★★★☆☆ — Better than others. Maintains risk factor STRUCTURE (heading + explanation) for longer. But content degenerates into repetitive “product candidates” loop. The model clearly over-indexed on biotech risk factors in training data.

PROMPT 4 — Revenue Recognition Notes (with table)

INPUT: Revenue table ($380M subscription, $89M services, $18M HW) + remaining performance obligations ($892.3M) OUTPUT: Perfect table echo, one sentence continuation, then blank

WHAT WORKED (lines 12-25):

Table echoed EXACTLY — all numbers, alignment, formatting preserved
“The aggregate amount of the transaction price allocated to remaining performance obligations was $892.3 million” — exact copy of input
Proper ASC 606 language maintained

WHAT FAILED (lines 26-29):

“The table below presents our revenues in the periods indicated” — tries to start another table, then outputs blank spaces
Only generated ~50 actual tokens of continuation before collapse
Model can’t generate new table rows — only echo existing ones

LONG-RANGE COHERENCE: ★★☆☆☆ — Perfect at echoing input, zero ability to extend. This is the fundamental limitation: the model memorized table FORMATS but can’t generate new coherent numbers.

PROMPT 5 — Proxy Statement (Executive Comp Table)

INPUT: 3 executives with full comp breakdown ($5.5M, $3.6M, $3.0M) OUTPUT: Perfect table echo, adds one broken row, then blank

WHAT WORKED (lines 14-24):

Table structure perfectly preserved — column alignment, dollar signs
All 3 executive rows echoed exactly with correct numbers
“Our executive compensation program is designed to attract, retain, and motivate” — proper proxy boilerplate

WHAT FAILED (line 25):

Attempts to add “William R. Gras” as 4th executive
Only gets: “William R. Gras 100,000 $” — missing most columns
Then blank spaces — model can’t continue the table pattern
“Gras” is likely a hallucinated name fragment

LONG-RANGE COHERENCE: ★★☆☆☆ — Same as Prompt 4. Perfect echo, broken extension. The model treats tables as fixed patterns to reproduce, not as structured data to extend.

================================================================ CROSS-PROMPT PATTERNS ================================================================

ECHO vs GENERATE DISTINCTION:
- Model is excellent at ECHOING input (tables, numbers, formatting)
- Model is poor at GENERATING new content that maintains consistency
- This suggests the model learned surface patterns, not underlying data relationships
LOOP ATTRACTORS:
- “commercialization of product candidates” (Prompt 1, 3)
- “Cost of revenue increased by $X million” (Prompt 2)
- “raise additional capital” (earlier tests)
- These are the most common SEC phrases in training data — the model falls into them as probability sinks
NUMERICAL COHERENCE:
- Dollar amounts: plausible scale ($1M-$500M range) but internally inconsistent (can’t do arithmetic)
- Percentages: often don’t match the dollar changes cited
- Dates: consistent (always “December 31, 2023/2022”)
- The model learned the FORMAT of numbers, not their MEANING
DOMAIN DRIFT:
- Healthcare SaaS prompt → biotech/pharma within 200 tokens
- This suggests training data is dominated by biotech 10-Ks
- Or: biotech risk factors are the “default” SEC content the model generates when uncertain
GRAMMAR vs LOGIC:
- Grammatical structure remains correct even when content is nonsensical (“We have suffered a number of risks involved in our research and development programs”)
- Subject-verb agreement maintained even in loops
- This is typical for n-gram-level learning without reasoning
TABLE HANDLING:
- Can reproduce tables with perfect alignment
- Cannot extend tables with new rows
- Cannot maintain numerical consistency across rows
- Suggests table formatting is learned as character patterns, not as structured data

================================================================ OVERALL ASSESSMENT ================================================================

The 124M model at 78% training has learned: ✓ SEC document structure (headings, bullets, tables) ✓ Financial vocabulary and boilerplate language ✓ Proper formatting (dollar signs, percentages, alignment) ✓ Grammatical sentence construction

The model has NOT learned: ✗ Numerical consistency (can’t do arithmetic) ✗ Long-range topic coherence (drifts to dominant training patterns) ✗ Table extension (can echo, can’t continue) ✗ Factual grounding (hallucinates drug names, acquisitions)

This is expected for a 124M parameter model. GPT-2 1.5B (12x larger) shows similar but less severe patterns. The model is useful for:

Generating SEC boilerplate language
Suggesting section structures
Drafting placeholder text that LOOKS like real SEC filings

It is NOT useful for:

Generating accurate financial data
Maintaining consistency across long documents
Producing factually grounded content

Back Donate