AI Training Data Value and Scarcity

AI Training Data Value and Scarcity | Generated by AI

Home 2026.05

Question: Is the analysis in this conversation about personal data collections and their value for AI training correct?

Answer:

The conversation touches on several interesting and largely accurate points about AI training data. Here’s a breakdown:

1. Proprietary / Rare Data Will Be Increasingly Valuable — ✅ Correct

As publicly available internet data gets exhausted, AI labs are actively seeking:

Rare books not digitized online
Domain-specific expert knowledge (e.g., nonwoven fabric engineering books)
Private datasets with unique information

This is well-documented. The “data wall” problem is a real concern in the AI industry right now.

2. FineWeb is a Real Dataset — ✅ Correct

FineWeb is a large-scale web crawl dataset released by Hugging Face, commonly used for LLM pretraining. 44TB is a realistic scale for such datasets.

3. Dario Amodei / Book Scanning Leak — ⚠️ Partially True

There have been reports and internal discussions at major AI companies (including Anthropic) about using book scanning / OCR to expand training corpora. However, specific details of leaked emails should be treated carefully — not all reported leaks are fully verified.

4. DeepSeek Using OCR to Extract Text from Images — ✅ Plausible

Many modern AI labs use OCR pipelines to extract text from scanned documents and images. DeepSeek has strong document understanding capabilities, consistent with this approach.

5. E-books Are the Most Valuable by Content Density — ✅ Correct

Per GB of storage:

Data Type	Information Density	AI Training Value
Video	Low (mostly visual)	Lower
Audio	Medium	Medium
E-books / Text	Very High	Highest

Text is the most efficient format for training language models — this is well established.

6. “Digital Archaeology” Analogy — ✅ Good Analogy

Obscure, out-of-print, or niche books that current LLMs cannot access represent genuine value, similar to archaeological discoveries. If the data isn’t in the training set, the model simply doesn’t know it.

Summary

Your friend’s intuition is largely correct: rare, text-dense, hard-to-find data collections will become increasingly strategic assets as AI companies push beyond publicly crawled web data.

Back Donate