Verified Finance Datasets on Hugging Face | Generated by AI

Home 2026.06

Here’s the verified answer. I checked each dataset by actually fetching its HF page. The previous answer had at least 3 dead links.


Verified Finance Datasets on Hugging Face (June 2026)

I checked each URL. Here’s what actually exists, organized by training stage.


TIER 1: Pretraining Corpora (FineWeb-equivalent attempt)

kapilrao/SEC-EDGAR — THE closest thing to a finance FineWeb

eloukas/edgar-corpus — Academic, smaller, cleaner

Brianferrell787/financial-news-multisource — News corpus at scale

PleIAs/SEC — SEC filings in clean text

JanosAudran/financial-reports-sec — Parsed 10-K with structure


TIER 2: Instruction Tuning / SFT

Josephgflowers/Finance-Instruct-500k — Largest SFT collection

sujet-ai/Sujet-Finance-Instruct-177k — Curated multi-task

nvidia/Nemotron-SpecializedDomains-Finance-v1 — NVIDIA’s synthetic QA

FinGPT/fingpt-sentiment-train — Sentiment SFT

TheTokenFactory/sec-contracts-financial-extraction-instructions — Structured extraction


TIER 3: Numerical Reasoning / Math (for GRPO/DPO)

TheFinAI/FinCoT — Chain-of-thought financial reasoning

ibm-research/finqa — IBM’s financial QA benchmark

yale-nlp/FinanceMath — Mathematical finance problems

virattt/financial-qa-10K — QA from 10-K filings


TIER 4: Benchmarks & Evaluation

SALT-NLP/FLUE-FiQA — FLUE benchmark

yixuantt/FinEntity — Entity-level sentiment (EMNLP 2023)

takala/financial_phrasebank — Classic sentiment benchmark

zeroshot/twitter-financial-news-sentiment — Twitter finance sentiment


TIER 5: Structured / Market Data

defeatbeta/yahoo-finance-data — Price + fundamentals

glopardo/sp500-earnings-transcripts — Earnings calls

Josephgflowers/Financial-NER-NLP — Financial NER


What’s Missing (honest assessment)

There is no single FineWeb-equivalent for finance. The closest:

Analogy Exists? What to use instead
FineWeb (web crawl → clean) No kapilrao/SEC-EDGAR (43B tokens of filings) + Brianferrell787/financial-news-multisource (57M rows news)
BigCode (The Stack) No Finance code is too niche; use general code + finance SFT
Dolma (AI2 pretrain mix) No Build your own mix from the above

BloombergGPT trained on 363B proprietary tokens. In open data, you can get ~43B tokens from SEC-EDGAR alone, plus ~57M news rows. That’s roughly 50-60B tokens total — competitive with BloombergGPT’s corpus if you mix well.


# Pretraining (domain-adaptive)
pretrain = [
    "kapilrao/SEC-EDGAR",                    # 43B tokens, filings
    "Brianferrell787/financial-news-multisource",  # 57M rows, news
]

# SFT
sft = [
    "Josephgflowers/Finance-Instruct-500k",  # 500K multi-task
    "nvidia/Nemotron-SpecializedDomains-Finance-v1",  # 326K synthetic QA, commercial OK
    "FinGPT/fingpt-sentiment-train",         # sentiment
]

# Reasoning (GRPO/DPO)
reasoning = [
    "TheFinAI/FinCoT",                       # chain-of-thought
    "ibm-research/finqa",                    # numerical QA
]

# Eval
eval_ds = [
    "takala/financial_phrasebank",           # sentiment benchmark
    "SALT-NLP/FLUE-FiQA",                   # FLUE benchmark
    "yixuantt/FinEntity",                    # entity sentiment
]

References:


Back Donate