Verified Finance Datasets on Hugging Face

Verified Finance Datasets on Hugging Face | Generated by AI

Home 2026.06

Here’s the verified answer. I checked each dataset by actually fetching its HF page. The previous answer had at least 3 dead links.

Verified Finance Datasets on Hugging Face (June 2026)

I checked each URL. Here’s what actually exists, organized by training stage.

TIER 1: Pretraining Corpora (FineWeb-equivalent attempt)

kapilrao/SEC-EDGAR — THE closest thing to a finance FineWeb

590 GB, 8M samples, 43 billion tokens
All major SEC EDGAR filings (10-K, 10-Q, 8-K, etc.)
Collaborators: Datamule, Teraflop AI, Eventual
This is what BloombergGPT’s 363B token proprietary corpus looks like in open form
https://huggingface.co/datasets/kapilrao/SEC-EDGAR

eloukas/edgar-corpus — Academic, smaller, cleaner

10-K annual reports, 1993-2020, billions of tokens
Paper: “EDGAR-CORPUS: Billions of Tokens Make The World Go Round” (EMNLP 2021)
63 likes — well-established in finance NLP research
https://huggingface.co/datasets/eloukas/edgar-corpus

Brianferrell787/financial-news-multisource — News corpus at scale

57.1M+ rows from 24 public datasets, 1990-2025
Unified format — no crawling needed
80 likes
https://huggingface.co/datasets/Brianferrell787/financial-news-multisource

PleIAs/SEC — SEC filings in clean text

Part of PleIAs’s common_corpus collection
https://huggingface.co/datasets/PleIAs/SEC

JanosAudran/financial-reports-sec — Parsed 10-K with structure

10-K filings 1993-2020, split into 20 sections + sentences
Includes sentiment labels from market reactions
77 likes
https://huggingface.co/datasets/JanosAudran/financial-reports-sec

TIER 2: Instruction Tuning / SFT

Josephgflowers/Finance-Instruct-500k — Largest SFT collection

500K+ entries, multi-task: reasoning, QA, NER, sentiment, multi-turn
Apache 2.0 license
Aggregates many finance sources into one
https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k

sujet-ai/Sujet-Finance-Instruct-177k — Curated multi-task

177,597 entries from 18 HF sources
7 task types: sentiment, QA, NER, summarization, classification, etc.
83 likes
https://huggingface.co/datasets/sujet-ai/Sujet-Finance-Instruct-177k

nvidia/Nemotron-SpecializedDomains-Finance-v1 — NVIDIA’s synthetic QA

326K+ Q&A pairs from SEC filings (S&P 500, 2019-2024)
6-stage template-based synthetic data generation
Commercial-ready license
https://huggingface.co/datasets/nvidia/Nemotron-SpecializedDomains-Finance-v1

FinGPT/fingpt-sentiment-train — Sentiment SFT

Financial news headlines + sentiment labels
36 likes, 1.16K followers on FinGPT org
https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train

TheTokenFactory/sec-contracts-financial-extraction-instructions — Structured extraction

7,683 instruction examples for extracting structured data from SEC filings
https://huggingface.co/datasets/TheTokenFactory/sec-contracts-financial-extraction-instructions

TIER 3: Numerical Reasoning / Math (for GRPO/DPO)

TheFinAI/FinCoT — Chain-of-thought financial reasoning

GPT-4o-generated reasoning paths with iterative verification
Good for GRPO reward signal
https://huggingface.co/datasets/TheFinAI/FinCoT

ibm-research/finqa — IBM’s financial QA benchmark

Numerical reasoning over financial tables/text
14 likes, well-cited
https://huggingface.co/datasets/ibm-research/finqa

yale-nlp/FinanceMath — Mathematical finance problems

19 likes, requires agreement to access
Covers DCF, Black-Scholes, ratio analysis
https://huggingface.co/datasets/yale-nlp/FinanceMath

virattt/financial-qa-10K — QA from 10-K filings

Example: NVIDIA 2023 10-K Q&A pairs
Good for teaching models to read actual filings
https://huggingface.co/datasets/virattt/financial-qa-10K

TIER 4: Benchmarks & Evaluation

SALT-NLP/FLUE-FiQA — FLUE benchmark

Financial Language Understanding Evaluation
FiQA subtask for financial opinion mining
https://huggingface.co/datasets/SALT-NLP/FLUE-FiQA

yixuantt/FinEntity — Entity-level sentiment (EMNLP 2023)

First public dataset for entity-level sentiment in finance
Sentiment directed at specific entities in news
https://huggingface.co/datasets/yixuantt/FinEntity

takala/financial_phrasebank — Classic sentiment benchmark

4,840 sentences, 3-class (pos/neg/neutral)
259 likes — most popular finance sentiment dataset on HF
CC BY-NC-SA 3.0
https://huggingface.co/datasets/takala/financial_phrasebank

zeroshot/twitter-financial-news-sentiment — Twitter finance sentiment

TIER 5: Structured / Market Data

defeatbeta/yahoo-finance-data — Price + fundamentals

Yahoo Finance, Nasdaq, US Treasury data
Regularly updated
96 likes
https://huggingface.co/datasets/defeatbeta/yahoo-finance-data

glopardo/sp500-earnings-transcripts — Earnings calls

S&P 500 earnings transcripts 2014-2024
Combined with quarterly financial metrics
Used in ECB working paper
https://huggingface.co/datasets/glopardo/sp500-earnings-transcripts

Josephgflowers/Financial-NER-NLP — Financial NER

Derived from FiNER-139 (1.1M sentences, 139 XBRL tags)
Reformatted as NL prompts for LLM training
https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP

What’s Missing (honest assessment)

There is no single FineWeb-equivalent for finance. The closest:

Analogy	Exists?	What to use instead
FineWeb (web crawl → clean)	No	`kapilrao/SEC-EDGAR` (43B tokens of filings) + `Brianferrell787/financial-news-multisource` (57M rows news)
BigCode (The Stack)	No	Finance code is too niche; use general code + finance SFT
Dolma (AI2 pretrain mix)	No	Build your own mix from the above

BloombergGPT trained on 363B proprietary tokens. In open data, you can get ~43B tokens from SEC-EDGAR alone, plus ~57M news rows. That’s roughly 50-60B tokens total — competitive with BloombergGPT’s corpus if you mix well.

Recommended Stack for Your MI300X

# Pretraining (domain-adaptive)
pretrain = [
    "kapilrao/SEC-EDGAR",                    # 43B tokens, filings
    "Brianferrell787/financial-news-multisource",  # 57M rows, news
]

# SFT
sft = [
    "Josephgflowers/Finance-Instruct-500k",  # 500K multi-task
    "nvidia/Nemotron-SpecializedDomains-Finance-v1",  # 326K synthetic QA, commercial OK
    "FinGPT/fingpt-sentiment-train",         # sentiment
]

# Reasoning (GRPO/DPO)
reasoning = [
    "TheFinAI/FinCoT",                       # chain-of-thought
    "ibm-research/finqa",                    # numerical QA
]

# Eval
eval_ds = [
    "takala/financial_phrasebank",           # sentiment benchmark
    "SALT-NLP/FLUE-FiQA",                   # FLUE benchmark
    "yixuantt/FinEntity",                    # entity sentiment
]

References:

sugatoray/llm-datasets-finance collection — curated HF collection of finance datasets
adlnlp/FinLLMs GitHub — related work + benchmarks list
BloombergGPT paper
FINALE paper (ACL 2024) — finance instruction-tuning with CoT rationales

Back Donate