Fixing Slow Hugging Face Downloads

Fixing Slow Hugging Face Downloads | Generated by AI

Home 2025.11

This is a common issue when downloading huge HF datasets through a proxy. The combination of:

Many multi-hundred-MB files
Python requests / huggingface_hub
A local HTTP proxy (especially Clash, V2Ray, or tun2socks)

often results in extremely slow throughput or connection resets.

Below are the most useful fixes.

1. Best Fix: Use `hf_transfer`

Hugging Face has an accelerated downloader that gives 10×–50× faster speeds and is more stable.

Install:

pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1

Run your command again.

This alone usually fixes slow downloads through proxies.

2. Use `aria2c` with Hugging Face CLI (fastest for huge files)

If you download manually:

pip install huggingface_hub[cli]
sudo apt install aria2

Then download:

huggingface-cli download --repo-type dataset openwebtext --local-dir ./openwebtext --local-dir-use-symlinks False

HF CLI automatically uses aria2, which handles:

parallel connections
resume
unstable proxies

much better than requests.

3. Fix for local proxy slowness: use system-wide SOCKS instead of HTTP

If you’re using something like:

HTTP_PROXY=http://127.0.0.1:7890
HTTPS_PROXY=http://127.0.0.1:7890

Switch to SOCKS5 if your proxy supports it:

export ALL_PROXY=socks5://127.0.0.1:7891

SOCKS is much more stable for large file streaming.

4. Disable proxy for large files

If your raw network is OK but proxy is slowing down:

unset HTTP_PROXY
unset HTTPS_PROXY
unset ALL_PROXY

Then retry. Some HF servers are fast even without proxy depending on region.

5. Use HF mirror (Chinese mainland / TW sometimes slow)

You can force CDN:

export HF_ENDPOINT=https://hf-mirror.com

Then run again. This dramatically improves speeds for some regions.

6. Resume interrupted downloads

HF datasets are cached here:

~/.cache/huggingface/datasets/

If any file was partially downloaded, HF resumes automatically — no need to start from zero.

7. If still slow: avoid the full dataset, use a small slice

OpenWebText is >40GB. You can load only a small part:

from datasets import load_dataset
ds = load_dataset("openwebtext", split="train[:1%]")

Quick recommendation

Fastest + easiest:

pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
python data/openwebtext_10k/prepare.py

You should immediately see download speeds jump.

If you want, I can show you how to modify your prepare.py to use a completely custom downloader with retry/backoff and parallel fetching, so your pipeline always downloads fast and never stalls.

Back Donate