Using Proxy with Hugging Face Datasets | Generated by AI
Yes — datasets follows your system / environment proxy settings, but it does not have its own proxy configuration.
So you can absolutely make Hugging Face download datasets through a proxy when you’re in China.
Here’s how.
How to make load_dataset use a proxy (China-friendly)
1. Set environment variables for HTTP/HTTPS proxy
datasets uses Python’s requests library internally.
requests automatically uses the standard proxy env variables:
Linux / macOS
export HTTP_PROXY="http://127.0.0.1:7890"
export HTTPS_PROXY="http://127.0.0.1:7890"
If your proxy uses SOCKS5:
export HTTP_PROXY="socks5://127.0.0.1:1080"
export HTTPS_PROXY="socks5://127.0.0.1:1080"
Windows (PowerShell)
setx HTTP_PROXY "http://127.0.0.1:7890"
setx HTTPS_PROXY "http://127.0.0.1:7890"
Then reopen terminal.
2. If you’re running Python inside code (like Jupyter), set it inside Python
import os
os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'
This works reliably.
3. Confirm proxy is working
You can test with:
import requests
print(requests.get("https://api.ipify.org?format=json").text)
If the IP is your proxy’s exit node, then the Hugging Face dataset downloads will also go through that proxy.
4. Hugging Face’s huggingface_hub also respects these variables
datasets internally uses huggingface_hub, and it also follows:
HTTP_PROXYHTTPS_PROXYhttp_proxyhttps_proxy
So once env variables are set, both model and dataset downloads will use your proxy automatically.
Notes about usage in China
- Hugging Face raw downloads often require a proxy because they host on US/Europe endpoints.
- If you see timeouts from
.zstdataset files, this is normal without proxy. - A global proxy (Clash, V2Ray, Qv2ray, etc.) makes everything smoother.
If you want, I can show:
- How to ensure NanoGPT downloads datasets through your proxy
- How to cache the dataset locally to avoid future downloads
- How to mirror Hugging Face datasets locally in China (e.g., Tsinghua mirror workaround)
Just tell me what you need.