Downloading OpenWebText Subset via Hugging Face | Generated by AI
The dataset you’re referring to is hosted on Hugging Face under the repo ID Skylion007/openwebtext. The specific file subsets/urlsf_subset00.tar is publicly accessible (no auth required), so you can download it using the Hugging Face CLI tool (huggingface-cli). Here’s how:
Prerequisites
- Install the Hugging Face Hub CLI if you haven’t already:
pip install huggingface_hub - (Optional) If you have the
huggingface-clicommand aliased ashf(common on some systems), you can use that shorthand. Otherwise, use the fullhuggingface-cliprefix.
Download Command
Run this in your terminal:
huggingface-cli download Skylion007/openwebtext subsets/urlsf_subset00.tar --local-dir ./openwebtext
- This will download the file to
./openwebtext/subsets/urlsf_subset00.tar(creating thesubsetssubdir if needed). - The file is quite large (~1.5 GB uncompressed, but it’s a tarball), so ensure you have enough disk space and a stable connection.
- If you want to download it without the repo structure (directly to the current dir), add
--repo-type dataset(though it’s implied here) or adjust--local-dir.
Using Python (Alternative)
If you prefer scripting it in Python (e.g., via the huggingface_hub library):
from huggingface_hub import hf_hub_download
# Download to current directory
hf_hub_download(
repo_id="Skylion007/openwebtext",
filename="subsets/urlsf_subset00.tar",
local_dir="./openwebtext", # Optional: specifies base dir
repo_type="dataset"
)
Install the lib first: pip install huggingface_hub.
Notes
- This is a subset of OpenWebText (a large web corpus). After downloading, you’ll need to extract the
.tarfile (e.g.,tar -xvf subsets/urlsf_subset00.tar). - If the download fails (e.g., due to network issues), retry or use
--cache-dirto specify a custom cache location. - For the full dataset or other subsets, check the repo page: https://huggingface.co/datasets/Skylion007/openwebtext.
If you run into errors, share the output for debugging!