Hub Storage Differences Explained

Hub Storage Differences Explained | Generated by AI

Home 2026.05

Great question. Let me explain from the storage layer up.

The Core Difference: Git vs Git LFS vs Xet

GitHub stores everything in standard git objects (packfiles). Git was designed for source code — it delta-compresses text beautifully but chokes on large binaries. Every version of a binary gets stored essentially in full. A 7GB model file with 10 versions = ~70GB on the server.

Hugging Face Hub is git on the surface but uses two different backends for large files:

1. Git LFS (legacy, still default)

When you ran hf upload lzwjava/zz ., here’s what happened for large files:

Your machine                    HF Hub
───────────                     ──────
git commit (small files)  ───>  standard git repo
                                (README.md, config.json, etc.)

LFS pointer files         ───>  LFS batch API
(large files tracked in         Stores actual blobs in
 .gitattributes)                object storage (S3/GCS)

The git repo itself only contains pointer files — tiny text stubs like:

version https://git-lfs.github.com/spec/v1
oid sha256:abc123...
size 7000000000

The actual model weights live in blob storage (S3/GCS). When you git clone or hf_hub_download, the LFS smudge filter fetches the real content from the blob store, not from git.

This is why HF can host repos with 100GB+ of model files — git never sees the actual bytes.

2. Xet (new, being rolled out)

HF has been building a new backend called Xet (based on content-defined chunking, similar to rsync/restic). Instead of storing whole files in blob storage:

Traditional LFS:     store entire 7GB file as one blob
Xet (CDC):           chunk file into ~64KB variable-size chunks,
                     deduplicate chunks across files/versions

This means if you fine-tune a model and only 5% of weights change, Xet only uploads/stores the modified chunks. Huge win for iterative ML workflows. Your hf_api.py references Xet support in _upload_large_folder.py and throughout the download path.

Architectural Comparison

                    GitHub                  HuggingFace Hub
─────────────────────────────────────────────────────────────
Protocol            Git (smart HTTP)        Git (smart HTTP)
                                              + LFS batch API
                                              + Xet CAS API

Large file storage  Git LFS (optional,      Git LFS (default)
                    per-repo opt-in)        + Xet (content-addressed
                                              chunked storage)

Max file size       ~2GB recommended        50GB+ supported
                    (hard limit ~100GB)

Deduplication       None (git delta only)   Xet: cross-file chunk
                                              deduplication

Clone/download      git clone (fetches      hf_hub_download:
                    everything)               selective file fetch,
                                              symlink-based caching,
                                              parallel downloads

Repo types          Just "repo"             models, datasets, spaces
                                              (each with different
                                               metadata schemas)

Metadata            Just README             Model card (YAML front
                                              matter), dataset tags,
                                              library name, metrics,
                                              license, etc.

Pull model weights  N/A (not designed       First-class feature:
                    for this)                 snapshot_download(),
                                              from_pretrained() all
                                              pull from Hub

Web hosting         GitHub Pages            Spaces (Gradio/Streamlit
                                              apps hosted with GPU
                                              support)

Why GitHub Can’t Do This (and Doesn’t Try)

GitHub’s value is code collaboration — PRs, issues, Actions CI/CD, code review. Git’s object model is optimized for text diffing. Running git diff on a 7B parameter .safetensors file is meaningless.

HF Hub’s value is model distribution. The entire storage layer is designed around “a few small config files + several multi-GB binary blobs that change infrequently.” The selective download in hf_hub_download (fetch only model-00002-of-00004.safetensors, not the whole repo) would be impossible with plain git.

What Your Upload Did

Start hashing 36 files.      # hash each file to detect changes
Finished hashing 36 files.   # only upload new/changed files
✓ Uploaded                   # commit pushed via HF API

The hashing step is HF’s way of doing incremental uploads — it checks which files already exist on the Hub (by SHA256) and skips them. This is smarter than git push which would re-send unchanged LFS objects.

For zz (your CLI tool), the repo is probably all small files so it’s essentially just a normal git push. But if you uploaded a 7GB .safetensors model, it would go through the LFS/Xet path transparently.

Back Donate