Expanding Disk Storage for LLM Training

Expanding Disk Storage for LLM Training | Generated by AI

Home 2026.05

Question: My current Linux machine has only 1.7 GB of disk space left on a 916 GB NVMe drive (100% full). I’m doing LLM training and need to store large datasets like FineWeb and GPT-3 scale data (~400 GB). How do I upgrade/expand the disk storage?

Answer:

Context: Your Current Situation

Filesystem      Size  Used Avail Use%
/dev/nvme0n1p2  916G  868G  1.7G 100%   ← basically full

Your new PC build (from the parts list you shared) includes:

Samsung 980 SSD 1TB (PCIe 3.0 NVMe) — your current boot drive (this is the one that’s full)
WD Blue 2TB HDD — also in the build (7200 RPM SATA)

You actually already have the solution in your PC: the 2TB WD HDD is your expansion drive. But you may not have it mounted and set up yet.

Strategy Options (Best to Simplest)

Option A: Use the 2TB HDD (Already in Your Build) — Recommended for Training Data

This is the easiest and most practical path. Training datasets (FineWeb, etc.) are read sequentially, so HDD speed is acceptable for data storage/preprocessing, even if not ideal.

Step 1 — Identify the new disk:

lsblk
# or
sudo fdisk -l

Look for a device like /dev/sdb or /dev/sda that is ~2TB and unpartitioned.

Step 2 — Partition it:

sudo fdisk /dev/sdb
# Inside fdisk:
# n  → new partition
# p  → primary
# 1  → partition number
# (press Enter twice to use full disk)
# w  → write and exit

Step 3 — Format as ext4:

sudo mkfs.ext4 /dev/sdb1

Step 4 — Create mount point and mount:

sudo mkdir -p /data/llm
sudo mount /dev/sdb1 /data/llm

Step 5 — Make it permanent (auto-mount on boot):

# Get the UUID
sudo blkid /dev/sdb1

# Edit fstab
sudo nano /etc/fstab
# Add this line (replace UUID with your actual value):
UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  /data/llm  ext4  defaults  0  2

Step 6 — Set permissions so your user can write:

sudo chown -R $USER:$USER /data/llm

Verify:

df -h /data/llm
# Should show ~1.8TB available

Now download your datasets there:

# Example: HuggingFace datasets
huggingface-cli download HuggingFaceFW/fineweb --repo-type dataset --local-dir /data/llm/fineweb

Option B: Add a Second NVMe SSD (Better for Active Training I/O)

The MSI MAG B760M Mortar motherboard you chose has two M.2 slots, so you can physically add another NVMe SSD (e.g., a 2TB PCIe 4.0 drive). This gives much better read speeds (~5000 MB/s vs ~150 MB/s HDD) which matters during tokenization and DataLoader I/O.

Steps are identical to Option A after physical install — it will appear as /dev/nvme1n1.

Option C: Immediate Relief — Clean Up the Existing NVMe

Before buying/mounting anything, reclaim space now:

# Find what's eating space
du -sh /* 2>/dev/null | sort -rh | head -20
du -sh /home/* 2>/dev/null | sort -rh
du -sh ~/.cache/* 2>/dev/null | sort -rh

# Common culprits in ML workloads:
# HuggingFace cache
du -sh ~/.cache/huggingface/

# Conda/pip cache
conda clean --all
pip cache purge

# Docker images (if using)
docker system prune -a

# Old model checkpoints
find ~ -name "*.ckpt" -o -name "*.pt" | du -sh

Option D: Symlink Cache to the New Drive (Transparent Integration)

After mounting /data/llm, redirect HuggingFace’s cache there seamlessly:

# Move existing HF cache to new drive
mv ~/.cache/huggingface /data/llm/hf_cache

# Symlink so tools don't notice
ln -s /data/llm/hf_cache ~/.cache/huggingface

# Or just set the env var in ~/.bashrc:
echo 'export HF_HOME=/data/llm/hf_cache' >> ~/.bashrc
echo 'export TRANSFORMERS_CACHE=/data/llm/hf_cache' >> ~/.bashrc
source ~/.bashrc

Storage Sizing Reality Check for LLM Training Data

Dataset	Approximate Size
FineWeb (sample-10BT)	~10–25 GB
FineWeb (full)	~44 TB (impractical locally)
The Pile	~825 GB
OpenWebText	~40 GB
GPT-3 training data equivalent	Several TB total

Practical recommendation: For a local training setup, use FineWeb sample subsets or OpenWebText, stored on the 2TB HDD. The full FineWeb/GPT-3 scale data requires cloud/cluster storage.

Summary: What to Do Right Now

lsblk — confirm the 2TB WD HDD appears (e.g., /dev/sdb)
Partition → format → mount it to /data/llm
Add to /etc/fstab so it persists across reboots
Set HF_HOME=/data/llm/hf_cache in your shell profile
Download datasets directly to /data/llm/

This gives you ~1.8 TB of usable training data space without touching your OS partition at all.

References:

Back Donate