Managing Large Files with Git

Managing Large Files with Git | Generated by AI

Home 2026.05

Question: For files larger than 500MB like MP4 recordings and Whisper outputs, how should I use Git to manage them? Should I only ignore them?

Answer:

For large media/transcription workflows like your setup, the normal approach is:

Git manages:
- source code
- scripts
- configs
- prompts
- small metadata
- documentation
Git ignores:
- large videos
- generated outputs
- caches
- logs
- temporary artifacts

So yes — usually .mp4, .srt, .vtt, .json, .txt generated by Whisper should mostly be ignored.

A clean structure is something like:

whisper-test/
├── src/
├── scripts/
├── prompts/
├── data/
│   ├── raw/
│   └── processed/
├── outputs/
├── logs/
├── .gitignore
└── README.md

Recommended .gitignore:

# Large media
*.mp4
*.mkv
*.mov
*.avi

# Whisper outputs
*.srt
*.vtt
*.tsv
*.json
*.txt

# Logs
*.log

# Temp/cache
.cache/
tmp/
temp/

# Python
__pycache__/
*.pyc

# Models
models/
*.bin
*.pt
*.gguf

For your current directory:

meeting_01_2.json
meeting_01_2_refined.md
meeting_01_2.tsv
meeting_01_2.vtt
myopia.mp4
whisper.log

You probably want:

File	Git?
`myopia.mp4`	Ignore
`meeting_01_2.tsv`	Ignore
`meeting_01_2.vtt`	Ignore
`meeting_01_2.srt`	Ignore
`meeting_01_2.txt`	Ignore
`whisper.log`	Ignore
`meeting_01_2_refined.md`	Keep maybe
scripts/configs	Keep

Because:

raw/generated artifacts are reproducible
refined markdown may contain human edits and knowledge

A very common engineering pattern is:

raw data -> generated outputs -> refined knowledge

Only the final refined knowledge enters Git.

If you REALLY need versioning for large files, there are several approaches.

1. Git LFS (Large File Storage)

Use when:

you truly need version history of videos/models
team collaboration requires it

Install:

git lfs install

Track mp4:

git lfs track "*.mp4"

This creates:

.gitattributes

Then commit normally.

But:

GitHub storage/bandwidth limits exist
repo becomes heavier
cloning slower
not ideal for generated outputs

Usually good for:

ML models
datasets
important demo videos

Not good for:

constantly regenerated Whisper outputs

2. External Storage + Git Metadata (Most Practical)

Very common in AI/data engineering.

Store large files in:

NAS
external SSD
cloud storage
S3
Google Drive
Dropbox

Git only stores:

scripts
manifests
hashes
metadata

Example:

videos/
  2026/
    meeting_01.mp4

Git tracks:

{
  "file": "meeting_01.mp4",
  "sha256": "...",
  "duration": 7200
}

This scales much better.

3. DVC (Data Version Control)

In ML/data pipelines, many people use:

It combines:

Git for code
external storage for data/models

Very useful when:

datasets are huge
GPU workflows exist
many experiments happen

But maybe overkill for personal Whisper experiments.

My recommendation for your workflow

Given your setup (local GPU + Whisper + experimentation):

Put into Git

scripts/
prompts/
configs/
README.md
refined markdown notes
small examples

Ignore

mp4
logs
srt/vtt/tsv
intermediate json
cache
models

Optional

Keep ONE small demo sample:

samples/demo_30sec.mp4

for reproducibility/testing.

A nice practical workflow:

# generate
python transcribe.py myopia.mp4

# refine
vim meeting_01_2_refined.md

# commit only knowledge/code
git add scripts README.md meeting_01_2_refined.md
git commit -m "Refine transcription pipeline"

This keeps repos:

fast
clean
cloneable
long-term maintainable

Very important after several years — otherwise repos become giant storage dumps instead of engineering history.

References:

Back Donate