Managing Large Files with Git | Generated by AI
Question: For files larger than 500MB like MP4 recordings and Whisper outputs, how should I use Git to manage them? Should I only ignore them?
Answer:
For large media/transcription workflows like your setup, the normal approach is:
-
Git manages:
- source code
- scripts
- configs
- prompts
- small metadata
- documentation
-
Git ignores:
- large videos
- generated outputs
- caches
- logs
- temporary artifacts
So yes — usually .mp4, .srt, .vtt, .json, .txt generated by Whisper should mostly be ignored.
A clean structure is something like:
whisper-test/
├── src/
├── scripts/
├── prompts/
├── data/
│ ├── raw/
│ └── processed/
├── outputs/
├── logs/
├── .gitignore
└── README.md
Recommended .gitignore:
# Large media
*.mp4
*.mkv
*.mov
*.avi
# Whisper outputs
*.srt
*.vtt
*.tsv
*.json
*.txt
# Logs
*.log
# Temp/cache
.cache/
tmp/
temp/
# Python
__pycache__/
*.pyc
# Models
models/
*.bin
*.pt
*.gguf
For your current directory:
meeting_01_2.json
meeting_01_2_refined.md
meeting_01_2.tsv
meeting_01_2.vtt
myopia.mp4
whisper.log
You probably want:
| File | Git? |
|---|---|
myopia.mp4 |
Ignore |
meeting_01_2.tsv |
Ignore |
meeting_01_2.vtt |
Ignore |
meeting_01_2.srt |
Ignore |
meeting_01_2.txt |
Ignore |
whisper.log |
Ignore |
meeting_01_2_refined.md |
Keep maybe |
| scripts/configs | Keep |
Because:
- raw/generated artifacts are reproducible
- refined markdown may contain human edits and knowledge
A very common engineering pattern is:
raw data -> generated outputs -> refined knowledge
Only the final refined knowledge enters Git.
If you REALLY need versioning for large files, there are several approaches.
1. Git LFS (Large File Storage)
Use when:
- you truly need version history of videos/models
- team collaboration requires it
Install:
git lfs install
Track mp4:
git lfs track "*.mp4"
This creates:
.gitattributes
Then commit normally.
But:
- GitHub storage/bandwidth limits exist
- repo becomes heavier
- cloning slower
- not ideal for generated outputs
Usually good for:
- ML models
- datasets
- important demo videos
Not good for:
- constantly regenerated Whisper outputs
2. External Storage + Git Metadata (Most Practical)
Very common in AI/data engineering.
Store large files in:
- NAS
- external SSD
- cloud storage
- S3
- Google Drive
- Dropbox
Git only stores:
- scripts
- manifests
- hashes
- metadata
Example:
videos/
2026/
meeting_01.mp4
Git tracks:
{
"file": "meeting_01.mp4",
"sha256": "...",
"duration": 7200
}
This scales much better.
3. DVC (Data Version Control)
In ML/data pipelines, many people use:
It combines:
- Git for code
- external storage for data/models
Very useful when:
- datasets are huge
- GPU workflows exist
- many experiments happen
But maybe overkill for personal Whisper experiments.
My recommendation for your workflow
Given your setup (local GPU + Whisper + experimentation):
Put into Git
scripts/
prompts/
configs/
README.md
refined markdown notes
small examples
Ignore
mp4
logs
srt/vtt/tsv
intermediate json
cache
models
Optional
Keep ONE small demo sample:
samples/demo_30sec.mp4
for reproducibility/testing.
A nice practical workflow:
# generate
python transcribe.py myopia.mp4
# refine
vim meeting_01_2_refined.md
# commit only knowledge/code
git add scripts README.md meeting_01_2_refined.md
git commit -m "Refine transcription pipeline"
This keeps repos:
- fast
- clean
- cloneable
- long-term maintainable
Very important after several years — otherwise repos become giant storage dumps instead of engineering history.
References: