Managing Large Files with Git | Generated by AI

Home 2026.05

Question: For files larger than 500MB like MP4 recordings and Whisper outputs, how should I use Git to manage them? Should I only ignore them?

Answer:

For large media/transcription workflows like your setup, the normal approach is:

So yes — usually .mp4, .srt, .vtt, .json, .txt generated by Whisper should mostly be ignored.

A clean structure is something like:

whisper-test/
├── src/
├── scripts/
├── prompts/
├── data/
│   ├── raw/
│   └── processed/
├── outputs/
├── logs/
├── .gitignore
└── README.md

Recommended .gitignore:

# Large media
*.mp4
*.mkv
*.mov
*.avi

# Whisper outputs
*.srt
*.vtt
*.tsv
*.json
*.txt

# Logs
*.log

# Temp/cache
.cache/
tmp/
temp/

# Python
__pycache__/
*.pyc

# Models
models/
*.bin
*.pt
*.gguf

For your current directory:

meeting_01_2.json
meeting_01_2_refined.md
meeting_01_2.tsv
meeting_01_2.vtt
myopia.mp4
whisper.log

You probably want:

File Git?
myopia.mp4 Ignore
meeting_01_2.tsv Ignore
meeting_01_2.vtt Ignore
meeting_01_2.srt Ignore
meeting_01_2.txt Ignore
whisper.log Ignore
meeting_01_2_refined.md Keep maybe
scripts/configs Keep

Because:

A very common engineering pattern is:

raw data -> generated outputs -> refined knowledge

Only the final refined knowledge enters Git.


If you REALLY need versioning for large files, there are several approaches.

1. Git LFS (Large File Storage)

Use when:

Install:

git lfs install

Track mp4:

git lfs track "*.mp4"

This creates:

.gitattributes

Then commit normally.

But:

Usually good for:

Not good for:


2. External Storage + Git Metadata (Most Practical)

Very common in AI/data engineering.

Store large files in:

Git only stores:

Example:

videos/
  2026/
    meeting_01.mp4

Git tracks:

{
  "file": "meeting_01.mp4",
  "sha256": "...",
  "duration": 7200
}

This scales much better.


3. DVC (Data Version Control)

In ML/data pipelines, many people use:

It combines:

Very useful when:

But maybe overkill for personal Whisper experiments.


My recommendation for your workflow

Given your setup (local GPU + Whisper + experimentation):

Put into Git

scripts/
prompts/
configs/
README.md
refined markdown notes
small examples

Ignore

mp4
logs
srt/vtt/tsv
intermediate json
cache
models

Optional

Keep ONE small demo sample:

samples/demo_30sec.mp4

for reproducibility/testing.


A nice practical workflow:

# generate
python transcribe.py myopia.mp4

# refine
vim meeting_01_2_refined.md

# commit only knowledge/code
git add scripts README.md meeting_01_2_refined.md
git commit -m "Refine transcription pipeline"

This keeps repos:

Very important after several years — otherwise repos become giant storage dumps instead of engineering history.

References:


Back Donate