Duplicate note detection — pre-commit vs CI | Generated by AI

Home 2026.06

Problem

The notes/ directory had 5 duplicate note pairs — same content, different filenames. These slip in because the note pipeline generates content without checking what already exists.

What changed

Pre-commit hook added (.pre-commit-config.yaml):

- id: duplicate-notes-check
  name: Duplicate notes check
  entry: python3 -m unittest tests.workflow.test_duplicate_notes -v
  language: system
  pass_filenames: false
  always_run: true

CI unit tests kept (.github/workflows/gh-pages.yml):

- name: Run Unit Tests
  run: python -m unittest discover -s tests/workflow

5 duplicate pairs removed from notes/.

Why this works

Pre-commit catches duplicates locally before they land. The test groups notes by date (YYYY-MM-DD prefix), then compares content within each group using fast character similarity on the first/last 200 characters. If similarity exceeds 90%, it’s flagged as duplicate.

CI still runs the full test suite — all 15 workflow tests including the duplicate check. This is the safety net for anything that bypasses pre-commit (direct pushes, bot commits, CI-generated notes).

The two layers serve different purposes:

Why this might not work

False positives. The similarity check uses first/last 200 chars. Two legitimately different notes about the same topic with similar openings could trigger it. The 90% threshold is aggressive.

Performance. With 9,884 notes, the date-grouped comparison is fast (~0.3s). But if notes grow 10x, the O(n²) comparison within each date group could slow down.

Bypass paths. Pre-commit can be skipped with git commit --no-verify. The CI test suite catches this, but only after the commit is already pushed.

Bot-generated content. The notes pipeline runs in CI (gh-pages.yml) and generates new notes. If it creates a duplicate, CI will catch it on the next run — but the duplicate will have already been committed and pushed.

What’s working

What could be better


Back Donate