Manual Dedup Pattern for JSON Storage

Manual Dedup Pattern for JSON Storage | Generated by AI

Home 2026.05

Question: Should I enforce uniqueness when adding records — i.e., is the dedup logic in add_record the right place to handle the “unique key” concern for JSON storage?

Answer:

Yes, your _record_key + existing_keys set approach in add_record is exactly the right pattern for JSON storage. You’ve essentially manually implemented what a DB UNIQUE constraint does.

But there are a few gaps worth fixing:

Current logic is correct but fragile in one spot

def _record_key(record_type: str, rec: dict) -> str:
    if record_type == "client_timesheet":
        return f"{rec.get('week_start', '')}_{rec.get('week_end', '')}"
    elif record_type == "vendor_punch_card":
        return rec.get("date", "")
    elif record_type == "vendor_timesheet":
        return rec.get("date", "")
    return str(sorted(rec.items()))  # ← fragile fallback

The fallback str(sorted(rec.items())) will fail silently if a new record type is added — it’ll “deduplicate” by full content equality, which is almost never what you want. Make it explicit:

def _record_key(record_type: str, rec: dict) -> str:
    if record_type == "client_timesheet":
        return f"{rec['week_start']}_{rec['week_end']}"
    elif record_type in ("vendor_punch_card", "vendor_timesheet"):
        return rec["date"]
    raise ValueError(f"Unknown record_type: {record_type!r}")

Fail loud, not silent.

The real uniqueness gap: cross-month

Your dedup only checks within a month bucket:

existing_keys = {_record_key(record_type, r) for r in all_data[month]["records"]}

A punch card date like 2026-05-31 could theoretically land in both 202605 and 202606 buckets if the caller passes the wrong month. For punch cards and vendor timesheets, the natural key is the date itself — consider validating that the date actually belongs to the declared month:

def _validate_month(record_type: str, rec: dict, month: str) -> bool:
    date_field = rec.get("date") or f"{rec.get('week_start', '')}"
    return date_field.startswith(f"{month[:4]}-{month[4:]}")

The `add_record` return value isn’t used anywhere visibly

You return (added, skipped) but callers should log it:

added, skipped = add_record(data_dir, staff_id, record_type, month, records)
if skipped:
    print(f"  ↩ {skipped} duplicate(s) skipped, {added} added")

Otherwise silent skips look like silent failures.

Bottom line

Your JSON + manual dedup is correct and sufficient for this scale. The pattern maps cleanly to what SQLite would give you:

SQLite	Your JSON
`PRIMARY KEY (date)`	`_record_key` returning `date`
`INSERT OR IGNORE`	`if key in existing_keys: skip`
`UNIQUE (week_start, week_end)`	`f"{week_start}_{week_end}"` composite key

No database needed — just tighten the fallback and add logging on skips.

Back Donate