Skip to content

Format-aware dedup and verify/prune#43

Merged
jakebromberg merged 3 commits intomainfrom
feat/format-aware-dedup
Mar 11, 2026
Merged

Format-aware dedup and verify/prune#43
jakebromberg merged 3 commits intomainfrom
feat/format-aware-dedup

Conversation

@jakebromberg
Copy link
Member

Summary

  • Add format normalization module and format column to the release table schema, partitioning dedup by (master_id, format) so different formats of the same album survive independently
  • Make verify/prune format-aware: exact-match KEEP releases are downgraded to PRUNE when the release format doesn't match the library's owned formats
  • Backward-compatible: NULL format on either side matches anything; old library.db schemas without a format column degrade gracefully

Closes #42

Test plan

  • 39 parametrized unit tests for format normalization (all Discogs/library format strings, edge cases)
  • Unit tests for import_csv format column config and transform
  • Integration tests for format-aware dedup (same-format dedup, different-format survival, NULL format grouping)
  • Unit tests for LibraryIndex format_by_pair construction (3-tuples, 2-tuple backward compat, from_sqlite with/without format)
  • Unit tests for format filtering in classify_all_releases (matching/mismatching/NULL formats)
  • All 463 unit tests pass
  • All 156 integration tests pass
  • ruff format + ruff check clean

Jake Bromberg added 3 commits March 11, 2026 12:29
Partition dedup by (master_id, format) instead of just master_id so different formats (CD, LP, Cassette, etc.) of the same album survive dedup independently. Add format-aware verify/prune that checks the library's owned formats against release formats, downgrading exact-match KEEP releases to PRUNE when the format doesn't match.

- Add lib/format_normalization.py with normalize_format(), normalize_library_format(), and format_matches() functions
- Add format column to release table schema and import pipeline with normalize_format transform
- Change dedup PARTITION BY from master_id to (master_id, format)
- Update LibraryIndex to track format_by_pair from library.db (with backward-compatible fallback for old schemas)
- Update classify_all_releases to apply format filtering on exact-match KEEP releases
- Update copy-swap column lists and COPY_TABLE_SPEC to include format
- Update test fixtures with format data in library.db and release_artist.csv
The import_csv.py script is invoked as a subprocess by run_pipeline.py, which means the repo root may not be on Python's module search path. Add sys.path.insert following the same pattern used by verify_cache.py and run_pipeline.py.
Format-aware dedup partitions by (master_id, format), so fixture releases with different formats all survive dedup. Update E2E test expectations: test_format_aware_dedup_and_prune verifies both CD and Vinyl survive while Cassette is pruned (library owns CD and LP only). Label-aware dedup tests verify all format variants survive. master_id persists when no dedup copy-swap runs. Add test_format_column_present.
@jakebromberg jakebromberg merged commit 267e13e into main Mar 11, 2026
3 checks passed
@jakebromberg jakebromberg deleted the feat/format-aware-dedup branch March 11, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Format-aware dedup and verify/prune

1 participant