Skip to content

perf: use ProcessPoolExecutor for multi-core fuzzy matching#47

Merged
jakebromberg merged 3 commits intomainfrom
perf/parallel-fuzzy
Mar 11, 2026
Merged

perf: use ProcessPoolExecutor for multi-core fuzzy matching#47
jakebromberg merged 3 commits intomainfrom
perf/parallel-fuzzy

Conversation

@jakebromberg
Copy link
Member

@jakebromberg jakebromberg commented Mar 11, 2026

Summary

  • Switch Phase 4 fuzzy matching from ThreadPoolExecutor to ProcessPoolExecutor with fork context for true multi-core parallelism (the GIL was serializing threads on a single core)
  • Use smaller chunks (~200 artists) for more frequent progress logging
  • Add throughput (artists/s) and ETA to Phase 4 chunk logs, elapsed time to Phase 3 log
  • Set PYTHONUNBUFFERED=1 in run_step subprocess environment so log lines stream immediately instead of being held in full-buffering pipe buffers

Closes #46

Test plan

  • New tests verify ProcessPoolExecutor worker produces same results as direct call
  • New tests verify multi-chunk aggregation matches single-batch results
  • New test verifies worker initializer sets module globals correctly
  • New test verifies Phase 4 logs contain throughput metrics
  • New tests verify PYTHONUNBUFFERED=1 is set in subprocess env and caller env vars are preserved
  • All 473 existing unit tests pass with no regressions
  • Run full pipeline with --prune and observe multi-core CPU usage and frequent chunk progress logs

Jake Bromberg added 3 commits March 11, 2026 13:45
…ll CPU cores

ThreadPoolExecutor was serializing on a single core because the Python loop overhead between rapidfuzz extractOne calls holds the GIL. ProcessPoolExecutor with fork context gives true multi-core parallelism. Also improves logging: Phase 3 now reports elapsed time, Phase 4 chunk logs include throughput (artists/s) and ETA, and chunks are smaller (~200 artists) for more frequent progress updates.
@jakebromberg jakebromberg merged commit 4af953b into main Mar 11, 2026
3 checks passed
@jakebromberg jakebromberg deleted the perf/parallel-fuzzy branch March 11, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 4 fuzzy matching pegs single CPU core due to GIL

1 participant