Skip to content

fix: load classifier once per pipeline run#61

Merged
SeanClay10 merged 3 commits intomainfrom
fix/classifier-loading
May 4, 2026
Merged

fix: load classifier once per pipeline run#61
SeanClay10 merged 3 commits intomainfrom
fix/classifier-loading

Conversation

@SeanClay10
Copy link
Copy Markdown
Collaborator

@SeanClay10 SeanClay10 commented Apr 29, 2026

Summary

The XGBoost classifier artifacts were being reloaded from disk on every PDF processed, which adds unnecessary overhead on batch runs.

  • Moved load_classifier() from the per PDF worker function into run_pipeline() so the model, vectorizer, and encoder load exactly once per run
  • Cleaned up dead code in _process_single_pdf() — removed leftover directory collection logic and an unused loop from a previous refactor
  • Fixed flake8 linting errors in chunked_biomistral_llm.py and chunked_extraction.py

Closes #59

@SeanClay10 SeanClay10 merged commit 662226c into main May 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix Classifier Model Reloading on Every PDF

2 participants