fix: load classifier once per pipeline run by SeanClay10 · Pull Request #61 · NovakLabOSU/FracFeedExtractor

SeanClay10 · 2026-04-29T22:54:27Z

Summary

The XGBoost classifier artifacts were being reloaded from disk on every PDF processed, which adds unnecessary overhead on batch runs.

Moved load_classifier() from the per PDF worker function into run_pipeline() so the model, vectorizer, and encoder load exactly once per run
Cleaned up dead code in _process_single_pdf() — removed leftover directory collection logic and an unused loop from a previous refactor
Fixed flake8 linting errors in chunked_biomistral_llm.py and chunked_extraction.py

Closes #59

SeanClay10 added 3 commits April 29, 2026 15:46

fix: load classifier once per pipeline run instead of per PDF

5793b9a

fix: removing dead code and updating docstrings

8ff9ac0

fix: linting

455bea5

SeanClay10 requested review from QuiteRocks, bradleyrule and raymondcen April 29, 2026 22:55

SeanClay10 self-assigned this Apr 29, 2026

raymondcen approved these changes May 4, 2026

View reviewed changes

SeanClay10 merged commit 662226c into main May 4, 2026
2 checks passed