Problem:
The two entry point scripts have duplicated CSV writing and row building logic, inconsistent preprocessing steps (the secondary pipeline has text cleaning and section filtering that the main pipeline lacks) and divergent default model names. The main pipeline is effectively less capable than the secondary one.
Tasks:
Fold text_cleaner.py and section_filter.py preprocessing steps from extract-from-txt.py into classify_extract.py
Consolidate shared CSV writing and row building logic into a shared utility
Expose a --skip-classifier flag on classify_extract.py
Align default model names across both scripts
Mark extract-from-txt.py as deprecated pending removal
Context:
Once completed both pipelines should produce equivalent output quality on the same input and duplicate code paths should be eliminated. Source: classify_extract.py and extract-from-txt.py
Problem:
The two entry point scripts have duplicated CSV writing and row building logic, inconsistent preprocessing steps (the secondary pipeline has text cleaning and section filtering that the main pipeline lacks) and divergent default model names. The main pipeline is effectively less capable than the secondary one.
Tasks:
Fold text_cleaner.py and section_filter.py preprocessing steps from extract-from-txt.py into classify_extract.py
Consolidate shared CSV writing and row building logic into a shared utility
Expose a --skip-classifier flag on classify_extract.py
Align default model names across both scripts
Mark extract-from-txt.py as deprecated pending removal
Context:
Once completed both pipelines should produce equivalent output quality on the same input and duplicate code paths should be eliminated. Source: classify_extract.py and extract-from-txt.py