Skip to content

Consolidate and Fix Both Pipeline Entry Points #58

@raymondcen

Description

@raymondcen

Problem:
The two entry point scripts have duplicated CSV writing and row building logic, inconsistent preprocessing steps (the secondary pipeline has text cleaning and section filtering that the main pipeline lacks) and divergent default model names. The main pipeline is effectively less capable than the secondary one.

Tasks:
Fold text_cleaner.py and section_filter.py preprocessing steps from extract-from-txt.py into classify_extract.py
Consolidate shared CSV writing and row building logic into a shared utility
Expose a --skip-classifier flag on classify_extract.py
Align default model names across both scripts
Mark extract-from-txt.py as deprecated pending removal

Context:
Once completed both pipelines should produce equivalent output quality on the same input and duplicate code paths should be eliminated. Source: classify_extract.py and extract-from-txt.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions