You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The insight stage was originally designed as a two-pass extraction:
Cheap pass (currently in production): every retrieved chunk goes through a cheap model (gpt-4o-mini) for fast, broad fact extraction. Records are deduped across documents.
Strong pass (never implemented): after cross-document dedup, the surviving records get re-checked by a more capable model (gpt-4o) to adjust confidence scores, fix mis-classifications, fill in missing structured fields, or reject records the strong model judges as weak.
The scaffold for the second pass shipped in the initial insight-stage PR (#10) as a use_strong_model_refinement: bool = False config flag with a placeholder branch in InsightPipeline.run that only appended a "not yet implemented" note. The strong_model: str = "gpt-4o" config sibling pointed at the intended model.
That scaffold has now been removed in the same branch that filed this issue (see commit on feat/insight-stage-hardening). The removal was deliberate: a flag that's wired into the config but does nothing useful is worse than no flag at all — users who enable it get a silent no-op and a misleading note, and future contributors have to reverse-engineer the design intent.
When to revisit
After the first end-to-end benchmark run, evaluate whether the cheap-model records are good enough on their own, or whether one of these failure modes is significant:
Confidence miscalibration — cheap-model confidence scores don't track real accuracy, so forecasting can't trust them as weights.
Misclassified event_type — cheap model puts a "deaths" fact under case_count, or similar category drift.
Missing structured fields — metric_unit, iso_country_code, event_date_precision left blank when the chunk actually supports filling them in.
Subtle hallucinations — facts that pass the substring/quote-match guard but make claims the chunk doesn't actually support.
If any of these matter enough to forecasting accuracy, a strong-model refinement pass is a natural fix.
Scope (when implementing)
Decide what the strong pass produces. Options ordered by ambition:
Confidence-only: strong model sees the record's quote + summary and emits an adjusted confidence value. Cheapest, smallest API surface.
Field-fill: strong model can also write to currently-null structured fields (iso_country_code, metric_unit, event_date_precision), but cannot change non-null ones.
Full refinement: strong model can change any field; original record kept as notes for audit.
Decide where it runs in the pipeline.
Today the placeholder sits after_deduplicate_records. That's the right place — dedup first, refine the survivors — but verify that's still true when implementing.
Budget plumbing. The strong-model calls must go through the existing BudgetTracker so the max_input_tokens_per_run early-stop still applies.
Reject path. If the strong model decides a record is unsupported, the original record should not be silently dropped. Either keep it with a lowered confidence, or drop it but record the reason in result.notes for downstream auditability.
Tests. Synthetic-LLM tests for: confidence adjustment, field-fill on null fields only, rejection path. Plus a live-LLM smoke test on the existing 6-doc fixture set comparing record quality before/after.
Definition of done
Benchmark results identify at least one of the failure modes above as significant.
Strong-pass scope (confidence-only vs field-fill vs full) decided based on benchmark data.
Implementation lands behind a re-introduced use_strong_model_refinement flag, defaulting to False.
Costs measured on the 6-doc fixture set — strong-pass should not blow the existing $0.01-per-run insight budget by more than 10×.
Background
The insight stage was originally designed as a two-pass extraction:
gpt-4o-mini) for fast, broad fact extraction. Records are deduped across documents.gpt-4o) to adjust confidence scores, fix mis-classifications, fill in missing structured fields, or reject records the strong model judges as weak.The scaffold for the second pass shipped in the initial insight-stage PR (#10) as a
use_strong_model_refinement: bool = Falseconfig flag with a placeholder branch inInsightPipeline.runthat only appended a "not yet implemented" note. Thestrong_model: str = "gpt-4o"config sibling pointed at the intended model.That scaffold has now been removed in the same branch that filed this issue (see commit on
feat/insight-stage-hardening). The removal was deliberate: a flag that's wired into the config but does nothing useful is worse than no flag at all — users who enable it get a silent no-op and a misleading note, and future contributors have to reverse-engineer the design intent.When to revisit
After the first end-to-end benchmark run, evaluate whether the cheap-model records are good enough on their own, or whether one of these failure modes is significant:
event_type— cheap model puts a "deaths" fact undercase_count, or similar category drift.metric_unit,iso_country_code,event_date_precisionleft blank when the chunk actually supports filling them in.If any of these matter enough to forecasting accuracy, a strong-model refinement pass is a natural fix.
Scope (when implementing)
confidencevalue. Cheapest, smallest API surface.iso_country_code,metric_unit,event_date_precision), but cannot change non-null ones.notesfor audit._deduplicate_records. That's the right place — dedup first, refine the survivors — but verify that's still true when implementing.BudgetTrackerso themax_input_tokens_per_runearly-stop still applies.result.notesfor downstream auditability.Definition of done
use_strong_model_refinementflag, defaulting toFalse.References
feat/insight-stage-hardeningbioscancast/insight/pipeline.pyInsightPipeline.run, after_deduplicate_records(all_records)gpt-4o-mini