Skip to content

Strong-model refinement pass in the insight stage #26

@smodee

Description

@smodee

Background

The insight stage was originally designed as a two-pass extraction:

  1. Cheap pass (currently in production): every retrieved chunk goes through a cheap model (gpt-4o-mini) for fast, broad fact extraction. Records are deduped across documents.
  2. Strong pass (never implemented): after cross-document dedup, the surviving records get re-checked by a more capable model (gpt-4o) to adjust confidence scores, fix mis-classifications, fill in missing structured fields, or reject records the strong model judges as weak.

The scaffold for the second pass shipped in the initial insight-stage PR (#10) as a use_strong_model_refinement: bool = False config flag with a placeholder branch in InsightPipeline.run that only appended a "not yet implemented" note. The strong_model: str = "gpt-4o" config sibling pointed at the intended model.

That scaffold has now been removed in the same branch that filed this issue (see commit on feat/insight-stage-hardening). The removal was deliberate: a flag that's wired into the config but does nothing useful is worse than no flag at all — users who enable it get a silent no-op and a misleading note, and future contributors have to reverse-engineer the design intent.

When to revisit

After the first end-to-end benchmark run, evaluate whether the cheap-model records are good enough on their own, or whether one of these failure modes is significant:

  • Confidence miscalibration — cheap-model confidence scores don't track real accuracy, so forecasting can't trust them as weights.
  • Misclassified event_type — cheap model puts a "deaths" fact under case_count, or similar category drift.
  • Missing structured fieldsmetric_unit, iso_country_code, event_date_precision left blank when the chunk actually supports filling them in.
  • Subtle hallucinations — facts that pass the substring/quote-match guard but make claims the chunk doesn't actually support.

If any of these matter enough to forecasting accuracy, a strong-model refinement pass is a natural fix.

Scope (when implementing)

  1. Decide what the strong pass produces. Options ordered by ambition:
    • Confidence-only: strong model sees the record's quote + summary and emits an adjusted confidence value. Cheapest, smallest API surface.
    • Field-fill: strong model can also write to currently-null structured fields (iso_country_code, metric_unit, event_date_precision), but cannot change non-null ones.
    • Full refinement: strong model can change any field; original record kept as notes for audit.
  2. Decide where it runs in the pipeline.
    • Today the placeholder sits after _deduplicate_records. That's the right place — dedup first, refine the survivors — but verify that's still true when implementing.
  3. Budget plumbing. The strong-model calls must go through the existing BudgetTracker so the max_input_tokens_per_run early-stop still applies.
  4. Reject path. If the strong model decides a record is unsupported, the original record should not be silently dropped. Either keep it with a lowered confidence, or drop it but record the reason in result.notes for downstream auditability.
  5. Tests. Synthetic-LLM tests for: confidence adjustment, field-fill on null fields only, rejection path. Plus a live-LLM smoke test on the existing 6-doc fixture set comparing record quality before/after.

Definition of done

  • Benchmark results identify at least one of the failure modes above as significant.
  • Strong-pass scope (confidence-only vs field-fill vs full) decided based on benchmark data.
  • Implementation lands behind a re-introduced use_strong_model_refinement flag, defaulting to False.
  • Costs measured on the 6-doc fixture set — strong-pass should not blow the existing $0.01-per-run insight budget by more than 10×.

References

  • Original insight-stage PR: Implement Stage 4: Insight stage pipeline #10
  • Branch where the flag was removed: feat/insight-stage-hardening
  • Relevant code path before removal: bioscancast/insight/pipeline.py InsightPipeline.run, after _deduplicate_records(all_records)
  • Cheap-model behaviour as of benchmark setup: 23 → 40 records across 6 real biosecurity docs at ~$0.006 per run with gpt-4o-mini

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpost-benchmarkRevisit after first end-to-end benchmark run

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions