Strong-model refinement pass in the insight stage

## Background

The insight stage was originally designed as a two-pass extraction:

1. **Cheap pass** (currently in production): every retrieved chunk goes through a cheap model (`gpt-4o-mini`) for fast, broad fact extraction. Records are deduped across documents.
2. **Strong pass** (never implemented): after cross-document dedup, the surviving records get re-checked by a more capable model (`gpt-4o`) to adjust confidence scores, fix mis-classifications, fill in missing structured fields, or reject records the strong model judges as weak.

The scaffold for the second pass shipped in the initial insight-stage PR (#10) as a `use_strong_model_refinement: bool = False` config flag with a placeholder branch in `InsightPipeline.run` that only appended a "not yet implemented" note. The `strong_model: str = "gpt-4o"` config sibling pointed at the intended model.

That scaffold has now been removed in the same branch that filed this issue (see commit on `feat/insight-stage-hardening`). The removal was deliberate: a flag that's wired into the config but does nothing useful is worse than no flag at all — users who enable it get a silent no-op and a misleading note, and future contributors have to reverse-engineer the design intent.

## When to revisit

After the first end-to-end benchmark run, evaluate whether the cheap-model records are good enough on their own, or whether one of these failure modes is significant:

- **Confidence miscalibration** — cheap-model confidence scores don't track real accuracy, so forecasting can't trust them as weights.
- **Misclassified `event_type`** — cheap model puts a "deaths" fact under `case_count`, or similar category drift.
- **Missing structured fields** — `metric_unit`, `iso_country_code`, `event_date_precision` left blank when the chunk actually supports filling them in.
- **Subtle hallucinations** — facts that pass the substring/quote-match guard but make claims the chunk doesn't actually support.

If any of these matter enough to forecasting accuracy, a strong-model refinement pass is a natural fix.

## Scope (when implementing)

1. **Decide what the strong pass produces.** Options ordered by ambition:
   - **Confidence-only**: strong model sees the record's quote + summary and emits an adjusted `confidence` value. Cheapest, smallest API surface.
   - **Field-fill**: strong model can also write to currently-null structured fields (`iso_country_code`, `metric_unit`, `event_date_precision`), but cannot change non-null ones.
   - **Full refinement**: strong model can change any field; original record kept as `notes` for audit.
2. **Decide where it runs in the pipeline.**
   - Today the placeholder sits *after* `_deduplicate_records`. That's the right place — dedup first, refine the survivors — but verify that's still true when implementing.
3. **Budget plumbing.** The strong-model calls must go through the existing `BudgetTracker` so the `max_input_tokens_per_run` early-stop still applies.
4. **Reject path.** If the strong model decides a record is unsupported, the original record should not be silently dropped. Either keep it with a lowered confidence, or drop it but record the reason in `result.notes` for downstream auditability.
5. **Tests.** Synthetic-LLM tests for: confidence adjustment, field-fill on null fields only, rejection path. Plus a live-LLM smoke test on the existing 6-doc fixture set comparing record quality before/after.

## Definition of done

- [ ] Benchmark results identify at least one of the failure modes above as significant.
- [ ] Strong-pass scope (confidence-only vs field-fill vs full) decided based on benchmark data.
- [ ] Implementation lands behind a re-introduced `use_strong_model_refinement` flag, defaulting to `False`.
- [ ] Costs measured on the 6-doc fixture set — strong-pass should not blow the existing $0.01-per-run insight budget by more than 10×.

## References

- Original insight-stage PR: #10
- Branch where the flag was removed: `feat/insight-stage-hardening`
- Relevant code path before removal: `bioscancast/insight/pipeline.py` `InsightPipeline.run`, after `_deduplicate_records(all_records)`
- Cheap-model behaviour as of benchmark setup: 23 → 40 records across 6 real biosecurity docs at ~\$0.006 per run with `gpt-4o-mini`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strong-model refinement pass in the insight stage #26

Background

When to revisit

Scope (when implementing)

Definition of done

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Strong-model refinement pass in the insight stage #26

Description

Background

When to revisit

Scope (when implementing)

Definition of done

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions