Skip to content

fix: guard baseline deltas by eval input fingerprint#1

Open
Osraka wants to merge 1 commit into
NousResearch:mainfrom
Osraka:fix/guard-baseline-inputs
Open

fix: guard baseline deltas by eval input fingerprint#1
Osraka wants to merge 1 commit into
NousResearch:mainfrom
Osraka:fix/guard-baseline-inputs

Conversation

@Osraka
Copy link
Copy Markdown

@Osraka Osraka commented May 17, 2026

Summary

  • record SHA-256 fingerprints for each fixture and probe bank in per-run outputs
  • refuse to summarize mixed-input runs and skip baseline deltas when fixture/probe contents no longer match
  • document the comparison guardrail and cover both changed-input and legacy-baseline cases with tests

Why

--compare-to currently keys only by fixture name. If a fixture or probe bank is edited between the baseline and current run, the report can show score deltas for unlike eval targets. That makes a prompt change look better or worse for reasons unrelated to the compressor itself. Fingerprinting the inputs keeps those comparisons honest while preserving the current report flow for compatible runs.

Test plan

  • python3 -m pytest tests/ -q

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant