Skip to content

feat: detect score regressions in Run-Benchmark_issues.ps1 #74

@openaitx-system

Description

@openaitx-system

Summary

Add automatic score-regression detection to \scripts/Run-Benchmark_issues.ps1.

Problem

When the issue benchmark script regenerates reference PDFs and re-runs comparisons, there was no way to tell at a glance whether any case's \overall_score\ had dropped compared to the previous run. Regressions were silently overwritten into \comparison_report.json.

Solution

Two new PowerShell helper functions are added to the script:

\Get-ReportScores\

Reads \comparison_report.json\ from a given report directory and returns a hashtable of
ame -> overall_score\ for all cases present in the file.

\Show-ScoreDrops\

Compares before/after score snapshots and prints a formatted table of any cases whose score decreased, sorted by delta (worst first):

\
Score drop check (DOCX): 2 lower score(s) found

Name Before After Delta
caseA 0.9500 0.9300 -0.0200
caseB 0.8100 0.7900 -0.0200
\\

If no regressions are found, a green confirmation line is printed instead:

\
Score drop check (XLSX): no lower scores found.
\\

Changes

  • *\scripts/Run-Benchmark_issues.ps1*
    • Added \Get-ReportScores\ — reads \comparison_report.json\ before comparison runs
    • Added \Show-ScoreDrops\ — prints regression table after comparison completes
    • Called for both XLSX and DOCX pipelines (Step 3)

Behaviour

  • Only cases that existed in the previous report are checked (new cases are ignored)
  • Output is colour-coded: red for regressions, green for clean runs
  • The check is non-blocking; it never interrupts the benchmark run
  • Works correctly with -Filter\ (filtered runs only compare the subset that was re-evaluated)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions