Skip to content

Fix benchmark scoring call signature#1

Draft
Bortlesboat wants to merge 1 commit intoelder-plinius:mainfrom
Bortlesboat:codex/autotemp-benchmark-evaluate-output-fix
Draft

Fix benchmark scoring call signature#1
Bortlesboat wants to merge 1 commit intoelder-plinius:mainfrom
Bortlesboat:codex/autotemp-benchmark-evaluate-output-fix

Conversation

@Bortlesboat
Copy link
Copy Markdown

Summary

  • pass both the original prompt and extracted best output into evaluate_output() during benchmark runs
  • add an offline regression test that stubs external imports so the benchmark path can be verified without live API calls

Why

benchmark() currently calls evaluate_output() with the wrong arguments, which raises a TypeError and gets swallowed by the broad exception handler. That quietly turns otherwise valid benchmark rows into 0.0 overall scores.

Verification

  • python -m unittest tests.test_benchmark -v

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant