Skip to content

feat: adds ability to use inverted judges#168

Open
andrewklatzke wants to merge 2 commits intoaklatzke/AIC-2263/sdk-dx-improvementsfrom
aklatzke/AIC-2377/inverted-judges
Open

feat: adds ability to use inverted judges#168
andrewklatzke wants to merge 2 commits intoaklatzke/AIC-2263/sdk-dx-improvementsfrom
aklatzke/AIC-2377/inverted-judges

Conversation

@andrewklatzke
Copy link
Copy Markdown
Contributor

@andrewklatzke andrewklatzke commented May 5, 2026

Requirements

  • I have added test coverage for new or changed functionality
  • I have followed the repository's pull request submission guidelines
  • I have validated my changes against all supported platform versions

Describe the solution you've provided

Implements handling for "inverted" judges.

Describe alternatives you've considered

This gets feature parity with our online evals functionality; no alternatives considered.

Additional context

When a metric has is_inverted set, it's intended that the evaluation of the score flips from >= to <=. This adds a util _judge_passed to handle that logic and implements it throughout. We don't surface the inverted property in the SDK, so we fetch the judge directly to get this information.


Note

Medium Risk
Changes core judge pass/fail semantics and adds per-judge REST calls (get_ai_config) during config-driven runs, which could affect optimization outcomes and introduce new failure/performance modes if the API is unavailable or slow.

Overview
Adds first-class support for inverted judges (where lower scores are better) by introducing a shared judge_passed helper and using it for pass/fail decisions in OptimizationClient and in prompt feedback generation.

Extends OptimizationJudge with an is_inverted flag and, for optimize_from_config, fetches each judge’s isInverted value via api_client.get_ai_config when building options. Updates logging to include the inverted status, and adds targeted tests covering the helper, mixed inverted/standard evaluation, config building behavior, and variation_prompt_feedback output.

Reviewed by Cursor Bugbot for commit a8f14de. Bugbot is set up for automated code reviews on this repo. Configure here.

@andrewklatzke andrewklatzke requested a review from jsonbailey May 5, 2026 23:12
@andrewklatzke andrewklatzke requested a review from a team as a code owner May 5, 2026 23:12
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a8f14de. Configure here.

score = result.score
if optimization_judge.threshold is not None:
passed = score >= optimization_judge.threshold
passed = judge_passed(score, optimization_judge.threshold, optimization_judge.is_inverted)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inverted judge logic missed in fallback threshold branch

Medium Severity

In variation_prompt_feedback, the else branch (when optimization_judge.threshold is None) still uses passed = score >= 1.0 instead of calling judge_passed(score, 1.0, optimization_judge.is_inverted). For inverted judges hitting this path, a low score (which should pass) would be marked as FAILED, and only a perfect 1.0 would pass — the exact opposite of the intended behavior. The other two equivalent locations in client.py correctly default to 1.0 and pass it through judge_passed.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a8f14de. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant