fix(eval): handle unevaluated final response v2 results#5728
fix(eval): handle unevaluated final response v2 results#5728pragnyanramtha wants to merge 13 commits into
Conversation
…onse-v2-no-eval-guard
…onse-v2-no-eval-guard
…onse-v2-no-eval-guard
…onse-v2-no-eval-guard
|
Refreshed this branch with current Validation rerun:
|
…onse-v2-no-eval-guard
|
Refreshed this branch with current Validation rerun:
|
…onse-v2-no-eval-guard
|
Pushed The failure was from repository-wide hooks updating existing files outside this PR's evaluator patch:
Validation:
|
|
Hi @pragnyanramtha , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Your PR has been received by the team and is currently under review. We will provide feedback as soon as we have an update to share. |
|
Hi @sasha-gitg , can you please review this. |
|
I noticed the |
Merge #5728 ## Summary Fixes a small aggregation edge case in `FinalResponseMatchV2Evaluator`: when every per-invocation result is skipped or not evaluated, the evaluator currently divides by zero while computing the overall score. ## Root Cause `aggregate_invocation_results()` filters out results whose `score` is `None` or whose `eval_status` is `NOT_EVALUATED`, but it unconditionally computes: ```python overall_score = num_valid / num_evaluated ``` If all judge samples fail to produce a usable score, `num_evaluated` remains `0` and evaluation crashes instead of returning a not-evaluated aggregate result. Other ADK evaluators handle this condition by returning `overall_score=None` and `overall_eval_status=NOT_EVALUATED`. ## Change - Return an `EvaluationResult` with `overall_score=None` and `overall_eval_status=NOT_EVALUATED` when no FinalResponseMatchV2 invocation results are evaluable. - Add a focused regression test for all-skipped/all-not-evaluated invocation results. ## Validation ```bash uv sync --extra test uv run pytest tests/unittests/evaluation/test_final_response_match_v2.py ``` Result: `18 passed, 20 warnings`. Full unit suite was not run; this patch is limited to FinalResponseMatchV2 aggregation and its targeted unit test file. Co-authored-by: Haran Rajkumar <haranrk@google.com> COPYBARA_INTEGRATE_REVIEW=#5728 from pragnyanramtha:pragnyan/final-response-v2-no-eval-guard 3d5ab73 PiperOrigin-RevId: 933818272
|
Thank you @pragnyanramtha for your contribution! 🎉 Your changes have been successfully imported and merged via Copybara in commit 5cfef01. Closing this PR as the changes are now in the main branch. |
Summary
Fixes a small aggregation edge case in
FinalResponseMatchV2Evaluator: when every per-invocation result is skipped or not evaluated, the evaluator currently divides by zero while computing the overall score.Root Cause
aggregate_invocation_results()filters out results whosescoreisNoneor whoseeval_statusisNOT_EVALUATED, but it unconditionally computes:If all judge samples fail to produce a usable score,
num_evaluatedremains0and evaluation crashes instead of returning a not-evaluated aggregate result. Other ADK evaluators handle this condition by returningoverall_score=Noneandoverall_eval_status=NOT_EVALUATED.Change
EvaluationResultwithoverall_score=Noneandoverall_eval_status=NOT_EVALUATEDwhen no FinalResponseMatchV2 invocation results are evaluable.Validation
uv sync --extra test uv run pytest tests/unittests/evaluation/test_final_response_match_v2.pyResult:
18 passed, 20 warnings.Full unit suite was not run; this patch is limited to FinalResponseMatchV2 aggregation and its targeted unit test file.