feat(eval): add evaluate_full_response option to rubric-based evaluation by Siddhartha90 · Pull Request #5216 · google/adk-python

Siddhartha90 · 2026-04-08T22:12:33Z

Summary

When an agent emits text before a tool call (e.g. presenting a plan), then calls a tool, then emits more text (e.g. an explanation), rubric_based_final_response_quality_v1 only sends the post-tool-call text to the judge as final_response. The pre-tool-call text is stored in intermediate_data.invocation_events but is never included in the judge prompt.

This means rubrics that check for content in the pre-tool-call text always fail, even though the agent correctly produced that content.

Changes

Added evaluate_full_response: bool = False to RubricsBasedCriterion (following the pattern of evaluate_intermediate_nl_responses on HallucinationsCriterion)
When enabled, the evaluator concatenates all NL text from invocation_events + final_response before sending to the judge

Usage

{
  "rubric_based_final_response_quality_v1": {
    "threshold": 0.8,
    "evaluate_full_response": true,
    "rubrics": [...]
  }
}

Motivation

We have a resume improvement agent that:

Streams a plan to the user (text)
Calls a tool (e.g. submit_improved_resume)
Streams an explanation of changes (text)

From the user's perspective this is one continuous response. But the rubric evaluator only judges step 3. Rubrics checking for the plan (step 1) always fail.

With evaluate_full_response: true, the judge sees the complete agent output and can accurately evaluate all rubrics.

Backwards compatible

The flag defaults to false, so existing behavior is unchanged.

Test plan

Scenario: Agent emits text before and after a tool call within a single invocation

Without flag (default behavior preserved): Run rubric_based_final_response_quality_v1 without evaluate_full_response set. Confirm the judge only receives the post-tool-call text in <final_answer>. Rubrics checking for pre-tool-call content should fail. This validates no regression.
With evaluate_full_response: true: Run the same eval with the flag enabled. Confirm the judge receives the concatenated text from all invocation events + final_response in <final_answer>. Rubrics checking for pre-tool-call content should now pass.
Agent with no intermediate text: Run with the flag enabled against an agent that only emits a final response (no pre-tool-call text). Confirm behavior is identical to the default — the judge receives just the final_response text.
Agent with multiple intermediate text events: Run with the flag enabled against an agent that emits text → tool call → text → tool call → text. Confirm all three text segments are concatenated and sent to the judge.
Backwards compatibility: Confirm existing test_config.json files without evaluate_full_response continue to work unchanged (field defaults to false).

Pre-PR validation: We installed the lib from this PR (uv pip install "google-adk[eval] @ git+https://github.com/Siddhartha90/adk-python.git@feat/evaluate-full-response")

Then tested the core logic (concatenating text from invocation_events + final_response) against a production agent that emits a plan text → calls a tool → emits explanation text.

Without full-response concatenation, a couple of our rubrics - presents_plan and warm_acknowledgment which relied on pre-tool-call-content consistently scored 0.0. With full-response concatenation, all rubrics scored 1.0. The same logic is applied in this PR's changes to format_auto_rater_prompt.

🤖 Generated with Claude Code

When an agent emits text before a tool call (e.g. presenting a plan), then calls a tool, then emits more text (e.g. an explanation), the rubric_based_final_response_quality_v1 metric only sends the post-tool-call text to the judge. The pre-tool-call text is stored in intermediate_data.invocation_events but is never included in the judge prompt. This means rubrics that check for content in the pre-tool-call text always fail, even though the agent correctly produced that content. This commit adds an `evaluate_full_response` boolean field to `RubricsBasedCriterion` (following the pattern of `evaluate_intermediate_nl_responses` on `HallucinationsCriterion`). When set to true, the evaluator concatenates all text from invocation_events with the final_response before sending to the judge, giving it the complete picture of the agent's output. Usage: ```json { "rubric_based_final_response_quality_v1": { "threshold": 0.8, "evaluate_full_response": true, "rubrics": [...] } } ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

google-cla · 2026-04-08T22:13:05Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

adk-bot · 2026-04-08T22:15:59Z

Response from ADK Triaging Agent

Hello @Siddhartha90, thank you for creating this PR!

Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

Also, this PR is a new feature, could you please associate the github issue with this PR? If there is no existing issue, could you please create one?

In addition, could you please include a testing plan section in your PR to describe how you will test?

This information will help reviewers to review your PR more efficiently. Thanks!

adk-bot · 2026-04-08T22:48:11Z

Response from ADK Triaging Agent

Hello @Siddhartha90, thank you for updating the pull request with the associated issue and a testing plan!

It looks like the Contributor License Agreement (CLA) is still not signed. Before we can merge your contribution, we'll need you to sign the CLA. You can do so at https://cla.developers.google.com/.

Thanks!

The criterion may be deserialized as BaseCriterion (which accepts extra fields via extra="allow") rather than RubricsBasedCriterion, so isinstance check fails even when evaluate_full_response is present. Using getattr with a default handles both cases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rohityan · 2026-04-09T18:32:48Z

Hi @Siddhartha90 , Thank you for your contribution! It appears you haven't yet signed the Contributor License Agreement (CLA). Please visit https://cla.developers.google.com/ to complete the signing process. Once the CLA is signed, we'll be able to proceed with the review of your PR. Thank you!

adk-bot added the eval [Component] This issue is related to evaluation label Apr 8, 2026

Siddhartha90 mentioned this pull request Apr 8, 2026

rubric_based_final_response_quality_v1 does not evaluate text emitted before tool calls #5217

Open

rohityan self-assigned this Apr 9, 2026

Merge branch 'main' into feat/evaluate-full-response

dd72e3a

rohityan added the request clarification [Status] The maintainer need clarification or more information from the author label Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add evaluate_full_response option to rubric-based evaluation#5216

feat(eval): add evaluate_full_response option to rubric-based evaluation#5216
Siddhartha90 wants to merge 3 commits intogoogle:mainfrom
Siddhartha90:feat/evaluate-full-response

Siddhartha90 commented Apr 8, 2026 •

edited

Loading

Uh oh!

google-cla bot commented Apr 8, 2026

Uh oh!

adk-bot commented Apr 8, 2026

Uh oh!

adk-bot commented Apr 8, 2026

Uh oh!

rohityan commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Siddhartha90 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Usage

Motivation

Backwards compatible

Test plan

Uh oh!

google-cla bot commented Apr 8, 2026

Uh oh!

adk-bot commented Apr 8, 2026

Uh oh!

adk-bot commented Apr 8, 2026

Uh oh!

rohityan commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Siddhartha90 commented Apr 8, 2026 •

edited

Loading