Problem
When an agent emits text before a tool call (e.g. presenting a plan), then calls a tool, then emits more text (e.g. an explanation), rubric_based_final_response_quality_v1 only sends the post-tool-call text to the judge as final_response.
The pre-tool-call text is stored in intermediate_data.invocation_events (confirmed by inspecting the eval results), but format_auto_rater_prompt in rubric_based_final_response_quality_v1.py only extracts tool calls/responses from intermediate_data — the text content in those events is ignored.
Impact
Rubrics that check for content in the pre-tool-call text always fail, even though the agent correctly produced that content. For example, if an agent:
- Streams a plan to the user (text)
- Calls a tool
- Streams an explanation of changes (text)
The judge only sees step 3. Rubrics checking for the plan (step 1) always fail.
Expected behavior
There should be an option to evaluate the agent's full response — all text emitted during the invocation, not just the text after the last tool call. This follows the pattern of evaluate_intermediate_nl_responses on HallucinationsCriterion.
Proposed solution
Add evaluate_full_response: bool = False to RubricsBasedCriterion. When true, concatenate text from intermediate_data.invocation_events + final_response before sending to the judge.
PR: #5216
Problem
When an agent emits text before a tool call (e.g. presenting a plan), then calls a tool, then emits more text (e.g. an explanation),
rubric_based_final_response_quality_v1only sends the post-tool-call text to the judge asfinal_response.The pre-tool-call text is stored in
intermediate_data.invocation_events(confirmed by inspecting the eval results), butformat_auto_rater_promptinrubric_based_final_response_quality_v1.pyonly extracts tool calls/responses fromintermediate_data— the text content in those events is ignored.Impact
Rubrics that check for content in the pre-tool-call text always fail, even though the agent correctly produced that content. For example, if an agent:
The judge only sees step 3. Rubrics checking for the plan (step 1) always fail.
Expected behavior
There should be an option to evaluate the agent's full response — all text emitted during the invocation, not just the text after the last tool call. This follows the pattern of
evaluate_intermediate_nl_responsesonHallucinationsCriterion.Proposed solution
Add
evaluate_full_response: bool = FalsetoRubricsBasedCriterion. When true, concatenate text fromintermediate_data.invocation_events+final_responsebefore sending to the judge.PR: #5216