Skip to content

rubric_based_final_response_quality_v1 does not evaluate text emitted before tool calls #5217

@Siddhartha90

Description

@Siddhartha90

Problem

When an agent emits text before a tool call (e.g. presenting a plan), then calls a tool, then emits more text (e.g. an explanation), rubric_based_final_response_quality_v1 only sends the post-tool-call text to the judge as final_response.

The pre-tool-call text is stored in intermediate_data.invocation_events (confirmed by inspecting the eval results), but format_auto_rater_prompt in rubric_based_final_response_quality_v1.py only extracts tool calls/responses from intermediate_data — the text content in those events is ignored.

Impact

Rubrics that check for content in the pre-tool-call text always fail, even though the agent correctly produced that content. For example, if an agent:

  1. Streams a plan to the user (text)
  2. Calls a tool
  3. Streams an explanation of changes (text)

The judge only sees step 3. Rubrics checking for the plan (step 1) always fail.

Expected behavior

There should be an option to evaluate the agent's full response — all text emitted during the invocation, not just the text after the last tool call. This follows the pattern of evaluate_intermediate_nl_responses on HallucinationsCriterion.

Proposed solution

Add evaluate_full_response: bool = False to RubricsBasedCriterion. When true, concatenate text from intermediate_data.invocation_events + final_response before sending to the judge.

PR: #5216

Metadata

Metadata

Assignees

Labels

eval[Component] This issue is related to evaluation

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions