Skip to content

AgentEvaluator crashes with ValidationError when evaluating conversation_scenario eval cases #5214

@ASRagab

Description

@ASRagab

Bug

AgentEvaluator.evaluate() raises a pydantic ValidationError when processing eval cases that use conversation_scenario (LLM-backed user simulation) instead of explicit conversation arrays.

Error

pydantic_core._pydantic_core.ValidationError: 1 validation error for _EvalMetricResultWithInvocation
expected_invocation
  Input should be a valid dictionary or instance of Invocation [type=model_type, input_value=None, input_type=NoneType]

at agent_evaluator.py:639

Root Cause

Type mismatch between the public and private invocation result models:

  • EvalMetricResultPerInvocation in eval_metrics.py:323 correctly declares:

    expected_invocation: Optional[Invocation] = Field(default=None, ...)
  • _EvalMetricResultWithInvocation in agent_evaluator.py:93 incorrectly declares:

    expected_invocation: Invocation  # required, non-optional

local_eval_service.py:285-287 intentionally sets expected_invocation=None when eval_case.conversation is None (i.e., when using conversation_scenario):

EvalMetricResultPerInvocation(
    actual_invocation=actual,
    expected_invocation=eval_case.conversation[idx]
    if eval_case.conversation
    else None,  # <-- None for conversation_scenario cases
)

This None flows through _get_eval_metric_results_with_invocation() at line 636 and into _EvalMetricResultWithInvocation() at line 639, which rejects it.

Fix

Two changes needed in agent_evaluator.py:

1. Make the field optional (line 93):

expected_invocation: Optional[Invocation] = None

2. Guard the downstream attribute accesses (lines 439-449 in _print_details):

"prompt": AgentEvaluator._convert_content_to_text(
    per_invocation_result.expected_invocation.user_content
    if per_invocation_result.expected_invocation else None
),
"expected_response": AgentEvaluator._convert_content_to_text(
    per_invocation_result.expected_invocation.final_response
    if per_invocation_result.expected_invocation else None
),
...
"expected_tool_calls": AgentEvaluator._convert_tool_calls_to_text(
    per_invocation_result.expected_invocation.intermediate_data
    if per_invocation_result.expected_invocation else None
),

Both _convert_content_to_text and _convert_tool_calls_to_text already accept Optional parameters and handle None gracefully.

Reproduction

# evalset with conversation_scenario (no explicit conversation array)
{
  "eval_set_id": "test",
  "eval_cases": [{
    "eval_id": "scenario_1",
    "conversation_scenario": {
      "starting_prompt": "Hello",
      "conversation_plan": "Ask the agent a question and accept the answer."
    },
    "session_input": {"app_name": "my_agent", "user_id": "user1", "state": {}}
  }]
}
# test_eval.py
import pytest
from google.adk.evaluation import AgentEvaluator

@pytest.mark.asyncio
async def test_scenario():
    await AgentEvaluator.evaluate(
        "my_agent",
        "path/to/evalset.json",
        num_runs=1,
    )

Crashes during post-processing in _get_eval_metric_results_with_invocation after all metrics have been computed.

Environment

  • google-adk[eval]>=1.28.0 (also confirmed on main branch — same code)
  • Python 3.11

Metadata

Metadata

Labels

eval[Component] This issue is related to evaluation

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions