fix(eval): include function-call events in invocation_events when skip_summarization is set#5417
Conversation
|
Hi @Koushik-Salammagari , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors by running autoformat.sh |
7e902ee to
709d857
Compare
|
Hi @rohityan — rebased onto latest |
|
Hi @Jacksunwei , can you please review this. |
…in thread pool When RunConfig.tool_thread_pool_config is enabled, _call_tool_in_thread_pool used None as a sentinel to distinguish "FunctionTool ran in thread pool" from "non-FunctionTool sync tool, needs async fallback". Because None is also a valid return value from any FunctionTool whose underlying function has no explicit return statement (implicit None), the sentinel check failed and execution fell through to tool.run_async(), invoking the function a second time silently. Replace the None sentinel with a dedicated _SYNC_TOOL_RESULT_UNSET object so that a legitimate None result from a FunctionTool is correctly returned on the first execution, without triggering the async fallback path. Fixes google#5284
…ases
Per reviewer feedback: collapse the two near-identical None tests into a
single @pytest.mark.parametrize test, and add falsy-but-not-None cases
(0, '', {}, False) to prove the sentinel is identity-based and does not
mishandle any falsy return value from a FunctionTool.
…p_summarization is set EvaluationGenerator.convert_events_to_eval_invocations builds invocation_events by excluding the final_event from intermediate steps. However, is_final_response() returns True for any event with skip_summarization=True, even when that event contains function calls (e.g. tools using skip_summarization to bypass LLM summarization). Such events were incorrectly excluded from invocation_events, causing get_all_tool_calls() to return an empty list and tool_trajectory_avg_score to always be 0.0 despite matching tool calls. Fix: keep an event in invocation_events even if it is the final_event when it contains function calls. Fixes google#5410
4d7739c to
f860460
Compare
|
Hi @rohityan @Jacksunwei — rebased onto latest |
_call_tool_in_thread_pool uses early returns for each tool category (sync FunctionTool, async tool, and the sync non-FunctionTool fallback), so a sync FunctionTool that returns None exits immediately via its own return and never falls through to the fallback path. The _SYNC_TOOL_RESULT_UNSET sentinel added to guard that case is never referenced anywhere, and its comment describes a fallthrough the code structure already prevents. Remove the dead definition and comment.
|
I have conducted a thorough, read-only analysis of PR #5417 in accordance with the I have generated a premium PR Analysis Report which is now saved and available: Key Takeaways & Recommendations:
Next Steps for You:
|
…p_summarization is set Merge #5417 ### Link to Issue or Description of Change Fixes #5410 ### Description `EvaluationGenerator.convert_events_to_eval_invocations` builds `invocation_events` (the intermediate tool-call record used by `TrajectoryEvaluator`) by collecting all qualifying events and then excluding the `final_event` from the list. The final event is identified via `event.is_final_response()`, but `is_final_response()` returns `True` for **any** event with `skip_summarization=True` — even events that contain `function_call` parts (e.g. tools that use `skip_summarization` to surface their result directly without an LLM summarization step). Those events were silently dropped from `invocation_events`, causing `get_all_tool_calls()` to return `[]` for the actual invocation. The result: `tool_trajectory_avg_score` was always **0.0** even when the tool name and args matched the expected exactly. **Root cause:** `is_final_response()` conflates "final user-visible response" with "should be excluded from tool trajectory". When `skip_summarization=True` the function-call event is both the final response *and* an intermediate step that must appear in the trajectory. **Fix:** in the list comprehension that builds `invocation_events`, keep an event even when it equals `final_event` if it contains function calls: ```python # before if e is not final_event # after if e is not final_event or e.get_function_calls() ``` ### Changes - `src/google/adk/evaluation/evaluation_generator.py`: one-line fix - `tests/unittests/evaluation/test_evaluation_generator.py`: regression test that verifies tool calls are preserved when `skip_summarization=True` - `tests/unittests/evaluation/test_trajectory_evaluator.py`: end-to-end tests for `InvocationEvents` intermediate_data format (exact match → 1.0, mismatch → 0.0) ### Testing Plan ``` pytest tests/unittests/evaluation/test_trajectory_evaluator.py \ tests/unittests/evaluation/test_evaluation_generator.py -v ======================== 47 passed in 1.23s ============================ ``` Co-authored-by: George Weale <gweale@google.com> COPYBARA_INTEGRATE_REVIEW=#5417 from Koushik-Salammagari:fix/trajectory-eval-skip-summarization ce8087f PiperOrigin-RevId: 933236523
|
Thank you @Koushik-Salammagari for your contribution! 🎉 Your changes have been successfully imported and merged via Copybara in commit 5b16a86. Closing this PR as the changes are now in the main branch. |
Link to Issue or Description of Change
Fixes #5410
Description
EvaluationGenerator.convert_events_to_eval_invocationsbuildsinvocation_events(the intermediate tool-call record used byTrajectoryEvaluator) by collecting all qualifying events and then excludingthe
final_eventfrom the list.The final event is identified via
event.is_final_response(), butis_final_response()returnsTruefor any event withskip_summarization=True— even events that containfunction_callparts(e.g. tools that use
skip_summarizationto surface their result directlywithout an LLM summarization step). Those events were silently dropped from
invocation_events, causingget_all_tool_calls()to return[]for theactual invocation. The result:
tool_trajectory_avg_scorewas always 0.0even when the tool name and args matched the expected exactly.
Root cause:
is_final_response()conflates "final user-visible response"with "should be excluded from tool trajectory". When
skip_summarization=Truethe function-call event is both the final response and an intermediate step
that must appear in the trajectory.
Fix: in the list comprehension that builds
invocation_events, keep anevent even when it equals
final_eventif it contains function calls:Changes
src/google/adk/evaluation/evaluation_generator.py: one-line fixtests/unittests/evaluation/test_evaluation_generator.py: regression test that verifies tool calls are preserved whenskip_summarization=Truetests/unittests/evaluation/test_trajectory_evaluator.py: end-to-end tests forInvocationEventsintermediate_data format (exact match → 1.0, mismatch → 0.0)Testing Plan