[test][runtime] Strengthen live-LLM e2e tests with structured tool-invocation assertions#722
Open
weiqingy wants to merge 5 commits into
Open
[test][runtime] Strengthen live-LLM e2e tests with structured tool-invocation assertions#722weiqingy wants to merge 5 commits into
weiqingy wants to merge 5 commits into
Conversation
…hat model cross-language test Replace the weak "3 in output" substring check in the chat model cross-language e2e test with a structured assertion that the agent actually invoked the add tool with the correct arguments. Add two shared helpers to e2e test_utils: - collect_tool_invocations(log_dir): reads events-*.log written by the FileEventLogger, filters _tool_request_event records, and extracts each tool call's name and arguments from the nested function field. - assert_tool_invoked(invocations, name, arguments): asserts a matching tool call, normalizing arguments given as a dict or a JSON string. Cover the helpers with fixture-based unit tests that need no live model.
…e e2e tests Strengthen two more live-LLM Flink-path tests using the structured tool-invocation helper: - yaml cross-language test: replace the "22 in output" substring check with an assertion that the Java calculateBMI tool was invoked with the weight and height parsed from the input. - react agent remote-runner test: in addition to checking the output has a result field, assert the add and multiply tools were invoked with the expected arguments, including multiply's first argument being add's threaded result. Both read the tool calls from the event log via collect_tool_invocations.
The Flink execution path persists tool events to an event log, but the pure-Python local runner does not, so tests on the local runner cannot assert which tools an agent invoked. Capture ToolRequestEvents as they pass through the local runner's event deque and expose them via get_tool_request_events() on both the runner and the local execution environment. The capture falls through to normal dispatch, so tool requests still reach their action. Add a tool_invocations_from_events adapter so in-memory events and event-log records yield the same tool-invocation shape for assertions.
Strengthen the react agent local-runner test, which previously asserted only the exact output value. It now also asserts the multiply tool was invoked with the threaded sum as its first argument, read from the local runner's captured tool events. Assert multiply rather than add because the small model frequently computes the addition itself and only calls the multiply tool, so an add assertion would be an unreliable signal. The kept value assertion is normalized to tolerate equivalent numeric representations; it remains because the final result is a separate model synthesis step that can be wrong even when the tool calls are correct.
The remote-runner test asserted add(1, 2) and multiply(3, 3) on inputs (1, 2, 3). qwen3:1.7b reliably computes 1+2 in its head and calls multiply directly without the add tool, so the add assertion fails non-deterministically. The local-runner test already documents this and asserts only multiply(sum, c) on large inputs that force genuine tool use. Mirror that approach on the remote runner: use inputs (2123, 2321, 312), assert the exact result value, and assert multiply(4444, 312) read back through the event-log capture path. The multiply first arg (4444 = 2123 + 2321) proves the addition was computed correctly and threaded into multiply, without depending on whether the add tool fired.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Linked issue: #719
Purpose of change
Follow-up to #716 (item 3). The live-LLM e2e tests assert on the agent's final output value (
react_agent_testassertedresult == 1386528) or weak substrings of it ("3" in output,"22" in output). A single check conflates three things a small non-deterministic model gets right at different rates: which tool was invoked, with what arguments, and the final synthesized answer — so a failure cannot be localized, and the substring checks barely test anything.This PR adds layered assertions on the structured
ToolRequestEvents the runtime already produces, so CI checks what the agent did (which tool, what arguments) rather than only an exact number the model often gets wrong.Sourced two ways, by execution path:
collect_tool_invocations(log_dir)reads theevents-*.logtheFileEventLoggeralready writes.from_list/to_list): the pure-PythonLocalRunnerhas no event log, so a small in-memory capture hook exposes theToolRequestEvents that already flow through its event deque.Both yield the same
{name, arguments}shape, so assertions read identically.Notes on two deliberate choices:
.resultis a separate model synthesis step, not a tool output, so it can be wrong even when the tool calls are correct — this was observed live (the model calledmultiply(4444, 312)correctly but emitted a wrong final number). The value check catches a failure the tool assertions cannot, so it remains; its residual flakiness is covered by the agent's retries and the per-test retry from [Tech Debt] Flaky live-LLM e2e/cross-language CI tests cause frequent red CI #716.multiplyinvocation, notadd. The small model frequently computes the addition itself and only calls the multiply tool, so anaddassertion would be an unreliable signal;multiply's first argument is the threaded sum, so asserting it proves the addition was computed correctly and the tool was used.Tests
collect_tool_invocations,assert_tool_invoked,tool_invocations_from_events) — no live model required.API
Adds a test-facing accessor
get_tool_request_events()onLocalRunnerandLocalExecutionEnvironment(Python runtime), returning theToolRequestEvents captured during execution. No change to the Java side; the events were already emitted there.Documentation
doc-neededdoc-not-neededdoc-included