[test][runtime] Strengthen live-LLM e2e tests with structured tool-invocation assertions by weiqingy · Pull Request #722 · apache/flink-agents

weiqingy · 2026-06-01T01:12:54Z

Linked issue: #719

Purpose of change

Follow-up to #716 (item 3). The live-LLM e2e tests assert on the agent's final output value (react_agent_test asserted result == 1386528) or weak substrings of it ("3" in output, "22" in output). A single check conflates three things a small non-deterministic model gets right at different rates: which tool was invoked, with what arguments, and the final synthesized answer — so a failure cannot be localized, and the substring checks barely test anything.

This PR adds layered assertions on the structured ToolRequestEvents the runtime already produces, so CI checks what the agent did (which tool, what arguments) rather than only an exact number the model often gets wrong.

Sourced two ways, by execution path:

Flink path (cross-language tests + react remote): a shared helper collect_tool_invocations(log_dir) reads the events-*.log the FileEventLogger already writes.
Local runner (react local, from_list/to_list): the pure-Python LocalRunner has no event log, so a small in-memory capture hook exposes the ToolRequestEvents that already flow through its event deque.

Both yield the same {name, arguments} shape, so assertions read identically.

Notes on two deliberate choices:

Final-value checks are kept (not relaxed). The agent's .result is a separate model synthesis step, not a tool output, so it can be wrong even when the tool calls are correct — this was observed live (the model called multiply(4444, 312) correctly but emitted a wrong final number). The value check catches a failure the tool assertions cannot, so it remains; its residual flakiness is covered by the agent's retries and the per-test retry from [Tech Debt] Flaky live-LLM e2e/cross-language CI tests cause frequent red CI #716.
For the react local test we assert the multiply invocation, not add. The small model frequently computes the addition itself and only calls the multiply tool, so an add assertion would be an unreliable signal; multiply's first argument is the threaded sum, so asserting it proves the addition was computed correctly and the tool was used.

Tests

New fixture-based unit tests for the helpers (collect_tool_invocations, assert_tool_invoked, tool_invocations_from_events) — no live model required.
New unit test for the local-runner capture hook, asserting both that the event is captured and that it still dispatches to its action (so capture cannot silently break tool execution).
The strengthened e2e tests: chat_model / yaml / react (remote + local) cross-language tests.
The react local-runner test was exercised live (Ollama, qwen3:1.7b) across many runs to confirm the captured tool calls and arguments end-to-end. The Flink-path e2e tests run in CI.

API

Adds a test-facing accessor get_tool_request_events() on LocalRunner and LocalExecutionEnvironment (Python runtime), returning the ToolRequestEvents captured during execution. No change to the Java side; the events were already emitted there.

Documentation

doc-needed
doc-not-needed
doc-included

…hat model cross-language test Replace the weak "3 in output" substring check in the chat model cross-language e2e test with a structured assertion that the agent actually invoked the add tool with the correct arguments. Add two shared helpers to e2e test_utils: - collect_tool_invocations(log_dir): reads events-*.log written by the FileEventLogger, filters _tool_request_event records, and extracts each tool call's name and arguments from the nested function field. - assert_tool_invoked(invocations, name, arguments): asserts a matching tool call, normalizing arguments given as a dict or a JSON string. Cover the helpers with fixture-based unit tests that need no live model.

…e e2e tests Strengthen two more live-LLM Flink-path tests using the structured tool-invocation helper: - yaml cross-language test: replace the "22 in output" substring check with an assertion that the Java calculateBMI tool was invoked with the weight and height parsed from the input. - react agent remote-runner test: in addition to checking the output has a result field, assert the add and multiply tools were invoked with the expected arguments, including multiply's first argument being add's threaded result. Both read the tool calls from the event log via collect_tool_invocations.

The Flink execution path persists tool events to an event log, but the pure-Python local runner does not, so tests on the local runner cannot assert which tools an agent invoked. Capture ToolRequestEvents as they pass through the local runner's event deque and expose them via get_tool_request_events() on both the runner and the local execution environment. The capture falls through to normal dispatch, so tool requests still reach their action. Add a tool_invocations_from_events adapter so in-memory events and event-log records yield the same tool-invocation shape for assertions.

Strengthen the react agent local-runner test, which previously asserted only the exact output value. It now also asserts the multiply tool was invoked with the threaded sum as its first argument, read from the local runner's captured tool events. Assert multiply rather than add because the small model frequently computes the addition itself and only calls the multiply tool, so an add assertion would be an unreliable signal. The kept value assertion is normalized to tolerate equivalent numeric representations; it remains because the final result is a separate model synthesis step that can be wrong even when the tool calls are correct.

The remote-runner test asserted add(1, 2) and multiply(3, 3) on inputs (1, 2, 3). qwen3:1.7b reliably computes 1+2 in its head and calls multiply directly without the add tool, so the add assertion fails non-deterministically. The local-runner test already documents this and asserts only multiply(sum, c) on large inputs that force genuine tool use. Mirror that approach on the remote runner: use inputs (2123, 2321, 312), assert the exact result value, and assert multiply(4444, 312) read back through the event-log capture path. The multiply first arg (4444 = 2123 + 2321) proves the addition was computed correctly and threaded into multiply, without depending on whether the add tool fired.

weiqingy added 4 commits May 31, 2026 16:46

github-actions Bot added doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. labels Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test][runtime] Strengthen live-LLM e2e tests with structured tool-invocation assertions#722

[test][runtime] Strengthen live-LLM e2e tests with structured tool-invocation assertions#722
weiqingy wants to merge 5 commits into
apache:mainfrom
weiqingy:719-impl

weiqingy commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weiqingy commented Jun 1, 2026

Purpose of change

Tests

API

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant