Skip to content

[test][runtime] Strengthen live-LLM e2e tests with structured tool-invocation assertions#722

Open
weiqingy wants to merge 5 commits into
apache:mainfrom
weiqingy:719-impl
Open

[test][runtime] Strengthen live-LLM e2e tests with structured tool-invocation assertions#722
weiqingy wants to merge 5 commits into
apache:mainfrom
weiqingy:719-impl

Conversation

@weiqingy
Copy link
Copy Markdown
Collaborator

@weiqingy weiqingy commented Jun 1, 2026

Linked issue: #719

Purpose of change

Follow-up to #716 (item 3). The live-LLM e2e tests assert on the agent's final output value (react_agent_test asserted result == 1386528) or weak substrings of it ("3" in output, "22" in output). A single check conflates three things a small non-deterministic model gets right at different rates: which tool was invoked, with what arguments, and the final synthesized answer — so a failure cannot be localized, and the substring checks barely test anything.

This PR adds layered assertions on the structured ToolRequestEvents the runtime already produces, so CI checks what the agent did (which tool, what arguments) rather than only an exact number the model often gets wrong.

Sourced two ways, by execution path:

  • Flink path (cross-language tests + react remote): a shared helper collect_tool_invocations(log_dir) reads the events-*.log the FileEventLogger already writes.
  • Local runner (react local, from_list/to_list): the pure-Python LocalRunner has no event log, so a small in-memory capture hook exposes the ToolRequestEvents that already flow through its event deque.

Both yield the same {name, arguments} shape, so assertions read identically.

Notes on two deliberate choices:

  • Final-value checks are kept (not relaxed). The agent's .result is a separate model synthesis step, not a tool output, so it can be wrong even when the tool calls are correct — this was observed live (the model called multiply(4444, 312) correctly but emitted a wrong final number). The value check catches a failure the tool assertions cannot, so it remains; its residual flakiness is covered by the agent's retries and the per-test retry from [Tech Debt] Flaky live-LLM e2e/cross-language CI tests cause frequent red CI #716.
  • For the react local test we assert the multiply invocation, not add. The small model frequently computes the addition itself and only calls the multiply tool, so an add assertion would be an unreliable signal; multiply's first argument is the threaded sum, so asserting it proves the addition was computed correctly and the tool was used.

Tests

  • New fixture-based unit tests for the helpers (collect_tool_invocations, assert_tool_invoked, tool_invocations_from_events) — no live model required.
  • New unit test for the local-runner capture hook, asserting both that the event is captured and that it still dispatches to its action (so capture cannot silently break tool execution).
  • The strengthened e2e tests: chat_model / yaml / react (remote + local) cross-language tests.
  • The react local-runner test was exercised live (Ollama, qwen3:1.7b) across many runs to confirm the captured tool calls and arguments end-to-end. The Flink-path e2e tests run in CI.

API

Adds a test-facing accessor get_tool_request_events() on LocalRunner and LocalExecutionEnvironment (Python runtime), returning the ToolRequestEvents captured during execution. No change to the Java side; the events were already emitted there.

Documentation

  • doc-needed
  • doc-not-needed
  • doc-included

weiqingy added 4 commits May 31, 2026 16:46
…hat model cross-language test

Replace the weak "3 in output" substring check in the chat model
cross-language e2e test with a structured assertion that the agent
actually invoked the add tool with the correct arguments.

Add two shared helpers to e2e test_utils:
- collect_tool_invocations(log_dir): reads events-*.log written by the
  FileEventLogger, filters _tool_request_event records, and extracts each
  tool call's name and arguments from the nested function field.
- assert_tool_invoked(invocations, name, arguments): asserts a matching
  tool call, normalizing arguments given as a dict or a JSON string.

Cover the helpers with fixture-based unit tests that need no live model.
…e e2e tests

Strengthen two more live-LLM Flink-path tests using the structured
tool-invocation helper:

- yaml cross-language test: replace the "22 in output" substring check
  with an assertion that the Java calculateBMI tool was invoked with the
  weight and height parsed from the input.
- react agent remote-runner test: in addition to checking the output has
  a result field, assert the add and multiply tools were invoked with the
  expected arguments, including multiply's first argument being add's
  threaded result.

Both read the tool calls from the event log via collect_tool_invocations.
The Flink execution path persists tool events to an event log, but the
pure-Python local runner does not, so tests on the local runner cannot
assert which tools an agent invoked.

Capture ToolRequestEvents as they pass through the local runner's event
deque and expose them via get_tool_request_events() on both the runner
and the local execution environment. The capture falls through to normal
dispatch, so tool requests still reach their action.

Add a tool_invocations_from_events adapter so in-memory events and
event-log records yield the same tool-invocation shape for assertions.
Strengthen the react agent local-runner test, which previously asserted
only the exact output value. It now also asserts the multiply tool was
invoked with the threaded sum as its first argument, read from the local
runner's captured tool events.

Assert multiply rather than add because the small model frequently
computes the addition itself and only calls the multiply tool, so an add
assertion would be an unreliable signal. The kept value assertion is
normalized to tolerate equivalent numeric representations; it remains
because the final result is a separate model synthesis step that can be
wrong even when the tool calls are correct.
@github-actions github-actions Bot added doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. labels Jun 1, 2026
The remote-runner test asserted add(1, 2) and multiply(3, 3) on inputs
(1, 2, 3). qwen3:1.7b reliably computes 1+2 in its head and calls multiply
directly without the add tool, so the add assertion fails non-deterministically.
The local-runner test already documents this and asserts only multiply(sum, c)
on large inputs that force genuine tool use.

Mirror that approach on the remote runner: use inputs (2123, 2321, 312), assert
the exact result value, and assert multiply(4444, 312) read back through the
event-log capture path. The multiply first arg (4444 = 2123 + 2321) proves the
addition was computed correctly and threaded into multiply, without depending on
whether the add tool fired.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant