-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Add hermetic E2E coverage for strict chat-completions tool-call validation #4537
Copy link
Copy link
Open
Labels
E2EEnd-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gapsEnd-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gapsLocal ModelsRunning NemoClaw with local modelsRunning NemoClaw with local modelsProvider: OllamaUse this label to identify issues with the local Ollama model integration.Use this label to identify issues with the local Ollama model integration.enhancement: inferenceItems related to running (local or hosted) inference models from NemoClaw.Items related to running (local or hosted) inference models from NemoClaw.enhancement: testingUse this label to identify requests to improve NemoClaw test coverage.Use this label to identify requests to improve NemoClaw test coverage.
Metadata
Metadata
Assignees
Labels
E2EEnd-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gapsEnd-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gapsLocal ModelsRunning NemoClaw with local modelsRunning NemoClaw with local modelsProvider: OllamaUse this label to identify issues with the local Ollama model integration.Use this label to identify issues with the local Ollama model integration.enhancement: inferenceItems related to running (local or hosted) inference models from NemoClaw.Items related to running (local or hosted) inference models from NemoClaw.enhancement: testingUse this label to identify requests to improve NemoClaw test coverage.Use this label to identify requests to improve NemoClaw test coverage.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
Follow-up from #4532 and the review at #4532 (review).
#4532 bounded the strict Chat Completions tool-call onboarding probe with
max_tokens: 256andstream: false. The unit coverage asserts the payload shape, but we do not yet have a PR-safe hermetic E2E that validates the behavior through the onboarding flow.Why this matters
The strict tool-call probe is currently reached from the Local Ollama onboarding path. A payload-shape regression, retry regression, or mock/server compatibility issue could make onboarding fail before the sandbox is created. The existing
gpu-e2epath is the closest real-flow coverage, but it requires GPU/Ollama infrastructure and is not a cheap hermetic guard.Proposed follow-up
Add a hermetic E2E under
test/e2e/and wire it intoregression-e2e.yamlthat:tool_choice=required,max_tokens=256, andstream=false,tool_calls,Thinking-model carve-out to evaluate
The review also called out that
probeChatCompletionsToolCallingnow applies the samemax_tokens: 256cap to every model in the strict path. Today that path is Local Ollama only, but ifrequireChatCompletionsToolCallingis extended to reasoning/thinking models, a model might consume the cap with internal reasoning before emitting a tool call and create a false negative.As part of this follow-up, evaluate whether strict tool-call validation needs a thinking-model carve-out similar to the existing
getChatCompletionsProbePayloadspecial cases (chat_template_kwargs: { thinking: false }for models that support it), or at least keep the assumption documented near the strict probe payload.References