Skip to content

Add hermetic E2E coverage for strict chat-completions tool-call validation #4537

@cv

Description

@cv

Summary

Follow-up from #4532 and the review at #4532 (review).

#4532 bounded the strict Chat Completions tool-call onboarding probe with max_tokens: 256 and stream: false. The unit coverage asserts the payload shape, but we do not yet have a PR-safe hermetic E2E that validates the behavior through the onboarding flow.

Why this matters

The strict tool-call probe is currently reached from the Local Ollama onboarding path. A payload-shape regression, retry regression, or mock/server compatibility issue could make onboarding fail before the sandbox is created. The existing gpu-e2e path is the closest real-flow coverage, but it requires GPU/Ollama infrastructure and is not a cheap hermetic guard.

Proposed follow-up

Add a hermetic E2E under test/e2e/ and wire it into regression-e2e.yaml that:

  • onboards against an OpenAI-compatible mock endpoint that requires structured Chat Completions tool calls,
  • captures the strict validation request body,
  • asserts tool_choice=required, max_tokens=256, and stream=false,
  • verifies success when the mock returns structured tool_calls,
  • verifies bounded retry/failure behavior when the first strict probe times out or returns a transient error.

Thinking-model carve-out to evaluate

The review also called out that probeChatCompletionsToolCalling now applies the same max_tokens: 256 cap to every model in the strict path. Today that path is Local Ollama only, but if requireChatCompletionsToolCalling is extended to reasoning/thinking models, a model might consume the cap with internal reasoning before emitting a tool call and create a false negative.

As part of this follow-up, evaluate whether strict tool-call validation needs a thinking-model carve-out similar to the existing getChatCompletionsProbePayload special cases (chat_template_kwargs: { thinking: false } for models that support it), or at least keep the assumption documented near the strict probe payload.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    E2EEnd-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gapsLocal ModelsRunning NemoClaw with local modelsProvider: OllamaUse this label to identify issues with the local Ollama model integration.enhancement: inferenceItems related to running (local or hosted) inference models from NemoClaw.enhancement: testingUse this label to identify requests to improve NemoClaw test coverage.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions