Skip to content

test: lock in real-world 3-consecutive-tool_call streaming regression#62

Draft
daniel-farina wants to merge 4 commits into
youssofal:mainfrom
daniel-farina:add-hip-workload-regression-test
Draft

test: lock in real-world 3-consecutive-tool_call streaming regression#62
daniel-farina wants to merge 4 commits into
youssofal:mainfrom
daniel-farina:add-hip-workload-regression-test

Conversation

@daniel-farina
Copy link
Copy Markdown
Contributor

Context

We hit the consecutive-XML-tool_call streaming bug downstream while running the hippo-code agent CLI (hip) against mtplx 0.3.3. Symptom: the model emitted three back-to-back <tool_call> blocks of the same tool name with blank lines between them, the SSE stream arrived in ~5-character chunks, and only the first call dispatched - the other two leaked into delta.content as raw <tool_call><function=name>... XML, prefixed with a truncated _call> fragment where the streaming parser ate part of an opening tag.

Your v0.3.4 (commit 9e3b929 "Fix consecutive XML tool-call streaming") resolved this in the parser - thanks for the quick fix. We pulled it into our fork and the bug is gone.

What this PR adds

A single new test in tests/test_tool_aware_stream_translator.py:

test_qwen_xml_three_tool_calls_streamed_5char_chunks_no_leak

The existing test_consecutive_qwen_xml_tool_calls_do_not_leak_as_content feeds the whole response as one chunk, which doesn't exercise the incremental parser at the chunk boundaries that were actually broken in the wild. This new test:

  • Uses three consecutive blocks (not two), same tool name (so any per-name caching can't help)
  • Has blank lines between blocks (matches real-world model output formatting)
  • Splits into 5-char chunks (matches SSE delta sizes we observed in MTPLX 0.3.3 traffic)
  • Asserts: zero <tool_call substring leaked, zero _call> fragment leaked, all three calls dispatched with sequential indices [0, 1, 2], and arguments preserved per-block

Why bother

The fix in 9e3b929 is correct; this just locks in the specific chunk-boundary shape from production traffic so a future regression in _QwenXMLToolCallStreamParser gets caught immediately. The new test passes against current main. It would fail against the v0.3.3 parser.

Out of scope

  • No parser changes - the fix is already in
  • No new tooling, no API changes

Happy to fold this into the existing consecutive-tool_call test instead if you'd prefer one bigger test rather than a separate case. Draft status until you weigh in.

Implements a context-length-based gating ladder for MTP speculative
decoding depth, replacing the single-step 98K threshold in the Sustained
profile with a more aggressive default schedule:

  <16K     -> requested depth (default 3)
  16-24K   -> min(requested, 2)
  24-30K   -> min(requested, 1)
  >=30K    -> 0 (MTP off, request switches to AR mode)

Rationale: MTP draft acceptance on Qwen3-family models collapses on long
contexts. vLLM tracking confirms the trend (#35387, #36872, #40756) and
MTPLX issue youssofal#49 reproduces it at depth=3 (47% -> 33% -> 27% acceptance
across consecutive long-context turns). vLLM's recommended setting for
Qwen3.6-35B is num_speculative_tokens=2, not 3.

Changes:
- profiles.py: extend `resolve_long_context_mtp_depth` to support a
  multi-step ladder via MTPLX_MTP_LONG_CONTEXT_LADDER env var
  ("16384:2,24576:1,30720:0" by default). The ladder takes precedence
  over the legacy single-step gate when configured. Set the env var to
  an empty string to fall back to the legacy gate. Sustained profile
  now ships the ladder explicitly.
- server/openai.py: apply the ladder at request time
  (_apply_long_context_ladder), so depth is capped BEFORE generation
  starts. When the ladder selects depth=0, the request switches to AR
  mode for that turn. The ladder details are surfaced in the metrics
  envelope as long_context_mtp_depth_policy (with reason like
  "ladder_step_16384" or "ladder_step_30720_mtp_off") so chat-monitor
  can see when it kicked in.
- tests/test_profiles.py: replace the old single-step test with full
  coverage of the ladder, the explicit-disable fallback, and custom
  ladder specs.

Existing MTPLX_MTP_HISTORY_POLICY=committed workaround and the
auto/committed/off policy semantics are unchanged.
…v0.3.2, 4ce879e prefill chunk) on top of local depth ladder
…char chunks

Found this pattern in a downstream session running hippo-code CLI against
MTPLX 0.3.3 on 2026-05-13. The model emitted three consecutive <tool_call>
blocks of the same tool name with blank lines between, and the SSE stream
arrived in ~5-character chunks. With 0.3.3 only the first call dispatched
and the other two leaked into delta.content as raw '<tool_call><function=name>...'
XML, prefixed with a truncated '_call>' fragment where the parser ate part
of an opening tag.

v0.3.4's fix (commit 9e3b929) resolved this, but the existing coverage
(test_consecutive_qwen_xml_tool_calls_do_not_leak_as_content) feeds the
whole text in a single chunk, so it does not exercise the incremental
parser through the chunk-boundary case that was actually broken in the
wild. This new test locks in the real-world streaming shape so a future
regression in _QwenXMLToolCallStreamParser would be caught immediately.
@youssofal
Copy link
Copy Markdown
Owner

Confirmed: this regression is real and valuable. I reproduced/locked down the 5-character chunk, 3-consecutive Qwen XML tool-call stream shape on main, plus adjacent mixed-tool, split-marker, long OpenCode-style argument, and server streaming cases.

I did not merge this PR branch directly because it currently contains unrelated MTP depth-ladder commits before the useful regression-test commit. The clean hardening commit on main is 0c2cde5 and includes Daniel co-author credit for the regression fixture.

If you want this PR merged independently, please rebase it so it contains only the tool-call regression test commit. Otherwise the main commit supersedes the branch for this issue. Thanks for catching the real stream shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants