test: lock in real-world 3-consecutive-tool_call streaming regression by daniel-farina · Pull Request #62 · youssofal/MTPLX

daniel-farina · 2026-05-13T05:51:54Z

Context

We hit the consecutive-XML-tool_call streaming bug downstream while running the hippo-code agent CLI (hip) against mtplx 0.3.3. Symptom: the model emitted three back-to-back <tool_call> blocks of the same tool name with blank lines between them, the SSE stream arrived in ~5-character chunks, and only the first call dispatched - the other two leaked into delta.content as raw <tool_call><function=name>... XML, prefixed with a truncated _call> fragment where the streaming parser ate part of an opening tag.

Your v0.3.4 (commit 9e3b929 "Fix consecutive XML tool-call streaming") resolved this in the parser - thanks for the quick fix. We pulled it into our fork and the bug is gone.

What this PR adds

A single new test in tests/test_tool_aware_stream_translator.py:

test_qwen_xml_three_tool_calls_streamed_5char_chunks_no_leak

The existing test_consecutive_qwen_xml_tool_calls_do_not_leak_as_content feeds the whole response as one chunk, which doesn't exercise the incremental parser at the chunk boundaries that were actually broken in the wild. This new test:

Uses three consecutive blocks (not two), same tool name (so any per-name caching can't help)
Has blank lines between blocks (matches real-world model output formatting)
Splits into 5-char chunks (matches SSE delta sizes we observed in MTPLX 0.3.3 traffic)
Asserts: zero <tool_call substring leaked, zero _call> fragment leaked, all three calls dispatched with sequential indices [0, 1, 2], and arguments preserved per-block

Why bother

The fix in 9e3b929 is correct; this just locks in the specific chunk-boundary shape from production traffic so a future regression in _QwenXMLToolCallStreamParser gets caught immediately. The new test passes against current main. It would fail against the v0.3.3 parser.

Out of scope

No parser changes - the fix is already in
No new tooling, no API changes

Happy to fold this into the existing consecutive-tool_call test instead if you'd prefer one bigger test rather than a separate case. Draft status until you weigh in.

Implements a context-length-based gating ladder for MTP speculative decoding depth, replacing the single-step 98K threshold in the Sustained profile with a more aggressive default schedule: <16K -> requested depth (default 3) 16-24K -> min(requested, 2) 24-30K -> min(requested, 1) >=30K -> 0 (MTP off, request switches to AR mode) Rationale: MTP draft acceptance on Qwen3-family models collapses on long contexts. vLLM tracking confirms the trend (#35387, #36872, #40756) and MTPLX issue youssofal#49 reproduces it at depth=3 (47% -> 33% -> 27% acceptance across consecutive long-context turns). vLLM's recommended setting for Qwen3.6-35B is num_speculative_tokens=2, not 3. Changes: - profiles.py: extend `resolve_long_context_mtp_depth` to support a multi-step ladder via MTPLX_MTP_LONG_CONTEXT_LADDER env var ("16384:2,24576:1,30720:0" by default). The ladder takes precedence over the legacy single-step gate when configured. Set the env var to an empty string to fall back to the legacy gate. Sustained profile now ships the ladder explicitly. - server/openai.py: apply the ladder at request time (_apply_long_context_ladder), so depth is capped BEFORE generation starts. When the ladder selects depth=0, the request switches to AR mode for that turn. The ladder details are surfaced in the metrics envelope as long_context_mtp_depth_policy (with reason like "ladder_step_16384" or "ladder_step_30720_mtp_off") so chat-monitor can see when it kicked in. - tests/test_profiles.py: replace the old single-step test with full coverage of the ladder, the explicit-disable fallback, and custom ladder specs. Existing MTPLX_MTP_HISTORY_POLICY=committed workaround and the auto/committed/off policy semantics are unchanged.

…v0.3.2, 4ce879e prefill chunk) on top of local depth ladder

…char chunks Found this pattern in a downstream session running hippo-code CLI against MTPLX 0.3.3 on 2026-05-13. The model emitted three consecutive <tool_call> blocks of the same tool name with blank lines between, and the SSE stream arrived in ~5-character chunks. With 0.3.3 only the first call dispatched and the other two leaked into delta.content as raw '<tool_call><function=name>...' XML, prefixed with a truncated '_call>' fragment where the parser ate part of an opening tag. v0.3.4's fix (commit 9e3b929) resolved this, but the existing coverage (test_consecutive_qwen_xml_tool_calls_do_not_leak_as_content) feeds the whole text in a single chunk, so it does not exercise the incremental parser through the chunk-boundary case that was actually broken in the wild. This new test locks in the real-world streaming shape so a future regression in _QwenXMLToolCallStreamParser would be caught immediately.

youssofal · 2026-05-13T17:41:59Z

Confirmed: this regression is real and valuable. I reproduced/locked down the 5-character chunk, 3-consecutive Qwen XML tool-call stream shape on main, plus adjacent mixed-tool, split-marker, long OpenCode-style argument, and server streaming cases.

I did not merge this PR branch directly because it currently contains unrelated MTP depth-ladder commits before the useful regression-test commit. The clean hardening commit on main is 0c2cde5 and includes Daniel co-author credit for the regression fixture.

If you want this PR merged independently, please rebase it so it contains only the tool-call regression test commit. Otherwise the main commit supersedes the branch for this issue. Thanks for catching the real stream shape.

danielfarina added 4 commits May 11, 2026 00:27

merge: pick up upstream fixes (5f3358c thinking-preservation, 8ee2bdf …

940cc0e

…v0.3.2, 4ce879e prefill chunk) on top of local depth ladder

Merge remote-tracking branch 'origin/main'

5bd4493

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: lock in real-world 3-consecutive-tool_call streaming regression#62

test: lock in real-world 3-consecutive-tool_call streaming regression#62
daniel-farina wants to merge 4 commits into
youssofal:mainfrom
daniel-farina:add-hip-workload-regression-test

daniel-farina commented May 13, 2026

Uh oh!

youssofal commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

daniel-farina commented May 13, 2026

Context

What this PR adds

Why bother

Out of scope

Uh oh!

youssofal commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants