test: lock in real-world 3-consecutive-tool_call streaming regression#62
Draft
daniel-farina wants to merge 4 commits into
Draft
test: lock in real-world 3-consecutive-tool_call streaming regression#62daniel-farina wants to merge 4 commits into
daniel-farina wants to merge 4 commits into
Conversation
Implements a context-length-based gating ladder for MTP speculative decoding depth, replacing the single-step 98K threshold in the Sustained profile with a more aggressive default schedule: <16K -> requested depth (default 3) 16-24K -> min(requested, 2) 24-30K -> min(requested, 1) >=30K -> 0 (MTP off, request switches to AR mode) Rationale: MTP draft acceptance on Qwen3-family models collapses on long contexts. vLLM tracking confirms the trend (#35387, #36872, #40756) and MTPLX issue youssofal#49 reproduces it at depth=3 (47% -> 33% -> 27% acceptance across consecutive long-context turns). vLLM's recommended setting for Qwen3.6-35B is num_speculative_tokens=2, not 3. Changes: - profiles.py: extend `resolve_long_context_mtp_depth` to support a multi-step ladder via MTPLX_MTP_LONG_CONTEXT_LADDER env var ("16384:2,24576:1,30720:0" by default). The ladder takes precedence over the legacy single-step gate when configured. Set the env var to an empty string to fall back to the legacy gate. Sustained profile now ships the ladder explicitly. - server/openai.py: apply the ladder at request time (_apply_long_context_ladder), so depth is capped BEFORE generation starts. When the ladder selects depth=0, the request switches to AR mode for that turn. The ladder details are surfaced in the metrics envelope as long_context_mtp_depth_policy (with reason like "ladder_step_16384" or "ladder_step_30720_mtp_off") so chat-monitor can see when it kicked in. - tests/test_profiles.py: replace the old single-step test with full coverage of the ladder, the explicit-disable fallback, and custom ladder specs. Existing MTPLX_MTP_HISTORY_POLICY=committed workaround and the auto/committed/off policy semantics are unchanged.
…v0.3.2, 4ce879e prefill chunk) on top of local depth ladder
…char chunks Found this pattern in a downstream session running hippo-code CLI against MTPLX 0.3.3 on 2026-05-13. The model emitted three consecutive <tool_call> blocks of the same tool name with blank lines between, and the SSE stream arrived in ~5-character chunks. With 0.3.3 only the first call dispatched and the other two leaked into delta.content as raw '<tool_call><function=name>...' XML, prefixed with a truncated '_call>' fragment where the parser ate part of an opening tag. v0.3.4's fix (commit 9e3b929) resolved this, but the existing coverage (test_consecutive_qwen_xml_tool_calls_do_not_leak_as_content) feeds the whole text in a single chunk, so it does not exercise the incremental parser through the chunk-boundary case that was actually broken in the wild. This new test locks in the real-world streaming shape so a future regression in _QwenXMLToolCallStreamParser would be caught immediately.
Owner
|
Confirmed: this regression is real and valuable. I reproduced/locked down the 5-character chunk, 3-consecutive Qwen XML tool-call stream shape on main, plus adjacent mixed-tool, split-marker, long OpenCode-style argument, and server streaming cases. I did not merge this PR branch directly because it currently contains unrelated MTP depth-ladder commits before the useful regression-test commit. The clean hardening commit on main is 0c2cde5 and includes Daniel co-author credit for the regression fixture. If you want this PR merged independently, please rebase it so it contains only the tool-call regression test commit. Otherwise the main commit supersedes the branch for this issue. Thanks for catching the real stream shape. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
We hit the consecutive-XML-tool_call streaming bug downstream while running the hippo-code agent CLI (
hip) againstmtplx 0.3.3. Symptom: the model emitted three back-to-back<tool_call>blocks of the same tool name with blank lines between them, the SSE stream arrived in ~5-character chunks, and only the first call dispatched - the other two leaked intodelta.contentas raw<tool_call><function=name>...XML, prefixed with a truncated_call>fragment where the streaming parser ate part of an opening tag.Your
v0.3.4(commit 9e3b929 "Fix consecutive XML tool-call streaming") resolved this in the parser - thanks for the quick fix. We pulled it into our fork and the bug is gone.What this PR adds
A single new test in
tests/test_tool_aware_stream_translator.py:The existing
test_consecutive_qwen_xml_tool_calls_do_not_leak_as_contentfeeds the whole response as one chunk, which doesn't exercise the incremental parser at the chunk boundaries that were actually broken in the wild. This new test:<tool_callsubstring leaked, zero_call>fragment leaked, all three calls dispatched with sequential indices[0, 1, 2], and arguments preserved per-blockWhy bother
The fix in 9e3b929 is correct; this just locks in the specific chunk-boundary shape from production traffic so a future regression in
_QwenXMLToolCallStreamParsergets caught immediately. The new test passes against currentmain. It would fail against the v0.3.3 parser.Out of scope
Happy to fold this into the existing consecutive-tool_call test instead if you'd prefer one bigger test rather than a separate case. Draft status until you weigh in.