Add K2V3TITOTokenizer for K2V3 and the TITO test#18
Open
ZhentingWang wants to merge 1 commit intoprodfrom
Open
Add K2V3TITOTokenizer for K2V3 and the TITO test#18ZhentingWang wants to merge 1 commit intoprodfrom
ZhentingWang wants to merge 1 commit intoprodfrom
Conversation
The K2V3 chat template emits `<|im_end|>\n` after every message (jinja
block whitespace between `{{- '<|im_end|>' }}` and the next block is
preserved by default `trim_blocks`). The model autoregressively stops
at `<|im_end|>` without producing the trailing `\n`. Without a per-model
fix, the rollout buffer ends with `<|im_end|>` while the canonical
chat-template render has `<|im_end|>\n` — diverging by exactly one `\n`
token, which trips `update_pretokenized_state`'s prefix check.
`K2V3TITOTokenizer.merge_tokens` mirrors `Qwen3TITOTokenizer`'s strategy:
insert `\n` when `prefix[-1] == im_end_id`. Standalone subclass (rather
than alias of Qwen3) so future K2V3-specific divergences have a clean
hook.
`tests/fast/utils/chat_template_utils/test_tito_k2v3.py` — K2V3-focused
contract test suite (54 parametrized cases) verifying the boundary fix
is alive and effective, that the K2V3 chat template round-trips cleanly
through the real SGLang parsers, and that the full rollout flow stays
consistent end-to-end.
Coverage:
- 8 trajectory shapes (single_tool / multi_turn /
multi_tool_single_turn / multi_tool_multi_turn, each with native or
synthesized thinking variant). Each runs a 2-phase check: finalized
buffer vs canonical, plus a synthetic env follow-up that forces the
boundary fix path even on single-turn shapes (defeating
trim_trailing_ids that would otherwise hide missing-fix bugs).
- 8 trajectories x 4 env append shapes = 32 cross-cases driving
realistic <|im_end|>-end buffers through prepare_pretokenized ->
merge_tokens with various env-delta patterns (single tool / user /
system / alternating mixed). Each case also asserts the env content
markers appear in the incremental tokens in order.
- 8 trajectories run through real SGLang ReasoningParser (deepseek-r1)
+ FunctionCallParser (hermes), verifying chat template <-> parser
structural round-trip on each shape's first assistant message.
- 4 boss-level integration flows: drive every assistant turn of a
multi-turn trajectory through real parsers (so session.messages
accumulates parser-derived parsed_msg across turns), then append a
complex env follow-up. Catches integration regressions that only
surface in the full flow.
- Sanity: production prefix-check defense fires on intentional
violation; K2V3 enum value is correctly wired in
_TOKENIZER_REGISTRY.
Reverse-validated: with the `\n` boundary insertion commented out, all
44 trajectory + append + boss cases fail; the 8 parser cases (which
don't drive merge_tokens) and the 2 fix-independent sanity cases still
pass. Confirms the suite is genuinely exercising the boundary fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
K2V3TITOTokenizer(newTITOTokenizerType.K2V3enum value + subclass + registry entry) for the K2V3 family. Same<|im_end|>\nboundary fix asQwen3TITOTokenizer— its chat template emits<|im_end|>\nafter every message via jinja block whitespace, but the model autoregressively stops at<|im_end|>without the trailing\n. Without the fix, the rollout buffer's last token is<|im_end|>while the canonical chat-template render has<|im_end|>\n, diverging by exactly one token which tripsupdate_pretokenized_state's prefix check.tests/fast/utils/chat_template_utils/test_tito_k2v3.py— K2V3-focused contract test suite (54 parametrized cases) verifying the boundary fix is alive and effective, that the K2V3 chat template round-trips cleanly through the real SGLang parsers, and that the full rollout flow stays consistent end-to-end.What the test suite covers
test_buffer_matches_canonical_under_realistic_rollouttrim_trailing_idsthat would otherwise hide missing-fix bugs)test_append_via_realistic_buffermerge_tokenscorrectness across (buffer end-state × env append shape) cross-product, plus required-content-in-order check on the incremental segmenttest_chat_template_round_trip_through_real_sglang_parsersdeepseek-r1ReasoningParser +hermesFunctionCallParser ↔ chat template structural round-trip per trajectory's first assistant message — independent of the boundary fix; catches parser-related regressionstest_end_to_end_realistic_rollout_with_real_parsers(boss)session.messagesaccumulates parser-derivedparsed_msgacross turns), then append a complex env follow-up that triggersmerge_tokensover the parser-tainted history. Catches integration regressions only visible in the full flowtest_production_prefix_check_raises_on_intentional_violationupdate_pretokenized_state's defense inlinear_trajectory.py:103-137is alivetest_k2v3_subclass_is_wired_TOKENIZER_REGISTRY[TITOTokenizerType.K2V3]correctly returns the K2V3 subclassThe trajectory + append + boss tests all use
update_pretokenized_stateto drive the session (so the buffer ends at<|im_end|>— the realistic autoregressive-stop shape), unliketest_tito_tokenizer_model_matrix.pywhosepretokenizedcomes from a cleanapply_chat_templaterender with the trailing\nalready in it. That distinction matters: the model_matrix test cannot exercise the missing-\nboundary fix because itsprefix[-1]is never<|im_end|>. This file closes that gap for K2V3.The 4 boss flows pick deliberately complex shapes:
multi_turn_thinking + tool_followupmulti_tool_multi_turn_thinking + alternating_user_tool_followupmulti_tool_single_turn_thinking + system_inject(parallel tools + reasoning + system injection)multi_tool_multi_turn_thinking + complex_env_chain(tool/user/tool/system/tool chain)How this complements production runtime safeguards
Production already has TITO defenses that fire on real rollouts:
Append-only / prefix check at every assistant turn —
linear_trajectory.py:103update_pretokenized_state()requires thestored
token_ids(= "what the trainer sees so far") to be a strictprefix of
prompt_token_ids + completion_token_ids(= "what therollout buffer carries forward"), modulo
max_trim_tokens. Violationraises
TokenizationError("pretokenized prefix mismatch: ...")atline 133. This is the runtime guarantee that the rollout-side and
trainer-side token streams stay bit-identical across turns.
Per-rollout chat-template round-trip metric —
linear_trajectory.py:357compute_session_mismatch()comparessession.token_ids(rollout buffer) against the canonicalapply_chat_template(session.messages)render, surfaced astito_session_mismatch_rateinray/rollout.py:1238. The strictsubset (
special_token_count,special_token_type,non_assistant_text) is asserted to be 0 underargs.ci_testatray/rollout.py:1243-1252.The new test in this PR is the offline / pre-PR-merge counterpart of those runtime defenses:
update_pretokenized_stateprefix check (1) catches a TITO break the moment it happens in a real run — but only after the bug has already shipped. The trajectory + append + boss cases here exercise that same code path on synthetic but production-realistic shapes, so per-model boundary fixes (like K2V3's<|im_end|>\ninsertion) can be verified BEFORE the model is wired into a training run.compute_session_mismatchcomparator (2) is what every test case here invokes directly viatito_tok.create_comparator(), applying the same severity classification (special_token_*/non_assistant_textstrict;assistant_texttolerated as BPE-merge / parser-whitespace noise). So this PR's test enforces the CI strict assertion atray/rollout.py:1243-1252as a precondition before training, rather than only catching it via the--ci_testruntime gate.Verification
Both runs against the K2V3 checkpoint inside the agentic-rl runtime container:
With the boundary fix in place (the committed state):
With the boundary fix DISABLED (commented out the
\ninsertion inK2V3TITOTokenizer.merge_tokensto verify the suite catches its absence):Per-test-family breakdown when fix is disabled:
test_buffer_matches_canonical_under_realistic_rollout× 8test_append_via_realistic_buffer× 32test_end_to_end_realistic_rollout_with_real_parsers× 4 (boss)test_chat_template_round_trip_through_real_sglang_parsers× 8test_production_prefix_check_raises_on_intentional_violation× 1test_k2v3_subclass_is_wired× 1Notes:
\n(the boundary fix path isprepare_pretokenized → merge_tokens, which only fires when there is prior session state — these tests drive that path).prepare_pretokenized → merge_tokenscall). They test a different invariant (parser ↔ chat template structural round-trip) and would catch parser regressions independently.Together this confirms the suite is genuinely exercising the boundary fix on the cases that should depend on it, and the parser/sanity tests are scoped to other invariants as designed.
How to run
The test needs the agentic-rl runtime container (which bundles the LLM360 SGLang fork with
hermesanddeepseek-r1parsers registered) and access to the K2V3 checkpoint on/mnt/weka. End-to-end on M2 SLURM, no GPU required (this is a fast pytest, ~2s):If you're already inside the container (or any host env with miles + transformers + sglang + the K2V3 checkpoint accessible):
Env overrides if needed:
TITO_TEST_MODEL_PATH_K2V3— model checkpoint pathTITO_TEST_TOOL_PARSER_K2V3— defaults tohermesTITO_TEST_REASONING_PARSER_K2V3— defaults todeepseek-r1TITO_TEST_REASONING_EFFORT_K2V3— defaults tohighDefaults match the K2V3 production config in RL360-tooluse-harbor.
Reviewers
@LLM360/RL360-Maintainers