Add K2V3TITOTokenizer for K2V3 and the TITO test by ZhentingWang · Pull Request #18 · LLM360/miles

ZhentingWang · 2026-05-07T08:55:26Z

Summary

Adds K2V3TITOTokenizer (new TITOTokenizerType.K2V3 enum value + subclass + registry entry) for the K2V3 family. Same <|im_end|>\n boundary fix as Qwen3TITOTokenizer — its chat template emits <|im_end|>\n after every message via jinja block whitespace, but the model autoregressively stops at <|im_end|> without the trailing \n. Without the fix, the rollout buffer's last token is <|im_end|> while the canonical chat-template render has <|im_end|>\n, diverging by exactly one token which trips update_pretokenized_state's prefix check.
Adds tests/fast/utils/chat_template_utils/test_tito_k2v3.py — K2V3-focused contract test suite (54 parametrized cases) verifying the boundary fix is alive and effective, that the K2V3 chat template round-trips cleanly through the real SGLang parsers, and that the full rollout flow stays consistent end-to-end.

What the test suite covers

Test	Cases	What it catches
`test_buffer_matches_canonical_under_realistic_rollout`	8 trajectories (4 native + thinking variants)	Per-model boundary fix correctness on realistic buffer end-states; phase-2 follow-up forces the boundary fix path even on single-turn shapes (defeating `trim_trailing_ids` that would otherwise hide missing-fix bugs)
`test_append_via_realistic_buffer`	8 trajectories × 4 env shapes = 32	`merge_tokens` correctness across (buffer end-state × env append shape) cross-product, plus required-content-in-order check on the incremental segment
`test_chat_template_round_trip_through_real_sglang_parsers`	8 trajectories	Real `deepseek-r1` ReasoningParser + `hermes` FunctionCallParser ↔ chat template structural round-trip per trajectory's first assistant message — independent of the boundary fix; catches parser-related regressions
`test_end_to_end_realistic_rollout_with_real_parsers` (boss)	4 complex flows	Full integration: drive every assistant turn of a multi-turn trajectory through real parsers (so `session.messages` accumulates parser-derived `parsed_msg` across turns), then append a complex env follow-up that triggers `merge_tokens` over the parser-tainted history. Catches integration regressions only visible in the full flow
`test_production_prefix_check_raises_on_intentional_violation`	1	`update_pretokenized_state`'s defense in `linear_trajectory.py:103-137` is alive
`test_k2v3_subclass_is_wired`	1	`_TOKENIZER_REGISTRY[TITOTokenizerType.K2V3]` correctly returns the K2V3 subclass

The trajectory + append + boss tests all use update_pretokenized_state to drive the session (so the buffer ends at <|im_end|> — the realistic autoregressive-stop shape), unlike test_tito_tokenizer_model_matrix.py whose pretokenized comes from a clean apply_chat_template render with the trailing \n already in it. That distinction matters: the model_matrix test cannot exercise the missing-\n boundary fix because its prefix[-1] is never <|im_end|>. This file closes that gap for K2V3.

The 4 boss flows pick deliberately complex shapes:

multi_turn_thinking + tool_followup
multi_tool_multi_turn_thinking + alternating_user_tool_followup
multi_tool_single_turn_thinking + system_inject (parallel tools + reasoning + system injection)
multi_tool_multi_turn_thinking + complex_env_chain (tool/user/tool/system/tool chain)

How this complements production runtime safeguards

Production already has TITO defenses that fire on real rollouts:

Append-only / prefix check at every assistant turn —
linear_trajectory.py:103 update_pretokenized_state() requires the
stored token_ids (= "what the trainer sees so far") to be a strict
prefix of prompt_token_ids + completion_token_ids (= "what the
rollout buffer carries forward"), modulo max_trim_tokens. Violation
raises TokenizationError("pretokenized prefix mismatch: ...") at
line 133. This is the runtime guarantee that the rollout-side and
trainer-side token streams stay bit-identical across turns.
Per-rollout chat-template round-trip metric —
linear_trajectory.py:357 compute_session_mismatch() compares
session.token_ids (rollout buffer) against the canonical
apply_chat_template(session.messages) render, surfaced as
tito_session_mismatch_rate in ray/rollout.py:1238. The strict
subset (special_token_count, special_token_type,
non_assistant_text) is asserted to be 0 under args.ci_test at
ray/rollout.py:1243-1252.

The new test in this PR is the offline / pre-PR-merge counterpart of those runtime defenses:

The update_pretokenized_state prefix check (1) catches a TITO break the moment it happens in a real run — but only after the bug has already shipped. The trajectory + append + boss cases here exercise that same code path on synthetic but production-realistic shapes, so per-model boundary fixes (like K2V3's <|im_end|>\n insertion) can be verified BEFORE the model is wired into a training run.
The compute_session_mismatch comparator (2) is what every test case here invokes directly via tito_tok.create_comparator(), applying the same severity classification (special_token_* / non_assistant_text strict; assistant_text tolerated as BPE-merge / parser-whitespace noise). So this PR's test enforces the CI strict assertion at ray/rollout.py:1243-1252 as a precondition before training, rather than only catching it via the --ci_test runtime gate.

Verification

Both runs against the K2V3 checkpoint inside the agentic-rl runtime container:

With the boundary fix in place (the committed state):

54 passed in 1.83s

With the boundary fix DISABLED (commented out the \n insertion in K2V3TITOTokenizer.merge_tokens to verify the suite catches its absence):

44 failed, 10 passed in 1.81s

Per-test-family breakdown when fix is disabled:

Test	Result
`test_buffer_matches_canonical_under_realistic_rollout` × 8	8 FAILED
`test_append_via_realistic_buffer` × 32	32 FAILED
`test_end_to_end_realistic_rollout_with_real_parsers` × 4 (boss)	4 FAILED
`test_chat_template_round_trip_through_real_sglang_parsers` × 8	8 PASSED
`test_production_prefix_check_raises_on_intentional_violation` × 1	1 PASSED
`test_k2v3_subclass_is_wired` × 1	1 PASSED

Notes:

The 44 trajectory + append + boss cases all detect the missing \n (the boundary fix path is prepare_pretokenized → merge_tokens, which only fires when there is prior session state — these tests drive that path).
The 8 parser round-trip cases are unaffected by the boundary fix because they're single-turn drives (no prepare_pretokenized → merge_tokens call). They test a different invariant (parser ↔ chat template structural round-trip) and would catch parser regressions independently.
The 2 fix-independent sanity cases (production prefix-check defense, registry wiring) also still pass.

Together this confirms the suite is genuinely exercising the boundary fix on the cases that should depend on it, and the parser/sanity tests are scoped to other invariants as designed.

How to run

The test needs the agentic-rl runtime container (which bundles the LLM360 SGLang fork with hermes and deepseek-r1 parsers registered) and access to the K2V3 checkpoint on /mnt/weka. End-to-end on M2 SLURM, no GPU required (this is a fast pytest, ~2s):

srun --partition=main --time=15:00 --cpus-per-task=2 \
  --container-image=/mnt/weka/shrd/k2pta/agentic_rl_images/agentic-rl-f9986751.sqsh \
  --container-mounts=/mnt/weka:/mnt/weka \
  bash -lc 'cd /path/to/your/miles-checkout && \
            PYTHONPATH=$PWD:$PYTHONPATH \
            pytest tests/fast/utils/chat_template_utils/test_tito_k2v3.py -v'

If you're already inside the container (or any host env with miles + transformers + sglang + the K2V3 checkpoint accessible):

cd /path/to/your/miles-checkout
PYTHONPATH=$PWD:$PYTHONPATH \
  pytest tests/fast/utils/chat_template_utils/test_tito_k2v3.py -v

Env overrides if needed:

TITO_TEST_MODEL_PATH_K2V3 — model checkpoint path
TITO_TEST_TOOL_PARSER_K2V3 — defaults to hermes
TITO_TEST_REASONING_PARSER_K2V3 — defaults to deepseek-r1
TITO_TEST_REASONING_EFFORT_K2V3 — defaults to high

Defaults match the K2V3 production config in RL360-tooluse-harbor.

Reviewers

@LLM360/RL360-Maintainers

The K2V3 chat template emits `<|im_end|>\n` after every message (jinja block whitespace between `{{- '<|im_end|>' }}` and the next block is preserved by default `trim_blocks`). The model autoregressively stops at `<|im_end|>` without producing the trailing `\n`. Without a per-model fix, the rollout buffer ends with `<|im_end|>` while the canonical chat-template render has `<|im_end|>\n` — diverging by exactly one `\n` token, which trips `update_pretokenized_state`'s prefix check. `K2V3TITOTokenizer.merge_tokens` mirrors `Qwen3TITOTokenizer`'s strategy: insert `\n` when `prefix[-1] == im_end_id`. Standalone subclass (rather than alias of Qwen3) so future K2V3-specific divergences have a clean hook. `tests/fast/utils/chat_template_utils/test_tito_k2v3.py` — K2V3-focused contract test suite (54 parametrized cases) verifying the boundary fix is alive and effective, that the K2V3 chat template round-trips cleanly through the real SGLang parsers, and that the full rollout flow stays consistent end-to-end. Coverage: - 8 trajectory shapes (single_tool / multi_turn / multi_tool_single_turn / multi_tool_multi_turn, each with native or synthesized thinking variant). Each runs a 2-phase check: finalized buffer vs canonical, plus a synthetic env follow-up that forces the boundary fix path even on single-turn shapes (defeating trim_trailing_ids that would otherwise hide missing-fix bugs). - 8 trajectories x 4 env append shapes = 32 cross-cases driving realistic <|im_end|>-end buffers through prepare_pretokenized -> merge_tokens with various env-delta patterns (single tool / user / system / alternating mixed). Each case also asserts the env content markers appear in the incremental tokens in order. - 8 trajectories run through real SGLang ReasoningParser (deepseek-r1) + FunctionCallParser (hermes), verifying chat template <-> parser structural round-trip on each shape's first assistant message. - 4 boss-level integration flows: drive every assistant turn of a multi-turn trajectory through real parsers (so session.messages accumulates parser-derived parsed_msg across turns), then append a complex env follow-up. Catches integration regressions that only surface in the full flow. - Sanity: production prefix-check defense fires on intentional violation; K2V3 enum value is correctly wired in _TOKENIZER_REGISTRY. Reverse-validated: with the `\n` boundary insertion commented out, all 44 trajectory + append + boss cases fail; the 8 parser cases (which don't drive merge_tokens) and the 2 fix-independent sanity cases still pass. Confirms the suite is genuinely exercising the boundary fix.

ZhentingWang requested a review from a team as a code owner May 7, 2026 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add K2V3TITOTokenizer for K2V3 and the TITO test#18

Add K2V3TITOTokenizer for K2V3 and the TITO test#18
ZhentingWang wants to merge 1 commit intoprodfrom
add-k2v3-tito-tokenizer-and-tito-test

ZhentingWang commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZhentingWang commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the test suite covers

How this complements production runtime safeguards

Verification

How to run

Reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZhentingWang commented May 7, 2026 •

edited

Loading