Skip to content

Add K2V3TITOTokenizer for K2V3 and the TITO test#18

Open
ZhentingWang wants to merge 1 commit intoprodfrom
add-k2v3-tito-tokenizer-and-tito-test
Open

Add K2V3TITOTokenizer for K2V3 and the TITO test#18
ZhentingWang wants to merge 1 commit intoprodfrom
add-k2v3-tito-tokenizer-and-tito-test

Conversation

@ZhentingWang
Copy link
Copy Markdown

@ZhentingWang ZhentingWang commented May 7, 2026

Summary

  • Adds K2V3TITOTokenizer (new TITOTokenizerType.K2V3 enum value + subclass + registry entry) for the K2V3 family. Same <|im_end|>\n boundary fix as Qwen3TITOTokenizer — its chat template emits <|im_end|>\n after every message via jinja block whitespace, but the model autoregressively stops at <|im_end|> without the trailing \n. Without the fix, the rollout buffer's last token is <|im_end|> while the canonical chat-template render has <|im_end|>\n, diverging by exactly one token which trips update_pretokenized_state's prefix check.
  • Adds tests/fast/utils/chat_template_utils/test_tito_k2v3.py — K2V3-focused contract test suite (54 parametrized cases) verifying the boundary fix is alive and effective, that the K2V3 chat template round-trips cleanly through the real SGLang parsers, and that the full rollout flow stays consistent end-to-end.

What the test suite covers

Test Cases What it catches
test_buffer_matches_canonical_under_realistic_rollout 8 trajectories (4 native + thinking variants) Per-model boundary fix correctness on realistic buffer end-states; phase-2 follow-up forces the boundary fix path even on single-turn shapes (defeating trim_trailing_ids that would otherwise hide missing-fix bugs)
test_append_via_realistic_buffer 8 trajectories × 4 env shapes = 32 merge_tokens correctness across (buffer end-state × env append shape) cross-product, plus required-content-in-order check on the incremental segment
test_chat_template_round_trip_through_real_sglang_parsers 8 trajectories Real deepseek-r1 ReasoningParser + hermes FunctionCallParser ↔ chat template structural round-trip per trajectory's first assistant message — independent of the boundary fix; catches parser-related regressions
test_end_to_end_realistic_rollout_with_real_parsers (boss) 4 complex flows Full integration: drive every assistant turn of a multi-turn trajectory through real parsers (so session.messages accumulates parser-derived parsed_msg across turns), then append a complex env follow-up that triggers merge_tokens over the parser-tainted history. Catches integration regressions only visible in the full flow
test_production_prefix_check_raises_on_intentional_violation 1 update_pretokenized_state's defense in linear_trajectory.py:103-137 is alive
test_k2v3_subclass_is_wired 1 _TOKENIZER_REGISTRY[TITOTokenizerType.K2V3] correctly returns the K2V3 subclass

The trajectory + append + boss tests all use update_pretokenized_state to drive the session (so the buffer ends at <|im_end|> — the realistic autoregressive-stop shape), unlike test_tito_tokenizer_model_matrix.py whose pretokenized comes from a clean apply_chat_template render with the trailing \n already in it. That distinction matters: the model_matrix test cannot exercise the missing-\n boundary fix because its prefix[-1] is never <|im_end|>. This file closes that gap for K2V3.

The 4 boss flows pick deliberately complex shapes:

  • multi_turn_thinking + tool_followup
  • multi_tool_multi_turn_thinking + alternating_user_tool_followup
  • multi_tool_single_turn_thinking + system_inject (parallel tools + reasoning + system injection)
  • multi_tool_multi_turn_thinking + complex_env_chain (tool/user/tool/system/tool chain)

How this complements production runtime safeguards

Production already has TITO defenses that fire on real rollouts:

  1. Append-only / prefix check at every assistant turn
    linear_trajectory.py:103 update_pretokenized_state() requires the
    stored token_ids (= "what the trainer sees so far") to be a strict
    prefix of prompt_token_ids + completion_token_ids (= "what the
    rollout buffer carries forward"), modulo max_trim_tokens. Violation
    raises TokenizationError("pretokenized prefix mismatch: ...") at
    line 133. This is the runtime guarantee that the rollout-side and
    trainer-side token streams stay bit-identical across turns.

  2. Per-rollout chat-template round-trip metric
    linear_trajectory.py:357 compute_session_mismatch() compares
    session.token_ids (rollout buffer) against the canonical
    apply_chat_template(session.messages) render, surfaced as
    tito_session_mismatch_rate in ray/rollout.py:1238. The strict
    subset (special_token_count, special_token_type,
    non_assistant_text) is asserted to be 0 under args.ci_test at
    ray/rollout.py:1243-1252.

The new test in this PR is the offline / pre-PR-merge counterpart of those runtime defenses:

  • The update_pretokenized_state prefix check (1) catches a TITO break the moment it happens in a real run — but only after the bug has already shipped. The trajectory + append + boss cases here exercise that same code path on synthetic but production-realistic shapes, so per-model boundary fixes (like K2V3's <|im_end|>\n insertion) can be verified BEFORE the model is wired into a training run.
  • The compute_session_mismatch comparator (2) is what every test case here invokes directly via tito_tok.create_comparator(), applying the same severity classification (special_token_* / non_assistant_text strict; assistant_text tolerated as BPE-merge / parser-whitespace noise). So this PR's test enforces the CI strict assertion at ray/rollout.py:1243-1252 as a precondition before training, rather than only catching it via the --ci_test runtime gate.

Verification

Both runs against the K2V3 checkpoint inside the agentic-rl runtime container:

With the boundary fix in place (the committed state):

54 passed in 1.83s

With the boundary fix DISABLED (commented out the \n insertion in K2V3TITOTokenizer.merge_tokens to verify the suite catches its absence):

44 failed, 10 passed in 1.81s

Per-test-family breakdown when fix is disabled:

Test Result
test_buffer_matches_canonical_under_realistic_rollout × 8 8 FAILED
test_append_via_realistic_buffer × 32 32 FAILED
test_end_to_end_realistic_rollout_with_real_parsers × 4 (boss) 4 FAILED
test_chat_template_round_trip_through_real_sglang_parsers × 8 8 PASSED
test_production_prefix_check_raises_on_intentional_violation × 1 1 PASSED
test_k2v3_subclass_is_wired × 1 1 PASSED

Notes:

  • The 44 trajectory + append + boss cases all detect the missing \n (the boundary fix path is prepare_pretokenized → merge_tokens, which only fires when there is prior session state — these tests drive that path).
  • The 8 parser round-trip cases are unaffected by the boundary fix because they're single-turn drives (no prepare_pretokenized → merge_tokens call). They test a different invariant (parser ↔ chat template structural round-trip) and would catch parser regressions independently.
  • The 2 fix-independent sanity cases (production prefix-check defense, registry wiring) also still pass.

Together this confirms the suite is genuinely exercising the boundary fix on the cases that should depend on it, and the parser/sanity tests are scoped to other invariants as designed.

How to run

The test needs the agentic-rl runtime container (which bundles the LLM360 SGLang fork with hermes and deepseek-r1 parsers registered) and access to the K2V3 checkpoint on /mnt/weka. End-to-end on M2 SLURM, no GPU required (this is a fast pytest, ~2s):

srun --partition=main --time=15:00 --cpus-per-task=2 \
  --container-image=/mnt/weka/shrd/k2pta/agentic_rl_images/agentic-rl-f9986751.sqsh \
  --container-mounts=/mnt/weka:/mnt/weka \
  bash -lc 'cd /path/to/your/miles-checkout && \
            PYTHONPATH=$PWD:$PYTHONPATH \
            pytest tests/fast/utils/chat_template_utils/test_tito_k2v3.py -v'

If you're already inside the container (or any host env with miles + transformers + sglang + the K2V3 checkpoint accessible):

cd /path/to/your/miles-checkout
PYTHONPATH=$PWD:$PYTHONPATH \
  pytest tests/fast/utils/chat_template_utils/test_tito_k2v3.py -v

Env overrides if needed:

  • TITO_TEST_MODEL_PATH_K2V3 — model checkpoint path
  • TITO_TEST_TOOL_PARSER_K2V3 — defaults to hermes
  • TITO_TEST_REASONING_PARSER_K2V3 — defaults to deepseek-r1
  • TITO_TEST_REASONING_EFFORT_K2V3 — defaults to high

Defaults match the K2V3 production config in RL360-tooluse-harbor.

Reviewers

@LLM360/RL360-Maintainers

The K2V3 chat template emits `<|im_end|>\n` after every message (jinja
block whitespace between `{{- '<|im_end|>' }}` and the next block is
preserved by default `trim_blocks`). The model autoregressively stops
at `<|im_end|>` without producing the trailing `\n`. Without a per-model
fix, the rollout buffer ends with `<|im_end|>` while the canonical
chat-template render has `<|im_end|>\n` — diverging by exactly one `\n`
token, which trips `update_pretokenized_state`'s prefix check.

`K2V3TITOTokenizer.merge_tokens` mirrors `Qwen3TITOTokenizer`'s strategy:
insert `\n` when `prefix[-1] == im_end_id`. Standalone subclass (rather
than alias of Qwen3) so future K2V3-specific divergences have a clean
hook.

`tests/fast/utils/chat_template_utils/test_tito_k2v3.py` — K2V3-focused
contract test suite (54 parametrized cases) verifying the boundary fix
is alive and effective, that the K2V3 chat template round-trips cleanly
through the real SGLang parsers, and that the full rollout flow stays
consistent end-to-end.

Coverage:

  - 8 trajectory shapes (single_tool / multi_turn /
    multi_tool_single_turn / multi_tool_multi_turn, each with native or
    synthesized thinking variant). Each runs a 2-phase check: finalized
    buffer vs canonical, plus a synthetic env follow-up that forces the
    boundary fix path even on single-turn shapes (defeating
    trim_trailing_ids that would otherwise hide missing-fix bugs).

  - 8 trajectories x 4 env append shapes = 32 cross-cases driving
    realistic <|im_end|>-end buffers through prepare_pretokenized ->
    merge_tokens with various env-delta patterns (single tool / user /
    system / alternating mixed). Each case also asserts the env content
    markers appear in the incremental tokens in order.

  - 8 trajectories run through real SGLang ReasoningParser (deepseek-r1)
    + FunctionCallParser (hermes), verifying chat template <-> parser
    structural round-trip on each shape's first assistant message.

  - 4 boss-level integration flows: drive every assistant turn of a
    multi-turn trajectory through real parsers (so session.messages
    accumulates parser-derived parsed_msg across turns), then append a
    complex env follow-up. Catches integration regressions that only
    surface in the full flow.

  - Sanity: production prefix-check defense fires on intentional
    violation; K2V3 enum value is correctly wired in
    _TOKENIZER_REGISTRY.

Reverse-validated: with the `\n` boundary insertion commented out, all
44 trajectory + append + boss cases fail; the 8 parser cases (which
don't drive merge_tokens) and the 2 fix-independent sanity cases still
pass. Confirms the suite is genuinely exercising the boundary fix.
@ZhentingWang ZhentingWang requested a review from a team as a code owner May 7, 2026 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant