[data] fix: Remove spurious EOS append in _chat_preprocess#2699
[data] fix: Remove spurious EOS append in _chat_preprocess#2699gautham-kollu merged 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughRemoved automatic EOS token insertion logic from the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit_tests/data/datasets/test_chat_template.py (1)
151-169: Assert the full token sequence here.
[-1] == 888pluslen == 4catches the extra-suffix case, but it can still miss unintended changes earlier ininput_ids. Comparing against the full list keeps this regression pinned to the exact pass-through behavior.Suggested tightening
- assert result["input_ids"][-1].item() == 888 - assert len(result["input_ids"]) == 4 + assert result["input_ids"].tolist() == [1, 10, 20, 888]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/data/datasets/test_chat_template.py` around lines 151 - 169, Update the test_chat_preprocess_trusts_template_eos test to assert the entire produced input_ids sequence instead of only the last element and length: call _chat_preprocess with the mocked tokenizer (which has mock_hf_tokenizer.apply_chat_template returning {"input_ids": [1, 10, 20, 888]}) and assert that result["input_ids"] matches the exact sequence [1, 10, 20, 888] (ensuring all elements are identical and no unintended modifications occurred by _chat_preprocess).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit_tests/data/datasets/test_chat_template.py`:
- Around line 151-169: Update the test_chat_preprocess_trusts_template_eos test
to assert the entire produced input_ids sequence instead of only the last
element and length: call _chat_preprocess with the mocked tokenizer (which has
mock_hf_tokenizer.apply_chat_template returning {"input_ids": [1, 10, 20, 888]})
and assert that result["input_ids"] matches the exact sequence [1, 10, 20, 888]
(ensuring all elements are identical and no unintended modifications occurred by
_chat_preprocess).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d32e5425-02d4-41c4-9bb9-72809cb26e7a
📒 Files selected for processing (2)
src/megatron/bridge/data/datasets/utils.pytests/unit_tests/data/datasets/test_chat_template.py
💤 Files with no reviewable changes (1)
- src/megatron/bridge/data/datasets/utils.py
The chat template already handles end-of-sequence tokens. For models where the end-of-turn token differs from eos_id (Qwen3, Llama 3.x), the extra append produces a token the model never sees in pretraining. Fixes: NVIDIA-NeMo#2698 Signed-off-by: Shane Moran <shane.moran@shopify.com>
Address review feedback: assert the entire input_ids list instead of only the last element and length, to catch any unintended modifications earlier in the sequence. Signed-off-by: Shane Moran <shane.moran@shopify.com>
57a1482 to
e123587
Compare
|
/ok to test e123587 |
Signed-off-by: Shane Moran <shane.moran@shopify.com>
Summary
_chat_preprocessappendseos_idwhen the last token differs fromeos_id, but for models like Qwen3 and Llama 3.x the chat template correctly ends with a different end-of-turn tokenChanges
src/megatron/bridge/data/datasets/utils.py: Delete 3-line conditional EOS append (lines 952-954)tests/unit_tests/data/datasets/test_chat_template.py: Replacetest_chat_preprocess_adds_eos_if_missingwithtest_chat_preprocess_trusts_template_eosFixes #2698
Summary by CodeRabbit
Release Notes