Skip to content

[data] fix: Remove spurious EOS append in _chat_preprocess#2699

Merged
gautham-kollu merged 2 commits intoNVIDIA-NeMo:mainfrom
shanecmoran:fix/chat-preprocess-extra-eos
Mar 13, 2026
Merged

[data] fix: Remove spurious EOS append in _chat_preprocess#2699
gautham-kollu merged 2 commits intoNVIDIA-NeMo:mainfrom
shanecmoran:fix/chat-preprocess-extra-eos

Conversation

@shanecmoran
Copy link
Copy Markdown
Contributor

@shanecmoran shanecmoran commented Mar 8, 2026

Summary

Changes

  • src/megatron/bridge/data/datasets/utils.py: Delete 3-line conditional EOS append (lines 952-954)
  • tests/unit_tests/data/datasets/test_chat_template.py: Replace test_chat_preprocess_adds_eos_if_missing with test_chat_preprocess_trusts_template_eos

Fixes #2698

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Enhanced chat preprocessing to properly honor template end-of-sequence token configurations. The system now trusts template-defined end tokens instead of automatically appending EOS markers when the template specifies alternative end tokens, ensuring input sequences are correctly tokenized according to template settings.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 8, 2026

📝 Walkthrough

Walkthrough

Removed automatic EOS token insertion logic from the _chat_preprocess function in the data utilities, making the preprocessing rely on the chat template to handle EOS tokens. The corresponding test was updated to validate this new behavior.

Changes

Cohort / File(s) Summary
EOS Token Handling
src/megatron/bridge/data/datasets/utils.py
Removed logic that automatically appended EOS tokens to input_ids when the last token was not an EOS token during chat preprocessing.
Test Updates
tests/unit_tests/data/datasets/test_chat_template.py
Renamed test_chat_preprocess_adds_eos_if_missing to test_chat_preprocess_trusts_template_eos and updated assertions to verify EOS is not appended when template end token differs from eos_id.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR removes spurious EOS token logic affecting tokenization output, with unit test updated but no test results or regression validation documented. Add test results, regression validation on Qwen3/Llama models, and before/after comparative analysis to the PR description.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: removing spurious EOS token appending in the _chat_preprocess function.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit_tests/data/datasets/test_chat_template.py (1)

151-169: Assert the full token sequence here.

[-1] == 888 plus len == 4 catches the extra-suffix case, but it can still miss unintended changes earlier in input_ids. Comparing against the full list keeps this regression pinned to the exact pass-through behavior.

Suggested tightening
-        assert result["input_ids"][-1].item() == 888
-        assert len(result["input_ids"]) == 4
+        assert result["input_ids"].tolist() == [1, 10, 20, 888]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/data/datasets/test_chat_template.py` around lines 151 - 169,
Update the test_chat_preprocess_trusts_template_eos test to assert the entire
produced input_ids sequence instead of only the last element and length: call
_chat_preprocess with the mocked tokenizer (which has
mock_hf_tokenizer.apply_chat_template returning {"input_ids": [1, 10, 20, 888]})
and assert that result["input_ids"] matches the exact sequence [1, 10, 20, 888]
(ensuring all elements are identical and no unintended modifications occurred by
_chat_preprocess).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit_tests/data/datasets/test_chat_template.py`:
- Around line 151-169: Update the test_chat_preprocess_trusts_template_eos test
to assert the entire produced input_ids sequence instead of only the last
element and length: call _chat_preprocess with the mocked tokenizer (which has
mock_hf_tokenizer.apply_chat_template returning {"input_ids": [1, 10, 20, 888]})
and assert that result["input_ids"] matches the exact sequence [1, 10, 20, 888]
(ensuring all elements are identical and no unintended modifications occurred by
_chat_preprocess).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d32e5425-02d4-41c4-9bb9-72809cb26e7a

📥 Commits

Reviewing files that changed from the base of the PR and between 9fe8e36 and 5964e90.

📒 Files selected for processing (2)
  • src/megatron/bridge/data/datasets/utils.py
  • tests/unit_tests/data/datasets/test_chat_template.py
💤 Files with no reviewable changes (1)
  • src/megatron/bridge/data/datasets/utils.py

@yaoyu-33 yaoyu-33 added area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer labels Mar 10, 2026
@yaoyu-33 yaoyu-33 added area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer and removed area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer labels Mar 11, 2026
The chat template already handles end-of-sequence tokens. For models
where the end-of-turn token differs from eos_id (Qwen3, Llama 3.x),
the extra append produces a token the model never sees in pretraining.

Fixes: NVIDIA-NeMo#2698
Signed-off-by: Shane Moran <shane.moran@shopify.com>
Address review feedback: assert the entire input_ids list instead of
only the last element and length, to catch any unintended modifications
earlier in the sequence.

Signed-off-by: Shane Moran <shane.moran@shopify.com>
@shanecmoran shanecmoran force-pushed the fix/chat-preprocess-extra-eos branch from 57a1482 to e123587 Compare March 12, 2026 14:59
@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test e123587

@yaoyu-33 yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Mar 12, 2026
Copy link
Copy Markdown
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gautham-kollu gautham-kollu enabled auto-merge (squash) March 13, 2026 17:54
@gautham-kollu gautham-kollu merged commit c2e2fbd into NVIDIA-NeMo:main Mar 13, 2026
63 of 64 checks passed
copy-pr-bot Bot pushed a commit that referenced this pull request Mar 19, 2026
Signed-off-by: Shane Moran <shane.moran@shopify.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:data Dataset builders, preprocessing, and samplers community-request ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[data] _chat_preprocess appends spurious EOS token when chat template uses different end-of-turn token

4 participants