[data] fix: Remove spurious EOS append in _chat_preprocess by shanecmoran · Pull Request #2699 · NVIDIA-NeMo/Megatron-Bridge

shanecmoran · 2026-03-08T14:11:58Z

Summary

_chat_preprocess appends eos_id when the last token differs from eos_id, but for models like Qwen3 and Llama 3.x the chat template correctly ends with a different end-of-turn token
Remove the 3-line conditional append; trust the chat template
Mirrors the Automodel fix in feat: support chat datasets with THD, BSHD + CP and padding fixes Automodel#1416

Changes

src/megatron/bridge/data/datasets/utils.py: Delete 3-line conditional EOS append (lines 952-954)
tests/unit_tests/data/datasets/test_chat_template.py: Replace test_chat_preprocess_adds_eos_if_missing with test_chat_preprocess_trusts_template_eos

Fixes #2698

Summary by CodeRabbit

Release Notes

Bug Fixes
- Enhanced chat preprocessing to properly honor template end-of-sequence token configurations. The system now trusts template-defined end tokens instead of automatically appending EOS markers when the template specifies alternative end tokens, ensuring input sequences are correctly tokenized according to template settings.

copy-pr-bot · 2026-03-08T14:12:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-08T14:16:54Z

📝 Walkthrough

Walkthrough

Removed automatic EOS token insertion logic from the _chat_preprocess function in the data utilities, making the preprocessing rely on the chat template to handle EOS tokens. The corresponding test was updated to validate this new behavior.

Changes

Cohort / File(s)	Summary
EOS Token Handling `src/megatron/bridge/data/datasets/utils.py`	Removed logic that automatically appended EOS tokens to input_ids when the last token was not an EOS token during chat preprocessing.
Test Updates `tests/unit_tests/data/datasets/test_chat_template.py`	Renamed `test_chat_preprocess_adds_eos_if_missing` to `test_chat_preprocess_trusts_template_eos` and updated assertions to verify EOS is not appended when template end token differs from eos_id.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR removes spurious EOS token logic affecting tokenization output, with unit test updated but no test results or regression validation documented.	Add test results, regression validation on Qwen3/Llama models, and before/after comparative analysis to the PR description.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: removing spurious EOS token appending in the _chat_preprocess function.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/unit_tests/data/datasets/test_chat_template.py (1)

151-169: Assert the full token sequence here.

[-1] == 888 plus len == 4 catches the extra-suffix case, but it can still miss unintended changes earlier in input_ids. Comparing against the full list keeps this regression pinned to the exact pass-through behavior.

Suggested tightening

-        assert result["input_ids"][-1].item() == 888
-        assert len(result["input_ids"]) == 4
+        assert result["input_ids"].tolist() == [1, 10, 20, 888]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/data/datasets/test_chat_template.py` around lines 151 - 169,
Update the test_chat_preprocess_trusts_template_eos test to assert the entire
produced input_ids sequence instead of only the last element and length: call
_chat_preprocess with the mocked tokenizer (which has
mock_hf_tokenizer.apply_chat_template returning {"input_ids": [1, 10, 20, 888]})
and assert that result["input_ids"] matches the exact sequence [1, 10, 20, 888]
(ensuring all elements are identical and no unintended modifications occurred by
_chat_preprocess).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit_tests/data/datasets/test_chat_template.py`:
- Around line 151-169: Update the test_chat_preprocess_trusts_template_eos test
to assert the entire produced input_ids sequence instead of only the last
element and length: call _chat_preprocess with the mocked tokenizer (which has
mock_hf_tokenizer.apply_chat_template returning {"input_ids": [1, 10, 20, 888]})
and assert that result["input_ids"] matches the exact sequence [1, 10, 20, 888]
(ensuring all elements are identical and no unintended modifications occurred by
_chat_preprocess).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d32e5425-02d4-41c4-9bb9-72809cb26e7a

📥 Commits

Reviewing files that changed from the base of the PR and between 9fe8e36 and 5964e90.

📒 Files selected for processing (2)

src/megatron/bridge/data/datasets/utils.py
tests/unit_tests/data/datasets/test_chat_template.py

💤 Files with no reviewable changes (1)

src/megatron/bridge/data/datasets/utils.py

The chat template already handles end-of-sequence tokens. For models where the end-of-turn token differs from eos_id (Qwen3, Llama 3.x), the extra append produces a token the model never sees in pretraining. Fixes: NVIDIA-NeMo#2698 Signed-off-by: Shane Moran <shane.moran@shopify.com>

Address review feedback: assert the entire input_ids list instead of only the last element and length, to catch any unintended modifications earlier in the sequence. Signed-off-by: Shane Moran <shane.moran@shopify.com>

yaoyu-33 · 2026-03-12T18:32:20Z

/ok to test e123587

cuichenx

LGTM

Signed-off-by: Shane Moran <shane.moran@shopify.com>

github-actions Bot added the community-request label Mar 8, 2026

coderabbitai Bot reviewed Mar 8, 2026

View reviewed changes

yaoyu-33 added area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer labels Mar 10, 2026

yaoyu-33 mentioned this pull request Mar 10, 2026

[data] _chat_preprocess appends spurious EOS token when chat template uses different end-of-turn token #2698

Closed

yaoyu-33 added area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer and removed area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer labels Mar 11, 2026

shanecmoran added 2 commits March 12, 2026 10:58

[data] fix: Assert full token sequence in template EOS test

e123587

Address review feedback: assert the entire input_ids list instead of only the last element and length, to catch any unintended modifications earlier in the sequence. Signed-off-by: Shane Moran <shane.moran@shopify.com>

shanecmoran force-pushed the fix/chat-preprocess-extra-eos branch from 57a1482 to e123587 Compare March 12, 2026 14:59

yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Mar 12, 2026

copy-pr-bot Bot temporarily deployed to test March 12, 2026 18:33 Inactive

cuichenx approved these changes Mar 12, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci March 12, 2026 21:31 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 13, 2026 01:33 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 13, 2026 05:37 Inactive

gautham-kollu enabled auto-merge (squash) March 13, 2026 17:54

gautham-kollu merged commit c2e2fbd into NVIDIA-NeMo:main Mar 13, 2026
63 of 64 checks passed

copy-pr-bot Bot pushed a commit that referenced this pull request Mar 19, 2026

[data] fix: Remove spurious EOS append in _chat_preprocess (#2699)

f68db6f

Signed-off-by: Shane Moran <shane.moran@shopify.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] fix: Remove spurious EOS append in _chat_preprocess#2699

[data] fix: Remove spurious EOS append in _chat_preprocess#2699
gautham-kollu merged 2 commits intoNVIDIA-NeMo:mainfrom
shanecmoran:fix/chat-preprocess-extra-eos

shanecmoran commented Mar 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Mar 8, 2026

Uh oh!

coderabbitai Bot commented Mar 8, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

yaoyu-33 commented Mar 12, 2026

Uh oh!

cuichenx left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shanecmoran commented Mar 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Mar 8, 2026

Uh oh!

coderabbitai Bot commented Mar 8, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Mar 12, 2026

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shanecmoran commented Mar 8, 2026 •

edited by coderabbitai Bot

Loading