[None][fix] Fix contrained decoding for GLM5 by cascade812 · Pull Request #12869 · NVIDIA/TensorRT-LLM

cascade812 · 2026-04-09T02:25:25Z

Description

This PR fixes error when run GLM5 with constrained decoding (guided decoding).

User needs to specify below config:
llm_kwargs["custom_tokenizer"] = 'glm_moe_dsa'

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

New Features
- Added support for custom tokenizer aliases, enabling deepseek_v32 and glm_moe_dsa tokenizers.
- GLM MOE DSA tokenizer now loads chat templates from chat_template.jinja files.
Improvements
- Enhanced speculative decoding resource management with proper reset and reallocation during KV-cache estimation.

Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>

coderabbitai · 2026-04-09T02:29:16Z

📝 Walkthrough

Walkthrough

The changes consolidate tokenizer alias definitions into a shared constant, implement dynamic custom tokenizer loading via configurable aliases, add chat template file support to the GLM Moe DSA tokenizer, and refactor speculative decoding components to be reset and recreated during KV-cache estimation reinitialization.

Changes

Cohort / File(s)	Summary
Tokenizer Alias Consolidation `tensorrt_llm/tokenizer/__init__.py`, `tensorrt_llm/llmapi/llm_args.py`	Moved `TOKENIZER_ALIASES` mapping from `llm_args.py` to `tensorrt_llm/tokenizer/__init__.py` and exported it. Maps short built-in tokenizer names to fully-qualified class paths for runtime resolution.
Custom Tokenizer Loading `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	Added branching logic for guided decoding tokenizer initialization: if `llm_args.custom_tokenizer` is set, resolves alias via `TOKENIZER_ALIASES`, dynamically imports the target module/class, and instantiates via `from_pretrained`; otherwise uses `TransformersTokenizer.from_pretrained`.
Chat Template Support `tensorrt_llm/tokenizer/glm_moe_dsa/tokenizer.py`	Enhanced `GlmMoeDsaTokenizer.from_pretrained` to check for and load `chat_template.jinja` file, reading its contents and assigning to `hf_tokenizer.chat_template` if present.
Speculative Decoding Refactoring `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	Extracted speculative decoding resource and drafter construction into `create_spec_components(max_seq_len)` helper. Components are now explicitly reset (`None`) and recreated during KV-cache estimation reinitialization, with explicit `gc.collect()` call between cycles.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	PR description is incomplete and missing critical required sections.	Complete the Test Coverage section with specific test cases. Provide more detailed explanation of the root cause and how the fix addresses it. Clarify which checklist items were verified.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title describes fixing constrained decoding for GLM5, which is the main objective of the PR, though it contains a typo ('contrained' vs 'constrained'). The changes support this by adding GLM-specific tokenizer handling and chat template loading.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)

720-741: Consider documenting side effects of the helper function.

The create_spec_components function modifies the outer resources dict as a side effect (lines 728-731). While functional, this implicit mutation pattern can be confusing for maintainers.

A brief docstring clarifying that the function updates resources[SPEC_RESOURCE_MANAGER] would improve readability.

📝 Suggested documentation

     def create_spec_components(max_seq_len):
+        """Create speculative decoding components.
+        
+        Side effect: Updates resources[SPEC_RESOURCE_MANAGER] if spec_resource_manager is created.
+        
+        Returns:
+            Tuple of (spec_resource_manager, drafter).
+        """
         with allocation_scope(ExecutorMemoryType.SPEC_RESOURCES):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py` around lines 720 -
741, Add a short docstring to create_spec_components explaining that in addition
to returning (spec_resource_manager, drafter) it may mutate the outer resources
dict by setting or removing
resources[ResourceManagerType.SPEC_RESOURCE_MANAGER]; mention the
allocation_scope side effects (ExecutorMemoryType.SPEC_RESOURCES and
ExecutorMemoryType.DRAFTER) so callers know the function updates external state
as well as its return values.

877-880: Unused variable spec_resource_manager after unpacking.

The spec_resource_manager returned from create_spec_components is not used after assignment—the function already stores it in the resources dict internally. Use _ to indicate the value is intentionally discarded.

♻️ Proposed fix

         spec_resource_manager = None
         drafter = None
         gc.collect()
-        spec_resource_manager, drafter = create_spec_components(max_seq_len)
+        _, drafter = create_spec_components(max_seq_len)

For consistency, apply the same pattern at line 804:

-    spec_resource_manager, drafter = create_spec_components(max_seq_len)
+    _, drafter = create_spec_components(max_seq_len)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py` around lines 877 -
880, The unpacked return from create_spec_components currently assigns
spec_resource_manager and drafter but spec_resource_manager is unused because
create_spec_components already stores it in the resources dict; change the
assignment to use _ for the discarded value (e.g., "_, drafter =
create_spec_components(max_seq_len)") mirroring the existing pattern used
elsewhere to indicate the first value is intentionally ignored and keep drafter
assigned.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`:
- Around line 288-302: The custom tokenizer import block (checking
llm_args.custom_tokenizer, using TOKENIZER_ALIASES, rsplit to get
module_path/class_name, and calling tokenizer_class.from_pretrained or
TransformersTokenizer.from_pretrained) lacks error handling and duplicates logic
in llm_args.py; wrap the dynamic import/attribute lookup and from_pretrained
call in a try/except that catches ImportError/AttributeError/Exception and
raises or logs a clear message including the provided tokenizer identifier and
underlying exception, or better yet move the resolution/loading logic into a
shared helper (e.g., a new function in tensorrt_llm.tokenizer.__init__ like
load_tokenizer(checkpoint_dir, tokenizer_identifier, trust_remote_code)) and
call that helper from both py_executor_creator.py and llm_args.py to centralize
error handling and avoid duplication.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`:
- Around line 720-741: Add a short docstring to create_spec_components
explaining that in addition to returning (spec_resource_manager, drafter) it may
mutate the outer resources dict by setting or removing
resources[ResourceManagerType.SPEC_RESOURCE_MANAGER]; mention the
allocation_scope side effects (ExecutorMemoryType.SPEC_RESOURCES and
ExecutorMemoryType.DRAFTER) so callers know the function updates external state
as well as its return values.
- Around line 877-880: The unpacked return from create_spec_components currently
assigns spec_resource_manager and drafter but spec_resource_manager is unused
because create_spec_components already stores it in the resources dict; change
the assignment to use _ for the discarded value (e.g., "_, drafter =
create_spec_components(max_seq_len)") mirroring the existing pattern used
elsewhere to indicate the first value is intentionally ignored and keep drafter
assigned.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 512060e0-e087-43de-9b77-1e65887eb721

📥 Commits

Reviewing files that changed from the base of the PR and between 2fe39c1 and d31f896.

📒 Files selected for processing (4)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/tokenizer/__init__.py
tensorrt_llm/tokenizer/glm_moe_dsa/tokenizer.py

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>

cascade812 · 2026-04-09T17:53:01Z

/run bot

cascade812 · 2026-04-09T19:24:16Z

/bot run

tensorrt-cicd · 2026-04-09T19:31:11Z

PR_Github #42569 [ run ] triggered by Bot. Commit: 9122dc4 Link to invocation

tensorrt-cicd · 2026-04-10T00:37:51Z

PR_Github #42569 [ run ] completed with state SUCCESS. Commit: 9122dc4
/LLM/main/L0_MergeRequest_PR pipeline #33302 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

fix guided decoding for GLM5

d31f896

Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>

cascade812 requested review from a team as code owners April 9, 2026 02:25

cascade812 requested a review from syuoni April 9, 2026 02:25

github-actions bot assigned cascade812 Apr 9, 2026

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py Outdated Show resolved Hide resolved

cascade812 added 2 commits April 8, 2026 19:38

merge main

e1d32ec

Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>

update

07e3787

Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>

cascade812 requested a review from zhaoyangwang-nvidia April 9, 2026 03:01

clean up

9122dc4

Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>

cascade812 changed the title ~~[None][fix] Fix Contrained decoding for GLM5~~ [None][fix] Fix contrained decoding for GLM5 Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Fix contrained decoding for GLM5#12869

[None][fix] Fix contrained decoding for GLM5#12869
cascade812 wants to merge 4 commits intoNVIDIA:mainfrom
cascade812:guided_glm

cascade812 commented Apr 9, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

cascade812 commented Apr 9, 2026

Uh oh!

cascade812 commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cascade812 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cascade812 commented Apr 9, 2026

Uh oh!

cascade812 commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cascade812 commented Apr 9, 2026 •

edited

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading