Skip to content

[None][fix] Fix contrained decoding for GLM5#12869

Open
cascade812 wants to merge 4 commits intoNVIDIA:mainfrom
cascade812:guided_glm
Open

[None][fix] Fix contrained decoding for GLM5#12869
cascade812 wants to merge 4 commits intoNVIDIA:mainfrom
cascade812:guided_glm

Conversation

@cascade812
Copy link
Copy Markdown
Collaborator

@cascade812 cascade812 commented Apr 9, 2026

Description

This PR fixes error when run GLM5 with constrained decoding (guided decoding).

User needs to specify below config:
llm_kwargs["custom_tokenizer"] = 'glm_moe_dsa'

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • New Features

    • Added support for custom tokenizer aliases, enabling deepseek_v32 and glm_moe_dsa tokenizers.
    • GLM MOE DSA tokenizer now loads chat templates from chat_template.jinja files.
  • Improvements

    • Enhanced speculative decoding resource management with proper reset and reallocation during KV-cache estimation.

Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
@cascade812 cascade812 requested review from a team as code owners April 9, 2026 02:25
@cascade812 cascade812 requested a review from syuoni April 9, 2026 02:25
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

The changes consolidate tokenizer alias definitions into a shared constant, implement dynamic custom tokenizer loading via configurable aliases, add chat template file support to the GLM Moe DSA tokenizer, and refactor speculative decoding components to be reset and recreated during KV-cache estimation reinitialization.

Changes

Cohort / File(s) Summary
Tokenizer Alias Consolidation
tensorrt_llm/tokenizer/__init__.py, tensorrt_llm/llmapi/llm_args.py
Moved TOKENIZER_ALIASES mapping from llm_args.py to tensorrt_llm/tokenizer/__init__.py and exported it. Maps short built-in tokenizer names to fully-qualified class paths for runtime resolution.
Custom Tokenizer Loading
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
Added branching logic for guided decoding tokenizer initialization: if llm_args.custom_tokenizer is set, resolves alias via TOKENIZER_ALIASES, dynamically imports the target module/class, and instantiates via from_pretrained; otherwise uses TransformersTokenizer.from_pretrained.
Chat Template Support
tensorrt_llm/tokenizer/glm_moe_dsa/tokenizer.py
Enhanced GlmMoeDsaTokenizer.from_pretrained to check for and load chat_template.jinja file, reading its contents and assigning to hf_tokenizer.chat_template if present.
Speculative Decoding Refactoring
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
Extracted speculative decoding resource and drafter construction into create_spec_components(max_seq_len) helper. Components are now explicitly reset (None) and recreated during KV-cache estimation reinitialization, with explicit gc.collect() call between cycles.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning PR description is incomplete and missing critical required sections. Complete the Test Coverage section with specific test cases. Provide more detailed explanation of the root cause and how the fix addresses it. Clarify which checklist items were verified.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title describes fixing constrained decoding for GLM5, which is the main objective of the PR, though it contains a typo ('contrained' vs 'constrained'). The changes support this by adding GLM-specific tokenizer handling and chat template loading.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)

720-741: Consider documenting side effects of the helper function.

The create_spec_components function modifies the outer resources dict as a side effect (lines 728-731). While functional, this implicit mutation pattern can be confusing for maintainers.

A brief docstring clarifying that the function updates resources[SPEC_RESOURCE_MANAGER] would improve readability.

📝 Suggested documentation
     def create_spec_components(max_seq_len):
+        """Create speculative decoding components.
+        
+        Side effect: Updates resources[SPEC_RESOURCE_MANAGER] if spec_resource_manager is created.
+        
+        Returns:
+            Tuple of (spec_resource_manager, drafter).
+        """
         with allocation_scope(ExecutorMemoryType.SPEC_RESOURCES):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py` around lines 720 -
741, Add a short docstring to create_spec_components explaining that in addition
to returning (spec_resource_manager, drafter) it may mutate the outer resources
dict by setting or removing
resources[ResourceManagerType.SPEC_RESOURCE_MANAGER]; mention the
allocation_scope side effects (ExecutorMemoryType.SPEC_RESOURCES and
ExecutorMemoryType.DRAFTER) so callers know the function updates external state
as well as its return values.

877-880: Unused variable spec_resource_manager after unpacking.

The spec_resource_manager returned from create_spec_components is not used after assignment—the function already stores it in the resources dict internally. Use _ to indicate the value is intentionally discarded.

♻️ Proposed fix
         spec_resource_manager = None
         drafter = None
         gc.collect()
-        spec_resource_manager, drafter = create_spec_components(max_seq_len)
+        _, drafter = create_spec_components(max_seq_len)

For consistency, apply the same pattern at line 804:

-    spec_resource_manager, drafter = create_spec_components(max_seq_len)
+    _, drafter = create_spec_components(max_seq_len)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py` around lines 877 -
880, The unpacked return from create_spec_components currently assigns
spec_resource_manager and drafter but spec_resource_manager is unused because
create_spec_components already stores it in the resources dict; change the
assignment to use _ for the discarded value (e.g., "_, drafter =
create_spec_components(max_seq_len)") mirroring the existing pattern used
elsewhere to indicate the first value is intentionally ignored and keep drafter
assigned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`:
- Around line 288-302: The custom tokenizer import block (checking
llm_args.custom_tokenizer, using TOKENIZER_ALIASES, rsplit to get
module_path/class_name, and calling tokenizer_class.from_pretrained or
TransformersTokenizer.from_pretrained) lacks error handling and duplicates logic
in llm_args.py; wrap the dynamic import/attribute lookup and from_pretrained
call in a try/except that catches ImportError/AttributeError/Exception and
raises or logs a clear message including the provided tokenizer identifier and
underlying exception, or better yet move the resolution/loading logic into a
shared helper (e.g., a new function in tensorrt_llm.tokenizer.__init__ like
load_tokenizer(checkpoint_dir, tokenizer_identifier, trust_remote_code)) and
call that helper from both py_executor_creator.py and llm_args.py to centralize
error handling and avoid duplication.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`:
- Around line 720-741: Add a short docstring to create_spec_components
explaining that in addition to returning (spec_resource_manager, drafter) it may
mutate the outer resources dict by setting or removing
resources[ResourceManagerType.SPEC_RESOURCE_MANAGER]; mention the
allocation_scope side effects (ExecutorMemoryType.SPEC_RESOURCES and
ExecutorMemoryType.DRAFTER) so callers know the function updates external state
as well as its return values.
- Around line 877-880: The unpacked return from create_spec_components currently
assigns spec_resource_manager and drafter but spec_resource_manager is unused
because create_spec_components already stores it in the resources dict; change
the assignment to use _ for the discarded value (e.g., "_, drafter =
create_spec_components(max_seq_len)") mirroring the existing pattern used
elsewhere to indicate the first value is intentionally ignored and keep drafter
assigned.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 512060e0-e087-43de-9b77-1e65887eb721

📥 Commits

Reviewing files that changed from the base of the PR and between 2fe39c1 and d31f896.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/tokenizer/__init__.py
  • tensorrt_llm/tokenizer/glm_moe_dsa/tokenizer.py

Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
@cascade812
Copy link
Copy Markdown
Collaborator Author

/run bot

@cascade812 cascade812 changed the title [None][fix] Fix Contrained decoding for GLM5 [None][fix] Fix contrained decoding for GLM5 Apr 9, 2026
@cascade812
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42569 [ run ] triggered by Bot. Commit: 9122dc4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42569 [ run ] completed with state SUCCESS. Commit: 9122dc4
/LLM/main/L0_MergeRequest_PR pipeline #33302 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants