[https://nvbugs/6050489][fix] fix agg pp4 hang issue by bo-nv · Pull Request #12888 · NVIDIA/TensorRT-LLM

bo-nv · 2026-04-09T09:12:38Z

Summary by CodeRabbit

Chores
- Improved internal MPI communication responsiveness in the executor backend by enhancing non-blocking polling and progress handling mechanisms.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Bo Deng <deemod@nvidia.com>

coderabbitai · 2026-04-09T09:17:45Z

📝 Walkthrough

Walkthrough

Modified the PyExecutor broadcast thread to use non-blocking queue polling with integrated MPI progress testing. Added a new _get_executed_batch() method that attempts short timeouts on the queue and periodically calls handle.test() on pending MPI send handles to enable progress during idle waiting periods.

Changes

Cohort / File(s)	Summary
MPI Progress Polling `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Added `Empty` import from `queue`. Introduced `_get_executed_batch()` method for non-blocking polling with 0.001s timeout, triggering `handle.test()` on pending send handles. Modified `_broadcast_sample_state_loop` to use the new polling method instead of blocking queue operations, replacing explicit "flush last isend" logic with continuous MPI progress attempts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description contains only the template with empty placeholders; the author did not fill in the required Description or Test Coverage sections, making it impossible to understand the issue, solution, or test validation.	Fill in the Description section explaining the hang issue and how the MPI progress fix resolves it, and provide Test Coverage details demonstrating that the fix has been validated.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly references a specific NVBugs ticket and indicates this is a fix for a hang issue in pipeline-parallel (pp4) aggregation, which aligns with the code changes that address MPI progress handling.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1575-1579: Annotate the new polling helper and name the timeout.

This helper can return the shutdown sentinel, but that contract is implicit today, and 0.001 is a non-obvious tuning knob in a hot idle loop. Please add -> BatchStatePP | None and give the timeout a name so the shutdown path and latency/CPU trade-off stay obvious.
♻️ Suggested cleanup
-    def _get_executed_batch(self):
+    def _get_executed_batch(self) -> BatchStatePP | None:
+        poll_timeout_s = 0.001
         while True:
             try:
-                return self.executed_batch_queue.get(timeout=0.001)
+                return self.executed_batch_queue.get(timeout=poll_timeout_s)
             except Empty:
As per coding guidelines, "Always annotate functions with type hints; make the return type None if the function does not return anything."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 1575 - 1579,
Annotate the helper _get_executed_batch with an explicit return type of
BatchStatePP | None and document that it may return the shutdown sentinel (None)
to make the contract explicit; replace the magic literal 0.001 with a clearly
named timeout constant (e.g. EXECUTED_BATCH_POLL_TIMEOUT) so the idle-loop
latency/CPU tradeoff is obvious and can be tuned, and update any
callers/comments to reflect the None/shutdown return path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1575-1579: Annotate the helper _get_executed_batch with an
explicit return type of BatchStatePP | None and document that it may return the
shutdown sentinel (None) to make the contract explicit; replace the magic
literal 0.001 with a clearly named timeout constant (e.g.
EXECUTED_BATCH_POLL_TIMEOUT) so the idle-loop latency/CPU tradeoff is obvious
and can be tuned, and update any callers/comments to reflect the None/shutdown
return path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6937c9fd-ef4f-4850-88a5-c10ae61e056d

📥 Commits

Reviewing files that changed from the base of the PR and between 2dff089 and cfa2258.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py

bo-nv · 2026-04-09T09:33:53Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-09T09:39:55Z

PR_Github #42519 [ run ] triggered by Bot. Commit: cfa2258 Link to invocation

pcastonguay · 2026-04-09T17:51:26Z

tensorrt_llm/_torch/pyexecutor/py_executor.py

Do we have a test to cover this bug fix? If possible, add a test to verify the hang is fixed.

[https://nvbugs/6050489][fix] fix agg pp4 hang issue

cfa2258

Signed-off-by: Bo Deng <deemod@nvidia.com>

bo-nv self-assigned this Apr 9, 2026

bo-nv requested a review from a team as a code owner April 9, 2026 09:12

bo-nv requested review from Shixiaowei02 and joyang-nv April 9, 2026 09:12

bo-nv assigned pcastonguay and unassigned pcastonguay Apr 9, 2026

bo-nv requested a review from pcastonguay April 9, 2026 09:17

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

bo-nv requested a review from Tabrizian April 9, 2026 09:34

Shixiaowei02 approved these changes Apr 9, 2026

View reviewed changes

pcastonguay reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6050489][fix] fix agg pp4 hang issue#12888

[https://nvbugs/6050489][fix] fix agg pp4 hang issue#12888
bo-nv wants to merge 1 commit intoNVIDIA:mainfrom
bo-nv:main-6050489

bo-nv commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

bo-nv commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

pcastonguay Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bo-nv commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 9, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

bo-nv commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

pcastonguay Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bo-nv commented Apr 9, 2026 •

edited by coderabbitai bot

Loading