Skip to content

[https://nvbugs/6050489][fix] fix agg pp4 hang issue#12888

Open
bo-nv wants to merge 1 commit intoNVIDIA:mainfrom
bo-nv:main-6050489
Open

[https://nvbugs/6050489][fix] fix agg pp4 hang issue#12888
bo-nv wants to merge 1 commit intoNVIDIA:mainfrom
bo-nv:main-6050489

Conversation

@bo-nv
Copy link
Copy Markdown
Collaborator

@bo-nv bo-nv commented Apr 9, 2026

Summary by CodeRabbit

  • Chores
    • Improved internal MPI communication responsiveness in the executor backend by enhancing non-blocking polling and progress handling mechanisms.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Bo Deng <deemod@nvidia.com>
@bo-nv bo-nv self-assigned this Apr 9, 2026
@bo-nv bo-nv requested a review from a team as a code owner April 9, 2026 09:12
@bo-nv bo-nv requested review from Shixiaowei02 and joyang-nv April 9, 2026 09:12
@bo-nv bo-nv assigned pcastonguay and unassigned pcastonguay Apr 9, 2026
@bo-nv bo-nv requested a review from pcastonguay April 9, 2026 09:17
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Modified the PyExecutor broadcast thread to use non-blocking queue polling with integrated MPI progress testing. Added a new _get_executed_batch() method that attempts short timeouts on the queue and periodically calls handle.test() on pending MPI send handles to enable progress during idle waiting periods.

Changes

Cohort / File(s) Summary
MPI Progress Polling
tensorrt_llm/_torch/pyexecutor/py_executor.py
Added Empty import from queue. Introduced _get_executed_batch() method for non-blocking polling with 0.001s timeout, triggering handle.test() on pending send handles. Modified _broadcast_sample_state_loop to use the new polling method instead of blocking queue operations, replacing explicit "flush last isend" logic with continuous MPI progress attempts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only the template with empty placeholders; the author did not fill in the required Description or Test Coverage sections, making it impossible to understand the issue, solution, or test validation. Fill in the Description section explaining the hang issue and how the MPI progress fix resolves it, and provide Test Coverage details demonstrating that the fix has been validated.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly references a specific NVBugs ticket and indicates this is a fix for a hang issue in pipeline-parallel (pp4) aggregation, which aligns with the code changes that address MPI progress handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1575-1579: Annotate the new polling helper and name the timeout.

This helper can return the shutdown sentinel, but that contract is implicit today, and 0.001 is a non-obvious tuning knob in a hot idle loop. Please add -> BatchStatePP | None and give the timeout a name so the shutdown path and latency/CPU trade-off stay obvious.

♻️ Suggested cleanup
-    def _get_executed_batch(self):
+    def _get_executed_batch(self) -> BatchStatePP | None:
+        poll_timeout_s = 0.001
         while True:
             try:
-                return self.executed_batch_queue.get(timeout=0.001)
+                return self.executed_batch_queue.get(timeout=poll_timeout_s)
             except Empty:

As per coding guidelines, "Always annotate functions with type hints; make the return type None if the function does not return anything."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 1575 - 1579,
Annotate the helper _get_executed_batch with an explicit return type of
BatchStatePP | None and document that it may return the shutdown sentinel (None)
to make the contract explicit; replace the magic literal 0.001 with a clearly
named timeout constant (e.g. EXECUTED_BATCH_POLL_TIMEOUT) so the idle-loop
latency/CPU tradeoff is obvious and can be tuned, and update any
callers/comments to reflect the None/shutdown return path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1575-1579: Annotate the helper _get_executed_batch with an
explicit return type of BatchStatePP | None and document that it may return the
shutdown sentinel (None) to make the contract explicit; replace the magic
literal 0.001 with a clearly named timeout constant (e.g.
EXECUTED_BATCH_POLL_TIMEOUT) so the idle-loop latency/CPU tradeoff is obvious
and can be tuned, and update any callers/comments to reflect the None/shutdown
return path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6937c9fd-ef4f-4850-88a5-c10ae61e056d

📥 Commits

Reviewing files that changed from the base of the PR and between 2dff089 and cfa2258.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 9, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@bo-nv bo-nv requested a review from Tabrizian April 9, 2026 09:34
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42519 [ run ] triggered by Bot. Commit: cfa2258 Link to invocation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test to cover this bug fix? If possible, add a test to verify the hang is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants