[https://nvbugs/6050489][fix] fix agg pp4 hang issue#12888
[https://nvbugs/6050489][fix] fix agg pp4 hang issue#12888bo-nv wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Signed-off-by: Bo Deng <deemod@nvidia.com>
📝 WalkthroughWalkthroughModified the PyExecutor broadcast thread to use non-blocking queue polling with integrated MPI progress testing. Added a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1575-1579: Annotate the new polling helper and name the timeout.This helper can return the shutdown sentinel, but that contract is implicit today, and
0.001is a non-obvious tuning knob in a hot idle loop. Please add-> BatchStatePP | Noneand give the timeout a name so the shutdown path and latency/CPU trade-off stay obvious.♻️ Suggested cleanup
- def _get_executed_batch(self): + def _get_executed_batch(self) -> BatchStatePP | None: + poll_timeout_s = 0.001 while True: try: - return self.executed_batch_queue.get(timeout=0.001) + return self.executed_batch_queue.get(timeout=poll_timeout_s) except Empty:As per coding guidelines, "Always annotate functions with type hints; make the return type
Noneif the function does not return anything."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 1575 - 1579, Annotate the helper _get_executed_batch with an explicit return type of BatchStatePP | None and document that it may return the shutdown sentinel (None) to make the contract explicit; replace the magic literal 0.001 with a clearly named timeout constant (e.g. EXECUTED_BATCH_POLL_TIMEOUT) so the idle-loop latency/CPU tradeoff is obvious and can be tuned, and update any callers/comments to reflect the None/shutdown return path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1575-1579: Annotate the helper _get_executed_batch with an
explicit return type of BatchStatePP | None and document that it may return the
shutdown sentinel (None) to make the contract explicit; replace the magic
literal 0.001 with a clearly named timeout constant (e.g.
EXECUTED_BATCH_POLL_TIMEOUT) so the idle-loop latency/CPU tradeoff is obvious
and can be tuned, and update any callers/comments to reflect the None/shutdown
return path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6937c9fd-ef4f-4850-88a5-c10ae61e056d
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #42519 [ run ] triggered by Bot. Commit: |
There was a problem hiding this comment.
Do we have a test to cover this bug fix? If possible, add a test to verify the hang is fixed.
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.