[None][feat] Add benchmark for all allreduce backend#12887
[None][feat] Add benchmark for all allreduce backend#12887yilin-void wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
/bot run |
📝 WalkthroughWalkthroughAdded Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI/Main
participant Benchmark as allreduce_benchmark_all
participant StrategyMgr as Strategy Manager
participant ProfileFunc as profile_allreduce
participant AllReduce as AllReduce Instance
participant UBUtil as UB Utilities
participant Results as Results/CSV
CLI->>Benchmark: Run with test_range, strategies, fusions
Benchmark->>StrategyMgr: Get selected AllReduceStrategy backends
StrategyMgr-->>Benchmark: Return strategy list
loop For each fusion op
Benchmark->>ProfileFunc: Generate shapes from test_range
loop For each shape
loop For each strategy
alt Strategy supports UB
Benchmark->>UBUtil: copy_to_userbuffers
Benchmark->>UBUtil: userbuffers_allreduce_finalize
UBUtil->>AllReduce: Execute with UB path
else Regular strategy
Benchmark->>ProfileFunc: profile_allreduce(allreduce_instance, dtype)
ProfileFunc->>AllReduce: Execute and measure
end
AllReduce-->>Benchmark: Timing results
Benchmark->>Benchmark: Compute algbw
end
end
Benchmark->>Benchmark: Print fusion table with results
end
opt save_csv provided
Benchmark->>Results: Write CSV file
end
Results-->>CLI: Benchmark complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tests/microbenchmarks/all_reduce.py (4)
478-484: Catching broadExceptionviolates coding guidelines.Per coding guidelines, avoid catching bare exceptions. For AllReduce initialization, consider catching more specific exceptions (e.g.,
RuntimeError,ValueError) to avoid masking unexpected errors.Proposed fix
for strat in strategies: try: ar_instances[strat] = AllReduce(mapping=mapping, strategy=strat, dtype=torch_dtype) - except Exception as e: + except (RuntimeError, ValueError) as e: if rank == 0: print(f"[WARN] Cannot init {strat.name}: {e}", flush=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/microbenchmarks/all_reduce.py` around lines 478 - 484, The try/except around AllReduce initialization is catching a broad Exception (in the block that assigns ar_instances[strat] = AllReduce(mapping=mapping, strategy=strat, dtype=torch_dtype)), which violates guidelines; replace the bare except with specific exceptions (e.g., except (RuntimeError, ValueError) as e:) that you expect AllReduce to raise during init, and keep the existing rank==0 warning print, so only expected initialization failures are swallowed while unexpected errors still surface.
60-61: Missing type annotations for new parameters.Per coding guidelines, function parameters should have type hints.
Proposed fix
bias=None, - allreduce_instance=None, - dtype=None, + allreduce_instance: AllReduce | None = None, + dtype: torch.dtype | None = None, ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/microbenchmarks/all_reduce.py` around lines 60 - 61, The parameters allreduce_instance and dtype are missing type hints; update the function signature to annotate them as optional (e.g., allreduce_instance: Optional[Any] and dtype: Optional[Any]) since they default to None, and add the required imports (from typing import Optional, Any) at the top of the module; ensure you update the signature where these parameters are declared (reference symbols: allreduce_instance, dtype) and run tests to confirm no further type errors.
544-547: Same issue: catching broadException.Consider catching specific exceptions here as well.
Proposed fix
- except Exception as e: + except (RuntimeError, ValueError, torch.cuda.CudaError) as e: if rank == 0: print(f" [SKIP] {sn} @ {_fmt_size(msg_bytes)}: {e}", flush=True) row[f"{sn}_time"] = row[f"{sn}_algbw"] = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/microbenchmarks/all_reduce.py` around lines 544 - 547, The except block currently catches a broad Exception; narrow it to the specific errors expected (e.g., RuntimeError, OSError, ValueError) instead of Exception in the try/except around the benchmark step that references rank, sn, _fmt_size and row in tests/microbenchmarks/all_reduce.py; update the except clause to list those concrete exception types, keep the same handling (printing the skip message and setting row[f"{sn}_time"] and row[f"{sn}_algbw"] to None) for those known errors, and add a final generic except Exception as e: raise to re-raise any unexpected exceptions so they surface during testing.
549-555: Consider distinguishing skipped vs. failed vs. zero in CSV output.Currently, skipped strategies record
0.0fortime_usandalgbw_GBps, which is indistinguishable from actual zero values (unlikely but possible) or failures. Consider usingNoneor an empty string to make the CSV more accurate for downstream analysis.Proposed alternative
csv_rows.append({ "world_size": world_size, "dtype": dtype, "fusion": fusion_name, "num_tokens": num_tokens, "hidden_size": hidden_size, "size_bytes": msg_bytes, "strategy": sn, - "time_us": row[f"{sn}_time"] or 0.0, - "algbw_GBps": row[f"{sn}_algbw"] or 0.0, + "time_us": row[f"{sn}_time"] if row[f"{sn}_time"] is not None else "", + "algbw_GBps": row[f"{sn}_algbw"] if row[f"{sn}_algbw"] is not None else "", })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/microbenchmarks/all_reduce.py` around lines 549 - 555, The CSV currently uses "or 0.0" for time and algbw fields which collapses skipped/failed values into 0.0; change the construction of the dict appended to csv_rows so that the values for f"{sn}_time" and f"{sn}_algbw" are set to None or an empty string when the source row value is missing (e.g., row.get(f"{sn}_time") is None) instead of falling back to 0.0; update the keys "time_us" and "algbw_GBps" in the csv_rows.append call (and any usage of row[f"{sn}_time"] / row[f"{sn}_algbw"]) to use explicit presence checks so downstream CSVs clearly distinguish skipped/failed entries from real zero measurements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/microbenchmarks/all_reduce.py`:
- Around line 284-300: The file has yapf formatting issues; run the project's
formatter (yapf) on tests/microbenchmarks/all_reduce.py to reformat the code so
pre-commit passes. In particular, ensure the mapping blocks around
_STRATEGY_MAP, _UB_STRATEGIES, and _FUSION_MAP are wrapped/indented and
line-broken per project style (e.g., consistent commas, spacing, and alignment)
so that yapf no longer changes the file.
---
Nitpick comments:
In `@tests/microbenchmarks/all_reduce.py`:
- Around line 478-484: The try/except around AllReduce initialization is
catching a broad Exception (in the block that assigns ar_instances[strat] =
AllReduce(mapping=mapping, strategy=strat, dtype=torch_dtype)), which violates
guidelines; replace the bare except with specific exceptions (e.g., except
(RuntimeError, ValueError) as e:) that you expect AllReduce to raise during
init, and keep the existing rank==0 warning print, so only expected
initialization failures are swallowed while unexpected errors still surface.
- Around line 60-61: The parameters allreduce_instance and dtype are missing
type hints; update the function signature to annotate them as optional (e.g.,
allreduce_instance: Optional[Any] and dtype: Optional[Any]) since they default
to None, and add the required imports (from typing import Optional, Any) at the
top of the module; ensure you update the signature where these parameters are
declared (reference symbols: allreduce_instance, dtype) and run tests to confirm
no further type errors.
- Around line 544-547: The except block currently catches a broad Exception;
narrow it to the specific errors expected (e.g., RuntimeError, OSError,
ValueError) instead of Exception in the try/except around the benchmark step
that references rank, sn, _fmt_size and row in
tests/microbenchmarks/all_reduce.py; update the except clause to list those
concrete exception types, keep the same handling (printing the skip message and
setting row[f"{sn}_time"] and row[f"{sn}_algbw"] to None) for those known
errors, and add a final generic except Exception as e: raise to re-raise any
unexpected exceptions so they surface during testing.
- Around line 549-555: The CSV currently uses "or 0.0" for time and algbw fields
which collapses skipped/failed values into 0.0; change the construction of the
dict appended to csv_rows so that the values for f"{sn}_time" and f"{sn}_algbw"
are set to None or an empty string when the source row value is missing (e.g.,
row.get(f"{sn}_time") is None) instead of falling back to 0.0; update the keys
"time_us" and "algbw_GBps" in the csv_rows.append call (and any usage of
row[f"{sn}_time"] / row[f"{sn}_algbw"]) to use explicit presence checks so
downstream CSVs clearly distinguish skipped/failed entries from real zero
measurements.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e85e5ea7-0932-4435-aa0c-ae9f82cb58e4
📒 Files selected for processing (1)
tests/microbenchmarks/all_reduce.py
tests/microbenchmarks/all_reduce.py
Outdated
| # ── nccl-tests style comprehensive benchmark (--benchmark mode) ────────────── | ||
|
|
||
| _STRATEGY_MAP = { | ||
| "NCCL": AllReduceStrategy.NCCL, | ||
| "NCCL_SYMMETRIC": AllReduceStrategy.NCCL_SYMMETRIC, | ||
| "UB": AllReduceStrategy.UB, | ||
| "ONESHOT": AllReduceStrategy.ONESHOT, | ||
| "TWOSHOT": AllReduceStrategy.TWOSHOT, | ||
| "AUTO": AllReduceStrategy.AUTO, | ||
| "MNNVL": AllReduceStrategy.MNNVL, | ||
| } | ||
| _UB_STRATEGIES = {AllReduceStrategy.NCCL_SYMMETRIC, AllReduceStrategy.UB} | ||
| _FUSION_MAP = { | ||
| "NONE": AllReduceFusionOp.NONE, | ||
| "RESIDUAL_RMS_NORM": AllReduceFusionOp.RESIDUAL_RMS_NORM, | ||
| "RESIDUAL_RMS_NORM_QUANT_FP8": AllReduceFusionOp.RESIDUAL_RMS_NORM_QUANT_FP8, | ||
| "RESIDUAL_RMS_NORM_QUANT_NVFP4": AllReduceFusionOp.RESIDUAL_RMS_NORM_QUANT_NVFP4, |
There was a problem hiding this comment.
Pipeline failure: yapf formatting not compliant.
The pre-commit hook reports that yapf modified files, indicating formatting doesn't match project standards. Please run yapf on this file to fix formatting issues before merging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/microbenchmarks/all_reduce.py` around lines 284 - 300, The file has
yapf formatting issues; run the project's formatter (yapf) on
tests/microbenchmarks/all_reduce.py to reformat the code so pre-commit passes.
In particular, ensure the mapping blocks around _STRATEGY_MAP, _UB_STRATEGIES,
and _FUSION_MAP are wrapped/indented and line-broken per project style (e.g.,
consistent commas, spacing, and alignment) so that yapf no longer changes the
file.
|
PR_Github #42513 [ run ] triggered by Bot. Commit: |
a0988e3 to
394a56f
Compare
|
/bot run |
|
/bot kill |
|
PR_Github #42517 [ run ] triggered by Bot. Commit: |
|
PR_Github #42513 [ run ] completed with state |
|
PR_Github #42520 [ kill ] triggered by Bot. Commit: |
|
PR_Github #42517 [ run ] completed with state |
|
PR_Github #42520 [ kill ] completed with state |
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
394a56f to
39abe8c
Compare
|
/bot run |
|
PR_Github #42525 [ run ] triggered by Bot. Commit: |
|
PR_Github #42525 [ run ] completed with state
|
Add benchmark for all allreduce backend.
usage example:
results on H200x8:
Summary by CodeRabbit
--benchmarkCLI flag to activate enhanced benchmarking capabilitiesDescription
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.