Skip to content

[None][feat] Add benchmark for all allreduce backend#12887

Open
yilin-void wants to merge 1 commit intoNVIDIA:mainfrom
yilin-void:benchmark/allreduce
Open

[None][feat] Add benchmark for all allreduce backend#12887
yilin-void wants to merge 1 commit intoNVIDIA:mainfrom
yilin-void:benchmark/allreduce

Conversation

@yilin-void
Copy link
Copy Markdown
Collaborator

@yilin-void yilin-void commented Apr 9, 2026

Add benchmark for all allreduce backend.

usage example:

mpirun -n 8 --oversubscribe --allow-run-as-root     python TensorRT-LLM/tests/microbenchmarks/all_reduce.py --benchmark --enable_cudagraph

results on H200x8:

================================================================================
  TRT-LLM AllReduce Benchmark
  world_size=8  dtype=bfloat16  SM=90  cudagraph=True  inner=200  outer=10
  Strategies : NCCL, NCCL_SYMMETRIC, UB, ONESHOT, TWOSHOT, AUTO, MNNVL
  Fusions    : NONE, RESIDUAL_RMS_NORM, RESIDUAL_RMS_NORM_QUANT_FP8
================================================================================
# Fusion: NONE    world_size=8    algbw = size / time (GB/s)
#
#       size    ntok    hdim           NCCL               NCCL_SYMMETRIC                UB                   ONESHOT                 TWOSHOT                   AUTO                   MNNVL                       BEST
#                               time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw                  
#---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        256B       1     128       14.27        0.02        9.05        0.03         N/A         N/A        2.54        0.10         N/A         N/A        2.62        0.10        2.62        0.10           ONESHOT
        2.5K       1    1280       14.99        0.17        9.52        0.27         N/A         N/A        3.28        0.78         N/A         N/A        2.98        0.86        2.98        0.86              AUTO
         24K       3    4096       19.77        1.24       12.35        1.99         N/A         N/A        3.20        7.68         N/A         N/A        2.83        8.67        2.84        8.67              AUTO
        248K      31    4096       23.17       10.96       17.66       14.38         N/A         N/A        9.47       26.81       49.56        5.12        4.33       58.64        4.33       58.72             MNNVL
       2.44M     312    4096       41.27       61.93       31.60       80.89         N/A         N/A       73.40       34.82       35.81       71.37       16.68      153.27       16.68      153.24              AUTO
      24.41M    3125    4096      164.72      155.42      144.53      177.13         N/A         N/A      627.73       40.78      199.20      128.51      133.94      191.13      133.94      191.13              AUTO
     244.14M   31250    4096     1007.47      254.10     1094.76      233.84         N/A         N/A         N/A         N/A         N/A         N/A     1298.58      197.14     1298.62      197.13              NCCL

# Fusion: RESIDUAL_RMS_NORM    world_size=8    algbw = size / time (GB/s)
#
#       size    ntok    hdim           NCCL               NCCL_SYMMETRIC                UB                   ONESHOT                 TWOSHOT                   AUTO                   MNNVL                       BEST
#                               time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw                  
#---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        256B       1     128       15.67        0.02        5.12        0.05        3.36        0.08        3.20        0.08         N/A         N/A        3.80        0.07        3.80        0.07           ONESHOT
        2.5K       1    1280       16.35        0.16        5.28        0.48        3.41        0.75        3.99        0.64         N/A         N/A        4.02        0.64        4.02        0.64                UB
         24K       3    4096       21.35        1.15        6.71        3.66        4.44        5.54        5.56        4.42         N/A         N/A        4.60        5.35        4.60        5.34                UB
        248K      31    4096       24.95       10.18       10.83       23.45        6.41       39.63       11.41       22.25       60.13        4.22        5.59       45.47        5.58       45.53             MNNVL
       2.44M     312    4096       44.03       58.05       21.47      119.05       16.18      157.99       68.61       37.25       45.91       55.67       17.78      143.78       17.77      143.81                UB
      24.41M    3125    4096      189.68      134.96      140.80      181.82      101.46      252.33      636.68       40.21      224.27      114.15      166.11      154.11      166.13      154.10                UB
     244.14M   31250    4096     1258.81      203.37     1345.77      190.23      946.82      270.38         N/A         N/A         N/A         N/A     1601.68      159.83     1601.73      159.83                UB

# Fusion: RESIDUAL_RMS_NORM_QUANT_FP8    world_size=8    algbw = size / time (GB/s)
#
#       size    ntok    hdim           NCCL               NCCL_SYMMETRIC                UB                   ONESHOT                 TWOSHOT                   AUTO                   MNNVL                       BEST
#                               time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw    time(us)       algbw                  
#---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        256B       1     128       18.08        0.01        7.50        0.03        3.63        0.07        3.26        0.08         N/A         N/A        3.26        0.08        3.27        0.08              AUTO
        2.5K       1    1280       18.80        0.14        7.69        0.33        3.61        0.71        4.04        0.63         N/A         N/A        4.05        0.63        4.04        0.63                UB
         24K       3    4096       23.86        1.03        9.17        2.68        4.59        5.35        5.62        4.37         N/A         N/A        5.64        4.36        5.64        4.36                UB
        248K      31    4096       27.71        9.16       13.46       18.87        6.34       40.08       11.42       22.23       60.88        4.17       11.41       22.26       11.42       22.24                UB
       2.44M     312    4096       47.40       53.92       24.62      103.81       14.83      172.40       68.14       37.51       46.62       54.83       24.59      103.94       24.62      103.81                UB
      24.41M    3125    4096      201.48      127.06      152.62      167.74       88.24      290.13      636.27       40.23      224.00      114.29      152.60      167.76      152.60      167.76                UB
     244.14M   31250    4096     1354.52      189.00     1439.43      177.85      821.09      311.78         N/A         N/A         N/A         N/A     1353.94      189.08     1354.66      188.98                UB

================================================================================
  Summary: peak algbw (GB/s) per strategy per fusion
================================================================================
  fusion                                         NCCL  NCCL_SYMMETRIC              UB         ONESHOT         TWOSHOT            AUTO           MNNVL
  ---------------------------------------------------------------------------------------------------------------------------------------------------
  NONE                                         254.10          233.84            0.00           40.78          128.51          197.14          197.13
  RESIDUAL_RMS_NORM                            203.37          190.23          270.38           40.21          114.15          159.83          159.83
  RESIDUAL_RMS_NORM_QUANT_FP8                  189.00          177.85          311.78           40.23          114.29          189.08          188.98

Summary by CodeRabbit

  • New Features
    • Added new benchmarking mode with formatted table output and multi-strategy comparison
    • Introduced --benchmark CLI flag to activate enhanced benchmarking capabilities
    • Added 2D exploration for input shape benchmarking
    • Enabled CSV export functionality for benchmark results
    • Expanded user buffer profiling support across multiple backends

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@yilin-void yilin-void requested a review from hyukn April 9, 2026 09:11
@yilin-void yilin-void self-assigned this Apr 9, 2026
@yilin-void
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Added allreduce_benchmark_all benchmark mode to all_reduce.py with support for profiling multiple AllReduce strategies and fusion operators across configurable input size ranges. Extended profile_allreduce with optional instance and dtype parameters. Added CLI --benchmark flag to switch between original and new benchmark modes.

Changes

Cohort / File(s) Summary
AllReduce Benchmark Enhancement
tests/microbenchmarks/all_reduce.py
Updated profile_allreduce signature with optional allreduce_instance and dtype parameters. Added allreduce_benchmark_all function supporting multi-strategy profiling (including NCCL-tests style iteration across fusion ops, AllReduceStrategy backends, and input shapes). Implemented helper functions for size formatting, UB buffer profiling, and CSV result logging. Added --benchmark CLI flag to toggle between legacy and new benchmark modes.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI/Main
    participant Benchmark as allreduce_benchmark_all
    participant StrategyMgr as Strategy Manager
    participant ProfileFunc as profile_allreduce
    participant AllReduce as AllReduce Instance
    participant UBUtil as UB Utilities
    participant Results as Results/CSV

    CLI->>Benchmark: Run with test_range, strategies, fusions
    Benchmark->>StrategyMgr: Get selected AllReduceStrategy backends
    StrategyMgr-->>Benchmark: Return strategy list
    
    loop For each fusion op
        Benchmark->>ProfileFunc: Generate shapes from test_range
        loop For each shape
            loop For each strategy
                alt Strategy supports UB
                    Benchmark->>UBUtil: copy_to_userbuffers
                    Benchmark->>UBUtil: userbuffers_allreduce_finalize
                    UBUtil->>AllReduce: Execute with UB path
                else Regular strategy
                    Benchmark->>ProfileFunc: profile_allreduce(allreduce_instance, dtype)
                    ProfileFunc->>AllReduce: Execute and measure
                end
                AllReduce-->>Benchmark: Timing results
                Benchmark->>Benchmark: Compute algbw
            end
        end
        Benchmark->>Benchmark: Print fusion table with results
    end
    
    opt save_csv provided
        Benchmark->>Results: Write CSV file
    end
    Results-->>CLI: Benchmark complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description is incomplete; missing required Title, Description, and Test Coverage sections from the template. Add a properly formatted PR title following [type] format, fill in the Description section explaining the feature, and list relevant tests for the new benchmark functionality.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change—adding a benchmark for all allreduce backends—and is directly related to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/microbenchmarks/all_reduce.py (4)

478-484: Catching broad Exception violates coding guidelines.

Per coding guidelines, avoid catching bare exceptions. For AllReduce initialization, consider catching more specific exceptions (e.g., RuntimeError, ValueError) to avoid masking unexpected errors.

Proposed fix
     for strat in strategies:
         try:
             ar_instances[strat] = AllReduce(mapping=mapping, strategy=strat, dtype=torch_dtype)
-        except Exception as e:
+        except (RuntimeError, ValueError) as e:
             if rank == 0:
                 print(f"[WARN] Cannot init {strat.name}: {e}", flush=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/microbenchmarks/all_reduce.py` around lines 478 - 484, The try/except
around AllReduce initialization is catching a broad Exception (in the block that
assigns ar_instances[strat] = AllReduce(mapping=mapping, strategy=strat,
dtype=torch_dtype)), which violates guidelines; replace the bare except with
specific exceptions (e.g., except (RuntimeError, ValueError) as e:) that you
expect AllReduce to raise during init, and keep the existing rank==0 warning
print, so only expected initialization failures are swallowed while unexpected
errors still surface.

60-61: Missing type annotations for new parameters.

Per coding guidelines, function parameters should have type hints.

Proposed fix
     bias=None,
-    allreduce_instance=None,
-    dtype=None,
+    allreduce_instance: AllReduce | None = None,
+    dtype: torch.dtype | None = None,
 ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/microbenchmarks/all_reduce.py` around lines 60 - 61, The parameters
allreduce_instance and dtype are missing type hints; update the function
signature to annotate them as optional (e.g., allreduce_instance: Optional[Any]
and dtype: Optional[Any]) since they default to None, and add the required
imports (from typing import Optional, Any) at the top of the module; ensure you
update the signature where these parameters are declared (reference symbols:
allreduce_instance, dtype) and run tests to confirm no further type errors.

544-547: Same issue: catching broad Exception.

Consider catching specific exceptions here as well.

Proposed fix
-                    except Exception as e:
+                    except (RuntimeError, ValueError, torch.cuda.CudaError) as e:
                         if rank == 0:
                             print(f"  [SKIP] {sn} @ {_fmt_size(msg_bytes)}: {e}", flush=True)
                         row[f"{sn}_time"] = row[f"{sn}_algbw"] = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/microbenchmarks/all_reduce.py` around lines 544 - 547, The except block
currently catches a broad Exception; narrow it to the specific errors expected
(e.g., RuntimeError, OSError, ValueError) instead of Exception in the try/except
around the benchmark step that references rank, sn, _fmt_size and row in
tests/microbenchmarks/all_reduce.py; update the except clause to list those
concrete exception types, keep the same handling (printing the skip message and
setting row[f"{sn}_time"] and row[f"{sn}_algbw"] to None) for those known
errors, and add a final generic except Exception as e: raise to re-raise any
unexpected exceptions so they surface during testing.

549-555: Consider distinguishing skipped vs. failed vs. zero in CSV output.

Currently, skipped strategies record 0.0 for time_us and algbw_GBps, which is indistinguishable from actual zero values (unlikely but possible) or failures. Consider using None or an empty string to make the CSV more accurate for downstream analysis.

Proposed alternative
                 csv_rows.append({
                     "world_size": world_size, "dtype": dtype, "fusion": fusion_name,
                     "num_tokens": num_tokens, "hidden_size": hidden_size,
                     "size_bytes": msg_bytes, "strategy": sn,
-                    "time_us": row[f"{sn}_time"] or 0.0,
-                    "algbw_GBps": row[f"{sn}_algbw"] or 0.0,
+                    "time_us": row[f"{sn}_time"] if row[f"{sn}_time"] is not None else "",
+                    "algbw_GBps": row[f"{sn}_algbw"] if row[f"{sn}_algbw"] is not None else "",
                 })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/microbenchmarks/all_reduce.py` around lines 549 - 555, The CSV
currently uses "or 0.0" for time and algbw fields which collapses skipped/failed
values into 0.0; change the construction of the dict appended to csv_rows so
that the values for f"{sn}_time" and f"{sn}_algbw" are set to None or an empty
string when the source row value is missing (e.g., row.get(f"{sn}_time") is
None) instead of falling back to 0.0; update the keys "time_us" and "algbw_GBps"
in the csv_rows.append call (and any usage of row[f"{sn}_time"] /
row[f"{sn}_algbw"]) to use explicit presence checks so downstream CSVs clearly
distinguish skipped/failed entries from real zero measurements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/microbenchmarks/all_reduce.py`:
- Around line 284-300: The file has yapf formatting issues; run the project's
formatter (yapf) on tests/microbenchmarks/all_reduce.py to reformat the code so
pre-commit passes. In particular, ensure the mapping blocks around
_STRATEGY_MAP, _UB_STRATEGIES, and _FUSION_MAP are wrapped/indented and
line-broken per project style (e.g., consistent commas, spacing, and alignment)
so that yapf no longer changes the file.

---

Nitpick comments:
In `@tests/microbenchmarks/all_reduce.py`:
- Around line 478-484: The try/except around AllReduce initialization is
catching a broad Exception (in the block that assigns ar_instances[strat] =
AllReduce(mapping=mapping, strategy=strat, dtype=torch_dtype)), which violates
guidelines; replace the bare except with specific exceptions (e.g., except
(RuntimeError, ValueError) as e:) that you expect AllReduce to raise during
init, and keep the existing rank==0 warning print, so only expected
initialization failures are swallowed while unexpected errors still surface.
- Around line 60-61: The parameters allreduce_instance and dtype are missing
type hints; update the function signature to annotate them as optional (e.g.,
allreduce_instance: Optional[Any] and dtype: Optional[Any]) since they default
to None, and add the required imports (from typing import Optional, Any) at the
top of the module; ensure you update the signature where these parameters are
declared (reference symbols: allreduce_instance, dtype) and run tests to confirm
no further type errors.
- Around line 544-547: The except block currently catches a broad Exception;
narrow it to the specific errors expected (e.g., RuntimeError, OSError,
ValueError) instead of Exception in the try/except around the benchmark step
that references rank, sn, _fmt_size and row in
tests/microbenchmarks/all_reduce.py; update the except clause to list those
concrete exception types, keep the same handling (printing the skip message and
setting row[f"{sn}_time"] and row[f"{sn}_algbw"] to None) for those known
errors, and add a final generic except Exception as e: raise to re-raise any
unexpected exceptions so they surface during testing.
- Around line 549-555: The CSV currently uses "or 0.0" for time and algbw fields
which collapses skipped/failed values into 0.0; change the construction of the
dict appended to csv_rows so that the values for f"{sn}_time" and f"{sn}_algbw"
are set to None or an empty string when the source row value is missing (e.g.,
row.get(f"{sn}_time") is None) instead of falling back to 0.0; update the keys
"time_us" and "algbw_GBps" in the csv_rows.append call (and any usage of
row[f"{sn}_time"] / row[f"{sn}_algbw"]) to use explicit presence checks so
downstream CSVs clearly distinguish skipped/failed entries from real zero
measurements.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e85e5ea7-0932-4435-aa0c-ae9f82cb58e4

📥 Commits

Reviewing files that changed from the base of the PR and between 2dff089 and a0988e3.

📒 Files selected for processing (1)
  • tests/microbenchmarks/all_reduce.py

Comment on lines +284 to +300
# ── nccl-tests style comprehensive benchmark (--benchmark mode) ──────────────

_STRATEGY_MAP = {
"NCCL": AllReduceStrategy.NCCL,
"NCCL_SYMMETRIC": AllReduceStrategy.NCCL_SYMMETRIC,
"UB": AllReduceStrategy.UB,
"ONESHOT": AllReduceStrategy.ONESHOT,
"TWOSHOT": AllReduceStrategy.TWOSHOT,
"AUTO": AllReduceStrategy.AUTO,
"MNNVL": AllReduceStrategy.MNNVL,
}
_UB_STRATEGIES = {AllReduceStrategy.NCCL_SYMMETRIC, AllReduceStrategy.UB}
_FUSION_MAP = {
"NONE": AllReduceFusionOp.NONE,
"RESIDUAL_RMS_NORM": AllReduceFusionOp.RESIDUAL_RMS_NORM,
"RESIDUAL_RMS_NORM_QUANT_FP8": AllReduceFusionOp.RESIDUAL_RMS_NORM_QUANT_FP8,
"RESIDUAL_RMS_NORM_QUANT_NVFP4": AllReduceFusionOp.RESIDUAL_RMS_NORM_QUANT_NVFP4,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Pipeline failure: yapf formatting not compliant.

The pre-commit hook reports that yapf modified files, indicating formatting doesn't match project standards. Please run yapf on this file to fix formatting issues before merging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/microbenchmarks/all_reduce.py` around lines 284 - 300, The file has
yapf formatting issues; run the project's formatter (yapf) on
tests/microbenchmarks/all_reduce.py to reformat the code so pre-commit passes.
In particular, ensure the mapping blocks around _STRATEGY_MAP, _UB_STRATEGIES,
and _FUSION_MAP are wrapped/indented and line-broken per project style (e.g.,
consistent commas, spacing, and alignment) so that yapf no longer changes the
file.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42513 [ run ] triggered by Bot. Commit: a0988e3 Link to invocation

@yilin-void yilin-void force-pushed the benchmark/allreduce branch from a0988e3 to 394a56f Compare April 9, 2026 09:26
@yilin-void
Copy link
Copy Markdown
Collaborator Author

/bot run

@yilin-void
Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42517 [ run ] triggered by Bot. Commit: 394a56f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42513 [ run ] completed with state ABORTED. Commit: a0988e3

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42520 [ kill ] triggered by Bot. Commit: 394a56f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42517 [ run ] completed with state ABORTED. Commit: 394a56f

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42520 [ kill ] completed with state SUCCESS. Commit: 394a56f
Successfully killed previous jobs for commit 394a56f

Link to invocation

Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
@yilin-void yilin-void force-pushed the benchmark/allreduce branch from 394a56f to 39abe8c Compare April 9, 2026 09:52
@yilin-void
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42525 [ run ] triggered by Bot. Commit: 39abe8c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42525 [ run ] completed with state SUCCESS. Commit: 39abe8c
/LLM/main/L0_MergeRequest_PR pipeline #33265 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants