Large EP Microbenchmarking (MoRI-EP & DeepEP)

Overview

These scripts run microbenchmarks for MoRI-EP (MoRI collective engine) and DeepEP (expert-parallel / MoE communication) on a Slurm cluster with AMD Instinct accelerators and InfiniBand (CX7) networking. The benchmarks validate collective performance and low-latency behavior in both intranode and internode configurations.

The Docker image is built from the MAD repository and is intended for use on clusters with Mellanox CX7 NICs and compatible AMD GPUs (e.g., gfx942). You can adjust the Dockerfile for different NICs or GPU architectures as needed.

Why this benchmarking?

Validate communication stacks — Before running large training or inference workloads (e.g., MoE models, distributed LLMs), microbenchmarks confirm that MoRI and DeepEP collectives achieve expected throughput and latency on your cluster. Regressions in drivers, ROCm, or NIC firmware show up here first.
Tune and qualify hardware — Results help tune environment variables (e.g., ROCSHMEM_HEAP_SIZE, IBDEVICES), compare NIC/GPU configurations, and qualify new nodes or fabrics (intranode vs internode, normal vs low-latency paths).
Regression and CI — These scripts can be used in CI or release validation to ensure that updates to MoRI, DeepEP, rocSHMEM, or the ROCm stack do not degrade collective performance.
Research and development — Microbenchmark data supports optimization of collective algorithms, FP8/low-latency paths, and scaling studies across nodes.

MoRI and DeepEP: background and links

MoRI (Modular RDMA Interface)

MoRI is a communication backend for the ROCm platform that optimizes inter-node collective operations on GPU clusters. It uses RDMA (Remote Direct Memory Access) for efficient GPU-to-GPU communication across nodes and supports standard collectives. MoRI is used in distributed inference (e.g., with SGLang / vLLM on AMD Instinct).

Resource	Link
MoRI (GitHub)	ROCm/mori

DeepEP and rocSHMEM

DeepEP is a high-performance communication library for Mixture-of-Experts (MoE) and expert parallelism, providing GPU all-to-all primitives with optional FP8/BF16 support. The AMD ROCm version uses xGMI/Infinity Fabric for intranode communication and InfiniBand/RoCE for internode communication. DeepEP builds on rocSHMEM (ROCm OpenSHMEM), an intra-kernel networking library that provides GPU-centric, OpenSHMEM-like APIs and a symmetric heap on GPU memory, enabling better communication–computation overlap than host-driven networking.

Resource	Link
DeepEP (GitHub)	ROCm/DeepEP
rocSHMEM — What is rocSHMEM? (ROCm docs)	rocSHMEM introduction
rocSHMEM (GitHub)	ROCm/ROC_SHMEM

Benchmarks Covered

Scope	Mode	DeepEP	MoRI-EP
Intranode	Normal	✅	✅
Intranode	Low latency	✅	✅
Internode	Normal	✅	✅
Internode	Low latency	✅	✅

Intranode: Single-node multi-process collectives.
Internode: Multi-node collectives over InfiniBand.
Low latency: FP8/low-latency collective paths (e.g., rocSHMEM heap and MoRI FP8).

Prerequisites

Slurm cluster with AMD Instinct nodes.
CX7 (or compatible) InfiniBand NICs; scripts assume InfiniBand devices are available.
Docker available on compute nodes (or use a container runtime configured for your cluster).

How to Run

1. Build the Docker image

Build from the MAD repository root (not from this script directory). The default Dockerfile targets gfx942 and CX7; edit docker/large_ep_benchmark.ubuntu.amd.Dockerfile if your cluster uses a different GPU arch or NIC.

# From MAD repository root
docker build -f docker/large_ep_benchmark.ubuntu.amd.Dockerfile -t ep-benchmarking:latest .

2. Set environment variables

Variable	Description	Example
`DOCKER_IMAGE`	Docker image to run on the cluster	`ep-benchmarking:latest`
`IBDEVICES`	InfiniBand device(s) for rocSHMEM	`mlx5_0` (default)

Optional:

LOG_PATH: Directory for benchmark logs (default: ./logs in the current working directory).

3. Submit the Slurm job

Navigate to the scripts directory and submit the job. The script supports 1 to N nodes; Slurm allocates the nodes and the job runs both intranode and internode tests when multiple nodes are requested.

cd scripts/large-ep-benchmark

export DOCKER_IMAGE=ep-benchmarking:latest
export IBDEVICES=mlx5_0

sbatch -p <partition> -N <num-nodes> run_benchmark.sbatch

Examples:

Single node (intranode only):

sbatch -p <partition> -N 1 run_benchmark.sbatch

Multi-node (intranode and internode):

sbatch -p <partition> -N 4 run_benchmark.sbatch

Note

Ensure the partition and account allow the requested number of nodes and that compute nodes have Docker (or the configured container runtime) and access to the built image (e.g., via a registry or shared image path).

Output and logs

Slurm stdout/stderr: ep_bench_slurm_job.out and ep_bench_slurm_job.err in the directory where sbatch was run.
Benchmark logs are written under LOG_PATH (default: ./logs), including:
- ep_bench_results.log — main run log
- intranode_results.log / mori_intranode_results.log — intranode DeepEP and MoRI
- internode_<rank>.txt, low-latency-<rank>.log — internode and low-latency results

Summary

Step	Action
1	Build image from MAD root: `docker build -f docker/large_ep_benchmark.ubuntu.amd.Dockerfile -t ep-benchmarking:latest .`
2	Set `DOCKER_IMAGE` and `IBDEVICES` (and optionally `LOG_PATH`)
3	From `scripts/large-ep-benchmark`, run: `sbatch -p <partition> -N <num-nodes> run_benchmark.sbatch`

For different GPU architectures or NICs, update the Dockerfile and rebuild before submitting jobs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large EP Microbenchmarking (MoRI-EP & DeepEP)

Overview

Why this benchmarking?

MoRI and DeepEP: background and links

MoRI (Modular RDMA Interface)

DeepEP and rocSHMEM

Benchmarks Covered

Prerequisites

How to Run

1. Build the Docker image

2. Set environment variables

3. Submit the Slurm job

Output and logs

Summary

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Large EP Microbenchmarking (MoRI-EP & DeepEP)

Overview

Why this benchmarking?

MoRI and DeepEP: background and links

MoRI (Modular RDMA Interface)

DeepEP and rocSHMEM

Benchmarks Covered

Prerequisites

How to Run

1. Build the Docker image

2. Set environment variables

3. Submit the Slurm job

Output and logs

Summary