MORI

News

[2026/02] 🔥 MORI powers AMD's WideEP and PD disaggregation in SemiAnalysis InferenceX v2 benchmark (PR, InferenceX, blog).
[2026/01] 🔥 MORI-EP and MORI-IO integrated into SGLang and vLLM for MoE Expert Parallelism and PD Disaggregation on AMD GPUs (sglang & MORI-EP, sglang & MORI-IO, vllm & MORI-EP, vllm & MORI-IO).
[2025/12] MORI adds support for AMD's AINIC (Pollara) with SOTA performance (AINIC & MORI-EP, AINIC & MORI-IO).
[2025/09] MORI-EP now seamlessly scales to 64 GPUs with SOTA performance (multiple optimizations, multi-QP support, low-latency kernel).
[2025/09] MORI adds Broadcom BNXT (Thor2) IBGDA support (PR).

Introduction

MORI (Modular RDMA Interface) is a bottom-up, modular, and composable framework for building high-performance communication applications with a strong focus on RDMA + GPU integration. Inspired by the role of MLIR in compiler infrastructure, MORI provides reusable and extensible building blocks that make it easier for developers to adopt advanced techniques such as IBGDA (Infiniband GPUDirect Async) and GDS (GPUDirect Storage).

To help developers get started quickly, MORI also includes a suite of optimized libraries—MORI-EP (MoE dispatch & combine kernels), MORI-IO (p2p communication for KVCache transfer), and MORI-CCL (collective communication)—that deliver out-of-the-box performance, with support for AMD Pensando DSC, Broadcom Thor2, and NVIDIA Mellanox ConnectX-7 NICs.

Feature summary:

Applications
- MORI-EP: intra and inter-node dispatch/combine kernels with SOTA performance.
- MORI-IO: point-to-point communication library with ultra-low overhead
- MORI-CCL: lightweight and flexible collective communication library designed for highly customized use cases such as latency-sensitive or resource-constrained environment
Framework
- High-performance building blocks for IBGDA / P2P and more
- Modular & composable components for developing communication applications, such as transport management, topology detection and etc.
- Shmem-style APIs
- C++ level APIs
- Python level APIs

Benchmarks

MORI-EP

Benchmark result on DeepSeek V3 model configurations:

Bandwidth Performance

4096 tokens per batch, 7168 hidden, top-8 experts, FP8 dispatching and BF16 combining

Hardware	Kernels	Dispatch XGMI	Dispatch RDMA	Combine XGMI	Combine RDMA
MI300X + CX7	EP8	307 GB/s	x	330 GB/s	x
	EP16-V1	171 GB/s	52 GB/s	219 GB/s	67 GB/s
	EP32-V1	103 GB/s*	57 GB/s*	91 GB/s*	50 GB/s*
MI355X + AINIC	EP8	345 GB/s	x	420 GB/s	x
	EP16-V1	179 GB/s	54 GB/s	234 GB/s	71 GB/s
	EP32-V1	85 GB/s	46 GB/s	110 GB/s	61 GB/s

Latency Performance

128 tokens per batch, 7168 hidden, top-8 experts, FP8 dispatching and BF16 combining

Hardware	Kernels	Dispatch Latency	Dispatch BW	Combine Latency	Combine BW
MI300X + CX7	EP8	35 us	134 GB/s	47 us	204 GB/s
	EP16-V1-LL	76 us	96 GB/s	122 us	121 GB/s
	EP32-V1-LL	157 us*	48 GB/s*	280 us*	55 GB/s*
MI355X + AINIC	EP8	31 us	142 GB/s	36 us	276 GB/s
	EP16-V1-LL	84 us	87 GB/s	108 us	139 GB/s
	EP32-V1-LL	152 us	45 GB/s	187 us	76 GB/s

* Stale data from previous kernel version; updated numbers pending re-benchmarking.

MORI-IO

NOTE: This is the preview version of MORI-IO Benchmark performance, we will soon merge MORI-IO into main branch

Benchmark result on the following configurations:

Operation: GPU direct RDMA READ
Mode: pairwise
Number of consecutive Transfer: 128
Number of GPUs: 1
Hardware: MI300X + Thor2

+--------------------------------------------------------------------------------------------------------+
|                                            Initiator Rank 0                                            |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
| MsgSize (B) | BatchSize | TotalSize (MB) | Max BW (GB/s) | Avg Bw (GB/s) | Min Lat (us) | Avg Lat (us) |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
|      8      |    128    |      0.00      |      0.03     |      0.03     |    33.38     |    36.33     |
|      16     |    128    |      0.00      |      0.06     |      0.06     |    34.09     |    36.35     |
|      32     |    128    |      0.00      |      0.12     |      0.11     |    34.57     |    36.33     |
|      64     |    128    |      0.01      |      0.24     |      0.23     |    33.62     |    36.33     |
|     128     |    128    |      0.02      |      0.49     |      0.45     |    33.62     |    36.49     |
|     256     |    128    |      0.03      |      0.94     |      0.89     |    34.81     |    36.99     |
|     512     |    128    |      0.07      |      1.86     |      1.77     |    35.29     |    37.01     |
|     1024    |    128    |      0.13      |      3.84     |      3.53     |    34.09     |    37.09     |
|     2048    |    128    |      0.26      |      7.33     |      6.96     |    35.76     |    37.65     |
|     4096    |    128    |      0.52      |     12.94     |     12.46     |    40.53     |    42.09     |
|     8192    |    128    |      1.05      |     20.75     |     20.12     |    50.54     |    52.11     |
|    16384    |    128    |      2.10      |     29.03     |     28.33     |    72.24     |    74.02     |
|    32768    |    128    |      4.19      |     36.50     |     35.91     |    114.92    |    116.81    |
|    65536    |    128    |      8.39      |     41.74     |     41.39     |    200.99    |    202.70    |
|    131072   |    128    |     16.78      |     45.14     |     44.85     |    371.69    |    374.10    |
|    262144   |    128    |     33.55      |     46.93     |     46.76     |    715.02    |    717.56    |
|    524288   |    128    |     67.11      |     47.94     |     47.81     |   1399.99    |   1403.64    |
|   1048576   |    128    |     134.22     |     48.44     |     48.32     |   2770.90    |   2777.76    |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+

Session is a specific technique used in MORI-IO to reduce overhead

Hardware Support Matrix

GPU

	MORI-EP	MORI-IO	MORI-SHMEM
MI308X	✅	✅	✅
MI300X	✅	✅	✅
MI325X	✅	✅	✅
MI355X	✅	✅	✅
MI450X	🚧	🚧	🚧

NIC

	MORI-EP	MORI-IO	MORI-SHMEM
Pollara	✅	✅	✅
CX7	✅	✅	✅
Thor2	✅	✅	✅
Volcano	🚧	🚧	🚧

✅ Supported 🚧 Under Development

Installation

Prerequisites

ROCm >= 6.4 (hipcc needed at runtime for JIT kernel compilation, not at install time)
System packages: libopenmpi-dev, openmpi-bin, libpci-dev (see Dockerfile.dev)

Or build docker image with:

cd mori && docker build -t rocm/mori:dev -f docker/Dockerfile.dev .

IBGDA NIC support (optional, for GPU-direct RDMA — auto-detected, no manual configuration needed):

NIC	User library	Headers
AMD Pollara (AINIC)	`libionic.so`	—
Mellanox ConnectX	`libmlx5.so` (typically pre-installed)	—
Broadcom Thor2	`libbnxt_re.so`	`bnxt_re_dv.h`, `bnxt_re_hsi.h`

Note: IBGDA requires vendor-specific DV (Direct Verbs) libraries. Mellanox libmlx5 is typically pre-installed with the kernel OFED stack. For Thor2 and Pollara, install the corresponding userspace library and headers from your NIC vendor.

Install

# NOTE: for venv build, add --no-build-isolation at the end
cd mori && pip install .

That's it. No hipcc needed at install time — host code compiles with a standard C++ compiler. GPU kernels are JIT-compiled on first use and cached to ~/.mori/jit/. If a GPU is detected during install, kernel precompilation starts automatically in the background.

To manually precompile all kernels (e.g. in a Docker image build):

MORI_PRECOMPILE=1 python -c "import mori"

Verify installation

python -c "import mori; print('OK')"

Test dispatch / combine

cd /path/to/mori
export PYTHONPATH=/path/to/mori:$PYTHONPATH

# Test correctness (8 GPUs)
pytest tests/python/ops/test_dispatch_combine.py -q

# Benchmark performance
python tests/python/ops/bench_dispatch_combine.py

Test MORI-IO

cd /path/to/mori
export PYTHONPATH=/path/to/mori:$PYTHONPATH

# Test correctness
pytest tests/python/io/

# Benchmark performance (two nodes)
export GLOO_SOCKET_IFNAME=ens14np0
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 --master_addr="10.194.129.65" --master_port=1234 \
  tests/python/io/benchmark.py --host="10.194.129.65" --enable-batch-transfer --enable-sess --buffer-size 32768 --transfer-batch-size 128

Test MORI-IR (Triton + shmem integration, guide)

# Basic shmem put (2 GPUs)
torchrun --nproc_per_node=2 examples/shmem/ir/test_triton_shmem.py

# Allreduce (8 GPUs)
torchrun --nproc_per_node=8 examples/shmem/ir/test_triton_allreduce.py

Contribution Guide

Welcome to MORI! We appreciate your interest in contributing. Whether you're fixing bugs, adding features, improving documentation, or sharing feedback, your contributions help make MORI better for everyone.

Code Quality

MORI uses pre-commit hooks to maintain code quality. After cloning the repository:

# Install and setup pre-commit
pip install pre-commit
cd /path/to/mori
pre-commit install

# Run on all files (first time)
pre-commit run --all-files

Pre-commit automatically checks code formatting, linting, license headers, and other quality checks on commit. To skip checks when necessary: git commit --no-verify

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
.github/workflows		.github/workflows
3rdparty		3rdparty
docker		docker
docs		docs
examples		examples
include/mori		include/mori
python/mori		python/mori
src		src
tests		tests
tools		tools
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.typos.toml		.typos.toml
CMakeLists.txt		CMakeLists.txt
CPPLINT.cfg		CPPLINT.cfg
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements-build.txt		requirements-build.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MORI

News

Introduction

Benchmarks

MORI-EP

MORI-IO

Hardware Support Matrix

Installation

Prerequisites

Install

Verify installation

Test dispatch / combine

Test MORI-IO

Test MORI-IR (Triton + shmem integration, guide)

Contribution Guide

Code Quality

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MORI

News

Introduction

Benchmarks

MORI-EP

MORI-IO

Hardware Support Matrix

Installation

Prerequisites

Install

Verify installation

Test dispatch / combine

Test MORI-IO

Test MORI-IR (Triton + shmem integration, guide)

Contribution Guide

Code Quality

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages