GitHub - ROCm/aiter: AI Tensor Engine for ROCm

AITER (AI Tensor Engine for ROCm) is AMD's high-performance AI operator library, providing optimized GPU kernels for inference and training workloads on ROCm. It serves as a unified collection of production-ready operators that framework developers can integrate directly into their stacks.

Key Features

C++ and Python APIs — use operators from either level
Multiple kernel backends — Triton, Composable Kernel (CK), and hand-tuned ASM
Inference and training — not just serving kernels, but also training and GEMM+communication fused kernels
Framework-agnostic — integrate into vLLM, SGLang, or any custom framework

News

[2026/04] AITER v0.1.12.post1 Released — patch on v0.1.12 with GEMM and scale masking accuracy fixes; v0.1.12 highlights include blockwise sparse Sage Attention, fused gated RMSNorm+group quantization, etc., plus MI355X tuned configs for Kimi-K2.5 and DeepSeek-V3
[2026/02] JAX-AITER: Bringing AMD's Optimized AI Kernels to JAX on ROCm
[2026/02] Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm
[2026/01] Character.ai: 2x Production Inference Performance on AMD Instinct GPUs
[2026/01] ROCm Becomes a First-Class Platform in the vLLM Ecosystem
[2025] Accelerated LLM Inference with vLLM 0.9.x and ROCm
[2025] Accelerate DeepSeek-R1 Inference: Integrate AITER into SGLang
[2025/08] AITER-Enabled MLA Layer Inference on AMD Instinct MI300X
[2025/08] Tutorial: MLA Decoding Kernel of the AITER Library to Accelerate LLM Inference
[2025/03] Accelerating DeepSeek Inference with AMD MI300 — Microsoft
[2025/03] AITER: AI Tensor Engine For ROCm — Launch Announcement

Ecosystem

AITER is the default kernel backend for LLM inference on AMD GPUs, integrated into the major serving frameworks and powering production workloads at scale.

Framework Integration

Framework	Integration	Status	Operators Used
vLLM	Default attention backend on ROCm	Production	MHA, MLA, Paged Attention, Fused MoE, GEMM, RMSNorm, RoPE+KVCache
SGLang	Default on ROCm Docker	Production	Attention, Fused MoE, Block-scale GEMM, All-reduce, RMSNorm
ATOM	Built natively on AITER	Active development	All AITER operators (attention, MoE, sampling, communication)
JAX	XLA FFI bridge, no PyTorch dependency	Experimental	MHA/FMHA, RMSNorm, BF16 GEMM
Various customer proprietary inference engines	Kernel-level integration	Production	Attention, MoE, GEMM, quantization

Performance Highlights

Operator	Speedup
MLA decode kernel	up to 17x
MHA prefill kernel	up to 14x
Block-scaled Fused MoE	up to 3x
Block-scaled GEMM	up to 2x
DeepSeek-R1 e2e (SGLang)	6,484 → 13,704 tok/s (2.1x)
JAX-AITER attention (MI350)	4.39x median

For detailed benchmarks, see the ATOM Benchmark Dashboard.

Supported Hardware

GPU	Architecture	Status
AMD Instinct MI300X	gfx942 (CDNA3)	Fully supported
AMD Instinct MI325X	gfx942 (CDNA3)	Fully supported
AMD Instinct MI350	gfx950 (CDNA4)	Supported
AMD Instinct MI355X	gfx950 (CDNA4)	Supported

Operators

AITER provides optimized kernels for attention, MoE, GEMM, normalization, quantization, communication, and more. Each operator has unit tests under op_tests/ that you can run directly:

# Example: run a single operator test
python3 op_tests/test_mha.py
python3 op_tests/test_mla.py
python3 op_tests/test_moe.py
python3 op_tests/test_gemm_a8w8.py
python3 op_tests/test_rmsnorm2d.py

# See all available operator tests
ls op_tests/test_*.py

Installation

git clone --recursive https://github.com/ROCm/aiter.git
cd aiter
python3 setup.py develop

If you happen to forget the --recursive during clone, you can use the following command after cd aiter

git submodule sync && git submodule update --init --recursive

FlyDSL (Optional)

AITER's FusedMoE supports FlyDSL-based kernels for mixed-precision MOE (e.g., A4W4). FlyDSL is optional — when not installed, AITER automatically falls back to CK kernels.

pip install --pre flydsl

Or install all optional dependencies at once:

pip install -r requirements.txt

Triton

AITER includes Triton-based operators that require triton from AMD PyPI (ROCm 7.0, ROCm 7.1, ROCm 7.2), with the correct version selected based on your ROCm installation.

If you install with python3 setup.py develop, triton is installed automatically. If you use pip install -e ., run the install script manually:

./.github/scripts/install_triton.sh

Opus — Lightweight C++ Template for Kernel Development

Opus is a single-header C++ template library (opus.hpp) for writing HIP kernels on AMD GPUs — vectorized load/store, layout abstractions, and MFMA wrappers with a strong focus on build time optimization (up to 61x faster than standard torch extension builds). See the Opus README and op_tests/opus/ for details.

Triton-based Communication (Iris)

AITER supports GPU-initiated communication using the Iris library. This enables high-performance Triton-based communication primitives like reduce-scatter and all-gather.

pip install -e .
pip install -r requirements-triton-comms.txt

For more details, see docs/triton_comms.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1,927 Commits
.claude/skills/opus-kernel-best-practice		.claude/skills/opus-kernel-best-practice
.githooks		.githooks
.github		.github
3rdparty		3rdparty
aiter		aiter
aiter_logs		aiter_logs
csrc		csrc
docs		docs
gradlib		gradlib
hsa		hsa
op_tests		op_tests
scripts		scripts
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTE.md		CONTRIBUTE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
REPO_CLEANUP_PLAN.md		REPO_CLEANUP_PLAN.md
pr-body.md		pr-body.md
pyproject.toml		pyproject.toml
requirements-triton-comms.txt		requirements-triton-comms.txt
requirements.txt		requirements.txt
setup.py		setup.py
split-tests-update-summary.md		split-tests-update-summary.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Key Features

News

Ecosystem

Framework Integration

Performance Highlights

Supported Hardware

Operators

Installation

FlyDSL (Optional)

Triton

Opus — Lightweight C++ Template for Kernel Development

Triton-based Communication (Iris)

About

Uh oh!

Releases 13

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Key Features

News

Ecosystem

Framework Integration

Performance Highlights

Supported Hardware

Operators

Installation

FlyDSL (Optional)

Triton

Opus — Lightweight C++ Template for Kernel Development

Triton-based Communication (Iris)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages