AITER (AI Tensor Engine for ROCm) is AMD's high-performance AI operator library, providing optimized GPU kernels for inference and training workloads on ROCm. It serves as a unified collection of production-ready operators that framework developers can integrate directly into their stacks.
- C++ and Python APIs — use operators from either level
- Multiple kernel backends — Triton, Composable Kernel (CK), and hand-tuned ASM
- Inference and training — not just serving kernels, but also training and GEMM+communication fused kernels
- Framework-agnostic — integrate into vLLM, SGLang, or any custom framework
- [2026/04] AITER v0.1.12.post1 Released — patch on v0.1.12 with GEMM and scale masking accuracy fixes; v0.1.12 highlights include blockwise sparse Sage Attention, fused gated RMSNorm+group quantization, etc., plus MI355X tuned configs for Kimi-K2.5 and DeepSeek-V3
- [2026/02] JAX-AITER: Bringing AMD's Optimized AI Kernels to JAX on ROCm
- [2026/02] Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm
- [2026/01] Character.ai: 2x Production Inference Performance on AMD Instinct GPUs
- [2026/01] ROCm Becomes a First-Class Platform in the vLLM Ecosystem
- [2025] Accelerated LLM Inference with vLLM 0.9.x and ROCm
- [2025] Accelerate DeepSeek-R1 Inference: Integrate AITER into SGLang
- [2025/08] AITER-Enabled MLA Layer Inference on AMD Instinct MI300X
- [2025/08] Tutorial: MLA Decoding Kernel of the AITER Library to Accelerate LLM Inference
- [2025/03] Accelerating DeepSeek Inference with AMD MI300 — Microsoft
- [2025/03] AITER: AI Tensor Engine For ROCm — Launch Announcement
AITER is the default kernel backend for LLM inference on AMD GPUs, integrated into the major serving frameworks and powering production workloads at scale.
| Framework | Integration | Status | Operators Used |
|---|---|---|---|
| vLLM | Default attention backend on ROCm | Production | MHA, MLA, Paged Attention, Fused MoE, GEMM, RMSNorm, RoPE+KVCache |
| SGLang | Default on ROCm Docker | Production | Attention, Fused MoE, Block-scale GEMM, All-reduce, RMSNorm |
| ATOM | Built natively on AITER | Active development | All AITER operators (attention, MoE, sampling, communication) |
| JAX | XLA FFI bridge, no PyTorch dependency | Experimental | MHA/FMHA, RMSNorm, BF16 GEMM |
| Various customer proprietary inference engines | Kernel-level integration | Production | Attention, MoE, GEMM, quantization |
| Operator | Speedup |
|---|---|
| MLA decode kernel | up to 17x |
| MHA prefill kernel | up to 14x |
| Block-scaled Fused MoE | up to 3x |
| Block-scaled GEMM | up to 2x |
| DeepSeek-R1 e2e (SGLang) | 6,484 → 13,704 tok/s (2.1x) |
| JAX-AITER attention (MI350) | 4.39x median |
For detailed benchmarks, see the ATOM Benchmark Dashboard.
| GPU | Architecture | Status |
|---|---|---|
| AMD Instinct MI300X | gfx942 (CDNA3) | Fully supported |
| AMD Instinct MI325X | gfx942 (CDNA3) | Fully supported |
| AMD Instinct MI350 | gfx950 (CDNA4) | Supported |
| AMD Instinct MI355X | gfx950 (CDNA4) | Supported |
AITER provides optimized kernels for attention, MoE, GEMM, normalization, quantization, communication, and more. Each operator has unit tests under op_tests/ that you can run directly:
# Example: run a single operator test
python3 op_tests/test_mha.py
python3 op_tests/test_mla.py
python3 op_tests/test_moe.py
python3 op_tests/test_gemm_a8w8.py
python3 op_tests/test_rmsnorm2d.py
# See all available operator tests
ls op_tests/test_*.pygit clone --recursive https://github.com/ROCm/aiter.git
cd aiter
python3 setup.py developIf you happen to forget the --recursive during clone, you can use the following command after cd aiter
git submodule sync && git submodule update --init --recursiveAITER's FusedMoE supports FlyDSL-based kernels for mixed-precision MOE (e.g., A4W4). FlyDSL is optional — when not installed, AITER automatically falls back to CK kernels.
pip install --pre flydslOr install all optional dependencies at once:
pip install -r requirements.txtAITER includes Triton-based operators that require triton from AMD PyPI (ROCm 7.0, ROCm 7.1, ROCm 7.2), with the correct version selected based on your ROCm installation.
If you install with python3 setup.py develop, triton is installed automatically. If you use pip install -e ., run the install script manually:
./.github/scripts/install_triton.shOpus is a single-header C++ template library (opus.hpp) for writing HIP kernels on AMD GPUs — vectorized load/store, layout abstractions, and MFMA wrappers with a strong focus on build time optimization (up to 61x faster than standard torch extension builds). See the Opus README and op_tests/opus/ for details.
AITER supports GPU-initiated communication using the Iris library. This enables high-performance Triton-based communication primitives like reduce-scatter and all-gather.
pip install -e .
pip install -r requirements-triton-comms.txtFor more details, see docs/triton_comms.md.
