PyTorch built from source for AMD RDNA 3.5 (gfx1150) integrated GPUs
____ _____ _ __ _ _ ____ ___
| _ \ _ |_ _|__ _ __ ___| |__ / _/ || | ___| / _ \
| |_) | | | || |/ _ \| '__/ __| '_ \ | |_| || |___ \| | | |
| __/| |_| || | (_) | | | (__| | | | | _|__ _|__) | |_| |
|_| \__, ||_|\___/|_| \___|_| |_| |_| |_||____/ \___/
|___/ gfx1150 / RDNA 3.5 / Radeon 890M
AMD's Ryzen AI 300 series ("Strix Point") processors include RDNA 3.5 integrated GPUs — the Radeon 890M and 880M. These use the gfx1150 ISA, which is not supported by any pre-built PyTorch package.
Here is what happens when you try the standard approaches:
| Approach | Result |
|---|---|
pip install torch (CPU) |
Works, but no GPU acceleration |
pip install torch --index-url .../rocm6.3 |
Detects GPU, then crashes: invalid device function |
HSA_OVERRIDE_GFX_VERSION=11.0.0 |
Does not map to a valid target for gfx1150 |
Build from source with PYTORCH_ROCM_ARCH=gfx1150 |
Native GPU acceleration |
This repository provides the build scripts, documentation, and benchmarks to get PyTorch running with real GPU acceleration on RDNA 3.5 hardware. No hacks, no overrides, no compatibility shims.
| GPU | Processor | Architecture | Status |
|---|---|---|---|
| Radeon 890M | Ryzen AI 9 HX 370 / HX 375 | gfx1150 (RDNA 3.5) | Tested |
| Radeon 880M | Ryzen AI 7 PRO 360 | gfx1150 (RDNA 3.5) | Expected to work |
| Other RDNA 3.5 | Various | gfx1150 / gfx1151 | Should work (untested) |
Known working device: GPD Pocket 4 (Ryzen AI 9 HX 370 / Radeon 890M / 32GB RAM)
Tested on AMD Radeon 890M with ROCm 7.2.0 and PyTorch 2.12.0a0:
| Test | GFLOPS | Notes |
|---|---|---|
| FP32 matmul 512x512 | ~1,059 | Peak single-precision |
| FP32 matmul 1024x1024 | ~1,043 | Sustained single-precision |
| FP16 matmul 1024x1024 | ~556 | Half-precision |
| Shared VRAM | 11.5 GB | System memory allocated to GPU |
| Configuration | Time | Speedup |
|---|---|---|
| CPU only | 19.4s | baseline |
| GPU (no AOTriton) | 49.4s | 0.4x (SLOWER) |
| GPU + AOTriton experimental | 12.8s | 1.5x faster |
The AOTriton result demonstrates why TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 is critical. See Known Issues.
ROCm 7.2+ (required):
# Arch/CachyOS
sudo pacman -S rocm-hip-runtime rocm-hip-sdk rocm-opencl-runtime
# Ubuntu (see https://rocm.docs.amd.com)
# Follow AMD's official ROCm installation guideBuild tools:
# Arch/CachyOS
sudo pacman -S base-devel cmake ninja python python-pip git
# Ubuntu/Debian
sudo apt install build-essential cmake ninja-build python3 python3-pip python3-venv gitrocminfo | grep -E "Marketing Name|Name:.*gfx"
# Should show: "AMD Radeon 890M Graphics" and "gfx1150"python3 -m venv ~/pytorch-gfx1150-env
source ~/pytorch-gfx1150-env/bin/activate
pip install numpy pyyaml typing-extensions sympy filelock jinja2 networkx setuptools wheel cffigit clone https://github.com/Peterc3-dev/pytorch-gfx1150.git
cd pytorch-gfx1150
chmod +x build.sh
./build.sh --venv ~/pytorch-gfx1150-envThe build takes approximately 44 minutes on a Ryzen AI 9 HX 370 with MAX_JOBS=10.
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
python3 -c "
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA: {torch.cuda.is_available()}')
print(f'Device: {torch.cuda.get_device_name(0)}')
x = torch.randn(1000, 1000, device='cuda')
y = torch.mm(x, x)
print(f'Matmul: OK ({y.shape})')
"export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
python3 benchmarks/run_benchmarks.py --markdownThe build.sh script supports several options:
./build.sh # Fresh build, default settings
./build.sh --rebuild # Incremental rebuild (skip clone)
./build.sh --arch gfx1150 # Explicit GPU architecture
./build.sh --jobs 12 # Custom parallel job count
./build.sh --venv ~/my_venv # Use a specific virtualenv
./build.sh --build-dir ~/my/pytorch # Custom source directory
./build.sh --branch v2.6.0 # Build a specific PyTorch version
./build.sh --dry-run # Show config without building
These should be set in your shell profile or before running any PyTorch workload:
# REQUIRED for good performance on gfx1150
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
# Optional: useful for debugging
export ROCM_PATH=/opt/rocm
export HSA_ENABLE_SDMA=0 # May help with stability on some systems
export AMD_LOG_LEVEL=0 # Reduce ROCm log noiseWithout TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1, PyTorch's Scaled Dot-Product Attention (SDPA) falls back to a naive implementation that is slower than CPU. This is because the standard flash-attention and memory-efficient attention kernels do not have pre-compiled paths for gfx1150. The experimental AOTriton flag enables JIT compilation of these kernels.
Always set this variable. Add it to your .bashrc / .bash_profile / config.fish:
# bash/zsh
echo 'export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1' >> ~/.bashrc
# fish
set -Ux TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL 1Before building PyTorch for ROCm, the CUDA-to-HIP translation ("hipify") step must be run:
cd ~/builds/pytorch-gfx1150
python3 tools/amd_build/build_amd.pyThe build.sh script handles this automatically, but if you are building manually, do not skip it. The PyTorch build system does not run hipify on its own.
cmake 4.x changed default policy behavior. Some older PyTorch subprojects fail with policy errors. The fix:
export CMAKE_POLICY_VERSION_MINIMUM=3.5This is set automatically by build.sh.
GCC 15 introduces new warnings that PyTorch treats as errors. The build script sets:
export CXXFLAGS="-Wno-error=deprecated-declarations -Wno-error=maybe-uninitialized -Wno-error"FP16 matmul shows ~556 GFLOPS vs ~1043 for FP32. This is likely because the RDNA 3.5 iGPU's FP16 throughput depends on packed math instructions, and the current ROCm/PyTorch code path may not be fully exploiting them. This may improve with future ROCm updates.
Q: Do I need a discrete GPU? No. This targets integrated GPUs (iGPUs) — the Radeon 890M/880M built into Ryzen AI 300 processors. No discrete GPU required.
Q: Can I use the pre-built ROCm wheels from PyTorch.org?
No. Those target older architectures (gfx900, gfx906, gfx908, gfx90a, gfx942, gfx1030, gfx1100). They will detect your GPU but crash at runtime with invalid device function.
Q: What about HSA_OVERRIDE_GFX_VERSION?
This trick maps one GPU architecture to another. It works for some cases (e.g., gfx1030 -> gfx1036) but there is no valid override target for gfx1150. Building from source is the only option.
Warning: If you previously set
HSA_OVERRIDE_GFX_VERSION=11.0.0for gfx1100 (RDNA 3) hardware, remove it before building or running native gfx1150 builds. The override tells the HIP runtime to emit gfx1100 ISA, which conflicts with a natively-compiled gfx1150 binary and producesinvalid device functionerrors. Native builds must run without anyHSA_OVERRIDE_GFX_VERSIONset.
Q: How much disk space do I need? About 15 GB for the PyTorch source + build artifacts. The final installed size is around 1-2 GB.
Q: How much RAM does the build need?
With MAX_JOBS=10, peak memory usage is around 16-20 GB. If you have 16 GB RAM, reduce to --jobs 6 or lower.
Q: Does this work on Ubuntu/Fedora/other distros? It should work on any Linux distro with ROCm 7.2+ installed. The build was tested on CachyOS (Arch-based) but the process is distro-agnostic.
Q: Can I build for gfx1100/gfx1030/other architectures?
Yes. Pass --arch gfx1100 (or any supported ISA) to build.sh. This repo's scripts work for any ROCm-supported GPU architecture.
Q: Will there be pre-built wheels?
A wheel packaging script (build_wheel.sh) is included. Pre-built wheels are planned for future releases once the build is validated on more systems.
Q: Does training work, or just inference?
Training works. The build includes full autograd support. However, distributed training is disabled (USE_DISTRIBUTED=0) since this targets single-iGPU systems.
pytorch-gfx1150/
build.sh # Main build script
build_wheel.sh # Wheel packaging script
benchmarks/
run_benchmarks.py # GPU benchmark suite
README.md
CHANGELOG.md
LICENSE # MIT
.gitignore
Found an issue or got this working on different hardware? Open an issue or PR. Of particular interest:
- Benchmark results from other RDNA 3.5 devices (Radeon 880M, etc.)
- Compatibility reports from different Linux distributions
- ROCm version compatibility findings
- Performance optimizations
MIT. See LICENSE.
- amdxdna-strix-fix — Patch for AMD XDNA NPU driver on Strix Point/Halo
- R.A.G-Race-Router — Tri-processor inference runtime (CPU + GPU + NPU)
- unified-ml — Custom HIP + Vulkan kernels for AMD APU unified memory
- The PyTorch team for ROCm support
- AMD for ROCm and open-source GPU compute
- The Arch/CachyOS community for bleeding-edge ROCm packages