pytorch-gfx1150

PyTorch built from source for AMD RDNA 3.5 (gfx1150) integrated GPUs

 ____       _____              _         __ _  _ ____   ___
|  _ \ _   |_   _|__  _ __ ___| |__     / _/ || | ___| / _ \
| |_) | | | || |/ _ \| '__/ __| '_ \   | |_| || |___ \| | | |
|  __/| |_| || | (_) | | | (__| | | |  |  _|__   _|__) | |_| |
|_|    \__, ||_|\___/|_|  \___|_| |_|  |_|    |_||____/ \___/
       |___/           gfx1150 / RDNA 3.5 / Radeon 890M

Why This Exists

AMD's Ryzen AI 300 series ("Strix Point") processors include RDNA 3.5 integrated GPUs — the Radeon 890M and 880M. These use the gfx1150 ISA, which is not supported by any pre-built PyTorch package.

Here is what happens when you try the standard approaches:

Approach	Result
`pip install torch` (CPU)	Works, but no GPU acceleration
`pip install torch --index-url .../rocm6.3`	Detects GPU, then crashes: `invalid device function`
`HSA_OVERRIDE_GFX_VERSION=11.0.0`	Does not map to a valid target for gfx1150
Build from source with `PYTORCH_ROCM_ARCH=gfx1150`	Native GPU acceleration

This repository provides the build scripts, documentation, and benchmarks to get PyTorch running with real GPU acceleration on RDNA 3.5 hardware. No hacks, no overrides, no compatibility shims.

Supported Hardware

GPU	Processor	Architecture	Status
Radeon 890M	Ryzen AI 9 HX 370 / HX 375	gfx1150 (RDNA 3.5)	Tested
Radeon 880M	Ryzen AI 7 PRO 360	gfx1150 (RDNA 3.5)	Expected to work
Other RDNA 3.5	Various	gfx1150 / gfx1151	Should work (untested)

Known working device: GPD Pocket 4 (Ryzen AI 9 HX 370 / Radeon 890M / 32GB RAM)

Benchmarks

Tested on AMD Radeon 890M with ROCm 7.2.0 and PyTorch 2.12.0a0:

Raw Compute

Test	GFLOPS	Notes
FP32 matmul 512x512	~1,059	Peak single-precision
FP32 matmul 1024x1024	~1,043	Sustained single-precision
FP16 matmul 1024x1024	~556	Half-precision
Shared VRAM	11.5 GB	System memory allocated to GPU

Real-World Inference (MusicGen, 5 seconds of audio)

Configuration	Time	Speedup
CPU only	19.4s	baseline
GPU (no AOTriton)	49.4s	0.4x (SLOWER)
GPU + AOTriton experimental	12.8s	1.5x faster

The AOTriton result demonstrates why TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 is critical. See Known Issues.

Quick Start

1. Install prerequisites

ROCm 7.2+ (required):

# Arch/CachyOS
sudo pacman -S rocm-hip-runtime rocm-hip-sdk rocm-opencl-runtime

# Ubuntu (see https://rocm.docs.amd.com)
# Follow AMD's official ROCm installation guide

Build tools:

# Arch/CachyOS
sudo pacman -S base-devel cmake ninja python python-pip git

# Ubuntu/Debian
sudo apt install build-essential cmake ninja-build python3 python3-pip python3-venv git

2. Verify your GPU is detected

rocminfo | grep -E "Marketing Name|Name:.*gfx"
# Should show: "AMD Radeon 890M Graphics" and "gfx1150"

3. Create a virtual environment (recommended)

python3 -m venv ~/pytorch-gfx1150-env
source ~/pytorch-gfx1150-env/bin/activate
pip install numpy pyyaml typing-extensions sympy filelock jinja2 networkx setuptools wheel cffi

4. Build PyTorch

git clone https://github.com/Peterc3-dev/pytorch-gfx1150.git
cd pytorch-gfx1150
chmod +x build.sh
./build.sh --venv ~/pytorch-gfx1150-env

The build takes approximately 44 minutes on a Ryzen AI 9 HX 370 with MAX_JOBS=10.

5. Verify

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

python3 -c "
import torch
print(f'PyTorch:  {torch.__version__}')
print(f'CUDA:     {torch.cuda.is_available()}')
print(f'Device:   {torch.cuda.get_device_name(0)}')
x = torch.randn(1000, 1000, device='cuda')
y = torch.mm(x, x)
print(f'Matmul:   OK ({y.shape})')
"

6. Run benchmarks

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
python3 benchmarks/run_benchmarks.py --markdown

Build Options

The build.sh script supports several options:

./build.sh                              # Fresh build, default settings
./build.sh --rebuild                    # Incremental rebuild (skip clone)
./build.sh --arch gfx1150              # Explicit GPU architecture
./build.sh --jobs 12                   # Custom parallel job count
./build.sh --venv ~/my_venv            # Use a specific virtualenv
./build.sh --build-dir ~/my/pytorch    # Custom source directory
./build.sh --branch v2.6.0            # Build a specific PyTorch version
./build.sh --dry-run                   # Show config without building

Environment Variables

These should be set in your shell profile or before running any PyTorch workload:

# REQUIRED for good performance on gfx1150
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

# Optional: useful for debugging
export ROCM_PATH=/opt/rocm
export HSA_ENABLE_SDMA=0          # May help with stability on some systems
export AMD_LOG_LEVEL=0            # Reduce ROCm log noise

Known Issues

AOTriton experimental flag is required

Without TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1, PyTorch's Scaled Dot-Product Attention (SDPA) falls back to a naive implementation that is slower than CPU. This is because the standard flash-attention and memory-efficient attention kernels do not have pre-compiled paths for gfx1150. The experimental AOTriton flag enables JIT compilation of these kernels.

Always set this variable. Add it to your .bashrc / .bash_profile / config.fish:

# bash/zsh
echo 'export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1' >> ~/.bashrc

# fish
set -Ux TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL 1

Hipify must be run manually

Before building PyTorch for ROCm, the CUDA-to-HIP translation ("hipify") step must be run:

cd ~/builds/pytorch-gfx1150
python3 tools/amd_build/build_amd.py

The build.sh script handles this automatically, but if you are building manually, do not skip it. The PyTorch build system does not run hipify on its own.

cmake 4.x compatibility

cmake 4.x changed default policy behavior. Some older PyTorch subprojects fail with policy errors. The fix:

export CMAKE_POLICY_VERSION_MINIMUM=3.5

This is set automatically by build.sh.

GCC 15 warnings-as-errors

GCC 15 introduces new warnings that PyTorch treats as errors. The build script sets:

export CXXFLAGS="-Wno-error=deprecated-declarations -Wno-error=maybe-uninitialized -Wno-error"

FP16 performance lower than expected

FP16 matmul shows ~556 GFLOPS vs ~1043 for FP32. This is likely because the RDNA 3.5 iGPU's FP16 throughput depends on packed math instructions, and the current ROCm/PyTorch code path may not be fully exploiting them. This may improve with future ROCm updates.

FAQ

Q: Do I need a discrete GPU? No. This targets integrated GPUs (iGPUs) — the Radeon 890M/880M built into Ryzen AI 300 processors. No discrete GPU required.

Q: Can I use the pre-built ROCm wheels from PyTorch.org? No. Those target older architectures (gfx900, gfx906, gfx908, gfx90a, gfx942, gfx1030, gfx1100). They will detect your GPU but crash at runtime with invalid device function.

Q: What about HSA_OVERRIDE_GFX_VERSION? This trick maps one GPU architecture to another. It works for some cases (e.g., gfx1030 -> gfx1036) but there is no valid override target for gfx1150. Building from source is the only option.

Warning: If you previously set HSA_OVERRIDE_GFX_VERSION=11.0.0 for gfx1100 (RDNA 3) hardware, remove it before building or running native gfx1150 builds. The override tells the HIP runtime to emit gfx1100 ISA, which conflicts with a natively-compiled gfx1150 binary and produces invalid device function errors. Native builds must run without any HSA_OVERRIDE_GFX_VERSION set.

Q: How much disk space do I need? About 15 GB for the PyTorch source + build artifacts. The final installed size is around 1-2 GB.

Q: How much RAM does the build need? With MAX_JOBS=10, peak memory usage is around 16-20 GB. If you have 16 GB RAM, reduce to --jobs 6 or lower.

Q: Does this work on Ubuntu/Fedora/other distros? It should work on any Linux distro with ROCm 7.2+ installed. The build was tested on CachyOS (Arch-based) but the process is distro-agnostic.

Q: Can I build for gfx1100/gfx1030/other architectures? Yes. Pass --arch gfx1100 (or any supported ISA) to build.sh. This repo's scripts work for any ROCm-supported GPU architecture.

Q: Will there be pre-built wheels? A wheel packaging script (build_wheel.sh) is included. Pre-built wheels are planned for future releases once the build is validated on more systems.

Q: Does training work, or just inference? Training works. The build includes full autograd support. However, distributed training is disabled (USE_DISTRIBUTED=0) since this targets single-iGPU systems.

Project Structure

pytorch-gfx1150/
  build.sh              # Main build script
  build_wheel.sh        # Wheel packaging script
  benchmarks/
    run_benchmarks.py   # GPU benchmark suite
  README.md
  CHANGELOG.md
  LICENSE               # MIT
  .gitignore

Contributing

Found an issue or got this working on different hardware? Open an issue or PR. Of particular interest:

Benchmark results from other RDNA 3.5 devices (Radeon 880M, etc.)
Compatibility reports from different Linux distributions
ROCm version compatibility findings
Performance optimizations

License

MIT. See LICENSE.

Related projects

amdxdna-strix-fix — Patch for AMD XDNA NPU driver on Strix Point/Halo
R.A.G-Race-Router — Tri-processor inference runtime (CPU + GPU + NPU)
unified-ml — Custom HIP + Vulkan kernels for AMD APU unified memory

Acknowledgments

The PyTorch team for ROCm support
AMD for ROCm and open-source GPU compute
The Arch/CachyOS community for bleeding-edge ROCm packages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pytorch-gfx1150

Why This Exists

Supported Hardware

Benchmarks

Raw Compute

Real-World Inference (MusicGen, 5 seconds of audio)

Quick Start

1. Install prerequisites

2. Verify your GPU is detected

3. Create a virtual environment (recommended)

4. Build PyTorch

5. Verify

6. Run benchmarks

Build Options

Environment Variables

Known Issues

AOTriton experimental flag is required

Hipify must be run manually

cmake 4.x compatibility

GCC 15 warnings-as-errors

FP16 performance lower than expected

FAQ

Project Structure

Contributing

License

Related projects

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmarks		benchmarks
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
build_wheel.sh		build_wheel.sh

Folders and files

Latest commit

History

Repository files navigation

pytorch-gfx1150

Why This Exists

Supported Hardware

Benchmarks

Raw Compute

Real-World Inference (MusicGen, 5 seconds of audio)

Quick Start

1. Install prerequisites

2. Verify your GPU is detected

3. Create a virtual environment (recommended)

4. Build PyTorch

5. Verify

6. Run benchmarks

Build Options

Environment Variables

Known Issues

AOTriton experimental flag is required

Hipify must be run manually

cmake 4.x compatibility

GCC 15 warnings-as-errors

FP16 performance lower than expected

FAQ

Project Structure

Contributing

License

Related projects

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages