Skip to content

Peterc3-dev/pytorch-gfx1150

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pytorch-gfx1150

PyTorch built from source for AMD RDNA 3.5 (gfx1150) integrated GPUs

 ____       _____              _         __ _  _ ____   ___
|  _ \ _   |_   _|__  _ __ ___| |__     / _/ || | ___| / _ \
| |_) | | | || |/ _ \| '__/ __| '_ \   | |_| || |___ \| | | |
|  __/| |_| || | (_) | | | (__| | | |  |  _|__   _|__) | |_| |
|_|    \__, ||_|\___/|_|  \___|_| |_|  |_|    |_||____/ \___/
       |___/           gfx1150 / RDNA 3.5 / Radeon 890M

ROCm 7.2.0 PyTorch 2.12 Python 3.14 gfx1150 License: MIT


Why This Exists

AMD's Ryzen AI 300 series ("Strix Point") processors include RDNA 3.5 integrated GPUs — the Radeon 890M and 880M. These use the gfx1150 ISA, which is not supported by any pre-built PyTorch package.

Here is what happens when you try the standard approaches:

Approach Result
pip install torch (CPU) Works, but no GPU acceleration
pip install torch --index-url .../rocm6.3 Detects GPU, then crashes: invalid device function
HSA_OVERRIDE_GFX_VERSION=11.0.0 Does not map to a valid target for gfx1150
Build from source with PYTORCH_ROCM_ARCH=gfx1150 Native GPU acceleration

This repository provides the build scripts, documentation, and benchmarks to get PyTorch running with real GPU acceleration on RDNA 3.5 hardware. No hacks, no overrides, no compatibility shims.


Supported Hardware

GPU Processor Architecture Status
Radeon 890M Ryzen AI 9 HX 370 / HX 375 gfx1150 (RDNA 3.5) Tested
Radeon 880M Ryzen AI 7 PRO 360 gfx1150 (RDNA 3.5) Expected to work
Other RDNA 3.5 Various gfx1150 / gfx1151 Should work (untested)

Known working device: GPD Pocket 4 (Ryzen AI 9 HX 370 / Radeon 890M / 32GB RAM)


Benchmarks

Tested on AMD Radeon 890M with ROCm 7.2.0 and PyTorch 2.12.0a0:

Raw Compute

Test GFLOPS Notes
FP32 matmul 512x512 ~1,059 Peak single-precision
FP32 matmul 1024x1024 ~1,043 Sustained single-precision
FP16 matmul 1024x1024 ~556 Half-precision
Shared VRAM 11.5 GB System memory allocated to GPU

Real-World Inference (MusicGen, 5 seconds of audio)

Configuration Time Speedup
CPU only 19.4s baseline
GPU (no AOTriton) 49.4s 0.4x (SLOWER)
GPU + AOTriton experimental 12.8s 1.5x faster

The AOTriton result demonstrates why TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 is critical. See Known Issues.


Quick Start

1. Install prerequisites

ROCm 7.2+ (required):

# Arch/CachyOS
sudo pacman -S rocm-hip-runtime rocm-hip-sdk rocm-opencl-runtime

# Ubuntu (see https://rocm.docs.amd.com)
# Follow AMD's official ROCm installation guide

Build tools:

# Arch/CachyOS
sudo pacman -S base-devel cmake ninja python python-pip git

# Ubuntu/Debian
sudo apt install build-essential cmake ninja-build python3 python3-pip python3-venv git

2. Verify your GPU is detected

rocminfo | grep -E "Marketing Name|Name:.*gfx"
# Should show: "AMD Radeon 890M Graphics" and "gfx1150"

3. Create a virtual environment (recommended)

python3 -m venv ~/pytorch-gfx1150-env
source ~/pytorch-gfx1150-env/bin/activate
pip install numpy pyyaml typing-extensions sympy filelock jinja2 networkx setuptools wheel cffi

4. Build PyTorch

git clone https://github.com/Peterc3-dev/pytorch-gfx1150.git
cd pytorch-gfx1150
chmod +x build.sh
./build.sh --venv ~/pytorch-gfx1150-env

The build takes approximately 44 minutes on a Ryzen AI 9 HX 370 with MAX_JOBS=10.

5. Verify

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

python3 -c "
import torch
print(f'PyTorch:  {torch.__version__}')
print(f'CUDA:     {torch.cuda.is_available()}')
print(f'Device:   {torch.cuda.get_device_name(0)}')
x = torch.randn(1000, 1000, device='cuda')
y = torch.mm(x, x)
print(f'Matmul:   OK ({y.shape})')
"

6. Run benchmarks

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
python3 benchmarks/run_benchmarks.py --markdown

Build Options

The build.sh script supports several options:

./build.sh                              # Fresh build, default settings
./build.sh --rebuild                    # Incremental rebuild (skip clone)
./build.sh --arch gfx1150              # Explicit GPU architecture
./build.sh --jobs 12                   # Custom parallel job count
./build.sh --venv ~/my_venv            # Use a specific virtualenv
./build.sh --build-dir ~/my/pytorch    # Custom source directory
./build.sh --branch v2.6.0            # Build a specific PyTorch version
./build.sh --dry-run                   # Show config without building

Environment Variables

These should be set in your shell profile or before running any PyTorch workload:

# REQUIRED for good performance on gfx1150
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

# Optional: useful for debugging
export ROCM_PATH=/opt/rocm
export HSA_ENABLE_SDMA=0          # May help with stability on some systems
export AMD_LOG_LEVEL=0            # Reduce ROCm log noise

Known Issues

AOTriton experimental flag is required

Without TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1, PyTorch's Scaled Dot-Product Attention (SDPA) falls back to a naive implementation that is slower than CPU. This is because the standard flash-attention and memory-efficient attention kernels do not have pre-compiled paths for gfx1150. The experimental AOTriton flag enables JIT compilation of these kernels.

Always set this variable. Add it to your .bashrc / .bash_profile / config.fish:

# bash/zsh
echo 'export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1' >> ~/.bashrc

# fish
set -Ux TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL 1

Hipify must be run manually

Before building PyTorch for ROCm, the CUDA-to-HIP translation ("hipify") step must be run:

cd ~/builds/pytorch-gfx1150
python3 tools/amd_build/build_amd.py

The build.sh script handles this automatically, but if you are building manually, do not skip it. The PyTorch build system does not run hipify on its own.

cmake 4.x compatibility

cmake 4.x changed default policy behavior. Some older PyTorch subprojects fail with policy errors. The fix:

export CMAKE_POLICY_VERSION_MINIMUM=3.5

This is set automatically by build.sh.

GCC 15 warnings-as-errors

GCC 15 introduces new warnings that PyTorch treats as errors. The build script sets:

export CXXFLAGS="-Wno-error=deprecated-declarations -Wno-error=maybe-uninitialized -Wno-error"

FP16 performance lower than expected

FP16 matmul shows ~556 GFLOPS vs ~1043 for FP32. This is likely because the RDNA 3.5 iGPU's FP16 throughput depends on packed math instructions, and the current ROCm/PyTorch code path may not be fully exploiting them. This may improve with future ROCm updates.


FAQ

Q: Do I need a discrete GPU? No. This targets integrated GPUs (iGPUs) — the Radeon 890M/880M built into Ryzen AI 300 processors. No discrete GPU required.

Q: Can I use the pre-built ROCm wheels from PyTorch.org? No. Those target older architectures (gfx900, gfx906, gfx908, gfx90a, gfx942, gfx1030, gfx1100). They will detect your GPU but crash at runtime with invalid device function.

Q: What about HSA_OVERRIDE_GFX_VERSION? This trick maps one GPU architecture to another. It works for some cases (e.g., gfx1030 -> gfx1036) but there is no valid override target for gfx1150. Building from source is the only option.

Warning: If you previously set HSA_OVERRIDE_GFX_VERSION=11.0.0 for gfx1100 (RDNA 3) hardware, remove it before building or running native gfx1150 builds. The override tells the HIP runtime to emit gfx1100 ISA, which conflicts with a natively-compiled gfx1150 binary and produces invalid device function errors. Native builds must run without any HSA_OVERRIDE_GFX_VERSION set.

Q: How much disk space do I need? About 15 GB for the PyTorch source + build artifacts. The final installed size is around 1-2 GB.

Q: How much RAM does the build need? With MAX_JOBS=10, peak memory usage is around 16-20 GB. If you have 16 GB RAM, reduce to --jobs 6 or lower.

Q: Does this work on Ubuntu/Fedora/other distros? It should work on any Linux distro with ROCm 7.2+ installed. The build was tested on CachyOS (Arch-based) but the process is distro-agnostic.

Q: Can I build for gfx1100/gfx1030/other architectures? Yes. Pass --arch gfx1100 (or any supported ISA) to build.sh. This repo's scripts work for any ROCm-supported GPU architecture.

Q: Will there be pre-built wheels? A wheel packaging script (build_wheel.sh) is included. Pre-built wheels are planned for future releases once the build is validated on more systems.

Q: Does training work, or just inference? Training works. The build includes full autograd support. However, distributed training is disabled (USE_DISTRIBUTED=0) since this targets single-iGPU systems.


Project Structure

pytorch-gfx1150/
  build.sh              # Main build script
  build_wheel.sh        # Wheel packaging script
  benchmarks/
    run_benchmarks.py   # GPU benchmark suite
  README.md
  CHANGELOG.md
  LICENSE               # MIT
  .gitignore

Contributing

Found an issue or got this working on different hardware? Open an issue or PR. Of particular interest:

  • Benchmark results from other RDNA 3.5 devices (Radeon 880M, etc.)
  • Compatibility reports from different Linux distributions
  • ROCm version compatibility findings
  • Performance optimizations

License

MIT. See LICENSE.


Related projects


Acknowledgments

  • The PyTorch team for ROCm support
  • AMD for ROCm and open-source GPU compute
  • The Arch/CachyOS community for bleeding-edge ROCm packages

About

PyTorch built from source for AMD RDNA 3.5 (gfx1150) — Radeon 890M/880M GPU acceleration

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors