Add support for AMD GPU by pxl-th · Pull Request #798 · nerfstudio-project/gsplat

pxl-th · 2025-09-13T20:58:09Z

This PR introduces AMD GPU support for gsplat.

Tested on:

CPU: AMD Ryzen 7 5800X 8-Core @ 16x 4.594GHz
GPU: Radeon RX 7900 XTX
OS: Ubuntu 24.04
ROCm: 6.4 (from official installation script).

Tested and confirmed working:

3DGS (simple & mcmc; packed & un-packed)
2DGS (packed & un-packed)
3DGUT

Relies on:

Closes #771.
Closes #434.

Move glm library out of the cuda/ directory to avoid hipifying it, which causes confusion during compilation.
Since AMD GPU does not support cg::labeled_partition, simply avoid warp reductions on it and just do global memory writes directly. To reduce code duplication, introduce FOR_HIP variable that we use to determine whether to use labeled partition or just a placeholder to avoid compilation errors, e.g.:

#if FOR_HIP
auto warp_group_g = warp; // Not used, just here to not error in the if-statements.
#else
auto warp_group_g = cg::labeled_partition(warp, gid);
#endif

and in places where we'd do only one global atomic add (which should eliminate branching altogether):

if (FOR_HIP || warp_group_g.thread_rank() == 0) {
...
}

Replace std::array.at with [] indexing to avoid bounds-checking which causes errors:

/usr/lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/array:220:9: error: reference to __host__ function '__throw_out_of_range_fmt' in __host__ __device__ function
  220 |           std::__throw_out_of_range_fmt(__N("array::at: __n (which is %zu) "
      |                ^

Since ROCm does not support cg::reduce, create equivalent warpSum reduction methods using shfl_down intrinsic & use it where we do reduction on the whole warp (i.e. tiled_partitiion in our case).
Use respective NVCC flags depending on the GPU backend. E.g. -munsafe-fp-atomics is required to replace CAS-loop with fast hardware floating-point atomics that significantly improves the performance on AMD GPU.

charyang-ai · 2026-02-24T16:09:29Z

how is the performance comparied to Nvidia RTX card, like 4090?

Spacefish · 2026-04-11T13:10:58Z

https://github.com/Spacefish/gsplat-rdna/tree/rocm_7_2 haha did the same thing but on AMDs fork of gsplat.

Performance is ok, i guess 8x8 is impacted a littlebit as we have to use shared memory to accumulate 2 32 wide warps but other sizes are fine / work on 32 wide waves like a charm.

pxl-th added 2 commits September 5, 2025 01:45

Initial AMD GPU enablement

4373426

Fixup submodule

e6805eb

pxl-th force-pushed the pxl-th/amd branch from 339ac82 to 1cfadf5 Compare September 13, 2025 21:00

Fixup remaining modules & cleanup

4e95ad6

pxl-th force-pushed the pxl-th/amd branch from 1cfadf5 to 4e95ad6 Compare September 13, 2025 21:01

pxl-th marked this pull request as ready for review September 13, 2025 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for AMD GPU#798

Add support for AMD GPU#798
pxl-th wants to merge 3 commits intonerfstudio-project:mainfrom
pxl-th:pxl-th/amd

pxl-th commented Sep 13, 2025

Uh oh!

charyang-ai commented Feb 24, 2026

Uh oh!

Spacefish commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pxl-th commented Sep 13, 2025

Uh oh!

charyang-ai commented Feb 24, 2026

Uh oh!

Spacefish commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants