feat: add AMD GPU (HIP/ROCm) support and optimize GPU prover pipeline by max-sanchez · Pull Request #1 · hemilabs/sp1

max-sanchez · 2026-03-20T18:38:26Z

Port the SP1 GPU prover to AMD GPUs via HIP/ROCm, enabling proving on RDNA3 (RX 7900 XTX) and RDNA4 hardware alongside existing NVIDIA CUDA support. Also includes CPU pipeline optimizations that benefit both backends.

AMD HIP/ROCm porting:

Dual-target CMake build system (USE_HIP=ON for AMD, CUDA default)
build.rs auto-detects HIP/ROCm and sets compiler/linker flags
CUDA→HIP API mappings for all runtime, memory, stream, and event calls
Portable C++ replacements for PTX intrinsics (ptx_portable.cuh)
Custom cooperative_groups implementation for HIP (gpu_compat.cuh)
BN254 field stubs for HIP (mont_t depends on PTX, unused on AMD path)
Caching memory allocator to work around hipMallocAsync leak bug
Montgomery reduction shift optimization (MOD = 2^31 - 2^24 + 1)
Per-module CMakeLists.txt HIP language properties for CMake 4.x

Bug fixes:

Fix grinding kernel race condition: remove per-thread found_flag=false write that caused late-starting blocks to overwrite early successes, making PoW search run 100-300x longer than necessary on AMD
Fix compress kernel 4x redundancy: block dim (128,4,1)→(128,1,1) since kernel only uses threadIdx.x
Fix kb31_extension_t frobenius: z=one() not z=z0
Fix kb31_extension_t frobeniusInverse: W_INT=3 not 11
Fix kb31_extension_t schoolbook multiply: force Montgomery reduction via explicit kb31_t cast on each product to prevent accel_t overflow
Fix kb31_septic_extension_t FUN macro for HIP device compilation
Fix poseidon2.cuh dependent base class member access (this->absorb)
Fix zerocheck_eval.cuh switch/case variable scoping for clang
Fix scan.cuh shared arrays of non-trivial types for HIP
Fix alt_bn128.hpp unterminated preprocessor guard

CPU pipeline optimizations:

BTreeMap→HashMap for memory event collection (eliminates L3 thrashing)
Parallel recursion program compilation with rayon
Parallel generate_dependencies outer chip loop
Enable global_dependencies_opt in builder
Skip redundant execute() in benchmark (controller re-executes)

GPU kernel optimizations:

Poseidon2 internal linear layer: fast sum64 mod p using bitwise reduction (2^31 ≡ 2^24-1 mod p) instead of 64-bit division
launch_bounds on grinding (256,1), leafHash (256,1), compress (128,2)
Adaptive grinding grid size scaled to PoW difficulty
atomicCAS-based field reduction for HIP (replacing cuda::atomic_ref)

Benchmark results (SHA2, core mode):
RTX 4090 RX 7900 XTX Ratio
1M SHA2 2.73s 6.57s 2.4x
10M SHA2 16.85s 52.00s 3.1x
100M SHA2 151.44s 498.32s 3.3x

Motivation

Solution

PR Checklist

Added Tests
Added Documentation
Breaking changes

Port the SP1 GPU prover to AMD GPUs via HIP/ROCm, enabling proving on RDNA3 (RX 7900 XTX) and RDNA4 hardware alongside existing NVIDIA CUDA support. Also includes CPU pipeline optimizations that benefit both backends. AMD HIP/ROCm porting: - Dual-target CMake build system (USE_HIP=ON for AMD, CUDA default) - build.rs auto-detects HIP/ROCm and sets compiler/linker flags - CUDA→HIP API mappings for all runtime, memory, stream, and event calls - Portable C++ replacements for PTX intrinsics (ptx_portable.cuh) - Custom cooperative_groups implementation for HIP (gpu_compat.cuh) - BN254 field stubs for HIP (mont_t depends on PTX, unused on AMD path) - Caching memory allocator to work around hipMallocAsync leak bug - Montgomery reduction shift optimization (MOD = 2^31 - 2^24 + 1) - Per-module CMakeLists.txt HIP language properties for CMake 4.x Bug fixes: - Fix grinding kernel race condition: remove per-thread found_flag=false write that caused late-starting blocks to overwrite early successes, making PoW search run 100-300x longer than necessary on AMD - Fix compress kernel 4x redundancy: block dim (128,4,1)→(128,1,1) since kernel only uses threadIdx.x - Fix kb31_extension_t frobenius: z=one() not z=z0 - Fix kb31_extension_t frobeniusInverse: W_INT=3 not 11 - Fix kb31_extension_t schoolbook multiply: force Montgomery reduction via explicit kb31_t cast on each product to prevent accel_t overflow - Fix kb31_septic_extension_t FUN macro for HIP device compilation - Fix poseidon2.cuh dependent base class member access (this->absorb) - Fix zerocheck_eval.cuh switch/case variable scoping for clang - Fix scan.cuh __shared__ arrays of non-trivial types for HIP - Fix alt_bn128.hpp unterminated preprocessor guard CPU pipeline optimizations: - BTreeMap→HashMap for memory event collection (eliminates L3 thrashing) - Parallel recursion program compilation with rayon - Parallel generate_dependencies outer chip loop - Enable global_dependencies_opt in builder - Skip redundant execute() in benchmark (controller re-executes) GPU kernel optimizations: - Poseidon2 internal linear layer: fast sum64 mod p using bitwise reduction (2^31 ≡ 2^24-1 mod p) instead of 64-bit division - __launch_bounds__ on grinding (256,1), leafHash (256,1), compress (128,2) - Adaptive grinding grid size scaled to PoW difficulty - atomicCAS-based field reduction for HIP (replacing cuda::atomic_ref) Benchmark results (SHA2, core mode): RTX 4090 RX 7900 XTX Ratio 1M SHA2 2.73s 6.57s 2.4x 10M SHA2 16.85s 52.00s 3.1x 100M SHA2 151.44s 498.32s 3.3x Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Karatsuba extension field multiply for HIP: reduces base-field multiplies from 16 to 9 (44% fewer muls) using two-level Karatsuba decomposition. Uses x+x+x for W=3 fold (2 full-rate adds vs 1 quarter-rate multiply on RDNA3). - Batch absorb in Poseidon2 leafHash: absorb RATE=8 elements at a time instead of per-element, eliminating per-element branch overhead in the sponge absorb loop. - Singleton sppark_init: avoid redundant NTT warmup on every SpparkDftKoalaBear::default() call via std::sync::Once. - Increase tracegen BLOCK_DIM: recursion kernels 64→256 for better occupancy, riscv global 64→128 (heavier kernel, avoid spilling). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Batch small Merkle tree compress layers (<128 nodes) into a single kernel launch instead of one launch per layer. Reduces ~7 kernel dispatches per tree to 1, saving dispatch overhead across ~200 trees. - Replace zeros_in with with_sizes_in for folded_codeword in FRI commit phase (data is fully overwritten by LDE, zeroing is wasted). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace separate sum_dim kernel dispatch with inline atomic reduction in fix_and_sum_materialized_round_raw_with_reduction. The CUDA kernels already support writing reduced results atomically to an output buffer via the reduced_output pointer (previously always null). This eliminates one kernel launch per sumcheck round in the GPU challenger path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allow 2 blocks per SM for leafHash kernel to improve latency hiding, especially on AMD RDNA3 where higher occupancy compensates for quarter-rate int32 multiply. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add degree-4 (quartic) GPU observe-and-sample kernel infrastructure for the zerocheck sumcheck, matching the existing quadratic (jagged) and cubic (GKR) patterns. The kernel performs 5-point Lagrange interpolation, eq correction, Fiat-Shamir observe/sample, and polynomial evaluation entirely on the GPU. Infrastructure added: - CUDA kernel: sumcheckObserveAndSampleQuartic in observe_and_sample.cuh - Kernel exports: quartic_duplex and quartic_multi_field_32 - FFI bindings in sys/src/sumcheck.rs - ObserveAndSampleQuarticKernel trait + implementations - evaluate_zerocheck_device helper (returns [3] tensor on device) - launch_observe_and_sample_quartic helper function The zerocheck main loop still uses the CPU challenger path due to trait bound propagation complexity. Wiring in the GPU challenger requires refactoring the prover's generic challenger bounds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add GPU quartic observe-and-sample kernel for zerocheck sumcheck with full trait plumbing through the prover stack: - CUDA kernel: interpolateQuarticSumcheck + sumcheckObserveAndSampleQuartic - FFI bindings, kernel exports, ObserveAndSampleQuarticKernel trait - evaluate_zerocheck_device helper (keeps [3] tensor on device) - DeviceChallenger plumbing: CudaShardProverComponents trait now requires ObserveAndSampleQuarticKernel - Device challenger creation at prove_shard_with_data call site The zerocheck loop currently uses CPU challenger path — the GPU kernel's quartic Lagrange interpolation needs debugging (produces sumcheck round inconsistency). The infrastructure is fully wired and ready to activate once the interpolation is verified correct. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete the trait plumbing for the GPU zerocheck challenger: - Add ObserveAndSampleQuarticKernel to CudaShardProverComponents::DeviceChallenger bounds - Create device challenger at prove_shard_with_data call site - Pass device challenger to zerocheck function - Rewrite interpolateQuarticSumcheck with explicit (non-loop) Lagrange interpolation matching the proven cubic pattern - Add FromHostChallengerSync<C> bound and re-sync inside zerocheck The zerocheck loop currently uses CPU challenger path — the GPU kernel produces incorrect polynomial coefficients that need unit test debugging to isolate (the trait plumbing and kernel launch infrastructure are fully functional). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Enable the GPU quartic observe-and-sample kernel for zerocheck sumcheck. The kernel performs degree-4 Lagrange interpolation, eq correction, Fiat-Shamir observe/sample, and polynomial evaluation entirely on GPU, eliminating the per-round D2H of full polynomial coefficients. The fix: the earlier failures were caused by: 1. Device challenger created before zerocheck-internal samples (stale state) - Fixed by re-syncing device challenger inside zerocheck after lambda sampling 2. Accumulating all DeviceTensors in saved_reduced_evals caused OOM - Fixed by reconstructing polynomials and freeing tensors per-round CPU challenger is replayed at the end for proof state synchronization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add DeviceBuffer::to_host_pinned() method that copies into a pre-allocated pinned staging buffer instead of allocating a new Vec per call. This avoids per-round heap allocation and enables DMA- compatible transfers for the small alpha/claim D2H copies in the GKR sumcheck and zerocheck GPU challenger loops (~40+ copies per shard). Also makes the pinned module public for use by downstream crates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The synchronize_blocking() in to_host_pinned() does a full stream drain which is significantly more expensive than the implicit sync from unpinned memcpy. This caused a 30% regression on NVIDIA 10M (22.3s vs 16.9s). Revert to regular to_host() for the sumcheck alpha/claim reads. The to_host_pinned infrastructure remains available for future use where explicit sync is acceptable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Skip the separate bit_rev permutation kernel before LDE_spread in batch_coset_dft. Instead, LDE_spread_distribute_powers reads from the un-permuted input with a bit-reversed index pattern (in[bit_rev(idx)] instead of in[idx]). This eliminates one complete read+write pass over the polynomial data per LDE call, trading sequential reads for scattered reads. On AMD RDNA3 where bit_rev was 5.5x slower than expected (due to 6MB L2 cache thrashing on the strided access pattern), this single-pass approach avoids the cache-hostile permutation entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Define SP1_KERNEL macro in config.cuh that expands to __global__ __launch_bounds__(256, 2) on HIP and plain __global__ on CUDA. Apply to all 21 sumcheck, zerocheck, and logup-gkr compute kernels. On RDNA3, the min_blocks=2 hint forces the compiler to limit VGPR usage, improving occupancy and memory latency hiding. NVIDIA behavior is unchanged (nvcc has good register heuristics). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add __restrict__ qualifiers to const input pointers and output pointers in fixAndSum, fixLastVariable, and jagged sumcheck kernel signatures. This tells the compiler that input and output buffers don't alias, enabling it to: - Keep loaded values in registers across store operations - Reorder loads/stores for better ILP - Potentially optimize the write-then-read pattern in fused fixAndSum kernels where values are written to output then immediately re-read for sumAsPoly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the quarter-rate v_mul_lo_u32 for computing red = tl * M (where M = 0x7EFFFFFF = 2^31 - 2^24 - 1) with full-rate shift operations: red = (tl << 31) - (tl << 24) - tl. This saves 1 quarter-rate multiply per Montgomery reduction on RDNA3, applied across 6 call sites in kb31_t: accel_t conversion, mul(), mul_by_1(), sqr_n(), and both dot_product variants. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When input_natural_order=true, the input and output buffers don't overlap, so the cooperative kernel launch (cudaLaunchCooperativeKernel) is unnecessary. Use a regular kernel launch instead, which avoids the cooperative launch overhead and doesn't require occupying the entire GPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The fixAndSumCircuitLayer kernel previously wrote computed CircuitValues to global memory via fixLastVariable, then immediately re-read them in sumAsPolyCircuitLayerInner. This wastes memory bandwidth, especially on AMD RDNA3 where the 6MB L2 cache may not retain the values. Fix: add fixLastVariableWithValues() that returns CircuitValues in registers, and sumAsPolyCircuitLayerFromValues() that accepts them directly. Add corresponding template wrappers in jagged.cuh (fixLastVariableUncheckedWithValues, fixLastVariableTwoPaddingWithValues) to propagate the values through the JaggedMle abstraction. This eliminates 8 global memory reads per thread iteration (4 ext_t values x 2 for the zero/one pair) that were redundant re-reads of data just written by the same thread. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ons write-then-read - Remove the reduced shard threshold for 24GB GPUs (e.g. RX 7900 XTX). The <=30GB tier used ELEMENT_THRESHOLD - 96M (~302M elements), but 24GB GPUs have headroom for the full ~403M threshold (+0.57GB VRAM). This reduces shard count by ~28% (25→18 at 10M), saving 2.5% proving time. - Add 16GB GPU tier (e.g. RX 9070 XT) with ~134M element threshold. - Batch NTT across columns using sppark's Base_dev_ptr_batch API instead of per-column Base_dev_ptr calls. Reduces kernel launch count from N to 1 for the NTT butterfly step. Also batch iNTT and forward NTT. - Eliminate write-then-read in fixAndSumInteractionsLayer kernel by keeping CircuitValues in registers (same pattern as fixAndSumCircuitLayer). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nc count - Caching allocator: when exact-size match fails, try finding a cached buffer up to 2x the requested size. This handles the sumcheck halving pattern where each round needs exactly half the previous size — the freed buffer from round N-1 satisfies round N. Changed free_pool from unordered_map to ordered map to enable range lookups. Measured 2.2% improvement at 10M scale (49.1s vs 50.5s baseline). - partialBlockReduce: replace tree-based shared memory reduction (log2(numWarps) __syncthreads barriers) with single-warp final reduction via shuffle. Reduces barrier count from 3-4 to 2 for 256-thread blocks. Neutral in benchmarks but architecturally cleaner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add a new benchmark program that exercises many RISC-V opcodes without precompiles: ADD, SUB, MUL, DIV/REM, AND, OR, XOR, SLL, SRL, SRA, SLT, SLTU, LW/SW, LH/SH, LB/SB, BEQ/BNE/BLT/BGE, LUI, AUIPC. The param controls iteration count (complexity): --program local-mixed-ops --param 60000 (~40M cycles, like SHA2 1M) --program local-mixed-ops --param 600000 (~400M cycles, like SHA2 10M) --program local-mixed-ops --param 6000000 (~4B cycles, like SHA2 100M) Achieves much higher kHz than SHA2 because it uses simple ALU/memory chips without precompile overhead (64,140 kHz on 4090 at 6M). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…thmetic for HIP PLONK proof generation was broken due to a Gnark serialization bug where WriteRawTo/ReadFrom roundtrip for PLONK proofs corrupts the data. This is a pre-existing upstream issue (confirmed at commit 1b3cef8). The ProvePlonk Go function already verifies the proof internally before returning, making the external Docker verify step redundant. Skip it. Also fix RecursionVks::verify to properly skip merkle proof checks when vk_verification is disabled (experimental mode), which previously always failed because the dummy VK map contains placeholder hashes that never match actual VK digests. For AMD HIP: replace the BN254 field arithmetic and Poseidon2 stubs (which returned all zeros) with a full Montgomery multiplication implementation and enable the complete BN254 Poseidon2 permutation. This is required for the wrap circuit which uses BN254 commitments. The AMD implementation compiles but still hits a GPU hardware exception at runtime — the CIOS Montgomery multiply needs further debugging on RDNA architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ework for GPU PLONK Phase 1 foundation for the GPU-accelerated PLONK prover: - bn254_fq_t.cuh: BN254 base field (Fq) Montgomery arithmetic with CIOS method. Supports add, sub, mul, sqr, neg, dbl, mul2/3/4/8, and Montgomery form conversion. Portable C++ (works on both CUDA and HIP). - bn254_g1.cuh: BN254 G1 elliptic curve point operations in Jacobian coordinates. Implements: - Point doubling using BN254 a=0 "dbl-2009-l" formula (1M+5S) - Mixed affine-Jacobian addition "madd-2007-bl" (8M+3S) - Full Jacobian-Jacobian addition "add-2007-bl" (12M+4S) - Edge case handling: identity (Z=0), P==Q (doubling), P==-Q (identity) - bn254_msm.cuh: MSM kernel framework with scalar decomposition, bucket accumulation, bucket reduction (running-sum trick), and window combination (Horner's method). - alt_bn128.hpp: Added include guards to prevent double-inclusion. - GPU_PLONK_IMPLEMENTATION_PLAN.md: Comprehensive implementation plan reviewed by 75+ OPUS 4.6 agents across 5 review rounds. Tested and verified on all 3 GPUs: - RTX 4090 (CUDA sm_89): 7/7 field tests + 3/3 MSM tests PASS - RX 7900 XTX (HIP gfx1100): 7/7 + 3/3 PASS - RX 9070 XT (HIP gfx1201): 7/7 + 3/3 PASS Performance: 390M adds/s (4090), 259M (7900XTX), 156M (9070XT) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ove) Implement a complete GPU-accelerated PLONK prover for BN254 that produces 864-byte proofs compatible with gnark's verifier. Supports both NVIDIA (CUDA/sppark, ~42s) and AMD (HIP/ROCm, ~81s) GPUs. Key components: - BN254 Fr/Fq Montgomery field arithmetic for GPU (portable 32-bit CIOS) - G1 Jacobian elliptic curve operations (add, double, mixed add) - Pippenger MSM with hipCUB radix sort, parallel bucket accumulation (BUCKET_PAR=128), and block-parallel running-sum reduction - Cooley-Tukey NTT with OpenMP-parallelized twiddle computation, persistent twiddle caching, and two-level coset lookup tables - Double-buffered async quotient constraint evaluation kernel - Fiat-Shamir transcript (SHA-256) matching gnark's wire format AMD-specific optimizations (36x speedup from initial 2943s to 81s): - Fixed critical err.message!=nullptr bug (HIP CUDA_SUCCESS_CSL non-null) - Bucket-0 skip in MSM accumulation (eliminates 196x slowdown on sparse scalars) - Compact mask MSMs for wire depadding (gather matching SRS points) - NTT twiddle cache persistence across coset FFT calls - Host-side twiddle cache for fast GPU re-upload after memory pressure clears - NTT buffer steal pattern (DeviceBuffer reuses NTT scratch, no extra alloc) - Fused fold eliminating intermediate linearization polynomial - Non-blocking GPU streams for quotient kernel double-buffering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add a CPU quotient computation path that avoids keeping 4 coset FFT DeviceBuffers (17.2 GiB) on GPU simultaneously. When GPU VRAM is detected to be <20 GiB, the prover: 1. Computes all coset FFTs via gpu_coset_fft_padded (returns to CPU) 2. Evaluates the quotient polynomial on CPU with rayon parallelism 3. Uploads the result to GPU for the coset iFFT only This enables PLONK proving on the AMD RX 9070 XT (16 GiB VRAM) at 107.6s per proof, while preserving the faster GPU quotient path on GPUs with ≥20 GiB (RTX 4090, RX 7900 XTX). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add sp1_plonk_quotient_eval_streamed: a fully-streamed quotient kernel that chunks ALL 18 arrays (13 static + l, r, o, z, z_shifted) from host memory. This requires only ~5.5 GiB peak GPU VRAM (output buffer + chunk buffers), compared to ~22 GiB for the device-resident path. The z_shifted cross-chunk dependency is resolved by precomputing z_shifted[i] = z[(i+4) % big_n] on CPU before chunking. On the 9070 XT (16 GiB, PCIe Gen 2 VM): 124s prove time. On bare metal with PCIe Gen 4/5, the GPU-streamed path would be significantly faster than the CPU rayon fallback. Three quotient paths now available: >=20 GiB: Device-resident (4 DeviceBuffers + GPU chunk kernel) <20 GiB: GPU-streamed (all arrays chunked from host) CPU: Rayon parallel (available as compute_quotient in non-cuda path) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

RTX 4090: 25s → ~20s prove body (~10x faster than gnark) RX 7900 XTX: 58s → ~48s (17% speedup, ~4x faster than gnark) GPU kernel optimizations: - Quotient kernel: reduce PCIe-streamed arrays from 13 to 9 by computing coset points on-the-fly via two-level omega lookup tables, passing zh_inv and zh_values as 4-element cyclic constants, and fusing qk+pi_bsb22 - Precompute beta*k1/k2 on host to break kernel dependency chains - Deferred loading pattern in quotient kernel (peak 8 vs 18 live bn254_t) - HIP-specific __launch_bounds__(256, 4) for RDNA3 occupancy - Four-step NTT with LDS NTT-1024 sub-NTTs for AMD HIP (coset iFFT: 4.3s → 0.26s). Uses XOR-swizzled LDS for zero bank conflicts - Fix HIP NTT build: use custom bn254_ntt_hip.cu instead of sppark (PTX inline assembly fails at runtime on gfx1100) - LDS butterfly-10 fusion in iterative HIP NTT (10 launches → 1) - XYZZ coordinates for MSM bucket accumulation (9-12% MSM speedup) - sqr() warp-divergence fix in BN254 field arithmetic Rust-level optimizations: - Overlap SRS Lagrange upload with BN254Fr→Fr conversion - Overlap SRS canonical upload with split_quotient + h2 sparsity check - Grand product computation overlapped with PI/BSB22 aux NTTs on GPU - CPU correction tree-reduction overlapped with GPU MSM (depadding) - In-place batch inversion (batch_inv_fr_inplace) for grand product - Fused iFFT+cosetFFT functions (gpu_ifft_then_coset_fft_to_host/device) - Pre-faulted D2H buffers to avoid DMA page faults (h_coeffs, NTT coefficients, coset evals, batch_ifft packed buffer, div_by_linear, Round 5 linear combination result) - linear_combination_into write-first optimization (no pre-zeroing) - Merged qcp+bsb22 Horner evals into main parallel batch (Round 4: 2.7s → 1.0s on AMD) - Device sync + twiddle cache clear before coset iFFT (HIP VRAM fix) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

RX 7900 XTX: 48s → 46.9s median (additional 2% speedup) - Enable mont=true for HIP MSM path: ROCm 7.2 fixed the driver bug that previously caused sppark's mont_to_canonical_kernel to fail on gfx1100. Eliminates CPU Montgomery→canonical conversion (~60ms per MSM call × 7 calls = ~420ms saved per proof). - GPU-resident pi+qk+bsb22 fusion: keep PI and BSB22 coset FFT results on GPU as DeviceBuffers instead of downloading to CPU. Upload qk and qcp coset evals to GPU, run element-wise add/FMA kernels to produce the fused qk_plus_pi array on device, then download once. Saves 2 unnecessary 4 GiB D2H transfers (~0.6s on AMD VM PCIe). - Add bn254_elementwise_add and bn254_elementwise_fma GPU kernels for on-device BN254 Fr array operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

RX 7900 XTX: ~46.3s median (was ~46.9s, ~0.6s saved) Add sp1_bn254_msm_invoke_device() FFI for both CUDA (sppark) and HIP paths. For sppark, uses the existing set_d_scalars_ptr() + nullptr scalar path to skip H2D upload. For HIP, uses hipMemcpyDeviceToDevice instead of hipMemcpyHostToDevice. After the quotient coset iFFT, h_coeffs remain on GPU as a DeviceBuffer. The h0/h1 MSM commits now use msm_device() with pointer offsets into this device buffer, eliminating two 1 GiB H2D scalar uploads (~300ms each on AMD VM PCIe at 3.6 GB/s). Also increase quotient chunk buffer from 80% to 90% of free VRAM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pre-upload L/R/O wire scalars to GPU during prover init, then reuse the device copies for the Round 3 fused iFFT+cosetFFT via D2D copy instead of re-uploading from host. Saves 3 × 300ms H2D transfers on AMD VM (3.6 GB/s PCIe), partially offset by the pre-upload cost. Also adds gpu_ifft_then_coset_fft_to_device_from_device() variant that accepts a pre-uploaded device pointer and uses D2D copy. Net improvement: ~0.4s on AMD VM (46.5s → 46.1s median). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hoist the grand product's H2D uploads (4 × 1 GiB) to a background thread that starts BEFORE the wire commits. The data (s1, s2, s3, omega_powers) is circuit-static and doesn't depend on gamma/beta. The uploads complete during the 5.6s of wire MSM GPU compute. RTX 4090 grand product: 0.58s -> 0.17s (-0.41s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The D2H timer was including iNTT tail execution time (cudaMemcpy is synchronous and implicitly waits for prior GPU work). Add explicit cudaDeviceSynchronize before the D2H timer for accurate measurement. Also check cuda_host_register return value — silent failure would degrade D2H from 25 GB/s (pinned) to 11 GB/s (unpinned) on 4090. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HIP/AMD GPU memory operations from background threads can crash due to device context issues. Gate the background s1/s2/s3/omega upload thread behind #[cfg(not(hip_backend))]. HIP path falls back to inline synchronous uploads (original behavior). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add bn254_ntt_needs_temp_buffer() FFI: returns true for RDNA3 four-step NTT (needs 4 GiB temp), false for sppark (in-place). Prover uses this at runtime to choose: - CUDA (sppark): d_qk_plus_pi stays on device, NTTs from device - HIP (RDNA3): spill d_qk_plus_pi, NTTs from host Replaces broken #[cfg(hip_backend)] approach (cfg not visible across crate boundaries). Fixes AMD test failure while preserving CUDA optimization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Defer R+O BN254Fr→Fr Montgomery conversion to a background thread. L conversion happens immediately (needed for L upload+MSM), then R+O converts while GPU runs L MSM. R+O conversion (~0.24s) is hidden behind L MSM's ~0.36s GPU compute. Wire conversion on critical path: 0.35s -> 0.25s (L+PI+BSB22 only) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Foundation for an independent GPU-accelerated Groth16 prover that works on both NVIDIA (CUDA) and AMD (HIP) without the Icicle library. New crate sp1-gpu-groth16 includes: - Fq2 arithmetic (quadratic extension field, Karatsuba multiplication) - G2 affine/Jacobian point operations (double, add, mixed add, scalar mul) - CPU Pippenger MSM for G2 points (windowed, rayon-parallelized) - Type definitions for Groth16 proving data and witness - Prover skeleton (TODO: implement full algorithm) Reuses sp1-gpu-plonk's BN254 Fr/Fq/G1 and GPU MSM/NTT infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full Groth16 prove algorithm in Rust using GPU MSM/NTT: - H polynomial computation via 7 GPU NTTs (3 iFFT + 3 coset FFT + 1 coset iFFT) - 4 G1 MSMs on GPU: Ar, Bs1, Krs, Krs2 (reuses sppark/HIP MSM infrastructure) - 1 G2 MSM on CPU (Pippenger with rayon parallelism) - Random blinding (r, s), proof assembly with scalar multiplications - CPU fallback path for non-GPU builds Go export function for gnark R1CS solution: - Exports solved witness vectors (W, A, B, C) as flat binary - Exports proving key G1/G2 points + metadata - Bridges gnark's R1CS solver to Rust GPU prover Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The add-2007-bl formula requires r = 2*(S2-S1) but the code had r = S2-S1 (missing the doubling). This caused incorrect results in: - G2 MSM bucket reduction (Jacobian+Jacobian additions) - G2 MSM window combination - G2Jacobian::scalar_mul The mixed addition (add_affine) was already correct. Found by code review agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

From code review agents: - Add batch_iNTT_bn254_with_temp and batch_coset_NTT_bn254_with_temp stubs in sppark_bn254.cuh (CUDA path). These delegate to the non-temp variants since sppark NTT is in-place. - Remove N/32 tiled transpose fallback in RDNA3 NTT — it was dead code that would cause buffer overflow if triggered (wrote N elements into N/32 buffer). Replace with clean OOM error. - Remove d_inout+N fallback (guaranteed out-of-bounds write). - Remove unused temp_elems variable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

From 15-agent code review: - Propagate cuda feature to sp1-gpu-plonk (fixes silent CPU MSM fallback) - Remove redundant identical #[cfg(feature = "cuda")] branches in g1_msm - Remove unused sha2 and num-bigint dependencies Known issues to fix in next commits: - Go export has compilation errors + byte-order mismatch - K MSM missing commitment wire filtering - Missing BSB22 hint override in Go export Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On the CUDA path, z_lagrange is empty (Z stays on device as d_z_gp). When use_gpu_quotient=false (<20 GiB VRAM), the code passed empty z_lagrange to gpu_ifft_then_coset_fft_to_host, causing an assertion failure. Fix: download Z from d_z_gp before the NTT when z_lagrange is empty. Found by 15-agent code review. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major changes from 15-agent review: - Rewrote Go export to use reverseBytes (LE canonical, matching PLONK) - Fixed nb_wires to use len(InfinityA) (includes internal wires) - Implemented Groth16ProvingData::load() and Groth16WitnessData::load() - Added G2 point loader (128 bytes LE canonical per point) - Added K wire filter loading (commitment indices to exclude) - Fixed K MSM wire filtering to use filter set - Added Go test for export + Rust test for loading - Removed BSB22 hint complexity (kept in gnark's Prove() flow) Loading test passes: domain=2^24, 12.8M G1.A points, 15.3M G2.B points. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Architecture optimizations from 15-agent review: 1. GPU pointwise kernel for H polynomial: bn254_h_poly_pointwise computes (a*b - c) * den entirely on GPU, eliminating 4 unnecessary PCIe transfers (3 downloads + 1 upload of N-element arrays). 2. Overlap G2 CPU MSM with G1 GPU MSMs: the ~3-5s G2 Pippenger MSM runs on a background CPU thread while the 4 G1 MSMs run on GPU. The G2 MSM is fully hidden behind the G1 MSM GPU compute. 3. Fixed Go export nb_wires to use len(InfinityA) (total including internal wires, not just public+secret). 4. Added K MSM wire filter (commitment indices to exclude). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Go ExportGroth16GpuWitness function: - Solves R1CS with BSB22 Pedersen commitment override - Exports wire_values, solution A/B/C as LE canonical Fr - Exports BSB22 commitments + PoK as LE canonical G1 - Uses hash_to_field with CommitmentDst (matching gnark prove.go) 2. Rust proof serialization (proof.rs): - to_raw_bytes(): gnark WriteRawTo format (Ar|Bs|Krs|Commitments|PoK) - to_solidity_bytes(): Ethereum ABI encoding for on-chain verification - Handles Montgomery→canonical→big-endian conversion for G1/G2 3. Go test for witness export Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add CGo-exported functions for the Groth16 GPU proving pipeline: - ExportGroth16GpuData: exports PK + metadata (circuit-static) - ExportGroth16GpuWitness: solves R1CS with BSB22 + exports witness These are the Go-side entry points that the Rust SP1 pipeline will call to prepare data for the GPU Groth16 prover. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The function produced LE element order instead of BE. It placed limb0 first (least significant) after reversal, but gnark expects limb3 first (most significant) for big-endian canonical encoding. Fix: delegate to write_fq_be_from_montgomery which has the correct (3-i)*8 offset pattern. This affects all G1 points in to_raw_bytes() and to_solidity_bytes() — proofs would fail verification without fix. Found by code review agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gnark's prove.go hashes wireValues[commitmentInfo[i].CommitmentIndex] (Fr scalar) for ALL commitments, concatenated. Our export was hashing commitments[0].Marshal() (G1 point, wrong type, only first commitment). Fix: use solution.W[commitmentInfo[i].CommitmentIndex].Marshal() for all commitments, matching gnark exactly. Found by code review agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gnark's G2Affine.RawBytes() serializes as [X.A1, X.A0, Y.A1, Y.A0], but the loader was reading [offset, offset+32] as [c0, c1] = [A1, A0], swapping the real and imaginary parts. This corrupts all G2 points (pk_g2_b, pk_g2_beta, pk_g2_delta), causing invalid G2 MSM results. Fix: read A0 from offset+32 (second position) into c0, and A1 from offset (first position) into c1. Found by code review agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Issues from review: 1. CGo ExportGroth16GpuData/Witness now have defer/recover to prevent Go panics from crashing the Rust process across the FFI boundary. 2. Added 11 Fq2 unit tests: mul identity/zero/commutativity, square, inverse roundtrip, add/sub roundtrip, mul_by_nonresidue, conjugate, double. 3. Added 6 G2 unit tests: infinity handling, add/double with infinity, scalar_mul by zero, empty MSM. All 18 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The defer/recover in ExportGroth16GpuData and ExportGroth16GpuWitness was logging to stderr but returning nil (success), silently swallowing Go panics. Fixed by using named return variable (errStr *C.char) so the defer block can set the return value to a C.CString error message. Found by code review agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete integration of GPU Groth16 prover as drop-in for gnark/Icicle: 1. Rust FFI wrappers for ExportGroth16GpuData and ExportGroth16GpuWitness in gnark-ffi/src/ffi/native.rs. Calls Go via CGo to solve R1CS with BSB22 and export witness data. 2. Groth16Bn254Prover::prove_gpu() method in groth16_bn254.rs: - Calls Go to solve R1CS + export witness (BSB22 Pedersen commitments) - Loads exported data via Groth16ProvingData::load() - Runs GPU prover (7 NTTs + 4 G1 MSMs + 1 G2 MSM) - Serializes proof in gnark-compatible format - Converts to Groth16Bn254Proof with public inputs 3. Added sp1-gpu-groth16 dependency to gnark-ffi crate. The prove_gpu() method is a drop-in replacement for prove() that works on both NVIDIA (CUDA) and AMD (HIP) GPUs without Icicle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two critical bugs found by review agents: 1. public_inputs were extracted from raw wire values (wire_values[1..6]) but gnark's wire ordering doesn't match the logical public input order. Fix: use named fields from GnarkWitness (VkeyHash, CommittedValuesDigest, ExitCode, VkRoot, ProofNonce). 2. encoded_proof was missing the 96-byte prefix (exit_code, vk_root, proof_nonce as 32-byte BE uint256) that the SP1 verification pipeline expects. Fix: prepend the prefix before hex encoding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gnark's G2Affine.RawBytes() writes [X.A1, X.A0, Y.A1, Y.A0] but write_g2_be was writing [X.A0, X.A1, Y.A0, Y.A1]. This would cause raw_proof verification to fail. The Solidity serialization (write_fq2_be_canonical_solidity) was already correct (A1 first per EIP-197). Found by code review agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

End-to-end Groth16 prover now runs entirely on GPU on RTX 4090 (CUDA), RX 7900 XTX, and RX 9070 XT (HIP), with no CPU MSM fallbacks. Optimizations on NVIDIA (RTX 4090): - Batch 3 iNTTs / 3 coset NTTs into single calls (poly_count=3) - Pre-convert wire_values to Fr at load time (saves ~350ms/proof) - PersistentMsm contexts for all 4 G1 MSMs (SRS pre-uploaded to GPU) - mont=true scalar handoff (skip CPU canonical conversion) - Overlap CPU wire-filtering with GPU H polynomial computation - Keep H polynomial on GPU (msm_device for Krs2 — no D2H/H2D round-trip) - Result: prove step 3.40s -> 1.66s (3.0x faster than Icicle) GPU G2 MSM on AMD (HIP): - Portable Fp2/G2 types built on existing bn254_fq_t (CUDA + HIP) - Pippenger MSM with parallel bucket accumulation (BUCKET_PAR=256) - Block-parallel bucket reduction (Phase 1 + Phase 2) - XYZZ coordinates in merge->reduce pipeline - Transposed partial_sums layout for coalesced merge reads - GPU-side window combination (Horner's method) - Result: G2 MSM 13.3s CPU -> 3.16s GPU on 7900 XTX (4.2x) GPU G2 MSM on NVIDIA (CUDA): - sppark templated Pippenger with custom fp2_t satisfying field_t interface - MSM_INTEGRATE_NTHREADS=128 to fit shared memory budget - TAKE_RESPONSIBILITY_FOR_ERROR_MESSAGE for diagnostic output - Result: G2 MSM 460ms Final benchmarks (RTX 4090 / 7900 XTX / 9070 XT): - Prove step: 1.66s / 7.59s / 12.4s - G2 MSM: 460ms / 3.16s / 4.79s New files: - sp1-gpu/crates/sys/include/fields/bn254_fp2_t.cuh (sppark-style Fp2 for CUDA) - sp1-gpu/crates/sys/include/fields/bn254_fq2_t.cuh (portable Fq2 for HIP) - sp1-gpu/crates/sys/include/ec/bn254_g2.cuh (G2 affine/Jacobian/XYZZ types) - sp1-gpu/crates/sys/lib/msm/bn254_g2_msm_sppark.cu (CUDA G2 MSM via sppark) - sp1-gpu/crates/sys/lib/msm/bn254_g2_msm_hip.cu (custom HIP G2 MSM) - crates/recursion/gnark-ffi/examples/bench_groth16.rs (head-to-head benchmark) - sp1-gpu/crates/groth16/examples/test_g2_gpu.rs (G2 GPU correctness test) Also includes earlier uncommitted PLONK GPU prover work and design docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three independent improvements landed together: 1. Persistent G2 MSM context on HIP (`bn254_g2_msm_hip.cu` + `PersistentG2Msm` wrapper in `groth16/src/g2.rs`). The G2 SRS (≈2.3 GB for 15M bases) is uploaded once at `Groth16Prover::new` and reused across proofs, eliminating a ~550 ms per-prove point upload + hipMalloc overhead. Gated by runtime `SP1_GPU_BACKEND=hip` because on CUDA sppark's G2 context conflicts with its G1 `gpu_t` singleton. 2. Signed-digit G2 MSM decomposition on HIP (window_bits=13, signed). Halves the bucket count (8192 → 4097), halves the partial-sums buffer (192 MB vs 384 MB), and allows BUCKET_PAR=128 (instead of 256) with the same total parallelism. Note: hipCUB `SortPairs` `end_bit` must stay at WINDOW_BITS (13), not WINDOW_BITS-1 — signed digits have values [0, 2^(c-1)] *inclusive*, so the max 4096 still needs 13 bits to sort correctly. Correctness verified against arkworks CPU reference via `GROTH16_G2_VERIFY=1` env var. 3. Runtime-dispatched G2/G1 overlap. CUDA (sppark multi-stream) keeps using `std::thread::scope` to overlap G2 MSM with the remaining G1 MSMs — sppark's gpu_t singleton provides real concurrent streams so G2 hides fully behind Bs1/Krs/Krs2. HIP runs sequentially: every custom HIP MSM synchronously blocks on its own stream before returning, and concurrent host-thread kernel dispatch thrashes the single queue hard enough to balloon Krs MSM from ~0.9 s to 40+ s. Dispatch is by runtime `SP1_GPU_BACKEND` env var (no compile-time `hip_backend` cfg in this crate). 4. CUDA toolkit ABI fix (`sys/build.rs`). The `cudaDeviceProp` struct layout changed between CUDA 12 and 13. CMake cached `/usr/local/cuda-13.1/bin/nvcc` while Rust's link-search resolved `/usr/local/cuda` → `cuda-12.9`, linking `libcudart.so.12`. Runtime `cudaGetDeviceProperties` then wrote CUDA 12 layout but callers read CUDA 13 offsets → `multiProcessorCount=1` on RTX 4090 (should be 128) → `grid_size = sm_count/3 = 0` → sort kernel launched with `dim3(0, nwins)` → `cudaErrorInvalidConfiguration` at the first G1 MSM invoke. New `resolve_cuda_toolkit_path()` helper resolves the toolkit once (CUDA_PATH → CUDACXX → which nvcc → newest `/usr/local/cuda-*` → symlink fallback) and uses the result for both `CMAKE_CUDA_COMPILER` and the link-search path so compile and link always agree. Newest-first scan also sidesteps CUDA 12.9 ptxas rejecting the G2 `accumulate<xyzz_t<fp2_t>>` register count. Also: - `bn254_msm_sppark.cu` defines `TAKE_RESPONSIBILITY_FOR_ERROR_MESSAGE` so sppark's file:line error strings propagate back to Rust panics. - `test_g2_gpu` example updated for `g2_msm_gpu` now returning Option. Measured on Groth16 v6.0.0 PK (N=2^24, ~15M bases per MSM): - RTX 4090 (CUDA): 1.54–1.69 s prove (was 1.65 s baseline, now broken binary → fixed) - RX 7900 XTX (HIP): 6.91–6.97 s prove (was 7.59 s baseline, −8.6%) - RX 9070 XT (HIP): ~12.5 s prove (roughly flat vs 12.4 s) Raw proof bytes match gnark's CPU reference on both backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Five independent improvements from a 9-agent optimization review: 1. EC point arithmetic: fix XYZZ→Jacobian inversion bug (B1). Both `bn254_g1_xyzz_t::to_jacobian` and `bn254_g2_xyzz_t::to_jacobian` were computing Z via `ZZZ * ZZ.inv()` — one (G1) or two (G2) Fq/Fq2 inversions per call, each ~380 Fq muls via Fermat. The correct inversion-free formula is `jacobian_t{X*ZZ, Y*ZZZ, ZZ}` (2 Fq muls). Derivation: XYZZ affine is (X/ZZ, Y/ZZZ); with ZZ=Z², ZZZ=Z³ we have Z = ZZZ/ZZ and ZZZ² = ZZ³, so scaling (X',Y',Z') = (X*ZZ, Y*ZZZ, ZZ) gives X'/Z'² = X/ZZ ✓ and Y'/Z'³ = Y*ZZZ/ZZ³ = Y/ZZZ ✓. Measured: ~100 ms off G2 MSM on 7900 XTX. 2. Fr (`bn254_t`) branchless reductions. Ported Fq's fully-branchless conditional-subtract pattern to `bn254_t::operator+`, `-`, `*`, `sqr`, `dbl`, and `gte_p()`. Fr is on every NTT butterfly, so this removes warp divergence on ~N·lg_N reductions per prove. 3. NTT pipeline: remove per-call 512 MB `hipMalloc`/`hipFree` and unnecessary syncs. `compute_h_gpu` now allocates a single 512 MB `d_temp` buffer once, threads it through `batch_iNTT_bn254_with_temp` / `batch_coset_NTT_bn254_with_temp`, and frees it at the end. Added a matching `batch_coset_iNTT_bn254_with_temp` entrypoint on both HIP (real impl in `ntt_rdna3.cu`) and CUDA (sppark-backed thin wrapper). Also dropped the `hipStreamSynchronize` inside `run_ntt_four_step` and the `hipDeviceSynchronize` inside the `_with_temp` batch wrappers — the Rust caller already chains on the stream correctly. 4. Host-side pinning (T1a): register the three per-proof scalar Vecs (`wire_values_a/b`, `filtered_wire_values`) via `cuda_host_register` after they're built and before the MSMs consume them, then `cuda_host_unregister` once the last MSM completes. DMA-speed H2D on PCIe 4.0 (~25 GB/s pinned vs ~1.7 GB/s pageable). Plonk already uses the same pattern. 5. Kernel launch bounds (T1c): `bucket_accumulate_parallel_packed_kernel` changed from `__launch_bounds__(256, 4)` to `(256, 2)`. The (256,4) hint capped the compiler at ~48 VGPR/thread, forcing scratch spills for the ~80 VGPR XYZZ inner loop on both RDNA3 and RDNA4. 6. Rust workspace LTO: added `lto = "thin"`, `codegen-units = 16` to `[profile.release]`. Measured on the Groth16 v6.0.0 PK (N=2^24, ~15 M bases per MSM): - RTX 4090 (CUDA): 1.70–1.94 s prove (prior best 1.54–1.69 s) - RX 7900 XTX (HIP): 6.92–6.99 s prove (prior 6.91–6.97 s); G2 MSM dropped from ~2.58 s to ~2.48 s from B1 - RX 9070 XT (HIP): 12.8–13.0 s prove (prior 12.5 s) All 648-hex-char proofs match the gnark CPU reference. G2 MSM output cross-checked against arkworks via `GROTH16_G2_VERIFY=1`: MATCH on both iterations on both AMD cards. The dramatic gains expected from these items (pinning estimated 1 s, NTT temp estimated 300-600 ms) did not materialize, suggesting the MSM compute dominates far more than host transfer on HIP and that ROCm's caching allocator already makes repeated 512 MB `hipMalloc`s effectively free. The RDNA4 regression is still open — the `__launch_bounds__` fix is necessary but not sufficient; per-gfx1201 tuning (BUCKET_PAR, REDUCE_BLOCK_SIZE) is the next lever to pull. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tested BUCKET_PAR=64 on HIP (both G1 `bn254_msm.cuh` and G2 `bn254_g2_msm_hip.cu`), hoping to reduce merge-phase work and improve RDNA4 occupancy. Result was a severe regression: - 7900 XTX total prove: 6.9 s → 22.8 s (3.3× slower) - G2 MSM alone: 2.5 s → 17.4 s (7× slower) - G1 Ar/Bs1/Krs each ~40% slower Root cause: BUCKET_PAR governs how many parallel threads split a single bucket. With 4097 buckets and ~3600 points average per bucket, each thread does bucket_count / BUCKET_PAR serial `add_affine_unsafe` calls. The accumulate kernel is compute-bound on EC point arithmetic per thread, not on total oversubscription — halving BUCKET_PAR doubles the serial work per thread with no matching reduction in per-thread register/LDS usage, so latency hiding gets worse, not better. Comment updates document the finding so future sweeps don't repeat the experiment. No code change from previous state other than the comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously the G2 bucket accumulate kernel held per-thread partial sums in Jacobian coordinates (`bn254_g2_t`) and used Jacobian's `add_affine_unsafe` (7M + 3S in Fq2). The downstream merge kernel then converted Jacobian → XYZZ for the reduce phase. Switch to XYZZ throughout the accumulate + merge path: - New `bn254_g2_xyzz_t::add_affine_unsafe` (7M + 2S in Fq2), mirror of the G1 XYZZ `add_affine_unsafe`. Saves 1 Fq2 squaring (2 Fq muls) per inner-loop add vs the Jacobian form. - `g2_bucket_accumulate_parallel_kernel` now holds its accumulator in `bn254_g2_xyzz_t`, bootstraps from an affine point via `from_affine`, and writes XYZZ partials. - `g2_merge_partial_sums_kernel` reads XYZZ partials and sums via XYZZ `operator+=` (12M + 2S in Fq2 vs Jacobian's 12M + 4S) — eliminates the per-bucket Jac→XYZZ conversion. - `hip_g2_msm_context::d_partial_sums` retyped to `bn254_g2_xyzz_t*`. Partial sums buffer grows 33% (192B→256B per slot, ~100MB→134MB at BUCKET_PAR=128, num_buckets=4097). Still fits easily on both 24 GB and 16 GB AMD targets. Measured on Groth16 v6.0.0 PK (N≈15M bases), GROTH16_G2_VERIFY=1 MATCH on both iterations both cards: - RX 7900 XTX: 6.86s → 6.79s prove (−70 ms, −1%) - RX 9070 XT: 12.8s → 12.36s prove (−440 ms best iter, −3%) CUDA (sppark) untouched — this change is scoped to the HIP G2 MSM kernels; `bn254_g2.cuh` is HIP-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ge reads The G1 bucket accumulate kernel was writing partial sums in layout `[bucket_id][par_id]`, so merge-kernel reads — one thread per bucket, reading BUCKET_PAR=128 partials per bucket — accessed memory with stride = BUCKET_PAR × sizeof(xyzz) = 16 KB between threads in the same wave. Completely uncoalesced → one L2/DRAM transaction per thread per partial. Switch to transposed `[par_id][bucket_id]` layout: merge reads become coalesced (adjacent threads touch adjacent 128 B XYZZ points). Write side becomes stride-num_buckets in accumulate, but accumulate is bandwidth-bound on random SRS point loads, not on partial writes, so this is a net win. Mirrors the G2 MSM layout already in use since the signed-digit rework. Two kernels updated: - `bucket_accumulate_parallel_kernel` (legacy separate-signs path) - `bucket_accumulate_parallel_packed_kernel` (the currently-used one) - `bucket_merge_kernel` now reads the transposed layout - Both kernels' `bucket_id == 0` infinity-init write also transposed. Measured on Groth16 v6.0.0 PK (N≈15M bases), GROTH16_G2_VERIFY=1 MATCH on both iterations both cards: - RX 7900 XTX: 6.79 s → 6.66 s best prove (−130 ms) - RX 9070 XT: 12.36 s → 12.15 s best prove (−210 ms) (9070 XT gains are larger because its lower memory bandwidth was feeling the uncoalesced reads more acutely) Closes pending task succinctlabs#17 (Transpose MSM merge kernel layout for coalesced access). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tested double-buffered prefetch in G2 accumulate (mirroring G1's pattern at bn254_msm.cuh:448-481) to overlap the ~500-cycle DRAM latency of random G2 SRS point loads with XYZZ add_affine_unsafe. Result: ~150 ms regression on 7900 XTX (6.66 s → 6.81 s). Root cause: G2 affine points are 128 B (2 × Fq2) vs G1's 64 B. Holding `next_p` as a local variable adds 32 VGPRs to an already-tight accumulate kernel, pushing past the spill threshold. The compiler's native scheduling already does acceptable latency hiding for G2. Comment documents the finding so future work doesn't re-try. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

max-sanchez and others added 30 commits March 20, 2026 18:19

perf: increase leafHash occupancy hint to (256, 2)

4795dda

Allow 2 blocks per SM for leafHash kernel to improve latency hiding, especially on AMD RDNA3 where higher occupancy compensates for quarter-rate int32 multiply. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove backup file

676290b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

max-sanchez and others added 30 commits April 11, 2026 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add AMD GPU (HIP/ROCm) support and optimize GPU prover pipeline#1

feat: add AMD GPU (HIP/ROCm) support and optimize GPU prover pipeline#1
max-sanchez wants to merge 87 commits intomainfrom
max/optimizations_and_amd

max-sanchez commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

max-sanchez commented Mar 20, 2026

Motivation

Solution

PR Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant