CUDA variable rate decompression hybrid index warp-level prefix sum by jwake · Pull Request #268 · llnl/zfp

jwake · 2025-05-25T23:47:29Z

Hi!

I was running into a bunch of invalid memory accesses trying to test out the CUDA variable rate decompression support in staging, and I believe I've tracked it down to the index block offset calculation using a warp-sized but thread-block-shared-memory offset array, causing the remaining warps in the thread block to clobber the offset calculation.

I've replaced it with a basic warp-level prefix sum and an assertion to ensure that the partition size is equal to the warp size; it'd probably be better to re-do this with something from the cooperative_groups namespace so it can support partition sizes of any power of 2 up to the thread block size, but this should be sufficient to start getting useful results from decompression.

…p-level prefix sum

lindstro · 2025-05-27T15:15:47Z

@jwake Thanks for your contribution. The parallel decoding implementation is currently wholly broken and will be rewritten over the coming months. We're currently re-engineering the block index representation and API, after which we will tackle decompression. During this time, do not expect the staging branch to work correctly until it has been finished and merged into develop.

jwake · 2025-05-27T23:30:35Z

For what it's worth, I managed to get the parallel decoder working fairly well on CUDA (~70GB/sec on an A100 with buffers in device memory, though I'll note I'm only testing 3D at the moment) between this change and manually bodging the stream_rseek call at the end of zfp_internal_cuda_decompress to reset the pointer offset without trying to read from the stream (in my case, the stream was in device memory, so rseek internally trying to read the buffer would fail; short of a separate device-aware bitstream implementation I'm not sure what the best approach here would be from an API standpoint), and then manually saving/restoring the appropriate index data externally after compression / before decompression. I did note that further compressing the index data with eg. LZ4 or Deflate netted a further ~15% size improvement on the index data but with appropriate granularity the index was already pretty small compared to the bulk of the compressed data.

As an aside, adding a zfp_mode_expert clause to the mode check in zfp_internal_cuda_[de]compress to set up the index as appropriate also allowed combining maxbits with maxprec / minexp to work as expected in parallel on CUDA - my use of the API is operating in a strictly bounded-memory environment where we've previously been using fixed-rate due to using CUDA, so being able to move to a variable-rate-with-bounded-size mode is pretty exciting.

Completely understand re: expected brokenness on the branch - I just wanted to see if I could get it to work and it's honestly looking pretty good all round. Eager to see the next major release, and thanks for all the hard work!

lindstro · 2025-05-28T15:37:58Z

The purpose of the stream_rseek call is to update the bitstream state in host memory so that when you return from zfp_decompress, the stream is positioned correctly for the next zfp_decompress call. No need to update any device data structures here.

Past experiments suggest that a granularity larger than 1, while helpful in reducing index size, is likely to degrade performance quite a bit. Also, it's incompatible with the index data structures needed for zfp's const_array classes, so it may not be available in the future. It definitely would be incompatible with some index representations we're considering.

There's no good reason I can think of why we would not support expert mode. Really the distinction should be between fixed and variable rate (and reversible mode, which uses a different algorithm), which is how we're distinguishing modes during compression. So I expect expert mode will be supported.

We'll hopefully get this cleaned up and working soon. While we certainly appreciate external contributions, I'm not sure it makes sense to merge this PR as we're already planning to rewrite major parts of this code.

Calculate per-thread block offset in hybrid indexing mode using a war…

d86e4da

…p-level prefix sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA variable rate decompression hybrid index warp-level prefix sum#268

CUDA variable rate decompression hybrid index warp-level prefix sum#268
jwake wants to merge 1 commit intollnl:stagingfrom
jwake:cuda_hybrid_index_warp_prefix_sum

jwake commented May 25, 2025

Uh oh!

lindstro commented May 27, 2025

Uh oh!

jwake commented May 27, 2025

Uh oh!

lindstro commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jwake commented May 25, 2025

Uh oh!

lindstro commented May 27, 2025

Uh oh!

jwake commented May 27, 2025

Uh oh!

lindstro commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants