CUDA variable rate decompression hybrid index warp-level prefix sum#268
CUDA variable rate decompression hybrid index warp-level prefix sum#268jwake wants to merge 1 commit intollnl:stagingfrom
Conversation
…p-level prefix sum
|
@jwake Thanks for your contribution. The parallel decoding implementation is currently wholly broken and will be rewritten over the coming months. We're currently re-engineering the block index representation and API, after which we will tackle decompression. During this time, do not expect the |
|
For what it's worth, I managed to get the parallel decoder working fairly well on CUDA (~70GB/sec on an A100 with buffers in device memory, though I'll note I'm only testing 3D at the moment) between this change and manually bodging the As an aside, adding a Completely understand re: expected brokenness on the branch - I just wanted to see if I could get it to work and it's honestly looking pretty good all round. Eager to see the next major release, and thanks for all the hard work! |
|
The purpose of the Past experiments suggest that a granularity larger than 1, while helpful in reducing index size, is likely to degrade performance quite a bit. Also, it's incompatible with the index data structures needed for zfp's There's no good reason I can think of why we would not support expert mode. Really the distinction should be between fixed and variable rate (and reversible mode, which uses a different algorithm), which is how we're distinguishing modes during compression. So I expect expert mode will be supported. We'll hopefully get this cleaned up and working soon. While we certainly appreciate external contributions, I'm not sure it makes sense to merge this PR as we're already planning to rewrite major parts of this code. |
Hi!
I was running into a bunch of invalid memory accesses trying to test out the CUDA variable rate decompression support in staging, and I believe I've tracked it down to the index block offset calculation using a warp-sized but thread-block-shared-memory offset array, causing the remaining warps in the thread block to clobber the offset calculation.
I've replaced it with a basic warp-level prefix sum and an assertion to ensure that the partition size is equal to the warp size; it'd probably be better to re-do this with something from the cooperative_groups namespace so it can support partition sizes of any power of 2 up to the thread block size, but this should be sufficient to start getting useful results from decompression.