zstd: Combine Raw+RLE block fields to reduce buffer and dispatch count#104
Open
Jonathan-Weinstein-AMD wants to merge 1 commit intomicrosoft:developmentfrom
Open
Conversation
…hes. 5 less buffers. Refactor for the load Less work in parse frames. Notable: kzstdgpu_RLEBlock_OffsetFlag RLE block offsets where not validated before (see deleted `//VALIDATE_BLOCKS(RLE);` line. Validate them now and make it work in new combined RR fields by applying kzstdgpu_RLEBlock_OffsetFlag and byte read in zstdgpu_ReferenceStore_Report_Block. --- 5 less dispatches: - 2 dispatches for lookback buffer memsets, though since the value is 0, this might not be worth much in some contexts (like if memory is already init to 0). - 2 PrefixSum dispatches. - Leaves just one dispatch of MemsetMemcpy.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For experimentation, it would be helpful to reduce the number of buffers.
This PR reduces the number of buffers by 5 via combing Raw and RLE block buffers into "RR" buffers. The number of dispatches is also reduced by 5 (though two of those are zero-clear-memsets for lookback buffers, which in some contexts might not be needed anyway).
This mainly works via the added
kzstdgpu_RLEBlock_OffsetFlagwhich theZstdGpuMemsetMemcpy.hlslchecks for per-block. This flag is encoded in bit 31 (a la lookback buffer style) of the "offset" for RLE blocks, which should be fine since that is already not the actual offset and instead contains the filler byte. The branch becoming non-uniform doesn't really affect performance, and the the shader was slightly restructured so a wave always does one store. Some performance may also be gained by less PrefixSum dispatches, slightly leaner ParseFrames, and in some (maybe contrived) cases where Raw/RLE block destinations are consecutive and have a better memory access pattern.Noteable changes:
kzstdgpu_RLEBlock_OffsetFlag.zstdgpu_ReferenceStore_Validate_Blocksnow validates RLE blocks (as part of "RR" validation); they weren't before (see removed//VALIDATE_BLOCKS(RLE);line). This is handled byzstdgpu_ReferenceStore_Report_Blockconstructing the "offset" the shader is expected to.DecompressedSequence{LLen,Offs,MLen}since CMP blocks may not have sequences (such blocks can be useful for testing purposes).Testing, used
--chk-cpuand--chk-gpu: