zstd: Combine Raw+RLE block fields to reduce buffer and dispatch count by Jonathan-Weinstein-AMD · Pull Request #104 · microsoft/DirectStorage

Jonathan-Weinstein-AMD · 2026-04-28T20:36:32Z

For experimentation, it would be helpful to reduce the number of buffers.

This PR reduces the number of buffers by 5 via combing Raw and RLE block buffers into "RR" buffers. The number of dispatches is also reduced by 5 (though two of those are zero-clear-memsets for lookback buffers, which in some contexts might not be needed anyway).

This mainly works via the added kzstdgpu_RLEBlock_OffsetFlag which the ZstdGpuMemsetMemcpy.hlsl checks for per-block. This flag is encoded in bit 31 (a la lookback buffer style) of the "offset" for RLE blocks, which should be fine since that is already not the actual offset and instead contains the filler byte. The branch becoming non-uniform doesn't really affect performance, and the the shader was slightly restructured so a wave always does one store. Some performance may also be gained by less PrefixSum dispatches, slightly leaner ParseFrames, and in some (maybe contrived) cases where Raw/RLE block destinations are consecutive and have a better memory access pattern.

Noteable changes:

kzstdgpu_RLEBlock_OffsetFlag.
zstdgpu_ReferenceStore_Validate_Blocks now validates RLE blocks (as part of "RR" validation); they weren't before (see removed //VALIDATE_BLOCKS(RLE); line). This is handled by zstdgpu_ReferenceStore_Report_Block constructing the "offset" the shader is expected to.
Avoid nullptr resource barriers with DecompressedSequence{LLen,Offs,MLen} since CMP blocks may not have sequences (such blocks can be useful for testing purposes).

Testing, used --chk-cpu and --chk-gpu:

Insects archive.
Other inputs, which have Huffman-compressed literals.
Some custom zsts made programmatically, for example:

// A "DegenerateCompressedBlock" is a Compressed_Block with no sequences and non-Huffman compressed literals.
// Useful when just want to test basic things about all blocks types.
ByteVector v;
FrameBuilder fb;
fb.BeginFrame(v);
for (int i = 0; i < 257; ++i)
{
    fb.EmitBlockRLE(v, 'A' + i % 26u, 5); // RHS = len
    fb.EmitBlockRaw(v, "{:}"); // len = 3
    fb.EmitDegenerateCompressedBlockRLE(v, 'a' + i % 26u, 2);
    fb.EmitDegenerateCompressedBlockRaw(v, "-"); // len = 1
}
fb.EndFrame(v);
WriteFile(R"(RR_test_N.txt.zst)", v);

…hes. 5 less buffers. Refactor for the load Less work in parse frames. Notable: kzstdgpu_RLEBlock_OffsetFlag RLE block offsets where not validated before (see deleted `//VALIDATE_BLOCKS(RLE);` line. Validate them now and make it work in new combined RR fields by applying kzstdgpu_RLEBlock_OffsetFlag and byte read in zstdgpu_ReferenceStore_Report_Block. --- 5 less dispatches: - 2 dispatches for lookback buffer memsets, though since the value is 0, this might not be worth much in some contexts (like if memory is already init to 0). - 2 PrefixSum dispatches. - Leaves just one dispatch of MemsetMemcpy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstd: Combine Raw+RLE block fields to reduce buffer and dispatch count#104

zstd: Combine Raw+RLE block fields to reduce buffer and dispatch count#104
Jonathan-Weinstein-AMD wants to merge 1 commit intomicrosoft:developmentfrom
Jonathan-Weinstein-AMD:zstd-combine-raw-rle

Jonathan-Weinstein-AMD commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jonathan-Weinstein-AMD commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant