Skip to content

zstd: Combine Raw+RLE block fields to reduce buffer and dispatch count#104

Open
Jonathan-Weinstein-AMD wants to merge 1 commit intomicrosoft:developmentfrom
Jonathan-Weinstein-AMD:zstd-combine-raw-rle
Open

zstd: Combine Raw+RLE block fields to reduce buffer and dispatch count#104
Jonathan-Weinstein-AMD wants to merge 1 commit intomicrosoft:developmentfrom
Jonathan-Weinstein-AMD:zstd-combine-raw-rle

Conversation

@Jonathan-Weinstein-AMD
Copy link
Copy Markdown

For experimentation, it would be helpful to reduce the number of buffers.

This PR reduces the number of buffers by 5 via combing Raw and RLE block buffers into "RR" buffers. The number of dispatches is also reduced by 5 (though two of those are zero-clear-memsets for lookback buffers, which in some contexts might not be needed anyway).

This mainly works via the added kzstdgpu_RLEBlock_OffsetFlag which the ZstdGpuMemsetMemcpy.hlsl checks for per-block. This flag is encoded in bit 31 (a la lookback buffer style) of the "offset" for RLE blocks, which should be fine since that is already not the actual offset and instead contains the filler byte. The branch becoming non-uniform doesn't really affect performance, and the the shader was slightly restructured so a wave always does one store. Some performance may also be gained by less PrefixSum dispatches, slightly leaner ParseFrames, and in some (maybe contrived) cases where Raw/RLE block destinations are consecutive and have a better memory access pattern.

Noteable changes:

  • kzstdgpu_RLEBlock_OffsetFlag.
  • zstdgpu_ReferenceStore_Validate_Blocks now validates RLE blocks (as part of "RR" validation); they weren't before (see removed //VALIDATE_BLOCKS(RLE); line). This is handled by zstdgpu_ReferenceStore_Report_Block constructing the "offset" the shader is expected to.
  • Avoid nullptr resource barriers with DecompressedSequence{LLen,Offs,MLen} since CMP blocks may not have sequences (such blocks can be useful for testing purposes).

Testing, used --chk-cpu and --chk-gpu:

  • Insects archive.
  • Other inputs, which have Huffman-compressed literals.
  • Some custom zsts made programmatically, for example:
// A "DegenerateCompressedBlock" is a Compressed_Block with no sequences and non-Huffman compressed literals.
// Useful when just want to test basic things about all blocks types.
ByteVector v;
FrameBuilder fb;
fb.BeginFrame(v);
for (int i = 0; i < 257; ++i)
{
    fb.EmitBlockRLE(v, 'A' + i % 26u, 5); // RHS = len
    fb.EmitBlockRaw(v, "{:}"); // len = 3
    fb.EmitDegenerateCompressedBlockRLE(v, 'a' + i % 26u, 2);
    fb.EmitDegenerateCompressedBlockRaw(v, "-"); // len = 1
}
fb.EndFrame(v);
WriteFile(R"(RR_test_N.txt.zst)", v);

…hes.

5 less buffers.

Refactor for the load

Less work in parse frames.

Notable:

kzstdgpu_RLEBlock_OffsetFlag

RLE block offsets where not validated before (see deleted `//VALIDATE_BLOCKS(RLE);` line. Validate them now and make it work in new combined RR fields by applying kzstdgpu_RLEBlock_OffsetFlag and byte read in zstdgpu_ReferenceStore_Report_Block.

---

5 less dispatches:
- 2 dispatches for lookback buffer memsets, though since the value is 0, this might not be worth much in some contexts (like if memory is already init to 0).
- 2 PrefixSum dispatches.
- Leaves just one dispatch of MemsetMemcpy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant