perf(encoding): Use pre-compiled and cache-aligned lookup table for faster decode (varint encoding) (#589) by srsuryadev · Pull Request #589 · facebookincubator/nimble

srsuryadev · 2026-03-19T17:23:12Z

Summary:

Use pre-compiled and cache-aligned lookup table for varint decode to eliminate switch case

Note: there is a regression for 2-byte case - planning to fix here D97189009

Reviewed By: Yuhta

Differential Revision: D96756546

Summary: Add `decodeSingleByteRun` fast path to `bulkVarintDecode32` and `bulkVarintDecode64` that processes leading runs of single-byte varints (values 0-127) using 8-byte word reads before falling through to the BMI2 switch-based decoder. For each 8-byte word where no continuation bits are set (`word & 0x8080808080808080 == 0`), all 8 varints are decoded with simple shifts, avoiding the `_pext_u64` and 64-case switch overhead. This is placed in the caller functions rather than inside `bulkVarintDecodeBmi2` to preserve the BMI2 function's code layout and icache behavior for mixed-width data. Benchmark results (1M elements, mode/opt): | Scenario | Before | After | Speedup | |-----------------------|-----------|-----------|-----------| | 1-byte (32-bit) | 465us | 260us | 1.79x | | 5-byte (32-bit) | slower | 1.22ms | fixed | | 3-byte (32-bit) | 1.04ms | 864us | 1.20x | | 4-byte (32-bit) | 1.50ms | 1.04ms | 1.44x | | 64-bit 1-byte | 294us | 232us | 1.27x | | batch1024 | 1.96us | 1.20us | 1.63x | | Uniform/2-byte/8-byte | unchanged | unchanged | no regress| Also enhances the varint benchmark with fixed byte-width benchmarks (1-5 byte for 32-bit, 1/4/8 byte for 64-bit), skip benchmarks, and batch size benchmarks. Differential Revision: D96617939

… single-byte varints Summary: Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach: 1. 32-element (4-word) unrolled loop with combined high-bit check `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead 2. 8-element (1-word) loop for smaller runs 3. Single-element trailing loop to pick up individual single-byte varints before multi-byte values Also extracts the byte-expansion logic into a reusable `expandWord()` helper for clarity. Differential Revision: D96619597

…gleByteRun Summary: Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in decodeSingleByteRun with xsimd-based SIMD operations: - Use xsimd::batch<uint8_t>::load_unaligned for a single wide load (32 bytes on AVX2) + vptest to check all high bits at once, replacing 4 separate uint64_t loads + OR chain. - Use xsimd::batch<T> construction and store_unaligned for byte-to-element widening (compiles to vpmovzxbd on AVX2, vmovl on NEON). - Replace reinterpret_cast<const uint64_t*> with std::memcpy in the 8-byte loop to avoid strict-aliasing/alignment issues. Differential Revision: D96628007 Reviewed By: xiaoxmeng

…ncoding to make it robust Summary: Add further tests to the varint encoding to make it robust Differential Revision: D96665765

meta-codesync · 2026-03-19T17:23:33Z

@srsuryadev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96756546.

…aster decode (varint encoding) (#589) Summary: Pull Request resolved: #589 Use pre-compiled and cache-aligned lookup table for varint decode to eliminate switch case Note: there is a regression for 2-byte case - planning to fix here D97189009 Reviewed By: Yuhta Differential Revision: D96756546

srsuryadev added 4 commits March 19, 2026 06:44

test(encoding): add further tests for varint encoding to the varint e…

fb93c36

…ncoding to make it robust Summary: Add further tests to the varint encoding to make it robust Differential Revision: D96665765

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 19, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 19, 2026

srsuryadev force-pushed the export-D96756546 branch from 57ff524 to f114948 Compare March 19, 2026 22:32

meta-codesync bot changed the title ~~perf(encoding): Use pre-compiled and cache-aligned lookup table for faster decode (varint encoding)~~ perf(encoding): Use pre-compiled and cache-aligned lookup table for faster decode (varint encoding) (#589) Mar 19, 2026

srsuryadev force-pushed the export-D96756546 branch from f114948 to 5a623ab Compare March 19, 2026 22:37

srsuryadev force-pushed the export-D96756546 branch from 5a623ab to ed4a646 Compare March 20, 2026 03:39

srsuryadev force-pushed the export-D96756546 branch from ed4a646 to ecc64cf Compare March 20, 2026 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(encoding): Use pre-compiled and cache-aligned lookup table for faster decode (varint encoding) (#589)#589

perf(encoding): Use pre-compiled and cache-aligned lookup table for faster decode (varint encoding) (#589)#589
srsuryadev wants to merge 5 commits intomainfrom
export-D96756546

srsuryadev commented Mar 19, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

srsuryadev commented Mar 19, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

srsuryadev commented Mar 19, 2026 •

edited by meta-codesync bot

Loading