Skip to content

perf(encoding): Use pre-compiled and cache-aligned lookup table for faster decode (varint encoding) (#589)#589

Open
srsuryadev wants to merge 5 commits intomainfrom
export-D96756546
Open

perf(encoding): Use pre-compiled and cache-aligned lookup table for faster decode (varint encoding) (#589)#589
srsuryadev wants to merge 5 commits intomainfrom
export-D96756546

Conversation

@srsuryadev
Copy link
Copy Markdown
Contributor

@srsuryadev srsuryadev commented Mar 19, 2026

Summary:

Use pre-compiled and cache-aligned lookup table for varint decode to eliminate switch case

Note: there is a regression for 2-byte case - planning to fix here D97189009

Reviewed By: Yuhta

Differential Revision: D96756546

Summary:
Add `decodeSingleByteRun` fast path to `bulkVarintDecode32` and
`bulkVarintDecode64` that processes leading runs of single-byte varints
(values 0-127) using 8-byte word reads before falling through to the
BMI2 switch-based decoder. For each 8-byte word where no continuation
bits are set (`word & 0x8080808080808080 == 0`), all 8 varints are
decoded with simple shifts, avoiding the `_pext_u64` and 64-case switch
overhead.

This is placed in the caller functions rather than inside
`bulkVarintDecodeBmi2` to preserve the BMI2 function's code layout and
icache behavior for mixed-width data.

Benchmark results (1M elements, mode/opt):
| Scenario              | Before    | After     | Speedup   |
|-----------------------|-----------|-----------|-----------|
| 1-byte (32-bit)       | 465us     | 260us     | 1.79x     |
| 5-byte (32-bit)       | slower    | 1.22ms    | fixed     |
| 3-byte (32-bit)       | 1.04ms    | 864us     | 1.20x     |
| 4-byte (32-bit)       | 1.50ms    | 1.04ms    | 1.44x     |
| 64-bit 1-byte         | 294us     | 232us     | 1.27x     |
| batch1024             | 1.96us    | 1.20us    | 1.63x     |
| Uniform/2-byte/8-byte | unchanged | unchanged | no regress|

Also enhances the varint benchmark with fixed byte-width benchmarks
(1-5 byte for 32-bit, 1/4/8 byte for 64-bit), skip benchmarks, and
batch size benchmarks.

Differential Revision: D96617939
… single-byte varints

Summary:
Manually loop-unroll `decodeSingleByteRun` with a 3-tier approach:
1. 32-element (4-word) unrolled loop with combined high-bit check
   `(w0 | w1 | w2 | w3) & kHighBits` to minimize branch overhead
2. 8-element (1-word) loop for smaller runs
3. Single-element trailing loop to pick up individual single-byte
   varints before multi-byte values

Also extracts the byte-expansion logic into a reusable `expandWord()`
helper for clarity.

Differential Revision: D96619597
…gleByteRun

Summary:
Replace scalar byte expansion and reinterpret_cast-based uint64_t loads in
decodeSingleByteRun with xsimd-based SIMD operations:

- Use xsimd::batch<uint8_t>::load_unaligned for a single wide load (32 bytes
  on AVX2) + vptest to check all high bits at once, replacing 4 separate
  uint64_t loads + OR chain.
- Use xsimd::batch<T> construction and store_unaligned for byte-to-element
  widening (compiles to vpmovzxbd on AVX2, vmovl on NEON).
- Replace reinterpret_cast<const uint64_t*> with std::memcpy in the 8-byte
  loop to avoid strict-aliasing/alignment issues.

Differential Revision: D96628007

Reviewed By: xiaoxmeng
…ncoding to make it robust

Summary: Add further tests to the varint encoding to make it robust

Differential Revision: D96665765
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 19, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 19, 2026

@srsuryadev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96756546.

srsuryadev added a commit that referenced this pull request Mar 19, 2026
…aster decode (varint encoding) (#589)

Summary:
Pull Request resolved: #589

Use pre-compiled and cache-aligned lookup table for varint decode to eliminate switch case

Note: there is a regression for 2-byte case  - planning to fix here D97189009

Reviewed By: Yuhta

Differential Revision: D96756546
@meta-codesync meta-codesync bot changed the title perf(encoding): Use pre-compiled and cache-aligned lookup table for faster decode (varint encoding) perf(encoding): Use pre-compiled and cache-aligned lookup table for faster decode (varint encoding) (#589) Mar 19, 2026
srsuryadev added a commit that referenced this pull request Mar 20, 2026
…aster decode (varint encoding) (#589)

Summary:
Pull Request resolved: #589

Use pre-compiled and cache-aligned lookup table for varint decode to eliminate switch case

Note: there is a regression for 2-byte case  - planning to fix here D97189009

Reviewed By: Yuhta

Differential Revision: D96756546
…aster decode (varint encoding) (#589)

Summary:
Pull Request resolved: #589

Use pre-compiled and cache-aligned lookup table for varint decode to eliminate switch case

Note: there is a regression for 2-byte case  - planning to fix here D97189009

Reviewed By: Yuhta

Differential Revision: D96756546
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant