[Feat] Add the support of Dual buckets - 1st phase#246
Merged
jiashuy merged 2 commits intoNVIDIA-Merlin:masterfrom Apr 9, 2026
Merged
[Feat] Add the support of Dual buckets - 1st phase#246jiashuy merged 2 commits intoNVIDIA-Merlin:masterfrom
jiashuy merged 2 commits intoNVIDIA-Merlin:masterfrom
Conversation
Documentation previewhttps://nvidia-merlin.github.io/HierarchicalKV/review/pr-246 |
…tils Document the digest mechanism for single-bucket and dual-bucket modes: - Single-bucket digests use bits [32:39]; dual-bucket uses bits [56:63] to avoid collision with the b2 bucket address derived from the high 32 bits. - Pipeline kernels in lookup.cuh and contains.cuh compute target digests inline (hashed_key >> 32) for performance, bypassing get_digest().
jiashuy
approved these changes
Apr 9, 2026
rhdong
added a commit
to rhdong/HierarchicalKV
that referenced
this pull request
Apr 18, 2026
…ments Documents the exact methodology, commands, and expected results for the single vs dual-bucket throughput comparison reported in the HierarchicalKV SIGMOD paper (s5 Exp NVIDIA-Merlin#4 + L2-residency sensitivity footnote). Covers 3 configs: - cap=1Mi dim=64 (PR NVIDIA-Merlin#246 default, L2-resident, dual wins) - cap=1Mi dim=32 (L2-resident, dual wins insert) - cap=128Mi dim=32 (paper scale, DRAM-bound, single wins) Measured on H100 NVL (2026-04-17). Enables reviewer reproduction of the capacity-dependent crossover discussed in the paper.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dual-Bucket Hashing for Memory-Optimized GPU Hash Table (
TableMode::kMemory)Algorithm Overview
This PR implements two-choice hashing (a d-left hashing variant) for
TableMode::kMemory, enabling near-100% load factor on GPU hash tables without rehashing.Bucket addressing. Each key is hashed with Murmur3-128 to produce a 128-bit digest. The two candidate buckets are derived as:
b1 = hash[0:63] mod N,b2 = hash[64:127] mod N, where N is the number of buckets. An 8-bit tag (digest = hash[56:63]) is stored per slot for fast negative filtering.Three-phase upsert pipeline:
Two-pass lookup: Pass 1 scans b1 with digest filtering and early-exits on match. Pass 2 scans b2 only on miss. A register-cached digest eliminates redundant Murmur3 recomputation between passes.
Kernel architecture: 128-thread blocks, 32 threads per key (one warp), 128-slot buckets, async global→shared memory pipeline (
cp.async), and shared-memory-resident digest arrays for both bucket scans.Benchmark Results
Platform: NVIDIA RTX A6000 (48 GB, Ampere sm_86), CUDA 12.9
Config: capacity = 1M, dim = 64, value_type = float, batch = 1M keys, EvictStrategy = kCustomized
(Throughput in Mops/s. Higher is better.)
Key observations:
Limitations (Phase I)
Init configuration constraints:
init_capacitymax_capacitymax_hbm_for_vectorsdim * sizeof(V)capacity / max_bucket_sizeUnsupported APIs (planned for Phase II):
insert_and_evict()find_or_insert()accum_or_assign()assign_scores()assign_values()contains()erase()reserve()Supported APIs:
insert_or_assign(),find(),clear(),size(),export_batch(),export_batch_if()