[Feat] Add the support of Dual buckets - 1st phase by rhdong · Pull Request #246 · NVIDIA-Merlin/HierarchicalKV

rhdong · 2026-02-20T01:56:06Z

Dual-Bucket Hashing for Memory-Optimized GPU Hash Table (`TableMode::kMemory`)

Algorithm Overview

This PR implements two-choice hashing (a d-left hashing variant) for TableMode::kMemory, enabling near-100% load factor on GPU hash tables without rehashing.

Bucket addressing. Each key is hashed with Murmur3-128 to produce a 128-bit digest. The two candidate buckets are derived as: b1 = hash[0:63] mod N, b2 = hash[64:127] mod N, where N is the number of buckets. An 8-bit tag (digest = hash[56:63]) is stored per slot for fast negative filtering.

Three-phase upsert pipeline:

Phase 0 (Duplicate scan): All 32 threads in a warp cooperatively scan both b1 and b2 for an existing copy of the key (digest match → full key compare). If found, the value and score are updated in-place.
Phase 1 (Two-choice insert): If the key is new, threads probe for an empty slot. The bucket with more free slots is preferred (power-of-two-choices), reducing worst-case clustering.
Phase 2 (Score-based eviction): When both buckets are full, a CAS-based eviction loop replaces the lowest-scoring entry across both buckets, followed by a re-verification pass to prevent duplicate insertion under concurrency.

Two-pass lookup: Pass 1 scans b1 with digest filtering and early-exits on match. Pass 2 scans b2 only on miss. A register-cached digest eliminates redundant Murmur3 recomputation between passes.

Kernel architecture: 128-thread blocks, 32 threads per key (one warp), 128-slot buckets, async global→shared memory pipeline (cp.async), and shared-memory-resident digest arrays for both bucket scans.

Benchmark Results

Platform: NVIDIA RTX A6000 (48 GB, Ampere sm_86), CUDA 12.9
Config: capacity = 1M, dim = 64, value_type = float, batch = 1M keys, EvictStrategy = kCustomized

Load Factor	THROUGHPUT Insert	THROUGHPUT Find	MEMORY Insert	MEMORY Find
0.25	32.2	95.3	85.8	102.1
0.50	89.9	98.3	90.8	103.7
0.75	88.4	98.3	88.9	104.1
0.90	80.6	98.9	90.7	104.4
0.95	60.2	93.1	86.5	104.5
1.00	50.3	95.0	82.6	104.8

(Throughput in Mops/s. Higher is better.)

	THROUGHPUT_MODE	MEMORY_MODE
Top-K score retention (fraction of ideal highest-scored keys present)	96.49%	99.73%
Insert at LF=1.0	50.3 Mops/s	82.6 Mops/s (1.6×)
Find at LF=1.0	95.0 Mops/s	104.8 Mops/s (1.1×)

Key observations:

MEMORY_MODE maintains stable insert throughput (82–91 Mops/s) across all load factors, while THROUGHPUT_MODE degrades to 50 Mops/s at LF=1.0
MEMORY_MODE achieves 99.73% top-K score retention (1,045,795 / 1,048,576 of the ideal highest-scored keys are present) vs 96.49% for single-bucket, meaning the eviction mechanism almost perfectly preserves the most valuable entries
Find throughput is consistently ~7% higher in MEMORY_MODE due to the two-pass early-exit optimization

Limitations (Phase I)

Init configuration constraints:

Parameter	Constraint	Reason
`init_capacity`	Must equal `max_capacity`	No auto-rehash in dual-bucket mode
`max_hbm_for_vectors`	Must be 0 (pure HBM only)	Hybrid HBM+HMEM not yet supported
`dim * sizeof(V)`	≤ 896 bytes (dim ≤ 224 for float)	Fixed shared memory buffer size in lookup kernel
`capacity / max_bucket_size`	≥ 2	Two-choice addressing requires at least 2 buckets

Unsupported APIs (planned for Phase II):

API	Status
`insert_and_evict()`	Not supported — requires return-evicted-pair semantics
`find_or_insert()`	Not supported — requires atomic find-then-insert
`accum_or_assign()`	Not supported — requires read-modify-write
`assign_scores()`	Not supported
`assign_values()`	Not supported
`contains()`	Not supported
`erase()`	Not supported
`reserve()`	Not supported — no dynamic capacity change

Supported APIs: insert_or_assign(), find(), clear(), size(), export_batch(), export_batch_if()

github-actions · 2026-02-20T01:57:35Z

Documentation preview

https://nvidia-merlin.github.io/HierarchicalKV/review/pr-246

…tils Document the digest mechanism for single-bucket and dual-bucket modes: - Single-bucket digests use bits [32:39]; dual-bucket uses bits [56:63] to avoid collision with the b2 bucket address derived from the high 32 bits. - Pipeline kernels in lookup.cuh and contains.cuh compute target digests inline (hashed_key >> 32) for performance, bypassing get_digest().

…ments Documents the exact methodology, commands, and expected results for the single vs dual-bucket throughput comparison reported in the HierarchicalKV SIGMOD paper (s5 Exp NVIDIA-Merlin#4 + L2-residency sensitivity footnote). Covers 3 configs: - cap=1Mi dim=64 (PR NVIDIA-Merlin#246 default, L2-resident, dual wins) - cap=1Mi dim=32 (L2-resident, dual wins insert) - cap=128Mi dim=32 (paper scale, DRAM-bound, single wins) Measured on H100 NVL (2026-04-17). Enables reviewer reproduction of the capacity-dependent crossover discussed in the paper.

[Feat] Add the support of Dual buckets - 1st phase

d53075b

rhdong self-assigned this Feb 20, 2026

jiashuy approved these changes Apr 9, 2026

View reviewed changes

jiashuy merged commit 6883e43 into NVIDIA-Merlin:master Apr 9, 2026
1 check passed

jiashuy mentioned this pull request Apr 10, 2026

[FEA] Dynamic bucket routing strategy to avoid uneven load between buckets NVIDIA/recsys-examples#351

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add the support of Dual buckets - 1st phase#246

[Feat] Add the support of Dual buckets - 1st phase#246
jiashuy merged 2 commits intoNVIDIA-Merlin:masterfrom
rhdong:hrong/dual-buckets

rhdong commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rhdong commented Feb 20, 2026

Dual-Bucket Hashing for Memory-Optimized GPU Hash Table (TableMode::kMemory)

Algorithm Overview

Benchmark Results

Limitations (Phase I)

Uh oh!

github-actions bot commented Feb 20, 2026

Documentation preview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dual-Bucket Hashing for Memory-Optimized GPU Hash Table (`TableMode::kMemory`)