Skip to content

[Feat] Add the support of Dual buckets - 1st phase#246

Merged
jiashuy merged 2 commits intoNVIDIA-Merlin:masterfrom
rhdong:hrong/dual-buckets
Apr 9, 2026
Merged

[Feat] Add the support of Dual buckets - 1st phase#246
jiashuy merged 2 commits intoNVIDIA-Merlin:masterfrom
rhdong:hrong/dual-buckets

Conversation

@rhdong
Copy link
Copy Markdown
Member

@rhdong rhdong commented Feb 20, 2026

Dual-Bucket Hashing for Memory-Optimized GPU Hash Table (TableMode::kMemory)

Algorithm Overview

This PR implements two-choice hashing (a d-left hashing variant) for TableMode::kMemory, enabling near-100% load factor on GPU hash tables without rehashing.

Bucket addressing. Each key is hashed with Murmur3-128 to produce a 128-bit digest. The two candidate buckets are derived as: b1 = hash[0:63] mod N, b2 = hash[64:127] mod N, where N is the number of buckets. An 8-bit tag (digest = hash[56:63]) is stored per slot for fast negative filtering.

Three-phase upsert pipeline:

  • Phase 0 (Duplicate scan): All 32 threads in a warp cooperatively scan both b1 and b2 for an existing copy of the key (digest match → full key compare). If found, the value and score are updated in-place.
  • Phase 1 (Two-choice insert): If the key is new, threads probe for an empty slot. The bucket with more free slots is preferred (power-of-two-choices), reducing worst-case clustering.
  • Phase 2 (Score-based eviction): When both buckets are full, a CAS-based eviction loop replaces the lowest-scoring entry across both buckets, followed by a re-verification pass to prevent duplicate insertion under concurrency.

Two-pass lookup: Pass 1 scans b1 with digest filtering and early-exits on match. Pass 2 scans b2 only on miss. A register-cached digest eliminates redundant Murmur3 recomputation between passes.

Kernel architecture: 128-thread blocks, 32 threads per key (one warp), 128-slot buckets, async global→shared memory pipeline (cp.async), and shared-memory-resident digest arrays for both bucket scans.

Benchmark Results

Platform: NVIDIA RTX A6000 (48 GB, Ampere sm_86), CUDA 12.9
Config: capacity = 1M, dim = 64, value_type = float, batch = 1M keys, EvictStrategy = kCustomized

Load Factor THROUGHPUT Insert THROUGHPUT Find MEMORY Insert MEMORY Find
0.25 32.2 95.3 85.8 102.1
0.50 89.9 98.3 90.8 103.7
0.75 88.4 98.3 88.9 104.1
0.90 80.6 98.9 90.7 104.4
0.95 60.2 93.1 86.5 104.5
1.00 50.3 95.0 82.6 104.8

(Throughput in Mops/s. Higher is better.)

THROUGHPUT_MODE MEMORY_MODE
Top-K score retention (fraction of ideal highest-scored keys present) 96.49% 99.73%
Insert at LF=1.0 50.3 Mops/s 82.6 Mops/s (1.6×)
Find at LF=1.0 95.0 Mops/s 104.8 Mops/s (1.1×)

Key observations:

  • MEMORY_MODE maintains stable insert throughput (82–91 Mops/s) across all load factors, while THROUGHPUT_MODE degrades to 50 Mops/s at LF=1.0
  • MEMORY_MODE achieves 99.73% top-K score retention (1,045,795 / 1,048,576 of the ideal highest-scored keys are present) vs 96.49% for single-bucket, meaning the eviction mechanism almost perfectly preserves the most valuable entries
  • Find throughput is consistently ~7% higher in MEMORY_MODE due to the two-pass early-exit optimization

Limitations (Phase I)

Init configuration constraints:

Parameter Constraint Reason
init_capacity Must equal max_capacity No auto-rehash in dual-bucket mode
max_hbm_for_vectors Must be 0 (pure HBM only) Hybrid HBM+HMEM not yet supported
dim * sizeof(V) ≤ 896 bytes (dim ≤ 224 for float) Fixed shared memory buffer size in lookup kernel
capacity / max_bucket_size ≥ 2 Two-choice addressing requires at least 2 buckets

Unsupported APIs (planned for Phase II):

API Status
insert_and_evict() Not supported — requires return-evicted-pair semantics
find_or_insert() Not supported — requires atomic find-then-insert
accum_or_assign() Not supported — requires read-modify-write
assign_scores() Not supported
assign_values() Not supported
contains() Not supported
erase() Not supported
reserve() Not supported — no dynamic capacity change

Supported APIs: insert_or_assign(), find(), clear(), size(), export_batch(), export_batch_if()

@github-actions
Copy link
Copy Markdown

@rhdong rhdong self-assigned this Feb 20, 2026
…tils

Document the digest mechanism for single-bucket and dual-bucket modes:
- Single-bucket digests use bits [32:39]; dual-bucket uses bits [56:63] to
  avoid collision with the b2 bucket address derived from the high 32 bits.
- Pipeline kernels in lookup.cuh and contains.cuh compute target digests
  inline (hashed_key >> 32) for performance, bypassing get_digest().
@jiashuy jiashuy merged commit 6883e43 into NVIDIA-Merlin:master Apr 9, 2026
1 check passed
rhdong added a commit to rhdong/HierarchicalKV that referenced this pull request Apr 18, 2026
…ments

Documents the exact methodology, commands, and expected results for the
single vs dual-bucket throughput comparison reported in the HierarchicalKV
SIGMOD paper (s5 Exp NVIDIA-Merlin#4 + L2-residency sensitivity footnote).

Covers 3 configs:
- cap=1Mi dim=64 (PR NVIDIA-Merlin#246 default, L2-resident, dual wins)
- cap=1Mi dim=32 (L2-resident, dual wins insert)
- cap=128Mi dim=32 (paper scale, DRAM-bound, single wins)

Measured on H100 NVL (2026-04-17). Enables reviewer reproduction of the
capacity-dependent crossover discussed in the paper.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants