[Opt] cache thread-local CUDA stream for lock acquisition kernels by rhdong · Pull Request #248 · NVIDIA-Merlin/HierarchicalKV

rhdong · 2026-02-24T21:45:17Z

Replace per-call cudaStreamCreate/Destroy in lock_read(), lock_update(), and lock_update_read() with a thread_local cached stream. Eliminates CUDA driver contention when multiple threads acquire locks concurrently, which caused triple-group mode to underperform R/W lock in mixed workloads (0.93x) despite allowing concurrent updaters.
CI tests passed on the local machine.

Triple-Group vs R/W Lock Concurrency Benchmark

Config: dim=16, capacity=128M, HBM=16GB, λ=0.75, batch=64K, 200 batches/thread, H100 NVL

With Stream Cache Optimization

Workload	Threads	TG (B-KV/s)	RW (B-KV/s)	Speedup
Read-heavy	8F/1U/1I	1.451	1.408	1.03×
Update-heavy	4F/5U/1I	1.675	0.523	3.21×
Insert-heavy	4F/2U/4I	0.551	0.459	1.20×
Update-only	1U	1.054	1.046	1.01×
Update-only	2U	1.613	1.125	1.43×
Update-only	5U	2.279	0.591	3.86×
Update-only	10U	2.569	0.535	4.80×

Before vs After Stream Cache

Workload	Before	After	Change
Read-heavy	1.02×	1.03×	—
Update-heavy	0.93×	3.21×	fixed
Insert-heavy	0.98×	1.20×	+22%
Update-only 1U	1.01×	1.01×	—
Update-only 2U	1.33×	1.43×	+8%
Update-only 5U	2.36×	3.86×	+64%
Update-only 10U	2.14×	4.80×	+124%

Replace per-call cudaStreamCreate/Destroy in lock_read(), lock_update(), and lock_update_read() with a thread_local cached stream. Eliminates CUDA driver contention when multiple threads acquire locks concurrently, which caused triple-group mode to underperform R/W lock in mixed workloads (0.93x) despite allowing concurrent updaters.

github-actions · 2026-02-24T21:46:50Z

Documentation preview

https://nvidia-merlin.github.io/HierarchicalKV/review/pr-248

rhdong requested a review from jiashuy February 24, 2026 21:45

rhdong self-assigned this Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Opt] cache thread-local CUDA stream for lock acquisition kernels#248

[Opt] cache thread-local CUDA stream for lock acquisition kernels#248
rhdong wants to merge 1 commit intoNVIDIA-Merlin:masterfrom
rhdong:hrong/cache-stream-triple-lock

rhdong commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhdong commented Feb 24, 2026

With Stream Cache Optimization

Before vs After Stream Cache

Uh oh!

github-actions bot commented Feb 24, 2026

Documentation preview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant