[None][fix] Combined disagg fixes: gen transfer timeout + perf metrics crash + KV cache block reuse hang by yifjiang · Pull Request #12889 · NVIDIA/TensorRT-LLM

yifjiang · 2026-04-09T09:15:16Z

Summary

This is a combined image-build branch, not intended for merge. It aggregates three independent fixes for disaggregated serving into a single branch for building test images. Each fix has (or will have) its own PR for review:

PR #12476 — Bounded timeout for gen-side KV cache transfer (cacheTransceiver.cpp)
PR #12868 — Guard CUDA event elapsed_time in perf_metrics_manager.py to prevent executor crash
Cherry-pick of Tabrizian@b298334 — Fix disagg serving hang on block reuse after eviction (kvCacheManager.cpp)

Changed files (3)

File	Fix	Source
`cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`	Bounded timeout + allgather guard	PR #12476
`tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py`	Guard `elapsed_time()` with try/except	PR #12868
`cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Claim block from free queue before pinning	Tabrizian/TensorRT-LLM@`b298334`

Commits (5)

db0b59a8f — Add bounded timeout to gen-side checkGenTransferStatus
4921fbeca — Remove !blockAll guard, apply timeout unconditionally on both paths
37063ba56 — Guard updateKVCacheTransferBW allgather collective
ccac43725 — Guard CUDA event elapsed_time in perf_metrics_manager
9e6cd36df — Claim block from eviction free queue before pinning (cherry-pick from Tabrizian)

Base

Same as PR #12476: branched from main at 3e1207164 ([None][fix] Make KVCacheManagerV2 release mem immediately on shutdown)

Test images built from this branch

See image build playbook for full details.

Signed-off-by: Yifan Jiang 19356972+yifjiang@users.noreply.github.com

🤖 Generated with Claude Code

…ransceiver CacheTransceiver::checkGenTransferStatus called future.get() without a timeout, causing an unbounded block when the KV transfer never completes. This leads to decode worker hangs in disaggregated serving. The context (send) path in checkContextTransferStatus already uses future.wait_for() with kv_transfer_sender_future_timeout_ms. Apply the same pattern to the generation (receive) path: use wait_for() with a bounded timeout, log a warning on timeout, and skip to the next iteration instead of blocking forever. Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>

…and checkContextTransferStatus Remove the !blockAll guard so the timeout applies in all code paths, including when the scheduler calls with atLeastRequestNum=nullopt. Previously, blockAll=true caused the timeout to be skipped, falling through to future.get() which blocks indefinitely on stalled transfers. Also use value_or(1000) instead of value_or(0) as the default timeout when no config is set, ensuring there is always a bounded wait. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

…ut unconditionally Address review feedback: - Guard updateKVCacheTransferBW timing collective so it only runs when all ranks block together (blockAll) or the request was confirmed ready on every rank in the initial poll (freqIt->second == syncSize). This prevents hangs in allgather when a peer timed out and skipped the request. - Keep bounded timeout on both context and gen sides unconditionally (remove !blockAll guard) with value_or(1000) default, so the scheduler loop is never blocked indefinitely on a single future. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

… executor crash Wrap the elapsed_time() calls in compute_batch_gpu_times() with try/except RuntimeError. If a CUDA event was not recorded on the current stream, elapsed_time() raises RuntimeError, which propagates up through the executor event loop and kills the executor thread. The main process and Dynamo runtime continue running (serving HTTP, responding to health probes), but with no executor thread, every inference request hangs forever. With this fix, a CUDA event timing failure logs 0.0 for that batch metrics instead of crashing the executor. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

Cherry-pick of Tabrizian/TensorRT-LLM@b2983349. Claim block from eviction policy free queue before pinning in storeBlocks so that the later unpinBlocksById / releaseBlock cycle does not create a duplicate queue entry, which causes a hang in disaggregated serving with block reuse after eviction. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

yifjiang and others added 5 commits March 23, 2026 20:35

github-actions bot assigned yifjiang Apr 9, 2026

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Combined disagg fixes: gen transfer timeout + perf metrics crash + KV cache block reuse hang#12889

[None][fix] Combined disagg fixes: gen transfer timeout + perf metrics crash + KV cache block reuse hang#12889
yifjiang wants to merge 5 commits intoNVIDIA:mainfrom
yifjiang:fix/cpp-gen-transfer-timeout-plus-kvcache

yifjiang commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yifjiang commented Apr 9, 2026

Summary

Changed files (3)

Commits (5)

Base

Test images built from this branch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants