[None][fix] Combined disagg fixes: gen transfer timeout + perf metrics crash + KV cache block reuse hang#12889
Draft
yifjiang wants to merge 5 commits intoNVIDIA:mainfrom
Draft
Conversation
…ransceiver CacheTransceiver::checkGenTransferStatus called future.get() without a timeout, causing an unbounded block when the KV transfer never completes. This leads to decode worker hangs in disaggregated serving. The context (send) path in checkContextTransferStatus already uses future.wait_for() with kv_transfer_sender_future_timeout_ms. Apply the same pattern to the generation (receive) path: use wait_for() with a bounded timeout, log a warning on timeout, and skip to the next iteration instead of blocking forever. Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
…and checkContextTransferStatus Remove the !blockAll guard so the timeout applies in all code paths, including when the scheduler calls with atLeastRequestNum=nullopt. Previously, blockAll=true caused the timeout to be skipped, falling through to future.get() which blocks indefinitely on stalled transfers. Also use value_or(1000) instead of value_or(0) as the default timeout when no config is set, ensuring there is always a bounded wait. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…ut unconditionally Address review feedback: - Guard updateKVCacheTransferBW timing collective so it only runs when all ranks block together (blockAll) or the request was confirmed ready on every rank in the initial poll (freqIt->second == syncSize). This prevents hangs in allgather when a peer timed out and skipped the request. - Keep bounded timeout on both context and gen sides unconditionally (remove !blockAll guard) with value_or(1000) default, so the scheduler loop is never blocked indefinitely on a single future. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
… executor crash Wrap the elapsed_time() calls in compute_batch_gpu_times() with try/except RuntimeError. If a CUDA event was not recorded on the current stream, elapsed_time() raises RuntimeError, which propagates up through the executor event loop and kills the executor thread. The main process and Dynamo runtime continue running (serving HTTP, responding to health probes), but with no executor thread, every inference request hangs forever. With this fix, a CUDA event timing failure logs 0.0 for that batch metrics instead of crashing the executor. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Cherry-pick of Tabrizian/TensorRT-LLM@b2983349. Claim block from eviction policy free queue before pinning in storeBlocks so that the later unpinBlocksById / releaseBlock cycle does not create a duplicate queue entry, which causes a hang in disaggregated serving with block reuse after eviction. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is a combined image-build branch, not intended for merge. It aggregates three independent fixes for disaggregated serving into a single branch for building test images. Each fix has (or will have) its own PR for review:
cacheTransceiver.cpp)elapsed_timeinperf_metrics_manager.pyto prevent executor crashkvCacheManager.cpp)Changed files (3)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpptensorrt_llm/_torch/pyexecutor/perf_metrics_manager.pyelapsed_time()with try/exceptcpp/tensorrt_llm/batch_manager/kvCacheManager.cppCommits (5)
db0b59a8f— Add bounded timeout to gen-sidecheckGenTransferStatus4921fbeca— Remove!blockAllguard, apply timeout unconditionally on both paths37063ba56— GuardupdateKVCacheTransferBWallgather collectiveccac43725— Guard CUDA eventelapsed_timein perf_metrics_manager9e6cd36df— Claim block from eviction free queue before pinning (cherry-pick from Tabrizian)Base
Same as PR #12476: branched from main at
3e1207164([None][fix] Make KVCacheManagerV2 release mem immediately on shutdown)Test images built from this branch
See image build playbook for full details.
Signed-off-by: Yifan Jiang 19356972+yifjiang@users.noreply.github.com
🤖 Generated with Claude Code