Skip to content

[None][fix] Combined disagg fixes: gen transfer timeout + perf metrics crash + KV cache block reuse hang#12889

Draft
yifjiang wants to merge 5 commits intoNVIDIA:mainfrom
yifjiang:fix/cpp-gen-transfer-timeout-plus-kvcache
Draft

[None][fix] Combined disagg fixes: gen transfer timeout + perf metrics crash + KV cache block reuse hang#12889
yifjiang wants to merge 5 commits intoNVIDIA:mainfrom
yifjiang:fix/cpp-gen-transfer-timeout-plus-kvcache

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

@yifjiang yifjiang commented Apr 9, 2026

Summary

This is a combined image-build branch, not intended for merge. It aggregates three independent fixes for disaggregated serving into a single branch for building test images. Each fix has (or will have) its own PR for review:

  1. PR #12476 — Bounded timeout for gen-side KV cache transfer (cacheTransceiver.cpp)
  2. PR #12868 — Guard CUDA event elapsed_time in perf_metrics_manager.py to prevent executor crash
  3. Cherry-pick of Tabrizian@b298334 — Fix disagg serving hang on block reuse after eviction (kvCacheManager.cpp)

Changed files (3)

File Fix Source
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp Bounded timeout + allgather guard PR #12476
tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py Guard elapsed_time() with try/except PR #12868
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp Claim block from free queue before pinning Tabrizian/TensorRT-LLM@b298334

Commits (5)

  1. db0b59a8f — Add bounded timeout to gen-side checkGenTransferStatus
  2. 4921fbeca — Remove !blockAll guard, apply timeout unconditionally on both paths
  3. 37063ba56 — Guard updateKVCacheTransferBW allgather collective
  4. ccac43725 — Guard CUDA event elapsed_time in perf_metrics_manager
  5. 9e6cd36df — Claim block from eviction free queue before pinning (cherry-pick from Tabrizian)

Base

Same as PR #12476: branched from main at 3e1207164 ([None][fix] Make KVCacheManagerV2 release mem immediately on shutdown)

Test images built from this branch

See image build playbook for full details.

Signed-off-by: Yifan Jiang 19356972+yifjiang@users.noreply.github.com

🤖 Generated with Claude Code

yifjiang and others added 5 commits March 23, 2026 20:35
…ransceiver

CacheTransceiver::checkGenTransferStatus called future.get() without a
timeout, causing an unbounded block when the KV transfer never completes.
This leads to decode worker hangs in disaggregated serving.

The context (send) path in checkContextTransferStatus already uses
future.wait_for() with kv_transfer_sender_future_timeout_ms. Apply the
same pattern to the generation (receive) path: use wait_for() with a
bounded timeout, log a warning on timeout, and skip to the next
iteration instead of blocking forever.

Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
…and checkContextTransferStatus

Remove the !blockAll guard so the timeout applies in all code paths,
including when the scheduler calls with atLeastRequestNum=nullopt.
Previously, blockAll=true caused the timeout to be skipped, falling
through to future.get() which blocks indefinitely on stalled transfers.

Also use value_or(1000) instead of value_or(0) as the default timeout
when no config is set, ensuring there is always a bounded wait.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…ut unconditionally

Address review feedback:
- Guard updateKVCacheTransferBW timing collective so it only runs when
  all ranks block together (blockAll) or the request was confirmed ready
  on every rank in the initial poll (freqIt->second == syncSize). This
  prevents hangs in allgather when a peer timed out and skipped the request.
- Keep bounded timeout on both context and gen sides unconditionally
  (remove !blockAll guard) with value_or(1000) default, so the scheduler
  loop is never blocked indefinitely on a single future.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
… executor crash

Wrap the elapsed_time() calls in compute_batch_gpu_times() with
try/except RuntimeError. If a CUDA event was not recorded on the
current stream, elapsed_time() raises RuntimeError, which propagates
up through the executor event loop and kills the executor thread.

The main process and Dynamo runtime continue running (serving HTTP,
responding to health probes), but with no executor thread, every
inference request hangs forever.

With this fix, a CUDA event timing failure logs 0.0 for that batch
metrics instead of crashing the executor.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Cherry-pick of Tabrizian/TensorRT-LLM@b2983349.

Claim block from eviction policy free queue before pinning in
storeBlocks so that the later unpinBlocksById / releaseBlock cycle
does not create a duplicate queue entry, which causes a hang in
disaggregated serving with block reuse after eviction.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants