Skip to content

[None][fix] Batch addSequence with pre-claim to fix host offloading M…#12878

Open
liji-nv wants to merge 2 commits intoNVIDIA:feat/bench_yfrom
liji-nv:fix/batch-addsequence-mnt-overflow
Open

[None][fix] Batch addSequence with pre-claim to fix host offloading M…#12878
liji-nv wants to merge 2 commits intoNVIDIA:feat/bench_yfrom
liji-nv:fix/batch-addsequence-mnt-overflow

Conversation

@liji-nv
Copy link
Copy Markdown
Collaborator

@liji-nv liji-nv commented Apr 9, 2026

…NT overflow

When host offloading is enabled, onboarding a host block to GPU during addSequence can trigger eviction of other reusable host blocks from the radix tree. This causes actual KV cache reuse to be less than the scheduler estimated, leading to max_num_tokens (MNT) overflow assertions.

Add a new addSequenceBatch API that processes all first-chunk context requests in two phases:

  • Phase 1: Walk the radix tree and claimBlock() for all matching blocks across all requests. No onboarding, no allocation. This protects reusable blocks from eviction.
  • Phase 2: Onboard host blocks and allocate non-matching blocks. Since all reusable blocks are already claimed, evictions during onboarding cannot touch them.

On the Python side, replace the TOCTOU-prone revalidation loop (count_reusable_blocks + budget check) with a single batch call.

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@liji-nv liji-nv requested a review from a team as a code owner April 9, 2026 05:30
@liji-nv liji-nv force-pushed the fix/batch-addsequence-mnt-overflow branch 3 times, most recently from ba22dfe to cecae98 Compare April 9, 2026 06:43
…NT overflow

When host offloading is enabled, onboarding a host block to GPU during
addSequence can trigger eviction of other reusable host blocks from
the radix tree. This causes actual KV cache reuse to be less than the
scheduler estimated, leading to max_num_tokens (MNT) overflow assertions.

Add a new addSequenceBatch API that processes all first-chunk context
requests in two phases:
- Phase 1: Walk the radix tree and claimBlock() for all matching blocks
  across all requests. No onboarding, no allocation. This protects
  reusable blocks from eviction.
- Phase 2: Onboard host blocks and allocate non-matching blocks. Since
  all reusable blocks are already claimed, evictions during onboarding
  cannot touch them.

On the Python side, replace the TOCTOU-prone revalidation loop
(count_reusable_blocks + budget check) with a single batch call.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@liji-nv liji-nv force-pushed the fix/batch-addsequence-mnt-overflow branch from cecae98 to 9dc7da9 Compare April 10, 2026 06:11
…chContent (NVIDIA#12550)

Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
@liji-nv liji-nv requested a review from a team as a code owner April 10, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants