server: fix SWA prompt reuse boundary condition#21695
Draft
1oridevs wants to merge 1 commit intoggml-org:masterfrom
Draft
server: fix SWA prompt reuse boundary condition#216951oridevs wants to merge 1 commit intoggml-org:masterfrom
1oridevs wants to merge 1 commit intoggml-org:masterfrom
Conversation
Author
|
Ive used cursor AI to help me write this pl description |
|
Hi @1oridevs, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
9.4.26 1oridevs
While testing long prompts with SWA-enabled models, I noticed that the server sometimes falls back to full prompt reprocessing even when it looks like the cached state should still be reusable.
After digging into it, it seems to come down to how the boundary is checked:
pos_min == pos_next - n_swaas invalidSo even when the cache is technically still valid, the server ends up not using it.
What this change does
Changes the reuse condition from:
to:
Updates the checkpoint lookup to allow:
This basically makes the boundary inclusive instead of exclusive, so we don’t throw away valid reuse cases right at the edge.
Also removed a leftover debug comment while I was there.
Why it matters
Without this, some requests end up doing a full prompt reprocess even though they don’t need to, which can hurt performance — especially with longer contexts.
With this change, reuse behaves more consistently at the sliding window boundary.
Notes
Tested locally with the server build — behavior looks correct and avoids unnecessary resets in cases where reuse should be possible.
Checklist
AI usage
I used AI tools to help reason about the issue and review the change, but the debugging, testing, and final code changes were done manually.