Skip to content

server: fix SWA prompt reuse boundary condition#21695

Draft
1oridevs wants to merge 1 commit intoggml-org:masterfrom
1oridevs:fix/swa-partial-reuse-boundary
Draft

server: fix SWA prompt reuse boundary condition#21695
1oridevs wants to merge 1 commit intoggml-org:masterfrom
1oridevs:fix/swa-partial-reuse-boundary

Conversation

@1oridevs
Copy link
Copy Markdown

@1oridevs 1oridevs commented Apr 9, 2026

Description

9.4.26 1oridevs

While testing long prompts with SWA-enabled models, I noticed that the server sometimes falls back to full prompt reprocessing even when it looks like the cached state should still be reusable.

After digging into it, it seems to come down to how the boundary is checked:

  • The current condition treats pos_min == pos_next - n_swa as invalid
  • Checkpoints at that same boundary can also get skipped

So even when the cache is technically still valid, the server ends up not using it.

What this change does

  • Changes the reuse condition from:

    pos_min >= pos_min_thold

    to:

    pos_min > pos_min_thold
  • Updates the checkpoint lookup to allow:

    cur.pos_min <= pos_min_thold

This basically makes the boundary inclusive instead of exclusive, so we don’t throw away valid reuse cases right at the edge.

Also removed a leftover debug comment while I was there.

Why it matters

Without this, some requests end up doing a full prompt reprocess even though they don’t need to, which can hurt performance — especially with longer contexts.

With this change, reuse behaves more consistently at the sliding window boundary.


Notes

Tested locally with the server build — behavior looks correct and avoids unnecessary resets in cases where reuse should be possible.


Checklist

  • Built and tested locally
  • Change is minimal and scoped
  • No behavior change outside SWA boundary handling

AI usage

I used AI tools to help reason about the issue and review the change, but the debugging, testing, and final code changes were done manually.

@1oridevs
Copy link
Copy Markdown
Author

1oridevs commented Apr 9, 2026

Ive used cursor AI to help me write this pl description

@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 9, 2026

Hi @1oridevs, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs examples server ggml changes relating to the ggml tensor library for machine learning labels Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant