server: fix SWA prompt reuse boundary condition by 1oridevs · Pull Request #21695 · ggml-org/llama.cpp

1oridevs · 2026-04-09T20:58:56Z

Description

9.4.26 1oridevs

While testing long prompts with SWA-enabled models, I noticed that the server sometimes falls back to full prompt reprocessing even when it looks like the cached state should still be reusable.

After digging into it, it seems to come down to how the boundary is checked:

The current condition treats pos_min == pos_next - n_swa as invalid
Checkpoints at that same boundary can also get skipped

So even when the cache is technically still valid, the server ends up not using it.

What this change does

Changes the reuse condition from:

pos_min >= pos_min_thold

to:

pos_min > pos_min_thold

Updates the checkpoint lookup to allow:
```
cur.pos_min <= pos_min_thold
```

This basically makes the boundary inclusive instead of exclusive, so we don’t throw away valid reuse cases right at the edge.

Also removed a leftover debug comment while I was there.

Why it matters

Without this, some requests end up doing a full prompt reprocess even though they don’t need to, which can hurt performance — especially with longer contexts.

With this change, reuse behaves more consistently at the sliding window boundary.

Notes

Tested locally with the server build — behavior looks correct and avoids unnecessary resets in cases where reuse should be possible.

Checklist

Built and tested locally
Change is minimal and scoped
No behavior change outside SWA boundary handling

AI usage

I used AI tools to help reason about the issue and review the change, but the debugging, testing, and final code changes were done manually.

1oridevs · 2026-04-09T21:02:07Z

Ive used cursor AI to help me write this pl description

ggml-gh-bot · 2026-04-09T21:03:30Z

Hi @1oridevs, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.
AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

server: fix SWA prompt reuse boundary condition

f1b1141

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs examples server ggml changes relating to the ggml tensor library for machine learning labels Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fix SWA prompt reuse boundary condition#21695

server: fix SWA prompt reuse boundary condition#21695
1oridevs wants to merge 1 commit intoggml-org:masterfrom
1oridevs:fix/swa-partial-reuse-boundary

1oridevs commented Apr 9, 2026

Uh oh!

1oridevs commented Apr 9, 2026

Uh oh!

ggml-gh-bot bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

1oridevs commented Apr 9, 2026

Description

What this change does

Why it matters

Notes

Checklist

AI usage

Uh oh!

1oridevs commented Apr 9, 2026

Uh oh!

ggml-gh-bot bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant