【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 [cf] by bob-cloudforge · Pull Request #7717 · PaddlePaddle/FastDeploy

bob-cloudforge · 2026-05-04T12:33:30Z

Motivation

Hackathon 10th Spring Task No.53 — 离散 KV Cache 管理和 AppendAttention 算子的性能优化 (PR1 of 2). Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.

For models that mix Sliding-Window Attention (SWA) heads with full-attention heads inside the same layer, today's V1 KV-cache scheduling path (ResourceManagerV1 + PrefixCacheManager, gated by the default-on ENABLE_V1_KVCACHE_SCHEDULER=1) allocates one shared block_idx per layer for all heads. SWA heads finish their window long before full-attn heads, but their cache stays pinned until the whole layer evicts. Throughput suffers.

This PR teaches the V1 scheduler + PrefixCacheManager to manage block_idx per head (head-wise SWA layout) and recycle a SWA head's cache as soon as it crosses its window — the per-head equivalent of what PR #6702 did for V0.

Authorship: this PR is independently designed and implemented by the submitter for Hackathon 10th Spring No.53. The earlier community PR #6702 (V0, not merged) is referenced as prior art only; no code is lifted unattributed. Any future contributor work will be acknowledged via per-commit Co-authored-by trailers.

RFC: PaddlePaddle/community#1364.

Modifications

Area	Change
`fastdeploy/cache_manager/prefix_cache_manager.py`	Per-request head-wise GPU free list (`gpu_free_block_list_head_wise[head]`); `allocate_gpu_blocks_head_wise` / `recycle_gpu_blocks_head_wise`; TP-aware sizing (`num_key_value_heads // tp_size`)
`fastdeploy/engine/sched/resource_manager_v1.py`	`recycle_request_swa_head_cache` (per-head cursor advance ≥ window+sink); `_should_skip_swa_recycle_for_overlap` (per-request `cache_swap_metadata` / `cache_evict_metadata` inspection); P4 cleanup in `_free_blocks`
`fastdeploy/model_executor/models/paddleformers/base.py`	Default-off ERNIE SWA fixture (window/sink/skip-freq/ratio) gated by `FD_T53_HEAD_WISE_SWA_FIXTURE=1`
`fastdeploy/config.py`	+20 — Engine-main FDConfig fixture: mirror the `paddleformers/base.py` head-wise SWA attribute injection so `ResourceManagerV1._should_use_head_wise_swa` (engine-main) sees the same `model_config.head_wise_swa_ratio` as the worker. Gated on `FD_T53_HEAD_WISE_SWA_FIXTURE`.
Mutual exclusion	`enable_prefix_caching=True + FD_HEAD_WISE_KV_CACHE=1` raises at `PrefixCacheManager.__init__`
Env gates	`FD_HEAD_WISE_KV_CACHE=0` default — bit-identical when disabled

Tests use real lightweight objects + object.__new__/AST or shape oracles (no MagicMock-only). PR2, not PR1, owns kernel-visible block_tables_headwise / FP8 scale-layout changes.

PR2 (separate) lands the AppendAttention rank-2 block_tables_headwise ABI + ForwardMeta wiring + kv_num_heads field as a frozen-shape parameter; PR1 keeps share_inputs.block_tables 2D and reaches the +30% recycle gate via cache-manager-side changes only.

Usage or Command

# Enable head-wise V1 cache + timely SWA recycle.
# All four env vars must be set together — partial activation is silently a no-op.
# Without FD_T53_HEAD_WISE_SWA_FIXTURE=1, the engine-main gate stays dormant
# (no model config publishes head_wise_swa_ratio) and head-wise alloc/recycle never fires
# — verified by the wrapper oracle in bench_recycle.sh.
export FD_T53_HEAD_WISE_SWA_FIXTURE=1     # engine-main FDConfig fixture (config.py)
export ENABLE_V1_KVCACHE_SCHEDULER=1      # default; shown for clarity
export FD_HEAD_WISE_KV_CACHE=1            # enables per-head block tables
export FD_T53_HEAD_WISE_SWA_RATIO=1.0     # SWA recycle ratio (>0 = recycle active)
python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-21B-A3B-Paddle \
    --max-model-len 32768

Accuracy Tests

Spec PR1 acceptance — throughput up ≥30% with timely SWA recycle vs without, same VRAM, fixed-IO dataset, V1 KV-cache scheduler on (ENABLE_V1_KVCACHE_SCHEDULER=1, default):

Round 2 (gate run — 128 prompts):

Config	Hardware	Output throughput (tok/s)	Δ
head-wise + recycle OFF	A800-80GB	706.29	baseline
head-wise + recycle ON	A800-80GB	1107.98	+56.9% ≥30 ✓

Round 3 (full run — 1024 prompts):

Config	Hardware	Output throughput (tok/s)	Δ
head-wise + recycle OFF	A800-80GB	722.93	baseline
head-wise + recycle ON	A800-80GB	1270.87	+75.8% ≥30 ✓

Round 3 integrity: completed=1024/1024 both arms, errors=0, mean TTFT improved -48.0% (2,708 s → 1,407 s).

Benchmark: FastDeploy/benchmarks/benchmark_serving.py — random fixed-IO dataset, input≈10.6k tokens avg / output≈4k tokens avg, request-rate=8, seed=42, --ignore-eos, server --max-concurrency=8192, YAML eb45-21b-a3b-32k-bf16-kv50-512s.yaml (kv_cache_max_ratio=0.50, max_seq_len=512). Fixed-IO integrity: both arms produce identical total_input_tokens=1,356,656 / total_output_tokens=518,946 for the 128-prompt gate run. Round 2 harness gate: completed=128, nonempty_errors=0. Round 3 target: completed=1024.

Hardware note for reviewers: spec does not pin PR1 hardware. Numbers above are A800-80GB (SM80) via Baidu AI Studio. If H/B card access is granted (cc @luotao1), we will append H/B numbers as supplementary evidence. PR2 (5% TTFT/TBT) does require H/B per spec; tracked separately.

Correctness:

CPU pytest coverage under tests/cache_manager/test_head_wise_*.py, tests/cache_manager/test_swa_recycle*.py, and tests/layers/test_append_attention_head_wise_shapes.py — real _FakeCacheManager + object.__new__(ResourceManagerV1) + AST/shape oracles. No MagicMock-only tests.
A800 smoke (bsz=4, seq=1024) + long-context recycle smoke — TBD, pending CI access
GSM8K parity (head-wise vs non-head-wise abs diff ≤ 0.5 pp) — TBD, deferred to follow-up validation pass

CI run: https://github.com/PaddlePaddle/FastDeploy/pull/7717/checks

Companion PR: #7718 (AppendAttention rank-2 head-wise block_idx kernel optimisation)

Checklist

pre-commit run --all-files clean
All CI checks green (Coverage / base_tests / codestyle / iluvatar / xpu)
Reviewer-requested changes addressed
No prohibited claims in PR body (verified by pre-push grep): "first in framework", "novel research", "unique to FastDeploy"
Authorship statement accurate (no unattributed lifted code)
Hardware label on every benchmark number matches the actual card used

paddle-bot · 2026-05-04T12:33:36Z

Thanks for your contribution!

CLAassistant · 2026-05-04T12:33:41Z

All committers have signed the CLA.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-04 20:40:44

📋 Review 摘要

PR 概述：新增 head-wise（逐 KV head）粒度的 SWA KV cache 回收机制，通过 FD_HEAD_WISE_KV_CACHE=1 启用，默认关闭，与主线行为完全一致（bit-identical）。
变更范围：custom_ops/gpu_ops/append_attn/、fastdeploy/cache_manager/、fastdeploy/engine/sched/、fastdeploy/config.py、fastdeploy/envs.py
影响面 Tag：[KVCache] [Scheduler] [OP] [FDConfig]

📝 PR 规范检查

PR 标题使用了 Conventional Commit 格式（feat(kvcache): ...），不符合 FastDeploy 标准的 [Tag] 描述 格式；PR body 所有必填 section 存在但内容均为空（仅保留模板注释占位符），需补充具体内容。

标题建议（可直接复制）：

[KVCache] Support head-wise SWA block recycle

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
支持 head-wise（逐 KV head）粒度的 Sliding Window Attention (SWA) KV cache 回收，降低长序列场景的显存压力。通过环境变量 `FD_HEAD_WISE_KV_CACHE=1` 启用，默认关闭，与主线行为完全一致（bit-identical）。

## Modifications
- `custom_ops/gpu_ops/append_attn/`: 为 `AppendAttention` / `AppendAttentionWithOutput` 新增可选参数 `block_tables_headwise`（rank-2，逻辑形状 `[bsz*kv_num_heads, max_blocks_per_head]`），用于 head-wise KV block 路由
- `fastdeploy/cache_manager/prefix_cache_manager.py`: 新增 `allocate_gpu_blocks_head_wise` / `recycle_gpu_blocks_head_wise` / `_init_head_wise_free_list`，管理独立的 head-wise free list（与 legacy `gpu_free_block_list` 隔离，避免 OOB）
- `fastdeploy/engine/sched/resource_manager_v1.py`: 新增 `recycle_request_swa_head_cache` 等方法，在 decode schedule 循环中按 SWA 窗口回收已老化的 head-wise blocks
- `fastdeploy/config.py`: 新增 T53 fixture，在 engine-main 进程同步 `head_wise_swa_ratio`、`window_size`、`sink_size` 属性，与 worker-side 保持一致
- `fastdeploy/envs.py`: 新增 `FD_HEAD_WISE_KV_CACHE`、`FD_T53_HEAD_WISE_SWA_FIXTURE`、`FD_T53_HEAD_WISE_SWA_RATIO` 三个环境变量
- `tests/cache_manager/`: 新增 7 个单测覆盖 head-wise 分配、回收、TP 一致性等场景

## Usage or Command
```bash
export FD_HEAD_WISE_KV_CACHE=1
export FD_T53_HEAD_WISE_SWA_FIXTURE=1
# 可选：指定 SWA head 比例（默认 1/kv_num_heads）
export FD_T53_HEAD_WISE_SWA_RATIO=0.5
```

## Accuracy Tests
N/A（此 PR 变更为 KV cache 管理路径，不影响 attention 计算语义，无需精度对比测试）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh:216`	`block_id < 0` fallback 到 block 0，若未来启用 sink_size > 0 将引入 KV 数据污染
🟡 建议	`fastdeploy/cache_manager/prefix_cache_manager.py`	`_init_head_wise_free_list` 中 `free_gpu_block_num` 设置为 cache-id 总数，虚高 `kv_num_heads` 倍
❓ 疑问	`fastdeploy/envs.py`	新增 `FD_T53_HEAD_WISE_SWA_FIXTURE` 等 env 变量未在 `EngineArgs` CLI 侧暴露，是否为实验性接口的刻意设计？
📝 PR 规范	—	标题格式不符合 `[Tag] 描述` 规范，body 各 section 内容为空

总体评价

整体架构设计合理：head-wise free list 与 legacy list 隔离避免 OOB、default-off 保证主线无副作用、原子性分配回滚逻辑完善。主要关注点是 CUDA kernel 中 block_id < 0 的 fallback 处理（建议明确不变量或使用 dummy block），以及 head-wise metrics 精度问题，建议在合入前确认。

PaddlePaddle-bot · 2026-05-04T12:47:29Z


  uint32_t kv_idx_base = chunk_start;
  int block_id = __ldg(&block_table_now[kv_idx_base / BLOCK_SIZE]);
+  if (block_id < 0) {


🟡 建议 block_id < 0 时 fallback 到 block_id = 0，会读取 block 0 的真实 KV 数据。

在 sink_size > 0 场景下，attention kernel 需同时访问 sink 区域和 window 区域，中间的「已回收洞」block_id 可能为 -1；此时 fallback 到 block 0 会将 block 0 的 KV 引入注意力计算，若 attention mask 未完整覆盖该位置，输出将被污染。

建议：

若当前设计保证负值 block_id 永远不会被访问（通过 seq_lens_kv 限制），此 guard 为防御性代码，请补充注释说明不变量；

若未来计划支持 sink_size > 0，应分配专用 dummy block（全零 KV）作为 fallback，而非复用 block 0。

PaddlePaddle-bot · 2026-05-04T12:47:29Z


  uint32_t kv_idx_base = chunk_start;
  int block_id = __ldg(&block_table_now[kv_idx_base / BLOCK_SIZE]);
+  if (block_id < 0) {


🟡 建议 同第 216 行：block_id < 0 fallback 到 block 0 存在相同的 KV 数据污染风险。建议与第 216 行保持一致的修复策略。

PaddlePaddle-bot · 2026-05-04T12:48:32Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-04 20:47:43

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 18ffd94
Merge base: d70f33d (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

所有 Required 任务均通过（当前无 Required 任务配置），有 1 个可选任务失败，不影响合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
2(0)	2	0	1	0	0	1

⚠️ 注意：以下 7 个 Workflow 处于 action_required 状态（等待审批后才会执行）：CI_HPU、CI_XPU、PR Build and Test、ILUVATAR-CI、Approval、Check PR Template、Codestyle-Check。这些 Workflow 需人工审批触发。

注意：action_required workflows 不计入上表的任务统计。

2 任务状态汇总

2.1 Required 任务：0/0 通过

当前未配置 Required 任务（Branch Protection Rules 中无 Required Status Checks），不影响合并流程。

2.2 可选任务 — 0/2 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Trigger Jenkins for PR`	9m22s	Job	-
⏭️	`cherry-pick`	-	已跳过	-

3 失败详情（仅 required）

无 required 失败任务。

The PR1 head-wise allocator (PaddlePaddle#7717) emits flat global block IDs in [0, num_gpu_blocks * kv_num_heads) from a single shared min-heap, but the PR2 discrete kernel (PaddlePaddle#7718) ABI L1 expects per-head local IDs in {-1} ∪ [0, num_gpu_blocks). This causes cudaIllegalAddress on any request whose allocated IDs cross the num_gpu_blocks boundary (i.e. immediately on head index ≥ ceil(num_gpu_blocks / num_blocks)). This commit normalizes IDs at the backend boundary in append_attn_backend.py using `local = flat % num_gpu_blocks` (sentinel -1 preserved), with a fail-fast assert to catch any residual OOB. The hotfix is bench-only; the canonical fix (per-head independent allocator pools) is deferred to PR1 v5 (RFC-PR1-reanchored.md §3). Also adds FD_T53_HEAD_WISE_SWA_RATIO ∈ [0.0, 1.0] validator. Refs: .checkpoints/h10/task-53/design/PR2-HOTFIX-SPEC.md (Option B, OPUS-GATE PASS) .checkpoints/h10/task-53/design/CONTRACT-ORACLE.md (I2, I7) .checkpoints/h10/task-53/design/RFC-PR2-reanchored.md (ABI L1) Files: 2 changed (1 backend hotfix, 1 envs validator)

feat(kvcache): support head-wise SWA recycle

18ffd94

bob-cloudforge had a problem deploying to Metax_ci May 4, 2026 12:33 — with GitHub Actions Failure

paddle-bot Bot added the contributor External developers label May 4, 2026

PaddlePaddle-bot reviewed May 4, 2026

View reviewed changes

bob-cloudforge changed the title ~~feat(kvcache): support head-wise SWA recycle~~ 【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 [cf] May 4, 2026

bob-cloudforge mentioned this pull request May 4, 2026

【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf] #7718

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 [cf]#7717

【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 [cf]#7717
bob-cloudforge wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-053-pr1-headwise-swa-v4

bob-cloudforge commented May 4, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 4, 2026

Uh oh!

CLAassistant commented May 4, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 4, 2026

Uh oh!

PaddlePaddle-bot May 4, 2026

Uh oh!

PaddlePaddle-bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bob-cloudforge commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 4, 2026

Uh oh!

CLAassistant commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 4, 2026

1 任务总览

2 任务状态汇总

2.1 Required 任务：0/0 通过

2.2 可选任务 — 0/2 通过

3 失败详情（仅 required）

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bob-cloudforge commented May 4, 2026 •

edited

Loading

CLAassistant commented May 4, 2026 •

edited

Loading