【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 [cf]#7717
Conversation
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-04 20:40:44
📋 Review 摘要
PR 概述:新增 head-wise(逐 KV head)粒度的 SWA KV cache 回收机制,通过 FD_HEAD_WISE_KV_CACHE=1 启用,默认关闭,与主线行为完全一致(bit-identical)。
变更范围:custom_ops/gpu_ops/append_attn/、fastdeploy/cache_manager/、fastdeploy/engine/sched/、fastdeploy/config.py、fastdeploy/envs.py
影响面 Tag:[KVCache] [Scheduler] [OP] [FDConfig]
📝 PR 规范检查
PR 标题使用了 Conventional Commit 格式(feat(kvcache): ...),不符合 FastDeploy 标准的 [Tag] 描述 格式;PR body 所有必填 section 存在但内容均为空(仅保留模板注释占位符),需补充具体内容。
标题建议(可直接复制):
[KVCache] Support head-wise SWA block recycle
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
支持 head-wise(逐 KV head)粒度的 Sliding Window Attention (SWA) KV cache 回收,降低长序列场景的显存压力。通过环境变量 `FD_HEAD_WISE_KV_CACHE=1` 启用,默认关闭,与主线行为完全一致(bit-identical)。
## Modifications
- `custom_ops/gpu_ops/append_attn/`: 为 `AppendAttention` / `AppendAttentionWithOutput` 新增可选参数 `block_tables_headwise`(rank-2,逻辑形状 `[bsz*kv_num_heads, max_blocks_per_head]`),用于 head-wise KV block 路由
- `fastdeploy/cache_manager/prefix_cache_manager.py`: 新增 `allocate_gpu_blocks_head_wise` / `recycle_gpu_blocks_head_wise` / `_init_head_wise_free_list`,管理独立的 head-wise free list(与 legacy `gpu_free_block_list` 隔离,避免 OOB)
- `fastdeploy/engine/sched/resource_manager_v1.py`: 新增 `recycle_request_swa_head_cache` 等方法,在 decode schedule 循环中按 SWA 窗口回收已老化的 head-wise blocks
- `fastdeploy/config.py`: 新增 T53 fixture,在 engine-main 进程同步 `head_wise_swa_ratio`、`window_size`、`sink_size` 属性,与 worker-side 保持一致
- `fastdeploy/envs.py`: 新增 `FD_HEAD_WISE_KV_CACHE`、`FD_T53_HEAD_WISE_SWA_FIXTURE`、`FD_T53_HEAD_WISE_SWA_RATIO` 三个环境变量
- `tests/cache_manager/`: 新增 7 个单测覆盖 head-wise 分配、回收、TP 一致性等场景
## Usage or Command
```bash
export FD_HEAD_WISE_KV_CACHE=1
export FD_T53_HEAD_WISE_SWA_FIXTURE=1
# 可选:指定 SWA head 比例(默认 1/kv_num_heads)
export FD_T53_HEAD_WISE_SWA_RATIO=0.5
```
## Accuracy Tests
N/A(此 PR 变更为 KV cache 管理路径,不影响 attention 计算语义,无需精度对比测试)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh:216 |
block_id < 0 fallback 到 block 0,若未来启用 sink_size > 0 将引入 KV 数据污染 |
| 🟡 建议 | fastdeploy/cache_manager/prefix_cache_manager.py |
_init_head_wise_free_list 中 free_gpu_block_num 设置为 cache-id 总数,虚高 kv_num_heads 倍 |
| ❓ 疑问 | fastdeploy/envs.py |
新增 FD_T53_HEAD_WISE_SWA_FIXTURE 等 env 变量未在 EngineArgs CLI 侧暴露,是否为实验性接口的刻意设计? |
| 📝 PR 规范 | — | 标题格式不符合 [Tag] 描述 规范,body 各 section 内容为空 |
总体评价
整体架构设计合理:head-wise free list 与 legacy list 隔离避免 OOB、default-off 保证主线无副作用、原子性分配回滚逻辑完善。主要关注点是 CUDA kernel 中 block_id < 0 的 fallback 处理(建议明确不变量或使用 dummy block),以及 head-wise metrics 精度问题,建议在合入前确认。
|
|
||
| uint32_t kv_idx_base = chunk_start; | ||
| int block_id = __ldg(&block_table_now[kv_idx_base / BLOCK_SIZE]); | ||
| if (block_id < 0) { |
There was a problem hiding this comment.
🟡 建议 block_id < 0 时 fallback 到 block_id = 0,会读取 block 0 的真实 KV 数据。
在 sink_size > 0 场景下,attention kernel 需同时访问 sink 区域和 window 区域,中间的「已回收洞」block_id 可能为 -1;此时 fallback 到 block 0 会将 block 0 的 KV 引入注意力计算,若 attention mask 未完整覆盖该位置,输出将被污染。
建议:
- 若当前设计保证负值 block_id 永远不会被访问(通过 seq_lens_kv 限制),此 guard 为防御性代码,请补充注释说明不变量;
- 若未来计划支持 sink_size > 0,应分配专用 dummy block(全零 KV)作为 fallback,而非复用 block 0。
|
|
||
| uint32_t kv_idx_base = chunk_start; | ||
| int block_id = __ldg(&block_table_now[kv_idx_base / BLOCK_SIZE]); | ||
| if (block_id < 0) { |
There was a problem hiding this comment.
🟡 建议 同第 216 行:block_id < 0 fallback 到 block 0 存在相同的 KV 数据污染风险。建议与第 216 行保持一致的修复策略。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有 Required 任务均通过(当前无 Required 任务配置),有 1 个可选任务失败,不影响合并。
2 任务状态汇总2.1 Required 任务:0/0 通过
2.2 可选任务 — 0/2 通过
3 失败详情(仅 required)无 required 失败任务。 |
The PR1 head-wise allocator (PaddlePaddle#7717) emits flat global block IDs in [0, num_gpu_blocks * kv_num_heads) from a single shared min-heap, but the PR2 discrete kernel (PaddlePaddle#7718) ABI L1 expects per-head local IDs in {-1} ∪ [0, num_gpu_blocks). This causes cudaIllegalAddress on any request whose allocated IDs cross the num_gpu_blocks boundary (i.e. immediately on head index ≥ ceil(num_gpu_blocks / num_blocks)). This commit normalizes IDs at the backend boundary in append_attn_backend.py using `local = flat % num_gpu_blocks` (sentinel -1 preserved), with a fail-fast assert to catch any residual OOB. The hotfix is bench-only; the canonical fix (per-head independent allocator pools) is deferred to PR1 v5 (RFC-PR1-reanchored.md §3). Also adds FD_T53_HEAD_WISE_SWA_RATIO ∈ [0.0, 1.0] validator. Refs: .checkpoints/h10/task-53/design/PR2-HOTFIX-SPEC.md (Option B, OPUS-GATE PASS) .checkpoints/h10/task-53/design/CONTRACT-ORACLE.md (I2, I7) .checkpoints/h10/task-53/design/RFC-PR2-reanchored.md (ABI L1) Files: 2 changed (1 backend hotfix, 1 envs validator)
Motivation
Hackathon 10th Spring Task No.53 — 离散 KV Cache 管理和 AppendAttention 算子的性能优化 (PR1 of 2). Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.
For models that mix Sliding-Window Attention (SWA) heads with full-attention heads inside the same layer, today's V1 KV-cache scheduling path (
ResourceManagerV1+PrefixCacheManager, gated by the default-onENABLE_V1_KVCACHE_SCHEDULER=1) allocates one sharedblock_idxper layer for all heads. SWA heads finish their window long before full-attn heads, but their cache stays pinned until the whole layer evicts. Throughput suffers.This PR teaches the V1 scheduler +
PrefixCacheManagerto manageblock_idxper head (head-wise SWA layout) and recycle a SWA head's cache as soon as it crosses its window — the per-head equivalent of what PR #6702 did for V0.Authorship: this PR is independently designed and implemented by the submitter for Hackathon 10th Spring No.53. The earlier community PR #6702 (V0, not merged) is referenced as prior art only; no code is lifted unattributed. Any future contributor work will be acknowledged via per-commit
Co-authored-bytrailers.RFC: PaddlePaddle/community#1364.
Modifications
fastdeploy/cache_manager/prefix_cache_manager.pygpu_free_block_list_head_wise[head]);allocate_gpu_blocks_head_wise/recycle_gpu_blocks_head_wise; TP-aware sizing (num_key_value_heads // tp_size)fastdeploy/engine/sched/resource_manager_v1.pyrecycle_request_swa_head_cache(per-head cursor advance ≥ window+sink);_should_skip_swa_recycle_for_overlap(per-requestcache_swap_metadata/cache_evict_metadatainspection); P4 cleanup in_free_blocksfastdeploy/model_executor/models/paddleformers/base.pyFD_T53_HEAD_WISE_SWA_FIXTURE=1fastdeploy/config.pypaddleformers/base.pyhead-wise SWA attribute injection soResourceManagerV1._should_use_head_wise_swa(engine-main) sees the samemodel_config.head_wise_swa_ratioas the worker. Gated onFD_T53_HEAD_WISE_SWA_FIXTURE.enable_prefix_caching=True + FD_HEAD_WISE_KV_CACHE=1raises atPrefixCacheManager.__init__FD_HEAD_WISE_KV_CACHE=0default — bit-identical when disabledTests use real lightweight objects +
object.__new__/AST or shape oracles (noMagicMock-only). PR2, not PR1, owns kernel-visibleblock_tables_headwise/ FP8 scale-layout changes.PR2 (separate) lands the AppendAttention rank-2
block_tables_headwiseABI + ForwardMeta wiring +kv_num_headsfield as a frozen-shape parameter; PR1 keepsshare_inputs.block_tables2D and reaches the +30% recycle gate via cache-manager-side changes only.Usage or Command
Accuracy Tests
Spec PR1 acceptance — throughput up ≥30% with timely SWA recycle vs without, same VRAM, fixed-IO dataset, V1 KV-cache scheduler on (
ENABLE_V1_KVCACHE_SCHEDULER=1, default):Round 2 (gate run — 128 prompts):
Round 3 (full run — 1024 prompts):
Round 3 integrity:
completed=1024/1024both arms,errors=0, mean TTFT improved -48.0% (2,708 s → 1,407 s).Benchmark:
FastDeploy/benchmarks/benchmark_serving.py— random fixed-IO dataset, input≈10.6k tokens avg / output≈4k tokens avg, request-rate=8, seed=42,--ignore-eos, server--max-concurrency=8192, YAMLeb45-21b-a3b-32k-bf16-kv50-512s.yaml(kv_cache_max_ratio=0.50, max_seq_len=512). Fixed-IO integrity: both arms produce identicaltotal_input_tokens=1,356,656/total_output_tokens=518,946for the 128-prompt gate run. Round 2 harness gate:completed=128,nonempty_errors=0. Round 3 target:completed=1024.Correctness:
tests/cache_manager/test_head_wise_*.py,tests/cache_manager/test_swa_recycle*.py, andtests/layers/test_append_attention_head_wise_shapes.py— real_FakeCacheManager+object.__new__(ResourceManagerV1)+ AST/shape oracles. NoMagicMock-only tests.CI run: https://github.com/PaddlePaddle/FastDeploy/pull/7717/checks
Companion PR: #7718 (AppendAttention rank-2 head-wise block_idx kernel optimisation)
Checklist
pre-commit run --all-filesclean