diff --git a/.claude/skills/en/graph-mode-internals/SKILL.md b/.claude/skills/en/graph-mode-internals/SKILL.md
new file mode 100644
index 00000000..fa857a4c
--- /dev/null
+++ b/.claude/skills/en/graph-mode-internals/SKILL.md
@@ -0,0 +1,254 @@
+---
+name: graph-mode-internals
+description: Understand the complete graph mode flow in lmdeploy+dlinfer, covering the runner architecture, buffer management, vendor differences, and common pitfalls.
+---
+# graph-mode-internals
+
+This skill explains how graph mode works end-to-end in lmdeploy+dlinfer,
+covering the runner layer, buffer layer, capture/replay flow, and
+vendor-specific differences. The goal is understanding, not just
+implementation details.
+
+---
+
+## Background
+
+**What is graph mode?**
+Graph mode captures a sequence of compute operations as a static graph and
+replays it without Python overhead. In practice this means each decode step
+can reuse a pre-compiled execution plan, reducing per-step latency.
+
+**Why decode only — not prefill?**
+Prefill sequence lengths vary widely across requests. Capturing a separate
+graph for each possible length would require far too many buckets, consuming
+large amounts of compile time and device memory. Decode is different: each
+request generates exactly one new token per step, so `q_seqlen = 1` for all
+requests. This makes bucketing by batch size alone practical.
+
+**Eager mode** skips graph capture entirely and runs ops directly through
+Python dispatch. It is the reference execution path.
+
+---
+
+## Code Organisation
+
+### lmdeploy (base classes and CUDA implementation)
+
+- **`CudaGraphMeta`** (`lmdeploy/pytorch/models/utils/cudagraph.py`) —
+  dataclass that stores graph configuration: `max_batchs`, `max_tokens`,
+  `num_blocks`, `device`, `input_buffers`, `output_buffers`, and optional
+  flags for MLA, SSM, MRoPE, etc.
+- **`CudaGraphMixin`** (same file) — mixin class that defines five methods
+  with default CUDA implementations:
+  - `support_cuda_graph` — returns True if the current step should use graph
+    mode (default: True when decoding)
+  - `make_buffers_cudagraph` — allocates fixed-shape tensors that will serve
+    as graph inputs for all future replays
+  - `fill_buffers_cudagraph` — copies real per-step data into the fixed
+    buffers before capture or replay
+  - `update_context_cudagraph` — updates `StepContext` fields to point at
+    the buffer tensors
+  - `get_outputs_cudagraph` — slices the full output buffers to the actual
+    token count after replay
+- **`GraphRunner`** (`lmdeploy/pytorch/backends/graph_runner.py`) — base
+  class; `__call__` simply calls `self.model(**kwargs)` (no graph)
+- **`CUDAGraphRunner`** (`lmdeploy/pytorch/backends/cuda/graph_runner.py`)
+  — full CUDA implementation with `CUDASingleGraphRunner` (uses
+  `torch.cuda.CUDAGraph`) and batch-size bucketing
+
+### dlinfer (vendor extensions)
+
+All vendors monkey-patch the three buffer methods at import time:
+
+```python
+CudaGraphMixin.make_buffers_cudagraph  = Vendor_make_buffers_cudagraph
+CudaGraphMixin.fill_buffers_cudagraph  = Vendor_fill_buffers_cudagraph
+CudaGraphMixin.update_context_cudagraph = Vendor_update_context_cudagraph
+```
+
+Ascend additionally provides **`AscendGraphRunner`**, which extends
+`GraphRunner` with `AscendSingleGraphRunner` (uses `torch.npu.NPUGraph`).
+Camb, MACA, and PPU reuse lmdeploy's `CUDAGraphRunner`.
+
+The wiring point where each vendor selects its runner class is
+**`op_backend.build_graph_runner()`** in
+`lmdeploy/pytorch/backends/dlinfer/<vendor>/op_backend.py`.
+
+---
+
+## Runner Layer
+
+### Batch-size bucketing (`compatible_size`)
+
+Graph capture is keyed by batch size. To maximise graph reuse, the actual
+batch size is rounded up to the nearest bucket before looking up or
+creating a graph:
+
+- **Ascend** (`AscendGraphRunner.get_ascend_compatible_size`):
+  three stages — power-of-2 for ≤ 16, 16-aligned for ≤ 256, 256-aligned
+  for > 256
+- **Camb / MACA / PPU** (via `CUDAGraphRunner`): pure power-of-2
+
+### `_runner_map` and graph lifecycle
+
+`_runner_map` maps `(compatible_batch_size, is_decoding, ...)` to a single
+graph runner. On first encounter the runner captures the graph; on
+subsequent encounters it replays the cached graph.
+
+---
+
+## Buffer Layer
+
+### Two categories of tensors
+
+| Category | Shape changes with batch size? | Needs buffer? |
+|---|---|---|
+| KV cache (`past_key_values`) | No — allocated once at max size | No |
+| `q_seqlens`, `kv_seqlens`, `block_offsets`, … | Yes | Yes |
+
+KV cache is passed through unchanged. Variable-shape tensors must be backed
+by fixed-shape buffers so the captured graph always sees the same memory
+addresses and shapes.
+
+### The three buffer methods
+
+**`make_buffers_cudagraph`** — called once during graph capture setup.
+Allocates fixed-shape tensors on device (at `max_batchs` / `max_tokens`
+size) and stores them in `graph_meta.input_buffers`.
+
+**`fill_buffers_cudagraph`** — called before every capture and every
+replay. Copies real data from the actual forward inputs into the
+pre-allocated buffers. Pads unused slots with safe defaults (e.g. repeating
+`max_tokens // max_batchs` for padding seqlens; initialising `kv_start_indices`
+to -1 so that padding slots never corrupt KV cache slot 0).
+
+**`update_context_cudagraph`** — called before every capture and replay.
+Updates `StepContext` to point at the buffer tensors so that downstream ops
+(e.g. attention) read from the right memory.
+
+If you introduce a new tensor input that varies with batch size, all three
+methods must be updated in sync.
+
+---
+
+## Capture Flow
+
+```text
+GraphRunner.__call__
+  └─ compatible_size = get_compatible_size(batch_size)
+       └─ _runner_map[compatible_size] not found → create AscendSingleGraphRunner
+            (or CUDASingleGraphRunner for Camb / MACA / PPU)
+            │
+            ├─ make_buffers_cudagraph(graph_meta)  ← allocate fixed buffers once
+            │
+            ├─ fill_buffers_cudagraph(...)          ← copy real data into buffers
+            │
+            ├─ update_context_cudagraph(...)        ← point StepContext at buffers
+            │
+            ├─ warmup forward (outside graph scope)
+            │
+            └─ with torch.cuda.graph() / torch.npu.NPUGraph():
+                 model.forward(...)                 ← ops captured here
+                 make_output_buffers(output)        ← store output tensor refs
+```
+
+---
+
+## Replay Flow
+
+```text
+GraphRunner.__call__
+  └─ compatible_size = get_compatible_size(batch_size)
+       └─ _runner_map[compatible_size] found → AscendSingleGraphRunner.forward()
+            │
+            ├─ fill_buffers_cudagraph(...)     ← update buffer contents
+            │
+            ├─ update_context_cudagraph(...)   ← re-point StepContext
+            │
+            ├─ [Ascend only] update kv_seqlens in-place (see next section)
+            │
+            ├─ _graph.replay()                 ← execute captured ops
+            │
+            └─ get_outputs_cudagraph(...)      ← slice output to actual token count
+```
+
+> **Note**: `get_outputs_cudagraph` is a simple output-slicing step. It
+> reads `output_buffers['hidden_states']` and slices `[:, :num_tokens]`.
+> For most vendors this is identical to the lmdeploy default.
+
+---
+
+## Ascend — kv_seqlens Update During Replay
+
+For Camb and MACA, writing updated values into the input buffer before
+replay is sufficient — the graph reads from the live device buffer automatically.
+Ascend is different: the attention operator takes `actual_seq_lengths_kv`
+as a CPU tensor or list, not as part of the NPU input buffer. An NPU buffer
+write cannot reach this CPU-side parameter, so the new values must be
+explicitly pushed into the captured graph via a dedicated update API.
+
+Two mechanisms exist, selected at runtime by `aclgraph_use_torch_npu_update()`:
+
+**torch_npu < 2.8.0.post1** — uses the low-level ACL graph task update API:
+
+```python
+graph_task_update_begin(graph_handle)
+update_attn_params(kv_seqlens, ...)  # writes via ACL
+graph_task_update_end(graph_handle)
+```
+
+**torch_npu ≥ 2.8.0.post1** — uses the higher-level torch_npu graph update
+API:
+
+```python
+graph.update(cpu_update_input=[{"actual_seq_lengths_kv": kv_seqlens}])
+```
+
+---
+
+## Vendor Comparison
+
+| Item | Ascend | Camb | MACA |
+|---|---|---|---|
+| Runner | `AscendGraphRunner` | `CUDAGraphRunner` | `CUDAGraphRunner` |
+| Graph API | `npu.NPUGraph` | `cuda.CUDAGraph` | `cuda.CUDAGraph` |
+| `compatible_size` | 3-stage (p2/16-align/256-align) | power-of-2 | power-of-2 |
+| `attn_metadata` slicing | not sliced | sliced | sliced |
+| `kv_start_indices` | `(max_batchs,)` | `(max_batchs,)` | `(max_batchs, 1)` |
+| `max_kv_seq_len` | kept as-is | set to -1 | kept as-is |
+| `x_active_mask` buffer | Yes | No | No |
+| kv_seqlens update | `update_attn_params` / `graph.update()` | write | write |
+
+---
+
+## Points to Note
+
+1. **`kv_start_indices` must be initialised to -1, not 0.** Index 0 is a
+   valid KV cache slot; padding slots initialised to 0 will silently corrupt
+   it.
+
+2. **`max_kv_seq_len` must be -1 for Camb.** This integer is captured as
+   a constant node in the graph at capture time. The `torch_mlu_ops` API
+   treats any value ≤ 0 as "compute the max dynamically from `kv_seqlens`";
+   setting it to the actual max at capture time would make it wrong at
+   every subsequent replay step.
+
+3. **All three buffer methods must be updated together.** If you add a new
+   tensor that varies with batch size, `make_buffers` must allocate the
+   buffer, `fill_buffers` must copy data into it, and `update_context` must
+   point `StepContext` at it. Missing any one of the three will cause
+   incorrect behaviour or a silent read from stale data.
+
+4. **Graph capture happens at `compatible_size`, not at actual batch size.**
+   Batch sizes are rounded up to a bucket. Do not compare `new_batch_size`
+   directly to `max_batchs` — use the compatible-size logic instead.
+
+5. **Ascend kv_seqlens update version check.** When debugging Ascend graph
+   mode failures involving wrong attention outputs, check which torch_npu
+   version is in use and verify the correct update path is taken in
+   `AscendSingleGraphRunner`.
+
+6. **Eager mode is always available as a reference.** If graph mode produces
+   wrong outputs, run the same step in eager mode (`eager_mode=True`) to
+   confirm whether the bug is in graph capture/replay or in the underlying
+   ops.
diff --git a/.claude/skills/zh_cn/graph-mode-internals/SKILL.md b/.claude/skills/zh_cn/graph-mode-internals/SKILL.md
new file mode 100644
index 00000000..2e280d8d
--- /dev/null
+++ b/.claude/skills/zh_cn/graph-mode-internals/SKILL.md
@@ -0,0 +1,239 @@
+---
+name: graph-mode-internals
+description: 理解 lmdeploy+dlinfer 中 graph mode 的完整流程，涵盖 runner 架构、buffer 管理、vendor 差异与常见陷阱。
+---
+# graph-mode-internals
+
+本技能说明 lmdeploy+dlinfer 中 graph mode 的端到端工作原理，涵盖 runner
+层、buffer 层、capture/replay 流程以及各 vendor 的具体差异。目标是理解，
+而不只是查阅实现细节。
+
+---
+
+## 背景
+
+**什么是 graph mode？**
+Graph mode 将一段计算过程捕获为静态图，之后每次执行只需回放（replay），无
+需 Python 层的调度开销。在实际推理中，这意味着每个 decode 步骤可以复用预
+编译的执行计划，从而降低单步延迟。
+
+**为什么只用于 decode，不用于 prefill？**
+Prefill 的序列长度因请求而异，变化范围极大。若要为每种可能的长度单独捕获
+一张图，需要大量分桶，占用大量编译时间和显存。Decode 则不同：每个请求每步
+只生成一个新 token，即 `q_seqlen = 1`，因此只需按 batch size 分桶，代价可
+以接受。
+
+**Eager mode** 完全跳过图捕获，通过 Python dispatch 直接执行算子，是参考
+执行路径。
+
+---
+
+## 代码组织
+
+### lmdeploy（基类与 CUDA 实现）
+
+- **`CudaGraphMeta`**（`lmdeploy/pytorch/models/utils/cudagraph.py`）——
+  存储图配置的 dataclass：`max_batchs`、`max_tokens`、`num_blocks`、
+  `device`、`input_buffers`、`output_buffers`，以及 MLA、SSM、MRoPE
+  等可选标志。
+- **`CudaGraphMixin`**（同一文件）—— 定义五个方法并提供默认 CUDA 实现的
+  mixin 类：
+  - `support_cuda_graph` —— 判断当前步骤是否使用 graph mode（默认：
+    decode 时返回 True）
+  - `make_buffers_cudagraph` —— 分配固定形状的 tensor，供后续所有
+    replay 步骤用作图输入
+  - `fill_buffers_cudagraph` —— 在 capture 或 replay 前，将真实的
+    per-step 数据拷贝到固定 buffer 中
+  - `update_context_cudagraph` —— 更新 `StepContext` 中的字段，使其
+    指向 buffer tensor
+  - `get_outputs_cudagraph` —— replay 结束后，将完整输出 buffer 按
+    实际 token 数截取
+- **`GraphRunner`**（`lmdeploy/pytorch/backends/graph_runner.py`）——
+  基类，`__call__` 直接调用 `self.model(**kwargs)`（无图）
+- **`CUDAGraphRunner`**（`lmdeploy/pytorch/backends/cuda/graph_runner.py`）
+  —— 完整的 CUDA 实现，包含 `CUDASingleGraphRunner`
+  （使用 `torch.cuda.CUDAGraph`）和 batch size 分桶逻辑
+
+### dlinfer（各 vendor 扩展）
+
+所有 vendor 在模块导入时以 monkey-patch 方式替换三个 buffer 方法：
+
+```python
+CudaGraphMixin.make_buffers_cudagraph   = Vendor_make_buffers_cudagraph
+CudaGraphMixin.fill_buffers_cudagraph   = Vendor_fill_buffers_cudagraph
+CudaGraphMixin.update_context_cudagraph = Vendor_update_context_cudagraph
+```
+
+Ascend 还额外提供了 **`AscendGraphRunner`**，继承自 `GraphRunner`，内部
+使用 `AscendSingleGraphRunner`（基于 `torch.npu.NPUGraph`）。Camb、MACA
+和 PPU 则复用 lmdeploy 的 `CUDAGraphRunner`。
+
+各 vendor 选择 runner 类的入口是
+`lmdeploy/pytorch/backends/dlinfer/<vendor>/op_backend.py` 中的
+**`op_backend.build_graph_runner()`**。
+
+---
+
+## Runner 层
+
+### Batch size 分桶（`compatible_size`）
+
+图捕获以 batch size 为键。为最大化图复用，实际 batch size 在查找或创建图
+之前会向上取整到最近的桶：
+
+- **Ascend**（`AscendGraphRunner.get_ascend_compatible_size`）：三段策略
+  —— ≤ 16 时取 2 的幂次，≤ 256 时按 16 对齐，> 256 时按 256 对齐
+- **Camb / MACA / PPU**（通过 `CUDAGraphRunner`）：纯粹的 2 的幂次
+
+### `_runner_map` 与图的生命周期
+
+`_runner_map` 以 `(compatible_batch_size, is_decoding, ...)` 为键，映射
+到单个图 runner。首次遇到时捕获图；后续遇到时直接回放已缓存的图。
+
+---
+
+## Buffer 层
+
+### Tensor 的两类
+
+| 类别 | 形状随 batch size 变化？ | 需要 buffer？ |
+|---|---|---|
+| KV cache（`past_key_values`） | 否——启动时按最大容量分配 | 否 |
+| `q_seqlens`、`kv_seqlens`、`block_offsets`、`kv_start_indices`、… | 是 | 是 |
+
+KV cache 直接透传，无需 buffer。形状随 batch size 变化的 tensor 必须由
+固定形状的 buffer 托底，以确保捕获的图始终看到相同的内存地址和形状。
+
+### 三个 buffer 方法
+
+**`make_buffers_cudagraph`** —— 在图捕获准备阶段调用一次。在设备上分配
+固定形状的 tensor（按 `max_batchs` / `max_tokens` 大小），并存入
+`graph_meta.input_buffers`。
+
+**`fill_buffers_cudagraph`** —— 在每次 capture 和每次 replay 前调用。将
+真实数据从 forward 输入拷贝到预分配的 buffer 中。对填充槽位使用安全默认值
+（例如，padding seqlen 填为 `max_tokens // max_batchs`；`kv_start_indices`
+初始化为 -1，防止 padding 槽位污染 KV cache 的 slot 0）。
+
+**`update_context_cudagraph`** —— 在每次 capture 和每次 replay 前调用。
+更新 `StepContext`，使其指向 buffer tensor，以便下游算子（如 attention）
+读取正确的内存。
+
+如果引入新的随 batch size 变化的 tensor，三个方法都需要同步更新。
+
+---
+
+## Capture 流程
+
+```text
+GraphRunner.__call__
+  └─ compatible_size = get_compatible_size(batch_size)
+       └─ _runner_map[compatible_size] 不存在 → 创建 AscendSingleGraphRunner
+            （Camb / MACA / PPU 使用 CUDASingleGraphRunner）
+            │
+            ├─ make_buffers_cudagraph(graph_meta)  ← 一次性分配固定 buffer
+            │
+            ├─ fill_buffers_cudagraph(...)          ← 将真实数据写入 buffer
+            │
+            ├─ update_context_cudagraph(...)        ← StepContext 指向 buffer
+            │
+            ├─ warmup forward（图范围之外）
+            │
+            └─ with torch.cuda.graph() / torch.npu.NPUGraph():
+                 model.forward(...)                 ← 算子在此处被捕获
+                 make_output_buffers(output)        ← 保存输出 tensor 引用
+```
+
+---
+
+## Replay 流程
+
+```text
+GraphRunner.__call__
+  └─ compatible_size = get_compatible_size(batch_size)
+       └─ _runner_map[compatible_size] 存在 → AscendSingleGraphRunner.forward()
+            │
+            ├─ fill_buffers_cudagraph(...)     ← 更新 buffer 内容
+            │
+            ├─ update_context_cudagraph(...)   ← 重新指向 StepContext
+            │
+            ├─ [仅 Ascend] 原地更新 kv_seqlens（见下节）
+            │
+            ├─ _graph.replay()                 ← 执行捕获的算子序列
+            │
+            └─ get_outputs_cudagraph(...)      ← 按实际 token 数截取输出
+```
+
+> **说明**：`get_outputs_cudagraph` 是一个简单的输出截取步骤。它读取
+> `output_buffers['hidden_states']` 并截取 `[:, :num_tokens]`。对于大多数
+> vendor，与 lmdeploy 默认实现相同。
+
+---
+
+## Ascend —— Replay 期间的 kv_seqlens 更新
+
+对于 Camb 和 MACA，在 replay 前将新值写入设备 buffer 即可——图
+replay 时会自动读取 buffer 中的最新值。Ascend 则不同：attention 算子的
+`actual_seq_lengths_kv` 是 CPU tensor 或 list，而非 NPU buffer 的一部分，
+写 NPU buffer 无法触达这个 CPU 侧参数，因此必须通过专门的 update API 将新
+值显式推入已捕获的图中。
+
+通过 `aclgraph_use_torch_npu_update()` 在运行时选择以下两种机制之一：
+
+**torch_npu < 2.8.0.post1** —— 使用底层 ACL graph task update API：
+
+```python
+graph_task_update_begin(graph_handle)
+update_attn_params(kv_seqlens, ...)  # 通过 ACL 写入
+graph_task_update_end(graph_handle)
+```
+
+**torch_npu ≥ 2.8.0.post1** —— 使用更高级的 torch_npu graph update API：
+
+```python
+graph.update(cpu_update_input=[{"actual_seq_lengths_kv": kv_seqlens}])
+```
+
+---
+
+## 各 Vendor 对比
+
+| 项目 | Ascend | Camb | MACA |
+|---|---|---|---|
+| Runner 类 | `AscendGraphRunner` | `CUDAGraphRunner` | `CUDAGraphRunner` |
+| Graph API | `npu.NPUGraph` | `cuda.CUDAGraph` | `cuda.CUDAGraph` |
+| `compatible_size` 策略 | 三段式（2幂/16对齐/256对齐） | 纯 2 的幂次 | 纯 2 的幂次 |
+| `attn_metadata` 截取 | 不截取 | 截取 | 截取 |
+| `kv_start_indices` | `(max_batchs,)` | `(max_batchs,)` | `(max_batchs, 1)` |
+| `max_kv_seq_len` | 保持原值 | 设为 -1 | 保持原值 |
+| `x_active_mask` buffer | 有 | 无 | 无 |
+| kv_seqlens 更新 | `update_attn_params` / `graph.update()` | 写入 | 写入 |
+
+---
+
+## 注意事项
+
+1. **`kv_start_indices` 必须初始化为 -1，而非 0。** Index 0 是合法的
+   KV cache slot；若 padding 槽位初始化为 0，会悄无声息地污染它。
+
+2. **Camb 的 `max_kv_seq_len` 必须设为 -1。** 此整数在捕获时会作为
+   常量节点固化到图中。`torch_mlu_ops` API 约定：值 ≤ 0 表示"从
+   `kv_seqlens` tensor 动态计算最大值"；若 capture 时填入真实最大值，
+   replay 时每一步都会使用该固化的错误常量。
+
+3. **三个 buffer 方法必须同步更新。** 如果引入新的随 batch size 变化的
+   tensor，`make_buffers` 需要分配 buffer，`fill_buffers` 需要写入数据，
+   `update_context` 需要让 `StepContext` 指向它。任意一步缺失都会导致错
+   误行为或悄悄读取到旧数据。
+
+4. **图以 `compatible_size` 为键，而非实际 batch size。** Batch size 向
+   上取整到桶。不要将 `new_batch_size` 直接与 `max_batchs` 比较，应使用
+   compatible-size 逻辑。
+
+5. **Ascend kv_seqlens 更新的版本检查。** 调试 Ascend graph mode 中
+   attention 输出错误的问题时，先确认 torch_npu 版本，再验证
+   `AscendSingleGraphRunner` 走的是哪条更新路径。
+
+6. **Eager mode 始终可作为参考。** 若 graph mode 产生错误输出，以
+   `eager_mode=True` 运行同一步骤，确认 bug 是在图捕获/回放中，还是在
+   底层算子中。