Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 254 additions & 0 deletions .claude/skills/en/graph-mode-internals/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
---
name: graph-mode-internals
description: Understand the complete graph mode flow in lmdeploy+dlinfer, covering the runner architecture, buffer management, vendor differences, and common pitfalls.
---
# graph-mode-internals

This skill explains how graph mode works end-to-end in lmdeploy+dlinfer,
covering the runner layer, buffer layer, capture/replay flow, and
vendor-specific differences. The goal is understanding, not just
implementation details.

---

## Background

**What is graph mode?**
Graph mode captures a sequence of compute operations as a static graph and
replays it without Python overhead. In practice this means each decode step
can reuse a pre-compiled execution plan, reducing per-step latency.

**Why decode only — not prefill?**
Prefill sequence lengths vary widely across requests. Capturing a separate
graph for each possible length would require far too many buckets, consuming
large amounts of compile time and device memory. Decode is different: each
request generates exactly one new token per step, so `q_seqlen = 1` for all
requests. This makes bucketing by batch size alone practical.

**Eager mode** skips graph capture entirely and runs ops directly through
Python dispatch. It is the reference execution path.

---

## Code Organisation

### lmdeploy (base classes and CUDA implementation)

- **`CudaGraphMeta`** (`lmdeploy/pytorch/models/utils/cudagraph.py`) —
dataclass that stores graph configuration: `max_batchs`, `max_tokens`,
`num_blocks`, `device`, `input_buffers`, `output_buffers`, and optional
flags for MLA, SSM, MRoPE, etc.
- **`CudaGraphMixin`** (same file) — mixin class that defines five methods
with default CUDA implementations:
- `support_cuda_graph` — returns True if the current step should use graph
mode (default: True when decoding)
- `make_buffers_cudagraph` — allocates fixed-shape tensors that will serve
as graph inputs for all future replays
- `fill_buffers_cudagraph` — copies real per-step data into the fixed
buffers before capture or replay
- `update_context_cudagraph` — updates `StepContext` fields to point at
the buffer tensors
- `get_outputs_cudagraph` — slices the full output buffers to the actual
token count after replay
- **`GraphRunner`** (`lmdeploy/pytorch/backends/graph_runner.py`) — base
class; `__call__` simply calls `self.model(**kwargs)` (no graph)
- **`CUDAGraphRunner`** (`lmdeploy/pytorch/backends/cuda/graph_runner.py`)
— full CUDA implementation with `CUDASingleGraphRunner` (uses
`torch.cuda.CUDAGraph`) and batch-size bucketing

### dlinfer (vendor extensions)

All vendors monkey-patch the three buffer methods at import time:

```python
CudaGraphMixin.make_buffers_cudagraph = Vendor_make_buffers_cudagraph
CudaGraphMixin.fill_buffers_cudagraph = Vendor_fill_buffers_cudagraph
CudaGraphMixin.update_context_cudagraph = Vendor_update_context_cudagraph
```

Ascend additionally provides **`AscendGraphRunner`**, which extends
`GraphRunner` with `AscendSingleGraphRunner` (uses `torch.npu.NPUGraph`).
Camb, MACA, and PPU reuse lmdeploy's `CUDAGraphRunner`.

The wiring point where each vendor selects its runner class is
**`op_backend.build_graph_runner()`** in
`lmdeploy/pytorch/backends/dlinfer/<vendor>/op_backend.py`.

---

## Runner Layer

### Batch-size bucketing (`compatible_size`)

Graph capture is keyed by batch size. To maximise graph reuse, the actual
batch size is rounded up to the nearest bucket before looking up or
creating a graph:

- **Ascend** (`AscendGraphRunner.get_ascend_compatible_size`):
three stages — power-of-2 for ≤ 16, 16-aligned for ≤ 256, 256-aligned
for > 256
- **Camb / MACA / PPU** (via `CUDAGraphRunner`): pure power-of-2

### `_runner_map` and graph lifecycle

`_runner_map` maps `(compatible_batch_size, is_decoding, ...)` to a single
graph runner. On first encounter the runner captures the graph; on
subsequent encounters it replays the cached graph.

---

## Buffer Layer

### Two categories of tensors

| Category | Shape changes with batch size? | Needs buffer? |
|---|---|---|
| KV cache (`past_key_values`) | No — allocated once at max size | No |
| `q_seqlens`, `kv_seqlens`, `block_offsets`, … | Yes | Yes |

KV cache is passed through unchanged. Variable-shape tensors must be backed
by fixed-shape buffers so the captured graph always sees the same memory
addresses and shapes.

### The three buffer methods

**`make_buffers_cudagraph`** — called once during graph capture setup.
Allocates fixed-shape tensors on device (at `max_batchs` / `max_tokens`
size) and stores them in `graph_meta.input_buffers`.

**`fill_buffers_cudagraph`** — called before every capture and every
replay. Copies real data from the actual forward inputs into the
pre-allocated buffers. Pads unused slots with safe defaults (e.g. repeating
`max_tokens // max_batchs` for padding seqlens; initialising `kv_start_indices`
to -1 so that padding slots never corrupt KV cache slot 0).

**`update_context_cudagraph`** — called before every capture and replay.
Updates `StepContext` to point at the buffer tensors so that downstream ops
(e.g. attention) read from the right memory.

If you introduce a new tensor input that varies with batch size, all three
methods must be updated in sync.

---

## Capture Flow

```text
GraphRunner.__call__
└─ compatible_size = get_compatible_size(batch_size)
└─ _runner_map[compatible_size] not found → create AscendSingleGraphRunner
(or CUDASingleGraphRunner for Camb / MACA / PPU)
├─ make_buffers_cudagraph(graph_meta) ← allocate fixed buffers once
├─ fill_buffers_cudagraph(...) ← copy real data into buffers
├─ update_context_cudagraph(...) ← point StepContext at buffers
├─ warmup forward (outside graph scope)
└─ with torch.cuda.graph() / torch.npu.NPUGraph():
model.forward(...) ← ops captured here
make_output_buffers(output) ← store output tensor refs
```

---

## Replay Flow

```text
GraphRunner.__call__
└─ compatible_size = get_compatible_size(batch_size)
└─ _runner_map[compatible_size] found → AscendSingleGraphRunner.forward()
├─ fill_buffers_cudagraph(...) ← update buffer contents
├─ update_context_cudagraph(...) ← re-point StepContext
├─ [Ascend only] update kv_seqlens in-place (see next section)
├─ _graph.replay() ← execute captured ops
└─ get_outputs_cudagraph(...) ← slice output to actual token count
```

> **Note**: `get_outputs_cudagraph` is a simple output-slicing step. It
> reads `output_buffers['hidden_states']` and slices `[:, :num_tokens]`.
> For most vendors this is identical to the lmdeploy default.

---

## Ascend — kv_seqlens Update During Replay

For Camb and MACA, writing updated values into the input buffer before
replay is sufficient — the graph reads from the live device buffer automatically.
Ascend is different: the attention operator takes `actual_seq_lengths_kv`
as a CPU tensor or list, not as part of the NPU input buffer. An NPU buffer
write cannot reach this CPU-side parameter, so the new values must be
explicitly pushed into the captured graph via a dedicated update API.

Two mechanisms exist, selected at runtime by `aclgraph_use_torch_npu_update()`:

**torch_npu < 2.8.0.post1** — uses the low-level ACL graph task update API:

```python
graph_task_update_begin(graph_handle)
update_attn_params(kv_seqlens, ...) # writes via ACL
graph_task_update_end(graph_handle)
```

**torch_npu ≥ 2.8.0.post1** — uses the higher-level torch_npu graph update
API:

```python
graph.update(cpu_update_input=[{"actual_seq_lengths_kv": kv_seqlens}])
```

---

## Vendor Comparison

| Item | Ascend | Camb | MACA |
|---|---|---|---|
| Runner | `AscendGraphRunner` | `CUDAGraphRunner` | `CUDAGraphRunner` |
| Graph API | `npu.NPUGraph` | `cuda.CUDAGraph` | `cuda.CUDAGraph` |
| `compatible_size` | 3-stage (p2/16-align/256-align) | power-of-2 | power-of-2 |
| `attn_metadata` slicing | not sliced | sliced | sliced |
| `kv_start_indices` | `(max_batchs,)` | `(max_batchs,)` | `(max_batchs, 1)` |
| `max_kv_seq_len` | kept as-is | set to -1 | kept as-is |
| `x_active_mask` buffer | Yes | No | No |
| kv_seqlens update | `update_attn_params` / `graph.update()` | write | write |

---

## Points to Note

1. **`kv_start_indices` must be initialised to -1, not 0.** Index 0 is a
valid KV cache slot; padding slots initialised to 0 will silently corrupt
it.

2. **`max_kv_seq_len` must be -1 for Camb.** This integer is captured as
a constant node in the graph at capture time. The `torch_mlu_ops` API
treats any value ≤ 0 as "compute the max dynamically from `kv_seqlens`";
setting it to the actual max at capture time would make it wrong at
every subsequent replay step.

3. **All three buffer methods must be updated together.** If you add a new
tensor that varies with batch size, `make_buffers` must allocate the
buffer, `fill_buffers` must copy data into it, and `update_context` must
point `StepContext` at it. Missing any one of the three will cause
incorrect behaviour or a silent read from stale data.

4. **Graph capture happens at `compatible_size`, not at actual batch size.**
Batch sizes are rounded up to a bucket. Do not compare `new_batch_size`
directly to `max_batchs` — use the compatible-size logic instead.

5. **Ascend kv_seqlens update version check.** When debugging Ascend graph
mode failures involving wrong attention outputs, check which torch_npu
version is in use and verify the correct update path is taken in
`AscendSingleGraphRunner`.

6. **Eager mode is always available as a reference.** If graph mode produces
wrong outputs, run the same step in eager mode (`eager_mode=True`) to
confirm whether the bug is in graph capture/replay or in the underlying
ops.
Loading
Loading