diff --git a/.claude/skills/en/precision-align/SKILL.md b/.claude/skills/en/precision-align/SKILL.md
new file mode 100644
index 00000000..e027828a
--- /dev/null
+++ b/.claude/skills/en/precision-align/SKILL.md
@@ -0,0 +1,447 @@
+---
+name: precision-align
+description: Debug precision regressions in lmdeploy+dlinfer on domestic AI hardware (Ascend / CAMB / MACA) by comparing against a reference implementation.
+---
+# precision-align
+
+You are helping the user fix a precision bug in lmdeploy+dlinfer on a domestic
+AI hardware backend (Ascend, CAMB, or MACA). The reference implementation is
+typically vllm+vllm-ascend (for Ascend) or another agreed-upon reference.
+The goal is to identify where lmdeploy+dlinfer diverges from the reference
+and fix it.
+
+The examples in this skill use Ascend as the concrete hardware. Apply the same
+methodology to CAMB and MACA by substituting the appropriate vendor paths
+and ops.
+
+---
+
+## Step 1 — Gather information
+
+Ask the user:
+
+1. **Which model** are you aligning? (e.g., `qwen3`, `deepseek_v2`)
+2. **Which hardware** are you targeting? (ascend / camb / maca)
+3. **What is the symptom?** — e.g., output tokens differ from the first token,
+   answers become nonsensical after a few tokens, accuracy benchmark score
+   dropped by X points.
+4. **Parallelism configuration**: what TP / DP / EP values are you using?
+5. **Any preliminary observations?** First token already wrong (→ prefill
+   issue), or diverges after a few correct tokens (→ decode / KV cache)?
+6. **Single-batch or multi-batch?** Can you reproduce the issue with a single
+   request (batch_size=1), or does it only appear when multiple requests are
+   batched together?
+
+Do not proceed until these are answered.
+
+---
+
+## Step 2 — Verify environment setup
+
+Before any debugging, confirm the comparison environment is controlled.
+Both sides must be identical except for the framework under test:
+
+| Condition                 | lmdeploy+dlinfer        | vllm+vllm-ascend      |
+|---------------------------|-------------------------|-----------------------|
+| Same SoC version          | ✓                       | ✓                     |
+| Warmup disabled           | ✓                       | ✓                     |
+| Eager mode                | ✓ (`--eager-mode true`) | ✓ (`--enforce-eager`) |
+| Same TP / DP / EP         | ✓                       | ✓                     |
+| `temperature=0`,`top_k=1` | ✓                       | ✓                     |
+| Same prompt / input       | ✓                       | ✓                     |
+
+If any condition is unmet, fix it first. Warmup leaves stale KV cache entries;
+temperature > 0 introduces sampling randomness — both mask real precision bugs.
+
+---
+
+## Step 3 — Quick output comparison
+
+Run both frameworks on the **same prompt** and compare the generated tokens
+directly.
+
+- **Tokens match** → output is consistent; precision is likely fine. Suggest
+  running opencompass or evascope for benchmark scoring.
+- **Tokens differ** → proceed to Step 4.
+
+---
+
+## Step 4 — Diagnose root cause
+
+Map the symptom to a debugging path:
+
+| Symptom                            | Most likely cause           | Path   |
+|------------------------------------|-----------------------------|--------|
+| First token already wrong          | Prefill operator precision  | B      |
+| First token correct, then diverges | KV cache or decode op       | A or B |
+| Divergence grows with seq length   | KV cache or op precision    | A or B |
+| Divergence at a fixed depth        | Operator precision          | B      |
+| Only wrong at TP > 1 / dp×tp / ep  | Communication / parallelism | C      |
+| Only wrong with multiple requests  | Batching / seqlen / masking | A or B |
+
+**Important**: accumulating divergence does **not** imply KV cache pollution.
+An operator precision bug can also compound over decode steps — for example,
+a rope embedding op silently falling back to CPU produces slightly wrong
+position encodings that accumulate into a visible accuracy drop (observed on
+Qwen30B-A3B: a 2-point drop on LiveCodeBench traced to cos/sin computed on
+CPU). Do not assume Path A without ruling out Path B first.
+
+**If unsure**:
+
+- Start with a single-batch request to eliminate batching interactions.
+- Start with the simplest parallelism configuration that can load the model
+  weights (see Path C for the parallelism hierarchy). Some large models cannot
+  fit at TP=1, so "simplest" means fewest parallelism dimensions that still
+  fits the weights.
+- Then go to Path B starting from layer 0.
+
+---
+
+## Path A — KV cache pollution
+
+KV cache pollution means `fill_kv_cache` was called with mismatched indices,
+writing tokens into wrong cache slots (stomping). The `fill_kv_cache` kernel
+itself is generally not the source of bugs — the problem is almost always in
+the indices passed to it.
+
+### What to check
+
+Read these two files:
+
+- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/fill_kv_cache.py`
+- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/pagedattention.py`
+
+Key parameters:
+
+- `kv_start_indices` in `fill_kv_cache`: flat index of each token's cache
+  slot.
+- `block_offsets`, `q_start_loc`, `q_seq_len`, `kv_seq_len` in
+  `prefill_attention` and `paged_token_attention` / `paged_attention_fwd`.
+
+**Multi-batch note**: if the bug only appears with multiple requests, pay extra
+attention to per-request seqlen tracking (`q_seq_len`, `kv_seq_len`) and
+`kv_start_indices`. A wrong per-request length causes attention to read from
+the wrong positions in the KV cache.
+
+### How to debug
+
+Do **not** dump the KV cache tensors — they are prohibitively large. Instead,
+dump the three tensors immediately **before** the `fill_kv_cache` call at the
+suspect layer:
+
+```python
+dump("key_states",       key_states)    # [num_tokens, num_kv_heads, head_size]
+dump("value_states",     value_states)  # [num_tokens, num_kv_heads, head_size]
+dump("kv_start_indices", kv_start_indices)  # shape: [num_tokens]
+```
+
+**Key check**: `kv_start_indices.shape[0]` must equal `key_states.shape[0]`.
+A mismatch means the index count does not match the token count being written,
+which causes fill-time stomping and corrupts subsequent decode steps.
+
+---
+
+## Path B — Operator precision
+
+The goal is to find the **first op** where lmdeploy+dlinfer diverges from the
+reference.
+
+**Single-batch first**: if the issue is reproducible with a single request,
+debug at batch_size=1. This eliminates batching interactions and simplifies
+seqlen shapes.
+
+### Strategy: start at layer 0
+
+Start at **layer 0** of the first linear-attention or full-attention block.
+Do not start at the midpoint: most of the model's layers share the same
+operator set, so layer 0's result is representative. If layer 0 is clean,
+most other layers will be too; if layer 0 already diverges, fix it before
+searching deeper.
+
+1. At layer 0, dump after each sub-op in order: RMSNorm → Attention → MLP.
+2. Compare with the reference framework at the same layer
+   (e.g. vllm+vllm-ascend for Ascend).
+   - Sub-op diverges at layer 0 → that is the first divergent op; investigate
+     it.
+   - Layer 0 is fully clean → use binary search across later layers (check
+     layer N/2, then narrow down) to find the first divergent layer.
+3. After identifying the faulty op, selectively verify one or two more layers
+   that might behave differently (e.g. the last layer, MoE layers if
+   applicable).
+
+### Comparison method
+
+For **deterministic vendor ops** (e.g. `torch_npu` on Ascend): use
+`torch.equal()`. These ops must produce bit-identical outputs. Any difference
+is a real bug.
+
+For **non-deterministic ops** (e.g. triton, less common on Ascend):
+`torch.equal()` may be too strict due to FP rounding. Check error magnitude
+instead:
+
+```python
+diff = (a - b).abs()
+print("max abs:", diff.max().item())
+print("max rel:", (diff / b.abs().clamp(min=1e-8)).max().item())
+```
+
+A relative error below ~1e-3 is generally acceptable; above that it is a real
+divergence.
+
+### Once the divergent op is found
+
+Read its implementation stack:
+
+- `lmdeploy/lmdeploy/pytorch/backends/dlinfer/<vendor>/` — the `Impl` class
+- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/` — the thin kernel wrapper
+- `dlinfer/dlinfer/vendor/<vendor>/` — the actual hardware op call
+  (e.g. for Ascend: `ascend/torch_npu_ops.py`)
+
+Dump the **inputs** to that op in both frameworks and check whether they are
+identical. If inputs differ, the bug is upstream; if inputs are identical but
+outputs differ, the bug is in the op itself (wrong argument order, dtype, or
+shape).
+
+---
+
+## Path C — Communication / parallelism
+
+Precision bugs that only appear with certain parallelism configurations point
+to communication or parallelism-patch issues. Before debugging, understand
+which parallelism level introduces the problem.
+
+### lmdeploy parallelism terminology
+
+lmdeploy supports three parallelism dimensions used in combination:
+
+- **TP only** (EP=1, DP=1): attention and FFN are both sharded across `tp`
+  GPUs. Total GPUs = tp.
+- **dp×tp** (EP=1, DP>1): attention uses dp×tp GPUs total; within each DP
+  group, `tp_size = tp / dp`. When EP=1, `tp` in the config equals the total
+  GPU count.
+- **dp×tp + ep** (EP>1): attention uses dp×tp as above; FFN/MoE experts are
+  further sharded across `ep` groups. When EP>1, `tp` in the config is the
+  tp_size **per DP group** (not the total GPU count).
+
+### Isolation strategy
+
+Not all models fit at TP=1. Work through the parallelism hierarchy from
+simplest to most complex, stopping at the level that introduces the bug:
+
+1. **TP only** (simplest that fits the weights): run both frameworks with
+   TP=N, DP=1, EP=1 (where N is the minimum number of GPUs needed to load
+   the model).
+   - Bug present → issue is in TP operator sharding or all_reduce; go to
+     operator dumps (Path B) focusing on the all_reduce outputs.
+   - Clean → proceed to step 2.
+
+2. **dp×tp** (add DP): increase DP while keeping EP=1.
+   - Bug appears → issue is in DP+TP interaction; check communication between
+     DP groups. Read
+     `dlinfer/dlinfer/framework/lmdeploy_ext/device/<vendor>.py` for relevant
+     patches.
+   - Clean → proceed to step 3 (only for MoE models).
+
+3. **dp×tp + ep** (add EP): enable EP>1.
+   - Bug appears → issue is in expert parallelism or EP communication. Read
+     the MoE forward class in `device/<vendor>.py` (e.g.
+     `AscendMoEForwardDPTP` in `device/ascend.py`) and verify the MoE routing
+     and reduce-scatter pattern.
+
+### Dummy data in idle DP groups
+
+When dp > 1, lmdeploy fills DP groups that have no real requests with
+**dummy data of sequence length 1**. vllm-ascend uses a similar mechanism.
+This is expected behaviour, not a bug. When dumping tensors across DP groups:
+
+- Idle DP groups will show tensors with a leading dimension of 1 — do not
+  mistake this for a seqlen mismatch.
+- Only compare tensor values in DP groups that are actually processing real
+  tokens.
+- If a precision discrepancy appears specifically in the idle-group dummy
+  path, verify that both frameworks use the same dummy length and that the
+  dummy data does not pollute the real groups' KV cache slots.
+
+### When TP=1 is impossible
+
+If the model is too large to fit at TP=1, start at the minimum TP that loads
+the weights and compare it against the same TP on the reference side. You can
+still isolate DP and EP by fixing TP and varying DP/EP independently.
+
+### What to read
+
+- `dlinfer/dlinfer/framework/lmdeploy_ext/device/<vendor>.py` — patches for
+  distributed behaviours specific to the hardware (e.g. for Ascend: `ascend.py`
+  containing `AscendMoEForwardDPTP` for MoE communication).
+- Dump outputs immediately **before and after** each all_reduce / all_gather
+  call across ranks to find where values first diverge.
+
+---
+
+## Tensor dump mechanics
+
+**Always dump to files. Never use `print` or `logger`.**
+
+On multi-rank runs, log output from all ranks interleaves and individual
+tensor values are lost. Use `torch.save` to per-rank files instead.
+
+```python
+import os, torch, torch.distributed as dist
+
+_DUMP_DIR = "/tmp/dlinfer_dump"
+os.makedirs(_DUMP_DIR, exist_ok=True)
+
+def dump(name: str, tensor: torch.Tensor):
+    rank = dist.get_rank() if dist.is_initialized() else 0
+    torch.save(tensor.detach().cpu(), f"{_DUMP_DIR}/{name}_rank{rank}.pt")
+```
+
+**Naming convention**: `{layer}_{op}_{input|output}_rank{rank}.pt`
+
+Example: `layer0_attn_out_rank0.pt` for lmdeploy+dlinfer, same name in a
+separate directory for vllm+vllm-ascend, so files are easy to pair.
+
+**Loading and comparing**:
+
+```python
+a = torch.load("dlinfer/layer0_attn_out_rank0.pt")
+b = torch.load("vllm/layer0_attn_out_rank0.pt")
+
+# deterministic vendor ops (e.g. torch_npu on Ascend) — expect exact match
+print(torch.equal(a, b))
+
+# triton / float ops — check error magnitude
+diff = (a - b).abs()
+print("max abs:", diff.max().item())
+print("max rel:", (diff / b.abs().clamp(min=1e-8)).max().item())
+```
+
+**Placement of dump calls**: add dumps inside the
+`lmdeploy/lmdeploy/pytorch/backends/dlinfer/<vendor>/` `Impl` classes — right
+after the kernel call and before returning. This gives the output in
+framework-native shape and is above the vendor-specific layer.
+
+---
+
+## Checklist
+
+- [ ] Same SoC, warmup disabled, `--eager-mode true`, same TP/DP/EP,
+      temperature=0 / top_k=1 confirmed
+- [ ] Single-batch reproduction attempted first
+- [ ] Output token comparison done on the same prompt
+- [ ] Parallelism hierarchy tested from simplest fitting config upward
+- [ ] Root cause path identified: A (KV cache) / B (operator) / C
+      (communication)
+- [ ] Tensor dumps use file-based approach (not print / logger)
+- [ ] Layer 0 verified first before searching deeper layers
+- [ ] First divergent layer / op identified
+- [ ] Inputs to the divergent op verified (identical or upstream bug found)
+- [ ] Root cause fixed and output tokens re-verified
+
+---
+
+## Troubleshooting
+
+### Accumulating divergence, but Path A checks out
+
+**Symptom**: `kv_start_indices` length matches `key_states`, but divergence
+still grows with sequence length.
+
+**Cause**: Operator precision bugs (e.g. cos/sin falling back to CPU) also
+accumulate across decode steps and look identical to cache pollution from the
+outside.
+
+**Action**: Shift to Path B. Dump layer 0 sub-ops (RMSNorm → Attention → MLP)
+to confirm whether divergence begins there.
+
+---
+
+### Tensors in some DP groups show unexpected length-1 shapes
+
+**Symptom**: When dp > 1, tensors in idle DP groups have a leading dimension
+of 1 and produce odd-looking outputs.
+
+**Cause**: lmdeploy (and vllm-ascend) fill DP groups that have no real
+requests with dummy data of sequence length 1. This is expected behaviour.
+
+**Action**: Only compare tensor values in DP groups that are actually
+processing real tokens. Do not treat length-1 tensors from idle groups as
+errors.
+
+---
+
+### Precision issue at dp×tp or ep, but unclear which dimension causes it
+
+**Symptom**: Adding DP or EP introduces a precision regression, but the root
+cause dimension is unknown.
+
+**Cause**: Testing multiple parallelism dimensions simultaneously makes it
+impossible to isolate which one introduces the bug.
+
+**Action**: Fix TP, add DP first. If dp×tp is clean, then add EP. See Path C
+for the full isolation strategy.
+
+---
+
+### Dump files are empty, truncated, or contain garbled content
+
+**Symptom**: Saved dump files have no usable data, or values from multiple
+ranks are mixed together.
+
+**Cause**: Using `print` or `logger` on multi-rank runs causes output from all
+ranks to interleave and overwrite each other.
+
+**Action**: Use `torch.save` to write a separate file per rank. See the Tensor
+dump section for the pattern.
+
+---
+
+### KV cache pollution bug appears intermittently
+
+**Symptom**: Same prompt gives different results across runs; the precision
+issue is not consistently reproducible.
+
+**Cause**: Warmup leaves stale KV cache entries that corrupt subsequent
+inference runs.
+
+**Action**: Disable warmup on both lmdeploy+dlinfer and vllm+vllm-ascend,
+then retry.
+
+---
+
+### Generated tokens differ between runs on the same prompt
+
+**Symptom**: Token-level comparison is not stable; the outputs change each
+run.
+
+**Cause**: `temperature > 0` introduces sampling randomness that makes
+comparison meaningless.
+
+**Action**: Set `temperature=0, top_k=1` on both sides.
+
+---
+
+### Binary search finishes but the bug was in an early layer all along
+
+**Symptom**: After several bisection steps the divergent layer turns out to be
+very early in the model.
+
+**Cause**: Starting at layer N/2 skips checking whether layer 0 is already
+wrong.
+
+**Action**: Always verify layer 0 first. If layer 0 is clean, then apply
+binary search to the remaining layers.
+
+---
+
+### Found the suspected divergent op, inputs look identical, root cause unclear
+
+**Symptom**: Inputs to the op match between frameworks, but you cannot tell
+whether the op itself is at fault.
+
+**Cause**: Without output dumps, there is no evidence of whether the op
+produces wrong results.
+
+**Action**: Dump both inputs and outputs. If inputs match but outputs differ,
+the bug is in the op call itself — check argument order, dtype, and shape
+passed to the hardware op (e.g. NPU op on Ascend).
diff --git a/.claude/skills/en/support-new-model/SKILL.md b/.claude/skills/en/support-new-model/SKILL.md
new file mode 100644
index 00000000..7adfa59b
--- /dev/null
+++ b/.claude/skills/en/support-new-model/SKILL.md
@@ -0,0 +1,255 @@
+---
+name: support-new-model
+description: Add support for a new model (already in lmdeploy's CUDA backend) on domestic AI hardware (Ascend / CAMB / MACA) via dlinfer.
+---
+# support-new-model
+
+You are helping the user adapt a new LLM or VLM for dlinfer's supported
+hardware backends (Ascend NPU, CAMB MLU, MACA GPU). The model already runs
+on CUDA via lmdeploy — your job is to identify what is missing for the target
+vendor and implement it.
+
+---
+
+## Step 1 — Gather information
+
+Ask the user:
+
+1. **Which model** are you adding support for? (name as it appears in
+   `lmdeploy/pytorch/models/`, e.g. `qwen3`, `deepseek_v2`)
+2. **Which vendor(s)** are you targeting? (ascend / camb / maca — may be
+   multiple)
+
+Do not proceed until both questions are answered.
+
+---
+
+## Step 2 — Analyse the model
+
+Read all of the following files yourself using Read/Bash tools — do not ask
+the user:
+
+```text
+lmdeploy/lmdeploy/pytorch/models/<model>.py
+lmdeploy/lmdeploy/pytorch/backends/dlinfer/op_backend.py
+lmdeploy/lmdeploy/pytorch/backends/dlinfer/<vendor>/op_backend.py  ← per vendor
+```
+
+The full call chain is: `models/<model>.py` → `lmdeploy/pytorch/nn/`
+→ `backends/dlinfer/` → `kernels/dlinfer/` → `dlinfer/ops/` → `vendor/`.
+If the connection between a model layer and its backend op is unclear, trace
+through `lmdeploy/pytorch/nn/` to find the intermediate abstraction.
+`lmdeploy/pytorch/kernels/default/` contains the CUDA reference
+implementations and is useful as a specification when writing new vendor ops.
+
+### From `models/<model>.py`, identify
+
+- Every non-trivial operator the model uses: attention variants (paged, flash,
+  MLA), MLP activation functions, RMS norm variants, MoE routing, rotary
+  embedding variants (standard, MROPE, multi-scale), quantization ops.
+- Whether the model passes any fields through `StepContext` or `attn_metadata`
+  beyond the standard set: `input_ids`, `position_ids`, `block_offsets`,
+  `q_seqlens`, `kv_seqlens`, `kv_start_indices`.
+  Known extra fields already handled: `state_ids` (SSM),
+  `mrope_position_ids` (MROPE),
+  `cu_seqlens` / `has_initial_state` (Gated Delta Networks).
+
+### From the generic `op_backend.py`, check
+
+- `get_layer_impl_builder()`: which `OpType`s already have a dlinfer `Impl`.
+  Cross-reference with the op list above to identify gaps → **Path A**.
+
+### From `<vendor>/op_backend.py`, check each of the following carefully
+
+- **`update_step_context()`**: this method builds `attn_metadata` and (for
+  Ascend) `moe_metadata` for every inference step. Verify that it correctly
+  handles all fields the new model requires. If the model introduces new
+  context fields or a new attention mode (e.g. a new `is_gated_delta`-style
+  flag), this method must be extended → **Path B**.
+- **`get_k_block_shape()` / `get_v_block_shape()`**: confirm the KV cache
+  layout matches what the model's attention implementation expects. Different
+  vendors and even different SoC generations (Ascend A2 vs A3, 310P) may use
+  different layouts → **Path B** if wrong.
+- **`AscendKVQuantMeta`** (Ascend only): if the model uses KV cache
+  quantization with a scale/offset format different from the current
+  implementation → **Path B**.
+
+Summarise your findings to the user before writing any code:
+
+- Op gaps (→ Path A)
+- Vendor `op_backend.py` gaps (→ Path B)
+- Framework-level gaps (→ Path C)
+
+---
+
+## Path A — Add missing ops (4-layer stack)
+
+Follow this path for every op that is absent from `get_layer_impl_builder()`.
+
+Implement each layer in top-to-bottom order:
+
+### Layer 1 — `lmdeploy/lmdeploy/pytorch/backends/dlinfer/`
+
+Add a new `XxxImpl` (inherits lmdeploy base `Impl`) and `XxxBuilder`
+(with `build()`).
+Register the builder in `op_backend.py`'s `get_layer_impl_builder()`
+dispatcher.
+Reference: `activation.py` (simplest), `norm.py`, `attention.py` (most
+complex).
+
+### Layer 2 — `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/`
+
+Add a thin wrapper function calling `dlinfer.ops.<op_name>(...)`.
+Export it from `__init__.py`.
+
+### Layer 3 — `dlinfer/dlinfer/ops/llm.py`
+
+Register with `@register_custom_op("dlinfer::<op_name>", [...])`.
+Forward to `vendor_ops_registry["<op_name>"]`.
+**The key string must exactly match the function name in Layer 4.**
+
+### Layer 4 — `dlinfer/dlinfer/vendor/<vendor>/`
+
+Add `@register_ops(vendor_ops_registry)` implementation calling the vendor's
+native op:
+
+- **Ascend**: `torch.ops.npu.*` — see `vendor/ascend/torch_npu_ops.py`
+- **CAMB**: `tmo.*` (`torch_mlu_ops`) — see `vendor/camb/camb_ops.py`
+- **MACA**: `mcoplib.*` — see `vendor/maca/maca_ops.py`
+
+**Ascend**: before writing any new op in `torch_npu_ops.py`, ask the user to
+provide the official NPU operator documentation for that op. Implement
+strictly according to the docs: parameter names, tensor shapes, and dtype
+constraints are not always inferrable from existing code and a mismatch causes
+hard-to-debug runtime errors.
+
+For complex ops (e.g. Ascend attention with graph-mode bookkeeping), split
+logic into a helper module (e.g. `vendor/ascend/attention.py`) and import
+from `torch_npu_ops.py`.
+
+---
+
+## Path B — Vendor-specific `op_backend.py` changes
+
+File: `lmdeploy/lmdeploy/pytorch/backends/dlinfer/<vendor>/op_backend.py`
+
+Handle each sub-case independently:
+
+### B1 — `update_step_context()`: new context fields or attention modes
+
+When the new model requires fields in `attn_metadata` that the current
+implementation does not populate, extend `update_step_context()`:
+
+- Add the computation of the new field (following the existing
+  helper-function pattern inside the method).
+- Pass the new field when constructing `attn_metadata` at the end of the
+  method.
+- For Ascend: also extend `moe_metadata` if the model introduces a new MoE
+  communication pattern or parallelism topology.
+
+Reference: the `is_gated_delta` block (adds `cu_seqlens` and
+`has_initial_state`),
+the `kv_quant_policy == 8` block (populates `AscendKVQuantMeta`).
+
+### B2 — `get_k_block_shape()` / `get_v_block_shape()`: KV cache layout
+
+This rarely needs changing once the hardware target is fixed. Skip unless the
+new model introduces a fundamentally different attention architecture that
+requires a new block memory layout not covered by any existing vendor backend.
+
+### B3 — `AscendKVQuantMeta`: KV quantization (Ascend only)
+
+Legacy feature; its correctness is not actively verified. Skip for standard
+model support — only revisit if KV cache quantization is explicitly required
+and confirmed to be working.
+
+---
+
+## Path C — Framework patches (`dlinfer/dlinfer/framework/lmdeploy_ext/`)
+
+Each sub-area is independent — assess and handle separately.
+
+### C1 — cudagraph / aclgraph buffer management
+
+**When needed**: only when the model introduces a new `StepContext` field
+whose **shape varies with batch size or sequence length** at runtime.
+Fixed-shape tensors do not need special buffer management. Example:
+`x_active_mask` (shape `[batch_size]`) was added to handle Expert Parallelism
+— its size changes per step, so it requires a pre-allocated maximum-size
+buffer.
+
+- **Ascend**: `framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py`
+  - `make_buffers_cudagraph`: allocate the new field at maximum size
+    (`max_batches` / `max_tokens`). Using runtime size here causes shape
+    errors on replay.
+  - `fill_buffers_cudagraph`: copy runtime values into the pre-allocated
+    buffer.
+  - `update_context_cudagraph`: wire the buffer back into the step context.
+  - Reference: `is_ssm` (`state_ids`) and `use_mrope` (`mrope_position_ids`)
+    paths.
+- **Other vendors**: apply the same pattern in `camb_cudagraph.py` /
+  `maca_cudagraph.py`.
+
+Skip if the model uses only the standard fields already handled.
+
+### C2 — Device-specific patches
+
+**When needed**: when the model requires a vendor-specific override of
+lmdeploy behaviour (e.g. a different MoE communication strategy on Ascend, an
+unsupported sampling op on CAMB, hardware-specific cache formats such as
+Ascend 310P NZ layout).
+
+- **Ascend**: `framework/lmdeploy_ext/device/ascend.py`
+- **CAMB**: `framework/lmdeploy_ext/device/camb.py`
+
+Patch the relevant lmdeploy class method directly. Ensure the file is
+imported in `framework/lmdeploy_ext/device/__init__.py`.
+
+### C3 — Quantization patches
+
+**When needed**: only when the model uses AWQ and the weight packing or scale
+layout differs from the current Ascend implementation.
+
+File: `framework/lmdeploy_ext/quants/ascend_awq.py`
+
+This file patches `WeightOnlyQLinear`, `MergedAwqLinear`, `AwqLinear`, and
+`QKVAwqLinear`. Only modify if the new model's quantized checkpoint uses a
+layout the current patches cannot handle.
+
+---
+
+## Verification checklist
+
+**Path A (new op):**
+
+- [ ] All 4 layers implemented for each missing op
+- [ ] `get_layer_impl_builder()` dispatcher updated in generic `op_backend.py`
+- [ ] `vendor_ops_registry` key in `ops/llm.py` exactly matches the decorated
+      function name in the vendor file
+- [ ] New kernel exported from `kernels/dlinfer/__init__.py`
+
+**Path B (vendor `op_backend.py`):**
+
+- [ ] `update_step_context()` populates all fields the new model's
+      `attn_metadata` requires
+
+**Path C1 (graph buffers):**
+
+- [ ] New field pre-allocated at max size in `make_buffers_cudagraph`
+- [ ] New field filled in `fill_buffers_cudagraph`
+- [ ] New field wired back in `update_context_cudagraph`
+
+**Path C2 (device patch):**
+
+- [ ] Patch applied directly to the lmdeploy class
+- [ ] Patch file imported in `device/__init__.py`
+
+**Path C3 (quant patch):**
+
+- [ ] Weight packing / scale layout verified against checkpoint format
+- [ ] Relevant class methods patched in `ascend_awq.py`
+
+**General:**
+
+- [ ] Eager mode: model runs without error
+- [ ] Graph mode: model runs without error (if vendor supports it)
diff --git a/.claude/skills/zh_cn/precision-align/SKILL.md b/.claude/skills/zh_cn/precision-align/SKILL.md
new file mode 100644
index 00000000..f65fcf9a
--- /dev/null
+++ b/.claude/skills/zh_cn/precision-align/SKILL.md
@@ -0,0 +1,411 @@
+---
+name: precision-align
+description: 诊断并修复 lmdeploy+dlinfer 在国产 AI 硬件上的精度问题，
+  通过与参考实现对比找到偏差根因。
+---
+# 精度对齐
+
+你正在帮助用户修复 lmdeploy+dlinfer 在国产 AI 硬件后端
+（Ascend、CAMB 或 MACA）上的精度问题。
+参考实现通常是 vllm+vllm-ascend（针对 Ascend）或其他约定的参考框架。
+目标是找到 lmdeploy+dlinfer 与参考实现的偏差根因并修复。
+
+本 skill 中的示例以 Ascend 为具体硬件。CAMB 和 MACA 适用相同方法论，
+替换对应的 vendor 路径和算子调用即可。
+
+---
+
+## 第一步 — 收集信息
+
+询问用户：
+
+1. **对齐的是哪个模型**？（例如 `qwen3`、`deepseek_v2`）
+2. **目标硬件是哪个**？（ascend / camb / maca）
+3. **现象是什么**？——例如从第一个 token 就不一致、几个 token
+   之后开始乱说、精度评测分数下降了多少分。
+4. **并行配置**：使用的是什么 TP / DP / EP？
+5. **是否有初步观察**？第一个生成的 token 就已经错误（→ prefill 问题），
+   还是前几个 token 正确之后才出现偏差（→ decode / KV cache 问题）？
+6. **单 batch 还是多 batch**？用单个请求（batch_size=1）能否复现问题，
+   还是只有多个请求同时处理时才出现？
+
+以上问题都得到回答后再继续。
+
+---
+
+## 第二步 — 确认环境配置
+
+开始排查之前，先确认对比环境是受控的。两侧除被测框架不同外，
+其余条件必须完全一致：
+
+| 条件                        | lmdeploy+dlinfer       | vllm+vllm-ascend     |
+|---------------------------|------------------------|----------------------|
+| 相同的 SoC 版本             | ✓                      | ✓                    |
+| 关闭 warmup                | ✓                      | ✓                    |
+| Eager mode                 | ✓（`--eager-mode true`）| ✓（`--enforce-eager`）|
+| 相同的 TP / DP / EP        | ✓                      | ✓                    |
+| `temperature=0`、`top_k=1` | ✓                      | ✓                    |
+| 相同的 prompt / 输入        | ✓                      | ✓                    |
+
+如果任何条件未满足，先修复。Warmup 会在 KV cache 中遗留脏数据；
+temperature > 0 引入采样随机性——两者都会掩盖真实的精度 bug。
+
+---
+
+## 第三步 — 快速输出对比
+
+用**相同的 prompt** 运行两个框架，直接对比生成的 token 序列。
+
+- **token 一致** → 输出对齐，精度大概率没问题。建议用 opencompass
+  或 evascope 跑评测分数确认。
+- **token 不一致** → 进入第四步。
+
+---
+
+## 第四步 — 诊断根因
+
+根据现象确定排查方向：
+
+| 现象                           | 最可能的原因                  | 路径   |
+|--------------------------------|-------------------------------|--------|
+| 第一个生成的 token 就已经错误  | Prefill 算子精度问题          | B      |
+| 第一个 token 正确，之后开始偏差| KV cache 污染或 decode 算子   | A 或 B |
+| 偏差随序列长度增加而累积       | KV cache 污染**或**算子精度   | A 或 B |
+| 在某个固定深度立即出现偏差     | 算子精度问题                  | B      |
+| 仅在 TP > 1 或 dp×tp / ep 下  | 通信 / 并行策略 patch         | C      |
+| 多 batch 时出错，单 batch 正常 | Batching / seqlen / masking   | A 或 B |
+
+**重要提示**：偏差随序列长度累积**并不代表**一定是 KV cache 污染。
+算子精度问题同样会在 decode 步骤中逐渐累积——例如 rope embedding 的
+cos/sin 在 CPU 上计算，会让每个 token 的位置编码都略有偏差，最终累积
+成可见的精度下降（曾在 Qwen30B-A3B 上排查出此问题，LiveCodeBench 评分
+低 2 分）。不要在排除路径 B 之前就断定是 cache 污染。
+
+**不确定时**：
+
+- 先用单 batch 请求复现，排除 batching 的干扰。
+- 从能装下模型权重的最简并行配置开始（具体见路径 C 的并行层次说明）。
+  部分大模型无法在 TP=1 下运行，"最简"指的是能加载权重的最少并行维度。
+- 然后从路径 B 的 layer 0 开始排查。
+
+---
+
+## 路径 A — KV cache 污染
+
+KV cache 污染意味着 `fill_kv_cache` 的索引传错了，把某些 token 写到了
+错误的 cache slot（踩踏）。`fill_kv_cache` 的内部逻辑一般不会出问题——
+问题几乎总是在传给它的索引上。
+
+### 检查重点
+
+- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/fill_kv_cache.py`
+- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/pagedattention.py`
+
+重点参数：
+
+- `fill_kv_cache` 的 `kv_start_indices`：每个 token 在 cache 中的
+  flat slot 索引。
+- `prefill_attention` 和 `paged_token_attention` /
+  `paged_attention_fwd` 的 `block_offsets`、`q_start_loc`、
+  `q_seq_len`、`kv_seq_len`。
+
+**多 batch 注意**：如果问题只在多请求时出现，重点核查每个请求的
+seqlen 追踪（`q_seq_len`、`kv_seq_len`）和 `kv_start_indices`。
+per-request 长度错误会导致 attention 从 KV cache 的错误位置读取数据。
+
+### 排查方法
+
+**不要 dump KV cache tensor 本身**——太大了。在可疑层的
+`fill_kv_cache` 调用**之前**，dump 以下三个 tensor：
+
+```python
+dump("key_states",       key_states)        # shape: [num_tokens, num_kv_heads, head_size]
+dump("value_states",     value_states)      # shape: [num_tokens, num_kv_heads, head_size]
+dump("kv_start_indices", kv_start_indices)  # shape: [num_tokens]
+```
+
+**关键检查**：`kv_start_indices.shape[0]` 必须等于
+`key_states.shape[0]`。如果长度不一致，说明索引数量与待写入的 token
+数量不匹配，fill 时会发生踩踏，导致后续 decode 步骤的 cache 内容
+被污染。
+
+---
+
+## 路径 B — 算子精度
+
+目标是找到 lmdeploy+dlinfer 与参考实现**第一次出现差异的算子**。
+
+**先用单 batch**：如果单请求可以复现问题，在 batch_size=1 下排查。
+这样可以排除 batching 交互，seqlen 的形状也更简单。
+
+### 策略：从 layer 0 开始
+
+从第一个 linear attention block 或 full attention block 的 **layer 0**
+开始，不要直接从中间层开始。原因：模型的大多数层使用相同的算子集合，
+layer 0 的情况具有代表性。如果 layer 0 没有问题，其他层大概率也没问题；
+如果 layer 0 已经有偏差，先修复它再往后看。
+
+1. 在 layer 0 中，按顺序在每个子算子之后 dump：
+   RMSNorm → Attention → MLP。
+2. 与参考框架在同一层的结果对比（例如 Ascend 使用
+   vllm+vllm-ascend）。
+   - 某个子算子在 layer 0 就有偏差 → 这就是第一个出现问题的算子，
+     深入排查。
+   - layer 0 所有子算子均正常 → 偏差在更深的层。对后续层使用二分
+     查找（先看第 N/2 层，再缩小范围）。
+3. 定位到有问题的算子后，有选择性地再验证一两个行为可能不同的层
+   （例如最后一层、存在 MoE routing 的层）。
+
+### 对比方法
+
+**确定性 vendor 算子**（例如 Ascend 的 `torch_npu` 算子）：使用
+`torch.equal()`。这类算子是确定性的，结果必须完全相同。任何差异都是
+真实 bug。
+
+**非确定性算子**（例如 triton，Ascend 上较少见）：`torch.equal()`
+可能因浮点舍入而过于严格，改用误差量来判断：
+
+```python
+diff = (a - b).abs()
+print("最大绝对误差:", diff.max().item())
+print("最大相对误差:", (diff / b.abs().clamp(min=1e-8)).max().item())
+```
+
+相对误差在 ~1e-3 以内通常可接受；超过这个量级则认为是真实偏差。
+
+### 找到出现偏差的算子后
+
+逐层读取其实现栈：
+
+- `lmdeploy/lmdeploy/pytorch/backends/dlinfer/<vendor>/` — `Impl` 类
+- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/` — 薄 kernel wrapper
+- `dlinfer/dlinfer/vendor/<vendor>/` — 实际硬件 op 调用
+  （例如 Ascend：`ascend/torch_npu_ops.py`）
+
+Dump 两个框架中该算子的**输入**，确认是否相同。若输入已经不同，则
+bug 在上游；若输入相同但输出不同，则 bug 在算子调用本身（参数顺序
+错误、dtype 错误、shape 错误等）。
+
+---
+
+## 路径 C — 通信 / 并行策略
+
+仅在某些并行配置下出现的精度问题，指向通信或并行策略 patch 的问题。
+排查前，先明确是哪个并行维度引入了问题。
+
+### lmdeploy 并行术语
+
+lmdeploy 支持三种并行维度的组合：
+
+- **仅 TP**（EP=1, DP=1）：Attention 和 FFN 都在 `tp` 张 GPU 上切分。
+  总 GPU 数 = tp。
+- **dp×tp**（EP=1, DP>1）：Attention 总共使用 dp×tp 张 GPU；每个 DP
+  组内，`tp_size = tp / dp`。当 EP=1 时，配置中的 `tp` 等于总 GPU 数。
+- **dp×tp + ep**（EP>1）：Attention 仍按 dp×tp 切分；FFN / MoE experts
+  进一步在 `ep` 组之间切分。当 EP>1 时，配置中的 `tp` 是**每个 DP 组
+  的 tp_size**（不是总 GPU 数）。
+
+### 隔离策略
+
+不是所有模型都能在 TP=1 下运行。按并行复杂度从低到高逐步测试，
+在引入问题的那一层停下：
+
+1. **仅 TP**（能装下权重的最简配置）：用最少 GPU 数的 TP-only 配置
+   （DP=1, EP=1）运行两个框架。
+   - 有问题 → 问题在 TP 算子切分或 all_reduce；结合路径 B 的 dump
+     方法，重点关注 all_reduce 前后的输出。
+   - 正常 → 进入第 2 步。
+
+2. **dp×tp**（加入 DP）：保持 EP=1，增大 DP。
+   - 有问题 → 问题在 DP+TP 交互或 DP 组间通信；读取
+     `dlinfer/dlinfer/framework/lmdeploy_ext/device/<vendor>.py`
+     中相关的通信 patch。
+   - 正常 → 进入第 3 步（仅限 MoE 模型）。
+
+3. **dp×tp + ep**（加入 EP）：开启 EP>1。
+   - 有问题 → 问题在 expert 并行或 EP 通信；读取
+     `device/<vendor>.py` 中的 MoE forward 类（例如 Ascend 的
+     `AscendMoEForwardDPTP`），核查 MoE routing 和
+     reduce-scatter 模式。
+
+### 空闲 DP 组的 dummy 数据
+
+当 dp > 1 时，lmdeploy 会在没有实际请求的 DP 组中填入**长度为 1 的
+dummy 数据**，vllm-ascend 也有类似机制。这是预期行为，不是 bug。
+在跨 DP 组 dump tensor 时需注意：
+
+- 空闲 DP 组的 tensor leading dimension 为 1——不要误认为是
+  seqlen 不一致。
+- 只对实际处理了真实 token 的 DP 组做数值对比。
+- 如果精度问题恰好出现在空闲组的 dummy 路径上，需确认两个框架使用了
+  相同的 dummy 长度，且 dummy 数据没有污染真实组的 KV cache slot。
+
+### 当 TP=1 装不下权重时
+
+如果模型过大无法在 TP=1 下运行，从能加载权重的最小 TP 出发，在两侧
+使用相同的 TP。此时仍可以固定 TP、分别变化 DP 和 EP 来隔离各维度的
+影响。
+
+### 需要读的文件
+
+- `dlinfer/dlinfer/framework/lmdeploy_ext/device/<vendor>.py` —
+  该硬件的分布式行为 patch（例如 Ascend 的 `ascend.py`，包含 MoE
+  通信的 `AscendMoEForwardDPTP`）。
+- 在每次 all_reduce / all_gather 调用的**前后**分别 dump 各 rank
+  的输出，找到第一次出现偏差的通信操作。
+
+---
+
+## Tensor dump 操作方法
+
+**必须 dump 到文件，禁止使用 `print` 或 `logger`。**
+
+在多 rank 场景下，所有 rank 的日志交错输出，tensor 数值会被冲掉无法
+读取。改用 `torch.save` 写入各 rank 独立的文件。
+
+```python
+import os, torch, torch.distributed as dist
+
+_DUMP_DIR = "/tmp/dlinfer_dump"
+os.makedirs(_DUMP_DIR, exist_ok=True)
+
+def dump(name: str, tensor: torch.Tensor):
+    rank = dist.get_rank() if dist.is_initialized() else 0
+    torch.save(tensor.detach().cpu(), f"{_DUMP_DIR}/{name}_rank{rank}.pt")
+```
+
+**命名约定**：`{层编号}_{算子}_{input|output}_rank{rank}.pt`
+
+示例：lmdeploy+dlinfer 的 `layer0_attn_out_rank0.pt`，
+vllm+vllm-ascend 的同名文件存放在不同目录下，方便配对比较。
+
+**加载并比较**：
+
+```python
+a = torch.load("dlinfer/layer0_attn_out_rank0.pt")
+b = torch.load("vllm/layer0_attn_out_rank0.pt")
+
+# 确定性 vendor 算子（例如 Ascend 的 torch_npu）— 期望完全一致
+print(torch.equal(a, b))
+
+# triton / 浮点算子 — 检查误差幅度
+diff = (a - b).abs()
+print("最大绝对误差:", diff.max().item())
+print("最大相对误差:", (diff / b.abs().clamp(min=1e-8)).max().item())
+```
+
+**dump 位置建议**：最好在
+`lmdeploy/lmdeploy/pytorch/backends/dlinfer/<vendor>/` 的 `Impl` 类中，
+紧接 kernel 调用之后、return 之前添加 dump。这一层位于
+vendor-specific 代码之上，输出 tensor 的 shape 是框架原生格式，
+便于对比。
+
+---
+
+## 验收 checklist
+
+- [ ] 相同 SoC 版本、关闭 warmup、`--eager-mode true`、相同
+  TP/DP/EP、temperature=0 / top_k=1 已确认
+- [ ] 先尝试单 batch 复现
+- [ ] 用同一 prompt 完成了输出 token 对比
+- [ ] 从能装下权重的最简并行配置开始，逐步向上测试
+- [ ] 确定了排查路径：A（KV cache）/ B（算子）/ C（通信）
+- [ ] Tensor dump 使用文件方式（非 print / logger）
+- [ ] 先验证了 layer 0，再向更深的层查找
+- [ ] 找到第一个出现偏差的层 / 算子
+- [ ] 已确认出现偏差的算子的输入是否相同（或发现上游 bug）
+- [ ] 修复后重新验证输出 token 一致
+
+---
+
+## 故障排查
+
+### 偏差随序列长度累积，但路径 A 没发现问题
+
+**症状**：`kv_start_indices` 长度与 `key_states` 一致，但偏差仍随
+序列长度增加。
+
+**原因**：算子精度问题（例如 cos/sin 回退到 CPU 计算）同样会在
+decode 步骤中逐渐累积，从外部看与 cache 污染完全相同。
+
+**操作**：转入路径 B。Dump layer 0 的各子算子
+（RMSNorm → Attention → MLP），确认偏差是否从这里开始。
+
+---
+
+### 某些 DP 组的 tensor leading dimension 出现意外的 1
+
+**症状**：dp > 1 时，空闲 DP 组的 tensor 首维为 1，输出结果看起来
+异常。
+
+**原因**：lmdeploy（以及 vllm-ascend）会在没有实际请求的 DP 组中
+填入长度为 1 的 dummy 数据，这是预期行为。
+
+**操作**：只对实际处理了真实 token 的 DP 组做数值对比，不要将空闲组
+的 length-1 tensor 视为错误。
+
+---
+
+### dp×tp 或 ep 配置下有精度问题，但不清楚是哪个维度引入的
+
+**症状**：加入 DP 或 EP 后出现精度回退，但无法确定是哪个维度导致的。
+
+**原因**：同时改变多个并行维度，无法逐一排查。
+
+**操作**：固定 TP，先单独加 DP；dp×tp 正常后再加 EP。详见路径 C
+的隔离策略。
+
+---
+
+### dump 文件为空、内容截断或数值混乱
+
+**症状**：保存的 dump 文件没有可用数据，或来自多个 rank 的数值混杂
+在一起。
+
+**原因**：多 rank 场景下使用 `print` 或 `logger`，各 rank 的输出
+交错覆盖。
+
+**操作**：使用 `torch.save` 为每个 rank 写独立文件。参见
+Tensor dump 操作方法节。
+
+---
+
+### KV cache 污染 bug 偶现，无法稳定复现
+
+**症状**：同一 prompt 多次运行结果不同，精度问题不稳定。
+
+**原因**：warmup 在 KV cache 中遗留脏数据，影响后续推理。
+
+**操作**：两侧都关闭 warmup，重新对比。
+
+---
+
+### 同一 prompt 每次生成的 token 不同
+
+**症状**：token 级别对比不稳定，每次运行结果都变。
+
+**原因**：temperature > 0 引入采样随机性，对比结果没有意义。
+
+**操作**：两侧均设置 temperature=0、top_k=1。
+
+---
+
+### 二分查找结束后发现 bug 其实在很早的层
+
+**症状**：经过多次二分，最终定位到的出问题层在模型很早的位置。
+
+**原因**：从 layer N/2 开始跳过了对 layer 0 的检查。
+
+**操作**：先验证 layer 0；如果 layer 0 正常，再对后续层二分。
+
+---
+
+### 找到疑似出问题的算子，输入看起来一致，但根因不明
+
+**症状**：两个框架中该算子的输入相同，但无法判断算子本身是否有问题。
+
+**原因**：没有 dump 输出，缺乏算子是否产生错误结果的直接证据。
+
+**操作**：同时 dump 输入和输出。若输入一致但输出不同，bug 在算子调用
+本身——检查传给硬件算子（例如 Ascend 的 NPU op）的参数顺序、dtype
+和 shape。
diff --git a/.claude/skills/zh_cn/support-new-model/SKILL.md b/.claude/skills/zh_cn/support-new-model/SKILL.md
new file mode 100644
index 00000000..879c06ac
--- /dev/null
+++ b/.claude/skills/zh_cn/support-new-model/SKILL.md
@@ -0,0 +1,238 @@
+---
+name: support-new-model
+description: 在国产 AI 硬件上通过 dlinfer 适配一个新模型
+  （该模型已在 lmdeploy CUDA backend 上支持）。
+---
+# 适配新模型
+
+你正在帮助用户将一个新的 LLM 或 VLM 适配到 dlinfer 支持的硬件后端
+（Ascend NPU、CAMB MLU、MACA GPU）。该模型已通过 lmdeploy 在 CUDA
+上运行——你的任务是找出目标 vendor 缺少什么，并补全它。
+
+---
+
+## 第一步 — 收集信息
+
+询问用户：
+
+1. **要适配哪个模型**？（提供模型在 `lmdeploy/pytorch/models/` 中的
+   名称，例如 `qwen3`、`deepseek_v2`）
+2. **目标 vendor 是哪些**？（ascend / camb / maca，可多选）
+
+两个问题都得到回答后再继续。
+
+---
+
+## 第二步 — 分析模型
+
+使用 Read/Bash 工具自行读取以下所有文件，不要让用户去读：
+
+```text
+lmdeploy/lmdeploy/pytorch/models/<model>.py
+lmdeploy/lmdeploy/pytorch/backends/dlinfer/op_backend.py
+lmdeploy/lmdeploy/pytorch/backends/dlinfer/<vendor>/op_backend.py
+```
+
+完整调用链为：`models/<model>.py` → `lmdeploy/pytorch/nn/`
+→ `backends/dlinfer/` → `kernels/dlinfer/` → `dlinfer/ops/`
+→ `vendor/`。如果某个 model 层和 backend op 之间的对应关系不清楚，
+顺着 `lmdeploy/pytorch/nn/` 中间层追踪。
+`lmdeploy/pytorch/kernels/default/` 包含 CUDA 参考实现，
+在编写新 vendor op 时可用作规格参考，按需读取。
+
+### 从 `models/<model>.py` 中识别
+
+- 模型使用的所有非平凡算子：attention 变体（paged、flash、MLA）、
+  MLP 激活函数、RMS norm 变体、MoE routing、rotary embedding 变体
+  （标准、MROPE、多尺度）、量化 op 等。
+- 模型是否通过 `StepContext` 或 `attn_metadata` 传递了超出标准字段
+  的新输入。标准字段为：`input_ids`、`position_ids`、`block_offsets`、
+  `q_seqlens`、`kv_seqlens`、`kv_start_indices`。已知的扩展字段：
+  `state_ids`（SSM 模型）、`mrope_position_ids`（MROPE 模型）、
+  `cu_seqlens` / `has_initial_state`（Gated Delta Network）。
+
+### 从通用 `op_backend.py` 中检查
+
+- `get_layer_impl_builder()`：哪些 `OpType` 已有 dlinfer `Impl`。
+  与上面的 op 列表对比，找出缺口 → **路径 A**。
+
+### 从 `<vendor>/op_backend.py` 中逐项检查
+
+- **`update_step_context()`**：该方法负责在每次推理步骤中构建
+  `attn_metadata`（Ascend 上还包括 `moe_metadata`）。需要仔细确认
+  它是否正确处理了新模型所需的所有字段。若模型引入了新的 context
+  字段或新的 attention 模式（例如类似 `is_gated_delta` 的标志），
+  则需要扩展此方法 → **路径 B**。
+- **`get_k_block_shape()` / `get_v_block_shape()`**：确认 KV cache
+  的内存布局与模型 attention 实现的期望一致。不同 vendor 甚至同一
+  vendor 的不同 SoC 版本（Ascend A2 vs A3、310P）可能使用不同的
+  layout → **路径 B**（如不匹配）。
+- **`AscendKVQuantMeta`**（仅 Ascend）：若模型使用 KV cache 量化，
+  且 scale/offset 格式与当前实现不同 → **路径 B**。
+
+在动手写代码之前，先向用户汇报分析结果：
+
+- Op 缺口（→ 路径 A）
+- Vendor `op_backend.py` 缺口（→ 路径 B）
+- Framework 层面缺口（→ 路径 C）
+
+---
+
+## 路径 A — 补充缺失的 op（4 层栈）
+
+对每个在 `get_layer_impl_builder()` 中缺失的 op，按从上到下的顺序
+逐层实现。
+
+### 第一层 — `lmdeploy/lmdeploy/pytorch/backends/dlinfer/`
+
+新增 `XxxImpl`（继承 lmdeploy 基类 `Impl`）和 `XxxBuilder`
+（包含 `build()` 方法）。在 `op_backend.py` 的
+`get_layer_impl_builder()` dispatcher 中注册该 Builder。
+参考：`activation.py`（最简单）、`norm.py`、`attention.py`（最复杂）。
+
+### 第二层 — `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/`
+
+新增一个薄 wrapper 函数，调用 `dlinfer.ops.<op_name>(...)`，
+并在 `__init__.py` 中导出。
+
+### 第三层 — `dlinfer/dlinfer/ops/llm.py`
+
+用 `@register_custom_op("dlinfer::<op_name>", [...])` 注册新 op，
+函数体转发到 `vendor_ops_registry["<op_name>"]`。
+**此处的字符串 key 必须与第四层中被装饰函数的名称完全一致。**
+
+### 第四层 — `dlinfer/dlinfer/vendor/<vendor>/`
+
+添加带 `@register_ops(vendor_ops_registry)` 装饰的实现，调用
+vendor 的 native op：
+
+- **Ascend**：`torch.ops.npu.*`，参考 `vendor/ascend/torch_npu_ops.py`
+- **CAMB**：`tmo.*`（`torch_mlu_ops`），参考 `vendor/camb/camb_ops.py`
+- **MACA**：`mcoplib.*`，参考 `vendor/maca/maca_ops.py`
+
+**Ascend**：在 `torch_npu_ops.py` 中新增任何算子之前，先请用户提供
+该算子的官方 NPU 文档。严格按照文档实现：参数名称、tensor shape 约束、
+dtype 约束无法从现有代码中推断，写错会引发难以定位的运行时错误。
+
+逻辑较复杂时（如 Ascend 带 graph mode 记录的 attention），拆分到辅助
+模块（如 `vendor/ascend/attention.py`）并在 `torch_npu_ops.py` 中导入。
+
+---
+
+## 路径 B — Vendor-specific `op_backend.py` 修改
+
+文件：`lmdeploy/lmdeploy/pytorch/backends/dlinfer/<vendor>/op_backend.py`
+
+以下三个子方向相互独立，分别评估。
+
+### B1 — `update_step_context()`：新 context 字段或 attention 模式
+
+当新模型需要 `attn_metadata` 中有当前实现未填充的字段时，扩展
+`update_step_context()`：
+
+- 在方法内部按已有的 helper 函数模式计算新字段。
+- 在方法末尾构造 `attn_metadata` 时将新字段传入。
+- Ascend 上若模型引入了新的 MoE 通信模式或并行拓扑，还需扩展
+  `moe_metadata`。
+
+参考：`is_gated_delta` 分支（添加 `cu_seqlens` 和
+`has_initial_state`）、`kv_quant_policy == 8` 分支
+（填充 `AscendKVQuantMeta`）。
+
+### B2 — `get_k_block_shape()` / `get_v_block_shape()`：KV cache layout
+
+硬件目标确定后这里基本不再改动。跳过，除非新模型引入了现有任何
+vendor backend 都无法覆盖的全新 attention 内存布局需求。
+
+### B3 — `AscendKVQuantMeta`：KV 量化（仅 Ascend）
+
+遗留功能，当前正确性未经主动验证。常规新模型适配跳过此项——仅在明确
+需要 KV cache 量化且已确认该功能可用时再处理。
+
+---
+
+## 路径 C — Framework patch（`dlinfer/dlinfer/framework/lmdeploy_ext/`）
+
+以下三个子模块相互独立，分别评估。
+
+### C1 — cudagraph / aclgraph 缓冲区管理
+
+**触发条件**：仅当模型引入了新的 `StepContext` 字段，且该字段的
+**shape 随 batch size 或 seq_len 在运行时动态变化**时才需要处理。
+shape 固定的 tensor 不需要特殊缓冲区管理。示例：`x_active_mask`
+（shape `[batch_size]`）是为 Expert Parallelism 支持而添加的——它的
+尺寸随每步变化，因此需要预分配最大尺寸的 buffer。
+
+- **Ascend**：`framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py`
+  - `make_buffers_cudagraph`：以最大尺寸（`max_batches` /
+    `max_tokens`）预分配新字段的 tensor。用运行时尺寸会导致
+    graph replay 时 shape 不匹配。
+  - `fill_buffers_cudagraph`：将运行时数据拷贝到预分配的 buffer 中。
+  - `update_context_cudagraph`：将 buffer 写回 step context。
+  - 参考：`is_ssm`（`state_ids`）和 `use_mrope`
+    （`mrope_position_ids`）的处理模式。
+- **其他 vendor**：在对应的 `camb_cudagraph.py` /
+  `maca_cudagraph.py` 中应用相同模式。
+
+若模型只使用已有的标准字段，跳过此节。
+
+### C2 — Device-specific patch
+
+**触发条件**：模型需要对 lmdeploy 的某个行为做 vendor 级别的覆盖
+时——例如 Ascend 上不同的 MoE 通信策略、CAMB 上不支持的 sampling op，
+或硬件特定的 KV cache 格式（如 Ascend 310P NZ 格式）。
+
+- **Ascend**：`framework/lmdeploy_ext/device/ascend.py`
+- **CAMB**：`framework/lmdeploy_ext/device/camb.py`
+
+直接在 lmdeploy 类上 patch 对应方法。确保 patch 文件在
+`framework/lmdeploy_ext/device/__init__.py` 中被导入。
+
+### C3 — 量化 patch
+
+**触发条件**：仅当模型使用 AWQ 且权重打包格式或 scale layout 与当前
+Ascend 实现不兼容时。
+
+文件：`framework/lmdeploy_ext/quants/ascend_awq.py`
+
+该文件 patch 了 `WeightOnlyQLinear`、`MergedAwqLinear`、`AwqLinear`、
+`QKVAwqLinear`。仅当新模型的量化 checkpoint 使用了当前 patch 无法处理
+的 layout 时才修改。
+
+---
+
+## 验收 checklist
+
+**路径 A（新 op）：**
+
+- [ ] 每个缺失 op 的 4 层均已实现
+- [ ] 通用 `op_backend.py` 的 `get_layer_impl_builder()` 已更新
+- [ ] `ops/llm.py` 中的 `vendor_ops_registry` key 与 vendor 文件中
+  被装饰函数名完全一致
+- [ ] 新 kernel 已在 `kernels/dlinfer/__init__.py` 中导出
+
+**路径 B（vendor `op_backend.py`）：**
+
+- [ ] `update_step_context()` 已正确填充新模型所需的所有
+  `attn_metadata` 字段
+
+**路径 C1（graph 缓冲区）：**
+
+- [ ] 新字段已在 `make_buffers_cudagraph` 中以最大尺寸预分配
+- [ ] 新字段已在 `fill_buffers_cudagraph` 中填充
+- [ ] 新字段已在 `update_context_cudagraph` 中写回 context
+
+**路径 C2（device patch）：**
+
+- [ ] patch 已直接应用到 lmdeploy 类上
+- [ ] patch 文件已在 `device/__init__.py` 中导入
+
+**路径 C3（量化 patch）：**
+
+- [ ] 已对照 checkpoint 格式核实权重打包 / scale layout
+- [ ] 相关类方法已在 `ascend_awq.py` 中 patch
+
+**通用：**
+
+- [ ] eager mode：模型可正常推理
+- [ ] graph mode：模型可正常推理（如该 vendor 支持）