diff --git a/.claude/skills/en/precision-align/SKILL.md b/.claude/skills/en/precision-align/SKILL.md new file mode 100644 index 00000000..e027828a --- /dev/null +++ b/.claude/skills/en/precision-align/SKILL.md @@ -0,0 +1,447 @@ +--- +name: precision-align +description: Debug precision regressions in lmdeploy+dlinfer on domestic AI hardware (Ascend / CAMB / MACA) by comparing against a reference implementation. +--- +# precision-align + +You are helping the user fix a precision bug in lmdeploy+dlinfer on a domestic +AI hardware backend (Ascend, CAMB, or MACA). The reference implementation is +typically vllm+vllm-ascend (for Ascend) or another agreed-upon reference. +The goal is to identify where lmdeploy+dlinfer diverges from the reference +and fix it. + +The examples in this skill use Ascend as the concrete hardware. Apply the same +methodology to CAMB and MACA by substituting the appropriate vendor paths +and ops. + +--- + +## Step 1 — Gather information + +Ask the user: + +1. **Which model** are you aligning? (e.g., `qwen3`, `deepseek_v2`) +2. **Which hardware** are you targeting? (ascend / camb / maca) +3. **What is the symptom?** — e.g., output tokens differ from the first token, + answers become nonsensical after a few tokens, accuracy benchmark score + dropped by X points. +4. **Parallelism configuration**: what TP / DP / EP values are you using? +5. **Any preliminary observations?** First token already wrong (→ prefill + issue), or diverges after a few correct tokens (→ decode / KV cache)? +6. **Single-batch or multi-batch?** Can you reproduce the issue with a single + request (batch_size=1), or does it only appear when multiple requests are + batched together? + +Do not proceed until these are answered. + +--- + +## Step 2 — Verify environment setup + +Before any debugging, confirm the comparison environment is controlled. +Both sides must be identical except for the framework under test: + +| Condition | lmdeploy+dlinfer | vllm+vllm-ascend | +|---------------------------|-------------------------|-----------------------| +| Same SoC version | ✓ | ✓ | +| Warmup disabled | ✓ | ✓ | +| Eager mode | ✓ (`--eager-mode true`) | ✓ (`--enforce-eager`) | +| Same TP / DP / EP | ✓ | ✓ | +| `temperature=0`,`top_k=1` | ✓ | ✓ | +| Same prompt / input | ✓ | ✓ | + +If any condition is unmet, fix it first. Warmup leaves stale KV cache entries; +temperature > 0 introduces sampling randomness — both mask real precision bugs. + +--- + +## Step 3 — Quick output comparison + +Run both frameworks on the **same prompt** and compare the generated tokens +directly. + +- **Tokens match** → output is consistent; precision is likely fine. Suggest + running opencompass or evascope for benchmark scoring. +- **Tokens differ** → proceed to Step 4. + +--- + +## Step 4 — Diagnose root cause + +Map the symptom to a debugging path: + +| Symptom | Most likely cause | Path | +|------------------------------------|-----------------------------|--------| +| First token already wrong | Prefill operator precision | B | +| First token correct, then diverges | KV cache or decode op | A or B | +| Divergence grows with seq length | KV cache or op precision | A or B | +| Divergence at a fixed depth | Operator precision | B | +| Only wrong at TP > 1 / dp×tp / ep | Communication / parallelism | C | +| Only wrong with multiple requests | Batching / seqlen / masking | A or B | + +**Important**: accumulating divergence does **not** imply KV cache pollution. +An operator precision bug can also compound over decode steps — for example, +a rope embedding op silently falling back to CPU produces slightly wrong +position encodings that accumulate into a visible accuracy drop (observed on +Qwen30B-A3B: a 2-point drop on LiveCodeBench traced to cos/sin computed on +CPU). Do not assume Path A without ruling out Path B first. + +**If unsure**: + +- Start with a single-batch request to eliminate batching interactions. +- Start with the simplest parallelism configuration that can load the model + weights (see Path C for the parallelism hierarchy). Some large models cannot + fit at TP=1, so "simplest" means fewest parallelism dimensions that still + fits the weights. +- Then go to Path B starting from layer 0. + +--- + +## Path A — KV cache pollution + +KV cache pollution means `fill_kv_cache` was called with mismatched indices, +writing tokens into wrong cache slots (stomping). The `fill_kv_cache` kernel +itself is generally not the source of bugs — the problem is almost always in +the indices passed to it. + +### What to check + +Read these two files: + +- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/fill_kv_cache.py` +- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/pagedattention.py` + +Key parameters: + +- `kv_start_indices` in `fill_kv_cache`: flat index of each token's cache + slot. +- `block_offsets`, `q_start_loc`, `q_seq_len`, `kv_seq_len` in + `prefill_attention` and `paged_token_attention` / `paged_attention_fwd`. + +**Multi-batch note**: if the bug only appears with multiple requests, pay extra +attention to per-request seqlen tracking (`q_seq_len`, `kv_seq_len`) and +`kv_start_indices`. A wrong per-request length causes attention to read from +the wrong positions in the KV cache. + +### How to debug + +Do **not** dump the KV cache tensors — they are prohibitively large. Instead, +dump the three tensors immediately **before** the `fill_kv_cache` call at the +suspect layer: + +```python +dump("key_states", key_states) # [num_tokens, num_kv_heads, head_size] +dump("value_states", value_states) # [num_tokens, num_kv_heads, head_size] +dump("kv_start_indices", kv_start_indices) # shape: [num_tokens] +``` + +**Key check**: `kv_start_indices.shape[0]` must equal `key_states.shape[0]`. +A mismatch means the index count does not match the token count being written, +which causes fill-time stomping and corrupts subsequent decode steps. + +--- + +## Path B — Operator precision + +The goal is to find the **first op** where lmdeploy+dlinfer diverges from the +reference. + +**Single-batch first**: if the issue is reproducible with a single request, +debug at batch_size=1. This eliminates batching interactions and simplifies +seqlen shapes. + +### Strategy: start at layer 0 + +Start at **layer 0** of the first linear-attention or full-attention block. +Do not start at the midpoint: most of the model's layers share the same +operator set, so layer 0's result is representative. If layer 0 is clean, +most other layers will be too; if layer 0 already diverges, fix it before +searching deeper. + +1. At layer 0, dump after each sub-op in order: RMSNorm → Attention → MLP. +2. Compare with the reference framework at the same layer + (e.g. vllm+vllm-ascend for Ascend). + - Sub-op diverges at layer 0 → that is the first divergent op; investigate + it. + - Layer 0 is fully clean → use binary search across later layers (check + layer N/2, then narrow down) to find the first divergent layer. +3. After identifying the faulty op, selectively verify one or two more layers + that might behave differently (e.g. the last layer, MoE layers if + applicable). + +### Comparison method + +For **deterministic vendor ops** (e.g. `torch_npu` on Ascend): use +`torch.equal()`. These ops must produce bit-identical outputs. Any difference +is a real bug. + +For **non-deterministic ops** (e.g. triton, less common on Ascend): +`torch.equal()` may be too strict due to FP rounding. Check error magnitude +instead: + +```python +diff = (a - b).abs() +print("max abs:", diff.max().item()) +print("max rel:", (diff / b.abs().clamp(min=1e-8)).max().item()) +``` + +A relative error below ~1e-3 is generally acceptable; above that it is a real +divergence. + +### Once the divergent op is found + +Read its implementation stack: + +- `lmdeploy/lmdeploy/pytorch/backends/dlinfer//` — the `Impl` class +- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/` — the thin kernel wrapper +- `dlinfer/dlinfer/vendor//` — the actual hardware op call + (e.g. for Ascend: `ascend/torch_npu_ops.py`) + +Dump the **inputs** to that op in both frameworks and check whether they are +identical. If inputs differ, the bug is upstream; if inputs are identical but +outputs differ, the bug is in the op itself (wrong argument order, dtype, or +shape). + +--- + +## Path C — Communication / parallelism + +Precision bugs that only appear with certain parallelism configurations point +to communication or parallelism-patch issues. Before debugging, understand +which parallelism level introduces the problem. + +### lmdeploy parallelism terminology + +lmdeploy supports three parallelism dimensions used in combination: + +- **TP only** (EP=1, DP=1): attention and FFN are both sharded across `tp` + GPUs. Total GPUs = tp. +- **dp×tp** (EP=1, DP>1): attention uses dp×tp GPUs total; within each DP + group, `tp_size = tp / dp`. When EP=1, `tp` in the config equals the total + GPU count. +- **dp×tp + ep** (EP>1): attention uses dp×tp as above; FFN/MoE experts are + further sharded across `ep` groups. When EP>1, `tp` in the config is the + tp_size **per DP group** (not the total GPU count). + +### Isolation strategy + +Not all models fit at TP=1. Work through the parallelism hierarchy from +simplest to most complex, stopping at the level that introduces the bug: + +1. **TP only** (simplest that fits the weights): run both frameworks with + TP=N, DP=1, EP=1 (where N is the minimum number of GPUs needed to load + the model). + - Bug present → issue is in TP operator sharding or all_reduce; go to + operator dumps (Path B) focusing on the all_reduce outputs. + - Clean → proceed to step 2. + +2. **dp×tp** (add DP): increase DP while keeping EP=1. + - Bug appears → issue is in DP+TP interaction; check communication between + DP groups. Read + `dlinfer/dlinfer/framework/lmdeploy_ext/device/.py` for relevant + patches. + - Clean → proceed to step 3 (only for MoE models). + +3. **dp×tp + ep** (add EP): enable EP>1. + - Bug appears → issue is in expert parallelism or EP communication. Read + the MoE forward class in `device/.py` (e.g. + `AscendMoEForwardDPTP` in `device/ascend.py`) and verify the MoE routing + and reduce-scatter pattern. + +### Dummy data in idle DP groups + +When dp > 1, lmdeploy fills DP groups that have no real requests with +**dummy data of sequence length 1**. vllm-ascend uses a similar mechanism. +This is expected behaviour, not a bug. When dumping tensors across DP groups: + +- Idle DP groups will show tensors with a leading dimension of 1 — do not + mistake this for a seqlen mismatch. +- Only compare tensor values in DP groups that are actually processing real + tokens. +- If a precision discrepancy appears specifically in the idle-group dummy + path, verify that both frameworks use the same dummy length and that the + dummy data does not pollute the real groups' KV cache slots. + +### When TP=1 is impossible + +If the model is too large to fit at TP=1, start at the minimum TP that loads +the weights and compare it against the same TP on the reference side. You can +still isolate DP and EP by fixing TP and varying DP/EP independently. + +### What to read + +- `dlinfer/dlinfer/framework/lmdeploy_ext/device/.py` — patches for + distributed behaviours specific to the hardware (e.g. for Ascend: `ascend.py` + containing `AscendMoEForwardDPTP` for MoE communication). +- Dump outputs immediately **before and after** each all_reduce / all_gather + call across ranks to find where values first diverge. + +--- + +## Tensor dump mechanics + +**Always dump to files. Never use `print` or `logger`.** + +On multi-rank runs, log output from all ranks interleaves and individual +tensor values are lost. Use `torch.save` to per-rank files instead. + +```python +import os, torch, torch.distributed as dist + +_DUMP_DIR = "/tmp/dlinfer_dump" +os.makedirs(_DUMP_DIR, exist_ok=True) + +def dump(name: str, tensor: torch.Tensor): + rank = dist.get_rank() if dist.is_initialized() else 0 + torch.save(tensor.detach().cpu(), f"{_DUMP_DIR}/{name}_rank{rank}.pt") +``` + +**Naming convention**: `{layer}_{op}_{input|output}_rank{rank}.pt` + +Example: `layer0_attn_out_rank0.pt` for lmdeploy+dlinfer, same name in a +separate directory for vllm+vllm-ascend, so files are easy to pair. + +**Loading and comparing**: + +```python +a = torch.load("dlinfer/layer0_attn_out_rank0.pt") +b = torch.load("vllm/layer0_attn_out_rank0.pt") + +# deterministic vendor ops (e.g. torch_npu on Ascend) — expect exact match +print(torch.equal(a, b)) + +# triton / float ops — check error magnitude +diff = (a - b).abs() +print("max abs:", diff.max().item()) +print("max rel:", (diff / b.abs().clamp(min=1e-8)).max().item()) +``` + +**Placement of dump calls**: add dumps inside the +`lmdeploy/lmdeploy/pytorch/backends/dlinfer//` `Impl` classes — right +after the kernel call and before returning. This gives the output in +framework-native shape and is above the vendor-specific layer. + +--- + +## Checklist + +- [ ] Same SoC, warmup disabled, `--eager-mode true`, same TP/DP/EP, + temperature=0 / top_k=1 confirmed +- [ ] Single-batch reproduction attempted first +- [ ] Output token comparison done on the same prompt +- [ ] Parallelism hierarchy tested from simplest fitting config upward +- [ ] Root cause path identified: A (KV cache) / B (operator) / C + (communication) +- [ ] Tensor dumps use file-based approach (not print / logger) +- [ ] Layer 0 verified first before searching deeper layers +- [ ] First divergent layer / op identified +- [ ] Inputs to the divergent op verified (identical or upstream bug found) +- [ ] Root cause fixed and output tokens re-verified + +--- + +## Troubleshooting + +### Accumulating divergence, but Path A checks out + +**Symptom**: `kv_start_indices` length matches `key_states`, but divergence +still grows with sequence length. + +**Cause**: Operator precision bugs (e.g. cos/sin falling back to CPU) also +accumulate across decode steps and look identical to cache pollution from the +outside. + +**Action**: Shift to Path B. Dump layer 0 sub-ops (RMSNorm → Attention → MLP) +to confirm whether divergence begins there. + +--- + +### Tensors in some DP groups show unexpected length-1 shapes + +**Symptom**: When dp > 1, tensors in idle DP groups have a leading dimension +of 1 and produce odd-looking outputs. + +**Cause**: lmdeploy (and vllm-ascend) fill DP groups that have no real +requests with dummy data of sequence length 1. This is expected behaviour. + +**Action**: Only compare tensor values in DP groups that are actually +processing real tokens. Do not treat length-1 tensors from idle groups as +errors. + +--- + +### Precision issue at dp×tp or ep, but unclear which dimension causes it + +**Symptom**: Adding DP or EP introduces a precision regression, but the root +cause dimension is unknown. + +**Cause**: Testing multiple parallelism dimensions simultaneously makes it +impossible to isolate which one introduces the bug. + +**Action**: Fix TP, add DP first. If dp×tp is clean, then add EP. See Path C +for the full isolation strategy. + +--- + +### Dump files are empty, truncated, or contain garbled content + +**Symptom**: Saved dump files have no usable data, or values from multiple +ranks are mixed together. + +**Cause**: Using `print` or `logger` on multi-rank runs causes output from all +ranks to interleave and overwrite each other. + +**Action**: Use `torch.save` to write a separate file per rank. See the Tensor +dump section for the pattern. + +--- + +### KV cache pollution bug appears intermittently + +**Symptom**: Same prompt gives different results across runs; the precision +issue is not consistently reproducible. + +**Cause**: Warmup leaves stale KV cache entries that corrupt subsequent +inference runs. + +**Action**: Disable warmup on both lmdeploy+dlinfer and vllm+vllm-ascend, +then retry. + +--- + +### Generated tokens differ between runs on the same prompt + +**Symptom**: Token-level comparison is not stable; the outputs change each +run. + +**Cause**: `temperature > 0` introduces sampling randomness that makes +comparison meaningless. + +**Action**: Set `temperature=0, top_k=1` on both sides. + +--- + +### Binary search finishes but the bug was in an early layer all along + +**Symptom**: After several bisection steps the divergent layer turns out to be +very early in the model. + +**Cause**: Starting at layer N/2 skips checking whether layer 0 is already +wrong. + +**Action**: Always verify layer 0 first. If layer 0 is clean, then apply +binary search to the remaining layers. + +--- + +### Found the suspected divergent op, inputs look identical, root cause unclear + +**Symptom**: Inputs to the op match between frameworks, but you cannot tell +whether the op itself is at fault. + +**Cause**: Without output dumps, there is no evidence of whether the op +produces wrong results. + +**Action**: Dump both inputs and outputs. If inputs match but outputs differ, +the bug is in the op call itself — check argument order, dtype, and shape +passed to the hardware op (e.g. NPU op on Ascend). diff --git a/.claude/skills/en/support-new-model/SKILL.md b/.claude/skills/en/support-new-model/SKILL.md new file mode 100644 index 00000000..7adfa59b --- /dev/null +++ b/.claude/skills/en/support-new-model/SKILL.md @@ -0,0 +1,255 @@ +--- +name: support-new-model +description: Add support for a new model (already in lmdeploy's CUDA backend) on domestic AI hardware (Ascend / CAMB / MACA) via dlinfer. +--- +# support-new-model + +You are helping the user adapt a new LLM or VLM for dlinfer's supported +hardware backends (Ascend NPU, CAMB MLU, MACA GPU). The model already runs +on CUDA via lmdeploy — your job is to identify what is missing for the target +vendor and implement it. + +--- + +## Step 1 — Gather information + +Ask the user: + +1. **Which model** are you adding support for? (name as it appears in + `lmdeploy/pytorch/models/`, e.g. `qwen3`, `deepseek_v2`) +2. **Which vendor(s)** are you targeting? (ascend / camb / maca — may be + multiple) + +Do not proceed until both questions are answered. + +--- + +## Step 2 — Analyse the model + +Read all of the following files yourself using Read/Bash tools — do not ask +the user: + +```text +lmdeploy/lmdeploy/pytorch/models/.py +lmdeploy/lmdeploy/pytorch/backends/dlinfer/op_backend.py +lmdeploy/lmdeploy/pytorch/backends/dlinfer//op_backend.py ← per vendor +``` + +The full call chain is: `models/.py` → `lmdeploy/pytorch/nn/` +→ `backends/dlinfer/` → `kernels/dlinfer/` → `dlinfer/ops/` → `vendor/`. +If the connection between a model layer and its backend op is unclear, trace +through `lmdeploy/pytorch/nn/` to find the intermediate abstraction. +`lmdeploy/pytorch/kernels/default/` contains the CUDA reference +implementations and is useful as a specification when writing new vendor ops. + +### From `models/.py`, identify + +- Every non-trivial operator the model uses: attention variants (paged, flash, + MLA), MLP activation functions, RMS norm variants, MoE routing, rotary + embedding variants (standard, MROPE, multi-scale), quantization ops. +- Whether the model passes any fields through `StepContext` or `attn_metadata` + beyond the standard set: `input_ids`, `position_ids`, `block_offsets`, + `q_seqlens`, `kv_seqlens`, `kv_start_indices`. + Known extra fields already handled: `state_ids` (SSM), + `mrope_position_ids` (MROPE), + `cu_seqlens` / `has_initial_state` (Gated Delta Networks). + +### From the generic `op_backend.py`, check + +- `get_layer_impl_builder()`: which `OpType`s already have a dlinfer `Impl`. + Cross-reference with the op list above to identify gaps → **Path A**. + +### From `/op_backend.py`, check each of the following carefully + +- **`update_step_context()`**: this method builds `attn_metadata` and (for + Ascend) `moe_metadata` for every inference step. Verify that it correctly + handles all fields the new model requires. If the model introduces new + context fields or a new attention mode (e.g. a new `is_gated_delta`-style + flag), this method must be extended → **Path B**. +- **`get_k_block_shape()` / `get_v_block_shape()`**: confirm the KV cache + layout matches what the model's attention implementation expects. Different + vendors and even different SoC generations (Ascend A2 vs A3, 310P) may use + different layouts → **Path B** if wrong. +- **`AscendKVQuantMeta`** (Ascend only): if the model uses KV cache + quantization with a scale/offset format different from the current + implementation → **Path B**. + +Summarise your findings to the user before writing any code: + +- Op gaps (→ Path A) +- Vendor `op_backend.py` gaps (→ Path B) +- Framework-level gaps (→ Path C) + +--- + +## Path A — Add missing ops (4-layer stack) + +Follow this path for every op that is absent from `get_layer_impl_builder()`. + +Implement each layer in top-to-bottom order: + +### Layer 1 — `lmdeploy/lmdeploy/pytorch/backends/dlinfer/` + +Add a new `XxxImpl` (inherits lmdeploy base `Impl`) and `XxxBuilder` +(with `build()`). +Register the builder in `op_backend.py`'s `get_layer_impl_builder()` +dispatcher. +Reference: `activation.py` (simplest), `norm.py`, `attention.py` (most +complex). + +### Layer 2 — `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/` + +Add a thin wrapper function calling `dlinfer.ops.(...)`. +Export it from `__init__.py`. + +### Layer 3 — `dlinfer/dlinfer/ops/llm.py` + +Register with `@register_custom_op("dlinfer::", [...])`. +Forward to `vendor_ops_registry[""]`. +**The key string must exactly match the function name in Layer 4.** + +### Layer 4 — `dlinfer/dlinfer/vendor//` + +Add `@register_ops(vendor_ops_registry)` implementation calling the vendor's +native op: + +- **Ascend**: `torch.ops.npu.*` — see `vendor/ascend/torch_npu_ops.py` +- **CAMB**: `tmo.*` (`torch_mlu_ops`) — see `vendor/camb/camb_ops.py` +- **MACA**: `mcoplib.*` — see `vendor/maca/maca_ops.py` + +**Ascend**: before writing any new op in `torch_npu_ops.py`, ask the user to +provide the official NPU operator documentation for that op. Implement +strictly according to the docs: parameter names, tensor shapes, and dtype +constraints are not always inferrable from existing code and a mismatch causes +hard-to-debug runtime errors. + +For complex ops (e.g. Ascend attention with graph-mode bookkeeping), split +logic into a helper module (e.g. `vendor/ascend/attention.py`) and import +from `torch_npu_ops.py`. + +--- + +## Path B — Vendor-specific `op_backend.py` changes + +File: `lmdeploy/lmdeploy/pytorch/backends/dlinfer//op_backend.py` + +Handle each sub-case independently: + +### B1 — `update_step_context()`: new context fields or attention modes + +When the new model requires fields in `attn_metadata` that the current +implementation does not populate, extend `update_step_context()`: + +- Add the computation of the new field (following the existing + helper-function pattern inside the method). +- Pass the new field when constructing `attn_metadata` at the end of the + method. +- For Ascend: also extend `moe_metadata` if the model introduces a new MoE + communication pattern or parallelism topology. + +Reference: the `is_gated_delta` block (adds `cu_seqlens` and +`has_initial_state`), +the `kv_quant_policy == 8` block (populates `AscendKVQuantMeta`). + +### B2 — `get_k_block_shape()` / `get_v_block_shape()`: KV cache layout + +This rarely needs changing once the hardware target is fixed. Skip unless the +new model introduces a fundamentally different attention architecture that +requires a new block memory layout not covered by any existing vendor backend. + +### B3 — `AscendKVQuantMeta`: KV quantization (Ascend only) + +Legacy feature; its correctness is not actively verified. Skip for standard +model support — only revisit if KV cache quantization is explicitly required +and confirmed to be working. + +--- + +## Path C — Framework patches (`dlinfer/dlinfer/framework/lmdeploy_ext/`) + +Each sub-area is independent — assess and handle separately. + +### C1 — cudagraph / aclgraph buffer management + +**When needed**: only when the model introduces a new `StepContext` field +whose **shape varies with batch size or sequence length** at runtime. +Fixed-shape tensors do not need special buffer management. Example: +`x_active_mask` (shape `[batch_size]`) was added to handle Expert Parallelism +— its size changes per step, so it requires a pre-allocated maximum-size +buffer. + +- **Ascend**: `framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py` + - `make_buffers_cudagraph`: allocate the new field at maximum size + (`max_batches` / `max_tokens`). Using runtime size here causes shape + errors on replay. + - `fill_buffers_cudagraph`: copy runtime values into the pre-allocated + buffer. + - `update_context_cudagraph`: wire the buffer back into the step context. + - Reference: `is_ssm` (`state_ids`) and `use_mrope` (`mrope_position_ids`) + paths. +- **Other vendors**: apply the same pattern in `camb_cudagraph.py` / + `maca_cudagraph.py`. + +Skip if the model uses only the standard fields already handled. + +### C2 — Device-specific patches + +**When needed**: when the model requires a vendor-specific override of +lmdeploy behaviour (e.g. a different MoE communication strategy on Ascend, an +unsupported sampling op on CAMB, hardware-specific cache formats such as +Ascend 310P NZ layout). + +- **Ascend**: `framework/lmdeploy_ext/device/ascend.py` +- **CAMB**: `framework/lmdeploy_ext/device/camb.py` + +Patch the relevant lmdeploy class method directly. Ensure the file is +imported in `framework/lmdeploy_ext/device/__init__.py`. + +### C3 — Quantization patches + +**When needed**: only when the model uses AWQ and the weight packing or scale +layout differs from the current Ascend implementation. + +File: `framework/lmdeploy_ext/quants/ascend_awq.py` + +This file patches `WeightOnlyQLinear`, `MergedAwqLinear`, `AwqLinear`, and +`QKVAwqLinear`. Only modify if the new model's quantized checkpoint uses a +layout the current patches cannot handle. + +--- + +## Verification checklist + +**Path A (new op):** + +- [ ] All 4 layers implemented for each missing op +- [ ] `get_layer_impl_builder()` dispatcher updated in generic `op_backend.py` +- [ ] `vendor_ops_registry` key in `ops/llm.py` exactly matches the decorated + function name in the vendor file +- [ ] New kernel exported from `kernels/dlinfer/__init__.py` + +**Path B (vendor `op_backend.py`):** + +- [ ] `update_step_context()` populates all fields the new model's + `attn_metadata` requires + +**Path C1 (graph buffers):** + +- [ ] New field pre-allocated at max size in `make_buffers_cudagraph` +- [ ] New field filled in `fill_buffers_cudagraph` +- [ ] New field wired back in `update_context_cudagraph` + +**Path C2 (device patch):** + +- [ ] Patch applied directly to the lmdeploy class +- [ ] Patch file imported in `device/__init__.py` + +**Path C3 (quant patch):** + +- [ ] Weight packing / scale layout verified against checkpoint format +- [ ] Relevant class methods patched in `ascend_awq.py` + +**General:** + +- [ ] Eager mode: model runs without error +- [ ] Graph mode: model runs without error (if vendor supports it) diff --git a/.claude/skills/zh_cn/precision-align/SKILL.md b/.claude/skills/zh_cn/precision-align/SKILL.md new file mode 100644 index 00000000..f65fcf9a --- /dev/null +++ b/.claude/skills/zh_cn/precision-align/SKILL.md @@ -0,0 +1,411 @@ +--- +name: precision-align +description: 诊断并修复 lmdeploy+dlinfer 在国产 AI 硬件上的精度问题, + 通过与参考实现对比找到偏差根因。 +--- +# 精度对齐 + +你正在帮助用户修复 lmdeploy+dlinfer 在国产 AI 硬件后端 +(Ascend、CAMB 或 MACA)上的精度问题。 +参考实现通常是 vllm+vllm-ascend(针对 Ascend)或其他约定的参考框架。 +目标是找到 lmdeploy+dlinfer 与参考实现的偏差根因并修复。 + +本 skill 中的示例以 Ascend 为具体硬件。CAMB 和 MACA 适用相同方法论, +替换对应的 vendor 路径和算子调用即可。 + +--- + +## 第一步 — 收集信息 + +询问用户: + +1. **对齐的是哪个模型**?(例如 `qwen3`、`deepseek_v2`) +2. **目标硬件是哪个**?(ascend / camb / maca) +3. **现象是什么**?——例如从第一个 token 就不一致、几个 token + 之后开始乱说、精度评测分数下降了多少分。 +4. **并行配置**:使用的是什么 TP / DP / EP? +5. **是否有初步观察**?第一个生成的 token 就已经错误(→ prefill 问题), + 还是前几个 token 正确之后才出现偏差(→ decode / KV cache 问题)? +6. **单 batch 还是多 batch**?用单个请求(batch_size=1)能否复现问题, + 还是只有多个请求同时处理时才出现? + +以上问题都得到回答后再继续。 + +--- + +## 第二步 — 确认环境配置 + +开始排查之前,先确认对比环境是受控的。两侧除被测框架不同外, +其余条件必须完全一致: + +| 条件 | lmdeploy+dlinfer | vllm+vllm-ascend | +|---------------------------|------------------------|----------------------| +| 相同的 SoC 版本 | ✓ | ✓ | +| 关闭 warmup | ✓ | ✓ | +| Eager mode | ✓(`--eager-mode true`)| ✓(`--enforce-eager`)| +| 相同的 TP / DP / EP | ✓ | ✓ | +| `temperature=0`、`top_k=1` | ✓ | ✓ | +| 相同的 prompt / 输入 | ✓ | ✓ | + +如果任何条件未满足,先修复。Warmup 会在 KV cache 中遗留脏数据; +temperature > 0 引入采样随机性——两者都会掩盖真实的精度 bug。 + +--- + +## 第三步 — 快速输出对比 + +用**相同的 prompt** 运行两个框架,直接对比生成的 token 序列。 + +- **token 一致** → 输出对齐,精度大概率没问题。建议用 opencompass + 或 evascope 跑评测分数确认。 +- **token 不一致** → 进入第四步。 + +--- + +## 第四步 — 诊断根因 + +根据现象确定排查方向: + +| 现象 | 最可能的原因 | 路径 | +|--------------------------------|-------------------------------|--------| +| 第一个生成的 token 就已经错误 | Prefill 算子精度问题 | B | +| 第一个 token 正确,之后开始偏差| KV cache 污染或 decode 算子 | A 或 B | +| 偏差随序列长度增加而累积 | KV cache 污染**或**算子精度 | A 或 B | +| 在某个固定深度立即出现偏差 | 算子精度问题 | B | +| 仅在 TP > 1 或 dp×tp / ep 下 | 通信 / 并行策略 patch | C | +| 多 batch 时出错,单 batch 正常 | Batching / seqlen / masking | A 或 B | + +**重要提示**:偏差随序列长度累积**并不代表**一定是 KV cache 污染。 +算子精度问题同样会在 decode 步骤中逐渐累积——例如 rope embedding 的 +cos/sin 在 CPU 上计算,会让每个 token 的位置编码都略有偏差,最终累积 +成可见的精度下降(曾在 Qwen30B-A3B 上排查出此问题,LiveCodeBench 评分 +低 2 分)。不要在排除路径 B 之前就断定是 cache 污染。 + +**不确定时**: + +- 先用单 batch 请求复现,排除 batching 的干扰。 +- 从能装下模型权重的最简并行配置开始(具体见路径 C 的并行层次说明)。 + 部分大模型无法在 TP=1 下运行,"最简"指的是能加载权重的最少并行维度。 +- 然后从路径 B 的 layer 0 开始排查。 + +--- + +## 路径 A — KV cache 污染 + +KV cache 污染意味着 `fill_kv_cache` 的索引传错了,把某些 token 写到了 +错误的 cache slot(踩踏)。`fill_kv_cache` 的内部逻辑一般不会出问题—— +问题几乎总是在传给它的索引上。 + +### 检查重点 + +- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/fill_kv_cache.py` +- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/pagedattention.py` + +重点参数: + +- `fill_kv_cache` 的 `kv_start_indices`:每个 token 在 cache 中的 + flat slot 索引。 +- `prefill_attention` 和 `paged_token_attention` / + `paged_attention_fwd` 的 `block_offsets`、`q_start_loc`、 + `q_seq_len`、`kv_seq_len`。 + +**多 batch 注意**:如果问题只在多请求时出现,重点核查每个请求的 +seqlen 追踪(`q_seq_len`、`kv_seq_len`)和 `kv_start_indices`。 +per-request 长度错误会导致 attention 从 KV cache 的错误位置读取数据。 + +### 排查方法 + +**不要 dump KV cache tensor 本身**——太大了。在可疑层的 +`fill_kv_cache` 调用**之前**,dump 以下三个 tensor: + +```python +dump("key_states", key_states) # shape: [num_tokens, num_kv_heads, head_size] +dump("value_states", value_states) # shape: [num_tokens, num_kv_heads, head_size] +dump("kv_start_indices", kv_start_indices) # shape: [num_tokens] +``` + +**关键检查**:`kv_start_indices.shape[0]` 必须等于 +`key_states.shape[0]`。如果长度不一致,说明索引数量与待写入的 token +数量不匹配,fill 时会发生踩踏,导致后续 decode 步骤的 cache 内容 +被污染。 + +--- + +## 路径 B — 算子精度 + +目标是找到 lmdeploy+dlinfer 与参考实现**第一次出现差异的算子**。 + +**先用单 batch**:如果单请求可以复现问题,在 batch_size=1 下排查。 +这样可以排除 batching 交互,seqlen 的形状也更简单。 + +### 策略:从 layer 0 开始 + +从第一个 linear attention block 或 full attention block 的 **layer 0** +开始,不要直接从中间层开始。原因:模型的大多数层使用相同的算子集合, +layer 0 的情况具有代表性。如果 layer 0 没有问题,其他层大概率也没问题; +如果 layer 0 已经有偏差,先修复它再往后看。 + +1. 在 layer 0 中,按顺序在每个子算子之后 dump: + RMSNorm → Attention → MLP。 +2. 与参考框架在同一层的结果对比(例如 Ascend 使用 + vllm+vllm-ascend)。 + - 某个子算子在 layer 0 就有偏差 → 这就是第一个出现问题的算子, + 深入排查。 + - layer 0 所有子算子均正常 → 偏差在更深的层。对后续层使用二分 + 查找(先看第 N/2 层,再缩小范围)。 +3. 定位到有问题的算子后,有选择性地再验证一两个行为可能不同的层 + (例如最后一层、存在 MoE routing 的层)。 + +### 对比方法 + +**确定性 vendor 算子**(例如 Ascend 的 `torch_npu` 算子):使用 +`torch.equal()`。这类算子是确定性的,结果必须完全相同。任何差异都是 +真实 bug。 + +**非确定性算子**(例如 triton,Ascend 上较少见):`torch.equal()` +可能因浮点舍入而过于严格,改用误差量来判断: + +```python +diff = (a - b).abs() +print("最大绝对误差:", diff.max().item()) +print("最大相对误差:", (diff / b.abs().clamp(min=1e-8)).max().item()) +``` + +相对误差在 ~1e-3 以内通常可接受;超过这个量级则认为是真实偏差。 + +### 找到出现偏差的算子后 + +逐层读取其实现栈: + +- `lmdeploy/lmdeploy/pytorch/backends/dlinfer//` — `Impl` 类 +- `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/` — 薄 kernel wrapper +- `dlinfer/dlinfer/vendor//` — 实际硬件 op 调用 + (例如 Ascend:`ascend/torch_npu_ops.py`) + +Dump 两个框架中该算子的**输入**,确认是否相同。若输入已经不同,则 +bug 在上游;若输入相同但输出不同,则 bug 在算子调用本身(参数顺序 +错误、dtype 错误、shape 错误等)。 + +--- + +## 路径 C — 通信 / 并行策略 + +仅在某些并行配置下出现的精度问题,指向通信或并行策略 patch 的问题。 +排查前,先明确是哪个并行维度引入了问题。 + +### lmdeploy 并行术语 + +lmdeploy 支持三种并行维度的组合: + +- **仅 TP**(EP=1, DP=1):Attention 和 FFN 都在 `tp` 张 GPU 上切分。 + 总 GPU 数 = tp。 +- **dp×tp**(EP=1, DP>1):Attention 总共使用 dp×tp 张 GPU;每个 DP + 组内,`tp_size = tp / dp`。当 EP=1 时,配置中的 `tp` 等于总 GPU 数。 +- **dp×tp + ep**(EP>1):Attention 仍按 dp×tp 切分;FFN / MoE experts + 进一步在 `ep` 组之间切分。当 EP>1 时,配置中的 `tp` 是**每个 DP 组 + 的 tp_size**(不是总 GPU 数)。 + +### 隔离策略 + +不是所有模型都能在 TP=1 下运行。按并行复杂度从低到高逐步测试, +在引入问题的那一层停下: + +1. **仅 TP**(能装下权重的最简配置):用最少 GPU 数的 TP-only 配置 + (DP=1, EP=1)运行两个框架。 + - 有问题 → 问题在 TP 算子切分或 all_reduce;结合路径 B 的 dump + 方法,重点关注 all_reduce 前后的输出。 + - 正常 → 进入第 2 步。 + +2. **dp×tp**(加入 DP):保持 EP=1,增大 DP。 + - 有问题 → 问题在 DP+TP 交互或 DP 组间通信;读取 + `dlinfer/dlinfer/framework/lmdeploy_ext/device/.py` + 中相关的通信 patch。 + - 正常 → 进入第 3 步(仅限 MoE 模型)。 + +3. **dp×tp + ep**(加入 EP):开启 EP>1。 + - 有问题 → 问题在 expert 并行或 EP 通信;读取 + `device/.py` 中的 MoE forward 类(例如 Ascend 的 + `AscendMoEForwardDPTP`),核查 MoE routing 和 + reduce-scatter 模式。 + +### 空闲 DP 组的 dummy 数据 + +当 dp > 1 时,lmdeploy 会在没有实际请求的 DP 组中填入**长度为 1 的 +dummy 数据**,vllm-ascend 也有类似机制。这是预期行为,不是 bug。 +在跨 DP 组 dump tensor 时需注意: + +- 空闲 DP 组的 tensor leading dimension 为 1——不要误认为是 + seqlen 不一致。 +- 只对实际处理了真实 token 的 DP 组做数值对比。 +- 如果精度问题恰好出现在空闲组的 dummy 路径上,需确认两个框架使用了 + 相同的 dummy 长度,且 dummy 数据没有污染真实组的 KV cache slot。 + +### 当 TP=1 装不下权重时 + +如果模型过大无法在 TP=1 下运行,从能加载权重的最小 TP 出发,在两侧 +使用相同的 TP。此时仍可以固定 TP、分别变化 DP 和 EP 来隔离各维度的 +影响。 + +### 需要读的文件 + +- `dlinfer/dlinfer/framework/lmdeploy_ext/device/.py` — + 该硬件的分布式行为 patch(例如 Ascend 的 `ascend.py`,包含 MoE + 通信的 `AscendMoEForwardDPTP`)。 +- 在每次 all_reduce / all_gather 调用的**前后**分别 dump 各 rank + 的输出,找到第一次出现偏差的通信操作。 + +--- + +## Tensor dump 操作方法 + +**必须 dump 到文件,禁止使用 `print` 或 `logger`。** + +在多 rank 场景下,所有 rank 的日志交错输出,tensor 数值会被冲掉无法 +读取。改用 `torch.save` 写入各 rank 独立的文件。 + +```python +import os, torch, torch.distributed as dist + +_DUMP_DIR = "/tmp/dlinfer_dump" +os.makedirs(_DUMP_DIR, exist_ok=True) + +def dump(name: str, tensor: torch.Tensor): + rank = dist.get_rank() if dist.is_initialized() else 0 + torch.save(tensor.detach().cpu(), f"{_DUMP_DIR}/{name}_rank{rank}.pt") +``` + +**命名约定**:`{层编号}_{算子}_{input|output}_rank{rank}.pt` + +示例:lmdeploy+dlinfer 的 `layer0_attn_out_rank0.pt`, +vllm+vllm-ascend 的同名文件存放在不同目录下,方便配对比较。 + +**加载并比较**: + +```python +a = torch.load("dlinfer/layer0_attn_out_rank0.pt") +b = torch.load("vllm/layer0_attn_out_rank0.pt") + +# 确定性 vendor 算子(例如 Ascend 的 torch_npu)— 期望完全一致 +print(torch.equal(a, b)) + +# triton / 浮点算子 — 检查误差幅度 +diff = (a - b).abs() +print("最大绝对误差:", diff.max().item()) +print("最大相对误差:", (diff / b.abs().clamp(min=1e-8)).max().item()) +``` + +**dump 位置建议**:最好在 +`lmdeploy/lmdeploy/pytorch/backends/dlinfer//` 的 `Impl` 类中, +紧接 kernel 调用之后、return 之前添加 dump。这一层位于 +vendor-specific 代码之上,输出 tensor 的 shape 是框架原生格式, +便于对比。 + +--- + +## 验收 checklist + +- [ ] 相同 SoC 版本、关闭 warmup、`--eager-mode true`、相同 + TP/DP/EP、temperature=0 / top_k=1 已确认 +- [ ] 先尝试单 batch 复现 +- [ ] 用同一 prompt 完成了输出 token 对比 +- [ ] 从能装下权重的最简并行配置开始,逐步向上测试 +- [ ] 确定了排查路径:A(KV cache)/ B(算子)/ C(通信) +- [ ] Tensor dump 使用文件方式(非 print / logger) +- [ ] 先验证了 layer 0,再向更深的层查找 +- [ ] 找到第一个出现偏差的层 / 算子 +- [ ] 已确认出现偏差的算子的输入是否相同(或发现上游 bug) +- [ ] 修复后重新验证输出 token 一致 + +--- + +## 故障排查 + +### 偏差随序列长度累积,但路径 A 没发现问题 + +**症状**:`kv_start_indices` 长度与 `key_states` 一致,但偏差仍随 +序列长度增加。 + +**原因**:算子精度问题(例如 cos/sin 回退到 CPU 计算)同样会在 +decode 步骤中逐渐累积,从外部看与 cache 污染完全相同。 + +**操作**:转入路径 B。Dump layer 0 的各子算子 +(RMSNorm → Attention → MLP),确认偏差是否从这里开始。 + +--- + +### 某些 DP 组的 tensor leading dimension 出现意外的 1 + +**症状**:dp > 1 时,空闲 DP 组的 tensor 首维为 1,输出结果看起来 +异常。 + +**原因**:lmdeploy(以及 vllm-ascend)会在没有实际请求的 DP 组中 +填入长度为 1 的 dummy 数据,这是预期行为。 + +**操作**:只对实际处理了真实 token 的 DP 组做数值对比,不要将空闲组 +的 length-1 tensor 视为错误。 + +--- + +### dp×tp 或 ep 配置下有精度问题,但不清楚是哪个维度引入的 + +**症状**:加入 DP 或 EP 后出现精度回退,但无法确定是哪个维度导致的。 + +**原因**:同时改变多个并行维度,无法逐一排查。 + +**操作**:固定 TP,先单独加 DP;dp×tp 正常后再加 EP。详见路径 C +的隔离策略。 + +--- + +### dump 文件为空、内容截断或数值混乱 + +**症状**:保存的 dump 文件没有可用数据,或来自多个 rank 的数值混杂 +在一起。 + +**原因**:多 rank 场景下使用 `print` 或 `logger`,各 rank 的输出 +交错覆盖。 + +**操作**:使用 `torch.save` 为每个 rank 写独立文件。参见 +Tensor dump 操作方法节。 + +--- + +### KV cache 污染 bug 偶现,无法稳定复现 + +**症状**:同一 prompt 多次运行结果不同,精度问题不稳定。 + +**原因**:warmup 在 KV cache 中遗留脏数据,影响后续推理。 + +**操作**:两侧都关闭 warmup,重新对比。 + +--- + +### 同一 prompt 每次生成的 token 不同 + +**症状**:token 级别对比不稳定,每次运行结果都变。 + +**原因**:temperature > 0 引入采样随机性,对比结果没有意义。 + +**操作**:两侧均设置 temperature=0、top_k=1。 + +--- + +### 二分查找结束后发现 bug 其实在很早的层 + +**症状**:经过多次二分,最终定位到的出问题层在模型很早的位置。 + +**原因**:从 layer N/2 开始跳过了对 layer 0 的检查。 + +**操作**:先验证 layer 0;如果 layer 0 正常,再对后续层二分。 + +--- + +### 找到疑似出问题的算子,输入看起来一致,但根因不明 + +**症状**:两个框架中该算子的输入相同,但无法判断算子本身是否有问题。 + +**原因**:没有 dump 输出,缺乏算子是否产生错误结果的直接证据。 + +**操作**:同时 dump 输入和输出。若输入一致但输出不同,bug 在算子调用 +本身——检查传给硬件算子(例如 Ascend 的 NPU op)的参数顺序、dtype +和 shape。 diff --git a/.claude/skills/zh_cn/support-new-model/SKILL.md b/.claude/skills/zh_cn/support-new-model/SKILL.md new file mode 100644 index 00000000..879c06ac --- /dev/null +++ b/.claude/skills/zh_cn/support-new-model/SKILL.md @@ -0,0 +1,238 @@ +--- +name: support-new-model +description: 在国产 AI 硬件上通过 dlinfer 适配一个新模型 + (该模型已在 lmdeploy CUDA backend 上支持)。 +--- +# 适配新模型 + +你正在帮助用户将一个新的 LLM 或 VLM 适配到 dlinfer 支持的硬件后端 +(Ascend NPU、CAMB MLU、MACA GPU)。该模型已通过 lmdeploy 在 CUDA +上运行——你的任务是找出目标 vendor 缺少什么,并补全它。 + +--- + +## 第一步 — 收集信息 + +询问用户: + +1. **要适配哪个模型**?(提供模型在 `lmdeploy/pytorch/models/` 中的 + 名称,例如 `qwen3`、`deepseek_v2`) +2. **目标 vendor 是哪些**?(ascend / camb / maca,可多选) + +两个问题都得到回答后再继续。 + +--- + +## 第二步 — 分析模型 + +使用 Read/Bash 工具自行读取以下所有文件,不要让用户去读: + +```text +lmdeploy/lmdeploy/pytorch/models/.py +lmdeploy/lmdeploy/pytorch/backends/dlinfer/op_backend.py +lmdeploy/lmdeploy/pytorch/backends/dlinfer//op_backend.py +``` + +完整调用链为:`models/.py` → `lmdeploy/pytorch/nn/` +→ `backends/dlinfer/` → `kernels/dlinfer/` → `dlinfer/ops/` +→ `vendor/`。如果某个 model 层和 backend op 之间的对应关系不清楚, +顺着 `lmdeploy/pytorch/nn/` 中间层追踪。 +`lmdeploy/pytorch/kernels/default/` 包含 CUDA 参考实现, +在编写新 vendor op 时可用作规格参考,按需读取。 + +### 从 `models/.py` 中识别 + +- 模型使用的所有非平凡算子:attention 变体(paged、flash、MLA)、 + MLP 激活函数、RMS norm 变体、MoE routing、rotary embedding 变体 + (标准、MROPE、多尺度)、量化 op 等。 +- 模型是否通过 `StepContext` 或 `attn_metadata` 传递了超出标准字段 + 的新输入。标准字段为:`input_ids`、`position_ids`、`block_offsets`、 + `q_seqlens`、`kv_seqlens`、`kv_start_indices`。已知的扩展字段: + `state_ids`(SSM 模型)、`mrope_position_ids`(MROPE 模型)、 + `cu_seqlens` / `has_initial_state`(Gated Delta Network)。 + +### 从通用 `op_backend.py` 中检查 + +- `get_layer_impl_builder()`:哪些 `OpType` 已有 dlinfer `Impl`。 + 与上面的 op 列表对比,找出缺口 → **路径 A**。 + +### 从 `/op_backend.py` 中逐项检查 + +- **`update_step_context()`**:该方法负责在每次推理步骤中构建 + `attn_metadata`(Ascend 上还包括 `moe_metadata`)。需要仔细确认 + 它是否正确处理了新模型所需的所有字段。若模型引入了新的 context + 字段或新的 attention 模式(例如类似 `is_gated_delta` 的标志), + 则需要扩展此方法 → **路径 B**。 +- **`get_k_block_shape()` / `get_v_block_shape()`**:确认 KV cache + 的内存布局与模型 attention 实现的期望一致。不同 vendor 甚至同一 + vendor 的不同 SoC 版本(Ascend A2 vs A3、310P)可能使用不同的 + layout → **路径 B**(如不匹配)。 +- **`AscendKVQuantMeta`**(仅 Ascend):若模型使用 KV cache 量化, + 且 scale/offset 格式与当前实现不同 → **路径 B**。 + +在动手写代码之前,先向用户汇报分析结果: + +- Op 缺口(→ 路径 A) +- Vendor `op_backend.py` 缺口(→ 路径 B) +- Framework 层面缺口(→ 路径 C) + +--- + +## 路径 A — 补充缺失的 op(4 层栈) + +对每个在 `get_layer_impl_builder()` 中缺失的 op,按从上到下的顺序 +逐层实现。 + +### 第一层 — `lmdeploy/lmdeploy/pytorch/backends/dlinfer/` + +新增 `XxxImpl`(继承 lmdeploy 基类 `Impl`)和 `XxxBuilder` +(包含 `build()` 方法)。在 `op_backend.py` 的 +`get_layer_impl_builder()` dispatcher 中注册该 Builder。 +参考:`activation.py`(最简单)、`norm.py`、`attention.py`(最复杂)。 + +### 第二层 — `lmdeploy/lmdeploy/pytorch/kernels/dlinfer/` + +新增一个薄 wrapper 函数,调用 `dlinfer.ops.(...)`, +并在 `__init__.py` 中导出。 + +### 第三层 — `dlinfer/dlinfer/ops/llm.py` + +用 `@register_custom_op("dlinfer::", [...])` 注册新 op, +函数体转发到 `vendor_ops_registry[""]`。 +**此处的字符串 key 必须与第四层中被装饰函数的名称完全一致。** + +### 第四层 — `dlinfer/dlinfer/vendor//` + +添加带 `@register_ops(vendor_ops_registry)` 装饰的实现,调用 +vendor 的 native op: + +- **Ascend**:`torch.ops.npu.*`,参考 `vendor/ascend/torch_npu_ops.py` +- **CAMB**:`tmo.*`(`torch_mlu_ops`),参考 `vendor/camb/camb_ops.py` +- **MACA**:`mcoplib.*`,参考 `vendor/maca/maca_ops.py` + +**Ascend**:在 `torch_npu_ops.py` 中新增任何算子之前,先请用户提供 +该算子的官方 NPU 文档。严格按照文档实现:参数名称、tensor shape 约束、 +dtype 约束无法从现有代码中推断,写错会引发难以定位的运行时错误。 + +逻辑较复杂时(如 Ascend 带 graph mode 记录的 attention),拆分到辅助 +模块(如 `vendor/ascend/attention.py`)并在 `torch_npu_ops.py` 中导入。 + +--- + +## 路径 B — Vendor-specific `op_backend.py` 修改 + +文件:`lmdeploy/lmdeploy/pytorch/backends/dlinfer//op_backend.py` + +以下三个子方向相互独立,分别评估。 + +### B1 — `update_step_context()`:新 context 字段或 attention 模式 + +当新模型需要 `attn_metadata` 中有当前实现未填充的字段时,扩展 +`update_step_context()`: + +- 在方法内部按已有的 helper 函数模式计算新字段。 +- 在方法末尾构造 `attn_metadata` 时将新字段传入。 +- Ascend 上若模型引入了新的 MoE 通信模式或并行拓扑,还需扩展 + `moe_metadata`。 + +参考:`is_gated_delta` 分支(添加 `cu_seqlens` 和 +`has_initial_state`)、`kv_quant_policy == 8` 分支 +(填充 `AscendKVQuantMeta`)。 + +### B2 — `get_k_block_shape()` / `get_v_block_shape()`:KV cache layout + +硬件目标确定后这里基本不再改动。跳过,除非新模型引入了现有任何 +vendor backend 都无法覆盖的全新 attention 内存布局需求。 + +### B3 — `AscendKVQuantMeta`:KV 量化(仅 Ascend) + +遗留功能,当前正确性未经主动验证。常规新模型适配跳过此项——仅在明确 +需要 KV cache 量化且已确认该功能可用时再处理。 + +--- + +## 路径 C — Framework patch(`dlinfer/dlinfer/framework/lmdeploy_ext/`) + +以下三个子模块相互独立,分别评估。 + +### C1 — cudagraph / aclgraph 缓冲区管理 + +**触发条件**:仅当模型引入了新的 `StepContext` 字段,且该字段的 +**shape 随 batch size 或 seq_len 在运行时动态变化**时才需要处理。 +shape 固定的 tensor 不需要特殊缓冲区管理。示例:`x_active_mask` +(shape `[batch_size]`)是为 Expert Parallelism 支持而添加的——它的 +尺寸随每步变化,因此需要预分配最大尺寸的 buffer。 + +- **Ascend**:`framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py` + - `make_buffers_cudagraph`:以最大尺寸(`max_batches` / + `max_tokens`)预分配新字段的 tensor。用运行时尺寸会导致 + graph replay 时 shape 不匹配。 + - `fill_buffers_cudagraph`:将运行时数据拷贝到预分配的 buffer 中。 + - `update_context_cudagraph`:将 buffer 写回 step context。 + - 参考:`is_ssm`(`state_ids`)和 `use_mrope` + (`mrope_position_ids`)的处理模式。 +- **其他 vendor**:在对应的 `camb_cudagraph.py` / + `maca_cudagraph.py` 中应用相同模式。 + +若模型只使用已有的标准字段,跳过此节。 + +### C2 — Device-specific patch + +**触发条件**:模型需要对 lmdeploy 的某个行为做 vendor 级别的覆盖 +时——例如 Ascend 上不同的 MoE 通信策略、CAMB 上不支持的 sampling op, +或硬件特定的 KV cache 格式(如 Ascend 310P NZ 格式)。 + +- **Ascend**:`framework/lmdeploy_ext/device/ascend.py` +- **CAMB**:`framework/lmdeploy_ext/device/camb.py` + +直接在 lmdeploy 类上 patch 对应方法。确保 patch 文件在 +`framework/lmdeploy_ext/device/__init__.py` 中被导入。 + +### C3 — 量化 patch + +**触发条件**:仅当模型使用 AWQ 且权重打包格式或 scale layout 与当前 +Ascend 实现不兼容时。 + +文件:`framework/lmdeploy_ext/quants/ascend_awq.py` + +该文件 patch 了 `WeightOnlyQLinear`、`MergedAwqLinear`、`AwqLinear`、 +`QKVAwqLinear`。仅当新模型的量化 checkpoint 使用了当前 patch 无法处理 +的 layout 时才修改。 + +--- + +## 验收 checklist + +**路径 A(新 op):** + +- [ ] 每个缺失 op 的 4 层均已实现 +- [ ] 通用 `op_backend.py` 的 `get_layer_impl_builder()` 已更新 +- [ ] `ops/llm.py` 中的 `vendor_ops_registry` key 与 vendor 文件中 + 被装饰函数名完全一致 +- [ ] 新 kernel 已在 `kernels/dlinfer/__init__.py` 中导出 + +**路径 B(vendor `op_backend.py`):** + +- [ ] `update_step_context()` 已正确填充新模型所需的所有 + `attn_metadata` 字段 + +**路径 C1(graph 缓冲区):** + +- [ ] 新字段已在 `make_buffers_cudagraph` 中以最大尺寸预分配 +- [ ] 新字段已在 `fill_buffers_cudagraph` 中填充 +- [ ] 新字段已在 `update_context_cudagraph` 中写回 context + +**路径 C2(device patch):** + +- [ ] patch 已直接应用到 lmdeploy 类上 +- [ ] patch 文件已在 `device/__init__.py` 中导入 + +**路径 C3(量化 patch):** + +- [ ] 已对照 checkpoint 格式核实权重打包 / scale layout +- [ ] 相关类方法已在 `ascend_awq.py` 中 patch + +**通用:** + +- [ ] eager mode:模型可正常推理 +- [ ] graph mode:模型可正常推理(如该 vendor 支持)