Add DeepSeek V4 prefill QKV RoPE tile by zhaozhaozz · Pull Request #374 · hw-native-sys/pypto-lib

zhaozhaozz · 2026-05-25T10:28:08Z

Summary

Add a DeepSeek V4 prefill Q/KV projection + partial RoPE kernel with a standalone golden test wrapper.
Use config-level PREFILL_BATCH=1 and PREFILL_SEQ=128 for the current QKV RoPE kernel invocation, while keeping internal token chunking for projection, quantization, and RoPE scopes.
Keep prefill model dimensions and tiling constants local to the prefill kernel so it does not inherit decode-specific T assumptions; reuse only the stable QKV tensor-spec helper for golden inputs.
Materialize RoPE rows from absolute start_pos and align the q-path RMS inverse math with the existing decode QKV core.

Validation

python3 -m py_compile models/deepseek/v4/config.py models/deepseek/v4/prefill_qkv_proj_rope.py
Remote a2a3 NPU, PTOAS v0.41: python models/deepseek/v4/prefill_qkv_proj_rope.py
Remote a2a3 NPU, PTOAS v0.41: python models/deepseek/v4/prefill_qkv_proj_rope.py --start-pos 128

Both remote NPU runs passed for q, kv, qr, and qr_scale.

coderabbitai · 2026-05-25T10:28:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a token-chunked DeepSeek-V4 prefill Q/KV projection kernel with partial RoPE tiling, introduces prefill tuning constants, implements a JIT core (RMSNorm, LoRA, per-row INT8 quantization, partial RoPE, INT8 matmul/dequant, BF16 outputs), and provides a Torch golden reference, tensor-spec builder, and CLI validator.

Changes

DeepSeek-V4 Prefill Q/KV Projection Kernel

Layer / File(s)	Summary
Deployment tuning constants `models/deepseek/v4/config.py`	Reformats `DECODE_BATCH`/`DECODE_SEQ` and adds `PREFILL_BATCH`, `PREFILL_SEQ` module constants used by the new prefill kernel.
Module constants and JIT signature `models/deepseek/v4/prefill_qkv_proj_rope.py`	Defines prefill/tiling constants and the full signature for `prefill_qkv_proj_rope_core`.
Core kernel — chunked RMSNorm, LoRA, KV assembly `models/deepseek/v4/prefill_qkv_proj_rope.py`	Per-chunk RoPE cos/sin materialization, FP32 attention RMSNorm+gamma, LoRA query A, per-row INT8 `qr` quantization with scales, KV RMSNorm/gamma, partial RoPE on KV tail, and assembly of `kv`, `qr`, `qr_scale`.
Core kernel — INT8 query projection and final outputs `models/deepseek/v4/prefill_qkv_proj_rope.py`	W8A8C16 INT8 query projection using `qr`/`wq_b` with dequantization, per-head RMSNorm, partial RoPE on query tail, interleaved reassembly, and BF16 writes producing `(q, kv, qr, qr_scale)`.
JIT wrapper `models/deepseek/v4/prefill_qkv_proj_rope.py`	Thin `prefill_qkv_proj_rope` wrapper that forwards to the core implementation.
Golden reference `models/deepseek/v4/prefill_qkv_proj_rope.py`	Torch `golden_prefill_qkv_proj_rope` reproducing kernel numeric stages (RMSNorm+LoRA, amax→scale quantization, INT8 matmul/dequant, RoPE convention) and writing BF16 outputs and `qr`/`qr_scale`.
Tensor specs & CLI `models/deepseek/v4/prefill_qkv_proj_rope.py`	`build_tensor_specs(start_pos=...)` initializes inputs including `freqs_cos`/`freqs_sin`; CLI entrypoint runs the JIT with golden comparison and fails on mismatch.

Sequence Diagram

sequenceDiagram
  participant Activations as Activations (input)
  participant RMSNorm as RMSNorm + Gamma
  participant LoRA_A as LoRA A
  participant Quant as INT8 Quantize (qr -> int8 + scale)
  participant KV_RoPE as Partial RoPE (KV tail)
  participant INT8MatMul as W8A8C16 MatMul (qr, wq_b)
  participant RoPE_Q as Partial RoPE (Q tail)
  participant Outputs as Outputs (q, kv, qr, qr_scale)

  Activations->>RMSNorm: chunked flatten & normalize
  RMSNorm->>LoRA_A: produce LoRA A query projection
  LoRA_A->>Quant: normalize & quantize qr (int8 + scale)
  Activations->>KV_RoPE: KV path RMSNorm + partial RoPE
  Quant->>INT8MatMul: qr_tile (int8) + qr_scale
  INT8MatMul->>RoPE_Q: dequantize & apply per-head RMSNorm, partial RoPE
  RoPE_Q->>Outputs: assemble interleaved heads, write BF16 q/kv and qr/qr_scale

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Refactor: centralize DeepSeek-V4 kernel config into config.py #263: Centralized DeepSeek-V4 kernel and decode constants; this change adds prefill tuning constants and a new prefill kernel implementation.

Poem

🐇 I count the tokens, chunk by chunk,
RoPE twirls where bitstreams munch,
INT8 hums while BF16 sleeps,
Golden checks the sums it keeps,
Prefill kernels hop, and merrily crunch.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a DeepSeek V4 prefill QKV projection with RoPE tiling, which aligns with the primary additions in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description directly addresses the changeset: it explains the addition of a DeepSeek V4 prefill Q/KV projection + RoPE kernel, config updates for PREFILL_BATCH and PREFILL_SEQ, and provides validation details matching the file changes shown in the summary.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

models/deepseek/v4/prefill_qkv_proj_rope.py (1)
408-408: 💤 Low value

Inconsistent use of pl.rsqrt vs pl.recip(pl.sqrt(...)).

This line uses pl.rsqrt(...) while all other inverse-RMS computations in this file (lines 147, 209, 288) and in the decode kernel use pl.recip(pl.sqrt(...)). While mathematically equivalent, different instruction paths could cause subtle numerical differences that may affect validation consistency.
Suggested fix for consistency
-                q_head_inv_rms = pl.rsqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS))
+                q_head_inv_rms = pl.recip(pl.sqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS)))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/prefill_qkv_proj_rope.py` at line 408, The line computing
q_head_inv_rms uses pl.rsqrt(...) which is inconsistent with other inverse-RMS
calculations; replace the pl.rsqrt(...) expression in the q_head_inv_rms
assignment with the equivalent pl.recip(pl.sqrt(...)) form (matching the pattern
used elsewhere) so the computation for q_head_inv_rms uses
pl.recip(pl.sqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS))) and thus
aligns numerically with the other inverse-RMS uses.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/deepseek/v4/prefill_qkv_proj_rope.py`:
- Line 408: The line computing q_head_inv_rms uses pl.rsqrt(...) which is
inconsistent with other inverse-RMS calculations; replace the pl.rsqrt(...)
expression in the q_head_inv_rms assignment with the equivalent
pl.recip(pl.sqrt(...)) form (matching the pattern used elsewhere) so the
computation for q_head_inv_rms uses
pl.recip(pl.sqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS))) and thus
aligns numerically with the other inverse-RMS uses.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 023d4d38-d85a-4832-9a4c-4b961702d334

📥 Commits

Reviewing files that changed from the base of the PR and between cddcc84 and f0092c7.

📒 Files selected for processing (2)

models/deepseek/v4/config.py
models/deepseek/v4/prefill_qkv_proj_rope.py

gemini-code-assist

Code Review

This pull request introduces a new JIT-compiled kernel for DeepSeek-V4 prefill Q/KV projection and partial RoPE application, along with corresponding deployment configuration constants. The implementation covers RMSNorm, LoRA projections, and W8A8C16 quantization. Review feedback identifies opportunities to optimize the kernel by moving loop-invariant frequency slicing outside the batch tile loop and utilizing the pl.rsqrt primitive for more efficient RMSNorm inverse calculations across the query and KV paths.

coderabbitai

♻️ Duplicate comments (1)

models/deepseek/v4/prefill_qkv_proj_rope.py (1)
402-405: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore pl.rsqrt(...) on the q-head RMS path.

Line 405 regresses the per-head q normalization to pl.recip(pl.sqrt(...)), but the decode kernel and the golden q path both use rsqrt here. That rounding delta is exactly on the post-dequant q path we compare, so this can drift from the validated decode behavior.
Proposed fix
-                q_head_inv_rms = pl.recip(pl.sqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS)))
+                q_head_inv_rms = pl.rsqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS))
Run this to verify the mismatch against the existing decode path and golden reference:
#!/bin/bash
set -euo pipefail

rg -n -C2 'q_head_inv_rms\s*=' models/deepseek/v4/prefill_qkv_proj_rope.py models/deepseek/v4/qkv_proj_rope.py
rg -n -C2 'torch\.rsqrt|pl\.rsqrt|pl\.recip\(pl\.sqrt' models/deepseek/v4/prefill_qkv_proj_rope.py models/deepseek/v4/qkv_proj_rope.py
Expected result: the validated decode q path and golden reference show rsqrt, while this prefill q-head RMS path shows pl.recip(pl.sqrt(...)).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/prefill_qkv_proj_rope.py` around lines 402 - 405, The
q-head RMS calculation currently uses pl.recip(pl.sqrt(...)) which causes
rounding differences; change the computation that assigns q_head_inv_rms to use
pl.rsqrt(...) instead (i.e., call pl.rsqrt on the same argument
pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS)), so the prefill q
normalization matches the decode/golden q path (refer to the q_head_inv_rms
variable, HEAD_DIM, EPS, and pl.rsqrt).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@models/deepseek/v4/prefill_qkv_proj_rope.py`:
- Around line 402-405: The q-head RMS calculation currently uses
pl.recip(pl.sqrt(...)) which causes rounding differences; change the computation
that assigns q_head_inv_rms to use pl.rsqrt(...) instead (i.e., call pl.rsqrt on
the same argument pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS)), so the
prefill q normalization matches the decode/golden q path (refer to the
q_head_inv_rms variable, HEAD_DIM, EPS, and pl.rsqrt).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cc60a769-e828-47b3-97d7-6edc67ee8079

📥 Commits

Reviewing files that changed from the base of the PR and between f0092c7 and cf2c52c.

📒 Files selected for processing (2)

models/deepseek/v4/config.py
models/deepseek/v4/prefill_qkv_proj_rope.py

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

Comment thread models/deepseek/v4/prefill_qkv_proj_rope.py

Comment thread models/deepseek/v4/prefill_qkv_proj_rope.py

Comment thread models/deepseek/v4/prefill_qkv_proj_rope.py

Comment thread models/deepseek/v4/prefill_qkv_proj_rope.py

zhaozhaozz commented May 25, 2026

View reviewed changes

Comment thread models/deepseek/v4/prefill_qkv_proj_rope.py Outdated

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

userZ added 3 commits May 26, 2026 02:57

Add DeepSeek V4 prefill QKV RoPE tile

d3e53a7

Align prefill QKV RMS precision

4372da7

Use config prefill shape for QKV RoPE

6c5c7e1

zhaozhaozz force-pushed the feat/deepseek-v4-prefill-qkv-rope branch from cf2c52c to 6c5c7e1 Compare May 26, 2026 03:08

zhaozhaozz requested a review from zhangqi-chen May 26, 2026 03:37

Keep prefill QKV tiling local

2998ac0

zhangqi-chen merged commit 619fba2 into hw-native-sys:main May 26, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepSeek V4 prefill QKV RoPE tile#374

Add DeepSeek V4 prefill QKV RoPE tile#374
zhangqi-chen merged 4 commits into
hw-native-sys:mainfrom
zhaozhaozz:feat/deepseek-v4-prefill-qkv-rope

zhaozhaozz commented May 25, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhaozhaozz commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhaozhaozz commented May 25, 2026 •

edited

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading