Skip to content

Add DeepSeek V4 prefill QKV RoPE tile#374

Merged
zhangqi-chen merged 4 commits into
hw-native-sys:mainfrom
zhaozhaozz:feat/deepseek-v4-prefill-qkv-rope
May 26, 2026
Merged

Add DeepSeek V4 prefill QKV RoPE tile#374
zhangqi-chen merged 4 commits into
hw-native-sys:mainfrom
zhaozhaozz:feat/deepseek-v4-prefill-qkv-rope

Conversation

@zhaozhaozz
Copy link
Copy Markdown
Contributor

@zhaozhaozz zhaozhaozz commented May 25, 2026

Summary

  • Add a DeepSeek V4 prefill Q/KV projection + partial RoPE kernel with a standalone golden test wrapper.
  • Use config-level PREFILL_BATCH=1 and PREFILL_SEQ=128 for the current QKV RoPE kernel invocation, while keeping internal token chunking for projection, quantization, and RoPE scopes.
  • Keep prefill model dimensions and tiling constants local to the prefill kernel so it does not inherit decode-specific T assumptions; reuse only the stable QKV tensor-spec helper for golden inputs.
  • Materialize RoPE rows from absolute start_pos and align the q-path RMS inverse math with the existing decode QKV core.

Validation

  • python3 -m py_compile models/deepseek/v4/config.py models/deepseek/v4/prefill_qkv_proj_rope.py
  • Remote a2a3 NPU, PTOAS v0.41: python models/deepseek/v4/prefill_qkv_proj_rope.py
  • Remote a2a3 NPU, PTOAS v0.41: python models/deepseek/v4/prefill_qkv_proj_rope.py --start-pos 128

Both remote NPU runs passed for q, kv, qr, and qr_scale.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a token-chunked DeepSeek-V4 prefill Q/KV projection kernel with partial RoPE tiling, introduces prefill tuning constants, implements a JIT core (RMSNorm, LoRA, per-row INT8 quantization, partial RoPE, INT8 matmul/dequant, BF16 outputs), and provides a Torch golden reference, tensor-spec builder, and CLI validator.

Changes

DeepSeek-V4 Prefill Q/KV Projection Kernel

Layer / File(s) Summary
Deployment tuning constants
models/deepseek/v4/config.py
Reformats DECODE_BATCH/DECODE_SEQ and adds PREFILL_BATCH, PREFILL_SEQ module constants used by the new prefill kernel.
Module constants and JIT signature
models/deepseek/v4/prefill_qkv_proj_rope.py
Defines prefill/tiling constants and the full signature for prefill_qkv_proj_rope_core.
Core kernel — chunked RMSNorm, LoRA, KV assembly
models/deepseek/v4/prefill_qkv_proj_rope.py
Per-chunk RoPE cos/sin materialization, FP32 attention RMSNorm+gamma, LoRA query A, per-row INT8 qr quantization with scales, KV RMSNorm/gamma, partial RoPE on KV tail, and assembly of kv, qr, qr_scale.
Core kernel — INT8 query projection and final outputs
models/deepseek/v4/prefill_qkv_proj_rope.py
W8A8C16 INT8 query projection using qr/wq_b with dequantization, per-head RMSNorm, partial RoPE on query tail, interleaved reassembly, and BF16 writes producing (q, kv, qr, qr_scale).
JIT wrapper
models/deepseek/v4/prefill_qkv_proj_rope.py
Thin prefill_qkv_proj_rope wrapper that forwards to the core implementation.
Golden reference
models/deepseek/v4/prefill_qkv_proj_rope.py
Torch golden_prefill_qkv_proj_rope reproducing kernel numeric stages (RMSNorm+LoRA, amax→scale quantization, INT8 matmul/dequant, RoPE convention) and writing BF16 outputs and qr/qr_scale.
Tensor specs & CLI
models/deepseek/v4/prefill_qkv_proj_rope.py
build_tensor_specs(start_pos=...) initializes inputs including freqs_cos/freqs_sin; CLI entrypoint runs the JIT with golden comparison and fails on mismatch.

Sequence Diagram

sequenceDiagram
  participant Activations as Activations (input)
  participant RMSNorm as RMSNorm + Gamma
  participant LoRA_A as LoRA A
  participant Quant as INT8 Quantize (qr -> int8 + scale)
  participant KV_RoPE as Partial RoPE (KV tail)
  participant INT8MatMul as W8A8C16 MatMul (qr, wq_b)
  participant RoPE_Q as Partial RoPE (Q tail)
  participant Outputs as Outputs (q, kv, qr, qr_scale)

  Activations->>RMSNorm: chunked flatten & normalize
  RMSNorm->>LoRA_A: produce LoRA A query projection
  LoRA_A->>Quant: normalize & quantize qr (int8 + scale)
  Activations->>KV_RoPE: KV path RMSNorm + partial RoPE
  Quant->>INT8MatMul: qr_tile (int8) + qr_scale
  INT8MatMul->>RoPE_Q: dequantize & apply per-head RMSNorm, partial RoPE
  RoPE_Q->>Outputs: assemble interleaved heads, write BF16 q/kv and qr/qr_scale
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐇 I count the tokens, chunk by chunk,
RoPE twirls where bitstreams munch,
INT8 hums while BF16 sleeps,
Golden checks the sums it keeps,
Prefill kernels hop, and merrily crunch.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a DeepSeek V4 prefill QKV projection with RoPE tiling, which aligns with the primary additions in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description directly addresses the changeset: it explains the addition of a DeepSeek V4 prefill Q/KV projection + RoPE kernel, config updates for PREFILL_BATCH and PREFILL_SEQ, and provides validation details matching the file changes shown in the summary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
models/deepseek/v4/prefill_qkv_proj_rope.py (1)

408-408: 💤 Low value

Inconsistent use of pl.rsqrt vs pl.recip(pl.sqrt(...)).

This line uses pl.rsqrt(...) while all other inverse-RMS computations in this file (lines 147, 209, 288) and in the decode kernel use pl.recip(pl.sqrt(...)). While mathematically equivalent, different instruction paths could cause subtle numerical differences that may affect validation consistency.

Suggested fix for consistency
-                q_head_inv_rms = pl.rsqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS))
+                q_head_inv_rms = pl.recip(pl.sqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS)))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/prefill_qkv_proj_rope.py` at line 408, The line computing
q_head_inv_rms uses pl.rsqrt(...) which is inconsistent with other inverse-RMS
calculations; replace the pl.rsqrt(...) expression in the q_head_inv_rms
assignment with the equivalent pl.recip(pl.sqrt(...)) form (matching the pattern
used elsewhere) so the computation for q_head_inv_rms uses
pl.recip(pl.sqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS))) and thus
aligns numerically with the other inverse-RMS uses.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/deepseek/v4/prefill_qkv_proj_rope.py`:
- Line 408: The line computing q_head_inv_rms uses pl.rsqrt(...) which is
inconsistent with other inverse-RMS calculations; replace the pl.rsqrt(...)
expression in the q_head_inv_rms assignment with the equivalent
pl.recip(pl.sqrt(...)) form (matching the pattern used elsewhere) so the
computation for q_head_inv_rms uses
pl.recip(pl.sqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS))) and thus
aligns numerically with the other inverse-RMS uses.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 023d4d38-d85a-4832-9a4c-4b961702d334

📥 Commits

Reviewing files that changed from the base of the PR and between cddcc84 and f0092c7.

📒 Files selected for processing (2)
  • models/deepseek/v4/config.py
  • models/deepseek/v4/prefill_qkv_proj_rope.py

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new JIT-compiled kernel for DeepSeek-V4 prefill Q/KV projection and partial RoPE application, along with corresponding deployment configuration constants. The implementation covers RMSNorm, LoRA projections, and W8A8C16 quantization. Review feedback identifies opportunities to optimize the kernel by moving loop-invariant frequency slicing outside the batch tile loop and utilizing the pl.rsqrt primitive for more efficient RMSNorm inverse calculations across the query and KV paths.

Comment thread models/deepseek/v4/prefill_qkv_proj_rope.py
Comment thread models/deepseek/v4/prefill_qkv_proj_rope.py
Comment thread models/deepseek/v4/prefill_qkv_proj_rope.py
Comment thread models/deepseek/v4/prefill_qkv_proj_rope.py
Comment thread models/deepseek/v4/prefill_qkv_proj_rope.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
models/deepseek/v4/prefill_qkv_proj_rope.py (1)

402-405: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore pl.rsqrt(...) on the q-head RMS path.

Line 405 regresses the per-head q normalization to pl.recip(pl.sqrt(...)), but the decode kernel and the golden q path both use rsqrt here. That rounding delta is exactly on the post-dequant q path we compare, so this can drift from the validated decode behavior.

Proposed fix
-                q_head_inv_rms = pl.recip(pl.sqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS)))
+                q_head_inv_rms = pl.rsqrt(pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS))

Run this to verify the mismatch against the existing decode path and golden reference:

#!/bin/bash
set -euo pipefail

rg -n -C2 'q_head_inv_rms\s*=' models/deepseek/v4/prefill_qkv_proj_rope.py models/deepseek/v4/qkv_proj_rope.py
rg -n -C2 'torch\.rsqrt|pl\.rsqrt|pl\.recip\(pl\.sqrt' models/deepseek/v4/prefill_qkv_proj_rope.py models/deepseek/v4/qkv_proj_rope.py

Expected result: the validated decode q path and golden reference show rsqrt, while this prefill q-head RMS path shows pl.recip(pl.sqrt(...)).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/prefill_qkv_proj_rope.py` around lines 402 - 405, The
q-head RMS calculation currently uses pl.recip(pl.sqrt(...)) which causes
rounding differences; change the computation that assigns q_head_inv_rms to use
pl.rsqrt(...) instead (i.e., call pl.rsqrt on the same argument
pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS)), so the prefill q
normalization matches the decode/golden q path (refer to the q_head_inv_rms
variable, HEAD_DIM, EPS, and pl.rsqrt).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@models/deepseek/v4/prefill_qkv_proj_rope.py`:
- Around line 402-405: The q-head RMS calculation currently uses
pl.recip(pl.sqrt(...)) which causes rounding differences; change the computation
that assigns q_head_inv_rms to use pl.rsqrt(...) instead (i.e., call pl.rsqrt on
the same argument pl.add(pl.mul(q_head_sq_sum, 1.0 / HEAD_DIM), EPS)), so the
prefill q normalization matches the decode/golden q path (refer to the
q_head_inv_rms variable, HEAD_DIM, EPS, and pl.rsqrt).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cc60a769-e828-47b3-97d7-6edc67ee8079

📥 Commits

Reviewing files that changed from the base of the PR and between f0092c7 and cf2c52c.

📒 Files selected for processing (2)
  • models/deepseek/v4/config.py
  • models/deepseek/v4/prefill_qkv_proj_rope.py

@zhaozhaozz zhaozhaozz force-pushed the feat/deepseek-v4-prefill-qkv-rope branch from cf2c52c to 6c5c7e1 Compare May 26, 2026 03:08
@zhaozhaozz zhaozhaozz requested a review from zhangqi-chen May 26, 2026 03:37
@zhangqi-chen zhangqi-chen merged commit 619fba2 into hw-native-sys:main May 26, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants