Skip to content

Add DeepSeek V4 prefill indexer compressor#384

Merged
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
wuzhf9:dev
May 26, 2026
Merged

Add DeepSeek V4 prefill indexer compressor#384
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
wuzhf9:dev

Conversation

@wuzhf9
Copy link
Copy Markdown
Contributor

@wuzhf9 wuzhf9 commented May 26, 2026

Summary

  • Promote the DeepSeek V4 prefill indexer compressor from draft to a standalone kernel.
  • Implement B=1, S=128 ratio-4 overlap compression with projected KV/score scratch, final state writes, pooled KV cache output, RoPE, and optional Hadamard rotation.
  • Batch RMSNorm, RoPE, and Hadamard over all 32 compressed rows to satisfy A2/A3 tile alignment constraints.

Verification

  • task-submit task_20260526_111825_17581997185: python models/deepseek/v4/prefill_indexer_compressor_draft.py -p a2a3 --device 5 --enable-l2-swimlane
  • PASS: kv, kv_state, score_state, kv_cache

Related Issues

None

## Summary
- Promote the DeepSeek V4 prefill indexer compressor from draft to a standalone kernel.
- Implement B=1, S=128 ratio-4 overlap compression with projected KV/score scratch, final state writes, pooled KV cache output, RoPE, and optional Hadamard rotation.
- Batch RMSNorm, RoPE, and Hadamard over all 32 compressed rows to satisfy A2/A3 tile alignment constraints.

## Verification
- task-submit task_20260526_111825_17581997185: python models/deepseek/v4/prefill_indexer_compressor_draft.py -p a2a3 --device 5 --enable-l2-swimlane
- PASS: kv, kv_state, score_state, kv_cache

## Related Issues
None
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Replaces a draft stub with a production DeepSeek-V4 prefill indexer compressor JIT kernel that projects inputs into key/value and score spaces, performs ratio-4 overlapping softmax pooling, applies RMSNorm and RoPE rotation, optionally applies Hadamard transform, and writes compressed results to kv_cache; includes Torch golden reference and test harness.

Changes

Kernel Implementation and Testing

Layer / File(s) Summary
Kernel implementation and setup
models/deepseek/v4/prefill_indexer_compressor.py, models/deepseek/v4/prefill_indexer_compressor_draft.py
Defines compression ratio and block parameters; implements prefill_indexer_compressor JIT kernel that flattens/tiles inputs, projects to KV/score spaces, writes intermediate state buffers, computes overlapping softmax-pooled KV across consecutive windows, applies RMSNorm using norm_w, slices and rotates heads with RoPE using cosine/sine and selection matrices, conditionally applies Hadamard vs. direct cast, and writes compressed KV to kv_cache at an offset derived from start_pos. Removes draft stub in favor of full implementation.
Test wrapper and golden reference
models/deepseek/v4/prefill_indexer_compressor.py
Exports prefill_indexer_compressor_test wrapper that forwards all kernel inputs to the JIT kernel for integration with golden test harness. Adds golden_prefill_indexer_compressor Torch reference that reproduces the same compression overlap, RMSNorm, RoPE rotation, and optional Hadamard logic for numerical validation and comparison.
Tensor specifications and test runner
models/deepseek/v4/prefill_indexer_compressor.py
Defines build_tensor_specs() factory creating TensorSpec and ScalarSpec entries for all inputs/outputs including shape, dtype, and initializer metadata. Adds __main__ runner wiring run_jit with CLI options (platform, device, start-pos override, l2-swimlane toggle) and configures per-output comparison tolerances and failure behavior for numerical validation.

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#270: Implements earlier DeepSeek-V4 KV compressor logic with KV/score projection, selector-matrix RoPE, optional Hadamard, ratio pooling, and state/cache updates at offsets from start_pos.
  • hw-native-sys/pypto-lib#346: Directly replaces the stub prefill_indexer_compressor_draft.py with the production implementation in prefill_indexer_compressor.py and updates module exports.

Poem

🐰 A kernel crystallized from drafty dreams,
With overlapping pooling, RoPE at the seams,
Golden reference dancing, Hadamard's grace—
Compression wisdom in a JIT-compiled space! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: promoting and implementing the DeepSeek V4 prefill indexer compressor kernel.
Description check ✅ Passed The description is directly related to the changeset, providing summary of promotion objectives, implementation details, and verification results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the empty scaffold file with a complete implementation of the DeepSeek-V4 prefill indexer compressor for ratio-4 overlapping KV cache in models/deepseek/v4/prefill_indexer_compressor.py. The implementation includes the JIT-compiled kernel, a test wrapper, and a PyTorch golden reference. Feedback suggests replacing wildcard imports with explicit imports to prevent namespace pollution and improve code maintainability.

Comment on lines +14 to +15
from config import FP32_NEG_INF
from decode_indexer_compressor import * # noqa: F401,F403
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Wildcard imports (from decode_indexer_compressor import *) should be avoided as they pollute the namespace and make it difficult to track the origin of constants and variables. Additionally, FP32_NEG_INF is imported from config but never used in this file. It is highly recommended to explicitly import only the required names to improve code readability and maintainability.

Suggested change
from config import FP32_NEG_INF
from decode_indexer_compressor import * # noqa: F401,F403
from decode_indexer_compressor import (
COMPRESS_RATIO,
STATE_LEN,
OUT_DIM,
HEAD_DIM,
ROPE_HEAD_DIM,
NOPE_HEAD_DIM,
IDX_KV_LEN,
ROPE_CHUCK,
HEAD_DIM_INV,
EPS,
ROTATE,
D,
)

@zhangqi-chen zhangqi-chen merged commit a13fc73 into hw-native-sys:main May 26, 2026
4 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants