Skip to content

perf(dsv4): mix-fuse ROPE matmul + vector epilogue in decode_indexer (-15%)#371

Merged
zhangqi-chen merged 3 commits into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-decode-indexer-rope-core-spread
May 26, 2026
Merged

perf(dsv4): mix-fuse ROPE matmul + vector epilogue in decode_indexer (-15%)#371
zhangqi-chen merged 3 commits into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-decode-indexer-rope-core-spread

Conversation

@wangqin1723-max
Copy link
Copy Markdown
Contributor

@wangqin1723-max wangqin1723-max commented May 25, 2026

What

Mix-fuse the two ROPE cube→vector handshakes in decode_indexer:

  • rope_slice + rope_apply (select matmul → cos/sin rotate cast)
  • rope_assemble + rope_write (assemble matmul → final BF16 cast)

Each merges into one scope keeping its FP32-out matmul accumulator
scope-local (no GM round-trip), inner-chunked over ROPE_ROW_CHUNK rows
under pl.split(UP_DOWN) so the fused acc+epilogue fits the 192KB Vec
budget at GRP=4. Rows are per-token independent, so the fused form is
bit-identical to the per-token form.

Perf (a2a3, S=2)

stage Total
baseline (rope core-spread) ~1357 us
+ rope_slice/apply fuse ~1357 → −14%
+ rope_assemble/write fuse ~1118–1176 us (avg ~1147, −15%)

Tuning notes

  • Fuse only FP32-out matmul + vector. INT8 (INT32-acc) scopes — qr_proj+write, qr_hadamard+quant, score_accum+store — do NOT fuse: score trips ptoas pto.subview valid_shape even with UP_DOWN (col-slice + row_sum conflicts with row-split). Left split.
  • Buffer overflow → inner pl.range(ROPE_ROW_CHUNK), not GRP-halve.
  • Win is dropping the scope handshake; whole-block/auto_chunk lose.

Validation

-p a2a3: score / topk_idxs / idx_kv_cache all PASS, 2 runs confirm range.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors the decode indexer: adds head-group/rope-group tiling and QR proj row-chunking, rewrites RoPE to write into a fresh buffer, computes local FP32 hadamard accumulation with per-head INT8 quant, and replaces global score accumulation with per-token INT32 matmuls plus dequant/store.

Changes

Decode Indexer Tiling and Computation Reorganization

Layer / File(s) Summary
Grouped tiling constants and shape assertions
models/deepseek/v4/decode_indexer.py
Introduce HEAD_GROUP, HEAD_ROWS, HEAD_GROUP_ROPE, HEAD_ROWS_ROPE, and SCORE_B_GROUP and add divisibility assertions.
QR projection scheduling and row-chunk dequant
models/deepseek/v4/decode_indexer.py
Add QR_PROJ_ROW_CHUNK, change qr_proj core-group scheduling, and replace full-tensor epilogue with per-row-chunk dequant/cast writes into qr_proj.
RoPE grouped transform and fresh output buffer
models/deepseek/v4/decode_indexer.py, models/deepseek/v4/decode_indexer_compressor.py
Rewrite RoPE using HEAD_ROWS_ROPE and ROPE_ROW_CHUNK, produce chunked BF16 even/odd accumulators, assemble FP32 rope_acc, write rotated results into qr_rope_out, and update compressor RoPE stages accordingly.
Hadamard local FP32 accumulate and INT8 quant
models/deepseek/v4/decode_indexer.py
Compute FP32 hadamard accumulation locally per rope group, emit per-HEAD_ROWS quant/dequant scales, and write INT8 activations into qr_hadamard_i8 without a global FP32 accumulator.
Grouped scoring buffers and in-store score computation
models/deepseek/v4/decode_indexer.py
Reallocate/reshape score_flat, score_kv_scale, kv_tile_i8_g; switch score loop to pl.auto_chunk; perform per-token INT32 matmuls from kv_tile_i8_g and qr_hadamard_i8, apply dequant scales, and store FP32 scores into score_flat.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement

Poem

🐰 I slice the rows in tidy chunks,
Fold heads and rope with nimble funks,
Fresh buffers keep the hazards cold,
INT8 scales and scores unfold,
A rabbit hops — the kernel hums!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main performance optimization: fusing ROPE matmul and vector epilogue operations to achieve a 15% improvement. It is directly supported by the changeset modifications.
Description check ✅ Passed The description provides comprehensive context about what, how, and why the changes were made, including performance metrics, tuning notes, and validation results. It is fully related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wangqin1723-max wangqin1723-max force-pushed the perf/dsv4-decode-indexer-rope-core-spread branch from 6d676a5 to 399d147 Compare May 25, 2026 08:54
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the decode_indexer by implementing group-chunking and refactoring core loops to improve parallelization and reduce task launch overhead. Key changes include using a fresh tensor for RoPE outputs to avoid data hazards and splitting Hadamard operations based on hardware constraints. A review comment identified a typo in the score_quant scope where the optimization parameter was used instead of optimizations, which would cause the chunked_loop_optimizer to be ignored.

Comment thread models/deepseek/v4/decode_indexer.py Outdated
kv_cache_tile_i8[:, h1 : h1 + HEAD_DIM_CHUCK] = kv_q_i8
score_kv_scale[score_row0 : score_row0 + CACHE_TILE, :] = kv_cache_scale_dq

with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="score_quant"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a typo in the keyword argument name for the pl.at scope. The parameter should be optimizations (plural) and it expects a list of optimization objects, consistent with the usage in lines 158 and 178. Using optimization (singular) will likely result in the pl.chunked_loop_optimizer being ignored by the compiler, which could lead to performance degradation or buffer overflows as mentioned in the PR description.

Suggested change
with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="score_quant"):
with pl.at(level=pl.Level.CORE_GROUP, optimizations=[pl.chunked_loop_optimizer], name_hint="score_quant"):

…logue

Fold the two ROPE cube->vector handshakes in decode_indexer into single
scopes: rope_slice+rope_apply and rope_assemble+rope_write. Each keeps its
FP32-out matmul accumulator scope-local (no GM round-trip) and inner-chunks
over ROPE_ROW_CHUNK rows under pl.split(UP_DOWN) so the fused acc+epilogue
fits the 192KB Vec budget at GRP=4. Rows are independent, so the result is
bit-identical; score/topk/idx_kv_cache all PASS. ~1357->~1147us (-15%).
…ul+vector epilogue

Merge separate matmul and vector epilogue scopes into single UP_DOWN-split
mix scopes: qr_proj dequant, qr_hadamard amax/quant, and compressor
rope_slice/assemble. Keeps FP32 acc scope-local, drops cross-scope GM
handshakes; inner row-chunk keeps Vec buffer under 192KB.
@wangqin1723-max wangqin1723-max force-pushed the perf/dsv4-decode-indexer-rope-core-spread branch from 399d147 to ccde41b Compare May 26, 2026 08:40
@wangqin1723-max wangqin1723-max force-pushed the perf/dsv4-decode-indexer-rope-core-spread branch from 132d041 to b906cab Compare May 26, 2026 09:49
@zhangqi-chen zhangqi-chen merged commit 9a352e9 into hw-native-sys:main May 26, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants