perf(dsv4): mix-fuse ROPE matmul + vector epilogue in decode_indexer (-15%) by wangqin1723-max · Pull Request #371 · hw-native-sys/pypto-lib

wangqin1723-max · 2026-05-25T08:52:13Z

What

Mix-fuse the two ROPE cube→vector handshakes in decode_indexer:

rope_slice + rope_apply (select matmul → cos/sin rotate cast)
rope_assemble + rope_write (assemble matmul → final BF16 cast)

Each merges into one scope keeping its FP32-out matmul accumulator
scope-local (no GM round-trip), inner-chunked over ROPE_ROW_CHUNK rows
under pl.split(UP_DOWN) so the fused acc+epilogue fits the 192KB Vec
budget at GRP=4. Rows are per-token independent, so the fused form is
bit-identical to the per-token form.

Perf (a2a3, S=2)

stage	Total
baseline (rope core-spread)	~1357 us
+ rope_slice/apply fuse	~1357 → −14%
+ rope_assemble/write fuse	~1118–1176 us (avg ~1147, −15%)

Tuning notes

Fuse only FP32-out matmul + vector. INT8 (INT32-acc) scopes — qr_proj+write, qr_hadamard+quant, score_accum+store — do NOT fuse: score trips ptoas pto.subview valid_shape even with UP_DOWN (col-slice + row_sum conflicts with row-split). Left split.
Buffer overflow → inner pl.range(ROPE_ROW_CHUNK), not GRP-halve.
Win is dropping the scope handshake; whole-block/auto_chunk lose.

Validation

-p a2a3: score / topk_idxs / idx_kv_cache all PASS, 2 runs confirm range.

coderabbitai · 2026-05-25T08:52:24Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR refactors the decode indexer: adds head-group/rope-group tiling and QR proj row-chunking, rewrites RoPE to write into a fresh buffer, computes local FP32 hadamard accumulation with per-head INT8 quant, and replaces global score accumulation with per-token INT32 matmuls plus dequant/store.

Changes

Decode Indexer Tiling and Computation Reorganization

Layer / File(s)	Summary
Grouped tiling constants and shape assertions `models/deepseek/v4/decode_indexer.py`	Introduce `HEAD_GROUP`, `HEAD_ROWS`, `HEAD_GROUP_ROPE`, `HEAD_ROWS_ROPE`, and `SCORE_B_GROUP` and add divisibility assertions.
QR projection scheduling and row-chunk dequant `models/deepseek/v4/decode_indexer.py`	Add `QR_PROJ_ROW_CHUNK`, change `qr_proj` core-group scheduling, and replace full-tensor epilogue with per-row-chunk dequant/cast writes into `qr_proj`.
RoPE grouped transform and fresh output buffer `models/deepseek/v4/decode_indexer.py`, `models/deepseek/v4/decode_indexer_compressor.py`	Rewrite RoPE using `HEAD_ROWS_ROPE` and `ROPE_ROW_CHUNK`, produce chunked BF16 even/odd accumulators, assemble FP32 rope_acc, write rotated results into `qr_rope_out`, and update compressor RoPE stages accordingly.
Hadamard local FP32 accumulate and INT8 quant `models/deepseek/v4/decode_indexer.py`	Compute FP32 hadamard accumulation locally per rope group, emit per-`HEAD_ROWS` quant/dequant scales, and write INT8 activations into `qr_hadamard_i8` without a global FP32 accumulator.
Grouped scoring buffers and in-store score computation `models/deepseek/v4/decode_indexer.py`	Reallocate/reshape `score_flat`, `score_kv_scale`, `kv_tile_i8_g`; switch score loop to `pl.auto_chunk`; perform per-token INT32 matmuls from `kv_tile_i8_g` and `qr_hadamard_i8`, apply dequant scales, and store FP32 scores into `score_flat`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hw-native-sys/pypto-lib#350: Related refactor of decode_indexer.py's qr_proj/RoPE/INT8 scoring tiling and fused store changes.
hw-native-sys/pypto-lib#341: Related RoPE even/odd transformation and downstream INT8 quantization/scale flow updates.
hw-native-sys/pypto-lib#268: Related changes to INT8 hadamard + INT8 KV tiling → INT32 matmul → dequant scoring flow.

Suggested labels

enhancement

Poem

🐰 I slice the rows in tidy chunks,
Fold heads and rope with nimble funks,
Fresh buffers keep the hazards cold,
INT8 scales and scores unfold,
A rabbit hops — the kernel hums!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main performance optimization: fusing ROPE matmul and vector epilogue operations to achieve a 15% improvement. It is directly supported by the changeset modifications.
Description check	✅ Passed	The description provides comprehensive context about what, how, and why the changes were made, including performance metrics, tuning notes, and validation results. It is fully related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request optimizes the decode_indexer by implementing group-chunking and refactoring core loops to improve parallelization and reduce task launch overhead. Key changes include using a fresh tensor for RoPE outputs to avoid data hazards and splitting Hadamard operations based on hardware constraints. A review comment identified a typo in the score_quant scope where the optimization parameter was used instead of optimizations, which would cause the chunked_loop_optimizer to be ignored.

gemini-code-assist · 2026-05-25T08:54:54Z

-                    kv_cache_tile_i8[:, h1 : h1 + HEAD_DIM_CHUCK] = kv_q_i8
-                score_kv_scale[score_row0 : score_row0 + CACHE_TILE, :] = kv_cache_scale_dq
+
+            with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="score_quant"):


There is a typo in the keyword argument name for the pl.at scope. The parameter should be optimizations (plural) and it expects a list of optimization objects, consistent with the usage in lines 158 and 178. Using optimization (singular) will likely result in the pl.chunked_loop_optimizer being ignored by the compiler, which could lead to performance degradation or buffer overflows as mentioned in the PR description.

Suggested change

with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="score_quant"):

with pl.at(level=pl.Level.CORE_GROUP, optimizations=[pl.chunked_loop_optimizer], name_hint="score_quant"):

…logue Fold the two ROPE cube->vector handshakes in decode_indexer into single scopes: rope_slice+rope_apply and rope_assemble+rope_write. Each keeps its FP32-out matmul accumulator scope-local (no GM round-trip) and inner-chunks over ROPE_ROW_CHUNK rows under pl.split(UP_DOWN) so the fused acc+epilogue fits the 192KB Vec budget at GRP=4. Rows are independent, so the result is bit-identical; score/topk/idx_kv_cache all PASS. ~1357->~1147us (-15%).

…ul+vector epilogue Merge separate matmul and vector epilogue scopes into single UP_DOWN-split mix scopes: qr_proj dequant, qr_hadamard amax/quant, and compressor rope_slice/assemble. Keeps FP32 acc scope-local, drops cross-scope GM handshakes; inner row-chunk keeps Vec buffer under 192KB.

wangqin1723-max force-pushed the perf/dsv4-decode-indexer-rope-core-spread branch from 6d676a5 to 399d147 Compare May 25, 2026 08:54

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

wangqin1723-max added 2 commits May 25, 2026 19:18

wangqin1723-max force-pushed the perf/dsv4-decode-indexer-rope-core-spread branch from 399d147 to ccde41b Compare May 26, 2026 08:40

perf(dsv4): rope_assemble use SplitMode.NONE for FP32-out mix

b906cab

wangqin1723-max force-pushed the perf/dsv4-decode-indexer-rope-core-spread branch from 132d041 to b906cab Compare May 26, 2026 09:49

zhangqi-chen merged commit 9a352e9 into hw-native-sys:main May 26, 2026
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(dsv4): mix-fuse ROPE matmul + vector epilogue in decode_indexer (-15%)#371

perf(dsv4): mix-fuse ROPE matmul + vector epilogue in decode_indexer (-15%)#371
zhangqi-chen merged 3 commits into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-decode-indexer-rope-core-spread

wangqin1723-max commented May 25, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="score_quant"):
	with pl.at(level=pl.Level.CORE_GROUP, optimizations=[pl.chunked_loop_optimizer], name_hint="score_quant"):

Conversation

wangqin1723-max commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Perf (a2a3, S=2)

Tuning notes

Validation

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangqin1723-max commented May 25, 2026 •

edited

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading