perf(dsv4): mix-fuse ROPE matmul + vector epilogue in decode_indexer (-15%)#371
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR refactors the decode indexer: adds head-group/rope-group tiling and QR proj row-chunking, rewrites RoPE to write into a fresh buffer, computes local FP32 hadamard accumulation with per-head INT8 quant, and replaces global score accumulation with per-token INT32 matmuls plus dequant/store. ChangesDecode Indexer Tiling and Computation Reorganization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
6d676a5 to
399d147
Compare
There was a problem hiding this comment.
Code Review
This pull request optimizes the decode_indexer by implementing group-chunking and refactoring core loops to improve parallelization and reduce task launch overhead. Key changes include using a fresh tensor for RoPE outputs to avoid data hazards and splitting Hadamard operations based on hardware constraints. A review comment identified a typo in the score_quant scope where the optimization parameter was used instead of optimizations, which would cause the chunked_loop_optimizer to be ignored.
| kv_cache_tile_i8[:, h1 : h1 + HEAD_DIM_CHUCK] = kv_q_i8 | ||
| score_kv_scale[score_row0 : score_row0 + CACHE_TILE, :] = kv_cache_scale_dq | ||
|
|
||
| with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="score_quant"): |
There was a problem hiding this comment.
There is a typo in the keyword argument name for the pl.at scope. The parameter should be optimizations (plural) and it expects a list of optimization objects, consistent with the usage in lines 158 and 178. Using optimization (singular) will likely result in the pl.chunked_loop_optimizer being ignored by the compiler, which could lead to performance degradation or buffer overflows as mentioned in the PR description.
| with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="score_quant"): | |
| with pl.at(level=pl.Level.CORE_GROUP, optimizations=[pl.chunked_loop_optimizer], name_hint="score_quant"): |
…logue Fold the two ROPE cube->vector handshakes in decode_indexer into single scopes: rope_slice+rope_apply and rope_assemble+rope_write. Each keeps its FP32-out matmul accumulator scope-local (no GM round-trip) and inner-chunks over ROPE_ROW_CHUNK rows under pl.split(UP_DOWN) so the fused acc+epilogue fits the 192KB Vec budget at GRP=4. Rows are independent, so the result is bit-identical; score/topk/idx_kv_cache all PASS. ~1357->~1147us (-15%).
…ul+vector epilogue Merge separate matmul and vector epilogue scopes into single UP_DOWN-split mix scopes: qr_proj dequant, qr_hadamard amax/quant, and compressor rope_slice/assemble. Keeps FP32 acc scope-local, drops cross-scope GM handshakes; inner row-chunk keeps Vec buffer under 192KB.
399d147 to
ccde41b
Compare
132d041 to
b906cab
Compare
What
Mix-fuse the two ROPE cube→vector handshakes in
decode_indexer:rope_slice+rope_apply(select matmul → cos/sin rotate cast)rope_assemble+rope_write(assemble matmul → final BF16 cast)Each merges into one scope keeping its FP32-out matmul accumulator
scope-local (no GM round-trip), inner-chunked over
ROPE_ROW_CHUNKrowsunder
pl.split(UP_DOWN)so the fused acc+epilogue fits the 192KB Vecbudget at GRP=4. Rows are per-token independent, so the fused form is
bit-identical to the per-token form.
Perf (a2a3, S=2)
Tuning notes
qr_proj+write,qr_hadamard+quant,score_accum+store— do NOT fuse:scoretrips ptoaspto.subview valid_shapeeven with UP_DOWN (col-slice + row_sum conflicts with row-split). Left split.pl.range(ROPE_ROW_CHUNK), not GRP-halve.Validation
-p a2a3:score/topk_idxs/idx_kv_cacheall PASS, 2 runs confirm range.