chore(dsv4): migrate chunked_loop_optimizer to auto_chunk (#388) by wangqin1723-max · Pull Request #389 · hw-native-sys/pypto-lib

wangqin1723-max · 2026-05-26T09:47:06Z

Summary

Closes #388. chunked_loop_optimizer is deprecated upstream; the rest of the repo already uses auto_chunk.

Commit 1 — swap the 7 remaining sites to auto_chunk:

decode_attention_hca.py — hca_topk
decode_attention_swa.py — swa_scatter_kv / swa_topk / swa_cmp_dummy
hc_post.py — hc_post
qkv_proj_rope.py — attn_norm_rms_partial / qr_rms_partial

Commit 2 — for the two pl.parallel(0,N,1,chunk=16) sites (hc_post, swa_scatter_kv), drop the optimizer entirely and make chunking explicit: pl.parallel(0,N,16) + pl.at + pl.range(16). The other 5 sites stay on auto_chunk (not parallel-chunk loops).

Validation

Run on a2a3 (with PTO2_RING_* env), all PASS, precision-neutral:

target	result
`decode_hca.py`	x_next PASS
`decode_csa.py`	x_next PASS
`decode_attention_swa.py`	x_out PASS

coderabbitai · 2026-05-26T09:47:19Z

Warning

Review limit reached

@wangqin1723-max, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 21 minutes and 52 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab88bdc7-d120-4693-8b08-67ef9a33afd0

📥 Commits

Reviewing files that changed from the base of the PR and between f05cd1f and c29b69c.

📒 Files selected for processing (3)

models/deepseek/v4/decode_attention_hca.py
models/deepseek/v4/decode_attention_swa.py
models/deepseek/v4/hc_post.py

📝 Walkthrough

Walkthrough

Four DeepSeek v4 kernel modules migrate loop optimization directives from pl.chunked_loop_optimizer to pl.auto_chunk. HCA and SWA attention decoders update scheduling hints; HC_post restructures loop nesting; QKV projection RMS loops switch optimization parameters with updated comments.

Changes

Loop Optimization Migration to Auto_Chunk

Layer / File(s)	Summary
Attention decoder kernel optimization migration `models/deepseek/v4/decode_attention_hca.py`, `models/deepseek/v4/decode_attention_swa.py`	HCA topk switches from `chunked_loop_optimizer` to `auto_chunk`. SWA KV scatter and sparse-topk loops migrate optimization directives and restructure batch parallelization from implicit chunking to `pl.parallel(0, B, 16)` plus inner 16-wide range loop. SWA cmp_dummy dummy-block preparation also switches to `auto_chunk`. Destination-slot computation and KV-cache assembly semantics preserved.
HC_post core loop restructuring `models/deepseek/v4/hc_post.py`	HC_post inner iteration replaces single `pl.parallel(0, T, 1, chunk=16)` with two-level structure: outer `pl.parallel(0, T, 16)` and inner `pl.range(t0, t0 + 16)` loop. Per-iteration math (load, compute rows, accumulate, assemble) remains unchanged.
QKV projection RMS loop optimization `models/deepseek/v4/qkv_proj_rope.py`	Attn_norm_rms_partial and qr_rms_partial worker loops switch `pl.at` from `optimization=pl.chunked_loop_optimizer` to `optimizations=[pl.auto_chunk]`. Comments updated to document auto_chunk requirement for preventing Vec UB over-allocation. Partial-split tiling structure and final reduction behavior unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

hw-native-sys/pypto-lib#373: Parallel auto_chunk migration across multiple DeepSeek kernels covering the same optimization directive consolidation pattern.
hw-native-sys/pypto-lib#339: Restructures RMS partial-reduction stages in qkv_proj_rope that are directly related to the RMS loop optimization changes in this PR.
hw-native-sys/pypto-lib#332: Concurrent modifications to qkv_proj_rope.py loop optimization and scope-fusion refactoring affecting the same RMS partial-sum paths.

Poem

🐰 Chunks and loops in harmony dance,
Auto-chunked, no second glance,
HCA, SWA, vectors bright,
Compiler guides all paths aright!
RMS sums and timely strides,
Optimization's joyful tides! 🎵

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately describes the main change: migrating from deprecated chunked_loop_optimizer to auto_chunk in the DSv4 decoder modules.
Description check	✅ Passed	The PR description is clearly related to the changeset, explaining the migration from chunked_loop_optimizer to auto_chunk across the specific files and functions shown in the code changes.
Linked Issues check	✅ Passed	The PR addresses all objectives from issue `#388`: migrating 7 sites from chunked_loop_optimizer to auto_chunk (hca_topk, swa_scatter_kv/topk/cmp_dummy, hc_post, attn/qr_rms_partial) with validation confirming numeric preservation.
Out of Scope Changes check	✅ Passed	All changes are scoped to the migration task; only the specified optimization mechanism and explicit chunking patterns were modified without unrelated alterations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request replaces the deprecated optimization=pl.chunked_loop_optimizer argument with optimizations=[pl.auto_chunk] across multiple DeepSeek v4 model files, including decode_attention_hca.py, decode_attention_swa.py, hc_post.py, and qkv_proj_rope.py. The inline documentation and comments are also updated to reflect this API change. No review comments were provided, so there is no feedback to address.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/decode_attention_swa.py`:
- Around line 163-173: The 16-wide parallel loop assumes full 16-lane chunks and
can read/write past valid batch rows when B % 16 != 0; in the
pl.parallel(pl.parallel(0, B, 16)) block (and the similar block at lines
191-193) replace the fixed pl.range(b0, b0 + 16) with a guarded range or mask by
computing the actual tail width (e.g., tail = min(16, B - b0)) and iterate
pl.range(b0, b0 + tail) or conditionally skip/avoid assemble for b >= B, and
ensure the pl.assemble into kv_cache_flat only happens for valid b (and uses the
same guarded index compute involving block_table_flat, S, s_idx, BLOCK_SIZE,
HEAD_DIM) so no out-of-bounds reads/writes occur.

In `@models/deepseek/v4/hc_post.py`:
- Around line 46-71: The inner loop `for t in pl.range(t0, t0 + 16)` can iterate
past the valid token count when T%16 != 0; modify the iteration to respect the
global T bound by limiting t to < T (e.g., compute t_end = min(t0+16, T) and
iterate pl.range(t0, t_end) or keep the existing range and skip iterations with
an if t >= T guard). Apply this change where the loop over t appears (the block
reading post_flat, x_flat, residual_flat and assembling y_flat) so all uses of t
(e.g., reads from post_flat at [t * HC_MULT + out_h], slices using [t, ...], and
writes to y_flat at [t, ...]) are protected from out-of-bounds access.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bcc0f626-c162-4c41-bf28-4537083cef26

📥 Commits

Reviewing files that changed from the base of the PR and between 2d6f086 and f05cd1f.

📒 Files selected for processing (4)

models/deepseek/v4/decode_attention_hca.py
models/deepseek/v4/decode_attention_swa.py
models/deepseek/v4/hc_post.py
models/deepseek/v4/qkv_proj_rope.py

coderabbitai · 2026-05-26T11:48:03Z

+        for t0 in pl.parallel(0, T, 16):
+            with pl.at(level=pl.Level.CORE_GROUP, name_hint="hc_post"):
+                for t in pl.range(t0, t0 + 16):
+                    post_w = pl.read(post_flat, [t * HC_MULT + out_h])
+                    for db in pl.range(D_BLOCKS):
+                        d0 = db * D_CHUNK
+                        x_row = pl.cast(
+                            pl.slice(x_flat, [1, D_CHUNK], [t, d0]),
                            target_type=pl.FP32,
                        )
-                        y_row = pl.add(y_row, pl.mul(residual_row, comb_w))
-                    y_flat = pl.assemble(
-                        y_flat,
-                        pl.cast(y_row, target_type=pl.BF16, mode="rint"),
-                        [t, out_h * D + d0],
-                    )
+                        y_row = pl.mul(x_row, post_w)
+                        for in_h in pl.range(HC_MULT):
+                            comb_w = pl.read(
+                                comb_flat,
+                                [t * HC_MULT * HC_MULT + in_h * HC_MULT + out_h],
+                            )
+                            residual_row = pl.cast(
+                                pl.slice(residual_flat, [1, D_CHUNK], [t, in_h * D + d0]),
+                                target_type=pl.FP32,
+                            )
+                            y_row = pl.add(y_row, pl.mul(residual_row, comb_w))
+                        y_flat = pl.assemble(
+                            y_flat,
+                            pl.cast(y_row, target_type=pl.BF16, mode="rint"),
+                            [t, out_h * D + d0],
+                        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add tail bound check for the new 16-wide t chunk loop.

for t in pl.range(t0, t0 + 16) can run beyond valid token rows on the last chunk when T % 16 != 0.

💡 Suggested fix

for out_h in pl.parallel(HC_MULT): for t0 in pl.parallel(0, T, 16): with pl.at(level=pl.Level.CORE_GROUP, name_hint="hc_post"): for t in pl.range(t0, t0 + 16): - post_w = pl.read(post_flat, [t * HC_MULT + out_h]) - for db in pl.range(D_BLOCKS): - d0 = db * D_CHUNK - x_row = pl.cast( - pl.slice(x_flat, [1, D_CHUNK], [t, d0]), - target_type=pl.FP32, - ) - y_row = pl.mul(x_row, post_w) - for in_h in pl.range(HC_MULT): - comb_w = pl.read( - comb_flat, - [t * HC_MULT * HC_MULT + in_h * HC_MULT + out_h], - ) - residual_row = pl.cast( - pl.slice(residual_flat, [1, D_CHUNK], [t, in_h * D + d0]), - target_type=pl.FP32, - ) - y_row = pl.add(y_row, pl.mul(residual_row, comb_w)) - y_flat = pl.assemble( - y_flat, - pl.cast(y_row, target_type=pl.BF16, mode="rint"), - [t, out_h * D + d0], - ) + if t < T: + post_w = pl.read(post_flat, [t * HC_MULT + out_h]) + for db in pl.range(D_BLOCKS): + d0 = db * D_CHUNK + x_row = pl.cast( + pl.slice(x_flat, [1, D_CHUNK], [t, d0]), + target_type=pl.FP32, + ) + y_row = pl.mul(x_row, post_w) + for in_h in pl.range(HC_MULT): + comb_w = pl.read( + comb_flat, + [t * HC_MULT * HC_MULT + in_h * HC_MULT + out_h], + ) + residual_row = pl.cast( + pl.slice(residual_flat, [1, D_CHUNK], [t, in_h * D + d0]), + target_type=pl.FP32, + ) + y_row = pl.add(y_row, pl.mul(residual_row, comb_w)) + y_flat = pl.assemble( + y_flat, + pl.cast(y_row, target_type=pl.BF16, mode="rint"), + [t, out_h * D + d0], + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/hc_post.py` around lines 46 - 71, The inner loop `for t in pl.range(t0, t0 + 16)` can iterate past the valid token count when T%16 != 0; modify the iteration to respect the global T bound by limiting t to < T (e.g., compute t_end = min(t0+16, T) and iterate pl.range(t0, t_end) or keep the existing range and skip iterations with an if t >= T guard). Apply this change where the loop over t appears (the block reading post_flat, x_flat, residual_flat and assembling y_flat) so all uses of t (e.g., reads from post_flat at [t * HC_MULT + out_h], slices using [t, ...], and writes to y_flat at [t, ...]) are protected from out-of-bounds access.

zhangqi-chen · 2026-05-26T11:56:40Z

    topk_idxs = pl.create_tensor([T, SPARSE_TOPK], dtype=pl.INT32)
    for t0 in pl.range(0, T, HCA_TOPK_CHUNK):
-        with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="hca_topk"):
+        with pl.at(level=pl.Level.CORE_GROUP, optimizations=[pl.auto_chunk], name_hint="hca_topk"):


zhangqi-chen · 2026-05-26T11:57:01Z

    sparse_topk = pl.create_tensor([T, SPARSE_TOPK], dtype=pl.INT32)
    for b0 in pl.range(0, T, SWA_BATCH_CHUNK):
-        with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="swa_topk"):
+        with pl.at(level=pl.Level.CORE_GROUP, optimizations=[pl.auto_chunk], name_hint="swa_topk"):


zhangqi-chen · 2026-05-26T11:57:14Z

    cmp_block_table_dummy = pl.create_tensor([B, SPARSE_CMP_MAX_BLOCKS], dtype=pl.INT32)
    for b0 in pl.range(0, B, SWA_BATCH_CHUNK):
-        with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="swa_cmp_dummy"):
+        with pl.at(level=pl.Level.CORE_GROUP, optimizations=[pl.auto_chunk], name_hint="swa_cmp_dummy"):


…pk scopes Replace parallel(chunk=) with explicit parallel+range and migrate the remaining chunked_loop_optimizer sites to auto_chunk; drop auto_chunk from hca_topk/swa_topk/swa_cmp_dummy scopes.

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

wangqin1723-max force-pushed the chore/dsv4-clo-to-auto-chunk-388 branch from 00c1e82 to f0f2ad4 Compare May 26, 2026 09:48

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

zhangqi-chen reviewed May 26, 2026

View reviewed changes

wangqin1723-max force-pushed the chore/dsv4-clo-to-auto-chunk-388 branch 3 times, most recently from bea63ab to c29b69c Compare May 26, 2026 12:20

chore(dsv4): migrate chunked_loop_optimizer to auto_chunk and tune to…

c29b69c

…pk scopes Replace parallel(chunk=) with explicit parallel+range and migrate the remaining chunked_loop_optimizer sites to auto_chunk; drop auto_chunk from hca_topk/swa_topk/swa_cmp_dummy scopes.

wangqin1723-max closed this May 27, 2026

wangqin1723-max reopened this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(dsv4): migrate chunked_loop_optimizer to auto_chunk (#388)#389

chore(dsv4): migrate chunked_loop_optimizer to auto_chunk (#388)#389
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:chore/dsv4-clo-to-auto-chunk-388

wangqin1723-max commented May 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

zhangqi-chen May 26, 2026

Uh oh!

zhangqi-chen May 26, 2026

Uh oh!

zhangqi-chen May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wangqin1723-max commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

zhangqi-chen May 26, 2026

Choose a reason for hiding this comment

Uh oh!

zhangqi-chen May 26, 2026

Choose a reason for hiding this comment

Uh oh!

zhangqi-chen May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangqin1723-max commented May 26, 2026 •

edited

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading