Skip to content

chore(dsv4): migrate chunked_loop_optimizer to auto_chunk (#388)#389

Open
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:chore/dsv4-clo-to-auto-chunk-388
Open

chore(dsv4): migrate chunked_loop_optimizer to auto_chunk (#388)#389
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:chore/dsv4-clo-to-auto-chunk-388

Conversation

@wangqin1723-max
Copy link
Copy Markdown
Contributor

@wangqin1723-max wangqin1723-max commented May 26, 2026

Summary

Closes #388. chunked_loop_optimizer is deprecated upstream; the rest of the repo already uses auto_chunk.

Commit 1 — swap the 7 remaining sites to auto_chunk:

  • decode_attention_hca.py — hca_topk
  • decode_attention_swa.py — swa_scatter_kv / swa_topk / swa_cmp_dummy
  • hc_post.py — hc_post
  • qkv_proj_rope.py — attn_norm_rms_partial / qr_rms_partial

Commit 2 — for the two pl.parallel(0,N,1,chunk=16) sites (hc_post, swa_scatter_kv), drop the optimizer entirely and make chunking explicit: pl.parallel(0,N,16) + pl.at + pl.range(16). The other 5 sites stay on auto_chunk (not parallel-chunk loops).

Validation

Run on a2a3 (with PTO2_RING_* env), all PASS, precision-neutral:

target result
decode_hca.py x_next PASS
decode_csa.py x_next PASS
decode_attention_swa.py x_out PASS

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

Warning

Review limit reached

@wangqin1723-max, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 21 minutes and 52 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab88bdc7-d120-4693-8b08-67ef9a33afd0

📥 Commits

Reviewing files that changed from the base of the PR and between f05cd1f and c29b69c.

📒 Files selected for processing (3)
  • models/deepseek/v4/decode_attention_hca.py
  • models/deepseek/v4/decode_attention_swa.py
  • models/deepseek/v4/hc_post.py
📝 Walkthrough

Walkthrough

Four DeepSeek v4 kernel modules migrate loop optimization directives from pl.chunked_loop_optimizer to pl.auto_chunk. HCA and SWA attention decoders update scheduling hints; HC_post restructures loop nesting; QKV projection RMS loops switch optimization parameters with updated comments.

Changes

Loop Optimization Migration to Auto_Chunk

Layer / File(s) Summary
Attention decoder kernel optimization migration
models/deepseek/v4/decode_attention_hca.py, models/deepseek/v4/decode_attention_swa.py
HCA topk switches from chunked_loop_optimizer to auto_chunk. SWA KV scatter and sparse-topk loops migrate optimization directives and restructure batch parallelization from implicit chunking to pl.parallel(0, B, 16) plus inner 16-wide range loop. SWA cmp_dummy dummy-block preparation also switches to auto_chunk. Destination-slot computation and KV-cache assembly semantics preserved.
HC_post core loop restructuring
models/deepseek/v4/hc_post.py
HC_post inner iteration replaces single pl.parallel(0, T, 1, chunk=16) with two-level structure: outer pl.parallel(0, T, 16) and inner pl.range(t0, t0 + 16) loop. Per-iteration math (load, compute rows, accumulate, assemble) remains unchanged.
QKV projection RMS loop optimization
models/deepseek/v4/qkv_proj_rope.py
Attn_norm_rms_partial and qr_rms_partial worker loops switch pl.at from optimization=pl.chunked_loop_optimizer to optimizations=[pl.auto_chunk]. Comments updated to document auto_chunk requirement for preventing Vec UB over-allocation. Partial-split tiling structure and final reduction behavior unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#373: Parallel auto_chunk migration across multiple DeepSeek kernels covering the same optimization directive consolidation pattern.
  • hw-native-sys/pypto-lib#339: Restructures RMS partial-reduction stages in qkv_proj_rope that are directly related to the RMS loop optimization changes in this PR.
  • hw-native-sys/pypto-lib#332: Concurrent modifications to qkv_proj_rope.py loop optimization and scope-fusion refactoring affecting the same RMS partial-sum paths.

Poem

🐰 Chunks and loops in harmony dance,
Auto-chunked, no second glance,
HCA, SWA, vectors bright,
Compiler guides all paths aright!
RMS sums and timely strides,
Optimization's joyful tides! 🎵

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main change: migrating from deprecated chunked_loop_optimizer to auto_chunk in the DSv4 decoder modules.
Description check ✅ Passed The PR description is clearly related to the changeset, explaining the migration from chunked_loop_optimizer to auto_chunk across the specific files and functions shown in the code changes.
Linked Issues check ✅ Passed The PR addresses all objectives from issue #388: migrating 7 sites from chunked_loop_optimizer to auto_chunk (hca_topk, swa_scatter_kv/topk/cmp_dummy, hc_post, attn/qr_rms_partial) with validation confirming numeric preservation.
Out of Scope Changes check ✅ Passed All changes are scoped to the migration task; only the specified optimization mechanism and explicit chunking patterns were modified without unrelated alterations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the deprecated optimization=pl.chunked_loop_optimizer argument with optimizations=[pl.auto_chunk] across multiple DeepSeek v4 model files, including decode_attention_hca.py, decode_attention_swa.py, hc_post.py, and qkv_proj_rope.py. The inline documentation and comments are also updated to reflect this API change. No review comments were provided, so there is no feedback to address.

@wangqin1723-max wangqin1723-max force-pushed the chore/dsv4-clo-to-auto-chunk-388 branch from 00c1e82 to f0f2ad4 Compare May 26, 2026 09:48
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/decode_attention_swa.py`:
- Around line 163-173: The 16-wide parallel loop assumes full 16-lane chunks and
can read/write past valid batch rows when B % 16 != 0; in the
pl.parallel(pl.parallel(0, B, 16)) block (and the similar block at lines
191-193) replace the fixed pl.range(b0, b0 + 16) with a guarded range or mask by
computing the actual tail width (e.g., tail = min(16, B - b0)) and iterate
pl.range(b0, b0 + tail) or conditionally skip/avoid assemble for b >= B, and
ensure the pl.assemble into kv_cache_flat only happens for valid b (and uses the
same guarded index compute involving block_table_flat, S, s_idx, BLOCK_SIZE,
HEAD_DIM) so no out-of-bounds reads/writes occur.

In `@models/deepseek/v4/hc_post.py`:
- Around line 46-71: The inner loop `for t in pl.range(t0, t0 + 16)` can iterate
past the valid token count when T%16 != 0; modify the iteration to respect the
global T bound by limiting t to < T (e.g., compute t_end = min(t0+16, T) and
iterate pl.range(t0, t_end) or keep the existing range and skip iterations with
an if t >= T guard). Apply this change where the loop over t appears (the block
reading post_flat, x_flat, residual_flat and assembling y_flat) so all uses of t
(e.g., reads from post_flat at [t * HC_MULT + out_h], slices using [t, ...], and
writes to y_flat at [t, ...]) are protected from out-of-bounds access.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bcc0f626-c162-4c41-bf28-4537083cef26

📥 Commits

Reviewing files that changed from the base of the PR and between 2d6f086 and f05cd1f.

📒 Files selected for processing (4)
  • models/deepseek/v4/decode_attention_hca.py
  • models/deepseek/v4/decode_attention_swa.py
  • models/deepseek/v4/hc_post.py
  • models/deepseek/v4/qkv_proj_rope.py

Comment thread models/deepseek/v4/decode_attention_swa.py
Comment on lines +46 to +71
for t0 in pl.parallel(0, T, 16):
with pl.at(level=pl.Level.CORE_GROUP, name_hint="hc_post"):
for t in pl.range(t0, t0 + 16):
post_w = pl.read(post_flat, [t * HC_MULT + out_h])
for db in pl.range(D_BLOCKS):
d0 = db * D_CHUNK
x_row = pl.cast(
pl.slice(x_flat, [1, D_CHUNK], [t, d0]),
target_type=pl.FP32,
)
y_row = pl.add(y_row, pl.mul(residual_row, comb_w))
y_flat = pl.assemble(
y_flat,
pl.cast(y_row, target_type=pl.BF16, mode="rint"),
[t, out_h * D + d0],
)
y_row = pl.mul(x_row, post_w)
for in_h in pl.range(HC_MULT):
comb_w = pl.read(
comb_flat,
[t * HC_MULT * HC_MULT + in_h * HC_MULT + out_h],
)
residual_row = pl.cast(
pl.slice(residual_flat, [1, D_CHUNK], [t, in_h * D + d0]),
target_type=pl.FP32,
)
y_row = pl.add(y_row, pl.mul(residual_row, comb_w))
y_flat = pl.assemble(
y_flat,
pl.cast(y_row, target_type=pl.BF16, mode="rint"),
[t, out_h * D + d0],
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add tail bound check for the new 16-wide t chunk loop.

for t in pl.range(t0, t0 + 16) can run beyond valid token rows on the last chunk when T % 16 != 0.

💡 Suggested fix
 for out_h in pl.parallel(HC_MULT):
     for t0 in pl.parallel(0, T, 16):
         with pl.at(level=pl.Level.CORE_GROUP, name_hint="hc_post"):
             for t in pl.range(t0, t0 + 16):
-                post_w = pl.read(post_flat, [t * HC_MULT + out_h])
-                for db in pl.range(D_BLOCKS):
-                    d0 = db * D_CHUNK
-                    x_row = pl.cast(
-                        pl.slice(x_flat, [1, D_CHUNK], [t, d0]),
-                        target_type=pl.FP32,
-                    )
-                    y_row = pl.mul(x_row, post_w)
-                    for in_h in pl.range(HC_MULT):
-                        comb_w = pl.read(
-                            comb_flat,
-                            [t * HC_MULT * HC_MULT + in_h * HC_MULT + out_h],
-                        )
-                        residual_row = pl.cast(
-                            pl.slice(residual_flat, [1, D_CHUNK], [t, in_h * D + d0]),
-                            target_type=pl.FP32,
-                        )
-                        y_row = pl.add(y_row, pl.mul(residual_row, comb_w))
-                    y_flat = pl.assemble(
-                        y_flat,
-                        pl.cast(y_row, target_type=pl.BF16, mode="rint"),
-                        [t, out_h * D + d0],
-                    )
+                if t < T:
+                    post_w = pl.read(post_flat, [t * HC_MULT + out_h])
+                    for db in pl.range(D_BLOCKS):
+                        d0 = db * D_CHUNK
+                        x_row = pl.cast(
+                            pl.slice(x_flat, [1, D_CHUNK], [t, d0]),
+                            target_type=pl.FP32,
+                        )
+                        y_row = pl.mul(x_row, post_w)
+                        for in_h in pl.range(HC_MULT):
+                            comb_w = pl.read(
+                                comb_flat,
+                                [t * HC_MULT * HC_MULT + in_h * HC_MULT + out_h],
+                            )
+                            residual_row = pl.cast(
+                                pl.slice(residual_flat, [1, D_CHUNK], [t, in_h * D + d0]),
+                                target_type=pl.FP32,
+                            )
+                            y_row = pl.add(y_row, pl.mul(residual_row, comb_w))
+                        y_flat = pl.assemble(
+                            y_flat,
+                            pl.cast(y_row, target_type=pl.BF16, mode="rint"),
+                            [t, out_h * D + d0],
+                        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/hc_post.py` around lines 46 - 71, The inner loop `for t in
pl.range(t0, t0 + 16)` can iterate past the valid token count when T%16 != 0;
modify the iteration to respect the global T bound by limiting t to < T (e.g.,
compute t_end = min(t0+16, T) and iterate pl.range(t0, t_end) or keep the
existing range and skip iterations with an if t >= T guard). Apply this change
where the loop over t appears (the block reading post_flat, x_flat,
residual_flat and assembling y_flat) so all uses of t (e.g., reads from
post_flat at [t * HC_MULT + out_h], slices using [t, ...], and writes to y_flat
at [t, ...]) are protected from out-of-bounds access.

topk_idxs = pl.create_tensor([T, SPARSE_TOPK], dtype=pl.INT32)
for t0 in pl.range(0, T, HCA_TOPK_CHUNK):
with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="hca_topk"):
with pl.at(level=pl.Level.CORE_GROUP, optimizations=[pl.auto_chunk], name_hint="hca_topk"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

sparse_topk = pl.create_tensor([T, SPARSE_TOPK], dtype=pl.INT32)
for b0 in pl.range(0, T, SWA_BATCH_CHUNK):
with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="swa_topk"):
with pl.at(level=pl.Level.CORE_GROUP, optimizations=[pl.auto_chunk], name_hint="swa_topk"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

cmp_block_table_dummy = pl.create_tensor([B, SPARSE_CMP_MAX_BLOCKS], dtype=pl.INT32)
for b0 in pl.range(0, B, SWA_BATCH_CHUNK):
with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer, name_hint="swa_cmp_dummy"):
with pl.at(level=pl.Level.CORE_GROUP, optimizations=[pl.auto_chunk], name_hint="swa_cmp_dummy"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@wangqin1723-max wangqin1723-max force-pushed the chore/dsv4-clo-to-auto-chunk-388 branch 3 times, most recently from bea63ab to c29b69c Compare May 26, 2026 12:20
…pk scopes

Replace parallel(chunk=) with explicit parallel+range and migrate the
remaining chunked_loop_optimizer sites to auto_chunk; drop auto_chunk from
hca_topk/swa_topk/swa_cmp_dummy scopes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Remove chunked_loop_optimizer from dsv4 attention (hca/swa), migrate to auto_chunk

2 participants