Skip to content

Perf: bump Qwen3-14B decode_full scope-3 K_CHUNK 128 -> 256#275

Open
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/qwen3-14b-decode-full-k-chunk-256
Open

Perf: bump Qwen3-14B decode_full scope-3 K_CHUNK 128 -> 256#275
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/qwen3-14b-decode-full-k-chunk-256

Conversation

@wangqin1723-max
Copy link
Copy Markdown
Contributor

Halves the per-layer down_proj kernel count (160 -> 80 at num-layers=4, 1600 -> 800 at num-layers=40) and lets each cube call do twice the work, amortising dispatch overhead.

Measured on a2a3, --num-layers 40, --runtime-profiling:

Metric baseline 256 delta

TotalExec (us) 781842.78 691453.76 -11.6%
Avg Exec/task (us) 34.41 32.74 -4.9%
Avg Latency (us) 61.78 53.66 -13.1%
Tail OH P99 (us) 57.1 41.3 -28%
Wall-clock (us) 121553 111863 -8.0%

Mechanism (per-kernel breakdown):

incore_18 (down_proj cube):
Count: 1600 -> 800 (halved, as expected)
Exec %: 68.1% -> 92.8% (cube engine utilisation +25pp)
Head OH: 55.05 us -> 3.92 us (-93%)
Per-call exec rises 135 -> 180 us (+33%) but the saved dispatch
overhead far outweighs that.

incore_2/3 (Q/KV proj output):
Tail OH 5.87/8.94 us -> 0.68/1.06 us — scheduler is less loaded
(Complete-phase total 48.2 ms -> 36.5 ms), so downstream kernels
are picked up sooner.

Scope-3 tiles at K=256, N=256 fit the 512 KB mat buffer verifier (128 KB BF16 cube tile, well under the limit that #223 hit when trying K=512,N=256 in qwen3_14b_decode.py).

Validation:
python models/qwen3/14b/qwen3_14b_decode_full.py -p a2a3
--num-layers 40 --runtime-profiling
-> 'out' PASS shape=(16, 5120) dtype=torch.bfloat16 (pass_rate>=0.9800)

Halves the per-layer down_proj kernel count (160 -> 80 at num-layers=4,
1600 -> 800 at num-layers=40) and lets each cube call do twice the work,
amortising dispatch overhead.

Measured on a2a3, --num-layers 40, --runtime-profiling:

  Metric              baseline      256       delta
  ---------------------------------------------------
  TotalExec (us)      781842.78  691453.76   -11.6%
  Avg Exec/task (us)      34.41      32.74    -4.9%
  Avg Latency (us)        61.78      53.66   -13.1%
  Tail OH P99 (us)         57.1       41.3    -28%
  Wall-clock (us)       121553     111863    -8.0%

Mechanism (per-kernel breakdown):

  incore_18 (down_proj cube):
    Count:    1600 -> 800   (halved, as expected)
    Exec %:   68.1% -> 92.8% (cube engine utilisation +25pp)
    Head OH:  55.05 us -> 3.92 us (-93%)
    Per-call exec rises 135 -> 180 us (+33%) but the saved dispatch
    overhead far outweighs that.

  incore_2/3 (Q/KV proj output):
    Tail OH 5.87/8.94 us -> 0.68/1.06 us — scheduler is less loaded
    (Complete-phase total 48.2 ms -> 36.5 ms), so downstream kernels
    are picked up sooner.

Scope-3 tiles at K=256, N=256 fit the 512 KB mat buffer verifier
(128 KB BF16 cube tile, well under the limit that hw-native-sys#223 hit when trying
K=512,N=256 in qwen3_14b_decode.py).

Validation:
  python models/qwen3/14b/qwen3_14b_decode_full.py -p a2a3 \
    --num-layers 40 --runtime-profiling
  -> 'out' PASS shape=(16, 5120) dtype=torch.bfloat16 (pass_rate>=0.9800)
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The pull request updates the Qwen3 14B decoder's tiling constant K_CHUNK from 128 to 256 in a single line. This change affects how the hidden dimension is partitioned during decode operations, altering the block granularity for Scope 1 staging/assembly and Scope 3 matmul slicing without modifying any functional logic or interfaces.

Changes

Qwen3 14B Decoder Tiling Granularity Update

Layer / File(s) Summary
K_CHUNK tiling constant update
models/qwen3/14b/qwen3_14b_decode_full.py
K_CHUNK constant changed from 128 to 256, which directly alters the hidden_blocks partitioning (hidden // K_CHUNK) and therefore the slice/iteration counts in Scope 1 staging and Scope 3 matmul operations.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#223: Both PRs change the Qwen3 14B decode tiling granularity by updating K_CHUNK from 128 to 256, which alters the hidden-dimension block/loop slicing used in the decoder's matmul/scoping logic.
  • hw-native-sys/pypto-lib#104: Both PRs directly change the Qwen3 decode tiling granularity by updating the K_CHUNK constant, which alters the hidden_blocks partitioning and corresponding slice/assembly logic.
  • hw-native-sys/pypto-lib#203: Both PRs touch the fused full-layer decoder module qwen3_14b_decode_full.py, with the main PR changing its tiling constant K_CHUNK while #203 introduces and wires up that same full decode kernel.

Poem

🐰 A constant tweaked from one-two-eight,
To two-five-six, no time to wait!
Tiles grow larger, blocks align,
Decoder partitions now divine,
Qwen3 hops forth with tiling refined! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: bumping K_CHUNK from 128 to 256 in the Qwen3-14B decode_full implementation for performance improvement.
Description check ✅ Passed The description is highly relevant and detailed, explaining the performance rationale, measured improvements, kernel-level breakdown, memory verification, and validation results.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the K_CHUNK tiling constant from 128 to 256 in the qwen3_14b_decode_full.py model script. I have no feedback to provide as there were no review comments to evaluate.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
models/qwen3/14b/qwen3_14b_decode_full.py (1)

656-659: ⚡ Quick win

Update comment to reflect new K_CHUNK value.

The comment calculates 16*128*4 = 8 KiB based on the old K_CHUNK=128. With K_CHUNK=256, the per-iteration size is now 16*256*4 = 16 KiB.

📝 Suggested comment update
                        # FP32 GM scratch chunk used as the cube -> vec bridge.
-                       # Per-iter [BATCH_TILE, K_CHUNK] is small (16*128*4 =
-                       # 8 KiB) and avoids a large pre-allocated scratch.
+                       # Per-iter [BATCH_TILE, K_CHUNK] is small (16*256*4 =
+                       # 16 KiB) and avoids a large pre-allocated scratch.
                        fp32_chunk_gm = pl.create_tensor([BATCH_TILE, K_CHUNK], dtype=pl.FP32)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/qwen3/14b/qwen3_14b_decode_full.py` around lines 656 - 659, Comment
about per-iter buffer size is outdated: update the comment above fp32_chunk_gm
to use the current K_CHUNK value (256) and correct the calculation to "16*256*4
= 16 KiB" (referencing symbols BATCH_TILE, K_CHUNK, fp32_chunk_gm) so the
comment accurately reflects the per-iteration size and rationale for avoiding a
large pre-allocated scratch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/qwen3/14b/qwen3_14b_decode_full.py`:
- Around line 656-659: Comment about per-iter buffer size is outdated: update
the comment above fp32_chunk_gm to use the current K_CHUNK value (256) and
correct the calculation to "16*256*4 = 16 KiB" (referencing symbols BATCH_TILE,
K_CHUNK, fp32_chunk_gm) so the comment accurately reflects the per-iteration
size and rationale for avoiding a large pre-allocated scratch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d5bea0b2-43ce-4060-bd5d-52f680f6bd6e

📥 Commits

Reviewing files that changed from the base of the PR and between 8f7af9a and d9537af.

📒 Files selected for processing (1)
  • models/qwen3/14b/qwen3_14b_decode_full.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant