Perf: bump Qwen3-14B decode_full scope-3 K_CHUNK 128 -> 256 by wangqin1723-max · Pull Request #275 · hw-native-sys/pypto-lib

wangqin1723-max · 2026-05-14T03:43:33Z

Halves the per-layer down_proj kernel count (160 -> 80 at num-layers=4, 1600 -> 800 at num-layers=40) and lets each cube call do twice the work, amortising dispatch overhead.

Measured on a2a3, --num-layers 40, --runtime-profiling:

Metric baseline 256 delta

TotalExec (us) 781842.78 691453.76 -11.6%
Avg Exec/task (us) 34.41 32.74 -4.9%
Avg Latency (us) 61.78 53.66 -13.1%
Tail OH P99 (us) 57.1 41.3 -28%
Wall-clock (us) 121553 111863 -8.0%

Mechanism (per-kernel breakdown):

incore_18 (down_proj cube):
Count: 1600 -> 800 (halved, as expected)
Exec %: 68.1% -> 92.8% (cube engine utilisation +25pp)
Head OH: 55.05 us -> 3.92 us (-93%)
Per-call exec rises 135 -> 180 us (+33%) but the saved dispatch
overhead far outweighs that.

incore_2/3 (Q/KV proj output):
Tail OH 5.87/8.94 us -> 0.68/1.06 us — scheduler is less loaded
(Complete-phase total 48.2 ms -> 36.5 ms), so downstream kernels
are picked up sooner.

Scope-3 tiles at K=256, N=256 fit the 512 KB mat buffer verifier (128 KB BF16 cube tile, well under the limit that #223 hit when trying K=512,N=256 in qwen3_14b_decode.py).

Validation:
python models/qwen3/14b/qwen3_14b_decode_full.py -p a2a3
--num-layers 40 --runtime-profiling
-> 'out' PASS shape=(16, 5120) dtype=torch.bfloat16 (pass_rate>=0.9800)

Halves the per-layer down_proj kernel count (160 -> 80 at num-layers=4, 1600 -> 800 at num-layers=40) and lets each cube call do twice the work, amortising dispatch overhead. Measured on a2a3, --num-layers 40, --runtime-profiling: Metric baseline 256 delta --------------------------------------------------- TotalExec (us) 781842.78 691453.76 -11.6% Avg Exec/task (us) 34.41 32.74 -4.9% Avg Latency (us) 61.78 53.66 -13.1% Tail OH P99 (us) 57.1 41.3 -28% Wall-clock (us) 121553 111863 -8.0% Mechanism (per-kernel breakdown): incore_18 (down_proj cube): Count: 1600 -> 800 (halved, as expected) Exec %: 68.1% -> 92.8% (cube engine utilisation +25pp) Head OH: 55.05 us -> 3.92 us (-93%) Per-call exec rises 135 -> 180 us (+33%) but the saved dispatch overhead far outweighs that. incore_2/3 (Q/KV proj output): Tail OH 5.87/8.94 us -> 0.68/1.06 us — scheduler is less loaded (Complete-phase total 48.2 ms -> 36.5 ms), so downstream kernels are picked up sooner. Scope-3 tiles at K=256, N=256 fit the 512 KB mat buffer verifier (128 KB BF16 cube tile, well under the limit that hw-native-sys#223 hit when trying K=512,N=256 in qwen3_14b_decode.py). Validation: python models/qwen3/14b/qwen3_14b_decode_full.py -p a2a3 \ --num-layers 40 --runtime-profiling -> 'out' PASS shape=(16, 5120) dtype=torch.bfloat16 (pass_rate>=0.9800)

coderabbitai · 2026-05-14T03:43:44Z

📝 Walkthrough

Walkthrough

The pull request updates the Qwen3 14B decoder's tiling constant K_CHUNK from 128 to 256 in a single line. This change affects how the hidden dimension is partitioned during decode operations, altering the block granularity for Scope 1 staging/assembly and Scope 3 matmul slicing without modifying any functional logic or interfaces.

Changes

Qwen3 14B Decoder Tiling Granularity Update

Layer / File(s)	Summary
K_CHUNK tiling constant update `models/qwen3/14b/qwen3_14b_decode_full.py`	K_CHUNK constant changed from 128 to 256, which directly alters the hidden_blocks partitioning (hidden // K_CHUNK) and therefore the slice/iteration counts in Scope 1 staging and Scope 3 matmul operations.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

hw-native-sys/pypto-lib#223: Both PRs change the Qwen3 14B decode tiling granularity by updating K_CHUNK from 128 to 256, which alters the hidden-dimension block/loop slicing used in the decoder's matmul/scoping logic.
hw-native-sys/pypto-lib#104: Both PRs directly change the Qwen3 decode tiling granularity by updating the K_CHUNK constant, which alters the hidden_blocks partitioning and corresponding slice/assembly logic.
hw-native-sys/pypto-lib#203: Both PRs touch the fused full-layer decoder module qwen3_14b_decode_full.py, with the main PR changing its tiling constant K_CHUNK while #203 introduces and wires up that same full decode kernel.

Poem

🐰 A constant tweaked from one-two-eight,
To two-five-six, no time to wait!
Tiles grow larger, blocks align,
Decoder partitions now divine,
Qwen3 hops forth with tiling refined! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: bumping K_CHUNK from 128 to 256 in the Qwen3-14B decode_full implementation for performance improvement.
Description check	✅ Passed	The description is highly relevant and detailed, explaining the performance rationale, measured improvements, kernel-level breakdown, memory verification, and validation results.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the K_CHUNK tiling constant from 128 to 256 in the qwen3_14b_decode_full.py model script. I have no feedback to provide as there were no review comments to evaluate.

coderabbitai

🧹 Nitpick comments (1)

models/qwen3/14b/qwen3_14b_decode_full.py (1)

656-659: ⚡ Quick win

Update comment to reflect new K_CHUNK value.

The comment calculates 16*128*4 = 8 KiB based on the old K_CHUNK=128. With K_CHUNK=256, the per-iteration size is now 16*256*4 = 16 KiB.

📝 Suggested comment update

                        # FP32 GM scratch chunk used as the cube -> vec bridge.
-                       # Per-iter [BATCH_TILE, K_CHUNK] is small (16*128*4 =
-                       # 8 KiB) and avoids a large pre-allocated scratch.
+                       # Per-iter [BATCH_TILE, K_CHUNK] is small (16*256*4 =
+                       # 16 KiB) and avoids a large pre-allocated scratch.
                        fp32_chunk_gm = pl.create_tensor([BATCH_TILE, K_CHUNK], dtype=pl.FP32)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/qwen3/14b/qwen3_14b_decode_full.py` around lines 656 - 659, Comment
about per-iter buffer size is outdated: update the comment above fp32_chunk_gm
to use the current K_CHUNK value (256) and correct the calculation to "16*256*4
= 16 KiB" (referencing symbols BATCH_TILE, K_CHUNK, fp32_chunk_gm) so the
comment accurately reflects the per-iteration size and rationale for avoiding a
large pre-allocated scratch.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/qwen3/14b/qwen3_14b_decode_full.py`:
- Around line 656-659: Comment about per-iter buffer size is outdated: update
the comment above fp32_chunk_gm to use the current K_CHUNK value (256) and
correct the calculation to "16*256*4 = 16 KiB" (referencing symbols BATCH_TILE,
K_CHUNK, fp32_chunk_gm) so the comment accurately reflects the per-iteration
size and rationale for avoiding a large pre-allocated scratch.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d5bea0b2-43ce-4060-bd5d-52f680f6bd6e

📥 Commits

Reviewing files that changed from the base of the PR and between 8f7af9a and d9537af.

📒 Files selected for processing (1)

models/qwen3/14b/qwen3_14b_decode_full.py

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: bump Qwen3-14B decode_full scope-3 K_CHUNK 128 -> 256#275

Perf: bump Qwen3-14B decode_full scope-3 K_CHUNK 128 -> 256#275
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/qwen3-14b-decode-full-k-chunk-256

wangqin1723-max commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wangqin1723-max commented May 14, 2026

Metric baseline 256 delta

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 14, 2026 •

edited

Loading