Perf: bump Qwen3-14B decode_full scope-3 K_CHUNK 128 -> 256#275
Perf: bump Qwen3-14B decode_full scope-3 K_CHUNK 128 -> 256#275wangqin1723-max wants to merge 1 commit into
Conversation
Halves the per-layer down_proj kernel count (160 -> 80 at num-layers=4,
1600 -> 800 at num-layers=40) and lets each cube call do twice the work,
amortising dispatch overhead.
Measured on a2a3, --num-layers 40, --runtime-profiling:
Metric baseline 256 delta
---------------------------------------------------
TotalExec (us) 781842.78 691453.76 -11.6%
Avg Exec/task (us) 34.41 32.74 -4.9%
Avg Latency (us) 61.78 53.66 -13.1%
Tail OH P99 (us) 57.1 41.3 -28%
Wall-clock (us) 121553 111863 -8.0%
Mechanism (per-kernel breakdown):
incore_18 (down_proj cube):
Count: 1600 -> 800 (halved, as expected)
Exec %: 68.1% -> 92.8% (cube engine utilisation +25pp)
Head OH: 55.05 us -> 3.92 us (-93%)
Per-call exec rises 135 -> 180 us (+33%) but the saved dispatch
overhead far outweighs that.
incore_2/3 (Q/KV proj output):
Tail OH 5.87/8.94 us -> 0.68/1.06 us — scheduler is less loaded
(Complete-phase total 48.2 ms -> 36.5 ms), so downstream kernels
are picked up sooner.
Scope-3 tiles at K=256, N=256 fit the 512 KB mat buffer verifier
(128 KB BF16 cube tile, well under the limit that hw-native-sys#223 hit when trying
K=512,N=256 in qwen3_14b_decode.py).
Validation:
python models/qwen3/14b/qwen3_14b_decode_full.py -p a2a3 \
--num-layers 40 --runtime-profiling
-> 'out' PASS shape=(16, 5120) dtype=torch.bfloat16 (pass_rate>=0.9800)
📝 WalkthroughWalkthroughThe pull request updates the Qwen3 14B decoder's tiling constant ChangesQwen3 14B Decoder Tiling Granularity Update
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
models/qwen3/14b/qwen3_14b_decode_full.py (1)
656-659: ⚡ Quick winUpdate comment to reflect new K_CHUNK value.
The comment calculates
16*128*4 = 8 KiBbased on the oldK_CHUNK=128. WithK_CHUNK=256, the per-iteration size is now16*256*4 = 16 KiB.📝 Suggested comment update
# FP32 GM scratch chunk used as the cube -> vec bridge. - # Per-iter [BATCH_TILE, K_CHUNK] is small (16*128*4 = - # 8 KiB) and avoids a large pre-allocated scratch. + # Per-iter [BATCH_TILE, K_CHUNK] is small (16*256*4 = + # 16 KiB) and avoids a large pre-allocated scratch. fp32_chunk_gm = pl.create_tensor([BATCH_TILE, K_CHUNK], dtype=pl.FP32)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/qwen3/14b/qwen3_14b_decode_full.py` around lines 656 - 659, Comment about per-iter buffer size is outdated: update the comment above fp32_chunk_gm to use the current K_CHUNK value (256) and correct the calculation to "16*256*4 = 16 KiB" (referencing symbols BATCH_TILE, K_CHUNK, fp32_chunk_gm) so the comment accurately reflects the per-iteration size and rationale for avoiding a large pre-allocated scratch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@models/qwen3/14b/qwen3_14b_decode_full.py`:
- Around line 656-659: Comment about per-iter buffer size is outdated: update
the comment above fp32_chunk_gm to use the current K_CHUNK value (256) and
correct the calculation to "16*256*4 = 16 KiB" (referencing symbols BATCH_TILE,
K_CHUNK, fp32_chunk_gm) so the comment accurately reflects the per-iteration
size and rationale for avoiding a large pre-allocated scratch.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d5bea0b2-43ce-4060-bd5d-52f680f6bd6e
📒 Files selected for processing (1)
models/qwen3/14b/qwen3_14b_decode_full.py
Halves the per-layer down_proj kernel count (160 -> 80 at num-layers=4, 1600 -> 800 at num-layers=40) and lets each cube call do twice the work, amortising dispatch overhead.
Measured on a2a3, --num-layers 40, --runtime-profiling:
Metric baseline 256 delta
TotalExec (us) 781842.78 691453.76 -11.6%
Avg Exec/task (us) 34.41 32.74 -4.9%
Avg Latency (us) 61.78 53.66 -13.1%
Tail OH P99 (us) 57.1 41.3 -28%
Wall-clock (us) 121553 111863 -8.0%
Mechanism (per-kernel breakdown):
incore_18 (down_proj cube):
Count: 1600 -> 800 (halved, as expected)
Exec %: 68.1% -> 92.8% (cube engine utilisation +25pp)
Head OH: 55.05 us -> 3.92 us (-93%)
Per-call exec rises 135 -> 180 us (+33%) but the saved dispatch
overhead far outweighs that.
incore_2/3 (Q/KV proj output):
Tail OH 5.87/8.94 us -> 0.68/1.06 us — scheduler is less loaded
(Complete-phase total 48.2 ms -> 36.5 ms), so downstream kernels
are picked up sooner.
Scope-3 tiles at K=256, N=256 fit the 512 KB mat buffer verifier (128 KB BF16 cube tile, well under the limit that #223 hit when trying K=512,N=256 in qwen3_14b_decode.py).
Validation:
python models/qwen3/14b/qwen3_14b_decode_full.py -p a2a3
--num-layers 40 --runtime-profiling
-> 'out' PASS shape=(16, 5120) dtype=torch.bfloat16 (pass_rate>=0.9800)