test(runtime): Add runtime regression test for #1352 Path 1#1355
test(runtime): Add runtime regression test for #1352 Path 1#1355luohuan19 wants to merge 5 commits into
Conversation
…w-native-sys#1352) Ascend 910B has no hardware path to copy data between disjoint L0C (accumulator) addresses; ptoas rejects any such pto.tmov with: 'pto.tmov' op expects a supported tmov address-space pair for this target These acc→acc tile.move ops are always semantic no-ops: they arise when MemoryReuse fails to unify MemRef bases across nested pipelined matmul_acc loops — YieldFixupMutator inserts them as fixups, but AllocateMemoryAddr then assigns different physical addresses to the un-unified bases, producing an acc@0 → acc@16384 → acc@0 round-trip. The prior fix (e4ddc66, hw-native-sys#1310) only elided same-space same-address moves. Extend the condition to also elide when both src and dst are MemorySpace::Acc, regardless of byte_offset. Co-authored-by: Cursor <cursoragent@cursor.com>
…peline loops Adds test_pipeline_matmul_acc_no_acc_to_acc_tmov to TestTileMoveAccNoopElision, covering the pattern where pl.pipeline(stage=2) + Mat-resident matmul_acc causes AutoTileMatmulL0 to insert inner K-loops whose _l0_c IterArg bases may not be unified by MemoryReuse, exposing acc→acc tile.move that ptoas rejects (hw-native-sys#1352). Co-authored-by: Cursor <cursoragent@cursor.com>
Add a runtime regression test that reproduces the gate_up_silu pattern from the Qwen3-32B decode kernel which triggered the acc→acc pto.tmov bug (hw-native-sys#1352). The test exercises Path 1 of the fix (codegen-level elision of acc→acc pto.tmov in src/backend/common/pto_ops_common.cpp). The test uses the exact prolog-then-pipeline pattern with two independent accumulators (gate_acc, up_acc) under pl.at(CORE_GROUP, split=UP_DOWN), which forces the AIC/AIV split and exposes the MemRef-base mismatch on the AIC function. Without the Path 1 fix, ptoas rejects the generated kernel with "'pto.tmov' op expects a supported tmov address-space pair". With the fix, the kernel compiles and produces numerically correct results. Refs hw-native-sys#1352
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR elides ChangesAcc→Acc tile.move no-op elision
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where unsupported accumulator-to-accumulator moves were being generated in the backend for certain pipeline matmul patterns. The fix involves eliding these moves in the PTO operations common logic when both source and destination are in the accumulator memory space. The PR also includes a new runtime regression test and a codegen unit test to verify the fix. Feedback was provided regarding the use of the PyPTO DSL in the new runtime test, specifically pointing out that pl.create_tile should be used for in-core scratch space instead of pl.create_tensor, and that global tensors should be explicitly loaded into tiles before being used in pl.matmul operations within device-side blocks.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/backend/common/pto_ops_common.cpp`:
- Around line 2197-2204: The current aliasing branch uses both_acc and silently
aliases all Acc→Acc tile.move, which breaks distinct-buffer copy semantics;
change the condition so aliasing only occurs when the move is provably a no-op
(e.g., retain the same_address check and replace the both_acc check with a call
to an explicit predicate like IsProvenNoOpTileMove(op) or inspect a specific op
attribute), and for other Acc→Acc cases do not call
codegen.SetCurrentResultBuf(codegen.GetExprAsCode(op->args_[0]))—instead emit
the normal copy lowering or return an explicit rejection/error for unsupported
distinct-buffer Acc→Acc copies; refer to same_address, both_acc,
ir::MemorySpace::Acc, tile.move, codegen.SetCurrentResultBuf, and op->args_[0]
when making the change.
In `@tests/ut/codegen/test_pto_codegen_ops.py`:
- Around line 1723-1784: The test only generates MLIR for the first InCore
variant, so when split=UP_DOWN it may miss the AIC kernel that contains the
acc→acc pto.tmov; update the test (test_pipeline_matmul_acc_no_acc_to_acc_tmov)
to iterate and codegen all split InCore variants produced by _generate_mlir (or
explicitly select the AIC variant) by checking ir.is_incore_type() for each
function and asserting _has_acc_to_acc_tmov on the AIC InCore output instead of
relying on the first function only.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d59032ee-5de7-4a54-8f94-4c745f9741d6
📒 Files selected for processing (3)
src/backend/common/pto_ops_common.cpptests/st/runtime/test_pipeline_matmul_acc.pytests/ut/codegen/test_pto_codegen_ops.py
- ruff-format tests/st/runtime/test_pipeline_matmul_acc.py - Clarify out_tile comment: it is a DDR staging tensor, not an InCore tile - Harden TestTileMoveAccNoopElision._generate_mlir to codegen all InCore functions (not just the first), so split-kernel variants are covered - Drop out_dtype=FP32 from pl.matmul in test_pipeline_matmul_acc_no_acc_to_acc_tmov so the Tile overload matches and pyright accepts the Mat-tile inputs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/st/runtime/test_pipeline_matmul_acc.py`:
- Around line 57-58: Update the docstrings to match the implemented computation
and output dtype: replace references that say "gate_result + up_result" and
"FP32 output" with wording that the operation is elementwise multiplication
"(gate * up)" and that the result is stored as BF16 (e.g., "gate_result (BF16) *
up_result (BF16) stored as BF16"); apply the same change to both docstring
locations referenced (the earlier block around the first occurrence and the
second block at lines ~80-84) so the documentation accurately reflects the math
and dtype used in the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0ada6f57-b84e-4f0e-b3ff-aeb9a7a0846e
📒 Files selected for processing (2)
tests/st/runtime/test_pipeline_matmul_acc.pytests/ut/codegen/test_pto_codegen_ops.py
Summary
Add a runtime regression test reproducing the
gate_up_silupattern from the Qwen3-32B decode kernel which triggered theacc→acc pto.tmovbug on Ascend 910B (#1352).The test uses the exact prolog-then-pipeline pattern with two independent accumulators (
gate_acc,up_acc) underpl.at(CORE_GROUP, split=UP_DOWN), which forces the AIC/AIV split and exposes the MemRef-base mismatch on the AIC function — the root condition that produces invalidacc→acc pto.tmovops.This is the runtime-level regression test for Path 1 (codegen-level elision) of issue #1352. The codegen-level fix and a codegen-only regression test are bundled in this branch (the fix had not yet been pushed upstream).
Changes
tests/st/runtime/test_pipeline_matmul_acc.py— new runtime regression test reproducing the gate/up dual-accumulator pipeline kernel that triggered ptoas error:'pto.tmov' op expects a supported tmov address-space pair.Testing
ptoasrejects acc→accpto.tmov).src/backend/common/pto_ops_common.cpp: kernel compiles and produces numerically correct results (rtol=1e-3, atol=1e-3).Related Issues
Refs #1352 (Path 1 — short-term codegen-level fix).