Skip to content

test(runtime): Add runtime regression test for #1352 Path 1#1355

Open
luohuan19 wants to merge 5 commits into
hw-native-sys:mainfrom
luohuan19:test/pipeline-matmul-acc-1352
Open

test(runtime): Add runtime regression test for #1352 Path 1#1355
luohuan19 wants to merge 5 commits into
hw-native-sys:mainfrom
luohuan19:test/pipeline-matmul-acc-1352

Conversation

@luohuan19
Copy link
Copy Markdown
Contributor

Summary

Add a runtime regression test reproducing the gate_up_silu pattern from the Qwen3-32B decode kernel which triggered the acc→acc pto.tmov bug on Ascend 910B (#1352).

The test uses the exact prolog-then-pipeline pattern with two independent accumulators (gate_acc, up_acc) under pl.at(CORE_GROUP, split=UP_DOWN), which forces the AIC/AIV split and exposes the MemRef-base mismatch on the AIC function — the root condition that produces invalid acc→acc pto.tmov ops.

This is the runtime-level regression test for Path 1 (codegen-level elision) of issue #1352. The codegen-level fix and a codegen-only regression test are bundled in this branch (the fix had not yet been pushed upstream).

Changes

  • tests/st/runtime/test_pipeline_matmul_acc.py — new runtime regression test reproducing the gate/up dual-accumulator pipeline kernel that triggered ptoas error: 'pto.tmov' op expects a supported tmov address-space pair.

Testing

  • Without the Path 1 fix: kernel fails to compile (ptoas rejects acc→acc pto.tmov).
  • With the Path 1 fix in src/backend/common/pto_ops_common.cpp: kernel compiles and produces numerically correct results (rtol=1e-3, atol=1e-3).

Related Issues

Refs #1352 (Path 1 — short-term codegen-level fix).

luohuan19 and others added 3 commits May 13, 2026 11:49
…w-native-sys#1352)

Ascend 910B has no hardware path to copy data between disjoint L0C
(accumulator) addresses; ptoas rejects any such pto.tmov with:

  'pto.tmov' op expects a supported tmov address-space pair for this target

These acc→acc tile.move ops are always semantic no-ops: they arise when
MemoryReuse fails to unify MemRef bases across nested pipelined
matmul_acc loops — YieldFixupMutator inserts them as fixups, but
AllocateMemoryAddr then assigns different physical addresses to the
un-unified bases, producing an acc@0 → acc@16384 → acc@0 round-trip.

The prior fix (e4ddc66, hw-native-sys#1310) only elided same-space same-address
moves. Extend the condition to also elide when both src and dst are
MemorySpace::Acc, regardless of byte_offset.

Co-authored-by: Cursor <cursoragent@cursor.com>
…peline loops

Adds test_pipeline_matmul_acc_no_acc_to_acc_tmov to TestTileMoveAccNoopElision,
covering the pattern where pl.pipeline(stage=2) + Mat-resident matmul_acc causes
AutoTileMatmulL0 to insert inner K-loops whose _l0_c IterArg bases may not be
unified by MemoryReuse, exposing acc→acc tile.move that ptoas rejects (hw-native-sys#1352).

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a runtime regression test that reproduces the gate_up_silu pattern
from the Qwen3-32B decode kernel which triggered the acc→acc pto.tmov
bug (hw-native-sys#1352). The test exercises Path 1 of the fix (codegen-level
elision of acc→acc pto.tmov in src/backend/common/pto_ops_common.cpp).

The test uses the exact prolog-then-pipeline pattern with two
independent accumulators (gate_acc, up_acc) under
pl.at(CORE_GROUP, split=UP_DOWN), which forces the AIC/AIV split and
exposes the MemRef-base mismatch on the AIC function.

Without the Path 1 fix, ptoas rejects the generated kernel with
"'pto.tmov' op expects a supported tmov address-space pair". With the
fix, the kernel compiles and produces numerically correct results.

Refs hw-native-sys#1352
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR elides pto.tmov in tile.move when both source and destination are in the Acc memory space (in addition to same-address elision), aliases the destination SSA to the source via SetCurrentResultBuf(), and adds unit and runtime regression tests exercising pipeline matmul_acc scenarios.

Changes

Acc→Acc tile.move no-op elision

Layer / File(s) Summary
tile.move acc→acc elision implementation
src/backend/common/pto_ops_common.cpp
Expanded inline documentation and updated codegen to compute both_acc; elides pto.tmov when `same_address
Runtime dual-accumulator pipeline test
tests/st/runtime/test_pipeline_matmul_acc.py
New TestPipelineMatmulAccGateUp runtime test building gate/up pl.matmul_acc pipelines, enforcing consumption ordering with pl.add(..., 0.0), multiplying accumulators, casting to BF16, and validating output against FP32 reference.
Codegen unit test for acc→acc tmov elision
tests/ut/codegen/test_pto_codegen_ops.py
_generate_mlir updated to concatenate all InCore variants' MLIR; added test_pipeline_matmul_acc_no_acc_to_acc_tmov() which codegens a pipeline matmul_acc and asserts no acc→acc pto.tmov appears.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • lyfne123

Poem

🐰 I hopped through code with eager cheer,

Found phantom moves that disappeared,
Now acc-to-acc need not be stirred,
Aliased bytes sing without a word,
Pipelines hum, no extra copy feared.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and specifically describes the main change: adding a runtime regression test for issue #1352 Path 1, which is the primary change across the files modified.
Description check ✅ Passed The pull request description is well-detailed and directly related to the changeset, explaining the context of the gate_up_silu pattern, the test structure, and how it addresses issue #1352 Path 1.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where unsupported accumulator-to-accumulator moves were being generated in the backend for certain pipeline matmul patterns. The fix involves eliding these moves in the PTO operations common logic when both source and destination are in the accumulator memory space. The PR also includes a new runtime regression test and a codegen unit test to verify the fix. Feedback was provided regarding the use of the PyPTO DSL in the new runtime test, specifically pointing out that pl.create_tile should be used for in-core scratch space instead of pl.create_tensor, and that global tensors should be explicitly loaded into tiles before being used in pl.matmul operations within device-side blocks.

Comment thread tests/st/runtime/test_pipeline_matmul_acc.py
Comment thread tests/st/runtime/test_pipeline_matmul_acc.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/common/pto_ops_common.cpp`:
- Around line 2197-2204: The current aliasing branch uses both_acc and silently
aliases all Acc→Acc tile.move, which breaks distinct-buffer copy semantics;
change the condition so aliasing only occurs when the move is provably a no-op
(e.g., retain the same_address check and replace the both_acc check with a call
to an explicit predicate like IsProvenNoOpTileMove(op) or inspect a specific op
attribute), and for other Acc→Acc cases do not call
codegen.SetCurrentResultBuf(codegen.GetExprAsCode(op->args_[0]))—instead emit
the normal copy lowering or return an explicit rejection/error for unsupported
distinct-buffer Acc→Acc copies; refer to same_address, both_acc,
ir::MemorySpace::Acc, tile.move, codegen.SetCurrentResultBuf, and op->args_[0]
when making the change.

In `@tests/ut/codegen/test_pto_codegen_ops.py`:
- Around line 1723-1784: The test only generates MLIR for the first InCore
variant, so when split=UP_DOWN it may miss the AIC kernel that contains the
acc→acc pto.tmov; update the test (test_pipeline_matmul_acc_no_acc_to_acc_tmov)
to iterate and codegen all split InCore variants produced by _generate_mlir (or
explicitly select the AIC variant) by checking ir.is_incore_type() for each
function and asserting _has_acc_to_acc_tmov on the AIC InCore output instead of
relying on the first function only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d59032ee-5de7-4a54-8f94-4c745f9741d6

📥 Commits

Reviewing files that changed from the base of the PR and between 7da3b83 and a5741f4.

📒 Files selected for processing (3)
  • src/backend/common/pto_ops_common.cpp
  • tests/st/runtime/test_pipeline_matmul_acc.py
  • tests/ut/codegen/test_pto_codegen_ops.py

Comment thread src/backend/common/pto_ops_common.cpp
Comment thread tests/ut/codegen/test_pto_codegen_ops.py
- ruff-format tests/st/runtime/test_pipeline_matmul_acc.py
- Clarify out_tile comment: it is a DDR staging tensor, not an InCore tile
- Harden TestTileMoveAccNoopElision._generate_mlir to codegen all InCore
  functions (not just the first), so split-kernel variants are covered
- Drop out_dtype=FP32 from pl.matmul in test_pipeline_matmul_acc_no_acc_to_acc_tmov
  so the Tile overload matches and pyright accepts the Mat-tile inputs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/st/runtime/test_pipeline_matmul_acc.py`:
- Around line 57-58: Update the docstrings to match the implemented computation
and output dtype: replace references that say "gate_result + up_result" and
"FP32 output" with wording that the operation is elementwise multiplication
"(gate * up)" and that the result is stored as BF16 (e.g., "gate_result (BF16) *
up_result (BF16) stored as BF16"); apply the same change to both docstring
locations referenced (the earlier block around the first occurrence and the
second block at lines ~80-84) so the documentation accurately reflects the math
and dtype used in the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0ada6f57-b84e-4f0e-b3ff-aeb9a7a0846e

📥 Commits

Reviewing files that changed from the base of the PR and between a5741f4 and d7b687b.

📒 Files selected for processing (2)
  • tests/st/runtime/test_pipeline_matmul_acc.py
  • tests/ut/codegen/test_pto_codegen_ops.py

Comment thread tests/st/runtime/test_pipeline_matmul_acc.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant