Skip to content

fix(ir): chain Acc -> Mat -> Left/Right tile.move in InferTileMemoryS…#1460

Draft
ChristosMatzoros wants to merge 1 commit into
hw-native-sys:mainfrom
ChristosMatzoros:fix/infer-tile-memory-space-acc-mat-hop
Draft

fix(ir): chain Acc -> Mat -> Left/Right tile.move in InferTileMemoryS…#1460
ChristosMatzoros wants to merge 1 commit into
hw-native-sys:mainfrom
ChristosMatzoros:fix/infer-tile-memory-space-acc-mat-hop

Conversation

@ChristosMatzoros
Copy link
Copy Markdown

Summary

InsertMoveStmt in InferTileMemorySpace did not consult the SoC memory graph
when inserting a tile.move between a producer and its constrained consumer.
When a cube op's output (placed in Acc per the matmul op's
set_output_memory(MemorySpace::Acc)) was consumed by another cube op that
required the value in Left or Right, the pass emitted a single direct
tile.move(producer, target=Left/Right). PTOAS's TMovOp::verify then
rejects the resulting pto.tmov at the codegen verifier:

error: 'pto.tmov' op expects a supported tmov address-space pair for this target

This PR teaches InsertMoveStmt to emit the legal two-hop sequence
Acc -> Mat -> Left/Right for the rejected pair, so each individual hop is
in mem_graph and accepted by PTOAS.

Cause

src/backend/common/soc.cpp declares for a2a3:

mem_graph[MemorySpace::Acc] = {MemorySpace::Mat, MemorySpace::DDR};
mem_graph[MemorySpace::Mat] = {MemorySpace::Left, MemorySpace::Right};

i.e. Acc has no direct edge to Left or Right. PTOAS enforces the same set in
its okPair table (AccToMat, AccToVec, MatToTile, VecToVec). The two
declarations agree on the hardware constraint; the pass simply ignored both.

Any pattern where a cube op's output feeds a subsequent cube op as Left or
Right triggers this. The minimal repro is the same one captured in the
test:

ab = pl.matmul(a, b)    # ab is placed in Acc per matmul's set_output_memory
abc = pl.matmul(ab, c)  # second matmul needs `ab` in Left — direct Acc->Left
                        # is not a legal tmov pair on a2a3

Fix

In InsertMoveStmt (src/ir/transforms/infer_tile_memory_space_pass.cpp),
detect the Acc -> {Left, Right} pair and emit two tile.move ops chained
through Mat. Intermediate hops carry no blayout/slayout override; only
the final hop carries the consumer's required layout so consumers see the
expected tile shape/layout.

Why narrow, not universal

The natural-looking alternative is "call Backend::FindMemPath for every
move and emit the path." That CHECK-fails on benign pairs that aren't in the
graph but are accepted by codegen (e.g. Mat -> Vec for pre-pop staging).
mem_graph models physical TMOV edges only — it's necessary but not
sufficient as a model of all permitted moves. This PR therefore handles only
the specific pair PTOAS actually rejects today.

Alternative locations considered

I'd be happy to relocate the rewrite if you prefer a different layer:

  • a dedicated LegalizeMemoryMoves pass running after InferTileMemorySpace,
  • an op-level input-memory constraint on tile.move to force the IR not to
    produce the bad pair in the first place,
  • making mem_graph a complete model of codegen-supported pairs and using
    Backend::FindMemPath universally.

These are bigger architectural changes; I picked the minimal-delta fix at the
site where the offending tile.move is emitted.

Related work

The pypto3 team has been steadily fixing the family of "pypto3 emits tmov
pairs that PTOAS rejects":

This PR continues that pattern at the same IR layer for the remaining
Acc→Left/Right case — the team's stated "Path 2" preference in #1352.

Test plan

Added TestAutoMoveInsertion::test_matmul_auto_moves_acc_through_mat_to_left
in tests/ut/ir/transforms/test_infer_tile_memory_space.py. Uses the
Before/Expected @pl.program pattern and ir.assert_structural_equal per
the conventions in testing-and-examples.md and the existing tests in this
file. Asserts that the inserted moves form the Acc -> Mat -> Left chain.

Verified locally:

  • tests/ut/ir/transforms/test_infer_tile_memory_space.py — all tests in
    this file pass (29 pre-existing + 1 new).
  • Full tests/ut/ run vs the same suite on a clean upstream/main checkout
    in my local environment: identical pre-existing failure/skip sets. No
    failure introduced by this patch; +1 test pass for the new regression
    case. (My local env has a handful of environment-specific failures unrelated
    to this pass — they reproduce on upstream/main without this patch and are
    green on CI.)

Files changed

  • src/ir/transforms/infer_tile_memory_space_pass.cpp — refactor
    InsertMoveStmt to factor out an emit_one_move helper, add the
    Acc → Mat → {Left, Right} chain when needed.
  • tests/ut/ir/transforms/test_infer_tile_memory_space.py — new test.

No header / ODS / Python-binding changes. No documentation changes (the pass
behaviour described in docs/en/dev/passes/17-infer_tile_memory_space.md
remains accurate — this PR is a constraint enforcement, not a new feature).

…pace

`InsertMoveStmt` in `InferTileMemorySpace` did not consult the SoC memory
graph when inserting a `tile.move` between a producer and its constrained
consumer.  When a cube op's output (placed in `Acc` per the matmul op's
`set_output_memory(MemorySpace::Acc)`) was consumed by another cube op
that required the value in `Left` or `Right`, the pass emitted a single
direct `tile.move(producer, target=Left/Right)`.  PTOAS's `TMovOp::verify`
then rejects the resulting `pto.tmov` at the codegen verifier with:

    'pto.tmov' op expects a supported tmov address-space pair for this target

`backend/common/soc.cpp` declares for a2a3:

    mem_graph[Acc] = {Mat, DDR}
    mem_graph[Mat] = {Left, Right}

i.e. Acc has no direct edge to Left or Right.  PTOAS enforces the same set
in its `okPair` table.  The two declarations agreed; the pass simply
ignored them.

This patch detects the `Acc -> Left/Right` pair in `InsertMoveStmt` and
emits two `tile.move` ops chained through `Mat`, so each individual hop
is in `mem_graph` and accepted by PTOAS.  Intermediate hops carry no
blayout/slayout override; only the final hop carries the consumer's
required layout.

The rewrite is narrow on purpose.  `mem_graph` models only physical TMOV
edges, but codegen accepts several pairs not in the graph (e.g.
`Mat -> Vec` for pre-pop staging).  A universal `Backend::FindMemPath`
BFS would CHECK-fail on those benign pairs, so we only insert the
intermediate hop for the specific pair PTOAS actually rejects.

Test: new `TestAutoMoveInsertion::test_matmul_auto_moves_acc_through_mat_to_left`
in `tests/ut/ir/transforms/test_infer_tile_memory_space.py` uses the
Before/Expected pattern with `ir.assert_structural_equal` to verify the
inserted moves form the `Acc -> Mat -> Left` chain.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5b089517-4116-4182-ac17-20d2ce3519ab

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the TileMemorySpaceMutator to handle hardware-constrained memory moves by inserting intermediate hops, such as routing Acc through Mat before reaching Left or Right spaces. The implementation uses a helper lambda to emit move operations and includes a new unit test to verify the multi-hop logic. Feedback suggests simplifying the move emission logic by consolidating the multi-hop and direct move paths into a single loop and final call for better maintainability.

Comment on lines +746 to +763
VarPtr final_moved_var;
if (path.size() > 2) {
// Multi-hop: emit one intermediate tile.move per hop. Intermediate hops
// get no blayout/slayout overrides (use default for the intermediate
// memory space); only the FINAL hop carries the consumer's required
// layout so that the consumer sees the expected tile shape/layout.
ExprPtr cur = mutated_producer;
// path[0] == src_space, path.back() == target; iterate hops [1..size).
for (size_t i = 1; i + 1 < path.size(); ++i) {
auto inter_var = emit_one_move(cur, path[i], std::nullopt, std::nullopt);
cur = inter_var;
}
final_moved_var = emit_one_move(cur, target, required_blayout, required_slayout);
} else {
// Direct move (path.empty() means direct edge or src == target; treat
// both as single-hop fall-through to the original behaviour).
final_moved_var = emit_one_move(mutated_producer, target, required_blayout, required_slayout);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for handling multi-hop and direct moves can be simplified by removing the explicit if/else block. Since cur is initialized to mutated_producer and the loop condition i + 1 < path.size() naturally handles empty or short paths, a single sequence of operations is sufficient and more maintainable.

    ExprPtr cur = mutated_producer;
    // path[0] == src_space, path.back() == target; iterate hops [1..size-1).
    for (size_t i = 1; i + 1 < path.size(); ++i) {
      cur = emit_one_move(cur, path[i], std::nullopt, std::nullopt);
    }
    VarPtr final_moved_var = emit_one_move(cur, target, required_blayout, required_slayout);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant