feat(distributed): lift pld.tensor.put staging tile into IR for level3 by YunjiQin · Pull Request #1548 · hw-native-sys/pypto

YunjiQin · 2026-05-27T02:11:33Z

Summary

Lift the VEC staging tile that pld.tensor.put bounces through from inline codegen synthesis into PyPTO IR, so it flows through InitMemRef + AllocateMemoryAddr and carries a real UB addr = before backend codegen. This is the contract PTOAS requires at --pto-level=level3.
Mirrors the existing tensor.transpose -> tile.create + tile.transpose pattern: ConvertTensorToTileOps emits tile.create + new pld.tile.put (4 args, last is the staging tile); MakePutCodegenPTO now reads the stage SSA from args[3] instead of calling AllocNewTileBuf. The DSL surface (pld.tensor.put / pld.put) is unchanged.
Drive-by refactor: register @CommRemoteOffset_<dtype> helpers lazily from EmitCommRemoteView instead of via a separate pre-walk. Helpers are now emitted at module end, so any new op that routes peer addressing through EmitCommRemoteView gets the matching helper without extra wiring.
TestL3Put stays under @pytest.mark.skip: the reason is re-targeted from the now-resolved N6/N7 plumbing gap to an open PTOAS issue where the stage-in tile.store is not ordered before pld.tensor.put, so the put can issue before the local window slice has been written.

Before / after

Before — ring_step.pto from the user's local L3 run had no address on the staging tile:

```
%tput_stage = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=64, ...>
```

After:

```
%tput_stage = pto.alloc_tile addr = %c0_i64 valid_row = %c1_index valid_col = %c64_index
: !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=64, v_row=?, v_col=?, ...>
```

Test plan

tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py — new test_put_emits_tile_create_plus_tile_put verifies the conversion shape: pld.tensor.put gone, tile.create + pld.tile.put present, stage Var is named tput_stage, shape [16, 64], FP16, MemorySpace.Vec
tests/ut/codegen/distributed/test_distributed_pto_codegen.py::test_put_emits_comm_tput_with_attr_and_staging_tile — strengthened to assert addr = on the pto.alloc_tile … tput_stage line (the whole point of the lift)
Full distributed put / convert suites pass locally (156 passed, 1 pre-existing skip)
clang-tidy on diff clean for changed C++ files
cmake --build build --parallel green
TestL3Put end-to-end will exercise the PTOAS store -> put ordering once that lands; tracked separately

Out of scope

pld.tensor.get has the symmetric problem (tget_stage synthesised inline) and will get the same treatment in a follow-up PR.
tile.textract extract_buf is the third in-codegen AllocNewTileBuf call site; also follow-up.

The VEC staging tile that pld.tensor.put bounces through is now materialised in IR by ConvertTensorToTileOps (tile.create + new pld.tile.put op) instead of being synthesised inline at codegen. That lets InitMemRef and AllocateMemoryAddr assign a real UB address before backend codegen runs -- required when PTOAS is invoked at --pto-level=level3, which expects PyPTO to allocate every tile address itself. Mirrors the tensor.transpose -> tile.create + tile.transpose pattern. The DSL surface (pld.tensor.put / pld.put, 3 args) is unchanged; pld.tile.put is created only by the conversion pass.

coderabbitai · 2026-05-27T02:11:46Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1464bd1c-937d-48a8-a345-05483f17a58b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR refactors distributed put operation handling by introducing lazy remote offset helper registration, adding a tile-level pld.tile.put IR operation, and implementing an automatic conversion from tensor-level to tile-level put with synthesized staging tiles. Backend codegen and Python DSL wrappers are updated accordingly.

Changes

Distributed Put Refactoring: Lazy Helper Registration and Tile-Level Operation

Layer / File(s)	Summary
Lazy remote offset helper registration API `include/pypto/codegen/pto/pto_codegen.h`, `src/codegen/pto/pto_codegen.cpp`	PTOCodegen replaces eager IR-walk collection with lazy dtype registration during op lowering via `RegisterCommRemoteOffsetHelper`, moving helper emission to module end after all user functions to enable forward references.
Tile-level put operation definition and validation `src/ir/op/distributed/put.cpp`, `include/pypto/ir/transforms/op_conversion_registry.h`	New IR op `pld.tile.put` with validation for matching dtypes, element counts, and staging tile type; `pld.tensor.put` description updated to document conversion through ConvertTensorToTileOps.
Tensor-to-tile conversion with staging tile synthesis `src/ir/transforms/op_conversion_registry.cpp`	RegisterDistributedOps converts `pld.tensor.put` by flattening N-D shape to 2D `[rows, cols]`, allocating a Vec staging tile via `tile.create`, and emitting `pld.tile.put` with the synthesized stage operand.
Backend codegen for tile.put and helper registration `src/backend/common/pto_ops_common.cpp`	Backend lowers `pld.tile.put` to `pto.comm.tput` using explicit staging tile operand; EmitCommRemoteView registers helpers lazily; operation registration targets tile.put instead of tensor.put.
Python IR layer builder for tile.put `python/pypto/ir/op/distributed/tile_ops.py`	IR builder normalizes peer to INT32, captures optional span, sets atomic attribute, and constructs `pld.tile.put` IR call; module exports include put.
Python language layer wrappers and documentation `python/pypto/language/distributed/op/tile_ops.py`, `python/pypto/language/distributed/op/tensor_ops.py`	Language-level `put` wrapper validates window-bound distributed tensors and delegates to IR builder; `pld.tensor.put` docstring describes full conversion pipeline including staging tile allocation and lowering.
Unit and system test updates `tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py`, `tests/ut/codegen/distributed/test_distributed_pto_codegen.py`, `tests/st/distributed/test_l3_put.py`	Conversion test verifies tensor.put lowers to tile.create plus tile.put; codegen test asserts explicit address assignment on staging tile; system test skip reason updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

hw-native-sys/pypto#1442: Introduces the original pld.tensor.put path and pto.comm.tput codegen behavior that this PR refactors to use a tile-level op with explicit staging tile.
hw-native-sys/pypto#1453: Modifies EmitCommRemoteView and remote CommRemoteOffset_* helper emission, which is directly affected by the lazy registration refactoring in this PR.

Suggested labels

enhancement

Suggested reviewers

lyfne123

Poem

🐰 From eager walks to lazy calls,
A staging tile now answers all,
Register helpers as we go—
Let tensors split, and tile-ops flow! 🎭✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(distributed): lift pld.tensor.put staging tile into IR for level3' directly and clearly summarizes the primary change—moving VEC staging tile synthesis from codegen into PyPTO IR to support PTOAS level3 contracts.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description clearly relates to the changeset, detailing the lift of VEC staging tile, the new pld.tile.put IR op, lazy registration of helpers, and corresponding test updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors the distributed put operation by splitting pld.tensor.put into a pre-allocated staging tile (tile.create) and a post-conversion pld.tile.put call. This allows the staging tile to flow through PyPTO's memory allocator, which is required at --pto-level=level3. Additionally, it refactors the codegen to lazily register and emit @CommRemoteOffset_<dtype> helpers at the end of the module. The reviewer suggested using the INTERNAL_CHECK_SPAN macro instead of INTERNAL_CHECK in src/ir/op/distributed/put.cpp to preserve source location information in error messages, which is a valid improvement that aligns with project conventions.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/pypto/language/distributed/op/tensor_ops.py`:
- Around line 131-139: Update the module-level header describing "put" to match
the new lowering flow: change the text that currently claims the staging tile is
synthesized in codegen and that put is tensor-level so it instead states that
ConvertTensorToTileOps allocates the VEC staging tile (tile.create) and that the
lowering emits a pld.tile.put (tile-level) paired with the GM-to-GM TGET;
mention the staging tile flows through the allocator and backend emits
CommRemoteOffset+addptr+make_tensor_view+partition_view+TPUT as described in
ConvertTensorToTileOps and the function docstring.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a68f1b8a-5586-48c2-968b-e7e719988b1a

📥 Commits

Reviewing files that changed from the base of the PR and between f8dbde9 and 6eefe0e.

📒 Files selected for processing (12)

include/pypto/codegen/pto/pto_codegen.h
include/pypto/ir/transforms/op_conversion_registry.h
python/pypto/ir/op/distributed/tile_ops.py
python/pypto/language/distributed/op/tensor_ops.py
python/pypto/language/distributed/op/tile_ops.py
src/backend/common/pto_ops_common.cpp
src/codegen/pto/pto_codegen.cpp
src/ir/op/distributed/put.cpp
src/ir/transforms/op_conversion_registry.cpp
tests/st/distributed/test_l3_put.py
tests/ut/codegen/distributed/test_distributed_pto_codegen.py
tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py

…sites Replace the pre-walk CollectRemoteOffsetDtypes with RegisterCommRemoteOffsetHelper called from EmitCommRemoteView -- any op that routes peer addressing through that helper gets its @CommRemoteOffset_<dtype> definition emitted automatically, no codegen- side opt-in. Helpers move to module end so future call sites can forward-reference them (MLIR resolves symbols whole-module). Also re-targets the TestL3Put @pytest.mark.skip reason from the now- resolved N6/N7 plumbing gap to the open PTOAS issue where the store -> put pair loses its synchronisation (the put can issue before the local window slice has been written).

- use INTERNAL_CHECK_SPAN in DeducePutTileType so dst/stage-shape invariant failures emit IR source location (args[0]/args[3] are already non-null at the failure point) - update tensor_ops module header for put: reflect ConvertTensorToTileOps lowering to tile.create + pld.tile.put (was still describing the pre-RFC codegen-time staging tile)

github-project-automation Bot added this to pto project May 27, 2026

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

Comment thread src/ir/op/distributed/put.cpp Outdated

YunjiQin changed the title ~~lift pld.tensor.put staging tile into IR for PTOAS level3~~ feat(distributed): lift pld.tensor.put staging tile into IR for level3 May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/pypto/language/distributed/op/tensor_ops.py

YunjiQin added 2 commits May 27, 2026 15:34

YunjiQin force-pushed the fix/tput branch from 6eefe0e to 2dacfc8 Compare May 27, 2026 07:34

lyfne123 approved these changes May 27, 2026

View reviewed changes

lyfne123 merged commit f79ba00 into hw-native-sys:main May 27, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distributed): lift pld.tensor.put staging tile into IR for level3#1548

feat(distributed): lift pld.tensor.put staging tile into IR for level3#1548
lyfne123 merged 3 commits into
hw-native-sys:mainfrom
YunjiQin:fix/tput

YunjiQin commented May 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YunjiQin commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Before / after

Test plan

Out of scope

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YunjiQin commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading