Skip to content

feat(distributed): lift pld.tensor.put staging tile into IR for level3#1548

Merged
lyfne123 merged 3 commits into
hw-native-sys:mainfrom
YunjiQin:fix/tput
May 27, 2026
Merged

feat(distributed): lift pld.tensor.put staging tile into IR for level3#1548
lyfne123 merged 3 commits into
hw-native-sys:mainfrom
YunjiQin:fix/tput

Conversation

@YunjiQin
Copy link
Copy Markdown
Contributor

@YunjiQin YunjiQin commented May 27, 2026

Summary

  • Lift the VEC staging tile that pld.tensor.put bounces through from inline codegen synthesis into PyPTO IR, so it flows through InitMemRef + AllocateMemoryAddr and carries a real UB addr = before backend codegen. This is the contract PTOAS requires at --pto-level=level3.
  • Mirrors the existing tensor.transpose -> tile.create + tile.transpose pattern: ConvertTensorToTileOps emits tile.create + new pld.tile.put (4 args, last is the staging tile); MakePutCodegenPTO now reads the stage SSA from args[3] instead of calling AllocNewTileBuf. The DSL surface (pld.tensor.put / pld.put) is unchanged.
  • Drive-by refactor: register @CommRemoteOffset_<dtype> helpers lazily from EmitCommRemoteView instead of via a separate pre-walk. Helpers are now emitted at module end, so any new op that routes peer addressing through EmitCommRemoteView gets the matching helper without extra wiring.
  • TestL3Put stays under @pytest.mark.skip: the reason is re-targeted from the now-resolved N6/N7 plumbing gap to an open PTOAS issue where the stage-in tile.store is not ordered before pld.tensor.put, so the put can issue before the local window slice has been written.

Before / after

Before — ring_step.pto from the user's local L3 run had no address on the staging tile:

```
%tput_stage = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=64, ...>
```

After:

```
%tput_stage = pto.alloc_tile addr = %c0_i64 valid_row = %c1_index valid_col = %c64_index
: !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=64, v_row=?, v_col=?, ...>
```

Test plan

  • tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py — new test_put_emits_tile_create_plus_tile_put verifies the conversion shape: pld.tensor.put gone, tile.create + pld.tile.put present, stage Var is named tput_stage, shape [16, 64], FP16, MemorySpace.Vec
  • tests/ut/codegen/distributed/test_distributed_pto_codegen.py::test_put_emits_comm_tput_with_attr_and_staging_tile — strengthened to assert addr = on the pto.alloc_tile … tput_stage line (the whole point of the lift)
  • Full distributed put / convert suites pass locally (156 passed, 1 pre-existing skip)
  • clang-tidy on diff clean for changed C++ files
  • cmake --build build --parallel green
  • TestL3Put end-to-end will exercise the PTOAS store -> put ordering once that lands; tracked separately

Out of scope

  • pld.tensor.get has the symmetric problem (tget_stage synthesised inline) and will get the same treatment in a follow-up PR.
  • tile.textract extract_buf is the third in-codegen AllocNewTileBuf call site; also follow-up.

The VEC staging tile that pld.tensor.put bounces through is now materialised
in IR by ConvertTensorToTileOps (tile.create + new pld.tile.put op) instead
of being synthesised inline at codegen. That lets InitMemRef and
AllocateMemoryAddr assign a real UB address before backend codegen runs --
required when PTOAS is invoked at --pto-level=level3, which expects PyPTO
to allocate every tile address itself.

Mirrors the tensor.transpose -> tile.create + tile.transpose pattern. The
DSL surface (pld.tensor.put / pld.put, 3 args) is unchanged; pld.tile.put
is created only by the conversion pass.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1464bd1c-937d-48a8-a345-05483f17a58b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors distributed put operation handling by introducing lazy remote offset helper registration, adding a tile-level pld.tile.put IR operation, and implementing an automatic conversion from tensor-level to tile-level put with synthesized staging tiles. Backend codegen and Python DSL wrappers are updated accordingly.

Changes

Distributed Put Refactoring: Lazy Helper Registration and Tile-Level Operation

Layer / File(s) Summary
Lazy remote offset helper registration API
include/pypto/codegen/pto/pto_codegen.h, src/codegen/pto/pto_codegen.cpp
PTOCodegen replaces eager IR-walk collection with lazy dtype registration during op lowering via RegisterCommRemoteOffsetHelper, moving helper emission to module end after all user functions to enable forward references.
Tile-level put operation definition and validation
src/ir/op/distributed/put.cpp, include/pypto/ir/transforms/op_conversion_registry.h
New IR op pld.tile.put with validation for matching dtypes, element counts, and staging tile type; pld.tensor.put description updated to document conversion through ConvertTensorToTileOps.
Tensor-to-tile conversion with staging tile synthesis
src/ir/transforms/op_conversion_registry.cpp
RegisterDistributedOps converts pld.tensor.put by flattening N-D shape to 2D [rows, cols], allocating a Vec staging tile via tile.create, and emitting pld.tile.put with the synthesized stage operand.
Backend codegen for tile.put and helper registration
src/backend/common/pto_ops_common.cpp
Backend lowers pld.tile.put to pto.comm.tput using explicit staging tile operand; EmitCommRemoteView registers helpers lazily; operation registration targets tile.put instead of tensor.put.
Python IR layer builder for tile.put
python/pypto/ir/op/distributed/tile_ops.py
IR builder normalizes peer to INT32, captures optional span, sets atomic attribute, and constructs pld.tile.put IR call; module exports include put.
Python language layer wrappers and documentation
python/pypto/language/distributed/op/tile_ops.py, python/pypto/language/distributed/op/tensor_ops.py
Language-level put wrapper validates window-bound distributed tensors and delegates to IR builder; pld.tensor.put docstring describes full conversion pipeline including staging tile allocation and lowering.
Unit and system test updates
tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py, tests/ut/codegen/distributed/test_distributed_pto_codegen.py, tests/st/distributed/test_l3_put.py
Conversion test verifies tensor.put lowers to tile.create plus tile.put; codegen test asserts explicit address assignment on staging tile; system test skip reason updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hw-native-sys/pypto#1442: Introduces the original pld.tensor.put path and pto.comm.tput codegen behavior that this PR refactors to use a tile-level op with explicit staging tile.
  • hw-native-sys/pypto#1453: Modifies EmitCommRemoteView and remote CommRemoteOffset_* helper emission, which is directly affected by the lazy registration refactoring in this PR.

Suggested labels

enhancement

Suggested reviewers

  • lyfne123

Poem

🐰 From eager walks to lazy calls,
A staging tile now answers all,
Register helpers as we go—
Let tensors split, and tile-ops flow! 🎭✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(distributed): lift pld.tensor.put staging tile into IR for level3' directly and clearly summarizes the primary change—moving VEC staging tile synthesis from codegen into PyPTO IR to support PTOAS level3 contracts.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly relates to the changeset, detailing the lift of VEC staging tile, the new pld.tile.put IR op, lazy registration of helpers, and corresponding test updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the distributed put operation by splitting pld.tensor.put into a pre-allocated staging tile (tile.create) and a post-conversion pld.tile.put call. This allows the staging tile to flow through PyPTO's memory allocator, which is required at --pto-level=level3. Additionally, it refactors the codegen to lazily register and emit @CommRemoteOffset_<dtype> helpers at the end of the module. The reviewer suggested using the INTERNAL_CHECK_SPAN macro instead of INTERNAL_CHECK in src/ir/op/distributed/put.cpp to preserve source location information in error messages, which is a valid improvement that aligns with project conventions.

Comment thread src/ir/op/distributed/put.cpp Outdated
@YunjiQin YunjiQin changed the title lift pld.tensor.put staging tile into IR for PTOAS level3 feat(distributed): lift pld.tensor.put staging tile into IR for level3 May 27, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/pypto/language/distributed/op/tensor_ops.py`:
- Around line 131-139: Update the module-level header describing "put" to match
the new lowering flow: change the text that currently claims the staging tile is
synthesized in codegen and that put is tensor-level so it instead states that
ConvertTensorToTileOps allocates the VEC staging tile (tile.create) and that the
lowering emits a pld.tile.put (tile-level) paired with the GM-to-GM TGET;
mention the staging tile flows through the allocator and backend emits
CommRemoteOffset+addptr+make_tensor_view+partition_view+TPUT as described in
ConvertTensorToTileOps and the function docstring.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a68f1b8a-5586-48c2-968b-e7e719988b1a

📥 Commits

Reviewing files that changed from the base of the PR and between f8dbde9 and 6eefe0e.

📒 Files selected for processing (12)
  • include/pypto/codegen/pto/pto_codegen.h
  • include/pypto/ir/transforms/op_conversion_registry.h
  • python/pypto/ir/op/distributed/tile_ops.py
  • python/pypto/language/distributed/op/tensor_ops.py
  • python/pypto/language/distributed/op/tile_ops.py
  • src/backend/common/pto_ops_common.cpp
  • src/codegen/pto/pto_codegen.cpp
  • src/ir/op/distributed/put.cpp
  • src/ir/transforms/op_conversion_registry.cpp
  • tests/st/distributed/test_l3_put.py
  • tests/ut/codegen/distributed/test_distributed_pto_codegen.py
  • tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py

Comment thread python/pypto/language/distributed/op/tensor_ops.py
YunjiQin added 2 commits May 27, 2026 15:34
…sites

Replace the pre-walk CollectRemoteOffsetDtypes with
RegisterCommRemoteOffsetHelper called from EmitCommRemoteView -- any op
that routes peer addressing through that helper gets its
@CommRemoteOffset_<dtype> definition emitted automatically, no codegen-
side opt-in. Helpers move to module end so future call sites can
forward-reference them (MLIR resolves symbols whole-module).

Also re-targets the TestL3Put @pytest.mark.skip reason from the now-
resolved N6/N7 plumbing gap to the open PTOAS issue where the store ->
put pair loses its synchronisation (the put can issue before the local
window slice has been written).
- use INTERNAL_CHECK_SPAN in DeducePutTileType so dst/stage-shape
  invariant failures emit IR source location (args[0]/args[3] are
  already non-null at the failure point)
- update tensor_ops module header for put: reflect ConvertTensorToTileOps
  lowering to tile.create + pld.tile.put (was still describing the
  pre-RFC codegen-time staging tile)
@lyfne123 lyfne123 merged commit f79ba00 into hw-native-sys:main May 27, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants