feat(distributed): lift pld.tensor.put staging tile into IR for level3#1548
Conversation
The VEC staging tile that pld.tensor.put bounces through is now materialised in IR by ConvertTensorToTileOps (tile.create + new pld.tile.put op) instead of being synthesised inline at codegen. That lets InitMemRef and AllocateMemoryAddr assign a real UB address before backend codegen runs -- required when PTOAS is invoked at --pto-level=level3, which expects PyPTO to allocate every tile address itself. Mirrors the tensor.transpose -> tile.create + tile.transpose pattern. The DSL surface (pld.tensor.put / pld.put, 3 args) is unchanged; pld.tile.put is created only by the conversion pass.
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR refactors distributed put operation handling by introducing lazy remote offset helper registration, adding a tile-level ChangesDistributed Put Refactoring: Lazy Helper Registration and Tile-Level Operation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the distributed put operation by splitting pld.tensor.put into a pre-allocated staging tile (tile.create) and a post-conversion pld.tile.put call. This allows the staging tile to flow through PyPTO's memory allocator, which is required at --pto-level=level3. Additionally, it refactors the codegen to lazily register and emit @CommRemoteOffset_<dtype> helpers at the end of the module. The reviewer suggested using the INTERNAL_CHECK_SPAN macro instead of INTERNAL_CHECK in src/ir/op/distributed/put.cpp to preserve source location information in error messages, which is a valid improvement that aligns with project conventions.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/pypto/language/distributed/op/tensor_ops.py`:
- Around line 131-139: Update the module-level header describing "put" to match
the new lowering flow: change the text that currently claims the staging tile is
synthesized in codegen and that put is tensor-level so it instead states that
ConvertTensorToTileOps allocates the VEC staging tile (tile.create) and that the
lowering emits a pld.tile.put (tile-level) paired with the GM-to-GM TGET;
mention the staging tile flows through the allocator and backend emits
CommRemoteOffset+addptr+make_tensor_view+partition_view+TPUT as described in
ConvertTensorToTileOps and the function docstring.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: a68f1b8a-5586-48c2-968b-e7e719988b1a
📒 Files selected for processing (12)
include/pypto/codegen/pto/pto_codegen.hinclude/pypto/ir/transforms/op_conversion_registry.hpython/pypto/ir/op/distributed/tile_ops.pypython/pypto/language/distributed/op/tensor_ops.pypython/pypto/language/distributed/op/tile_ops.pysrc/backend/common/pto_ops_common.cppsrc/codegen/pto/pto_codegen.cppsrc/ir/op/distributed/put.cppsrc/ir/transforms/op_conversion_registry.cpptests/st/distributed/test_l3_put.pytests/ut/codegen/distributed/test_distributed_pto_codegen.pytests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py
…sites Replace the pre-walk CollectRemoteOffsetDtypes with RegisterCommRemoteOffsetHelper called from EmitCommRemoteView -- any op that routes peer addressing through that helper gets its @CommRemoteOffset_<dtype> definition emitted automatically, no codegen- side opt-in. Helpers move to module end so future call sites can forward-reference them (MLIR resolves symbols whole-module). Also re-targets the TestL3Put @pytest.mark.skip reason from the now- resolved N6/N7 plumbing gap to the open PTOAS issue where the store -> put pair loses its synchronisation (the put can issue before the local window slice has been written).
- use INTERNAL_CHECK_SPAN in DeducePutTileType so dst/stage-shape invariant failures emit IR source location (args[0]/args[3] are already non-null at the failure point) - update tensor_ops module header for put: reflect ConvertTensorToTileOps lowering to tile.create + pld.tile.put (was still describing the pre-RFC codegen-time staging tile)
Summary
pld.tensor.putbounces through from inline codegen synthesis into PyPTO IR, so it flows throughInitMemRef+AllocateMemoryAddrand carries a real UBaddr =before backend codegen. This is the contract PTOAS requires at--pto-level=level3.tensor.transpose -> tile.create + tile.transposepattern:ConvertTensorToTileOpsemitstile.create+ newpld.tile.put(4 args, last is the staging tile);MakePutCodegenPTOnow reads the stage SSA fromargs[3]instead of callingAllocNewTileBuf. The DSL surface (pld.tensor.put/pld.put) is unchanged.@CommRemoteOffset_<dtype>helpers lazily fromEmitCommRemoteViewinstead of via a separate pre-walk. Helpers are now emitted at module end, so any new op that routes peer addressing throughEmitCommRemoteViewgets the matching helper without extra wiring.TestL3Putstays under@pytest.mark.skip: the reason is re-targeted from the now-resolved N6/N7 plumbing gap to an open PTOAS issue where the stage-intile.storeis not ordered beforepld.tensor.put, so the put can issue before the local window slice has been written.Before / after
Before —
ring_step.ptofrom the user's local L3 run had no address on the staging tile:```
%tput_stage = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=64, ...>
```
After:
```
%tput_stage = pto.alloc_tile addr = %c0_i64 valid_row = %c1_index valid_col = %c64_index
: !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=64, v_row=?, v_col=?, ...>
```
Test plan
tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py— newtest_put_emits_tile_create_plus_tile_putverifies the conversion shape:pld.tensor.putgone,tile.create+pld.tile.putpresent, stage Var is namedtput_stage, shape[16, 64], FP16,MemorySpace.Vectests/ut/codegen/distributed/test_distributed_pto_codegen.py::test_put_emits_comm_tput_with_attr_and_staging_tile— strengthened to assertaddr =on thepto.alloc_tile … tput_stageline (the whole point of the lift)cmake --build build --parallelgreenTestL3Putend-to-end will exercise the PTOASstore -> putordering once that lands; tracked separatelyOut of scope
pld.tensor.gethas the symmetric problem (tget_stagesynthesised inline) and will get the same treatment in a follow-up PR.tile.textractextract_bufis the third in-codegenAllocNewTileBufcall site; also follow-up.