Skip to content

feat(ir): Add tile.comm_notify, tile.comm_wait, tile.comm_test cross-rank signal ops#1301

Open
Little-oil wants to merge 6 commits into
hw-native-sys:mainfrom
Little-oil:add_notify
Open

feat(ir): Add tile.comm_notify, tile.comm_wait, tile.comm_test cross-rank signal ops#1301
Little-oil wants to merge 6 commits into
hw-native-sys:mainfrom
Little-oil:add_notify

Conversation

@Little-oil
Copy link
Copy Markdown
Contributor

@Little-oil Little-oil commented May 7, 2026

Summary

Add three cross-rank signaling ops on AIV that wrap PTOAS's pto.comm.* custom dialect:

  • pl.tile.comm_notify(signal, value, *, op) — write or atomic-add an INT32 value into a remote rank's signal slot. op ∈ {"set", "atomic_add"}. Lowers to pto.comm.tnotify.
  • pl.tile.comm_wait(signal, cmp_value, *, cmp) — block until a local INT32 signal slot satisfies a comparison. cmp ∈ {"eq","ne","gt","ge","lt","le"}. Lowers to pto.comm.twait.
  • pl.tile.comm_test(signal, cmp_value, *, cmp) — non-blocking poll of the same comparison; returns a pl.Scalar(BOOL). Lowers to pto.comm.ttest (returns i1).

signal is a 1-element INT32 pl.Tensor viewing a slot in the rank's HCCL window (typically obtained via import_peer_buffer). value / cmp_value accepts Python int, pl.Scalar, or pl.Expr and is normalised to INT32.

Mirrors the two real usage patterns from simpler's ep_dispatch_combine kernels (count exchange via atomic_add, done barrier via atomic_add + wait ge).

Layers updated

  • C++ op registrations + IR-level validators — src/ir/op/tile_ops/cross_core.cpp
  • PTO codegen (signal lowered to !pto.partition_tensor_view<Nxi32>, generic MLIR op syntax) — src/backend/common/pto_ops_common.cpp
  • Python IR wrappers — python/pypto/ir/op/tile_ops.py
  • DSL wrappers (re-exported via pl.tile.*) — python/pypto/language/op/system_ops.py, python/pypto/language/op/tile_ops.py
  • UT for IR + PTO codegen — tests/ut/ir/operators/test_tile_ops.py, tests/ut/codegen/test_pto_codegen_ops.py
  • ST loopback (count exchange + done barrier + wait-only) — tests/st/runtime/test_notify_wait.py
  • Bilingual docs — docs/en/dev/ir/05-operators.md, docs/zh-cn/dev/ir/05-operators.md
  • PTOAS bumped to 0.37 (exposes the pto.comm.tnotify/twait/ttest custom ops) — .github/workflows/ci.yml

Key design / fix notes

  • Naming: comm_* prefix groups these under cross-rank signaling, parallel to the pto.comm.* MLIR namespace.
  • Codegen: signal Var is lowered through make_tensor_view → partition_view to produce a !pto.partition_tensor_view<Nxi32> covering the full signal shape. Assembly uses generic MLIR op syntax ("pto.comm.tnotify"(%sig, %v) {...} : (...) -> ()) because PTOAS defines no custom assemblyFormat for these ops.
  • Validation: f_deduce_type enforces 1-element INT32 signal + INT32 scalar at IR construction, so misuse fails at @pl.program decoration time.
  • Orchestrator pattern: side-effect kernels must bind the call (signal = self.kernel(signal); return signal); a bare return self.kernel(...) is silently dropped (KERNELS=[]).

Testing

  • UT (22/22): TestTileCommNotifyOp, TestTileCommWaitOp, TestTileCommTestOp, TestTileNotifyPtoCodegen, TestTileWaitPtoCodegen, TestTileTestPtoCodegen.
  • ST tests/st/runtime/test_notify_wait.py — gated on PTOAS_HAS_COMM_NOTIFY_WAIT=1 so infra without the upgraded PTOAS skips cleanly. Three programs: CountExchangeProgram (atomic_add), WaitOnlyProgram (wait ge), DoneBarrierProgram (notify + wait combined).

Notes

  • Lint/format/pyright pre-commit hooks pass.
  • No new global state, no IR design changes — additive op registrations following the existing CrossCoreOp pattern (mirror of tpush_* / tpop_* / tfree_*).

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds tile.comm_notify and tile.comm_wait: IR ops, IR-level Python bindings, language DSL wrappers, PTO backend lowering to pto.comm.tnotify/twait, English/Chinese docs, and unit + codegen + runtime tests validating INT32 signal semantics and attributes.

Changes

Tile Signal Operations (tile.comm_notify / tile.comm_wait)

Layer / File(s) Summary
IR Operation Registration
src/ir/op/tile_ops/cross_core.cpp
Registers tile.comm_notify and tile.comm_wait as CrossCoreOp with INT32 signal operands and string attributes (op for notify, cmp for wait).
Python IR Bindings
python/pypto/ir/op/tile_ops.py
Adds _NOTIFY_OPS/_WAIT_CMPS and implements comm_notify(signal, value, *, op, span) and comm_wait(signal, cmp_value, *, cmp, span) with validation, span capture, and IR Call emission.
Language DSL (system_ops)
python/pypto/language/op/system_ops.py
Adds public comm_notify / comm_wait that normalize Python int/Scalar/Expr to INT32 Expr (using ConstInt when needed) and forward to IR-layer functions.
Module Exports (tile_ops)
python/pypto/language/op/tile_ops.py
Imports/forwards new IR-level symbols and adds comm_notify and comm_wait to __all__ to expose pl.tile.comm_notify/pl.tile.comm_wait.
PTO Backend Codegen
src/backend/common/pto_ops_common.cpp
Adds partition-view helper, SSA/type parsing helper, INT32 signal validation, MakeTileNotifyCodegenPTO and MakeTileWaitCodegenPTO to emit pto.comm.tnotify/pto.comm.twait with enum attributes; registers handlers.
Documentation
docs/en/dev/ir/05-operators.md, docs/zh-cn/dev/ir/05-operators.md
Adds "Cross-Rank Signal Operations" sections documenting signatures, INT32 signal-slot semantics, lowering targets (TNOTIFY/TWAIT), pipeline ordering note, and examples.
Unit & Codegen Tests
tests/ut/ir/operators/test_tile_ops.py, tests/ut/codegen/test_pto_codegen_ops.py
IR unit tests for call construction and kwarg printing; PTO codegen tests asserting pto.comm.tnotify/pto.comm.twait emission, enum attributes, and INT32-signal rejection tests.
Runtime Tests
tests/st/runtime/test_notify_wait.py
On-device loopback tests (single-rank) exercising notify (atomic_add/set) and wait (eq/ne/gt/ge/lt/le) patterns and asserting final INT32 signal slot contents.

Sequence Diagram(s)

sequenceDiagram
  participant Program as User Program
  participant Lang as language.system_ops
  participant IR as ir.tile_ops
  participant PTO as PTO Codegen
  participant GM as Device GM
  Program->>Lang: comm_notify(signal, value, op)
  Lang->>IR: normalized signal/value -> tile.comm_notify Call
  IR->>PTO: tile.comm_notify
  PTO->>GM: emit pto.comm.tnotify (partition_view, i32 value, notifyOp)
  Program->>Lang: comm_wait(signal, cmp_value, cmp)
  Lang->>IR: normalized cmp_value -> tile.comm_wait Call
  IR->>PTO: tile.comm_wait
  PTO->>GM: emit pto.comm.twait (partition_view, i32 cmp, cmp attr)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hw-native-sys/pypto#267: Related PTO codegen/partition_view changes and partition_tensor_view naming; likely intersects backend lowering infrastructure.
  • hw-native-sys/pypto#1312: Also modifies PTO ops registration and codegen handlers; may overlap in RegisterPTOOps edits.

Suggested reviewers

  • Hzfengsy
  • lyfne123

Poem

🐰
I nudge the slot, I add, I set,
Across the ranks my whispers get,
A wait that watches, eyes so bright,
Until the signal says "alright".
— hops, notifies, sleeps with delight

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The PR title accurately describes the main changes: two new cross-rank signal operations (tile.comm_notify and tile.comm_wait) are added to the IR layer.
Description check ✅ Passed The pull request description clearly outlines the three new cross-rank signaling operations (tile.comm_notify and tile.comm_wait being primary; tile.comm_test mentioned), their signatures, behavior, and implementation across multiple layers (C++, Python, tests, docs).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Little-oil
Copy link
Copy Markdown
Contributor Author

wait for PTOAS'new version

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cross-rank signal operations, specifically tile.notify and tile.wait, to support synchronization between different ranks. The changes include documentation in both English and Chinese, Python IR and language-level API definitions, C++ backend codegen for PTO operations, and comprehensive unit and system tests. The feedback suggests refactoring the argument conversion logic in python/pypto/language/op/system_ops.py into a shared helper function to improve maintainability and ensure consistent validation of IntLike arguments across both operations.

Comment thread python/pypto/language/op/system_ops.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/ut/codegen/test_pto_codegen_ops.py (1)

1608-1673: 💤 Low value

TestTileNotifyPtoCodegen — LGTM with one optional improvement

The three tests cover the key paths (set, atomic_add, bad dtype). One gap worth noting: there is no rejection test for an unsupported op string (e.g. op="invalid"). If input validation is enforced at the IR construction layer rather than codegen, add the equivalent test to tests/ut/ir/operators/test_tile_ops.py; if it's enforced in codegen, a small pytest.raises case here would complete the contract coverage.

Optionally, test_tile_notify_set_codegen and test_tile_notify_atomic_add_codegen can be collapsed into a single @pytest.mark.parametrize("op,attr", [("set", "set"), ("atomic_add", "atomic_add")]) test to reduce duplication.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/ut/codegen/test_pto_codegen_ops.py` around lines 1608 - 1673, Add a
rejection test for unsupported op strings so tile.notify validates op values:
add a new test (e.g. in TestTileNotifyPtoCodegen or in
tests/ut/ir/operators/test_tile_ops.py depending on where validation lives) that
constructs a program using pl.tile.notify(signal, 1, op="invalid") and asserts
it raises (pytest.raises) with an appropriate message; reference the existing
helper _generate_mlir and the test names
test_tile_notify_set_codegen/test_tile_notify_atomic_add_codegen to locate
similar test patterns and mirror their structure (or convert the two positive
tests into a single parametric `@pytest.mark.parametrize` if you prefer to reduce
duplication).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/ir/op/tile_ops/cross_core.cpp`:
- Around line 90-114: Add IR-level operand validation to the REGISTER_OP
declarations for "tile.notify" and "tile.wait": implement .f_validate handlers
that check the "signal" operand is an INT32 tensor with exactly one element
(shape == 1) and that the secondary operand ("value" for tile.notify,
"cmp_value" for tile.wait) is an INT32 scalar; emit a clear validation error
when these conditions fail so invalid uses fail during IR construction rather
than backend lowering. Ensure the validators reference the op names
("tile.notify", "tile.wait") and the operand names ("signal", "value",
"cmp_value") so reviewers can locate the checks.

In `@tests/st/runtime/test_notify_wait.py`:
- Around line 268-297: The test suite unconditionally exercises PTOAS-only APIs
pto.comm.tnotify / pto.comm.twait causing infra-driven failures when PTOAS is
not present; modify the TestNotifyWait tests to be skipped when the capability
is absent by checking the PTOAS capability at import/runtime (e.g., a helper
like has_ptoas_capability() or checking pto.comm for tnotify/twait) and applying
pytest.skip or pytest.mark.skipif to the whole TestNotifyWait class or
individual test methods (referencing TestNotifyWait, test_notify_* methods, and
pto.comm.tnotify/twait) so the suite only runs when those APIs are available.

---

Nitpick comments:
In `@tests/ut/codegen/test_pto_codegen_ops.py`:
- Around line 1608-1673: Add a rejection test for unsupported op strings so
tile.notify validates op values: add a new test (e.g. in
TestTileNotifyPtoCodegen or in tests/ut/ir/operators/test_tile_ops.py depending
on where validation lives) that constructs a program using
pl.tile.notify(signal, 1, op="invalid") and asserts it raises (pytest.raises)
with an appropriate message; reference the existing helper _generate_mlir and
the test names test_tile_notify_set_codegen/test_tile_notify_atomic_add_codegen
to locate similar test patterns and mirror their structure (or convert the two
positive tests into a single parametric `@pytest.mark.parametrize` if you prefer
to reduce duplication).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2a67f8d0-4681-494b-8c02-d82da2702789

📥 Commits

Reviewing files that changed from the base of the PR and between 774775d and de39e6c.

📒 Files selected for processing (10)
  • docs/en/dev/ir/05-operators.md
  • docs/zh-cn/dev/ir/05-operators.md
  • python/pypto/ir/op/tile_ops.py
  • python/pypto/language/op/system_ops.py
  • python/pypto/language/op/tile_ops.py
  • src/backend/common/pto_ops_common.cpp
  • src/ir/op/tile_ops/cross_core.cpp
  • tests/st/runtime/test_notify_wait.py
  • tests/ut/codegen/test_pto_codegen_ops.py
  • tests/ut/ir/operators/test_tile_ops.py

Comment thread src/ir/op/tile_ops/cross_core.cpp Outdated
Comment thread tests/st/runtime/test_notify_wait.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/common/pto_ops_common.cpp`:
- Around line 1924-1937: CheckCommSignalType currently only rejects rank-0
tensors but must enforce the single-slot contract: verify the tensor contains
exactly one element or reject statically-known non-singleton shapes before
lowering. In CheckCommSignalType (and using span/op_name for diagnostics) keep
the rank>=1 check, then inspect signal_tensor_type->shape_: if all extents are
statically-known, compute the product and REQUIRE it equals 1 (emit a clear
CHECK/INTERNAL_CHECK_SPAN failure referencing op_name and the shape); if any
extent is dynamic/unknown, allow it (since it could be singleton at runtime) but
still reject any statically-known extent >1 early. Return the same
signal_tensor_type on success.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9b35e237-2d6f-4661-af36-5784836ec7cb

📥 Commits

Reviewing files that changed from the base of the PR and between de39e6c and 96f2d6d.

📒 Files selected for processing (10)
  • docs/en/dev/ir/05-operators.md
  • docs/zh-cn/dev/ir/05-operators.md
  • python/pypto/ir/op/tile_ops.py
  • python/pypto/language/op/system_ops.py
  • python/pypto/language/op/tile_ops.py
  • src/backend/common/pto_ops_common.cpp
  • src/ir/op/tile_ops/cross_core.cpp
  • tests/st/runtime/test_notify_wait.py
  • tests/ut/codegen/test_pto_codegen_ops.py
  • tests/ut/ir/operators/test_tile_ops.py
✅ Files skipped from review due to trivial changes (2)
  • docs/en/dev/ir/05-operators.md
  • docs/zh-cn/dev/ir/05-operators.md

Comment thread src/backend/common/pto_ops_common.cpp Outdated
@Little-oil Little-oil changed the title feat(ir): Add tile.notify and tile.wait cross-rank signal ops feat(ir): Add tile.notify, tile.wait, tile.test cross-rank signal ops May 12, 2026
Little-oil pushed a commit to Little-oil/pypto that referenced this pull request May 12, 2026
- system_ops: extract _value_to_int32_expr helper shared by comm_notify/
  comm_wait/comm_test; rewrap INDEX ConstInt as INT32 so the literal int
  case satisfies the IR contract (gemini-code-assist)
- ir/cross_core: add f_deduce_type validators for tile.comm_notify/
  comm_wait/comm_test enforcing 1-element INT32 signal + INT32 scalar
  value at IR construction (coderabbitai)
- backend/pto_ops_common: extend CheckCommSignalType to reject
  statically-known non-singleton signal shapes before PTO lowering
  (coderabbitai)
- tests/st: gate test_notify_wait suite on PTOAS_HAS_COMM_NOTIFY_WAIT=1
  env var so infra without the staged PTOAS build skips cleanly
  (coderabbitai)
- tests/ut: relocate pytest.raises around the @pl.program class body in
  the reject_non_int32_signal cases, since the new IR-level validators
  now fire during decoration instead of during codegen
@Little-oil Little-oil changed the title feat(ir): Add tile.notify, tile.wait, tile.test cross-rank signal ops feat(ir): Add tile.comm_notify, tile.comm_wait, tile.comm_test cross-rank signal ops May 12, 2026
Youhezhen added 5 commits May 12, 2026 18:47
Introduces a pair of cross-rank signaling operations on AIV:

- tile.notify(signal, value, op): write or atomic-add an INT32 value to
  a remote rank's signal slot (1-element INT32 GM tensor). Lowers to
  pto.comm.tnotify with notifyOp = #pto.notify_op<set|atomic_add>.
- tile.wait(signal, cmp_value, cmp): block until a local INT32 signal
  slot satisfies a comparison. Lowers to pto.comm.twait with
  cmp = #pto.wait_cmp<eq|ne|gt|ge|lt|le>.

All five layers updated:
- C++ op registrations in src/ir/op/tile_ops/cross_core.cpp
- PTO codegen in src/backend/common/pto_ops_common.cpp
- Python IR wrappers in python/pypto/ir/op/tile_ops.py
- DSL wrappers in python/pypto/language/op/system_ops.py with
  re-export through python/pypto/language/op/tile_ops.py
- Tests: UT for IR + PTO codegen, ST loopback covering all six cmp
  variants and both notify ops
- Docs: Cross-Rank Signal Operations sections in
  docs/en/dev/ir/05-operators.md and docs/zh-cn/dev/ir/05-operators.md

Note: pto.comm.tnotify / pto.comm.twait require a PTOAS build that
exposes those custom ops; the on-board ST will only run on a PTOAS
that has the comm dialect enabled.
…ps to tile.comm_{notify,wait}

The previous codegen for tile.notify/tile.wait was broken — PTOAS rejected
the emitted MLIR. Two bugs:

1. Wrong operand type. Codegen emitted the signal as !pto.ptr<i32> (or a
   raw tensor_view), but pto.comm.tnotify / pto.comm.twait require
   !pto.partition_tensor_view<Nxi32>. Fix: lower the signal Var through
   make_tensor_view → partition_view to build a partition view covering
   the full signal shape.

2. Wrong assembly syntax. Codegen used the custom format
   "pto.comm.tnotify %sig, %v {...} : <type>, i32", but PTOAS's TNotifyOp /
   TWaitOp have no custom assemblyFormat — only generic MLIR op syntax is
   accepted. Fix: emit "pto.comm.tnotify"(%sig, %v) {...} : (<type>, i32) -> ().

Also rename the ops from tile.notify/tile.wait to tile.comm_notify/tile.comm_wait
for namespace consistency with the pto.comm.* MLIR ops and to keep cross-rank
signaling ops grouped under a comm_* prefix.

ST tests reshaped to mirror the two real usage patterns from simpler's
ep_dispatch_combine kernels (count exchange via atomic_add, done barrier
via atomic_add + wait ge), instead of exhaustively covering every cmp op.
…y/wait ST

Add `tile.comm_test` (non-blocking signal check, returns i1) alongside the
existing `tile.comm_notify`/`tile.comm_wait` cross-rank signal ops. Emits
`pto.comm.ttest(... : !pto.partition_tensor_view<Nxi32>, i32) {cmp = ...} -> i1`
using PTOAS custom assembly syntax.

Also fix `tests/st/runtime/test_notify_wait.py` orchestrators: replaced
`return self.kernel(signal)` with `signal = self.kernel(signal); return signal`
so the kernel call lands in an AssignStmt and gets emitted (previously
KERNELS=[] and the signal stayed at its init value). Add a dedicated
wait-only ST case to isolate the twait codegen path.
- system_ops: extract _value_to_int32_expr helper shared by comm_notify/
  comm_wait/comm_test; rewrap INDEX ConstInt as INT32 so the literal int
  case satisfies the IR contract (gemini-code-assist)
- ir/cross_core: add f_deduce_type validators for tile.comm_notify/
  comm_wait/comm_test enforcing 1-element INT32 signal + INT32 scalar
  value at IR construction (coderabbitai)
- backend/pto_ops_common: extend CheckCommSignalType to reject
  statically-known non-singleton signal shapes before PTO lowering
  (coderabbitai)
- tests/st: gate test_notify_wait suite on PTOAS_HAS_COMM_NOTIFY_WAIT=1
  env var so infra without the staged PTOAS build skips cleanly
  (coderabbitai)
- tests/ut: relocate pytest.raises around the @pl.program class body in
  the reject_non_int32_signal cases, since the new IR-level validators
  now fire during decoration instead of during codegen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant