Skip to content

fix(codegen): lower scatter_update to tile.scatter, remove orch memcpy#1537

Merged
lyfne123 merged 10 commits into
hw-native-sys:mainfrom
Little-oil:fix/issue-1490-scatter-update-tscatter
May 27, 2026
Merged

fix(codegen): lower scatter_update to tile.scatter, remove orch memcpy#1537
lyfne123 merged 10 commits into
hw-native-sys:mainfrom
Little-oil:fix/issue-1490-scatter-update-tscatter

Conversation

@Little-oil
Copy link
Copy Markdown
Contributor

@Little-oil Little-oil commented May 26, 2026

Summary

  • tensor.scatter_update / tile.scatter_update now lower to a whole-row tile.scatter (pto.tscatter) + select preserve-blend in ConvertTensorToTileOps, so the op runs on a device kernel and respects TensorMap producer sync.
  • Removed the orchestration-level raw memcpy path (which dereferenced tensor data pointers, racing with producer tasks) and the now-dead GetTensorDataPtr helper.
  • Dropped the scratch arg from tile.scatter_update (4→3) and deleted the textract/tinsert PTO codegen.
  • FP16 is now correct and device-verified (see FP16 root cause below).

Approach

Whole-row scatter is expressed as a per-element flat scatter: flat_idx[k, c] = index.flat[k] * d + c. The conversion builds the flat index (column arange broadcast via col_expand + an index * d row offset), scatters src into a zeroed base, and reconstructs the DPS row-preserve via a mask scatter + cmps + sel blend — mirroring tensor.scatter.

The flat-index arithmetic is computed entirely in i32, and only the finished row-major [n, d] flat index is narrowed to the tscatter-required width (i16 for 2-byte data). This keeps every intermediate index tile in a canonical, 32-byte-aligned, row-major layout.

Lowering (generated PTO)

Hardware pto.tscatter writes per element (dst.flat[idx[k, c]] = src[k, c]) and treats dst as write-only, so the preserve semantics are rebuilt on the PyPTO side. Generated kernel for FP32 [32, 32] input / [2, 8] index / [16, 32] src:

# PTO op Produces
1–3 pto.tload ×3 input_tile, index_tile, src_tile
4 pto.tci column arange [1, d] = 0..d-1
5 pto.texpands zero template [n, d]
6 pto.tcolexpand col_nd[k, c] = c
7 pto.tmuls row_base[k] = index.flat[k] * d
8 pto.trowexpandadd flat_idx = col_nd + row_base[n, d]
8a pto.tcvt narrow flat_idx i32→i16 (2-byte dtypes only)
9 pto.texpands zeroed scatter base [m, d]
10 pto.tscatter scattered = src into zeroed base (written = src, unwritten = 0)
11–12 pto.texpands ×2 mask zero base, ones src
13 pto.tscatter mask = ones into zeroed base (written = 1, unwritten = 0)
14 pto.tcmps pred = (mask != 0)
15 pto.tsel out = sel(pred, scattered, input_tile)
16 pto.tstore write out to output

tile.sel (not input * mask) avoids emitting pto.tmul, which A2/A3 reject for bf16/i8. The index reshape [b, s] → [n, 1] is a buffer-view realias, not a separate PTO op. Full walkthrough: docs/en/dev/passes/12-convert_tensor_to_tile_ops.md (Scatter Update Lowering).

FP16 root cause

Two FP16-specific defects surfaced on device and are fixed here:

  1. Device hang. The system-test kernels returned the pl.store(...) op result directly (return pl.store(result, ...)). On the FP16 case this triggered an AICPU stream-sync timeout that cascaded into unrelated tests. Materializing the store as a statement before the return (dst_t = pl.store(...) / return dst_t) clears the hang.
  2. Wrong result. The lowering previously narrowed the index i32→i16 on a col_major [n, 1] view; tile.cast mis-orders elements on a col_major source, so whole src rows scattered in reverse (dst row 0 received src[15]). Computing the indices in i32 and narrowing only the final row-major [n, d] flat index fixes it. (Narrowing earlier on the [b, s] tile is also invalid — an i16 [b, s] row is cols * 2 bytes and breaks 32-byte alignment.) The companion tensor.scatter FP16/BF16 path is unaffected: it takes an i16 index from the caller and never casts.

Testing

  • 90+ scatter/orch unit tests pass (convert, tile_ops, codegen)
  • FP32 end-to-end through the Default pipeline
  • FP16 test_tile_scatter_update_fp16 passes on device (previously hung / produced incorrect output)
  • Docs updated (en/zh op reference + ConvertTensorToTileOps lowering walkthrough)

Related Issues

Fixes #1490

hw-native-sys#1490)

tensor.scatter_update now expands to per-element flat-index tile.scatter
(pto.tscatter) + select preserve-blend during ConvertTensorToTileOps, so it
runs on device and respects TensorMap producer sync. Removes the orch
raw-memcpy path and the dead GetTensorDataPtr helper, drops the scratch arg
from tile.scatter_update, and deletes the textract/tinsert PTO codegen.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Removes the scratch: Tile parameter from scatter_update across the API, IR, and backend layers. Rewrites tensor.scatter_update lowering to synthesize whole-row scatter via flat destination indices using tile.scatter operations instead of a dedicated tile.scatter_update IR op. Removes PTO backend codegen and orchestration op implementations, with comprehensive test updates.

Changes

Scatter Update Refactoring: API, Lowering, and Backend Cleanup

Layer / File(s) Summary
Public API signature simplification
python/pypto/ir/op/tile_ops.py, python/pypto/language/op/tile_ops.py, docs/en/user/02-operation_reference.md, docs/zh-cn/user/02-operation_reference.md
scatter_update function signature changes from (input, dim, index, src, scratch) to (input, dim, index, src). Python argument validation, docstrings, and user documentation (English and Chinese) updated to match.
IR operation registration and type inference
src/ir/op/tile_ops/transform.cpp, include/pypto/codegen/codegen_base.h
tile.scatter_update IR op now accepts exactly 3 operands (input, index, src). DeduceTileScatterUpdateType validates only 3 inputs, with scratch-tile validation removed. REGISTER_OP wires only 3 operands to MemorySpace::Vec. GetTensorDataPtr virtual method removed from CodegenBase.
Tensor.scatter_update lowering rewrite
src/ir/transforms/op_conversion_registry.cpp
New scatter_update_conv implements whole-row scatter via flat destination indices: flattens index, scatters source and write-mask into zero bases, derives predicate with tile.cmps, and reconstructs read-preserve semantics with tile.sel. Both tensor.scatter_update and DSL tile.scatter_update route to this conversion.
Backend implementation removal
src/backend/common/pto_ops_common.cpp, src/codegen/tensor_op_codegen.cpp, src/codegen/orchestration/orchestration_codegen.cpp, src/codegen/pto/pto_codegen.cpp
Deleted MakeScatterUpdateCodegenPTO and its PTO registration. Removed tensor.scatter_update orchestration op codegen. Removed GetTensorDataPtr override from OrchestrationStmtCodegen. Updated IsInPlaceScatterFamilyOp to exclude tile.scatter_update, now only recognizing tile.scatter and tile.scatter_mask.
Test suite updates
tests/ut/codegen/test_pto_codegen_ops.py, tests/ut/ir/operators/test_tile_ops.py, tests/st/runtime/ops/test_scatter_update.py, tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py
Removed TestTileScatterUpdateCodegen unit test. Updated IR operator tests to use 3-operand API. Removed explicit scratch_tile allocations and scratch= arguments from runtime kernels. Updated conversion test to expect tile.scatter only; replaced structural checks with IR printing and string assertions. Updated all test docstrings to describe new flat-index whole-row lowering.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

  • hw-native-sys/pypto#1004: Directly touches the tile.scatter_update PTO backend lowering by adding/removing the MakeScatterUpdateCodegenPTO implementation and registration in src/backend/common/pto_ops_common.cpp.
  • hw-native-sys/pypto#1106: Refactors tile.scatter_update IR-level tests in tests/ut/ir/operators/test_tile_ops.py, affecting the same test file and scatter_update test logic.
  • hw-native-sys/pypto#1426: Introduces new tile.scatter/tensor.scatter operator family that the main PR now uses for the rewritten tensor.scatter_update lowering path.

Suggested Reviewers

  • lyfne123
  • Hzfengsy

🐰 A scatter's tale, now cleaned and bright,
No scratch in sight, just flat indices right,
Whole rows now dance through tile.scatter's gate,
While select-blend ensures no stale state,
The rabbit cheers: complexity deprecated! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.03% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main change: lowering scatter_update to tile.scatter and removing orchestration memcpy, which is the primary objective of this changeset.
Linked Issues check ✅ Passed The PR addresses the core requirements of issue #1490: removes the unsafe orchestration-level raw memcpy path that bypassed TensorMap producer sync, deletes GetTensorDataPtr helper, and lowers tensor.scatter_update to device-scope tile.scatter kernel.
Out of Scope Changes check ✅ Passed All changes are in-scope: documentation updates, removal of scatter_update scratch parameter, conversion logic for whole-row scatter, codegen updates, test adjustments, and removal of unsafe memcpy paths directly address issue #1490.
Description check ✅ Passed The PR description clearly describes the changes: lowering scatter_update to tile.scatter with whole-row semantics, removing orchestration memcpy, dropping scratch argument, and fixing FP16 issues. It is directly related to the changeset shown.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the scatter_update operation (at both the tensor and tile levels) to lower directly to a whole-row tile.scatter using flat indices, eliminating the need for a temporary scratch tile. Consequently, the scratch parameter has been removed from the Python APIs, IR definitions, and C++ codegen. The review feedback suggests improving the C++ IR transformation in op_conversion_registry.cpp by using INTERNAL_CHECK_SPAN instead of CHECK to conform to project conventions, and adding an explicit safety check to verify that the number of source rows matches the index size.

Comment thread src/ir/transforms/op_conversion_registry.cpp Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/st/runtime/cross_core/test_chained_matmul_cast.py`:
- Around line 66-68: The expected-value path in compute_expected currently does
a pure FP32 matmul but the tested kernel performs FP32→BF16→FP32 (CubeVecCast),
so modify compute_expected to mimic that round-trip: cast tensors["a"] and
tensors["w"] to torch.bfloat16 then back to torch.float32 before calling
torch.matmul, and store the result into tensors["y"]; this will make the
reference follow the same FP32→BF16→FP32 behavior as the implementation under
test.

In `@tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py`:
- Around line 1989-1993: Replace the fragile substring assertions with exact-op
checks: after running passes.convert_tensor_to_tile_ops() and getting After,
either (A) inspect the IR Call nodes in After (e.g., traverse After to collect
call.op.name values) and assert "tile.scatter" appears the expected number of
times and "tile.scatter_mask" is not present, or (B) if staying with the printed
text from ir.python_print(After), use a word-boundary regex like
r"\btile\.scatter\b" to assert exact matches and separately assert
r"\btile\.scatter_mask\b" is absent; update the assertions accordingly (refer to
passes.PassContext, passes.convert_tensor_to_tile_ops, ir.python_print, and the
After variable).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d60d101c-7572-4fb4-9790-02f7045606e7

📥 Commits

Reviewing files that changed from the base of the PR and between e7e41f5 and 7ca8ba2.

📒 Files selected for processing (16)
  • docs/en/user/02-operation_reference.md
  • docs/zh-cn/user/02-operation_reference.md
  • include/pypto/codegen/codegen_base.h
  • python/pypto/ir/op/tile_ops.py
  • python/pypto/language/op/tile_ops.py
  • src/backend/common/pto_ops_common.cpp
  • src/codegen/orchestration/orchestration_codegen.cpp
  • src/codegen/pto/pto_codegen.cpp
  • src/codegen/tensor_op_codegen.cpp
  • src/ir/op/tile_ops/transform.cpp
  • src/ir/transforms/op_conversion_registry.cpp
  • tests/st/runtime/cross_core/test_chained_matmul_cast.py
  • tests/st/runtime/ops/test_scatter_update.py
  • tests/ut/codegen/test_pto_codegen_ops.py
  • tests/ut/ir/operators/test_tile_ops.py
  • tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py
💤 Files with no reviewable changes (6)
  • include/pypto/codegen/codegen_base.h
  • tests/ut/ir/operators/test_tile_ops.py
  • src/backend/common/pto_ops_common.cpp
  • src/codegen/orchestration/orchestration_codegen.cpp
  • src/codegen/tensor_op_codegen.cpp
  • tests/ut/codegen/test_pto_codegen_ops.py

Comment thread tests/st/runtime/cross_core/test_chained_matmul_cast.py Outdated
Comment thread tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py
@Little-oil Little-oil changed the title fix(scatter_update): lower to whole-row tile.scatter, drop orch memcpy (#1490) fix(codegen): lower scatter_update to tile.scatter, remove orch memcpy May 26, 2026
Youhezhen and others added 9 commits May 26, 2026 18:12
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- op_conversion_registry: use INTERNAL_CHECK_SPAN for scatter_update
  post-bridge invariants (Span in scope) and add explicit check that
  src rows == index size (b * s)
- test_convert_tensor_to_tile_ops: tighten scatter_update assertion to
  exact pl.tile.scatter( (index form), and assert pl.tile.scatter_mask(
  is absent so a regression to mask form is caught
Replace `return pl.store(result, ...)` with explicit
`dst_t = pl.store(result, ...)` / `return dst_t` across all kernel
programs in the scatter_update system tests.

Returning the store op result directly was triggering a device-side
AICPU stream-sync timeout (hang) on the FP16 case; materializing the
store as a statement before the return clears the hang.
…he end

The FP16 scatter_update lowering built the flat destination indices in
the tscatter-required i16 width, which forced an i32->i16 tile.cast on the
col_major [n,1] index view. That cast mis-ordered the indices, scattering
whole src rows in reversed order (dst row 0 received src[15]).

Keep the entire flat-index computation in i32 (identical to the FP32
path) and narrow only the finished row-major [n,d] flat_idx to the
tscatter index width. The final cast runs on a 32-byte-aligned,
row-major tile, which is both alignment-legal and correct.
Add a Scatter Update Lowering section to the ConvertTensorToTileOps pass
doc (en + zh-cn): the flat-index expansion (flat_idx = index*d + c), the
i32-compute / narrow-at-the-end rule, and the generated pto.tscatter +
sel preserve-blend op sequence.
- CHECK that m*d fits in i16 for 2-byte dst (the tscatter index width),
  so an oversized FP16/BF16/INT16 dst raises a clear error instead of
  silently scattering to wrong rows on flat-index overflow.
- Reject 4D input/src in the lowering with a user-facing CHECK: 4D
  type-checks via the op's deduction but is not yet lowered, so it would
  otherwise hit an internal error.
- Add convert-pass tests covering both rejections.
@lyfne123 lyfne123 merged commit 538f073 into hw-native-sys:main May 27, 2026
9 checks passed
lyfne123 pushed a commit that referenced this pull request May 29, 2026
## Summary

`pto.tcvt` (the lowering of `tile.cast`) silently **mis-orders elements
when its source tile is `col_major`** — e.g. a reshaped `[n, 1]` index
vector narrowed `i32 -> i16`. The same cast on a `row_major` source is
correct, so the failure is silent wrong output with no diagnostic. This
is what produced reversed scatter rows in the FP16
`tensor.scatter_update` lowering (issue #1549).

PyPTO already drives this exact class of ISA constraint through the
`ResolveBackendOpLayouts` pass, which reshapes a `[n, 1] col_major`
vector to `[1, n] row_major` around a constrained op and restores the
layout afterwards. `tile.cast` simply had **no layout spec**, so it was
never repaired.

### Changes

- **`src/backend/common/pto_ops_common.cpp`**: register `tile.cast` with
`set_input_layout(0, row_major)` + `set_output_layout(row_major)`,
mirroring `tile.rsqrt` / `tile.cmps` / `tile.sort32`.
`ResolveBackendOpLayouts` now repairs every `col_major` caller
generically. Row-major callers are unaffected (no repair, zero
overhead).
- **`tests/ut/ir/transforms/test_resolve_backend_op_layouts_pass.py`**:
pass-level regression for a `col_major [16, 1]` `i32 -> i16` cast being
repaired through a `[1, 16] row_major` reshape.
- **`tests/st/runtime/ops/test_cast.py`**: new end-to-end ST — a
`col_major [N, 1]` `i32 -> i16` narrow (the #1549 regression, must
preserve element order) plus a `row_major [1, N]` control case.
- **`docs/en` + `docs/zh-cn` `20-resolve_backend_op_layouts.md`**: list
`tile.cast` among the constrained ops.
- **`src/ir/transforms/op_conversion_registry.cpp`**: comment-only trim
of the `scatter_update` lowering — its i32-compute / narrow-at-the-end
design is kept for the alignment benefit. No behavior change: this path
narrows only the **row-major `[n, d]`** flat index, so it never feeds a
`col_major` source to `tile.cast`, and the new layout spec is a no-op
for it.

## Testing

- [x] New + existing `ResolveBackendOpLayouts` UTs pass (5/5)
- [x] `tests/ut/ir/transforms/` + `tests/ut/codegen/`: 1687 passed, 26
skipped
- [x] `tests/ut/ir/operators/test_tile_ops.py`: 237 passed
- [x] Pre-commit hooks (clang-format, cpplint, ruff, pyright,
markdownlint) pass
- [ ] On-device ST
`tests/st/runtime/ops/test_cast.py::TestCast::test_tile_cast_col_major_narrow`
(hardware, to be confirmed by reviewer) — the direct col_major-cast
regression for this fix.

> Note: `test_scatter_update.py::...::test_tile_scatter_update_fp16` is
**not** a target of this PR. That path was already worked around in
#1537 (narrow only the row-major `[n, d]` flat index), so this PR does
not change its codegen and it needs no re-confirmation here.

## Related Issues

Fixes #1549

---------

Co-authored-by: Youhezhen <youhezhen@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Bug] Orchestration codegen for tensor.scatter_update emits raw memcpy through tensor data ptr, bypassing get/set_tensor_data producer sync

2 participants