fix(codegen): lower scatter_update to tile.scatter, remove orch memcpy by Little-oil · Pull Request #1537 · hw-native-sys/pypto

Little-oil · 2026-05-26T09:55:51Z

Summary

tensor.scatter_update / tile.scatter_update now lower to a whole-row tile.scatter (pto.tscatter) + select preserve-blend in ConvertTensorToTileOps, so the op runs on a device kernel and respects TensorMap producer sync.
Removed the orchestration-level raw memcpy path (which dereferenced tensor data pointers, racing with producer tasks) and the now-dead GetTensorDataPtr helper.
Dropped the scratch arg from tile.scatter_update (4→3) and deleted the textract/tinsert PTO codegen.
FP16 is now correct and device-verified (see FP16 root cause below).

Approach

Whole-row scatter is expressed as a per-element flat scatter: flat_idx[k, c] = index.flat[k] * d + c. The conversion builds the flat index (column arange broadcast via col_expand + an index * d row offset), scatters src into a zeroed base, and reconstructs the DPS row-preserve via a mask scatter + cmps + sel blend — mirroring tensor.scatter.

The flat-index arithmetic is computed entirely in i32, and only the finished row-major [n, d] flat index is narrowed to the tscatter-required width (i16 for 2-byte data). This keeps every intermediate index tile in a canonical, 32-byte-aligned, row-major layout.

Lowering (generated PTO)

Hardware pto.tscatter writes per element (dst.flat[idx[k, c]] = src[k, c]) and treats dst as write-only, so the preserve semantics are rebuilt on the PyPTO side. Generated kernel for FP32 [32, 32] input / [2, 8] index / [16, 32] src:

#	PTO op	Produces
1–3	`pto.tload` ×3	`input_tile`, `index_tile`, `src_tile`
4	`pto.tci`	column arange `[1, d]` = `0..d-1`
5	`pto.texpands`	zero template `[n, d]`
6	`pto.tcolexpand`	`col_nd[k, c] = c`
7	`pto.tmuls`	`row_base[k] = index.flat[k] * d`
8	`pto.trowexpandadd`	`flat_idx = col_nd + row_base` → `[n, d]`
8a	`pto.tcvt`	narrow `flat_idx` i32→i16 (2-byte dtypes only)
9	`pto.texpands`	zeroed scatter base `[m, d]`
10	`pto.tscatter`	`scattered` = src into zeroed base (written = src, unwritten = 0)
11–12	`pto.texpands` ×2	mask zero base, ones src
13	`pto.tscatter`	`mask` = ones into zeroed base (written = 1, unwritten = 0)
14	`pto.tcmps`	`pred = (mask != 0)`
15	`pto.tsel`	`out = sel(pred, scattered, input_tile)`
16	`pto.tstore`	write `out` to output

tile.sel (not input * mask) avoids emitting pto.tmul, which A2/A3 reject for bf16/i8. The index reshape [b, s] → [n, 1] is a buffer-view realias, not a separate PTO op. Full walkthrough: docs/en/dev/passes/12-convert_tensor_to_tile_ops.md (Scatter Update Lowering).

FP16 root cause

Two FP16-specific defects surfaced on device and are fixed here:

Device hang. The system-test kernels returned the pl.store(...) op result directly (return pl.store(result, ...)). On the FP16 case this triggered an AICPU stream-sync timeout that cascaded into unrelated tests. Materializing the store as a statement before the return (dst_t = pl.store(...) / return dst_t) clears the hang.
Wrong result. The lowering previously narrowed the index i32→i16 on a col_major [n, 1] view; tile.cast mis-orders elements on a col_major source, so whole src rows scattered in reverse (dst row 0 received src[15]). Computing the indices in i32 and narrowing only the final row-major [n, d] flat index fixes it. (Narrowing earlier on the [b, s] tile is also invalid — an i16 [b, s] row is cols * 2 bytes and breaks 32-byte alignment.) The companion tensor.scatter FP16/BF16 path is unaffected: it takes an i16 index from the caller and never casts.

Testing

90+ scatter/orch unit tests pass (convert, tile_ops, codegen)
FP32 end-to-end through the Default pipeline
FP16 test_tile_scatter_update_fp16 passes on device (previously hung / produced incorrect output)
Docs updated (en/zh op reference + ConvertTensorToTileOps lowering walkthrough)

Related Issues

Fixes #1490

hw-native-sys#1490) tensor.scatter_update now expands to per-element flat-index tile.scatter (pto.tscatter) + select preserve-blend during ConvertTensorToTileOps, so it runs on device and respects TensorMap producer sync. Removes the orch raw-memcpy path and the dead GetTensorDataPtr helper, drops the scratch arg from tile.scatter_update, and deletes the textract/tinsert PTO codegen.

coderabbitai · 2026-05-26T09:56:04Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Removes the scratch: Tile parameter from scatter_update across the API, IR, and backend layers. Rewrites tensor.scatter_update lowering to synthesize whole-row scatter via flat destination indices using tile.scatter operations instead of a dedicated tile.scatter_update IR op. Removes PTO backend codegen and orchestration op implementations, with comprehensive test updates.

Changes

Scatter Update Refactoring: API, Lowering, and Backend Cleanup

Layer / File(s)	Summary
Public API signature simplification `python/pypto/ir/op/tile_ops.py`, `python/pypto/language/op/tile_ops.py`, `docs/en/user/02-operation_reference.md`, `docs/zh-cn/user/02-operation_reference.md`	`scatter_update` function signature changes from `(input, dim, index, src, scratch)` to `(input, dim, index, src)`. Python argument validation, docstrings, and user documentation (English and Chinese) updated to match.
IR operation registration and type inference `src/ir/op/tile_ops/transform.cpp`, `include/pypto/codegen/codegen_base.h`	`tile.scatter_update` IR op now accepts exactly 3 operands (`input`, `index`, `src`). `DeduceTileScatterUpdateType` validates only 3 inputs, with scratch-tile validation removed. `REGISTER_OP` wires only 3 operands to `MemorySpace::Vec`. `GetTensorDataPtr` virtual method removed from `CodegenBase`.
Tensor.scatter_update lowering rewrite `src/ir/transforms/op_conversion_registry.cpp`	New `scatter_update_conv` implements whole-row scatter via flat destination indices: flattens index, scatters source and write-mask into zero bases, derives predicate with `tile.cmps`, and reconstructs read-preserve semantics with `tile.sel`. Both `tensor.scatter_update` and DSL `tile.scatter_update` route to this conversion.
Backend implementation removal `src/backend/common/pto_ops_common.cpp`, `src/codegen/tensor_op_codegen.cpp`, `src/codegen/orchestration/orchestration_codegen.cpp`, `src/codegen/pto/pto_codegen.cpp`	Deleted `MakeScatterUpdateCodegenPTO` and its PTO registration. Removed `tensor.scatter_update` orchestration op codegen. Removed `GetTensorDataPtr` override from `OrchestrationStmtCodegen`. Updated `IsInPlaceScatterFamilyOp` to exclude `tile.scatter_update`, now only recognizing `tile.scatter` and `tile.scatter_mask`.
Test suite updates `tests/ut/codegen/test_pto_codegen_ops.py`, `tests/ut/ir/operators/test_tile_ops.py`, `tests/st/runtime/ops/test_scatter_update.py`, `tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py`	Removed `TestTileScatterUpdateCodegen` unit test. Updated IR operator tests to use 3-operand API. Removed explicit `scratch_tile` allocations and `scratch=` arguments from runtime kernels. Updated conversion test to expect `tile.scatter` only; replaced structural checks with IR printing and string assertions. Updated all test docstrings to describe new flat-index whole-row lowering.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

hw-native-sys/pypto#1004: Directly touches the tile.scatter_update PTO backend lowering by adding/removing the MakeScatterUpdateCodegenPTO implementation and registration in src/backend/common/pto_ops_common.cpp.
hw-native-sys/pypto#1106: Refactors tile.scatter_update IR-level tests in tests/ut/ir/operators/test_tile_ops.py, affecting the same test file and scatter_update test logic.
hw-native-sys/pypto#1426: Introduces new tile.scatter/tensor.scatter operator family that the main PR now uses for the rewritten tensor.scatter_update lowering path.

Suggested Reviewers

lyfne123
Hzfengsy

🐰 A scatter's tale, now cleaned and bright,
No scratch in sight, just flat indices right,
Whole rows now dance through tile.scatter's gate,
While select-blend ensures no stale state,
The rabbit cheers: complexity deprecated! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.03% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately summarizes the main change: lowering scatter_update to tile.scatter and removing orchestration memcpy, which is the primary objective of this changeset.
Linked Issues check	✅ Passed	The PR addresses the core requirements of issue `#1490`: removes the unsafe orchestration-level raw memcpy path that bypassed TensorMap producer sync, deletes GetTensorDataPtr helper, and lowers tensor.scatter_update to device-scope tile.scatter kernel.
Out of Scope Changes check	✅ Passed	All changes are in-scope: documentation updates, removal of scatter_update scratch parameter, conversion logic for whole-row scatter, codegen updates, test adjustments, and removal of unsafe memcpy paths directly address issue `#1490`.
Description check	✅ Passed	The PR description clearly describes the changes: lowering scatter_update to tile.scatter with whole-row semantics, removing orchestration memcpy, dropping scratch argument, and fixing FP16 issues. It is directly related to the changeset shown.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors the scatter_update operation (at both the tensor and tile levels) to lower directly to a whole-row tile.scatter using flat indices, eliminating the need for a temporary scratch tile. Consequently, the scratch parameter has been removed from the Python APIs, IR definitions, and C++ codegen. The review feedback suggests improving the C++ IR transformation in op_conversion_registry.cpp by using INTERNAL_CHECK_SPAN instead of CHECK to conform to project conventions, and adding an explicit safety check to verify that the number of source rows matches the index size.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/st/runtime/cross_core/test_chained_matmul_cast.py`:
- Around line 66-68: The expected-value path in compute_expected currently does
a pure FP32 matmul but the tested kernel performs FP32→BF16→FP32 (CubeVecCast),
so modify compute_expected to mimic that round-trip: cast tensors["a"] and
tensors["w"] to torch.bfloat16 then back to torch.float32 before calling
torch.matmul, and store the result into tensors["y"]; this will make the
reference follow the same FP32→BF16→FP32 behavior as the implementation under
test.

In `@tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py`:
- Around line 1989-1993: Replace the fragile substring assertions with exact-op
checks: after running passes.convert_tensor_to_tile_ops() and getting After,
either (A) inspect the IR Call nodes in After (e.g., traverse After to collect
call.op.name values) and assert "tile.scatter" appears the expected number of
times and "tile.scatter_mask" is not present, or (B) if staying with the printed
text from ir.python_print(After), use a word-boundary regex like
r"\btile\.scatter\b" to assert exact matches and separately assert
r"\btile\.scatter_mask\b" is absent; update the assertions accordingly (refer to
passes.PassContext, passes.convert_tensor_to_tile_ops, ir.python_print, and the
After variable).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d60d101c-7572-4fb4-9790-02f7045606e7

📥 Commits

Reviewing files that changed from the base of the PR and between e7e41f5 and 7ca8ba2.

📒 Files selected for processing (16)

docs/en/user/02-operation_reference.md
docs/zh-cn/user/02-operation_reference.md
include/pypto/codegen/codegen_base.h
python/pypto/ir/op/tile_ops.py
python/pypto/language/op/tile_ops.py
src/backend/common/pto_ops_common.cpp
src/codegen/orchestration/orchestration_codegen.cpp
src/codegen/pto/pto_codegen.cpp
src/codegen/tensor_op_codegen.cpp
src/ir/op/tile_ops/transform.cpp
src/ir/transforms/op_conversion_registry.cpp
tests/st/runtime/cross_core/test_chained_matmul_cast.py
tests/st/runtime/ops/test_scatter_update.py
tests/ut/codegen/test_pto_codegen_ops.py
tests/ut/ir/operators/test_tile_ops.py
tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py

💤 Files with no reviewable changes (6)

include/pypto/codegen/codegen_base.h
tests/ut/ir/operators/test_tile_ops.py
src/backend/common/pto_ops_common.cpp
src/codegen/orchestration/orchestration_codegen.cpp
src/codegen/tensor_op_codegen.cpp
tests/ut/codegen/test_pto_codegen_ops.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- op_conversion_registry: use INTERNAL_CHECK_SPAN for scatter_update post-bridge invariants (Span in scope) and add explicit check that src rows == index size (b * s) - test_convert_tensor_to_tile_ops: tighten scatter_update assertion to exact pl.tile.scatter( (index form), and assert pl.tile.scatter_mask( is absent so a regression to mask form is caught

Replace `return pl.store(result, ...)` with explicit `dst_t = pl.store(result, ...)` / `return dst_t` across all kernel programs in the scatter_update system tests. Returning the store op result directly was triggering a device-side AICPU stream-sync timeout (hang) on the FP16 case; materializing the store as a statement before the return clears the hang.

…he end The FP16 scatter_update lowering built the flat destination indices in the tscatter-required i16 width, which forced an i32->i16 tile.cast on the col_major [n,1] index view. That cast mis-ordered the indices, scattering whole src rows in reversed order (dst row 0 received src[15]). Keep the entire flat-index computation in i32 (identical to the FP32 path) and narrow only the finished row-major [n,d] flat_idx to the tscatter index width. The final cast runs on a 32-byte-aligned, row-major tile, which is both alignment-legal and correct.

Add a Scatter Update Lowering section to the ConvertTensorToTileOps pass doc (en + zh-cn): the flat-index expansion (flat_idx = index*d + c), the i32-compute / narrow-at-the-end rule, and the generated pto.tscatter + sel preserve-blend op sequence.

- CHECK that m*d fits in i16 for 2-byte dst (the tscatter index width), so an oversized FP16/BF16/INT16 dst raises a clear error instead of silently scattering to wrong rows on flat-index overflow. - Reject 4D input/src in the lowering with a user-facing CHECK: 4D type-checks via the op's deduction but is not yet lowered, so it would otherwise hit an internal error. - Add convert-pass tests covering both rejections.

## Summary `pto.tcvt` (the lowering of `tile.cast`) silently **mis-orders elements when its source tile is `col_major`** — e.g. a reshaped `[n, 1]` index vector narrowed `i32 -> i16`. The same cast on a `row_major` source is correct, so the failure is silent wrong output with no diagnostic. This is what produced reversed scatter rows in the FP16 `tensor.scatter_update` lowering (issue #1549). PyPTO already drives this exact class of ISA constraint through the `ResolveBackendOpLayouts` pass, which reshapes a `[n, 1] col_major` vector to `[1, n] row_major` around a constrained op and restores the layout afterwards. `tile.cast` simply had **no layout spec**, so it was never repaired. ### Changes - **`src/backend/common/pto_ops_common.cpp`**: register `tile.cast` with `set_input_layout(0, row_major)` + `set_output_layout(row_major)`, mirroring `tile.rsqrt` / `tile.cmps` / `tile.sort32`. `ResolveBackendOpLayouts` now repairs every `col_major` caller generically. Row-major callers are unaffected (no repair, zero overhead). - **`tests/ut/ir/transforms/test_resolve_backend_op_layouts_pass.py`**: pass-level regression for a `col_major [16, 1]` `i32 -> i16` cast being repaired through a `[1, 16] row_major` reshape. - **`tests/st/runtime/ops/test_cast.py`**: new end-to-end ST — a `col_major [N, 1]` `i32 -> i16` narrow (the #1549 regression, must preserve element order) plus a `row_major [1, N]` control case. - **`docs/en` + `docs/zh-cn` `20-resolve_backend_op_layouts.md`**: list `tile.cast` among the constrained ops. - **`src/ir/transforms/op_conversion_registry.cpp`**: comment-only trim of the `scatter_update` lowering — its i32-compute / narrow-at-the-end design is kept for the alignment benefit. No behavior change: this path narrows only the **row-major `[n, d]`** flat index, so it never feeds a `col_major` source to `tile.cast`, and the new layout spec is a no-op for it. ## Testing - [x] New + existing `ResolveBackendOpLayouts` UTs pass (5/5) - [x] `tests/ut/ir/transforms/` + `tests/ut/codegen/`: 1687 passed, 26 skipped - [x] `tests/ut/ir/operators/test_tile_ops.py`: 237 passed - [x] Pre-commit hooks (clang-format, cpplint, ruff, pyright, markdownlint) pass - [ ] On-device ST `tests/st/runtime/ops/test_cast.py::TestCast::test_tile_cast_col_major_narrow` (hardware, to be confirmed by reviewer) — the direct col_major-cast regression for this fix. > Note: `test_scatter_update.py::...::test_tile_scatter_update_fp16` is **not** a target of this PR. That path was already worked around in #1537 (narrow only the row-major `[n, d]` flat index), so this PR does not change its codegen and it needs no re-confirmation here. ## Related Issues Fixes #1549 --------- Co-authored-by: Youhezhen <youhezhen@huawei.com>

github-project-automation Bot added this to pto project May 26, 2026

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/ir/transforms/op_conversion_registry.cpp Outdated

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread tests/st/runtime/cross_core/test_chained_matmul_cast.py Outdated

Comment thread tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py

Little-oil changed the title ~~fix(scatter_update): lower to whole-row tile.scatter, drop orch memcpy (#1490)~~ fix(codegen): lower scatter_update to tile.scatter, remove orch memcpy May 26, 2026

Youhezhen and others added 9 commits May 26, 2026 18:12

test: remove test_chained_matmul_cast.py

c738c0f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ci: retrigger flaky system-tests (AICPU stream timeout cascade)

6b196b1

ci: retrigger (actions/checkout download infra failure)

c7a60fb

ci: retrigger system-tests

713d8b1

lyfne123 approved these changes May 27, 2026

View reviewed changes

lyfne123 merged commit 538f073 into hw-native-sys:main May 27, 2026
9 checks passed

This was referenced May 27, 2026

[Bug] tile.cast (pto.tcvt) narrowing mis-orders elements when the source tile is col_major #1549

Closed

fix(codegen): require row_major layout for tile.cast (pto.tcvt) #1559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(codegen): lower scatter_update to tile.scatter, remove orch memcpy#1537

fix(codegen): lower scatter_update to tile.scatter, remove orch memcpy#1537
lyfne123 merged 10 commits into
hw-native-sys:mainfrom
Little-oil:fix/issue-1490-scatter-update-tscatter

Little-oil commented May 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Possibly Related PRs

Suggested Reviewers

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Little-oil commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Lowering (generated PTO)

FP16 root cause

Testing

Related Issues

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Possibly Related PRs

Suggested Reviewers

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Little-oil commented May 26, 2026 •

edited

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading