fix(expand_mixed_kernel): reject GM-mediated cross-lane dependencies by lwDavid · Pull Request #1434 · hw-native-sys/pypto

lwDavid · 2026-05-20T14:12:45Z

Summary

ExpandMixedKernel detects cross-CV boundaries via tile.move across the cube/vec memory boundary and inserts tpush/tpop handshakes for them. It does not recognise the GM-mediated cross-lane pattern produced by tile.store -> GM tensor on one lane followed by tile.load <- same tensor on the other lane inside the same mixed-root scope. The two ops are independently CUBE- and VECTOR-affine, so neither is classified as a CV boundary; the pass splits the body and the resulting AIC/AIV kernels run in parallel sharing the GM region with no fence between them.

On device this surfaces as a real data race. With torch.manual_seed(0) and two consecutive runs of models/qwen3/14b/qwen3_14b_decode_mix.py (a fused Flash Attention scope using exactly this pattern), the dumped device outputs disagree on 2796/65536 elements with max delta ~0.00195 (~1 ULP for BF16 around 0.5). The equivalent un-fused decode_layer.py (four independent pl.at scopes) is bit-identical across runs. grep -cE 'tpush|tpop|aic_initialize_pipe|aiv_initialize_pipe' fa_fused.pto fa_fused__windowed.pto = 0 on both kernels, vs. 4 on a sibling scope that has an explicit tile.move CV boundary.

Detect the pattern at the start of ExpandMixedFunction (right after affinity analysis) and refuse with a user-facing ValueError that names the offending tensor, the store / load source locations, and points to the tracking issue. This converts the silent on-device race into a loud compile-time failure. Proper cross-lane sync-insertion is tracked separately by #1433; this PR is the diagnostic foundation it can build on.

Closes / partially addresses #1433.

What's in this PR

src/ir/transforms/expand_mixed_kernel_pass.cpp: new DetectCrossLaneGmDeps walker collects (tile.store result var, lane affinity) pairs and, for each subsequent tile.load whose first arg points to one of those vars with a different lane affinity, records a cross-lane GM dependency. Invoked from ExpandMixedFunction after AnalyzeStmtsAffinity; non-empty result raises ValueError via CHECK(false) << ... with a clear, actionable message.
tests/ut/ir/transforms/test_expand_mixed_kernel_a2a3.py: regression test_cross_lane_gm_dependency_rejected that exercises the minimal AIC-store + AIV-load pattern and asserts the ValueError mentions the tensor name, both lanes, and the tracking issue.
docs/en/dev/passes/20-expand_mixed_kernel.md + docs/zh-cn/dev/passes/20-expand_mixed_kernel.md: new "Cross-lane GM dependency check" subsection documenting the new check and recommending the multi-pl.at restructure (as decode_layer.py does for Flash Attention) as the current workaround.

The recursive walker handles ForStmt / IfStmt / WhileStmt body nesting using the same FlattenBody helper as the existing CollectCVBoundaryMoves.

Test plan

python3 -m pytest tests/ut/ir/transforms/ -k expand_mixed -x -q — 63 passed (62 existing + 1 new), 0 regressions
python3 -m pytest tests/ut/ir/transforms/ -x -q — 1305 passed, 25 skipped, no regressions
models/qwen3/14b/qwen3_14b_decode_mix.py (pypto-lib) now fails with the expected ValueError pointing at line 491:28 (store) and 492:28 (load), naming the offending all_raw_scores__tile tensor.
models/qwen3/14b/decode_layer.py (pypto-lib, the bit-identical baseline) still compiles and runs to PASS on a2a3.

ExpandMixedKernel detects cross-CV boundaries via 'tile.move' across the cube/vec memory boundary and inserts tpush/tpop handshakes for them. It does not, however, recognise the GM-mediated cross-lane pattern produced by 'tile.store -> GM tensor' on one lane followed by 'tile.load <- same tensor' on the other lane inside the same mixed-root scope. The two ops are independently CUBE- and VECTOR-affine, so neither is classified as a CV boundary. The pass happily splits the body, and the resulting AIC/AIV kernels run in parallel sharing the GM region with no fence between them. On device this surfaces as a real data race: same seeded inputs yield non-bit-identical outputs (~4% elements differing by ~1 ULP for the qwen3-14b fused Flash Attention scope in pypto-lib). Detect the pattern at the start of ExpandMixedFunction (after affinity analysis) and refuse with a user-facing ValueError that names the offending tensor, the store / load source locations, and points to the tracking issue for proper sync-insertion. This converts the silent on-device race into a loud compile-time failure until the cross-lane fence machinery lands. Add a regression test that exercises the minimal AIC-store + AIV-load pattern and asserts the ValueError, plus pass-doc sections (en + zh-cn) documenting the new check and the recommended workaround (split the scope so each data-flow phase lives in its own pl.at). Tracks: hw-native-sys#1433

coderabbitai · 2026-05-20T14:13:02Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a pre-split validation to ExpandMixedKernel that detects and rejects mixed-root scopes where one lane writes a GM tensor via tile.store and the opposite lane reads it via tile.load inside the same scope without synchronization; includes detection logic, an integration check that aborts compilation with details, docs, and a regression test.

Changes

Cross-lane GM dependency validation

Layer / File(s)	Summary
Documentation of cross-lane GM dependency rule `docs/en/dev/passes/20-expand_mixed_kernel.md`, `docs/zh-cn/dev/passes/20-expand_mixed_kernel.md`	English and Chinese sections document the rejected pattern: `tile.store` on one lane writing GM followed by `tile.load` on the opposite lane reading the same tensor in a single mixed-root scope triggers `ValueError` with tensor name and source locations; remediation is separating producer and consumer into different `pl.at` scope boundaries.
Cross-lane GM dependency detection walker `src/ir/transforms/expand_mixed_kernel_pass.cpp`	`CrossLaneGmDep` record and `DetectCrossLaneGmDeps` walker recursively scan statements to record `tile.store` result variables per-lane and detect `tile.load` operations reading those same SSA values from the opposite lane, aggregating unsafe dependency pairs.
Integration into ExpandMixedFunction `src/ir/transforms/expand_mixed_kernel_pass.cpp`	After affinity analysis, runs the walker on the mixed-root scope; if dependencies are found, derives lane names and aborts compilation with `CHECK(false)` including scope hint, source spans, SSA var, dependency count, and remediation guidance.
Regression test `tests/ut/ir/transforms/test_expand_mixed_kernel_a2a3.py`	`test_cross_lane_gm_dependency_rejected` validates that a scope with AIC `tile.store` writing GM and AIV `tile.load` reading the same buffer triggers `ValueError` with correct error message content.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

[Pass Bug] ExpandMixedKernel misses GM-mediated cross-lane data dependencies, leaving fused-scope kernels racing on shared GM scratch #1433: Cross-lane GM dependency detection and rejection directly addresses the missing safety check for race conditions between mixed-kernel lanes sharing GM tensors without synchronization.

Suggested labels

bug

Suggested reviewers

Hzfengsy
lyfne123

Poem

🐰 A walker hops through scopes so deep,
Finding lanes that cross the GM heap,
"Two reads, one write—but no sync!"—it cries,
With source spans and helpful advice:
Split thy blocks, let synchrony shine! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding detection and rejection of unsafe GM-mediated cross-lane dependencies in ExpandMixedKernel pass.
Description check	✅ Passed	The description clearly explains the problem, solution, files changed, and test results, all directly related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a cross-lane Global Memory (GM) dependency check within the ExpandMixedKernel pass to prevent data races between AIC and AIV kernels. The changes include the implementation of DetectCrossLaneGmDeps to identify unsafe tile.store and tile.load patterns, logic to trigger a compilation error when these patterns are found, and updated documentation and unit tests. Feedback suggests extending this detection to cover other memory operations like tile.mscatter and tile.mgather or documenting these as known limitations.

- Apply clang-format to DetectCrossLaneGmDeps and the CHECK message block so pre-commit passes (continuation lines align to the start of the streamed argument, not to a fixed indent). - Document the limitation flagged in code review: the detector covers tile.store / tile.load (the contiguous pattern produced by pl.assemble + bracket-slice). Indexed GM transfers via tile.mscatter / tile.mgather can in principle exhibit the same race and are deferred to upstream issue hw-native-sys#1433 along with the proper sync-insertion fix.

coderabbitai

🧹 Nitpick comments (1)

src/ir/transforms/expand_mixed_kernel_pass.cpp (1)
263-318: 💤 Low value

Detection is order-sensitive within loop bodies.

The walker traverses loop bodies once in program order, so { CUBE_store(T); VEC_load(T); } is caught, but { VEC_load(T); CUBE_store(T); } is not—even though after splitting, both operations run in parallel across iterations and race on T.

This is acceptable for an initial implementation (the common pattern is store-before-load, and the workaround applies regardless), but consider adding a brief comment noting this limitation alongside the existing scope note about mscatter/mgather.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ir/transforms/expand_mixed_kernel_pass.cpp` around lines 263 - 318, Add a
brief comment to DetectCrossLaneGmDeps noting that detection is order-sensitive
within loop bodies: because the function walks statements once in program order
(see FlattenBody recursion for ForStmt/IfStmt/WhileStmt), a pattern like {
VEC_load(T); CUBE_store(T); } inside the same loop will not be detected even
though splitting can make them race across iterations; this limitation is
analogous to the existing mscatter/mgather scope note—place the comment near the
top of DetectCrossLaneGmDeps (or adjacent to the recursion into compound
statements) so readers understand the single-pass/program-order limitation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/ir/transforms/expand_mixed_kernel_pass.cpp`:
- Around line 263-318: Add a brief comment to DetectCrossLaneGmDeps noting that
detection is order-sensitive within loop bodies: because the function walks
statements once in program order (see FlattenBody recursion for
ForStmt/IfStmt/WhileStmt), a pattern like { VEC_load(T); CUBE_store(T); } inside
the same loop will not be detected even though splitting can make them race
across iterations; this limitation is analogous to the existing mscatter/mgather
scope note—place the comment near the top of DetectCrossLaneGmDeps (or adjacent
to the recursion into compound statements) so readers understand the
single-pass/program-order limitation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fad242b0-841d-4687-bb1c-3dd8dc8ed2d1

📥 Commits

Reviewing files that changed from the base of the PR and between 9d0b53f and 6a40c16.

📒 Files selected for processing (1)

src/ir/transforms/expand_mixed_kernel_pass.cpp

github-project-automation Bot added this to pto project May 20, 2026

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

Comment thread src/ir/transforms/expand_mixed_kernel_pass.cpp

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(expand_mixed_kernel): reject GM-mediated cross-lane dependencies#1434

fix(expand_mixed_kernel): reject GM-mediated cross-lane dependencies#1434
lwDavid wants to merge 2 commits into
hw-native-sys:mainfrom
lwDavid:fix-mixed-kernel-gm-sync

lwDavid commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lwDavid commented May 20, 2026

Summary

What's in this PR

Test plan

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 20, 2026 •

edited

Loading