Skip to content

fix(expand_mixed_kernel): reject GM-mediated cross-lane dependencies#1434

Open
lwDavid wants to merge 2 commits into
hw-native-sys:mainfrom
lwDavid:fix-mixed-kernel-gm-sync
Open

fix(expand_mixed_kernel): reject GM-mediated cross-lane dependencies#1434
lwDavid wants to merge 2 commits into
hw-native-sys:mainfrom
lwDavid:fix-mixed-kernel-gm-sync

Conversation

@lwDavid
Copy link
Copy Markdown
Contributor

@lwDavid lwDavid commented May 20, 2026

Summary

ExpandMixedKernel detects cross-CV boundaries via tile.move across the cube/vec memory boundary and inserts tpush/tpop handshakes for them. It does not recognise the GM-mediated cross-lane pattern produced by tile.store -> GM tensor on one lane followed by tile.load <- same tensor on the other lane inside the same mixed-root scope. The two ops are independently CUBE- and VECTOR-affine, so neither is classified as a CV boundary; the pass splits the body and the resulting AIC/AIV kernels run in parallel sharing the GM region with no fence between them.

On device this surfaces as a real data race. With torch.manual_seed(0) and two consecutive runs of models/qwen3/14b/qwen3_14b_decode_mix.py (a fused Flash Attention scope using exactly this pattern), the dumped device outputs disagree on 2796/65536 elements with max delta ~0.00195 (~1 ULP for BF16 around 0.5). The equivalent un-fused decode_layer.py (four independent pl.at scopes) is bit-identical across runs. grep -cE 'tpush|tpop|aic_initialize_pipe|aiv_initialize_pipe' fa_fused.pto fa_fused__windowed.pto = 0 on both kernels, vs. 4 on a sibling scope that has an explicit tile.move CV boundary.

Detect the pattern at the start of ExpandMixedFunction (right after affinity analysis) and refuse with a user-facing ValueError that names the offending tensor, the store / load source locations, and points to the tracking issue. This converts the silent on-device race into a loud compile-time failure. Proper cross-lane sync-insertion is tracked separately by #1433; this PR is the diagnostic foundation it can build on.

Closes / partially addresses #1433.

What's in this PR

  • src/ir/transforms/expand_mixed_kernel_pass.cpp: new DetectCrossLaneGmDeps walker collects (tile.store result var, lane affinity) pairs and, for each subsequent tile.load whose first arg points to one of those vars with a different lane affinity, records a cross-lane GM dependency. Invoked from ExpandMixedFunction after AnalyzeStmtsAffinity; non-empty result raises ValueError via CHECK(false) << ... with a clear, actionable message.
  • tests/ut/ir/transforms/test_expand_mixed_kernel_a2a3.py: regression test_cross_lane_gm_dependency_rejected that exercises the minimal AIC-store + AIV-load pattern and asserts the ValueError mentions the tensor name, both lanes, and the tracking issue.
  • docs/en/dev/passes/20-expand_mixed_kernel.md + docs/zh-cn/dev/passes/20-expand_mixed_kernel.md: new "Cross-lane GM dependency check" subsection documenting the new check and recommending the multi-pl.at restructure (as decode_layer.py does for Flash Attention) as the current workaround.

The recursive walker handles ForStmt / IfStmt / WhileStmt body nesting using the same FlattenBody helper as the existing CollectCVBoundaryMoves.

Test plan

  • python3 -m pytest tests/ut/ir/transforms/ -k expand_mixed -x -q — 63 passed (62 existing + 1 new), 0 regressions
  • python3 -m pytest tests/ut/ir/transforms/ -x -q — 1305 passed, 25 skipped, no regressions
  • models/qwen3/14b/qwen3_14b_decode_mix.py (pypto-lib) now fails with the expected ValueError pointing at line 491:28 (store) and 492:28 (load), naming the offending all_raw_scores__tile tensor.
  • models/qwen3/14b/decode_layer.py (pypto-lib, the bit-identical baseline) still compiles and runs to PASS on a2a3.

ExpandMixedKernel detects cross-CV boundaries via 'tile.move' across the
cube/vec memory boundary and inserts tpush/tpop handshakes for them. It
does not, however, recognise the GM-mediated cross-lane pattern produced
by 'tile.store -> GM tensor' on one lane followed by 'tile.load <- same
tensor' on the other lane inside the same mixed-root scope.

The two ops are independently CUBE- and VECTOR-affine, so neither is
classified as a CV boundary. The pass happily splits the body, and the
resulting AIC/AIV kernels run in parallel sharing the GM region with no
fence between them. On device this surfaces as a real data race: same
seeded inputs yield non-bit-identical outputs (~4% elements differing by
~1 ULP for the qwen3-14b fused Flash Attention scope in pypto-lib).

Detect the pattern at the start of ExpandMixedFunction (after affinity
analysis) and refuse with a user-facing ValueError that names the
offending tensor, the store / load source locations, and points to the
tracking issue for proper sync-insertion. This converts the silent
on-device race into a loud compile-time failure until the cross-lane
fence machinery lands.

Add a regression test that exercises the minimal AIC-store + AIV-load
pattern and asserts the ValueError, plus pass-doc sections (en + zh-cn)
documenting the new check and the recommended workaround (split the
scope so each data-flow phase lives in its own pl.at).

Tracks: hw-native-sys#1433
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a pre-split validation to ExpandMixedKernel that detects and rejects mixed-root scopes where one lane writes a GM tensor via tile.store and the opposite lane reads it via tile.load inside the same scope without synchronization; includes detection logic, an integration check that aborts compilation with details, docs, and a regression test.

Changes

Cross-lane GM dependency validation

Layer / File(s) Summary
Documentation of cross-lane GM dependency rule
docs/en/dev/passes/20-expand_mixed_kernel.md, docs/zh-cn/dev/passes/20-expand_mixed_kernel.md
English and Chinese sections document the rejected pattern: tile.store on one lane writing GM followed by tile.load on the opposite lane reading the same tensor in a single mixed-root scope triggers ValueError with tensor name and source locations; remediation is separating producer and consumer into different pl.at scope boundaries.
Cross-lane GM dependency detection walker
src/ir/transforms/expand_mixed_kernel_pass.cpp
CrossLaneGmDep record and DetectCrossLaneGmDeps walker recursively scan statements to record tile.store result variables per-lane and detect tile.load operations reading those same SSA values from the opposite lane, aggregating unsafe dependency pairs.
Integration into ExpandMixedFunction
src/ir/transforms/expand_mixed_kernel_pass.cpp
After affinity analysis, runs the walker on the mixed-root scope; if dependencies are found, derives lane names and aborts compilation with CHECK(false) including scope hint, source spans, SSA var, dependency count, and remediation guidance.
Regression test
tests/ut/ir/transforms/test_expand_mixed_kernel_a2a3.py
test_cross_lane_gm_dependency_rejected validates that a scope with AIC tile.store writing GM and AIV tile.load reading the same buffer triggers ValueError with correct error message content.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Suggested labels

bug

Suggested reviewers

  • Hzfengsy
  • lyfne123

Poem

🐰 A walker hops through scopes so deep,
Finding lanes that cross the GM heap,
"Two reads, one write—but no sync!"—it cries,
With source spans and helpful advice:
Split thy blocks, let synchrony shine! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding detection and rejection of unsafe GM-mediated cross-lane dependencies in ExpandMixedKernel pass.
Description check ✅ Passed The description clearly explains the problem, solution, files changed, and test results, all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cross-lane Global Memory (GM) dependency check within the ExpandMixedKernel pass to prevent data races between AIC and AIV kernels. The changes include the implementation of DetectCrossLaneGmDeps to identify unsafe tile.store and tile.load patterns, logic to trigger a compilation error when these patterns are found, and updated documentation and unit tests. Feedback suggests extending this detection to cover other memory operations like tile.mscatter and tile.mgather or documenting these as known limitations.

Comment thread src/ir/transforms/expand_mixed_kernel_pass.cpp
- Apply clang-format to DetectCrossLaneGmDeps and the CHECK message
  block so pre-commit passes (continuation lines align to the start of
  the streamed argument, not to a fixed indent).
- Document the limitation flagged in code review: the detector covers
  tile.store / tile.load (the contiguous pattern produced by
  pl.assemble + bracket-slice). Indexed GM transfers via
  tile.mscatter / tile.mgather can in principle exhibit the same race
  and are deferred to upstream issue hw-native-sys#1433 along with the proper
  sync-insertion fix.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/ir/transforms/expand_mixed_kernel_pass.cpp (1)

263-318: 💤 Low value

Detection is order-sensitive within loop bodies.

The walker traverses loop bodies once in program order, so { CUBE_store(T); VEC_load(T); } is caught, but { VEC_load(T); CUBE_store(T); } is not—even though after splitting, both operations run in parallel across iterations and race on T.

This is acceptable for an initial implementation (the common pattern is store-before-load, and the workaround applies regardless), but consider adding a brief comment noting this limitation alongside the existing scope note about mscatter/mgather.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ir/transforms/expand_mixed_kernel_pass.cpp` around lines 263 - 318, Add a
brief comment to DetectCrossLaneGmDeps noting that detection is order-sensitive
within loop bodies: because the function walks statements once in program order
(see FlattenBody recursion for ForStmt/IfStmt/WhileStmt), a pattern like {
VEC_load(T); CUBE_store(T); } inside the same loop will not be detected even
though splitting can make them race across iterations; this limitation is
analogous to the existing mscatter/mgather scope note—place the comment near the
top of DetectCrossLaneGmDeps (or adjacent to the recursion into compound
statements) so readers understand the single-pass/program-order limitation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/ir/transforms/expand_mixed_kernel_pass.cpp`:
- Around line 263-318: Add a brief comment to DetectCrossLaneGmDeps noting that
detection is order-sensitive within loop bodies: because the function walks
statements once in program order (see FlattenBody recursion for
ForStmt/IfStmt/WhileStmt), a pattern like { VEC_load(T); CUBE_store(T); } inside
the same loop will not be detected even though splitting can make them race
across iterations; this limitation is analogous to the existing mscatter/mgather
scope note—place the comment near the top of DetectCrossLaneGmDeps (or adjacent
to the recursion into compound statements) so readers understand the
single-pass/program-order limitation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fad242b0-841d-4687-bb1c-3dd8dc8ed2d1

📥 Commits

Reviewing files that changed from the base of the PR and between 9d0b53f and 6a40c16.

📒 Files selected for processing (1)
  • src/ir/transforms/expand_mixed_kernel_pass.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant