fix(expand_mixed_kernel): reject GM-mediated cross-lane dependencies#1434
fix(expand_mixed_kernel): reject GM-mediated cross-lane dependencies#1434lwDavid wants to merge 2 commits into
Conversation
ExpandMixedKernel detects cross-CV boundaries via 'tile.move' across the cube/vec memory boundary and inserts tpush/tpop handshakes for them. It does not, however, recognise the GM-mediated cross-lane pattern produced by 'tile.store -> GM tensor' on one lane followed by 'tile.load <- same tensor' on the other lane inside the same mixed-root scope. The two ops are independently CUBE- and VECTOR-affine, so neither is classified as a CV boundary. The pass happily splits the body, and the resulting AIC/AIV kernels run in parallel sharing the GM region with no fence between them. On device this surfaces as a real data race: same seeded inputs yield non-bit-identical outputs (~4% elements differing by ~1 ULP for the qwen3-14b fused Flash Attention scope in pypto-lib). Detect the pattern at the start of ExpandMixedFunction (after affinity analysis) and refuse with a user-facing ValueError that names the offending tensor, the store / load source locations, and points to the tracking issue for proper sync-insertion. This converts the silent on-device race into a loud compile-time failure until the cross-lane fence machinery lands. Add a regression test that exercises the minimal AIC-store + AIV-load pattern and asserts the ValueError, plus pass-doc sections (en + zh-cn) documenting the new check and the recommended workaround (split the scope so each data-flow phase lives in its own pl.at). Tracks: hw-native-sys#1433
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a pre-split validation to ExpandMixedKernel that detects and rejects mixed-root scopes where one lane writes a GM tensor via ChangesCross-lane GM dependency validation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a cross-lane Global Memory (GM) dependency check within the ExpandMixedKernel pass to prevent data races between AIC and AIV kernels. The changes include the implementation of DetectCrossLaneGmDeps to identify unsafe tile.store and tile.load patterns, logic to trigger a compilation error when these patterns are found, and updated documentation and unit tests. Feedback suggests extending this detection to cover other memory operations like tile.mscatter and tile.mgather or documenting these as known limitations.
- Apply clang-format to DetectCrossLaneGmDeps and the CHECK message block so pre-commit passes (continuation lines align to the start of the streamed argument, not to a fixed indent). - Document the limitation flagged in code review: the detector covers tile.store / tile.load (the contiguous pattern produced by pl.assemble + bracket-slice). Indexed GM transfers via tile.mscatter / tile.mgather can in principle exhibit the same race and are deferred to upstream issue hw-native-sys#1433 along with the proper sync-insertion fix.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/ir/transforms/expand_mixed_kernel_pass.cpp (1)
263-318: 💤 Low valueDetection is order-sensitive within loop bodies.
The walker traverses loop bodies once in program order, so
{ CUBE_store(T); VEC_load(T); }is caught, but{ VEC_load(T); CUBE_store(T); }is not—even though after splitting, both operations run in parallel across iterations and race on T.This is acceptable for an initial implementation (the common pattern is store-before-load, and the workaround applies regardless), but consider adding a brief comment noting this limitation alongside the existing scope note about mscatter/mgather.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/ir/transforms/expand_mixed_kernel_pass.cpp` around lines 263 - 318, Add a brief comment to DetectCrossLaneGmDeps noting that detection is order-sensitive within loop bodies: because the function walks statements once in program order (see FlattenBody recursion for ForStmt/IfStmt/WhileStmt), a pattern like { VEC_load(T); CUBE_store(T); } inside the same loop will not be detected even though splitting can make them race across iterations; this limitation is analogous to the existing mscatter/mgather scope note—place the comment near the top of DetectCrossLaneGmDeps (or adjacent to the recursion into compound statements) so readers understand the single-pass/program-order limitation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/ir/transforms/expand_mixed_kernel_pass.cpp`:
- Around line 263-318: Add a brief comment to DetectCrossLaneGmDeps noting that
detection is order-sensitive within loop bodies: because the function walks
statements once in program order (see FlattenBody recursion for
ForStmt/IfStmt/WhileStmt), a pattern like { VEC_load(T); CUBE_store(T); } inside
the same loop will not be detected even though splitting can make them race
across iterations; this limitation is analogous to the existing mscatter/mgather
scope note—place the comment near the top of DetectCrossLaneGmDeps (or adjacent
to the recursion into compound statements) so readers understand the
single-pass/program-order limitation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: fad242b0-841d-4687-bb1c-3dd8dc8ed2d1
📒 Files selected for processing (1)
src/ir/transforms/expand_mixed_kernel_pass.cpp
Summary
ExpandMixedKerneldetects cross-CV boundaries viatile.moveacross the cube/vec memory boundary and insertstpush/tpophandshakes for them. It does not recognise the GM-mediated cross-lane pattern produced bytile.store -> GM tensoron one lane followed bytile.load <- same tensoron the other lane inside the same mixed-root scope. The two ops are independently CUBE- and VECTOR-affine, so neither is classified as a CV boundary; the pass splits the body and the resulting AIC/AIV kernels run in parallel sharing the GM region with no fence between them.On device this surfaces as a real data race. With
torch.manual_seed(0)and two consecutive runs ofmodels/qwen3/14b/qwen3_14b_decode_mix.py(a fused Flash Attention scope using exactly this pattern), the dumped device outputs disagree on 2796/65536 elements with max delta ~0.00195 (~1 ULP for BF16 around 0.5). The equivalent un-fuseddecode_layer.py(four independentpl.atscopes) is bit-identical across runs.grep -cE 'tpush|tpop|aic_initialize_pipe|aiv_initialize_pipe' fa_fused.pto fa_fused__windowed.pto= 0 on both kernels, vs. 4 on a sibling scope that has an explicittile.moveCV boundary.Detect the pattern at the start of
ExpandMixedFunction(right after affinity analysis) and refuse with a user-facingValueErrorthat names the offending tensor, the store / load source locations, and points to the tracking issue. This converts the silent on-device race into a loud compile-time failure. Proper cross-lane sync-insertion is tracked separately by #1433; this PR is the diagnostic foundation it can build on.Closes / partially addresses #1433.
What's in this PR
src/ir/transforms/expand_mixed_kernel_pass.cpp: newDetectCrossLaneGmDepswalker collects(tile.store result var, lane affinity)pairs and, for each subsequenttile.loadwhose first arg points to one of those vars with a different lane affinity, records a cross-lane GM dependency. Invoked fromExpandMixedFunctionafterAnalyzeStmtsAffinity; non-empty result raisesValueErrorviaCHECK(false) << ...with a clear, actionable message.tests/ut/ir/transforms/test_expand_mixed_kernel_a2a3.py: regressiontest_cross_lane_gm_dependency_rejectedthat exercises the minimal AIC-store + AIV-load pattern and asserts theValueErrormentions the tensor name, both lanes, and the tracking issue.docs/en/dev/passes/20-expand_mixed_kernel.md+docs/zh-cn/dev/passes/20-expand_mixed_kernel.md: new "Cross-lane GM dependency check" subsection documenting the new check and recommending the multi-pl.atrestructure (asdecode_layer.pydoes for Flash Attention) as the current workaround.The recursive walker handles
ForStmt/IfStmt/WhileStmtbody nesting using the sameFlattenBodyhelper as the existingCollectCVBoundaryMoves.Test plan
python3 -m pytest tests/ut/ir/transforms/ -k expand_mixed -x -q— 63 passed (62 existing + 1 new), 0 regressionspython3 -m pytest tests/ut/ir/transforms/ -x -q— 1305 passed, 25 skipped, no regressionsmodels/qwen3/14b/qwen3_14b_decode_mix.py(pypto-lib) now fails with the expectedValueErrorpointing at line 491:28 (store) and 492:28 (load), naming the offendingall_raw_scores__tiletensor.models/qwen3/14b/decode_layer.py(pypto-lib, the bit-identical baseline) still compiles and runs to PASS on a2a3.