feat(pass): compress manual_scope phase-fence deps with ExpandManualPhaseFence#1545
feat(pass): compress manual_scope phase-fence deps with ExpandManualPhaseFence#1545Leaf-Salix wants to merge 8 commits into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR implements a new IR transform pass that optimizes manual-scope task dependencies by inserting explicit dummy barrier tasks to compress large all-to-all dependency arrays. The pass analyzes manual-dependency patterns, inserts ChangesManual Phase Fence Dependency Compression
Sequence DiagramsequenceDiagram
participant OrchestrationFunc as Orchestration<br/>Function Body
participant ExpandPass as ExpandManualPhaseFence<br/>IR Pass
participant ManualScope as RuntimeScopeStmt<br/>(manual=true)
participant ForLoop as ForStmt<br/>(parallel)
participant Barrier as system.task_dummy<br/>Assignment
participant Consumer as Consumer<br/>Call
OrchestrationFunc->>ExpandPass: Transform program
ExpandPass->>ManualScope: Analyze within manual scope
ManualScope->>ForLoop: Check loop for profitable deps
ForLoop->>ExpandPass: Identify manual_dep_edges array
ExpandPass->>ExpandPass: Count producers/consumers,<br/>estimate edge savings
alt Profitable
ExpandPass->>Barrier: Insert phase_fence_barrier_*_tid<br/>assignment with manual_dep_edges
Barrier->>Consumer: Rewrite consumer manual_dep_edges<br/>to reference barrier variable
else Not Profitable
ExpandPass->>ForLoop: Keep original dependency arrays
end
ExpandPass-->>OrchestrationFunc: Return transformed program
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related Issues
Possibly Related PRs
Suggested Labels
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new compiler pass, ExpandManualPhaseFence, designed to optimize manual phase fences by compressing multi-dependency fanouts into synthetic dummy barrier tasks (system.task_dummy). It updates the orchestration codegen to lower these dummy tasks to rt_submit_dummy_task, registers the pass in the Python bindings and pass manager, and adds comprehensive correctness and swimlane tests. The review feedback identifies two main improvement opportunities: removing the unused helper function WithBoolAttr in the new pass file, and optimizing sequential loops by placing barriers for loop-invariant dependencies outside the loop body to avoid redundant dummy task submissions on every iteration.
| static std::vector<std::pair<std::string, std::any>> WithBoolAttr( | ||
| std::vector<std::pair<std::string, std::any>> attrs, const std::string& key, bool value) { | ||
| for (auto& [k, v] : attrs) { | ||
| if (k == key) { | ||
| v = value; | ||
| return attrs; | ||
| } | ||
| } | ||
| attrs.emplace_back(key, value); | ||
| return attrs; | ||
| } |
| if (!decisions.empty() && op->kind_ != ForKind::Parallel) { | ||
| auto body_stmts = FlattenToVector(body_with_current_rewrites); | ||
| std::vector<StmtPtr> with_barriers; | ||
| with_barriers.reserve(decisions.size() + body_stmts.size()); | ||
| for (const auto& decision : decisions) { | ||
| with_barriers.push_back(decision.barrier_stmt); | ||
| } | ||
| with_barriers.insert(with_barriers.end(), body_stmts.begin(), body_stmts.end()); | ||
| body_with_current_rewrites = MakeSeqOrStmt(std::move(with_barriers), op->span_); | ||
| } |
There was a problem hiding this comment.
For sequential loops (op->kind_ != ForKind::Parallel), placing the barrier inside the loop body causes a dummy task to be submitted on every single iteration of the loop. While this is necessary for loop-carried dependencies (which change per iteration), it is highly inefficient for loop-invariant dependencies (which are constant across all iterations). For loop-invariant dependencies, the barrier should be placed outside the sequential loop, just like it is for parallel loops. This would reduce the number of dummy task submissions from O(trip_count) to O(1) and significantly lower runtime overhead.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef45dc562d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (const auto& dep_array : CollectDirectManualDepArrays(body)) { | ||
| if (!dep_array || BodyUpdatesArray(body, dep_array.get())) continue; | ||
| int64_t consumers = CountManualDepConsumersOnArray(body, dep_array.get()); | ||
| if (is_parallel) consumers *= trip_count; | ||
| try_add(dep_array, dep_array, consumers); |
There was a problem hiding this comment.
Guard barrier hoisting against loop-local dep arrays
When BuildDecisions scans CollectDirectManualDepArrays(body), it treats every matching array as hoistable in parallel loops, but VisitStmt_(ForStmtPtr) later always places the barrier before the loop. If a manual_dep_edges array is introduced inside the loop body (for example via an alias/rebind var used by consumers), the synthesized system.task_dummy will reference a value that is not defined at the hoist point. In that case codegen can compute zero barrier deps and leave the barrier TaskId invalid, so rewritten consumers lose their intended dependency edges. Please restrict this path to dep-array vars that dominate the loop header (or keep such cases uncompressed).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
tests/ut/ir/transforms/test_expand_manual_phase_fence.py (1)
292-298: 💤 Low valueMissing
pytest.mainentry point.The file is missing the standard
if __name__ == "__main__"block for direct test execution that other test files in the repository have.Based on learnings, prefer invoking pytest programmatically via
pytest.main([__file__, "-v"])inside theif __name__ == "__main__"guard for consistent test execution across test files.🔧 Suggested addition at end of file
def test_pure_range_consumer_does_not_insert_dummy(): tids = ir.Var("tids", ir.ArrayType(DataType.TASK_ID, 4), S) before = _program_with_loop(_consumer("a", [tids]), kind=ir.ForKind.Sequential) after = _run(before) first_call = cast(ir.Call, cast(ir.AssignStmt, _loop_body_stmts(after)[0]).value) assert first_call.op.name != "system.task_dummy" + + +if __name__ == "__main__": + import pytest + pytest.main([__file__, "-v"])🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/ut/ir/transforms/test_expand_manual_phase_fence.py` around lines 292 - 298, Add the standard "__main__" pytest entry so this test file can be executed directly: add an if __name__ == "__main__" guard and call pytest.main([__file__, "-v"]). Ensure you import pytest at top if missing; reference the existing test_pure_range_consumer_does_not_insert_dummy test and other tests in this file so running the module executes pytest for this file.tests/ut/codegen/test_phase_fence_dep_compression.py (1)
22-37: ⚡ Quick winAvoid re-running phase-fence transforms after the default pipeline.
_compile_program()already runs the default pass strategy, then_generate_orch_code()appliesderive_call_directions+expand_manual_phase_fenceagain. That can hide pipeline-wiring regressions and makes this suite assert a double-pass flow.Suggested helper split
-def _generate_orch_code(program) -> str: - program = passes.derive_call_directions()(program) - program = passes.expand_manual_phase_fence()(program) +def _generate_orch_code(program, *, apply_phase_fence_passes: bool = False) -> str: + if apply_phase_fence_passes: + program = passes.derive_call_directions()(program) + program = passes.expand_manual_phase_fence()(program) for func in program.functions.values(): if func.func_type == ir.FunctionType.Orchestration: return codegen.generate_orchestration(program, func).code raise ValueError("No orchestration function found in program")def _compile_program(program_cls) -> str: @@ transformed = pm.run_passes(program_cls) - return _generate_orch_code(transformed) + return _generate_orch_code(transformed)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/ut/codegen/test_phase_fence_dep_compression.py` around lines 22 - 37, The test helper double-runs phase-fence transforms: _compile_program runs the default pass pipeline (PassManager.get_strategy/pm.run_passes) and then _generate_orch_code re-applies derive_call_directions() and expand_manual_phase_fence(), masking regressions that depend on a single-pass flow. Fix by removing the redundant transforms from _generate_orch_code — have _compile_program produce a fully transformed program (using pm.run_passes) and make _generate_orch_code simply locate the orchestration function and call codegen.generate_orchestration(program, func). Update references to derive_call_directions and expand_manual_phase_fence accordingly so they are only invoked via the default pipeline, not again in _generate_orch_code.tests/ut/codegen/test_orchestration_codegen.py (1)
3936-3941: ⚡ Quick winMake task-deps assertions resilient to task index renumbering.
These checks pin
params_t0/params_t1directly, which is brittle when non-semantic codegen changes renumber tasks. Preferparams_t\d+regex here, like other tests in this file.Suggested adjustment
- assert "PTO2TaskId params_t0_deps[1];" in code, code + assert re.search(r"PTO2TaskId params_t\d+_deps\[1\];", code), code ... - assert "PTO2TaskId params_t1_deps[5];" in code, code + assert re.search(r"PTO2TaskId params_t\d+_deps\[5\];", code), code ... - assert "PTO2TaskId params_t0_deps[4];" in code, code + assert re.search(r"PTO2TaskId params_t\d+_deps\[4\];", code), codeAlso applies to: 4026-4027, 4071-4072
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/ut/codegen/test_orchestration_codegen.py` around lines 3936 - 3941, The assertions hard-code task numbers (e.g., "params_t0_deps" / "params_t1_deps") which breaks when task indices are renumbered; change them to use regex-based checks that match params_t\d+ (for example use re.search on patterns like r'PTO2TaskId\s+params_t\d+_deps\[\d+\];' and r'if\s+\(phase_fence_barrier_0_tid\.is_valid\(\)\)\s+params_t\d+\_deps\[params_t\d+_deps_count\+\+\]\s*=' ) so the tests verify the same semantics without binding to a specific task index (also update the similar assertions at the other occurrences noted around lines 4026-4027 and 4071-4072); keep the negative check for ABOVE_LEGACY_CAP as-is but apply the params_t\d+ pattern where task-specific names are asserted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/st/runtime/scheduling/test_manual_scope_pipeline.py`:
- Around line 237-244: The test test_intra_iteration_dep_present was left as a
no-op because the runtime checks were commented out; re-enable the runtime
fanout assertions by uncommenting the manual_scope_swimlane_data checks (the
tasks extraction, total_fanout computation and assert) and ensure the assertion
verifies total_fanout >= _M * _N (or another documented, conservative
lower-bound if you expect runtime variance) so the test fails on missing
intra-iteration dependency edges; reference the variables
manual_scope_swimlane_data, tasks, total_fanout, and the constants _M and _N
when restoring the assertion and keeping the original failure message for
diagnostic clarity.
In `@tests/st/runtime/scheduling/test_phase_fence_dep_compression.py`:
- Around line 52-57: The helper _assert_flattened_stage_strict currently calls
pytest.skip when the observed task count is less than expected (in the block
around the tasks length check) which masks regressions; change that behavior to
fail the test instead by replacing pytest.skip(...) with pytest.fail(...) (or an
explicit assert/raise) and update the similar helper block that also skips on
insufficient tasks (the second occurrence around the other helper/branch at
lines 67-70) so both helpers (including the one referenced near the other
task-count check) raise a failure with a clear message like "need >= {expected}
tasks for phase-fence check, got {len(tasks)}".
In `@tests/st/runtime/scheduling/test_pl_at_deps_pipeline.py`:
- Around line 228-232: The current conditional uses pytest.skip when
total_fanout < _M * _N which can hide real intra-iteration dependency
regressions; change the behavior so the test fails (or at minimum xfails)
instead of silently skipping: replace the pytest.skip call with pytest.fail (or
an assert) that includes the same diagnostic message and values for total_fanout
and _M * _N so failures are visible in runtime witness mode; keep the same
message text but call pytest.fail(...) (or assert total_fanout >= _M * _N,
f"...") referencing total_fanout, _M and _N and remove the silent skip.
---
Nitpick comments:
In `@tests/ut/codegen/test_orchestration_codegen.py`:
- Around line 3936-3941: The assertions hard-code task numbers (e.g.,
"params_t0_deps" / "params_t1_deps") which breaks when task indices are
renumbered; change them to use regex-based checks that match params_t\d+ (for
example use re.search on patterns like r'PTO2TaskId\s+params_t\d+_deps\[\d+\];'
and
r'if\s+\(phase_fence_barrier_0_tid\.is_valid\(\)\)\s+params_t\d+\_deps\[params_t\d+_deps_count\+\+\]\s*='
) so the tests verify the same semantics without binding to a specific task
index (also update the similar assertions at the other occurrences noted around
lines 4026-4027 and 4071-4072); keep the negative check for ABOVE_LEGACY_CAP
as-is but apply the params_t\d+ pattern where task-specific names are asserted.
In `@tests/ut/codegen/test_phase_fence_dep_compression.py`:
- Around line 22-37: The test helper double-runs phase-fence transforms:
_compile_program runs the default pass pipeline
(PassManager.get_strategy/pm.run_passes) and then _generate_orch_code re-applies
derive_call_directions() and expand_manual_phase_fence(), masking regressions
that depend on a single-pass flow. Fix by removing the redundant transforms from
_generate_orch_code — have _compile_program produce a fully transformed program
(using pm.run_passes) and make _generate_orch_code simply locate the
orchestration function and call codegen.generate_orchestration(program, func).
Update references to derive_call_directions and expand_manual_phase_fence
accordingly so they are only invoked via the default pipeline, not again in
_generate_orch_code.
In `@tests/ut/ir/transforms/test_expand_manual_phase_fence.py`:
- Around line 292-298: Add the standard "__main__" pytest entry so this test
file can be executed directly: add an if __name__ == "__main__" guard and call
pytest.main([__file__, "-v"]). Ensure you import pytest at top if missing;
reference the existing test_pure_range_consumer_does_not_insert_dummy test and
other tests in this file so running the module executes pytest for this file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0eee7a2f-df8f-49c6-a1e6-3ad12d27e870
📒 Files selected for processing (19)
CMakeLists.txtinclude/pypto/ir/expr.hinclude/pypto/ir/transforms/pass_properties.hinclude/pypto/ir/transforms/passes.hpython/bindings/modules/ir.cpppython/bindings/modules/passes.cpppython/pypto/ir/pass_manager.pypython/pypto/pypto_core/passes.pyisrc/codegen/orchestration/orchestration_codegen.cppsrc/ir/op/sync_ops/task.cppsrc/ir/transforms/expand_manual_phase_fence_pass.cpptests/st/runtime/scheduling/test_manual_scope_pipeline.pytests/st/runtime/scheduling/test_phase_fence_dep_compression.pytests/st/runtime/scheduling/test_pl_at_deps_pipeline.pytests/ut/codegen/test_orchestration_codegen.pytests/ut/codegen/test_phase_fence_dep_compression.pytests/ut/ir/operators/test_task_ops.pytests/ut/ir/transforms/test_expand_manual_phase_fence.pytests/ut/ir/transforms/test_pass_manager.py
| # Swimlane fanout is runtime-dependent on this witness and can | ||
| # under-report the same underlying dependency shape. Do not hard-assert | ||
| # the aggregate fanout count here. | ||
| # tasks = manual_scope_swimlane_data["tasks"] | ||
| # total_fanout = sum(t["fanout_count"] for t in tasks) | ||
| # assert total_fanout >= _M * _N, ( | ||
| # f"expected at least {_M * _N} fan-out edges (one per stage1->stage2 pair), got {total_fanout}" | ||
| # ) |
There was a problem hiding this comment.
test_intra_iteration_dep_present is currently a no-op.
Line [237]–Line [244] comment out all runtime checks, so this test always passes and cannot detect dep-edge regressions.
Suggested fix
def test_intra_iteration_dep_present(self, manual_scope_swimlane_data: dict):
@@
- # Swimlane fanout is runtime-dependent on this witness and can
- # under-report the same underlying dependency shape. Do not hard-assert
- # the aggregate fanout count here.
- # tasks = manual_scope_swimlane_data["tasks"]
- # total_fanout = sum(t["fanout_count"] for t in tasks)
- # assert total_fanout >= _M * _N, (
- # f"expected at least {_M * _N} fan-out edges (one per stage1->stage2 pair), got {total_fanout}"
- # )
+ tasks = manual_scope_swimlane_data["tasks"]
+ total_fanout = sum(t["fanout_count"] for t in tasks)
+ if total_fanout == 0:
+ pytest.fail("no observable stage1->stage2 fanout edges in swimlane")
+ if total_fanout < _M * _N:
+ pytest.skip(
+ f"swimlane under-reports fanout edges ({total_fanout} < {_M * _N}); "
+ "strict dep wiring is covered by codegen UT"
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Swimlane fanout is runtime-dependent on this witness and can | |
| # under-report the same underlying dependency shape. Do not hard-assert | |
| # the aggregate fanout count here. | |
| # tasks = manual_scope_swimlane_data["tasks"] | |
| # total_fanout = sum(t["fanout_count"] for t in tasks) | |
| # assert total_fanout >= _M * _N, ( | |
| # f"expected at least {_M * _N} fan-out edges (one per stage1->stage2 pair), got {total_fanout}" | |
| # ) | |
| tasks = manual_scope_swimlane_data["tasks"] | |
| total_fanout = sum(t["fanout_count"] for t in tasks) | |
| if total_fanout == 0: | |
| pytest.fail("no observable stage1->stage2 fanout edges in swimlane") | |
| if total_fanout < _M * _N: | |
| pytest.skip( | |
| f"swimlane under-reports fanout edges ({total_fanout} < {_M * _N}); " | |
| "strict dep wiring is covered by codegen UT" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/st/runtime/scheduling/test_manual_scope_pipeline.py` around lines 237 -
244, The test test_intra_iteration_dep_present was left as a no-op because the
runtime checks were commented out; re-enable the runtime fanout assertions by
uncommenting the manual_scope_swimlane_data checks (the tasks extraction,
total_fanout computation and assert) and ensure the assertion verifies
total_fanout >= _M * _N (or another documented, conservative lower-bound if you
expect runtime variance) so the test fails on missing intra-iteration dependency
edges; reference the variables manual_scope_swimlane_data, tasks, total_fanout,
and the constants _M and _N when restoring the assertion and keeping the
original failure message for diagnostic clarity.
| def _assert_flattened_stage_strict(swimlane_data: dict, *, stages: int, branches: int) -> None: | ||
| expected = stages * branches | ||
| tasks = swimlane_data["tasks"] | ||
| if len(tasks) < expected: | ||
| pytest.skip(f"need >= {expected} tasks for phase-fence check, got {len(tasks)}") | ||
| tasks = sorted(tasks, key=lambda t: t["start_time_us"])[:expected] |
There was a problem hiding this comment.
Task-count undershoots should fail, not skip, in witness helpers.
Line [55]-Line [56] and Line [69]-Line [70] currently skip when too few tasks are observed, which can hide real scheduling regressions across many callers.
Suggested fix
def _assert_flattened_stage_strict(swimlane_data: dict, *, stages: int, branches: int) -> None:
expected = stages * branches
tasks = swimlane_data["tasks"]
- if len(tasks) < expected:
- pytest.skip(f"need >= {expected} tasks for phase-fence check, got {len(tasks)}")
+ assert len(tasks) >= expected, (
+ f"need >= {expected} tasks for phase-fence check, got {len(tasks)}"
+ )
@@
def _assert_min_task_count(swimlane_data: dict, *, expected: int) -> None:
tasks = swimlane_data["tasks"]
- if len(tasks) < expected:
- pytest.skip(f"need >= {expected} tasks for swimlane check, got {len(tasks)}")
+ assert len(tasks) >= expected, (
+ f"need >= {expected} tasks for swimlane check, got {len(tasks)}"
+ )Also applies to: 67-70
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/st/runtime/scheduling/test_phase_fence_dep_compression.py` around lines
52 - 57, The helper _assert_flattened_stage_strict currently calls pytest.skip
when the observed task count is less than expected (in the block around the
tasks length check) which masks regressions; change that behavior to fail the
test instead by replacing pytest.skip(...) with pytest.fail(...) (or an explicit
assert/raise) and update the similar helper block that also skips on
insufficient tasks (the second occurrence around the other helper/branch at
lines 67-70) so both helpers (including the one referenced near the other
task-count check) raise a failure with a clear message like "need >= {expected}
tasks for phase-fence check, got {len(tasks)}".
| if total_fanout < _M * _N: | ||
| pytest.skip( | ||
| f"pl.at swimlane under-reports outlined fanout edges ({total_fanout} < {_M * _N}); " | ||
| "strict dep wiring is covered by codegen UT" | ||
| ) |
There was a problem hiding this comment.
Skip-only fanout handling can hide real intra-iteration dep regressions.
At Line [228]–Line [232], low observed fanout always skips; a broken dependency shape can silently pass in runtime witness mode.
Suggested fix
tasks = pl_at_deps_swimlane_data["tasks"]
total_fanout = sum(t["fanout_count"] for t in tasks)
+ if total_fanout == 0:
+ pytest.fail("no observable stage1->stage2 fanout edges in pl.at swimlane")
if total_fanout < _M * _N:
pytest.skip(
f"pl.at swimlane under-reports outlined fanout edges ({total_fanout} < {_M * _N}); "
"strict dep wiring is covered by codegen UT"
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/st/runtime/scheduling/test_pl_at_deps_pipeline.py` around lines 228 -
232, The current conditional uses pytest.skip when total_fanout < _M * _N which
can hide real intra-iteration dependency regressions; change the behavior so the
test fails (or at minimum xfails) instead of silently skipping: replace the
pytest.skip call with pytest.fail (or an assert) that includes the same
diagnostic message and values for total_fanout and _M * _N so failures are
visible in runtime witness mode; keep the same message text but call
pytest.fail(...) (or assert total_fanout >= _M * _N, f"...") referencing
total_fanout, _M and _N and remove the silent skip.
Summary
Adds a dedicated
ExpandManualPhaseFenceIR pass that compresses profitableexplicit
manual_scopefull-arrayTaskIddependencies before backendlowering.
Rewrites profitable phase-fence fanout from
into
by inserting one dependency-only
system.task_dummy.Keeps loop-entry dependency snapshots aligned with the previous phase, so
current-phase parallel branches do not accidentally observe
TaskIdslotsalready updated by earlier branches in the same phase.
Preserves the existing direct dependency lowering for low-benefit, mixed,
partial-slot, unresolved, pure range-only, non-orchestration, and update-hazard
dependency shapes.
Updates runtime swimlane witnesses under the current scheduling test layout:
tests/st/runtime/scheduling/.ExpandManualPhaseFence
This PR moves manual phase-fence dependency compression into a dedicated IR pass.
The pass recognizes profitable orchestration
manual_scopeconsumer sites whosemanual dependency is exactly one full
Array[TASK_ID]. For those sites, itinserts a marked
system.task_dummybarrier carrying the original dependencyarray, then rewrites covered consumers to depend on the barrier
TaskId.That changes the runtime dependency shape from direct all-to-all fanout to a
single dependency-only barrier:
The main correctness point is the phase snapshot boundary. A loop-carried
TaskIdarray should represent the completed previous phase when current-phaseconsumers are submitted. The pass therefore creates the barrier before the
current profitable fanout rather than letting each consumer branch expand
deps=[tids]after earlier branches may have already written updated task idsback into the same carrier.
The pass is intentionally narrow. It is not a general DAG optimizer; it only
compresses clear local full-array phase-fence fanout patterns where the barrier
is both profitable and safe.
Fallback Boundaries
The pass keeps direct dependency lowering when the shape is not clearly safe or
profitable.
Current behavior:
3 -> 3fanout compresses;N -> 1and2 -> 2fanout stay direct;range -> rangerepeated consumers stay direct;auto-scope deps stay direct;
prev = tids[i]; deps=[prev]stay guarded scalardirect deps;
In practice, if the pass cannot recognize one clear profitable full-array manual
phase-fence pattern, it leaves the existing dependency path unchanged.
One caveat worth calling out: a shared outer
tidscarrier is not implicitlylifted into per-lane ownership. For example,
does not become
Independent outer lanes still need the carrier itself to be scoped or indexed
per lane.
Runtime Test Sync Notes
This branch also updates older
manual_scopeandpl.atscheduling witnessesto match the post-compression dependency shape and the current test directory
layout.
The runtime ST coverage now lives under:
tests/st/runtime/scheduling/test_phase_fence_dep_compression.pytests/st/runtime/scheduling/test_manual_scope_pipeline.pytests/st/runtime/scheduling/test_pl_at_deps_pipeline.pyThe swimlane assertions intentionally avoid depending on a stable visible
dummy-task marker. The externally required runtime contract is strict phase
ordering: tasks in flattened stage
k + 1must not start before all tasks inflattened stage
kfinish.The updated witnesses also avoid stale pre-compression assumptions that required
direct all-to-all fanout observability in the swimlane artifact.
Tests
Coverage added or updated in this branch:
tests/ut/ir/transforms/test_expand_manual_phase_fence.pytests/ut/codegen/test_phase_fence_dep_compression.pytests/ut/codegen/test_orchestration_codegen.pytests/st/runtime/scheduling/test_phase_fence_dep_compression.pytests/st/runtime/scheduling/test_manual_scope_pipeline.pytests/st/runtime/scheduling/test_pl_at_deps_pipeline.pyRepresentative validation commands:
pytest tests/ut/ir/transforms/test_expand_manual_phase_fence.py pytest tests/ut/codegen/test_phase_fence_dep_compression.py pytest tests/ut/codegen/test_orchestration_codegen.py -k "phase_fence or manual_scope" pytest tests/st/runtime/scheduling/test_phase_fence_dep_compression.py -q --platform=a2a3 pytest tests/st/runtime/scheduling/test_manual_scope_pipeline.py -q --platform=a2a3 pytest tests/st/runtime/scheduling/test_pl_at_deps_pipeline.py -q --platform=a2a3 PYPTO_PHASE_FENCE_EXTRA_SWIMLANE=1 pytest \ tests/st/runtime/scheduling/test_phase_fence_dep_compression.py::TestPhaseFenceDepCompressionSwimlane::test_multiloop_chain_default \ --enable-l2-swimlane -q --platform=a2a3 PYPTO_PHASE_FENCE_EXTRA_SWIMLANE=1 pytest \ tests/st/runtime/scheduling/test_phase_fence_dep_compression.py::TestPhaseFenceDepCompressionSwimlane::test_submit_three_level_strict \ --enable-l2-swimlane -q --platform=a2a3 PYPTO_PHASE_FENCE_EXTRA_SWIMLANE=1 pytest \ tests/st/runtime/scheduling/test_phase_fence_dep_compression.py::TestPhaseFenceDepCompressionSwimlane::test_pl_at_three_level_strict \ --enable-l2-swimlane -q --platform=a2a3These cover profitable full-array compression, loop-entry snapshot correctness,
low-benefit and mixed-dep fallback, partial-slot scalar fallback, codegen barrier
shape, runtime correctness, and strict phase ordering after dummy-barrier
insertion.