Refactor: drop fanout from L2PerfRecord hot path#863
Merged
ChaoWao merged 1 commit intoMay 27, 2026
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
poursoul
reviewed
May 27, 2026
ca33ace to
703ffef
Compare
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB of fanout edges plus walking the producer's fanout linked list, every task, on the scheduler completion critical path. The fanout edges are already the static DAG and are reconstructed offline by dep_gen replay into deps.json — so the device-side hot path was paying GM-bandwidth and cache-miss cost to duplicate information host tooling already has. Scope is a2a3 only; a5 is untouched. Device side: - L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per record). - l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count parameters; the impl no longer touches them. - scheduler_completion drops the fanout_arr build + linked-list walk; host_build_graph/aicpu_executor drops the same pattern at all four call sites. Host side: - l2_perf_collector::export_swimlane_json emits "fanout": [] and "fanout_count": 0 per task to keep the JSON schema shape stable, and drops the top-level "version" field, which had drifted into a duplicate of L2PerfLevel (see in-flight PR hw-native-sys#856 for the misaligned guard cleanup on the consumer side). Downstream tools: - swimlane_converter already preferred deps.json over task["fanout"]; it now reads the version-free schema and treats empty fanout as the expected steady state. - sched_overhead_analysis no longer gates phase parsing on the dropped "version" field — it gates on presence of aicpu_scheduler_phases, which is the right key. Tests and comments: - dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and the auto-add of --enable-l2-swimlane that only existed to feed that gate. - _swimlane_validate drops the version assertion. - profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator comments updated to reflect deps.json as the sole source of truth for fanout. Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen, test_dep_gen_chain all pass with --enable-l2-swimlane --enable-dep-gen.
703ffef to
fa95657
Compare
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB of fanout edges plus walking the producer's fanout linked list, every task, on the scheduler completion critical path. The fanout edges are already the static DAG and are reconstructed offline by dep_gen replay into
deps.json— so the device-side hot path was paying GM-bandwidth and cache-miss cost to duplicate information host tooling already has.Scope is a2a3 only; a5 is untouched.
Device side
L2PerfRecorddropsfanout[128]/fanout_count(~1088 B → 64 B per record).l2_perf_aicpu_complete_recorddrops the trailingfanout/fanout_countparameters; the impl no longer touches them.scheduler_completiondrops thefanout_arrbuild + linked-list walk;host_build_graph/aicpu_executordrops the same pattern at all four call sites.Host side
l2_perf_collector::export_swimlane_jsonemits"fanout": []and"fanout_count": 0per task to keep the JSON schema shape stable, and drops the top-level"version"field, which had drifted into a duplicate ofL2PerfLevel(see in-flight PR Fix: clean up "version" fields in L2 swimlane / dep_gen JSON #856 for the misaligned guard cleanup on the consumer side).Downstream tools
swimlane_converteralready preferreddeps.jsonovertask["fanout"]; it now reads the version-free schema and treats empty fanout as the expected steady state.sched_overhead_analysisno longer gates phase parsing on the dropped"version"field — it gates on presence ofaicpu_scheduler_phases, which is the right key.Tests and comments
--enable-l2-swimlanethat only existed to feed that gate._swimlane_validatedrops the version assertion.profiling_levels.md,dep_gen.h,dep_gen_replay.h,pto_orchestratorcomments updated to reflectdeps.jsonas the sole source of truth for fanout.Test plan
pip install --no-build-isolation -e .clean on macOS simtests/st/a2a3/.../dfx/l2_swimlane/test_l2_swimlane.pypasses (--platform a2a3sim --enable-l2-swimlane)tests/st/a2a3/.../dfx/l2_swimlane/test_l2_swimlane_mixed.pypasses (--platform a2a3sim --enable-l2-swimlane)tests/st/a2a3/.../dfx/dep_gen/test_dep_gen.pypasses (--platform a2a3sim --enable-l2-swimlane --enable-dep-gen)tests/st/a2a3/.../dfx/dep_gen/test_dep_gen_chain.pypasses (--platform a2a3sim --enable-l2-swimlane --enable-dep-gen)paged_attention_unroll_manual_scopeCase1)Notes for reviewers
fanout: []/fanout_count: 0are still emitted per task), so any consumer that just shape-checks keeps working."version"field removal overlaps with PR Fix: clean up "version" fields in L2 swimlane / dep_gen JSON #856 in direction. Whichever lands first leaves a trivial conflict for the other.