Skip to content

Refactor: drop fanout from L2PerfRecord hot path#863

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:refactor/drop-l2-perf-record-fanout
May 27, 2026
Merged

Refactor: drop fanout from L2PerfRecord hot path#863
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:refactor/drop-l2-perf-record-fanout

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 27, 2026

Summary

The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB of fanout edges plus walking the producer's fanout linked list, every task, on the scheduler completion critical path. The fanout edges are already the static DAG and are reconstructed offline by dep_gen replay into deps.json — so the device-side hot path was paying GM-bandwidth and cache-miss cost to duplicate information host tooling already has.

Scope is a2a3 only; a5 is untouched.

Device side

  • L2PerfRecord drops fanout[128] / fanout_count (~1088 B → 64 B per record).
  • l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count parameters; the impl no longer touches them.
  • scheduler_completion drops the fanout_arr build + linked-list walk; host_build_graph/aicpu_executor drops the same pattern at all four call sites.

Host side

  • l2_perf_collector::export_swimlane_json emits "fanout": [] and "fanout_count": 0 per task to keep the JSON schema shape stable, and drops the top-level "version" field, which had drifted into a duplicate of L2PerfLevel (see in-flight PR Fix: clean up "version" fields in L2 swimlane / dep_gen JSON #856 for the misaligned guard cleanup on the consumer side).

Downstream tools

  • swimlane_converter already preferred deps.json over task["fanout"]; it now reads the version-free schema and treats empty fanout as the expected steady state.
  • sched_overhead_analysis no longer gates phase parsing on the dropped "version" field — it gates on presence of aicpu_scheduler_phases, which is the right key.

Tests and comments

  • dep_gen tests drop the now-vacuous "fanout ⊆ deps" gate and the auto-add of --enable-l2-swimlane that only existed to feed that gate.
  • _swimlane_validate drops the version assertion.
  • profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator comments updated to reflect deps.json as the sole source of truth for fanout.

Test plan

  • pip install --no-build-isolation -e . clean on macOS sim
  • tests/st/a2a3/.../dfx/l2_swimlane/test_l2_swimlane.py passes (--platform a2a3sim --enable-l2-swimlane)
  • tests/st/a2a3/.../dfx/l2_swimlane/test_l2_swimlane_mixed.py passes (--platform a2a3sim --enable-l2-swimlane)
  • tests/st/a2a3/.../dfx/dep_gen/test_dep_gen.py passes (--platform a2a3sim --enable-l2-swimlane --enable-dep-gen)
  • tests/st/a2a3/.../dfx/dep_gen/test_dep_gen_chain.py passes (--platform a2a3sim --enable-l2-swimlane --enable-dep-gen)
  • Hardware run on a2a3 to confirm the AICPU-cycle savings translate to wall-clock improvement on a fanin-tail-bound case (e.g. paged_attention_unroll_manual_scope Case1)

Notes for reviewers

  • JSON schema shape is preserved (fanout: [] / fanout_count: 0 are still emitted per task), so any consumer that just shape-checks keeps working.
  • The "version" field removal overlaps with PR Fix: clean up "version" fields in L2 swimlane / dep_gen JSON #856 in direction. Whichever lands first leaves a trivial conflict for the other.
  • Phase-record cost (orchestrator 9-per-submit + scheduler per-iter) is intentionally untouched — that's a separate follow-up if measurement after this still shows AICPU pressure at higher perf levels.

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment thread src/a2a3/platform/include/common/l2_perf_profiling.h
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/drop-l2-perf-record-fanout branch 3 times, most recently from ca33ace to 703ffef Compare May 27, 2026 02:31
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB
of fanout edges plus walking the producer's fanout linked list, every
task, on the scheduler completion critical path. The fanout edges are
already the static DAG and are reconstructed offline by dep_gen replay
into deps.json — so the device-side hot path was paying GM-bandwidth and
cache-miss cost to duplicate information host tooling already has.

Scope is a2a3 only; a5 is untouched.

Device side:
- L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per
  record).
- l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count
  parameters; the impl no longer touches them.
- scheduler_completion drops the fanout_arr build + linked-list walk;
  host_build_graph/aicpu_executor drops the same pattern at all four
  call sites.

Host side:
- l2_perf_collector::export_swimlane_json emits "fanout": [] and
  "fanout_count": 0 per task to keep the JSON schema shape stable, and
  drops the top-level "version" field, which had drifted into a
  duplicate of L2PerfLevel (see in-flight PR hw-native-sys#856 for the misaligned
  guard cleanup on the consumer side).

Downstream tools:
- swimlane_converter already preferred deps.json over task["fanout"]; it
  now reads the version-free schema and treats empty fanout as the
  expected steady state.
- sched_overhead_analysis no longer gates phase parsing on the dropped
  "version" field — it gates on presence of aicpu_scheduler_phases,
  which is the right key.

Tests and comments:
- dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and
  the auto-add of --enable-l2-swimlane that only existed to feed that
  gate.
- _swimlane_validate drops the version assertion.
- profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator
  comments updated to reflect deps.json as the sole source of truth
  for fanout.

Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed,
test_dep_gen, test_dep_gen_chain all pass with
--enable-l2-swimlane --enable-dep-gen.
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/drop-l2-perf-record-fanout branch from 703ffef to fa95657 Compare May 27, 2026 06:01
@ChaoWao ChaoWao merged commit 49d74e8 into hw-native-sys:main May 27, 2026
15 checks passed
@ChaoWao ChaoWao deleted the refactor/drop-l2-perf-record-fanout branch May 27, 2026 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants