Refactor: fold orch sub-step phase records into one per-submit envelope#871
Conversation
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR consolidates orchestrator profiling from per-sub-step phase records to single per-submit envelope records. The phase enum introduces ChangesOrchestrator profiling envelope consolidation
Possibly related PRs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request folds the multiple orchestrator sub-step phases into a single ORCH_SUBMIT phase record covering the entire submit task window, while preserving the per-sub-step cycle counts in the cold-path device log. It updates the documentation, Python swimlane converter, and C++ profiling headers and source files to support this change while maintaining fallback support for legacy captures. A review comment points out a logical discrepancy in swimlane_converter.py where the last-seen orch_submit wins instead of the first-seen as documented, and provides a code suggestion to fix it.
7e0c70f to
6d9829f
Compare
Each submit_task() / alloc_tensors() call used to emit 6 separate AicpuPhaseRecord entries (ORCH_SYNC, ORCH_ALLOC, ORCH_LOOKUP, ORCH_INSERT, ORCH_PARAMS, ORCH_FANIN) — 240 B of GM writes per submit. The per-sub-step breakdown they carried duplicated the g_orch_*_cycle cumulatives that the cold-path device log already prints; the only consumer of the per-sub-step records was Perfetto eye-balling, where 6 µs-scale bars convey essentially the same information as one bar spanning the whole submit. Fold them into one ORCH_SUBMIT record per submit: 240 B → 40 B GM write (6×↓) 6 record_phase calls → 1 (6×↓) 7 get_sys_cnt_aicpu calls → 2 (1 at START, 1 at SUBMIT_END) (3.5×↓) For "which sub-step is slow overall", the cold-path log's per-step cycle ratios still cover it. For "which submit is slow", the single ORCH_SUBMIT record carries the full wall-clock envelope. Scope is a2a3 tensormap_and_ringbuffer only. Device (pto_orchestrator.cpp): - CYCLE_COUNT_START captures _submit_start_ts as a separate variable so the per-submit envelope is recoverable after the LAPs have advanced _t0. - CYCLE_COUNT_LAP_RECORD macro deleted; all 9 call sites (6 in submit_task_common, 3 in alloc_tensors) become plain CYCLE_COUNT_LAP (accumulate-only — g_orch_*_cycle cumulatives unchanged). - New CYCLE_COUNT_ORCH_SUBMIT_RECORD(tid) fires once per submit path with [_submit_start_ts, _t1, g_orch_submit_idx, task_id.raw]. - Dropped commented-out ORCH_SCOPE_END emit at scope_end(). Enum (l2_perf_profiling.h): - AicpuPhaseId: drop ORCH_SYNC..ORCH_SCOPE_END (ids 16-24). Replace with single ORCH_SUBMIT = 16. Doc note records that ids 17-24 may appear in legacy captures and are dropped by the host parser. Host parser (l2_perf_collector.cpp): - is_scheduler_phase boundary check uses ORCH_SUBMIT instead of the removed ORCH_SYNC. - orch_phase_name switch collapses to one case (ORCH_SUBMIT → "orch_submit"); default returns "unknown" so legacy ids on old captures land in "unknown" and downstream tools drop them. Header (l2_perf_collector_aicpu.h): - Doc for l2_perf_aicpu_record_orch_phase updated to reflect the single-record semantics. Function signature unchanged — phase_id param still required, callers pass ORCH_SUBMIT. Tools (swimlane_converter.py): - orch_phase_colors: "orch_submit" primary; legacy 6 sub-step strings retained so old captures render. - Orch → sched dispatch arrow anchor: prefers orch_submit end, falls back to legacy orch_fanin / orch_params for old captures. - submit_count derivation prefers orch_submit, falls back to orch_fanin. Docs (docs/dfx/l2-swimlane-profiling.md): - Section 2: orchestrator description shifted from "9 sub-steps" to "per-submit envelope + cumulative log counters". - Section 3 phase table: orchestrator phase string list updated. - Section 4: orchestrator overhead breakdown rephrased to point at the per-submit record + cold-path log split. Verified on a2a3sim with --enable-l2-swimlane --enable-dep-gen: test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen, test_dep_gen_chain all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6d9829f to
75358f2
Compare
Why
Each
submit_task()/alloc_tensors()call used to emit 6 separateAicpuPhaseRecordentries (ORCH_SYNC,ORCH_ALLOC,ORCH_LOOKUP,ORCH_INSERT,ORCH_PARAMS,ORCH_FANIN) — 240 B of GM writes per submit. The per-sub-step breakdown they carried duplicated theg_orch_*_cyclecumulatives that the cold-path device log already prints; the only consumer of the per-sub-step records was Perfetto eye-balling, where 6 µs-scale bars convey essentially the same information as one bar spanning the whole submit.This PR folds them into one
ORCH_SUBMITrecord per submit:l2_perf_aicpu_record_phasecalls per submitget_sys_cnt_aicpu()calls per submitDecision rationale: "which sub-step is slow overall?" is already covered by the cold-path log's per-step cycle ratios. "Which submit is slow?" is what the single
ORCH_SUBMITrecord (covering the whole[start, end]wall-clock window) addresses. Two artifacts, two clean responsibilities.Scope is a2a3
tensormap_and_ringbufferonly.What changed
Device (
pto_orchestrator.cpp)CYCLE_COUNT_STARTcaptures_submit_start_tsas a separate variable so the per-submit envelope is recoverable after theCYCLE_COUNT_LAPcalls have advanced_t0.CYCLE_COUNT_LAP_RECORDmacro deleted; all 9 call sites (6 insubmit_task_common, 3 inalloc_tensors) become plainCYCLE_COUNT_LAP— accumulate-only,g_orch_*_cyclecumulatives unchanged.CYCLE_COUNT_ORCH_SUBMIT_RECORD(tid)fires once per submit path with[_submit_start_ts, _t1, g_orch_submit_idx, task_id.raw].ORCH_SCOPE_ENDemit atscope_end().Enum (
l2_perf_profiling.h)AicpuPhaseId: dropORCH_SYNC..ORCH_SCOPE_END(ids 16-24). Replace with singleORCH_SUBMIT = 16. Doc note records that ids 17-24 may appear in legacy captures and are dropped by the host parser.Host parser (
l2_perf_collector.cpp)is_scheduler_phaseboundary check usesORCH_SUBMITinstead of the removedORCH_SYNC.orch_phase_nameswitch collapses to one case (ORCH_SUBMIT→"orch_submit");defaultreturns"unknown"so legacy ids on old captures land in"unknown"and downstream tools drop them.Header doc (
l2_perf_collector_aicpu.h)l2_perf_aicpu_record_orch_phaseupdated to reflect single-record semantics. Function signature unchanged —phase_idparam still required, callers passORCH_SUBMIT.Tools (
swimlane_converter.py)orch_phase_colors:"orch_submit"primary; legacy 6 sub-step strings retained so old captures render.orch_submitend, falls back to legacyorch_fanin/orch_paramsfor old captures.submit_countderivation prefersorch_submit, falls back toorch_fanin.Docs (
docs/dfx/l2-swimlane-profiling.md)Test plan
pytest tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/ tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/ --platform a2a3sim --enable-l2-swimlane --enable-dep-gen→ 4 passed in 36s--enable-l2-swimlane 4— expected: sched_cost stays in noise (same pattern as Refactor: drop fanout from L2PerfRecord hot path #863 / Refactor: drop SCHED_IDLE_WAIT phase records and vestigial SCAN slot #869); orch-side records-per-submit drop from 6 to 1.🤖 Generated with Claude Code