Skip to content

Refactor: fold orch sub-step phase records into one per-submit envelope#871

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/orch-submit-fold
May 27, 2026
Merged

Refactor: fold orch sub-step phase records into one per-submit envelope#871
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/orch-submit-fold

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Why

Each submit_task() / alloc_tensors() call used to emit 6 separate AicpuPhaseRecord entries (ORCH_SYNC, ORCH_ALLOC, ORCH_LOOKUP, ORCH_INSERT, ORCH_PARAMS, ORCH_FANIN) — 240 B of GM writes per submit. The per-sub-step breakdown they carried duplicated the g_orch_*_cycle cumulatives that the cold-path device log already prints; the only consumer of the per-sub-step records was Perfetto eye-balling, where 6 µs-scale bars convey essentially the same information as one bar spanning the whole submit.

This PR folds them into one ORCH_SUBMIT record per submit:

Before After ×
GM write per submit 240 B (6 × 40 B) 40 B (1 record) 6 ×↓
l2_perf_aicpu_record_phase calls per submit 6 1 6 ×↓
get_sys_cnt_aicpu() calls per submit 7 2 3.5 ×↓
Schema bump / new buffer / new struct 0

Decision rationale: "which sub-step is slow overall?" is already covered by the cold-path log's per-step cycle ratios. "Which submit is slow?" is what the single ORCH_SUBMIT record (covering the whole [start, end] wall-clock window) addresses. Two artifacts, two clean responsibilities.

Scope is a2a3 tensormap_and_ringbuffer only.

What changed

Device (pto_orchestrator.cpp)

  • CYCLE_COUNT_START captures _submit_start_ts as a separate variable so the per-submit envelope is recoverable after the CYCLE_COUNT_LAP calls have advanced _t0.
  • CYCLE_COUNT_LAP_RECORD macro deleted; all 9 call sites (6 in submit_task_common, 3 in alloc_tensors) become plain CYCLE_COUNT_LAP — accumulate-only, g_orch_*_cycle cumulatives unchanged.
  • New CYCLE_COUNT_ORCH_SUBMIT_RECORD(tid) fires once per submit path with [_submit_start_ts, _t1, g_orch_submit_idx, task_id.raw].
  • Dropped commented-out ORCH_SCOPE_END emit at scope_end().

Enum (l2_perf_profiling.h)

  • AicpuPhaseId: drop ORCH_SYNC..ORCH_SCOPE_END (ids 16-24). Replace with single ORCH_SUBMIT = 16. Doc note records that ids 17-24 may appear in legacy captures and are dropped by the host parser.

Host parser (l2_perf_collector.cpp)

  • is_scheduler_phase boundary check uses ORCH_SUBMIT instead of the removed ORCH_SYNC.
  • orch_phase_name switch collapses to one case (ORCH_SUBMIT"orch_submit"); default returns "unknown" so legacy ids on old captures land in "unknown" and downstream tools drop them.

Header doc (l2_perf_collector_aicpu.h)

  • Doc for l2_perf_aicpu_record_orch_phase updated to reflect single-record semantics. Function signature unchanged — phase_id param still required, callers pass ORCH_SUBMIT.

Tools (swimlane_converter.py)

  • orch_phase_colors: "orch_submit" primary; legacy 6 sub-step strings retained so old captures render.
  • Orch → sched dispatch arrow anchor: prefers orch_submit end, falls back to legacy orch_fanin / orch_params for old captures.
  • submit_count derivation prefers orch_submit, falls back to orch_fanin.

Docs (docs/dfx/l2-swimlane-profiling.md)

  • Section 2: orchestrator description shifted from "9 sub-steps" to "per-submit envelope + cumulative log counters".
  • Section 3 phase table: orchestrator phase string list updated.
  • Section 4: orchestrator overhead breakdown rephrased to point at the per-submit record + cold-path log split.

Test plan

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR consolidates orchestrator profiling from per-sub-step phase records to single per-submit envelope records. The phase enum introduces ORCH_SUBMIT = 25, profiling macros emit one record per submit_task() or alloc_tensors() call, step-level timing moves to cycle counters, and swimlane visualization and host-side tooling are updated to match the new model.

Changes

Orchestrator profiling envelope consolidation

Layer / File(s) Summary
Phase enum and contract definitions
src/a2a3/platform/include/common/l2_perf_profiling.h, src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h
Introduced AicpuPhaseId::ORCH_SUBMIT = 25 as the new envelope-level orchestrator phase ID; removed legacy per-sub-step IDs (16–24); kept scheduler phases (SCHED_COMPLETE, SCHED_DISPATCH). Updated l2_perf_aicpu_record_orch_phase documentation to describe submission-level envelope recording.
Profiling macro infrastructure for per-submit emission
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Refactored profiling macros to track per-submit start timestamp and introduce CYCLE_COUNT_ORCH_SUBMIT_RECORD(tid) macro. Adjusted non-ORCH_PROFILING path to conditionally disable cycle accumulation. Removed obsolete ORCH_SCOPE_END record emission.
Orchestrator step timing in submit_task_common
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Replaced per-sub-step phase record emissions with cycle accumulation calls (CYCLE_COUNT_LAP) for alloc, sync, lookup, insert, args, and fanin steps. Added final CYCLE_COUNT_ORCH_SUBMIT_RECORD to emit single envelope record.
Orchestrator step timing in alloc_tensors
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Replaced per-sub-step phase records with lap-counter accumulation for alloc, args, and fanin steps. Added final CYCLE_COUNT_ORCH_SUBMIT_RECORD emission.
Host-side phase classification and naming
src/a2a3/platform/src/host/l2_perf_collector.cpp
Refactored is_scheduler_phase() to use fixed numeric boundary kAicpuOrchPhaseIdBase = 16. Simplified orch_phase_name() to explicitly handle only ORCH_SUBMIT; all other orchestrator IDs now resolve as "unknown".
Swimlane visualization anchor selection and rendering
simpler_setup/tools/swimlane_converter.py
Updated phase color mapping for orch_submit. Introduced per-task anchor-selection dictionary with first-seen semantics: orch_submit preferred, legacy orch_fanin/orch_params used as fallback. Modified arrow emission to use selected anchor timing. Adjusted submit counting to use orch_submit with orch_fanin fallback.
Documentation of the new phase model
docs/dfx/l2-swimlane-profiling.md
Updated swimlane profiling documentation to describe orch_submit as per-submit envelope (not per-sub-step), explained per-sub-step timing now reported via g_orch_*_cycle device log counters, and clarified scheduler and orchestrator phase field semantics with legacy value notes.

Possibly related PRs

  • hw-native-sys/simpler#869: Modifies L2 profiling phase handling in the same host collector and swimlane tooling, removing legacy scheduler phases while this PR refactors orchestrator phase emission to envelope-level records.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit refactors the trace,
One record per submit, more grace—
Sub-steps now counted as cycles in logs,
No per-step records to clog the pipeline,
The swimlane glows cleaner, profile more aligned! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: folding multiple per-sub-step orchestrator phase records into a single per-submit envelope, which is the core refactoring across the entire changeset.
Description check ✅ Passed The description is comprehensively related to the changeset, providing detailed motivation (6× reduction in GM writes and calls), before/after metrics, and specific file-by-file changes that align with the actual modifications shown in the summary.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request folds the multiple orchestrator sub-step phases into a single ORCH_SUBMIT phase record covering the entire submit task window, while preserving the per-sub-step cycle counts in the cold-path device log. It updates the documentation, Python swimlane converter, and C++ profiling headers and source files to support this change while maintaining fallback support for legacy captures. A review comment points out a logical discrepancy in swimlane_converter.py where the last-seen orch_submit wins instead of the first-seen as documented, and provides a code suggestion to fix it.

Comment thread simpler_setup/tools/swimlane_converter.py Outdated
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/orch-submit-fold branch from 7e0c70f to 6d9829f Compare May 27, 2026 07:58
Each submit_task() / alloc_tensors() call used to emit 6 separate
AicpuPhaseRecord entries (ORCH_SYNC, ORCH_ALLOC, ORCH_LOOKUP,
ORCH_INSERT, ORCH_PARAMS, ORCH_FANIN) — 240 B of GM writes per submit.
The per-sub-step breakdown they carried duplicated the
g_orch_*_cycle cumulatives that the cold-path device log already
prints; the only consumer of the per-sub-step records was Perfetto
eye-balling, where 6 µs-scale bars convey essentially the same
information as one bar spanning the whole submit.

Fold them into one ORCH_SUBMIT record per submit:
  240 B → 40 B GM write  (6×↓)
  6 record_phase calls → 1                                    (6×↓)
  7 get_sys_cnt_aicpu calls → 2  (1 at START, 1 at SUBMIT_END) (3.5×↓)

For "which sub-step is slow overall", the cold-path log's per-step
cycle ratios still cover it. For "which submit is slow", the single
ORCH_SUBMIT record carries the full wall-clock envelope.

Scope is a2a3 tensormap_and_ringbuffer only.

Device (pto_orchestrator.cpp):
- CYCLE_COUNT_START captures _submit_start_ts as a separate variable
  so the per-submit envelope is recoverable after the LAPs have
  advanced _t0.
- CYCLE_COUNT_LAP_RECORD macro deleted; all 9 call sites (6 in
  submit_task_common, 3 in alloc_tensors) become plain CYCLE_COUNT_LAP
  (accumulate-only — g_orch_*_cycle cumulatives unchanged).
- New CYCLE_COUNT_ORCH_SUBMIT_RECORD(tid) fires once per submit path
  with [_submit_start_ts, _t1, g_orch_submit_idx, task_id.raw].
- Dropped commented-out ORCH_SCOPE_END emit at scope_end().

Enum (l2_perf_profiling.h):
- AicpuPhaseId: drop ORCH_SYNC..ORCH_SCOPE_END (ids 16-24). Replace
  with single ORCH_SUBMIT = 16. Doc note records that ids 17-24 may
  appear in legacy captures and are dropped by the host parser.

Host parser (l2_perf_collector.cpp):
- is_scheduler_phase boundary check uses ORCH_SUBMIT instead of the
  removed ORCH_SYNC.
- orch_phase_name switch collapses to one case (ORCH_SUBMIT →
  "orch_submit"); default returns "unknown" so legacy ids on old
  captures land in "unknown" and downstream tools drop them.

Header (l2_perf_collector_aicpu.h):
- Doc for l2_perf_aicpu_record_orch_phase updated to reflect the
  single-record semantics. Function signature unchanged — phase_id
  param still required, callers pass ORCH_SUBMIT.

Tools (swimlane_converter.py):
- orch_phase_colors: "orch_submit" primary; legacy 6 sub-step strings
  retained so old captures render.
- Orch → sched dispatch arrow anchor: prefers orch_submit end, falls
  back to legacy orch_fanin / orch_params for old captures.
- submit_count derivation prefers orch_submit, falls back to
  orch_fanin.

Docs (docs/dfx/l2-swimlane-profiling.md):
- Section 2: orchestrator description shifted from "9 sub-steps" to
  "per-submit envelope + cumulative log counters".
- Section 3 phase table: orchestrator phase string list updated.
- Section 4: orchestrator overhead breakdown rephrased to point at
  the per-submit record + cold-path log split.

Verified on a2a3sim with --enable-l2-swimlane --enable-dep-gen:
  test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen,
  test_dep_gen_chain all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/orch-submit-fold branch from 6d9829f to 75358f2 Compare May 27, 2026 08:05
@ChaoWao ChaoWao marked this pull request as ready for review May 27, 2026 08:41
@ChaoWao ChaoWao merged commit 18d62e2 into hw-native-sys:main May 27, 2026
15 of 16 checks passed
@ChaoWao ChaoWao deleted the refactor/orch-submit-fold branch May 27, 2026 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants