Refactor: fold orch sub-step phase records into one per-submit envelope by hw-native-sys-bot · Pull Request #871 · hw-native-sys/simpler

hw-native-sys-bot · 2026-05-27T07:49:34Z

Why

Each submit_task() / alloc_tensors() call used to emit 6 separate AicpuPhaseRecord entries (ORCH_SYNC, ORCH_ALLOC, ORCH_LOOKUP, ORCH_INSERT, ORCH_PARAMS, ORCH_FANIN) — 240 B of GM writes per submit. The per-sub-step breakdown they carried duplicated the g_orch_*_cycle cumulatives that the cold-path device log already prints; the only consumer of the per-sub-step records was Perfetto eye-balling, where 6 µs-scale bars convey essentially the same information as one bar spanning the whole submit.

This PR folds them into one ORCH_SUBMIT record per submit:

	Before	After	×
GM write per submit	240 B (6 × 40 B)	40 B (1 record)	6 ×↓
`l2_perf_aicpu_record_phase` calls per submit	6	1	6 ×↓
`get_sys_cnt_aicpu()` calls per submit	7	2	3.5 ×↓
Schema bump / new buffer / new struct	—	—	0

Decision rationale: "which sub-step is slow overall?" is already covered by the cold-path log's per-step cycle ratios. "Which submit is slow?" is what the single ORCH_SUBMIT record (covering the whole [start, end] wall-clock window) addresses. Two artifacts, two clean responsibilities.

Scope is a2a3 tensormap_and_ringbuffer only.

What changed

Device (pto_orchestrator.cpp)

CYCLE_COUNT_START captures _submit_start_ts as a separate variable so the per-submit envelope is recoverable after the CYCLE_COUNT_LAP calls have advanced _t0.
CYCLE_COUNT_LAP_RECORD macro deleted; all 9 call sites (6 in submit_task_common, 3 in alloc_tensors) become plain CYCLE_COUNT_LAP — accumulate-only, g_orch_*_cycle cumulatives unchanged.
New CYCLE_COUNT_ORCH_SUBMIT_RECORD(tid) fires once per submit path with [_submit_start_ts, _t1, g_orch_submit_idx, task_id.raw].
Dropped commented-out ORCH_SCOPE_END emit at scope_end().

Enum (l2_perf_profiling.h)

AicpuPhaseId: drop ORCH_SYNC..ORCH_SCOPE_END (ids 16-24). Replace with single ORCH_SUBMIT = 16. Doc note records that ids 17-24 may appear in legacy captures and are dropped by the host parser.

Host parser (l2_perf_collector.cpp)

is_scheduler_phase boundary check uses ORCH_SUBMIT instead of the removed ORCH_SYNC.
orch_phase_name switch collapses to one case (ORCH_SUBMIT → "orch_submit"); default returns "unknown" so legacy ids on old captures land in "unknown" and downstream tools drop them.

Header doc (l2_perf_collector_aicpu.h)

Doc for l2_perf_aicpu_record_orch_phase updated to reflect single-record semantics. Function signature unchanged — phase_id param still required, callers pass ORCH_SUBMIT.

Tools (swimlane_converter.py)

orch_phase_colors: "orch_submit" primary; legacy 6 sub-step strings retained so old captures render.
Orch → sched dispatch arrow anchor: prefers orch_submit end, falls back to legacy orch_fanin / orch_params for old captures.
submit_count derivation prefers orch_submit, falls back to orch_fanin.

Docs (docs/dfx/l2-swimlane-profiling.md)

Section 2: orchestrator description shifted from "9 sub-steps" to "per-submit envelope + cumulative log counters".
Section 3 phase table: orchestrator phase string list updated.
Section 4: orchestrator overhead breakdown rephrased to point at the per-submit record + cold-path log split.

Test plan

pytest tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/ tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/ --platform a2a3sim --enable-l2-swimlane --enable-dep-gen → 4 passed in 36s
Hardware bench on paged_attention_unroll Case1 with --enable-l2-swimlane 4 — expected: sched_cost stays in noise (same pattern as Refactor: drop fanout from L2PerfRecord hot path #863 / Refactor: drop SCHED_IDLE_WAIT phase records and vestigial SCAN slot #869); orch-side records-per-submit drop from 6 to 1.

🤖 Generated with Claude Code

coderabbitai · 2026-05-27T07:49:41Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR consolidates orchestrator profiling from per-sub-step phase records to single per-submit envelope records. The phase enum introduces ORCH_SUBMIT = 25, profiling macros emit one record per submit_task() or alloc_tensors() call, step-level timing moves to cycle counters, and swimlane visualization and host-side tooling are updated to match the new model.

Changes

Orchestrator profiling envelope consolidation

Layer / File(s)	Summary
Phase enum and contract definitions `src/a2a3/platform/include/common/l2_perf_profiling.h`, `src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.h`	Introduced `AicpuPhaseId::ORCH_SUBMIT = 25` as the new envelope-level orchestrator phase ID; removed legacy per-sub-step IDs (16–24); kept scheduler phases (`SCHED_COMPLETE`, `SCHED_DISPATCH`). Updated `l2_perf_aicpu_record_orch_phase` documentation to describe submission-level envelope recording.
Profiling macro infrastructure for per-submit emission `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`	Refactored profiling macros to track per-submit start timestamp and introduce `CYCLE_COUNT_ORCH_SUBMIT_RECORD(tid)` macro. Adjusted non-ORCH_PROFILING path to conditionally disable cycle accumulation. Removed obsolete `ORCH_SCOPE_END` record emission.
Orchestrator step timing in submit_task_common `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`	Replaced per-sub-step phase record emissions with cycle accumulation calls (`CYCLE_COUNT_LAP`) for alloc, sync, lookup, insert, args, and fanin steps. Added final `CYCLE_COUNT_ORCH_SUBMIT_RECORD` to emit single envelope record.
Orchestrator step timing in alloc_tensors `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`	Replaced per-sub-step phase records with lap-counter accumulation for alloc, args, and fanin steps. Added final `CYCLE_COUNT_ORCH_SUBMIT_RECORD` emission.
Host-side phase classification and naming `src/a2a3/platform/src/host/l2_perf_collector.cpp`	Refactored `is_scheduler_phase()` to use fixed numeric boundary `kAicpuOrchPhaseIdBase = 16`. Simplified `orch_phase_name()` to explicitly handle only `ORCH_SUBMIT`; all other orchestrator IDs now resolve as `"unknown"`.
Swimlane visualization anchor selection and rendering `simpler_setup/tools/swimlane_converter.py`	Updated phase color mapping for `orch_submit`. Introduced per-task anchor-selection dictionary with first-seen semantics: `orch_submit` preferred, legacy `orch_fanin`/`orch_params` used as fallback. Modified arrow emission to use selected anchor timing. Adjusted submit counting to use `orch_submit` with `orch_fanin` fallback.
Documentation of the new phase model `docs/dfx/l2-swimlane-profiling.md`	Updated swimlane profiling documentation to describe `orch_submit` as per-submit envelope (not per-sub-step), explained per-sub-step timing now reported via `g_orch_*_cycle` device log counters, and clarified scheduler and orchestrator phase field semantics with legacy value notes.

Possibly related PRs

hw-native-sys/simpler#869: Modifies L2 profiling phase handling in the same host collector and swimlane tooling, removing legacy scheduler phases while this PR refactors orchestrator phase emission to envelope-level records.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit refactors the trace,
One record per submit, more grace—
Sub-steps now counted as cycles in logs,
No per-step records to clog the pipeline,
The swimlane glows cleaner, profile more aligned! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: folding multiple per-sub-step orchestrator phase records into a single per-submit envelope, which is the core refactoring across the entire changeset.
Description check	✅ Passed	The description is comprehensively related to the changeset, providing detailed motivation (6× reduction in GM writes and calls), before/after metrics, and specific file-by-file changes that align with the actual modifications shown in the summary.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request folds the multiple orchestrator sub-step phases into a single ORCH_SUBMIT phase record covering the entire submit task window, while preserving the per-sub-step cycle counts in the cold-path device log. It updates the documentation, Python swimlane converter, and C++ profiling headers and source files to support this change while maintaining fallback support for legacy captures. A review comment points out a logical discrepancy in swimlane_converter.py where the last-seen orch_submit wins instead of the first-seen as documented, and provides a code suggestion to fix it.

Each submit_task() / alloc_tensors() call used to emit 6 separate AicpuPhaseRecord entries (ORCH_SYNC, ORCH_ALLOC, ORCH_LOOKUP, ORCH_INSERT, ORCH_PARAMS, ORCH_FANIN) — 240 B of GM writes per submit. The per-sub-step breakdown they carried duplicated the g_orch_*_cycle cumulatives that the cold-path device log already prints; the only consumer of the per-sub-step records was Perfetto eye-balling, where 6 µs-scale bars convey essentially the same information as one bar spanning the whole submit. Fold them into one ORCH_SUBMIT record per submit: 240 B → 40 B GM write (6×↓) 6 record_phase calls → 1 (6×↓) 7 get_sys_cnt_aicpu calls → 2 (1 at START, 1 at SUBMIT_END) (3.5×↓) For "which sub-step is slow overall", the cold-path log's per-step cycle ratios still cover it. For "which submit is slow", the single ORCH_SUBMIT record carries the full wall-clock envelope. Scope is a2a3 tensormap_and_ringbuffer only. Device (pto_orchestrator.cpp): - CYCLE_COUNT_START captures _submit_start_ts as a separate variable so the per-submit envelope is recoverable after the LAPs have advanced _t0. - CYCLE_COUNT_LAP_RECORD macro deleted; all 9 call sites (6 in submit_task_common, 3 in alloc_tensors) become plain CYCLE_COUNT_LAP (accumulate-only — g_orch_*_cycle cumulatives unchanged). - New CYCLE_COUNT_ORCH_SUBMIT_RECORD(tid) fires once per submit path with [_submit_start_ts, _t1, g_orch_submit_idx, task_id.raw]. - Dropped commented-out ORCH_SCOPE_END emit at scope_end(). Enum (l2_perf_profiling.h): - AicpuPhaseId: drop ORCH_SYNC..ORCH_SCOPE_END (ids 16-24). Replace with single ORCH_SUBMIT = 16. Doc note records that ids 17-24 may appear in legacy captures and are dropped by the host parser. Host parser (l2_perf_collector.cpp): - is_scheduler_phase boundary check uses ORCH_SUBMIT instead of the removed ORCH_SYNC. - orch_phase_name switch collapses to one case (ORCH_SUBMIT → "orch_submit"); default returns "unknown" so legacy ids on old captures land in "unknown" and downstream tools drop them. Header (l2_perf_collector_aicpu.h): - Doc for l2_perf_aicpu_record_orch_phase updated to reflect the single-record semantics. Function signature unchanged — phase_id param still required, callers pass ORCH_SUBMIT. Tools (swimlane_converter.py): - orch_phase_colors: "orch_submit" primary; legacy 6 sub-step strings retained so old captures render. - Orch → sched dispatch arrow anchor: prefers orch_submit end, falls back to legacy orch_fanin / orch_params for old captures. - submit_count derivation prefers orch_submit, falls back to orch_fanin. Docs (docs/dfx/l2-swimlane-profiling.md): - Section 2: orchestrator description shifted from "9 sub-steps" to "per-submit envelope + cumulative log counters". - Section 3 phase table: orchestrator phase string list updated. - Section 4: orchestrator overhead breakdown rephrased to point at the per-submit record + cold-path log split. Verified on a2a3sim with --enable-l2-swimlane --enable-dep-gen: test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen, test_dep_gen_chain all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

Comment thread simpler_setup/tools/swimlane_converter.py Outdated

hw-native-sys-bot force-pushed the refactor/orch-submit-fold branch from 7e0c70f to 6d9829f Compare May 27, 2026 07:58

hw-native-sys-bot force-pushed the refactor/orch-submit-fold branch from 6d9829f to 75358f2 Compare May 27, 2026 08:05

ChaoWao marked this pull request as ready for review May 27, 2026 08:41

ChaoWao approved these changes May 27, 2026

View reviewed changes

ChaoWao merged commit 18d62e2 into hw-native-sys:main May 27, 2026
15 of 16 checks passed

ChaoWao deleted the refactor/orch-submit-fold branch May 27, 2026 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: fold orch sub-step phase records into one per-submit envelope#871

Refactor: fold orch sub-step phase records into one per-submit envelope#871
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/orch-submit-fold

hw-native-sys-bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Possibly related PRs

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented May 27, 2026

Why

What changed

Test plan

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Possibly related PRs

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 27, 2026 •

edited

Loading