Refactor: drop SCHED_IDLE_WAIT phase records and vestigial SCAN slot by hw-native-sys-bot · Pull Request #869 · hw-native-sys/simpler

hw-native-sys-bot · 2026-05-27T06:20:42Z

Why

When --enable-l2-swimlane is at SCHED_PHASES (3) or higher, every idle
scheduler loop iteration emits a 40-byte SCHED_IDLE_WAIT phase record. On
paged_attention_unroll Case1, idle iterations make up 56–62 % of all
sched records on the busy threads (measured: T0 363/614, T1 541/867,
T2 404/725; T3 unused on this 4-thread case). Each idle record carries no
unique information — its [start_time, end_time] is the wall-clock gap
between two non-idle records on the same thread, and that gap is fully
recoverable from the neighbouring records' timestamps. The host tools
double-paint the same time range either way.

Dropping the emit:

cuts the per-thread phase-record GM write traffic by ~60 % on this case;
frees roughly 2.5× headroom in the 16384-record per-thread phase buffer
(current peak measured: 867 records — well below saturation, so this is
a future-proofing margin for longer captures or finer phase
taxonomies, not a fix for an observed overflow);
has no measurable effect on sched_cost wall-clock: hardware bench on
Case1 with --enable-l2-swimlane 4 shows BASE 1192.97 us → HEAD 1190.85 us
(−0.18 %, well within the ±13 us run-to-run noise).

The SCHED_SCAN slot is removed at the same time. This runtime is
event-driven (a task's last fanin release pushes downstream into the
ready/wiring queue) so there is no poll-style scan phase. The enum
value, the sched_scan_cycle counter, and its cold-path log line were
all vestigial leftovers that always read as 0 us / 0 %.

Scope is a2a3 tensormap_and_ringbuffer only.

What changed

Device (scheduler_dispatch.cpp)

Stops emitting SCHED_IDLE_WAIT on idle iterations.
CYCLE_COUNT_LAP(sched_idle_cycle) stays — cold-path summary still has the wall-clock idle total.
_t0_phase = _t1 lifted out of the (now-deleted) emit branch so the next iter's COMPLETE/DISPATCH record gets the correct start_time rather than absorbing the preceding idle gap into its own duration.

Enum + cold path (l2_perf_profiling.h, scheduler_types.h, scheduler_cold_path.cpp)

Drops SCHED_SCAN (never emitted in this runtime), SCHED_IDLE_WAIT, and the SCHED_PHASE_COUNT sentinel.
Drops the sched_scan_cycle counter field and removes it from the sched_total formula.
Removes the cold-path "scan : 0.000us (0%)" log line.
Doc note records that legacy IDs 2–3 may appear in old captures.

Host parser (l2_perf_collector.cpp)

is_scheduler_phase checks against ORCH_SYNC (16) instead of the removed SCHED_PHASE_COUNT.
The lambda's default: branch labels legacy IDs 2–3 as "unknown"; host tools then drop them.

Tools

sched_overhead_analysis: parses only "complete" / "dispatch" records; idle_us is reconstructed by summing gaps between consecutive work records on each thread. Legacy "idle" / "scan" / "unknown" records in old captures are skipped (would otherwise double-count idle). Print precision on Avg scheduler loop iteration bumped from .1f to .3f — sub-µs values are common now that idle threads no longer inflate the sum.
swimlane_converter: sorts each thread's work records by start_time and emits a synthetic yellow IDLE bar for each gap; visualization matches the prior look with no per-iter device-side record cost.

Test plan

pytest tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/ tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/ --platform a2a3sim --enable-l2-swimlane --enable-dep-gen → 4 passed.
Hardware bench on paged_attention_unroll Case1 with --enable-l2-swimlane 4, n=10 BASE vs n=9 HEAD on device 0:
- sched_cost mean: 1192.97 → 1190.85 us (−0.18 %, in noise)
- orch_cost mean: 1007.85 → 1002.40 us (−0.54 %, in noise)
- Phase buffer max usage: 867 records / thread (5 % of 16384)
- Idle share of records: 56–62 %
(Out of scope) Workloads with longer-running captures where the 16384 limit would matter — none in the current example set.

🤖 Generated with Claude Code

gemini-code-assist · 2026-05-27T06:20:45Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-05-27T06:21:40Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-05-27T07:09:23Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5edec5dd-1b84-4039-80b2-d5672885a4cd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR consolidates the a2a3 scheduler profiling model by removing legacy scheduler phases (SCHED_SCAN, SCHED_IDLE_WAIT) and their associated phase records. Idle spans are now reconstructed by host tooling from gaps between consecutive work records, and profiling counters, runtime emission, and host-side analysis tools are aligned accordingly.

Changes

Scheduler Phase Model Consolidation

Layer / File(s)	Summary
Scheduler phase enum and profiling counter struct `src/a2a3/platform/include/common/l2_perf_profiling.h`, `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h`	`AicpuPhaseId` enum now defines only `SCHED_COMPLETE` (0) and `SCHED_DISPATCH` (1) for scheduler work phases; legacy IDs (`SCHED_SCAN`, `SCHED_IDLE_WAIT`) and `SCHED_PHASE_COUNT` sentinel are removed. `AicpuPhaseRecord` documentation is updated for `extra1/extra2` field semantics. `SchedL2PerfCounters` drops the `sched_scan_cycle` counter field.
Scheduler runtime: phase emission and profiling `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`, `scheduler_cold_path.cpp`, `src/a2a3/platform/src/host/l2_perf_collector.cpp`	Dispatcher no longer emits `SCHED_IDLE_WAIT` phase records; only `_t0_phase` progression is maintained for subsequent work record timestamps. Cold path profiling log removes scan cycle reporting and recomputes `sched_total` from wiring/complete/dispatch/idle only. Host collector's phase classification changes boundary from `SCHED_PHASE_COUNT` to `ORCH_SYNC` to route legacy IDs in older captures; phase name mapping adds explanatory comments for legacy behavior.
Host-side phase parsing and trace rendering `simpler_setup/tools/sched_overhead_analysis.py`, `swimlane_converter.py`	Scheduler overhead analysis filters phase records to complete and dispatch only, reconstructs idle as gaps between adjacent work records per thread, and removes scan from phase tables and dominant-phase selection. Output formatting uses higher precision (%.3f) for average loop iteration. Swimlane converter narrows phase color map to work phases and filters event emission to skip non-work phase records.
L2 swimlane profiling documentation `docs/dfx/l2-swimlane-profiling.md`	Clarifies that on a2a3, idle iterations no longer emit scheduler phase records; idle spans are reconstructed from gaps between work records. Documents that legacy captures may include `SCHED_IDLE_WAIT` / `SCHED_SCAN`, which the parser drops. Field documentation lists current expected `phase_id` values and notes legacy IDs 2–3 may appear in older captures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 The scheduler hops with just two phases now,
No scan, no idle—only work, and how!
The gaps between them tell the idle tale,
Simpler profiling that will never fail.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately and specifically describes the main change: removing SCHED_IDLE_WAIT phase records and the unused SCHED_SCAN enum slot, which aligns with the core refactoring across device, enum, and tool changes.
Description check	✅ Passed	The PR description is comprehensive and directly related to the changeset, explaining the rationale (idle records account for 56-62% of phase records with no unique information), the specific changes across device/enum/host/tools, and test results.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

When --enable-l2-swimlane is at SCHED_PHASES (3) or higher, every idle scheduler iteration was emitting a 40-byte SCHED_IDLE_WAIT phase record. On realistic loads (paged_attention Case1 ~60% idle iters; drain-heavy shapes 85-90%) this saturates the per-thread 16384-record phase buffer and silently drops later records. The record itself carries no unique data — it's the gap between two non-idle records on the same thread, and that gap is fully recoverable from the start_time/end_time of those neighbouring records. The SCHED_SCAN slot is removed at the same time. This runtime is event-driven (a task's last fanin release pushes downstream into the ready/wiring queue) so there is no poll-style scan phase. The enum value, the sched_scan_cycle counter, and its cold-path log line were all vestigial leftovers that always read as 0 us / 0%. Scope is a2a3 tensormap_and_ringbuffer only. Device: - scheduler_dispatch.cpp stops emitting SCHED_IDLE_WAIT on idle iterations. CYCLE_COUNT_LAP(sched_idle_cycle) stays so the cumulative cold-path summary still has the wall-clock idle total. - _t0_phase = _t1 is lifted out of the (now-deleted) emit branch so the next iter's COMPLETE/DISPATCH record gets the correct start_time rather than absorbing the preceding idle gap into its own duration. - l2_perf_profiling.h drops SCHED_SCAN, SCHED_IDLE_WAIT, and the SCHED_PHASE_COUNT sentinel. Doc note records that legacy IDs 2-3 may appear in old captures. - scheduler_types.h drops the unused sched_scan_cycle field. - scheduler_cold_path.cpp drops sched_scan_cycle from the sched_total formula and removes the "scan : 0.000us (0%)" log line. Host: - l2_perf_collector::is_scheduler_phase checks against ORCH_SYNC (16) instead of the removed SCHED_PHASE_COUNT, keeping legacy IDs 2-3 on the scheduler-side branch where the JSON writer maps them to "unknown" (host tools then drop them). Tools: - sched_overhead_analysis parses only "complete" / "dispatch" records; idle_us is reconstructed by summing gaps between consecutive work records on each thread. Legacy "idle" / "scan" / "unknown" records in old captures are skipped. - swimlane_converter sorts each thread's work records by start_time and emits a synthetic yellow IDLE bar for each gap, matching the prior visualization with no per-iter record cost. Verified on a2a3sim with --enable-l2-swimlane --enable-dep-gen: test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen, test_dep_gen_chain all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@simpler_setup/tools/swimlane_converter.py`:
- Around line 738-766: The loop over thread_records currently drops non-work
phases and never emits synthetic IDLE events for gaps; modify the logic in
swimlane_converter.py where thread_records is iterated to track the previous
record end (e.g., last_end_us) for that tid and, before appending the next work
event (inside the loop that builds events with "ph": "X"), detect if start_us >
last_end_us and append a synthetic IDLE event covering [last_end_us, start_us)
with fields similar to work events (ph="X", name="idle" or "IDLE",
cat="scheduler", pid=3, tid=tid, ts=last_end_us, dur=start_us-last_end_us, cname
from phase_colors.get("idle", ...), args with phase="idle"); update last_end_us
to end_us after emitting either the gap event and/or the work event so long idle
stretches appear in Perfetto.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 62cb474f-244b-4224-ad69-977f10473486

📥 Commits

Reviewing files that changed from the base of the PR and between 352c3f8 and 6a17da0.

📒 Files selected for processing (8)

docs/dfx/l2-swimlane-profiling.md
simpler_setup/tools/sched_overhead_analysis.py
simpler_setup/tools/swimlane_converter.py
src/a2a3/platform/include/common/l2_perf_profiling.h
src/a2a3/platform/src/host/l2_perf_collector.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h

💤 Files with no reviewable changes (1)

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h

hw-native-sys-bot force-pushed the refactor/drop-idle-wait-phase-record branch from 23755dd to 7ea650b Compare May 27, 2026 06:21

hw-native-sys-bot marked this pull request as ready for review May 27, 2026 06:21

hw-native-sys-bot force-pushed the refactor/drop-idle-wait-phase-record branch from 7ea650b to 7d5d976 Compare May 27, 2026 06:26

hw-native-sys-bot changed the title ~~Refactor: drop SCHED_IDLE_WAIT phase records from hot path~~ Refactor: drop SCHED_IDLE_WAIT phase records and vestigial SCAN slot May 27, 2026

hw-native-sys-bot force-pushed the refactor/drop-idle-wait-phase-record branch 2 times, most recently from f4b5fe7 to 6a17da0 Compare May 27, 2026 07:07

hw-native-sys-bot force-pushed the refactor/drop-idle-wait-phase-record branch from 6a17da0 to 0eb35c9 Compare May 27, 2026 07:12

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread simpler_setup/tools/swimlane_converter.py

ChaoWao approved these changes May 27, 2026

View reviewed changes

ChaoWao merged commit bb0b37a into hw-native-sys:main May 27, 2026
16 checks passed

ChaoWao deleted the refactor/drop-idle-wait-phase-record branch May 27, 2026 07:24

hw-native-sys-bot mentioned this pull request May 27, 2026

Refactor: fold orch sub-step phase records into one per-submit envelope #871

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: drop SCHED_IDLE_WAIT phase records and vestigial SCAN slot#869

Refactor: drop SCHED_IDLE_WAIT phase records and vestigial SCAN slot#869
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/drop-idle-wait-phase-record

hw-native-sys-bot commented May 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 27, 2026

Uh oh!

gemini-code-assist Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

Test plan

Uh oh!

gemini-code-assist Bot commented May 27, 2026

Uh oh!

gemini-code-assist Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hw-native-sys-bot commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading