Skip to content

Refactor: drop SCHED_IDLE_WAIT phase records and vestigial SCAN slot#869

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/drop-idle-wait-phase-record
May 27, 2026
Merged

Refactor: drop SCHED_IDLE_WAIT phase records and vestigial SCAN slot#869
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/drop-idle-wait-phase-record

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented May 27, 2026

Why

When --enable-l2-swimlane is at SCHED_PHASES (3) or higher, every idle
scheduler loop iteration emits a 40-byte SCHED_IDLE_WAIT phase record. On
paged_attention_unroll Case1, idle iterations make up 56–62 % of all
sched records on the busy threads
(measured: T0 363/614, T1 541/867,
T2 404/725; T3 unused on this 4-thread case). Each idle record carries no
unique information — its [start_time, end_time] is the wall-clock gap
between two non-idle records on the same thread, and that gap is fully
recoverable from the neighbouring records' timestamps. The host tools
double-paint the same time range either way.

Dropping the emit:

  • cuts the per-thread phase-record GM write traffic by ~60 % on this case;
  • frees roughly 2.5× headroom in the 16384-record per-thread phase buffer
    (current peak measured: 867 records — well below saturation, so this is
    a future-proofing margin for longer captures or finer phase
    taxonomies
    , not a fix for an observed overflow);
  • has no measurable effect on sched_cost wall-clock: hardware bench on
    Case1 with --enable-l2-swimlane 4 shows BASE 1192.97 us → HEAD 1190.85 us
    (−0.18 %, well within the ±13 us run-to-run noise).

The SCHED_SCAN slot is removed at the same time. This runtime is
event-driven (a task's last fanin release pushes downstream into the
ready/wiring queue) so there is no poll-style scan phase. The enum
value, the sched_scan_cycle counter, and its cold-path log line were
all vestigial leftovers that always read as 0 us / 0 %.

Scope is a2a3 tensormap_and_ringbuffer only.

What changed

Device (scheduler_dispatch.cpp)

  • Stops emitting SCHED_IDLE_WAIT on idle iterations.
  • CYCLE_COUNT_LAP(sched_idle_cycle) stays — cold-path summary still has the wall-clock idle total.
  • _t0_phase = _t1 lifted out of the (now-deleted) emit branch so the next iter's COMPLETE/DISPATCH record gets the correct start_time rather than absorbing the preceding idle gap into its own duration.

Enum + cold path (l2_perf_profiling.h, scheduler_types.h, scheduler_cold_path.cpp)

  • Drops SCHED_SCAN (never emitted in this runtime), SCHED_IDLE_WAIT, and the SCHED_PHASE_COUNT sentinel.
  • Drops the sched_scan_cycle counter field and removes it from the sched_total formula.
  • Removes the cold-path "scan : 0.000us (0%)" log line.
  • Doc note records that legacy IDs 2–3 may appear in old captures.

Host parser (l2_perf_collector.cpp)

  • is_scheduler_phase checks against ORCH_SYNC (16) instead of the removed SCHED_PHASE_COUNT.
  • The lambda's default: branch labels legacy IDs 2–3 as "unknown"; host tools then drop them.

Tools

  • sched_overhead_analysis: parses only "complete" / "dispatch" records; idle_us is reconstructed by summing gaps between consecutive work records on each thread. Legacy "idle" / "scan" / "unknown" records in old captures are skipped (would otherwise double-count idle). Print precision on Avg scheduler loop iteration bumped from .1f to .3f — sub-µs values are common now that idle threads no longer inflate the sum.
  • swimlane_converter: sorts each thread's work records by start_time and emits a synthetic yellow IDLE bar for each gap; visualization matches the prior look with no per-iter device-side record cost.

Test plan

  • pytest tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/ tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/ --platform a2a3sim --enable-l2-swimlane --enable-dep-gen → 4 passed.
  • Hardware bench on paged_attention_unroll Case1 with --enable-l2-swimlane 4, n=10 BASE vs n=9 HEAD on device 0:
    • sched_cost mean: 1192.97 → 1190.85 us (−0.18 %, in noise)
    • orch_cost mean: 1007.85 → 1002.40 us (−0.54 %, in noise)
    • Phase buffer max usage: 867 records / thread (5 % of 16384)
    • Idle share of records: 56–62 %
  • (Out of scope) Workloads with longer-running captures where the 16384 limit would matter — none in the current example set.

🤖 Generated with Claude Code

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/drop-idle-wait-phase-record branch from 23755dd to 7ea650b Compare May 27, 2026 06:21
@hw-native-sys-bot hw-native-sys-bot marked this pull request as ready for review May 27, 2026 06:21
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/drop-idle-wait-phase-record branch from 7ea650b to 7d5d976 Compare May 27, 2026 06:26
@hw-native-sys-bot hw-native-sys-bot changed the title Refactor: drop SCHED_IDLE_WAIT phase records from hot path Refactor: drop SCHED_IDLE_WAIT phase records and vestigial SCAN slot May 27, 2026
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/drop-idle-wait-phase-record branch 2 times, most recently from f4b5fe7 to 6a17da0 Compare May 27, 2026 07:07
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5edec5dd-1b84-4039-80b2-d5672885a4cd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR consolidates the a2a3 scheduler profiling model by removing legacy scheduler phases (SCHED_SCAN, SCHED_IDLE_WAIT) and their associated phase records. Idle spans are now reconstructed by host tooling from gaps between consecutive work records, and profiling counters, runtime emission, and host-side analysis tools are aligned accordingly.

Changes

Scheduler Phase Model Consolidation

Layer / File(s) Summary
Scheduler phase enum and profiling counter struct
src/a2a3/platform/include/common/l2_perf_profiling.h, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
AicpuPhaseId enum now defines only SCHED_COMPLETE (0) and SCHED_DISPATCH (1) for scheduler work phases; legacy IDs (SCHED_SCAN, SCHED_IDLE_WAIT) and SCHED_PHASE_COUNT sentinel are removed. AicpuPhaseRecord documentation is updated for extra1/extra2 field semantics. SchedL2PerfCounters drops the sched_scan_cycle counter field.
Scheduler runtime: phase emission and profiling
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp, scheduler_cold_path.cpp, src/a2a3/platform/src/host/l2_perf_collector.cpp
Dispatcher no longer emits SCHED_IDLE_WAIT phase records; only _t0_phase progression is maintained for subsequent work record timestamps. Cold path profiling log removes scan cycle reporting and recomputes sched_total from wiring/complete/dispatch/idle only. Host collector's phase classification changes boundary from SCHED_PHASE_COUNT to ORCH_SYNC to route legacy IDs in older captures; phase name mapping adds explanatory comments for legacy behavior.
Host-side phase parsing and trace rendering
simpler_setup/tools/sched_overhead_analysis.py, swimlane_converter.py
Scheduler overhead analysis filters phase records to complete and dispatch only, reconstructs idle as gaps between adjacent work records per thread, and removes scan from phase tables and dominant-phase selection. Output formatting uses higher precision (%.3f) for average loop iteration. Swimlane converter narrows phase color map to work phases and filters event emission to skip non-work phase records.
L2 swimlane profiling documentation
docs/dfx/l2-swimlane-profiling.md
Clarifies that on a2a3, idle iterations no longer emit scheduler phase records; idle spans are reconstructed from gaps between work records. Documents that legacy captures may include SCHED_IDLE_WAIT / SCHED_SCAN, which the parser drops. Field documentation lists current expected phase_id values and notes legacy IDs 2–3 may appear in older captures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 The scheduler hops with just two phases now,
No scan, no idle—only work, and how!
The gaps between them tell the idle tale,
Simpler profiling that will never fail.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately and specifically describes the main change: removing SCHED_IDLE_WAIT phase records and the unused SCHED_SCAN enum slot, which aligns with the core refactoring across device, enum, and tool changes.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, explaining the rationale (idle records account for 56-62% of phase records with no unique information), the specific changes across device/enum/host/tools, and test results.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

When --enable-l2-swimlane is at SCHED_PHASES (3) or higher, every idle
scheduler iteration was emitting a 40-byte SCHED_IDLE_WAIT phase record.
On realistic loads (paged_attention Case1 ~60% idle iters; drain-heavy
shapes 85-90%) this saturates the per-thread 16384-record phase buffer
and silently drops later records. The record itself carries no
unique data — it's the gap between two non-idle records on the same
thread, and that gap is fully recoverable from the start_time/end_time
of those neighbouring records.

The SCHED_SCAN slot is removed at the same time. This runtime is
event-driven (a task's last fanin release pushes downstream into the
ready/wiring queue) so there is no poll-style scan phase. The enum
value, the sched_scan_cycle counter, and its cold-path log line were
all vestigial leftovers that always read as 0 us / 0%.

Scope is a2a3 tensormap_and_ringbuffer only.

Device:
- scheduler_dispatch.cpp stops emitting SCHED_IDLE_WAIT on idle
  iterations. CYCLE_COUNT_LAP(sched_idle_cycle) stays so the cumulative
  cold-path summary still has the wall-clock idle total.
- _t0_phase = _t1 is lifted out of the (now-deleted) emit branch so the
  next iter's COMPLETE/DISPATCH record gets the correct start_time
  rather than absorbing the preceding idle gap into its own duration.
- l2_perf_profiling.h drops SCHED_SCAN, SCHED_IDLE_WAIT, and the
  SCHED_PHASE_COUNT sentinel. Doc note records that legacy IDs 2-3
  may appear in old captures.
- scheduler_types.h drops the unused sched_scan_cycle field.
- scheduler_cold_path.cpp drops sched_scan_cycle from the sched_total
  formula and removes the "scan : 0.000us (0%)" log line.

Host:
- l2_perf_collector::is_scheduler_phase checks against ORCH_SYNC (16)
  instead of the removed SCHED_PHASE_COUNT, keeping legacy IDs 2-3
  on the scheduler-side branch where the JSON writer maps them to
  "unknown" (host tools then drop them).

Tools:
- sched_overhead_analysis parses only "complete" / "dispatch" records;
  idle_us is reconstructed by summing gaps between consecutive work
  records on each thread. Legacy "idle" / "scan" / "unknown" records
  in old captures are skipped.
- swimlane_converter sorts each thread's work records by start_time
  and emits a synthetic yellow IDLE bar for each gap, matching the
  prior visualization with no per-iter record cost.

Verified on a2a3sim with --enable-l2-swimlane --enable-dep-gen:
  test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen,
  test_dep_gen_chain all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/drop-idle-wait-phase-record branch from 6a17da0 to 0eb35c9 Compare May 27, 2026 07:12
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@simpler_setup/tools/swimlane_converter.py`:
- Around line 738-766: The loop over thread_records currently drops non-work
phases and never emits synthetic IDLE events for gaps; modify the logic in
swimlane_converter.py where thread_records is iterated to track the previous
record end (e.g., last_end_us) for that tid and, before appending the next work
event (inside the loop that builds events with "ph": "X"), detect if start_us >
last_end_us and append a synthetic IDLE event covering [last_end_us, start_us)
with fields similar to work events (ph="X", name="idle" or "IDLE",
cat="scheduler", pid=3, tid=tid, ts=last_end_us, dur=start_us-last_end_us, cname
from phase_colors.get("idle", ...), args with phase="idle"); update last_end_us
to end_us after emitting either the gap event and/or the work event so long idle
stretches appear in Perfetto.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 62cb474f-244b-4224-ad69-977f10473486

📥 Commits

Reviewing files that changed from the base of the PR and between 352c3f8 and 6a17da0.

📒 Files selected for processing (8)
  • docs/dfx/l2-swimlane-profiling.md
  • simpler_setup/tools/sched_overhead_analysis.py
  • simpler_setup/tools/swimlane_converter.py
  • src/a2a3/platform/include/common/l2_perf_profiling.h
  • src/a2a3/platform/src/host/l2_perf_collector.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
💤 Files with no reviewable changes (1)
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h

Comment thread simpler_setup/tools/swimlane_converter.py
@ChaoWao ChaoWao merged commit bb0b37a into hw-native-sys:main May 27, 2026
16 checks passed
@ChaoWao ChaoWao deleted the refactor/drop-idle-wait-phase-record branch May 27, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants