Refactor: drop SCHED_IDLE_WAIT phase records and vestigial SCAN slot#869
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
23755dd to
7ea650b
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
7ea650b to
7d5d976
Compare
f4b5fe7 to
6a17da0
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR consolidates the a2a3 scheduler profiling model by removing legacy scheduler phases ( ChangesScheduler Phase Model Consolidation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
When --enable-l2-swimlane is at SCHED_PHASES (3) or higher, every idle scheduler iteration was emitting a 40-byte SCHED_IDLE_WAIT phase record. On realistic loads (paged_attention Case1 ~60% idle iters; drain-heavy shapes 85-90%) this saturates the per-thread 16384-record phase buffer and silently drops later records. The record itself carries no unique data — it's the gap between two non-idle records on the same thread, and that gap is fully recoverable from the start_time/end_time of those neighbouring records. The SCHED_SCAN slot is removed at the same time. This runtime is event-driven (a task's last fanin release pushes downstream into the ready/wiring queue) so there is no poll-style scan phase. The enum value, the sched_scan_cycle counter, and its cold-path log line were all vestigial leftovers that always read as 0 us / 0%. Scope is a2a3 tensormap_and_ringbuffer only. Device: - scheduler_dispatch.cpp stops emitting SCHED_IDLE_WAIT on idle iterations. CYCLE_COUNT_LAP(sched_idle_cycle) stays so the cumulative cold-path summary still has the wall-clock idle total. - _t0_phase = _t1 is lifted out of the (now-deleted) emit branch so the next iter's COMPLETE/DISPATCH record gets the correct start_time rather than absorbing the preceding idle gap into its own duration. - l2_perf_profiling.h drops SCHED_SCAN, SCHED_IDLE_WAIT, and the SCHED_PHASE_COUNT sentinel. Doc note records that legacy IDs 2-3 may appear in old captures. - scheduler_types.h drops the unused sched_scan_cycle field. - scheduler_cold_path.cpp drops sched_scan_cycle from the sched_total formula and removes the "scan : 0.000us (0%)" log line. Host: - l2_perf_collector::is_scheduler_phase checks against ORCH_SYNC (16) instead of the removed SCHED_PHASE_COUNT, keeping legacy IDs 2-3 on the scheduler-side branch where the JSON writer maps them to "unknown" (host tools then drop them). Tools: - sched_overhead_analysis parses only "complete" / "dispatch" records; idle_us is reconstructed by summing gaps between consecutive work records on each thread. Legacy "idle" / "scan" / "unknown" records in old captures are skipped. - swimlane_converter sorts each thread's work records by start_time and emits a synthetic yellow IDLE bar for each gap, matching the prior visualization with no per-iter record cost. Verified on a2a3sim with --enable-l2-swimlane --enable-dep-gen: test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen, test_dep_gen_chain all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6a17da0 to
0eb35c9
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@simpler_setup/tools/swimlane_converter.py`:
- Around line 738-766: The loop over thread_records currently drops non-work
phases and never emits synthetic IDLE events for gaps; modify the logic in
swimlane_converter.py where thread_records is iterated to track the previous
record end (e.g., last_end_us) for that tid and, before appending the next work
event (inside the loop that builds events with "ph": "X"), detect if start_us >
last_end_us and append a synthetic IDLE event covering [last_end_us, start_us)
with fields similar to work events (ph="X", name="idle" or "IDLE",
cat="scheduler", pid=3, tid=tid, ts=last_end_us, dur=start_us-last_end_us, cname
from phase_colors.get("idle", ...), args with phase="idle"); update last_end_us
to end_us after emitting either the gap event and/or the work event so long idle
stretches appear in Perfetto.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 62cb474f-244b-4224-ad69-977f10473486
📒 Files selected for processing (8)
docs/dfx/l2-swimlane-profiling.mdsimpler_setup/tools/sched_overhead_analysis.pysimpler_setup/tools/swimlane_converter.pysrc/a2a3/platform/include/common/l2_perf_profiling.hsrc/a2a3/platform/src/host/l2_perf_collector.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
💤 Files with no reviewable changes (1)
- src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
Why
When
--enable-l2-swimlaneis atSCHED_PHASES(3) or higher, every idlescheduler loop iteration emits a 40-byte
SCHED_IDLE_WAITphase record. Onpaged_attention_unroll Case1, idle iterations make up 56–62 % of all
sched records on the busy threads (measured: T0 363/614, T1 541/867,
T2 404/725; T3 unused on this 4-thread case). Each idle record carries no
unique information — its
[start_time, end_time]is the wall-clock gapbetween two non-idle records on the same thread, and that gap is fully
recoverable from the neighbouring records' timestamps. The host tools
double-paint the same time range either way.
Dropping the emit:
(current peak measured: 867 records — well below saturation, so this is
a future-proofing margin for longer captures or finer phase
taxonomies, not a fix for an observed overflow);
sched_costwall-clock: hardware bench onCase1 with
--enable-l2-swimlane 4shows BASE 1192.97 us → HEAD 1190.85 us(−0.18 %, well within the ±13 us run-to-run noise).
The
SCHED_SCANslot is removed at the same time. This runtime isevent-driven (a task's last fanin release pushes downstream into the
ready/wiring queue) so there is no poll-style scan phase. The enum
value, the
sched_scan_cyclecounter, and its cold-path log line wereall vestigial leftovers that always read as
0 us / 0 %.Scope is a2a3
tensormap_and_ringbufferonly.What changed
Device (
scheduler_dispatch.cpp)SCHED_IDLE_WAITon idle iterations.CYCLE_COUNT_LAP(sched_idle_cycle)stays — cold-path summary still has the wall-clock idle total._t0_phase = _t1lifted out of the (now-deleted) emit branch so the next iter's COMPLETE/DISPATCH record gets the correctstart_timerather than absorbing the preceding idle gap into its own duration.Enum + cold path (
l2_perf_profiling.h,scheduler_types.h,scheduler_cold_path.cpp)SCHED_SCAN(never emitted in this runtime),SCHED_IDLE_WAIT, and theSCHED_PHASE_COUNTsentinel.sched_scan_cyclecounter field and removes it from thesched_totalformula."scan : 0.000us (0%)"log line.Host parser (
l2_perf_collector.cpp)is_scheduler_phasechecks againstORCH_SYNC(16) instead of the removedSCHED_PHASE_COUNT.default:branch labels legacy IDs 2–3 as"unknown"; host tools then drop them.Tools
sched_overhead_analysis: parses only"complete"/"dispatch"records;idle_usis reconstructed by summing gaps between consecutive work records on each thread. Legacy"idle"/"scan"/"unknown"records in old captures are skipped (would otherwise double-count idle). Print precision onAvg scheduler loop iterationbumped from.1fto.3f— sub-µs values are common now that idle threads no longer inflate the sum.swimlane_converter: sorts each thread's work records bystart_timeand emits a synthetic yellow IDLE bar for each gap; visualization matches the prior look with no per-iter device-side record cost.Test plan
pytest tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/ tests/st/a2a3/tensormap_and_ringbuffer/dfx/dep_gen/ --platform a2a3sim --enable-l2-swimlane --enable-dep-gen→ 4 passed.--enable-l2-swimlane 4, n=10 BASE vs n=9 HEAD on device 0:🤖 Generated with Claude Code