Fix: clean up "version" fields in L2 swimlane / dep_gen JSON#856
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the dependency graph generation schema from version 2 (v2) to version 3 (v3), introducing a strided tensor representation. Key changes include adding an args array with detailed tensor slice geometry to tasks, replacing raw_shapes with buffer_numel in the tensors schema, and replacing simple offsets with explicit start offsets and strides for both consumers and producers in the edges schema. Downstream tools, documentation, and tests have been updated to support and validate the new v3 schema. There are no review comments, so I have no feedback to provide.
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB of fanout edges plus walking the producer's fanout linked list, every task, on the scheduler completion critical path. The fanout edges are already the static DAG and are reconstructed offline by dep_gen replay into deps.json — so the device-side hot path was paying GM-bandwidth and cache-miss cost to duplicate information host tooling already has. Scope is a2a3 only; a5 is untouched. Device side: - L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per record). - l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count parameters; the impl no longer touches them. - scheduler_completion drops the fanout_arr build + linked-list walk; host_build_graph/aicpu_executor drops the same pattern at all four call sites. Host side: - l2_perf_collector::export_swimlane_json emits "fanout": [] and "fanout_count": 0 per task to keep the JSON schema shape stable, and drops the top-level "version" field, which had drifted into a duplicate of L2PerfLevel (see in-flight PR hw-native-sys#856 for the misaligned guard cleanup on the consumer side). Downstream tools: - swimlane_converter already preferred deps.json over task["fanout"]; it now reads the version-free schema and treats empty fanout as the expected steady state. - sched_overhead_analysis no longer gates phase parsing on the dropped "version" field — it gates on presence of aicpu_scheduler_phases, which is the right key. Tests and comments: - dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and the auto-add of --enable-l2-swimlane that only existed to feed that gate. - _swimlane_validate drops the version assertion. - profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator comments updated to reflect deps.json as the sole source of truth for fanout. Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen, test_dep_gen_chain all pass with --enable-l2-swimlane --enable-dep-gen.
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB of fanout edges plus walking the producer's fanout linked list, every task, on the scheduler completion critical path. The fanout edges are already the static DAG and are reconstructed offline by dep_gen replay into deps.json — so the device-side hot path was paying GM-bandwidth and cache-miss cost to duplicate information host tooling already has. Scope is a2a3 only; a5 is untouched. Device side: - L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per record). - l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count parameters; the impl no longer touches them. - scheduler_completion drops the fanout_arr build + linked-list walk; host_build_graph/aicpu_executor drops the same pattern at all four call sites. Host side: - l2_perf_collector::export_swimlane_json emits "fanout": [] and "fanout_count": 0 per task to keep the JSON schema shape stable, and drops the top-level "version" field, which had drifted into a duplicate of L2PerfLevel (see in-flight PR hw-native-sys#856 for the misaligned guard cleanup on the consumer side). Downstream tools: - swimlane_converter already preferred deps.json over task["fanout"]; it now reads the version-free schema and treats empty fanout as the expected steady state. - sched_overhead_analysis no longer gates phase parsing on the dropped "version" field — it gates on presence of aicpu_scheduler_phases, which is the right key. Tests and comments: - dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and the auto-add of --enable-l2-swimlane that only existed to feed that gate. - _swimlane_validate drops the version assertion. - profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator comments updated to reflect deps.json as the sole source of truth for fanout. Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen, test_dep_gen_chain all pass with --enable-l2-swimlane --enable-dep-gen.
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB of fanout edges plus walking the producer's fanout linked list, every task, on the scheduler completion critical path. The fanout edges are already the static DAG and are reconstructed offline by dep_gen replay into deps.json — so the device-side hot path was paying GM-bandwidth and cache-miss cost to duplicate information host tooling already has. Scope is a2a3 only; a5 is untouched. Device side: - L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per record). - l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count parameters; the impl no longer touches them. - scheduler_completion drops the fanout_arr build + linked-list walk; host_build_graph/aicpu_executor drops the same pattern at all four call sites. Host side: - l2_perf_collector::export_swimlane_json emits "fanout": [] and "fanout_count": 0 per task to keep the JSON schema shape stable, and drops the top-level "version" field, which had drifted into a duplicate of L2PerfLevel (see in-flight PR hw-native-sys#856 for the misaligned guard cleanup on the consumer side). Downstream tools: - swimlane_converter already preferred deps.json over task["fanout"]; it now reads the version-free schema and treats empty fanout as the expected steady state. - sched_overhead_analysis no longer gates phase parsing on the dropped "version" field — it gates on presence of aicpu_scheduler_phases, which is the right key. Tests and comments: - dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and the auto-add of --enable-l2-swimlane that only existed to feed that gate. - _swimlane_validate drops the version assertion. - profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator comments updated to reflect deps.json as the sole source of truth for fanout. Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen, test_dep_gen_chain all pass with --enable-l2-swimlane --enable-dep-gen.
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB of fanout edges plus walking the producer's fanout linked list, every task, on the scheduler completion critical path. The fanout edges are already the static DAG and are reconstructed offline by dep_gen replay into deps.json — so the device-side hot path was paying GM-bandwidth and cache-miss cost to duplicate information host tooling already has. Scope is a2a3 only; a5 is untouched. Device side: - L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per record). - l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count parameters; the impl no longer touches them. - scheduler_completion drops the fanout_arr build + linked-list walk; host_build_graph/aicpu_executor drops the same pattern at all four call sites. Host side: - l2_perf_collector::export_swimlane_json emits "fanout": [] and "fanout_count": 0 per task to keep the JSON schema shape stable, and drops the top-level "version" field, which had drifted into a duplicate of L2PerfLevel (see in-flight PR hw-native-sys#856 for the misaligned guard cleanup on the consumer side). Downstream tools: - swimlane_converter already preferred deps.json over task["fanout"]; it now reads the version-free schema and treats empty fanout as the expected steady state. - sched_overhead_analysis no longer gates phase parsing on the dropped "version" field — it gates on presence of aicpu_scheduler_phases, which is the right key. Tests and comments: - dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and the auto-add of --enable-l2-swimlane that only existed to feed that gate. - _swimlane_validate drops the version assertion. - profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator comments updated to reflect deps.json as the sole source of truth for fanout. Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen, test_dep_gen_chain all pass with --enable-l2-swimlane --enable-dep-gen.
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB of fanout edges plus walking the producer's fanout linked list, every task, on the scheduler completion critical path. The fanout edges are already the static DAG and are reconstructed offline by dep_gen replay into deps.json — so the device-side hot path was paying GM-bandwidth and cache-miss cost to duplicate information host tooling already has. Scope is a2a3 only; a5 is untouched. Device side: - L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per record). - l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count parameters; the impl no longer touches them. - scheduler_completion drops the fanout_arr build + linked-list walk; host_build_graph/aicpu_executor drops the same pattern at all four call sites. Host side: - l2_perf_collector::export_swimlane_json emits "fanout": [] and "fanout_count": 0 per task to keep the JSON schema shape stable, and drops the top-level "version" field, which had drifted into a duplicate of L2PerfLevel (see in-flight PR #856 for the misaligned guard cleanup on the consumer side). Downstream tools: - swimlane_converter already preferred deps.json over task["fanout"]; it now reads the version-free schema and treats empty fanout as the expected steady state. - sched_overhead_analysis no longer gates phase parsing on the dropped "version" field — it gates on presence of aicpu_scheduler_phases, which is the right key. Tests and comments: - dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and the auto-add of --enable-l2-swimlane that only existed to feed that gate. - _swimlane_validate drops the version assertion. - profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator comments updated to reflect deps.json as the sole source of truth for fanout. Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed, test_dep_gen, test_dep_gen_chain all pass with --enable-l2-swimlane --enable-dep-gen.
deps.json and l2_perf_records.json both carried a "version" field that consumers were getting wrong: - deps.json bumped v2 → v3 in hw-native-sys#808 but swimlane_converter still guarded on `version != 2`, silently rejected every fresh capture, and fell back to L2PerfRecord::fanout[] — losing the race-window edges dep_gen replay exists to recover. - l2_perf_records.json's "version" was never a schema version — the producer writes L2PerfLevel (1..4). Misreading it caused two consumers to short-circuit on `version != 2` / `< 2`, while phase blocks only exist at level >= 3. Producer side: deps.json drops the field outright; l2_perf_records.json (a2a3 + a5) renames "version" → "l2_perf_level" so the name matches its meaning. Consumer side: drop the three now-misaligned guards (deps_to_graph, swimlane_converter.load_deps_json / _print_verbose_data_info, sched_overhead_analysis.parse_scheduler_ from_json_phases) plus the version assertions in test_dep_gen, test_dep_gen_chain, and _swimlane_validate. Doc / comment fallout per .claude/rules/doc-consistency.md: retire "v2 JSON" / "version 2" wording in favour of "l2_perf_level >= N" across docs/dfx/{dep_gen,l2-swimlane-profiling}.md, profiling_levels.md (a2a3 + a5), tools/README.md, the 6 scheduler comments (dispatch / cold_path / types × a2a3, a5), and the tool docstrings. dep_gen.md §4 example + fields table rewritten against the strided-Tensor producer (buffer_numel / start_offset / strides[] replace raw_shapes / multi-dim offset[]); strides type corrected to uint32 (Tensor::strides invariant > 0).
deps.json and l2_perf_records.json both carried a "version" field
that consumers were getting wrong:
deps.json bumped v2 → v3 in #808 but swimlane_converter still
guarded on
version != 2, silently rejected every fresh capture,and fell back to L2PerfRecord::fanout[] — losing the race-window
edges dep_gen replay exists to recover.
l2_perf_records.json's "version" was never a schema version — the
producer writes L2PerfLevel (1..4). Misreading it caused two
consumers to short-circuit on
version != 2/< 2, while phaseblocks only exist at level >= 3.
Producer side: deps.json drops the field outright; l2_perf_records.json
(a2a3 + a5) renames "version" → "l2_perf_level" so the name matches
its meaning. Consumer side: drop the three now-misaligned guards
(deps_to_graph, swimlane_converter.load_deps_json /
print_verbose_data_info, sched_overhead_analysis.parse_scheduler
from_json_phases) plus the version assertions in test_dep_gen,
test_dep_gen_chain, and _swimlane_validate.
Doc / comment fallout per .claude/rules/doc-consistency.md: retire
"v2 JSON" / "version 2" wording in favour of "l2_perf_level >= N"
across docs/dfx/{dep_gen,l2-swimlane-profiling}.md, profiling_levels.md
(a2a3 + a5), tools/README.md, the 6 scheduler comments (dispatch /
cold_path / types × a2a3, a5), and the tool docstrings. dep_gen.md §4
example + fields table rewritten against the strided-Tensor producer
(buffer_numel / start_offset / strides[] replace raw_shapes /
multi-dim offset[]); strides type corrected to uint32 (Tensor::strides
invariant > 0).