Skip to content

Fix: clean up "version" fields in L2 swimlane / dep_gen JSON#856

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
indigo1973:dep_0525
May 27, 2026
Merged

Fix: clean up "version" fields in L2 swimlane / dep_gen JSON#856
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
indigo1973:dep_0525

Conversation

@indigo1973
Copy link
Copy Markdown
Contributor

@indigo1973 indigo1973 commented May 26, 2026

deps.json and l2_perf_records.json both carried a "version" field
that consumers were getting wrong:

  • deps.json bumped v2 → v3 in #808 but swimlane_converter still
    guarded on version != 2, silently rejected every fresh capture,
    and fell back to L2PerfRecord::fanout[] — losing the race-window
    edges dep_gen replay exists to recover.

  • l2_perf_records.json's "version" was never a schema version — the
    producer writes L2PerfLevel (1..4). Misreading it caused two
    consumers to short-circuit on version != 2 / < 2, while phase
    blocks only exist at level >= 3.

Producer side: deps.json drops the field outright; l2_perf_records.json
(a2a3 + a5) renames "version" → "l2_perf_level" so the name matches
its meaning. Consumer side: drop the three now-misaligned guards
(deps_to_graph, swimlane_converter.load_deps_json /
print_verbose_data_info, sched_overhead_analysis.parse_scheduler
from_json_phases) plus the version assertions in test_dep_gen,
test_dep_gen_chain, and _swimlane_validate.

Doc / comment fallout per .claude/rules/doc-consistency.md: retire
"v2 JSON" / "version 2" wording in favour of "l2_perf_level >= N"
across docs/dfx/{dep_gen,l2-swimlane-profiling}.md, profiling_levels.md
(a2a3 + a5), tools/README.md, the 6 scheduler comments (dispatch /
cold_path / types × a2a3, a5), and the tool docstrings. dep_gen.md §4
example + fields table rewritten against the strided-Tensor producer
(buffer_numel / start_offset / strides[] replace raw_shapes /
multi-dim offset[]); strides type corrected to uint32 (Tensor::strides
invariant > 0).

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the dependency graph generation schema from version 2 (v2) to version 3 (v3), introducing a strided tensor representation. Key changes include adding an args array with detailed tensor slice geometry to tasks, replacing raw_shapes with buffer_numel in the tensors schema, and replacing simple offsets with explicit start offsets and strides for both consumers and producers in the edges schema. Downstream tools, documentation, and tests have been updated to support and validate the new v3 schema. There are no review comments, so I have no feedback to provide.

@indigo1973 indigo1973 changed the title Fix: complete deps.json v3 rollout missed by PR #808 Fix: drop misaligned version guards in the L2 swimlane pipeline May 26, 2026
hw-native-sys-bot pushed a commit to ChaoWao/simpler-fork that referenced this pull request May 27, 2026
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB
of fanout edges plus walking the producer's fanout linked list, every
task, on the scheduler completion critical path. The fanout edges are
already the static DAG and are reconstructed offline by dep_gen replay
into deps.json — so the device-side hot path was paying GM-bandwidth and
cache-miss cost to duplicate information host tooling already has.

Scope is a2a3 only; a5 is untouched.

Device side:
- L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per
  record).
- l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count
  parameters; the impl no longer touches them.
- scheduler_completion drops the fanout_arr build + linked-list walk;
  host_build_graph/aicpu_executor drops the same pattern at all four
  call sites.

Host side:
- l2_perf_collector::export_swimlane_json emits "fanout": [] and
  "fanout_count": 0 per task to keep the JSON schema shape stable, and
  drops the top-level "version" field, which had drifted into a
  duplicate of L2PerfLevel (see in-flight PR hw-native-sys#856 for the misaligned
  guard cleanup on the consumer side).

Downstream tools:
- swimlane_converter already preferred deps.json over task["fanout"]; it
  now reads the version-free schema and treats empty fanout as the
  expected steady state.
- sched_overhead_analysis no longer gates phase parsing on the dropped
  "version" field — it gates on presence of aicpu_scheduler_phases,
  which is the right key.

Tests and comments:
- dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and
  the auto-add of --enable-l2-swimlane that only existed to feed that
  gate.
- _swimlane_validate drops the version assertion.
- profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator
  comments updated to reflect deps.json as the sole source of truth
  for fanout.

Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed,
test_dep_gen, test_dep_gen_chain all pass with
--enable-l2-swimlane --enable-dep-gen.
hw-native-sys-bot pushed a commit to ChaoWao/simpler-fork that referenced this pull request May 27, 2026
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB
of fanout edges plus walking the producer's fanout linked list, every
task, on the scheduler completion critical path. The fanout edges are
already the static DAG and are reconstructed offline by dep_gen replay
into deps.json — so the device-side hot path was paying GM-bandwidth and
cache-miss cost to duplicate information host tooling already has.

Scope is a2a3 only; a5 is untouched.

Device side:
- L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per
  record).
- l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count
  parameters; the impl no longer touches them.
- scheduler_completion drops the fanout_arr build + linked-list walk;
  host_build_graph/aicpu_executor drops the same pattern at all four
  call sites.

Host side:
- l2_perf_collector::export_swimlane_json emits "fanout": [] and
  "fanout_count": 0 per task to keep the JSON schema shape stable, and
  drops the top-level "version" field, which had drifted into a
  duplicate of L2PerfLevel (see in-flight PR hw-native-sys#856 for the misaligned
  guard cleanup on the consumer side).

Downstream tools:
- swimlane_converter already preferred deps.json over task["fanout"]; it
  now reads the version-free schema and treats empty fanout as the
  expected steady state.
- sched_overhead_analysis no longer gates phase parsing on the dropped
  "version" field — it gates on presence of aicpu_scheduler_phases,
  which is the right key.

Tests and comments:
- dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and
  the auto-add of --enable-l2-swimlane that only existed to feed that
  gate.
- _swimlane_validate drops the version assertion.
- profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator
  comments updated to reflect deps.json as the sole source of truth
  for fanout.

Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed,
test_dep_gen, test_dep_gen_chain all pass with
--enable-l2-swimlane --enable-dep-gen.
hw-native-sys-bot pushed a commit to ChaoWao/simpler-fork that referenced this pull request May 27, 2026
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB
of fanout edges plus walking the producer's fanout linked list, every
task, on the scheduler completion critical path. The fanout edges are
already the static DAG and are reconstructed offline by dep_gen replay
into deps.json — so the device-side hot path was paying GM-bandwidth and
cache-miss cost to duplicate information host tooling already has.

Scope is a2a3 only; a5 is untouched.

Device side:
- L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per
  record).
- l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count
  parameters; the impl no longer touches them.
- scheduler_completion drops the fanout_arr build + linked-list walk;
  host_build_graph/aicpu_executor drops the same pattern at all four
  call sites.

Host side:
- l2_perf_collector::export_swimlane_json emits "fanout": [] and
  "fanout_count": 0 per task to keep the JSON schema shape stable, and
  drops the top-level "version" field, which had drifted into a
  duplicate of L2PerfLevel (see in-flight PR hw-native-sys#856 for the misaligned
  guard cleanup on the consumer side).

Downstream tools:
- swimlane_converter already preferred deps.json over task["fanout"]; it
  now reads the version-free schema and treats empty fanout as the
  expected steady state.
- sched_overhead_analysis no longer gates phase parsing on the dropped
  "version" field — it gates on presence of aicpu_scheduler_phases,
  which is the right key.

Tests and comments:
- dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and
  the auto-add of --enable-l2-swimlane that only existed to feed that
  gate.
- _swimlane_validate drops the version assertion.
- profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator
  comments updated to reflect deps.json as the sole source of truth
  for fanout.

Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed,
test_dep_gen, test_dep_gen_chain all pass with
--enable-l2-swimlane --enable-dep-gen.
hw-native-sys-bot pushed a commit to ChaoWao/simpler-fork that referenced this pull request May 27, 2026
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB
of fanout edges plus walking the producer's fanout linked list, every
task, on the scheduler completion critical path. The fanout edges are
already the static DAG and are reconstructed offline by dep_gen replay
into deps.json — so the device-side hot path was paying GM-bandwidth and
cache-miss cost to duplicate information host tooling already has.

Scope is a2a3 only; a5 is untouched.

Device side:
- L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per
  record).
- l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count
  parameters; the impl no longer touches them.
- scheduler_completion drops the fanout_arr build + linked-list walk;
  host_build_graph/aicpu_executor drops the same pattern at all four
  call sites.

Host side:
- l2_perf_collector::export_swimlane_json emits "fanout": [] and
  "fanout_count": 0 per task to keep the JSON schema shape stable, and
  drops the top-level "version" field, which had drifted into a
  duplicate of L2PerfLevel (see in-flight PR hw-native-sys#856 for the misaligned
  guard cleanup on the consumer side).

Downstream tools:
- swimlane_converter already preferred deps.json over task["fanout"]; it
  now reads the version-free schema and treats empty fanout as the
  expected steady state.
- sched_overhead_analysis no longer gates phase parsing on the dropped
  "version" field — it gates on presence of aicpu_scheduler_phases,
  which is the right key.

Tests and comments:
- dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and
  the auto-add of --enable-l2-swimlane that only existed to feed that
  gate.
- _swimlane_validate drops the version assertion.
- profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator
  comments updated to reflect deps.json as the sole source of truth
  for fanout.

Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed,
test_dep_gen, test_dep_gen_chain all pass with
--enable-l2-swimlane --enable-dep-gen.
@indigo1973 indigo1973 changed the title Fix: drop misaligned version guards in the L2 swimlane pipeline Fix: clean up "version" fields in L2 swimlane / dep_gen JSON May 27, 2026
@indigo1973
Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ChaoWao added a commit that referenced this pull request May 27, 2026
The L2 swimlane per-task commit on AICPU was copying up to 128*8B = 1 KB
of fanout edges plus walking the producer's fanout linked list, every
task, on the scheduler completion critical path. The fanout edges are
already the static DAG and are reconstructed offline by dep_gen replay
into deps.json — so the device-side hot path was paying GM-bandwidth and
cache-miss cost to duplicate information host tooling already has.

Scope is a2a3 only; a5 is untouched.

Device side:
- L2PerfRecord drops fanout[128] / fanout_count (~1088 B -> 64 B per
  record).
- l2_perf_aicpu_complete_record drops the trailing fanout / fanout_count
  parameters; the impl no longer touches them.
- scheduler_completion drops the fanout_arr build + linked-list walk;
  host_build_graph/aicpu_executor drops the same pattern at all four
  call sites.

Host side:
- l2_perf_collector::export_swimlane_json emits "fanout": [] and
  "fanout_count": 0 per task to keep the JSON schema shape stable, and
  drops the top-level "version" field, which had drifted into a
  duplicate of L2PerfLevel (see in-flight PR #856 for the misaligned
  guard cleanup on the consumer side).

Downstream tools:
- swimlane_converter already preferred deps.json over task["fanout"]; it
  now reads the version-free schema and treats empty fanout as the
  expected steady state.
- sched_overhead_analysis no longer gates phase parsing on the dropped
  "version" field — it gates on presence of aicpu_scheduler_phases,
  which is the right key.

Tests and comments:
- dep_gen tests drop the now-vacuous "fanout subset-of deps" gate and
  the auto-add of --enable-l2-swimlane that only existed to feed that
  gate.
- _swimlane_validate drops the version assertion.
- profiling_levels.md, dep_gen.h, dep_gen_replay.h, pto_orchestrator
  comments updated to reflect deps.json as the sole source of truth
  for fanout.

Verified on a2a3sim: test_l2_swimlane, test_l2_swimlane_mixed,
test_dep_gen, test_dep_gen_chain all pass with
--enable-l2-swimlane --enable-dep-gen.
deps.json and l2_perf_records.json both carried a "version" field
that consumers were getting wrong:

- deps.json bumped v2 → v3 in hw-native-sys#808 but swimlane_converter still
  guarded on `version != 2`, silently rejected every fresh capture,
  and fell back to L2PerfRecord::fanout[] — losing the race-window
  edges dep_gen replay exists to recover.

- l2_perf_records.json's "version" was never a schema version — the
  producer writes L2PerfLevel (1..4). Misreading it caused two
  consumers to short-circuit on `version != 2` / `< 2`, while phase
  blocks only exist at level >= 3.

Producer side: deps.json drops the field outright; l2_perf_records.json
(a2a3 + a5) renames "version" → "l2_perf_level" so the name matches
its meaning. Consumer side: drop the three now-misaligned guards
(deps_to_graph, swimlane_converter.load_deps_json /
_print_verbose_data_info, sched_overhead_analysis.parse_scheduler_
from_json_phases) plus the version assertions in test_dep_gen,
test_dep_gen_chain, and _swimlane_validate.

Doc / comment fallout per .claude/rules/doc-consistency.md: retire
"v2 JSON" / "version 2" wording in favour of "l2_perf_level >= N"
across docs/dfx/{dep_gen,l2-swimlane-profiling}.md, profiling_levels.md
(a2a3 + a5), tools/README.md, the 6 scheduler comments (dispatch /
cold_path / types × a2a3, a5), and the tool docstrings. dep_gen.md §4
example + fields table rewritten against the strided-Tensor producer
(buffer_numel / start_offset / strides[] replace raw_shapes /
multi-dim offset[]); strides type corrected to uint32 (Tensor::strides
invariant > 0).
@ChaoWao ChaoWao merged commit 352c3f8 into hw-native-sys:main May 27, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants