hw-native-sys · indigo1973 · May 27, 2026
diff --git a/docs/dfx/dep_gen.md b/docs/dfx/dep_gen.md
@@ -1,4 +1,4 @@
-# dep_gen — Complete Per-Submit Dependency Graph (v2, Tensor-Annotated)
+# dep_gen — Complete Per-Submit Dependency Graph (Tensor-Annotated)
 
 ## 1. Background & Motivation
 
@@ -61,8 +61,8 @@ appear too.
   This is the guarantee against silent shotgun modifications — anyone
   who changes `compute_task_fanin` semantics will trip the gate
   immediately and know to update the annotated mirror.
-- **Output.** `<output_prefix>/deps.json` — v2 schema with `tasks[]`,
-  `tensors[]`, and tensor-annotated `edges[]` (see §4).
+- **Output.** `<output_prefix>/deps.json` — strided-Tensor schema with
+  `tasks[]`, `tensors[]`, and tensor-annotated `edges[]` (see §4).
 
 ---
 
@@ -97,29 +97,35 @@ The standard SceneTest path
 
 ---
 
-## 4. Output: `deps.json` (v2)
+## 4. Output: `deps.json`
 
 ```json
 {
-  "version": 2,
   "tasks": [
-    {"task_id": "0",          "scope": "auto"},
-    {"task_id": "4294967296", "scope": "auto"}
+    {"task_id": "0",          "scope": "auto", "args": []},
+    {"task_id": "4294967296", "scope": "auto", "args": [
+      {"idx": 0, "type": "INPUT", "tensor_id": "13451765318376212391",
+       "dtype": "FLOAT32", "shape": [16384],
+       "start_offset": "0", "strides": [1]}
+    ]}
   ],
   "tensors": [
     {"tensor_id": "13451765318376212391",
      "buffer_addr": "29204938752", "version": 0,
-     "dtype": "FLOAT32", "ndims": 1, "raw_shapes": [16384]}
+     "dtype": "FLOAT32", "buffer_numel": "16384"}
   ],
   "edges": [
     {"pred": "0", "succ": "4294967296", "arg": 0, "source": "creator",
      "tensor_id": "13451765318376212391", "consumer_dtype": "FLOAT32",
-     "consumer_shape": [16384], "consumer_offset": [0]},
+     "consumer_shape": [16384],
+     "consumer_start_offset": "0", "consumer_strides": [1]},
     {"pred": "4294967296", "succ": "4294967298", "arg": 0, "source": "tensormap",
      "overlap": "covered",
      "tensor_id": "9514117477438350967", "consumer_dtype": "FLOAT32",
-     "consumer_shape": [16384], "consumer_offset": [0],
-     "producer_shape": [16384], "producer_offset": [0]}
+     "consumer_shape": [16384],
+     "consumer_start_offset": "0", "consumer_strides": [1],
+     "producer_shape": [16384],
+     "producer_start_offset": "0", "producer_strides": [1]}
   ]
 }
 ```
@@ -153,8 +159,9 @@ this block.
 One entry per unique `(buffer_addr, version)` pair touched by the trace.
 `tensor_id` is a stable FNV-1a 64-bit hash of that pair — identical
 inputs across runs yield the same id, making `deps.json` files diffable.
-`raw_shapes` describes the **underlying buffer**, not the slice;
-per-edge slice information lives in the `edges[]` entries.
+`buffer_numel` is the element count of the **underlying buffer**, not the
+slice; per-edge slice geometry (`shape` + `start_offset` + `strides`)
+lives in the `edges[]` entries.
 
 ### `edges[]`
 
@@ -168,8 +175,12 @@ Each edge is `{pred, succ}` plus annotation. Fields:
 | `overlap` | string | `source=tensormap` | `covered` (producer slice fully contains consumer slice) or `other` |
 | `tensor_id` | uint64 (string) | not `explicit` | Identity of the underlying tensor; cross-references `tensors[]` |
 | `consumer_dtype` | string | not `explicit` | Element type the consumer reads as |
-| `consumer_shape`, `consumer_offset` | uint32 array | not `explicit` | The slice the consumer actually reads |
-| `producer_shape`, `producer_offset` | uint32 array | `source=tensormap` | The slice the producer wrote (recovered from the live tensormap entry) |
+| `consumer_shape` | uint32 array | not `explicit` | Per-dim element count of the consumer slice |
+| `consumer_start_offset` | uint64 (string) | not `explicit` | Element offset of the consumer slice into the buffer |
+| `consumer_strides` | uint32 array | not `explicit` | Per-dim stride (in elements) of the consumer slice; runtime invariant > 0 |
+| `producer_shape` | uint32 array | `source=tensormap` | Per-dim element count of the producer slice |
+| `producer_start_offset` | uint64 (string) | `source=tensormap` | Element offset of the producer slice |
+| `producer_strides` | uint32 array | `source=tensormap` | Per-dim stride of the producer slice; runtime invariant > 0 |
 
 A single `(pred, succ)` pair can appear in `edges[]` multiple times if
 the producer drives the consumer through multiple slots, multiple
@@ -222,9 +233,10 @@ Each arg row carries a 4-line block:
 
 ```text
 arg<i> <ARG_TYPE>[ ?] <Tname>:<dtype>
-raw:    [...]    # underlying buffer (from tensors[].raw_shapes)
-shape:  [...]    # slice this slot accesses
-offset: [...]    # slice start in the raw buffer
+storage:      <buffer_numel> elems   # underlying buffer size
+shape:        [...]                  # slice this slot accesses
+strides:      [...]                  # per-dim element strides
+start_offset: <N> (elem)             # slice start in the underlying buffer
 ```
 
 `<Tname>` is `T<idx>` from `tensors[]` order, so two slots referencing
@@ -270,7 +282,7 @@ for this tool.
 
 ## 6. Relationship to `fanout[]` + Validation Gate
 
-When checking fanout coverage, project v2 edges down to a
+When checking fanout coverage, project annotated edges down to a
 `{(pred, succ)}` set first — the per-edge annotation distinguishes
 sources / args / slices, so the raw `edges[]` count is a superset of the
 underlying task-pair count.
@@ -342,7 +354,7 @@ list; only the dep_gen replay graph loses the tail.
 | AICPU writer | `src/a2a3/platform/{include,src}/aicpu/dep_gen_collector_aicpu.{h,cpp}` | Single-instance write path; weak-fallback exported to host build |
 | Host collector | `src/a2a3/platform/{include/host,src/host}/dep_gen_collector.{h,cpp}` | `ProfilerBase<DepGenCollector, DepGenModule>` — drains ring → `records_` vector |
 | Capture call site | `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` `submit_task_common` | One conditional block that snapshots inputs into the ring when `is_dep_gen_enabled()`; fires for both `submit_task` and `submit_dummy_task`. Dep-only tasks land in the record stream with valid tensor/dep info but no kernel_id field (the schema does not carry kernel_id), so replay treats them as ordinary dep nodes — viewers do not currently distinguish dummy from real tasks. |
-| Replay | `src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits v2 `deps.json` when both passes agree per record. |
+| Replay | `src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits `deps.json` when both passes agree per record. |
 | Device-runner hookup | `src/a2a3/platform/{onboard,sim}/host/device_runner.cpp` | post-`reconcile_counters` calls `dep_gen_replay_emit_deps_json(records.data(), records.size(), deps_path, nullptr)` |
 | Viewer | `simpler_setup/tools/deps_to_graph.py` | `deps.json` → pan/zoom HTML |
 | Test | `tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py` | Smoke test + `fanout ⊆ deps` validation gate |

diff --git a/docs/dfx/l2-swimlane-profiling.md b/docs/dfx/l2-swimlane-profiling.md
@@ -25,14 +25,13 @@ available.
 
 ## 2. Overview
 
-- **Per-task AICore timing** — `start_time`, `end_time`,
-  `duration`, plus AICPU-stamped `dispatch_time` / `finish_time`.
+- **Per-task AICore timing** — `start_time_us`, `end_time_us`,
+  `duration_us`, plus AICPU-stamped `dispatch_time_us` / `finish_time_us`.
 - **Per-task fanout chain** — successor `task_id`s recorded in
   the L2 record so dependency arrows show up in the Perfetto
   view.
 - **AICPU scheduler phases** — per-iteration breakdown into
-  `SCHED_COMPLETE` / `SCHED_DISPATCH` / `SCHED_SCAN` /
-  `SCHED_IDLE_WAIT`.
+  `complete` / `dispatch` / `scan` / `idle`.
 - **Orchestrator phase summary** — cumulative cycle counts for
   the orchestrator's nine sub-steps (sync / alloc / params /
   lookup / heap / insert / fanin / finalize / scope_end).
@@ -57,10 +56,10 @@ backward-compatible with the old boolean behavior).
 | Level | Collects | Notes |
 | ----- | -------- | ----- |
 | 0 | Nothing (disabled) | Default when flag is absent |
-| 1 | AICore timing only (start/end/task_id/func_id/core_type) | No AICPU timestamps, no fanout |
-| 2 | + dispatch_time, finish_time, fanout | Full per-task record |
-| 3 | + Scheduler phases (`SCHED_*`) | Skips orchestrator phases |
-| 4 | + Orchestrator phases | Full collection |
+| 1 | AICore timing only (start_time_us/end_time_us/task_id/func_id/core_type) | No AICPU timestamps, no fanout |
+| 2 | + dispatch_time_us, finish_time_us, fanout | Full per-task record |
+| 3 | + scheduler phases (`aicpu_scheduler_phases[]`) | Skips orchestrator phases |
+| 4 | + orchestrator phases (`aicpu_orchestrator_phases[]`) | Full collection |
 
 ```bash
 # Standalone runner
@@ -88,8 +87,8 @@ dispatch/finish timestamps and fanout are recorded only at
 level >= 2, scheduler phase records only at level >= 3, and
 orchestrator phase records only at level >= 4.
 
-The JSON output `"version"` field directly reflects the
-perf_level: `1` = AICore timing only, `2` = +dispatch/fanout,
+The JSON output `"l2_perf_level"` field is the captured perf_level:
+`1` = AICore timing only, `2` = +dispatch/fanout,
 `3` = +scheduler phases, `4` = +orchestrator phases.
 
 `--rounds > 1` collects only on the **first** round so warm-up
@@ -118,22 +117,29 @@ you pass to `swimlane_converter`. Important fields per task:
 
 | Field | Meaning |
 | ----- | ------- |
-| `task_id` | Runtime task id, hex (low 32 bits = AICore register token; full 64 bits filled by AICPU) |
+| `task_id` | Runtime task id (`(ring_id << 32) \| local_id`); also exposed split as`ring_id` |
 | `func_id` | Kernel function id |
-| `core_type` | `0` = AIC, `1` = AIV |
-| `start_time` / `end_time` / `duration` | AICore device-clock cycles (`get_sys_cnt`) |
-| `dispatch_time` | AICPU timestamp when this task was dispatched |
-| `finish_time` | AICPU timestamp when AICPU observed FIN |
-| `fanout[]` / `fanout_count` | Successor task ids, used by Perfetto dependency arrows |
+| `core_id` / `core_type` | Physical core index and `"aic"` / `"aiv"` string |
+| `start_time_us` / `end_time_us` / `duration_us` | AICore execution window in microseconds |
+| `dispatch_time_us` | AICPU timestamp when this task was dispatched (filled at level >= 2) |
+| `finish_time_us` | AICPU timestamp when AICPU observed FIN (filled at level >= 2) |
+| `fanout[]` / `fanout_count` | Successor task ids (level >= 2), used by Perfetto dependency arrows |
 
-Phase records (per scheduler thread):
+Phase records (per scheduler thread, level >= 3 for
+`aicpu_scheduler_phases[]` and level >= 4 for
+`aicpu_orchestrator_phases[]`):
 
 | Field | Meaning |
 | ----- | ------- |
-| `start_time` / `end_time` | Phase start / end timestamps |
-| `loop_iter` | Scheduler loop iteration number |
-| `phase_id` | One of `SCHED_COMPLETE` / `SCHED_DISPATCH` / `SCHED_SCAN` / `SCHED_IDLE_WAIT`, or `ORCH_*` for orchestrator phases |
+| `start_time_us` / `end_time_us` | Phase start / end timestamps in microseconds |
+| `phase` | Lowercase phase name. Scheduler: `complete` / `dispatch` / `scan` / `idle`. Orchestrator: `orch_*` (sync / alloc / params / lookup / heap / insert / fanin / finalize / scope_end). |
+| `loop_iter` (scheduler) / `submit_idx` (orchestrator) | Iteration / submit-call counter for the producing thread |
 | `tasks_processed` (scheduler) / `task_id` (orchestrator) | Phase-specific union field |
+| `pop_hit` / `pop_miss` (dispatch only) | Ready-queue pop deltas since the previous dispatch emit |
+
+`core_to_thread[]` (level >= 3) maps `core_id` (array index) to the
+scheduler thread index that retired that core's tasks (`-1` =
+unassigned).
 
 ### 3.3 Convert and view in Perfetto
 
@@ -162,7 +168,7 @@ in. The trace contains:
   channel). Each task shows `func_name(t<task_id>)`; dependency
   arrows follow `fanout[]`.
 - **AICPU View** — scheduler thread lanes with per-iteration
-  phase blocks coloured by `phase_id`.
+  phase blocks coloured by `phase`.
 - **AICPU Scheduler** — orchestrator phase summary at the top.
 
 When the run also emitted a device log (`device-*` file under
@@ -206,12 +212,13 @@ schema and L3 example.
 What the swimlane shows:
 
 - **Per-task wall-clock placement.** Where each task ran on which
-  AICore, with start / end / duration in device cycles.
-- **Dispatch and finish overhead.** `dispatch_time` and
-  `finish_time` come from AICPU, so the gap between
-  `dispatch_time` and `start_time` is the AICPU→AICore
-  hand-off latency, and the gap between `end_time` and
-  `finish_time` is the FIN-observation latency.
+  AICore, with `start_time_us` / `end_time_us` / `duration_us` in
+  microseconds (converted from device cycles).
+- **Dispatch and finish overhead.** `dispatch_time_us` and
+  `finish_time_us` come from AICPU, so the gap between
+  `dispatch_time_us` and `start_time_us` is the AICPU→AICore
+  hand-off latency, and the gap between `end_time_us` and
+  `finish_time_us` is the FIN-observation latency.
 - **Dependency chains.** `fanout[]` lets Perfetto draw arrows
   between predecessor and successor tasks.
 - **Scheduler-loop time decomposition.** Per-iteration AICPU
@@ -279,7 +286,7 @@ platform-owned AICore state, and never reassigned — so AICore is
 fully decoupled from any AICPU-side records-buffer rotation. AICPU,
 on observing FIN, validates the slot's register token, copies the slot
 record into the current `L2PerfBuffer::records[count]`, fills
-`func_id` / `core_type` / `dispatch_time` / `finish_time` / `fanout`,
+`func_id` / `core_type` / `dispatch_time_us` / `finish_time_us` / `fanout`,
 advances `count`, and rotates the records buffer in place when it
 fills up. The ring is sized to the runtime's in-flight issue depth
 (2 for dual-issue today; raise to the next power of two when issue
@@ -619,7 +626,7 @@ data (only `tensormap_and_ringbuffer` does, and only when
 `AicpuPhaseHeader` was not initialized. Verify the runtime sets
 the magic in its scheduler init path.
 
-**`dispatch_time` < `finish_time` mismatch.** Verify the runtime
+**`dispatch_time_us` < `finish_time_us` mismatch.** Verify the runtime
 overwrites `task_id` with the full encoding on FIN
 (`tensormap_and_ringbuffer` does
 `(ring_id << 32) | local_id`); a half-filled record means AICore

diff --git a/simpler_setup/tools/README.md b/simpler_setup/tools/README.md
@@ -120,7 +120,7 @@ Analyze AICPU scheduler overhead and quantitatively decompose the sources of Tai
 
 `sched_overhead_analysis` reads two artifacts produced by the runtime:
 
-1. **Perf profiling data** (`l2_perf_records_*.json`, v2): per-task Exec / Head OH / Tail OH time breakdowns plus `aicpu_scheduler_phases` — per-thread, per-loop-iteration phase records carrying scan / complete / dispatch / idle timings and per-emit pop_hit / pop_miss deltas.
+1. **Perf profiling data** (`l2_perf_records_*.json`, l2_perf_level >= 3): per-task Exec / Head OH / Tail OH time breakdowns plus `aicpu_scheduler_phases` — per-thread, per-loop-iteration phase records carrying scan / complete / dispatch / idle timings and per-emit pop_hit / pop_miss deltas.
 2. **`deps.json`** (optional, dep_gen replay output): structural task DAG. When colocated with the perf JSON, Part 2 prints per-thread fanout / fanin aggregates derived from it.
 
 ### Basic Usage
@@ -154,7 +154,7 @@ Output is emitted in three parts:
 - **Part 2: AICPU scheduler loop breakdown** — per-scheduler-thread loop statistics, per-phase (scan / complete / dispatch / idle) time ratios, pop_hit / pop_miss totals, and (when deps.json is available) per-thread fanout / fanin aggregates
 - **Part 3: Tail OH distribution & cause analysis** — Tail OH quantile distribution (P10–P99), correlation between scheduler loop iteration time and Tail OH, and data-driven insights into the dominant phase
 
-The perf JSON must be a v2 capture with non-empty `aicpu_scheduler_phases` (rerun the case with `--enable-l2-swimlane` if the tool reports the field is missing).
+The perf JSON must be captured at l2_perf_level >= 3 so that `aicpu_scheduler_phases` is non-empty (rerun the case with `--enable-l2-swimlane` if the tool reports the field is missing).
 
 ---
 
@@ -270,23 +270,49 @@ The analysis tools share the same input format - the `l2_perf_records_*.json` fi
 
 ```json
 {
-  "version": 1,
+  "l2_perf_level": 4,
   "tasks": [
     {
       "task_id": 0,
       "func_id": 0,
-      "core_id": 0,
-      "core_type": "aic",
-      "start_time_us": 100.0,
-      "end_time_us": 250.5,
-      "duration_us": 150.5,
-      "fanout": [1, 2],
-      "fanout_count": 2
+      "core_id": 7,
+      "core_type": "aiv",
+      "ring_id": 0,
+      "start_time_us": 47.46,
+      "end_time_us": 55.9,
+      "duration_us": 8.44,
+      "dispatch_time_us": 45.94,
+      "finish_time_us": 60.52,
+      "fanout": [4294967299, 4294967297, 4294967296],
+      "fanout_count": 3
+    },
+    {
+      "task_id": 4294967296,
+      "func_id": 1,
+      "core_id": 7,
+      "core_type": "aiv",
+      "ring_id": 1,
+      "start_time_us": 68.68,
+      "end_time_us": 70.42,
+      "duration_us": 1.74,
+      "dispatch_time_us": 68.24,
+      "finish_time_us": 71.2,
+      "fanout": [4294967298],
+      "fanout_count": 1
     }
   ]
 }
 ```
 
+Top-level layout depends on `l2_perf_level`:
+
+- All levels: `l2_perf_level`, `tasks[]` (per-task fields above).
+- `>= 3`: also `aicpu_scheduler_phases[]` (per-thread phase records:
+  scan / complete / dispatch / idle) and `core_to_thread[]` (core_id →
+  scheduler thread index).
+- `>= 4`: also `aicpu_orchestrator_phases[]` (per-task orchestrator
+  phase records).
+
 ### Kernel Config Format
 
 To display meaningful function names in the output, provide a `kernel_config.py` file:
@@ -366,10 +392,11 @@ For batch-run hardware regression, see the dev-only script
 - Check the kernel_config.py file format
 - Make sure every KERNELS entry has a 'func_id' and 'name' field
 
-### Error: Unsupported version
+### Error: Unsupported l2_perf_level
 
-- The tools only support version 1 of the profiling data format
-- Regenerate the profiling data with the latest runtime
+- The tools accept l2_perf_level 1–4 (the integer captured at runtime
+  via `--enable-l2-swimlane <N>`)
+- Regenerate the profiling data with a supported level
 
 ### Error: Perf JSON missing required fields for scheduler overhead analysis
 
@@ -394,7 +421,7 @@ For batch-run hardware regression, see the dev-only script
 | ---- | ---- | ------- | ------ |
 | `l2_perf_records_*.json` | Runtime | Raw timing profiling data | JSON |
 | `merged_swimlane_*.json` | swimlane_converter | Perfetto visualization | Chrome Trace Event JSON |
-| `deps.json` | Runtime (dep_gen replay) | Structural task dependency graph + per-edge tensor info | JSON (v2) |
+| `deps.json` | Runtime (dep_gen replay) | Structural task dependency graph + per-edge tensor info | JSON |
 | `deps_graph.html` | deps_to_graph | Pan/zoom dependency graph viewer | HTML (self-contained) |
 
 ---