Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 32 additions & 20 deletions docs/dfx/dep_gen.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# dep_gen — Complete Per-Submit Dependency Graph (v2, Tensor-Annotated)
# dep_gen — Complete Per-Submit Dependency Graph (Tensor-Annotated)

## 1. Background & Motivation

Expand Down Expand Up @@ -61,8 +61,8 @@ appear too.
This is the guarantee against silent shotgun modifications — anyone
who changes `compute_task_fanin` semantics will trip the gate
immediately and know to update the annotated mirror.
- **Output.** `<output_prefix>/deps.json` — v2 schema with `tasks[]`,
`tensors[]`, and tensor-annotated `edges[]` (see §4).
- **Output.** `<output_prefix>/deps.json` — strided-Tensor schema with
`tasks[]`, `tensors[]`, and tensor-annotated `edges[]` (see §4).

---

Expand Down Expand Up @@ -97,29 +97,35 @@ The standard SceneTest path

---

## 4. Output: `deps.json` (v2)
## 4. Output: `deps.json`

```json
{
"version": 2,
"tasks": [
{"task_id": "0", "scope": "auto"},
{"task_id": "4294967296", "scope": "auto"}
{"task_id": "0", "scope": "auto", "args": []},
{"task_id": "4294967296", "scope": "auto", "args": [
{"idx": 0, "type": "INPUT", "tensor_id": "13451765318376212391",
"dtype": "FLOAT32", "shape": [16384],
"start_offset": "0", "strides": [1]}
]}
],
"tensors": [
{"tensor_id": "13451765318376212391",
"buffer_addr": "29204938752", "version": 0,
"dtype": "FLOAT32", "ndims": 1, "raw_shapes": [16384]}
"dtype": "FLOAT32", "buffer_numel": "16384"}
],
"edges": [
{"pred": "0", "succ": "4294967296", "arg": 0, "source": "creator",
"tensor_id": "13451765318376212391", "consumer_dtype": "FLOAT32",
"consumer_shape": [16384], "consumer_offset": [0]},
"consumer_shape": [16384],
"consumer_start_offset": "0", "consumer_strides": [1]},
{"pred": "4294967296", "succ": "4294967298", "arg": 0, "source": "tensormap",
"overlap": "covered",
"tensor_id": "9514117477438350967", "consumer_dtype": "FLOAT32",
"consumer_shape": [16384], "consumer_offset": [0],
"producer_shape": [16384], "producer_offset": [0]}
"consumer_shape": [16384],
"consumer_start_offset": "0", "consumer_strides": [1],
"producer_shape": [16384],
"producer_start_offset": "0", "producer_strides": [1]}
]
}
```
Expand Down Expand Up @@ -153,8 +159,9 @@ this block.
One entry per unique `(buffer_addr, version)` pair touched by the trace.
`tensor_id` is a stable FNV-1a 64-bit hash of that pair — identical
inputs across runs yield the same id, making `deps.json` files diffable.
`raw_shapes` describes the **underlying buffer**, not the slice;
per-edge slice information lives in the `edges[]` entries.
`buffer_numel` is the element count of the **underlying buffer**, not the
slice; per-edge slice geometry (`shape` + `start_offset` + `strides`)
lives in the `edges[]` entries.

### `edges[]`

Expand All @@ -168,8 +175,12 @@ Each edge is `{pred, succ}` plus annotation. Fields:
| `overlap` | string | `source=tensormap` | `covered` (producer slice fully contains consumer slice) or `other` |
| `tensor_id` | uint64 (string) | not `explicit` | Identity of the underlying tensor; cross-references `tensors[]` |
| `consumer_dtype` | string | not `explicit` | Element type the consumer reads as |
| `consumer_shape`, `consumer_offset` | uint32 array | not `explicit` | The slice the consumer actually reads |
| `producer_shape`, `producer_offset` | uint32 array | `source=tensormap` | The slice the producer wrote (recovered from the live tensormap entry) |
| `consumer_shape` | uint32 array | not `explicit` | Per-dim element count of the consumer slice |
| `consumer_start_offset` | uint64 (string) | not `explicit` | Element offset of the consumer slice into the buffer |
| `consumer_strides` | uint32 array | not `explicit` | Per-dim stride (in elements) of the consumer slice; runtime invariant > 0 |
| `producer_shape` | uint32 array | `source=tensormap` | Per-dim element count of the producer slice |
| `producer_start_offset` | uint64 (string) | `source=tensormap` | Element offset of the producer slice |
| `producer_strides` | uint32 array | `source=tensormap` | Per-dim stride of the producer slice; runtime invariant > 0 |

A single `(pred, succ)` pair can appear in `edges[]` multiple times if
the producer drives the consumer through multiple slots, multiple
Expand Down Expand Up @@ -222,9 +233,10 @@ Each arg row carries a 4-line block:

```text
arg<i> <ARG_TYPE>[ ?] <Tname>:<dtype>
raw: [...] # underlying buffer (from tensors[].raw_shapes)
shape: [...] # slice this slot accesses
offset: [...] # slice start in the raw buffer
storage: <buffer_numel> elems # underlying buffer size
shape: [...] # slice this slot accesses
strides: [...] # per-dim element strides
start_offset: <N> (elem) # slice start in the underlying buffer
```

`<Tname>` is `T<idx>` from `tensors[]` order, so two slots referencing
Expand Down Expand Up @@ -270,7 +282,7 @@ for this tool.

## 6. Relationship to `fanout[]` + Validation Gate

When checking fanout coverage, project v2 edges down to a
When checking fanout coverage, project annotated edges down to a
`{(pred, succ)}` set first — the per-edge annotation distinguishes
sources / args / slices, so the raw `edges[]` count is a superset of the
underlying task-pair count.
Expand Down Expand Up @@ -342,7 +354,7 @@ list; only the dep_gen replay graph loses the tail.
| AICPU writer | `src/a2a3/platform/{include,src}/aicpu/dep_gen_collector_aicpu.{h,cpp}` | Single-instance write path; weak-fallback exported to host build |
| Host collector | `src/a2a3/platform/{include/host,src/host}/dep_gen_collector.{h,cpp}` | `ProfilerBase<DepGenCollector, DepGenModule>` — drains ring → `records_` vector |
| Capture call site | `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` `submit_task_common` | One conditional block that snapshots inputs into the ring when `is_dep_gen_enabled()`; fires for both `submit_task` and `submit_dummy_task`. Dep-only tasks land in the record stream with valid tensor/dep info but no kernel_id field (the schema does not carry kernel_id), so replay treats them as ordinary dep nodes — viewers do not currently distinguish dummy from real tasks. |
| Replay | `src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits v2 `deps.json` when both passes agree per record. |
| Replay | `src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits `deps.json` when both passes agree per record. |
| Device-runner hookup | `src/a2a3/platform/{onboard,sim}/host/device_runner.cpp` | post-`reconcile_counters` calls `dep_gen_replay_emit_deps_json(records.data(), records.size(), deps_path, nullptr)` |
| Viewer | `simpler_setup/tools/deps_to_graph.py` | `deps.json` → pan/zoom HTML |
| Test | `tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py` | Smoke test + `fanout ⊆ deps` validation gate |
Expand Down
65 changes: 36 additions & 29 deletions docs/dfx/l2-swimlane-profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,13 @@ available.

## 2. Overview

- **Per-task AICore timing** — `start_time`, `end_time`,
`duration`, plus AICPU-stamped `dispatch_time` / `finish_time`.
- **Per-task AICore timing** — `start_time_us`, `end_time_us`,
`duration_us`, plus AICPU-stamped `dispatch_time_us` / `finish_time_us`.
- **Per-task fanout chain** — successor `task_id`s recorded in
the L2 record so dependency arrows show up in the Perfetto
view.
- **AICPU scheduler phases** — per-iteration breakdown into
`SCHED_COMPLETE` / `SCHED_DISPATCH` / `SCHED_SCAN` /
`SCHED_IDLE_WAIT`.
`complete` / `dispatch` / `scan` / `idle`.
- **Orchestrator phase summary** — cumulative cycle counts for
the orchestrator's nine sub-steps (sync / alloc / params /
lookup / heap / insert / fanin / finalize / scope_end).
Expand All @@ -57,10 +56,10 @@ backward-compatible with the old boolean behavior).
| Level | Collects | Notes |
| ----- | -------- | ----- |
| 0 | Nothing (disabled) | Default when flag is absent |
| 1 | AICore timing only (start/end/task_id/func_id/core_type) | No AICPU timestamps, no fanout |
| 2 | + dispatch_time, finish_time, fanout | Full per-task record |
| 3 | + Scheduler phases (`SCHED_*`) | Skips orchestrator phases |
| 4 | + Orchestrator phases | Full collection |
| 1 | AICore timing only (start_time_us/end_time_us/task_id/func_id/core_type) | No AICPU timestamps, no fanout |
| 2 | + dispatch_time_us, finish_time_us, fanout | Full per-task record |
| 3 | + scheduler phases (`aicpu_scheduler_phases[]`) | Skips orchestrator phases |
| 4 | + orchestrator phases (`aicpu_orchestrator_phases[]`) | Full collection |

```bash
# Standalone runner
Expand Down Expand Up @@ -88,8 +87,8 @@ dispatch/finish timestamps and fanout are recorded only at
level >= 2, scheduler phase records only at level >= 3, and
orchestrator phase records only at level >= 4.

The JSON output `"version"` field directly reflects the
perf_level: `1` = AICore timing only, `2` = +dispatch/fanout,
The JSON output `"l2_perf_level"` field is the captured perf_level:
`1` = AICore timing only, `2` = +dispatch/fanout,
`3` = +scheduler phases, `4` = +orchestrator phases.

`--rounds > 1` collects only on the **first** round so warm-up
Expand Down Expand Up @@ -118,22 +117,29 @@ you pass to `swimlane_converter`. Important fields per task:

| Field | Meaning |
| ----- | ------- |
| `task_id` | Runtime task id, hex (low 32 bits = AICore register token; full 64 bits filled by AICPU) |
| `task_id` | Runtime task id (`(ring_id << 32) \| local_id`); also exposed split as`ring_id` |
| `func_id` | Kernel function id |
| `core_type` | `0` = AIC, `1` = AIV |
| `start_time` / `end_time` / `duration` | AICore device-clock cycles (`get_sys_cnt`) |
| `dispatch_time` | AICPU timestamp when this task was dispatched |
| `finish_time` | AICPU timestamp when AICPU observed FIN |
| `fanout[]` / `fanout_count` | Successor task ids, used by Perfetto dependency arrows |
| `core_id` / `core_type` | Physical core index and `"aic"` / `"aiv"` string |
| `start_time_us` / `end_time_us` / `duration_us` | AICore execution window in microseconds |
| `dispatch_time_us` | AICPU timestamp when this task was dispatched (filled at level >= 2) |
| `finish_time_us` | AICPU timestamp when AICPU observed FIN (filled at level >= 2) |
| `fanout[]` / `fanout_count` | Successor task ids (level >= 2), used by Perfetto dependency arrows |

Phase records (per scheduler thread):
Phase records (per scheduler thread, level >= 3 for
`aicpu_scheduler_phases[]` and level >= 4 for
`aicpu_orchestrator_phases[]`):

| Field | Meaning |
| ----- | ------- |
| `start_time` / `end_time` | Phase start / end timestamps |
| `loop_iter` | Scheduler loop iteration number |
| `phase_id` | One of `SCHED_COMPLETE` / `SCHED_DISPATCH` / `SCHED_SCAN` / `SCHED_IDLE_WAIT`, or `ORCH_*` for orchestrator phases |
| `start_time_us` / `end_time_us` | Phase start / end timestamps in microseconds |
| `phase` | Lowercase phase name. Scheduler: `complete` / `dispatch` / `scan` / `idle`. Orchestrator: `orch_*` (sync / alloc / params / lookup / heap / insert / fanin / finalize / scope_end). |
| `loop_iter` (scheduler) / `submit_idx` (orchestrator) | Iteration / submit-call counter for the producing thread |
| `tasks_processed` (scheduler) / `task_id` (orchestrator) | Phase-specific union field |
| `pop_hit` / `pop_miss` (dispatch only) | Ready-queue pop deltas since the previous dispatch emit |

`core_to_thread[]` (level >= 3) maps `core_id` (array index) to the
scheduler thread index that retired that core's tasks (`-1` =
unassigned).

### 3.3 Convert and view in Perfetto

Expand Down Expand Up @@ -162,7 +168,7 @@ in. The trace contains:
channel). Each task shows `func_name(t<task_id>)`; dependency
arrows follow `fanout[]`.
- **AICPU View** — scheduler thread lanes with per-iteration
phase blocks coloured by `phase_id`.
phase blocks coloured by `phase`.
- **AICPU Scheduler** — orchestrator phase summary at the top.

When the run also emitted a device log (`device-*` file under
Expand Down Expand Up @@ -206,12 +212,13 @@ schema and L3 example.
What the swimlane shows:

- **Per-task wall-clock placement.** Where each task ran on which
AICore, with start / end / duration in device cycles.
- **Dispatch and finish overhead.** `dispatch_time` and
`finish_time` come from AICPU, so the gap between
`dispatch_time` and `start_time` is the AICPU→AICore
hand-off latency, and the gap between `end_time` and
`finish_time` is the FIN-observation latency.
AICore, with `start_time_us` / `end_time_us` / `duration_us` in
microseconds (converted from device cycles).
- **Dispatch and finish overhead.** `dispatch_time_us` and
`finish_time_us` come from AICPU, so the gap between
`dispatch_time_us` and `start_time_us` is the AICPU→AICore
hand-off latency, and the gap between `end_time_us` and
`finish_time_us` is the FIN-observation latency.
- **Dependency chains.** `fanout[]` lets Perfetto draw arrows
between predecessor and successor tasks.
- **Scheduler-loop time decomposition.** Per-iteration AICPU
Expand Down Expand Up @@ -279,7 +286,7 @@ platform-owned AICore state, and never reassigned — so AICore is
fully decoupled from any AICPU-side records-buffer rotation. AICPU,
on observing FIN, validates the slot's register token, copies the slot
record into the current `L2PerfBuffer::records[count]`, fills
`func_id` / `core_type` / `dispatch_time` / `finish_time` / `fanout`,
`func_id` / `core_type` / `dispatch_time_us` / `finish_time_us` / `fanout`,
advances `count`, and rotates the records buffer in place when it
fills up. The ring is sized to the runtime's in-flight issue depth
(2 for dual-issue today; raise to the next power of two when issue
Expand Down Expand Up @@ -619,7 +626,7 @@ data (only `tensormap_and_ringbuffer` does, and only when
`AicpuPhaseHeader` was not initialized. Verify the runtime sets
the magic in its scheduler init path.

**`dispatch_time` < `finish_time` mismatch.** Verify the runtime
**`dispatch_time_us` < `finish_time_us` mismatch.** Verify the runtime
overwrites `task_id` with the full encoding on FIN
(`tensormap_and_ringbuffer` does
`(ring_id << 32) | local_id`); a half-filled record means AICore
Expand Down
55 changes: 41 additions & 14 deletions simpler_setup/tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ Analyze AICPU scheduler overhead and quantitatively decompose the sources of Tai

`sched_overhead_analysis` reads two artifacts produced by the runtime:

1. **Perf profiling data** (`l2_perf_records_*.json`, v2): per-task Exec / Head OH / Tail OH time breakdowns plus `aicpu_scheduler_phases` — per-thread, per-loop-iteration phase records carrying scan / complete / dispatch / idle timings and per-emit pop_hit / pop_miss deltas.
1. **Perf profiling data** (`l2_perf_records_*.json`, l2_perf_level >= 3): per-task Exec / Head OH / Tail OH time breakdowns plus `aicpu_scheduler_phases` — per-thread, per-loop-iteration phase records carrying scan / complete / dispatch / idle timings and per-emit pop_hit / pop_miss deltas.
2. **`deps.json`** (optional, dep_gen replay output): structural task DAG. When colocated with the perf JSON, Part 2 prints per-thread fanout / fanin aggregates derived from it.

### Basic Usage
Expand Down Expand Up @@ -154,7 +154,7 @@ Output is emitted in three parts:
- **Part 2: AICPU scheduler loop breakdown** — per-scheduler-thread loop statistics, per-phase (scan / complete / dispatch / idle) time ratios, pop_hit / pop_miss totals, and (when deps.json is available) per-thread fanout / fanin aggregates
- **Part 3: Tail OH distribution & cause analysis** — Tail OH quantile distribution (P10–P99), correlation between scheduler loop iteration time and Tail OH, and data-driven insights into the dominant phase

The perf JSON must be a v2 capture with non-empty `aicpu_scheduler_phases` (rerun the case with `--enable-l2-swimlane` if the tool reports the field is missing).
The perf JSON must be captured at l2_perf_level >= 3 so that `aicpu_scheduler_phases` is non-empty (rerun the case with `--enable-l2-swimlane` if the tool reports the field is missing).

---

Expand Down Expand Up @@ -270,23 +270,49 @@ The analysis tools share the same input format - the `l2_perf_records_*.json` fi

```json
{
"version": 1,
"l2_perf_level": 4,
"tasks": [
{
"task_id": 0,
"func_id": 0,
"core_id": 0,
"core_type": "aic",
"start_time_us": 100.0,
"end_time_us": 250.5,
"duration_us": 150.5,
"fanout": [1, 2],
"fanout_count": 2
"core_id": 7,
"core_type": "aiv",
"ring_id": 0,
"start_time_us": 47.46,
"end_time_us": 55.9,
"duration_us": 8.44,
"dispatch_time_us": 45.94,
"finish_time_us": 60.52,
"fanout": [4294967299, 4294967297, 4294967296],
"fanout_count": 3
},
{
"task_id": 4294967296,
"func_id": 1,
"core_id": 7,
"core_type": "aiv",
"ring_id": 1,
"start_time_us": 68.68,
"end_time_us": 70.42,
"duration_us": 1.74,
"dispatch_time_us": 68.24,
"finish_time_us": 71.2,
"fanout": [4294967298],
"fanout_count": 1
}
]
}
```

Top-level layout depends on `l2_perf_level`:

- All levels: `l2_perf_level`, `tasks[]` (per-task fields above).
- `>= 3`: also `aicpu_scheduler_phases[]` (per-thread phase records:
scan / complete / dispatch / idle) and `core_to_thread[]` (core_id →
scheduler thread index).
- `>= 4`: also `aicpu_orchestrator_phases[]` (per-task orchestrator
phase records).

### Kernel Config Format

To display meaningful function names in the output, provide a `kernel_config.py` file:
Expand Down Expand Up @@ -366,10 +392,11 @@ For batch-run hardware regression, see the dev-only script
- Check the kernel_config.py file format
- Make sure every KERNELS entry has a 'func_id' and 'name' field

### Error: Unsupported version
### Error: Unsupported l2_perf_level

- The tools only support version 1 of the profiling data format
- Regenerate the profiling data with the latest runtime
- The tools accept l2_perf_level 1–4 (the integer captured at runtime
via `--enable-l2-swimlane <N>`)
- Regenerate the profiling data with a supported level

### Error: Perf JSON missing required fields for scheduler overhead analysis

Expand All @@ -394,7 +421,7 @@ For batch-run hardware regression, see the dev-only script
| ---- | ---- | ------- | ------ |
| `l2_perf_records_*.json` | Runtime | Raw timing profiling data | JSON |
| `merged_swimlane_*.json` | swimlane_converter | Perfetto visualization | Chrome Trace Event JSON |
| `deps.json` | Runtime (dep_gen replay) | Structural task dependency graph + per-edge tensor info | JSON (v2) |
| `deps.json` | Runtime (dep_gen replay) | Structural task dependency graph + per-edge tensor info | JSON |
| `deps_graph.html` | deps_to_graph | Pan/zoom dependency graph viewer | HTML (self-contained) |

---
Expand Down
Loading
Loading