Add: scope_stats collector for per-scope queue-fill peaks#858
Add: scope_stats collector for per-scope queue-fill peaks#858doraemonmj wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a decoupled scope-stats collector (PTO2_SCOPE_STATS) to report peak ring-buffer fill (heap bytes and tasks in-flight) and tensormap usage at each user scope's end. The feedback highlights critical thread-safety issues with global variables in multi-threaded orchestrator environments, recommending the use of thread_local storage and passing the orchestrator state pointer to the hooks for lazy binding. Additionally, defensive checks are recommended to prevent potential division-by-zero and null pointer dereferences.
9b13d72 to
d5c4d64
Compare
| // host opts in via `--enable-scope-stats` and the allocation is skipped when | ||
| // the flag is off. | ||
|
|
||
| #define PTO2_SCOPE_STATS_LOG_CAP 16384u |
There was a problem hiding this comment.
这里定义的宏好像和src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h中的是相关的,如果runtime中的量被改大,这里会不会出现什么问题
There was a problem hiding this comment.
PTO2_SCOPE_STATS_LOG_CAP是用来存储记录的数据的一个环形缓冲区,只是表示一个大小,不会影响其他地方,当前大概最大只占用了2MB左右
There was a problem hiding this comment.
文档中给大家讲解下接口是什么,怎么用。在tests/st/a2a3/tensormap_and_ringbuffer/dfx下新建一个测试用例,里面测试的同时也可以重点凸显出来怎么使用的新加的接口
d5c4d64 to
df45c71
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces scope-stats, a peak resource attribution feature for PTO2_SCOPE regions. It adds shared-memory ring buffers, platform collectors, host JSON export, runtime instrumentation, and comprehensive test support across a2a3 and a5 platforms with end-to-end Python test integration. ChangesScope Stats Peak Attribution Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
|
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (4)
docs/dfx/scope-stats.md (3)
232-232: ⚡ Quick winDocument the constant value or reference its definition.
The constant
PTO2_SCOPE_STATS_LOG_CAP(16384) is mentioned but readers might benefit from knowing where this is defined or whether it's configurable.📝 Suggested addition
Consider adding a brief explanation:
-| `cap` | uint32 | Ring capacity, `PTO2_SCOPE_STATS_LOG_CAP` (16384) | +| `cap` | uint32 | Ring capacity, `PTO2_SCOPE_STATS_LOG_CAP` (16384, defined in `scope_stats_buffer.h`) |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/dfx/scope-stats.md` at line 232, Update the docs entry for the `cap` field to document the `PTO2_SCOPE_STATS_LOG_CAP` constant: state its numeric value (16384), indicate where it is defined (reference the header/source that declares `PTO2_SCOPE_STATS_LOG_CAP`), and clarify whether it is configurable or compile-time fixed; mention any related environment variables or build-time flags if applicable and add a short example or link to the definition for readers to inspect.
245-245: ⚡ Quick winClarify why fanin capacity isn't tracked.
The parenthetical note "capacity isn't currently carried — fanin reservation is implicit in dep accounting" could confuse readers about the design decision.
💡 Suggested clarification
Consider expanding slightly:
-| `fanin_used[ring]` | int32 | Peak fanin-pool entries in use (capacity isn't currently carried — fanin reservation is implicit in dep accounting) | +| `fanin_used[ring]` | int32 | Peak fanin-pool entries in use (capacity not shown separately because fanin slots are reserved as part of dep accounting; see dep capacity for the effective limit) |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/dfx/scope-stats.md` at line 245, Clarify the parenthetical note for `fanin_used[ring]` by explaining that fanin capacity isn't recorded because reservations are implicit in dependency accounting: state already tracks outstanding dep reservations elsewhere (so capacity would duplicate data), capacity is fixed per design or derived when needed, and omitting it avoids extra bookkeeping and potential inconsistency; update the parenthetical to a short explanatory sentence referencing `fanin_used[ring]` and "dep accounting" so readers know where reservations are recorded and why capacity is unnecessary.
95-95: ⚡ Quick winDocument the constant value or reference its definition.
The constant
PTO2_SCOPE_STATS_MAX_RING_DEPTHis referenced but not defined in the documentation. Readers would benefit from knowing the actual limit or where to find it.📝 Suggested addition
Consider adding a note near the first use:
copied verbatim into the buffer header so the host JSON can render `used/cap` ratios without a second device→host query. `ring_id` outside -`[0, PTO2_SCOPE_STATS_MAX_RING_DEPTH)` is silently dropped. +`[0, PTO2_SCOPE_STATS_MAX_RING_DEPTH)` is silently dropped. +(`PTO2_SCOPE_STATS_MAX_RING_DEPTH` is defined in `scope_stats_buffer.h` as 4)Or add a constants reference subsection listing all relevant limits.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/dfx/scope-stats.md` at line 95, Add a concrete definition or link for PTO2_SCOPE_STATS_MAX_RING_DEPTH where it’s first mentioned in scope-stats.md: either state the numeric range (e.g. "PTO2_SCOPE_STATS_MAX_RING_DEPTH = <value>") or add a cross-reference pointing to the source/constant definition, and/or create a "Constants" subsection listing PTO2_SCOPE_STATS_MAX_RING_DEPTH and any related limits so readers can see the actual limit instead of the opaque constant name.src/a2a3/platform/include/common/scope_stats_buffer.h (1)
28-31: 💤 Low valueMinor: record size comment appears inaccurate.
The comment states "16 384 records × ~96 B = ~1.5 MB", but
sizeof(ScopeStatsRecord)calculates to 136 bytes (8+32+4+2+2+32+16+16+16+4+4), giving ~2.2 MB total. This doesn't affect functionality but could mislead capacity planning.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/include/common/scope_stats_buffer.h` around lines 28 - 31, Update the explanatory comment that currently claims "16 384 records × ~96 B = ~1.5 MB" to reflect the actual sizeof(ScopeStatsRecord) and total buffer size: compute sizeof(ScopeStatsRecord) (used by PTO2_SCOPE_STATS_LOG_CAP) and replace the inaccurate per-record and total bytes with the correct values (e.g., 136 B per record, 16,384 × 136 B = 2,228,224 bytes ≈ 2.13 MiB) in the comment near ScopeStatsRecord/PTO2_SCOPE_STATS_LOG_CAP so capacity planning is accurate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@conftest.py`:
- Around line 166-172: The help text for the pytest option registered via
parser.addoption("--enable-scope-stats") is incorrect about file output; update
the help string to reflect runtime behavior by noting that enabling scope stats
will emit a scope_stats.json file under the configured output_prefix (i.e.,
mention "writes scope_stats.json under output_prefix" instead of "produces no
extra files"); edit the help argument passed to
parser.addoption("--enable-scope-stats") to include this file and location
detail.
In `@simpler_setup/scene_test.py`:
- Around line 1262-1268: Update the help string for the "--enable-scope-stats"
argument in the parser.add_argument call so it accurately describes that
enabling scope stats emits a scope_stats.json file under the configured
output_prefix (i.e., it does produce an extra artifact), rather than saying no
extra files are produced; edit the help text attached to
parser.add_argument("--enable-scope-stats", ...) to mention "emits
scope_stats.json under output_prefix" and remove the "produces no extra files"
phrase.
In `@src/a2a3/platform/include/host/scope_stats_dump.h`:
- Line 238: The current fprintf uses an unbounded "%s" to print
rec.site_file_basename which is populated from shared memory and may not be
NUL-terminated; change to a bounded print by computing size_t len =
strnlen(rec.site_file_basename, sizeof(rec.site_file_basename)); then use
fprintf(fp, "\"site\": \"%.*s:%d\", ", (int)len, rec.site_file_basename,
rec.site_line); to avoid over-reads when printing rec.site_file_basename.
- Around line 220-236: Validate and clamp buf->header.cap before using it for
modulo/indexing: read header.cap into cap, ensure cap is at least 1 and no
greater than the compile-time record array size (the max entries in
ScopeStatsBuffer::records), and if out of range replace it with the clamped
value; recompute kept, dropped and start based on the clamped cap (and guard
against division/modulo by zero) so the loop that indexes buf->records[(start +
i) % cap] and the use of ScopeStatsRecord cannot read out of bounds.
In `@src/a2a3/platform/src/aicpu/scope_stats_collector_aicpu.cpp`:
- Around line 71-79: When rebinding shared buffer in
set_platform_scope_stats_base(uint64_t scope_stats_data_base) you must also
reset collector-local state so a prior run cannot leak into the new run: after
assigning scope_stats_shared_buf (and after clearing the shared header) reset
scope_stats_depth to 0, clear any pending site state (e.g.
scope_stats_pending_site or equivalent) and zero/clear the peak matrices and any
per-collector accumulators used by the collector; update the function
set_platform_scope_stats_base and/or related initialization path to explicitly
reinitialize those local symbols (scope_stats_depth, pending site variable, peak
matrix variables) whenever scope_stats_data_base is rebound.
In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h`:
- Line 183: The accessor heap_used_bytes() can divide/modulo by zero when
heap_size_ == 0; update heap_used_bytes() (referencing heap_top_, heap_size_,
heap_tail_) to guard early: if heap_size_ == 0 return 0 (or the intended neutral
value) before performing (heap_top_ + heap_size_ - heap_tail_) % heap_size_,
ensuring no modulo-by-zero occurs in profiling paths.
In `@src/a5/platform/include/host/scope_stats_dump.h`:
- Line 193: Replace the unbounded "%s" usage when printing
rec.site_file_basename in scope_stats_dump.h with a bounded print; compute a
safe length using strnlen(rec.site_file_basename,
sizeof(rec.site_file_basename)-1) and use the "%.*s" printf format to pass that
length and rec.site_file_basename, keeping rec.site_line as-is, so serialization
cannot read past the shared-memory buffer.
- Around line 175-191: The code uses device-written buf->header.cap directly for
modulo indexing into buf->records, which can OOB if cap is corrupted; fix by
introducing a bounded_cap (e.g., bounded_cap = std::min(buf->header.cap,
HOST_ARRAY_CAPACITY)) and treat cap==0 as empty, then recompute kept, dropped
and start using bounded_cap (ensure kept <= bounded_cap and avoid modulo by
zero) and use bounded_cap for the array indexing (buf->records[(start + i) %
bounded_cap]) and any related calculations; update all references to cap in this
scope to bounded_cap (but keep original header.cap preserved for logging if
needed).
In `@src/a5/platform/src/aicpu/scope_stats_collector_aicpu.cpp`:
- Around line 71-79: When rebinding the shared buffer in
set_platform_scope_stats_base, also reinitialize the collector-local tracking
state so stale per-run values (e.g., depth/peak counters) don't carry over;
locate set_platform_scope_stats_base and after assigning scope_stats_shared_buf
and resetting its header, reset the local tracking variables used by the
collector (for example scope_depth, scope_peak_depth, any current index/offset
or per-thread counters maintained in this translation unit) back to their
initial zero/empty state so a previous run that didn't unwind fully cannot leak
into the new run.
In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h`:
- Line 184: The accessor heap_used_bytes() can perform a modulo by heap_size_
which is undefined when heap_size_ == 0; update the implementation of
heap_used_bytes() to check heap_size_ first (e.g., if heap_size_ == 0 return 0)
before computing (heap_top_ + heap_size_ - heap_tail_) % heap_size_, using the
existing members heap_top_, heap_size_, and heap_tail_ to locate the change.
In `@tests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.py`:
- Around line 96-97: The test currently does a silent return when the glob found
no output directories (if not matches: return), which can hide failures; replace
this silent return with an explicit assertion or failure (e.g., assert matches
and/or pytest.fail) that at least one match was found when --enable-scope-stats
is set and the platform condition is met, and include a clear message like "no
scope-stats output directories found" so missing artifacts fail the test instead
of passing silently; update the check around the matches variable in
test_scope_stats.py accordingly.
In `@tests/ut/cpp/CMakeLists.txt`:
- Line 93: Update the CMakeLists entry that references scope_stats_collector.cpp
to the actual source filename scope_stats_collector_aicpu.cpp so CMake uses the
existing file; locate the offending list item in the tests/ut/cpp CMake target
(the line adding
${CMAKE_SOURCE_DIR}/../../../src/a2a3/platform/src/aicpu/scope_stats_collector.cpp)
and replace that filename with scope_stats_collector_aicpu.cpp, ensuring any
variable or target that expects the collector source (e.g., the unit test source
list) references the corrected name.
---
Nitpick comments:
In `@docs/dfx/scope-stats.md`:
- Line 232: Update the docs entry for the `cap` field to document the
`PTO2_SCOPE_STATS_LOG_CAP` constant: state its numeric value (16384), indicate
where it is defined (reference the header/source that declares
`PTO2_SCOPE_STATS_LOG_CAP`), and clarify whether it is configurable or
compile-time fixed; mention any related environment variables or build-time
flags if applicable and add a short example or link to the definition for
readers to inspect.
- Line 245: Clarify the parenthetical note for `fanin_used[ring]` by explaining
that fanin capacity isn't recorded because reservations are implicit in
dependency accounting: state already tracks outstanding dep reservations
elsewhere (so capacity would duplicate data), capacity is fixed per design or
derived when needed, and omitting it avoids extra bookkeeping and potential
inconsistency; update the parenthetical to a short explanatory sentence
referencing `fanin_used[ring]` and "dep accounting" so readers know where
reservations are recorded and why capacity is unnecessary.
- Line 95: Add a concrete definition or link for PTO2_SCOPE_STATS_MAX_RING_DEPTH
where it’s first mentioned in scope-stats.md: either state the numeric range
(e.g. "PTO2_SCOPE_STATS_MAX_RING_DEPTH = <value>") or add a cross-reference
pointing to the source/constant definition, and/or create a "Constants"
subsection listing PTO2_SCOPE_STATS_MAX_RING_DEPTH and any related limits so
readers can see the actual limit instead of the opaque constant name.
In `@src/a2a3/platform/include/common/scope_stats_buffer.h`:
- Around line 28-31: Update the explanatory comment that currently claims "16
384 records × ~96 B = ~1.5 MB" to reflect the actual sizeof(ScopeStatsRecord)
and total buffer size: compute sizeof(ScopeStatsRecord) (used by
PTO2_SCOPE_STATS_LOG_CAP) and replace the inaccurate per-record and total bytes
with the correct values (e.g., 136 B per record, 16,384 × 136 B = 2,228,224
bytes ≈ 2.13 MiB) in the comment near ScopeStatsRecord/PTO2_SCOPE_STATS_LOG_CAP
so capacity planning is accurate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 858cc650-6d15-4526-bb28-da3414161590
📒 Files selected for processing (55)
conftest.pydocs/dfx/scope-stats.mdpython/bindings/task_interface.cpppython/simpler/worker.pysimpler_setup/scene_test.pysrc/a2a3/platform/include/aicpu/scope_stats_collector_aicpu.hsrc/a2a3/platform/include/common/kernel_args.hsrc/a2a3/platform/include/common/platform_config.hsrc/a2a3/platform/include/common/scope_stats_buffer.hsrc/a2a3/platform/include/host/scope_stats_dump.hsrc/a2a3/platform/onboard/aicpu/kernel.cppsrc/a2a3/platform/onboard/host/device_runner.cppsrc/a2a3/platform/onboard/host/device_runner.hsrc/a2a3/platform/onboard/host/pto_runtime_c_api.cppsrc/a2a3/platform/sim/host/device_runner.cppsrc/a2a3/platform/sim/host/device_runner.hsrc/a2a3/platform/sim/host/pto_runtime_c_api.cppsrc/a2a3/platform/src/aicpu/scope_stats_collector_aicpu.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.hsrc/a5/platform/include/aicpu/scope_stats_collector_aicpu.hsrc/a5/platform/include/common/kernel_args.hsrc/a5/platform/include/common/platform_config.hsrc/a5/platform/include/common/scope_stats_buffer.hsrc/a5/platform/include/host/scope_stats_dump.hsrc/a5/platform/onboard/aicpu/kernel.cppsrc/a5/platform/onboard/host/device_runner.cppsrc/a5/platform/onboard/host/device_runner.hsrc/a5/platform/onboard/host/pto_runtime_c_api.cppsrc/a5/platform/sim/host/device_runner.cppsrc/a5/platform/sim/host/device_runner.hsrc/a5/platform/sim/host/pto_runtime_c_api.cppsrc/a5/platform/src/aicpu/scope_stats_collector_aicpu.cppsrc/a5/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.hsrc/common/task_interface/call_config.hsrc/common/worker/chip_worker.cppsrc/common/worker/chip_worker.hsrc/common/worker/pto_runtime_c_api.htests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.pytests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.pytests/st/a2a3/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.pytests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.pytests/st/a5/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.pytests/ut/cpp/CMakeLists.txt
| const std::uint64_t write_count = buf->header.write_count; | ||
| const std::uint32_t cap = buf->header.cap; | ||
| const std::uint64_t kept = write_count > cap ? cap : write_count; | ||
| const std::uint64_t dropped = write_count > cap ? (write_count - cap) : 0; | ||
| const std::uint64_t start = write_count - kept; | ||
|
|
||
| std::fprintf(fp, "{\n"); | ||
| std::fprintf(fp, " \"version\": 2,\n"); | ||
| std::fprintf(fp, " \"fatal\": %s,\n", buf->header.fatal_latched ? "true" : "false"); | ||
| std::fprintf(fp, " \"write_count\": %" PRIu64 ",\n", write_count); | ||
| std::fprintf(fp, " \"cap\": %u,\n", cap); | ||
| std::fprintf(fp, " \"dropped\": %" PRIu64 ",\n", dropped); | ||
|
|
||
| std::fprintf(fp, " \"records\": ["); | ||
| for (std::uint64_t i = 0; i < kept; i++) { | ||
| const ScopeStatsRecord &rec = buf->records[(start + i) % cap]; | ||
| if (i) std::fputc(',', fp); |
There was a problem hiding this comment.
Validate and clamp cap before ring-buffer indexing.
cap comes from shared memory (buf->header.cap) and is used in % cap + record indexing. If it is corrupted or larger than the compile-time record array size, this can read out of bounds.
Suggested fix
- const std::uint32_t cap = buf->header.cap;
+ const std::uint32_t raw_cap = buf->header.cap;
+ const std::uint32_t max_cap =
+ static_cast<std::uint32_t>(sizeof(buf->records) / sizeof(buf->records[0]));
+ const std::uint32_t cap = (raw_cap <= max_cap) ? raw_cap : max_cap;
+ if (cap == 0) {
+ std::fprintf(fp, "{\n \"version\": 2,\n \"fatal\": %s,\n \"write_count\": %" PRIu64
+ ",\n \"cap\": %u,\n \"dropped\": %" PRIu64 ",\n \"records\": []\n}\n",
+ buf->header.fatal_latched ? "true" : "false", buf->header.write_count, raw_cap,
+ buf->header.write_count);
+ std::fclose(fp);
+ return 0;
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a2a3/platform/include/host/scope_stats_dump.h` around lines 220 - 236,
Validate and clamp buf->header.cap before using it for modulo/indexing: read
header.cap into cap, ensure cap is at least 1 and no greater than the
compile-time record array size (the max entries in ScopeStatsBuffer::records),
and if out of range replace it with the clamped value; recompute kept, dropped
and start based on the clamped cap (and guard against division/modulo by zero)
so the loop that indexes buf->records[(start + i) % cap] and the use of
ScopeStatsRecord cannot read out of bounds.
There was a problem hiding this comment.
Declining. header.cap is host-written, not device-written: ScopeStatsHostBuffer::init() allocates the buffer and set_platform_scope_stats_base() (called once at
▎ AICPU init) writes header.cap = PTO2_SCOPE_STATS_LOG_CAP — the compile-time constant that also sizes ScopeStatsBuffer::records[PTO2_SCOPE_STATS_LOG_CAP]. The AICPU
▎ writer never touches header.cap; it only writes records[idx] and increments header.write_count.
▎
▎ Since both the writer and the reader bind to the same compile-time constant, and the host re-initializes the header at the start of every run, there is no path
▎ where cap can drift from the array size — barring a memory-corruption bug somewhere else in the system, in which case a defensive clamp here would mask the real
▎ problem rather than catch it.
▎
▎ If the concern is "what if cap is ever made runtime-configurable in the future", I'd rather add a static_assert or an init-time check in
▎ set_platform_scope_stats_base than scatter clamps at every read site — happy to add that if you'd like.
There was a problem hiding this comment.
@doraemonmj, understood — the invariant is clear: header.cap is written once by the host to the compile-time constant PTO2_SCOPE_STATS_LOG_CAP and the device never touches it, so the host reader and AICPU writer are always in sync by construction. A scattered read-site clamp would indeed mask real corruption rather than expose it. Thanks for the thorough explanation.
(🐰 ._.)
✏️ Learnings added
Learnt from: doraemonmj
Repo: hw-native-sys/simpler PR: 858
File: src/a2a3/platform/include/host/scope_stats_dump.h:220-236
Timestamp: 2026-05-27T09:15:17.659Z
Learning: In `src/a2a3/platform/include/host/scope_stats_dump.h` (and its a5 counterpart), `ScopeStatsBuffer::header.cap` is host-written only: `ScopeStatsHostBuffer::init()` allocates the buffer and `set_platform_scope_stats_base()` sets `header.cap = PTO2_SCOPE_STATS_LOG_CAP` — the same compile-time constant that sizes `ScopeStatsBuffer::records[]`. The AICPU writer never mutates `header.cap`; it only writes records and increments `header.write_count`. Therefore defensive clamping of `cap` at every read site is unnecessary and undesirable; a `static_assert` or an init-time assertion in `set_platform_scope_stats_base` is the preferred guard if runtime configurability is ever added.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| extern "C" void set_scope_stats_enabled(bool enable) { scope_stats_enabled = enable; } | ||
|
|
||
| extern "C" void set_platform_scope_stats_base(uint64_t scope_stats_data_base) { | ||
| scope_stats_shared_buf = reinterpret_cast<ScopeStatsBuffer *>(scope_stats_data_base); | ||
| if (scope_stats_shared_buf) { | ||
| memset(&scope_stats_shared_buf->header, 0, sizeof(scope_stats_shared_buf->header)); | ||
| scope_stats_shared_buf->header.cap = PTO2_SCOPE_STATS_LOG_CAP; | ||
| } | ||
| } |
There was a problem hiding this comment.
Reinitialize collector-local tracking state on buffer rebinding.
At Line 73, only the shared header is reset. Local collector state persists, so a previous run that did not unwind scope depth can leak stale depth/peak data into the next run.
Proposed fix
namespace {
+inline void reset_scope_stats_local_state() {
+ scope_stats_depth = -1;
+ s_pending_site_file = nullptr;
+ s_pending_site_line = 0;
+ memset(scope_stats_peak_heap_bytes, 0, sizeof(scope_stats_peak_heap_bytes));
+ memset(scope_stats_peak_task_in_flight, 0, sizeof(scope_stats_peak_task_in_flight));
+ memset(scope_stats_peak_fanin_used, 0, sizeof(scope_stats_peak_fanin_used));
+ memset(scope_stats_peak_dep_used, 0, sizeof(scope_stats_peak_dep_used));
+ memset(scope_stats_peak_tensormap_used, 0, sizeof(scope_stats_peak_tensormap_used));
+ memset(s_scope_site_file, 0, sizeof(s_scope_site_file));
+ memset(s_scope_site_line, 0, sizeof(s_scope_site_line));
+}
}
extern "C" void set_scope_stats_enabled(bool enable) { scope_stats_enabled = enable; }
extern "C" void set_platform_scope_stats_base(uint64_t scope_stats_data_base) {
scope_stats_shared_buf = reinterpret_cast<ScopeStatsBuffer *>(scope_stats_data_base);
+ reset_scope_stats_local_state();
if (scope_stats_shared_buf) {
memset(&scope_stats_shared_buf->header, 0, sizeof(scope_stats_shared_buf->header));
scope_stats_shared_buf->header.cap = PTO2_SCOPE_STATS_LOG_CAP;
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a5/platform/src/aicpu/scope_stats_collector_aicpu.cpp` around lines 71 -
79, When rebinding the shared buffer in set_platform_scope_stats_base, also
reinitialize the collector-local tracking state so stale per-run values (e.g.,
depth/peak counters) don't carry over; locate set_platform_scope_stats_base and
after assigning scope_stats_shared_buf and resetting its header, reset the local
tracking variables used by the collector (for example scope_depth,
scope_peak_depth, any current index/offset or per-thread counters maintained in
this translation unit) back to their initial zero/empty state so a previous run
that didn't unwind fully cannot leak into the new run.
Platform-owned collector with pure-value API — runtime calls scope_stats_update_*_peaks() at instrumentation points, passing extracted int/uint64_t values. No runtime-specific types cross the platform boundary. Gated by --enable-scope-stats runtime flag; emits scope_stats.json via shared buffer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
df45c71 to
3f6c4b1
Compare
Summary
Add a scope_stats diagnostic that records per-scope peak occupancy of ring buffer / heap / tensormap / fanin / dep_pool, and dumps <output_prefix>/scope_stats.json at the end of each run. Aimed at capacity-class failures (PTO2_ERROR_FLOW_CONTROL_DEADLOCK, RING_BUFFER_BACKPRESSURE) — tells the user which scope, on which ring, exhausted which resource.
Details
Collector (device side). New scope_stats_collector.cpp under src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/. Probes wired into the orchestrator track a PTO2_MAX_SCOPE_DEPTH × PTO2_MAX_RING_DEPTH peak matrix; per-scope source site flows from PTO2ScopeGuard's ctor via scope_stats_set_pending_site. Compile-time gated by PTO2_PROFILING.
Runtime flag. New --enable-scope-stats CLI / CallConfig::enable_scope_stats / PROFILING_FLAG_SCOPE_STATS bit. AICPU entry calls set_scope_stats_enabled(...); when off, every probe collapses to one relaxed bool load.
Emission off the hot path. Earlier draft called LOG_INFO_V5 synchronously at every scope exit (8469 calls per paged_attention run, ~12 µs apart), which starved the orchestrator and tripped PTO2_ERROR_FLOW_CONTROL_DEADLOCK. The probe now just stores a record into a host-allocated ScopeStatsBuffer (header + ring of records) — zero vsnprintf, zero IO. scope_stats_on_fatal only latches fatal_latched.
Output. ScopeStatsHostBuffer (header-only) dumps <output_prefix>/scope_stats.json next to l2_perf_records.json / deps.json, keeping the used/cap string shape (e.g. "task_window": ["0/16384","5/16384",...]).
Cross-platform transport. a2a3 maps the buffer host-visible via halHostRegister; a5 keeps a host shadow and refreshes it via rtMemcpy DEVICE_TO_HOST at dump time; sim is single-address-space. Mode is selected by which callbacks the user passes to ScopeStatsHostBuffer::init(...).
Wiring. kernel_args.scope_stats_data_base + set_platform_scope_stats_base(...) mirror the existing set_platform_*_base hooks. DeviceRunner gains a private init_scope_stats() / dump / finalize in the same shape as init_dep_gen. CallConfig::diagnostics_any() and scene_test.py::diagnostics_on include enable_scope_stats so the output_prefix requirement applies.