Skip to content

Add: scope_stats collector for per-scope queue-fill peaks#858

Open
doraemonmj wants to merge 1 commit into
hw-native-sys:mainfrom
doraemonmj:scopememoryuse
Open

Add: scope_stats collector for per-scope queue-fill peaks#858
doraemonmj wants to merge 1 commit into
hw-native-sys:mainfrom
doraemonmj:scopememoryuse

Conversation

@doraemonmj
Copy link
Copy Markdown
Contributor

@doraemonmj doraemonmj commented May 26, 2026

Summary
Add a scope_stats diagnostic that records per-scope peak occupancy of ring buffer / heap / tensormap / fanin / dep_pool, and dumps <output_prefix>/scope_stats.json at the end of each run. Aimed at capacity-class failures (PTO2_ERROR_FLOW_CONTROL_DEADLOCK, RING_BUFFER_BACKPRESSURE) — tells the user which scope, on which ring, exhausted which resource.

Details

  • Collector (device side). New scope_stats_collector.cpp under src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/. Probes wired into the orchestrator track a PTO2_MAX_SCOPE_DEPTH × PTO2_MAX_RING_DEPTH peak matrix; per-scope source site flows from PTO2ScopeGuard's ctor via scope_stats_set_pending_site. Compile-time gated by PTO2_PROFILING.

  • Runtime flag. New --enable-scope-stats CLI / CallConfig::enable_scope_stats / PROFILING_FLAG_SCOPE_STATS bit. AICPU entry calls set_scope_stats_enabled(...); when off, every probe collapses to one relaxed bool load.

  • Emission off the hot path. Earlier draft called LOG_INFO_V5 synchronously at every scope exit (8469 calls per paged_attention run, ~12 µs apart), which starved the orchestrator and tripped PTO2_ERROR_FLOW_CONTROL_DEADLOCK. The probe now just stores a record into a host-allocated ScopeStatsBuffer (header + ring of records) — zero vsnprintf, zero IO. scope_stats_on_fatal only latches fatal_latched.

  • Output. ScopeStatsHostBuffer (header-only) dumps <output_prefix>/scope_stats.json next to l2_perf_records.json / deps.json, keeping the used/cap string shape (e.g. "task_window": ["0/16384","5/16384",...]).

  • Cross-platform transport. a2a3 maps the buffer host-visible via halHostRegister; a5 keeps a host shadow and refreshes it via rtMemcpy DEVICE_TO_HOST at dump time; sim is single-address-space. Mode is selected by which callbacks the user passes to ScopeStatsHostBuffer::init(...).

  • Wiring. kernel_args.scope_stats_data_base + set_platform_scope_stats_base(...) mirror the existing set_platform_*_base hooks. DeviceRunner gains a private init_scope_stats() / dump / finalize in the same shape as init_dep_gen. CallConfig::diagnostics_any() and scene_test.py::diagnostics_on include enable_scope_stats so the output_prefix requirement applies.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a decoupled scope-stats collector (PTO2_SCOPE_STATS) to report peak ring-buffer fill (heap bytes and tasks in-flight) and tensormap usage at each user scope's end. The feedback highlights critical thread-safety issues with global variables in multi-threaded orchestrator environments, recommending the use of thread_local storage and passing the orchestrator state pointer to the hooks for lazy binding. Additionally, defensive checks are recommended to prevent potential division-by-zero and null pointer dereferences.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scope_stats_collector.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scope_stats_collector.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scope_stats_collector.cpp Outdated
Comment thread src/a2a3/platform/include/aicpu/scope_stats_collector.h Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scope_stats_collector.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scope_stats_collector.cpp Outdated
Comment thread src/a5/platform/include/aicpu/scope_stats_collector.h Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
@doraemonmj doraemonmj force-pushed the scopememoryuse branch 6 times, most recently from 9b13d72 to d5c4d64 Compare May 27, 2026 06:19
// host opts in via `--enable-scope-stats` and the allocation is skipped when
// the flag is off.

#define PTO2_SCOPE_STATS_LOG_CAP 16384u
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里定义的宏好像和src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h中的是相关的,如果runtime中的量被改大,这里会不会出现什么问题

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTO2_SCOPE_STATS_LOG_CAP是用来存储记录的数据的一个环形缓冲区,只是表示一个大小,不会影响其他地方,当前大概最大只占用了2MB左右

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/docs/profiling_levels.md Outdated
Comment thread docs/dfx/scope-stats.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档中给大家讲解下接口是什么,怎么用。在tests/st/a2a3/tensormap_and_ringbuffer/dfx下新建一个测试用例,里面测试的同时也可以重点凸显出来怎么使用的新加的接口

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加文档与用例

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d5723703-9aa2-4fda-9f35-22ddf6d37062

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces scope-stats, a peak resource attribution feature for PTO2_SCOPE regions. It adds shared-memory ring buffers, platform collectors, host JSON export, runtime instrumentation, and comprehensive test support across a2a3 and a5 platforms with end-to-end Python test integration.

Changes

Scope Stats Peak Attribution Feature

Layer / File(s) Summary
Shared platform data contracts and CLI config
src/common/task_interface/call_config.h, src/a2a3/platform/include/common/scope_stats_buffer.h, src/a2a3/platform/include/common/platform_config.h, src/a2a3/platform/include/common/kernel_args.h
Defines cross-platform ring buffer layout (ScopeStatsRecord, ScopeStatsHeader, ScopeStatsBuffer), profiling flag bit PROFILING_FLAG_SCOPE_STATS (bit 4), kernel args scope_stats_data_base field, and CallConfig struct field enable_scope_stats with updated wire-layout assertion.
AICPU collector interfaces and implementations
src/a2a3/platform/include/aicpu/scope_stats_collector_aicpu.h, src/a2a3/platform/src/aicpu/scope_stats_collector_aicpu.cpp, src/a5/platform/include/aicpu/scope_stats_collector_aicpu.h, src/a5/platform/src/aicpu/scope_stats_collector_aicpu.cpp
Declares and implements C ABI interface with lifecycle probes, site tracking, enable/base setters, peak update functions, and capacity registration for both platforms.
Host-side JSON export infrastructure
src/a2a3/platform/include/host/scope_stats_dump.h, src/a5/platform/include/host/scope_stats_dump.h
Header-only helpers managing allocation, device↔host synchronization, and JSON v2 serialization with mode-dependent cleanup for both a2a3 and a5.
A2A3 platform integration (onboard and sim)
src/a2a3/platform/onboard/aicpu/kernel.cpp, src/a2a3/platform/onboard/host/device_runner.{h,cpp}, src/a2a3/platform/sim/host/device_runner.{h,cpp}, src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp, src/a2a3/platform/sim/host/pto_runtime_c_api.cpp
Extends kernel and host device runners to initialize scope-stats buffers, wire profiling flags, load AICPU collectors (sim), dump JSON after execution, and finalize resources.
A5 platform integration (onboard and sim)
src/a5/platform/onboard/aicpu/kernel.cpp, src/a5/platform/onboard/host/device_runner.{h,cpp}, src/a5/platform/sim/host/device_runner.{h,cpp}, src/a5/platform/onboard/host/pto_runtime_c_api.cpp, src/a5/platform/sim/host/pto_runtime_c_api.cpp
Identical integration pattern as a2a3 with BufferPoolManager-based allocation instead of HAL register wrapper.
Runtime orchestration instrumentation
src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.{h,cpp}, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h, plus a5 equivalents
Adds scope_set_site() hook to PTO2ScopeGuard to capture call-site; updates orchestrator to emit scope lifecycle probes and peak metrics; adds scheduler pool tracking; provides heap_used_bytes(), current_used(), pool_capacity() accessors for peak snapshots. All gated by PTO2_PROFILING.
Python config threading and test integration
conftest.py, python/bindings/task_interface.cpp, python/simpler/worker.py, simpler_setup/scene_test.py, src/common/worker/{chip_worker.h,cpp}, src/common/worker/pto_runtime_c_api.h
Adds pytest --enable-scope-stats flag; extends CallConfig Python bindings and mailbox wire format; threads enable_scope_stats through run_class_cases() and L2/L3 execution paths; updates common C API signature and ChipWorker to propagate the new parameter.
Test helpers and scope stats smoke test
tests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.py, tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py, tests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.py, tests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.py, tests/st/a5/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
Adds enable_scope_stats parameter to test helper _run_and_validate_l2() methods; introduces new smoke test that validates scope_stats.json schema v2, fatal flag, record count, and required fields only when --enable-scope-stats is enabled.
Documentation and build integration
docs/dfx/scope-stats.md, tests/ut/cpp/CMakeLists.txt
Comprehensive documentation of purpose, architecture, platform API, runtime flow, JSON schema, symbol resolution, and future extensibility; adds scope_stats_collector.cpp to a2a3_rt_objs library.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop, skip, and a scope trace!
Peaks now captured in every place,
Ring buffers dance, JSON flows free,
Resource exhaustion for all to see!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (4)
docs/dfx/scope-stats.md (3)

232-232: ⚡ Quick win

Document the constant value or reference its definition.

The constant PTO2_SCOPE_STATS_LOG_CAP (16384) is mentioned but readers might benefit from knowing where this is defined or whether it's configurable.

📝 Suggested addition

Consider adding a brief explanation:

-| `cap` | uint32 | Ring capacity, `PTO2_SCOPE_STATS_LOG_CAP` (16384) |
+| `cap` | uint32 | Ring capacity, `PTO2_SCOPE_STATS_LOG_CAP` (16384, defined in `scope_stats_buffer.h`) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/dfx/scope-stats.md` at line 232, Update the docs entry for the `cap`
field to document the `PTO2_SCOPE_STATS_LOG_CAP` constant: state its numeric
value (16384), indicate where it is defined (reference the header/source that
declares `PTO2_SCOPE_STATS_LOG_CAP`), and clarify whether it is configurable or
compile-time fixed; mention any related environment variables or build-time
flags if applicable and add a short example or link to the definition for
readers to inspect.

245-245: ⚡ Quick win

Clarify why fanin capacity isn't tracked.

The parenthetical note "capacity isn't currently carried — fanin reservation is implicit in dep accounting" could confuse readers about the design decision.

💡 Suggested clarification

Consider expanding slightly:

-| `fanin_used[ring]` | int32 | Peak fanin-pool entries in use (capacity isn't currently carried — fanin reservation is implicit in dep accounting) |
+| `fanin_used[ring]` | int32 | Peak fanin-pool entries in use (capacity not shown separately because fanin slots are reserved as part of dep accounting; see dep capacity for the effective limit) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/dfx/scope-stats.md` at line 245, Clarify the parenthetical note for
`fanin_used[ring]` by explaining that fanin capacity isn't recorded because
reservations are implicit in dependency accounting: state already tracks
outstanding dep reservations elsewhere (so capacity would duplicate data),
capacity is fixed per design or derived when needed, and omitting it avoids
extra bookkeeping and potential inconsistency; update the parenthetical to a
short explanatory sentence referencing `fanin_used[ring]` and "dep accounting"
so readers know where reservations are recorded and why capacity is unnecessary.

95-95: ⚡ Quick win

Document the constant value or reference its definition.

The constant PTO2_SCOPE_STATS_MAX_RING_DEPTH is referenced but not defined in the documentation. Readers would benefit from knowing the actual limit or where to find it.

📝 Suggested addition

Consider adding a note near the first use:

 copied verbatim into the buffer header so the host JSON can render
 `used/cap` ratios without a second device→host query. `ring_id` outside
-`[0, PTO2_SCOPE_STATS_MAX_RING_DEPTH)` is silently dropped.
+`[0, PTO2_SCOPE_STATS_MAX_RING_DEPTH)` is silently dropped.
+(`PTO2_SCOPE_STATS_MAX_RING_DEPTH` is defined in `scope_stats_buffer.h` as 4)

Or add a constants reference subsection listing all relevant limits.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/dfx/scope-stats.md` at line 95, Add a concrete definition or link for
PTO2_SCOPE_STATS_MAX_RING_DEPTH where it’s first mentioned in scope-stats.md:
either state the numeric range (e.g. "PTO2_SCOPE_STATS_MAX_RING_DEPTH =
<value>") or add a cross-reference pointing to the source/constant definition,
and/or create a "Constants" subsection listing PTO2_SCOPE_STATS_MAX_RING_DEPTH
and any related limits so readers can see the actual limit instead of the opaque
constant name.
src/a2a3/platform/include/common/scope_stats_buffer.h (1)

28-31: 💤 Low value

Minor: record size comment appears inaccurate.

The comment states "16 384 records × ~96 B = ~1.5 MB", but sizeof(ScopeStatsRecord) calculates to 136 bytes (8+32+4+2+2+32+16+16+16+4+4), giving ~2.2 MB total. This doesn't affect functionality but could mislead capacity planning.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/common/scope_stats_buffer.h` around lines 28 - 31,
Update the explanatory comment that currently claims "16 384 records × ~96 B =
~1.5 MB" to reflect the actual sizeof(ScopeStatsRecord) and total buffer size:
compute sizeof(ScopeStatsRecord) (used by PTO2_SCOPE_STATS_LOG_CAP) and replace
the inaccurate per-record and total bytes with the correct values (e.g., 136 B
per record, 16,384 × 136 B = 2,228,224 bytes ≈ 2.13 MiB) in the comment near
ScopeStatsRecord/PTO2_SCOPE_STATS_LOG_CAP so capacity planning is accurate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@conftest.py`:
- Around line 166-172: The help text for the pytest option registered via
parser.addoption("--enable-scope-stats") is incorrect about file output; update
the help string to reflect runtime behavior by noting that enabling scope stats
will emit a scope_stats.json file under the configured output_prefix (i.e.,
mention "writes scope_stats.json under output_prefix" instead of "produces no
extra files"); edit the help argument passed to
parser.addoption("--enable-scope-stats") to include this file and location
detail.

In `@simpler_setup/scene_test.py`:
- Around line 1262-1268: Update the help string for the "--enable-scope-stats"
argument in the parser.add_argument call so it accurately describes that
enabling scope stats emits a scope_stats.json file under the configured
output_prefix (i.e., it does produce an extra artifact), rather than saying no
extra files are produced; edit the help text attached to
parser.add_argument("--enable-scope-stats", ...) to mention "emits
scope_stats.json under output_prefix" and remove the "produces no extra files"
phrase.

In `@src/a2a3/platform/include/host/scope_stats_dump.h`:
- Line 238: The current fprintf uses an unbounded "%s" to print
rec.site_file_basename which is populated from shared memory and may not be
NUL-terminated; change to a bounded print by computing size_t len =
strnlen(rec.site_file_basename, sizeof(rec.site_file_basename)); then use
fprintf(fp, "\"site\": \"%.*s:%d\", ", (int)len, rec.site_file_basename,
rec.site_line); to avoid over-reads when printing rec.site_file_basename.
- Around line 220-236: Validate and clamp buf->header.cap before using it for
modulo/indexing: read header.cap into cap, ensure cap is at least 1 and no
greater than the compile-time record array size (the max entries in
ScopeStatsBuffer::records), and if out of range replace it with the clamped
value; recompute kept, dropped and start based on the clamped cap (and guard
against division/modulo by zero) so the loop that indexes buf->records[(start +
i) % cap] and the use of ScopeStatsRecord cannot read out of bounds.

In `@src/a2a3/platform/src/aicpu/scope_stats_collector_aicpu.cpp`:
- Around line 71-79: When rebinding shared buffer in
set_platform_scope_stats_base(uint64_t scope_stats_data_base) you must also
reset collector-local state so a prior run cannot leak into the new run: after
assigning scope_stats_shared_buf (and after clearing the shared header) reset
scope_stats_depth to 0, clear any pending site state (e.g.
scope_stats_pending_site or equivalent) and zero/clear the peak matrices and any
per-collector accumulators used by the collector; update the function
set_platform_scope_stats_base and/or related initialization path to explicitly
reinitialize those local symbols (scope_stats_depth, pending site variable, peak
matrix variables) whenever scope_stats_data_base is rebound.

In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h`:
- Line 183: The accessor heap_used_bytes() can divide/modulo by zero when
heap_size_ == 0; update heap_used_bytes() (referencing heap_top_, heap_size_,
heap_tail_) to guard early: if heap_size_ == 0 return 0 (or the intended neutral
value) before performing (heap_top_ + heap_size_ - heap_tail_) % heap_size_,
ensuring no modulo-by-zero occurs in profiling paths.

In `@src/a5/platform/include/host/scope_stats_dump.h`:
- Line 193: Replace the unbounded "%s" usage when printing
rec.site_file_basename in scope_stats_dump.h with a bounded print; compute a
safe length using strnlen(rec.site_file_basename,
sizeof(rec.site_file_basename)-1) and use the "%.*s" printf format to pass that
length and rec.site_file_basename, keeping rec.site_line as-is, so serialization
cannot read past the shared-memory buffer.
- Around line 175-191: The code uses device-written buf->header.cap directly for
modulo indexing into buf->records, which can OOB if cap is corrupted; fix by
introducing a bounded_cap (e.g., bounded_cap = std::min(buf->header.cap,
HOST_ARRAY_CAPACITY)) and treat cap==0 as empty, then recompute kept, dropped
and start using bounded_cap (ensure kept <= bounded_cap and avoid modulo by
zero) and use bounded_cap for the array indexing (buf->records[(start + i) %
bounded_cap]) and any related calculations; update all references to cap in this
scope to bounded_cap (but keep original header.cap preserved for logging if
needed).

In `@src/a5/platform/src/aicpu/scope_stats_collector_aicpu.cpp`:
- Around line 71-79: When rebinding the shared buffer in
set_platform_scope_stats_base, also reinitialize the collector-local tracking
state so stale per-run values (e.g., depth/peak counters) don't carry over;
locate set_platform_scope_stats_base and after assigning scope_stats_shared_buf
and resetting its header, reset the local tracking variables used by the
collector (for example scope_depth, scope_peak_depth, any current index/offset
or per-thread counters maintained in this translation unit) back to their
initial zero/empty state so a previous run that didn't unwind fully cannot leak
into the new run.

In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h`:
- Line 184: The accessor heap_used_bytes() can perform a modulo by heap_size_
which is undefined when heap_size_ == 0; update the implementation of
heap_used_bytes() to check heap_size_ first (e.g., if heap_size_ == 0 return 0)
before computing (heap_top_ + heap_size_ - heap_tail_) % heap_size_, using the
existing members heap_top_, heap_size_, and heap_tail_ to locate the change.

In `@tests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.py`:
- Around line 96-97: The test currently does a silent return when the glob found
no output directories (if not matches: return), which can hide failures; replace
this silent return with an explicit assertion or failure (e.g., assert matches
and/or pytest.fail) that at least one match was found when --enable-scope-stats
is set and the platform condition is met, and include a clear message like "no
scope-stats output directories found" so missing artifacts fail the test instead
of passing silently; update the check around the matches variable in
test_scope_stats.py accordingly.

In `@tests/ut/cpp/CMakeLists.txt`:
- Line 93: Update the CMakeLists entry that references scope_stats_collector.cpp
to the actual source filename scope_stats_collector_aicpu.cpp so CMake uses the
existing file; locate the offending list item in the tests/ut/cpp CMake target
(the line adding
${CMAKE_SOURCE_DIR}/../../../src/a2a3/platform/src/aicpu/scope_stats_collector.cpp)
and replace that filename with scope_stats_collector_aicpu.cpp, ensuring any
variable or target that expects the collector source (e.g., the unit test source
list) references the corrected name.

---

Nitpick comments:
In `@docs/dfx/scope-stats.md`:
- Line 232: Update the docs entry for the `cap` field to document the
`PTO2_SCOPE_STATS_LOG_CAP` constant: state its numeric value (16384), indicate
where it is defined (reference the header/source that declares
`PTO2_SCOPE_STATS_LOG_CAP`), and clarify whether it is configurable or
compile-time fixed; mention any related environment variables or build-time
flags if applicable and add a short example or link to the definition for
readers to inspect.
- Line 245: Clarify the parenthetical note for `fanin_used[ring]` by explaining
that fanin capacity isn't recorded because reservations are implicit in
dependency accounting: state already tracks outstanding dep reservations
elsewhere (so capacity would duplicate data), capacity is fixed per design or
derived when needed, and omitting it avoids extra bookkeeping and potential
inconsistency; update the parenthetical to a short explanatory sentence
referencing `fanin_used[ring]` and "dep accounting" so readers know where
reservations are recorded and why capacity is unnecessary.
- Line 95: Add a concrete definition or link for PTO2_SCOPE_STATS_MAX_RING_DEPTH
where it’s first mentioned in scope-stats.md: either state the numeric range
(e.g. "PTO2_SCOPE_STATS_MAX_RING_DEPTH = <value>") or add a cross-reference
pointing to the source/constant definition, and/or create a "Constants"
subsection listing PTO2_SCOPE_STATS_MAX_RING_DEPTH and any related limits so
readers can see the actual limit instead of the opaque constant name.

In `@src/a2a3/platform/include/common/scope_stats_buffer.h`:
- Around line 28-31: Update the explanatory comment that currently claims "16
384 records × ~96 B = ~1.5 MB" to reflect the actual sizeof(ScopeStatsRecord)
and total buffer size: compute sizeof(ScopeStatsRecord) (used by
PTO2_SCOPE_STATS_LOG_CAP) and replace the inaccurate per-record and total bytes
with the correct values (e.g., 136 B per record, 16,384 × 136 B = 2,228,224
bytes ≈ 2.13 MiB) in the comment near ScopeStatsRecord/PTO2_SCOPE_STATS_LOG_CAP
so capacity planning is accurate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 858cc650-6d15-4526-bb28-da3414161590

📥 Commits

Reviewing files that changed from the base of the PR and between f8b7285 and df45c71.

📒 Files selected for processing (55)
  • conftest.py
  • docs/dfx/scope-stats.md
  • python/bindings/task_interface.cpp
  • python/simpler/worker.py
  • simpler_setup/scene_test.py
  • src/a2a3/platform/include/aicpu/scope_stats_collector_aicpu.h
  • src/a2a3/platform/include/common/kernel_args.h
  • src/a2a3/platform/include/common/platform_config.h
  • src/a2a3/platform/include/common/scope_stats_buffer.h
  • src/a2a3/platform/include/host/scope_stats_dump.h
  • src/a2a3/platform/onboard/aicpu/kernel.cpp
  • src/a2a3/platform/onboard/host/device_runner.cpp
  • src/a2a3/platform/onboard/host/device_runner.h
  • src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp
  • src/a2a3/platform/sim/host/device_runner.cpp
  • src/a2a3/platform/sim/host/device_runner.h
  • src/a2a3/platform/sim/host/pto_runtime_c_api.cpp
  • src/a2a3/platform/src/aicpu/scope_stats_collector_aicpu.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
  • src/a5/platform/include/aicpu/scope_stats_collector_aicpu.h
  • src/a5/platform/include/common/kernel_args.h
  • src/a5/platform/include/common/platform_config.h
  • src/a5/platform/include/common/scope_stats_buffer.h
  • src/a5/platform/include/host/scope_stats_dump.h
  • src/a5/platform/onboard/aicpu/kernel.cpp
  • src/a5/platform/onboard/host/device_runner.cpp
  • src/a5/platform/onboard/host/device_runner.h
  • src/a5/platform/onboard/host/pto_runtime_c_api.cpp
  • src/a5/platform/sim/host/device_runner.cpp
  • src/a5/platform/sim/host/device_runner.h
  • src/a5/platform/sim/host/pto_runtime_c_api.cpp
  • src/a5/platform/src/aicpu/scope_stats_collector_aicpu.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
  • src/common/task_interface/call_config.h
  • src/common/worker/chip_worker.cpp
  • src/common/worker/chip_worker.h
  • src/common/worker/pto_runtime_c_api.h
  • tests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.py
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.py
  • tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
  • tests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.py
  • tests/st/a5/tensormap_and_ringbuffer/prepared_callable/test_prepared_callable.py
  • tests/ut/cpp/CMakeLists.txt

Comment thread conftest.py
Comment thread simpler_setup/scene_test.py
Comment on lines +220 to +236
const std::uint64_t write_count = buf->header.write_count;
const std::uint32_t cap = buf->header.cap;
const std::uint64_t kept = write_count > cap ? cap : write_count;
const std::uint64_t dropped = write_count > cap ? (write_count - cap) : 0;
const std::uint64_t start = write_count - kept;

std::fprintf(fp, "{\n");
std::fprintf(fp, " \"version\": 2,\n");
std::fprintf(fp, " \"fatal\": %s,\n", buf->header.fatal_latched ? "true" : "false");
std::fprintf(fp, " \"write_count\": %" PRIu64 ",\n", write_count);
std::fprintf(fp, " \"cap\": %u,\n", cap);
std::fprintf(fp, " \"dropped\": %" PRIu64 ",\n", dropped);

std::fprintf(fp, " \"records\": [");
for (std::uint64_t i = 0; i < kept; i++) {
const ScopeStatsRecord &rec = buf->records[(start + i) % cap];
if (i) std::fputc(',', fp);
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate and clamp cap before ring-buffer indexing.

cap comes from shared memory (buf->header.cap) and is used in % cap + record indexing. If it is corrupted or larger than the compile-time record array size, this can read out of bounds.

Suggested fix
-        const std::uint32_t cap = buf->header.cap;
+        const std::uint32_t raw_cap = buf->header.cap;
+        const std::uint32_t max_cap =
+            static_cast<std::uint32_t>(sizeof(buf->records) / sizeof(buf->records[0]));
+        const std::uint32_t cap = (raw_cap <= max_cap) ? raw_cap : max_cap;
+        if (cap == 0) {
+            std::fprintf(fp, "{\n  \"version\": 2,\n  \"fatal\": %s,\n  \"write_count\": %" PRIu64
+                             ",\n  \"cap\": %u,\n  \"dropped\": %" PRIu64 ",\n  \"records\": []\n}\n",
+                         buf->header.fatal_latched ? "true" : "false", buf->header.write_count, raw_cap,
+                         buf->header.write_count);
+            std::fclose(fp);
+            return 0;
+        }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/host/scope_stats_dump.h` around lines 220 - 236,
Validate and clamp buf->header.cap before using it for modulo/indexing: read
header.cap into cap, ensure cap is at least 1 and no greater than the
compile-time record array size (the max entries in ScopeStatsBuffer::records),
and if out of range replace it with the clamped value; recompute kept, dropped
and start based on the clamped cap (and guard against division/modulo by zero)
so the loop that indexes buf->records[(start + i) % cap] and the use of
ScopeStatsRecord cannot read out of bounds.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declining. header.cap is host-written, not device-written: ScopeStatsHostBuffer::init() allocates the buffer and set_platform_scope_stats_base() (called once at
▎ AICPU init) writes header.cap = PTO2_SCOPE_STATS_LOG_CAP — the compile-time constant that also sizes ScopeStatsBuffer::records[PTO2_SCOPE_STATS_LOG_CAP]. The AICPU
▎ writer never touches header.cap; it only writes records[idx] and increments header.write_count.

▎ Since both the writer and the reader bind to the same compile-time constant, and the host re-initializes the header at the start of every run, there is no path
▎ where cap can drift from the array size — barring a memory-corruption bug somewhere else in the system, in which case a defensive clamp here would mask the real
▎ problem rather than catch it.

▎ If the concern is "what if cap is ever made runtime-configurable in the future", I'd rather add a static_assert or an init-time check in
▎ set_platform_scope_stats_base than scatter clamps at every read site — happy to add that if you'd like.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doraemonmj, understood — the invariant is clear: header.cap is written once by the host to the compile-time constant PTO2_SCOPE_STATS_LOG_CAP and the device never touches it, so the host reader and AICPU writer are always in sync by construction. A scattered read-site clamp would indeed mask real corruption rather than expose it. Thanks for the thorough explanation.

(🐰 ._.)


✏️ Learnings added
Learnt from: doraemonmj
Repo: hw-native-sys/simpler PR: 858
File: src/a2a3/platform/include/host/scope_stats_dump.h:220-236
Timestamp: 2026-05-27T09:15:17.659Z
Learning: In `src/a2a3/platform/include/host/scope_stats_dump.h` (and its a5 counterpart), `ScopeStatsBuffer::header.cap` is host-written only: `ScopeStatsHostBuffer::init()` allocates the buffer and `set_platform_scope_stats_base()` sets `header.cap = PTO2_SCOPE_STATS_LOG_CAP` — the same compile-time constant that sizes `ScopeStatsBuffer::records[]`. The AICPU writer never mutates `header.cap`; it only writes records and increments `header.write_count`. Therefore defensive clamping of `cap` at every read site is unnecessary and undesirable; a `static_assert` or an init-time assertion in `set_platform_scope_stats_base` is the preferred guard if runtime configurability is ever added.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread src/a2a3/platform/include/host/scope_stats_dump.h Outdated
Comment thread src/a2a3/platform/src/aicpu/scope_stats_collector_aicpu.cpp
Comment thread src/a5/platform/include/host/scope_stats_dump.h Outdated
Comment on lines +71 to +79
extern "C" void set_scope_stats_enabled(bool enable) { scope_stats_enabled = enable; }

extern "C" void set_platform_scope_stats_base(uint64_t scope_stats_data_base) {
scope_stats_shared_buf = reinterpret_cast<ScopeStatsBuffer *>(scope_stats_data_base);
if (scope_stats_shared_buf) {
memset(&scope_stats_shared_buf->header, 0, sizeof(scope_stats_shared_buf->header));
scope_stats_shared_buf->header.cap = PTO2_SCOPE_STATS_LOG_CAP;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reinitialize collector-local tracking state on buffer rebinding.

At Line 73, only the shared header is reset. Local collector state persists, so a previous run that did not unwind scope depth can leak stale depth/peak data into the next run.

Proposed fix
 namespace {
+inline void reset_scope_stats_local_state() {
+    scope_stats_depth = -1;
+    s_pending_site_file = nullptr;
+    s_pending_site_line = 0;
+    memset(scope_stats_peak_heap_bytes, 0, sizeof(scope_stats_peak_heap_bytes));
+    memset(scope_stats_peak_task_in_flight, 0, sizeof(scope_stats_peak_task_in_flight));
+    memset(scope_stats_peak_fanin_used, 0, sizeof(scope_stats_peak_fanin_used));
+    memset(scope_stats_peak_dep_used, 0, sizeof(scope_stats_peak_dep_used));
+    memset(scope_stats_peak_tensormap_used, 0, sizeof(scope_stats_peak_tensormap_used));
+    memset(s_scope_site_file, 0, sizeof(s_scope_site_file));
+    memset(s_scope_site_line, 0, sizeof(s_scope_site_line));
+}
 }
 
 extern "C" void set_scope_stats_enabled(bool enable) { scope_stats_enabled = enable; }
 
 extern "C" void set_platform_scope_stats_base(uint64_t scope_stats_data_base) {
     scope_stats_shared_buf = reinterpret_cast<ScopeStatsBuffer *>(scope_stats_data_base);
+    reset_scope_stats_local_state();
     if (scope_stats_shared_buf) {
         memset(&scope_stats_shared_buf->header, 0, sizeof(scope_stats_shared_buf->header));
         scope_stats_shared_buf->header.cap = PTO2_SCOPE_STATS_LOG_CAP;
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/src/aicpu/scope_stats_collector_aicpu.cpp` around lines 71 -
79, When rebinding the shared buffer in set_platform_scope_stats_base, also
reinitialize the collector-local tracking state so stale per-run values (e.g.,
depth/peak counters) don't carry over; locate set_platform_scope_stats_base and
after assigning scope_stats_shared_buf and resetting its header, reset the local
tracking variables used by the collector (for example scope_depth,
scope_peak_depth, any current index/offset or per-thread counters maintained in
this translation unit) back to their initial zero/empty state so a previous run
that didn't unwind fully cannot leak into the new run.

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h Outdated
Comment thread tests/st/a2a3/tensormap_and_ringbuffer/dfx/scope_stats/test_scope_stats.py Outdated
Comment thread tests/ut/cpp/CMakeLists.txt Outdated
Platform-owned collector with pure-value API — runtime calls
scope_stats_update_*_peaks() at instrumentation points, passing
extracted int/uint64_t values. No runtime-specific types cross the
platform boundary.

Gated by --enable-scope-stats runtime flag; emits scope_stats.json
via shared buffer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants