Refactor: host-build trb runtime arena by poursoul · Pull Request #846 · hw-native-sys/simpler

poursoul · 2026-05-22T10:01:15Z

Summary

Move the trb runtime arena's layout + data initialization from AICPU boot
onto host, so each AICPU launch reduces to a cheap pointer-fixup pass plus
the SM reset that can't run off-device. The pooled prebuilt image lives in
a per-Worker DeviceRunner pool and is reused across runs via a single
rtMemcpy — multi-launch boot cost drops from O(task_window_size) per
worker to a constant.

Two related cleanups ride along:

RingSchedState::init's O(task_window_size) slot-bind loop is lifted
into per-submit orch::prepare_task, making startup independent of
window size. The two extra stores hit the same 64B cache line that
prepare_task already dirties, so the per-submit cost is essentially
free.
AICPU SM reset (per-slot bind_ring + reset_for_reuse +
fanin_count/active_mask zero) consolidated into
PTO2SharedMemoryHandle::init_header_per_ring so the host-build path
doesn't dereference SM.

Covers both a2a3 and a5 trb runtimes (platform layer onboard + sim).
hbg is unaffected by the runtime-arena split — its
setup_static_arena(...,0) keeps the third region unreserved.

Mechanism

runtime_create_from_sm split into four phases that run on either side:
- runtime_reserve_layout — pure arithmetic; host computes sub-region
  offsets on a libc-backed DeviceArena.
- runtime_init_data_from_layout — writes standalone fields, memset's
  arena regions, and stores SM device pointers (only stores, no
  dereferences).
- runtime_wire_arena_pointers — walks every arena-internal pointer
  field and binds it to arena.base() + offset. Idempotent: host runs
  once with the host mirror, AICPU runs once after attach with device
  addresses.
- runtime_finalize_after_wire — AICPU-only fixup for s_runtime_ops
  (device-side file-local global) and the SPMD core counts from the
  SchedulerContext.
DeviceArena::attach() wraps an externally-owned buffer with no
per-attach allocation; re-attach is permitted so each AICPU boot can
reuse the same pooled image. Pre-alignment / non-null / power-of-two
checks std::abort() instead of assert() so release builds still
trap on contract violations.
pto2_sm_layout namespace computes SM device-side field addresses by
pure offset arithmetic so host init never reads SM. Takes a per-ring
task_window_sizes[] array (mirroring the SM API) and asserts
ring_id in range — structurally prevents the host-built image from
silently disagreeing with the SM layout.
New runtime/shared/pto_runtime2_init.cpp holds the host-pluggable
cold-path lifted from pto_runtime2.cpp / pto_orchestrator.cpp /
scheduler/pto_scheduler.cpp. AICPU-only ops table / submit_task /
dispatch / business logic stay in their original files.
DeviceRunner now owns three independent pooled arenas —
gm_heap_arena_, gm_sm_arena_, runtime_arena_pool_ — one
device_malloc each. Split out from a single backing allocation
because the combined size can exceed the device allocator's largest
contiguous block. setup_static_arena(gm_heap_size, gm_sm_size, runtime_arena_size) commits each region independently;
acquire_pooled_runtime_arena() returns nullptr when the region is
unreserved (hbg's setup_static_arena(...,0) path) so misuse is loud,
not undefined.
bind_prepared_to_runtime_impl (host runtime_maker) does the full
reserve_layout → init_data → wire on a host arena, stashes the layout
inside the PTO2Runtime image at prebuilt_layout, then rtMemcpys
the whole arena into the pooled device region.
Dead fields and parameters dropped: PTO2TensorMap::orch back-pointer
(never dereferenced), PTO2Runtime::prebuilt_arena_base mirror (host
Runtime::prebuilt_arena_base_ is the real source of truth), unused
task_window_size / dep_pool_capacity from
PTO2SchedulerState::init_data_from_layout and
RingSchedState::init_data_from_layout (scheduler only needs SM base
- ring index, both window-size-independent).

Test plan

cpput: 25/25 pass. ready_queue / spsc_queue / scheduler_state /
task_state / wiring / tensormap UTs migrated to the data+wire API.
task_allocator.init grew an optional initial_local_task_id
(default 0) so the near-INT32_MAX corner case is still exercised
without reading the SM.
a5sim: L2 trb 21/21 + L2 host_build_graph 6/6 pass.
a2a3sim: L2 trb 29/29 + L2 host_build_graph 9/9 pass.
a2a3 hardware: tests/st/.../paged_attention_unroll passes on
device 9 (--build, pto-isa commit pinned to CI).

Move the per-slot payload/task pointer assignments out of the RingSchedState::init() O(task_window_size) loop and into orch::prepare_task. Their value is per-slot constant (&task_payloads[slot] / &task_descriptors[slot]) but writing them at submit time, on the same 64B slot_state cache line prepare_task is already dirtying, is essentially free — while removing the only "scale-dependent" pointer assignments from the init path. ring_id stays in init (its value is per-ring constant, so rewriting it each submit would only add noise without removing a loop). Split PTO2TaskSlotState::bind() into bind_ring() (init-time) and bind_buffers() (per-submit) to make the two call-site shapes explicit. Mirrored across both a2a3 and a5 trb runtimes.

Previously the AICPU rebuilt the entire trb runtime arena (PTO2Runtime, orchestrator/scheduler/tensor_map sub-regions, sm_handle wrapper, mailbox) on every device boot via runtime_create_from_sm. This commit moves layout + data init onto the host so the AICPU only does a cheap arena-internal pointer wire pass plus the SM reset that can't run off-device. Multi-run boots reuse the pooled prebuilt image with a single rtMemcpy. Mechanism - DeviceArena::attach() wraps an externally-owned buffer; re-attach is permitted so each AICPU boot can reuse the pooled image. - runtime_create_from_sm split into reserve_layout / init_data_from_layout / wire_arena_pointers / finalize_after_wire. orchestrator / scheduler / tensor_map / ready_queue / spsc gain matching data+wire pairs; finalize_after_wire stays AICPU-only since it binds s_runtime_ops. - pto2_sm_layout helper computes SM field device addresses by pure offset arithmetic so host init never dereferences SM. - Per-slot SM-side reset (bind_ring + reset_for_reuse + active_mask) moved from RingSchedState::init into PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns it after the split. - runtime/shared/pto_runtime2_init.cpp — new file holding the host-able pieces lifted out of pto_runtime2.cpp / pto_orchestrator.cpp / pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch stay in place. Host wiring (runtime_maker.cpp) - DeviceRunner::setup_static_arena gains a third runtime_arena_size region (hbg passes 0). The prebuilt image lives in the same pooled backing allocation as gm_heap and SM, keeping worker lifetime to one rtMalloc. - bind_prepared_to_runtime_impl reserves layout on a host arena, sizes the pooled regions, runs init_data + wire, stashes prebuilt metadata into the rt image, rtMemcpys to device, and records base/offset on Runtime so the AICPU boot can find it. AICPU boot (aicpu_executor.cpp) - attach the runtime arena to the pooled buffer, take rt from base+off_runtime, wire arena-internal pointers, sm_handle->init (SM reset including the per-slot fields above), mailbox reset, finalize_after_wire (ops table + cluster/aiv counts). Tests - cpput: 25/25 pass. ready_queue / spsc_queue / scheduler_state / task_state / wiring / tensormap UTs migrated to the data+wire API. task_allocator.init grew an optional initial_local_task_id (default 0) so UTs can still exercise task_id near INT32_MAX without reading the SM. - a2a3sim trb: standalone (dynamic_register variants, L3 group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass. - a2a3sim host_build_graph: 9/9 pass (verifies the shared HostApi changes don't break hbg). - a2a3 hardware: tests/st/.../paged_attention_unroll PASS on device 9 (--build with pto-isa commit pinned to CI).

gemini-code-assist

Code Review

This pull request implements a prebuilt-arena fast path for the PTO2 runtime, allowing the host to pre-compute the runtime arena image and upload it to the device. This optimization reduces AICPU boot time by replacing full initialization with a simple attachment and pointer "wiring" phase. Key changes include refactoring the initialization logic for the runtime, orchestrator, and scheduler into separate data-population and pointer-wiring stages, extending the DeviceRunner to manage a pooled runtime arena, and adding an attach method to DeviceArena for externally-owned buffers. Review feedback correctly identified potential undefined behavior in the new acquire_pooled_runtime_arena methods when the arena is not provisioned, suggesting defensive checks against SIZE_MAX offsets.

Address review feedback from PR hw-native-sys#846: - pto2_sm_layout::ring_task_descriptors_addr: take per-ring task_window_sizes[] array (mirroring PTO2SharedMemoryHandle's SM API) and assert ring_id range, so a future per-ring SM layout cannot silently disagree with the addresses the host bakes into the prebuilt image. - DeviceRunner::acquire_pooled_runtime_arena (onboard + sim): return nullptr when runtime_arena_region_off_ == SIZE_MAX so a stray hbg-path call cannot resolve to base + SIZE_MAX. Failure is now loud and contained at the acquire boundary. - DeviceArena::attach(): rewrite doc to match real behavior (region table is not repopulated after attach, reserve() asserts !committed_ so cannot replay, region_size() returns 0); promote the pre-alignment / non-null / power-of-two checks from plain assert() to an unconditional abort() so release builds still trap on contract violations. - PTO2TensorMap: drop the dead `orch` back-pointer field (a2a3 never dereferences it), strip parent_orch parameter from wire_arena_pointers, and remove the now-unused PTO2OrchestratorState forward declaration. - PTO2RingFlowControl::init(): add a coupling comment so future fc-initial- value or boot-order changes flag PTO2TaskAllocator::init's initial_local_task_id default in the same edit. - PTO2SchedulerState::init_data_from_layout / RingSchedState:: init_data_from_layout: drop the task_window_size / dep_pool_capacity parameters that were never consumed (scheduler only needs SM base + ring index, both window-size-independent; orchestrator counterpart still takes task_window_size for ring_task_descriptors arithmetic). Updated all callsites (pto_runtime2_init.cpp + 4 cpput suites). - PTO2Runtime::prebuilt_arena_base: removed the dead mirror field. The host Runtime's prebuilt_arena_base_ is the real source of truth (AICPU reads it to locate the pooled buffer *before* dereferencing the image); the PTO2Runtime image still carries prebuilt_layout, which the AICPU does consume. cpput: 25/25 pass. a2a3sim trb: dummy_task / dynamic_register / L2 trb suite pass with --build.

Sync of PR hw-native-sys#846 commit 2/3 to a5 — commit 1 (slot_state.bind split) was already mirrored. Brings the a5 trb runtime up to the same host-build arena fast path as a2a3. - 4-phase API (reserve_layout / init_data_from_layout / wire_arena_pointers / finalize_after_wire) replaces runtime_create_from_sm. - New runtime/shared/pto_runtime2_init.cpp (~355 lines) and shared/pto_tensormap.cpp (the old runtime/pto_tensormap.cpp moved + split) hold the host-pluggable cold-path lifted from pto_runtime2.cpp / pto_orchestrator.cpp / scheduler/pto_scheduler.cpp. - AICPU boot becomes attach + wire + sm_handle->init + finalize. - runtime_maker.cpp pre-builds the arena image on host and rtMemcpys it into a pooled runtime-arena region; onboard + sim DeviceRunner setup_static_arena grow a third runtime_arena_size argument with matching acquire_pooled_runtime_arena (hbg path passes 0). a5-specific divergences kept: enable_l2_swimlane (bool) instead of L2PerfLevel, no dep_gen subsystem, wait_init_complete naming, alignas(64) PTO2SpscQueue queue, cache_invalidate_range + cond.retire in async_wait, RUNTIME_MAX_WORKER 108. Tests - cpput: 25/25 pass. - a5sim: trb 21/21 + host_build_graph 6/6 pass. - a2a3sim regression: trb 29/29 + host_build_graph 9/9 pass.

…tions DeviceRunner's GM heap / PTO2 SM / trb prebuilt runtime arena used to live in a single backing device buffer (one rtMalloc per worker, three regions sub-divided via DeviceArena::reserve). The combined size can exceed the device allocator's largest contiguous block on real hardware, so split into three independent DeviceArena instances — each commits exactly one region (one device_malloc), and acquire_pooled_* returns its base(). Touches all four DeviceRunner implementations (a2a3/a5 × onboard/sim). The setup_static_arena and acquire_pooled_* signatures are unchanged; the host_api / runtime_maker callers are unaffected. hbg keeps passing runtime_arena_size = 0, which leaves runtime_arena_pool_ uncommitted and acquire_pooled_runtime_arena returning nullptr. Tests - cpput: 25/25 pass. - a5sim: L2 trb + host_build_graph full suite pass. - a2a3sim: L2 trb + host_build_graph full suite pass.

poursoul added 2 commits May 22, 2026 12:22

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp

Comment thread src/a2a3/platform/sim/host/device_runner.cpp

poursoul added 2 commits May 25, 2026 11:27

poursoul changed the title ~~Refactor: host-build trb runtime arena (a2a3 only)~~ Refactor: host-build trb runtime arena May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: host-build trb runtime arena#846

Refactor: host-build trb runtime arena#846
poursoul wants to merge 5 commits into
hw-native-sys:mainfrom
poursoul:refactor-defer-slot-state-bind-to-prepare-task

poursoul commented May 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poursoul commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Mechanism

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poursoul commented May 22, 2026 •

edited

Loading