Refactor: host-build trb runtime arena#846
Conversation
Move the per-slot payload/task pointer assignments out of the RingSchedState::init() O(task_window_size) loop and into orch::prepare_task. Their value is per-slot constant (&task_payloads[slot] / &task_descriptors[slot]) but writing them at submit time, on the same 64B slot_state cache line prepare_task is already dirtying, is essentially free — while removing the only "scale-dependent" pointer assignments from the init path. ring_id stays in init (its value is per-ring constant, so rewriting it each submit would only add noise without removing a loop). Split PTO2TaskSlotState::bind() into bind_ring() (init-time) and bind_buffers() (per-submit) to make the two call-site shapes explicit. Mirrored across both a2a3 and a5 trb runtimes.
Previously the AICPU rebuilt the entire trb runtime arena (PTO2Runtime, orchestrator/scheduler/tensor_map sub-regions, sm_handle wrapper, mailbox) on every device boot via runtime_create_from_sm. This commit moves layout + data init onto the host so the AICPU only does a cheap arena-internal pointer wire pass plus the SM reset that can't run off-device. Multi-run boots reuse the pooled prebuilt image with a single rtMemcpy. Mechanism - DeviceArena::attach() wraps an externally-owned buffer; re-attach is permitted so each AICPU boot can reuse the pooled image. - runtime_create_from_sm split into reserve_layout / init_data_from_layout / wire_arena_pointers / finalize_after_wire. orchestrator / scheduler / tensor_map / ready_queue / spsc gain matching data+wire pairs; finalize_after_wire stays AICPU-only since it binds s_runtime_ops. - pto2_sm_layout helper computes SM field device addresses by pure offset arithmetic so host init never dereferences SM. - Per-slot SM-side reset (bind_ring + reset_for_reuse + active_mask) moved from RingSchedState::init into PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns it after the split. - runtime/shared/pto_runtime2_init.cpp — new file holding the host-able pieces lifted out of pto_runtime2.cpp / pto_orchestrator.cpp / pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch stay in place. Host wiring (runtime_maker.cpp) - DeviceRunner::setup_static_arena gains a third runtime_arena_size region (hbg passes 0). The prebuilt image lives in the same pooled backing allocation as gm_heap and SM, keeping worker lifetime to one rtMalloc. - bind_prepared_to_runtime_impl reserves layout on a host arena, sizes the pooled regions, runs init_data + wire, stashes prebuilt metadata into the rt image, rtMemcpys to device, and records base/offset on Runtime so the AICPU boot can find it. AICPU boot (aicpu_executor.cpp) - attach the runtime arena to the pooled buffer, take rt from base+off_runtime, wire arena-internal pointers, sm_handle->init (SM reset including the per-slot fields above), mailbox reset, finalize_after_wire (ops table + cluster/aiv counts). Tests - cpput: 25/25 pass. ready_queue / spsc_queue / scheduler_state / task_state / wiring / tensormap UTs migrated to the data+wire API. task_allocator.init grew an optional initial_local_task_id (default 0) so UTs can still exercise task_id near INT32_MAX without reading the SM. - a2a3sim trb: standalone (dynamic_register variants, L3 group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass. - a2a3sim host_build_graph: 9/9 pass (verifies the shared HostApi changes don't break hbg). - a2a3 hardware: tests/st/.../paged_attention_unroll PASS on device 9 (--build with pto-isa commit pinned to CI).
There was a problem hiding this comment.
Code Review
This pull request implements a prebuilt-arena fast path for the PTO2 runtime, allowing the host to pre-compute the runtime arena image and upload it to the device. This optimization reduces AICPU boot time by replacing full initialization with a simple attachment and pointer "wiring" phase. Key changes include refactoring the initialization logic for the runtime, orchestrator, and scheduler into separate data-population and pointer-wiring stages, extending the DeviceRunner to manage a pooled runtime arena, and adding an attach method to DeviceArena for externally-owned buffers. Review feedback correctly identified potential undefined behavior in the new acquire_pooled_runtime_arena methods when the arena is not provisioned, suggesting defensive checks against SIZE_MAX offsets.
Address review feedback from PR hw-native-sys#846: - pto2_sm_layout::ring_task_descriptors_addr: take per-ring task_window_sizes[] array (mirroring PTO2SharedMemoryHandle's SM API) and assert ring_id range, so a future per-ring SM layout cannot silently disagree with the addresses the host bakes into the prebuilt image. - DeviceRunner::acquire_pooled_runtime_arena (onboard + sim): return nullptr when runtime_arena_region_off_ == SIZE_MAX so a stray hbg-path call cannot resolve to base + SIZE_MAX. Failure is now loud and contained at the acquire boundary. - DeviceArena::attach(): rewrite doc to match real behavior (region table is not repopulated after attach, reserve() asserts !committed_ so cannot replay, region_size() returns 0); promote the pre-alignment / non-null / power-of-two checks from plain assert() to an unconditional abort() so release builds still trap on contract violations. - PTO2TensorMap: drop the dead `orch` back-pointer field (a2a3 never dereferences it), strip parent_orch parameter from wire_arena_pointers, and remove the now-unused PTO2OrchestratorState forward declaration. - PTO2RingFlowControl::init(): add a coupling comment so future fc-initial- value or boot-order changes flag PTO2TaskAllocator::init's initial_local_task_id default in the same edit. - PTO2SchedulerState::init_data_from_layout / RingSchedState:: init_data_from_layout: drop the task_window_size / dep_pool_capacity parameters that were never consumed (scheduler only needs SM base + ring index, both window-size-independent; orchestrator counterpart still takes task_window_size for ring_task_descriptors arithmetic). Updated all callsites (pto_runtime2_init.cpp + 4 cpput suites). - PTO2Runtime::prebuilt_arena_base: removed the dead mirror field. The host Runtime's prebuilt_arena_base_ is the real source of truth (AICPU reads it to locate the pooled buffer *before* dereferencing the image); the PTO2Runtime image still carries prebuilt_layout, which the AICPU does consume. cpput: 25/25 pass. a2a3sim trb: dummy_task / dynamic_register / L2 trb suite pass with --build.
Sync of PR hw-native-sys#846 commit 2/3 to a5 — commit 1 (slot_state.bind split) was already mirrored. Brings the a5 trb runtime up to the same host-build arena fast path as a2a3. - 4-phase API (reserve_layout / init_data_from_layout / wire_arena_pointers / finalize_after_wire) replaces runtime_create_from_sm. - New runtime/shared/pto_runtime2_init.cpp (~355 lines) and shared/pto_tensormap.cpp (the old runtime/pto_tensormap.cpp moved + split) hold the host-pluggable cold-path lifted from pto_runtime2.cpp / pto_orchestrator.cpp / scheduler/pto_scheduler.cpp. - AICPU boot becomes attach + wire + sm_handle->init + finalize. - runtime_maker.cpp pre-builds the arena image on host and rtMemcpys it into a pooled runtime-arena region; onboard + sim DeviceRunner setup_static_arena grow a third runtime_arena_size argument with matching acquire_pooled_runtime_arena (hbg path passes 0). a5-specific divergences kept: enable_l2_swimlane (bool) instead of L2PerfLevel, no dep_gen subsystem, wait_init_complete naming, alignas(64) PTO2SpscQueue queue, cache_invalidate_range + cond.retire in async_wait, RUNTIME_MAX_WORKER 108. Tests - cpput: 25/25 pass. - a5sim: trb 21/21 + host_build_graph 6/6 pass. - a2a3sim regression: trb 29/29 + host_build_graph 9/9 pass.
…tions DeviceRunner's GM heap / PTO2 SM / trb prebuilt runtime arena used to live in a single backing device buffer (one rtMalloc per worker, three regions sub-divided via DeviceArena::reserve). The combined size can exceed the device allocator's largest contiguous block on real hardware, so split into three independent DeviceArena instances — each commits exactly one region (one device_malloc), and acquire_pooled_* returns its base(). Touches all four DeviceRunner implementations (a2a3/a5 × onboard/sim). The setup_static_arena and acquire_pooled_* signatures are unchanged; the host_api / runtime_maker callers are unaffected. hbg keeps passing runtime_arena_size = 0, which leaves runtime_arena_pool_ uncommitted and acquire_pooled_runtime_arena returning nullptr. Tests - cpput: 25/25 pass. - a5sim: L2 trb + host_build_graph full suite pass. - a2a3sim: L2 trb + host_build_graph full suite pass.
Summary
Move the trb runtime arena's layout + data initialization from AICPU boot
onto host, so each AICPU launch reduces to a cheap pointer-fixup pass plus
the SM reset that can't run off-device. The pooled prebuilt image lives in
a per-Worker DeviceRunner pool and is reused across runs via a single
rtMemcpy — multi-launch boot cost drops from O(task_window_size) per
worker to a constant.
Two related cleanups ride along:
RingSchedState::init's O(task_window_size) slot-bind loop is liftedinto per-submit
orch::prepare_task, making startup independent ofwindow size. The two extra stores hit the same 64B cache line that
prepare_taskalready dirties, so the per-submit cost is essentiallyfree.
bind_ring+reset_for_reuse+fanin_count/active_maskzero) consolidated intoPTO2SharedMemoryHandle::init_header_per_ringso the host-build pathdoesn't dereference SM.
Covers both a2a3 and a5 trb runtimes (platform layer onboard + sim).
hbg is unaffected by the runtime-arena split — its
setup_static_arena(...,0)keeps the third region unreserved.Mechanism
runtime_create_from_smsplit into four phases that run on either side:runtime_reserve_layout— pure arithmetic; host computes sub-regionoffsets on a libc-backed
DeviceArena.runtime_init_data_from_layout— writes standalone fields, memset'sarena regions, and stores SM device pointers (only stores, no
dereferences).
runtime_wire_arena_pointers— walks every arena-internal pointerfield and binds it to
arena.base() + offset. Idempotent: host runsonce with the host mirror, AICPU runs once after attach with device
addresses.
runtime_finalize_after_wire— AICPU-only fixup fors_runtime_ops(device-side file-local global) and the SPMD core counts from the
SchedulerContext.DeviceArena::attach()wraps an externally-owned buffer with noper-attach allocation; re-attach is permitted so each AICPU boot can
reuse the same pooled image. Pre-alignment / non-null / power-of-two
checks
std::abort()instead ofassert()so release builds stilltrap on contract violations.
pto2_sm_layoutnamespace computes SM device-side field addresses bypure offset arithmetic so host init never reads SM. Takes a per-ring
task_window_sizes[]array (mirroring the SM API) and assertsring_idin range — structurally prevents the host-built image fromsilently disagreeing with the SM layout.
runtime/shared/pto_runtime2_init.cppholds the host-pluggablecold-path lifted from
pto_runtime2.cpp/pto_orchestrator.cpp/scheduler/pto_scheduler.cpp. AICPU-only ops table / submit_task /dispatch / business logic stay in their original files.
DeviceRunnernow owns three independent pooled arenas —gm_heap_arena_,gm_sm_arena_,runtime_arena_pool_— onedevice_malloceach. Split out from a single backing allocationbecause the combined size can exceed the device allocator's largest
contiguous block.
setup_static_arena(gm_heap_size, gm_sm_size, runtime_arena_size)commits each region independently;acquire_pooled_runtime_arena()returnsnullptrwhen the region isunreserved (hbg's
setup_static_arena(...,0)path) so misuse is loud,not undefined.
bind_prepared_to_runtime_impl(host runtime_maker) does the fullreserve_layout → init_data → wire on a host arena, stashes the layout
inside the
PTO2Runtimeimage atprebuilt_layout, then rtMemcpysthe whole arena into the pooled device region.
PTO2TensorMap::orchback-pointer(never dereferenced),
PTO2Runtime::prebuilt_arena_basemirror (hostRuntime::prebuilt_arena_base_is the real source of truth), unusedtask_window_size/dep_pool_capacityfromPTO2SchedulerState::init_data_from_layoutandRingSchedState::init_data_from_layout(scheduler only needs SM baseTest plan
task_state / wiring / tensormap UTs migrated to the data+wire API.
task_allocator.initgrew an optionalinitial_local_task_id(default 0) so the near-INT32_MAX corner case is still exercised
without reading the SM.
tests/st/.../paged_attention_unrollpasses ondevice 9 (
--build, pto-isa commit pinned to CI).