Refactor: persist AICPU/AICore streams + eager bootstrap at simpler_init by hw-native-sys-bot · Pull Request #864 · hw-native-sys/simpler

hw-native-sys-bot · 2026-05-27T02:46:56Z

Summary

Two related lifecycle changes on onboard DeviceRunner (a2a3 + a5) that both move per-run work to one-shot init/finalize.

1. Streams now live for the DeviceRunner's lifetime

rtStreamCreate / rtStreamDestroy were happening on every prepare_callable and run_prepared call (4 rtStream* per launch, ~ms each). The stream-check inside prepare_run_context already short-circuits on existing streams, so the per-run create/destroy was strictly redundant once streams persist.
release_run_context becomes a no-op; finalize gains the matching rtStreamDestroy pair. simpler_init triggers stream creation eagerly via ensure_device_initialized.

2. Bootstrap moves from "first `run()`" to `simpler_init`

BootstrapDispatcher + LoadAicpuOp::Init + AicpuSoInfo H2D + init_device_args now run at init time.
The previous laziness was a stream-lifecycle side effect: bootstrap needs a stream, streams were per-run, so bootstrap had to wait for the first run(). With persistent streams that constraint is gone.
ensure_device_initialized moved to the public section on DeviceRunner so simpler_init can call it directly after executor bytes are cached.

ABI and Python surface unchanged. Sim platforms untouched (no streams or bootstrap there).

Hardware validation (Ascend910, device 3)

aicore_op_timeout (a2a3, a5): PASS
paged_attention_unroll (a2a3 — HANDOFF's canary): PASS
vector_add, hello_worker, paged_attention_manual_scope: PASS
a2a3sim hello_worker (sanity for unchanged sim path): PASS

Benchmark (tensormap_and_ringbuffer, device 3, 100 rounds)

Total / Sched / Orch: within ±1% (device-side wall is untouched, as expected).
Round 0 (cold start) Host: 7 of 9 examples improved 5%–50%; max -424 ms on paged_attention_unroll C1, consistent with BootstrapDispatcher being the ~200-500 ms first-run cost. The remaining two examples were within per-example noise.
Steady-state Host (round 50+ mean, summed across 9 examples): -12% aggregate, but per-example deltas are noise-dominated; the small per-run rtStream* saving (~0.5-1 ms per run) sits well below the host noise floor and only the aggregate is interpretable.

Test plan

Onboard a2a3 ST passes locally
Onboard a5 ST passes locally
Sim a2a3 ST passes locally
Benchmark shows expected cold-start improvement on Round 0
CI green

gemini-code-assist · 2026-05-27T02:46:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-05-27T07:40:02Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 07032907-ab56-49cc-a8e7-1c97089edfcb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR refactors stream resource management across a2a3 and a5 device runners. AICPU and AICore streams transition from per-run teardown (release_run_context()) to runner-lifetime persistence, with destruction moved to finalize(). The public ensure_device_initialized() method is now called eagerly during simpler_init() to create streams once rather than on first run.

Changes

Stream Lifecycle Refactor: a2a3 and a5 Device Runners

Layer / File(s)	Summary
a2a3 DeviceRunner stream lifecycle `src/a2a3/platform/onboard/host/device_runner.h`, `src/a2a3/platform/onboard/host/device_runner.cpp`	`ensure_device_initialized()` is moved to public; `release_run_context()` becomes a no-op with updated documentation; `finalize()` explicitly destroys persistent AICPU and AICore streams before other cleanup.
a2a3 simpler_init eager stream initialization `src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp`	`simpler_init()` now calls `ensure_device_initialized()` after transferring executor and dispatcher binaries, triggering stream creation and dispatcher bootstrap during initialization.
a5 DeviceRunner stream lifecycle `src/a5/platform/onboard/host/device_runner.h`, `src/a5/platform/onboard/host/device_runner.cpp`	Mirrors a2a3 changes: `ensure_device_initialized()` becomes public; `release_run_context()` is documented as a no-op; `finalize()` destroys persistent streams.
a5 simpler_init eager stream initialization `src/a5/platform/onboard/host/pto_runtime_c_api.cpp`	Mirrors a2a3 pattern: `simpler_init()` eagerly calls `ensure_device_initialized()` after binary setup to initialize streams during runtime startup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Streams now live long, not die each run,
From birth to death with the runner—one!
Called early, eager, in simpler_init's care,
Then peacefully released when finalize's there. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: persisting AICPU/AICore streams and moving bootstrap to simpler_init, matching the primary refactoring objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description is directly related to the changeset, explaining the motivation for stream lifecycle refactoring and bootstrap timing changes across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/platform/onboard/host/device_runner.cpp`:
- Around line 1168-1176: DeviceRunner::finalize() currently calls
rtStreamDestroy on stream_aicpu_ and stream_aicore_ without first
draining/synchronizing outstanding work and without checking the rtStreamDestroy
return values; update finalize() to, for each non-null stream (stream_aicpu_ and
stream_aicore_), call the appropriate synchronization API (e.g.,
aclrtSynchronizeStreamWithTimeout or rtStreamSynchronize) to wait for queued
work to finish, handle and log any sync errors, then call rtStreamDestroy and
check its return code, logging/handling failures and only setting the stream
pointer to nullptr on successful destroy to avoid silent teardown of in-flight
work.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6c8eb67d-a675-43d3-924d-da312fda7815

📥 Commits

Reviewing files that changed from the base of the PR and between f8b7285 and 5b1a058.

📒 Files selected for processing (6)

src/a2a3/platform/onboard/host/device_runner.cpp
src/a2a3/platform/onboard/host/device_runner.h
src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp
src/a5/platform/onboard/host/device_runner.cpp
src/a5/platform/onboard/host/device_runner.h
src/a5/platform/onboard/host/pto_runtime_c_api.cpp

Two related lifecycle changes on onboard DeviceRunner (a2a3 + a5) that both move per-run work to one-shot init/finalize: 1. Streams now live for the DeviceRunner's lifetime. - rtStreamCreate / rtStreamDestroy were happening on every prepare_callable and run_prepared call (4 rtStream* per launch, ~ms each). The stream check inside prepare_run_context already short-circuits on existing streams, so the per-run create/destroy was strictly redundant once streams persist. - release_run_context becomes a no-op; finalize gains the matching rtStreamDestroy pair. simpler_init triggers stream creation eagerly via ensure_device_initialized. 2. Bootstrap (BootstrapDispatcher + LoadAicpuOp::Init + AicpuSoInfo H2D + init_device_args) moves from "first run() call" to simpler_init. - The previous laziness was a stream-lifecycle side effect: bootstrap needs a stream, streams were per-run, so bootstrap had to wait for the first run. With persistent streams, that constraint is gone. - ensure_device_initialized is moved to the public section on DeviceRunner so simpler_init can call it directly after the executor bytes are cached. ABI / Python surface unchanged. Sim platforms untouched (no streams or bootstrap there). Hardware validation (Ascend910, device 3): - aicore_op_timeout (a2a3, a5): PASS - paged_attention_unroll (a2a3 — HANDOFF's canary): PASS - vector_add, hello_worker, paged_attention_manual_scope: PASS - a2a3sim hello_worker (sanity for unchanged sim path): PASS Benchmark (tensormap_and_ringbuffer, device 3, 100 rounds): - Total / Sched / Orch: ±1% (device-side wall is untouched, as expected) - Round 0 (cold start) Host: 7 of 9 examples improved 5%–50%, max -424 ms on paged_attention_unroll C1 (consistent with BootstrapDispatcher being the ~200-500 ms first-run cost). The remaining two examples were within per-example noise (±100 ms band on a single round). - Steady-state Host (round 50+ mean, summed across 9 examples): -12% aggregate, but per-example deltas are noise-dominated (single example variance ±200 ms); the small per-run rtStream*/no-op-release saving (~0.5-1 ms per run) sits well below host noise floor and only the aggregate is interpretable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hw-native-sys-bot force-pushed the refactor/persistent-streams-and-eager-bootstrap branch from 69497cc to 5efbda2 Compare May 27, 2026 04:58

ChaoWao marked this pull request as draft May 27, 2026 06:01

hw-native-sys-bot force-pushed the refactor/persistent-streams-and-eager-bootstrap branch from 5efbda2 to 5b1a058 Compare May 27, 2026 07:39

ChaoWao marked this pull request as ready for review May 27, 2026 07:39

hw-native-sys-bot force-pushed the refactor/persistent-streams-and-eager-bootstrap branch from 5b1a058 to 15c23f2 Compare May 27, 2026 07:45

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp

hw-native-sys-bot force-pushed the refactor/persistent-streams-and-eager-bootstrap branch 2 times, most recently from 3f0c10f to 562081d Compare May 27, 2026 08:17

hw-native-sys-bot force-pushed the refactor/persistent-streams-and-eager-bootstrap branch from 562081d to ca4a884 Compare May 27, 2026 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: persist AICPU/AICore streams + eager bootstrap at simpler_init#864

Refactor: persist AICPU/AICore streams + eager bootstrap at simpler_init#864
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/persistent-streams-and-eager-bootstrap

hw-native-sys-bot commented May 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Streams now live for the DeviceRunner's lifetime

2. Bootstrap moves from "first run()" to simpler_init

Hardware validation (Ascend910, device 3)

Benchmark (tensormap_and_ringbuffer, device 3, 100 rounds)

Test plan

Uh oh!

gemini-code-assist Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hw-native-sys-bot commented May 27, 2026 •

edited

Loading

2. Bootstrap moves from "first `run()`" to `simpler_init`

coderabbitai Bot commented May 27, 2026 •

edited

Loading