Skip to content

Refactor: persist AICPU/AICore streams + eager bootstrap at simpler_init#864

Open
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/persistent-streams-and-eager-bootstrap
Open

Refactor: persist AICPU/AICore streams + eager bootstrap at simpler_init#864
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/persistent-streams-and-eager-bootstrap

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented May 27, 2026

Summary

Two related lifecycle changes on onboard DeviceRunner (a2a3 + a5) that both move per-run work to one-shot init/finalize.

1. Streams now live for the DeviceRunner's lifetime

  • rtStreamCreate / rtStreamDestroy were happening on every prepare_callable and run_prepared call (4 rtStream* per launch, ~ms each). The stream-check inside prepare_run_context already short-circuits on existing streams, so the per-run create/destroy was strictly redundant once streams persist.
  • release_run_context becomes a no-op; finalize gains the matching rtStreamDestroy pair. simpler_init triggers stream creation eagerly via ensure_device_initialized.

2. Bootstrap moves from "first run()" to simpler_init

  • BootstrapDispatcher + LoadAicpuOp::Init + AicpuSoInfo H2D + init_device_args now run at init time.
  • The previous laziness was a stream-lifecycle side effect: bootstrap needs a stream, streams were per-run, so bootstrap had to wait for the first run(). With persistent streams that constraint is gone.
  • ensure_device_initialized moved to the public section on DeviceRunner so simpler_init can call it directly after executor bytes are cached.

ABI and Python surface unchanged. Sim platforms untouched (no streams or bootstrap there).

Hardware validation (Ascend910, device 3)

  • aicore_op_timeout (a2a3, a5): PASS
  • paged_attention_unroll (a2a3 — HANDOFF's canary): PASS
  • vector_add, hello_worker, paged_attention_manual_scope: PASS
  • a2a3sim hello_worker (sanity for unchanged sim path): PASS

Benchmark (tensormap_and_ringbuffer, device 3, 100 rounds)

  • Total / Sched / Orch: within ±1% (device-side wall is untouched, as expected).
  • Round 0 (cold start) Host: 7 of 9 examples improved 5%–50%; max -424 ms on paged_attention_unroll C1, consistent with BootstrapDispatcher being the ~200-500 ms first-run cost. The remaining two examples were within per-example noise.
  • Steady-state Host (round 50+ mean, summed across 9 examples): -12% aggregate, but per-example deltas are noise-dominated; the small per-run rtStream* saving (~0.5-1 ms per run) sits well below the host noise floor and only the aggregate is interpretable.

Test plan

  • Onboard a2a3 ST passes locally
  • Onboard a5 ST passes locally
  • Sim a2a3 ST passes locally
  • Benchmark shows expected cold-start improvement on Round 0
  • CI green

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/persistent-streams-and-eager-bootstrap branch from 69497cc to 5efbda2 Compare May 27, 2026 04:58
@ChaoWao ChaoWao marked this pull request as draft May 27, 2026 06:01
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/persistent-streams-and-eager-bootstrap branch from 5efbda2 to 5b1a058 Compare May 27, 2026 07:39
@ChaoWao ChaoWao marked this pull request as ready for review May 27, 2026 07:39
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 07032907-ab56-49cc-a8e7-1c97089edfcb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR refactors stream resource management across a2a3 and a5 device runners. AICPU and AICore streams transition from per-run teardown (release_run_context()) to runner-lifetime persistence, with destruction moved to finalize(). The public ensure_device_initialized() method is now called eagerly during simpler_init() to create streams once rather than on first run.

Changes

Stream Lifecycle Refactor: a2a3 and a5 Device Runners

Layer / File(s) Summary
a2a3 DeviceRunner stream lifecycle
src/a2a3/platform/onboard/host/device_runner.h, src/a2a3/platform/onboard/host/device_runner.cpp
ensure_device_initialized() is moved to public; release_run_context() becomes a no-op with updated documentation; finalize() explicitly destroys persistent AICPU and AICore streams before other cleanup.
a2a3 simpler_init eager stream initialization
src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp
simpler_init() now calls ensure_device_initialized() after transferring executor and dispatcher binaries, triggering stream creation and dispatcher bootstrap during initialization.
a5 DeviceRunner stream lifecycle
src/a5/platform/onboard/host/device_runner.h, src/a5/platform/onboard/host/device_runner.cpp
Mirrors a2a3 changes: ensure_device_initialized() becomes public; release_run_context() is documented as a no-op; finalize() destroys persistent streams.
a5 simpler_init eager stream initialization
src/a5/platform/onboard/host/pto_runtime_c_api.cpp
Mirrors a2a3 pattern: simpler_init() eagerly calls ensure_device_initialized() after binary setup to initialize streams during runtime startup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Streams now live long, not die each run,
From birth to death with the runner—one!
Called early, eager, in simpler_init's care,
Then peacefully released when finalize's there. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: persisting AICPU/AICore streams and moving bootstrap to simpler_init, matching the primary refactoring objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is directly related to the changeset, explaining the motivation for stream lifecycle refactoring and bootstrap timing changes across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/persistent-streams-and-eager-bootstrap branch from 5b1a058 to 15c23f2 Compare May 27, 2026 07:45
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/platform/onboard/host/device_runner.cpp`:
- Around line 1168-1176: DeviceRunner::finalize() currently calls
rtStreamDestroy on stream_aicpu_ and stream_aicore_ without first
draining/synchronizing outstanding work and without checking the rtStreamDestroy
return values; update finalize() to, for each non-null stream (stream_aicpu_ and
stream_aicore_), call the appropriate synchronization API (e.g.,
aclrtSynchronizeStreamWithTimeout or rtStreamSynchronize) to wait for queued
work to finish, handle and log any sync errors, then call rtStreamDestroy and
check its return code, logging/handling failures and only setting the stream
pointer to nullptr on successful destroy to avoid silent teardown of in-flight
work.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6c8eb67d-a675-43d3-924d-da312fda7815

📥 Commits

Reviewing files that changed from the base of the PR and between f8b7285 and 5b1a058.

📒 Files selected for processing (6)
  • src/a2a3/platform/onboard/host/device_runner.cpp
  • src/a2a3/platform/onboard/host/device_runner.h
  • src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp
  • src/a5/platform/onboard/host/device_runner.cpp
  • src/a5/platform/onboard/host/device_runner.h
  • src/a5/platform/onboard/host/pto_runtime_c_api.cpp

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/persistent-streams-and-eager-bootstrap branch 2 times, most recently from 3f0c10f to 562081d Compare May 27, 2026 08:17
Two related lifecycle changes on onboard DeviceRunner (a2a3 + a5) that
both move per-run work to one-shot init/finalize:

1. Streams now live for the DeviceRunner's lifetime.
   - rtStreamCreate / rtStreamDestroy were happening on every prepare_callable
     and run_prepared call (4 rtStream* per launch, ~ms each). The stream
     check inside prepare_run_context already short-circuits on existing
     streams, so the per-run create/destroy was strictly redundant once
     streams persist.
   - release_run_context becomes a no-op; finalize gains the matching
     rtStreamDestroy pair. simpler_init triggers stream creation eagerly
     via ensure_device_initialized.

2. Bootstrap (BootstrapDispatcher + LoadAicpuOp::Init + AicpuSoInfo H2D +
   init_device_args) moves from "first run() call" to simpler_init.
   - The previous laziness was a stream-lifecycle side effect: bootstrap
     needs a stream, streams were per-run, so bootstrap had to wait for
     the first run. With persistent streams, that constraint is gone.
   - ensure_device_initialized is moved to the public section on
     DeviceRunner so simpler_init can call it directly after the executor
     bytes are cached.

ABI / Python surface unchanged. Sim platforms untouched (no streams or
bootstrap there).

Hardware validation (Ascend910, device 3):
- aicore_op_timeout (a2a3, a5): PASS
- paged_attention_unroll (a2a3 — HANDOFF's canary): PASS
- vector_add, hello_worker, paged_attention_manual_scope: PASS
- a2a3sim hello_worker (sanity for unchanged sim path): PASS

Benchmark (tensormap_and_ringbuffer, device 3, 100 rounds):
- Total / Sched / Orch: ±1% (device-side wall is untouched, as expected)
- Round 0 (cold start) Host: 7 of 9 examples improved 5%–50%, max -424 ms
  on paged_attention_unroll C1 (consistent with BootstrapDispatcher being
  the ~200-500 ms first-run cost). The remaining two examples were within
  per-example noise (±100 ms band on a single round).
- Steady-state Host (round 50+ mean, summed across 9 examples): -12%
  aggregate, but per-example deltas are noise-dominated (single example
  variance ±200 ms); the small per-run rtStream*/no-op-release saving
  (~0.5-1 ms per run) sits well below host noise floor and only the
  aggregate is interpretable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/persistent-streams-and-eager-bootstrap branch from 562081d to ca4a884 Compare May 27, 2026 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants