Refactor: persist AICPU/AICore streams + eager bootstrap at simpler_init#864
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
69497cc to
5efbda2
Compare
5efbda2 to
5b1a058
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR refactors stream resource management across a2a3 and a5 device runners. AICPU and AICore streams transition from per-run teardown ( ChangesStream Lifecycle Refactor: a2a3 and a5 Device Runners
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
5b1a058 to
15c23f2
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a2a3/platform/onboard/host/device_runner.cpp`:
- Around line 1168-1176: DeviceRunner::finalize() currently calls
rtStreamDestroy on stream_aicpu_ and stream_aicore_ without first
draining/synchronizing outstanding work and without checking the rtStreamDestroy
return values; update finalize() to, for each non-null stream (stream_aicpu_ and
stream_aicore_), call the appropriate synchronization API (e.g.,
aclrtSynchronizeStreamWithTimeout or rtStreamSynchronize) to wait for queued
work to finish, handle and log any sync errors, then call rtStreamDestroy and
check its return code, logging/handling failures and only setting the stream
pointer to nullptr on successful destroy to avoid silent teardown of in-flight
work.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6c8eb67d-a675-43d3-924d-da312fda7815
📒 Files selected for processing (6)
src/a2a3/platform/onboard/host/device_runner.cppsrc/a2a3/platform/onboard/host/device_runner.hsrc/a2a3/platform/onboard/host/pto_runtime_c_api.cppsrc/a5/platform/onboard/host/device_runner.cppsrc/a5/platform/onboard/host/device_runner.hsrc/a5/platform/onboard/host/pto_runtime_c_api.cpp
3f0c10f to
562081d
Compare
Two related lifecycle changes on onboard DeviceRunner (a2a3 + a5) that
both move per-run work to one-shot init/finalize:
1. Streams now live for the DeviceRunner's lifetime.
- rtStreamCreate / rtStreamDestroy were happening on every prepare_callable
and run_prepared call (4 rtStream* per launch, ~ms each). The stream
check inside prepare_run_context already short-circuits on existing
streams, so the per-run create/destroy was strictly redundant once
streams persist.
- release_run_context becomes a no-op; finalize gains the matching
rtStreamDestroy pair. simpler_init triggers stream creation eagerly
via ensure_device_initialized.
2. Bootstrap (BootstrapDispatcher + LoadAicpuOp::Init + AicpuSoInfo H2D +
init_device_args) moves from "first run() call" to simpler_init.
- The previous laziness was a stream-lifecycle side effect: bootstrap
needs a stream, streams were per-run, so bootstrap had to wait for
the first run. With persistent streams, that constraint is gone.
- ensure_device_initialized is moved to the public section on
DeviceRunner so simpler_init can call it directly after the executor
bytes are cached.
ABI / Python surface unchanged. Sim platforms untouched (no streams or
bootstrap there).
Hardware validation (Ascend910, device 3):
- aicore_op_timeout (a2a3, a5): PASS
- paged_attention_unroll (a2a3 — HANDOFF's canary): PASS
- vector_add, hello_worker, paged_attention_manual_scope: PASS
- a2a3sim hello_worker (sanity for unchanged sim path): PASS
Benchmark (tensormap_and_ringbuffer, device 3, 100 rounds):
- Total / Sched / Orch: ±1% (device-side wall is untouched, as expected)
- Round 0 (cold start) Host: 7 of 9 examples improved 5%–50%, max -424 ms
on paged_attention_unroll C1 (consistent with BootstrapDispatcher being
the ~200-500 ms first-run cost). The remaining two examples were within
per-example noise (±100 ms band on a single round).
- Steady-state Host (round 50+ mean, summed across 9 examples): -12%
aggregate, but per-example deltas are noise-dominated (single example
variance ±200 ms); the small per-run rtStream*/no-op-release saving
(~0.5-1 ms per run) sits well below host noise floor and only the
aggregate is interpretable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
562081d to
ca4a884
Compare
Summary
Two related lifecycle changes on onboard
DeviceRunner(a2a3 + a5) that both move per-run work to one-shot init/finalize.1. Streams now live for the DeviceRunner's lifetime
rtStreamCreate/rtStreamDestroywere happening on everyprepare_callableandrun_preparedcall (4rtStream*per launch, ~ms each). The stream-check insideprepare_run_contextalready short-circuits on existing streams, so the per-run create/destroy was strictly redundant once streams persist.release_run_contextbecomes a no-op;finalizegains the matchingrtStreamDestroypair.simpler_inittriggers stream creation eagerly viaensure_device_initialized.2. Bootstrap moves from "first
run()" tosimpler_initLoadAicpuOp::Init+AicpuSoInfoH2D +init_device_argsnow run at init time.run(). With persistent streams that constraint is gone.ensure_device_initializedmoved to the public section onDeviceRunnersosimpler_initcan call it directly after executor bytes are cached.ABI and Python surface unchanged. Sim platforms untouched (no streams or bootstrap there).
Hardware validation (Ascend910, device 3)
aicore_op_timeout(a2a3, a5): PASSpaged_attention_unroll(a2a3 — HANDOFF's canary): PASSvector_add,hello_worker,paged_attention_manual_scope: PASSa2a3sim hello_worker(sanity for unchanged sim path): PASSBenchmark (tensormap_and_ringbuffer, device 3, 100 rounds)
paged_attention_unroll C1, consistent with BootstrapDispatcher being the ~200-500 ms first-run cost. The remaining two examples were within per-example noise.rtStream*saving (~0.5-1 ms per run) sits well below the host noise floor and only the aggregate is interpretable.Test plan