Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
When enable_dump_tensor=True is used on a moderately-sized kernel
(paged_attention SPMD, 64 batch × 8192 ctx), the run reliably fails
with device error codes 507017 / 507018 / 507046 after some
seconds. Diagnosis indicates the host-side dump collector PCIe drain
rate cannot keep up with the device-side AICPU dump production rate;
AICPU back-pressures on the dump arena and gets STARS-killed once
PLATFORM_OP_EXECUTE_TIMEOUT_US elapses. The same kernel runs to
completion with dump disabled.
Looking for guidance: are users expected to scale down the test case
when they need dumps at this scale, or should the dump pipeline adapt?
Steps to Reproduce
- Build a moderately-sized kernel through pypto. In our case:
paged_attention_spmd_64bat_64h_256d_64bs_8192ctx — 8 tensor
bindings (largest 2 × 1 GB), block_dim=24, 4 AICPU threads,
heap=4 GB, task_window=131072.
- Run the debug replay with
enable_dump_tensor=True (equivalent to
RunConfig(enable_dump_tensor=True) in the user-facing API).
- Observe a device error after a few seconds.
Expected Behavior
Either:
- The dump completes and the kernel finishes normally (preferred), or
- A clearer "dump capacity exceeded — drop to selective dump"
diagnostic surfaces before the device-side op-timeout kill.
Actual Behavior
Run A — default platform_config.h
Config:
PLATFORM_OP_EXECUTE_TIMEOUT_US = 1_000_000 (1 s)
PLATFORM_STREAM_SYNC_TIMEOUT_MS = 2_000 (2 s)
PLATFORM_DUMP_AVG_TENSOR_BYTES = 65_536
PLATFORM_DUMP_BUFFERS_PER_THREAD = 8
=> dump arena: 128 MB/thread × 4 threads = 512 MB
Log excerpt:
WARN process_dump_buffer: Tensor dump truncation detected.
Increase PLATFORM_DUMP_AVG_TENSOR_BYTES.
ERROR run: Stream sync timeout: stream=AICPU timeout_ms=2000
INFO on_buffer_collected: Collecting: 1280 tensors, 9.5 GB written (227 s)
RuntimeError: run_prepared failed with code 507046
Effective collector drain rate: 9.5 GB / 227 s ≈ 42 MB/s.
Run B — tuned platform_config.h (raised timeouts + arena)
Config:
PLATFORM_OP_EXECUTE_TIMEOUT_US = 60_000_000 (60 s)
PLATFORM_STREAM_SYNC_TIMEOUT_MS = 600_000 (10 min)
PLATFORM_DUMP_AVG_TENSOR_BYTES = 1_048_576 (1 MB)
PLATFORM_DUMP_BUFFERS_PER_THREAD = 16
PLATFORM_DUMP_TIMEOUT_SECONDS = 300
=> dump arena: 4096 MB/thread × 4 threads = 16 GB
Log excerpt:
INFO aclrtSetOpExecuteTimeOutV2: requested=60000000 us, actual=60129542 us
INFO Tensor dump initialized: 4 threads, arena=4096 MB/thread, 16 buffers/thread
INFO === aclrtSynchronizeStreamWithTimeout stream_aicpu_ ===
[~77 s elapsed, no truncation warning, no stream-sync-timeout branch hit]
ERROR aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507017
Notable: no truncation warning, and the host-side Stream sync timeout: … log line was not hit. The 77 s wait + return code 507017 (not
ACL_ERROR_RT_STREAM_SYNC_TIMEOUT) matches STARS killing the AICPU op
after OP_EXECUTE_TIMEOUT_US = 60 s while it was back-pressured on
dump buffers.
Run C — same case, dump disabled
Runs to completion successfully.
Git Commit ID
324df3d
Host Platform
Linux (aarch64)
Additional Context
Analysis
- Arena size only delays when the per-thread ring fills; it does not
change the steady-state imbalance between AICPU dump production and
host PCIe consumption.
- At ~42 MB/s drain, this case's ~9.5 GB payload needs >220 s just to
land on host. Any per-op timeout shorter than that is hit while
AICPU is back-pressured.
- Raising
PLATFORM_OP_EXECUTE_TIMEOUT_US globally to mask the
problem is a hammer that also hides real hangs in non-dump runs.
Question for maintainers
What is the intended workflow for tensor-dumping at this scale?
- Scale down: are users expected to shrink the test case (smaller
batch / shorter context) whenever they need dumps? Is there a
documented "max dump-friendly case size"?
- Pipeline adapts: would you be open to changes in any of these
directions?
- Per-tensor / per-task dump filter so users can opt-in for a
subset (today enable_dump_tensor is all-or-nothing)
- Async / larger PCIe writeback path to raise the ~42 MB/s
collector drain ceiling
- Dump-aware op timeout (longer when dump is enabled, default
otherwise) so the non-dump path stays strict
- Any combination of the above?
Happy to test patches or contribute the per-tensor filter route if it
matches the direction the maintainers prefer.
Environment notes
- CANN version: not captured (collect_env path warns about ownership
mismatch on this host)
- Driver version:
npu-smi info unavailable on this host
(DrvMngGetConsoleLogLevel failed, dcmi module initialize failed ret=-8005)
- Reproducible on device 4 of the test machine
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
When
enable_dump_tensor=Trueis used on a moderately-sized kernel(paged_attention SPMD, 64 batch × 8192 ctx), the run reliably fails
with device error codes
507017/507018/507046after someseconds. Diagnosis indicates the host-side dump collector PCIe drain
rate cannot keep up with the device-side AICPU dump production rate;
AICPU back-pressures on the dump arena and gets STARS-killed once
PLATFORM_OP_EXECUTE_TIMEOUT_USelapses. The same kernel runs tocompletion with dump disabled.
Looking for guidance: are users expected to scale down the test case
when they need dumps at this scale, or should the dump pipeline adapt?
Steps to Reproduce
paged_attention_spmd_64bat_64h_256d_64bs_8192ctx— 8 tensorbindings (largest 2 × 1 GB), block_dim=24, 4 AICPU threads,
heap=4 GB, task_window=131072.
enable_dump_tensor=True(equivalent toRunConfig(enable_dump_tensor=True)in the user-facing API).Expected Behavior
Either:
diagnostic surfaces before the device-side op-timeout kill.
Actual Behavior
Run A — default
platform_config.hConfig:
Log excerpt:
Effective collector drain rate: 9.5 GB / 227 s ≈ 42 MB/s.
Run B — tuned
platform_config.h(raised timeouts + arena)Config:
Log excerpt:
Notable: no truncation warning, and the host-side
Stream sync timeout: …log line was not hit. The 77 s wait + return code 507017 (notACL_ERROR_RT_STREAM_SYNC_TIMEOUT) matches STARS killing the AICPU opafter
OP_EXECUTE_TIMEOUT_US = 60 swhile it was back-pressured ondump buffers.
Run C — same case, dump disabled
Runs to completion successfully.
Git Commit ID
324df3d
Host Platform
Linux (aarch64)
Additional Context
Analysis
change the steady-state imbalance between AICPU dump production and
host PCIe consumption.
land on host. Any per-op timeout shorter than that is hit while
AICPU is back-pressured.
PLATFORM_OP_EXECUTE_TIMEOUT_USglobally to mask theproblem is a hammer that also hides real hangs in non-dump runs.
Question for maintainers
What is the intended workflow for tensor-dumping at this scale?
batch / shorter context) whenever they need dumps? Is there a
documented "max dump-friendly case size"?
directions?
subset (today
enable_dump_tensoris all-or-nothing)collector drain ceiling
otherwise) so the non-dump path stays strict
Happy to test patches or contribute the per-tensor filter route if it
matches the direction the maintainers prefer.
Environment notes
mismatch on this host)
npu-smi infounavailable on this host(
DrvMngGetConsoleLogLevel failed,dcmi module initialize failed ret=-8005)