Skip to content

[Question] Tensor dump cannot keep up with moderately-sized kernel workloads (paged_attention 64bat/8192ctx) — host collector drain becomes a kernel-hang root cause #860

@luohuan19

Description

@luohuan19

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

When enable_dump_tensor=True is used on a moderately-sized kernel
(paged_attention SPMD, 64 batch × 8192 ctx), the run reliably fails
with device error codes 507017 / 507018 / 507046 after some
seconds. Diagnosis indicates the host-side dump collector PCIe drain
rate cannot keep up with the device-side AICPU dump production rate;
AICPU back-pressures on the dump arena and gets STARS-killed once
PLATFORM_OP_EXECUTE_TIMEOUT_US elapses. The same kernel runs to
completion with dump disabled.

Looking for guidance: are users expected to scale down the test case
when they need dumps at this scale, or should the dump pipeline adapt?

Steps to Reproduce

  1. Build a moderately-sized kernel through pypto. In our case:
    paged_attention_spmd_64bat_64h_256d_64bs_8192ctx — 8 tensor
    bindings (largest 2 × 1 GB), block_dim=24, 4 AICPU threads,
    heap=4 GB, task_window=131072.
  2. Run the debug replay with enable_dump_tensor=True (equivalent to
    RunConfig(enable_dump_tensor=True) in the user-facing API).
  3. Observe a device error after a few seconds.

Expected Behavior

Either:

  • The dump completes and the kernel finishes normally (preferred), or
  • A clearer "dump capacity exceeded — drop to selective dump"
    diagnostic surfaces before the device-side op-timeout kill.

Actual Behavior

Run A — default platform_config.h

Config:

PLATFORM_OP_EXECUTE_TIMEOUT_US     = 1_000_000   (1 s)
PLATFORM_STREAM_SYNC_TIMEOUT_MS    = 2_000       (2 s)
PLATFORM_DUMP_AVG_TENSOR_BYTES     = 65_536
PLATFORM_DUMP_BUFFERS_PER_THREAD   = 8
=> dump arena: 128 MB/thread × 4 threads = 512 MB

Log excerpt:

WARN process_dump_buffer: Tensor dump truncation detected.
     Increase PLATFORM_DUMP_AVG_TENSOR_BYTES.
ERROR run: Stream sync timeout: stream=AICPU timeout_ms=2000
INFO on_buffer_collected: Collecting: 1280 tensors, 9.5 GB written (227 s)
RuntimeError: run_prepared failed with code 507046

Effective collector drain rate: 9.5 GB / 227 s ≈ 42 MB/s.

Run B — tuned platform_config.h (raised timeouts + arena)

Config:

PLATFORM_OP_EXECUTE_TIMEOUT_US     = 60_000_000  (60 s)
PLATFORM_STREAM_SYNC_TIMEOUT_MS    = 600_000     (10 min)
PLATFORM_DUMP_AVG_TENSOR_BYTES     = 1_048_576   (1 MB)
PLATFORM_DUMP_BUFFERS_PER_THREAD   = 16
PLATFORM_DUMP_TIMEOUT_SECONDS      = 300
=> dump arena: 4096 MB/thread × 4 threads = 16 GB

Log excerpt:

INFO aclrtSetOpExecuteTimeOutV2: requested=60000000 us, actual=60129542 us
INFO Tensor dump initialized: 4 threads, arena=4096 MB/thread, 16 buffers/thread
INFO === aclrtSynchronizeStreamWithTimeout stream_aicpu_ ===
[~77 s elapsed, no truncation warning, no stream-sync-timeout branch hit]
ERROR aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507017

Notable: no truncation warning, and the host-side Stream sync timeout: … log line was not hit. The 77 s wait + return code 507017 (not
ACL_ERROR_RT_STREAM_SYNC_TIMEOUT) matches STARS killing the AICPU op
after OP_EXECUTE_TIMEOUT_US = 60 s while it was back-pressured on
dump buffers.

Run C — same case, dump disabled

Runs to completion successfully.

Git Commit ID

324df3d

Host Platform

Linux (aarch64)

Additional Context

Analysis

  • Arena size only delays when the per-thread ring fills; it does not
    change the steady-state imbalance between AICPU dump production and
    host PCIe consumption.
  • At ~42 MB/s drain, this case's ~9.5 GB payload needs >220 s just to
    land on host. Any per-op timeout shorter than that is hit while
    AICPU is back-pressured.
  • Raising PLATFORM_OP_EXECUTE_TIMEOUT_US globally to mask the
    problem is a hammer that also hides real hangs in non-dump runs.

Question for maintainers

What is the intended workflow for tensor-dumping at this scale?

  1. Scale down: are users expected to shrink the test case (smaller
    batch / shorter context) whenever they need dumps? Is there a
    documented "max dump-friendly case size"?
  2. Pipeline adapts: would you be open to changes in any of these
    directions?
    • Per-tensor / per-task dump filter so users can opt-in for a
      subset (today enable_dump_tensor is all-or-nothing)
    • Async / larger PCIe writeback path to raise the ~42 MB/s
      collector drain ceiling
    • Dump-aware op timeout (longer when dump is enabled, default
      otherwise) so the non-dump path stays strict
  3. Any combination of the above?

Happy to test patches or contribute the per-tensor filter route if it
matches the direction the maintainers prefer.

Environment notes

  • CANN version: not captured (collect_env path warns about ownership
    mismatch on this host)
  • Driver version: npu-smi info unavailable on this host
    (DrvMngGetConsoleLogLevel failed, dcmi module initialize failed ret=-8005)
  • Reproducible on device 4 of the test machine

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestquestionFurther information is requested

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions