[Question] Tensor dump cannot keep up with moderately-sized kernel workloads (paged_attention 64bat/8192ctx) — host collector drain becomes a kernel-hang root cause

### Platform

a2a3 (Ascend 910B/C hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Description

When `enable_dump_tensor=True` is used on a moderately-sized kernel
(paged_attention SPMD, 64 batch × 8192 ctx), the run reliably fails
with device error codes `507017` / `507018` / `507046` after some
seconds. Diagnosis indicates the host-side dump collector PCIe drain
rate cannot keep up with the device-side AICPU dump production rate;
AICPU back-pressures on the dump arena and gets STARS-killed once
`PLATFORM_OP_EXECUTE_TIMEOUT_US` elapses. The same kernel runs to
completion with dump disabled.

Looking for guidance: are users expected to scale down the test case
when they need dumps at this scale, or should the dump pipeline adapt?

### Steps to Reproduce

1. Build a moderately-sized kernel through pypto. In our case:
   `paged_attention_spmd_64bat_64h_256d_64bs_8192ctx` — 8 tensor
   bindings (largest 2 × 1 GB), block_dim=24, 4 AICPU threads,
   heap=4 GB, task_window=131072.
2. Run the debug replay with `enable_dump_tensor=True` (equivalent to
   `RunConfig(enable_dump_tensor=True)` in the user-facing API).
3. Observe a device error after a few seconds.

### Expected Behavior

Either:
- The dump completes and the kernel finishes normally (preferred), or
- A clearer "dump capacity exceeded — drop to selective dump"
  diagnostic surfaces *before* the device-side op-timeout kill.

### Actual Behavior

#### Run A — default `platform_config.h`

Config:
```
PLATFORM_OP_EXECUTE_TIMEOUT_US     = 1_000_000   (1 s)
PLATFORM_STREAM_SYNC_TIMEOUT_MS    = 2_000       (2 s)
PLATFORM_DUMP_AVG_TENSOR_BYTES     = 65_536
PLATFORM_DUMP_BUFFERS_PER_THREAD   = 8
=> dump arena: 128 MB/thread × 4 threads = 512 MB
```

Log excerpt:
```
WARN process_dump_buffer: Tensor dump truncation detected.
     Increase PLATFORM_DUMP_AVG_TENSOR_BYTES.
ERROR run: Stream sync timeout: stream=AICPU timeout_ms=2000
INFO on_buffer_collected: Collecting: 1280 tensors, 9.5 GB written (227 s)
RuntimeError: run_prepared failed with code 507046
```

Effective collector drain rate: 9.5 GB / 227 s ≈ **42 MB/s**.

#### Run B — tuned `platform_config.h` (raised timeouts + arena)

Config:
```
PLATFORM_OP_EXECUTE_TIMEOUT_US     = 60_000_000  (60 s)
PLATFORM_STREAM_SYNC_TIMEOUT_MS    = 600_000     (10 min)
PLATFORM_DUMP_AVG_TENSOR_BYTES     = 1_048_576   (1 MB)
PLATFORM_DUMP_BUFFERS_PER_THREAD   = 16
PLATFORM_DUMP_TIMEOUT_SECONDS      = 300
=> dump arena: 4096 MB/thread × 4 threads = 16 GB
```

Log excerpt:
```
INFO aclrtSetOpExecuteTimeOutV2: requested=60000000 us, actual=60129542 us
INFO Tensor dump initialized: 4 threads, arena=4096 MB/thread, 16 buffers/thread
INFO === aclrtSynchronizeStreamWithTimeout stream_aicpu_ ===
[~77 s elapsed, no truncation warning, no stream-sync-timeout branch hit]
ERROR aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507017
```

Notable: no truncation warning, and the host-side `Stream sync timeout:
…` log line was *not* hit. The 77 s wait + return code 507017 (not
`ACL_ERROR_RT_STREAM_SYNC_TIMEOUT`) matches STARS killing the AICPU op
after `OP_EXECUTE_TIMEOUT_US = 60 s` while it was back-pressured on
dump buffers.

#### Run C — same case, dump disabled

Runs to completion successfully.

### Git Commit ID

324df3d6557b0c6571a3ff3d170675324df0fa1c

### Host Platform

Linux (aarch64)

### Additional Context

#### Analysis

- Arena size only delays *when* the per-thread ring fills; it does not
  change the steady-state imbalance between AICPU dump production and
  host PCIe consumption.
- At ~42 MB/s drain, this case's ~9.5 GB payload needs >220 s just to
  land on host. Any per-op timeout shorter than that is hit while
  AICPU is back-pressured.
- Raising `PLATFORM_OP_EXECUTE_TIMEOUT_US` globally to mask the
  problem is a hammer that also hides real hangs in non-dump runs.

#### Question for maintainers

What is the intended workflow for tensor-dumping at this scale?

1. **Scale down**: are users expected to shrink the test case (smaller
   batch / shorter context) whenever they need dumps? Is there a
   documented "max dump-friendly case size"?
2. **Pipeline adapts**: would you be open to changes in any of these
   directions?
   - **Per-tensor / per-task dump filter** so users can opt-in for a
     subset (today `enable_dump_tensor` is all-or-nothing)
   - **Async / larger PCIe writeback path** to raise the ~42 MB/s
     collector drain ceiling
   - **Dump-aware op timeout** (longer when dump is enabled, default
     otherwise) so the non-dump path stays strict
3. Any combination of the above?

Happy to test patches or contribute the per-tensor filter route if it
matches the direction the maintainers prefer.

#### Environment notes

- CANN version: not captured (collect_env path warns about ownership
  mismatch on this host)
- Driver version: `npu-smi info` unavailable on this host
  (`DrvMngGetConsoleLogLevel failed`, `dcmi module initialize failed
  ret=-8005`)
- Reproducible on device 4 of the test machine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Tensor dump cannot keep up with moderately-sized kernel workloads (paged_attention 64bat/8192ctx) — host collector drain becomes a kernel-hang root cause #860

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Run A — default `platform_config.h`

Run B — tuned `platform_config.h` (raised timeouts + arena)

Run C — same case, dump disabled

Git Commit ID

Host Platform

Additional Context

Analysis

Question for maintainers

Environment notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Question] Tensor dump cannot keep up with moderately-sized kernel workloads (paged_attention 64bat/8192ctx) — host collector drain becomes a kernel-hang root cause #860

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Run A — default platform_config.h

Run B — tuned platform_config.h (raised timeouts + arena)

Run C — same case, dump disabled

Git Commit ID

Host Platform

Additional Context

Analysis

Question for maintainers

Environment notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Run A — default `platform_config.h`

Run B — tuned `platform_config.h` (raised timeouts + arena)