Skip to content

[Bug] A2A3 triangular inverse AIC kernel times out in tensormap_and_ringbuffer runtime #831

@MirkoDeVita98

Description

@MirkoDeVita98

Platform

a2a3 (Ascend 910B2)

Runtime Variant

tensormap_and_ringbuffer

Description

The triangular inverse example in PR #830 compiles for a2a3 hardware after separating simulator-only L0C-to-L1 store handling from hardware direct accumulator stores, but the AIC task times out at runtime.

The timeout happens both with the recursive/unrolled triangular inverse kernel and with a reduced single-core tri_inv_trick kernel. Probing the single-core version shows that initial GM-to-L1 loads, L1-to-L0 moves, the first TMATMUL, and accumulator-to-L1 moves can complete. The timeout starts when the kernel enters the refinement loop and reuses L0A/L0B buffers after prior cube/FIX work.

This suggests a missing or incorrect pipe/event synchronization sequence around reusing L0 buffers after TMATMUL/TMOV accumulator paths, or a runtime/device scheduling issue triggered by that pattern.

Steps to Reproduce

From the PR #830

python examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py \
  -p a2a3 \
  -d 1 \
  --case TestTriangularInverse::Case_upper_tri_matrix_size_32 \
  --skip-golden \
  --log-level debug

Expected Behavior

The AIC task should complete on hardware, copy output tensors back to host, and the test should pass or at least proceed to numerical validation.

Actual Behavior

The AIC task launches but the AICPU stream synchronization times out. Typical failure:

aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
FAILED: run_prepared failed with code 507018

A debug run shows the runtime reaches AICore launch:

=== launch_aicpu_kernel DynTileFwkKernelServerInit ===
=== launch_aicpu_kernel DynTileFwkKernelServer ===
=== launch_aicore_kernel ===
=== aclrtSynchronizeStreamWithTimeout stream_aicpu_ ===

Then it times out before successful completion.

Git Commit ID

d423e87

CANN Version

9.0.0

Driver Version

25.5.1

Host Platform

Linux (aarch64)

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions