Platform
a2a3 (Ascend 910B2)
Runtime Variant
tensormap_and_ringbuffer
Description
The triangular inverse example in PR #830 compiles for a2a3 hardware after separating simulator-only L0C-to-L1 store handling from hardware direct accumulator stores, but the AIC task times out at runtime.
The timeout happens both with the recursive/unrolled triangular inverse kernel and with a reduced single-core tri_inv_trick kernel. Probing the single-core version shows that initial GM-to-L1 loads, L1-to-L0 moves, the first TMATMUL, and accumulator-to-L1 moves can complete. The timeout starts when the kernel enters the refinement loop and reuses L0A/L0B buffers after prior cube/FIX work.
This suggests a missing or incorrect pipe/event synchronization sequence around reusing L0 buffers after TMATMUL/TMOV accumulator paths, or a runtime/device scheduling issue triggered by that pattern.
Steps to Reproduce
From the PR #830
python examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py \
-p a2a3 \
-d 1 \
--case TestTriangularInverse::Case_upper_tri_matrix_size_32 \
--skip-golden \
--log-level debug
Expected Behavior
The AIC task should complete on hardware, copy output tensors back to host, and the test should pass or at least proceed to numerical validation.
Actual Behavior
The AIC task launches but the AICPU stream synchronization times out. Typical failure:
aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
FAILED: run_prepared failed with code 507018
A debug run shows the runtime reaches AICore launch:
=== launch_aicpu_kernel DynTileFwkKernelServerInit ===
=== launch_aicpu_kernel DynTileFwkKernelServer ===
=== launch_aicore_kernel ===
=== aclrtSynchronizeStreamWithTimeout stream_aicpu_ ===
Then it times out before successful completion.
Git Commit ID
d423e87
CANN Version
9.0.0
Driver Version
25.5.1
Host Platform
Linux (aarch64)
Additional Context
No response
Platform
a2a3 (Ascend 910B2)
Runtime Variant
tensormap_and_ringbuffer
Description
The triangular inverse example in PR #830 compiles for
a2a3hardware after separating simulator-only L0C-to-L1 store handling from hardware direct accumulator stores, but the AIC task times out at runtime.The timeout happens both with the recursive/unrolled triangular inverse kernel and with a reduced single-core
tri_inv_trickkernel. Probing the single-core version shows that initial GM-to-L1 loads, L1-to-L0 moves, the firstTMATMUL, and accumulator-to-L1 moves can complete. The timeout starts when the kernel enters the refinement loop and reuses L0A/L0B buffers after prior cube/FIX work.This suggests a missing or incorrect pipe/event synchronization sequence around reusing L0 buffers after
TMATMUL/TMOVaccumulator paths, or a runtime/device scheduling issue triggered by that pattern.Steps to Reproduce
From the PR #830
Expected Behavior
The AIC task should complete on hardware, copy output tensors back to host, and the test should pass or at least proceed to numerical validation.
Actual Behavior
The AIC task launches but the AICPU stream synchronization times out. Typical failure:
A debug run shows the runtime reaches AICore launch:
Then it times out before successful completion.
Git Commit ID
d423e87
CANN Version
9.0.0
Driver Version
25.5.1
Host Platform
Linux (aarch64)
Additional Context
No response