[Bug] A2A3 triangular inverse AIC kernel times out in `tensormap_and_ringbuffer` runtime

### Platform

a2a3 (Ascend 910B2)

### Runtime Variant

tensormap_and_ringbuffer

### Description

The triangular inverse example in PR https://github.com/hw-native-sys/simpler/pull/830 compiles for `a2a3` hardware after separating simulator-only L0C-to-L1 store handling from hardware direct accumulator stores, but the AIC task times out at runtime.

The timeout happens both with the recursive/unrolled triangular inverse kernel and with a reduced single-core `tri_inv_trick` kernel. Probing the single-core version shows that initial GM-to-L1 loads, L1-to-L0 moves, the first `TMATMUL`, and accumulator-to-L1 moves can complete. The timeout starts when the kernel enters the refinement loop and reuses L0A/L0B buffers after prior cube/FIX work.

This suggests a missing or incorrect pipe/event synchronization sequence around reusing L0 buffers after `TMATMUL`/`TMOV` accumulator paths, or a runtime/device scheduling issue triggered by that pattern.

### Steps to Reproduce
From the PR https://github.com/hw-native-sys/simpler/pull/830 
```bash
python examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py \
  -p a2a3 \
  -d 1 \
  --case TestTriangularInverse::Case_upper_tri_matrix_size_32 \
  --skip-golden \
  --log-level debug
```

### Expected Behavior

The AIC task should complete on hardware, copy output tensors back to host, and the test should pass or at least proceed to numerical validation.

### Actual Behavior

The AIC task launches but the AICPU stream synchronization times out. Typical failure:

```text
aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
FAILED: run_prepared failed with code 507018
```

A debug run shows the runtime reaches AICore launch:

```text
=== launch_aicpu_kernel DynTileFwkKernelServerInit ===
=== launch_aicpu_kernel DynTileFwkKernelServer ===
=== launch_aicore_kernel ===
=== aclrtSynchronizeStreamWithTimeout stream_aicpu_ ===
```

Then it times out before successful completion.

### Git Commit ID

d423e878c960e5dd3c5b65aeba5f8a82fda88e96

### CANN Version

9.0.0

### Driver Version

25.5.1

### Host Platform

Linux (aarch64)

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] A2A3 triangular inverse AIC kernel times out in `tensormap_and_ringbuffer` runtime #831

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] A2A3 triangular inverse AIC kernel times out in tensormap_and_ringbuffer runtime #831

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug] A2A3 triangular inverse AIC kernel times out in `tensormap_and_ringbuffer` runtime #831