Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Summary
For:
examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py
We profiled the main activities of the AICPU cores:
(timeline produced by Noah Baumann using TracR+Perfetto )
- Initialization (green)
- DLL_loading (light lilac)
- Orchestration (pink)
- Scheduling (lilac)
- Deinitializing (light green)
We find the following
- DLL_loading takes a non-trivial amount of time. Other solutions, like JIT (static or dynamic) compiling the orchestration functions could help alleviate this cost.
- Orchestration (maybe building the initial graph?) is taking by far the longest time.
- Scheduling takes very little time. This might be because the operation itself is very small, but still is noticeably small
- Deinitialization also takes some time
The most urgent thing, in my opinion, is looking into why orchestration takes so long.
Git Commit ID
d423e87
CANN Version
9.0.0
Driver Version
25.5.1
Host Platform
Linux (aarch64)
Reproduction
python examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py -p a2a3
Expected Performance
No pre-existing expectations.
Actual Performance
For the orchestrator thread:
231us Initialization
750us DLL_loading
9900us Orchestration
3050us Deinitialization
For one of the scheduling threads:
200us - 240us Initialization
100us - 170us Scheduling
Profiling Data (Optional)
The timeline is shown above.
Additional Context
No response
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Summary
For:
examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py
We profiled the main activities of the AICPU cores:
We find the following
The most urgent thing, in my opinion, is looking into why orchestration takes so long.
Git Commit ID
d423e87
CANN Version
9.0.0
Driver Version
25.5.1
Host Platform
Linux (aarch64)
Reproduction
Expected Performance
No pre-existing expectations.
Actual Performance
For the orchestrator thread:
231us Initialization
750us DLL_loading
9900us Orchestration
3050us Deinitialization
For one of the scheduling threads:
200us - 240us Initialization
100us - 170us Scheduling
Profiling Data (Optional)
The timeline is shown above.
Additional Context
No response