Skip to content

[Performance] Orchestration taking most of the running time #849

@SergioMartin86

Description

@SergioMartin86

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Summary

For:

examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py

We profiled the main activities of the AICPU cores:

Image (timeline produced by Noah Baumann using TracR+Perfetto )

  • Initialization (green)
  • DLL_loading (light lilac)
  • Orchestration (pink)
  • Scheduling (lilac)
  • Deinitializing (light green)

We find the following

  • DLL_loading takes a non-trivial amount of time. Other solutions, like JIT (static or dynamic) compiling the orchestration functions could help alleviate this cost.
  • Orchestration (maybe building the initial graph?) is taking by far the longest time.
  • Scheduling takes very little time. This might be because the operation itself is very small, but still is noticeably small
  • Deinitialization also takes some time

The most urgent thing, in my opinion, is looking into why orchestration takes so long.

Git Commit ID

d423e87

CANN Version

9.0.0

Driver Version

25.5.1

Host Platform

Linux (aarch64)

Reproduction

python examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py -p a2a3

Expected Performance

No pre-existing expectations.

Actual Performance

For the orchestrator thread:

231us Initialization
750us DLL_loading
9900us Orchestration
3050us Deinitialization

For one of the scheduling threads:
200us - 240us Initialization
100us - 170us Scheduling

Profiling Data (Optional)

The timeline is shown above.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance regression or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions