How to analyze etdump results for QNN backend? #16285

liu-mengyang · 2025-12-17T01:41:21Z

liu-mengyang
Dec 17, 2025

I have obtained etdump results from a QNN‑LLM execution and used the Inspector API to generate the profiling table. However, I’m not sure how to interpret this table. Specifically, I have the following questions:

After sorting the results, I see Accelerator (execute) time (cycles), Method::execute, and DELEGATE_CALL at the top. Their latencies are quite similar. Are these essentially representing the same thing?
The remaining entries all start with aten_, which I assume correspond to different operators. How can I determine which of these are running on the CPU due to fallback, and which are actually running on the NPU?
Is there a way to know whether each operator uses HMX or HVX inside the NPU?
Finally, can I extract more detailed profiling information such as memory consumption, TCM usage, or overall NPU utilization?

Any guidance or documentation pointers would be greatly appreciated. Thank you!

yujiaoliang · 2025-12-18T02:01:30Z

yujiaoliang
Dec 18, 2025

For QNN profiling, I would mainly look at two aspects:

QNN runtime profiling itself already provides useful information about NPU execution and TCM usage. By enabling the built-in runtime profiling options, you can inspect per-op execution time and memory behavior, which is usually the primary source for understanding performance and resource utilization.
As an additional reference, this document might be helpful:
https://github.com/pytorch/executorch/blob/main/backends/qualcomm/debugger/README.md

It describes some related debugging and profiling workflows around the Qualcomm backend, though you may already be familiar with it.

0 replies

cccclai · 2025-12-18T19:00:23Z

cccclai
Dec 18, 2025
Collaborator

After sorting the results, I see Accelerator (execute) time (cycles), Method::execute, and DELEGATE_CALL at the top. Their latencies are quite similar. Are these essentially representing the same thing?

Delegate_CALL is inside Method::execute call. If the time is similar, that means the delegate call (the execution time in HTP) is dominant.

The remaining entries all start with aten_, which I assume correspond to different operators. How can I determine which of these are running on the CPU due to fallback, and which are actually running on the NPU?

DELEGATE_CALL is everything inside HTP, and the individual operator call (like aten_bmm) means they fall back to cpu.

Is there a way to know whether each operator uses HMX or HVX inside the NPU?

We can use optrace or the debugger https://github.com/pytorch/executorch/tree/main/backends/qualcomm/debugger#qairt-visualizer as shared by @yujiaoliang

Finally, can I extract more detailed profiling information such as memory consumption, TCM usage, or overall NPU utilization?

@haowhsu-quic @shewu-quic @winskuo-quic @DannyYuyang-quic do we have guidance on this?

0 replies

shewu-quic · 2025-12-19T01:35:32Z

shewu-quic
Dec 19, 2025
Collaborator

Hi @yujiaoliang

Finally, can I extract more detailed profiling information such as memory consumption, TCM usage, or overall NPU utilization?

Yes, you can dump QHAS and optrace to observe the NPU utilization and TCM usage.:
https://github.com/pytorch/executorch/tree/main/backends/qualcomm/debugger#2-generate-optrace-and-qhas

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to analyze etdump results for QNN backend? #16285

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to analyze etdump results for QNN backend? #16285

Uh oh!

liu-mengyang Dec 17, 2025

Replies: 3 comments

Uh oh!

yujiaoliang Dec 18, 2025

Uh oh!

cccclai Dec 18, 2025 Collaborator

Uh oh!

shewu-quic Dec 19, 2025 Collaborator

liu-mengyang
Dec 17, 2025

yujiaoliang
Dec 18, 2025

cccclai
Dec 18, 2025
Collaborator

shewu-quic
Dec 19, 2025
Collaborator