-
Notifications
You must be signed in to change notification settings - Fork 767
Closed
Labels
module: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/partner: qualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, QualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, QualcommtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
I ran Llama-v3.2-3B-Chat(precision w4a16) from ai-hub-model on a Snapdragon 8 Gen 3 device, achieving 20 tokens/s.
For comparison, I ran inference for the Llama3.2-3B model quantized to W4A16 using executorch with the QNN backend on the same device. The performance I observed was 10 tokens/s.
Could you provide insights into what might be causing this performance difference? Are there issues with how executorch handles quantized models that could explain this gap?
Any guidance or suggestions would be greatly appreciated!
Metadata
Metadata
Assignees
Labels
module: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/partner: qualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, QualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, QualcommtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module