support multiheadattention int8 #3940

tpoisonooo · 2022-06-21T07:23:17Z

这是在干啥

支持 mha int8 kernel

GEMM weight 都还是 per-channel 量化
内部需要 5 个 input scale 参数
- xq/xk/xv 的 scale
- softmax 之前的 scale
- 乘 out_weight 之前的 scale

速度对比（wsl2 虚拟机）

1 线程

$ ./benchncnn  10 1
loop_count = 10
num_threads = 1
powersave = 0
gpu_device = -1
cooling_down = 1
  vision_transformer  min = 2955.98  max = 3130.18  avg = 3051.40
vision_transformer_int8  min = 2403.91  max = 2459.07  avg = 2431.06

8 线程

$ ./benchncnn
loop_count = 4
num_threads = 8
powersave = 0
gpu_device = -1
cooling_down = 1
  vision_transformer  min = 1175.01  max = 1575.90  avg = 1343.40
vision_transformer_int8  min = 1076.93  max = 1153.30  avg = 1109.33

softmax 数值结果对比

直接量化 mha/conv/gemm 三类 opr 版本，不校准 bias
(base) khj@khj:~/ncnn/ninjabuild/examples$ ./vision_transformer
data size 1769472
output shape whc 1000,1,1
softmax result: 65 0.978581

浮点版本
(base) khj@khj:~/ncnn/ninjabuild/examples$ ./vision_transformer_fp32
data size 1769472
output shape whc 1000,1,1
softmax result: 65 0.985758

备注

需要先处理 PR 3911，我 rebase 一下。
或者直接 review 这个，也是一样的。

精度测试

pytorch fp32 原始模型，完整的 5w 张图
top-1 84.01%
top-5 97.08%

基线：ncnn fp32 原始模型，CPU 推理太慢了只能跑 2000 张
2022-06-28 17:49:46,793 - test - INFO - accuracy_top-1 : 83.55
2022-06-28 17:49:46,799 - test - INFO - accuracy_top-5 : 97.55

量化 conv+mha
2022-06-28 14:26:39,188 - test - INFO - accuracy_top-1 : 83.25
2022-06-28 14:26:39,194 - test - INFO - accuracy_top-5 : 97.65

量化 conv+mha+gemm
2022-06-27 21:05:06,841 - test - INFO - accuracy_top-1 : 82.55
2022-06-27 21:05:06,844 - test - INFO - accuracy_top-5 : 97.45

量化 conv+mha+gemm+bias 校准
2022-06-29 12:31:18,982 - test - INFO - accuracy_top-1 : 82.80
2022-06-29 12:31:18,984 - test - INFO - accuracy_top-5 : 97.55

结论：mha +conv 直接量化会影响 -0.3%； gemm 直接量化会影响 -0.7%，用 bias 校准可以救回来 +0.25%。

naive 整体加速 20%，掉点 -0.75%，模型大小 337MB->86MB

…t8-toml

…nto ncnn-int8-toml

…-mha-int8

…into support-mha-int8

codecov-commenter · 2022-06-23T13:55:52Z

Codecov Report

Merging #3940 (3f1844b) into master (8c06103) will decrease coverage by 0.18%.
The diff coverage is 9.72%.

@@            Coverage Diff             @@
##           master    #3940      +/-   ##
==========================================
- Coverage   93.84%   93.65%   -0.19%     
==========================================
  Files         721      728       +7     
  Lines      175071   177009    +1938     
==========================================
+ Hits       164291   165778    +1487     
- Misses      10780    11231     +451

Impacted Files	Coverage Δ
src/layer/multiheadattention.cpp	`47.82% <9.72%> (-45.41%)`	⬇️
src/command.cpp	`72.70% <0.00%> (-14.94%)`	⬇️
src/pipeline.cpp	`58.69% <0.00%> (-2.18%)`	⬇️
src/layer/vulkan/reshape_vulkan.cpp	`92.01% <0.00%> (-2.14%)`	⬇️
src/layer/x86/cast_x86.cpp	`96.07% <0.00%> (-1.91%)`	⬇️
src/layer/vulkan/packing_vulkan.cpp	`81.70% <0.00%> (-1.88%)`	⬇️
src/layer/vulkan/permute_vulkan.cpp	`96.99% <0.00%> (-1.60%)`	⬇️
src/layer/vulkan/reorg_vulkan.cpp	`96.35% <0.00%> (-1.57%)`	⬇️
src/layer/vulkan/pixelshuffle_vulkan.cpp	`96.35% <0.00%> (-1.57%)`	⬇️
src/layer/vulkan/flatten_vulkan.cpp	`95.97% <0.00%> (-1.51%)`	⬇️
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c06103...3f1844b. Read the comment docs.

…into support-mha-int8

nihui · 2025-04-18T07:33:13Z

ref
https://github.com/megvii-research/FQ-ViT
https://github.com/megvii-research/FQ-ViT/blob/main/models/ptq/layers.py#L209

nihui · 2025-06-23T03:26:29Z

close as mha dynamic int8 quantization now supported

tpoisonooo and others added 30 commits June 13, 2022 17:14

feat(tools/quantize): support toml

2a5a296

apply code-format changes

e8ad914

feat(tools/quantize): add .ini parser

77e6546

apply code-format changes

8e2f806

improvement(tools/quantize): add ini config

146b8ba

Merge branch 'master' of https://github.com/tencent/ncnn into ncnn-in…

12f075f

…t8-toml

Merge branch 'ncnn-int8-toml' of https://github.com/tpoisonooo/ncnn i…

f719ee7

…nto ncnn-int8-toml

improvement(tools/quantize): refactor code

9863b26

apply code-format changes

1612caf

test(tools/quantize/ncnn2int8): test quant sqznet

be66fac

improvement(CMakeLists): downgrade to cxx11

ba6640d

apply code-format changes

d106fc0

Update CMakeLists.txt

fab112d

Update ncnn2table.cpp

77cf07a

Merge branch 'ncnn-int8-toml' of https://github.com/tpoisonooo/ncnn i…

9262515

…nto ncnn-int8-toml

fix(CI): remove cxx17 grammar

9d473f5

fix(tools/quantize): typo

181714e

docs(ncnn2int8): add ini description

b32dd56

feat(ncnn2int8): parse mha

12bef90

feat(src/layer): add mha int8

c7641ca

apply code-format changes

f20318b

feat(src/layer): add mha int8

4de1aff

Merge branch 'master' of https://github.com/tencent/ncnn into support…

acedd44

…-mha-int8

Merge branch 'support-mha-int8' of https://github.com/tpoisonooo/ncnn …

9d743fe

…into support-mha-int8

feat(src/layer): mha int8 input transform

2428661

apply code-format changes

5305e50

feat(src/layer/multiheadattention): add log_int_softmax

8d276f4

Merge branch 'support-mha-int8' of https://github.com/tpoisonooo/ncnn …

a560617

…into support-mha-int8

apply code-format changes

75061d9

feat(src/layer): log_int_softmax

30d6388

apply code-format changes

c81850e

tpoisonooo and others added 8 commits June 24, 2022 21:42

fix(lis): scale error

83e3368

fix(mha): single opr precision

58df666

improvement(mha): fp32 version using fake quant

b958cab

fix(mha): remove LIS and get good precision

0843acf

Merge branch 'support-mha-int8' of https://github.com/tpoisonooo/ncnn …

527b03a

…into support-mha-int8

apply code-format changes

aa6e791

improvement(mha): quantize softmax output

bdf52ab

apply code-format changes

1bf72dc

tpoisonooo mentioned this pull request Jun 26, 2022

WIP: ncnn ViT int8 OpenPPL/ppq#154

Open

improvement(benchmark): clean code

9258065

tpoisonooo changed the title ~~WIP: mha int8~~ support multiheadattention int8 Jun 26, 2022

tpoisonooo changed the title ~~support multiheadattention int8~~ WIP: support multiheadattention int8 Jun 26, 2022

tpoisonooo and others added 4 commits June 26, 2022 17:46

docs(operators.md): update mha

6c7d992

revert(src/layer/mha): do not quantize softmax

3f1844b

improvement(test): add mha test

240137b

apply code-format changes

14d45ab

tpoisonooo changed the title ~~WIP: support multiheadattention int8~~ support multiheadattention int8 Jun 29, 2022

tpoisonooo mentioned this pull request Jul 28, 2022

improve vit int8 mha opr #4096

Closed

tpoisonooo and others added 6 commits July 28, 2022 18:40

fix(CI): rebase code

c9f430f

Merge branch 'support-mha-int8' of https://github.com/tpoisonooo/ncnn …

66ed718

…into support-mha-int8

apply code-format changes

435e380

fix(CI): test mha exceeding

497dbd7

fix(src/layer/mha): miss convert weight to int8

5c5a586

apply code-format changes

8c44ccf

EdVince mentioned this pull request Jan 19, 2023

[ARM] Multiheadattention #4463

Merged

nihui closed this Jun 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support multiheadattention int8 #3940

support multiheadattention int8 #3940

Uh oh!

tpoisonooo commented Jun 21, 2022 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 23, 2022 •

edited

Loading

Uh oh!

nihui commented Apr 18, 2025

Uh oh!

nihui commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

support multiheadattention int8 #3940

support multiheadattention int8 #3940

Uh oh!

Conversation

tpoisonooo commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

这是在干啥

速度对比 （wsl2 虚拟机）

softmax 数值结果对比

备注

精度测试

Uh oh!

codecov-commenter commented Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nihui commented Apr 18, 2025

Uh oh!

nihui commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tpoisonooo commented Jun 21, 2022 •

edited

Loading

速度对比（wsl2 虚拟机）

codecov-commenter commented Jun 23, 2022 •

edited

Loading