ๅฎไน:
- ๅฐ่ฎญ็ปๅฅฝ็ๆจกๅๅจ็นๅฎ่ฝฏ็กฌไปถ็ฏๅขไธญๅฏๅจ็่ฟ็จ๏ผไฝฟๆจกๅ่ฝๅคๆฅๆถ่พๅ ฅๅนถ่ฟๅ้ขๆต็ปๆ
- ไธบไบๆปก่ถณๆง่ฝๅๆ็็้ๆฑ๏ผๅธธๅธธ้่ฆๅฏนๆจกๅ่ฟ่กไผๅ๏ผไพๅฆๆจกๅๅ็ผฉๅ็กฌไปถๅ ้
ไบงๅๅฝขๆ:
- ไบ็ซฏใ่พน็ผ่ฎก็ฎ็ซฏใ็งปๅจ็ซฏ
่ฎก็ฎ่ฎพๅค:
- CPUใGPUใNPUใTPU
ๅ ๅญๅผ้ๅทจๅคง
- ๅบๅคง็ๅๆฐ้ใ 7B ๆจกๅไป ๆ้ๅฐฑ้่ฆ 14+G ๅ ๅญ
- ้็จ
่ชๅๅฝ็ๆ token๏ผ้่ฆ็ผๅญ Attention ็ k/v,ๅธฆๆฅๅทจๅคง็ๅ ๅญๅผ้
ๅจๆshape
- ่ฏทๆฑๆฐไธๅบๅฎ
- Token ้ไธช็ๆ๏ผไธๆฐ้ไธๅฎ
็ธๅฏน่ง่งๆจกๅ๏ผLLM็ปๆ็ฎๅ
- Transformers ็ปๆ๏ผๅคง้จๅๆฏ decoder-only
่ฎพๅค๏ผ
- ๅฆไฝๅบๅฏนๅทจๅคง็ๅญๅจ้ฎ้ข๏ผไฝๅญๅจ่ฎพๅค๏ผๆถ่ดน็บงๆพๅกใๆๆบ็ญ๏ผๅฆไฝ้จ็ฝฒ๏ผ
ๆจ็๏ผ
- ๅฆไฝๅ ้ token ็็ๆ้ๅบฆ
- ๅฆไฝ่งฃๅณๅจๆshape๏ผ่ฎฉๆจ็ๅฏไปฅไธ้ดๆญ
- ๅฆไฝๆๆ็ฎก็ๅๅฉ็จๅ ๅญ
ๆๅก๏ผ
- ๅฆไฝๆๅ็ณป็ปๆดไฝๅๅ้๏ผ
- ๅฏนไบไธชไฝ็จๆท๏ผๅฆไฝ้ไฝๅๅบๆถ้ด๏ผ
ๆๆฏ็น๏ผ
- ๆจกๅๅนถ่ก
- transformer ่ฎก็ฎๅ่ฎฟๅญไผๅ
- ไฝๆฏ็น้ๅ
- Continuous Batch
- Page Attention
ๆนๆก
- huggingface transformers
- ไธ้จ็ๆจ็ๅ ้ๆกๆถ
- ไบ็ซฏ
- Imdeploy
- vllm
- tensorrt-Ilm
- deepspeed
- ็งปๅจ็ซฏ
- Ilama.cpp
- mlc-Ilm
- ไบ็ซฏ
่ฑไผ่พพ่ฎพๅคไธ้จ็ฝฒ็ๅ จๆต็จ่งฃๅณๆนๆกใ ๆจกๅ่ฝป้ๅใๆจ็ใๆๅกใ
ๆจ็ๆง่ฝ
Weight FP16 + KV Cache FP16
| ๆจกๅ | ๆ้ | KV Cache๏ผtokens=2k๏ผ๏ผbatch=8๏ผ | KV Cache๏ผtokens=8k๏ผ๏ผbatch=8๏ผ | KV Cache๏ผtokens=32k๏ผ๏ผbatch=8๏ผ |
|---|---|---|---|---|
| Llama 7B | 14GB | 8GB | 32GB | 128GB |
| Llama 70B | 14GB | 5GB | 20GB | 80GB |
Weight INT4 + KV Cache INT8
| ๆจกๅ | ๆ้ | KV Cache๏ผtokens=2k๏ผ๏ผbatch=8๏ผ | KV Cache๏ผtokens=8k๏ผ๏ผbatch=8๏ผ | KV Cache๏ผtokens=32k๏ผ๏ผbatch=8๏ผ |
|---|---|---|---|---|
| Llama 7B | 3.5GB | 4GB | 16GB | 64GB |
| Llama 70B | 35GB | 2.5GB | 10GB | 40GB |
ไธคไธชๅบๆฌๆฆๅฟต
- ่ฎก็ฎๅฏ้๏ผcompute-bound๏ผ:ๆจ็็็ปๅคง้จๅๆถ้ดๆถ่ๅจๆฐๅผ่ฎก็ฎไธ๏ผ้ๅฏน่ฎก็ฎๅฏ้ๅบๆฏ๏ผๅฏไปฅ้่ฟไฝฟ็จๆดๅฟซ็็กฌไปถ่ฎก็ฎๅๅ
ๆฅๆๅ่ฎก็ฎ้ๅบฆ๏ผๆฏๅฆ้ๅไธบ
W8A8ไฝฟ็จ INT8 Tensor Core ๆฅๅ ้่ฎก็ฎใโ - ่ฎฟๅญๅฏ้๏ผmemory-bound๏ผ:ๆจ็ๆถ๏ผ็ปๅคง้จๅๆถ้ดๆถ่ๅจๆฐๆฎ่ฏปๅไธ๏ผ้ๅฏน่ฎฟๅญๅฏ้ๅๅบๆฏ๏ผไธ่ฌๆฏ้่ฟๆ้ซ่ฎก็ฎ่ฎฟๅญๆฏๆฅๆๅๆง่ฝใ
LLM ๆฏๅ ธๅ็่ฎฟๅญๅฏ้ๅไปปๅก
- ๅธธ่ง็ LLM ๆจกๅๆฏ Decoder Only ๆถๆใๆจ็ๆถๅคง้จๅๆถ้ดๆถ่ๅจ้Token ็ๆ้ถๆฎต๏ผDecoding ้ถๆฎต๏ผ๏ผๆฏๅ ธๅ็่ฎฟๅญๅฏ้ๅๅบๆฏใ
ๅฆๅพ๏ผA100 ็ FP16 ๅณฐๅผ็ฎๅไธบ 312 TFLOPS๏ผๅชๆๅจ Batch Size ่พพๅฐ128 ่ฟไธช้็บงๆถ๏ผ่ฎก็ฎๆๆไธบๆจ็็็ถ้ข๏ผไฝ็ฑไบ LLM ๆจกๅๆฌ่บซๅฐฑๅพๅคง
ๆจ็ๆถ็ KV Cache ไนไผๅ ็จๅพๅคๆพๅญ๏ผ่ฟๆไธไบๅ
ถไป็ๅ ็ด ๅฝฑๅ๏ผๅฆPersistent Batch๏ผ๏ผๅฎ้
ๆจ็ๆถๅพ้พๅๅฐ 128 ่ฟไนๅคง็ Batch Sizeใ

- 4bit Weight Only ้ๅ๏ผๅฐ FP16 ็ๆจกๅๆ้้ๅไธบINT4๏ผ่ฎฟๅญ้็ดๆฅ้ไธบ FP16 ๆจกๅ็ 1/4๏ผๅคงๅน ้ไฝไบ่ฎฟๅญๆๆฌ๏ผๆ้ซไบ Decoding ็้ๅบฆใ
- ๅ ้็ๅๆถ่ฟ่็ไบๆพๅญ๏ผๅๆ ท็่ฎพๅค่ฝๅคๆฏๆๆดๅคง็ๆจกๅไปฅๅๆด้ฟ็ๅฏน่ฏ้ฟๅบฆ
่ฎฟๅญๅฏ้ๅไปปๅก ???
ๅฆไฝๅ
- LMDeploy ไฝฟ็จ MIT HAN LAB ๅผๆบ็ AWQ ็ฎๆณ๏ผ้ๅไธบ 4bit ๆจกๅ๏ผ
- ๆจ็ๆถ๏ผ
ๅ ๆ 4bit ๆ้๏ผๅ้ๅๅ FP16๏ผๅจ Kernelๅ ้จ่ฟ่ก๏ผไป Global Memory ่ฏปๅๆถไปๆฏ 4bit๏ผ๏ผไพๆงไฝฟ็จ็ๆฏ FP16 ่ฎก็ฎ ็ธ่พไบ็คพๅบไฝฟ็จๆฏ่พๅค็ GPTQ ็ฎๆณ๏ผAWQ ็ๆจ็้ๅบฆๆดๅฟซ๏ผ้ๅ็ๆถ้ดๆด็ญ
ๅ้ๅไธ้่ฆๆถ้ด๏ผ
TurboMind ๆฏไธๆฌพๅ ณไบ LLM ๆจ็็้ซๆๆจ็ๅผๆ๏ผๅบไบ่ฑไผ่พพ็ FasterTransformer ็ ๅ่ๆใ
- LLaMa ็ปๆๆจกๅ็ๆฏๆ
- ๆ็ปญๆนๅค็: ไปค็ๆกถ๏ผ
- ้ซๆง่ฝ cuda kernel๏ผ
- ๆ็ถๆๆจ็๏ผๆๅก็ซฏ็ผๅญ๏ผ
- Blocked k/v cache๏ผ ็ผๅญ็ฎๆณ๏ผ
BlockSize = 2 X Layers X Heads X HeadDim X Seq X B๏ผ Seq: 1 ไธช block ้็ๅบๅ้ฟๅบฆ๏ผ้ป่ฎค128๏ผ B๏ผk/vๆฐๅผ็ฒพๅบฆๅฏนๅบ็ๅญ่ๆฐ๏ผ
llama-7b๏ผ2Kๅบๅ้ฟๅบฆ๏ผk/v block ๅ ๅญ1G
Block ็ถๆ
- Free ๆช่ขซไปปไฝๅบๅๅ ็จ
- Activate ๆญฃๅจ่ขซๆจ็็ๅบๅๅ ็จ
- Cache ่ขซ็ผๅญไธญ็ๅบๅๅ ็จ
Block็ถๆ่ฟ็งป
-
ๆจกๅๆจ็/ๆๅก: ไธป่ฆๆไพๆจกๅๆฌ่บซ็ๆจ็๏ผไธ่ฌๆฅ่ฏดๅฏไปฅๅๅ ทไฝไธๅก่งฃ่ฆ๏ผไธๆณจๆจกๅๆจ็ๆฌ่บซๆง่ฝ็ไผๅใๅฏไปฅไปฅๆจกๅใAPI็ญๅค็งๆนๅผๆไพใ
-
Client: ๅฏไปฅ็่งฃไธบๅ็ซฏ๏ผไธ็จๆทไบคไบ็ๅฐๆนใ
-
API Server: ไธ่ฌไฝไธบๅ็ซฏ็ๅ็ซฏ๏ผๆไพไธไบงๅๅๆๅก็ธๅ ณ็ๆฐๆฎๅๅ่ฝๆฏๆใ
ไธไธชๆต็จไธไธๅฎไธฅๆ ผๅบๅใ
ไฝฟ็จ TurboMind ๆจ็ๆจกๅ้่ฆๅ ๅฐๆจกๅ่ฝฌๅไธบ TurboMind ็ๆ ผๅผ๏ผ็ฎๅๆฏๆๅจ็บฟ่ฝฌๆขๅ็ฆป็บฟ่ฝฌๆขไธค็งๅฝขๅผ๏ผ
- ๅจ็บฟ่ฝฌๆขๅฏไปฅ็ดๆฅๅ ่ฝฝ Huggingface ๆจกๅ
- ็ฆป็บฟ่ฝฌๆข้้่ฆๅ ไฟๅญๆจกๅๅๅ ่ฝฝ
ๅ ่ฝฝๅๅ่ฝฌๆข
# ้่ฆ่ฎฟ้ฎ hf
lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b
# ๆฌๅฐ
lmdeploy chat turbomind /share/temp/model_repos/internlm-chat-7b/ --model-name internlm-chat-7b็บฟ่ฝฌๅๅ ่ฝฝ
# ่ฝฌๆขๆจกๅ๏ผFastTransformerๆ ผๅผ๏ผ TurboMind
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /root/share/temp/model_repos/internlm-chat-7b/่พๅบๅจworkspaceๆไปถๅคน๏ผๅ
ถไธญ๏ผ
ๅๆฐ layers.0.attention.w_qkv.0.bias
- ็ฌฌไธไธช 0 ่กจ็คบโๅฑโ็็ดขๅผ๏ผๅ้ข็้ฃไธช0่กจ็คบ Tensor ๅนถ่ก็็ดขๅผ
- ไธคๅผ ๅกๅฏไปฅ็จๆฅๆจ็๏ผไผๆๅไธไธชๅๆฐๆๆไธคไปฝ
layers.0.attention.w_qkv.0.weightๅlayers.0.attention.w_qkv.1.weight - ๅฏไปฅ้่ฟ
--tpๆๅฎtensor parallel๏ผ่ฏฅๅๆฐ้ป่ฎคๅผไธบ1
(base) root@intern-studio:~# tree ./workspace/
./workspace/
โโโ model_repository
โ โโโ postprocessing -> ../triton_models/postprocessing
โ โโโ preprocessing -> ../triton_models/preprocessing
โ โโโ turbomind -> ../triton_models/interactive
โโโ service_docker_up.sh
โโโ triton_models
โโโ interactive
โ โโโ 1
โ โ โโโ placeholder
โ โ โโโ weights -> ../../weights
โ โโโ config.pbtxt
โโโ postprocessing
โ โโโ 1
โ โ โโโ __pycache__
โ โ โ โโโ model.cpython-310.pyc
โ โ โโโ model.py
โ โ โโโ tokenizer -> ../../tokenizer
โ โโโ config.pbtxt
โโโ preprocessing
โ โโโ 1
โ โ โโโ __pycache__
โ โ โ โโโ model.cpython-310.pyc
โ โ โโโ model.py
โ โ โโโ tokenizer -> ../../tokenizer
โ โโโ config.pbtxt
โโโ tokenizer
โ โโโ config.json
โ โโโ configuration.json
โ โโโ configuration_internlm.py
โ โโโ generation_config.json
โ โโโ modeling_internlm.py
โ โโโ placeholder
โ โโโ pytorch_model.bin.index.json
โ โโโ special_tokens_map.json
โ โโโ tokenization_internlm.py
โ โโโ tokenizer.model
โ โโโ tokenizer.py
โ โโโ tokenizer_config.json
โโโ weights
โโโ config.ini
โโโ layers.0.attention.w_qkv.0.bias
โโโ layers.0.attention.w_qkv.0.weight
โโโ layers.0.attention.wo.0.bias
โโโ layers.0.attention.wo.0.weight
โโโ layers.0.attention_norm.weight
โโโ layers.0.feed_forward.w1.0.weight
โโโ layers.0.feed_forward.w2.0.weight
โโโ layers.0.feed_forward.w3.0.weight
โโโ layers.0.ffn_norm.weight
โโโ layers.1.attention.w_qkv.0.bias
โโโ layers.1.attention.w_qkv.0.weight
โโโ layers.1.attention.wo.0.bias
โโโ layers.1.attention.wo.0.weight
โโโ layers.1.attention_norm.weight
โโโ layers.1.feed_forward.w1.0.weight
โโโ layers.1.feed_forward.w2.0.weight
โโโ layers.1.feed_forward.w3.0.weight
โโโ layers.1.ffn_norm.weight
โโโ layers.10.attention.w_qkv.0.bias
โโโ layers.10.attention.w_qkv.0.weight
โโโ layers.10.attention.wo.0.bias
โโโ layers.10.attention.wo.0.weight
โโโ layers.10.attention_norm.weight
โโโ layers.10.feed_forward.w1.0.weight
โโโ layers.10.feed_forward.w2.0.weight
โโโ layers.10.feed_forward.w3.0.weight
โโโ layers.10.ffn_norm.weight
โโโ layers.11.attention.w_qkv.0.bias
โโโ layers.11.attention.w_qkv.0.weight
โโโ layers.11.attention.wo.0.bias
โโโ layers.11.attention.wo.0.weight
โโโ layers.11.attention_norm.weight
โโโ layers.11.feed_forward.w1.0.weight
โโโ layers.11.feed_forward.w2.0.weight
โโโ layers.11.feed_forward.w3.0.weight
โโโ layers.11.ffn_norm.weight
โโโ layers.12.attention.w_qkv.0.bias
โโโ layers.12.attention.w_qkv.0.weight
โโโ layers.12.attention.wo.0.bias
โโโ layers.12.attention.wo.0.weight
โโโ layers.12.attention_norm.weight
โโโ layers.12.feed_forward.w1.0.weight
โโโ layers.12.feed_forward.w2.0.weight
โโโ layers.12.feed_forward.w3.0.weight
โโโ layers.12.ffn_norm.weight
โโโ layers.13.attention.w_qkv.0.bias
โโโ layers.13.attention.w_qkv.0.weight
โโโ layers.13.attention.wo.0.bias
โโโ layers.13.attention.wo.0.weight
โโโ layers.13.attention_norm.weight
โโโ layers.13.feed_forward.w1.0.weight
โโโ layers.13.feed_forward.w2.0.weight
โโโ layers.13.feed_forward.w3.0.weight
โโโ layers.13.ffn_norm.weight
โโโ layers.14.attention.w_qkv.0.bias
โโโ layers.14.attention.w_qkv.0.weight
โโโ layers.14.attention.wo.0.bias
โโโ layers.14.attention.wo.0.weight
โโโ layers.14.attention_norm.weight
โโโ layers.14.feed_forward.w1.0.weight
โโโ layers.14.feed_forward.w2.0.weight
โโโ layers.14.feed_forward.w3.0.weight
โโโ layers.14.ffn_norm.weight
โโโ layers.15.attention.w_qkv.0.bias
โโโ layers.15.attention.w_qkv.0.weight
โโโ layers.15.attention.wo.0.bias
โโโ layers.15.attention.wo.0.weight
โโโ layers.15.attention_norm.weight
โโโ layers.15.feed_forward.w1.0.weight
โโโ layers.15.feed_forward.w2.0.weight
โโโ layers.15.feed_forward.w3.0.weight
โโโ layers.15.ffn_norm.weight
โโโ layers.16.attention.w_qkv.0.bias
โโโ layers.16.attention.w_qkv.0.weight
โโโ layers.16.attention.wo.0.bias
โโโ layers.16.attention.wo.0.weight
โโโ layers.16.attention_norm.weight
โโโ layers.16.feed_forward.w1.0.weight
โโโ layers.16.feed_forward.w2.0.weight
โโโ layers.16.feed_forward.w3.0.weight
โโโ layers.16.ffn_norm.weight
โโโ layers.17.attention.w_qkv.0.bias
โโโ layers.17.attention.w_qkv.0.weight
โโโ layers.17.attention.wo.0.bias
โโโ layers.17.attention.wo.0.weight
โโโ layers.17.attention_norm.weight
โโโ layers.17.feed_forward.w1.0.weight
โโโ layers.17.feed_forward.w2.0.weight
โโโ layers.17.feed_forward.w3.0.weight
โโโ layers.17.ffn_norm.weight
โโโ layers.18.attention.w_qkv.0.bias
โโโ layers.18.attention.w_qkv.0.weight
โโโ layers.18.attention.wo.0.bias
โโโ layers.18.attention.wo.0.weight
โโโ layers.18.attention_norm.weight
โโโ layers.18.feed_forward.w1.0.weight
โโโ layers.18.feed_forward.w2.0.weight
โโโ layers.18.feed_forward.w3.0.weight
โโโ layers.18.ffn_norm.weight
โโโ layers.19.attention.w_qkv.0.bias
โโโ layers.19.attention.w_qkv.0.weight
โโโ layers.19.attention.wo.0.bias
โโโ layers.19.attention.wo.0.weight
โโโ layers.19.attention_norm.weight
โโโ layers.19.feed_forward.w1.0.weight
โโโ layers.19.feed_forward.w2.0.weight
โโโ layers.19.feed_forward.w3.0.weight
โโโ layers.19.ffn_norm.weight
โโโ layers.2.attention.w_qkv.0.bias
โโโ layers.2.attention.w_qkv.0.weight
โโโ layers.2.attention.wo.0.bias
โโโ layers.2.attention.wo.0.weight
โโโ layers.2.attention_norm.weight
โโโ layers.2.feed_forward.w1.0.weight
โโโ layers.2.feed_forward.w2.0.weight
โโโ layers.2.feed_forward.w3.0.weight
โโโ layers.2.ffn_norm.weight
โโโ layers.20.attention.w_qkv.0.bias
โโโ layers.20.attention.w_qkv.0.weight
โโโ layers.20.attention.wo.0.bias
โโโ layers.20.attention.wo.0.weight
โโโ layers.20.attention_norm.weight
โโโ layers.20.feed_forward.w1.0.weight
โโโ layers.20.feed_forward.w2.0.weight
โโโ layers.20.feed_forward.w3.0.weight
โโโ layers.20.ffn_norm.weight
โโโ layers.21.attention.w_qkv.0.bias
โโโ layers.21.attention.w_qkv.0.weight
โโโ layers.21.attention.wo.0.bias
โโโ layers.21.attention.wo.0.weight
โโโ layers.21.attention_norm.weight
โโโ layers.21.feed_forward.w1.0.weight
โโโ layers.21.feed_forward.w2.0.weight
โโโ layers.21.feed_forward.w3.0.weight
โโโ layers.21.ffn_norm.weight
โโโ layers.22.attention.w_qkv.0.bias
โโโ layers.22.attention.w_qkv.0.weight
โโโ layers.22.attention.wo.0.bias
โโโ layers.22.attention.wo.0.weight
โโโ layers.22.attention_norm.weight
โโโ layers.22.feed_forward.w1.0.weight
โโโ layers.22.feed_forward.w2.0.weight
โโโ layers.22.feed_forward.w3.0.weight
โโโ layers.22.ffn_norm.weight
โโโ layers.23.attention.w_qkv.0.bias
โโโ layers.23.attention.w_qkv.0.weight
โโโ layers.23.attention.wo.0.bias
โโโ layers.23.attention.wo.0.weight
โโโ layers.23.attention_norm.weight
โโโ layers.23.feed_forward.w1.0.weight
โโโ layers.23.feed_forward.w2.0.weight
โโโ layers.23.feed_forward.w3.0.weight
โโโ layers.23.ffn_norm.weight
โโโ layers.24.attention.w_qkv.0.bias
โโโ layers.24.attention.w_qkv.0.weight
โโโ layers.24.attention.wo.0.bias
โโโ layers.24.attention.wo.0.weight
โโโ layers.24.attention_norm.weight
โโโ layers.24.feed_forward.w1.0.weight
โโโ layers.24.feed_forward.w2.0.weight
โโโ layers.24.feed_forward.w3.0.weight
โโโ layers.24.ffn_norm.weight
โโโ layers.25.attention.w_qkv.0.bias
โโโ layers.25.attention.w_qkv.0.weight
โโโ layers.25.attention.wo.0.bias
โโโ layers.25.attention.wo.0.weight
โโโ layers.25.attention_norm.weight
โโโ layers.25.feed_forward.w1.0.weight
โโโ layers.25.feed_forward.w2.0.weight
โโโ layers.25.feed_forward.w3.0.weight
โโโ layers.25.ffn_norm.weight
โโโ layers.26.attention.w_qkv.0.bias
โโโ layers.26.attention.w_qkv.0.weight
โโโ layers.26.attention.wo.0.bias
โโโ layers.26.attention.wo.0.weight
โโโ layers.26.attention_norm.weight
โโโ layers.26.feed_forward.w1.0.weight
โโโ layers.26.feed_forward.w2.0.weight
โโโ layers.26.feed_forward.w3.0.weight
โโโ layers.26.ffn_norm.weight
โโโ layers.27.attention.w_qkv.0.bias
โโโ layers.27.attention.w_qkv.0.weight
โโโ layers.27.attention.wo.0.bias
โโโ layers.27.attention.wo.0.weight
โโโ layers.27.attention_norm.weight
โโโ layers.27.feed_forward.w1.0.weight
โโโ layers.27.feed_forward.w2.0.weight
โโโ layers.27.feed_forward.w3.0.weight
โโโ layers.27.ffn_norm.weight
โโโ layers.28.attention.w_qkv.0.bias
โโโ layers.28.attention.w_qkv.0.weight
โโโ layers.28.attention.wo.0.bias
โโโ layers.28.attention.wo.0.weight
โโโ layers.28.attention_norm.weight
โโโ layers.28.feed_forward.w1.0.weight
โโโ layers.28.feed_forward.w2.0.weight
โโโ layers.28.feed_forward.w3.0.weight
โโโ layers.28.ffn_norm.weight
โโโ layers.29.attention.w_qkv.0.bias
โโโ layers.29.attention.w_qkv.0.weight
โโโ layers.29.attention.wo.0.bias
โโโ layers.29.attention.wo.0.weight
โโโ layers.29.attention_norm.weight
โโโ layers.29.feed_forward.w1.0.weight
โโโ layers.29.feed_forward.w2.0.weight
โโโ layers.29.feed_forward.w3.0.weight
โโโ layers.29.ffn_norm.weight
โโโ layers.3.attention.w_qkv.0.bias
โโโ layers.3.attention.w_qkv.0.weight
โโโ layers.3.attention.wo.0.bias
โโโ layers.3.attention.wo.0.weight
โโโ layers.3.attention_norm.weight
โโโ layers.3.feed_forward.w1.0.weight
โโโ layers.3.feed_forward.w2.0.weight
โโโ layers.3.feed_forward.w3.0.weight
โโโ layers.3.ffn_norm.weight
โโโ layers.30.attention.w_qkv.0.bias
โโโ layers.30.attention.w_qkv.0.weight
โโโ layers.30.attention.wo.0.bias
โโโ layers.30.attention.wo.0.weight
โโโ layers.30.attention_norm.weight
โโโ layers.30.feed_forward.w1.0.weight
โโโ layers.30.feed_forward.w2.0.weight
โโโ layers.30.feed_forward.w3.0.weight
โโโ layers.30.ffn_norm.weight
โโโ layers.31.attention.w_qkv.0.bias
โโโ layers.31.attention.w_qkv.0.weight
โโโ layers.31.attention.wo.0.bias
โโโ layers.31.attention.wo.0.weight
โโโ layers.31.attention_norm.weight
โโโ layers.31.feed_forward.w1.0.weight
โโโ layers.31.feed_forward.w2.0.weight
โโโ layers.31.feed_forward.w3.0.weight
โโโ layers.31.ffn_norm.weight
โโโ layers.4.attention.w_qkv.0.bias
โโโ layers.4.attention.w_qkv.0.weight
โโโ layers.4.attention.wo.0.bias
โโโ layers.4.attention.wo.0.weight
โโโ layers.4.attention_norm.weight
โโโ layers.4.feed_forward.w1.0.weight
โโโ layers.4.feed_forward.w2.0.weight
โโโ layers.4.feed_forward.w3.0.weight
โโโ layers.4.ffn_norm.weight
โโโ layers.5.attention.w_qkv.0.bias
โโโ layers.5.attention.w_qkv.0.weight
โโโ layers.5.attention.wo.0.bias
โโโ layers.5.attention.wo.0.weight
โโโ layers.5.attention_norm.weight
โโโ layers.5.feed_forward.w1.0.weight
โโโ layers.5.feed_forward.w2.0.weight
โโโ layers.5.feed_forward.w3.0.weight
โโโ layers.5.ffn_norm.weight
โโโ layers.6.attention.w_qkv.0.bias
โโโ layers.6.attention.w_qkv.0.weight
โโโ layers.6.attention.wo.0.bias
โโโ layers.6.attention.wo.0.weight
โโโ layers.6.attention_norm.weight
โโโ layers.6.feed_forward.w1.0.weight
โโโ layers.6.feed_forward.w2.0.weight
โโโ layers.6.feed_forward.w3.0.weight
โโโ layers.6.ffn_norm.weight
โโโ layers.7.attention.w_qkv.0.bias
โโโ layers.7.attention.w_qkv.0.weight
โโโ layers.7.attention.wo.0.bias
โโโ layers.7.attention.wo.0.weight
โโโ layers.7.attention_norm.weight
โโโ layers.7.feed_forward.w1.0.weight
โโโ layers.7.feed_forward.w2.0.weight
โโโ layers.7.feed_forward.w3.0.weight
โโโ layers.7.ffn_norm.weight
โโโ layers.8.attention.w_qkv.0.bias
โโโ layers.8.attention.w_qkv.0.weight
โโโ layers.8.attention.wo.0.bias
โโโ layers.8.attention.wo.0.weight
โโโ layers.8.attention_norm.weight
โโโ layers.8.feed_forward.w1.0.weight
โโโ layers.8.feed_forward.w2.0.weight
โโโ layers.8.feed_forward.w3.0.weight
โโโ layers.8.ffn_norm.weight
โโโ layers.9.attention.w_qkv.0.bias
โโโ layers.9.attention.w_qkv.0.weight
โโโ layers.9.attention.wo.0.bias
โโโ layers.9.attention.wo.0.weight
โโโ layers.9.attention_norm.weight
โโโ layers.9.feed_forward.w1.0.weight
โโโ layers.9.feed_forward.w2.0.weight
โโโ layers.9.feed_forward.w3.0.weight
โโโ layers.9.ffn_norm.weight
โโโ norm.weight
โโโ output.weight
โโโ tok_embeddings.weight
18 directories, 313 files
(base) root@intern-studio:~# ** Tensorๅนถ่ก**
่ทณ่ฟ API Server ็ดๆฅ่ฐ็จ TurboMindใ
pytorch/DeepSpeed ็ฎๅๅ่ฝ้ฝๆฏ่พๅผฑ
# Turbomind + Bash Local Chat
lmdeploy chat turbomind ./workspace
ๆๅก็ซฏ
# ApiServer+Turbomind api_server => AsyncEngine => TurboMind
lmdeploy serve api_server ./workspace \
--server_name 0.0.0.0 \
--server_port 23333 \
--instance_num 64 \ # Batch ็ๅคงๅฐ
--tp 1็ป็ซฏๅฎขๆท็ซฏ
# ChatApiClient+ApiServer๏ผๆณจๆๆฏhttpๅ่ฎฎ๏ผ้่ฆๅ http๏ผ
lmdeploy serve api_client http://localhost:23333ๅฆๆๅจ็บฟไธ็ฏๅข๏ผๅฏไปฅ่ฟ่ก่ฝฌๅๆๅก็ซฏๅฃๅฐๆฌๅฐ
ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p <ไฝ ็ssh็ซฏๅฃๅท>gradio
lmdeploy serve gradio http://0.0.0.0:23333 \
--server_name 0.0.0.0 \
--server_port 6006 \
--restful_api Truefrom lmdeploy import turbomind as tm
# load model
model_path = "/root/share/temp/model_repos/internlm-chat-7b/"
tm_model = tm.TurboMind.from_pretrained(model_path, model_name='internlm-chat-20b')
generator = tm_model.create_instance()
# process query
query = "ๆไธๅๅฅๅฅฝ"
prompt = tm_model.model.get_prompt(query)
input_ids = tm_model.tokenizer.encode(prompt)
# inference
for outputs in generator.stream_infer(
session_id=0,
input_ids=[input_ids]):
res, tokens = outputs[0]
response = tm_model.tokenizer.decode(res.tolist())
print(response)- ๅบๆฏไธ๏ผๅ4ๅผ ๅพ๏ผ๏ผๅบๅฎ็่พๅ ฅใ่พๅบ token ๆฐ๏ผๅๅซ1ๅ2048๏ผ๏ผๆต่ฏToken่พๅบๅๅ้๏ผoutput token throughput๏ผใ
- ๅบๆฏไบ๏ผ็ฌฌ5ๅผ ๅพ๏ผ๏ผไฝฟ็จ็ๅฎๆฐๆฎ๏ผๆต่ฏๅๅ้๏ผrequest throughput๏ผใ
ๅฏนๆฏ
(lmdeploy) root@intern-studio:~# python infer_compare.py hf
Loading checkpoint shards: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:10<00:00, 1.29s/it]
hf ่ๆถ 29.50็ง 40 ๅญ/็ง
[TM][INFO] [Forward] [0, 1), dc_bsz = 0, pf_bsz = 1, n_tok = 113, max_q = 113, max_k = 113
[TM][INFO] ------------------------- step = 120 -------------------------
...
[TM][INFO] [Interrupt] slot = 0, id = 0
[TM][INFO] [forward] Request complete for 0, code 0
lmdeploy ่ๆถ 10.35็ง 109 ๅญ/็ง
[TM][INFO] ~LlamaBatch()
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [OutputThreadEntry] stop requested.- web api๏ผTurboMindๆจ็ + API ๆๅก
- demo๏ผTurboMind ๆจ็+Gradio
- python้กน็ฎ๏ผTurboMindๆจ็ + Python
ๆจกๅๅฑๆงๅๆฐๆฎ็ฑปๅไธๅฏๆดๆนใ
(lmdeploy) root@intern-studio:~# cat ./workspace/triton_models/weights/config.ini
[llama]
model_name = internlm-chat-7b ; ๆจกๅๅฑๆง
tensor_para_size = 1
head_num = 32 ; ๆจกๅๅฑๆง
kv_head_num = 32 ; ๆจกๅๅฑๆง
vocab_size = 103168 ; ๆจกๅๅฑๆง
num_layer = 32 ; ๆจกๅๅฑๆง
inter_size = 11008 ; ๆจกๅๅฑๆง
norm_eps = 1e-06 ; ๆจกๅๅฑๆง
attn_bias = 1 ; ๆจกๅๅฑๆง
start_id = 1 ; ๆจกๅๅฑๆง
end_id = 2 ; ๆจกๅๅฑๆง
session_len = 2056
weight_type = fp16; ๆฐๆฎ็ฑปๅ
rotary_embedding = 128 ; ๆจกๅๅฑๆง
rope_theta = 10000.0
size_per_head = 128 ; ๆจกๅๅฑๆง
group_size = 0 ; ๆฐๆฎ็ฑปๅ
max_batch_size = 64
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 2048
rope_scaling_factor = 0.0
use_logn_attn = 0ไธไธชๅฏ่ฝ้่ฆ่ฐๆด็ๅๆฐใ
quant_policy๏ผ
- KV int8 ๅผๅ ณ๏ผKV Cache ๆฏๅฏนๅบๅ็ๆ่ฟ็จไธญ็ K ๅ V ่ฟ่ก้ๅ๏ผ็จไปฅ่็ๆพๅญ
- ้ป่ฎคๅผไธบ 0๏ผ่กจ็คบไธไฝฟ็จ KV Cache๏ผๅฆๆ้่ฆๅผๅฏ๏ผๅๅฐ่ฏฅๅๆฐ่ฎพ็ฝฎไธบ 4
- ๅฝๆพๅญไธ่ถณ๏ผๆๅบๅๆฏ่พ้ฟๆถ๏ผๅปบ่ฎฎๆๅผๆญคๅผๅ ณ
rope_scaling_factor๏ผ
- ๅคๆจ่ฝๅๅผๅ ณ๏ผๆจ็ๆถไธไธๆ็้ฟๅบฆ่ถ ่ฟ่ฎญ็ปๆถ็ๆๅคง้ฟๅบฆๆถๆจกๅ็ๆ็่ฝๅ
- ้ป่ฎคๅผไธบ 0.0๏ผ่กจ็คบไธๅ ทๅคๅคๆจ่ฝๅ๏ผ่ฎพ็ฝฎไธบ 1.0๏ผๅฏไปฅๅผๅฏ RoPE ็ Dynamic NTK ๅ่ฝ๏ผๆฏๆ้ฟๆๆฌๆจ็
- ๅฝๆจ็ๆๆฌ้ๅธธ้ฟ๏ผๆๆพ่ถ ่ฟไบ่ฎญ็ปๆถ็ๆๅคง้ฟๅบฆ๏ผๆถ๏ผๅปบ่ฎฎๅผๅฏๅคๆจ่ฝๅใ
ๅฆๆๆฒกๆๅคๆจ่ฝๅ๏ผๅฝๆจ็ๆถไธไธๆ้ฟๅบฆ่ถ ่ฟ่ฎญ็ปๆถ็ๆๅคง้ฟๅบฆ๏ผๆๆไผๆฅๅงไธ้ใ็ธๅ๏ผๅไธ้ไธ้ฃไนๆๆพ๏ผๅฝ็ถๅฆๆ่ถ ๅบๅคชๅค๏ผๆๆไนไผไธ้็ๅๅฎณใ
use_logn_attn๏ผ
- Attention ็ผฉๆพ๏ผ้ป่ฎคๅผไธบ 0๏ผๅฆๆ่ฆๅผๅฏ๏ผๅฏไปฅๅฐๅ ถๆนไธบ 1ใ
max_batch_size:
- ๆนๅค็ๅคงๅฐ
- ้ป่ฎคไธบ 64๏ผไนAPI Server ๅฏๅจๆถ็ instance_num ๅๆฐ
- ่ฏฅๅๆฐๅผ่ถๅคง๏ผๅๅบฆ้่ถๅคง๏ผๅๆถๆฅๅ็่ฏทๆฑๆฐ๏ผ๏ผไฝไนไผๅ ็จๆดๅคๆพๅญ,ๅปบ่ฎฎๆ นๆฎ่ฏทๆฑ้ๅๆๅคง็ไธไธๆ้ฟๅบฆ๏ผๆๅฎ้ ๆ ๅต่ฐๆดใ












