Skip to content

Latest commit

ย 

History

History
712 lines (600 loc) ยท 28.1 KB

File metadata and controls

712 lines (600 loc) ยท 28.1 KB

LMDeploy ็š„้‡ๅŒ–ๅ’Œ้ƒจ็ฝฒ

ๅคงๆจกๅž‹้ƒจ็ฝฒ่ƒŒๆ™ฏ

้ƒจ็ฝฒ

ๅฎšไน‰:

  • ๅฐ†่ฎญ็ปƒๅฅฝ็š„ๆจกๅž‹ๅœจ็‰นๅฎš่ฝฏ็กฌไปถ็ŽฏๅขƒไธญๅฏๅŠจ็š„่ฟ‡็จ‹๏ผŒไฝฟๆจกๅž‹่ƒฝๅคŸๆŽฅๆ”ถ่พ“ๅ…ฅๅนถ่ฟ”ๅ›ž้ข„ๆต‹็ป“ๆžœ
  • ไธบไบ†ๆปก่ถณๆ€ง่ƒฝๅ’Œๆ•ˆ็އ็š„้œ€ๆฑ‚๏ผŒๅธธๅธธ้œ€่ฆๅฏนๆจกๅž‹่ฟ›่กŒไผ˜ๅŒ–๏ผŒไพ‹ๅฆ‚ๆจกๅž‹ๅŽ‹็ผฉๅ’Œ็กฌไปถๅŠ ้€Ÿ

ไบงๅ“ๅฝขๆ€:

  • ไบ‘็ซฏใ€่พน็ผ˜่ฎก็ฎ—็ซฏใ€็งปๅŠจ็ซฏ

่ฎก็ฎ—่ฎพๅค‡:

  • CPUใ€GPUใ€NPUใ€TPU

ๅคงๆจกๅž‹็‰น็‚น

ๅ†…ๅญ˜ๅผ€้”€ๅทจๅคง

  • ๅบžๅคง็š„ๅ‚ๆ•ฐ้‡ใ€‚ 7B ๆจกๅž‹ไป…ๆƒ้‡ๅฐฑ้œ€่ฆ 14+G ๅ†…ๅญ˜
  • ้‡‡็”จ่‡ชๅ›žๅฝ’็”Ÿๆˆ token๏ผŒ้œ€่ฆ็ผ“ๅญ˜ Attention ็š„ k/v,ๅธฆๆฅๅทจๅคง็š„ๅ†…ๅญ˜ๅผ€้”€

ๅŠจๆ€shape

  • ่ฏทๆฑ‚ๆ•ฐไธๅ›บๅฎš
  • Token ้€ไธช็”Ÿๆˆ๏ผŒไธ”ๆ•ฐ้‡ไธๅฎš

็›ธๅฏน่ง†่ง‰ๆจกๅž‹๏ผŒLLM็ป“ๆž„็ฎ€ๅ•

  • Transformers ็ป“ๆž„๏ผŒๅคง้ƒจๅˆ†ๆ˜ฏ decoder-only

ๅคงๆจกๅž‹้ƒจ็ฝฒๆŒ‘ๆˆ˜

่ฎพๅค‡๏ผš

  • ๅฆ‚ไฝ•ๅบ”ๅฏนๅทจๅคง็š„ๅญ˜ๅ‚จ้—ฎ้ข˜๏ผŸไฝŽๅญ˜ๅ‚จ่ฎพๅค‡๏ผˆๆถˆ่ดน็บงๆ˜พๅกใ€ๆ‰‹ๆœบ็ญ‰๏ผ‰ๅฆ‚ไฝ•้ƒจ็ฝฒ๏ผŸ

ๆŽจ็†๏ผš

  • ๅฆ‚ไฝ•ๅŠ ้€Ÿ token ็š„็”Ÿๆˆ้€Ÿๅบฆ
  • ๅฆ‚ไฝ•่งฃๅ†ณๅŠจๆ€shape๏ผŒ่ฎฉๆŽจ็†ๅฏไปฅไธ้—ดๆ–ญ
  • ๅฆ‚ไฝ•ๆœ‰ๆ•ˆ็ฎก็†ๅ’Œๅˆฉ็”จๅ†…ๅญ˜

ๆœๅŠก๏ผš

  • ๅฆ‚ไฝ•ๆๅ‡็ณป็ปŸๆ•ดไฝ“ๅžๅ้‡๏ผŸ
  • ๅฏนไบŽไธชไฝ“็”จๆˆท๏ผŒๅฆ‚ไฝ•้™ไฝŽๅ“ๅบ”ๆ—ถ้—ด๏ผŸ

ๆŠ€ๆœฏ็‚น๏ผš

  • ๆจกๅž‹ๅนถ่กŒ
  • transformer ่ฎก็ฎ—ๅ’Œ่ฎฟๅญ˜ไผ˜ๅŒ–
  • ไฝŽๆฏ”็‰น้‡ๅŒ–
  • Continuous Batch
  • Page Attention

ๆ–นๆกˆ

  • huggingface transformers
  • ไธ“้—จ็š„ๆŽจ็†ๅŠ ้€Ÿๆก†ๆžถ
    • ไบ‘็ซฏ
      • Imdeploy
      • vllm
      • tensorrt-Ilm
      • deepspeed
    • ็งปๅŠจ็ซฏ
      • Ilama.cpp
      • mlc-Ilm

LMDeplpoy

่‹ฑไผŸ่พพ่ฎพๅค‡ไธŠ้ƒจ็ฝฒ็š„ๅ…จๆต็จ‹่งฃๅ†ณๆ–นๆกˆใ€‚ ๆจกๅž‹่ฝป้‡ๅŒ–ใ€ๆŽจ็†ใ€ๆœๅŠกใ€‚

image

ๆŽจ็†ๆ€ง่ƒฝ

image

้‡ๅŒ–

Weight FP16 + KV Cache FP16

ๆจกๅž‹ ๆƒ้‡ KV Cache๏ผˆtokens=2k๏ผ‰๏ผˆbatch=8๏ผ‰ KV Cache๏ผˆtokens=8k๏ผ‰๏ผˆbatch=8๏ผ‰ KV Cache๏ผˆtokens=32k๏ผ‰๏ผˆbatch=8๏ผ‰
Llama 7B 14GB 8GB 32GB 128GB
Llama 70B 14GB 5GB 20GB 80GB

Weight INT4 + KV Cache INT8

ๆจกๅž‹ ๆƒ้‡ KV Cache๏ผˆtokens=2k๏ผ‰๏ผˆbatch=8๏ผ‰ KV Cache๏ผˆtokens=8k๏ผ‰๏ผˆbatch=8๏ผ‰ KV Cache๏ผˆtokens=32k๏ผ‰๏ผˆbatch=8๏ผ‰
Llama 7B 3.5GB 4GB 16GB 64GB
Llama 70B 35GB 2.5GB 10GB 40GB

Weight Only ้‡ๅŒ–

ไธคไธชๅŸบๆœฌๆฆ‚ๅฟต

  • ่ฎก็ฎ—ๅฏ†้›†๏ผˆcompute-bound๏ผ‰:ๆŽจ็†็š„็ปๅคง้ƒจๅˆ†ๆ—ถ้—ดๆถˆ่€—ๅœจๆ•ฐๅ€ผ่ฎก็ฎ—ไธŠ๏ผ›้’ˆๅฏน่ฎก็ฎ—ๅฏ†้›†ๅœบๆ™ฏ๏ผŒๅฏไปฅ้€š่ฟ‡ไฝฟ็”จๆ›ดๅฟซ็š„็กฌไปถ่ฎก็ฎ—ๅ•ๅ…ƒๆฅๆๅ‡่ฎก็ฎ—้€Ÿๅบฆ๏ผŒๆฏ”ๅฆ‚้‡ๅŒ–ไธบ W8A8 ไฝฟ็”จ INT8 Tensor Core ๆฅๅŠ ้€Ÿ่ฎก็ฎ—ใ€‚โ€™
  • ่ฎฟๅญ˜ๅฏ†้›†๏ผˆmemory-bound๏ผ‰:ๆŽจ็†ๆ—ถ๏ผŒ็ปๅคง้ƒจๅˆ†ๆ—ถ้—ดๆถˆ่€—ๅœจๆ•ฐๆฎ่ฏปๅ–ไธŠ๏ผ›้’ˆๅฏน่ฎฟๅญ˜ๅฏ†้›†ๅž‹ๅœบๆ™ฏ๏ผŒไธ€่ˆฌๆ˜ฏ้€š่ฟ‡ๆ้ซ˜่ฎก็ฎ—่ฎฟๅญ˜ๆฏ”ๆฅๆๅ‡ๆ€ง่ƒฝใ€‚

LLM ๆ˜ฏๅ…ธๅž‹็š„่ฎฟๅญ˜ๅฏ†้›†ๅž‹ไปปๅŠก

  • ๅธธ่ง็š„ LLM ๆจกๅž‹ๆ˜ฏ Decoder Only ๆžถๆž„ใ€‚ๆŽจ็†ๆ—ถๅคง้ƒจๅˆ†ๆ—ถ้—ดๆถˆ่€—ๅœจ้€Token ็”Ÿๆˆ้˜ถๆฎต๏ผˆDecoding ้˜ถๆฎต๏ผ‰๏ผŒๆ˜ฏๅ…ธๅž‹็š„่ฎฟๅญ˜ๅฏ†้›†ๅž‹ๅœบๆ™ฏใ€‚

ๅฆ‚ๅ›พ๏ผŒA100 ็š„ FP16 ๅณฐๅ€ผ็ฎ—ๅŠ›ไธบ 312 TFLOPS๏ผŒๅชๆœ‰ๅœจ Batch Size ่พพๅˆฐ128 ่ฟ™ไธช้‡็บงๆ—ถ๏ผŒ่ฎก็ฎ—ๆ‰ๆˆไธบๆŽจ็†็š„็“ถ้ขˆ๏ผŒไฝ†็”ฑไบŽ LLM ๆจกๅž‹ๆœฌ่บซๅฐฑๅพˆๅคง ๆŽจ็†ๆ—ถ็š„ KV Cache ไนŸไผšๅ ็”จๅพˆๅคšๆ˜พๅญ˜๏ผŒ่ฟ˜ๆœ‰ไธ€ไบ›ๅ…ถไป–็š„ๅ› ็ด ๅฝฑๅ“๏ผˆๅฆ‚Persistent Batch๏ผ‰๏ผŒๅฎž้™…ๆŽจ็†ๆ—ถๅพˆ้šพๅšๅˆฐ 128 ่ฟ™ไนˆๅคง็š„ Batch Sizeใ€‚ image

  • 4bit Weight Only ้‡ๅŒ–๏ผŒๅฐ† FP16 ็š„ๆจกๅž‹ๆƒ้‡้‡ๅŒ–ไธบINT4๏ผŒ่ฎฟๅญ˜้‡็›ดๆŽฅ้™ไธบ FP16 ๆจกๅž‹็š„ 1/4๏ผŒๅคงๅน…้™ไฝŽไบ†่ฎฟๅญ˜ๆˆๆœฌ๏ผŒๆ้ซ˜ไบ† Decoding ็š„้€Ÿๅบฆใ€‚
  • ๅŠ ้€Ÿ็š„ๅŒๆ—ถ่ฟ˜่Š‚็œไบ†ๆ˜พๅญ˜๏ผŒๅŒๆ ท็š„่ฎพๅค‡่ƒฝๅคŸๆ”ฏๆŒๆ›ดๅคง็š„ๆจกๅž‹ไปฅๅŠๆ›ด้•ฟ็š„ๅฏน่ฏ้•ฟๅบฆ

่ฎฟๅญ˜ๅฏ†้›†ๅž‹ไปปๅŠก ???

ๅฆ‚ไฝ•ๅš

  • LMDeploy ไฝฟ็”จ MIT HAN LAB ๅผ€ๆบ็š„ AWQ ็ฎ—ๆณ•๏ผŒ้‡ๅŒ–ไธบ 4bit ๆจกๅž‹๏ผ›
  • ๆŽจ็†ๆ—ถ๏ผŒๅ…ˆๆŠŠ 4bit ๆƒ้‡๏ผŒๅ้‡ๅŒ–ๅ›ž FP16๏ผˆๅœจ Kernelๅ†…้ƒจ่ฟ›่กŒ๏ผŒไปŽ Global Memory ่ฏปๅ–ๆ—ถไปๆ˜ฏ 4bit๏ผ‰๏ผŒไพๆ—งไฝฟ็”จ็š„ๆ˜ฏ FP16 ่ฎก็ฎ— ็›ธ่พƒไบŽ็คพๅŒบไฝฟ็”จๆฏ”่พƒๅคš็š„ GPTQ ็ฎ—ๆณ•๏ผŒAWQ ็š„ๆŽจ็†้€Ÿๅบฆๆ›ดๅฟซ๏ผŒ้‡ๅŒ–็š„ๆ—ถ้—ดๆ›ด็Ÿญ

image

ๅ้‡ๅŒ–ไธ้œ€่ฆๆ—ถ้—ด๏ผŸ

TurboMind

TurboMind ๆ˜ฏไธ€ๆฌพๅ…ณไบŽ LLM ๆŽจ็†็š„้ซ˜ๆ•ˆๆŽจ็†ๅผ•ๆ“Ž๏ผŒๅŸบไบŽ่‹ฑไผŸ่พพ็š„ FasterTransformer ็ ”ๅ‘่€Œๆˆใ€‚

  • LLaMa ็ป“ๆž„ๆจกๅž‹็š„ๆ”ฏๆŒ
  • ๆŒ็ปญๆ‰นๅค„็†: ไปค็‰Œๆกถ๏ผŸ
  • ้ซ˜ๆ€ง่ƒฝ cuda kernel๏ผš
  • ๆœ‰็Šถๆ€ๆŽจ็†๏ผšๆœๅŠก็ซฏ็ผ“ๅญ˜๏ผŸ
  • Blocked k/v cache๏ผš ็ผ“ๅญ˜็ฎ—ๆณ•๏ผŸ

Blocked k/v cach

BlockSize = 2 X Layers X Heads X HeadDim X Seq X B๏ผ› Seq: 1 ไธช block ้‡Œ็š„ๅบๅˆ—้•ฟๅบฆ๏ผŒ้ป˜่ฎค128๏ผ› B๏ผšk/vๆ•ฐๅ€ผ็ฒพๅบฆๅฏนๅบ”็š„ๅญ—่Š‚ๆ•ฐ๏ผ›

llama-7b๏ผŒ2Kๅบๅˆ—้•ฟๅบฆ๏ผŒk/v block ๅ†…ๅญ˜1G

image

Block ็Šถๆ€

  • Free ๆœช่ขซไปปไฝ•ๅบๅˆ—ๅ ็”จ
  • Activate ๆญฃๅœจ่ขซๆŽจ็†็š„ๅบๅˆ—ๅ ็”จ
  • Cache ่ขซ็ผ“ๅญ˜ไธญ็š„ๅบๅˆ—ๅ ็”จ

Block็Šถๆ€่ฟ็งป

image

้ซ˜ๆ€ง่ƒฝ cuda kernel

image

API Server

image

  • ๆจกๅž‹ๆŽจ็†/ๆœๅŠก: ไธป่ฆๆไพ›ๆจกๅž‹ๆœฌ่บซ็š„ๆŽจ็†๏ผŒไธ€่ˆฌๆฅ่ฏดๅฏไปฅๅ’Œๅ…ทไฝ“ไธšๅŠก่งฃ่€ฆ๏ผŒไธ“ๆณจๆจกๅž‹ๆŽจ็†ๆœฌ่บซๆ€ง่ƒฝ็š„ไผ˜ๅŒ–ใ€‚ๅฏไปฅไปฅๆจกๅ—ใ€API็ญ‰ๅคš็งๆ–นๅผๆไพ›ใ€‚

  • Client: ๅฏไปฅ็†่งฃไธบๅ‰็ซฏ๏ผŒไธŽ็”จๆˆทไบคไบ’็š„ๅœฐๆ–นใ€‚

  • API Server: ไธ€่ˆฌไฝœไธบๅ‰็ซฏ็š„ๅŽ็ซฏ๏ผŒๆไพ›ไธŽไบงๅ“ๅ’ŒๆœๅŠก็›ธๅ…ณ็š„ๆ•ฐๆฎๅ’ŒๅŠŸ่ƒฝๆ”ฏๆŒใ€‚

    ไธ‰ไธชๆต็จ‹ไธไธ€ๅฎšไธฅๆ ผๅŒบๅˆ†ใ€‚

TurboMind ๆŽจ็†ๆจกๅž‹

ไฝฟ็”จ TurboMind ๆŽจ็†ๆจกๅž‹้œ€่ฆๅ…ˆๅฐ†ๆจกๅž‹่ฝฌๅŒ–ไธบ TurboMind ็š„ๆ ผๅผ๏ผŒ็›ฎๅ‰ๆ”ฏๆŒๅœจ็บฟ่ฝฌๆขๅ’Œ็ฆป็บฟ่ฝฌๆขไธค็งๅฝขๅผ๏ผš

  • ๅœจ็บฟ่ฝฌๆขๅฏไปฅ็›ดๆŽฅๅŠ ่ฝฝ Huggingface ๆจกๅž‹
  • ็ฆป็บฟ่ฝฌๆข้œ€้œ€่ฆๅ…ˆไฟๅญ˜ๆจกๅž‹ๅ†ๅŠ ่ฝฝ

ๅŠ ่ฝฝๅŽๅ†่ฝฌๆข

#  ้œ€่ฆ่ฎฟ้—ฎ hf
lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b
# ๆœฌๅœฐ
lmdeploy chat turbomind /share/temp/model_repos/internlm-chat-7b/  --model-name internlm-chat-7b

็บฟ่ฝฌๅ†ๅŠ ่ฝฝ

# ่ฝฌๆขๆจกๅž‹๏ผˆFastTransformerๆ ผๅผ๏ผ‰ TurboMind
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b  /root/share/temp/model_repos/internlm-chat-7b/

่พ“ๅ‡บๅœจworkspaceๆ–‡ไปถๅคน๏ผŒๅ…ถไธญ๏ผš

ๅ‚ๆ•ฐ layers.0.attention.w_qkv.0.bias

  • ็ฌฌไธ€ไธช 0 ่กจ็คบโ€œๅฑ‚โ€็š„็ดขๅผ•๏ผŒๅŽ้ข็š„้‚ฃไธช0่กจ็คบ Tensor ๅนถ่กŒ็š„็ดขๅผ•
  • ไธคๅผ ๅกๅฏไปฅ็”จๆฅๆŽจ็†๏ผŒไผšๆŠŠๅŒไธ€ไธชๅ‚ๆ•ฐๆ‹†ๆˆไธคไปฝ layers.0.attention.w_qkv.0.weightๅ’Œ layers.0.attention.w_qkv.1.weight
  • ๅฏไปฅ้€š่ฟ‡ --tp ๆŒ‡ๅฎštensor parallel๏ผŒ่ฏฅๅ‚ๆ•ฐ้ป˜่ฎคๅ€ผไธบ1
(base) root@intern-studio:~# tree ./workspace/
./workspace/
โ”œโ”€โ”€ model_repository
โ”‚   โ”œโ”€โ”€ postprocessing -> ../triton_models/postprocessing
โ”‚   โ”œโ”€โ”€ preprocessing -> ../triton_models/preprocessing
โ”‚   โ””โ”€โ”€ turbomind -> ../triton_models/interactive
โ”œโ”€โ”€ service_docker_up.sh
โ””โ”€โ”€ triton_models
    โ”œโ”€โ”€ interactive
    โ”‚   โ”œโ”€โ”€ 1
    โ”‚   โ”‚   โ”œโ”€โ”€ placeholder
    โ”‚   โ”‚   โ””โ”€โ”€ weights -> ../../weights
    โ”‚   โ””โ”€โ”€ config.pbtxt
    โ”œโ”€โ”€ postprocessing
    โ”‚   โ”œโ”€โ”€ 1
    โ”‚   โ”‚   โ”œโ”€โ”€ __pycache__
    โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ model.cpython-310.pyc
    โ”‚   โ”‚   โ”œโ”€โ”€ model.py
    โ”‚   โ”‚   โ””โ”€โ”€ tokenizer -> ../../tokenizer
    โ”‚   โ””โ”€โ”€ config.pbtxt
    โ”œโ”€โ”€ preprocessing
    โ”‚   โ”œโ”€โ”€ 1
    โ”‚   โ”‚   โ”œโ”€โ”€ __pycache__
    โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ model.cpython-310.pyc
    โ”‚   โ”‚   โ”œโ”€โ”€ model.py
    โ”‚   โ”‚   โ””โ”€โ”€ tokenizer -> ../../tokenizer
    โ”‚   โ””โ”€โ”€ config.pbtxt
    โ”œโ”€โ”€ tokenizer
    โ”‚   โ”œโ”€โ”€ config.json
    โ”‚   โ”œโ”€โ”€ configuration.json
    โ”‚   โ”œโ”€โ”€ configuration_internlm.py
    โ”‚   โ”œโ”€โ”€ generation_config.json
    โ”‚   โ”œโ”€โ”€ modeling_internlm.py
    โ”‚   โ”œโ”€โ”€ placeholder
    โ”‚   โ”œโ”€โ”€ pytorch_model.bin.index.json
    โ”‚   โ”œโ”€โ”€ special_tokens_map.json
    โ”‚   โ”œโ”€โ”€ tokenization_internlm.py
    โ”‚   โ”œโ”€โ”€ tokenizer.model
    โ”‚   โ”œโ”€โ”€ tokenizer.py
    โ”‚   โ””โ”€โ”€ tokenizer_config.json
    โ””โ”€โ”€ weights
        โ”œโ”€โ”€ config.ini
        โ”œโ”€โ”€ layers.0.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.0.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.0.attention.wo.0.bias
        โ”œโ”€โ”€ layers.0.attention.wo.0.weight
        โ”œโ”€โ”€ layers.0.attention_norm.weight
        โ”œโ”€โ”€ layers.0.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.0.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.0.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.0.ffn_norm.weight
        โ”œโ”€โ”€ layers.1.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.1.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.1.attention.wo.0.bias
        โ”œโ”€โ”€ layers.1.attention.wo.0.weight
        โ”œโ”€โ”€ layers.1.attention_norm.weight
        โ”œโ”€โ”€ layers.1.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.1.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.1.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.1.ffn_norm.weight
        โ”œโ”€โ”€ layers.10.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.10.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.10.attention.wo.0.bias
        โ”œโ”€โ”€ layers.10.attention.wo.0.weight
        โ”œโ”€โ”€ layers.10.attention_norm.weight
        โ”œโ”€โ”€ layers.10.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.10.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.10.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.10.ffn_norm.weight
        โ”œโ”€โ”€ layers.11.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.11.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.11.attention.wo.0.bias
        โ”œโ”€โ”€ layers.11.attention.wo.0.weight
        โ”œโ”€โ”€ layers.11.attention_norm.weight
        โ”œโ”€โ”€ layers.11.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.11.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.11.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.11.ffn_norm.weight
        โ”œโ”€โ”€ layers.12.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.12.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.12.attention.wo.0.bias
        โ”œโ”€โ”€ layers.12.attention.wo.0.weight
        โ”œโ”€โ”€ layers.12.attention_norm.weight
        โ”œโ”€โ”€ layers.12.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.12.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.12.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.12.ffn_norm.weight
        โ”œโ”€โ”€ layers.13.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.13.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.13.attention.wo.0.bias
        โ”œโ”€โ”€ layers.13.attention.wo.0.weight
        โ”œโ”€โ”€ layers.13.attention_norm.weight
        โ”œโ”€โ”€ layers.13.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.13.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.13.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.13.ffn_norm.weight
        โ”œโ”€โ”€ layers.14.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.14.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.14.attention.wo.0.bias
        โ”œโ”€โ”€ layers.14.attention.wo.0.weight
        โ”œโ”€โ”€ layers.14.attention_norm.weight
        โ”œโ”€โ”€ layers.14.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.14.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.14.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.14.ffn_norm.weight
        โ”œโ”€โ”€ layers.15.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.15.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.15.attention.wo.0.bias
        โ”œโ”€โ”€ layers.15.attention.wo.0.weight
        โ”œโ”€โ”€ layers.15.attention_norm.weight
        โ”œโ”€โ”€ layers.15.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.15.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.15.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.15.ffn_norm.weight
        โ”œโ”€โ”€ layers.16.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.16.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.16.attention.wo.0.bias
        โ”œโ”€โ”€ layers.16.attention.wo.0.weight
        โ”œโ”€โ”€ layers.16.attention_norm.weight
        โ”œโ”€โ”€ layers.16.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.16.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.16.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.16.ffn_norm.weight
        โ”œโ”€โ”€ layers.17.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.17.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.17.attention.wo.0.bias
        โ”œโ”€โ”€ layers.17.attention.wo.0.weight
        โ”œโ”€โ”€ layers.17.attention_norm.weight
        โ”œโ”€โ”€ layers.17.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.17.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.17.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.17.ffn_norm.weight
        โ”œโ”€โ”€ layers.18.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.18.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.18.attention.wo.0.bias
        โ”œโ”€โ”€ layers.18.attention.wo.0.weight
        โ”œโ”€โ”€ layers.18.attention_norm.weight
        โ”œโ”€โ”€ layers.18.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.18.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.18.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.18.ffn_norm.weight
        โ”œโ”€โ”€ layers.19.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.19.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.19.attention.wo.0.bias
        โ”œโ”€โ”€ layers.19.attention.wo.0.weight
        โ”œโ”€โ”€ layers.19.attention_norm.weight
        โ”œโ”€โ”€ layers.19.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.19.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.19.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.19.ffn_norm.weight
        โ”œโ”€โ”€ layers.2.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.2.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.2.attention.wo.0.bias
        โ”œโ”€โ”€ layers.2.attention.wo.0.weight
        โ”œโ”€โ”€ layers.2.attention_norm.weight
        โ”œโ”€โ”€ layers.2.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.2.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.2.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.2.ffn_norm.weight
        โ”œโ”€โ”€ layers.20.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.20.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.20.attention.wo.0.bias
        โ”œโ”€โ”€ layers.20.attention.wo.0.weight
        โ”œโ”€โ”€ layers.20.attention_norm.weight
        โ”œโ”€โ”€ layers.20.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.20.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.20.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.20.ffn_norm.weight
        โ”œโ”€โ”€ layers.21.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.21.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.21.attention.wo.0.bias
        โ”œโ”€โ”€ layers.21.attention.wo.0.weight
        โ”œโ”€โ”€ layers.21.attention_norm.weight
        โ”œโ”€โ”€ layers.21.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.21.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.21.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.21.ffn_norm.weight
        โ”œโ”€โ”€ layers.22.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.22.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.22.attention.wo.0.bias
        โ”œโ”€โ”€ layers.22.attention.wo.0.weight
        โ”œโ”€โ”€ layers.22.attention_norm.weight
        โ”œโ”€โ”€ layers.22.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.22.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.22.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.22.ffn_norm.weight
        โ”œโ”€โ”€ layers.23.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.23.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.23.attention.wo.0.bias
        โ”œโ”€โ”€ layers.23.attention.wo.0.weight
        โ”œโ”€โ”€ layers.23.attention_norm.weight
        โ”œโ”€โ”€ layers.23.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.23.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.23.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.23.ffn_norm.weight
        โ”œโ”€โ”€ layers.24.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.24.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.24.attention.wo.0.bias
        โ”œโ”€โ”€ layers.24.attention.wo.0.weight
        โ”œโ”€โ”€ layers.24.attention_norm.weight
        โ”œโ”€โ”€ layers.24.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.24.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.24.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.24.ffn_norm.weight
        โ”œโ”€โ”€ layers.25.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.25.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.25.attention.wo.0.bias
        โ”œโ”€โ”€ layers.25.attention.wo.0.weight
        โ”œโ”€โ”€ layers.25.attention_norm.weight
        โ”œโ”€โ”€ layers.25.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.25.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.25.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.25.ffn_norm.weight
        โ”œโ”€โ”€ layers.26.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.26.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.26.attention.wo.0.bias
        โ”œโ”€โ”€ layers.26.attention.wo.0.weight
        โ”œโ”€โ”€ layers.26.attention_norm.weight
        โ”œโ”€โ”€ layers.26.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.26.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.26.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.26.ffn_norm.weight
        โ”œโ”€โ”€ layers.27.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.27.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.27.attention.wo.0.bias
        โ”œโ”€โ”€ layers.27.attention.wo.0.weight
        โ”œโ”€โ”€ layers.27.attention_norm.weight
        โ”œโ”€โ”€ layers.27.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.27.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.27.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.27.ffn_norm.weight
        โ”œโ”€โ”€ layers.28.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.28.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.28.attention.wo.0.bias
        โ”œโ”€โ”€ layers.28.attention.wo.0.weight
        โ”œโ”€โ”€ layers.28.attention_norm.weight
        โ”œโ”€โ”€ layers.28.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.28.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.28.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.28.ffn_norm.weight
        โ”œโ”€โ”€ layers.29.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.29.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.29.attention.wo.0.bias
        โ”œโ”€โ”€ layers.29.attention.wo.0.weight
        โ”œโ”€โ”€ layers.29.attention_norm.weight
        โ”œโ”€โ”€ layers.29.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.29.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.29.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.29.ffn_norm.weight
        โ”œโ”€โ”€ layers.3.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.3.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.3.attention.wo.0.bias
        โ”œโ”€โ”€ layers.3.attention.wo.0.weight
        โ”œโ”€โ”€ layers.3.attention_norm.weight
        โ”œโ”€โ”€ layers.3.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.3.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.3.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.3.ffn_norm.weight
        โ”œโ”€โ”€ layers.30.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.30.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.30.attention.wo.0.bias
        โ”œโ”€โ”€ layers.30.attention.wo.0.weight
        โ”œโ”€โ”€ layers.30.attention_norm.weight
        โ”œโ”€โ”€ layers.30.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.30.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.30.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.30.ffn_norm.weight
        โ”œโ”€โ”€ layers.31.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.31.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.31.attention.wo.0.bias
        โ”œโ”€โ”€ layers.31.attention.wo.0.weight
        โ”œโ”€โ”€ layers.31.attention_norm.weight
        โ”œโ”€โ”€ layers.31.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.31.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.31.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.31.ffn_norm.weight
        โ”œโ”€โ”€ layers.4.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.4.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.4.attention.wo.0.bias
        โ”œโ”€โ”€ layers.4.attention.wo.0.weight
        โ”œโ”€โ”€ layers.4.attention_norm.weight
        โ”œโ”€โ”€ layers.4.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.4.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.4.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.4.ffn_norm.weight
        โ”œโ”€โ”€ layers.5.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.5.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.5.attention.wo.0.bias
        โ”œโ”€โ”€ layers.5.attention.wo.0.weight
        โ”œโ”€โ”€ layers.5.attention_norm.weight
        โ”œโ”€โ”€ layers.5.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.5.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.5.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.5.ffn_norm.weight
        โ”œโ”€โ”€ layers.6.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.6.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.6.attention.wo.0.bias
        โ”œโ”€โ”€ layers.6.attention.wo.0.weight
        โ”œโ”€โ”€ layers.6.attention_norm.weight
        โ”œโ”€โ”€ layers.6.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.6.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.6.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.6.ffn_norm.weight
        โ”œโ”€โ”€ layers.7.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.7.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.7.attention.wo.0.bias
        โ”œโ”€โ”€ layers.7.attention.wo.0.weight
        โ”œโ”€โ”€ layers.7.attention_norm.weight
        โ”œโ”€โ”€ layers.7.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.7.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.7.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.7.ffn_norm.weight
        โ”œโ”€โ”€ layers.8.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.8.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.8.attention.wo.0.bias
        โ”œโ”€โ”€ layers.8.attention.wo.0.weight
        โ”œโ”€โ”€ layers.8.attention_norm.weight
        โ”œโ”€โ”€ layers.8.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.8.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.8.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.8.ffn_norm.weight
        โ”œโ”€โ”€ layers.9.attention.w_qkv.0.bias
        โ”œโ”€โ”€ layers.9.attention.w_qkv.0.weight
        โ”œโ”€โ”€ layers.9.attention.wo.0.bias
        โ”œโ”€โ”€ layers.9.attention.wo.0.weight
        โ”œโ”€โ”€ layers.9.attention_norm.weight
        โ”œโ”€โ”€ layers.9.feed_forward.w1.0.weight
        โ”œโ”€โ”€ layers.9.feed_forward.w2.0.weight
        โ”œโ”€โ”€ layers.9.feed_forward.w3.0.weight
        โ”œโ”€โ”€ layers.9.ffn_norm.weight
        โ”œโ”€โ”€ norm.weight
        โ”œโ”€โ”€ output.weight
        โ””โ”€โ”€ tok_embeddings.weight

18 directories, 313 files
(base) root@intern-studio:~# 

** Tensorๅนถ่กŒ**

ๅˆ—ๅนถ่กŒ image

่กŒๅนถ่กŒ image

ๆœฌๅœฐๅฏน่ฏ๏ผˆBash Local Chat๏ผ‰

่ทณ่ฟ‡ API Server ็›ดๆŽฅ่ฐƒ็”จ TurboMindใ€‚

pytorch/DeepSpeed ็›ฎๅ‰ๅŠŸ่ƒฝ้ƒฝๆฏ”่พƒๅผฑ

# Turbomind + Bash Local Chat
lmdeploy chat turbomind ./workspace

APIๆœๅŠก

ๆœๅŠก็ซฏ

# ApiServer+Turbomind   api_server => AsyncEngine => TurboMind
lmdeploy serve api_server ./workspace \
	--server_name 0.0.0.0 \
	--server_port 23333 \
	--instance_num 64 \ #  Batch ็š„ๅคงๅฐ
	--tp 1

็ปˆ็ซฏๅฎขๆˆท็ซฏ

# ChatApiClient+ApiServer๏ผˆๆณจๆ„ๆ˜ฏhttpๅ่ฎฎ๏ผŒ้œ€่ฆๅŠ http๏ผ‰
lmdeploy serve api_client http://localhost:23333

image

ๅฆ‚ๆžœๅœจ็บฟไธŠ็Žฏๅขƒ๏ผŒๅฏไปฅ่ฟ›่กŒ่ฝฌๅ‘ๆœๅŠก็ซฏๅฃๅˆฐๆœฌๅœฐ

ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p <ไฝ ็š„ssh็ซฏๅฃๅท>

image

gradio

lmdeploy serve gradio http://0.0.0.0:23333 \
	--server_name 0.0.0.0 \
	--server_port 6006 \
	--restful_api True

image

TurboMind ๆŽจ็† + Python ไปฃ็ ้›†ๆˆ

from lmdeploy import turbomind as tm

# load model
model_path = "/root/share/temp/model_repos/internlm-chat-7b/"
tm_model = tm.TurboMind.from_pretrained(model_path, model_name='internlm-chat-20b')
generator = tm_model.create_instance()

# process query
query = "ๆ™šไธŠๅƒๅ•ฅๅฅฝ"
prompt = tm_model.model.get_prompt(query)
input_ids = tm_model.tokenizer.encode(prompt)

# inference
for outputs in generator.stream_infer(
        session_id=0,
        input_ids=[input_ids]):
    res, tokens = outputs[0]

response = tm_model.tokenizer.decode(res.tolist())
print(response)

ๆœ€ไฝณๅฎž่ทต๏ผŸ

้ƒจ็ฝฒ

image

  • ๅœบๆ™ฏไธ€๏ผˆๅ‰4ๅผ ๅ›พ๏ผ‰๏ผšๅ›บๅฎš็š„่พ“ๅ…ฅใ€่พ“ๅ‡บ token ๆ•ฐ๏ผˆๅˆ†ๅˆซ1ๅ’Œ2048๏ผ‰๏ผŒๆต‹่ฏ•Token่พ“ๅ‡บๅžๅ้‡๏ผˆoutput token throughput๏ผ‰ใ€‚
  • ๅœบๆ™ฏไบŒ๏ผˆ็ฌฌ5ๅผ ๅ›พ๏ผ‰๏ผšไฝฟ็”จ็œŸๅฎžๆ•ฐๆฎ๏ผŒๆต‹่ฏ•ๅžๅ้‡๏ผˆrequest throughput๏ผ‰ใ€‚

ๅฏนๆฏ”

(lmdeploy) root@intern-studio:~# python infer_compare.py hf
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:10<00:00,  1.29s/it]
hf ่€—ๆ—ถ 29.50็ง’  40 ๅญ—/็ง’

[TM][INFO] [Forward] [0, 1), dc_bsz = 0, pf_bsz = 1, n_tok = 113, max_q = 113, max_k = 113
[TM][INFO] ------------------------- step = 120 -------------------------
...
[TM][INFO] [Interrupt] slot = 0, id = 0
[TM][INFO] [forward] Request complete for 0, code 0
lmdeploy ่€—ๆ—ถ 10.35็ง’  109 ๅญ—/็ง’
[TM][INFO] ~LlamaBatch()
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [OutputThreadEntry] stop requested.
  • web api๏ผšTurboMindๆŽจ็† + API ๆœๅŠก
  • demo๏ผšTurboMind ๆŽจ็†+Gradio
  • python้กน็›ฎ๏ผšTurboMindๆŽจ็† + Python

้…็ฝฎ

ๆจกๅž‹ๅฑžๆ€งๅ’Œๆ•ฐๆฎ็ฑปๅž‹ไธๅฏๆ›ดๆ”นใ€‚

(lmdeploy) root@intern-studio:~# cat ./workspace/triton_models/weights/config.ini 
[llama]
model_name = internlm-chat-7b ; ๆจกๅž‹ๅฑžๆ€ง
tensor_para_size = 1
head_num = 32 ; ๆจกๅž‹ๅฑžๆ€ง
kv_head_num = 32 ; ๆจกๅž‹ๅฑžๆ€ง
vocab_size = 103168 ; ๆจกๅž‹ๅฑžๆ€ง
num_layer = 32 ; ๆจกๅž‹ๅฑžๆ€ง
inter_size = 11008 ; ๆจกๅž‹ๅฑžๆ€ง
norm_eps = 1e-06 ; ๆจกๅž‹ๅฑžๆ€ง
attn_bias = 1 ; ๆจกๅž‹ๅฑžๆ€ง
start_id = 1 ; ๆจกๅž‹ๅฑžๆ€ง
end_id = 2 ; ๆจกๅž‹ๅฑžๆ€ง
session_len = 2056
weight_type = fp16; ๆ•ฐๆฎ็ฑปๅž‹
rotary_embedding = 128 ; ๆจกๅž‹ๅฑžๆ€ง
rope_theta = 10000.0
size_per_head = 128 ; ๆจกๅž‹ๅฑžๆ€ง
group_size = 0 ; ๆ•ฐๆฎ็ฑปๅž‹
max_batch_size = 64
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 2048
rope_scaling_factor = 0.0
use_logn_attn = 0

ไธ‰ไธชๅฏ่ƒฝ้œ€่ฆ่ฐƒๆ•ด็š„ๅ‚ๆ•ฐใ€‚

quant_policy๏ผš

  • KV int8 ๅผ€ๅ…ณ๏ผŒKV Cache ๆ˜ฏๅฏนๅบๅˆ—็”Ÿๆˆ่ฟ‡็จ‹ไธญ็š„ K ๅ’Œ V ่ฟ›่กŒ้‡ๅŒ–๏ผŒ็”จไปฅ่Š‚็œๆ˜พๅญ˜
  • ้ป˜่ฎคๅ€ผไธบ 0๏ผŒ่กจ็คบไธไฝฟ็”จ KV Cache๏ผŒๅฆ‚ๆžœ้œ€่ฆๅผ€ๅฏ๏ผŒๅˆ™ๅฐ†่ฏฅๅ‚ๆ•ฐ่ฎพ็ฝฎไธบ 4
  • ๅฝ“ๆ˜พๅญ˜ไธ่ถณ๏ผŒๆˆ–ๅบๅˆ—ๆฏ”่พƒ้•ฟๆ—ถ๏ผŒๅปบ่ฎฎๆ‰“ๅผ€ๆญคๅผ€ๅ…ณ

rope_scaling_factor๏ผš

  • ๅค–ๆŽจ่ƒฝๅŠ›ๅผ€ๅ…ณ๏ผŒๆŽจ็†ๆ—ถไธŠไธ‹ๆ–‡็š„้•ฟๅบฆ่ถ…่ฟ‡่ฎญ็ปƒๆ—ถ็š„ๆœ€ๅคง้•ฟๅบฆๆ—ถๆจกๅž‹็”Ÿๆˆ็š„่ƒฝๅŠ›
  • ้ป˜่ฎคๅ€ผไธบ 0.0๏ผŒ่กจ็คบไธๅ…ทๅค‡ๅค–ๆŽจ่ƒฝๅŠ›๏ผŒ่ฎพ็ฝฎไธบ 1.0๏ผŒๅฏไปฅๅผ€ๅฏ RoPE ็š„ Dynamic NTK ๅŠŸ่ƒฝ๏ผŒๆ”ฏๆŒ้•ฟๆ–‡ๆœฌๆŽจ็†
  • ๅฝ“ๆŽจ็†ๆ–‡ๆœฌ้žๅธธ้•ฟ๏ผˆๆ˜Žๆ˜พ่ถ…่ฟ‡ไบ†่ฎญ็ปƒๆ—ถ็š„ๆœ€ๅคง้•ฟๅบฆ๏ผ‰ๆ—ถ๏ผŒๅปบ่ฎฎๅผ€ๅฏๅค–ๆŽจ่ƒฝๅŠ›ใ€‚

ๅฆ‚ๆžœๆฒกๆœ‰ๅค–ๆŽจ่ƒฝๅŠ›๏ผŒๅฝ“ๆŽจ็†ๆ—ถไธŠไธ‹ๆ–‡้•ฟๅบฆ่ถ…่ฟ‡่ฎญ็ปƒๆ—ถ็š„ๆœ€ๅคง้•ฟๅบฆ๏ผŒๆ•ˆๆžœไผšๆ€ฅๅ‰งไธ‹้™ใ€‚็›ธๅ๏ผŒๅˆ™ไธ‹้™ไธ้‚ฃไนˆๆ˜Žๆ˜พ๏ผŒๅฝ“็„ถๅฆ‚ๆžœ่ถ…ๅ‡บๅคชๅคš๏ผŒๆ•ˆๆžœไนŸไผšไธ‹้™็š„ๅމๅฎณใ€‚

use_logn_attn๏ผš

  • Attention ็ผฉๆ”พ๏ผŒ้ป˜่ฎคๅ€ผไธบ 0๏ผŒๅฆ‚ๆžœ่ฆๅผ€ๅฏ๏ผŒๅฏไปฅๅฐ†ๅ…ถๆ”นไธบ 1ใ€‚

max_batch_size:

  • ๆ‰นๅค„็†ๅคงๅฐ
  • ้ป˜่ฎคไธบ 64๏ผŒไนŸAPI Server ๅฏๅŠจๆ—ถ็š„ instance_num ๅ‚ๆ•ฐ
  • ่ฏฅๅ‚ๆ•ฐๅ€ผ่ถŠๅคง๏ผŒๅžๅบฆ้‡่ถŠๅคง๏ผˆๅŒๆ—ถๆŽฅๅ—็š„่ฏทๆฑ‚ๆ•ฐ๏ผ‰๏ผŒไฝ†ไนŸไผšๅ ็”จๆ›ดๅคšๆ˜พๅญ˜,ๅปบ่ฎฎๆ นๆฎ่ฏทๆฑ‚้‡ๅ’Œๆœ€ๅคง็š„ไธŠไธ‹ๆ–‡้•ฟๅบฆ๏ผŒๆŒ‰ๅฎž้™…ๆƒ…ๅ†ต่ฐƒๆ•ดใ€‚