LMDeploy 的量化和部署

大模型部署背景

部署

定义:

将训练好的模型在特定软硬件环境中启动的过程，使模型能够接收输入并返回预测结果
为了满足性能和效率的需求，常常需要对模型进行优化，例如模型压缩和硬件加速

产品形态:

云端、边缘计算端、移动端

计算设备:

CPU、GPU、NPU、TPU

大模型特点

内存开销巨大

庞大的参数量。 7B 模型仅权重就需要 14+G 内存
采用自回归生成 token，需要缓存 Attention 的 k/v,带来巨大的内存开销

动态shape

请求数不固定
Token 逐个生成，且数量不定

相对视觉模型，LLM结构简单

Transformers 结构，大部分是 decoder-only

大模型部署挑战

设备：

如何应对巨大的存储问题？低存储设备（消费级显卡、手机等）如何部署？

推理：

如何加速 token 的生成速度
如何解决动态shape，让推理可以不间断
如何有效管理和利用内存

服务：

如何提升系统整体吞吐量？
对于个体用户，如何降低响应时间？

技术点：

模型并行
transformer 计算和访存优化
低比特量化
Continuous Batch
Page Attention

方案

huggingface transformers
专门的推理加速框架
- 云端
  - Imdeploy
  - vllm
  - tensorrt-Ilm
  - deepspeed
- 移动端
  - Ilama.cpp
  - mlc-Ilm

LMDeplpoy

英伟达设备上部署的全流程解决方案。模型轻量化、推理、服务。

推理性能

量化

Weight FP16 + KV Cache FP16

模型	权重	KV Cache（tokens=2k）（batch=8）	KV Cache（tokens=8k）（batch=8）	KV Cache（tokens=32k）（batch=8）
Llama 7B	14GB	8GB	32GB	128GB
Llama 70B	14GB	5GB	20GB	80GB

Weight INT4 + KV Cache INT8

模型	权重	KV Cache（tokens=2k）（batch=8）	KV Cache（tokens=8k）（batch=8）	KV Cache（tokens=32k）（batch=8）
Llama 7B	3.5GB	4GB	16GB	64GB
Llama 70B	35GB	2.5GB	10GB	40GB

Weight Only 量化

两个基本概念

计算密集（compute-bound）:推理的绝大部分时间消耗在数值计算上；针对计算密集场景，可以通过使用更快的硬件计算单元来提升计算速度，比如量化为 W8A8 使用 INT8 Tensor Core 来加速计算。’
访存密集（memory-bound）:推理时，绝大部分时间消耗在数据读取上；针对访存密集型场景，一般是通过提高计算访存比来提升性能。

LLM 是典型的访存密集型任务

常见的 LLM 模型是 Decoder Only 架构。推理时大部分时间消耗在逐Token 生成阶段（Decoding 阶段），是典型的访存密集型场景。

如图，A100 的 FP16 峰值算力为 312 TFLOPS，只有在 Batch Size 达到128 这个量级时，计算才成为推理的瓶颈，但由于 LLM 模型本身就很大推理时的 KV Cache 也会占用很多显存，还有一些其他的因素影响（如Persistent Batch），实际推理时很难做到 128 这么大的 Batch Size。

4bit Weight Only 量化，将 FP16 的模型权重量化为INT4，访存量直接降为 FP16 模型的 1/4，大幅降低了访存成本，提高了 Decoding 的速度。
加速的同时还节省了显存，同样的设备能够支持更大的模型以及更长的对话长度

访存密集型任务 ???

如何做

LMDeploy 使用 MIT HAN LAB 开源的 AWQ 算法，量化为 4bit 模型；
推理时，先把 4bit 权重，反量化回 FP16（在 Kernel内部进行，从 Global Memory 读取时仍是 4bit），依旧使用的是 FP16 计算相较于社区使用比较多的 GPTQ 算法，AWQ 的推理速度更快，量化的时间更短

反量化不需要时间？

TurboMind

TurboMind 是一款关于 LLM 推理的高效推理引擎，基于英伟达的 FasterTransformer 研发而成。

LLaMa 结构模型的支持
持续批处理: 令牌桶？
高性能 cuda kernel：
有状态推理：服务端缓存？
Blocked k/v cache：缓存算法？

Blocked k/v cach

BlockSize = 2 X Layers X Heads X HeadDim X Seq X B； Seq: 1 个 block 里的序列长度，默认128； B：k/v数值精度对应的字节数；

llama-7b，2K序列长度，k/v block 内存1G

Block 状态

Free 未被任何序列占用
Activate 正在被推理的序列占用
Cache 被缓存中的序列占用

Block状态迁移

高性能 cuda kernel

API Server

模型推理/服务: 主要提供模型本身的推理，一般来说可以和具体业务解耦，专注模型推理本身性能的优化。可以以模块、API等多种方式提供。
Client: 可以理解为前端，与用户交互的地方。
API Server: 一般作为前端的后端，提供与产品和服务相关的数据和功能支持。

三个流程不一定严格区分。

TurboMind 推理模型

使用 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式，目前支持在线转换和离线转换两种形式：

在线转换可以直接加载 Huggingface 模型
离线转换需需要先保存模型再加载

加载后再转换

#  需要访问 hf
lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b
# 本地
lmdeploy chat turbomind /share/temp/model_repos/internlm-chat-7b/  --model-name internlm-chat-7b

线转再加载

# 转换模型（FastTransformer格式） TurboMind
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b  /root/share/temp/model_repos/internlm-chat-7b/

输出在workspace文件夹，其中：

参数 layers.0.attention.w_qkv.0.bias

第一个 0 表示“层”的索引，后面的那个0表示 Tensor 并行的索引
两张卡可以用来推理，会把同一个参数拆成两份 layers.0.attention.w_qkv.0.weight和 layers.0.attention.w_qkv.1.weight
可以通过 --tp 指定tensor parallel，该参数默认值为1

(base) root@intern-studio:~# tree ./workspace/
./workspace/
├── model_repository
│   ├── postprocessing -> ../triton_models/postprocessing
│   ├── preprocessing -> ../triton_models/preprocessing
│   └── turbomind -> ../triton_models/interactive
├── service_docker_up.sh
└── triton_models
    ├── interactive
    │   ├── 1
    │   │   ├── placeholder
    │   │   └── weights -> ../../weights
    │   └── config.pbtxt
    ├── postprocessing
    │   ├── 1
    │   │   ├── __pycache__
    │   │   │   └── model.cpython-310.pyc
    │   │   ├── model.py
    │   │   └── tokenizer -> ../../tokenizer
    │   └── config.pbtxt
    ├── preprocessing
    │   ├── 1
    │   │   ├── __pycache__
    │   │   │   └── model.cpython-310.pyc
    │   │   ├── model.py
    │   │   └── tokenizer -> ../../tokenizer
    │   └── config.pbtxt
    ├── tokenizer
    │   ├── config.json
    │   ├── configuration.json
    │   ├── configuration_internlm.py
    │   ├── generation_config.json
    │   ├── modeling_internlm.py
    │   ├── placeholder
    │   ├── pytorch_model.bin.index.json
    │   ├── special_tokens_map.json
    │   ├── tokenization_internlm.py
    │   ├── tokenizer.model
    │   ├── tokenizer.py
    │   └── tokenizer_config.json
    └── weights
        ├── config.ini
        ├── layers.0.attention.w_qkv.0.bias
        ├── layers.0.attention.w_qkv.0.weight
        ├── layers.0.attention.wo.0.bias
        ├── layers.0.attention.wo.0.weight
        ├── layers.0.attention_norm.weight
        ├── layers.0.feed_forward.w1.0.weight
        ├── layers.0.feed_forward.w2.0.weight
        ├── layers.0.feed_forward.w3.0.weight
        ├── layers.0.ffn_norm.weight
        ├── layers.1.attention.w_qkv.0.bias
        ├── layers.1.attention.w_qkv.0.weight
        ├── layers.1.attention.wo.0.bias
        ├── layers.1.attention.wo.0.weight
        ├── layers.1.attention_norm.weight
        ├── layers.1.feed_forward.w1.0.weight
        ├── layers.1.feed_forward.w2.0.weight
        ├── layers.1.feed_forward.w3.0.weight
        ├── layers.1.ffn_norm.weight
        ├── layers.10.attention.w_qkv.0.bias
        ├── layers.10.attention.w_qkv.0.weight
        ├── layers.10.attention.wo.0.bias
        ├── layers.10.attention.wo.0.weight
        ├── layers.10.attention_norm.weight
        ├── layers.10.feed_forward.w1.0.weight
        ├── layers.10.feed_forward.w2.0.weight
        ├── layers.10.feed_forward.w3.0.weight
        ├── layers.10.ffn_norm.weight
        ├── layers.11.attention.w_qkv.0.bias
        ├── layers.11.attention.w_qkv.0.weight
        ├── layers.11.attention.wo.0.bias
        ├── layers.11.attention.wo.0.weight
        ├── layers.11.attention_norm.weight
        ├── layers.11.feed_forward.w1.0.weight
        ├── layers.11.feed_forward.w2.0.weight
        ├── layers.11.feed_forward.w3.0.weight
        ├── layers.11.ffn_norm.weight
        ├── layers.12.attention.w_qkv.0.bias
        ├── layers.12.attention.w_qkv.0.weight
        ├── layers.12.attention.wo.0.bias
        ├── layers.12.attention.wo.0.weight
        ├── layers.12.attention_norm.weight
        ├── layers.12.feed_forward.w1.0.weight
        ├── layers.12.feed_forward.w2.0.weight
        ├── layers.12.feed_forward.w3.0.weight
        ├── layers.12.ffn_norm.weight
        ├── layers.13.attention.w_qkv.0.bias
        ├── layers.13.attention.w_qkv.0.weight
        ├── layers.13.attention.wo.0.bias
        ├── layers.13.attention.wo.0.weight
        ├── layers.13.attention_norm.weight
        ├── layers.13.feed_forward.w1.0.weight
        ├── layers.13.feed_forward.w2.0.weight
        ├── layers.13.feed_forward.w3.0.weight
        ├── layers.13.ffn_norm.weight
        ├── layers.14.attention.w_qkv.0.bias
        ├── layers.14.attention.w_qkv.0.weight
        ├── layers.14.attention.wo.0.bias
        ├── layers.14.attention.wo.0.weight
        ├── layers.14.attention_norm.weight
        ├── layers.14.feed_forward.w1.0.weight
        ├── layers.14.feed_forward.w2.0.weight
        ├── layers.14.feed_forward.w3.0.weight
        ├── layers.14.ffn_norm.weight
        ├── layers.15.attention.w_qkv.0.bias
        ├── layers.15.attention.w_qkv.0.weight
        ├── layers.15.attention.wo.0.bias
        ├── layers.15.attention.wo.0.weight
        ├── layers.15.attention_norm.weight
        ├── layers.15.feed_forward.w1.0.weight
        ├── layers.15.feed_forward.w2.0.weight
        ├── layers.15.feed_forward.w3.0.weight
        ├── layers.15.ffn_norm.weight
        ├── layers.16.attention.w_qkv.0.bias
        ├── layers.16.attention.w_qkv.0.weight
        ├── layers.16.attention.wo.0.bias
        ├── layers.16.attention.wo.0.weight
        ├── layers.16.attention_norm.weight
        ├── layers.16.feed_forward.w1.0.weight
        ├── layers.16.feed_forward.w2.0.weight
        ├── layers.16.feed_forward.w3.0.weight
        ├── layers.16.ffn_norm.weight
        ├── layers.17.attention.w_qkv.0.bias
        ├── layers.17.attention.w_qkv.0.weight
        ├── layers.17.attention.wo.0.bias
        ├── layers.17.attention.wo.0.weight
        ├── layers.17.attention_norm.weight
        ├── layers.17.feed_forward.w1.0.weight
        ├── layers.17.feed_forward.w2.0.weight
        ├── layers.17.feed_forward.w3.0.weight
        ├── layers.17.ffn_norm.weight
        ├── layers.18.attention.w_qkv.0.bias
        ├── layers.18.attention.w_qkv.0.weight
        ├── layers.18.attention.wo.0.bias
        ├── layers.18.attention.wo.0.weight
        ├── layers.18.attention_norm.weight
        ├── layers.18.feed_forward.w1.0.weight
        ├── layers.18.feed_forward.w2.0.weight
        ├── layers.18.feed_forward.w3.0.weight
        ├── layers.18.ffn_norm.weight
        ├── layers.19.attention.w_qkv.0.bias
        ├── layers.19.attention.w_qkv.0.weight
        ├── layers.19.attention.wo.0.bias
        ├── layers.19.attention.wo.0.weight
        ├── layers.19.attention_norm.weight
        ├── layers.19.feed_forward.w1.0.weight
        ├── layers.19.feed_forward.w2.0.weight
        ├── layers.19.feed_forward.w3.0.weight
        ├── layers.19.ffn_norm.weight
        ├── layers.2.attention.w_qkv.0.bias
        ├── layers.2.attention.w_qkv.0.weight
        ├── layers.2.attention.wo.0.bias
        ├── layers.2.attention.wo.0.weight
        ├── layers.2.attention_norm.weight
        ├── layers.2.feed_forward.w1.0.weight
        ├── layers.2.feed_forward.w2.0.weight
        ├── layers.2.feed_forward.w3.0.weight
        ├── layers.2.ffn_norm.weight
        ├── layers.20.attention.w_qkv.0.bias
        ├── layers.20.attention.w_qkv.0.weight
        ├── layers.20.attention.wo.0.bias
        ├── layers.20.attention.wo.0.weight
        ├── layers.20.attention_norm.weight
        ├── layers.20.feed_forward.w1.0.weight
        ├── layers.20.feed_forward.w2.0.weight
        ├── layers.20.feed_forward.w3.0.weight
        ├── layers.20.ffn_norm.weight
        ├── layers.21.attention.w_qkv.0.bias
        ├── layers.21.attention.w_qkv.0.weight
        ├── layers.21.attention.wo.0.bias
        ├── layers.21.attention.wo.0.weight
        ├── layers.21.attention_norm.weight
        ├── layers.21.feed_forward.w1.0.weight
        ├── layers.21.feed_forward.w2.0.weight
        ├── layers.21.feed_forward.w3.0.weight
        ├── layers.21.ffn_norm.weight
        ├── layers.22.attention.w_qkv.0.bias
        ├── layers.22.attention.w_qkv.0.weight
        ├── layers.22.attention.wo.0.bias
        ├── layers.22.attention.wo.0.weight
        ├── layers.22.attention_norm.weight
        ├── layers.22.feed_forward.w1.0.weight
        ├── layers.22.feed_forward.w2.0.weight
        ├── layers.22.feed_forward.w3.0.weight
        ├── layers.22.ffn_norm.weight
        ├── layers.23.attention.w_qkv.0.bias
        ├── layers.23.attention.w_qkv.0.weight
        ├── layers.23.attention.wo.0.bias
        ├── layers.23.attention.wo.0.weight
        ├── layers.23.attention_norm.weight
        ├── layers.23.feed_forward.w1.0.weight
        ├── layers.23.feed_forward.w2.0.weight
        ├── layers.23.feed_forward.w3.0.weight
        ├── layers.23.ffn_norm.weight
        ├── layers.24.attention.w_qkv.0.bias
        ├── layers.24.attention.w_qkv.0.weight
        ├── layers.24.attention.wo.0.bias
        ├── layers.24.attention.wo.0.weight
        ├── layers.24.attention_norm.weight
        ├── layers.24.feed_forward.w1.0.weight
        ├── layers.24.feed_forward.w2.0.weight
        ├── layers.24.feed_forward.w3.0.weight
        ├── layers.24.ffn_norm.weight
        ├── layers.25.attention.w_qkv.0.bias
        ├── layers.25.attention.w_qkv.0.weight
        ├── layers.25.attention.wo.0.bias
        ├── layers.25.attention.wo.0.weight
        ├── layers.25.attention_norm.weight
        ├── layers.25.feed_forward.w1.0.weight
        ├── layers.25.feed_forward.w2.0.weight
        ├── layers.25.feed_forward.w3.0.weight
        ├── layers.25.ffn_norm.weight
        ├── layers.26.attention.w_qkv.0.bias
        ├── layers.26.attention.w_qkv.0.weight
        ├── layers.26.attention.wo.0.bias
        ├── layers.26.attention.wo.0.weight
        ├── layers.26.attention_norm.weight
        ├── layers.26.feed_forward.w1.0.weight
        ├── layers.26.feed_forward.w2.0.weight
        ├── layers.26.feed_forward.w3.0.weight
        ├── layers.26.ffn_norm.weight
        ├── layers.27.attention.w_qkv.0.bias
        ├── layers.27.attention.w_qkv.0.weight
        ├── layers.27.attention.wo.0.bias
        ├── layers.27.attention.wo.0.weight
        ├── layers.27.attention_norm.weight
        ├── layers.27.feed_forward.w1.0.weight
        ├── layers.27.feed_forward.w2.0.weight
        ├── layers.27.feed_forward.w3.0.weight
        ├── layers.27.ffn_norm.weight
        ├── layers.28.attention.w_qkv.0.bias
        ├── layers.28.attention.w_qkv.0.weight
        ├── layers.28.attention.wo.0.bias
        ├── layers.28.attention.wo.0.weight
        ├── layers.28.attention_norm.weight
        ├── layers.28.feed_forward.w1.0.weight
        ├── layers.28.feed_forward.w2.0.weight
        ├── layers.28.feed_forward.w3.0.weight
        ├── layers.28.ffn_norm.weight
        ├── layers.29.attention.w_qkv.0.bias
        ├── layers.29.attention.w_qkv.0.weight
        ├── layers.29.attention.wo.0.bias
        ├── layers.29.attention.wo.0.weight
        ├── layers.29.attention_norm.weight
        ├── layers.29.feed_forward.w1.0.weight
        ├── layers.29.feed_forward.w2.0.weight
        ├── layers.29.feed_forward.w3.0.weight
        ├── layers.29.ffn_norm.weight
        ├── layers.3.attention.w_qkv.0.bias
        ├── layers.3.attention.w_qkv.0.weight
        ├── layers.3.attention.wo.0.bias
        ├── layers.3.attention.wo.0.weight
        ├── layers.3.attention_norm.weight
        ├── layers.3.feed_forward.w1.0.weight
        ├── layers.3.feed_forward.w2.0.weight
        ├── layers.3.feed_forward.w3.0.weight
        ├── layers.3.ffn_norm.weight
        ├── layers.30.attention.w_qkv.0.bias
        ├── layers.30.attention.w_qkv.0.weight
        ├── layers.30.attention.wo.0.bias
        ├── layers.30.attention.wo.0.weight
        ├── layers.30.attention_norm.weight
        ├── layers.30.feed_forward.w1.0.weight
        ├── layers.30.feed_forward.w2.0.weight
        ├── layers.30.feed_forward.w3.0.weight
        ├── layers.30.ffn_norm.weight
        ├── layers.31.attention.w_qkv.0.bias
        ├── layers.31.attention.w_qkv.0.weight
        ├── layers.31.attention.wo.0.bias
        ├── layers.31.attention.wo.0.weight
        ├── layers.31.attention_norm.weight
        ├── layers.31.feed_forward.w1.0.weight
        ├── layers.31.feed_forward.w2.0.weight
        ├── layers.31.feed_forward.w3.0.weight
        ├── layers.31.ffn_norm.weight
        ├── layers.4.attention.w_qkv.0.bias
        ├── layers.4.attention.w_qkv.0.weight
        ├── layers.4.attention.wo.0.bias
        ├── layers.4.attention.wo.0.weight
        ├── layers.4.attention_norm.weight
        ├── layers.4.feed_forward.w1.0.weight
        ├── layers.4.feed_forward.w2.0.weight
        ├── layers.4.feed_forward.w3.0.weight
        ├── layers.4.ffn_norm.weight
        ├── layers.5.attention.w_qkv.0.bias
        ├── layers.5.attention.w_qkv.0.weight
        ├── layers.5.attention.wo.0.bias
        ├── layers.5.attention.wo.0.weight
        ├── layers.5.attention_norm.weight
        ├── layers.5.feed_forward.w1.0.weight
        ├── layers.5.feed_forward.w2.0.weight
        ├── layers.5.feed_forward.w3.0.weight
        ├── layers.5.ffn_norm.weight
        ├── layers.6.attention.w_qkv.0.bias
        ├── layers.6.attention.w_qkv.0.weight
        ├── layers.6.attention.wo.0.bias
        ├── layers.6.attention.wo.0.weight
        ├── layers.6.attention_norm.weight
        ├── layers.6.feed_forward.w1.0.weight
        ├── layers.6.feed_forward.w2.0.weight
        ├── layers.6.feed_forward.w3.0.weight
        ├── layers.6.ffn_norm.weight
        ├── layers.7.attention.w_qkv.0.bias
        ├── layers.7.attention.w_qkv.0.weight
        ├── layers.7.attention.wo.0.bias
        ├── layers.7.attention.wo.0.weight
        ├── layers.7.attention_norm.weight
        ├── layers.7.feed_forward.w1.0.weight
        ├── layers.7.feed_forward.w2.0.weight
        ├── layers.7.feed_forward.w3.0.weight
        ├── layers.7.ffn_norm.weight
        ├── layers.8.attention.w_qkv.0.bias
        ├── layers.8.attention.w_qkv.0.weight
        ├── layers.8.attention.wo.0.bias
        ├── layers.8.attention.wo.0.weight
        ├── layers.8.attention_norm.weight
        ├── layers.8.feed_forward.w1.0.weight
        ├── layers.8.feed_forward.w2.0.weight
        ├── layers.8.feed_forward.w3.0.weight
        ├── layers.8.ffn_norm.weight
        ├── layers.9.attention.w_qkv.0.bias
        ├── layers.9.attention.w_qkv.0.weight
        ├── layers.9.attention.wo.0.bias
        ├── layers.9.attention.wo.0.weight
        ├── layers.9.attention_norm.weight
        ├── layers.9.feed_forward.w1.0.weight
        ├── layers.9.feed_forward.w2.0.weight
        ├── layers.9.feed_forward.w3.0.weight
        ├── layers.9.ffn_norm.weight
        ├── norm.weight
        ├── output.weight
        └── tok_embeddings.weight

18 directories, 313 files
(base) root@intern-studio:~#

** Tensor并行**

列并行

行并行

本地对话（Bash Local Chat）

跳过 API Server 直接调用 TurboMind。

pytorch/DeepSpeed 目前功能都比较弱

# Turbomind + Bash Local Chat
lmdeploy chat turbomind ./workspace

API服务

服务端

# ApiServer+Turbomind   api_server => AsyncEngine => TurboMind
lmdeploy serve api_server ./workspace \
	--server_name 0.0.0.0 \
	--server_port 23333 \
	--instance_num 64 \ #  Batch 的大小
	--tp 1

终端客户端

# ChatApiClient+ApiServer（注意是http协议，需要加http）
lmdeploy serve api_client http://localhost:23333

如果在线上环境，可以进行转发服务端口到本地

ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p <你的ssh端口号>

gradio

lmdeploy serve gradio http://0.0.0.0:23333 \
	--server_name 0.0.0.0 \
	--server_port 6006 \
	--restful_api True

TurboMind 推理 + Python 代码集成

from lmdeploy import turbomind as tm

# load model
model_path = "/root/share/temp/model_repos/internlm-chat-7b/"
tm_model = tm.TurboMind.from_pretrained(model_path, model_name='internlm-chat-20b')
generator = tm_model.create_instance()

# process query
query = "晚上吃啥好"
prompt = tm_model.model.get_prompt(query)
input_ids = tm_model.tokenizer.encode(prompt)

# inference
for outputs in generator.stream_infer(
        session_id=0,
        input_ids=[input_ids]):
    res, tokens = outputs[0]

response = tm_model.tokenizer.decode(res.tolist())
print(response)

最佳实践？

部署

场景一（前4张图）：固定的输入、输出 token 数（分别1和2048），测试Token输出吞吐量（output token throughput）。
场景二（第5张图）：使用真实数据，测试吞吐量（request throughput）。

对比

(lmdeploy) root@intern-studio:~# python infer_compare.py hf
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████| 8/8 [00:10<00:00,  1.29s/it]
hf 耗时 29.50秒  40 字/秒

[TM][INFO] [Forward] [0, 1), dc_bsz = 0, pf_bsz = 1, n_tok = 113, max_q = 113, max_k = 113
[TM][INFO] ------------------------- step = 120 -------------------------
...
[TM][INFO] [Interrupt] slot = 0, id = 0
[TM][INFO] [forward] Request complete for 0, code 0
lmdeploy 耗时 10.35秒  109 字/秒
[TM][INFO] ~LlamaBatch()
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [OutputThreadEntry] stop requested.

web api：TurboMind推理 + API 服务
demo：TurboMind 推理+Gradio
python项目：TurboMind推理 + Python

配置

模型属性和数据类型不可更改。

(lmdeploy) root@intern-studio:~# cat ./workspace/triton_models/weights/config.ini 
[llama]
model_name = internlm-chat-7b ; 模型属性
tensor_para_size = 1
head_num = 32 ; 模型属性
kv_head_num = 32 ; 模型属性
vocab_size = 103168 ; 模型属性
num_layer = 32 ; 模型属性
inter_size = 11008 ; 模型属性
norm_eps = 1e-06 ; 模型属性
attn_bias = 1 ; 模型属性
start_id = 1 ; 模型属性
end_id = 2 ; 模型属性
session_len = 2056
weight_type = fp16; 数据类型
rotary_embedding = 128 ; 模型属性
rope_theta = 10000.0
size_per_head = 128 ; 模型属性
group_size = 0 ; 数据类型
max_batch_size = 64
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 2048
rope_scaling_factor = 0.0
use_logn_attn = 0

三个可能需要调整的参数。

quant_policy：

KV int8 开关，KV Cache 是对序列生成过程中的 K 和 V 进行量化，用以节省显存
默认值为 0，表示不使用 KV Cache，如果需要开启，则将该参数设置为 4
当显存不足，或序列比较长时，建议打开此开关

rope_scaling_factor：

外推能力开关，推理时上下文的长度超过训练时的最大长度时模型生成的能力
默认值为 0.0，表示不具备外推能力，设置为 1.0，可以开启 RoPE 的 Dynamic NTK 功能，支持长文本推理
当推理文本非常长（明显超过了训练时的最大长度）时，建议开启外推能力。

如果没有外推能力，当推理时上下文长度超过训练时的最大长度，效果会急剧下降。相反，则下降不那么明显，当然如果超出太多，效果也会下降的厉害。

use_logn_attn：

Attention 缩放，默认值为 0，如果要开启，可以将其改为 1。

max_batch_size:

批处理大小
默认为 64，也API Server 启动时的 instance_num 参数
该参数值越大，吞度量越大（同时接受的请求数），但也会占用更多显存,建议根据请求量和最大的上下文长度，按实际情况调整。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMDeploy 的量化和部署

大模型部署背景

部署

大模型特点

大模型部署挑战

LMDeplpoy

量化

Weight Only 量化

TurboMind

Blocked k/v cach

高性能 cuda kernel

API Server

TurboMind 推理模型

本地对话（Bash Local Chat）

API服务

TurboMind 推理 + Python 代码集成

最佳实践？

部署

配置

FilesExpand file tree

5.md

Latest commit

History

5.md

File metadata and controls

LMDeploy 的量化和部署

大模型部署背景

部署

大模型特点

大模型部署挑战

LMDeplpoy

量化

Weight Only 量化

TurboMind

Blocked k/v cach

高性能 cuda kernel

API Server

TurboMind 推理模型

本地对话（Bash Local Chat）

API服务

TurboMind 推理 + Python 代码集成

最佳实践？

部署

配置