Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
bbc9eba
tpsp optimization (#1269)
hiworldwzj Apr 15, 2026
529f9ca
optimization prefill dp banlance, support multimodal dp balance. (#1271)
hiworldwzj Apr 16, 2026
6fa2f23
feat(api): add Anthropic Messages API compatibility endpoint (#1272)
sufubao Apr 17, 2026
5034706
fix: upgrade flashinfer to 0.6.8.post1 (#1280)
blueswhen Apr 24, 2026
3368043
qwen3 omni support long audio (#1268)
WANDY666 Apr 29, 2026
e28f984
feat(api): consolidate HTTP API endpoints and fixes (#1282)
sufubao Apr 29, 2026
4208b76
fix: typo prefll -> prefill in cudagraph option (#1283)
sufubao Apr 30, 2026
88b2fc6
feat: refactor kv buffer + qwen3.5 linear att radix cache upgrade. (#…
blueswhen Apr 30, 2026
0d5e122
add --performance_mode start args (#1285)
hiworldwzj May 6, 2026
1f54d60
auto set tool call parser and reasoning_parser (#1284)
hiworldwzj May 6, 2026
162df8b
fix: honor visual infer batch size (#1293)
sufubao May 6, 2026
3d08cba
use pinned device_ptr to init cpu cache tensor (#1287)
hiworldwzj May 7, 2026
28254a9
Communication opt (#1286)
blueswhen May 7, 2026
b8eee5c
feat(triton): support 256 headdim in attention decode kernels (#1291)
sufubao May 7, 2026
447fc40
fix(httpserver): quiet client-disconnect log path, return 499 (#1288)
sufubao May 8, 2026
cc7e8f4
remove lightllm_kernel (#1296)
hiworldwzj May 8, 2026
e1f8723
support prefill cudagraph for gdn (#1294)
WANDY666 May 8, 2026
38609d1
auto-derive max_req_total_len from model config (#1297)
Owleye4 May 9, 2026
592cad2
fix(basemodel): Format AssertionError message for max_seq_length vs m…
hiworldwzj May 9, 2026
8bcd28b
feat: support invalid_token_ids in sampling params (#1305)
shihaobai May 11, 2026
70cdb07
refactor(kv-cache): embed KvCacheAllocator in MemoryManager as alloca…
hiworldwzj May 13, 2026
8141c56
fix(multimodal): detect truncated images at the frontend via pixel-le…
shihaobai May 14, 2026
f41b8c4
feat(multimodal): add max_image_token_count guard with OOM risk guida…
hiworldwzj May 14, 2026
45e8cca
improve multimodal image preprocessing with max_image_pixels auto-res…
hiworldwzj May 14, 2026
171204e
Fix window size for sliding attention layer (#1311)
WANDY666 May 18, 2026
eaf0f42
Fix sliding window size for token attention kernel (#1312)
WANDY666 May 18, 2026
f850264
muliturn benchmark (#1313)
shihaobai May 19, 2026
4c069d3
fix: fix cache length (#1314)
blueswhen May 21, 2026
1b38e8d
support gemma4 (#1304)
WANDY666 May 22, 2026
eaa3b28
add enable_prefill_decode_mixed start args (#1315)
hiworldwzj May 22, 2026
c73698e
fix linear att cpu cache offload load speed (#1317)
hiworldwzj May 25, 2026
5adbf00
opt: optimatize cpu cache start time (#1319)
blueswhen May 26, 2026
e696aed
opt: refine cpu cache start time (#1321)
blueswhen May 26, 2026
375ad57
fix: fp8 group_fuse_moe (#1323)
shihaobai May 27, 2026
520c041
fix health check (#1322)
hiworldwzj May 29, 2026
466651c
feat: deep_ep v2 (#1303)
blueswhen Jun 1, 2026
63269d5
Refine prefill CUDA graph capture sizes (#1331)
shihaobai Jun 4, 2026
b16ccfa
fix: v32 tokenizer for transformers 5.x (#1326)
shihaobai Jun 5, 2026
105d57f
fix: update ci to cuda13.0 (#1332)
blueswhen Jun 5, 2026
3863844
pd nixl upgrade write mode to transfer kv (#1324)
hiworldwzj Jun 7, 2026
5514e24
fix prefill_params when prefill num_reqs > 1024 (#1336)
shihaobai Jun 8, 2026
e03ef9a
refactor(mtp): extract BaseMTPModel mixin shared by existing MTP draf…
sufubao Jun 9, 2026
da9dfb8
revert(mtp): drop shared BaseMTPModel base, keep per-model is_mtp_dra…
sufubao Jun 9, 2026
78e34a7
nixl pd support qwen3.5 (#1340)
hiworldwzj Jun 9, 2026
2740083
add Flashinfer sampling backend (#1328)
blueswhen Jun 10, 2026
8196f35
remove nccl pd mode. (#1342)
hiworldwzj Jun 10, 2026
316b398
fix lmeval start speed (#1343)
hiworldwzj Jun 11, 2026
6a70412
fix: correct 'Unsupport' typo to 'Unsupported' in error messages (#1320)
SuperMarioYL Jun 11, 2026
d471c21
feat(metrics): add model_name label and new throughput/cache metrics …
sufubao Jun 11, 2026
3a15cb0
fix duplicate reasoning and reasoning_content (#1345)
shihaobai Jun 12, 2026
b0231de
fix(linear-att): fix latent prefix-cache ref/buffer leaks (#1348)
sufubao Jun 14, 2026
41ed8e9
basic Profiler support (#1247)
WuSiYu Jun 15, 2026
630f7b8
Return 400 for chat template build errors (#1356)
shihaobai Jun 15, 2026
9e4f552
Fix config utils (#1357)
blueswhen Jun 16, 2026
b28eeac
fix: truncate oversized output token strings (#1359)
shihaobai Jun 16, 2026
9ae6a5d
perf(qwen3next): drop q/k/v/a/b contiguous copies in GDN fused_recurr…
sufubao Jun 16, 2026
84c7fe8
feat: add fused moe shared-expert and add-rmsnorm optimization (#1353)
blueswhen Jun 23, 2026
7876db3
improve moe align (#1369)
hiworldwzj Jun 29, 2026
13c7017
Fix linear attention CPU cache tail index buffer (#1372)
hiworldwzj Jun 29, 2026
cf55326
fix position_delta in decode. (#1377)
hiworldwzj Jul 1, 2026
1ff35a0
feat: opt flashinfer (#1367)
blueswhen Jul 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
9 changes: 5 additions & 4 deletions .github/workflows/docker-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@ jobs:
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

# Build and push default image (cuda12.8.0)
- name: Build and push Docker image (default cuda12.8.0)
# Build and push default image (cuda13.0.0)
- name: Build and push Docker image (default cuda13.0.0)
id: build-and-push
uses: docker/build-push-action@ac9327eae2b366085ac7f6a2d02df8aa8ead720a
with:
Expand All @@ -97,10 +97,11 @@ jobs:
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
build-args: |
CUDA_VERSION=12.8.0
CUDA_VERSION=13.0.0
ENABLE_DEEPEP=1
ENABLE_NIXL=1
ENABLE_CACHE=1
ENABLE_SM100=0
cache-from: type=gha
cache-to: type=gha,mode=max

Expand All @@ -117,4 +118,4 @@ jobs:
DIGEST: ${{ steps.build-and-push.outputs.digest }}
# This step uses the identity token to provision an ephemeral certificate
# against the sigstore community Fulcio instance.
run: echo "${TAGS}" | xargs -I {} cosign sign --yes {}@${DIGEST}
run: echo "${TAGS}" | xargs -I {} cosign sign --yes {}@${DIGEST}
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ dist
.vscode
tmp/
requirements-musa.txt
logs/
55 changes: 29 additions & 26 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
ARG CUDA_VERSION=12.8.0
ARG CUDA_VERSION=13.0.0
FROM nvidia/cuda:${CUDA_VERSION}-cudnn-devel-ubuntu22.04

ARG PYTHON_VERSION=3.10
ARG MAMBA_VERSION=24.7.1-0
ARG VLLM_VERSION=0.16.0
ARG VLLM_VERSION=0.21.0
ARG NIXL_REF=v1.2.0
ARG FLASH_MLA_REF=47c35a7
ARG DEEPGEMM_REF=891d57b4db1071624b5c8fa0d1e51cb317fa709f
ARG TARGETPLATFORM
ARG ENABLE_DEEPEP=1
ARG ENABLE_NIXL=1
ARG ENABLE_CACHE=1
ARG ENABLE_SM100=0

ENV PATH=/opt/conda/bin:$PATH \
CONDA_PREFIX=/opt/conda
Expand Down Expand Up @@ -44,13 +47,20 @@ WORKDIR /root

COPY ./requirements.txt /lightllm/requirements.txt
RUN pip install -U pip
RUN pip install -r /lightllm/requirements.txt --no-cache-dir
RUN pip install --no-cache-dir vllm==${VLLM_VERSION}
RUN git clone https://github.com/deepseek-ai/FlashMLA.git /root/FlashMLA && \
RUN pip install --no-cache-dir \
-i https://pypi.org/simple \
--extra-index-url https://download.pytorch.org/whl/cu130 \
vllm==${VLLM_VERSION}
RUN pip install -r /lightllm/requirements.txt --no-cache-dir \
-i https://pypi.org/simple \
--extra-index-url https://download.pytorch.org/whl/cu130
RUN export CPATH=/usr/local/cuda/targets/x86_64-linux/include/cccl:/usr/local/cuda/targets/x86_64-linux/include${CPATH:+:${CPATH}} && \
git clone https://github.com/deepseek-ai/FlashMLA.git /root/FlashMLA && \
cd /root/FlashMLA && \
git checkout ${FLASH_MLA_REF} && \
git submodule update --init --recursive && \
FLASH_MLA_DISABLE_SM100=1 pip install --no-cache-dir .
FLASH_MLA_DISABLE_SM100="$(if [ "${ENABLE_SM100}" = "1" ]; then echo 0; else echo 1; fi)" \
pip install --no-cache-dir .

RUN apt-get update && apt-get install -y libnuma-dev && rm -rf /var/lib/apt/lists/*

Expand Down Expand Up @@ -78,27 +88,20 @@ RUN if [ "${ENABLE_NIXL}" = "1" ] || [ "${ENABLE_DEEPEP}" = "1" ]; then \
RUN if [ "${ENABLE_DEEPEP}" = "1" ]; then \
set -e; \
ln -sf /usr/lib/x86_64-linux-gnu/libmlx5.so.1 /usr/lib/x86_64-linux-gnu/libmlx5.so; \
NVSHMEM_VERSION=3.3.9; \
CUDA_ARCHS=90; \
wget https://developer.download.nvidia.com/compute/redist/nvshmem/${NVSHMEM_VERSION}/source/nvshmem_src_cuda12-all-all-${NVSHMEM_VERSION}.tar.gz \
&& tar -xf nvshmem_src_cuda12-all-all-${NVSHMEM_VERSION}.tar.gz && mv nvshmem_src nvshmem \
&& cd nvshmem \
&& rm -f /root/nvshmem_src_cuda12-all-all-${NVSHMEM_VERSION}.tar.gz \
&& NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_MPI_SUPPORT=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/root/nvshmem/install -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCHS} \
&& cmake --build build --target install -j64; \
DEEPEP_COMMIT=b6ce310bb0b75079682d09bc2ebc063a074fbd58; \
cd /root && git clone https://github.com/deepseek-ai/DeepEP.git && cd DeepEP && git checkout ${DEEPEP_COMMIT} && cd ..; \
cd /root/DeepEP && NVSHMEM_DIR=/root/nvshmem/install python setup.py install; \
python -m pip install --upgrade --no-deps \
"nvidia-nccl-cu13==2.30.4" \
"nvidia-nvshmem-cu13==3.6.5"; \
cd /root && git clone https://github.com/deepseek-ai/DeepEP.git && cd DeepEP && git checkout b306af06afd412c88e51e71802951606e40b7358; \
ln -sf /opt/conda/lib/python${PYTHON_VERSION}/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so.3 /opt/conda/lib/python${PYTHON_VERSION}/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so; \
ln -sf /opt/conda/lib/python${PYTHON_VERSION}/site-packages/nvidia/nccl/lib/libnccl.so.2 /opt/conda/lib/python${PYTHON_VERSION}/site-packages/nvidia/nccl/lib/libnccl.so; \
pip install --no-build-isolation .; \
fi

RUN cd /root && git clone https://github.com/deepseek-ai/DeepGEMM.git && \
cd DeepGEMM && git checkout ${DEEPGEMM_REF} && \
git submodule update --init --recursive && \
pip install --no-build-isolation .

RUN if [ "${ENABLE_NIXL}" = "1" ]; then \
apt-get update && apt-get install -y cmake automake autotools-dev libtool libz-dev && \
DEBIAN_FRONTEND=noninteractive apt-get -y install --reinstall libibverbs-dev rdma-core ibverbs-utils libibumad-dev; \
Expand Down Expand Up @@ -126,7 +129,7 @@ RUN if [ "${ENABLE_NIXL}" = "1" ]; then \
apt-get update && apt-get install -y pkg-config tmux net-tools && \
cd /usr/local/src; \
pip install --upgrade meson pybind11 patchelf; \
git clone https://github.com/ai-dynamo/nixl.git -b main && \
git clone https://github.com/ai-dynamo/nixl.git -b ${NIXL_REF} && \
cd nixl && \
rm -rf build && \
mkdir build && \
Expand Down
14 changes: 10 additions & 4 deletions docker/scripts/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,21 +18,23 @@ set -euo pipefail
# --no-nixl Disable NIXL (default: enabled)
# --no-cache Disable cache (default: enabled)
# --lite Disable DEEPEP, NIXL and cache in one shot
# --cuda-version <ver> CUDA version (default: 12.8.0)
# --cuda-version <ver> CUDA version (default: 13.0.0)
# --image-prefix <name> Image prefix (default: lightllm)
# --image-tag <tag> Image tag (default: generated from enabled features)
# --enable-sm100 Enable SM100 support (default: disabled)
# -h / --help Show help

ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
cd "${ROOT_DIR}"

IMAGE_PREFIX="${IMAGE_PREFIX:-lightllm}"
CUDA_VERSION="${CUDA_VERSION:-12.8.0}"
CUDA_VERSION="${CUDA_VERSION:-13.0.0}"
IMAGE_TAG="${IMAGE_TAG:-}"

ENABLE_DEEPEP="${ENABLE_DEEPEP:-1}"
ENABLE_NIXL="${ENABLE_NIXL:-1}"
ENABLE_CACHE="${ENABLE_CACHE:-1}"
ENABLE_SM100="${ENABLE_SM100:-0}"

print_help() {
sed -n '1,80p' "$0" | sed 's/^# \{0,1\}//'
Expand All @@ -43,6 +45,7 @@ while [[ $# -gt 0 ]]; do
--no-deepep) ENABLE_DEEPEP=0 ;;
--no-nixl) ENABLE_NIXL=0 ;;
--no-cache) ENABLE_CACHE=0 ;;
--enable-sm100) ENABLE_SM100=1 ;;
--lite)
ENABLE_DEEPEP=0
ENABLE_NIXL=0
Expand Down Expand Up @@ -78,13 +81,16 @@ done
# - Other combos: composed from enabled feature names
if [[ -z "${IMAGE_TAG}" ]]; then
tag_parts=()
if [[ "${ENABLE_SM100}" -eq 1 ]]; then
tag_parts+=("sm100")
fi
if [[ "${ENABLE_NIXL}" -eq 1 ]]; then
tag_parts+=("nixl")
fi
if [[ "${ENABLE_DEEPEP}" -eq 1 ]]; then
tag_parts+=("deepep")
fi
if [[ "${ENABLE_NIXL}" -eq 1 && "${ENABLE_DEEPEP}" -eq 1 && "${ENABLE_CACHE}" -eq 1 ]]; then
if [[ "${ENABLE_SM100}" -eq 0 && "${ENABLE_NIXL}" -eq 1 && "${ENABLE_DEEPEP}" -eq 1 && "${ENABLE_CACHE}" -eq 1 ]]; then
IMAGE_TAG="cuda${CUDA_VERSION}"
else
prefix=""
Expand All @@ -100,6 +106,6 @@ DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile \
--build-arg ENABLE_DEEPEP="${ENABLE_DEEPEP}" \
--build-arg ENABLE_NIXL="${ENABLE_NIXL}" \
--build-arg ENABLE_CACHE="${ENABLE_CACHE}" \
--build-arg ENABLE_SM100="${ENABLE_SM100}" \
--progress=plain \
-t "${IMAGE_PREFIX}:${IMAGE_TAG}" .

11 changes: 11 additions & 0 deletions docs/CN/source/cookbook/qwen35_deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,17 @@ Qwen3.5-397B-A17B(8×H200)
- ``--graph_max_batch_size 128``: CUDA graph 最大批处理大小(显存不足时可减小)
- ``--reasoning_parser qwen3``: 启用 Qwen3 推理解析器,支持思考模式

线性注意力缓存调参说明
~~~~~~~~~~~~~~~~~~~~~~

Qwen3.5 使用混合注意力架构,在涉及线性注意力缓存复用时,建议关注以下参数:

- ``--linear_att_hash_page_size``: 小块粒度(每个 hash bucket 的 token 数)
- ``--linear_att_page_block_num``: 块级匹配相关配置。可将块大小近似理解为 ``linear_att_page_block_num * linear_att_hash_page_size``。
- 当 ``linear_att_page_block_num * linear_att_hash_page_size > max_req_total_len`` 时,radix cache 的块级匹配能力会近似关闭,更多依赖请求级小块匹配(小块大小为 ``linear_att_hash_page_size``)。
- 在高负载下,小块数量不足叠加内部 LRU 淘汰,可能导致命中率下降。此时可调大 ``--linear_att_cache_size`` 提升命中率,但会增加内存占用。
- 开启 ``--enable_cpu_cache`` 时,CPU cache 的 page 大小会被强制设置为 ``linear_att_page_block_num * linear_att_hash_page_size``,以满足内部复用约束。

纯文本模式(节省显存)
~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
98 changes: 84 additions & 14 deletions docs/CN/source/tutorial/api_server_args.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,16 @@ APIServer 参数详解
* ``pd_master``: pd 主节点模式(用于 pd 分离运行模式)
* ``config_server``: 配置服务器模式(用于 pd 分离模式,用于注册 pd_master 节点并获取 pd_master 节点列表),专门为大规模、高并发场景设计,当 `pd_master` 遇到显著的 CPU 瓶颈时使用。

.. option:: --performance_mode, --p_mode

不同场景的性能模式,可选值:

* ``None``: 不应用性能模式(默认)
* ``personal``: 私有化个人运行模式,自动设置:
- ``running_max_req_size`` 为 3
- ``batch_max_tokens`` 为 2048 (2k)
- ``chunked_prefill_size`` 为 1024 (1k)

.. option:: --host

服务器监听地址,默认为 ``127.0.0.1``
Expand Down Expand Up @@ -122,7 +132,10 @@ PD 分离模式参数

.. option:: --max_req_total_len

请求输入长度 + 请求输出长度的最大值,默认为 ``16384``
请求输入长度 + 请求输出长度的最大值。若未显式设置,将从模型配置自动推导,
若推导失败则回退到 ``16384``。
对于部分 RoPE 类型(如 ``yarn/dynamic/su/llama3``),推导不会直接用 ``rope_scaling.factor``
去乘以 ``max_position_embeddings``,以避免过度估算最大长度。

.. option:: --eos_id

Expand Down Expand Up @@ -201,6 +214,16 @@ PD 分离模式参数

激进调度可能导致解码期间频繁的预填充中断。禁用它可以让 router_max_wait_tokens 参数更有效地工作。

.. option:: --enable_prefill_decode_mixed

在同一次推理调度步骤中混合执行 prefill 与 decode。

仅支持 ``--run_mode`` 为 ``normal`` 时开启。当同时存在 prefill 与 decode 请求时,调度器会在同一步内
先执行 prefill、再执行 decode,而不是在激进调度下只执行 prefill、阻塞 decode,从而在有新 prefill
请求时也能推进 decode,提升整体吞吐。

不能与 ``--enable_prefill_microbatch_overlap`` 或 ``--enable_decode_microbatch_overlap`` 同时使用。

.. option:: --disable_dynamic_prompt_cache

禁用kv cache 缓存
Expand Down Expand Up @@ -259,6 +282,18 @@ PD 分离模式参数

多模态资源的缓存服务器容量,默认为 ``200``

.. option:: --max_image_token_count

单张图片在转换为 token 后允许的最大 token 数量,默认为 ``6128``

当任意图片超过该阈值时,请求会被拒绝。

.. option:: --max_image_pixels

单张图片在预处理缩放前允许的最大像素数量,默认为 ``8294400``(约等于 4K 图片像素总量)。

当输入图片超过该阈值时,LightLLM 会先自动将其缩放到该像素预算内,再继续后续流程。

.. option:: --visual_infer_batch_size

每次推理批次中处理的图像数量,默认为 ``1``
Expand Down Expand Up @@ -293,13 +328,13 @@ PD 分离模式参数
性能优化参数
------------

.. option:: --disable_custom_allreduce
.. option:: --disable_symm_mem_allreduce

是否禁用自定义 allreduce
禁用默认开启的 SymmMem all-reduce 快路径,并回退到 NCCL

.. option:: --enable_custom_allgather
.. option:: --disable_flashinfer_allreduce

是否启用自定义 allgather
禁用默认开启的 FlashInfer all-reduce 快路径,并回退到 SymmMem / NCCL

.. option:: --enable_tpsp_mix_mode

Expand Down Expand Up @@ -342,6 +377,41 @@ PD 分离模式参数
- ``fp8kv_sph``: FP8 静态按 head 量化,对应 fa3 后端
- ``fp8kv_spt``: FP8 静态按 tensor 量化,对应 flashinfer 后端

.. option:: --linear_att_hash_page_size

线性注意力的哈希页大小,默认为 ``512``。

该参数控制每个哈希桶中的 token 数量,会影响 radix cache 的复用效果。

.. option:: --linear_att_page_block_num

线性注意力状态存储使用的块数量,默认为 ``10000000``。

该参数控制用于保存注意力状态的可用页数,会影响内存占用和多轮对话性能。
在当前实现中,可将块大小近似理解为
``linear_att_page_block_num * linear_att_hash_page_size``。
当 ``linear_att_page_block_num * linear_att_hash_page_size > max_req_total_len`` 时,
radix cache 的块级匹配能力会近似被关闭,此时更依赖请求级别的小块匹配(小块大小为 ``linear_att_hash_page_size``)。
如果负载较高,小块数量不足叠加内部 LRU 淘汰机制,可能导致 cache 命中率下降。

当开启 ``--enable_cpu_cache`` 时,cpu cache 的 page 大小会被强制设置为
``linear_att_page_block_num * linear_att_hash_page_size``,以满足内部复用约束。

.. option:: --linear_att_cache_size

线性注意力缓存大小。

不指定时会根据缓存相关配置自动计算。
当高负载下出现小块缓存命中不足(例如受小块数量和 LRU 淘汰影响)时,
可以调大该参数以提升命中率,但会增加内存占用。

.. option:: --linear_att_ssm_data_type

线性注意力 SSM 状态的数据类型,可选值:

* ``bfloat16``
* ``float32``(默认)

.. option:: --disable_cudagraph

禁用解码阶段的 cudagraph
Expand Down Expand Up @@ -394,6 +464,14 @@ PD 分离模式参数

示例可以在 test/advanced_config/mixed_quantization/llamacls-mix-down.yaml 中找到。

.. option:: --expert_dtype

EP MoE 专家量化类型,可选值:

* ``fp8``
* ``fp4``,仅支持 SM100 GPU
* ``None`` (默认)

.. option:: --vit_quant_type

ViT 量化方法,可选值:
Expand Down Expand Up @@ -426,14 +504,6 @@ PD 分离模式参数

使用奖励模型

.. option:: --long_truncation_mode

当 input_token_len + max_new_tokens > max_req_total_len 时的处理方式,可选值:

* ``None``: 抛出异常(默认)
* ``head``: 移除一些头部 token 使 input_token_len + max_new_tokens <= max_req_total_len
* ``center``: 移除中心位置的一些 token 使 input_token_len + max_new_tokens <= max_req_total_len

.. option:: --use_tgi_api

使用 tgi 输入和输出格式
Expand Down Expand Up @@ -509,4 +579,4 @@ DeepSeek 冗余专家参数

.. option:: --enable_monitor_auth

是否为 push_gateway 开启身份验证
是否为 push_gateway 开启身份验证
4 changes: 3 additions & 1 deletion docs/CN/source/tutorial/deepseek_deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以

# PD prefill 模式 for DeepSeek-R1 (DP+EP) on H200
# 使用方法: sh pd_prefill.sh <host> <pd_master_ip>
# 默认使用 NIXL 传输;如需使用 NCCL 数据面,可设置 LIGHTLLM_PD_KV_TRANSPORT_BACKEND=nccl
# nvidia-cuda-mps-control -d,运行MPS(可选, 有mps支持性能会好特别多,但是部分显卡和驱动环境开启mps会容易出现错误,建议升级驱动到较高版本,特别是H系列卡)

export host=$1
Expand All @@ -201,6 +202,7 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以

# PD decode 模式 for DeepSeek-R1 (DP+EP) on H200
# 使用方法: sh pd_decode.sh <host> <pd_master_ip>
# 默认使用 NIXL 传输;如需使用 NCCL 数据面,可设置 LIGHTLLM_PD_KV_TRANSPORT_BACKEND=nccl
export host=$1
export pd_master_ip=$2
nvidia-cuda-mps-control -d
Expand Down Expand Up @@ -336,4 +338,4 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
--tokenizer_path /path/DeepSeek-R1/ \
--url http://127.0.0.1:8088/generate_stream

以上所有脚本可以参考 `test/start_scripts/multi_pd_master/` 目录下的脚本。
以上所有脚本可以参考 `test/start_scripts/multi_pd_master/` 目录下的脚本。
Loading
Loading