-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
System Info
- GPU: NVIDIA GeForce RTX 5090 Blackwell (SM 120)
- CPU: Intel i9-14900K Raptor Lake-S Refresh (32) @ 5.700GHz
- RAM: TeamGroup Delta RGB 64GB DDR5-7600 7600MHz Dual Channel
- OS: Ubuntu 24.04.3 LTS x86_64
- Kernel: 6.14.0-37-generic
- Docker Image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2
- Python Version: 3.12.3
- CUDA Version: 12.8
Who can help?
@2ez4bz
@yuanjingx87
@karljang
@greg-kwasniewski1
@Wanli-Jiang
@kaiyux
@nv-guomingz
@nekorobov
@Funatiq
@byshiue
Information
- The official example scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...)
Reproduction
# Build the engine for Llama 3.1 (which uses a list for eos_token_id in generation_config.json)
trtllm-build \
--checkpoint_dir /models/llama3_8b_legacy \
--output_dir /models/llama3_8b_base_engine \
--max_batch_size 1 \
--max_seq_len 3000 \
--max_num_tokens 6000
# Start the FastAPI server using the C++ Executor backend
trtllm-serve /models/llama3_8b_base_engine \
--backend tensorrt \
--tokenizer /models/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
# Send a basic completions request (in a separate terminal)
curl -sS <http://127.0.0.1:8000/v1/completions> \
-H "Content-Type: application/json" \
-d '{
"model": "llama3_8b_base_engine",
"prompt": "Say hello.",
"max_tokens": 10
}'Expected behavior
The trtllm-serve wrapper and underlying C++ Executor API should natively parse and accept an array/list of integers for eos_token_id (and pad_token_id) directly from the Hugging Face generation_config.json, correctly mapping them to multiple stop conditions without throwing a C++ exception.
Actual behavior
The server instantly rejects the request and throws a C++ type-casting panic. Because the official Meta config sets "eos_token_id": [128001, 128009], PyBind11 fails to cast the Python list into the single int32_t variable expected by the C++ SamplingConfig.
The server logs the following error to the client:
{"object":"error","message":"Encountered an error in forwardAsync function: std::bad_cast","type":"BadRequestError","param":null,"code":400}
Additional notes
- Impact: This breaks out-of-the-box deployment for the Llama 3/3.1 family when using
trtllm-servein a production environment. - Constraint: We cannot manually alter
generation_config.jsonto force the list into a single integer. - Question regarding support: Given that Meta's official Llama 3.1 configurations natively utilize a list for
eos_token_id, is this model fully supported in the1.3.0rc4Executor API without modifying the upstream JSON? - Request for Help: If it is supported, what is the officially recommended way to pass or override these stop tokens (e.g., via server startup flags or the API request payload) to safely bypass the C++ list-casting crash while we wait for a permanent patch?