Skip to content

[Bug]: trtllm-serve crashes with std::bad_cast when eos_token_id is a list (Llama 3.1) #11625

@CristyNel

Description

@CristyNel

System Info

  • GPU: NVIDIA GeForce RTX 5090 Blackwell (SM 120)
  • CPU: Intel i9-14900K Raptor Lake-S Refresh (32) @ 5.700GHz
  • RAM: TeamGroup Delta RGB 64GB DDR5-7600 7600MHz Dual Channel
  • OS: Ubuntu 24.04.3 LTS x86_64
  • Kernel: 6.14.0-37-generic
  • Docker Image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2
  • Python Version: 3.12.3
  • CUDA Version: 12.8

Who can help?

@2ez4bz
@yuanjingx87
@karljang
@greg-kwasniewski1
@Wanli-Jiang
@kaiyux
@nv-guomingz
@nekorobov
@Funatiq
@byshiue

Information

  • The official example scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)

Reproduction

# Build the engine for Llama 3.1 (which uses a list for eos_token_id in generation_config.json)

trtllm-build \
  --checkpoint_dir /models/llama3_8b_legacy \
  --output_dir /models/llama3_8b_base_engine \
  --max_batch_size 1 \
  --max_seq_len 3000 \
  --max_num_tokens 6000

#  Start the FastAPI server using the C++ Executor backend

trtllm-serve /models/llama3_8b_base_engine \
  --backend tensorrt \
  --tokenizer /models/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

#  Send a basic completions request (in a separate terminal)

curl -sS <http://127.0.0.1:8000/v1/completions> \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3_8b_base_engine",
    "prompt": "Say hello.",
    "max_tokens": 10
  }'

Expected behavior

The trtllm-serve wrapper and underlying C++ Executor API should natively parse and accept an array/list of integers for eos_token_id (and pad_token_id) directly from the Hugging Face generation_config.json, correctly mapping them to multiple stop conditions without throwing a C++ exception.

Actual behavior

The server instantly rejects the request and throws a C++ type-casting panic. Because the official Meta config sets "eos_token_id": [128001, 128009], PyBind11 fails to cast the Python list into the single int32_t variable expected by the C++ SamplingConfig.

The server logs the following error to the client:
{"object":"error","message":"Encountered an error in forwardAsync function: std::bad_cast","type":"BadRequestError","param":null,"code":400}

Additional notes

  • Impact: This breaks out-of-the-box deployment for the Llama 3/3.1 family when using trtllm-serve in a production environment.
  • Constraint: We cannot manually alter generation_config.json to force the list into a single integer.
  • Question regarding support: Given that Meta's official Llama 3.1 configurations natively utilize a list for eos_token_id, is this model fully supported in the 1.3.0rc4 Executor API without modifying the upstream JSON?
  • Request for Help: If it is supported, what is the officially recommended way to pass or override these stop tokens (e.g., via server startup flags or the API request payload) to safely bypass the C++ list-casting crash while we wait for a permanent patch?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions