[Bug]: trtllm-serve crashes with std::bad_cast when eos_token_id is a list (Llama 3.1)

### System Info

* GPU: NVIDIA GeForce RTX 5090 Blackwell (SM 120)
* CPU: Intel i9-14900K Raptor Lake-S Refresh (32) @ 5.700GHz
* RAM: TeamGroup Delta RGB 64GB DDR5-7600 7600MHz Dual Channel
* OS: Ubuntu 24.04.3 LTS x86_64
* Kernel: 6.14.0-37-generic
* Docker Image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2
* Python Version: 3.12.3
* CUDA Version: 12.8

### Who can help?

@2ez4bz
@yuanjingx87
@karljang
@greg-kwasniewski1
@Wanli-Jiang
@kaiyux
@nv-guomingz
@nekorobov
@Funatiq
@byshiue

### Information

- [x] The official example scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)

### Reproduction

```bash
# Build the engine for Llama 3.1 (which uses a list for eos_token_id in generation_config.json)

trtllm-build \
  --checkpoint_dir /models/llama3_8b_legacy \
  --output_dir /models/llama3_8b_base_engine \
  --max_batch_size 1 \
  --max_seq_len 3000 \
  --max_num_tokens 6000

#  Start the FastAPI server using the C++ Executor backend

trtllm-serve /models/llama3_8b_base_engine \
  --backend tensorrt \
  --tokenizer /models/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

#  Send a basic completions request (in a separate terminal)

curl -sS <http://127.0.0.1:8000/v1/completions> \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3_8b_base_engine",
    "prompt": "Say hello.",
    "max_tokens": 10
  }'
```


### Expected behavior

The `trtllm-serve` wrapper and underlying C++ Executor API should natively parse and accept an array/list of integers for `eos_token_id` (and `pad_token_id`) directly from the Hugging Face `generation_config.json`, correctly mapping them to multiple stop conditions without throwing a C++ exception.


### Actual behavior


The server instantly rejects the request and throws a C++ type-casting panic. Because the official Meta config sets `"eos_token_id": [128001, 128009]`, PyBind11 fails to cast the Python list into the single `int32_t` variable expected by the C++ `SamplingConfig`.

The server logs the following error to the client:
`{"object":"error","message":"Encountered an error in forwardAsync function: std::bad_cast","type":"BadRequestError","param":null,"code":400}`


### Additional notes

* **Impact:** This breaks out-of-the-box deployment for the Llama 3/3.1 family when using `trtllm-serve` in a production environment.
* **Constraint:** We cannot manually alter `generation_config.json` to force the list into a single integer.
* **Question regarding support:** Given that Meta's official Llama 3.1 configurations natively utilize a list for `eos_token_id`, is this model fully supported in the `1.3.0rc4` Executor API without modifying the upstream JSON?
* **Request for Help:** If it is supported, what is the officially recommended way to pass or override these stop tokens (e.g., via server startup flags or the API request payload) to safely bypass the C++ list-casting crash while we wait for a permanent patch?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: trtllm-serve crashes with std::bad_cast when eos_token_id is a list (Llama 3.1) #11625

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Actual behavior

Additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: trtllm-serve crashes with std::bad_cast when eos_token_id is a list (Llama 3.1) #11625

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Actual behavior

Additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions