Skip to content

[Bug]: [trtllm-gen fmha] Missing OE2m1 ForGen VarSeqQ32/64/128 kernels + E2M1 hard-gated selector causes high-throughput decode regression #11620

@baonudesifeizhai

Description

@baonudesifeizhai

System Info

b300

Who can help?

for q in 8 16 32 64 128; do
n=$(ls cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/QkvE4m3OE2m1ForGencubin.cpp 2>/dev/null | grep -E "VarSeqQ${q}Kv128" | wc -l)
echo "Q${q}: ${n}"
done

res
Q8: 24
Q16: 24
Q32: 0
Q64: 0
Q128: 0

vllm-project/vllm#34988

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

.

Expected behavior

.

actual behavior

.

additional notes

vllm-project/vllm#34988

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions