LLM inference speed: 27ms/token on A100 - how to optimize?

## Environment

- GPU: NVIDIA A100-SXM4-40GB
- PyTorch: 2.3.1+cu121
- CUDA: 12.1
- cuDNN: 8902
- Model loaded with: `torch_dtype=torch.float16`, `device_map="cuda:0"`, `attn_implementation="sdpa"`

## Issue

I'm experiencing ~27ms per token during autoregressive LLM inference, which results in ~4-5s time-to-first-audio for typical sentences. I was expecting faster inference on A100.

## Benchmarks

I ran isolated benchmarks on the LLM (outside of GLM-TTS wrapper):

### Autoregressive generation with KV-cache

50 token prompt, generating 20 tokens one at a time
Autoregressive per-token: 27.30ms
Expected for 158 tokens: 4.31sThis matches what I see in practice:

This matches what I see in practice:

LLM=4.04s, Flow=0.59s, Total=4.63s
RTF: 0.75x

### Model info:

hidden_size: 2048
num_hidden_layers: 28
num_attention_heads: 16
num_key_value_heads: 4
intermediate_size: 6144
vocab_size: 98304

## Questions

1) Is ~27ms/token expected for this model on A100?
2) How are users achieving fast TTFA (time-to-first-audio) for real-time applications?
3) Would vLLM or TensorRT-LLM provide significant speedups?
4) Are there any recommended optimizations (quantization, flash-attn, etc.)?

The RTF < 1.0 is great, but for interactive/conversational use cases, 4-5s TTFA makes it challenging.

The voice quality is impressive, however, I do wish TTS model cards came with expected/real-world TTFA. RTF is only a small part of the story.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM inference speed: 27ms/token on A100 - how to optimize? #51

Environment

Issue

Benchmarks

Autoregressive generation with KV-cache

Model info:

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLM inference speed: 27ms/token on A100 - how to optimize? #51

Description

Environment

Issue

Benchmarks

Autoregressive generation with KV-cache

Model info:

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions