Environment
- GPU: NVIDIA A100-SXM4-40GB
- PyTorch: 2.3.1+cu121
- CUDA: 12.1
- cuDNN: 8902
- Model loaded with:
torch_dtype=torch.float16, device_map="cuda:0", attn_implementation="sdpa"
Issue
I'm experiencing ~27ms per token during autoregressive LLM inference, which results in ~4-5s time-to-first-audio for typical sentences. I was expecting faster inference on A100.
Benchmarks
I ran isolated benchmarks on the LLM (outside of GLM-TTS wrapper):
Autoregressive generation with KV-cache
50 token prompt, generating 20 tokens one at a time
Autoregressive per-token: 27.30ms
Expected for 158 tokens: 4.31sThis matches what I see in practice:
This matches what I see in practice:
LLM=4.04s, Flow=0.59s, Total=4.63s
RTF: 0.75x
Model info:
hidden_size: 2048
num_hidden_layers: 28
num_attention_heads: 16
num_key_value_heads: 4
intermediate_size: 6144
vocab_size: 98304
Questions
- Is ~27ms/token expected for this model on A100?
- How are users achieving fast TTFA (time-to-first-audio) for real-time applications?
- Would vLLM or TensorRT-LLM provide significant speedups?
- Are there any recommended optimizations (quantization, flash-attn, etc.)?
The RTF < 1.0 is great, but for interactive/conversational use cases, 4-5s TTFA makes it challenging.
The voice quality is impressive, however, I do wish TTS model cards came with expected/real-world TTFA. RTF is only a small part of the story.
Environment
torch_dtype=torch.float16,device_map="cuda:0",attn_implementation="sdpa"Issue
I'm experiencing ~27ms per token during autoregressive LLM inference, which results in ~4-5s time-to-first-audio for typical sentences. I was expecting faster inference on A100.
Benchmarks
I ran isolated benchmarks on the LLM (outside of GLM-TTS wrapper):
Autoregressive generation with KV-cache
50 token prompt, generating 20 tokens one at a time
Autoregressive per-token: 27.30ms
Expected for 158 tokens: 4.31sThis matches what I see in practice:
This matches what I see in practice:
LLM=4.04s, Flow=0.59s, Total=4.63s
RTF: 0.75x
Model info:
hidden_size: 2048
num_hidden_layers: 28
num_attention_heads: 16
num_key_value_heads: 4
intermediate_size: 6144
vocab_size: 98304
Questions
The RTF < 1.0 is great, but for interactive/conversational use cases, 4-5s TTFA makes it challenging.
The voice quality is impressive, however, I do wish TTS model cards came with expected/real-world TTFA. RTF is only a small part of the story.