Skip to content

qwen3-tts-1.7b-custom-voice vllm-omni backend map fails. Qwen backend map has errors #9293

@JohnGalt1717

Description

@JohnGalt1717

LocalAI version:
LocalAI v4.1.3 (fdc9f7b)

Environment, CPU architecture, OS, and Version:
Docker, x64, Ubuntu, 24.04

Describe the bug
When you install this model and try and use it in the studio tts interface it gives 2 options. One is vllm-omni and the other is the qwen backend for it.

Both have issues. vllm-omni fails entirely. The Qwen backend shows a ton of warnings and eventually works.

vllm-omni should work (I think it's related to #8536 comments that I made there that omni doesn't work at all), and the qwen backend should have all of the right modules installed for maximum performance.

To Reproduce

  1. Install Qwent3-tts-12hz-1.7b-customvoice from the model gallery
  2. Go to studio/tts
  3. Type "hi"
  4. Choose one of the two models listed.
  5. Submit

Expected behavior

Logs

11:07:23 AM.587
stdout
Initializing libbackend for cuda13-qwen-tts
11:07:23 AM.589
stdout
Using portable Python
11:07:23 AM.662
stdout
Added /backends/cuda13-qwen-tts/lib to LD_LIBRARY_PATH for GPU libraries
11:07:27 AM.779
stderr
/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
11:07:27 AM.779
stderr
  warnings.warn(
11:07:33 AM.059
stderr
Server started. Listening on: 127.0.0.1:38041
11:07:33 AM.615
stderr
CUDA is available
11:07:33 AM.615
stderr
Using device: cuda, torch_dtype: torch.bfloat16, attn_implementation: flash_attention_2, model_type: CustomVoice
11:07:33 AM.615
stderr
Loading model from: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
11:08:07 AM.492
stderr
[ERROR] Loading model: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
11:08:07 AM.498
stderr
Traceback (most recent call last):
11:08:07 AM.498
stderr
  File "/backends/cuda13-qwen-tts/backend.py", line 250, in LoadModel
11:08:07 AM.498
stderr
    self.model = Qwen3TTSModel.from_pretrained(model_path, **load_kwargs)
11:08:07 AM.498
stderr
  File "/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/qwen_tts/inference/qwen3_tts_model.py", line 112, in from_pretrained
11:08:07 AM.498
stderr
    model = AutoModel.from_pretrained(pretrained_model_name_or_path, **kwargs)
11:08:07 AM.498
stderr
  File "/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
11:08:07 AM.498
stderr
    return model_class.from_pretrained(
11:08:07 AM.498
stderr
  File "/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/qwen_tts/core/models/modeling_qwen3_tts.py", line 1876, in from_pretrained
11:08:07 AM.498
stderr
    model = super().from_pretrained(
11:08:07 AM.498
stderr
  File "/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
11:08:07 AM.498
stderr
    return func(*args, **kwargs)
11:08:07 AM.498
stderr
  File "/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4971, in from_pretrained
11:08:07 AM.498
stderr
    model = cls(config, *model_args, **model_kwargs)
11:08:07 AM.498
stderr
  File "/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/qwen_tts/core/models/modeling_qwen3_tts.py", line 1817, in __init__
11:08:07 AM.498
stderr
    super().__init__(config)
11:08:07 AM.498
stderr
  File "/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2076, in __init__
11:08:07 AM.498
stderr
    self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
11:08:07 AM.498
stderr
  File "/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2686, in _check_and_adjust_attn_implementation
11:08:07 AM.498
stderr
    applicable_attn_implementation = self.get_correct_attn_implementation(
11:08:07 AM.498
stderr
  File "/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2714, in get_correct_attn_implementation
11:08:07 AM.499
stderr
    self._flash_attn_2_can_dispatch(is_init_check)
11:08:07 AM.499
stderr
  File "/backends/cuda13-qwen-tts/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2422, in _flash_attn_2_can_dispatch
11:08:07 AM.499
stderr
    raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}")
11:08:07 AM.499
stderr
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
11:08:07 AM.499
stderr
11:08:07 AM.499
stderr
11:08:07 AM.499
stderr
Trying to use SDPA instead of flash_attention_2...
11:08:09 AM.164
stdout
11:08:09 AM.164
stdout
********
11:08:09 AM.164
stdout
Warning: flash-attn is not installed. Will only run the manual PyTorch version. Please install flash-attn for faster inference.
11:08:09 AM.164
stdout
********
11:08:09 AM.164
stdout
 
11:08:15 AM.342
stderr

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]
Fetching 4 files:  25%|██▌       | 1/4 [00:00<00:01,  2.58it/s]
Fetching 4 files:  75%|███████▌  | 3/4 [00:06<00:02,  2.24s/it]
Fetching 4 files: 100%|██████████| 4/4 [00:06<00:00,  1.54s/it]
11:08:19 AM.599
stderr
Model loaded successfully: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
11:08:19 AM.599
stderr
Detected mode: CustomVoice
11:08:19 AM.599
stderr
Warning: Speaker 'Aiden' not in supported list. Available: ['aiden', 'dylan', 'eric', 'ono_anna', 'ryan', 'serena', 'sohee', 'uncle_fu', 'vivian']
11:08:19 AM.599
stderr
Using matched speaker: aiden
11:08:19 AM.785
stderr
Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.
11:08:21 AM.120
stderr
Saved 0.56s audio to /tmp/generated/content/audio/tts.wav

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions