Skip to content

Add llama-cpp-python backend for CosyVoice3#1872

Open
Ferraronp wants to merge 4 commits into
FunAudioLLM:mainfrom
Ferraronp:main
Open

Add llama-cpp-python backend for CosyVoice3#1872
Ferraronp wants to merge 4 commits into
FunAudioLLM:mainfrom
Ferraronp:main

Conversation

@Ferraronp

Copy link
Copy Markdown

Summary

Adds optional llama-cpp-python inference backend for CosyVoice3, allowing CPU and low-VRAM inference using GGUF quantized models.

Changes

  • cosyvoice/cli/cosyvoice.py: Added load_llama_cpp and gguf_model_path parameters to CosyVoice3.__init__. Overrides inference_zero_shot, inference_cross_lingual, inference_instruct2 with llama.cpp path. Both streaming and non-streaming modes supported.
  • cosyvoice/cli/model.py: Added tts_with_external_tokens and tts_stream_external_llm methods.
  • README.md: Added llama-cpp-python backend installation and usage instructions.

Usage

cosyvoice = AutoModel(
    model_dir='pretrained_models/Fun-CosyVoice3-0.5B',
    load_llama_cpp=True,
    gguf_model_path='/path/to/model.gguf'
)

All existing inference methods (inference_zero_shot, inference_cross_lingual, inference_instruct2) work unchanged.

Performance (NVIDIA T4, fp16)

Backend Avg RTF
PyTorch fp16 (original) ~1.17
llama-cpp-python F16 GGUF ~0.45

~2.6x faster inference on T4 GPU.

Pre-converted GGUF models

Available on Hugging Face: Ferraronp/CosyVoice3-qwen2.5-0.5b-speech-gguf
Converter: Ferraronp/CosyVoice-gguf-converter

Notes

  • Only supported for CosyVoice3 / Fun-CosyVoice3-0.5B
  • Requires pip install llama-cpp-python
  • When load_llama_cpp=True, PyTorch LLM weights are not loaded to save VRAM

Syphys added a commit to Syphys/maestria that referenced this pull request May 25, 2026
Out of ~1870 GGUFs in a typical local catalogue, ~30-50 are not
chat models at all (TTS, audio codec, embedding, vision, video).
Running the 28-prompt suite against them wastes minutes and pollutes
the radar with all-zero scores. Three signals catch them, ordered
cheapest to most structural; any one alone is enough to quarantine.

Tier 1 - `general.architecture` blacklist (header-only, instant):
  extend `NON_GENERATIVE_ARCH` with the dedicated TTS arch names
  introduced by Qwen3-TTS / OuteTTS / Parler-TTS / Kokoro /
  MOSS-TTS / SNAC-TTS / Qwen2-TTS GGUFs.

Tier 2 - `general.name` regex (header-only, instant):
  `NON_CHAT_NAME_RE` matches `tokenizer|codec|vocoder|vq|cfm|tts`.
  Uses `[\W_]` boundary class instead of `\b` because JS regex
  treats `_` as a word char - `\btokenizer\b` does NOT match
  `Llamacpp_Tokenizer` (the Cosyvoice3 name). Tested against the
  real Cosyvoice GGUF + false-positive guards (Mistral, Qwen2.5-Math,
  Llama-3.1-Instruct, "Ottosaurus" do NOT match).

Tier 3 - adaptive quarantine on non-text output (post-launch):
  `looksLikeNonTextResponse` in `characterize.ts` matches responses
  consisting entirely of `<|stop_1|>`-style audio codebook tokens.
  Plus a 5-consecutive-empty-response fallback for models that emit
  nothing at all. Throws an `Error` with "non-chat model" in the
  message - `characterizeAll`'s `isUnsupported` regex now catches
  that pattern so the model lands in `characterization_state:'failed'`
  and is never re-tried.

Side change: read the GGUF header ONCE in `runCharacterization`
(arch + name + embedding pooling all come from the same fs read).
The `archOf` test seam is still honoured for shape parity.

Also wire the `archiveServerLog` calls at the start of each per-
model loop in `characterizeAll` AND at the head of
`runCharacterization` so single-CARACTÉRISER triggers also get the
fresh session log.

`prompt_done` events now carry the full `DiagnosticRunEntry` so the
renderer can live-update the Interactions tab without waiting for
the whole-model signature to land on disk (consumed in the next
commit).

Refs: https://huggingface.co/cstr/qwen3-tts-1.7b-customvoice-GGUF
      (Qwen3-TTS arch convention)
      FunAudioLLM/CosyVoice#1872
      (Cosyvoice3 llama-cpp-python backend - confirms it uses an
      LLM backbone with a separate audio tokenizer)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant