Add llama-cpp-python backend for CosyVoice3 by Ferraronp · Pull Request #1872 · FunAudioLLM/CosyVoice

Ferraronp · 2026-04-15T20:41:22Z

Summary

Adds optional llama-cpp-python inference backend for CosyVoice3, allowing CPU and low-VRAM inference using GGUF quantized models.

Changes

cosyvoice/cli/cosyvoice.py: Added load_llama_cpp and gguf_model_path parameters to CosyVoice3.__init__. Overrides inference_zero_shot, inference_cross_lingual, inference_instruct2 with llama.cpp path. Both streaming and non-streaming modes supported.
cosyvoice/cli/model.py: Added tts_with_external_tokens and tts_stream_external_llm methods.
README.md: Added llama-cpp-python backend installation and usage instructions.

Usage

cosyvoice = AutoModel(
    model_dir='pretrained_models/Fun-CosyVoice3-0.5B',
    load_llama_cpp=True,
    gguf_model_path='/path/to/model.gguf'
)

All existing inference methods (inference_zero_shot, inference_cross_lingual, inference_instruct2) work unchanged.

Performance (NVIDIA T4, fp16)

Backend	Avg RTF
PyTorch fp16 (original)	~1.17
llama-cpp-python F16 GGUF	~0.45

~2.6x faster inference on T4 GPU.

Pre-converted GGUF models

Available on Hugging Face: Ferraronp/CosyVoice3-qwen2.5-0.5b-speech-gguf
Converter: Ferraronp/CosyVoice-gguf-converter

Notes

Only supported for CosyVoice3 / Fun-CosyVoice3-0.5B
Requires pip install llama-cpp-python
When load_llama_cpp=True, PyTorch LLM weights are not loaded to save VRAM

Out of ~1870 GGUFs in a typical local catalogue, ~30-50 are not chat models at all (TTS, audio codec, embedding, vision, video). Running the 28-prompt suite against them wastes minutes and pollutes the radar with all-zero scores. Three signals catch them, ordered cheapest to most structural; any one alone is enough to quarantine. Tier 1 - `general.architecture` blacklist (header-only, instant): extend `NON_GENERATIVE_ARCH` with the dedicated TTS arch names introduced by Qwen3-TTS / OuteTTS / Parler-TTS / Kokoro / MOSS-TTS / SNAC-TTS / Qwen2-TTS GGUFs. Tier 2 - `general.name` regex (header-only, instant): `NON_CHAT_NAME_RE` matches `tokenizer|codec|vocoder|vq|cfm|tts`. Uses `[\W_]` boundary class instead of `\b` because JS regex treats `_` as a word char - `\btokenizer\b` does NOT match `Llamacpp_Tokenizer` (the Cosyvoice3 name). Tested against the real Cosyvoice GGUF + false-positive guards (Mistral, Qwen2.5-Math, Llama-3.1-Instruct, "Ottosaurus" do NOT match). Tier 3 - adaptive quarantine on non-text output (post-launch): `looksLikeNonTextResponse` in `characterize.ts` matches responses consisting entirely of `<|stop_1|>`-style audio codebook tokens. Plus a 5-consecutive-empty-response fallback for models that emit nothing at all. Throws an `Error` with "non-chat model" in the message - `characterizeAll`'s `isUnsupported` regex now catches that pattern so the model lands in `characterization_state:'failed'` and is never re-tried. Side change: read the GGUF header ONCE in `runCharacterization` (arch + name + embedding pooling all come from the same fs read). The `archOf` test seam is still honoured for shape parity. Also wire the `archiveServerLog` calls at the start of each per- model loop in `characterizeAll` AND at the head of `runCharacterization` so single-CARACTÉRISER triggers also get the fresh session log. `prompt_done` events now carry the full `DiagnosticRunEntry` so the renderer can live-update the Interactions tab without waiting for the whole-model signature to land on disk (consumed in the next commit). Refs: https://huggingface.co/cstr/qwen3-tts-1.7b-customvoice-GGUF (Qwen3-TTS arch convention) FunAudioLLM/CosyVoice#1872 (Cosyvoice3 llama-cpp-python backend - confirms it uses an LLM backbone with a separate audio tokenizer) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Ferraronp added 3 commits April 15, 2026 22:28

Add llama-cpp-python backend for CosyVoice3

de3341f

Remove verbose debug logging from llama.cpp backend

1b007c4

Remove unused speech_token_offset parameter

de8818d

Change sampling params

d8f708d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama-cpp-python backend for CosyVoice3#1872

Add llama-cpp-python backend for CosyVoice3#1872
Ferraronp wants to merge 4 commits into
FunAudioLLM:mainfrom
Ferraronp:main

Ferraronp commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ferraronp commented Apr 15, 2026

Summary

Changes

Usage

Performance (NVIDIA T4, fp16)

Pre-converted GGUF models

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant