Offline VibeVoice text-to-speech that turns text into audio files — built to run CPU-native, and to use a GPU automatically when one is present.
This is a file generator, not a live read-aloud voice. Point it at some text and a reference voice, and it writes a WAV. It downloads the model and the reference voices on first use, and warns you when RAM is tight so a long run on a small machine isn't a surprise.
Microsoft's VibeVoice is a great long-form, expressive TTS model — but the published code is GPU-oriented, needs a manual model download, and has no clean "just give me a WAV" entry point. This package wraps it into something small and CPU-first: automatic device selection, on-demand model + voice downloads, a one-call Python API, and a CLI.
Born out of Quill (a screen-reader-first editor, a Community Access project), which wanted to export a document to a spoken audio file without requiring a GPU.
pip install "vibevoice-cpu[model]" # CLI/helpers + the inference stack (torch, etc.)
pip install "vibevoice-cpu[model,ram]" # also better RAM detection (psutil)The first run downloads the model (several GB for 1.5B) and the reference voices
into a cache ($VIBEVOICE_CPU_HOME, else ~/.cache/vibevoice-cpu).
from vibevoice_cpu import synthesize, list_voices
print(list_voices(download=True)) # ['Alice', 'Carter', 'Frank', ...]
synthesize("Hello there, this is VibeVoice.", "out.wav", voice="Alice")vibevoice-cpu download # pre-fetch model + voices
vibevoice-cpu voices # list reference voices
vibevoice-cpu synth -o out.wav -v Alice "Hello there."
echo "A whole paragraph…" | vibevoice-cpu synth -o out.wav- CUDA (NVIDIA) → used automatically, bfloat16.
- MPS (Apple Silicon) → used automatically, float32.
- CPU → the fallback, float32. It works — but VibeVoice is a large model, so on CPU expect minutes per passage, not real time. With less than ~8 GB free it will be very slow and may swap; the library prints a warning before it starts.
This is why it's a file tool: generate in the background, save the WAV, play it when it's ready.
synthesize(text, output_path, *, voice="Alice", model="1.5B", on_log=...) -> PathVibeVoiceEngine(model="1.5B", *, on_log=..., cfg_scale=1.3, cpu_threads=None)—.load(),.synthesize(text, output_path, voice=...),.device.list_voices(download=False),available_ram_gb(),ram_warning(model="1.5B").
"1.5B" (default) and "7B" map to the community Hugging Face mirrors; you can also
pass any repo id or local path.
MIT — made by Taylor Arndt, a Community Access project. Contributions welcome.