Skip to content

Community-Access/vibevoice-cpu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vibevoice-cpu

Offline VibeVoice text-to-speech that turns text into audio files — built to run CPU-native, and to use a GPU automatically when one is present.

This is a file generator, not a live read-aloud voice. Point it at some text and a reference voice, and it writes a WAV. It downloads the model and the reference voices on first use, and warns you when RAM is tight so a long run on a small machine isn't a surprise.

Why

Microsoft's VibeVoice is a great long-form, expressive TTS model — but the published code is GPU-oriented, needs a manual model download, and has no clean "just give me a WAV" entry point. This package wraps it into something small and CPU-first: automatic device selection, on-demand model + voice downloads, a one-call Python API, and a CLI.

Born out of Quill (a screen-reader-first editor, a Community Access project), which wanted to export a document to a spoken audio file without requiring a GPU.

Install

pip install "vibevoice-cpu[model]"     # CLI/helpers + the inference stack (torch, etc.)
pip install "vibevoice-cpu[model,ram]" # also better RAM detection (psutil)

The first run downloads the model (several GB for 1.5B) and the reference voices into a cache ($VIBEVOICE_CPU_HOME, else ~/.cache/vibevoice-cpu).

Use it (Python)

from vibevoice_cpu import synthesize, list_voices

print(list_voices(download=True))          # ['Alice', 'Carter', 'Frank', ...]
synthesize("Hello there, this is VibeVoice.", "out.wav", voice="Alice")

Use it (CLI)

vibevoice-cpu download                     # pre-fetch model + voices
vibevoice-cpu voices                       # list reference voices
vibevoice-cpu synth -o out.wav -v Alice "Hello there."
echo "A whole paragraph…" | vibevoice-cpu synth -o out.wav

Devices & speed

  • CUDA (NVIDIA) → used automatically, bfloat16.
  • MPS (Apple Silicon) → used automatically, float32.
  • CPU → the fallback, float32. It works — but VibeVoice is a large model, so on CPU expect minutes per passage, not real time. With less than ~8 GB free it will be very slow and may swap; the library prints a warning before it starts.

This is why it's a file tool: generate in the background, save the WAV, play it when it's ready.

API

  • synthesize(text, output_path, *, voice="Alice", model="1.5B", on_log=...) -> Path
  • VibeVoiceEngine(model="1.5B", *, on_log=..., cfg_scale=1.3, cpu_threads=None).load(), .synthesize(text, output_path, voice=...), .device.
  • list_voices(download=False), available_ram_gb(), ram_warning(model="1.5B").

Models

"1.5B" (default) and "7B" map to the community Hugging Face mirrors; you can also pass any repo id or local path.

License

MIT — made by Taylor Arndt, a Community Access project. Contributions welcome.

About

Offline VibeVoice text-to-speech file generation — CPU-native, uses a GPU when available, downloads models and voices on demand.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages