vox

Cross-platform TTS CLI with six backends and MCP server for AI assistants.

English • Français • 中文 • 日本語 • 한국어 • Español

                              vox
                               |
       +--------+--------+----+----+--------+--------+
       |        |        |         |        |        |
     say     piper    qwen-native kokoro  voxtream  qwen
   (macOS)  (Rust/ort) (Rust/candle) (ONNX) (zero-shot) (MLX/Py)
   native   CPU       CPU/Metal  opt-in  CUDA/MPS  Apple Si.
                       /CUDA
                         |
                       rodio (audio playback)

Backends

Backend	Engine	Voice cloning	Latency (warm)	GPU	Platform
`say`	macOS native	No	3s	No	macOS
`piper`	ONNX (Rust)	No	<1s	No	All
`qwen-native`	Candle (Rust)	Yes	~3s	Metal/CUDA	All
`kokoro`	ONNX (Rust, opt-in)	No	<1s	No	macOS only
`voxtream`	PyTorch 0.5B	Yes	~8s	CUDA/MPS	All
`qwen`	MLX-Audio (Python)	Yes	~2s	Apple Neural	macOS

Benchmark — single sentence (~50 chars)

All times measured end-to-end (model loading + inference + audio playback). Cold = first CLI call.

Backend	M2 Pro (CPU)	RTX 4070 Ti SUPER	Voice cloning	Quality
`say`	3s	macOS only	No	System voices
`piper`	<1s	<1s	No	Good
`kokoro`	<1s	macOS only	No	Fair (EN only)
`voxtream` (VoXtream2, 0.5B)	68s / 40s warm	23s / 19s warm	Yes (zero-shot)	Excellent
`qwen-native` (Qwen3-TTS, 0.6B)	11m33s / 3s warm	48s (CPU)	Yes	Excellent
`qwen` (MLX-Audio)	~15s / 2s warm	macOS only	Yes	Excellent

With daemon (vox daemon start — keeps model server warm):

Backend	M2 Pro (CPU)	Notes
`voxtream`	32s	Inference CPU-bound (~25s). On CUDA: paper reports 74ms first-packet
`qwen-native`	~3s	Model stays in RAM via global Mutex

All CUDA benchmarks measured on RTX 4070 Ti SUPER (16GB). For lowest latency: say (macOS) or piper (all platforms). For best quality + cloning: voxtream on CUDA with daemon.

Install

Pre-built binaries (recommended)

# Quick install (macOS ARM / Linux x86_64)
curl -fsSL https://raw.githubusercontent.com/rtk-ai/vox/main/install.sh | sh

# Homebrew (macOS)
brew install rtk-ai/tap/vox

Pre-built binaries are available for each release:

Platform	Binary	GPU
macOS (Apple Silicon)	`vox-aarch64-apple-darwin.tar.gz`	Metal
Linux x86_64	`vox-x86_64-unknown-linux-gnu.tar.gz`	CPU
Linux x86_64 + CUDA	`vox-x86_64-unknown-linux-gnu-cuda.tar.gz`	CUDA
Windows x86_64	`vox-x86_64-pc-windows-msvc.zip`	CPU
Windows x86_64 + CUDA	`vox-x86_64-pc-windows-msvc-cuda.zip`	CUDA

Download from GitHub Releases.

From source

cargo install --path .                   # CPU only
cargo install --path . --features metal  # macOS Apple Silicon (GPU)
cargo install --path . --features cuda   # Linux/Windows NVIDIA (GPU)

Linux requires sudo apt install libasound2-dev.

Platform defaults

Platform	Default backend	Notes
macOS	`say`	No setup needed
Linux / Windows	`piper`	Models auto-download on first use

VoXtream backend (optional)

brew install espeak-ng                              # macOS (or apt install espeak-ng on Linux)
uv venv ~/.local/venvs/voxtream --python 3.11
uv pip install --python ~/.local/venvs/voxtream/bin/python "voxtream>=0.2"
# Copy config files
git clone --depth 1 https://github.com/herimor/voxtream.git /tmp/voxtream-repo
mkdir -p "$(vox config show 2>/dev/null | grep dir | awk '{print $2}' || echo ~/.config/vox)/voxtream"
cp /tmp/voxtream-repo/configs/*.json "$(vox config show 2>/dev/null | grep dir | awk '{print $2}' || echo ~/.config/vox)/voxtream/"

Quick start

vox "Hello, world."                     # Speak with default backend
vox -b qwen-native "Neural TTS."        # Qwen3 (best quality)
vox -b piper "Fast TTS."                # Piper (fastest)
vox --volume 2.0 "Louder!"             # 2x volume (range: 0.0-5.0)
vox -l fr "Bonjour"                     # French
echo "Piped text" | vox                 # Read from stdin
vox --list-voices                       # List available voices
vox setup                               # Interactive TUI configuration

Interactive setup (TUI)

For humans — choose backend, voice, language, style, and volume interactively:

vox setup

┌ Backend ──┐┌ Voice ─────┐┌ Lang ┐┌ Style ────┐┌ Volume ┐┌ Config ──────┐
│> say      ││> Samantha  ││> en  ││> (default)││  0.5   ││ Backend: say │
│  piper    ││  Thomas    ││  fr  ││  calm     ││> 1.0   ││ Voice: ...   │
│  qwen-nat ││  Amelie    ││  es  ││  warm     ││  1.5   ││ Lang:  en    │
│  voxtream ││           ││  de  ││  cheerful ││  2.0   ││ Volume: 1.0x │
│  qwen     ││           ││  ja  ││          ││  3.0   ││ [T]est [S]ave│
└───────────┘└────────────┘└──────┘└──────────┘└────────┘└──────────────┘

Navigate with arrow keys / hjkl, Tab to switch panel, T to test, S to save, Q to quit.

AI agents use CLI flags instead: vox -b qwen-native -l fr "text"

AI assistant integration

One command configures 14 AI tools (Claude Code, Cursor, VS Code, Zed, Codex, Gemini, Amazon Q, and more):

vox init                # MCP server (default) — all AI tools
vox init -m cli         # CLAUDE.md + Stop hook (recommended)
vox init -m skill       # /speak slash command
vox init -m all         # all of the above

Running vox init again is safe — it skips files that are already configured.

CLI mode vs MCP mode

CLI mode is recommended for AI coding agents. Benchmarks show CLI tools are 10-32x cheaper and 100% reliable vs 72% for MCP due to MCP's TCP timeout overhead and JSON schema cost per call.

Mode	Reliability	Token cost	Best for
CLI (`vox init -m cli`)	100%	Low (Bash call)	Claude Code, Codex, terminal agents
MCP (`vox init`)	~72%	Higher (JSON schema)	Cursor, VS Code, GUI-based tools

Voice cloning

vox clone add patrick --audio ~/voice.wav --text "Transcription"
vox clone record myvoice --duration 10
vox -v patrick "This speaks with your voice."
vox clone list
vox clone remove patrick

Works with qwen, qwen-native, and voxtream backends. VoXtream2 uses zero-shot cloning (3-10s audio prompt, no training needed).

Preferences

vox config show
vox config set backend voxtream
vox config set lang fr
vox config set voice Chelsie
vox config set gender feminine
vox config set style warm
vox config reset

Sound packs

vox pack install peon              # Install a pack
vox pack set peon                  # Activate it
vox pack play greeting             # Play a sound
vox pack list                      # List available packs

Voice conversation (macOS)

export ANTHROPIC_API_KEY=sk-...
vox chat -l fr                     # Talk with Claude
vox hear -l fr                     # Speech-to-text only

Data

All state is stored locally — no data sent to external servers (except vox chat which uses Claude API).

~/.config/vox/           # or ~/Library/Application Support/vox/ on macOS
  vox.db                 # SQLite: preferences, voice clones, usage logs
  clones/                # Audio files for voice clones
  packs/                 # Installed sound packs
  voxtream/              # VoXtream2 config files

Env var	Description
`VOX_CONFIG_DIR`	Override config directory
`VOX_DB_PATH`	Override database path

Documentation

Document	Description
Architecture	Technical architecture, backends, DB schema, MCP protocol, security
Features	All commands and features documented
Guide	Installation, quick start, troubleshooting

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.cargo		.cargo
.claude		.claude
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
README_es.md		README_es.md
README_fr.md		README_fr.md
README_ja.md		README_ja.md
README_ko.md		README_ko.md
README_zh.md		README_zh.md
install.sh		install.sh
models.toml		models.toml
release-please-config.json		release-please-config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vox

Backends

Benchmark — single sentence (~50 chars)

Install

Pre-built binaries (recommended)

From source

Platform defaults

VoXtream backend (optional)

Quick start

Interactive setup (TUI)

AI assistant integration

CLI mode vs MCP mode

Voice cloning

Preferences

Sound packs

Voice conversation (macOS)

Data

Documentation

License

About

Uh oh!

Releases 18

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vox

Backends

Benchmark — single sentence (~50 chars)

Install

Pre-built binaries (recommended)

From source

Platform defaults

VoXtream backend (optional)

Quick start

Interactive setup (TUI)

AI assistant integration

CLI mode vs MCP mode

Voice cloning

Preferences

Sound packs

Voice conversation (macOS)

Data

Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 18

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages