Runs OpenBMB VoxCPM as a service on macOS, on the Apple GPU (MPS), without Pinokio in the picture.
Fourth in the family, after kokoro-tts-mac-service, xtts-mac-service, and f5-tts-mac-service. Same shape, different model.
VoxCPM is the heaviest of the four engines but currently the strongest at preserving the style of the reference voice (rhythm, breath, micropauses). Useful when you want the cloned voice to actually sound like the person speaking, not just share their timbre.
The catch on Mac:
- The upstream
app.pyonly knows about CUDA. We patch it for MPS detection. - The upstream
app.pyalways passesoptimize=True, which inside the lib raisesValueError("VoxCPMModel can only be optimized on CUDA device")on MPS or CPU. We patch it to disable optimize when not on CUDA. - The upstream
app.pybinds to0.0.0.0with no--hostflag. We add one.
voxcpm itself (the Python package) already detects MPS internally, so the patch is small and only touches the demo app, not the model lib.
- VoxCPM at
http://127.0.0.1:7863, 24/7. - Inference on the Apple GPU (with CPU fallback for ops without an MPS kernel).
- Two model choices via
VOXCPM_MODEL_ID:openbmb/VoxCPM-0.5B(default, ~3 GB, low VRAM, recommended for MPS)openbmb/VoxCPM2(full 2B, ~5 GB, slower on MPS, higher quality)
- Co-exists with Kokoro on 7860, XTTS on 7861, F5 on 7862.
- Auto-restart if the process crashes.
- 48 kHz output (vs 24 kHz from XTTS and F5).
- macOS (tested on Sequoia, Apple Silicon).
- Python 3.12 on PATH.
brew install python@3.12. - ~5 GB of disk for the 0.5B model + deps; ~7 GB for the 2B model.
- Some patience: VoxCPM is heavier than the other three engines. First call after cold boot takes a while because of MPS JIT warmup, and even warm calls are slower than F5 or XTTS.
git clone https://github.com/linuxelitebr/voxcpm-mac-service.git
cd voxcpm-mac-service
./scripts/install.shTo use the full 2B model instead of the default 0.5B:
VOXCPM_MODEL_ID=openbmb/VoxCPM2 ./scripts/install.shThe installer will:
- Create
./envwith Python 3.12 if it doesn't exist. pip install voxcpmfrom PyPI (and a pile of friends including funasr, modelscope, etc.).- Apply the MPS + host patch to
app.py(with backup). - Render
com.voxcpm.tts.plistwith your real paths, drop it in~/Library/LaunchAgents/. - Load the LaunchAgent and wait for Gradio to answer on :7863.
First boot downloads the model, the SenseVoice ASR, and the denoiser. Plan for 5 to 15 minutes on a decent connection.
Open http://127.0.0.1:7863. Upload a 5 to 30 second reference WAV, optionally write a control instruction in natural language ("warm voice, calm pacing"), type the target text, generate.
from gradio_client import Client, handle_file
c = Client("http://127.0.0.1:7863")
audio_path = c.predict(
"Hello in my own voice, generated on the Apple GPU.", # text
"", # control instruction
handle_file("path/to/your/voice-sample.wav"), # reference_wav
True, # show_prompt_text
"transcript of the reference audio", # prompt_text
2.0, # cfg_value (1.0 to 3.0)
True, # DoNormalizeText
False, # DoDenoisePromptAudio
10, # dit_steps (1 to 50)
api_name="/generate",
)
print(audio_path)cfg_value(1.0 to 3.0, default 2.0): higher follows the reference more strictly; lower lets the model improvise.dit_steps(1 to 50, default 10): diffusion steps. More = cleaner output, slower. Sweet spot 10 to 20.DoNormalizeText: recommended True for technical writing with punctuation.DoDenoisePromptAudio: only turn on if your reference WAV has noise.control_instruction: free text in any of the 30 supported languages. Examples:"warm young woman, calm pacing","voz masculina madura, leitura técnica pausada". Empty = take cues from reference.
| Thing | Command | What it does |
|---|---|---|
| Start | ./scripts/start.sh |
Idempotent. Returns 0 instantly if already up. |
| Stop | ./scripts/stop.sh |
Unloads the agent for this session. |
| Status | ./scripts/status.sh |
Plist + LaunchAgent + HTTP, all in one screen. |
| Tail stdout | tail -f logs/voxcpm.out.log |
|
| Tail stderr | tail -f logs/voxcpm.err.log |
start.sh defaults to WAIT_TIMEOUT=300 (5 min) because cold boot on MPS can be slow.
VOXCPM_REF_AUDIO=/path/to/your-voice.wav ./scripts/test.shFour checks: launchd state, MPS in logs, HTTP, and an E2E voice cloning call.
Three changes to app.py (full diff in patches/01-mps-and-host.patch):
1. Detect MPS
if torch.cuda.is_available():
self.device = "cuda"
elif torch.backends.mps.is_available():
self.device = "mps"
os.environ.setdefault("PYTORCH_ENABLE_MPS_FALLBACK", "1")
else:
self.device = "cpu"2. Don't pass optimize=True outside CUDA
self.voxcpm_model = voxcpm.VoxCPM.from_pretrained(
self._model_id,
optimize=(self.device == "cuda"),
)3. Add --host flag so we can pin to 127.0.0.1 from the LaunchAgent.
The actual voxcpm lib already supports MPS internally, so we only patch the demo app.
~/Library/LaunchAgents/com.voxcpm.tts.plist runs as your user, on your login. Same shape as the sister projects: RunAtLoad=true, KeepAlive only on crash, ThrottleInterval=10, PYTHONUNBUFFERED=1, TOKENIZERS_PARALLELISM=false.
The model id is rendered into the plist at install time, so to switch models you uninstall and reinstall with a different VOXCPM_MODEL_ID.
./scripts/uninstall.sh # just unloads the service
./scripts/uninstall.sh --revert # also reverts the patch in app.py
./scripts/uninstall.sh --purge # also nukes ./env and the model cache- Slow on MPS. VoxCPM is genuinely heavy, and several diffusion ops fall back to CPU. Even the 0.5B model is significantly slower than F5 or XTTS on the same Apple Silicon. If you need throughput, run it on a CUDA box and point your client at it.
- First call is the slowest. Plan for several minutes on the first generation while MPS JIT warms up.
- Model cache lives in
~/.cache/modelscope/hub/openbmb/for the VoxCPM weights, and~/.cache/huggingface/for the funasr ASR. - Apache-2.0 license on the weights, free for commercial use.
- Model and code: OpenBMB/VoxCPM
- PyPI package:
voxcpm(maintained by OpenBMB)
MIT for the wrapper code in this repo. VoxCPM weights and lib code are Apache-2.0. See LICENSE.