Clone any voice. Read anything. All on your Mac.
Local voice cloning · long-form reading · zero cloud · native Apple Silicon
Drop in 5 seconds of reference audio, clone any Mandarin or English voice locally, and read your entire script aloud — without sending a single byte to the cloud.
- 🎙 Voice clone from a 5–15 sec sample, auto-transcribed
- 📝 Long-form synthesis with automatic segmentation and streaming playback
- 🎧 Export to WAV / M4A / MP3 with ⌘S
- 📚 Persistent voice library across launches
- 🕘 Generation history — every synthesis saved, replay & re-export
- 🛡 100% local — on-device inference, no network requests
⬇️ Download the latest release
Or browse the Releases page for older versions.
- Download
voiceBox-X.Y.Z.dmgand double-click to mount - Drag
voiceBox.appinto yourApplicationsfolder - First launch: right-click (or Control-click)
voiceBox.appin Applications → choose Open → click Open again in the dialog - Subsequent launches: just double-click
The app isn't notarized, so the first launch needs the right-click → Open step — a one-time macOS step for non-notarized apps, not a problem with voiceBox. If you see a "damaged" warning, run in Terminal:
xattr -dr com.apple.quarantine /Applications/voiceBox.app
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Reference │ │ Your │ │ Cloned │
│ Audio (5s) │ + │ Script │ → │ Speech │
│ + ASR text │ │ (any len) │ │ WAV/M4A/MP3 │
└──────────────┘ └──────────────┘ └──────────────┘
(one click ✨) (paste / drop) (⌘S export)
3 steps:
- Studio tab → click the voice chip → Add voice → drop in reference audio → click ✨ to auto-transcribe → save
- Studio main input → paste your script (or drop a
.txt) → pick a voice - ⌘↩ to generate · listen · ⌘S to export
| Purpose | Engine | Source |
|---|---|---|
| Speech synthesis (TTS) | Qwen3 voice engine | Alibaba Qwen |
| Speech recognition (ASR) | Qwen3 voice recognition | Alibaba Qwen |
| On-device acceleration | Apple Silicon (GPU / Neural Engine) | Apple |
On first launch the voice models (~4 GB total) are downloaded — use a stable connection. After that, everything runs offline.
Is voiceBox open source?
The binary releases are free for personal use. The source code is not publicly available. voiceBox builds on open-source models and frameworks, credited below.
Will my voice or text be uploaded?
No. All speech computation runs locally on your Mac's GPU / Neural Engine, fully offline. The only network request is on first launch, to download the voice models. After that you can use it with no connection at all.
Which languages are supported?
Mandarin Chinese and English work best. The Qwen3 voice engine also officially supports Spanish, French, German, Japanese, Portuguese, Italian and others — ten languages in total.
Why isn't it on the Mac App Store?
App Store sandboxing breaks the local file-system access we need for reference audio and exports. Direct distribution gives a cleaner experience.
Can I use it commercially?
The app itself is free, but commercial licensing of the underlying Qwen3 models follows each model's own license. voiceBox takes no responsibility for compliance of the generated output.
- macOS 15+ (Sequoia or newer)
- Apple Silicon (M1 / M2 / M3 / M4)
- At least 5 GB of free disk space (model weights)
- Internet (first-time model download only)
- App notarization + auto-update (Sparkle)
- Batch generation across voices
- Synchronized subtitle (SRT) export
- Custom pause / emphasis markers
- iOS version
voiceBox wouldn't exist without these projects:
- MLX by Apple — the framework
- mlx-audio-swift by Prince Canuma — the Swift TTS/STT layer
- mlx-audio by Prince Canuma — the Python research playground
- Qwen by Alibaba — TTS & ASR models
- Hugging Face — model distribution
Found a bug / want a feature? Open an Issue.
Made with ☕ on Apple Silicon.