AI-powered Text-to-Speech desktop application with voice cloning โ built on OminiX MLX
Moxin Voice is a modern, GPU-accelerated desktop TTS application built entirely in Rust. It uses the Makepad UI framework for native performance and the OminiX MLX inference stack for high-speed, Python-free speech synthesis on Apple Silicon.
Moxin Voice now includes a built-in Live Translation mode for real-time bilingual subtitles.
- Microphone or system audio input โ translate speech from your mic, browser, meeting app, or video player
- Real-time subtitle overlay โ compact or fullscreen floating window with adjustable text size, position, and opacity
- Low-latency streaming pipeline โ VAD-segmented ASR + rolling translation commits for readable subtitle chunks
- Bilingual display โ original text and translated text shown together in the overlay
- No extra virtual audio driver required โ system audio capture uses macOS ScreenCaptureKit directly
- Apple Silicon Mac required โ M1 / M2 / M3 / M4
- macOS 14.0+ recommended for the full app
- Live Translation system audio input is macOS-only
- System audio capture requires Screen Recording permission On first use, macOS will prompt for Screen Recording access because system audio capture is implemented with ScreenCaptureKit.
- A display must be available ScreenCaptureKit requires a display-backed capture session even when you only want audio.
If Screen Recording permission is denied or ScreenCaptureKit is unavailable, Live Translation still works with the microphone input source.
The inference engine behind Moxin Voice is OminiX MLX โ a comprehensive Rust-native ML inference ecosystem for Apple Silicon.
OminiX MLX provides:
- Pure Rust inference โ no Python runtime required at synthesis time
- Metal GPU acceleration โ optimized for M1/M2/M3/M4 chips via Apple's MLX framework
- Unified memory โ zero-copy CPU/GPU data sharing
- Qwen3-TTS-MLX โ the TTS engine used by Moxin Voice (9 built-in voices, 12 languages, ICL voice cloning, 2.3ร real-time on M3 Max)
Moxin Voice uses OminiX MLX's
dora-qwen3-tts-mlxnode as its sole TTS backend. Source:node-hub/dora-qwen3-tts-mlx/
- ๐๏ธ Zero-Shot Voice Cloning โ Clone any voice with 5โ30 seconds of audio (ICL Express mode)
- ๐ต Text-to-Speech โ 9 preset voices across Chinese, English, Japanese, and Korean
- ๐ Live Translation โ Real-time subtitles from microphone or system audio with a floating overlay
- ๐ฎ Qwen3-TTS-MLX Backend โ 2.3ร real-time synthesis via OminiX MLX on Apple Silicon
- ๐ค Audio Recording โ Built-in real-time recording with waveform visualization
- ๐ ASR Integration โ Automatic text transcription for cloning reference audio
- ๐พ Audio Export โ Save generated speech as WAV files
- ๐ Dark Mode โ Native dark theme via Makepad GPU rendering
- ๐ Bilingual UI โ Chinese and English interface
moxin-voice/
โโโ moxin-voice-shell/ # Application entry point (binary)
โโโ apps/moxin-voice/ # UI + application logic
โ โโโ dataflow/tts.yml # Dora dataflow graph
โโโ moxin-widgets/ # Shared Makepad UI components
โโโ moxin-ui/ # Application infrastructure
โโโ moxin-dora-bridge/ # Dora dataflow integration bridge
โโโ node-hub/
โโโ dora-qwen3-tts-mlx/ # โ
OminiX MLX Qwen3-TTS Rust node
โ โโโ previews/ # Pre-generated voice preview WAVs
โโโ dora-qwen3-asr/ # โ
OminiX MLX Qwen3-ASR Rust node
The TTS pipeline runs as a Dora dataflow: the UI sends text, the qwen-tts-node (built from dora-qwen3-tts-mlx) synthesizes audio using OminiX MLX, and the audio player receives the stream.
- macOS 14.0+ (Sonoma), Apple Silicon (M1/M2/M3/M4)
- Rust 1.82+
- Dora CLI (
cargo install dora-cli) - Python 3.8+ (for the one-time model download script; not required at runtime)
bash scripts/init_qwen3_models.shThis downloads all three model snapshots into ~/.OminiX/models/:
| Model | Purpose |
|---|---|
Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit |
Preset voice synthesis |
Qwen3-TTS-12Hz-1.7B-Base-8bit |
ICL zero-shot voice cloning |
Qwen3-ASR-1.7B-8bit |
Voice cloning reference audio transcription |
huggingface_hub is installed automatically if not present.
cargo build --releaseThis builds all binaries including dora-qwen3-asr (the ASR Dora node) and qwen-tts-node.
dora up
cargo run -p moxin-voice-shellFor end-users receiving the distributed .app, model download and initialization happen automatically via the in-app bootstrap wizard on first launch.
9 built-in preset voices, UI names localized to Chinese or English:
| ID | Language | Character |
|---|---|---|
vivian |
zh | ่่ๅฎ โ bright, slightly edgy young female |
serena |
zh | ่ต็ณๅจ โ warm, gentle young female |
uncle_fu |
zh | ๅ ๅ โ low, mellow seasoned male |
dylan |
zh | ่ฟชไผฆ โ clear Beijing young male |
eric |
zh | ๅ้ๅ โ lively Chengdu young male |
ryan |
en | Ryan โ dynamic male with rhythmic drive |
aiden |
en | Aiden โ sunny American male |
ono_anna |
ja | ๅฐ้ๅฎๅฅ โ playful Japanese female |
sohee |
ko | ็ด ็ โ warm Korean female |
Upload or record 5โ30 seconds of reference audio. Moxin Voice uses Qwen3-TTS's In-Context Learning (ICL) to clone the voice in real time โ no training required. ASR auto-transcription is optional; if ASR is unavailable, users can enter reference text manually.
cargo build -p moxin-voice-shellbash scripts/build_macos_app.sh --version 0.1.0
bash scripts/build_macos_dmg.shbash scripts/macos_bootstrap.shDownloads Qwen3-TTS and Qwen3-ASR models, sets up the app-private conda env (needed for TTS download script only).
| Component | Technology |
|---|---|
| UI framework | Makepad โ GPU-accelerated, pure Rust |
| TTS inference | OminiX MLX ยท Qwen3-TTS-MLX |
| TTS model | Qwen3-TTS (Alibaba) |
| ML runtime | Apple MLX via mlx-sys / mlx-rs (OminiX MLX) |
| Dataflow | Dora |
| Audio I/O | CPAL |
| ASR | OminiX MLX ยท Qwen3-ASR-MLX (Rust, Metal GPU) |
| Language | Rust 2021 edition |
Apache License 2.0 โ see LICENSE.
- OminiX MLX โ the core ML inference engine powering all synthesis in this project
- Qwen3-TTS โ the TTS model (Alibaba)
- Makepad โ GPU-accelerated UI framework
- Dora โ dataflow architecture
- Apple MLX โ foundation for OminiX MLX
Repository: https://github.com/moxin-org/Moxin-Voice