Skip to content

moxin-org/Moxin-Voice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

213 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Moxin Voice

AI-powered Text-to-Speech desktop application with voice cloning โ€” built on OminiX MLX

License Rust Platform

Moxin Voice is a modern, GPU-accelerated desktop TTS application built entirely in Rust. It uses the Makepad UI framework for native performance and the OminiX MLX inference stack for high-speed, Python-free speech synthesis on Apple Silicon.


๐Ÿช„ New: Live Translation

Moxin Voice now includes a built-in Live Translation mode for real-time bilingual subtitles.

  • Microphone or system audio input โ€” translate speech from your mic, browser, meeting app, or video player
  • Real-time subtitle overlay โ€” compact or fullscreen floating window with adjustable text size, position, and opacity
  • Low-latency streaming pipeline โ€” VAD-segmented ASR + rolling translation commits for readable subtitle chunks
  • Bilingual display โ€” original text and translated text shown together in the overlay
  • No extra virtual audio driver required โ€” system audio capture uses macOS ScreenCaptureKit directly

Hardware / System Requirements

  • Apple Silicon Mac required โ€” M1 / M2 / M3 / M4
  • macOS 14.0+ recommended for the full app
  • Live Translation system audio input is macOS-only
  • System audio capture requires Screen Recording permission On first use, macOS will prompt for Screen Recording access because system audio capture is implemented with ScreenCaptureKit.
  • A display must be available ScreenCaptureKit requires a display-backed capture session even when you only want audio.

If Screen Recording permission is denied or ScreenCaptureKit is unavailable, Live Translation still works with the microphone input source.


โšก Powered by OminiX MLX

The inference engine behind Moxin Voice is OminiX MLX โ€” a comprehensive Rust-native ML inference ecosystem for Apple Silicon.

OminiX MLX provides:

  • Pure Rust inference โ€” no Python runtime required at synthesis time
  • Metal GPU acceleration โ€” optimized for M1/M2/M3/M4 chips via Apple's MLX framework
  • Unified memory โ€” zero-copy CPU/GPU data sharing
  • Qwen3-TTS-MLX โ€” the TTS engine used by Moxin Voice (9 built-in voices, 12 languages, ICL voice cloning, 2.3ร— real-time on M3 Max)

Moxin Voice uses OminiX MLX's dora-qwen3-tts-mlx node as its sole TTS backend. Source: node-hub/dora-qwen3-tts-mlx/


โœจ Features

  • ๐ŸŽ™๏ธ Zero-Shot Voice Cloning โ€” Clone any voice with 5โ€“30 seconds of audio (ICL Express mode)
  • ๐ŸŽต Text-to-Speech โ€” 9 preset voices across Chinese, English, Japanese, and Korean
  • ๐ŸŒ Live Translation โ€” Real-time subtitles from microphone or system audio with a floating overlay
  • ๐Ÿ”ฎ Qwen3-TTS-MLX Backend โ€” 2.3ร— real-time synthesis via OminiX MLX on Apple Silicon
  • ๐ŸŽค Audio Recording โ€” Built-in real-time recording with waveform visualization
  • ๐Ÿ” ASR Integration โ€” Automatic text transcription for cloning reference audio
  • ๐Ÿ’พ Audio Export โ€” Save generated speech as WAV files
  • ๐ŸŒ“ Dark Mode โ€” Native dark theme via Makepad GPU rendering
  • ๐ŸŒ Bilingual UI โ€” Chinese and English interface

๐Ÿ—๏ธ Architecture

moxin-voice/
โ”œโ”€โ”€ moxin-voice-shell/          # Application entry point (binary)
โ”œโ”€โ”€ apps/moxin-voice/           # UI + application logic
โ”‚   โ””โ”€โ”€ dataflow/tts.yml        # Dora dataflow graph
โ”œโ”€โ”€ moxin-widgets/              # Shared Makepad UI components
โ”œโ”€โ”€ moxin-ui/                   # Application infrastructure
โ”œโ”€โ”€ moxin-dora-bridge/          # Dora dataflow integration bridge
โ””โ”€โ”€ node-hub/
    โ”œโ”€โ”€ dora-qwen3-tts-mlx/     # โ˜… OminiX MLX Qwen3-TTS Rust node
    โ”‚   โ””โ”€โ”€ previews/           # Pre-generated voice preview WAVs
    โ””โ”€โ”€ dora-qwen3-asr/         # โ˜… OminiX MLX Qwen3-ASR Rust node

The TTS pipeline runs as a Dora dataflow: the UI sends text, the qwen-tts-node (built from dora-qwen3-tts-mlx) synthesizes audio using OminiX MLX, and the audio player receives the stream.


๐Ÿš€ Quick Start (macOS)

Prerequisites

  • macOS 14.0+ (Sonoma), Apple Silicon (M1/M2/M3/M4)
  • Rust 1.82+
  • Dora CLI (cargo install dora-cli)
  • Python 3.8+ (for the one-time model download script; not required at runtime)

1. Download Models

bash scripts/init_qwen3_models.sh

This downloads all three model snapshots into ~/.OminiX/models/:

Model Purpose
Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit Preset voice synthesis
Qwen3-TTS-12Hz-1.7B-Base-8bit ICL zero-shot voice cloning
Qwen3-ASR-1.7B-8bit Voice cloning reference audio transcription

huggingface_hub is installed automatically if not present.

2. Build

cargo build --release

This builds all binaries including dora-qwen3-asr (the ASR Dora node) and qwen-tts-node.

3. Run

dora up
cargo run -p moxin-voice-shell

First-Time Distribution (macOS .app)

For end-users receiving the distributed .app, model download and initialization happen automatically via the in-app bootstrap wizard on first launch.


๐Ÿ”ฎ Qwen3-TTS Voice Library

9 built-in preset voices, UI names localized to Chinese or English:

ID Language Character
vivian zh ่–‡่–‡ๅฎ‰ โ€” bright, slightly edgy young female
serena zh ่ต›็ณๅจœ โ€” warm, gentle young female
uncle_fu zh ๅ‚…ๅ” โ€” low, mellow seasoned male
dylan zh ่ฟชไผฆ โ€” clear Beijing young male
eric zh ๅŸƒ้‡Œๅ…‹ โ€” lively Chengdu young male
ryan en Ryan โ€” dynamic male with rhythmic drive
aiden en Aiden โ€” sunny American male
ono_anna ja ๅฐ้‡Žๅฎ‰ๅฅˆ โ€” playful Japanese female
sohee ko ็ด ็†™ โ€” warm Korean female

Voice Cloning (Express Mode)

Upload or record 5โ€“30 seconds of reference audio. Moxin Voice uses Qwen3-TTS's In-Context Learning (ICL) to clone the voice in real time โ€” no training required. ASR auto-transcription is optional; if ASR is unavailable, users can enter reference text manually.


๐Ÿ“ฆ Build

Development

cargo build -p moxin-voice-shell

macOS App Bundle

bash scripts/build_macos_app.sh --version 0.1.0
bash scripts/build_macos_dmg.sh

Distribution Bootstrap (user machine)

bash scripts/macos_bootstrap.sh

Downloads Qwen3-TTS and Qwen3-ASR models, sets up the app-private conda env (needed for TTS download script only).


๐Ÿ”ง Technology Stack

Component Technology
UI framework Makepad โ€” GPU-accelerated, pure Rust
TTS inference OminiX MLX ยท Qwen3-TTS-MLX
TTS model Qwen3-TTS (Alibaba)
ML runtime Apple MLX via mlx-sys / mlx-rs (OminiX MLX)
Dataflow Dora
Audio I/O CPAL
ASR OminiX MLX ยท Qwen3-ASR-MLX (Rust, Metal GPU)
Language Rust 2021 edition

๐Ÿ“ License

Apache License 2.0 โ€” see LICENSE.


๐Ÿ™ Acknowledgments

  • OminiX MLX โ€” the core ML inference engine powering all synthesis in this project
  • Qwen3-TTS โ€” the TTS model (Alibaba)
  • Makepad โ€” GPU-accelerated UI framework
  • Dora โ€” dataflow architecture
  • Apple MLX โ€” foundation for OminiX MLX

Repository: https://github.com/moxin-org/Moxin-Voice

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors