Skip to content

T0mSIlver/localvoxtral

Repository files navigation

localvoxtral

localvoxtral demo

localvoxtral app icon

localvoxtral is a native macOS menu bar app for realtime dictation. It keeps the loop simple: start dictation, speak, get text fast. Unlike Whisper-based tools that transcribe after you stop speaking, Voxtral Realtime streams text as audio arrives, so words appear while you're still talking. On Apple Silicon, localvoxtral + voxmlx + mlx-lm provides a fully local path (audio + inference + LLM polishing stay on-device), improving privacy and avoiding API costs.

It connects to any OpenAI Realtime-compatible endpoint. Recommended backends are voxmlx (Apple Silicon) and vLLM (NVIDIA GPU). LLM Polishing connect to any OpenAI /chat/completions endpoint. The recommended backend is mlx-lm (Apple Silicon).

Built for Mistral AI's Voxtral Mini 4B Realtime model, but it works with any OpenAI-compatible Realtime API backend and model.

Features

  • Global shortcut with selectable behavior: Toggle (press-to-start/stop) or Push to Talk (hold-to-dictate)
  • Native menu bar app with instant open and visual feedback with the icon
  • Output modes: overlay buffer (commit on stop) or live auto-paste into focused input
  • Personal replacement dictionary (exact match or exact match + LLM-aware-replacement)
  • Editable LLM system and user prompt templates
  • Fully local dictation option with voxmlx (no third-party API traffic)
  • Fully local LLM polishing option with mlx-lm (no third-party API traffic)
  • Pick your preferred microphone input device
  • Copy the latest transcribed segment

Quick start

Recommended: install from GitHub Releases (DMG)

Download the latest .dmg from Releases.

If macOS blocks first launch, go to System Settings -> Privacy & Security and click Open Anyway for localvoxtral.

Alternatively, build from source as an app bundle

./scripts/package_app.sh release
open ./dist/localvoxtral.app

Settings

  • Open Settings from the menu bar popover to set:
    • Dictation keyboard shortcut
    • Shortcut behavior (Toggle / Push to Talk)
    • Realtime endpoint (URL, model name, API key)
    • Commit interval (vLLM/voxmlx)
    • Auto-copy final segment
    • Output mode (Overlay Buffer / Live Auto-Paste)
    • Replacement dictionary (overlay buffer output mode only)
    • LLM polishing endpoint (URL, model name, API key - overlay buffer output mode only)
    • Open the shared config folder for replacement_dictionary.toml, llm_system_prompt.toml, and llm_user_prompt.toml

The shared config directory lives at ~/Library/Application Support/localvoxtral/config.

Tested setup

In this tested setup, vLLM and voxmlx stream partial text fast enough for realtime dictation; latency and throughput will vary by hardware, model, and quantization.

voxmlx (recommended)

voxmlx OpenAI Realtime-compatible running on M1 Pro with a 4-bit quantized model. Use this fork which adds a WebSocket server that speaks the OpenAI Realtime API protocol and memory management optimizations.

# install uv once: https://docs.astral.sh/uv/getting-started/installation/
uvx --from "git+https://github.com/T0mSIlver/voxmlx.git[server]" \
  voxmlx-serve --model T0mSIlver/Voxtral-Mini-4B-Realtime-2602-MLX-4bit

vLLM

vllm OpenAI Realtime-compatible server running on an NVIDIA RTX 3090, using the default settings recommended on the Voxtral Mini 4B Realtime model page.

VLLM_DISABLE_COMPILE_CACHE=1
vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'

mlx-audio (deprecated)

Deprecated: mlx-audio does not provide true incremental inference for Voxtral Realtime, so partial transcripts are chunked and less responsive than the vLLM and voxmlx backends.

mlx-audio server on M1 Pro, running a 4-bit quant of Voxtral Mini 4B Realtime.

# Default max_chunk (6s) force-splits continuous speech mid-sentence; 30 lets silence detection handle segmentation naturally
MLX_AUDIO_REALTIME_MAX_CHUNK_SECONDS=30 python -m mlx_audio.server --workers 1

Tested setup (LLM polishing)

mlx-lm (recommended)

mlx_lm.server on M1 Pro, running Qwen3.5-0.8B in 8 bit for local LLM polishing. Use this fork which adds prompt caching optimizations. Qwen3.5-0.8B is a lightweight default that adds little overhead while remaining smart enough for reliable polishing.

# install uv once: https://docs.astral.sh/uv/getting-started/installation/
# use prompt caching to avoid reprocessing the full conversation on every request
uvx --from "git+https://github.com/T0mSIlver/mlx-lm.git" mlx_lm.server \
  --model mlx-community/Qwen3.5-0.8B-8bit \
  --prompt-cache-size 1 \
  --prompt-cache-bytes 1GB

With the default polishing prompts, prompt processing is roughly 286 ms (~50%) faster on average on M1 Pro with my fork's enhanced prompt caching. On more powerful Apple Silicon, the absolute ms savings will likely be lower because prompt processing is faster.

Roadmap

UI

localvoxtral menubar icon Menubar icon

Realtime Endpoint Dictation
localvoxtral realtime endpoint settings localvoxtral dictation settings
Text Processing Popover
localvoxtral text processing settings localvoxtral popover view

About

Native macOS menu bar app for realtime dictation with optional LLM polishing. Connects to any OpenAI Realtime-compatible backend — fully local on Apple Silicon with voxmlx + mlx-lm.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors