localvoxtral is a native macOS menu bar app for realtime dictation.
It keeps the loop simple: start dictation, speak, get text fast.
Unlike Whisper-based tools that transcribe after you stop speaking, Voxtral Realtime streams text as audio arrives, so words appear while you're still talking.
On Apple Silicon, localvoxtral + voxmlx + mlx-lm provides a fully local path (audio + inference + LLM polishing stay on-device), improving privacy and avoiding API costs.
It connects to any OpenAI Realtime-compatible endpoint. Recommended backends are voxmlx (Apple Silicon) and vLLM (NVIDIA GPU).
LLM Polishing connect to any OpenAI /chat/completions endpoint. The recommended backend is mlx-lm (Apple Silicon).
Built for Mistral AI's Voxtral Mini 4B Realtime model, but it works with any OpenAI-compatible Realtime API backend and model.
- Global shortcut with selectable behavior:
Toggle(press-to-start/stop) orPush to Talk(hold-to-dictate) - Native menu bar app with instant open and visual feedback with the icon
- Output modes: overlay buffer (commit on stop) or live auto-paste into focused input
- Personal replacement dictionary (exact match or exact match + LLM-aware-replacement)
- Editable LLM system and user prompt templates
- Fully local dictation option with
voxmlx(no third-party API traffic) - Fully local LLM polishing option with
mlx-lm(no third-party API traffic) - Pick your preferred microphone input device
- Copy the latest transcribed segment
Download the latest .dmg from Releases.
If macOS blocks first launch, go to System Settings -> Privacy & Security and click Open Anyway for localvoxtral.
./scripts/package_app.sh release
open ./dist/localvoxtral.app- Open Settings from the menu bar popover to set:
- Dictation keyboard shortcut
- Shortcut behavior (
Toggle/Push to Talk) - Realtime endpoint (URL, model name, API key)
- Commit interval (
vLLM/voxmlx) - Auto-copy final segment
- Output mode (
Overlay Buffer/Live Auto-Paste) - Replacement dictionary (overlay buffer output mode only)
- LLM polishing endpoint (URL, model name, API key - overlay buffer output mode only)
- Open the shared config folder for
replacement_dictionary.toml,llm_system_prompt.toml, andllm_user_prompt.toml
The shared config directory lives at ~/Library/Application Support/localvoxtral/config.
In this tested setup, vLLM and voxmlx stream partial text fast enough for realtime dictation; latency and throughput will vary by hardware, model, and quantization.
voxmlx OpenAI Realtime-compatible running on M1 Pro with a 4-bit quantized model. Use this fork which adds a WebSocket server that speaks the OpenAI Realtime API protocol and memory management optimizations.
# install uv once: https://docs.astral.sh/uv/getting-started/installation/
uvx --from "git+https://github.com/T0mSIlver/voxmlx.git[server]" \
voxmlx-serve --model T0mSIlver/Voxtral-Mini-4B-Realtime-2602-MLX-4bitvllm OpenAI Realtime-compatible server running on an NVIDIA RTX 3090, using the default settings recommended on the Voxtral Mini 4B Realtime model page.
VLLM_DISABLE_COMPILE_CACHE=1
vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'Deprecated: mlx-audio does not provide true incremental inference for Voxtral Realtime, so partial transcripts are chunked and less responsive than the vLLM and voxmlx backends.
mlx-audio server on M1 Pro, running a 4-bit quant of Voxtral Mini 4B Realtime.
# Default max_chunk (6s) force-splits continuous speech mid-sentence; 30 lets silence detection handle segmentation naturally
MLX_AUDIO_REALTIME_MAX_CHUNK_SECONDS=30 python -m mlx_audio.server --workers 1mlx_lm.server on M1 Pro, running Qwen3.5-0.8B in 8 bit for local LLM polishing.
Use this fork which adds prompt caching optimizations.
Qwen3.5-0.8B is a lightweight default that adds little overhead while remaining smart enough for reliable polishing.
# install uv once: https://docs.astral.sh/uv/getting-started/installation/
# use prompt caching to avoid reprocessing the full conversation on every request
uvx --from "git+https://github.com/T0mSIlver/mlx-lm.git" mlx_lm.server \
--model mlx-community/Qwen3.5-0.8B-8bit \
--prompt-cache-size 1 \
--prompt-cache-bytes 1GBWith the default polishing prompts, prompt processing is roughly 286 ms (~50%) faster on average on M1 Pro with my fork's enhanced prompt caching. On more powerful Apple Silicon, the absolute ms savings will likely be lower because prompt processing is faster.
- Enhance the server connection UX
- Drive
voxmlx-serve(from thevoxmlxfork) upstream and assess app-managed local serving (start/stop/config) in localvoxtral. - Implement more of the on-device Voxtral Realtime integrations recommended in the model README:
- Pure C - thanks Salvatore Sanfilippo
- done
mlx-audio framework - thanks Shreyas Karnik - done
MLX - thanks Awni Hannun - Rust - thanks TrevorS
| Realtime Endpoint | Dictation |
|---|---|
![]() |
![]() |
| Text Processing | Popover |
![]() |
![]() |




