Local speech-to-text. Hotkey toggles recording (or hold-to-talk). Transcribes via Metal GPU, pastes at cursor.
Requires Apple Silicon. Intel Macs are not supported.
There's a Windows version using AutoHotkey + CUDA (CPU fallback). Caveat: not tested to the same extent as the mac version.
- Homebrew (only prereq — install it first if you haven't)
whisper-cpp— local speech-to-text engine (Metal GPU on Apple Silicon)sox— mic recording- Hammerspoon — hotkey binding + automation
ggml-large-v3-turbomodel (~1.5GB) — downloaded automatically
The installer prompts you to pick a hotkey and recording mode (toggle or hold-to-talk).
yolo method:
/bin/zsh -c "$(curl -fsSL https://raw.githubusercontent.com/hwrok/dictpaste/main/mac/install-dictpaste.zsh)"if you're scared of running random scripts from the internet and would rather see it first:
curl -fsSL https://raw.githubusercontent.com/hwrok/dictpaste/main/mac/install-dictpaste.zsh -o install-dictpaste.zsh
less install-dictpaste.zsh # satisfy your paranoia
zsh install-dictpaste.zsh- Focus any text input
- Hit your hotkey → "● Recording"
- Speak
- Hit your hotkey again → "Transcribing…" → text pastes at cursor (also remains on clipboard in case you need to repaste)
All transcriptions are appended to ~/Library/Logs/dictpaste/dictpaste.log with timestamps. Rolling rotation at 1MB, max 5 files. Nothing is lost even if paste lands in the wrong place and clipboard is overwritten for whatever reason.
- No audio: System Settings → Privacy & Security → Microphone → Hammerspoon must be enabled
- whisper-cli not found:
ls /opt/homebrew/bin/whisper*— binary name varies by brew version - bad magic / model error: re-download from the huggingface repo — older cached models may be incompatible with newer whisper-cpp
- Junk output on short clips: whisper hallucinates on <1s audio — cleanup strips common artifacts but very short recordings may still produce noise
dictpaste ships two whisper tweaks that make sense for short dictation but wouldn't be appropriate for long-form transcription (meetings, podcasts, movies):
-
-mc 0(max-context 0) disables cross-segment context, where whisper uses the previous ~30s segment to inform the next one. For long recordings this improves coherence across segment boundaries. For dictation (a few seconds to maybe a minute), there's only one segment anyway, so the feature does nothing useful. Worse, it's the mechanism behind a known decoder bug where whisper's attention gets stuck on an earlier phrase and repeats it in a loop. Disabling it eliminates that bug with zero practical downside for dictpaste's use case. (Previously--no-context) -
Sox silence trimming strips leading and trailing silence from the audio before handing it to whisper. Whisper was trained heavily on YouTube content that ends with "Thank you" or "Thanks for watching", so it confidently hallucinates sign-off phrases when it hits trailing dead air. Removing the silence before transcription prevents this. Overhead on a short 16kHz mono clip is negligible (single-digit ms).
- Claude Code: version 2.1.68+ has built-in push-to-talk STT. dictpaste is still generally the better option: it works system-wide (not just Claude's input box), transcription is fully local (no audio sent to Anthropic), and it's already wired into the OS-level paste flow.