A speech-to-text toolkit wrapping faster-whisper (large-v3) with GPU acceleration. Includes a PyQt6 desktop GUI, a headless CLI, and a vocabulary manager for domain-specific transcription.
Platform: Linux only. This application depends on systemd, FIFO-based IPC, PyAudio/ALSA, and X11/XCB. It does not work on macOS or Windows.
- Speech-to-text using faster-whisper large-v3 (best Whisper model, 4-6x faster than openai-whisper via CTranslate2)
- GPU acceleration with CUDA float16 (~4 GB VRAM), automatic CPU fallback with int8 quantization
- PyQt6 GUI with system tray icon, recording controls, transcription history, and click-to-copy
- Systemd integration for autostart on login
- Keyboard shortcut via FIFO IPC — bind any desktop shortcut to toggle recording (Wayland-compatible)
- Claude integration for AI-powered text refinement and keyword highlighting
Measured on NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM), Debian 12, Python 3.11:
| Metric | Value |
|---|---|
| Model | large-v3 (float16, CTranslate2) |
| Model load time | ~1.3s (CUDA) |
| VRAM usage | ~3.9 GB |
| Inference speed | ~19x realtime (84s audio in 4.4s) |
| Throughput | ~1,800 words/min processing capacity |
Performance is logged automatically on each transcription:
Transcription: 83.7s audio → 4.4s inference (19.0x realtime) on cuda
| Minimum | Recommended | |
|---|---|---|
| OS | Linux (Debian/Ubuntu, Fedora, Arch) | Debian 12+, Ubuntu 22.04+ |
| Python | 3.8 | 3.11+ |
| RAM | 4 GB | 8 GB+ |
| VRAM (GPU) | 4 GB (NVIDIA) | 6 GB+ |
| Disk | 5 GB (model download) | 10 GB |
System packages needed:
# Debian/Ubuntu
sudo apt install python3-dev portaudio19-dev xclip xdotool
# Fedora
sudo dnf install python3-devel portaudio-develgit clone https://github.com/rahulrajaram/whisper-app.git
cd whisper-app
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtThe first run will download the large-v3 model (~3 GB) from HuggingFace. This happens once.
source venv/bin/activate
python -m whisper_appThe setup script creates a virtual environment, installs dependencies, and configures a systemd user service that starts the GUI on login.
./scripts/setup_venv_systemd.shThen start it immediately:
systemctl --user start whisper-guisystemctl --user status whisper-gui # Check status
journalctl --user -u whisper-gui -f # View logs
systemctl --user restart whisper-gui # Restart after code changes
systemctl --user stop whisper-gui # Stop
systemctl --user disable whisper-gui # Disable autostartRecording is controlled via the whisper-recording-toggle script, which communicates with the running GUI through a FIFO pipe. This approach works on both X11 and Wayland.
- Open System Settings > Shortcuts > Custom Shortcuts
- Add a new Global Shortcut > Command/URL
- Set the trigger to your preferred key combo (e.g.
Ctrl+Alt+Shift+R) - Set the command to:
/path/to/whisper-app/scripts/whisper-recording-toggle toggle
gsettings set org.gnome.settings-daemon.plugins.media-keys custom-keybindings \
"['/org/gnome/settings-daemon/plugins/media-keys/custom-keybindings/whisper/']"
gsettings set org.gnome.settings-daemon.plugins.media-keys.custom-keybinding:/org/gnome/settings-daemon/plugins/media-keys/custom-keybindings/whisper/ \
name 'Whisper Toggle' \
command '/path/to/whisper-app/scripts/whisper-recording-toggle toggle' \
binding '<Ctrl><Alt><Shift>r'xfconf-query -c xfce4-keyboard-shortcuts \
-p '/commands/custom/<Primary><Alt><Shift>r' \
-n -t string \
-s '/path/to/whisper-app/scripts/whisper-recording-toggle toggle'Or via GUI: Settings Manager > Keyboard > Application Shortcuts > Add, set the command to the toggle script path, then press Ctrl+Alt+Shift+R.
Any mechanism that can run a shell command on a keypress will work. Point it at:
/path/to/whisper-app/scripts/whisper-recording-toggle toggle
The toggle script also supports start, stop, and status subcommands.
- Start recording: Press your keyboard shortcut or click the record button in the GUI
- Speak: Audio is captured via your selected microphone
- Stop recording: Press the shortcut again or click stop
- View result: Transcription appears in the history table — click any row to copy to clipboard
On first run you will be prompted to select a microphone. To reconfigure later:
rm ~/.whisper/config
systemctl --user restart whisper-guiA short sound plays when recording stops and transcription finishes. To use your own sound, place an audio file named exactly completion.wav in the assets directory:
assets/completion.wav
The file must be named completion.wav. If the file is missing, the app runs silently with no error.
Audio assets are gitignored, so each installation manages its own sound file.
whisper-app/
├── src/whisper_app/
│ ├── gui/ # PyQt6 GUI (main_window, presenter, history, workers)
│ ├── cli.py # Headless CLI mode
│ ├── config.py # Runtime configuration (model, device, hotkeys)
│ ├── controllers/ # Recording controller (start/stop/toggle)
│ ├── services/ # TranscriptionService (faster-whisper), AudioInput, RecordingSession
│ ├── hotkeys/ # Pynput-based hotkey backend (disabled by default)
│ ├── fifo_controller.py # FIFO-based IPC for external shortcut commands
│ ├── dbus_controller.py # D-Bus IPC (with FIFO fallback)
│ └── command_bus.py # Command dispatch (toggle/start/stop)
├── config/systemd/ # Service file template
├── scripts/
│ ├── setup_venv_systemd.sh # One-command setup
│ └── whisper-recording-toggle # CLI to control recording via FIFO
└── tests/ # Test suite (pytest)
Desktop shortcut (KDE/GNOME/XFCE)
-> whisper-recording-toggle toggle
-> writes "toggle" to ~/.whisper/control.fifo
-> FifoController reads it
-> CommandBus dispatches to GUI
-> start or stop recording
The app auto-detects CUDA. To verify:
python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"
nvidia-smi # Should show ~4 GB used when model is loadedIf CUDA is unavailable, the app falls back to CPU with int8 quantization (slower but functional).
- Linux only — depends on systemd, FIFO IPC, ALSA/PulseAudio, and X11/XCB. No macOS or Windows support.
- NVIDIA GPU recommended — CPU inference works but is significantly slower (~10-30x for large-v3).
- First startup is slow — the large-v3 model (~3 GB) must be downloaded once from HuggingFace, and loading it into GPU memory takes 30-60 seconds on each start.
- Single instance — only one GUI process should run at a time (enforced via
~/.whisper/app.lock).
Service won't start:
journalctl --user -u whisper-gui -n 50
# Common: ModuleNotFoundError — run ./scripts/setup_venv_systemd.sh againNo audio input detected:
python3 -c "
import pyaudio
audio = pyaudio.PyAudio()
for i in range(audio.get_device_count()):
info = audio.get_device_info_by_index(i)
if info['maxInputChannels'] > 0:
print(f'{i}: {info[\"name\"]}')
"
# Then: rm ~/.whisper/config && systemctl --user restart whisper-guiShortcut triggers twice:
Make sure you only have one mechanism bound — either a desktop shortcut calling whisper-recording-toggle, or the built-in pynput hotkey (disabled by default), but not both.
High memory usage: ~4 GB VRAM for the model is expected. System RAM usage is typically 1-2 GB.
pip install -e ".[dev]"
pytest # Run all tests
pytest -v --cov=whisper_app # With coverageMIT License. See LICENSE for details.
Rahul Rajaram — GitHub
