WhisperS2T Push-to-Talk Usage Guide

🚀 Quick Start

conda activate whisper
python whisper_hotkey.py

Then hold your hotkey to record, release to transcribe. The transcription is automatically copied to your clipboard!

🎤 How It Works

Press & Hold your configured hotkey (default: ctrl+alt+shift+space)
Speak - you'll hear a pop sound when recording starts
Release the hotkey when done
Transcription appears and is automatically copied to clipboard
Paste anywhere with Ctrl+V

⚙️ Configuration

All settings are in the .env file. Edit to customize:

# =============================================================================
# AUDIO SETTINGS
# =============================================================================
MIC_DEVICE=5                    # Your microphone device index
SAMPLE_RATE=16000               # Keep at 16000 for Whisper

# =============================================================================
# MODEL SETTINGS
# =============================================================================
MODEL=large-v3                  # Whisper model (see options below)
BACKEND=CTranslate2             # Fastest backend
LANGUAGE=en                     # Language code

# --- OR use Parakeet for best English accuracy ---
# MODEL=models/parakeet-tdt-0.6b-v2.nemo  # Local .nemo file
# MODEL=nvidia/parakeet-tdt-0.6b-v2       # Download from NGC
# BACKEND=Parakeet
# LANGUAGE=en                             # Parakeet is English-only

# =============================================================================
# RECORDING SETTINGS
# =============================================================================
CHUNK_DURATION=10               # Seconds per chunk (10 recommended)
CHUNK_OVERLAP=2                 # Overlap between chunks

# =============================================================================
# HOTKEY SETTINGS
# =============================================================================
HOTKEY=ctrl+alt+shift+space     # Your push-to-talk hotkey

# =============================================================================
# STOP MODE SETTINGS
# =============================================================================
AUTO_STOP_ENABLED=false         # Set to true for auto-stop on silence
SILENCE_THRESHOLD=2.0           # Seconds of silence before auto-stop

# =============================================================================
# OUTPUT SETTINGS
# =============================================================================
COPY_TO_CLIPBOARD=true          # Auto-copy transcription
SHOW_PROGRESS=true              # Show recording progress
PRINT_TRANSCRIPTION=true        # Print final result to console

Finding Your Microphone Device Index

python -c "import pyaudio; p=pyaudio.PyAudio(); [print(f'{i}: {p.get_device_info_by_index(i)[\"name\"]}') for i in range(p.get_device_count()) if p.get_device_info_by_index(i)['maxInputChannels'] > 0]; p.terminate()"

📋 Available Options

Models

Whisper Models (Multilingual)

Model	Size	Speed	Accuracy	Recommended For
`tiny`	~39MB	⚡ Fastest	Basic	Testing only
`base`	~74MB	⚡ Very Fast	Good	Quick notes
`small`	~244MB	🚀 Fast	Better	Daily use
`medium`	~769MB	🐌 Slower	Very Good	Important recordings
`large-v2`	~1550MB	🐌 Slowest	Best	Maximum accuracy
`large-v3`	~1550MB	🐌 Slowest	Best	Recommended

Parakeet Models (English-Only, State-of-the-Art)

Model	Size	Speed	Accuracy	Notes
`nvidia/parakeet-tdt-0.6b-v2`	~600MB	🚀 Fast	Best	Recommended for English
`nvidia/parakeet-tdt-1.1b`	~1.1GB	🚀 Fast	Best	Larger, slightly better
Local `.nemo` file	Varies	🚀 Fast	Best	Use your own model

Note: Parakeet models require the NeMo toolkit: pip install nemo_toolkit[asr]

Backends

Backend	Best For	Notes
`CTranslate2`	General Whisper use	Default, fast, good balance
`TensorRT`	Maximum Whisper speed	Requires TensorRT-LLM setup
`HuggingFace`	Distil models, flexibility	Slower, more features
`OpenAI`	Original implementation	Reference, not optimized
`Parakeet`	Best English accuracy	English-only, requires NeMo

Hotkey Examples

HOTKEY=ctrl+alt+shift+space     # 4-key combo
HOTKEY=ctrl+shift+r             # 3-key combo
HOTKEY=f9                       # Single function key

Languages

Common codes: en, es, fr, de, it, pt, ru, ja, zh

🔧 Command Line Options

# Normal push-to-talk mode
python whisper_hotkey.py

# Show current configuration
python whisper_hotkey.py --config

# Simple one-shot mode (no hotkey, just record for X seconds)
python whisper_hotkey.py --simple --duration 10

# Use a different .env file
python whisper_hotkey.py --env /path/to/custom.env

🎯 Tips for Best Results

Model Selection:
- For English: Use Parakeet backend with nvidia/parakeet-tdt-0.6b-v2 for state-of-the-art accuracy
- For multilingual: Use large-v3 with CTranslate2 backend
Chunk Duration: 10 seconds works well. The app automatically handles longer recordings by chunking and stitching.
Speak Naturally: The intelligent stitching algorithm handles sentence boundaries well. Don't worry about pausing between chunks.
Wait for the Pop: The audio notification confirms recording has started. Speak after you hear it.
Clean Release: Release the hotkey cleanly after you finish speaking. The transcription starts immediately.
Parakeet Setup: If using Parakeet, install NeMo first: pip install nemo_toolkit[asr]

🔧 Troubleshooting

Hotkey Not Working

Run as Administrator: The keyboard library may need admin privileges for global hotkeys
Try a simpler hotkey: Change to f9 or ctrl+shift+r in .env
Check for conflicts: Another app might be using the same hotkey

No Sound on Recording Start

Ensure files/pop.wav exists
Check Windows sound settings

Transcription Not Copying to Clipboard

Verify pyperclip is installed: pip install pyperclip
Check COPY_TO_CLIPBOARD=true in .env

Audio Issues

Verify microphone index with the command above
Check Windows sound settings → Recording devices
Ensure no other apps are using the microphone

CUDA/GPU Issues

Run python verify_setup.py to check GPU status
Ensure NVIDIA drivers are up to date
Check that PyTorch sees your GPU: python -c "import torch; print(torch.cuda.is_available())"

📊 Performance

With an RTX 4080 and large-v3 model:

10-second chunk: ~1.5s transcription time
Real-time factor: ~6-7x faster than real-time
Memory usage: ~3-4GB VRAM

🎉 You're Ready!

conda activate whisper
python whisper_hotkey.py

Hold your hotkey, speak, release, paste! 🎤✨

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WhisperS2T Push-to-Talk Usage Guide

🚀 Quick Start

🎤 How It Works

⚙️ Configuration

Finding Your Microphone Device Index

📋 Available Options

Models

Whisper Models (Multilingual)

Parakeet Models (English-Only, State-of-the-Art)

Backends

Hotkey Examples

Languages

🔧 Command Line Options

🎯 Tips for Best Results

🔧 Troubleshooting

Hotkey Not Working

No Sound on Recording Start

Transcription Not Copying to Clipboard

Audio Issues

CUDA/GPU Issues

📊 Performance

🎉 You're Ready!

FilesExpand file tree

USAGE_GUIDE.md

Latest commit

History

USAGE_GUIDE.md

File metadata and controls

WhisperS2T Push-to-Talk Usage Guide

🚀 Quick Start

🎤 How It Works

⚙️ Configuration

Finding Your Microphone Device Index

📋 Available Options

Models

Whisper Models (Multilingual)

Parakeet Models (English-Only, State-of-the-Art)

Backends

Hotkey Examples

Languages

🔧 Command Line Options

🎯 Tips for Best Results

🔧 Troubleshooting

Hotkey Not Working

No Sound on Recording Start

Transcription Not Copying to Clipboard

Audio Issues

CUDA/GPU Issues

📊 Performance

🎉 You're Ready!