Universal transcription. Any video, podcast, or audio source — one command.
Works with 1,800+ platforms via yt-dlp. Runs OpenAI Whisper locally for speech-to-text. No API keys. No data leaves your machine.
pip install yt-dlp openai-whisper
# Optional: for video file support (mp4, mkv, avi, etc.)
brew install ffmpeg # macOS
# sudo apt install ffmpeg # Linux# Any URL
python3 scripts/transcribe.py "https://www.youtube.com/watch?v=..."
# Local file
python3 scripts/transcribe.py /path/to/recording.mp4
# Podcast feed (latest 3 episodes)
python3 scripts/transcribe.py "https://feeds.example.com/podcast.xml" --latest 3
# Twitter/X Space
python3 scripts/transcribe.py "https://x.com/i/spaces/1yxBeMYdqgnJN"
# Batch mode
python3 scripts/transcribe.py file1.mp3 "https://youtube.com/..." file2.mp4
# Higher accuracy
python3 scripts/transcribe.py "URL" --model small
# Just download audio
python3 scripts/transcribe.py "URL" --download-only| Source | Example |
|---|---|
| YouTube | https://www.youtube.com/watch?v=... |
| Twitter/X | https://x.com/i/spaces/..., tweet videos |
| Twitch | https://www.twitch.tv/videos/... |
| Vimeo | https://vimeo.com/... |
| SoundCloud | https://soundcloud.com/... |
| Podcast RSS | https://feeds.example.com/podcast.xml |
| Local audio | recording.mp3, audio.m4a, interview.wav |
| Local video | meeting.mp4, lecture.mkv, clip.mov |
| 1,800+ more | Anything yt-dlp supports |
| Flag | Default | Description |
|---|---|---|
--model |
base |
Whisper model: tiny, base, small, medium, large |
--language |
en |
Language code |
-o / --output |
stdout | Write transcript to file |
--text-only |
off | Skip metadata header |
--download-only |
off | Download audio only, skip transcription |
--keep-audio |
off | Keep audio file after transcription |
--latest N |
all | For RSS feeds: transcribe latest N episodes |
| Model | Size | Speed (35 min audio) | Quality |
|---|---|---|---|
| tiny | 39 MB | ~5s | Low |
| base | 74 MB | ~15s | Good |
| small | 244 MB | ~45s | Better |
| medium | 769 MB | ~2 min | High |
| large | 1.5 GB | ~5 min | Best |
All models run on CPU. No GPU required.
- URL validation — only http(s) URLs accepted
- RSS enclosure URL validation before download
- No shell injection — all subprocess calls use list form
- Symlink protection on output paths
- Predictable path avoidance (PID-based temp filenames)
- No XXE risk (stdlib ElementTree)
- Metadata sanitization (control characters stripped, length capped)
- Timeout enforcement on all subprocess calls
Source (URL / file / RSS)
│
▼
┌─────────────┐
│ yt-dlp │ Downloads audio from 1800+ platforms
│ (or ffmpeg) │ Extracts audio from local video files
└──────┬──────┘
│
▼
┌─────────────┐
│ Whisper │ Local speech-to-text (no API)
└──────┬──────┘
│
▼
Clean transcript
+ metadata header
- Hermes skill — install via
hermes skillsor place in~/.hermes/skills/media/transcribe/
MIT