- Overview
- The Whisper Model
- Backends
- Project Structure
- Installation
- Configuration
- CLI Usage
- Web Interface
- Makefile Reference
- References
- License
whispr is a transcription pipeline built around OpenAI's Whisper model. It supports three interchangeable backends, parallel chunk processing, automatic fallback, and multiple output formats.
Two interfaces ship with the same codebase:
- A CLI (
main.py) for scripting, automation, and direct use. - A web UI (
web/) for browser-based use — upload a file or paste a URL, pick a backend, download the result.
Key capabilities:
| Area | Details |
|---|---|
| Backends | whisper.cpp (GGML subprocess), faster-whisper (CTranslate2), OpenAI Whisper API |
| Input | Local files, YouTube URLs (yt-dlp), direct HTTP URLs |
| Input formats | MP3, MP4, WAV, M4A, MKV, WEBM, OGG, FLAC, AVI, MOV, and any format supported by ffmpeg |
| Output formats | Plain text, JSON (with segments and timestamps), SRT subtitles, WebVTT subtitles |
| Chunking | ffmpeg-based splitting for arbitrarily long files, transparent to the user |
| Parallel processing | ThreadPoolExecutor, configurable worker count |
| Retry / fallback | Exponential backoff per chunk, automatic switch to secondary backend on failure |
| Configuration | JSON file + CLI overrides + ${ENV_VAR} interpolation |
| Web interface | FastAPI + vanilla JS, real-time WebSocket log, per-backend options |
Whisper is an automatic speech recognition (ASR) model published by OpenAI in September 2022 (Radford et al., arXiv:2212.04356). It is a sequence-to-sequence model based on a standard encoder-decoder Transformer architecture, trained end-to-end with weak supervision on a large-scale multilingual dataset.
Raw audio undergoes the following transformations before reaching the model:
- Resampled to 16 kHz mono.
- A log-Mel spectrogram is computed with a 25 ms window, 10 ms stride, and 80 Mel frequency bins.
- The spectrogram is split into 30-second windows of 3000 frames (padded with silence if shorter).
The resulting 80 x 3000 tensor is the input to the encoder.
The encoder consists of a convolutional front-end followed by Transformer blocks:
- Two 1D convolution layers (kernel size 3, GELU activation) downsample the time axis by a factor of 2, reducing the sequence from 3000 to 1500 frames.
- A stack of Transformer encoder blocks with multi-head self-attention and learned sinusoidal positional embeddings produces a sequence of contextual hidden representations.
Hyperparameters scale with model size (e.g. base: 6 layers, 512 dims, 8 heads — large-v2: 32 layers, 1280 dims, 20 heads).
The decoder generates text tokens autoregressively via cross-attention over the encoder output:
- A multilingual Byte-Pair Encoding (BPE) tokenizer with a vocabulary of ~50,000 tokens.
- Each generation is conditioned on a structured prompt of special tokens:
<|startoftranscript|>— sequence boundary<|fr|>,<|en|>, ... — target language (omitted for auto-detect)<|transcribe|>or<|translate|>— task selector<|notimestamps|>or<|0.00|>— disables or enables timestamp prediction
When timestamp mode is active, the decoder interleaves timestamp tokens (<|0.00|> to <|30.00|>, 0.02 s resolution) with text tokens, enabling segment-level timing without any external alignment step.
When no language is specified, Whisper runs the encoder on the first 30 seconds, then takes a softmax over language tokens at the first decoder step to identify the language. This adds ~1-2 seconds of overhead and works reliably for the 99 languages present in the training data.
Whisper was trained on 680,000 hours of multilingual audio paired with transcripts collected from the web. Training uses standard cross-entropy over token sequences. The dataset was not manually curated — transcripts were obtained automatically (weak supervision), which accounts for the model's breadth as well as its sensitivity to low-resource languages and strong accents.
| Model | Parameters | Layers | Embedding dim | VRAM | Relative speed (CPU) |
|---|---|---|---|---|---|
tiny |
39 M | 4 | 384 | ~1 GB | ~32x |
base |
74 M | 6 | 512 | ~1 GB | ~16x |
small |
244 M | 12 | 768 | ~2 GB | ~6x |
medium |
769 M | 24 | 1024 | ~5 GB | ~2x |
large-v2 |
1550 M | 32 | 1280 | ~10 GB | 1x |
Speed is relative to large-v2 on CPU. All sizes share the same architecture and are interchangeable across backends.
The three backends execute Whisper inference through different runtimes, model formats, and compute environments. Transcription quality is identical for the same model size.
The API backend delegates inference to OpenAI's servers. Audio chunks are sent over HTTPS and the transcript is returned as JSON. No local compute or model download is required.
Request format:
POST https://api.openai.com/v1/audio/transcriptions
Authorization: Bearer <api_key>
Content-Type: multipart/form-data
file=<audio_chunk>
model=whisper-1
language=fr # optional
response_format=verbose_json
verbose_json returns segments with start/end timestamps, used to produce srt and vtt outputs.
Constraints:
- 25 MB per request — the chunking pipeline splits audio with ffmpeg before sending, so any file size is handled transparently.
- Rate limits —
429 Too Many Requestsresponses are handled with exponential backoff. - The server always runs
whisper-1(equivalent tolarge). No model size selection is available through the API. - Cost: ~$0.006/min of audio (as of 2024).
faster-whisper runs Whisper locally using CTranslate2, a C++ inference engine originally developed for Neural Machine Translation by the OpenNMT team, extended to support encoder-decoder Transformers.
Model format:
Original PyTorch weights are converted to CTranslate2's serialization format using ct2-whisper-converter. Converted models are hosted on HuggingFace (e.g. Systran/faster-whisper-base) and downloaded automatically on first use, cached at ~/.cache/huggingface/hub/.
Quantization:
compute_type |
Precision | Use case |
|---|---|---|
int8 |
8-bit integer | CPU — best speed/memory tradeoff |
float16 |
16-bit float | GPU (CUDA) — best throughput |
float32 |
32-bit float | Reference, no compression |
int8 on CPU reduces memory bandwidth by ~4x vs float32 with negligible accuracy loss on the Whisper architecture.
Inference:
The Python API's WhisperModel.transcribe() returns a generator of Segment objects (start, end, text, and optionally per-word timestamps derived from cross-attention weights). The pipeline consumes this generator to build the final TranscriptionResult.
GPU:
With faster_whisper_device: cuda, CTranslate2 dispatches matrix multiplications to the GPU via cuBLAS. float16 is recommended on GPU.
whisper.cpp is a standalone C++ reimplementation of Whisper by Georgi Gerganov, built on GGML — a minimal C tensor library with no dependencies beyond the C standard library.
Model format:
GGML stores weights as flat tensors in a custom binary .bin format, with a header describing the architecture. Quantization is applied statically at conversion time:
| Quantization | Bits/weight | Notes |
|---|---|---|
f32 |
32 | Reference, no compression |
f16 |
16 | Default for distributed .bin models |
q8_0 |
8 | 8-bit symmetric per-block |
q5_0 / q5_1 |
5 | 5-bit quantization |
q4_0 / q4_1 |
4 | Lowest memory footprint |
Quantized variants can be produced locally with the quantize tool bundled with whisper.cpp.
CPU optimizations: GGML selects SIMD intrinsics at compile time: AVX, AVX2, AVX-512 on x86-64, NEON on ARM (Apple Silicon, Raspberry Pi). This gives near-metal CPU throughput without any GPU requirement.
Input: whisper.cpp requires 16 kHz mono PCM WAV. The pipeline converts any input format to this specification automatically via ffmpeg.
Output format:
With --output-json, the binary writes a JSON file:
{
"transcription": [
{
"timestamps": { "from": "00:00:01,280", "to": "00:00:03,760" },
"text": "Bonjour tout le monde."
}
]
}Timestamps follow the SRT convention with a comma as decimal separator (HH:MM:SS,mmm).
Invocation: The pipeline calls the binary as a subprocess and reads the output JSON from disk:
whisper-cli -m <model.bin> -f <audio.wav> --language fr --output-json -of <output_base>
| whisper_cpp | faster_whisper | openai | |
|---|---|---|---|
| Inference engine | GGML (C++) | CTranslate2 (C++) | OpenAI server |
| Model format | GGML .bin |
CTranslate2 (HuggingFace) | Server-side |
| Quantization | Static (at conversion) | Dynamic (at load time) | N/A |
| Model auto-download | No | Yes (HuggingFace, first run) | N/A |
| Offline | Yes | Yes | No |
| GPU support | No | Yes (CUDA) | N/A |
| RAM (base model) | ~200 MB | ~500 MB | None |
| Speed (CPU) | Fastest | Fast | Network-bound |
| Cost | Free | Free | ~$0.006/min |
| Privacy | Full (local) | Full (local) | Audio sent to OpenAI |
| Setup | High (compile + model) | Minimal | Minimal (API key) |
| whisper_cpp | faster_whisper | openai | |
|---|---|---|---|
| Best for | Low RAM machines, offline/embedded, maximum CPU speed | General use, GPU, easy setup | No local compute, one-off jobs |
| Advantages | Lowest footprint, SIMD-optimised, no Python at runtime | Trivial install, CUDA, int8, word timestamps, active development | Zero local setup, no model management |
| Disadvantages | Requires compilation, WAV 16 kHz only, no GPU, manual model download | Slower than whisper.cpp on CPU, higher RAM | Internet required, paid, audio leaves your machine, no model choice |
Recommended: faster_whisper for most use cases.
whispr/
├── main.py # CLI entry point (argparse)
├── config.example.json # Annotated configuration reference
├── requirements.txt # Core dependencies (optional deps commented)
├── install.sh # Interactive setup script
├── Makefile # Common tasks
│
├── transcriber/
│ ├── config.py # Config dataclass, JSON loading, env interpolation
│ ├── logger.py # Structured logging (file + stderr)
│ ├── backends/
│ │ ├── base.py # TranscriptionBackend ABC + TranscriptionResult
│ │ ├── whisper_cpp.py # subprocess to whisper.cpp binary
│ │ ├── faster_whisper.py # faster-whisper (optional import)
│ │ └── openai_api.py # OpenAI Whisper API with rate-limit handling
│ ├── processors/
│ │ ├── video.py # Source resolution: local / YouTube / HTTP
│ │ └── audio.py # WAV conversion, duration probing
│ ├── managers/
│ │ └── transcription.py # Orchestrator: chunks, parallel, retry, merge
│ └── formatters/
│ └── output.py # txt, json, srt, vtt writers
│
└── web/
├── app.py # FastAPI application
└── static/
├── index.html
├── style.css
└── app.js
System prerequisites: Python 3.9+, ffmpeg
ffmpeg is required at the system level for audio extraction, format conversion, and chunk splitting. Install it with your package manager if not already present (apt-get install ffmpeg, brew install ffmpeg, dnf install ffmpeg). install.sh checks for it and installs it automatically if absent.
git clone https://github.com/franckferman/whispr.git
cd whispr
bash install.shThe installer asks two questions:
1. Interface:
[1] CLI only
[2] Web UI (adds fastapi, uvicorn, python-multipart)
[3] Both
2. Backend:
[1] faster-whisper (local, no API key, recommended)
[2] OpenAI API (cloud, requires OPENAI_API_KEY)
[3] whisper.cpp (local C++ binary, requires compilation)
[4] All
[5] Skip
What the installer does:
- Verifies Python 3.9+ and ffmpeg (installs ffmpeg if absent)
- Creates a virtual environment in
.venv/ - Installs core dependencies (
ffmpeg-python,tqdm,requests,yt-dlp,pydub) - Installs web UI dependencies if selected
- Installs the chosen backend(s)
- For whisper.cpp: clones the repository, compiles with
make -j$(nproc), prompts for model download, and writes the detected binary and model paths intoconfig.json - Copies
config.example.jsontoconfig.jsonif absent
After installation:
source .venv/bin/activategit clone https://github.com/franckferman/whispr.git
cd whispr
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# faster-whisper
pip install faster-whisper
# OpenAI
pip install openai
# Web UI
pip install fastapi "uvicorn[standard]" python-multipartmake whisper-cpp-setup # clone, compile, download base model
make whisper-cpp-setup WHISPER_MODEL=small # or small / medium / large-v2After compilation the binary is at ./whisper.cpp/build/bin/whisper-cli. Set this path in config.json or pass it via --whisper-binary.
make checkValidates all Python imports and verifies the CLI entry point responds correctly.
Settings are read from config.json (if present). Any value can be overridden with a CLI flag. CLI flags always take precedence over the file.
cp config.example.json config.jsonString values support ${ENV_VAR} interpolation:
"openai_api_key": "${OPENAI_API_KEY}"| Field | Default | Description |
|---|---|---|
backend |
faster_whisper |
Primary backend: faster_whisper, whisper_cpp, openai |
fallback_backend |
null |
Secondary backend used if primary exhausts all retries |
faster_whisper_model |
base |
Model size: tiny, base, small, medium, large-v2 |
faster_whisper_device |
cpu |
Compute device: cpu or cuda |
faster_whisper_compute_type |
int8 |
Quantization: int8, float16, float32 |
whisper_cpp_binary |
whisper |
Path to the compiled whisper.cpp binary |
whisper_cpp_model |
models/ggml-base.bin |
Path to the GGML .bin model file |
whisper_cpp_extra_args |
[] |
Additional CLI arguments forwarded verbatim to the binary |
openai_api_key |
${OPENAI_API_KEY} |
OpenAI API key |
openai_model |
whisper-1 |
OpenAI model identifier |
language |
null |
ISO 639-1 code (fr, en, ...) or null for auto-detect |
chunk_duration_seconds |
600 |
Duration of each audio chunk in seconds |
workers |
2 |
Number of parallel transcription threads |
temp_dir |
null |
Temporary directory (system default if null) |
output_formats |
["txt"] |
Output formats: txt, json, srt, vtt |
output_dir |
. |
Directory where output files are written |
output_prefix |
transcript |
Filename prefix for output files |
max_retries |
3 |
Retry attempts per chunk before declaring failure |
retry_base_delay |
1.0 |
Base delay in seconds for exponential backoff |
retry_max_delay |
30.0 |
Maximum backoff delay cap in seconds |
dry_run |
false |
Print the execution plan without running anything |
debug |
false |
Enable DEBUG-level logging |
log_file |
null |
Write logs to a file (auto generates a timestamped filename) |
ytdlp_format |
bestaudio/best |
yt-dlp format selector |
ytdlp_output_template |
%(title)s.%(ext)s |
yt-dlp output filename template |
python main.py [--file PATH | --url URL] [options]
If --backend is not specified, the value from config.json is used (default: faster_whisper). The CLI does not perform automatic backend detection — set your preferred backend in config.json once and omit the flag on subsequent runs.
source .venv/bin/activate
# Local file, auto-detect language
python main.py --file video.mp4 --backend faster_whisper
# Explicit language (skips the 30 s language detection pass)
python main.py --file video.mp4 --backend faster_whisper --language fr
# YouTube URL
python main.py --url "https://youtube.com/watch?v=..." --backend faster_whisper --language fr
# Multiple output formats
python main.py --file video.mp4 --backend faster_whisper --format txt,srt,vtt,json
# whisper.cpp
python main.py --file video.mp4 --backend whisper_cpp \
--whisper-binary ./whisper.cpp/build/bin/whisper-cli \
--whisper-model ./whisper.cpp/models/ggml-base.bin \
--language fr
# OpenAI API
python main.py --file audio.mp3 --backend openai --openai-key sk-...
# Primary + fallback: re-run with OpenAI if faster-whisper exhausts retries
python main.py --file video.mp4 --backend faster_whisper --fallback-backend openai
# Long file: 5-minute chunks, 4 workers
python main.py --file film.mp4 --backend faster_whisper \
--chunk-duration 300 --workers 4 --language fr --format txt,srt
# Dry run
python main.py --dry-run --file video.mp4 --backend faster_whisper
# Load config from file
python main.py --config config.json| Flag | Default | Description |
|---|---|---|
--file, -f |
Local audio/video file | |
--url, -u |
YouTube URL or direct HTTP link | |
--config, -c |
JSON config file (CLI flags override) | |
--backend, -b |
faster_whisper |
whisper_cpp | faster_whisper | openai |
--fallback-backend |
Secondary backend if primary fails all retries | |
--whisper-binary |
whisper |
Path to the whisper.cpp binary |
--whisper-model |
Path to the GGML .bin model file |
|
--fw-model |
base |
faster-whisper model size |
--fw-device |
cpu |
faster-whisper device: cpu | cuda |
--openai-key |
$OPENAI_API_KEY |
OpenAI API key |
--language, -l |
auto-detect | ISO 639-1 code: fr, en, es, ... |
--chunk-duration |
600 |
Chunk size in seconds |
--workers, -w |
2 |
Parallel transcription threads |
--format, -F |
txt |
Output formats, comma-separated: txt,json,srt,vtt |
--output-dir, -o |
. |
Output directory |
--output-prefix |
transcript |
Output filename prefix |
--max-retries |
3 |
Retry attempts per chunk on failure |
--dry-run |
Print plan without executing | |
--debug |
Enable DEBUG logging | |
--log-file |
Write logs to file (auto for timestamped name) |
Models are downloaded on first use and cached at ~/.cache/huggingface/hub/.
| Model | Size | Speed (CPU) | Quality | Use case |
|---|---|---|---|---|
tiny |
~75 MB | Fastest | Basic | Quick drafts, noisy audio |
base |
~150 MB | Fast | Good | Default, everyday use |
small |
~500 MB | Moderate | Better | Interviews, mixed accents |
medium |
~1.5 GB | Slow | High | Difficult content |
large-v2 |
~3 GB | Slowest | Best | Maximum accuracy |
python main.py --file video.mp4 --backend faster_whisper --fw-device cudaIn config.json:
"faster_whisper_device": "cuda",
"faster_whisper_compute_type": "float16"float16 on GPU, int8 on CPU.
All four formats are produced by the same formatter, independently of the backend.
| Format | Description |
|---|---|
txt |
Plain text, no timestamps |
json |
Full output: segments, timestamps, language, metadata |
srt |
SubRip subtitles — VLC, ffmpeg, Premiere |
vtt |
WebVTT — HTML5 <video> element |
Long files are split into fixed-duration chunks before transcription. Each chunk is processed independently and results are merged with corrected absolute timestamps.
The default is 600 seconds (10 minutes). Reduce for very long files to lower peak memory, increase to reduce per-chunk overhead:
python main.py --file film.mp4 --chunk-duration 300 --workers 4The OpenAI API's 25 MB limit per request is handled transparently by the chunking layer.
Each chunk is retried independently with exponential backoff:
attempt 1 -> wait 1 s -> attempt 2 -> wait 2 s -> attempt 3 -> ...
If all retries fail and a fallback backend is configured, the entire job restarts from scratch on the fallback:
python main.py --file video.mp4 --backend faster_whisper \
--fallback-backend openai --max-retries 5Same transcription pipeline as the CLI, accessible from a browser.
make web
# or manually
source .venv/bin/activate
pip install fastapi "uvicorn[standard]" python-multipart
python -m uvicorn web.app:app --host 0.0.0.0 --port 8000Open http://localhost:8000.
- Backend detection at startup — green dot if ready, grey with install hint if unavailable
- Auto-selects the best available backend:
faster_whisper>whisper_cpp>openai - Per-backend options shown dynamically:
- faster-whisper: model size dropdown (tiny / base / small / medium / large-v2)
- whisper.cpp: binary path and model path, auto-filled from detected installation
- OpenAI: API key input
- File upload (drag and drop) or URL (YouTube or direct link)
- Language, output formats (TXT / SRT / VTT / JSON), and worker count
- Real-time transcription log via WebSocket
- Download buttons per format on completion
make install # Full setup via install.sh
make deps # Core Python deps only
make deps-faster-whisper # + faster-whisper
make deps-openai # + openai
make deps-all # All Python deps
make whisper-cpp-build # Clone and compile whisper.cpp
make whisper-cpp-model # Download model (WHISPER_MODEL=base|tiny|small|medium|large-v2)
make whisper-cpp-setup # Build + model in one step
make run FILE=video.mp4 # Transcribe (faster_whisper, fr by default)
make run URL=https://... # Transcribe a URL
make run FILE=video.mp4 BACKEND=whisper_cpp # Specific backend
make run FILE=video.mp4 LANGUAGE=en WORKERS=4 # Override defaults
make run-whisper-cpp FILE=video.mp4
make run-faster-whisper FILE=video.mp4
make run-openai FILE=video.mp4
make run-srt FILE=video.mp4 # SRT output only
make run-all-formats FILE=video.mp4 # txt + json + srt + vtt
make dry-run FILE=video.mp4 # Preview without executing
make run-config # Use config.json
make web # Start web UI at http://localhost:8000
make web-install # Install web deps only
make check # Validate imports + CLI
make clean # Remove output files and __pycache__
make clean-all # Remove venv too[1] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356. https://arxiv.org/abs/2212.04356
[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03762
[3] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). https://arxiv.org/abs/1508.07909
[4] Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M. (2017). OpenNMT: Open-Source Toolkit for Neural Machine Translation. Proceedings of ACL 2017, System Demonstrations. https://arxiv.org/abs/1701.02810
[5] Gerganov, G. (2022). whisper.cpp: Port of OpenAI's Whisper model in C/C++. GitHub repository. https://github.com/ggerganov/whisper.cpp
[6] Systran. (2023). faster-whisper: Faster Whisper transcription with CTranslate2. GitHub repository. https://github.com/SYSTRAN/faster-whisper
[7] Klein, G., Crego, J., & Senellart, J. (2020). Efficient and High-Quality Neural Machine Translation with OpenNMT. Proceedings of the 4th Workshop on Neural Generation and Translation. — CTranslate2 inference engine underlying faster-whisper. https://github.com/OpenNMT/CTranslate2
[8] OpenAI. (2023). Whisper API Reference. OpenAI Platform Documentation. https://platform.openai.com/docs/api-reference/audio
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
Any use, modification, or distribution — including over a network — requires the full source code to remain open under the same license.