whispr

A modular Python CLI and web UI for transcribing audio and video with multi-backend support.

Overview

whispr is a transcription pipeline built around OpenAI's Whisper model. It supports three interchangeable backends, parallel chunk processing, automatic fallback, and multiple output formats.

Two interfaces ship with the same codebase:

A CLI (main.py) for scripting, automation, and direct use.
A web UI (web/) for browser-based use — upload a file or paste a URL, pick a backend, download the result.

Key capabilities:

Area	Details
Backends	whisper.cpp (GGML subprocess), faster-whisper (CTranslate2), OpenAI Whisper API
Input	Local files, YouTube URLs (yt-dlp), direct HTTP URLs
Input formats	MP3, MP4, WAV, M4A, MKV, WEBM, OGG, FLAC, AVI, MOV, and any format supported by ffmpeg
Output formats	Plain text, JSON (with segments and timestamps), SRT subtitles, WebVTT subtitles
Chunking	ffmpeg-based splitting for arbitrarily long files, transparent to the user
Parallel processing	ThreadPoolExecutor, configurable worker count
Retry / fallback	Exponential backoff per chunk, automatic switch to secondary backend on failure
Configuration	JSON file + CLI overrides + `${ENV_VAR}` interpolation
Web interface	FastAPI + vanilla JS, real-time WebSocket log, per-backend options

The Whisper Model

Whisper is an automatic speech recognition (ASR) model published by OpenAI in September 2022 (Radford et al., arXiv:2212.04356). It is a sequence-to-sequence model based on a standard encoder-decoder Transformer architecture, trained end-to-end with weak supervision on a large-scale multilingual dataset.

Audio preprocessing

Raw audio undergoes the following transformations before reaching the model:

Resampled to 16 kHz mono.
A log-Mel spectrogram is computed with a 25 ms window, 10 ms stride, and 80 Mel frequency bins.
The spectrogram is split into 30-second windows of 3000 frames (padded with silence if shorter).

The resulting 80 x 3000 tensor is the input to the encoder.

Encoder

The encoder consists of a convolutional front-end followed by Transformer blocks:

Two 1D convolution layers (kernel size 3, GELU activation) downsample the time axis by a factor of 2, reducing the sequence from 3000 to 1500 frames.
A stack of Transformer encoder blocks with multi-head self-attention and learned sinusoidal positional embeddings produces a sequence of contextual hidden representations.

Hyperparameters scale with model size (e.g. base: 6 layers, 512 dims, 8 heads — large-v2: 32 layers, 1280 dims, 20 heads).

Decoder

The decoder generates text tokens autoregressively via cross-attention over the encoder output:

A multilingual Byte-Pair Encoding (BPE) tokenizer with a vocabulary of ~50,000 tokens.
Each generation is conditioned on a structured prompt of special tokens:
- <|startoftranscript|> — sequence boundary
- <|fr|>, <|en|>, ... — target language (omitted for auto-detect)
- <|transcribe|> or <|translate|> — task selector
- <|notimestamps|> or <|0.00|> — disables or enables timestamp prediction

Timestamp prediction

When timestamp mode is active, the decoder interleaves timestamp tokens (<|0.00|> to <|30.00|>, 0.02 s resolution) with text tokens, enabling segment-level timing without any external alignment step.

Language detection

When no language is specified, Whisper runs the encoder on the first 30 seconds, then takes a softmax over language tokens at the first decoder step to identify the language. This adds ~1-2 seconds of overhead and works reliably for the 99 languages present in the training data.

Training

Whisper was trained on 680,000 hours of multilingual audio paired with transcripts collected from the web. Training uses standard cross-entropy over token sequences. The dataset was not manually curated — transcripts were obtained automatically (weak supervision), which accounts for the model's breadth as well as its sensitivity to low-resource languages and strong accents.

Model sizes

Model	Parameters	Layers	Embedding dim	VRAM	Relative speed (CPU)
`tiny`	39 M	4	384	~1 GB	~32x
`base`	74 M	6	512	~1 GB	~16x
`small`	244 M	12	768	~2 GB	~6x
`medium`	769 M	24	1024	~5 GB	~2x
`large-v2`	1550 M	32	1280	~10 GB	1x

Speed is relative to large-v2 on CPU. All sizes share the same architecture and are interchangeable across backends.

Backends

The three backends execute Whisper inference through different runtimes, model formats, and compute environments. Transcription quality is identical for the same model size.

OpenAI API (`openai`)

The API backend delegates inference to OpenAI's servers. Audio chunks are sent over HTTPS and the transcript is returned as JSON. No local compute or model download is required.

Request format:

POST https://api.openai.com/v1/audio/transcriptions
Authorization: Bearer <api_key>
Content-Type: multipart/form-data

file=<audio_chunk>
model=whisper-1
language=fr            # optional
response_format=verbose_json

verbose_json returns segments with start/end timestamps, used to produce srt and vtt outputs.

Constraints:

25 MB per request — the chunking pipeline splits audio with ffmpeg before sending, so any file size is handled transparently.
Rate limits — 429 Too Many Requests responses are handled with exponential backoff.
The server always runs whisper-1 (equivalent to large). No model size selection is available through the API.
Cost: ~$0.006/min of audio (as of 2024).

faster-whisper (`faster_whisper`)

faster-whisper runs Whisper locally using CTranslate2, a C++ inference engine originally developed for Neural Machine Translation by the OpenNMT team, extended to support encoder-decoder Transformers.

Model format: Original PyTorch weights are converted to CTranslate2's serialization format using ct2-whisper-converter. Converted models are hosted on HuggingFace (e.g. Systran/faster-whisper-base) and downloaded automatically on first use, cached at ~/.cache/huggingface/hub/.

Quantization:

`compute_type`	Precision	Use case
`int8`	8-bit integer	CPU — best speed/memory tradeoff
`float16`	16-bit float	GPU (CUDA) — best throughput
`float32`	32-bit float	Reference, no compression

int8 on CPU reduces memory bandwidth by ~4x vs float32 with negligible accuracy loss on the Whisper architecture.

Inference: The Python API's WhisperModel.transcribe() returns a generator of Segment objects (start, end, text, and optionally per-word timestamps derived from cross-attention weights). The pipeline consumes this generator to build the final TranscriptionResult.

GPU: With faster_whisper_device: cuda, CTranslate2 dispatches matrix multiplications to the GPU via cuBLAS. float16 is recommended on GPU.

whisper.cpp (`whisper_cpp`)

whisper.cpp is a standalone C++ reimplementation of Whisper by Georgi Gerganov, built on GGML — a minimal C tensor library with no dependencies beyond the C standard library.

Model format: GGML stores weights as flat tensors in a custom binary .bin format, with a header describing the architecture. Quantization is applied statically at conversion time:

Quantization	Bits/weight	Notes
`f32`	32	Reference, no compression
`f16`	16	Default for distributed `.bin` models
`q8_0`	8	8-bit symmetric per-block
`q5_0` / `q5_1`	5	5-bit quantization
`q4_0` / `q4_1`	4	Lowest memory footprint

Quantized variants can be produced locally with the quantize tool bundled with whisper.cpp.

CPU optimizations: GGML selects SIMD intrinsics at compile time: AVX, AVX2, AVX-512 on x86-64, NEON on ARM (Apple Silicon, Raspberry Pi). This gives near-metal CPU throughput without any GPU requirement.

Input: whisper.cpp requires 16 kHz mono PCM WAV. The pipeline converts any input format to this specification automatically via ffmpeg.

Output format: With --output-json, the binary writes a JSON file:

{
  "transcription": [
    {
      "timestamps": { "from": "00:00:01,280", "to": "00:00:03,760" },
      "text": "Bonjour tout le monde."
    }
  ]
}

Timestamps follow the SRT convention with a comma as decimal separator (HH:MM:SS,mmm).

Invocation: The pipeline calls the binary as a subprocess and reads the output JSON from disk:

whisper-cli -m <model.bin> -f <audio.wav> --language fr --output-json -of <output_base>

Comparison

	whisper_cpp	faster_whisper	openai
Inference engine	GGML (C++)	CTranslate2 (C++)	OpenAI server
Model format	GGML `.bin`	CTranslate2 (HuggingFace)	Server-side
Quantization	Static (at conversion)	Dynamic (at load time)	N/A
Model auto-download	No	Yes (HuggingFace, first run)	N/A
Offline	Yes	Yes	No
GPU support	No	Yes (CUDA)	N/A
RAM (base model)	~200 MB	~500 MB	None
Speed (CPU)	Fastest	Fast	Network-bound
Cost	Free	Free	~$0.006/min
Privacy	Full (local)	Full (local)	Audio sent to OpenAI
Setup	High (compile + model)	Minimal	Minimal (API key)

	whisper_cpp	faster_whisper	openai
Best for	Low RAM machines, offline/embedded, maximum CPU speed	General use, GPU, easy setup	No local compute, one-off jobs
Advantages	Lowest footprint, SIMD-optimised, no Python at runtime	Trivial install, CUDA, int8, word timestamps, active development	Zero local setup, no model management
Disadvantages	Requires compilation, WAV 16 kHz only, no GPU, manual model download	Slower than whisper.cpp on CPU, higher RAM	Internet required, paid, audio leaves your machine, no model choice

Recommended: faster_whisper for most use cases.

Project Structure

whispr/
├── main.py                        # CLI entry point (argparse)
├── config.example.json            # Annotated configuration reference
├── requirements.txt               # Core dependencies (optional deps commented)
├── install.sh                     # Interactive setup script
├── Makefile                       # Common tasks
│
├── transcriber/
│   ├── config.py                  # Config dataclass, JSON loading, env interpolation
│   ├── logger.py                  # Structured logging (file + stderr)
│   ├── backends/
│   │   ├── base.py                # TranscriptionBackend ABC + TranscriptionResult
│   │   ├── whisper_cpp.py         # subprocess to whisper.cpp binary
│   │   ├── faster_whisper.py      # faster-whisper (optional import)
│   │   └── openai_api.py          # OpenAI Whisper API with rate-limit handling
│   ├── processors/
│   │   ├── video.py               # Source resolution: local / YouTube / HTTP
│   │   └── audio.py               # WAV conversion, duration probing
│   ├── managers/
│   │   └── transcription.py       # Orchestrator: chunks, parallel, retry, merge
│   └── formatters/
│       └── output.py              # txt, json, srt, vtt writers
│
└── web/
    ├── app.py                     # FastAPI application
    └── static/
        ├── index.html
        ├── style.css
        └── app.js

Installation

System prerequisites: Python 3.9+, ffmpeg

ffmpeg is required at the system level for audio extraction, format conversion, and chunk splitting. Install it with your package manager if not already present (apt-get install ffmpeg, brew install ffmpeg, dnf install ffmpeg). install.sh checks for it and installs it automatically if absent.

Interactive installer (recommended)

git clone https://github.com/franckferman/whispr.git
cd whispr
bash install.sh

The installer asks two questions:

1. Interface:

[1] CLI only
[2] Web UI  (adds fastapi, uvicorn, python-multipart)
[3] Both

2. Backend:

[1] faster-whisper  (local, no API key, recommended)
[2] OpenAI API      (cloud, requires OPENAI_API_KEY)
[3] whisper.cpp     (local C++ binary, requires compilation)
[4] All
[5] Skip

What the installer does:

Verifies Python 3.9+ and ffmpeg (installs ffmpeg if absent)
Creates a virtual environment in .venv/
Installs core dependencies (ffmpeg-python, tqdm, requests, yt-dlp, pydub)
Installs web UI dependencies if selected
Installs the chosen backend(s)
For whisper.cpp: clones the repository, compiles with make -j$(nproc), prompts for model download, and writes the detected binary and model paths into config.json
Copies config.example.json to config.json if absent

After installation:

source .venv/bin/activate

Manual installation

git clone https://github.com/franckferman/whispr.git
cd whispr
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# faster-whisper
pip install faster-whisper

# OpenAI
pip install openai

# Web UI
pip install fastapi "uvicorn[standard]" python-multipart

whisper.cpp (manual build)

make whisper-cpp-setup                        # clone, compile, download base model
make whisper-cpp-setup WHISPER_MODEL=small    # or small / medium / large-v2

After compilation the binary is at ./whisper.cpp/build/bin/whisper-cli. Set this path in config.json or pass it via --whisper-binary.

Verify

make check

Validates all Python imports and verifies the CLI entry point responds correctly.

Configuration

Settings are read from config.json (if present). Any value can be overridden with a CLI flag. CLI flags always take precedence over the file.

cp config.example.json config.json

String values support ${ENV_VAR} interpolation:

"openai_api_key": "${OPENAI_API_KEY}"

Full reference

Field	Default	Description
`backend`	`faster_whisper`	Primary backend: `faster_whisper`, `whisper_cpp`, `openai`
`fallback_backend`	`null`	Secondary backend used if primary exhausts all retries
`faster_whisper_model`	`base`	Model size: `tiny`, `base`, `small`, `medium`, `large-v2`
`faster_whisper_device`	`cpu`	Compute device: `cpu` or `cuda`
`faster_whisper_compute_type`	`int8`	Quantization: `int8`, `float16`, `float32`
`whisper_cpp_binary`	`whisper`	Path to the compiled whisper.cpp binary
`whisper_cpp_model`	`models/ggml-base.bin`	Path to the GGML `.bin` model file
`whisper_cpp_extra_args`	`[]`	Additional CLI arguments forwarded verbatim to the binary
`openai_api_key`	`${OPENAI_API_KEY}`	OpenAI API key
`openai_model`	`whisper-1`	OpenAI model identifier
`language`	`null`	ISO 639-1 code (`fr`, `en`, ...) or `null` for auto-detect
`chunk_duration_seconds`	`600`	Duration of each audio chunk in seconds
`workers`	`2`	Number of parallel transcription threads
`temp_dir`	`null`	Temporary directory (system default if null)
`output_formats`	`["txt"]`	Output formats: `txt`, `json`, `srt`, `vtt`
`output_dir`	`.`	Directory where output files are written
`output_prefix`	`transcript`	Filename prefix for output files
`max_retries`	`3`	Retry attempts per chunk before declaring failure
`retry_base_delay`	`1.0`	Base delay in seconds for exponential backoff
`retry_max_delay`	`30.0`	Maximum backoff delay cap in seconds
`dry_run`	`false`	Print the execution plan without running anything
`debug`	`false`	Enable DEBUG-level logging
`log_file`	`null`	Write logs to a file (`auto` generates a timestamped filename)
`ytdlp_format`	`bestaudio/best`	yt-dlp format selector
`ytdlp_output_template`	`%(title)s.%(ext)s`	yt-dlp output filename template

CLI Usage

python main.py [--file PATH | --url URL] [options]

If --backend is not specified, the value from config.json is used (default: faster_whisper). The CLI does not perform automatic backend detection — set your preferred backend in config.json once and omit the flag on subsequent runs.

Examples

source .venv/bin/activate

# Local file, auto-detect language
python main.py --file video.mp4 --backend faster_whisper

# Explicit language (skips the 30 s language detection pass)
python main.py --file video.mp4 --backend faster_whisper --language fr

# YouTube URL
python main.py --url "https://youtube.com/watch?v=..." --backend faster_whisper --language fr

# Multiple output formats
python main.py --file video.mp4 --backend faster_whisper --format txt,srt,vtt,json

# whisper.cpp
python main.py --file video.mp4 --backend whisper_cpp \
    --whisper-binary ./whisper.cpp/build/bin/whisper-cli \
    --whisper-model  ./whisper.cpp/models/ggml-base.bin \
    --language fr

# OpenAI API
python main.py --file audio.mp3 --backend openai --openai-key sk-...

# Primary + fallback: re-run with OpenAI if faster-whisper exhausts retries
python main.py --file video.mp4 --backend faster_whisper --fallback-backend openai

# Long file: 5-minute chunks, 4 workers
python main.py --file film.mp4 --backend faster_whisper \
    --chunk-duration 300 --workers 4 --language fr --format txt,srt

# Dry run
python main.py --dry-run --file video.mp4 --backend faster_whisper

# Load config from file
python main.py --config config.json

All flags

Flag	Default	Description
`--file`, `-f`		Local audio/video file
`--url`, `-u`		YouTube URL or direct HTTP link
`--config`, `-c`		JSON config file (CLI flags override)
`--backend`, `-b`	`faster_whisper`	`whisper_cpp` \| `faster_whisper` \| `openai`
`--fallback-backend`		Secondary backend if primary fails all retries
`--whisper-binary`	`whisper`	Path to the whisper.cpp binary
`--whisper-model`		Path to the GGML `.bin` model file
`--fw-model`	`base`	faster-whisper model size
`--fw-device`	`cpu`	faster-whisper device: `cpu` \| `cuda`
`--openai-key`	`$OPENAI_API_KEY`	OpenAI API key
`--language`, `-l`	auto-detect	ISO 639-1 code: `fr`, `en`, `es`, ...
`--chunk-duration`	`600`	Chunk size in seconds
`--workers`, `-w`	`2`	Parallel transcription threads
`--format`, `-F`	`txt`	Output formats, comma-separated: `txt,json,srt,vtt`
`--output-dir`, `-o`	`.`	Output directory
`--output-prefix`	`transcript`	Output filename prefix
`--max-retries`	`3`	Retry attempts per chunk on failure
`--dry-run`		Print plan without executing
`--debug`		Enable DEBUG logging
`--log-file`		Write logs to file (`auto` for timestamped name)

faster-whisper model sizes

Models are downloaded on first use and cached at ~/.cache/huggingface/hub/.

Model	Size	Speed (CPU)	Quality	Use case
`tiny`	~75 MB	Fastest	Basic	Quick drafts, noisy audio
`base`	~150 MB	Fast	Good	Default, everyday use
`small`	~500 MB	Moderate	Better	Interviews, mixed accents
`medium`	~1.5 GB	Slow	High	Difficult content
`large-v2`	~3 GB	Slowest	Best	Maximum accuracy

GPU acceleration

python main.py --file video.mp4 --backend faster_whisper --fw-device cuda

In config.json:

"faster_whisper_device": "cuda",
"faster_whisper_compute_type": "float16"

float16 on GPU, int8 on CPU.

Output formats

All four formats are produced by the same formatter, independently of the backend.

Format	Description
`txt`	Plain text, no timestamps
`json`	Full output: segments, timestamps, language, metadata
`srt`	SubRip subtitles — VLC, ffmpeg, Premiere
`vtt`	WebVTT — HTML5 `<video>` element

Chunking

Long files are split into fixed-duration chunks before transcription. Each chunk is processed independently and results are merged with corrected absolute timestamps.

The default is 600 seconds (10 minutes). Reduce for very long files to lower peak memory, increase to reduce per-chunk overhead:

python main.py --file film.mp4 --chunk-duration 300 --workers 4

The OpenAI API's 25 MB limit per request is handled transparently by the chunking layer.

Retry and fallback

Each chunk is retried independently with exponential backoff:

attempt 1 -> wait 1 s -> attempt 2 -> wait 2 s -> attempt 3 -> ...

If all retries fail and a fallback backend is configured, the entire job restarts from scratch on the fallback:

python main.py --file video.mp4 --backend faster_whisper \
    --fallback-backend openai --max-retries 5

Web Interface

Same transcription pipeline as the CLI, accessible from a browser.

Start

make web

# or manually
source .venv/bin/activate
pip install fastapi "uvicorn[standard]" python-multipart
python -m uvicorn web.app:app --host 0.0.0.0 --port 8000

Open http://localhost:8000.

Features

Backend detection at startup — green dot if ready, grey with install hint if unavailable
Auto-selects the best available backend: faster_whisper > whisper_cpp > openai
Per-backend options shown dynamically:
- faster-whisper: model size dropdown (tiny / base / small / medium / large-v2)
- whisper.cpp: binary path and model path, auto-filled from detected installation
- OpenAI: API key input
File upload (drag and drop) or URL (YouTube or direct link)
Language, output formats (TXT / SRT / VTT / JSON), and worker count
Real-time transcription log via WebSocket
Download buttons per format on completion

Makefile Reference

make install              # Full setup via install.sh
make deps                 # Core Python deps only
make deps-faster-whisper  # + faster-whisper
make deps-openai          # + openai
make deps-all             # All Python deps

make whisper-cpp-build    # Clone and compile whisper.cpp
make whisper-cpp-model    # Download model (WHISPER_MODEL=base|tiny|small|medium|large-v2)
make whisper-cpp-setup    # Build + model in one step

make run FILE=video.mp4                         # Transcribe (faster_whisper, fr by default)
make run URL=https://...                        # Transcribe a URL
make run FILE=video.mp4 BACKEND=whisper_cpp     # Specific backend
make run FILE=video.mp4 LANGUAGE=en WORKERS=4  # Override defaults
make run-whisper-cpp FILE=video.mp4
make run-faster-whisper FILE=video.mp4
make run-openai FILE=video.mp4

make run-srt FILE=video.mp4                     # SRT output only
make run-all-formats FILE=video.mp4             # txt + json + srt + vtt

make dry-run FILE=video.mp4                     # Preview without executing
make run-config                                 # Use config.json

make web                                        # Start web UI at http://localhost:8000
make web-install                                # Install web deps only

make check                                      # Validate imports + CLI
make clean                                      # Remove output files and __pycache__
make clean-all                                  # Remove venv too

References

[1] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356. https://arxiv.org/abs/2212.04356

[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03762

[3] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). https://arxiv.org/abs/1508.07909

[4] Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M. (2017). OpenNMT: Open-Source Toolkit for Neural Machine Translation. Proceedings of ACL 2017, System Demonstrations. https://arxiv.org/abs/1701.02810

[5] Gerganov, G. (2022). whisper.cpp: Port of OpenAI's Whisper model in C/C++. GitHub repository. https://github.com/ggerganov/whisper.cpp

[6] Systran. (2023). faster-whisper: Faster Whisper transcription with CTranslate2. GitHub repository. https://github.com/SYSTRAN/faster-whisper

[7] Klein, G., Crego, J., & Senellart, J. (2020). Efficient and High-Quality Neural Machine Translation with OpenNMT. Proceedings of the 4th Workshop on Neural Generation and Translation. — CTranslate2 inference engine underlying faster-whisper. https://github.com/OpenNMT/CTranslate2

[8] OpenAI. (2023). Whisper API Reference. OpenAI Platform Documentation. https://platform.openai.com/docs/api-reference/audio

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

Any use, modification, or distribution — including over a network — requires the full source code to remain open under the same license.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
docs		docs
transcriber		transcriber
web		web
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.example.json		config.example.json
install.sh		install.sh
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

whispr

Table of Contents

Overview

The Whisper Model

Audio preprocessing

Encoder

Decoder

Timestamp prediction

Language detection

Training

Model sizes

Backends

OpenAI API (openai)

faster-whisper (faster_whisper)

whisper.cpp (whisper_cpp)

Comparison

Project Structure

Installation

Interactive installer (recommended)

Manual installation

whisper.cpp (manual build)

Verify

Configuration

Full reference

CLI Usage

Examples

All flags

faster-whisper model sizes

GPU acceleration

Output formats

Chunking

Retry and fallback

Web Interface

Start

Features

Makefile Reference

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Languages

OpenAI API (`openai`)

faster-whisper (`faster_whisper`)

whisper.cpp (`whisper_cpp`)