The whisperX API is a tool for enhancing and analyzing audio content. This API provides a suite of services for processing audio and video files, including transcription, alignment, diarization, and combining transcript with diarization results.
Swagger UI is available at /docs for all the services, dump of OpenAPI definition is available in folder app/docs as well. You can explore it directly in Swagger Editor
See the WhisperX Documentation for details on whisperX functions.
- in
.envyou can define default LanguageDEFAULT_LANG, if not defined en is used (you can also set it in the request) .envcontains definition of Whisper model usingWHISPER_MODEL(you can also set it in the request).envcontains definition of logging level usingLOG_LEVEL, if not defined DEBUG is used in development and INFO in production.envcontains definition of environment usingENVIRONMENT, if not defined production is used.envcontains a booleanDEVto indicate if the environment is development, if not defined true is used.envcontains a booleanFILTER_WARNINGto enable or disable filtering of specific warnings, if not defined true is used
.oga,.m4a,.aac,.wav,.amr,.wma,.awb,.mp3,.ogg
.wmv,.mkv,.avi,.mov,.mp4
-
Speech-to-Text (
/speech-to-text)- Upload audio/video files for transcription
- Supports multiple languages and Whisper models
-
Speech-to-Text URL (
/speech-to-text-url)- Transcribe audio/video from URLs
- Same features as direct upload
-
Individual Services:
- Transcribe (
/service/transcribe): Convert speech to text - Align (
/service/align): Align transcript with audio - Diarize (
/service/diarize): Speaker diarization - Combine (
/service/combine): Merge transcript with diarization
- Transcribe (
-
Task Management:
- Get all tasks (
/task/all) - Get task status (
/task/{identifier})
- Get all tasks (
-
Health Check Endpoints:
- Basic health check (
/health): Simple service status check - Liveness probe (
/health/live): Verifies if application is running - Readiness probe (
/health/ready): Checks if application is ready to accept requests (includes database connectivity check)
- Basic health check (
The API also exposes synchronous OpenAI Whisper-compatible endpoints:
POST /v1/audio/transcriptionsPOST /v1/audio/translations
These endpoints accept the same multipart/form-data style requests expected by
OpenAI SDK clients and return the transcript directly in the response body. Each
request is also persisted as a task, so its result (or error) can be retrieved later
via the /task endpoints.
model="whisper-1"maps to the local checkpoint configured byWHISPER_MODEL- Direct local Whisper checkpoint names such as
tiny,base,large-v3, ordistil-large-v3are also accepted response_formatsupportsjson,text,srt,verbose_json, andvtttimestamp_granularities[]=wordis supported withresponse_format=verbose_jsonon/v1/audio/transcriptionsand triggers alignment for word timings
Example with the official OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(
api_key="not-used-but-required-by-some-clients",
base_url="http://127.0.0.1:8000/v1",
)
with open("tests/test_files/audio_en.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["segment"],
)
print(transcript.text)flowchart TD
Client(["Client / OpenAI SDK"])
subgraph AsyncAPI ["Asynchronous API — background jobs"]
STT["POST /speech-to-text<br/>POST /speech-to-text-url"]
SVC["POST /service/transcribe<br/>POST /service/align<br/>POST /service/diarize<br/>POST /service/combine"]
BG{{"Background task<br/>WhisperX pipeline"}}
TASK["GET /task/all<br/>GET /task/{identifier}<br/>DELETE /task/{identifier}/delete"]
end
subgraph SyncAPI ["OpenAI-compatible API — synchronous"]
OAI["POST /v1/audio/transcriptions<br/>POST /v1/audio/translations"]
end
subgraph SpeakerAPI ["Speaker management"]
SPK["POST / GET / PUT / DELETE /speakers<br/>POST /speakers/search<br/>POST /speakers/identify"]
end
SEM(["GPU semaphore<br/>MAX_CONCURRENT_GPU_TASKS"])
DB[("Database<br/>tasks, results and speaker embeddings")]
Client -->|submit job| STT
Client -->|submit job| SVC
Client -->|poll / manage| TASK
Client -->|request| OAI
Client -->|manage speakers| SPK
STT --> BG
SVC --> BG
BG -->|store status + result| DB
OAI -->|store status + result| DB
DB -->|read| TASK
SPK -->|CRUD / search / identify| DB
BG -. identify / auto-store speakers .-> DB
BG -. acquire .-> SEM
OAI -. acquire .-> SEM
OAI -->|transcript in response| Client
The asynchronous endpoints enqueue a background WhisperX job, persist its status and
result to the database, and let clients poll or manage it via the /task endpoints. The
OpenAI-compatible endpoints run the pipeline synchronously and return the transcript
directly in the response, while also persisting the task and its result to the same
database — so completed (and failed) synchronous requests are queryable via the /task
endpoints just like the asynchronous ones. Both paths share the same GPU semaphore
(MAX_CONCURRENT_GPU_TASKS) to prevent out-of-memory errors.
The /speakers endpoints provide CRUD, similarity search, and identification over speaker
embeddings persisted in the same database. Diarization tasks can optionally identify
against, or auto-store into, these embeddings (identify_speakers / auto_store_speakers).
Task status and results are stored in a database via async SQLAlchemy. The DB connection
is configured with DB_URL (default: sqlite:///records.db).
See SQLAlchemy Engine configuration for supported database URLs.
Async drivers are required — the application rewrites the URL scheme automatically:
DB_URL scheme |
Async driver used |
|---|---|
sqlite:// |
aiosqlite (included by default) |
postgresql:// |
asyncpg (install with --extra postgres) |
For PostgreSQL, install the driver extra: uv sync --no-dev --extra postgres. The
Docker image includes it automatically.
Performance note: SQLite is suitable for development and low-concurrency use. For production or sustained concurrent load, use PostgreSQL — it sustains 350+ req/s at 200 concurrent users vs. ~15 req/s with SQLite. See Async SQLAlchemy concurrency guide for full load test results.
Structure of the of the db is described in DB Schema
Configure compute options in .env:
DEVICE: Device for inference (cudaorcpu, default:cuda)COMPUTE_TYPE: Computation type (float16,float32,int8, default:float16)Note: When using CPU,
COMPUTE_TYPEmust be set toint8
WhisperX supports these model sizes:
tiny,tiny.enbase,base.ensmall,small.enmedium,medium.enlarge,large-v1,large-v2,large-v3,large-v3-turbo- Distilled models:
distil-large-v2,distil-medium.en,distil-small.en,distil-large-v3 - Custom models:
nyrahealth/faster_CrisperWhisper
Set default model in .env using WHISPER_MODEL= (default: tiny)
The transcription endpoints (/speech-to-text, /service/*, /v1/audio/*) can be
protected with optional, configurable safeguards. Every option is a no-op by default,
so existing deployments are unaffected until they opt in.
| Variable | Default | Effect |
|---|---|---|
MAX_UPLOAD_SIZE_MB |
0 |
Reject uploads larger than this many MB with HTTP 413, checked from Content-Length before the body is read. 0 = unlimited. (Requests without a Content-Length, e.g. chunked uploads, are not pre-checked — see the note below.) |
MAX_QUEUED_GPU_REQUESTS |
0 |
Cap on concurrent in-flight transcription requests admitted across the API, returning HTTP 503 (with Retry-After) when exceeded. 0 = unlimited. Use >= 2 for an exact split; 1 admits up to 2 (one per path). |
SYNC_GPU_QUOTA_FRACTION |
0.5 |
Fraction of MAX_QUEUED_GPU_REQUESTS reserved for the synchronous (/v1/audio/*) path; the async path gets the remainder. For total >= 2 the split is exact and each path keeps at least one slot. |
RATE_LIMIT__ENABLED |
false |
Enable per-caller rate limiting (slowapi). Returns HTTP 429 with Retry-After. |
RATE_LIMIT__REQUESTS_PER_MINUTE |
60 |
Sustained per-caller budget per minute. |
RATE_LIMIT__BURST |
10 |
Short-term per-caller burst budget per second. |
RATE_LIMIT__KEY_STRATEGY |
ip |
How callers are identified: ip or bearer_token. |
AUTH__ENABLED |
false |
Require a shared bearer token on protected endpoints (HTTP 401 otherwise). |
AUTH__BEARER_TOKEN |
(empty) | The shared token. Required when AUTH__ENABLED=true. |
Example .env snippet enabling all of them:
MAX_UPLOAD_SIZE_MB=25
MAX_QUEUED_GPU_REQUESTS=20
SYNC_GPU_QUOTA_FRACTION=0.5
RATE_LIMIT__ENABLED=true
RATE_LIMIT__REQUESTS_PER_MINUTE=60
RATE_LIMIT__BURST=10
RATE_LIMIT__KEY_STRATEGY=ip
AUTH__ENABLED=true
AUTH__BEARER_TOKEN=replace-with-a-long-random-secretNotes:
- Values are read from the environment at startup and are fixed for the life of the process; changing them requires a restart.
- Rate-limit and concurrency state are kept in-process. Running multiple workers (
uvicorn --workers >1) gives each worker its own budget, so these limits — like the GPU semaphore — assume a single worker process.- The upload cap is enforced from the
Content-Lengthheader. Uploads sent without one (chunked transfer encoding) are not rejected up front and are bounded only by available memory/disk; the common multipart upload path always sendsContent-Length.- For async endpoints (
/speech-to-text,/service/*), the concurrency gate is admission control for the request phase (validation, audio decode, enqueue). The GPU pipeline runs in a background task bounded byMAX_CONCURRENT_GPU_TASKS, not by this gate.
- NVIDIA GPU with CUDA 12.8+ support
- At least 8GB RAM (16GB+ recommended for large models)
- Storage space for models (varies by model size):
- tiny/base: ~1GB
- small: ~2GB
- medium: ~5GB
- large: ~10GB
To get started with the API, follow these steps:
-
Install
uvpackage manager -
Create virtual environment and install dependencies:
# For production dependencies only uv sync --no-dev # For development (includes testing, linting, async SQLite driver) uv sync --all-extras
-
Configure your environment (see
.envfile setup below)
Note: This project uses
uvfor dependency management with platform-specific PyTorch configuration (CUDA 12.8 on Linux, CPU-only on macOS/Windows). All dependencies are defined inpyproject.toml.
The application uses two logging configuration files:
uvicorn_log_conf.yaml: Used by Uvicorn for logging configuration.gunicorn_logging.conf: Used by Gunicorn for logging configuration (located in the root of theappdirectory).
Ensure these files are correctly configured and placed in the app directory.
- Create
.envfile
define your Whisper Model and token for Huggingface
HF_TOKEN=<<YOUR HUGGINGFACE TOKEN>>
WHISPER_MODEL=<<WHISPER MODEL SIZE>>
LOG_LEVEL=<<LOG LEVEL>>- Run the FastAPI application:
uvicorn app.main:app --reload --log-config uvicorn_log_conf.yaml --log-level $LOG_LEVELThe API will be accessible at http://127.0.0.1:8000.
- Create
.envfile
define your Whisper Model and token for Huggingface
HF_TOKEN=<<YOUR HUGGINGFACE TOKEN>>
WHISPER_MODEL=<<WHISPER MODEL SIZE>>
LOG_LEVEL=<<LOG LEVEL>>- Build Image
using docker-compose.yaml
#build and start the image using compose file
docker-compose upalternative approach
#build image
docker build -t whisperx-service .
# Run Container
docker run -d --gpus all -p 8000:8000 --env-file .env whisperx-serviceThe API will be accessible at http://127.0.0.1:8000.
Note: The Docker build uses
uvfor installing dependencies, as specified in the Dockerfile. The main entrypoint for the Docker container is via Gunicorn (not Uvicorn directly), using the configuration inapp/gunicorn_logging.conf.Important: For GPU support in Docker, you must have CUDA drivers 12.8+ installed on your host system.
The models used by whisperX are stored in root/.cache, if you want to avoid downloanding the models each time the container is starting you can store the cache in persistent storage. docker-compose.yaml defines a volume whisperx-models-cache to store this cache.
- faster-whisper cache:
root/.cache/huggingface/hub - pyannotate and other models cache:
root/.cache/torch
-
Environment Variables Not Loaded
- Ensure your
.envfile is correctly formatted and placed in the root directory. - Verify that all required environment variables are defined.
- Ensure your
-
Database Connection Issues
- Check the
DB_URLenvironment variable for correctness. - Ensure the database server is running and accessible.
- PostgreSQL driver: when using
DB_URL=postgresql://...outside Docker, install the driver withuv sync --extra postgres. - Async driver mismatch: if you set a
DB_URLwith a sync scheme (e.g.postgresql+psycopg2://), the app will fail to start. Use the plain scheme (postgresql://) and let the app rewrite it topostgresql+asyncpg://automatically.
- Check the
-
Model Download Failures
- Verify your internet connection.
- Ensure the
HF_TOKENis correctly set in the.envfile.
-
GPU Not Detected
- Ensure NVIDIA drivers and CUDA are correctly installed.
- Verify that Docker is configured to use the GPU (
nvidia-docker).
-
Warnings Not Filtered
- Ensure the
FILTER_WARNINGenvironment variable is set totruein the.envfile.
- Ensure the
- Check the logs for detailed error messages.
- Use the
LOG_LEVELenvironment variable to set the appropriate logging level (DEBUG,INFO,WARNING,ERROR).
The API provides built-in health check endpoints that can be used for monitoring and orchestration:
-
Basic Health Check (
/health)- Returns a simple status check with HTTP 200 if the service is running
- Useful for basic availability monitoring
-
Liveness Probe (
/health/live)- Includes a timestamp with status information
- Designed for Kubernetes liveness probes or similar orchestration systems
- Returns HTTP 200 if the application is running
-
Readiness Probe (
/health/ready)- Tests if the application is fully ready to accept requests
- Checks connectivity to the database
- Returns HTTP 200 if all dependencies are available
- Returns HTTP 503 if there's an issue with dependencies (e.g., database connection)
For further assistance, please open an issue on the GitHub repository.