Benchmark OpenAI-compatible chat completion endpoints across multiple models and providers. Compares latency, throughput, cost-efficiency, and consistency using streaming responses.
Disclosure: This benchmark is maintained by Opper. We encourage independent reproduction — see How to reproduce below.
| Model | Tier | Output $/1M tokens | Providers |
|---|---|---|---|
gpt-4.1 |
Flagship | $8.00 | OpenAI, OpenRouter, Opper |
gpt-4.1-mini |
Fast | $1.60 | OpenAI, OpenRouter, Opper |
gpt-4.1-nano |
Ultra-fast | $0.40 | OpenAI, OpenRouter, Opper |
claude-sonnet-4-6 |
Flagship | $15.00 | OpenRouter, Opper, Bedrock |
claude-haiku-4-5 |
Fast | $5.00 | OpenRouter, Opper, Bedrock |
gemini-2.5-pro |
Flagship | $10.00 | OpenRouter, Opper |
gemini-2.5-flash |
Fast | $2.50 | OpenRouter, Opper |
| Provider | Type | Endpoint | Auth |
|---|---|---|---|
| OpenAI | Direct | api.openai.com |
API key |
| OpenRouter | Router | openrouter.ai |
API key |
| Opper | Router | api.opper.ai |
API key |
| Bedrock US-East | Direct (regional) | us-east-1 |
AWS credentials |
| Bedrock US-West | Direct (regional) | us-west-2 |
AWS credentials |
| Bedrock EU-Ireland | Direct (regional) | eu-west-1 |
AWS credentials |
| Bedrock EU-Frankfurt | Direct (regional) | eu-central-1 |
AWS credentials |
| Bedrock AP-Tokyo | Direct (regional) | ap-northeast-1 |
AWS credentials |
- TTFB -- time to first content token (seconds)
- Total -- full request duration (seconds)
- Tokens/sec -- generation throughput (excludes TTFB)
- CV -- coefficient of variation (consistency: lower = more predictable)
- P95/P99 -- tail latency percentiles
- Price-performance -- tokens/sec per dollar (throughput efficiency relative to cost)
Results are reported as median with 95% bootstrap confidence intervals. When CIs overlap between providers, the difference is not statistically significant.
pip install httpx
pip install boto3 # optional, only needed for Bedrock providersSet API keys as environment variables or in a .env file alongside benchmark.py:
OPENAI_API_KEY=sk-...
OPENROUTER_API_KEY=sk-or-...
OPPER_API_KEY=...
Only keys for the providers you want to test are required.
For Bedrock providers, configure AWS credentials via aws configure, environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), or an IAM role.
# Default: all models × all providers, 30 rounds + 2 warmup
python benchmark.py
# Specific models and providers
python benchmark.py --models gpt-4.1,gpt-4.1-mini --providers openai,opper
# Only flagship models across all routers
python benchmark.py --models gpt-4.1,claude-sonnet-4-6,gemini-2.5-pro
# Compare OpenAI direct vs routed (routing overhead analysis)
python benchmark.py --models gpt-4.1 --providers openai,openrouter,opper
# Geographic comparison: same model across AWS regions
python benchmark.py --models claude-sonnet-4-6 --providers bedrock-us-east-1,bedrock-eu-west-1,bedrock-ap-northeast-1
# Direct vs routed vs regional: Claude through every path
python benchmark.py --models claude-sonnet-4-6 --providers opper,openrouter,bedrock-us-east-1,bedrock-eu-west-1
# Structured JSON output mode (measures "JSON tax")
python benchmark.py --models gpt-4.1 --json-mode
# Throughput scaling: sweep across output lengths
python benchmark.py --models gpt-4.1-mini --sweep-tokens 256,512,1024,2048
# Custom iterations and max tokens
python benchmark.py --iterations 50 --max-tokens 1024
# Reproducible run with fixed seed
python benchmark.py --seed 42
# JSON output to stdout (for piping)
python benchmark.py --json
# Analyze multiple result files for cross-run variance
python benchmark.py --analyze ./make_charts.py reads any results-*.json, results-cache-*.json, results-sweep-*.json, or results/<run>/benchmark.json file in a directory and renders matplotlib bar/line charts (with 95% CI whiskers) plus a sidecar .txt listing the exact numbers each chart was built from.
pip install matplotlib # not a hard dep of benchmark.py
python make_charts.py --results-dir . --out-dir charts/File shape is auto-detected, so the same command produces router-vs-direct, Bedrock regional, backend-heterogeneity, prompt-cache-by-turn, and token-sweep charts depending on what's in the directory.
- The Router Tax -- How much latency does routing add vs direct API access?
- The Model Ladder -- What's the speed multiplier when dropping from flagship to mini/nano?
- Cross-Family Showdown -- Is Claude Haiku faster than GPT-4.1-mini? Is Gemini Flash faster than both?
- Geographic Reality -- How does latency differ between US, EU, and Asia regions? (Bedrock)
- The Throughput Curve -- Does throughput hold steady as output gets longer? (
--sweep-tokens) - Consistency -- Which providers have predictable performance vs wild swings? (CV metric)
- The JSON Tax -- How much does structured output mode cost in speed? (
--json-mode) - Price-Performance -- Which model gives the most tokens/sec per dollar?
Every run prints a seed value. To reproduce the exact same prompt sequence and provider ordering:
python benchmark.py --seed <SEED> --iterations <N> --models <MODELS>The JSON output includes:
- SHA-256 hash of the benchmark script (verify it hasn't been modified)
- Git commit SHA
- All raw measurements (not just aggregates)
- Environment metadata (Python version, OS, location)
- The seed used
- Pricing and cost-efficiency data
For published results, we recommend --iterations 50 or higher.
Each round picks a random prompt from a pool of 25 CS/engineering topics, appends a unique seed to bust caches, and sends it to every model×provider combination via streaming SSE. Provider order is randomized per round to reduce ordering bias. Warmup rounds are discarded from the final statistics.
Quality checks: Each response is verified for minimum token count, paragraph structure, and prompt keyword relevance. Quality failures are reported in the summary.
Statistical method: Bootstrap confidence intervals (10,000 resamples) are used because latency distributions are right-skewed. Pairwise comparisons note when CIs overlap, meaning the difference is not statistically significant.
- Results are measured from a single geographic location. Latency will vary by region.
- Sequential requests only — does not test concurrent/burst performance.
- Performance varies by time of day and day of week. Use
--analyzeto compare multiple runs. - Token counts may vary slightly across providers due to different tokenizer implementations.
- Pricing is hardcoded and may become stale. Verify against current provider pricing pages.