Provider Benchmark

Benchmark OpenAI-compatible chat completion endpoints across multiple models and providers. Compares latency, throughput, cost-efficiency, and consistency using streaming responses.

Disclosure: This benchmark is maintained by Opper. We encourage independent reproduction — see How to reproduce below.

Models

Model	Tier	Output $/1M tokens	Providers
`gpt-4.1`	Flagship	$8.00	OpenAI, OpenRouter, Opper
`gpt-4.1-mini`	Fast	$1.60	OpenAI, OpenRouter, Opper
`gpt-4.1-nano`	Ultra-fast	$0.40	OpenAI, OpenRouter, Opper
`claude-sonnet-4-6`	Flagship	$15.00	OpenRouter, Opper, Bedrock
`claude-haiku-4-5`	Fast	$5.00	OpenRouter, Opper, Bedrock
`gemini-2.5-pro`	Flagship	$10.00	OpenRouter, Opper
`gemini-2.5-flash`	Fast	$2.50	OpenRouter, Opper

Providers

Provider	Type	Endpoint	Auth
OpenAI	Direct	`api.openai.com`	API key
OpenRouter	Router	`openrouter.ai`	API key
Opper	Router	`api.opper.ai`	API key
Bedrock US-East	Direct (regional)	`us-east-1`	AWS credentials
Bedrock US-West	Direct (regional)	`us-west-2`	AWS credentials
Bedrock EU-Ireland	Direct (regional)	`eu-west-1`	AWS credentials
Bedrock EU-Frankfurt	Direct (regional)	`eu-central-1`	AWS credentials
Bedrock AP-Tokyo	Direct (regional)	`ap-northeast-1`	AWS credentials

Metrics

TTFB -- time to first content token (seconds)
Total -- full request duration (seconds)
Tokens/sec -- generation throughput (excludes TTFB)
CV -- coefficient of variation (consistency: lower = more predictable)
P95/P99 -- tail latency percentiles
Price-performance -- tokens/sec per dollar (throughput efficiency relative to cost)

Results are reported as median with 95% bootstrap confidence intervals. When CIs overlap between providers, the difference is not statistically significant.

Setup

pip install httpx
pip install boto3  # optional, only needed for Bedrock providers

Set API keys as environment variables or in a .env file alongside benchmark.py:

OPENAI_API_KEY=sk-...
OPENROUTER_API_KEY=sk-or-...
OPPER_API_KEY=...

Only keys for the providers you want to test are required.

For Bedrock providers, configure AWS credentials via aws configure, environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), or an IAM role.

Usage

# Default: all models × all providers, 30 rounds + 2 warmup
python benchmark.py

# Specific models and providers
python benchmark.py --models gpt-4.1,gpt-4.1-mini --providers openai,opper

# Only flagship models across all routers
python benchmark.py --models gpt-4.1,claude-sonnet-4-6,gemini-2.5-pro

# Compare OpenAI direct vs routed (routing overhead analysis)
python benchmark.py --models gpt-4.1 --providers openai,openrouter,opper

# Geographic comparison: same model across AWS regions
python benchmark.py --models claude-sonnet-4-6 --providers bedrock-us-east-1,bedrock-eu-west-1,bedrock-ap-northeast-1

# Direct vs routed vs regional: Claude through every path
python benchmark.py --models claude-sonnet-4-6 --providers opper,openrouter,bedrock-us-east-1,bedrock-eu-west-1

# Structured JSON output mode (measures "JSON tax")
python benchmark.py --models gpt-4.1 --json-mode

# Throughput scaling: sweep across output lengths
python benchmark.py --models gpt-4.1-mini --sweep-tokens 256,512,1024,2048

# Custom iterations and max tokens
python benchmark.py --iterations 50 --max-tokens 1024

# Reproducible run with fixed seed
python benchmark.py --seed 42

# JSON output to stdout (for piping)
python benchmark.py --json

# Analyze multiple result files for cross-run variance
python benchmark.py --analyze ./

Generating charts

make_charts.py reads any results-*.json, results-cache-*.json, results-sweep-*.json, or results/<run>/benchmark.json file in a directory and renders matplotlib bar/line charts (with 95% CI whiskers) plus a sidecar .txt listing the exact numbers each chart was built from.

pip install matplotlib  # not a hard dep of benchmark.py
python make_charts.py --results-dir . --out-dir charts/

File shape is auto-detected, so the same command produces router-vs-direct, Bedrock regional, backend-heterogeneity, prompt-cache-by-turn, and token-sweep charts depending on what's in the directory.

Key questions this benchmark answers

The Router Tax -- How much latency does routing add vs direct API access?
The Model Ladder -- What's the speed multiplier when dropping from flagship to mini/nano?
Cross-Family Showdown -- Is Claude Haiku faster than GPT-4.1-mini? Is Gemini Flash faster than both?
Geographic Reality -- How does latency differ between US, EU, and Asia regions? (Bedrock)
The Throughput Curve -- Does throughput hold steady as output gets longer? (--sweep-tokens)
Consistency -- Which providers have predictable performance vs wild swings? (CV metric)
The JSON Tax -- How much does structured output mode cost in speed? (--json-mode)
Price-Performance -- Which model gives the most tokens/sec per dollar?

How to reproduce

Every run prints a seed value. To reproduce the exact same prompt sequence and provider ordering:

python benchmark.py --seed <SEED> --iterations <N> --models <MODELS>

The JSON output includes:

SHA-256 hash of the benchmark script (verify it hasn't been modified)
Git commit SHA
All raw measurements (not just aggregates)
Environment metadata (Python version, OS, location)
The seed used
Pricing and cost-efficiency data

For published results, we recommend --iterations 50 or higher.

How it works

Each round picks a random prompt from a pool of 25 CS/engineering topics, appends a unique seed to bust caches, and sends it to every model×provider combination via streaming SSE. Provider order is randomized per round to reduce ordering bias. Warmup rounds are discarded from the final statistics.

Quality checks: Each response is verified for minimum token count, paragraph structure, and prompt keyword relevance. Quality failures are reported in the summary.

Statistical method: Bootstrap confidence intervals (10,000 resamples) are used because latency distributions are right-skewed. Pairwise comparisons note when CIs overlap, meaning the difference is not statistically significant.

Limitations

Results are measured from a single geographic location. Latency will vary by region.
Sequential requests only — does not test concurrent/burst performance.
Performance varies by time of day and day of week. Use --analyze to compare multiple runs.
Token counts may vary slightly across providers due to different tokenizer implementations.
Pricing is hardcoded and may become stale. Verify against current provider pricing pages.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
charts		charts
.gitignore		.gitignore
FINDINGS.md		FINDINGS.md
README.md		README.md
benchmark.py		benchmark.py
make_charts.py		make_charts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Provider Benchmark

Models

Providers

Metrics

Setup

Usage

Generating charts

Key questions this benchmark answers

How to reproduce

How it works

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Provider Benchmark

Models

Providers

Metrics

Setup

Usage

Generating charts

Key questions this benchmark answers

How to reproduce

How it works

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages