Skip to content

opper-ai/provider-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Provider Benchmark

Benchmark OpenAI-compatible chat completion endpoints across multiple models and providers. Compares latency, throughput, cost-efficiency, and consistency using streaming responses.

Disclosure: This benchmark is maintained by Opper. We encourage independent reproduction — see How to reproduce below.

Models

Model Tier Output $/1M tokens Providers
gpt-4.1 Flagship $8.00 OpenAI, OpenRouter, Opper
gpt-4.1-mini Fast $1.60 OpenAI, OpenRouter, Opper
gpt-4.1-nano Ultra-fast $0.40 OpenAI, OpenRouter, Opper
claude-sonnet-4-6 Flagship $15.00 OpenRouter, Opper, Bedrock
claude-haiku-4-5 Fast $5.00 OpenRouter, Opper, Bedrock
gemini-2.5-pro Flagship $10.00 OpenRouter, Opper
gemini-2.5-flash Fast $2.50 OpenRouter, Opper

Providers

Provider Type Endpoint Auth
OpenAI Direct api.openai.com API key
OpenRouter Router openrouter.ai API key
Opper Router api.opper.ai API key
Bedrock US-East Direct (regional) us-east-1 AWS credentials
Bedrock US-West Direct (regional) us-west-2 AWS credentials
Bedrock EU-Ireland Direct (regional) eu-west-1 AWS credentials
Bedrock EU-Frankfurt Direct (regional) eu-central-1 AWS credentials
Bedrock AP-Tokyo Direct (regional) ap-northeast-1 AWS credentials

Metrics

  • TTFB -- time to first content token (seconds)
  • Total -- full request duration (seconds)
  • Tokens/sec -- generation throughput (excludes TTFB)
  • CV -- coefficient of variation (consistency: lower = more predictable)
  • P95/P99 -- tail latency percentiles
  • Price-performance -- tokens/sec per dollar (throughput efficiency relative to cost)

Results are reported as median with 95% bootstrap confidence intervals. When CIs overlap between providers, the difference is not statistically significant.

Setup

pip install httpx
pip install boto3  # optional, only needed for Bedrock providers

Set API keys as environment variables or in a .env file alongside benchmark.py:

OPENAI_API_KEY=sk-...
OPENROUTER_API_KEY=sk-or-...
OPPER_API_KEY=...

Only keys for the providers you want to test are required.

For Bedrock providers, configure AWS credentials via aws configure, environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), or an IAM role.

Usage

# Default: all models × all providers, 30 rounds + 2 warmup
python benchmark.py

# Specific models and providers
python benchmark.py --models gpt-4.1,gpt-4.1-mini --providers openai,opper

# Only flagship models across all routers
python benchmark.py --models gpt-4.1,claude-sonnet-4-6,gemini-2.5-pro

# Compare OpenAI direct vs routed (routing overhead analysis)
python benchmark.py --models gpt-4.1 --providers openai,openrouter,opper

# Geographic comparison: same model across AWS regions
python benchmark.py --models claude-sonnet-4-6 --providers bedrock-us-east-1,bedrock-eu-west-1,bedrock-ap-northeast-1

# Direct vs routed vs regional: Claude through every path
python benchmark.py --models claude-sonnet-4-6 --providers opper,openrouter,bedrock-us-east-1,bedrock-eu-west-1

# Structured JSON output mode (measures "JSON tax")
python benchmark.py --models gpt-4.1 --json-mode

# Throughput scaling: sweep across output lengths
python benchmark.py --models gpt-4.1-mini --sweep-tokens 256,512,1024,2048

# Custom iterations and max tokens
python benchmark.py --iterations 50 --max-tokens 1024

# Reproducible run with fixed seed
python benchmark.py --seed 42

# JSON output to stdout (for piping)
python benchmark.py --json

# Analyze multiple result files for cross-run variance
python benchmark.py --analyze ./

Generating charts

make_charts.py reads any results-*.json, results-cache-*.json, results-sweep-*.json, or results/<run>/benchmark.json file in a directory and renders matplotlib bar/line charts (with 95% CI whiskers) plus a sidecar .txt listing the exact numbers each chart was built from.

pip install matplotlib  # not a hard dep of benchmark.py
python make_charts.py --results-dir . --out-dir charts/

File shape is auto-detected, so the same command produces router-vs-direct, Bedrock regional, backend-heterogeneity, prompt-cache-by-turn, and token-sweep charts depending on what's in the directory.

Key questions this benchmark answers

  1. The Router Tax -- How much latency does routing add vs direct API access?
  2. The Model Ladder -- What's the speed multiplier when dropping from flagship to mini/nano?
  3. Cross-Family Showdown -- Is Claude Haiku faster than GPT-4.1-mini? Is Gemini Flash faster than both?
  4. Geographic Reality -- How does latency differ between US, EU, and Asia regions? (Bedrock)
  5. The Throughput Curve -- Does throughput hold steady as output gets longer? (--sweep-tokens)
  6. Consistency -- Which providers have predictable performance vs wild swings? (CV metric)
  7. The JSON Tax -- How much does structured output mode cost in speed? (--json-mode)
  8. Price-Performance -- Which model gives the most tokens/sec per dollar?

How to reproduce

Every run prints a seed value. To reproduce the exact same prompt sequence and provider ordering:

python benchmark.py --seed <SEED> --iterations <N> --models <MODELS>

The JSON output includes:

  • SHA-256 hash of the benchmark script (verify it hasn't been modified)
  • Git commit SHA
  • All raw measurements (not just aggregates)
  • Environment metadata (Python version, OS, location)
  • The seed used
  • Pricing and cost-efficiency data

For published results, we recommend --iterations 50 or higher.

How it works

Each round picks a random prompt from a pool of 25 CS/engineering topics, appends a unique seed to bust caches, and sends it to every model×provider combination via streaming SSE. Provider order is randomized per round to reduce ordering bias. Warmup rounds are discarded from the final statistics.

Quality checks: Each response is verified for minimum token count, paragraph structure, and prompt keyword relevance. Quality failures are reported in the summary.

Statistical method: Bootstrap confidence intervals (10,000 resamples) are used because latency distributions are right-skewed. Pairwise comparisons note when CIs overlap, meaning the difference is not statistically significant.

Limitations

  • Results are measured from a single geographic location. Latency will vary by region.
  • Sequential requests only — does not test concurrent/burst performance.
  • Performance varies by time of day and day of week. Use --analyze to compare multiple runs.
  • Token counts may vary slightly across providers due to different tokenizer implementations.
  • Pricing is hardcoded and may become stale. Verify against current provider pricing pages.

About

Benchmark different gateways and providers and compare to opper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages