Skip to content

bench: endpoint benchmark harness for prod ↔ preview cutover comparison#101

Draft
alastairong1 wants to merge 12 commits into
mainfrom
bench/endpoint-harness
Draft

bench: endpoint benchmark harness for prod ↔ preview cutover comparison#101
alastairong1 wants to merge 12 commits into
mainfrom
bench/endpoint-harness

Conversation

@alastairong1
Copy link
Copy Markdown
Contributor

@alastairong1 alastairong1 commented May 6, 2026

Summary

Adds a local-only benchmark harness under bench/ that compares response speed and reliability of every safe-to-bench endpoint between api.preview.st0x.io and api.st0x.io. Used today to gate the preview→prod cutover; structured so it can be wired into CI later (deferred).

  • 15 idempotent read endpoints covered (mutating endpoints excluded by design)
  • Tooling: oha (load gen) + jq + bash. No Rust code touched.
  • Auto-discovers fixtures (token addresses, an order hash, an owner, a tx hash) per host
  • Runs at 0.8 qps × 1 concurrency × 30 reqs/endpoint to stay under the 60 rpm per-key rate limit (`config/rest-api.toml`)
  • Emits per-host JSON + a side-by-side markdown report with advisory regression thresholds (p95 +25%, success drop −2pp)

bench/.env and bench/results/* are gitignored.

Cutover findings (using this harness)

Preview build vs current prod, apples-to-apples (same fixtures both hosts):

Reliability — preview fixes broken-on-prod functionality:

  • `/v1/trades/{owner}` → 500 on prod, 200 on preview
  • `/v1/trades/tx/{tx_hash}` → 500 on prod, 200 on preview
  • `/v1/trades/token/{addr}`, `/v1/trades/taker/{addr}` → 404 on prod, 200 on preview
  • `/health/detailed`, `/registry/history`, `/v1/trades/batch` → 404 on prod (routes not deployed), 200 on preview

Performance:

  • `/v1/orders/token/{addr}` p95: 4.8 s on prod → 1.1 s on preview (4.5× faster)
  • Working-on-both endpoints (`health`, `tokens`, `registry`) comparable

Non-blocking follow-ups:

  • `order-by-hash` p95 ≈ 11 s on both hosts (RPC multicall live-quote refresh — UX concern, not a regression)
  • `swap-quote` / `swap-calldata` / `trades-batch` body templates in `endpoints.toml` are wrong (both hosts return 422 identically). Harness gap, not server issue.

Usage

```bash
cp bench/.env.example bench/.env # fill in BENCH_USER and BENCH_PASS
cargo install oha # not yet in nix shell
bench/all.sh # ~21 min for both hosts

→ bench/results/-compare.md

```

See `bench/README.md` for tunables and what's deferred.

Test plan

  • Smoke test against preview with auth — all 15 endpoints return real data, schema parses correctly
  • Verified rate-limit budget: 0.8 qps stays under per-key 60 rpm; spot 429s on cached endpoints only
  • Verified bug fixes hold (oha 1.14 JSON schema, null percentile handling, success-rate derivation from status codes not oha's transport-level field)
  • Full apples-to-apples bench against both hosts produced actionable cutover findings (above)
  • CI integration — deferred per scope decision; `bench/all.sh` exits 0 on advisory regressions, hooks ready for future job

🤖 Generated with Claude Code

alastairong1 and others added 8 commits May 6, 2026 15:00
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… README)

- Fix 1: guard empty discover_keys loop so empty string isn't iterated
- Fix 2: null oha percentiles pass through as null (not 0); statusCodeDistribution null survives to_entries via type guard
- Fix 3: derive report timestamp from prod result filename so all artifacts share a stem
- Fix 5: copy oha .err file to bench/results/.<name>.err on non-zero exit for post-mortem
- Fix 7: README correctly states oha is not in nix develop; python3 version note
- Fix 8: clearer Python error when neither tomllib nor tomli are available

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 103c92ba-2e76-4036-8d47-6dd8acf2bc77

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bench/endpoint-harness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- oha 1.14 replaced -j with --output-format json
- Summary now derives total_requests from statusCodeDistribution sum
  (oha doesn't emit a top-level request count) and uses summary.successRate
  directly when present
- Discover now reads camelCase fields (orderHash, txHash) and uses the
  trades-by-token endpoint to source orderHash + txHash, since orders can
  exist with zero trades
- Added defensive skip when summary computation produces empty output
- oha's summary.successRate counts 4xx/5xx as successful (it means
  'transport-level success'). Derive HTTP-level success from
  statusCodeDistribution where status < 400. Without this fix the
  harness reported 100% success on endpoints that were returning 404
  or 429 on every request.
- Surface the per-endpoint status-code histogram in the summary block
  so failures aren't masked by rolled-up percentages.
- Discover now walks every token in /v1/tokens until one has at least
  one trade, so order_hash and tx_hash fixtures populate even when the
  first token has no associated trades.
- Use a portable while-read array idiom (mapfile is bash 4+ only).
…env)

Discovery now writes a target-specific fixture file so each host benches
against values that exist in its own data, avoiding cross-host 404s.
all.sh now runs discovery against both hosts before benching.
…te limit

The API's per-key rate limit is 60 rpm and the global limit is 600 rpm
(see config/rest-api.toml). The previous defaults (50 reqs at concurrency 5)
ran at ~25 req/sec per endpoint and produced 429-saturated results on every
endpoint touching uncached data — the harness was the wrong tool, not the
servers being slow.

New defaults:
  BENCH_REQUESTS=30, BENCH_CONCURRENCY=1, BENCH_QPS=0.8,
  BENCH_INTER_ENDPOINT_SLEEP=5

This stays under the 60 rpm budget on a rolling 60s window across all 15
endpoints. One host completes in ~10 min; both hosts in ~21 min.

Override via env vars for soak / ceiling tests when running with a
high-rate-limit key.
@alastairong1 alastairong1 changed the title bench: endpoint perf+reliability harness for preview→prod cutover bench: endpoint benchmark harness for prod ↔ preview cutover comparison May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant