Reproducible benchmark harness for delta-rs. Measure performance against any revision, compare branches, and track regressions over time.
Prepare the checkout, generate fixtures, run a smoke benchmark, then compare your branch against main:
./scripts/prepare_delta_rs.sh
./scripts/sync_harness_to_delta_rs.sh
./scripts/bench.sh data --dataset-id tiny_smoke --seed 42
./scripts/bench.sh run --suite scan --runner rust --dataset-id tiny_smoke --label local
./scripts/compare_branch.sh --current-vs-main --methodology-profile pr-macro scanResults go to results/local/<suite>.json. Pass --help to any script for details.
For the full setup walkthrough, see Getting Started.
bench.sh run defaults to the smoke lane. Use correctness for correctness-backed suites such as write, delete_update, merge, metadata, optimize_vacuum, and interop_py. Use macro only for macro-safe perf exploration. GitHub-hosted CI stays on smoke and correctness lanes, while self-hosted workflows are the authoritative path for macro perf, decision compare, and longitudinal automation.
tiny_smoke stays the fast setup/smoke dataset. The PR macro compare contract is stronger: --methodology-profile pr-macro automatically switches branch compare onto the deterministic local-disk medium_selective dataset and the decision-grade 7-run / 15-iteration methodology.
Choose the benchmark surface that matches the change:
- Use
scanpluspr-macrowhen the suspected effect is on query execution or Parquet reads. - Use
metadata_perfpluspr-metadata-perfwhen the suspected effect is on checkpoint loading, long-history log replay, or metadata-heavy table open paths. - Use
tpcdspluspr-tpcdson trusted self-hosted runners when the suspected effect is on analytical execute-path regressions against the DuckDB-backedtpcds_duckdbcorpus.tpcds_q72remains outside the PR decision surface. - Keep
scanas the public execute-phase guardrail for replay-adjacent work, then pair it withcargo bench -p delta-bench --bench metadata_replay_benchwhen you need the narrower replay-state or snapshot-owned provider signal. - Use Criterion as the primary signal when replay-state timings stay sub-millisecond or too noisy for branch-compare classification.
- Use
run benchmark decision fullonly for the harness-ownedpr-full-decisionpack inbench/evidence/registry.yaml.fulldoes not mean--suite all, and the bot blocks the command until every listed suite isreadiness=ready.
In operator terms: full does not mean --suite all.
Keep the replay-state probe separate from the execute-phase guardrail:
cargo bench -p delta-bench --bench metadata_replay_benchThe replay-state microbench is investigation-grade. Do not substitute it for the default execute-phase guardrail:
./scripts/compare_branch.sh \
--base-sha 61ea71b77d3322bec3ddb857685a46562925d9fd \
--candidate-sha 385e7bd1730a2d21703001777d1368ffce5ce559 \
--timing-phase plan \
scan./scripts/compare_branch.sh \
--base-sha 61ea71b77d3322bec3ddb857685a46562925d9fd \
--candidate-sha 385e7bd1730a2d21703001777d1368ffce5ce559 \
--methodology-profile pr-macro \
scanFor ad hoc historical replay-state checks, scan --timing-phase plan remains an approximation-only fallback. The current scan suite registers table.table_provider() from loaded eager state, so timing_phase=plan is not proof of the snapshot-owned replay-state path by itself.
For Python interop coverage, install python/requirements-audit.txt first. For trust-contract verification, read Validation and run ./scripts/validate_perf_harness.sh.
- Run local smoke checks on any machine with
./scripts/bench.sh run. - Use GitHub-hosted CI for smoke and correctness validation, including correctness-backed suites such as
interop_py. - Use self-hosted runners for macro perf, decision compare, Criterion microbench, and longitudinal workflows.
- For PR macro evidence, run
./scripts/compare_branch.sh --methodology-profile pr-macro ...; the profile fixes the decision-grade compare contract, usesmedium_selective, and keeps sub-millisecond cases out of the normal macro verdict. - For PR comment automation, use
run benchmark scan,run benchmark decision scan,run benchmark decision full, andshow benchmark queue.
| I want to... | Read this |
|---|---|
| Set up from scratch | Getting Started |
| Compare two revisions | Comparing Branches |
| Track performance over many revisions | Longitudinal Benchmarking |
| Run on dedicated cloud hardware | Cloud Runner |
| Look up a flag, metric, or schema | Reference |
| Understand how the harness works | Architecture |
| Validate the trust contract | Validation |
The reference surface currently covers scan, write, write_perf, delete_update, delete_update_perf, merge, merge_perf, metadata, metadata_perf, optimize_vacuum, optimize_perf, concurrency, tpcds, and interop_py. concurrency covers Rust-only contention paths, and interop_py is correctness-backed coverage for the Python runtime path in addition to the Rust-native suites. Replay-state internals stay in the dedicated metadata_replay_bench engineering probe instead of a public suite contract.
See Reference for the full listing.
| Script | Purpose |
|---|---|
./scripts/bench.sh |
Generate fixtures, run suites, list benchmarks, health checks |
./scripts/compare_branch.sh |
Branch-to-branch comparison with aggregated reporting |
./scripts/longitudinal_bench.sh |
Longitudinal matrix, ingest, reporting, and retention |
./scripts/cleanup_local.sh |
Clean fixtures, results, and checkout artifacts (dry-run by default) |
./scripts/validate_perf_harness.sh |
Trust-contract verification for perf claims |
cargo test --locked
(cd python && python3 -m pytest -q tests)
./scripts/validate_perf_harness.shRun these before opening a PR. See Getting Started for the full CI baseline.