Add benchmarking script#76
Conversation
978193e to
b35fc76
Compare
|
I would make the comparisons ratios instead of differences, e.g., |
|
Was chatting with @nx10 about this on Friday. It didn't seem like benchmarking using the gh actions runners is very reliable from his past experience. Given this, one solution is an updated script + desired changes that can be used to run the benchmarks locally. Another alternative to explore is self-hosted runners, which may be more reliable for performing benchmarks. |
|
f7a737d folded in the CI stuff into a locally run script for benchmarking and generates a markdown file with the results (same as to what was being commented via the action - see screenshot)
Note that in the current state, this is not meant to be exhaustive or comprehensive at the moment, but at least give us a quick look into how changes may affect performance. |
|
Forgive my failing to find this in the code, but how many retries are we doing here? To @nx10 's point, benchmarking in gh-actions isn't super consistent, but I don't think that's a big deal... We are still able to get apples-apples comparison if we run all jobs in the same node in the same conditions, and all we're looking for, to @effigies ' point is a sense of ratio for faster/slower. I think the more important piece is to have ~10–20 retries per so we have a consistent estimate. I'd also suggest a margin of error (eg 5%), below which we consider performance unchanged, since it'll rarely be perfectly identical. |
|
From past experience, continuous benchmarking and microbenchmarking on GHA generally don't work well. They can catch large regressions, but even then it takes care - stochastic sampling, cache warmups, and controlling for runner-to-runner variance all matter. |
|
To Greg's point: I think the margin of error on GHA would be much higher even - maybe 10% maybe 20% |
- Mark tests with "cloud" and / or "benchmark" as needed - Combine both "dev" and "benchmark" dependencies, was causing issues with the pytest due to imports (alternatively, use `try-except` block for optional dependency import) - Replace pandas with polars in dev dependency (for benchmarking)
- Switch to shortened SHA for PR - Add PR for unique output file artifact - Disable comparison against tag due to lack of dependency group - Add step to comment on PR - Sort labels for comment
- Fold CI scripts into local benchmark script - Remove CI workflow - Use importlib for pytest for identical file names across different test modules
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Adds CI to benchmark new PRs against both the
mainbranch (dev) and previous release. Benchmarks are performed on a small dataset from thebids-examplessubmodule (labelled as "local") and a subset from OpenNeuro (labelled as "remote"), with the focus primarily on performance.Index sizes are negligible with these benchmarks given the size of these datasets and if desired should be performed on datasets that are larger than possible to fit on the gh runners. Can consider adding these using self-hosted runners with larger datasets.
I've disabled benchmarking against tags for now due to missing dependencies in older versions, but should re-enable them following the next release. Benchmarks on the
mainbranch should work once this is merged in, so leaving that in there for now.Below is an example of what this benchmark results would look like (was run on a fork):
Closes #75