Skip to content

Add benchmarking script#76

Draft
kaitj wants to merge 7 commits into
mainfrom
ci/benchmark
Draft

Add benchmarking script#76
kaitj wants to merge 7 commits into
mainfrom
ci/benchmark

Conversation

@kaitj
Copy link
Copy Markdown
Contributor

@kaitj kaitj commented May 6, 2026

Adds CI to benchmark new PRs against both the main branch (dev) and previous release. Benchmarks are performed on a small dataset from the bids-examples submodule (labelled as "local") and a subset from OpenNeuro (labelled as "remote"), with the focus primarily on performance.

Index sizes are negligible with these benchmarks given the size of these datasets and if desired should be performed on datasets that are larger than possible to fit on the gh runners. Can consider adding these using self-hosted runners with larger datasets.

I've disabled benchmarking against tags for now due to missing dependencies in older versions, but should re-enable them following the next release. Benchmarks on the main branch should work once this is merged in, so leaving that in there for now.

Below is an example of what this benchmark results would look like (was run on a fork):

{9974971E-E052-4D0A-AB46-E71D03ABCD3E}

Closes #75

@kaitj kaitj force-pushed the ci/benchmark branch 4 times, most recently from 978193e to b35fc76 Compare May 6, 2026 19:45
@effigies
Copy link
Copy Markdown
Contributor

effigies commented May 7, 2026

I would make the comparisons ratios instead of differences, e.g., new/old (1 is no change, <1 is faster, >1 is slower). A difference of 40ms is huge if the baseline is 100ms, not so much if it's 20s.

@kaitj
Copy link
Copy Markdown
Contributor Author

kaitj commented May 11, 2026

Was chatting with @nx10 about this on Friday. It didn't seem like benchmarking using the gh actions runners is very reliable from his past experience.

Given this, one solution is an updated script + desired changes that can be used to run the benchmarks locally. Another alternative to explore is self-hosted runners, which may be more reliable for performing benchmarks.

@kaitj
Copy link
Copy Markdown
Contributor Author

kaitj commented May 14, 2026

f7a737d folded in the CI stuff into a locally run script for benchmarking and generates a markdown file with the results (same as to what was being commented via the action - see screenshot)

{9A3FFC1C-84A0-44DE-9856-6F079186F75A}

Note that in the current state, this is not meant to be exhaustive or comprehensive at the moment, but at least give us a quick look into how changes may affect performance.

@kaitj kaitj marked this pull request as ready for review May 14, 2026 15:34
@kaitj kaitj requested a review from nx10 May 14, 2026 15:35
@kaitj kaitj changed the title Add benchmarking CI Add benchmarking script May 14, 2026
@gkiar
Copy link
Copy Markdown
Collaborator

gkiar commented May 20, 2026

Forgive my failing to find this in the code, but how many retries are we doing here? To @nx10 's point, benchmarking in gh-actions isn't super consistent, but I don't think that's a big deal... We are still able to get apples-apples comparison if we run all jobs in the same node in the same conditions, and all we're looking for, to @effigies ' point is a sense of ratio for faster/slower.

I think the more important piece is to have ~10–20 retries per so we have a consistent estimate.

I'd also suggest a margin of error (eg 5%), below which we consider performance unchanged, since it'll rarely be perfectly identical.

@nx10
Copy link
Copy Markdown
Collaborator

nx10 commented May 20, 2026

From past experience, continuous benchmarking and microbenchmarking on GHA generally don't work well. They can catch large regressions, but even then it takes care - stochastic sampling, cache warmups, and controlling for runner-to-runner variance all matter.

@nx10
Copy link
Copy Markdown
Collaborator

nx10 commented May 20, 2026

To Greg's point: I think the margin of error on GHA would be much higher even - maybe 10% maybe 20%

kaitj added 7 commits May 21, 2026 16:08
- Mark tests with "cloud" and / or "benchmark" as needed
- Combine both "dev" and "benchmark" dependencies, was causing issues with the pytest due to imports (alternatively, use `try-except` block for optional dependency import)
- Replace pandas with polars in dev dependency (for benchmarking)
- Switch to shortened SHA for PR
- Add PR for unique output file artifact
- Disable comparison against tag due to lack of dependency group
- Add step to comment on PR
- Sort labels for comment
- Fold CI scripts into local benchmark script
- Remove CI workflow
- Use importlib for pytest for identical file names across different test modules
@github-actions
Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py90100% 
__main__.py69888%101, 110, 135, 137–138, 164, 168, 172
_entities.py112199%129
_indexing.py212597%150, 159–160, 409, 447
_logging.py31487%30, 37, 39–40
_metadata.py48491%39–40, 66, 71
_pathlib.py21576%16–17, 19–20, 22
_version.py110100% 
pybids
   __init__.py40100% 
   _bidsfile.py381365%71–73, 77–79, 83–85, 89–91, 95
   _layout.py1564571%63, 72, 81, 104, 114–115, 118, 140–141, 156–157, 173–174, 177–181, 186, 188–189, 192–193, 228, 233, 241, 322–324, 389–394, 396, 399–404, 406, 462, 482
   _utils.py13561%47–50, 52
TOTAL7249087% 

Tests Skipped Failures Errors Time
100 1 💤 0 ❌ 0 🔥 13.061s ⏱️

@childmindresearch childmindresearch deleted a comment from github-actions Bot May 21, 2026
@kaitj kaitj marked this pull request as draft May 21, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add benchmarking CI

4 participants