Future Feature Idea: Git-Aware A/B Comparison Runner
Summary:
Create a new, high-level benchmark runner that is aware of the version control system (Git). This runner would be capable of checking out two different commits, running the same benchmark against both versions within a single session, and performing a paired-sample statistical analysis to provide a highly robust and reliable comparison. This would transform the framework from a baseline-comparison tool into a true A/B performance testing system.
Problem Solved:
Traditional baseline comparison (like pytest-benchmark) suffers from the "noisy environment" problem. A performance change can be masked by variations in system load between runs, leading to flaky tests and low confidence in results. This new model would eliminate environmental noise as a confounding variable by running both versions of the code in an interleaved fashion under the exact same system conditions.
Proposed Workflow:
The user would invoke the benchmark from the command line with references to two Git commits:
# Compare the current working directory against the 'main' branch
simplebench --compare-with main
# Compare two specific commits
simplebench --compare b1c3a4d --with a0f9e8b
The output would be a direct, high-confidence statement about the performance difference:
Benchmark: process_data
- Commit
b1c3a4d is 15.2% slower (±1.8%) than commit a0f9e8b.
- The difference is statistically significant (p < 0.001).
- Result: REGRESSION DETECTED
Technical Implementation Plan:
-
CLI Enhancement (cli.py):
- Add a new command group or arguments like
--compare and --with to the main simplebench CLI entry point.
-
Git Interaction Layer:
- Create a new module responsible for interacting with the Git repository. This could use a library like
GitPython or the subprocess module.
- It needs to handle:
- Identifying the current branch/commit.
- Checking out specific commits to a temporary directory to avoid disturbing the user's working tree.
- Installing dependencies for each checked-out version (e.g., by running
pip install -e . in the temporary directory).
-
Dynamic Code Loading:
- The runner will need to dynamically import the benchmark functions from the two different checked-out versions of the code. Python's
importlib will be essential here.
-
New "Comparison Runner" (runners.py):
- Create a new
ComparisonRunner class.
- This runner will orchestrate the process:
- Set up the two temporary environments for commit A and commit B.
- Load
function_A and function_B.
- In a loop, run the functions in an interleaved or randomized order (
A, B, A, B, ... or B, A, A, B, ...).
- Collect the raw timing/measurement data for both versions into two separate
Results objects.
-
Paired Statistical Analysis (stats/):
- Enhance the
stats module to include paired-sample statistical tests (e.g., a paired t-test).
- This test will operate on the two lists of results from the
ComparisonRunner to determine if the difference between them is statistically significant.
-
New Reporter Mode (reporters/):
- The existing
Reporter system will need a new mode or a new dedicated ComparisonReporter to format and display the results of the paired analysis, including the percentage change, confidence intervals, and the p-value.
Future Feature Idea: Git-Aware A/B Comparison Runner
Summary:
Create a new, high-level benchmark runner that is aware of the version control system (Git). This runner would be capable of checking out two different commits, running the same benchmark against both versions within a single session, and performing a paired-sample statistical analysis to provide a highly robust and reliable comparison. This would transform the framework from a baseline-comparison tool into a true A/B performance testing system.
Problem Solved:
Traditional baseline comparison (like
pytest-benchmark) suffers from the "noisy environment" problem. A performance change can be masked by variations in system load between runs, leading to flaky tests and low confidence in results. This new model would eliminate environmental noise as a confounding variable by running both versions of the code in an interleaved fashion under the exact same system conditions.Proposed Workflow:
The user would invoke the benchmark from the command line with references to two Git commits:
The output would be a direct, high-confidence statement about the performance difference:
Technical Implementation Plan:
CLI Enhancement (
cli.py):--compareand--withto the mainsimplebenchCLI entry point.Git Interaction Layer:
GitPythonor thesubprocessmodule.pip install -e .in the temporary directory).Dynamic Code Loading:
importlibwill be essential here.New "Comparison Runner" (
runners.py):ComparisonRunnerclass.function_Aandfunction_B.A, B, A, B, ...orB, A, A, B, ...).Resultsobjects.Paired Statistical Analysis (
stats/):statsmodule to include paired-sample statistical tests (e.g., a paired t-test).ComparisonRunnerto determine if the difference between them is statistically significant.New Reporter Mode (
reporters/):Reportersystem will need a new mode or a new dedicatedComparisonReporterto format and display the results of the paired analysis, including the percentage change, confidence intervals, and the p-value.