Skip to content

Git-Aware A/B Comparison Runner #2

@JerilynFranz

Description

@JerilynFranz

Future Feature Idea: Git-Aware A/B Comparison Runner

Summary:

Create a new, high-level benchmark runner that is aware of the version control system (Git). This runner would be capable of checking out two different commits, running the same benchmark against both versions within a single session, and performing a paired-sample statistical analysis to provide a highly robust and reliable comparison. This would transform the framework from a baseline-comparison tool into a true A/B performance testing system.

Problem Solved:

Traditional baseline comparison (like pytest-benchmark) suffers from the "noisy environment" problem. A performance change can be masked by variations in system load between runs, leading to flaky tests and low confidence in results. This new model would eliminate environmental noise as a confounding variable by running both versions of the code in an interleaved fashion under the exact same system conditions.

Proposed Workflow:

The user would invoke the benchmark from the command line with references to two Git commits:

# Compare the current working directory against the 'main' branch
simplebench --compare-with main

# Compare two specific commits
simplebench --compare b1c3a4d --with a0f9e8b

The output would be a direct, high-confidence statement about the performance difference:

Benchmark: process_data

  • Commit b1c3a4d is 15.2% slower (±1.8%) than commit a0f9e8b.
  • The difference is statistically significant (p < 0.001).
  • Result: REGRESSION DETECTED

Technical Implementation Plan:

  1. CLI Enhancement (cli.py):

    • Add a new command group or arguments like --compare and --with to the main simplebench CLI entry point.
  2. Git Interaction Layer:

    • Create a new module responsible for interacting with the Git repository. This could use a library like GitPython or the subprocess module.
    • It needs to handle:
      • Identifying the current branch/commit.
      • Checking out specific commits to a temporary directory to avoid disturbing the user's working tree.
      • Installing dependencies for each checked-out version (e.g., by running pip install -e . in the temporary directory).
  3. Dynamic Code Loading:

    • The runner will need to dynamically import the benchmark functions from the two different checked-out versions of the code. Python's importlib will be essential here.
  4. New "Comparison Runner" (runners.py):

    • Create a new ComparisonRunner class.
    • This runner will orchestrate the process:
      • Set up the two temporary environments for commit A and commit B.
      • Load function_A and function_B.
      • In a loop, run the functions in an interleaved or randomized order (A, B, A, B, ... or B, A, A, B, ...).
      • Collect the raw timing/measurement data for both versions into two separate Results objects.
  5. Paired Statistical Analysis (stats/):

    • Enhance the stats module to include paired-sample statistical tests (e.g., a paired t-test).
    • This test will operate on the two lists of results from the ComparisonRunner to determine if the difference between them is statistically significant.
  6. New Reporter Mode (reporters/):

    • The existing Reporter system will need a new mode or a new dedicated ComparisonReporter to format and display the results of the paired analysis, including the percentage change, confidence intervals, and the p-value.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestroadmapLong term direction

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions