Skip to content

feat(analysis): add RunTrendAnalyzer β€” detect score regression across sequential runs#104

Open
nanookclaw wants to merge 2 commits intopinchbench:mainfrom
nanookclaw:feat/run-trend-analyzer
Open

feat(analysis): add RunTrendAnalyzer β€” detect score regression across sequential runs#104
nanookclaw wants to merge 2 commits intopinchbench:mainfrom
nanookclaw:feat/run-trend-analyzer

Conversation

@nanookclaw
Copy link
Copy Markdown

Summary

Implements RunTrendAnalyzer as discussed in #101.

Detects whether a model's benchmark score is improving, stable, or degrading across sequential runs using OLS slope fitting over a configurable sliding window.

Changes

New file: scripts/lib_trend.py

  • RunPoint β€” dataclass (run_id, timestamp, model, score_pct, task_count)
  • RunTrendReport β€” analysis result with regression_detected and task_count_varies flag
  • RunTrendAnalyzer.load_points() β€” parses result JSONs, groups by model, sorts by timestamp. Narrowed exception handling: (json.JSONDecodeError, OSError) with silent skip.
  • RunTrendAnalyzer.analyze() β€” OLS slope via statistics.linear_regression, configurable window + threshold
  • RunTrendReport.summary() β€” CLI-friendly output with task-count-varied warning

Design decisions

  1. Pure stdlib β€” uses statistics.linear_regression, no new dependencies
  2. Composable β€” can be imported as a library or called from CLI
  3. Suite expansion aware β€” task_count_varies flag warns when the benchmark suite composition changed across the trending window
  4. Per-model β€” handles multiple models concurrently, returns sorted by slope

Tests

12 new tests in tests/test_lib_trend.py covering:

  • Edge cases (empty, single run, malformed files)
  • Regression/improving/stable detection
  • task_count_varies flag correctness
  • Multiple models + single model filtering
  • CLI summary string format

CLI integration

Can be wired into benchmark.py post-run:

from scripts.lib_trend import RunTrendAnalyzer
analyzer = RunTrendAnalyzer(Path(args.output_dir))
analyzer.run(model=args.model)

Closes #101.

cc @Soham-o β€” all review points addressed per our discussion. βœ… Exception narrowed βœ… task_count_varies βœ…

…etection

Detects whether a model's benchmark score is improving, stable, or degrading
across sequential runs using OLS slope fitting over a sliding window.

- RunPoint dataclass with run_id, timestamp, score, task_count
- RunTrendReport with regression_detected, task_count_varies flag
- Narrowed exception handling (JSONDecodeError, OSError) in load_points
- CLI output warns when suite composition changed across trending window
- Pure stdlib (statistics.linear_regression), no new dependencies

See Issue pinchbench#101 for full proposal and maintainer review thread.
- test_no_data_returns_empty: empty directory returns []
- test_single_run_returns_empty: needs >= 2 runs for trend
- test_regression_detected: declining scores triggers flag
- test_improving_not_regression: positive slope not flagged
- test_malformed_file_skipped: bad JSON files skipped gracefully
- test_task_count_varies_flag: suite expansion sets warning
- test_task_count_varies_false_when_equal: consistent suite no warning
- test_summary_string_regression: CLI output format
- test_summary_string_task_count_warning: warning in summary
- test_stable_scores: zero slope flat detection
- test_multiple_models: concurrent analysis per model, sorted
- test_filter_by_model: single model query
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(benchmark): RunTrendAnalyzer β€” detect score regression across sequential benchmark runs

1 participant