Skip to content

fix: reject empty holdout datasets#59

Closed
steezkelly wants to merge 1 commit into
NousResearch:mainfrom
steezkelly:fix/33-empty-holdout-gate
Closed

fix: reject empty holdout datasets#59
steezkelly wants to merge 1 commit into
NousResearch:mainfrom
steezkelly:fix/33-empty-holdout-gate

Conversation

@steezkelly
Copy link
Copy Markdown

Summary

Partially addresses #33 M2 by failing fast when an evaluation dataset has no holdout examples:

  • adds _require_non_empty_holdout(...) to reject empty holdout splits before optimization starts
  • raises an actionable click.ClickException instead of silently producing 0.000/0.000 holdout scores
  • adds regression tests for both valid holdout and empty-holdout rejection

Root cause

The holdout scoring code uses sum(scores) / max(1, len(scores)). When dataset.holdout is empty, both baseline and evolved score lists are empty, so the report silently shows 0.000 scores rather than warning that no independent evaluation occurred.

Opposite-perspective review notes

I checked the skeptical cases:

  • Could this block intentional train/val-only runs? Yes, but the current CLI always reports holdout improvement, so allowing a zero-example holdout is more misleading than useful.
  • Should this happen before or after optimization? Before. If the independent evaluation split is invalid, running GEPA first just wastes model calls and creates a fake-looking report.
  • Is this only a warning? No. It hard-fails with a ClickException so the operator must fix the dataset/split.

Test Plan

  • RED first: pytest tests/skills/test_evolve_skill_dataset_gates.py -q failed because _require_non_empty_holdout did not exist
  • pytest tests/skills/test_evolve_skill_dataset_gates.py -q
  • pytest -q
  • static added-line security scan
  • git diff --check

Result: 141 passed, 11 warnings (DSPy deprecation warnings only).

Partially addresses #33 (M2: empty holdout set silently produces 0.000 scores).

@steezkelly
Copy link
Copy Markdown
Author

Closing this PR in favor of consolidated PR #68. Local integration found real helper-block overlap in evolution/skills/evolve_skill.py across the stack, and #68 preserves local test evidence: targeted stack tests 41 passed; full suite 164 passed; GitHub checks were absent on the split PRs. Review #68 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant