Skip to content

fix: use LLM judge feedback for skill fitness#57

Closed
steezkelly wants to merge 1 commit into
NousResearch:mainfrom
steezkelly:fix/12-llm-judge-fitness
Closed

fix: use LLM judge feedback for skill fitness#57
steezkelly wants to merge 1 commit into
NousResearch:mainfrom
steezkelly:fix/12-llm-judge-fitness

Conversation

@steezkelly
Copy link
Copy Markdown

Summary

Fixes #12 by wiring the existing rubric-based LLM judge into the skill optimization metric:

  • makes skill_fitness_metric(...) use the currently configured DSPy LM as the primary judge
  • returns dspy.Prediction(score=float, feedback=str) so GEPA can use reflective feedback instead of generic score-only feedback
  • preserves the previous keyword-overlap score as an offline/rate-limit fallback with explicit feedback
  • keeps holdout aggregation numeric by coercing metric results with float(...)
  • adds regression tests for judge scoring, fallback scoring, and empty-output scoring

Root cause

The codebase already had LLMJudge, but the actual optimizer metric still used only keyword overlap. That produced a narrow and easily gamed objective and deprived GEPA of actionable feedback.

Test Plan

  • RED first: pytest tests/core/test_fitness.py -q failed before _score_with_llm_judge existed
  • pytest tests/core/test_fitness.py -q
  • pytest -q
  • runtime probe: no configured LM returns dspy.Prediction(...), float(prediction) works, and fallback feedback is populated
  • static added-line security scan: clean
  • git diff --check

Result: 142 passed, 11 warnings (DSPy deprecation warnings only).

Closes #12

@steezkelly
Copy link
Copy Markdown
Author

Closing this PR in favor of consolidated PR #68. Local integration found real helper-block overlap in evolution/skills/evolve_skill.py across the stack, and #68 preserves local test evidence: targeted stack tests 41 passed; full suite 164 passed; GitHub checks were absent on the split PRs. Review #68 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fitness metric uses keyword overlap only — insufficient signal for optimization

1 participant