fix: use LLM judge feedback for skill fitness by steezkelly · Pull Request #57 · NousResearch/hermes-agent-self-evolution

steezkelly · 2026-05-09T02:27:53Z

Summary

Fixes #12 by wiring the existing rubric-based LLM judge into the skill optimization metric:

makes skill_fitness_metric(...) use the currently configured DSPy LM as the primary judge
returns dspy.Prediction(score=float, feedback=str) so GEPA can use reflective feedback instead of generic score-only feedback
preserves the previous keyword-overlap score as an offline/rate-limit fallback with explicit feedback
keeps holdout aggregation numeric by coercing metric results with float(...)
adds regression tests for judge scoring, fallback scoring, and empty-output scoring

Root cause

The codebase already had LLMJudge, but the actual optimizer metric still used only keyword overlap. That produced a narrow and easily gamed objective and deprived GEPA of actionable feedback.

Test Plan

RED first: pytest tests/core/test_fitness.py -q failed before _score_with_llm_judge existed
pytest tests/core/test_fitness.py -q
pytest -q
runtime probe: no configured LM returns dspy.Prediction(...), float(prediction) works, and fallback feedback is populated
static added-line security scan: clean
git diff --check

Result: 142 passed, 11 warnings (DSPy deprecation warnings only).

Closes #12

steezkelly · 2026-05-09T03:39:43Z

Closing this PR in favor of consolidated PR #68. Local integration found real helper-block overlap in evolution/skills/evolve_skill.py across the stack, and #68 preserves local test evidence: targeted stack tests 41 passed; full suite 164 passed; GitHub checks were absent on the split PRs. Review #68 instead.

fix: use LLM judge feedback for skill fitness

ba60e2d

steezkelly mentioned this pull request May 9, 2026

feat: consolidate evolution ingestion and safety gates #68

Closed

steezkelly closed this May 9, 2026

This was referenced May 9, 2026

Implement all-agent session ingestion and promotion gates #54

Open

feat: consolidate evolution gates stack (#55-#61) #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use LLM judge feedback for skill fitness#57

fix: use LLM judge feedback for skill fitness#57
steezkelly wants to merge 1 commit into
NousResearch:mainfrom
steezkelly:fix/12-llm-judge-fitness

steezkelly commented May 9, 2026

Uh oh!

steezkelly commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

steezkelly commented May 9, 2026

Summary

Root cause

Test Plan

Uh oh!

steezkelly commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant