Skip to content

fix: harden phase 1 skill evolution path#75

Open
ohkyuetaek wants to merge 1 commit into
NousResearch:mainfrom
ohkyuetaek:fix/phase1-minimal-e2e
Open

fix: harden phase 1 skill evolution path#75
ohkyuetaek wants to merge 1 commit into
NousResearch:mainfrom
ohkyuetaek:fix/phase1-minimal-e2e

Conversation

@ohkyuetaek
Copy link
Copy Markdown

Summary

  • Store skill text in DSPy signature instructions so GEPA can actually optimize the skill body
  • Update the GEPA construction path for DSPy 3.2+ with reflection_lm and max_full_evals
  • Validate full SKILL.md artifacts, including YAML frontmatter, before accepting candidates
  • Add configurable output_dir support and API-free golden dataset regression coverage for emitted artifacts

Test Plan

  • .venv/bin/python -m pytest -q
  • Static added-line security scan: no hardcoded secrets, shell=True/os.system, eval/exec, pickle.loads, or obvious SQL format injection matches
  • Independent pre-commit reviewer passed; one non-blocking baseline_body suggestion was applied

Notes

  • Direct push to NousResearch/hermes-agent-self-evolution was denied for this account, so this PR is opened from the ohkyuetaek fork.

- Store skill text in DSPy signature instructions so GEPA can evolve it
- Use DSPy 3.2 GEPA constructor with reflection LM and max_full_evals
- Validate full SKILL.md artifacts and honor configurable output_dir
- Add API-free golden dataset regression coverage for emitted artifacts
@ohkyuetaek
Copy link
Copy Markdown
Author

Ready for review.

Validation completed locally:

  • .venv/bin/python -m pytest -q → 145 passed, 11 DSPy deprecation warnings
  • Added-line static security scan found no hardcoded secrets, shell=True/os.system, eval/exec, pickle.loads, or obvious SQL string-format injection patterns
  • Independent pre-commit review passed; one non-blocking suggestion around baseline_body handling was applied and the full suite was re-run

This PR is intentionally scoped to hardening the Phase 1 minimal skill-evolution path: DSPy 3.2+ GEPA construction, optimizable skill instructions, full SKILL.md validation, configurable artifact output, and API-free regression coverage for emitted baseline/evolved/metrics artifacts.

Copy link
Copy Markdown

@jarrettj jarrettj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hermes Agent Code Review

Checked out pr-75, ran 139 tests (all pass), reviewed all 4 changed files. Two issues need to be addressed before merge, two minor suggestions included.


Critical

None

Warnings

  1. evolve_skill.py:300 — Variable name output_dir is reused: the function parameter (line 77, Optional[str]) is consumed at line 90, then shadowed at line 300 with a Path object of different semantics. This confuses readers and static analysis. Rename to run_dir.

  2. evolve_skill.py:264 — Holdout baseline may be scored with the evolved skill. optimizer.compile(baseline_module, ...) may mutate baseline_module in-place. If baseline_module is optimized_module, baseline holdout scores use the evolved skill text, making the improvement delta meaningless. Fix: create evolved_module = SkillModule(evolved_body) and use it for evolved-path scoring instead of optimized_module.

Suggestions

  1. skill_module.py:8import re is unused (leftover after TaskWithSkill was removed). Delete it.

  2. skill_module.py:104self.predictor.predict.signature is an undocumented DSPy internal. Add a comment: # DSPy 3.x: ChainOfThought stores the inner Predict at .predict.

Looks Good

  • Storing skill text in signature instructions (not as an InputField) is the correct GEPA target — sound design.
  • build_gepa_optimizer properly uses max_full_evals + reflection_lm per DSPy 3.2+ API.
  • validate_skill_candidate correctly assembles full SKILL.md (frontmatter + body) before constraint checks.
  • output_dir configurability is clean; the Path(output_dir) conversion is properly guarded.
  • Test coverage for all four new behaviors is solid.

Reviewed by Hermes Agent

# ── 10. Save output ─────────────────────────────────────────────────
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = Path("output") / skill_name / timestamp
output_dir = config.output_dir / skill_name / timestamp
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning: output_dir here shadows the function parameter of the same name (line 77, Optional[str]). At this point it becomes a Path — different type, different semantics. Rename this local variable to run_dir to avoid confusion.

@property
def skill_text(self) -> str:
"""Return the current/evolved skill instructions from the predictor."""
return self.predictor.predict.signature.instructions
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: self.predictor.predict.signature accesses an undocumented DSPy internal (ChainOfThought.predict). This works today (all tests pass), but could silently break on a DSPy version bump. Add a comment noting the DSPy 3.x dependency.

@jarrettj
Copy link
Copy Markdown

Code Review Summary — PR #75: fix: harden phase 1 skill evolution path

Verdict: Request Changes (0 critical, 2 warnings, 2 suggestions)

Scope: 4 files · +235 / -30 lines · 139 tests pass


⚠️ Warnings

  1. evolution/skills/evolve_skill.py:300 — Variable name output_dir is reused in the same function scope. The parameter output_dir: Optional[str] (line 77) is consumed at line 90, then reassigned at line 300 as Path with different semantics. Rename the local to run_dir to eliminate the shadow.

  2. evolution/skills/evolve_skill.py:264 — Holdout baseline may be inadvertently scored with the evolved skill. optimizer.compile(baseline_module, ...) may mutate baseline_module in-place and return it. If so, baseline_module is optimized_module and the comparison is invalid. Fix: after extracting evolved_body, create evolved_module = SkillModule(evolved_body) and score the evolved path from that instead of from optimized_module. (The existing test doesn't catch this because FakeSkillModule always returns identical output.)

💡 Suggestions

  1. evolution/skills/skill_module.py:8import re is a dead import — leftover after TaskWithSkill was removed. Delete it.

  2. evolution/skills/skill_module.py:104self.predictor.predict.signature accesses an undocumented DSPy internal. Works now, but fragile across DSPy versions. Add a brief comment: # DSPy 3.x: ChainOfThought stores the inner Predict at .predict.


✅ Looks Good

  • Storing skill text in signature instructions (not as an InputField) is the right GEPA target — this is the core correctness fix.
  • build_gepa_optimizer correctly uses max_full_evals + reflection_lm (DSPy 3.2+ API).
  • validate_skill_candidate correctly reassembles full SKILL.md before validation — the old body-only check was the bug this PR set out to fix.
  • output_dir configurability is clean and properly guarded with Path(output_dir).
  • Test file covers all four new behaviors (signature placement, property wiring, GEPA constructor shape, end-to-end artifact emission) without making live API calls.

Reviewed by Hermes Agent | 139 tests passed locally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants