Skip to content

feat: Phase 2 tool description + Phase 3 prompt section evolution#86

Open
hotepfederales-creator wants to merge 2 commits into
NousResearch:mainfrom
hotepfederales-creator:feat/phases-2-3-evolution
Open

feat: Phase 2 tool description + Phase 3 prompt section evolution#86
hotepfederales-creator wants to merge 2 commits into
NousResearch:mainfrom
hotepfederales-creator:feat/phases-2-3-evolution

Conversation

@hotepfederales-creator
Copy link
Copy Markdown

Summary

Extends self-evolution beyond skills (Phase 1) to two additional artifact types using the same KPI-gated harness pattern:

  • Phase 2 - tool descriptions (evolution/tools/): AST-load *_SCHEMA dicts from hermes-agent/tools/*.py, wrap the description as predictor.signature.instructions, optimize with MIPROv2/GEPA against a positive/negative tool-selection dataset, enforce a 500-char budget with sentence-bounded truncation, gate with cross-tool non-regression.
  • Phase 3 - system-prompt sections (evolution/prompts/): AST-load DEFAULT_AGENT_IDENTITY / MEMORY_GUIDANCE / SESSION_SEARCH_GUIDANCE / SKILLS_GUIDANCE and explode PLATFORM_HINTS per-platform, optimize with overlap / LLM-judge / hybrid metrics, gate with identity-trait preservation check, with a non-baseline-winner fallback for when MIPROv2's best candidate equals the baseline.

Shared infrastructure: polymorphic reproducibility manifests, per-phase gate evaluators, tool_description and system_prompt rollout policy tiers.

Live KPI Gate Results

Phase Artifact Baseline Evolved Delta Gate
1 github-code-review skill 0.473 0.518 +9.7% PASS
2 read_file tool description 0.600 0.800 +33% selection acc PASS
3 MEMORY_GUIDANCE section 0.513 0.579 +12.82% rel / +0.066 abs PASS

All three runs used local ollama/qwen2.5:7b end-to-end. Constraints, manifest reproducibility, identity-trait preservation, and rollout safety checks all green per phase.

Tests

  • 229 tests passing (up from 167 pre-PR)
  • python -m ruff check . clean
  • New test packages: tests/tools/ (loader, module, phase2 gate) and tests/prompts/ (loader, module, judge metric, phase3 gate)

Files of Interest

  • evolution/tools/tool_module.py - clean_evolved_description(max_chars) strips optimizer-inlined few-shot examples and enforces a sentence-bounded budget.
  • evolution/prompts/prompt_module.py - make_llm_judge_metric adds LLM-as-judge fitness with optional keyword-overlap fallback weight.
  • evolution/prompts/evolve_prompt_section.py - falls back to the highest-scoring non-baseline MIPROv2 candidate when the optimizer's best winner is the baseline itself.
  • evolution/core/phase_gate.py - evaluate_phase2_gate adds cross-tool regression check, evaluate_phase3_gate adds identity-trait preservation.

Out of Scope

  • Phase 4 (Darwinian Evolver: code-as-organism, sandboxed pytest, composite fitness, strict review) - separate effort.
  • Phase 5 (continuous loop with budget caps + cron) - separate effort.
  • GEPA signature mismatch (max_steps rejected by installed DSPy version) currently falls back to MIPROv2; deferred until we pin a version-correct wrapper.

Repository Hygiene

  • .gitignore now excludes output/ run trees and stray Windows NUL 0-byte files.
  • Per-run outputs under output/{skills,tools,prompts}/<name>/<timestamp>/ not committed.

Claude Agent and others added 2 commits May 17, 2026 19:10
Extends self-evolution beyond skills (Phase 1) to two additional artifact types,
each with the same KPI-gated harness pattern: AST loader -> DSPy module exposing
artifact text via signature.instructions -> synthetic dataset -> MIPROv2/GEPA
optimization -> constraint validation -> reproducibility manifest -> phase gate.

Phase 2 (tool descriptions, evolution/tools/):
- tool_loader.py parses *_SCHEMA dicts from hermes-agent/tools/*.py
- tool_module.py wraps a tool description as predictor.signature.instructions,
  with clean_evolved_description that strips MIPROv2-inlined few-shot examples
  and enforces a max_chars budget by sentence boundaries.
- tool_dataset.py builds positive/negative scenarios for tool selection.
- evolve_tool.py CLI orchestrator with Phase 2 gate (cross-tool non-regression).
- Live gate passed on read_file with qwen2.5:7b: selection 0.600 -> 0.800.

Phase 3 (system-prompt sections, evolution/prompts/):
- prompt_loader.py extracts DEFAULT_AGENT_IDENTITY / MEMORY_GUIDANCE /
  SESSION_SEARCH_GUIDANCE / SKILLS_GUIDANCE and explodes PLATFORM_HINTS per
  platform key. identity_traits_present() helper.
- prompt_module.py PromptSectionModule + behavioral_fitness_metric
  (keyword overlap) + make_llm_judge_metric (LLM-as-judge with optional
  overlap fallback) and clean_evolved_section.
- prompt_dataset.py SyntheticPromptScenarioBuilder with section-specific
  behavioral briefs.
- evolve_prompt_section.py CLI with --metric overlap/judge/hybrid,
  --eval-dataset-size, --optimizer-auto, and a non-baseline-winner fallback
  when MIPROv2's best candidate equals the baseline.
- Phase 3 gate adds identity-trait preservation check on DEFAULT_AGENT_IDENTITY.
- Live gate passed on MEMORY_GUIDANCE with qwen2.5:7b hybrid judge:
  behavioral 0.513 -> 0.579 (+12.82% relative / +0.066 absolute);
  section size 1185 -> 1102 chars; all 7 gate checks green.

Shared (evolution/core/):
- reproducibility.py polymorphic manifests (skill / tool / prompt section).
- phase_gate.py evaluate_phase2_gate (cross-tool regression) +
  evaluate_phase3_gate (identity traits).
- rollout_policy.py 'tool_description' + 'system_prompt' rollout tiers.

Tests: tests/tools/ and tests/prompts/ add coverage for loaders, modules,
metrics, judge score parsing, and per-phase gates. 229 passing, ruff clean.

.gitignore now excludes output/ run trees and stray Windows NUL files.
…assed

Phase 4 adds LLM-driven code mutation for hermes-agent tool files:
- CodeOrganism: git worktree per candidate, SHA256 tracking
- signature_freeze: AST-level invariant gates (signatures, registry calls, try/except count)
- WorktreeSandbox: subprocess timeout + path confinement (soft isolation)
- InternalMutator: DSPy-driven proposer with fenced-code stripping
- ExternalDarwinianEvolverMutator: AGPL stub for future shell-out
- CompositeFitness: pytest 100% hard gate, ruff penalty, freeze gate, bug-repro transition
- evolve_tool_code CLI: --tool, --bug-brief, --iterations, --engine, --pytest-target
- CodeReproducibilityManifest + Phase 4 gate evaluation
- 266 tests passing, ruff clean

Live demo against hermes-agent/path_security with ollama/qwen2.5:7b:
  - Bug: validate_within_dir PermissionError on junctions / UNC paths
  - Result: gate PASSED, fitness 0.50, error handling 1->2 try/except
  - Candidate: surgical nested try/except + PermissionError catch
  - No auto-merge; candidate emitted as patch for human review

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
breakneo added a commit to breakneo/hermes-agent-self-evolution that referenced this pull request May 18, 2026
…earch#86

Phase 2 (tools), Phase 3 (prompts), Phase 4 (code Darwinian evolver),
shared core infrastructure, 62 new test files, run scripts.

Source: hotepfederales-creator feat/phases-2-3-evolution
Conflicting files left for manual merge step.
breakneo added a commit to breakneo/hermes-agent-self-evolution that referenced this pull request May 18, 2026
…ools + prompts + code)

Cherry-picked from hotepfederales-creator feat/phases-2-3-evolution:
- Phase 2: tool description evolution (evolution/tools/)
- Phase 3: system prompt section evolution (evolution/prompts/)
- Phase 4: Darwinian code evolver (evolution/code/)
- Shared: phase_gate, reproducibility, rollout_policy, stop_loss, governance

Merged with our PR NousResearch#52 (git_pr_automation, report_artifact, benchmark_gate,
hermes_eval). Resolved conflicts in evolve_skill.py, config.py, __init__.py,
fitness.py, skill_module.py, test_evolve_skill.py.

Test status: 302 passed, 24 failed
- 3 sandbox failures: macOS portability (cmd → bash)
- 21 evolve_skill orchestration test failures: PR NousResearch#52 helper functions
  need re-wiring into new phase_gate/reproducibility architecture
  (known debt — core modules fully intact, all 39 core tests pass)
breakneo added a commit to breakneo/hermes-agent-self-evolution that referenced this pull request May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant