feat: Phase 2 tool description + Phase 3 prompt section evolution by hotepfederales-creator · Pull Request #86 · NousResearch/hermes-agent-self-evolution

hotepfederales-creator · 2026-05-18T00:42:55Z

Summary

Extends self-evolution beyond skills (Phase 1) to two additional artifact types using the same KPI-gated harness pattern:

Phase 2 - tool descriptions (evolution/tools/): AST-load *_SCHEMA dicts from hermes-agent/tools/*.py, wrap the description as predictor.signature.instructions, optimize with MIPROv2/GEPA against a positive/negative tool-selection dataset, enforce a 500-char budget with sentence-bounded truncation, gate with cross-tool non-regression.
Phase 3 - system-prompt sections (evolution/prompts/): AST-load DEFAULT_AGENT_IDENTITY / MEMORY_GUIDANCE / SESSION_SEARCH_GUIDANCE / SKILLS_GUIDANCE and explode PLATFORM_HINTS per-platform, optimize with overlap / LLM-judge / hybrid metrics, gate with identity-trait preservation check, with a non-baseline-winner fallback for when MIPROv2's best candidate equals the baseline.

Shared infrastructure: polymorphic reproducibility manifests, per-phase gate evaluators, tool_description and system_prompt rollout policy tiers.

Live KPI Gate Results

Phase	Artifact	Baseline	Evolved	Delta	Gate
1	github-code-review skill	0.473	0.518	+9.7%	PASS
2	read_file tool description	0.600	0.800	+33% selection acc	PASS
3	MEMORY_GUIDANCE section	0.513	0.579	+12.82% rel / +0.066 abs	PASS

All three runs used local ollama/qwen2.5:7b end-to-end. Constraints, manifest reproducibility, identity-trait preservation, and rollout safety checks all green per phase.

Tests

229 tests passing (up from 167 pre-PR)
python -m ruff check . clean
New test packages: tests/tools/ (loader, module, phase2 gate) and tests/prompts/ (loader, module, judge metric, phase3 gate)

Files of Interest

evolution/tools/tool_module.py - clean_evolved_description(max_chars) strips optimizer-inlined few-shot examples and enforces a sentence-bounded budget.
evolution/prompts/prompt_module.py - make_llm_judge_metric adds LLM-as-judge fitness with optional keyword-overlap fallback weight.
evolution/prompts/evolve_prompt_section.py - falls back to the highest-scoring non-baseline MIPROv2 candidate when the optimizer's best winner is the baseline itself.
evolution/core/phase_gate.py - evaluate_phase2_gate adds cross-tool regression check, evaluate_phase3_gate adds identity-trait preservation.

Out of Scope

Phase 4 (Darwinian Evolver: code-as-organism, sandboxed pytest, composite fitness, strict review) - separate effort.
Phase 5 (continuous loop with budget caps + cron) - separate effort.
GEPA signature mismatch (max_steps rejected by installed DSPy version) currently falls back to MIPROv2; deferred until we pin a version-correct wrapper.

Repository Hygiene

.gitignore now excludes output/ run trees and stray Windows NUL 0-byte files.
Per-run outputs under output/{skills,tools,prompts}/<name>/<timestamp>/ not committed.

Extends self-evolution beyond skills (Phase 1) to two additional artifact types, each with the same KPI-gated harness pattern: AST loader -> DSPy module exposing artifact text via signature.instructions -> synthetic dataset -> MIPROv2/GEPA optimization -> constraint validation -> reproducibility manifest -> phase gate. Phase 2 (tool descriptions, evolution/tools/): - tool_loader.py parses *_SCHEMA dicts from hermes-agent/tools/*.py - tool_module.py wraps a tool description as predictor.signature.instructions, with clean_evolved_description that strips MIPROv2-inlined few-shot examples and enforces a max_chars budget by sentence boundaries. - tool_dataset.py builds positive/negative scenarios for tool selection. - evolve_tool.py CLI orchestrator with Phase 2 gate (cross-tool non-regression). - Live gate passed on read_file with qwen2.5:7b: selection 0.600 -> 0.800. Phase 3 (system-prompt sections, evolution/prompts/): - prompt_loader.py extracts DEFAULT_AGENT_IDENTITY / MEMORY_GUIDANCE / SESSION_SEARCH_GUIDANCE / SKILLS_GUIDANCE and explodes PLATFORM_HINTS per platform key. identity_traits_present() helper. - prompt_module.py PromptSectionModule + behavioral_fitness_metric (keyword overlap) + make_llm_judge_metric (LLM-as-judge with optional overlap fallback) and clean_evolved_section. - prompt_dataset.py SyntheticPromptScenarioBuilder with section-specific behavioral briefs. - evolve_prompt_section.py CLI with --metric overlap/judge/hybrid, --eval-dataset-size, --optimizer-auto, and a non-baseline-winner fallback when MIPROv2's best candidate equals the baseline. - Phase 3 gate adds identity-trait preservation check on DEFAULT_AGENT_IDENTITY. - Live gate passed on MEMORY_GUIDANCE with qwen2.5:7b hybrid judge: behavioral 0.513 -> 0.579 (+12.82% relative / +0.066 absolute); section size 1185 -> 1102 chars; all 7 gate checks green. Shared (evolution/core/): - reproducibility.py polymorphic manifests (skill / tool / prompt section). - phase_gate.py evaluate_phase2_gate (cross-tool regression) + evaluate_phase3_gate (identity traits). - rollout_policy.py 'tool_description' + 'system_prompt' rollout tiers. Tests: tests/tools/ and tests/prompts/ add coverage for loaders, modules, metrics, judge score parsing, and per-phase gates. 229 passing, ruff clean. .gitignore now excludes output/ run trees and stray Windows NUL files.

…assed Phase 4 adds LLM-driven code mutation for hermes-agent tool files: - CodeOrganism: git worktree per candidate, SHA256 tracking - signature_freeze: AST-level invariant gates (signatures, registry calls, try/except count) - WorktreeSandbox: subprocess timeout + path confinement (soft isolation) - InternalMutator: DSPy-driven proposer with fenced-code stripping - ExternalDarwinianEvolverMutator: AGPL stub for future shell-out - CompositeFitness: pytest 100% hard gate, ruff penalty, freeze gate, bug-repro transition - evolve_tool_code CLI: --tool, --bug-brief, --iterations, --engine, --pytest-target - CodeReproducibilityManifest + Phase 4 gate evaluation - 266 tests passing, ruff clean Live demo against hermes-agent/path_security with ollama/qwen2.5:7b: - Bug: validate_within_dir PermissionError on junctions / UNC paths - Result: gate PASSED, fitness 0.50, error handling 1->2 try/except - Candidate: surgical nested try/except + PermissionError catch - No auto-merge; candidate emitted as patch for human review Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

…earch#86 Phase 2 (tools), Phase 3 (prompts), Phase 4 (code Darwinian evolver), shared core infrastructure, 62 new test files, run scripts. Source: hotepfederales-creator feat/phases-2-3-evolution Conflicting files left for manual merge step.

…ools + prompts + code) Cherry-picked from hotepfederales-creator feat/phases-2-3-evolution: - Phase 2: tool description evolution (evolution/tools/) - Phase 3: system prompt section evolution (evolution/prompts/) - Phase 4: Darwinian code evolver (evolution/code/) - Shared: phase_gate, reproducibility, rollout_policy, stop_loss, governance Merged with our PR NousResearch#52 (git_pr_automation, report_artifact, benchmark_gate, hermes_eval). Resolved conflicts in evolve_skill.py, config.py, __init__.py, fitness.py, skill_module.py, test_evolve_skill.py. Test status: 302 passed, 24 failed - 3 sandbox failures: macOS portability (cmd → bash) - 21 evolve_skill orchestration test failures: PR NousResearch#52 helper functions need re-wiring into new phase_gate/reproducibility architecture (known debt — core modules fully intact, all 39 core tests pass)

Claude Agent and others added 2 commits May 17, 2026 19:10

breakneo added a commit to breakneo/hermes-agent-self-evolution that referenced this pull request May 18, 2026

merge: integrate upstream PR NousResearch#86 — Phase 2-4 evolution

d8bd3b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Phase 2 tool description + Phase 3 prompt section evolution#86

feat: Phase 2 tool description + Phase 3 prompt section evolution#86
hotepfederales-creator wants to merge 2 commits into
NousResearch:mainfrom
hotepfederales-creator:feat/phases-2-3-evolution

hotepfederales-creator commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hotepfederales-creator commented May 18, 2026

Summary

Live KPI Gate Results

Tests

Files of Interest

Out of Scope

Repository Hygiene

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant