feat: Phase 2 tool description + Phase 3 prompt section evolution#86
Open
hotepfederales-creator wants to merge 2 commits into
Open
feat: Phase 2 tool description + Phase 3 prompt section evolution#86hotepfederales-creator wants to merge 2 commits into
hotepfederales-creator wants to merge 2 commits into
Conversation
Extends self-evolution beyond skills (Phase 1) to two additional artifact types, each with the same KPI-gated harness pattern: AST loader -> DSPy module exposing artifact text via signature.instructions -> synthetic dataset -> MIPROv2/GEPA optimization -> constraint validation -> reproducibility manifest -> phase gate. Phase 2 (tool descriptions, evolution/tools/): - tool_loader.py parses *_SCHEMA dicts from hermes-agent/tools/*.py - tool_module.py wraps a tool description as predictor.signature.instructions, with clean_evolved_description that strips MIPROv2-inlined few-shot examples and enforces a max_chars budget by sentence boundaries. - tool_dataset.py builds positive/negative scenarios for tool selection. - evolve_tool.py CLI orchestrator with Phase 2 gate (cross-tool non-regression). - Live gate passed on read_file with qwen2.5:7b: selection 0.600 -> 0.800. Phase 3 (system-prompt sections, evolution/prompts/): - prompt_loader.py extracts DEFAULT_AGENT_IDENTITY / MEMORY_GUIDANCE / SESSION_SEARCH_GUIDANCE / SKILLS_GUIDANCE and explodes PLATFORM_HINTS per platform key. identity_traits_present() helper. - prompt_module.py PromptSectionModule + behavioral_fitness_metric (keyword overlap) + make_llm_judge_metric (LLM-as-judge with optional overlap fallback) and clean_evolved_section. - prompt_dataset.py SyntheticPromptScenarioBuilder with section-specific behavioral briefs. - evolve_prompt_section.py CLI with --metric overlap/judge/hybrid, --eval-dataset-size, --optimizer-auto, and a non-baseline-winner fallback when MIPROv2's best candidate equals the baseline. - Phase 3 gate adds identity-trait preservation check on DEFAULT_AGENT_IDENTITY. - Live gate passed on MEMORY_GUIDANCE with qwen2.5:7b hybrid judge: behavioral 0.513 -> 0.579 (+12.82% relative / +0.066 absolute); section size 1185 -> 1102 chars; all 7 gate checks green. Shared (evolution/core/): - reproducibility.py polymorphic manifests (skill / tool / prompt section). - phase_gate.py evaluate_phase2_gate (cross-tool regression) + evaluate_phase3_gate (identity traits). - rollout_policy.py 'tool_description' + 'system_prompt' rollout tiers. Tests: tests/tools/ and tests/prompts/ add coverage for loaders, modules, metrics, judge score parsing, and per-phase gates. 229 passing, ruff clean. .gitignore now excludes output/ run trees and stray Windows NUL files.
…assed Phase 4 adds LLM-driven code mutation for hermes-agent tool files: - CodeOrganism: git worktree per candidate, SHA256 tracking - signature_freeze: AST-level invariant gates (signatures, registry calls, try/except count) - WorktreeSandbox: subprocess timeout + path confinement (soft isolation) - InternalMutator: DSPy-driven proposer with fenced-code stripping - ExternalDarwinianEvolverMutator: AGPL stub for future shell-out - CompositeFitness: pytest 100% hard gate, ruff penalty, freeze gate, bug-repro transition - evolve_tool_code CLI: --tool, --bug-brief, --iterations, --engine, --pytest-target - CodeReproducibilityManifest + Phase 4 gate evaluation - 266 tests passing, ruff clean Live demo against hermes-agent/path_security with ollama/qwen2.5:7b: - Bug: validate_within_dir PermissionError on junctions / UNC paths - Result: gate PASSED, fitness 0.50, error handling 1->2 try/except - Candidate: surgical nested try/except + PermissionError catch - No auto-merge; candidate emitted as patch for human review Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
breakneo
added a commit
to breakneo/hermes-agent-self-evolution
that referenced
this pull request
May 18, 2026
…earch#86 Phase 2 (tools), Phase 3 (prompts), Phase 4 (code Darwinian evolver), shared core infrastructure, 62 new test files, run scripts. Source: hotepfederales-creator feat/phases-2-3-evolution Conflicting files left for manual merge step.
breakneo
added a commit
to breakneo/hermes-agent-self-evolution
that referenced
this pull request
May 18, 2026
…ools + prompts + code) Cherry-picked from hotepfederales-creator feat/phases-2-3-evolution: - Phase 2: tool description evolution (evolution/tools/) - Phase 3: system prompt section evolution (evolution/prompts/) - Phase 4: Darwinian code evolver (evolution/code/) - Shared: phase_gate, reproducibility, rollout_policy, stop_loss, governance Merged with our PR NousResearch#52 (git_pr_automation, report_artifact, benchmark_gate, hermes_eval). Resolved conflicts in evolve_skill.py, config.py, __init__.py, fitness.py, skill_module.py, test_evolve_skill.py. Test status: 302 passed, 24 failed - 3 sandbox failures: macOS portability (cmd → bash) - 21 evolve_skill orchestration test failures: PR NousResearch#52 helper functions need re-wiring into new phase_gate/reproducibility architecture (known debt — core modules fully intact, all 39 core tests pass)
breakneo
added a commit
to breakneo/hermes-agent-self-evolution
that referenced
this pull request
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends self-evolution beyond skills (Phase 1) to two additional artifact types using the same KPI-gated harness pattern:
evolution/tools/): AST-load*_SCHEMAdicts fromhermes-agent/tools/*.py, wrap the description aspredictor.signature.instructions, optimize with MIPROv2/GEPA against a positive/negative tool-selection dataset, enforce a 500-char budget with sentence-bounded truncation, gate with cross-tool non-regression.evolution/prompts/): AST-loadDEFAULT_AGENT_IDENTITY/MEMORY_GUIDANCE/SESSION_SEARCH_GUIDANCE/SKILLS_GUIDANCEand explodePLATFORM_HINTSper-platform, optimize with overlap / LLM-judge / hybrid metrics, gate with identity-trait preservation check, with a non-baseline-winner fallback for when MIPROv2's best candidate equals the baseline.Shared infrastructure: polymorphic reproducibility manifests, per-phase gate evaluators,
tool_descriptionandsystem_promptrollout policy tiers.Live KPI Gate Results
All three runs used local
ollama/qwen2.5:7bend-to-end. Constraints, manifest reproducibility, identity-trait preservation, and rollout safety checks all green per phase.Tests
python -m ruff check .cleantests/tools/(loader, module, phase2 gate) andtests/prompts/(loader, module, judge metric, phase3 gate)Files of Interest
evolution/tools/tool_module.py-clean_evolved_description(max_chars)strips optimizer-inlined few-shot examples and enforces a sentence-bounded budget.evolution/prompts/prompt_module.py-make_llm_judge_metricadds LLM-as-judge fitness with optional keyword-overlap fallback weight.evolution/prompts/evolve_prompt_section.py- falls back to the highest-scoring non-baseline MIPROv2 candidate when the optimizer's best winner is the baseline itself.evolution/core/phase_gate.py-evaluate_phase2_gateadds cross-tool regression check,evaluate_phase3_gateadds identity-trait preservation.Out of Scope
max_stepsrejected by installed DSPy version) currently falls back to MIPROv2; deferred until we pin a version-correct wrapper.Repository Hygiene
.gitignorenow excludesoutput/run trees and stray WindowsNUL0-byte files.output/{skills,tools,prompts}/<name>/<timestamp>/not committed.