feat: consolidate evolution ingestion and safety gates by steezkelly · Pull Request #68 · NousResearch/hermes-agent-self-evolution

steezkelly · 2026-05-09T03:39:29Z

Summary

Consolidates the active evolution/ingestion gate PR stack into one locally-tested PR to reduce merge/review overhead.

Requested stack:

feat: add ingestion reports and promotion gates #55: ingestion reports and promotion gates
fix: update GEPA construction for DSPy 3 #56: DSPy 3 GEPA construction update
fix: use LLM judge feedback for skill fitness #57: LLM judge feedback for skill fitness
fix: enforce run-tests evolution gate #58: enforce --run-tests evolution gate
fix: reject empty holdout datasets #59: reject empty holdout datasets

Also included so the repo does not keep parallel overlapping consolidation PRs open:

fix: declare reportlab dependency #60: declare reportlab>=4.0 and report import regression tests
fix: fail fast on invalid baseline skills #61: fail fast on invalid baseline skill constraints
feat: consolidate issue 54 ingestion and promotion gates #67: prior consolidation PR is superseded by this broader superstack

Why consolidate

gh showed #55-#59 were individually open, non-draft, and mergeable against main, but GitHub reported no checks for the stack (statusCheckRollup length 0 on each PR). Local integration found real overlap in evolution/skills/evolve_skill.py helper declarations:

fix: update GEPA construction for DSPy 3 #56 adds _create_gepa_optimizer(...)
fix: enforce run-tests evolution gate #58 adds _run_pytest_gate_if_requested(...) and _save_failed_variant(...)
fix: reject empty holdout datasets #59 adds _require_non_empty_holdout(...)
fix: fail fast on invalid baseline skills #61 adds _require_constraints_pass(...) and _validate_baseline_constraints(...)

The consolidated branch keeps all helpers and avoids making maintainers resolve those conflicts piecemeal.

Local verification evidence

Evidence preserved locally at:

/home/steve/repos/hermes-agent-self-evolution/issue-55-61-superstack-local-evidence.md

Commands run in local venv:

. .venv-review/bin/activate
pip install -e '.[dev]'
pytest tests/test_generate_report.py tests/skills/test_evolve_skill_constraint_gates.py tests/core/test_issue54_ingestion.py tests/core/test_issue54_promotion.py tests/skills/test_evolve_skill_gepa.py tests/core/test_fitness.py tests/core/test_constraints.py tests/skills/test_evolve_skill_gates.py tests/skills/test_evolve_skill_dataset_gates.py -q
pytest -q

Results:

Targeted stack tests: 41 passed, 11 warnings in 2.34s
Full suite: 164 passed, 11 warnings in 2.44s

Warnings were DSPy dependency deprecation warnings about prefix in InputField / OutputField; no test failures.

Superseded PRs

Supersedes #55, #56, #57, #58, #59, #60, #61, and #67.

Closes #54.
Fixes #10.
Fixes #12.
Partially addresses #33.

…sResearch#24, NousResearch#26, NousResearch#35) - PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions - _load_skill_body() splits frontmatter from body, body becomes instruction - _extract_evolved_instructions() extracts from signature.instructions (not wrapper) - constraint_validator.py: body/frontmatter separation — validate body has substance - dataset_builder.py: robust JSON parsing with 6 fallback strategies - PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA - PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls conflicts with auto='light'. Use max_metric_calls alone (fixed).

…traint validator, JSON parsing robustness Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35: - skill_module.py: embed skill body in signature instructions via HTML sentinel - evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging - constraints.py: validate YAML frontmatter + substantive body content separately - dataset_builder.py: 6-strategy JSON parser for LLM output resilience - sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with

…<20KB) +50%, large +20%. Fixes companion-interview-workflow rejection (+28.5% bloat was genuine operational detail, not bloat). Also cap pre-filter to 20 candidates in RelevanceFilter to prevent 30+ minute timeouts.

Completes v2.1 build phase: 1. GEPA/MIPROv2 logger (Cassian's #1 production risk) - Logs optimizer type (GEPA vs MIPROv2) after compile in evolve_skill.py - Added optimizer_type field to stats CSV schema 2. Router (evolution/core/router.py) - 3-action classification: fix / extend / abstain - Heuristic-based (no LLM calls): failure pattern detection by reason keyword, structural change detection via conditional counts, confidence scaling - All thresholds labeled as unvalidated novel design per Aris Thorne review 3. Backtrack Controller (evolution/core/backtrack.py) - 3-iteration sliding window plateau detection - Float-epsilon threshold comparison (fixes IEEE 754 precision edge case) - Walk-back: finds last adjacent improvement > 1%, returns checkpoint before it - Force-archive after N consecutive backtracks - Resets backtrack count after any improvement 4. Robustness Checkers (evolution/core/constraints_v2.py) - ConfigDriftChecker: frontmatter name/description stability - SkillRegressionChecker: holdout score retains 90%+ of baseline - ScopeCreepChecker: length-normalized term frequency drift detection - Small-baseline (<3 meaningful words) gracefully skipped 5. Pareto Selector (evolution/core/pareto_selector.py) - Multi-objective: holdout score (primary) + skill size delta (secondary) - min_improvement_delta=0.03 noise floor (evaluation noise guard) - growth_threshold cap prevents 400%+ bloat with small gains - Robustness gate: failed check = baseline retained regardless 6. Shared Types (evolution/core/types.py) - 5 dataclasses: EvolutionSnapshot, RouterDecision, BacktrackDecision, ComputeBudget, EvolutionReport 7. Tests: 30 new tests, all passing - Router: 6 tests (empty extend, edge case fix, low budget, structural, confidence scaling, all-pass) - Backtrack: 6 tests (insufficient data, plateau, improving, force archive, reset, walk-back) - Pareto: 5 tests (better, noise floor, robustness fail, growth penalty, zero growth) - Constraints: 9 tests (5 config drift, 4 scope creep) - Integration: 4 tests (dry run, no-gepa skeleton, backtrack integration, Pareto integration)

Adds the top-level integration layer that connects v2.1 modules to the live evolution pipeline: 1. gepa_v2_dispatch.py (437 lines) - Wraps v1's GEPA loop with v2.1 decision gates - Top-level backtrack: re-runs GEPA if ParetoSelector rejects, up to N attempts (3 or iterations//5) - Runs ConfigDriftChecker, SkillRegressionChecker, ScopeCreepChecker on each evolved candidate - Captures per-scenario holdout results for Router classification - Returns EvolutionReport with deploy/review/reject recommendation - Saves output to output/<skill>/v2_<timestamp>/ with report.json 2. --v2 CLI flag - python -m evolution.skills.evolve_skill --skill X --v2 - Dispatches through v2_dispatch() instead of v1 evolve() - v1 path unchanged when --v2 is absent 3. EvolutionReport simplified - Replaced 10 fields (baseline_score, evolved_score, budget, etc.) with 8 focused fields: skill_name, n_iterations_executed, improvement, recommendation, details, router_decision, backtrack_decision, elapsed_seconds - All dependent modules (evolve_skill_v2.py, types.py, tests) updated to match 4. Backtrack checkpoint_for_score convenience method - Records EvolutionSnapshot from raw score/body/iteration values 5. Tests: 162 passing (3 pre-existing failures) - 3 new dispatch tests: dry_run, no_skill, report_type - 30 v2.1 unit tests still passing - Integration tests updated for new EvolutionReport shape

1. EvolutionRouter fixes (threshold validation): - Fixed priority order: structural checked BEFORE coverage (not after) - Fixed structural check: uses evolution_history[-1] (needs >=1, not >=2) - coverage check now actually uses coverage_cluster_ratio (was defined but never implemented — classified ANY multi-reason failure set as coverage) - Added coverage_min_failures=3 guard (1 failure isn't a coverage pattern) - Added _dominant_category helper for logging which category dominates 2. PostHocAnalyzer (evolution/core/posthoc_analyzer.py) - Fits power-law curve: score = a * iteration^c + b via scipy.optimize.curve_fit - Scipy fallback: log-log linear regression with R² estimate - Phase classification: early_discovery (c>0.2), diminishing_returns (0.05<c<0.2), plateau (c<0.05) - Crossover detection: finds iteration where marginal gain < min_improvement_delta - Predicted score at 2x iterations - Pure analytical — no API calls - scipy added to project dependencies (needed for curve_fit) 3. Router Benchmark (tests/core/router_benchmark.py) - 11 synthetic test cases: edge_case, coverage, structural, noise, all_pass, low_budget, edge_case ratio sweep, empty, no_history, zero_hard, mixed_priority - All 11 passing - Run standalone: python tests/core/router_benchmark.py 4. Tests: 175 passing (3 pre-existing failures)

…n test Bugs surfaced by running the full gate pipeline against real skill data: 1. SkillRegressionChecker.check() interface mismatch - Filesystem-based check() takes (skill_name, threshold) — not inline scores - Added check_score(evolved_score, baseline_score) for the direct score comparison that v2_dispatch and tests use - Fixed v2_dispatch.py call → check_score() 2. SelectionResult missing 'reason' field - ParetoSelector selected evolved vs baseline but didn't explain WHY - Added reason: str field with human-readable explanations at every branch - All selection paths now log: robustness failure, noise floor, weighted win, growth penalty, and improvement - Growth penalty now appears in reason string for size-override decisions 3. ParetoSelector reason edge case: growth info missing - When size penalty was the deciding factor (400% growth → penalty=1.0), the reason only said 'baseline wins on weighted score' without mentioning growth or the penalty value - Fixed: all weighted-score reasons now include growth ratio and penalty 4. Test fixes: - ConfigDrift: used different descriptions (which correctly triggers drift) changed to different tags (which correctly does not) - Regression: asserted r2[0] == 'pass' but check_score returns (bool, str) - Pipeline tests: 4/4 passing against real companion-workflows skill data

- PostHocAnalyzer runs after GEPA loop completes, before Router classification, using the per-attempt score trajectory - Shows power-law phase classification and recommended action in console - Appends posthoc analysis to report.json output - Adds Phase and Power-Law c rows to summary table - Import PostHocReport type and PostHocAnalyzer class in dispatch - All 17 posthoc + pipeline integration tests passing

1. test_skill_over_limit: 20KB input didn't exceed 50KB max_skill_size → increased to 60KB to actually trigger the limit 2. test_excessive_growth: 30% growth on 1KB baseline was within the 100% dynamic allowance for small skills (<5KB) → changed to 30% growth on 25KB baseline (max 20% for >20KB skills) 3. test_valid_skill: minimal body lacked 2-of-3 structural checks → added substantive body with steps, headings, and >100 chars 4. PostHoc integration: fixed spurious PostHocReport import from types.py (doesn't exist there; PostHocAnalyzer resolves its own dependencies) Full suite: 189/189 passing — all tests clean

Bug: score_trajectory extracted avg_baseline (always the same value) instead of tracking best score over time. This meant PostHoc never had enough variance for power-law fitting and always returned None. Fix: starts with the initial baseline score, then for each attempt takes max(avg_baseline, avg_evolved) compared against the running best. This produces a non-decreasing trajectory that the power-law fitter can actually analyze. 189/189 tests passing.

Part A — captured-skill plugin (Hermes Agent gateway): plugin.yaml — registers on_session_end hook __init__.py — loads session data, builds candidates, slash commands capture.py — core logic: is_capturable heuristics, tool sequence extraction, domain tagging, skill body generation, overlap detection (Jaccard word similarity, no embeddings needed) Hooks: on_session_end — runs after every completed session with 3+ tool calls, extracts task description, tool sequence, domain tags, and success pattern, saves to ~/.hermes/captured/<name>.json Slash command: /captured — list, show, inspect, validate, stats Part B — ingest-captured CLI (self-evolution repo): python -m evolution.tools.ingest_captured list [status] List captured candidates validate <file> Validate candidate structure and overlap deploy <file> Deploy a validated candidate as ~/.hermes/skills/<name>/SKILL.md auto Bulk validate + deploy all pending candidates evolve <file> Run v2 evolution pipeline then deploy if improved stats Capture statistics Validation: body length > 50 chars, frontmatter or heading structure, overlap with existing skills (blocked at J > 0.5) 196/196 tests passing (7 new + 189 existing)

…n (rejected), nous_auth module, v2 entry point, evolution stats/tools catalog

New modules: - evolution/tools/tool_module.py: ToolDescriptionStore (loads from Hermes registry) + ToolDescriptionModule (DSPy Predict wrapper) - evolution/tools/tool_dataset_builder.py: (task, tool_name) eval dataset from synthetic templates + SessionDB mining - evolution/tools/tool_description_v2.py: GEPA v2 pipeline with BacktrackController, PostHocAnalyzer, ParetoSelector, EvolutionRouter - evolution/tools/evolve_tool_description.py: CLI entry point (--tool, --iterations, --eval-source, --dry-run) Architecture: - Tool selection as classification: given task description → predict correct tool - GEPA optimizes tool descriptions (≤500 chars) to maximize classification accuracy - v2 pipeline wraps v1 GEPA with decision gates (reject/accept/review) - Constraint validator enforces 500-char description limit - Output: output/tool_descriptions/<tool>_<timestamp>/report.json CLI: python -m evolution.tools.evolve_tool_description --tool search_files --iterations 10

…itness, parallel scoring - MultiComponentSkillModule: section-level mutation via split_into_sections/reconstruct - PurposePreservationChecker (4th hard gate): blocks type-changing mutations via keyword survival + TF-IDF cosine similarity + consultant-prompt structure detection - ContentSemanticScorer: sklearn TfidfVectorizer (unigrams+bigrams, sublinear TF) - ParallelRelevanceFilter: ThreadPoolExecutor for LLM relevance calls (13s→30s for 20) - Ollama Cloud thinking wrapper fix: promotes reasoning_content→content when empty - evolve_skill.py: multi_component_extract() replaces _extract_evolved_skill_body() - gepa_v2_dispatch: uses MultiComponentSkillModule, fixed total_improvement calc - Test coverage: test_constraints_v2.py (7 test cases) - CLI entry points: run_batch_evolution.py, run_deep_evolution.py, run_v2_validate*.py

…04-30

…on tests

…eration - Fix syntax error in seed_to_skill.py: Python 3.12 doesn't allow backslash escapes inside f-string expressions. Extracted coherence_issues_escaped and timestamp to variables before the multi-line return statement. - Add run_batch_seed_generation.py: generates skills from seeds for Phase 3 of skill-generation-from-seed plan. Seeds: personal-osint-audit, exploratory-data-analysis, research-planning - Kanban: move 3 regression skills to STALE: companion-personas (plateau, best=+0.1988, latest=-0.1247), companion-system-orchestration (plateau, best=+0.0418), github-code-review (noise-level changes, best=+0.0000) Root cause: evolving existing 600-line skills hits plateaus. These should use seed-based generation instead. - Batch seed generation running in background (proc_ce421498c3d0)

- seed_to_skill.py: add --timestamp CLI arg to prevent cross-run arXiv pollution - run_batch_seed_fast.py: switch eval model from broken minimax/minimax-m2.7 → deepseek/deepseek-v4-flash - gepa_kanban: add 3 new skills (hermes-agent-author, design-a-multi-agent-companion-coordinat, github-pr-review) in VALIDATING; mark old skills STALE New skills installed to ~/.hermes/skills/: companion-system/hermes-agent-author/ (replaces companion-personas) companion-system/design-a-multi-agent-companion-coordinat/ (replaces companion-system-orchestration) github/github-pr-review/ (replaces github-code-review) All 3 skills: 0 arXiv refs, all 5 sections exit=0, coherence PASS. 228 tests still passing.

Results: - hermes-agent-author: 0.500 (INCOMPARABLE - generator vs old persona catalog) - design-a-multi-agent-companion-coordinat: 0.621 vs old 0.731 (-0.11 regression) - github-pr-review: 0.578 vs old 0.650 (-0.07 regression) Key insight: hermes-agent-author is a fundamentally different task from the old companion-personas (generates personas vs. provides a fixed catalog). The other two are modest regressions - the seed skills are more focused/narrow than the original 11-section skills they replaced. Scripts added: score_new_skills.py, score_compare.py, score_baseline.py Cleanup: removed bad arXiv-polluted seed dirs from previous failed run.

GEPA ran 5 iterations on: - design-a-multi-agent-companion-coordinat: 0.5->0.5, all mutations rejected - github-pr-review: 0.5->0.5, all mutations rejected - hermes-agent-author: failed to load skill Root cause: seed skills are ~50% smaller (5 sections) vs old archived skills (9-15 sections). The seed generates a GEPA-friendly skeleton but lacks the depth/complexity that made originals effective. GEPA can't hallucinate in missing sections in 5 iterations. Key finding: seed-to-skill creates useful starting skeletons, not direct replacements for highly-refined multi-section skills. The pipeline is working correctly — the gap is in seed density.

NEW SKILLS (Phase 3 - no regression baseline): - research-synthesis: 0.560 — web/arxiv/wiki research report synthesis - linear-issue-creator: 0.341 — natural language to Linear issue creation - codebase-metrics: 0.527 — codebase metrics via pygount PHASE 2 REPLACEMENTS (with old baselines): - hermes-agent-author: 0.528 vs 0.627 old (-0.099, INCOMPARABLE - different task) - design-a-multi-agent-companion-coordinat: 0.622 vs 0.731 old (-0.109) - github-pr-review: 0.585 vs 0.650 old (-0.065) GEPA PHASE 4 EVOLUTION: - All 3 seed skills: 0 improvement (5 iterations, all mutations rejected) - Root cause: seed skills are 5-section skeletons vs old 9-15 section skills - Pipeline is working correctly; seed density is the bottleneck Phase 3 scripts: run_batch_seed_phase3.py, updated score_new_skills.py Datasets: research-synthesis, linear-issue-creator, codebase-metrics

…0% improvement - linear-issue-creator: GEPA 5 iter, valset 72%, holdout 50%, no improvement - codebase-metrics: GEPA 5 iter, valset 55.8%, holdout 50%, no improvement - Both seeds locally optimal on synthetic eval — seed density ceiling confirmed - research-synthesis killed (22% valset, poor synthetic fit) - Updated card-registry.json + baseline_scores_20260501.json

…er, same 0.5, same 10544 chars)

- EvalExample: add tool_sequence, complexity_score, session_id, success_pattern - EvalDataset: add merge() dedup and save_atomic() write-temp-then-rename - CapturedExampleEnricher: rule-based rubric extraction from first section - assign_split(): deterministic MD5 hash -> exactly one split (train/val/holdout) - enrich_and_merge(): single-split append with dedup, replaces save_as_sessiondb_example - Capture plugin: on_session_end hook + /captured slash command (plugin.yaml + __init__.py) - Verification script: docs/phase5_verification.py - Tests: 34 passing across dataset_builder, ingest_captured, capture_plugin Design decisions: - D1 (3+ tools) enforced in both plugin _is_capturable and _save_candidate - D2 (rule-based rubric) — no LLM call, first post-frontmatter section + fallback - D3 (silent failure) — errors logged to ~/.hermes/capture_errors/<date>.jsonl Fixes 6 gaps from Phase 5 gap audit: - Gap B (data leakage): single split assignment - Gap C (rubric mismatch): structured expected_behavior with Task/Expected tools/Procedure - Gap F (field loss): metadata preserved to dataset

steezkelly added 30 commits April 24, 2026 22:44

Phase 5 session wrap: v2.1 fixes, companion-workflows v2 evolution ru…

519cc69

…n (rejected), nous_auth module, v2 entry point, evolution stats/tools catalog

GEPA kanban: card-registry.json + board-state.md snapshot as of 2026-…

d0d5346

…04-30

Apr 29 evening reports + batch evolution results

1370669

README: update Phase status markers to 2026-04-29

1b2a5d9

Update metrics.json timestamps for v1 evolution run outputs

e931aa6

V2 pipeline validation: new run outputs, safe wrapper, tool descripti…

2a3b050

…on tests

Final: kanban state + hermes-agent Apr29 085253 run output

6f05e28

GEPA: design-a-multi-agent-companion-coordinat — 0% improvement (5 it…

2abdba0

…er, same 0.5, same 10544 chars)

GEPA: add design-a-multi-agent-companion-coordinat baseline scores

40a9296

GEPA: github-pr-review — 0% improvement (5 iter, same 0.5, 8374 chars)

fb8f815

GEPA: add github-pr-review baseline scores

06e5e64

steezkelly added 19 commits May 3, 2026 02:26

Phase 5 P1: --enrich flag, integration tests, full 266/266 green

add2d5b

fix: declare scikit-learn dependency

a7117d2

docs: clarify fork project direction

5be2631

docs: establish agent evolution lab identity

7a40906

feat: add ingestion reports and promotion gates

1360d7f

fix: update GEPA construction for DSPy 3

06f8abe

fix: use LLM judge feedback for skill fitness

ba60e2d

fix: enforce run-tests evolution gate

e35598c

fix: reject empty holdout datasets

98b7835

fix: declare reportlab dependency

2d8f518

fix: fail fast on invalid baseline skills

db43472

Merge branch 'pr-55' into review-55-59-integration

3af8913

Merge branch 'pr-56' into review-55-59-integration

6db33cc

Merge branch 'pr-57' into review-55-59-integration

fe04c1f

merge PR 58 into PR 55-57 integration review

c3db0b6

merge PR 59 into PR 55-58 integration review

60b9b34

Merge branch 'pr-60' into review-55-61-superstack

387abe2

merge PR 61 into PR 55-60 integration review

74f5da3

fix: rebase evolution gates onto active lab main

7764ce6

steezkelly closed this May 9, 2026

steezkelly deleted the consolidate/55-61-evolution-gates branch May 9, 2026 03:58

steezkelly mentioned this pull request May 9, 2026

feat: consolidate evolution gates stack (#55-#61) #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: consolidate evolution ingestion and safety gates#68

feat: consolidate evolution ingestion and safety gates#68
steezkelly wants to merge 53 commits into
NousResearch:mainfrom
steezkelly:consolidate/55-61-evolution-gates

steezkelly commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

steezkelly commented May 9, 2026

Summary

Why consolidate

Local verification evidence

Superseded PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant