feat: consolidate evolution ingestion and safety gates#68
Closed
steezkelly wants to merge 53 commits into
Closed
Conversation
…sResearch#24, NousResearch#26, NousResearch#35) - PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions - _load_skill_body() splits frontmatter from body, body becomes instruction - _extract_evolved_instructions() extracts from signature.instructions (not wrapper) - constraint_validator.py: body/frontmatter separation — validate body has substance - dataset_builder.py: robust JSON parsing with 6 fallback strategies - PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA - PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls conflicts with auto='light'. Use max_metric_calls alone (fixed).
…traint validator, JSON parsing robustness Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35: - skill_module.py: embed skill body in signature instructions via HTML sentinel - evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging - constraints.py: validate YAML frontmatter + substantive body content separately - dataset_builder.py: 6-strategy JSON parser for LLM output resilience - sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with <!-- ___SKILL_EVOLUTION_SENTINEL___ -->
…<20KB) +50%, large +20%. Fixes companion-interview-workflow rejection (+28.5% bloat was genuine operational detail, not bloat). Also cap pre-filter to 20 candidates in RelevanceFilter to prevent 30+ minute timeouts.
Completes v2.1 build phase: 1. GEPA/MIPROv2 logger (Cassian's #1 production risk) - Logs optimizer type (GEPA vs MIPROv2) after compile in evolve_skill.py - Added optimizer_type field to stats CSV schema 2. Router (evolution/core/router.py) - 3-action classification: fix / extend / abstain - Heuristic-based (no LLM calls): failure pattern detection by reason keyword, structural change detection via conditional counts, confidence scaling - All thresholds labeled as unvalidated novel design per Aris Thorne review 3. Backtrack Controller (evolution/core/backtrack.py) - 3-iteration sliding window plateau detection - Float-epsilon threshold comparison (fixes IEEE 754 precision edge case) - Walk-back: finds last adjacent improvement > 1%, returns checkpoint before it - Force-archive after N consecutive backtracks - Resets backtrack count after any improvement 4. Robustness Checkers (evolution/core/constraints_v2.py) - ConfigDriftChecker: frontmatter name/description stability - SkillRegressionChecker: holdout score retains 90%+ of baseline - ScopeCreepChecker: length-normalized term frequency drift detection - Small-baseline (<3 meaningful words) gracefully skipped 5. Pareto Selector (evolution/core/pareto_selector.py) - Multi-objective: holdout score (primary) + skill size delta (secondary) - min_improvement_delta=0.03 noise floor (evaluation noise guard) - growth_threshold cap prevents 400%+ bloat with small gains - Robustness gate: failed check = baseline retained regardless 6. Shared Types (evolution/core/types.py) - 5 dataclasses: EvolutionSnapshot, RouterDecision, BacktrackDecision, ComputeBudget, EvolutionReport 7. Tests: 30 new tests, all passing - Router: 6 tests (empty extend, edge case fix, low budget, structural, confidence scaling, all-pass) - Backtrack: 6 tests (insufficient data, plateau, improving, force archive, reset, walk-back) - Pareto: 5 tests (better, noise floor, robustness fail, growth penalty, zero growth) - Constraints: 9 tests (5 config drift, 4 scope creep) - Integration: 4 tests (dry run, no-gepa skeleton, backtrack integration, Pareto integration)
Adds the top-level integration layer that connects v2.1 modules
to the live evolution pipeline:
1. gepa_v2_dispatch.py (437 lines)
- Wraps v1's GEPA loop with v2.1 decision gates
- Top-level backtrack: re-runs GEPA if ParetoSelector rejects,
up to N attempts (3 or iterations//5)
- Runs ConfigDriftChecker, SkillRegressionChecker, ScopeCreepChecker
on each evolved candidate
- Captures per-scenario holdout results for Router classification
- Returns EvolutionReport with deploy/review/reject recommendation
- Saves output to output/<skill>/v2_<timestamp>/ with report.json
2. --v2 CLI flag
- python -m evolution.skills.evolve_skill --skill X --v2
- Dispatches through v2_dispatch() instead of v1 evolve()
- v1 path unchanged when --v2 is absent
3. EvolutionReport simplified
- Replaced 10 fields (baseline_score, evolved_score, budget, etc.)
with 8 focused fields: skill_name, n_iterations_executed,
improvement, recommendation, details, router_decision,
backtrack_decision, elapsed_seconds
- All dependent modules (evolve_skill_v2.py, types.py, tests)
updated to match
4. Backtrack checkpoint_for_score convenience method
- Records EvolutionSnapshot from raw score/body/iteration values
5. Tests: 162 passing (3 pre-existing failures)
- 3 new dispatch tests: dry_run, no_skill, report_type
- 30 v2.1 unit tests still passing
- Integration tests updated for new EvolutionReport shape
1. EvolutionRouter fixes (threshold validation):
- Fixed priority order: structural checked BEFORE coverage (not after)
- Fixed structural check: uses evolution_history[-1] (needs >=1, not >=2)
- coverage check now actually uses coverage_cluster_ratio (was defined but
never implemented — classified ANY multi-reason failure set as coverage)
- Added coverage_min_failures=3 guard (1 failure isn't a coverage pattern)
- Added _dominant_category helper for logging which category dominates
2. PostHocAnalyzer (evolution/core/posthoc_analyzer.py)
- Fits power-law curve: score = a * iteration^c + b via scipy.optimize.curve_fit
- Scipy fallback: log-log linear regression with R² estimate
- Phase classification: early_discovery (c>0.2), diminishing_returns (0.05<c<0.2),
plateau (c<0.05)
- Crossover detection: finds iteration where marginal gain < min_improvement_delta
- Predicted score at 2x iterations
- Pure analytical — no API calls
- scipy added to project dependencies (needed for curve_fit)
3. Router Benchmark (tests/core/router_benchmark.py)
- 11 synthetic test cases: edge_case, coverage, structural, noise,
all_pass, low_budget, edge_case ratio sweep, empty, no_history,
zero_hard, mixed_priority
- All 11 passing
- Run standalone: python tests/core/router_benchmark.py
4. Tests: 175 passing (3 pre-existing failures)
…n test
Bugs surfaced by running the full gate pipeline against real skill data:
1. SkillRegressionChecker.check() interface mismatch
- Filesystem-based check() takes (skill_name, threshold) — not inline scores
- Added check_score(evolved_score, baseline_score) for the direct score
comparison that v2_dispatch and tests use
- Fixed v2_dispatch.py call → check_score()
2. SelectionResult missing 'reason' field
- ParetoSelector selected evolved vs baseline but didn't explain WHY
- Added reason: str field with human-readable explanations at every branch
- All selection paths now log: robustness failure, noise floor, weighted win,
growth penalty, and improvement
- Growth penalty now appears in reason string for size-override decisions
3. ParetoSelector reason edge case: growth info missing
- When size penalty was the deciding factor (400% growth → penalty=1.0),
the reason only said 'baseline wins on weighted score' without mentioning
growth or the penalty value
- Fixed: all weighted-score reasons now include growth ratio and penalty
4. Test fixes:
- ConfigDrift: used different descriptions (which correctly triggers drift)
changed to different tags (which correctly does not)
- Regression: asserted r2[0] == 'pass' but check_score returns (bool, str)
- Pipeline tests: 4/4 passing against real companion-workflows skill data
- PostHocAnalyzer runs after GEPA loop completes, before Router classification, using the per-attempt score trajectory - Shows power-law phase classification and recommended action in console - Appends posthoc analysis to report.json output - Adds Phase and Power-Law c rows to summary table - Import PostHocReport type and PostHocAnalyzer class in dispatch - All 17 posthoc + pipeline integration tests passing
1. test_skill_over_limit: 20KB input didn't exceed 50KB max_skill_size → increased to 60KB to actually trigger the limit 2. test_excessive_growth: 30% growth on 1KB baseline was within the 100% dynamic allowance for small skills (<5KB) → changed to 30% growth on 25KB baseline (max 20% for >20KB skills) 3. test_valid_skill: minimal body lacked 2-of-3 structural checks → added substantive body with steps, headings, and >100 chars 4. PostHoc integration: fixed spurious PostHocReport import from types.py (doesn't exist there; PostHocAnalyzer resolves its own dependencies) Full suite: 189/189 passing — all tests clean
Bug: score_trajectory extracted avg_baseline (always the same value) instead of tracking best score over time. This meant PostHoc never had enough variance for power-law fitting and always returned None. Fix: starts with the initial baseline score, then for each attempt takes max(avg_baseline, avg_evolved) compared against the running best. This produces a non-decreasing trajectory that the power-law fitter can actually analyze. 189/189 tests passing.
Part A — captured-skill plugin (Hermes Agent gateway):
plugin.yaml — registers on_session_end hook
__init__.py — loads session data, builds candidates, slash commands
capture.py — core logic: is_capturable heuristics, tool sequence
extraction, domain tagging, skill body generation, overlap
detection (Jaccard word similarity, no embeddings needed)
Hooks: on_session_end — runs after every completed session with
3+ tool calls, extracts task description, tool sequence, domain
tags, and success pattern, saves to ~/.hermes/captured/<name>.json
Slash command: /captured — list, show, inspect, validate, stats
Part B — ingest-captured CLI (self-evolution repo):
python -m evolution.tools.ingest_captured
list [status] List captured candidates
validate <file> Validate candidate structure and overlap
deploy <file> Deploy a validated candidate as ~/.hermes/skills/<name>/SKILL.md
auto Bulk validate + deploy all pending candidates
evolve <file> Run v2 evolution pipeline then deploy if improved
stats Capture statistics
Validation: body length > 50 chars, frontmatter or heading structure,
overlap with existing skills (blocked at J > 0.5)
196/196 tests passing (7 new + 189 existing)
…n (rejected), nous_auth module, v2 entry point, evolution stats/tools catalog
New modules: - evolution/tools/tool_module.py: ToolDescriptionStore (loads from Hermes registry) + ToolDescriptionModule (DSPy Predict wrapper) - evolution/tools/tool_dataset_builder.py: (task, tool_name) eval dataset from synthetic templates + SessionDB mining - evolution/tools/tool_description_v2.py: GEPA v2 pipeline with BacktrackController, PostHocAnalyzer, ParetoSelector, EvolutionRouter - evolution/tools/evolve_tool_description.py: CLI entry point (--tool, --iterations, --eval-source, --dry-run) Architecture: - Tool selection as classification: given task description → predict correct tool - GEPA optimizes tool descriptions (≤500 chars) to maximize classification accuracy - v2 pipeline wraps v1 GEPA with decision gates (reject/accept/review) - Constraint validator enforces 500-char description limit - Output: output/tool_descriptions/<tool>_<timestamp>/report.json CLI: python -m evolution.tools.evolve_tool_description --tool search_files --iterations 10
…itness, parallel scoring - MultiComponentSkillModule: section-level mutation via split_into_sections/reconstruct - PurposePreservationChecker (4th hard gate): blocks type-changing mutations via keyword survival + TF-IDF cosine similarity + consultant-prompt structure detection - ContentSemanticScorer: sklearn TfidfVectorizer (unigrams+bigrams, sublinear TF) - ParallelRelevanceFilter: ThreadPoolExecutor for LLM relevance calls (13s→30s for 20) - Ollama Cloud thinking wrapper fix: promotes reasoning_content→content when empty - evolve_skill.py: multi_component_extract() replaces _extract_evolved_skill_body() - gepa_v2_dispatch: uses MultiComponentSkillModule, fixed total_improvement calc - Test coverage: test_constraints_v2.py (7 test cases) - CLI entry points: run_batch_evolution.py, run_deep_evolution.py, run_v2_validate*.py
…eration - Fix syntax error in seed_to_skill.py: Python 3.12 doesn't allow backslash escapes inside f-string expressions. Extracted coherence_issues_escaped and timestamp to variables before the multi-line return statement. - Add run_batch_seed_generation.py: generates skills from seeds for Phase 3 of skill-generation-from-seed plan. Seeds: personal-osint-audit, exploratory-data-analysis, research-planning - Kanban: move 3 regression skills to STALE: companion-personas (plateau, best=+0.1988, latest=-0.1247), companion-system-orchestration (plateau, best=+0.0418), github-code-review (noise-level changes, best=+0.0000) Root cause: evolving existing 600-line skills hits plateaus. These should use seed-based generation instead. - Batch seed generation running in background (proc_ce421498c3d0)
- seed_to_skill.py: add --timestamp CLI arg to prevent cross-run arXiv pollution - run_batch_seed_fast.py: switch eval model from broken minimax/minimax-m2.7 → deepseek/deepseek-v4-flash - gepa_kanban: add 3 new skills (hermes-agent-author, design-a-multi-agent-companion-coordinat, github-pr-review) in VALIDATING; mark old skills STALE New skills installed to ~/.hermes/skills/: companion-system/hermes-agent-author/ (replaces companion-personas) companion-system/design-a-multi-agent-companion-coordinat/ (replaces companion-system-orchestration) github/github-pr-review/ (replaces github-code-review) All 3 skills: 0 arXiv refs, all 5 sections exit=0, coherence PASS. 228 tests still passing.
Results: - hermes-agent-author: 0.500 (INCOMPARABLE - generator vs old persona catalog) - design-a-multi-agent-companion-coordinat: 0.621 vs old 0.731 (-0.11 regression) - github-pr-review: 0.578 vs old 0.650 (-0.07 regression) Key insight: hermes-agent-author is a fundamentally different task from the old companion-personas (generates personas vs. provides a fixed catalog). The other two are modest regressions - the seed skills are more focused/narrow than the original 11-section skills they replaced. Scripts added: score_new_skills.py, score_compare.py, score_baseline.py Cleanup: removed bad arXiv-polluted seed dirs from previous failed run.
GEPA ran 5 iterations on: - design-a-multi-agent-companion-coordinat: 0.5->0.5, all mutations rejected - github-pr-review: 0.5->0.5, all mutations rejected - hermes-agent-author: failed to load skill Root cause: seed skills are ~50% smaller (5 sections) vs old archived skills (9-15 sections). The seed generates a GEPA-friendly skeleton but lacks the depth/complexity that made originals effective. GEPA can't hallucinate in missing sections in 5 iterations. Key finding: seed-to-skill creates useful starting skeletons, not direct replacements for highly-refined multi-section skills. The pipeline is working correctly — the gap is in seed density.
NEW SKILLS (Phase 3 - no regression baseline): - research-synthesis: 0.560 — web/arxiv/wiki research report synthesis - linear-issue-creator: 0.341 — natural language to Linear issue creation - codebase-metrics: 0.527 — codebase metrics via pygount PHASE 2 REPLACEMENTS (with old baselines): - hermes-agent-author: 0.528 vs 0.627 old (-0.099, INCOMPARABLE - different task) - design-a-multi-agent-companion-coordinat: 0.622 vs 0.731 old (-0.109) - github-pr-review: 0.585 vs 0.650 old (-0.065) GEPA PHASE 4 EVOLUTION: - All 3 seed skills: 0 improvement (5 iterations, all mutations rejected) - Root cause: seed skills are 5-section skeletons vs old 9-15 section skills - Pipeline is working correctly; seed density is the bottleneck Phase 3 scripts: run_batch_seed_phase3.py, updated score_new_skills.py Datasets: research-synthesis, linear-issue-creator, codebase-metrics
…0% improvement - linear-issue-creator: GEPA 5 iter, valset 72%, holdout 50%, no improvement - codebase-metrics: GEPA 5 iter, valset 55.8%, holdout 50%, no improvement - Both seeds locally optimal on synthetic eval — seed density ceiling confirmed - research-synthesis killed (22% valset, poor synthetic fit) - Updated card-registry.json + baseline_scores_20260501.json
…er, same 0.5, same 10544 chars)
- EvalExample: add tool_sequence, complexity_score, session_id, success_pattern - EvalDataset: add merge() dedup and save_atomic() write-temp-then-rename - CapturedExampleEnricher: rule-based rubric extraction from first section - assign_split(): deterministic MD5 hash -> exactly one split (train/val/holdout) - enrich_and_merge(): single-split append with dedup, replaces save_as_sessiondb_example - Capture plugin: on_session_end hook + /captured slash command (plugin.yaml + __init__.py) - Verification script: docs/phase5_verification.py - Tests: 34 passing across dataset_builder, ingest_captured, capture_plugin Design decisions: - D1 (3+ tools) enforced in both plugin _is_capturable and _save_candidate - D2 (rule-based rubric) — no LLM call, first post-frontmatter section + fallback - D3 (silent failure) — errors logged to ~/.hermes/capture_errors/<date>.jsonl Fixes 6 gaps from Phase 5 gap audit: - Gap B (data leakage): single split assignment - Gap C (rubric mismatch): structured expected_behavior with Task/Expected tools/Procedure - Gap F (field loss): metadata preserved to dataset
This was referenced May 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates the active evolution/ingestion gate PR stack into one locally-tested PR to reduce merge/review overhead.
Requested stack:
--run-testsevolution gateAlso included so the repo does not keep parallel overlapping consolidation PRs open:
reportlab>=4.0and report import regression testsWhy consolidate
ghshowed #55-#59 were individually open, non-draft, and mergeable againstmain, but GitHub reported no checks for the stack (statusCheckRolluplength 0 on each PR). Local integration found real overlap inevolution/skills/evolve_skill.pyhelper declarations:_create_gepa_optimizer(...)_run_pytest_gate_if_requested(...)and_save_failed_variant(...)_require_non_empty_holdout(...)_require_constraints_pass(...)and_validate_baseline_constraints(...)The consolidated branch keeps all helpers and avoids making maintainers resolve those conflicts piecemeal.
Local verification evidence
Evidence preserved locally at:
/home/steve/repos/hermes-agent-self-evolution/issue-55-61-superstack-local-evidence.mdCommands run in local venv:
Results:
41 passed, 11 warnings in 2.34s164 passed, 11 warnings in 2.44sWarnings were DSPy dependency deprecation warnings about
prefixinInputField/OutputField; no test failures.Superseded PRs
Supersedes #55, #56, #57, #58, #59, #60, #61, and #67.
Closes #54.
Fixes #10.
Fixes #12.
Partially addresses #33.