Skip to content

feat: consolidate evolution ingestion and safety gates#68

Closed
steezkelly wants to merge 53 commits into
NousResearch:mainfrom
steezkelly:consolidate/55-61-evolution-gates
Closed

feat: consolidate evolution ingestion and safety gates#68
steezkelly wants to merge 53 commits into
NousResearch:mainfrom
steezkelly:consolidate/55-61-evolution-gates

Conversation

@steezkelly
Copy link
Copy Markdown

Summary

Consolidates the active evolution/ingestion gate PR stack into one locally-tested PR to reduce merge/review overhead.

Requested stack:

Also included so the repo does not keep parallel overlapping consolidation PRs open:

Why consolidate

gh showed #55-#59 were individually open, non-draft, and mergeable against main, but GitHub reported no checks for the stack (statusCheckRollup length 0 on each PR). Local integration found real overlap in evolution/skills/evolve_skill.py helper declarations:

The consolidated branch keeps all helpers and avoids making maintainers resolve those conflicts piecemeal.

Local verification evidence

Evidence preserved locally at:

/home/steve/repos/hermes-agent-self-evolution/issue-55-61-superstack-local-evidence.md

Commands run in local venv:

. .venv-review/bin/activate
pip install -e '.[dev]'
pytest tests/test_generate_report.py tests/skills/test_evolve_skill_constraint_gates.py tests/core/test_issue54_ingestion.py tests/core/test_issue54_promotion.py tests/skills/test_evolve_skill_gepa.py tests/core/test_fitness.py tests/core/test_constraints.py tests/skills/test_evolve_skill_gates.py tests/skills/test_evolve_skill_dataset_gates.py -q
pytest -q

Results:

  • Targeted stack tests: 41 passed, 11 warnings in 2.34s
  • Full suite: 164 passed, 11 warnings in 2.44s

Warnings were DSPy dependency deprecation warnings about prefix in InputField / OutputField; no test failures.

Superseded PRs

Supersedes #55, #56, #57, #58, #59, #60, #61, and #67.

Closes #54.
Fixes #10.
Fixes #12.
Partially addresses #33.

…sResearch#24, NousResearch#26, NousResearch#35)

- PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions
  - _load_skill_body() splits frontmatter from body, body becomes instruction
  - _extract_evolved_instructions() extracts from signature.instructions (not wrapper)
  - constraint_validator.py: body/frontmatter separation — validate body has substance
  - dataset_builder.py: robust JSON parsing with 6 fallback strategies

- PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA

- PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto

Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls
conflicts with auto='light'. Use max_metric_calls alone (fixed).
…traint validator, JSON parsing robustness

Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35:
- skill_module.py: embed skill body in signature instructions via HTML sentinel
- evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging
- constraints.py: validate YAML frontmatter + substantive body content separately
- dataset_builder.py: 6-strategy JSON parser for LLM output resilience
- sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with <!-- ___SKILL_EVOLUTION_SENTINEL___ -->
…<20KB) +50%, large +20%. Fixes companion-interview-workflow rejection (+28.5% bloat was genuine operational detail, not bloat). Also cap pre-filter to 20 candidates in RelevanceFilter to prevent 30+ minute timeouts.
Completes v2.1 build phase:

1. GEPA/MIPROv2 logger (Cassian's #1 production risk)
   - Logs optimizer type (GEPA vs MIPROv2) after compile in evolve_skill.py
   - Added optimizer_type field to stats CSV schema

2. Router (evolution/core/router.py)
   - 3-action classification: fix / extend / abstain
   - Heuristic-based (no LLM calls): failure pattern detection by reason keyword,
     structural change detection via conditional counts, confidence scaling
   - All thresholds labeled as unvalidated novel design per Aris Thorne review

3. Backtrack Controller (evolution/core/backtrack.py)
   - 3-iteration sliding window plateau detection
   - Float-epsilon threshold comparison (fixes IEEE 754 precision edge case)
   - Walk-back: finds last adjacent improvement > 1%, returns checkpoint before it
   - Force-archive after N consecutive backtracks
   - Resets backtrack count after any improvement

4. Robustness Checkers (evolution/core/constraints_v2.py)
   - ConfigDriftChecker: frontmatter name/description stability
   - SkillRegressionChecker: holdout score retains 90%+ of baseline
   - ScopeCreepChecker: length-normalized term frequency drift detection
   - Small-baseline (<3 meaningful words) gracefully skipped

5. Pareto Selector (evolution/core/pareto_selector.py)
   - Multi-objective: holdout score (primary) + skill size delta (secondary)
   - min_improvement_delta=0.03 noise floor (evaluation noise guard)
   - growth_threshold cap prevents 400%+ bloat with small gains
   - Robustness gate: failed check = baseline retained regardless

6. Shared Types (evolution/core/types.py)
   - 5 dataclasses: EvolutionSnapshot, RouterDecision, BacktrackDecision,
     ComputeBudget, EvolutionReport

7. Tests: 30 new tests, all passing
   - Router: 6 tests (empty extend, edge case fix, low budget, structural,
     confidence scaling, all-pass)
   - Backtrack: 6 tests (insufficient data, plateau, improving, force archive,
     reset, walk-back)
   - Pareto: 5 tests (better, noise floor, robustness fail, growth penalty,
     zero growth)
   - Constraints: 9 tests (5 config drift, 4 scope creep)
   - Integration: 4 tests (dry run, no-gepa skeleton, backtrack integration,
     Pareto integration)
Adds the top-level integration layer that connects v2.1 modules
to the live evolution pipeline:

1. gepa_v2_dispatch.py (437 lines)
   - Wraps v1's GEPA loop with v2.1 decision gates
   - Top-level backtrack: re-runs GEPA if ParetoSelector rejects,
     up to N attempts (3 or iterations//5)
   - Runs ConfigDriftChecker, SkillRegressionChecker, ScopeCreepChecker
     on each evolved candidate
   - Captures per-scenario holdout results for Router classification
   - Returns EvolutionReport with deploy/review/reject recommendation
   - Saves output to output/<skill>/v2_<timestamp>/ with report.json

2. --v2 CLI flag
   - python -m evolution.skills.evolve_skill --skill X --v2
   - Dispatches through v2_dispatch() instead of v1 evolve()
   - v1 path unchanged when --v2 is absent

3. EvolutionReport simplified
   - Replaced 10 fields (baseline_score, evolved_score, budget, etc.)
     with 8 focused fields: skill_name, n_iterations_executed,
     improvement, recommendation, details, router_decision,
     backtrack_decision, elapsed_seconds
   - All dependent modules (evolve_skill_v2.py, types.py, tests)
     updated to match

4. Backtrack checkpoint_for_score convenience method
   - Records EvolutionSnapshot from raw score/body/iteration values

5. Tests: 162 passing (3 pre-existing failures)
   - 3 new dispatch tests: dry_run, no_skill, report_type
   - 30 v2.1 unit tests still passing
   - Integration tests updated for new EvolutionReport shape
1. EvolutionRouter fixes (threshold validation):
   - Fixed priority order: structural checked BEFORE coverage (not after)
   - Fixed structural check: uses evolution_history[-1] (needs >=1, not >=2)
   - coverage check now actually uses coverage_cluster_ratio (was defined but
     never implemented — classified ANY multi-reason failure set as coverage)
   - Added coverage_min_failures=3 guard (1 failure isn't a coverage pattern)
   - Added _dominant_category helper for logging which category dominates

2. PostHocAnalyzer (evolution/core/posthoc_analyzer.py)
   - Fits power-law curve: score = a * iteration^c + b via scipy.optimize.curve_fit
   - Scipy fallback: log-log linear regression with R² estimate
   - Phase classification: early_discovery (c>0.2), diminishing_returns (0.05<c<0.2),
     plateau (c<0.05)
   - Crossover detection: finds iteration where marginal gain < min_improvement_delta
   - Predicted score at 2x iterations
   - Pure analytical — no API calls
   - scipy added to project dependencies (needed for curve_fit)

3. Router Benchmark (tests/core/router_benchmark.py)
   - 11 synthetic test cases: edge_case, coverage, structural, noise,
     all_pass, low_budget, edge_case ratio sweep, empty, no_history,
     zero_hard, mixed_priority
   - All 11 passing
   - Run standalone: python tests/core/router_benchmark.py

4. Tests: 175 passing (3 pre-existing failures)
…n test

Bugs surfaced by running the full gate pipeline against real skill data:

1. SkillRegressionChecker.check() interface mismatch
   - Filesystem-based check() takes (skill_name, threshold) — not inline scores
   - Added check_score(evolved_score, baseline_score) for the direct score
     comparison that v2_dispatch and tests use
   - Fixed v2_dispatch.py call → check_score()

2. SelectionResult missing 'reason' field
   - ParetoSelector selected evolved vs baseline but didn't explain WHY
   - Added reason: str field with human-readable explanations at every branch
   - All selection paths now log: robustness failure, noise floor, weighted win,
     growth penalty, and improvement
   - Growth penalty now appears in reason string for size-override decisions

3. ParetoSelector reason edge case: growth info missing
   - When size penalty was the deciding factor (400% growth → penalty=1.0),
     the reason only said 'baseline wins on weighted score' without mentioning
     growth or the penalty value
   - Fixed: all weighted-score reasons now include growth ratio and penalty

4. Test fixes:
   - ConfigDrift: used different descriptions (which correctly triggers drift)
     changed to different tags (which correctly does not)
   - Regression: asserted r2[0] == 'pass' but check_score returns (bool, str)
   - Pipeline tests: 4/4 passing against real companion-workflows skill data
- PostHocAnalyzer runs after GEPA loop completes, before Router
  classification, using the per-attempt score trajectory
- Shows power-law phase classification and recommended action in console
- Appends posthoc analysis to report.json output
- Adds Phase and Power-Law c rows to summary table
- Import PostHocReport type and PostHocAnalyzer class in dispatch
- All 17 posthoc + pipeline integration tests passing
1. test_skill_over_limit: 20KB input didn't exceed 50KB max_skill_size
   → increased to 60KB to actually trigger the limit

2. test_excessive_growth: 30% growth on 1KB baseline was within the
   100% dynamic allowance for small skills (<5KB)
   → changed to 30% growth on 25KB baseline (max 20% for >20KB skills)

3. test_valid_skill: minimal body lacked 2-of-3 structural checks
   → added substantive body with steps, headings, and >100 chars

4. PostHoc integration: fixed spurious PostHocReport import from types.py
   (doesn't exist there; PostHocAnalyzer resolves its own dependencies)

Full suite: 189/189 passing — all tests clean
Bug: score_trajectory extracted avg_baseline (always the same value)
instead of tracking best score over time. This meant PostHoc never had
enough variance for power-law fitting and always returned None.

Fix: starts with the initial baseline score, then for each attempt
takes max(avg_baseline, avg_evolved) compared against the running best.
This produces a non-decreasing trajectory that the power-law fitter
can actually analyze.

189/189 tests passing.
Part A — captured-skill plugin (Hermes Agent gateway):
  plugin.yaml — registers on_session_end hook
  __init__.py — loads session data, builds candidates, slash commands
  capture.py — core logic: is_capturable heuristics, tool sequence
    extraction, domain tagging, skill body generation, overlap
    detection (Jaccard word similarity, no embeddings needed)

  Hooks: on_session_end — runs after every completed session with
    3+ tool calls, extracts task description, tool sequence, domain
    tags, and success pattern, saves to ~/.hermes/captured/<name>.json

  Slash command: /captured — list, show, inspect, validate, stats

Part B — ingest-captured CLI (self-evolution repo):
  python -m evolution.tools.ingest_captured
    list [status]     List captured candidates
    validate <file>   Validate candidate structure and overlap
    deploy <file>     Deploy a validated candidate as ~/.hermes/skills/<name>/SKILL.md
    auto              Bulk validate + deploy all pending candidates
    evolve <file>     Run v2 evolution pipeline then deploy if improved
    stats             Capture statistics

  Validation: body length > 50 chars, frontmatter or heading structure,
    overlap with existing skills (blocked at J > 0.5)

196/196 tests passing (7 new + 189 existing)
…n (rejected), nous_auth module, v2 entry point, evolution stats/tools catalog
New modules:
- evolution/tools/tool_module.py: ToolDescriptionStore (loads from Hermes registry) + ToolDescriptionModule (DSPy Predict wrapper)
- evolution/tools/tool_dataset_builder.py: (task, tool_name) eval dataset from synthetic templates + SessionDB mining
- evolution/tools/tool_description_v2.py: GEPA v2 pipeline with BacktrackController, PostHocAnalyzer, ParetoSelector, EvolutionRouter
- evolution/tools/evolve_tool_description.py: CLI entry point (--tool, --iterations, --eval-source, --dry-run)

Architecture:
- Tool selection as classification: given task description → predict correct tool
- GEPA optimizes tool descriptions (≤500 chars) to maximize classification accuracy
- v2 pipeline wraps v1 GEPA with decision gates (reject/accept/review)
- Constraint validator enforces 500-char description limit
- Output: output/tool_descriptions/<tool>_<timestamp>/report.json

CLI: python -m evolution.tools.evolve_tool_description --tool search_files --iterations 10
…itness, parallel scoring

- MultiComponentSkillModule: section-level mutation via split_into_sections/reconstruct
- PurposePreservationChecker (4th hard gate): blocks type-changing mutations via keyword
  survival + TF-IDF cosine similarity + consultant-prompt structure detection
- ContentSemanticScorer: sklearn TfidfVectorizer (unigrams+bigrams, sublinear TF)
- ParallelRelevanceFilter: ThreadPoolExecutor for LLM relevance calls (13s→30s for 20)
- Ollama Cloud thinking wrapper fix: promotes reasoning_content→content when empty
- evolve_skill.py: multi_component_extract() replaces _extract_evolved_skill_body()
- gepa_v2_dispatch: uses MultiComponentSkillModule, fixed total_improvement calc
- Test coverage: test_constraints_v2.py (7 test cases)
- CLI entry points: run_batch_evolution.py, run_deep_evolution.py, run_v2_validate*.py
…eration

- Fix syntax error in seed_to_skill.py: Python 3.12 doesn't allow backslash
  escapes inside f-string expressions. Extracted coherence_issues_escaped
  and timestamp to variables before the multi-line return statement.

- Add run_batch_seed_generation.py: generates skills from seeds for
  Phase 3 of skill-generation-from-seed plan.
  Seeds: personal-osint-audit, exploratory-data-analysis, research-planning

- Kanban: move 3 regression skills to STALE:
  companion-personas (plateau, best=+0.1988, latest=-0.1247),
  companion-system-orchestration (plateau, best=+0.0418),
  github-code-review (noise-level changes, best=+0.0000)
  Root cause: evolving existing 600-line skills hits plateaus.
  These should use seed-based generation instead.

- Batch seed generation running in background (proc_ce421498c3d0)
- seed_to_skill.py: add --timestamp CLI arg to prevent cross-run arXiv pollution
- run_batch_seed_fast.py: switch eval model from broken minimax/minimax-m2.7 → deepseek/deepseek-v4-flash
- gepa_kanban: add 3 new skills (hermes-agent-author, design-a-multi-agent-companion-coordinat, github-pr-review) in VALIDATING; mark old skills STALE

New skills installed to ~/.hermes/skills/:
  companion-system/hermes-agent-author/ (replaces companion-personas)
  companion-system/design-a-multi-agent-companion-coordinat/ (replaces companion-system-orchestration)
  github/github-pr-review/ (replaces github-code-review)

All 3 skills: 0 arXiv refs, all 5 sections exit=0, coherence PASS.
228 tests still passing.
Results:
- hermes-agent-author: 0.500 (INCOMPARABLE - generator vs old persona catalog)
- design-a-multi-agent-companion-coordinat: 0.621 vs old 0.731 (-0.11 regression)
- github-pr-review: 0.578 vs old 0.650 (-0.07 regression)

Key insight: hermes-agent-author is a fundamentally different task from
the old companion-personas (generates personas vs. provides a fixed catalog).
The other two are modest regressions - the seed skills are more focused/narrow
than the original 11-section skills they replaced.

Scripts added: score_new_skills.py, score_compare.py, score_baseline.py
Cleanup: removed bad arXiv-polluted seed dirs from previous failed run.
GEPA ran 5 iterations on:
- design-a-multi-agent-companion-coordinat: 0.5->0.5, all mutations rejected
- github-pr-review: 0.5->0.5, all mutations rejected
- hermes-agent-author: failed to load skill

Root cause: seed skills are ~50% smaller (5 sections) vs old archived
skills (9-15 sections). The seed generates a GEPA-friendly skeleton
but lacks the depth/complexity that made originals effective. GEPA
can't hallucinate in missing sections in 5 iterations.

Key finding: seed-to-skill creates useful starting skeletons, not
direct replacements for highly-refined multi-section skills. The
pipeline is working correctly — the gap is in seed density.
NEW SKILLS (Phase 3 - no regression baseline):
- research-synthesis: 0.560 — web/arxiv/wiki research report synthesis
- linear-issue-creator: 0.341 — natural language to Linear issue creation
- codebase-metrics: 0.527 — codebase metrics via pygount

PHASE 2 REPLACEMENTS (with old baselines):
- hermes-agent-author: 0.528 vs 0.627 old (-0.099, INCOMPARABLE - different task)
- design-a-multi-agent-companion-coordinat: 0.622 vs 0.731 old (-0.109)
- github-pr-review: 0.585 vs 0.650 old (-0.065)

GEPA PHASE 4 EVOLUTION:
- All 3 seed skills: 0 improvement (5 iterations, all mutations rejected)
- Root cause: seed skills are 5-section skeletons vs old 9-15 section skills
- Pipeline is working correctly; seed density is the bottleneck

Phase 3 scripts: run_batch_seed_phase3.py, updated score_new_skills.py
Datasets: research-synthesis, linear-issue-creator, codebase-metrics
…0% improvement

- linear-issue-creator: GEPA 5 iter, valset 72%, holdout 50%, no improvement
- codebase-metrics: GEPA 5 iter, valset 55.8%, holdout 50%, no improvement
- Both seeds locally optimal on synthetic eval — seed density ceiling confirmed
- research-synthesis killed (22% valset, poor synthetic fit)
- Updated card-registry.json + baseline_scores_20260501.json
steezkelly added 19 commits May 3, 2026 02:26
- EvalExample: add tool_sequence, complexity_score, session_id, success_pattern
- EvalDataset: add merge() dedup and save_atomic() write-temp-then-rename
- CapturedExampleEnricher: rule-based rubric extraction from first section
- assign_split(): deterministic MD5 hash -> exactly one split (train/val/holdout)
- enrich_and_merge(): single-split append with dedup, replaces save_as_sessiondb_example
- Capture plugin: on_session_end hook + /captured slash command (plugin.yaml + __init__.py)
- Verification script: docs/phase5_verification.py
- Tests: 34 passing across dataset_builder, ingest_captured, capture_plugin

Design decisions:
- D1 (3+ tools) enforced in both plugin _is_capturable and _save_candidate
- D2 (rule-based rubric) — no LLM call, first post-frontmatter section + fallback
- D3 (silent failure) — errors logged to ~/.hermes/capture_errors/<date>.jsonl

Fixes 6 gaps from Phase 5 gap audit:
- Gap B (data leakage): single split assignment
- Gap C (rubric mismatch): structured expected_behavior with Task/Expected tools/Procedure
- Gap F (field loss): metadata preserved to dataset
@steezkelly steezkelly closed this May 9, 2026
@steezkelly steezkelly deleted the consolidate/55-61-evolution-gates branch May 9, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant