feat: add Perses plugin scaffolds, hooks, and agent-comparison autoresearch#209
Merged
feat: add Perses plugin scaffolds, hooks, and agent-comparison autoresearch#209
Conversation
PR #204 was merged to main while this branch was being developed. All conflicts resolved in favor of the clean rework versions (ours): - SKILL.md: review/export approach over cherry-pick - optimization-guide.md: snapshot review terminology - eval_viewer.html: radio selection, setActivePage helper, optimization-only mode - eval_compare.py: standalone is_optimization_data() validator
…view issues - Migrate generate_variant.py and improve_description.py from Anthropic SDK to claude -p subprocess invocation - Add beam search optimization with configurable width, candidates per parent, and frontier retention to optimize_loop.py - Add beam search parameters display and empty-state UX in eval_viewer.html - Update SKILL.md and optimization-guide.md for beam search documentation - Migrate skill-eval run_loop and rules-distill to use claude -p - Add test coverage for beam search, model flag omission, and claude -p flow Fixes from review: - Fix misplaced test_writes_pending_json_in_live_mode (back in TestFullPipeline) - Remove dead round_keeps variable from optimize_loop.py - Fix timeout mismatch (120s outer vs 300s inner → 360s outer) - Clarify --max-iterations help text (rounds, not individual iterations)
Critical fixes: - Temp file collision in beam search: embed iteration_counter in filename - rules-distill.py: log errors on claude -p failure and JSONDecodeError - _run_trigger_rate: always print subprocess errors, not just under --verbose - _generate_variant_output: add cwd and env (strip CLAUDECODE) Important fixes: - _find_project_root: warn on silent cwd fallback in generate_variant and improve_description - improve_description: warn when <new_description> tags not found - search_strategy: emit "hill_climb" for single-path runs (beam_width=1, candidates=1) - rules-distill: log exception in broad except clause
…x task-file leak Critical fixes: - Wrap json.loads in _run_trigger_rate with try/except JSONDecodeError (exits-0-but-invalid-JSON no longer crashes the entire optimization run) - Move task_file assignment before json.dump so finally block can always clean up the temp file on disk Also: document _run_claude_code soft-fail contract in rules-distill.py
…anup guard - Add subprocess.TimeoutExpired to caught exceptions in variant generation loop (prevents unhandled crash when claude -p hits 360s timeout) - Move temp_target.write_text() inside try/finally block so partial writes are cleaned up on disk-full or permission errors
- Fix import block ordering in test_eval_compare_optimization.py (ruff I001) - Fix formatting in test_skill_eval_claude_code.py and eval_compare.py (ruff format)
Add _run_behavioral_eval() to optimize_loop.py that runs
`claude -p "/do {query}"` and checks for ADR artifact creation,
enabling direct testing of /do's creation protocol compliance.
Trigger-rate optimization was proven inapplicable for /do (scored
0.0 across all 32 tasks) because /do is slash-invoked, not
description-discovered. Behavioral eval via headless /do is the
correct approach — confirmed that `claude -p "/do create..."` works
but does NOT produce ADRs, validating the compliance gap.
Changes:
- Add _run_behavioral_eval() with artifact snapshot/diff detection
- Add _is_behavioral_task() for eval_mode detection
- Update _validate_task_set() for behavioral task format
- Wire behavioral path into assess_target()
- Add DO NOT OPTIMIZE markers to /do SKILL.md (Phase 2-5 protected)
- Create 32-task benchmark set (16 positive, 16 negative, 60/40 split)
Add explicit Creation Request Detection block to Phase 1 CLASSIFY, immediately before the Gate line. The block scans for creation verbs, domain object targets, and implicit creation patterns, then flags the request as [CREATION REQUEST DETECTED] so Phase 4 Step 0 is acknowledged before routing decisions consume model attention. This is ADR-133 Prong 2, Option A. Moving detection to Phase 1 addresses the root cause: the creation protocol was buried in Phase 4 where it competed with agent dispatch instructions and was frequently skipped.
Soft-warns when an Agent dispatch appears to be for a creation task but no recent .adr-session.json is present (stale = >900s or missing). Exit 0 only — never blocks. Prong 2 / Option B of ADR-133.
Three agents (kotlin-general-engineer, php-general-engineer, swift-general-engineer) existed on disk but were missing from agents/INDEX.json, making them invisible to the routing system. Added all three entries with triggers, pairs_with, complexity, and category sourced directly from each agent's frontmatter. Also fixes the pre-existing golang-general-engineer-compact ordering bug as a side effect of re-sorting the index alphabetically.
…meoutExpired Two fixes to _run_behavioral_eval(): 1. Default timeout 120s -> 240s: headless /do creation sessions frequently exceed 120s when they dispatch agents that write files, create plans, etc. 2. Check artifact glob after TimeoutExpired: the subprocess may have written artifacts before the timeout fired. The old code set triggered=False on any timeout, causing false FAIL for tasks that completed their artifact writes but ran over time. E2E baseline results (6-task subset, 240s timeout): - Creation recall: 1/3 (33%) — implicit-create-rails passed (ADR-135 created) - Non-creation precision: 3/3 (100%) - build-agent-rust: genuine compliance gap (completed, no ADR)
1. behavioral eval: always print claude exit code (not only in verbose mode) — silent failures would produce phantom 50% accuracy, corrupting optimization 2. behavioral eval: clean up created artifacts between tasks to prevent stale before-snapshots in multi-round optimization runs 3. creation-protocol-enforcer: expand keyword set to match SKILL.md vocabulary — 'build a', 'add new', 'new feature', 'i need a/an', 'we need a/an' previously covered <50% of the benchmark creation queries 4. SKILL.md Phase 1: move [CREATION REQUEST DETECTED] output to the Gate condition so LLM cannot proceed to Phase 2 without acknowledging the flag
… selection, add behavioral-runs-per-task param - Fix 1: _run_behavioral_eval now snapshots agents/*.md, scripts/*.py, skills/**/SKILL.md, and pipelines/**/SKILL.md before each task run. New files in those dirs are deleted after the run (including on TimeoutExpired), preventing cross-task snapshot pollution. - Fix 2: After the main loop, if test_tasks exist, selects the KEEP iteration with the highest held-out test score rather than highest training score (anti-Goodhart). Falls back to a single final test eval on best_content when no holdout-checked KEEP exists. Adds best_test_score to the result dict. - Fix 3: Adds --behavioral-runs-per-task (default: 1) and --behavioral-trigger-threshold (default: 0.5) CLI params. When runs_per_task > 1, each task is run sequentially N times; triggered = (sum(runs) / N) >= threshold. Mirrors Anthropic's runs_per_query=3 / trigger_threshold=0.5 pattern. Params thread through run_optimization_loop → assess_target → _run_behavioral_eval.
… ADR reminder Introduces creation-request-enforcer-userprompt.py which fires at UserPromptSubmit time, before the model begins routing, to catch creation requests that lack a recent ADR session. Complements the existing PreToolUse:Agent creation-protocol-enforcer by moving the advisory injection earlier in the pipeline.
Adds PostToolUse hook that scans written/edited code files for SQL injection anti-patterns (string concat, .format(), f-strings with SQL keywords, += concatenation). Advisory-only, never blocks. Fixes ruff format violation in test file that would have failed CI. Also adds: - learning_db_v2: v3 migration with timestamp/cohort indexes for query perf - usage_db: composite indexes on (skill_name, timestamp) and (agent_type, timestamp)
… and React component Scaffolds an end-to-end Perses panel plugin demonstrating the full authoring lifecycle: CUE schema with close() constraints, canonical JSON examples, TypeScript React component, Module Federation build config, and tsconfig. Adds gap-filling files missing from initial scaffold: - example-panel.cue and example-panel.json (canonical percli naming convention) - testdata/full-config.json (all-optional-fields fixture for percli test-schemas) - tsconfig.json (TypeScript compilation) - rsbuild.config.ts (Module Federation entrypoint, mirrors custom-panel reference) - Updated package.json with devDependencies and rsbuild scripts Relates-to: ADR-137
Remove extra blank line after import block that ruff treats as malformed import formatting (I001).
Fix 8 ruff errors: 6x ARG005 unused lambda args in test stubs (prefix with _), 2x SIM910 redundant None default in .get(). Auto-format 4 files with ruff format. Add new agent-comparison task sets, perses plugin scaffolds, and skill variant.
…estore trailing newline - Remove duplicate spec.cue (identical to example-panel.cue) from perses-plugin-example - Remove redundant display.json from perses-plugin-example schemas - Add missing index.ts plugin registration for plugins/custom-panel - Restore trailing newline in .claude/settings.json
…t-schemas script - team-config-loader: compare version as string to handle both PyYAML int and fallback parser string returns - plugins/custom-panel: add test-schemas script to package.json for percli schema validation parity with perses-plugin-example
PR #205 merged the same autoresearch content. Keep our branch versions for files with additional review-round fixes. Apply our two INDEX.json edits (deprecation wording, pairs_with simplification) on top of main's formatting.
notque
added a commit
that referenced
this pull request
Mar 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
perses-plugin-example/,plugins/custom-panel/,plugins/example-panel/) with CUE schemas, React components, rsbuild config, and Module Federation setupskill_evalscripts from direct Anthropic SDK toclaude -psubprocess, removing hardanthropicdependencyoptimize_loop.py,generate_variant.py) with beam search, Goodhart divergence detection, and HTML reportingTest plan
ruff check .passes (all checks passed)ruff format --check .passes (251 files already formatted)pytestpasses (1444 passed)