feat: add Perses plugin scaffolds, hooks, and agent-comparison autoresearch by notque · Pull Request #209 · notque/claude-code-toolkit

notque · 2026-03-30T04:11:33Z

Summary

Add Perses panel plugin scaffolds (perses-plugin-example/, plugins/custom-panel/, plugins/example-panel/) with CUE schemas, React components, rsbuild config, and Module Federation setup
Add SQL injection detector PostToolUse hook with 7 pattern categories (string concatenation, format injection, sprintf, f-string) and comprehensive test suite
Add creation protocol enforcer hooks (UserPromptSubmit + PreToolUse:Agent) for early ADR compliance detection
Add team-config-loader SessionStart hook for multi-environment team configuration injection
Migrate skill_eval scripts from direct Anthropic SDK to claude -p subprocess, removing hard anthropic dependency
Add agent-comparison autoresearch optimization loop (optimize_loop.py, generate_variant.py) with beam search, Goodhart divergence detection, and HTML reporting
Register kotlin, php, and swift agent entries in INDEX.json
Add DB performance indexes for learning_db_v2 (v3 migration) and usage_db
Strengthen /do SKILL.md Phase 1 creation request detection

Test plan

ruff check . passes (all checks passed)
ruff format --check . passes (251 files already formatted)
pytest passes (1444 passed)
3 rounds of PR review completed with fixes committed
SQL injection detector hook tests pass (12 tests)
Agent comparison optimize_loop tests pass (24 tests)
Skill eval claude code migration tests pass (13 tests)

PR #204 was merged to main while this branch was being developed. All conflicts resolved in favor of the clean rework versions (ours): - SKILL.md: review/export approach over cherry-pick - optimization-guide.md: snapshot review terminology - eval_viewer.html: radio selection, setActivePage helper, optimization-only mode - eval_compare.py: standalone is_optimization_data() validator

…view issues - Migrate generate_variant.py and improve_description.py from Anthropic SDK to claude -p subprocess invocation - Add beam search optimization with configurable width, candidates per parent, and frontier retention to optimize_loop.py - Add beam search parameters display and empty-state UX in eval_viewer.html - Update SKILL.md and optimization-guide.md for beam search documentation - Migrate skill-eval run_loop and rules-distill to use claude -p - Add test coverage for beam search, model flag omission, and claude -p flow Fixes from review: - Fix misplaced test_writes_pending_json_in_live_mode (back in TestFullPipeline) - Remove dead round_keeps variable from optimize_loop.py - Fix timeout mismatch (120s outer vs 300s inner → 360s outer) - Clarify --max-iterations help text (rounds, not individual iterations)

Critical fixes: - Temp file collision in beam search: embed iteration_counter in filename - rules-distill.py: log errors on claude -p failure and JSONDecodeError - _run_trigger_rate: always print subprocess errors, not just under --verbose - _generate_variant_output: add cwd and env (strip CLAUDECODE) Important fixes: - _find_project_root: warn on silent cwd fallback in generate_variant and improve_description - improve_description: warn when <new_description> tags not found - search_strategy: emit "hill_climb" for single-path runs (beam_width=1, candidates=1) - rules-distill: log exception in broad except clause

…x task-file leak Critical fixes: - Wrap json.loads in _run_trigger_rate with try/except JSONDecodeError (exits-0-but-invalid-JSON no longer crashes the entire optimization run) - Move task_file assignment before json.dump so finally block can always clean up the temp file on disk Also: document _run_claude_code soft-fail contract in rules-distill.py

…anup guard - Add subprocess.TimeoutExpired to caught exceptions in variant generation loop (prevents unhandled crash when claude -p hits 360s timeout) - Move temp_target.write_text() inside try/finally block so partial writes are cleaned up on disk-full or permission errors

- Fix import block ordering in test_eval_compare_optimization.py (ruff I001) - Fix formatting in test_skill_eval_claude_code.py and eval_compare.py (ruff format)

Add _run_behavioral_eval() to optimize_loop.py that runs `claude -p "/do {query}"` and checks for ADR artifact creation, enabling direct testing of /do's creation protocol compliance. Trigger-rate optimization was proven inapplicable for /do (scored 0.0 across all 32 tasks) because /do is slash-invoked, not description-discovered. Behavioral eval via headless /do is the correct approach — confirmed that `claude -p "/do create..."` works but does NOT produce ADRs, validating the compliance gap. Changes: - Add _run_behavioral_eval() with artifact snapshot/diff detection - Add _is_behavioral_task() for eval_mode detection - Update _validate_task_set() for behavioral task format - Wire behavioral path into assess_target() - Add DO NOT OPTIMIZE markers to /do SKILL.md (Phase 2-5 protected) - Create 32-task benchmark set (16 positive, 16 negative, 60/40 split)

Add explicit Creation Request Detection block to Phase 1 CLASSIFY, immediately before the Gate line. The block scans for creation verbs, domain object targets, and implicit creation patterns, then flags the request as [CREATION REQUEST DETECTED] so Phase 4 Step 0 is acknowledged before routing decisions consume model attention. This is ADR-133 Prong 2, Option A. Moving detection to Phase 1 addresses the root cause: the creation protocol was buried in Phase 4 where it competed with agent dispatch instructions and was frequently skipped.

Soft-warns when an Agent dispatch appears to be for a creation task but no recent .adr-session.json is present (stale = >900s or missing). Exit 0 only — never blocks. Prong 2 / Option B of ADR-133.

Three agents (kotlin-general-engineer, php-general-engineer, swift-general-engineer) existed on disk but were missing from agents/INDEX.json, making them invisible to the routing system. Added all three entries with triggers, pairs_with, complexity, and category sourced directly from each agent's frontmatter. Also fixes the pre-existing golang-general-engineer-compact ordering bug as a side effect of re-sorting the index alphabetically.

…meoutExpired Two fixes to _run_behavioral_eval(): 1. Default timeout 120s -> 240s: headless /do creation sessions frequently exceed 120s when they dispatch agents that write files, create plans, etc. 2. Check artifact glob after TimeoutExpired: the subprocess may have written artifacts before the timeout fired. The old code set triggered=False on any timeout, causing false FAIL for tasks that completed their artifact writes but ran over time. E2E baseline results (6-task subset, 240s timeout): - Creation recall: 1/3 (33%) — implicit-create-rails passed (ADR-135 created) - Non-creation precision: 3/3 (100%) - build-agent-rust: genuine compliance gap (completed, no ADR)

1. behavioral eval: always print claude exit code (not only in verbose mode) — silent failures would produce phantom 50% accuracy, corrupting optimization 2. behavioral eval: clean up created artifacts between tasks to prevent stale before-snapshots in multi-round optimization runs 3. creation-protocol-enforcer: expand keyword set to match SKILL.md vocabulary — 'build a', 'add new', 'new feature', 'i need a/an', 'we need a/an' previously covered <50% of the benchmark creation queries 4. SKILL.md Phase 1: move [CREATION REQUEST DETECTED] output to the Gate condition so LLM cannot proceed to Phase 2 without acknowledging the flag

… selection, add behavioral-runs-per-task param - Fix 1: _run_behavioral_eval now snapshots agents/*.md, scripts/*.py, skills/**/SKILL.md, and pipelines/**/SKILL.md before each task run. New files in those dirs are deleted after the run (including on TimeoutExpired), preventing cross-task snapshot pollution. - Fix 2: After the main loop, if test_tasks exist, selects the KEEP iteration with the highest held-out test score rather than highest training score (anti-Goodhart). Falls back to a single final test eval on best_content when no holdout-checked KEEP exists. Adds best_test_score to the result dict. - Fix 3: Adds --behavioral-runs-per-task (default: 1) and --behavioral-trigger-threshold (default: 0.5) CLI params. When runs_per_task > 1, each task is run sequentially N times; triggered = (sum(runs) / N) >= threshold. Mirrors Anthropic's runs_per_query=3 / trigger_threshold=0.5 pattern. Params thread through run_optimization_loop → assess_target → _run_behavioral_eval.

… ADR reminder Introduces creation-request-enforcer-userprompt.py which fires at UserPromptSubmit time, before the model begins routing, to catch creation requests that lack a recent ADR session. Complements the existing PreToolUse:Agent creation-protocol-enforcer by moving the advisory injection earlier in the pipeline.

Adds PostToolUse hook that scans written/edited code files for SQL injection anti-patterns (string concat, .format(), f-strings with SQL keywords, += concatenation). Advisory-only, never blocks. Fixes ruff format violation in test file that would have failed CI. Also adds: - learning_db_v2: v3 migration with timestamp/cohort indexes for query perf - usage_db: composite indexes on (skill_name, timestamp) and (agent_type, timestamp)

… and React component Scaffolds an end-to-end Perses panel plugin demonstrating the full authoring lifecycle: CUE schema with close() constraints, canonical JSON examples, TypeScript React component, Module Federation build config, and tsconfig. Adds gap-filling files missing from initial scaffold: - example-panel.cue and example-panel.json (canonical percli naming convention) - testdata/full-config.json (all-optional-fields fixture for percli test-schemas) - tsconfig.json (TypeScript compilation) - rsbuild.config.ts (Module Federation entrypoint, mirrors custom-panel reference) - Updated package.json with devDependencies and rsbuild scripts Relates-to: ADR-137

Remove extra blank line after import block that ruff treats as malformed import formatting (I001).

Fix 8 ruff errors: 6x ARG005 unused lambda args in test stubs (prefix with _), 2x SIM910 redundant None default in .get(). Auto-format 4 files with ruff format. Add new agent-comparison task sets, perses plugin scaffolds, and skill variant.

…estore trailing newline - Remove duplicate spec.cue (identical to example-panel.cue) from perses-plugin-example - Remove redundant display.json from perses-plugin-example schemas - Add missing index.ts plugin registration for plugins/custom-panel - Restore trailing newline in .claude/settings.json

…t-schemas script - team-config-loader: compare version as string to handle both PyYAML int and fallback parser string returns - plugins/custom-panel: add test-schemas script to package.json for percli schema validation parity with perses-plugin-example

PR #205 merged the same autoresearch content. Keep our branch versions for files with additional review-round fixes. Apply our two INDEX.json edits (deprecation wording, pairs_with simplification) on top of main's formatting.

notque added 22 commits March 28, 2026 19:17

feat(agent-comparison): add autoresearch optimization review flow

79d2733

style: fix import sort order and formatting

926bedf

- Fix import block ordering in test_eval_compare_optimization.py (ruff I001) - Fix formatting in test_skill_eval_claude_code.py and eval_compare.py (ruff format)

feat(adr-133): add creation-protocol-enforcer PreToolUse hook

c25f6a7

Soft-warns when an Agent dispatch appears to be for a creation task but no recent .adr-session.json is present (stale = >900s or missing). Exit 0 only — never blocks. Prong 2 / Option B of ADR-133.

fix(lint): fix ruff I001 import sort in team-config-loader.py

f9c18e0

Remove extra blank line after import block that ruff treats as malformed import formatting (I001).

merge: resolve conflicts with origin/main (PR #205 overlap)

9fdc523

PR #205 merged the same autoresearch content. Keep our branch versions for files with additional review-round fixes. Apply our two INDEX.json edits (deprecation wording, pairs_with simplification) on top of main's formatting.

notque merged commit e57f29f into main Mar 30, 2026
4 checks passed

notque deleted the feat/perses-plugin-example branch March 30, 2026 04:18

notque added a commit that referenced this pull request Mar 30, 2026

merge: resolve conflicts with main after PR #209 merge

3da524a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Perses plugin scaffolds, hooks, and agent-comparison autoresearch#209

feat: add Perses plugin scaffolds, hooks, and agent-comparison autoresearch#209
notque merged 22 commits intomainfrom
feat/perses-plugin-example

notque commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

notque commented Mar 30, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant