Skip to content

feat: add Perses plugin scaffolds, hooks, and agent-comparison autoresearch#209

Merged
notque merged 22 commits intomainfrom
feat/perses-plugin-example
Mar 30, 2026
Merged

feat: add Perses plugin scaffolds, hooks, and agent-comparison autoresearch#209
notque merged 22 commits intomainfrom
feat/perses-plugin-example

Conversation

@notque
Copy link
Copy Markdown
Owner

@notque notque commented Mar 30, 2026

Summary

  • Add Perses panel plugin scaffolds (perses-plugin-example/, plugins/custom-panel/, plugins/example-panel/) with CUE schemas, React components, rsbuild config, and Module Federation setup
  • Add SQL injection detector PostToolUse hook with 7 pattern categories (string concatenation, format injection, sprintf, f-string) and comprehensive test suite
  • Add creation protocol enforcer hooks (UserPromptSubmit + PreToolUse:Agent) for early ADR compliance detection
  • Add team-config-loader SessionStart hook for multi-environment team configuration injection
  • Migrate skill_eval scripts from direct Anthropic SDK to claude -p subprocess, removing hard anthropic dependency
  • Add agent-comparison autoresearch optimization loop (optimize_loop.py, generate_variant.py) with beam search, Goodhart divergence detection, and HTML reporting
  • Register kotlin, php, and swift agent entries in INDEX.json
  • Add DB performance indexes for learning_db_v2 (v3 migration) and usage_db
  • Strengthen /do SKILL.md Phase 1 creation request detection

Test plan

  • ruff check . passes (all checks passed)
  • ruff format --check . passes (251 files already formatted)
  • pytest passes (1444 passed)
  • 3 rounds of PR review completed with fixes committed
  • SQL injection detector hook tests pass (12 tests)
  • Agent comparison optimize_loop tests pass (24 tests)
  • Skill eval claude code migration tests pass (13 tests)

notque added 22 commits March 28, 2026 19:17
PR #204 was merged to main while this branch was being developed.
All conflicts resolved in favor of the clean rework versions (ours):
- SKILL.md: review/export approach over cherry-pick
- optimization-guide.md: snapshot review terminology
- eval_viewer.html: radio selection, setActivePage helper, optimization-only mode
- eval_compare.py: standalone is_optimization_data() validator
…view issues

- Migrate generate_variant.py and improve_description.py from Anthropic SDK
  to claude -p subprocess invocation
- Add beam search optimization with configurable width, candidates per parent,
  and frontier retention to optimize_loop.py
- Add beam search parameters display and empty-state UX in eval_viewer.html
- Update SKILL.md and optimization-guide.md for beam search documentation
- Migrate skill-eval run_loop and rules-distill to use claude -p
- Add test coverage for beam search, model flag omission, and claude -p flow

Fixes from review:
- Fix misplaced test_writes_pending_json_in_live_mode (back in TestFullPipeline)
- Remove dead round_keeps variable from optimize_loop.py
- Fix timeout mismatch (120s outer vs 300s inner → 360s outer)
- Clarify --max-iterations help text (rounds, not individual iterations)
Critical fixes:
- Temp file collision in beam search: embed iteration_counter in filename
- rules-distill.py: log errors on claude -p failure and JSONDecodeError
- _run_trigger_rate: always print subprocess errors, not just under --verbose
- _generate_variant_output: add cwd and env (strip CLAUDECODE)

Important fixes:
- _find_project_root: warn on silent cwd fallback in generate_variant and improve_description
- improve_description: warn when <new_description> tags not found
- search_strategy: emit "hill_climb" for single-path runs (beam_width=1, candidates=1)
- rules-distill: log exception in broad except clause
…x task-file leak

Critical fixes:
- Wrap json.loads in _run_trigger_rate with try/except JSONDecodeError
  (exits-0-but-invalid-JSON no longer crashes the entire optimization run)
- Move task_file assignment before json.dump so finally block can always
  clean up the temp file on disk

Also: document _run_claude_code soft-fail contract in rules-distill.py
…anup guard

- Add subprocess.TimeoutExpired to caught exceptions in variant generation
  loop (prevents unhandled crash when claude -p hits 360s timeout)
- Move temp_target.write_text() inside try/finally block so partial writes
  are cleaned up on disk-full or permission errors
- Fix import block ordering in test_eval_compare_optimization.py (ruff I001)
- Fix formatting in test_skill_eval_claude_code.py and eval_compare.py (ruff format)
Add _run_behavioral_eval() to optimize_loop.py that runs
`claude -p "/do {query}"` and checks for ADR artifact creation,
enabling direct testing of /do's creation protocol compliance.

Trigger-rate optimization was proven inapplicable for /do (scored
0.0 across all 32 tasks) because /do is slash-invoked, not
description-discovered. Behavioral eval via headless /do is the
correct approach — confirmed that `claude -p "/do create..."` works
but does NOT produce ADRs, validating the compliance gap.

Changes:
- Add _run_behavioral_eval() with artifact snapshot/diff detection
- Add _is_behavioral_task() for eval_mode detection
- Update _validate_task_set() for behavioral task format
- Wire behavioral path into assess_target()
- Add DO NOT OPTIMIZE markers to /do SKILL.md (Phase 2-5 protected)
- Create 32-task benchmark set (16 positive, 16 negative, 60/40 split)
Add explicit Creation Request Detection block to Phase 1 CLASSIFY,
immediately before the Gate line. The block scans for creation verbs,
domain object targets, and implicit creation patterns, then flags the
request as [CREATION REQUEST DETECTED] so Phase 4 Step 0 is acknowledged
before routing decisions consume model attention.

This is ADR-133 Prong 2, Option A. Moving detection to Phase 1 addresses
the root cause: the creation protocol was buried in Phase 4 where it
competed with agent dispatch instructions and was frequently skipped.
Soft-warns when an Agent dispatch appears to be for a creation task but
no recent .adr-session.json is present (stale = >900s or missing).
Exit 0 only — never blocks. Prong 2 / Option B of ADR-133.
Three agents (kotlin-general-engineer, php-general-engineer,
swift-general-engineer) existed on disk but were missing from
agents/INDEX.json, making them invisible to the routing system.

Added all three entries with triggers, pairs_with, complexity, and
category sourced directly from each agent's frontmatter. Also fixes
the pre-existing golang-general-engineer-compact ordering bug as a
side effect of re-sorting the index alphabetically.
…meoutExpired

Two fixes to _run_behavioral_eval():
1. Default timeout 120s -> 240s: headless /do creation sessions frequently
   exceed 120s when they dispatch agents that write files, create plans, etc.
2. Check artifact glob after TimeoutExpired: the subprocess may have written
   artifacts before the timeout fired. The old code set triggered=False on
   any timeout, causing false FAIL for tasks that completed their artifact
   writes but ran over time.

E2E baseline results (6-task subset, 240s timeout):
  - Creation recall: 1/3 (33%) — implicit-create-rails passed (ADR-135 created)
  - Non-creation precision: 3/3 (100%)
  - build-agent-rust: genuine compliance gap (completed, no ADR)
1. behavioral eval: always print claude exit code (not only in verbose mode)
   — silent failures would produce phantom 50% accuracy, corrupting optimization
2. behavioral eval: clean up created artifacts between tasks to prevent
   stale before-snapshots in multi-round optimization runs
3. creation-protocol-enforcer: expand keyword set to match SKILL.md vocabulary
   — 'build a', 'add new', 'new feature', 'i need a/an', 'we need a/an'
   previously covered <50% of the benchmark creation queries
4. SKILL.md Phase 1: move [CREATION REQUEST DETECTED] output to the Gate
   condition so LLM cannot proceed to Phase 2 without acknowledging the flag
… selection, add behavioral-runs-per-task param

- Fix 1: _run_behavioral_eval now snapshots agents/*.md, scripts/*.py,
  skills/**/SKILL.md, and pipelines/**/SKILL.md before each task run.
  New files in those dirs are deleted after the run (including on
  TimeoutExpired), preventing cross-task snapshot pollution.

- Fix 2: After the main loop, if test_tasks exist, selects the KEEP
  iteration with the highest held-out test score rather than highest
  training score (anti-Goodhart). Falls back to a single final test
  eval on best_content when no holdout-checked KEEP exists.
  Adds best_test_score to the result dict.

- Fix 3: Adds --behavioral-runs-per-task (default: 1) and
  --behavioral-trigger-threshold (default: 0.5) CLI params.
  When runs_per_task > 1, each task is run sequentially N times;
  triggered = (sum(runs) / N) >= threshold. Mirrors Anthropic's
  runs_per_query=3 / trigger_threshold=0.5 pattern.
  Params thread through run_optimization_loop → assess_target →
  _run_behavioral_eval.
… ADR reminder

Introduces creation-request-enforcer-userprompt.py which fires at
UserPromptSubmit time, before the model begins routing, to catch creation
requests that lack a recent ADR session. Complements the existing
PreToolUse:Agent creation-protocol-enforcer by moving the advisory
injection earlier in the pipeline.
Adds PostToolUse hook that scans written/edited code files for SQL injection
anti-patterns (string concat, .format(), f-strings with SQL keywords, +=
concatenation). Advisory-only, never blocks. Fixes ruff format violation in
test file that would have failed CI.

Also adds:
- learning_db_v2: v3 migration with timestamp/cohort indexes for query perf
- usage_db: composite indexes on (skill_name, timestamp) and (agent_type, timestamp)
… and React component

Scaffolds an end-to-end Perses panel plugin demonstrating the full authoring
lifecycle: CUE schema with close() constraints, canonical JSON examples,
TypeScript React component, Module Federation build config, and tsconfig.

Adds gap-filling files missing from initial scaffold:
- example-panel.cue and example-panel.json (canonical percli naming convention)
- testdata/full-config.json (all-optional-fields fixture for percli test-schemas)
- tsconfig.json (TypeScript compilation)
- rsbuild.config.ts (Module Federation entrypoint, mirrors custom-panel reference)
- Updated package.json with devDependencies and rsbuild scripts

Relates-to: ADR-137
Remove extra blank line after import block that ruff treats as
malformed import formatting (I001).
Fix 8 ruff errors: 6x ARG005 unused lambda args in test stubs (prefix
with _), 2x SIM910 redundant None default in .get(). Auto-format 4
files with ruff format. Add new agent-comparison task sets, perses
plugin scaffolds, and skill variant.
…estore trailing newline

- Remove duplicate spec.cue (identical to example-panel.cue) from perses-plugin-example
- Remove redundant display.json from perses-plugin-example schemas
- Add missing index.ts plugin registration for plugins/custom-panel
- Restore trailing newline in .claude/settings.json
…t-schemas script

- team-config-loader: compare version as string to handle both PyYAML int
  and fallback parser string returns
- plugins/custom-panel: add test-schemas script to package.json for percli
  schema validation parity with perses-plugin-example
PR #205 merged the same autoresearch content. Keep our branch versions
for files with additional review-round fixes. Apply our two INDEX.json
edits (deprecation wording, pairs_with simplification) on top of main's
formatting.
@notque notque merged commit e57f29f into main Mar 30, 2026
4 checks passed
@notque notque deleted the feat/perses-plugin-example branch March 30, 2026 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant