Skip to content

Commit 250c351

Browse files
authored
feat(agent-comparison): add autoresearch optimization review flow (#205)
* feat(agent-comparison): add autoresearch optimization review flow * feat(autoresearch): migrate SDK to claude -p, add beam search, fix review issues - Migrate generate_variant.py and improve_description.py from Anthropic SDK to claude -p subprocess invocation - Add beam search optimization with configurable width, candidates per parent, and frontier retention to optimize_loop.py - Add beam search parameters display and empty-state UX in eval_viewer.html - Update SKILL.md and optimization-guide.md for beam search documentation - Migrate skill-eval run_loop and rules-distill to use claude -p - Add test coverage for beam search, model flag omission, and claude -p flow Fixes from review: - Fix misplaced test_writes_pending_json_in_live_mode (back in TestFullPipeline) - Remove dead round_keeps variable from optimize_loop.py - Fix timeout mismatch (120s outer vs 300s inner → 360s outer) - Clarify --max-iterations help text (rounds, not individual iterations) * fix(review-round-1): address 8 findings from PR review Critical fixes: - Temp file collision in beam search: embed iteration_counter in filename - rules-distill.py: log errors on claude -p failure and JSONDecodeError - _run_trigger_rate: always print subprocess errors, not just under --verbose - _generate_variant_output: add cwd and env (strip CLAUDECODE) Important fixes: - _find_project_root: warn on silent cwd fallback in generate_variant and improve_description - improve_description: warn when <new_description> tags not found - search_strategy: emit "hill_climb" for single-path runs (beam_width=1, candidates=1) - rules-distill: log exception in broad except clause * fix(review-round-2): handle JSON parse error in _run_trigger_rate, fix task-file leak Critical fixes: - Wrap json.loads in _run_trigger_rate with try/except JSONDecodeError (exits-0-but-invalid-JSON no longer crashes the entire optimization run) - Move task_file assignment before json.dump so finally block can always clean up the temp file on disk Also: document _run_claude_code soft-fail contract in rules-distill.py * fix(review-round-3): catch TimeoutExpired, move write_text inside cleanup guard - Add subprocess.TimeoutExpired to caught exceptions in variant generation loop (prevents unhandled crash when claude -p hits 360s timeout) - Move temp_target.write_text() inside try/finally block so partial writes are cleaned up on disk-full or permission errors * style: fix import sort order and formatting - Fix import block ordering in test_eval_compare_optimization.py (ruff I001) - Fix formatting in test_skill_eval_claude_code.py and eval_compare.py (ruff format) * feat(adr-132): add behavioral eval mode and creation compliance task set Add _run_behavioral_eval() to optimize_loop.py that runs `claude -p "/do {query}"` and checks for ADR artifact creation, enabling direct testing of /do's creation protocol compliance. Trigger-rate optimization was proven inapplicable for /do (scored 0.0 across all 32 tasks) because /do is slash-invoked, not description-discovered. Behavioral eval via headless /do is the correct approach — confirmed that `claude -p "/do create..."` works but does NOT produce ADRs, validating the compliance gap. Changes: - Add _run_behavioral_eval() with artifact snapshot/diff detection - Add _is_behavioral_task() for eval_mode detection - Update _validate_task_set() for behavioral task format - Wire behavioral path into assess_target() - Add DO NOT OPTIMIZE markers to /do SKILL.md (Phase 2-5 protected) - Create 32-task benchmark set (16 positive, 16 negative, 60/40 split) * feat(adr-133): strengthen Phase 1 creation detection in /do SKILL.md Add explicit Creation Request Detection block to Phase 1 CLASSIFY, immediately before the Gate line. The block scans for creation verbs, domain object targets, and implicit creation patterns, then flags the request as [CREATION REQUEST DETECTED] so Phase 4 Step 0 is acknowledged before routing decisions consume model attention. This is ADR-133 Prong 2, Option A. Moving detection to Phase 1 addresses the root cause: the creation protocol was buried in Phase 4 where it competed with agent dispatch instructions and was frequently skipped. * feat(adr-133): add creation-protocol-enforcer PreToolUse hook Soft-warns when an Agent dispatch appears to be for a creation task but no recent .adr-session.json is present (stale = >900s or missing). Exit 0 only — never blocks. Prong 2 / Option B of ADR-133. * fix(index): register kotlin, php, and swift agent entries in INDEX.json Three agents (kotlin-general-engineer, php-general-engineer, swift-general-engineer) existed on disk but were missing from agents/INDEX.json, making them invisible to the routing system. Added all three entries with triggers, pairs_with, complexity, and category sourced directly from each agent's frontmatter. Also fixes the pre-existing golang-general-engineer-compact ordering bug as a side effect of re-sorting the index alphabetically. * fix(behavioral-eval): raise timeout to 240s, check artifacts after TimeoutExpired Two fixes to _run_behavioral_eval(): 1. Default timeout 120s -> 240s: headless /do creation sessions frequently exceed 120s when they dispatch agents that write files, create plans, etc. 2. Check artifact glob after TimeoutExpired: the subprocess may have written artifacts before the timeout fired. The old code set triggered=False on any timeout, causing false FAIL for tasks that completed their artifact writes but ran over time. E2E baseline results (6-task subset, 240s timeout): - Creation recall: 1/3 (33%) — implicit-create-rails passed (ADR-135 created) - Non-creation precision: 3/3 (100%) - build-agent-rust: genuine compliance gap (completed, no ADR) * fix(review-round-1): address 4 findings from PR review 1. behavioral eval: always print claude exit code (not only in verbose mode) — silent failures would produce phantom 50% accuracy, corrupting optimization 2. behavioral eval: clean up created artifacts between tasks to prevent stale before-snapshots in multi-round optimization runs 3. creation-protocol-enforcer: expand keyword set to match SKILL.md vocabulary — 'build a', 'add new', 'new feature', 'i need a/an', 'we need a/an' previously covered <50% of the benchmark creation queries 4. SKILL.md Phase 1: move [CREATION REQUEST DETECTED] output to the Gate condition so LLM cannot proceed to Phase 2 without acknowledging the flag
1 parent 3bdf3cd commit 250c351

19 files changed

+1899
-517
lines changed

agents/INDEX.json

Lines changed: 115 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -115,23 +115,6 @@
115115
"complexity": "Medium",
116116
"category": "meta"
117117
},
118-
"golang-general-engineer-compact": {
119-
"file": "golang-general-engineer-compact.md",
120-
"short_description": "Use this agent for focused Go development with tight context budgets",
121-
"triggers": [
122-
"go",
123-
"golang",
124-
"tight context",
125-
"compact",
126-
"focused go"
127-
],
128-
"pairs_with": [
129-
"go-pr-quality-gate",
130-
"go-testing"
131-
],
132-
"complexity": "Medium-Complex",
133-
"category": "language"
134-
},
135118
"golang-general-engineer": {
136119
"file": "golang-general-engineer.md",
137120
"short_description": "Use this agent when you need expert assistance with Go development, including implementing features,\ndebugging issues, reviewing code quality, optimizing performance, or answering technical questions\nabout Go codebases",
@@ -151,6 +134,23 @@
151134
"complexity": "Medium-Complex",
152135
"category": "language"
153136
},
137+
"golang-general-engineer-compact": {
138+
"file": "golang-general-engineer-compact.md",
139+
"short_description": "Use this agent for focused Go development with tight context budgets",
140+
"triggers": [
141+
"go",
142+
"golang",
143+
"tight context",
144+
"compact",
145+
"focused go"
146+
],
147+
"pairs_with": [
148+
"go-pr-quality-gate",
149+
"go-testing"
150+
],
151+
"complexity": "Medium-Complex",
152+
"category": "language"
153+
},
154154
"hook-development-engineer": {
155155
"file": "hook-development-engineer.md",
156156
"short_description": "Use this agent when developing Python hooks for Claude Code's event-driven system",
@@ -171,6 +171,34 @@
171171
"complexity": "Comprehensive",
172172
"category": "meta"
173173
},
174+
"kotlin-general-engineer": {
175+
"file": "kotlin-general-engineer.md",
176+
"short_description": "Use this agent when you need expert assistance with Kotlin development, including implementing features, debugging issues, reviewing code quality, optimizing coroutine usage, or answering technical questions about Kotlin codebases",
177+
"triggers": [
178+
"kotlin",
179+
"ktor",
180+
"koin",
181+
"coroutine",
182+
"suspend fun",
183+
"kotlin flow",
184+
"StateFlow",
185+
"kotest",
186+
"mockk",
187+
"gradle-kts",
188+
"detekt",
189+
"ktlint",
190+
"ktfmt",
191+
"android kotlin",
192+
"kotlin-multiplatform"
193+
],
194+
"pairs_with": [
195+
"systematic-debugging",
196+
"verification-before-completion",
197+
"systematic-code-review"
198+
],
199+
"complexity": "Medium-Complex",
200+
"category": "language"
201+
},
174202
"kubernetes-helm-engineer": {
175203
"file": "kubernetes-helm-engineer.md",
176204
"short_description": "Use this agent for Kubernetes and Helm deployment management, troubleshooting, and cloud-native infrastructure",
@@ -354,6 +382,38 @@
354382
"complexity": "Medium-Complex",
355383
"category": "development"
356384
},
385+
"php-general-engineer": {
386+
"file": "php-general-engineer.md",
387+
"short_description": "Use this agent when you need expert assistance with PHP development, including implementing features, debugging issues, reviewing code quality, enforcing security posture, or answering technical questions about PHP codebases",
388+
"triggers": [
389+
"php",
390+
"laravel",
391+
"symfony",
392+
"composer",
393+
"artisan",
394+
"eloquent",
395+
"blade",
396+
"twig",
397+
"phpunit",
398+
"pest",
399+
"psr-12",
400+
"psr standards",
401+
"hybris",
402+
"sapcc",
403+
".php files",
404+
"doctrine",
405+
"php-cs-fixer",
406+
"phpstan",
407+
"psalm"
408+
],
409+
"pairs_with": [
410+
"systematic-debugging",
411+
"verification-before-completion",
412+
"systematic-code-review"
413+
],
414+
"complexity": "Medium-Complex",
415+
"category": "language"
416+
},
357417
"pipeline-orchestrator-engineer": {
358418
"file": "pipeline-orchestrator-engineer.md",
359419
"short_description": "Use this agent when building new pipelines that require coordinated creation\nof agents, skills, and hooks",
@@ -792,7 +852,7 @@
792852
},
793853
"reviewer-meta-process": {
794854
"file": "reviewer-meta-process.md",
795-
"short_description": "Meta-analysis of system design decisions \u2014 examines whether the SYSTEM ITSELF is creating\nproblems",
855+
"short_description": "Meta-analysis of system design decisions examines whether the SYSTEM ITSELF is creating\nproblems",
796856
"triggers": [
797857
"meta-process review",
798858
"system design review",
@@ -907,7 +967,7 @@
907967
"hot paths",
908968
"N+1 queries",
909969
"allocations",
910-
"O(n\u00b2)",
970+
"O(n²)",
911971
"caching",
912972
"slow code",
913973
"performance optimization"
@@ -1083,6 +1143,41 @@
10831143
"complexity": "Medium",
10841144
"category": "language"
10851145
},
1146+
"swift-general-engineer": {
1147+
"file": "swift-general-engineer.md",
1148+
"short_description": "Use this agent when you need expert assistance with Swift development, including implementing features for iOS, macOS, watchOS, tvOS, visionOS, or server-side Swift, debugging issues, reviewing code quality, or answering technical questions about Swift codebases",
1149+
"triggers": [
1150+
"swift",
1151+
"ios",
1152+
"macos",
1153+
"xcode",
1154+
"swiftui",
1155+
"uikit",
1156+
"appkit",
1157+
"watchos",
1158+
"tvos",
1159+
"visionos",
1160+
"vapor",
1161+
"spm",
1162+
"swift-package-manager",
1163+
"swiftlint",
1164+
"swiftformat",
1165+
"xctest",
1166+
"swift-testing",
1167+
"swift actor",
1168+
"swift sendable",
1169+
"swift-combine",
1170+
"swiftdata",
1171+
"coredata"
1172+
],
1173+
"pairs_with": [
1174+
"systematic-debugging",
1175+
"verification-before-completion",
1176+
"systematic-code-review"
1177+
],
1178+
"complexity": "Medium-Complex",
1179+
"category": "language"
1180+
},
10861181
"system-upgrade-engineer": {
10871182
"file": "system-upgrade-engineer.md",
10881183
"short_description": "Use this agent for systematic upgrades to the agent/skill/hook ecosystem when\nClaude Code ships updates, user goals change, or retro learnings accumulate",
@@ -1236,4 +1331,4 @@
12361331
"category": "language"
12371332
}
12381333
}
1239-
}
1334+
}
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
#!/usr/bin/env python3
2+
# hook-version: 1.0.0
3+
"""
4+
PreToolUse:Agent Hook: Creation Protocol Enforcer
5+
6+
Soft-warns when an Agent dispatch appears to be for a creation request
7+
but no ADR has been written yet this session (i.e. .adr-session.json
8+
does not exist or was last modified more than 900 seconds ago).
9+
10+
This is a SOFT WARN — exit 0 only (never blocks).
11+
12+
Detection logic:
13+
- Tool is Agent
14+
- tool_input["prompt"] contains creation keywords
15+
- .adr-session.json in project root either does not exist or is stale (>900s)
16+
17+
Allow-through conditions:
18+
- Tool is not Agent
19+
- No creation keywords found in prompt
20+
- .adr-session.json exists and was modified within the last 900 seconds
21+
- ADR_PROTOCOL_BYPASS=1 env var
22+
"""
23+
24+
import json
25+
import os
26+
import sys
27+
import time
28+
import traceback
29+
from pathlib import Path
30+
31+
sys.path.insert(0, str(Path(__file__).parent / "lib"))
32+
from stdin_timeout import read_stdin
33+
34+
_BYPASS_ENV = "ADR_PROTOCOL_BYPASS"
35+
_ADR_SESSION_FILE = ".adr-session.json"
36+
_STALENESS_THRESHOLD_SECONDS = 900
37+
38+
_CREATION_KEYWORDS = [
39+
"create",
40+
"scaffold",
41+
"build a new",
42+
"build a ",
43+
"add a new",
44+
"add new",
45+
"new agent",
46+
"new skill",
47+
"new pipeline",
48+
"new hook",
49+
"new feature",
50+
"new workflow",
51+
"new plugin",
52+
"implement new",
53+
"i need a ",
54+
"i need an ",
55+
"we need a ",
56+
"we need an ",
57+
]
58+
59+
_WARNING_LINES = [
60+
"[creation-protocol-enforcer] Creation request detected but no recent ADR session found.",
61+
"/do Phase 4 Step 0 requires: (1) Write ADR at adr/{name}.md, (2) Register via adr-query.py register, THEN dispatch agent.",
62+
"If ADR was already written, set ADR_PROTOCOL_BYPASS=1 to suppress this warning.",
63+
]
64+
65+
66+
def _has_creation_keywords(prompt: str) -> bool:
67+
"""Return True if the prompt contains any creation keyword (case-insensitive)."""
68+
lower = prompt.lower()
69+
return any(kw in lower for kw in _CREATION_KEYWORDS)
70+
71+
72+
def _adr_session_is_recent(base_dir: Path) -> bool:
73+
"""Return True if .adr-session.json exists and was modified within the threshold."""
74+
adr_session_path = base_dir / _ADR_SESSION_FILE
75+
if not adr_session_path.exists():
76+
return False
77+
try:
78+
mtime = os.path.getmtime(adr_session_path)
79+
age = time.time() - mtime
80+
return age <= _STALENESS_THRESHOLD_SECONDS
81+
except OSError:
82+
return False
83+
84+
85+
def main() -> None:
86+
"""Run the creation protocol enforcement check."""
87+
debug = os.environ.get("CLAUDE_HOOKS_DEBUG")
88+
89+
raw = read_stdin(timeout=2)
90+
try:
91+
event = json.loads(raw)
92+
except (json.JSONDecodeError, ValueError):
93+
sys.exit(0)
94+
95+
# Filter: only act on Agent tool dispatches.
96+
tool_name = event.get("tool_name", "")
97+
if tool_name != "Agent":
98+
sys.exit(0)
99+
100+
# Bypass env var.
101+
if os.environ.get(_BYPASS_ENV) == "1":
102+
if debug:
103+
print(
104+
f"[creation-protocol-enforcer] Bypassed via {_BYPASS_ENV}=1",
105+
file=sys.stderr,
106+
)
107+
sys.exit(0)
108+
109+
tool_input = event.get("tool_input", {})
110+
prompt = tool_input.get("prompt", "")
111+
if not prompt:
112+
sys.exit(0)
113+
114+
# Check for creation keywords.
115+
if not _has_creation_keywords(prompt):
116+
if debug:
117+
print(
118+
"[creation-protocol-enforcer] No creation keywords found — allowing through",
119+
file=sys.stderr,
120+
)
121+
sys.exit(0)
122+
123+
# Resolve project root.
124+
cwd_str = event.get("cwd") or os.environ.get("CLAUDE_PROJECT_DIR", ".")
125+
base_dir = Path(cwd_str).resolve()
126+
127+
# Check whether a recent ADR session exists.
128+
if _adr_session_is_recent(base_dir):
129+
if debug:
130+
print(
131+
"[creation-protocol-enforcer] Recent .adr-session.json found — allowing through",
132+
file=sys.stderr,
133+
)
134+
sys.exit(0)
135+
136+
# No recent ADR session — emit soft warning to stdout (context injection).
137+
print("\n".join(_WARNING_LINES))
138+
sys.exit(0)
139+
140+
141+
if __name__ == "__main__":
142+
try:
143+
main()
144+
except SystemExit:
145+
raise
146+
except Exception as e:
147+
if os.environ.get("CLAUDE_HOOKS_DEBUG"):
148+
traceback.print_exc(file=sys.stderr)
149+
else:
150+
print(
151+
f"[creation-protocol-enforcer] Error: {type(e).__name__}: {e}",
152+
file=sys.stderr,
153+
)
154+
# Fail open — never exit non-zero on unexpected errors.
155+
sys.exit(0)

0 commit comments

Comments
 (0)