From f50cc28a199b209093617b8cab6d8e4b0b661066 Mon Sep 17 00:00:00 2001 From: rdwj Date: Fri, 8 May 2026 07:53:42 -0500 Subject: [PATCH] retro: subagent-as-tool v1 shakedown MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit End-to-end retrospective on the 2026-05-08 session: triage of subagent- as-tool issues, design synthesis for the foundation underneath #163/#164/#166/#168 (#182), demo build (calculus-coordinator delegating to calculus-agent specialist via subagent-as-tool v1), three-model scorecard, and cluster cleanup. Champion model: Ministral 3 14B — only candidate with 6/6 delegation correctness. Granite 4.1 8B partial; Granite 4.0 H-Small blocked on a vLLM hermes-parser bug. Two framework bugs surfaced and filed (#185 observers drop subagent events, #186 transport URL doubling). Patterns to start: sanity-check load-bearing dependencies before declaring end-to-end works; file at the moment of diagnosis. Patterns to stop: batching CLAUDE.md updates at session-close; JSON-Patch index paths against positional arg arrays. Assisted-by: Claude Code (Opus 4.7) --- .../RETRO.md | 83 +++++++++++++++++++ 1 file changed, 83 insertions(+) create mode 100644 retrospectives/2026-05-08_subagent-as-tool-v1-shakedown/RETRO.md diff --git a/retrospectives/2026-05-08_subagent-as-tool-v1-shakedown/RETRO.md b/retrospectives/2026-05-08_subagent-as-tool-v1-shakedown/RETRO.md new file mode 100644 index 0000000..eb8a751 --- /dev/null +++ b/retrospectives/2026-05-08_subagent-as-tool-v1-shakedown/RETRO.md @@ -0,0 +1,83 @@ +# Retrospective: Subagent-as-tool v1 shakedown + +**Date:** 2026-05-08 +**Effort:** Triage of issues opened with subagent-as-tool, design synthesis for the foundation underneath #163/#164/#166/#168, end-to-end demo build (calculus-coordinator → calculus-agent specialist via subagent-as-tool), and three-model scorecard to pick a champion coordinator. +**Issues:** closed #165 (v1 retroactively); opened #179, #180, #181 (v2 follow-ups), #182 (Phase 0 foundation tracker), #184 (ad-hoc spawn_agent companion), #185, #186 (framework bugs found during shakedown) +**PRs:** agent-template #183, ui-template #31, examples #31 +**Commits (this work):** d2dea07, 639578b (agent-template); e5a6d79, df8d17c (ui-template); ee53ba8, 32fa895 (examples) + +## What We Set Out To Do + +The session opened as exploratory triage — "let's discuss our open issues, particularly the ones we opened today." It widened twice on user direction: + +1. After triaging #165, into design work on the #163/#164 cluster (Question tool + per-tool permission policy). +2. After "deploy to cluster, do all and only stop if you run into a problem," into build-and-ship: scaffold a coordinator, deploy the full stack, evaluate three models, scale down to a single L40S champion. + +The arc was triage → design → ship. Unusual to span all three in one session. + +## What Changed + +| Change | Type | Rationale | +|--------|------|-----------| +| Closed #165 entirely (vs commenting) | Good pivot | v1 was already shipped via PR #173. The PR body's "Closes the bulk of #165" is not a recognized closing keyword, so the issue stayed open. Carved cleanly into v2 issues #179/#180/#181. | +| Foundation tracker #182 + design doc | Good pivot (scope addition) | Noticed mid-discussion that #163/#164/#166/#168 share schema concerns. Pinning the contract once via design doc avoids re-spec'ing 4×. Stable message IDs, pending-state columns, fork lineage all defined once. | +| Compaction research via subagent | Good pivot | Findings shaped the design doc concretely (marker-based rolling summary, tool-call/result pairing as a documented LLM failure class, client-side enforcement). Research → architecture, not research → Slack message. | +| gpt-oss-20b → Ministral coordinator swap | Forced pivot | gpt-oss's RLHF prior overrode strict prompts on trivial calculus. Nemotron's vLLM tool-call parser is named `_no_streaming.py` and doesn't surface tool calls in streaming mode. Ministral was the third option. | +| Add Granite 4.0 H-Small + 4.1 8B to evaluation | Scope addition | External recommendation forwarded mid-session. Useful detour — produced reproducible scorecard methodology. | +| Deploy `calculus-helper` MCP into `calculus-mcp` namespace | Missed requirement caught late | Assumed the existing course-v11 calculus-agent was a working specialist. Its image expected an MCP namespace that didn't exist on the cluster. The specialist had been hallucinating. Discovery cost an iteration. | +| Coordinator nodeSelector to `g6e.4xlarge` | Good pivot | First reschedule landed Ministral on the OTHER team's eval-gpu node. Adding the nodeSelector ensured it could only land on gpu-cluster nodes. | + +## What Went Well + +- **Foundation caught before piecemeal implementation.** Design doc + #182 lock the schema contract once; #163/#164/#166/#168 reference it instead of re-spec'ing. +- **Research → design was tight.** Compaction research's findings ended up as actual contract decisions in the design doc with primary-source citations. +- **Scorecard methodology was reproducible.** `/tmp/scorecard.py` exists; raw timings, token counts via `stream_options.include_usage`, six prompts in fixed order. Champion picked on numbers, not vibes. The TTFC vs TTFT distinction (with Ministral conceptual at 42 ms) directly addressed the rounding-to-zero risk. +- **Multi-repo coordination clean.** Three PRs, each scoped to its own concern, no leaked changes between them. CLAUDE.md updates eventually landed on each repo's branch. +- **Two real bugs filed with reproductions.** #185 (observers drop subagent events), #186 (URL doubling). Concrete repro steps, proposed fixes. +- **Surgical cluster cleanup.** `cluster-api-delete-machine` annotation pattern + nodeSelector pin kept the eval-gpu scaledown from disturbing the parallel test on the same machineset. +- **Diagnostic flow once provoked.** Specialist hallucination was caught only after user question, but the path from "did it really work?" → confirmed hallucination → traced to dead namespace → deployed missing MCP was tight (~10 min). + +## Gaps Identified + +| Gap | Severity | Resolution | +|-----|----------|------------| +| MCP backend assumed working without sanity check; specialist was hallucinating until user provoked the question | Process problem | Mitigation: hit each load-bearing endpoint with a curl before declaring end-to-end works. | +| Visual UI verification skipped after final Ministral relocation — curl-only confirmation | Fix now | User did the visual verification post-retro and reported "still some bugs" → tracked for next session, not blocking. | +| vLLM Granite 4.0 H-Small hermes parser double-encoding bug — diagnosed cleanly but no upstream issue filed | Owner: user | User will file separately. | +| fipsagents non-streaming `usage: null` bug — diagnosed but no fipsagents issue filed | Owner: user | User will file separately. | +| Three CLAUDE.md updates batched at session-close instead of landing with the work | Process slop | Fixed as part of session-close skill; pattern for next session: doc updates land in the same commit as behavior change. | +| Six-revision prompt-engineering thrash on gpt-oss-20b before testing model behavior with a minimal harness | Process problem | Mitigation: when a model resists a strict prompt, run a minimal curl-with-tool-only harness first to bound model-can't vs prompt-can't. | +| Two off-by-one errors in `oc patch` JSON-Patch index paths against args arrays | One-off | Default to replacing the whole list when mutating positional args. | +| Demo MCP grounding fragile — depends on `calculus-mcp` namespace existing, no alerting | Documentation gap | Resolved in `examples/calculus-coordinator/CLAUDE.md` "Demo dependencies" section (commit 32fa895). | +| Final system prompt lacked rationale for its strict shape | Documentation gap | Resolved by frontmatter comment in `prompts/system.md` (commit 32fa895). | +| Demo URL + helm `--set` reference values existed only in conversation | Discoverability | Resolved in `examples/calculus-coordinator/CLAUDE.md` "Demo dependencies" section (commit 32fa895). | +| "Diagnose-but-don't-file" pattern: #165's verbal closing keyword + two un-filed framework bugs treated as filed | Recurring pattern | Mitigation: open the issue in the same step as the diagnosis. Verbally agreeing to file is not a filed issue. | + +## Action Items + +- [x] Document demo dependencies in `calculus-coordinator/CLAUDE.md` (commit 32fa895, examples PR #31) +- [x] Comment system-prompt rationale in `prompts/system.md` (commit 32fa895, examples PR #31) +- [ ] Visual verification reported "still some bugs" — debug in next session. +- [ ] User to file vLLM Granite 4.0 H-Small parser bug separately. +- [ ] User to file fipsagents non-streaming `usage: null` bug separately. + +## Patterns + +**Start:** + +- Sanity-check load-bearing dependencies (specialist endpoints, MCP backends, model availability) with one curl before declaring an end-to-end demo working. Cheaper than discovering a hallucinating specialist after a scorecard run. +- File issues at the moment of diagnosis. "Verbal commitment to file" is the failure mode that produced #165 staying open + two un-filed framework bugs in this session. +- Visual UI verification on every UI-touching demo, even when curl-equivalent SSE confirms the wire is correct. +- When a model resists a strict prompt, isolate "model can't" from "prompt can't" with a 30-second curl harness against the model directly. + +**Stop:** + +- Batching CLAUDE.md updates at session-close. Doc landings happen with the work. +- Patching positional-arg arrays via JSON-Patch index paths. Replace the whole list when mutating it. + +**Continue:** + +- Foundation-design-before-piecemeal-implementation when a cluster of issues shares a contract surface. Pays off more than the design-doc effort. +- Reproducible scorecard methodology (raw timings, captured tokens via `stream_options.include_usage`, fixed prompt set). Generalizable to future model evaluations. +- Surgical cluster operations (annotation-driven machine deletion, node-affinity pinning) instead of "scale and hope" patterns. +- Cross-linking issues + PRs with prose explaining *why*, not just `Closes #N` keywords (which several models in this evaluation also fumble — turns out humans can too).