A defense-in-depth agent framework for Claude Code — built around one obsession: proving that work is actually done, not that it merely looks done.
87 subagents · 16 vendored skills · /agileteam v3 orchestrator · /concilium four-body council · Reality-Ledger QA · empirically benchmarked
#AIAgents #AgenticAI #ClaudeCode #MultiAgentSystems #AgentEngineering #TDD #AutonomousCoding #DefenseInDepth #LLMOps
Then we shipped the honest result anyway. That is Plumbline.
▶ Live demo · explore all 87 agents in your browser, nothing to install
A plumb line is the oldest tool humanity has for checking whether something is truly straight — not whether it looks straight. You hang a weight on a string, and gravity gives you one honest reference that never lies.
That is exactly what this framework is for. It was born from a real failure: a feature whose tests were all green, yet the actual integration was a no-op — "tests pass" had been mistaken for "it works." Plumbline exists to hold every piece of agent-produced work against one honest reference: does it hang true?
In carpentry, "true" means both correct and perfectly plumb. That double meaning is the whole philosophy in one word.
Plumbline is an end-to-end product-building team framework. Its core invariant is staying true to confirmed human customer value across every quality gate. It does not treat green tests, completed tasks, or agent consensus as enough. Every gate must re-check whether the work remains real, useful, usable, production-grounded, and aligned with the user's confirmed Product Vision. In one line: Plumbline does not optimize for finishing. Plumbline optimizes for staying true to confirmed human customer value; finishing is valid only when the line remains true.
Most agent collections are prompt libraries. Plumbline is a prompt library plus a measured, falsifiable claim about agent quality — and we did the experiments to back it.
We suspected a clever QA prompt ("always check you reached the real boundary, not a fake") would make agents catch the "green-but-broken" class of bug. So we built mutation-oracle benchmarks: give two agent variants the same task, let them write tests, then secretly sabotage the code and count which tests turn red (caught) vs. stay green (escaped). Deterministic. No vibes.
Across four independently-designed oracle corpora (metrics/corpus/), the honest result surprised us:
| What we measured | Result |
|---|---|
| QA prompt-discipline at the test-planning stage | 5× recall at equal precision — real, kept |
| Same discipline at the build-and-test stage | outcome-neutral — the act of building already forces the agent to look |
| The decisive "provided-fake" trap (mirrors the original incident) | only Opus catches it (0/3 escaped); Sonnet and Haiku escape it 3/3 |
The lesson: whether an agent's tests reach reality is governed by model capability, not prompt cleverness. A stronger prompt cannot give a weaker model that judgment. This is documented end-to-end — including the bugs the instrument caught in itself — in metrics/SUMMARY-2026-05-30-dna-investigation.md.
That intellectual honesty — measuring our own framework instead of marketing it — is the spirit of Plumbline.
- Reality Ledger — every requirement carries an evidence class (
unit-fake → integration-fake → real-boundary-smoke → production-verified). Anything touching I/O, a remote, an external API or UI that stays*-fakeis RED regardless of green tests, and that RED cannot be silently downgraded. - Wired-in-prod check — a feature with a real implementation but no test through the production composition root is not satisfiable. The two costliest real-world misses ("exists in tests, never composed in prod") die here.
- "Kritische semantische Glättung" — a cheap, gated 3-beat QA reflex (thesis → counter-thesis → the one test that kills it) that fires only on genuine boundary features, never crying wolf on pure logic.
Plumbline even ships its own honesty as commands: /honest-status (separate looks done from is done, including what's unverified) and /bench-oracle (measure a change with a deterministic mutation oracle instead of asserting it works). The framework holds itself to its own plumb line.
| Area | Count | Purpose |
|---|---|---|
core/ |
5 | Base roles: coder, planner, researcher, reviewer, tester |
agileteam/ |
6 | /agileteam v3 workflow roles: requirements, spec-audit, PO, security, retro, context |
github/ |
13 | PR / issue / release / repo / workflow / multi-repo automation |
flow-nexus/ |
9 | Platform agents: sandbox, swarm, workflow, auth, payments, neural, … |
templates/ |
9 | Reusable agent templates and scaffolds |
consensus/ |
7 | Distributed-systems patterns: Byzantine, Raft, Gossip, CRDT, Quorum, … |
hive-mind/ |
5 | Queen / worker / scout / memory collective-intelligence patterns |
optimization/ |
5 | Performance, topology, resources, load-balancing, benchmarking |
sparc/ |
4 | SPARC phases: specification, pseudocode, architecture, refinement |
swarm/ |
3 | Swarm topologies: adaptive, hierarchical, mesh |
goal/, reasoning/, testing/ |
6 | GOAP planners, reasoning variants, TDD-London + production validation |
| domain specialists | 8 | analysis, architecture, ML, backend, CI/CD, API-docs, neural, mobile |
concilium/ |
4 | Four-body idea+team council: market-realist · tech-arbiter · skeptic · distribution-realist |
config/claude/skills/ |
16 | Vendored skills so workflows stay portable without external packs |
config/claude/commands/ |
7 | /agileteam, /agileteam-bench, /concilium, /honest-status, /bench-oracle, /reflect, /reflect-skills |
Browse them all visually in the Agent Explorer (see below).
![]() |
![]() |
| Colour-coded categories, instant filtering | Per-agent detail: tools, triggers, source link |
agent-explorer.html is a self-contained, dependency-free snapshot of the whole
collection — a dark terminal-style UI with colour-coded categories, full-text search
over names/tools/keywords, schema filters, and a per-agent detail drawer that links
straight to the source on GitHub. Try the live demo →
or open agent-explorer.html in any browser; nothing to install.
Regenerate it after editing agents:
./build-explorer.sh # re-extracts frontmatter → rebuilds the bundle + docs/index.html (the live demo)/agileteam <feature> orchestrates a full delivery pipeline of independent agents.
The governing stance: there is no "100% safe" (Rice's theorem) — so chain many
diverse, independent checks, such that a defect would have to survive several
uncorrelated gates.
- Product Canvas — a mandatory upstream value-alignment gate: problem, target user, value proposition, success signal, core use case, non-goals, risks, evidence needed — saved to
docs/canvas/<feature>.canvas.mdand explicitly user-confirmed before the PRD is finalized or development starts (no agent may self-confirm it) - Requirements — PRD, REQ-IDs, acceptance criteria, traceability matrix
- Spec sanity — ultrathink + konfabulation audit (claim-provenance check)
- Planning — architecture, atomic tasks, sequence
- TDD loop — coder writes the failing test first, then minimal impl
- Independent review — reviewer sees diff + spec, never the coder's reasoning
- Security review — SAST / deps / secrets / threat + injection surface
- Validation — per-REQ pass/fail against the matrix, with evidence
- Judgment gate — product-owner: did we build the right thing?
- Human acceptance — sign-off stays explicitly human
- Retro / learning loop — process improvements, persisted only under guardrails
Independence invariant: whoever writes code does not review it; whoever derives tests does not implement them.
Per the benchmark above, the reach-the-real-boundary judgment lives in model
capability. The orchestrator therefore defaults all roles to your session model
(/model), discloses once at run start that the GBrain-class safety net on the
checking gates is only guaranteed on Opus, and — only if you opt in — dispatches just
those five gates on Opus. No silent up- or down-grading. (We also verified that
per-agent model: frontmatter is not applied by the current Claude Code runtime;
only an explicit dispatch parameter takes effect — so control lives in the
orchestrator, transparently.)
| Mode | Goal | Self-modification |
|---|---|---|
core (default) |
Safe, runnable baseline | None — learnings stay human-gated |
full |
Autonomous evolution (canary + auto-revert) | Only once a metrics/runs.jsonl baseline exists |
git clone https://github.com/DYAI2025/Plumbline plumbline
cd plumbline
./config/claude/install.sh # symlinks repo → ~/.claude/agents, installs commands/skills/hook
# add --copy on Windows / if you prefer copiesThen, in any project:
/agileteam "add OAuth2 login with refresh-token rotation"Requirements: git, bash, python3, and jq (for hook registration). Full
portability, web-session bootstrap, and per-project gate tooling are covered in
SETUP.md.
# validate every agent's frontmatter (parse errors / missing description / duplicate names)
bash config/claude/tests/run_all.shThe CI suite checks frontmatter, metrics scripts, settings JSON, the stop-hook, the
web bootstrap, and (if installed) shell scripts via shellcheck.
.
├── core/ # coder, planner, researcher, reviewer, tester
├── agileteam/ # /agileteam workflow roles
├── github/ swarm/ hive-mind/ # automation + coordination agents
├── consensus/ sparc/ … # distributed-systems + methodology agents
├── config/claude/commands/ # slash commands (/agileteam, /reflect, …)
├── config/claude/skills/ # 16 vendored fallback skills
├── config/claude/hooks/ # SessionStart + learning-loop Stop hook
├── metrics/ # the benchmark corpora + the honest write-ups
├── explorer/ # source for agent-explorer.html
├── docs/ # /agileteam spec v3 + governance
├── README.md SETUP.md CLAUDE.md
- Evidence over vibes — claims must be backed by code, tests, logs, or an explicit assumption; missing tooling is marked
MISSING, never fantasised as passing. - Roles stay sharp — a good agent has one crisp job, not a generic "do everything" identity.
- Independence matters — review, test, security and product judgment must not just echo the coder's perspective.
- Human gates stay — especially for requirements, product decisions, and persistent self-improvement.
- Version prompts like code — every agent change gets a diff, review, and validation.
MIT © 2026 DYAI2025.
The agent base is derived in part from Claude Flow by ruvnet (MIT, © ruvnet) — the repo path ruvnet/claude-flow now points to ruvnet/ruflo. Keep this attribution and the MIT notice when redistributing forks or major rewrites.
Plumbline — if you only need a single prompt, this is overkill. If you want to build, inspect, and evolve auditable agent systems that prove they hang true: welcome to the machine room.
#AIEngineering #AgentOrchestration #PromptEngineering #AutonomousAgents #CollectiveIntelligence #AgenticWorkflow #ClaudeAgents #FutureOfSoftwareDevelopment


