diff --git a/agents-multi-repo.md b/agents-multi-repo.md
new file mode 100644
index 0000000..cb268da
--- /dev/null
+++ b/agents-multi-repo.md
@@ -0,0 +1,120 @@
+# Agents for Multi-Repo Changes
+
+> Documents the new agents introduced to handle multi-repo PRDs, what each one does, and how they integrate into the existing eng-team workflow.
+
+---
+
+## Updated Workflow
+
+```
+Orchestrator
+ └── Tech Lead (multi-repo plan + blast radius analysis)
+ └── Contract Agent (locks interface changes before anyone implements)
+ └── Engineer × N (parallel, one per repo, gated on contract finalization)
+ └── Reviewer (cross-repo, reads all diffs together)
+```
+
+The Orchestrator drives the full pipeline. The core sequence — Tech Lead → Engineer → Reviewer — stays the same. Two things change: the Contract Agent is inserted between planning and implementation, and the Engineer step becomes parallel across repos.
+
+---
+
+## Existing Agents — What Changes
+
+### Orchestrator (extended, not replaced)
+
+The Orchestrator already handles sequencing and agent coordination. For multi-repo changes it needs two behavioral additions:
+
+- **Parallel engineer dispatch** — spins up one Engineer agent per repo rather than always one
+- **Contract gate enforcement** — Engineer agents that depend on a contract change are blocked from starting until the Contract Agent has finalized and published the contract artifact
+
+No new agent is needed here. This is a logic and configuration extension.
+
+### Tech Lead (extended)
+
+In addition to its existing responsibilities, the Tech Lead must:
+
+- Read the **system-level architecture document** (service inventory, dependency graph, cross-cutting conventions) before writing the plan
+- Identify which service boundaries the PRD crosses
+- Produce a **cross-repo plan artifact** that maps each acceptance criterion to a specific repo and lists every file expected to change per repo
+- Flag any public API surface, event schema, or shared type that will be modified — this is the signal that triggers the Contract Agent
+
+The plan artifact is the Orchestrator's input for deciding whether to invoke the Contract Agent and how many Engineer agents to spin up.
+
+### Engineer (extended)
+
+Each Engineer agent operates on a single repo, same as before. The changes are:
+
+- Multiple instances run in parallel, one per repo
+- Instances that depend on a contract change receive the finalized contract artifact as additional context before starting
+- Instances working on independent repos (no shared contract dependency) start immediately in parallel
+
+### Reviewer (extended)
+
+The Reviewer receives all diffs across all repos simultaneously and adds one additional check to its existing checklist:
+
+- Does Service B's usage of the new API match what Service A implemented?
+- Is there any service that calls the changed interface that was not included in the plan?
+- Are all contract changes backward compatible, or is there a coordinated breaking change with an explicit migration plan?
+
+The Reviewer requires the dependency graph from the system-level architecture document to know which services to check — it cannot discover blast radius from diffs alone.
+
+---
+
+## New Agent: Contract Agent
+
+### When it is invoked
+
+The Orchestrator invokes the Contract Agent when either of the following is true:
+
+- The PRD touches more than one repo
+- The PRD touches a single repo but the Tech Lead's plan flags a change to a public API surface, event schema, or shared library interface
+
+For purely internal single-repo changes (business logic, UI, infra config with no public interface change) the Contract Agent is skipped entirely.
+
+### What it does
+
+The Contract Agent owns the interface boundary between services. Its job is to produce a **contract artifact** — a precise, versioned definition of what is changing at the service boundary — and lock it before any Engineer agent starts implementing.
+
+Specifically it:
+
+1. Reads the Tech Lead's plan artifact and the current interface definitions (OpenAPI specs, proto files, shared types, event schemas) for all affected services
+2. Produces a diff of the contract change — what is being added, modified, or removed at the interface boundary
+3. Checks backward compatibility — flags breaking changes and requires an explicit migration or versioning plan if any exist
+4. Publishes the finalized contract artifact so downstream Engineer agents can use it as a source of truth
+5. Blocks the Orchestrator from starting any dependent Engineer agent until the artifact is published
+
+### What it does not do
+
+- It does not write implementation code
+- It does not modify business logic
+- It does not review the final diffs — that is the Reviewer's job
+
+### Output
+
+A contract artifact containing:
+- The precise interface change (structured diff of the API surface)
+- Backward compatibility assessment (compatible / breaking + migration plan)
+- A list of all services that consume the changed interface, derived from the dependency graph
+
+---
+
+## Sequencing Rules
+
+| Condition | Contract Agent | Engineer agents |
+|---|---|---|
+| Single-repo, no public interface change | Skipped | One agent, starts immediately |
+| Single-repo, public interface change | Invoked | Starts after contract is finalized |
+| Multi-repo, independent services (no shared contract) | Skipped | All agents start in parallel immediately |
+| Multi-repo, shared contract change | Invoked | Contract-dependent agents wait; independent agents start immediately |
+
+---
+
+## Summary
+
+One new agent is introduced: the **Contract Agent**. It fills the gap that exists in the current workflow — there was no role responsible for locking interface changes before implementation begins. Without it, parallel Engineer agents make independent assumptions about the same interface, and mismatches only surface at review or integration.
+
+Everything else — Orchestrator, Tech Lead, Engineer, Reviewer — retains its existing role and gains scoped extensions to handle multi-repo context and parallel execution.
+
+---
+
+*Document authored from eng-team architectural discussion — May 2026.*
diff --git a/docs/.nojekyll b/docs/.nojekyll
new file mode 100644
index 0000000..e69de29
diff --git a/docs/code-as-agent-harness.html b/docs/code-as-agent-harness.html
new file mode 100644
index 0000000..4b6cd43
--- /dev/null
+++ b/docs/code-as-agent-harness.html
@@ -0,0 +1,1201 @@
+
+
+
In agentic systems, code is not only output — it is the operational harness.
+
+
+
+ The survey argues that code is the executable substrate for reasoning, acting, environment modeling, and verification.
+ A good harness makes behavior executable, inspectable, stateful, and verifiable over long horizons.
+ Progress depends as much on harness engineering (tools, memory, oracles, control loops, multi-agent shared state) as on the base model.
+
+
+
+
+
⚡
Executable
+
🔍
Inspectable
+
💾
Stateful
+
✓
Verifiable
+
+
+
+
+
+
The Shift: Prompt Orchestration → Harness Engineering
+
Drag the slider to see how eng-team should evolve according to the paper.
+
+
+
+ Prompt orchestration that usually works
+ Harness engineering with proof
+
+
+
+ Today: eng-team is already a code-centric harness for PRD → spec → implementation → review → merge-ready PR. Orchestrator phases, bounded loops, and diff-based review align with the paper's philosophy.
+
+
+
+
+
+
+
How eng-team Already Fits
+
+ eng-team is a code-centric agent harness. Click each row to see how the paper's layers map to today's implementation.
+
+ The harness interface is how agents read and write state: CLAUDE.md for conventions, technical_spec for the plan artifact, Engineer edits for implementation, and Reviewer git diff for output-based judgment. Aligns with PHILOSOPHY.md — bottom-up trust, narrow insertion point, diff over intent.
+
+
+
+ The article does not suggest replacing this design. It names what to harden next:
+ oracle quality,
+ shared state discipline,
+ harness telemetry, and
+ governed iteration.
+
+
+
+
+
+
What to Keep (Already Strong)
+
These foundations align with the paper and should not be replaced.
Filter by priority to focus on what to implement first.
+
+
+
+
+
+
+
+
+
+
+
+
+
Priority
+
Change
+
Paper Ref
+
+
+
+
+
P0
+
Evidence bundle + untested_regions on approve
+
§5.2.2
+
+
+
P0
+
Commit pins + spec_version on scratchpad
+
§4.2, §5.2.4
+
+
+
P1
+
Trajectory / harness metrics in every task JSON
+
§5.2.1
+
+
+
P1
+
Failure-type routing in orchestrator
+
§3.4
+
+
+
P2
+
Engineer file-scope + bash policy enforcement
+
§2.2
+
+
+
P2
+
acceptance_checks for complex specs
+
§2.1
+
+
+
P3
+
Cross-task .eng_team/learnings.json
+
§3.2
+
+
+
P3
+
Golden-repo harness regression tests
+
§5.2.3
+
+
+
+
+
+
+
+
+
Bottom Line
+
+ eng-team is already a code-as-harness system for software engineering.
+ The survey's main push is to evolve from prompt orchestration that usually works
+ to harness engineering: every approval carries proof, every phase carries versioned assumptions,
+ and harness failures improve the system with regression discipline — without widening scope beyond the PRD → PR slice until trust is earned.
+
+
+
+
+
diff --git a/docs/code-as-agent-harness.md b/docs/code-as-agent-harness.md
new file mode 100644
index 0000000..504924d
--- /dev/null
+++ b/docs/code-as-agent-harness.md
@@ -0,0 +1,141 @@
+# Code as Agent Harness — Implications for eng-team
+
+Summary of how [Code as Agent Harness](https://arxiv.org/abs/2605.18747) (Ning et al., 2026) relates to **eng-team**, and a prioritized backlog for strengthening the harness.
+
+---
+
+## Article in one paragraph
+
+The survey argues that in agentic systems, **code is not only output** — it is the **operational harness**: the executable substrate for reasoning, acting, environment modeling, and verification. A good harness makes behavior **executable, inspectable, stateful, and verifiable** over long horizons. Progress depends as much on harness engineering (tools, memory, oracles, control loops, multi-agent shared state) as on the base model.
+
+---
+
+## How eng-team already fits
+
+eng-team is a **code-centric agent harness** for the slice PRD → spec → implementation → review → merge-ready PR:
+
+| Paper layer | eng-team today |
+|-------------|----------------|
+| **Harness interface** | `CLAUDE.md`, `technical_spec`, Engineer edits, Reviewer `git diff` |
+| **Harness mechanisms** | Orchestrator phases, bounded loops, `repo_context`, test/lint gates |
+| **Multi-agent over code** | Tech Lead → Engineer → Reviewer via `.eng_team/task_*.json` (orchestrator-only; no peer chat) |
+| **Verifiable closure** | Tests + linter + structured review checklist |
+
+This aligns with **PHILOSOPHY.md**: bottom-up trust, narrow insertion point, diff-based review (output over intent).
+
+The article does **not** suggest replacing this design. It names what to harden next: **oracle quality**, **shared state discipline**, **harness telemetry**, and **governed iteration**.
+
+---
+
+## Key upgrades (article → eng-team)
+
+### 1. Scratchpad as program state
+
+Extend `.eng_team/task_*.json` beyond narrative logging:
+
+- `verification_evidence` (tests run, linter result, diff stats)
+- `assumptions[]` with `verified_by` (test / diff hunk / reviewer item)
+- Per-phase `read_set` / `write_set`
+- Commit pins: `base_commit`, `spec_version`, `impl_commit`
+
+*Paper: §2.3, §4.2, §5.2.4 — transactional shared program state.*
+
+### 2. Verification stack (not only “tests passed”)
+
+On approve, require an **evidence bundle** and explicit limits:
+
+- What was checked (unit / integration / security hints / coverage on touched files)
+- `untested_regions[]` — what the oracle does **not** prove
+- For `complex` tasks: runnable `acceptance_checks` or test skeletons in the spec
+
+*Paper: §5.2.1–5.2.2 — oracle adequacy and semantic verification beyond executable feedback.*
+
+### 3. Harness-level evaluation
+
+Log per-run **trajectory metrics** in the scratchpad:
+
+- Phase durations, clarification/review cycles
+- Recovery: each `critical_issue` linked to a fix commit
+- `oracle_strength` (trivial vs full checklist, targeted re-review scope)
+
+*Paper: §5.2.1 — evaluate the harness, not only final task success.*
+
+### 4. Failure-type routing in the orchestrator
+
+Route feedback by signal type:
+
+| Signal | Action |
+|--------|--------|
+| `spec_gaps` | Tech Lead (max 1 cycle — existing) |
+| Test failure | Engineer fix mode |
+| Lint only | Engineer, narrow scope |
+| Behavior vs spec | Tech Lead, not blind Engineer patch |
+| Security/perf | Reviewer targeted re-review |
+
+*Paper: §3.4 — plan → execute → verify with feedback-driven control.*
+
+### 5. Action validation (lightweight harness boundary)
+
+Pre-flight before Engineer acts:
+
+- Edits only under `files_to_modify` / `files_to_create`
+- No edits on `base_branch`
+- Bash allowlist from `CLAUDE.md` (no destructive or secret-leaking commands)
+
+*Paper: §2.2 — code mediates intent; filter invalid actions before execution.*
+
+### 6. Human gates as durable state
+
+Scratchpad fields: `human_gates` (`prd_approved`, `spec_approved`, `merge_approved`), `human_resolution` on escalation so later runs do not repeat the same failure.
+
+*Paper: §5.2.5; **PHILOSOPHY.md** — the gate that stays human.*
+
+### 7. Cross-task memory (optional, later)
+
+`.eng_team/learnings.json` for recurring reviewer findings, flaky areas, repo-specific patterns — opt-in, governed.
+
+*Paper: §3.2 — memory and context engineering.*
+
+### 8. Harness evolution with regression discipline
+
+Golden fixture repos + expected scratchpad phases; prompt/checklist changes only with held-out regression tasks and explicit change contracts.
+
+*Paper: §5.2.3 — self-evolving harnesses without regression.*
+
+---
+
+## What to keep (already strong)
+
+- Bottom-up, verifiable slice (code → tests → diff review)
+- Orchestrator-owned control flow; bounded loops; targeted re-review
+- Role/tool separation (Tech Lead no Edit; Reviewer judges diff not intent)
+- `/eng-team-context` as environment bootstrapping
+- Scratchpad as audit trail
+
+---
+
+## Prioritized backlog
+
+| Priority | Change | Paper reference |
+|----------|--------|-----------------|
+| **P0** | Evidence bundle + `untested_regions` on approve | §5.2.2 |
+| **P0** | Commit pins + `spec_version` on scratchpad | §4.2, §5.2.4 |
+| **P1** | Trajectory / harness metrics in every task JSON | §5.2.1 |
+| **P1** | Failure-type routing in orchestrator | §3.4 |
+| **P2** | Engineer file-scope + bash policy enforcement | §2.2 |
+| **P2** | `acceptance_checks` for `complex` specs | §2.1 |
+| **P3** | Cross-task `.eng_team/learnings.json` | §3.2 |
+| **P3** | Golden-repo harness regression tests | §5.2.3 |
+
+---
+
+## Bottom line
+
+eng-team is already a **code-as-harness** system for software engineering. The survey’s main push is to evolve from **prompt orchestration that usually works** to **harness engineering**: every approval carries proof, every phase carries versioned assumptions, and harness failures improve the system with regression discipline — without widening scope beyond the PRD → PR slice until trust is earned.
+
+---
+
+## Reference
+
+- **Paper:** [Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems](https://arxiv.org/abs/2605.18747)
+- **Related repo docs:** `PHILOSOPHY.md`, `README.md`, `.claude/commands/eng-team.md`
diff --git a/docs/executive-presentation.html b/docs/executive-presentation.html
new file mode 100644
index 0000000..55dfb2f
--- /dev/null
+++ b/docs/executive-presentation.html
@@ -0,0 +1,1432 @@
+
+
+
+
+
+ Eng-Team — Executive Presentation
+
+
+
+
+
+
+
+
+
+
+
Executive Briefing · May 2026
+
Autonomous Engineering with Quality Gates
+
+ From a working AI engineering team today — to trusted, verifiable delivery across single and multi-repo environments.
+
+
eng-team · Engineering AI Team
+
+
+
+
+
The Opportunity
+
Most AI in engineering is still faster autocomplete
+
+
Today: a human accepts or rejects every AI output
+
The ceiling: individual speed, not autonomous delivery
+
The real opportunity: hand the system a requirement, get back working, tested, reviewed code
+
+
+
We start where AI output can actually be verified — tests pass or they don't, diffs make sense or they don't.
+
+
+
+
+
+
Where We Are Today
+
eng-team: PRD → merge-ready PR, autonomously
+
+ A Claude Code workflow that turns a plain-English feature request into committed, reviewed code — without changing how product or leadership work.
+
+
+
3
Core agents today
+
1
Human gate (PRD in)
+
0
Peer chat between agents
+
+
+
+
+
+
Current Workflow
+
Forward path + three bounded iteration loops
+
+
+
+
Orchestrator
+
/eng-team · coordinates all phases & loops
+
+
+
+
+
+
+
+
Human
Provides PRD
+
+
+
+ Autonomous
+
+
Tech Lead
Technical spec
AI agent
+
+
Engineer
Build + test
AI agent
+
+
Reviewer
Diff review
AI agent
+
+
+
+
+
Merge-ready PR
Human merge decision
+
+
+
+
Human input / gate
+
Autonomous agent pipeline
+
Human-reviewed output
+
+
+
+
+
+
+
+ ⇄
+
+
+
+
Human ⇄ Tech Lead
+
PRD clarification · intent gate
+
+
+
+
+
+ ⇄
+
+
+
+
Engineer ⇄ Tech Lead
+
Spec gaps → clarify spec · max 1 cycle
+
+
+
+
+
+ ⇄
+
+
+
+
Engineer ⇄ Reviewer
+
Reject → fix → targeted re-review · max 1 cycle
+
+
+
+
+
+
All agents read/write .eng_team/task_*.json — shared scratchpad & full audit trail · Reviewer judges the diff, not intent
+
+
+
+
+
+
The Challenge
+
Why autonomous agents need quality gates
+
+
+
🔍
+
Too little context
+
Agent invents patterns it doesn't know exist in the repo.
+
+
+
📈
+
Scope creep
+
Nothing stops over-engineering when there's no plan boundary.
+
+
+
🔄
+
No verification loop
+
Code handed off without proving it works or matches intent.
+
+
+
🪞
+
Self-review bias
+
The agent that wrote the code also "reviewed" it.
+
+
+
+ Without gates, AI PRs erode trust faster than they create value.
+
+
+
+
+
+
Quality Gates Framework
+
Five stages — catch failures where they're cheapest
+
+
STAGE 1
Pre-flight
+
STAGE 2
In-flight
+
STAGE 3
Post-impl
+
STAGE 4
Adversarial Review
+
STAGE 5
Trust Loop
+
+
+
No single gate is enough. The combination of pre-flight discipline, mechanical checks, and an independent reviewer covers the three biggest failure modes.
+
+
+
+
+
+
Single-Repo Environment
+
Quality assurance for one codebase
+
+
+
Before code is written
+
+
Plan artifact — files, LOC estimate, approach mapped to acceptance criteria
+
Scope check — flag over-engineering before it starts
+
CLAUDE.md — repo constitution all agents must read
+
+
+
+
Before PR opens
+
+
Tests pass — full suite, enforced mechanically
+
Diff coverage — new code must have tests
+
Lint + diff audit — zero tolerance; flag unexpected file changes
+
+
+
+
+
+
+
+
Highest-Trust Gate
+
The agent that writes code must never be the sole reviewer
+
+
+
Separate agent with fresh context and no attachment to the implementation
+
Structured checklist: every acceptance criterion covered? Extra code? Pattern divergence?
+
Output: review report attached to the PR — human reviewer sees AI assessment + diff
+
Reduces cognitive load; focuses human attention where it matters
+
+
+
Trust over time
+
Capture every human correction as a CLAUDE.md update or annotated example.
+
Periodically re-evaluate: would a fresh agent implement the same way? Divergence signals drift.
+
+
+
+
+
+
+
Single-Repo Priority
+
Implement in this order
+
+
#
Gate
Impact
+
+
1
CLAUDE.md with explicit conventions
Repo context for all agents
+
2
Plan artifact + scope check
Catch over-engineering early
+
3
Full test suite enforcement
Non-negotiable quality floor
+
4
Adversarial reviewer agent
Builds human trust directly
+
5
Diff size + blast radius audit
Catch subtle scope creep
+
6
Feedback capture loop
Compounds quality over time
+
+
+
+
+
+
+
Multi-Repo Reality
+
Single-repo gates break down in microservices
+
+
Blind spots
Full context on Service A, no idea Service B exists.
+
Silent breaks
Shared library change breaks three consumers undetected.
+
Parallel drift
Two agents modify the same contract independently.
+
Interaction bugs
Reviewer sees one diff; bug lives between services.
+
+
+ Need an org-level architecture layer — service inventory, dependency graph, cross-cutting conventions — partially auto-generated from API specs and import graphs.
+
+
+
+
+
+
Multi-Repo Gates
+
Four gates that don't exist in single-repo
+
+
+
1 · Contract-first planning
+
Declare API/event/schema changes upfront. Contracts locked before any engineer starts.
+
+
+
2 · Sequenced parallel execution
+
Engineers work in parallel — but dependent services wait until contracts are finalized.
+
+
+
3 · Cross-repo adversarial reviewer
+
Reads all diffs together. Checks interface compatibility and missed consumers.
+
+
+
4 · Contract tests
+
Consumer-driven tests against producer implementations — the cross-service quality floor.
+
+
+
+
+
+
+
New Agent Required
+
Contract Agent — locks interfaces before implementation
+
+
+
When invoked
+
+
PRD touches more than one repo
+
PRD touches a public API, event schema, or shared library
+
Skipped for purely internal single-repo changes
+
+
What it does NOT do
+
+
Write implementation code
+
Review final diffs (Reviewer's job)
+
+
+
+
Contract artifact output
+
+
Precise interface change (structured diff)
+
Backward compatibility assessment
+
List of all consuming services
+
+
Without it, parallel engineers make independent assumptions — mismatches surface only at integration.
+
+
+
+
+
+
+
Target Multi-Repo Workflow
+
Extended pipeline — one new agent, parallel engineers
+
+
🎯
Orchestrator
+
→
+
📋
Tech Lead
Cross-repo plan
+
→
+
📜
Contract Agent
New
+
→
+
+
⚙️
Engineer
+
⚙️
Engineer
+
+
→
+
🔍
Reviewer
All diffs
+
+
+
Orchestrator Extended
Parallel dispatch + contract gate enforcement
+
Tech Lead Extended
System architecture doc + cross-repo plan
+
Contract Agent New
Locks interface changes pre-implementation
+
+
+
+
+
+
Coordination Rules
+
When does the Contract Agent run?
+
+
Scenario
Contract Agent
Engineer Agents
+
+
Single-repo, no public interface change
Skipped
One agent, starts immediately
+
Single-repo, public interface change
Invoked
Starts after contract finalized
+
Multi-repo, independent services
Skipped
All start in parallel immediately
+
Multi-repo, shared contract change
Invoked
Dependent agents wait; independent start now
+
+
+
+
+
+
+
Where We Want to Reach
+
From prompt orchestration to harness engineering
+
+
+
Today
+
+
Agents follow instructions; state is mostly a log
+
Verification is binary: tests pass or fail
+
Failures repeat across runs
+
+
+
→
+
+
Target
+
+
Every approval carries evidence + untested regions
+
Scratchpad is versioned program state with commit pins
+
Harness failures improve the system with regression discipline
+
+
+
+
+
+
+
+
Implementation Roadmap
+
Phased path from here to there
+
+
+
Now
+
Working eng-team pipeline
Tech Lead → Engineer → Reviewer. CLAUDE.md bootstrapping. Diff-based review.
+
+
+
Phase 1
+
Single-repo quality gates
Plan artifact, test enforcement, adversarial reviewer, evidence bundle on approve.
+
+
+
Phase 2
+
Harness hardening
Scratchpad as program state, failure-type routing, harness telemetry, human gates as durable state.
+
+
+
Phase 2
+
Multi-repo extension
System architecture doc, Contract Agent, sequenced parallel execution, contract tests.
Automating the verifiable slice so humans focus on judgment.
+
Not automating everything
Narrow insertion point. One human gate stays.
+
Not a big-bang rollout
Each phase earns trust before the next begins.
+
+
+
+
+
+
Bottom Line
+
Trusted autonomous delivery is a harness problem, not a model problem
+
+
Today: eng-team delivers PRD → merge-ready PR with a proven three-agent pipeline
+
Next: quality gates ensure agents stay on scope, produce verifiable output, and build human trust
+
Then: Contract Agent + multi-repo gates extend the same philosophy across services
+
Always: start verifiable, earn trust upward, keep the human intent gate
+
+
+
The teams that build AI trust incrementally — starting with what's verifiable — will pull ahead of those that either ignore AI or try to automate everything at once.
+
+
+
+
+
+
Discussion
+
Questions?
+
eng-team · Autonomous Engineering with Quality Gates
+ An AI engineering team that turns a plain-English feature request into committed, tested, reviewed code — without changing how product or leadership work.
+
Four specialized agents orchestrated by /eng-team. Each reads from a shared scratchpad — no agent talks directly to another.
+
+
+
+
📋
+
Tech Lead
+
Reads the PRD and codebase, produces a precise Technical Spec — file paths, acceptance criteria, approach. Never writes code.
+ Spec only
+
+
+
⚙️
+
Engineer
+
Implements the spec, writes tests inline, runs the linter, and commits to a feature branch.
+ Build + test
+
+
+
🔍
+
Reviewer
+
Diffs against main for correctness, security, and performance. Approves with a PR description — or rejects with actionable fixes.
+ Diff review
+
+
+
🚀
+
DevOps
+
Deploys locally via Docker Compose. Separate /devops command after merge.
+ Deploy
+
+
+
+
+
+
+
+
+
How it works
+
Human provides the PRD. The autonomous pipeline runs spec → implementation → review, with bounded iteration loops when needed.
+
+
+
+
Human PRD in
+
→
+
+ Autonomous
+
+
Tech Lead
+
→
+
Engineer
+
→
+
Reviewer
+
+
+
→
+
Merge-ready PR Human merges
+
+
Bounded loops: Human ⇄ Tech Lead (PRD clarification) · Engineer ⇄ Tech Lead (spec gaps) · Engineer ⇄ Reviewer (reject/fix) · All state in .eng_team/task_*.json
+
+
+
+
+
+
+
+
Get started in four steps
+
Copy .claude/ into your project, bootstrap context, and run your first PRD.
+
+
+
+
Add eng-team to your repo
+
Copy the .claude/ directory to your project root. Requires Claude Code.
+
+
+
Generate CLAUDE.md
+
Run /eng-team-context ./ to analyze your repo and generate agent-optimized context automatically.
+
+
+
Run a PRD
+
/eng-team Add rate limiting to the cart API — max 100 req/min per user, Redis-backed, fail-open
+
+
+
Push & merge
+
Push the branch, open a PR, paste PR_DESCRIPTION.md. Deploy locally with /devops deploy local.
+
+
+
+
+
+
+
+
+
Interactive guides
+
Architecture deep-dives, quality gates, and executive presentations — built for teams adopting autonomous engineering.
+ How to ensure eng-team agents produce high-quality code, stay on scope, and build human trust in AI-generated PRs — extended for microservices and multi-repo environments.
+
+
+
+
+
+
Why AI Code Quality Degrades
+
+ Before picking gates, name the failure modes. Each gate in this system targets one or more root causes. Click a card to see which gates address it.
+
+
+
+
+
🔍
+
Too Little Context
+
Agent invents patterns because it doesn't know repo conventions.
+
+
+
📈
+
Scope Discipline
+
Nothing stops over-engineering when there's no plan boundary.
+
+
+
🔄
+
No Verification Loop
+
Code is handed off without checking if it works or matches intent.
+
+
+
🪞
+
No Adversarial Review
+
The same agent that wrote the code also "reviewed" it.
+
+
+
+
+
Addressed by: CLAUDE.md + Plan Artifact
+
+ A well-maintained CLAUDE.md acts as a constitution all agents must read and cite. The plan artifact (files to change, LOC estimate, approach mapped to acceptance criteria) surfaces risks before any code is written.
+
+
+
+
+
+
+
Five Stages of Quality Gates
+
+ No single gate is sufficient — failure modes differ at each stage. Explore each stage to see what happens and when.
+
+
+
+
+
+
+
+
+
+
+
+
✈️ Stage 1 — Pre-flight (Before a Line Is Written)
+
+ The highest-leverage point. The agent produces a plan artifact before implementation.
+
+
+
Scope reasonableness — flag if file/LOC count exceeds threshold for a small feature
+
Repo structure alignment — plan follows module boundaries, naming, and patterns from CLAUDE.md
+
Test-first commitment — agent declares tests before writing implementation
An adversarial reviewer (independent judgment on correctness and fit)
+
+
+
+
+
+
+
Why Single-Repo Gates Aren't Enough
+
+ In microservices, failure modes multiply. Each repo keeps its own CLAUDE.md, but multi-repo changes need an org-level architecture layer.
+
+
+
+
+
Blind to Service B
+
Agent has full context on Service A but no idea Service B exists.
+
+
+
Silent Schema Breaks
+
A shared library change silently breaks three consumers.
+
+
+
Parallel Contract Drift
+
Two engineers modify overlapping contracts with no coordination.
+
+
+
Interaction Bugs
+
Reviewer sees one diff; the bug lives between two services.
+
+
+
+
+
System-Level Context Layer
+
+ Partially auto-generated from API specs, import graphs, and event bus subscriptions — not a hand-maintained wiki.
+ Covers service inventory, dependency graph, and cross-cutting conventions (auth, errors, events).
+
+
+
+
+
+
+
Multi-Repo Quality Gates
+
Four additional gates that don't exist in the single-repo model. Expand each to learn more.
+
+
+
+ 1 Contract-First Planning
+
+ The tech-lead's plan must answer: which service boundaries does this change cross?
+ Any API contract, event schema, or shared type change is declared upfront — before any engineer starts.
+ Enforces contracts first, implementations second.
+
+
+
+ 2 Sequenced Parallel Execution
+
+ Multiple engineer agents work in parallel — but only after the contract is settled.
+ No agent touches a dependent service until the contract change is finalized.
+ Violating this means independent assumptions that only surface at integration time.
+
+
+
+ 3 Cross-Repo Adversarial Reviewer
+
+ Reads all diffs together. Checks backward compatibility, whether Service B's usage matches Service A's implementation,
+ and whether any calling service was missed. Requires the dependency graph — blast radius can't be discovered from diffs alone.
+
+
+
+ 4 Contract Tests as Quality Floor
+
+ In multi-repo, contract tests replace the single-repo test suite as the cross-service quality floor.
+ Consumer-driven tests against the producer's implementation — the only automated check for interface mismatches.
+
+
+
+
+
+
+
+
Priority
+
Gate
+
What It Addresses
+
+
+
+
1
System-level architecture document
Cross-service context
+
2
Per-repo CLAUDE.md
Local conventions
+
3
Contract-first plan artifact
Incompatible parallel implementations
+
4
Sequenced parallel execution
Contract-before-consumer ordering
+
5
Test suite + contract tests
Automated quality floor
+
6
Cross-repo adversarial reviewer
Interface mismatches across diffs
+
7
Feedback capture loop
Compounding quality over time
+
+
+
+
+
+
+
Multi-Repo Agent Workflow
+
+ One new agent (Contract Agent) fills the gap for locking interface changes. Everything else is extended, not replaced.
+ Click each node to explore its role.
+
+
+
+
+
+
🎯
+
Orchestrator
+
+
→
+
+
📋
+
Tech Lead
+
+
→
+
+
📜
+
Contract Agent
+
+
→
+
+
+
⚙️
+
Engineer ×N
+
+
+
→
+
+
🔍
+
Reviewer
+
+
+
+
+
Orchestrator (extended)
+
Drives the full pipeline with two behavioral additions for multi-repo:
+
+
Parallel engineer dispatch — one Engineer agent per repo
+
Contract gate enforcement — blocks dependent engineers until contract is finalized
+
+
+
+
+
+
+
+ 🎯
+
Orchestrator
+ Extended
+ ▼
+
+
+
+
+
Parallel engineer dispatch across repos
+
Contract gate enforcement before dependent engineers start
+
No new agent — logic and configuration extension
+
+
+
+
+
+
+
+ 📋
+
Tech Lead
+ Extended
+ ▼
+
+
+
+
+
Reads system-level architecture document first
+
Identifies which service boundaries the PRD crosses
+
Produces cross-repo plan artifact per repo
+
Flags API/event/shared type changes → triggers Contract Agent
+
+
+
+
+
+
+
+ 📜
+
Contract Agent
+ New
+ ▼
+
+
+
+
+
Invoked when PRD touches multiple repos or public interfaces
+
Produces versioned contract artifact before implementation
+
Checks backward compatibility; requires migration plan for breaking changes
+
Does NOT write implementation code or review final diffs
+
Output: interface diff, compatibility assessment, consumer list
+
+
+
+
+
+
+
+ ⚙️
+
Engineer
+ Extended
+ ▼
+
+
+
+
+
One instance per repo, running in parallel
+
Contract-dependent instances receive finalized artifact as context
+
Independent repos start immediately without waiting
+
+
+
+
+
+
+
+ 🔍
+
Reviewer
+ Extended
+ ▼
+
+
+
+
+
Receives all diffs across repos simultaneously
+
Checks Service B usage matches Service A implementation
+
Verifies all calling services are included in the plan
+
Requires dependency graph for blast radius analysis
+
+
+
+
+
+
+
+
+
+
Sequencing Rules
+
+ Select a scenario to see whether the Contract Agent runs and how Engineer agents are dispatched.
+
+
+
+
+
+
+
+
+
+
+
+
Contract Agent
+
Skipped
+
+
+
Engineer Agents
+
One agent, starts immediately
+
+
+
+
+
The Unsolved Problem
+
+ The system-level architecture document is only as good as its maintenance discipline.
+ The real answer is auto-generation from API specs, import graphs, and event subscriptions —
+ until that tooling exists, treat it as the best available context and have agents flag unknown service references.
+
+
+
+
+
+
+
+
+
diff --git a/quality-gates-multi-repo.md b/quality-gates-multi-repo.md
new file mode 100644
index 0000000..27d3cf9
--- /dev/null
+++ b/quality-gates-multi-repo.md
@@ -0,0 +1,96 @@
+# Quality Gates in a Multi-Repo / Microservices Environment
+
+> Extending the single-repo quality philosophy to systems where logic is split across services, shared libraries, and utilities — and where one PRD may touch more than one repo.
+
+---
+
+## Why Single-Repo Gates Are Not Enough
+
+The quality gates designed for a single repo assume one CLAUDE.md, one test suite, one diff to review. In a microservices environment, the failure modes multiply:
+
+- An agent has full context on Service A but no idea Service B even exists
+- A schema change in a shared library silently breaks three consumers
+- Two engineer agents modify overlapping contracts in parallel with no coordination
+- The reviewer only sees one diff but the bug lives in the interaction between two services
+
+Each of these requires a gate that simply doesn't exist in the single-repo model.
+
+---
+
+## New Layer Required: System-Level Context
+
+Each repo keeps its own `CLAUDE.md` for local conventions. But multi-repo changes require an additional layer — an org-level architecture document that every agent reads before planning.
+
+This document covers:
+
+- **Service inventory** — what each service owns, its public API surface, who calls it
+- **Dependency graph** — which services depend on which, where contracts live (OpenAPI specs, proto files, shared types)
+- **Cross-cutting conventions** — auth patterns, error formats, event schemas — things that must be consistent across all services
+
+This is not a living wiki maintained by hand. It should be partially auto-generated from actual API specs, import graphs, and event bus subscriptions — so it reflects the real system, not someone's memory of it.
+
+---
+
+## Gate 1: Contract-First Planning
+
+For any multi-repo PRD, the tech-lead's plan artifact must answer: *which service boundaries does this change cross?*
+
+Any change that touches an API contract, event schema, or shared type must be declared upfront — before any engineer agent starts writing code. The plan names the contract change explicitly. All downstream service changes are derived from it.
+
+This enforces the right order of operations: **contracts first, implementations second.** Agents cannot drift into incompatible assumptions if the contract is locked before they start.
+
+---
+
+## Gate 2: Sequenced Parallel Execution
+
+Multiple engineer agents can work in parallel on separate services — but only after the contract is settled.
+
+The coordination rule: **no agent touches a service that depends on a contract change until that contract change is finalized.**
+
+This is a sequencing constraint, not a quality check. Violating it means two agents make independent assumptions about the same interface, and both may be wrong in ways that only surface at integration time.
+
+---
+
+## Gate 3: Cross-Repo Adversarial Reviewer
+
+The single-repo reviewer reads one diff. In multi-repo, the reviewer must read all diffs together and specifically check:
+
+- Are all contract changes backward compatible — or is there a coordinated breaking change with a migration plan?
+- Does Service B's usage of the new API actually match what Service A implemented?
+- Is there a service that calls the changed interface that wasn't included in the plan?
+
+This reviewer requires the dependency graph from the system-level context layer to know which services to check. It cannot discover blast radius from the diffs alone.
+
+---
+
+## Gate 4: Contract Tests as the Quality Floor
+
+In a single repo, the test suite is the quality floor. In multi-repo, the equivalent is **contract tests** — consumer-driven tests that run against the producer's implementation.
+
+Every service that publishes an API should have contract tests defined by its consumers. These are the only automated checks that can catch cross-service incompatibilities before integration. Unit tests and linting within each service will not surface interface mismatches.
+
+---
+
+## Revised Implementation Priority (Multi-Repo)
+
+| Priority | Gate | What it addresses |
+|---|---|---|
+| 1 | System-level architecture document | Gives agents cross-service context |
+| 2 | Per-repo `CLAUDE.md` | Gives agents local conventions |
+| 3 | Contract-first plan artifact | Prevents incompatible parallel implementations |
+| 4 | Sequenced parallel execution | Enforces contract-before-consumer ordering |
+| 5 | Full test suite per repo + contract tests | Sets the automated quality floor |
+| 6 | Cross-repo adversarial reviewer | Catches interface mismatches across diffs |
+| 7 | Feedback capture loop | Compounds quality improvements over time |
+
+---
+
+## The Unsolved Problem
+
+The system-level architecture document is only as good as its maintenance discipline. In a fast-moving microservices environment, the dependency graph goes stale quickly.
+
+The real answer is that this document needs to be auto-generated — derived from actual API specs, import graphs, and event bus subscriptions — not maintained by hand. Until that tooling exists, the document is a useful approximation, not a guarantee. Treat it as the best available context, and build agents that flag when they encounter service references not covered by it.
+
+---
+
+*Document authored from eng-team architectural discussion — May 2026.*
diff --git a/quality-gates.md b/quality-gates.md
new file mode 100644
index 0000000..6ebf697
--- /dev/null
+++ b/quality-gates.md
@@ -0,0 +1,130 @@
+# Quality Gates for Autonomous Engineering Teams
+
+> How to ensure eng-team agents always produce high-quality code, stay on scope, and build human trust in AI-generated PRs.
+
+---
+
+## Why AI Code Quality Degrades
+
+Before picking gates, it helps to name the failure modes:
+
+- **Too little context** — the agent doesn't know the repo's conventions, so it invents patterns.
+- **Too little scope discipline** — the agent over-engineers because nothing stops it.
+- **No verification loop** — the agent writes code and hands it off without checking if it actually works or matches intent.
+- **No adversarial review** — the same agent that wrote the code also "reviewed" it.
+
+Each gate in this document targets one or more of these root causes.
+
+---
+
+## Stage 1: Pre-flight (Before a Single Line Is Written)
+
+The highest-leverage point is *before* implementation starts. The agent must produce a **plan artifact** — a structured document that states:
+
+- Which files will change
+- Rough line count estimate
+- Implementation approach
+- How the approach maps to each acceptance criterion in the PRD
+
+This costs almost nothing and surfaces the biggest risks before wasted compute.
+
+The plan is checked against:
+
+**Scope reasonableness** — If the plan touches more than a threshold number of files or LOC for a small feature, that's a flag to surface before implementation begins.
+
+**Repo structure alignment** — Does the plan follow existing module boundaries, naming conventions, and architectural patterns? A well-maintained `CLAUDE.md` is the primary mechanism here — treat it as a constitution that all agents must read and cite in their plan.
+
+**Test-first commitment** — The agent declares what tests it will write before writing any implementation. This forces real thinking about the contract, not just the code.
+
+---
+
+## Stage 2: In-flight Controls (While Implementing)
+
+**Incremental, reviewable commits** — Rather than one giant diff at the end, each logical chunk (a new function, a schema change, a new component) should be a discrete commit. This makes the diff auditable incrementally and makes it far easier to spot drift.
+
+**Self-critique step** — After writing each logical unit, the agent reads its own diff and answers:
+- Is this the minimum change needed?
+- Does it follow the pattern used elsewhere in the codebase?
+- Am I introducing anything that wasn't in the PRD?
+
+Catching drift mid-implementation is dramatically cheaper than catching it at review.
+
+---
+
+## Stage 3: Post-implementation Gates (Before PR Is Opened)
+
+These are the mechanical, automated checks that form the quality floor.
+
+### Tests Must Pass
+The full existing test suite must pass before a PR is opened. If the agent breaks tests, the PR does not open. This is enforced mechanically, not left to the agent's judgment.
+
+### Test Coverage on New Code
+The agent is required to write tests for its own additions. Coverage thresholds apply to the **diff** — not just overall repo coverage — to catch cases where the agent ships logic with zero tests.
+
+### Static Analysis and Linting
+TypeScript strict mode, ESLint, formatters, and any other repo-configured tools must pass at zero-tolerance. The agent runs and fixes these locally before the PR opens.
+
+### Diff Size Audit
+Compare the size of the PR (files changed, LOC) against the stated complexity of the PRD. A one-sentence feature request that produces a 1200-line PR is a signal worth surfacing — it doesn't mean the PR is wrong, but it should trigger human scrutiny before merge.
+
+### File Blast Radius Check
+Which files were modified? If the agent touched a shared utility, a config file, or anything outside the expected module scope, that must be explicitly flagged in the PR description. Unexpected file changes are one of the most common sources of subtle regressions.
+
+---
+
+## Stage 4: The Adversarial Reviewer Agent
+
+This is the highest-trust gate and the most important one to get right.
+
+**The agent that writes the code must never be the sole reviewer.**
+
+A separate agent instance — with fresh context and no attachment to the implementation — reads the PRD and the diff, then answers a structured checklist:
+
+- Does every acceptance criterion have corresponding code and a test?
+- Is there any code that wasn't required by the PRD?
+- Are there patterns that diverge from the existing codebase?
+- Are there obvious edge cases not handled?
+- Is the PR description accurate and complete?
+
+The output is a **structured review report** attached to the PR. When the human reviewer opens the PR, they see the AI reviewer's assessment alongside the diff — surfacing disagreements, flags, and open questions. This reduces the cognitive load on the human reviewer and focuses their attention where it matters.
+
+---
+
+## Stage 5: Building Trust Over Time
+
+The gates above catch bad output in the moment. Sustained trust requires a feedback loop.
+
+**Capture human corrections** — Every time a human reviewer modifies an AI-generated PR, that change should be captured — as an annotated example or a `CLAUDE.md` update. This creates a growing library of "this is what we do here and why," progressively calibrating future agents to the team's standards.
+
+**Retrospective evals** — Periodically sample merged AI PRs, strip context, and ask a fresh agent: "How would you implement this PRD given this codebase?" If the approach diverges significantly from what was merged, the agents are drifting from what the team actually wants. Use those diffs to improve the `CLAUDE.md` and agent prompts.
+
+---
+
+## Implementation Priority
+
+Implement in this order for the best return on investment:
+
+| Priority | Gate | What it addresses |
+|---|---|---|
+| 1 | `CLAUDE.md` with explicit conventions | Gives agents repo context |
+| 2 | Plan artifact + scope check | Catches over-engineering before it happens |
+| 3 | Full test suite enforcement | Sets a non-negotiable quality floor |
+| 4 | Adversarial reviewer agent | Builds human trust most directly |
+| 5 | Diff size + blast radius audit | Catches subtle scope creep |
+| 6 | Feedback capture loop | Compounds quality improvements over time |
+
+---
+
+## Summary
+
+No single gate is sufficient because the failure modes are different at each stage. The combination of:
+
+- A strong **pre-flight** (scope discipline + plan artifact)
+- **Mechanical post-implementation gates** (tests, linting, diff audit)
+- An **adversarial reviewer** (independent judgment on correctness and fit)
+
+...covers the three biggest failure modes. The rest is refinement and iteration as the team builds its feedback corpus.
+
+---
+
+*Document authored from eng-team architectural discussion — May 2026.*