An agentic SDLC harness that transforms requirements into merged PRs with forensic evidence. For: Platform engineers and staff+ operators who need repeatable, auditable automation (not demos).
Docs: docs/INDEX.md | Getting started: docs/GETTING_STARTED.md | Golden runs: docs/GOLDEN_RUNS.md | Roadmap: docs/ROADMAP_3_0.md | Changelog: CHANGELOG.md
The job is moving up the stack. Again.
Punchcards → Assembly → High-level languages → Now
Each transition followed the same pattern: what was once skilled craft becomes mechanical, and humans move to higher-leverage work. Programmers stopped managing memory addresses. Then stopped thinking in registers. Now they stop grinding on first-draft implementation.
The shift: Models write nearly-working code at 1,000+ tokens/second. The bottleneck isn't generation—it's trust. Can a human review and trust the output in 30 minutes instead of spending a week doing it themselves?
Flow Studio addresses that constraint. It produces forensic evidence alongside code. You audit the evidence, escalate verification at hotspots where doubt exists, and ship—or bounce it back for another iteration.
The machine does the implementation. You do the architecture, the intent, the judgment.
Just like every transition before.
Current Version: v3.0.0-rc.1 (Release Candidate)
See RELEASE_NOTES_3_0_0_RC1.md for details on the V3 "Intelligent Factory" update.
This release is stable for evaluation. It includes the new Pack System, Stepwise Orchestrator, and Resilient DB.
Flow Studio orchestrates 9 flows that transform a requirement into a merged PR and manage the lifecycle:
| Flow | Transformation | Output |
|---|---|---|
| 1: Signal | Raw input → structured problem | Requirements, BDD scenarios, risk assessment |
| 2: Plan | Requirements → architecture | ADR, contracts, work plan, test plan |
| 3: Build | Plan → working code | Implementation + tests via adversarial loops |
| 4: Review | Draft PR → Ready PR | Harvest feedback, apply fixes |
| 5: Gate | Code → merge decision | Audit receipts, policy check, recommendation |
| 6: Deploy | Approved → production | Merge, verify health, audit trail |
| 7: Wisdom | Artifacts → learnings | Pattern detection, feedback loops |
| 8: Reset | Clean state | Infrastructure wipe, cache purge |
| 9: Demo | Simulation | Walkthrough of the Stepwise Orchestrator |
Each flow produces receipts (proof of execution) and evidence (test results, coverage, lint output). Kill the process anytime—resume from the last checkpoint with zero data loss.
| Approach | Cost | Output |
|---|---|---|
| Developer implements feature | 5 days of salary | Code you hope works |
| Flow Studio runs overnight | ~$30 compute | Code + tests + receipts + evidence panel + hotspot list |
The receipts are the product. The code is a side effect.
Requirements:
- Python 3.13+
- uv (required)
- GNU Make (Linux/macOS: included; Windows: use WSL2 or MSYS2)
- Node.js 20+ (optional, for UI development)
Install:
git clone https://github.com/EffortlessMetrics/flow-studio-swarm.git
cd flow-studio-swarm
uv sync --extra devVerify:
make dev-check # Should pass all checksmake demo-run # Populate example artifacts
make flow-studio # Start UI → http://localhost:5000Open: http://localhost:5000/?run=demo-health-check&mode=operator
You'll see:
- Left: 9 flows (Signal → Plan → Build → Review → Gate → Deploy → Wisdom → Reset → Demo)
- Center: Step graph showing what ran and what it produced
- Right: Evidence, artifacts, and agent details
The demo shows a complete run—all nine flows indexed, with artifacts captured. Click around. This is what "done" looks like.
Every completed run produces:
| Artifact | Purpose |
|---|---|
| Receipts | Forensic proof of what happened—commands run, exit codes, timing |
| Evidence panel | Multi-metric dashboard (tests, coverage, lint, security) that resists gaming |
| Hotspots | The 3-8 files a reviewer should actually look at |
| Bounded diff | The change itself, with clear scope |
| Explicit unknowns | What wasn't measured, what's still risky |
A reviewer answers three questions in under 5 minutes:
- Does evidence exist and is it fresh?
- Does the panel of metrics agree?
- Where would I escalate verification?
If yes, approve. If contradictions, investigate. The system did the grinding.
Don't anthropomorphize the AI as a "copilot." View it as a manufacturing plant.
| Component | Role | Behavior |
|---|---|---|
| Python Kernel | Factory Foreman | Deterministic. Manages time, disk, budget. Never guesses. |
| Agents | Enthusiastic Interns | Brilliant, tireless. Will claim success to please you. Need boundaries. |
| Disk | Ledger | If it isn't written, it didn't happen. |
| Receipts | Audit Trail | The actual product. |
The foreman's job:
- Don't ask interns if they succeeded—measure the bolt
- Don't give them everything—curate what they need
- Don't trust their prose—trust their receipts
This is why the system doesn't rely on agent claims and runs forensic scanners. Exit codes don't lie. Git diffs don't hallucinate.
Three planes, cleanly separated:
| Plane | Component | What it does |
|---|---|---|
| Control | Python kernel | State machine, budgets, atomic disk commits |
| Execution | Claude Agent SDK | Autonomous work in a sandbox |
| Projection | DuckDB | Queryable index for the UI |
The kernel is deterministic. The agent is stochastic. The database is ephemeral (rebuildable from the event journal).
Step lifecycle:
- Work — Agent executes with full autonomy
- Finalize — Structured handoff envelope extracted from hot context
- Route — Next step determined from forensic evidence, not prose
Flow Studio orchestrates work in repos of any language. It's implemented in Python (kernel) and TypeScript (UI).
| Principle | What it means |
|---|---|
| Forensics over narrative | Trust the git diff, the test log, the receipt. Not the agent's claim. |
| Verification is the product | Output is code + the evidence needed to trust it. |
| Steps, not sessions | Each step has one job in fresh context. No 100k-token confusion. |
| Adversarial loops | Critics find problems. Authors fix them. They never agree to be nice. |
| Resumable by default | Kill anytime. Resume from last checkpoint. Zero data loss. |
make dev-check # Validate swarm health (run before commits)
make selftest # Full 16-step validation
make kernel-smoke # Fast kernel check (~300ms)
make stepwise-sdlc-stub # Zero-cost demo run
make help # All commands| Time | Document | What you'll learn |
|---|---|---|
| 10 min | GETTING_STARTED.md | Run the demo, see it work |
| 20 min | TOUR_20_MIN.md | Understand the full system |
| 5 min | MARKET_SNAPSHOT.md | Why this approach, why now |
| Topic | Document |
|---|---|
| Flow Studio UI | FLOW_STUDIO.md |
| Stepwise execution | STEPWISE_BACKENDS.md |
| Reviewing PRs | REVIEWING_PRS.md |
| Adopting for your repo | ADOPTION_PLAYBOOK.md |
| Full reference | CLAUDE.md |
| Topic | Document |
|---|---|
| The AgOps philosophy | AGOPS_MANIFESTO.md — Steven Zimmerman |
| What this system is | TRUST_COMPILER.md |
| 15 lessons learned | META_LEARNINGS.md |
| 12 emergent laws | EMERGENT_PHYSICS.md |
Flow Studio is built for teams who want evidence, replayability, and change control around agentic work. You'll have a better time if:
- You can run Python locally and in CI (the repo assumes
uv) - You have an opinionated place to store run artifacts (logs, snapshots, traces)
- You have a stance on secrets (source, access, rotation)
- You treat "automation with guardrails" as a feature: validation, schemas, and CI checks
- You have a human review path when the system is uncertain
Start with a known-good reference execution: GOLDEN_RUNS.md.
See ADOPTION_PLAYBOOK.md for the full onboarding guide.
Contributions are welcome. Before submitting:
- Run
make dev-checkto validate the swarm - Run
make selftestfor full validation - Follow existing patterns in
swarm/and.claude/
See CLAUDE.md for the full reference on how the system works.
Something broken? Open an issue.
- EffortlessMetrics/demo-swarm — Portable
.claude/swarm pack for your own repo
Apache-2.0 or MIT