Skip to content

Chipagosfinest/software-factory

Repository files navigation

Software Factory

A research corpus on agent-native software delivery. Nine topology patterns, 50+ curated references, production systems analysis (Stripe, Spotify, Ramp, Coinbase, Brex), and a working reference implementation.

Core thesis: Agents produce working but low-quality code, and instruction-based constraints (CLAUDE.md, AGENTS.md) don't enforce quality. The solution is programmatic verification — deterministic checks + LLM judge + bounded retries. How you wire the agents together matters more than how smart they are.


Table of Contents

  1. Why Software Factories
  2. Design Dimensions
  3. Topology Walkthroughs — 9 patterns, side by side
  4. Choosing a Topology — decision matrix by problem shape
  5. What Doesn't Work (Failure Modes)
  6. Reference Implementation — architecture, agents, safety layers
  7. Verification & Quality
  8. Context Engineering
  9. Agent Security — emerging category (CrabTrap, OpenShell, etc.)
  10. Research Corpus — production systems, papers, platforms
  11. License

Why Software Factories

A dozen companies have proven the pattern at scale and published the receipts. All converged on the same primitives: isolated sandboxes, PRs as review gates, humans review before merge.

Company System Scale Key Insight
Uber Minion / uReview / Autocover / Shepherd 92% monthly eng adoption, 11% of PRs opened by agents, uReview analyzes 90% of ~65K weekly diffs, Autocover saved 21K dev-hours Four-layer agent stack with specialized internal products
Cloudflare Internal AI eng stack (self-hosted on their own products) 93% R&D adoption, MRs grew 5,600 → 8,700/wk in 11 months, 51B+ tokens/mo Dogfooding AI Gateway + Workers + Sandboxes as the platform
Meta Capacity Efficiency AI + MetaMateCR paper Capacity AI compressed ~10hr investigations to ~30min, recovered hundreds of megawatts; separate MetaMateCR paper reports ~68% exact-match patches Unified agent marketplace with MCP-layer tools shared across defense/offense agents
Anthropic Claude Code dogfooding + Managed Agents 200% growth in output per engineer, substantive reviews 16% → 54%, 84% of 1000+ line PRs get findings "Decouple brain from hands" harness thesis
LinkedIn CAPT (Contextual Agent Playbooks & Tools) 1,000+ engineers, 500+ playbooks, 70% triage time drop, 3x data workflows Organizational context as the agent primitive
Stripe Minions 1,300 PRs/week Goose fork + devboxes (10s spin-up) + 400 MCP tools
Spotify Honk 1,500+ AI-generated PRs from Honk; separately, their Fleet Management org reports automating ~50% of internal PRs across all tooling Containerized K8s + verification loops + LLM judge
Ramp Inspect ~50% of merged PRs Modal snapshot warm pools + multiplayer sessions
Coinbase Forge (née Cloudbot) 5% of merged PRs, 150h → 15h cycle time Cloud sandboxes + Slack-first + agent councils
Block AI-Assisted Development at Block 95% eng adoption, Champions program (50 devs × 30% time) AI-readiness architecture for 40K-file monorepo
DoorDash Collaborative AI Ecosystem 25K hrs saved via Alteryx automation alone 4-stage progression: workflows → single agents → deep agents → swarms
Cognition How Cognition Uses Devin to Build Devin Hundreds of Devin PRs merged per week internally (vendor-reported) Agent-to-agent REST API triggers
Nubank Devin deployment 100K data class migrations; 12x efficiency claimed (vendor-reported via Cognition case study) Fine-tuned Devin on the Nubank codebase
OpenAI Harness Engineering 1M lines, 1,500 PRs Environment-first, 3.5 PRs/engineer/day
Airbnb Large-scale test migration with LLMs 3.5K files Enzyme → RTL in 6 weeks, 75% auto in 4 hrs, 97% after 4 days Canonical batch migration harness pattern
Google Jules — async coding agent 140K code improvements during beta Ephemeral VM per task, GitHub-issue driven
Pinterest MCP Ecosystem + Tricorder 66K MCP invocations/mo, 844 MAUs Central MCP registry + observability agent
Datadog Bits AI SRE + Dev Agent Observability-driven auto-PR Eval platform for trusting autonomous agents at scale
Shopify Roast framework Open-sourced Ruby DSL Claude Code as a "CodingAgent" cog in structured workflows
Databricks Genie Code 2x a leading coding agent on internal benchmarks Data-specific coding agent with proactive prod monitoring
Vercel AEO tracking for agents Coding agents run in ephemeral Firecracker microVMs as probe fleet

The convergent primitives: isolated sandboxes, queue-based dispatch, PRs as review gates, verification loops, bounded retries, cost governance, MCP tool layer, organizational context injection.

Cross-company survey: Ry Walker's In-House Coding Agents: Build vs Buy (Feb 2026) aggregates many of these systems in one place — useful as a starting overview.


Design Dimensions (Not a Single Spectrum)

Earlier versions of this doc collapsed these into one axis labeled "prescriptive ↔ autonomous." That framing is wrong — it conflates at least four independent dimensions. When picking a topology, score your problem on each:

Dimension Low High
Workflow determinism Agent decides path (Ratchet, One-Shot Tree) Path is a hardcoded graph (Deterministic Graph, Pipeline)
Delegation structure Flat / peer (Mesh, One-Shot Tree) Hierarchical (Org Chart, Hierarchical Mesh)
State coupling Isolated per-agent (One-Shot Tree) Shared workspace (Mesh)
Search / retry freedom Single-shot (One-Shot Tree) Unbounded loops (Ratchet)

A system can be high on determinism but flat on delegation (Deterministic Graph), or flat on delegation but high on state coupling (Mesh). Org Chart and One-Shot Tree both have low search freedom but sit at opposite poles on delegation structure. There is no clean linear ordering — pick the topology whose dimensional profile matches your problem.

Heuristic for scoring your problem:

  • Task well-defined? → higher workflow determinism wins
  • Failures expensive? → lower search freedom, higher determinism
  • Verification cheap? → higher search freedom is affordable
  • Multi-disciplinary work? → hierarchical delegation helps
  • Humans join mid-flight? → shared state coupling is required

Topology Walkthroughs

Nine topology patterns observed across production and research systems, each with tradeoffs, pros/cons, and suitability guidance. A tenth pattern — the Tiered Gate (deterministic fast path + LLM slow path) — is orthogonal to topology (it's a decision/verification pattern that composes inside any topology), so it's covered in Verification & Quality and Agent Security.


1. One-Shot Tree (Stripe Minions)

                    ┌─────────┐
                    │ Dispatch│
                    └────┬────┘
               ┌─────────┼─────────┐
               ▼         ▼         ▼
          ┌────────┐ ┌────────┐ ┌────────┐
          │Agent A │ │Agent B │ │Agent C │
          │(fix)   │ │(fix)   │ │(fix)   │
          └───┬────┘ └───┬────┘ └───┬────┘
              ▼          ▼          ▼
           [PR #1]    [PR #2]    [PR #3]

How it works: Fire-and-forget. Dispatcher fans out issues to independent agents. Each agent receives full context, produces one PR, and dies. No iteration, no feedback between siblings.

Pros Cons
Massively parallel — scales horizontally No cross-agent learning
Simple failure model (each agent is independent) Partial success is the norm — expect first-pass failures
Easy to cost-cap per agent Wasted work on failures — agent re-does context
No coordination overhead Can't handle tasks requiring iteration
  • When to use: Large batches of independent, well-scoped tasks (CI failures, dependency bumps, formatter passes)
  • When NOT to use: Tasks requiring refinement, cross-file reasoning, or multi-step negotiation
  • Production example: Stripe Minions — 1,300 PRs/week
  • Key metric: Throughput (PRs/hour)

2. Pipeline (Spotify Honk)

  ┌───────┐    ┌────────┐    ┌───────┐    ┌────────┐    ┌───────┐
  │ Parse │───▶│ Reason │───▶│  Fix  │───▶│ Verify │───▶│ Judge │
  │  logs │    │ about  │    │  code │    │ (tests)│    │(LLM)  │
  └───────┘    └────────┘    └───────┘    └────────┘    └───────┘
                                               │              │
                                               │   ✗ veto     │
                                               ◀──────────────┘
                                          (max 2 retries)

How it works: Linear with one feedback loop. Each stage transforms output and passes it forward. Judge can veto back to Fix (max 2 iterations). Convergence detection stops the loop if the same error repeats.

Pros Cons
Deterministic stages — easy to debug Sequential → throughput limited
LLM judge catches scope creep and phantom fixes (Spotify Honk reports a meaningful veto rate) Judge can become Goodhart's law target
Bounded retries prevent runaway cost Rigid — can't skip stages
Each stage can be optimized independently Single point of failure at judge
  • When to use: Well-understood task types with a clear success signal (CI debug, security patches, incident response)
  • When NOT to use: Novel problem shapes, greenfield feature building, tasks with ambiguous "done" criteria
  • Production example: Spotify Honk (Part 3 reports judge catching a meaningful share of scope-creep / phantom-fix attempts)
  • Key metric: First-pass merge rate + veto catch rate

3. Org Chart (Paperclip)

                 ┌──────────────┐
                 │     CEO      │
                 │(Orchestrator)│
                 │  Budget: $50 │
                 └──────┬───────┘
              ┌─────────┼─────────────┐
              ▼         ▼             ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │ Eng Lead │ │ QA Lead  │ │ Ops Lead │
        │ $20 budg │ │ $15 budg │ │ $15 budg │
        └────┬─────┘ └────┬─────┘ └────┬─────┘
          ┌──┴──┐      ┌──┴──┐      ┌──┴──┐
          ▼     ▼      ▼     ▼      ▼     ▼
        [dev] [dev]  [qa]  [qa]  [sre] [sre]

How it works: Hierarchical delegation. Parent assigns tasks to children, children report upward. Budget flows down — each level gets a sub-allocation enforced by the orchestrator.

Pros Cons
Natural cost governance (budget per branch) Coordination overhead (messages traverse hierarchy)
Clear accountability chain Bottlenecks at parent nodes
Maps to existing org structures Slow to adapt (requires re-delegation)
Easy to add new "departments" Parent becomes context window bottleneck
  • When to use: Complex multi-disciplinary projects where goal decomposition is natural and budget control matters
  • When NOT to use: Fast-moving iterative work, highly parallel independent tasks
  • Production example: Paperclip (57K+ stars as of April 2026)
  • Key metric: Budget adherence + mission→task traceability

4. Mesh (Ramp Inspect)

        ┌────────┐       ┌────────┐
        │Agent A │◀─────▶│Agent B │
        │(review)│       │(fix)   │
        └───┬────┘       └───┬────┘
            │    ╲       ╱   │
            │     ╲     ╱    │
            │   shared state │
            │     ╱     ╲    │
            │    ╱       ╲   │
        ┌───┴────┐       ┌───┴────┐
        │Agent C │◀─────▶│Agent D │
        │(test)  │       │(deploy)│
        └────────┘       └────────┘

How it works: Peer-to-peer. Agents share state through a common workspace (filesystem snapshots, shared memory). No central coordinator. Warm pools mean agents spin up in <2s from snapshots. Humans can join live sessions and co-edit.

Pros Cons
No orchestrator bottleneck Race conditions on shared state
Warm pools → fast spin-up (seconds, not minutes) Hard to reason about global behavior
Humans join mid-session Debugging requires distributed tracing
Good fit for interactive workflows Requires robust snapshot/merge primitives
  • When to use: Interactive multi-agent collaboration, fast experimentation, sessions that humans join mid-flight
  • When NOT to use: Strict audit requirements, regulated domains, fully-batched workloads
  • Production example: Ramp Inspect — ~50% of merged PRs
  • Key metric: Session latency + shared-state conflict rate

5. Ratchet (Karpathy Autoresearch)

       ┌──────────────────────────────────────────┐
       │            NEVER STOP LOOP               │
       │                                          │
       │  ┌──────┐    ┌──────┐    ┌───────┐      │
       │  │ Read │───▶│Modify│───▶│Commit │      │
       │  │state │    │ code │    │(git)  │      │
       │  └──────┘    └──────┘    └───┬───┘      │
       │                              ▼          │
       │                        ┌──────────┐     │
       │                        │   Run    │     │
       │                        │experiment│     │
       │                        └────┬─────┘     │
       │                             ▼           │
       │                      ┌────────────┐     │
       │               ┌──yes─┤ Improved?  ├─no─┐│
       │               ▼      └────────────┘    ▼│
       │          ┌─────────┐            ┌───────┐│
       │          │  KEEP   │            │ RESET ││
       │          │(advance │            │(git   ││
       │          │ branch) │            │ reset)││
       │          └────┬────┘            └───┬───┘│
       │               └────────┬────────────┘    │
       │                        ▼                 │
       │                   LOOP BACK              │
       └──────────────────────────────────────────┘
            ~12 experiments/hour, ~100 overnight
            git history = full audit trail

How it works: Self-directed with no external coordinator. One binary metric (improved / not improved) eliminates ambiguity. Accepts or rejects via git — history becomes the audit trail.

Pros Cons
Runs unattended overnight Only works when metric is clean/binary
Git history = built-in audit Goodhart's law if metric is gameable
Captures tacit optimization knowledge Requires cheap evaluation loop
No coordination cost No multi-objective reasoning
  • When to use: Single-objective optimization with cheap evaluation (performance tuning, hyperparameter search, self-improving code)
  • When NOT to use: Multi-objective work, tasks where "better" is subjective, expensive-to-evaluate metrics
  • Production example: Karpathy — 700 experiments in 2 days, 11% training speedup; Shopify — 19% gain overnight
  • Key metric: Improvement rate × experiment throughput

6. Sequential Multi-Agent (LangChain Open SWE)

  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │ MANAGER  │────▶│ PLANNER  │────▶│PROGRAMMER│────▶│ REVIEWER │
  │          │     │          │     │          │     │          │
  │ Route    │     │ Research │     │ Code in  │     │ Quality  │
  │ task     │     │ codebase │     │ sandbox  │     │ check    │
  └──────────┘     └──────────┘     └──────────┘     └──────────┘

How it works: Different specialized agents with distinct roles. Unlike Pipeline (same agent through stages), each agent has its own system prompt, tools, and reasoning mode. The Planner never writes code; the Programmer never reviews.

Pros Cons
Role specialization → higher quality Context loss at handoffs
Each agent gets scoped system prompt Fixed sequence = slow for simple tasks
Easier to swap out one role Debugging requires role-level tracing
+13.7pp measured harness gain (LangChain) Handoff format is a contract — rigid
  • When to use: Complex feature building where planning, coding, and reviewing require different mindsets
  • When NOT to use: Simple bug fixes (role overhead isn't worth it), highly-parallel batch work
  • Production example: LangChain Open SWE (March 2026) — distills Stripe Minions + Ramp Inspect + Coinbase Forge patterns
  • Key metric: End-to-end PR quality + handoff success rate

7. Dynamic DAG (AgentConductor)

   Input ─▶ ┌──────────────┐
            │ RL Orchestrator│  generates topology per task
            └──────┬───────┘
                   ▼
        Topology varies by task:
        ┌────┐     ┌─────────────┐     ┌──────────┐
        │ T1 │ vs  │    T2       │ vs  │    T3    │
        │A─▶B│     │   A─┐       │     │   A──┐   │
        │    │     │   B─┤─▶D    │     │   │  │   │
        │    │     │   C─┘       │     │   ▼  ▼   │
        └────┘     └─────────────┘     │   B  C   │
                                       │   └──┬───┘
                                       │      ▼   │
                                       │      D   │
                                       └──────────┘

How it works: Reinforcement-learning-trained orchestrator generates an optimal topology per task. Different tasks produce different agent graphs. Learns from outcomes to improve routing.

Pros Cons
Adapts topology to task shape Training data required
+14.6% on APPS, 68% cost reduction Hard to audit ("why this graph?")
Emergent efficiency gains RL training is expensive
Handles heterogeneous task mix Behavior can drift over time
  • When to use: Heterogeneous task mix at scale where a static topology would over/under-provision most tasks
  • When NOT to use: Small task volume, regulated domains needing deterministic behavior
  • Production example: AgentConductor research — +14.6% APPS benchmark gain
  • Key metric: Task-type weighted quality × cost

8. Deterministic Workflow Graph (Fabro)

   graph workflow {
     lint -> test -> implement -> review -> merge
     implement -> {sandbox, typecheck} [parallel]
     review -> implement [loop, max: 2]
     review -> HUMAN_GATE [approval]
   }

How it works: Human defines a DOT graph with branching, loops, parallelism, and approval gates. CSS-like stylesheets route steps to appropriate models (Opus for implementation, smaller models for linting). Git commits at every stage create checkpoints.

Pros Cons
Fully reproducible + auditable Rigid — can't adapt to novel inputs
Easy to reason about (graph is the spec) Requires manual graph authoring
Trivial to add approval gates Changes require graph edits + deploy
Great for regulated domains Doesn't capture tacit knowledge
  • When to use: Regulated workflows (finance, healthcare), compliance-driven builds, reproducible research
  • When NOT to use: Rapidly-evolving product requirements, exploratory work
  • Production example: Fabro, deterministic graph engines
  • Key metric: Audit completeness + reproducibility rate

9. Hierarchical Mesh (OpenAI Symphony-style / Cloudflare Mesh)

                    ┌──────────────┐
                    │ Orchestrator │  (reconciliation loop)
                    │ (state machine)│
                    └──────┬───────┘
                           │ polls
              ┌────────────┴────────────┐
              ▼                         ▼
     ┌──────────────┐           ┌──────────────┐
     │Task Cluster A│◀─mesh────▶│Task Cluster B│
     │  ┌──┐  ┌──┐ │           │  ┌──┐  ┌──┐  │
     │  │A1│  │A2│ │           │  │B1│  │B2│  │
     │  └──┘  └──┘ │           │  └──┘  └──┘  │
     └──────────────┘           └──────────────┘
                           ▲
                           │
                    ┌──────┴───────┐
                    │ Convergence  │
                    │  detection   │
                    └──────────────┘

How it works: Orchestrator reconciles desired state against observed state. Within each cluster, agents mesh. Clusters mesh with each other. Exponential backoff + convergence detection prevent runaway retries.

Pros Cons
Handles partial failure gracefully Orchestrator is high-stakes single point
State machine is auditable Complex to implement correctly
Scales across clusters Debugging cross-cluster failures is hard
Cloudflare Mesh-style — federated deploys Requires robust convergence logic
  • When to use: Long-running multi-agent deployments, federated compute across regions, distributed state reconciliation
  • When NOT to use: Simple batch work, small teams
  • Production example: OpenAI Symphony, Cloudflare Mesh (April 2026), internal orchestrators at scale
  • Key metric: Convergence rate + partial-failure recovery

Choosing a Topology

Match the topology to the problem shape. Three key questions: (1) is the task well-defined? (2) how expensive are failures? (3) how cheap is verification?

Decision Matrix

If you have... Use Why
A large queue of well-scoped, independent tasks One-Shot Tree Maximum parallelism, no coordination cost
A repeatable task type with clear success signal Pipeline Deterministic stages, LLM judge catches drift
Complex multi-disciplinary work with budget constraints Org Chart Natural cost governance + accountability
Interactive collaboration with humans joining mid-flight Mesh Shared state, warm pools, low latency
Single-objective optimization with cheap eval Ratchet Self-directed, git-audited, runs overnight
Feature building requiring planning + coding + review Sequential Multi-Agent Role specialization, scoped prompts
Heterogeneous task mix at scale Dynamic DAG Topology adapts per task
Regulated/compliance-critical workflows Deterministic Graph Full audit + reproducibility
Long-running distributed deployments Hierarchical Mesh State reconciliation + convergence

Anti-Patterns

Don't do this Because
One-Shot Tree for tasks requiring iteration Agents can't refine — they burn context on failure
Pipeline for greenfield feature building Rigid stages don't fit ambiguous "done" criteria
Org Chart for fast iteration Coordination overhead dominates
Mesh for audited workflows Shared state is hard to reconstruct post-hoc
Ratchet with multi-objective metrics No binary accept/reject → no ratcheting
Dynamic DAG at small task volume Training cost exceeds savings

Combining Topologies

Production systems rarely use one topology — they compose. Common combos:

  • One-Shot Tree + Pipeline: Dispatcher fans out to agents, each agent runs a pipeline internally (Stripe Minions does this)
  • Mesh + Ratchet: Agents mesh during the day, individual agents ratchet overnight (Shopify autoresearch pattern)
  • Deterministic Graph + Sequential Multi-Agent: Graph defines the sequence, agents specialize within each node (common in enterprise SDLC)
  • Hierarchical Mesh + Pipeline: Orchestrator reconciles, each agent runs a pipeline (OpenAI Symphony-style)
  • Pipeline + Tiered Gate verification: Pipeline at each stage uses a deterministic-first / LLM-fallback verifier (Spotify Honk's verification, this repo's 3-layer quality loop)
  • One-Shot Tree + Tiered Gate security: Fan-out agents, each request they emit passes through a deterministic-first / LLM-fallback security gate (Brex CrabTrap in front of a fleet)

See docs/potent-combos.md for the full combinatorial analysis.


What Doesn't Work (Failure Modes)

The corpus above over-indexes on successes because companies only publish wins. Honest failure patterns to know about:

  • Devin critique (Answer.AI, 2025)Analysis of Devin's actual completion rate on real tasks (well below marketing claims). Key lesson: vendor-reported agent benchmarks routinely don't reproduce.
  • Alibaba SWE-CI benchmark — 75% of agents break working code across consecutive PRs. Quality regresses over agent-turns.
  • Klarna's AI customer service walkback (2024) — Klarna reversed its "AI replaces 700 agents" story quietly by rehiring humans; widely cited as a cautionary tale for over-optimistic agent deployment.
  • CodeRabbit analysis of 470 AI PRs — AI PRs have 1.7x more issues than human PRs; visible quality lag.
  • LLM judge gaming / Goodhart's law — Agents optimized to satisfy an LLM judge will game readability without substance; why tests must remain the hard gate.
  • Context rotChroma Research shows context quality degrades over agent turns; long-horizon agents need active summarization or reset.
  • Ona escape / veto bypass patterns — Agents finding ways around verification gates (commit bypassing, ignoring lint with comments, etc.); covered in docs/ corpus notes.

If your topology choice has no theory of failure, it's not a real design — it's optimism.

The codebase in this repo implements a Pipeline topology with One-Shot Tree dispatch — GitHub webhooks fan out to independent agents, each runs through parse → reason → fix → verify → judge.

Architecture

GitHub Webhooks / Cron / Linear Issues
              │
        Event Router (src/router.ts)
              │
     ┌────────┼────────┬──────────┬──────────┐
     │        │        │          │          │
  PR Review  CI Debug  Security  Incident  Merge
   Agent      Agent    Agent     Agent     Agent
     │        │        │          │          │
     └────────┴────────┴──────────┴──────────┘
              │                   │
        Governance Layer    GitHub API
       (gate → budget →    (PRs/comments)
        breaker → timeout)
              │
        Human Review Gate
       (all output = PRs)

Two execution paths:

  1. Webhook-driven (reactive)GitHub Webhook → EventRouter → BullMQ → Worker → Agent → GitHub API. Triggers: PR opened/updated, CI failure, Dependabot alert.
  2. Orchestrator-driven (proactive)Linear Issue → Orchestrator Poll → Reconciler → Workspace → BullMQ → Agent → GitHub API. Symphony-style, disabled by default.

Both paths converge at the BullMQ queue and share the same Agent Runner.

Agents

Agent Trigger Output Constraints
PR Reviewer pull_request.opened / synchronize Review comments + approve/request changes Cannot create PRs
CI Debugger check_suite.completed (failure) Diagnosis comment + fix PR Max 10 files, 200 lines
Security Patcher dependabot_alert.created Patch PR Lockfiles only
Incident Responder PagerDuty / custom alert RCA + fix PR Max 10 files, 200 lines
Merge Resolver PR with conflict label Conflict resolution commit Max 20 files, 500 lines

Safety & Governance

Five layers checked in order on every agent run:

  1. Global Daily Budget — Hard cap across all agents (default $20/day). Warns at 80%.
  2. Executor Gate — Kill switch via executor_gate.json. Hot-reloaded on every check.
  3. Per-Agent Governance — File patterns, max files/lines changed, cost limit, PR creation rights.
  4. Circuit Breaker — Per-API (OpenRouter, GitHub, Linear). Opens after 3 consecutive failures, half-open test after 60s.
  5. Timeout — Agent runs race against configurable timeout (default 300s).

Tech Stack

Layer Technology Why
Runtime Node.js + TypeScript Type safety, ecosystem
Server Hono Lightweight, fast
LLM OpenRouter Model-agnostic (Claude, GPT, Gemini)
GitHub Octokit + GitHub App auth PR creation, review comments
Queue BullMQ + Redis Reliable event processing
Local DB SQLite (better-sqlite3) Audit logs, agent state
External DB Supabase (PostgreSQL) Signals, validations
Sandbox Docker / git worktrees Isolated execution per agent

Quick Start

git clone https://github.com/Chipagosfinest/software-factory.git
cd software-factory
npm install
cp .env.example .env
# Configure: GitHub App credentials, OpenRouter API key, Redis URL
npm run dev

Verification & Quality

The quality problem is industry-wide. Alibaba's SWE-CI benchmark shows 75% of agents break working code. AI PRs have 1.7x more issues than human PRs (CodeRabbit, 470 PRs). Three layers address this:

  Agent generates code
         │
  ┌──────▼──────────────────────────────────────────┐
  │  Layer 1: Deterministic Checks (free, fast)     │
  │  AST complexity, duplication, linting, types    │
  └──────┬──────────────────────────────────────────┘
         │ all pass
  ┌──────▼──────────────────────────────────────────┐
  │  Layer 2: Test Execution (hard gate)            │
  │  Build succeeds? Unit tests pass?               │
  └──────┬──────────────────────────────────────────┘
         │ all pass
  ┌──────▼──────────────────────────────────────────┐
  │  Layer 3: LLM-as-Judge (soft signal)            │
  │  Scope check, readability, architecture         │
  └──────┬──────────────────────────────────────────┘
         │
    pass ├──────▶ Open PR for human review
         │
    veto ├──▶ retry (max 2) ──▶ flag for human
  • Layer 1 is the cheap filter — static analysis catches a large class of issues without any LLM call
  • Layer 2 is the hard gate — if it doesn't build/test, it doesn't ship
  • Layer 3 catches subtle quality issues (scope creep, readability)
  • Goodhart's Law risk: agents optimized for LLM judge approval will game the metric — hard constraints are the real gate

The Tiered Gate pattern. Layers 1→3 form a deterministic fast path + LLM slow path — the same shape Brex CrabTrap uses for network security. Static rules handle predictable cases cheaply; LLM judge handles the long tail. This is a verification pattern, not a topology: it composes inside any of the 9 topologies above.

Leaderboard: See the official SWE-bench leaderboard for current rankings. As of April 2026, the Verified split is dominated by Claude Opus 4.7, GPT-5.3-Codex, and Gemini 3.1 Pro (verify exact numbers at the official source — third-party aggregators often lag).


Context Engineering

From Spotify Honk's Part 2 — context engineering is the single biggest lever for agent quality.

Prompt design:

  • Describe the end state, not step-by-step instructions
  • State preconditions — tell the agent when NOT to act
  • Use concrete examples — code snippets heavily influence output quality
  • One change at a time — combining changes exhausts context and produces partial results
  • Define success as tests — "make this code better" fails; "these tests should pass" succeeds

Tool strategy:

  • Spotify: minimal tools (verify MCP, Git, restricted Bash). Context goes in the prompt.
  • Stripe: 400+ MCP tools via Toolshed, pre-hydrated before agent starts.
  • Start constrained, add tools only when prompts aren't enough.

Conditional rules (Stripe pattern):

  • Global agent rules don't work in large codebases — they conflict across domains
  • Apply rules conditionally by subdirectory
  • Use scoped config files (CLAUDE.md, AGENTS.md) per directory

The context files debate (2026): Princeton study found AGENTS.md adoption reduces runtime 28.6%, saves 16.6% tokens. ETH Zurich/LogicStar counter-study (Gloaguen et al. 2026) found verbose context files hurt performance. The truth is probably: concise, scoped, hierarchical context wins; bloated global rules lose.


Agent Security

An emerging category. As agents gain real credentials and make real API calls, the security surface expands beyond traditional guardrails. April 2026 saw an explosion of entrants — Brex CrabTrap, Permiso SandyClaw, Capsule Security ($7M seed), iron-proxy, agentcage, ToolMesh.

Comparison Matrix

System Layer Approach Best For
CrabTrap (Brex) Transport (HTTP/S proxy) LLM-as-a-judge + static rules Framework-agnostic egress control
NVIDIA OpenShell Kernel (Landlock, seccomp) Docker sandbox + YAML policies Kernel-level isolation
Deconvolute Protocol (MCP) MCP session firewall MCP-heavy architectures
ClawShield Application 5 specialized AI agents + YAML policy PII redaction, prompt injection scanning
Permiso SandyClaw Skill (dynamic detonation) Detonates agent skills in sandbox Cross-framework skill safety
Capsule Security Runtime Behavior control at runtime Enterprise runtime governance
iron-proxy Transport + credential MITM egress + proxy tokens Real secrets never enter sandbox
ToolMesh Protocol (MCP) Self-hosted MCP gateway Enterprise MCP control plane
Superserve Transport (credential) Injects API keys at network level Credential isolation (not request filtering)

CrabTrap Deep Dive

CrabTrap (Brex, open-sourced April 21, 2026) sits between AI agents and external APIs. Two-tier evaluation:

  1. Static rules — Deterministic URL pattern matching (prefix, exact, glob). Deny rules always win. Microsecond execution.
  2. LLM-as-a-judge — When no static rule matches, full request context goes to an LLM with a natural-language security policy. Returns structured ALLOW/DENY with reasoning.

Production findings from Brex:

  • Policies derived from observed traffic beat hand-written rules
  • LLM judge fires on <3% of requests (static rules catch the rest)
  • Audit trail became a discovery tool for tightening agents themselves
  • Prompt injection defense: JSON-encoded payloads, 4KB header cap, 16KB body truncation

Architecture: Go + TypeScript, MIT license, PostgreSQL audit trail, AWS Bedrock with Anthropic models, React admin dashboard.

Limitations: Outbound only (not a WAF), no approval queue, doesn't filter responses, fundamentally probabilistic on the LLM layer.

Sources: GitHub · Brex Blog · HN Discussion


Research Corpus

Internal Engineering Deployments (recruiting-tool eng blogs)

These are the first-person accounts from company engineering teams — the richest source of real numbers and architecture details. Tech companies publish these as recruiting signals, so the bylines are real engineers.

Tier 1 — Deep case studies with numbers:

Tier 2 — Strong patterns + production numbers:

Production Systems (agent frameworks + case studies)

System Scale Key Insight
Stripe Minions Part 1 1,300 PRs/week One-shot tree, 400+ MCP tools, warm EC2
Stripe Minions Part 2 Zero human-written code Devboxes, conditional rules, max 2 CI retries
Spotify Honk Part 1 650+ PRs/month Containerized K8s fleet management
Spotify Honk Part 2 60-90% time savings Context engineering, Claude Code as top agent
Spotify Honk Part 3 Judge catches scope-creep / phantom fixes Verification loops, LLM judge
Spotify x Anthropic (April 2026) Slack-@mention-driven Backstage evolving to agent-first MCP platform
Ramp Inspect ~50% of merged PRs Modal snapshot-based warm pools
Ramp on Modal 1,000 Datadog monitors Self-maintaining code, auto-generated monitoring
OpenAI Harness Engineering 1M lines, 1,500 PRs Environment-first, 3.5 PRs/engineer/day
OpenAI Codex follow-up posts (Feb–Mar 2026) Codex harness evolution: standalone app server + Responses API with computer environment (verify exact URLs on openai.com)
Nubank + Devin 100K migrations 12x efficiency, 20x cost savings
Nubank Agent Infra (Mar 2026) 131M customers Clojure-based internal agent infrastructure
Coinbase Forge (née Cloudbot) 5% of merged PRs 150h → 15h cycle time, agent councils
LangChain Open SWE (Mar 2026) Open-source framework distilling Minions + Inspect + Forge

Architecture & Patterns

Research Foundations

The papers that shaped how technical audiences think about coding agents. Missing any of these is a red flag.

IDE / Pair-Programming Agents

These sit on a different axis from background agents — human-in-loop editors. Covered here because they shape how developers expect coding agents to feel.

  • Cursor — AI-native IDE, dominant category incumbent
  • Windsurf — Cascade agent mode, Codeium successor IDE
  • Cline — Open-source VS Code autonomous assistant, Plan+Act modes
  • Continue.dev — Open-source IDE extensions (VS Code, JetBrains)
  • JetBrains Junie + Air — Standalone coding agent + structural code awareness
  • Cognition Devin — Cloud IDE with autonomous agent
  • Factory.ai — Droids across 6+ surfaces
  • Replit Agent — Browser-based autonomous agent

Platforms & Ecosystems

Agent Memory & Knowledge

  • Karpathy Autoresearch — Overnight research loop: prepare → train → evaluate → accept/reject
  • Karpathy LLM Wiki Pattern (April 2026) — 18M views, spawned memory-layer discourse
  • QMD (Tobi Lutke) — v2.1.0 (April 2026), 21K+ stars, 96% token reduction on agentic retrieval
  • Napkin — Progressive disclosure memory, BM25, sql.js, 99.8% recall on HotpotQA
  • TigerFS — Mount databases as POSIX directories
  • AgentFS (Turso) — Database-as-directory for agent workspaces

Strategy & Cost

Full Research Corpus

The docs/ directory contains 34+ research documents covering sandbox architectures, agent memory systems, harness engineering, context engineering, agent topologies, competitive landscape, and more. See AGENTS.md for a navigable index.


License

MIT

About

Research corpus on agent-native software delivery. 9 topology patterns, 4 design dimensions, 20+ company case studies (Uber, Cloudflare, Meta, Anthropic, Stripe, Spotify, Brex), failure modes, agent security. Working reference implementation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors