From a3d49a36f80ac44e0daf24d63b294a34129904dd Mon Sep 17 00:00:00 2001
From: Preetam Dwivedi <preetam@uber.com>
Date: Mon, 8 Jun 2026 11:47:57 -0700
Subject: [PATCH] docs(rfc): stovepipe post-merge trunk-validation workflow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

### Why?

SubmitQueue can no longer prove every change green before merge at the monorepo's current throughput, so it now merges directly to a single `main` that may be temporarily broken. Stovepipe is the post-merge service that validates `main`, records per-commit health, and drives recovery. There was no design doc describing its pipeline; this RFC fills that gap, mirroring `doc/rfc/submitqueue/workflow.md` so the two services read as siblings.

### What?

Adds `doc/rfc/stovepipe/workflow.md` describing the end-to-end pipeline: ingest trunk push events (external webhooks plus a fallback reconciliation poller, deduped on commit SHA) → start → validate → batch commits since the last known green → speculate / build / buildsignal → on green record `succeeded`; on failure bisect to the offending commit → invoke a pluggable remediation extension whose external backend lands a fix or revert via SQ.

Documents the three commit states (`unknown` / `succeeded` / `failed`), the SHA-as-identity / batch-as-validation-unit tracking model (bisect owns termination), and two gateway-owned sinks the orchestrator publishes to and the gateway consumes: `status` (the commit-status store callers query) and `log` (an append-only event log, the analogue of SQ's request log). Also covers fail-closed DLQ reconciliation, ownership by service, and the status/log ownership invariant. Links the new doc from `doc/rfc/index.md` under a new `## Stovepipe` section.
---
 doc/rfc/index.md              |   4 +
 doc/rfc/stovepipe/workflow.md | 169 ++++++++++++++++++++++++++++++++++
 2 files changed, 173 insertions(+)
 create mode 100644 doc/rfc/stovepipe/workflow.md

diff --git a/doc/rfc/index.md b/doc/rfc/index.md
index 1305a909..a16eedfa 100644
--- a/doc/rfc/index.md
+++ b/doc/rfc/index.md
@@ -10,3 +10,7 @@ Design documents and technical proposals, grouped by scope. Shared/cross-cutting
 
 - [Orchestrator Workflow](submitqueue/workflow.md) - Queue-driven controller pipeline from gateway entry through batching, scoring, build, merge, and conclude
 - [Build Runner](submitqueue/build-runner.md) - Vendor-agnostic BuildRunner interface, provider-neutral BuildStatus lifecycle, and how the orchestrator wires it into the build stage
+
+## Stovepipe
+
+- [Stovepipe Workflow](stovepipe/workflow.md) - Post-merge trunk-validation pipeline: ingest trunk push events (webhook + fallback poll), batch since last green, build to validate, record per-commit health, bisect to the offending commit, hand off to a remediation extension
diff --git a/doc/rfc/stovepipe/workflow.md b/doc/rfc/stovepipe/workflow.md
new file mode 100644
index 00000000..69fc896e
--- /dev/null
+++ b/doc/rfc/stovepipe/workflow.md
@@ -0,0 +1,169 @@
+# Stovepipe Workflow
+
+Stovepipe is the post-merge trunk-validation service: it consumes a stream of commits pushed to the trunk (`main`), validates them in batches *after* they land, and records a per-commit health state that downstream systems gate on. It exists because SubmitQueue (SQ) can no longer afford to prove every change green *before* merge at the throughput the monorepo now sees — so SQ merges directly to a `main` that may be temporarily broken, and Stovepipe is the system that finds the breakages, names the offending commit, and drives recovery.
+
+Like SQ, the orchestrator is a queue-driven pipeline of small, single-purpose controllers. Each controller consumes one topic, advances a commit or batch, and publishes to the next topic. Most hops carry only an ID — the controller fetches the entity from storage — while the entry point carries the full payload because there is no row to fetch yet. The pipeline has two cycles: `speculate → build → buildsignal → bisect → speculate` (the build / bisection loop that narrows a failure to a single offending commit) and `conclude → batch` (advance to the next range once a green is established). `conclude` is the only stage that assigns a commit its terminal status. `status` and `log` are two gateway-owned sinks the orchestrator publishes to and the gateway consumes: `status` carries commit-health transitions into the commit-status store, and `log` is an append-only event log of what happened to each commit — the direct analogue of SQ's request-log sink.
+
+## Commit states
+
+Every trunk commit Stovepipe tracks is in one of three states. Callers — deployment systems and developer tooling — read this state to decide whether a commit is safe to act on.
+
+- **`unknown`** — the commit has landed on `main` but has not yet been validated. This is the default the moment a commit is ingested. Most commits between two validated points sit here, because validation is batched rather than per-commit.
+- **`succeeded`** (green) — the relevant targets build and test successfully at this commit.
+- **`failed`** (not green) — a target is broken at this commit.
+
+"Green" is ultimately *subjective per target/project*: a commit can be green for one team's targets and broken for another's. Stovepipe starts with a binary repo-level state and evolves toward per-target/per-project granularity; the state machine and the caller contract are the same either way.
+
+## Identity and tracking
+
+Identity is established at the gateway on ingest, but Stovepipe does not mint a synthetic per-event request ID: the **commit SHA** (scoped by repository and branch) is the identity and the dedup key. Keying on the SHA is what makes ingestion idempotent — a commit announced by both a webhook and a poll backfill resolves to the same record and is processed once.
+
+A validation attempt over a contiguous range of commits is a **Batch**, identified by a BatchID. The batch is the unit that carries a build and, through it, a pass/fail result.
+
+Bisection needs no separate tracking machinery: when a batch's build fails, `bisect` splits the range into smaller sub-ranges, each of which is just another Batch driven through the same `speculate → build → buildsignal` loop. The state of the search lives in those ordinary batch results, and `bisect` — not `buildsignal` — owns the decision of when the search is over:
+
+- A probe that builds **green** does *not* end the search and does *not* advance the trunk; it only proves its commits good and shrinks the suspect range, so the result returns to `bisect` for the next probe.
+- A probe that builds **red** narrows the suspect range to its lower half.
+- When the suspect range is a **single commit** — including the trivial case where the failing range was one commit to begin with, so there is nothing left to split — that commit is the offender. `bisect` routes it to `conclude`, which marks it `failed` and hands it to `remediate`.
+
+The commits proven good along the way are marked `succeeded`, letting the green pointer advance to the last good one; the commits after the offender stay `unknown` until a fix lands and re-validation reaches them.
+
+## Ingestion and completeness
+
+Trunk push events arrive as **external webhook events**, modeled as messages on the queue: when SQ merges a commit to `main`, a webhook notifies Stovepipe, which records the commit (`unknown`) and hands it into the pipeline. Webhooks give low-latency ingestion in the common case.
+
+Webhooks are a latency *optimization*, not a completeness *guarantee*. They can be delayed for hours, arrive out of order, or be dropped entirely — and a missed commit means a hole in trunk coverage that no one notices until something gates on an `unknown` that should have been validated. So ingestion does not depend on webhook reliability. A **fallback reconciliation poller** periodically diffs the last-ingested trunk SHA against the actual `main` HEAD and backfills any gap, publishing the missing commits into the same entry path. With the poller running on a fixed cadence, no landed commit is missed even if webhooks are fully down.
+
+The two producers — webhook and poller — converge through the SHA-keyed idempotency described above, so nothing downstream assumes a commit is seen only once. Commits are processed in trunk order (committer-timestamp / topological), and a batch is a contiguous range of commits since the last known green. Because the green pointer and per-commit state are persisted, the system must be resilient to history rewrites — a previously validated commit that is no longer present on the branch — and converge rather than wedge when that happens.
+
+## Workflow
+
+```
+   push events ─┐                                     ┌──────────────────────┐
+   (webhook)    │   ┌─────────────────────────────┐   │ gateway: status      │
+                ├──►│ gateway: webhook + poll     │┌─►│ Commit-status store  │ ◄─ GetStatus (RPC):
+   main HEAD  ──┘   │ Ingest pushes; fallback     ││  └──────────────────────┘    deployment & dev
+   (poll)           │ poll backfills missed SHAs  ││  ┌──────────────────────┐    tooling query it
+                    └──────────────┬──────────────┘│  │ gateway: log         │
+                                   │ PushEvent     ├─►│ Append-only event log│
+                                   ▼               │  └──────────────────────┘
+                    ┌─ orchestrator ──────────────┐│
+                    │ start                       ├┘  orchestrator publishes status
+                    │ Record Commit (unknown) by  │   + log events (any stage)
+                    │ SHA; emit status + log      │
+                    └──────────────┬──────────────┘
+                                   │ SHA
+                                   ▼
+                    ┌─────────────────────────────┐
+                    │ validate                    │
+                    │ Resolve commit metadata     │
+                    │ for ordering & batching     │
+                    └──────────────┬──────────────┘
+                                   │ SHA
+        ┌─────────────────────────►▼
+        │ BatchID  ┌─────────────────────────────┐
+        │(advance) │ batch                       │
+        │          │ Aggregate commits since green│
+        │          └──────────────┬──────────────┘
+        │                         │ BatchID
+        │             ┌───────────►▼
+        │             │ ┌──────────────────────────┐
+        │        next │ │ speculate  (stub)        │
+        │       probe │ │ Prepare (sub-)range build│
+        │             │ └────────────┬─────────────┘
+        │             │              │ BatchID
+        │             │              ▼
+        │             │ ┌──────────────────────────┐
+        │             │ │ build                    │
+        │             │ │ Build changed targets    │
+        │             │ └────────────┬─────────────┘
+        │             │       Build  │
+        │             │              ▼
+        │             │ ┌──────────────────────────┐
+        │             │ │ buildsignal              │
+        │             │ │ Record build result      │
+        │             │ └───┬──────────────────┬───┘
+        │             │fail/│                  │ full-range
+        │             │probe│                  │ pass
+        │             │     ▼                  │
+        │             │ ┌──────────────────┐   │
+        │             └─┤ bisect  (stub)   │   │
+        │               │ Narrow to the    │   │
+        │               │ offender         │   │
+        │               └────────┬─────────┘   │
+        │           isolated fail│              │
+        │                        ▼              ▼
+        │               ┌────────────────────────┐
+        │      advance  │ conclude               │
+        └───────────────┤ pass → succeeded,      │
+                        │   advance next batch    │
+                        │ fail → failed,          │
+                        │   then remediate        │
+                        └───────────┬────────────┘
+                                    │ SHA (offender)
+                                    ▼
+                         ┌─────────────────────────┐
+                         │ remediate               │┄┄► remediation
+                         │ Invoke remediation      │   extension →
+                         │ extension for the commit│   external fix /
+                         └─────────────────────────┘   revert → SQ
+```
+
+Any orchestrator controller can also publish a `log` event (via a `PublishLog` helper) recording what it did; the gateway is the sole consumer that persists those events to the event log. The `status` and `log` sinks are drawn once at the top right to keep the pipeline readable, but they receive events from across the pipeline, not only from `start`.
+
+## Per-controller summary
+
+| Controller | In | Out | One-line role |
+|---|---|---|---|
+| **gateway/webhook** | push event (RPC/HTTP) | start | Receive a trunk push event, publish to the start topic, hand off async |
+| **gateway/poll** | (timer) | start | Fallback reconciler: diff last-ingested SHA vs `main` HEAD, backfill any gap |
+| **gateway/GetStatus** | RPC | — | Read path: callers query a commit's status (optionally scoped to a target/project) |
+| **start** | PushEvent | validate, status, log | Record the Commit as `unknown` keyed by SHA (dedup), emit Recorded status |
+| **validate** | SHA | batch | Resolve the commit metadata (parent, committer time) that ordering and batching need |
+| **batch** | SHA | speculate | Aggregate commits since the last known green into a validation Batch (commit range) |
+| **speculate** (stub) | BatchID | build | Decide the validation strategy and prepare the build for the full range or the next bisection sub-range |
+| **build** | BatchID | buildsignal | Build the batch's changed targets (target analysis happens here) |
+| **buildsignal** | Build | conclude, bisect | Record the build result; a clean full-range build → conclude (green), any failure or bisection probe → bisect |
+| **bisect** (stub) | BatchID | speculate, conclude | Narrow a failing range via sub-batch probes; when the failure is isolated to a single commit, conclude it `failed`, otherwise probe the next sub-range |
+| **conclude** | BatchID | batch, remediate, status, log | Green: mark commits `succeeded` and advance the next batch. Failure: mark the offending commit `failed` and hand off to remediate |
+| **remediate** | SHA | — (extension) | Invoke the remediation extension for the offending commit; an external fix/revert lands via SQ |
+| **status** | StatusEvent | — | Gateway-owned sink: persist the authoritative commit-status store |
+| **log** | LogEvent | — | Gateway-owned sink: persist the append-only event log (audit trail) |
+
+Any controller may publish to `log` (the append-only event log) via a `PublishLog` helper, exactly as in SQ; the table lists it only on the stages that most clearly emit it. There is deliberately no changed-target stage and no scoring stage. Target analysis belongs to `build`, which already needs the changed-target set to know what to compile and test, so a separate stage would only pre-compute what `build` must derive anyway. And commits are validated in trunk order rather than reordered by priority, so there is nothing to score; bisection may eventually use a suspicion-weighted heuristic to place its probes (build the commits most likely to be the culprit first), but that is an optional input to `bisect`, not a stage of its own.
+
+## Remediation handoff
+
+When `bisect` isolates the offending commit, `conclude` marks it `failed` and publishes its SHA to the `remediate` topic — the same decoupled publish-then-consume hop the rest of the pipeline uses, not an inline call. The `remediate` controller consumes that topic and invokes a **remediation extension**: a vendor-agnostic, pluggable interface that is Stovepipe's integration boundary with whatever external system produces the fix. The extension hands the offending commit to that system, which generates a revert or fix and lands it through SQ like any other change.
+
+Stovepipe's responsibility ends at invoking the extension. It does not author or land the fix, and it does not block waiting for one — there is no synchronous "wait for green" stage. The fix lands on `main` as an ordinary commit and re-enters Stovepipe through the normal ingest-and-validate path, where it is validated like anything else and the trunk returns to green. This keeps the pipeline non-blocking and the external remediation system fully decoupled behind the extension.
+
+## DLQ reconciliation
+
+Every *consumed* primary pipeline topic above is paired with a `{topic}_dlq` subscription consumed by a dedicated DLQ controller. The `status` and `log` topics are the exception: the orchestrator only publishes to them (the gateway is the sole consumer that persists commit status and the event log), so they have no orchestrator-side subscription and therefore no DLQ. The consumer framework moves a message to its DLQ once the primary controller returns a non-retryable error or exhausts retries on a retryable one.
+
+The stovepipe-specific risk a DLQ must close: a validation that can never complete must not leave a commit stuck non-terminal. A commit wedged at `unknown` forever is not a neutral outcome — callers gate on status, and an unvalidatable commit that silently stays `unknown` blocks the trunk's green pointer from advancing past it. So the DLQ controllers do not re-attempt the failed work; they decode the payload to recover the affected commit SHA or `BatchID` and drive the entity to a **conservative, not-green terminal state**, so gating stays safe (fail closed, never falsely green) and the pipeline can move on. State writes use the same optimistic-locking CAS as the primary pipeline, so a late primary-pipeline update wins cleanly and a version mismatch is asked back for redelivery.
+
+DLQ consumers are wired with `errs.AlwaysRetryableProcessor` and a very high `Retry.MaxAttempts`, with their own DLQ disabled — the same effectively-non-droppable posture SQ uses. The trade-off is identical: a genuinely unprocessable DLQ message (typically a malformed payload) must be removed by an operator. See `submitqueue/orchestrator/controller/dlq/README.md` for the shared design constraints (simplest possible implementation, reconcile-only, no recovery).
+
+## Ownership by service
+
+Each service owns its own data; the gateway and orchestrator never touch each other's, and the only thing they share is the messaging queue.
+
+### Gateway
+
+The gateway is the boundary of the system and the owner of the commit-status store and the event log. It ingests trunk push events — both from external webhooks and from the fallback poller — and hands them to the orchestrator over the queue. It serves the status query RPC that downstream systems call. And it owns the record of each commit's health and history: it is the only service that reads or writes the commit-status store and the log, writing them both directly as commits are ingested and by consuming the status and log events the orchestrator emits.
+
+### Orchestrator
+
+The orchestrator runs the pipeline that takes a landed commit from `unknown` to a terminal state. It owns the working state of that pipeline — in-flight commits, batches, builds, and bisection bookkeeping — and is the only service that writes it. It drives a batch through validation, re-entering speculation as build results arrive and as bisection narrows a failing range, advances to the next range once a green is established, and hands an isolated offending commit off through the remediation extension. It never persists commit status or log entries itself; it only emits status and log events for the gateway to record.
+
+### Shared: the messaging queue
+
+The two services communicate only through the messaging queue. It is pluggable infrastructure kept in its own database, separate from either service's application data: it carries external push events in, the internal pipeline topics between orchestrator stages, and the status and log events the orchestrator publishes for the gateway to consume.
+
+## Status and log ownership invariant
+
+The commit-status store and the event log have exactly one owner: the **gateway**. The orchestrator only emits status and log events onto the queue; it never persists them. The gateway is the sole consumer of those events and the only writer of both the commit-status store and the log.
+
+This keeps all status and log writes in one service: the orchestrator stays a pure pipeline that emits events, and the gateway owns the records — the health state callers query and the history of what happened — end to end. It is the direct analogue of SQ's request-log ownership invariant.