Skip to content

Commit e45cd9a

Browse files
committed
docs(rfc): stovepipe post-merge trunk-validation workflow
## Summary ### Why? SubmitQueue can no longer prove every change green before merge at the monorepo's current throughput, so it now merges directly to a single `main` that may be temporarily broken. Stovepipe is the post-merge service that validates `main`, records per-commit health, and drives recovery. There was no design doc describing its pipeline; this RFC fills that gap, mirroring `doc/rfc/submitqueue/workflow.md` so the two services read as siblings. ### What? Adds `doc/rfc/stovepipe/workflow.md` describing the end-to-end pipeline: ingest trunk push events (external webhooks plus a fallback reconciliation poller, deduped on commit SHA) → batch commits since the last known green → speculate / build / buildsignal → on green record `succeeded`; on failure bisect to the offending commit → invoke a pluggable remediation extension whose external backend lands a fix or revert via SQ. Documents the three commit states (`unknown` / `succeeded` / `failed`), the SHA-as-identity and batch-as-validation-unit tracking model (including how bisection reuses the same build loop), the gateway-as-sole-owner of the commit-status store invariant (mirrors SQ's request-log invariant), fail-closed DLQ reconciliation, and ownership by service. Links the new doc from `doc/rfc/index.md` under a new `## Stovepipe` section.
1 parent 98f76d5 commit e45cd9a

2 files changed

Lines changed: 150 additions & 0 deletions

File tree

doc/rfc/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,7 @@ Design documents and technical proposals, grouped by scope. Shared/cross-cutting
1010

1111
- [Orchestrator Workflow](submitqueue/workflow.md) - Queue-driven controller pipeline from gateway entry through batching, scoring, build, merge, and conclude
1212
- [Build Runner](submitqueue/build-runner.md) - Vendor-agnostic BuildRunner interface, provider-neutral BuildStatus lifecycle, and how the orchestrator wires it into the build stage
13+
14+
## Stovepipe
15+
16+
- [Stovepipe Workflow](stovepipe/workflow.md) - Post-merge trunk-validation pipeline: ingest trunk push events (webhook + fallback poll), batch since last green, build to validate, record per-commit health, bisect to the offending commit, hand off to a remediation extension

doc/rfc/stovepipe/workflow.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Stovepipe Workflow
2+
3+
Stovepipe is the post-merge trunk-validation service: it consumes a stream of commits pushed to the trunk (`main`), validates them in batches *after* they land, and records a per-commit health state that downstream systems gate on. It exists because SubmitQueue (SQ) can no longer afford to prove every change green *before* merge at the throughput the monorepo now sees — so SQ merges directly to a `main` that may be temporarily broken, and Stovepipe is the system that finds the breakages, names the offending commit, and drives recovery.
4+
5+
Like SQ, the orchestrator is a queue-driven pipeline of small, single-purpose controllers. Each controller consumes one topic, advances a commit or batch, and publishes to the next topic. Most hops carry only an ID — the controller fetches the entity from storage — while the entry point carries the full payload because there is no row to fetch yet. The pipeline has two cycles: `speculate → build → buildsignal → bisect → speculate` (the build / bisection loop that isolates an offending commit) and `conclude → batch` (advance to the next range once a green is established). `conclude` is the only stage that assigns a commit its terminal status; `status` is the gateway-owned sink that any controller can publish to, mirroring SQ's request-log sink.
6+
7+
## Commit states
8+
9+
Every trunk commit Stovepipe tracks is in one of three states. Callers — deployment systems and developer tooling — read this state to decide whether a commit is safe to act on.
10+
11+
- **`unknown`** — the commit has landed on `main` but has not yet been validated. This is the default the moment a commit is ingested. Most commits between two validated points sit here, because validation is batched rather than per-commit.
12+
- **`succeeded`** (green) — the relevant targets build and test successfully at this commit.
13+
- **`failed`** (not green) — a target is broken at this commit.
14+
15+
"Green" is ultimately *subjective per target/project*: a commit can be green for one team's targets and broken for another's. Stovepipe starts with a binary repo-level state and evolves toward per-target/per-project granularity; the state machine and the caller contract are the same either way.
16+
17+
## Identity and tracking
18+
19+
Identity is established at the gateway on ingest, but Stovepipe does not mint a synthetic per-event request ID: the **commit SHA** (scoped by repository and branch) is the identity and the dedup key. Keying on the SHA is what makes ingestion idempotent — a commit announced by both a webhook and a poll backfill resolves to the same record and is processed once.
20+
21+
A validation attempt over a contiguous range of commits is a **Batch**, identified by a BatchID. The batch is the unit that carries a build and, through it, a pass/fail result.
22+
23+
Bisection needs no separate tracking machinery: when a batch's build fails, `bisect` splits the range into smaller sub-ranges, each of which is just another Batch driven through the same `speculate → build → buildsignal` loop. The bisection state is therefore tracked as ordinary batch results — each sub-batch that builds green clears its commits, each that fails narrows the suspect range — until a single commit is isolated as the boundary between the last green build and the first failing one. That commit's record is marked `failed`; the commits proven green along the way are marked `succeeded`, letting the green pointer advance to the last good one.
24+
25+
## Ingestion and completeness
26+
27+
Trunk push events arrive as **external webhook events**, modeled as messages on the queue: when SQ merges a commit to `main`, a webhook notifies Stovepipe, which records the commit (`unknown`) and hands it into the pipeline. Webhooks give low-latency ingestion in the common case.
28+
29+
Webhooks are a latency *optimization*, not a completeness *guarantee*. They can be delayed for hours, arrive out of order, or be dropped entirely — and a missed commit means a hole in trunk coverage that no one notices until something gates on an `unknown` that should have been validated. So ingestion does not depend on webhook reliability. A **fallback reconciliation poller** periodically diffs the last-ingested trunk SHA against the actual `main` HEAD and backfills any gap, publishing the missing commits into the same ingest path. With the poller running on a fixed cadence, no landed commit is missed even if webhooks are fully down.
30+
31+
The two producers — webhook and poller — converge through the SHA-keyed idempotency described above, so nothing downstream assumes a commit is seen only once. Commits are processed in trunk order (committer-timestamp / topological), and a batch is a contiguous range of commits since the last known green. Because the green pointer and per-commit state are persisted, the system must be resilient to history rewrites — a previously validated commit that is no longer present on the branch — and converge rather than wedge when that happens.
32+
33+
## Workflow
34+
35+
```
36+
push events ─┐ ┌─ GetStatus (RPC):
37+
(webhook) │ ┌─────────────────────────────┐ │ deployment systems
38+
├──►│ gateway: webhook + poll │ │ & developer tooling
39+
main HEAD ──┘ │ Ingest pushes; fallback │ │ read commit status
40+
(poll) │ poll backfills missed SHAs │ ▼
41+
└──────────────┬──────────────┘ ┌──────────────────────┐
42+
│ PushEvent │ gateway: status │
43+
▼ │ Commit-status store │
44+
┌─────────────────────────────┐ │ (sole writer/reader) │
45+
│ ingest / start │ └──────────┬───────────┘
46+
│ Record Commit (unknown) │ ▲
47+
│ keyed by SHA, emit Recorded │─────────────┤
48+
└──────────────┬──────────────┘ status │
49+
│ SHA events │
50+
┌──────────►▼ (any stage)│
51+
│ ┌─────────────────────────────┐ │
52+
BatchID │ │ batch │ │
53+
(advance)│ │ Aggregate commits since green│ │
54+
│ └──────────────┬──────────────┘ │
55+
│ │ BatchID │
56+
│ ▼ │
57+
│ ┌──────────────────────────┐ │
58+
│ ┌──►│ speculate (stub) │◄───┐ │
59+
│ │ │ Full range vs. sub-range │ │ │
60+
│ │ └────────────┬─────────────┘ │ │
61+
│ │ BatchID │ BatchID │ │
62+
│ │ ▼ │ │
63+
│ │ ┌──────────────────────────┐ │ │
64+
│ │ │ build │ │ │
65+
│ │ │ Build the batch's │ │ │
66+
│ │ │ changed targets │ │ │
67+
│ │ └────────────┬─────────────┘ │ │
68+
│ │ Build │ │ │
69+
│ │ ▼ │ │
70+
│ │ ┌──────────────────────────┐ │ │
71+
│ │ │ buildsignal │ │ │
72+
│ │ │ Feed build result back │ │ │
73+
│ │ └───┬──────────────────┬───┘ │ │
74+
│ │ fail │ │ pass │ │
75+
│ │ ▼ ▼ │ │
76+
│ │ ┌──────────────┐ ┌────────────────┐ │
77+
│ └─│ bisect (stub)│ │ conclude │ │
78+
│ │ Split range │ │ Green: mark │ │
79+
│ └──────────────┘ │ succeeded, │ │
80+
└─────────────────────│ advance batch. │ │
81+
│ Fail: mark │ │
82+
│ commit failed. │─┘
83+
└───────┬────────┘ status
84+
│ SHA events
85+
86+
┌─────────────────────────┐
87+
│ remediate │┄┄► remediation
88+
│ Invoke remediation │ extension →
89+
│ extension for the commit│ external fix /
90+
└─────────────────────────┘ revert → SQ
91+
```
92+
93+
## Per-controller summary
94+
95+
| Controller | In | Out | One-line role |
96+
|---|---|---|---|
97+
| **gateway/webhook** | push event (RPC/HTTP) | ingest | Receive a trunk push event, publish to the ingest topic, hand off async |
98+
| **gateway/poll** | (timer) | ingest | Fallback reconciler: diff last-ingested SHA vs `main` HEAD, backfill any gap |
99+
| **gateway/GetStatus** | RPC || Read path: callers query a commit's status (optionally scoped to a target/project) |
100+
| **ingest / start** | PushEvent | batch, status | Record the Commit as `unknown` keyed by SHA (dedup), emit Recorded status |
101+
| **batch** | SHA | speculate | Aggregate commits since the last known green into a validation Batch (commit range) |
102+
| **speculate** (stub) | BatchID | build | Decide the validation strategy: full range vs. a bisection sub-range |
103+
| **build** | BatchID | buildsignal | Build the batch's changed targets (target analysis happens here) |
104+
| **buildsignal** | Build | conclude, bisect | Feed the build result back; route pass → conclude, fail → bisect |
105+
| **bisect** (stub) | BatchID | speculate | Split a failing range into sub-batches to isolate the offending commit |
106+
| **conclude** | BatchID | batch, remediate, status | Green: mark commits `succeeded` and advance the next batch. Failure: mark the offending commit `failed` and hand off to remediate |
107+
| **remediate** | SHA | — (extension) | Invoke the remediation extension for the offending commit; an external fix/revert lands via SQ |
108+
| **status** | StatusEvent || Gateway-owned sink: persist the authoritative commit-status store |
109+
110+
There is deliberately no changed-target stage and no scoring stage. Target analysis belongs to `build`, which already needs the changed-target set to know what to compile and test, so a separate stage would only pre-compute what `build` must derive anyway. And commits are validated in trunk order rather than reordered by priority, so there is nothing to score; bisection may eventually use a suspicion-weighted heuristic to place its probes (build the commits most likely to be the culprit first), but that is an optional input to `bisect`, not a stage of its own.
111+
112+
## Remediation handoff
113+
114+
When `bisect` isolates the offending commit, `conclude` marks it `failed` and publishes its SHA to the `remediate` topic — the same decoupled publish-then-consume hop the rest of the pipeline uses, not an inline call. The `remediate` controller consumes that topic and invokes a **remediation extension**: a vendor-agnostic, pluggable interface that is Stovepipe's integration boundary with whatever external system produces the fix. The extension hands the offending commit to that system, which generates a revert or fix and lands it through SQ like any other change.
115+
116+
Stovepipe's responsibility ends at invoking the extension. It does not author or land the fix, and it does not block waiting for one — there is no synchronous "wait for green" stage. The fix lands on `main` as an ordinary commit and re-enters Stovepipe through the normal ingest-and-validate path, where it is validated like anything else and the trunk returns to green. This keeps the pipeline non-blocking and the external remediation system fully decoupled behind the extension.
117+
118+
## DLQ reconciliation
119+
120+
Every *consumed* primary pipeline topic above is paired with a `{topic}_dlq` subscription consumed by a dedicated DLQ controller. The `status` topic is the exception: the orchestrator only publishes to it (the gateway is the sole consumer that persists commit status), so it has no orchestrator-side subscription and therefore no DLQ. The consumer framework moves a message to its DLQ once the primary controller returns a non-retryable error or exhausts retries on a retryable one.
121+
122+
The stovepipe-specific risk a DLQ must close: a validation that can never complete must not leave a commit stuck non-terminal. A commit wedged at `unknown` forever is not a neutral outcome — callers gate on status, and an unvalidatable commit that silently stays `unknown` blocks the trunk's green pointer from advancing past it. So the DLQ controllers do not re-attempt the failed work; they decode the payload to recover the affected commit SHA or `BatchID` and drive the entity to a **conservative, not-green terminal state**, so gating stays safe (fail closed, never falsely green) and the pipeline can move on. State writes use the same optimistic-locking CAS as the primary pipeline, so a late primary-pipeline update wins cleanly and a version mismatch is asked back for redelivery.
123+
124+
DLQ consumers are wired with `errs.AlwaysRetryableProcessor` and a very high `Retry.MaxAttempts`, with their own DLQ disabled — the same effectively-non-droppable posture SQ uses. The trade-off is identical: a genuinely unprocessable DLQ message (typically a malformed payload) must be removed by an operator. See `submitqueue/orchestrator/controller/dlq/README.md` for the shared design constraints (simplest possible implementation, reconcile-only, no recovery).
125+
126+
## Ownership by service
127+
128+
Each service owns its own data; the gateway and orchestrator never touch each other's, and the only thing they share is the messaging queue.
129+
130+
### Gateway
131+
132+
The gateway is the boundary of the system and the owner of the commit-status store. It ingests trunk push events — both from external webhooks and from the fallback poller — and hands them to the orchestrator over the queue. It serves the status query RPC that downstream systems call. And it owns the record of each commit's health: it is the only service that reads or writes the commit-status store, writing it both directly as commits are ingested and by consuming the status events the orchestrator emits.
133+
134+
### Orchestrator
135+
136+
The orchestrator runs the pipeline that takes a landed commit from `unknown` to a terminal state. It owns the working state of that pipeline — in-flight commits, batches, builds, and bisection bookkeeping — and is the only service that writes it. It drives a batch through validation, re-entering speculation as build results arrive and as bisection narrows a failing range, advances to the next range once a green is established, and hands an isolated offending commit off through the remediation extension. It never persists commit status itself; it only emits status events for the gateway to record.
137+
138+
### Shared: the messaging queue
139+
140+
The two services communicate only through the messaging queue. It is pluggable infrastructure kept in its own database, separate from either service's application data: it carries external push events in, the internal pipeline topics between orchestrator stages, and the status events the orchestrator publishes for the gateway to consume.
141+
142+
## Commit-status ownership invariant
143+
144+
The commit-status store has exactly one owner: the **gateway**. The orchestrator only emits status events onto the queue; it never persists them. The gateway is the sole consumer of those events and the only writer of the commit-status store.
145+
146+
This keeps all status writes in one service: the orchestrator stays a pure pipeline that emits events, and the gateway owns the record callers query end to end. It is the direct analogue of SQ's request-log ownership invariant.

0 commit comments

Comments
 (0)