|
| 1 | +# Stovepipe Workflow |
| 2 | + |
| 3 | +Stovepipe is the post-merge trunk-validation service: it consumes a stream of commits pushed to the trunk (`main`), validates them in batches *after* they land, and records a per-commit health state that downstream systems gate on. It exists because SubmitQueue (SQ) can no longer afford to prove every change green *before* merge at the throughput the monorepo now sees — so SQ merges directly to a `main` that may be temporarily broken, and Stovepipe is the system that finds the breakages, names the offending commit, and drives recovery. |
| 4 | + |
| 5 | +Like SQ, the orchestrator is a queue-driven pipeline of small, single-purpose controllers. Each controller consumes one topic, advances a commit or batch, and publishes to the next topic. Most hops carry only an ID — the controller fetches the entity from storage — while the entry point carries the full payload because there is no row to fetch yet. The pipeline has two cycles: `speculate → build → buildsignal → bisect → speculate` (the build / bisection loop that isolates an offending commit) and `conclude → batch` (advance to the next range once a green is established). `conclude` is the only stage that assigns a commit its terminal status; `status` is the gateway-owned sink that any controller can publish to, mirroring SQ's request-log sink. |
| 6 | + |
| 7 | +## Commit states |
| 8 | + |
| 9 | +Every trunk commit Stovepipe tracks is in one of three states. Callers — deployment systems and developer tooling — read this state to decide whether a commit is safe to act on. |
| 10 | + |
| 11 | +- **`unknown`** — the commit has landed on `main` but has not yet been validated. This is the default the moment a commit is ingested. Most commits between two validated points sit here, because validation is batched rather than per-commit. |
| 12 | +- **`succeeded`** (green) — the relevant targets build and test successfully at this commit. |
| 13 | +- **`failed`** (not green) — a target is broken at this commit. |
| 14 | + |
| 15 | +"Green" is ultimately *subjective per target/project*: a commit can be green for one team's targets and broken for another's. Stovepipe starts with a binary repo-level state and evolves toward per-target/per-project granularity; the state machine and the caller contract are the same either way. |
| 16 | + |
| 17 | +## Identity and tracking |
| 18 | + |
| 19 | +Identity is established at the gateway on ingest, but Stovepipe does not mint a synthetic per-event request ID: the **commit SHA** (scoped by repository and branch) is the identity and the dedup key. Keying on the SHA is what makes ingestion idempotent — a commit announced by both a webhook and a poll backfill resolves to the same record and is processed once. |
| 20 | + |
| 21 | +A validation attempt over a contiguous range of commits is a **Batch**, identified by a BatchID. The batch is the unit that carries a build and, through it, a pass/fail result. |
| 22 | + |
| 23 | +Bisection needs no separate tracking machinery: when a batch's build fails, `bisect` splits the range into smaller sub-ranges, each of which is just another Batch driven through the same `speculate → build → buildsignal` loop. The bisection state is therefore tracked as ordinary batch results — each sub-batch that builds green clears its commits, each that fails narrows the suspect range — until a single commit is isolated as the boundary between the last green build and the first failing one. That commit's record is marked `failed`; the commits proven green along the way are marked `succeeded`, letting the green pointer advance to the last good one. |
| 24 | + |
| 25 | +## Ingestion and completeness |
| 26 | + |
| 27 | +Trunk push events arrive as **external webhook events**, modeled as messages on the queue: when SQ merges a commit to `main`, a webhook notifies Stovepipe, which records the commit (`unknown`) and hands it into the pipeline. Webhooks give low-latency ingestion in the common case. |
| 28 | + |
| 29 | +Webhooks are a latency *optimization*, not a completeness *guarantee*. They can be delayed for hours, arrive out of order, or be dropped entirely — and a missed commit means a hole in trunk coverage that no one notices until something gates on an `unknown` that should have been validated. So ingestion does not depend on webhook reliability. A **fallback reconciliation poller** periodically diffs the last-ingested trunk SHA against the actual `main` HEAD and backfills any gap, publishing the missing commits into the same ingest path. With the poller running on a fixed cadence, no landed commit is missed even if webhooks are fully down. |
| 30 | + |
| 31 | +The two producers — webhook and poller — converge through the SHA-keyed idempotency described above, so nothing downstream assumes a commit is seen only once. Commits are processed in trunk order (committer-timestamp / topological), and a batch is a contiguous range of commits since the last known green. Because the green pointer and per-commit state are persisted, the system must be resilient to history rewrites — a previously validated commit that is no longer present on the branch — and converge rather than wedge when that happens. |
| 32 | + |
| 33 | +## Workflow |
| 34 | + |
| 35 | +``` |
| 36 | + push events ─┐ ┌─ GetStatus (RPC): |
| 37 | + (webhook) │ ┌─────────────────────────────┐ │ deployment systems |
| 38 | + ├──►│ gateway: webhook + poll │ │ & developer tooling |
| 39 | + main HEAD ──┘ │ Ingest pushes; fallback │ │ read commit status |
| 40 | + (poll) │ poll backfills missed SHAs │ ▼ |
| 41 | + └──────────────┬──────────────┘ ┌──────────────────────┐ |
| 42 | + │ PushEvent │ gateway: status │ |
| 43 | + ▼ │ Commit-status store │ |
| 44 | + ┌─────────────────────────────┐ │ (sole writer/reader) │ |
| 45 | + │ ingest / start │ └──────────┬───────────┘ |
| 46 | + │ Record Commit (unknown) │ ▲ |
| 47 | + │ keyed by SHA, emit Recorded │─────────────┤ |
| 48 | + └──────────────┬──────────────┘ status │ |
| 49 | + │ SHA events │ |
| 50 | + ┌──────────►▼ (any stage)│ |
| 51 | + │ ┌─────────────────────────────┐ │ |
| 52 | + BatchID │ │ batch │ │ |
| 53 | + (advance)│ │ Aggregate commits since green│ │ |
| 54 | + │ └──────────────┬──────────────┘ │ |
| 55 | + │ │ BatchID │ |
| 56 | + │ ▼ │ |
| 57 | + │ ┌──────────────────────────┐ │ |
| 58 | + │ ┌──►│ speculate (stub) │◄───┐ │ |
| 59 | + │ │ │ Full range vs. sub-range │ │ │ |
| 60 | + │ │ └────────────┬─────────────┘ │ │ |
| 61 | + │ │ BatchID │ BatchID │ │ |
| 62 | + │ │ ▼ │ │ |
| 63 | + │ │ ┌──────────────────────────┐ │ │ |
| 64 | + │ │ │ build │ │ │ |
| 65 | + │ │ │ Build the batch's │ │ │ |
| 66 | + │ │ │ changed targets │ │ │ |
| 67 | + │ │ └────────────┬─────────────┘ │ │ |
| 68 | + │ │ Build │ │ │ |
| 69 | + │ │ ▼ │ │ |
| 70 | + │ │ ┌──────────────────────────┐ │ │ |
| 71 | + │ │ │ buildsignal │ │ │ |
| 72 | + │ │ │ Feed build result back │ │ │ |
| 73 | + │ │ └───┬──────────────────┬───┘ │ │ |
| 74 | + │ │ fail │ │ pass │ │ |
| 75 | + │ │ ▼ ▼ │ │ |
| 76 | + │ │ ┌──────────────┐ ┌────────────────┐ │ |
| 77 | + │ └─│ bisect (stub)│ │ conclude │ │ |
| 78 | + │ │ Split range │ │ Green: mark │ │ |
| 79 | + │ └──────────────┘ │ succeeded, │ │ |
| 80 | + └─────────────────────│ advance batch. │ │ |
| 81 | + │ Fail: mark │ │ |
| 82 | + │ commit failed. │─┘ |
| 83 | + └───────┬────────┘ status |
| 84 | + │ SHA events |
| 85 | + ▼ |
| 86 | + ┌─────────────────────────┐ |
| 87 | + │ remediate │┄┄► remediation |
| 88 | + │ Invoke remediation │ extension → |
| 89 | + │ extension for the commit│ external fix / |
| 90 | + └─────────────────────────┘ revert → SQ |
| 91 | +``` |
| 92 | + |
| 93 | +## Per-controller summary |
| 94 | + |
| 95 | +| Controller | In | Out | One-line role | |
| 96 | +|---|---|---|---| |
| 97 | +| **gateway/webhook** | push event (RPC/HTTP) | ingest | Receive a trunk push event, publish to the ingest topic, hand off async | |
| 98 | +| **gateway/poll** | (timer) | ingest | Fallback reconciler: diff last-ingested SHA vs `main` HEAD, backfill any gap | |
| 99 | +| **gateway/GetStatus** | RPC | — | Read path: callers query a commit's status (optionally scoped to a target/project) | |
| 100 | +| **ingest / start** | PushEvent | batch, status | Record the Commit as `unknown` keyed by SHA (dedup), emit Recorded status | |
| 101 | +| **batch** | SHA | speculate | Aggregate commits since the last known green into a validation Batch (commit range) | |
| 102 | +| **speculate** (stub) | BatchID | build | Decide the validation strategy: full range vs. a bisection sub-range | |
| 103 | +| **build** | BatchID | buildsignal | Build the batch's changed targets (target analysis happens here) | |
| 104 | +| **buildsignal** | Build | conclude, bisect | Feed the build result back; route pass → conclude, fail → bisect | |
| 105 | +| **bisect** (stub) | BatchID | speculate | Split a failing range into sub-batches to isolate the offending commit | |
| 106 | +| **conclude** | BatchID | batch, remediate, status | Green: mark commits `succeeded` and advance the next batch. Failure: mark the offending commit `failed` and hand off to remediate | |
| 107 | +| **remediate** | SHA | — (extension) | Invoke the remediation extension for the offending commit; an external fix/revert lands via SQ | |
| 108 | +| **status** | StatusEvent | — | Gateway-owned sink: persist the authoritative commit-status store | |
| 109 | + |
| 110 | +There is deliberately no changed-target stage and no scoring stage. Target analysis belongs to `build`, which already needs the changed-target set to know what to compile and test, so a separate stage would only pre-compute what `build` must derive anyway. And commits are validated in trunk order rather than reordered by priority, so there is nothing to score; bisection may eventually use a suspicion-weighted heuristic to place its probes (build the commits most likely to be the culprit first), but that is an optional input to `bisect`, not a stage of its own. |
| 111 | + |
| 112 | +## Remediation handoff |
| 113 | + |
| 114 | +When `bisect` isolates the offending commit, `conclude` marks it `failed` and publishes its SHA to the `remediate` topic — the same decoupled publish-then-consume hop the rest of the pipeline uses, not an inline call. The `remediate` controller consumes that topic and invokes a **remediation extension**: a vendor-agnostic, pluggable interface that is Stovepipe's integration boundary with whatever external system produces the fix. The extension hands the offending commit to that system, which generates a revert or fix and lands it through SQ like any other change. |
| 115 | + |
| 116 | +Stovepipe's responsibility ends at invoking the extension. It does not author or land the fix, and it does not block waiting for one — there is no synchronous "wait for green" stage. The fix lands on `main` as an ordinary commit and re-enters Stovepipe through the normal ingest-and-validate path, where it is validated like anything else and the trunk returns to green. This keeps the pipeline non-blocking and the external remediation system fully decoupled behind the extension. |
| 117 | + |
| 118 | +## DLQ reconciliation |
| 119 | + |
| 120 | +Every *consumed* primary pipeline topic above is paired with a `{topic}_dlq` subscription consumed by a dedicated DLQ controller. The `status` topic is the exception: the orchestrator only publishes to it (the gateway is the sole consumer that persists commit status), so it has no orchestrator-side subscription and therefore no DLQ. The consumer framework moves a message to its DLQ once the primary controller returns a non-retryable error or exhausts retries on a retryable one. |
| 121 | + |
| 122 | +The stovepipe-specific risk a DLQ must close: a validation that can never complete must not leave a commit stuck non-terminal. A commit wedged at `unknown` forever is not a neutral outcome — callers gate on status, and an unvalidatable commit that silently stays `unknown` blocks the trunk's green pointer from advancing past it. So the DLQ controllers do not re-attempt the failed work; they decode the payload to recover the affected commit SHA or `BatchID` and drive the entity to a **conservative, not-green terminal state**, so gating stays safe (fail closed, never falsely green) and the pipeline can move on. State writes use the same optimistic-locking CAS as the primary pipeline, so a late primary-pipeline update wins cleanly and a version mismatch is asked back for redelivery. |
| 123 | + |
| 124 | +DLQ consumers are wired with `errs.AlwaysRetryableProcessor` and a very high `Retry.MaxAttempts`, with their own DLQ disabled — the same effectively-non-droppable posture SQ uses. The trade-off is identical: a genuinely unprocessable DLQ message (typically a malformed payload) must be removed by an operator. See `submitqueue/orchestrator/controller/dlq/README.md` for the shared design constraints (simplest possible implementation, reconcile-only, no recovery). |
| 125 | + |
| 126 | +## Ownership by service |
| 127 | + |
| 128 | +Each service owns its own data; the gateway and orchestrator never touch each other's, and the only thing they share is the messaging queue. |
| 129 | + |
| 130 | +### Gateway |
| 131 | + |
| 132 | +The gateway is the boundary of the system and the owner of the commit-status store. It ingests trunk push events — both from external webhooks and from the fallback poller — and hands them to the orchestrator over the queue. It serves the status query RPC that downstream systems call. And it owns the record of each commit's health: it is the only service that reads or writes the commit-status store, writing it both directly as commits are ingested and by consuming the status events the orchestrator emits. |
| 133 | + |
| 134 | +### Orchestrator |
| 135 | + |
| 136 | +The orchestrator runs the pipeline that takes a landed commit from `unknown` to a terminal state. It owns the working state of that pipeline — in-flight commits, batches, builds, and bisection bookkeeping — and is the only service that writes it. It drives a batch through validation, re-entering speculation as build results arrive and as bisection narrows a failing range, advances to the next range once a green is established, and hands an isolated offending commit off through the remediation extension. It never persists commit status itself; it only emits status events for the gateway to record. |
| 137 | + |
| 138 | +### Shared: the messaging queue |
| 139 | + |
| 140 | +The two services communicate only through the messaging queue. It is pluggable infrastructure kept in its own database, separate from either service's application data: it carries external push events in, the internal pipeline topics between orchestrator stages, and the status events the orchestrator publishes for the gateway to consume. |
| 141 | + |
| 142 | +## Commit-status ownership invariant |
| 143 | + |
| 144 | +The commit-status store has exactly one owner: the **gateway**. The orchestrator only emits status events onto the queue; it never persists them. The gateway is the sole consumer of those events and the only writer of the commit-status store. |
| 145 | + |
| 146 | +This keeps all status writes in one service: the orchestrator stays a pure pipeline that emits events, and the gateway owns the record callers query end to end. It is the direct analogue of SQ's request-log ownership invariant. |
0 commit comments