diff --git a/decisions/0015-spaces-key-as-iac-resource.md b/decisions/0015-spaces-key-as-iac-resource.md new file mode 100644 index 00000000..f7d2096f --- /dev/null +++ b/decisions/0015-spaces-key-as-iac-resource.md @@ -0,0 +1,108 @@ +# 0015: Spaces key as IaC resource — two-phase fix (canonical-schema + Hybrid resource) + +- **Date:** 2026-05-09 +- **Status:** Accepted + +## Context + +DigitalOcean Spaces access keys (`access_key` + `secret_key` pairs) are +provisioned and stored by the canonical bootstrap path +(`bootstrapSecrets` in `cmd/wfctl/infra_bootstrap.go`) when a config +declares a `secrets.generate` entry of `type: provider_credential` with +`source: digitalocean.spaces`. The bootstrap layer mints the key via +`generateDOSpacesKey` (in `secrets/generators.go`), then stores the two +sub-keys (`_access_key`, `_secret_key`) as GitHub Actions +Secrets per `providerCredentialSubKeys`. + +Two failure classes had accumulated: + +1. **The doubled-create anti-pattern** — operators sometimes wrote two + separate `secrets.generate` entries (`SPACES_access_key` AND + `SPACES_secret_key`), each `type: provider_credential`, each with + the same `name`. Each entry triggered a separate cloud credential + creation; the second silently orphaned the first's access_key. After + N CI runs, the DO account accumulated N orphaned keys (the leak + surface that motivated this plan). + +2. **No IaC lifecycle for purpose-scoped keys** — when downstream code + needed a credential scoped to a single bucket (e.g., a customer- + data export job), there was no way to declare it in `infra.yaml` as + a regular IaC resource alongside the bucket. Operators had to mint + one out-of-band and paste the access_key into a secret, which + defeated drift detection + state correlation. + +The design considered two approaches: + +- **Approach A**: Pure schema correction. Block the doubled-create + anti-pattern at lint (R-A9), rename existing-broken configs to the + canonical single-entry shape, leave purpose-scoped keys to the + out-of-band path. Smallest change; doesn't address (2). + +- **Approach B**: Two-phase fix. Phase 1 = same as A (block + migrate); + Phase 2 adds a new `infra.spaces_key` IaC resource type with its own + `ResourceDriver.Create/Read/Update/Delete` so purpose-scoped keys + enter state and participate in plan/apply/drift like any other + resource. Larger change; addresses both failure classes. + +## Decision + +Take **Approach B** — two-phase fix. Phase 1 ships the schema +correction (R-A9 severity flip + canonical-schema migration of broken +configs + storage-filter fix for sidecar metadata). Phase 2 ships +`infra.spaces_key` as a first-class IaC resource type via a new +`SpacesKeyDriver` in `workflow-plugin-digitalocean`. + +The two phases are sequenced: Phase 1 lands first (PR0+PR1+PR2+PR4a in +this plan), then Phase 2 (PR4b+PR5+PR6) once a workflow tag containing +Phase 1 is cut. + +## Consequences + +- The doubled-create anti-pattern is now blocked at `wfctl infra + align --strict` with severity `ERROR` (always exits non-zero, not just + under `--strict`). See ADR for R-A9 (in plan rev3). + +- A new resource type `infra.spaces_key` exists. Its driver writes the + two GH Secrets at Create time using the same naming convention + (`_access_key`, `_secret_key`) so downstream code that + references `${NAME_access_key}` works identically whether the key + came from canonical bootstrap or the new IaC driver. + +- The bootstrap key (the canonical credential `bootstrapSecrets` mints + before any plan runs) is **explicitly NOT** an `infra.spaces_key` — + it lives outside IaC for chicken-and-egg reasons (state-backend + creds need to exist before state can be loaded). See ADR 0018. + +- Drift detection for purpose-scoped keys is now meaningful: if an + operator deletes a key out-of-band, `infra drift` flags it. + +- The new resource type implies an `EnumeratorAll` interface on the DO + provider (so `wfctl infra audit-keys` / `prune` can find every key + in the account, not just those in state). See ADR 0016. + +## Alternatives considered + +- **Approach A (pure schema correction)**: rejected because it leaves + purpose-scoped keys in the out-of-band path. Operators would still + hand-paste access_keys into secrets; drift detection would silently + miss orphans. + +- **Single-phase fix (combine 1 + 2 in one PR)**: rejected because the + blast radius is too large for one review pass. The schema correction + has security urgency (block the leak vector); the IaC driver is + additive and can take a second review cycle. + +- **Tag-based key correlation** (use DO's per-key tag field to match + state): rejected by adversarial review #2 — DO tags overload with + Spaces *bucket* tags and cause noise in cleanup queries; the DO API + doesn't enforce uniqueness on the tag namespace. + +## Related + +- `docs/plans/2026-05-08-spaces-key-iac-resource-design.md` — full + design doc with the Approach A vs B trade study. +- `docs/plans/2026-05-08-spaces-key-iac-resource-design.adversarial-review-1.md` + through `-7.md` — review history that converged on Approach B. +- ADR 0016 (EnumeratorAll), ADR 0017 (prune CLI two-key opt-in), + ADR 0018 (bootstrap-key exemption), ADR 0020 (storage-filter sidecar + metadata). diff --git a/decisions/0016-enumerator-all-interface.md b/decisions/0016-enumerator-all-interface.md new file mode 100644 index 00000000..1879dfe1 --- /dev/null +++ b/decisions/0016-enumerator-all-interface.md @@ -0,0 +1,100 @@ +# 0016: EnumeratorAll interface for non-tag-supporting resources + +- **Date:** 2026-05-09 +- **Status:** Accepted + +## Context + +The existing `interfaces.Enumerator` interface (added in earlier +plans for `wfctl infra cleanup --tag`) returns `[]ResourceRef` for +resources matching a tag query. It works for resource types where the +cloud API exposes a per-resource tag field that can be queried server- +side: DO Droplets, DO Volumes, DO Domains, AWS EC2 instances, etc. + +Several resource types do NOT support tagging at all — DO Spaces +access keys are the canonical example. The DO API exposes `name`, +`access_key`, `created_at` on each key but no tag field. Building +`audit-keys` / `prune` against `Enumerator.EnumerateByTag` would have +required either: + +- A spurious tag-querying step that returns "tags unsupported for + resource type" — operators would have to know which types support + tags before invoking. +- A per-type fallback path inside the CLI that branches between + Enumerator + a hand-rolled list call. + +Both options leak provider-API quirks into the CLI layer. + +## Decision + +Add a new optional interface `interfaces.EnumeratorAll`: + +```go +type EnumeratorAll interface { + EnumerateAll(ctx context.Context, resourceType string) ([]*ResourceOutput, error) +} +``` + +Providers that can list resources of `resourceType` regardless of tag +implement this interface. The DO provider implements it for +`infra.spaces_key`. Future providers can implement it for any type +whose API supports listing without filter. + +Returns `[]*ResourceOutput` (full metadata) rather than `[]ResourceRef` +(name + type + provider_id) because audit + prune both need the +metadata to render / filter — and the DO List API returns the full +shape in one call, so re-Reading would be wasteful. + +CLI dispatchers type-assert at the boundary; providers that don't +implement it surface a structured error (`audit-keys: no loaded +provider implements EnumeratorAll`) rather than crashing. + +## Consequences + +- Existing `Enumerator` (tag-based) is unchanged. New interface is + purely additive — older provider plugins continue to work without + modification. + +- DO provider (`workflow-plugin-digitalocean`) implements + `EnumerateAll` for `infra.spaces_key` with transparent pagination + (DO's `/v2/spaces/keys` API page-size limits). Per-page calls happen + inside the implementation; callers see one slice. + +- The CLI dispatchers (`runInfraAuditKeysCmd`, + `runInfraPruneCmd`, `runInfraRotateAndPruneCmd`) iterate loaded + iac.provider modules and pick the first one implementing + `EnumeratorAll`. Multi-provider configs need to ensure at most one + provider per `--type`. + +- Returning `[]*ResourceOutput` ties the interface to the + `ResourceOutput` shape (`Outputs map[string]any`). Providers must + populate the `Outputs` map with the fields callers expect — for + spaces keys: `name`, `access_key`, `created_at`. This is the + documented contract. + +## Alternatives considered + +- **Add `EnumerateAll` to the existing `Enumerator` interface** as a + second method. Rejected because it would force every existing + Enumerator implementer to grow a second method or stop satisfying + the interface — backwards-incompatible. + +- **Return `[]ResourceRef` from `EnumerateAll`**. Rejected because + audit + prune need the metadata; forcing callers to round-trip back + to `Read` for every result would double the API calls. + +- **Centralize listing in `wfctl` core** (manage list-by-type as a + generic operation against `iac.state` only). Rejected because + state-only listing misses resources that exist in the cloud but not + in state — exactly the orphan keys this plan is built to find. + +## Related + +- `interfaces/iac_provider.go` (commit c8e4c4e8) — `EnumeratorAll` + declaration. +- `workflow-plugin-digitalocean/internal/providers/digitalocean.go` + — DO implementation with pagination. +- `cmd/wfctl/infra_audit_keys.go`, `cmd/wfctl/infra_prune.go` — CLI + consumers. +- ADR 0015 (spaces-key as IaC resource — this interface enables the + audit + prune CLIs needed to operationalize it). diff --git a/decisions/0017-prune-cli-two-key-opt-in.md b/decisions/0017-prune-cli-two-key-opt-in.md new file mode 100644 index 00000000..c499dd4e --- /dev/null +++ b/decisions/0017-prune-cli-two-key-opt-in.md @@ -0,0 +1,125 @@ +# 0017: `wfctl infra prune` UX — discriminator + two-key opt-in + +- **Date:** 2026-05-09 +- **Status:** Accepted + +## Context + +`wfctl infra prune` is destructive — it deletes cloud resources by +calling the provider's `ResourceDriver.Delete`. The user it serves +most often is post-leak cleanup: an operator has rotated a credential +and needs to delete every older key so the leaked one stops being +accepted. This is a high-stakes operation; a typo can nuke the live +key and cause downtime. + +The design considered three discriminator shapes: + +- **Name regex as the primary filter**: rejected by adversarial + review #2. A regex protects *everything matching the regex*, + including any leaked key whose name happens to match. The leak + surface is exactly what we're trying to remove. + +- **Per-key tag**: rejected because spaces keys don't support tags + (motivated ADR 0016). + +- **Time + access_key** (this ADR): only resources whose + `created_at` is OLDER than `--created-before` are eligible, AND the + resource whose `access_key` matches `--exclude-access-key` is always + preserved. + +The time + access_key shape pairs naturally with the rotation +primitive: rotating a credential mints a new key with a fresh +`created_at` and a known `access_key`. Pass those values and the +filter is unambiguous: "everything older than the new key, except +the new key itself". + +The opt-in shape considered three levels: + +- **Just `--confirm` flag** (single per-invocation): too easy to + recover from `~/.bash_history`. +- **Just env var**: too easy to ship in a Makefile / CI step. +- **Both required (this ADR)**: the operator must export the env var + AND pass the flag, so neither a stale shell history line nor a + misconfigured Makefile alone is enough to fire. + +## Decision + +`wfctl infra prune` requires: + +1. `--type ` (required) — single resource type per invocation. Rejects + if missing. +2. `--created-before ` (required) — only resources older than + this are eligible. +3. `--exclude-access-key ` (required) — this access_key is preserved + no matter what (paranoia rail). +4. `--confirm` flag — explicit per-invocation consent. +5. `WFCTL_CONFIRM_PRUNE=1` environment variable — two-key authorization. +6. Interactive `y/N` prompt — skipped only with `--non-interactive`. + +If ANY of (1)-(5) is missing, prune exits with a structured error and +NO cloud calls are made. + +`--preserve-names ` is offered as a secondary, additive filter — +it preserves resources whose `name` matches the regex on top of the +`--exclude-access-key` exclusion. The verb-in-the-name (`preserve` vs. +the original `--allowlist`) is deliberate: on a destructive command, +"allowlist" reads ambiguously (some operators map it to "list of +resources allowed to be deleted", which would invert the semantics +catastrophically); "preserve-names" only reads one way. + +`--recovery-from-last-rotation` is the recovery-mode shortcut: reads +the discriminator from the recovery file written by `infra +rotate-and-prune` (see ADR for that flow) so an operator who just had +a partial-failure rotation can finish cleanup without manually +copy-pasting timestamps + access_keys. + +## Consequences + +- Operators must capture the rotation result before pruning. The + `infra rotate-and-prune` all-in-one command does this automatically; + the multi-step variant requires reading stderr after `infra + bootstrap --force-rotate`. + +- The two-key opt-in adds one extra step (`export + WFCTL_CONFIRM_PRUNE=1`) for routine operator use. Acceptable + friction for a destructive operation. CI workflows that need to + prune (e.g., scheduled hygiene jobs) export the env var explicitly. + +- The feature flag this ADR codifies (`WFCTL_CONFIRM_PRUNE`) has a + retirement milestone: v0.28.0 makes it always-on (effectively + removing the env var step) IF operational telemetry shows zero + unintended fires for 60 days post-release. Tracked in plan rev3 as + the "feature-flag retirement" sub-task; not in this ADR's scope. + +- A typo in `--exclude-access-key` is caught by the value-equality + filter — if the typo'd value doesn't match any real key, every + eligible key (including the live one) is targeted. This is why the + interactive `y/N` prompt exists as the third opt-in: the operator + reviews the list of keys-to-be-deleted before confirming. + +## Alternatives considered + +- **Soft-delete via tag-and-cron**: rejected because DO Spaces keys + don't support tags + cron-based delayed-delete adds operational + state we don't want. + +- **Allowlist-only filter** (no `--exclude-access-key`): rejected per + adversarial review #2. The rotation use case needs an *exclusion* + semantic (preserve N, delete N-1), not an *inclusion* semantic + (delete N matching the regex). Inclusion-by-regex is dangerous + because typos default to "match nothing" which is safe; exclusion- + by-regex defaults to "match everything" which deletes the live key. + +- **Single opt-in (env var XOR flag)**: rejected. Either alone is + insufficient defense in depth — env vars persist in shell history + + CI logs; flags persist in shell history. + +## Related + +- `cmd/wfctl/infra_prune.go` — implementation. +- `cmd/wfctl/infra_prune_test.go` — `TestInfraPrune_RequiresTwoKeyOptIn`, + `TestInfraPrune_RequiresExcludeAccessKey`, + `TestInfraPrune_FiltersByTimeAndExcludesAccessKey`. +- `docs/runbooks/spaces-key-prune.md` — operator-facing runbook. +- ADR 0015 (two-phase plan), ADR 0016 (EnumeratorAll — what prune + enumerates against). diff --git a/decisions/0018-bootstrap-key-exemption.md b/decisions/0018-bootstrap-key-exemption.md new file mode 100644 index 00000000..f6c8184c --- /dev/null +++ b/decisions/0018-bootstrap-key-exemption.md @@ -0,0 +1,109 @@ +# 0018: Bootstrap key exemption from `infra.spaces_key` + +- **Date:** 2026-05-09 +- **Status:** Accepted + +## Context + +ADR 0015 introduced `infra.spaces_key` as a first-class IaC resource +type so purpose-scoped DO Spaces credentials can participate in +plan/apply/drift like any other resource. The natural follow-on +question is: should the *bootstrap key* — the canonical credential +that `bootstrapSecrets` mints before any plan runs — also be modeled +as an `infra.spaces_key` resource? + +Migrating the bootstrap key to `infra.spaces_key` would unify the two +code paths (`bootstrapSecrets` and the IaC driver) and let drift +detection catch out-of-band rotations of the bootstrap credential. + +But the bootstrap key has two characteristics that make it +fundamentally different from purpose-scoped keys: + +1. **Chicken-and-egg with state**: `bootstrapSecrets` must run + *before* the IaC state backend is loaded (in many configs, the + state backend itself is hosted on DO Spaces — accessing it + requires the bootstrap key). If the bootstrap key were managed by + `infra.spaces_key`, the IaC subsystem would need that key to load + state, but the key wouldn't be in state until apply ran. Circular. + +2. **Lifecycle separation**: the bootstrap key is rotated by + `wfctl infra bootstrap --force-rotate ` — a flow that + reaches into the upstream provider (DO API) directly, mints a new + key, then revokes the old one via + `interfaces.ProviderCredentialRevoker` (ADR 0012). It does NOT go + through the resource driver Plan/Apply cycle. Forcing it through + the IaC pipeline would either require reimplementing the rotation + primitive there, or having two competing rotation paths that + could disagree. + +## Decision + +The bootstrap key is **explicitly NOT** an `infra.spaces_key` +resource. It remains managed by the canonical bootstrap path +(`bootstrapSecrets` + `--force-rotate`) and lives outside the IaC +graph. The `wfctl infra audit-keys` CLI surfaces it (because it +appears in the cloud account's key list), so operators can SEE it +during an audit, but `wfctl infra prune` will skip it as long as +operators specify the canonical key's `access_key` in +`--exclude-access-key`. + +This carve-out is explicitly documented in: + +- The R-A9 align rule (which fires `ERROR` only on `provider_credential` + entries that violate the canonical-shape contract — bootstrap keys + always satisfy it). +- The runbook (`docs/runbooks/spaces-key-prune.md`) — operators are + reminded to exclude the bootstrap key's access_key when running + `prune` against `infra.spaces_key` if they have one in their + account. +- The plan's "Out of scope" section — bootstrap-key rotation reaper + deferred to ADR 0019 (future plan). + +## Consequences + +- Two code paths exist for spaces-key lifecycle: the canonical + bootstrap path (for the bootstrap key) and the new + `SpacesKeyDriver` (for purpose-scoped keys). They share the GH + Secrets naming convention (`_access_key` / `_secret_key`) + so downstream consumer code is identical. + +- Drift detection does NOT cover the bootstrap key. If an operator + rotates the bootstrap credential through the DO console, IaC + doesn't notice. This is acceptable because the credential is + required to load state at all — any out-of-band rotation that breaks + this is operationally visible immediately. + +- A future plan (ADR 0019, deferred) can add a bootstrap-key rotation + reaper that watches for stale keys + reports them via a separate + CLI surface. Out of scope for this plan. + +- Operators running `wfctl infra audit-keys --type infra.spaces_key` + will see the bootstrap key in the output. This is intentional — + visibility is good even when management is out-of-band. The runbook + notes this so they don't try to delete it. + +## Alternatives considered + +- **Migrate bootstrap key to `infra.spaces_key`**: rejected for the + chicken-and-egg + lifecycle reasons above. Would require either + bootstrapping a state backend without managed credentials (defeats + the bootstrap workflow's purpose) or two competing rotation paths. + +- **Block the bootstrap key from `audit-keys` output entirely** (so + operators don't see it): rejected. Visibility is good even when + the key is out-of-IaC-scope. Hiding it would create a different + failure mode where an operator forgets it exists. + +- **Track the bootstrap key in `infra.yaml` as a `data` source** (a + read-only IaC reference): rejected as overengineering for the + current need. Revisit when ADR 0019 lands. + +## Related + +- ADR 0012 — Provider credential rotation (the primitive + `bootstrapSecrets --force-rotate` uses). +- ADR 0015 — Spaces key as IaC resource (introduces the resource + type this ADR carves out from). +- ADR 0019 (deferred) — Bootstrap-key rotation reaper. +- `docs/runbooks/spaces-key-prune.md` — runbook with the + bootstrap-key carve-out called out. diff --git a/decisions/0020-storage-filter-sidecar-metadata.md b/decisions/0020-storage-filter-sidecar-metadata.md new file mode 100644 index 00000000..5d0856a0 --- /dev/null +++ b/decisions/0020-storage-filter-sidecar-metadata.md @@ -0,0 +1,126 @@ +# 0020: Storage filter for sidecar metadata in `bootstrapSecrets` + +- **Date:** 2026-05-09 +- **Status:** Accepted + +## Context + +The DO Spaces credential creation API +(`POST /v2/spaces/keys`) returns a JSON object with at least three +useful fields: + +```json +{ + "access_key": "DO00...", + "secret_key": "...", + "created_at": "2026-05-08T11:00:00Z" +} +``` + +Pre-this-plan, `generateDOSpacesKey` returned only the `access_key` + +`secret_key` (concatenated as a JSON string), and `bootstrapSecrets` +JSON-decoded that string + stored each key as a separate GH Secret +named `_` (e.g., `SPACES_access_key`, `SPACES_secret_key`). + +The `wfctl infra prune` design (ADR 0017) needs the rotation result +to include `created_at` so the time-based discriminator +(`--created-before`) can be derived without a second API call. The +natural fix: have `generateDOSpacesKey` return ALL three fields +(`access_key` + `secret_key` + `created_at`) and let +`bootstrapSecrets` extract `created_at` from the result. + +The naive fix is dangerous, though. `bootstrapSecrets`'s storage +loop iterates EVERY field in the JSON map and writes each as a GH +Secret named `_`. Adding `created_at` would create a +phantom `SPACES_created_at` GH Secret on every bootstrap run — a +shape that makes no sense semantically (`created_at` is metadata, +not a credential), pollutes the audit-keys / prune output (the +phantom secret has no corresponding cloud key), and confuses +downstream consumers expecting only `_access_key` / `_secret_key`. + +## Decision + +Make two changes together (must land in the same merge commit per +this ADR's "same-commit constraint"): + +1. `generateDOSpacesKey` returns a JSON map with `access_key` + + `secret_key` + `created_at` (full DO API response shape, not just + the credential pair). + +2. `bootstrapSecrets` storage loop **filters** the JSON map by the + `providerCredentialSubKeys[source]` allow-list before writing GH + Secrets. For `digitalocean.spaces`, that's + `["access_key", "secret_key"]` — `created_at` is silently ignored + at the storage layer. + +3. `created_at` is surfaced to operators via stderr sidecar emission + (`WFCTL_NEW_KEY_CREATED_AT=` line printed alongside the + existing rotation log) so the multi-step prune workflow can + capture it from the bootstrap output. + +4. `RotationResult` (the in-process return shape) includes + `CreatedAt` so the all-in-one `rotate-and-prune` doesn't need to + parse stderr — it gets the metadata directly. + +The "same-commit constraint" is critical: shipping (1) without (2) +creates a phantom GH Secret on every bootstrap; shipping (2) without +(1) is a no-op (nothing to filter). Splitting them across commits +would leave main in a broken state for any window between merges. + +## Consequences + +- All future provider_credential generators that return additional + metadata (DO Spaces precedent) MUST be filtered by the per-source + sub-key allow-list. New providers register their canonical sub-keys + in `providerCredentialSubKeys` (`cmd/wfctl/infra_bootstrap.go`). + +- `audit-keys` / `prune` for spaces keys can rely on `created_at` + being available in `ResourceOutput.Outputs` (via `EnumerateAll`) + AND in `RotationResult.CreatedAt` (in-process) — no second API + call needed. + +- The stderr sidecar (`WFCTL_NEW_KEY_CREATED_AT=`) is part of the + documented contract for the multi-step variant of the prune + workflow (see `docs/runbooks/spaces-key-prune.md`). Format + prefixed with `WFCTL_` so operators can `grep`/parse reliably. + +- Backward compatibility: existing canonical-bootstrap consumers + (configs that already use single-entry `secrets.generate` shape) + continue to work without change. The storage filter is invisible + to them — they only ever stored `_access_key` + `_secret_key` + before, and that's still all they get. + +- Rejected approach (sidecar return as a separate `[]Sidecar` + parameter on `secrets.GenerateSecret`) would have changed the + public package signature. This filter approach keeps the public + signature stable. + +## Alternatives considered + +- **Sidecar return parameter** on `secrets.GenerateSecret` (a new + `[]Sidecar` slice for non-credential metadata). Rejected because + it changes the public package signature, breaking every downstream + consumer of `secrets.GenerateSecret`. The storage-filter approach + achieves the same result without ABI churn. + +- **Filter at the `generateDOSpacesKey` callsite** instead of in + `bootstrapSecrets`. Rejected because it would force every future + provider_credential generator to remember to filter — putting the + filter in `bootstrapSecrets` makes it the system's invariant, not + a per-generator obligation. + +- **Don't return `created_at` at all** from `generateDOSpacesKey` and + re-fetch via a separate `Read` call when prune needs it. Rejected + for the second-API-call cost and for opening a TOCTOU window + between rotation and prune. + +## Related + +- ADR 0012 — Provider credential rotation (the rotation primitive + this metadata flows from). +- ADR 0017 — `wfctl infra prune` UX (the consumer of `created_at`). +- `cmd/wfctl/infra_bootstrap.go` — `bootstrapSecrets` storage loop + + `providerCredentialSubKeys` map. +- `secrets/generators.go` — `generateDOSpacesKey` return shape. +- ADR 0021 — `rewriteTransport` (test stubbing fix landed in PR4a; + same area but unrelated decision). diff --git a/docs/runbooks/spaces-key-prune.md b/docs/runbooks/spaces-key-prune.md new file mode 100644 index 00000000..f225ec60 --- /dev/null +++ b/docs/runbooks/spaces-key-prune.md @@ -0,0 +1,251 @@ +# Spaces Key Prune Runbook + +When DigitalOcean Spaces accumulates leaked or stale access keys (post-leak +cleanup, periodic hygiene, scheduled rotation), use `wfctl infra +rotate-and-prune` to rotate the canonical credential AND prune older keys +in one atomic flow. Inspect-then-prune and recovery flows are documented +below for the cases where the all-in-one is not the right tool. + +This runbook assumes: + +- The workflow CLI is built from a release containing the `infra + audit-keys` / `infra prune` / `infra rotate-and-prune` subcommands + (`wfctl infra --help` lists all three). +- The DO plugin is loaded for the target env (`infra audit-keys` exits + with `no loaded provider implements EnumeratorAll` if not). +- The operator has shell access where `WFCTL_CONFIRM_PRUNE=1` can be + exported and `${WFCTL_STATE_DIR:-$HOME/.wfctl}` is writable. + +## Overview + +| Tool | Use when | +|------|----------| +| `infra audit-keys` | Read-only — list every cloud-side key for a `--type` to compare against IaC state. Always safe. | +| `infra prune` | Destructive — delete cloud-side keys older than `--created-before` except `--exclude-access-key`. Three-step opt-in (see below). | +| `infra rotate-and-prune` | Destructive — mint new credential, then prune older keys with the new one as exclusion target. Three-step opt-in + recovery file. | + +Both destructive subcommands require **three** opt-ins to fire. The +first two are the **two-key** authorization (env var + flag — both +required, named after the two-person rule from physical security +contexts); the third is the runtime confirmation: + +1. The `--confirm` flag on the CLI invocation (per-invocation consent). +2. `WFCTL_CONFIRM_PRUNE=1` exported in the environment (per-session + authorization). Together with (1) this is the "two-key opt-in" + referred to throughout the rest of this runbook + ADR 0017. +3. An interactive `y/N` prompt — skipped only with `--non-interactive` + (CI workflows MUST opt in explicitly). + +The `--exclude-access-key` flag is mandatory for `prune` (paranoia rail — +prevents a typo from nuking the live key); `rotate-and-prune` derives it +from the rotation result so it's auto-populated and not operator-provided. + +## Happy path: rotate-and-prune + +Recommended for routine hygiene + post-leak cleanup. One command, atomic +intent: the new credential is minted and persisted before any deletion +touches the cloud, and the recovery file is removed only after the prune +step succeeds. + +```bash +export WFCTL_CONFIRM_PRUNE=1 +wfctl infra rotate-and-prune \ + --type infra.spaces_key \ + --name coredump-deploy-key \ + --confirm +``` + +The flow is: + +1. **Rotate** the canonical `coredump-deploy-key` credential — mints a + new key on DO, stores `coredump-deploy-key_access_key` + + `coredump-deploy-key_secret_key` as GH Secrets, then revokes the old + credential at the upstream provider per ADR 0012. +2. **Persist** a recovery record at + `${WFCTL_STATE_DIR:-$HOME/.wfctl}/last-rotation.json` (perms `0600`) + BEFORE any deletion. The record contains + `{type, name, access_key, created_at, source, rotated_at}`. +3. **List** every other `infra.spaces_key` resource in the account whose + `created_at` is older than the new key's `created_at`. +4. **Prompt** for confirmation — skipped under `--non-interactive`. +5. **Delete** each older key via the DO API. +6. **Remove** the recovery file on full success. On any partial + failure the recovery file is RETAINED (see "Recovery from partial + failure" below). + +Add `--preserve-names` to skip keys whose names match a regex even if +they're older than the cutoff (see "Preserving hand-created keys"). + +## Multi-step variant: audit then prune + +Use this when: + +- You want to manually inspect keys before any destructive call. +- The rotation step needs to run separately (e.g., the secrets backend + requires elevated credentials only available in CI). +- You're recovering from a leak where the rotation already happened + out-of-band (e.g., the credential was rotated through the DO console). + +```bash +# Step 1: list all cloud-side keys (read-only, always safe). +wfctl infra audit-keys --type infra.spaces_key +``` + +The output is a fixed-width table with `NAME`, `ACCESS_KEY`, +`CREATED_AT` columns. Compare against IaC state (`wfctl infra outputs`) +to identify orphans. + +```bash +# Step 2: rotate the canonical credential (separate command — no prune). +wfctl infra bootstrap --force-rotate SPACES --env staging +# Capture the stderr sidecar line — only sidecar metadata fields emit +# WFCTL_NEW_KEY_= markers (per the storage-filter contract, +# ADR 0020). The new credential's access_key is NOT in the stderr +# stream — it's stored as the SPACES_access_key GH Secret instead: +# wfctl: rotated provider_credential SPACES (replaced existing value at ) +# WFCTL_NEW_KEY_CREATED_AT= +``` + +`access_key` is canonical credential data, not sidecar metadata, so +it goes through the same code path as `secret_key` (stored as a GH +Secret named `SPACES_access_key`, never logged). To recover it for +the prune step, pick whichever lookup matches your environment: + +```bash +# Option A: read directly from the GH Secrets store where bootstrap put it. +gh secret view SPACES_access_key --repo / + +# Option B: list cloud-side keys and identify the new one by created_at +# (the WFCTL_NEW_KEY_CREATED_AT= line you captured above). +wfctl infra audit-keys --type infra.spaces_key +# → grep the row whose CREATED_AT matches; its ACCESS_KEY column +# is the value to pass as --exclude-access-key. +``` + +```bash +# Step 3: prune with the captured discriminator. +export WFCTL_CONFIRM_PRUNE=1 +wfctl infra prune \ + --type infra.spaces_key \ + --created-before \ + --exclude-access-key \ + --confirm +``` + +If the rotation happened out-of-band (DO console, not bootstrap), get +both `access_key` and `created_at` from the `audit-keys` output for +the new key and use those values directly in step 3. + +## Recovery from partial failure + +If `rotate-and-prune` fails mid-flow — most commonly because the prune +step hit an API rate limit or transient network failure — the rotate +step has ALREADY succeeded (a new credential is live in the secrets +store) but some older keys remain. The recovery file at +`${WFCTL_STATE_DIR:-$HOME/.wfctl}/last-rotation.json` is RETAINED in +this case so you can finish the prune without re-rotating. + +```bash +export WFCTL_CONFIRM_PRUNE=1 +wfctl infra prune \ + --type infra.spaces_key \ + --recovery-from-last-rotation \ + --confirm +``` + +This reads the recovery file and applies the SAME discriminator +(`--created-before` + `--exclude-access-key`) the failed +`rotate-and-prune` would have applied. On success the recovery file is +deleted; on another failure it stays for the next attempt. + +**Do NOT re-run `rotate-and-prune` directly to recover.** It would mint +yet another new credential — leaking another key to the audit log and +making the cleanup harder. The `--recovery-from-last-rotation` path is +the only correct recovery. + +## Preserving hand-created keys + +If your DO account has hand-created keys outside the IaC graph that +must be preserved regardless of age (e.g., a vendor integration key, a +contractor's access key with no managed lifecycle), pass +`--preserve-names` with a regex that matches their `name`: + +```bash +wfctl infra rotate-and-prune \ + --type infra.spaces_key \ + --name coredump-deploy-key \ + --preserve-names '^manual-' \ + --confirm +``` + +Names matching the regex are skipped during the prune phase even if +their `created_at` is older than the cutoff. The regex is matched +against the resource `name` (cloud-side `name` field, not the +`access_key`). + +`--preserve-names` is orthogonal to `--exclude-access-key`: the active +credential's access_key is always preserved, AND any name matching the +regex is preserved on top. + +> **Why "preserve-names" and not "allowlist"?** On a destructive +> command the verb has to be unambiguous. `--allowlist '^manual-'` +> reads to some operators as "delete only manual-* keys" and to +> others as "preserve manual-* keys". The opposite mental models +> would yield opposite outcomes — one of them deleting every +> production key. `--preserve-names` only reads one way. Per ADR 0017. + +## GH Secrets convention for managed `infra.spaces_key` resources + +When you declare an `infra.spaces_key` module in `infra.yaml`: + +```yaml +modules: + - name: api-scoped-key + type: infra.spaces_key + config: + name: api-scoped-key + grants: + - bucket: api-uploads + permission: readwrite +``` + +The DO provider's `SpacesKeyDriver.Create` writes two GH Secrets at +provision time: + +- `api-scoped-key_access_key` +- `api-scoped-key_secret_key` + +Same naming convention as the canonical bootstrap path +(`bootstrapSecrets` for `provider_credential` types, ADR 0015). Module +names that already end in `_access_key` / `_secret_key` are caught by +the `R-A9` align rule (severity `ERROR`) so the doubled-create +anti-pattern can't reach `apply`. + +## Frequently asked + +**Q: Why two-key opt-in (env var + flag)?** Defense in depth against +a stuck-in-history shell line in someone's `~/.bash_history` running +under their UID. The env var is operator-provided per session; the +flag is operator-provided per invocation. Either alone is insufficient. +This is opt-ins (1) + (2) from the "Overview" three-step list; the +interactive y/N prompt (3) is the third gate, skippable only with +`--non-interactive`. ADR 0017. + +**Q: What happens if I omit `--exclude-access-key`?** `prune` exits +with `prune: --created-before AND --exclude-access-key are both +required (paranoia rail)` and code 1. No cloud calls are made. + +**Q: Can I run this against multiple types in one invocation?** No. +`--type` is single-valued. Run `prune` once per type if you have keys +from multiple resource types to clean up. (`audit-keys` is also single- +type for the same reason.) + +**Q: Does the recovery file leak the secret value?** No. It stores +only `{type, name, access_key, created_at, source, rotated_at}`. The +`secret_key` is in the GH Secrets store, never on disk. The file is +written `0600` to be doubly safe. + +**Q: What if the secrets store rejects the new credential write +mid-rotation?** `bootstrapSecrets` returns the error before the old +credential is revoked, so the system stays on the old key. Re-run the +command after fixing the secrets store; nothing was leaked or lost.