Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions decisions/0015-spaces-key-as-iac-resource.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# 0015: Spaces key as IaC resource — two-phase fix (canonical-schema + Hybrid resource)

- **Date:** 2026-05-09
- **Status:** Accepted

## Context

DigitalOcean Spaces access keys (`access_key` + `secret_key` pairs) are
provisioned and stored by the canonical bootstrap path
(`bootstrapSecrets` in `cmd/wfctl/infra_bootstrap.go`) when a config
declares a `secrets.generate` entry of `type: provider_credential` with
`source: digitalocean.spaces`. The bootstrap layer mints the key via
`generateDOSpacesKey` (in `secrets/generators.go`), then stores the two
sub-keys (`<KEY>_access_key`, `<KEY>_secret_key`) as GitHub Actions
Secrets per `providerCredentialSubKeys`.

Two failure classes had accumulated:

1. **The doubled-create anti-pattern** — operators sometimes wrote two
separate `secrets.generate` entries (`SPACES_access_key` AND
`SPACES_secret_key`), each `type: provider_credential`, each with
the same `name`. Each entry triggered a separate cloud credential
creation; the second silently orphaned the first's access_key. After
N CI runs, the DO account accumulated N orphaned keys (the leak
surface that motivated this plan).

2. **No IaC lifecycle for purpose-scoped keys** — when downstream code
needed a credential scoped to a single bucket (e.g., a customer-
data export job), there was no way to declare it in `infra.yaml` as
a regular IaC resource alongside the bucket. Operators had to mint
one out-of-band and paste the access_key into a secret, which
defeated drift detection + state correlation.

The design considered two approaches:

- **Approach A**: Pure schema correction. Block the doubled-create
anti-pattern at lint (R-A9), rename existing-broken configs to the
canonical single-entry shape, leave purpose-scoped keys to the
out-of-band path. Smallest change; doesn't address (2).

- **Approach B**: Two-phase fix. Phase 1 = same as A (block + migrate);
Phase 2 adds a new `infra.spaces_key` IaC resource type with its own
`ResourceDriver.Create/Read/Update/Delete` so purpose-scoped keys
enter state and participate in plan/apply/drift like any other
resource. Larger change; addresses both failure classes.

## Decision

Take **Approach B** — two-phase fix. Phase 1 ships the schema
correction (R-A9 severity flip + canonical-schema migration of broken
configs + storage-filter fix for sidecar metadata). Phase 2 ships
`infra.spaces_key` as a first-class IaC resource type via a new
`SpacesKeyDriver` in `workflow-plugin-digitalocean`.

The two phases are sequenced: Phase 1 lands first (PR0+PR1+PR2+PR4a in
this plan), then Phase 2 (PR4b+PR5+PR6) once a workflow tag containing
Phase 1 is cut.

## Consequences

- The doubled-create anti-pattern is now blocked at `wfctl infra
align --strict` with severity `ERROR` (always exits non-zero, not just
under `--strict`). See ADR for R-A9 (in plan rev3).

- A new resource type `infra.spaces_key` exists. Its driver writes the
two GH Secrets at Create time using the same naming convention
(`<NAME>_access_key`, `<NAME>_secret_key`) so downstream code that
references `${NAME_access_key}` works identically whether the key
came from canonical bootstrap or the new IaC driver.

- The bootstrap key (the canonical credential `bootstrapSecrets` mints
before any plan runs) is **explicitly NOT** an `infra.spaces_key` —
it lives outside IaC for chicken-and-egg reasons (state-backend
creds need to exist before state can be loaded). See ADR 0018.

- Drift detection for purpose-scoped keys is now meaningful: if an
operator deletes a key out-of-band, `infra drift` flags it.

- The new resource type implies an `EnumeratorAll` interface on the DO
provider (so `wfctl infra audit-keys` / `prune` can find every key
in the account, not just those in state). See ADR 0016.

## Alternatives considered

- **Approach A (pure schema correction)**: rejected because it leaves
purpose-scoped keys in the out-of-band path. Operators would still
hand-paste access_keys into secrets; drift detection would silently
miss orphans.

- **Single-phase fix (combine 1 + 2 in one PR)**: rejected because the
blast radius is too large for one review pass. The schema correction
has security urgency (block the leak vector); the IaC driver is
additive and can take a second review cycle.

- **Tag-based key correlation** (use DO's per-key tag field to match
state): rejected by adversarial review #2 — DO tags overload with
Spaces *bucket* tags and cause noise in cleanup queries; the DO API
doesn't enforce uniqueness on the tag namespace.

## Related

- `docs/plans/2026-05-08-spaces-key-iac-resource-design.md` — full
design doc with the Approach A vs B trade study.
- `docs/plans/2026-05-08-spaces-key-iac-resource-design.adversarial-review-1.md`
through `-7.md` — review history that converged on Approach B.
Comment on lines +102 to +105
- ADR 0016 (EnumeratorAll), ADR 0017 (prune CLI two-key opt-in),
ADR 0018 (bootstrap-key exemption), ADR 0020 (storage-filter sidecar
metadata).
100 changes: 100 additions & 0 deletions decisions/0016-enumerator-all-interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# 0016: EnumeratorAll interface for non-tag-supporting resources

- **Date:** 2026-05-09
- **Status:** Accepted

## Context

The existing `interfaces.Enumerator` interface (added in earlier
plans for `wfctl infra cleanup --tag`) returns `[]ResourceRef` for
resources matching a tag query. It works for resource types where the
cloud API exposes a per-resource tag field that can be queried server-
side: DO Droplets, DO Volumes, DO Domains, AWS EC2 instances, etc.

Several resource types do NOT support tagging at all — DO Spaces
access keys are the canonical example. The DO API exposes `name`,
`access_key`, `created_at` on each key but no tag field. Building
`audit-keys` / `prune` against `Enumerator.EnumerateByTag` would have
required either:

- A spurious tag-querying step that returns "tags unsupported for
resource type" — operators would have to know which types support
tags before invoking.
- A per-type fallback path inside the CLI that branches between
Enumerator + a hand-rolled list call.

Both options leak provider-API quirks into the CLI layer.

## Decision

Add a new optional interface `interfaces.EnumeratorAll`:

```go
type EnumeratorAll interface {
EnumerateAll(ctx context.Context, resourceType string) ([]*ResourceOutput, error)
}
```

Providers that can list resources of `resourceType` regardless of tag
implement this interface. The DO provider implements it for
`infra.spaces_key`. Future providers can implement it for any type
whose API supports listing without filter.

Returns `[]*ResourceOutput` (full metadata) rather than `[]ResourceRef`
(name + type + provider_id) because audit + prune both need the
metadata to render / filter — and the DO List API returns the full
shape in one call, so re-Reading would be wasteful.

CLI dispatchers type-assert at the boundary; providers that don't
implement it surface a structured error (`audit-keys: no loaded
provider implements EnumeratorAll`) rather than crashing.

## Consequences

- Existing `Enumerator` (tag-based) is unchanged. New interface is
purely additive — older provider plugins continue to work without
modification.

- DO provider (`workflow-plugin-digitalocean`) implements
`EnumerateAll` for `infra.spaces_key` with transparent pagination
(DO's `/v2/spaces/keys` API page-size limits). Per-page calls happen
inside the implementation; callers see one slice.

- The CLI dispatchers (`runInfraAuditKeysCmd`,
`runInfraPruneCmd`, `runInfraRotateAndPruneCmd`) iterate loaded
iac.provider modules and pick the first one implementing
`EnumeratorAll`. Multi-provider configs need to ensure at most one
provider per `--type`.
Comment on lines +63 to +67

Comment on lines +63 to +68
- Returning `[]*ResourceOutput` ties the interface to the
`ResourceOutput` shape (`Outputs map[string]any`). Providers must
populate the `Outputs` map with the fields callers expect — for
spaces keys: `name`, `access_key`, `created_at`. This is the
documented contract.

## Alternatives considered

- **Add `EnumerateAll` to the existing `Enumerator` interface** as a
second method. Rejected because it would force every existing
Enumerator implementer to grow a second method or stop satisfying
the interface — backwards-incompatible.

- **Return `[]ResourceRef` from `EnumerateAll`**. Rejected because
audit + prune need the metadata; forcing callers to round-trip back
to `Read` for every result would double the API calls.

- **Centralize listing in `wfctl` core** (manage list-by-type as a
generic operation against `iac.state` only). Rejected because
state-only listing misses resources that exist in the cloud but not
in state — exactly the orphan keys this plan is built to find.

## Related

- `interfaces/iac_provider.go` (commit c8e4c4e8) — `EnumeratorAll`
declaration.
- `workflow-plugin-digitalocean/internal/providers/digitalocean.go`
— DO implementation with pagination.
- `cmd/wfctl/infra_audit_keys.go`, `cmd/wfctl/infra_prune.go` — CLI
consumers.
Comment on lines +97 to +98
- ADR 0015 (spaces-key as IaC resource — this interface enables the
audit + prune CLIs needed to operationalize it).
125 changes: 125 additions & 0 deletions decisions/0017-prune-cli-two-key-opt-in.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# 0017: `wfctl infra prune` UX — discriminator + two-key opt-in

- **Date:** 2026-05-09
- **Status:** Accepted

## Context

`wfctl infra prune` is destructive — it deletes cloud resources by
calling the provider's `ResourceDriver.Delete`. The user it serves
most often is post-leak cleanup: an operator has rotated a credential
and needs to delete every older key so the leaked one stops being
accepted. This is a high-stakes operation; a typo can nuke the live
key and cause downtime.

The design considered three discriminator shapes:

- **Name regex as the primary filter**: rejected by adversarial
review #2. A regex protects *everything matching the regex*,
including any leaked key whose name happens to match. The leak
surface is exactly what we're trying to remove.

- **Per-key tag**: rejected because spaces keys don't support tags
(motivated ADR 0016).

- **Time + access_key** (this ADR): only resources whose
`created_at` is OLDER than `--created-before` are eligible, AND the
resource whose `access_key` matches `--exclude-access-key` is always
preserved.

The time + access_key shape pairs naturally with the rotation
primitive: rotating a credential mints a new key with a fresh
`created_at` and a known `access_key`. Pass those values and the
filter is unambiguous: "everything older than the new key, except
the new key itself".

The opt-in shape considered three levels:

- **Just `--confirm` flag** (single per-invocation): too easy to
recover from `~/.bash_history`.
- **Just env var**: too easy to ship in a Makefile / CI step.
- **Both required (this ADR)**: the operator must export the env var
AND pass the flag, so neither a stale shell history line nor a
misconfigured Makefile alone is enough to fire.

## Decision

`wfctl infra prune` requires:

1. `--type <T>` (required) — single resource type per invocation. Rejects
if missing.
2. `--created-before <RFC3339>` (required) — only resources older than
this are eligible.
3. `--exclude-access-key <AK>` (required) — this access_key is preserved
no matter what (paranoia rail).
4. `--confirm` flag — explicit per-invocation consent.
5. `WFCTL_CONFIRM_PRUNE=1` environment variable — two-key authorization.
6. Interactive `y/N` prompt — skipped only with `--non-interactive`.

If ANY of (1)-(5) is missing, prune exits with a structured error and
NO cloud calls are made.
Comment on lines +47 to +60

`--preserve-names <regex>` is offered as a secondary, additive filter —
it preserves resources whose `name` matches the regex on top of the
`--exclude-access-key` exclusion. The verb-in-the-name (`preserve` vs.
the original `--allowlist`) is deliberate: on a destructive command,
"allowlist" reads ambiguously (some operators map it to "list of
resources allowed to be deleted", which would invert the semantics
catastrophically); "preserve-names" only reads one way.

`--recovery-from-last-rotation` is the recovery-mode shortcut: reads
the discriminator from the recovery file written by `infra
rotate-and-prune` (see ADR for that flow) so an operator who just had
a partial-failure rotation can finish cleanup without manually
copy-pasting timestamps + access_keys.

## Consequences

- Operators must capture the rotation result before pruning. The
`infra rotate-and-prune` all-in-one command does this automatically;
the multi-step variant requires reading stderr after `infra
bootstrap --force-rotate`.

- The two-key opt-in adds one extra step (`export
WFCTL_CONFIRM_PRUNE=1`) for routine operator use. Acceptable
friction for a destructive operation. CI workflows that need to
prune (e.g., scheduled hygiene jobs) export the env var explicitly.

- The feature flag this ADR codifies (`WFCTL_CONFIRM_PRUNE`) has a
retirement milestone: v0.28.0 makes it always-on (effectively
removing the env var step) IF operational telemetry shows zero
unintended fires for 60 days post-release. Tracked in plan rev3 as
the "feature-flag retirement" sub-task; not in this ADR's scope.

- A typo in `--exclude-access-key` is caught by the value-equality
filter — if the typo'd value doesn't match any real key, every
eligible key (including the live one) is targeted. This is why the
interactive `y/N` prompt exists as the third opt-in: the operator
reviews the list of keys-to-be-deleted before confirming.

## Alternatives considered

- **Soft-delete via tag-and-cron**: rejected because DO Spaces keys
don't support tags + cron-based delayed-delete adds operational
state we don't want.

- **Allowlist-only filter** (no `--exclude-access-key`): rejected per
adversarial review #2. The rotation use case needs an *exclusion*
semantic (preserve N, delete N-1), not an *inclusion* semantic
(delete N matching the regex). Inclusion-by-regex is dangerous
because typos default to "match nothing" which is safe; exclusion-
by-regex defaults to "match everything" which deletes the live key.

- **Single opt-in (env var XOR flag)**: rejected. Either alone is
insufficient defense in depth — env vars persist in shell history
+ CI logs; flags persist in shell history.

## Related

- `cmd/wfctl/infra_prune.go` — implementation.
- `cmd/wfctl/infra_prune_test.go` — `TestInfraPrune_RequiresTwoKeyOptIn`,
`TestInfraPrune_RequiresExcludeAccessKey`,
Comment on lines +119 to +121
`TestInfraPrune_FiltersByTimeAndExcludesAccessKey`.
Comment on lines +119 to +122
- `docs/runbooks/spaces-key-prune.md` — operator-facing runbook.
- ADR 0015 (two-phase plan), ADR 0016 (EnumeratorAll — what prune
enumerates against).
Loading
Loading