docs: workflow#640 Phase 2 design — gRPC contract extension + 5-repo hard-cutover#692
docs: workflow#640 Phase 2 design — gRPC contract extension + 5-repo hard-cutover#692intel352 wants to merge 7 commits into
Conversation
…hard-cutover 10-deliverable Phase 2 design per ADR 0040 + ADR 0024 constraints: Engine side (workflow PR): 1. iac.proto: extend ApplyResult with `repeated ActionResult actions` field + new ActionStatus enum (SUCCESS/ERROR/DELETE_FAILED/COMPENSATED/ COMPENSATION_FAILED + UNSPECIFIED reject case) 2. Regenerate iac.pb.go 3. interfaces.ApplyResult: add Actions []ActionOutcome field 4. cmd/wfctl/iac_typed_adapter.go::applyResultFromPB: decode new field + REJECT UNSPECIFIED status + REJECT len(actions) != len(plan.actions) 5. iac/wfctlhelpers/apply.go: engine populates result.Actions per dispatch 6. cmd/wfctl/deploy_providers.go: schema-validated manifest load (Phase 1 Assumption 8 — closes silent-v1-fallback gap) Plugin side (4 plugin PRs): 7. aws Apply: custom-loop populates result.Actions per iteration 8. gcp Apply: same 9. azure Apply: same 10. DO: workflow-pin + minEng bump only (already canonical-delegate) Coordinated cutover sequence: - workflow v0.54.0 + aws v1.2.0 + gcp v1.2.0 + azure v1.2.0 + DO v1.2.0 all ship in same release window (~30 min) - Per ADR 0024 NO graceful fallback; old plugin tags permanently incompatible with workflow v0.54.0+ Self-challenge surfaced 3 doubts: 1. UNSPECIFIED rejection breaks in-flight upgrade — intentional per ADR 0024 2. output_keys ambiguity for replace actions — needs writing-plans resolution 3. Manifest validation gate expands Phase 2 scope — Phase 1 Assumption 8 documented 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Design-only PR adding a Phase 2 design document for extending the IaC gRPC contract (ApplyResult.actions) and coordinating a 5-repo hard-cutover release per ADR 0024/0040. No code changes.
Changes:
- Adds
docs/plans/2026-05-16-v2-lifecycle-phase2-design.mddescribing proto extension, Go interface additions, wfctl decoder validation, engine-side population, and per-plugin migration patterns. - Outlines coordinated cutover sequence, rollback strategy, assumptions, and self-challenge notes.
- Flags ADR 0041 to be landed alongside.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
⏱ Benchmark Results✅ No significant performance regressions detected. benchstat comparison (baseline → PR)
|
…1 findings Critical fixes (3): - C-1: Pattern B (custom-loop for aws/gcp/azure) was ARCHITECTURALLY BROKEN — hooks only fire on v2 dispatch path (wfctlhelpers.ApplyPlanWithHooks); custom-loop plugins take v1 path; wfctl never reads Actions field even if populated. Fix: all 4 plugins declare compute_plan_version="v2" in CapabilitiesResponse → wfctl routes through v2 dispatch → engine-side population handles all 4 uniformly. Pattern A + B COLLAPSE into single pattern; existing IaCProvider.Apply impls on aws/gcp/azure become dead code (kept in-tree for Phase 2 to minimize blast radius; Phase 2.5 cleanup may delete). Deliverables shrink 10 → 8. - C-2: applyResultFromPB(r *pb.ApplyResult) doesn't receive plan; design's pseudo-code plan.GetActions() was a compile error. Fix: length validation moves to caller (typedIaCAdapter.Apply has both plan + result); applyResultFromPB only decodes + rejects UNSPECIFIED. - C-3: pluginmanifest.ValidateBytes package doesn't exist; was speculative hidden scope (~200 LOC + new dependency). Fix: defer manifest validation to Phase 2.1 follow-up PR (file followup issue as deliverable #6); reduces Phase 2 5-repo cutover scope. Phase 2.1 can ship simpler check (verify computePlanVersion ∈ {"v1","v2"} OR empty) without full JSON- schema infrastructure. Important + timing fixes: - "Atomic 30-min release window" claim was unrealistic given Copilot review cycles + GoReleaser windows. Revised to 2-3 hour empirical estimate based on this session's plugin-sweep observations. - v1.2.0 semver decision documented (compatibility ratcheting via minEngineVersion, not API break of plugin's own API; matches prior sweep precedent). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er autonomous mandate, at 2-revision-limit) Cycle 2 adversarial review surfaced 2 Crit + 2 Imp + 3 Minor. At skill's 2-revision-cycle limit; per autonomous mandate, applying surgical text- edits and proceeding to writing-plans WITHOUT cycle 3 re-review. Critical fixes: - C-1 (phantom status values COMPENSATED + COMPENSATION_FAILED): RESERVED tag 4 + 5 in proto comment for Phase 2.3 future; NOT defined in Phase 2 enum. Engine code has no compensation path today; defining unreachable enum values creates linter exhaustiveness violations. - C-2 (stale Approaches A description + ADR Consequences contradict cycle-1 Pattern B collapse): UPDATED both passages to reflect single uniform pattern (all 4 plugins declare v2; engine-side population). Important fixes (reviewer's strong recommendation applied — Option): - I-1 (output_keys map<string,string> vs Outputs map[string]any type mismatch): DROP output_keys field from ActionResult entirely. Hook firing only requires action_index + status. Per-resource outputs already in ApplyResult.resources (existing aggregated field). Simultaneously resolves m-2 (Replace output attribution ambiguity). - I-2 (length validation placed in wrong call site): MOVED to iac/wfctlhelpers/apply.go::applyPlanWithEnvProviderAndHooks (engine path) AFTER dispatch loop completes. typedIaCAdapter.Apply (v1 path) no longer enforces — v1 plugins legitimately return empty Actions. Minor: m-1 (manifest validation deferral language) — design explicitly recommends option (c) defer to Phase 2.1; ADR Consequences updated to match. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… design 13 tasks across 5 PRs: - PR 1 (workflow v0.54.0): proto + pb.go regen + interfaces + decoder + engine populate + Phase 2.1 followup issue + tag (Tasks 1-7) - PR 2-5 (4 plugins parallel): each declares ComputePlanVersion=v2 + bumps workflow pin + bumps minEng + releases v1.2.0 (Tasks 8-11) - Tasks 12-13 (team-lead non-PR): cross-plugin smoke + memory update Universal per-plugin pattern reuses cloud-SDK sweep precedent (5-step branch + ff-pull → bump → tidy/build/test → minEng update → commit/PR/ admin-merge/tag/release). Per-task includes Files (path:line), TDD test step, Verification per change class, Rollback note. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Critical fixes: - C-1 (engine populate misses pre-dispatch continue paths): rewrote Task 4 with deferred-closure pattern inside loop body so EVERY exit path (jit error, driver-resolve error, dispatch error, success) appends ActionOutcome. Added second test covering driver-resolve pre-dispatch error to lock the invariant. - C-2 (Task 4 test used non-existent fakeProvider.drivers map): rewrote to use real fakeProvider/newFakeProvider() single-driver API + added driverErr-based pre-dispatch test variant. - C-3 (Universal plugin template wrong Capabilities signature + filename): template now references actual file (internal/iacserver.go), actual receiver pattern (*<plugin>IaCServer), actual param type (*pb.CapabilitiesRequest); instruction is to change ONLY the return- statement struct literal, not the function signature. Important fixes: - I-1 (Tasks 1+2 broken intermediate commit): COLLAPSED proto edit + regen into single Task 1 atomic commit. Tasks renumbered. - I-2 (Task 6 misclassified as implementer task): MOVED to Pre-dispatch setup step 2 (team-lead action). Task 6 removed; remaining tasks renumbered. PR Count stays 5 but Task count 13→11. - I-3 (Task 3 GetOutputKeys lingering): verified clean in current text (no GetOutputKeys reference present). Minor: M-3 memory file path (project_cloud_sdk_extraction_complete.md vs new project_v2_lifecycle_phase2.md) — leaving as-is; existing file is the canonical thread for the larger #640 effort. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent-check FAIL fix) plan-scope-check.sh flagged Tasks 10+11 as not in PR Grouping table. Tasks 10 (smoke) + 11 (memory) are team-lead post-cascade operational actions, not implementer tasks creating PRs. Moved to a 'Post-cascade closeout' section after the task list; Scope Manifest task count 11→9. Now: PR Count=5, Tasks=9, all sequential 1-9, all in PR Grouping table. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- plan-scope-check.sh PASS exit 0 - adversarial-design-review: design cycles 1+2 fixed; design cycle-2 polish per autonomous mandate (skill 2-revision-limit) - adversarial-plan-review cycle 1: 3 Crit + 3 Imp all addressed in surgical revision; cycle 2 skipped per autonomous mandate (revisions are direct cycle-1 fixes; no new architectural decisions) - Per skill spec: subagent-driven-development is fresh-session work given conversation context budget; this scope-lock is the pickup point for the next session 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Copilot reviewed 3 out of 3 changed files in this pull request and generated 13 comments.
Comments suppressed due to low confidence (1)
docs/plans/2026-05-16-v2-lifecycle-phase2.md:720
- Task 11 Step 1 refers to "tracking issue filed in Task 6", but in the current plan Task 6 is the aws plugin PR — the issue-filing now lives in the "Pre-dispatch setup" section, not in any numbered task. Update the cross-reference so the operator looks in the right place when closing out Phase 2.
| actions = append(actions, interfaces.ActionOutcome{ | ||
| ActionIndex: a.GetActionIndex(), | ||
| Status: mapPBStatusToInterface(a.GetStatus()), | ||
| Outputs: a.GetOutputKeys(), |
| ActionStatusCompensated | ||
| ActionStatusCompensationFailed |
| **Phase 2 scope (REVISED per cycle-1 adversarial review: Pattern B collapsed — 8 deliverables, not 10):** | ||
|
|
||
| The cycle-1 review correctly identified that "Pattern B (custom-loop)" is architecturally broken: aws/gcp/azure don't declare `compute_plan_version="v2"` in their CapabilitiesResponse, so wfctl takes the v1 dispatch path (`provider.Apply(ctx, &plan)` direct call). Even if those plugins populate the new `Actions` wire field, wfctl never reads it on the v1 path. The intended OnResourceApplied / OnResourceDeleted hook firing CANNOT happen via Pattern B. | ||
|
|
||
| **Correct architecture**: have ALL 4 plugins DECLARE `compute_plan_version="v2"` in their CapabilitiesResponse. wfctl then routes through `wfctlhelpers.ApplyPlanWithHooks` for all 4 — engine-side population handles Actions UNIFORMLY (Pattern A becomes the ONLY pattern). The plugins' existing `IaCProvider.Apply` impls become unused dead code (engine dispatches via `provider.ResourceDriver(action.Resource.Type)` per action instead). | ||
|
|
||
| **8 deliverables:** | ||
|
|
||
| 1. Extend `plugin/external/proto/iac.proto` `ApplyResult` message with `repeated ActionResult actions` field + new `enum ActionStatus` (workflow PR) | ||
| 2. Regenerate `iac.pb.go` from updated proto (workflow PR) | ||
| 3. Extend `interfaces.ApplyResult` Go struct with `Actions []ActionOutcome` field (workflow PR) | ||
| 4. Update `cmd/wfctl/iac_typed_adapter.go::applyResultFromPB` to decode the new field. **REVISED per cycle-1 C-2**: function signature must thread plan-action-count for length validation. Either (a) add `expectedActionCount uint32` parameter to applyResultFromPB OR (b) move length validation to caller after applyResultFromPB returns. | ||
| 5. Update `iac/wfctlhelpers/apply.go::applyPlanWithEnvProviderAndHooks` to populate `result.Actions` during dispatch (workflow PR) — the engine ALREADY iterates plan.Actions via `dispatchAction`; Phase 2 just appends ActionOutcome per iteration | ||
| 6. **REVISED per cycle-1 C-3**: manifest validation gate scope. Two options: (a) implement new `plugin/external/manifest` package with `ValidateBytes(bytes []byte) error` + create `schema/plugin_manifest.json` (hidden scope, ~200 LOC). (b) Use existing `schema/` package's JSON schema infrastructure if it covers plugin.json (probe required at writing-plans). (c) DEFER manifest validation to Phase 2.1 separate PR. Decision needed at writing-plans phase. Phase 2 design now LISTS this explicitly as deliverable #6 with scope-decision sub-task. | ||
| 7. **All 4 plugins (aws/gcp/azure/DO): declare `compute_plan_version="v2"` in CapabilitiesResponse** + bump workflow pin + bump minEngineVersion. ~5-line change per plugin. PRs in parallel. Plugins' existing `IaCProvider.Apply` impls become dead-code; they could be deleted in a Phase 2.5 cleanup but kept for Phase 2 to avoid blast radius. | ||
| 8. Cross-plugin smoke: install each plugin into wfctl + run sample apply + verify ActionResults populated AND that v2 dispatch path was taken (not v1 fallback). | ||
|
|
||
| **Coordinated cutover:** workflow PR + 4 plugin PRs land in same release window. Workflow tag (v0.54.0 candidate) carries the engine change. 4 plugin tags (aws v1.2.0 / gcp v1.2.0 / azure v1.2.0 / DO v1.2.0) carry the per-plugin Apply impls. All 5 ship in lockstep. |
|
|
||
| 1. **`ACTION_STATUS_UNSPECIFIED` rejection at decoder layer is strict; could break in-flight upgrade window.** If wfctl is upgraded to v0.54.0 while a stale-tagged plugin (pre-v1.2.0) is still loaded, the plugin emits no actions field → decoder reads len(actions)==0 vs plan having N actions → REJECTS. This is intentional per ADR 0024 (no graceful fallback). But it means operators MUST upgrade plugins immediately after upgrading workflow. | ||
|
|
||
| 2. **Per-action `output_keys` semantic ambiguity** — for delete actions, what goes in output_keys? Spec says "empty for delete actions." But ResourceReplacer (delete-then-create) may surface new outputs from the recreate phase. Decision needed: which action is the replace recorded under — the delete or the create-after-delete? |
| 1. **Proto wire format is forwards-compatible AT THE FIELD LEVEL** — new `actions` field at tag 7 doesn't conflict with existing fields. Verified by checking iac.proto:295 (ApplyResult tags 1-6 used). New tag 7 + reserved-tags discipline = safe. | ||
| 2. **All 4 IaC plugins use the same `iac.proto` definition** — verified plan-2 + plan-1 sweep work. Plugins re-import workflow/plugin/external/proto via go module path. | ||
| 3. **`wfctlhelpers.ApplyPlanWithHooks` is the engine-side path that populates `result.Actions`** — direct path: dispatchAction → action result evidence → append to result.Actions. Verified during Phase 1: ApplyPlanWithHooks is the ONLY engine entrypoint after Phase 1+4 migration. | ||
| 4. **`pluginmanifest.ValidateBytes` (or equivalent) exists or can be added** — verified existence of `schema/` package with JSON schema generation. Phase 2 adds the runtime validator if not present. |
| Per the cloud-SDK plugin sweep precedent — team-lead actions BEFORE dispatching implementers: | ||
|
|
||
| 1. Verify ADR 0024 + 0040 are still binding (read decisions/0024 + decisions/0040; confirm no override in flight). | ||
| 2. **File `workflow#640-phase-2.1` follow-up tracking issue** (moved from Task 6 per cycle-1 plan-review I-2 — this is a team-lead action, not an implementer task; the tracking issue body references the Phase 2 PR which doesn't exist yet at implementer-time). Issue body: "Phase 2.1 follow-up to #640 Phase 2 (PRs land via Phase 2 cascade) — add manifest validation gate at cmd/wfctl/deploy_providers.go::findIaCPluginDir per Phase 1 Assumption 8. Three implementation options recorded in Phase 2 design doc (full pluginmanifest package; reuse existing schema/; lightweight computePlanVersion enum check). Pick at design time." |
| In `iac/wfctlhelpers/apply.go`, the dispatch loop has multiple `continue` exits (verified: lines 224 jit-error, 234 driver-resolve-error, 261 dispatchAction-error, 287, 313). **EVERY continue path must append an ActionOutcome** OR the post-loop length-assert false-fails on legitimate plans with errors. | ||
|
|
||
| Cleanest implementation: deferred closure inside the loop body that records the ActionOutcome on every exit path: | ||
|
|
||
| ```go | ||
| for i := range plan.Actions { | ||
| action := plan.Actions[i] | ||
| var dispatchErr error | ||
| var loopErr error // captures the actual error of this iteration | ||
|
|
||
| // Deferred closure: runs on EVERY exit from this iteration (continue or fall-through). | ||
| // Guarantees 1-to-1 correspondence between plan.Actions and result.Actions | ||
| // regardless of which continue/branch the code took. | ||
| func() { | ||
| defer func() { | ||
| status := mapDispatchErrToStatus(loopErr, action.Action) | ||
| errStr := "" | ||
| if loopErr != nil { errStr = loopErr.Error() } | ||
| result.Actions = append(result.Actions, interfaces.ActionOutcome{ | ||
| ActionIndex: uint32(i), | ||
| Status: status, | ||
| Error: errStr, | ||
| }) | ||
| }() | ||
|
|
||
| // ctx cancellation check | ||
| if err := ctx.Err(); err != nil { loopErr = err; return } | ||
|
|
||
| // Existing JIT substitution at apply.go:217 | ||
| resolved, err := jitsubst.ResolveSpec(action.Resource, result.ReplaceIDMap, syncedOutputs, os.LookupEnv) | ||
| if err != nil { | ||
| // ... existing result.Errors append for JIT error ... | ||
| loopErr = fmt.Errorf("jit substitution: %w", err) | ||
| return | ||
| } | ||
|
|
||
| // Existing driver resolution at apply.go:228 | ||
| d, err := p.ResourceDriver(action.Resource.Type) | ||
| if err != nil { | ||
| // ... existing result.Errors append for driver-resolve error ... | ||
| loopErr = err | ||
| return | ||
| } | ||
|
|
||
| // Existing dispatchAction call at apply.go:251 | ||
| if err := dispatchAction(ctx, d, resolved, result, actionHooks, deleteHookActive); err != nil { | ||
| // ... existing result.Errors handling ... | ||
| loopErr = err | ||
| return | ||
| } | ||
| // Success path — loopErr stays nil; deferred closure records ActionStatusSuccess. | ||
| }() | ||
| } | ||
| ``` | ||
|
|
||
| The implementer should RESTRUCTURE the existing loop body to fit this shape — the deferred closure pattern preserves the existing best-effort continue-on-error semantics while guaranteeing the ActionOutcome append on every path. |
|
|
||
| **Step 1: Edit iac.proto — add ActionStatus enum + ActionResult message** | ||
|
|
||
| Add the following BEFORE `message ApplyResult` (line 295): |
| Edit **`internal/iacserver.go`** (verified file location across all 4 plugins via cycle-1 plan-review). The Capabilities method signature is identical across all 4 plugins: | ||
|
|
||
| ```go | ||
| // Existing pattern (verified per-plugin — receiver varies: awsIaCServer / gcpIaCServer / azureIaCServer / doIaCServer): | ||
| func (s *<plugin>IaCServer) Capabilities(_ context.Context, _ *pb.CapabilitiesRequest) (*pb.CapabilitiesResponse, error) { | ||
| // ... existing IaCCapabilityDeclaration build into `out` ... | ||
| return &pb.CapabilitiesResponse{Capabilities: out}, nil // CURRENT | ||
| } | ||
| ``` | ||
|
|
||
| **Change ONLY the return-statement struct literal** to add ComputePlanVersion field: | ||
|
|
||
| ```go | ||
| return &pb.CapabilitiesResponse{ | ||
| Capabilities: out, | ||
| ComputePlanVersion: "v2", // NEW Phase 2: declare v2 dispatch per ADR 0040 | ||
| }, nil | ||
| ``` | ||
|
|
||
| Do NOT rewrite the function body, receiver type, parameter types, or parameter names. |
| // Default; must NEVER be emitted by plugins. wfctl rejects ApplyResponse | ||
| // with any action_index whose status == UNSPECIFIED (catches plugin bugs | ||
| // where action result was forgotten). |
|
Superseded by actual Phase 2 implementation (#691 Phase 1, #694 Phase 2 v2 hooks-over-gRPC, #697 Phase 2.5 IaCProviderFinalizer, #700 Phase 2.3 ActionStatus enums all merged). The design-only PR was the planning artifact; the implementation cascaded via different PRs without first landing this design doc. Per project_v2_lifecycle_phase2_shipped memory: Phase 2 fully SHIPPED 2026-05-17 + Phase 2.3 + Phase 2.5 + Phase 2.5+ cleanup. Closing as superseded. |
Summary
Phase 2 of workflow#640. Design doc for the gRPC contract extension that delivers v2 hooks-over-gRPC per ADR 0040's 5 provider compatibility expectations, shipped as a HARD-CUTOVER 5-repo coordinated PR cascade per ADR 0024.
This PR is design-only — lands the design doc + ADR 0041 (in followup commit if not bundled). The actual implementation (10 deliverables across 5 repos) is the next session's pickup point.
Scope (10 deliverables)
Engine side (workflow PR — v0.54.0):
iac.proto: extendApplyResultwithrepeated ActionResult actionsfield + newActionStatusenumiac.pb.gointerfaces.ApplyResult: addActions []ActionOutcomefieldcmd/wfctl/iac_typed_adapter.go::applyResultFromPB: decode + REJECT UNSPECIFIED status + REJECT len(actions) != len(plan.actions)iac/wfctlhelpers/apply.go: engine populatesresult.Actionsper dispatchcmd/wfctl/deploy_providers.go: schema-validated manifest load (Phase 1 Assumption 8)Plugin side (4 plugin PRs — v1.2.0 each):
7. aws Apply: custom-loop populates
result.Actions8. gcp Apply: same
9. azure Apply: same
10. DO: workflow-pin + minEng bump only (already canonical-delegate)
Coordinated cutover
workflow v0.54.0 + aws v1.2.0 + gcp v1.2.0 + azure v1.2.0 + DO v1.2.0 ship in same release window (~30 min). Per ADR 0024 NO graceful fallback; old plugin tags permanently incompatible with workflow v0.54.0+.
Phase 1 context (already shipped)
// Deprecated:marker onwfctlhelpers.ApplyPlanApplyPlanWithHooks)Pipeline state
Test plan
🤖 Generated with Claude Code