Skip to content

Add safe plugin/config hot-reload try-activate rollback #667

@intel352

Description

@intel352

Problem

workflow-compute agent/provider updates need a Workflow-owned hot-reload path that can safely stage a plugin/config update, probe it, and rollback without crashing or unregistering the currently active plugin surface.

Current observations from workflow:

  • plugin/external/manager.go implements ReloadPlugin as unload-then-load. If the candidate binary/manifest fails to load, the old plugin process is already killed.
  • cmd/server/main.go startup discovery loads external plugins through a local manager, while the management API creates a separate ExternalPluginManager. The API can manage subprocesses but does not obviously own the already-loaded engine plugin registrations.
  • cmd/server/main.go full config reload stops the current engine before building/starting the replacement. Failure after stop can leave the running process degraded.
  • Existing docs claim reload support, but they do not define try-activate, health probe, rollback, or crash-safe handoff semantics.

Required contract

Add a Workflow/wfctl-owned safe reload contract for plugin/config updates:

  1. Stage candidate plugin binary/config without replacing current active marker.
  2. Start candidate plugin process and perform handshake/strict contract validation.
  3. Build/probe candidate engine/config before stopping the current engine when possible.
  4. Swap active pointers only after probe success.
  5. On candidate load/probe failure, kill candidate and keep current engine/plugin active.
  6. Emit observable reload result/status for operators and agent update managers.
  7. Keep package/update artifact trust outside Workflow core; Workflow should consume already-staged local artifacts/config, not fetch arbitrary release URLs.

Acceptance

  • Unit tests prove ReloadPlugin failure preserves the old plugin client/process registration.
  • Unit/integration tests prove config reload failure keeps the prior engine active.
  • HTTP/API or wfctl surface exposes a dry-run/try-activate result with enough status to drive workflow-compute update campaigns.
  • Docs explicitly distinguish legacy unload/load reload from safe try-activate rollback.

Downstream reference

GoCodeAlone/workflow-compute SPEC task T199, invariant V387: Workflow plugin hot-reload upgrade path supports try-activate, health probe, rollback, and crash-safe config/plugin binary handoff.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions