openshift · jcpowermac · Feb 27, 2026 · coderabbitai · Feb 27, 2026 · coderabbitai
diff --git a/pkg/controller/machine/AGENTS.md b/pkg/controller/machine/AGENTS.md
@@ -0,0 +1,38 @@
+# Machine Controller
+
+**Machine lifecycle management across cloud providers**
+
+## OVERVIEW
+Creates/deletes provider instances, tracks Machine phases, drains nodes before deletion.
+
+## WHERE TO LOOK
+
+| Task | File | Notes |
+|------|------|-------|
+| Reconcile loop | `machine_controller.go` | Core lifecycle logic |
+| Instance creation | `machine_instance.go` | Provider instance provisioning |
+| Node draining | `machine_drain.go` | Pre-delete node drain |
+| Phase tracking | `machine_phase.go` | Provisioning/Running/Failed states |
+| Test suite | `machine_suite_test.go` | Ginkgo envtest setup |
+
+## CONVENTIONS
+
+- Machine phases: `Provisioning` → `Running` → `Failed`/`Terminating`
+- Always check `Machine.Status.Phase` before operations
+- Node drain: Skip for master machines (not recommended)
+- Instance deletion: Always wait for provider cleanup
+- ProviderID: Required for Node ↔ Machine linking
+
+## ANTI-PATTERNS (THIS PACKAGE)
+
+- **NEVER** change Machine.Spec (has no effect)
+- **NEVER** skip draining when deleting worker machines
+- **NEVER** remove finalizer manually (causes orphaned VMs)
+- **ALWAYS** check `Machine.DeletionTimestamp` before provisioning
+- **ALWAYS** set `Machine.Status.ProviderID` after instance creation
+
+## NOTES
+
+- Provider logic lives in `machine-api-provider-*` repos (except vSphere)
+- Machine controller is platform-agnostic; actuators are external
+- Use `machine.openshift.io/cluster-api-cluster` label for cluster scoping
diff --git a/pkg/controller/machineset/AGENTS.md b/pkg/controller/machineset/AGENTS.md
@@ -0,0 +1,37 @@
+# MachineSet Controller
+
+**Replica management for MachineSet CRD**
+
+## OVERVIEW
+Ensures expected number of Machine replicas with consistent provider config; implements delete policies (Random/Oldest/Newest).
+
+## WHERE TO LOOK
+
+| Task | File | Notes |
+|------|------|-------|
+| Reconcile loop | `machineset_controller.go` | Main reconciliation logic |
+| Delete policy | `machineset_delete_policy.go` | Random/Oldest/Newest implementations |
+| Replica sync | `machineset_replica.go` | Scale up/down logic |
+| Machine selection | `machineset_selector.go` | Label-based machine matching |
+| Test suite | `machineset_suite_test.go` | Ginkgo envtest setup |
+
+## CONVENTIONS
+
+- Prefer komega over plain Gomega
+- Each test uses `*_suite_test.go` pattern
+- Tests run with `--race --trace --timeout=10m`
+- Delete policy: Never delete machines with `Deleting` finalizer
+- Replica sync: Always check MachineSet status before scaling
+
+## ANTI-PATTERNS (THIS PACKAGE)
+
+- **NEVER** skip status update after replica change
+- **NEVER** delete machines without checking delete policy
+- **ALWAYS** use `komega.Eventually()` for async state checks
+- **ALWAYS** respect `MachineSet.DeletionTimestamp` during reconciliation
+
+## NOTES
+
+- MachineSet controller does NOT create instances (Machine controller does)
+- Delete policy only applies when scaling down
+- Machines with `machine.openshift.io/deletion-in-progress` are protected
diff --git a/pkg/controller/vsphere/AGENTS.md b/pkg/controller/vsphere/AGENTS.md
@@ -0,0 +1,47 @@
+# vSphere Controller
+
+**vSphere-specific Machine actuator (only provider in this repo)**
+
+## OVERVIEW
+Implements Machine lifecycle for vSphere infrastructure; manages VM creation, deletion, and state tracking.
+
+## WHERE TO LOOK
+
+| Task | File | Notes |
+|------|------|-------|
+| Reconcile loop | `vsphere_controller.go` | Main vSphere reconciliation |
+| VM creation | `vm_provision.go` | vSphere VM provisioning |
+| Session management | `session/` | vCenter connection handling |
+| MachineSet support | `machineset/` | vSphere-specific MachineSet logic |
+| Test suite | `vsphere_suite_test.go` | Ginkgo envtest setup |
+
+## STRUCTURE
+
+```
+vsphere/
+├── session/          # vCenter connection/session management
+├── machineset/       # vSphere-specific MachineSet controller
+└── [root]           # Core vSphere actuator
+```
+
+## CONVENTIONS
+
+- vCenter connection: Always use session pool from `session/`
+- VM naming: `{cluster}-{machine-name}-{uuid}`
+- Power state: Check before operations (powered on/off/suspended)
+- Network: Always attach to configured port groups
+- Disk: Use thin provisioning by default
+
+## ANTI-PATTERNS (THIS PACKAGE)
+
+- **NEVER** create VMs outside Machine API (causes orphaned resources)
+- **NEVER** skip VM power state checks
+- **ALWAYS** use session pool (not direct vCenter connections)
+- **ALWAYS** handle vSphere API rate limiting
+- **ALWAYS** clean up orphaned VMs on Machine deletion
+
+## NOTES
+
+- Only cloud provider implemented directly in MAO repo
+- Other providers (AWS/GCP/Azure) live in `machine-api-provider-*` repos
+- Session pool prevents vCenter connection exhaustion
diff --git a/pkg/controller/vsphere/reconciler.go b/pkg/controller/vsphere/reconciler.go
@@ -177,6 +177,11 @@ func (r *Reconciler) create() error {
 
 	moTask, err := r.session.GetTask(r.Context, r.providerStatus.TaskRef)
 	if err != nil {
+		// Task may have been cleaned up by vCenter. Check if VM exists and needs to be powered on.
+		if isRetrieveMONotFound(r.providerStatus.TaskRef, err) {
+			klog.V(2).Infof("%v: task %s not found, checking if VM exists and needs power-on", r.machine.GetName(), r.providerStatus.TaskRef)
+			return r.handleTaskNotFound()
+		}
 		metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
 			Name:      r.machine.Name,
 			Namespace: r.machine.Namespace,
@@ -186,9 +191,10 @@ func (r *Reconciler) create() error {
 	}
 
 	if moTask == nil {
-		// Possible eventual consistency problem from vsphere
-		// TODO: change error message here to indicate this might be expected.
-		return fmt.Errorf("unexpected moTask nil")
+		// Task object is nil, may indicate eventual consistency or cleanup.
+		// Check if VM exists and needs to be powered on.
+		klog.V(2).Infof("%v: task %s returned nil, checking if VM exists and needs power-on", r.machine.GetName(), r.providerStatus.TaskRef)
+		return r.handleTaskNotFound()
 	}
 
 	if taskIsFinished, err := taskIsFinished(moTask); err != nil {
@@ -246,7 +252,7 @@ func (r *Reconciler) create() error {
 			if statusError != nil {
 				return fmt.Errorf("failed to set provider status: %w", err)
 			}
-			return err
+		return err
 		}
 		return setProviderStatus(task, conditionSuccess(), r.machineScope, nil)
 	}
@@ -255,6 +261,81 @@ func (r *Reconciler) create() error {
 	return nil
 }
 
+// handleTaskNotFound handles the case where the task reference exists but the task
+// object is no longer available in vCenter (e.g., cleaned up). It checks if the VM
+// exists and if it's powered off, attempts to power it on. This handles the race
+// condition where clone completes but power-on hasn't been triggered yet.
+
+func (r *Reconciler) handleTaskNotFound() error {
+	// Try to find the VM
+	vmRef, err := findVM(r.machineScope)
+	if err != nil {
+		if isNotFound(err) {
+			// VM doesn't exist yet, this is unexpected if we have a TaskRef
+			// Clear the TaskRef and let the next reconcile attempt to clone
+			klog.V(2).Infof("%v: VM not found after task cleanup, clearing TaskRef and requeuing", r.machine.GetName())
+			r.providerStatus.TaskRef = ""
+			if err := r.machineScope.PatchMachine(); err != nil {
+				return fmt.Errorf("failed to patch machine after clearing TaskRef: %w", err)
+			}
+			return fmt.Errorf("%v: VM not found, will retry clone operation", r.machine.GetName())
+		}
+		return fmt.Errorf("%v: failed to find VM after task cleanup: %w", r.machine.GetName(), err)
+	}
+
+	// VM exists, check its power state
+	vm := &virtualMachine{
+		Context: r.machineScope.Context,
+		Obj:     object.NewVirtualMachine(r.machineScope.session.Client.Client, vmRef),
+		Ref:     vmRef,
+	}
+
+	powerState, err := vm.getPowerState()
+	if err != nil {
+		return fmt.Errorf("%v: failed to get VM power state after task cleanup: %w", r.machine.GetName(), err)
+	}
+
+	klog.V(2).Infof("%v: VM found with power state %s after task cleanup", r.machine.GetName(), powerState)
+
+	// If VM is powered off, we need to power it on
+	if powerState == types.VirtualMachinePowerStatePoweredOff {
+		klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
+		task, err := vm.powerOnVM()
+		if err != nil {
-	if powerState == types.VirtualMachinePowerStatePoweredOff {
-		klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
-		task, err := vm.powerOnVM()
-		if err != nil {
+	if powerState == types.VirtualMachinePowerStatePoweredOff {
+		klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
+		task, err := powerOn(r.machineScope)
+		if err != nil {
-	if powerState == types.VirtualMachinePowerStatePoweredOff {
-		klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
-		task, err := vm.powerOnVM()
-		if err != nil {
+	if powerState == types.VirtualMachinePowerStatePoweredOff {
+		klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
+		task, err := powerOn(r.machineScope)
+		if err != nil {
+			metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
+				Name:      r.machine.Name,
+				Namespace: r.machine.Namespace,
+				Reason:    "PowerOn task finished with error",
+			})
+			conditionFailed := conditionFailed()
+			conditionFailed.Message = err.Error()
+			statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
+			if statusError != nil {
+				return fmt.Errorf("failed to set provider status: %w", statusError)
+			}
+			return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
+		}
+
+		// Clear the old TaskRef and set the new power-on task
+		r.providerStatus.TaskRef = task
+		klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
+		return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
+	}
+
+	// VM is already powered on (or in another state), clear the old TaskRef
+	// and let the next reconcile proceed to update
+	klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
-	if powerState == types.VirtualMachinePowerStatePoweredOff {
-		klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
-		task, err := vm.powerOnVM()
-		if err != nil {
-			metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
-				Name:      r.machine.Name,
-				Namespace: r.machine.Namespace,
-				Reason:    "PowerOn task finished with error",
-			})
-			conditionFailed := conditionFailed()
-			conditionFailed.Message = err.Error()
-			statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
-			if statusError != nil {
-				return fmt.Errorf("failed to set provider status: %w", statusError)
-			}
-			return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
-		}
-
-		// Clear the old TaskRef and set the new power-on task
-		r.providerStatus.TaskRef = task
-		klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
-		return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
-	}
-
-	// VM is already powered on (or in another state), clear the old TaskRef
-	// and let the next reconcile proceed to update
-	klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
+	if powerState == types.VirtualMachinePowerStatePoweredOff ||
+		powerState == types.VirtualMachinePowerStateSuspended {
+		klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
+		task, err := vm.powerOnVM()
+		if err != nil {
+			metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
+				Name:      r.machine.Name,
+				Namespace: r.machine.Namespace,
+				Reason:    "PowerOn task finished with error",
+			})
+			conditionFailed := conditionFailed()
+			conditionFailed.Message = err.Error()
+			statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
+			if statusError != nil {
+				return fmt.Errorf("failed to set provider status: %w", statusError)
+			}
+			return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
+		}
+
+		// Clear the old TaskRef and set the new power-on task
+		r.providerStatus.TaskRef = task
+		klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
+		return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
+	}
+
+	// VM is already powered on (or in another state), clear the old TaskRef
+	// and let the next reconcile proceed to update
+	klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
-	if powerState == types.VirtualMachinePowerStatePoweredOff {
-		klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
-		task, err := vm.powerOnVM()
-		if err != nil {
-			metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
-				Name:      r.machine.Name,
-				Namespace: r.machine.Namespace,
-				Reason:    "PowerOn task finished with error",
-			})
-			conditionFailed := conditionFailed()
-			conditionFailed.Message = err.Error()
-			statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
-			if statusError != nil {
-				return fmt.Errorf("failed to set provider status: %w", statusError)
-			}
-			return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
-		}
-
-		// Clear the old TaskRef and set the new power-on task
-		r.providerStatus.TaskRef = task
-		klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
-		return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
-	}
-
-	// VM is already powered on (or in another state), clear the old TaskRef
-	// and let the next reconcile proceed to update
-	klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
+	if powerState == types.VirtualMachinePowerStatePoweredOff ||
+		powerState == types.VirtualMachinePowerStateSuspended {
+		klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
+		task, err := vm.powerOnVM()
+		if err != nil {
+			metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
+				Name:      r.machine.Name,
+				Namespace: r.machine.Namespace,
+				Reason:    "PowerOn task finished with error",
+			})
+			conditionFailed := conditionFailed()
+			conditionFailed.Message = err.Error()
+			statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
+			if statusError != nil {
+				return fmt.Errorf("failed to set provider status: %w", statusError)
+			}
+			return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
+		}
+
+		// Clear the old TaskRef and set the new power-on task
+		r.providerStatus.TaskRef = task
+		klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
+		return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
+	}
+
+	// VM is already powered on (or in another state), clear the old TaskRef
+	// and let the next reconcile proceed to update
+	klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
+	r.providerStatus.TaskRef = ""
+	if err := r.machineScope.PatchMachine(); err != nil {
+		return fmt.Errorf("failed to patch machine after clearing TaskRef: %w", err)
+	}
+
+	// Return nil to indicate success, next reconcile will call update()
+	return nil
+}
+
+
+
 // update finds a vm and reconciles the machine resource status against it.
 func (r *Reconciler) update() error {
 	if err := validateMachine(*r.machine); err != nil {

diff --git a/pkg/operator/AGENTS.md b/pkg/operator/AGENTS.md
@@ -0,0 +1,38 @@
+# Operator
+
+**Platform detection, controller deployment, ClusterOperator status**
+
+## OVERVIEW
+Deploys platform-specific controllers, manages ClusterOperator status, handles images configuration.
+
+## WHERE TO LOOK
+
+| Task | File | Notes |
+|------|------|-------|
+| Platform detection | `platform.go` | Detects cloud provider (vSphere, kubemark, etc.) |
+| Controller deployment | `deployment.go` | Deploys machine-controllers Deployment |
+| ClusterOperator status | `clusteroperator.go` | Manages MAO ClusterOperator status |
+| Images config | `images.go` | Parses images.json ConfigMap |
+| Test suite | `operator_suite_test.go` | Ginkgo envtest setup |
+
+## CONVENTIONS
+
+- Platform detection: Check `Infrastructure.cluster` status
+- Controller deployment: Always set image from `images.json` ConfigMap
+- ClusterOperator: Update status on every reconciliation
+- Leader election: Single active operator instance
+- Feature gates: Check OpenShift featuregates before enabling
+
+## ANTI-PATTERNS (THIS PACKAGE)
+
+- **NEVER** hardcode container images (use images.json)
+- **NEVER** skip leader election
+- **ALWAYS** update ClusterOperator status on errors
+- **ALWAYS** check feature gates before deploying optional components
+- **ALWAYS** validate `images.json` before using
+
+## NOTES
+
+- Images ConfigMap: `0000_30_machine-api-operator_01_images.configmap.yaml`
+- CRDs deployed from `/install` (synced via `make crds-sync`)
+- Operator manages: machine-controllers, machine-api-tests-ext deployments
diff --git a/pkg/webhooks/AGENTS.md b/pkg/webhooks/AGENTS.md
@@ -0,0 +1,38 @@
+# Webhooks
+
+**Admission webhooks for Machine/MachineSet validation and mutation**
+
+## OVERVIEW
+Validates and mutates Machine/MachineSet resources on create/update; enforces cluster scoping and default labels.
+
+## WHERE TO LOOK
+
+| Task | File | Notes |
+|------|------|-------|
+| Machine validation | `admission_machine.go` | ValidateMachine webhook |
+| Machine mutation | `admission_machine.go` | MutateMachine webhook |
+| MachineSet validation | `admission_machineset.go` | ValidateMachineSet webhook |
+| Webhook server | `webhook.go` | Server setup, TLS config |
+| Test suite | `webhook_suite_test.go` | Ginkgo envtest setup |
+
+## CONVENTIONS
+
+- Validation: Check cluster label, required fields, immutable fields
+- Mutation: Add default labels, validate provider config
+- TLS: Always use valid certificates (not self-signed in production)
+- Timeout: Webhooks must respond within 10 seconds
+- Error messages: Always actionable, reference CRD docs
+
+## ANTI-PATTERNS (THIS PACKAGE)
+
+- **NEVER** skip cluster label validation
+- **NEVER** allow Machine.Spec changes after creation
+- **NEVER** block deletions in validation webhook
+- **ALWAYS** return detailed validation errors
+- **ALWAYS** set default labels in mutation (not validation)
+
+## NOTES
+
+- Webhooks run as separate Deployment (machine-webhook)
+- Certificates managed by cert-manager or operator
+- Use `--kubeconfig` for local testing, webhook for cluster