diff --git a/pkg/controller/machine/AGENTS.md b/pkg/controller/machine/AGENTS.md new file mode 100644 index 000000000..80d7d162e --- /dev/null +++ b/pkg/controller/machine/AGENTS.md @@ -0,0 +1,38 @@ +# Machine Controller + +**Machine lifecycle management across cloud providers** + +## OVERVIEW +Creates/deletes provider instances, tracks Machine phases, drains nodes before deletion. + +## WHERE TO LOOK + +| Task | File | Notes | +|------|------|-------| +| Reconcile loop | `machine_controller.go` | Core lifecycle logic | +| Instance creation | `machine_instance.go` | Provider instance provisioning | +| Node draining | `machine_drain.go` | Pre-delete node drain | +| Phase tracking | `machine_phase.go` | Provisioning/Running/Failed states | +| Test suite | `machine_suite_test.go` | Ginkgo envtest setup | + +## CONVENTIONS + +- Machine phases: `Provisioning` → `Running` → `Failed`/`Terminating` +- Always check `Machine.Status.Phase` before operations +- Node drain: Skip for master machines (not recommended) +- Instance deletion: Always wait for provider cleanup +- ProviderID: Required for Node ↔ Machine linking + +## ANTI-PATTERNS (THIS PACKAGE) + +- **NEVER** change Machine.Spec (has no effect) +- **NEVER** skip draining when deleting worker machines +- **NEVER** remove finalizer manually (causes orphaned VMs) +- **ALWAYS** check `Machine.DeletionTimestamp` before provisioning +- **ALWAYS** set `Machine.Status.ProviderID` after instance creation + +## NOTES + +- Provider logic lives in `machine-api-provider-*` repos (except vSphere) +- Machine controller is platform-agnostic; actuators are external +- Use `machine.openshift.io/cluster-api-cluster` label for cluster scoping \ No newline at end of file diff --git a/pkg/controller/machineset/AGENTS.md b/pkg/controller/machineset/AGENTS.md new file mode 100644 index 000000000..d9954f1c6 --- /dev/null +++ b/pkg/controller/machineset/AGENTS.md @@ -0,0 +1,37 @@ +# MachineSet Controller + +**Replica management for MachineSet CRD** + +## OVERVIEW +Ensures expected number of Machine replicas with consistent provider config; implements delete policies (Random/Oldest/Newest). + +## WHERE TO LOOK + +| Task | File | Notes | +|------|------|-------| +| Reconcile loop | `machineset_controller.go` | Main reconciliation logic | +| Delete policy | `machineset_delete_policy.go` | Random/Oldest/Newest implementations | +| Replica sync | `machineset_replica.go` | Scale up/down logic | +| Machine selection | `machineset_selector.go` | Label-based machine matching | +| Test suite | `machineset_suite_test.go` | Ginkgo envtest setup | + +## CONVENTIONS + +- Prefer komega over plain Gomega +- Each test uses `*_suite_test.go` pattern +- Tests run with `--race --trace --timeout=10m` +- Delete policy: Never delete machines with `Deleting` finalizer +- Replica sync: Always check MachineSet status before scaling + +## ANTI-PATTERNS (THIS PACKAGE) + +- **NEVER** skip status update after replica change +- **NEVER** delete machines without checking delete policy +- **ALWAYS** use `komega.Eventually()` for async state checks +- **ALWAYS** respect `MachineSet.DeletionTimestamp` during reconciliation + +## NOTES + +- MachineSet controller does NOT create instances (Machine controller does) +- Delete policy only applies when scaling down +- Machines with `machine.openshift.io/deletion-in-progress` are protected \ No newline at end of file diff --git a/pkg/controller/vsphere/AGENTS.md b/pkg/controller/vsphere/AGENTS.md new file mode 100644 index 000000000..3bee19b34 --- /dev/null +++ b/pkg/controller/vsphere/AGENTS.md @@ -0,0 +1,47 @@ +# vSphere Controller + +**vSphere-specific Machine actuator (only provider in this repo)** + +## OVERVIEW +Implements Machine lifecycle for vSphere infrastructure; manages VM creation, deletion, and state tracking. + +## WHERE TO LOOK + +| Task | File | Notes | +|------|------|-------| +| Reconcile loop | `vsphere_controller.go` | Main vSphere reconciliation | +| VM creation | `vm_provision.go` | vSphere VM provisioning | +| Session management | `session/` | vCenter connection handling | +| MachineSet support | `machineset/` | vSphere-specific MachineSet logic | +| Test suite | `vsphere_suite_test.go` | Ginkgo envtest setup | + +## STRUCTURE + +``` +vsphere/ +├── session/ # vCenter connection/session management +├── machineset/ # vSphere-specific MachineSet controller +└── [root] # Core vSphere actuator +``` + +## CONVENTIONS + +- vCenter connection: Always use session pool from `session/` +- VM naming: `{cluster}-{machine-name}-{uuid}` +- Power state: Check before operations (powered on/off/suspended) +- Network: Always attach to configured port groups +- Disk: Use thin provisioning by default + +## ANTI-PATTERNS (THIS PACKAGE) + +- **NEVER** create VMs outside Machine API (causes orphaned resources) +- **NEVER** skip VM power state checks +- **ALWAYS** use session pool (not direct vCenter connections) +- **ALWAYS** handle vSphere API rate limiting +- **ALWAYS** clean up orphaned VMs on Machine deletion + +## NOTES + +- Only cloud provider implemented directly in MAO repo +- Other providers (AWS/GCP/Azure) live in `machine-api-provider-*` repos +- Session pool prevents vCenter connection exhaustion \ No newline at end of file diff --git a/pkg/controller/vsphere/reconciler.go b/pkg/controller/vsphere/reconciler.go index 7075de326..4f6cd5023 100644 --- a/pkg/controller/vsphere/reconciler.go +++ b/pkg/controller/vsphere/reconciler.go @@ -177,6 +177,11 @@ func (r *Reconciler) create() error { moTask, err := r.session.GetTask(r.Context, r.providerStatus.TaskRef) if err != nil { + // Task may have been cleaned up by vCenter. Check if VM exists and needs to be powered on. + if isRetrieveMONotFound(r.providerStatus.TaskRef, err) { + klog.V(2).Infof("%v: task %s not found, checking if VM exists and needs power-on", r.machine.GetName(), r.providerStatus.TaskRef) + return r.handleTaskNotFound() + } metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{ Name: r.machine.Name, Namespace: r.machine.Namespace, @@ -186,9 +191,10 @@ func (r *Reconciler) create() error { } if moTask == nil { - // Possible eventual consistency problem from vsphere - // TODO: change error message here to indicate this might be expected. - return fmt.Errorf("unexpected moTask nil") + // Task object is nil, may indicate eventual consistency or cleanup. + // Check if VM exists and needs to be powered on. + klog.V(2).Infof("%v: task %s returned nil, checking if VM exists and needs power-on", r.machine.GetName(), r.providerStatus.TaskRef) + return r.handleTaskNotFound() } if taskIsFinished, err := taskIsFinished(moTask); err != nil { @@ -246,7 +252,7 @@ func (r *Reconciler) create() error { if statusError != nil { return fmt.Errorf("failed to set provider status: %w", err) } - return err + return err } return setProviderStatus(task, conditionSuccess(), r.machineScope, nil) } @@ -255,6 +261,81 @@ func (r *Reconciler) create() error { return nil } +// handleTaskNotFound handles the case where the task reference exists but the task +// object is no longer available in vCenter (e.g., cleaned up). It checks if the VM +// exists and if it's powered off, attempts to power it on. This handles the race +// condition where clone completes but power-on hasn't been triggered yet. + +func (r *Reconciler) handleTaskNotFound() error { + // Try to find the VM + vmRef, err := findVM(r.machineScope) + if err != nil { + if isNotFound(err) { + // VM doesn't exist yet, this is unexpected if we have a TaskRef + // Clear the TaskRef and let the next reconcile attempt to clone + klog.V(2).Infof("%v: VM not found after task cleanup, clearing TaskRef and requeuing", r.machine.GetName()) + r.providerStatus.TaskRef = "" + if err := r.machineScope.PatchMachine(); err != nil { + return fmt.Errorf("failed to patch machine after clearing TaskRef: %w", err) + } + return fmt.Errorf("%v: VM not found, will retry clone operation", r.machine.GetName()) + } + return fmt.Errorf("%v: failed to find VM after task cleanup: %w", r.machine.GetName(), err) + } + + // VM exists, check its power state + vm := &virtualMachine{ + Context: r.machineScope.Context, + Obj: object.NewVirtualMachine(r.machineScope.session.Client.Client, vmRef), + Ref: vmRef, + } + + powerState, err := vm.getPowerState() + if err != nil { + return fmt.Errorf("%v: failed to get VM power state after task cleanup: %w", r.machine.GetName(), err) + } + + klog.V(2).Infof("%v: VM found with power state %s after task cleanup", r.machine.GetName(), powerState) + + // If VM is powered off, we need to power it on + if powerState == types.VirtualMachinePowerStatePoweredOff { + klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName()) + task, err := vm.powerOnVM() + if err != nil { + metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{ + Name: r.machine.Name, + Namespace: r.machine.Namespace, + Reason: "PowerOn task finished with error", + }) + conditionFailed := conditionFailed() + conditionFailed.Message = err.Error() + statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm) + if statusError != nil { + return fmt.Errorf("failed to set provider status: %w", statusError) + } + return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err) + } + + // Clear the old TaskRef and set the new power-on task + r.providerStatus.TaskRef = task + klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task) + return setProviderStatus(task, conditionSuccess(), r.machineScope, vm) + } + + // VM is already powered on (or in another state), clear the old TaskRef + // and let the next reconcile proceed to update + klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState) + r.providerStatus.TaskRef = "" + if err := r.machineScope.PatchMachine(); err != nil { + return fmt.Errorf("failed to patch machine after clearing TaskRef: %w", err) + } + + // Return nil to indicate success, next reconcile will call update() + return nil +} + + + // update finds a vm and reconciles the machine resource status against it. func (r *Reconciler) update() error { if err := validateMachine(*r.machine); err != nil { diff --git a/pkg/operator/AGENTS.md b/pkg/operator/AGENTS.md new file mode 100644 index 000000000..12e2e30a2 --- /dev/null +++ b/pkg/operator/AGENTS.md @@ -0,0 +1,38 @@ +# Operator + +**Platform detection, controller deployment, ClusterOperator status** + +## OVERVIEW +Deploys platform-specific controllers, manages ClusterOperator status, handles images configuration. + +## WHERE TO LOOK + +| Task | File | Notes | +|------|------|-------| +| Platform detection | `platform.go` | Detects cloud provider (vSphere, kubemark, etc.) | +| Controller deployment | `deployment.go` | Deploys machine-controllers Deployment | +| ClusterOperator status | `clusteroperator.go` | Manages MAO ClusterOperator status | +| Images config | `images.go` | Parses images.json ConfigMap | +| Test suite | `operator_suite_test.go` | Ginkgo envtest setup | + +## CONVENTIONS + +- Platform detection: Check `Infrastructure.cluster` status +- Controller deployment: Always set image from `images.json` ConfigMap +- ClusterOperator: Update status on every reconciliation +- Leader election: Single active operator instance +- Feature gates: Check OpenShift featuregates before enabling + +## ANTI-PATTERNS (THIS PACKAGE) + +- **NEVER** hardcode container images (use images.json) +- **NEVER** skip leader election +- **ALWAYS** update ClusterOperator status on errors +- **ALWAYS** check feature gates before deploying optional components +- **ALWAYS** validate `images.json` before using + +## NOTES + +- Images ConfigMap: `0000_30_machine-api-operator_01_images.configmap.yaml` +- CRDs deployed from `/install` (synced via `make crds-sync`) +- Operator manages: machine-controllers, machine-api-tests-ext deployments \ No newline at end of file diff --git a/pkg/webhooks/AGENTS.md b/pkg/webhooks/AGENTS.md new file mode 100644 index 000000000..3210e133e --- /dev/null +++ b/pkg/webhooks/AGENTS.md @@ -0,0 +1,38 @@ +# Webhooks + +**Admission webhooks for Machine/MachineSet validation and mutation** + +## OVERVIEW +Validates and mutates Machine/MachineSet resources on create/update; enforces cluster scoping and default labels. + +## WHERE TO LOOK + +| Task | File | Notes | +|------|------|-------| +| Machine validation | `admission_machine.go` | ValidateMachine webhook | +| Machine mutation | `admission_machine.go` | MutateMachine webhook | +| MachineSet validation | `admission_machineset.go` | ValidateMachineSet webhook | +| Webhook server | `webhook.go` | Server setup, TLS config | +| Test suite | `webhook_suite_test.go` | Ginkgo envtest setup | + +## CONVENTIONS + +- Validation: Check cluster label, required fields, immutable fields +- Mutation: Add default labels, validate provider config +- TLS: Always use valid certificates (not self-signed in production) +- Timeout: Webhooks must respond within 10 seconds +- Error messages: Always actionable, reference CRD docs + +## ANTI-PATTERNS (THIS PACKAGE) + +- **NEVER** skip cluster label validation +- **NEVER** allow Machine.Spec changes after creation +- **NEVER** block deletions in validation webhook +- **ALWAYS** return detailed validation errors +- **ALWAYS** set default labels in mutation (not validation) + +## NOTES + +- Webhooks run as separate Deployment (machine-webhook) +- Certificates managed by cert-manager or operator +- Use `--kubeconfig` for local testing, webhook for cluster \ No newline at end of file