Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions pkg/controller/machine/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Machine Controller

**Machine lifecycle management across cloud providers**

## OVERVIEW
Creates/deletes provider instances, tracks Machine phases, drains nodes before deletion.

## WHERE TO LOOK

| Task | File | Notes |
|------|------|-------|
| Reconcile loop | `machine_controller.go` | Core lifecycle logic |
| Instance creation | `machine_instance.go` | Provider instance provisioning |
| Node draining | `machine_drain.go` | Pre-delete node drain |
| Phase tracking | `machine_phase.go` | Provisioning/Running/Failed states |
| Test suite | `machine_suite_test.go` | Ginkgo envtest setup |

## CONVENTIONS

- Machine phases: `Provisioning` → `Running` → `Failed`/`Terminating`
- Always check `Machine.Status.Phase` before operations
- Node drain: Skip for master machines (not recommended)
- Instance deletion: Always wait for provider cleanup
- ProviderID: Required for Node ↔ Machine linking

## ANTI-PATTERNS (THIS PACKAGE)

- **NEVER** change Machine.Spec (has no effect)
- **NEVER** skip draining when deleting worker machines
- **NEVER** remove finalizer manually (causes orphaned VMs)
- **ALWAYS** check `Machine.DeletionTimestamp` before provisioning
- **ALWAYS** set `Machine.Status.ProviderID` after instance creation

## NOTES

- Provider logic lives in `machine-api-provider-*` repos (except vSphere)
- Machine controller is platform-agnostic; actuators are external
- Use `machine.openshift.io/cluster-api-cluster` label for cluster scoping
37 changes: 37 additions & 0 deletions pkg/controller/machineset/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# MachineSet Controller

**Replica management for MachineSet CRD**

## OVERVIEW
Ensures expected number of Machine replicas with consistent provider config; implements delete policies (Random/Oldest/Newest).

## WHERE TO LOOK

| Task | File | Notes |
|------|------|-------|
| Reconcile loop | `machineset_controller.go` | Main reconciliation logic |
| Delete policy | `machineset_delete_policy.go` | Random/Oldest/Newest implementations |
| Replica sync | `machineset_replica.go` | Scale up/down logic |
| Machine selection | `machineset_selector.go` | Label-based machine matching |
| Test suite | `machineset_suite_test.go` | Ginkgo envtest setup |

## CONVENTIONS

- Prefer komega over plain Gomega
- Each test uses `*_suite_test.go` pattern
- Tests run with `--race --trace --timeout=10m`
- Delete policy: Never delete machines with `Deleting` finalizer
- Replica sync: Always check MachineSet status before scaling

## ANTI-PATTERNS (THIS PACKAGE)

- **NEVER** skip status update after replica change
- **NEVER** delete machines without checking delete policy
- **ALWAYS** use `komega.Eventually()` for async state checks
- **ALWAYS** respect `MachineSet.DeletionTimestamp` during reconciliation

## NOTES

- MachineSet controller does NOT create instances (Machine controller does)
- Delete policy only applies when scaling down
- Machines with `machine.openshift.io/deletion-in-progress` are protected
47 changes: 47 additions & 0 deletions pkg/controller/vsphere/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# vSphere Controller

**vSphere-specific Machine actuator (only provider in this repo)**

## OVERVIEW
Implements Machine lifecycle for vSphere infrastructure; manages VM creation, deletion, and state tracking.

## WHERE TO LOOK

| Task | File | Notes |
|------|------|-------|
| Reconcile loop | `vsphere_controller.go` | Main vSphere reconciliation |
| VM creation | `vm_provision.go` | vSphere VM provisioning |
| Session management | `session/` | vCenter connection handling |
| MachineSet support | `machineset/` | vSphere-specific MachineSet logic |
| Test suite | `vsphere_suite_test.go` | Ginkgo envtest setup |

## STRUCTURE

```
vsphere/
├── session/ # vCenter connection/session management
├── machineset/ # vSphere-specific MachineSet controller
└── [root] # Core vSphere actuator
```

## CONVENTIONS

- vCenter connection: Always use session pool from `session/`
- VM naming: `{cluster}-{machine-name}-{uuid}`
- Power state: Check before operations (powered on/off/suspended)
- Network: Always attach to configured port groups
- Disk: Use thin provisioning by default

## ANTI-PATTERNS (THIS PACKAGE)

- **NEVER** create VMs outside Machine API (causes orphaned resources)
- **NEVER** skip VM power state checks
- **ALWAYS** use session pool (not direct vCenter connections)
- **ALWAYS** handle vSphere API rate limiting
- **ALWAYS** clean up orphaned VMs on Machine deletion

## NOTES

- Only cloud provider implemented directly in MAO repo
- Other providers (AWS/GCP/Azure) live in `machine-api-provider-*` repos
- Session pool prevents vCenter connection exhaustion
89 changes: 85 additions & 4 deletions pkg/controller/vsphere/reconciler.go
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,11 @@ func (r *Reconciler) create() error {

moTask, err := r.session.GetTask(r.Context, r.providerStatus.TaskRef)
if err != nil {
// Task may have been cleaned up by vCenter. Check if VM exists and needs to be powered on.
if isRetrieveMONotFound(r.providerStatus.TaskRef, err) {
klog.V(2).Infof("%v: task %s not found, checking if VM exists and needs power-on", r.machine.GetName(), r.providerStatus.TaskRef)
return r.handleTaskNotFound()
}
metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
Name: r.machine.Name,
Namespace: r.machine.Namespace,
Expand All @@ -186,9 +191,10 @@ func (r *Reconciler) create() error {
}

if moTask == nil {
// Possible eventual consistency problem from vsphere
// TODO: change error message here to indicate this might be expected.
return fmt.Errorf("unexpected moTask nil")
// Task object is nil, may indicate eventual consistency or cleanup.
// Check if VM exists and needs to be powered on.
klog.V(2).Infof("%v: task %s returned nil, checking if VM exists and needs power-on", r.machine.GetName(), r.providerStatus.TaskRef)
return r.handleTaskNotFound()
}

if taskIsFinished, err := taskIsFinished(moTask); err != nil {
Expand Down Expand Up @@ -246,7 +252,7 @@ func (r *Reconciler) create() error {
if statusError != nil {
return fmt.Errorf("failed to set provider status: %w", err)
}
return err
return err
}
return setProviderStatus(task, conditionSuccess(), r.machineScope, nil)
}
Expand All @@ -255,6 +261,81 @@ func (r *Reconciler) create() error {
return nil
}

// handleTaskNotFound handles the case where the task reference exists but the task
// object is no longer available in vCenter (e.g., cleaned up). It checks if the VM
// exists and if it's powered off, attempts to power it on. This handles the race
// condition where clone completes but power-on hasn't been triggered yet.

func (r *Reconciler) handleTaskNotFound() error {
// Try to find the VM
vmRef, err := findVM(r.machineScope)
if err != nil {
if isNotFound(err) {
// VM doesn't exist yet, this is unexpected if we have a TaskRef
// Clear the TaskRef and let the next reconcile attempt to clone
klog.V(2).Infof("%v: VM not found after task cleanup, clearing TaskRef and requeuing", r.machine.GetName())
r.providerStatus.TaskRef = ""
if err := r.machineScope.PatchMachine(); err != nil {
return fmt.Errorf("failed to patch machine after clearing TaskRef: %w", err)
}
return fmt.Errorf("%v: VM not found, will retry clone operation", r.machine.GetName())
}
return fmt.Errorf("%v: failed to find VM after task cleanup: %w", r.machine.GetName(), err)
}

// VM exists, check its power state
vm := &virtualMachine{
Context: r.machineScope.Context,
Obj: object.NewVirtualMachine(r.machineScope.session.Client.Client, vmRef),
Ref: vmRef,
}

powerState, err := vm.getPowerState()
if err != nil {
return fmt.Errorf("%v: failed to get VM power state after task cleanup: %w", r.machine.GetName(), err)
}

klog.V(2).Infof("%v: VM found with power state %s after task cleanup", r.machine.GetName(), powerState)

// If VM is powered off, we need to power it on
if powerState == types.VirtualMachinePowerStatePoweredOff {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := vm.powerOnVM()
if err != nil {
Comment on lines +301 to +304
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the existing powerOn() helper here to preserve DRS override behavior.

Line 303 calls vm.powerOnVM() directly, bypassing the existing powerOn(r.machineScope) path used elsewhere in create flow. That can diverge behavior for environments relying on the datacenter power-on path.

⚙️ Proposed fix
-		task, err := vm.powerOnVM()
+		task, err := powerOn(r.machineScope)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if powerState == types.VirtualMachinePowerStatePoweredOff {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := vm.powerOnVM()
if err != nil {
if powerState == types.VirtualMachinePowerStatePoweredOff {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := powerOn(r.machineScope)
if err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/vsphere/reconciler.go` around lines 301 - 304, The code calls
vm.powerOnVM() directly which bypasses the shared DRS-aware helper; replace this
call with the existing powerOn(r.machineScope) helper so the same datacenter/DRS
override behavior is used (ensure you capture and handle the returned task and
err the same way as before and keep the existing log message and error handling
around task/err). Target symbols: replace vm.powerOnVM() with
powerOn(r.machineScope) and preserve surrounding logic that inspects task and
err.

metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
Name: r.machine.Name,
Namespace: r.machine.Namespace,
Reason: "PowerOn task finished with error",
})
conditionFailed := conditionFailed()
conditionFailed.Message = err.Error()
statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
if statusError != nil {
return fmt.Errorf("failed to set provider status: %w", statusError)
}
return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
}

// Clear the old TaskRef and set the new power-on task
r.providerStatus.TaskRef = task
klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
}

// VM is already powered on (or in another state), clear the old TaskRef
// and let the next reconcile proceed to update
klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
Comment on lines +301 to +327
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle suspended VMs explicitly instead of clearing TaskRef as success.

Current logic only powers on when state is PoweredOff. A Suspended VM falls through to Line 328 and clears TaskRef, which can advance reconciliation without actually restoring runtime state.

🛠️ Proposed fix
-	if powerState == types.VirtualMachinePowerStatePoweredOff {
+	if powerState == types.VirtualMachinePowerStatePoweredOff ||
+		powerState == types.VirtualMachinePowerStateSuspended {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if powerState == types.VirtualMachinePowerStatePoweredOff {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := vm.powerOnVM()
if err != nil {
metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
Name: r.machine.Name,
Namespace: r.machine.Namespace,
Reason: "PowerOn task finished with error",
})
conditionFailed := conditionFailed()
conditionFailed.Message = err.Error()
statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
if statusError != nil {
return fmt.Errorf("failed to set provider status: %w", statusError)
}
return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
}
// Clear the old TaskRef and set the new power-on task
r.providerStatus.TaskRef = task
klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
}
// VM is already powered on (or in another state), clear the old TaskRef
// and let the next reconcile proceed to update
klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
if powerState == types.VirtualMachinePowerStatePoweredOff ||
powerState == types.VirtualMachinePowerStateSuspended {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := vm.powerOnVM()
if err != nil {
metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
Name: r.machine.Name,
Namespace: r.machine.Namespace,
Reason: "PowerOn task finished with error",
})
conditionFailed := conditionFailed()
conditionFailed.Message = err.Error()
statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
if statusError != nil {
return fmt.Errorf("failed to set provider status: %w", statusError)
}
return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
}
// Clear the old TaskRef and set the new power-on task
r.providerStatus.TaskRef = task
klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
}
// VM is already powered on (or in another state), clear the old TaskRef
// and let the next reconcile proceed to update
klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/vsphere/reconciler.go` around lines 301 - 327, The code
currently only handles types.VirtualMachinePowerStatePoweredOff and falls
through for Suspended, clearing r.providerStatus.TaskRef as a success; change
this by adding an explicit branch for types.VirtualMachinePowerStateSuspended
that mirrors the PoweredOff handling: call vm.powerOnVM() (or the appropriate
resume API), register failure metrics and set conditionFailed on error, set
r.providerStatus.TaskRef and log on success, and call setProviderStatus(...)
with conditionSuccess(); ensure you do not clear TaskRef for Suspended without
initiating the resume/power-on workflow. Use the existing symbols
vm.powerOnVM(), powerState, types.VirtualMachinePowerStateSuspended,
r.providerStatus.TaskRef, conditionFailed(), conditionSuccess(),
setProviderStatus() and metrics.RegisterFailedInstanceCreate to implement
consistent handling.

r.providerStatus.TaskRef = ""
if err := r.machineScope.PatchMachine(); err != nil {
return fmt.Errorf("failed to patch machine after clearing TaskRef: %w", err)
}

// Return nil to indicate success, next reconcile will call update()
return nil
}



// update finds a vm and reconciles the machine resource status against it.
func (r *Reconciler) update() error {
if err := validateMachine(*r.machine); err != nil {
Expand Down
38 changes: 38 additions & 0 deletions pkg/operator/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Operator

**Platform detection, controller deployment, ClusterOperator status**

## OVERVIEW
Deploys platform-specific controllers, manages ClusterOperator status, handles images configuration.

## WHERE TO LOOK

| Task | File | Notes |
|------|------|-------|
| Platform detection | `platform.go` | Detects cloud provider (vSphere, kubemark, etc.) |
| Controller deployment | `deployment.go` | Deploys machine-controllers Deployment |
| ClusterOperator status | `clusteroperator.go` | Manages MAO ClusterOperator status |
| Images config | `images.go` | Parses images.json ConfigMap |
| Test suite | `operator_suite_test.go` | Ginkgo envtest setup |

## CONVENTIONS

- Platform detection: Check `Infrastructure.cluster` status
- Controller deployment: Always set image from `images.json` ConfigMap
- ClusterOperator: Update status on every reconciliation
- Leader election: Single active operator instance
- Feature gates: Check OpenShift featuregates before enabling

## ANTI-PATTERNS (THIS PACKAGE)

- **NEVER** hardcode container images (use images.json)
- **NEVER** skip leader election
- **ALWAYS** update ClusterOperator status on errors
- **ALWAYS** check feature gates before deploying optional components
- **ALWAYS** validate `images.json` before using

## NOTES

- Images ConfigMap: `0000_30_machine-api-operator_01_images.configmap.yaml`
- CRDs deployed from `/install` (synced via `make crds-sync`)
- Operator manages: machine-controllers, machine-api-tests-ext deployments
38 changes: 38 additions & 0 deletions pkg/webhooks/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Webhooks

**Admission webhooks for Machine/MachineSet validation and mutation**

## OVERVIEW
Validates and mutates Machine/MachineSet resources on create/update; enforces cluster scoping and default labels.

## WHERE TO LOOK

| Task | File | Notes |
|------|------|-------|
| Machine validation | `admission_machine.go` | ValidateMachine webhook |
| Machine mutation | `admission_machine.go` | MutateMachine webhook |
| MachineSet validation | `admission_machineset.go` | ValidateMachineSet webhook |
| Webhook server | `webhook.go` | Server setup, TLS config |
| Test suite | `webhook_suite_test.go` | Ginkgo envtest setup |

## CONVENTIONS

- Validation: Check cluster label, required fields, immutable fields
- Mutation: Add default labels, validate provider config
- TLS: Always use valid certificates (not self-signed in production)
- Timeout: Webhooks must respond within 10 seconds
- Error messages: Always actionable, reference CRD docs

## ANTI-PATTERNS (THIS PACKAGE)

- **NEVER** skip cluster label validation
- **NEVER** allow Machine.Spec changes after creation
- **NEVER** block deletions in validation webhook
- **ALWAYS** return detailed validation errors
- **ALWAYS** set default labels in mutation (not validation)

## NOTES

- Webhooks run as separate Deployment (machine-webhook)
- Certificates managed by cert-manager or operator
- Use `--kubeconfig` for local testing, webhook for cluster