Skip to content

Conversation

@gagan16k
Copy link
Member

@gagan16k gagan16k commented Nov 28, 2025

What this PR does / why we need it:
This PR introduces changes to the machine deletion workflow, specifically: Node finalizers. Added steps for adding/removing node finalizers, support when dealing with orphaned nodes, and updated related tests.

  • Added a new constant NodeFinalizer for tagging node objects and implemented a new step in the machine deletion flow to explicitly remove node finalizers before node deletion.
  • Rename node event handler methods (addNodeToMachine, updateNodeToMachine, deleteNodeToMachine) to (addNode, updateNode, deleteNode) and moved them from machine.go to node.go
    • Added logic to enqueue nodes on the node queue on startup, to add finalizer if not already present.
    • Reconciliation adds finalizers to any nodes not marked as NotManagedByMCM, and trigger machine deletion if the node has a deletion timestamp.
    • Removed logic that enables annotation-based machine deletion trigger; machine deletion is now triggered when a node has a deletion timestamp.
    • Triggers machine deletion if node is force-deleted by manual removal of finalizers.
  • Updated the machine deletion state machine (triggerDeletionFlow) to transition from VM deletion to node finalizer removal, and only then to node deletion. All related status messages and retry logic now reflect this new intermediate step.
  • In the node safety controller, orphaned nodes (not managed by MCM) are annotated and now have their MCM finalizer removed to allow garbage collection. Managed nodes are queued for reconciliation for finalizer handling.
  • Added and updated unit tests to cover the new node finalizer removal logic, a scenario for orphan node annotation and finalizer removal, and the new machine deletion workflow.

Changes made in the PR will deprecate this feature.

Which issue(s) this PR fixes:
Fixes #1051

Special notes for your reviewer:

IT logs
Random Seed: 1764335203

Will run 10 of 10 specs
------------------------------
[BeforeSuite]
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/test/integration/controller/controller_test.go:47
  > Enter [BeforeSuite] TOP-LEVEL @ 11/28/25 18:36:53.711
  STEP: Checking for the clusters if provided are available @ 11/28/25 18:36:53.711
  2025/11/28 18:36:53 Control cluster kube-config - /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_control.yaml
  2025/11/28 18:36:53 Target cluster kube-config  - /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_target.yaml
  STEP: Killing any existing processes @ 11/28/25 18:36:56.426
  STEP: Checking Machine-Controller-Manager repo is available at: ../../../dev/mcm @ 11/28/25 18:36:56.638
  STEP: Scaledown existing machine controllers @ 11/28/25 18:36:56.638
  STEP: Starting Machine Controller  @ 11/28/25 18:36:56.84
  STEP: Starting Machine Controller Manager @ 11/28/25 18:36:56.847
  STEP: Cleaning any old resources @ 11/28/25 18:36:56.853
  2025/11/28 18:36:57 machinedeployments.machine.sapcloud.io "test-machine-deployment" not found
  2025/11/28 18:36:57 machines.machine.sapcloud.io "test-machine" not found
  2025/11/28 18:36:57 machineclasses.machine.sapcloud.io "test-mc-v1" not found
  2025/11/28 18:36:57 machineclasses.machine.sapcloud.io "test-mc-v2" not found
  STEP: Setup MachineClass @ 11/28/25 18:36:57.582
  STEP: Looking for machineclass resource in the control cluster @ 11/28/25 18:36:58.899
  STEP: Looking for secrets refered in machineclass in the control cluster @ 11/28/25 18:36:59.081
  STEP: Initializing orphan resource tracker @ 11/28/25 18:36:59.443
  2025/11/28 18:37:03 orphan resource tracker initialized
  < Exit [BeforeSuite] TOP-LEVEL @ 11/28/25 18:37:03.348 (9.638s)
[BeforeSuite] PASSED [9.638 seconds]
------------------------------
Machine controllers test machine resource creation should not lead to any errors and add 1 more node in target cluster
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:649
  > Enter [BeforeEach] Machine controllers test @ 11/28/25 18:37:03.348
  STEP: Checking machineController process is running @ 11/28/25 18:37:03.349
  STEP: Checking machineControllerManager process is running @ 11/28/25 18:37:03.349
  STEP: Checking nodes in target cluster are healthy @ 11/28/25 18:37:03.349
  < Exit [BeforeEach] Machine controllers test @ 11/28/25 18:37:03.966 (617ms)
  > Enter [It] should not lead to any errors and add 1 more node in target cluster @ 11/28/25 18:37:03.966
  STEP: Checking for errors @ 11/28/25 18:37:04.185
  STEP: Waiting until number of ready nodes is 1 more than initial nodes @ 11/28/25 18:37:04.366
  < Exit [It] should not lead to any errors and add 1 more node in target cluster @ 11/28/25 18:38:46.089 (1m42.124s)
• [102.741 seconds]
------------------------------
Machine controllers test machine resource deletion when machines available should not lead to errors and remove 1 node in target cluster
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:678
  > Enter [BeforeEach] Machine controllers test @ 11/28/25 18:38:46.089
  STEP: Checking machineController process is running @ 11/28/25 18:38:46.089
  STEP: Checking machineControllerManager process is running @ 11/28/25 18:38:46.089
  STEP: Checking nodes in target cluster are healthy @ 11/28/25 18:38:46.089
  < Exit [BeforeEach] Machine controllers test @ 11/28/25 18:38:46.507 (418ms)
  > Enter [It] should not lead to errors and remove 1 node in target cluster @ 11/28/25 18:38:46.507
  STEP: Checking for errors @ 11/28/25 18:38:47.434
  STEP: Waiting until test-machine machine object is deleted @ 11/28/25 18:38:47.619
  STEP: Waiting until number of ready nodes is equal to number of initial nodes @ 11/28/25 18:39:00.932
  < Exit [It] should not lead to errors and remove 1 node in target cluster @ 11/28/25 18:39:01.551 (15.044s)
• [15.463 seconds]
------------------------------
Machine controllers test machine resource deletion when machines are not available should keep nodes intact
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:717
  > Enter [BeforeEach] Machine controllers test @ 11/28/25 18:39:01.552
  STEP: Checking machineController process is running @ 11/28/25 18:39:01.552
  STEP: Checking machineControllerManager process is running @ 11/28/25 18:39:01.552
  STEP: Checking nodes in target cluster are healthy @ 11/28/25 18:39:01.552
  < Exit [BeforeEach] Machine controllers test @ 11/28/25 18:39:01.97 (418ms)
  > Enter [It] should keep nodes intact @ 11/28/25 18:39:01.97
  STEP: Skipping as there are machines available and this check can't be performed @ 11/28/25 18:39:02.152
  < Exit [It] should keep nodes intact @ 11/28/25 18:39:02.152 (182ms)
• [0.600 seconds]
------------------------------
Machine controllers test machine deployment resource creation with replicas=0, scale up with replicas=1 should not lead to errors and add 1 more node to target cluster
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:745
  > Enter [BeforeEach] Machine controllers test @ 11/28/25 18:39:02.152
  STEP: Checking machineController process is running @ 11/28/25 18:39:02.152
  STEP: Checking machineControllerManager process is running @ 11/28/25 18:39:02.152
  STEP: Checking nodes in target cluster are healthy @ 11/28/25 18:39:02.152
  < Exit [BeforeEach] Machine controllers test @ 11/28/25 18:39:02.568 (416ms)
  > Enter [It] should not lead to errors and add 1 more node to target cluster @ 11/28/25 18:39:02.568
  STEP: Checking for errors @ 11/28/25 18:39:02.784
  STEP: Waiting for Machine Set to be created @ 11/28/25 18:39:02.968
  STEP: Updating machineDeployment replicas to 1 @ 11/28/25 18:39:05.684
  STEP: Checking if machineDeployment's status has been updated with correct conditions @ 11/28/25 18:39:06.052
  STEP: Checking number of ready nodes==1 @ 11/28/25 18:41:04.217
  STEP: Fetching initial number of machineset freeze events @ 11/28/25 18:41:05.696
  < Exit [It] should not lead to errors and add 1 more node to target cluster @ 11/28/25 18:41:06.47 (2m3.902s)
• [124.319 seconds]
------------------------------
Machine controllers test machine deployment resource scale-up with replicas=6 should not lead to errors and add further 5 nodes to target cluster
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:813
  > Enter [BeforeEach] Machine controllers test @ 11/28/25 18:41:06.47
  STEP: Checking machineController process is running @ 11/28/25 18:41:06.47
  STEP: Checking machineControllerManager process is running @ 11/28/25 18:41:06.47
  STEP: Checking nodes in target cluster are healthy @ 11/28/25 18:41:06.47
  < Exit [BeforeEach] Machine controllers test @ 11/28/25 18:41:07.072 (602ms)
  > Enter [It] should not lead to errors and add further 5 nodes to target cluster @ 11/28/25 18:41:07.072
  STEP: Checking for errors @ 11/28/25 18:41:07.44
  STEP: Checking number of ready nodes are 6 more than initial @ 11/28/25 18:41:07.44
  < Exit [It] should not lead to errors and add further 5 nodes to target cluster @ 11/28/25 18:43:26.706 (2m19.635s)
• [140.237 seconds]
------------------------------
Machine controllers test machine deployment resource scale-down with replicas=2 should not lead to errors and remove 4 nodes from target cluster
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:843
  > Enter [BeforeEach] Machine controllers test @ 11/28/25 18:43:26.706
  STEP: Checking machineController process is running @ 11/28/25 18:43:26.706
  STEP: Checking machineControllerManager process is running @ 11/28/25 18:43:26.706
  STEP: Checking nodes in target cluster are healthy @ 11/28/25 18:43:26.706
  < Exit [BeforeEach] Machine controllers test @ 11/28/25 18:43:27.316 (609ms)
  > Enter [It] should not lead to errors and remove 4 nodes from target cluster @ 11/28/25 18:43:27.316
  STEP: Checking for errors @ 11/28/25 18:43:28.452
  STEP: Checking number of ready nodes are 2 more than initial @ 11/28/25 18:43:28.452
  ------------------------------
  Automatically polling progress:
    Machine controllers test machine deployment resource scale-down with replicas=2 should not lead to errors and remove 4 nodes from target cluster (Spec Runtime: 5m0.61s)
      /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:843
      In [It] (Node Runtime: 5m0.001s)
        /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:843
        At [By Step] Checking number of ready nodes are 2 more than initial (Step Runtime: 4m58.865s)
          /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:859

        Spec Goroutine
        goroutine 414 [select]
          github.com/onsi/gomega/internal.(*AsyncAssertion).match(0x140003232d0, {0x10731f5d0, 0x140006f1b90}, 0x1, {0x0, 0x0, 0x0})
            /Users/I765230/go/pkg/mod/github.com/onsi/gomega@v1.36.2/internal/async_assertion.go:546
          github.com/onsi/gomega/internal.(*AsyncAssertion).Should(0x140003232d0, {0x10731f5d0, 0x140006f1b90}, {0x0, 0x0, 0x0})
            /Users/I765230/go/pkg/mod/github.com/onsi/gomega@v1.36.2/internal/async_assertion.go:145
        > github.com/gardener/machine-controller-manager/pkg/test/integration/common.(*IntegrationTestFramework).ControllerTests.func2.3.1()
            /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:864
              | 	c.timeout,
              | 	c.pollingInterval).
              > 	Should(gomega.BeNumerically("==", initialNodes+2))
              | gomega.Eventually(
              | 	c.TargetCluster.GetNumberOfReadyNodes,
          github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3({0x140004e4600?, 0x0?})
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/node.go:475
          github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3()
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/suite.go:894
          github.com/onsi/ginkgo/v2/internal.(*Suite).runNode in goroutine 81
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/suite.go:881
  ------------------------------
  < Exit [It] should not lead to errors and remove 4 nodes from target cluster @ 11/28/25 18:48:45.526 (5m18.217s)
• [318.826 seconds]
------------------------------
Machine controllers test machine deployment resource scale-down with replicas=2 should freeze and unfreeze machineset temporarily
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:872
  > Enter [BeforeEach] Machine controllers test @ 11/28/25 18:48:45.527
  STEP: Checking machineController process is running @ 11/28/25 18:48:45.527
  STEP: Checking machineControllerManager process is running @ 11/28/25 18:48:45.527
  STEP: Checking nodes in target cluster are healthy @ 11/28/25 18:48:45.527
  < Exit [BeforeEach] Machine controllers test @ 11/28/25 18:48:45.931 (404ms)
  > Enter [It] should freeze and unfreeze machineset temporarily @ 11/28/25 18:48:45.931
  < Exit [It] should freeze and unfreeze machineset temporarily @ 11/28/25 18:48:47.289 (1.358s)
• [1.762 seconds]
------------------------------
Machine controllers test machine deployment resource updation to v2 machine-class and replicas=4 should upgrade machines and add more nodes to target
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:881
  > Enter [BeforeEach] Machine controllers test @ 11/28/25 18:48:47.289
  STEP: Checking machineController process is running @ 11/28/25 18:48:47.289
  STEP: Checking machineControllerManager process is running @ 11/28/25 18:48:47.289
  STEP: Checking nodes in target cluster are healthy @ 11/28/25 18:48:47.289
  < Exit [BeforeEach] Machine controllers test @ 11/28/25 18:48:48.093 (804ms)
  > Enter [It] should upgrade machines and add more nodes to target @ 11/28/25 18:48:48.093
  STEP: Checking for errors @ 11/28/25 18:48:48.499
  STEP: UpdatedReplicas to be 4 @ 11/28/25 18:48:48.499
  STEP: AvailableReplicas to be 4 @ 11/28/25 18:48:55.274
  STEP: Number of ready nodes be 4 more @ 11/28/25 18:50:36.542
  ------------------------------
  Automatically polling progress:
    Machine controllers test machine deployment resource updation to v2 machine-class and replicas=4 should upgrade machines and add more nodes to target (Spec Runtime: 5m0.805s)
      /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:881
      In [It] (Node Runtime: 5m0.001s)
        /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:881
        At [By Step] Number of ready nodes be 4 more (Step Runtime: 3m11.55s)
          /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:919

        Spec Goroutine
        goroutine 599 [select]
          github.com/onsi/gomega/internal.(*AsyncAssertion).match(0x14000323f10, {0x10731f5d0, 0x140006f52f0}, 0x1, {0x0, 0x0, 0x0})
            /Users/I765230/go/pkg/mod/github.com/onsi/gomega@v1.36.2/internal/async_assertion.go:546
          github.com/onsi/gomega/internal.(*AsyncAssertion).Should(0x14000323f10, {0x10731f5d0, 0x140006f52f0}, {0x0, 0x0, 0x0})
            /Users/I765230/go/pkg/mod/github.com/onsi/gomega@v1.36.2/internal/async_assertion.go:145
        > github.com/gardener/machine-controller-manager/pkg/test/integration/common.(*IntegrationTestFramework).ControllerTests.func2.4.1()
            /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:924
              | 	c.timeout,
              | 	c.pollingInterval).
              > 	Should(gomega.BeNumerically("==", initialNodes+4))
              | gomega.Eventually(
              | 	c.TargetCluster.GetNumberOfReadyNodes,
          github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3({0x140001f9b00?, 0x0?})
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/node.go:475
          github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3()
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/suite.go:894
          github.com/onsi/ginkgo/v2/internal.(*Suite).runNode in goroutine 81
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/suite.go:881
  ------------------------------
  Automatically polling progress:
    Machine controllers test machine deployment resource updation to v2 machine-class and replicas=4 should upgrade machines and add more nodes to target (Spec Runtime: 6m0.808s)
      /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:881
      In [It] (Node Runtime: 6m0.004s)
        /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:881
        At [By Step] Number of ready nodes be 4 more (Step Runtime: 4m11.553s)
          /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:919

        Spec Goroutine
        goroutine 599 [select]
          github.com/onsi/gomega/internal.(*AsyncAssertion).match(0x14000323f10, {0x10731f5d0, 0x140006f52f0}, 0x1, {0x0, 0x0, 0x0})
            /Users/I765230/go/pkg/mod/github.com/onsi/gomega@v1.36.2/internal/async_assertion.go:546
          github.com/onsi/gomega/internal.(*AsyncAssertion).Should(0x14000323f10, {0x10731f5d0, 0x140006f52f0}, {0x0, 0x0, 0x0})
            /Users/I765230/go/pkg/mod/github.com/onsi/gomega@v1.36.2/internal/async_assertion.go:145
        > github.com/gardener/machine-controller-manager/pkg/test/integration/common.(*IntegrationTestFramework).ControllerTests.func2.4.1()
            /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:924
              | 	c.timeout,
              | 	c.pollingInterval).
              > 	Should(gomega.BeNumerically("==", initialNodes+4))
              | gomega.Eventually(
              | 	c.TargetCluster.GetNumberOfReadyNodes,
          github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3({0x140001f9b00?, 0x0?})
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/node.go:475
          github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3()
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/suite.go:894
          github.com/onsi/ginkgo/v2/internal.(*Suite).runNode in goroutine 81
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/suite.go:881
  ------------------------------
  Automatically polling progress:
    Machine controllers test machine deployment resource updation to v2 machine-class and replicas=4 should upgrade machines and add more nodes to target (Spec Runtime: 7m0.813s)
      /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:881
      In [It] (Node Runtime: 7m0.009s)
        /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:881
        At [By Step] Number of ready nodes be 4 more (Step Runtime: 5m11.558s)
          /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:919

        Spec Goroutine
        goroutine 599 [sync.Cond.Wait]
          sync.runtime_notifyListWait(0x140007796c8, 0x2)
            /opt/homebrew/Cellar/go/1.24.5/libexec/src/runtime/sema.go:597
          sync.(*Cond).Wait(0x140007796b8)
            /opt/homebrew/Cellar/go/1.24.5/libexec/src/sync/cond.go:71
          golang.org/x/net/http2.(*pipe).Read(0x140007796b0, {0x140009f6000, 0x2000, 0x2000})
            /Users/I765230/go/pkg/mod/golang.org/x/net@v0.38.0/http2/pipe.go:76
          golang.org/x/net/http2.transportResponseBody.Read({0x140009d4000?}, {0x140009f6000?, 0x140008241b0?, 0x14000824090?})
            /Users/I765230/go/pkg/mod/golang.org/x/net@v0.38.0/http2/transport.go:2560
          io.ReadAll({0x13020f7b8, 0x14000779680})
            /opt/homebrew/Cellar/go/1.24.5/libexec/src/io/io.go:712
          k8s.io/client-go/rest.(*Request).transformResponse(0x140004ac000, 0x14000824090, 0x140004b8a00)
            /Users/I765230/go/pkg/mod/k8s.io/client-go@v0.31.0/rest/request.go:1237
          k8s.io/client-go/rest.(*Request).Do.func1(0x14000620d80?, 0x45?)
            /Users/I765230/go/pkg/mod/k8s.io/client-go@v0.31.0/rest/request.go:1203
          k8s.io/client-go/rest.(*Request).request.func3.1(...)
            /Users/I765230/go/pkg/mod/k8s.io/client-go@v0.31.0/rest/request.go:1178
          k8s.io/client-go/rest.(*Request).request.func3(0x14000824090, 0x1400006af60, {0x10732e818?, 0x14000620d80?}, 0x14000824090?, 0x0?, 0x140004b8a00, {0x0?, 0x0?}, 0x10508f5d0?)
            /Users/I765230/go/pkg/mod/k8s.io/client-go@v0.31.0/rest/request.go:1185
          k8s.io/client-go/rest.(*Request).request(0x140004ac000, {0x10732e138, 0x10874e100}, 0x1400006af60)
            /Users/I765230/go/pkg/mod/k8s.io/client-go@v0.31.0/rest/request.go:1187
          k8s.io/client-go/rest.(*Request).Do(0x140004ac000, {0x10732e138, 0x10874e100})
            /Users/I765230/go/pkg/mod/k8s.io/client-go@v0.31.0/rest/request.go:1202
          k8s.io/client-go/gentype.(*alsoLister[...]).list(0x10733aac0, {0x10732e138, 0x10874e100}, {{{0x0, 0x0}, {0x0, 0x0}}, {0x0, 0x0}, {0x0, ...}, ...})
            /Users/I765230/go/pkg/mod/k8s.io/client-go@v0.31.0/gentype/type.go:188
          k8s.io/client-go/gentype.(*alsoLister[...]).List(0x10733aac0, {0x10732e138, 0x10874e100}, {{{0x0, 0x0}, {0x0, 0x0}}, {0x0, 0x0}, {0x0, ...}, ...})
            /Users/I765230/go/pkg/mod/k8s.io/client-go@v0.31.0/gentype/type.go:170
        > github.com/gardener/machine-controller-manager/pkg/test/integration/common/helpers.(*Cluster).getNodes(0x1?)
            /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/helpers/nodes.go:22
              | // getNodes tries to retrieve the list of node objects in the cluster.
              | func (c *Cluster) getNodes() (*v1.NodeList, error) {
              > 	return c.Clientset.CoreV1().Nodes().List(context.Background(), metav1.ListOptions{})
              | }
              |
        > github.com/gardener/machine-controller-manager/pkg/test/integration/common/helpers.(*Cluster).GetNumberOfNodes(...)
            /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/helpers/nodes.go:41
              | // GetNumberOfNodes tries to retrieve the list of node objects in the cluster.
              | func (c *Cluster) GetNumberOfNodes() int16 {
              > 	nodes, _ := c.getNodes()
              | 	return int16(len(nodes.Items)) //#nosec G115 (CWE-190) -- Test only
              | }
          reflect.Value.call({0x106e4bf40?, 0x1400027f7a0?, 0x1400006bc38?}, {0x1068563b5, 0x4}, {0x10874e100, 0x0, 0x1400006bc92?})
            /opt/homebrew/Cellar/go/1.24.5/libexec/src/reflect/value.go:584
          reflect.Value.Call({0x106e4bf40?, 0x1400027f7a0?, 0x20?}, {0x10874e100?, 0x1?, 0x0?})
            /opt/homebrew/Cellar/go/1.24.5/libexec/src/reflect/value.go:368
          github.com/onsi/gomega/internal.(*AsyncAssertion).buildActualPoller.func3()
            /Users/I765230/go/pkg/mod/github.com/onsi/gomega@v1.36.2/internal/async_assertion.go:325
          github.com/onsi/gomega/internal.(*AsyncAssertion).match(0x14000323f10, {0x10731f5d0, 0x140006f52f0}, 0x1, {0x0, 0x0, 0x0})
            /Users/I765230/go/pkg/mod/github.com/onsi/gomega@v1.36.2/internal/async_assertion.go:548
          github.com/onsi/gomega/internal.(*AsyncAssertion).Should(0x14000323f10, {0x10731f5d0, 0x140006f52f0}, {0x0, 0x0, 0x0})
            /Users/I765230/go/pkg/mod/github.com/onsi/gomega@v1.36.2/internal/async_assertion.go:145
        > github.com/gardener/machine-controller-manager/pkg/test/integration/common.(*IntegrationTestFramework).ControllerTests.func2.4.1()
            /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:924
              | 	c.timeout,
              | 	c.pollingInterval).
              > 	Should(gomega.BeNumerically("==", initialNodes+4))
              | gomega.Eventually(
              | 	c.TargetCluster.GetNumberOfReadyNodes,
          github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3({0x140001f9b00?, 0x0?})
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/node.go:475
          github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3()
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/suite.go:894
          github.com/onsi/ginkgo/v2/internal.(*Suite).runNode in goroutine 81
            /Users/I765230/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.23.0/internal/suite.go:881
  ------------------------------
  < Exit [It] should upgrade machines and add more nodes to target @ 11/28/25 18:55:51.119 (7m3.03s)
• [423.835 seconds]
------------------------------
Machine controllers test machine deployment resource deletion When there are machine deployment(s) available in control cluster should not lead to errors and list only initial nodes
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:935
  > Enter [BeforeEach] Machine controllers test @ 11/28/25 18:55:51.119
  STEP: Checking machineController process is running @ 11/28/25 18:55:51.119
  STEP: Checking machineControllerManager process is running @ 11/28/25 18:55:51.12
  STEP: Checking nodes in target cluster are healthy @ 11/28/25 18:55:51.12
  < Exit [BeforeEach] Machine controllers test @ 11/28/25 18:55:51.553 (433ms)
  > Enter [It] should not lead to errors and list only initial nodes @ 11/28/25 18:55:51.553
  STEP: Checking for errors @ 11/28/25 18:55:52.42
  STEP: Waiting until number of ready nodes is equal to number of initial  nodes @ 11/28/25 18:55:52.603
  < Exit [It] should not lead to errors and list only initial nodes @ 11/28/25 18:56:39.906 (48.353s)
• [48.787 seconds]
------------------------------
Machine controllers test orphaned resources when the hyperscaler resources are queried should have been deleted
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:972
  > Enter [BeforeEach] Machine controllers test @ 11/28/25 18:56:39.906
  STEP: Checking machineController process is running @ 11/28/25 18:56:39.906
  STEP: Checking machineControllerManager process is running @ 11/28/25 18:56:39.906
  STEP: Checking nodes in target cluster are healthy @ 11/28/25 18:56:39.906
  < Exit [BeforeEach] Machine controllers test @ 11/28/25 18:56:40.319 (413ms)
  > Enter [It] should have been deleted @ 11/28/25 18:56:40.319
  STEP: Querying and comparing @ 11/28/25 18:56:40.32
  < Exit [It] should have been deleted @ 11/28/25 18:56:44.437 (4.118s)
• [4.531 seconds]
------------------------------
[AfterSuite]
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/test/integration/controller/controller_test.go:49
  > Enter [AfterSuite] TOP-LEVEL @ 11/28/25 18:56:44.437
  STEP: Running Cleanup @ 11/28/25 18:56:44.438
  2025/11/28 18:57:04 machinedeployments.machine.sapcloud.io "test-machine-deployment" not found
  2025/11/28 18:57:04 machines.machine.sapcloud.io "test-machine" not found
  2025/11/28 18:57:04 deleting test-mc-v1 machineclass
  2025/11/28 18:57:05 machineclass deleted
  2025/11/28 18:57:05 deleting test-mc-v2 machineclass
  2025/11/28 18:57:06 machineclass deleted
  STEP: Killing any existing processes @ 11/28/25 18:57:06.227
  2025/11/28 18:57:06 controller_manager --control-kubeconfig=/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_control.yaml --target-kubeconfig=/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_target.yaml --namespace=shoot--i765230--demo --safety-up=2 --safety-down=1 --machine-safety-overshooting-period=300ms --leader-elect=false --v=3
  2025/11/28 18:57:06 stopMCM killed MCM process(es) with pid(s): [36382]
  2025/11/28 18:57:06 main --control-kubeconfig=/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_control.yaml --target-kubeconfig=/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_target.yaml --namespace=shoot--i765230--demo --machine-creation-timeout=20m --machine-drain-timeout=5m --machine-health-timeout=10m --machine-pv-detach-timeout=2m --machine-safety-apiserver-statuscheck-timeout=30s --machine-safety-apiserver-statuscheck-period=1m --machine-safety-orphan-vms-period=30m --leader-elect=false --v=3
  2025/11/28 18:57:06 stopMCM killed MCM process(es) with pid(s): [36380]
  STEP: Scale back the existing machine controllers @ 11/28/25 18:57:06.51
  < Exit [AfterSuite] TOP-LEVEL @ 11/28/25 18:57:07.109 (22.672s)
[AfterSuite] PASSED [22.672 seconds]
------------------------------

Ran 10 of 10 Specs in 1213.412 seconds
SUCCESS! -- 10 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 20m23.448949708s
Test Suite Passed
Integration tests completed successfully

Release note:

Users with delete permissions can simply use `kubectl delete node` to delete backing Machine. `node.machine.sapcloud.io/trigger-deletion-by-mcm` annotation on Node no longer supported for indirect deletion of Machine

@gardener-robot gardener-robot added needs/review Needs review size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs/second-opinion Needs second review by someone else labels Nov 28, 2025
@gagan16k gagan16k marked this pull request as ready for review November 28, 2025 14:11
@gagan16k gagan16k requested a review from a team as a code owner November 28, 2025 14:11
@gagan16k
Copy link
Member Author

gagan16k commented Dec 3, 2025

Scenarios tested

Add node finalizer

  1. Verified that when a node is created, the finalizer is added to the node.
  2. Also verified when a pre-existing node is adopted by the machine controller.
  3. On manual deletion of finalizer from the node, verified that the finalizer is re-added (After one reconcile loop).
$ k get nodes ip-10-180-17-205.eu-west-1.compute.internal -oyaml | grep -A 1 finalizers
        finalizers:
        - node.machine.sapcloud.io/machine-controller

Remove node finalizer

  1. On deleting the machine, verified that the finalizer is removed from the node's finalizers list, and node is deleted successfully.

    (garden--aws-ha-external:garden shoot--i765230--demo) ~ k get mc
        NAME                                              STATUS    AGE    NODE
        shoot--i765230--demo-worker-cpu-z1-7b64b-z4npp    Running   4m3s   ip-10-180-26-109.eu-west-1.compute.internal
        shoot--i765230--demo-worker-etcd-z1-d9ffc-5ld2d   Running   87m    ip-10-180-129-100.eu-west-1.compute.internal
    
    (garden-i765230--demo-external:garden-i765230 default) ~ k get nodes
        NAME                                           STATUS   ROLES    AGE   VERSION
        ip-10-180-129-100.eu-west-1.compute.internal   Ready    worker   84m   v1.32.9
        ip-10-180-26-109.eu-west-1.compute.internal    Ready    worker   75s   v1.32.9
    
    (garden-i765230--demo-external:garden-i765230 default) ~ k get nodes ip-10-180-26-109.eu-west-1.compute.internal -oyaml | grep -A 1 finalizers
        finalizers:
        - node.machine.sapcloud.io/machine-controller
    
    (garden--aws-ha-external:garden shoot--i765230--demo) ~ k scale --replicas=0 mcd shoot--i765230--demo-worker-cpu-z1
        machinedeployment.machine.sapcloud.io/shoot--i765230--demo-worker-cpu-z1 scaled
    
    #AFTER MACHINE DELETION
    (garden-i765230--demo-external:garden-i765230 default) ~ k get nodes
        NAME                                           STATUS   ROLES    AGE   VERSION
        ip-10-180-129-100.eu-west-1.compute.internal   Ready    worker   87m   v1.32.9
  2. Similarly tested for node deletion, by manually deleting the node while the machine still exists.

    (garden--aws-ha-external:shoot--i765230--demo garden) ~ k get mc
        NAME                                              STATUS    AGE     NODE
        shoot--i765230--demo-worker-cpu-z1-58674-4pmx5    Running   151m    ip-10-180-15-50.eu-west-1.compute.internal
        shoot--i765230--demo-worker-cpu-z1-58674-76f77    Running   3m48s   ip-10-180-26-84.eu-west-1.compute.internal
        shoot--i765230--demo-worker-etcd-z1-7cf78-c9p22   Running   151m    ip-10-180-149-194.eu-west-1.compute.internal
    
    (garden-i765230--demo-external:garden-i765230 default) ~ k delete node ip-10-180-26-84.eu-west-1.compute.internal
        node "ip-10-180-26-84.eu-west-1.compute.internal" deleted
    
    (garden--aws-ha-external:shoot--i765230--demo garden) ~ k get mc
        NAME                                              STATUS        AGE     NODE
        shoot--i765230--demo-worker-cpu-z1-58674-4pmx5    Running       158m    ip-10-180-15-50.eu-west-1.compute.internal
        shoot--i765230--demo-worker-cpu-z1-58674-76f77    Terminating   7m11s   ip-10-180-26-84.eu-west-1.compute.internal
        shoot--i765230--demo-worker-cpu-z1-58674-sv27l                  5s
        shoot--i765230--demo-worker-etcd-z1-7cf78-c9p22   Running       158m    ip-10-180-149-194.eu-west-1.compute.internal

Deletion flow order

  1. Verified that on deleting a machine, the following order of operations happen:

    • Machine phase set to 'Terminating'
    • Drain node
    • Delete VM
    • Remove finalizer from node
    • Delete Node object (which might be already deleted, when finalizer is removed)
    • Delete machine finalizers
    I1203 01:20:51.214565   22101 machine.go:269] reconcileClusterMachineTermination: Start for "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk" with phase:"Terminating", description:"Drain successful. Initiate VM deletion"
    
    I1203 01:20:51.214643   22101 core.go:388] Machine deletion request has been received for "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk"
    I1203 01:20:51.214677   22101 machine.go:63] Skipping non-spec updates for machine shoot--i765230--demo-worker-cpu-z1-58674-m5ztk
    I1203 01:20:52.321826   22101 core.go:414] VM "aws:///eu-west-1/i-099327887bf499f77" for Machine "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk" was terminated successfully
    I1203 01:20:52.321881   22101 core.go:438] Machine deletion request has been processed for "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk"
    W1203 01:20:52.524987   22101 machine_util.go:863] Machine/status UPDATE failed for machine "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk". Retrying, error: Operation cannot be fulfilled on machines.machine.sapcloud.io "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk": the object has been modified; please apply your changes to the latest version and try again
    I1203 01:20:52.525052   22101 machine.go:130] Adding machine object to termination queue "shoot--i765230--demo/shoot--i765230--demo-worker-cpu-z1-58674-m5ztk" after 200ms, reason: Operation cannot be fulfilled on machines.machine.sapcloud.io "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk": the object has been modified; please apply your changes to the latest version and try again
    I1203 01:20:52.525101   22101 machine.go:292] reconcileClusterMachineTermination: Stop for "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk"
    I1203 01:20:52.525172   22101 machine.go:269] reconcileClusterMachineTermination: Start for "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk" with phase:"Terminating", description:"VM deletion was successful. Remove node finalizers"
    
    I1203 01:20:52.803582   22101 core.go:292] VM with Provider-ID "aws:///eu-west-1/i-0e076a92fef18b995", for machine "shoot--i765230--demo-worker-cpu-z1-58674-x2gww", nodeName: "ip-10-180-18-72.eu-west-1.compute.internal" should be visible to all AWS endpoints now
    I1203 01:20:52.803627   22101 core.go:293] VM with Provider-ID: "aws:///eu-west-1/i-0e076a92fef18b995" created for Machine: "shoot--i765230--demo-worker-cpu-z1-58674-x2gww"
    I1203 01:20:52.803658   22101 machine.go:427] Created new VM for machine: "shoot--i765230--demo-worker-cpu-z1-58674-x2gww" with ProviderID: "aws:///eu-west-1/i-0e076a92fef18b995" and backing node: "ip-10-180-18-72.eu-west-1.compute.internal"
    I1203 01:20:52.949225   22101 node.go:253] Removed finalizer from node "ip-10-180-5-94.eu-west-1.compute.internal"
    I1203 01:20:53.202210   22101 machine_util.go:865] Machine/status UPDATE for "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk"
    I1203 01:20:53.202272   22101 machine.go:130] Adding machine object to termination queue "shoot--i765230--demo/shoot--i765230--demo-worker-cpu-z1-58674-m5ztk" after 5s, reason: Machine deletion in process. Removal of finalizers from Node Object "ip-10-180-5-94.eu-west-1.compute.internal" is successful. Initiate node object deletion
    
    I1203 01:20:54.034713   22101 machine_util.go:1959] Deleting node "ip-10-180-5-94.eu-west-1.compute.internal" associated with machine "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk"
    W1203 01:20:54.034727   22101 machine_util.go:1972] No node object found for "ip-10-180-5-94.eu-west-1.compute.internal", continuing deletion flow. Initiate machine object finalizer removal
    
    I1203 01:20:54.237784   22101 machine.go:292] reconcileClusterMachineTermination: Stop for "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk"
    I1203 01:20:54.438877   22101 machine.go:269] reconcileClusterMachineTermination: Start for "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk" with phase:"Terminating", description:"No node object found for \"ip-10-180-5-94.eu-west-1.compute.internal\", continuing deletion flow. Initiate machine object finalizer removal"
    
    I1203 01:20:54.850090   22101 machine_util.go:1226] Removed finalizer to machine "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk" with providerID "aws:///eu-west-1/i-099327887bf499f77" and backing node "ip-10-180-5-94.eu-west-1.compute.internal"
    I1203 01:20:54.850139   22101 machine.go:725] Machine "shoot--i765230--demo-worker-cpu-z1-58674-m5ztk" with providerID "aws:///eu-west-1/i-099327887bf499f77" and nodeName "ip-10-180-5-94.eu-west-1.compute.internal" deleted successfully
    

MCM pod restart

  1. Pause MCM → delete node → resume MCM → machine deletion triggered. (Nodes with deletionTimestamp on MCM startup)

    (garden--aws-ha-external:shoot--i765230--demo garden) ~ k get mc
        NAME                                              STATUS    AGE     NODE
        shoot--i765230--demo-worker-cpu-z1-58674-4pmx5    Running   3h15m   ip-10-180-15-50.eu-west-1.compute.internal
        shoot--i765230--demo-worker-cpu-z1-58674-x2gww    Running   17m     ip-10-180-18-72.eu-west-1.compute.internal
        shoot--i765230--demo-worker-etcd-z1-7cf78-c9p22   Running   3h15m   ip-10-180-149-194.eu-west-1.compute.internal
    
    #MCM STOPPED
    
    (garden-i765230--demo-external:garden-i765230 default) ~ k delete node ip-10-180-18-72.eu-west-1.compute.internal
        node "ip-10-180-18-72.eu-west-1.compute.internal" deleted
    (garden-i765230--demo-external:garden-i765230 default) ~ k get node ip-10-180-18-72.eu-west-1.compute.internal -oyaml | grep deletionTimestamp
        deletionTimestamp: "2025-12-02T20:13:10Z"
    (garden-i765230--demo-external:garden-i765230 default) ~ k get node ip-10-180-8-72.eu-west-1.compute.internal -oyaml | grep -A 1 finalizers
        finalizers:
        - node.machine.sapcloud.io/machine-controller
    
    #MCM RESTARTED
    
    (garden--aws-ha-external:shoot--i765230--demo garden) ~ k get mc
        NAME                                              STATUS        AGE     NODE
        shoot--i765230--demo-worker-cpu-z1-58674-4pmx5    Running       3h17m   ip-10-180-15-50.eu-west-1.compute.internal
        shoot--i765230--demo-worker-cpu-z1-58674-86bxk                  5s
        shoot--i765230--demo-worker-cpu-z1-58674-x2gww    Terminating   19m     ip-10-180-18-72.eu-west-1.compute.internal
        shoot--i765230--demo-worker-etcd-z1-7cf78-c9p22   Running       3h17m   ip-10-180-149-194.eu-west-1.compute.internal
  2. Deletion of machine during MCM crash is already handled in an identical manner as above, by pre-existing code.

    Any crash/restart of MCM during deletion will be handled on MCM restart by the machine deletion flow, as the machine will still be in 'Terminating' phase and the state machine will continue from where it crashed.

Force deletion

  1. Node finalizers are manually removed and node is deleted outside MCM. Verified that machine is marked for deletion successfully.
    (garden-i765230--demo-external:garden-i765230 default) ~ k get nodes
        NAME                                           STATUS   ROLES    AGE   VERSION
        ip-10-180-129-100.eu-west-1.compute.internal   Ready    worker   3h50m   v1.32.9
        ip-10-180-15-50.eu-west-1.compute.internal     Ready    worker   2h40m   v1.32.9
    
    (garden-i765230--demo-external:garden-i765230 default) ~ k get nodes ip-10-180-15-50.eu-west-1.compute.internal -oyaml | grep -A 1 finalizers
        finalizers:
        - node.machine.sapcloud.io/machine-controller
    
    (garden-i765230--demo-external:garden-i765230 default) ~ k patch node ip-10-180-15-50.eu-west-1.compute.internal -p '{"metadata":{"finalizers":[]}}' --type=merge
        node "ip-10-180-15-50.eu-west-1.compute.internal" patched
    
    (garden-i765230--demo-external:garden-i765230 default) ~ k delete node ip-10-180-15-50.eu-west-1.compute.internal
        node "ip-10-180-15-50.eu-west-1.compute.internal" deleted
    
    (garden--aws-ha-external:shoot--i765230--demo garden) ~ k get mc -w
        NAME                                              STATUS        AGE
        shoot--i765230--demo-worker-cpu-z1-58674-4pmx5    Terminating   2h45m

Orphan safety controller scenarios

  1. Machine is force deleted by manually removing finalizers from machine object.

    (garden--aws-ha-external:shoot--i765230--demo garden) ~ k get mc
        NAME                                              STATUS    AGE     NODE
        shoot--i765230--demo-worker-cpu-z1-58674-4pmx5    Running   2h50m   ip-10-180-15-50.eu-west-1.compute.internal
        shoot--i765230--demo-worker-etcd-z1-7cf78-c9p22   Running   2h50m   ip-10-180-149-194.eu-west-1.compute.internal
    
    (garden--aws-ha-external:shoot--i765230--demo garden) ~ k get nodes ip-10-180-15-50.eu-west-1.compute.internal -oyaml | grep -A 1 finalizers
        finalizers:
        - node.machine.sapcloud.io/machine-controller
    
    (garden--aws-ha-external:shoot--i765230--demo garden) ~ k patch mc shoot--i765230--demo-worker-cpu-z1-58674-4pmx5 -p '{"metadata":{"finalizers":[]}}' --type=merge
        machine "shoot--i765230--demo-worker-cpu-z1-58674-4pmx5" patched
    
    (garden--aws-ha-external:shoot--i765230--demo garden) ~ k delete mc shoot--i765230--demo-worker-cpu-z1-58674-4pmx5
        machine.machine.sapcloud.io "shoot--i765230--demo-worker-cpu-z1-58674-4pmx5" deleted

    When the safety controller runs after some time, it detects that the node has no backing machine: it deletes the backing VM, annotates the node and removes the finalizer.

    I1203 11:31:36.295046   42368 machine_safety.go:191] Adding NotManagedByMCM annotation to Node "ip-10-180-15-50.eu-west-1.compute.internal"
    I1203 11:31:37.033457   42368 node.go:253] Removed finalizer from node "ip-10-180-15-50.eu-west-1.compute.internal"
    I1203 11:31:37.033501   42368 machine_safety.go:200] Removed MCM finalizer from orphan node "ip-10-180-15-50.eu-west-1.compute.internal" to allow deletion
    

    Once this is done, the kubelet of the node will stop sending heartbeats(as there is no backing VM) and the node will be removed from the cluster by the CCM's node controller.

    #After some time
    (garden-i765230--demo-external:garden-i765230 default) ~ k get node ip-10-180-15-50.eu-west-1.compute.internal
        Error from server (NotFound): nodes "ip-10-180-15-50.eu-west-1.compute.internal" not found
    
  2. When a node is annotated with node.machine.sapcloud.io/not-managed-by-mcm: "1", the safety controller removes this annotation after verifying that the node has a backing machine, removes the annotation and adds the finalizer back if it was removed. It also queues the node for reconciliation.

    (garden-i765230--demo-external:garden-i765230 default) ~ k annotate node ip-10-180-129-100.eu-west-1.compute.internal node.machine.sapcloud.io/not-managed-by-mcm=1
        node "ip-10-180-129-100.eu-west-1.compute.internal" annotated
    
    (garden-i765230--demo-external:garden-i765230 default) ~ k patch node ip-10-180-129-100.eu-west-1.compute.internal -p '{"metadata":{"finalizers":[]}}' --type=merge
        node "ip-10-180-129-100.eu-west-1.compute.internal" patched

    When the safety controller runs after some time, it detects that the node has a backing machine: it removes the annotation and adds the finalizer back.

    I1203 11:39:54.889345   42368 machine_safety.go:208] Removing NotManagedByMCM annotation from Node "ip-10-180-129-100.eu-west-1.compute.internal" associated with Machine "shoot--i765230--demo-worker-cpu-z1-58674-6b5xt"
    I1203 11:39:55.315597   42368 node.go:210] Adding node object to queue "ip-10-180-129-100.eu-west-1.compute.internal" after 5s, reason: Node "ip-10-180-129-100.eu-west-1.compute.internal" is managed by MCM, reconciling
    I1203 11:39:55.315645   42368 machine_safety.go:48] reconcileClusterMachineSafetyOrphanVMs: End, reSync-Period: 10m0s
    I1203 11:40:00.738702   42368 node.go:240] Added finalizer to node "ip-10-180-129-100.eu-west-1.compute.internal"
    I1203 11:40:00.738723   42368 node.go:210] Adding node object to queue "ip-10-180-129-100.eu-west-1.compute.internal" after 10m0s, reason: periodic reconcile
    

    After this, verifying the annotation is removed and finalizer is added back:

    (garden-i765230--demo-external:garden-i765230 default) ~ k get node ip-10-180-129-100.eu-west-1.compute.internal -oyaml | grep not-managed-by-mcm
        # (no output - annotation removed)
    
    (garden-i765230--demo-external:garden-i765230 default) ~ k get nodes ip-10-180-129-100.eu-west-1.compute.internal -oyaml | grep -A 1 finalizers
        finalizers:
        - node.machine.sapcloud.io/machine-controller

@aaronfern
Copy link
Member

/assign

Copy link
Member

@thiyyakat thiyyakat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gagan16k. Thank you for the PR. Just have a few small suggestions. Nitpicks mostly.

Copy link
Member

@takoverflow takoverflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have only gone through half of the PR, had some suggestions, PTAL.

Thanks for extensively documenting the testing process!

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Dec 4, 2025
Copy link
Member

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
Looks fine in general, just some comments

@gagan16k
Copy link
Member Author

gagan16k commented Dec 5, 2025

Made changes to node.go, PTAL

Updated IT logs(AWS)
Random Seed: 1764925022

Will run 10 of 10 specs
------------------------------
[BeforeSuite]
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/test/integration/controller/controller_test.go:47
  > Enter [BeforeSuite] TOP-LEVEL @ 12/05/25 14:27:15.467
  STEP: Checking for the clusters if provided are available @ 12/05/25 14:27:15.468
  2025/12/05 14:27:15 Control cluster kube-config - /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_control.yaml
  2025/12/05 14:27:15 Target cluster kube-config  - /Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_target.yaml
  STEP: Killing any existing processes @ 12/05/25 14:27:18.146
  STEP: Checking Machine-Controller-Manager repo is available at: ../../../dev/mcm @ 12/05/25 14:27:18.347
  STEP: Scaledown existing machine controllers @ 12/05/25 14:27:18.347
  STEP: Starting Machine Controller  @ 12/05/25 14:27:18.534
  STEP: Starting Machine Controller Manager @ 12/05/25 14:27:18.542
  STEP: Cleaning any old resources @ 12/05/25 14:27:18.547
  2025/12/05 14:27:18 machinedeployments.machine.sapcloud.io "test-machine-deployment" not found
  2025/12/05 14:27:18 machines.machine.sapcloud.io "test-machine" not found
  2025/12/05 14:27:19 machineclasses.machine.sapcloud.io "test-mc-v1" not found
  2025/12/05 14:27:19 machineclasses.machine.sapcloud.io "test-mc-v2" not found
  STEP: Setup MachineClass @ 12/05/25 14:27:19.272
  STEP: Looking for machineclass resource in the control cluster @ 12/05/25 14:27:20.567
  STEP: Looking for secrets refered in machineclass in the control cluster @ 12/05/25 14:27:20.752
  STEP: Initializing orphan resource tracker @ 12/05/25 14:27:21.114
  2025/12/05 14:27:26 orphan resource tracker initialized
  < Exit [BeforeSuite] TOP-LEVEL @ 12/05/25 14:27:26.521 (11.054s)
[BeforeSuite] PASSED [11.054 seconds]
------------------------------
Machine controllers test machine resource creation should not lead to any errors and add 1 more node in target cluster
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:649
  > Enter [BeforeEach] Machine controllers test @ 12/05/25 14:27:26.522
  STEP: Checking machineController process is running @ 12/05/25 14:27:26.522
  STEP: Checking machineControllerManager process is running @ 12/05/25 14:27:26.522
  STEP: Checking nodes in target cluster are healthy @ 12/05/25 14:27:26.522
  < Exit [BeforeEach] Machine controllers test @ 12/05/25 14:27:27.402 (880ms)
  > Enter [It] should not lead to any errors and add 1 more node in target cluster @ 12/05/25 14:27:27.402
  STEP: Checking for errors @ 12/05/25 14:27:27.816
  STEP: Waiting until number of ready nodes is 1 more than initial nodes @ 12/05/25 14:27:28.002
  < Exit [It] should not lead to any errors and add 1 more node in target cluster @ 12/05/25 14:29:16.444 (1m49.043s)
• [109.924 seconds]
------------------------------
Machine controllers test machine resource deletion when machines available should not lead to errors and remove 1 node in target cluster
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:678
  > Enter [BeforeEach] Machine controllers test @ 12/05/25 14:29:16.445
  STEP: Checking machineController process is running @ 12/05/25 14:29:16.445
  STEP: Checking machineControllerManager process is running @ 12/05/25 14:29:16.445
  STEP: Checking nodes in target cluster are healthy @ 12/05/25 14:29:16.445
  < Exit [BeforeEach] Machine controllers test @ 12/05/25 14:29:16.885 (441ms)
  > Enter [It] should not lead to errors and remove 1 node in target cluster @ 12/05/25 14:29:16.886
  STEP: Checking for errors @ 12/05/25 14:29:17.814
  STEP: Waiting until test-machine machine object is deleted @ 12/05/25 14:29:18.004
  STEP: Waiting until number of ready nodes is equal to number of initial nodes @ 12/05/25 14:29:24.755
  < Exit [It] should not lead to errors and remove 1 node in target cluster @ 12/05/25 14:29:25.59 (8.704s)
• [9.145 seconds]
------------------------------
Machine controllers test machine resource deletion when machines are not available should keep nodes intact
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:717
  > Enter [BeforeEach] Machine controllers test @ 12/05/25 14:29:25.59
  STEP: Checking machineController process is running @ 12/05/25 14:29:25.59
  STEP: Checking machineControllerManager process is running @ 12/05/25 14:29:25.59
  STEP: Checking nodes in target cluster are healthy @ 12/05/25 14:29:25.59
  < Exit [BeforeEach] Machine controllers test @ 12/05/25 14:29:26.028 (438ms)
  > Enter [It] should keep nodes intact @ 12/05/25 14:29:26.028
  STEP: Skipping as there are machines available and this check can't be performed @ 12/05/25 14:29:26.395
  < Exit [It] should keep nodes intact @ 12/05/25 14:29:26.395 (367ms)
• [0.805 seconds]
------------------------------
Machine controllers test machine deployment resource creation with replicas=0, scale up with replicas=1 should not lead to errors and add 1 more node to target cluster
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:745
  > Enter [BeforeEach] Machine controllers test @ 12/05/25 14:29:26.396
  STEP: Checking machineController process is running @ 12/05/25 14:29:26.396
  STEP: Checking machineControllerManager process is running @ 12/05/25 14:29:26.396
  STEP: Checking nodes in target cluster are healthy @ 12/05/25 14:29:26.396
  < Exit [BeforeEach] Machine controllers test @ 12/05/25 14:29:27.03 (634ms)
  > Enter [It] should not lead to errors and add 1 more node to target cluster @ 12/05/25 14:29:27.03
  STEP: Checking for errors @ 12/05/25 14:29:27.244
  STEP: Waiting for Machine Set to be created @ 12/05/25 14:29:27.43
  STEP: Updating machineDeployment replicas to 1 @ 12/05/25 14:29:30.168
  STEP: Checking if machineDeployment's status has been updated with correct conditions @ 12/05/25 14:29:30.54
  STEP: Checking number of ready nodes==1 @ 12/05/25 14:31:33.335
  STEP: Fetching initial number of machineset freeze events @ 12/05/25 14:31:34.613
  < Exit [It] should not lead to errors and add 1 more node to target cluster @ 12/05/25 14:31:35.353 (2m8.324s)
• [128.958 seconds]
------------------------------
Machine controllers test machine deployment resource scale-up with replicas=6 should not lead to errors and add further 5 nodes to target cluster
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:813
  > Enter [BeforeEach] Machine controllers test @ 12/05/25 14:31:35.353
  STEP: Checking machineController process is running @ 12/05/25 14:31:35.353
  STEP: Checking machineControllerManager process is running @ 12/05/25 14:31:35.353
  STEP: Checking nodes in target cluster are healthy @ 12/05/25 14:31:35.353
  < Exit [BeforeEach] Machine controllers test @ 12/05/25 14:31:35.791 (438ms)
  > Enter [It] should not lead to errors and add further 5 nodes to target cluster @ 12/05/25 14:31:35.791
  STEP: Checking for errors @ 12/05/25 14:31:36.179
  STEP: Checking number of ready nodes are 6 more than initial @ 12/05/25 14:31:36.18
  < Exit [It] should not lead to errors and add further 5 nodes to target cluster @ 12/05/25 14:33:22.87 (1m47.08s)
• [107.518 seconds]
------------------------------
Machine controllers test machine deployment resource scale-down with replicas=2 should not lead to errors and remove 4 nodes from target cluster
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:843
  > Enter [BeforeEach] Machine controllers test @ 12/05/25 14:33:22.87
  STEP: Checking machineController process is running @ 12/05/25 14:33:22.87
  STEP: Checking machineControllerManager process is running @ 12/05/25 14:33:22.87
  STEP: Checking nodes in target cluster are healthy @ 12/05/25 14:33:22.87
  < Exit [BeforeEach] Machine controllers test @ 12/05/25 14:33:23.328 (458ms)
  > Enter [It] should not lead to errors and remove 4 nodes from target cluster @ 12/05/25 14:33:23.328
  STEP: Checking for errors @ 12/05/25 14:33:24.408
  STEP: Checking number of ready nodes are 2 more than initial @ 12/05/25 14:33:24.408
  < Exit [It] should not lead to errors and remove 4 nodes from target cluster @ 12/05/25 14:33:38.017 (14.688s)
• [15.147 seconds]
------------------------------
Machine controllers test machine deployment resource scale-down with replicas=2 should freeze and unfreeze machineset temporarily
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:872
  > Enter [BeforeEach] Machine controllers test @ 12/05/25 14:33:38.017
  STEP: Checking machineController process is running @ 12/05/25 14:33:38.017
  STEP: Checking machineControllerManager process is running @ 12/05/25 14:33:38.017
  STEP: Checking nodes in target cluster are healthy @ 12/05/25 14:33:38.017
  < Exit [BeforeEach] Machine controllers test @ 12/05/25 14:33:38.456 (439ms)
  > Enter [It] should freeze and unfreeze machineset temporarily @ 12/05/25 14:33:38.456
  < Exit [It] should freeze and unfreeze machineset temporarily @ 12/05/25 14:33:39.234 (778ms)
• [1.217 seconds]
------------------------------
Machine controllers test machine deployment resource updation to v2 machine-class and replicas=4 should upgrade machines and add more nodes to target
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:881
  > Enter [BeforeEach] Machine controllers test @ 12/05/25 14:33:39.234
  STEP: Checking machineController process is running @ 12/05/25 14:33:39.234
  STEP: Checking machineControllerManager process is running @ 12/05/25 14:33:39.234
  STEP: Checking nodes in target cluster are healthy @ 12/05/25 14:33:39.234
  < Exit [BeforeEach] Machine controllers test @ 12/05/25 14:33:39.871 (637ms)
  > Enter [It] should upgrade machines and add more nodes to target @ 12/05/25 14:33:39.871
  STEP: Checking for errors @ 12/05/25 14:33:40.275
  STEP: UpdatedReplicas to be 4 @ 12/05/25 14:33:40.276
  STEP: AvailableReplicas to be 4 @ 12/05/25 14:33:47.069
  STEP: Number of ready nodes be 4 more @ 12/05/25 14:35:50.397
  < Exit [It] should upgrade machines and add more nodes to target @ 12/05/25 14:35:52.025 (2m12.155s)
• [132.793 seconds]
------------------------------
Machine controllers test machine deployment resource deletion When there are machine deployment(s) available in control cluster should not lead to errors and list only initial nodes
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:935
  > Enter [BeforeEach] Machine controllers test @ 12/05/25 14:35:52.026
  STEP: Checking machineController process is running @ 12/05/25 14:35:52.026
  STEP: Checking machineControllerManager process is running @ 12/05/25 14:35:52.026
  STEP: Checking nodes in target cluster are healthy @ 12/05/25 14:35:52.026
  < Exit [BeforeEach] Machine controllers test @ 12/05/25 14:35:52.478 (452ms)
  > Enter [It] should not lead to errors and list only initial nodes @ 12/05/25 14:35:52.478
  STEP: Checking for errors @ 12/05/25 14:35:52.672
  STEP: Waiting until number of ready nodes is equal to number of initial  nodes @ 12/05/25 14:35:52.869
  < Exit [It] should not lead to errors and list only initial nodes @ 12/05/25 14:36:04.444 (11.966s)
• [12.418 seconds]
------------------------------
Machine controllers test orphaned resources when the hyperscaler resources are queried should have been deleted
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager/pkg/test/integration/common/framework.go:972
  > Enter [BeforeEach] Machine controllers test @ 12/05/25 14:36:04.444
  STEP: Checking machineController process is running @ 12/05/25 14:36:04.444
  STEP: Checking machineControllerManager process is running @ 12/05/25 14:36:04.444
  STEP: Checking nodes in target cluster are healthy @ 12/05/25 14:36:04.444
  < Exit [BeforeEach] Machine controllers test @ 12/05/25 14:36:04.885 (441ms)
  > Enter [It] should have been deleted @ 12/05/25 14:36:04.885
  STEP: Querying and comparing @ 12/05/25 14:36:04.885
  < Exit [It] should have been deleted @ 12/05/25 14:36:08.77 (3.886s)
• [4.327 seconds]
------------------------------
[AfterSuite]
/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/test/integration/controller/controller_test.go:49
  > Enter [AfterSuite] TOP-LEVEL @ 12/05/25 14:36:08.771
  STEP: Running Cleanup @ 12/05/25 14:36:08.771
  2025/12/05 14:36:28 machinedeployments.machine.sapcloud.io "test-machine-deployment" not found
  2025/12/05 14:36:29 machines.machine.sapcloud.io "test-machine" not found
  2025/12/05 14:36:29 deleting test-mc-v1 machineclass
  2025/12/05 14:36:29 machineclass deleted
  2025/12/05 14:36:30 deleting test-mc-v2 machineclass
  2025/12/05 14:36:30 machineclass deleted
  STEP: Killing any existing processes @ 12/05/25 14:36:30.728
  2025/12/05 14:36:30 controller_manager --control-kubeconfig=/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_control.yaml --target-kubeconfig=/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_target.yaml --namespace=shoot--i765230--demo --safety-up=2 --safety-down=1 --machine-safety-overshooting-period=300ms --leader-elect=false --v=3
  2025/12/05 14:36:30 stopMCM killed MCM process(es) with pid(s): [66163]
  2025/12/05 14:36:30 main --control-kubeconfig=/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_control.yaml --target-kubeconfig=/Users/I765230/go/src/github.com/gagan16k/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_target.yaml --namespace=shoot--i765230--demo --machine-creation-timeout=20m --machine-drain-timeout=5m --machine-health-timeout=10m --machine-pv-detach-timeout=2m --machine-safety-apiserver-statuscheck-timeout=30s --machine-safety-apiserver-statuscheck-period=1m --machine-safety-orphan-vms-period=30m --leader-elect=false --v=3
  2025/12/05 14:36:30 stopMCM killed MCM process(es) with pid(s): [66162]
  STEP: Scale back the existing machine controllers @ 12/05/25 14:36:30.956
  < Exit [AfterSuite] TOP-LEVEL @ 12/05/25 14:36:31.603 (22.833s)
[AfterSuite] PASSED [22.833 seconds]
------------------------------

Ran 10 of 10 Specs in 556.141 seconds
SUCCESS! -- 10 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 9m29.269957542s
Test Suite Passed
Integration tests completed successfully

Changes

Node event handlers

  • Removed deletion handling from addNode (no longer checks DeletionTimestamp).
  • updateNode:
    • Early-handles node deletion
    • Adds detection for MCM finalizer removal (hasNodeFinalizerBeenRemoved) and requeue
  • deleteNode calls triggerMachineDeletion instead of fetching the Machine and manually deleting it.

reconcileClusterNodeKey

  • If node.DeletionTimestamp is set, triggerMachineDeletion and return
  • Periodic enqueues removed, update handler now enqueues node with MediumRetry to re-add finalizer if deleted

triggerMachineDeletion

  • Changed from returning (machineutils.RetryPeriod, error) to returning only error.
  • Does not enqueue machine for termination if deletionTimestamp is set

@gagan16k gagan16k changed the title Remove node.machine.sapcloud.io/trigger-deletion-by-mcm annotation for security reasons Remove node.machine.sapcloud.io/trigger-deletion-by-mcm annotation for security reasons Dec 8, 2025
@gagan16k gagan16k changed the title Remove node.machine.sapcloud.io/trigger-deletion-by-mcm annotation for security reasons Remove node.machine.sapcloud.io/trigger-deletion-by-mcm annotation Dec 11, 2025
@takoverflow takoverflow removed their assignment Dec 12, 2025
gagan16k

This comment was marked as duplicate.

Copy link
Member Author

@gagan16k gagan16k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes

Copy link
Member

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes needs/review Needs review needs/second-opinion Needs second review by someone else labels Dec 18, 2025
@gardener-robot gardener-robot added needs/second-opinion Needs second review by someone else and removed reviewed/lgtm Has approval for merging labels Dec 18, 2025
@gagan16k gagan16k merged commit 2609f37 into gardener:master Dec 18, 2025
12 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Dec 18, 2025
@gagan16k gagan16k deleted the remove_annotation branch December 23, 2025 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs/second-opinion Needs second review by someone else size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. status/closed Issue is closed (either delivered or triaged)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove node.machine.sapcloud.io/trigger-deletion-by-mcm annotation

7 participants