Skip to content

OCPBUGS-74960: prevent resource leak on deletion and handle DependencyViolation#7868

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
sdminonne:OCPBUGS-74960
Apr 2, 2026
Merged

OCPBUGS-74960: prevent resource leak on deletion and handle DependencyViolation#7868
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
sdminonne:OCPBUGS-74960

Conversation

@sdminonne
Copy link
Copy Markdown
Contributor

@sdminonne sdminonne commented Mar 5, 2026

Summary

  • When getClients fails during deletion (e.g., after an operator restart), the controller now returns an error instead of logging and falling through to finalizer removal, which would permanently orphan AWS resources (security groups, VPC endpoints, DNS records)
  • On deletion, the controller now performs best-effort client initialization by listing HostedControlPlane resources in the namespace. After a controller restart the clientBuilder is uninitialized; if the HCP still exists, initializeWithHCP is called so that getClients can succeed and cleanup can proceed
  • Adds DependencyViolation error handling to the deleteSecurityGroup function — when AWS returns DependencyViolation during security group ingress/egress revocation or deletion, the controller returns a sentinel error that the caller translates into a controlled requeue (5s delay), allowing AWS to finish VPC endpoint cleanup before retrying
  • Extracts awsClientProvider interface from clientBuilder to enable mock injection in tests
  • Documents the remaining SharedVPC leak scenario: when the operator restarts during deletion and the HCP has already been deleted, the SharedVPC role ARNs (needed for cross-account AWS access) are lost. The fix preserves the finalizer, but retries will never succeed. A proper fix requires persisting the SharedVPC role ARNs in the AWSEndpointService status

Test plan

  • Unit tests added for deletion reconciliation (TestReconcileDeletion): successful cleanup, empty status, VPC endpoint failure, DependencyViolation requeue
  • Unit test for best-effort HCP initialization during deletion: verifies initializeWithHCP is called when the HCP exists in the namespace
  • Unit test reproducing the controller-restart bug (TestReconcileDeletion_AfterControllerRestart): verifies error is returned and finalizer is preserved when no HCP exists
  • Unit tests for deleteSecurityGroup covering all DependencyViolation paths (ingress, egress, delete), SG not found, empty describe results, no ingress/egress rules, other AWS errors
  • Unit tests documenting the SharedVPC leak scenario (TestReconcileDeletionSharedVPC): uninitialized client after restart, initialized client without role ARNs
  • All new test cases pass
  • Package builds successfully
  • Verify in a real cluster that orphaned security groups are eventually cleaned up on VPC endpoint deletion

Fixes: https://issues.redhat.com/browse/OCPBUGS-74960

🤖 Generated with Claude Code

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 5, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 6d2cd457-1206-47c8-be90-192e5a8fe95a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces an awsClientProvider interface and a clientBuilder implementation for AWSEndpointServiceReconciler, swapping the reconciler's direct client builder field for the interface and adding a go:generate mock directive. Reconcile deletion now requires obtaining AWS clients (initializing with HCP for SharedVPC), converts AWS DependencyViolation errors into a sentinel (errDependencyViolation) and treats them as retryable to requeue deletion, and returns errors when clients cannot be obtained during deletion. Adds a new exported awsutil DependencyViolation error code and extensive unit tests covering deletion, controller-restart, SharedVPC scenarios, and dependency-aware security-group cleanup.

Changes

Cohort / File(s) Summary
Controller & AWS client abstraction
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
Adds awsClientProvider interface, clientBuilder concrete type implementing it, replaces reconciler field type with the interface, and adds //go:generate mockgen directive and a verification line asserting clientBuilder implements the interface.
Deletion/control-flow & error handling
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
During reconcile deletion, initializes provider from HCP when applicable, requires AWS clients (EC2/Route53) and returns errors if unavailable, and adjusts deletion flow to obtain clients and proceed with cleanup deterministically.
DependencyViolation retry handling
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go & support/awsutil/errorcode.go
Adds exported DependencyViolation error code and an errDependencyViolation sentinel; deleteSecurityGroup converts AWS DependencyViolation responses from Describe/Revoke/Delete operations into the sentinel so reconciler can requeue rather than fail immediately.
Unit tests & mocks
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
Adds extensive tests (TestReconcileDeletion, TestReconcileDeletion_AfterControllerRestart, TestDeleteSecurityGroup, TestReconcileDeletionSharedVPC) using gomock-generated AWS service mocks and controller-runtime fake client to exercise deletion flows, dependency violation retries, finalizer semantics, and SharedVPC initialization paths.

Sequence Diagram(s)

sequenceDiagram
  participant Controller as Reconciler
  participant Provider as awsClientProvider
  participant EC2 as AWS EC2 API
  participant R53 as AWS Route53 API
  participant K8s as Kubernetes API

  Controller->>Provider: initializeWithHCP(log, hcp)   %% optional for SharedVPC
  Controller->>Provider: getClients(ctx)
  Provider-->>EC2: return EC2 client
  Provider-->>R53: return Route53 client
  Controller->>EC2: DescribeSecurityGroups / DescribeVpcEndpoints
  alt resources exist
    Controller->>EC2: RevokeSecurityGroupIngress/Egress
    EC2-->>Controller: Success or DependencyViolation
    alt DependencyViolation
      Controller->>K8s: record errDependencyViolation -> requeue (retry)
    else Success
      Controller->>EC2: DeleteSecurityGroup / DeleteVpcEndpoints
      Controller->>R53: ChangeResourceRecordSets (cleanup DNS)
      Controller->>K8s: remove finalizer, update resource
    end
  else resources missing
    Controller->>K8s: remove finalizer, update resource
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Stable And Deterministic Test Names ❌ Error Test file contains dynamic test names in TestDiffPermissions function using loop indices instead of descriptive static strings. Replace dynamic test name generation with descriptive static names by adding a name field to test struct and using t.Run(test.name, ...).
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive The test file awsprivatelink_controller_test.go could not be located or accessed in the repository despite extensive search attempts. Verify the test file is committed and accessible in the repository, or provide the test file content directly for assessment.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the two main changes: preventing resource leak on deletion and handling DependencyViolation errors, both directly supported by the changeset.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from jparrill and muraee March 5, 2026 20:52
@openshift-ci openshift-ci bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels Mar 5, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go (1)

1092-1124: ⚠️ Potential issue | 🟠 Major

DependencyViolation path currently bypasses the fixed-delay delete retry flow.

These branches return an error immediately, so the caller exits through the error path instead of the completed=false retry path that applies endpointServiceDeletionRequeueDuration. If the goal is to use the explicit 5s delete retry behavior, return a distinguishable retriable condition and translate it to completed=false, err=nil in delete(...).

Suggested approach
+var errDependencyViolation = errors.New("security group dependency violation")

 func (r *AWSEndpointServiceReconciler) deleteSecurityGroup(ctx context.Context, ec2Client ec2iface.EC2API, sgID string) error {
   ...
-      return fmt.Errorf("security group has dependencies, VPC endpoint deletion may still be in progress")
+      return errDependencyViolation
   ...
-      return fmt.Errorf("security group has dependencies, VPC endpoint deletion may still be in progress")
+      return errDependencyViolation
   ...
-    return fmt.Errorf("security group has dependencies, VPC endpoint deletion may still be in progress")
+    return errDependencyViolation
   ...
 }

 func (r *AWSEndpointServiceReconciler) delete(ctx context.Context, awsEndpointService *hyperv1.AWSEndpointService, ec2Client ec2iface.EC2API, route53Client awsapi.ROUTE53API) (bool, error) {
   ...
   if awsEndpointService.Status.SecurityGroupID != "" {
     if err := r.deleteSecurityGroup(ctx, ec2Client, awsEndpointService.Status.SecurityGroupID); err != nil {
+      if errors.Is(err, errDependencyViolation) {
+        return false, nil
+      }
       return false, err
     }
   }
   ...
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`
around lines 1092 - 1124, The DependencyViolation branches in the AWSErrorCode
checks for RevokeSecurityGroupIngressWithContext,
RevokeSecurityGroupEgressWithContext and DeleteSecurityGroupWithContext
currently return an error immediately, bypassing the caller's fixed-delay retry
path; change those branches to return a distinguishable retriable signal (for
example a sentinel error like errDependencyViolationRequeue) instead of
fmt.Errorf, and update the outer delete(...) caller to detect that sentinel and
translate it to completed=false, err=nil so the existing
endpointServiceDeletionRequeueDuration requeue behavior is used; references:
supportawsutil.AWSErrorCode, RevokeSecurityGroupIngressWithContext,
RevokeSecurityGroupEgressWithContext, DeleteSecurityGroupWithContext,
delete(...), endpointServiceDeletionRequeueDuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`:
- Around line 1092-1124: The DependencyViolation branches in the AWSErrorCode
checks for RevokeSecurityGroupIngressWithContext,
RevokeSecurityGroupEgressWithContext and DeleteSecurityGroupWithContext
currently return an error immediately, bypassing the caller's fixed-delay retry
path; change those branches to return a distinguishable retriable signal (for
example a sentinel error like errDependencyViolationRequeue) instead of
fmt.Errorf, and update the outer delete(...) caller to detect that sentinel and
translate it to completed=false, err=nil so the existing
endpointServiceDeletionRequeueDuration requeue behavior is used; references:
supportawsutil.AWSErrorCode, RevokeSecurityGroupIngressWithContext,
RevokeSecurityGroupEgressWithContext, DeleteSecurityGroupWithContext,
delete(...), endpointServiceDeletionRequeueDuration.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b6b8cec0-0519-41dc-b428-8ed29f1d3eb9

📥 Commits

Reviewing files that changed from the base of the PR and between 687e33e and 9e41174.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go

Comment thread control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go Outdated
@sdminonne sdminonne marked this pull request as draft March 6, 2026 06:33
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 6, 2026
@sdminonne sdminonne marked this pull request as ready for review March 6, 2026 10:51
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 6, 2026
@openshift-ci openshift-ci bot requested review from devguyio and enxebre March 6, 2026 10:51
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`:
- Around line 419-434: The deletion path calls awsClientBuilder.getClients in
the reconcile delete branch before running the non-deletion initializer
(initializeWithHCP), which can cause perpetual “clients not initialized”
failures after a restart; modify the reconcile delete logic to ensure clients
are initialized for deletion: detect the deletion-path and call the same
initialization code used by initializeWithHCP (or a new small helper that sets
up awsClientBuilder/clients) before invoking getClients and delete, or make
getClients resilient by lazily initializing clients when missing so delete(ctx,
awsEndpointService, ec2Client, route53Client) always receives valid clients;
update getClients, initializeWithHCP, or the delete-path prelude accordingly
(references: awsClientBuilder.getClients, initializeWithHCP, delete,
awsEndpointService).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f78b3d42-dbdb-4b71-a7ba-b6d7d9274cbe

📥 Commits

Reviewing files that changed from the base of the PR and between 9e41174 and d487dbd.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go

Comment thread control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go Outdated
@sdminonne sdminonne changed the title fix(OCPBUGS-74960): handle DependencyViolation when deleting security groups fix(OCPBUGS-74960): prevent resource leak on deletion and handle DependencyViolation Mar 6, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go (1)

431-434: Consider logging skipped deletion initialization paths for diagnosability.

The best-effort HCP list currently swallows list/multiplicity outcomes. Adding logs here would make “clients not initialized” failures much easier to triage.

💡 Suggested non-behavioral improvement
 hcpList := &hyperv1.HostedControlPlaneList{}
-if err := r.List(ctx, hcpList, &client.ListOptions{Namespace: req.Namespace}); err == nil && len(hcpList.Items) == 1 {
-	r.awsClientBuilder.initializeWithHCP(log, &hcpList.Items[0])
-}
+if err := r.List(ctx, hcpList, &client.ListOptions{Namespace: req.Namespace}); err != nil {
+	log.Error(err, "failed to list HostedControlPlanes for deletion initialization")
+} else if len(hcpList.Items) == 1 {
+	r.awsClientBuilder.initializeWithHCP(log, &hcpList.Items[0])
+} else if len(hcpList.Items) > 1 {
+	log.Info("skipping deletion initialization; unexpected HostedControlPlane count", "count", len(hcpList.Items))
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`
around lines 431 - 434, The HCP listing path currently swallows errors and non-1
results; update the block around hcpList/HostedControlPlaneList so it logs why
initialization was skipped: log an error when r.List(ctx, hcpList,
&client.ListOptions{Namespace: req.Namespace}) returns an error and log a
debug/info message when the list returns 0 or >1 items before calling
r.awsClientBuilder.initializeWithHCP(log, &hcpList.Items[0]); include
req.Namespace and the observed length (or error) in the log to aid triage.
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go (1)

317-318: expectRequeue branch appears unused in this table.

Either add at least one case that asserts RequeueAfter, or drop the field/check to keep the test intent tighter.

Also applies to: 518-520

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go`
around lines 317 - 318, The table-driven test declares an unused expectRequeue
field; either add a test case that uses it or remove the field and related
assertion. Fix by updating the test cases in awsprivatelink_controller_test.go
to include at least one scenario where expectRequeue is true and assert the
reconcile result's RequeueAfter is > 0 (reference the expectRequeue field and
the reconcile result variable used in the test), or delete the expectRequeue
field and any RequeueAfter assertion to keep the table focused on used
expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go`:
- Around line 855-867: The test creates a real clientBuilder which calls
clientBuilder.getClients() (awsprivatelink_controller.go:241) and can attempt to
create real AWS sessions; instead, create and use a mocked awsClientProvider via
setupMocks(gomock.Controller) and set an expectation that getClients() returns a
deterministic error for this test case (the case with clientInitialized=true and
empty assumeSharedVPCRole ARNs) so the deletion path fails hermetically; replace
the real clientBuilder instantiation in the test with the mock provider and
ensure the controller under test receives that mock.

---

Nitpick comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go`:
- Around line 317-318: The table-driven test declares an unused expectRequeue
field; either add a test case that uses it or remove the field and related
assertion. Fix by updating the test cases in awsprivatelink_controller_test.go
to include at least one scenario where expectRequeue is true and assert the
reconcile result's RequeueAfter is > 0 (reference the expectRequeue field and
the reconcile result variable used in the test), or delete the expectRequeue
field and any RequeueAfter assertion to keep the table focused on used
expectations.

In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`:
- Around line 431-434: The HCP listing path currently swallows errors and non-1
results; update the block around hcpList/HostedControlPlaneList so it logs why
initialization was skipped: log an error when r.List(ctx, hcpList,
&client.ListOptions{Namespace: req.Namespace}) returns an error and log a
debug/info message when the list returns 0 or >1 items before calling
r.awsClientBuilder.initializeWithHCP(log, &hcpList.Items[0]); include
req.Namespace and the observed length (or error) in the log to aid triage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a5beba26-78d6-49b6-812c-1c845d5f5603

📥 Commits

Reviewing files that changed from the base of the PR and between d487dbd and 33e897e.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go

@sdminonne
Copy link
Copy Markdown
Contributor Author

/cc @jparrill

@sdminonne
Copy link
Copy Markdown
Contributor Author

/assign @jparrill

Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!. Dropped a couple of comments.

Comment thread control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go Outdated
Comment thread control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go Outdated
@openshift-ci openshift-ci bot added the area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release label Mar 10, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`:
- Around line 203-216: The controller currently reuses a single mutable
clientBuilder (awsClientProvider) across concurrent reconciles which causes
per-HCP state (role ARNs, hosted-zone ID) to leak between reconciles; instead
instantiate a fresh awsClientProvider for each Reconcile invocation and remove
shared mutable fields from the controller struct so state is not shared.
Concretely, stop storing clientBuilder on the controller struct and create a new
builder inside Reconcile before calling
initializeWithHCP/getClients/getLocalHostedZoneID/setLocalHostedZoneID (or add a
NewClientBuilder(...) factory and call it in each reconcile path including
delete handling); ensure getClients uses the per-reconcile provider instance so
hosted-zone and role ARN state cannot be overwritten by concurrent reconciles.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a6604b92-e71e-4a92-a68a-377af583f2fd

📥 Commits

Reviewing files that changed from the base of the PR and between 33e897e and dd7db15.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-azure-self-managed

@sdminonne sdminonne changed the title fix(OCPBUGS-74960): prevent resource leak on deletion and handle DependencyViolation OCPBUGS-74960: prevent resource leak on deletion and handle DependencyViolation Mar 11, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 11, 2026
…eletion

The AWSEndpointService deletion path silently skipped AWS resource
cleanup when getClients failed (e.g. after a controller restart with an
uninitialized clientBuilder). This caused security groups, VPC endpoints,
and DNS records to be orphaned because the finalizer was still removed.

Changes:
- Return an error instead of silently skipping cleanup when AWS clients
  cannot be initialized, preserving the finalizer for retry.
- Attempt best-effort client initialization during deletion by listing
  HostedControlPlanes in the namespace, so restarts can recover when
  the HCP still exists.
- Extract awsClientProvider interface from clientBuilder to enable
  unit testing the reconciler with mock AWS clients.
- Add errDependencyViolation sentinel error so that DependencyViolation
  from security group operations triggers a controlled requeue (with
  RequeueAfter) instead of an error-driven exponential backoff.
- Replace duplicate log.Error calls in deleteSecurityGroup with proper
  error wrapping using %w.
- Add DependencyViolation constant to support/awsutil/errorcode.go.
- Add comprehensive unit tests for deletion, restart recovery,
  DependencyViolation handling, and SharedVPC leak scenarios.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci openshift-ci bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 27, 2026
@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-azure-self-managed

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@sdminonne
Copy link
Copy Markdown
Contributor Author

sdminonne commented Mar 29, 2026

About ci/prow/e2e-azure-self-managed failure
Error: timed out waiting for the condition on hostedclusters/2e64dec42e-mgmt

Summary: The e2e-azure-self-managed test failed during the pre-test infrastructure setup
phase (create-management-cluster step), NOT during actual e2e test execution. The Azure
management HostedCluster "2e64dec42e-mgmt" was created with all Azure infrastructure
(resource groups, VNet, NSG, private DNS zone, load balancer) provisioned successfully,
but the HostedCluster never reached the "Available" condition within the 30-minute timeout
(11:03:51 to 11:33:51 UTC). No e2e tests were ever executed. The failure analyzer step
confirmed: "No test failures detected — skipping analysis."

This failure is UNRELATED to the PR changes. PR #7868 (OCPBUGS-74960) only modifies
AWS PrivateLink controller files:
- hypershift-operator/controllers/awsprivatelink/awsprivatelink_controller.go
- hypershift-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
- support/awsutil/errorcode.go
These are exclusively AWS-specific changes and cannot affect Azure cluster provisioning.
This is a CI infrastructure / Azure platform flake where the management HostedCluster
failed to become ready in time.

@openshift-ci-robot
Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-74960, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @zhfeng

Details

In response to this:

Summary

  • When getClients fails during deletion (e.g., after an operator restart), the controller now returns an error instead of logging and falling through to finalizer removal, which would permanently orphan AWS resources (security groups, VPC endpoints, DNS records)
  • On deletion, the controller now performs best-effort client initialization by listing HostedControlPlane resources in the namespace. After a controller restart the clientBuilder is uninitialized; if the HCP still exists, initializeWithHCP is called so that getClients can succeed and cleanup can proceed
  • Adds DependencyViolation error handling to the deleteSecurityGroup function — when AWS returns DependencyViolation during security group ingress/egress revocation or deletion, the controller returns a sentinel error that the caller translates into a controlled requeue (5s delay), allowing AWS to finish VPC endpoint cleanup before retrying
  • Extracts awsClientProvider interface from clientBuilder to enable mock injection in tests
  • Documents the remaining SharedVPC leak scenario: when the operator restarts during deletion and the HCP has already been deleted, the SharedVPC role ARNs (needed for cross-account AWS access) are lost. The fix preserves the finalizer, but retries will never succeed. A proper fix requires persisting the SharedVPC role ARNs in the AWSEndpointService status

Test plan

  • Unit tests added for deletion reconciliation (TestReconcileDeletion): successful cleanup, empty status, VPC endpoint failure, DependencyViolation requeue
  • Unit test for best-effort HCP initialization during deletion: verifies initializeWithHCP is called when the HCP exists in the namespace
  • Unit test reproducing the controller-restart bug (TestReconcileDeletion_AfterControllerRestart): verifies error is returned and finalizer is preserved when no HCP exists
  • Unit tests for deleteSecurityGroup covering all DependencyViolation paths (ingress, egress, delete), SG not found, empty describe results, no ingress/egress rules, other AWS errors
  • Unit tests documenting the SharedVPC leak scenario (TestReconcileDeletionSharedVPC): uninitialized client after restart, initialized client without role ARNs
  • All new test cases pass
  • Package builds successfully
  • Verify in a real cluster that orphaned security groups are eventually cleaned up on VPC endpoint deletion

Fixes: https://issues.redhat.com/browse/OCPBUGS-74960

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from zhfeng March 29, 2026 17:48
@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@enxebre
Copy link
Copy Markdown
Member

enxebre commented Mar 30, 2026

/lgtm

1 similar comment
@jparrill
Copy link
Copy Markdown
Contributor

/lgtm

@jparrill
Copy link
Copy Markdown
Contributor

/retest-required

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@zhfeng
Copy link
Copy Markdown
Contributor

zhfeng commented Apr 2, 2026

/retest

zhfeng added a commit to zhfeng/release that referenced this pull request Apr 2, 2026
Add CONTROL_PLANE_OPERATOR_IMAGE env var to hypershift-aws-run-e2e-external
step to allow overriding the control plane operator image via
--e2e.control-plane-operator-image flag.

Add periodic job e2e-aws-private-sg-cleanup that runs TestCreateClusterPrivate
with a Konflux-built CPO image from openshift/hypershift#7868 to verify the
OCPBUGS-74960 fix: security groups for VPC endpoints are properly cleaned up
during cluster deletion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zhfeng
Copy link
Copy Markdown
Contributor

zhfeng commented Apr 2, 2026

CI Verification Result — SG Cleanup with Latest Commit

Ran the e2e-aws-private-sg-cleanup rehearsal job on openshift/release PR #76891 using the CPO image built from this PR's latest commit (08f463d).

Results

Step Result
ipi-install-rbac SUCCESS
create-management-cluster SUCCESS
hypershift-install SUCCESS
hypershift-aws-sg-baseline SUCCESS
hypershift-aws-run-e2e-external SUCCESS
hypershift-aws-verify-sg-cleanup SUCCESS
dump-management-cluster SUCCESS
destroy-management-cluster SUCCESS

SG Verify Output

  • Baseline: 3 vpce-private-router SGs before test
  • After test: 2 vpce-private-router SGs (no new SGs — our HC's SG was properly cleaned up)
  • Orphaned: 0
  • Result: ALL CHECKS PASSED

CPO Image Override Confirmed

--e2e.control-plane-operator-image=quay.io/redhat-user-workloads/crt-redhat-acm-tenant/control-plane-operator-main:on-pr-08f463d5e50e589939fec80c3efe1bbdba377230

Prow Job

View Job

The TestCreateClusterPrivate e2e test passed and the security group was properly cleaned up during HostedCluster teardown, confirming the fix works as expected.

@zhfeng
Copy link
Copy Markdown
Contributor

zhfeng commented Apr 2, 2026

/verified by @zhfeng

Prow Job: rehearse-76891-periodic-ci-openshift-hypershift-release-4.23-periodics-e2e-aws-private-sg-cleanup

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@zhfeng: This PR has been marked as verified by @zhfeng.

Details

In response to this:

/verified by @zhfeng

Prow Job: rehearse-76891-periodic-ci-openshift-hypershift-release-4.23-periodics-e2e-aws-private-sg-cleanup

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Copy Markdown
Member

/pipeline required

@openshift-ci-robot
Copy link
Copy Markdown

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 83.87097% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 26.69%. Comparing base (c503233) to head (08f463d).
⚠️ Report is 114 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7868      +/-   ##
==========================================
+ Coverage   26.56%   26.69%   +0.13%     
==========================================
  Files        1087     1087              
  Lines      105041   105052      +11     
==========================================
+ Hits        27901    28047     +146     
+ Misses      74731    74580     -151     
- Partials     2409     2425      +16     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 2, 2026

@sdminonne: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 51af991 into openshift:main Apr 2, 2026
29 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@sdminonne: Jira Issue OCPBUGS-74960 is in an unrecognized state (Verified) and will not be moved to the MODIFIED state.

Details

In response to this:

Summary

  • When getClients fails during deletion (e.g., after an operator restart), the controller now returns an error instead of logging and falling through to finalizer removal, which would permanently orphan AWS resources (security groups, VPC endpoints, DNS records)
  • On deletion, the controller now performs best-effort client initialization by listing HostedControlPlane resources in the namespace. After a controller restart the clientBuilder is uninitialized; if the HCP still exists, initializeWithHCP is called so that getClients can succeed and cleanup can proceed
  • Adds DependencyViolation error handling to the deleteSecurityGroup function — when AWS returns DependencyViolation during security group ingress/egress revocation or deletion, the controller returns a sentinel error that the caller translates into a controlled requeue (5s delay), allowing AWS to finish VPC endpoint cleanup before retrying
  • Extracts awsClientProvider interface from clientBuilder to enable mock injection in tests
  • Documents the remaining SharedVPC leak scenario: when the operator restarts during deletion and the HCP has already been deleted, the SharedVPC role ARNs (needed for cross-account AWS access) are lost. The fix preserves the finalizer, but retries will never succeed. A proper fix requires persisting the SharedVPC role ARNs in the AWSEndpointService status

Test plan

  • Unit tests added for deletion reconciliation (TestReconcileDeletion): successful cleanup, empty status, VPC endpoint failure, DependencyViolation requeue
  • Unit test for best-effort HCP initialization during deletion: verifies initializeWithHCP is called when the HCP exists in the namespace
  • Unit test reproducing the controller-restart bug (TestReconcileDeletion_AfterControllerRestart): verifies error is returned and finalizer is preserved when no HCP exists
  • Unit tests for deleteSecurityGroup covering all DependencyViolation paths (ingress, egress, delete), SG not found, empty describe results, no ingress/egress rules, other AWS errors
  • Unit tests documenting the SharedVPC leak scenario (TestReconcileDeletionSharedVPC): uninitialized client after restart, initialized client without role ARNs
  • All new test cases pass
  • Package builds successfully
  • Verify in a real cluster that orphaned security groups are eventually cleaned up on VPC endpoint deletion

Fixes: https://issues.redhat.com/browse/OCPBUGS-74960

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants