OCPBUGS-74960: prevent resource leak on deletion and handle DependencyViolation#7868
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntroduces an awsClientProvider interface and a clientBuilder implementation for AWSEndpointServiceReconciler, swapping the reconciler's direct client builder field for the interface and adding a go:generate mock directive. Reconcile deletion now requires obtaining AWS clients (initializing with HCP for SharedVPC), converts AWS DependencyViolation errors into a sentinel (errDependencyViolation) and treats them as retryable to requeue deletion, and returns errors when clients cannot be obtained during deletion. Adds a new exported awsutil DependencyViolation error code and extensive unit tests covering deletion, controller-restart, SharedVPC scenarios, and dependency-aware security-group cleanup. Changes
Sequence Diagram(s)sequenceDiagram
participant Controller as Reconciler
participant Provider as awsClientProvider
participant EC2 as AWS EC2 API
participant R53 as AWS Route53 API
participant K8s as Kubernetes API
Controller->>Provider: initializeWithHCP(log, hcp) %% optional for SharedVPC
Controller->>Provider: getClients(ctx)
Provider-->>EC2: return EC2 client
Provider-->>R53: return Route53 client
Controller->>EC2: DescribeSecurityGroups / DescribeVpcEndpoints
alt resources exist
Controller->>EC2: RevokeSecurityGroupIngress/Egress
EC2-->>Controller: Success or DependencyViolation
alt DependencyViolation
Controller->>K8s: record errDependencyViolation -> requeue (retry)
else Success
Controller->>EC2: DeleteSecurityGroup / DeleteVpcEndpoints
Controller->>R53: ChangeResourceRecordSets (cleanup DNS)
Controller->>K8s: remove finalizer, update resource
end
else resources missing
Controller->>K8s: remove finalizer, update resource
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go (1)
1092-1124:⚠️ Potential issue | 🟠 MajorDependencyViolation path currently bypasses the fixed-delay delete retry flow.
These branches return an error immediately, so the caller exits through the error path instead of the
completed=falseretry path that appliesendpointServiceDeletionRequeueDuration. If the goal is to use the explicit 5s delete retry behavior, return a distinguishable retriable condition and translate it tocompleted=false, err=nilindelete(...).Suggested approach
+var errDependencyViolation = errors.New("security group dependency violation") func (r *AWSEndpointServiceReconciler) deleteSecurityGroup(ctx context.Context, ec2Client ec2iface.EC2API, sgID string) error { ... - return fmt.Errorf("security group has dependencies, VPC endpoint deletion may still be in progress") + return errDependencyViolation ... - return fmt.Errorf("security group has dependencies, VPC endpoint deletion may still be in progress") + return errDependencyViolation ... - return fmt.Errorf("security group has dependencies, VPC endpoint deletion may still be in progress") + return errDependencyViolation ... } func (r *AWSEndpointServiceReconciler) delete(ctx context.Context, awsEndpointService *hyperv1.AWSEndpointService, ec2Client ec2iface.EC2API, route53Client awsapi.ROUTE53API) (bool, error) { ... if awsEndpointService.Status.SecurityGroupID != "" { if err := r.deleteSecurityGroup(ctx, ec2Client, awsEndpointService.Status.SecurityGroupID); err != nil { + if errors.Is(err, errDependencyViolation) { + return false, nil + } return false, err } } ... }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go` around lines 1092 - 1124, The DependencyViolation branches in the AWSErrorCode checks for RevokeSecurityGroupIngressWithContext, RevokeSecurityGroupEgressWithContext and DeleteSecurityGroupWithContext currently return an error immediately, bypassing the caller's fixed-delay retry path; change those branches to return a distinguishable retriable signal (for example a sentinel error like errDependencyViolationRequeue) instead of fmt.Errorf, and update the outer delete(...) caller to detect that sentinel and translate it to completed=false, err=nil so the existing endpointServiceDeletionRequeueDuration requeue behavior is used; references: supportawsutil.AWSErrorCode, RevokeSecurityGroupIngressWithContext, RevokeSecurityGroupEgressWithContext, DeleteSecurityGroupWithContext, delete(...), endpointServiceDeletionRequeueDuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`:
- Around line 1092-1124: The DependencyViolation branches in the AWSErrorCode
checks for RevokeSecurityGroupIngressWithContext,
RevokeSecurityGroupEgressWithContext and DeleteSecurityGroupWithContext
currently return an error immediately, bypassing the caller's fixed-delay retry
path; change those branches to return a distinguishable retriable signal (for
example a sentinel error like errDependencyViolationRequeue) instead of
fmt.Errorf, and update the outer delete(...) caller to detect that sentinel and
translate it to completed=false, err=nil so the existing
endpointServiceDeletionRequeueDuration requeue behavior is used; references:
supportawsutil.AWSErrorCode, RevokeSecurityGroupIngressWithContext,
RevokeSecurityGroupEgressWithContext, DeleteSecurityGroupWithContext,
delete(...), endpointServiceDeletionRequeueDuration.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b6b8cec0-0519-41dc-b428-8ed29f1d3eb9
📒 Files selected for processing (2)
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.gocontrol-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`:
- Around line 419-434: The deletion path calls awsClientBuilder.getClients in
the reconcile delete branch before running the non-deletion initializer
(initializeWithHCP), which can cause perpetual “clients not initialized”
failures after a restart; modify the reconcile delete logic to ensure clients
are initialized for deletion: detect the deletion-path and call the same
initialization code used by initializeWithHCP (or a new small helper that sets
up awsClientBuilder/clients) before invoking getClients and delete, or make
getClients resilient by lazily initializing clients when missing so delete(ctx,
awsEndpointService, ec2Client, route53Client) always receives valid clients;
update getClients, initializeWithHCP, or the delete-path prelude accordingly
(references: awsClientBuilder.getClients, initializeWithHCP, delete,
awsEndpointService).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f78b3d42-dbdb-4b71-a7ba-b6d7d9274cbe
📒 Files selected for processing (2)
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.gocontrol-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go (1)
431-434: Consider logging skipped deletion initialization paths for diagnosability.The best-effort HCP list currently swallows list/multiplicity outcomes. Adding logs here would make “clients not initialized” failures much easier to triage.
💡 Suggested non-behavioral improvement
hcpList := &hyperv1.HostedControlPlaneList{} -if err := r.List(ctx, hcpList, &client.ListOptions{Namespace: req.Namespace}); err == nil && len(hcpList.Items) == 1 { - r.awsClientBuilder.initializeWithHCP(log, &hcpList.Items[0]) -} +if err := r.List(ctx, hcpList, &client.ListOptions{Namespace: req.Namespace}); err != nil { + log.Error(err, "failed to list HostedControlPlanes for deletion initialization") +} else if len(hcpList.Items) == 1 { + r.awsClientBuilder.initializeWithHCP(log, &hcpList.Items[0]) +} else if len(hcpList.Items) > 1 { + log.Info("skipping deletion initialization; unexpected HostedControlPlane count", "count", len(hcpList.Items)) +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go` around lines 431 - 434, The HCP listing path currently swallows errors and non-1 results; update the block around hcpList/HostedControlPlaneList so it logs why initialization was skipped: log an error when r.List(ctx, hcpList, &client.ListOptions{Namespace: req.Namespace}) returns an error and log a debug/info message when the list returns 0 or >1 items before calling r.awsClientBuilder.initializeWithHCP(log, &hcpList.Items[0]); include req.Namespace and the observed length (or error) in the log to aid triage.control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go (1)
317-318:expectRequeuebranch appears unused in this table.Either add at least one case that asserts
RequeueAfter, or drop the field/check to keep the test intent tighter.Also applies to: 518-520
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go` around lines 317 - 318, The table-driven test declares an unused expectRequeue field; either add a test case that uses it or remove the field and related assertion. Fix by updating the test cases in awsprivatelink_controller_test.go to include at least one scenario where expectRequeue is true and assert the reconcile result's RequeueAfter is > 0 (reference the expectRequeue field and the reconcile result variable used in the test), or delete the expectRequeue field and any RequeueAfter assertion to keep the table focused on used expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go`:
- Around line 855-867: The test creates a real clientBuilder which calls
clientBuilder.getClients() (awsprivatelink_controller.go:241) and can attempt to
create real AWS sessions; instead, create and use a mocked awsClientProvider via
setupMocks(gomock.Controller) and set an expectation that getClients() returns a
deterministic error for this test case (the case with clientInitialized=true and
empty assumeSharedVPCRole ARNs) so the deletion path fails hermetically; replace
the real clientBuilder instantiation in the test with the mock provider and
ensure the controller under test receives that mock.
---
Nitpick comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go`:
- Around line 317-318: The table-driven test declares an unused expectRequeue
field; either add a test case that uses it or remove the field and related
assertion. Fix by updating the test cases in awsprivatelink_controller_test.go
to include at least one scenario where expectRequeue is true and assert the
reconcile result's RequeueAfter is > 0 (reference the expectRequeue field and
the reconcile result variable used in the test), or delete the expectRequeue
field and any RequeueAfter assertion to keep the table focused on used
expectations.
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`:
- Around line 431-434: The HCP listing path currently swallows errors and non-1
results; update the block around hcpList/HostedControlPlaneList so it logs why
initialization was skipped: log an error when r.List(ctx, hcpList,
&client.ListOptions{Namespace: req.Namespace}) returns an error and log a
debug/info message when the list returns 0 or >1 items before calling
r.awsClientBuilder.initializeWithHCP(log, &hcpList.Items[0]); include
req.Namespace and the observed length (or error) in the log to aid triage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a5beba26-78d6-49b6-812c-1c845d5f5603
📒 Files selected for processing (2)
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.gocontrol-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
|
/cc @jparrill |
|
/assign @jparrill |
jparrill
left a comment
There was a problem hiding this comment.
Thanks for the PR!. Dropped a couple of comments.
33e897e to
4c305f0
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`:
- Around line 203-216: The controller currently reuses a single mutable
clientBuilder (awsClientProvider) across concurrent reconciles which causes
per-HCP state (role ARNs, hosted-zone ID) to leak between reconciles; instead
instantiate a fresh awsClientProvider for each Reconcile invocation and remove
shared mutable fields from the controller struct so state is not shared.
Concretely, stop storing clientBuilder on the controller struct and create a new
builder inside Reconcile before calling
initializeWithHCP/getClients/getLocalHostedZoneID/setLocalHostedZoneID (or add a
NewClientBuilder(...) factory and call it in each reconcile path including
delete handling); ensure getClients uses the per-reconcile provider instance so
hosted-zone and role ARN state cannot be overwritten by concurrent reconciles.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a6604b92-e71e-4a92-a68a-377af583f2fd
📒 Files selected for processing (2)
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.gocontrol-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
|
/test e2e-azure-self-managed |
…eletion The AWSEndpointService deletion path silently skipped AWS resource cleanup when getClients failed (e.g. after a controller restart with an uninitialized clientBuilder). This caused security groups, VPC endpoints, and DNS records to be orphaned because the finalizer was still removed. Changes: - Return an error instead of silently skipping cleanup when AWS clients cannot be initialized, preserving the finalizer for retry. - Attempt best-effort client initialization during deletion by listing HostedControlPlanes in the namespace, so restarts can recover when the HCP still exists. - Extract awsClientProvider interface from clientBuilder to enable unit testing the reconciler with mock AWS clients. - Add errDependencyViolation sentinel error so that DependencyViolation from security group operations triggers a controlled requeue (with RequeueAfter) instead of an error-driven exponential backoff. - Replace duplicate log.Error calls in deleteSecurityGroup with proper error wrapping using %w. - Add DependencyViolation constant to support/awsutil/errorcode.go. - Add comprehensive unit tests for deletion, restart recovery, DependencyViolation handling, and SharedVPC leak scenarios. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4ef1d5e to
08f463d
Compare
|
/test e2e-azure-self-managed |
|
/test e2e-aws |
|
About ci/prow/e2e-azure-self-managed failure Summary: The e2e-azure-self-managed test failed during the pre-test infrastructure setup This failure is UNRELATED to the PR changes. PR #7868 (OCPBUGS-74960) only modifies |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-74960, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aks |
|
/lgtm |
1 similar comment
|
/lgtm |
|
/retest-required |
|
Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage. |
|
/retest |
Add CONTROL_PLANE_OPERATOR_IMAGE env var to hypershift-aws-run-e2e-external step to allow overriding the control plane operator image via --e2e.control-plane-operator-image flag. Add periodic job e2e-aws-private-sg-cleanup that runs TestCreateClusterPrivate with a Konflux-built CPO image from openshift/hypershift#7868 to verify the OCPBUGS-74960 fix: security groups for VPC endpoints are properly cleaned up during cluster deletion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI Verification Result — SG Cleanup with Latest CommitRan the Results
SG Verify Output
CPO Image Override ConfirmedProw JobThe |
|
@zhfeng: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/pipeline required |
|
Scheduling tests matching the |
|
/test e2e-aks |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #7868 +/- ##
==========================================
+ Coverage 26.56% 26.69% +0.13%
==========================================
Files 1087 1087
Lines 105041 105052 +11
==========================================
+ Hits 27901 28047 +146
+ Misses 74731 74580 -151
- Partials 2409 2425 +16 🚀 New features to boost your workflow:
|
|
@sdminonne: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@sdminonne: Jira Issue OCPBUGS-74960 is in an unrecognized state (Verified) and will not be moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Summary
getClientsfails during deletion (e.g., after an operator restart), the controller now returns an error instead of logging and falling through to finalizer removal, which would permanently orphan AWS resources (security groups, VPC endpoints, DNS records)HostedControlPlaneresources in the namespace. After a controller restart theclientBuilderis uninitialized; if the HCP still exists,initializeWithHCPis called so thatgetClientscan succeed and cleanup can proceedDependencyViolationerror handling to thedeleteSecurityGroupfunction — when AWS returnsDependencyViolationduring security group ingress/egress revocation or deletion, the controller returns a sentinel error that the caller translates into a controlled requeue (5s delay), allowing AWS to finish VPC endpoint cleanup before retryingawsClientProviderinterface fromclientBuilderto enable mock injection in testsTest plan
TestReconcileDeletion): successful cleanup, empty status, VPC endpoint failure, DependencyViolation requeueinitializeWithHCPis called when the HCP exists in the namespaceTestReconcileDeletion_AfterControllerRestart): verifies error is returned and finalizer is preserved when no HCP existsdeleteSecurityGroupcovering allDependencyViolationpaths (ingress, egress, delete), SG not found, empty describe results, no ingress/egress rules, other AWS errorsTestReconcileDeletionSharedVPC): uninitialized client after restart, initialized client without role ARNsFixes: https://issues.redhat.com/browse/OCPBUGS-74960
🤖 Generated with Claude Code