Fix leak managed/owned security group on Service update with BYO SG on CLB by mtulio · Pull Request #1209 · kubernetes/cloud-provider-aws

mtulio · 2025-07-15T17:23:51Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR propose fix on leaked security group (SG) when a Service type-loadBalancer (CLB) is updated adding the BYO SG annotation (service.beta.kubernetes.io/aws-load-balancer-security-groups), which replaces all SG added to the Load Balancer without removing linked rules, as well not deleting managed SG (created by controller).

Which issue(s) this PR fixes:

Fixes #1208

Special notes for your reviewer:

We decided of creating isolated dedicated methods to discover and remove linked rule's SG targeting to:

enhance code maintenance
enhance unit tests
allow to reuse the logic when NLB with SG is supported (future)

The unit tests and documentation(function) comments have been assisted by Cursor AI(model claude-4-sonet): AIA HAb SeCeNc Hin R v1.0

Does this PR introduce a user-facing change?:

Fixed security group leak when updating Classic Load Balancer (CLB) services with `service.beta.kubernetes.io/aws-load-balancer-security-groups` annotation. Controller-managed security groups are now properly cleaned up when switching CLB from managed security groups to user-specified security groups.

k8s-ci-robot · 2025-07-15T17:23:59Z

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-07-15T17:24:00Z

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

elmiko · 2025-07-15T20:45:15Z

/ok-to-test

mtulio · 2025-07-15T21:11:57Z

/test all

mtulio · 2025-07-15T22:10:40Z

Fixing doc strings and failed unit tests from previous unexpected behavior:

/test all

k8s-ci-robot · 2025-07-16T02:23:04Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mtulio · 2025-07-16T03:37:24Z

/test pull-cloud-provider-aws-e2e-kubetest2

mtulio · 2025-07-16T03:37:48Z

/test all

mtulio · 2025-07-16T14:02:52Z

I can't find connection between failures in pull-cloud-provider-aws-e2e-kubetest2 and existing changes.

I am going to convert to regular PR to ask for reviewers while we observe if this isnt a CI flake.

PTAL?
/assign @kmala @elmiko @JoelSpeed

mtulio · 2026-01-08T03:03:04Z

Introduced test is passing:

[cloud-provider-aws-e2e] loadbalancer CLB with managed Security Group must update to BYO Security Group

Failed tests, both hairpinning traffic, are failing to resolve DNS:

DNS resolution failure - check if target hostname is resolvable

Checking if this would be transient ('cloudability' issues) or related to e2e updates (which is mostly debug and new e2e).

mtulio · 2026-01-08T03:03:15Z

/test pull-cloud-provider-aws-e2e

mtulio · 2026-01-08T15:08:05Z

The last attempt only the NLB test has failed for same reason as before (timeout):

[It] [cloud-provider-aws-e2e] loadbalancer NLB internal should be reachable with hairpinning traffic

I wonder if I need to consider increasing the timeout, although I think it is already high.

/test pull-cloud-provider-aws-e2e

mtulio · 2026-01-08T19:55:15Z

@mtulio: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cloud-provider-aws-e2e 63e7396 link true /test pull-cloud-provider-aws-e2e
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

The hairpinning trafic test(s) keeps flaking due timeout (to have a LB/resolve it's name). I had a good sync with @elmiko today, one approach would be increasing timeout of hairpinning traffic tests. I am also considering isolating the e2e improvements added on this PR to a dedicated one, so we can focus here in the fix part of things, while we investigate /isolate the e2e improvements. Open for thoughts.

mtulio · 2026-01-12T14:30:12Z

e2e timeout increased on loadbalancer curller/pooler to validate if CI issues are related. Expected to decrease the flake between hairpin traffic tests (CLB and NLB)

mtulio · 2026-01-16T20:33:21Z

/test pull-cloud-provider-aws-e2e

mtulio · 2026-03-26T19:09:44Z

PR rebased. @mfbonfigli is working with me to get this PR validated. Thanks Federico!

mtulio · 2026-03-27T00:49:56Z

e2e jobs are now passing, This PR is generally available to a new round of review. Thanks!

mfbonfigli · 2026-03-27T12:04:20Z

Thanks @mtulio. I verified the issue and your fix and confirm it works.

Testing Results

I have verified this fix by:

Reproducing it on a real kubernetes cluster, confirming the security group leak
Running the patched AWS CCM built from this PR against the same cluster
Trying to reproduce it again and confirming the group was not leaked anymore
Reviewing the code and e2e test logic.

The steps/scripts I used to reproduce the issue are documented here:
https://github.com/mfbonfigli/aws-ccm-issue-1208

Note: possible edge case in case of network issues calling AWS APIs

While reproducing by running the AWS CCM locally on my machine (so AWS calls were going through my home network), I once encountered a possible issue in case multiple (i.e. more than the configured number of AWS SDK retries) i/o timeout errors happen when calling AWS to delete the old, managed, security group after it has been detached from the load balancer. In this case the group remains orphaned and given it's already detached from the LB, from my understanding it seems it won't be cleaned up on subsequent updates either. Given the transient nature related to network connectivity, I was however not been able to reproduce it again. I do not deem this edge case to be a blocker for the fix but wanted to call it out here.

Testing details

Reproducing the issue

These are the execution logs of reproduce-bug.sh script documented here [https://github.com/mfbonfigli/aws-ccm-issue-1208/blob/main/reproduce-bug.sh], against a real cluster running on AWS.

$ ./reproduce-bug.sh 
=====================================================================
Kubernetes Cloud Provider AWS Bug #1208 Reproduction Script
Security Group Leak When Adding BYO SG Annotation
=====================================================================

Step 1: Creating LoadBalancer service without custom security groups...
service/test-clb-service created
pod/test-pod created
Waiting for LoadBalancer to be provisioned (this may take 2-3 minutes)...
.✓ LoadBalancer provisioned: af3d3fbe5c2c84fab9ce3e187d3d3260-796580548.us-west-2.elb.amazonaws.com


Step 2: Identifying the auto-generated managed security group...
Load Balancer Name: af3d3fbe5c2c84fab9ce3e187d3d3260
Managed Security Group: sg-09d1ca177dbf8e761

Managed SG Tags:
--------------------------------------------------------
|                 DescribeSecurityGroups                |
+---------------------------------------------+--------+
|                    Key                      | Value  |
+---------------------------------------------+--------+
|  kubernetes.io/cluster/test-cluster-9927c   |  owned |
+---------------------------------------------+--------+

Step 3: Creating custom security group (BYO SG)...
VPC ID: vpc-xxxxxxxxxxxxxxxxx
Custom Security Group: sg-0019c9f550127f9d9
Adding ingress rules to custom SG...
{
    "Return": true,
    "SecurityGroupRules": [
        {
            "SecurityGroupRuleId": "sgr-03fa191cfb8ab45b7",
            "GroupId": "sg-0019c9f550127f9d9",
            "GroupOwnerId": "123456789012",
            "IsEgress": false,
            "IpProtocol": "tcp",
            "FromPort": 80,
            "ToPort": 80,
            "CidrIpv4": "0.0.0.0/0",
            "SecurityGroupRuleArn": "arn:aws:ec2:us-west-2:123456789012:security-group-rule/sgr-03fa191cfb8ab45b7"
        }
    ]

Step 4: Adding BYO security group annotation to the service...
service/test-clb-service annotated
Waiting for reconciliation (60 seconds)...

=====================================================================
Step 5: BUG VERIFICATION
=====================================================================

Current security groups on load balancer:
--------------------------
|  DescribeLoadBalancers |
+------------------------+
|  sg-0019c9f550127f9d9  |
+------------------------+

=== BUG VERIFICATION CHECKS ===

1. Checking if managed SG still exists:
   ✓ Managed SG still exists: sg-09d1ca177dbf8e761

2. Checking network interfaces attached to managed SG:
   ✓ BUG CONFIRMED: Managed SG has 0 network interfaces (orphaned)

3. Checking if managed SG is still on the load balancer:
   ✓ Managed SG is NOT on the LB anymore

4. Checking if custom SG is now on the load balancer:
   ✓ Custom SG is attached to LB: sg-0019c9f550127f9d9

=====================================================================
SUMMARY
=====================================================================
Managed SG ID: sg-09d1ca177dbf8e761
Custom SG ID:  sg-0019c9f550127f9d9

🐛 BUG CONFIRMED!

The managed security group is LEAKED:
  - It still exists in AWS
  - It has NO network interfaces attached
  - It is NOT attached to the load balancer
  - The custom SG has replaced it on the LB

This orphaned security group will remain until manually deleted.

=====================================================================
To clean up, run: ./cleanup.sh
=====================================================================

Environment state saved to .bug-reproduction-state

Testing the fix

To test, the PR has been pulled and built locally and the built AWS CCM has been ran locally overriding the cluster one. This was done following the steps described here [https://github.com/mfbonfigli/aws-ccm-issue-1208/blob/main/how_to_test_locally.md]

The following are the execution logs of the reproduce-bug.sh script when using the AWS CCM patched by this PR:

$ ./reproduce-bug.sh 
=====================================================================
Kubernetes Cloud Provider AWS Bug #1208 Reproduction Script
Security Group Leak When Adding BYO SG Annotation
=====================================================================

Step 1: Creating LoadBalancer service without custom security groups...
service/test-clb-service created
pod/test-pod created
Waiting for LoadBalancer to be provisioned (this may take 2-3 minutes)...
..✓ LoadBalancer provisioned: a665be131fcf74277bb7175f64a6a2d1-179449977.us-west-2.elb.amazonaws.com


Step 2: Identifying the auto-generated managed security group...
Load Balancer Name: a665be131fcf74277bb7175f64a6a2d1
Managed Security Group: sg-0f219e88a9ac01a27

Managed SG Tags:
----------------------------------------------------------------------
|                        DescribeSecurityGroups                      |
+--------------------------------------------+-----------------------+
|                    Key                     |         Value         |
+--------------------------------------------+-----------------------+
|  kubernetes.io/cluster/test-cluster-9927c  |  owned                |
|  KubernetesCluster                         |  test-cluster-9927c   |
+--------------------------------------------+-----------------------+

Step 3: Creating custom security group (BYO SG)...
VPC ID: vpc-xxxxxxxxxxxxxxxxx
Custom Security Group: sg-0faf0103f6ea42c9b
Adding ingress rules to custom SG...

Step 4: Adding BYO security group annotation to the service...
service/test-clb-service annotated
Waiting for reconciliation (60 seconds)...

=====================================================================
Step 5: BUG VERIFICATION
=====================================================================

Current security groups on load balancer:
--------------------------
|  DescribeLoadBalancers |
+------------------------+
|  sg-0faf0103f6ea42c9b  |
+------------------------+

=== BUG VERIFICATION CHECKS ===

1. Checking if managed SG still exists:
   ✗ Managed SG was deleted (unexpected - bug NOT reproduced)

2. Checking network interfaces attached to managed SG:

3. Checking if managed SG is still on the load balancer:
   ✓ Managed SG is NOT on the LB anymore

4. Checking if custom SG is now on the load balancer:
   ✓ Custom SG is attached to LB: sg-0faf0103f6ea42c9b

=====================================================================
SUMMARY
=====================================================================
Managed SG ID: sg-0f219e88a9ac01a27
Custom SG ID:  sg-0faf0103f6ea42c9b

⚠️  BUG NOT REPRODUCED

The expected bug behavior was not observed.
This could mean the bug has been fixed or the environment is different.

=====================================================================
To clean up, run: ./cleanup.sh
=====================================================================

Environment state saved to .bug-reproduction-state

mtulio · 2026-03-27T16:51:24Z

I once encountered a possible issue in case multiple (i.e. more than the configured number of AWS SDK retries) i/o timeout errors happen when calling AWS to delete the old, managed, security group after it has been detached from the load balancer. In this case the group remains orphaned and given it's already detached from the LB,
from my understanding it seems it won't be cleaned up on subsequent updates either. Given the transient nature related to network connectivity,

Interesting finding, thanks for taking a look. Yeah it looks like a bit trick to reproduce inside the VPC which may have stable network, I think the challenge in this edge case may be related to improve retries with backoff, by reviewing all api calls that could be enhanced (broad of this bug fix), specially because, afaict, the controller does not have a synchronization loop to ensure that kind of "garbage collector" after exhausting retries.

If you already have the logs of controller when it happens, or if we can try to reproduce it it would be nice for a later enhancement, I think filing a new issue suggesting those improvements would be appropriate to collect more ideas from colleagues in this project.

mtulio · 2026-03-27T16:52:23Z

Hello all, would you mind taking a look at this bug fix? Thanks!

/assign @elmiko @JoelSpeed @kmala

mtulio · 2026-03-27T16:54:30Z

Hi @kmala is there a way to refresh the bot to review the release note label? Looks like RN is present but this PR still have do-not-merge/release-note-label-needed

mtulio · 2026-04-02T15:16:47Z

cc @damdo @nrb

nrb · 2026-04-02T15:28:10Z

/tide refresh

Fix the managed (controller-owned) security group leak when user provided security group (SG) annotation(1) is added to an existing Service type-loadBalancer Classic Load Balancer (CLB). Previously, the controller was leak a managed security group resource when the annotation is added to existing Service loadBalancer CLB (default mode). This change detects the changes correctly, trigger the SG removal and it's dependencies - other SGs referencing the managed security group that will be removed. Unit tests functions added to validate Service update to BYO Security Group annotations from a managed SG state on CLB. Issue kubernetes#1208

mtulio · 2026-04-21T02:29:48Z

Hi folks, I've reduced the scope of this PR to focus on smaller, more reviewable changes. Here's what changed:

What's been removed: The new e2e test case for BYO Security Groups - this was triggering significant e2e library refactoring that's better handled separately

New plan:

E2E loadbalancer library updates will be addressed in WIP/test/e2e: export load balancer helpers #1381 (currently WIP), providing the scaffolding needed for future test cases
BYO SG e2e test case will be added in a follow-up PR once the library updates are merged (preview available here)

This PR now focuses solely on fixing the security group leak on CLB.

I believe this narrower scope will make the review more straightforward. Please take a look when you have a chance and let me know if you have any suggestions to improve the approach. Thanks!

mtulio · 2026-04-21T03:21:04Z

/test pull-cloud-provider-aws-e2e-kubetest2

k8s-ci-robot requested review from hakman and kishorj July 15, 2025 17:23

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 15, 2025

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 15, 2025

mtulio changed the title ~~Fix 1208 byosg update~~ fix leak managed/owned security group on Service update with BYO SG Jul 15, 2025

mtulio commented Jul 15, 2025

View reviewed changes

Comment thread pkg/providers/v1/aws_loadbalancer_test.go

Comment thread pkg/providers/v1/aws_loadbalancer_test.go Outdated

mtulio force-pushed the fix-1208-byosg-update branch from 03f9775 to 83c92f2 Compare July 15, 2025 20:41

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 15, 2025

mtulio commented Jul 15, 2025

View reviewed changes

Comment thread pkg/providers/v1/aws_loadbalancer.go Outdated

Comment thread pkg/providers/v1/aws_loadbalancer.go Outdated

mtulio force-pushed the fix-1208-byosg-update branch from 83c92f2 to 23ba0b3 Compare July 15, 2025 21:11

mtulio force-pushed the fix-1208-byosg-update branch from 23ba0b3 to 0fec46d Compare July 15, 2025 21:55

k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jul 16, 2025

mtulio force-pushed the fix-1208-byosg-update branch from 0fec46d to 1907542 Compare July 16, 2025 03:34

k8s-ci-robot assigned elmiko Jul 16, 2025

mtulio force-pushed the fix-1208-byosg-update branch from 63e7396 to 3e497b2 Compare January 12, 2026 14:28

mtulio mentioned this pull request Jan 15, 2026

Managed security group leak after annotation service.beta.kubernetes.io/aws-load-balancer-security-groups added to existing service #1208

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 13, 2026

mtulio force-pushed the fix-1208-byosg-update branch from 3e497b2 to 1baa91f Compare March 26, 2026 19:07

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2026

mfbonfigli mentioned this pull request Apr 13, 2026

feat: Add BYO security group support for NLB #1379

Draft

mtulio changed the title ~~Fix leak managed/owned security group on Service update with BYO SG~~ Fix leak managed/owned security group on Service update with BYO SG on CLB Apr 21, 2026

mtulio force-pushed the fix-1208-byosg-update branch from 1baa91f to 622e013 Compare April 21, 2026 01:59

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 21, 2026

mtulio force-pushed the fix-1208-byosg-update branch 2 times, most recently from eee3ea8 to 8461a63 Compare April 21, 2026 02:12

mtulio force-pushed the fix-1208-byosg-update branch from 8461a63 to b481672 Compare April 21, 2026 02:13

Conversation

mtulio commented Jul 15, 2025 • edited by kmala Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jul 15, 2025

Uh oh!

k8s-ci-robot commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

elmiko commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

mtulio commented Jul 15, 2025

Uh oh!

mtulio commented Jul 15, 2025

Uh oh!

k8s-ci-robot commented Jul 16, 2025

Uh oh!

mtulio commented Jul 16, 2025

Uh oh!

mtulio commented Jul 16, 2025

Uh oh!

mtulio commented Jul 16, 2025

Uh oh!

mtulio commented Jan 8, 2026

Uh oh!

mtulio commented Jan 8, 2026

Uh oh!

mtulio commented Jan 8, 2026

Uh oh!

mtulio commented Jan 8, 2026

Uh oh!

mtulio commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtulio commented Jan 16, 2026

Uh oh!

mtulio commented Mar 26, 2026

Uh oh!

mtulio commented Mar 27, 2026

Uh oh!

mfbonfigli commented Mar 27, 2026

Testing Results

Reproducing the issue

Testing the fix

Uh oh!

mtulio commented Mar 27, 2026

Uh oh!

mtulio commented Mar 27, 2026

Uh oh!

mtulio commented Mar 27, 2026

Uh oh!

mtulio commented Apr 2, 2026

Uh oh!

nrb commented Apr 2, 2026

Uh oh!

mtulio commented Apr 21, 2026

Uh oh!

mtulio commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mtulio commented Jul 15, 2025 •

edited by kmala

Loading

mtulio commented Jan 12, 2026 •

edited

Loading