Fix leak managed/owned security group on Service update with BYO SG on CLB#1209
Fix leak managed/owned security group on Service update with BYO SG on CLB#1209mtulio wants to merge 1 commit intokubernetes:masterfrom
Conversation
|
This issue is currently awaiting triage. If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hi @mtulio. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
03f9775 to
83c92f2
Compare
|
/ok-to-test |
83c92f2 to
23ba0b3
Compare
|
/test all |
23ba0b3 to
0fec46d
Compare
|
Fixing doc strings and failed unit tests from previous unexpected behavior: /test all |
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
0fec46d to
1907542
Compare
|
/test pull-cloud-provider-aws-e2e-kubetest2 |
|
/test all |
|
I can't find connection between failures in pull-cloud-provider-aws-e2e-kubetest2 and existing changes. I am going to convert to regular PR to ask for reviewers while we observe if this isnt a CI flake. PTAL? |
|
Introduced test is passing:
Failed tests, both hairpinning traffic, are failing to resolve DNS:
Checking if this would be transient ('cloudability' issues) or related to e2e updates (which is mostly debug and new e2e). |
|
/test pull-cloud-provider-aws-e2e |
|
The last attempt only the NLB test has failed for same reason as before (timeout):
I wonder if I need to consider increasing the timeout, although I think it is already high. /test pull-cloud-provider-aws-e2e |
The hairpinning trafic test(s) keeps flaking due timeout (to have a LB/resolve it's name). I had a good sync with @elmiko today, one approach would be increasing timeout of hairpinning traffic tests. I am also considering isolating the e2e improvements added on this PR to a dedicated one, so we can focus here in the fix part of things, while we investigate /isolate the e2e improvements. Open for thoughts. |
63e7396 to
3e497b2
Compare
|
e2e timeout increased on loadbalancer curller/pooler to validate if CI issues are related. Expected to decrease the flake between hairpin traffic tests (CLB and NLB) |
|
/test pull-cloud-provider-aws-e2e |
3e497b2 to
1baa91f
Compare
|
PR rebased. @mfbonfigli is working with me to get this PR validated. Thanks Federico! |
|
e2e jobs are now passing, This PR is generally available to a new round of review. Thanks! |
|
Thanks @mtulio. I verified the issue and your fix and confirm it works. Testing ResultsI have verified this fix by:
The steps/scripts I used to reproduce the issue are documented here: Note: possible edge case in case of network issues calling AWS APIs While reproducing by running the AWS CCM locally on my machine (so AWS calls were going through my home network), I once encountered a possible issue in case multiple (i.e. more than the configured number of AWS SDK retries) Testing detailsReproducing the issueThese are the execution logs of Testing the fixTo test, the PR has been pulled and built locally and the built AWS CCM has been ran locally overriding the cluster one. This was done following the steps described here [https://github.com/mfbonfigli/aws-ccm-issue-1208/blob/main/how_to_test_locally.md] The following are the execution logs of the |
Interesting finding, thanks for taking a look. Yeah it looks like a bit trick to reproduce inside the VPC which may have stable network, I think the challenge in this edge case may be related to improve retries with backoff, by reviewing all api calls that could be enhanced (broad of this bug fix), specially because, afaict, the controller does not have a synchronization loop to ensure that kind of "garbage collector" after exhausting retries. If you already have the logs of controller when it happens, or if we can try to reproduce it it would be nice for a later enhancement, I think filing a new issue suggesting those improvements would be appropriate to collect more ideas from colleagues in this project. |
|
Hello all, would you mind taking a look at this bug fix? Thanks! /assign @elmiko @JoelSpeed @kmala |
|
Hi @kmala is there a way to refresh the bot to review the release note label? Looks like RN is present but this PR still have |
|
/tide refresh |
1baa91f to
622e013
Compare
eee3ea8 to
8461a63
Compare
Fix the managed (controller-owned) security group leak when user provided security group (SG) annotation(1) is added to an existing Service type-loadBalancer Classic Load Balancer (CLB). Previously, the controller was leak a managed security group resource when the annotation is added to existing Service loadBalancer CLB (default mode). This change detects the changes correctly, trigger the SG removal and it's dependencies - other SGs referencing the managed security group that will be removed. Unit tests functions added to validate Service update to BYO Security Group annotations from a managed SG state on CLB. Issue kubernetes#1208
8461a63 to
b481672
Compare
|
Hi folks, I've reduced the scope of this PR to focus on smaller, more reviewable changes. Here's what changed: What's been removed: The new e2e test case for BYO Security Groups - this was triggering significant e2e library refactoring that's better handled separately New plan:
This PR now focuses solely on fixing the security group leak on CLB. I believe this narrower scope will make the review more straightforward. Please take a look when you have a chance and let me know if you have any suggestions to improve the approach. Thanks! |
|
/test pull-cloud-provider-aws-e2e-kubetest2 |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR propose fix on leaked security group (SG) when a Service type-loadBalancer (CLB) is updated adding the BYO SG annotation (
service.beta.kubernetes.io/aws-load-balancer-security-groups), which replaces all SG added to the Load Balancer without removing linked rules, as well not deleting managed SG (created by controller).Which issue(s) this PR fixes:
Fixes #1208
Special notes for your reviewer:
We decided of creating isolated dedicated methods to discover and remove linked rule's SG targeting to:
The unit tests and documentation(function) comments have been assisted by Cursor AI(model claude-4-sonet): AIA HAb SeCeNc Hin R v1.0
Does this PR introduce a user-facing change?: