fix: IAM role cleanup order to prevent orphaned roles #516
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue description:
IAM roles are accumulating causing our canary canary test account to hit role limit of 1000. This causes new sample app deployments to fail resulting in a sev-2 every ~5 days. Two patterns of orphaned roles were found:
eksctl-e2e-java-otlp-ocb-canary-test-addon-ia-Role1-*eks-s3-access-java-eks-otlp-ocb-*Root cause:
In the cleanup phase of
java-eks-otlp-ocb-test.yml, the Kubernetes namespace is deleted before the IAM service accounts are deleted. Wheneksctl delete iamserviceaccountruns after the namespace is already gone, it cannot find the Kubernetes ServiceAccount, fails silently due tocontinue-on-error: true, and leaves the underlying CloudFormation stack and IAM role orphaned.Description of changes:
Reorder the cleanup steps in
.github/workflows/java-eks-otlp-ocb-test.ymlso that IAM service accounts are deleted before the Kubernetes namespace:eks-s3-access-*orphaned roleseksctl-*-addon-ia-Role1-*orphaned rolesRollback procedure:
Yes, this commit can be safely reverted if needed. The change only affects the order of cleanup steps and does not modify any test logic or resource creation. Reverting would restore the previous behavior where orphaned IAM roles accumulate, but would not cause test failures or break any functionality.
Test Workflow Run:
https://github.com/yiyuan-he/aws-application-signals-test-framework/actions/runs/20116407133/job/57726811476
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.