Skip to content

Conversation

@yiyuan-he
Copy link
Contributor

@yiyuan-he yiyuan-he commented Dec 10, 2025

Issue description:

IAM roles are accumulating causing our canary canary test account to hit role limit of 1000. This causes new sample app deployments to fail resulting in a sev-2 every ~5 days. Two patterns of orphaned roles were found:

  • eksctl-e2e-java-otlp-ocb-canary-test-addon-ia-Role1-*
  • eks-s3-access-java-eks-otlp-ocb-*

Root cause:

In the cleanup phase of java-eks-otlp-ocb-test.yml, the Kubernetes namespace is deleted before the IAM service accounts are deleted. When eksctl delete iamserviceaccount runs after the namespace is already gone, it cannot find the Kubernetes ServiceAccount, fails silently due to continue-on-error: true, and leaves the underlying CloudFormation stack and IAM role orphaned.

Description of changes:

Reorder the cleanup steps in .github/workflows/java-eks-otlp-ocb-test.yml so that IAM service accounts are deleted before the Kubernetes namespace:

  1. Remove aws access service account (sa-$TESTING_ID) - fixes eks-s3-access-* orphaned roles
  2. Remove Application Signals Collector IAM service account (appsignals-collector) - fixes eksctl-*-addon-ia-Role1-* orphaned roles
  3. Remove cloudwatch-agent IAM service account (cloudwatch-agent) - extracted to its own step for clarity
  4. Terraform destroy
  5. Clean up namespaces - namespace deletion now happens last

Rollback procedure:

Yes, this commit can be safely reverted if needed. The change only affects the order of cleanup steps and does not modify any test logic or resource creation. Reverting would restore the previous behavior where orphaned IAM roles accumulate, but would not cause test failures or break any functionality.

Test Workflow Run:

https://github.com/yiyuan-he/aws-application-signals-test-framework/actions/runs/20116407133/job/57726811476

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@yiyuan-he yiyuan-he force-pushed the fix/iam-cleanup-order branch from 22cbe6d to bc74a04 Compare December 10, 2025 23:20
@yiyuan-he
Copy link
Contributor Author

Successful test workflow run. Will merge tomorrow morning.

Note: The insufficient termination protection error is a known issue that was discovered recently. It will be fixed by adding the necessary permissions to the GitHub workflow via our test infra CDK.

@yiyuan-he yiyuan-he merged commit 57724b3 into aws-observability:main Dec 11, 2025
19 checks passed
@yiyuan-he yiyuan-he deleted the fix/iam-cleanup-order branch December 11, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants