Skip to content

feat(plugin/eks): ACK Role CR additional inline policies (APTH-1704)#43

Merged
reoring merged 1 commit intomainfrom
feat/apth-1704-additional-inline-policies
May 7, 2026
Merged

feat(plugin/eks): ACK Role CR additional inline policies (APTH-1704)#43
reoring merged 1 commit intomainfrom
feat/apth-1704-additional-inline-policies

Conversation

@reoring
Copy link
Copy Markdown
Collaborator

@reoring reoring commented May 7, 2026

Summary

  • Add a new ClusterClass topology variable karpenter-node-role-additional-inline-policies (sister to karpenter-node-role-additional-policy-arns) so cluster operators can declare IAM inline policies on the karpenter node Role end-to-end through ACK Role CR's spec.inlinePolicies
  • Eliminates the ~6.5h race observed on pmc-local where ACK + kro v0.8.4 reconcile interpreted externally-attached inline policies as drift and issued DeleteRolePolicy, breaking cert-manager DNS01 challenges
  • Backward-compatible: cluster specs that omit the variable see no behavior change (writer skips spec.inlinePolicies entirely when the map is empty) — production clusters are byte-identical at the ACK CR level

Background

5/6 evening cert chain dive on pmc-local revealed that demo-ops/scripts/05-pmc-node-role-policies.sh attaches the IRSA-give-up bundle (cert-manager-route53-dns01-txt-only + external-dns-route53-record-management) directly via aws iam put-role-policy, but ACK reconcile silently removes them with DeleteRolePolicy because the ACK Role CR's spec.inlinePolicies is empty. CloudTrail pmc-local-node events confirm the cycle:

2026-05-06 21:12  PutRolePolicy x2     (manual 05 script run)
2026-05-07 03:45  DeleteRolePolicy x2  (~6.5h after, ACK reconcile)

userAgent = aws-controllers-k8s/iam.services.k8s.aws-1.6.2 ManagedBy/kro KROVersion/v0.8.4

The root fix is to make the inline policies declarative on the ACK Role CR through a new ClusterClass topology variable. See design/apth-1704-... in demo0521 for the full design.

Changes

  • constants.go: add topologyNodeRoleAdditionalInlinePoliciesVariableName next to its sister *PolicyARNs* const so future readers see the symmetric pair
  • karpenter_bootstrapper_controller.go: read the topology variable as map[string]string (new readTopologyStringMap helper, modelled on readTopologyStringSlice) and forward to a widened ensureIAMRoleForEC2(..., inlinePolicies map[string]string). Writer materializes spec.inlinePolicies only when the map is non-empty.
  • karpenter_bootstrapper_controller_test.go: cover the documented cases — (1) no inline = absent on spec, (2) inline declared = both entries preserved verbatim, (3) managed and inline coexist independently — plus a round-trip test on the new map reader
  • hack/capi/metadata.yaml: prepend v0.4 series entry for the upcoming release cut (clusterctl requires descending order)

Test plan

  • go build ./... (passing)
  • go vet ./... (passing)
  • go test ./... all package tests passing including envtest internal/controller/infrastructure
  • New unit tests pass: TestEKSKarpenterBootstrapperReconciler_EnsureIAMRoleForEC2_InlinePoliciesDeclaredOnSpec + TestReadTopologyStringMap_RoundTripsClusterClassVariable
  • Cut v0.4.0 release tag after merge → release pipeline auto-publishes the 5 clusterctl-provider artifacts
  • Follow-up appthrust/platform PR atomically lands: kany8s-eks-byo-clusterclass/templates/clusterclass.yaml (new variable schema), packages-0/.../capi-kany8s-installation.yaml (v0.4.0 pin x4), hack/local/clusters/boot-single.sh (vars.yaml render with inline policies sourced from lib/policies.sh)
  • reoring/demo0521 PR for demo-ops/scripts/02-route53-zones.sh (zone ID file cache) + 05-pmc-node-role-policies.sh deprecation header
  • Fresh boot rehearsal (APTH-1612) verifies 24h+ of zero DeleteRolePolicy events on pmc-local-node

Related

Add a new ClusterClass topology variable
`karpenter-node-role-additional-inline-policies` (sister to the existing
`karpenter-node-role-additional-policy-arns`) so cluster operators can
declare IAM inline policies on the karpenter node Role end-to-end through
the ACK Role CR's `spec.inlinePolicies` field. ACK + kro v0.8.4 then sees
the policies as declarative state instead of treating externally-attached
inline policies as drift to delete.

Concretely, this eliminates the ~6.5h race observed on pmc-local where
the IRSA-give-up bundle (cert-manager-route53-dns01-txt-only +
external-dns-route53-record-management) attached by demo-ops/scripts/
05-pmc-node-role-policies.sh was repeatedly removed by ACK reconcile,
causing cert-manager DNS01 challenges to permanently stall.

Changes:
- constants.go: add `topologyNodeRoleAdditionalInlinePoliciesVariableName`
  next to its sister `*PolicyARNs*` const so future readers see the
  symmetric pair.
- karpenter_bootstrapper_controller.go: read the topology variable as a
  map[string]string (new `readTopologyStringMap` helper, modelled on
  `readTopologyStringSlice`) and forward it to a widened
  `ensureIAMRoleForEC2(..., inlinePolicies map[string]string)`. The
  writer materializes `spec.inlinePolicies` only when the map is
  non-empty so existing clusters that do not opt in remain
  byte-identical at the ACK CR level.
- karpenter_bootstrapper_controller_test.go: cover the three documented
  cases (no inline = absent on spec, inline declared = both entries
  preserved verbatim, managed and inline coexist independently) plus a
  round-trip test on the new map reader.
- hack/capi/metadata.yaml: prepend a v0.4 series entry for the upcoming
  release cut (clusterctl requires descending order by major,minor).

Production cluster specs that omit the new variable see no behavior
change (the writer skips spec.inlinePolicies entirely when the map is
empty), so this is a backward-compatible feature suitable for a v0.4.0
minor release.

See design/apth-1704-... and impl/apth-1704-... in
github.com/reoring/demo0521 for the full design + implementation
hand-off.

Linear: APTH-1704
@reoring reoring merged commit 0c2aa6a into main May 7, 2026
5 of 9 checks passed
@reoring reoring deleted the feat/apth-1704-additional-inline-policies branch May 7, 2026 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant