CNTRLPLANE-1985: Azure Private Topology for Self-Managed HyperShift#1949
Conversation
|
@bryan-cox: This pull request references CNTRLPLANE-1985 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
800c79b to
c07417d
Compare
|
/retest |
|
/retest ci/prow/markdownlint |
|
@bryan-cox: The Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Adds an enhancement proposal for CNTRLPLANE-1985 to support private endpoint access for self-managed Azure HyperShift clusters using Azure Private Link Service. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c07417d to
d0f43de
Compare
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sjenning The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
csrwng
left a comment
There was a problem hiding this comment.
Some comments/questions. I also don't see any mention of how non-kas services will be exposed (konnectivity, ignition, oauth)
| | HO creates | VPC Endpoint Service | Service Attachment | PLS | | ||
| | CPO creates | VPC Endpoint + SG + DNS | PSC Endpoint + IP + DNS | PE + DNS | | ||
|
|
||
| #### API Design: endpointAccess Struct |
There was a problem hiding this comment.
can we mention how this intersect with the Services []ServicePublishingStrategyMapping json:"services"
knob
There was a problem hiding this comment.
Added a new section "Interaction with Services (ServicePublishingStrategyMapping)" that explains how endpointAccess and Services are complementary — endpointAccess controls visibility (public vs private) while Services controls how each service is exposed (Route, LoadBalancer, etc.). The two are independent and reconciled separately by the CPO, matching how AWS handles EndpointAccess and Services as independent fields (infra.go:198-297).
Key point: this enhancement requires KAS via Route (the default for Azure), which lets all services share a single internal LB and only one PLS per cluster. KAS via dedicated LoadBalancer would need 2 PLSes and is scoped as a Non-Goal.
Assisted by Claude Code
There was a problem hiding this comment.
let's plan to lock the ServicePublishingStrategy mapping via CEL to only what we support, i.e. all routes.
Switch from KAS-specific internal LB to KAS via Route on the private router. All private services (KAS, OAuth, Konnectivity, Ignition) share the router's single internal LB, requiring only 1 PLS per cluster. Update endpointAccess from immutable to mutable between PublicAndPrivate and Private, consistent with AWS behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detail how controllers avoid unnecessary Azure API calls and hot loops, following AWS/GCP patterns: exponential backoff (3s-30s), error-specific requeue intervals, idempotent check-before-create, status-based caching, drift detection (5min), and sanitized condition messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Back up every claim about AWS/GCP behavior in the error handling section with specific file:line references from openshift/hypershift. Clarify where Azure follows GCP patterns vs AWS patterns (e.g., error-specific requeue intervals follow GCP's handleGCPError, not AWS's flat backoff). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The HO's federated managed identity for PLS operations is configured at operator deployment, not per-HostedCluster. This matches how AWS uses AWS_SHARED_CREDENTIALS_FILE env vars on the HO (controller.go:109-113) and GCP uses GCP_PROJECT/GCP_REGION env vars (privateserviceconnect_controller.go:55-73). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend Goal 4 to explicitly accommodate current and future managed services scenarios (e.g., ARO HCP) in addition to self-managed extensions like dedicated OAuth LB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ARO HCP uses Swift, not Azure Private Link Service. Reword to avoid implying they are the same mechanism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clarify that the CPO Observer populates guestSubnetID from the NodePool and the CPO Controller uses it when creating the PE, similar to how AWS collects subnet IDs from NodePools into AWSEndpointService.Spec.SubnetIDs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Swift is Azure's CNI infrastructure for VNet injection tied to AKS. Self-managed HyperShift runs on standard OpenShift, so PLS is the appropriate cross-VNet private connectivity mechanism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document that the HostedCluster controller aggregates AzurePrivateLinkService CR conditions into HC status, following computeAWSEndpointServiceCondition (hostedcluster_controller.go:3116). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback about the chicken-egg credential problem during deletion and the need to reject PE connections before PLS deletion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add golang type definitions for AzureEndpointAccessSpec, AzurePrivateConnectivityConfig, and AzurePrivateLinkService CRD alongside the prose API descriptions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IsPrivateHCP() returns true for ARO HCP with Swift, so using it alone would incorrectly activate PLS controllers for ARO clusters. Gate on the presence of endpointAccess on AzurePlatformSpec instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain how endpointAccess (visibility) and Services (ServicePublishingStrategyMapping) are complementary and independently reconciled, matching the AWS pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
csrwng
left a comment
There was a problem hiding this comment.
some more comments, looks better
| | CRD | AWSEndpointService | GCPPrivateServiceConnect | AzurePrivateLinkService | | ||
| | Management-side resource | VPC Endpoint Service | Service Attachment | Private Link Service | | ||
| | Customer-side resource | VPC Endpoint + SG | PSC Endpoint (Forwarding Rule) | Private Endpoint | | ||
| | DNS | Route53 Private Zone | Cloud DNS | Private DNS Zone | |
There was a problem hiding this comment.
It would be good to also specify here how the private zone is created.
In AWS, the customer is expected to create their own private/local zone. Today our CLI creates it when creating hosted cluster infra resources.
In GCP, the controller creates it on behalf on the customer.
In Azure, it seems we're saying we'll follow the GCP pattern. It'd be good to state that.
There was a problem hiding this comment.
Good call. Yes, Azure follows the GCP pattern here — the CPO controller creates the Private DNS Zone on behalf of the customer as part of step 6 in the workflow. The customer doesn't need to pre-create the zone.
I'll add a row to the comparison table clarifying this:
| DNS zone creation | CLI (create infra) pre-creates private zone | Controller creates Cloud DNS zone | Controller creates Private DNS Zone |
I'll also add a note in the workflow description (step 6) making this explicit: the CPO controller creates the Private DNS Zone automatically, so no customer pre-provisioning is required.
AI-assisted response via Claude Code
| both a public KAS endpoint and the private router path needs to be verified | ||
| against the AWS `PublicAndPrivate` implementation. With the Route-based | ||
| approach, the private path goes through the router's internal LB while the | ||
| public KAS endpoint may remain via its existing public LB. |
There was a problem hiding this comment.
I would expect the public endpoint to also be served by the internal router just as in AWS. We have an internal LB for private link, and a separate public LB is created/destroyed depending on whether your cluster is Private or PublicAndPrivate. Both LBs point to the one router deployment. A load balancer for the KAS is not needed.
There was a problem hiding this comment.
Agreed — the Azure design should follow the same pattern as AWS. The private router deployment with its internal LB handles all traffic (KAS, OAuth, Konnectivity, Ignition). For PublicAndPrivate, a separate public LB is created pointing to the same router deployment, and it's destroyed when transitioning to Private. No separate KAS-specific load balancer is needed.
I'll update the open question to reflect this resolved design and incorporate the pattern into the proposal section.
AI-assisted response via Claude Code
| `guestSubnetID` populated by the CPO Observer) targeting the PLS | ||
| - A Private DNS Zone with an A record mapping the KAS hostname to the PE's | ||
| private IP | ||
| It updates the CR status with the PE and DNS resource IDs. |
There was a problem hiding this comment.
Are additional permissions needed for the control plane workload identity to perform these operations?
There was a problem hiding this comment.
Yes, the existing CPO workload identity will need additional Azure RBAC permissions to create Private Endpoints and Private DNS Zones in the guest subscription. Specifically:
Microsoft.Network/privateEndpoints/*— to create and manage the Private Endpoint in the guest VNetMicrosoft.Network/privateDnsZones/*— to create and manage the Private DNS Zone and A records
These are not part of the default CPO workload identity permissions today. I'll add a section documenting the required additional RBAC assignments for the CPO identity when private endpoint access is configured.
AI-assisted response via Claude Code
| 6. The CPO Controller sees the PLS alias in the CR status and creates: | ||
| - A Private Endpoint in the guest VNet's worker subnet (from the | ||
| `guestSubnetID` populated by the CPO Observer) targeting the PLS | ||
| - A Private DNS Zone with an A record mapping the KAS hostname to the PE's |
There was a problem hiding this comment.
For AWS 2 records are created, one for the KAS, and another one for everything else:
api.[cluster-name].hypershift.local
*.apps.[cluster-name].hypershift.local
There was a problem hiding this comment.
Good point. Since all private services (KAS, OAuth, Konnectivity, Ignition) go through the same private router and PLS, Azure should follow the same pattern and create two DNS records in the Private DNS Zone:
api.<cluster-name>.hypershift.local → PE private IP
*.apps.<cluster-name>.hypershift.local → PE private IP
Both resolve to the same Private Endpoint IP since everything routes through the single router. I'll update step 6 and the DNS documentation to reflect this.
AI-assisted response via Claude Code
| (`hypershift-operator/controllers/platform/gcp/privateserviceconnect_controller.go:55-73`) | ||
|
|
||
| The identity is only required when the HO manages clusters with non-Public | ||
| endpoint access. |
There was a problem hiding this comment.
What changes will be needed for the hypershift install command?
What environment variables will need to be set on the HO for it to use the managed identity?
There was a problem hiding this comment.
The hypershift install command will need a new flag to configure the HO's federated managed identity for PLS operations. Following the existing patterns:
-
New install flag:
--azure-pls-managed-identity-client-id(or similar) to pass the client ID of the managed identity that has Network Contributor RBAC on the management resource group. -
HO deployment changes: The install command would configure the HO service account with Azure Workload Identity annotations for the PLS managed identity and set an environment variable (e.g.,
AZURE_PLS_CLIENT_ID) on the HO pod so the platform controller can authenticate to Azure.
This follows the same pattern as AWS (AWS_SHARED_CREDENTIALS_FILE, AWS_REGION) and GCP (GCP_PROJECT, GCP_REGION) where the install command configures operator-level credentials. The identity is only needed when the HO manages clusters with non-Public endpoint access.
I'll add an "Operator Installation" section documenting the specific install command changes.
AI-assisted response via Claude Code
Add a DNS zone creation row to the AWS/GCP/Azure comparison table showing that Azure follows the GCP pattern (controller-created) rather than the AWS pattern (CLI pre-created). Also clarify in step 6 that the CPO controller creates the Private DNS Zone automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clarify that Azure follows the AWS pattern: the private router deployment handles all traffic via an internal LB. For PublicAndPrivate, a separate public LB points to the same router and is destroyed when transitioning to Private. No separate KAS-specific LB is needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CPO workload identity needs additional RBAC permissions to create Private Endpoints and Private DNS Zones in the guest subscription when private endpoint access is configured. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update step 6 to create two A records in the Private DNS Zone matching the AWS pattern: one for the KAS and a wildcard for all other services (OAuth, Konnectivity, Ignition). Both resolve to the same PE IP. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the new hypershift install flag and HO deployment changes needed to configure the federated managed identity for PLS operations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The guest cluster's own subscription (from AzurePlatformSpec.SubscriptionID) is automatically allowed on the PLS. The field is now optional and only for specifying additional subscriptions, following the same pattern as AWS's additionalAllowedPrincipals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/lgtm |
|
@bryan-cox: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Adds an enhancement proposal for private endpoint access support on self-managed Azure HyperShift clusters using Azure Private Link Service (PLS).
Key points:
endpointAccessandprivateConnectivityfields onAzurePlatformSpec, a newAzurePrivateLinkServiceCRD, and aPrivateLinkServiceworkload identityTracking: CNTRLPLANE-1985
Related: Self-Managed Azure EP
🤖 Generated with Claude Code