Skip to content

CNTRLPLANE-1985: Azure Private Topology for Self-Managed HyperShift#1949

Merged
openshift-merge-bot[bot] merged 20 commits intoopenshift:masterfrom
bryan-cox:azure-private-topology
Mar 4, 2026
Merged

CNTRLPLANE-1985: Azure Private Topology for Self-Managed HyperShift#1949
openshift-merge-bot[bot] merged 20 commits intoopenshift:masterfrom
bryan-cox:azure-private-topology

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

Summary

Adds an enhancement proposal for private endpoint access support on self-managed Azure HyperShift clusters using Azure Private Link Service (PLS).

Key points:

  • Delivers three endpoint access modes: Public (default), PublicAndPrivate, and Private — matching existing AWS and GCP private topology support
  • Introduces endpointAccess and privateConnectivity fields on AzurePlatformSpec, a new AzurePrivateLinkService CRD, and a PrivateLinkService workload identity
  • Follows the established split-controller pattern: CPO Observer creates CRD, HO creates PLS (management-side), CPO creates PE + DNS (customer-side)
  • API designed to accommodate future OAuth dedicated private LB without breaking changes
  • Self-managed Azure only — ARO HCP is unaffected

Tracking: CNTRLPLANE-1985

Related: Self-Managed Azure EP


🤖 Generated with Claude Code

@openshift-ci openshift-ci Bot requested review from csrwng and sjenning February 26, 2026 12:22
@bryan-cox bryan-cox changed the title Enhancement: Azure Private Topology for Self-Managed HyperShift CNTRLPLANE-1985: Azure Private Topology for Self-Managed HyperShift Feb 26, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 26, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 26, 2026

@bryan-cox: This pull request references CNTRLPLANE-1985 which is a valid jira issue.

Details

In response to this:

Summary

Adds an enhancement proposal for private endpoint access support on self-managed Azure HyperShift clusters using Azure Private Link Service (PLS).

Key points:

  • Delivers three endpoint access modes: Public (default), PublicAndPrivate, and Private — matching existing AWS and GCP private topology support
  • Introduces endpointAccess and privateConnectivity fields on AzurePlatformSpec, a new AzurePrivateLinkService CRD, and a PrivateLinkService workload identity
  • Follows the established split-controller pattern: CPO Observer creates CRD, HO creates PLS (management-side), CPO creates PE + DNS (customer-side)
  • API designed to accommodate future OAuth dedicated private LB without breaking changes
  • Self-managed Azure only — ARO HCP is unaffected

Tracking: CNTRLPLANE-1985

Related: Self-Managed Azure EP


🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox bryan-cox force-pushed the azure-private-topology branch 6 times, most recently from 800c79b to c07417d Compare February 26, 2026 14:59
@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest ci/prow/markdownlint

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 26, 2026

@bryan-cox: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test markdownlint

Use /test all to run all jobs.

Details

In response to this:

/retest ci/prow/markdownlint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Comment thread enhancements/hypershift/azure-private-topology.md Outdated
Comment thread enhancements/hypershift/azure-private-topology.md
Adds an enhancement proposal for CNTRLPLANE-1985 to support private
endpoint access for self-managed Azure HyperShift clusters using Azure
Private Link Service.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bryan-cox bryan-cox force-pushed the azure-private-topology branch from c07417d to d0f43de Compare March 2, 2026 17:54
@sjenning
Copy link
Copy Markdown
Contributor

sjenning commented Mar 2, 2026

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2026
Copy link
Copy Markdown
Contributor

@csrwng csrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments/questions. I also don't see any mention of how non-kas services will be exposed (konnectivity, ignition, oauth)

Comment thread enhancements/hypershift/azure-private-topology.md
Comment thread enhancements/hypershift/azure-private-topology.md Outdated
Comment thread enhancements/hypershift/azure-private-topology.md Outdated
Comment thread enhancements/hypershift/azure-private-topology.md Outdated
Comment thread enhancements/hypershift/azure-private-topology.md
Comment thread enhancements/hypershift/azure-private-topology.md Outdated
Comment thread enhancements/hypershift/azure-private-topology.md
Comment thread enhancements/hypershift/azure-private-topology.md Outdated
Comment thread enhancements/hypershift/azure-private-topology.md Outdated
Comment thread enhancements/hypershift/azure-private-topology.md Outdated
Comment thread enhancements/hypershift/azure-private-topology.md Outdated
Comment thread enhancements/hypershift/azure-private-topology.md
Comment thread enhancements/hypershift/azure-private-topology.md Outdated
Comment thread enhancements/hypershift/azure-private-topology.md
| HO creates | VPC Endpoint Service | Service Attachment | PLS |
| CPO creates | VPC Endpoint + SG + DNS | PSC Endpoint + IP + DNS | PE + DNS |

#### API Design: endpointAccess Struct
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we mention how this intersect with the Services []ServicePublishingStrategyMapping json:"services"
knob

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new section "Interaction with Services (ServicePublishingStrategyMapping)" that explains how endpointAccess and Services are complementary — endpointAccess controls visibility (public vs private) while Services controls how each service is exposed (Route, LoadBalancer, etc.). The two are independent and reconciled separately by the CPO, matching how AWS handles EndpointAccess and Services as independent fields (infra.go:198-297).

Key point: this enhancement requires KAS via Route (the default for Azure), which lets all services share a single internal LB and only one PLS per cluster. KAS via dedicated LoadBalancer would need 2 PLSes and is scoped as a Non-Goal.

Assisted by Claude Code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's plan to lock the ServicePublishingStrategy mapping via CEL to only what we support, i.e. all routes.

bryan-cox and others added 13 commits March 3, 2026 08:25
Switch from KAS-specific internal LB to KAS via Route on the private
router. All private services (KAS, OAuth, Konnectivity, Ignition) share
the router's single internal LB, requiring only 1 PLS per cluster.

Update endpointAccess from immutable to mutable between PublicAndPrivate
and Private, consistent with AWS behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detail how controllers avoid unnecessary Azure API calls and hot loops,
following AWS/GCP patterns: exponential backoff (3s-30s), error-specific
requeue intervals, idempotent check-before-create, status-based caching,
drift detection (5min), and sanitized condition messages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Back up every claim about AWS/GCP behavior in the error handling section
with specific file:line references from openshift/hypershift. Clarify
where Azure follows GCP patterns vs AWS patterns (e.g., error-specific
requeue intervals follow GCP's handleGCPError, not AWS's flat backoff).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The HO's federated managed identity for PLS operations is configured at
operator deployment, not per-HostedCluster. This matches how AWS uses
AWS_SHARED_CREDENTIALS_FILE env vars on the HO (controller.go:109-113)
and GCP uses GCP_PROJECT/GCP_REGION env vars
(privateserviceconnect_controller.go:55-73).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend Goal 4 to explicitly accommodate current and future managed
services scenarios (e.g., ARO HCP) in addition to self-managed
extensions like dedicated OAuth LB.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ARO HCP uses Swift, not Azure Private Link Service. Reword to avoid
implying they are the same mechanism.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clarify that the CPO Observer populates guestSubnetID from the NodePool
and the CPO Controller uses it when creating the PE, similar to how AWS
collects subnet IDs from NodePools into AWSEndpointService.Spec.SubnetIDs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Swift is Azure's CNI infrastructure for VNet injection tied to AKS.
Self-managed HyperShift runs on standard OpenShift, so PLS is the
appropriate cross-VNet private connectivity mechanism.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document that the HostedCluster controller aggregates
AzurePrivateLinkService CR conditions into HC status, following
computeAWSEndpointServiceCondition (hostedcluster_controller.go:3116).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback about the chicken-egg credential problem during
deletion and the need to reject PE connections before PLS deletion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add golang type definitions for AzureEndpointAccessSpec,
AzurePrivateConnectivityConfig, and AzurePrivateLinkService CRD
alongside the prose API descriptions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IsPrivateHCP() returns true for ARO HCP with Swift, so using it alone
would incorrectly activate PLS controllers for ARO clusters. Gate on
the presence of endpointAccess on AzurePlatformSpec instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain how endpointAccess (visibility) and Services
(ServicePublishingStrategyMapping) are complementary and independently
reconciled, matching the AWS pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@csrwng csrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more comments, looks better

| CRD | AWSEndpointService | GCPPrivateServiceConnect | AzurePrivateLinkService |
| Management-side resource | VPC Endpoint Service | Service Attachment | Private Link Service |
| Customer-side resource | VPC Endpoint + SG | PSC Endpoint (Forwarding Rule) | Private Endpoint |
| DNS | Route53 Private Zone | Cloud DNS | Private DNS Zone |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to also specify here how the private zone is created.
In AWS, the customer is expected to create their own private/local zone. Today our CLI creates it when creating hosted cluster infra resources.
In GCP, the controller creates it on behalf on the customer.
In Azure, it seems we're saying we'll follow the GCP pattern. It'd be good to state that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Yes, Azure follows the GCP pattern here — the CPO controller creates the Private DNS Zone on behalf of the customer as part of step 6 in the workflow. The customer doesn't need to pre-create the zone.

I'll add a row to the comparison table clarifying this:

| DNS zone creation | CLI (create infra) pre-creates private zone | Controller creates Cloud DNS zone | Controller creates Private DNS Zone |

I'll also add a note in the workflow description (step 6) making this explicit: the CPO controller creates the Private DNS Zone automatically, so no customer pre-provisioning is required.


AI-assisted response via Claude Code

both a public KAS endpoint and the private router path needs to be verified
against the AWS `PublicAndPrivate` implementation. With the Route-based
approach, the private path goes through the router's internal LB while the
public KAS endpoint may remain via its existing public LB.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the public endpoint to also be served by the internal router just as in AWS. We have an internal LB for private link, and a separate public LB is created/destroyed depending on whether your cluster is Private or PublicAndPrivate. Both LBs point to the one router deployment. A load balancer for the KAS is not needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the Azure design should follow the same pattern as AWS. The private router deployment with its internal LB handles all traffic (KAS, OAuth, Konnectivity, Ignition). For PublicAndPrivate, a separate public LB is created pointing to the same router deployment, and it's destroyed when transitioning to Private. No separate KAS-specific load balancer is needed.

I'll update the open question to reflect this resolved design and incorporate the pattern into the proposal section.


AI-assisted response via Claude Code

`guestSubnetID` populated by the CPO Observer) targeting the PLS
- A Private DNS Zone with an A record mapping the KAS hostname to the PE's
private IP
It updates the CR status with the PE and DNS resource IDs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are additional permissions needed for the control plane workload identity to perform these operations?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the existing CPO workload identity will need additional Azure RBAC permissions to create Private Endpoints and Private DNS Zones in the guest subscription. Specifically:

  • Microsoft.Network/privateEndpoints/* — to create and manage the Private Endpoint in the guest VNet
  • Microsoft.Network/privateDnsZones/* — to create and manage the Private DNS Zone and A records

These are not part of the default CPO workload identity permissions today. I'll add a section documenting the required additional RBAC assignments for the CPO identity when private endpoint access is configured.


AI-assisted response via Claude Code

6. The CPO Controller sees the PLS alias in the CR status and creates:
- A Private Endpoint in the guest VNet's worker subnet (from the
`guestSubnetID` populated by the CPO Observer) targeting the PLS
- A Private DNS Zone with an A record mapping the KAS hostname to the PE's
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For AWS 2 records are created, one for the KAS, and another one for everything else:

api.[cluster-name].hypershift.local
*.apps.[cluster-name].hypershift.local

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Since all private services (KAS, OAuth, Konnectivity, Ignition) go through the same private router and PLS, Azure should follow the same pattern and create two DNS records in the Private DNS Zone:

api.<cluster-name>.hypershift.local → PE private IP
*.apps.<cluster-name>.hypershift.local → PE private IP

Both resolve to the same Private Endpoint IP since everything routes through the single router. I'll update step 6 and the DNS documentation to reflect this.


AI-assisted response via Claude Code

(`hypershift-operator/controllers/platform/gcp/privateserviceconnect_controller.go:55-73`)

The identity is only required when the HO manages clusters with non-Public
endpoint access.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changes will be needed for the hypershift install command?
What environment variables will need to be set on the HO for it to use the managed identity?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hypershift install command will need a new flag to configure the HO's federated managed identity for PLS operations. Following the existing patterns:

  1. New install flag: --azure-pls-managed-identity-client-id (or similar) to pass the client ID of the managed identity that has Network Contributor RBAC on the management resource group.

  2. HO deployment changes: The install command would configure the HO service account with Azure Workload Identity annotations for the PLS managed identity and set an environment variable (e.g., AZURE_PLS_CLIENT_ID) on the HO pod so the platform controller can authenticate to Azure.

This follows the same pattern as AWS (AWS_SHARED_CREDENTIALS_FILE, AWS_REGION) and GCP (GCP_PROJECT, GCP_REGION) where the install command configures operator-level credentials. The identity is only needed when the HO manages clusters with non-Public endpoint access.

I'll add an "Operator Installation" section documenting the specific install command changes.


AI-assisted response via Claude Code

bryan-cox and others added 6 commits March 4, 2026 06:23
Add a DNS zone creation row to the AWS/GCP/Azure comparison table showing
that Azure follows the GCP pattern (controller-created) rather than the
AWS pattern (CLI pre-created). Also clarify in step 6 that the CPO
controller creates the Private DNS Zone automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clarify that Azure follows the AWS pattern: the private router deployment
handles all traffic via an internal LB. For PublicAndPrivate, a separate
public LB points to the same router and is destroyed when transitioning
to Private. No separate KAS-specific LB is needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CPO workload identity needs additional RBAC permissions to create
Private Endpoints and Private DNS Zones in the guest subscription when
private endpoint access is configured.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update step 6 to create two A records in the Private DNS Zone matching
the AWS pattern: one for the KAS and a wildcard for all other services
(OAuth, Konnectivity, Ignition). Both resolve to the same PE IP.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the new hypershift install flag and HO deployment changes
needed to configure the federated managed identity for PLS operations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The guest cluster's own subscription (from AzurePlatformSpec.SubscriptionID)
is automatically allowed on the PLS. The field is now optional and only
for specifying additional subscriptions, following the same pattern as
AWS's additionalAllowedPrincipals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@csrwng
Copy link
Copy Markdown
Contributor

csrwng commented Mar 4, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 4, 2026

@bryan-cox: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit b2ad665 into openshift:master Mar 4, 2026
2 checks passed
@bryan-cox bryan-cox deleted the azure-private-topology branch March 5, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants