Failing Test
Aspire.Deployment.EndToEnd.Tests.AksAzureKubernetesEnvironmentCertManagerDeploymentTests.DeployApiWithCertManagerToAzureKubernetesEnvironment
Symptom
The test fails at Step 15 ("Waiting for cert-manager to issue the certificate") after the full 10-minute wait window with the certificate never becoming Ready. From the kubectl describe certificate output captured on failure:
Conditions:
Type: Ready
Status: False
Reason: IncorrectIssuer
Message: Issuing certificate as Secret was previously issued by "Issuer.cert-manager.io/"
Type: Issuing
Status: False
Reason: Failed
Message: The certificate request has failed to complete and will be retried:
Failed to wait for order resource "api-gw-tls-1-755734676" to become ready:
order is in "errored" state:
Failed to fetch authorization: 404 urn:ietf:params:acme:error:malformed: No such authorization
Failed Issuance Attempts: 1
Last Failure Time: 2026-05-26T04:48:35Z
order.acme.cert-manager.io/api-gw-tls-1-755734676 errored 10m
Root Cause
This is an ACME / Let's Encrypt staging flake interacting with cert-manager's retry policy:
- cert-manager created the ACME
Order (api-gw-tls-1-755734676).
- While polling for the order's authorization status, Let's Encrypt staging returned
404 urn:ietf:params:acme:error:malformed: No such authorization. This is a known intermittent server-side issue (the authorization resource is either expired, GC'd, or not yet visible due to backend replication).
- cert-manager marked the
Order as errored and incremented failedIssuanceAttempts to 1.
- cert-manager's default retry interval for a failed
CertificateRequest/Order uses exponential backoff starting at ~1 hour, which is far beyond the test's 10-minute wait (WaitUntil(..., TimeSpan.FromMinutes(11)) at line 268).
Because cert-manager will not retry within the test timeout, the test inevitably times out once the first ACME attempt errors.
Affected Runs
This is the first observed failure of this specific test (the TypeScript variant has its own separate root cause tracked in #17427, which was actually fixed by #17434).
Possible Mitigations (need validation)
These are not high-confidence fixes — listing them for discussion:
- Force-retry on transient ACME failure: in Step 15's wait loop, detect
order in errored state and delete the failed CertificateRequest (or Order) so cert-manager immediately generates a new one. This sidesteps the 1-hour backoff but adds complexity, and consecutive ACME flakes would still time out.
- Lengthen the timeout to ~70 min so the natural cert-manager retry can run. Significantly extends the worst-case test runtime.
- Quarantine the test and treat it as best-effort until Let's Encrypt staging reliability improves or the test uses a self-signed/local ACME server (e.g., Pebble) for the HTTP-01 path.
No fix PR is being opened at this time — the underlying flake is on the ACME server side and there is no high-confidence change to the test or product code that is guaranteed to eliminate it.
This issue was created automatically by a workflow analyzing Deployment E2E test failures on main.
Failing Test
Aspire.Deployment.EndToEnd.Tests.AksAzureKubernetesEnvironmentCertManagerDeploymentTests.DeployApiWithCertManagerToAzureKubernetesEnvironmentSymptom
The test fails at Step 15 ("Waiting for cert-manager to issue the certificate") after the full 10-minute wait window with the certificate never becoming
Ready. From thekubectl describe certificateoutput captured on failure:Root Cause
This is an ACME / Let's Encrypt staging flake interacting with cert-manager's retry policy:
Order(api-gw-tls-1-755734676).404 urn:ietf:params:acme:error:malformed: No such authorization. This is a known intermittent server-side issue (the authorization resource is either expired, GC'd, or not yet visible due to backend replication).Orderaserroredand incrementedfailedIssuanceAttemptsto1.CertificateRequest/Orderuses exponential backoff starting at ~1 hour, which is far beyond the test's 10-minute wait (WaitUntil(..., TimeSpan.FromMinutes(11))at line 268).Because cert-manager will not retry within the test timeout, the test inevitably times out once the first ACME attempt errors.
Affected Runs
This is the first observed failure of this specific test (the TypeScript variant has its own separate root cause tracked in #17427, which was actually fixed by #17434).
Possible Mitigations (need validation)
These are not high-confidence fixes — listing them for discussion:
orderinerroredstate anddeletethe failedCertificateRequest(orOrder) so cert-manager immediately generates a new one. This sidesteps the 1-hour backoff but adds complexity, and consecutive ACME flakes would still time out.No fix PR is being opened at this time — the underlying flake is on the ACME server side and there is no high-confidence change to the test or product code that is guaranteed to eliminate it.