E2E Deployment Test Failure: Aspire.Deployment.EndToEnd.Tests.AksAzureKubernetesEnvironmentCertManagerDeploymentTests.DeployApiWithCertManagerToAzureKubernetesEnvironment

## Failing Test

`Aspire.Deployment.EndToEnd.Tests.AksAzureKubernetesEnvironmentCertManagerDeploymentTests.DeployApiWithCertManagerToAzureKubernetesEnvironment`

## Symptom

The test fails at **Step 15** ("Waiting for cert-manager to issue the certificate") after the full 10-minute wait window with the certificate never becoming `Ready`. From the `kubectl describe certificate` output captured on failure:

```
Conditions:
  Type:    Ready
  Status:  False
  Reason:  IncorrectIssuer
  Message: Issuing certificate as Secret was previously issued by "Issuer.cert-manager.io/"

  Type:    Issuing
  Status:  False
  Reason:  Failed
  Message: The certificate request has failed to complete and will be retried:
           Failed to wait for order resource "api-gw-tls-1-755734676" to become ready:
           order is in "errored" state:
           Failed to fetch authorization: 404 urn:ietf:params:acme:error:malformed: No such authorization

Failed Issuance Attempts: 1
Last Failure Time:         2026-05-26T04:48:35Z

order.acme.cert-manager.io/api-gw-tls-1-755734676   errored   10m
```

## Root Cause

This is an ACME / Let's Encrypt **staging** flake interacting with cert-manager's retry policy:

1. cert-manager created the ACME `Order` (`api-gw-tls-1-755734676`).
2. While polling for the order's authorization status, Let's Encrypt staging returned `404 urn:ietf:params:acme:error:malformed: No such authorization`. This is a known intermittent server-side issue (the authorization resource is either expired, GC'd, or not yet visible due to backend replication).
3. cert-manager marked the `Order` as `errored` and incremented `failedIssuanceAttempts` to `1`.
4. cert-manager's default retry interval for a failed `CertificateRequest`/`Order` uses exponential backoff starting at ~1 hour, which is **far beyond the test's 10-minute wait** (`WaitUntil(..., TimeSpan.FromMinutes(11))` at line 268).

Because cert-manager will not retry within the test timeout, the test inevitably times out once the first ACME attempt errors.

## Affected Runs

- https://github.com/microsoft/aspire/actions/runs/26432236139 (2026-05-26) — only failure in this nightly run

This is the first observed failure of this specific test (the TypeScript variant has its own separate root cause tracked in #17427, which was actually fixed by #17434).

## Possible Mitigations (need validation)

These are **not** high-confidence fixes — listing them for discussion:

1. **Force-retry on transient ACME failure**: in Step 15's wait loop, detect `order` in `errored` state and `delete` the failed `CertificateRequest` (or `Order`) so cert-manager immediately generates a new one. This sidesteps the 1-hour backoff but adds complexity, and consecutive ACME flakes would still time out.
2. **Lengthen the timeout** to ~70 min so the natural cert-manager retry can run. Significantly extends the worst-case test runtime.
3. **Quarantine** the test and treat it as best-effort until Let's Encrypt staging reliability improves or the test uses a self-signed/local ACME server (e.g., Pebble) for the HTTP-01 path.

No fix PR is being opened at this time — the underlying flake is on the ACME server side and there is no high-confidence change to the test or product code that is guaranteed to eliminate it.

> This issue was created automatically by a workflow analyzing Deployment E2E test failures on `main`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E Deployment Test Failure: Aspire.Deployment.EndToEnd.Tests.AksAzureKubernetesEnvironmentCertManagerDeploymentTests.DeployApiWithCertManagerToAzureKubernetesEnvironment #17515

Failing Test

Symptom

Root Cause

Affected Runs

Possible Mitigations (need validation)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

E2E Deployment Test Failure: Aspire.Deployment.EndToEnd.Tests.AksAzureKubernetesEnvironmentCertManagerDeploymentTests.DeployApiWithCertManagerToAzureKubernetesEnvironment #17515

Description

Failing Test

Symptom

Root Cause

Affected Runs

Possible Mitigations (need validation)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions