Skip to content

E2E Deployment Test Failure: Aspire.Deployment.EndToEnd.Tests.AksAzureKubernetesEnvironmentCertManagerDeploymentTests.DeployApiWithCertManagerToAzureKubernetesEnvironment #17515

@mitchdenny

Description

@mitchdenny

Failing Test

Aspire.Deployment.EndToEnd.Tests.AksAzureKubernetesEnvironmentCertManagerDeploymentTests.DeployApiWithCertManagerToAzureKubernetesEnvironment

Symptom

The test fails at Step 15 ("Waiting for cert-manager to issue the certificate") after the full 10-minute wait window with the certificate never becoming Ready. From the kubectl describe certificate output captured on failure:

Conditions:
  Type:    Ready
  Status:  False
  Reason:  IncorrectIssuer
  Message: Issuing certificate as Secret was previously issued by "Issuer.cert-manager.io/"

  Type:    Issuing
  Status:  False
  Reason:  Failed
  Message: The certificate request has failed to complete and will be retried:
           Failed to wait for order resource "api-gw-tls-1-755734676" to become ready:
           order is in "errored" state:
           Failed to fetch authorization: 404 urn:ietf:params:acme:error:malformed: No such authorization

Failed Issuance Attempts: 1
Last Failure Time:         2026-05-26T04:48:35Z

order.acme.cert-manager.io/api-gw-tls-1-755734676   errored   10m

Root Cause

This is an ACME / Let's Encrypt staging flake interacting with cert-manager's retry policy:

  1. cert-manager created the ACME Order (api-gw-tls-1-755734676).
  2. While polling for the order's authorization status, Let's Encrypt staging returned 404 urn:ietf:params:acme:error:malformed: No such authorization. This is a known intermittent server-side issue (the authorization resource is either expired, GC'd, or not yet visible due to backend replication).
  3. cert-manager marked the Order as errored and incremented failedIssuanceAttempts to 1.
  4. cert-manager's default retry interval for a failed CertificateRequest/Order uses exponential backoff starting at ~1 hour, which is far beyond the test's 10-minute wait (WaitUntil(..., TimeSpan.FromMinutes(11)) at line 268).

Because cert-manager will not retry within the test timeout, the test inevitably times out once the first ACME attempt errors.

Affected Runs

This is the first observed failure of this specific test (the TypeScript variant has its own separate root cause tracked in #17427, which was actually fixed by #17434).

Possible Mitigations (need validation)

These are not high-confidence fixes — listing them for discussion:

  1. Force-retry on transient ACME failure: in Step 15's wait loop, detect order in errored state and delete the failed CertificateRequest (or Order) so cert-manager immediately generates a new one. This sidesteps the 1-hour backoff but adds complexity, and consecutive ACME flakes would still time out.
  2. Lengthen the timeout to ~70 min so the natural cert-manager retry can run. Significantly extends the worst-case test runtime.
  3. Quarantine the test and treat it as best-effort until Let's Encrypt staging reliability improves or the test uses a self-signed/local ACME server (e.g., Pebble) for the HTTP-01 path.

No fix PR is being opened at this time — the underlying flake is on the ACME server side and there is no high-confidence change to the test or product code that is guaranteed to eliminate it.

This issue was created automatically by a workflow analyzing Deployment E2E test failures on main.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-area-labelAn area label is needed to ensure this gets routed to the appropriate area ownerstriage:bot-seenAspire triage bot has seen this issue

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions