Skip to content

Route promotes a Revision to 100% traffic while it is still below initial-scale (latestReadyRevision latches an under-scaled Revision) #16649

@juanpark-dandy

Description

@juanpark-dandy

/area autoscale
/area API

/kind spec

What version of Knative?

Knative Serving v1.20.0 (observed). Root-cause code is unchanged on main and in the latest
release knative-v1.22.1, so all released versions through v1.22.1 are affected.
Related (not duplicates): #11531, #2674, #11373.

Expected Behavior

Per the scale-bounds docs,
initial-scale is "the initial target scale a Revision must reach ... before it is marked as Ready."
A new Revision should not become Ready, should not be promoted to
Configuration.status.latestReadyRevisionName, and should not receive route traffic until it reaches
its initial-scale. Traffic should stay on the previous, fully-scaled Revision until then.

Actual Behavior

A new Revision can be marked Ready=True and latched as latestReadyRevisionName while still far below
its initial-scale, so the Route shifts 100% of traffic to an under-provisioned Revision. Because
latestReadyRevisionName is monotonic, the route never reverts — even after the Revision later flips
Ready=False (e.g. ProgressDeadlineExceeded). In production this abandoned a healthy fully-scaled
Revision for a new one running at ~15% of target replicas, which then returned 503/504s.

Root cause (same on release-1.20 and main):

  • Revision readiness is not gated on initial-scale —
    pkg/apis/serving/v1/revision_lifecycle.go:
    revisionCondSet = apis.NewLivingConditionSet(ResourcesAvailable, ContainerHealthy) (no
    ScaleTargetInitialized/Active term).

  • ResourcesAvailable comes from the Deployment's Progressing condition, not Available
    pkg/reconciler/revision/reconcile_resources.go PropagateDeploymentStatus
    TransformDeploymentStatus reads {Progressing, ReplicaSetReady}, never appsv1.DeploymentAvailable.
    So ResourcesAvailable=True at any replica count, and ContainerHealthy=True once ReadyReplicas > 0.

  • The only step that holds Ready back below initial-scale (PropagateAutoscalerStatus in reconcilePA,
    which pulls ResourcesAvailable back to Unknown while !IsScaleTargetInitialized()) is skipped
    when an earlier phase errors. The reconcile is an early-return phase loop
    (pkg/reconciler/revision/revision.go):

    for _, phase := range []func(context.Context, *v1.Revision) error{
        c.reconcileDeployment,   // sets ResourcesAvailable=True (Progressing) + ContainerHealthy=True
        c.reconcileImageCache,   // if this errors, the loop returns...
        c.reconcilePA,           // ...so reconcilePA (which would reset ResourcesAvailable) is SKIPPED
    } {
        if err := phase(ctx, rev); err != nil { return err }
    }
    
    A transient reconcileImageCache error (e.g. createImageCache AlreadyExists from image-lister lag,
    common under heavy concurrent-write churn) persists the below-initial-scale Ready=True. Then
    pkg/reconciler/configuration/configuration.go findAndSetLatestReadyRevision advances
    latestReadyRevisionName to it and never reverts (monotonic); the Route follows.
    (Note: a reconcileDeployment Update conflict does NOT trigger thisit returns before
    PropagateDeploymentStatus, so ResourcesAvailable is never set True that cycle.)

Steps to Reproduce the Problem

In production this fires stochastically under churn. To make it deterministic, the steps below
(a) wedge a new Revision below initial-scale and (b) force reconcileImageCache to error via a
ResourceQuota (a stand-in for the transient AlreadyExists). Requires image caching enabled
(caching.internal.knative.dev Image CRD + controller).

  1. Label 3 schedulable nodes and create a namespace:
    kubectl label node repro=true
    kubectl create ns repro
  2. Deploy a healthy baseline Revision at scale 1, pinned to the labeled nodes; wait until Ready:
    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata: { name: rt, namespace: repro }
    spec:
    template:
    metadata:
    annotations:
    autoscaling.knative.dev/initial-scale: "1"
    autoscaling.knative.dev/min-scale: "1"
    autoscaling.knative.dev/max-scale: "1"
    spec:
    nodeSelector: { repro: "true" }
    containers:
    - image: ghcr.io/knative/helloworld-go:latest
    env: [{ name: TARGET, value: "v1" }]
  3. Cap the namespace's Image count at its current value so the next Revision's image cache can't be created:
    kubectl -n repro create quota capimg
    --hard=count/images.caching.internal.knative.dev=$(kubectl -n repro get images.caching.internal.knative.dev --no-headers | wc -l)
  4. Roll a new Revision that wants initial-scale=6 but can only schedule 2 pods (one per labeled node;
    one node is held by the baseline) — wedging it at 2/6:

re-apply the same Service with:

spec:
template:
metadata:
annotations:
autoscaling.knative.dev/initial-scale: "6"
autoscaling.knative.dev/min-scale: "6"
autoscaling.knative.dev/max-scale: "6"
spec:
nodeSelector: { repro: "true" }
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels: { serving.knative.dev/service: rt }
containers:
- image: ghcr.io/knative/helloworld-go:latest
env: [{ name: TARGET, value: "v2" }] # forces a new Revision
5. Observe the new Revision become Ready=True and capture the route while below initial-scale:
kubectl -n repro get revision # new rev READY=True
kubectl -n repro get pods # 2 Running, 4 Pending (wedged at 2/6)
kubectl -n repro get images.caching.internal.knative.dev # only the baseline's image exists
kubectl -n repro get configuration rt -o jsonpath='{.status.latestReadyRevisionName}' # -> new rev
kubectl -n repro get route rt -o jsonpath='{.status.traffic}' # -> 100% new rev

  1. The new Revision is Ready=True at 2/6 replicas with no image cache; latestReadyRevisionName and the
    Route both point at it, and never revert to the fully-scaled baseline.

Suggested fix

Make readiness reflect replica sufficiency: include the Deployment's Available condition in
TransformDeploymentStatus, or gate Revision Ready / findAndSetLatestReadyRevision advancement on the
PodAutoscaler's ScaleTargetInitialized (enforcing the documented initial-scale semantics). Either
prevents latestReadyRevisionName from advancing to an under-scaled Revision.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions