-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
When a spot/preemptible GPU node is preempted and GKE creates a replacement, all Knative services with existing revisions attempt to schedule pods — including services that were scaled to zero before the preemption event.
This is a continuation of #12538 (closed as stale, never fixed).
Environment
- GKE regional cluster (us-central1) with spot GPU node pool (max 1 node)
- Knative Serving (latest)
- Multiple Knative services sharing a single GPU node via custom extended resource (
yubot.io/gpu-mem) - Services configured with
min-scale: "0",initial-scale: "0"
Steps to Reproduce
- Deploy multiple Knative services on a spot GPU node, each requesting a portion of GPU memory
- Some services are actively serving (pods running), others are scaled to zero (no pods)
- Wait for GKE to preempt the spot node
- GKE provisions a replacement spot node
- All services — including those that were at zero replicas — attempt to schedule pods simultaneously
Expected Behavior
Services that were scaled to zero (no pods) before preemption should remain at zero after the replacement node comes up. Only services that had active pods should be restored.
Actual Behavior
All services with existing revisions attempt to create pods, regardless of their pre-preemption replica count. This causes resource contention — on a shared GPU node, low-priority services (e.g., trainers) can grab GPU memory slots before high-priority services (e.g., inference), leaving critical services stuck in Pending.
Evidence
After node recreation, all pods started at the exact same timestamp:
image-trainer-00004 2026-03-11T04:51:27Z (was scaled to 0 before preemption)
llm-trainer-00004 2026-03-11T04:51:27Z (was scaled to 0 before preemption)
stt-00005 2026-03-11T04:51:27Z (was running before preemption)
tts-00006 2026-03-11T04:51:27Z (was running before preemption)
llm-00008 Pending (was running, now blocked by trainers)
All trainer revisions were created with initial-scale: "0" and had no active pods before the preemption. They should not have restarted.
Workarounds
- PriorityClass: Assign higher priority to critical services so K8s preempts low-priority pods. This mitigates but doesn't prevent the unnecessary scheduling.
- Delete ksvc when not in use: Services with no ksvc object can't be resurrected. But this defeats the purpose of scale-to-zero.
Root Cause Hypothesis
Knative's reconciler does not persist or respect the "scaled to zero" state across node disruptions. When pods are evicted, the revision controller sees "desired replicas > 0" (from the last reconciliation before scale-down) and recreates pods, rather than checking the autoscaler's current decision.
/kind bug