What happened?
Chart 0.56.0 (commit 9cc943b1, #3149 β "KEDA 2.20.1+ unsupported job scale strategy 'default'") changed the default autoscaling.scaledJobOptions.scalingStrategy.strategy from default to accurate.
accurate has a long-known over-provisioning / "node pods stay without active session" problem for scalingType: job (#2133, #2068, #1904, kedacore/keda#4833). The historical guidance in those issues was to set strategy: default. That escape hatch is now gone: KEDA 2.20 removed default from the ScaledJob CRD scalingStrategy.strategy enum (it is now only custom | accurate | eager), so the previous fix can no longer be applied on KEDA β₯ 2.20.
Net effect: upgrading to chart 0.56.0 (which also moves to the KEDA 2.20.1-based image) silently switches every node ScaledJob to accurate and reintroduces the runaway, with no default fallback.
Observed in production (KEDA 2.20.1, chart 0.56.0, scalingType: job): chrome node jobs run away to maxReplicaCount (we hit our cap of 1000 pods) and never scale back to the idle baseline. During multi-hour windows with 0 active grid sessions, ~254 node pods stayed Running. Before the 0.56.0 rollout β same config, strategy: default β node counts reliably returned to a small idle baseline after each burst.
Root cause (KEDA v2.20.1 pkg/scaling/executor/scale_jobs.go)
default: effectiveMaxScale = maxScale β runningJobCount
accurate: effectiveMaxScale = maxScale β pendingJobCount (ignores running jobs)
pendingJobCount only counts jobs whose pod has not yet reached Running. A Selenium node pod reaches Running well before it registers with the Distributor and claims its queued session (node registration + browser startup). During that window:
- the session is still in the New Session Queue β
maxScale still counts it, but
- the job is no longer
pending (pod is Running) β pendingJobCount does not count it.
So accurate computes maxScale β ~0 and, at a fast pollingInterval, KEDA keeps creating a fresh duplicate job for the same still-queued session on every poll β pile-up to maxReplicaCount. default never did this because it subtracts runningJobCount, which includes those Running-but-not-yet-registered pods.
This is aggravated by large node pods on autoprovisioned nodes (long Pending β Running β registered path) and a small pollingInterval, but the underlying mismatch is that the Selenium scaler's queue metric is not compatible with accurate's "subtract only pending jobs" assumption β which is exactly the "calculation problem" the old # Change this to "accurate" when the calculation problem is fixed values comment referred to.
Suggested fix
Because default is no longer a valid enum value on KEDA β₯ 2.20, default the chart to custom reproducing default's formula:
scaledJobOptions:
scalingStrategy:
strategy: custom
customScalingQueueLengthDeduction: 0
customScalingRunningJobPercentage: "1" # maxScale - 0 - runningJobCount*1.0 == maxScale - runningJobCount
custom is accepted by every supported KEDA version and, with these parameters, is byte-for-byte equivalent to the old default behavior. At minimum, the README / values comment should document this as the replacement for the previous strategy: default guidance, since existing users upgrading past KEDA 2.20 will otherwise silently regress.
Relevant log output
KEDA scaleexecutor repeatedly creating jobs while sessions are already being served; node pods remain Running with 0 active sessions.
Environment
- Chart:
selenium-grid 0.56.0 (image 4.45.0-20260606)
- KEDA: 2.20.1
scalingType: job, SE_NODE_MAX_SESSIONS=1, SE_DRAIN_AFTER_SESSION_COUNT=1
- Kubernetes: GKE 1.3x
Related
What happened?
Chart
0.56.0(commit9cc943b1, #3149 β "KEDA 2.20.1+ unsupported job scale strategy 'default'") changed the defaultautoscaling.scaledJobOptions.scalingStrategy.strategyfromdefaulttoaccurate.accuratehas a long-known over-provisioning / "node pods stay without active session" problem forscalingType: job(#2133, #2068, #1904, kedacore/keda#4833). The historical guidance in those issues was to setstrategy: default. That escape hatch is now gone: KEDA 2.20 removeddefaultfrom the ScaledJob CRDscalingStrategy.strategyenum (it is now onlycustom | accurate | eager), so the previous fix can no longer be applied on KEDA β₯ 2.20.Net effect: upgrading to chart 0.56.0 (which also moves to the KEDA 2.20.1-based image) silently switches every node ScaledJob to
accurateand reintroduces the runaway, with nodefaultfallback.Observed in production (KEDA 2.20.1, chart 0.56.0,
scalingType: job): chrome node jobs run away tomaxReplicaCount(we hit our cap of 1000 pods) and never scale back to the idle baseline. During multi-hour windows with 0 active grid sessions, ~254 node pods stayedRunning. Before the 0.56.0 rollout β same config,strategy: defaultβ node counts reliably returned to a small idle baseline after each burst.Root cause (KEDA v2.20.1
pkg/scaling/executor/scale_jobs.go)default:effectiveMaxScale = maxScale β runningJobCountaccurate:effectiveMaxScale = maxScale β pendingJobCount(ignores running jobs)pendingJobCountonly counts jobs whose pod has not yet reachedRunning. A Selenium node pod reachesRunningwell before it registers with the Distributor and claims its queued session (node registration + browser startup). During that window:maxScalestill counts it, butpending(pod isRunning) βpendingJobCountdoes not count it.So
accuratecomputesmaxScale β ~0and, at a fastpollingInterval, KEDA keeps creating a fresh duplicate job for the same still-queued session on every poll β pile-up tomaxReplicaCount.defaultnever did this because it subtractsrunningJobCount, which includes thoseRunning-but-not-yet-registered pods.This is aggravated by large node pods on autoprovisioned nodes (long
Pending β Running β registeredpath) and a smallpollingInterval, but the underlying mismatch is that the Selenium scaler's queue metric is not compatible withaccurate's "subtract only pending jobs" assumption β which is exactly the "calculation problem" the old# Change this to "accurate" when the calculation problem is fixedvalues comment referred to.Suggested fix
Because
defaultis no longer a valid enum value on KEDA β₯ 2.20, default the chart tocustomreproducingdefault's formula:customis accepted by every supported KEDA version and, with these parameters, is byte-for-byte equivalent to the olddefaultbehavior. At minimum, the README / values comment should document this as the replacement for the previousstrategy: defaultguidance, since existing users upgrading past KEDA 2.20 will otherwise silently regress.Relevant log output
KEDA scaleexecutor repeatedly creating jobs while sessions are already being served; node pods remain Running with 0 active sessions.
Environment
selenium-grid0.56.0 (image4.45.0-20260606)scalingType: job,SE_NODE_MAX_SESSIONS=1,SE_DRAIN_AFTER_SESSION_COUNT=1Related
accurate