Skip to content

Commit 35c56f1

Browse files
authored
feat(supervisor): add opt-in dequeue backpressure (#3836)
The supervisor can now pause dequeuing - and freeze consumer-pool scale-up - when a backpressure signal says the cluster can't place more work, then ramp dequeuing back up gradually once it clears. The signal is a verdict published to a Redis key by a cluster-side component; the supervisor reads it on a short refresh and gates `preDequeue` on it. Off by default (`TRIGGER_DEQUEUE_BACKPRESSURE_ENABLED`). Everything fails open: a missing, stale, or unreadable verdict never pins the brake, and the hot-path read is a synchronous cached lookup with no I/O. The scale-up freeze leaves scale-down untouched, and on release the resume is ramped so a deep queue isn't hammered all at once. Dry-run is on by default (`TRIGGER_DEQUEUE_BACKPRESSURE_DRY_RUN`): even once enabled it only logs what it would have done, and surfaces the computed state through metrics, until explicitly set to act. Prometheus: `supervisor_backpressure_engaged`, `_dry_run`, `_skipped_dequeues_total`. Refs TRI-5354
1 parent 85886b9 commit 35c56f1

13 files changed

Lines changed: 853 additions & 7 deletions
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"@trigger.dev/core": patch
3+
---
4+
5+
Add optional `shouldPauseScaling` to the supervisor consumer pool scaling options to freeze scale-up while it returns true (scale-down stays allowed).
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
area: supervisor
3+
type: feature
4+
---
5+
6+
Add opt-in dequeue backpressure to the supervisor. When enabled, the supervisor reads a verdict from Redis and pauses dequeuing while the worker cluster is saturated, then resumes once capacity is available. Disabled by default - no behavior change for existing deployments.

apps/supervisor/package.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,15 @@
1818
"@kubernetes/client-node": "^1.0.0",
1919
"@trigger.dev/core": "workspace:*",
2020
"dockerode": "^4.0.6",
21+
"ioredis": "^5.3.2",
2122
"p-limit": "^6.2.0",
2223
"prom-client": "^15.1.0",
2324
"socket.io": "4.7.4",
2425
"std-env": "^3.8.0",
2526
"zod": "3.25.76"
2627
},
2728
"devDependencies": {
29+
"@internal/testcontainers": "workspace:*",
2830
"@types/dockerode": "^3.3.33"
2931
}
3032
}
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
import { Counter, Gauge, type Registry } from "prom-client";
2+
3+
/** Prometheus metrics for dequeue backpressure. */
4+
export class BackpressureMetrics {
5+
/** 1 while backpressure is engaged (computed signal, set even in dry-run). */
6+
readonly engaged: Gauge<string>;
7+
/** 1 when running in dry-run (gates inert). */
8+
readonly dryRun: Gauge<string>;
9+
/** Dequeue attempts the gate skipped - or would have, in dry-run (labelled). */
10+
readonly skipsTotal: Counter<string>;
11+
12+
constructor(opts: { register: Registry; prefix?: string }) {
13+
const prefix = opts.prefix ?? "supervisor_backpressure";
14+
15+
this.engaged = new Gauge({
16+
name: `${prefix}_engaged`,
17+
help: "1 while dequeue backpressure is engaged (computed signal, regardless of dry-run)",
18+
registers: [opts.register],
19+
});
20+
21+
this.dryRun = new Gauge({
22+
name: `${prefix}_dry_run`,
23+
help: "1 when dequeue backpressure is in dry-run mode (gates inert)",
24+
registers: [opts.register],
25+
});
26+
27+
this.skipsTotal = new Counter({
28+
name: `${prefix}_skipped_dequeues_total`,
29+
help: "Dequeue attempts skipped by backpressure (or would be, in dry-run)",
30+
labelNames: ["dry_run"],
31+
registers: [opts.register],
32+
});
33+
}
34+
}

0 commit comments

Comments
 (0)