You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(webapp,run-engine): mollifier drainer replay + stale sweep + cancelled-run engine API (#3754)
## Summary
The replay side of the mollifier:
- `DrainerHandler`: reads buffered snapshots and replays them through
`engine.trigger` to materialise PG rows.
- `RunEngine.createCancelledRun`: new public method the handler uses to
write CANCELED rows directly from snapshots (bypass queue + waitpoint,
emit `runCancelled`). Tolerates the cjson empty-table tags edge case
found during validation.
- Drainer fairness: org → env rotation so a heavy env doesn't starve
light ones in the same org.
- Stale-entry sweep + telemetry + alertable gauge so a stuck/offline
drainer surfaces in alerts.
Both the drainer and sweep default-off; nothing fires unless flagged on
(`TRIGGER_MOLLIFIER_DRAINER_ENABLED`,
`TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED`).
Stacked on the trigger-time decisions PR.
## Test plan
- [x] \`pnpm run typecheck --filter webapp\` passes
- [x] \`pnpm run test --filter webapp
test/mollifierDrainerHandler.test.ts\` passes
- [x] \`pnpm run test --filter webapp test/mollifierStaleSweep.test.ts\`
passes
- [x] \`pnpm run test --filter @internal/run-engine
src/engine/tests/createCancelledRun.test.ts\` passes
- [x] \`pnpm run test --filter @trigger.dev/redis-worker
packages/redis-worker/src/mollifier/drainer.test.ts\` passes
---
## Ship-gate follow-up fix
**Drainer writes SYSTEM_FAILURE on max-attempts exhaustion.** Adds an
`onTerminalFailure` callback on `MollifierDrainerOptions` so the
customer's run lands a SYSTEM_FAILURE PG row even when the drainer
exhausts `MAX_ATTEMPTS` on a retryable PG error (previously
`buffer.fail()` was called with no row written → silent data loss). The
callback runs before `buffer.fail()` on every terminal path
(non-retryable AND max-attempts-exhausted), and re-throwing a retryable
error from the callback causes the drainer to requeue rather than fail.
Bumps `@trigger.dev/redis-worker` to a **minor** changeset (additive
option + new exported types). Includes 5 unit tests covering both
terminal causes plus the requeue-on-retryable-callback-failure path and
no-callback back-compat.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `onTerminalFailure` callback to `MollifierDrainerOptions` so the customer's run lands a SYSTEM_FAILURE PG row even when the drainer exhausts `maxAttempts` on a retryable PG error. Previously, retryable-error exhaustion called `buffer.fail()` directly, which atomically marks FAILED + DELs the entry hash with no PG write — silent data loss when PG was unreachable across the full retry budget. The callback fires before `buffer.fail()` on any terminal path (`cause: "non-retryable"` or `"max-attempts-exhausted"`); throwing a retryable error from the callback causes the drainer to requeue rather than fail.
Mollifier drainer replay: replay buffered entries into `engine.trigger`, stale-entry sweep, a drainer-health gauge, and run-engine cancelled/failed run APIs. Known limitation: stale-sweep runs per-webapp instance, so stale-entry counter metrics multiply by N webapps in HA until a distributed lease lands as follow-up.
0 commit comments