feat(scheduler): force-resume gate for failed jobs#138
Open
truffle-dev wants to merge 1 commit into
Open
Conversation
Rate-limit storms (and any sustained error) drive a scheduled job to
status=failed with next_run_at=NULL once MAX_CONSECUTIVE_ERRORS = 10
trips. resumeJob refused to touch the row, so recovery meant a direct
SQLite UPDATE on the live DB.
Add an opt-in revival path:
scheduler.resumeJob(id, { force: true })
POST /ui/api/scheduler/:id/resume { "force": true }
The HTTP path returns 409 with a force-prompt message when the caller
omits force on a failed job, so the circuit-breaker still defaults to
no-op. paused → active stays unchanged; completed → active stays
forbidden even with force (one-shots may have already self-deleted).
Closes ghostwright#128
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #128.
Rate-limit storms (and any sustained error) drive a scheduled job to
status=failedwithnext_run_at=NULLonceMAX_CONSECUTIVE_ERRORS = 10trips (src/scheduler/executor.ts:8,67-69). The currentresumeJob(src/scheduler/service.ts) refuses to touch the row, so recovery means a direct SQLite UPDATE on the live DB. It has happened twice now (2026-04-30, 2026-05-05).Change
Scheduler.resumeJob(id, opts?: { force?: boolean })now allows one extra transition:paused → active: always (unchanged).failed → active: requiresopts.force === true.completed → active: never, even with force. A one-shot may already have deleted itself inline (executor.tsdelete_after_runpath); re-activating is a sharp edge.The HTTP path mirrors the gate:
POST /ui/api/scheduler/:id/resumewith no body still revives apausedjob.failedjob without force, it returns 409 with a message that names the force opt.POST .../resume {"force": true}revives and audits the transition.Both paths recompute
next_run_atfrom the stored schedule and resetconsecutive_errorsto 0 so the revived job gets a clean retry budget.Why opt-in
Failures the executor marks terminal are usually transient (model-provider rate limits, a brief Slack outage), but the executor cannot tell transient from broken.
forceopts in the operator: they have judged the underlying cause cleared. Without force the path stays a no-op so an accidental resume call cannot bypass the circuit-breaker.Tests
src/scheduler/__tests__/service.test.tsresumeJob without force is a no-op on a non-paused job (active, failed, completed)(renamed)resumeJob({force:true}) revives a failed job and resets the error counterresumeJob({force:true}) still refuses to revive a completed jobresumeJob({force:true}) on a paused job behaves like the unforced pathsrc/ui/api/__tests__/scheduler.test.tsPOST /:id/resume on a failed job without force returns 409POST /:id/resume on a failed job with {force:true} revives it(also asserts the audit row)POST /:id/resume tolerates an empty body for the paused → active path68 / 68 in the two suites;
bun run typecheckclean;bun run lintclean.Scope
Service-layer method + UI HTTP handler + tests. The
phantom_scheduleMCP tool doesn't exposepause/resumeactions today, so I did not addresumethere — that's a separate concern and a different surface to design. The recovery playbook in agent-notes for direct-UPDATE remains valid; this just gives operators a non-destructive path.