feat: persist RecurringTaskRun before run and reconcile abandoned rows#219
Merged
Merged
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #219 +/- ##
==========================================
+ Coverage 97.29% 97.32% +0.03%
==========================================
Files 104 104
Lines 4466 4525 +59
==========================================
+ Hits 4345 4404 +59
Misses 121 121 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2d9e097 to
cd3b627
Compare
emyller
previously approved these changes
Apr 29, 2026
Save the RecurringTaskRun row before dispatching to `_run_task` instead of after. The pre-saved row leaves a recoverable artefact if the worker process is killed mid-task — a row with `started_at` set, `finished_at` and `result` null — which a later run can reconcile when the SQL reaper unlocks the abandoned task. `_run_task` gains an optional `task_run` parameter; if supplied, it mutates that row instead of creating a new in-memory one. The post-run save in `run_recurring_tasks` becomes an UPDATE on the persisted row (`update_fields=["finished_at", "result", "error_details"]`).
When a worker dies mid-task (SIGKILL, OOM, host crash), the RecurringTaskRun row persisted before `_run_task` is left with `result IS NULL`. The SQL reaper eventually unlocks the task because its `locked_at` exceeded the timeout grace window. On the next pickup, mark every such abandoned row as `FAILURE` with a distinguishing `error_details` message. The reconciliation lives on `RecurringTask.reconcile_abandoned_run` alongside `should_execute`/`unlock`, called once from `run_recurring_task` immediately after the task is fetched. There is at most one orphan row per task — `FOR UPDATE SKIP LOCKED` and the `is_locked` gate in `get_recurringtasks_to_process` ensure two workers can't both leave orphans for the same task — so the method uses `.first()` and falls through cheaply when no orphan exists. Marking abandoned runs as `FAILURE` lets `should_execute`'s existing failure-count branch count them toward backoff without any additional logic there.
92eeb30 to
07dff4c
Compare
for more information, see https://pre-commit.ci
Zaimwa9
approved these changes
Apr 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #218.
Changes
Make worker crashes (SIGKILL, OOM) and DB connection failures during recurring tasks observable, and feed them into the existing backoff.
run_recurring_tasknow persists theRecurringTaskRunrow before dispatching to_run_task, so any failure that prevents the post-execution save (process killed, host evicted, DB connection dropped during the UPDATE) leaves a recoverable orphan withresult IS NULL. On the next pickup,RecurringTask.reconcile_abandoned_runmarks that orphan asFAILUREwith a distinguishingerror_details, whichshould_execute's failure-count branch then counts toward backoff.How did you test this code?
make typecheckclean.test_recurring_task_reconcile_abandoned_run__no_abandoned_run__noop,test_recurring_task_reconcile_abandoned_run__finished_run_present__only_abandoned_touched.test_run_recurring_task__abandoned_run__reconciled_as_failure.default.