Skip to content

feat: persist RecurringTaskRun before run and reconcile abandoned rows#219

Merged
gagantrivedi merged 3 commits into
mainfrom
refactor/persist-recurring-task-run-early
Apr 30, 2026
Merged

feat: persist RecurringTaskRun before run and reconcile abandoned rows#219
gagantrivedi merged 3 commits into
mainfrom
refactor/persist-recurring-task-run-early

Conversation

@gagantrivedi

Copy link
Copy Markdown
Member

Stacked on #218.

Changes

Make worker crashes (SIGKILL, OOM) and DB connection failures during recurring tasks observable, and feed them into the existing backoff. run_recurring_task now persists the RecurringTaskRun row before dispatching to _run_task, so any failure that prevents the post-execution save (process killed, host evicted, DB connection dropped during the UPDATE) leaves a recoverable orphan with result IS NULL. On the next pickup, RecurringTask.reconcile_abandoned_run marks that orphan as FAILURE with a distinguishing error_details, which should_execute's failure-count branch then counts toward backoff.

How did you test this code?

  • make typecheck clean.
  • New unit tests: test_recurring_task_reconcile_abandoned_run__no_abandoned_run__noop, test_recurring_task_reconcile_abandoned_run__finished_run_present__only_abandoned_touched.
  • New behavioural test: test_run_recurring_task__abandoned_run__reconciled_as_failure.
  • Pre-existing recurring-task suite still passes on default.

@codecov-commenter

codecov-commenter commented Apr 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.32%. Comparing base (fdfd22b) to head (fa0c6e0).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #219      +/-   ##
==========================================
+ Coverage   97.29%   97.32%   +0.03%     
==========================================
  Files         104      104              
  Lines        4466     4525      +59     
==========================================
+ Hits         4345     4404      +59     
  Misses        121      121              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gagantrivedi gagantrivedi force-pushed the refactor/persist-recurring-task-run-early branch 2 times, most recently from 2d9e097 to cd3b627 Compare April 29, 2026 10:21
@gagantrivedi gagantrivedi marked this pull request as ready for review April 29, 2026 10:23
@gagantrivedi gagantrivedi requested a review from a team as a code owner April 29, 2026 10:23
@gagantrivedi gagantrivedi requested review from Zaimwa9 and emyller and removed request for a team April 29, 2026 10:23
emyller
emyller previously approved these changes Apr 29, 2026

@emyller emyller left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Conflicts with main though. Seems an easy fix, and I might try if I find the time today.

Save the RecurringTaskRun row before dispatching to `_run_task` instead
of after. The pre-saved row leaves a recoverable artefact if the worker
process is killed mid-task — a row with `started_at` set, `finished_at`
and `result` null — which a later run can reconcile when the SQL reaper
unlocks the abandoned task.

`_run_task` gains an optional `task_run` parameter; if supplied, it
mutates that row instead of creating a new in-memory one. The post-run
save in `run_recurring_tasks` becomes an UPDATE on the persisted row
(`update_fields=["finished_at", "result", "error_details"]`).
When a worker dies mid-task (SIGKILL, OOM, host crash), the
RecurringTaskRun row persisted before `_run_task` is left with
`result IS NULL`. The SQL reaper eventually unlocks the task because
its `locked_at` exceeded the timeout grace window. On the next pickup,
mark every such abandoned row as `FAILURE` with a distinguishing
`error_details` message.

The reconciliation lives on `RecurringTask.reconcile_abandoned_run`
alongside `should_execute`/`unlock`, called once from
`run_recurring_task` immediately after the task is fetched. There is
at most one orphan row per task — `FOR UPDATE SKIP LOCKED` and the
`is_locked` gate in `get_recurringtasks_to_process` ensure two workers
can't both leave orphans for the same task — so the method uses
`.first()` and falls through cheaply when no orphan exists. Marking
abandoned runs as `FAILURE` lets `should_execute`'s existing
failure-count branch count them toward backoff without any additional
logic there.
@gagantrivedi gagantrivedi force-pushed the refactor/persist-recurring-task-run-early branch from 92eeb30 to 07dff4c Compare April 30, 2026 08:52
@gagantrivedi gagantrivedi changed the base branch from refactor/simplify-run-recurring-tasks to main April 30, 2026 08:53
@gagantrivedi gagantrivedi dismissed emyller’s stale review April 30, 2026 08:53

The base branch was changed.

@gagantrivedi gagantrivedi merged commit 7746773 into main Apr 30, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants