You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The public CRON-job paths in crates/api/src/handlers/jobs.rs write to the jobs table and then call register_pg_cron / unregister_pg_cron against a fresh pool connection — after the row write has committed and with the pg_cron error logged-and-swallowed. The API returns 2xx even when the pg_cron operation failed, so the persisted jobs row can end up out of sync with what pg_cron is actually scheduled to run.
Surfaced during review of #24 (thread). The reaper-provisioning path was fixed there; the public paths are deferred because the right fix is not just a tx scope change — it's also a behavior change around how the API reports pg_cron failures, which deserves an explicit decision.
The three affected sites
1. POST /workspaces/{w}/jobs with trigger: CRON (create)
crates/api/src/handlers/jobs.rs:187-215
letmut conn = scoped_connection(...).await?;// auto-commit, NOT a txlet job = create_cron(&mut*conn, ...).await?;// row committed immediatelyifletErr(e) = register_pg_cron(&state.pool, ...,job.job_id, cron_expr).await{
tracing::error!(job_id = %job.job_id,"Failed to register pg_cron job: {}", e);// swallowed — handler still returns 201 Created below}
Failure mode: row in jobs with status = 'ACTIVE', no pg_cron entry → job never fires. Client got 201 Created and a job_id.
letmut tx = scoped_transaction(...).await?;let created = retire_and_replace(&mut*tx, ...).await?;
tx.commit().await?;// old job RETIRED, new job ACTIVEifletErr(e) = unregister_pg_cron(...).await{ tracing::error!(...);}// swallowedifletErr(e) = register_pg_cron(...,created.job_id, ...).await{ tracing::error!(...);}// swallowed
Failure modes:
unregister fails: pg_cron still scheduled for the old job_id. The cron_command guards on status = 'ACTIVE', so it materializes no executions — but stays in cron.job as dead weight.
register fails: new version row has status = 'ACTIVE' but no schedule → never fires. Client got 200 OK.
3. DELETE /workspaces/{w}/jobs/{j} (cancel)
crates/api/src/handlers/jobs.rs:416-421
let cancelled = cancel(&mut*conn, ...).await?;// status flipped to RETIRED via auto-commitif job.trigger_type == "CRON"{ifletErr(e) = unregister_pg_cron(...).await{ tracing::error!(...);}// swallowed}
Failure mode: row RETIRED, pg_cron still scheduled. cron_command's status = 'ACTIVE' guard prevents bogus execution rows from being materialized, but the dead pg_cron entry stays put forever — current reaper sweep only catches jobs with cron_ends_at in the past, not cancelled ones.
Tx scope: swap register_pg_cron(pool, ...) for register_pg_cron_conn(conn, ...) (already added in Dogfood CRON reaper as internal kronos job #24 for the reaper path) and the equivalent for unregister. Run them on the surrounding tx. The CRON-create path additionally needs to switch from scoped_connection (auto-commit) to scoped_transaction.
Error propagation: stop swallowing the pg_cron error. Surface it as an AppError and let the surrounding tx roll back. This is an API contract change: today POST /jobs (CRON) succeeds whenever the row inserts; afterwards it can fail with a 5xx if pg_cron is unhealthy. Same for PATCH and DELETE on CRON jobs.
Wrapping in a tx without fixing the swallowing is cosmetic — the error path's if let Err(e) = ... would still let the tx commit. So the two changes have to ship together.
Proposed fix
Add unregister_pg_cron_conn(&mut PgConnection, ...) mirroring register_pg_cron_conn (added in Dogfood CRON reaper as internal kronos job #24) and unschedule_pg_cron_conn (already present).
In jobs.rs:
create: switch the CRON branch from scoped_connection to scoped_transaction; replace register_pg_cron(pool, ...) with register_pg_cron_conn(&mut *tx, ...); replace if let Err(e) = ... { tracing::error!(...); } with ? so the error bubbles up and rolls back the tx. The handler responds with whatever AppError mapping is appropriate (likely 5xx on pg_cron unhealthy, since the client can retry).
update: move unregister + register inside the tx that wraps retire_and_replace. Same ? substitution. Both succeed or both roll back together with the version flip.
cancel: wrap the cancel + unregister in a scoped_transaction. Same ? substitution.
(Optional, follow-up): consider whether the reaper sweep should also catch cancelled CRON jobs (status = 'RETIRED' without cron_ends_at) and unschedule their pg_cron entries — a defense-in-depth net for any path that left a row behind before this fix lands.
Summary
The public CRON-job paths in
crates/api/src/handlers/jobs.rswrite to thejobstable and then callregister_pg_cron/unregister_pg_cronagainst a fresh pool connection — after the row write has committed and with the pg_cron error logged-and-swallowed. The API returns2xxeven when the pg_cron operation failed, so the persistedjobsrow can end up out of sync with what pg_cron is actually scheduled to run.Surfaced during review of #24 (thread). The reaper-provisioning path was fixed there; the public paths are deferred because the right fix is not just a tx scope change — it's also a behavior change around how the API reports pg_cron failures, which deserves an explicit decision.
The three affected sites
1.
POST /workspaces/{w}/jobswithtrigger: CRON(create)crates/api/src/handlers/jobs.rs:187-215Failure mode: row in
jobswithstatus = 'ACTIVE', no pg_cron entry → job never fires. Client got201 Createdand ajob_id.2.
PATCH /workspaces/{w}/jobs/{j}(update — retire-and-replace)crates/api/src/handlers/jobs.rs:350-371Failure modes:
unregisterfails: pg_cron still scheduled for the old job_id. Thecron_commandguards onstatus = 'ACTIVE', so it materializes no executions — but stays incron.jobas dead weight.registerfails: new version row hasstatus = 'ACTIVE'but no schedule → never fires. Client got200 OK.3.
DELETE /workspaces/{w}/jobs/{j}(cancel)crates/api/src/handlers/jobs.rs:416-421Failure mode: row
RETIRED, pg_cron still scheduled.cron_command'sstatus = 'ACTIVE'guard prevents bogus execution rows from being materialized, but the dead pg_cron entry stays put forever — current reaper sweep only catches jobs withcron_ends_atin the past, not cancelled ones.Why we deferred this from #24
The shape of the fix is two-part, not one:
Tx scope: swap
register_pg_cron(pool, ...)forregister_pg_cron_conn(conn, ...)(already added in Dogfood CRON reaper as internal kronos job #24 for the reaper path) and the equivalent forunregister. Run them on the surrounding tx. The CRON-create path additionally needs to switch fromscoped_connection(auto-commit) toscoped_transaction.Error propagation: stop swallowing the pg_cron error. Surface it as an
AppErrorand let the surrounding tx roll back. This is an API contract change: todayPOST /jobs (CRON)succeeds whenever the row inserts; afterwards it can fail with a 5xx if pg_cron is unhealthy. Same forPATCHandDELETEon CRON jobs.Wrapping in a tx without fixing the swallowing is cosmetic — the error path's
if let Err(e) = ...would still let the tx commit. So the two changes have to ship together.Proposed fix
unregister_pg_cron_conn(&mut PgConnection, ...)mirroringregister_pg_cron_conn(added in Dogfood CRON reaper as internal kronos job #24) andunschedule_pg_cron_conn(already present).jobs.rs:scoped_connectiontoscoped_transaction; replaceregister_pg_cron(pool, ...)withregister_pg_cron_conn(&mut *tx, ...); replaceif let Err(e) = ... { tracing::error!(...); }with?so the error bubbles up and rolls back the tx. The handler responds with whateverAppErrormapping is appropriate (likely 5xx on pg_cron unhealthy, since the client can retry).unregister + registerinside the tx that wrapsretire_and_replace. Same?substitution. Both succeed or both roll back together with the version flip.scoped_transaction. Same?substitution.status = 'RETIRED'withoutcron_ends_at) and unschedule their pg_cron entries — a defense-in-depth net for any path that left a row behind before this fix lands.Out of scope