Skip to content

Public CRON job paths swallow pg_cron registration errors and skip transactional atomicity #32

@knutties

Description

@knutties

Summary

The public CRON-job paths in crates/api/src/handlers/jobs.rs write to the jobs table and then call register_pg_cron / unregister_pg_cron against a fresh pool connection — after the row write has committed and with the pg_cron error logged-and-swallowed. The API returns 2xx even when the pg_cron operation failed, so the persisted jobs row can end up out of sync with what pg_cron is actually scheduled to run.

Surfaced during review of #24 (thread). The reaper-provisioning path was fixed there; the public paths are deferred because the right fix is not just a tx scope change — it's also a behavior change around how the API reports pg_cron failures, which deserves an explicit decision.

The three affected sites

1. POST /workspaces/{w}/jobs with trigger: CRON (create)

crates/api/src/handlers/jobs.rs:187-215

let mut conn = scoped_connection(...).await?;          // auto-commit, NOT a tx
let job = create_cron(&mut *conn, ...).await?;          // row committed immediately

if let Err(e) = register_pg_cron(&state.pool, ..., job.job_id, cron_expr).await {
    tracing::error!(job_id = %job.job_id, "Failed to register pg_cron job: {}", e);
    // swallowed — handler still returns 201 Created below
}

Failure mode: row in jobs with status = 'ACTIVE', no pg_cron entry → job never fires. Client got 201 Created and a job_id.

2. PATCH /workspaces/{w}/jobs/{j} (update — retire-and-replace)

crates/api/src/handlers/jobs.rs:350-371

let mut tx = scoped_transaction(...).await?;
let created = retire_and_replace(&mut *tx, ...).await?;
tx.commit().await?;                                     // old job RETIRED, new job ACTIVE

if let Err(e) = unregister_pg_cron(...).await { tracing::error!(...); }  // swallowed
if let Err(e) = register_pg_cron(..., created.job_id, ...).await { tracing::error!(...); }  // swallowed

Failure modes:

  • unregister fails: pg_cron still scheduled for the old job_id. The cron_command guards on status = 'ACTIVE', so it materializes no executions — but stays in cron.job as dead weight.
  • register fails: new version row has status = 'ACTIVE' but no schedule → never fires. Client got 200 OK.

3. DELETE /workspaces/{w}/jobs/{j} (cancel)

crates/api/src/handlers/jobs.rs:416-421

let cancelled = cancel(&mut *conn, ...).await?;         // status flipped to RETIRED via auto-commit

if job.trigger_type == "CRON" {
    if let Err(e) = unregister_pg_cron(...).await { tracing::error!(...); }  // swallowed
}

Failure mode: row RETIRED, pg_cron still scheduled. cron_command's status = 'ACTIVE' guard prevents bogus execution rows from being materialized, but the dead pg_cron entry stays put forever — current reaper sweep only catches jobs with cron_ends_at in the past, not cancelled ones.

Why we deferred this from #24

The shape of the fix is two-part, not one:

  1. Tx scope: swap register_pg_cron(pool, ...) for register_pg_cron_conn(conn, ...) (already added in Dogfood CRON reaper as internal kronos job #24 for the reaper path) and the equivalent for unregister. Run them on the surrounding tx. The CRON-create path additionally needs to switch from scoped_connection (auto-commit) to scoped_transaction.

  2. Error propagation: stop swallowing the pg_cron error. Surface it as an AppError and let the surrounding tx roll back. This is an API contract change: today POST /jobs (CRON) succeeds whenever the row inserts; afterwards it can fail with a 5xx if pg_cron is unhealthy. Same for PATCH and DELETE on CRON jobs.

Wrapping in a tx without fixing the swallowing is cosmetic — the error path's if let Err(e) = ... would still let the tx commit. So the two changes have to ship together.

Proposed fix

  • Add unregister_pg_cron_conn(&mut PgConnection, ...) mirroring register_pg_cron_conn (added in Dogfood CRON reaper as internal kronos job #24) and unschedule_pg_cron_conn (already present).
  • In jobs.rs:
    • create: switch the CRON branch from scoped_connection to scoped_transaction; replace register_pg_cron(pool, ...) with register_pg_cron_conn(&mut *tx, ...); replace if let Err(e) = ... { tracing::error!(...); } with ? so the error bubbles up and rolls back the tx. The handler responds with whatever AppError mapping is appropriate (likely 5xx on pg_cron unhealthy, since the client can retry).
    • update: move unregister + register inside the tx that wraps retire_and_replace. Same ? substitution. Both succeed or both roll back together with the version flip.
    • cancel: wrap the cancel + unregister in a scoped_transaction. Same ? substitution.
  • (Optional, follow-up): consider whether the reaper sweep should also catch cancelled CRON jobs (status = 'RETIRED' without cron_ends_at) and unschedule their pg_cron entries — a defense-in-depth net for any path that left a row behind before this fix lands.

Out of scope

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions