Public CRON job paths swallow pg_cron registration errors and skip transactional atomicity

## Summary

The public CRON-job paths in `crates/api/src/handlers/jobs.rs` write to the `jobs` table and then call `register_pg_cron` / `unregister_pg_cron` against a fresh pool connection — *after* the row write has committed and with the pg_cron error logged-and-swallowed. The API returns `2xx` even when the pg_cron operation failed, so the persisted `jobs` row can end up out of sync with what pg_cron is actually scheduled to run.

Surfaced during review of #24 ([thread](https://github.com/juspay/kronos/pull/24#discussion_r3333525142)). The reaper-provisioning path was fixed there; the public paths are deferred because the right fix is not just a tx scope change — it's also a behavior change around how the API reports pg_cron failures, which deserves an explicit decision.

## The three affected sites

### 1. `POST /workspaces/{w}/jobs` with `trigger: CRON` (create)

`crates/api/src/handlers/jobs.rs:187-215`

```rust
let mut conn = scoped_connection(...).await?;          // auto-commit, NOT a tx
let job = create_cron(&mut *conn, ...).await?;          // row committed immediately

if let Err(e) = register_pg_cron(&state.pool, ..., job.job_id, cron_expr).await {
    tracing::error!(job_id = %job.job_id, "Failed to register pg_cron job: {}", e);
    // swallowed — handler still returns 201 Created below
}
```

Failure mode: row in `jobs` with `status = 'ACTIVE'`, no pg_cron entry → job never fires. Client got `201 Created` and a `job_id`.

### 2. `PATCH /workspaces/{w}/jobs/{j}` (update — retire-and-replace)

`crates/api/src/handlers/jobs.rs:350-371`

```rust
let mut tx = scoped_transaction(...).await?;
let created = retire_and_replace(&mut *tx, ...).await?;
tx.commit().await?;                                     // old job RETIRED, new job ACTIVE

if let Err(e) = unregister_pg_cron(...).await { tracing::error!(...); }  // swallowed
if let Err(e) = register_pg_cron(..., created.job_id, ...).await { tracing::error!(...); }  // swallowed
```

Failure modes:
- `unregister` fails: pg_cron still scheduled for the old job_id. The `cron_command` guards on `status = 'ACTIVE'`, so it materializes no executions — but stays in `cron.job` as dead weight.
- `register` fails: new version row has `status = 'ACTIVE'` but no schedule → never fires. Client got `200 OK`.

### 3. `DELETE /workspaces/{w}/jobs/{j}` (cancel)

`crates/api/src/handlers/jobs.rs:416-421`

```rust
let cancelled = cancel(&mut *conn, ...).await?;         // status flipped to RETIRED via auto-commit

if job.trigger_type == "CRON" {
    if let Err(e) = unregister_pg_cron(...).await { tracing::error!(...); }  // swallowed
}
```

Failure mode: row `RETIRED`, pg_cron still scheduled. `cron_command`'s `status = 'ACTIVE'` guard prevents bogus execution rows from being materialized, but the dead pg_cron entry stays put forever — current reaper sweep only catches jobs with `cron_ends_at` in the past, not cancelled ones.

## Why we deferred this from #24

The shape of the fix is two-part, not one:

1. **Tx scope:** swap `register_pg_cron(pool, ...)` for `register_pg_cron_conn(conn, ...)` (already added in #24 for the reaper path) and the equivalent for `unregister`. Run them on the surrounding tx. The CRON-create path additionally needs to switch from `scoped_connection` (auto-commit) to `scoped_transaction`.

2. **Error propagation:** stop swallowing the pg_cron error. Surface it as an `AppError` and let the surrounding tx roll back. This is an **API contract change**: today `POST /jobs (CRON)` succeeds whenever the row inserts; afterwards it can fail with a 5xx if pg_cron is unhealthy. Same for `PATCH` and `DELETE` on CRON jobs.

Wrapping in a tx without fixing the swallowing is cosmetic — the error path's `if let Err(e) = ...` would still let the tx commit. So the two changes have to ship together.

## Proposed fix

- Add `unregister_pg_cron_conn(&mut PgConnection, ...)` mirroring `register_pg_cron_conn` (added in #24) and `unschedule_pg_cron_conn` (already present).
- In `jobs.rs`:
  - **create:** switch the CRON branch from `scoped_connection` to `scoped_transaction`; replace `register_pg_cron(pool, ...)` with `register_pg_cron_conn(&mut *tx, ...)`; replace `if let Err(e) = ... { tracing::error!(...); }` with `?` so the error bubbles up and rolls back the tx. The handler responds with whatever `AppError` mapping is appropriate (likely 5xx on pg_cron unhealthy, since the client can retry).
  - **update:** move `unregister + register` inside the tx that wraps `retire_and_replace`. Same `?` substitution. Both succeed or both roll back together with the version flip.
  - **cancel:** wrap the cancel + unregister in a `scoped_transaction`. Same `?` substitution.
- (Optional, follow-up): consider whether the reaper sweep should also catch *cancelled* CRON jobs (`status = 'RETIRED'` without `cron_ends_at`) and unschedule their pg_cron entries — a defense-in-depth net for any path that left a row behind before this fix lands.

## Out of scope

- The reaper's own provisioning path (handled in #24).
- The scheduler's CRON materialization path, if any (this issue is scoped to the API handlers).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Public CRON job paths swallow pg_cron registration errors and skip transactional atomicity #32

Summary

The three affected sites

1. `POST /workspaces/{w}/jobs` with `trigger: CRON` (create)

2. `PATCH /workspaces/{w}/jobs/{j}` (update — retire-and-replace)

3. `DELETE /workspaces/{w}/jobs/{j}` (cancel)

Why we deferred this from #24

Proposed fix

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Public CRON job paths swallow pg_cron registration errors and skip transactional atomicity #32

Description

Summary

The three affected sites

1. POST /workspaces/{w}/jobs with trigger: CRON (create)

2. PATCH /workspaces/{w}/jobs/{j} (update — retire-and-replace)

3. DELETE /workspaces/{w}/jobs/{j} (cancel)

Why we deferred this from #24

Proposed fix

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `POST /workspaces/{w}/jobs` with `trigger: CRON` (create)

2. `PATCH /workspaces/{w}/jobs/{j}` (update — retire-and-replace)

3. `DELETE /workspaces/{w}/jobs/{j}` (cancel)