Skip to content

Cancellable Query with AbortSignal option#3212

Open
boromisp wants to merge 4 commits into
brianc:masterfrom
boromisp:cancellation_experiment_3
Open

Cancellable Query with AbortSignal option#3212
boromisp wants to merge 4 commits into
brianc:masterfrom
boromisp:cancellation_experiment_3

Conversation

@boromisp

@boromisp boromisp commented May 12, 2024

Copy link
Copy Markdown
Contributor

A suggestion for a new cancellation API (based on the original request in #2774)

This query-specific cancellation API only makes sense because the library waits for each query to complete before it submits the next. With query pipelining it would be a lot messier - or simpler, because the API could offer less guarantees.

While the tests are turned off if AbortController is not available, the library code itself should still be compatible with older node versions.

Example

const r1 = await client.query('SELECT now()')

try {
  await client.query({
    text: 'SELECT pg_sleep(10)',
    signal: AbortSignal.timeout(1_000),
  })
} catch (err) {
  console.log(err.code, err.message) // 57014 canceling statement due to user request
}

const r2 = await client.query(
  'SELECT now() - $1 AS delay',
  [r1.rows[0].now],
)

console.log(r2.rows[0].delay) // PostgresInterval { seconds: 1, milliseconds: 9.592 }

First, in a separate commit (913967f) is a small change that moves the SSL initialization logic to the Connection.
This means the Connection no longer needs a separate an 'sslconnect' event.


Then, the Connection got a new con.cancelWithClone() method that tries to connect to the exact same server, and send a cancel request.

This required a bit of extra state in the Connection, to keep track of the host, port, config and the backendKeyData.

I'm not thrilled with duplicating state from the Client (or with the name of the method for that matter).

What I like about this approach is that all the logic around the cancellation can be implemented in the Submittable and doesn't depend on any special handling by the Client.


Finally the Query got a signal option that can be used to cancel it either before, or after it has been submitted to the server.

If the query has been canceled before submit, it will return the abort reason in submit.

If an actual cancel request has been sent, the query callback and the 'error' and 'end' events are delayed until the cancel request has finished.

This is done to avoid accidentally canceling a different query on the same connection.

If the cancel request isn't completed cleanly, the original connection is destroyed.

This might be triggered if you get an ECONNRESET after sending the cancel request? Honestly I'm not sure.
This should probably treat the errors after the cancel request has been sent differently from errors during connecting.

@charmander

Copy link
Copy Markdown
Collaborator

This query-specific cancellation API only makes sense because the library waits for each query to complete before it submits the next.

I’m pretty sure there are still unavoidable race conditions that make this not the right API.

@boromisp

Copy link
Copy Markdown
Contributor Author

I’m pretty sure there are still unavoidable race conditions that make this not the right API.

That was my original impression as well.
But I wanted to write up an implementation, to see the actual problems.

One possible issue I see with this is destroying the connection, if the cancellation connection closes with an error.

This is done, so we don't release the connection with a possible cancellation pending.

However, if the cancellation actually failed, we just left possibly long running query alive on the server. Explicitly sending an RST might be necessary in this case, assuming that is handled similarly as cancel requests by the server.

Even if all this works as I think it does, I'm not so sure if this API is a good idea.


The intent of the code:

  • If you cancel before submitting the query, you skip the query.
  • If you cancel after submitting the query, but before the query is processed, the query should be executed.
  • If you cancel while the query is being executed, the query should be canceled. I'm not so sure about the interaction with the extended query protocol.
  • If you cancel after the query has been executed, nothing should happen.
  • Additionally, if there is an error during cancellation, the connection is destroyed.

So depending on timing the query might be canceled, prevented from being executed or complete successfully.

But the connection should be left In a clean state, or be destroyed.

edobry added a commit to edobry/minsky that referenced this pull request Jun 3, 2026
…pit init later resolves

## Summary

mt#2244 added an init-timeout to `getSharedPersistenceService()`: a hung init is abandoned so the next caller retries. But the abandoned init keeps running, and if it **later resolves**, its provider connection pool leaks — nobody holds the orphaned `PersistenceService` to close it.

**The original spec planned to thread `AbortSignal` through the provider; research showed that's the wrong shape and infeasible for our driver.** porsager/`postgres` accepts **no** `AbortSignal`/`signal` option (its only cancel primitive is `.cancel()` on an executed query object — [README](https://github.com/porsager/postgres)); `AbortSignal` is a *node-postgres/pg* feature ([pg#3212](brianc/node-postgres#3212)). Also, `connect_timeout` already bounds the connection phase (TCP/protocol/auth) at 10s in our config, and the postgres provider's `initialize()` self-cleans its pool on any failure — so the real residual leak is narrow: a `SELECT 1`/migration hang that exceeds the deadline **and then succeeds**.

So the fix is **driver-agnostic and entirely in cockpit**: on timeout, attach a best-effort teardown to the orphaned init promise — if it resolves after the deadline, close it via the existing `PersistenceService.close()`.

## Key Changes

- `src/cockpit/shared-persistence.ts` — on the timeout branch, `void init.then(svc => svc?.close?.()).catch(() => {})`. Best-effort: a late rejection (provider already self-cleaned) and any `close()` error are swallowed and never mask the timeout rejection. Doc comment updated with the reframe + why `AbortSignal` was rejected.
- `src/cockpit/shared-persistence.test.ts` — +2 tests: late-**resolve** → `close()` called exactly once; late-**reject** → `close()` not called.

No `PersistenceService` / `PersistenceProvider` interface change.

## Execution evidence:

```
bun test src/cockpit/shared-persistence.test.ts
 9 pass, 0 fail
   (mt#2244: timeout+retry, cache-on-success, elapsed-ms, createService-hang,
    env-override ×3; mt#2248: late-resolve→close, late-reject→no-close)
```

`validate_typecheck`: pass. `validate_lint`: pass (repo-wide, 0/0).

## Task coordination

- Closes **mt#2248** (reframed 2026-06-03 from "AbortSignal through the domain" to the cockpit-local teardown after driver research; spec updated accordingly).
- Completes the cockpit-side wedge family: mt#2186 (investigation) → mt#2244 (init-timeout + reset) → mt#2245 (Octokit root-cause timeout) → **mt#2248 (orphan teardown)**.
- Out of scope / possible follow-up: a per-operation timeout on `SELECT 1` / `runMigrations` (porsager has no built-in per-query timeout) — only worth filing if the migration-hang class is observed in production.

Co-Authored-By: minsky-ai[bot] <minsky-ai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants