Cancellable Query with AbortSignal option#3212
Conversation
…ted in the Connection class. Kept the client.ssl prop for now, as that is technically on the public API, but undocumented.
I’m pretty sure there are still unavoidable race conditions that make this not the right API. |
That was my original impression as well. One possible issue I see with this is destroying the connection, if the cancellation connection closes with an error. This is done, so we don't release the connection with a possible cancellation pending. However, if the cancellation actually failed, we just left possibly long running query alive on the server. Explicitly sending an RST might be necessary in this case, assuming that is handled similarly as cancel requests by the server. Even if all this works as I think it does, I'm not so sure if this API is a good idea. The intent of the code:
So depending on timing the query might be canceled, prevented from being executed or complete successfully. But the connection should be left In a clean state, or be destroyed. |
…pit init later resolves ## Summary mt#2244 added an init-timeout to `getSharedPersistenceService()`: a hung init is abandoned so the next caller retries. But the abandoned init keeps running, and if it **later resolves**, its provider connection pool leaks — nobody holds the orphaned `PersistenceService` to close it. **The original spec planned to thread `AbortSignal` through the provider; research showed that's the wrong shape and infeasible for our driver.** porsager/`postgres` accepts **no** `AbortSignal`/`signal` option (its only cancel primitive is `.cancel()` on an executed query object — [README](https://github.com/porsager/postgres)); `AbortSignal` is a *node-postgres/pg* feature ([pg#3212](brianc/node-postgres#3212)). Also, `connect_timeout` already bounds the connection phase (TCP/protocol/auth) at 10s in our config, and the postgres provider's `initialize()` self-cleans its pool on any failure — so the real residual leak is narrow: a `SELECT 1`/migration hang that exceeds the deadline **and then succeeds**. So the fix is **driver-agnostic and entirely in cockpit**: on timeout, attach a best-effort teardown to the orphaned init promise — if it resolves after the deadline, close it via the existing `PersistenceService.close()`. ## Key Changes - `src/cockpit/shared-persistence.ts` — on the timeout branch, `void init.then(svc => svc?.close?.()).catch(() => {})`. Best-effort: a late rejection (provider already self-cleaned) and any `close()` error are swallowed and never mask the timeout rejection. Doc comment updated with the reframe + why `AbortSignal` was rejected. - `src/cockpit/shared-persistence.test.ts` — +2 tests: late-**resolve** → `close()` called exactly once; late-**reject** → `close()` not called. No `PersistenceService` / `PersistenceProvider` interface change. ## Execution evidence: ``` bun test src/cockpit/shared-persistence.test.ts 9 pass, 0 fail (mt#2244: timeout+retry, cache-on-success, elapsed-ms, createService-hang, env-override ×3; mt#2248: late-resolve→close, late-reject→no-close) ``` `validate_typecheck`: pass. `validate_lint`: pass (repo-wide, 0/0). ## Task coordination - Closes **mt#2248** (reframed 2026-06-03 from "AbortSignal through the domain" to the cockpit-local teardown after driver research; spec updated accordingly). - Completes the cockpit-side wedge family: mt#2186 (investigation) → mt#2244 (init-timeout + reset) → mt#2245 (Octokit root-cause timeout) → **mt#2248 (orphan teardown)**. - Out of scope / possible follow-up: a per-operation timeout on `SELECT 1` / `runMigrations` (porsager has no built-in per-query timeout) — only worth filing if the migration-hang class is observed in production. Co-Authored-By: minsky-ai[bot] <minsky-ai[bot]@users.noreply.github.com>
A suggestion for a new cancellation API (based on the original request in #2774)
This query-specific cancellation API only makes sense because the library waits for each query to complete before it submits the next. With query pipelining it would be a lot messier - or simpler, because the API could offer less guarantees.
While the tests are turned off if
AbortControlleris not available, the library code itself should still be compatible with older node versions.Example
First, in a separate commit (913967f) is a small change that moves the SSL initialization logic to the
Connection.This means the
Connectionno longer needs a separate an'sslconnect'event.Then, the
Connectiongot a newcon.cancelWithClone()method that tries to connect to the exact same server, and send a cancel request.This required a bit of extra state in the
Connection, to keep track of thehost,port,configand thebackendKeyData.I'm not thrilled with duplicating state from the
Client(or with the name of the method for that matter).What I like about this approach is that all the logic around the cancellation can be implemented in the
Submittableand doesn't depend on any special handling by theClient.Finally the
Querygot asignaloption that can be used to cancel it either before, or after it has been submitted to the server.If the query has been canceled before submit, it will return the abort reason in submit.
If an actual cancel request has been sent, the query callback and the
'error'and'end'events are delayed until the cancel request has finished.This is done to avoid accidentally canceling a different query on the same connection.
If the cancel request isn't completed cleanly, the original connection is destroyed.
This might be triggered if you get an ECONNRESET after sending the cancel request? Honestly I'm not sure.
This should probably treat the errors after the cancel request has been sent differently from errors during connecting.