Bound the count query with a server-side statement timeout#1463
Bound the count query with a server-side statement timeout#1463
Conversation
66bf990 to
379a057
Compare
Can you explain the footgun exactly? If I change |
Fair. The footgun documented is based on my team's experience in the past few weeks. AI tools are confidently recommending |
`TaskJobConcern#on_start` calls `Task#count` to populate the progress
bar before iteration begins. On large or poorly-indexed collections
this query can run for tens of seconds and stall the task, even though
the count is only a UI hint and a failure here should not fail the
run. We've seen this hit production tasks repeatedly at Shopify.
Active Record has no cross-platform query timeout API, and
`Timeout.timeout` is unsafe to use around database calls — it raises
across thread boundaries at arbitrary bytecode points, which can leave
a connection mid-protocol and the server-side query still running.
Wrap the count call in a small adapter-aware helper that issues the
database's native cancellation:
PostgreSQL/PostGIS: `SET LOCAL statement_timeout` inside a
`transaction(requires_new: true)` so cleanup is automatic on
commit or rollback.
MySQL/Trilogy: `SET SESSION max_execution_time`, with the prior
value restored in an `ensure`.
SQLite/unknown: no-op.
When the database cancels the query it raises
`ActiveRecord::QueryCanceled`, which `safe_count` rescues. The run
starts with a `nil` count (the progress bar shows progress without a
percentage) and the cancellation is reported to `Rails.error` as a
handled warning so it remains visible without paging anyone.
Default is 5 seconds, configurable via
`MaintenanceTasks.count_timeout_ms` (set to `nil` or `0` to disable).
The `:no_count` → `@collection_enum.size` fallback is intentionally
preserved here; changing that is a separate behaviour change.
Co-Authored-By: pi <pi@shopify.com>
Why
Task#countpopulates the progress bar before iteration begins and is called fromTaskJobConcern#on_start. On large or poorly-indexed collections this query can run for tens of seconds and stall the task — even though the count is only a UI hint and a failure here should not fail the run.We've seen this hit production tasks repeatedly on my team, with count queries running long enough to time out the entire run.
Why not
Timeout.timeout?Active Record has no cross-platform query timeout API, so the obvious fix is
Timeout.timeout(5) { @task.count }. That's unsafe to use around database calls — it raises across thread boundaries at arbitrary bytecode points, which can leave the connection mid-protocol with the server-side query still running on the database. That's the path #1388 currently uses for its count-preview endpoint, and it's the path I'd like to avoid foron_start.What this does instead
Wrap the count call in a small adapter-aware helper that issues the database's native cancellation:
SET LOCAL statement_timeoutinsidetransaction(requires_new: true), auto-reset on commit/rollbackSET SESSION max_execution_time, prior value restored inensureWhen the database cancels the query it raises
ActiveRecord::QueryCanceled.safe_countrescues it; the run starts with anilcount (the progress bar shows progress without a percentage) and the cancellation is reported toRails.erroras a handled warning so it remains visible without paging anyone.Default is 5 seconds, configurable via
MaintenanceTasks.count_timeout_ms(nilor0to disable).Tests
Unit tests use mocked connections to assert the right SQL is issued for each adapter, that prior values are restored, that errors propagate, and that
nil/0/negative timeouts short-circuit before touching the connection.Integration tests against a real PostgreSQL connection (the existing
gemfiles/postgresql.gemfileCI matrix) verify that a slow query is actually cancelled, that fast queries pass through, and thatstatement_timeoutis reset after both successful and failed blocks. They skip on SQLite.Questions for reviewers
nil(opt-in) if you'd prefer to avoid behaviour changes for existing users on first upgrade.ActiveRecord::Base.connection. If a task's#countqueries a model on a different connection, the timeout won't apply there. Documented as a caveat; could be extended to introspect the relation's connection if useful.