Make idle_in_transaction regression test deterministic via per-row COPY delay#8485
Draft
Make idle_in_transaction regression test deterministic via per-row COPY delay#8485
Conversation
…tions during shard moves In block_writes mode, LockShardListMetadataOnWorkers() opens coordinated transactions on all metadata workers to hold advisory shard metadata locks. These connections remain open for the entire duration of the shard move, but workers not involved in the data copy have no commands to execute and they sit idle-in-transaction until the coordinated transaction commits. For large shards, the data copy can take hours, easily exceeding common idle_in_transaction_session_timeout values. When the timeout fires on an uninvolved worker, PostgreSQL terminates the connection and the move fails. Fix by sending SET LOCAL idle_in_transaction_session_timeout = 0 on each metadata worker connection before acquiring locks. SET LOCAL scopes the change to the current transaction only, so normal sessions are unaffected.
Use a NOT VALID check constraint that calls pg_sleep() per row so that COPY (which fires check constraints) reliably takes > 1s during the block_writes shard move. This ensures the idle_in_transaction timeout fires on uninvolved metadata worker connections, making the regression test effective at detecting the original failure mode. Co-authored-by: emelsimsek <13130350+emelsimsek@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Address feedback on shard move failures in block_writes mode
Make idle_in_transaction regression test deterministic via per-row COPY delay
Feb 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DESCRIPTION: Add per-row sleep constraint to make idle_in_transaction shard move test deterministic
The regression test for the
idle_in_transaction_session_timeoutfix was ineffective: the 100-row shard copies in well under the 1 s timeout, so the bug would never be caught even without the fix.Fix: Introduce a
NOT VALIDcheck constraint backed by apg_sleepwrapper so PostgreSQL'sCOPY FROM(which fires check constraints per-row) makes the data-copy phase reliably exceed the timeout:NOT VALIDkeeps the initialINSERTfast while still enforcing the constraint on rows written byCOPYduring the shard move (~25 rows × 0.1 s ≈ 2.5 s per shard, well over the 1 s timeout).ALTER TABLEpropagated to workers resolves correctly regardless of workersearch_path.DROP SCHEMA … CASCADEin cleanup also drops the function; expected output updated accordingly.✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.