Skip to content

Make idle_in_transaction regression test deterministic via per-row COPY delay#8485

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/sub-pr-8484
Draft

Make idle_in_transaction regression test deterministic via per-row COPY delay#8485
Copilot wants to merge 3 commits intomainfrom
copilot/sub-pr-8484

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 25, 2026

DESCRIPTION: Add per-row sleep constraint to make idle_in_transaction shard move test deterministic

The regression test for the idle_in_transaction_session_timeout fix was ineffective: the 100-row shard copies in well under the 1 s timeout, so the bug would never be caught even without the fix.

Fix: Introduce a NOT VALID check constraint backed by a pg_sleep wrapper so PostgreSQL's COPY FROM (which fires check constraints per-row) makes the data-copy phase reliably exceed the timeout:

-- Created before distribute + insert, propagated to all workers by Citus DDL propagation
CREATE FUNCTION sleep_and_true(float8) RETURNS boolean LANGUAGE plpgsql AS $$
BEGIN
    PERFORM pg_sleep($1);
    RETURN true;
END;
$$;

-- NOT VALID: skips checking 100 existing rows; still fires for every row COPY writes
ALTER TABLE test_move ADD CONSTRAINT slow_copy
    CHECK (blocking_move_idle_timeout.sleep_and_true(0.1)) NOT VALID;
  • NOT VALID keeps the initial INSERT fast while still enforcing the constraint on rows written by COPY during the shard move (~25 rows × 0.1 s ≈ 2.5 s per shard, well over the 1 s timeout).
  • Schema-qualified function name ensures the deparsed ALTER TABLE propagated to workers resolves correctly regardless of worker search_path.
  • DROP SCHEMA … CASCADE in cleanup also drops the function; expected output updated accordingly.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

codeforall and others added 2 commits February 20, 2026 17:45
…tions during shard moves

In block_writes mode, LockShardListMetadataOnWorkers() opens coordinated
transactions on all metadata workers to hold advisory shard metadata locks.
These connections remain open for the entire duration of the shard move,
but workers not involved in the data copy have no commands to execute and
they sit idle-in-transaction until the coordinated transaction commits.

For large shards, the data copy can take hours, easily exceeding common
idle_in_transaction_session_timeout values. When the timeout fires on an
uninvolved worker, PostgreSQL terminates the connection and the move fails.

Fix by sending SET LOCAL idle_in_transaction_session_timeout = 0 on each
metadata worker connection before acquiring locks. SET LOCAL scopes the
change to the current transaction only, so normal sessions are unaffected.
Use a NOT VALID check constraint that calls pg_sleep() per row so that
COPY (which fires check constraints) reliably takes > 1s during the
block_writes shard move. This ensures the idle_in_transaction timeout
fires on uninvolved metadata worker connections, making the regression
test effective at detecting the original failure mode.

Co-authored-by: emelsimsek <13130350+emelsimsek@users.noreply.github.com>
Copilot AI changed the title [WIP] Address feedback on shard move failures in block_writes mode Make idle_in_transaction regression test deterministic via per-row COPY delay Feb 25, 2026
Copilot AI requested a review from emelsimsek February 25, 2026 14:08
Base automatically changed from muusama/idle_in_trans to main March 2, 2026 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants