Skip to content

Shard move in block_writes mode fails with idle_in_transaction_sessio…#8495

Merged
codeforall merged 1 commit intorelease-14.0from
release-14.0-8484
Mar 6, 2026
Merged

Shard move in block_writes mode fails with idle_in_transaction_sessio…#8495
codeforall merged 1 commit intorelease-14.0from
release-14.0-8484

Conversation

@codeforall
Copy link
Copy Markdown
Contributor

…n_timeout on metadata workers (#8484)

Description

When performing a shard move using block_writes transfer mode (either directly via citus_move_shard_placement or through the background rebalancer), the operation can fail with:

   ERROR: terminating connection due to idle-in-transaction timeout
   CONTEXT: while executing command on <worker_host>:<worker_port>

The failing worker is a metadata worker that is neither the source nor the target of the shard move.

Root Cause

LockShardListMetadataOnWorkers() opens coordinated transactions on all metadata workers to acquire advisory shard metadata locks via SELECT lock_shard_metadata(...). These transactions remain open until the entire shard move completes and the coordinated transaction commits.

In block_writes mode, the data copy phase (CopyShardsToNode) runs synchronously between the source and target workers. Metadata workers not involved in the copy have no commands to execute and their connections sit completely idle-in-transaction for the entire duration of the data copy.

For large shards, the copy can take significantly longer than common idle_in_transaction_session_timeout values, When the timeout fires on an uninvolved worker, PostgreSQL terminates the connection, causing the shard move to fail.

This also affects shard splits, since they follow the same code path through LockShardListMetadataOnWorkers.

Fix

LockShardListMetadataOnWorkers() should send SET LOCAL idle_in_transaction_session_timeout = 0 on each metadata worker connection before acquiring the locks. SET LOCAL scopes the change to the current transaction only, so normal sessions on the workers are unaffected.

DESCRIPTION: PR description that will go into the change log, up to 78 characters

…n_timeout on metadata workers (#8484)

### Description

When performing a shard move using block_writes transfer mode (either
directly via citus_move_shard_placement or through the background
rebalancer), the operation can fail with:

```
   ERROR: terminating connection due to idle-in-transaction timeout
   CONTEXT: while executing command on <worker_host>:<worker_port>

```
The failing worker is a metadata worker that is neither the source nor
the target of the shard move.

### Root Cause
LockShardListMetadataOnWorkers() opens coordinated transactions on all
metadata workers to acquire advisory shard metadata locks via SELECT
lock_shard_metadata(...). These transactions remain open until the
entire shard move completes and the coordinated transaction commits.

In block_writes mode, the data copy phase (CopyShardsToNode) runs
synchronously between the source and target workers. Metadata workers
not involved in the copy have no commands to execute and their
connections sit completely idle-in-transaction for the entire duration
of the data copy.

For large shards, the copy can take significantly longer than common
idle_in_transaction_session_timeout values, When the timeout fires on an
uninvolved worker, PostgreSQL terminates the connection, causing the
shard move to fail.

This also affects shard splits, since they follow the same code path
through LockShardListMetadataOnWorkers.

### Fix
LockShardListMetadataOnWorkers() should send SET LOCAL
idle_in_transaction_session_timeout = 0 on each metadata worker
connection before acquiring the locks. SET LOCAL scopes the change to
the current transaction only, so normal sessions on the workers are
unaffected.
@codeforall codeforall merged commit aaa369e into release-14.0 Mar 6, 2026
687 of 692 checks passed
@codeforall codeforall deleted the release-14.0-8484 branch March 6, 2026 19:06
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.73%. Comparing base (286da11) to head (bb6bfb2).
⚠️ Report is 5 commits behind head on release-14.0.

Additional details and impacted files
@@               Coverage Diff                @@
##           release-14.0    #8495      +/-   ##
================================================
- Coverage         88.73%   88.73%   -0.01%     
================================================
  Files               287      287              
  Lines             63233    63235       +2     
  Branches           7922     7921       -1     
================================================
+ Hits              56109    56110       +1     
- Misses             4857     4858       +1     
  Partials           2267     2267              
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants