Skip to content

Shard move in block_writes mode fails with idle_in_transaction_sessio…#8491

Merged
ihalatci merged 1 commit intorelease-13.2from
release-13.2-8484
Mar 5, 2026
Merged

Shard move in block_writes mode fails with idle_in_transaction_sessio…#8491
ihalatci merged 1 commit intorelease-13.2from
release-13.2-8484

Conversation

@codeforall
Copy link
Copy Markdown
Contributor

…n_timeout on metadata workers (#8484)

Description

When performing a shard move using block_writes transfer mode (either directly via citus_move_shard_placement or through the background rebalancer), the operation can fail with:

   ERROR: terminating connection due to idle-in-transaction timeout
   CONTEXT: while executing command on <worker_host>:<worker_port>

The failing worker is a metadata worker that is neither the source nor the target of the shard move.

Root Cause

LockShardListMetadataOnWorkers() opens coordinated transactions on all metadata workers to acquire advisory shard metadata locks via SELECT lock_shard_metadata(...). These transactions remain open until the entire shard move completes and the coordinated transaction commits.

In block_writes mode, the data copy phase (CopyShardsToNode) runs synchronously between the source and target workers. Metadata workers not involved in the copy have no commands to execute and their connections sit completely idle-in-transaction for the entire duration of the data copy.

For large shards, the copy can take significantly longer than common idle_in_transaction_session_timeout values, When the timeout fires on an uninvolved worker, PostgreSQL terminates the connection, causing the shard move to fail.

This also affects shard splits, since they follow the same code path through LockShardListMetadataOnWorkers.

Fix

LockShardListMetadataOnWorkers() should send SET LOCAL idle_in_transaction_session_timeout = 0 on each metadata worker connection before acquiring the locks. SET LOCAL scopes the change to the current transaction only, so normal sessions on the workers are unaffected.

DESCRIPTION: PR description that will go into the change log, up to 78 characters

…n_timeout on metadata workers (#8484)

### Description

When performing a shard move using block_writes transfer mode (either
directly via citus_move_shard_placement or through the background
rebalancer), the operation can fail with:

```
   ERROR: terminating connection due to idle-in-transaction timeout
   CONTEXT: while executing command on <worker_host>:<worker_port>

```
The failing worker is a metadata worker that is neither the source nor
the target of the shard move.

### Root Cause
LockShardListMetadataOnWorkers() opens coordinated transactions on all
metadata workers to acquire advisory shard metadata locks via SELECT
lock_shard_metadata(...). These transactions remain open until the
entire shard move completes and the coordinated transaction commits.

In block_writes mode, the data copy phase (CopyShardsToNode) runs
synchronously between the source and target workers. Metadata workers
not involved in the copy have no commands to execute and their
connections sit completely idle-in-transaction for the entire duration
of the data copy.

For large shards, the copy can take significantly longer than common
idle_in_transaction_session_timeout values, When the timeout fires on an
uninvolved worker, PostgreSQL terminates the connection, causing the
shard move to fail.

This also affects shard splits, since they follow the same code path
through LockShardListMetadataOnWorkers.

### Fix
LockShardListMetadataOnWorkers() should send SET LOCAL
idle_in_transaction_session_timeout = 0 on each metadata worker
connection before acquiring the locks. SET LOCAL scopes the change to
the current transaction only, so normal sessions on the workers are
unaffected.
@ihalatci ihalatci merged commit 111a9ea into release-13.2 Mar 5, 2026
532 of 535 checks passed
@ihalatci ihalatci deleted the release-13.2-8484 branch March 5, 2026 10:52
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.89%. Comparing base (51c223a) to head (d02ed1a).
⚠️ Report is 1 commits behind head on release-13.2.

Additional details and impacted files
@@               Coverage Diff                @@
##           release-13.2    #8491      +/-   ##
================================================
- Coverage         88.93%   88.89%   -0.04%     
================================================
  Files               287      287              
  Lines             63184    63186       +2     
  Branches           7950     7951       +1     
================================================
- Hits              56193    56171      -22     
- Misses             4674     4699      +25     
+ Partials           2317     2316       -1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants