Add force-break mechanism for stuck advisory locks during deployments by KyleAMathews · Pull Request #3781 · electric-sql/electric

KyleAMathews · 2026-01-27T18:06:22Z

Summary

This PR adds a force-break mechanism to handle stuck advisory locks that aren't released properly during rolling deployments. After waiting 30 seconds to acquire a lock, the system will now forcefully terminate backends holding the lock that have been connected for more than 30 seconds, regardless of replication slot status.

Key Changes

Lock wait tracking: Added lock_wait_start field to track when lock acquisition began, allowing the system to detect when a lock is taking too long to acquire.
Force-break threshold: Introduced @force_lock_break_threshold (30 seconds) to determine when to switch from safe mode (only breaking locks if slot is inactive) to force mode (breaking any old backend holding the lock).
Enhanced LockBreakerConnection:
- Modified stop_backends_and_close/4 to accept a force option
- Split lock-breaking logic into two strategies:
  - lock_breaker_query_safe/3: Original behavior - only terminates backends if the replication slot is inactive
  - lock_breaker_query_force/3: New behavior - terminates any backend holding the lock that's been connected for 30+ seconds
- Added @force_break_min_backend_age_seconds constant to prevent accidentally breaking locks from freshly started instances
Improved shutdown order: Moved replication backend termination earlier in the shutdown sequence to ensure the advisory lock is released before other cleanup operations, which is critical for graceful handoff during rolling deployments.

Implementation Details

The force-break mechanism is particularly useful on platforms like Neon where connections may not close cleanly. By checking backend connection age (backend_start), we avoid breaking locks from newly started Electric instances while still being able to recover from truly stuck locks held by dead processes.

The lock wait timer is reset when transitioning to the configuration step, ensuring we only measure the time spent actually waiting for the lock acquisition.

https://claude.ai/code/session_01K52FtgUQvbb6e1LKHZcqwu

This commit addresses container crashes during rolling deployments caused by PostgreSQL advisory lock contention. The issue occurred when: 1. Old Electric instance receives SIGTERM during deployment 2. Advisory lock isn't properly released (especially on Neon) 3. New instance can't acquire lock → returns 202 → ALB marks unhealthy 4. Cascade of task replacements with none able to acquire lock Changes: 1. Reorder shutdown sequence (manager.ex): - Kill replication backend BEFORE replication client - This ensures pg_terminate_backend() runs first to release the lock - Previously the Elixir process was killed first, leaving the PG backend potentially orphaned with the lock still held 2. Add force mode to lock breaker (lock_breaker_connection.ex): - New `force: true` option for stop_backends_and_close/4 - Force mode doesn't require the replication slot to be inactive - Only terminates backends connected for >30 seconds (safety check) - Prevents breaking locks from freshly started instances 3. Auto-escalate to force mode after timeout (manager.ex): - Track when we started waiting for the lock (lock_wait_start field) - After 30 seconds of waiting, switch to force mode - Logs warning when escalating to force mode This should prevent the 2+ hour lock contention scenarios seen in production during rolling deployments. https://claude.ai/code/session_01K52FtgUQvbb6e1LKHZcqwu

coderabbitai · 2026-01-27T18:06:34Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-27T18:08:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.56%. Comparing base (3f257aa) to head (38a3943).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3781      +/-   ##
==========================================
- Coverage   86.61%   86.56%   -0.05%     
==========================================
  Files          23       23              
  Lines        2039     2039              
  Branches      544      545       +1     
==========================================
- Hits         1766     1765       -1     
- Misses        271      272       +1     
  Partials        2        2

Flag	Coverage Δ
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/start	`82.83% <ø> (ø)`
packages/typescript-client	`92.12% <ø> (-0.08%)`	⬇️
packages/y-electric	`56.05% <ø> (ø)`
typescript	`86.56% <ø> (-0.05%)`	⬇️
unit-tests	`86.56% <ø> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add force-break mechanism for stuck advisory locks during deployments#3781

Add force-break mechanism for stuck advisory locks during deployments#3781
KyleAMathews wants to merge 1 commit intomainfrom
claude/investigate-container-crash-LYzpV

KyleAMathews commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026

Review skipped

Uh oh!

codecov bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KyleAMathews commented Jan 27, 2026

Summary

Key Changes

Implementation Details

Uh oh!

coderabbitai bot commented Jan 27, 2026

Review skipped

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jan 27, 2026 •

edited

Loading