Add force-break mechanism for stuck advisory locks during deployments#3781
Add force-break mechanism for stuck advisory locks during deployments#3781KyleAMathews wants to merge 1 commit intomainfrom
Conversation
This commit addresses container crashes during rolling deployments caused by
PostgreSQL advisory lock contention. The issue occurred when:
1. Old Electric instance receives SIGTERM during deployment
2. Advisory lock isn't properly released (especially on Neon)
3. New instance can't acquire lock → returns 202 → ALB marks unhealthy
4. Cascade of task replacements with none able to acquire lock
Changes:
1. Reorder shutdown sequence (manager.ex):
- Kill replication backend BEFORE replication client
- This ensures pg_terminate_backend() runs first to release the lock
- Previously the Elixir process was killed first, leaving the PG backend
potentially orphaned with the lock still held
2. Add force mode to lock breaker (lock_breaker_connection.ex):
- New `force: true` option for stop_backends_and_close/4
- Force mode doesn't require the replication slot to be inactive
- Only terminates backends connected for >30 seconds (safety check)
- Prevents breaking locks from freshly started instances
3. Auto-escalate to force mode after timeout (manager.ex):
- Track when we started waiting for the lock (lock_wait_start field)
- After 30 seconds of waiting, switch to force mode
- Logs warning when escalating to force mode
This should prevent the 2+ hour lock contention scenarios seen in production
during rolling deployments.
https://claude.ai/code/session_01K52FtgUQvbb6e1LKHZcqwu
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3781 +/- ##
==========================================
- Coverage 86.61% 86.56% -0.05%
==========================================
Files 23 23
Lines 2039 2039
Branches 544 545 +1
==========================================
- Hits 1766 1765 -1
- Misses 271 272 +1
Partials 2 2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
This PR adds a force-break mechanism to handle stuck advisory locks that aren't released properly during rolling deployments. After waiting 30 seconds to acquire a lock, the system will now forcefully terminate backends holding the lock that have been connected for more than 30 seconds, regardless of replication slot status.
Key Changes
Lock wait tracking: Added
lock_wait_startfield to track when lock acquisition began, allowing the system to detect when a lock is taking too long to acquire.Force-break threshold: Introduced
@force_lock_break_threshold(30 seconds) to determine when to switch from safe mode (only breaking locks if slot is inactive) to force mode (breaking any old backend holding the lock).Enhanced LockBreakerConnection:
stop_backends_and_close/4to accept aforceoptionlock_breaker_query_safe/3: Original behavior - only terminates backends if the replication slot is inactivelock_breaker_query_force/3: New behavior - terminates any backend holding the lock that's been connected for 30+ seconds@force_break_min_backend_age_secondsconstant to prevent accidentally breaking locks from freshly started instancesImproved shutdown order: Moved replication backend termination earlier in the shutdown sequence to ensure the advisory lock is released before other cleanup operations, which is critical for graceful handoff during rolling deployments.
Implementation Details
The force-break mechanism is particularly useful on platforms like Neon where connections may not close cleanly. By checking backend connection age (
backend_start), we avoid breaking locks from newly started Electric instances while still being able to recover from truly stuck locks held by dead processes.The lock wait timer is reset when transitioning to the configuration step, ensuring we only measure the time spent actually waiting for the lock acquisition.
https://claude.ai/code/session_01K52FtgUQvbb6e1LKHZcqwu