Skip to content

Add force-break mechanism for stuck advisory locks during deployments#3781

Open
KyleAMathews wants to merge 1 commit intomainfrom
claude/investigate-container-crash-LYzpV
Open

Add force-break mechanism for stuck advisory locks during deployments#3781
KyleAMathews wants to merge 1 commit intomainfrom
claude/investigate-container-crash-LYzpV

Conversation

@KyleAMathews
Copy link
Contributor

Summary

This PR adds a force-break mechanism to handle stuck advisory locks that aren't released properly during rolling deployments. After waiting 30 seconds to acquire a lock, the system will now forcefully terminate backends holding the lock that have been connected for more than 30 seconds, regardless of replication slot status.

Key Changes

  • Lock wait tracking: Added lock_wait_start field to track when lock acquisition began, allowing the system to detect when a lock is taking too long to acquire.

  • Force-break threshold: Introduced @force_lock_break_threshold (30 seconds) to determine when to switch from safe mode (only breaking locks if slot is inactive) to force mode (breaking any old backend holding the lock).

  • Enhanced LockBreakerConnection:

    • Modified stop_backends_and_close/4 to accept a force option
    • Split lock-breaking logic into two strategies:
      • lock_breaker_query_safe/3: Original behavior - only terminates backends if the replication slot is inactive
      • lock_breaker_query_force/3: New behavior - terminates any backend holding the lock that's been connected for 30+ seconds
    • Added @force_break_min_backend_age_seconds constant to prevent accidentally breaking locks from freshly started instances
  • Improved shutdown order: Moved replication backend termination earlier in the shutdown sequence to ensure the advisory lock is released before other cleanup operations, which is critical for graceful handoff during rolling deployments.

Implementation Details

The force-break mechanism is particularly useful on platforms like Neon where connections may not close cleanly. By checking backend connection age (backend_start), we avoid breaking locks from newly started Electric instances while still being able to recover from truly stuck locks held by dead processes.

The lock wait timer is reset when transitioning to the configuration step, ensuring we only measure the time spent actually waiting for the lock acquisition.

https://claude.ai/code/session_01K52FtgUQvbb6e1LKHZcqwu

This commit addresses container crashes during rolling deployments caused by
PostgreSQL advisory lock contention. The issue occurred when:

1. Old Electric instance receives SIGTERM during deployment
2. Advisory lock isn't properly released (especially on Neon)
3. New instance can't acquire lock → returns 202 → ALB marks unhealthy
4. Cascade of task replacements with none able to acquire lock

Changes:

1. Reorder shutdown sequence (manager.ex):
   - Kill replication backend BEFORE replication client
   - This ensures pg_terminate_backend() runs first to release the lock
   - Previously the Elixir process was killed first, leaving the PG backend
     potentially orphaned with the lock still held

2. Add force mode to lock breaker (lock_breaker_connection.ex):
   - New `force: true` option for stop_backends_and_close/4
   - Force mode doesn't require the replication slot to be inactive
   - Only terminates backends connected for >30 seconds (safety check)
   - Prevents breaking locks from freshly started instances

3. Auto-escalate to force mode after timeout (manager.ex):
   - Track when we started waiting for the lock (lock_wait_start field)
   - After 30 seconds of waiting, switch to force mode
   - Logs warning when escalating to force mode

This should prevent the 2+ hour lock contention scenarios seen in production
during rolling deployments.

https://claude.ai/code/session_01K52FtgUQvbb6e1LKHZcqwu
@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.56%. Comparing base (3f257aa) to head (38a3943).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3781      +/-   ##
==========================================
- Coverage   86.61%   86.56%   -0.05%     
==========================================
  Files          23       23              
  Lines        2039     2039              
  Branches      544      545       +1     
==========================================
- Hits         1766     1765       -1     
- Misses        271      272       +1     
  Partials        2        2              
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 92.12% <ø> (-0.08%) ⬇️
packages/y-electric 56.05% <ø> (ø)
typescript 86.56% <ø> (-0.05%) ⬇️
unit-tests 86.56% <ø> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants