Skip to content

fix: Handle replication connection loss (PGRES_COPY_BOTH) and reduce keepalive overhead#749

Open
ksohail22 wants to merge 8 commits intoMeltanoLabs:mainfrom
ksohail22:fix/handle-replication-connection-loss
Open

fix: Handle replication connection loss (PGRES_COPY_BOTH) and reduce keepalive overhead#749
ksohail22 wants to merge 8 commits intoMeltanoLabs:mainfrom
ksohail22:fix/handle-replication-connection-loss

Conversation

@ksohail22
Copy link
Copy Markdown
Contributor

Problem

Two issues observed in production after the long-lived replication loop was deployed:

1. PGRES_COPY_BOTH crashes terminate the entire sync

When the PostgreSQL server terminates the WAL sender mid-session (due to wal_sender_timeout, network interruption, RDS failover, or server-side load), read_message() raises:

psycopg2.DatabaseError: error with status PGRES_COPY_BOTH and no message from the libpq

This unhandled exception crashes the tap, losing all progress for the current stream and every subsequent stream in the job. The replication slot is not advanced, so the next run must re-scan the same WAL.

2. Excessive keepalive overhead

status_interval was set to 10 seconds, causing the client to send standby status feedback 6 times per minute. This is unnecessary overhead — PostgreSQL's default wal_sender_timeout is 60 seconds, so feedback every 30 seconds (2/min) is sufficient to keep the connection alive.

Solution

Connection-loss resilience

The read_message() call is now wrapped in try/except psycopg2.DatabaseError. When the replication connection dies:

  1. The error is logged as a warning (not a crash) with the record count and elapsed time.
  2. The main loop exits cleanly via a connection_lost flag.
  3. A new recovery method _advance_slot_via_new_connection() opens a fresh replication connection, queries pg_current_wal_flush_lsn() for the current WAL tip, sends send_feedback to advance the slot, and updates the stream bookmark.
  4. The dead cursor/connection close() calls are wrapped in try/except to prevent secondary exceptions.

This ensures that even when a connection is lost:

  • Records already yielded before the crash are preserved by the loader.
  • The replication slot is still advanced, preventing WAL accumulation.
  • The stream's bookmark is updated so the next run resumes from the correct position.
  • Subsequent streams in the same job continue unaffected.

Reduced keepalive frequency

status_interval increased from 10 to 30 seconds. This halves the feedback message rate while maintaining a comfortable 30-second margin below the default 60-second wal_sender_timeout.

Changes

tap_postgres/client.pyPostgresLogBasedStream:

  • get_records():
    • status_interval raised from 10 → 30
    • read_message() wrapped in try/except psycopg2.DatabaseError
    • connection_lost flag controls which slot-advancement path is taken
    • select.select now also catches OSError (broken socket)
    • Cursor/connection cleanup wrapped in try/except
  • _advance_slot_via_new_connection() — New method. Recovery path when the original replication connection is dead. Opens a fresh replication session at the WAL tip, sends feedback to advance the slot, and updates the bookmark.

Error handling flow

get_records() loop
    │
    ├── read_message() succeeds → yield record, continue
    │
    ├── read_message() raises DatabaseError
    │       │
    │       ├── Log warning
    │       ├── connection_lost = True
    │       └── break
    │
    └── loop exits (idle/max_time/connection_lost)
            │
            ├── connection OK  → _advance_slot_and_state() [existing cursor]
            │
            └── connection lost → _advance_slot_via_new_connection() [fresh connection]
                    │
                    ├── Query pg_current_wal_flush_lsn()
                    ├── Open new replication session at WAL tip
                    ├── send_feedback(flush_lsn=wal_tip)
                    └── Update bookmark

Risk assessment

  • Graceful degradation: A PGRES_COPY_BOTH error now results in a partial sync (records before the crash are kept) rather than a full job failure. The slot is still advanced. This is strictly better than the previous behavior.
  • Recovery connection failure: If the recovery connection also fails (e.g., server is completely down), the warning is logged and the bookmark is left at the last successfully yielded record. The next run will resume from there.
  • status_interval at 30s: Still well within the default wal_sender_timeout of 60 seconds. For servers with a custom lower timeout, the interval can be further adjusted in a future config option.

Test plan

  • Simulate connection loss (e.g., pg_terminate_backend on the WAL sender PID during sync) — verify the tap logs a warning, advances the slot via recovery, and the job does not crash
  • Verify subsequent streams in the same job continue normally after a connection loss on an earlier stream
  • Confirm status_interval=30 keeps the connection alive under normal operation (no wal_sender_timeout errors)
  • Verify slot advancement works correctly on both the normal and recovery paths

@ksohail22 ksohail22 changed the title Fix/handle replication connection loss fix: Handle replication connection loss (PGRES_COPY_BOTH) and reduce keepalive overhead Mar 13, 2026
@edgarrmondragon edgarrmondragon self-assigned this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants