fix: Handle replication connection loss (`PGRES_COPY_BOTH`) and reduce keepalive overhead by ksohail22 · Pull Request #749 · MeltanoLabs/tap-postgres

ksohail22 · 2026-03-13T16:58:49Z

Problem

Two issues observed in production after the long-lived replication loop was deployed:

1. `PGRES_COPY_BOTH` crashes terminate the entire sync

When the PostgreSQL server terminates the WAL sender mid-session (due to wal_sender_timeout, network interruption, RDS failover, or server-side load), read_message() raises:

psycopg2.DatabaseError: error with status PGRES_COPY_BOTH and no message from the libpq

This unhandled exception crashes the tap, losing all progress for the current stream and every subsequent stream in the job. The replication slot is not advanced, so the next run must re-scan the same WAL.

2. Excessive keepalive overhead

status_interval was set to 10 seconds, causing the client to send standby status feedback 6 times per minute. This is unnecessary overhead — PostgreSQL's default wal_sender_timeout is 60 seconds, so feedback every 30 seconds (2/min) is sufficient to keep the connection alive.

Solution

Connection-loss resilience

The read_message() call is now wrapped in try/except psycopg2.DatabaseError. When the replication connection dies:

The error is logged as a warning (not a crash) with the record count and elapsed time.
The main loop exits cleanly via a connection_lost flag.
A new recovery method _advance_slot_via_new_connection() opens a fresh replication connection, queries pg_current_wal_flush_lsn() for the current WAL tip, sends send_feedback to advance the slot, and updates the stream bookmark.
The dead cursor/connection close() calls are wrapped in try/except to prevent secondary exceptions.

This ensures that even when a connection is lost:

Records already yielded before the crash are preserved by the loader.
The replication slot is still advanced, preventing WAL accumulation.
The stream's bookmark is updated so the next run resumes from the correct position.
Subsequent streams in the same job continue unaffected.

Reduced keepalive frequency

status_interval increased from 10 to 30 seconds. This halves the feedback message rate while maintaining a comfortable 30-second margin below the default 60-second wal_sender_timeout.

Changes

tap_postgres/client.py — PostgresLogBasedStream:

get_records():
- status_interval raised from 10 → 30
- read_message() wrapped in try/except psycopg2.DatabaseError
- connection_lost flag controls which slot-advancement path is taken
- select.select now also catches OSError (broken socket)
- Cursor/connection cleanup wrapped in try/except
_advance_slot_via_new_connection() — New method. Recovery path when the original replication connection is dead. Opens a fresh replication session at the WAL tip, sends feedback to advance the slot, and updates the bookmark.

Error handling flow

get_records() loop
    │
    ├── read_message() succeeds → yield record, continue
    │
    ├── read_message() raises DatabaseError
    │       │
    │       ├── Log warning
    │       ├── connection_lost = True
    │       └── break
    │
    └── loop exits (idle/max_time/connection_lost)
            │
            ├── connection OK  → _advance_slot_and_state() [existing cursor]
            │
            └── connection lost → _advance_slot_via_new_connection() [fresh connection]
                    │
                    ├── Query pg_current_wal_flush_lsn()
                    ├── Open new replication session at WAL tip
                    ├── send_feedback(flush_lsn=wal_tip)
                    └── Update bookmark

Risk assessment

Graceful degradation: A PGRES_COPY_BOTH error now results in a partial sync (records before the crash are kept) rather than a full job failure. The slot is still advanced. This is strictly better than the previous behavior.
Recovery connection failure: If the recovery connection also fails (e.g., server is completely down), the warning is logged and the bookmark is left at the last successfully yielded record. The next run will resume from there.
status_interval at 30s: Still well within the default wal_sender_timeout of 60 seconds. For servers with a custom lower timeout, the interval can be further adjusted in a future config option.

Test plan

Simulate connection loss (e.g., pg_terminate_backend on the WAL sender PID during sync) — verify the tap logs a warning, advances the slot via recovery, and the job does not crash
Verify subsequent streams in the same job continue normally after a connection loss on an earlier stream
Confirm status_interval=30 keeps the connection alive under normal operation (no wal_sender_timeout errors)
Verify slot advancement works correctly on both the normal and recovery paths

Fix: LOG_BASED replication bookmark not advancing between syncs

Fix InvalidStreamSortException in LOG_BASED replication

…LOG_BASED streams

Fix: Unbounded WAL retention and premature replication loop exit for LOG_BASED streams

…keepalive overhead

kashif-se and others added 7 commits March 10, 2026 22:13

Fix: LOG_BASED replication bookmark not advancing between syncs

af776f8

Merge pull request #3 from ksohail22/fix/wal-log-flush

2878bfd

Fix: LOG_BASED replication bookmark not advancing between syncs

Fix InvalidStreamSortException in LOG_BASED replication

cc2b9d7

Merge pull request #4 from ksohail22/fix/invalid-stream-sort-exception

b997531

Fix InvalidStreamSortException in LOG_BASED replication

Fix: Unbounded WAL retention and premature replication loop exit for …

6f8be47

…LOG_BASED streams

Merge pull request #5 from ksohail22/fix/unbounded-wal-retention

e0881c4

Fix: Unbounded WAL retention and premature replication loop exit for LOG_BASED streams

fix: Handle replication connection loss (PGRES_COPY_BOTH) and reduce …

c46b5ad

…keepalive overhead

ksohail22 requested a review from edgarrmondragon as a code owner March 13, 2026 16:58

ksohail22 changed the title ~~Fix/handle replication connection loss~~ fix: Handle replication connection loss (PGRES_COPY_BOTH) and reduce keepalive overhead Mar 13, 2026

edgarrmondragon self-assigned this Mar 13, 2026

Merge branch 'main' into fix/handle-replication-connection-loss

a6745bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handle replication connection loss (`PGRES_COPY_BOTH`) and reduce keepalive overhead#749

fix: Handle replication connection loss (`PGRES_COPY_BOTH`) and reduce keepalive overhead#749
ksohail22 wants to merge 8 commits intoMeltanoLabs:mainfrom
ksohail22:fix/handle-replication-connection-loss

ksohail22 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ksohail22 commented Mar 13, 2026

Problem

1. PGRES_COPY_BOTH crashes terminate the entire sync

2. Excessive keepalive overhead

Solution

Connection-loss resilience

Reduced keepalive frequency

Changes

Error handling flow

Risk assessment

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. `PGRES_COPY_BOTH` crashes terminate the entire sync