-
Notifications
You must be signed in to change notification settings - Fork 317
Description
Summary
The FlushTracker in ShapeLogCollector is a passive data structure that relies entirely on external notifications to advance last_global_flushed_offset. When a shape enters tracking via handle_txn_fragment, the tracker expects one of two things to eventually happen:
handle_flush_notification— the consumer reports it flushed the offsethandle_shape_removed— the shape is cleaned up
If neither happens, that shape's entry becomes the minimum in min_incomplete_flush_tree and permanently blocks WAL flush advancement for the entire stack.
ShapeLogCollector does not monitor consumer processes. There is no timeout, no periodic sweep, and no liveness check for shapes tracked by FlushTracker. The system is entirely dependent on every consumer eventually calling notify_flushed or being properly removed — a property that is not guaranteed across all failure modes.
Impact
When last_global_flushed_offset stops advancing, the replication client never receives {:flush_boundary_updated, lsn}, Postgres never gets WAL flush acknowledgement past that point, and WAL accumulates indefinitely until disk is exhausted.
What's already addressed
- fix(sync-service): recover from dead consumers blocking WAL flush advancement #3975 fixes the case where a consumer dies during
ConsumerRegistry.broadcast— the dead consumer is detected via:DOWN, cleaned from ETS, and a replacement is started. Events for shapes that were removed between routing and delivery are excluded from FlushTracker entirely.
What remains unaddressed
The child issues below describe specific failure modes that can leave shapes permanently tracked in FlushTracker with no path to resolution.
Suggested fix
Add a defense-in-depth mechanism in ShapeLogCollector to actively detect when FlushTracker-tracked shapes have stopped progressing. Two approaches:
Option A: Monitor consumer PIDs. When ConsumerRegistry.publish delivers events, capture the consumer PIDs. ShapeLogCollector monitors them. On :DOWN, call FlushTracker.handle_shape_removed. This is fast (immediate detection) but requires ShapeLogCollector to handle :DOWN messages and maintain a PID→shape mapping.
Option B: Periodic liveness sweep. On each incoming transaction (or on a timer), ShapeLogCollector iterates FlushTracker.last_flushed and checks whether each shape still has a live consumer via ConsumerRegistry.whereis + Process.alive?. Dead shapes get handle_shape_removed. This is simpler but detection is delayed until the next sweep.
Option A catches all crash/kill scenarios immediately. Option B also catches the "alive but stuck" scenario (child issue #3) if extended with a staleness timeout.