Skip to content

FlushTracker has no active mechanism to detect shapes that stop reporting flush progress #3980

@alco

Description

@alco

Summary

The FlushTracker in ShapeLogCollector is a passive data structure that relies entirely on external notifications to advance last_global_flushed_offset. When a shape enters tracking via handle_txn_fragment, the tracker expects one of two things to eventually happen:

  1. handle_flush_notification — the consumer reports it flushed the offset
  2. handle_shape_removed — the shape is cleaned up

If neither happens, that shape's entry becomes the minimum in min_incomplete_flush_tree and permanently blocks WAL flush advancement for the entire stack.

ShapeLogCollector does not monitor consumer processes. There is no timeout, no periodic sweep, and no liveness check for shapes tracked by FlushTracker. The system is entirely dependent on every consumer eventually calling notify_flushed or being properly removed — a property that is not guaranteed across all failure modes.

Impact

When last_global_flushed_offset stops advancing, the replication client never receives {:flush_boundary_updated, lsn}, Postgres never gets WAL flush acknowledgement past that point, and WAL accumulates indefinitely until disk is exhausted.

What's already addressed

What remains unaddressed

The child issues below describe specific failure modes that can leave shapes permanently tracked in FlushTracker with no path to resolution.

Suggested fix

Add a defense-in-depth mechanism in ShapeLogCollector to actively detect when FlushTracker-tracked shapes have stopped progressing. Two approaches:

Option A: Monitor consumer PIDs. When ConsumerRegistry.publish delivers events, capture the consumer PIDs. ShapeLogCollector monitors them. On :DOWN, call FlushTracker.handle_shape_removed. This is fast (immediate detection) but requires ShapeLogCollector to handle :DOWN messages and maintain a PID→shape mapping.

Option B: Periodic liveness sweep. On each incoming transaction (or on a timer), ShapeLogCollector iterates FlushTracker.last_flushed and checks whether each shape still has a live consumer via ConsumerRegistry.whereis + Process.alive?. Dead shapes get handle_shape_removed. This is simpler but detection is delayed until the next sweep.

Option A catches all crash/kill scenarios immediately. Option B also catches the "alive but stuck" scenario (child issue #3) if extended with a staleness timeout.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions