FlushTracker has no active mechanism to detect shapes that stop reporting flush progress

## Summary

The `FlushTracker` in `ShapeLogCollector` is a passive data structure that relies entirely on external notifications to advance `last_global_flushed_offset`. When a shape enters tracking via `handle_txn_fragment`, the tracker expects one of two things to eventually happen:

1. `handle_flush_notification` — the consumer reports it flushed the offset
2. `handle_shape_removed` — the shape is cleaned up

If **neither** happens, that shape's entry becomes the minimum in `min_incomplete_flush_tree` and permanently blocks WAL flush advancement for the entire stack.

`ShapeLogCollector` does not monitor consumer processes. There is no timeout, no periodic sweep, and no liveness check for shapes tracked by `FlushTracker`. The system is entirely dependent on every consumer eventually calling `notify_flushed` or being properly removed — a property that is not guaranteed across all failure modes.

## Impact

When `last_global_flushed_offset` stops advancing, the replication client never receives `{:flush_boundary_updated, lsn}`, Postgres never gets WAL flush acknowledgement past that point, and WAL accumulates indefinitely until disk is exhausted.

## What's already addressed

- **#3975** fixes the case where a consumer dies *during* `ConsumerRegistry.broadcast` — the dead consumer is detected via `:DOWN`, cleaned from ETS, and a replacement is started. Events for shapes that were removed between routing and delivery are excluded from FlushTracker entirely.

## What remains unaddressed

The child issues below describe specific failure modes that can leave shapes permanently tracked in FlushTracker with no path to resolution.

## Suggested fix

Add a defense-in-depth mechanism in `ShapeLogCollector` to actively detect when FlushTracker-tracked shapes have stopped progressing. Two approaches:

**Option A: Monitor consumer PIDs.** When `ConsumerRegistry.publish` delivers events, capture the consumer PIDs. ShapeLogCollector monitors them. On `:DOWN`, call `FlushTracker.handle_shape_removed`. This is fast (immediate detection) but requires ShapeLogCollector to handle `:DOWN` messages and maintain a PID→shape mapping.

**Option B: Periodic liveness sweep.** On each incoming transaction (or on a timer), ShapeLogCollector iterates `FlushTracker.last_flushed` and checks whether each shape still has a live consumer via `ConsumerRegistry.whereis` + `Process.alive?`. Dead shapes get `handle_shape_removed`. This is simpler but detection is delayed until the next sweep.

Option A catches all crash/kill scenarios immediately. Option B also catches the "alive but stuck" scenario (child issue #3) if extended with a staleness timeout.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlushTracker has no active mechanism to detect shapes that stop reporting flush progress #3980

Summary

Impact

What's already addressed

What remains unaddressed

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FlushTracker has no active mechanism to detect shapes that stop reporting flush progress #3980

Description

Summary

Impact

What's already addressed

What remains unaddressed

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions