-
Notifications
You must be signed in to change notification settings - Fork 317
Description
Parent: #3980
Scenario
A consumer process is alive (Process.alive? returns true) but has stopped making progress on flushing transactions. The FlushTracker tracks the shape, but no handle_flush_notification ever arrives because the consumer is stuck.
How this can happen
- Deadlock or infinite wait: The consumer is blocked waiting on a resource that will never become available (e.g., a GenServer.call to a dead process without a timeout, a storage backend that hangs on I/O).
- Infinite loop in event processing: A bug in change handling, move-in processing, or materializer interaction causes the consumer to loop without returning from
handle_call. - Message queue starvation: The consumer's mailbox is flooded with low-priority messages that are processed before the storage
:flushedcallback, effectively starving the flush path indefinitely.
Why this is distinct
This scenario cannot be detected by process monitoring (Option A in the parent issue) because the consumer process is alive. Process.alive? returns true, and no :DOWN message is ever sent.
Only a progress-based detection mechanism can catch this — e.g., tracking the last time each shape in FlushTracker.last_flushed advanced and treating shapes that haven't progressed within a timeout as stuck.
Fix
Extend the liveness sweep approach (Option B in #3980) with a staleness timeout: if a shape has been in FlushTracker.last_flushed for longer than N seconds without its last_flushed offset advancing, treat it as stuck and call handle_shape_removed (or trigger a consumer restart).
This is the only scenario that requires timeout-based detection rather than monitor-based detection.