-
Notifications
You must be signed in to change notification settings - Fork 318
Bug: Consumer sends unaligned flush offset to FlushTracker when storage flushes before commit fragment is processed #4063
Description
Root Cause
When a consumer uses write_unit: :txn_fragment, there is a race condition between the storage flush and the commit fragment processing that causes the consumer to send an unaligned offset to FlushTracker, permanently blocking the global flush offset from advancing.
The conditions for this issue became possible after the fix for #3985 landed in main: it didn't address the case where a consumer gets commit fragment with no changes but opened up the possibility for FlushTracker to get stuck after it starts tracking a shape from a non-commit fragment.
The race sequence
Consider a multi-fragment transaction where a non-commit fragment writes data that triggers a buffer-size flush (≥64KB) in WriteLoop:
-
SLC dispatches non-commit fragment to consumer (synchronous via
$gen_call). Duringappend_fragment_to_log!, the buffer exceeds 64KB →flush_buffer→send(self(), {Storage, :flushed, offset})goes to consumer's mailbox. Consumer replies to SLC. -
SLC updates FlushTracker: tracks shape with
last_sent = source_fragment.last_log_offset(which is the higher offset of the non-matching change). -
Consumer processes
:flushedmessage.txn_offset_mappingis still empty →align_offset_to_txn_boundaryreturns the raw storage offset. Consumer sendsnotify_flushed(shape, raw_offset)to SLC and it gets handled byFlushTracker.handle_flush_notification()but since the offset in the notification is lower than the one seen byFlushTracker(due to the other change in the txn fragment that didn't affect this consumer's shape),FlushTrackerkeeps tracking the shape. -
SLC dispatches commit fragment to consumer.
-
Consumer processes commit fragment:
maybe_complete_pending_txnpopulatestxn_offset_mapping. Consumer replies. -
SLC continues commit processing: FlushTracker updates
last_sentto the commit fragment'slast_log_offset. -
No further flushes arrive: the commit fragment didn't contain changes affecting consumer's shape → no more flushes from storage. Consumer keeps the
txn_offset_mappingentry andFlushTrackerkeeps waiting on the flush notification that will never arrive.
Production evidence
State dumps from the stuck consumer show:
pending_txn: nil
txn_offset_mapping: [{LogOffset.new(54926778248, 0), LogOffset.new(54926778248, 47520)}]
FlushTracker:
last_flushed: LogOffset.new(54926778248, 0)
last_sent: LogOffset.new(54926778248, 47520)
The consumer had only one relevant change at op_offset=0 in a transaction with 47521 total operations. The txn_offset_mapping entry was populated but never consumed because no subsequent flush arrived.
Conditions required
write_unit = :txn_fragment(the old:txnpath buffers everything inTransactionBuilderbefore writing, so storage has nothing to flush until after commit)- Non-commit fragment writes enough data to trigger a buffer-size flush (≥64KB), OR the flush timer fires between the non-commit and commit fragment processing
- The transaction touches multiple tables (so
last_log_offset > shape's last written offset)
Proposed fix directions
Option 1: Deferred flush notification
When processing {Storage, :flushed, offset} and pending_txn != nil, save the flushed offset instead of notifying immediately. Re-process it in maybe_complete_pending_txn after txn_offset_mapping is populated.
Option 2: Retroactive notification on commit
In maybe_complete_pending_txn, after populating txn_offset_mapping, check if the writer's last_persisted_offset ≥ latest_offset. If so, the data was already flushed — call align_offset_to_txn_boundary and send the correctly-aligned notification immediately.
Option 3: Re-send on txn_offset_mapping update
After appending to txn_offset_mapping in maybe_complete_pending_txn, if the storage reports it has already persisted past the new mapping's key offset, send the aligned notification.
Supersedes
This issue supersedes #4058 which hypothesized the root cause was a lost :flushed message. The message is not lost — it arrives and is processed, but at the wrong time relative to the commit fragment, causing align_offset_to_txn_boundary to return an unaligned offset.