Skip to content

Bug: Consumer sends unaligned flush offset to FlushTracker when storage flushes before commit fragment is processed #4063

@alco

Description

@alco

Root Cause

When a consumer uses write_unit: :txn_fragment, there is a race condition between the storage flush and the commit fragment processing that causes the consumer to send an unaligned offset to FlushTracker, permanently blocking the global flush offset from advancing.

The conditions for this issue became possible after the fix for #3985 landed in main: it didn't address the case where a consumer gets commit fragment with no changes but opened up the possibility for FlushTracker to get stuck after it starts tracking a shape from a non-commit fragment.

The race sequence

Consider a multi-fragment transaction where a non-commit fragment writes data that triggers a buffer-size flush (≥64KB) in WriteLoop:

  1. SLC dispatches non-commit fragment to consumer (synchronous via $gen_call). During append_fragment_to_log!, the buffer exceeds 64KB → flush_buffersend(self(), {Storage, :flushed, offset}) goes to consumer's mailbox. Consumer replies to SLC.

  2. SLC updates FlushTracker: tracks shape with last_sent = source_fragment.last_log_offset (which is the higher offset of the non-matching change).

  3. Consumer processes :flushed message. txn_offset_mapping is still empty → align_offset_to_txn_boundary returns the raw storage offset. Consumer sends notify_flushed(shape, raw_offset) to SLC and it gets handled by FlushTracker.handle_flush_notification() but since the offset in the notification is lower than the one seen by FlushTracker (due to the other change in the txn fragment that didn't affect this consumer's shape), FlushTracker keeps tracking the shape.

  4. SLC dispatches commit fragment to consumer.

  5. Consumer processes commit fragment: maybe_complete_pending_txn populates txn_offset_mapping. Consumer replies.

  6. SLC continues commit processing: FlushTracker updates last_sent to the commit fragment's last_log_offset.

  7. No further flushes arrive: the commit fragment didn't contain changes affecting consumer's shape → no more flushes from storage. Consumer keeps the txn_offset_mapping entry and FlushTracker keeps waiting on the flush notification that will never arrive.

Production evidence

State dumps from the stuck consumer show:

pending_txn: nil
txn_offset_mapping: [{LogOffset.new(54926778248, 0), LogOffset.new(54926778248, 47520)}]

FlushTracker:

last_flushed: LogOffset.new(54926778248, 0)
last_sent: LogOffset.new(54926778248, 47520)

The consumer had only one relevant change at op_offset=0 in a transaction with 47521 total operations. The txn_offset_mapping entry was populated but never consumed because no subsequent flush arrived.

Conditions required

  • write_unit = :txn_fragment (the old :txn path buffers everything in TransactionBuilder before writing, so storage has nothing to flush until after commit)
  • Non-commit fragment writes enough data to trigger a buffer-size flush (≥64KB), OR the flush timer fires between the non-commit and commit fragment processing
  • The transaction touches multiple tables (so last_log_offset > shape's last written offset)

Proposed fix directions

Option 1: Deferred flush notification

When processing {Storage, :flushed, offset} and pending_txn != nil, save the flushed offset instead of notifying immediately. Re-process it in maybe_complete_pending_txn after txn_offset_mapping is populated.

Option 2: Retroactive notification on commit

In maybe_complete_pending_txn, after populating txn_offset_mapping, check if the writer's last_persisted_offset ≥ latest_offset. If so, the data was already flushed — call align_offset_to_txn_boundary and send the correctly-aligned notification immediately.

Option 3: Re-send on txn_offset_mapping update

After appending to txn_offset_mapping in maybe_complete_pending_txn, if the storage reports it has already persisted past the new mapping's key offset, send the aligned notification.

Supersedes

This issue supersedes #4058 which hypothesized the root cause was a lost :flushed message. The message is not lost — it arrives and is processed, but at the wrong time relative to the commit fragment, causing align_offset_to_txn_boundary to return an unaligned offset.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions