Skip to content

fix: Resolve pending shapes in FlushTracker when consumer process receives commit fragment with no relevant changes#4064

Open
alco wants to merge 10 commits intomainfrom
alco/flush-tracker-notification-timing
Open

fix: Resolve pending shapes in FlushTracker when consumer process receives commit fragment with no relevant changes#4064
alco wants to merge 10 commits intomainfrom
alco/flush-tracker-notification-timing

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented Mar 26, 2026

Summary

Fix #4063

When a {Storage, :flushed, offset} message arrives at a Consumer process while it's in the middle of processing a multi-fragment transaction, the flush notification could be lost. This caused the FlushTracker in ShapeLogCollector to get stuck waiting for a flush that was already completed.

Changes

Consumer (consumer.ex, consumer/state.ex):

  • Defer flush notifications that arrive during a pending transaction by storing the max flushed offset in pending_flush_offset
  • Process deferred flush notifications when the pending transaction completes (commit, skip, or no relevant changes)

FlushTracker (flush_tracker.ex):

  • Simplify to commit-only tracking: remove non-commit fragment handling since Consumer now defers flush notifications during pending transactions
  • Remove shapes_with_changes parameter from handle_txn_fragment/3 — all affected shapes are tracked uniformly at commit time

ShapeLogCollector (shape_log_collector.ex):

  • Only call FlushTracker.handle_txn_fragment/3 on commit fragments
  • Remove shapes_with_changes computation that is no longer needed

Test plan

  • New regression test for stuck flush tracker scenario
  • Stricter assertions in EventRouterTest for txn fragment reslicing
  • Updated Consumer tests for deferred flush notification behavior
  • Existing test suite passes

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.67%. Comparing base (0af96e9) to head (b76d0d3).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4064   +/-   ##
=======================================
  Coverage   88.67%   88.67%           
=======================================
  Files          25       25           
  Lines        2438     2438           
  Branches      615      611    -4     
=======================================
  Hits         2162     2162           
  Misses        274      274           
  Partials        2        2           
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 93.81% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 88.67% <ø> (ø)
unit-tests 88.67% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

alco added 4 commits March 26, 2026 22:11
Non-commit fragments no longer register shapes in FlushTracker.
The Consumer will defer flush notifications until the commit fragment
is processed, so early registration is no longer needed.

- FlushTracker.handle_txn_fragment now takes 3 args (removed unused
  shapes_with_changes parameter)
- SLC's dropped-fragment path only calls FlushTracker for commit
fragments
- Remove misleading FlushTracker tests that claimed to test
cross-fragment
  tracking (that's EventRouter's responsibility, already tested there)
- Update handle_txn test helper for arity 3

Refs: #4063
When a {Storage, :flushed, offset} message arrives while a
multi-fragment
transaction is pending, the Consumer now saves the offset instead of
immediately notifying the ShapeLogCollector. After the commit fragment
populates txn_offset_mapping, the deferred offset is aligned and sent
as a single notification.

This fixes the race condition where the consumer sent an unaligned
flush offset to FlushTracker because txn_offset_mapping was empty
at the time of the storage flush.
Tests now verify that flush notifications are deferred during pending
transactions and sent only after the commit fragment is processed.

Also fix a compilation error in Consumer (cannot call remote function in
pattern match) and add a non-commit fragment clause to FlushTracker's
handle_txn_fragment/4.

Refs: #4063
@alco alco force-pushed the alco/flush-tracker-notification-timing branch from 4f143f8 to 7b56ab0 Compare March 26, 2026 21:23
alco and others added 2 commits March 26, 2026 22:25
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alco alco force-pushed the alco/flush-tracker-notification-timing branch from 87d9295 to 7cd24ee Compare March 26, 2026 21:26
@alco alco added the claude label Mar 26, 2026
@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

Claude Code Review

Summary

Fixes a real production bug (#4063) where the FlushTracker could get permanently stuck when a storage buffer-size flush fires during multi-fragment transaction processing. The approach of deferring flush notifications at the Consumer level and simplifying FlushTracker to commit-only tracking is clean and well-tested.

What's Working Well

  • Root cause analysis is excellent: the issue writeup precisely identifies the race sequence with production state dumps and proposed fix directions
  • Simplification is a net win: removing FlushTracker non-commit pre-registration complexity makes both components easier to reason about independently
  • Deferred flush notification covers all three terminal paths of a pending transaction cleanly
  • Double-notification in the no-changes path is correctly benign (second call is a safe no-op)
  • Regression test faithfully reproduces the production scenario
  • Changeset included

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

is_write_unit_txn_fragment guard defined but unused
File: packages/sync-service/lib/electric/shapes/consumer/state.ex:401
The guard was added but is not referenced anywhere in consumer.ex or the broader codebase. The @write_unit_txn_fragment module attribute is still matched directly via pattern matching at line 271. Either replace that pattern match with the guard or remove it to avoid confusion for future readers.

handle_txn_fragment/3 is now load-bearing for commit-only callers
File: packages/sync-service/lib/electric/replication/shape_log_collector/flush_tracker.ex:98
The function will crash with FunctionClauseError if a non-commit fragment is passed (old non-commit clause removed). Since the only callsite already guards with when not is_nil(commit), this is safe today. A short @doc note stating only commit fragments are accepted would protect future callers from discovering this at runtime.

Issue Conformance

Issue #4063 is well-specified with a precise root-cause analysis, exact race sequence, production state dumps, and three proposed fix directions. The PR implements Option 1 (deferred flush notification), the cleanest approach. Implementation fully addresses the race condition. No scope creep.

Previous Review Status

  • Fixed: Typo in consumer/state.ex comment (ultiple -> Multiple) — addressed in commit b76d0d3.
  • Open: is_write_unit_txn_fragment guard unused — not yet addressed.
  • Open: handle_txn_fragment/3 missing @doc note for commit-only contract — not yet addressed.

Review iteration: 2 | 2026-03-26

@alco alco marked this pull request as ready for review March 26, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Consumer sends unaligned flush offset to FlushTracker when storage flushes before commit fragment is processed

1 participant