Skip to content

FlushTracker stale entry debugging strategy#4050

Draft
alco wants to merge 2 commits intomainfrom
erik/flush-tracker-debugging-strategy
Draft

FlushTracker stale entry debugging strategy#4050
alco wants to merge 2 commits intomainfrom
erik/flush-tracker-debugging-strategy

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented Mar 24, 2026

Summary

  • Comprehensive debugging strategy for diagnosing FlushTracker stalling in production, caused by consumer processes dying without FlushTracker cleanup
  • Deep code analysis identifying the root cause gap: consumer termination with clean exit reasons (:shutdown, suspend) doesn't clean up FlushTracker entries
  • Analysis of production state dumps from two affected customers (edison and faraday)
  • 4-phase data gathering plan: runtime state inspection, targeted tracing, Honeycomb telemetry, ETS table cross-referencing
  • 4 hypothesized root causes with specific code path evidence
  • 3 proposed fix options once root cause is confirmed

Context

Part of electric-sql/alco-agent-tasks#8 (FlushTracker stalling when tracked consumer dies out-of-band).

This PR contains analysis and a strategy document, no code changes.

Key Finding

The critical gap is in ShapeCleaner.handle_writer_termination:

  • Consumer suspend (@shutdown_suspend) → only calls ConsumerRegistry.remove_consumer, NO FlushTracker cleanup
  • Consumer dies with :shutdown/:normal/:killed → does NOTHING at all
  • Only abnormal exits trigger full shape removal (which does clean FlushTracker)

This means any consumer that exits cleanly while its shape is tracked in FlushTracker leaves a stale entry that blocks WAL flush advancement forever.

🤖 Generated with Claude Code

alco and others added 2 commits March 24, 2026 13:46
Comprehensive production debugging strategy for diagnosing FlushTracker
stalling caused by consumer processes dying without FlushTracker cleanup.

Includes:
- Deep code analysis identifying the root cause gap: consumer termination
  with clean exit reasons (:shutdown, suspend) doesn't clean up FlushTracker
- Analysis of production state dumps from two affected customers
- 4-phase data gathering plan (runtime inspection, tracing, telemetry, ETS)
- 4 hypothesized root causes with code path evidence
- 3 proposed fix options

Part of electric-sql/alco-agent-tasks#8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Incorporate findings from parallel research agents:
- Edison: 33 stuck shapes, 1.5GB WAL gap, 2 shapes flushed nothing
- Faraday: 402 stuck vs 115 active shapes, zero change between snapshots,
  16.4GB WAL gap growing. 7 shapes removed between snapshots were all active,
  confirming stuck shapes' consumers are permanently gone.
- Add Hypothesis 5 (stale PID in ConsumerRegistry from issue #4013)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2bd920856

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

flush_tracker = slc_state.flush_tracker

# Get the consumer registry ETS table
registry_table = Electric.Shapes.ConsumerRegistry.ets_name(stack_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use exported API for ConsumerRegistry table lookup

The Phase 1 snippet calls Electric.Shapes.ConsumerRegistry.ets_name(stack_id), but ets_name/1 is a private function (defp) in packages/sync-service/lib/electric/shapes/consumer_registry.ex, so this command will raise UndefinedFunctionError when run from ECS/IEx and block the “immediate” data-gathering flow. The same private call is repeated later in this document, so the runbook currently cannot be executed as written during an incident.

Useful? React with 👍 / 👎.

:dbg.p(:all, :c)

# Trace ShapeCleaner decisions
:dbg.tp(Electric.ShapeCache.ShapeCleaner, :handle_writer_termination, 4, [])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Trace ShapeCleaner with correct function arity

The tracing command uses :dbg.tp(Electric.ShapeCache.ShapeCleaner, :handle_writer_termination, 4, []), but handle_writer_termination is defined with arity 3, so this tracepoint will not attach to the target function. That means Phase 2.4 misses exactly the termination path this strategy is trying to verify, which can produce misleading debugging results.

Useful? React with 👍 / 👎.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.67%. Comparing base (461576d) to head (970a0e0).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4050   +/-   ##
=======================================
  Coverage   88.67%   88.67%           
=======================================
  Files          25       25           
  Lines        2438     2438           
  Branches      616      611    -5     
=======================================
  Hits         2162     2162           
  Misses        274      274           
  Partials        2        2           
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 93.81% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 88.67% <ø> (ø)
unit-tests 88.67% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@alco alco marked this pull request as draft March 24, 2026 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant