Conversation
Comprehensive production debugging strategy for diagnosing FlushTracker stalling caused by consumer processes dying without FlushTracker cleanup. Includes: - Deep code analysis identifying the root cause gap: consumer termination with clean exit reasons (:shutdown, suspend) doesn't clean up FlushTracker - Analysis of production state dumps from two affected customers - 4-phase data gathering plan (runtime inspection, tracing, telemetry, ETS) - 4 hypothesized root causes with code path evidence - 3 proposed fix options Part of electric-sql/alco-agent-tasks#8 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Incorporate findings from parallel research agents: - Edison: 33 stuck shapes, 1.5GB WAL gap, 2 shapes flushed nothing - Faraday: 402 stuck vs 115 active shapes, zero change between snapshots, 16.4GB WAL gap growing. 7 shapes removed between snapshots were all active, confirming stuck shapes' consumers are permanently gone. - Add Hypothesis 5 (stale PID in ConsumerRegistry from issue #4013) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d2bd920856
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| flush_tracker = slc_state.flush_tracker | ||
|
|
||
| # Get the consumer registry ETS table | ||
| registry_table = Electric.Shapes.ConsumerRegistry.ets_name(stack_id) |
There was a problem hiding this comment.
Use exported API for ConsumerRegistry table lookup
The Phase 1 snippet calls Electric.Shapes.ConsumerRegistry.ets_name(stack_id), but ets_name/1 is a private function (defp) in packages/sync-service/lib/electric/shapes/consumer_registry.ex, so this command will raise UndefinedFunctionError when run from ECS/IEx and block the “immediate” data-gathering flow. The same private call is repeated later in this document, so the runbook currently cannot be executed as written during an incident.
Useful? React with 👍 / 👎.
| :dbg.p(:all, :c) | ||
|
|
||
| # Trace ShapeCleaner decisions | ||
| :dbg.tp(Electric.ShapeCache.ShapeCleaner, :handle_writer_termination, 4, []) |
There was a problem hiding this comment.
Trace ShapeCleaner with correct function arity
The tracing command uses :dbg.tp(Electric.ShapeCache.ShapeCleaner, :handle_writer_termination, 4, []), but handle_writer_termination is defined with arity 3, so this tracepoint will not attach to the target function. That means Phase 2.4 misses exactly the termination path this strategy is trying to verify, which can produce misleading debugging results.
Useful? React with 👍 / 👎.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4050 +/- ##
=======================================
Coverage 88.67% 88.67%
=======================================
Files 25 25
Lines 2438 2438
Branches 616 611 -5
=======================================
Hits 2162 2162
Misses 274 274
Partials 2 2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
:shutdown, suspend) doesn't clean up FlushTracker entriesContext
Part of electric-sql/alco-agent-tasks#8 (FlushTracker stalling when tracked consumer dies out-of-band).
This PR contains analysis and a strategy document, no code changes.
Key Finding
The critical gap is in
ShapeCleaner.handle_writer_termination:@shutdown_suspend) → only callsConsumerRegistry.remove_consumer, NO FlushTracker cleanup:shutdown/:normal/:killed→ does NOTHING at allThis means any consumer that exits cleanly while its shape is tracked in FlushTracker leaves a stale entry that blocks WAL flush advancement forever.
🤖 Generated with Claude Code