Skip to content

Subquery shapes error on first WAL event after restart: Materializer not available #3946

@alco

Description

@alco

Summary

When Electric restarts and restores shapes with subqueries from disk, the first WAL event(s) that arrive trigger RuntimeError: Materializer for stack "..." and handle "..." is not available errors. This is a race condition between eager filter restoration and lazy consumer/materializer startup.

Root cause

On startup, two things happen in sequence:

  1. ShapeLogCollector restores filters eagerly for all persisted shapes, including shapes with subquery dependencies. After this, the filter knows about all shapes and can try to evaluate subquery WHERE clauses.

  2. Consumer and Materializer processes are started lazily — only when the first event arrives for a shape (via ConsumerRegistry).

When a WAL event arrives, the flow is:

ShapeLogCollector.handle_call
  → EventRouter.route_change
    → Filter.affected_shapes
      → WhereCondition.other_shapes_affected
        → refs_fun.(shape)
          → Materializer.get_all_as_refs
            → GenServer.call(materializer)  ← process doesn't exist yet!

The filter needs the Materializer to resolve subquery link values, but the Materializer is only started after Filter.affected_shapes returns and ConsumerRegistry.publish triggers lazy consumer startup.

The error is caught by Filter.affected_shapes which falls back to returning all shape IDs (safe but noisy).

Reproduction

  1. Create a table with some data:
CREATE TABLE hot_rows (
  id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  grp_a integer NOT NULL,
  grp_b integer NOT NULL,
  grp_c integer NOT NULL,
  payload text NOT NULL DEFAULT '',
  updated_at timestamptz NOT NULL DEFAULT now()
);
-- Insert rows with random group values
  1. Create a shape with a subquery WHERE clause:
where=id IN (SELECT id FROM hot_rows WHERE grp_a = 1)
  OR id IN (SELECT id FROM hot_rows WHERE grp_b = 1)
  OR id IN (SELECT id FROM hot_rows WHERE grp_c = 1)
  1. Restart Electric (shapes restore from disk).

  2. Update any row in the table — without any HTTP client connecting first:

UPDATE hot_rows SET payload = md5(random()::text), updated_at = now()
WHERE id = (SELECT id FROM hot_rows ORDER BY random() LIMIT 1);
  1. Observe the error flood — one error per subquery dependency shape, per change in the transaction. Increasing the number of modified rows in a single transaction or the number of shapes proportionally increases the number of logged errors.

Example output

Startup completes normally, shapes and filters are restored:

13:03:52.356 [notice] Restored publication filters in 3ms
13:03:52.380 [notice] Restored filters for 4 shapes in 28ms
13:03:52.380 [notice] Consumers ready in 39ms (4 shapes)

A single-row UPDATE arrives ~2 seconds later. The filter tries to evaluate subquery WHERE clauses but the Materializer processes don't exist yet — one error per dependency shape (3 subqueries = 2 errors, one for the old_record and one for the record evaluation in UpdatedRecord handling):

13:03:54.827 [info] Received relation "public"."hot_rows"

13:03:54.919 [error] Unexpected error in Filter.WhereCondition.affected_shapes:
** (RuntimeError) Materializer for stack "single_stack" and handle "16096921-1772539368604093" is not available
    (electric 1.4.11) lib/electric/shapes/consumer/materializer.ex:98: Electric.Shapes.Consumer.Materializer.genserver_get_link_values/1
    (electric 1.4.11) lib/electric/shapes/consumer/materializer.ex:111: anonymous fn/2 in Electric.Shapes.Consumer.Materializer.get_all_as_refs/2
    ...
    (electric 1.4.11) lib/electric/shapes/filter/where_condition.ex:210: anonymous fn/6 in Electric.Shapes.Filter.WhereCondition.other_shapes_affected/4
    ...
    (electric 1.4.11) lib/electric/shapes/event_router.ex:149: Electric.Shapes.EventRouter.find_affected_shapes_for_change/2

13:03:54.942 [error] Unexpected error in Filter.WhereCondition.affected_shapes:
** (RuntimeError) Materializer for stack "single_stack" and handle "16096921-1772539368604093" is not available
    ...

Only after the errors do consumer and materializer processes get lazily started:

13:03:54.943 [debug] Starting Electric.Shapes.Consumer for 16096921-1772539368604093
13:03:54.954 [debug] Started consumer for existing handle 16096921-1772539368604093
13:03:54.954 [debug] Starting Electric.Shapes.Consumer for 79160600-1772539368614088
13:03:54.955 [debug] Started consumer for existing handle 79160600-1772539368614088
13:03:54.955 [debug] Starting Electric.Shapes.Consumer for 95336946-1772539368614654
13:03:54.956 [debug] Started consumer for existing handle 95336946-1772539368614654
...
13:03:55.022 [debug] Starting Electric.Shapes.Consumer.Materializer for 16096921-1772539368604093
13:03:55.023 [debug] Starting Electric.Shapes.Consumer.Materializer for 79160600-1772539368614088
13:03:55.023 [debug] Starting Electric.Shapes.Consumer.Materializer for 95336946-1772539368614654
13:03:55.023 [debug] Starting Electric.Shapes.Consumer for 76149289-1772539368615417

Impact

  • The catch-all in Filter.affected_shapes returns all shape IDs when the error occurs, so no events are lost — but every shape gets notified of every change during the race window, which is wasteful.
  • Each change in the WAL produces one error per subquery dependency, so a multi-row transaction causes an error flood.
  • Rare in production because HTTP clients typically reconnect after a restart before the first WAL event arrives, which triggers eager consumer startup. Observed twice on the hertz instance (ELECTRIC-5D7, ELECTRIC-5PF).

Sentry references

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions