Skip to content

fix(typescript-client): add exponential backoff to onError-driven retries#4054

Open
KyleAMathews wants to merge 8 commits intomainfrom
fix/onerror-retry-backoff
Open

fix(typescript-client): add exponential backoff to onError-driven retries#4054
KyleAMathews wants to merge 8 commits intomainfrom
fix/onerror-retry-backoff

Conversation

@KyleAMathews
Copy link
Copy Markdown
Contributor

Summary

Adds exponential backoff to onError-driven retries in ShapeStream to prevent tight infinite loops when onError returns {} on persistent 4xx errors (e.g., expired auth tokens returning 403).

Previously, the fetch backoff layer correctly skipped retrying 4xx errors, but when onError returned {} to signal "retry", the stream restarted immediately with zero delay — creating a tight loop that could hammer both Electric and the upstream database. A user reported this causing ~$200/day in Neon network egress from a development app with zero traffic.

Root Cause

The client has two layers of error handling:

  1. Fetch backoff (createFetchWithBackoff): Retries 5xx/429 with exponential backoff. Throws 4xx immediately.
  2. onError callback (#start): Called after fetch backoff gives up. Returns {} to retry, void to stop.

When onError returned an object, #start() recursively called itself with no delay. The simplest "keep syncing" pattern — onError: () => ({}) — became the most dangerous on persistent client errors.

Approach

  • Exponential backoff with full jitter on the onError retry path: 100ms base, 30s cap, same algorithm as existing fast-loop and SSE backoffs
  • Abort-aware delay: The setTimeout listens for the abort signal so stream.abort() / component unmount tears down immediately instead of blocking up to 30s
  • Console warning on 2nd+ retry: Logs the delay duration and error message so developers can diagnose "why is my stream not syncing?"
  • Reset on success: #onErrorRetryCount resets when the stream reaches up-to-date, so a successful auth token refresh isn't penalized on the next error

Key Invariants

  • First retry: 0–100ms delay (fast enough for auth token refresh)
  • Exponential growth: 200ms, 400ms, 800ms... up to 30s cap
  • Abort always honored: no hanging teardown
  • Fast-loop detector stays independent (its state is still cleared on onError retry)

Non-goals

  • Changing the fetch backoff layer's 4xx handling (still throws immediately, by design)
  • Capping onError retries (the user's onError controls whether to give up)
  • Changing the onError API contract (returning {} still means "retry")

Verification

cd packages/typescript-client
pnpm vitest run test/stream.test.ts test/wake-detection.test.ts test/fetch.test.ts test/expired-shapes-cache.test.ts

All 73 unit tests pass.

Files changed

File Change
packages/typescript-client/src/client.ts Add backoff fields, backoff + abort logic in #start() onError path, reset in #requestShape() success path
packages/typescript-client/test/wake-detection.test.ts Advance fake timers by 200ms to account for new backoff delay
.changeset/add-onerror-retry-backoff.md Changeset for patch release

🤖 Generated with Claude Code

KyleAMathews and others added 2 commits March 24, 2026 18:03
…ries

When onError returns {} on persistent 4xx errors (e.g. expired auth
tokens returning 403), the stream retried immediately with zero delay,
creating a tight infinite loop that could hammer both Electric and the
upstream database.

Add exponential backoff with jitter (100ms base, 30s cap) to the
onError retry path. The backoff delay is abort-aware so stream teardown
remains responsive. Includes a console.warn on 2nd+ retry for
debuggability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Mar 25, 2026

Open in StackBlitz

npm i https://pkg.pr.new/@electric-sql/react@4054
npm i https://pkg.pr.new/@electric-sql/client@4054
npm i https://pkg.pr.new/@electric-sql/y-electric@4054

commit: 130dc9e

- Exponential backoff grows delay between retries on persistent 403s
- Stream tears down immediately when aborted during backoff delay
- Console warning emitted on 2nd+ retry attempt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.77%. Comparing base (e4165ec) to head (130dc9e).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4054      +/-   ##
==========================================
+ Coverage   84.85%   88.77%   +3.92%     
==========================================
  Files          39       25      -14     
  Lines        2872     2459     -413     
  Branches      614      616       +2     
==========================================
- Hits         2437     2183     -254     
+ Misses        433      274     -159     
  Partials        2        2              
Flag Coverage Δ
electric-telemetry ?
elixir ?
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 93.88% <100.00%> (+0.01%) ⬆️
packages/y-electric 56.05% <ø> (ø)
typescript 88.77% <100.00%> (+0.05%) ⬆️
unit-tests 88.77% <100.00%> (+3.92%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…hared-field analyzer warning

Use a local `retryCount` variable so the field is not read across the
async boundary in #start, satisfying the shape-stream-risks analyzer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 25, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit a7036ce
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/69c3f15992f81400088619c6
😎 Deploy Preview https://deploy-preview-4054--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

KyleAMathews and others added 4 commits March 25, 2026 08:31
Use retryCount - 1 in the exponent so the first onError retry has
minimal delay (for legitimate auth token refresh), with exponential
growth only kicking in on subsequent retries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lyzer

Remove the reset from #requestShape so #start is the sole writer,
satisfying the shared-instance-field analyzer check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace single-gap jitter-sensitive assertion with early-vs-late sum
  comparison that is robust against random jitter
- Clean up abort listener when backoff timer expires normally to prevent
  closure accumulation on long-lived streams with many recoverable errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that abort listeners are removed when the backoff timer
expires normally, preventing closure accumulation on long-lived streams.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@thruflo
Copy link
Copy Markdown
Contributor

thruflo commented Mar 25, 2026

Is it worth slowing down the backoff increases? Seems quite fast as in, a transitory error for a few seconds -> results in an extra few seconds delay.

If the behavior right now is zero delay then perhaps a lower gradient backoff curve solves the problem whilst still keeping the default fairly eager to re-connect?


I know this can be be overridden with your own onError handler and we don't like options but I feel like this being configurable could be useful DX. Like a small set of keywords for common strategies.

@KyleAMathews
Copy link
Copy Markdown
Contributor Author

@thruflo the assumption is that generally people have configured their onError correctly so the first request should go through — so the backoff is just a fail safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants