feat(sdk): retry transient network errors and rate limits by jakubno · Pull Request #1378 · e2b-dev/E2B

jakubno · 2026-06-02T07:56:23Z

Automatically retry requests on transient failures across the JS and Python SDKs. Retries connection errors and 429/502/503/504 responses using exponential backoff with jitter, and honor a server-provided Retry-After header so rate limiting (e.g. listing sandboxes) is handled transparently.

Retries are idempotency-aware: idempotent methods retry on any transient failure, while non-idempotent ones (e.g. Sandbox.create) only retry on "rejected" failures where the server provably did not process the request (throttling, connection-refused, DNS), avoiding duplicate side effects.

Configure via the new retries option or E2B_MAX_RETRIES env var (default 3, set 0 to disable). The Python envd RPC retry now also uses backoff between attempts.

cursor · 2026-06-02T07:56:30Z

PR Summary

Medium Risk
Cross-cutting change to all API/envd/volume request paths; incorrect retry classification on non-idempotent calls could duplicate side effects, though the PR explicitly limits those to “rejected” failures only.

Overview
Both JS and Python SDKs now automatically retry control-plane, volume, and sandbox (envd) HTTP traffic on transient failures—connection errors and 429 / 408 / 502 / 503 / 504 (not 500)—using exponential backoff with jitter and honoring Retry-After.

Retries are idempotency-aware: safe methods retry on any transient failure; non-idempotent calls (e.g. Sandbox.create, envd POST RPCs) only retry when the failure is provably unprocessed (“rejected”, e.g. throttling, refused connection, DNS). Ambiguous mid-flight failures are not replayed for those calls. Large or streaming bodies are sent once without retry (JS buffers up to 1 MiB for replay).

Configuration is unified via a new retries option and E2B_MAX_RETRIES (default 3, 0 disables). Python httpx transports and envd Connect unary RPCs use the shared policy; envd RPC backoff is added. wait() in JS respects AbortSignal during backoff.

^{Reviewed by Cursor Bugbot for commit 9bb851f. Bugbot is set up for automated code reviews on this repo. Configure here.}

changeset-bot · 2026-06-02T07:56:34Z

🦋 Changeset detected

Latest commit: 9bb851f

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages

Name	Type
e2b	Minor
@e2b/python-sdk	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Automatically retry requests on transient failures across the JS and Python SDKs. Retries connection errors and 429/502/503/504 responses using exponential backoff with jitter, and honor a server-provided Retry-After header so rate limiting (e.g. listing sandboxes) is handled transparently. Retries are idempotency-aware: idempotent methods retry on any transient failure, while non-idempotent ones (e.g. Sandbox.create) only retry on "rejected" failures where the server provably did not process the request (throttling, connection-refused, DNS), avoiding duplicate side effects. Configure via the new `retries` option or E2B_MAX_RETRIES env var (default 3, set 0 to disable). The Python envd RPC retry now also uses backoff between attempts.

First retry now waits up to ~100ms (was ~500ms) before backing off exponentially, keeping the cap at 8s.

github-actions · 2026-06-02T08:36:13Z

Package Artifacts

Built from 16a13a2. Download artifacts from this workflow run.

JS SDK (e2b@2.27.2-feat-retry-transient-errors.0):

npm install ./e2b-2.27.2-feat-retry-transient-errors.0.tgz

CLI (@e2b/cli@2.10.4-feat-retry-transient-errors.0):

npm install ./e2b-cli-2.10.4-feat-retry-transient-errors.0.tgz

Python SDK (e2b==2.25.1+feat-retry-transient-errors):

pip install ./e2b-2.25.1+feat.retry.transient.errors-py3-none-any.whl

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 330a96d56d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Treat 503 as ambiguous (not rejected) so non-idempotent POSTs are not replayed when the server may have processed the request (JS + Python) - Correct misleading retry docs that referenced an idempotency-key mechanism that is not implemented - Plumb config.retries through the Python envd Connect RPC client instead of a hardcoded count of 3 - Pass retries=config.retries to the Python envd HTTP transport and include it in the transport cache key

- Fix a deadlock in the JS retry body buffering: cancelling one branch of a teed request body (request.clone()) never resolves while the other branch is unread, hanging any >1MiB non-stream upload (volume PUT, filesystem POST). Send the pristine original once instead of cancelling. - Collapse the duplicated response/error retry guards into a single shouldRetry predicate (JS) / _should_retry (Python) so the POST safety rule has one source of truth. - Export and lock the classification tables with tests and cross-SDK sync notes to catch JS<->Python drift. - Add edge tests: large non-replayable body sent once, abort-race.

…lay streamed DELETE - Bound the whole retried operation by the request's timeout (a monotonic deadline + per-attempt clamp), instead of letting each attempt use the full timeout so N retries could run ~N*timeout. Mirrors the JS single-signal bound. - _is_replayable now requires buffered content for all methods; a DELETE or OPTIONS carrying a one-shot streaming body is no longer treated as replayable. - Add sync+async tests for both.

cursor

Stale comment

Security review (run 2/2) complete on head f96449a49b11d4cf933a95bae4fba43bc18205b1.

No new security vulnerabilities were identified in the current diff.

I specifically re-checked the retry safety surface (replayability checks and timeout/deadline bounding in Python, and abort-aware backoff behavior in JS) and did not find a remaining security regression to report.

_{Sent by Cursor Security Agent: Security Reviewer}

Match the JS SDK, which retries envd RPC through withRetry. Python unary RPC now retries on rejected failures — HTTP 429 (honoring Retry-After) and connection errors (ConnectError/ConnectTimeout) — in addition to the existing RemoteProtocolError handling. Ambiguous statuses (502/503/504) are not retried since RPC is a non-idempotent POST. Streaming RPC is left unwrapped, matching the JS isStreamLike pass-through.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit df035e9. Configure here.}

wj-e2b · 2026-06-03T19:06:03Z

+    const kind = RETRYABLE_ERROR_CODES.get(code)
+    if (kind) return kind
+    // undici low-level socket/transport errors are ambiguous mid-flight drops.
+    if (code.startsWith('und_err_') || code === 'fetch failed') {


Should probably check .message here since fetch failed isn't set in .code.

https://github.com/nodejs/undici/blob/c995513094903c67151907213296e91179279b50/lib/web/fetch/index.js#L261

mishushakov

JS SDK reviewed

mishushakov · 2026-06-09T17:10:32Z

+    response?.headers.get('retry-after') ?? null
+  )
+  if (retryAfter !== undefined) {
+    return Math.min(retryAfter, policy.backoffCapMs * 4)


it doesn't make much sense?
if server says you should retry after x, but it's > policy.backoffCapMs * 4 when we retry it will fail again - maybe makes sense to mark the request as not retryable at the time if retryAfter > policy.backoffCapMs * 4 - wdyt?

mishushakov · 2026-06-09T17:10:32Z

+ * its {@link FailureKind}) or `undefined` when it is not retryable. User/timeout
+ * aborts are explicitly not retryable.
+ */
+export function retryableErrorKind(err: unknown): FailureKind | undefined {


: bool, no undefined

mishushakov · 2026-06-09T17:10:33Z

+import { wait } from './utils'
+
+/** Default number of *retries* (i.e. attempts after the first). */
+export const DEFAULT_MAX_RETRIES = 3


it's exported but not imported anywhere else?

mishushakov · 2026-06-09T17:10:33Z

+  const codes: string[] = []
+  let current: unknown = err
+  // Walk the `cause` chain (undici wraps the real error in `cause`).
+  for (let i = 0; i < 5 && current; i++) {


i will have to check undici, but the for loop looks weird

mishushakov · 2026-06-09T17:10:33Z

+export const DEFAULT_MAX_RETRIES = 3
+
+/** Base for the exponential backoff, in milliseconds. */
+const DEFAULT_BACKOFF_BASE_MS = 100


should be higher imo, 500ms?

mishushakov · 2026-06-09T17:10:37Z

+    let attempt = 0
+    for (;;) {
+      try {
+        const response = await innerFetch(buildAttempt())


keep in mind, this might pollute the memory with Request objects

mishushakov · 2026-06-09T17:10:38Z

+        }
+
+        // Drain and discard the body so the connection can be reused.
+        await response.body?.cancel().catch(() => {})


isn't this handled by the V8 GC, not sure you need to manually drain it

mishushakov · 2026-06-09T17:10:38Z

+ * `E2B_MAX_RETRIES` environment variable and finally
+ * {@link DEFAULT_MAX_RETRIES}.
+ */
+export function resolveMaxRetries(retries?: number): number {


this should perhaps live on connectionConfig directly, not here

mishushakov · 2026-06-09T17:10:38Z

+      // Volume content uploads/downloads can be large; withRetry only retries
+      // small, replayable bodies.


i would check specific URL patterns, for example PUTs on the sandbox fs / volumes and mark these as non-retryable instead.
would also remove the need for the buffer logic altogether.

mishushakov · 2026-06-09T17:10:39Z

+ * Build a fake `fetch` that returns/throws the queued outcomes in order and
+ * records the requests it received.
+ */
+function fakeFetch(


this feels kinda cursed, i think should instead use our actual fetch and mock the responses w MSW

mishushakov · 2026-06-09T17:10:32Z

merge python changeset into it

mishushakov · 2026-06-09T17:10:43Z

+  return kind === 'rejected' || idempotent
+}
+
+function isAbortError(err: unknown): boolean {


also, why have this function when you only call it from one place

cla-bot Bot added the cla-signed label Jun 2, 2026

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread packages/python-sdk/e2b/connection_config.py

jakubno added 3 commits June 2, 2026 08:13

fix(sdk): lower retry backoff base to 100ms

f45e99b

First retry now waits up to ~100ms (was ~500ms) before backing off exponentially, keeping the cap at 8s.

fix(python-sdk): propagate retries through get_api_params

330a96d

jakubno force-pushed the feat/retry-transient-errors branch from 42725e5 to 330a96d Compare June 2, 2026 08:35

jakubno marked this pull request as ready for review June 2, 2026 13:45

jakubno requested review from ValentaTomas and mishushakov as code owners June 2, 2026 13:45

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread packages/js-sdk/src/retry.ts Outdated

Comment thread packages/python-sdk/e2b/_retry.py Outdated

claude Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread packages/js-sdk/src/connectionConfig.ts

Comment thread packages/python-sdk/e2b_connect/client.py

Comment thread packages/python-sdk/e2b/api/client_async/__init__.py

jakubno added 3 commits June 2, 2026 14:40

refactor(js-sdk): make utils.wait abort-aware and reuse it in retry

94daff5