feat(sdk): retry transient network errors and rate limits#1378
feat(sdk): retry transient network errors and rate limits#1378jakubno wants to merge 11 commits into
Conversation
PR SummaryMedium Risk Overview Retries are idempotency-aware: safe methods retry on any transient failure; non-idempotent calls (e.g. Configuration is unified via a new Reviewed by Cursor Bugbot for commit 9bb851f. Bugbot is set up for automated code reviews on this repo. Configure here. |
🦋 Changeset detectedLatest commit: 9bb851f The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Automatically retry requests on transient failures across the JS and Python SDKs. Retries connection errors and 429/502/503/504 responses using exponential backoff with jitter, and honor a server-provided Retry-After header so rate limiting (e.g. listing sandboxes) is handled transparently. Retries are idempotency-aware: idempotent methods retry on any transient failure, while non-idempotent ones (e.g. Sandbox.create) only retry on "rejected" failures where the server provably did not process the request (throttling, connection-refused, DNS), avoiding duplicate side effects. Configure via the new `retries` option or E2B_MAX_RETRIES env var (default 3, set 0 to disable). The Python envd RPC retry now also uses backoff between attempts.
First retry now waits up to ~100ms (was ~500ms) before backing off exponentially, keeping the cap at 8s.
42725e5 to
330a96d
Compare
Package ArtifactsBuilt from 16a13a2. Download artifacts from this workflow run. JS SDK ( npm install ./e2b-2.27.2-feat-retry-transient-errors.0.tgzCLI ( npm install ./e2b-cli-2.10.4-feat-retry-transient-errors.0.tgzPython SDK ( pip install ./e2b-2.25.1+feat.retry.transient.errors-py3-none-any.whl |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 330a96d56d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Treat 503 as ambiguous (not rejected) so non-idempotent POSTs are not replayed when the server may have processed the request (JS + Python) - Correct misleading retry docs that referenced an idempotency-key mechanism that is not implemented - Plumb config.retries through the Python envd Connect RPC client instead of a hardcoded count of 3 - Pass retries=config.retries to the Python envd HTTP transport and include it in the transport cache key
- Fix a deadlock in the JS retry body buffering: cancelling one branch of a teed request body (request.clone()) never resolves while the other branch is unread, hanging any >1MiB non-stream upload (volume PUT, filesystem POST). Send the pristine original once instead of cancelling. - Collapse the duplicated response/error retry guards into a single shouldRetry predicate (JS) / _should_retry (Python) so the POST safety rule has one source of truth. - Export and lock the classification tables with tests and cross-SDK sync notes to catch JS<->Python drift. - Add edge tests: large non-replayable body sent once, abort-race.
…lay streamed DELETE - Bound the whole retried operation by the request's timeout (a monotonic deadline + per-attempt clamp), instead of letting each attempt use the full timeout so N retries could run ~N*timeout. Mirrors the JS single-signal bound. - _is_replayable now requires buffered content for all methods; a DELETE or OPTIONS carrying a one-shot streaming body is no longer treated as replayable. - Add sync+async tests for both.
There was a problem hiding this comment.
Stale comment
Security review (run 2/2) complete on head
f96449a49b11d4cf933a95bae4fba43bc18205b1.No new security vulnerabilities were identified in the current diff.
I specifically re-checked the retry safety surface (replayability checks and timeout/deadline bounding in Python, and abort-aware backoff behavior in JS) and did not find a remaining security regression to report.
Sent by Cursor Security Agent: Security Reviewer
Match the JS SDK, which retries envd RPC through withRetry. Python unary RPC now retries on rejected failures — HTTP 429 (honoring Retry-After) and connection errors (ConnectError/ConnectTimeout) — in addition to the existing RemoteProtocolError handling. Ambiguous statuses (502/503/504) are not retried since RPC is a non-idempotent POST. Streaming RPC is left unwrapped, matching the JS isStreamLike pass-through.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit df035e9. Configure here.
| const kind = RETRYABLE_ERROR_CODES.get(code) | ||
| if (kind) return kind | ||
| // undici low-level socket/transport errors are ambiguous mid-flight drops. | ||
| if (code.startsWith('und_err_') || code === 'fetch failed') { |
There was a problem hiding this comment.
Should probably check .message here since fetch failed isn't set in .code.
| response?.headers.get('retry-after') ?? null | ||
| ) | ||
| if (retryAfter !== undefined) { | ||
| return Math.min(retryAfter, policy.backoffCapMs * 4) |
There was a problem hiding this comment.
it doesn't make much sense?
if server says you should retry after x, but it's > policy.backoffCapMs * 4 when we retry it will fail again - maybe makes sense to mark the request as not retryable at the time if retryAfter > policy.backoffCapMs * 4 - wdyt?
| * its {@link FailureKind}) or `undefined` when it is not retryable. User/timeout | ||
| * aborts are explicitly not retryable. | ||
| */ | ||
| export function retryableErrorKind(err: unknown): FailureKind | undefined { |
| import { wait } from './utils' | ||
|
|
||
| /** Default number of *retries* (i.e. attempts after the first). */ | ||
| export const DEFAULT_MAX_RETRIES = 3 |
There was a problem hiding this comment.
it's exported but not imported anywhere else?
| const codes: string[] = [] | ||
| let current: unknown = err | ||
| // Walk the `cause` chain (undici wraps the real error in `cause`). | ||
| for (let i = 0; i < 5 && current; i++) { |
There was a problem hiding this comment.
i will have to check undici, but the for loop looks weird
| export const DEFAULT_MAX_RETRIES = 3 | ||
|
|
||
| /** Base for the exponential backoff, in milliseconds. */ | ||
| const DEFAULT_BACKOFF_BASE_MS = 100 |
There was a problem hiding this comment.
should be higher imo, 500ms?
| let attempt = 0 | ||
| for (;;) { | ||
| try { | ||
| const response = await innerFetch(buildAttempt()) |
There was a problem hiding this comment.
keep in mind, this might pollute the memory with Request objects
| } | ||
|
|
||
| // Drain and discard the body so the connection can be reused. | ||
| await response.body?.cancel().catch(() => {}) |
There was a problem hiding this comment.
isn't this handled by the V8 GC, not sure you need to manually drain it
| * `E2B_MAX_RETRIES` environment variable and finally | ||
| * {@link DEFAULT_MAX_RETRIES}. | ||
| */ | ||
| export function resolveMaxRetries(retries?: number): number { |
There was a problem hiding this comment.
this should perhaps live on connectionConfig directly, not here
| // Volume content uploads/downloads can be large; withRetry only retries | ||
| // small, replayable bodies. |
There was a problem hiding this comment.
i would check specific URL patterns, for example PUTs on the sandbox fs / volumes and mark these as non-retryable instead.
would also remove the need for the buffer logic altogether.
| * Build a fake `fetch` that returns/throws the queued outcomes in order and | ||
| * records the requests it received. | ||
| */ | ||
| function fakeFetch( |
There was a problem hiding this comment.
this feels kinda cursed, i think should instead use our actual fetch and mock the responses w MSW
There was a problem hiding this comment.
merge python changeset into it
| return kind === 'rejected' || idempotent | ||
| } | ||
|
|
||
| function isAbortError(err: unknown): boolean { |
There was a problem hiding this comment.
also, why have this function when you only call it from one place



Automatically retry requests on transient failures across the JS and Python SDKs. Retries connection errors and 429/502/503/504 responses using exponential backoff with jitter, and honor a server-provided Retry-After header so rate limiting (e.g. listing sandboxes) is handled transparently.
Retries are idempotency-aware: idempotent methods retry on any transient failure, while non-idempotent ones (e.g. Sandbox.create) only retry on "rejected" failures where the server provably did not process the request (throttling, connection-refused, DNS), avoiding duplicate side effects.
Configure via the new
retriesoption or E2B_MAX_RETRIES env var (default 3, set 0 to disable). The Python envd RPC retry now also uses backoff between attempts.