Skip to content

fix(retry): add equal jitter to 429 retry backoff to prevent parallel lockstep#1349

Merged
Astro-Han merged 2 commits into
devfrom
pawwork/retry-jitter-1348
Jun 17, 2026
Merged

fix(retry): add equal jitter to 429 retry backoff to prevent parallel lockstep#1349
Astro-Han merged 2 commits into
devfrom
pawwork/retry-jitter-1348

Conversation

@Astro-Han

Copy link
Copy Markdown
Owner

Summary

  • Add equal jitter (50–100% of the computed delay) to the exponential backoff branch of delay() in packages/opencode/src/session/retry.ts, so parallel retrying callers spread their retries across different moments instead of retrying in lockstep.
  • Explicit Retry-After header values stay exact — only the fallback exponential branch is jittered.

Why

When the main agent dispatches multiple parallel subagents against a low-quota provider (reproduced with xiaomi-token-plan-cn / mimo-v2.5-pro), all subagents receive HTTP 429 at the same instant, then retry on the identical 2s → 4s → 8s schedule (no jitter), re-triggering the rate limit together and exhausting retries as a group. Five subagents × 3 retries = 15+ requests inside a few seconds, all on the same beat.

The 429 itself is an external provider limit that PawWork cannot eliminate. But the lockstep retry is a PawWork-side amplifier that turns a rate-limit into a total subagent collapse. Equal jitter (AWS-recommended for parallel retries) spreads the retries so most can recover.

Closes #1348.

Root cause (evidence-locked)

  • Sidecar log data/pawwork/log/2026-06-17T124207.log: 16 ERROR service=llm mode=subagent entries across 5 sub-sessions in 22 seconds, all statusCode: 429, server: MiFE/3.4.34, responseBody: {"error":{"code":"429","message":"Too many requests","type":"limitation"}}.
  • packages/opencode/src/session/retry.ts:38-69delay() exponential backoff had no jitter.
  • packages/opencode/src/session/retry.ts:218safeRecoveryPolicy calls delay(attempt) without the error argument, so the Retry-After header-parsing branch is dead on the live path. (Not fixed in this PR — the reproducing provider does not send Retry-After, so it would not help this case. Tracked as follow-up.)
  • packages/opencode/src/session/subagent-run.ts:91MAX_ACTIVE = 5 hardcoded. (Not fixed in this PR — out of scope, tracked in [Feature] Add background subagent lifecycle v1.1 #341.)

What this PR does NOT change

  • Retry-After header handling — unchanged (the reproducing provider doesn't send it; enabling it on the live path is a follow-up).
  • Subagent concurrency cap — unchanged (5, hardcoded; configurable cap tracked in [Feature] Add background subagent lifecycle v1.1 #341).
  • Retry count, classification, safety gate — unchanged.
  • Retry-After precise values — unchanged (server directives stay exact, no jitter applied).

How To Verify

TDD red→green:
  RED  — bun --cwd packages/opencode test test/session/retry.test.ts
         "exponential backoff without headers spreads parallel retries via jitter" FAILS
         (delay(2) returns constant 4000, Set.size === 1 < 3)
  GREEN — after adding withJitter(), same test PASSES

Full retry suite:  bun --cwd packages/opencode test test/session/retry.test.ts
  → 36 pass, 0 fail

Related suites (no regression):
  bun --cwd packages/opencode test test/session/retry-decision.test.ts \
    test/session/processor-effect.test.ts test/session/prompt-effect.test.ts
  → 110 pass, 0 fail

  bun --cwd packages/opencode test test/session/subagent-lifecycle-integration.test.ts \
    test/session/subagent-run.test.ts
  → 19 pass, 0 fail

Typecheck: (cd packages/opencode && bun run typecheck) → tsgo --noEmit passed

Risk Notes

Runtime behavior change: retry delays are no longer deterministic. Any code or test that asserted an exact delay value from the exponential branch will need to assert a range instead — that is the intended new contract. The five existing tests that pinned exact values were updated to range assertions in this PR. Retry-After header values remain exact and deterministic.

No OS-specific, packaging, or UI surfaces changed.

Checklist

  • Type labelbug
  • Routing labelsharness
  • Priority labelP2
  • Human Review Status: Pending
  • I linked the related issue ([Bug] Parallel subagents all fail on provider 429 (retry has no jitter, ignores Retry-After, concurrency hardcoded) #1348).
  • I described the review focus and risks.
  • I replaced the example block in How To Verify with the real verification steps.
  • I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope.
  • (conditional) No visible UI or copy changes — screenshots N/A.
  • (conditional) No platform/packaging surface touched.
  • (conditional) No docs, release notes, dependencies, permissions, credentials, or generated content touched.
  • I reviewed the final diff for unrelated changes and suspicious dependency changes.
  • Targeting dev, Conventional Commits title in English.

… lockstep

When multiple subagents hit a provider 429 at once (e.g. low-quota providers
like xiaomi-token-plan-cn / mimo-v2.5-pro), they all retried on the same
exponential schedule (2s → 4s → 8s, no jitter) and re-triggered the rate
limit together, exhausting retries as a group.

Add equal jitter (50-100% of the computed delay) to the exponential backoff
branch of delay() so parallel retrying callers spread across different
moments. Explicit Retry-After header values stay exact — only the fallback
exponential branch is jittered.

Root cause analysis: #1348
Verified: 36 pass / 0 fail in test/session/retry.test.ts, typecheck clean.
@Astro-Han Astro-Han added bug Something isn't working P2 Medium priority harness Model harness, prompts, tool descriptions, and session mechanics labels Jun 17, 2026
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@Astro-Han, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 33 minutes and 20 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 94d0ddee-e9ef-4f8c-a882-503fd8bc1f1d

📥 Commits

Reviewing files that changed from the base of the PR and between 15ad201 and 347c790.

📒 Files selected for processing (2)
  • packages/opencode/src/session/retry.ts
  • packages/opencode/test/session/retry.test.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pawwork/retry-jitter-1348

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested priority: P2 (includes non-doc, non-test paths outside the low-risk bucket).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces equal jitter (50-100% of the computed delay) to the exponential backoff retry mechanism in packages/opencode/src/session/retry.ts to prevent parallel retrying callers from executing in lockstep. The test suite in packages/opencode/test/session/retry.test.ts has been updated to validate the jittered delay ranges and ensure that concurrent retries spread out as expected. No review comments were provided, so there is no additional feedback to address.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Replace the probabilistic spread test (20 Math.random draws, assert
Set.size >= 3) with a deterministic one that stubs Math.random to 0,
0.5, and 0.999 and asserts the exact jittered delay at each boundary.

Addresses review feedback on #1349: the previous test relied on real
Math.random draws and could theoretically flake. The new test is fully
deterministic — verified by running 3× with identical results.

Verified: 36 pass / 0 fail, deterministic across 3 runs, typecheck clean.
@Astro-Han Astro-Han merged commit ac1d1e1 into dev Jun 17, 2026
35 checks passed
@Astro-Han Astro-Han deleted the pawwork/retry-jitter-1348 branch June 17, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Parallel subagents all fail on provider 429 (retry has no jitter, ignores Retry-After, concurrency hardcoded)

1 participant