fix(retry): add equal jitter to 429 retry backoff to prevent parallel lockstep#1349
Conversation
… lockstep When multiple subagents hit a provider 429 at once (e.g. low-quota providers like xiaomi-token-plan-cn / mimo-v2.5-pro), they all retried on the same exponential schedule (2s → 4s → 8s, no jitter) and re-triggered the rate limit together, exhausting retries as a group. Add equal jitter (50-100% of the computed delay) to the exponential backoff branch of delay() so parallel retrying callers spread across different moments. Explicit Retry-After header values stay exact — only the fallback exponential branch is jittered. Root cause analysis: #1348 Verified: 36 pass / 0 fail in test/session/retry.test.ts, typecheck clean.
|
Warning Review limit reached
More reviews will be available in 33 minutes and 20 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces equal jitter (50-100% of the computed delay) to the exponential backoff retry mechanism in packages/opencode/src/session/retry.ts to prevent parallel retrying callers from executing in lockstep. The test suite in packages/opencode/test/session/retry.test.ts has been updated to validate the jittered delay ranges and ensure that concurrent retries spread out as expected. No review comments were provided, so there is no additional feedback to address.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Replace the probabilistic spread test (20 Math.random draws, assert Set.size >= 3) with a deterministic one that stubs Math.random to 0, 0.5, and 0.999 and asserts the exact jittered delay at each boundary. Addresses review feedback on #1349: the previous test relied on real Math.random draws and could theoretically flake. The new test is fully deterministic — verified by running 3× with identical results. Verified: 36 pass / 0 fail, deterministic across 3 runs, typecheck clean.
Summary
delay()inpackages/opencode/src/session/retry.ts, so parallel retrying callers spread their retries across different moments instead of retrying in lockstep.Retry-Afterheader values stay exact — only the fallback exponential branch is jittered.Why
When the main agent dispatches multiple parallel subagents against a low-quota provider (reproduced with
xiaomi-token-plan-cn/mimo-v2.5-pro), all subagents receive HTTP 429 at the same instant, then retry on the identical2s → 4s → 8sschedule (no jitter), re-triggering the rate limit together and exhausting retries as a group. Five subagents × 3 retries = 15+ requests inside a few seconds, all on the same beat.The 429 itself is an external provider limit that PawWork cannot eliminate. But the lockstep retry is a PawWork-side amplifier that turns a rate-limit into a total subagent collapse. Equal jitter (AWS-recommended for parallel retries) spreads the retries so most can recover.
Closes #1348.
Root cause (evidence-locked)
data/pawwork/log/2026-06-17T124207.log: 16ERROR service=llm mode=subagententries across 5 sub-sessions in 22 seconds, allstatusCode: 429,server: MiFE/3.4.34,responseBody: {"error":{"code":"429","message":"Too many requests","type":"limitation"}}.packages/opencode/src/session/retry.ts:38-69—delay()exponential backoff had no jitter.packages/opencode/src/session/retry.ts:218—safeRecoveryPolicycallsdelay(attempt)without theerrorargument, so theRetry-Afterheader-parsing branch is dead on the live path. (Not fixed in this PR — the reproducing provider does not sendRetry-After, so it would not help this case. Tracked as follow-up.)packages/opencode/src/session/subagent-run.ts:91—MAX_ACTIVE = 5hardcoded. (Not fixed in this PR — out of scope, tracked in [Feature] Add background subagent lifecycle v1.1 #341.)What this PR does NOT change
Retry-Afterheader handling — unchanged (the reproducing provider doesn't send it; enabling it on the live path is a follow-up).Retry-Afterprecise values — unchanged (server directives stay exact, no jitter applied).How To Verify
Risk Notes
Runtime behavior change: retry delays are no longer deterministic. Any code or test that asserted an exact delay value from the exponential branch will need to assert a range instead — that is the intended new contract. The five existing tests that pinned exact values were updated to range assertions in this PR.
Retry-Afterheader values remain exact and deterministic.No OS-specific, packaging, or UI surfaces changed.
Checklist
bugharnessP2dev, Conventional Commits title in English.