Skip to content

fix(monitor): cross-session rate-limit retry coordination (#3931)#4059

Merged
aegis-gh-agent[bot] merged 1 commit into
developfrom
fix/3931-rate-limit-backoff
May 23, 2026
Merged

fix(monitor): cross-session rate-limit retry coordination (#3931)#4059
aegis-gh-agent[bot] merged 1 commit into
developfrom
fix/3931-rate-limit-backoff

Conversation

@OneStepAt4time
Copy link
Copy Markdown
Owner

Fix: Parallel sessions competing for CC rate limit

Closes #3931

Problem

When 3+ parallel ag run sessions hit the CC rate limit, each retries independently with its own exponential backoff. The concurrent retries amplify rate-limit pressure instead of reducing it.

Solution

RateLimitCoordinator — a lightweight semaphore that serializes rate-limit retries across sessions:

  • Concurrency cap (default: 1): only one session retries at a time
  • Stagger delay (default: 2s): minimum gap between consecutive retry starts
  • Queue: sessions hitting rate limits acquire a slot; if full, they queue FIFO
  • Cleanup: sessions killed while queued are dequeued via removeSession()

Changes

New files:

  • src/rate-limit-coordinator.ts — coordinator (124 lines)
  • src/__tests__/rate-limit-coordinator.test.ts — 8 tests

Modified:

  • src/monitor.tshandleRateLimitSignal() uses coordinator.acquire() before delay + restart, coordinator.release() on completion/failure. removeSession() calls coordinator.dequeue() for cleanup.

Tests (8 passing)

  • Acquire when under limit
  • Queue when at limit, proceed on release
  • FIFO ordering
  • Dequeue removes specific session
  • Stagger delay between releases
  • Active count tracking
  • Default config (maxConcurrent=1, staggerMs=2000)
  • Safe release when nothing active

Verification

tsc --noEmit → 0 errors
npm run build → OK

Parallel CC sessions hitting rate limits independently amplify the problem
by retrying concurrently. Add RateLimitCoordinator to serialize retries:

- Concurrency cap (default: 1) limits simultaneous rate-limit retries
- Stagger delay (default: 2s) spaces consecutive retry starts
- Queued sessions wait for a slot before retrying
- Sessions killed while queued are dequeued via removeSession hook

New files:
- src/rate-limit-coordinator.ts — semaphore-style coordinator
- src/__tests__/rate-limit-coordinator.test.ts — 8 tests

Modified:
- src/monitor.ts — handleRateLimitSignal uses coordinator.acquire/release
Copy link
Copy Markdown
Contributor

@aegis-gh-agent aegis-gh-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #4059 — Rate-Limit Retry Coordination (#3931)

Verdict: ✅ Approved

Clean semaphore implementation. The coordinator correctly serializes rate-limit retries across sessions with concurrency cap and stagger delay. Integration into monitor is well-structured — acquire→delay→restart→release chain with proper cleanup on session kill.

Key Points

  • Semaphore pattern: maxConcurrent=1 default means only one session retries at a time
  • Stagger delay: 2s minimum between consecutive retry starts — prevents retry storms
  • Cleanup: dequeue on session kill, release on both success and error paths
  • Observability: structured logging at acquire/release/dequeue with session context
  • Tests: 8 unit tests covering concurrency, FIFO, dequeue, stagger, defaults, edge cases

Minor Nits (non-blocking)

  1. lastReleaseAt is set then immediately read in release()elapsed is always ~0, so stagger is effectively a fixed delay. Behavior is correct, but the variable name is misleading.
  2. No integration test for the monitor.ts acquire→restart→release chain. Acceptable given mocking complexity.

9 Merge Gates — All Pass

  1. ✅ Review completed
  2. ✅ No conflicts — MERGEABLE
  3. ✅ CI green — all checks pass
  4. ✅ No regressions
  5. ✅ Unit tests (+8 new)
  6. ✅ E2E pass
  7. ✅ Documented
  8. ✅ Security clean
  9. ✅ Targets develop

@aegis-gh-agent aegis-gh-agent Bot merged commit 9d08eaf into develop May 23, 2026
18 checks passed
@aegis-gh-agent aegis-gh-agent Bot deleted the fix/3931-rate-limit-backoff branch May 23, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant