fix: add timeout, retry, and fallback RPC to ADL fetchSlab (H4+H5)#87
fix: add timeout, retry, and fallback RPC to ADL fetchSlab (H4+H5)#870x-SquidSol wants to merge 1 commit intodcccrypto:mainfrom
Conversation
The ADL service called fetchSlab() with no timeout wrapper, no retry logic, and no fallback RPC. A hung primary RPC blocked the entire ADL service — which runs during market stress when RPC load is highest. Add fetchSlabWithRetry() helper matching the liquidation service pattern: - 15s timeout per RPC call (withTimeout + Promise.race) - 1 retry with fallback RPC on transient errors (429, timeout, socket) - Rate-limit token acquisition via acquireToken() - Exponential backoff between attempts Also fixes H5: sendWithRetryKeeper now uses getConnection() directly since the local connection variable was removed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds RPC resilience utilities to the ADL service, implementing timeout and retry logic with fallback connection support. The changes wrap Changes
Sequence Diagram(s)sequenceDiagram
participant Service as AdlService
participant Util as fetchSlabWithRetry
participant Primary as getConnection()
participant Fallback as getFallbackConnection()
participant RPC as Slab RPC
Service->>Util: fetchSlabWithRetry(slabAddress)
Util->>Primary: Get primary connection
Util->>RPC: fetchSlab with 15s timeout (attempt 1)
alt Success
RPC-->>Util: Return slab data
Util-->>Service: Return data
else Timeout/Error
RPC-->>Util: Error (rate limit/timeout/502/503)
Util->>Fallback: Get fallback connection
Util->>RPC: fetchSlab with 15s timeout (attempt 2)
RPC-->>Util: Return slab data
Util-->>Service: Return data
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/services/adl.test.ts (1)
36-55: Add a focused test for retry/fallback wiring.This mock update is good, but there’s no assertion yet that the new resilience path actually retries (Line 36, Lines 53-55). Please add one test that forces a transient
fetchSlabfailure and verifiesacquireToken,backoffMs, andgetFallbackConnectionare invoked as expected.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/services/adl.test.ts` around lines 36 - 55, Add a focused unit test that simulates a transient failure in fetchSlab and asserts the retry/fallback wiring: mock fetchSlab to fail the first N attempts (throw) then succeed, run the code path that triggers fetchSlab, and verify that acquireToken, backoffMs, and getFallbackConnection were called the expected number of times (and that sendWithRetryKeeper or final success occurred). Locate the test suite around the existing mocks (referencing fetchSlab, acquireToken, backoffMs, getFallbackConnection, and sendWithRetryKeeper) and add assertions using vi.mock/spy to check call counts and call order for retries and fallback invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/services/adl.ts`:
- Around line 70-95: The fetchSlabWithRetry retry detection misses the exact
timeout text emitted by withTimeout (“timed out”) so timeouts don't trigger
fallback retries; update the isRetryable logic inside fetchSlabWithRetry to also
detect the exact withTimeout timeout phrase (e.g., check for "timed out" in
addition to "timeout") or normalize error messages from
withTimeout/getErrorMessage so timeout errors are classified as retryable;
locate the check in fetchSlabWithRetry (references: fetchSlabWithRetry,
withTimeout, getErrorMessage, fetchSlab, RPC_TIMEOUT_MS) and add the extra match
for "timed out" (or otherwise map withTimeout errors to a retryable timeout
token) so the fallback/getFallbackConnection logic runs on true timeouts.
---
Nitpick comments:
In `@tests/services/adl.test.ts`:
- Around line 36-55: Add a focused unit test that simulates a transient failure
in fetchSlab and asserts the retry/fallback wiring: mock fetchSlab to fail the
first N attempts (throw) then succeed, run the code path that triggers
fetchSlab, and verify that acquireToken, backoffMs, and getFallbackConnection
were called the expected number of times (and that sendWithRetryKeeper or final
success occurred). Locate the test suite around the existing mocks (referencing
fetchSlab, acquireToken, backoffMs, getFallbackConnection, and
sendWithRetryKeeper) and add assertions using vi.mock/spy to check call counts
and call order for retries and fallback invocation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cc5e3ee9-18f2-48b0-9e30-c5ab510625ca
📒 Files selected for processing (2)
src/services/adl.tstests/services/adl.test.ts
| timer = setTimeout(() => reject(new Error(`${label}: timed out after ${ms}ms`)), ms); | ||
| }); | ||
| return Promise.race([promise, timeout]).finally(() => clearTimeout(timer)); | ||
| } | ||
|
|
||
| async function fetchSlabWithRetry(slabPubkey: PublicKey | string): Promise<Uint8Array> { | ||
| const pubkey = typeof slabPubkey === "string" ? new PublicKey(slabPubkey) : slabPubkey; | ||
| const maxRetries = 2; | ||
| let lastErr: unknown; | ||
| for (let attempt = 0; attempt < maxRetries; attempt++) { | ||
| const conn = attempt === 0 ? getConnection() : getFallbackConnection(); | ||
| try { | ||
| await acquireToken(); | ||
| return await withTimeout( | ||
| fetchSlab(conn, pubkey), | ||
| RPC_TIMEOUT_MS, | ||
| `fetchSlab(${pubkey.toBase58()})`, | ||
| ); | ||
| } catch (err) { | ||
| lastErr = err; | ||
| const msg = getErrorMessage(err).toLowerCase(); | ||
| const isRetryable = msg.includes("429") || msg.includes("too many requests") | ||
| || msg.includes("rate limit") || msg.includes("timeout") | ||
| || msg.includes("socket") || msg.includes("econnrefused") | ||
| || msg.includes("502") || msg.includes("503"); | ||
| if (!isRetryable || attempt >= maxRetries - 1) break; |
There was a problem hiding this comment.
Timeouts from withTimeout are not classified as retryable.
Line 70 emits "timed out", but Line 92 only checks "timeout". That can skip retry/fallback on the exact timeout path this helper creates.
💡 Proposed fix
- const isRetryable = msg.includes("429") || msg.includes("too many requests")
- || msg.includes("rate limit") || msg.includes("timeout")
+ const isRetryable = msg.includes("429") || msg.includes("too many requests")
+ || msg.includes("rate limit") || msg.includes("timeout")
+ || msg.includes("timed out") || msg.includes("time out")
|| msg.includes("socket") || msg.includes("econnrefused")
|| msg.includes("502") || msg.includes("503");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/services/adl.ts` around lines 70 - 95, The fetchSlabWithRetry retry
detection misses the exact timeout text emitted by withTimeout (“timed out”) so
timeouts don't trigger fallback retries; update the isRetryable logic inside
fetchSlabWithRetry to also detect the exact withTimeout timeout phrase (e.g.,
check for "timed out" in addition to "timeout") or normalize error messages from
withTimeout/getErrorMessage so timeout errors are classified as retryable;
locate the check in fetchSlabWithRetry (references: fetchSlabWithRetry,
withTimeout, getErrorMessage, fetchSlab, RPC_TIMEOUT_MS) and add the extra match
for "timed out" (or otherwise map withTimeout errors to a retryable timeout
token) so the fallback/getFallbackConnection logic runs on true timeouts.
Summary
The ADL service called
fetchSlab()with no timeout, no retry, and no fallback RPC. A hung primary RPC node blocked the entire ADL service — which runs during market stress when RPC load is highest and ADL is most needed.H4:
fetchSlabingetAdlState()(line 319) andscanMarket()(line 361) were bare calls.H5:
sendWithRetryKeeperalways used the primary connection with no fallback path.Fix
Add
fetchSlabWithRetry()helper matching the established pattern fromliquidation.ts:withTimeout()+Promise.racegetFallbackConnection()on transient errorsacquireToken()before each RPC callTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit