Skip to content

fix: add timeout, retry, and fallback RPC to ADL fetchSlab (H4+H5)#87

Open
0x-SquidSol wants to merge 1 commit intodcccrypto:mainfrom
0x-SquidSol:fix/h4-h5-adl-rpc-resilience
Open

fix: add timeout, retry, and fallback RPC to ADL fetchSlab (H4+H5)#87
0x-SquidSol wants to merge 1 commit intodcccrypto:mainfrom
0x-SquidSol:fix/h4-h5-adl-rpc-resilience

Conversation

@0x-SquidSol
Copy link
Copy Markdown
Contributor

@0x-SquidSol 0x-SquidSol commented Apr 9, 2026

Summary

The ADL service called fetchSlab() with no timeout, no retry, and no fallback RPC. A hung primary RPC node blocked the entire ADL service — which runs during market stress when RPC load is highest and ADL is most needed.

H4: fetchSlab in getAdlState() (line 319) and scanMarket() (line 361) were bare calls.
H5: sendWithRetryKeeper always used the primary connection with no fallback path.

Fix

Add fetchSlabWithRetry() helper matching the established pattern from liquidation.ts:

  • 15s timeout via withTimeout() + Promise.race
  • 1 retry with automatic fallback to getFallbackConnection() on transient errors
  • Rate-limit compliance via acquireToken() before each RPC call
  • Exponential backoff between attempts
  • Retryable error detection: 429, timeout, socket, ECONNREFUSED, 502, 503

Test plan

  • All 15 ADL tests pass (updated mocks for new shared imports)
  • All 133 tests pass across full suite

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Improvements
    • Enhanced reliability of market data fetching with automatic retry logic and timeouts.
    • Added fallback connection handling to mitigate transient network failures and rate-limiting issues.
    • Improved resilience against connection errors, timeouts, and temporary service disruptions.

The ADL service called fetchSlab() with no timeout wrapper, no retry
logic, and no fallback RPC. A hung primary RPC blocked the entire ADL
service — which runs during market stress when RPC load is highest.

Add fetchSlabWithRetry() helper matching the liquidation service
pattern:
- 15s timeout per RPC call (withTimeout + Promise.race)
- 1 retry with fallback RPC on transient errors (429, timeout, socket)
- Rate-limit token acquisition via acquireToken()
- Exponential backoff between attempts

Also fixes H5: sendWithRetryKeeper now uses getConnection() directly
since the local connection variable was removed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Adds RPC resilience utilities to the ADL service, implementing timeout and retry logic with fallback connection support. The changes wrap fetchSlab calls with a 15-second timeout and up to 2 retry attempts, switching to fallback connections on failure and making retry decisions based on error contents.

Changes

Cohort / File(s) Summary
RPC Resilience Implementation
src/services/adl.ts
Added withTimeout() and fetchSlabWithRetry() utilities for handling RPC calls with 15s timeouts and 2 retry attempts; updated AdlService.getAdlState() and AdlService.scanMarket() to use these utilities instead of direct fetchSlab() calls; modified transaction submission to use sendWithRetryKeeper() instead of captured connection variables.
Test Mock Extensions
tests/services/adl.test.ts
Extended @percolator/shared mock with getFallbackConnection, acquireToken, backoffMs, and getErrorMessage functions to support new resilience features.

Sequence Diagram(s)

sequenceDiagram
    participant Service as AdlService
    participant Util as fetchSlabWithRetry
    participant Primary as getConnection()
    participant Fallback as getFallbackConnection()
    participant RPC as Slab RPC

    Service->>Util: fetchSlabWithRetry(slabAddress)
    Util->>Primary: Get primary connection
    Util->>RPC: fetchSlab with 15s timeout (attempt 1)
    
    alt Success
        RPC-->>Util: Return slab data
        Util-->>Service: Return data
    else Timeout/Error
        RPC-->>Util: Error (rate limit/timeout/502/503)
        Util->>Fallback: Get fallback connection
        Util->>RPC: fetchSlab with 15s timeout (attempt 2)
        RPC-->>Util: Return slab data
        Util-->>Service: Return data
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • PR #42: Directly updates AdlService methods (getAdlState, scanMarket) to use new retry/timeout utilities and sendWithRetryKeeper patterns introduced in this PR.
  • PR #79: Introduces the withTimeout helper and applies 15-second RPC timeout logic that forms the foundation for the retry wrapper implemented here.

Poem

🐰 With timeouts ticking, retries dance,
Fallback connections get their chance,
Resilient rabbits hop with care,
Retrying errors here and there!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding timeout, retry, and fallback RPC mechanisms to the ADL fetchSlab functionality, with references to the specific issue identifiers (H4+H5) that are addressed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/services/adl.test.ts (1)

36-55: Add a focused test for retry/fallback wiring.

This mock update is good, but there’s no assertion yet that the new resilience path actually retries (Line 36, Lines 53-55). Please add one test that forces a transient fetchSlab failure and verifies acquireToken, backoffMs, and getFallbackConnection are invoked as expected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/services/adl.test.ts` around lines 36 - 55, Add a focused unit test
that simulates a transient failure in fetchSlab and asserts the retry/fallback
wiring: mock fetchSlab to fail the first N attempts (throw) then succeed, run
the code path that triggers fetchSlab, and verify that acquireToken, backoffMs,
and getFallbackConnection were called the expected number of times (and that
sendWithRetryKeeper or final success occurred). Locate the test suite around the
existing mocks (referencing fetchSlab, acquireToken, backoffMs,
getFallbackConnection, and sendWithRetryKeeper) and add assertions using
vi.mock/spy to check call counts and call order for retries and fallback
invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/services/adl.ts`:
- Around line 70-95: The fetchSlabWithRetry retry detection misses the exact
timeout text emitted by withTimeout (“timed out”) so timeouts don't trigger
fallback retries; update the isRetryable logic inside fetchSlabWithRetry to also
detect the exact withTimeout timeout phrase (e.g., check for "timed out" in
addition to "timeout") or normalize error messages from
withTimeout/getErrorMessage so timeout errors are classified as retryable;
locate the check in fetchSlabWithRetry (references: fetchSlabWithRetry,
withTimeout, getErrorMessage, fetchSlab, RPC_TIMEOUT_MS) and add the extra match
for "timed out" (or otherwise map withTimeout errors to a retryable timeout
token) so the fallback/getFallbackConnection logic runs on true timeouts.

---

Nitpick comments:
In `@tests/services/adl.test.ts`:
- Around line 36-55: Add a focused unit test that simulates a transient failure
in fetchSlab and asserts the retry/fallback wiring: mock fetchSlab to fail the
first N attempts (throw) then succeed, run the code path that triggers
fetchSlab, and verify that acquireToken, backoffMs, and getFallbackConnection
were called the expected number of times (and that sendWithRetryKeeper or final
success occurred). Locate the test suite around the existing mocks (referencing
fetchSlab, acquireToken, backoffMs, getFallbackConnection, and
sendWithRetryKeeper) and add assertions using vi.mock/spy to check call counts
and call order for retries and fallback invocation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cc5e3ee9-18f2-48b0-9e30-c5ab510625ca

📥 Commits

Reviewing files that changed from the base of the PR and between 7ab8258 and aedccfe.

📒 Files selected for processing (2)
  • src/services/adl.ts
  • tests/services/adl.test.ts

Comment thread src/services/adl.ts
Comment on lines +70 to +95
timer = setTimeout(() => reject(new Error(`${label}: timed out after ${ms}ms`)), ms);
});
return Promise.race([promise, timeout]).finally(() => clearTimeout(timer));
}

async function fetchSlabWithRetry(slabPubkey: PublicKey | string): Promise<Uint8Array> {
const pubkey = typeof slabPubkey === "string" ? new PublicKey(slabPubkey) : slabPubkey;
const maxRetries = 2;
let lastErr: unknown;
for (let attempt = 0; attempt < maxRetries; attempt++) {
const conn = attempt === 0 ? getConnection() : getFallbackConnection();
try {
await acquireToken();
return await withTimeout(
fetchSlab(conn, pubkey),
RPC_TIMEOUT_MS,
`fetchSlab(${pubkey.toBase58()})`,
);
} catch (err) {
lastErr = err;
const msg = getErrorMessage(err).toLowerCase();
const isRetryable = msg.includes("429") || msg.includes("too many requests")
|| msg.includes("rate limit") || msg.includes("timeout")
|| msg.includes("socket") || msg.includes("econnrefused")
|| msg.includes("502") || msg.includes("503");
if (!isRetryable || attempt >= maxRetries - 1) break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Timeouts from withTimeout are not classified as retryable.

Line 70 emits "timed out", but Line 92 only checks "timeout". That can skip retry/fallback on the exact timeout path this helper creates.

💡 Proposed fix
-      const isRetryable = msg.includes("429") || msg.includes("too many requests")
-        || msg.includes("rate limit") || msg.includes("timeout")
+      const isRetryable = msg.includes("429") || msg.includes("too many requests")
+        || msg.includes("rate limit") || msg.includes("timeout")
+        || msg.includes("timed out") || msg.includes("time out")
         || msg.includes("socket") || msg.includes("econnrefused")
         || msg.includes("502") || msg.includes("503");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/services/adl.ts` around lines 70 - 95, The fetchSlabWithRetry retry
detection misses the exact timeout text emitted by withTimeout (“timed out”) so
timeouts don't trigger fallback retries; update the isRetryable logic inside
fetchSlabWithRetry to also detect the exact withTimeout timeout phrase (e.g.,
check for "timed out" in addition to "timeout") or normalize error messages from
withTimeout/getErrorMessage so timeout errors are classified as retryable;
locate the check in fetchSlabWithRetry (references: fetchSlabWithRetry,
withTimeout, getErrorMessage, fetchSlab, RPC_TIMEOUT_MS) and add the extra match
for "timed out" (or otherwise map withTimeout errors to a retryable timeout
token) so the fallback/getFallbackConnection logic runs on true timeouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant