Skip to content

fix: Adding lock to init relayer instances#622

Merged
collins-w merged 21 commits intomainfrom
improvements-to-relayers-initialization
Mar 13, 2026
Merged

fix: Adding lock to init relayer instances#622
collins-w merged 21 commits intomainfrom
improvements-to-relayers-initialization

Conversation

@NicoMolinaOZ
Copy link
Copy Markdown
Contributor

@NicoMolinaOZ NicoMolinaOZ commented Jan 20, 2026

Summary

  • Adding lock to init relayer instances

Testing Process

Checklist

  • Add a reference to related issues in the PR description.
  • Add unit tests if applicable.

Note

If you are using Relayer in your stack, consider adding your team or organization to our list of Relayer Users in the Wild!

Summary by CodeRabbit

  • New Features

    • Distributed locking for relayer initialization and config processing to coordinate work across instances
    • Global staleness checks and metadata tracking for relayer last-sync and global-init to avoid redundant work
    • In-memory fallback when persistent storage is unavailable; wait/polling behavior to observe completion from other instances
    • New generic polling utility to support timed wait/retry patterns
  • Tests

    • Expanded unit and integration tests covering locking, polling, metadata, and initialization outcomes

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 20, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a64cb625-7b54-4530-87c4-06534c319c9e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR introduces distributed locking coordination for relayer initialization across multiple instances. It adds per-relayer staleness checks, implements locking-based initialization in persistent storage mode, and falls back to direct initialization in in-memory mode. Redis sync metadata utilities track initialization timestamps to prevent redundant syncs, while a new repository API exposes storage connection details.

Changes

Cohort / File(s) Summary
Repository Connection Exposure
src/repositories/relayer/mod.rs, src/repositories/relayer/relayer_redis.rs, src/repositories/relayer/relayer_in_memory.rs
Added public connection_info() method to RelayerRepository trait returning storage connection details; implemented for RedisRelayerRepository to expose underlying ConnectionManager and key prefix; in-memory variant returns None via default trait implementation. Includes tests validating in-memory behavior.
Relayer Initialization with Locking
src/bootstrap/initialize_relayers.rs
Branched initialization logic between persistent (distributed locking) and in-memory (direct init) modes. Introduced RelayerInitResult enum tracking outcomes (Initialized, SkippedRecentSync, SkippedLockHeld, Failed). Added per-relayer lock acquisition, recent-sync staleness checks, and aggregated failure reporting. New helper functions: initialize_relayers_with_locking, initialize_single_relayer_with_lock, initialize_relayers_without_locking, count_results, initialize_relayer_with_service.
Redis Sync Metadata Utilities
src/utils/redis.rs
New functions for tracking relayer last-sync timestamps: set_relayer_last_sync(), get_relayer_last_sync(), is_relayer_recently_synced(). Includes Redis hash operations and comprehensive tests validating set/get behavior and staleness thresholds.

Sequence Diagram(s)

sequenceDiagram
    participant Init as RelayerInitialization
    participant Repo as RelayerRepository
    participant Redis as Redis (Distributed Lock & Metadata)
    participant Service as RelayerService

    rect rgba(100, 150, 200, 0.5)
    note over Init,Service: Persistent Storage Mode (with Locking)
    Init->>Repo: connection_info()
    Repo-->>Init: Some((client, prefix))
    
    loop For each relayer
        Init->>Redis: is_relayer_recently_synced(prefix, relayer_id)
        Redis-->>Init: bool (recent sync check)
        
        alt Not Recently Synced
            Init->>Redis: Acquire per-relayer lock (TTL-based)
            alt Lock Acquired
                Redis-->>Init: Lock acquired
                Init->>Service: initialize(relayer_id)
                Service-->>Init: Result
                Init->>Redis: set_relayer_last_sync(prefix, relayer_id)
                Init->>Redis: Release lock
            else Lock Held by Other Instance
                Redis-->>Init: Lock contention
                Init->>Init: Skip (SkippedLockHeld)
            end
        else Recently Synced
            Init->>Init: Skip (SkippedRecentSync)
        end
    end
    end

    rect rgba(150, 100, 200, 0.5)
    note over Init,Service: In-Memory Mode (no Locking)
    Init->>Repo: connection_info()
    Repo-->>Init: None
    
    loop For each relayer
        Init->>Service: initialize(relayer_id)
        Service-->>Init: Result
    end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #618: Adds distributed locking and repository connection_info() support with modifications to the same Redis utilities for cross-instance coordination.

Suggested reviewers

  • collins-w
  • dylankilkenny
  • tirumerla

Poem

🐰 Locks and latches, timestamps too,
Relayers sync—once, not twice through!
Redis holds the wisdom of the herd,
Cross-instance harmony, every word.

🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR description explicitly states that the checklist item 'Add a reference to related issues in the PR description' is unchecked, and no linked issues or issue numbers are mentioned anywhere in the description. Add a reference to related issues or feature requests that motivated this locking mechanism implementation in the PR description or link them via GitHub.
Description check ❓ Inconclusive The PR description uses the correct template structure but is largely incomplete. While it includes the required section headings (Summary, Testing Process, Checklist), the Summary is minimal (just repeating the title), and the Testing Process section is entirely empty. Expand the Summary with details about what the lock solves and why it's needed, provide specifics in the Testing Process section, and if applicable, reference related issues or pull requests.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: Adding lock to init relayer instances' is directly related to the main changes in the PR, which introduce distributed locking coordination for relayer initialization.
Out of Scope Changes check ✅ Passed All changes appear focused on adding distributed locking coordination for relayer initialization, including supporting infrastructure (connection info retrieval, Redis metadata tracking, and result handling).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch improvements-to-relayers-initialization
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@NicoMolinaOZ NicoMolinaOZ marked this pull request as ready for review January 20, 2026 20:21
@NicoMolinaOZ NicoMolinaOZ requested a review from a team as a code owner January 20, 2026 20:21
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 32.50975% with 1038 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.27%. Comparing base (f720040) to head (46959b9).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/utils/redis.rs 0.00% 545 Missing ⚠️
src/bootstrap/initialize_relayers.rs 56.59% 247 Missing ⚠️
src/bootstrap/config_processor.rs 13.45% 238 Missing ⚠️
src/repositories/relayer/mod.rs 78.78% 7 Missing ⚠️
src/utils/polling.rs 99.13% 1 Missing ⚠️
Additional details and impacted files
Flag Coverage Δ
ai 0.27% <0.91%> (+<0.01%) ⬆️
dev 90.26% <32.50%> (-0.71%) ⬇️
properties 0.01% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@            Coverage Diff             @@
##             main     #622      +/-   ##
==========================================
- Coverage   90.97%   90.27%   -0.71%     
==========================================
  Files         288      289       +1     
  Lines      118548   120108    +1560     
==========================================
+ Hits       107852   108428     +576     
- Misses      10696    11680     +984     
Files with missing lines Coverage Δ
src/repositories/relayer/relayer_in_memory.rs 82.47% <ø> (ø)
src/utils/polling.rs 99.13% <99.13%> (ø)
src/repositories/relayer/mod.rs 81.65% <78.78%> (-1.25%) ⬇️
src/bootstrap/config_processor.rs 82.95% <13.45%> (-15.60%) ⬇️
src/bootstrap/initialize_relayers.rs 66.21% <56.59%> (-4.91%) ⬇️
src/utils/redis.rs 13.60% <0.00%> (-15.68%) ⬇️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/bootstrap/initialize_relayers.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/bootstrap/config_processor.rs`:
- Around line 529-537: The current wait_for_config_processing_complete call
ignores poll_until's timeout result and always returns Ok(()), allowing startup
to continue without config; update wait_for_config_processing_complete to
propagate poll_until errors (or convert timeout into a Err) instead of
unconditionally returning Ok so the process fails fast when is_redis_populated
polling times out; locate the poll_until invocation in
wait_for_config_processing_complete and return the poll_until result (or map its
timeout into a meaningful error) so the caller cannot proceed with empty
repositories.

In `@src/utils/time.rs`:
- Around line 24-28: The docstring for poll_until is incorrect: it claims the
function can return Err from the check closure, but poll_until logs errors from
the check closure and continues polling, only returning Ok(true) if condition
met or Ok(false) on timeout; update the documentation of poll_until to remove
the `Err(_)` return variant, explicitly state that errors from the `check`
closure are logged and ignored (do not stop polling), and clearly document the
actual return values (Ok(true) when condition met, Ok(false) on timeout).
🧹 Nitpick comments (3)
src/bootstrap/initialize_relayers.rs (3)

169-196: Consider lock expiration during long-running initialization.

If initialization takes longer than BOOTSTRAP_LOCK_TTL_SECS, the lock could expire while initialization is still in progress, allowing another instance to start initializing concurrently. This is a trade-off: a longer TTL risks lock starvation if the holder crashes, while a shorter TTL risks concurrent initialization.

The current approach with graceful degradation (proceeding on lock errors) provides resilience, but you may want to document this behavior or consider implementing lock renewal for very large deployments.


257-267: Consider adding concurrency limits for large deployments.

Using join_all runs all relayer initializations concurrently without bounds. For deployments with many relayers, this could overwhelm Redis connections or external services.

Consider using futures::stream::iter(...).buffer_unordered(n) to limit concurrent initializations:

♻️ Optional refactor with bounded concurrency
+use futures::stream::{self, StreamExt};
+
+const MAX_CONCURRENT_INITS: usize = 10;
+
 async fn run_initialization_batch<...>(...) -> Result<()> {
-    let futures = relayers.iter().map(|relayer| {
+    let results: Vec<_> = stream::iter(relayers.iter().map(|relayer| {
         let app_state = app_state.clone();
         let relayer_id = relayer.id.clone();
 
         async move {
             let result = initialize_relayer(relayer_id.clone(), app_state).await;
             (relayer_id, result)
         }
-    });
-
-    let results = futures::future::join_all(futures).await;
+    }))
+    .buffer_unordered(MAX_CONCURRENT_INITS)
+    .collect()
+    .await;

672-683: Consider using environment variable for test Redis URL.

The Redis URL is hardcoded to 127.0.0.1:6379. For CI/CD environments or developers using different Redis configurations, consider using an environment variable with a fallback:

♻️ Suggested improvement
 async fn create_test_redis_pool() -> Option<Arc<Pool>> {
-    let cfg = deadpool_redis::Config::from_url("redis://127.0.0.1:6379");
+    let redis_url = std::env::var("TEST_REDIS_URL")
+        .unwrap_or_else(|_| "redis://127.0.0.1:6379".to_string());
+    let cfg = deadpool_redis::Config::from_url(&redis_url);

Comment thread src/bootstrap/config_processor.rs Outdated
Comment thread src/utils/time.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/bootstrap/config_processor.rs`:
- Around line 507-545: The current wait_for_config_processing_complete uses
is_redis_populated to decide completion but that ignores API keys so minimal
configs can time out; update the completion check in
wait_for_config_processing_complete (or extend is_redis_populated) to also query
the ApiKeyRepositoryTrait (AKR) from the provided app_state to detect whether
any API keys exist (e.g., call a count/list method on the API key repo) and
treat presence of at least one API key as “populated”/complete; keep the
existing polling and timeout behaviour but return completed=true when API keys
are present to avoid false timeouts for empty relayer/signer/network/plugin
sets.
🧹 Nitpick comments (3)
src/bootstrap/initialize_relayers.rs (3)

222-239: Consider adding debug logging when wait completes successfully.

The function discards the boolean result from poll_until, which means successful completion and timeout both proceed silently at this level. While poll_until logs warnings on timeout, adding a debug log on success would improve observability.

♻️ Suggested improvement
-    poll_until(
+    let completed = poll_until(
         || is_global_init_recently_completed(&conn, &prefix, INIT_STALENESS_THRESHOLD_SECS),
         max_wait,
         poll_interval,
         "initialization",
     )
     .await?;
 
+    if completed {
+        debug!("Another instance completed initialization, proceeding");
+    }
+
     Ok(())

257-267: Consider bounded concurrency for large relayer counts.

Using join_all spawns all initialization tasks concurrently without limit. For deployments with many relayers, this could overwhelm connection pools or external services. Consider using buffer_unordered with a reasonable limit if large-scale deployments are expected.

♻️ Suggested approach (if needed in future)
use futures::stream::{self, StreamExt};

const MAX_CONCURRENT_INIT: usize = 10;

let results: Vec<_> = stream::iter(futures)
    .buffer_unordered(MAX_CONCURRENT_INIT)
    .collect()
    .await;

672-683: Consider using an environment variable for Redis URL.

The hardcoded redis://127.0.0.1:6379 works for local testing but could be made configurable via environment variable for flexibility in different test environments.

♻️ Suggested improvement
 async fn create_test_redis_pool() -> Option<Arc<Pool>> {
-    let cfg = deadpool_redis::Config::from_url("redis://127.0.0.1:6379");
+    let url = std::env::var("TEST_REDIS_URL").unwrap_or_else(|_| "redis://127.0.0.1:6379".to_string());
+    let cfg = deadpool_redis::Config::from_url(&url);

Comment thread src/bootstrap/config_processor.rs
Copy link
Copy Markdown
Contributor

@tirumerla tirumerla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks..Added one comment

Comment thread src/bootstrap/initialize_relayers.rs Outdated
Copy link
Copy Markdown
Collaborator

@zeljkoX zeljkoX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested it locally with 2 instances. Looks good.

Sai made a good comment. Let's see how to address it.

Comment thread src/utils/time.rs Outdated
Comment thread src/bootstrap/initialize_relayers.rs Outdated
@zeljkoX
Copy link
Copy Markdown
Collaborator

zeljkoX commented Feb 23, 2026

Hey @NicoMolinaOZ

I have merged latest main and added changes to reuse new env var DISTRIBUTED_MODE so sync logic is only used when flag is set.

Same approach is used at other places where locks are used.

Comment thread src/bootstrap/config_processor.rs Dismissed
@collins-w collins-w merged commit 51df3c5 into main Mar 13, 2026
24 of 26 checks passed
@collins-w collins-w deleted the improvements-to-relayers-initialization branch March 13, 2026 12:03
@github-actions github-actions Bot locked and limited conversation to collaborators Mar 13, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants