Skip to content

feat: implement pubsub connection pooling#931

Merged
thlorenz merged 20 commits intomasterfrom
thlorenz/websocket-pool-conections
Feb 17, 2026
Merged

feat: implement pubsub connection pooling#931
thlorenz merged 20 commits intomasterfrom
thlorenz/websocket-pool-conections

Conversation

@thlorenz
Copy link
Copy Markdown
Collaborator

@thlorenz thlorenz commented Feb 7, 2026

Summary

Implement WebSocket connection pooling for PubSub subscriptions to improve resource efficiency
and connection management. Introduces a generic connection pool abstraction that efficiently
manages multiple concurrent subscriptions across pooled connections.

This limits max subscriptions to 100 for helius pubsub.

Details

Connection Pooling Architecture

A new PubSubConnectionPool manages multiple WebSocket connections, distributing subscriptions
across them based on per-stream limits. This prevents connection saturation and enables
horizontal scaling of subscriptions.

Abstraction Layer

Extracted pubsub connection logic into a trait-based design (PubsubConnection) allowing the ChainPubsubActor to work with generic connection implementations. This improves testability and allows for different connection strategies.

Integration

  • ChainPubsubActor now uses the pooled connection interface instead of direct client access
  • Pool handles connection lifecycle, stream distribution, and reconnection logic
  • Simplified ChainPubsubClient by removing pooling responsibilities

Testing

Added comprehensive test suite covering:

  • Account subscription through pool
  • Program subscription through pool
  • Pool connection distribution
  • Mock connection implementations for unit testing

Summary by CodeRabbit

  • New Features

    • Connection pooling to distribute subscriptions across multiple websocket connections.
    • New config option for per-stream subscription limits with default detection for certain RPC providers.
  • Improvements

    • Reworked reconnection flow to operate at the pool level with serialized retries and verification checks.
    • Simplified pub/sub wiring to use the actor-based workflow and pool API.
  • Tests

    • Added tests validating pooling behavior, slot limits, and unsubscribe lifecycle.
  • Removals

    • Removed legacy single-connection wrapper and an automatic error-conversion impl.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 7, 2026

Manual Deploy Available

You can trigger a manual deploy of this PR branch to testnet:

Deploy to Testnet 🚀

Alternative: Comment /deploy on this PR to trigger deployment directly.

⚠️ Note: Manual deploy requires authorization. Only authorized users can trigger deployments.

Comment updated automatically when the PR is synchronized.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces the single-connection PubSub abstraction with a new PubsubConnection trait and concrete PubsubConnectionImpl. Introduces a generic PubSubConnectionPool<T: PubsubConnection> that shards subscriptions across multiple websocket connections with per-connection limits and slot management. chain_pubsub_actor now stores Arc<PubSubConnectionPool<PubsubConnectionImpl>> and related function signatures were updated. The prior public PubSubConnection type and its client-layer lifecycle were removed from chain_pubsub_client. pubsub_common gained per_stream_subscription_limit and HELIUS_PER_STREAM_SUBSCRIPTION_LIMIT. A From<PubsubClientError> impl for RemoteAccountProviderError was removed.

Suggested reviewers

  • GabrielePicco
  • bmuddha
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch thlorenz/websocket-pool-conections

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@magicblock-chainlink/src/remote_account_provider/pubsub_common.rs`:
- Around line 26-36: The current brittle provider detection uses
pubsub_url.to_lowercase().contains("helius"); change it to parse the URL host
and base the detection on the hostname only (e.g., use
url::Url::parse(pubsub_url.as_str()) and check url.host_str().map(|h|
h.contains("helius")) ), then set per_stream_subscription_limit to
Some(HELIUS_PER_STREAM_SUBSCRIPTION_LIMIT) only when the host matches; keep
existing symbols pubsub_url, per_stream_subscription_limit, and
HELIUS_PER_STREAM_SUBSCRIPTION_LIMIT and ensure the parsing fallback preserves
the previous None behavior if parsing fails.

In `@magicblock-chainlink/src/remote_account_provider/pubsub_connection_pool.rs`:
- Around line 134-160: There’s a TOCTOU where two concurrent
find_or_create_connection calls can both miss capacity and each create a
connection; to avoid over-provisioning, after creating new_connection but before
pushing it, re-acquire the same lock and re-check for an available slot using
pick_connection (or otherwise reserve a slot): if pick_connection returns Some,
increment that slot’s sub_count and discard/close the newly-created connection
and return the existing one; otherwise push the new PooledConnection into
connections as you do now. Reference: find_or_create_connection,
pick_connection, PooledConnection, sub_count, connections.
- Around line 128-130: The clear_connections method currently drops entries in
self.connections without unsubscribing active streams; add a doc comment on pub
fn clear_connections(&self) stating it does not perform graceful unsubscribe and
must only be called after subscriptions have been cancelled (e.g., by the
actor's try_reconnect), and mention the precondition that callers are
responsible for closing/ cancelling streams to avoid abruptly terminating active
subscriptions; reference clear_connections and try_reconnect in the comment so
maintainers know the required call order.

In `@magicblock-chainlink/src/remote_account_provider/pubsub_connection.rs`:
- Around line 123-131: The current reconnect() handling silently returns Ok(())
when self.reconnect_guard.try_lock() fails, which misleads callers; change the
behavior so callers can observe the true outcome by either (A) returning a
distinct status/result when the lock is contended (e.g., an enum
ReconnectResult::InProgress) from the reconnect() method instead of Ok(()), or
(B) block on the mutex (use lock().await) so the caller shares the real
reconnect result; update the reconnect_guard usage and the callers of
reconnect() to handle the new result (refer to reconnect(), reconnect_guard,
try_lock, and RECONNECT_ATTEMPT_DELAY) and ensure the sleep path still delays
but returns an appropriate status/error rather than Ok(()).
- Around line 121-151: The reconnect() implementation on PubsubConnection (the
async fn reconnect in pubsub_connection.rs) is currently dead code because
PubSubConnectionPool never invokes reconnect() and instead uses
clear_connections() and T::new(); either mark the method as intentionally unused
with #[allow(dead_code)] or add a short doc comment explaining its
reserved/future use so linters and readers understand why it exists, or remove
the method to shrink the trait surface if you confirm no external consumers need
it; reference the reconnect method, PubSubConnectionPool, clear_connections, and
T::new() when making the change.

Comment thread magicblock-chainlink/src/remote_account_provider/errors.rs Outdated
Comment thread magicblock-chainlink/src/remote_account_provider/pubsub_common.rs
Comment thread magicblock-chainlink/src/remote_account_provider/pubsub_connection_pool.rs Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@magicblock-chainlink/src/remote_account_provider/pubsub_connection_pool.rs`:
- Around line 77-138: account_subscribe and program_subscribe duplicate the same
find-or-create, error mapping, rollback, and wrap_unsub logic; introduce a
private helper (e.g., subscribe_with_pool) that calls
self.find_or_create_connection().await, maps errors to
PubsubClientError::SubscribeFailed, accepts an async closure (passed the chosen
connection Arc) to invoke either connection.account_subscribe or
connection.program_subscribe, and on success wraps the returned raw_unsub via
self.wrap_unsub(raw_unsub, sub_count) and on failure decrements sub_count
(sub_count.fetch_sub(1, Ordering::SeqCst)) before returning the error; then
refactor account_subscribe and program_subscribe to call this helper with the
appropriate subscribe closure.
- Around line 150-159: The current non-atomic pattern in the Phase 1 block
(calling pick_connection(&guard) to get pooled_conn, then reading/subsequently
calling sub_count.fetch_add(1, Ordering::SeqCst)) can let concurrent callers
exceed per_connection_sub_limit; change the increment to an atomic CAS loop
(e.g., use sub_count.fetch_update or compare_exchange in a loop) that only
increments when the observed value is strictly less than
per_connection_sub_limit and otherwise treats the slot as full, so
pick_connection + atomic increment becomes a single atomic check-and-increment;
ensure you still return Ok((sub_count_handle,
Arc::clone(&pooled_conn.connection))) only after the CAS succeeds and fall back
to the existing creation/lookup paths when the CAS fails due to the limit.

Comment thread magicblock-chainlink/src/remote_account_provider/pubsub_connection_pool.rs Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@magicblock-chainlink/src/remote_account_provider/pubsub_connection_pool.rs`:
- Around line 151-189: The pool can grow unbounded and the trace calls
connections.len() which is O(n); modify find_or_create_connection to enforce a
configurable max (e.g., add a max_connections field on PubSubConnectionPool and
check it under the new_connection_guard before creating a connection, returning
a clear RemoteAccountProvider error when exhausted), and avoid calling
scc::Queue::len() in trace by maintaining a cheap counter (e.g., an AtomicUsize
connection_count incremented when pushing a PooledConnection and decremented on
drop) and log that counter instead of self.connections.len(); update references
to new_connection_guard, connections, find_or_create_connection, and the trace
call accordingly.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@magicblock-chainlink/src/remote_account_provider/pubsub_connection_pool.rs`:
- Around line 148-174: The reconnect() method currently calls conn.reconnect()
right after creating a connection via T::new(self.url.clone()), which is
redundant; remove the conn.reconnect().await? call and instead use the freshly
created conn directly to construct the PooledConnection (PooledConnection {
connection: Arc::new(conn), sub_count: Arc::new(AtomicUsize::new(0)) }) and push
that into self.connections; ensure error handling still returns
PubsubClientError::ConnectionClosed on T::new failure and keep the rest of
reconnect() logic unchanged.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs (1)

150-168: ⚠️ Potential issue | 🟠 Major

.expect() on mutex locks in new production code violates coding guidelines.

Lines 156 and 161 use .expect() which should be treated as a major issue per repository guidelines. While this pattern exists elsewhere in the file, new code should move toward proper error handling.

Additionally, this method holds both subscriptions and program_subs mutex locks simultaneously (the MutexGuard temporaries live until .collect() completes). Although there's no deadlock today—abort_and_signal_connection_issue acquires them sequentially—this dual-lock pattern is fragile. Consider draining each map separately to match the pattern used in drain_subscriptions.

Proposed fix: drain maps sequentially and handle poisoned locks
     fn unsubscribe_all(
         subscriptions: Arc<Mutex<HashMap<Pubkey, AccountSubscription>>>,
         program_subs: Arc<Mutex<HashMap<Pubkey, AccountSubscription>>>,
     ) {
-        let subs = subscriptions
-            .lock()
-            .expect("subscriptions lock poisoned")
-            .drain()
-            .chain(
-                program_subs
-                    .lock()
-                    .expect("program subs lock poisoned")
-                    .drain(),
-            )
-            .collect::<Vec<_>>();
-        for (_, sub) in subs {
-            sub.cancellation_token.cancel();
+        fn drain_and_cancel(
+            map: Arc<Mutex<HashMap<Pubkey, AccountSubscription>>>,
+            label: &str,
+        ) {
+            let Ok(mut lock) = map.lock() else {
+                error!("{label} lock poisoned during unsubscribe_all");
+                return;
+            };
+            for (_, sub) in lock.drain() {
+                sub.cancellation_token.cancel();
+            }
         }
+        drain_and_cancel(subscriptions, "subscriptions");
+        drain_and_cancel(program_subs, "program_subs");
     }

As per coding guidelines, {magicblock-*,programs,storage-proto}/**: Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Fix all issues with AI agents
In `@magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs`:
- Around line 776-777: In try_reconnect, the test-only unsubscribe() call is
missing the 2-second timeout used elsewhere and can hang on dead sockets; wrap
the unsubscribe().await invocation in the same
tokio::time::timeout(Duration::from_secs(2), unsubscribe()).await call (as done
at the other call-sites around lines 569 and 728) and ignore or log the timeout
result since unsubscribe resolves to (), ensuring the reconnect flow cannot
block indefinitely.

@bmuddha bmuddha self-requested a review February 10, 2026 04:54
Copy link
Copy Markdown
Collaborator

@bmuddha bmuddha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall solid, with some minor nitpicks, I feel need to be addressed before merging.

Comment thread magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs Outdated
Comment thread magicblock-chainlink/src/remote_account_provider/pubsub_common.rs
Comment thread magicblock-chainlink/src/remote_account_provider/pubsub_connection_pool.rs Outdated
Copy link
Copy Markdown
Collaborator

@bmuddha bmuddha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thlorenz thlorenz merged commit 5fb82c1 into master Feb 17, 2026
18 checks passed
@thlorenz thlorenz deleted the thlorenz/websocket-pool-conections branch February 17, 2026 07:16
thlorenz added a commit that referenced this pull request Feb 17, 2026
* master:
  feat: implement pubsub connection pooling (#931)
  fix: project ata from eata delegation update (#963)
@thlorenz thlorenz restored the thlorenz/websocket-pool-conections branch February 17, 2026 07:19
@thlorenz thlorenz deleted the thlorenz/websocket-pool-conections branch February 17, 2026 07:21
thlorenz added a commit that referenced this pull request Feb 17, 2026
…ational

* master:
  chore: improve subscription reconciler (#945)
  feat: implement pubsub connection pooling (#931)
  fix: project ata from eata delegation update (#963)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants