Skip to content

fix(pool): preserve Cache readiness with clones#297

Open
seanmonstar wants to merge 1 commit into
masterfrom
sean/ukzklyvytvyt
Open

fix(pool): preserve Cache readiness with clones#297
seanmonstar wants to merge 1 commit into
masterfrom
sean/ukzklyvytvyt

Conversation

@seanmonstar

Copy link
Copy Markdown
Member

Previously, a Cache was considered ready if the shared list of services was not empty. But, since Cache could be cloned, it was possible to observe on one clone that services was non-empty, a second clone could take a cached service, and then the first clone at call would see the list empty and need to then connector.call() without having polled its readiness.

The proposed solution here is that if a Cache polls for readiness and sees that there is a service available, it moves that out of the list and into a per-clone slot. Then a later call can be sure that either a service is in its owned slot, or it called connector.poll_ready().

cc @aajtodd

@aajtodd

aajtodd commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

I looked at this and it fixes the tower contract, but I don't think it solves the underlying issue I hit trying to put connection limits in the connect stack below the cache.

I had a connection limit layer in front of the connector, reserves a semaphore permit in poll_ready, releases on connection drop. With this PR that worker parks in the fallthrough:

// Cache::poll_ready
if let Some(svc) = self.shared.lock().unwrap().take() {
    self.ready = Ready::Cached(svc);
    return Poll::Ready(Ok(()));
}
self.connector.poll_ready(cx)   // <-- capped worker parks here

Under sustained reuse (512 concurrent, connection limit@64, H1) it deadlocks — zero throughput, 64/65 worker threads in futex_wait, no recovery. The problem is the wakeup that frees it never comes. An idle connection returning wakes the waiters, registered over in call:

// Shared::put (idle return)
while let Some(tx) = self.waiters.pop() { ... }   // wakes a waiter

// Cache::call, miss path
locked.waiters.push(tx);   // only reachable AFTER poll_ready returned Ready

So a worker parked in connector.poll_ready only wakes when the cap frees a permit, on a connection drop. But under keep-alive connections go idle, not dropped, so every wakeup lands on the waiter queue, which is empty because nobody got past poll_ready to register. The two paths never meet.


This PR is right for what it fixes, a clone can't call a service it didn't reserve/poll ready. The gap is a worker gated at poll_ready has no way to become a waiter, so it can't see the idle returns. What worked for me was driving the connector's readiness inside the connect race instead, poll_ready always Ready, cap resolved alongside the waiter in call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants