Skip to content

fix(iroh): evict cached mapped addrs when a RemoteStateActor shuts down#4294

Open
drlukeangel wants to merge 3 commits into
n0-computer:mainfrom
drlukeangel:fix/churn-memory-leak
Open

fix(iroh): evict cached mapped addrs when a RemoteStateActor shuts down#4294
drlukeangel wants to merge 3 commits into
n0-computer:mainfrom
drlukeangel:fix/churn-memory-leak

Conversation

@drlukeangel
Copy link
Copy Markdown

@drlukeangel drlukeangel commented May 29, 2026

Description

While running iroh under sustained remote churn (remotes continuously appearing and going away) in
a downstream project, we saw the per-remote address cache grow without bound. AddrMap — the cache
of mapped endpoint/relay addresses behind mapped_addrs — has no eviction path and is never pruned.
When a RemoteStateActor shuts down cleanly, RemoteMap::remove_or_restart_actor removes its
senders entry but leaves that remote's cached addresses in mapped_addrs forever, so
endpoint_addrs/relay_addrs grow once per remote-ever-seen.

(Also: thank you for iroh — the endpoint/discovery/QUIC stack let us build a multi-node mesh without
hand-rolling NAT traversal or relay handling, which is exactly why we could focus on this one cache.)

Fix

Add remove() and retain() to AddrMap (keeping the forward addrs and reverse lookup maps
consistent), and call them in the clean-shutdown branch of remove_or_restart_actor to drop the
gone remote's cached endpoint and relay addresses. Eviction happens only on clean shutdown, so
active remotes are untouched. relay_addrs is keyed by (addr, remote_id), so it uses retain to
drop all entries for the departing remote.

Breaking Changes

None. remove/retain are pub(super); behavior change is limited to releasing state for remotes
that have already gone away.

Notes & open questions

  • Test: added a unit test for AddrMap::remove/retain covering forward removal and
    reverse-lookup consistency (the subtle part — keeping addrs and lookup in sync). The
    end-to-end "cache tracks only live remotes" behavior is churn-dependent; we validated it via a
    downstream heap profile under continuous remote kill/respawn, where the mapped_addrs entries
    stopped growing with remotes-ever-seen. A before/after dhat capture on your churn harness is the
    cleanest way to quantify it.
  • Open question: are there other intended lifetimes for mapped_addrs entries (e.g. deliberately
    cached across reconnects)? If so, the eviction point may want to be narrower — guidance welcome.

Checklist

  • Self-review
  • Documented the disjoint-borrow reasoning in retain
  • Tests — unit test for remove/retain; integration validation noted above
  • No breaking changes

Related

Found together with two other churn-related leak fixes in the iroh stack:

Resolves #4293

AddrMap (the per-remote endpoint/relay address cache behind mapped_addrs) has no
eviction path and is never pruned. On clean RemoteStateActor shutdown,
remove_or_restart_actor removed the senders entry but left the remote's cached
addresses in endpoint_addrs/relay_addrs forever, so they grew once per
remote-ever-seen under churn.

Add remove()/retain() to AddrMap (keeping the forward addrs and reverse lookup
maps consistent) and evict the departing remote's cached addresses on the
clean-shutdown path. Add a unit test for remove/retain forward+reverse
consistency.
@drlukeangel drlukeangel marked this pull request as ready for review May 29, 2026 17:53
@n0bot n0bot Bot added this to iroh May 29, 2026
@github-project-automation github-project-automation Bot moved this to 🚑 Needs Triage in iroh May 29, 2026
@flub flub added the c-iroh Functionality of the core iroh crate. label Jun 1, 2026
Copy link
Copy Markdown
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks reasonable.

Comment thread iroh/src/socket/mapped_addrs.rs Outdated
let mut inner = self.inner.lock().expect("poisoned");
// Disjoint borrows: `addrs.retain` holds `addrs` mutably while the
// closure also needs `lookup` — destructure so they aren't both
// routed through `inner` (avoids E0499).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to write such a comment, please remove

Comment thread iroh/src/socket/mapped_addrs.rs Outdated
Comment on lines +414 to +416
// The map is generic over the key, so a plain `u32` key exercises the
// forward/reverse bookkeeping; `EndpointIdMappedAddr::generate()` is a
// process-global counter with no other dependencies.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for such comments, please remove

@github-project-automation github-project-automation Bot moved this from 🚑 Needs Triage to 🏗 In progress in iroh Jun 1, 2026
@drlukeangel
Copy link
Copy Markdown
Author

Thanks for the feedback and for taking a look at this! 🙇‍♂️ I've removed the unnecessary comments from the PR as requested. Everything is updated and pushed. Let me know if there's anything else!

Comment thread .containerignore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't commit files like these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c-iroh Functionality of the core iroh crate.

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

mapped_addrs cache is never evicted, grows per remote-ever-seen under churn

2 participants