rtppeer: fix UAF when disconnect() triggers self-removal via status_change_event#150
Open
ibiltari wants to merge 1 commit into
Open
rtppeer: fix UAF when disconnect() triggers self-removal via status_change_event#150ibiltari wants to merge 1 commit into
ibiltari wants to merge 1 commit into
Conversation
…nnect() rtppeer_t::disconnect() called send_goodbye() and then reset() on *this. send_goodbye() synchronously fires status_change_event, whose rtpserverpeer slot calls rtpserver_t::remove_peer(). remove_peer() does peers.erase() on the owning vector, destructing the rtpserverpeer_t and dropping the last shared_ptr<rtppeer_t> — freeing *this. Control then returned up the stack to reset(), reading and writing freed memory. ASAN confirmed (heap-use-after-free, WRITE of size 4 at rtppeer.cpp:55, free chain disconnect -> send_goodbye -> status_change_event -> rtpserverpeer_t::status_change -> rtpserver_t::remove_peer -> vector::erase -> ~rtpserverpeer_t -> ~shared_ptr<rtppeer_t> -> freed). Fix: take a local shared_ptr at entry via weak_from_this().lock(). When the rtppeer is shared-owned (the server path that originally crashed), the local keeps it alive across the signal storm. When it isn't (rtpclient_t embeds rtppeer_t by value at rtpclient.hpp:45; tests stack-allocate it in tests/test_rtppeer.cpp), no external owner can drop the last reference mid-call, so the UAF is structurally impossible — and weak_from_this().lock() yielding nullptr is harmless. In production this crashed ~317 times per boot under steady cluster activity (~15-20s interval). It also explains the libfmt-vformat crash signature previously reported on v24.12 — same root-cause UAF, the libfmt allocator just happened to be the next user of the freed slab.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First, thank you for rtpmidid. It's been a small-but-load-bearing piece of a live-performance system we maintain, and it's been reliable across a lot of show hardware!.
Where this came from
CUEMS is our distributed cue-playback stack for live performance — audio, video, DMX, and timecode coordinated across a cluster of nodes. rtpmidid is what carries our MTC stream over the network. When the cluster is rolling, rtpmidid is on the critical path for every timecoded frame.
We hit this bug on debian 12 bookworm, previous releases of rtpmidid where working for us on debian 11. After migrating our system to 12, we found this issue. Whe were using rtpmidid release 24 and it hit there, and then updated to release 26 and it was still there. After the pach we are able to use it on debian 12 again with no problems, and it has been running consistently for about a week now with intensive testing (since rtpmidid is a key component in our system, nice work!).
Summary
Heap-use-after-free in
rtppeer_t::reset()triggered from insidertppeer_t::disconnect().send_goodbye()synchronously firesstatus_change_event, whosertpserverpeer_tslot callsrtpserver_t::remove_peer(), which erases the owningrtpserverpeer_tfrom the server's
peersvector — dropping the lastshared_ptr<rtppeer_t>and freeing*thiswhiledisconnect()isstill on the stack. Control returns and
disconnect()callsthis->reset()(lib/rtppeer.cpp:867)on the freed object.
Impact
Under steady-state operation on a 3-host LAN cluster (Debian 12,
glibc 2.36), this fires ~317 times per boot on the server-side
rtpmidid, with a ~15-20 s natural retrigger interval driven by avahipeer churn. Production symptom is
malloc(): unaligned tcache chunk detected; the libfmt-vformat crash signature previously reported onv24.12 is the same root cause — libfmt just happens to be the next
allocator user of the freed slab.
Reproducer (any one of)
systemctl restart rtpmididon a peer host with[connect_to]config.rtpmididon the LAN starting/stopping (avahiBROWSER_REMOVEtriggers the same cleanup path).~15 s as auto-reconnect after each crash triggers the next cascade.
ASAN+UBSAN build catches it on the first peer-disconnect of the run.
ASAN (key frames)
(Full report can be attached on request.)
Fix
Make
rtppeer_tinheritstd::enable_shared_from_this<rtppeer_t>and, in
disconnect(), take a localshared_ptrviaweak_from_this().lock()at entry. The local keeps*thisaliveacross the synchronous signal storm so the trailing
reset()runs onvalid memory.
weak_from_this().lock()rather thanshared_from_this()becausetwo paths construct
rtppeer_twithout a managingshared_ptr—rtpclient_t::peeris a value member (rtpclient.hpp:45), and theunit tests stack-allocate
rtpmidid::rtppeer_t peer("test")inseveral test cases.
shared_from_this()would throwbad_weak_ptron both. With
lock()it just returnsnullptrin those cases —which is safe, because no external owner exists to drop the last
reference during the signal storm anyway.
Deferring the
peers.erase()inremove_peer()was also consideredbut rejected:
~rtpserver_t()callssend_goodbye()directly oneach peer, which would enqueue a deferred lambda capturing
this(the dying server) and UAF on the next poller tick.Verified
ASAN+UBSAN build with the patch applied —
ctestgreen; productioncrash trigger no longer fires.
Thanks!!