Skip to content

rtppeer: fix UAF when disconnect() triggers self-removal via status_change_event#150

Open
ibiltari wants to merge 1 commit into
davidmoreno:masterfrom
stagesoft:upstream-pr/uaf-fix
Open

rtppeer: fix UAF when disconnect() triggers self-removal via status_change_event#150
ibiltari wants to merge 1 commit into
davidmoreno:masterfrom
stagesoft:upstream-pr/uaf-fix

Conversation

@ibiltari
Copy link
Copy Markdown

@ibiltari ibiltari commented May 12, 2026

First, thank you for rtpmidid. It's been a small-but-load-bearing piece of a live-performance system we maintain, and it's been reliable across a lot of show hardware!.

Where this came from

CUEMS is our distributed cue-playback stack for live performance — audio, video, DMX, and timecode coordinated across a cluster of nodes. rtpmidid is what carries our MTC stream over the network. When the cluster is rolling, rtpmidid is on the critical path for every timecoded frame.

We hit this bug on debian 12 bookworm, previous releases of rtpmidid where working for us on debian 11. After migrating our system to 12, we found this issue. Whe were using rtpmidid release 24 and it hit there, and then updated to release 26 and it was still there. After the pach we are able to use it on debian 12 again with no problems, and it has been running consistently for about a week now with intensive testing (since rtpmidid is a key component in our system, nice work!).

Summary

Heap-use-after-free in rtppeer_t::reset() triggered from inside
rtppeer_t::disconnect(). send_goodbye() synchronously fires
status_change_event, whose rtpserverpeer_t slot calls
rtpserver_t::remove_peer(), which erases the owning rtpserverpeer_t
from the server's peers vector — dropping the last
shared_ptr<rtppeer_t> and freeing *this while disconnect() is
still on the stack. Control returns and disconnect() calls
this->reset() (lib/rtppeer.cpp:867)
on the freed object.

Impact

Under steady-state operation on a 3-host LAN cluster (Debian 12,
glibc 2.36), this fires ~317 times per boot on the server-side
rtpmidid, with a ~15-20 s natural retrigger interval driven by avahi
peer churn. Production symptom is malloc(): unaligned tcache chunk detected; the libfmt-vformat crash signature previously reported on
v24.12 is the same root cause — libfmt just happens to be the next
allocator user of the freed slab.

Reproducer (any one of)

  1. systemctl restart rtpmidid on a peer host with [connect_to] config.
  2. Any other rtpmidid on the LAN starting/stopping (avahi
    BROWSER_REMOVE triggers the same cleanup path).
  3. Steady-state with 2-3 instances visible — fires on its own every
    ~15 s as auto-reconnect after each crash triggers the next cascade.

ASAN+UBSAN build catches it on the first peer-disconnect of the run.

ASAN (key frames)

WRITE of size 4 at heap-use-after-free
  #0 rtpmidid::rtppeer_t::reset()       lib/rtppeer.cpp:55
  #1 rtpmidid::rtppeer_t::disconnect()  lib/rtppeer.cpp:867
freed by thread T0 here:
  #9  ~rtpserverpeer_t                  lib/rtpserverpeer.cpp:85
  #14 rtpserver_t::remove_peer(int)     lib/rtpserver.cpp:202
  #15 rtpserverpeer_t::status_change(...) lib/rtpserverpeer.cpp:127
  #21 signal_t<...>::operator()         include/rtpmidid/signal.hpp:105
  #22 rtppeer_t::send_goodbye(...)      lib/rtppeer.cpp:724
  #23 rtpmidid::rtppeer_t::disconnect() lib/rtppeer.cpp:865
previously allocated by thread T0 here:
  #7  std::make_shared<rtpmidid::rtppeer_t, ...>
  #8  rtpserverpeer_t::rtpserverpeer_t  lib/rtpserverpeer.cpp:32

(Full report can be attached on request.)

Fix

Make rtppeer_t inherit std::enable_shared_from_this<rtppeer_t>
and, in disconnect(), take a local shared_ptr via
weak_from_this().lock() at entry. The local keeps *this alive
across the synchronous signal storm so the trailing reset() runs on
valid memory.

weak_from_this().lock() rather than shared_from_this() because
two paths construct rtppeer_t without a managing shared_ptr
rtpclient_t::peer is a value member (rtpclient.hpp:45), and the
unit tests stack-allocate rtpmidid::rtppeer_t peer("test") in
several test cases. shared_from_this() would throw bad_weak_ptr
on both. With lock() it just returns nullptr in those cases —
which is safe, because no external owner exists to drop the last
reference during the signal storm anyway.

Deferring the peers.erase() in remove_peer() was also considered
but rejected: ~rtpserver_t() calls send_goodbye() directly on
each peer, which would enqueue a deferred lambda capturing
this (the dying server) and UAF on the next poller tick.

Verified

ASAN+UBSAN build with the patch applied — ctest green; production
crash trigger no longer fires.

Thanks!!

…nnect()

rtppeer_t::disconnect() called send_goodbye() and then reset() on *this.
send_goodbye() synchronously fires status_change_event, whose
rtpserverpeer slot calls rtpserver_t::remove_peer(). remove_peer() does
peers.erase() on the owning vector, destructing the rtpserverpeer_t and
dropping the last shared_ptr<rtppeer_t> — freeing *this. Control then
returned up the stack to reset(), reading and writing freed memory.

ASAN confirmed (heap-use-after-free, WRITE of size 4 at rtppeer.cpp:55,
free chain disconnect -> send_goodbye -> status_change_event ->
rtpserverpeer_t::status_change -> rtpserver_t::remove_peer ->
vector::erase -> ~rtpserverpeer_t -> ~shared_ptr<rtppeer_t> -> freed).

Fix: take a local shared_ptr at entry via weak_from_this().lock(). When
the rtppeer is shared-owned (the server path that originally crashed),
the local keeps it alive across the signal storm. When it isn't
(rtpclient_t embeds rtppeer_t by value at rtpclient.hpp:45; tests
stack-allocate it in tests/test_rtppeer.cpp), no external owner can drop
the last reference mid-call, so the UAF is structurally impossible —
and weak_from_this().lock() yielding nullptr is harmless.

In production this crashed ~317 times per boot under steady cluster
activity (~15-20s interval). It also explains the libfmt-vformat crash
signature previously reported on v24.12 — same root-cause UAF, the
libfmt allocator just happened to be the next user of the freed slab.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant