rtpclient: harden connect path against port=0 endpoints and +1 port collisions#151
Open
ibiltari wants to merge 3 commits into
Open
rtpclient: harden connect path against port=0 endpoints and +1 port collisions#151ibiltari wants to merge 3 commits into
ibiltari wants to merge 3 commits into
Conversation
An [connect_to] block missing port= left settings.connect_to[].port as
the empty string. main.cpp::setup_network_rtpmidi_listener forwards
that to make_local_alsa_listener, which on first ALSA subscribe creates
a rtpclient_t with add_server_addresses({{hostname, ""}}). The
subsequent state_resolve_next_ip_port calls
network_address_list_t(hostname, "") -> getaddrinfo with service=NULL
returns sockaddrs with port=0; control_address ends up with port=0 and
every sendto fails EINVAL ("Error sending to <ip>:0"). The retry path
re-resolves the same broken endpoint, so the rtpclient_t never
recovers.
Defaulting to 5004 (IANA-registered standard RTP-MIDI control port)
matches what users almost certainly mean when they omit it, and
mirrors what [rtpmidi_announce] explicitly uses in every published
example INI. Existing INIs with port= set are unaffected (the value
is overwritten during parse).
When state_resolve_next_ip_port produced a sockaddr with port=0 (e.g. because the [connect_to] block's port= was missing and getaddrinfo resolved with service=NULL), control_address.dup() inherited port 0 and every subsequent sendto() returned EINVAL. The state machine's retry path (state_disconnect_because_cktimeout / state_error -> state_resolve_next_ip_port) faithfully re-resolves the same broken endpoint, so the rtpclient_t never recovers - it loops every 30s forever with "Error sending to <ip>:0". Refuse port-0 entries explicitly: log a clear ERROR and emit ResolveFailed so the state machine advances to the next resolved sockaddr (or, if none remain, exhausts the list and bubbles up to state_error normally). Pairs with the settings.hpp default (connect_to.port = "5004") as defense-in-depth: even if a downstream config tool or future endpoint source produces port 0 by some other path, this guard prevents the silent permanent failure.
state_connect_control opens control_peer at a kernel-assigned ephemeral (when local_base_port_str=="0"), then state_connect_midi binds local_base_port + 1 for the midi socket. The kernel ephemeral allocator can hand us a control port whose +1 is already owned by another process on the same host (observed in production: a cuems-node-engine NNG dial source port held UDP 48650 -> rtpmidid's kernel-assigned control port 48649 made midi bind 48650 fail with EADDRINUSE). Current behavior on +1 collision: ConnectFailed, 30s reconnect_timeout, re-resolve, almost certainly same kernel ephemeral, same collision, loop. Result: minutes-long MTC outage from a single port conflict. Fix: speculatively bind both control_peer and midi_peer in state_connect_control BEFORE sending IN to the master. If the +1 bind fails, close control_peer and retry with a fresh ephemeral pair up to MAX_PAIR_RETRIES=5 times. With ~30/28000 collision probability per attempt, 5 retries puts exhaustion at ~10^-15. Retry is gated on local_base_port_str=="0" (ephemeral mode). A user-configured fixed local_udp_port that collides on +1 is a config error - fail with a clear message after one attempt (no behavior change for that path). state_connect_midi no longer calls midi_peer.open() (it's already bound). Added is_open() sanity guard. The speculative midi_peer binding creates a new invariant: midi_peer must be closed on every exit from the "not-yet-fully-connected" window before bouncing back to ResolveNextIpPort. Three places now close midi_peer: - The control connect_timeout timer lambda - The control status_change !CONTROL_CONNECTED branch - state_disconnect_control (reached on ConnectMidi failure) Pairs with the earlier commits on this branch: - fix(settings): default connect_to.port to 5004 - fix(rtpclient): refuse resolved endpoints with destination port 0 Together these three commits make rtpclient_t resilient to: missing INI port, port-0 resolved addresses, and same-host ephemeral port collisions on the +1 midi bind.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi! another small fix proposal:
What this PR fixes
Issue 1 — silent acceptance of
[connect_to]blocks withoutport=A
[connect_to]block in the INI without an explicitport=line leavessettings.connect_to[].portas the empty string. That flows all the way throughmain.cpp::setup_network_rtpmidi_listener→make_local_alsa_listener→local_alsa_listener_t::connect_to_remote_server→rtpclient_t::add_server_addresses({{hostname, ""}}).state_resolve_next_ip_portthen callsnetwork_address_list_t(hostname, "")→getaddrinfo(hostname, NULL), which returnssockaddrs with port 0.control_address.port()ends up 0, and every subsequentsendto()fails withEINVAL:The state machine's retry path re-resolves the same broken endpoint, so the client loops forever.
Fix (commit 1): default
connect_to_t::port = "5004"insettings.hpp. The IANA-registered RTP-MIDI control port matches what users almost certainly mean when they omit it (and what every published INI example uses for[rtpmidi_announce]).Issue 2 —
state_resolve_next_ip_portshould refuse port=0 entriesEven with the default in place, anyone constructing endpoints programmatically (or any other malformed config path) can still produce a resolved sockaddr with port 0. The state machine has no guard against this — it happily uses the entry, loops on EINVAL, and never recovers.
Fix (commit 2): in
state_resolve_next_ip_port, refuse entries whose resolved port is 0, emit a clear ERROR, and advance to the next resolved sockaddr viaResolveFailed. Belt-and-braces alongside commit 1.Issue 3 —
state_connect_midicollides with other processes oncontrol_port + 1state_connect_controlopenscontrol_peerat a kernel-assigned ephemeral (whenlocal_base_port_str == "0", which is whatlocal_alsa_listener_tandrtpmidiremotehandlerboth pass for avahi-discovered peers). Thenstate_connect_midicallsbind(local_base_port + 1)for the MIDI socket.The kernel ephemeral allocator can hand us a
control_portwhose+1is owned by another process. Production case we hit: another cuems daemon on the same host (cuems-node-engine) had a UDP source port at 48650 from an outbound NNG connection; rtpmidid was assignedcontrol = 48649→ MIDI bind on 48650 →EADDRINUSE→ConnectFailed→ 30 sreconnect_timeout. On retry, kernel reuses the same ephemeral, same conflict — loop.Fix (commit 3): speculatively bind both
control_peerandmidi_peerinstate_connect_controlbefore sending IN to the remote peer. If the+1bind fails, closecontrol_peerand retry with a fresh ephemeral pair, up toMAX_PAIR_RETRIES=5times.local_base_port_str == "0"(ephemeral mode). A user-configured fixedlocal_udp_portthat collides on+1is a config error — fail after one attempt with a clear message (unchanged behavior).state_connect_midino longer callsmidi_peer.open()(it's already bound). Anis_open()sanity guard catches state-machine misuse.midi_peeron every exit from the not-yet-fully-connected window. Three places now callmidi_peer.close(): the control connect_timeout timer lambda, the control status_change!CONTROL_CONNECTEDbranch, andstate_disconnect_control.Verified
Reproduction on a real two-node cluster (one controller rtpmidi server, one node rtpmidi client driving an MTC chain through
cuems-videocomposer):systemctl restart rtpmididreconnect_timeout+ handshake)Packet captures, journals, and aconnect snapshots from the test runs are available if useful for review.
Backward compatibility
None broken.
port=set in[connect_to]are unaffected (commit 1 only sets the default).local_udp_portset to a fixed value, the one-attempt + ERROR behavior matches pre-patch.Thanks!!