Skip to content

rtpclient: harden connect path against port=0 endpoints and +1 port collisions#151

Open
ibiltari wants to merge 3 commits into
davidmoreno:masterfrom
stagesoft:fix/connect-to-default-port-5004
Open

rtpclient: harden connect path against port=0 endpoints and +1 port collisions#151
ibiltari wants to merge 3 commits into
davidmoreno:masterfrom
stagesoft:fix/connect-to-default-port-5004

Conversation

@ibiltari
Copy link
Copy Markdown

@ibiltari ibiltari commented May 22, 2026

Hi! another small fix proposal:

What this PR fixes

Issue 1 — silent acceptance of [connect_to] blocks without port=

A [connect_to] block in the INI without an explicit port= line leaves settings.connect_to[].port as the empty string. That flows all the way through main.cpp::setup_network_rtpmidi_listenermake_local_alsa_listenerlocal_alsa_listener_t::connect_to_remote_serverrtpclient_t::add_server_addresses({{hostname, ""}}).

state_resolve_next_ip_port then calls network_address_list_t(hostname, "")getaddrinfo(hostname, NULL), which returns sockaddrs with port 0. control_address.port() ends up 0, and every subsequent sendto() fails with EINVAL:

[ERROR] udppeer.cpp:112 | Error sending to 169.254.9.204:0. This is UDP... so just lost! (Invalid argument)
[ERROR] rtpclient.cpp:87 | Client: Could not send all data to CONTROL_PORT. Sent -1. Invalid argument (22)
[ERROR] rtpclient.cpp:314 | Error at rtpclient_t. Will try to connect again in 30000ms

The state machine's retry path re-resolves the same broken endpoint, so the client loops forever.

Fix (commit 1): default connect_to_t::port = "5004" in settings.hpp. The IANA-registered RTP-MIDI control port matches what users almost certainly mean when they omit it (and what every published INI example uses for [rtpmidi_announce]).

Issue 2 — state_resolve_next_ip_port should refuse port=0 entries

Even with the default in place, anyone constructing endpoints programmatically (or any other malformed config path) can still produce a resolved sockaddr with port 0. The state machine has no guard against this — it happily uses the entry, loops on EINVAL, and never recovers.

Fix (commit 2): in state_resolve_next_ip_port, refuse entries whose resolved port is 0, emit a clear ERROR, and advance to the next resolved sockaddr via ResolveFailed. Belt-and-braces alongside commit 1.

Issue 3 — state_connect_midi collides with other processes on control_port + 1

state_connect_control opens control_peer at a kernel-assigned ephemeral (when local_base_port_str == "0", which is what local_alsa_listener_t and rtpmidiremotehandler both pass for avahi-discovered peers). Then state_connect_midi calls bind(local_base_port + 1) for the MIDI socket.

The kernel ephemeral allocator can hand us a control_port whose +1 is owned by another process. Production case we hit: another cuems daemon on the same host (cuems-node-engine) had a UDP source port at 48650 from an outbound NNG connection; rtpmidid was assigned control = 48649 → MIDI bind on 48650 → EADDRINUSEConnectFailed → 30 s reconnect_timeout. On retry, kernel reuses the same ephemeral, same conflict — loop.

Fix (commit 3): speculatively bind both control_peer and midi_peer in state_connect_control before sending IN to the remote peer. If the +1 bind fails, close control_peer and retry with a fresh ephemeral pair, up to MAX_PAIR_RETRIES=5 times.

  • Retry is gated on local_base_port_str == "0" (ephemeral mode). A user-configured fixed local_udp_port that collides on +1 is a config error — fail after one attempt with a clear message (unchanged behavior).
  • state_connect_midi no longer calls midi_peer.open() (it's already bound). An is_open() sanity guard catches state-machine misuse.
  • The new "midi_peer is bound before IN" invariant required closing midi_peer on every exit from the not-yet-fully-connected window. Three places now call midi_peer.close(): the control connect_timeout timer lambda, the control status_change !CONTROL_CONNECTED branch, and state_disconnect_control.

Verified

Reproduction on a real two-node cluster (one controller rtpmidi server, one node rtpmidi client driving an MTC chain through cuems-videocomposer):

State MTC outage on controller systemctl restart rtpmidid
Pre-fix permanent (the loop in Issue 1 never recovers without manual restart on both ends)
With commit 1 only ~33 s (Issue 3 still hits, 30 s reconnect_timeout + handshake)
With all three commits ~3 s (just the natural avahi REMOVE + rediscover latency)

Packet captures, journals, and aconnect snapshots from the test runs are available if useful for review.

Backward compatibility

None broken.

  • Existing INIs with port= set in [connect_to] are unaffected (commit 1 only sets the default).
  • Existing endpoints with valid resolved ports are unaffected by commit 2 (only port=0 entries are refused).
  • Existing connect paths see commit 3 as a transparent retry on a rare-but-real failure mode. With local_udp_port set to a fixed value, the one-attempt + ERROR behavior matches pre-patch.

Thanks!!

ibiltari added 3 commits May 21, 2026 19:21
An [connect_to] block missing port= left settings.connect_to[].port as
the empty string. main.cpp::setup_network_rtpmidi_listener forwards
that to make_local_alsa_listener, which on first ALSA subscribe creates
a rtpclient_t with add_server_addresses({{hostname, ""}}). The
subsequent state_resolve_next_ip_port calls
network_address_list_t(hostname, "") -> getaddrinfo with service=NULL
returns sockaddrs with port=0; control_address ends up with port=0 and
every sendto fails EINVAL ("Error sending to <ip>:0"). The retry path
re-resolves the same broken endpoint, so the rtpclient_t never
recovers.

Defaulting to 5004 (IANA-registered standard RTP-MIDI control port)
matches what users almost certainly mean when they omit it, and
mirrors what [rtpmidi_announce] explicitly uses in every published
example INI. Existing INIs with port= set are unaffected (the value
is overwritten during parse).
When state_resolve_next_ip_port produced a sockaddr with port=0 (e.g.
because the [connect_to] block's port= was missing and getaddrinfo
resolved with service=NULL), control_address.dup() inherited port 0
and every subsequent sendto() returned EINVAL. The state machine's
retry path (state_disconnect_because_cktimeout / state_error ->
state_resolve_next_ip_port) faithfully re-resolves the same broken
endpoint, so the rtpclient_t never recovers - it loops every 30s
forever with "Error sending to <ip>:0".

Refuse port-0 entries explicitly: log a clear ERROR and emit
ResolveFailed so the state machine advances to the next resolved
sockaddr (or, if none remain, exhausts the list and bubbles up to
state_error normally). Pairs with the settings.hpp default
(connect_to.port = "5004") as defense-in-depth: even if a downstream
config tool or future endpoint source produces port 0 by some other
path, this guard prevents the silent permanent failure.
state_connect_control opens control_peer at a kernel-assigned ephemeral
(when local_base_port_str=="0"), then state_connect_midi binds
local_base_port + 1 for the midi socket. The kernel ephemeral allocator
can hand us a control port whose +1 is already owned by another process
on the same host (observed in production: a cuems-node-engine NNG dial
source port held UDP 48650 -> rtpmidid's kernel-assigned control port
48649 made midi bind 48650 fail with EADDRINUSE).

Current behavior on +1 collision: ConnectFailed, 30s reconnect_timeout,
re-resolve, almost certainly same kernel ephemeral, same collision,
loop. Result: minutes-long MTC outage from a single port conflict.

Fix: speculatively bind both control_peer and midi_peer in
state_connect_control BEFORE sending IN to the master. If the +1 bind
fails, close control_peer and retry with a fresh ephemeral pair up to
MAX_PAIR_RETRIES=5 times. With ~30/28000 collision probability per
attempt, 5 retries puts exhaustion at ~10^-15.

Retry is gated on local_base_port_str=="0" (ephemeral mode). A
user-configured fixed local_udp_port that collides on +1 is a config
error - fail with a clear message after one attempt (no behavior change
for that path).

state_connect_midi no longer calls midi_peer.open() (it's already
bound). Added is_open() sanity guard.

The speculative midi_peer binding creates a new invariant: midi_peer
must be closed on every exit from the "not-yet-fully-connected" window
before bouncing back to ResolveNextIpPort. Three places now close
midi_peer:
- The control connect_timeout timer lambda
- The control status_change !CONTROL_CONNECTED branch
- state_disconnect_control (reached on ConnectMidi failure)

Pairs with the earlier commits on this branch:
- fix(settings): default connect_to.port to 5004
- fix(rtpclient): refuse resolved endpoints with destination port 0

Together these three commits make rtpclient_t resilient to: missing
INI port, port-0 resolved addresses, and same-host ephemeral port
collisions on the +1 midi bind.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant