rtpclient: harden connect path against port=0 endpoints and +1 port collisions by ibiltari · Pull Request #151 · davidmoreno/rtpmidid

ibiltari · 2026-05-22T12:01:30Z

Hi! another small fix proposal:

What this PR fixes

Issue 1 — silent acceptance of `[connect_to]` blocks without `port=`

A [connect_to] block in the INI without an explicit port= line leaves settings.connect_to[].port as the empty string. That flows all the way through main.cpp::setup_network_rtpmidi_listener → make_local_alsa_listener → local_alsa_listener_t::connect_to_remote_server → rtpclient_t::add_server_addresses({{hostname, ""}}).

state_resolve_next_ip_port then calls network_address_list_t(hostname, "") → getaddrinfo(hostname, NULL), which returns sockaddrs with port 0. control_address.port() ends up 0, and every subsequent sendto() fails with EINVAL:

[ERROR] udppeer.cpp:112 | Error sending to 169.254.9.204:0. This is UDP... so just lost! (Invalid argument)
[ERROR] rtpclient.cpp:87 | Client: Could not send all data to CONTROL_PORT. Sent -1. Invalid argument (22)
[ERROR] rtpclient.cpp:314 | Error at rtpclient_t. Will try to connect again in 30000ms

The state machine's retry path re-resolves the same broken endpoint, so the client loops forever.

Fix (commit 1): default connect_to_t::port = "5004" in settings.hpp. The IANA-registered RTP-MIDI control port matches what users almost certainly mean when they omit it (and what every published INI example uses for [rtpmidi_announce]).

Issue 2 — `state_resolve_next_ip_port` should refuse port=0 entries

Even with the default in place, anyone constructing endpoints programmatically (or any other malformed config path) can still produce a resolved sockaddr with port 0. The state machine has no guard against this — it happily uses the entry, loops on EINVAL, and never recovers.

Fix (commit 2): in state_resolve_next_ip_port, refuse entries whose resolved port is 0, emit a clear ERROR, and advance to the next resolved sockaddr via ResolveFailed. Belt-and-braces alongside commit 1.

Issue 3 — `state_connect_midi` collides with other processes on `control_port + 1`

state_connect_control opens control_peer at a kernel-assigned ephemeral (when local_base_port_str == "0", which is what local_alsa_listener_t and rtpmidiremotehandler both pass for avahi-discovered peers). Then state_connect_midi calls bind(local_base_port + 1) for the MIDI socket.

The kernel ephemeral allocator can hand us a control_port whose +1 is owned by another process. Production case we hit: another cuems daemon on the same host (cuems-node-engine) had a UDP source port at 48650 from an outbound NNG connection; rtpmidid was assigned control = 48649 → MIDI bind on 48650 → EADDRINUSE → ConnectFailed → 30 s reconnect_timeout. On retry, kernel reuses the same ephemeral, same conflict — loop.

Fix (commit 3): speculatively bind both control_peer and midi_peer in state_connect_control before sending IN to the remote peer. If the +1 bind fails, close control_peer and retry with a fresh ephemeral pair, up to MAX_PAIR_RETRIES=5 times.

Retry is gated on local_base_port_str == "0" (ephemeral mode). A user-configured fixed local_udp_port that collides on +1 is a config error — fail after one attempt with a clear message (unchanged behavior).
state_connect_midi no longer calls midi_peer.open() (it's already bound). An is_open() sanity guard catches state-machine misuse.
The new "midi_peer is bound before IN" invariant required closing midi_peer on every exit from the not-yet-fully-connected window. Three places now call midi_peer.close(): the control connect_timeout timer lambda, the control status_change !CONTROL_CONNECTED branch, and state_disconnect_control.

Verified

Reproduction on a real two-node cluster (one controller rtpmidi server, one node rtpmidi client driving an MTC chain through cuems-videocomposer):

State	MTC outage on controller `systemctl restart rtpmidid`
Pre-fix	permanent (the loop in Issue 1 never recovers without manual restart on both ends)
With commit 1 only	~33 s (Issue 3 still hits, 30 s `reconnect_timeout` + handshake)
With all three commits	~3 s (just the natural avahi REMOVE + rediscover latency)

Packet captures, journals, and aconnect snapshots from the test runs are available if useful for review.

Backward compatibility

None broken.

Existing INIs with port= set in [connect_to] are unaffected (commit 1 only sets the default).
Existing endpoints with valid resolved ports are unaffected by commit 2 (only port=0 entries are refused).
Existing connect paths see commit 3 as a transparent retry on a rare-but-real failure mode. With local_udp_port set to a fixed value, the one-attempt + ERROR behavior matches pre-patch.

Thanks!!

An [connect_to] block missing port= left settings.connect_to[].port as the empty string. main.cpp::setup_network_rtpmidi_listener forwards that to make_local_alsa_listener, which on first ALSA subscribe creates a rtpclient_t with add_server_addresses({{hostname, ""}}). The subsequent state_resolve_next_ip_port calls network_address_list_t(hostname, "") -> getaddrinfo with service=NULL returns sockaddrs with port=0; control_address ends up with port=0 and every sendto fails EINVAL ("Error sending to <ip>:0"). The retry path re-resolves the same broken endpoint, so the rtpclient_t never recovers. Defaulting to 5004 (IANA-registered standard RTP-MIDI control port) matches what users almost certainly mean when they omit it, and mirrors what [rtpmidi_announce] explicitly uses in every published example INI. Existing INIs with port= set are unaffected (the value is overwritten during parse).

When state_resolve_next_ip_port produced a sockaddr with port=0 (e.g. because the [connect_to] block's port= was missing and getaddrinfo resolved with service=NULL), control_address.dup() inherited port 0 and every subsequent sendto() returned EINVAL. The state machine's retry path (state_disconnect_because_cktimeout / state_error -> state_resolve_next_ip_port) faithfully re-resolves the same broken endpoint, so the rtpclient_t never recovers - it loops every 30s forever with "Error sending to <ip>:0". Refuse port-0 entries explicitly: log a clear ERROR and emit ResolveFailed so the state machine advances to the next resolved sockaddr (or, if none remain, exhausts the list and bubbles up to state_error normally). Pairs with the settings.hpp default (connect_to.port = "5004") as defense-in-depth: even if a downstream config tool or future endpoint source produces port 0 by some other path, this guard prevents the silent permanent failure.

state_connect_control opens control_peer at a kernel-assigned ephemeral (when local_base_port_str=="0"), then state_connect_midi binds local_base_port + 1 for the midi socket. The kernel ephemeral allocator can hand us a control port whose +1 is already owned by another process on the same host (observed in production: a cuems-node-engine NNG dial source port held UDP 48650 -> rtpmidid's kernel-assigned control port 48649 made midi bind 48650 fail with EADDRINUSE). Current behavior on +1 collision: ConnectFailed, 30s reconnect_timeout, re-resolve, almost certainly same kernel ephemeral, same collision, loop. Result: minutes-long MTC outage from a single port conflict. Fix: speculatively bind both control_peer and midi_peer in state_connect_control BEFORE sending IN to the master. If the +1 bind fails, close control_peer and retry with a fresh ephemeral pair up to MAX_PAIR_RETRIES=5 times. With ~30/28000 collision probability per attempt, 5 retries puts exhaustion at ~10^-15. Retry is gated on local_base_port_str=="0" (ephemeral mode). A user-configured fixed local_udp_port that collides on +1 is a config error - fail with a clear message after one attempt (no behavior change for that path). state_connect_midi no longer calls midi_peer.open() (it's already bound). Added is_open() sanity guard. The speculative midi_peer binding creates a new invariant: midi_peer must be closed on every exit from the "not-yet-fully-connected" window before bouncing back to ResolveNextIpPort. Three places now close midi_peer: - The control connect_timeout timer lambda - The control status_change !CONTROL_CONNECTED branch - state_disconnect_control (reached on ConnectMidi failure) Pairs with the earlier commits on this branch: - fix(settings): default connect_to.port to 5004 - fix(rtpclient): refuse resolved endpoints with destination port 0 Together these three commits make rtpclient_t resilient to: missing INI port, port-0 resolved addresses, and same-host ephemeral port collisions on the +1 midi bind.

ibiltari added 3 commits May 21, 2026 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rtpclient: harden connect path against port=0 endpoints and +1 port collisions#151

rtpclient: harden connect path against port=0 endpoints and +1 port collisions#151
ibiltari wants to merge 3 commits into
davidmoreno:masterfrom
stagesoft:fix/connect-to-default-port-5004

ibiltari commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ibiltari commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR fixes

Issue 1 — silent acceptance of [connect_to] blocks without port=

Issue 2 — state_resolve_next_ip_port should refuse port=0 entries

Issue 3 — state_connect_midi collides with other processes on control_port + 1

Verified

Backward compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ibiltari commented May 22, 2026 •

edited

Loading

Issue 1 — silent acceptance of `[connect_to]` blocks without `port=`

Issue 2 — `state_resolve_next_ip_port` should refuse port=0 entries

Issue 3 — `state_connect_midi` collides with other processes on `control_port + 1`