Skip to content

Bug: eud_handler_ssl accept loop locks up under concurrent or slow connections — port 8089 stops accepting clients #305

@Humble-Helper-96

Description

@Humble-Helper-96

Bug: eud_handler_ssl accept loop locks up under concurrent or slow connections — port 8089 stops accepting clients

Version: 1.7.10
Component: opentakserver/eud_handler/SocketServer.py, client_controller.py
Severity: High — results in complete loss of ATAK client connectivity until service restart


Summary

The SSL socket server (eud_handler_ssl, port 8089) periodically reaches a state where it stops accepting new client connections. The socket remains open and in LISTEN state, but the kernel's accept queue fills to capacity and all incoming connections are dropped at the kernel level before reaching the application. The only recovery is sudo systemctl restart eud_handler_ssl.

The root cause is two bugs in SocketServer.py that interact: a listen(0) call that restricts the kernel accept queue to a single pending connection, and an SSL handshake that blocks the single-threaded accept loop for the entire duration of each TLS exchange.


Observed Symptoms

  • ATAK clients on port 8089 fail to connect; port 8443 continues working normally.
  • sudo ss -tulpn | grep :8089 shows Recv-Q: 1, Send-Q: 0 while the service is running.
  • With Recv-Q at capacity, all new connection attempts are silently dropped by the kernel — clients see a timeout, not a refused connection.
  • sudo systemctl restart eud_handler_ssl resolves the issue immediately — Recv-Q returns to 0 and clients reconnect.
  • A full system reboot also resolves it.
  • The issue recurs after a period of operation, particularly after periods of higher client activity or reconnection bursts.

Root Cause Analysis

Bug 1: listen(0) in launch_ssl_server() — zero-length accept queue

File: opentakserver/eud_handler/SocketServer.py, line 87

def launch_ssl_server(self):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0) as sock:
        sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        context = self.get_ssl_context()
        sconn = context.wrap_socket(sock, server_side=True)
        sconn.bind(("0.0.0.0", self.port))
        sconn.listen(0)   # <-- BUG: backlog of 0
        return sconn

Compare with the TCP server on line 75:

def launch_tcp_server(self):
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    s.bind(("0.0.0.0", self.port))
    s.listen(1)   # TCP uses 1
    return s

On Linux, listen(0) is rounded up to a minimum of 1 by the kernel. This means the accept queue can hold at most one fully-established TCP connection before additional incoming connections are dropped. This is the direct explanation for the observed Recv-Q: 1 state: one connection is queued, the queue is at capacity, all others are discarded.

The TCP server uses listen(1), which — while also very low for a server — at least isn't explicitly zero. The SSL server's listen(0) appears to be an unintentional value.


Bug 2: SSL handshake blocks the accept loop

File: opentakserver/eud_handler/SocketServer.py, lines 30 and 85

self.socket.settimeout(1.0)           # line 30 — timeout on server socket

# ...

sconn = context.wrap_socket(sock, server_side=True)   # line 85 — do_handshake_on_connect=True by default

SSLContext.wrap_socket() defaults to do_handshake_on_connect=True. For a server socket, this means that each call to accept() performs the complete TLS handshake before returning. The accept loop is single-threaded, so the loop is entirely frozen for the duration of every handshake.

The settimeout(1.0) applies to individual socket read operations — not the entire handshake. A TLS 1.2 exchange involves multiple round trips (ClientHello → ServerHello+Certificate → ClientKeyExchange → Finished). At up to 1 second per read operation, a single handshake can block accept() for 4+ seconds under normal operation on a slow or mobile client.

During those 4+ seconds, any client that completes its TCP three-way handshake occupies Recv-Q. With a backlog of 1 (from Bug 1), the queue is immediately at capacity, and all subsequent clients are dropped.


Bug 3: do_handshake() called in ClientController.__init__() — main thread, after accept() already completed it

File: opentakserver/eud_handler/client_controller.py, line 84

if self.is_ssl:
    try:
        self.sock.do_handshake()    # <-- called in __init__, which runs in the main thread
        ...

ClientController.__init__() is called before new_thread.start() in the accept loop:

new_thread = ClientController(...)   # __init__ runs here, in the accept loop thread
new_thread.daemon = True
new_thread.start()

Because accept() with do_handshake_on_connect=True already completed the handshake, this do_handshake() call is a no-op. However, it runs in the main accept loop thread before the thread is started. The intended architecture — offloading the handshake to the per-client thread — is structurally present but non-functional because Bug 2 causes the handshake to complete before ClientController ever runs.


Bug 4: self.clients list grows unbounded

File: opentakserver/eud_handler/SocketServer.py, line 45

self.clients.append(new_thread)

Finished ClientController threads are appended to self.clients but never removed. Over time this list accumulates all historical connection threads, leaking memory proportional to the number of client connections since the service last started.


Proposed Fix

Fix for Bug 1 and Bug 2 together

In SocketServer.py, launch_ssl_server():

def launch_ssl_server(self):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0) as sock:
        sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

        context = self.get_ssl_context()

        # do_handshake_on_connect=False: accept() returns immediately without
        # blocking the loop. The handshake is completed in ClientController's
        # thread (do_handshake() is already called there on line 84 of
        # client_controller.py — it just needs to be non-redundant).
        sconn = context.wrap_socket(sock, server_side=True, do_handshake_on_connect=False)
        sconn.bind(("0.0.0.0", self.port))
        sconn.listen(5)   # was listen(0); 5 is a conventional minimum for servers
        return sconn

With do_handshake_on_connect=False, accept() returns as soon as the TCP connection is established. The ClientController thread then calls do_handshake() on line 84 of client_controller.py, which — since it now runs in its own thread — no longer blocks the accept loop. self.sock.settimeout(1.0) in ClientController.__init__() (line 45) already applies a per-operation timeout to that handshake, so slow clients are handled correctly without freezing the accept loop.

This makes Bug 3 correct behavior rather than a no-op: the handshake now actually happens in the per-client thread as intended.

Fix for Bug 4

In SocketServer.run(), prune dead threads before appending, or periodically:

self.clients = [c for c in self.clients if c.is_alive()]
self.clients.append(new_thread)

Reproduction

Reliable reproduction involves two simultaneous slow SSL connections, but the issue occurs in normal operation with mobile ATAK clients on poor or intermittent networks (LTE, weak WiFi). The following confirms the stuck state without needing reproduction:

# Shows Recv-Q: 1 when stuck
sudo ss -tulpn | grep :8089

# Clears it
sudo systemctl restart eud_handler_ssl

# Confirm recovery
sudo ss -tulpn | grep :8089   # Recv-Q should return to 0

Environment

  • OpenTAKServer 1.7.10, native (non-Docker) installation
  • Ubuntu 24.04 LTS
  • Python SSL module — TLS 1.2, OTS_SSL_VERIFICATION_MODE = 2 (ssl.CERT_REQUIRED)
  • ATAK clients connecting from mobile devices over LTE and WiFi

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions