Bug: eud_handler_ssl accept loop locks up under concurrent or slow connections — port 8089 stops accepting clients
Version: 1.7.10
Component: opentakserver/eud_handler/SocketServer.py, client_controller.py
Severity: High — results in complete loss of ATAK client connectivity until service restart
Summary
The SSL socket server (eud_handler_ssl, port 8089) periodically reaches a state where it stops accepting new client connections. The socket remains open and in LISTEN state, but the kernel's accept queue fills to capacity and all incoming connections are dropped at the kernel level before reaching the application. The only recovery is sudo systemctl restart eud_handler_ssl.
The root cause is two bugs in SocketServer.py that interact: a listen(0) call that restricts the kernel accept queue to a single pending connection, and an SSL handshake that blocks the single-threaded accept loop for the entire duration of each TLS exchange.
Observed Symptoms
- ATAK clients on port 8089 fail to connect; port 8443 continues working normally.
sudo ss -tulpn | grep :8089 shows Recv-Q: 1, Send-Q: 0 while the service is running.
- With
Recv-Q at capacity, all new connection attempts are silently dropped by the kernel — clients see a timeout, not a refused connection.
sudo systemctl restart eud_handler_ssl resolves the issue immediately — Recv-Q returns to 0 and clients reconnect.
- A full system reboot also resolves it.
- The issue recurs after a period of operation, particularly after periods of higher client activity or reconnection bursts.
Root Cause Analysis
Bug 1: listen(0) in launch_ssl_server() — zero-length accept queue
File: opentakserver/eud_handler/SocketServer.py, line 87
def launch_ssl_server(self):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0) as sock:
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
context = self.get_ssl_context()
sconn = context.wrap_socket(sock, server_side=True)
sconn.bind(("0.0.0.0", self.port))
sconn.listen(0) # <-- BUG: backlog of 0
return sconn
Compare with the TCP server on line 75:
def launch_tcp_server(self):
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(("0.0.0.0", self.port))
s.listen(1) # TCP uses 1
return s
On Linux, listen(0) is rounded up to a minimum of 1 by the kernel. This means the accept queue can hold at most one fully-established TCP connection before additional incoming connections are dropped. This is the direct explanation for the observed Recv-Q: 1 state: one connection is queued, the queue is at capacity, all others are discarded.
The TCP server uses listen(1), which — while also very low for a server — at least isn't explicitly zero. The SSL server's listen(0) appears to be an unintentional value.
Bug 2: SSL handshake blocks the accept loop
File: opentakserver/eud_handler/SocketServer.py, lines 30 and 85
self.socket.settimeout(1.0) # line 30 — timeout on server socket
# ...
sconn = context.wrap_socket(sock, server_side=True) # line 85 — do_handshake_on_connect=True by default
SSLContext.wrap_socket() defaults to do_handshake_on_connect=True. For a server socket, this means that each call to accept() performs the complete TLS handshake before returning. The accept loop is single-threaded, so the loop is entirely frozen for the duration of every handshake.
The settimeout(1.0) applies to individual socket read operations — not the entire handshake. A TLS 1.2 exchange involves multiple round trips (ClientHello → ServerHello+Certificate → ClientKeyExchange → Finished). At up to 1 second per read operation, a single handshake can block accept() for 4+ seconds under normal operation on a slow or mobile client.
During those 4+ seconds, any client that completes its TCP three-way handshake occupies Recv-Q. With a backlog of 1 (from Bug 1), the queue is immediately at capacity, and all subsequent clients are dropped.
Bug 3: do_handshake() called in ClientController.__init__() — main thread, after accept() already completed it
File: opentakserver/eud_handler/client_controller.py, line 84
if self.is_ssl:
try:
self.sock.do_handshake() # <-- called in __init__, which runs in the main thread
...
ClientController.__init__() is called before new_thread.start() in the accept loop:
new_thread = ClientController(...) # __init__ runs here, in the accept loop thread
new_thread.daemon = True
new_thread.start()
Because accept() with do_handshake_on_connect=True already completed the handshake, this do_handshake() call is a no-op. However, it runs in the main accept loop thread before the thread is started. The intended architecture — offloading the handshake to the per-client thread — is structurally present but non-functional because Bug 2 causes the handshake to complete before ClientController ever runs.
Bug 4: self.clients list grows unbounded
File: opentakserver/eud_handler/SocketServer.py, line 45
self.clients.append(new_thread)
Finished ClientController threads are appended to self.clients but never removed. Over time this list accumulates all historical connection threads, leaking memory proportional to the number of client connections since the service last started.
Proposed Fix
Fix for Bug 1 and Bug 2 together
In SocketServer.py, launch_ssl_server():
def launch_ssl_server(self):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0) as sock:
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
context = self.get_ssl_context()
# do_handshake_on_connect=False: accept() returns immediately without
# blocking the loop. The handshake is completed in ClientController's
# thread (do_handshake() is already called there on line 84 of
# client_controller.py — it just needs to be non-redundant).
sconn = context.wrap_socket(sock, server_side=True, do_handshake_on_connect=False)
sconn.bind(("0.0.0.0", self.port))
sconn.listen(5) # was listen(0); 5 is a conventional minimum for servers
return sconn
With do_handshake_on_connect=False, accept() returns as soon as the TCP connection is established. The ClientController thread then calls do_handshake() on line 84 of client_controller.py, which — since it now runs in its own thread — no longer blocks the accept loop. self.sock.settimeout(1.0) in ClientController.__init__() (line 45) already applies a per-operation timeout to that handshake, so slow clients are handled correctly without freezing the accept loop.
This makes Bug 3 correct behavior rather than a no-op: the handshake now actually happens in the per-client thread as intended.
Fix for Bug 4
In SocketServer.run(), prune dead threads before appending, or periodically:
self.clients = [c for c in self.clients if c.is_alive()]
self.clients.append(new_thread)
Reproduction
Reliable reproduction involves two simultaneous slow SSL connections, but the issue occurs in normal operation with mobile ATAK clients on poor or intermittent networks (LTE, weak WiFi). The following confirms the stuck state without needing reproduction:
# Shows Recv-Q: 1 when stuck
sudo ss -tulpn | grep :8089
# Clears it
sudo systemctl restart eud_handler_ssl
# Confirm recovery
sudo ss -tulpn | grep :8089 # Recv-Q should return to 0
Environment
- OpenTAKServer 1.7.10, native (non-Docker) installation
- Ubuntu 24.04 LTS
- Python SSL module — TLS 1.2,
OTS_SSL_VERIFICATION_MODE = 2 (ssl.CERT_REQUIRED)
- ATAK clients connecting from mobile devices over LTE and WiFi
References
Bug:
eud_handler_sslaccept loop locks up under concurrent or slow connections — port 8089 stops accepting clientsVersion: 1.7.10
Component:
opentakserver/eud_handler/SocketServer.py,client_controller.pySeverity: High — results in complete loss of ATAK client connectivity until service restart
Summary
The SSL socket server (
eud_handler_ssl, port 8089) periodically reaches a state where it stops accepting new client connections. The socket remains open and in LISTEN state, but the kernel's accept queue fills to capacity and all incoming connections are dropped at the kernel level before reaching the application. The only recovery issudo systemctl restart eud_handler_ssl.The root cause is two bugs in
SocketServer.pythat interact: alisten(0)call that restricts the kernel accept queue to a single pending connection, and an SSL handshake that blocks the single-threaded accept loop for the entire duration of each TLS exchange.Observed Symptoms
sudo ss -tulpn | grep :8089showsRecv-Q: 1, Send-Q: 0while the service is running.Recv-Qat capacity, all new connection attempts are silently dropped by the kernel — clients see a timeout, not a refused connection.sudo systemctl restart eud_handler_sslresolves the issue immediately —Recv-Qreturns to 0 and clients reconnect.Root Cause Analysis
Bug 1:
listen(0)inlaunch_ssl_server()— zero-length accept queueFile:
opentakserver/eud_handler/SocketServer.py, line 87Compare with the TCP server on line 75:
On Linux,
listen(0)is rounded up to a minimum of 1 by the kernel. This means the accept queue can hold at most one fully-established TCP connection before additional incoming connections are dropped. This is the direct explanation for the observedRecv-Q: 1state: one connection is queued, the queue is at capacity, all others are discarded.The TCP server uses
listen(1), which — while also very low for a server — at least isn't explicitly zero. The SSL server'slisten(0)appears to be an unintentional value.Bug 2: SSL handshake blocks the accept loop
File:
opentakserver/eud_handler/SocketServer.py, lines 30 and 85SSLContext.wrap_socket()defaults todo_handshake_on_connect=True. For a server socket, this means that each call toaccept()performs the complete TLS handshake before returning. The accept loop is single-threaded, so the loop is entirely frozen for the duration of every handshake.The
settimeout(1.0)applies to individual socket read operations — not the entire handshake. A TLS 1.2 exchange involves multiple round trips (ClientHello → ServerHello+Certificate → ClientKeyExchange → Finished). At up to 1 second per read operation, a single handshake can blockaccept()for 4+ seconds under normal operation on a slow or mobile client.During those 4+ seconds, any client that completes its TCP three-way handshake occupies Recv-Q. With a backlog of 1 (from Bug 1), the queue is immediately at capacity, and all subsequent clients are dropped.
Bug 3:
do_handshake()called inClientController.__init__()— main thread, afteraccept()already completed itFile:
opentakserver/eud_handler/client_controller.py, line 84ClientController.__init__()is called beforenew_thread.start()in the accept loop:Because
accept()withdo_handshake_on_connect=Truealready completed the handshake, thisdo_handshake()call is a no-op. However, it runs in the main accept loop thread before the thread is started. The intended architecture — offloading the handshake to the per-client thread — is structurally present but non-functional because Bug 2 causes the handshake to complete beforeClientControllerever runs.Bug 4:
self.clientslist grows unboundedFile:
opentakserver/eud_handler/SocketServer.py, line 45Finished
ClientControllerthreads are appended toself.clientsbut never removed. Over time this list accumulates all historical connection threads, leaking memory proportional to the number of client connections since the service last started.Proposed Fix
Fix for Bug 1 and Bug 2 together
In
SocketServer.py,launch_ssl_server():With
do_handshake_on_connect=False,accept()returns as soon as the TCP connection is established. TheClientControllerthread then callsdo_handshake()on line 84 ofclient_controller.py, which — since it now runs in its own thread — no longer blocks the accept loop.self.sock.settimeout(1.0)inClientController.__init__()(line 45) already applies a per-operation timeout to that handshake, so slow clients are handled correctly without freezing the accept loop.This makes Bug 3 correct behavior rather than a no-op: the handshake now actually happens in the per-client thread as intended.
Fix for Bug 4
In
SocketServer.run(), prune dead threads before appending, or periodically:Reproduction
Reliable reproduction involves two simultaneous slow SSL connections, but the issue occurs in normal operation with mobile ATAK clients on poor or intermittent networks (LTE, weak WiFi). The following confirms the stuck state without needing reproduction:
Environment
OTS_SSL_VERIFICATION_MODE = 2(ssl.CERT_REQUIRED)References
[SSLContext.wrap_socket()](https://docs.python.org/3/library/ssl.html#ssl.SSLContext.wrap_socket),do_handshake_on_connectparameterlisten(2)man page — backlog parameter behavior, minimum effective value of 1