Cancel pending futures and exit listener on EOF / unexpected exceptions#44
Cancel pending futures and exit listener on EOF / unexpected exceptions#44rhalutz wants to merge 1 commit into
Conversation
|
@rhalutz Can you rebase on top of master? It should allow us to create a test build |
There was a problem hiding this comment.
Pull request overview
This PR fixes a disconnect-handling edge case in the local TCP socket transport so the listener exits cleanly on EOF / socket-level failures and in-flight send_command() calls fail fast instead of waiting for the full timeout.
Changes:
- Treat graceful EOF and other socket read failures as fatal in
RiscoSocket._listen()and exit the listen loop. - Fail any pending command futures immediately when the connection is lost.
- Add unit tests to ensure the listen loop terminates and errors are surfaced via the socket queue.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
pyrisco/local/risco_socket.py |
Expands disconnect exception handling and attempts to fail pending command futures when the connection drops or the listener errors. |
tests/test_risco_socket.py |
Adds regression tests covering EOF/RST/OSError/unexpected exceptions ensuring _listen() exits and pending futures are resolved. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i, fut in enumerate(self._futures): | ||
| if fut is not None and not fut.done(): | ||
| fut.set_exception(OperationError('Connection lost')) | ||
| self._futures[i] = None |
There was a problem hiding this comment.
Done — the slot is now cleared for every entry regardless of state, and set_exception is only called on the still-pending ones:
for i, fut in enumerate(self._futures):
if fut is not None and not fut.done():
fut.set_exception(OperationError('Connection lost'))
self._futures[i] = None| except Exception as error: | ||
| await self._queue.put(error) | ||
| break |
There was a problem hiding this comment.
Addressed by changing the approach: the generic except Exception branch no longer breaks the listener (I'd added that break in the first revision, but it was too aggressive). Recoverable, per-command errors — the OperationError deliberately raised on a panel error frame at line 68, CRC mismatches, a late-response InvalidStateError — are surfaced on the queue but must not tear down the connection, which matches the original behaviour.
So there's no longer a "listener has exited but futures are still pending" state for non-connection errors: the listener keeps running. The only paths that stop the listener are real connection loss (ConnectionError / IncompleteReadError), and those now fail-and-clear all pending futures before breaking. If a write inside _handle_incoming fails on a dead socket it raises a ConnectionError, which lands in that same branch.
| else: | ||
| await self._handle_incoming(cmd_id, command, crc) | ||
| except ConnectionResetError as error: | ||
| except (ConnectionResetError, asyncio.IncompleteReadError, EOFError, OSError) as error: |
There was a problem hiding this comment.
I think OSError is a bit too general. Can we have something more specific?
There was a problem hiding this comment.
Good point — narrowed it to (ConnectionError, asyncio.IncompleteReadError):
ConnectionErroris the stdlib base for the connection-death family (ConnectionResetError/ECONNRESET,ConnectionAbortedError/ECONNABORTED,ConnectionRefusedError/ECONNREFUSED,BrokenPipeError/EPIPE) — all subclasses ofOSError, but only the ones that mean "the socket is gone".asyncio.IncompleteReadErroris the concrete typeStreamReader.readuntilraises on a graceful FIN (and it's itself a subclass ofEOFError, so the previousEOFErrorin the tuple was redundant and is dropped).
I also reverted the break I'd added to the generic except Exception branch — see the reply on Copilot's comment below for the reasoning.
d66ed10 to
4a35f22
Compare
The _listen loop never exits when the panel drops the TCP session
gracefully, and pending send_command() calls wait the full 10s
asyncio.wait_for timeout instead of failing fast.
The only except branch that breaks is ConnectionResetError (RST).
A graceful EOF from the panel raises asyncio.IncompleteReadError,
which falls through to the generic handler that neither breaks the
loop nor cancels pending futures, so:
1. The listener keeps running on a dead socket.
2. The in-flight CLOCK keep-alive waits the full 10s timeout, then
raises OperationError('Timeout in command: CLOCK').
3. _keep_alive sleeps 5s and tries again -- the cycle repeats
forever.
Downstream effect in Home Assistant: a 15s "Error in Risco library /
Timeout in command: CLOCK" log storm, and the integration never
reconnects.
Catch the connection-loss family (ConnectionError, which covers
ECONNRESET / ECONNABORTED / ECONNREFUSED / EPIPE) together with
asyncio.IncompleteReadError (the graceful-EOF type readuntil raises),
cancel all pending futures with OperationError('Connection lost') so
send_command() callers fail in milliseconds instead of after 10s, and
break out of the loop. Recoverable per-command errors continue to be
surfaced on the queue without tearing down the listener, preserving
the existing behaviour for panel error frames and CRC mismatches.
4a35f22 to
1edd177
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
pyrisco/local/risco_socket.py:89
- PR description says the listener should exit on “unexpected exceptions too”, but the generic
except Exceptionbranch currently only queues the error and continues the loop. Either the description should be updated to match the implemented behavior (treating these as recoverable), or this handler should break/tear down consistently on truly-fatal unexpected exceptions.
except Exception as error:
await self._queue.put(error)
| self._futures[i] = None | ||
| await self._queue.put(error) | ||
| break |
Summary
The
_listenloop inRiscoSocketnever exits when the panel drops the TCPsession gracefully, and pending
send_command()calls wait the full 10sasyncio.wait_fortimeout instead of failing fast.The only
exceptbranch thatbreaks isConnectionResetError(RST). Agraceful EOF from the panel raises
asyncio.IncompleteReadError, which hitsthe bare
except Exception:branch — that branch doesn't break the loop anddoesn't cancel any pending futures, so:
CLOCKkeep-alive waits the full 10s timeout, then raisesOperationError('Timeout in command: CLOCK')._keep_alivesleeps 5s and tries again — the cycle repeats forever.Downstream effect in Home Assistant: a 15s
Error in Risco library / Timeout in command: CLOCKlog storm, and the integration never reconnects (see home-assistant/core#169279, home-assistant/core#167628, home-assistant/core#154062, and #36 in this repo).Changes
exceptto(ConnectionResetError, asyncio.IncompleteReadError, EOFError, OSError)so a graceful FIN exits the loop the same way an RST does.OperationError('Connection lost')sosend_command()callers fail in milliseconds, not after 10s.breakon unexpected exceptions too, so a bug in_handle_incomingcan't silently keep the listener spinning on a half-broken socket.Testing
Added
tests/test_risco_socket.py(first tests for the local socket) covering: graceful EOF, RST, OS error, and an unexpected exception. All verify the loop exits and that any pending future is resolved withOperationError. Each test wraps_listeninasyncio.wait_for(..., 1)so a regression to the spinning-loop bug fails fast in CI instead of hanging — without this patch all four tests hang.Verified end-to-end against a real Risco panel (pyrisco 0.6.8 + HA 2026.4 container): after the panel times out the session, the integration sees the connection drop immediately, HA reloads the entry, and the keep-alive resumes — instead of the previous indefinite CLOCK-timeout loop.
Related
_errorreload check beyondConnectionResetErrorwould let the integration self-heal cleanly — happy to file that separately.