Skip to content

fix(serial): watchdog UI Pico CDC link; detect silent stalls#42

Open
mairas wants to merge 1 commit intomainfrom
fix/ui-pico-cdc-watchdog
Open

fix(serial): watchdog UI Pico CDC link; detect silent stalls#42
mairas wants to merge 1 commit intomainfrom
fix/ui-pico-cdc-watchdog

Conversation

@mairas
Copy link
Copy Markdown
Contributor

@mairas mairas commented Apr 24, 2026

Summary

Closes #41.

When the USB CDC stream from the UI Pico stalls without closing the FD (observed after DUT BOOTSEL flash cycles on the shared xHCI hub on 2026-04-24), the reader thread blocks forever in readline() with no exception. /api/status keeps reporting ui_pico_connected: true while physical start / e-stop / switches silently stop emitting events — the runtime gap that triggered this issue.

Until now the only way out was a manual echo 1-1 | sudo tee /sys/bus/usb/drivers/usb/unbind; echo 1-1 | sudo tee .../bind.

Approach

Per-connection watchdog thread started alongside the reader when the UI Pico connects:

  • PicoConnection.last_rx_monotonic is updated by the reader on every byte that arrives (events, responses, INFO lines — anything).
  • After UI_PICO_HEARTBEAT_INTERVAL seconds of silence, the watchdog writes PING\n to nudge the link. Any response updates last_rx_monotonic and defuses the watchdog for another interval. PINGs are rate-limited to one per interval.
  • After UI_PICO_HEARTBEAT_INTERVAL * UI_PICO_HEARTBEAT_STALL_FACTOR seconds of continuous silence, the watchdog closes the port. The reader loop hits OSError, marks _ui_pico = None, emits ui_pico_disconnected, and the existing _reconnect_loop reattaches within SERIAL_RECONNECT_INTERVAL.

Defaults: interval 5s, factor 3.0 → ~15s to detect a full stall. Configurable via HALSPA_RUNNER_UI_PICO_HEARTBEAT_INTERVAL and HALSPA_RUNNER_UI_PICO_HEARTBEAT_STALL_FACTOR.

Also in this PR

  • Reader loop now also catches TypeError and bails quietly when stop_event is set. Addresses the shutdown race where port.close() nulls the FD mid-os.read, which previously logged a scary traceback at every systemd restart.
  • Narrowed the reader's _ui_pico = None update to only nullify the slot if it still points at this connection (protects against a race where reconnect has already swapped in a fresh connection).

Test plan

  • New unit test: watchdog sends PING on idle and closes the port after stall threshold elapses (mocked serial, tight timings).
  • Full suite green (107 tests).
  • On Pi: run the DUT BOOTSEL flash cycle that previously wedged the link; confirm watchdog detects the stall and reconnects within ~15s without operator intervention.

When the USB CDC stream from the UI Pico stalls without closing the
FD (seen after DUT BOOTSEL flash cycles on the shared xHCI hub), the
reader thread blocks forever in readline() with no exception. Status
shows ui_pico_connected: true while physical buttons silently stop
working until someone notices and resets USB manually.

Add a per-connection watchdog that:

- Tracks last_rx_monotonic on every byte received.
- After UI_PICO_HEARTBEAT_INTERVAL seconds of silence, sends PING to
  poke the link. Any response (PONG or otherwise) updates last_rx
  and defuses the watchdog.
- After HEARTBEAT_INTERVAL * STALL_FACTOR seconds of continuous
  silence, closes the port so the reader exits with OSError and the
  existing reconnect loop reattaches.

Also harden the reader loop: catch TypeError from readline (seen
during shutdown race where port.close() nulls the fd mid-os.read)
and bail quietly when stop_event is set.

Refs #41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UI Pico CDC stall goes undetected — reader thread blocks silently, events lost

1 participant