Skip to content

RSDK-13862 Add watchdogs to speaker and microphone #39

Open
oliviamiller wants to merge 3 commits intoviam-modules:mainfrom
oliviamiller:speaker-stream-stall
Open

RSDK-13862 Add watchdogs to speaker and microphone #39
oliviamiller wants to merge 3 commits intoviam-modules:mainfrom
oliviamiller:speaker-stream-stall

Conversation

@oliviamiller
Copy link
Copy Markdown
Collaborator

@oliviamiller oliviamiller commented May 5, 2026

Summary
Adds a background StallWatchdog to Speaker and Microphone that detects
when the PortAudio callback stops firing and recovers automatically.

Audio callbacks can stop firing for a few reasons — USB device unplugged,
kernel re-enumeration, ALSA xrun escalation, driver hiccup. Before this PR
the only recovery path for the speaker was an inline staleness log (warn-only,
no restart), and the mic's recovery only ran while a get_audio call was
active.

  • A background watchdog on each component polls how long it's been since
    the audio callback last fired; if the gap exceeds the staleness threshold,
    it triggers a stream restart.
    • On restart, the configured device id is re-resolved against the current
      audio device state, so a USB device that came back at a different card
      index is found automatically.
    • After several consecutive failed restarts, the watchdog drops to a slow
      retry cadence instead of giving up permanently — when the device comes
      back, recovery happens on its own.
    • If the configured device isn't currently present, the restart is skipped
      outright to avoid noisy driver-level errors during the wait.

Manual Testing
USB hot-replug: unplug the device, wait, plug it back in. Logs show 3
fast retry attempts, then 5s backoff retries until the replug — at
which point the next retry resolves the new card index and succeeds.

@oliviamiller oliviamiller requested a review from seanavery May 5, 2026 15:36
Copy link
Copy Markdown

@seanavery seanavery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just got done with my first pass here

Comment thread src/watchdog.hpp
// instead retries every BACKOFF_INTERVAL. This supports hot-replug scenarios: the device
// may come back later (USB plug-in, driver recovery) and we want to resume recovery
// without forcing the user to reconfigure.
constexpr std::chrono::milliseconds BACKOFF_INTERVAL{2000};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want 2 second or 5 second backoff interval?

Comment thread test/watchdog_test.cpp

// 1. Restart fires when the callback is stale and attempts are below the budget.
TEST(StallWatchdog, RestartFiresWhenStale) {
const auto ctx = make_context_stale_by(5000); // 5s past threshold
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 2000 to match watchdog implementation?

Comment thread src/speaker.hpp
// Member variables
std::string device_name_;
double latency_;
int sample_rate_;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[maybe nit] should we also remove sample rate now that we have StreamParams attached to the class?

Comment thread src/speaker.hpp
std::string device_name_;
double latency_;
int sample_rate_;
int num_channels_;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this?

Comment thread src/watchdog.hpp
}

private:
void loop() {
Copy link
Copy Markdown

@seanavery seanavery May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any race condition considerations between reconfigure and this loop that calls restart_stalled_stream?

Basically, do we need to reconsider locking strategy because of restart path moving from get_audio to poll?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants