Skip to content

Fix intermittent listener test hangs (LISTEN/UNLISTEN deadlock, panic, flaky tests)#174

Merged
chandr-andr merged 2 commits into
mainfrom
fix/listener-ci-hang
Jun 28, 2026
Merged

Fix intermittent listener test hangs (LISTEN/UNLISTEN deadlock, panic, flaky tests)#174
chandr-andr merged 2 commits into
mainfrom
fix/listener-ci-hang

Conversation

@chandr-andr

Copy link
Copy Markdown
Member

Problem

python/tests/test_listener.py intermittently hung the GitHub runners (sometimes, not always) — never reproducible locally.

Root causes

  1. AB-BA deadlock in the listener (the actual freeze). update_listen_query held channel_callbacks.read + listen_query.write while taking is_listened.write, while execute_listen held is_listened.write while taking listen_query.read. A clear_*/add_callback running concurrently with the listen loop deadlocked both tasks — exactly the pattern the clear tests exercise.
  2. Unbounded async iteration. test_listener_asynciterator awaited an async for fed by a fire-and-forget asyncio.create_task(notify(...)) whose exception was swallowed — any failure left it waiting forever.
  3. LISTEN/NOTIFY race. The listen() tests sent NOTIFY immediately after listen(), racing the background task that issues LISTEN lazily.

Changes

Rust (src/driver/listener/core.rs)

  • Replaced the listen_query string with applied_channels: HashSet<String>. execute_listen now reconciles desired vs applied channels and issues real UNLISTEN + LISTEN (previously cleared channels were never unlistened on the backend). mark_subscriptions_dirty takes only is_listened, so the lock order is now client → is_listened → channel_callbacks → applied_channels with no cycle.
  • The message-forwarding task no longer panic!s on a broken connection — it breaks out of a manual poll_message loop (also stops cleanly when the receiver is dropped on shutdown).

Tests (python/tests/test_listener.py)

  • wait_until_listening() polls pg_listening_channels() on the listener's own connection to confirm the subscription before notifying (removes the race deterministically).
  • Every wait is bounded by anyio.fail_after; fixed asyncio.sleep(0.5) + assert replaced with poll-until-condition helpers; the iterator's notify runs in an anyio task group so failures propagate instead of being swallowed.

CI backstop (tox.ini, pyproject.toml)

  • Added pytest-timeout with timeout = 120, so any future hang fails fast with a traceback instead of blocking the runner.

Verification

  • cargo clippy --all-features -- -W clippy::pedantic -D warnings and cargo fmt --check clean.
  • Listener suite 8/8; 25× in a row under TOKIO_WORKER_THREADS=1 + CPU load with no hangs.
  • Full suite (excluding test_ssl_mode, which needs CI certs): 261 passed.
  • Verified UNLISTEN removes channels from pg_listening_channels(), and that an abnormal backend termination no longer panics the worker thread.

chandr-andr and others added 2 commits June 27, 2026 21:41
…lock, end the message-forwarding task cleanly instead of panicking, and bound the listener tests with deterministic waits plus a pytest-timeout backstop.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chandr-andr chandr-andr merged commit e61f034 into main Jun 28, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant