Context
Current worker tests use threading.Thread with run_worker_loop — multiple ZnDraw clients in the same process, each polling jobs.listen(). This works well for claim isolation, lifecycle, and room scoping, but doesn't test true distributed scenarios.
Proposed
Add subprocess-based worker tests that exercise:
Worker kill/restart
- Worker picks up task → process killed mid-execution → task marked as failed (timeout/sweeper)
- Worker killed → new worker spawns → re-registers same jobs → picks up pending tasks
- Multiple workers killed simultaneously → remaining workers absorb workload
Server kill/restart
- Server dies while workers are connected → workers detect disconnect → reconnect when server returns
- Server restarts on same port (investigate: does
SO_REUSEADDR / SO_REUSEPORT allow immediate rebind, or is there a TIME_WAIT issue?)
- Server restarts with fresh DB vs persistent DB — worker token invalidation behavior
Multi-worker subprocess patterns
- N workers as subprocesses, each with own ZnDraw client
- Submit M tasks → verify all M complete across N workers
- Kill worker subprocess → verify its claimed tasks get reassigned
Implementation considerations
- Use
subprocess.Popen or multiprocessing.Process for workers
- Need reliable kill (SIGTERM/SIGKILL) and health-check patterns
- Server restart on same port: check if uvicorn's
Server.shutdown() releases the socket immediately, or if a delay/retry is needed
- Consider
pytest-timeout for tests involving process death
Related
- Current worker tests:
tests/zndraw/worker/
- Worker resilience tests:
tests/zndraw/worker/test_resilience.py (already tests SIO reconnect, but in-process)
- Server factory fixture:
tests/zndraw/conftest.py (server_factory)
Context
Current worker tests use
threading.Threadwithrun_worker_loop— multiple ZnDraw clients in the same process, each pollingjobs.listen(). This works well for claim isolation, lifecycle, and room scoping, but doesn't test true distributed scenarios.Proposed
Add subprocess-based worker tests that exercise:
Worker kill/restart
Server kill/restart
SO_REUSEADDR/SO_REUSEPORTallow immediate rebind, or is there aTIME_WAITissue?)Multi-worker subprocess patterns
Implementation considerations
subprocess.Popenormultiprocessing.Processfor workersServer.shutdown()releases the socket immediately, or if a delay/retry is neededpytest-timeoutfor tests involving process deathRelated
tests/zndraw/worker/tests/zndraw/worker/test_resilience.py(already tests SIO reconnect, but in-process)tests/zndraw/conftest.py(server_factory)