Raise RLIMIT_NOFILE on startup; make worker recycling opt-in#105
Open
thejchap wants to merge 7 commits into
Open
Raise RLIMIT_NOFILE on startup; make worker recycling opt-in#105thejchap wants to merge 7 commits into
thejchap wants to merge 7 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces automatic recycling of long-lived Python worker subprocesses after a fixed number of executed tests to bound file-descriptor/state accumulation in large suites, reusing the existing crash-recovery respawn + hook-replay flow.
Changes:
- Added per-worker test counting (
tests_completed) and a hard cap (MAX_TESTS_PER_WORKER = 128) to decide when a worker should be recycled. - Implemented recycle logic in the worker pool to kill and respawn workers once they exceed the cap.
- Added a Rust integration test that runs more than the cap and asserts the worker PID changes exactly as expected.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| crates/tryke_runner/src/worker.rs | Adds MAX_TESTS_PER_WORKER, tracks tests_completed, and exposes a should_recycle() helper. |
| crates/tryke_runner/src/pool.rs | Recycles workers after the threshold and adds a test validating PID turnover at the boundary. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Long-lived Python workers accumulate module-level state across imports — logging handlers, sqlite/ssl objects, atexit callbacks — that holds open file descriptors which `del sys.modules[name]` cannot reclaim because the FDs are owned by objects living outside the module dict. On macOS where the per-process FD soft limit is 256, this exhausts FDs when running tryke against large suites like homeassistant/core (~5,267 test files). WorkerProcess now tracks completed tests; the pool recycles a worker via the existing crash-recovery path (shutdown + ensure_worker respawn with cached hook replay) once the count reaches 128. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recycling inside `run_single_test` could fire mid-unit, dropping the live process before `handle_unit`'s `finalize_hooks` loop ran — and that loop is gated on `state.process.as_mut()`, so per="scope" fixture teardown was silently skipped (or run on a fresh worker's freshly re-imported fixtures, double-counting setup). Move the recycle check to after `finalize_hooks` in `handle_unit`. Adjusts the existing recycle test to submit one unit per test (so the end-of-unit check fires at the threshold) and adds a coverage test that crosses the threshold inside a single unit with a per="scope" fixture, asserting setup and teardown each run exactly once. Co-authored-by: Claude <noreply@anthropic.com>
Replaces the hardcoded `MAX_TESTS_PER_WORKER = 128` cap with
self-reported resource snapshots: every `run_test`/`run_doctest`
response carries a `WorkerHealthWire { rss_bytes, open_fds }` that the
runner consults at unit boundaries alongside a wall-clock age check.
Recycle decisions now name the tripped signal via a `RecycleReason`
enum, so debug logs can attribute drops to memory pressure vs FD
exhaustion vs slow drift.
Soft ceilings live on `WorkerLimits` (1 GiB RSS / 200 FDs / 10 min by
default, all `Option<u64>` so platforms without `/proc/self/fd` or
`resource` simply skip that signal). Tests use the new
`WorkerPool::with_python_path_and_limits` constructor with tiny caps
(or `WorkerLimits::unlimited`) to exercise recycle behaviour
deterministically; production call sites in `tryke`, `tryke_server`,
and `tryke_dev` are unchanged because `with_python_path` defaults to
`WorkerLimits::default()`.
Test coverage: end-to-end age-based recycle test (replaces the
test-count one); existing scope-fixture-teardown test now triggers via
age cap; new unit tests for `evaluate_recycle` priority order and
no-signal/no-cap fallbacks.
https://claude.ai/code/session_01PMbxzSuASTEYbDFEu4SQqy
Co-authored-by: Claude <noreply@anthropic.com>
Three CI test failures (ubuntu/macos/windows on slower Python versions) traced to the same flake: the integration test pinned the recycle count to exactly 1 (= 2 distinct worker pids), but on slow runners the first unit's end-of-unit age check already trips the 100 ms cap, producing an extra recycle and 3 pids. The "age recycle eventually fires" property still holds in that scenario, so relax the assertion to >= 2 distinct pids and document why. The "no recycle when under cap" property is already covered by evaluate_recycle_returns_none_when_no_signals_tripped. Also address two of the Copilot review comments on c20ca3b: - _measure_rss_bytes returns peak RSS (ru_maxrss), not current. Update the docstring to call that out and explain why peak is the right signal for recycling (transient spikes leave fragmentation/state we can't free). - _measure_open_fds was off by one because iterating /proc/self/fd or /dev/fd holds an extra dir FD that gets counted. Subtract it so the reading reflects FDs held by the worker, not the measurement. Regenerated reporter doc samples (v0.0.27 -> v0.0.28 + new scheduler warning) caught up by the generate-reporter-examples hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recycling thresholds (1 GiB RSS / 200 FDs / 600 s) were tuned to mitigate the macOS 256-FD per-process soft default. That mitigation has costs: surprise process churn on suites that legitimately run long or hold many FDs, plus the lost Python startup/import cache. Address the underlying FD ceiling directly instead. On startup tryke now raises its own RLIMIT_NOFILE soft limit toward the inherited hard limit, following the systemd convention (1024/524288) Home Assistant OS 16 adopted in mid-2025: ship modest soft defaults, generous hard defaults, and let applications raise their own at startup. Worker subprocesses inherit the bumped rlimit, so the FD-exhaustion failure mode disappears without churning interpreters. `WorkerLimits` defaults all three caps to `None` (opt-in only). The recycling machinery itself is unchanged — the same code path is still exercised by tests, just not on every `tryke test` invocation. macOS quirk: kern.maxfilesperproc caps per-process FDs below the reported hard limit; on EINVAL we fall back to a 10240 target, which still dwarfs the 256-FD default. Docs: new "Resource limits" section in guides/configuration.md covering both the FD bump and the opt-in recycling knobs.
84c8101 to
dbca0ac
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
crates/tryke/Cargo.toml:33
[target.'cfg(unix)'.dependencies]currently includes tokio and multiple workspace crates. That makes these dependencies unix-only, which will break building thetrykecrate on non-Unix targets (e.g. Windows) because the code still imports/uses tokio and the other crates unconditionally. Keep onlylibcunder the unix target table, and move tokio / tryke_* deps back under the main[dependencies]table (or add a new[dependencies]section after the target table).
[target.'cfg(unix)'.dependencies]
libc = "0.2"
tokio = { workspace = true }
tokio-stream = { workspace = true }
tryke_config = { workspace = true }
tryke_discovery = { workspace = true }
tryke_reporter = { workspace = true }
tryke_runner = { workspace = true }
tryke_server = { workspace = true }
tryke_types = { workspace = true }
Comment on lines
+35
to
+36
| //! inside a sandbox that pinned it) we log at `debug` and proceed — | ||
| //! the user can still override via `ulimit -n` before launching. |
Comment on lines
+178
to
+183
| **Defaults: all caps disabled.** Out of the box no worker is ever | ||
| recycled — the runner pairs the FD-limit bump above with the assumption | ||
| that most suites do not need process churn, and recycling has costs of | ||
| its own (Python interpreter startup latency, cached fixture loss). If | ||
| you have a suite that leaks memory or FDs over hours of runtime, the | ||
| hooks are still there: |
The previous commit accidentally inserted `[target.'cfg(unix)'.dependencies]` mid-stream in the `[dependencies]` table, sweeping tokio, tokio-stream, and the six tryke_* workspace crates into the unix-only section. `uv sync` (which builds the maturin wheel) on Windows then failed with "unresolved dependencies" because none of those crates were available on the windows target. Move the unix-gated table to the end so it only contains `libc`, restoring the cross-platform dependency surface. Also reword the fdlimit module preamble per Copilot review: this module returns an `io::Result` rather than logging directly; the caller (main.rs) is what logs at warn level.
| @@ -44,6 +158,7 @@ impl WorkerProcess { | |||
| python_path: &[&Path], | |||
| root: &Path, | |||
| log_level: log::LevelFilter, | |||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Address the macOS 256-FD per-process soft default at its source — the runner's own rlimit — instead of papering over it with mandatory worker recycling.
RLIMIT_NOFILEbump on startup (crates/tryke/src/fdlimit.rs). On Unix, tryke now raises its own soft limit toward the inherited hard limit before any worker is spawned. Children inherit the bumped rlimit, so a single call lifts the FD ceiling for every Python interpreter the runner spawns. Follows the systemd convention (soft 1024 / hard 524288) that Home Assistant OS 16 adopted in mid-2025: ship modest soft defaults, generous hard defaults, and let each application raise its own soft limit at startup based on actual needs.kern.maxfilesperproccaps per-process FDs below the reported hard limit, so unconditionalsoft = hardreturnsEINVAL. On rejection we fall back to a 10 240 target — still dwarfs the 256-FD soft default that caused the failure mode. Windows: no-op (noRLIMIT_NOFILEanalogue).ulimit -nbefore launching.Worker recycling is now fully opt-in (
WorkerLimits::default()returns all-None). The same recycle machinery from c20ca3b is retained — peak RSS, open FDs, wall-clock age, in priority order memory > FDs > age, deferred to end-of-unit soper="scope"teardown runs. But the previous 1 GiB / 200 FDs / 600 s defaults are gone: they made the runner's behaviour depend on workload shape rather than user intent, and surprised suites that legitimately ran long or held many FDs. With the FD ceiling raised at startup the recycling motivation is mostly gone; users who still need it on long-lived suites can wire explicit caps throughWorkerLimits.Docs: new "Resource limits" section in
docs/guides/configuration.mdcovering both the FD bump (with the Home Assistant link) and the opt-in recycling knobs.Options summary
RLIMIT_NOFILE(process-wide)EINVAL; warning on other errorsWorkerLimits::max_rss_bytesNonegetrusage(RUSAGE_SELF).ru_maxrss(normalised to bytes) crosses cap. POSIX only.WorkerLimits::max_open_fdsNone/proc/self/fdor/dev/fdcount crosses cap. Linux + macOS only.WorkerLimits::max_ageNoneNoneon any worker-side cap means "never recycle on this signal". When multiple signals trip at once the runner reports the strongest in priority order memory > FDs > age, so debug logs attribute the recycle to the most user-visible failure mode.Test plan
fdlimit::tests::raise_returns_meaningful_outcome_on_unix— getrlimit succeeds, raise returnsRaised { from <= to }orSkipped { current > 0 }.fdlimit::tests::raise_is_idempotent— a second call never lowers the soft limit.pool::tests::worker_recycles_when_age_exceeds_limit— passes explicitWorkerLimitswith a tight age cap; recycle still fires.pool::tests::recycle_does_not_skip_scope_fixture_teardown— explicit limits, end-of-unit recycle still respectsper="scope"teardown.worker::tests::evaluate_recycle_*— priority / no-signal / no-reading / unlimited coverage on the pure decision helper.cargo nextest run --workspace --all-features— 629 passed.cargo clippy --workspace --all-targets --all-features -- -D warnings— clean.uvx prek run -a— clean.References