Background
Crawlee estimates process memory usage in get_memory_info() (src/crawlee/_utils/system.py). On Linux it sums PSS (Proportional Set Size) across the whole process tree (current_process.children(recursive=True)), specifically so that memory shared between the parent and its children is not counted multiple times. On non-Linux platforms it falls back to RSS. This estimate flows into the snapshotter → SystemStatus → AutoscaledPool, so it directly drives concurrency autoscaling decisions.
The regression test test_memory_estimation_does_not_overestimate_due_to_shared_memory (tests/unit/_utils/test_system.py) was written to exercise exactly this shared-memory accounting, and it originally used a fork multiprocessing context — i.e. children that share copy-on-write pages with the parent, which is the scenario where PSS-based de-duplication actually matters.
Why this came up
PR #1960 silenced two CI warnings, one of them being:
DeprecationWarning: This process is multi-threaded, use of fork() may lead to deadlocks in the child
To do that, it switched the test suite default start method (and that specific test) from fork to spawn. Relevant facts:
- Forking a multi-threaded process is deprecated since Python 3.12.
- Python 3.14 changes the default start method on Linux from
fork to forkserver (not spawn). macOS/Windows already default to spawn. See the 3.14 What's New — multiprocessing note.
- Crawlee's production code does not set any start method, so it inherits the interpreter default:
fork on Linux ≤ 3.13, forkserver on Linux 3.14+, spawn on macOS/Windows.
So after #1960 we'd be testing memory estimation under spawn, while production historically ran under fork and now (3.14+) runs under forkserver. These three start methods differ precisely in how much memory is shared between parent and children — which is the exact thing the estimation is trying to account for.
The concern
The memory estimation may now behave differently (potentially be off / overestimate) under the new default forkserver, and nothing currently verifies that across start methods. Because the estimate feeds autoscaling, a wrong estimate could throttle or over-scale concurrency in real runs. This needs a careful look before we assume "spawn in tests is fine".
To verify / decide
Notes
Filed from the discussion on #1960.
🤖 Generated with Claude Code
Background
Crawlee estimates process memory usage in
get_memory_info()(src/crawlee/_utils/system.py). On Linux it sums PSS (Proportional Set Size) across the whole process tree (current_process.children(recursive=True)), specifically so that memory shared between the parent and its children is not counted multiple times. On non-Linux platforms it falls back to RSS. This estimate flows into the snapshotter →SystemStatus→AutoscaledPool, so it directly drives concurrency autoscaling decisions.The regression test
test_memory_estimation_does_not_overestimate_due_to_shared_memory(tests/unit/_utils/test_system.py) was written to exercise exactly this shared-memory accounting, and it originally used aforkmultiprocessing context — i.e. children that share copy-on-write pages with the parent, which is the scenario where PSS-based de-duplication actually matters.Why this came up
PR #1960 silenced two CI warnings, one of them being:
To do that, it switched the test suite default start method (and that specific test) from
forktospawn. Relevant facts:forktoforkserver(notspawn). macOS/Windows already default tospawn. See the 3.14 What's New — multiprocessing note.forkon Linux ≤ 3.13,forkserveron Linux 3.14+,spawnon macOS/Windows.So after #1960 we'd be testing memory estimation under
spawn, while production historically ran underforkand now (3.14+) runs underforkserver. These three start methods differ precisely in how much memory is shared between parent and children — which is the exact thing the estimation is trying to account for.The concern
The memory estimation may now behave differently (potentially be off / overestimate) under the new default
forkserver, and nothing currently verifies that across start methods. Because the estimate feeds autoscaling, a wrong estimate could throttle or over-scale concurrency in real runs. This needs a careful look before we assume "spawn in tests is fine".To verify / decide
get_memory_info()PSS-based shared-memory accounting underfork,forkserver, andspawn— does the de-duplication still hold, or is the estimate skewed under non-fork methods?forkserver) materially shifts the estimate for realistic workloads — particularly Playwright browser subprocesses (independent OS processes) and anyProcessPoolExecutorusage.spawn.Notes
Filed from the discussion on #1960.
🤖 Generated with Claude Code