Skip to content

Verify memory-usage estimation across multiprocessing start methods (fork → forkserver in Python 3.14) #1968

@vdusek

Description

@vdusek

Background

Crawlee estimates process memory usage in get_memory_info() (src/crawlee/_utils/system.py). On Linux it sums PSS (Proportional Set Size) across the whole process tree (current_process.children(recursive=True)), specifically so that memory shared between the parent and its children is not counted multiple times. On non-Linux platforms it falls back to RSS. This estimate flows into the snapshotter → SystemStatusAutoscaledPool, so it directly drives concurrency autoscaling decisions.

The regression test test_memory_estimation_does_not_overestimate_due_to_shared_memory (tests/unit/_utils/test_system.py) was written to exercise exactly this shared-memory accounting, and it originally used a fork multiprocessing context — i.e. children that share copy-on-write pages with the parent, which is the scenario where PSS-based de-duplication actually matters.

Why this came up

PR #1960 silenced two CI warnings, one of them being:

DeprecationWarning: This process is multi-threaded, use of fork() may lead to deadlocks in the child

To do that, it switched the test suite default start method (and that specific test) from fork to spawn. Relevant facts:

  • Forking a multi-threaded process is deprecated since Python 3.12.
  • Python 3.14 changes the default start method on Linux from fork to forkserver (not spawn). macOS/Windows already default to spawn. See the 3.14 What's New — multiprocessing note.
  • Crawlee's production code does not set any start method, so it inherits the interpreter default: fork on Linux ≤ 3.13, forkserver on Linux 3.14+, spawn on macOS/Windows.

So after #1960 we'd be testing memory estimation under spawn, while production historically ran under fork and now (3.14+) runs under forkserver. These three start methods differ precisely in how much memory is shared between parent and children — which is the exact thing the estimation is trying to account for.

The concern

The memory estimation may now behave differently (potentially be off / overestimate) under the new default forkserver, and nothing currently verifies that across start methods. Because the estimate feeds autoscaling, a wrong estimate could throttle or over-scale concurrency in real runs. This needs a careful look before we assume "spawn in tests is fine".

To verify / decide

  • Validate get_memory_info() PSS-based shared-memory accounting under fork, forkserver, and spawn — does the de-duplication still hold, or is the estimate skewed under non-fork methods?
  • Determine whether the 3.14 default change (forkserver) materially shifts the estimate for realistic workloads — particularly Playwright browser subprocesses (independent OS processes) and any ProcessPoolExecutor usage.
  • Decide whether the regression test should exercise the realistic production start method (or be parametrized across all three) rather than only spawn.
  • Decide whether Crawlee should explicitly assume / document / pin a start method for memory estimation, or make the estimation robust regardless.

Notes

Filed from the discussion on #1960.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions