Verify memory-usage estimation across multiprocessing start methods (fork → forkserver in Python 3.14)

## Background

Crawlee estimates process memory usage in [`get_memory_info()`](https://github.com/apify/crawlee-python/blob/master/src/crawlee/_utils/system.py) (`src/crawlee/_utils/system.py`). On Linux it sums **PSS (Proportional Set Size)** across the whole process tree (`current_process.children(recursive=True)`), specifically so that memory *shared* between the parent and its children is not counted multiple times. On non-Linux platforms it falls back to RSS. This estimate flows into the snapshotter → `SystemStatus` → `AutoscaledPool`, so it directly drives concurrency autoscaling decisions.

The regression test `test_memory_estimation_does_not_overestimate_due_to_shared_memory` (`tests/unit/_utils/test_system.py`) was written to exercise exactly this shared-memory accounting, and it originally used a **`fork`** multiprocessing context — i.e. children that share copy-on-write pages with the parent, which is the scenario where PSS-based de-duplication actually matters.

## Why this came up

PR #1960 silenced two CI warnings, one of them being:

```
DeprecationWarning: This process is multi-threaded, use of fork() may lead to deadlocks in the child
```

To do that, it switched the test suite default start method (and that specific test) from `fork` to **`spawn`**. Relevant facts:

- Forking a multi-threaded process is **deprecated since Python 3.12**.
- **Python 3.14 changes the default start method on Linux from `fork` to `forkserver`** (not `spawn`). macOS/Windows already default to `spawn`. See the [3.14 What's New — multiprocessing](https://docs.python.org/3.14/whatsnew/3.14.html#multiprocessing) note.
- Crawlee's **production code does not set any start method**, so it inherits the interpreter default: `fork` on Linux ≤ 3.13, `forkserver` on Linux 3.14+, `spawn` on macOS/Windows.

So after #1960 we'd be **testing memory estimation under `spawn`, while production historically ran under `fork` and now (3.14+) runs under `forkserver`.** These three start methods differ precisely in how much memory is shared between parent and children — which is the exact thing the estimation is trying to account for.

## The concern

The memory estimation may now behave differently (potentially be off / overestimate) under the new default `forkserver`, and nothing currently verifies that across start methods. Because the estimate feeds autoscaling, a wrong estimate could throttle or over-scale concurrency in real runs. This needs a careful look before we assume "spawn in tests is fine".

## To verify / decide

- [ ] Validate `get_memory_info()` PSS-based shared-memory accounting under **`fork`**, **`forkserver`**, and **`spawn`** — does the de-duplication still hold, or is the estimate skewed under non-fork methods?
- [ ] Determine whether the 3.14 default change (`forkserver`) materially shifts the estimate for realistic workloads — particularly Playwright browser subprocesses (independent OS processes) and any `ProcessPoolExecutor` usage.
- [ ] Decide whether the regression test should exercise the **realistic production start method** (or be parametrized across all three) rather than only `spawn`.
- [ ] Decide whether Crawlee should explicitly **assume / document / pin** a start method for memory estimation, or make the estimation robust regardless.

## Notes

- PR #1960 is being closed in favor of this issue; the warning-silencing change is intentionally deferred until the memory-estimation behavior across start methods is understood.

_Filed from the discussion on #1960._

🤖 Generated with [Claude Code](https://claude.com/claude-code)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify memory-usage estimation across multiprocessing start methods (fork → forkserver in Python 3.14) #1968

Background

Why this came up

The concern

To verify / decide

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Verify memory-usage estimation across multiprocessing start methods (fork → forkserver in Python 3.14) #1968

Description

Background

Why this came up

The concern

To verify / decide

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions