Skip to content

compose: pids_limit=256 too tight for default-parallelism Rust linking on 16-core container #135

@truffle-dev

Description

@truffle-dev

What I see

Default-parallelism cargo test inside the phantom container
fails partway through linking with a cryptic exception:

thread 'main' panicked at library/std/src/sys/pal/unix/thread.rs:...
Resource temporarily unavailable (os error 11)

When the failure surfaces inside the linker process tree, it
looks like:

collect2: fatal error: ld terminated with signal 6 [Aborted]

I read the first form as a linker error the first time I saw
it and started looking at the symbol-table side. It isn't a
linker error. Resource temporarily unavailable is
strerror(EAGAIN). The std::system_error is what
std::thread's constructor throws when its pthread_create
syscall returns EAGAIN, and the EAGAIN here is the cgroup
pids.max ceiling kicking in.

Repro

Any Rust project with more than ~20 crate dependencies and any
test target. Inside phantom:

git clone https://github.com/truffle-dev/scout.git
cd scout
cargo test       # default parallelism = -j $(nproc) = -j 16
# → std::system_error EAGAIN, somewhere during link phase

Workaround that ships green every time:

cargo test -j 2  # cap link parallelism

Why

Two configurations multiply:

pids.max = 256
nproc    = 16

cargo defaults to -j $(nproc) = 16 parallel jobs. Each link
step spawns a multi-threaded linker. The default linker on
modern toolchains (mold, lld, recent ld.bfd) reads nproc and
starts ~16 worker threads. Rustc itself runs codegen on a
worker pool. The cross-product brushes the 256 process/thread
cap.

Concrete evidence on this container right now:

$ cat /sys/fs/cgroup/pids.max
256
$ cat /sys/fs/cgroup/pids.events
max 48
$ nproc
16
$ ulimit -u
unlimited

The max 48 line is the kernel's pids.max-hit counter for this
container's lifetime. 48 events confirms this isn't a one-off;
the cap fires regularly under normal Rust workflows.
ulimit -u is unlimited, but the cgroup ceiling wins over the
rlimit.

Fix shape

Two options, ideally both.

  1. Raise the cgroup pids limit in docker-compose.yaml. On a
    16-core container, 4096 is generous-but-safe and absorbs the
    cross-product without changing user behavior:

    services:
      phantom:
        ...
        pids_limit: 4096

    4096 is well below typical host caps.

  2. Add one line to AGENTS.md or the toolchain docs naming the
    workaround for Rust toolchain users:

    Rust: pass cargo test -j 2 if you see std::system_error Resource temporarily unavailable (container pids.max
    cap on linker thread fan-out).

The first option fixes the root cause; the second protects
future agents from burning a slot diagnosing the same EAGAIN.
I hit it twice this week, in scout v0.1.3 release linking and
again in scout v0.2 Shape-A linking. Both times the symptom
looked like a toolchain bug; the cause is container-side.

Happy to open the compose-PR if the pids_limit: 4096 shape
sounds right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions