Handle ECHILD in ProcessWaitState.TryReapChild instead of FailFast by adityamandaleeka · Pull Request #124124 · dotnet/runtime

adityamandaleeka · 2026-02-07T02:49:40Z

Summary

ProcessWaitState.TryReapChild calls Environment.FailFast when waitpid returns ECHILD, crashing the process. This change handles ECHILD gracefully by marking the child as exited.

Problem

The SIGCHLD handling path in CheckChildren does a two-step reap:

waitid(P_ALL, WEXITED | WNOHANG | WNOWAIT) — peek at a terminated child without consuming its waitable status
waitpid(pid, WNOHANG) — actually reap the specific child

Between steps 1 and 2, any code in the process that calls waitpid or wait can reap the child first. When this happens, our waitpid(pid) returns -1 with errno = ECHILD, and TryReapChild calls Environment.FailFast.

This is a known pattern — #33297 documents the same crash caused by a native library (WiringPi) calling wait(-1) which reaped children started by .NET. While that case was resolved by fixing the native library, the runtime should not FailFast for a race it cannot prevent in all cases. The checkAll path in CheckChildren (which blindly calls TryReapChild on all registered children when an unmanaged child is detected) is another scenario where waitpid can return ECHILD for an already-reaped child.

This has been observed intermittently in aspnetcore CI (example log) in tests that use RemoteExecutor.Invoke(), recently made more frequent by a vstest version bump (17.12 → 18.0.1) that changed process lifecycle timing. Example CI log:

Fix

Handle ECHILD from WaitPidExitedNoHang by calling SetExited() without setting _exitCode, following the same pattern used by CheckForNonChildExit (which detects a process is gone via kill(pid, 0) but has no exit code available). The _exitCode field is int? — leaving it null causes Process.ExitCode to return the default value 0 (via UpdateHasExited).

Non-ECHILD errors from waitpid still trigger FailFast.

Why ECHILD is safe to handle

waitpid(pid) returns ECHILD when:

The pid is not a child of the calling process
The child was already reaped by another waitpid call
SIGCHLD is set to SIG_IGN

In our context, the pid comes from s_childProcessWaitStates, placed there by Process.Start after fork. It is definitively our child, so ECHILD means it was already reaped. The P/Invoke declaration for WaitPidExitedNoHang already documents this as a known return value:

"if pid is not a child or there are no unwaited-for children, -1 is returned (errno=ECHILD)"

Related issues

Inconsistent "Error while reaping child" when running multiple instances of Tesseract concurrently in Ubuntu via Process #33297 — same crash from native library reaping .NET's children
assert in ProcessWaitState on Linux arm64 #74795 — same crash in CI, closed by disabling tests on Mono arm64

adityamandaleeka · 2026-02-07T02:50:46Z

@tmds What do you think about this change?

Copilot

Pull request overview

Updates Unix child-process reaping in System.Diagnostics.Process to avoid crashing the entire process when waitpid fails with ECHILD due to a race (child already reaped by another waiter).

Changes:

Treat waitpid(...)= -1 with errno=ECHILD as a recoverable condition and mark the ProcessWaitState as exited (without an exit code).
Preserve existing FailFast behavior for non-ECHILD waitpid failures.
Ensure terminal bookkeeping is still updated for terminal-using children in the ECHILD path.

jkotas · 2026-02-07T03:36:04Z

This looks similar to #70705 .

I do not understand the root cause of the problem from the description. Is this working around a bug in some other native library? If it is the case, we want to get the bug in that native library fixed. Replacing a fail-fast with an invalid behavior is not an improvement.

dotnet-policy-service · 2026-02-07T03:37:10Z

Tagging subscribers to this area: @dotnet/area-system-diagnostics-process
See info in area-owners.md if you want to be subscribed.

adityamandaleeka · 2026-02-07T03:43:59Z

@jkotas Yea, after I opened this I was pointed to the historical discussions about this that I missed when searching. I'm going to try to collect more info from aspnetcore CI to identify what else is running in the test process. Hopefully that will help identify the root cause, and if it turns out to be a bug in external code, we can fix it there.

adityamandaleeka · 2026-02-07T05:19:41Z

Hmm, so my CI job ran and this is the list of loaded libraries:

libhostfxr.so
libhostpolicy.so
libcoreclr.so
libclrjit.so
libSystem.Native.so
libSystem.IO.Compression.Native.so
libSystem.Security.Cryptography.Native.OpenSsl.so
ld-linux-x86-64.so.2
libc.so.6
libm.so.6
libpthread.so.0
librt.so.1
libdl.so.2
libgcc_s.so.1
libstdc++.so.6.0.30
libssl.so.3
libcrypto.so.3
libicudata.so.70.1
libicui18n.so.70.1
libicuuc.so.70.1
liblttng-ust.so.1.0.0
liblttng-ust-common.so.1.0.0
liblttng-ust-tracepoint.so.1.0.0
libmsquic.so.2.5.6
libmsquic.lttng.so.2.5.6
libnuma.so.1.0.0

(from https://helixr1107v0xdeko0k025g8.blob.core.windows.net/dotnet-aspnetcore-refs-pull-65355-merge-ab58f059f6074fb389/Microsoft.AspNetCore.Http.Extensions.Tests--net11.0/1/console.b6cf5cd5.log?helixlogtype=result&skoid=8eda00af-b5ec-4be9-b69b-0919a2338892&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-02-07T04%3A55%3A13Z&ske=2026-02-07T05%3A55%3A13Z&sks=b&skv=2024-11-04&sv=2024-11-04&se=2026-02-07T05%3A55%3A13Z&sr=b&sp=rl&sig=EOnOjn03PWFj0iqeU1tf8yHTFwh8oOldMlCtL2xUDuo%3D)

Nothing stands out there. I checked lttng and msquic and didn't see scary waitpid usage.

adityamandaleeka · 2026-02-07T05:31:19Z

Worth noting that we have other places in the runtime where ECHILD is handled without failing fast:

runtime/src/coreclr/pal/src/synchmgr/synchmanager.cpp

Line 2934 in 2d638dc

else if (ECHILD == errno)

runtime/src/coreclr/pal/src/thread/process.cpp

Line 3330 in 2d638dc

else if (ECHILD == errno)

jkotas · 2026-02-07T07:50:55Z

runtime/src/coreclr/pal is legacy PAL that we are actively working towards deleting. The legacy PAL Is not the best example of how to do things right on Unix. The only remaining uses of CreateProcess from legacy PAL should be in superpmi (JIT testing tool) and maybe debugger. It should not be used by the runtime itself. If there is a bug in that code, the best path forward is to figure how to delete it and launch the process directly using C runtime instead.

tmds · 2026-02-07T09:46:23Z

Replacing a fail-fast with an invalid behavior is not an improvement.

@adityamandaleeka this is the reason for the FailFast.

@jkotas @adamsitnik note that we can't handle this better by using process handles, see #47631 (comment).

adityamandaleeka · 2026-02-07T17:18:26Z

I see. Thanks for the comments and links. I'm not pushing back (I understand the concern about silently hiding a failure) and I don't mind closing this PR if we decide not to change this behavior for now.

But I did also check how other runtimes handle ECHILD in their child-reaping paths and wanted to leave a breadcrumb here, if for no other reason than that future versions of us will see it the next time someone opens this issue 😆.

libuv handles ECHILD in its SIGCHLD handler path on Linux (https://github.com/libuv/libuv/blob/26a97ad4425ca2f0a911c6412f19b089b9dbf527/src/unix/process.c#L139-L144):

   if (pid == -1) {
       if (errno != ECHILD)
           abort();
       /* The child died, and we missed it. This probably means someone else
        * stole the waitpid from us. Handle this by not handling it at all. */
       continue;
   }

Only non-ECHILD errors abort. On ECHILD they skip the process and don't fire the exit callback or crash. This was added in libuv/libuv@bae2992] with the commit message: "Bug #3504 seems to affect more platforms than just OpenBSD. As this seems to be a race condition in these kernels, we do not want to fail because of it." libuv#3504 describes the same pattern: waitpid returning ECHILD for a known child.

OpenJDK HotSpot returns exit code 0 on ECHILD in os::fork_and_exec (https://github.com/openjdk/jdk/blob/9cd25d517c25477be6643bfb795843ca080d4e38/src/hotspot/os/posix/os_posix.cpp#L2138-L2143)

   while (::waitpid(pid, &status, 0) < 0) {
       switch (errno) {
       case ECHILD: return 0;
       case EINTR: break;
       default: return -1;
       }
   }

And in their managed Process API, when waitForProcessExit0 gets ECHILD it returns a NOT_A_CHILD sentinel to Java, which then polls isAlive0() until the process is gone and defaults the exit code to 0 (https://github.com/openjdk/jdk/blob/9cd25d517c25477be6643bfb795843ca080d4e38/src/java.base/share/classes/java/lang/ProcessHandleImpl.java#L148-L170):

   if (exitValue == NOT_A_CHILD) {
       // pid not alive or not a child of this process
       // If it is alive wait for it to terminate
       // ... polls isAlive0() with backoff ...
       exitValue = 0;
   }
   newCompletion.complete(exitValue);

Handle ECHILD in ProcessWaitState.TryReapChild instead of FailFast.

fd71777

Copilot AI review requested due to automatic review settings February 7, 2026 02:49

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 7, 2026

dotnet-policy-service bot assigned adityamandaleeka Feb 7, 2026

Copilot started reviewing on behalf of adityamandaleeka February 7, 2026 02:50 View session

Copilot AI reviewed Feb 7, 2026

View reviewed changes

jkotas added area-System.Diagnostics.Process and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Feb 7, 2026

build-analysis bot mentioned this pull request Feb 7, 2026

Cannot find 'arm64-v8a' device dotnet/dnceng#2284

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle ECHILD in ProcessWaitState.TryReapChild instead of FailFast#124124

Handle ECHILD in ProcessWaitState.TryReapChild instead of FailFast#124124
adityamandaleeka wants to merge 1 commit intodotnet:mainfrom
adityamandaleeka:echild

adityamandaleeka commented Feb 7, 2026

Uh oh!

adityamandaleeka commented Feb 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jkotas commented Feb 7, 2026

Uh oh!

dotnet-policy-service bot commented Feb 7, 2026

Uh oh!

adityamandaleeka commented Feb 7, 2026

Uh oh!

adityamandaleeka commented Feb 7, 2026

Uh oh!

adityamandaleeka commented Feb 7, 2026

Uh oh!

jkotas commented Feb 7, 2026

Uh oh!

tmds commented Feb 7, 2026

Uh oh!

adityamandaleeka commented Feb 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adityamandaleeka commented Feb 7, 2026

Summary

Problem

Fix

Why ECHILD is safe to handle

Related issues

Uh oh!

adityamandaleeka commented Feb 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

jkotas commented Feb 7, 2026

Uh oh!

dotnet-policy-service bot commented Feb 7, 2026

Uh oh!

adityamandaleeka commented Feb 7, 2026

Uh oh!

adityamandaleeka commented Feb 7, 2026

Uh oh!

adityamandaleeka commented Feb 7, 2026

Uh oh!

jkotas commented Feb 7, 2026

Uh oh!

tmds commented Feb 7, 2026

Uh oh!

adityamandaleeka commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adityamandaleeka commented Feb 7, 2026 •

edited

Loading