Skip to content

Handle ECHILD in ProcessWaitState.TryReapChild instead of FailFast#124124

Open
adityamandaleeka wants to merge 1 commit intodotnet:mainfrom
adityamandaleeka:echild
Open

Handle ECHILD in ProcessWaitState.TryReapChild instead of FailFast#124124
adityamandaleeka wants to merge 1 commit intodotnet:mainfrom
adityamandaleeka:echild

Conversation

@adityamandaleeka
Copy link
Member

Summary

ProcessWaitState.TryReapChild calls Environment.FailFast when waitpid returns ECHILD, crashing the process. This change handles ECHILD gracefully by marking the child as exited.

Problem

The SIGCHLD handling path in CheckChildren does a two-step reap:

  1. waitid(P_ALL, WEXITED | WNOHANG | WNOWAIT) — peek at a terminated child without consuming its waitable status
  2. waitpid(pid, WNOHANG) — actually reap the specific child

Between steps 1 and 2, any code in the process that calls waitpid or wait can reap the child first. When this happens, our waitpid(pid) returns -1 with errno = ECHILD, and TryReapChild calls Environment.FailFast.

This is a known pattern — #33297 documents the same crash caused by a native library (WiringPi) calling wait(-1) which reaped children started by .NET. While that case was resolved by fixing the native library, the runtime should not FailFast for a race it cannot prevent in all cases. The checkAll path in CheckChildren (which blindly calls TryReapChild on all registered children when an unmanaged child is detected) is another scenario where waitpid can return ECHILD for an already-reaped child.

This has been observed intermittently in aspnetcore CI (example log) in tests that use RemoteExecutor.Invoke(), recently made more frequent by a vstest version bump (17.12 → 18.0.1) that changed process lifecycle timing. Example CI log:

Fix

Handle ECHILD from WaitPidExitedNoHang by calling SetExited() without setting _exitCode, following the same pattern used by CheckForNonChildExit (which detects a process is gone via kill(pid, 0) but has no exit code available). The _exitCode field is int? — leaving it null causes Process.ExitCode to return the default value 0 (via UpdateHasExited).

Non-ECHILD errors from waitpid still trigger FailFast.

Why ECHILD is safe to handle

waitpid(pid) returns ECHILD when:

  • The pid is not a child of the calling process
  • The child was already reaped by another waitpid call
  • SIGCHLD is set to SIG_IGN

In our context, the pid comes from s_childProcessWaitStates, placed there by Process.Start after fork. It is definitively our child, so ECHILD means it was already reaped. The P/Invoke declaration for WaitPidExitedNoHang already documents this as a known return value:

"if pid is not a child or there are no unwaited-for children, -1 is returned (errno=ECHILD)"

Related issues

Copilot AI review requested due to automatic review settings February 7, 2026 02:49
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 7, 2026
@adityamandaleeka
Copy link
Member Author

@tmds What do you think about this change?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Unix child-process reaping in System.Diagnostics.Process to avoid crashing the entire process when waitpid fails with ECHILD due to a race (child already reaped by another waiter).

Changes:

  • Treat waitpid(...)= -1 with errno=ECHILD as a recoverable condition and mark the ProcessWaitState as exited (without an exit code).
  • Preserve existing FailFast behavior for non-ECHILD waitpid failures.
  • Ensure terminal bookkeeping is still updated for terminal-using children in the ECHILD path.

@jkotas
Copy link
Member

jkotas commented Feb 7, 2026

This looks similar to #70705 .

I do not understand the root cause of the problem from the description. Is this working around a bug in some other native library? If it is the case, we want to get the bug in that native library fixed. Replacing a fail-fast with an invalid behavior is not an improvement.

@jkotas jkotas added area-System.Diagnostics.Process and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Feb 7, 2026
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-diagnostics-process
See info in area-owners.md if you want to be subscribed.

@adityamandaleeka
Copy link
Member Author

@jkotas Yea, after I opened this I was pointed to the historical discussions about this that I missed when searching. I'm going to try to collect more info from aspnetcore CI to identify what else is running in the test process. Hopefully that will help identify the root cause, and if it turns out to be a bug in external code, we can fix it there.

@adityamandaleeka
Copy link
Member Author

Hmm, so my CI job ran and this is the list of loaded libraries:

  • libhostfxr.so
  • libhostpolicy.so
  • libcoreclr.so
  • libclrjit.so
  • libSystem.Native.so
  • libSystem.IO.Compression.Native.so
  • libSystem.Security.Cryptography.Native.OpenSsl.so
  • ld-linux-x86-64.so.2
  • libc.so.6
  • libm.so.6
  • libpthread.so.0
  • librt.so.1
  • libdl.so.2
  • libgcc_s.so.1
  • libstdc++.so.6.0.30
  • libssl.so.3
  • libcrypto.so.3
  • libicudata.so.70.1
  • libicui18n.so.70.1
  • libicuuc.so.70.1
  • liblttng-ust.so.1.0.0
  • liblttng-ust-common.so.1.0.0
  • liblttng-ust-tracepoint.so.1.0.0
  • libmsquic.so.2.5.6
  • libmsquic.lttng.so.2.5.6
  • libnuma.so.1.0.0

(from https://helixr1107v0xdeko0k025g8.blob.core.windows.net/dotnet-aspnetcore-refs-pull-65355-merge-ab58f059f6074fb389/Microsoft.AspNetCore.Http.Extensions.Tests--net11.0/1/console.b6cf5cd5.log?helixlogtype=result&skoid=8eda00af-b5ec-4be9-b69b-0919a2338892&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-02-07T04%3A55%3A13Z&ske=2026-02-07T05%3A55%3A13Z&sks=b&skv=2024-11-04&sv=2024-11-04&se=2026-02-07T05%3A55%3A13Z&sr=b&sp=rl&sig=EOnOjn03PWFj0iqeU1tf8yHTFwh8oOldMlCtL2xUDuo%3D)

Nothing stands out there. I checked lttng and msquic and didn't see scary waitpid usage.

@adityamandaleeka
Copy link
Member Author

Worth noting that we have other places in the runtime where ECHILD is handled without failing fast:

else if (ECHILD == errno)

else if (ECHILD == errno)

@jkotas
Copy link
Member

jkotas commented Feb 7, 2026

runtime/src/coreclr/pal is legacy PAL that we are actively working towards deleting. The legacy PAL Is not the best example of how to do things right on Unix. The only remaining uses of CreateProcess from legacy PAL should be in superpmi (JIT testing tool) and maybe debugger. It should not be used by the runtime itself. If there is a bug in that code, the best path forward is to figure how to delete it and launch the process directly using C runtime instead.

@tmds
Copy link
Member

tmds commented Feb 7, 2026

Replacing a fail-fast with an invalid behavior is not an improvement.

@adityamandaleeka this is the reason for the FailFast.

@jkotas @adamsitnik note that we can't handle this better by using process handles, see #47631 (comment).

@adityamandaleeka
Copy link
Member Author

adityamandaleeka commented Feb 7, 2026

I see. Thanks for the comments and links. I'm not pushing back (I understand the concern about silently hiding a failure) and I don't mind closing this PR if we decide not to change this behavior for now.

But I did also check how other runtimes handle ECHILD in their child-reaping paths and wanted to leave a breadcrumb here, if for no other reason than that future versions of us will see it the next time someone opens this issue 😆.

libuv handles ECHILD in its SIGCHLD handler path on Linux (https://github.com/libuv/libuv/blob/26a97ad4425ca2f0a911c6412f19b089b9dbf527/src/unix/process.c#L139-L144):

   if (pid == -1) {
       if (errno != ECHILD)
           abort();
       /* The child died, and we missed it. This probably means someone else
        * stole the waitpid from us. Handle this by not handling it at all. */
       continue;
   }

Only non-ECHILD errors abort. On ECHILD they skip the process and don't fire the exit callback or crash. This was added in libuv/libuv@bae2992] with the commit message: "Bug #3504 seems to affect more platforms than just OpenBSD. As this seems to be a race condition in these kernels, we do not want to fail because of it." libuv#3504 describes the same pattern: waitpid returning ECHILD for a known child.

OpenJDK HotSpot returns exit code 0 on ECHILD in os::fork_and_exec (https://github.com/openjdk/jdk/blob/9cd25d517c25477be6643bfb795843ca080d4e38/src/hotspot/os/posix/os_posix.cpp#L2138-L2143)

   while (::waitpid(pid, &status, 0) < 0) {
       switch (errno) {
       case ECHILD: return 0;
       case EINTR: break;
       default: return -1;
       }
   }

And in their managed Process API, when waitForProcessExit0 gets ECHILD it returns a NOT_A_CHILD sentinel to Java, which then polls isAlive0() until the process is gone and defaults the exit code to 0 (https://github.com/openjdk/jdk/blob/9cd25d517c25477be6643bfb795843ca080d4e38/src/java.base/share/classes/java/lang/ProcessHandleImpl.java#L148-L170):

   if (exitValue == NOT_A_CHILD) {
       // pid not alive or not a child of this process
       // If it is alive wait for it to terminate
       // ... polls isAlive0() with backoff ...
       exitValue = 0;
   }
   newCompletion.complete(exitValue);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants