Handle ECHILD in ProcessWaitState.TryReapChild instead of FailFast#124124
Handle ECHILD in ProcessWaitState.TryReapChild instead of FailFast#124124adityamandaleeka wants to merge 1 commit intodotnet:mainfrom
Conversation
|
@tmds What do you think about this change? |
There was a problem hiding this comment.
Pull request overview
Updates Unix child-process reaping in System.Diagnostics.Process to avoid crashing the entire process when waitpid fails with ECHILD due to a race (child already reaped by another waiter).
Changes:
- Treat
waitpid(...)= -1witherrno=ECHILDas a recoverable condition and mark theProcessWaitStateas exited (without an exit code). - Preserve existing
FailFastbehavior for non-ECHILDwaitpidfailures. - Ensure terminal bookkeeping is still updated for terminal-using children in the
ECHILDpath.
|
This looks similar to #70705 . I do not understand the root cause of the problem from the description. Is this working around a bug in some other native library? If it is the case, we want to get the bug in that native library fixed. Replacing a fail-fast with an invalid behavior is not an improvement. |
|
Tagging subscribers to this area: @dotnet/area-system-diagnostics-process |
|
@jkotas Yea, after I opened this I was pointed to the historical discussions about this that I missed when searching. I'm going to try to collect more info from aspnetcore CI to identify what else is running in the test process. Hopefully that will help identify the root cause, and if it turns out to be a bug in external code, we can fix it there. |
|
Hmm, so my CI job ran and this is the list of loaded libraries:
Nothing stands out there. I checked lttng and msquic and didn't see scary waitpid usage. |
|
Worth noting that we have other places in the runtime where ECHILD is handled without failing fast: runtime/src/coreclr/pal/src/thread/process.cpp Line 3330 in 2d638dc |
|
|
@adityamandaleeka this is the reason for the @jkotas @adamsitnik note that we can't handle this better by using process handles, see #47631 (comment). |
|
I see. Thanks for the comments and links. I'm not pushing back (I understand the concern about silently hiding a failure) and I don't mind closing this PR if we decide not to change this behavior for now. But I did also check how other runtimes handle ECHILD in their child-reaping paths and wanted to leave a breadcrumb here, if for no other reason than that future versions of us will see it the next time someone opens this issue 😆. libuv handles ECHILD in its SIGCHLD handler path on Linux (https://github.com/libuv/libuv/blob/26a97ad4425ca2f0a911c6412f19b089b9dbf527/src/unix/process.c#L139-L144): if (pid == -1) {
if (errno != ECHILD)
abort();
/* The child died, and we missed it. This probably means someone else
* stole the waitpid from us. Handle this by not handling it at all. */
continue;
}Only non-ECHILD errors abort. On ECHILD they skip the process and don't fire the exit callback or crash. This was added in libuv/libuv@bae2992] with the commit message: "Bug #3504 seems to affect more platforms than just OpenBSD. As this seems to be a race condition in these kernels, we do not want to fail because of it." libuv#3504 describes the same pattern: waitpid returning ECHILD for a known child. OpenJDK HotSpot returns exit code 0 on ECHILD in while (::waitpid(pid, &status, 0) < 0) {
switch (errno) {
case ECHILD: return 0;
case EINTR: break;
default: return -1;
}
}And in their managed Process API, when waitForProcessExit0 gets ECHILD it returns a NOT_A_CHILD sentinel to Java, which then polls isAlive0() until the process is gone and defaults the exit code to 0 (https://github.com/openjdk/jdk/blob/9cd25d517c25477be6643bfb795843ca080d4e38/src/java.base/share/classes/java/lang/ProcessHandleImpl.java#L148-L170): if (exitValue == NOT_A_CHILD) {
// pid not alive or not a child of this process
// If it is alive wait for it to terminate
// ... polls isAlive0() with backoff ...
exitValue = 0;
}
newCompletion.complete(exitValue); |
Summary
ProcessWaitState.TryReapChildcallsEnvironment.FailFastwhenwaitpidreturnsECHILD, crashing the process. This change handlesECHILDgracefully by marking the child as exited.Problem
The SIGCHLD handling path in
CheckChildrendoes a two-step reap:waitid(P_ALL, WEXITED | WNOHANG | WNOWAIT)— peek at a terminated child without consuming its waitable statuswaitpid(pid, WNOHANG)— actually reap the specific childBetween steps 1 and 2, any code in the process that calls
waitpidorwaitcan reap the child first. When this happens, ourwaitpid(pid)returns-1witherrno = ECHILD, andTryReapChildcallsEnvironment.FailFast.This is a known pattern — #33297 documents the same crash caused by a native library (WiringPi) calling
wait(-1)which reaped children started by .NET. While that case was resolved by fixing the native library, the runtime should notFailFastfor a race it cannot prevent in all cases. ThecheckAllpath inCheckChildren(which blindly callsTryReapChildon all registered children when an unmanaged child is detected) is another scenario wherewaitpidcan returnECHILDfor an already-reaped child.This has been observed intermittently in aspnetcore CI (example log) in tests that use
RemoteExecutor.Invoke(), recently made more frequent by a vstest version bump (17.12 → 18.0.1) that changed process lifecycle timing. Example CI log:Fix
Handle
ECHILDfromWaitPidExitedNoHangby callingSetExited()without setting_exitCode, following the same pattern used byCheckForNonChildExit(which detects a process is gone viakill(pid, 0)but has no exit code available). The_exitCodefield isint?— leaving itnullcausesProcess.ExitCodeto return the default value0(viaUpdateHasExited).Non-ECHILD errors from
waitpidstill triggerFailFast.Why ECHILD is safe to handle
waitpid(pid)returnsECHILDwhen:waitpidcallSIGCHLDis set toSIG_IGNIn our context, the pid comes from
s_childProcessWaitStates, placed there byProcess.Startafterfork. It is definitively our child, soECHILDmeans it was already reaped. The P/Invoke declaration forWaitPidExitedNoHangalready documents this as a known return value:Related issues