Conversation
There was a problem hiding this comment.
Review: Cherry-pick #557 + orphan cleanup
This is a solid fix for the CI stalling problem. The multi-layered approach (force-exit, child kill via /proc, orphan sweep, configurable timeout) is exactly what's needed when you can't trust Kit's shutdown path. The test parametrization refactor is a nice bonus — individual timeout/pass-fail per config is much better for CI debugging.
A few items worth considering:
1. Process-group kill gap in SimulationAppContext.__exit__
_kill_child_processes() only walks direct children (PPid == our PID). If a Kit child spawns its own grandchild before being killed, that grandchild can survive. Since start_new_session=True is set on the parent side, you could also os.killpg(os.getpgid(0), signal.SIGKILL) from the child side before the /proc walk — but only if you're confident the child is in its own session (which it is, via start_new_session). Worth considering as a belt-and-suspenders approach.
2. _kill_kit_orphans is a broad sweep
The function kills any process on the system matching the orphan patterns (omni.telemetry.transmitter, omni.kit, carb.windowing), not just descendants of the test runner. On a shared CI runner this is probably fine (one job at a time), but on a dev machine it could nuke a legitimate Kit session. A comment noting this is CI-only would be helpful.
3. test_zero_action_policy_gr1_open_microwave lost background=None
The old code passed background=None to run_policy_runner(). The new parametrized version drops it. If run_policy_runner has a background kwarg with a non-None default, this changes behavior. Worth verifying that the default is already None.
4. Race between TimeoutExpired and orphan cleanup
When subprocess.run() raises TimeoutExpired, the child process may still be running (Python only sends SIGKILL on timeout if using Popen with kill() — subprocess.run with timeout does kill the direct child but not the process group). Since start_new_session=True, the child's process group may survive. Consider adding os.killpg for the child's pgid in the TimeoutExpired handler before _kill_kit_orphans().
5. Minor: SubprocessError vs TimeoutExpired
The timeout handler catches TimeoutExpired and re-raises as a generic SubprocessError. This loses the original exception context (the partially captured output, the cmd). Consider raise SubprocessError(...) from e to preserve the chain, or just re-raise the TimeoutExpired since callers likely want to know it was a timeout specifically.
Overall this is good work and should unblock CI. The fixes are well-documented with clear comments explaining why each piece exists. 👍
| ps = subprocess.run( | ||
| ["ps", "-eo", "pid,ppid,comm,args"], | ||
| capture_output=True, | ||
| text=True, |
There was a problem hiding this comment.
This matches against the entire system process table, not just descendants of the test runner. On shared dev machines this could kill legitimate Kit sessions. Consider adding a comment that this is intended for CI use only, or narrowing the scan to processes whose original parent was this test runner (though I understand reparented orphans make that hard).
| result = subprocess.run( | ||
| cmd, | ||
| check=True, | ||
| env=env, |
There was a problem hiding this comment.
When TimeoutExpired is raised with start_new_session=True, subprocess.run kills the direct child but the process group may survive. Consider:
except subprocess.TimeoutExpired as e:
# Kill the entire process group
try:
os.killpg(os.getpgid(e.pid), signal.SIGKILL)
except (ProcessLookupError, PermissionError):
pass
...Also, raise SubprocessError(...) from e would preserve the exception chain.
| embodiment=embodiment, | ||
| object_name=object_name, | ||
| num_steps=NUM_STEPS, | ||
| ) |
There was a problem hiding this comment.
The old code passed background=None explicitly. Worth verifying that run_policy_runner's default for background is already None — otherwise this parametrized version silently changes behavior.
Summary
Short description of the change (max 50 chars)
Detailed description