Conversation
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
PR: #557 — Exp CI stalling | Author: @xyao-nv | Base: main ← xyao/exp/ci_stall
Summary
Clean, focused change that adds --durations=0 to all 5 pytest invocations in .github/workflows/ci.yml. This is a standard pytest flag that reports timing for every test (sorted slowest-first) at the end of each test suite run.
Assessment: ✅ Looks Good
What it does:
- Adds
--durations=0to Newton tests, PhysX tests (with/without cameras), subprocess-spawning tests, and GR00T policy tests - This will print a timing breakdown for every test at the end of each CI step, making it easy to identify which tests are slow or stalling
Why it's useful:
- Essential diagnostic for CI stall investigation — without timing data, you're flying blind
--durations=0reports all tests (vs--durations=Nwhich only shows the N slowest), which is the right choice for debugging
Performance impact: Negligible. The timing collection is built into pytest's core and adds microseconds of overhead per test.
No concerns:
- No functional change to test selection or execution
- Flag is well-placed (after
-sv, before marker filters) — consistent across all steps - The double-space in "subprocess-spawning PhysX tests" (line 189) predates this PR
Good debugging instrumentation. Once the stall is identified and fixed, you may want to switch to --durations=10 or --durations=20 to keep CI logs cleaner for routine runs, but --durations=0 is the right call for investigation.
Greptile SummaryThis PR fixes CI subprocess test hangs by replacing Confidence Score: 5/5Safe to merge — all remaining findings are minor style improvements that don't affect correctness. The core logic (Popen + process-group kill + forced exit) is sound and directly addresses the hang. Both inline comments are P2 quality nits. The env-mutation is a cosmetic concern given that FORCE_EXIT is always desirable from run_subprocess, and the getpgid race is astronomically unlikely given the process is a zombie-retained child. No P0/P1 findings. isaaclab_arena/tests/utils/subprocess.py — minor robustness improvements to the timeout kill path are worth considering
|
| Filename | Overview |
|---|---|
| isaaclab_arena/tests/utils/subprocess.py | Replaces subprocess.run with Popen+process-group kill and injects FORCE_EXIT env var; two minor robustness nits (env mutation, bare getpgid/killpg) |
| isaaclab_arena/utils/isaaclab_utils/simulation_app.py | Adds ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE check in exit to call os._exit(0) instead of app.close() for subprocess-spawned test children; clean implementation |
| isaaclab_arena/tests/test_policy_runner.py | Converts nested for-loops to @pytest.mark.parametrize; each config now gets its own pass/fail and timing — straightforward refactor |
| .github/workflows/ci.yml | Adds --durations=0 to all pytest invocations for timing stats; no functional or security concerns |
Sequence Diagram
sequenceDiagram
participant PT as pytest (parent)
participant RS as run_subprocess()
participant CH as Child Process (Isaac Sim)
participant SC as SimulationAppContext.__exit__
PT->>RS: run_subprocess(cmd, env)
RS->>RS: set FORCE_EXIT_ON_COMPLETE=1 in env
RS->>CH: Popen(cmd, start_new_session=True)
alt Normal completion within 600s
CH->>SC: __exit__(exc_type=None)
SC->>SC: check FORCE_EXIT_ON_COMPLETE
SC->>CH: os._exit(0) skips app.close()
CH->>RS: returncode = 0
RS->>PT: returns normally
else Subprocess raises exception
CH->>SC: __exit__(exc_type != None)
SC->>CH: os._exit(1)
CH->>RS: returncode != 0
RS->>PT: raises CalledProcessError
else Timeout exceeds 600s
RS->>RS: TimeoutExpired caught
RS->>CH: os.killpg(pgid, SIGKILL)
RS->>RS: _AT_LEAST_ONE_TEST_FAILED = True
RS->>PT: raises SubprocessError
end
Reviews (1): Last reviewed commit: "Merge branch 'main' into xyao/exp/ci_sta..." | Re-trigger Greptile
| if env is None: | ||
| env = os.environ.copy() |
There was a problem hiding this comment.
Caller's env dict is mutated in-place
If a caller passes their own env dict, it is mutated to include ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1. While this key is harmless to leave behind for the current callers, it can still cause surprising side-effects if the same dict is reused or inspected after the call. The env is None branch already copies os.environ, so the fix is to copy the caller-supplied dict as well.
| if env is None: | |
| env = os.environ.copy() | |
| if env is None: | |
| env = os.environ.copy() | |
| else: | |
| env = env.copy() | |
| env["ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE"] = "1" |
| returncode = process.wait(timeout=timeout_sec) | ||
| except subprocess.TimeoutExpired: | ||
| pgid = os.getpgid(process.pid) | ||
| sys.stderr.write( | ||
| f"\n[isaaclab-arena] Subprocess timed out after {timeout_sec}s — killing process group {pgid}\n" | ||
| ) | ||
| print(f"Command completed with return code: {result.returncode}") | ||
| except subprocess.CalledProcessError as e: | ||
| sys.stderr.write(f"Command failed with return code {e.returncode}: {e}\n") | ||
| os.killpg(pgid, signal.SIGKILL) | ||
| process.wait() |
There was a problem hiding this comment.
Unguarded
os.getpgid and os.killpg calls after timeout
If the timed-out child process coincidentally self-terminates in the narrow window between TimeoutExpired being raised and os.getpgid(process.pid) being called, getpgid will raise ProcessLookupError. Wrapping the kill block in a try/except OSError keeps the timeout-path robust — the process is gone either way.
| returncode = process.wait(timeout=timeout_sec) | |
| except subprocess.TimeoutExpired: | |
| pgid = os.getpgid(process.pid) | |
| sys.stderr.write( | |
| f"\n[isaaclab-arena] Subprocess timed out after {timeout_sec}s — killing process group {pgid}\n" | |
| ) | |
| print(f"Command completed with return code: {result.returncode}") | |
| except subprocess.CalledProcessError as e: | |
| sys.stderr.write(f"Command failed with return code {e.returncode}: {e}\n") | |
| os.killpg(pgid, signal.SIGKILL) | |
| process.wait() | |
| except subprocess.TimeoutExpired: | |
| try: | |
| pgid = os.getpgid(process.pid) | |
| sys.stderr.write( | |
| f"\n[isaaclab-arena] Subprocess timed out after {timeout_sec}s — killing process group {pgid}\n" | |
| ) | |
| os.killpg(pgid, signal.SIGKILL) | |
| except OSError: | |
| pass # process already gone | |
| process.wait() | |
| _AT_LEAST_ONE_TEST_FAILED = True | |
| raise subprocess.SubprocessError(f"Subprocess timed out after {timeout_sec}s: {cmd}") |
|
Most of the subprocess fixes are from Claude. I mainly babysitted the CI to check if it speeds up the timing. |
1fe581c to
86b09e2
Compare
bc9bb9d to
e80dca2
Compare
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot — Follow-up
PR: #557 — Fix CI subprocess test hangs | Reviewing commits since 8d8e198a
Summary of New Changes
The PR has grown significantly from the initial --durations=0 diagnostic patch into a comprehensive fix for CI subprocess hangs. Good evolution — the root cause analysis in the PR description is clear and the multi-pronged approach is sound.
New additions:
SimulationAppContextforce-exit path: skipsapp.close()(which hangs in Kit shutdown) and SIGKILLs child processes via/procbeforeos._exit(0)run_subprocess()now has configurable wall-clock timeouts, process isolation (start_new_session=True), and theISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETEenv var- AppLauncher startup timing marker for CI observability
- Test parametrization: nested loops →
@pytest.mark.parametrize(each config gets its own timeout/pass/fail/timing) - Temporary skip of the stalling gr00t eval_runner test
Assessment: 👍 Good approach, one gap worth addressing
The force-exit path is well-designed. The _kill_child_processes() approach of walking /proc to find direct children by PPid is careful — the comment explaining why os.killpg isn't safe (Kit's C-level SIGTERM handler) is appreciated. The exception handling is thorough.
One concern: orphaned processes on timeout path. When subprocess.run() hits the timeout, Python SIGKILLs the direct child, but with start_new_session=True, Kit's grandchildren (shader compiler, GPU workers) survive as orphans. Since the child was killed by SIGKILL, the SimulationAppContext.__exit__ cleanup never runs. Consider switching to Popen for the timeout path so you can os.killpg(pgid, signal.SIGKILL) the entire session:
process = subprocess.Popen(cmd, env=env, start_new_session=True, ...)
try:
process.wait(timeout=timeout_sec)
except subprocess.TimeoutExpired:
os.killpg(os.getpgid(process.pid), signal.SIGKILL)
process.wait()
...SIGKILL bypasses Kit's C-level signal handler entirely, so the os.killpg concern from the normal exit path doesn't apply here. The 900s CI timeout is a backstop, but cleaning up immediately avoids GPU resource contention for the next test subprocess.
Minor: The reapply_viewer_cfg refactor (moved from arena_env_builder.py to simulation_app.py utility module) is clean and correctly applied to all the imitation learning scripts that were missing it.
Test parametrization is a clear win — the old nested-loop approach meant one failure would skip remaining configs silently.
alexmillane
left a comment
There was a problem hiding this comment.
Thank you addressing this!
I'll spin the job a bunch of times to check that this solves the issue!
| ``start_new_session=True`` isolates the child into its own process group. | ||
| The child-side ``SimulationAppContext`` uses this to SIGTERM its entire | ||
| group before ``os._exit()``, preventing orphaned Kit children (shader | ||
| compiler, GPU workers, …) from holding GPU resources and blocking the | ||
| next subprocess. |
There was a problem hiding this comment.
Suggestion to start this comment block with:
"This function runs the child subprocess with start_new_session=True such that the child is isolated..."
Right now the comment looks like it's describing an argument to this function, which is a bit confusing because there is no such argument. Whereas it's actually describing an argument used inside the function.
Another option is to move this comment down below the subprocess call.
| @pytest.mark.parametrize("embodiment", ["franka_ik", "gr1_pink", "gr1_joint"]) | ||
| @pytest.mark.parametrize("object_name", ["cracker_box", "tomato_soup_can"]) | ||
| def test_zero_action_policy_kitchen_pick_and_place(embodiment, object_name): | ||
| # TODO(alexmillane, 2025.07.29): Get an exhaustive list of all scenes and embodiments | ||
| # from a registry when we have one. | ||
| example_environment = "kitchen_pick_and_place" | ||
| embodiments = ["franka_ik", "gr1_pink", "gr1_joint"] | ||
| object_names = ["cracker_box", "tomato_soup_can"] | ||
| for embodiment in embodiments: | ||
| for object_name in object_names: | ||
| run_policy_runner( | ||
| policy_type="zero_action", | ||
| example_environment=example_environment, | ||
| embodiment=embodiment, | ||
| object_name=object_name, | ||
| num_steps=NUM_STEPS, | ||
| ) | ||
| run_policy_runner( | ||
| policy_type="zero_action", | ||
| example_environment="kitchen_pick_and_place", | ||
| embodiment=embodiment, | ||
| object_name=object_name, | ||
| num_steps=NUM_STEPS, | ||
| ) |
|
|
||
|
|
||
| @pytest.mark.with_subprocess | ||
| @pytest.mark.skip(reason="Skipping because of CI stalling") |
There was a problem hiding this comment.
Is this still required? This MR solves this issue right?
There was a problem hiding this comment.
yeah I forgot. revert
Summary
Subprocess-spawning tests hang indefinitely on CI.
Causes & Fixes
Problems
From Lab:
Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test.
Fixes
SimulationAppForce Exit: Skipsapp.close()(which can hang indefinitely in Kit's shutdown path) when the env varISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1is set.Calls a new
_kill_child_processes()helper that walks/proctoSIGKILLall direct children before doingos._exit(0), preventing orphaned Kit processes from holding GPU resources.run_subprocesshas a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above.Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI.
Minor fixes
Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section.
Parametrize multi-config tests: Convert nested for-loops in
test_zero_action_policy_kitchen_pick_and_place(6 configs) andtest_zero_action_policy_gr1_open_microwave(3 configs) into@pytest.mark.parametrize.Each config gets its own timeout, pass/fail, and timing.Skip gr00t eval_runner test for now to isolate CI stalling problem.
Before -- timeout
After