Fix CI subprocess test hangs by xyao-nv · Pull Request #557 · isaac-sim/IsaacLab-Arena

xyao-nv · 2026-04-08T17:35:05Z

Summary

Subprocess-spawning tests hang indefinitely on CI.

Causes & Fixes

Problems

From Lab:

Lab reports "AppLauncher doesnt quit properly after app.close(), app.quit() doesn't help either.
Cold startup times for tests using IS can be upwards of 10 min on Lab CI machines.

Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test.

Fixes

SimulationApp Force Exit: Skips app.close() (which can hang indefinitely in Kit's shutdown path) when the env var ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1 is set.
Calls a new _kill_child_processes() helper that walks /proc to SIGKILL all direct children before doing os._exit(0), preventing orphaned Kit processes from holding GPU resources.
run_subprocess has a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above.
Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI.

Minor fixes

Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section.
Parametrize multi-config tests: Convert nested for-loops in test_zero_action_policy_kitchen_pick_and_place (6 configs) and test_zero_action_policy_gr1_open_microwave (3 configs) into @pytest.mark.parametrize. Each config gets its own timeout, pass/fail, and timing.
Skip gr00t eval_runner test for now to isolate CI stalling problem.

Before -- timeout

After

isaaclab-review-bot

🤖 Isaac Lab Review Bot

PR: #557 — Exp CI stalling | Author: @xyao-nv | Base: main ← xyao/exp/ci_stall

Summary

Clean, focused change that adds --durations=0 to all 5 pytest invocations in .github/workflows/ci.yml. This is a standard pytest flag that reports timing for every test (sorted slowest-first) at the end of each test suite run.

Assessment: ✅ Looks Good

What it does:

Adds --durations=0 to Newton tests, PhysX tests (with/without cameras), subprocess-spawning tests, and GR00T policy tests
This will print a timing breakdown for every test at the end of each CI step, making it easy to identify which tests are slow or stalling

Why it's useful:

Essential diagnostic for CI stall investigation — without timing data, you're flying blind
--durations=0 reports all tests (vs --durations=N which only shows the N slowest), which is the right choice for debugging

Performance impact: Negligible. The timing collection is built into pytest's core and adds microseconds of overhead per test.

No concerns:

No functional change to test selection or execution
Flag is well-placed (after -sv, before marker filters) — consistent across all steps
The double-space in "subprocess-spawning PhysX tests" (line 189) predates this PR

Good debugging instrumentation. Once the stall is identified and fixed, you may want to switch to --durations=10 or --durations=20 to keep CI logs cleaner for routine runs, but --durations=0 is the right call for investigation.

greptile-apps · 2026-04-08T20:44:52Z

Greptile Summary

This PR fixes CI subprocess test hangs by replacing subprocess.run() with Popen in a new process group, adding a 600 s wall-clock timeout with os.killpg-based tree kill, and injecting ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1 so that SimulationAppContext.__exit__ calls os._exit(0) instead of the stalling app.close(). Tests are also refactored from nested loops to @pytest.mark.parametrize for per-config pass/fail reporting, and --durations=0 is added to all CI pytest invocations.

Confidence Score: 5/5

Safe to merge — all remaining findings are minor style improvements that don't affect correctness.

The core logic (Popen + process-group kill + forced exit) is sound and directly addresses the hang. Both inline comments are P2 quality nits. The env-mutation is a cosmetic concern given that FORCE_EXIT is always desirable from run_subprocess, and the getpgid race is astronomically unlikely given the process is a zombie-retained child. No P0/P1 findings.

isaaclab_arena/tests/utils/subprocess.py — minor robustness improvements to the timeout kill path are worth considering

Vulnerabilities

No security concerns identified. ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE is an internal env var and cannot be weaponized by untrusted input; os._exit(0) deliberately skips normal cleanup (not bypassing auth/ACLs). Process-group kill via os.killpg is scoped to the child's own session and cannot affect unrelated processes.

Important Files Changed

Filename	Overview
isaaclab_arena/tests/utils/subprocess.py	Replaces subprocess.run with Popen+process-group kill and injects FORCE_EXIT env var; two minor robustness nits (env mutation, bare getpgid/killpg)
isaaclab_arena/utils/isaaclab_utils/simulation_app.py	Adds ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE check in exit to call os._exit(0) instead of app.close() for subprocess-spawned test children; clean implementation
isaaclab_arena/tests/test_policy_runner.py	Converts nested for-loops to @pytest.mark.parametrize; each config now gets its own pass/fail and timing — straightforward refactor
.github/workflows/ci.yml	Adds --durations=0 to all pytest invocations for timing stats; no functional or security concerns

Sequence Diagram

sequenceDiagram
    participant PT as pytest (parent)
    participant RS as run_subprocess()
    participant CH as Child Process (Isaac Sim)
    participant SC as SimulationAppContext.__exit__

    PT->>RS: run_subprocess(cmd, env)
    RS->>RS: set FORCE_EXIT_ON_COMPLETE=1 in env
    RS->>CH: Popen(cmd, start_new_session=True)

    alt Normal completion within 600s
        CH->>SC: __exit__(exc_type=None)
        SC->>SC: check FORCE_EXIT_ON_COMPLETE
        SC->>CH: os._exit(0) skips app.close()
        CH->>RS: returncode = 0
        RS->>PT: returns normally
    else Subprocess raises exception
        CH->>SC: __exit__(exc_type != None)
        SC->>CH: os._exit(1)
        CH->>RS: returncode != 0
        RS->>PT: raises CalledProcessError
    else Timeout exceeds 600s
        RS->>RS: TimeoutExpired caught
        RS->>CH: os.killpg(pgid, SIGKILL)
        RS->>RS: _AT_LEAST_ONE_TEST_FAILED = True
        RS->>PT: raises SubprocessError
    end

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into xyao/exp/ci_sta..." | Re-trigger Greptile}

greptile-apps · 2026-04-08T20:44:56Z

isaaclab_arena/tests/utils/subprocess.py

+    if env is None:
+        env = os.environ.copy()


Caller's env dict is mutated in-place

If a caller passes their own env dict, it is mutated to include ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1. While this key is harmless to leave behind for the current callers, it can still cause surprising side-effects if the same dict is reused or inspected after the call. The env is None branch already copies os.environ, so the fix is to copy the caller-supplied dict as well.

Suggested change

if env is None:

env = os.environ.copy()

if env is None:

env = os.environ.copy()

else:

env = env.copy()

env["ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE"] = "1"

greptile-apps · 2026-04-08T20:44:57Z

isaaclab_arena/tests/utils/subprocess.py

+        returncode = process.wait(timeout=timeout_sec)
+    except subprocess.TimeoutExpired:
+        pgid = os.getpgid(process.pid)
+        sys.stderr.write(
+            f"\n[isaaclab-arena] Subprocess timed out after {timeout_sec}s — killing process group {pgid}\n"
        )
-        print(f"Command completed with return code: {result.returncode}")
-    except subprocess.CalledProcessError as e:
-        sys.stderr.write(f"Command failed with return code {e.returncode}: {e}\n")
+        os.killpg(pgid, signal.SIGKILL)
+        process.wait()


Unguarded os.getpgid and os.killpg calls after timeout

If the timed-out child process coincidentally self-terminates in the narrow window between TimeoutExpired being raised and os.getpgid(process.pid) being called, getpgid will raise ProcessLookupError. Wrapping the kill block in a try/except OSError keeps the timeout-path robust — the process is gone either way.

Suggested change

returncode = process.wait(timeout=timeout_sec)

except subprocess.TimeoutExpired:

pgid = os.getpgid(process.pid)

sys.stderr.write(

f"\n[isaaclab-arena] Subprocess timed out after {timeout_sec}s — killing process group {pgid}\n"

)

print(f"Command completed with return code: {result.returncode}")

except subprocess.CalledProcessError as e:

sys.stderr.write(f"Command failed with return code {e.returncode}: {e}\n")

os.killpg(pgid, signal.SIGKILL)

process.wait()

except subprocess.TimeoutExpired:

try:

pgid = os.getpgid(process.pid)

sys.stderr.write(

f"\n[isaaclab-arena] Subprocess timed out after {timeout_sec}s — killing process group {pgid}\n"

)

os.killpg(pgid, signal.SIGKILL)

except OSError:

pass # process already gone

process.wait()

_AT_LEAST_ONE_TEST_FAILED = True

raise subprocess.SubprocessError(f"Subprocess timed out after {timeout_sec}s: {cmd}")

xyao-nv · 2026-04-08T20:49:39Z

Most of the subprocess fixes are from Claude. I mainly babysitted the CI to check if it speeds up the timing.

xyao-nv · 2026-04-09T05:56:23Z

@greptile-apps

isaaclab-review-bot

🤖 Isaac Lab Review Bot — Follow-up

PR: #557 — Fix CI subprocess test hangs | Reviewing commits since 8d8e198a

Summary of New Changes

The PR has grown significantly from the initial --durations=0 diagnostic patch into a comprehensive fix for CI subprocess hangs. Good evolution — the root cause analysis in the PR description is clear and the multi-pronged approach is sound.

New additions:

SimulationAppContext force-exit path: skips app.close() (which hangs in Kit shutdown) and SIGKILLs child processes via /proc before os._exit(0)
run_subprocess() now has configurable wall-clock timeouts, process isolation (start_new_session=True), and the ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE env var
AppLauncher startup timing marker for CI observability
Test parametrization: nested loops → @pytest.mark.parametrize (each config gets its own timeout/pass/fail/timing)
Temporary skip of the stalling gr00t eval_runner test

Assessment: 👍 Good approach, one gap worth addressing

The force-exit path is well-designed. The _kill_child_processes() approach of walking /proc to find direct children by PPid is careful — the comment explaining why os.killpg isn't safe (Kit's C-level SIGTERM handler) is appreciated. The exception handling is thorough.

One concern: orphaned processes on timeout path. When subprocess.run() hits the timeout, Python SIGKILLs the direct child, but with start_new_session=True, Kit's grandchildren (shader compiler, GPU workers) survive as orphans. Since the child was killed by SIGKILL, the SimulationAppContext.__exit__ cleanup never runs. Consider switching to Popen for the timeout path so you can os.killpg(pgid, signal.SIGKILL) the entire session:

process = subprocess.Popen(cmd, env=env, start_new_session=True, ...)
try:
    process.wait(timeout=timeout_sec)
except subprocess.TimeoutExpired:
    os.killpg(os.getpgid(process.pid), signal.SIGKILL)
    process.wait()
    ...

SIGKILL bypasses Kit's C-level signal handler entirely, so the os.killpg concern from the normal exit path doesn't apply here. The 900s CI timeout is a backstop, but cleaning up immediately avoids GPU resource contention for the next test subprocess.

Minor: The reapply_viewer_cfg refactor (moved from arena_env_builder.py to simulation_app.py utility module) is clean and correctly applied to all the imitation learning scripts that were missing it.

Test parametrization is a clear win — the old nested-loop approach meant one failure would skip remaining configs silently.

alexmillane

Thank you addressing this!

I'll spin the job a bunch of times to check that this solves the issue!

alexmillane · 2026-04-09T08:10:33Z

isaaclab_arena/tests/utils/subprocess.py

+    ``start_new_session=True`` isolates the child into its own process group.
+    The child-side ``SimulationAppContext`` uses this to SIGTERM its entire
+    group before ``os._exit()``, preventing orphaned Kit children (shader
+    compiler, GPU workers, …) from holding GPU resources and blocking the
+    next subprocess.


Suggestion to start this comment block with:

"This function runs the child subprocess with start_new_session=True such that the child is isolated..."

Right now the comment looks like it's describing an argument to this function, which is a bit confusing because there is no such argument. Whereas it's actually describing an argument used inside the function.

Another option is to move this comment down below the subprocess call.

Thanks! 🙏

alexmillane · 2026-04-09T08:12:42Z

isaaclab_arena/tests/test_policy_runner.py

+@pytest.mark.parametrize("embodiment", ["franka_ik", "gr1_pink", "gr1_joint"])
+@pytest.mark.parametrize("object_name", ["cracker_box", "tomato_soup_can"])
+def test_zero_action_policy_kitchen_pick_and_place(embodiment, object_name):
    # TODO(alexmillane, 2025.07.29): Get an exhaustive list of all scenes and embodiments
    # from a registry when we have one.
-    example_environment = "kitchen_pick_and_place"
-    embodiments = ["franka_ik", "gr1_pink", "gr1_joint"]
-    object_names = ["cracker_box", "tomato_soup_can"]
-    for embodiment in embodiments:
-        for object_name in object_names:
-            run_policy_runner(
-                policy_type="zero_action",
-                example_environment=example_environment,
-                embodiment=embodiment,
-                object_name=object_name,
-                num_steps=NUM_STEPS,
-            )
+    run_policy_runner(
+        policy_type="zero_action",
+        example_environment="kitchen_pick_and_place",
+        embodiment=embodiment,
+        object_name=object_name,
+        num_steps=NUM_STEPS,
+    )


alexmillane · 2026-04-09T08:20:34Z

isaaclab_arena_gr00t/tests/test_gr00t_closedloop_policy.py



-@pytest.mark.with_subprocess
+@pytest.mark.skip(reason="Skipping because of CI stalling")


Is this still required? This MR solves this issue right?

yeah I forgot. revert

Sounds good! 👍

xyao-nv changed the title ~~timing stats~~ Exp CI stalling Apr 8, 2026

isaaclab-review-bot bot reviewed Apr 8, 2026

View reviewed changes

xyao-nv changed the title ~~Exp CI stalling~~ Fix CI subprocess test hangs Apr 8, 2026

xyao-nv marked this pull request as ready for review April 8, 2026 20:42

greptile-apps bot reviewed Apr 8, 2026

View reviewed changes

xyao-nv force-pushed the xyao/exp/ci_stall branch from 1fe581c to 86b09e2 Compare April 8, 2026 22:18

xyao-nv added 6 commits April 8, 2026 16:10

timing stats

b9c1d84

subprocess group

15ce0b4

lint

13539eb

add startup complete visbility

7c5fd4b

disable eval runner tests

e86c250

kill child process

e80dca2

xyao-nv force-pushed the xyao/exp/ci_stall branch from bc9bb9d to e80dca2 Compare April 8, 2026 23:12

xyao-nv added 3 commits April 8, 2026 16:15

merge

6b7a0fb

typo

8772c71

Merge branch 'main' into xyao/exp/ci_stall

4ab1008

isaaclab-review-bot bot reviewed Apr 9, 2026

View reviewed changes

alexmillane reviewed Apr 9, 2026

View reviewed changes

Merge branch 'main' into xyao/exp/ci_stall

b71d9e4

isaaclab-review-bot bot mentioned this pull request Apr 10, 2026

cherrypick 557 and add orphan cleanup #569

Draft

revert gr00t

678d536



		@pytest.mark.with_subprocess
		@pytest.mark.skip(reason="Skipping because of CI stalling")

Conversation

xyao-nv commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Causes & Fixes

Problems

Fixes

Minor fixes

Before -- timeout

After

Uh oh!

isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot

Summary

Assessment: ✅ Looks Good

Uh oh!

greptile-apps bot commented Apr 8, 2026

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

xyao-nv commented Apr 8, 2026

Uh oh!

xyao-nv commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot — Follow-up

Summary of New Changes

Assessment: 👍 Good approach, one gap worth addressing

Uh oh!

alexmillane left a comment

Choose a reason for hiding this comment

Uh oh!

alexmillane Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

xyao-nv Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

alexmillane Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

alexmillane Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

xyao-nv Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xyao-nv commented Apr 8, 2026 •

edited

Loading

xyao-nv commented Apr 9, 2026 •

edited

Loading