Skip to content

Fix CI subprocess test hangs#557

Open
xyao-nv wants to merge 11 commits intomainfrom
xyao/exp/ci_stall
Open

Fix CI subprocess test hangs#557
xyao-nv wants to merge 11 commits intomainfrom
xyao/exp/ci_stall

Conversation

@xyao-nv
Copy link
Copy Markdown
Collaborator

@xyao-nv xyao-nv commented Apr 8, 2026

Summary

Subprocess-spawning tests hang indefinitely on CI.

Causes & Fixes

Problems

From Lab:

  1. Lab reports "AppLauncher doesnt quit properly after app.close(), app.quit() doesn't help either.
  2. Cold startup times for tests using IS can be upwards of 10 min on Lab CI machines.

Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test.

Fixes

  1. SimulationApp Force Exit: Skips app.close() (which can hang indefinitely in Kit's shutdown path) when the env var ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1 is set.
    Calls a new _kill_child_processes() helper that walks /proc to SIGKILL all direct children before doing os._exit(0), preventing orphaned Kit processes from holding GPU resources.

  2. run_subprocess has a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above.

  3. Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI.

Minor fixes

  1. Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section.

  2. Parametrize multi-config tests: Convert nested for-loops in test_zero_action_policy_kitchen_pick_and_place (6 configs) and test_zero_action_policy_gr1_open_microwave (3 configs) into @pytest.mark.parametrize. Each config gets its own timeout, pass/fail, and timing.

  3. Skip gr00t eval_runner test for now to isolate CI stalling problem.

Before -- timeout

image

After

image image

@xyao-nv xyao-nv changed the title timing stats Exp CI stalling Apr 8, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

PR: #557 — Exp CI stalling | Author: @xyao-nv | Base: mainxyao/exp/ci_stall


Summary

Clean, focused change that adds --durations=0 to all 5 pytest invocations in .github/workflows/ci.yml. This is a standard pytest flag that reports timing for every test (sorted slowest-first) at the end of each test suite run.

Assessment: ✅ Looks Good

What it does:

  • Adds --durations=0 to Newton tests, PhysX tests (with/without cameras), subprocess-spawning tests, and GR00T policy tests
  • This will print a timing breakdown for every test at the end of each CI step, making it easy to identify which tests are slow or stalling

Why it's useful:

  • Essential diagnostic for CI stall investigation — without timing data, you're flying blind
  • --durations=0 reports all tests (vs --durations=N which only shows the N slowest), which is the right choice for debugging

Performance impact: Negligible. The timing collection is built into pytest's core and adds microseconds of overhead per test.

No concerns:

  • No functional change to test selection or execution
  • Flag is well-placed (after -sv, before marker filters) — consistent across all steps
  • The double-space in "subprocess-spawning PhysX tests" (line 189) predates this PR

Good debugging instrumentation. Once the stall is identified and fixed, you may want to switch to --durations=10 or --durations=20 to keep CI logs cleaner for routine runs, but --durations=0 is the right call for investigation.

@xyao-nv xyao-nv changed the title Exp CI stalling Fix CI subprocess test hangs Apr 8, 2026
@xyao-nv xyao-nv marked this pull request as ready for review April 8, 2026 20:42
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR fixes CI subprocess test hangs by replacing subprocess.run() with Popen in a new process group, adding a 600 s wall-clock timeout with os.killpg-based tree kill, and injecting ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1 so that SimulationAppContext.__exit__ calls os._exit(0) instead of the stalling app.close(). Tests are also refactored from nested loops to @pytest.mark.parametrize for per-config pass/fail reporting, and --durations=0 is added to all CI pytest invocations.

Confidence Score: 5/5

Safe to merge — all remaining findings are minor style improvements that don't affect correctness.

The core logic (Popen + process-group kill + forced exit) is sound and directly addresses the hang. Both inline comments are P2 quality nits. The env-mutation is a cosmetic concern given that FORCE_EXIT is always desirable from run_subprocess, and the getpgid race is astronomically unlikely given the process is a zombie-retained child. No P0/P1 findings.

isaaclab_arena/tests/utils/subprocess.py — minor robustness improvements to the timeout kill path are worth considering

Vulnerabilities

No security concerns identified. ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE is an internal env var and cannot be weaponized by untrusted input; os._exit(0) deliberately skips normal cleanup (not bypassing auth/ACLs). Process-group kill via os.killpg is scoped to the child's own session and cannot affect unrelated processes.

Important Files Changed

Filename Overview
isaaclab_arena/tests/utils/subprocess.py Replaces subprocess.run with Popen+process-group kill and injects FORCE_EXIT env var; two minor robustness nits (env mutation, bare getpgid/killpg)
isaaclab_arena/utils/isaaclab_utils/simulation_app.py Adds ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE check in exit to call os._exit(0) instead of app.close() for subprocess-spawned test children; clean implementation
isaaclab_arena/tests/test_policy_runner.py Converts nested for-loops to @pytest.mark.parametrize; each config now gets its own pass/fail and timing — straightforward refactor
.github/workflows/ci.yml Adds --durations=0 to all pytest invocations for timing stats; no functional or security concerns

Sequence Diagram

sequenceDiagram
    participant PT as pytest (parent)
    participant RS as run_subprocess()
    participant CH as Child Process (Isaac Sim)
    participant SC as SimulationAppContext.__exit__

    PT->>RS: run_subprocess(cmd, env)
    RS->>RS: set FORCE_EXIT_ON_COMPLETE=1 in env
    RS->>CH: Popen(cmd, start_new_session=True)

    alt Normal completion within 600s
        CH->>SC: __exit__(exc_type=None)
        SC->>SC: check FORCE_EXIT_ON_COMPLETE
        SC->>CH: os._exit(0) skips app.close()
        CH->>RS: returncode = 0
        RS->>PT: returns normally
    else Subprocess raises exception
        CH->>SC: __exit__(exc_type != None)
        SC->>CH: os._exit(1)
        CH->>RS: returncode != 0
        RS->>PT: raises CalledProcessError
    else Timeout exceeds 600s
        RS->>RS: TimeoutExpired caught
        RS->>CH: os.killpg(pgid, SIGKILL)
        RS->>RS: _AT_LEAST_ONE_TEST_FAILED = True
        RS->>PT: raises SubprocessError
    end
Loading

Reviews (1): Last reviewed commit: "Merge branch 'main' into xyao/exp/ci_sta..." | Re-trigger Greptile

Comment on lines +57 to +58
if env is None:
env = os.environ.copy()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Caller's env dict is mutated in-place

If a caller passes their own env dict, it is mutated to include ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1. While this key is harmless to leave behind for the current callers, it can still cause surprising side-effects if the same dict is reused or inspected after the call. The env is None branch already copies os.environ, so the fix is to copy the caller-supplied dict as well.

Suggested change
if env is None:
env = os.environ.copy()
if env is None:
env = os.environ.copy()
else:
env = env.copy()
env["ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE"] = "1"

Comment on lines +71 to +78
returncode = process.wait(timeout=timeout_sec)
except subprocess.TimeoutExpired:
pgid = os.getpgid(process.pid)
sys.stderr.write(
f"\n[isaaclab-arena] Subprocess timed out after {timeout_sec}s — killing process group {pgid}\n"
)
print(f"Command completed with return code: {result.returncode}")
except subprocess.CalledProcessError as e:
sys.stderr.write(f"Command failed with return code {e.returncode}: {e}\n")
os.killpg(pgid, signal.SIGKILL)
process.wait()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unguarded os.getpgid and os.killpg calls after timeout

If the timed-out child process coincidentally self-terminates in the narrow window between TimeoutExpired being raised and os.getpgid(process.pid) being called, getpgid will raise ProcessLookupError. Wrapping the kill block in a try/except OSError keeps the timeout-path robust — the process is gone either way.

Suggested change
returncode = process.wait(timeout=timeout_sec)
except subprocess.TimeoutExpired:
pgid = os.getpgid(process.pid)
sys.stderr.write(
f"\n[isaaclab-arena] Subprocess timed out after {timeout_sec}s — killing process group {pgid}\n"
)
print(f"Command completed with return code: {result.returncode}")
except subprocess.CalledProcessError as e:
sys.stderr.write(f"Command failed with return code {e.returncode}: {e}\n")
os.killpg(pgid, signal.SIGKILL)
process.wait()
except subprocess.TimeoutExpired:
try:
pgid = os.getpgid(process.pid)
sys.stderr.write(
f"\n[isaaclab-arena] Subprocess timed out after {timeout_sec}s — killing process group {pgid}\n"
)
os.killpg(pgid, signal.SIGKILL)
except OSError:
pass # process already gone
process.wait()
_AT_LEAST_ONE_TEST_FAILED = True
raise subprocess.SubprocessError(f"Subprocess timed out after {timeout_sec}s: {cmd}")

@xyao-nv
Copy link
Copy Markdown
Collaborator Author

xyao-nv commented Apr 8, 2026

Most of the subprocess fixes are from Claude. I mainly babysitted the CI to check if it speeds up the timing.

@xyao-nv xyao-nv force-pushed the xyao/exp/ci_stall branch from 1fe581c to 86b09e2 Compare April 8, 2026 22:18
@xyao-nv xyao-nv force-pushed the xyao/exp/ci_stall branch from bc9bb9d to e80dca2 Compare April 8, 2026 23:12
@xyao-nv
Copy link
Copy Markdown
Collaborator Author

xyao-nv commented Apr 9, 2026

@greptile-apps

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot — Follow-up

PR: #557 — Fix CI subprocess test hangs | Reviewing commits since 8d8e198a


Summary of New Changes

The PR has grown significantly from the initial --durations=0 diagnostic patch into a comprehensive fix for CI subprocess hangs. Good evolution — the root cause analysis in the PR description is clear and the multi-pronged approach is sound.

New additions:

  • SimulationAppContext force-exit path: skips app.close() (which hangs in Kit shutdown) and SIGKILLs child processes via /proc before os._exit(0)
  • run_subprocess() now has configurable wall-clock timeouts, process isolation (start_new_session=True), and the ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE env var
  • AppLauncher startup timing marker for CI observability
  • Test parametrization: nested loops → @pytest.mark.parametrize (each config gets its own timeout/pass/fail/timing)
  • Temporary skip of the stalling gr00t eval_runner test

Assessment: 👍 Good approach, one gap worth addressing

The force-exit path is well-designed. The _kill_child_processes() approach of walking /proc to find direct children by PPid is careful — the comment explaining why os.killpg isn't safe (Kit's C-level SIGTERM handler) is appreciated. The exception handling is thorough.

One concern: orphaned processes on timeout path. When subprocess.run() hits the timeout, Python SIGKILLs the direct child, but with start_new_session=True, Kit's grandchildren (shader compiler, GPU workers) survive as orphans. Since the child was killed by SIGKILL, the SimulationAppContext.__exit__ cleanup never runs. Consider switching to Popen for the timeout path so you can os.killpg(pgid, signal.SIGKILL) the entire session:

process = subprocess.Popen(cmd, env=env, start_new_session=True, ...)
try:
    process.wait(timeout=timeout_sec)
except subprocess.TimeoutExpired:
    os.killpg(os.getpgid(process.pid), signal.SIGKILL)
    process.wait()
    ...

SIGKILL bypasses Kit's C-level signal handler entirely, so the os.killpg concern from the normal exit path doesn't apply here. The 900s CI timeout is a backstop, but cleaning up immediately avoids GPU resource contention for the next test subprocess.

Minor: The reapply_viewer_cfg refactor (moved from arena_env_builder.py to simulation_app.py utility module) is clean and correctly applied to all the imitation learning scripts that were missing it.

Test parametrization is a clear win — the old nested-loop approach meant one failure would skip remaining configs silently.

Copy link
Copy Markdown
Collaborator

@alexmillane alexmillane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you addressing this!

I'll spin the job a bunch of times to check that this solves the issue!

Comment on lines +45 to +49
``start_new_session=True`` isolates the child into its own process group.
The child-side ``SimulationAppContext`` uses this to SIGTERM its entire
group before ``os._exit()``, preventing orphaned Kit children (shader
compiler, GPU workers, …) from holding GPU resources and blocking the
next subprocess.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to start this comment block with:

"This function runs the child subprocess with start_new_session=True such that the child is isolated..."

Right now the comment looks like it's describing an argument to this function, which is a bit confusing because there is no such argument. Whereas it's actually describing an argument used inside the function.

Another option is to move this comment down below the subprocess call.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 🙏

Comment on lines +68 to +79
@pytest.mark.parametrize("embodiment", ["franka_ik", "gr1_pink", "gr1_joint"])
@pytest.mark.parametrize("object_name", ["cracker_box", "tomato_soup_can"])
def test_zero_action_policy_kitchen_pick_and_place(embodiment, object_name):
# TODO(alexmillane, 2025.07.29): Get an exhaustive list of all scenes and embodiments
# from a registry when we have one.
example_environment = "kitchen_pick_and_place"
embodiments = ["franka_ik", "gr1_pink", "gr1_joint"]
object_names = ["cracker_box", "tomato_soup_can"]
for embodiment in embodiments:
for object_name in object_names:
run_policy_runner(
policy_type="zero_action",
example_environment=example_environment,
embodiment=embodiment,
object_name=object_name,
num_steps=NUM_STEPS,
)
run_policy_runner(
policy_type="zero_action",
example_environment="kitchen_pick_and_place",
embodiment=embodiment,
object_name=object_name,
num_steps=NUM_STEPS,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!



@pytest.mark.with_subprocess
@pytest.mark.skip(reason="Skipping because of CI stalling")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still required? This MR solves this issue right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I forgot. revert

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants