Skip to content

cherrypick 557 and add orphan cleanup#569

Closed
xyao-nv wants to merge 2 commits intomainfrom
xyao/exp/ci_stalling_2
Closed

cherrypick 557 and add orphan cleanup#569
xyao-nv wants to merge 2 commits intomainfrom
xyao/exp/ci_stalling_2

Conversation

@xyao-nv
Copy link
Copy Markdown
Collaborator

@xyao-nv xyao-nv commented Apr 10, 2026

Summary

Short description of the change (max 50 chars)

Detailed description

  • What was the reason for the change?
  • What has been changed?
  • What is the impact of this change?

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Cherry-pick #557 + orphan cleanup

This is a solid fix for the CI stalling problem. The multi-layered approach (force-exit, child kill via /proc, orphan sweep, configurable timeout) is exactly what's needed when you can't trust Kit's shutdown path. The test parametrization refactor is a nice bonus — individual timeout/pass-fail per config is much better for CI debugging.

A few items worth considering:

1. Process-group kill gap in SimulationAppContext.__exit__

_kill_child_processes() only walks direct children (PPid == our PID). If a Kit child spawns its own grandchild before being killed, that grandchild can survive. Since start_new_session=True is set on the parent side, you could also os.killpg(os.getpgid(0), signal.SIGKILL) from the child side before the /proc walk — but only if you're confident the child is in its own session (which it is, via start_new_session). Worth considering as a belt-and-suspenders approach.

2. _kill_kit_orphans is a broad sweep

The function kills any process on the system matching the orphan patterns (omni.telemetry.transmitter, omni.kit, carb.windowing), not just descendants of the test runner. On a shared CI runner this is probably fine (one job at a time), but on a dev machine it could nuke a legitimate Kit session. A comment noting this is CI-only would be helpful.

3. test_zero_action_policy_gr1_open_microwave lost background=None

The old code passed background=None to run_policy_runner(). The new parametrized version drops it. If run_policy_runner has a background kwarg with a non-None default, this changes behavior. Worth verifying that the default is already None.

4. Race between TimeoutExpired and orphan cleanup

When subprocess.run() raises TimeoutExpired, the child process may still be running (Python only sends SIGKILL on timeout if using Popen with kill()subprocess.run with timeout does kill the direct child but not the process group). Since start_new_session=True, the child's process group may survive. Consider adding os.killpg for the child's pgid in the TimeoutExpired handler before _kill_kit_orphans().

5. Minor: SubprocessError vs TimeoutExpired

The timeout handler catches TimeoutExpired and re-raises as a generic SubprocessError. This loses the original exception context (the partially captured output, the cmd). Consider raise SubprocessError(...) from e to preserve the chain, or just re-raise the TimeoutExpired since callers likely want to know it was a timeout specifically.

Overall this is good work and should unblock CI. The fixes are well-documented with clear comments explaining why each piece exists. 👍

ps = subprocess.run(
["ps", "-eo", "pid,ppid,comm,args"],
capture_output=True,
text=True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches against the entire system process table, not just descendants of the test runner. On shared dev machines this could kill legitimate Kit sessions. Consider adding a comment that this is intended for CI use only, or narrowing the scan to processes whose original parent was this test runner (though I understand reparented orphans make that hard).

result = subprocess.run(
cmd,
check=True,
env=env,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When TimeoutExpired is raised with start_new_session=True, subprocess.run kills the direct child but the process group may survive. Consider:

except subprocess.TimeoutExpired as e:
    # Kill the entire process group
    try:
        os.killpg(os.getpgid(e.pid), signal.SIGKILL)
    except (ProcessLookupError, PermissionError):
        pass
    ...

Also, raise SubprocessError(...) from e would preserve the exception chain.

embodiment=embodiment,
object_name=object_name,
num_steps=NUM_STEPS,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code passed background=None explicitly. Worth verifying that run_policy_runner's default for background is already None — otherwise this parametrized version silently changes behavior.

@xyao-nv xyao-nv closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant