Skip to content

Add per-actor parent-main inheritance opt-out#434

Merged
goodboy merged 14 commits intogoodboy:mainfrom
mahmoudhas:add-replay-parent-main-opt-out
Apr 11, 2026
Merged

Add per-actor parent-main inheritance opt-out#434
goodboy merged 14 commits intogoodboy:mainfrom
mahmoudhas:add-replay-parent-main-opt-out

Conversation

@mahmoudhas
Copy link
Copy Markdown
Contributor

@mahmoudhas mahmoudhas commented Apr 6, 2026

Summary

  • add an inherit_parent_main flag to ActorNursery.start_actor()
    and ActorNursery.run_in_actor() while keeping the existing
    default behavior
  • let _mp_fixup_main._mp_figure_out_main(inherit_parent_main=...)
    decide whether parent __main__ bootstrap data should be captured,
    keeping the opt-out localized instead of threading extra state
    through spawn setup
  • cover the trio spawn behavior with a regression test that checks
    the child actor's __main__ with and without parent-main
    inheritance

Test plan

  • python -m pytest tests/test_spawning.py -k "test_start_actor_can_skip_parent_main_inheritance or test_run_in_actor_same_func_in_child" -q
  • downstream usage in the Baseten inference server

For this to land..


Follow up,

See tracking issue #441 for follow-up items from this PR.

@mahmoudhas mahmoudhas changed the title Add per-actor parent-main replay opt-out Add per-actor parent-main inheritance opt-out Apr 6, 2026
@mahmoudhas
Copy link
Copy Markdown
Contributor Author

@goodboy would you please take a look?

I'm using tractor in a setup where the parent process is bootstrapped by MPI, so I don't control the parent's __main__. When a child actor inherits/replays that parent __main__, it ends up importing/initializing MPI in the child as well, which causes errors in some libraries like ucxx.

This PR adds a per-actor opt-out for inheriting the parent's __main__.

@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 6, 2026

@mahmoudhas hey sorry just saw this now.

i will look ASAP!

Thanks for the patch and will get back to you shortly 👍🏼

@goodboy goodboy requested a review from Copilot April 6, 2026 20:44
@goodboy goodboy self-requested a review April 6, 2026 20:48
@goodboy goodboy added the spawning of processes, (shm) threads, tasks on varied (OS-specific) backends label Apr 6, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a per-child-actor switch to opt out of inheriting (and re-running) the parent process __main__ bootstrap when using Tractor’s trio spawn backend, while preserving the existing default behavior.

Changes:

  • Added inherit_parent_main: bool = True to ActorNursery.start_actor() and ActorNursery.run_in_actor() and threaded it into child Actor construction.
  • Updated _mp_fixup_main._mp_figure_out_main() to optionally return no parent __main__ bootstrap data when inheritance is disabled.
  • Added a trio-backend regression test verifying child __main__ behavior with and without parent-main inheritance.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
tractor/spawn/_mp_fixup_main.py Adds inherit_parent_main opt-out to parent __main__ bootstrap capture.
tractor/runtime/_supervise.py Exposes the opt-out flag on start_actor / run_in_actor and forwards it to Actor.
tractor/runtime/_runtime.py Stores parent-main bootstrap data conditionally based on the new flag.
tests/test_spawning.py Adds regression coverage for trio spawn child __main__ behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +207 to +212
def test_start_actor_can_skip_parent_main_inheritance(
start_method,
reg_addr,
monkeypatch,
):
if start_method != 'trio':
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test name says start_actor but the test exercises run_in_actor (which indirectly calls start_actor). Consider renaming the test to reflect the exercised API (or add an explicit start_actor assertion) to keep intent and coverage clear.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, i don't think this is critical, though it's likely good to demo the daemon API working as well even though one is just a wrapper for the other.

Yeah, i'd say for now uncessary other then longer run we'd like to remove the Portal.run_in_actor() API from the ActorNursery, so it saves some tech debt now.

See #290

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Non-critical per repo owner. The broader run_in_actorstart_actor API consolidation is tracked in #290.

@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 6, 2026

Another thing I thought of immediately is we haven't synced this helper with the stdlib in a long time 🫣

Looking at the changes it seems there's some other stuff we might want to include as well?

https://github.com/python/cpython/blob/main/Lib/multiprocessing/spawn.py#L160

As it stands however i don't see any issue with landing this asap if you need it @mahmoudhas 👍🏼

@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 6, 2026

@mahmoudhas one more thing, if you don't mind i'm going to augment this PR's description with some follow up which I'd like to see come out of this based on your mentioning the desired usage alongside MPI and/or UCX.

Particularly i think adding some examples/integrations/* scripts which demo how tractor can be used alongside/on-top-of such grander (or more general purpose?), pre-existing IPC frameworks and libs to potentially enforce SC would be pretty stellar no 🤔 ?

I'm happy to go through composing these examples if you (or whomever else there might be) is not interested. It would be super handy to get pointed at some basic examples to demo,

  • how to use multiple actors to conduct inter-proc IPC over each lib, ideally with some ode to cancellation and error propagation up a process tree.

  • some demo of how you would normally not be able to (as easily) enforce strict SC supervision without tractor if using those libs alone?

anyway lmk you're thoughts @mahmoudhas !

@mahmoudhas
Copy link
Copy Markdown
Contributor Author

@goodboy thanks for the swift response. I've addressed the comments around the test clarity and typing.

And yes, please feel free to edit the PR description.

I agree that examples of using tractor alongside broader IPC frameworks like MPI/UCX would be very valuable. In our specific case though, I think it would be difficult to turn the integration into a clean upstream example or test.

Our use case is real-time extraction of hidden states from TensorRT-LLM for training draft models for speculative decoding. It’s somewhat similar in spirit to TorchSpec, but built for TRT-LLM rather than vLLM/SGLang, and designed to be careful about transfer scheduling so the overhead stays low, and using fewer dependencies (no Ray; uses UCX instead of MoonCake for simplicity)

The tricky part is that TRT-LLM already owns the MPI side and uses it to manage the executor threads/processes. We’re layering a tractor scope on top of that in order to supervise and offload the UCX transfer logic. Since the MPI setup is provided by TRT-LLM rather than by us, it’s not easy to extract a minimal standalone integration example from the full system.

That said, we will likely open-source this project once it stabilizes, and at that point I’d be happy to point at a more concrete example.

await isolated.cancel_actor()

trio.run(main)
assert _run_parent_main_inheritance_script(
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, ok well this is definitely testing the "round-tripping" of the whole sys as it were 😂 but, i wonder if maybe just asserting the values in the subactors is enough?

Not sure if you've tried out the remote exc propagation much but that would also be an easy way to ensure the subactor's values are as expected?

If you really would rather audit running the whole whole runtime as a subproc i'm also fine with that, i just don't know it's truly necessary?

I'll comment above with an alt approach for this suite.

Copy link
Copy Markdown
Owner

@goodboy goodboy Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Done — both run_in_actor and start_actor tests now assert inside the child via check_parent_main_inheritance(). No subprocess, no JSON marshalling.

📎 fixed in e2ba75a0

from spawn_test_support.parent_main_inheritance_support import get_main_mod_name


async def main(api: str, output_path: str) -> None:
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, again see my other comments below, but if you can use the approach i suggest by having the subactor do the assert in the child, then you can move all this code back where it was in the original test_spawning mod 😉

NOTE, this only presumes doing so will in fact provide the coverage needed for you use case (but i'm pretty pretty sure it will).

Copy link
Copy Markdown
Owner

@goodboy goodboy Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Done — tests/spawn_test_support/ deleted entirely, all test code is back in test_spawning.py as a self-contained suite.

📎 fixed in e2ba75a0

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out there was some finagling todo with how pytest deals with __main__ overriding. so claude wrote a better check for us 😎

@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 7, 2026

Oh weird, and oddly the CI hasn't run yet?

I guess i gotta tweak some settings for fork PRs?

(ahh here it is https://docs.github.com/en/actions/how-tos/manage-workflow-runs/approve-runs-from-forks#approving-workflow-runs-on-a-pull-request-from-a-public-fork)

((wow, i don't see that "and in the section "n workflow(s) awaiting approval", click Approve workflows to run." anywhere..))

@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 7, 2026

@goodboy thanks for the swift response. I've addressed the comments around the test clarity and typing.f

left you a few more that might help simplify the new test suite 🫣
But definitely push back on me if you prefer to have this super duper end-2-end coverage 👍🏼

And yes, please feel free to edit the PR description.

kk, i'm going to write a follow up section that points to a new issue with a todo around what i wrote below.

I agree that examples of using tractor alongside broader IPC frameworks like MPI/UCX would be very valuable. In our specific case though, I think it would be difficult to turn the integration into a clean upstream example or test.
...
That said, we will likely open-source this project once it stabilizes, and at that point I’d be happy to point at a more concrete example.

In terms of trying to write a follow issue for at least adding some simple integration examples, tell me if i'm on the right track with these,


Appendix,

Our use case is real-time extraction of hidden states from TensorRT-LLM for training draft models for speculative decoding. It’s somewhat similar in spirit to TorchSpec, but built for TRT-LLM rather than vLLM/SGLang, and designed to be careful about transfer scheduling so the overhead stays low, and using fewer dependencies (no Ray; uses UCX instead of MoonCake for simplicity)

Sounds super fancy 😂?

i've never even heard of MoonCake 👀, gonna check it. Also wasn't familiar with TorchSpec; thx also for sending.

The tricky part is that TRT-LLM already owns the MPI side and uses it to manage the executor threads/processes. We’re layering a tractor scope on top of that in order to supervise and offload the UCX transfer logic. Since the MPI setup is provided by TRT-LLM rather than by us, it’s not easy to extract a minimal standalone integration example from the full system.

So am i reading this right that you're trying to speed up inference but need a more SC-rigorous and minimal supervision layer for the scheduler in charge i guess whatever goes into this spec-decoding stuff? (anyway don't feel like you have to explain it here, i'm just genuine interested in the use case. feel free to jump on the normally quiet matrix channel if you want to educate me/us 👍🏼)

@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 7, 2026

Oh weird, and oddly the CI hasn't run yet?

I guess i gotta tweak some settings for fork PRs?

(ahh here it is https://docs.github.com/en/actions/how-tos/manage-workflow-runs/approve-runs-from-forks#approving-workflow-runs-on-a-pull-request-from-a-public-fork)

((wow, i don't see that "and in the section "n workflow(s) awaiting approval", click Approve workflows to run." anywhere..))

@mahmoudhas i mentioned it as a todo bullet in the descr but, we need to adjust the ci.yml to make this happen. I'm also happy to bring in a PR beforehand if you don't feel like doing it here, just lmk 👍🏼

@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 7, 2026

@mahmoudhas if i don't hear from you by this aft i'll try patching the test suite the way i was hoping and then we can go from there ya?

@mahmoudhas
Copy link
Copy Markdown
Contributor Author

I managed to repro the issue I was seeing using just mpi4py (no ucxx). added an integration example here

@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 10, 2026

@mahmoudhas maybe one final thing is to rebase to main to ensure the tests that were erroneously being skipped prior but now are running (per #439 landing) are run a final time for our sanity? 🙏🏼

@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 10, 2026

I managed to repro the issue I was seeing using just mpi4py (no ucxx). added an integration example here

Ahh bootyful!

So i assume this is resolved with this patchset though right?

@goodboy goodboy linked an issue Apr 10, 2026 that may be closed by this pull request
4 tasks
getting local rank failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS

Passing ``inherit_parent_main=False`` and placing RPC functions in a
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh nice so then this all works given the #434 patchset yah?

@@ -0,0 +1,5 @@
import os
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh nice, it might be worth tasking us in follow up to be able dynamically write modules like this from an internal API much like what pytest offers?

something like their tmp_path fixture maybe,

Also relates to our wishlist for #360

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe give #441 a quick lookover to be sure i didn't miss/get-anything-wrong if you can as well 🙏🏼

@goodboy goodboy added the integration Optional/loose support for 3rd party libs/apps/projects label Apr 10, 2026
@goodboy goodboy linked an issue Apr 10, 2026 that may be closed by this pull request
1 task
@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 11, 2026

@mahmoudhas i think just a sanity rebase and then we'll land this 🏄🏼

i'm not gonna expect @guilledk to respond before weekend 😂

@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 11, 2026

2026-04-10T20:55:05,843639319-04:00

Dunno if i trust that button 😂

mahmoudhas and others added 14 commits April 10, 2026 20:58
Let actor callers skip replaying the parent __main__ during child startup so downstream integrations can avoid inheriting incompatible bootstrap state without changing the default spawn behavior.
Keep trio child bootstrap data in the spawn handshake instead of stashing it on Actor state so the replay opt-out stays explicit and avoids stale-looking runtime fields.
Keep actor-owned parent-main capture and let `_mp_figure_out_main()` decide whether to return `__main__` bootstrap data, avoiding the extra SpawnSpec plumbing while preserving the per-actor flag.
Use `inherit_parent_main` across the actor APIs and helper to better describe the behavior, and restore the reviewer note at child bootstrap where the inherited `__main__` data is copied from `SpawnSpec`.
Clean up mutable defaults, give parent-main bootstrap data a named type, and add direct start_actor coverage so the opt-out change is clearer to review.
Move the subprocess probe into dedicated spawn test support files so the inheritance tests cover the real __main__ replay path without monkeypatching or inline script strings.
Replace the subproc-based test harness with inline
`tractor.open_nursery()` calls that directly check
`actor._parent_main_data` instead of comparing
`__main__.__name__` across a process boundary
(which is a no-op under pytest bc the parent
`__main__` is `pytest.__main__`).

Deats,
- delete `tests/spawn_test_support/` pkg (3 files)
- add `check_parent_main_inheritance()` helper fn
  that asserts on `_parent_main_data` emptiness
- rewrite both `run_in_actor` and `start_actor`
  parent-main tests as inline async fns
- drop `tmp_path` fixture and unused imports

Review: PR goodboy#434 (goodboy, Copilot)
goodboy#434

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Note the opt-out only applies to the trio spawn
backend; `multiprocessing` `spawn`/`forkserver`
reconstruct `__main__` via stdlib bootstrap.

Review: PR goodboy#438 (Copilot)
goodboy#438

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Drop hand-copied `_fixup_main_from_name()` and `_fixup_main_from_path()`
in favor of direct re-exports from `multiprocessing.spawn`. Simplify
`_mp_figure_out_main()` to call stdlib's `get_preparation_data()`
instead of reimplementing `__main__` module inspection inline.

Also,
- drop `ORIGINAL_DIR` global and `os`, `sys`, `platform`, `types`,
  `runpy` imports.
- pop `authkey` from prep data (unserializable and unneeded by our spawn
  path).
- update mod docstring to reflect delegation.

Review: PR goodboy#438 (Copilot)
goodboy#438

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Parametrize `test_loglevel_propagated_to_subactor`
across `'debug'`, `'cancel'`, `'critical'` levels
(was hardcoded to just `'critical'`) and move it
above the parent-main tests for logical grouping.

Also,
- add `start_method: str` annotations throughout
- use `portal.wait_for_result()` in
  `test_most_beautiful_word` (replaces `.result()`)
- expand mod docstring to describe test coverage
- reformat `check_parent_main_inheritance` docstr

Review: PR goodboy#438 (Copilot)
goodboy#438

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Use walrus `:=` to combine the assignment and
truthiness check for `_parent_main_data` into the
`if` condition, cleanly skipping the fixup block
when `inherit_parent_main=False` yields `{}`.

Review: PR goodboy#438 (Copilot)
goodboy#438

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- "spawing" → "spawning", close unbalanced
  backtick on `` `start_method='trio'` ``
- "uneeded" → "unneeded", "deats" → "details"
- Remove double `d` annotation; filter
  `get_preparation_data()` result into only
  `ParentMainData` keys before returning
- Use `pop('authkey', None)` for safety

Review: PR #1 (Copilot)
#1 (review)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- "boostrap" → "bootstrap" in mod docstring
- replace deprecated `portal.result()` with
  `portal.wait_for_result()` + value assertion
  inside the nursery block

Review: PR #1 (Copilot)
#1 (review)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@mahmoudhas mahmoudhas force-pushed the add-replay-parent-main-opt-out branch from f7160a8 to ca1b01f Compare April 11, 2026 00:58
@goodboy
Copy link
Copy Markdown
Owner

goodboy commented Apr 11, 2026

Final sanity CI 🐎

Copy link
Copy Markdown
Owner

@goodboy goodboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for me this is definitely more then good enough 😎

THANKS again @mahmoudhas for struggling through those final refinements with me 🫣 🤣

@goodboy goodboy merged commit 3a9f4ea into goodboy:main Apr 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration Optional/loose support for 3rd party libs/apps/projects IPC and transport messaging messaging patterns and protocols spawning of processes, (shm) threads, tasks on varied (OS-specific) backends testing

Projects

None yet

3 participants