[2.7] Fix external kill before server download by YuanTingHsieh · Pull Request #4275 · NVIDIA/NVFlare

YuanTingHsieh · 2026-03-06T21:49:35Z

Fix: subprocess cell torn down before server tensor download (large-model deadlock)

Problem

big_model_4g CI test reliably failed with:

RuntimeError: failed to download from site-X_active

after 1–3 hours of retry loops (300 s SubmitUpdate timeout × 3 download retries).

Root cause 1 — ordering bug in _finalize_external_execution():

The subprocess client API works by keeping the subprocess's CellNet connection alive
after it sends its result, so the server can pull large tensors directly from the
subprocess's DownloadService (the "reverse PASS_THROUGH" path). To prevent the
subprocess from exiting before the download completes, _do_submit_result() blocks
on a download_done.wait(1800 s) event that fires only after the server has
finished downloading all tensors.

However, _finalize_external_execution() called stop_task() synchronously,
which sends SIGTERM to the subprocess immediately after execute() receives the
result — before ClientRunner had even sent SubmitUpdate to the server, let alone
before the server had started downloading. This tore down the subprocess cell,
causing every server download attempt to hit "no connection to site-X_active" /
"cannot forward req: no path".

The subprocess-side wait was therefore unreachable: the process was killed
externally before it could block.

Root cause 2 — launch_once=True subprocess exits after round 1:

_do_submit_result() called os._exit(0) unconditionally at the end of the
CellPipe + pass_through_on_send path. For launch_once=False (one subprocess per
round) this is correct — the process should exit immediately after the download so
the deferred-stop poller unblocks. But for launch_once=True (one subprocess for
all rounds, e.g. np-loop-cell-pipe, pt_client_api_launch_once), the subprocess
was killed after round 1, leaving rounds 2–N unhandled and the job hanging.

Fix

1. Deferred stop_task() (launcher_executor.py)

Instead of calling stop_task() synchronously, _finalize_external_execution()
now starts a background thread (when _stop_task_wait_timeout > 0) that polls
check_run_status() until the subprocess exits naturally (i.e. after
download_done fires and the subprocess unblocks), then calls stop_task().
execute() returns immediately, so ClientRunner can send SubmitUpdate and the
server can connect to the still-alive subprocess cell to download tensors.

2. Round-boundary coordination (_deferred_stop_event)

Without additional synchronization, the deferred thread from round N could fire
after round N+1's launch_task() call. Since SubprocessLauncher guards
_start_external_process() with if self._process is None, seeing a not-yet-cleared
reference to the exited round-N process causes it to skip starting a new subprocess.
Round N+1 then sees COMPLETE_SUCCESS immediately and fails with
"External process has not called flare.init and run status becomes success".

A threading.Event (_deferred_stop_event, initially set) coordinates the two:

Cleared in _finalize_external_execution() just before the deferred thread starts.
Set unconditionally in a finally block when the deferred thread completes.
_initialize_external_execution() waits on this event before calling
launch_task(), with a forced stop_task() fallback if it times out.

In practice the wait is 0–1 s (subprocess exits naturally after download, well
before the server completes global aggregation and dispatches the next round's task),
so there is no meaningful latency impact.

3. ClientAPILauncherExecutor opt-in (client_api_launcher_executor.py)

_stop_task_wait_timeout is set to download_complete_timeout (default 1800 s) in
ClientAPILauncherExecutor.__init__(), enabling the deferred path only for the
subprocess client API where large-tensor downloads are expected. The base
LauncherExecutor defaults to 0.0 (original synchronous behaviour, no change).

4. launch_once-aware subprocess exit (flare_agent.py, config.py, ex_process/api.py)

prepare_config_for_launch() now writes launch_once (derived from
launcher.needs_deferred_stop()) into the subprocess config file. The subprocess
reads it via ClientConfig.get_launch_once() and passes it to FlareAgent.
_do_submit_result() branches on _launch_once:

`launch_once`	Behaviour after download gate
`False` (one subprocess per round, e.g. `pt-client-api`)	`os._exit(0)` called directly — deferred-stop poller unblocks immediately (original behaviour preserved)
`True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`)	`atexit.register(os._exit, 0)` registered once — subprocess continues to next round; `os._exit` fires only when `main()` finally returns, bypassing non-daemon CoreCell thread cleanup

_resolve_launch_once() safely fetches the launcher even when self.launcher is
still None at initialize() time (resolved directly from the engine component
registry).

5. Pipe handler identity guard and safe close (task_exchanger.py)

Pipe handler creation is refactored into _create_pipe_handler(), which binds a
per-handler status callback that checks self.pipe_handler is _h before acting.
This prevents a late PEER_GONE from round N's (now-stale) handler from stopping
round N+1's handler. stop(close_pipe=False) is used because CellPipe.close()
is irreversible — closing it in the status callback would prevent the next round
from communicating. An explicit self.pipe.close() is added in END_RUN instead.

6. Defensive logging and heartbeat error handling (pipe_handler.py)

_send_to_pipe now logs asked_to_stop and abort_triggered status when a send
is suppressed. Peer-gone detection logs the elapsed time since the last heartbeat.
Heartbeat sends are wrapped in try/except so a broken pipe sets asked_to_stop
and breaks the heartbeat loop cleanly instead of propagating an unhandled exception.

7. New unit tests (test_download_complete_gating.py, test_download_initiated_gating.py)

Two new test files covering the subprocess download-gating behaviour (the
download_done wait, the validate-path fast return, and launch_once). Both use FlareAgent.__new__() to construct minimal agent
stubs. To prevent os._exit(0) from killing the pytest-xdist worker:

An autouse _no_os_exit fixture patches nvflare.client.flare_agent.os._exit
to a no-op for every test in the file.
_make_agent() sets agent._launch_once = False (the per-round path that calls
os._exit directly, making the fixture's patch the active guard).

Files changed

File	Change
`nvflare/app_common/executors/launcher_executor.py`	Deferred `stop_task()` background thread + `_deferred_stop_event` round-boundary coordination
`nvflare/app_common/executors/client_api_launcher_executor.py`	Set `_stop_task_wait_timeout = download_complete_timeout`; add `_resolve_launch_once()`; write `LAUNCH_ONCE` to subprocess config
`nvflare/app_common/abstract/launcher.py`	`needs_deferred_stop()` abstract method + idempotency/thread-safety note on `stop_task()`
`nvflare/app_common/launchers/subprocess_launcher.py`	Implement `needs_deferred_stop()`; add info logging for process start/stop
`nvflare/app_common/executors/task_exchanger.py`	Refactor pipe handler creation into `_create_pipe_handler()` with identity-checking callback; use `close_pipe=False` to prevent irreversible `CellPipe.close()`
`nvflare/app_common/widgets/metric_relay.py`	Include `msg.data` in pipe status log message
`nvflare/fuel/utils/pipe/pipe_handler.py`	Enhanced logging (send failures, peer-gone elapsed time); heartbeat send error handling
`nvflare/client/config.py`	Add `LAUNCH_ONCE` config key + `get_launch_once()`
`nvflare/client/flare_agent.py`	`launch_once`-aware `_do_submit_result()`: direct `os._exit` vs `atexit`
`nvflare/client/ex_process/api.py`	Pass `launch_once` to `FlareAgentWithFLModel`
`tests/unit_test/client/test_download_complete_gating.py`	New — tests for DOWNLOAD_COMPLETE_CB registration, download-done wait, timeout, status logging, and cleanup
`tests/unit_test/client/test_download_initiated_gating.py`	New — tests for thread-local download-initiation detection (validate-path fast return, no spurious 1800 s wait)
`tests/unit_test/app_common/executors/client_api_launcher_executor_test.py`	New tests for deferred stop, `_deferred_stop_event`, `_stop_task_wait_timeout`

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

greptile-apps · 2026-03-06T21:52:43Z

Greptile Summary

This PR fixes a reliable deadlock in the big_model_4g CI test where large-tensor downloads failed with "no connection to site-X_active" because _finalize_external_execution() called stop_task() synchronously — sending SIGTERM to the subprocess before the server had a chance to connect to its DownloadService and pull tensors.

The fix introduces a deferred stop_task() background thread (opt-in via _stop_task_wait_timeout in ClientAPILauncherExecutor) that polls until the subprocess exits naturally (after the download_done event fires), then calls stop_task(). A threading.Event (_deferred_stop_event) coordinates the deferred thread with the next round's launch_task() call to prevent skipped subprocess spawning. A complementary fix in FlareAgent makes _do_submit_result() launch_once-aware so per-round subprocesses exit promptly via os._exit(0) while long-lived (launch_once=True) subprocesses register an atexit handler instead. TaskExchanger is refactored to create a fresh PipeHandler per round with an identity-checking callback that prevents stale PEER_GONE events from a prior round's handler from killing the current one.

Key changes and observations:

launcher_executor.py: Deferred-stop thread with bounded polling, _deferred_stop_event round-boundary coordination, monitor-thread guard to skip false abort during download window. Thread-safe coordination of deferred stop (previous round) and launch_task (new round) is correctly implemented with event gating and task name capture.
flare_agent.py: launch_once-aware _do_submit_result() correctly branches between per-round os._exit(0) and multi-round atexit registration. Minor dead-code and documentation issues noted in detailed review.
client_api_launcher_executor.py, ex_process/api.py, config.py: launch_once configuration plumbing correctly connects launcher config to subprocess behavior.
New unit tests in client_api_launcher_executor_test.py, test_download_complete_gating.py, and test_download_initiated_gating.py provide solid coverage of the new coordination logic.

Confidence Score: 4/5

This PR is safe to merge. The deferred-stop mechanism correctly solves the subprocess premature-teardown deadlock with proper thread coordination and abort-signal handling.
The core deferred-stop fix is logically sound: _deferred_stop_event correctly gates round-boundary transitions (cleared before deferred thread starts, set in finally block guarantees next round waits for cleanup). Abort signal is polled every 1 second in the deferred-wait loop, enabling prompt shutdown. Task name capture (_deferred_stop_task_name) ensures timeout fallback stops the correct previous-round task. Monitor-thread guard prevents false-abort detection during the deferred window. Comprehensive unit tests validate the coordination logic, abort handling, timeout scenarios, and multi-round (launch_once=True) behavior. Minor deductions for dead-code issue (unreachable return True after os._exit()) and missing docstring for the new launch_once parameter — neither affects correctness.
nvflare/client/flare_agent.py (minor: dead code and missing docstring; no blocking issues)

Sequence Diagram

sequenceDiagram
    participant CR as ClientRunner
    participant LE as LauncherExecutor
    participant SL as SubprocessLauncher
    participant SP as Subprocess (FlareAgent)
    participant SV as Server

    Note over CR,SV: Round N — launch_once=False (deferred-stop path)

    CR->>LE: execute(task_name)
    LE->>LE: _initialize_external_execution()<br/>waits _deferred_stop_event (set → OK)
    LE->>SL: launch_task()
    SL->>SP: spawn subprocess
    SP-->>LE: flare.init() / heartbeat (pipe up)
    LE->>LE: _wait_external_setup() → True
    LE->>SP: send task shareable via pipe
    SP->>SP: train / compute result
    SP->>LE: send result via CellPipe (pass_through_on_send)
    Note over SP: registers DOWNLOAD_COMPLETE_CB<br/>blocks on download_done.wait()
    LE->>LE: check_output_shareable() → sets _received_result
    LE->>LE: _finalize_external_execution():<br/>deferred path → clear _deferred_stop_event,<br/>start deferred-stop thread, return immediately
    LE-->>CR: return result (execute() returns)
    CR->>SV: SubmitUpdate (subprocess cell still alive!)
    SV->>SP: download tensors from DownloadService
    SP->>SP: download_done fires → DOWNLOAD_COMPLETE_CB
    SP->>SP: os._exit(0) — subprocess exits
    Note over LE: deferred-stop thread polls check_run_status
    LE->>SL: stop_task() [deferred thread]
    SL->>SL: _stop_external_process() → _process = None
    LE->>LE: _deferred_stop_event.set()

    Note over CR,SV: Round N+1
    CR->>LE: execute(task_name)
    LE->>LE: _initialize_external_execution()<br/>waits _deferred_stop_event (now set → OK)
    LE->>LE: reset_peer_is_up_or_dead()
    LE->>SL: launch_task() → new subprocess

Comments Outside Diff (2)

nvflare/client/flare_agent.py, line 445-446 (link)

Unreachable code: os._exit() terminates the process immediately and never returns, making the return True on line 446 dead code. This will trigger warnings in linters/static-analysis tools that are aware os._exit is NoReturn.

Consider removing the statement:
nvflare/client/flare_agent.py, line 482-504 (link)

Missing launch_once parameter in docstring. The new parameter (line 482) controls whether the subprocess handles multiple rounds or exits after each round — a critical behavioral change that should be documented alongside the other parameters.

Add to the docstring:

_{Last reviewed commit: 073e71d}

Copilot

Pull request overview

Fixes a deadlock/failure mode in the subprocess Client API path where the client subprocess was being torn down (via stop_task()) before the server could connect back to the subprocess DownloadService to pull large tensors.

Changes:

Add an opt-in deferred stop_task() mechanism in LauncherExecutor that waits for the external process to exit naturally (or times out) before stopping/cleaning up.
Add round-boundary synchronization (_deferred_stop_event) so a deferred cleanup from round N can’t interfere with launching round N+1.
Enable the deferred-stop behavior for ClientAPILauncherExecutor by wiring _stop_task_wait_timeout to download_complete_timeout.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`nvflare/app_common/executors/launcher_executor.py`	Adds deferred stop thread + event coordination to avoid killing the subprocess before server-side downloads complete.
`nvflare/app_common/executors/client_api_launcher_executor.py`	Opts the Client API executor into deferred-stop behavior using `download_complete_timeout`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

YuanTingHsieh · 2026-03-06T22:34:12Z

@greptile review

YuanTingHsieh · 2026-03-06T22:42:07Z

@greptile review

YuanTingHsieh · 2026-03-07T00:30:51Z

@greptile review

YuanTingHsieh · 2026-03-07T01:22:16Z

@greptile review

YuanTingHsieh · 2026-03-07T01:44:35Z

For Issue 2 — handler started before launch_task()

"This one is not a real problem in practice. The concern is that the fresh PipeHandler starts its heartbeat clock before the subprocess connects. heartbeat_timeout=60s by default, and _ensure_pipe_handler_alive() is called only nanoseconds before launch_task(). The subprocess only needs to connect within 60s of launch_task() — which is already enforced by _external_pre_init_timeout. There's no new window being opened.

Even if PEER_GONE fired early (which would require the subprocess to take >60s just to call flare.init()), _wait_external_setup would detect it: peer_is_up_or_dead() returns True (set by PEER_GONE), _wait_external_setup returns, then execute() immediately sees pipe_handler.asked_to_stop=True and returns TASK_ABORTED. It's a detectable failure, not a silent hang."

not going to change code

YuanTingHsieh · 2026-03-07T05:09:42Z

/build

chesterxgchen

few comments

pcnudde · 2026-03-09T20:29:27Z

Codex review found two issues in this PR.

In nvflare/client/flare_agent.py, the no-download fast path returns from _do_submit_result() before the new launch_once/per-round shutdown block runs. That means:

launch_once=False skips the os._exit(0) path, so the CJ deferred-stop poller can sit until timeout before killing the subprocess
launch_once=True skips the atexit(os._exit, 0) registration, so the shutdown hang risk on non-daemon CoreCell threads is still present for jobs that only hit the no-download path

I added a minimal local reproducer test for this against the PR branch and it fails in both modes.

In nvflare/app_common/executors/task_exchanger.py, the new pipe status callback permanently stops the current PipeHandler, but later rounds reuse the same handler and only call start() on it. After stop() latches asked_to_stop=True and the worker threads exit, PipeHandler.start() does not recreate a fresh handler. So a late PEER_GONE/END from round N can leave round N+1 with a dead task/result transport. User-visible consequence: the next round can fail immediately with pipe-stopped / peer-not-accepting-task style errors instead of running.

These findings are from Codex review and local validation on PR #4275.

YuanTingHsieh · 2026-03-09T22:58:32Z

/build

## Fix: subprocess cell torn down before server tensor download (large-model deadlock) ### Problem `big_model_4g` CI test reliably failed with: ``` RuntimeError: failed to download from site-X_active ``` after 1–3 hours of retry loops (300 s SubmitUpdate timeout × 3 download retries). **Root cause 1 — ordering bug in `_finalize_external_execution()`:** The subprocess client API works by keeping the subprocess's CellNet connection alive after it sends its result, so the server can pull large tensors directly from the subprocess's `DownloadService` (the "reverse PASS_THROUGH" path). To prevent the subprocess from exiting before the download completes, `_do_submit_result()` blocks on a `download_done.wait(1800 s)` event that fires only after the server has finished downloading all tensors. However, `_finalize_external_execution()` called `stop_task()` **synchronously**, which sends `SIGTERM` to the subprocess immediately after `execute()` receives the result — before `ClientRunner` had even sent `SubmitUpdate` to the server, let alone before the server had started downloading. This tore down the subprocess cell, causing every server download attempt to hit `"no connection to site-X_active"` / `"cannot forward req: no path"`. The subprocess-side wait was therefore unreachable: the process was killed externally before it could block. **Root cause 2 — `launch_once=True` subprocess exits after round 1:** `_do_submit_result()` called `os._exit(0)` unconditionally at the end of the `CellPipe + pass_through_on_send` path. For `launch_once=False` (one subprocess per round) this is correct — the process should exit immediately after the download so the deferred-stop poller unblocks. But for `launch_once=True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`, `pt_client_api_launch_once`), the subprocess was killed after round 1, leaving rounds 2–N unhandled and the job hanging. --- ### Fix **1. Deferred `stop_task()` (`launcher_executor.py`)** Instead of calling `stop_task()` synchronously, `_finalize_external_execution()` now starts a background thread (when `_stop_task_wait_timeout > 0`) that polls `check_run_status()` until the subprocess exits naturally (i.e. after `download_done` fires and the subprocess unblocks), then calls `stop_task()`. `execute()` returns immediately, so `ClientRunner` can send `SubmitUpdate` and the server can connect to the still-alive subprocess cell to download tensors. **2. Round-boundary coordination (`_deferred_stop_event`)** Without additional synchronization, the deferred thread from round N could fire *after* round N+1's `launch_task()` call. Since `SubprocessLauncher` guards `_start_external_process()` with `if self._process is None`, seeing a not-yet-cleared reference to the exited round-N process causes it to skip starting a new subprocess. Round N+1 then sees `COMPLETE_SUCCESS` immediately and fails with `"External process has not called flare.init and run status becomes success"`. A `threading.Event` (`_deferred_stop_event`, initially set) coordinates the two: - **Cleared** in `_finalize_external_execution()` just before the deferred thread starts. - **Set** unconditionally in a `finally` block when the deferred thread completes. - `_initialize_external_execution()` **waits** on this event before calling `launch_task()`, with a forced `stop_task()` fallback if it times out. In practice the wait is 0–1 s (subprocess exits naturally after download, well before the server completes global aggregation and dispatches the next round's task), so there is no meaningful latency impact. **3. `ClientAPILauncherExecutor` opt-in (`client_api_launcher_executor.py`)** `_stop_task_wait_timeout` is set to `download_complete_timeout` (default 1800 s) in `ClientAPILauncherExecutor.__init__()`, enabling the deferred path only for the subprocess client API where large-tensor downloads are expected. The base `LauncherExecutor` defaults to `0.0` (original synchronous behaviour, no change). **4. `launch_once`-aware subprocess exit (`flare_agent.py`, `config.py`, `ex_process/api.py`)** `prepare_config_for_launch()` now writes `launch_once` (derived from `launcher.needs_deferred_stop()`) into the subprocess config file. The subprocess reads it via `ClientConfig.get_launch_once()` and passes it to `FlareAgent`. `_do_submit_result()` branches on `_launch_once`: | `launch_once` | Behaviour after download gate | |---|---| | `False` (one subprocess per round, e.g. `pt-client-api`) | `os._exit(0)` called directly — deferred-stop poller unblocks immediately (original behaviour preserved) | | `True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`) | `atexit.register(os._exit, 0)` registered once — subprocess continues to next round; `os._exit` fires only when `main()` finally returns, bypassing non-daemon CoreCell thread cleanup | `_resolve_launch_once()` safely fetches the launcher even when `self.launcher` is still `None` at `initialize()` time (resolved directly from the engine component registry). **5. Pipe handler identity guard and safe close (`task_exchanger.py`)** Pipe handler creation is refactored into `_create_pipe_handler()`, which binds a per-handler status callback that checks `self.pipe_handler is _h` before acting. This prevents a late `PEER_GONE` from round N's (now-stale) handler from stopping round N+1's handler. `stop(close_pipe=False)` is used because `CellPipe.close()` is irreversible — closing it in the status callback would prevent the next round from communicating. An explicit `self.pipe.close()` is added in `END_RUN` instead. **6. Defensive logging and heartbeat error handling (`pipe_handler.py`)** `_send_to_pipe` now logs `asked_to_stop` and `abort_triggered` status when a send is suppressed. Peer-gone detection logs the elapsed time since the last heartbeat. Heartbeat sends are wrapped in `try/except` so a broken pipe sets `asked_to_stop` and breaks the heartbeat loop cleanly instead of propagating an unhandled exception. **7. New unit tests (`test_download_complete_gating.py`, `test_download_initiated_gating.py`)** Two new test files covering the subprocess download-gating behaviour (the `download_done` wait, the validate-path fast return, and `launch_once`). Both use `FlareAgent.__new__()` to construct minimal agent stubs. To prevent `os._exit(0)` from killing the pytest-xdist worker: - An `autouse` `_no_os_exit` fixture patches `nvflare.client.flare_agent.os._exit` to a no-op for every test in the file. - `_make_agent()` sets `agent._launch_once = False` (the per-round path that calls `os._exit` directly, making the fixture's patch the active guard). --- ### Files changed | File | Change | |------|--------| | `nvflare/app_common/executors/launcher_executor.py` | Deferred `stop_task()` background thread + `_deferred_stop_event` round-boundary coordination | | `nvflare/app_common/executors/client_api_launcher_executor.py` | Set `_stop_task_wait_timeout = download_complete_timeout`; add `_resolve_launch_once()`; write `LAUNCH_ONCE` to subprocess config | | `nvflare/app_common/abstract/launcher.py` | `needs_deferred_stop()` abstract method + idempotency/thread-safety note on `stop_task()` | | `nvflare/app_common/launchers/subprocess_launcher.py` | Implement `needs_deferred_stop()`; add info logging for process start/stop | | `nvflare/app_common/executors/task_exchanger.py` | Refactor pipe handler creation into `_create_pipe_handler()` with identity-checking callback; use `close_pipe=False` to prevent irreversible `CellPipe.close()` | | `nvflare/app_common/widgets/metric_relay.py` | Include `msg.data` in pipe status log message | | `nvflare/fuel/utils/pipe/pipe_handler.py` | Enhanced logging (send failures, peer-gone elapsed time); heartbeat send error handling | | `nvflare/client/config.py` | Add `LAUNCH_ONCE` config key + `get_launch_once()` | | `nvflare/client/flare_agent.py` | `launch_once`-aware `_do_submit_result()`: direct `os._exit` vs `atexit` | | `nvflare/client/ex_process/api.py` | Pass `launch_once` to `FlareAgentWithFLModel` | | `tests/unit_test/client/test_download_complete_gating.py` | **New** — tests for DOWNLOAD_COMPLETE_CB registration, download-done wait, timeout, status logging, and cleanup | | `tests/unit_test/client/test_download_initiated_gating.py` | **New** — tests for thread-local download-initiation detection (validate-path fast return, no spurious 1800 s wait) | | `tests/unit_test/app_common/executors/client_api_launcher_executor_test.py` | New tests for deferred stop, `_deferred_stop_event`, `_stop_task_wait_timeout` | ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

### Description Cherry-pick #4225 #4231 #4247 #4250 #4260 #4259 #4272 #4270 #4249 #4275 #4296 #4297 #4309 #4307 #4312 #4314 ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>

(large-model deadlock) `big_model_4g` CI test reliably failed with: ``` RuntimeError: failed to download from site-X_active ``` after 1–3 hours of retry loops (300 s SubmitUpdate timeout × 3 download retries). **Root cause 1 — ordering bug in `_finalize_external_execution()`:** The subprocess client API works by keeping the subprocess's CellNet connection alive after it sends its result, so the server can pull large tensors directly from the subprocess's `DownloadService` (the "reverse PASS_THROUGH" path). To prevent the subprocess from exiting before the download completes, `_do_submit_result()` blocks on a `download_done.wait(1800 s)` event that fires only after the server has finished downloading all tensors. However, `_finalize_external_execution()` called `stop_task()` **synchronously**, which sends `SIGTERM` to the subprocess immediately after `execute()` receives the result — before `ClientRunner` had even sent `SubmitUpdate` to the server, let alone before the server had started downloading. This tore down the subprocess cell, causing every server download attempt to hit `"no connection to site-X_active"` / `"cannot forward req: no path"`. The subprocess-side wait was therefore unreachable: the process was killed externally before it could block. **Root cause 2 — `launch_once=True` subprocess exits after round 1:** `_do_submit_result()` called `os._exit(0)` unconditionally at the end of the `CellPipe + pass_through_on_send` path. For `launch_once=False` (one subprocess per round) this is correct — the process should exit immediately after the download so the deferred-stop poller unblocks. But for `launch_once=True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`, `pt_client_api_launch_once`), the subprocess was killed after round 1, leaving rounds 2–N unhandled and the job hanging. --- **1. Deferred `stop_task()` (`launcher_executor.py`)** Instead of calling `stop_task()` synchronously, `_finalize_external_execution()` now starts a background thread (when `_stop_task_wait_timeout > 0`) that polls `check_run_status()` until the subprocess exits naturally (i.e. after `download_done` fires and the subprocess unblocks), then calls `stop_task()`. `execute()` returns immediately, so `ClientRunner` can send `SubmitUpdate` and the server can connect to the still-alive subprocess cell to download tensors. **2. Round-boundary coordination (`_deferred_stop_event`)** Without additional synchronization, the deferred thread from round N could fire *after* round N+1's `launch_task()` call. Since `SubprocessLauncher` guards `_start_external_process()` with `if self._process is None`, seeing a not-yet-cleared reference to the exited round-N process causes it to skip starting a new subprocess. Round N+1 then sees `COMPLETE_SUCCESS` immediately and fails with `"External process has not called flare.init and run status becomes success"`. A `threading.Event` (`_deferred_stop_event`, initially set) coordinates the two: - **Cleared** in `_finalize_external_execution()` just before the deferred thread starts. - **Set** unconditionally in a `finally` block when the deferred thread completes. - `_initialize_external_execution()` **waits** on this event before calling `launch_task()`, with a forced `stop_task()` fallback if it times out. In practice the wait is 0–1 s (subprocess exits naturally after download, well before the server completes global aggregation and dispatches the next round's task), so there is no meaningful latency impact. **3. `ClientAPILauncherExecutor` opt-in (`client_api_launcher_executor.py`)** `_stop_task_wait_timeout` is set to `download_complete_timeout` (default 1800 s) in `ClientAPILauncherExecutor.__init__()`, enabling the deferred path only for the subprocess client API where large-tensor downloads are expected. The base `LauncherExecutor` defaults to `0.0` (original synchronous behaviour, no change). **4. `launch_once`-aware subprocess exit (`flare_agent.py`, `config.py`, `ex_process/api.py`)** `prepare_config_for_launch()` now writes `launch_once` (derived from `launcher.needs_deferred_stop()`) into the subprocess config file. The subprocess reads it via `ClientConfig.get_launch_once()` and passes it to `FlareAgent`. `_do_submit_result()` branches on `_launch_once`: | `launch_once` | Behaviour after download gate | |---|---| | `False` (one subprocess per round, e.g. `pt-client-api`) | `os._exit(0)` called directly — deferred-stop poller unblocks immediately (original behaviour preserved) | | `True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`) | `atexit.register(os._exit, 0)` registered once — subprocess continues to next round; `os._exit` fires only when `main()` finally returns, bypassing non-daemon CoreCell thread cleanup | `_resolve_launch_once()` safely fetches the launcher even when `self.launcher` is still `None` at `initialize()` time (resolved directly from the engine component registry). **5. Pipe handler identity guard and safe close (`task_exchanger.py`)** Pipe handler creation is refactored into `_create_pipe_handler()`, which binds a per-handler status callback that checks `self.pipe_handler is _h` before acting. This prevents a late `PEER_GONE` from round N's (now-stale) handler from stopping round N+1's handler. `stop(close_pipe=False)` is used because `CellPipe.close()` is irreversible — closing it in the status callback would prevent the next round from communicating. An explicit `self.pipe.close()` is added in `END_RUN` instead. **6. Defensive logging and heartbeat error handling (`pipe_handler.py`)** `_send_to_pipe` now logs `asked_to_stop` and `abort_triggered` status when a send is suppressed. Peer-gone detection logs the elapsed time since the last heartbeat. Heartbeat sends are wrapped in `try/except` so a broken pipe sets `asked_to_stop` and breaks the heartbeat loop cleanly instead of propagating an unhandled exception. **7. New unit tests (`test_download_complete_gating.py`, `test_download_initiated_gating.py`)** Two new test files covering the subprocess download-gating behaviour (the `download_done` wait, the validate-path fast return, and `launch_once`). Both use `FlareAgent.__new__()` to construct minimal agent stubs. To prevent `os._exit(0)` from killing the pytest-xdist worker: - An `autouse` `_no_os_exit` fixture patches `nvflare.client.flare_agent.os._exit` to a no-op for every test in the file. - `_make_agent()` sets `agent._launch_once = False` (the per-round path that calls `os._exit` directly, making the fixture's patch the active guard). --- | File | Change | |------|--------| | `nvflare/app_common/executors/launcher_executor.py` | Deferred `stop_task()` background thread + `_deferred_stop_event` round-boundary coordination | | `nvflare/app_common/executors/client_api_launcher_executor.py` | Set `_stop_task_wait_timeout = download_complete_timeout`; add `_resolve_launch_once()`; write `LAUNCH_ONCE` to subprocess config | | `nvflare/app_common/abstract/launcher.py` | `needs_deferred_stop()` abstract method + idempotency/thread-safety note on `stop_task()` | | `nvflare/app_common/launchers/subprocess_launcher.py` | Implement `needs_deferred_stop()`; add info logging for process start/stop | | `nvflare/app_common/executors/task_exchanger.py` | Refactor pipe handler creation into `_create_pipe_handler()` with identity-checking callback; use `close_pipe=False` to prevent irreversible `CellPipe.close()` | | `nvflare/app_common/widgets/metric_relay.py` | Include `msg.data` in pipe status log message | | `nvflare/fuel/utils/pipe/pipe_handler.py` | Enhanced logging (send failures, peer-gone elapsed time); heartbeat send error handling | | `nvflare/client/config.py` | Add `LAUNCH_ONCE` config key + `get_launch_once()` | | `nvflare/client/flare_agent.py` | `launch_once`-aware `_do_submit_result()`: direct `os._exit` vs `atexit` | | `nvflare/client/ex_process/api.py` | Pass `launch_once` to `FlareAgentWithFLModel` | | `tests/unit_test/client/test_download_complete_gating.py` | **New** — tests for DOWNLOAD_COMPLETE_CB registration, download-done wait, timeout, status logging, and cleanup | | `tests/unit_test/client/test_download_initiated_gating.py` | **New** — tests for thread-local download-initiation detection (validate-path fast return, no spurious 1800 s wait) | | `tests/unit_test/app_common/executors/client_api_launcher_executor_test.py` | New tests for deferred stop, `_deferred_stop_event`, `_stop_task_wait_timeout` |  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

YuanTingHsieh added 2 commits March 6, 2026 13:24

Fix external process killed before server downloads

d36513c

enhance deferred stop event handling

5e1aafe

Copilot AI review requested due to automatic review settings March 6, 2026 21:49

Copilot started reviewing on behalf of YuanTingHsieh March 6, 2026 21:50 View session

greptile-apps Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread nvflare/app_common/executors/launcher_executor.py

Comment thread nvflare/app_common/executors/launcher_executor.py

Comment thread nvflare/app_common/executors/launcher_executor.py

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Comment thread nvflare/app_common/executors/launcher_executor.py

Comment thread nvflare/app_common/executors/launcher_executor.py Outdated

Comment thread nvflare/app_common/executors/launcher_executor.py Outdated

Comment thread nvflare/app_common/executors/launcher_executor.py Outdated

Address comments

da1c2fe

greptile-apps Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread nvflare/app_common/widgets/tensorboard.py Outdated

Comment thread nvflare/app_common/widgets/tensorboard.py Outdated

Comment thread nvflare/app_common/widgets/tensorboard.py Outdated

remove extra files

672d6a0

greptile-apps Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread nvflare/app_common/executors/launcher_executor.py

address comments

da600a3

greptile-apps Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread nvflare/app_common/executors/launcher_executor.py Outdated

Address comments

cf79de9

YuanTingHsieh changed the title ~~Fix external kill before server download~~ [2.7] Fix external kill before server download Mar 7, 2026

Address comments

986ba2e

Address comments

88053d5

YuanTingHsieh added 8 commits March 6, 2026 18:11

Fix issue

daf2d50

fix issue

ce44808

remove recreate pipe handler

8d0550d

add logs

cf2ce73

add debug

5549f14

add atexit

0f9545c

fix issue

96ff8bc

remove debug logging

d0f11d2

chesterxgchen reviewed Mar 8, 2026

View reviewed changes

Comment thread nvflare/app_common/executors/launcher_executor.py

chesterxgchen reviewed Mar 8, 2026

View reviewed changes

Comment thread nvflare/app_common/executors/task_exchanger.py

chesterxgchen reviewed Mar 8, 2026

View reviewed changes

Comment thread nvflare/app_common/launchers/subprocess_launcher.py

chesterxgchen reviewed Mar 8, 2026

View reviewed changes

YuanTingHsieh added 8 commits March 8, 2026 20:54

Fix issue

7425b1d

fix test

d852837

fix atexit bug

e31a024

fix issue

a2b52d5

fix issue

31d216b

fix unit test

0676616

edit test configs

5a9fba3

Fix review and test

0bec755

YuanTingHsieh added 4 commits March 9, 2026 13:53

fix review comments

b8d8c70

fix issue

1d49f81

fix issue

e83f708

Merge branch '2.7' into fix_external_kill_before_server_download

073e71d

YuanTingHsieh requested review from chesterxgchen and pcnudde March 9, 2026 23:03

pcnudde approved these changes Mar 9, 2026

View reviewed changes

chesterxgchen merged commit e1a1d38 into NVIDIA:2.7 Mar 9, 2026
19 checks passed

YuanTingHsieh deleted the fix_external_kill_before_server_download branch March 9, 2026 23:33

YuanTingHsieh mentioned this pull request Mar 17, 2026

Cherry pick changes from 2.7 #4327

Merged

6 tasks

chesterxgchen mentioned this pull request Mar 29, 2026

Port remaining 2.7.2 changes to main #4376

Closed

Conversation

YuanTingHsieh commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: subprocess cell torn down before server tensor download (large-model deadlock)

Problem

Fix

Files changed

Types of changes

Uh oh!

greptile-apps Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuanTingHsieh commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuanTingHsieh commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

YuanTingHsieh commented Mar 7, 2026

Uh oh!

YuanTingHsieh commented Mar 7, 2026

Uh oh!

YuanTingHsieh commented Mar 7, 2026

Uh oh!

YuanTingHsieh commented Mar 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chesterxgchen left a comment

Choose a reason for hiding this comment

Uh oh!

pcnudde commented Mar 9, 2026

Uh oh!

YuanTingHsieh commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

YuanTingHsieh commented Mar 6, 2026 •

edited

Loading

greptile-apps Bot commented Mar 6, 2026 •

edited

Loading