Skip to content
Merged
1 change: 1 addition & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,7 @@
"models/sweeps/parallelize-agents",
"models/sweeps/visualize-sweep-results",
"models/sweeps/pause-resume-and-cancel-sweeps",
"models/sweeps/signal-handling-sweep-runs",
"models/sweeps/useful-resources",
"models/sweeps/local-controller",
"models/sweeps/troubleshoot-sweeps",
Expand Down
28 changes: 16 additions & 12 deletions models/runs/resuming.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -106,18 +106,22 @@ If you can not share a filesystem, specify the `WANDB_RUN_ID` environment variab


## Resume preemptible Sweeps runs
Automatically requeue interrupted [sweep](/models/sweeps/) runs. This is particularly useful if you run a sweep agent in a compute environment that is subject to preemption such as a SLURM job in a preemptible queue, an EC2 spot instance, or a Google Cloud preemptible VM.

Use the [`mark_preempting`](/models/ref/python/experiments/run#mark_preempting) function to automatically requeue interrupted sweep runs. For example:
When you handle preemption correctly, interrupted [sweep](/models/sweeps/) runs can be requeued automatically so another agent can pick them up. That pattern is especially helpful when the sweep agent runs in a preemptible environment, such as a SLURM job in a preemptible queue, an EC2 Spot instance, or a Google Cloud preemptible VM.

```python
with wandb.init() as run:
run.mark_preempting()
```
The following table outlines how W&B handles runs based on the exit status of a sweep run.
The behavior below applies when you run sweep agents with the [`wandb agent`](/models/ref/cli/wandb-agent) CLI, which starts your training program as a **subprocess**. It does not fully apply when you use only the Python API [`wandb.agent()`](/models/ref/python/functions/agent), because that path runs your training function in a thread rather than a separate process, so OS signal delivery and forwarding do not match the CLI agent model.

**Recommended pattern:** Register a signal handler for the preemption signal your scheduler or platform uses (for example `SIGUSR1` or `SIGTERM`). In the handler, call [`mark_preempting()`](/models/ref/python/experiments/run#mark_preempting) when a run is active, perform any cleanup (such as saving a checkpoint), then exit with a non-zero code (a common convention is `128 + signum` for signal termination). Do **not** call `mark_preempting()` unconditionally immediately after `wandb.init()`. Doing so can mark every failure, including code bugs, as preemption and requeue the run repeatedly.

For runnable examples, `--forward-signals` on the CLI agent, and a full reference table for different uses of `mark_preempting()`, see [Signal handling and sweep runs](/models/sweeps/signal-handling-sweep-runs).

When you follow that pattern, W&B records run state roughly as follows:

| Scenario | Run state |
| --- | --- |
| Run completes normally with exit code 0 | FINISHED |
| Run fails with a non-zero exit code | FAILED |
| Run receives an unhandled signal (for example `SIGKILL`) | CRASHED after about five minutes |
| Run receives a handled preemption signal (for example `SIGTERM` or `SIGUSR1`), the handler calls `mark_preempting()`, and the process exits non-zero | PREEMPTED; the run is queued for the next agent request |

|Status| Behavior |
|------| ---------|
|Status code 0| Run is considered to have terminated successfully and it will not be requeued. |
|Nonzero status| W&B automatically appends the run to a run queue associated with the sweep.|
|No status| Run is added to the sweep run queue. Sweep agents consume runs off the run queue until the queue is empty. Once the queue is empty, the sweep queue resumes generating new runs based on the sweep search algorithm.|
Sweep agents drain the run queue before the sweep generates new hyperparameter combinations from the search algorithm. Once the queue is empty, the sweep resumes normal scheduling.
2 changes: 1 addition & 1 deletion models/sweeps/pause-resume-and-cancel-sweeps.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ W&B does not terminate active [sweeps](/models/sweeps) or agents when you delete

### Cancel a sweep

Cancel a sweep to immediately kill all running runs and stop creating new runs. This is the only sweep command that forcibly terminates existing runs. Use the [`wandb sweep --cancel`](/models/ref/cli/wandb-sweep) command to cancel a sweep. Provide the sweep ID that you want to cancel.
Cancel a sweep to immediately kill all running runs and stop creating new runs. This is the only sweep command that forcibly terminates existing runs. Runs are terminated abruptly; the running processes have no chance to run user-defined signal handlers. Use the [`wandb sweep --cancel`](/models/ref/cli/wandb-sweep) command to cancel a sweep. Provide the sweep ID that you want to cancel. For more on signals and sweep runs, see [Signal handling and sweep runs](/models/sweeps/signal-handling-sweep-runs).

```bash
wandb sweep --cancel entity/project/sweep_ID
Expand Down
132 changes: 132 additions & 0 deletions models/sweeps/signal-handling-sweep-runs.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
description: Learn how W&B Sweeps handle UNIX signals, exit codes, and preemption in sweep runs.
title: Signal handling and sweep runs
---

This page provides details about how W&B Sweeps handle system signals and process exit codes, to help you run sweeps reliably in preemptible environments such as SLURM, EC2 Spot, or Google Cloud preemptible VMs. These sections explain how to interrupt runs cleanly from the keyboard and give details to help you understand and predict run requeue behavior. For details about how runs are requeued when preempted, see [Resume preemptible Sweeps runs](/models/runs/resuming#resume-preemptible-sweeps-runs).

## Exit status and signals

W&B uses the training process exit status to decide whether a run is requeued and how run state is recorded.

**Exit code contract:**

- **Exit code 0**: The run is considered to have completed successfully and is not requeued.
- **Non-zero exit code**: The run is treated as failed or preempted. When you use [`mark_preempting()`](/models/ref/python/experiments/run#mark_preempting), W&B requeues the run so another agent (or the same agent after restart) can resume it.

This applies whether the process exits from a signal handler, from an exception, or from an explicit `sys.exit()` call. Understanding and relying on this contract is vitally important in preemptible or cluster environments.

When the process exits due to a [**catchable** signal](#catchable-signals-and-preemption), your handler can run, call [`wandb.run.mark_preempting()`](/models/ref/python/experiments/run#mark_preempting) if you want the run requeued, perform cleanup (for example, save a checkpoint), then exit with a non-zero code. A common convention is `sys.exit(128 + signum)` for termination by signal. W&B records that exit code and the same [requeue rules](/models/runs/resuming#resume-preemptible-sweeps-runs) apply. When the process is killed by the operating system kernel with [**`SIGKILL`**](#sigkill-uncatchable), the process cannot run exit hooks, so no final summary is written and the run may appear as crashed or killed; the agent still starts the next run.

Comment thread
mdlinville marked this conversation as resolved.
## Stale runs and server-side timeouts

If a run neither finishes nor posts new metrics for a long time (on the order of about five minutes), the W&B server marks the run as **crashed**. That often happens when the training process hangs, stops logging, or is terminated without a clean exit (for example after `SIGKILL`). Logging metrics on a steady cadence or exiting with a defined code helps keep run state aligned with what actually happened.

## Catchable signals and preemption

You can register custom signal handlers in your training script. When a catchable signal is delivered, your handler runs; metrics already sent to W&B are preserved, and the agent detects the process exit and starts the next run.

**Best practices:**

- Register handlers early (for example, before entering the main training loop).
- In the handler, call [`wandb.run.mark_preempting()`](/models/ref/python/experiments/run#mark_preempting) when you intend the run to be requeued after preemption, perform cleanup (for example, save a checkpoint), then exit with a non-zero code.

The following example registers handlers for `SIGUSR1` (a typical cluster preemption signal) and `SIGTERM`. It leaves `SIGINT` free for interactive use (for example, manual cancellation from the terminal). The handler calls `wandb.run.mark_preempting()` and exits using `128 + signum`:

```python
import signal
import sys
import wandb


def signal_handler(signum, frame):
if wandb.run is not None:
# Optional: save a model checkpoint, flush buffers, and so on.
print(f"Preempted with signal: {signal.Signals(signum).name}.")
wandb.run.mark_preempting()
sys.exit(128 + signum)


def train():
signal.signal(signal.SIGUSR1, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)

with wandb.init() as run:
config = wandb.config
for epoch in range(100):
# Training step; wandb.log(...) as needed
pass


if __name__ == "__main__":
train()
```

## `SIGKILL` (uncatchable)

`SIGKILL` cannot be caught or ignored. The process terminates immediately with no chance to run handlers or atexit callbacks. W&B cannot write a final summary for the run. The agent still recovers and continues the sweep, but run data for that run is incomplete. Use `SIGKILL` only as a last resort; prefer `SIGTERM` or `SIGINT` when you need graceful shutdown.

## Forwarding signals from agent to child

When you use the [`wandb agent`](/models/ref/cli/wandb-agent) CLI, the agent runs your training script as a **child process**. When you interrupt the **agent** (for example, with Ctrl+C or when a scheduler sends `SIGTERM` to the job), the **child** (training process) does not receive the signal by default; the training script cannot run its handler or call `mark_preempting()`. This is described in [GitHub #3667](https://github.com/wandb/wandb/issues/3667).

To let the child shut down gracefully and call `wandb.run.mark_preempting()` in a handler, run the CLI agent with `--forward-signals`:

```bash
wandb agent --forward-signals entity/project/sweep_ID
```

Signal forwarding is **not** supported for [`wandb.agent()`](/models/ref/python/functions/agent) in the Python API. That path runs your training function in a thread, not as a separate child process, so the same forwarding behavior does not apply.

When the CLI agent receives `SIGINT` or `SIGTERM` with forwarding enabled, it relays the signal to the child so your training script's handler can run, call `wandb.run.mark_preempting()` and [`wandb.finish()`](/models/ref/python/experiments/run#finish) with a non-zero exit code if needed, and exit with a non-zero code. If you press Ctrl+C twice on the agent process, the agent receives `SIGTERM` by default. With `--forward-signals`, `SIGINT` can be forwarded to the child so your handler runs.

See the [wandb agent](/models/ref/cli/wandb-agent) CLI reference for details.

## Preemptible clusters like `SLURM`

On preemption, the **training process** must receive the signal, mark the run as preempting, and exit with a non-zero code so the run is requeued. A new agent (or the same agent after the job is requeued) can then resume the run.

**Ensure the training process receives the signal:**

1. **When the scheduler signals the agent**: Run the agent with `wandb agent --forward-signals` so that when the scheduler (or user) sends a signal to the agent, the agent forwards it to the child. The child's handler can then call `wandb.run.mark_preempting()`, [`wandb.finish(exit_code=...)`](/models/ref/python/experiments/run#finish) with a non-zero code, and `sys.exit(128 + signum)` (or another non-zero exit code).
2. **When the scheduler signals the launch script (not the agent directly)**: Have the launch script send the preemption signal directly to the training process. For example, the training script writes its process ID to a file; the launch script traps the cluster signal (for example `SIGUSR1`) and runs `kill -SIGUSR1 $(cat $PID_FILE)` so the training process's handler runs.

**In the training script:** Register a handler for the signal your cluster uses (for example `SIGTERM` or `SIGUSR1`). In the handler, call `wandb.run.mark_preempting()` if a run is active, then finish the run with a non-zero exit code and `sys.exit(128 + signum)` (or another non-zero code) so the run is requeued. See [Resume preemptible Sweeps runs](/models/runs/resuming#resume-preemptible-sweeps-runs) for when runs are requeued and how that interacts with `mark_preempting()`.

**Sweep state:** Run `wandb sweep entity/project/sweep_ID --resume` before starting the agent so the sweep is in resume mode and will hand out requeued runs.

**Multi-agent coordination:** When many agents run at once (such as SLURM array jobs), they can race to claim the same preempted run. This is a known limitation. Stagger agent startup or use external coordination mechanisms like locks to help work around this potential issue.

## `wandb sweep --cancel`

You cancel a sweep using the W&B API, not an OS signal. Run a command like `wandb sweep --cancel entity/project/sweep_ID`. The server tells the agent to exit, and the agent then terminates running child processes and stops. There can be a short delay (on the order of the agent's API polling interval) before cancellation takes effect.

Cancellation delivers **`SIGKILL`** to runs. Child processes have no chance to run user-defined signal handlers. The same applies when you use the **Cancel** control on the Sweeps UI. Use `--cancel` when you want to stop the entire sweep and mark it cancelled. For graceful shutdown of the current run, send a catchable signal to the run (or use `--forward-signals` with the CLI agent and signal the agent). For graceful sweep completion, use [`wandb sweep --stop`](/models/sweeps/pause-resume-and-cancel-sweeps#stop-a-sweep) instead of `--cancel`.

See [Manage sweeps](/models/sweeps/pause-resume-and-cancel-sweeps) for pause, resume, stop, and cancel options.

## Killing the agent vs the run

If you send a signal to the **agent** process (not the child training process), the agent may exit while the child continues running as an orphan. The orphan may keep printing to your terminal, and the shell may not show a new prompt until you press Enter.

Unless you use `--forward-signals` with the CLI agent, stopping the agent does not guarantee the child training process stops.

To confirm the agent has exited, use an OS command like `ps -p <agent_pid>` or `pgrep -f "wandb agent"` instead of relying on prompt appearance.

## Reference: `mark_preempting()` and final run state

The table below summarizes how run state depends on **when** you call `mark_preempting()` and how the process exits. It assumes you use the [`wandb agent`](/models/ref/cli/wandb-agent) CLI with your training program as a subprocess.

| Scenario | No `mark_preempting()` | Signal handler calls `mark_preempting()` and exits non-zero | `mark_preempting()` always called right after `init()` |
| --- | --- | --- | --- |
| Run completes normally with exit code 0 | FINISHED | FINISHED | FINISHED |
| Run fails with non-zero exit code | FAILED | FAILED | PREEMPTED |
| Run receives `SIGKILL` | CRASHED after about five minutes | CRASHED after about five minutes (uncatchable) | PREEMPTED after about five minutes |
| Run receives `SIGINT` | KILLED | PREEMPTED (with a `SIGINT` handler) | PREEMPTED |
| Run receives another signal (for example `SIGTERM` or `SIGUSR1`) | CRASHED after about five minutes | PREEMPTED (with a matching handler) | PREEMPTED after about five minutes |

If you only call `mark_preempting()` inside a signal handler, you do not cover cases where the handler never runs, such as `SIGKILL`.

If you always call `mark_preempting()` immediately after `wandb.init()`, any failure can be treated as preemption and the run may be requeued repeatedly, including for bugs or bad configuration.

For environments with a well-defined preemption signal, the usual approach is a **signal handler** that calls `mark_preempting()` and exits non-zero, not an unconditional call after `init()`.
4 changes: 4 additions & 0 deletions models/sweeps/start-sweep-agents.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ Copy and paste the code snippet below and replace `sweep_id` with your sweep ID:
```bash
wandb agent sweep_id
```

For graceful shutdown when you interrupt the agent (for example, with Ctrl+C), use `wandb agent --forward-signals sweep_id` so the current run receives the signal and can shut down cleanly. See [Signal handling and sweep runs](/models/sweeps/signal-handling-sweep-runs) for details.
</Tab>
<Tab title="Python script or notebook">
Use [`wandb.agent()`](/models/ref/python/functions/agent) to start a sweep. Provide the sweep ID that W&B returns when you initialized the sweep along with the name of the function that is the entrypoint to your training script.
Expand All @@ -45,6 +47,8 @@ Copy and paste the code snippet below and replace `<sweep_id>` with your sweep I
wandb.agent(sweep_id="<sweep_id>", function="<function_name>")
```

Signal forwarding from the agent to the training run is only supported when you use the CLI (`wandb agent --forward-signals`). It is not supported for `wandb.agent()` in Python because the training function runs in a thread, not as a child process. See [Signal handling and sweep runs](/models/sweeps/signal-handling-sweep-runs) for details.

See [Python script or notebook tab](/models/sweeps/add-w-and-b-to-your-code#python-script-or-notebook) in Add W&B to your code for an example of how to set up your training script if you use this method.

<Warning>
Expand Down
Loading