Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ docs/_build/
env/
logs/*
.venv/
lightning_logs/*
lightning_logs*
public/
tests/__pycache__
tests/data/*
Expand Down
170 changes: 170 additions & 0 deletions docs/development/cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Cluster Distributed Runs

This page shows supported patterns for running DeepForest across multiple GPUs and multiple nodes on a Slurm-managed cluster (for example HiPerGator).

## Slurm: `sbatch` and `srun`

`sbatch` requests the allocation (nodes, GPUs, tasks, memory, time). `srun` inside that batch script starts a **job step** within the same allocation. It does **not** submit a second job or double-charge the scheduler.

Match `#SBATCH --ntasks-per-node` to `devices` (one Slurm task per GPU) and `#SBATCH --nodes` to `num_nodes`. For multi-GPU DDP, launch with `srun`. For a single GPU, the cluster train script runs the command directly in the batch step.

Example launchers live under `src/deepforest/scripts/HPC/`.

## Shared Settings

Use the same launch pattern for `train`, `evaluate`, and `predict`:

- `devices=<gpus_per_node>` is the number of GPUs on each node
- `num_nodes=<nnodes>` is the total number of nodes
- `strategy=ddp` enables distributed data parallel execution (use `auto` for single-GPU jobs)
- `workers=0` is required for large-tile prediction with `dataloader_strategy="window"`

## Environment

```bash
ml conda
eval "$(conda shell.bash hook)"
conda activate predict
cd /path/to/DeepForest
mkdir -p slurm_logs
```

## Train

Use `src/deepforest/scripts/HPC/run_cluster_train.sbatch` for production training and smoke tests. The launcher script is `run_cluster_train.sh`.

### Production training (single GPU)

Defaults use `TRAIN_MODE=train` and `CONFIG_NAME=bird`. Submit from the repo root:

```bash
sbatch src/deepforest/scripts/HPC/run_cluster_train.sbatch
```

Hydra overrides and resume:

```bash
export COMET_EXPERIMENT_NAME="exp_lr_0.0005"
sbatch src/deepforest/scripts/HPC/run_cluster_train.sbatch train.lr=0.0005 train.epochs=80

RESUME_CKPT=/path/to/last.ckpt sbatch src/deepforest/scripts/HPC/run_cluster_train.sbatch
```

Multi-GPU or multi-node training: set Slurm resources at submit time and pass matching Hydra settings if needed. The script infers `SCENARIO` from the allocation.

```bash
sbatch --nodes=2 --ntasks-per-node=2 --gpus-per-node=2 --cpus-per-task=8 --mem=128G --time=15:00:00 \
src/deepforest/scripts/HPC/run_cluster_train.sbatch \
--strategy ddp devices=2 num_nodes=2
```

### Smoke tests

Smoke tests use bundled OSBS sample data (`TRAIN_MODE=smoke`, `CONFIG_NAME=smoke`, 1 epoch). Set `SCENARIO` and match `#SBATCH` resources:

```bash
# 1 GPU
TRAIN_MODE=smoke SCENARIO=1gpu sbatch --nodes=1 --ntasks-per-node=1 --gpus-per-node=1 \
--cpus-per-task=8 --mem=32G --time=00:30:00 \
src/deepforest/scripts/HPC/run_cluster_train.sbatch

# Multi-GPU (one node)
TRAIN_MODE=smoke SCENARIO=multigpu GPUS_PER_NODE=2 sbatch --nodes=1 --ntasks-per-node=2 --gpus-per-node=2 \
--cpus-per-task=8 --mem=64G --time=00:45:00 \
src/deepforest/scripts/HPC/run_cluster_train.sbatch

# Multi-node
TRAIN_MODE=smoke SCENARIO=multinode GPUS_PER_NODE=2 NNODES=2 sbatch --nodes=2 --ntasks-per-node=2 --gpus-per-node=2 \
--cpus-per-task=8 --mem=64G --time=01:00:00 \
src/deepforest/scripts/HPC/run_cluster_train.sbatch
```

Optional: `export COMET_EXPERIMENT_NAME="my-smoke-run"` before `sbatch`. Disable Comet with `USE_COMET=0`.

### Train directly in a batch script

```bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-node=2

srun uv run deepforest train \
--strategy ddp \
accelerator=gpu \
devices=2 \
num_nodes=2 \
train.csv_file=/path/to/train.csv \
train.root_dir=/path/to/train_images \
validation.csv_file=/path/to/val.csv \
validation.root_dir=/path/to/val_images
```

## Evaluate

```bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-node=2

srun uv run deepforest evaluate \
/path/to/ground_truth.csv \
--root-dir /path/to/images \
--save-predictions eval_preds.csv \
-o eval_metrics.csv \
--strategy ddp \
accelerator=gpu \
devices=2 \
num_nodes=2
```

## Predict From CSV

For the cluster regression test and example launcher (submit from the repo root):

```bash
sbatch src/deepforest/scripts/HPC/run_cluster_predict_test.sbatch
```

To run your own CSV prediction job directly:

```bash
srun uv run deepforest predict \
/path/to/images.csv \
--mode csv \
--root-dir /path/to/images \
-o predictions.csv \
--strategy ddp \
accelerator=gpu \
devices=2 \
num_nodes=2
```

## Predict A Large Tile

For large rasters on a cluster, prefer `predict_tile(..., dataloader_strategy="window")`.

The ready-to-run test launcher is:

```bash
sbatch src/deepforest/scripts/HPC/run_cluster_predict_tile_test.sbatch
```

To run a tiled prediction job directly:

```bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-node=2

srun uv run python tests/cluster_predict_tile_driver.py \
--input-path /path/to/tile.tif \
--output-path tile_predictions.csv \
--model-name weecology/everglades-bird-species-detector \
--patch-size 1500 \
--patch-overlap 0 \
--dataloader-strategy window \
--devices 2 \
--num-nodes 2
```

See also the [multi-GPU and multi-node guide](../user_guide/distributed.md).
1 change: 1 addition & 0 deletions docs/development/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
```{toctree}
:maxdepth: 1

cluster
authors
contributing
code_of_conduct
Expand Down
6 changes: 3 additions & 3 deletions docs/user_guide/07_scaling.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Scaling DeepForest using PyTorch Lightning

For concise launch recipes, see the [multi-GPU and multi-node guide](distributed.md). If you are using a Slurm-managed cluster, see the [cluster developer guide](../development/cluster.md).

## Increase batch size

It is more efficient to run a larger batch size on a single GPU. This is because the overhead of loading data and moving data between the CPU and GPU is relatively large. By running a larger batch size, we can reduce the overhead of these operations.
Expand Down Expand Up @@ -27,9 +29,7 @@ A few notes that can trip up those less used to multi-gpu training. These are fo

2. Each device gets its own portion of the dataset. This means that they do not interact during forward passes.

3. Make sure to use srun when combining with SLURM! This is an easy one to miss and will cause training to hang without error. Documented here

https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#troubleshooting.
3. On SLURM, launch with **`srun`**. Match `#SBATCH --ntasks-per-node` to `devices` and `#SBATCH --nodes` to `num_nodes`. See the [multi-GPU and multi-node guide](distributed.md) and [Lightning SLURM troubleshooting](https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#troubleshooting).


## Prediction
Expand Down
23 changes: 18 additions & 5 deletions docs/user_guide/09_configuration_file.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,17 +151,30 @@ The number of cpus/gpus to use during model training. Deepforest has been tested

### accelerator

Most commonly, `cpu`, `gpu` or `tpu` as well as other [options](https://pytorch-lightning.readthedocs.io/en/1.4.0/advanced/multi_gpu.html) listed:
Most commonly, `cpu`, `gpu` or `tpu` as well as other [options](https://lightning.ai/docs/pytorch/stable/accelerators/gpu.html).

If `gpu`, it can be helpful to specify the data parallelization strategy. This can be done using the `strategy` arg in `main.create_trainer()`
### num_nodes

Number of machines for distributed training. Default is `1`. Set this to your Slurm node count for multi-node jobs. See [Scaling](07_scaling.md) and [distributed runs](distributed.md).

### strategy

Distributed training strategy passed to the Lightning `Trainer`. Default is `auto` (appropriate for single-GPU runs). Use `ddp` for multi-GPU or multi-node training.

Set in the config file, as Hydra overrides (`strategy=ddp`), or via `create_trainer(strategy="ddp")`. CLI and `create_trainer` kwargs override the config file.

```python
from deepforest import model as m
from deepforest import main

m.create_trainer(logger=comet_logger, strategy="ddp")
m = main.deepforest()
m.config.accelerator = "gpu"
m.config.devices = 4
m.config.num_nodes = 2
m.config.strategy = "ddp"
m.create_trainer(logger=comet_logger)
```

This is passed to the pytorch-lightning trainer, documented in the link above for multi-gpu training.
On Slurm clusters, launch with `srun` so Lightning can read the job environment. Details are in [distributed runs](distributed.md).

### batch_size

Expand Down
36 changes: 14 additions & 22 deletions docs/user_guide/11_training.md
Original file line number Diff line number Diff line change
Expand Up @@ -526,38 +526,28 @@ Usually creating this object does not cost too much computational time.

#### Training across multiple nodes on a HPC system

We have heard that this error can appear when trying to deep copy the pytorch lightning module. The trainer object is not pickleable.
For example, on multi-gpu environments when trying to scale the deepforest model the entire module is copied leading to this error.
Setting the trainer object to None and directly using the pytorch object is a reasonable workaround.
On Slurm clusters, submit jobs with `srun` and set `devices`, `num_nodes`, and `strategy=ddp` to match your `#SBATCH` allocation. See [Scaling](07_scaling.md) and [distributed runs](distributed.md).

Replace
If you see **Weakly referenced objects** when scaling across GPUs, the trainer object may not pickle cleanly when the module is copied. A workaround is to construct a `Trainer` directly:

```python
m = main.deepforest()
m.create_trainer()
m.trainer.fit(m)
```

with

```python
m.trainer = None
from pytorch_lightning import Trainer

trainer = Trainer(
accelerator="gpu",
strategy="ddp",
devices=model.config.devices,
enable_checkpointing=False,
max_epochs=model.config.train.epochs,
logger=comet_logger
)
trainer = Trainer(
accelerator="gpu",
strategy="ddp",
devices=m.config.devices,
num_nodes=m.config.num_nodes,
enable_checkpointing=False,
max_epochs=m.config.train.epochs,
logger=comet_logger,
)
trainer.fit(m)
```

The added benefits of this is more control over the trainer object.
The downside is that it doesn't align with the .config pattern where a user now has to look into the config to create the trainer.
We are open to changing this to be the default pattern in the future and welcome input from users.
We are open to making this the default pattern and welcome input from users.

#### Visualization during training

Expand Down Expand Up @@ -598,6 +588,8 @@ We provide a basic script to trigger a training run via CLI. This script is inst
If you are using `uv` to manage your Python environment, remember to prefix these commands with `uv run`, for example: `uv run deepforest predict`.
```

On a Slurm cluster, wrap the command in `srun` inside your batch script (see [Scaling](07_scaling.md) and [distributed runs](distributed.md)).

```bash
deepforest train batch_size=8 train.csv_file=your_labels.csv train.root_dir=some/path
```
Expand Down
84 changes: 84 additions & 0 deletions docs/user_guide/distributed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Multi-GPU and Multi-Node Runs

DeepForest uses PyTorch Lightning distributed execution. For most multi-GPU and multi-node runs, these settings matter:

- `accelerator=gpu`
- `devices=<gpus_per_node>`
- `num_nodes=<nnodes>`
- `strategy=ddp`

On Slurm clusters, launch with **`srun`** inside your job allocation. Match `#SBATCH --ntasks-per-node` to `devices` and `#SBATCH --nodes` to `num_nodes`. See the [cluster developer guide](../development/cluster.md).

Single-GPU jobs can keep the default `strategy=auto`.

## Train

```bash
#SBATCH --nodes=<nnodes>
#SBATCH --ntasks-per-node=<gpus_per_node>
#SBATCH --gres=gpu:<gpus_per_node>

srun uv run deepforest train \
--strategy ddp \
accelerator=gpu \
devices=<gpus_per_node> \
num_nodes=<nnodes> \
train.csv_file=/path/to/train.csv \
train.root_dir=/path/to/train_images \
validation.csv_file=/path/to/val.csv \
validation.root_dir=/path/to/val_images
```

## Evaluate

```bash
srun uv run deepforest evaluate \
/path/to/ground_truth.csv \
--root-dir /path/to/images \
--save-predictions eval_preds.csv \
-o eval_metrics.csv \
--strategy ddp \
accelerator=gpu \
devices=<gpus_per_node> \
num_nodes=<nnodes>
```

## Predict From CSV

```bash
srun uv run deepforest predict \
/path/to/images.csv \
--mode csv \
--root-dir /path/to/images \
-o predictions.csv \
--strategy ddp \
accelerator=gpu \
devices=<gpus_per_node> \
num_nodes=<nnodes>
```

## Predict A Large Tile

For large geospatial rasters, use `predict_tile(..., dataloader_strategy="window")` instead of the simple CLI tile mode.

```python
from deepforest.main import deepforest

m = deepforest()
m.load_model("weecology/everglades-bird-species-detector")
m.config.accelerator = "gpu"
m.config.devices = 2
m.config.num_nodes = 2
m.config.strategy = "ddp"
m.config.workers = 0
m.create_trainer()

results = m.predict_tile(
path="/path/to/tile.tif",
patch_size=1500,
patch_overlap=0,
dataloader_strategy="window",
)
```

Launch that script with the same `srun` Slurm pattern and trainer settings. For a complete cluster example, see the [cluster developer guide](../development/cluster.md).
1 change: 1 addition & 0 deletions docs/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ The User Guide covers the core DeepForest package usage and functionalities.
05_model_architecture
06_multi_species
07_scaling
distributed
08_visualizations
09_configuration_file
10_better
Expand Down
Loading
Loading