weecology · henrykironde · Mar 17, 2026 · Apr 3, 2026 · Apr 3, 2026 · Apr 4, 2026
diff --git a/.gitignore b/.gitignore
@@ -21,7 +21,7 @@ docs/_build/
 env/
 logs/*
 .venv/
-lightning_logs/*
+lightning_logs*
 public/
 tests/__pycache__
 tests/data/*

diff --git a/docs/development/cluster.md b/docs/development/cluster.md
@@ -0,0 +1,170 @@
+# Cluster Distributed Runs
+
+This page shows supported patterns for running DeepForest across multiple GPUs and multiple nodes on a Slurm-managed cluster (for example HiPerGator).
+
+## Slurm: `sbatch` and `srun`
+
+`sbatch` requests the allocation (nodes, GPUs, tasks, memory, time). `srun` inside that batch script starts a **job step** within the same allocation. It does **not** submit a second job or double-charge the scheduler.
+
+Match `#SBATCH --ntasks-per-node` to `devices` (one Slurm task per GPU) and `#SBATCH --nodes` to `num_nodes`. For multi-GPU DDP, launch with `srun`. For a single GPU, the cluster train script runs the command directly in the batch step.
+
+Example launchers live under `src/deepforest/scripts/HPC/`.
+
+## Shared Settings
+
+Use the same launch pattern for `train`, `evaluate`, and `predict`:
+
+- `devices=<gpus_per_node>` is the number of GPUs on each node
+- `num_nodes=<nnodes>` is the total number of nodes
+- `strategy=ddp` enables distributed data parallel execution (use `auto` for single-GPU jobs)
+- `workers=0` is required for large-tile prediction with `dataloader_strategy="window"`
+
+## Environment
+
+```bash
+ml conda
+eval "$(conda shell.bash hook)"
+conda activate predict
+cd /path/to/DeepForest
+mkdir -p slurm_logs
+```
+
+## Train
+
+Use `src/deepforest/scripts/HPC/run_cluster_train.sbatch` for production training and smoke tests. The launcher script is `run_cluster_train.sh`.
+
+### Production training (single GPU)
+
+Defaults use `TRAIN_MODE=train` and `CONFIG_NAME=bird`. Submit from the repo root:
+
+```bash
+sbatch src/deepforest/scripts/HPC/run_cluster_train.sbatch
+```
+
+Hydra overrides and resume:
+
+```bash
+export COMET_EXPERIMENT_NAME="exp_lr_0.0005"
+sbatch src/deepforest/scripts/HPC/run_cluster_train.sbatch train.lr=0.0005 train.epochs=80
+
+RESUME_CKPT=/path/to/last.ckpt sbatch src/deepforest/scripts/HPC/run_cluster_train.sbatch
+```
+
+Multi-GPU or multi-node training: set Slurm resources at submit time and pass matching Hydra settings if needed. The script infers `SCENARIO` from the allocation.
+
+```bash
+sbatch --nodes=2 --ntasks-per-node=2 --gpus-per-node=2 --cpus-per-task=8 --mem=128G --time=15:00:00 \
+  src/deepforest/scripts/HPC/run_cluster_train.sbatch \
+  --strategy ddp devices=2 num_nodes=2
+```
+
+### Smoke tests
+
+Smoke tests use bundled OSBS sample data (`TRAIN_MODE=smoke`, `CONFIG_NAME=smoke`, 1 epoch). Set `SCENARIO` and match `#SBATCH` resources:
+
+```bash
+# 1 GPU
+TRAIN_MODE=smoke SCENARIO=1gpu sbatch --nodes=1 --ntasks-per-node=1 --gpus-per-node=1 \
+  --cpus-per-task=8 --mem=32G --time=00:30:00 \
+  src/deepforest/scripts/HPC/run_cluster_train.sbatch
+
+# Multi-GPU (one node)
+TRAIN_MODE=smoke SCENARIO=multigpu GPUS_PER_NODE=2 sbatch --nodes=1 --ntasks-per-node=2 --gpus-per-node=2 \
+  --cpus-per-task=8 --mem=64G --time=00:45:00 \
+  src/deepforest/scripts/HPC/run_cluster_train.sbatch
+
+# Multi-node
+TRAIN_MODE=smoke SCENARIO=multinode GPUS_PER_NODE=2 NNODES=2 sbatch --nodes=2 --ntasks-per-node=2 --gpus-per-node=2 \
+  --cpus-per-task=8 --mem=64G --time=01:00:00 \
+  src/deepforest/scripts/HPC/run_cluster_train.sbatch
+```
+
+Optional: `export COMET_EXPERIMENT_NAME="my-smoke-run"` before `sbatch`. Disable Comet with `USE_COMET=0`.
+
+### Train directly in a batch script
+
+```bash
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=2
+#SBATCH --gpus-per-node=2
+
+srun uv run deepforest train \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=2 \
+  num_nodes=2 \
+  train.csv_file=/path/to/train.csv \
+  train.root_dir=/path/to/train_images \
+  validation.csv_file=/path/to/val.csv \
+  validation.root_dir=/path/to/val_images
+```
+
+## Evaluate
+
+```bash
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=2
+#SBATCH --gpus-per-node=2
+
+srun uv run deepforest evaluate \
+  /path/to/ground_truth.csv \
+  --root-dir /path/to/images \
+  --save-predictions eval_preds.csv \
+  -o eval_metrics.csv \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=2 \
+  num_nodes=2
+```
+
+## Predict From CSV
+
+For the cluster regression test and example launcher (submit from the repo root):
+
+```bash
+sbatch src/deepforest/scripts/HPC/run_cluster_predict_test.sbatch
+```
+
+To run your own CSV prediction job directly:
+
+```bash
+srun uv run deepforest predict \
+  /path/to/images.csv \
+  --mode csv \
+  --root-dir /path/to/images \
+  -o predictions.csv \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=2 \
+  num_nodes=2
+```
+
+## Predict A Large Tile
+
+For large rasters on a cluster, prefer `predict_tile(..., dataloader_strategy="window")`.
+
+The ready-to-run test launcher is:
+
+```bash
+sbatch src/deepforest/scripts/HPC/run_cluster_predict_tile_test.sbatch
+```
+
+To run a tiled prediction job directly:
+
+```bash
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=2
+#SBATCH --gpus-per-node=2
+
+srun uv run python tests/cluster_predict_tile_driver.py \
+  --input-path /path/to/tile.tif \
+  --output-path tile_predictions.csv \
+  --model-name weecology/everglades-bird-species-detector \
+  --patch-size 1500 \
+  --patch-overlap 0 \
+  --dataloader-strategy window \
+  --devices 2 \
+  --num-nodes 2
+```
+
+See also the [multi-GPU and multi-node guide](../user_guide/distributed.md).
diff --git a/docs/development/index.md b/docs/development/index.md
@@ -5,6 +5,7 @@
 ```{toctree}
 :maxdepth: 1
 
+cluster
 authors
 contributing
 code_of_conduct

diff --git a/docs/user_guide/07_scaling.md b/docs/user_guide/07_scaling.md
@@ -1,5 +1,7 @@
 # Scaling DeepForest using PyTorch Lightning
 
+For concise launch recipes, see the [multi-GPU and multi-node guide](distributed.md). If you are using a Slurm-managed cluster, see the [cluster developer guide](../development/cluster.md).
+
 ## Increase batch size
 
 It is more efficient to run a larger batch size on a single GPU. This is because the overhead of loading data and moving data between the CPU and GPU is relatively large. By running a larger batch size, we can reduce the overhead of these operations.
@@ -27,9 +29,7 @@ A few notes that can trip up those less used to multi-gpu training. These are fo
 
 2. Each device gets its own portion of the dataset. This means that they do not interact during forward passes.
 
-3. Make sure to use srun when combining with SLURM! This is an easy one to miss and will cause training to hang without error. Documented here
-
-https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#troubleshooting.
+3. On SLURM, launch with **`srun`**. Match `#SBATCH --ntasks-per-node` to `devices` and `#SBATCH --nodes` to `num_nodes`. See the [multi-GPU and multi-node guide](distributed.md) and [Lightning SLURM troubleshooting](https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#troubleshooting).
 
 
 ## Prediction

diff --git a/docs/user_guide/09_configuration_file.md b/docs/user_guide/09_configuration_file.md
@@ -151,17 +151,30 @@ The number of cpus/gpus to use during model training. Deepforest has been tested
 
 ### accelerator
 
-Most commonly, `cpu`, `gpu` or `tpu` as well as other [options](https://pytorch-lightning.readthedocs.io/en/1.4.0/advanced/multi_gpu.html) listed:
+Most commonly, `cpu`, `gpu` or `tpu` as well as other [options](https://lightning.ai/docs/pytorch/stable/accelerators/gpu.html).
 
-If `gpu`, it can be helpful to specify the data parallelization strategy. This can be done using the `strategy` arg in `main.create_trainer()`
+### num_nodes
+
+Number of machines for distributed training. Default is `1`. Set this to your Slurm node count for multi-node jobs. See [Scaling](07_scaling.md) and [distributed runs](distributed.md).
+
+### strategy
+
+Distributed training strategy passed to the Lightning `Trainer`. Default is `auto` (appropriate for single-GPU runs). Use `ddp` for multi-GPU or multi-node training.
+
+Set in the config file, as Hydra overrides (`strategy=ddp`), or via `create_trainer(strategy="ddp")`. CLI and `create_trainer` kwargs override the config file.
 
 ```python
-from deepforest import model as m
+from deepforest import main
 
-m.create_trainer(logger=comet_logger, strategy="ddp")
+m = main.deepforest()
+m.config.accelerator = "gpu"
+m.config.devices = 4
+m.config.num_nodes = 2
+m.config.strategy = "ddp"
+m.create_trainer(logger=comet_logger)
 ```
 
-This is passed to the pytorch-lightning trainer, documented in the link above for multi-gpu training.
+On Slurm clusters, launch with `srun` so Lightning can read the job environment. Details are in [distributed runs](distributed.md).
 
 ### batch_size
 

diff --git a/docs/user_guide/11_training.md b/docs/user_guide/11_training.md
@@ -526,38 +526,28 @@ Usually creating this object does not cost too much computational time.
 
 #### Training across multiple nodes on a HPC system
 
-We have heard that this error can appear when trying to deep copy the pytorch lightning module. The trainer object is not pickleable.
-For example, on multi-gpu environments when trying to scale the deepforest model the entire module is copied leading to this error.
-Setting the trainer object to None and directly using the pytorch object is a reasonable workaround.
+On Slurm clusters, submit jobs with `srun` and set `devices`, `num_nodes`, and `strategy=ddp` to match your `#SBATCH` allocation. See [Scaling](07_scaling.md) and [distributed runs](distributed.md).
 
-Replace
+If you see **Weakly referenced objects** when scaling across GPUs, the trainer object may not pickle cleanly when the module is copied. A workaround is to construct a `Trainer` directly:
 
 ```python
 m = main.deepforest()
-m.create_trainer()
-m.trainer.fit(m)
-```
-
-with
-
-```python
 m.trainer = None
 from pytorch_lightning import Trainer
 
-    trainer = Trainer(
-        accelerator="gpu",
-        strategy="ddp",
-        devices=model.config.devices,
-        enable_checkpointing=False,
-        max_epochs=model.config.train.epochs,
-        logger=comet_logger
-    )
+trainer = Trainer(
+    accelerator="gpu",
+    strategy="ddp",
+    devices=m.config.devices,
+    num_nodes=m.config.num_nodes,
+    enable_checkpointing=False,
+    max_epochs=m.config.train.epochs,
+    logger=comet_logger,
+)
 trainer.fit(m)
 ```
 
-The added benefits of this is more control over the trainer object.
-The downside is that it doesn't align with the .config pattern where a user now has to look into the config to create the trainer.
-We are open to changing this to be the default pattern in the future and welcome input from users.
+We are open to making this the default pattern and welcome input from users.
 
 #### Visualization during training
 
@@ -598,6 +588,8 @@ We provide a basic script to trigger a training run via CLI. This script is inst
 If you are using `uv` to manage your Python environment, remember to prefix these commands with `uv run`, for example: `uv run deepforest predict`.
 ```
 
+On a Slurm cluster, wrap the command in `srun` inside your batch script (see [Scaling](07_scaling.md) and [distributed runs](distributed.md)).
+
 ```bash
 deepforest train batch_size=8 train.csv_file=your_labels.csv train.root_dir=some/path
 ```

diff --git a/docs/user_guide/distributed.md b/docs/user_guide/distributed.md
@@ -0,0 +1,84 @@
+# Multi-GPU and Multi-Node Runs
+
+DeepForest uses PyTorch Lightning distributed execution. For most multi-GPU and multi-node runs, these settings matter:
+
+- `accelerator=gpu`
+- `devices=<gpus_per_node>`
+- `num_nodes=<nnodes>`
+- `strategy=ddp`
+
+On Slurm clusters, launch with **`srun`** inside your job allocation. Match `#SBATCH --ntasks-per-node` to `devices` and `#SBATCH --nodes` to `num_nodes`. See the [cluster developer guide](../development/cluster.md).
+
+Single-GPU jobs can keep the default `strategy=auto`.
+
+## Train
+
+```bash
+#SBATCH --nodes=<nnodes>
+#SBATCH --ntasks-per-node=<gpus_per_node>
+#SBATCH --gres=gpu:<gpus_per_node>
+
+srun uv run deepforest train \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=<gpus_per_node> \
+  num_nodes=<nnodes> \
+  train.csv_file=/path/to/train.csv \
+  train.root_dir=/path/to/train_images \
+  validation.csv_file=/path/to/val.csv \
+  validation.root_dir=/path/to/val_images
+```
+
+## Evaluate
+
+```bash
+srun uv run deepforest evaluate \
+  /path/to/ground_truth.csv \
+  --root-dir /path/to/images \
+  --save-predictions eval_preds.csv \
+  -o eval_metrics.csv \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=<gpus_per_node> \
+  num_nodes=<nnodes>
+```
+
+## Predict From CSV
+
+```bash
+srun uv run deepforest predict \
+  /path/to/images.csv \
+  --mode csv \
+  --root-dir /path/to/images \
+  -o predictions.csv \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=<gpus_per_node> \
+  num_nodes=<nnodes>
+```
+
+## Predict A Large Tile
+
+For large geospatial rasters, use `predict_tile(..., dataloader_strategy="window")` instead of the simple CLI tile mode.
+
+```python
+from deepforest.main import deepforest
+
+m = deepforest()
+m.load_model("weecology/everglades-bird-species-detector")
+m.config.accelerator = "gpu"
+m.config.devices = 2
+m.config.num_nodes = 2
+m.config.strategy = "ddp"
+m.config.workers = 0
+m.create_trainer()
+
+results = m.predict_tile(
+    path="/path/to/tile.tif",
+    patch_size=1500,
+    patch_overlap=0,
+    dataloader_strategy="window",
+)
+```
+
+Launch that script with the same `srun` Slurm pattern and trainer settings. For a complete cluster example, see the [cluster developer guide](../development/cluster.md).
diff --git a/docs/user_guide/index.md b/docs/user_guide/index.md
@@ -18,6 +18,7 @@ The User Guide covers the core DeepForest package usage and functionalities.
 05_model_architecture
 06_multi_species
 07_scaling
+distributed
 08_visualizations
 09_configuration_file
 10_better