refactor recall_precision metric for better DDP support by jveitchmichaelis · Pull Request #1343 · weecology/DeepForest

jveitchmichaelis · 2026-03-07T01:26:02Z

Description

Refactors the RecallPrecision metric to use (mostly) the same update signature as other metrics.
Ground truth/pred matching is performed eagerly during update so that the final reduction stage doesn't time-out the sync between ranks during DDP joining.
This should be faster on multi node machines, because each rank will only perform 1/N of the work. There are maybe some tiny benefits from not having to search the dataframe for image_path <> preds/gt. The it/s during eval might be a bit slower, but there won't be a big chunk of compute at the end.
The metric has to keep track of the matching results so it can be accessed later, which is a little awkward, but I've tried to keep it clean (at least to users...)
I moved empty frame accuracy here too, since we have to handle those checks during evaluation anyway and it make main a bit cleaner.
The metrics are calculated against the dataset targets directly (instead of loading the CSV and passing that raw). This has the benefit that adding geometric augmentation to the validation dataset (if we did that for some reason) wouldn't break eval.

Tests:

Added a sanity check in test_main::test_evaluate to confirm that the output dataframe is what we expect + some comments on the assertions.
One fixup in tests for the new metric.update signature.

Related Issue(s)

AI-Assisted Development

Rewrote most of the metric and then got CC to check some of the logic and the test suite.

I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
I understand all the code I'm submitting
I have reviewed and validated all AI-generated code

AI tools used (if applicable):
Claude Code Opus/Sonnet 4.6

codecov · 2026-03-07T02:04:03Z

Codecov Report

❌ Patch coverage is 85.00000% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.78%. Comparing base (ecf3a1c) to head (ce8be7a).
⚠️ Report is 16 commits behind head on main.

Files with missing lines	Patch %	Lines
src/deepforest/metrics.py	84.61%	12 Missing ⚠️
src/deepforest/main.py	86.36%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1343      +/-   ##
==========================================
- Coverage   87.34%   86.78%   -0.56%     
==========================================
  Files          24       24              
  Lines        2978     3005      +27     
==========================================
+ Hits         2601     2608       +7     
- Misses        377      397      +20

Flag	Coverage Δ
unittests	`86.78% <85.00%> (-0.56%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jveitchmichaelis · 2026-03-07T06:39:57Z

Majority of the missing lines here are due to DDP coverage which won't run on Github Actions. This ties into #1273.

It seems we might also be lacking coverage for some multi-class cases, where we would expect class precision/recall to be returned.

jveitchmichaelis · 2026-03-07T06:46:28Z

This is basically good for review, but needs testing on a multi-GPU system to ensure that sync works properly in the metric, and when gathering predictions from evaluate. Results calling deepforest evaluate on NeonTreeEval are identical to main:

Metric	main	torchmetrics-refactor
box_precision	0.61659	0.61659
box_recall	0.76480	0.76480
iou	0.67113	0.67113
map	0.18450	0.18450
val_loss	0.64963	0.64963

@bw4sz would be helpful if you could have a look to see if this solves your issues, and act as an independent test that things are working. I'll also run the same benchmark on HiPerGator.

jveitchmichaelis · 2026-03-10T05:14:49Z

Check on HPG looks good. As usual, if reporting results for publication use a single worker to avoid any strangeness with batches that are padded up to a fixed size across nodes. But a bit of variation here (4x GPU) is to be expected.

@bw4sz if you approve, before we merge this please could you double check on one of your bird runs?

I did find a small annoyance with the evaluate command. The CSV file should probably be an optional parameter and not positional, because if you pass a hydra override first, like validation.csv_file, it seems to get confused.

#!/bin/bash
#SBATCH --job-name=df-eval-check
#SBATCH --nodes=1
#SBATCH --partition=hpg-b200
#SBATCH --cpus-per-task=16
#SBATCH --mem=196GB
#SBATCH --time=14:00:00
#SBATCH --gpus=4
#SBATCH --output=./slurm_logs/%A.out
#SBATCH --error=./slurm_logs/%A.err
#SBATCH --ntasks-per-node=4

printenv | grep -i slurm | sort

source .venv/bin/activate

srun uv run deepforest evaluate neon_eval.csv validation.root_dir=./data/NEON_benchmark/images batch_size=4

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│       box_precision       │    0.6181755065917969     │
│        box_recall         │    0.7618706226348877     │
│            iou            │    0.6717678904533386     │
│         iou/cl_0          │    0.6717677712440491     │
│            map            │    0.18482036888599396    │
│          map_50           │    0.5014185905456543     │
│          map_75           │    0.08611200004816055    │
│         map_large         │    0.32269594073295593    │
│        map_medium         │    0.24427038431167603    │
│       map_per_class       │           -1.0            │
│         map_small         │    0.1242641806602478     │
│           mar_1           │   0.015420975163578987    │
│          mar_10           │    0.10335303097963333    │
│          mar_100          │     0.263633668422699     │
│     mar_100_per_class     │           -1.0            │
│         mar_large         │    0.4178861677646637     │
│        mar_medium         │    0.33035221695899963    │
│         mar_small         │    0.19376103579998016    │
│    val_bbox_regression    │    0.3602587580680847     │
│    val_classification     │    0.29568910598754883    │
│         val_loss          │    0.6559478044509888     │
└───────────────────────────┴───────────────────────────┘

bw4sz

This looks right, i'm going to check it out and submit a BOEM detection multi-gpu that was previous failing and see.

Copilot

Pull request overview

Refactors DeepForest’s validation-time box recall/precision metric to better support DDP by performing per-image matching during update() and minimizing end-of-epoch work, while also moving empty-frame handling into the same metric.

Changes:

Refactored RecallPrecision to accept (preds, targets, image_names) and compute matching eagerly during update(), gathering match results across ranks for later access.
Updated validation_step / evaluate() to use the new metric API and to gather prediction outputs across ranks during evaluate().
Adjusted/expanded tests to validate the returned evaluation results DataFrame and updated a test for the new metric.update signature.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`tests/test_main.py`	Updates tests for new metric API and adds assertions about evaluation outputs (labels, image_path presence, row counts).
`src/deepforest/metrics.py`	Major refactor of `RecallPrecision`: new update signature, eager matching, distributed gathering of match results, and empty-frame accounting.
`src/deepforest/main.py`	Wires new metric update signature into validation, removes standalone empty-frame accuracy metric, and gathers predictions in `evaluate()` under DDP.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-10T16:47:35Z

+        self.num_images += 1
+
+        n_pred = len(pred["boxes"])
+        n_target = len(target["boxes"])
+
+        # Early exit for prediction/target base cases.
+        is_empty_frame = n_target == 0 or torch.all(target["boxes"] == 0)
+        if is_empty_frame:
+            self.num_empty_frames += 1
+            if n_pred == 0:
+                self.correct_empty_predictions += 1
            else:
-                # Expand image names to one entry per box
-                predictions["image_path"] = [
-                    self.index_to_path[int(idx.item())]
-                    for idx in torch.cat(self.image_indices)
-                ]
-
-        results = __evaluate_wrapper__(
-            predictions=predictions,
+                # Predictions in an empty frame are all FP: precision = 0
+                self.num_images_with_predictions += 1
+            return


num_images is incremented before detecting/handling empty frames, so empty frames currently contribute to the denominator for box_recall (and, when predictions exist, also affect box_precision via num_images_with_predictions). This differs from the existing evaluation semantics in evaluate.evaluate_geometry(), which filters out empty ground-truth boxes/frames before computing per-image recall/precision. Consider tracking a separate counter for non-empty ground-truth images (and using that as the recall denominator), and excluding empty-frame-only samples from the recall/precision aggregation while still reporting empty_frame_accuracy.

bw4sz · 2026-03-11T02:57:47Z

You can see the trainer get the correct ranks

`Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
-------------------------------------------------------------------------------------------------
---
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
-------------------------------------------------------------------------------------------------
---

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2]

then start training.
Unfortunately this still yields

[Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might
 run on corrupted/incomplete data.

and a core dump.

I'm wondering how we can make a reproducible fail here. I believe its the giant test set. Can you just make a very large test set by concat the normal testing set many many times? I wonder if that will be enough to trigger it.

jveitchmichaelis · 2026-03-11T04:50:54Z

Do you have the profile from Comet showing the GPU and CPU usage for the run? Is it still computing when the timeout happens or is one process hanging?

Or maybe try with some of the cuda debug flags on.

We can definitely try with a duplicated dataset though. Might need to hack it so it doesn't load unique(images).

jveitchmichaelis · 2026-03-11T05:58:19Z

Could we also try the other way? Can you binary search through limit_val_batches until you see where it breaks?

Maybe it's as simple as bumping up the timeout to something much larger.

jveitchmichaelis · 2026-03-19T13:41:54Z

@bw4sz I can't replicate your issue. I've cherry picked the metric changes in my treeformer branch, and it runs fine with the LIDAR dataset; validation takes about 25 mins on a single GPU (and this is 20k+ images). So I wonder if this is specific to your data that's causing validation to be pathologically slow? It's possible that keypoint eval is fast.

I'll report back when this run is complete, but it looks like aggregation is fine comparing a 2-GPU run vs 1.

My suggestion would be to get this merged, because the strategy is definitely better than what's in main, and we can try and identify the edge cases next.

bw4sz

Agreed. I'm glad you checked out a different branch. I'm not surprised there is something specific about the bird detector.

jveitchmichaelis · 2026-03-20T14:11:06Z

Ignoring the LR difference, this was a 1x vs 2x run with the same time and the P/R looks close enough that I'm satisfied (issue we had before would be seeing PR reported at 1/N for the different dataset shards).

vickysharma-prog mentioned this pull request Mar 7, 2026

Fix low recall when limit_val_batches is set #1298

Merged

3 tasks

jveitchmichaelis force-pushed the torchmetrics-refactor branch 2 times, most recently from a4e7f12 to 901a6a3 Compare March 7, 2026 06:43

refactor recall_precision metric

ce8be7a

jveitchmichaelis force-pushed the torchmetrics-refactor branch from 901a6a3 to ce8be7a Compare March 7, 2026 07:44

jveitchmichaelis marked this pull request as ready for review March 10, 2026 05:14

jveitchmichaelis requested a review from bw4sz March 10, 2026 05:17

bw4sz reviewed Mar 10, 2026

View reviewed changes

jveitchmichaelis requested a review from Copilot March 10, 2026 16:40

Copilot started reviewing on behalf of jveitchmichaelis March 10, 2026 16:41 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

bw4sz self-requested a review March 20, 2026 13:56

bw4sz approved these changes Mar 20, 2026

View reviewed changes

bw4sz merged commit 46908f5 into weecology:main Mar 20, 2026
11 checks passed

bw4sz mentioned this pull request Mar 20, 2026

Add keypoint dataset class #1311

Merged

3 tasks

Conversation

jveitchmichaelis commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

AI-Assisted Development

Uh oh!

codecov bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jveitchmichaelis commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jveitchmichaelis commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jveitchmichaelis commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bw4sz commented Mar 11, 2026

Uh oh!

jveitchmichaelis commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jveitchmichaelis commented Mar 11, 2026

Uh oh!

jveitchmichaelis commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jveitchmichaelis commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jveitchmichaelis commented Mar 7, 2026 •

edited

Loading

codecov bot commented Mar 7, 2026 •

edited

Loading

jveitchmichaelis commented Mar 7, 2026 •

edited

Loading

jveitchmichaelis commented Mar 7, 2026 •

edited

Loading

jveitchmichaelis commented Mar 10, 2026 •

edited

Loading

jveitchmichaelis commented Mar 11, 2026 •

edited

Loading

jveitchmichaelis commented Mar 19, 2026 •

edited

Loading