Skip to content

refactor recall_precision metric for better DDP support#1343

Merged
bw4sz merged 1 commit intoweecology:mainfrom
jveitchmichaelis:torchmetrics-refactor
Mar 20, 2026
Merged

refactor recall_precision metric for better DDP support#1343
bw4sz merged 1 commit intoweecology:mainfrom
jveitchmichaelis:torchmetrics-refactor

Conversation

@jveitchmichaelis
Copy link
Copy Markdown
Collaborator

@jveitchmichaelis jveitchmichaelis commented Mar 7, 2026

Description

  • Refactors the RecallPrecision metric to use (mostly) the same update signature as other metrics.
  • Ground truth/pred matching is performed eagerly during update so that the final reduction stage doesn't time-out the sync between ranks during DDP joining.
  • This should be faster on multi node machines, because each rank will only perform 1/N of the work. There are maybe some tiny benefits from not having to search the dataframe for image_path <> preds/gt. The it/s during eval might be a bit slower, but there won't be a big chunk of compute at the end.
  • The metric has to keep track of the matching results so it can be accessed later, which is a little awkward, but I've tried to keep it clean (at least to users...)
  • I moved empty frame accuracy here too, since we have to handle those checks during evaluation anyway and it make main a bit cleaner.
  • The metrics are calculated against the dataset targets directly (instead of loading the CSV and passing that raw). This has the benefit that adding geometric augmentation to the validation dataset (if we did that for some reason) wouldn't break eval.

Tests:

  • Added a sanity check in test_main::test_evaluate to confirm that the output dataframe is what we expect + some comments on the assertions.
  • One fixup in tests for the new metric.update signature.

Related Issue(s)

AI-Assisted Development

Rewrote most of the metric and then got CC to check some of the logic and the test suite.

  • I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
  • I understand all the code I'm submitting
  • I have reviewed and validated all AI-generated code

AI tools used (if applicable):
Claude Code Opus/Sonnet 4.6

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 7, 2026

Codecov Report

❌ Patch coverage is 85.00000% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.78%. Comparing base (ecf3a1c) to head (ce8be7a).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
src/deepforest/metrics.py 84.61% 12 Missing ⚠️
src/deepforest/main.py 86.36% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1343      +/-   ##
==========================================
- Coverage   87.34%   86.78%   -0.56%     
==========================================
  Files          24       24              
  Lines        2978     3005      +27     
==========================================
+ Hits         2601     2608       +7     
- Misses        377      397      +20     
Flag Coverage Δ
unittests 86.78% <85.00%> (-0.56%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jveitchmichaelis
Copy link
Copy Markdown
Collaborator Author

jveitchmichaelis commented Mar 7, 2026

Majority of the missing lines here are due to DDP coverage which won't run on Github Actions. This ties into #1273.

It seems we might also be lacking coverage for some multi-class cases, where we would expect class precision/recall to be returned.

@jveitchmichaelis jveitchmichaelis force-pushed the torchmetrics-refactor branch 2 times, most recently from a4e7f12 to 901a6a3 Compare March 7, 2026 06:43
@jveitchmichaelis
Copy link
Copy Markdown
Collaborator Author

jveitchmichaelis commented Mar 7, 2026

This is basically good for review, but needs testing on a multi-GPU system to ensure that sync works properly in the metric, and when gathering predictions from evaluate. Results calling deepforest evaluate on NeonTreeEval are identical to main:

Metric main torchmetrics-refactor
box_precision 0.61659 0.61659
box_recall 0.76480 0.76480
iou 0.67113 0.67113
map 0.18450 0.18450
val_loss 0.64963 0.64963

@bw4sz would be helpful if you could have a look to see if this solves your issues, and act as an independent test that things are working. I'll also run the same benchmark on HiPerGator.

@jveitchmichaelis
Copy link
Copy Markdown
Collaborator Author

jveitchmichaelis commented Mar 10, 2026

Check on HPG looks good. As usual, if reporting results for publication use a single worker to avoid any strangeness with batches that are padded up to a fixed size across nodes. But a bit of variation here (4x GPU) is to be expected.

@bw4sz if you approve, before we merge this please could you double check on one of your bird runs?

I did find a small annoyance with the evaluate command. The CSV file should probably be an optional parameter and not positional, because if you pass a hydra override first, like validation.csv_file, it seems to get confused.

#!/bin/bash
#SBATCH --job-name=df-eval-check
#SBATCH --nodes=1
#SBATCH --partition=hpg-b200
#SBATCH --cpus-per-task=16
#SBATCH --mem=196GB
#SBATCH --time=14:00:00
#SBATCH --gpus=4
#SBATCH --output=./slurm_logs/%A.out
#SBATCH --error=./slurm_logs/%A.err
#SBATCH --ntasks-per-node=4

printenv | grep -i slurm | sort

source .venv/bin/activate

srun uv run deepforest evaluate neon_eval.csv validation.root_dir=./data/NEON_benchmark/images batch_size=4
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│       box_precision       │    0.6181755065917969     │
│        box_recall         │    0.7618706226348877     │
│            iou            │    0.6717678904533386     │
│         iou/cl_0          │    0.6717677712440491     │
│            map            │    0.18482036888599396    │
│          map_50           │    0.5014185905456543     │
│          map_75           │    0.08611200004816055    │
│         map_large         │    0.32269594073295593    │
│        map_medium         │    0.24427038431167603    │
│       map_per_class       │           -1.0            │
│         map_small         │    0.1242641806602478     │
│           mar_1           │   0.015420975163578987    │
│          mar_10           │    0.10335303097963333    │
│          mar_100          │     0.263633668422699     │
│     mar_100_per_class     │           -1.0            │
│         mar_large         │    0.4178861677646637     │
│        mar_medium         │    0.33035221695899963    │
│         mar_small         │    0.19376103579998016    │
│    val_bbox_regression    │    0.3602587580680847     │
│    val_classification     │    0.29568910598754883    │
│         val_loss          │    0.6559478044509888     │
└───────────────────────────┴───────────────────────────┘

@jveitchmichaelis jveitchmichaelis marked this pull request as ready for review March 10, 2026 05:14
@jveitchmichaelis jveitchmichaelis requested a review from bw4sz March 10, 2026 05:17
Copy link
Copy Markdown
Collaborator

@bw4sz bw4sz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right, i'm going to check it out and submit a BOEM detection multi-gpu that was previous failing and see.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors DeepForest’s validation-time box recall/precision metric to better support DDP by performing per-image matching during update() and minimizing end-of-epoch work, while also moving empty-frame handling into the same metric.

Changes:

  • Refactored RecallPrecision to accept (preds, targets, image_names) and compute matching eagerly during update(), gathering match results across ranks for later access.
  • Updated validation_step / evaluate() to use the new metric API and to gather prediction outputs across ranks during evaluate().
  • Adjusted/expanded tests to validate the returned evaluation results DataFrame and updated a test for the new metric.update signature.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/test_main.py Updates tests for new metric API and adds assertions about evaluation outputs (labels, image_path presence, row counts).
src/deepforest/metrics.py Major refactor of RecallPrecision: new update signature, eager matching, distributed gathering of match results, and empty-frame accounting.
src/deepforest/main.py Wires new metric update signature into validation, removes standalone empty-frame accuracy metric, and gathers predictions in evaluate() under DDP.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/deepforest/metrics.py
Comment on lines +87 to +101
self.num_images += 1

n_pred = len(pred["boxes"])
n_target = len(target["boxes"])

# Early exit for prediction/target base cases.
is_empty_frame = n_target == 0 or torch.all(target["boxes"] == 0)
if is_empty_frame:
self.num_empty_frames += 1
if n_pred == 0:
self.correct_empty_predictions += 1
else:
# Expand image names to one entry per box
predictions["image_path"] = [
self.index_to_path[int(idx.item())]
for idx in torch.cat(self.image_indices)
]

results = __evaluate_wrapper__(
predictions=predictions,
# Predictions in an empty frame are all FP: precision = 0
self.num_images_with_predictions += 1
return
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_images is incremented before detecting/handling empty frames, so empty frames currently contribute to the denominator for box_recall (and, when predictions exist, also affect box_precision via num_images_with_predictions). This differs from the existing evaluation semantics in evaluate.evaluate_geometry(), which filters out empty ground-truth boxes/frames before computing per-image recall/precision. Consider tracking a separate counter for non-empty ground-truth images (and using that as the recall denominator), and excluding empty-frame-only samples from the recall/precision aggregation while still reporting empty_frame_accuracy.

Copilot uses AI. Check for mistakes.
Comment thread src/deepforest/metrics.py
@bw4sz
Copy link
Copy Markdown
Collaborator

bw4sz commented Mar 11, 2026

You can see the trainer get the correct ranks

`Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
-------------------------------------------------------------------------------------------------
---
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
-------------------------------------------------------------------------------------------------
---

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2]

then start training.
Unfortunately this still yields

[Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might
 run on corrupted/incomplete data.

and a core dump.

I'm wondering how we can make a reproducible fail here. I believe its the giant test set. Can you just make a very large test set by concat the normal testing set many many times? I wonder if that will be enough to trigger it.

@jveitchmichaelis
Copy link
Copy Markdown
Collaborator Author

jveitchmichaelis commented Mar 11, 2026

Do you have the profile from Comet showing the GPU and CPU usage for the run? Is it still computing when the timeout happens or is one process hanging?

Or maybe try with some of the cuda debug flags on.

We can definitely try with a duplicated dataset though. Might need to hack it so it doesn't load unique(images).

@jveitchmichaelis
Copy link
Copy Markdown
Collaborator Author

Could we also try the other way? Can you binary search through limit_val_batches until you see where it breaks?

Maybe it's as simple as bumping up the timeout to something much larger.

@jveitchmichaelis
Copy link
Copy Markdown
Collaborator Author

jveitchmichaelis commented Mar 19, 2026

@bw4sz I can't replicate your issue. I've cherry picked the metric changes in my treeformer branch, and it runs fine with the LIDAR dataset; validation takes about 25 mins on a single GPU (and this is 20k+ images). So I wonder if this is specific to your data that's causing validation to be pathologically slow? It's possible that keypoint eval is fast.

I'll report back when this run is complete, but it looks like aggregation is fine comparing a 2-GPU run vs 1.

My suggestion would be to get this merged, because the strategy is definitely better than what's in main, and we can try and identify the edge cases next.

@bw4sz bw4sz self-requested a review March 20, 2026 13:56
Copy link
Copy Markdown
Collaborator

@bw4sz bw4sz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'm glad you checked out a different branch. I'm not surprised there is something specific about the bird detector.

@bw4sz bw4sz merged commit 46908f5 into weecology:main Mar 20, 2026
11 checks passed
@bw4sz bw4sz mentioned this pull request Mar 20, 2026
3 tasks
@jveitchmichaelis
Copy link
Copy Markdown
Collaborator Author

image

Ignoring the LR difference, this was a 1x vs 2x run with the same time and the P/R looks close enough that I'm satisfied (issue we had before would be seeing PR reported at 1/N for the different dataset shards).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants