refactor recall_precision metric for better DDP support#1343
refactor recall_precision metric for better DDP support#1343bw4sz merged 1 commit intoweecology:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1343 +/- ##
==========================================
- Coverage 87.34% 86.78% -0.56%
==========================================
Files 24 24
Lines 2978 3005 +27
==========================================
+ Hits 2601 2608 +7
- Misses 377 397 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Majority of the missing lines here are due to DDP coverage which won't run on Github Actions. This ties into #1273. It seems we might also be lacking coverage for some multi-class cases, where we would expect class precision/recall to be returned. |
a4e7f12 to
901a6a3
Compare
|
This is basically good for review, but needs testing on a multi-GPU system to ensure that sync works properly in the metric, and when gathering predictions from
@bw4sz would be helpful if you could have a look to see if this solves your issues, and act as an independent test that things are working. I'll also run the same benchmark on HiPerGator. |
901a6a3 to
ce8be7a
Compare
|
Check on HPG looks good. As usual, if reporting results for publication use a single worker to avoid any strangeness with batches that are padded up to a fixed size across nodes. But a bit of variation here (4x GPU) is to be expected. @bw4sz if you approve, before we merge this please could you double check on one of your bird runs? I did find a small annoyance with the #!/bin/bash
#SBATCH --job-name=df-eval-check
#SBATCH --nodes=1
#SBATCH --partition=hpg-b200
#SBATCH --cpus-per-task=16
#SBATCH --mem=196GB
#SBATCH --time=14:00:00
#SBATCH --gpus=4
#SBATCH --output=./slurm_logs/%A.out
#SBATCH --error=./slurm_logs/%A.err
#SBATCH --ntasks-per-node=4
printenv | grep -i slurm | sort
source .venv/bin/activate
srun uv run deepforest evaluate neon_eval.csv validation.root_dir=./data/NEON_benchmark/images batch_size=4 |
bw4sz
left a comment
There was a problem hiding this comment.
This looks right, i'm going to check it out and submit a BOEM detection multi-gpu that was previous failing and see.
There was a problem hiding this comment.
Pull request overview
Refactors DeepForest’s validation-time box recall/precision metric to better support DDP by performing per-image matching during update() and minimizing end-of-epoch work, while also moving empty-frame handling into the same metric.
Changes:
- Refactored
RecallPrecisionto accept(preds, targets, image_names)and compute matching eagerly duringupdate(), gathering match results across ranks for later access. - Updated
validation_step/evaluate()to use the new metric API and to gather prediction outputs across ranks duringevaluate(). - Adjusted/expanded tests to validate the returned evaluation results DataFrame and updated a test for the new
metric.updatesignature.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
tests/test_main.py |
Updates tests for new metric API and adds assertions about evaluation outputs (labels, image_path presence, row counts). |
src/deepforest/metrics.py |
Major refactor of RecallPrecision: new update signature, eager matching, distributed gathering of match results, and empty-frame accounting. |
src/deepforest/main.py |
Wires new metric update signature into validation, removes standalone empty-frame accuracy metric, and gathers predictions in evaluate() under DDP. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| self.num_images += 1 | ||
|
|
||
| n_pred = len(pred["boxes"]) | ||
| n_target = len(target["boxes"]) | ||
|
|
||
| # Early exit for prediction/target base cases. | ||
| is_empty_frame = n_target == 0 or torch.all(target["boxes"] == 0) | ||
| if is_empty_frame: | ||
| self.num_empty_frames += 1 | ||
| if n_pred == 0: | ||
| self.correct_empty_predictions += 1 | ||
| else: | ||
| # Expand image names to one entry per box | ||
| predictions["image_path"] = [ | ||
| self.index_to_path[int(idx.item())] | ||
| for idx in torch.cat(self.image_indices) | ||
| ] | ||
|
|
||
| results = __evaluate_wrapper__( | ||
| predictions=predictions, | ||
| # Predictions in an empty frame are all FP: precision = 0 | ||
| self.num_images_with_predictions += 1 | ||
| return |
There was a problem hiding this comment.
num_images is incremented before detecting/handling empty frames, so empty frames currently contribute to the denominator for box_recall (and, when predictions exist, also affect box_precision via num_images_with_predictions). This differs from the existing evaluation semantics in evaluate.evaluate_geometry(), which filters out empty ground-truth boxes/frames before computing per-image recall/precision. Consider tracking a separate counter for non-empty ground-truth images (and using that as the recall denominator), and excluding empty-frame-only samples from the recall/precision aggregation while still reporting empty_frame_accuracy.
|
You can see the trainer get the correct ranks then start training. and a core dump. I'm wondering how we can make a reproducible fail here. I believe its the giant test set. Can you just make a very large test set by concat the normal testing set many many times? I wonder if that will be enough to trigger it. |
|
Do you have the profile from Comet showing the GPU and CPU usage for the run? Is it still computing when the timeout happens or is one process hanging? Or maybe try with some of the cuda debug flags on. We can definitely try with a duplicated dataset though. Might need to hack it so it doesn't load |
|
Could we also try the other way? Can you binary search through Maybe it's as simple as bumping up the timeout to something much larger. |
|
@bw4sz I can't replicate your issue. I've cherry picked the metric changes in my treeformer branch, and it runs fine with the LIDAR dataset; validation takes about 25 mins on a single GPU (and this is 20k+ images). So I wonder if this is specific to your data that's causing validation to be pathologically slow? It's possible that keypoint eval is fast. I'll report back when this run is complete, but it looks like aggregation is fine comparing a 2-GPU run vs 1. My suggestion would be to get this merged, because the strategy is definitely better than what's in main, and we can try and identify the edge cases next. |
bw4sz
left a comment
There was a problem hiding this comment.
Agreed. I'm glad you checked out a different branch. I'm not surprised there is something specific about the bird detector.

Description
updatesignature as other metrics.updateso that the final reduction stage doesn't time-out the sync between ranks during DDP joining.it/sduring eval might be a bit slower, but there won't be a big chunk of compute at the end.Tests:
test_main::test_evaluateto confirm that the output dataframe is what we expect + some comments on the assertions.Related Issue(s)
AI-Assisted Development
Rewrote most of the metric and then got CC to check some of the logic and the test suite.
AI tools used (if applicable):
Claude Code Opus/Sonnet 4.6