Skip to content

Average Precision and Average Recall metrics reported by COCOeval seem to be incorrect #672

@tinybike

Description

@tinybike

I'm not sure if this is the right place to report issues with https://github.com/ppwwyyxx/cocoapi -- that repo doesn't have its own Issues tab, so I'm opening an issue here instead.

I'm confused by how pycocotools calculates average precision and recall metrics reported in the summary. I'm not sure if it's actually a bug, or if I'm just fundamentally misunderstanding how these calculations are being done under the hood. So, I wrote out a super simple test case, just taking two bboxes with perfect overlap and passing them into COCOeval:

from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
 
actual_boxes = [[50, 50, 150, 150], [200, 200, 300, 300]]
predicted_boxes =  [[50, 50, 150, 150], [200, 200, 300, 300]]
scores = [1.0, 1.0]
coco_actual = COCO()
coco_predicted = COCO()
actual_annotations_list = []
predicted_annotations_list = []
for id, box in enumerate(actual_boxes):
    actual_annotations_list.append({
        "id": id,
        "image_id": 1,
        "category_id": 1,
        "bbox": [box[0], box[1], box[2] - box[0], box[3] - box[1]],
        "area": (box[2] - box[0]) * (box[3] - box[1]),
        "iscrowd": 0,
    })
for id, box in enumerate(predicted_boxes):
    predicted_annotations_list.append({
        "id": id,
        "image_id": 1,
        "category_id": 1,
        "bbox": [box[0], box[1], box[2] - box[0], box[3] - box[1]],
        "area": (box[2] - box[0]) * (box[3] - box[1]),
        "iscrowd": 0,
        "score": scores[id],
    })
coco_actual.dataset = {
    "images": [{"id": 1}],
    "annotations": actual_annotations_list,
    "categories": [{"id": 1, "name": "object"}],
}
coco_actual.createIndex()
coco_predicted.dataset = {
    "images": [{"id": 1}],
    "annotations": predicted_annotations_list,
    "categories": [{"id": 1, "name": "object"}],
}
coco_predicted.createIndex()
coco_eval = COCOeval(coco_actual, coco_predicted, iouType="bbox")
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()

Here is the output:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.252
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.252
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.252
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.252
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.500

I believe these are considered "large", and the summary shows AP=0.252 and AR=0.500. These numbers do not make sense to me. Actual and predicted are 100% identical here, so we'd expect average precision and recall to both be 1.0, right? Am I misunderstanding something, or is there a bug in how these metrics are calculated?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions