Object-Detection-with-Faster-R-CNN/cnn_object_detection.py at main · sarojinisharon/Object-Detection-with-Faster-R-CNN · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
# -*- coding: utf-8 -*-
"""Cnn_object_detection.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1BkvQWxRK3QD6R08mIcy3y15ZHR3IpHTt

### **Faster R-CNN for Real-Time Object Detection and Tracking**

# 1. Research and Model Selection for this project


 In order to execute real-time object detection and tracking with a balanced approach to speed and accuracy, Faster R-CNN model as the pre-trained model would be suitable for this project.

 Justifying the choice of the Faster R-CNN model over others, such as YOLO (You Only Look Once) or SSD (Single Shot MultiBox Detector) after considering its balance between speed and accuracy, which is crucial for our requirements. Although models like YOLO and SSD offer faster inference times, which are beneficial for real-time detection tasks, Faster R-CNN tends to achieve higher accuracy.

In the first stage, Faster R-CNN uses a Region Proposal Network (RPN) to generate potential bounding boxes in an image that are likely to contain objects. In the second stage, it refines these bounding boxes and classifies the objects within them. This two-step approach allows for more precise detections and a lower false positive rate, as it explicitly differentiates between background and foreground, making it ideal for applications where precision is paramount.

Furthermore, the ability of Faster R-CNN to integrate with deep feature extractors like ResNet makes it highly adaptable for a wide range of object detection tasks, as these deep networks can capture intricate details and subtle features of objects. This is particularly advantageous in scenarios where the dataset is complex and contains objects that require detailed feature extraction for accurate detection and classification.

For the dataset used here, which is based on the COCO 2017 challenge, Faster R-CNN offers a robust and tested solution. The dataset contains a diverse set of images with complex scenes and multiple object categories, where the model's accuracy and ability to detect small to medium-sized objects align well with our objectives. While it might not provide the real-time inference capabilities of YOLO or SSD, the project prioritizes accuracy and thorough detection over speed, justifying the choice of Faster R-CNN.

Additionally, given the project's focus on tracking, the accurate detection of objects frame by frame is critical. Faster R-CNN’s precise bounding box predictions facilitate more reliable tracking across frames, which is beneficial for maintaining the identity of objects over time.

In summary, while YOLO and SSD are suitable for scenarios where speed is a critical factor, the selection of Faster R-CNN for this project is justified by its superior accuracy, reliable performance on benchmark datasets, and its suitability for the project's specific needs around precise object detection and tracking.

# 2. Data Collection and Preparation


For a project involving object detection and tracking, choosing a dataset that not only includes diverse instances of the objects of interest but also should provide continuity for tracking. Therefore, the COCO Dataset seems to be appropriate for this project.

The Common Objects in Context (COCO) dataset is widely used in object detection, segmentation, and captioning projects. It offers a large variety of images with objects in their natural contexts and includes annotations for object detection tasks.


Used FiftyOne to load and work with the COCO 2017 dataset as downloading the dataset directly from the official coco dataset website required a lot of storage. FiftyOne provides a high-level interface for loading datasets, visualizing data, and integrating with various machine learning models and datasets.
"""

pip install fiftyone

import fiftyone as fo
import fiftyone.zoo as foz

# Loading COCO 2017 Training Dataset
train_dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="train",
    dataset_name="coco-2017-train"
)

# Loading COCO 2017 Validation Dataset
validation_dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="validation",
    dataset_name="coco-2017-validation"
)

import torchvision.transforms as transforms

# Defining the augmentation pipeline
augmentations = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
])

#Normalizing to match the input requirements of the model
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

"""# 3. Implementation

A pre-trained Faster R-CNN model is loaded using PyTorch, configured to run on a GPU if available.
The model is set to evaluation mode, which is standard procedure when performing inference as it disables certain layers and behaviors like Dropout or BatchNorm that are only used during training.
A random subset of 100 samples is selected from the training dataset to which predictions will be added.
"""

import torch
import torchvision

# Running the model on GPU only if it is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Loading a pre-trained Faster R-CNN model
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.to(device)
model.eval()

print("Model ready")

# Choosing a random subset of 100 samples to add predictions to
predictions_view = train_dataset.take(100, seed=51)

"""**Object Detection:**
Adding predictions to each randomly selected samples. This involves loading images, converting them to tensors, and feeding them into the model.
The outputs from the model (labels, scores, and bounding boxes) are converted into the FiftyOne format, which is used for visualization and analysis in the FiftyOne app.
"""

from PIL import Image
from torchvision.transforms import functional as func

import fiftyone as fo

# Getting class list from the dataset
classes = train_dataset.default_classes

# Adding predictions to samples
with fo.ProgressBar() as pb:
    for sample in pb(predictions_view):
        # Loading image
        image = Image.open(sample.filepath)
        image = func.to_tensor(image).to(device)
        c, h, w = image.shape

        # Performing inference
        preds = model([image])[0]
        labels = preds["labels"].cpu().detach().numpy()
        scores = preds["scores"].cpu().detach().numpy()
        boxes = preds["boxes"].cpu().detach().numpy()

        # Converting detections to FiftyOne format
        detections = []
        for label, score, box in zip(labels, scores, boxes):
            # Converting to [top-left-x, top-left-y, width, height]
            # in relative coordinates in [0, 1] x [0, 1]
            x1, y1, x2, y2 = box
            rel_box = [x1 / w, y1 / h, (x2 - x1) / w, (y2 - y1) / h]

            detections.append(
                fo.Detection(
                    label=classes[label],
                    bounding_box=rel_box,
                    confidence=score
                )
            )

        # Saving predictions to dataset
        sample["predictions"] = fo.Detections(detections=detections)
        sample.save()

"""Used FiftyOne tool to load and work with the dataset"""

import fiftyone as fo

session = fo.launch_app()

session.view = predictions_view

"""For inference after loading predictions_view in the App to visualize the predictions that we added. In this image from what appears to be the FiftyOne app where an individual, seemingly a man, making a speech or presentation. The man is wearing a formal suit with a red patterned bow tie and a name tag, and he's holding a microphone., you can see that there are several tags or labels in the sidebar pertaining to the machine learning dataset: ground_truth, predictions.


**Ground Truth (2):** There are 2 annotations provided as ground truth for the objects within the image. Ground truth annotations are considered the correct answer or the benchmark for model predictions. These could be bounding boxes or other forms of labels that indicate the presence and position of objects in the image that the model is meant to detect.

**Predictions (24):** A trained model has produced 24 predictions for this particular image. These predictions are the model's attempt to detect objects based on what it has learned during training. The number of predictions exceeds the ground truth, suggesting that the model may have identified multiple potential objects that need further validation to determine their accuracy. This could indicate a high recall rate, but potentially also a lower precision if many of these predictions are false positives.

**Object Tracking:**
A CentroidTracker class is defined and implemented, which updates the tracker with detections for each frame.
The centroid tracker is used to maintain the identity of objects across frames.

The tracker is initialized, registered, deregistered, and updated for tracking.
"""

import numpy as np
from scipy.spatial import distance

class CentroidTracker:
    def __init__(self, maxDisappeared=50):
        # Initializing the next unique object ID along with two ordered dictionaries
        self.nextObjectID = 0
        self.objects = {}
        self.disappeared = {}

        # Storing the maximum number of consecutive frames a given object is
        # marked as "disappeared" until we deregister the object from tracking
        self.maxDisappeared = maxDisappeared

    def register(self, centroid):
        # While registering an object, used the next available object ID to store the
        # centroid coordinates of the object
        self.objects[self.nextObjectID] = centroid
        self.disappeared[self.nextObjectID] = 0
        self.nextObjectID += 1

    def deregister(self, objectID):
        # To deregister an object ID, deleted the object ID from both of our
        # respective dictionaries
        del self.objects[objectID]
        del self.disappeared[objectID]

    def update(self, rects):
        # Checking to see if the list of input bounding box rectangles is empty
        if len(rects) == 0:
            # Looping over any existing tracked objects and marking them as disappeared
            for objectID in list(self.disappeared.keys()):
                self.disappeared[objectID] += 1

                # Deregistering any object if a maximum number of frames is reached where the object has
                # been marked as missing
                if self.disappeared[objectID] > self.maxDisappeared:
                    self.deregister(objectID)

            # Returning early as there are no centroids or tracking info to update
            return self.objects

        # Initializing an array of input centroids for the current frame
        inputCentroids = np.zeros((len(rects), 2), dtype="int")

        # Looping over the bounding box rectangles
        for (i, (startX, startY, endX, endY)) in enumerate(rects):
            # Use the bounding box coordinates to derive the centroid
            cX = int((startX + endX) / 2.0)
            cY = int((startY + endY) / 2.0)
            inputCentroids[i] = (cX, cY)

        # If we are currently not tracking any objects take the input centroids and
        # register each of them
        if len(self.objects) == 0:
            for i in range(0, len(inputCentroids)):
                self.register(inputCentroids[i])

        # Otherwise, we are currently tracking objects so we need to try to match the
        # input centroids to existing object centroids
        else:
            # Getting the set of object IDs and corresponding centroids
            objectIDs = list(self.objects.keys())
            objectCentroids = list(self.objects.values())

            # Computing the distance between each pair of object centroids and input centroids,
            # respectively -- the goal will be to match an input centroid to an existing
            # object centroid
            D = distance.cdist(np.array(objectCentroids), inputCentroids)

# Initializing the tracker
tracker = CentroidTracker()

# Looping to process images, perform detection, update the tracker, and store results in FiftyOne
with fo.ProgressBar() as pb:
    for sample in pb(predictions_view):
        # Loading image
        image = Image.open(sample.filepath)
        image = func.to_tensor(image).to(device)
        c, h, w = image.shape

        # Performing inference
        preds = model([image])[0]
        labels = preds['labels'].cpu().detach().numpy()
        scores = preds['scores'].cpu().detach().numpy()
        boxes = preds['boxes'].cpu().detach().numpy()

        # Updating the tracker with detections from the current frame
        # Here we assume that 'boxes' are in the format [x1, y1, x2, y2]
        detections = [(box[0], box[1], box[2] - box[0], box[3] - box[1]) for box in boxes]
        tracker.update(detections)

                # Converting detections to FiftyOne format
        model_detections = []
        for label, score, box in zip(labels, scores, boxes):
            x1, y1, x2, y2 = box
            rel_box = [x1 / w, y1 / h, (x2 - x1) / w, (y2 - y1) / h]

            model_detections.append(
                fo.Detection(
                    label=classes[label],
                    bounding_box=rel_box,
                    confidence=score
                )
            )

        # Adding model detections to the sample
        sample["predictions"] = fo.Detections(detections=model_detections)

        # Preparing FiftyOne for tracking results
        tracking_results = []
        for objectID, centroid in tracker.objects.items():
            # Here we need to check the length of the centroid tuple
            # If the tracker returns 2 values, we need to convert it to 4 values
            if len(centroid) == 4:
                x, y, width, height = centroid
                x1 = max(0, x - (width / 2))
                y1 = max(0, y - (height / 2))
            else:
                # This is just a placeholder, in case the tracker returns only (cX, cY)
                # We would need the width and height to create a proper bounding box
                x, y = centroid
                width, height = 50, 50  # Placeholder values for width and height
                x1 = max(0, x - (width / 2))
                y1 = max(0, y - (height / 2))

            # Converting centroid to FiftyOne bounding box format
            rel_box = [x1 / w, y1 / h, width / w, height / h]

            tracking_results.append(
                fo.Detection(
                    label=str(objectID),
                    bounding_box=rel_box
                )
            )

        # Adding tracking results to the sample
        sample["tracking"] = fo.Detections(detections=tracking_results)
        sample.save()

# Launching FiftyOne session
session = fo.launch_app()

# Using the dataset with added predictions and tracking
session.dataset = train_dataset

# Creating a view that includes samples with tracking labels
view = train_dataset.exists("tracking")
session.view = view

"""For inference after applying tracking analyzed this image where social scene with people engaged in an activity, likely at an event or gathering,

In this image from what appears to be the FiftyOne app, you can see that there are several tags or labels in the sidebar pertaining to the machine learning dataset: `ground_truth`, `predictions`, and `tracking`.

- **Ground Truth (18)**: There are 18 ground truth annotations on this image. Ground truth refers to the actual, manually labeled data indicating the correct answers for the training or evaluation of a machine learning model.

- **Predictions (95)**: The model has made 95 predictions for this image. This is typically the output from a trained machine learning model, where it tries to predict the ground truth labels based on the learned patterns.

- **Tracking (9)**: The tracking implementation has been applied to this image, resulting in 9 tracking annotations. In the context of object detection and tracking, this likely indicates that the model has identified and tracked 9 distinct objects over time (or across different frames if this is a part of a sequence of images).

The discrepancy in the number of ground truth labels and predictions suggests that the model may have either made multiple predictions for the same object or detected additional objects that were not labeled in the ground truth.

# Challenges


Reflecting on the project details and the nature of object detection and tracking, here are the challenges faced, along with potential future directions:

### Challenges Encountered:
1. **Dataset Limitations**: While the COCO dataset is comprehensive, it did not cover all edge cases or specific scenarios needed for advanced object detection and tracking. Also, while trying to use the official dataset from COCO website it required a lot of memory so had to use the fiftyOne app to extract the COCO dataset.

Refered from : https://docs.voxel51.com/recipes/adding_detections.html

2. **Model Complexity and Speed**: Faster R-CNN, while accurate, it's computationally intensive, which may not be suitable for real-time applications. Balancing model complexity with the need for speed is a constant challenge.

3. **Generalization to New Domains**: The model might perform well on the dataset it was trained on but did not generalize well to datasets.


By addressing these challenges listed, the project can be pushed towards a more robust, accurate, and versatile object detection and tracking system.
"""