#YOLO: You Only Look Once

YOLO stands for You Only Look Once, a clever convolutional neural network (CNN) used for doing object detection in real-time.

But before anything else, let's backtrack a bit.

A Convolutional Approach to Sliding Windows
Let’s assume we have a 16x16x3 image, like the one shown below. This means the image has a size of 16 by 16 pixels and has 3 channels, corresponding to RGB.


Let’s now select a window size of 10x10 pixels as shown below:

If we use a stride of 2 pixels, it will take 16 windows to cover the entire image, as we can see below.

In the original Sliding Windows approach, each of these 16 windows will have to be passed individually through a CNN. Let’s assume that CNN has the following architecture:

The CNN takes as input a 10x10x3 image, then it applies 5, 7x7x3 filters, then it uses a 2x2 Max pooling layer, then is has 128, 2x2x5 filters, then is has 128, 1x1x128 filters, and finally it has 8, 1x1x128 filters that represents a softmax output. 

What will happen if we change the input of the above CNN from 10x10x3, to 16x16x3? The result is shown below:

As we can see, this CNN architecture is the same as the one shown before except that it takes as input a 16x16x3 image. The sizes of each layer change because the input image is larger, but the same filters as before have been applied. If we follow the region of the image that corresponds to the first window through this new CNN, we see that the result is the upper-left corner of the last layer (see image above). 

Similarly, if we follow the section of the image that corresponds to the second window through this new CNN, we see the corresponding result in the last layer:
Likewise, if we follow the section of the image that corresponds to the third window through this new CNN, we see the corresponding result in the last layer, as shown in the image below:

Finally, if we follow the section of the image that corresponds to the fourth window through this new CNN, we see the corresponding result in the last layer, as shown in the image below:

In fact, if we follow all the windows through the CNN we see that all the 16 windows are contained within the last layer of this new CNN. Therefore, passing the 16 windows individually through the old CNN is exactly the same as passing the whole image only once through this new CNN.

This is how you can apply sliding windows with a CNN. This technique makes the whole process much more efficient. However, this technique has a downside: the position of the bounding boxes is not going to be very accurate. The reason is that it is quite unlikely that a given size window and stride will be able to match the objects in the images perfectly. In order to increase the accuracy of the bounding boxes, YOLO uses a grid instead of sliding windows, in addition to two other techniques, known as Intersection Over Union and Non-Maximal Suppression. 

The combination of the above techniques is part of the reason the YOLO algorithm works so well. Before diving into how YOLO puts all these techniques together, we will look first at each technique individually.

We will be using the latest version of YOLO, known as YOLOv3, for this exercise. 

Setting the Non-Maximal Suppression Threshold
YOLO uses Non-Maximal Suppression (NMS) to only keep the best bounding box. The first step in NMS is to remove all the predicted bounding boxes that have a detection probability that is less than a given NMS threshold. For this exercise, we set this NMS threshold to 0.6. This means that all predicted bounding boxes that have a detection probability less than 0.6 will be removed.

Setting the Intersection Over Union Threshold
After removing all the predicted bounding boxes that have a low detection probability, the second step in NMS, is to select the bounding boxes with the highest detection probability and eliminate all the bounding boxes whose Intersection Over Union (IOU) value is higher than a given IOU threshold. In the code below, we set this IOU threshold to 0.4. This means that all predicted bounding boxes that have an IOU value greater than 0.4 with respect to the best bounding boxes will be removed.

I get the results below. Viola! Although we'd have to introduce cans to YOLOv3. 


Again, this is part of Udacity's Nanodegree on Computer Vision! 

Comments

Popular Posts