Object Detection Review: R-CNN, Fast R-CNN, Faster R-CNN and YOLO

1. Abstract

In this report, firstly, I give an overall review of object detection, then introduce the mainstream deep convolution neural network (DCNN) methods for this topic, including R-CNN[5], Fast R-CNN[4], Faster R-CNN[10] and YOLO[9].

2. Overview of Object Detection

Object detection[2] is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. The traditional machine learning approaches for objection detection including ViolaJones method based on Harr features, Scale-invariant feature transform (SIFT) and Histogram of oriented gradients (HOG) features. With the fast growth of deep learning, some deep convolution neural network (DCNN) methods were proposed, which performed much better than those traditional ways, these methods can be divided into two groups, the two-stage methods, such as R-CNN and its improvements, and the one-stage methods, such as YOLO and SSD. This report mainly focus on the DCNN methods.

3. Review on Object Detection Network

3.1 Regions with CNN features (R-CNN)

R-CNN[5] is the pionner using DCNN instead of traditional methods for feature extraction on object detection task. The main workflow of R-CNN is propose a number of region of interest (ROI), then using CNN to extract features for support vector machine (SVM) classifier.

Algorithm

Take an input image:
Region proposal: one image generates 1K∼2K candidate areas by selective search algorithm[8].
Feature extraction: for each region proposal, deep convolutional network is used to extract features. In this paper, they used AlexNet, and pre-trained the model on ImageNet.
Classification judgment: the features are sent to SVM classifiers of each class to determine whether they belong to this class or not.
Position refinement: use regressors to fine-tune candidate box positions

Selective Search

Selective Search[8] is a region proposal algorithm used in object detection. First, segment the input image into 1k∼2k small areas, then group similar regions based on color, texture, size and shape compatibility. The group rules are described as follow.

Color: select the similar areas based on color histogram
Texture: select the areas have similar texture, the texture features are represented by gradient histogram.
Size: Size similarity encourages smaller regions to merge early, which can avoid a big region to merge the smallers one by one.
Shape: Shape compatibility measures how well two regions fit into each other. If one fits into the other, we would like to merge them in order to fill gaps and if they are not even touching each other they should not be merged.

After region proposal and feature extracting, we need to use them as input feature vector to train a B-box refinement model. Hence its a transform function fitting task, the linear regression model can solve this problem. Suppose $P^i=(P_x^i,P_y^i,P_w^i,P_h^i)$ specifies the pixel coordinates of the center of proposal $P^i$ ’s bounding box together with $P^i$s width and height in pixels, and $G=(G_x,G_y,G_w,G_h)$ specifies the ground truth. We define four transfer function for each properties:

$$G_\overline{x} = P_xd_x(P)+P_x$$
$$G_\overline{y} = P_hd_y(P)+P_y$$
$$G_\overline{w} = P_wexp(d_w(P))$$
$$G_\overline{h} = P_hexp(d_h(P))$$

the object function is:

$$w_* = \arg\min_{\overline{w}_*}\sum_i^N(t_*^i - \overline{w}_*Pooling_5(P^i))^2 + \lambda||\overline{w}_*||^2$$

Where $*$ is one of ${x,y,h,w}$, $t$ is the target value calculated from ground truth with transfer function above, $Pooling_5$ is the $pool_5$ feature of $P^i$ in CNN, and $w$ is a vector of learnable model parameters. By the way, the $\lambda||w_*||^2$ is a $L_2$ regularization to avoid overfitting and help the optimization.

3.2 Spatial Pyramid Pooling Network (SPP Net)

Before introducing Fast R-CNN, we must learn the spatial pyramid pooling (SPP)[6] structure. For R-CNN, we need to use anisotropic scaling to resize different region proposal, and this process may trouble CNN training since some distorted or incomplete regions. To solve this problem, SPP stucture was proposed, it use three kinds of pooling sizes to extract feature with different scales, and using concatenate to combine them.

Fig. shows the detail of SPP structure, the feature map was divided into three sizes 1 ∗ 1, 2 ∗ 2, 4 ∗ 4 for pooling process, then get 1 + 4 + 16 = 21 spatial bins.

3.3 Fast R-CNN

R-CNN requires a lot of computing resource, since it uses CNN to extract features for each ROI. In fact we can draw lessons from shared weight method, due to the spatial image processing feature of CNN, we can compute the feature map only once, then calculate corresponding ROI on feature map in terms of the input image, that is the main algorithm for Fast R-CNN[4]. Fig.4 shows the structure of Fast R-CNN.

ROI Pooling

We noted that Fast R-CNN use the same feature maps for different ROI, since convolution doesn’t change the spatial position of image, we can calculate the ROI position on feature map:

$$Left, Top: \overline{x} = \lfloor x/S \rfloor + 1$$
$$Right, Bottom: \overline{x} = \lceil x/S \rceil - 1$$

Where $S$ is the sum of strides in CNN.

To standardize the size of different ROI, we used a special SPP structure. Note that the ROI Pooling only use one size $7*7$ for pooling process.

Muti-task Loss

Without using SVM for classification, Fast R-CNN set a muti-task network, one branch for classfication using softmax function, the other branch use linear regression for B-Box refinement. $P(P_0,P_1,…,P_k)$ is the classification output (including background), $t^u = (t^u_x,t^u_y,t^u_w,t^u_h)$ is the B-Box refinement output, we define the loss functions $L_{cls}, L_{loc}$ as follow:

$$L_{cls}(p,u)=-logp_u$$
$$L_{loc}(t^u,v)=\sum_{i\in{x,y,w,h}}smooth_{L_1}(t^u_i-v_i)$$
$$smooth_{L_1}(x)= \begin{cases}
0.5x^2&|x|<1\\
|x|-0.5 &\mbox{otherwise}
\end{cases}$$

Where $u$ is the true class, and $v$ is the ground truth for bounding box.

Thus the final loss is:
$$
L_{f-rcnn} = L_{cls} + \lambda L_{loc}
$$

3.4 Faster R-CNN

SPPnet and Fast R-CNN have reduced the running time much than R-CNN, but they still use selective search method to get region proposal, can we get the ROI use neural network too? Faster R-CNN[10] said yes, which proposed a region proposal network (RPN) sharing full-image CNN features with the detection network, and nearly cost-free on this part.

Region Proposal Network

Firstly, the Region proposal network (RPN) uses a anchor system shown in Fig below , it sets 9 anchors for each position at the W*H feature map came from CNN, each anchor is a rectangle with scales in {8, 16, 32} and aspect ratios in {0.5, 1, 2}. RPN sets two branches, one for classification and the other for anchor refinement.

At firest, RPN do 3∗3 convolution to integrate regional features, for the classification branch, uses 1 ∗ 1 ∗ 18 convolution kernel to reduce the input dimension to (W, H, 18), then reshape it to (2,9∗W ∗H), it’s easy to know that 9∗W ∗H is anchors for each position, and 2 is the class with or without object. Then, RPN uses softmax for classification, and reshape them back to (W, H, 18) for region output. For the anchor refinement branch, uses 1 ∗ 1 ∗ 36 convolution to reduce dimension to (W, H, 4 ∗ 9), obviously it means four box parameters for 9 anchors at each position, it’s same as B-Box refinement model in R-CNN and Fast R-CNN, uses linear regression.

Training

Train the RPN network using the model already trained on ImageNet.
Collect proposals using the RPN network trained in step 1
Train Fast R-CNN network using the RPN proposals in step 2, in which Fast R-CNN is also initialized on ImageNet.
Set the learning rate of shared convolution layers as 0, fine tune RPN.
Use RPN trained in step 4 to collect region proposals.
Fine tune Fast R-CNN.

3.5 You Only Look Once (YOLO)

You only look once (YOLO)[9] proposed a one-stage model for object detection task, it frames object detection as a regression problem to spatially separated B-Box and associated class probabilities.

Anchor System

YOLO uses a CNN structure based on GoogLeNet, and it sets an anchor system for detection and classification. As the convolution operation can be thought as sliding windows, the CNN output can be corresponded with input image spatially, suppose the CNN divided the image into S*S grids, YOLO detects objects use the center grid, and the box size is a property for regression. The classification and detection probabilities:

$$Pr(class_i|object)*Pr(object)*IOU^{truth}_{pred} = Pr(class_i)*IOU^{truth}_{pred}$$

Where $Pr(object)$ is the probability for object in this windows, $Pr(class_i|object)$ is the class probability taking existing objects as a prior, and IOU is the metric for B-Box predicted.
$$
IOU(A,B) = \frac{|A\cap B|}{|A\cup B|}
$$

Multi-task Loss

The output of YOLO is a 7∗7∗30 tensor, 7∗7 is the grid size, and 30 is the combination of probabilities, 20 dimensions for classification $Pr(class_i|object)$, 2 dimensions for object existing $Pr(object)$, and 8 dimensions (2 ∗ x, y, w, h) for B-Box.

$$L_{yolo} = \lambda_{coord}\sum_{i=0}^{S^2}\sum_{j=0}^B1_{ij}^{obj}[(x_i-\overline{x}_i)^2+(y_i-\overline{y}_i)^2]+\\
\lambda_{coord}\sum_{i=0}^{S^2}\sum_{j=0}^B1_{ij}^{obj}[(\sqrt{w_i}-\sqrt{\overline{w}_i})^2+(\sqrt{h_i}-\sqrt{\overline{h}_i})^2]+\\
\sum_{i=0}^{S^2}\sum_{j=0}^B1_{ij}^{obj}(C_i-\overline{C}_i)^2+\\
\lambda_{noobj}\sum_{i=0}^{S^2}\sum_{j=0}^B1_{ij}^{noobj}(C_i-\overline{C}_i)^2+\\
\sum_{i=0}^{S^2}1_{ij}^{obj}\sum_{c\in classes}(p_i(c)-\overline{p}_i(c))^2$$

Where $S*S$ is the grid size, $B$ is the number of boxes for each grid, $1_{ij}^{obj}$ means the $(i,j)$ grid exists objects, $C_i$ is the confidence for existing objects, and $p_i$ is the probability for each classes.

3.6 Comparison

4. References

[2] Object detection. https://en.wikipedia.org/wiki/Object_detection. Accessed: 2019-08-24.

[3] Yolo. https://github.com/pjreddie/darknet. Accessed: 2019-08-24.

[4] Ross Girshick. Fast R-CNN. ICCV, 2015.

[5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR, 2014.

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. TPAMI, 2015.

[7] Hu Jie, Li Shen, and Gang Sun. Squeeze-and-excitation networks. CVPR, 2018.

[8] J.R.R.Uijlings, K.E.A.van de Sande, T.Gevers, and A.W.M.Smeulders. Selective Search for Object Recognition. IJCV, 2013.

[9] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. CVPR, 2016.

[10] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS, 2015.