Keyword [YOLOv1]

Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788.

1. Overview

In this paper, it proposes YOLO for detection.

1.1. Network

1) Input: $448 \times 448$.
2) Output: $7 \times 7 \times (B \times 5 + CLS)$. ($B$=2, $S$=7, $CLS=20$).
3) First, pretrained on ImageNet, then add 4 Conv + 2 FC with randomly initialization.
4) Fast YOLO only has 9 Convs and fewer filters.

1.2. Definition

1) Split images into $S \times S$ Grid Cells.
2) Each Grid Cell contains $B$ Bounding Boxes.
3) Each Bounding Box needs to predict $(x, y, w, h, C)$.

$C$. Confidence: $Pr(obj)*IOU_{pred}^{truth}$. (If no obj exists in cell, $C=0$, otherwise $C=IOU$, directly predict)
$(x, y)$. Center of box relative to Grid Cell ($\in [0, 1]$).
$(w, h)$. relative to the whole image ($\in [0, 1]$).

4) Each Grid Cell needs to predict $CLS$ $Pr(Class_i|obj)$
5) At test time, the class-specific confidence scores for each box $Pr(Class_i|obj)*Pr(obj)*IOU_{pred}^{truth}=Pr(Class_i)*IOU_{pred}^{truth}$

1.3. Loss Function

1) Increase the loss of BBox prediction ($\lambda_{coord}=5$). (Not ideal to treat localization and classification loss as equal)
2) Decrease the loss of Confidence prediction when no obj exists in boxes ($\lambda_{noobj}=0.5$). (Many grid cells do not contain obj, leading to gradient problem)
3) Square root of $(h, w)$. (Small deviations in large boxes matter less than in small boxes)
4) At training time, assign one predictor to be ‘responsible’ for predicting an object based on which prediction has the highest current IOU with the GT.

1.4. Limitations of YOLO

1) Each cell can only have one class, struggle with small obj that appear in groups, such as flocks of birds.
2) Use relatively coarse features for predicting BBox (multiple downsampling).
3) Loss function treats errors the same in small BBox versus large BBox.