(CVPR 2017) YOLO9000:Better, Faster, Stronger

Keyword [YOLOv2] [YOLO9000]

Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7263-7271.

1. Overview

In this paper, it proposes YOLO9000 (YOLOv2) for detection.

  • Better: BN, High Resolution Classifier, Conv with anchor box (instead of grid cell), Multi-Scale Training.
  • Faster: Darknet-19.
  • Stronger: Jointly training on classification (ImageNet) and detection (COCO) data.

2. Better

2.1. Batch Normalization

Add BN on all of Conv in YOLO, then can remove dropout. (+2% in mAP)

2.2. High Resolution Classifier

1) Fine-tune classification network at $448\times 448$.
2) Fine-tune on detection. (+4% mAP)

2.3. Conv with Anchor Box

1) remove FC and one pooling layer from YOLO (downsample by a factor of 32).
2) Use anchor box to predict bounding box.
3) Change the input size from $448 \times 448$ to $416 \times 416$, then can get $13 \times 13$ feature map.
4) Predict class (Pr(class_i|obj)) and obj ($IOU_{pred}^{truth}$) for every anchor box.

Solve two issues:
1) Run k-means clustering on the training set bounding boxes to select good prior of box dimension.
2) Using anchor box leads to instability, especially in early epoch. So predict relative location of grid cell (constrain in cell to prevent offset to any location in images).

1) $\sigma$ is $Sigmoid$.
2) Predict 5 (decided by K-means cluster) BBox at each cell. (+5% mAP)
3) Each BBox contains 5 coordinates and 20 classes.
4) So each cell contains $5 \times (5 + 20) = 125$ filters.

2.4. Multi-Scale Training

1) Every 10 batches, randomly choose a new image dimension size from {320,352,…,608}.
2) It’s easy to tradeoff between speed and accuracy, As it can predict well across a variety of input dimension,

2.5. Comparison

3. Faster

Darknet has 19 Conv and 5 Maxpool.

# 4. Stronger

1) Jointly training on classification and detection data.
2) Hierarchical Classification to solve category exclusive problem.

3) For classification purposes, assume that the image contains an obj $Pr(obj)=1$. And simply find the BBox that predicts highest probability for that class and we compute the loss on just its predicted tree.