(ECCV 2018) Group normalization

Keyword [Group Normalization]

Wu Y, He K. Group normalization[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 3-19.

1. Overview

1.1. Motivation

  • Normalization along batch dimension introduces problems: when batch size smaller, BN’s error increase rapidly
  • BN helps to converge (stochastic uncertainty of batch statistics acts as a regularizer, benifit generalization), but worse for small batch
  • SIFT, HOG. group-wise feature and group-wise normalization
  • BN’s statistics are computed for each GPU, not broadcast across all GPU

In this paper, it proposed Group Normalization (GN)

  • independent of batch
  • divide channels into groups and compute μ, σ of each group

1.2.1. Normalization

  • Local Response Normalization (LRN). compute the statistics in a small neighbourhood for each pixel
  • BN
  • Layer Normalization (LN)
  • Instance Normalization (IN)
  • Weight Normalization (WN)
    LN, IN, WN. independent with batch.
    LN, IN. successful in RNN and GAN model.

1.2.2. Addressing Small Batch

  • Batch Renormalization (BR). two parameters constraint the μ,σ of BN
  • Synchronized BN. μ,σ computed across multiple GPUs

1.2.3. Group-wise Computation

  • group convolution. AlexNet
  • ResXNet
  • depth-wise. MobileNet, Xception
  • ShuffleNet

1.3. Dataset

  • ImageNet. Classification
  • COCO. obj detection, Segmentation
  • Kinectics. Video Classification

1.4. Group Normalization

Relation in Group

  1. horizontal
  2. orientation
  3. frequency
  4. shape
  5. illumination
  6. texture

  • general formulation

  • BN (along NHW)

  • LN (along HW)

  • GN (along HWC_group)

G. group number; C/G. channel per group

  • (G=1)→ LN (assume all channels make similar contribution, more stricted than GN)
  • (G=C)→ IN (not exploit channel dependence)

1.5. Future Works

  • investigate GN in reinforcement learning (RL)

2. Experiments

2.1. Ablation Study

2.1.1. Batch Size (Classification)

  • Linear Learning Rate Scaling Rule. LR 0.1 for size 32, LR 0.1N/32 for size N.

2.1.2. Batch Size (Video Classification)

2.1.3. Group & Channel Number

2.1.4. Distribution

2.2. Comparison

2.2.1. Classification

2.2.2. Detection & Segmentation

  • replace BN* with GN, when fine-tuneing weight decay of 0 for γ and β is important for good detection results
  • the distribution of RoIs batches sampled from the same image is not i.i.d. degrades BN’s estimation

2.2.3. Video Classification