Keyword [Group Normalization]
Wu Y, He K. Group normalization[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 3-19.
1. Overview
1.1. Motivation
- Normalization along batch dimension introduces problems: when batch size smaller, BN’s error increase rapidly
- BN helps to converge (stochastic uncertainty of batch statistics acts as a regularizer, benifit generalization), but worse for small batch
- SIFT, HOG. group-wise feature and group-wise normalization
- BN’s statistics are computed for each GPU, not broadcast across all GPU
In this paper, it proposed Group Normalization (GN)
- independent of batch
- divide channels into groups and compute μ, σ of each group
1.2. Related Works
1.2.1. Normalization
- Local Response Normalization (LRN). compute the statistics in a small neighbourhood for each pixel
- BN
- Layer Normalization (LN)
- Instance Normalization (IN)
- Weight Normalization (WN)
LN, IN, WN. independent with batch.
LN, IN. successful in RNN and GAN model.
1.2.2. Addressing Small Batch
- Batch Renormalization (BR). two parameters constraint the μ,σ of BN
- Synchronized BN. μ,σ computed across multiple GPUs
1.2.3. Group-wise Computation
- group convolution. AlexNet
- ResXNet
- depth-wise. MobileNet, Xception
- ShuffleNet
1.3. Dataset
- ImageNet. Classification
- COCO. obj detection, Segmentation
- Kinectics. Video Classification
1.4. Group Normalization
Relation in Group
- horizontal
- orientation
- frequency
- shape
- illumination
- texture
- general formulation
BN (along NHW)
LN (along HW)
GN (along HWC_group)
G. group number; C/G. channel per group
- (G=1)→ LN (assume all channels make similar contribution, more stricted than GN)
- (G=C)→ IN (not exploit channel dependence)
1.5. Future Works
- investigate GN in reinforcement learning (RL)
2. Experiments
2.1. Ablation Study
2.1.1. Batch Size (Classification)
- Linear Learning Rate Scaling Rule. LR 0.1 for size 32, LR 0.1N/32 for size N.
2.1.2. Batch Size (Video Classification)
2.1.3. Group & Channel Number
2.1.4. Distribution
2.2. Comparison
2.2.1. Classification
2.2.2. Detection & Segmentation
- replace BN* with GN, when fine-tuneing weight decay of 0 for γ and β is important for good detection results
- the distribution of RoIs batches sampled from the same image is not i.i.d. degrades BN’s estimation