(2017) Paying more attention to attention:Improving the performance of convolutional neural networks via attention transfer

Keyword [Attention Map]

Zagoruyko S, Komodakis N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer[J]. arXiv preprint arXiv:1612.03928, 2016.

1. Overview

1.1. Motivation

  • different observers with different knowledge, goals lead to different attentional strategies
  • can a teacher network improve the performance of another student network by providing to it information about where it looks
    In this paper, it improves the student network by forcing it to mimic the attention maps of a powerful teacher network.
  • activation-based and gradient-based attention map

1.2. Contribution

  • attention mechanism to transfer knowledge
  • activation-based (better and can combine with knowledge distillation) and gradient-based spatial attention maps
  • Attention Mechanism
    • image caption
    • VQA
    • weakly-supervised object localization
    • classification
  • Gradient-Based
  • Knowledge Distillation
    • shallow networks has been shown to be able to approximate deeper ones without loss in accuracy
  • Network
    • after a certain depth, the improvements came mostly from increased capacity of the networks (parameter number)
    • 16 layer wider ResNet can learn as good or better as very thin 1000 layers one

1.4. Dataset

  • ImageNet. classification, localization
  • COCO. obj detection, face recognition amd fine-grained recognition

2. Attention Transfer

2.1. Activation-Based Attention Transfer

  • get the attention map from the feature maps

  • first layer. activate for low-level gradient points

  • middle level. high for discriminative regions (eyes, wheels)
  • top level. reflects full obj

2.1.1. three methods

  • stronger networks hace peaks in attention where weak networks don’t
  • (F_sum)^p put more weight (than F_sum) to spatial locations the correspond to the neurons with the highest activation
  • (F_max)^p only consider one of the max rather than sum of all

2.1.2. Cases of Student and Teacher Networks

  • same depth
  • different depth

2.1.3. Loss Function

  • L(W, x). task loss
  • I. pairs of student-teacher attention maps
  • The normalization of attention maps is important for student training

Attention transfer can also be combined with knowledge distillation.

2.2. Gradient-Based Attention Transfer

If small changes at a pixel can have a large effect on the network output then it is logical to assume that the network is “paying attention” to that pix

  • flip invariant version

3. Experiments

3.1. Attention-Based

  • trained with all transfer loss better than only one transfer loss
  • F_sum better than F_max

4. Compared with Knowledge Distillation