(CVPR 2019) Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Keyword [MAttNet]

Liu X, Wang Z, Shao J, et al. Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing[J]. arXiv preprint arXiv:1903.00839, 2019.

1. Overview

1.1. Motivation

  • previous attention models focus on only the most dominant features of both modalities

In this paper, it designs a novel cross-modal attention-guided erasing approach

  • generate difficult training samples online
  • make full use of latent correspondences between training paris
  • avoid overly rely on specific words or visual concepts

Three types of erasing

  • Image-aware query sentence erasing. replace word with unknown token
  • Sentence-aware subject region erasing. erase the spatial features
  • Sentence-aware context object erasing. erase a dominant context region

1.2. Dataset

  • RefCOCO
  • RefCOCO+
  • RefCOCOg

2. Cross-modal Attention-guided Erasing

2.1. Overview of Attention-guided Erasing

Query Sentence Erasing ($Q^*$).
Visual Erasing ($O^*$).

sample a module based on $Multinominal(3, [w_{subj}, w_{loc}, w_{rel}])$

  • subject region erasing on feature maps
  • context object erasing to discard features of a context object

Loss Function

2.2. Image-aware Query Sentence Erasing

  • encode the whole img, then feed into LSTM.
  • sample a word from Multinomial(T, [α_1, …, α_T])

2.3. Sentence-aware Subject Region Erasing

  • v_j. a feature point
  • erase a continuous region of size kxk (k=3)

2.4. Sentence-aware Context Object Erasing

  • c_k. context region features
  • m = {loc, rel}

Different from MAttNet

  • In relationship module, MAttNet assume only one contect object contributes to recognizing the subject
  • In this paper, it deals with all context objects and attend to important ones.

Finally, sample a context object based on Multinomial(K, [α_1, …, αK]) and replace its feature to zero.
(already choose which module based on Multinominal(3, [w
{subj}, w_{loc}, w_{rel}]))

2.5. Details

  • Faster R-CNN with ResNet-101 as backbone to extract image features
  • For each candidate object proposal, 7x7 feature maps are fed into subject module