(CVPR 2018) MAttNet:Modular Attention Network for Referring Expression Comprehension

Keyword [MAttNet]

Yu L, Lin Z, Shen X, et al. Mattnet: Modular attention network for referring expression comprehension[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 1307-1315.

1. Overview

1.1. Motivation

  • Most recent work treats expressions as a single unit
  • external parser could raise parsing errors

In this paper, it proposes MAttNet

  • decompose expression into three components.
    • subject appearance
    • location
    • relationship
  • two types of attention
    • language-based attention. module weights
    • word/phrase attention. each module should focus on

Different visual processing modules are triggered based on what information is present in the referring expression.

1.2. Problem Definition

Select the best region from a set of proposals/objects {$o_i$} in image $I$, given an input expression $r$.

1.3. Novelty

  • MAttNet is designed for general referring expression. subject, location, relationship.
  • MAttNet learns to parse expression automatically through a soft attention.
  • Different visual attention techniques. subject: in-box. relationship:out-of-box.

And the only supervision is object proposal, referring pair $(o_i, r_i)$.

1.4. Dataset

  • RefCOCO
  • RefCOCO+
  • RefCOCOg

2. Model

Given a candidate $o_i$ and referring expression $r$.

2.1 Language Attention Network

$u_t$. each word
$e_t$. one-hot embedding
$f_m$. trainable vectors, m=subj, loc, rel


  • language attention embedding q_m of each module.
  • attention weight of each module。

2.2 Visual Modules

  • forward the whole image into Faster R-CNN
  • crop C3 feature for each o_i
  • further compute C4 fueature

C3. lower-level cues including colors and shapes
C4. higher-level cues for category prediction

2.2.1 Subject Module

Given C3 and C4 features of o_i.

Attribute Prediction
run a parser to get color and generic attribute words.

Phrase-guided Attentional Pooling

  • in-box attention.
  • G=14x14

Matching Function

2.2.2 Location Module

  • up to five surrounding objects of the same category
  • relative position

2.2.3 Relationship Module

2.3 Loss Function

For each given positive pair of $(o_i, r_i)$, randomly sample two negative pairs $(o_i, r_j)$ and $(o_k, r_i)$.