Keyword [MAttNet]
Yu L, Lin Z, Shen X, et al. Mattnet: Modular attention network for referring expression comprehension[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 1307-1315.
1. Overview
1.1. Motivation
- Most recent work treats expressions as a single unit
- external parser could raise parsing errors
In this paper, it proposes MAttNet
- decompose expression into three components.
- subject appearance
- location
- relationship
- two types of attention
- language-based attention. module weights
- word/phrase attention. each module should focus on
Different visual processing modules are triggered based on what information is present in the referring expression.
1.2. Problem Definition
Select the best region from a set of proposals/objects {$o_i$} in image $I$, given an input expression $r$.
1.3. Novelty
- MAttNet is designed for general referring expression. subject, location, relationship.
- MAttNet learns to parse expression automatically through a soft attention.
- Different visual attention techniques. subject: in-box. relationship:out-of-box.
And the only supervision is object proposal, referring pair $(o_i, r_i)$.
1.4. Dataset
- RefCOCO
- RefCOCO+
- RefCOCOg
2. Model
Given a candidate $o_i$ and referring expression $r$.
2.1 Language Attention Network
$u_t$. each word
$e_t$. one-hot embedding
$f_m$. trainable vectors, m=subj, loc, rel
Output
- language attention embedding q_m of each module.
- attention weight of each module。
2.2 Visual Modules
- forward the whole image into Faster R-CNN
- crop C3 feature for each o_i
- further compute C4 fueature
C3. lower-level cues including colors and shapes
C4. higher-level cues for category prediction
2.2.1 Subject Module
Given C3 and C4 features of o_i.
Attribute Prediction
run a parser to get color and generic attribute words.
Phrase-guided Attentional Pooling
- in-box attention.
- G=14x14
Matching Function
MLP: FC-ReLU-FC-ReLU
2.2.2 Location Module
- up to five surrounding objects of the same category
- relative position
2.2.3 Relationship Module
2.3 Loss Function
For each given positive pair of $(o_i, r_i)$, randomly sample two negative pairs $(o_i, r_j)$ and $(o_k, r_i)$.