Keyword [MAttNet]

Yu L, Lin Z, Shen X, et al. Mattnet: Modular attention network for referring expression comprehension[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 1307-1315.

1. Overview

1.1. Motivation

Most recent work treats expressions as a single unit
external parser could raise parsing errors

In this paper, it proposes MAttNet

decompose expression into three components.
- subject appearance
- location
- relationship
two types of attention
- language-based attention. module weights
- word/phrase attention. each module should focus on

Different visual processing modules are triggered based on what information is present in the referring expression.

1.2. Problem Definition

Select the best region from a set of proposals/objects {$o_i$} in image $I$, given an input expression $r$.

1.3. Novelty

MAttNet is designed for general referring expression. subject, location, relationship.
MAttNet learns to parse expression automatically through a soft attention.
Different visual attention techniques. subject: in-box. relationship:out-of-box.

And the only supervision is object proposal, referring pair $(o_i, r_i)$.