(CVPR 2019) Cross-Modal Relationship Inference for Grounding Referring Expressions

Yang S, Li G, Yu Y. Cross-Modal Relationship Inference for Grounding Referring Expressions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 4145-4154.

1. Overview

1.1. Motivation

  • existing work can not extract multi-order relationships

In this paper, it proposes
1) Cross-Modal Relationship Extractor (CMRE)
2) Gated Graph Convolutional Network (GGCN)

Graph Construct:
1) spatial relation graph $G^s=(V, E, X^s)$
2) language-guided visual relation graph $G^v=(V, E, X, P^v, P^e)$
3) multimodal (language, visual and spatial information) relation graph $G^m=(V, E, X^m, P^v, P^e)$
4) propagate with gated operation

1.2. Dataset

  • RefCOCO
  • RefCOCO+
  • RefCOCOg

2. Cross-Modal Relationship Inference Network

2.1. CMRE

2.1.1. Spatial Relation Graph

1) Node. visual feature $v_i$ from pretrained CNN model
2) Edge. $e_{ij}$ from 1 to $N_e=11$

2.1.2. Language Context

1) softly classify the words in the expression into four types (entity words, relation, absolute location and unnecessary words)
2) $m_t = [m_t^{(0)}, m_t^{(1)}, m_t^{(2)}, m_t^{(3)}]$

2.1.3. Language-Guided Visual Relation Graph

highlight the vertices and edges of $G^s$, that have connections with the referring expression, to generate $G^v$.
1) gate for $v_i$ according to the expression
2) gate $p_j^e$ for edge with type $j=1, 2, …, N^e$

$P^v=p_i^v, i=1,…,K$
$P^e=p_j^e, j=1,…,N_e$

2.2. Multimodal Context Modeling

generate $G^m$.

2.2.1. Language-Vision Feature

$x_i^m = [x_i^s, c_i]$

2.2.2. Semantic Context Modeling


1) →,←. encoded feature for out- and in- relationships respectively
2) finally get semantic contexts $X^c=x_i^c, i=1,…,K$
3) for each $v_i$, $x_i=[W_p p_i, x_i^c]$, encode spatial feature $p_i$.

2.3. Loss Function