Cao Q, Liang X, Li B, et al. Visual question reasoning on general dependency tree[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7249-7257.

1. Overview

1.1. Motivation

recent work tried to explicit compositional processes to assemble multiple sub-task embedded in the questions, but rely on hand-craft or annotation

In this paper, it proposed ACMN (Adversarial Composition Modular Network)

adversarial attention module. local visual evidence for each word
residual composition module.
construct a dependency tree for each question
clausal predicate relation
modifer relation
enforce each parent node not explore new regions by masking out attentive regions of its child nodes at each step

1.2. Contribution

interpretable reasoning VQA system
adversarial attention module. enforce efficient visual evidence mining for modifier relations
residual composition module. integrat knowledge of child nodes for clausal predicate relations

1.3.1. VQA

CNN-LSTM
attention, stacked attention, co-attention
joint embedding
compact bilinear method

database queries

2. Adversarial Composition Modular Network

generate tree by universal Stanford Parser
prune the leaf-nodes that are not noun
M. modifier relation
P. clausal predicate relation
x. node
x_c1, x_c2, …, x_cn. n child node of x
v. spatial feature through pre-trained network
w. word embedding through a Bi-LSTM
set of modules f. apply on each word; share weights
input of f. (v, w, child’s output)
output of f. (attention region att_out, hidden feature h_out)

2.1. Adversarial Attention Module

filter child nodes ∈ M
sum attention map of these child nodes
1 - summation attention map to get mask
mask x v
output new attention map att_out
get h’ based on att_out

2.2. Residual Composition Module

sum hidden features h of child ∈ P
concat summation of h and h’
add it with all children‘ h

2.3. Overview

the nodes with modifier relations M can modify their parent node by referring to a more specific object
the nodes with clausal predicate relations P can enhance the representation
output feature h_root → 3 MLP to get y

3. Experiments

3.1. Dataset

CLEVR
Sort-of-CLEVR
VQAv2

3.2. Details

224x224 image
max tree hight 13 for CLEVR

(CVPR 2018) Visual Question Reasoning on General Dependency Tree

1. Overview

1.1. Motivation

1.2. Contribution

1.3.1. VQA

2. Adversarial Composition Modular Network

2.1. Adversarial Attention Module

2.2. Residual Composition Module

2.3. Overview

3. Experiments

3.1. Dataset

3.2. Details

3.3. Comparison

3.4. Ablation Study

1. Overview

1.1. Motivation

1.2. Contribution

1.3. Related Work

1.3.1. VQA

1.3.2. Reasoning Modal

2. Adversarial Composition Modular Network

2.1. Adversarial Attention Module

2.2. Residual Composition Module

2.3. Overview

3. Experiments

3.1. Dataset

3.2. Details

3.3. Comparison

3.4. Ablation Study