Cao Q, Liang X, Li B, et al. Visual question reasoning on general dependency tree[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7249-7257.
1. Overview
1.1. Motivation
- recent work tried to explicit compositional processes to assemble multiple sub-task embedded in the questions, but rely on hand-craft or annotation
In this paper, it proposed ACMN (Adversarial Composition Modular Network)
- adversarial attention module. local visual evidence for each word
- residual composition module.
- construct a dependency tree for each question
- clausal predicate relation
- modifer relation
- enforce each parent node not explore new regions by masking out attentive regions of its child nodes at each step
1.2. Contribution
- interpretable reasoning VQA system
- adversarial attention module. enforce efficient visual evidence mining for modifier relations
- residual composition module. integrat knowledge of child nodes for clausal predicate relations
1.3. Related Work
1.3.1. VQA
- CNN-LSTM
- attention, stacked attention, co-attention
- joint embedding
- compact bilinear method
1.3.2. Reasoning Modal
- database queries
2. Adversarial Composition Modular Network
- generate tree by universal Stanford Parser
- prune the leaf-nodes that are not noun
- M. modifier relation
- P. clausal predicate relation
- x. node
- x_c1, x_c2, …, x_cn. n child node of x
- v. spatial feature through pre-trained network
- w. word embedding through a Bi-LSTM
- set of modules f. apply on each word; share weights
- input of f. (v, w, child’s output)
- output of f. (attention region att_out, hidden feature h_out)
2.1. Adversarial Attention Module
- filter child nodes ∈ M
- sum attention map of these child nodes
- 1 - summation attention map to get mask
- mask x v
- output new attention map att_out
- get h’ based on att_out
2.2. Residual Composition Module
- sum hidden features h of child ∈ P
- concat summation of h and h’
- add it with all children‘ h
2.3. Overview
- the nodes with modifier relations M can modify their parent node by referring to a more specific object
the nodes with clausal predicate relations P can enhance the representation
output feature h_root → 3 MLP to get y
3. Experiments
3.1. Dataset
- CLEVR
- Sort-of-CLEVR
- VQAv2
3.2. Details
- 224x224 image
- max tree hight 13 for CLEVR