(CVPR 2016) Deep Compositional Question Answering with Neural Module Networks

Keyword [Nerual Module Networks]

Andreas J, Rohrbach M, Darrell T, et al. Neural module networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 39-48.

1. Overview

1.1. Motivation

1) no single best network for all tasks
2) a prefix of a network trained for classification used in many vision tasks
So while network structures are not universal, they are at least empirically modular.

In this paper, it seeks exploit the representational capacity of DNN and the compositional linguistic strucuture of questions.

  • decompose questions into linguistic sub-structure
  • use sub-strucutures to dynamically instantiate modular networks
  • jointly trained
  • propose a new dataset of complex questions about abstract shapes

1) different kinds of modules are shown in different colors

  • attention module (like dogs), labeling modules (like where)

2) all modules in NMN are independent and composable

1.2. Dataset

  • VQA. natural images
  • SHAPES. new dataset

2. NMN

2.1. Overview

training sample (w, x, y)

  • w. question
  • x. image
  • y. answer

1) the model instantiates a network based on $P(w)$.
2) pass x (possibly w again) as inputs, and obtains a distribution over labels
3) model $p(y | w, x,; \theta)$

2.2. Modules

Three basic types:

  • images
  • unnormalized attentions
  • labels

Typeset: TYPE[INSTANCE](ARG_1, …). example. attend[red]

  • TYPE. high-level module type (attention, classification, …)
  • weights may be shared at both the type and instance level
  • modules with on arguments implicitly take the image as input
  • higher-level arguments may also inspect the image

2.3. String to Networks

Two steps:

  • map questions to layouts (set of module and connections)
    standard tools pre-trained. (Future work might focus on learning (or at least fine-tuning) this prediction process jointly with the rest of the
  • assemble network based on layouts

2.3.1. Parsing

  • Stanford Parser
  • perform basic lemmatization to reduce sparsity
    kites into kite; were into be;


  • what is standing in the field → what(stand)
  • what color is the truck → color(truck)
  • is there a circle next to a square → is(circle, next-to(square))
  • what type of cakes were they? (what type of cake is it?) → type(cake)
    strip away function words

2.3.2. Layout

  • Leaf. attend module
  • Root. measure module or classify module
  • Other. re-attend module or combine module

Networks which have the same high-level structure but different instantiations of individual modules can be processed in the same batch, resulting in efficient computation.
For example: classify[color](attend[cat]) and classify[where](attend[truck])

2.3.3. Generalizations

SQL-like queries:

  • IS(cat) AND NOT(IS(dog))
  • IS(cat) and date > 2014-11-5

2.4. Answering

geometric average of two distribution, dynamically reweighted using both text and image features

  • prediction from NMN
  • prediction from LSTM (1024 hidden unit).


  • aggressive simplification in the parser. (need more information from original questions)
    both what is flying? and what are flying? → what(fly)
  • capture semantic regularities with missing or low-qaulity image data
    reasonable to guess that what color is the bear? brown instead of green

2.5. Answering

  • dynamicl network strucuture results in some weight update more frequently than others.
  • adaptive per-weight learning rates performed substantially better than simple gradient descent

3. Experiments