Keyword [Nerual Module Networks]

Andreas J, Rohrbach M, Darrell T, et al. Neural module networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 39-48.

1. Overview

1.1. Motivation

1) no single best network for all tasks
2) a prefix of a network trained for classification used in many vision tasks
So while network structures are not universal, they are at least empirically modular.

In this paper, it seeks exploit the representational capacity of DNN and the compositional linguistic strucuture of questions.

decompose questions into linguistic sub-structure
use sub-strucutures to dynamically instantiate modular networks
jointly trained
propose a new dataset of complex questions about abstract shapes

1) different kinds of modules are shown in different colors

attention module (like dogs), labeling modules (like where)

2) all modules in NMN are independent and composable

1.2. Dataset

VQA. natural images
SHAPES. new dataset

2. NMN

2.1. Overview

training sample (w, x, y)

w. question
x. image
y. answer

1) the model instantiates a network based on $P(w)$.
2) pass x (possibly w again) as inputs, and obtains a distribution over labels
3) model $p(y | w, x,; \theta)$

2.2. Modules

Three basic types:

images
unnormalized attentions
labels

Typeset: TYPE[INSTANCE](ARG_1, …). example. attend[red]

TYPE. high-level module type (attention, classification, …)
weights may be shared at both the type and instance level
modules with on arguments implicitly take the image as input
higher-level arguments may also inspect the image

2.3. String to Networks

Two steps:

map questions to layouts (set of module and connections)
standard tools pre-trained. (Future work might focus on learning (or at least fine-tuning) this prediction process jointly with the rest of the
system)
assemble network based on layouts

2.3.1. Parsing

Stanford Parser
perform basic lemmatization to reduce sparsity
kites into kite; were into be;

Exmaples:

what is standing in the field → what(stand)
what color is the truck → color(truck)
is there a circle next to a square → is(circle, next-to(square))
what type of cakes were they? (what type of cake is it?) → type(cake)
strip away function words

2.3.2. Layout

Leaf. attend module
Root. measure module or classify module
Other. re-attend module or combine module

Networks which have the same high-level structure but different instantiations of individual modules can be processed in the same batch, resulting in efficient computation.
For example: classify[color](attend[cat]) and classify[where](attend[truck])

2.3.3. Generalizations

SQL-like queries:

IS(cat) AND NOT(IS(dog))
IS(cat) and date > 2014-11-5

2.4. Answering

geometric average of two distribution, dynamically reweighted using both text and image features

prediction from NMN
prediction from LSTM (1024 hidden unit).

Reason

aggressive simplification in the parser. (need more information from original questions)
both what is flying? and what are flying? → what(fly)
capture semantic regularities with missing or low-qaulity image data
reasonable to guess that what color is the bear? brown instead of green

2.5. Answering

dynamicl network strucuture results in some weight update more frequently than others.
adaptive per-weight learning rates performed substantially better than simple gradient descent

(CVPR 2016) Deep Compositional Question Answering with Neural Module Networks