(ICCV 2017) Inferring and executing programs for visual reasoning

Keyword [Nerual Module Networks] [IEP]

Johnson J, Hariharan B, van der Maaten L, et al. Inferring and executing programs for visual reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 2989-2998.

1. Overview

1.1. Motivation

1) Existing methods use black-box architecture without explicitly modeling the underlying reasoning process
2) Black-box models often learn to exploit biases in the data rather than reasoning

In this paper, it proposes a model for visual reasoning

  • Program generator
    read questions and produces a plan or program for answering the question by composing functions from a function dictionary
  • Execution engine
    implements each function using a small neural module, and executes the resulting module network to produce answers.
  • Trained using a combination of BP and REINFORCE
  • With only small amount of supervision GT programs (semi-supervision), outperform SOTA.

1) NMN: hand-tuned program generator based on syntactic parsing.
2) This paper: only define the function vocabulary and the universal module architecture by hand.

1.2. Dataset

1) CLEVR. all questions are quipped with GT programs.
2) New dataset of human-posed free-form natural languge questions about CLEVR images

  • finetuned on this dataset without additional program supervision

2. Method

Training sample (x, q, a, z)

  • x. img
  • q. question
  • a. answer
  • z. program(with or without)

Program generator. $z = \pi(q)$
Execution engine. $a=\phi(x,z)$

2.1. Programs

1) pre-specify a set of function $f$, each of which has a fixed arity {1, 2}
2) add special constant Scene into function vocabulary
3) represent valid programs $z$ as syntax trees, each node contaisn a function $f$.

2.2. Program Generator

1) use prefix traversal to serialize the syntax tree
2) at test time, we simply take the argmax function at each time step
3) if the sequence is too short, pad with Scene constants
4) if the sequence is too long, then the unused functions are discarded

5) encoder and decoder
6) at each timestep, compute a distribution over all possible functions.

2.3. Execution Engine

1) implement using NMN
2) $z$ is to assemble a question-specific network
3) leaves of the tree are Scene functions
4) root of the tree is a function corresponding to one of the question types from the CLEVR dataset

2.4. Training

1) at training time, argmax can not BP. Instead replace with sampling and use REINFORCE
2) semi-supervision

a) small set of GT programs to train program generator
b) fix program generator and train execution engine on large dataset
c) use REINFORCE to jointly finetune