neural modular networks - github pages...the visual qa task has signiﬁcant signiﬁcant...

Neural Modular Networks

Joseph E. GonzalezCo-director of the RISE Lab

[email protected]

Deep Compositional Question Answering with Neural Module Networks

Jacob Andreas Marcus Rohrbach Trevor Darrell Dan KleinDepartment of Electrical Engineering and Computer Sciences

University of California, Berkeley{jda,rohrbach,trevor,klein}@{cs,eecs,eecs,cs}.berkeley.edu

Abstract

Visual question answering is fundamentally composi-

tional in nature—a question like where is the dog? shares

substructure with questions like what color is the dog? and

where is the cat? This paper seeks to simultaneously exploit

the representational capacity of deep networks and the com-

positional linguistic structure of questions. We describe a

procedure for constructing and learning neural module net-works, which compose collections of jointly-trained neural

“modules” into deep networks for question answering. Our

approach decomposes questions into their linguistic sub-

structures, and uses these structures to dynamically instan-

tiate modular networks (with reusable components for rec-

ognizing dogs, classifying colors, etc.). The resulting com-

pound networks are jointly trained. We evaluate our ap-

proach on two challenging datasets for visual question an-

swering, achieving state-of-the-art results on both the VQA

natural image dataset and a new dataset of complex ques-

tions about abstract shapes.

1. IntroductionThis paper describes an approach to visual question an-

swering based on neural module networks (NMNs). We an-swer natural language questions about images using collec-tions of jointly-trained neural “modules”, dynamically com-posed into deep networks based on linguistic structure.

Concretely, given an image and an associated question(e.g. where is the dog?), we wish to predict a correspondinganswer (e.g. on the couch, or perhaps just couch) (Figure 1).The visual QA task has significant significant applicationsto human-robot interaction, search, and accessibility, andhas been the subject of a great deal of recent research at-tention [2, 7, 20, 22, 25, 32]. The task requires sophisti-cated understanding of both visual scenes and natural lan-guage. Recent successful approaches represent questionsas bags of words, or encode the question using a recurrentneural network [22] and train a simple classifier on the en-coded question and image. In contrast to these monolithic

wherecount color ...

dog standing ...

LSTM couch

cat

CNN

Where is the dog?

LayoutParser

Figure 1: A schematic representation of our proposedmodel—the shaded gray area is a neural module network ofthe kind introduced in this paper. Our approach uses a nat-ural language parser to dynamically lay out a deep networkcomposed of reusable modules. For visual question answer-ing tasks, an additional sequence model provides sentencecontext and learns common-sense knowledge.

approaches, another line of work for textual QA [18] andimage QA [21] uses semantic parsers to decompose ques-tions into logical expressions. These logical expressionsare evaluated against a purely logical representation of theworld, which may be provided directly or extracted from animage [16].

In this paper we draw from both lines of research,presenting a technique for integrating the representationalpower of neural networks with the flexible compositionalstructure afforded by symbolic approaches to semantics.Rather than relying on a monolithic network structure toanswer all questions, our approach assembles a network onthe fly from a collection of specialized, jointly-learned mod-ules (Figure 1). Rather than using logic to reason over truthvalues, we remain entirely in the domain of visual featuresand attentions.

Our approach first analyzes each question with a seman-tic parser, and uses this analysis to determine the basic com-

arX

iv:1

511.

0279

9v4

[cs.C

V]

24 Ju

l 201

7

Today

What Problem is being solved?Ø Problem Domain: Visual Question Answering

Ø “Visual Turing test”

how many different lights

in various different shapes

and sizes?

what is the color of the

horse?

what color is the vase? is the bus full of passen-

gers?

is there a red shape above

a circle?

measure[count](

attend[light])

classify[color](

attend[horse])

classify[color](

attend[vase])

measure[is](

combine[and](

attend[bus],

attend[full])

measure[is](

combine[and](

attend[red],

re-attend[above](

attend[circle])))

four (four) brown (brown) green (green) yes (yes) no (no)

what is stuffed with

toothbrushes wrapped in

plastic?

where does the tabby cat

watch a horse eating hay?

what material are the

boxes made of?

is this a clock? is a red shape blue?

classify[what](

attend[stuff])

classify[where](

attend[watch])

classify[material](

attend[box])

measure[is](

attend[clock])

measure[is](

combine[and](

attend[red],

attend[blue]))

container (cup) pen (barn) leather (cardboard) yes (no) yes (no)

Figure 3: Example output from our approach on different visual QA tasks. The top row shows correct answers, while thebottom row shows mistakes (correct answers are given in parentheses).

swering, performing especially well on questions answeredby an object or an attribute. Additionally, we have in-troduced a new dataset of highly compositional questionsabout simple arrangements of shapes, and shown that ourapproach substantially outperforms previous work.

So far we have maintained a strict separation betweenpredicting network structures and learning network param-eters. It is easy to imagine that these two problems mightbe solved jointly, with uncertainty maintained over networkstructures throughout training and decoding. This might beaccomplished either with a monolithic network, by usingsome higher-level mechanism to “attend” to relevant por-tions of the computation, or else by integrating with existing

tools for learning semantic parsers [16].

The fact that our neural module networks can betrained to produce predictable outputs—even when freelycomposed—points toward a more general paradigm of“programs” built from neural networks. In this paradigm,network designers (human or automated) have access to astandard kit of neural parts from which to construct mod-els for performing complex reasoning tasks. While visualquestion answering provides a natural testbed for this ap-proach, its usefulness is potentially much broader, extend-ing to queries about documents and structured knowledgebases or more general signal processing and function ap-proximation.

Prior State of the ArtØ Semantic parsing and logic:

Ø Dependent on pre-trained computer vision models to populate database

Environment d Know. Base �mug(1)mug(3)blue(1)table(4)on-rel(1, 4)on-rel(3, 4)...

(a) Perception fper produces a logical knowl-edge base � from the environment d using anindependent classifier for each category andrelation.

Language z

“blue mug on table”

Logical form `

�x.9y.blue(x) ^mug(x) ^on-rel(x, y) ^table(y)

(b) Semantic parsing fprsmaps language z to a log-ical form `.

Grounding: g = {(1, 4)}, Denotation: � = {1}

{1}

{1}

blue(x)

{1, 3}

mug(x)

{(1, 4), (3, 4)}

{(1, 4), (3, 4)}

on-rel(x, y)

{4}

table(y)

(c) Evaluation feval evaluates a logical form ` on alogical knowledge base � to produce a grounding gand denotation �.

Figure 2: Overview of Logical Semantics with Perception (LSP).

et al., 2010; Dzifcak et al., 2009; Cantrell et al.,2010; Chen and Mooney, 2011). All of this work as-sumes that the formal environment representation isgiven, while LSP learns to produce this formal rep-resentation from raw sensor input.

Most similar to LSP is work on simultaneouslyunderstanding natural language and perceiving theenvironment. This problem has been addressed inthe context of robot direction following (Kollar etal., 2010; Tellex et al., 2011) and visual attributelearning (Matuszek et al., 2012). However, thiswork is less semantically expressive than LSP andtrained using more supervision. The G3 model (Kol-lar et al., 2010; Tellex et al., 2011) assumes a one-to-one mapping from noun phrases to entities andis trained using full supervision, while LSP allowsone-to-many mappings from noun phrases to entitiesand can be trained using minimal annotation. Ma-tuszek et al. (2012) learns only one-argument cate-gories (“attributes”) and requires a fully supervisedinitial training phase. In contrast, LSP models two-argument relations and allows for weakly supervisedsupervised training throughout.

3 Logical Semantics with Perception

Logical Semantics with Perception (LSP) is a modelfor grounded language acquisition. LSP accepts asinput a natural language statement and an environ-ment and outputs the objects in the environment de-noted by the statement. The LSP model has threecomponents: perception, parsing and evaluation (seeFigure 2). The perception component constructslogical knowledge bases from low-level feature-based representations of environments. The pars-ing component semantically parses natural language

into lambda calculus queries against the constructedknowledge base. Finally, the evaluation compo-nent deterministically executes this query against theknowledge base to produce LSP’s output.

The output of LSP can be either a denotation ora grounding. A denotation is the set of entity refer-ents for the phrase as a whole, while a grounding isthe set of entity referents for each component of thephrase. The distinction between these two outputs isshown in Figure 1b. In this example, the denotationis the set of “things to the right of the blue mug,”which does not include the blue mug itself. On theother hand, the grounding includes both the refer-ents of “things” and “blue mug.” Only denotationsare used during training, so we ignore groundings inthe following model description. However, ground-ings are used in our evaluation, as they are a morecomplete description of the model’s understanding.

Formally, LSP is a linear model f that predicts adenotation � given a natural language statement z inan environment d. As shown in Figure 3, the struc-ture of LSP factors into perception (fper), semanticparsing (fprs) and evaluation (feval) components us-ing several latent variables:

f(�,�, `, t, z,d; ✓) = fper(�, d; ✓per)+

fprs(`, t, z; ✓prs) + feval(�,�, `)

LSP assumes access to a set of predicates thattake either one argument, called categories (c 2 C)or two arguments, called relations (r 2 R).2 Thesepredicates are the interface between LSP’s percep-tion and parsing components. The perception func-tion fper takes an environment d and produces a log-

2The set of predicates are derived from our training data.See Section 5.3.

195

Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World

Prior State of the ArtØ Deep Embeddings

Ø Learned end-to-end but image representation is independent of the question.

Image Question Answering: A Visual Semantic Embedding Model and a New Dataset

DQ2355: What is the color of roll oftissue paper?Ground truth: white

Comment: This question is easy be-

cause toilet tissue paper is always

white.

DQ1466: What is on the night stand?Ground truth: paper

Comment: This question is very hard

because the object is too small to focus.

DQ2010: What is to the right of the ta-ble?Ground truth: sink

Comment: There are too many cluttered

objects.

Figure 2. Sample questions from DAQUAR, showing a variety of difficulty levels (last line contains our own editorial comments).

t = 1 t = 2 t = T

how many books

dropout p

LSTM

...softmax

one two ... red bird.21 .56 ... .09 .01

linear map

image

CNN

word embedding

Figure 3. VIS+LSTM Model

4.2. Question-Answer Generation

The currently available DAQUAR dataset contains approx-imately 1500 images, with 7000 images on 37 common ob-ject classes, which might be not enough for training largecomplex models. Another problem with the current datasetis that simply guessing the mode can yield very good accu-racy. We aim to create another dataset, to produce a muchlarger number of QA pairs and a more even distribution ofanswers. While collecting human generated QA pairs isone possible approach, and another is to generate questionsbased on image labellings, we instead propose to automat-ically convert descriptions into QA form.

As a starting point we used the Microsoft-COCO dataset(Lin et al., 2014), but the same method can be applied toany other image description dataset, such as Flickr (Hodoshet al., 2013), SBU (Ordonez et al., 2011), or even the inter-net. Another advantage of using image descriptions is thatthey are generated by humans in the first place. Hence mostobjects mentioned in the descriptions are easier to noticethan the ones in DAQUAR’s human generated questions,

and synthetic QAs from ground truth labellings. This al-lows the model to rely more on common sense and roughimage understanding without any logic reasoning. Lastlythe conversion process preserves the language variability inthe original description and will result in more human-likequestions than questions generated from image labellings.

Question generation is still an open-ended topic. We arenot trying to solve a linguistic problem but instead aim tocreate a usable image QA dataset. We adopt a conservativeapproach to generating questions: because we have the lux-ury of large publicly-available image description datasets,we can afford to reject many possible questions in an at-tempt to create high-quality questions.

4.2.1. COMMON STRATEGIES

1. Compound sentences to simple sentencesHere we only consider a simple case, where two sen-tences are joined together with a conjunctive words.We split the orginial sentences into two independentsentences. For example, “There is a cat and the cat isrunning.” will be split as “There is a cat.” and “Thecat is running.”

2. Indefinite determiners to definite determiners.Asking questions on a specific instance of the subjectrequires changing the determiner into definite form“the”. For example, “A boy is playing baseball.” willhave “the” instead of “a” in its question form: “Whatis the boy playing?”.

3. Wh-movement constraintsIn English, questions tend to start with interrogativewords such as “what”. The algorithm needs to movethe verb as well as the “wh-” constituent to the front ofthe sentence. In this work we consider the followingtwo simple constraints:

What is the doing cat ? <BOA> Sitting on umbrella the

CNN

LSTM

Embedding

Fusing

Sitting on umbrella the <EOA>

Shared

Shared

Intermediate

Softmax

Figure 2: Illustration of the mQA model architecture. We input an image and a question about theimage (i.e. “What is the cat doing?”) to the model. The model is trained to generate the answer tothe question (i.e. “Sitting on the umbrella”). The weight matrix in the word embedding layers ofthe two LSTMs (one for the question and one for the answer) are shared. In addition, as in [25], thisweight matrix is also shared, in a transposed manner, with the weight matrix in the Softmax layer.Different colors in the figure represent different components of the model. (Best viewed in color.)

There are some concurrent and independent works on this topic: [1, 23, 32]. [1] propose a large-scale dataset also based on MS COCO. They also provide some simple baseline methods on thisdataset. Compared to them, we propose a stronger model for this task and evaluate our method usinghuman judges. Our dataset also contains two different kinds of language, which can be useful forother tasks, such as machine translation. Because we use a different set of annotators and differentrequirements of the annotation, our dataset and the [1] can be complementary to each other, and leadto some interesting topics, such as dataset transferring for visual question answering.

Both [23] and [32] use a model containing a single LSTM and a CNN. They concatenate the questionand the answer (for [32], the answer is a single word. [23] also prefer a single word as the answer),and then feed them to the LSTM. Different from them, we use two separate LSTMs for questionsand answers respectively in consideration of the different properties (e.g. grammar) of questions andanswers, while allow the sharing of the word-embeddings. For the dataset, [23] adopt the datasetproposed in [22], which is much smaller than our FM-IQA dataset. [32] utilize the annotations inMS COCO and synthesize a dataset with four pre-defined types of questions (i.e. object, number,color, and location). They also synthesize the answer with a single word. Their dataset can also becomplementary to ours.

3 The Multimodal QA (mQA) Model

We show the architecture of our mQA model in Figure 2. The model has four components: (I). aLong Short-Term Memory (LSTM [12]) for extracting semantic representation of a question, (II). adeep Convolutional Neural Network (CNN) for extracting the image representation, (III). an LSTMto extract representation of the current word in the answer and its linguistic context, and (IV). afusing component that incorporates the information from the first three parts together and generatesthe next word in the answer. These four components can be jointly trained together 3. The detailsof the four model components are described in Section 3.1. The effectiveness of the importantcomponents and strategies are analyzed in Section 5.3.

The inputs of the model are a question and the reference image. The model is trained to generatethe answer. The words in the question and answer are represented by one-hot vectors (i.e. binaryvectors with the length of the dictionary size N and have only one non-zero vector indicating itsindex in the word dictionary). We add a hBOAi sign and a hEOAi sign, as two spatial words inthe word dictionary, at the beginning and the end of the training answers respectively. They will beused for generating the answer to the question in the testing stage.

In the testing stage, we input an image and a question about the image into the model first. Togenerate the answer, we start with the start sign hBOAi and use the model to calculate the probabilitydistribution of the next word. We then use a beam search scheme that keeps the best K candidates

3In practice, we fix the CNN part because the gradient returned from LSTM is very noisy. Finetuning theCNN takes a much longer time than just fixing it, and does not improve the performance significantly.

3

Are you Talking to a Machine? Datasets and Methods for Multilingual Image Question Answering

Image Question Answering: A Visual Semantic Embedding Model and a new Dataset

Proposed Solution




Abstract






















dog standing ...

LSTM couch

cat

CNN

Where is the dog?

LayoutParser





arX

iv:1

511.

0279

9v4

[cs.C

V]

24 Ju

l 201

7

Use parser to extract logical expression

from question.

Compose a neural network to answer

question

Reuse network components across

problems




Abstract






















dog standing ...

LSTM couch

cat

CNN

Where is the dog?

LayoutParser





arX

iv:1

511.

0279

9v4

[cs.C

V]

24 Ju

l 201

7

Is there a circle next to a square?

Ø Objective: Convert question into logical expression.

Ø Conceptually à Inducing a program from a question

Ø Also probably the more brittle part of the workØ Addressed in follow-up paperØ Alternative solution: user writes logical expression à programming

Question:

is(circle, next-to(square))Logical Expression:




Abstract






















dog standing ...

LSTM couch

cat

CNN

Where is the dog?

LayoutParser





arX

iv:1

511.

0279

9v4

[cs.C

V]

24 Ju

l 201

7

Neural ModulesAttention

attend : Image ! Attention

Convolution

attend[dog]

An attention module attend[c] convolves every positionin the input image with a weight vector (distinct for eachc) to produce a heatmap or unnormalized attention. So,for example, the output of the module attend[dog] is amatrix whose entries should be in regions of the imagecontaining cats, and small everywhere else, as shown above.

Re-attention

re-attend : Attention ! Attention

A re-attention module re-attend[c] is essentially just amultilayer perceptron with rectified nonlinearities (ReLUs),performing a fully-connected mapping from one attentionto another. Again, the weights for this mapping are distinctfor each c. So re-attend[above] should take an attentionand shift the regions of greatest activation upward (asabove), while re-attend[not] should move attention awayfrom the active regions. For the experiments in this paper,the first fully-connected (FC) layer produces a vector ofsize 32, and the second is the same size as the input.

Combination

combine : Attention ⇥ Attention ! Attention

A combination module combine[c] merges two attentionsinto a single attention. For example, combine[and] shouldbe active only in the regions that are active in both inputs,while combine[except] should be active where the first in-put is active and the second is inactive.

Classification

classify : Image ⇥ Attention ! Label

A classification module classify[c] takes an attention andthe input image and maps them to a distribution over labels.For example, classify[color] should return a distributionover colors in the region attended to.

Measurement

measure : Attention ! Label

A measurement module measure[c] takes an attention aloneand maps it to a distribution over labels. Because atten-tions passed between modules are unnormalized, measure issuitable for evaluating the existence of a detected object, orcounting sets of objects.

4.2. From strings to networksHaving built up an inventory of modules, we now need

to assemble them into the layout specified by the question.The transformation from a natural language question to aninstantiated neural network takes place in two steps. Firstwe map from natural language questions to layouts, whichspecify both the set of modules used to answer a given ques-tion, and the connections between them. Next we use theselayouts are used to assemble the final prediction networks.

We use standard tools pre-trained on existing linguis-tic resources to obtained structured representations of ques-tions. Future work might focus on learning (or at least fine-tuning) this prediction process jointly with the rest of thesystem.

Parsing We begin by parsing each question with the Stan-ford Parser [14]. to obtain a universal dependency represen-tation [4]. Dependency parses express grammatical rela-tions between parts of a sentence (e.g. between objects andtheir attributes, or events and their participants), and pro-vide a lightweight abstraction away from the surface form

Separate weights for each argument e.g., [dog]

“Learned Sub-routines/Functions”

Attention



Re-attention


FC ReLU

re-attend[above]

×2


Combination



Classification



Measurement







Attention



Re-attention



Combination


Stack Conv. ReLU

combine[except]


Classification



Measurement







Attention



Re-attention



Combination



Classification


Attend FC Softmax couch


Measurement







Attention



Re-attention



Combination



Classification



Measurement


FC ReLU Softmax yesFC









Abstract






















dog standing ...

LSTM couch

cat

CNN

Where is the dog?

LayoutParser





arX

iv:1

511.

0279

9v4

[cs.C

V]

24 Ju

l 201

7

Composition!

Attention


Convolution

attend[dog]


Re-attention



Combination



Classification



Measurement







“Learned programs”

Attention



Re-attention


FC ReLU

re-attend[above]

×2


Combination



Classification



Measurement







Attention



Re-attention



Combination


Stack Conv. ReLU

combine[except]


Classification



Measurement







Attention



Re-attention



Combination



Classification




Measurement







Attention



Re-attention



Combination



Classification



Measurement








attend[tie]

classify[color] yellow

(a) NMN for answering the question What color is his

tie? The attend[tie] module first predicts a heatmapcorresponding to the location of the tie. Next, theclassify[color] module uses this heatmap to pro-duce a weighted average of image features, which arefinally used to predict an output label.

(b) NMN for answering the question Is there a red shape above a cir-

cle? The two attend modules locate the red shapes and circles, there-attend[above] shifts the attention above the circles, the combine mod-ule computes their intersection, and the measure[is] module inspects thefinal attention and determines that it is non-empty.

Figure 2: Sample NMNs for question answering about natural images and shapes. For both examples, layouts, attentions,and answers are real predictions made by our model.

of the sentence. The parser also performs basic lemmati-zation, for example turning kites into kite and were into be.This reduces sparsity of module instances.

Next, we filter the set of dependencies to those con-nected the wh-word in the question (the exact distance wetraverse varies depending on the task). This gives a sim-ple symbolic form expressing (the primary) part of the sen-tence’s meaning. For example, what is standing in the

field becomes what(stand); what color is the truck becomescolor(truck), and is there a circle next to a square becomesis(circle, next-to(square)). In the process we also stripaway function words like determiners and modals, so what

type of cakes were they? and what type of cake is it? bothget converted to type(cake). The code for transformingparse trees to structured queries will be provided in the ac-companying software package.

These representations bear a certain resemblance topieces of a combinatory logic [18]: every leaf is implicitlya function taking the image as input, and the root representsthe final value of the computation. But our approach, whilecompositional and combinatorial, is crucially not logical:the inferential computations operate on continuous repre-sentations produced by neural networks, becoming discreteonly in the prediction of the final answer.

Layout These symbolic representations already deter-mine the structure of the predicted networks, but not theidentities of the modules that compose them. This final as-signment of modules is fully determined by the structureof the parse. All leaves become attend modules, all inter-nal nodes become re-attend or combine modules dependenton their arity, and root nodes become measure modules foryes/no questions and classify modules for all other ques-tion types.

Given the mapping from queries to network layouts de-scribed above, we have for each training example a net-work structure, an input image, and an output label. Inmany cases, these network structures are different, buthave tied parameters. Networks which have the samehigh-level structure but different instantiations of indi-vidual modules (for example what color is the cat?—classify[color](attend[cat]) and where is the truck?—classify[where](attend[truck])) can be processed in thesame batch, resulting in efficient computation.

As noted above, parts of this conversion process are task-specific—we found that relatively simple expressions werebest for the natural image questions, while the shapes ques-tion (by design) required deeper structures. Some summarystatistics are provided in Table 1.

Generalizations It is easy to imagine applications wherethe input to the layout stage comes from something otherthan a natural language parser. Users of an image database,for example, might write SQL-like queries directly in orderto specify their requirements precisely, e.g.

IS(cat) AND NOT(IS(dog))

or even mix visual and non-visual specifications in theirqueries:

IS(cat) and date > 2014-11-5

Indeed, it is possible to construct this kind of “visualSQL” using precisely the approach described in this paper—once our system is trained, the learned modules for atten-tion, classification, etc. can be assembled by any kind ofoutside user, without relying on natural language specifi-cally.

“What color is his tie?”




Abstract






















dog standing ...

LSTM couch

cat

CNN

Where is the dog?

LayoutParser





arX

iv:1

511.

0279

9v4

[cs.C

V]

24 Ju

l 201

7

Composition!

Attention


Convolution

attend[dog]


Re-attention



Combination



Classification



Measurement







“Learned programs”

Attention



Re-attention


FC ReLU

re-attend[above]

×2


Combination



Classification



Measurement







Attention



Re-attention



Combination


Stack Conv. ReLU

combine[except]


Classification



Measurement







Attention



Re-attention



Combination



Classification




Measurement







Attention



Re-attention



Combination



Classification



Measurement








“Is there a red shape above a circle?”



re-attend[above]

yes

attend[circle]

combine[and]

attend[red]

measure[is]




















Abstract






















dog standing ...

LSTM couch

cat

CNN

Where is the dog?

LayoutParser





arX

iv:1

511.

0279

9v4

[cs.C

V]

24 Ju

l 201

7

Training

Attention


Convolution

attend[dog]


Re-attention



Combination



Classification



Measurement







Attention



Re-attention


FC ReLU

re-attend[above]

×2


Combination



Classification



Measurement







Attention



Re-attention



Combination


Stack Conv. ReLU

combine[except]


Classification



Measurement







Attention



Re-attention



Combination



Classification




Measurement







Attention



Re-attention



Combination



Classification



Measurement








attend[tie]

classify[color] yellow





















re-attend[above]

yes

attend[circle]

combine[and]

attend[red]

measure[is]

















Train multiple graphs at once with shared modules.

Individual models learn through their composition.

No pre-training

Evaluation Metrics and Results

Ø Accuracy on VQA benchmarksØ Existing benchmarks only require limited reasoning…

Ø Introduce new Shapes Benchmark

size 4 size 5 size 6 All

Majority 64.4 62.5 61.7 63.0VIS+LSTM 71.9 62.5 61.7 65.3NMN 89.7 92.4 85.2 90.6NMN (easy) 97.7 91.1 89.7 90.8

Table 2: Results on the SHAPES dataset. Here “size” isthe number of modules needed to instantiate an appropriateNMN. Our model achieves high accuracy and outperformsa baseline from previous work, especially on highly compo-sitional questions. “NMN (easy)” is a modified training setwith no size-6 questions; these results demonstrate that ourmodel is able to generalize to questions more complicatedthan it has ever seen at training time.

dataset, and we hope that SHAPES will continue to be usedin conjunction with natural image datasets.

To produce an initial set of image features, we pass theinput image through the convolutional portion of a LeNet[17] which is jointly trained with the question-answeringpart of the model. We compare our approach to a reim-plementation of the VIS+LSTM baseline similar to the onedescribed by [26], again swapping out the pre-trained imageembedding with a LeNet.

As can be seen in Table 2, our model achieves excellentperformance on this dataset, while the VIS+LSTM base-line fares little better than a majority guesser. Moreover,the color detectors and attention transformations behave asexpected (Figure 2b), indicating that our joint training pro-cedure correctly allocates responsibilities among modules.This confirms that our approach is able to model complexcompositional phenomena outside the capacity of previousapproaches to visual question answering.

We perform an additional experiment on a modified ver-sion of the training set, which contains no size-6 questions(i.e. questions whose corresponding NMN has 6 modules).Here our performance does not suffer at all, and perhaps in-creases slightly; this demonstrates that our model is able togeneralize to questions even more complicated than those ithas seen during training. Using only linguistic information,the model extrapolates simple visual patterns it has learnedto even harder questions.

7. Experiments: natural imagesNext we consider the model’s ability to handle hard per-

ceptual problems involving natural images. Here we evalu-ate on the recently-released VQA dataset. This is the largestresource of its kind, consisting of more than 200,000 im-ages, each paired with three questions and ten answers perquestion. Data was generated by human annotators, in con-trast to previous work, which has generated questions au-tomatically from captions [26]. We learn our model usingthe standard train/test split, training only with those answers

test-dev test

Yes/No Number Other All All

LSTM [2] 78.20 35.7 26.6 48.8 –VIS+LSTM [2] 78.9 35.2 36.4 53.7 54.1NMN 69.38 30.7 22.7 42.7 –NMN+LSTM 77.7 37.2 39.3 54.8 55.1

Table 3: Results on the VQA test server. NMN+LSTM isthe full model shown in Figure 1, while NMN is an ablationexperiment with no whole-question LSTM. The full modeloutperforms previous approaches, scoring particularly wellon questions not involving a binary decision. Baseline num-bers are as reported in previous work.

marked as high confidence. The visual input to the NMNis the conv5 layer of a 16-layer VGGNet [27] after max-pooling. We do not fine-tune the VGGNet.

Results are shown in Table 3. As can be seen, we outper-form the best published results on this task. A breakdownof our questions by answer type reveals that our model per-forms especially well on questions answered by an object,attribute, or number, but worse than a sequence baselinein the yes/no category. Inspection of training-set accura-cies suggests that performance on yes/no questions is dueto overfitting. An ensemble with a sequence-only systemmight achieve even better results; future work within theNMN framework should focus on redesigning the measure

module to reduce effects from overfitting.Inspection of parser outputs also suggests that there is

substantial room to improve the system using a better parser.A hand inspection of the first 50 parses in the training setsuggests that most (80–90%) of questions asking for simpleproperties of objects are correctly analyzed, but more com-plicated questions are more prone to picking up irrelevantpredicates. For example are these people most likely experi-

encing a work day? is parsed as be(people, likely), whenthe desired analysis is is(people, work). Parser errors ofthis kind could be fixed with joint learning.

Figure 3 is broadly suggestive of the kinds of predic-tion errors made by the system, including plausible seman-tic confusions (cardboard interpreted as leather, round win-dows interpreted as clocks), normal lexical variation (con-

tainer for cup), and use of answers that are a priori plausiblebut unrelated to the image (describing a horse as located ina pen rather than a barn).

8. Conclusions and future workIn this paper, we have introduced neural module net-

works, which provide a general-purpose framework forlearning collections of neural modules which can be dy-namically assembled into arbitrary deep networks. We havedemonstrated that this approach achieves state-of-the-artperformance on existing datasets for visual question an-

Shapes Benchmark

size 4 size 5 size 6 All

Majority 64.4 62.5 61.7 63.0VIS+LSTM 71.9 62.5 61.7 65.3NMN 89.7 92.4 85.2 90.6NMN (easy) 97.7 91.1 89.7 90.8

Table 2: Results on the SHAPES dataset. Here “size” isthe number of modules needed to instantiate an appropriateNMN. Our model achieves high accuracy and outperformsa baseline from previous work, especially on highly compo-sitional questions. “NMN (easy)” is a modified training setwith no size-6 questions; these results demonstrate that ourmodel is able to generalize to questions more complicatedthan it has ever seen at training time.

dataset, and we hope that SHAPES will continue to be usedin conjunction with natural image datasets.

To produce an initial set of image features, we pass theinput image through the convolutional portion of a LeNet[17] which is jointly trained with the question-answeringpart of the model. We compare our approach to a reim-plementation of the VIS+LSTM baseline similar to the onedescribed by [26], again swapping out the pre-trained imageembedding with a LeNet.

As can be seen in Table 2, our model achieves excellentperformance on this dataset, while the VIS+LSTM base-line fares little better than a majority guesser. Moreover,the color detectors and attention transformations behave asexpected (Figure 2b), indicating that our joint training pro-cedure correctly allocates responsibilities among modules.This confirms that our approach is able to model complexcompositional phenomena outside the capacity of previousapproaches to visual question answering.

We perform an additional experiment on a modified ver-sion of the training set, which contains no size-6 questions(i.e. questions whose corresponding NMN has 6 modules).Here our performance does not suffer at all, and perhaps in-creases slightly; this demonstrates that our model is able togeneralize to questions even more complicated than those ithas seen during training. Using only linguistic information,the model extrapolates simple visual patterns it has learnedto even harder questions.

7. Experiments: natural imagesNext we consider the model’s ability to handle hard per-

ceptual problems involving natural images. Here we evalu-ate on the recently-released VQA dataset. This is the largestresource of its kind, consisting of more than 200,000 im-ages, each paired with three questions and ten answers perquestion. Data was generated by human annotators, in con-trast to previous work, which has generated questions au-tomatically from captions [26]. We learn our model usingthe standard train/test split, training only with those answers

test-dev test

Yes/No Number Other All All

LSTM [2] 78.20 35.7 26.6 48.8 –VIS+LSTM [2] 78.9 35.2 36.4 53.7 54.1NMN 69.38 30.7 22.7 42.7 –NMN+LSTM 77.7 37.2 39.3 54.8 55.1

Table 3: Results on the VQA test server. NMN+LSTM isthe full model shown in Figure 1, while NMN is an ablationexperiment with no whole-question LSTM. The full modeloutperforms previous approaches, scoring particularly wellon questions not involving a binary decision. Baseline num-bers are as reported in previous work.

marked as high confidence. The visual input to the NMNis the conv5 layer of a 16-layer VGGNet [27] after max-pooling. We do not fine-tune the VGGNet.

Results are shown in Table 3. As can be seen, we outper-form the best published results on this task. A breakdownof our questions by answer type reveals that our model per-forms especially well on questions answered by an object,attribute, or number, but worse than a sequence baselinein the yes/no category. Inspection of training-set accura-cies suggests that performance on yes/no questions is dueto overfitting. An ensemble with a sequence-only systemmight achieve even better results; future work within theNMN framework should focus on redesigning the measure

module to reduce effects from overfitting.Inspection of parser outputs also suggests that there is

substantial room to improve the system using a better parser.A hand inspection of the first 50 parses in the training setsuggests that most (80–90%) of questions asking for simpleproperties of objects are correctly analyzed, but more com-plicated questions are more prone to picking up irrelevantpredicates. For example are these people most likely experi-

encing a work day? is parsed as be(people, likely), whenthe desired analysis is is(people, work). Parser errors ofthis kind could be fixed with joint learning.

Figure 3 is broadly suggestive of the kinds of predic-tion errors made by the system, including plausible seman-tic confusions (cardboard interpreted as leather, round win-dows interpreted as clocks), normal lexical variation (con-

tainer for cup), and use of answers that are a priori plausiblebut unrelated to the image (describing a horse as located ina pen rather than a barn).

8. Conclusions and future workIn this paper, we have introduced neural module net-

works, which provide a general-purpose framework forlearning collections of neural modules which can be dy-namically assembled into arbitrary deep networks. We havedemonstrated that this approach achieves state-of-the-artperformance on existing datasets for visual question an-

VQA Benchmark

how many different lights

in various different shapes

and sizes?

what is the color of the

horse?

what color is the vase? is the bus full of passen-

gers?

is there a red shape above

a circle?

measure[count](

attend[light])

classify[color](

attend[horse])

classify[color](

attend[vase])

measure[is](

combine[and](

attend[bus],

attend[full])

measure[is](

combine[and](

attend[red],

re-attend[above](

attend[circle])))

four (four) brown (brown) green (green) yes (yes) no (no)

what is stuffed with

toothbrushes wrapped in

plastic?

where does the tabby cat

watch a horse eating hay?

what material are the

boxes made of?

is this a clock? is a red shape blue?

classify[what](

attend[stuff])

classify[where](

attend[watch])

classify[material](

attend[box])

measure[is](

attend[clock])

measure[is](

combine[and](

attend[red],

attend[blue]))

container (cup) pen (barn) leather (cardboard) yes (no) yes (no)

Figure 3: Example output from our approach on different visual QA tasks. The top row shows correct answers, while thebottom row shows mistakes (correct answers are given in parentheses).

swering, performing especially well on questions answeredby an object or an attribute. Additionally, we have in-troduced a new dataset of highly compositional questionsabout simple arrangements of shapes, and shown that ourapproach substantially outperforms previous work.

So far we have maintained a strict separation betweenpredicting network structures and learning network param-eters. It is easy to imagine that these two problems mightbe solved jointly, with uncertainty maintained over networkstructures throughout training and decoding. This might beaccomplished either with a monolithic network, by usingsome higher-level mechanism to “attend” to relevant por-tions of the computation, or else by integrating with existing

tools for learning semantic parsers [16].

The fact that our neural module networks can betrained to produce predictable outputs—even when freelycomposed—points toward a more general paradigm of“programs” built from neural networks. In this paradigm,network designers (human or automated) have access to astandard kit of neural parts from which to construct mod-els for performing complex reasoning tasks. While visualquestion answering provides a natural testbed for this ap-proach, its usefulness is potentially much broader, extend-ing to queries about documents and structured knowledgebases or more general signal processing and function ap-proximation.

Qualitative Results

ImpactØ Over 300 citations (pretty good)

Ø Follow-up work “Learning to Reason: End-to-End Module Networks for Visual Question Answering” address limitations of parsing.Ø Uses Policy RNN to predict composition (trained using RL)

How many other things are of the same size as the green matte ball?

relocate[1]

4

count[2]

find[0]

Does the blue cylinder have the same material as the big block on the right side of the red metallic thing?

yes

find[1]find[0]

filter[3]

relocate[2]

compare[4]

Figure 5: Question answering examples on the CLEVR dataset. On the left, it can be seen that the model successfully locatesthe matte green ball, attends to all the other objects of the same size, and then correctly identifies that there are 4 such objects(excluding the initial ball). On the right, it can be seen the various modules similarly assume intuitive semantics. Of particularinterest is the second find module, which picks up the word right in addition to metallic red thing: this suggests that modelcan use the fact that downstream computation will look to the right of the detected object to focus its initial search in the lefthalf of the image, a behavior supported by our attentive approach but not a conventional linguistic analysis of the question.

Compare Integer Query Attribute Compare Attribute

Method Overall Exist Count equal less more size color material shape size color material shape

CNN+BoW [26] 48.4 59.5 38.9 50 54 49 56 32 58 47 52 52 51 52CNN+LSTM [4] 52.3 65.2 43.7 57 72 69 59 32 58 48 54 54 51 53CNN+LSTM+MCB [9] 51.4 63.4 42.1 57 71 68 59 32 57 48 51 52 50 51CNN+LSTM+SA [25] 68.5 71.1 52.2 60 82 74 87 81 88 85 52 55 51 51

NMN (expert layout) [3] 72.1 79.3 52.5 61.2 77.9 75.2 84.2 68.9 82.6 80.2 80.7 74.4 77.6 79.3

ours - policy searchfrom scratch 69.0 72.7 55.1 71.6 85.1 79.0 88.1 74.0 86.6 84.1 50.1 53.9 48.6 51.1

ours - cloning expert 78.9 83.3 63.3 68.2 87.2 85.4 90.5 80.2 88.9 88.3 89.4 52.5 85.4 86.7ours - policy searchafter cloning 83.7 85.7 68.5 73.8 89.7 87.7 93.1 84.8 91.5 90.6 92.6 82.8 89.6 90.0

Table 3: Evaluation of our method and previous work on CLEVR test set. With policy search after cloning, the accuracies areconsistently improved on all questions types, with large improvement on some question types like compare color.

We evaluate our model on the test set of CLEVR. Table 3shows the detailed performance of our model and previousmethods on each question type, where “ours - policy searchfrom scratch” is the baseline using pure reinforcement learn-ing without resorting to the expert, “ours - cloning expert”is the supervised behavioral cloning from the constructedexpert policy in the first stage, and “ours - policy searchafter cloning” is our model further trained for the secondtraining stage. It can be seen that without using any ex-pert demonstrations, our method with policy optimizationfrom scratch already achieves higher performance than mostprevious work, and our model trained in the first behavioralcloning stage outperforms the previous approaches by a largemargin in overall accuracy. This indicates that our neuralmodules are capable of reasoning for complex questions inthe dataset like “does the block that is to the right of the bigcyan sphere have the same material as the large blue thing?”

Our model also outperforms the NMN baseline [3] trainedon the same expert layout as used in our model1. This showsthat our soft attention module parameterization is better thanthe hard-coded textual parameters in NMN. Figure 5 showssome question answering examples with our model.

By comparing “ours - policy search after cloning” with“ours - cloning expert” in Table 3, it can be seen that theperformance consistently improves after end-to-end trainingwith policy search using reinforcement learning in the secondtraining stage, with especially large improvement on thecompare color type of questions, indicating that the originalexpert policy is not optimal, and we can improve upon itwith policy search over the entire layout space. Figure 6shows an example before and after end-to-end optimization.

1The question parsing in the original NMN implementation does notwork on the CLEVR dataset, as confirmed in [15]. For fair comparison withNMN, we train NMN using the same expert layout as our model.

Points to a bigger opportunity…

Ø Composition of learned modules

Ø Conjecture: Increasing “non-experts” will compose existing ML models to solve new complex problems.Ø Organizations will develop and reuse model components in

multiple tasksØ Training will span many different neural module programs

Ø Needed?Ø Abstractions for individual componentsØ Mechanisms for composition and joint training

neural modular networks - github pages...the visual qa task has signiﬁcant signiﬁcant...

Documents