abstract - arxiv2016), nsm achieves new state-of-the-art results with weak supervision,...

Neural Symbolic Machines:Learning Semantic Parsers on Freebase with Weak Supervision

Chen Liang∗, Jonathan Berant†, Quoc Le, Kenneth D. Forbus, Ni LaoNorthwestern University, Evanston, IL

Tel-Aviv University, Tel Aviv-Yafo, IsraelGoogle Inc., Mountain View, CA

{chenliang2013,forbus}@u.northwestern.edu, [email protected], {qvl,nlao}@google.com

Abstract

Harnessing the statistical power of neu-ral networks to perform language under-standing and symbolic reasoning is dif-ficult, when it requires executing effi-cient discrete operations against a largeknowledge-base. In this work, we intro-duce a Neural Symbolic Machine (NSM),which contains (a) a neural “program-mer”, i.e., a sequence-to-sequence modelthat maps language utterances to programsand utilizes a key-variable memory to han-dle compositionality (b) a symbolic “com-puter”, i.e., a Lisp interpreter that performsprogram execution, and helps find goodprograms by pruning the search space.We apply REINFORCE to directly opti-mize the task reward of this structuredprediction problem. To train with weaksupervision and improve the stability ofREINFORCE we augment it with an it-erative maximum-likelihood training pro-cess. NSM outperforms the state-of-the-art on the WEBQUESTIONSSP datasetwhen trained from question-answer pairsonly, without requiring any feature engi-neering or domain-specific knowledge.

1 Introduction

Deep neural networks have achieved impressiveperformance in supervised classification and struc-tured prediction tasks such as speech recognition(Hinton et al., 2012), machine translation (Bah-danau et al., 2014; Wu et al., 2016) and more.However, training neural networks for semanticparsing (Zelle and Mooney, 1996; Zettlemoyerand Collins, 2005; Liang et al., 2011) or programinduction, where language is mapped to a sym-

∗Work done while the author was interning at Google†Work done while the author was visiting Google

x: Largest city in the US ⇒ y: NYC

(USA)

(Hop v0 CityIn)

CityIn

(Argmax v1 Population)

Population

Compositionality

Large Search Space

( Argmax

Hop

v1

v0

Population

Size

Elevation

)

v2 ←

v1 ←

v0 ←

Figure 1: The main challenges of training a semantic parserfrom weak supervision: (a) compositionality: we use vari-ables (v0, v1, v2) to store execution results of intermediategenerated programs. (b) search: we prune the search spaceand augment REINFORCE with pseudo-gold programs.

bolic representation that is executed by an execu-tor, through weak supervision remains challeng-ing. This is because the model must interact with asymbolic executor through non-differentiable op-erations to search over a large program space.

In semantic parsing, recent work handled this(Dong and Lapata, 2016; Jia and Liang, 2016)by training from manually annotated programsand avoiding program execution at training time.However, annotating programs is known to be ex-pensive and scales poorly. In program induc-tion, attempts to address this problem (Graveset al., 2014; Reed and de Freitas, 2016; Kaiserand Sutskever, 2015; Graves et al., 2016b; An-dreas et al., 2016) either utilized low-level mem-ory (Zaremba and Sutskever, 2015), or requiredmemory to be differentiable (Neelakantan et al.,2015; Yin et al., 2015) so that the model can betrained with backpropagation. This makes it dif-ficult to use the efficient discrete operations andmemory of a traditional computer, and limited theapplication to synthetic or small knowledge bases.

In this paper, we propose to utilize the mem-ory and discrete operations of a traditional com-

arX

iv:1

611.

0002

0v4

[cs

.CL

] 2

3 A

pr 2

017

puter in a novel Manager-Programmer-Computer(MPC) framework for neural program induction,which integrates three components:

1. A “manager” that provides weak supervi-sion (e.g., ‘NYC’ in Figure 1) through a re-ward indicating how well a task is accom-plished. Unlike full supervision, weak super-vision is easy to obtain at scale (Section 3.1).

2. A “programmer” that takes natural lan-guage as input and generates a program thatis a sequence of tokens (Figure 2). The pro-grammer learns from the reward and mustovercome the hard search problem of findingcorrect programs (Section 2.2).

3. A “computer” that executes programs in ahigh level programming language. Its non-differentiable memory enables abstract, scal-able and precise operations, but makes train-ing more challenging (Section 2.3). To helpthe “programmer” prune the search space,it provides a friendly neural computer in-terface, which detects and eliminates invalidchoices (Section 2.1).

Within this framework, we introduce the Neu-ral Symbolic Machine (NSM) and apply it to se-mantic parsing. NSM contains a neural sequence-to-sequence (seq2seq) “programmer” (Sutskeveret al., 2014) and a symbolic non-differentiableLisp interpreter (“computer”) that executes pro-grams against a large knowledge-base (KB).

Our technical contribution in this work is three-fold. First, to support language compositionality,we augment the standard seq2seq model with akey-variable memory to save and reuse intermedi-ate execution results (Figure 1). This is a novel ap-plication of pointer networks (Vinyals et al., 2015)to compositional semantics.

Second, to alleviate the search problem of find-ing correct programs when training from question-answer pairs,we use the computer to execute par-tial programs and prune the programmer’s searchspace by checking the syntax and semantics ofgenerated programs. This generalizes the weaklysupervised semantic parsing framework (Lianget al., 2011; Berant et al., 2013) by leveraging se-mantic denotations during structural search.

Third, to train from weak supervision and di-rectly maximize the expected reward we turnto the REINFORCE (Williams, 1992) algorithm.Since learning from scratch is difficult for RE-INFORCE, we combine it with an iterative max-

imum likelihood (ML) training process, wherebeam search is used to find pseudo-gold programs,which are then used to augment the objective ofREINFORCE.

On the WEBQUESTIONSSP dataset (Yih et al.,2016), NSM achieves new state-of-the-art resultswith weak supervision, significantly closing thegap between weak and full supervision for thistask. Unlike prior works, it is trained end-to-end, and does not require feature engineering ordomain-specific knowledge.

2 Neural Symbolic Machines

We now introduce NSM by first describing the“computer”, a non-differentiable Lisp interpreterthat executes programs against a large KB and pro-vides code assistance (Section 2.1). We then pro-pose a seq2seq model (“programmer”) that sup-ports compositionality using a key-variable mem-ory to save and reuse intermediate results (Sec-tion 2.2). Finally, we describe a training procedurethat is based on REINFORCE, but is augmentedwith pseudo-gold programs found by an iterativeML training procedure (Section 2.3).

Before diving into details, we define the seman-tic parsing task: given a knowledge base K, anda question x = (w1, w2, ..., wm), produce a pro-gram or logical form z that when executed againstK generates the right answer y. Let E denote aset of entities (e.g., ABELINCOLN),1 and let P de-note a set of properties (e.g., PLACEOFBIRTH). Aknowledge base K is a set of assertions or triples(e1, p, e2) ∈ E × P × E , such as (ABELINCOLN,PLACEOFBIRTH, HODGENVILLE).

2.1 Computer: Lisp Interpreter with CodeAssistance

Semantic parsing typically requires using a set ofoperations to query the knowledge base and pro-cess the results. Operations learned with neuralnetworks such as addition and sorting do not per-fectly generalize to inputs that are larger than theones observed in the training data (Graves et al.,2014; Reed and de Freitas, 2016). In contrast, op-erations implemented in high level programminglanguages are abstract, scalable, and precise, thusgeneralizes perfectly to inputs of arbitrary size.Based on this observation, we implement opera-tions necessary for semantic parsing with an or-

1We also consider numbers (e.g., “1.33”) and date-times(e.g., “1999-1-1”) as entities.

dinary programming language instead of trying tolearn them with a neural network.

We adopt a Lisp interpreter as the “com-puter”. A program C is a list of expressions(c1...cN ), where each expression is either a spe-cial token “Return” indicating the end of the pro-gram, or a list of tokens enclosed by parentheses“(FA1...AK)”. F is a function, which takes asinput K arguments of specific types. Table 1 de-fines the semantics of each function and the typesof its arguments (either a property p or a variabler). When a function is executed, it returns an en-tity list that is the expression’s denotation in K,and save it to a new variable.

By introducing variables that save the interme-diate results of execution, the program naturallymodels language compositionality and describesfrom left to right a bottom-up derivation of the fullmeaning of the natural language input, which isconvenient in a seq2seq model (Figure 1). Thisis reminiscent of the floating parser (Wang et al.,2015; Pasupat and Liang, 2015), where a deriva-tion tree that is not grounded in the input is incre-mentally constructed.

The set of programs defined by our functions isequivalent to the subset of λ-calculus presented in(Yih et al., 2015). We did not use full Lisp pro-gramming language here, because constructs likecontrol flow and loops are unnecessary for mostcurrent semantic parsing tasks, and it is simple toadd more functions to the model when necessary.

To create a friendly neural computer interface,the interpreter provides code assistance to the pro-grammer by producing a list of valid tokens at eachstep. First, a valid token should not cause a syntaxerror: e.g., if the previous token is “(”, the next to-ken must be a function name, and if the previoustoken is “Hop”, the next token must be a variable.More importantly, a valid token should not causea semantic (run-time) error: this is detected usingthe denotation saved in the variables. For example,if the previously generated tokens were “( Hop r”,the next available token is restricted to properties{p | ∃e, e′ : e ∈ r, (e, p, e′) ∈ K} that are reach-able from entities in r in the KB. These checks areenabled by the variables and can be derived fromthe definition of the functions in Table 1. The in-terpreter prunes the “programmer”’s search spaceby orders of magnitude, and enables learning fromweak supervision on a large KB.

2.2 Programmer: Seq2seq Model withKey-Variable Memory

Given the “computer”, the “programmer” needs tomap natural language into a program, which is asequence of tokens that reference operations andvalues in the “computer”. We base our program-mer on a standard seq2seq model with attention,but extend it with a key-variable memory that al-lows the model to learn to represent and refer toprogram variables (Figure 2).

Sequence-to-sequence models consist of twoRNNs, an encoder and a decoder. We used a1-layer GRU (Cho et al., 2014) for both the en-coder and decoder. Given a sequence of wordsw1, w2...wm, each word wt is mapped to an em-bedding qt (embedding details are in Section 3).Then, the encoder reads these embeddings and up-dates its hidden state step by step using ht+1 =GRU(ht, qt, θEncoder), where θEncoder are theGRU parameters. The decoder updates its hid-den states ut by ut+1 = GRU(ut, ct−1, θDecoder),where ct−1 is the embedding of last step’s outputtoken at−1, and θDecoder are the GRU parame-ters. The last hidden state of the encoder hT isused as the decoder’s initial state. We also adopt adot-product attention similar to Dong and Lapata(2016). The tokens of the program a1, a2...an aregenerated one by one using a softmax over the vo-cabulary of valid tokens at each step, as providedby the “computer” (Section 2.1).

To achieve compositionality, the decoder mustlearn to represent and refer to intermediate vari-ables whose value was saved in the “computer”after execution. Therefore, we augment the modelwith a key-variable memory, where each entryhas two components: a continuous embedding keyvi, and a corresponding variable token Ri refer-encing the value in the “computer” (see Figure 2).During encoding, we use an entity linker to linktext spans (e.g., “US”) to KB entities. For eachlinked entity we add a memory entry where the keyis the average of GRU hidden states over the entityspan, and the variable token (R1) is the name of avariable in the computer holding the linked entity(m.USA) as its value. During decoding, when afull expression is generated (i.e., the decoder gen-erates “)”), it gets executed, and the result is storedas the value of a new variable in the “computer”.This variable is keyed by the GRU hidden state atthat step. When a new variable R1 with key em-bedding v1 is added into the key-variable memory,

( Hop r p )⇒ {e2|e1 ∈ r, (e1, p, e2) ∈ K}( ArgMax r p )⇒ {e1|e1 ∈ r, ∃e2 ∈ E : (e1, p, e2) ∈ K,∀e : (e1, p, e) ∈ K, e2 ≥ e}( ArgMin r p )⇒ {e1|e1 ∈ r, ∃e2 ∈ E : (e1, p, e2) ∈ K, ∀e : (e1, p, e) ∈ K, e2 ≤ e}

( Filter r1 r2 p )⇒ {e1|e1 ∈ r1,∃e2 ∈ r2 : (e1, p, e2) ∈ K}

Table 1: Interpreter functions. r represents a variable, p a property in Freebase. ≥ and ≤ are defined on numbers and dates.

Key Variable

v1 R1(m.USA)Execute( Argmax R2 Population )

ExecuteReturn

m.NYCKey Variable

... ...

v3 R3(m.NYC)

Key Variable

v1 R1(m.USA)

v2 R2(list of US cities)

Execute( Hop R1 !CityIn )

Hop R1 !CityIn( )

Largest city ( Hop R1in US GO !CityIn Argmax R2( )Population)

R2 Population ReturnArgmax )(

Entity Resolver

Figure 2: Semantic Parsing with NSM. The key embeddings of the key-variable memory are the output of the sequence modelat certain encoding or decoding steps. For illustration purposes, we also show the values of the variables in parentheses, but thesequence model never sees these values, and only references them with the name of the variable (“R1”). A special token “GO”indicates the start of decoding, and “Return” indicates the end of decoding.

the token R1 is added into the decoder vocabu-lary with v1 as its embedding. The final answerreturned by the “programmer” is the value of thelast computed variable.

Similar to pointer networks (Vinyals et al.,2015), the key embeddings for variables are dy-namically generated for each example. Duringtraining, the model learns to represent variables bybackpropagating gradients from a time step wherea variable is selected by the decoder, through thekey-variable memory, to an earlier time step whenthe key embedding was computed. Thus, the en-coder/decoder learns to generate representationsfor variables such that they can be used at the righttime to construct the correct program.

While the key embeddings are differentiable,the values referenced by the variables (lists ofentities), stored in the “computer”, are symbolicand non-differentiable. This distinguishes the key-variable memory from other memory-augmentedneural networks that use continuous differentiableembeddings as the values of memory entries (We-ston et al., 2014; Graves et al., 2016a).

2.3 Training NSM with Weak Supervision

NSM executes non-differentiable operationsagainst a KB, and thus end-to-end backpropa-gation is not possible. Therefore, we base ourtraining procedure on REINFORCE (Williams,1992; Norouzi et al., 2016). When the rewardsignal is sparse and the search space is large,it is common to utilize some full supervisionto pre-train REINFORCE (Silver et al., 2016).

To train from weak supervision, we suggest aniterative ML procedure for finding pseudo-goldprograms that will bootstrap REINFORCE.

REINFORCE We can formulate training as areinforcement learning problem: given a questionx, the state, action and reward at each time step t ∈{0, 1, ..., T} are (st, at, rt). Since the environmentis deterministic, the state is defined by the questionx and the action sequence: st = (x, a0:t−1), wherea0:t−1 = (a0, ..., at−1) is the history of actions attime t. A valid action at time t is at ∈ A(st),where A(st) is the set of valid tokens given by the“computer”. Since each action corresponds to atoken, the full history a0:T corresponds to a pro-gram. The reward rt = I[t = T ] · F1(x, a0:T )is non-zero only at the last step of decoding, andis the F1 score computed comparing the gold an-swer and the answer generated by executing theprogram a0:T . Thus, the cumulative reward of aprogram a0:T is

R(x, a0:T ) =∑t

rt = F1(x, a0:T ).

The agent’s decision making procedure at eachtime is defined by a policy, πθ(s, a) = Pθ(at =a|x, a0:t−1), where θ are the model parameters.Since the environment is deterministic, the prob-ability of generating a program a0:T is

Pθ(a0:T |x) =∏t

Pθ(at | x, a0:t−1).

We can define our objective to be the expectedcumulative reward and use policy gradient meth-

ods such as REINFORCE for training. The objec-tive and gradient are:

JRL(θ) =∑x

EPθ(a0:T |x)[R(x, a0:T )],

∇θJRL(θ) =∑x

∑a0:T

Pθ(a0:T | x) · [R(x, a0:T )−

B(x)] · ∇θ logPθ(a0:T | x),

where B(x) =∑

a0:TPθ(a0:T | x)R(x, a0:T ) is

a baseline that reduces the variance of the gradi-ent estimation without introducing bias. Having aseparate network to predict the baseline is an in-teresting future direction.

While REINFORCE assumes a stochastic pol-icy, we use beam search for gradient estimation.Thus, in contrast with common practice of ap-proximating the gradient by sampling from themodel, we use the top-k action sequences (pro-grams) in the beam with normalized probabilities.This allows training to focus on sequences withhigh probability, which are on the decision bound-aries, and reduces the variance of the gradient.

Empirically (and in line with prior work), RE-INFORCE converged slowly and often got stuckin local optima (see Section 3). The difficulty oftraining resulted from the sparse reward signal inthe large search space, which caused model prob-abilities for programs with non-zero reward to bevery small at the beginning. If the beam size k issmall, good programs fall off the beam, leading tozero gradients for all programs in the beam. If thebeam size k is large, training is very slow, and thenormalized probabilities of good programs whenthe model is untrained are still very small, leadingto (1) near zero baselines, thus near zero gradientson “bad” programs (2) near zero gradients on goodprograms due to the low probability Pθ(a0:T | x).To combat this, we present an alternative trainingstrategy based on maximum-likelihood.

Iterative ML If we had gold programs, wecould directly optimize their likelihood. Since wedo not have gold programs, we can perform aniterative procedure (similar to hard Expectation-Maximization (EM)), where we search for goodprograms given fixed parameters, and then opti-mize the probability of the best program found sofar. We do decoding on an example with a largebeam size and declare abest0:T (x) to be the pseudo-gold program, which achieved highest reward withshortest length among the programs decoded on x

in all previous iterations. Then, we can optimizethe ML objective:

JML(θ) =∑x

logPθ(abest0:T (x) | x) (1)

A question x is not included if we did not find anyprogram with positive reward.

Training with iterative ML is fast because thereis at most one program per example and the gra-dient is not weighted by model probability. whiledecoding with a large beam size is slow, we couldtrain for multiple epochs after each decoding. Thisiterative process has a bootstrapping effect that abetter model leads to a better program abest0:T (x)through decoding, and a better program abest0:T (x)leads to a better model through training.

Even with a large beam size, some programs arehard to find because of the large search space. Acommon solution to this problem is to use curricu-lum learning (Zaremba and Sutskever, 2015; Reedand de Freitas, 2016). The size of the search spaceis controlled by both the set of functions used inthe program and the program length. We applycurriculum learning by gradually increasing boththese quantities (see details in Section 3) whenperforming iterative ML.

Nevertheless, iterative ML uses only pseudo-gold programs and does not directly optimizethe objective we truly care about. This has twoadverse effects: (1) The best program abest0:T (x)could be a spurious program that accidentally pro-duces the correct answer (e.g., using the prop-erty PLACEOFBIRTH instead of PLACEOFDEATH

when the two places are the same), and thus doesnot generalize to other questions. (2) Becausetraining does not observe full negative programs,the model often fails to distinguish between to-kens that are related to one another. For exam-ple, differentiating PARENTSOF vs. SIBLINGSOF

vs. CHILDRENOF can be challenging. We nowpresent learning where we combine iterative MLwith REINFORCE.

Augmented REINFORCE To bootstrap REIN-FORCE, we can use iterative ML to find pseudo-gold programs, and then add these programs to thebeam with a reasonably large probability. This issimilar to methods from imitation learning (Rosset al., 2011; Jiang et al., 2012) that define aproposal distribution by linearly interpolating themodel distribution and an oracle.

Algorithm 1 IML-REINFORCEInput: question-answer pairs D = {(xi, yi)}, mix ratioα, reward function R(·), training iterations NML, NRL,and beam sizes BML, BRL.Procedure:Initialize C∗x = ∅ the best program so far for xInitialize model θ randomly . Iterative MLfor n = 1 to NML do

for (x, y) in D doC← Decode BML programs given xfor j in 1...|C| do

if Rx,y(Cj) > Rx,y(C∗x) then C∗x ← Cj

θ ←ML training with DML = {(x,C∗x)}Initialize model θ randomly . REINFORCEfor n = 1 to NRL do

DRL ← ∅ is the RL training setfor (x, y) in D do

C← Decode BRL programs from xfor j in 1...|C| do

if Rx,y(Cj) > Rx,y(C∗x) then C∗x ← Cj

C← C ∪ {C∗x}for j in 1...|C| do

pj ← (1−α)· pj∑j′ pj′

where pj = Pθ(Cj | x)if Cj = C∗x then pj ← pj + α

DRL ← DRL ∪ {(x,Cj , pj)}θ ← REINFORCE training with DRL

Algorithm 1 describes our overall training pro-cedure. We first run iterative ML for NML itera-tions and record the best program found for everyexample xi. Then, we run REINFORCE, wherewe normalize the probabilities of the programs inbeam to sum to (1−α) and add α to the probabilityof the best found program C∗(xi). Consequently,the model always puts a reasonable amount ofprobability on a program with high reward duringtraining. Note that we randomly initialized the pa-rameters for REINFORCE, since initializing fromthe final ML parameters seems to get stuck in alocal optimum and produced worse results.

On top of imitation learning, our approach isrelated to the common practice in reinforcementlearning (Schaul et al., 2016) to replay rare suc-cessful experiences to reduce the training varianceand improve training efficiency. This is also simi-lar to recent developments (Wu et al., 2016) in ma-chine translation, where ML and RL objectives arelinearly combined, because anchoring the modelto some high-reward outputs stabilizes training.

3 Experiments and Analysis

We now empirically show that NSM can learna semantic parser from weak supervision over alarge KB. We evaluate on WEBQUESTIONSSP, achallenging semantic parsing dataset with strongbaselines. Experiments show that NSM achieves

new state-of-the-art performance on WEBQUES-TIONSSP with weak supervision, and significantlycloses the gap between weak and full supervisionsfor this task.

3.1 The WEBQUESTIONSSP dataset

The WEBQUESTIONSSP dataset (Yih et al., 2016)contains full semantic parses for a subset of thequestions from WEBQUESTIONS (Berant et al.,2013), because 18.5% of the original dataset werefound to be “not answerable”. It consists of 3,098question-answer pairs for training and 1,639 fortesting, which were collected using Google Sug-gest API, and the answers were originally obtainedusing Amazon Mechanical Turk workers. Theywere updated in (Yih et al., 2016) by annotatorswho were familiar with the design of Freebase andadded semantic parses. We further separated out620 questions from the training set as a validationset. For query pre-processing we used an in-housenamed entity linking system to find the entities in aquestion. The quality of the entity linker is similarto that of (Yih et al., 2015) at 94% of the gold rootentities being included. Similar to Dong and Lap-ata (2016), we replaced named entity tokens witha special token “ENT”. For example, the question“who plays meg in family guy” is changed to “whoplays ENT in ENT ENT”. This helps reduce over-fitting, because instead of memorizing the correctprogram for a specific entity, the model has to fo-cus on other context words in the sentence, whichimproves generalization.

Following (Yih et al., 2015) we used the lastpublicly available snapshot of Freebase (Bollackeret al., 2008). Since NSM training requires ran-dom access to Freebase during decoding, we pre-processed Freebase by removing predicates thatare not related to world knowledge (starting with“/common/”, “/type/”, “/freebase/”),2 and remov-ing all text valued predicates, which are rarely theanswer. Out of all 27K relations, 434 relations areremoved during preprocessing. This results in agraph that fits in memory with 23K relations, 82Mnodes, and 417M edges.

3.2 Model Details

For pre-trained word embeddings, we used the300 dimension GloVe word embeddings trainedon 840B tokens (Pennington et al., 2014). Onthe encoder side, we added a projection matrix to

2We kept “/common/topic/notable types”.

transform the embeddings into 50 dimensions. Onthe decoder side, we used the same GloVe embed-dings to construct an embedding for each propertyusing its Freebase id, and also added a projectionmatrix to transform this embedding to 50 dimen-sions. A Freebase id contains three parts: domain,type, and property. For example, the Freebaseid for PARENTSOF is “/people/person/parents”.“people” is the domain, “person” is the typeand “parents” is the property. The embeddingis constructed by concatenating the average ofword embeddings in the domain and type nameto the average of word embeddings in the prop-erty name. For example, if the embedding dimen-sion is 300, the embedding dimension for “/peo-ple/person/parents” will be 600. The first 300 di-mensions will be the average of the embeddingsfor “people” and “person”, and the second 300dimensions will be the embedding for “parents”.

The dimension of encoder hidden state, decoderhidden state and key embeddings are all 50. Theembeddings for the functions and special tokens(e.g., “UNK”, “GO”) are randomly initialized by atruncated normal distribution with mean=0.0 andstddev=0.1. All the weight matrices are initializedwith a uniform distribution in [−

√3d ,√3d ] where d

is the input dimension. Dropout rate is set to 0.5,and we see a clear tendency for larger dropout rateto produce better performance, indicating overfit-ting is a major problem for learning.

3.3 Training Details

In iterative ML training, the decoder uses a beamof size k = 100 to update the pseudo-gold pro-grams and the model is trained for 20 epochs aftereach decoding step. We use the Adam optimizer(Kingma and Ba, 2014) with initial learning rate0.001. In our experiment, this process usually con-verges after a few (5-8) iterations.

For REINFORCE training, the best hyperpa-rameters are chosen using the validation set. Weuse a beam of size k = 5 for decoding, and α isset to 0.1. Because the dataset is small and somerelations are only used once in the whole trainingset, we train the model on the entire training setfor 200 iterations with the best hyperparameters.Then we train the model with learning rate de-cay until convergence. Learning rate is decayed asgt = g0×β

max(0,t−ts)m , where g0 = 0.001, β = 0.5

m = 1000, and ts is the number of training stepsat the end of iteration 200.

Since decoding needs to query the knowledgebase (KB) constantly, the speed bottleneck fortraining is decoding. We address this problemin our implementation by partitioning the dataset,and using multiple decoders in parallel to han-dle each partition. We use 100 decoders, whichqueries 50 KG servers, and one trainer. The neu-ral network model is implemented in TensorFlow.Since the model is small, we didn’t see a signif-icant speedup by using GPU, so all the decodersand the trainer are using CPU only.

Inspired by the staged generation process in Yihet al. (2015), curriculum learning includes twosteps. We first run iterative ML for 10 iterationswith programs constrained to only use the “Hop”function and the maximum number of expressionsis 2. Then, we run iterative ML again, but use both“Hop” and “Filter”. The maximum number of ex-pressions is 3, and the relations used by “Hop” arerestricted to those that appeared in abest0:T (q) in thefirst step.

3.4 Results and discussionWe evaluate performance using the offical evalu-ation script for WEBQUESTIONSSP. Because theanswer to a question may contain multiple enti-ties or values, precision, recall and F1 are com-puted based on the output of each individual ques-tion, and average F1 is reported as the main eval-uation metric. Accuracy measures the proportionof questions that are answered exactly.

A comparison to STAGG, the previous state-of-the-art model (Yih et al., 2016, 2015), is shownin Table 2. Our model beats STAGG with weaksupervision by a significant margin on all metrics,while relying on no feature engineering or hand-crafted rules. When STAGG is trained with strongsupervision it obtains an F1 of 71.7, and thus NSMcloses half the gap between training with weak andfull supervision.

Model Prec. Rec. F1 Acc.STAGG 67.3 73.1 66.8 58.8NSM 70.8 76.0 69.0 59.5

Table 2: Results on the test set. Average F1 is the main evalu-ation metric and NSM outperforms STAGG with no domain-specific knowledge or feature engineering.

Four key ingredients lead to the final perfor-mance of NSM. The first one is the neural com-puter interface that provides code assistance bychecking for syntax and semantic errors. We find

that semantic checks are very effective for open-domain KBs with a large number of properties.For our task, the average number of choices is re-duced from 23K per step (all properties) to lessthan 100 (the average number of properties con-nected to an entity).

The second ingredient is augmented REIN-FORCE training. Table 3 compares augmentedREINFORCE, REINFORCE, and iterative ML onthe validation set. REINFORCE gets stuck in lo-cal optimum and performs poorly. Iterative MLtraining is not directly optimizing the F1 measure,and achieves sub-optimal results. In contrast, aug-mented REINFORCE is able to bootstrap usingpseudo-gold programs found by iterative ML andachieves the best performance on both the trainingand validation set.

Settings Train F1 Valid F1Iterative ML 68.6 60.1REINFORCE 55.1 47.8Augmented REINFORCE 83.0 67.2

Table 3: Average F1 on the validation set for augmented RE-INFORCE, REINFORCE, and iterative ML.

The third ingredient is curriculum learning dur-ing iterative ML. We compare the performance ofthe best programs found with and without curricu-lum learning in Table 4. We find that the best pro-grams found with curriculum learning are substan-tially better than those found without curriculumlearning by a large margin on every metric.

Settings Prec. Rec. F1 Acc.No curriculum 79.1 91.1 78.5 67.2Curriculum 88.6 96.1 89.5 79.8

Table 4: Evaluation of the programs with the highest F1 scorein the beam (abest0:t ) with and without curriculum learning.

The last important ingredient is reducing over-fitting. Given the small size of the dataset, over-fitting is a major problem for training neural net-work models. We show the contributions of dif-ferent techniques for controlling overfitting in Ta-ble 5. Note that after all the techniques have beenapplied, the model is still overfitting with trainingF1@1=83.0% and validation F1@1=67.2%.

Among the programs generated by the model,a significant portion (36.7%) uses more than oneexpression. From Table 6, we can see that the per-formance doesn’t decrease much as the composi-

Settings ∆ F1@1−Pretrained word embeddings −5.5−Pretrained property embeddings −2.7−Dropout on GRU input and output −2.4−Dropout on softmax −1.1−Anonymize entity tokens −2.0

Table 5: Contributions of different overfitting techniques onthe validation set.

#Expressions 0 1 2 3Percentage 0.4% 62.9% 29.8% 6.9%F1 0.0 73.5 59.9 70.3

Table 6: Percentage and performance of model generatedprograms with different complexity (number of expressions).

tional depth increases, indicating that the modelis effective at capturing compositionality. We ob-serve that programs with three expressions use amore limited set of properties, mainly focusing onanswering a few types of questions such as “whoplays meg in family guy”, “what college did jeffcorwin go to” and “which countries does russiaborder”. In contrast, programs with two expres-sions use a more diverse set of properties, whichcould explain the lower performance compared toprograms with three expressions.

Error analysis Error analysis on the validationset shows two main sources of errors:

1. Search failure: Programs with high rewardare not found during search for pseudo-goldprograms, either because the beam size is notlarge enough, or because the set of functionsimplemented by the interpreter is insufficient.The 89.5% F1 score in Table 4 indicates thatat least 10% of the questions are of this kind.

2. Ranking failure: Programs with high rewardexist in the beam, but are not ranked at thetop during decoding. Because the training er-ror is low, this is largely due to overfitting orspurious programs. The 67.2% F1 score inTable 3 indicates that about 20% of the ques-tions are of this kind.

4 Related work

Among deep learning models for program in-duction, Reinforcement Learning Neural TuringMachines (RL-NTMs) (Zaremba and Sutskever,2015) are the most similar to NSM, as a non-differentiable machine is controlled by a sequence

model. Therefore, both models rely on REIN-FORCE for training. The main difference betweenthe two is the abstraction level of the programminglanguage. RL-NTM uses lower level operationssuch as memory address manipulation and bytereading/writing, while NSM uses a high level pro-gramming language over a large knowledge basethat includes operations such as following proper-ties from entities, or sorting based on a property,which is more suitable for representing semantics.Earlier works such as OOPS (Schmidhuber, 2004)has desirable characteristics, for example, the abil-ity to define new functions. These remain to befuture improvements for NSM.

We formulate NSM training as an instance ofreinforcement learning (Sutton and Barto, 1998)in order to directly optimize the task reward ofthe structured prediction problem (Norouzi et al.,2016; Li et al., 2016; Yu et al., 2017). Comparedto imitation learning methods (Daume et al., 2009;Ross et al., 2011) that interpolate a model dis-tribution with an oracle, NSM needs to solve achallenging search problem of training from weaksupervisions in a large search space. Our solu-tion employs two techniques (a) a symbolic “com-puter” helps find good programs by pruning thesearch space (b) an iterative ML training pro-cess, where beam search is used to find pseudo-gold programs. Wiseman and Rush (Wisemanand Rush, 2016) proposed a max-margin approachto train a sequence-to-sequence scorer. However,their training procedure is more involved, and wedid not implement it in this work. MIXER (Ran-zato et al., 2015) also proposed to combine MLtraining and REINFORCE, but they only con-sidered tasks with full supervisions. Berant andLiang (Berant and Liang, 2015) applied imita-tion learning to semantic parsing, but still requireshand crafted grammars and features.

NSM is similar to Neural Programmer (Nee-lakantan et al., 2015) and Dynamic Neural Mod-ule Network (Andreas et al., 2016) in that theyall solve the problem of semantic parsing fromstructured data, and generate programs using sim-ilar semantics. The main difference between theseapproaches is how an intermediate result (thememory) is represented. Neural Programmer andDynamic-NMN chose to represent results as vec-tors of weights (row selectors and attention vec-tors), which enables backpropagation and searchthrough all possible programs in parallel. How-

ever, their strategy is not applicable to a largeKB such as Freebase, which contains about 100Mentities, and more than 20k properties. Instead,NSM chooses a more scalable approach, wherethe “computer” saves intermediate results, and theneural network only refers to them with variablenames (e.g., “R1” for all cities in the US).

NSM is similar to the Path Ranking Algorithm(PRA) (Lao et al., 2011) in that semantics is en-coded as a sequence of actions, and denotationsare used to prune the search space during learning.NSM is more powerful than PRA by 1) allowingmore complex semantics to be composed throughthe use of a key-variable memory; 2) controllingthe search procedure with a trained neural net-work, while PRA only samples actions uniformly;3) allowing input questions to express complex re-lations, and then dynamically generating actionsequences. PRA can combine multiple seman-tic representations to produce the final prediction,which remains to be future work for NSM.

5 Conclusion

We propose the Manager-Programmer-Computerframework for neural program induction. It in-tegrates neural networks with a symbolic non-differentiable computer to support abstract, scal-able and precise operations through a friendlyneural computer interface. Within this frame-work, we introduce the Neural Symbolic Machine,which integrates a neural sequence-to-sequence“programmer” with key-variable memory, and asymbolic Lisp interpreter with code assistance.Because the interpreter is non-differentiable and todirectly optimize the task reward, we apply REIN-FORCE and use pseudo-gold programs found byan iterative ML training process to bootstrap train-ing. NSM achieves new state-of-the-art results ona challenging semantic parsing dataset with weaksupervision, and significantly closes the gap be-tween weak and full supervision. It is trained end-to-end, and does not require any feature engineer-ing or domain-specific knowledge.

AcknowledgementsWe thank for discussions and help fromArvind Neelakantan, Mohammad Norouzi, TomKwiatkowski, Eugene Brevdo, Lukasz Kaizer,Thomas Strohmann, Yonghui Wu, Zhifeng Chen,Alexandre Lacoste, and John Blitzer. The secondauthor is partially supported by the Israel ScienceFoundation, grant 942/16.

ReferencesJacob Andreas, Marcus Rohrbach, Trevor Darrell,

and Dan Klein. 2016. Learning to composeneural networks for question answering. CoRRabs/1601.01705.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRRabs/1409.0473. http://arxiv.org/abs/1409.0473.

Jonathan Berant, Andrew Chou, Roy Frostig, andPercy Liang. 2013. Semantic parsing on freebasefrom question-answer pairs. In EMNLP. volume 2,page 6.

Jonathan Berant and Percy Liang. 2015. Imitationlearning of agenda-based semantic parsers. TACL3:545–558.

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, andJ. Taylor. 2008. Freebase: a collaboratively createdgraph database for structuring human knowledge. InInternational Conference on Management of Data(SIGMOD). pages 1247–1250.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP). Associationfor Computational Linguistics, Doha, Qatar, pages1724–1734. http://www.aclweb.org/anthology/D14-1179.

H. Daume, J. Langford, and D. Marcu. 2009. Search-based structured prediction. Machine Learning75:297–325.

Li Dong and Mirella Lapata. 2016. Language to log-ical form with neural attention. In Association forComputational Linguistics (ACL).

Alex Graves, Greg Wayne, and Ivo Danihelka.2014. Neural turing machines. arXiv preprintarXiv:1410.5401 .

Alex Graves, Greg Wayne, Malcolm Reynolds,Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio G. Colmenarejo, Edward Grefen-stette, Tiago Ramalho, John Agapiou, AdriA P.Badia, Karl M. Hermann, Yori Zwols, GeorgOstrovski, Adam Cain, Helen King, ChristopherSummerfield, Phil Blunsom, Koray Kavukcuoglu,and Demis Hassabis. 2016a. Hybrid comput-ing using a neural network with dynamic exter-nal memory. Nature advance online publication.https://doi.org/10.1038/nature20101.

Alex Graves, Greg Wayne, Malcolm Reynolds,Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez Colmenarejo, EdwardGrefenstette, Tiago Ramalho, John Agapiou, et al.

2016b. Hybrid computing using a neural networkwith dynamic external memory. Nature .

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl,Abdel-rahman Mohamed, Navdeep Jaitly, AndrewSenior, Vincent Vanhoucke, Patrick Nguyen, Tara NSainath, et al. 2012. Deep neural networks foracoustic modeling in speech recognition: The sharedviews of four research groups. IEEE Signal Process-ing Magazine 29(6):82–97.

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In Association for Com-putational Linguistics (ACL).

J. Jiang, A. Teichert, J. Eisner, and H. Daume. 2012.Learned prioritization for trading off accuracy andspeed. In Advances in Neural Information Process-ing Systems (NIPS).

Łukasz Kaiser and Ilya Sutskever. 2015. Neural gpuslearn algorithms. arXiv preprint arXiv:1511.08228 .

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRRabs/1412.6980. http://arxiv.org/abs/1412.6980.

Ni Lao, Tom Mitchell, and William W Cohen. 2011.Random walk inference and learning in a large scaleknowledge base. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing. Association for Computational Linguistics,pages 529–539.

Jiwei Li, Will Monroe, Alan Ritter, Michel Galley,Jianfeng Gao, and Dan Jurafsky. 2016. Deep rein-forcement learning for dialogue generation. arXivpreprint arXiv:1606.01541 .

P. Liang, M. I. Jordan, and D. Klein. 2011. Learn-ing dependency-based compositional semantics. InAssociation for Computational Linguistics (ACL).pages 590–599.

Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever.2015. Neural programmer: Inducing la-tent programs with gradient descent. CoRRabs/1511.04834.

Mohammad Norouzi, Samy Bengio, zhifeng Chen,Navdeep Jaitly, Mike Schuster, Yonghui Wu, andDale Schuurmans. 2016. Reward augmented max-imum likelihood for neural structured prediction. InAdvances in Neural Information Processing Systems(NIPS).

Panupong Pasupat and Percy Liang. 2015. Composi-tional semantic parsing on semi-structured tables. InACL.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP.

http://arxiv.org/abs/1409.0473



http://www.aclweb.org/anthology/D14-1179





https://doi.org/10.1038/nature20101







Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2015. Sequence level train-ing with recurrent neural networks. arXiv preprintarXiv:1511.06732 .

Scott Reed and Nando de Freitas. 2016. Neuralprogrammer-interpreters. In ICLR.

S. Ross, G. Gordon, and A. Bagnell. 2011. A reductionof imitation learning and structured prediction to no-regret online learning. In Artificial Intelligence andStatistics (AISTATS).

Tom Schaul, John Quan, Ioannis Antonoglou, andDavid Silver. 2016. Prioritized experience replay.In International Conference on Learning Represen-tations. Puerto Rico.

Jurgen Schmidhuber. 2004. Optimal ordered problemsolver. Machine Learning 54(3):211–254.

David Silver, Aja Huang, Chris J Maddison, ArthurGuez, Laurent Sifre, George Van Den Driessche, Ju-lian Schrittwieser, Ioannis Antonoglou, Veda Pan-neershelvam, Marc Lanctot, et al. 2016. Masteringthe game of go with deep neural networks and treesearch. Nature 529(7587):484–489.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems. pages 3104–3112.

Richard S. Sutton and Andrew G. Barto. 1998. Rein-forcement Learning: An Introduction. MIT Press,Cambridge, MA.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In Advances in Neural In-formation Processing Systems (NIPS).

Yushi Wang, Jonathan Berant, and Percy Liang. 2015.Building a semantic parser overnight. In Associa-tion for Computational Linguistics (ACL).

Jason Weston, Sumit Chopra, and Antoine Bor-des. 2014. Memory networks. arXiv preprintarXiv:1410.3916 .

Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. In Machine Learning. pages 229–256.

Sam Wiseman and Alexander M. Rush. 2016.Sequence-to-sequence learning as beam-search optimization. CoRR abs/1606.02960.http://arxiv.org/abs/1606.02960.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, Nishant

Patil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging the gapbetween human and machine translation. CoRRabs/1609.08144. http://arxiv.org/abs/1609.08144.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, andJianfeng Gao. 2015. Semantic parsing via stagedquery graph generation: Question answering withknowledge base. In Association for ComputationalLinguistics (ACL).

Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of se-mantic parse labeling for knowledge base questionanswering. In Association for Computational Lin-guistics (ACL).

Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao.2015. Neural enquirer: Learning to query tables.arXiv preprint arXiv:1512.00965 .

Adam Yu, Hongrae Lee, and Quoc Le. 2017. Learningto skim text. In ACL.

Wojciech Zaremba and Ilya Sutskever. 2015. Rein-forcement learning neural turing machines. arXivpreprint arXiv:1505.00521 .

M. Zelle and R. J. Mooney. 1996. Learning to parsedatabase queries using inductive logic program-ming. In Association for the Advancement of Arti-ficial Intelligence (AAAI). pages 1050–1055.

L. S. Zettlemoyer and M. Collins. 2005. Learning tomap sentences to logical form: Structured classifica-tion with probabilistic categorial grammars. In Un-certainty in Artificial Intelligence (UAI). pages 658–666.








A Supplementary Material

A.1 Extra Figures

Programmer ComputerManager

program

code assist

question & reward

answer built-in functions

Weak supervision

Neural knowledge base(built-in data)

non-differentiable

Symbolic

Figure 3: Manager-Programmer-Computer framework

Encodera0 a1 an

SoftMax

q0 q1 qm

dot product

Linear Projection 300*50=15k


GRU 2*50*3*50=15k

Attention

Decoder

GRU 2*50*3*50=15k


Figure 4: Seq2seq model architecture with dot-product attention and dropout at GRU input, output, and softmax layers.

KG server 1

KG server m

…...

Decoder 1

Decoder 2

Decoder n

…...

Trainer

QA pairs 1

QA pairs 2

QA pairs n

Solutions 1

Solutions 2

Solutions n

…...…...

Model checkpoint

Figure 5: System Architecture. 100 decoders, 50 KB servers and 1 trainer.

abstract - arxiv2016), nsm achieves new state-of-the-art results with weak supervision,...

Documents