value-based search in execution space for mapping

Proceedings of NAACL-HLT 2019, pages 1942–1954Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

1942

Value-based Search in Execution Space for Mapping Instructions toPrograms

Dor Muhlgay1 Jonathan Herzig1 Jonathan Berant1,2

1School of Computer Science, Tel-Aviv University2Allen Institute for Artificial Intelligence

{dormuhlg@mail,jonathan.herzig@cs,joberant@cs}.tau.ac.il

Abstract

Training models to map natural language in-structions to programs, given target world su-pervision only, requires searching for goodprograms at training time. Search is com-monly done using beam search in the spaceof partial programs or program trees, but asthe length of the instructions grows finding agood program becomes difficult. In this work,we propose a search algorithm that uses thetarget world state, known at training time, totrain a critic network that predicts the expectedreward of every search state. We then scoresearch states on the beam by interpolating theirexpected reward with the likelihood of pro-grams represented by the search state. More-over, we search not in the space of programsbut in a more compressed state of program ex-ecutions, augmented with recent entities andactions. On the SCONE dataset, we show thatour algorithm dramatically improves perfor-mance on all three domains compared to stan-dard beam search and other baselines.

1 Introduction

Training models that can understand natural lan-guage instructions and execute them in a real-world environment is of paramount importance forcommunicating with virtual assistants and robots,and therefore has attracted considerable atten-tion (Branavan et al., 2009; Vogel and Jurafsky,2010; Chen and Mooney, 2011). A prominent ap-proach is to cast the problem as semantic parsing,where instructions are mapped to a high-level pro-gramming language (Artzi and Zettlemoyer, 2013;Long et al., 2016; Guu et al., 2017). Because anno-tating programs at scale is impractical, it is desir-able to train a model from instructions, an initialworld state, and a target world state only, lettingthe program itself be a latent variable.

Learning from such weak supervision results ina difficult search problem at training time. The

model must search for a program that when ex-ecuted leads to the correct target state. Earlywork employed lexicons and grammars to con-strain the search space (Clarke et al., 2010; Lianget al., 2011; Krishnamurthy and Mitchell, 2012;Berant et al., 2013; Artzi and Zettlemoyer, 2013),but recent success of sequence-to-sequence mod-els (Sutskever et al., 2014) shifted most of the bur-den to learning. Search is often performed sim-ply using beam search, where program tokens areemitted from left-to-right, or program trees aregenerated top-down (Krishnamurthy et al., 2017;Yin and Neubig, 2017; Cheng et al., 2017; Ra-binovich et al., 2017) or bottom-up (Liang et al.,2017; Guu et al., 2017; Goldman et al., 2018).Nevertheless, when instructions are long and com-plex and reward is sparse, the model may neverfind enough correct programs, and training willfail.

In this paper, we propose a novel search algo-rithm for mapping a sequence of natural languageinstructions to a program, which extends the stan-dard beam-search in two ways. First, we capitalizeon the target world state being available at trainingtime and train a critic network that given the lan-guage instructions, current world state, and targetworld state estimates the expected future rewardfor each search state. In contrast to traditionalbeam search where states representing partial pro-grams are scored based on their likelihood only,we also consider expected future reward, leadingto a more targeted search at training time. Sec-ond, rather than search in the space of programs,we search in a more compressed execution space,where each state is defined by a partial program’sexecution result.

We evaluated our method on the SCONEdataset, which includes three different domainswhere long sequences of 5 instructions are mappedto programs. We show that while standard beam

1943

search gets stuck in local optima and is unable todiscover good programs for many examples, ourmodel is able to bootstrap, improving final perfor-mance by 20 points on average. We also performextensive analysis and show that both value-basedsearch as well as searching in execution space con-tribute to the final performance. Our code anddata are available at http://gitlab.com/tau-nlp/vbsix-lang2program.

2 Background

Mapping instructions to programs invariably in-volves a context, such as a database or a robotic en-vironment, in which the program (or logical form)is executed. The goal is to train a model given atraining set {(x(j) = (c(j),u(j)), y(j))}Nj=1, wherec is the context, u is a sequence of natural lan-guage instructions, and y is the target state ofthe environment after following the instructions,which we refer to as denotation. The model istrained to map the instructions u to a program zsuch that executing z in the context c results inthe denotation y, which we denote by JzKc = y.Thus, the program z is a latent variable we mustsearch for at both training and test time. When thesequence of instructions is long, search becomeshard, particularly in the early stages of training.

Recent work tackled the training problem us-ing variants of reinforcement learning (RL) (Suhrand Artzi, 2018; Liang et al., 2018) or maximummarginal likelihood (MML) (Guu et al., 2017;Goldman et al., 2018). We now briefly describeMML training, which we base our training proce-dure on, and outperformed RL in past work undercomparable conditions (Guu et al., 2017).

We denote by πθ(·) a model, parameterized byθ, that generates the program z token by tokenfrom left to right. The model πθ receives the con-text c, instructions u and previously predicted to-kens z1...t−1, and returns a distribution over thenext program token zt. The probability of a pro-gram prefix is defined to be: pθ(z1...t | u, c) =∏ti=1 πθ(zi | u, c, z1...i−1). The model is trained

to maximize the MML objective:

JMML = log∏

(c,u,y)

pθ(y | c,u) =∑(c,u,y)

log(∑z

pθ(z | u, c) ·R(z)),

where R(z) = 1 if JzKc = y, and 0 otherwise (Forbrevity, we omit c and y from R(·)). Typically,

1. The person in an orange hat moves to the left of the person in a red hat2. He then disappears3. Then a person in orange appears on the far right

X:

y:

Figure 1: Example from the SCENE domain in SCONE(3 utterances), where people with different hats andshirts are located in different positions. The examplecontains (from top to bottom): an initial world (the peo-ple’s properties, from left to right: a blue shirt with anorange hat, a blue shirt with a red hat, and a green shirtwith a yellow hat), a sequence of natural language in-structions, and a target world (the people’s properties,from left to right: blue shirt with a red hat, green shirtwith a yellow hat, and an orange shirt).

the MML objective is optimized with stochasticgradient ascent, where the gradient for an example(c,u, y) is:

∇θJMML =∑z

q(z) ·R(z)∇θ log pθ(z|c,u)

q(z) :=R(z) · pθ(z|c,u)∑z R(z) · pθ(z|c,u)

.

The search problem arises because it is impos-sible to enumerate the set of all programs, andthus the sum over programs is approximated bya small set of high probability programs, whichhave high weights q(·) that dominate the gradient.Search is commonly done with beam-search, aniterative algorithm that builds an approximation ofthe highest probability programs according to themodel. At each time step t, the algorithm con-structs a beam Bt of at most K program prefixesof length t. Given a beam Bt−1, Bt is constructedby selecting the K most likely continuations ofprefixes in Bt−1 , according to pθ(z1..t|·). The al-gorithm runs L iterations, and returns all completeprograms discovered.

In this paper, our focus is the search problemthat arises at training time when training from de-notations, i.e., finding programs that execute to theright denotation. Thus, we would like to focus onscenarios where programs are long, and standardbeam search fails. We next describe the SCONEdataset, which provides such an opportunity.

1944

1 2 3

Program stack (𝜓):

Command history (h):

World state (w0):

Current utterance (u1):The person in the yellow hat moves to the left of the person in blue

1 2 3

Program stack (𝜓):

Command history (h):

World state (w1):

Current utterance (u2):Then he disappears

Figure 2: An illustration of a state transition in SCONE. πθ predicted the action token MOVE. Before the transition(left), the command history is empty and the program stack contains the arguments 3 and 1, which were computedin previous steps. Note that the state does not include the partial program that computed those arguments. Aftertransition (right), the executor popped the arguments from the program stack and applied the action MOVE 3 1: theman in position 3 (a person with a yellow hat and a red shirt) moved to position 1 (to the left of the person withthe blue shirt), and the action is added to the execution history with its arguments. Since this action terminated thecommand, πθ advanced to the next utterance.

The SCONE dataset Long et al. (2016) pre-sented the SCONE dataset, where a sequence ofinstructions needs to be mapped to a program con-sisting of a sequence of executable commands.The dataset has three domains, where each do-main includes several objects (such as people orbeakers), each with different properties (such asshirt color or chemical color). SCONE providesa good environment for stress-testing search al-gorithms because a long sequence of instructionsneeds to be mapped to a program. Figure 1 showsan example from the SCENE domain.

Formally, the context in SCONE is a worldspecified by a list of positions, where each po-sition may contain an object with certain prop-erties. A formal language is defined to inter-act with the world. The formal language con-tains constants (e.g., numbers and colors), func-tions that allow to query the world and re-fer to objects and intermediate computations,and actions, which are functions that modifythe world state. Each command is composedof a single action and several arguments con-structed recursively from constants and functions.E.g., the command MOVE(HASHAT(YELLOW),LEFTOF(HASSHIRT(BLUE))), contains the ac-tion MOVE, which moves a person to a spec-ified position. The person is computed byHASHAT(YELLOW), which queries the worldfor the position of the person with a yellowhat, and the target position is computed byLEFTOF(HASSHIRT(BLUE)). We refer to Guuet al. (2017) for a full description of the language.

Our goal is to train a model that given an ini-tial world w0 and a sequence of natural language

utterances u = (u1, . . . , uM ), will map each ut-terance ui to a command zi such that applying theprogram z = (z1, . . . , zM ) onw0 will result in thetarget world, i.e., JzKw0 = wM = y.

3 Markov Decision Process Formulation

To present our algorithm, we first formulate theproblem as a Markov Decision Process (MDP),which is a tuple (S,A, R, δ). To define the stateset S, we assume all program prefixes are exe-cutable, which can be easily done as we show be-low. The execution result of a prefix z in the con-text c, denoted by JzKexc , contains its denotationand additional information stored in the executor.Let Zpre be the set of all valid programs prefixes.The set of states is defined to be S = {(x, JzKexc ) |z ∈ Zpre}, i.e., the input paired with all possibleexecution results.

The action set A includes all possible programtokens 1, and the transition function δ : S×A → Sis computed by the executor. Last, the rewardR(s, a) = 1 iff the action a ends the programand leads to a state δ(s, a) where the denotation isequal to the target y. The model πθ(·) is a parame-terized policy that provides a distribution over theprogram vocabulary at each time step.

SCONE as an MDP We define the partial ex-ecution result JzKexc in SCONE, as described byGuu et al. (2017). We assume that SCONE’sformal language is written in postfix notations,e.g., the instruction MOVE(HASHAT(YELLOW),LEFTOF(HASSHIRT(BLUE))) is written as YEL-LOW HASHAT BLUE HASSHIRT LEFTOF MOVE.

1Decoding is constrained to valid program tokens only.

1945

With this notation, a partial program can be exe-cuted left-to-right by maintaining a program stack,ψ. The executor pushes constants (YELLOW) to ψ,and applies functions (HASHAT) by popping theirarguments from ψ and pushing back the computedresult. Actions (MOVE) are applied by popping ar-guments from ψ and performing the action in thecurrent world.

To handle references to previously executedcommands, SCONE’s formal language includesfunctions that provide access to actions and argu-ments in previous commands. To this end, theexecutor maintains an execution history, hi =(e1, . . . , ei), a list of executed actions and their ar-guments. Thus, the execution result of a programprefix is JzKexw0 = (wi−1, ψ,hi−1).

We adopt the model from Guu et al. (2017) (ar-chitecture details in appendix A): The policy πθobserves ψ and ui, the current utterance beingparsed, and predicts a token. When the modelpredicts an action token that terminates a com-mand, the model moves to the next utterance (un-til all utterances have been processed). The modeluses functions to query the worldwi−1 and historyhi−1. Thus, each MDP state in SCONE is a pairs = (ui, JzKexw0). Figure 2 illustrates a state transi-tion in the SCENE domain. Importantly, the statedoes not store the full program’s prefix, and manydifferent prefixes may lead to the same state. Next,we describe a search algorithm for this MDP.

4 Searching in Execution Space

Model improvement relies on generating correctprograms given a possibly weak model. Stan-dard beam-search explores the space of all pro-gram token sequences up to some fixed length. Wepropose two technical contributions to improvesearch: (a) We simplify the search problem bysearching for correct executions rather than correctprograms; (b) We use the target denotation at train-ing time to better estimate partial program scoresin the search space. We describe those next.

4.1 Reducing program search to executionsearch

Program space can be formalized as a directed treeT = (VT , ET ), where vertices VT are programprefixes, and labeled edges ET represent prefixes’continuations: an edge e = (z, z′) labeled withthe token a, represents a continuation of the prefixz with the token a, which yields the prefix z′. The

red yellow

hasHat

hasShirtblue

hasShirt

leftOf

move

blue

hasShirt

leftOf

move

1

move

Program search space Execution search space

red

hasHathasShirt

1

blue

hasShirt

leftOf

move

yellow

1

move

Figure 3: A set of commands represented in programspace (left) and execution space (right). The commandsrelate to the first world in Figure 2. Since multipleprefixes have the same execution result (e.g., YELLOWHASHAT and RED HASSHIRT), the execution space issmaller.

root of the graph represents the empty sequence.Similarly, Execution space is a directed graphG =(VG, EG) induced from the MDP described in Sec-tion 3. Vertices VG represent MDP states, whichexpress execution results, and labeled edges EGrepresent transitions. An edge (s1, s2) labeled bytoken a means that δ(s1, a) = s2. Since multi-ple programs have the same execution result, ex-ecution space is a compressed representation ofprogram space. Figure 3 shows a few commandsin both program and execution space. Executionspace is smaller, and so searching in it is easier.

Each path in execution space represents a differ-ent program prefix, and the path’s final state rep-resents its execution result. Program search cantherefore be reduced to execution search: given anexample (c,u, y) and a model πθ, we can use πθto explore in execution space, discover correct ter-minal states, i.e. states corresponding to correctfull programs, and extract paths leading to thosestates. As the number of paths may be exponentialin the size of the graph, we can use beam-searchto extract the most probable correct programs (ac-cording to the model) in the discovered graph.

Our approach is similar to the DPD algo-rithm (Pasupat and Liang, 2016), where CKY-style search is performed in denotation space, fol-lowed by search in a pruned space of programs.However, DPD was used without learning, and thesearch was not guided by a trained model, whichis a major part of our algorithm.

4.2 Value-based Beam Search in ExecutionSpace

We propose Value-based Beam Search ineXecution space (VBSIX), a variant of beam

1946

Algorithm 1 Program Search with VBSIX1: function PROGRAMSEARCH(c,u, y, πθ, Vφ)2: G, T ← VBSIX(c,u, y, πθ, Vφ)3: Z ← Find paths inG that lead to states in T with beam search4: Return Z5: function VBSIX(c,u, y, πθ, Vφ)6: T ← ∅, P ← {} . init terminal states and DP chart7: s0 := The empty program parsing state8: B0 ← {s0}, G = ({s0}, ∅) . Init beam and graph9: P0[s0]← 1 . The probability of s0 is 1

10: for t ∈ [1 . . . L] do11: Bt ← ∅12: for s ∈ Bt−1, a ∈ A do13: s′ ← δ(s, a)14: Add edge (s, s′) toG labeled with a15: if s′ is correct terminal then16: T ← T ∪ {s′}17: else18: Pt[s

′]← Pt[s′] + Pt−1[s] · πθ(a|s)

19: Bt ← Bt ∪ {s′}20: SortBt by AC-SCORER(·)21: Bt ← Keep the top-K states inBt22: ReturnG, T23: function AC-SCORER(s, Pt, Vφ, y)24: Return Pt[s] + Vφ(s, y)

search modified for searching in execution space,that scores search states with a value-based net-work. VBSIX is formally defined in Algorithm 1,which we will refer to throughout this section.

Standard beam search is a breadth-first traver-sal of the program space tree, where a fixed num-ber of vertices are kept in the beam at every levelof the tree. The selection of vertices is done byscoring their corresponding prefixes according topθ(z1...t | u, c). VBSIX applies the same traver-sal in execution space (lines 10-21). However,since each vertex in the execution space representsan execution result and not a particular prefix, weneed to modify the scoring function.

Let s be a vertex discovered in iteration t of thesearch. We propose two scores for ranking s. Thefirst is the actor score, the probability of reachingvertex s after t iterations2 according to the modelπθ. The second and more novel score is the value-based critic score, an estimate of the state’s ex-pected reward. The AC-Score is the sum of thesetwo scores (lines 23-24).

The actor score, ptθ(s), is the sum of proba-bilities of all prefixes of length t that reach s(rather than the probability of one prefix as inbeam search). VBSIX approximates ptθ(s) byperforming the summation only over paths thatreach s via states in the beam Bt−1, which canbe done efficiently with a dynamic programming(DP) chart Pt[s] that keeps actor score approxima-tions in each iteration (line 18). This lower-bounds

2We score paths in different iterations independently toavoid bias for shorter paths. An MDP state that appears inmultiple iterations will get a different score in each iteration.

0.1 0.20.3

10-3

0.55

0.7 0.8

0.1

0.08 0.02

0.05

0.6 0.1

0.50.1

0.6

Figure 4: An illustration of scoring and pruning statesin step t of VBSIX. Discovered edges are in full edges,undiscovered edges are dashed, correct terminal statesare dashed and in yellow, and states that are kept inthe beam are in dark grey. The actor scores of the high-lighted vertex (0.5) and its parents (0.6, 0.1) are in blue,and its critic score (0.1) is in green.

the true ptθ(s) since some prefixes of length t thatreach s might have not been discovered.

Contrary to standard beam-search, we wantto score search states also with a critic scoreEpθ [R(s)], which is the sum of the suffix proba-bilities that lead from s to a correct terminal state:

Epθ [R(s)] =∑τ(s)

pθ(τ(s) | s) ·R(τ(s)),

where τ(s) are all possible trajectories startingfrom s and R(τ(s)) is the reward observed whentaking the trajectory τ(s) from s. Enumeratingall trajectories τ(s) is intractable and so we willapproximate Epθ [R(s)] with a trained value net-work Vφ(s, y), parameterized by φ. Importantly,because we are searching at training time, we cancondition Vφ on both the current state s and tar-get denotation y. At test time we will use πθ only,which does not need access to y.

Since the value function and DP chart are usedfor efficient ranking, the asymptotic run-time com-plexity of VBSIX is the same as standard beamsearch (O(K ·|A|·L)). The beam search in Line 3,where we extract programs from the constructedexecution space graph, can be done with a smallbeam size, since it operates over a small space ofcorrect programs. Thus, its contribution to the al-gorithm complexity is negligible.

Figure 4 visualizes scoring and pruning withthe actor-critic score in iteration t. Vertices in Btare discovered by expanding vertices in Bt−1, andeach vertex is ranked by the AC-scorer. The high-lighted vertex score is 0.6, a sum of the actor score(0.5) and the critic score (0.1). The actor score isthe sum of its prefixes ((0.05+0.55) · 0.7+0.01 ·

1947

0.08 = 0.5) and the critic score is a value net-work estimate for the sum of probabilities of out-going trajectories reaching correct terminal states(0.02 + 0.08 = 0.1). Only the top-K states arekept in the beam (K = 4 in the figure).

VBSIX leverages execution space in a num-ber of ways. First, since each vertex in execu-tion space compactly represents multiple prefixes,a beam in VBSIX effectively holds more prefixesthan in standard beam search. Second, runningbeam-search over a graph rather than a tree is lessgreedy, as the same vertex can surface back evenif it fell out of the beam.

The value-based approach has several advan-tages as well: First, evaluating the probability ofoutgoing trajectories provides look-ahead that ismissing from standard beam search. Second (andmost importantly), Vφ is conditioned on y, whichπθ doesn’t have access to, which allows findingcorrect programs with low model probability, thatπθ can learn from. We note that our two contribu-tions are orthogonal: the critic score can be usedin program space, and search in execution spacecan be done with actor-scores only.

5 Training

We train the model πθ and value network Vφjointly (Algorithm 2). πθ is trained using MMLover discovered correct programs (Line 4, Algo-rithm 1). The value network is trained as fol-lows: Given a training example (c,u, y), we gen-erate a set of correct programs Zpos with VB-SIX. The value network needs negative exam-ples, and so for every incorrect terminal state znegfound during search with VBSIX we create a sin-gle program leading to zneg. We then constructa set of training examples Dv, where each ex-ample ((s, y), l) labels states encountered whilegenerating programs z ∈ Z with the probabilitymass of correct programs suffixes that extend it,i.e., l =

∑ztpθ(zt...|z|), where zt ranges over all

z ∈ Z and t ∈ [1 . . . |z|] Finally, we train Vφ tominimize the log-loss objective:

∑((s,y),l)∈Dvl·

logVφ(s,y)+(1−l)(log(1−Vφ(s,y))) .Similar to actor score estimation, labeling ex-

amples for Vφ is affected by beam-search errors:the labels lower bound the true expected reward.However, since search is guided by the model,those programs are likely to have low probability.Moreover, estimates from Vφ are based on multi-ple examples, compared to probabilities in the DP

Algorithm 2 Actor-Critic Training1: procedure TRAIN()2: Initialize θ and φ randomly3: while πθ not converged do4: (x := (c,u), y)← select random example5: Zpos ← PROGRAMSEARCH(c,u, y, πθ, Vφ)

6: Zneg ← programs leading to incorrect terminal states7: Dv ← BUILDVALUEEXAMPLES((Zpos ∪ Zneg), c, y)

8: Update φ usingDv , update θ using (x,Zpos, y)

9: function BUILDVALUEEXAMPLES(Z, c,u, y)10: for z ∈ Z do11: for t ∈ [1 . . . |z|] do12: s← Jz1...tKexc13: L[s]← L[s] + pθ(zt...|z| | c,u) · R(z)

14: Dv ← {((s, y), L[s])}s∈L15: ReturnDv

chart, and are more robust to search errors.

Neural network architecture: We adapt themodel proposed by Guu et al. (2017) for SCONE.The model receives the current utterance ui andprogram stack ψ, and returns a distribution overthe next token. Our value network receives thesame input, but also the next utterance ui+1, theworld statewi and target world state y, and outputsa scalar. Appendix A provides a full description.

6 Experiments

6.1 Experimental setupWe evaluated our method on the three domains ofSCONE with the standard accuracy metric, i.e.,the proportion of test examples where the pre-dicted program has the correct denotation. Wetrained with VBSIX, and used standard beamsearch (K = 32) at test time for programs’ gen-eration. Each test example contains 5 utterances,and similar to prior work we reported the modelaccuracy on all 5 utterances as well as the first 3utterances. We ran each experiment 6 times withdifferent random seeds and reported the averageaccuracy and standard deviation.

In contrast to prior work on SCONE (Longet al., 2016; Guu et al., 2017; Suhr and Artzi,2018), where models were trained on all se-quences of 1 or 2 utterances, and thus were ex-posed during training to all gold intermediatestates, we trained from longer sequences keepingintermediate states latent. This leads to a hardersearch problem that was not addressed previously,but makes our results incomparable to previous re-sults 3. In SCENE and TANGRAM, we used the first4 and 5 utterances as examples. In ALCHEMY, weused the first utterance and 5 utterances.

3For completeness, we show the performance on thesedatasets from prior work in Appendix C.

1948

SCENE ALCHEMY TANGRAMBeam 3 utt 5 utt 3 utt 5 utt 3 utt 5 utt

MML 32 8.4±(2.0) 7.2±(1.3) 41.9±(22.8) 33.2±(20.1) 32.5±(20.7) 16.8±(14.1)64 15.4±(12.6) 12.3±(9.6) 44.6±(23.7) 36.3±(20.6) 45.6±(18.0) 25.8±(12.6)

EXPERT-MML 32 1.8±(1.5) 1.6±(1.2) 29.4±(22.7) 23.1±(18.8) 2.4±(0.8) 1.2±(0.5)VBSIX 32 34.2±(27.5) 28.2±(20.7) 74.5±(1.1) 64.8±(1.5) 65.0±(0.8) 43.0±(1.3)

Table 1: Test accuracy and standard deviation of VBSIX compared to MML baselines (top) and our trainingmethods (bottom). We evaluate the same model over the first 3 and 5 utterances in each domain.

SCENE ALCHEMY TANGRAMSearch space Value 3 utt 5 utt 3 utt 5 utt 3 utt 5 uttProgram No 5.5±(0.5) 3.8±(0.6) 36.4±(26.5) 25.4±(23.0) 34.4±(18.3) 15.6±(12.8)Execution No 7.4±(10.4) 4.0±(5.8) 41.3±(28.6) 28.2±(23.27) 33.5±(15.5) 12.7±(10.4)Program Yes 7.6±(8.3) 3.4±(2.9) 78.5±(1.0) 72.8±(1.3) 66.8±(1.5) 42.8±(1.9)Execution Yes 31.0±(24.7) 22.6±(19.6) 81.9±(1.3) 75.2±(2.9) 68.6±(2.0) 44.2±(2.1)

Table 2: Validation accuracy when ablating the different components of VBSIX. The first line presents MML, thelast line is VBSIX, and the intermediate lines examine execution space and value-based networks separately.

Training details To warm-start the value net-work, we trained it for a few thousand steps,and only then start re-ranking with its predictions.Moreover, we gain efficiency by first returningK0(=128) states with the actor score, and thenre-ranking with the actor-critic score, returningK(=32) states. Last, we use the value networkonly in the last two utterances of every examplesince we found it has less effect in earlier utter-ances where future uncertainty is large. We usedthe Adam optimizer (Kingma and Ba, 2014) andfixed GloVe embeddings (Pennington et al., 2014)for utterance words.

Baselines We evaluated the following trainingmethods (Hyper-parameters are in appendix B):1. MML: Our main baseline, where search isdone with beam search and training with MML.We used randomized beam-search, which adds ε-greedy exploration to beam search, which was pro-posed by Guu et al. (2017) and performed better 4.2. EXPERT-MML: An alternative way of usingthe target denotation y at training time, based onimitation learning (Daume et al., 2009; Ross et al.,2011; Berant and Liang, 2015), is to train an ex-pert policy πexpert

θ , which receives y as input inaddition to the parsing state, and trains with theMML objective. Then, our policy πθ is trained us-ing programs found by πexpert

θ . The intuition is thatthe expert can use y to find good programs that thepolicy πθ can train from.3. VBSIX: Our proposed training algorithm.

4We did not include meritocratic updates (Guu et al.,2017), since it performed worse in initial experiments.

We also evaluated REINFORCE, where Monte-Carlo sampling is used as search strategy(Williams, 1992; Sutton et al., 1999). We followedthe implementation of Guu et al. (2017), who per-formed variance reduction with a constant baselineand added ε-greedy exploration. We found thatREINFORCE fails to discover any correct pro-grams to bootstrap from.

6.2 Results

Table 1 reports test accuracy of VBSIX comparedto the baselines. First, VBSIX outperforms allbaselines in all cases. MML is the strongest base-line, but even with an increased beam (K = 64),VBSIX (K = 32) surpasses it by a large margin(more than 20 points on average). On top of theimprovement in accuracy, in ALCHEMY and TAN-GRAM the standard deviation of VBSIX is lowerthan the other baselines across the 6 random seeds,showing the robustness of our model.

EXPERT-MML performs worse than MML inall cases. We hypothesize that using the denota-tion y as input to the expert policy πexpert

θ results inmany spurious programs, i.e., they are unrelatedto the utterance meaning. This is since the expertcan learn to perform actions that take it to the tar-get world state while ignoring the utterances com-pletely. Such programs will lead to bad general-ization of πθ. Using a critic at training time elimi-nates this problem, since its scores depend on πθ.

Ablations We performed ablations to examinethe benefit of our two technical contributions(a) execution space (b) value-based search. Ta-ble 2 presents accuracy on the validation set when

1949

0 5000 10000 15000 20000Train step

0.0

0.1

0.2

0.3

0.4

0.5

0.6Tr

ain

hit a

ccur

acy

SceneExecution Space OnlyValue OnlyBeam-SearchVBSIX

0 5000 10000 15000 20000 25000 30000Train step

0.0

0.2

0.4

0.6

0.8

Trai

n hi

t acc

urac

y

Alchemy

Execution Space OnlyValue OnlyBeam-SearchVBSIX

0 5000 10000 15000 20000 25000 30000 35000 40000Train step

0.0

0.2

0.4

0.6

0.8

Trai

n hi

t acc

urac

y

Tangram

Execution Space OnlyValue OnlyBeam-SearchVBSIX

Figure 5: Training hit accuracy on examples with 5 utterances, comparing VBSIX to baselines with ablatedcomponents. The results are averaged over 6 runs with different random seeds (best viewed in color)

each component is used separately, when both ofthem are used (VBSIX), and when none are used(beam-search). We find that both contributionsare important for performance, as the full systemachieves the highest accuracy in all domains. InSCENE, each component has only a slight advan-tage over beam-search, and therefore both are re-quired to achieve significant improvement. How-ever, in ALCHEMY and TANGRAM most of thegain is due to the value network.

We also directly measured the hit accuracy attraining time, i.e., the proportion of training ex-amples where the beam produced by the search al-gorithm contains a program with the correct deno-tation, showing the effectiveness of search at train-ing time. In Figure 5, we report train hit accuracyin each training step, averaged across 6 randomseeds. The graphs illustrate the performance ofeach search algorithm in every domain throughouttraining. The validation accuracy results are corre-lated with the improvement in train hit-accuracy.

6.3 Analysis

Execution Space We empirically measured twoquantities that we expect should reflect the advan-tage of execution-space search. First, we mea-sured the number of programs stored in the execu-tion space graph compared to beam search, whichholds K programs. Second, we counted the av-erage number of states that are connected to cor-rect terminal states in the discovered graph, butfell out of the beam during the search. The prop-erty reflects the gain from running search over agraph structure, where the same vertex can resur-face. We preformed the analysis on VBSIX over5-utterance training examples in all 3 domains.The following table summarizes the results:

We found the measured properties and the con-tribution of execution space in each domain are

Property SCENE ALCHEMY TANGRAMPaths in beam 143903 5892 678Correct pruned 18.5 11.2 3.8

correlated, as seen in the ablations. Differencesbetween domains are due to the different com-plexities of their formal languages. As the for-mal language becomes more expressive, the exe-cution space is more compressed as each state canbe reached in more ways. In particular, the formallanguage in SCENE contains more functions com-pared to the other domains, and so it benefits themost from execution-space search.

Value Network We analyzed the accuracy of thevalue network at training time by measuring, foreach state, the difference between its expected re-ward (estimated from the discovered paths) andthe value network prediction. Figure 6 shows theaverage difference in each training step for all en-countered states (in blue), and for high rewardstates only (states with expected reward larger than0.7, in orange). Those metrics are averaged across6 runs with different random seeds.

The accuracy of the value network improvesduring training, except when the policy changessubstantially, in which case the value networkneeds to re-evaluate the policy. When the valuenetwork converges, the difference between its pre-dictions and the expected reward is 0.15 − 0.2on average. However, for high reward states thedifference is higher (∼ 0.3). This indicates thatthe value network has a bias toward lower values,which is expected as most states lead to low re-wards. Since VBSIX uses the value network asa beam-search ranker, the value network doesn’tneed to be exact as long as it assigns higher val-ues to states with higher expected rewards. Furtheranalysis is provided in appendix D.

1950

0 5000 10000 15000 20000Train step

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Di

stan

ceScene

All tatesHigh Reward States

0 5000 10000 15000 20000 25000 30000Train step

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Dist

ance

AlchemyAll StatesHigh Reward States

0 5000 10000 15000 20000 25000 30000 35000 40000Train step

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Dist

ance

TangramAll StatesHigh Reward States

Figure 6: The difference between the prediction of the value network and the expected reward (estimated fromthe discovered paths) during training. We report the average difference for all of the states (blue) and for the highreward states only (> 0.7, orange). The results are averaged over 6 runs with different random seeds (best viewedin color).

7 Related Work

Training from denotations has been extensively in-vestigated (Kwiatkowski et al., 2013; Pasupat andLiang, 2015; Bisk et al., 2016), with a recent em-phasis on neural models (Neelakantan et al., 2016;Krishnamurthy et al., 2017). Improving beamsearch has been investigated by proposing special-ized objectives (Wiseman and Rush, 2016), stop-ping criteria (Yang et al., 2018), and using contin-uous relaxations (Goyal et al., 2018).

Bahdanau et al. (2017) and Suhr and Artzi(2018) proposed ways to evaluate intermediatepredictions from a sparse reward signal. Bah-danau et al. (2017) used a critic network to es-timate expected BLEU in translation, while Suhrand Artzi (2018) used edit-distance between thecurrent world and the goal for SCONE. But, inthose works stronger supervision was assumed:Bahdanau et al. (2017) utilized the gold sequences,and Suhr and Artzi (2018) used intermediateworlds states. Moreover, intermediate evaluationswere used to compute gradient updates, rather thanfor guiding search.

Guiding search with both policy and valuenetworks was done in Monte-Carlo Tree Search(MCTS) for tasks with a sparse reward (Silveret al., 2017; T. A. and and Barber, 2017; Shenet al., 2018). In MCTS, value network evaluationsare refined with backup updates to improve policyscores. In this work, we gain this advantage by us-ing the target denotation. The use of an actor anda critic is also reminiscent of A∗ where states arescored by past cost and an admissible heuristic forfuture cost (Klein and Manning, 2003; Pauls andKlein, 2009; lee et al., 2016). In semantic parsing,Misra et al. (2018) recently proposed a critic dis-tribution to improve the policy, which is based on

prior domain knowledge (that is not learned).

8 Conclusions

In this work, we propose a new training algorithmfor mapping instructions to programs given deno-tation supervision only. Our algorithm exploits thedenotation at training time to train a critic networkused to rank search states on the beam, and per-forms search in a compact execution space ratherthan in the space of programs. We evaluated onthree different domains from SCONE, and foundthat it dramatically improves performance com-pared to strong baselines across all domains.

VBSIX is applicable to any task that supportsgraph-search exploration. Specifically, for tasksthat can be formulated as an MDP with a deter-ministic transition function, which allow efficientexecution of multiple partial trajectories. Thosetasks include a wide range of instruction mapping(Branavan et al., 2009; Vogel and Jurafsky, 2010;Anderson et al., 2018) and semantic parsing tasks(Dahl et al., 1994; Iyyer et al., 2017; Yu et al.,2018). Therefore, evaluating VBSIX on other do-mains is a natural next step for our research.

Acknowledgments

We thank the anonymous reviewers for their con-structive feedback. This work was completed infulfillment for the M.Sc degree of the first author.This research was partially supported by The Is-rael Science Foundation grant 942/16, the Blavat-nik Computer Science Research Fund, and TheYandex Initiative for Machine Learning.

1951

ReferencesP. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson,

N. Sunderhauf, I. Reid, S. Gould, and A. van denHengel. 2018. Vision-and-language navigation: In-terpreting visually-grounded navigation instructionsin real environments. In Computer Vision and Pat-tern Recognition (CVPR).

Y. Artzi and L. Zettlemoyer. 2013. Weakly supervisedlearning of semantic parsers for mapping instruc-tions to actions. Transactions of the Association forComputational Linguistics (TACL), 1:49–62.

D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe,J. Pineau, A. Courville, and Y. Bengio. 2017. Anactor-critic algorithm for sequence prediction. InInternational Conference on Learning Representa-tions (ICLR).

J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Se-mantic parsing on Freebase from question-answerpairs. In Empirical Methods in Natural LanguageProcessing (EMNLP).

J. Berant and P. Liang. 2015. Imitation learning ofagenda-based semantic parsers. Transactions of theAssociation for Computational Linguistics (TACL),3:545–558.

Y. Bisk, D. Yuret, and D. Marcu. 2016. Naturallanguage communication with robots. In NorthAmerican Association for Computational Linguis-tics (NAACL).

S. Branavan, H. Chen, L. S. Zettlemoyer, and R. Barzi-lay. 2009. Reinforcement learning for mapping in-structions to actions. In Association for Compu-tational Linguistics and International Joint Con-ference on Natural Language Processing (ACL-IJCNLP), pages 82–90.

D. L. Chen and R. J. Mooney. 2011. Learning to in-terpret natural language navigation instructions fromobservations. In Association for the Advancement ofArtificial Intelligence (AAAI), pages 859–865.

J. Cheng, S. Reddy, V. Saraswat, and M. Lapata. 2017.Learning structured natural language representationsfor semantic parsing. In Association for Computa-tional Linguistics (ACL).

J. Clarke, D. Goldwasser, M. Chang, and D. Roth.2010. Driving semantic parsing from the world’s re-sponse. In Computational Natural Language Learn-ing (CoNLL), pages 18–27.

D. A. Dahl, M. Bates, M. Brown, W. Fisher,K. Hunicke-Smith, D. Pallett, C. Pao, A. Rudnicky,and E. Shriberg. 1994. Expanding the scope of theATIS task: The ATIS-3 corpus. In Workshop on Hu-man Language Technology, pages 43–48.

H. Daume, J. Langford, and D. Marcu. 2009. Search-based structured prediction. Machine Learning,75:297–325.

D. Fried, J. Andreas, and D. Klein. 2018. Unified prag-matic models for generating and following instruc-tions. In North American Association for Computa-tional Linguistics (NAACL).

O. Goldman, V. Latcinnik, U. Naveh, A. Globerson,and J. Berant. 2018. Weakly-supervised semanticparsing with abstract examples. In Association forComputational Linguistics (ACL).

K. Goyal, G. Neubig, C. Dyer, and T. Berg-Kirkpatrick.2018. A continuous relaxation of beam search forend-to-end training of neural sequence models. InAssociation for the Advancement of Artificial Intel-ligence (AAAI).

K. Guu, P. Pasupat, E. Z. Liu, and P. Liang. 2017.From language to programs: Bridging reinforce-ment learning and maximum marginal likelihood. InAssociation for Computational Linguistics (ACL).

S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.

Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. 2017.Search-based neural structured learning for sequen-tial question answering. In Proceedings of the 55thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), vol-ume 1, pages 1821–1831.

D. Kingma and J. Ba. 2014. Adam: A methodfor stochastic optimization. arXiv preprintarXiv:1412.6980.

D. Klein and C. Manning. 2003. A* parsing: Fast exactviterbi parse selection. In Human Language Tech-nology and North American Association for Compu-tational Linguistics (HLT/NAACL).

J. Krishnamurthy, P. Dasigi, and M. Gardner. 2017.Neural semantic parsing with type constraints forsemi-structured tables. In Empirical Methods inNatural Language Processing (EMNLP).

J. Krishnamurthy and T. Mitchell. 2012. Weaklysupervised training of semantic parsers. In Em-pirical Methods in Natural Language Processingand Computational Natural Language Learning(EMNLP/CoNLL), pages 754–765.

T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer.2013. Scaling semantic parsers with on-the-fly on-tology matching. In Empirical Methods in NaturalLanguage Processing (EMNLP).

K. lee, M. Lewis, and L. Zettlemoyer. 2016. Globalneural CCG parsing with optimality guarantees. InEmpirical Methods in Natural Language Processing(EMNLP).

C. Liang, J. Berant, Q. Le, and K. D. F. N. Lao.2017. Neural symbolic machines: Learning seman-tic parsers on Freebase with weak supervision. InAssociation for Computational Linguistics (ACL).

1952

C. Liang, M. Norouzi, J. Berant, Q. Le, and N. Lao.2018. Memory augmented policy optimization forprogram synthesis with generalization. In Advancesin Neural Information Processing Systems (NIPS).

P. Liang, M. I. Jordan, and D. Klein. 2011. Learn-ing dependency-based compositional semantics. InAssociation for Computational Linguistics (ACL),pages 590–599.

R. Long, P. Pasupat, and P. Liang. 2016. Simplercontext-dependent logical forms via model projec-tions. In Association for Computational Linguistics(ACL).

Dipendra Misra, Ming-Wei Chang, Xiaodong He, andWen-tau Yih. 2018. Policy shaping and generalizedupdate equations for semantic parsing from denota-tions. arXiv preprint arXiv:1809.01299.

A. Neelakantan, Q. V. Le, and I. Sutskever. 2016.Neural programmer: Inducing latent programs withgradient descent. In International Conference onLearning Representations (ICLR).

P. Pasupat and P. Liang. 2015. Compositional semanticparsing on semi-structured tables. In Association forComputational Linguistics (ACL).

P. Pasupat and P. Liang. 2016. Inferring logical formsfrom denotations. In Association for ComputationalLinguistics (ACL).

A. Pauls and D. Klein. 2009. K-best A* parsing. InAssociation for Computational Linguistics (ACL),pages 958–966.

J. Pennington, R. Socher, and C. D. Manning. 2014.GloVe: Global vectors for word representation. InEmpirical Methods in Natural Language Processing(EMNLP), pages 1532–1543.

M. Rabinovich, M. Stern, and D. Klein. 2017. Abstractsyntax networks for code generation and semanticparsing. In Association for Computational Linguis-tics (ACL).

S. Ross, G. Gordon, and A. Bagnell. 2011. A reductionof imitation learning and structured prediction to no-regret online learning. In Artificial Intelligence andStatistics (AISTATS).

Y. Shen, J. Chen, P. Huang, Y. Guo, and J. Gao.2018. Reinforcewalk: Learning to walk in graphwith monte carlo tree search. In International Con-ference on Learning Representations (ICLR).

D. Silver, J. Schrittwieser, K. Simonyan,I. Antonoglou, A. Huang, A. Guez, T. Hubert,L., M. Lai, A. Bolton, et al. 2017. Mastering thegame of go without human knowledge. Nature,550(7676):354–359.

A. Suhr and Y. Artzi. 2018. Situated mapping of se-quential instructions to actions with single-step re-ward observation. In Association for ComputationalLinguistics (ACL).

I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequenceto sequence learning with neural networks. In Ad-vances in Neural Information Processing Systems(NIPS), pages 3104–3112.

R. Sutton, D. McAllester, S. Singh, and Y. Mansour.1999. Policy gradient methods for reinforcementlearning with function approximation. In Advancesin Neural Information Processing Systems (NIPS).

Z. Tian T. A. and and D. Barber. 2017. Thinking Fastand Slow with Deep Learning and Tree Search. Ad-vances in Neural Information Processing Systems30.

A. Vogel and D. Jurafsky. 2010. Learning to follownavigational directions. In Association for Compu-tational Linguistics (ACL), pages 806–814.

R. J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3):229–256.

S. Wiseman and A. M. Rush. 2016. Sequence-to-sequence learning as beam-search optimization. InEmpirical Methods in Natural Language Processing(EMNLP).

Y. Yang, L. Huang, and M. Ma. 2018. Breaking thebeam search curse: A study of (re-) scoring methodsand stopping criteria for neural machine translation.In Empirical Methods in Natural Language Process-ing (EMNLP).

P. Yin and G. Neubig. 2017. A syntactic neural modelfor general-purpose code generation. In Associationfor Computational Linguistics (ACL), pages 440–450.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-ing Yao, Shanelle Roman, et al. 2018. Spider: Alarge-scale human-labeled dataset for complex andcross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887.

A Neural Network Architecture

We adopt the model πθ(·) proposed by Guu et al.(2017). The model receives the current utteranceui and the program stackψ. A bidirectional LSTM(Hochreiter and Schmidhuber, 1997) is used toembed ui, while ψ is embedded by concatenat-ing the embedding of stack elements. The em-bedded input is then fed to a feed-forward net-work with attention over the LSTM hidden states,followed by a softmax layer that predicts a pro-gram token. Our value network Vφ(·) shares theinput layer of πθ(·). In addition, it receives thenext utterance ui+1, the current world state wi andthe target world state wM . The utterance ui+1 isembedded with an additional BiLSTM, and world

1953

BiLSTMui

𝜓

RelU Att. Softmax

zt

wiRelU Sigmoid v

wM

ui+1 BiLSTM

Figure 7: The model proposed by Guu et al. (2017)(top), and our value network (bottom).

states are embedded by concatenating embeddingsof SCONE elements. The inputs are concatenatedand fed to a feed-forward network, followed by asigmoid layer that outputs a scalar.

B Hyper-parameters

Table 3 contains the hyper-parameter setting foreach experiment. Hyper-parameters of REIN-FORCE and MML were taken from Guu et al.(2017). In all experiments learning rate was 0.001and mini-batch size was 8. We explicitly definethe following hyper-parameters which are not self-explanatory:1. Training steps: The number of training stepstaken.2. Sample size: Number of samples drawn frompθ in REINFORCE3. Baseline: A constant subtracted from the re-ward for variance reduction.4. Execution beam size: K in Algorithm 1.5. Program bea size: Size of beam in line 3 ofAlgorithm 1.6. Value ranking start step: Step when we startranking states using the critic score.7. Value re-rank size: Size of beam K0 returnedby the actor score before re-ranking with the actor-critic score.

C Prior Work

In prior work on SCONE, models were trainedon sequences of 1 or 2 utterances, and thus wereexposed during training to all gold intermediatestates (Long et al., 2016; Guu et al., 2017; Suhrand Artzi, 2018). Fried et al. (2018) assumed ac-

cess to the full annotated logical form. In con-trast, we trained from longer sequences, keepingthe logical form and intermediate states latent. Wereport the test accuracy as reported by prior workand in this paper, noting that our results are an av-erage of 6 runs, while prior work reports the me-dian of 5 runs.

Naturally, our results are lower compared toprior work that uses much stronger supervision.This is because our setup poses a hard search prob-lem at training time, and also requires overcom-ing spuriousness – the fact that even incorrect pro-grams sometimes lead to high reward.

D Value Network Analysis

We analyzed the ability of the value network topredict expected reward. The reward of a state de-pends on two properties, (a) connectivity: whetherthere is a trajectory from this state to a correct ter-minal state, and (b) model likelihood: the proba-bility the model assigns to those trajectories. Wecollected a random set of 120 states in the SCENE

domain from, where the real expected reward wasvery high (> 0.7), or very low (= 0.0) and thevalue network predicted well (less than 0.2 devia-tion) or poorly (more than 0.5 deviation). For easeof analysis we only look at states from the finalutterance.

To analyze connectivity, we looked at states thatcannot reach a correct terminal state with a singleaction (since states in the last utterance can per-form one action only, the expected reward is 0).Those are states where either their current and tar-get world differ in too many ways, or the stackcontent is not relevant to the differences betweenthe worlds. We find that when there are many dif-ferences between the current and target world, thevalue network correctly estimates low expected re-ward in 87.0% of the cases. However, when thereis just one mismatch between the current and tar-get world, the value network tends to ignore it anderroneously predicts high reward in 78.9% of thecases.

To analyze whether the value network can pre-dict the success of the trained policy, we considerstates from which there is an action that leads tothe target world. While it is challenging to fullyinterpret the value network, we notice that the net-work predicts a value that is > 0.5 in 86.1% ofthe cases where the number of people in the worldis no more than 2, and a value that is < 0.5 in

1954

System SCENE ALCHEMY TANGRAMREINFORCE Training steps = 22.5k Training steps = 31.5k Training steps = 40k

Sample size = 32 Sample size = 32 Sample size = 32ε = 0.2 ε = 0.2 ε = 0.2Baseline = 10−5 Baseline = 10−2 Baseline = 10−3

MML Training steps = 22.5k Training steps = 31.5k Training steps = 40kBeam size = 32 Beam size = 32 Beam size = 32ε = 0.15 ε = 0.15 ε = 0.15

VBSiX Training steps = 22.5k Training steps = 31.5k Training steps = 40kExecution beam size = 32 Execution beam size = 32 Execution beam size = 32Program beam size = 8 Program beam = 8 Program beam size = 8ε = 0.15 ε = 0.15 ε = 0.15Value ranking start step = 5k Value ranking start step = 5k Value ranking start step = 10kValue re-rank size = 128 Value re-rank size = 128 Value re-rank size = 128

Table 3: Hyper-parameter settings.

SCENE ALCHEMY TANGRAM3 utt 5 utt 3 utt 5 utt 3 utt 5 utt

Long et al. (2016) 23.2 14.7 56.8 52.3 64.9 27.6Guu et al. (2017) 64.8 46.2 66.9 52.9 65.8 37.1Suhr and Artzi (2018) 73.9 62.0 74.2 62.7 80.8 62.4Fried et al. (2018) – 72.7 – 72.0 – 69.6VBSIX 34.2 28.2 74.5 64.8 65.0 43.0

Table 4: Test accuracy comparison to prior work.

82.1% of the cases where the number of people inthe world is more than 2. This indicates that thevalue network believes more complex worlds, in-volving many people, are harder for the policy.

value-based search in execution space for mapping

Documents