Learning to Learn Programs from Examples: Going ?· Learning to Learn Programs from Examples: Going…

Download Learning to Learn Programs from Examples: Going ?· Learning to Learn Programs from Examples: Going…

Post on 12-Sep-2018




0 download

Embed Size (px)


<ul><li><p>Learning to Learn Programs from Examples: Going Beyond Program Structure</p><p>Content areas: program synthesis, machine learning, programming by examples</p><p>AbstractProgramming-by-example technologies let endusers construct and run new programs by providingexamples of the intended program behavior. But,the few provided examples seldom uniquely deter-mine the intended program. Previous approachesto picking a program used a bias toward shorter ormore naturally structured programs. Our work heregives a machine learning approach for learning tolearn programs that departs from previous work byrelying upon features that are independent of theprogram structure, instead relying upon a learnedbias over program behaviors, and more gener-ally over program execution traces. Our approachleverages abundant unlabeled data for semisuper-vised learning, and incorporates simple kinds ofworld knowledge for common-sense reasoning dur-ing program induction. These techniques are eval-uated in two programming-by-example domains,improving the accuracy of program learners.</p><p>1 IntroductionBillions of people own computers, yet vanishingly few knowhow to program. Imagine an end user wishing to extract theyears from a table of data, like in Table 1. What would be atrivial regular expression for a coder is impossible for the vastmajority of computer users. But anyone can show a com-puter what to do by giving examples an observation thathas motivated a long line of work on the problem of pro-gramming by examples (PBE), a paradigm where end usersgive examples of intended behavior and the system respondsby inducing and running a program [Lieberman, 2001]. Acore problem in PBE is determining which single programthe user intended within the vast space of all programs con-sistent with the examples. Users would like to provide onlyone or a few examples, leaving the intended behavior highlyambiguous. Consider a user who provides just the first in-put/output example in Table 1. Did they mean to extract thefirst number of the input? The last number? The first numberafter a comma? Or did they intend to just produce 1993for each input? In real-world scenarios we could encounteron the order of 10100 distinct programs consistent with theexamples. Getting the right program from fewer examples</p><p>Input table Desired output tableMissing page numbers, 1993 199364-67, 1995 19951992 (1-27) 1992 </p><p>Table 1: An everyday computer task trivial for programmersbut inaccessible for nonprogrammers: given the input table ofstrings, automatically extract the year to produce the desiredoutput table on the right.</p><p>means less effort for users and more adoption of PBE tech-nology. This concern is practical: Microsoft refused to shipthe recent PBE system Flash Fill [Gulwani, 2011] until com-mon scenarios were learned from only one example.</p><p>We develop a new inductive bias for resolving the ambi-guity that is inherent when learning programs from few ex-amples. Prior inductive biases in PBE use features of theprograms syntactic structure, picking either the smallest pro-gram consistent with the examples, or the one that looks themost natural according to some learned criterion [Liang etal., 2010; Menon et al., 2013; Singh and Gulwani, 2015;Lin et al., 2014]. In contrast, we look at the outputs and ex-ecution traces of a program, which we will show can some-times predict program correctness even better than if we couldexamine the program itself. Intuitively, we ask, what do typ-ically intended programs compute? rather than what dotypically intended programs look like? Returning to Table 1,we prefer the program extracting years because its outputslook like an intended behavior, even though extracting thefirst number is a shorter program.</p><p>We apply our technique in two different PBE domains: astring transformation domain, which enriches Flash Fill-styleproblems (eg Table 1) with semantic transformations, likethe ability to parse and transform times and dates [Singh andGulwani, 2012a] and numbers [Singh and Gulwani, 2012b];and a text extraction domain, where the goal is to learn aprogram that extracts structured tables out of a log file [Leand Gulwani, 2014]. Flash Fill, now a part of MicrosoftExcel, motivated a series of other PBE systems, which co-alesced into a software library called PROSE [Polozov andGulwani, 2015]. In PROSE, one provides the hypothesisspace (programming language) and gets a PBE tool for free.But PROSE does not solve the ambiguity problem, instead</p></li><li><p>using a hand-engineered inductive bias over programs. Ourwork integrates into PROSE and provides a better inductivebias. Although we worked with existing PROSE implemen-tations of the string transformation and text extraction do-mains, the broad approach is domain-agnostic. We take asa goal to improve PROSEs inductive bias, and use the phrasePROSE to refer to the current PROSE implementations ofthese domains, in contrast to our augmented system.</p><p>1.1 Our contribution: picking the correct programWe develop two new contributions to PBE technology:</p><p>Predictive featuresPredicting program correctness based on its syntactic struc-ture is perhaps the oldest and most successful idea in pro-gram induction [Solomonoff, 1964]. This general family ofapproaches use what we call program features to bias thelearner. But the correctness of a program goes beyond its ap-pearance. We develop two new classes of features that areinvariant to program structure:Output features. Some sets of outputs are apriori more likelyto be produced from valid programs. In PBE scenarios theuser typically labels few inputs by providing outputs but hasmany unlabeled inputs; the candidate outputs on the unla-beled inputs give a semisupervised learning signal that lever-ages the typically larger set of unlabeled data. See Table 2and 3. In Table 2, the system considers programs that eitherappend a bracket (a simple program) or ensure correct brack-eting (a complex program). PROSE opts for the simple pro-gram, but our system notices that program predicts an outputtoo dissimilar from the labeled example. Instead we preferthe program without this outlier in its outputs.Execution trace features. Going beyond the final outputsof a candidate program, we show how to consider the entireexecution trace. Our model learns a bias over sequences ofcomputations, which allows us to disprefer seemingly naturalprograms with pathological behavior on the provided inputs.</p><p>Input Output (PROSE) Output (ours)[CPT-00350 [CPT-00350] [CPT-00350][CPT-00340 [CPT-00340] [CPT-00340][CPT-115] [CPT-115]] [CPT-115]</p><p>Table 2: Learning a program from one example (top row) andapplying it to other inputs (bottom rows, outputs italicized).Our semisupervised approach let us get the last row correct.</p><p>Machine learning frameworkWe develop a framework for learning to learn programs froma corpus of problems that improves on prior work as follows:Weak supervision. We require no explicitly providedground-truth programs, in contrast with eg [Menon et al.,2013; Liang et al., 2010]. This helps automate the engineer-ing of PBE systems because the engineer need not manuallyannotate solutions to potentially hundreds of problems.The modeling paradigm. We introduce a discriminativeprobabilistic model, in contrast with [Liang et al., 2010;Menon et al., 2013; Singh and Gulwani, 2015]. A discrim-inative approach leads to higher predictive accuracy, while a</p><p>Input Output Output(PROSE) (ours)</p><p>Brenda Everroad Brenda BrendaDr. Catherine Ramsey Catherine CatherineJudith K. Smith Judith K. JudithCheryl J. Adams and Cheryl J. Adams Cheryl</p><p>Binnie Phillips and Binnie</p><p>Table 3: Learning a program from one example (top row)and applying it to other inputs (bottom rows, outputs itali-cized). Our semisupervised approach uses simple commonsense reasoning, knowing about names, places, words, dates,etc, letting us get the last two rows correct.</p><p>probabilistic framing lets us learn with simple and tractablegradient-guided search.</p><p>1.2 NotationWe consider PBE problems where the program, written p, isdrawn from a domain specific language (DSL), written L. Wehave one DSL for string transformation and a different DSLfor text extraction. DSLs are described using a grammar thatconstrains the ways in which program components may becombined. We learn a p L consistent with L labeled in-put/output examples, with inputs {xi}Li=1 (collectively XL)and user labeled outputs {yi}Li=1 (collectively YL). We writep(x) for the output of p on input x, so consistency with thelabeled examples means that yi = p(xi) for 1 i L.We write N for the total number of inputs on which the userintends to run the program, so that means N L. All ofthese inputs are written {xi}Ni=1 (collectively X). When aprogram p L is clear from context, we write {yi}Ni=1 (col-lectively Y ) for the outputs p predicts on the inputs X . Wewrite {yi}Ni=L+1 for the predictions of p on the unlabeled in-puts (collectively YU ).</p><p>For each DSL L, we assume a simple hand-crafted scoringfunction that assigns higher scores to shorter or simpler pro-grams. PROSE can enumerate the top K programs under thisscoring function, where K is large but manageable (for K upto 104). We call these top K programs the frontier, and wewrite FK(X,YL) to mean the frontier of size K for inputs Xand labeled outputs YL (so this means that if p FK(X,YL)then p(xiL) = yiL). The existing PROSE approach is topredict the single program in F1(X,YL). We write () tomean some kind of feature extractor, and use the variable tomean weights placed on those features.</p><p>For ease of exposition we will draw our examples fromstring transformation, where the goal is to learn a programthat takes as input a vector of strings (so xi is a vector ofstrings) and produces as output a string (so yi is a string).</p><p>2 Extracting predictive features2.1 Features of program structureA common intuition in the program induction literatureis that one should prefer short, simple programs overlong, complicated programs. Many old and modern ap-proaches [Solomonoff, 1964; Liang et al., 2010; Polozov and</p></li><li><p>Gulwani, 2015; Lau, 2001] realize this intuition by first mod-eling the set of all programs consistent with the examples,and then picking the program in that set maximizing a mea-sure of program simplicity. The only way in which the exam-ples participate in these program induction approaches is byexcluding impossible programs.</p><p>We model these program feature-style approaches bydefining a feature extractor for programs, program(p). Thelearner predicts the program p (consistent with examples)maximizing a linear combination of these features:</p><p>p = argmaxp consistent with examples</p><p> program(p) (1)</p><p>This framework models several lines of work: (1) if thescoring function is likelihood under a probabilistic gram-mar, then program(p) are counts of the grammar productionsused in p and are log production probabilities (eg, [Menonet al., 2013]); (2) if the grammars structure is unknownthen program(p) are counts of all program fragments used(eg, [Liang et al., 2010]); or (3) if the scoring function is thesize of the program then program(p) is the one-dimensionalcount of the size of the syntax tree (eg, [Lin et al., 2014]).</p><p>Our program counted occurrences of different programprimitives, so our model could mimic the inductive bias ofa probabilistic grammar. It also detected the presence ofdomain-specific code templates, for example counting thenumber of times that a prefix of the input is extracted, or thenumber of times that an input is parsed as a date. These do-main specific choices are motivated by past models that learna bias towards useful code fragments [Liang et al., 2010], anidea which has been usefully deployed in string transforma-tion domains [Singh and Gulwani, 2015]. But, our contri-bution is not a more sophisticated preference over programs.Instead, we go beyond this approach by turning to features ofprogram behaviors, as the next two sections describe.</p><p>2.2 Features of program traceImagine a spreadsheet of professor names: Rebecca, Oliver,etc. One thing you might want a PBE system to do is putthe title Dr. in front of each of these names. So, you givethe system an example of Dr. being prepended to the stringRebecca. This should be a trivial learning problem, and thesystem should induce a program that just puts the constantDr. in front of the input. However, PROSE failed on thissimple case; see Table 4. Although the system can representthe intended program, it instead prefers a program that ex-tracts the first character from Rebecca to produce the r inDr., with unintended consequences for Oliver.</p><p>Why does PROSE prefer a program that extracts the firstcharacter? In general, programs with more constants are lessplausible; this is related to the intuition that we should pre-fer programs with shorter description lengths. Furthermore,the first character of the input is very commonly extracted,so PROSE was tuned to prefer programs that extract prefixes.These two inductive biases conspired to steer the system to-ward the wrong program.</p><p>This failure is not an artifact of the fact that PROSEs in-ductive bias was written by hand rather than being learnedfrom data. With a learned prior over program structures,</p><p>Input OutputRebecca Dr. RebeccaOliver Do. Oliver</p><p>Table 4: A string transfor-mation problem; the userprovided the first outputand an incorrect programproduced the italicized sec-ond output.</p><p>Rebecca Dr. Rebecca</p><p>Figure 1: Execution tracefor erroneous program withthe behavior shown in Ta-ble 4. Notice the overlap-ping substring extractions.</p><p>the model made the exact same error. The program struc-ture alone simply does not provide a strong signal that Olivershould be Dr. Oliver, rather than Do. Oliver.</p><p>By looking at the execution trace of the program we dis-covered a new kind of signal for program correctness. Re-turning to our motivating example, the erroneous programfirst extracts a region of the input and then extracts an over-lapping region (see Figure 1). Accessing overlapping regionsof data is seldom intended: usually programs pull out the datathey want and then do something with it, rather than extract-ing some parts of the data multiple times. Simply introducingan inductive bias against accessing overlapping regions of theinput is enough to disprefer the erroneous program in Table 4.</p><p>More generally one can learn an inductive bias for exe-cution traces by fitting a probabilistic model to traces fromintended programs. This scheme could work for any DSL,with the system using the model to steer the learner towardsintended programs.</p><p>With these intuitions in hand, we now want an inductivebias over execution traces that strongly penalizes these patho-logical behaviors. An inductive bias based only on threefeatures sufficed: Feature 1: did substring extractions over-lap? Correct programs usually pull out the intended data onlyonce, so this feature strongly predicted program incorrect-ness. Feature 2: were substring extractions repeated? Thisis a weaker signal of incorrectness. Featu...</p></li></ul>