thesis proposal - semantic scholar · 2017-07-24 · thesis proposal knowledge-aware natural...

July 12, 2017DRAFT

Thesis ProposalKnowledge-Aware Natural Language

Understanding

Pradeep Dasigi

Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213

Thesis Committee:Eduard Hovy (Chair)

Chris DyerWilliam Cohen

Luke Zettlemoyer

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy.

Copyright c© 2017 Pradeep Dasigi

July 12, 2017DRAFT

Keywords: natural language understanding, knowledge, neural networks, end-to-end mod-els, semantic parsing, question answering

July 12, 2017DRAFT

AbstractNatural Language Understanding (NLU) systems need to encode human gener-

ated text (or speech) and reason over it at a deep semantic level. Any NLU systemtypically involves two main components: The first is an encoder, which composeswords (or other basic linguistic units) within the input utterances to compute en-coded representations, that are then used as features in the second component, apredictor, to reason over the encoded inputs and produce the desired output. We ar-gue that performing these two steps over the utterances alone is seldom sufficient forunderstanding language, as the utterances themselves do not contain all the informa-tion needed for understanding them. We identify two kinds of additional knowledgeneeded to fill the gaps: background knowledge and contextual knowledge. Thegoal of this thesis is to build end-to-end NLU systems that encode inputs along withrelevant background knowledge, and reason about them in the presence of contextualknowledge.

The first part of the thesis deals with background knowledge. While distribu-tional methods for encoding inputs have been used to represent meaning of wordsin the context of other words in the input, there are other aspects of semantics thatare out of their reach. These are related to commonsense or real world informa-tion which is part of shared human knowledge but is not explicitly present in theinput. We address this limitation by having the encoders also encode backgroundknowledge, and present two approaches for doing so. The first is by modeling theselectional restrictions verbs place on their semantic role fillers. We use this modelto encode events, and show that these event representations are useful in detect-ing newswire anomalies. Our second approach towards augmenting distributionalmethods is to use external knowledge bases like WordNet. We compute ontology-grounded token-level representations of words and show that they are useful in pre-dicting prepositional phrase attachments and textual entailment.

The second part of the thesis focuses on contextual knowledge. Machine com-prehension tasks require interpreting input utterances in the context of other struc-tured or unstructured information. This can be challenging for multiple reasons.Firstly, given some task-specific data, retrieving the relevant contextual knowledgefrom it can be a serious problem. Secondly, even when the relevant contextualknowledge is provided, reasoning over it might require executing a complex se-ries of operations depending on the structure of the context and the composition-ality of the input language. To handle reasoning over contexts, we first describe atype constrained neural semantic parsing framework for question answering (QA).We achieve state of the art performance on WIKITABLEQUESTIONS, a dataset withhighly compositional questions over semi-structured tables. Proposed work in thisarea includes application of this framework to QA in other domains with weaker su-pervision. To address the challenge of retrieval, we propose to build neural networkmodels with explicit memory components that can adaptively reason and learn toretrieve relevant context given a question.

July 12, 2017DRAFT

iv

July 12, 2017DRAFT

Contents

1 Introduction 11.1 Natural Language Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Contextual Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Knowledge-Aware NLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

I Encoding Background Knowledge 11

2 Leveraging Selectional Preferences for Event Understanding 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Understanding Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Model for Semantic Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.1 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5.2 Language Model Separability . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Sentence Understanding with Background Knowledge from Ontologies 233.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Ontology-Aware Token Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.1 Multi-prototype Word Vectors . . . . . . . . . . . . . . . . . . . . . . . 253.3.2 Hybrid Models of Knowledge-based and Distributional Semantics . . . . 263.3.3 WordNet as a source for Selectional Preferences . . . . . . . . . . . . . 26

3.4 WordNet-Grounded Context-Sensitive Token Embeddings . . . . . . . . . . . . 263.4.1 WordNet Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.2 Context-Sensitive Token Embeddings . . . . . . . . . . . . . . . . . . . 27

v

July 12, 2017DRAFT

3.5 PP Attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5.2 Model Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6 Textual Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.6.1 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.6.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.6.3 Model Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.6.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

II Reasoning over Contextual Knowledge 39

4 Learning Semantic Parsers over Question Answer Pairs 414.1 Type Constrained Neural Semantic Parser . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.1.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Experiments with WIKITABLEQUESTIONS . . . . . . . . . . . . . . . . . . . . 454.2.1 Context and Logical Form Representation . . . . . . . . . . . . . . . . . 464.2.2 DPD Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.1 Handling more operations and improved entity linking . . . . . . . . . . 494.3.2 Extending to datasets with weaker supervision . . . . . . . . . . . . . . 49

5 Proposed Work: Context Retrieval for Memory Networks 515.1 Memory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Retrieval for Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.1 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2.2 Potential Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Summary and Timeline 576.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.2 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Bibliography 59

vi

July 12, 2017DRAFT

List of Figures

1.1 Example of a task requiring reasoning over semi-structured context . . . . . . . . 41.2 Example of an artificial task requiring reasoning over context, part of the bAbI

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Snippet from a science textbook followed by a question . . . . . . . . . . . . . . 51.4 Generic NLU pipeline and thesis overview . . . . . . . . . . . . . . . . . . . . . 8

2.1 Anomaly Classification Pipeline with Neural Event Model . . . . . . . . . . . . 172.2 Comparison of perplexity scores for different labels . . . . . . . . . . . . . . . . 192.3 ROC curves for the LSTM baseline and NEM . . . . . . . . . . . . . . . . . . . 21

3.1 Example of ontology-aware lexical representation . . . . . . . . . . . . . . . . . 253.2 An example grounding for the word pool . . . . . . . . . . . . . . . . . . . . . . 283.3 Steps for computing the context-sensitive token embedding for the word ‘pool’,

as described in Section3.4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Two sentences illustrating the importance of lexicalization in PP attachment de-

cisions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5 Effect of training data size on test accuracies of OntoLSTM-PP and LSTM-PP. . 343.6 Visualization of soft attention weights assigned by OntoLSTM to various synsets

in Preposition Phrase Attachment . . . . . . . . . . . . . . . . . . . . . . . . . . 353.7 Visualization of soft attention weights assigned by OntoLSTM to various synsets

in Textual Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Overview of our type-constrained neural semantic parsing model . . . . . . . . . 414.2 The derivation of a logical form using the type-constrained grammar. The holes

in the left column have empty scope, while holes in the right column have scopeΓ = {(x : r)} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Example from NLVR Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1 Schematic showing a generic view of memory networks . . . . . . . . . . . . . . 52

vii

July 12, 2017DRAFT

viii

July 12, 2017DRAFT

List of Tables

2.1 Annotation Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Classification Performance and Comparison with Baselines . . . . . . . . . . . . 202.3 Anomaly Detection Competence . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Prepositional Phrase Attachment Results with OntoLSTM . . . . . . . . . . . . 323.2 Results from RBG dependency parser with features coming from various PP

attachment predictors and oracle attachments. . . . . . . . . . . . . . . . . . . . 333.3 Effect of removing sense priors and context sensitivity (attention) from the model. 343.4 Results of OntoLSTM on the SNLI dataset . . . . . . . . . . . . . . . . . . . . . 37

4.1 Development and test set accuracy of our semantic parser compared to prior workon WIKITABLEQUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

ix

July 12, 2017DRAFT

x

July 12, 2017DRAFT

Chapter 1

Introduction

1.1 Natural Language Understanding

A Natural Language Understanding (NLU) system should process human generated text (orspeech) at a deep semantic level, represent the semantics of the processed inputs, and reasonover them to perform well at a given task. The effectiveness of NLU systems is often measuredin terms of their performance at the intended task. The inherent assumption1 behind taking atask-oriented approach towards building NLU systems is that an improvement in the task perfor-mance correlates with an improvement in the quality of the semantic representation, or in otherwords, the understanding capability of the computational system.

NLU is thus an umbrella term that includes several tasks, that can be defined in terms of theirend goals. For example, Question Answering (QA) refers to answering questions about a spanof text, or other structured knowledge representations like knowledge bases or even images. QAsystems are required to encode the meaning of the provided inputs such that the relevant bitsof information can be retrieved to answer questions. Recognizing Textual Entailment (RTE),another NLU task, refers to the problem of identifying whether the truth value of some text pro-vided as a hypothesis follows from that of another text provided as a premise, and it is usuallydone by extracting relevant features from the hypothesis and the premise to see if there is enoughoverlap between them in the right direction. Another NLU task is Sentiment Analysis, which re-quires automatically categorizing the opinions expressed in the input utterances towards specifictargets. This involves extracting appropriate affective states and subjective information, such thata sentiment classifier can be built using that information as features. To perform well at at eachof these tasks, the NLU system should encode the semantics of the input to support the kind ofreasoning appropriate for the task.

In each of the examples above, it can be noticed that there are two common components: anencoder and a predictor. The encoder extracts task-relevant features from the input, and thepredictor performs the appropriate computation on top of the features given by the encoder toproduce the desired result. A generic NLU system can thus be succinctly described using the

1We rely on this assumption throughout this thesis. While interpreting the encoded representations to test thisassumption is also an active research topic, it is beyond the scope of this thesis.

1

July 12, 2017DRAFT

following two equations.

e = encode(I) (1.1)o = predict(e) (1.2)

where I is the set of textual inputs to the NLU system. For example, I are single sentences inSentiment Analysis and pairs of sentences in RTE. e are intermediate encoded representations ofthe inputs (which may or may not be task specific), and o are the final task specific predictions.For example, in Sentiment Analysis or RTE, o are categorical labels indicating the sentiment orentailment respectively, and in the case of Semantic Parsing, they are structured outputs, calledparses.

In older feature-rich methods for NLU, Equation 1.1 is a mapping of the inputs to a handdesigned feature space, typically containing patterns over word classes based on part-of-speech(Corley and Mihalcea, 2005) or Wordnet synsets (Moldovan and Rus, 2001); shallow linguis-tic features like dependencies (Bos and Markert, 2005); named entity information Tatu andMoldovan (2005); or other features depending on the task. The choice of features was left tothe discretion of the ML-practitioners designing the NLU systems. In such systems, the mod-eling emphasis was more on the prediction component, and the encoding component did notinvolve any learning. More recent systems (Bahdanau et al., 2014; Weston et al., 2014; Hermannet al., 2015; Xiong et al., 2016; Bowman et al., 2016; Yang et al., 2016, among many others) userepresentation learning or deep learning techniques to also learn the parameters of the encode

function, and typically this is done jointly with learning the parameters of the predict function,to ensure task-relevancy of the learned representations.

1.2 KnowledgeA computational system that seeks to understand human utterances, cannot do so given the ut-terances alone. This is because, the utterances themselves seldom contain all the informationneeded to comprehend them. The missing information is either part of the background knowl-edge about the world humans inherently possess, or contained in the surrounding context inwhich the utterance is spoken or written. NLU systems thus need to access or infer this knowl-edge to comprehend human languages.

In this thesis, we draw a distinction between the two kinds of knowledge described here:background and contextual. We now describe them in the context of NLU.

1.2.1 Background KnowledgeAn issue with the formulation of the problem in Equation 1.1 is that it is missing a key input:background knowledge. Since human language is aimed at other humans who share the samebackground knowledge, a lot of information is often not explicitly stated for the sake of brevity.The implicit knowledge required to understand language varies from simple commonsense inthe case of basic conversations, to domain specific knowledge about concepts in more esoteric ortechnical communications.

2

July 12, 2017DRAFT

The notion of implicit knowledge can be illustrated using the following example. Considerthe events described by the following three sentences:

Cops shoot a man.

Man shoots a dog.

Dog shoots a man.

Clearly, one would find the third sentence more unusual compared to the first and the second,the reason being that a dog performing a shooting action does not agree with our knowledgeabout the general state of affairs. While this information is not clearly written down anywhere,humans know it implicitly. An NLU system that is designed to differentiate between these sen-tences should be provided with the information that dog in the role of the agent for shooting isunlikely, while it can be the patient for shooting, and man fits equally well as agent and patientfor shooting. The notion that some combinations of actions and their arguments are more seman-tically likely than others is referred to as Selectional Preference (Wilks, 1973), and we describethat in greater detail in Chapter 2.

Consider another example of the need for encoding commonsense information, that of solv-ing the textual entailment problem given the following premise and candidate hypothesis:

Premise: Children and parents are splashing water at the pool.

Hypothesis Families are playing outside.The knowledge a human uses to correctly predict that the hypothesis can be inferred from thepremise is that children and parents typically form families, splashing water is a kind of playing,and a pool is expected to be outside. An entailment model trained on texts like the ones shownabove, with just the sentences as inputs, will not have access to the background knowledgethat humans possess. Such a model will at best memorize co-occurrence patterns in the inputwords. Using pre-trained vectors to represent input words provides additional knowledge tothe model in the form of co-occurrence statistics obtained from large corpora. However, it isunclear whether distributional information at the word level alone can substitute for the kind ofknowledge described above. Fortunately, background knowledge of this kind is encoded to someextent in machine readable ontologies and knowledge bases.

In this thesis, we will show how such background knowledge can be encoded either by lever-aging Selectional Preferences or by incorporating external knowledge sources, and used in con-junction with distributional models to obtain better input representations for NLU systems.

1.2.2 Contextual Knowledge

A second issue with the generic NLU formulation, particularly with Equation 1.2, is that it doesnot take into account the context in which the input utterance occurs. Typically, context includesthe circumstances that form the setting for the input utterances to be fully understood, and henceit significantly contributes to the meaning of the inputs. In this thesis, we will refer to thisadditional information as contextual knowledge. Unlike background knowledge, this is notimplicit in the text being read. However, reasoning in the presence of contextual knowledge stillposes significant challenges. An automated reading comprehension system that reasons over thisadditional information is expected to effectively do the following: encode the context in such

3

July 12, 2017DRAFT

Figure 1.1: Example of a task requiring reasoning over semi-structured context

a way that the bits of information relevant to the input utterances can be aligned to the input;encode the inputs in such a way that they can be executed against the encoded context to producethe desired output. Approaches for encoding inputs and contexts, and linking them depend onthe nature of the contextual information.

In the case of structured contexts, such as knowledge graphs, linking contexts to inputs isreferred to as entity linking, and encoding inputs is a Semantic Parsing problem, where the naturallanguage utterances are translated into logical forms that can be executed against the knowledgegraph to produce the desired output. When the contexts are unstructured, the problem boils downto encoding the context in a way to facilitate extracting input-specific information from it. Thisis usually referred to as Information Extraction. Examples include Hill et al. (2015); Richardsonet al.; Penas et al. (2013); Breck et al. (2001). In this thesis, we focus on both structured andunstructured contexts.

Figure 1.1 shows a question taken from the Wikitables dataset (Pasupat and Liang, 2015),and an associated table from a Wikipedia article, which provides the necessary context to answerit. This an example of a QA task where the context is (semi) structured. As it can be seen fromthe example, answering this question requires doing multi-step reasoning over the table.

Figure 1.2 shows an example where the context is unstructured. This is from the bAbI dataset(Weston et al., 2015), which is a collection of artificially generated toy tasks. The first six sen-tences in the example form the context, and the final line is a question followed by the answerand pointers to the relevant sentences in the context. A QA system is expected to learn to findthe appropriate context given the question and answer it correctly.

Figure 1.3 shows a relatively more difficult QA example from Aristo science question an-swering dataset (Clark, 2015). Simple techniques such as measuring lexical overlap do not workwell for this task because all four options have significant overlap with the sentences in context.As we show in Chapter 5, this task requires modeling complex entailment to be able to matchthe information present in the context with that in the question and option pairs.

4

July 12, 2017DRAFT

Figure 1.2: Example of an artificial task requiring reasoning over context, part of the bAbI dataset

Figure 1.3: Snippet from a science textbook followed by a question

5

July 12, 2017DRAFT

1.3 Knowledge-Aware NLU

Given that knowledge, both background and contextual, plays an important role in several real-world language understanding tasks, we need to reconsider our definition of the generic NLUpipeline. Without background knowledge, the encoding of inputs will be incomplete, and theNLU systems might not generalize beyond the specific patterns seen in the training data to unseentest cases. Without contextual knowledge, and an effective method to link encoded inputs tocontexts, the NLU systems will simply not have sufficient information to produce the desiredresults. We thus modify our NLU equations as follows.

e = encode with background(I,Ke) (1.3)eKp = encode context(Kp) (1.4)o = predict in context(e, eKp) (1.5)

where Ke and Kp represent the knowledge required for encoding and prediction respectively.They come from different sources and augment the inputs in different ways.

Obtaining relevant background or contextual knowledge can be challenging as well, depend-ing on the task being solved. In this thesis, we do not put a lot of emphasis on these problems,except in Chapter 5, where we focus on context retrieval. In the remaining chapters, we willfocus on tasks and datasets where obtaining appropriate knowledge is easy.

Better encoding with background knowledge Ke is additional knowledge used to fill in thesemantic gaps in the inputs, and thus obtain better encoded representations of them. One sourceof such background knowledge is knowledge bases and ontologies, which encode it explicitly.Examples include hypernym trees from WordNet for incorporating sense and generalization in-formation about concepts while composing sentences and subgraph features from Freebase toencode relations between entities seen in the input text. Moldovan and Rus (2001) and Kry-molowski and Roth are examples of a feature-rich systems that encoded input sentences in thecontext of external knowledge. Both systems used WordNet features and other related informa-tion about the semantic classes of the words in the input in NLU tasks. While Krymolowskiand Roth built a system based on SNoW (Carlson et al., 1999) for predicting PrepositionalPhrase Attachment, Moldovan and Rus (2001) built a QA system. In Chapter 3 we describea representation-learning application of knowledge-aware encoding where we show the advan-tages of incorporating WordNet information in recurrent neural networks for encoding sentences.

Such explicit sources of knowledge are often incomplete, or may not contain the kind ofknowledge needed to fill the gaps to obtain representations need for the task at hand. An alterna-tive source of background knowledge is a model of selectional preferences (Wilks, 1973), givenan appropriate structured representation of the inputs. We exploit the idea that predicates placecertain constraints on the semantic content of their arguments, and some predicate-argumentstructures are more common than others. Accordingly, we model the co-occurrences within theappropriate structures to model these preferences. In Chapter 2, we explain this idea in greaterdetail and leverage selectional preferences in SRL structures to build an NLU system that under-stands events in the newswire domain.

6

July 12, 2017DRAFT

Reasoning with contextual knowledge Kp inputs allow the prediction step to reason usingadditional contextual knowledge. Kp can either be structured or unstructured. In cases wherethe prediction component needs to learn to perform a sequence of operations (or a program), toarrive at the required output, the prediction becomes a Semantic Parsing problem. The contextsare often structured in this case. There exist both traditional (Zelle and Mooney, 1996; Zettle-moyer and Collins, 2005, 2007, among others) and neural network based methods (Dong andLapata, 2016; Andreas et al., 2016; Liang et al., 2016; Neelakantan et al., 2016) for solving theseproblems. In this thesis, we build on top of this work on Semantic Parsing, and describe a typedriven neural semantic parsing framework in Chapter 4, where we assume only weak supervisionin the form of question answer pairs, and not the programs.

Machine reading comprehension typically involves encoding some context given as a para-graph and using it to produce an answer from an encoded question. Examples of task targeteddatasets built to evaluate that capability include MCTest (Richardson et al., 2013), CNN/DailyMail dataset (Hermann et al., 2015), SQuAD (Rajpurkar et al., 2016) and TriviQA (Joshi et al.,2017), among others. Systems built for these tasks including attentive readers (Hermann et al.,2015), variants of memory networks (Weston et al., 2014; Sukhbaatar et al., 2015a; Xiong et al.,2016) and the more recent Bidirectional Attention Flow (Seo et al., 2016) can be viewed as dif-ferent instantiations of the predict with context function, where Kp is some background textthat needs to be understood to answer a question about it. In Chapter 5, we describe a genericformulation of the problem.

1.4 ChallengesAdding external knowledge inputs to NLU systems is not straightforward. Firstly, while linkingtext being read to some structured background knowledge in a KB, an automated system usuallyfaces considerable ambiguity. For example, with lexical ontologies like WordNet, we get usefultype hierarchies like parent is-a ancestor is-a person and pool is-a body-of-water and so on,but one has to deal with sense ambiguity: pool can also be a game. Moreover, the fact that thetwo sources of semantic information are fundamentally different makes this challenging. Whiledistributional approaches encode meaning in a continuous and abstract fashion, meaning in KBsis usually symbolic and discrete. Chapter 3 is dedicated to learning distributions over the discreteconcepts of the KB conditioned on the context, to deal with exceptions in language.

When reasoning in the presence of context, finding the relevant parts of the provided informa-tion given the text being processed is a challenge. The goal of this thesis is to find the limitationsof various knowledge sources — structured and unstructured — and use this information to buildhybrid NLU systems that can successfully incorporate real world knowledge in deep learningmodels. We now describe how this thesis is organized.

1.5 Thesis OutlineFigure 1.4 shows the generic pipeline described so far, along with the breakdown of the chaptersrelated to each component of the pipeline.

7

July 12, 2017DRAFT

Figure 1.4: Generic NLU pipeline and thesis overview

8

July 12, 2017DRAFT

In Chapter 2 we encode events as predicate argument structures derived from a semantic rolelabeler. In this chapter, we rely on selectional preferences between verbs and their semanticrole fillers as the background knowledge, and show how they are useful in predicting anoma-lous events. Our event encoder, Neural Event Model (NEM) captures semantics deeper than thesurface-level fluency. We describe an annotation effort that resulted in newswire headlines man-ually labeled with the degree of surprise associated with them and show that NEM outperformsa baseline LSTM model that encodes sentences at anomaly detection.

In Chapter 3, we describe a method to incorporate implicit knowledge from WordNet to-wards obtaining better sentence representations. Our model looks at WordNet synsets and at thehypernym hierarchies of the words being processed, making the encoder aware of their differ-ent senses and the corresponding type information. The ontology-aware LSTM (OntoLSTM)learns to attend to the appropriate sense, and the relevant type (either the most specific conceptor a generalization by choosing a hypernym of the word), conditioned on the context and theend-objective. We show that the sentence representations produced by OntoLSTM are betterthan those produced by LSTMs when used with identical prediction components for predictingtextual entailment and preposition phrase attachment. We also visualize the attention scores as-signed to the hypernyms of words in the input sentences and show that OntoLSTM is indeedlearning useful generalizations of words that help the learned representations perform better atthe end task.

Chapter 4 deals with reasoning over semi-structured contexts. We describe a type-drivenneural semantic parser aimed at answering compositional questions about Wikipedia tables. It isan encoder-decoder model that generates well-typed logical forms that can be executed againstgraph representations of the tables in context. The parser also includes entity embedding andlinking modules that are trained jointly using QA supervision. This parser achieves state-of-the art result on WIKITABLEQUESTIONS dataset. Proposed work includes improving the entitylinking module, and applying this parser to QA in other domains.

The focus of Chapter 5 is context retrieval methods for neural network models with explicitmemory components. We start by describing a generalized deep learning architecture for reason-ing over unstructured contexts. The pipeline follows the generic specification in Section 1.2 andhas configurable components for encoding and prediction. We identify that the memory networkmodels in Weston et al. (2014); Sukhbaatar et al. (2015a); Xiong et al. (2016), the attentive readermodel from Hermann et al. (2015), and the Gated Attention Reader from Dhingra et al. (2016)are specific instantiations of our generic knowledge guided reasoning model. Unlike prior workthat evaluated these models on tasks where the contexts are limited and well-defined, we dealwith tasks that require retrieving relevant context from large corpora. Our preliminary experi-ments show that the choice of retrieval methods greatly affect the final QA results. Based on thisinsight, we propose to build models where retrieval is tightly integrated in the QA pipeline.

9

July 12, 2017DRAFT

10

July 12, 2017DRAFT

Part I

Encoding Background Knowledge

11

July 12, 2017DRAFT

July 12, 2017DRAFT

Chapter 2

Leveraging Selectional Preferences forEvent Understanding

2.1 IntroductionIn this chapter, we describe a system that models background knowledge by exploiting the struc-ture in inputs. As described in Chapter 1, our goal is to fill the gaps in the input utterances byexplicitly encoding commonsense information, which is usually omitted by humans. Here, wedo not rely on external sources of knowledge, but instead exploit the information provided byco-occurrence of fillers of slots given by the structure to capture this knowledge. This notion wascalled Selectional Preference (Wilks, 1973) in earlier literature (see Section 2.3 for more infor-mation). While this notion was initially restricted to verbs and their arguments in dependencystructures, we hypothesize that it extends to other kinds of structures as well, and consequently,it can be used for modeling appropriate implicit semantic knowledge.

This chapter shows one instantiation of background knowledge. In it, we select the appropri-ate structures to model selectional preferences. The task presented here is that of differentiatingbetween normal news headlines and strange or weird ones, and the implicit knowledge requiredfor this task is fundamental to understanding events. Hence, we obtain the structured represen-tations of the input utterances from a Semantic Role Labeler, a tool which identifies the eventivepredicate in a given input sentence, and its corresponding role fillers.

2.2 Understanding EventsEvents are fundamental linguistic elements in human language. Thus understanding events is afundamental prerequisite for deeper semantic analysis of language, and any computational sys-tem of human language should have a model of events. One can view an event as an occurrenceof a certain action caused by an agent, affecting a patient at a certain time and place. It is thecombination of the entities filling the said roles that define an event. Furthermore, certain com-binations of these role fillers agree with the general state of the world, and others do not. Forexample, consider the events described by the following sentences.

Man recovering after being bitten by his dog.

13

July 12, 2017DRAFT

Man recovering after being shot by cops.The agent of the biting action is dog in the first sentence, and that of the shooting action in thesecond sentence is cops. The patient in both sentences is man. Both combinations of role-fillersresult in events that agree with most people’s world view. That is not the case with the event inthe following sentence.

Man recovering after being shot by his dog.This combination does not agree with our expectations about the general state of affairs, andconsequently one might find this event strange. While all three sentences above are equally validsyntactically, it is our knowledge about the role fillers —both individually and specifically incombination— that enables us to differentiate between normal and anomalous events. Hencewe hypothesize that anomaly is a result of unexpected or unusual combination of semantic rolefillers.

In this chapter, we introduce the problem of automatic detection of anomalous events andpropose a novel event model that can learn to differentiate between normal and anomalous events.We generally define anomalous events as those that are unusual compared to the general state ofaffairs and might invoke surprise when reported. An automatic anomaly detection algorithmhas to encode the goodness of semantic role filler coherence. This is a hard problem sincedetermining what a good combination of role fillers is requires deep semantic and pragmaticknowledge. Moreover, manual judgment of anomaly itself may be difficult and people often maynot agree with each other in this regard. We describe the difficulty in human judgment in greaterdetail in Section 2.5.1. Automatic detection of anomaly requires encoding complex information,which poses the challenge of sparsity due to the discrete representations of meaning, that arewords. These problems range from polysemy and synonymy at the lexical semantic level toentity and event coreference at the discourse level.

We define an event as the pair (V,A), where V is the predicate or a semantic verb1, and A isthe set of its semantic arguments like agent, patient, time, location, so on. Our aim is to obtaina representation of the event that is composed from those of individual words, while explicitlyguided by the semantic role structure. This representation can be understood as an embedding ofthe event in an event space.

2.3 Related WorkSelectional preference, a notion introduced by Wilks (1973), refers to the paradigm of modelingsemantics of natural language in terms of the restrictions a verb places on its arguments. Forexample, the knowledge required to identify one of the following sentences as meaningless canbe encoded as preferences (or restrictions) of the verb.

Man eats pasta.

Poetry eats computers.The restrictions include eat requiring its subject to be animate, its object to be edible, and soon. Though the notion was originally proposed for a verb and its dependents in the context of

1By semantic verb, we mean an action word whose syntactic category is not necessarily a verb. For example, inTerrorist attacks on the World Trade Center.., attacks is not a verb but is still an action word.

14

July 12, 2017DRAFT

dependency grammars, it is equally applicable to other kinds of semantic relations. In this work,we use this idea in the context of verbs and their semantic role fillers.

The idea is that identifying the right sense of the word can be guided by the preferences of theother words it is related to. Resnik (1997) illustrates this through the example of disambiguatingthe sense of the word letter in the phrase write a letter. While letter has multiple senses, asan object, the verb write “prefers” the sense closer to reading matter more than others. At thephrasal level, selectional preferences are useful in encoding complex phenomena like metaphor.Metaphorical usage of words usually involves violating selectional restrictions. For example thephrase kill an idea is a metaphor, because kill does not usually take abstract concepts as objects.The phrase itself is common and one can easily attribute a metaphorical sense to the verb kill andresolve the violation. Wilks (2007a), Wilks (2007b), Krishnakumaran and Zhu (2007), Shutovaet al. (2013), among others discuss the selectional preference view of metaphor identification.Automatic acquisition of selectional preferences from text is a well studied problem. Resnik(1996) proposed the first broad coverage computational model of selectional preferences. Theapproach is based on using WordNet’s hierarchy to generalize across words. It measures selec-tional association strength of an triple (v, r, c) with verb v, relation r and an argument class c asthe Kullback-Leibler divergence between the the distributions P (c|v, r) c and P (c|r) normalizedover all classes of arguments that occur with the (v, r) pair. Abe and Li (1996) also use WordNet,and model selectional preferences by minimizing the length tree cut through the noun hierarchy.Ciaramita and Johnson (2000) encode the hierarchy as a Bayesian network and learn selectionalpreferences using probabilistic graphical methods. Rooth et al. (1999) adopt a distributional ap-proach, and use latent variable models to induce classes of noun-verb pairs, and thus learn theirpreferences. Methods by Erk (2007); Erk et al. (2010) are also distributional, and have beendescribed earlier in this paper. Seaghdha (2010) proposed the use of Latent Dirichlet Allocation(LDA), to model preferences as topics. Van de Cruys (2009) used non-negative tensor factor-ization, and modeled subject-verb-object triples as a three-way tensor. The tensor is populatedwith co-occurrence frequencies, and the selectional preferences are measured from decomposedrepresentations of the three words in the triple. Van de Cruys (2014) trains neural networks toscore felicitous triples higher than infelicitous ones.

Erk et al. (2010) also model selectional preferences using vector spaces. They measure thegoodness of the fit of a noun with a verb in terms of the similarity between the vector of thenoun and some “exemplar” nouns taken by the verb in the same argument role. Baroni and Lenci(2010) also measure selectional preference similarly, but instead of exemplar nouns, they calcu-late a prototype vector for that role based on the vectors of the most common nouns occurring inthat role for the given verb. Lenci (2011) builds on this work and models the phenomenon thatthe expectations of the verb or its role-fillers change dynamically given other role fillers.

2.4 Model for Semantic Anomaly DetectionNeural Event Model (NEM) is a supervised model that learns to differentiate between anoma-lous and normal events by classifying dense representations of events, otherwise known as eventembeddings. We treat the problem as a binary classification problem, with normal and anoma-lous being the two classes. The event embeddings are computed using a structured composition

15

July 12, 2017DRAFT

model that represents an event as the composition of the five slot-fillers: action, agent, patient,time and location. Figure 2.1 shows the pipeline for anomalous event detection using NEM. Wefirst identify the fillers for the five slots mentioned above, by running a PropBank (Palmer et al.,2005) style Semantic Role Labeler (SRL). We use the following mapping of roles to obtain theevent structure:• V→ action• A0→ agent• A1→ patient• AM-TMP→ time• AM-LOC→ location

This structured input is then passed to NEM. The model has three components:

Argument Composition This component encodes all the role fillers as vectors in the samespace. This is accomplished by embedding the words in each role filler, and then passing themthrough a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) (Hochre-iter and Schmidhuber, 1997) cells. Note that the same LSTM is used for all the slots. Concretely,

as = LSTM(Is) (2.1)

where Is denotes the ordered set of word vector representations in in the filler of slot s, and asrepresents the vector representation of that slot filler.

Event Composition The slot filler representations are then composed using slot specific pro-jection weights to obtain a single event representation. Concretely,

as = tanh(W ᵀs as + bs) where s ∈ {c, g, p, t, l} (2.2)

e = ac + ag + ap + at + al (2.3)

Ws ∈ Rd×d and bs ∈ Rd are slot specific projection weights and biases respectively, with sdenoting one of action, agent, patient, time and location. d is the dimensionality of the wordembeddings. Event composition involves performing non-linear projections of all slot-filler rep-resentations, and finally obtaining an event representation e.

Anomaly Classification The event representation is finally passed through a softmax layer toobtain the predicted probabilities for the two classes.

pe = softmax(W ᵀe e+ be) (2.4)

(2.5)

We ∈ Rd×2 and be ∈ R2 are the weight and bias values of the softmax layer respectively. Asit can be seen, the layer projects the event representation into a two-dimensional space beforeapplying the softmax non-linearity.

16

July 12, 2017DRAFT

Figure 2.1: Anomaly Classification Pipeline with Neural Event Model

2.4.1 TrainingGiven the output from NEM, we define

p1e(θ) = P (anomalous|e; θ) (2.6)p0e(θ) = P (normal|e; θ) (2.7)

where θ are the parameters in the model and compute the label loss based on the prediction atthe end of anomaly classification as a cross entropy loss as follows:

J(θ) = −l log p1e(θ) + (1− l) log p0e(θ) (2.8)

where l is the reference binary label indicating whether the event is normal or anomalous. Thethree components of the model described above: argument composition, event composition andanomaly classification are all trained jointly. That is, we optimize the following objective.

θ∗ = arg minθ

J(θ) (2.9)

where θ includes all Ws and bs from all the slots, We and be from event composition and alsoall the LSTM parameters from argument composition. We optimize this objective using ADAM(Kingma and Ba, 2014). The model is implemented2 using Keras (Chollet, 2015). Note thatthe model described here is an extension of the one described in our previously published pa-per (Dasigi and Hovy, 2014). The main improvements here include using an LSTM-RNN asthe encoder in argument composition instead of a simple RNN, and jointly training argumentcomposition, event composition and anomaly classification, whereas the older model includedtraining argument composition separately using an unsupervised objective.

2The code is publicly available at https://github.com/pdasigi/neural-event-model

17

https://github.com/pdasigi/neural-event-model

July 12, 2017DRAFT

Total number of annotators 22Normal annotations 56.3%Strange annotations 28.6%

Highly unusual annotations 10.3%Cannot Say annotations 4.8%

Avg. events annotated per worker 3444-way Inter annotator agreement (α) 0.343-way Inter annotator agreement (α) 0.56

Table 2.1: Annotation Statistics

2.5 DataWe crawl 3684 “weird news” headlines available publicly on the website of NBC news3, such asthe following:

India weaponizes world’s hottest chili.

Man recovering after being shot by his dog.

Thai snake charmer puckers up to 19 cobras.For training, we assume that the events extracted from this source, called NBC Weird Events(NWE) henceforth, are anomalous. NWE contains 4271 events extracted using SENNA’s (Col-lobert et al., 2011) SRL. We use 3771 of those events as our negative training data. Similarly,we extract events also from headlines in the AFE section of Gigaword, called Gigaword Events(GWE) henceforth. We assume these events are normal. To use as positive examples for trainingevent composition, we sample roughly the same number of events from GWE as our negativeexamples from NWE. From the two sets, we uniformly sample 1003 events as the test set andvalidate the labels by getting crowd-sourced annotations.

2.5.1 Annotation

We post the annotation of the test set containing 1003 events as Human Intelligence Tasks (HIT)on Amazon Mechanical Turk (AMT). We break the task into 20 HITs and ask the workers toselect one of the four options: highly unusual, strange, normal, and cannot say for each event.We ask them to select highly unusual when the event seems too strange to be true, strange ifit seems unusual but still plausible, and cannot say only if the information present in the eventis not sufficient to make a decision. We publicly release this set of 1003 annotated events forevaluating future research.

Table 2.1 shows some statistics of the annotation task. We compute the Inter-AnnotatorAgreement (IAA) in terms of Kripendorff’s alpha (Krippendorff, 1980). The advantage of usingthis measure instead of the more popular Kappa is that the former can deal with missing infor-mation, which is the case with our task since annotators work on different overlapping subsetsof the test set. The 4-way IAA shown in the table corresponds to agreement over the original

3http://www.nbcnews.com/html/msnbc/3027113/3032524/4429950/4429950_1.html

18

http://www.nbcnews.com/html/msnbc/3027113/3032524/4429950/4429950_1.html

July 12, 2017DRAFT

Figure 2.2: Comparison of perplexity scores for different labels

4-way decision (including cannot say’ while the 3-way IAA is measured after merging the highlyunusual and strange decisions.

Additionally we use MACE (Hovy et al., 2013) to assess the quality of annotation. MACEmodels the annotation task as a generative process of producing the observed labels conditionedon the true labels and the competence of the annotators, and predicts both the latent variables.The average of competence of annotators, a value that ranges from 0 to 1, for our task is 0.49 forthe 4-way decision and 0.59 for the 3-way decision.

We generate true label predictions produced by MACE, discard the events for which theprediction remains to be cannot say, and use the rest as the final labeled test set. This leaves 949events as our test dataset, of which only 41% of the labels are strange or highly unusual. It hasto be noted that even though our test set has equal size samples from both NWE and GWE, thetrue distribution is not uniform.

2.5.2 Language Model Separability

Given the annotations, we test to see if the sentences corresponding to anomalous events can beseparated from normal events by simpler features. We build an n-gram language model fromthe training data set used for argument composition and measure the perplexity of the sentencesin the test set. Figure 2.2 shows a comparison of the perplexity scores for different labels. Ifthe n-gram features are enough to separate different classes of sentences, one would expect thesentences corresponding to strange and highly unusual labels to have higher perplexity rangesthan normal sentences, because the language model is built from a dataset that is expected tohave a distribution of sentences where majority of them contain normal events. As can be seen inFigure 2.2, except for a few outliers, most data points in all the categories are in similar perplexityranges. Hence, sentences with different labels cannot be separated based on an n-gram languagemodel features.

19

July 12, 2017DRAFT

Accuracy F-ScorePMI Baseline 45.2% 45.1%

LSTM Baseline 82.5% 77.4%NEM 84.9% 79.7%

Table 2.2: Classification Performance and Comparison with Baselines

2.6 ResultsWe merged the two anomaly classes (strange and highly unusual) and calculated accuracy andF-score (for the anomaly class) of the binary classifier. For the sake of comparison, we alsodefine the following two baseline models.

PMI Baseline We compare the performance of our model against a baseline that is based onhow well the semantic arguments in the event match the selectional preferences of the predicate.We measure selectional preference using Point-wise Mutual Information (PMI) (Church andHanks, 1990) of the head words of each semantic argument with the predicate. The baselinemodel is built as follows. We perform dependency parsing using MaltParser (Nivre et al., 2007)on the sentences in the training data used in the first phase of training to obtain the head wordsof the semantic arguments. We then calculate the PMI values of all the pairs < hA, p > whereh is the head word of argument A and p is the predicate of the event. For training our baselineclassifier, we use the labeled training data from the event composition phase. The features to thisclassifier are the PMI measures of the < hA, p > pairs estimated from the larger dataset. Theclassifier thus trained to distinguish between anomalous and normal events is applied to the testset.

LSTM Baseline To investigate the importance of our proposed event representation, we alsoimplemented a LSTM baseline that directly encodes sentences. The following equations describethe model.

w = LSTM(I) (2.10)w = tanh(W ᵀ

ww + bw) (2.11)pu = softmax(W ᵀ

u w + bu) (2.12)

Note that this model is comparable in complexity, and has the same number of non-linear trans-formations as NEM. w represents a word level composition instead of an argument-level compo-sition in NEM, and pu is a probability distribution defined over an unstructured composition asfeatures. Training of this baseline model is done in the same way as NEM.

Table 2.2 shows the results and a comparison with the two baselines. The accuracy of thebaseline classifier is lower than 50%, which is the expected accuracy of a classifier that assignslabels randomly. As seen in the table, the LSTM model proves to be a strong baseline, but itunder performs in comparison with NEM, showing the value of our proposed event compositionmodel. The difference between the performance of NEM and the LSTM baseline is statistically

20

July 12, 2017DRAFT

Figure 2.3: ROC curves for the LSTM baseline and NEM

Human highest 0.896Human average 0.633Human lowest 0.144

Random 0.054LSTM Baseline 0.743

NEM 0.797

Table 2.3: Anomaly Detection Competence

significant with p < 0.0001. Figure 2.3 shows the ROC curves for NEM and the LSTM baseline.

To further compare NEM with human annotators, we give to MACE, the binary predictionsproduced by NEM, and those by the LSTM baseline along with the annotations and measure thecompetence. For the sake of comparison, we also give to MACE, a list of random binary labelsas one of the annotations to measure the competence of a hypothetical worker that made randomchoices. These results are reported in Table 2.3. Unsurprisingly, the competence of the randomclassifier is very low, and is worse than the least competent human. It can be seen that MACEpredicts both the LSTM baseline and NEM to be more competent than human average. This isan encouraging result, especially given that this is a hard problem even for humans.

2.7 ConclusionIn this chapter, we have looked at how selectional preferences can be leveraged to model eventsand represent them in such a way as to distinguish anomalous events from normal events. Resultsfrom our controlled experiments comparing the proposed model with one that directly encodes

21

July 12, 2017DRAFT

whole sentences have shown the value of using the proposed structured composition.The model we have shown in this chapter is just one instantiation of using selectional pref-

erences to capture implicit knowledge. In addition to some of the traditional uses of selectionalpreferences in dependency structures described in Section 2.3, there also exists a large bodyof work in applications of selectional preferences to relations among entities. Knowledge basecompletion and relation extraction (Mintz et al., 2009; Sutskever et al., 2009; Nickel et al., 2011;Bordes et al., 2011; Socher et al., 2013; Gardner et al., 2014, among others), fall under thiscategory.

There are limitations to the kind of implicit knowledge that selectional preferences over pro-cessed inputs can capture. There is often an unmet need for encoding factual commonsenseknowledge. For example, one important aspect of anomaly that is currently not handled by NEMis the level of generality of the concepts the events contain. Usually more general conceptscause events to be more normal since they convey lesser information. For example, an Americansoldier shooting another American soldier may be considered unusual, while a soldier shootinganother soldier may not be as unusual, and at the highest level of generalization, a person shoot-ing another person is normal. This information about generality can be incorporated into theevent model only by using real world knowledge from knowledge bases like Wordnet (Miller,1995). We pursue that direction in Chapter 3.

22

July 12, 2017DRAFT

Chapter 3

Sentence Understanding with BackgroundKnowledge from Ontologies

3.1 Introduction

As described in Chapter 1, the utterances provided as input to an NLU system do not containall the information needed to fully comprehend them, and an important missing piece is thecommon-sense information that humans usually leave out for the sake of brevity. In this chapter,we show how we can leverage knowledge-bases to encode this missing information. As opposedto the model described in Chapter 2, the one described here assumes we have the backgroundknowledge explicitly present in a symbolic knowledge base.

The model shown here is one specific kind of knowledge augmented encoder. We restrictourselves to lexical background knowledge here, specifically of the kind that specifies sense andconceptual information of words. We use WordNet (Miller, 1995) to obtain this information.For example, WordNet contains the information that the word bugle as a noun has three senses,one of which is a brass instrument and another is a type of herb. It also shows that clarion,trombone, French horn etc. are also types of brass instruments. Clearly, when the word isused in an utterance, all this information is omitted because the reader is expected to know alot of it, or atleast be able to infer it from the context. However, this is not true in the caseof computational systems that process human language. The goal of this chapter is to developa model of computational model of language that incorporates information from WordNet toobtain better word representations.

Our model is an ontology-aware recurrent neural network, an encoder that computes tokenlevel word representations grounded in an ontology (Wordnet). The kind of background knowl-edge we deal with here is that of various senses of words, and the generalizations of the conceptsthat correspond to each sense. We encode that information by using Wordnet to find the collec-tion of semantic concepts manifested by a word type, and represent a word type as a collectionof concept embeddings. We show how to integrate the proposed ontology-aware lexical repre-sentation with recurrent neural networks (RNNs) to model token sequences. Section 3.5.3 andSection 3.6.4 provide empirical evidence that the WordNet-augmented representations encodelexical background knowledge useful for two tasks that require semantic understanding: predict-

23

July 12, 2017DRAFT

ing prepositional phrase attachment and textual entailment.

3.2 Ontology-Aware Token EmbeddingsTypes vs. Tokens In accordance with standard terminology, we make the following distinctionbetween types and tokens in this chapter: By word types, we mean the surface form of the word,whereas by tokens we mean the instantiation of the surface form in a context. For example, thesame word type ‘pool’ occurs as two different tokens in the sentences “He sat by the pool.” and“He swam in the pool.”

Type-level word embeddings map a word type (i.e., a surface form) to a dense vector of realnumbers such that similar word types have similar embeddings. When pre-trained on a largecorpus of unlabeled text, they provide an effective mechanism for generalizing statistical modelsto words which do not appear in the labeled training data for a downstream task. Most wordembedding models define a single vector for each word type. However, a fundamental flaw inthis design is their inability to distinguish between different meanings and abstractions of thesame word. For example, pool has a different sense in the sentence “He played a game of pool.”,compared to the two examples shown above, but type-level word embeddings typically assignthe same representation for all three of them. Similarly, the fact that ‘pool’ and ‘lake’ are bothkinds of water bodies is not explicitly incorporated in most type-level embeddings. Furthermore,it has become a standard practice to tune pre-trained word embeddings as model parametersduring training for an NLP task (e.g., Chen and Manning, 2014; Lample et al., 2016), potentiallyallowing the parameters of a frequent word in the labeled training data to drift away from relatedbut rare words in the embedding space.

In this chapter, we represent a word token in a given context by estimating a context-sensitiveprobability distribution over relevant concepts in WordNet (Miller, 1995) and use the expectedvalue (i.e., weighted sum) of the concept embeddings as the token representation. Effectively,we map each word type to a grid of concept embeddings (see Fig. 3.1), which are shared acrossmany words. The word representation is computed as a distribution over the concept embeddingsfrom the word’s grid. We show how these distributions can be learned conditioned on the contextwhen the representations are plugged into RNNs. Intuitively, commonsense knowledge encodedas WordNet relations could potentially help with language understanding tasks. but mappingtokens to entities in WordNet is a challenge. One needs to at least disambiguate the sense of thetoken before being able to use the relevant information. Similarly, not all the hypernyms definedby WordNet may be useful for the task at hand as some may be too general to be informative.

We take a task-centric approach towards doing this, and learn the token representations jointlywith the task-specific parameters. In addition to providing context-sensitive token embeddings,the proposed method implicitly regularizes the embeddings of related words by forcing relatedwords to share similar concept embeddings. As a result, the representation of a rare word whichdoes not appear in the training data for a downstream task benefits from all the updates to relatedwords which share one or more concept embeddings. While the results presented in this thesisare from a model that relies on WordNet, and we exploit the order of senses given by WordNet,the approach is, in principle applicable to any ontology, with appropriate modifications. Here,we do not assume the inputs are sense tagged.

24

July 12, 2017DRAFT

Figure 3.1: Example of ontology-aware lexical representation

We evaluate the proposed embeddings in two applications. The first is predicting preposi-tional phrase (PP) attachments (see Section 3.5), a challenging problem which emphasizes theselectional preferences between words in the PP and each of the candidate head words. Thesecond application is Textual Entailment (see Section 3.6), a task that benefits from hypernymyfeatures from WordNet, providing generalization information. Our empirical results and detailedanalysis (see Section3.5.3 and Section 3.5.3) show that the proposed embeddings effectively useWordNet to improve the accuracy of PP attachment predictions.

3.3 Related WorkThis work is related to various lines of research within the NLP community: dealing with syn-onymy and homonymy in word representations both in the context of distributed embeddings andmore traditional vector spaces; hybrid models of distributional and knowledge based semantics;and selectional preferences and their relation with syntactic and semantic relations.

3.3.1 Multi-prototype Word VectorsThe need for going beyond a single vector per word-type has been well established for a while,and many efforts were focused on building multi-prototype vector space models of meaning(Reisinger and Mooney, 2010; Huang et al., 2012; Chen et al., 2014; Jauhar et al., 2015; Nee-lakantan et al., 2015; Arora et al., 2016, etc.). However, the target of all these approaches isobtaining multi-sense word vector spaces, either by incorporating sense tagged information orother kinds of external context. The number of vectors learned is still fixed, based on the presetnumber of senses. In contrast, our focus is on learning a context dependent distribution over

25

July 12, 2017DRAFT

those concept representations. Other work not necessarily related to multi-sense vectors, but stillrelated to our work includes Belanger and Kakade (2015)’s work which proposed a Gaussianlinear dynamical system for estimating token-level word embeddings, and Vilnis and McCallum(2015)’s work which proposed mapping each word type to a density instead of a point in a spaceto account for uncertainty in meaning. These approaches do not make use of lexical ontologiesand are not amenable for joint training with a downstream NLP task.

Related to the idea of concept embeddings is the work by Rothe and Schutze (2015), whichestimated WordNet synset representations, given pre-trained type-level word embeddings. Incontrast, our work focuses on estimating token-level word embeddings as context sensitive dis-tributions of concept embeddings.

3.3.2 Hybrid Models of Knowledge-based and Distributional Semantics

There is a large body of work that tried to improve word embeddings using external resources.Yu and Dredze (2014) extended the CBOW model (Mikolov et al., 2013) by adding an extraterm in the training objective for generating words conditioned on similar words according to alexicon. Jauhar et al. (2015) extended the skipgram model (Mikolov et al., 2013) by representingword senses as latent variables in the generation process, and used a structured prior based on theontology. Faruqui et al. (2015) used belief propagation to update pre-trained word embeddingson a graph that encodes lexical relationships in the ontology. Similarly, Johansson and Pina(2015) improved word embeddings by representing each sense of the word in a way that reflectsthe topology of the semantic network they belong to, and then representing the words as convexcombinations of their senses. In contrast to previous work that was aimed at improving type levelword representations, we propose an approach for obtaining context-sensitive embeddings at thetoken level, while jointly optimizing the model parameters for the NLP task of interest.

3.3.3 WordNet as a source for Selectional Preferences

Resnik (1993) showed the applicability of semantic classes and selectional preferences to re-solving syntactic ambiguity. Zapirain et al. (2013) applied models of selectional preferencesautomatically learned from WordNet and distributional information, to the problem of seman-tic role labeling. Resnik (1993); Brill and Resnik (1994); Agirre (2008) and others have usedWordNet information towards improving prepositional phrase attachment predictions.

3.4 WordNet-Grounded Context-Sensitive Token Embeddings

In this section, we focus on defining our context-sensitive token embeddings. We first describeour grounding of word types using WordNet concepts. Then, we describe our model of context-sensitive token-level embeddings as a weighted sum of WordNet concept embeddings.

26

July 12, 2017DRAFT

3.4.1 WordNet GroundingWe use WordNet to map each word type to a set of synsets, including possible generalizationsor abstractions. Among the labeled relations defined in WordNet between different synsets,we focus on the hypernymy relation to help model generalization and selectional preferencesbetween words, which is especially important for predicting PP attachments (Resnik, 1993).To ground a word type, we identify the set of (direct and indirect) hypernyms of the WordNetsenses of that word. A simplified grounding of the word ‘pool’ is illustrated in Figure 3.2. Thisgrounding is key to our model of token embeddings, to be described in the following subsections.

3.4.2 Context-Sensitive Token EmbeddingsOur goal is to define a context-sensitive model of token embeddings which can be used as adrop-in replacement for traditional type-level word embeddings.

Notation. Let Senses(w) be the list of synsets defined as possible word senses of a given wordtypew in WordNet, and Hypernyms(s) be the list of hypernyms for a synset s. 1 Figure 3.2 showsan example. In the figure, solid arrows represent possible senses and dashed arrows representhypernym relations. Note that the same set of concepts are used to ground the word ‘pool’regardless of its context. Other WordNet senses for ‘pool’ were removed from the figure forsimplicity. According to figure,

Senses(pool) = [pond.n.01, pool.n.09], andHypernyms(pond.n.01) = [pond.n.01, lake.n.01,

body of water.n.01, entity.n.01]

Each WordNet synset s is associated with a set of parameters vs ∈ Rn which represent itsembedding. This parameterization is similar to that of Rothe and Schutze (2015).

Embedding model. Given a sequence of tokens t and their corresponding word types w, letui ∈ Rn be the embedding of the word token ti at position i. Unlike most embedding models,the token embeddings ui are not parameters. Rather, ui is computed as the expected value ofconcept embeddings used to ground the word type wi corresponding to the token ti:

ui =∑

s∈Senses(wi)

∑s′∈Hypernyms(s)

p(s, s′ | t,w, i) vs′ (3.1)

such that ∑s∈Senses(wi)

∑s′∈Hypernyms(s)

p(s, s′ | t,w, i) = 1

The distribution that governs the expectation over synset embeddings factorizes into twocomponents:

p(s, s′ | t,w, i) ∝λwi exp−λwi rank(s,wi)×MLP([vs′ ; context(i, t)]) (3.2)

1For notational convenience, we assume that s ∈ Hypernyms(s).

27

July 12, 2017DRAFT

Figure 3.2: An example grounding for the word pool

The first component, λwi exp−λwi rank(s,wi), is a sense prior which reflects the prominence of eachword sense for a given word type. Here, we exploit2 the fact that WordNet senses are ordered indescending order of their frequencies, obtained from sense tagged corpora, and parameterize thesense prior like an exponential distribution. rank(s, wi) denotes the rank of sense s for the wordtype wi, thus rank(s, wi) = 0 corresponds to s being the first sense of wi. The scalar parameter(λwi) controls the decay of the probability mass, which is learned along with the other parametersin the model. Note that sense priors are defined for each word type (wi), and are shared acrossall tokens which have the same word type.

MLP([vs′ ; context(i, t)]), the second component, is what makes the token representationscontext-sensitive. It scores each concept in the WordNet grounding of wi by feeding the concate-nation of the concept embedding and a dense vector that summarizes the textual context into amultilayer perceptron (MLP) with two tanh layers followed by a softmax layer. This compo-nent is inspired by the soft attention often used in neural machine translation (Bahdanau et al.,2014).3 The definition of the context function is dependent on the encoder used to encode thecontext. We describe a specific instantiations of this function in Section 3.5 and Section 3.6.

To summarize, Figure 3.3 illustrates how to compute the embedding of a word token ti =‘pool’ in a given context:

1. compute a summary of the context context(i, t),

2. enumerate related concepts for ti,

3. compute p(s, s′ | t,w, i) for each pair (s, s′), and

4. compute ui = E[vs′ ].In the following section, we describe our model for predicting PP attachments, including our

2Note that for ontologies where such information is not available, our method is still applicable but without thiscomponent. We show the effect of using a uniform sense prior in Section 3.5.3.

3Although soft attention mechanism is typically used to explicitly represent the importance of each item in asequence, it can also be applied to non-sequential items.

28

July 12, 2017DRAFT

Figure 3.3: Steps for computing the context-sensitive token embedding for the word ‘pool’, asdescribed in Section3.4.2.

definition for context for this problem.

3.5 PP Attachment

Disambiguating PP attachments is an important and challenging NLP problem. Since modelinghypernymy and selectional preferences is critical for successful prediction of PP attachments(Resnik, 1993), it is a good fit for evaluating our WordNet-grounded context-sensitive embed-dings.

Figure 3.5, reproduced from Belinkov et al. (2014), illustrates an example of the PP at-tachment prediction problem. In the top sentence, the PP ‘with butter’ attaches to the noun‘spaghetti’. In the bottom sentence, the PP ‘with chopsticks’ attaches to the verb ‘ate’. The ac-curacy of a competitive English dependency parser at predicting the head word of an ambiguousprepositional phrase is 88.5%, significantly lower than the overall unlabeled attachment accuracyof the same parser (94.2%).4

This section formally defines the problem of PP attachment disambiguation, describes ourbaseline model, then shows how to integrate the token-level embeddings in the model.

4See Table 3.2 for detailed results.

29

July 12, 2017DRAFT

Figure 3.4: Two sentences illustrating the importance of lexicalization in PP attachment deci-sions.

3.5.1 Problem Definition

We follow Belinkov et al. (2014)’s definition of the PP attachment problem. Given a prepositionp and its direct dependent d in the prepositional phrase (PP), our goal is to predict the correcthead word for the PP among an ordered list of candidate head words h. Each example in thetrain, validation, and test sets consists of an input tuple 〈h, p, d〉 and an output index k to identifythe correct head among the candidates in h. Note that the order of words that form each 〈h, p, d〉is the same as that in the corresponding original sentence.

3.5.2 Model Definition

Both our proposed and baseline models for PP attachment use bidirectional RNN with LSTMcells (bi-LSTM) to encode the sequence t = 〈h1, . . . , hK , p, d〉.

We score each candidate head by feeding the concatenation of the output bi-LSTM vectors forthe head hk, the preposition p and the direct dependent d through an MLP, with a fully connectedtanh layer to obtain a non-linear projection of the concatenation, followed by a fully-connectedsoftmax layer:

p(hkis head) = MLPattach([lstm out(hk);lstm out(p);lstm out(d)]) (3.3)

To train the model, we use cross-entropy loss at the output layer for each candidate head in thetraining set. At test time, we predict the candidate head with the highest probability according tothe model in Eq. 3.3, i.e.,

k = arg maxkp(hkis head = 1). (3.4)

30

July 12, 2017DRAFT

This model is inspired by the Head-Prep-Child-Ternary model of Belinkov et al. (2014). Themain difference is that we replace the input features for each token with the output bi-RNNvectors.

We now describe the difference between the proposed and the baseline models. Generally,let lstm in(ti) and lstm out(ti) represent the input and output vectors of the bi-LSTM for eachtoken ti ∈ t in the sequence. The outputs at each timestep are obtained by concatenating thoseof the forward and backward LSTMs.

Baseline model. In the baseline model, we use type-level word embeddings to represent theinput vector lstm in(ti) for a token ti in the sequence. The word embedding parameters areinitialized with pre-trained vectors, then tuned along with the parameters of the bi-LSTM andMLPattach. We call this model LSTM-PP.

Proposed model. In the proposed model, we use token level word embedding as describedin Section3.4 as the input to the bi-LSTM, i.e., lstm in(ti) = ui. The context used for theattention component is simply the hidden state from the previous timestep. However, since weuse a bi-LSTM, the model essentially has two RNNs, and accordingly we have two contextvectors, and associated attentions. That is, contextf (i, t) = lstm in(ti−1) for the forward RNNand contextb(i, t) = lstm in(ti+1) for the backward RNN. Consequently, each token gets tworepresentations, one from each RNN. The synset embedding parameters are initialized with pre-trained vectors and tuned along with the sense decay (λw) and MLP parameters from Eq. 3.2,the parameters of the bi-LSTM and those of MLPattach. We call this model OntoLSTM-PP.

3.5.3 ExperimentsDataset and evaluation. We used the English PP attachment dataset created and made avail-able by Belinkov et al. (2014). The training and test splits contain 33,359 and 1951 labeledexamples respectively. As explained in Section3.5.1, the input for each example is 1) an orderedlist of candidate head words, 2) the preposition, and 3) the direct dependent of the preposition.The head words are either nouns or verbs and the dependent is always a noun. All examples inthis dataset have at least two candidate head words. As discussed in Belinkov et al. (2014), thisdataset is a more realistic PP attachment task than the RRR dataset (Ratnaparkhi et al., 1994).The RRR dataset is a binary classification task with exactly two head word candidates in allexamples. The context for each example in the RRR dataset is also limited which defeats thepurpose of our context-sensitive embeddings.

Model specifications and hyperparameters. For efficient implementation, we use mini-batchupdates with the same number of senses and hypernyms for all examples, padding zeros andtruncating senses and hypernyms as needed. For each word type, we use a maximum of S sensesand H indirect hypernyms from WordNet. In our initial experiments on a held-out developmentset (10% of the training data), we found that values greater than S = 3 and H = 5 did notimprove performance. We also used the development set to tune the number of layers in MLPattach

separately for the OntoLSTM-PP and LSTM-PP, and the number of layers in the attention MLP

31

July 12, 2017DRAFT

System Initialization Embedding Resources Test Acc.HPCD (full) Syntactic-SG Type WordNet, VerbNet 88.7LSTM-PP GloVe Type - 84.3LSTM-PP GloVe-retro Type WordNet 84.8OntoLSTM-PP GloVe-extended Token WordNet 89.7

Table 3.1: Prepositional Phrase Attachment Results with OntoLSTM

in OntoLSTM-PP. When a synset has multiple hypernym paths, we use the shortest one. Finally,words types which do not appear in WordNet are assumed to have one unique sense per wordtype with no hypernyms. Since the POS tag for each word is included in the dataset, we excludeWordNet synsets which are incompatible with the POS tag. The synset embedding parametersare initialized using the synset vectors obtained by running AutoExtend (Rothe and Schutze,2015) on 100-dimensional GloVe (Pennington et al., 2014) vectors for WordNet 3.1. We referto this embedding as GloVe-extended. Representation for the OOV word types in LSTM-PP andOOV synset types in OntoLSTM-PP were randomly drawn from a uniform 100-d distribution.Initial sense prior parameters (λw) were also drawn from a uniform 1-d distribution.

Baselines. In our experiments, we compare our proposed model, OntoLSTM-PP with threebaselines – LSTM-PP initialized with GloVe embedding, LSTM-PP initialized with GloVe vec-tors retrofitted to WordNet using the approach of Faruqui et al. (2015) (henceforth referred to asGloVe-retro), and finally the best performing standalone PP attachment system from Belinkovet al. (2014), referred to as HPCD (full) in the paper. HPCD (full) is a neural network modelthat learns to compose the vector representations of each of the candidate heads with those ofthe preposition and the dependent, and predict attachments. The input representations are en-riched using syntactic context information, POS, WordNet and VerbNet (Kipper et al., 2008)information and the distance of the head word from the PP is explicitly encoded in compositionarchitecture. In contrast, we do not use syntactic context, VerbNet and distance information, anddo not explicitly encode POS information.

PP Attachment Results

Table 3.1 shows that our proposed token level embedding scheme OntoLSTM-PP outperforms thebetter variant of our baseline LSTM-PP (with GloVe-retro intialization) by an absolute accuracydifference of 4.9%, or a relative error reduction of 32%. OntoLSTM-PP also outperforms HPCD(full), the previous best result on this dataset. HPCD (full) is from the original paper, and it usessyntactic SkipGram. GloVe-retro is GloVe vectors retrofitted (Faruqui et al., 2015) to WordNet3.1, and GloVe-extended refers to the synset embeddings obtained by running AutoExtend (Rotheand Schutze, 2015) on GloVe.

Initializing the word embeddings with GloVe-retro (which uses WordNet as described inFaruqui et al. (2015)) instead of GloVe amounts to a small improvement, compared to the im-provements obtained using OntoLSTM-PP. This result illustrates that our approach of dynami-cally choosing a context sensitive distribution over synsets is a more effective way of making useof WordNet.

32

July 12, 2017DRAFT

System Full UAS PPA Acc.RBG 94.17 88.51RBG + HPCD (full) 94.19 89.59RBG + LSTM-PP 94.14 86.35RBG + OntoLSTM-PP 94.30 90.11RBG + Oracle PP 94.60 98.97

Table 3.2: Results from RBG dependency parser with features coming from various PP attach-ment predictors and oracle attachments.

Effect on dependency parsing. Following Belinkov et al. (2014), we used RBG parser (Leiet al., 2014), and modified it by adding a binary feature indicating the PP attachment predictionsfrom our model.

We compare four ways to compute the additional binary features: 1) the predictions of thebest standalone system HPCD (full) in Belinkov et al. (2014), 2) the predictions of our baselinemodel LSTM-PP, 3) the predictions of our improved model OntoLSTM-PP, and 4) the gold labelsOracle PP.

Table 3.2 shows the effect of using the PP attachment predictions as features within a de-pendency parser. We note there is a relatively small difference in unlabeled attachment accuracyfor all dependencies (not only PP attachments), even when gold PP attachments are used as ad-ditional features to the parser. However, when gold PP attachment are used, we note a largepotential improvement of 10.46 points in PP attachment accuracies (between the PPA accuracyfor RBG and RBG + Oracle PP), which confirms that adding PP predictions as features is aneffective approach. Our proposed model RBG + OntoLSTM-PP recovers 15% of this potentialimprovement, while RBG + HPCD (full) recovers 10%, which illustrates that PP attachment re-mains a difficult problem with plenty of room for improvements even when using a dedicatedmodel to predict PP attachments and using its predictions in a dependency parser.

We also note that, although we use the same predictions of the HPCD (full) model in Belinkovet al. (2014)5, we report different results than Belinkov et al. (2014). For example, the unlabeledattachment score (UAS) of the baselines RBG and RBG + HPCD (full) are 94.17 and 94.19,respectively, in Table 3.2, compared to 93.96 and 94.05, respectively, in Belinkov et al. (2014).This is due to the use of different versions of the RBG parser.6

Analysis

We now analyze different aspects of our model in order to develop a better understanding of itsbehavior.

Effect of context sensitivity and sense priors. We now show some results that indicate therelative strengths of two components of our context-sensitive token embedding model. The sec-ond row in Table 3.3 shows the test accuracy of a system trained without sense priors (that is,

5The authors kindly provided their predictions for 1942 test examples (out of 1951 examples in the full test set).In Table 3.2, we use the same subset of 1942 test examples.

6We use the latest commit (SHA: e07f74) on the GitHub repository of the RGB parser.

33

July 12, 2017DRAFT

Figure 3.5: Effect of training data size on test accuracies of OntoLSTM-PP and LSTM-PP.

Model PPA Acc.full 89.7- sense priors 88.4- attention 87.5

Table 3.3: Effect of removing sense priors and context sensitivity (attention) from the model.

making p(s|wi) from Eq. 3.1 a uniform distribution), and the third row shows the effect of mak-ing the token representations context-insensitive by giving a similar attention score to all relatedconcepts, essentially making them type level representations, but still grounded in WordNet. Asit can be seen, removing context sensitivity has an adverse effect on the results. This illustratesthe importance of the sense priors and the attention mechanism.

It is interesting that, even without sense priors and attention, the results with WordNet ground-ing is still higher than that of the two LSTM-PP systems in Table 3.1. This result illustrates theregularization behavior of sharing concept embeddings across multiple words, which is espe-cially important for rare words.

Effect of training data size. Since OntoLSTM-PP uses external information, the gap betweenthe model and LSTM-PP is expected to be more pronounced when the training data sizes aresmaller. To test this hypothesis, we trained the two models with different amounts of trainingdata and measured their accuracies on the test set. The plot is shown in Figure 3.5. As expected,the gap tends to be larger at smaller data sizes. Surprisingly, even with 2000 sentences in thetraining data set, OntoLSTM-PP outperforms LSTM-PP trained with the full data set. When boththe models are trained with the full dataset, LSTM-PP reaches a training accuracy of 95.3%,whereas OntoLSTM-PP reaches 93.5%. The fact that LSTM-PP is overfitting the training datamore, indicates the regularization capability of OntoLSTM-PP.

Qualitative analysis. To better understand the effect of WordNet grounding, we took a sampleof 100 sentences from the test set whose PP attachments were correctly predicted by OntoLSTM-

34

July 12, 2017DRAFT

Figure 3.6: Visualization of soft attention weights assigned by OntoLSTM to various synsets inPreposition Phrase Attachment

PP but not by LSTM-PP. A common pattern observed was that those sentences contained wordsnot seen frequently in the training data. Figure 3.6 shows two such cases where OntoLSTM-PPpredicts the head correctly and LSTM-PP does not, along with weights by OntoLSTM-PP tosynsets that contribute to token representations of infrequent word types. The prepositions areshown in bold, LSTM-PP’s predictions in red and OntoLSTM-PP’s predictions in green. Wordsthat are not candidate heads or dependents are shown in brackets. In both cases, the weightsassigned by OntoLSTM-PP to infrequent words are also shown. The word types soapsuds andbuoyancy do not occur in the training data, but OntoLSTM-PP was able to leverage the param-eters learned for the synsets that contributed to their token representations. Another importantobservation is that the word type buoyancy has four senses in WordNet (we consider the firstthree), none of which is the metaphorical sense that is applicable to markets as shown in the ex-ample here. Selecting a combination of relevant hypernyms from various senses may have helpedOntoLSTM-PP make the right prediction. This shows the value of using hypernymy informationfrom WordNet. Moreover, this indicates the strength of the hybrid nature of the model, that letsit augment ontological information with distributional information.

Parameter space. We note that the vocabulary sizes in OntoLSTM-PP and LSTM-PP are com-parable as the synset types are shared across word types. In our experiments with the full PPattachment dataset, we learned embeddings for 18k synset types with OntoLSTM-PP and 11kword types with LSTM-PP. Since the biggest contribution to the parameter space comes fromthe embedding layer, the complexities of both the models are comparable.

35

July 12, 2017DRAFT

3.6 Textual EntailmentWe evaluate the capability of OntoLSTM to learn useful generalizations of concepts by testingit on the task of textual entailment. We use the Stanford Natural Language Inference (SNLI)dataset (Bowman et al., 2015).

3.6.1 Data preprocessing

Since the dataset does not contain POS tag information unlike the PP attachment dataset, we runStanford’s POS tagger (Toutanova et al., 2003) on the dataset. To construct the tensor represen-tation, we map the POS-tagged word to the first two synsets in WordNet (i.e. S = 2), and extractthe most direct five hypernyms (i.e., H = 5). When a synset has multiple hypernym paths, weuse the shortest one. In preliminary experiments, we found that using more word senses or morehypernyms per word sense does not improve the performance. Like with the PPA dataset, wordswhich do not appear in WordNet are assumed to have one unique synset per word type with nohypernyms.

3.6.2 Problem Definition

Given a pair of sentences (premise and hypothesis), the model predicts one of three possiblerelationships between the two sentences: entailment, contradiction or neutral. We use the stan-dard train, development and test splits of the SNLI corpus, which consists of 549K, 10K and10K labeled sentence pairs, respectively. For example, the following sentence pair is labeledentailment:• Premise Children and parents are splashing water at the pool.• Hypothesis Families are playing outside.

3.6.3 Model Definition

Following Bowman et al. (2016), our model for textual entailment uses two LSTMs with tiedparameters to read the premise and the hypothesis of each example, then compute the vector hwhich summarizes the sentence pair as the following concatenation (originally proposed by Mouet al. (2016)):

h =[hpre;hhyp;hpre − hhyp;hpre ∗ hhyp

]where hpre is the output hidden layer at the end of the premise token sequence, hhyp is the outputhidden layer at the end of the hypothesis token sequence. The summary h then feeds into twofully connected ReLU layers of size 1024, followed by a softmax layer of size 3 to predict thelabel. The word embeddings and concept embeddings are of size 300, and the hidden layers 150.

The proposed model uses the WordNet grounded token embeddings, and we call it OntoLSTM-TE. We compare it with two baselines: The first one is a simpler variant of OntoLSTM-TE, anddoes not use attention over synsets. We call it OntoLSTM-TEuni. The second uses type levelword representations, and we call it LSTM-TE.

36

July 12, 2017DRAFT

Model GloVe Train TestLSTM-TE No 90.5 79.7OntoLSTM-TEuni No 90.6 81.0OntoLSTM-TE No 90.5 81.1LSTM (Bowman et al.) Yes 83.9 80.6

Table 3.4: Results of OntoLSTM on the SNLI dataset

Figure 3.7: Visualization of soft attention weights assigned by OntoLSTM to various synsets inTextual Entailment

The models are trained to maximize log-likelihood of correct labels in the training data, usingADAM (Kingma and Ba, 2014) with early stopping (up to twenty epochs).

3.6.4 Experiments

Table 3.4 shows the classification results. We also report previously published results of a sim-ilar neural architecture as an extra baseline: the 300-dimensional LSTM RNN encoder modelin Bowman et al. (2016).7 They use GloVe as pretrained word embeddings while we jointlylearn word/concept embeddings in the same model. Bowman et al. (2016)’s LSTM outperformsLSTM-TE model by 0.9 absolute points on the test set. This may be due to the difference in inputrepresentations. Since we learn the synset representations in OntoLSTM-TE, comparison withour LSTM implementations is more sensible. The OntoLSTM-TE outperforms LSTM-TE by 1.4absolute points on the test set, illustrating the utility of the ontology-aware word representationsin a controlled experiment.

3.6.5 Analysis

Generalization attention: Fig. 3.7 shows one example from the test set of the SNLI datasetwhere hypernymy information is useful. The LSTM-TE model shares no parameters between thelexical representations of book and manuals, and fails to predict the correct relationship (i.e., en-tailment). However, OntoLSTM-TE leverages the fact that book and manuals share the commonhypernyms book.n.01 and publication.n.01, and assigns relatively high attention probabilities to

7We note this is not the state of the art on this dataset. This is the simplest sentence encoding based model forthis task.

37

July 12, 2017DRAFT

these hypernyms, resulting in a similar lexical representation of the two words, and a correctprediction of this example.

The following is an example that LSTM-TE gets right and OntoLSTM-TE does not:• Premise: Women of India performing with blue streamers, in beautiful blue costumes.• Hypothesis: Indian women perform together in gorgeous costumes

Indian in the second sentence is an adjective, and India in the first is an noun. By design,WordNet does not link words of different parts of speech. Moreover, the adjective hierarchiesin WordNet are very shallow and gorgeous and beautiful belong to two different synsets, bothwithout any hypernyms. Hence OntoLSTM-TE could not use any generalization information inthis problem.

Parameter space: As mentioned earlier in the case of PP attachment problem, the model sizein LSTM-TE and both variants of OntoLSTM-TE are comparable (11.1M and 12.7M parameters,respectively).

3.7 ConclusionIn this chapter, we showed a grounding of lexical items which acknowledges the semantic am-biguity of word types using WordNet and a method to learn a context-sensitive distribution overtheir representations. We also showed how to integrate the proposed representation with re-current neural networks for modeling sentences, showing that the proposed WordNet-groundedcontext-sensitive token embeddings outperforms standard type-level embeddings for predictingPP attachments and textual entailment.

We encoded a specific kind of background knowledge using WordNet in this work, but themethods described here can be extended to other kinds of structured knowledge sources as well.For example, one can effectively encode entities in text, using structured information from Free-base. Such a model may help tasks like factual question answering.

One potential issue with the approach suggested in this chapter is linking ambiguity. Thatis, deciding which entities in the background knowledge base, or concepts in the backgroundontology can be linked to the tokens in the input utterance. In this chapter, we did not have todeal with this issue because linking words to WordNet concepts is a trivial process. However,when extending this method to other kinds of knowledge bases, one may need to perform explicitentity or concept linking.

38

July 12, 2017DRAFT

Part II

Reasoning over Contextual Knowledge

39

July 12, 2017DRAFT

July 12, 2017DRAFT

Chapter 4

Learning Semantic Parsers over QuestionAnswer Pairs

In this chapter, we turn to the problem of reasoning over structured contexts, towards compre-hending compositional language. We take a semantic parsing approach for Question Answeringon (semi) structured domains, and describe an encoder-decoder framework, where we set ex-plicit type constraints on the decoder. Following the overarching theme of this thesis, we seek tobuild end-to-end models for these QA problems as well. Accordingly, we assume that the onlydirection supervision we get is from the answers at the end (i.e., our semantic parser is weaklysupervised).

4.1 Type Constrained Neural Semantic Parser

This section describes our semantic parsing model. The input to our model is a natural languagequestion and a context in which it is to be answered. The model predicts the answer to thequestion by semantically parsing it to a logical form then executing it against the context.

Figure 4.1: Overview of our type-constrained neural semantic parsing model

41

July 12, 2017DRAFT

Our model follows an encoder-decoder architecture, using recurrent neural networks withLong Short Term Memory (LSTM) cells (Hochreiter and Schmidhuber, 1997). The input ques-tion and table entities are first encoded as vectors that are then decoded into a logical form (Figure4.1). We make two significant additions to the standard encoder-decoder architecture. First, theencoder includes a special entity embedding and linking module that produces a link embeddingfor each question token that represents the table entities it links to (Section 4.1.1). These linkembeddings are concatenated with word embeddings for each question token, then encoded witha bidirectional LSTM. Second, the action space of the decoder is defined by a type-constrainedgrammar which guarantees that generated logical forms satisfy type constraints (Section 4.1.2).The decoder architecture is simply an LSTM with attention that predicts a sequence of generationactions within this grammar.

We train the parser using question-answer pairs as supervision, using an objective based onenumerating logical forms via dynamic programming on denotations (Pasupat and Liang, 2016)(Section 4.2.2). This objective function makes it possible to train neural models with question-answer supervision, which is otherwise difficult for efficiency and gradient variance reasons.

4.1.1 Encoder

The encoder network is a standard bidirectional LSTM augmented with an entity embedding andlinking module.

Notation. Throughout this section, we denote entities as e, and their corresponding types asτ(e). The ith token in a question is denoted qi. We use vw to denote a learned vector representation(embedding) of word w, e.g., vqi denotes the vector representation of the ith question token.Finally, we denote the set of all entities as E, and all entities belonging to a type τ as Eτ . Theentities E include all of the entities from the table, as well as numeric entities detected in thequestion by NER.

Entity Embedding. The encoder first constructs an embedding for each entity in the knowl-edge graph given its type and position in the graph. LetW (e) denote the set of words in the nameof entity e and N(e) the neighbors of entity e in the knowledge graph. Each entity’s embeddingre is a nonlinear projection of a type vector vτ(e) and a neighbor vector vN(e):

vN(e) =∑

e′∈N(e)

∑w∈W (e′)

vw (4.1)

re = tanh(Pτvτ(e) + PNvN(e)

)(4.2)

The type vector vτ(e) is a one-hot vector for τ(e), with dimension equal to the number of entitytypes in the grammar. The neighbor vector vN(e) is simply an average of the word vectors inthe names of e’s neighbors. Pτ and PN are learned parameter matrices for combining these twovectors.

42

July 12, 2017DRAFT

Entity Linking. This module generates a link embedding li for each question token represent-ing the entities it links to. The first part of this module generates an entity linking score s(e, i)for each entity e and token index i:

s(e, i) = maxw∈W (e)

vᵀwvqi + ψᵀφ(e, i) (4.3)

This score has two terms. The first represents similarity in word embedding space betweenthe token and entity name, computed as function of the embeddings of words in W (e) and theword embedding of the ith token, vqi . The second represents a linear classifier with parametersψ on features φ(e, i). The feature function φ contains only a few features: exact token match,lemma match, edit distance, an NER indicator feature, and a bias feature. It also includes tokenand lemma features coming from the neighbors of the node that originally matches the token.We found that features were an effective way to address sparsity in the entity name tokens, manyof which appear too infrequently to learn embeddings for.

Finally, the entity embeddings and linking scores are combined to produce a link embeddingfor each token. The scores s(e, i) are then fed into a softmax layer over all entities e of thesame type, and the link embedding li is an average of entity vectors re weighted by the resultingdistribution. We include a null entity, ∅, in each softmax layer to permit the model to identifytokens that do not refer to an entity. The null entity’s embedding is the all-zero vector and itsscore s(∅, ·) = 0.

p(e|i, τ) =exp s(e, i)∑

e′∈Eτ∪{∅} exp s(e′, i)(4.4)

li =∑τ

∑e∈Eτ

rep(e|i, τ) (4.5)

Bidirectional LSTM. We concatenate the link embedding li and the word embedding vqi ofeach token in the question, and feed them into a bidirectional LSTM:

xi =

[livqi

](4.6)

(ofi , fi) = LSTM(fi−1, xi) (4.7)

(obi , bi) = LSTM(bi+1, xi) (4.8)

oi =

[ofiobi

](4.9)

This process produces an encoded vector representation of each token oi. The final LSTMhidden states fn and b−1 are concatenated and used to initialize the decoder.

4.1.2 DecoderThe decoder is an LSTM with attention that selects parsing actions from a grammar over well-typed logical forms.

43

July 12, 2017DRAFT

Figure 4.2: The derivation of a logical form using the type-constrained grammar. The holes inthe left column have empty scope, while holes in the right column have scope Γ = {(x : r)}

Type-Constrained Grammar. The parser maintains a state at each step of decoding that con-sists of a logical form with typed holes. A hole is a tuple [τ,Γ] of a type τ and a scope Γ thatcontains typed variable bindings, (x : α) ∈ Γ. The scope is used to store and generate the argu-ments of lambda expressions. The grammar consists of a collection of four kinds of productionrules on holes:

1. Application [τ,Γ]→([〈β, τ〉,Γ] [β,Γ]) rewrites a hole of type τ by applying a functionfrom β to τ to an argument of type β. We also permit applications with more than oneargument.

2. Constant [τ,Γ]→const where constant const has type τ . This rule generates bothcontext-independent operations such as argmax and context-specific entities such as united states.

3. Lambda [〈α, τ〉,Γ]→ λx. [τ,Γ ∪ {(x : α)}] generates a lambda expression where theargument has type α. x is a fresh variable name.

4. Variable [τ,Γ]→ x where (x : τ) ∈ Γ. This rule generates a variable bound in apreviously-generated lambda expression that is currently in scope.

We instantiate each of the four rules above by replacing the type variables τ, α, β with con-crete types, producing, e.g., [c,Γ] → ([〈r, c〉,Γ] [r,Γ]) from the application rule. The set ofinstantiated rules is automatically derived from a corpus of logical forms, which we in turn pro-duce by running dynamic programming on denotations (see Section 4.2.2). Every logical formcan be derived in exactly one way using the four kinds of rules above; this derivation is combinedwith the (automatically-assigned) type of each of the logical form’s subexpressions to instantiatethe type variables in each rule. We then filter out constant rules that generate context-specificentities (which are handled specially by the decoder) to produce a context-independent grammar.

The first action of the parser is to predict a root type for the logical form, and then decodingproceeds according to the production rules above. Each time step of decoding fills the leftmost

44

July 12, 2017DRAFT

hole in the logical form, and decoding terminates when no holes remain. Figure 4.2 shows thesequence of decoder actions used to generate a logical form.

Network Architecture. The decoder is an LSTM that outputs a distribution over grammaractions using an attention mechanism over the encoded question tokens. The decoder also usesa copy-like mechanism on the entity linking scores to generate entities. Say that, during the jthtime step, the current hole has type τ . The decoder generates a score for each grammar actionwhose left-hand side is τ using the following equations:

(yj, hj) = LSTM(hj−1,

[gj−1oj−1

]) (4.10)

aj = softmax(OW ayj) (4.11)

oj = (aj)TO (4.12)

sj = W 2τ ReLU(W 1

[yjoj

]+ b1) + b2τ (4.13)

sj(ek) =∑i

s(ek, i)aji (4.14)

pj = softmax(

sj

sj(e1)sj(e2)...

) (4.15)

The input to the LSTM gj−1 is a grammar action embedding for the action chosen in previ-ous time step. g0 is a learned parameter vector, and h0 is the concatenated hidden states of theencoder LSTMs. The matrix O contains the encoded token vectors o1, ..., , on from the encoder.Equations (4.10) to (4.12) above perform a softmax attention over O using a learned parametermatrixW a. Equation 4.13 generates scores s for the context-independent grammar rules applica-ble to type τ using a multilayer perceptron with weights W 1,b1,W 2

τ ,b2τ . Equation 4.14 generatesa score for each entity e with type τ by averaging the entity linking scores with the current atten-tion aj . Finally, the context-independent and -dependent scores are concatenated and softmaxedto produce a probability distribution pj over grammar actions in Equation 4.15. If a context-independent action is chosen, gj is a learned parameter vector for that action. Otherwise gj = gτ ,which is a learned parameter representing the selection of an entity with type τ .

4.2 Experiments with WIKITABLEQUESTIONS

We now describe an application of our framework to WIKITABLEQUESTIONS, a challengingdataset for question answering against semi-structured Wikipedia tables (Pasupat and Liang,2015). An example question from this dataset is shown in Figure 1.1.

45

July 12, 2017DRAFT

4.2.1 Context and Logical Form RepresentationWe follow Pasupat and Liang (2015) in using the same table structure representation and λ-DCSlanguage for expressing logical forms. In this representation, tables are expressed as knowledgegraphs over 6 types of entities: cells, cell parts, rows, columns, numbers and dates. Each entityalso has a name, which is typically a string value in the table. Our parser uses both the entitynames and the knowledge graph structure to construct embeddings for each entity. Specifically,the neighbors of a column are the cells it contains, and the neighbors of a cell are the columns itbelongs to.

The logical form language consists of a collection of named sets and entities, along withoperators on them. The named sets are used to select table cells, e.g., united states is theset of cells that contain the text “united states”. The operators include functions from sets tosets, e.g., the next operator maps a row to the next row. Columns are treated as functions fromcells to their rows, e.g., (country united states) generates the rows whose countrycolumn contains “united states”. Other operators include reversing relations (e.g., in order tomap rows to cells in a certain column), relations that interpret cells as numbers and dates, and setand arithmetic operations. The language also includes aggregation and quantification operationssuch as count and argmax, along with λ abstractions that can be used to join binary relations.

Our parser also assigns a type to every λ-DCS expression, which is used to enforce typeconstraints on generated logical forms. The base types are cells c, parts p, rows r, numbers i, anddates d. Columns such as country have the functional type 〈c, r〉, representing functions fromcells c to rows r. Other operations have more complex functional types, e.g., reverse has type〈〈c, r〉, 〈r, c〉〉, which enables us to write (reverse country).1 The parser assigns everyλ-DCS constant a type, then applies standard programming language type inference algorithms(Pierce, 2002) to automatically assign types to larger expressions.

4.2.2 DPD TrainingOur parser is trained from question-answer pairs, treating logical forms as a latent variable. Weuse a new loglikelihood objective function for this process that first automatically enumerates aset of correct logical forms for each example, then trains on these logical forms. This objectivesimplifies the search problem during training and is well-suited to training our neural model.

The training data consists of a collection of n question-answer-table triples, {(qi, ai, T i)}ni=1.We first run dynamic programming on denotations (Pasupat and Liang, 2016) on each table T i

and answer ai to generate a set of logical forms ` ∈ Li that execute to the correct answer.

Dynamic programming on denotations. DPD is an automatic procedure for enumerating log-ical forms that execute to produce a particular value; it leverages the observation that there arefewer denotations than logical forms to enumerate this set relatively efficiently. It is a kind of

1Technically, reverse has the parametric polymorphic type 〈〈α, β〉, 〈β, α〉〉, where α and β are type variablesthat can be any type. This type allows reverse to reverse any function. However, this is a detail that can largelybe ignored. We only use parametric polymorphism when typing logical forms to generate the type-constrainedgrammar; the grammar itself does not have type variables, but rather a fixed number of concrete instances – such as〈〈c, r〉, 〈r, c〉〉 – of the above polymorphic type.

46

July 12, 2017DRAFT

chart parsing algorithm where partial parses are grouped together in cells by their output cate-gory, size, and the denotation. This is an improvement over the previous work by Pasupat andLiang (2015), where the logical forms were grouped based on only the output category and size.Given this chart, DPD performs two forward passes, In the first pass, the algorithm finds all thecells that lead to the correct denotation. In the second pass, all the rule combinations which leadto the cell corresponding to the correct denotation are listed, and logical forms are generatedfrom those cells, using only the rule combinations that are known to lead to the final cells.

Given the logical forms generated from DPD, we train the model with the following objective:

O(θ) =n∑i=1

log∑`∈Li

P (`|qi, T i; θ) (4.16)

We optimize this objective function using stochastic gradient descent. If |Li| is small, e.g.,5-10, the gradient of the ith example can be computed exactly by simply replicating the parser’snetwork architecture |Li| times, once per logical form. (Note that this takes advantage of theproperty that each logical form has a unique derivation in the decoder’s grammar.) However, |Li|often contains many thousands of logical forms, which makes the above computation infeasible.We address this problem by truncating Li to the m = 100 shortest logical forms, then using abeam search with a beam of k = 5 to approximate the sum.

We briefly contrast this objective function with two other commonly-used approaches. Thefirst approach is commonly used in prior semantic parsing work with loglinear models (Lianget al., 2011; Pasupat and Liang, 2015) and uses a similar loglikelihood objective. The gradientcomputation for this objective requires running a wide beam search, generating, e.g., 300 logicalforms, executing each one to identify which are correct, then backpropagating through a termfor each. This process would be very expensive with a neural model due to the cost of eachbackpropagation pass. Another approach is to train the network with REINFORCE (Williams,1992), which essentially samples a logical form instead of using beam search. This approach isknown to be difficult to apply when the space of latent variables is large and the reward signalsparse, as it is in semantic parsing. Our objective improves on these by precomputing correctlogical forms in order to avoid searching for them during the gradient computation.

4.2.3 Experimental SetupWe used the standard train/test splits of WIKITABLEQUESTIONS. The training set consists of14,152 examples and the test set consists of 4,344 examples. The training set comes dividedinto 5 cross-validation folds for development using an 80/20 split. All data sets are constructedso that the development and test tables are not present in the training set. We report questionanswering accuracy measured using the official evaluation script, which performs some simplenormalization of numbers, dates, and strings before comparing predictions and answers. Whengenerating answers from a model’s predictions, we skip logical forms that do not execute (whichmay occur for some baseline models) or answer with the empty string (which is never correct).All reported accuracy numbers are an average of 5 parsers, each trained on one training fold,using the respective development set to perform early stopping.

47

July 12, 2017DRAFT

Model Ensemble Size Dev. TestNeelakantan et al. (2016) 1 34.1 34.2Haug et al. (2017) 1 - 34.8Pasupat and Liang (2015) 1 37.0 37.1Neelakantan et al. (2016) 15 37.5 37.7Haug et al. (2017) 15 - 38.7Our Parser 1 42.7 43.3Our Parser 5 - 45.9

Table 4.1: Development and test set accuracy of our semantic parser compared to prior work onWIKITABLEQUESTIONS

We trained our parser with 20 epochs of stochastic gradient descent. We used 200-dimensionalword embeddings for the question and entity tokens, mapping all tokens that occurred < 3 timesin the training questions to UNK. (We tried using a larger vocabulary that included frequent tokensin tables, but this caused the parser to seriously overfit the training examples.) The hidden andoutput dimensions of the forward/backward encoder LSTMs were set to 100, such that the con-catenated representations were also 200-dimensional. The decoder LSTM uses 100-dimensionalaction embeddings and has a 200-dimensional hidden state and output. The action selection MLPhas a hidden layer dimension of 100. We used a dropout probability of 0.5 on the output of boththe encoder and decoder LSTMs, as well as on the hidden layer of the action selection MLP. Attest time, we decode with a beam size of 10.

4.2.4 Results

Table 4.1 compares the accuracy of our semantic parser to prior work on WIKITABLEQUES-TIONS. We distinguish between single models and ensembles, as we expect ensembling to im-prove accuracy, but not all prior work has used it. Prior work on this data set includes a loglinearsemantic parser (Pasupat and Liang, 2015), that same parser with a neural, paraphrase-basedreranker (Haug et al., 2017), and a neural programmer that answers questions by predicting asequence of table operations (Neelakantan et al., 2016). We find that our parser outperforms thebest prior result on this data set by 4.6%, despite that prior result using a 15-model ensemble. Anensemble of 5 parsers, one per training fold, improves accuracy by an additional 2.6% for a totalimprovement of 7.2%.

4.3 Proposed Work

The initial results with our framework on a difficult dataset show that this is a promising directionfor building weakly supervised semantic parsers for reasoning over structured or semi-structureddomains. We now list out the directions for proposed work in this chapter.

48

July 12, 2017DRAFT

4.3.1 Handling more operations and improved entity linkingWhile the results in Table 4.1 show that the proposed framework outperforms previous systems,the absolute accuracy of the system shows that there is still room for improvement. Followingare some improvements that can be made to improve performance on this dataset.

Our entity embedding and linking module described in Section 4.1.1 is fairly simple, and doesnot account for paraphrases of predicates seen in questions (like “go to” referring to “attended” inthe context of attending an academic institution), or abbreviations used for entities (like “DNQ”for “did not qualify”) etc.

Moreover, on close examination of the errors made by the parser on a sample of 100 questionsfrom the WIKITABLEQUESTIONS dataset, we found that a major class of errors (26.5%) is onquestions that place explicit set membership constraints on outputs. Following are two examplesof such questions:• In 2008, in track and field events who broke more world records, Usain Bolt or Haile

Gebrselassie?• What vehicle maker other than Dodge has the most vehicles in the roster?

Both the questions above specify sets of entities and the parser should operate on those setsinstead of all those in the context. One way of achieving this is by introducing set membershipoperations in the grammar.

4.3.2 Extending to datasets with weaker supervisionThe method described in this chapter relies strongly on DPD. An assumption we make here isthat supervision in the form of the right answer is sufficient to enumerate the set of correct logicalforms. However, this may not be true in all cases. For example, the Cornell Natural LanguageVisual Reasoning (NLVR)2 (Suhr et al., 2017) dataset is a set of natural language utterancesgrounded in (synthetically generated) images, and the task is to answer whether the utterance istrue or false. Figure 4.3 shows an example. Since the supervision is available is in the form ofa binary label, it may be significantly more challenging to run DPD on this dataset. We expectthe output thus generated will contain a very large number of spurious logical forms (those thatare incorrect, but execute to the correct denotation), and hence will affect our training processadversely.

To address this problem, we will investigate constraining DPD to generate logical formsgrounded in the input utterance. As described in Section 4.2.2, DPD consists of two passes, andthe second pass involves using all rule combinations that lead to the cells with correct denota-tions. One possibility to constrain this process is to score the rule combinations conditioned onthe input utterance.

2http://lic.nlp.cornell.edu/nlvr/

49

http://lic.nlp.cornell.edu/nlvr/

July 12, 2017DRAFT

Figure 4.3: Example from NLVR Dataset

50

July 12, 2017DRAFT

Chapter 5

Proposed Work: Context Retrieval forMemory Networks

We consider the problem of reasoning over unstructured contexts in this chapter, and propose tobuild a model that can jointly retrieve and reason over them.

5.1 Memory NetworksMemory Networks (MemNet) are a class of models that combine inference with long-term mem-ory. Unlike Recurrent Neural Networks (RNN) that model language (Mikolov), and their variantswith Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), MemNets havean explicit memory component with read and write functions. While the original MemNet modelproposed by Weston et al. (2014), MemNN required explicit supervision for selecting the rele-vant parts of the memory, (Sukhbaatar et al., 2015b) proposed a end-to-end variant (MemN2N)where the memory selection component is trained jointly with the rest of the network. Thesewere previously used for answering questions that require reasoning over multiple context sen-tences, both in simulated (Bordes et al.) and large-scale (Fader et al.) scenarios.

In this chapter, we use the term memory network, or the abbreviation MemNet, to refer to anyneural network model with an explicit memory component that can be read from or written to.Our focus is mostly on the general class of models. Wherever necessary, we refer to the originalmemory network model proposed by Weston et al. (2014) as MemNN and the end to end modelby Sukhbaatar et al. (2015b) as MemN2N.

Figure 5.1 shows the setup of a generic MemNet. It takes as inputs:• a set of N context sentences, indexed as {contexti}Ni=1, such that contexti is a vector

containing the indices of words in the ith sentence.• a query, a single sentence indexed similar to context.The sentences are then encoded using the EncodeContext and EncodeQuery functions

to produce the matrix C and q respectively.The memory network then reasons over the context, represented C, given the query, repre-

sented by q over multiple memory layers, corresponding to multiple hops i ∈ {1, . . . , h}. Ateach hop, a memory layer performs two actions:

51

July 12, 2017DRAFT

Figure 5.1: Schematic showing a generic view of memory networks

• ReadMemory gets as input either the context (C) (in the first hop) or an updated memoryrepresentation (Mi) (in subsequent hops), and the query representation (q) to produce asummary of the memory Si+1.

• UpdateMemory produces an updated memory representation (Mi+1) given the summary(Si) and the query (q).

Finally, an answer is predicted by passing the query encoding q, and the memory state at thelast hop (Mh) to the Predict function.

It can be seen that MemN2N fits into this setup with the following configuration:

EncodeQuery(query) = BOWEncoderq(query) ∈ Rd (5.1)

EncodeContext(Context) = BOWEncoderc(Context) ∈ RN×d (5.2)

ReadMemory(q,Mi−1,C) = softmax(C.Mi−1).C ∈ Rd (5.3)

UpdateMemory(q,Si,C) = q + Si ∈ Rd (5.4)

Predict(q,Mh) = softmax(W.Mh) ∈ Ro (5.5)

where BOWEncoderq(.) and BOWEncoderc(.) are simply bag of words models that aggre-gate the vector representations of all the words given by the indices in the input and averagethem. W ∈ Ro×d is a parameter of the answer prediction function, causing the softmax to beover the output vocabulary size o. Note that in MemN2N, q is not used in ReadMemory andPredict, and C is not used in UpdateMemory since MemN2N is one of the simplest formsof MemNets.

In contrast, Gated Attention Reader Dhingra et al. (2016) is relatively more complex, anduses the following configuration:

52

July 12, 2017DRAFT

EncodeQuery(query) = Bi-GRUq(query) ∈ Rd (5.6)

EncodeContext(Context) = Bi-GRUc(Context) ∈ RNw×d (5.7)

ReadMemory(q,Mi−1,C) = softmax(Mi−1.q).Mi−1 ∈ RNw×d (5.8)

UpdateMemory(q,Si,C) = Bi-GRUc(Si.q) ∈ RNw×d (5.9)

Predict(q,Mh) = softmax(Mh.q) ∈ RNw (5.10)

The encoders used here are bidirectional Recurrent Neural Networks with Gated RecurrentUnits Cho et al. (2014). Note that the output of Bi-GRUc is a matrix because it returns outputsfrom all time steps as a sequence. Nw here denotes the number of tokens in all N sentencesin context. The output produced here is a probability distribution over all tokens in the context,because this model was proposed for tasks that involve directly selecting a token from the context.

Both the models described here assume that the context is well defined and small enoughto fit in memory. In this chapter, we focus on problems where these assumptions do not holdtrue, and where the context be retrieved from a big corpus. Accordingly, we propose to add anadditional RetrieveContext module to our generic MemNet architecture. Particularly, weare interested in developing MemNet models that can figure out whether the available context issufficient to make a prediction. We first describe a target dataset for our model, and then list theissues we propose to address.

5.2 Retrieval for Question AnsweringThe datasets we use here are significantly different from the QA datasets previously used to testmemory networks, and require more complex reasoning. One example of such question is shownbelow.• Astronauts weigh more on Earth than they do on the moon because

(a) they have less mass on the moon

(b) their density decreases on the moon

(c) the moon has less gravity than Earth

(d) the moon has less friction than Earth

The text relevant to questions like this may contain a single sentence that has the information toanswer this question, like this:• People weigh more on some planets than others because of differences in gravity.Or it may span multiple sentences like this:• Moon’s gravity is less than that of Earth.• Weight of a person depends on the gravity of the planet.Or in other cases, the text may not contain the relevant answer at all. Hence, this setup

requires a module in addition to the ones mentioned, that retrieves relevant content from thecorpus. However, as the size of the context increases, it becomes more and more expensive to

53

July 12, 2017DRAFT

reason over it. Also, increased context size may add noise to prediction process. Hence, it isimportant to retrieve context conservatively, and the model should reason whether the retrievedcontext is sufficient to produce an answer.

Question Answering as Textual Entailment In our proposed setup, we cast the problem ofdeciding whether the given context is sufficient to answer the question, as a textual entailmentproblem. That is, given a multiple choice question with answer options, we convert the combi-nation of the question and each of the options into a statement, and check whether the statementcan be entailed from relevant background information. Given this setup, the Predict functionessentially becomes an entailment function. If none of the question-option combinations can beentailed from (or contradicted by) the retrieved context, we retrieve more context.

5.2.1 Proposed AlgorithmBased on the ideas described so far, we now present the proposed algorithm, ADAPTIVEREAD toread and reason, while choosing to retrieve or predict depending on the sufficiency of the contextavailable.

Input: L: large corpus, q: questionOutput: a: answerContextSize = k ;C = RetrieveContext(q, L, ContextSize), the top k relevant sentences from thecorpus;

while ContextSize < MaxContextSize doif EntailsOrContradicts(C, q) then

a = Predict(C, q) ;return a

elseContextSize+ = k ;C = RetrieveContext(q, L, ContextSize) ;

endendAlgorithm 1: ADAPTIVEREAD algorithm that learns when to stop retrieving context

This algorithm shares some similarities with the ReasoNet model proposed by Shen et al.(2016), but the main difference is that while ReasoNet learns to stop performing additional hops,the proposed algorithm learns to stop retrieving additional context.

5.2.2 Potential IssuesWe identify the following potential issues in building ADAPTIVEREAD.

1. Scalability to large corpora: Depending on the complexity of ReadMemory, reason-ing over large corpora may prove to be intractable. Earlier work in solving this probleminvolves building hierarchical models (Chandar et al., 2016; Choi et al., 2016). WhileChandar et al. (2016) build a hierarchical memory network that uses a simple dot prod-uct for RetrieveContext and ReadMemory, and thus use maximum inner product

54

July 12, 2017DRAFT

search (MIPS) to do them efficiently, Choi et al. (2016) perform summarization to re-trieve context followed by answer prediction. We can use some of their ideas to makeRetrieveContext a simpler process, and ReadMemory a more expensive one.

2. Difficulty in training: Clearly, ADAPTIVEREAD makes a hard choice between retrievingcontext and predicting an answer, thus making it impossible to train the model end-to-endusing back-propagation. Shen et al. (2016) and Choi et al. (2016) get around similar prob-lems that involve hard choices by training with REINFORCE (Williams, 1992) algorithm.However, REINFORCE is known to be unstable due to high variance induced by sampling.

55

July 12, 2017DRAFT

56

July 12, 2017DRAFT

Chapter 6

Summary and Timeline

6.1 Summary

In this thesis we identified two classes of knowledge relevant to language understanding: back-ground and contextual.

We defined background knowledge as the implicit shared human knowledge that is oftenomitted in human generated text or speech. We presented two ways to incorporate this kind ofknowledge in NLU systems. In Chapter 2, we showed how selectional preferences in structuresrelevant to the task at hand can be leveraged to encode background knowledge. We showed thatthis is indeed more effective than encoding whole sentences using comparable neural networkarchitectures. In the Chapter 3, we linked text to a structured knowledge bases (WordNet) bybuilding an ontology-aware encoder that exploited the WordNet hierarchies to learn context-sensitive token representations of words. We showed the proposed model helps textual entailmentand prepositional phrase attachment tasks. By giving the encoder, access to the hypernym pathsof relevant concepts we showed that we can learn useful task-dependent generalizations using anattention mechanism.

We defined contextual knowledge as the explicit additional information that reading compre-hension systems need to reason over, during prediction. In this thesis, we proposed to pursue tworesearch goals towards effectively reasoning over semi-structured and unstructured contextualknowledge. First, as described in Chapter 4, we will explore type-driven neural network basedsemantic parsers to reason over semi-structured contexts for QA tasks where only the question-answer pairs are given. Our preliminary experiments with such a parser show promising resultson the WIKITABLESQUESTIONS dataset. Second, we propose to augment explicit memory basedneural network models with a stochastic retrieval module in Chapter 5. While there has been alot of progress in explicit memory based models for reading comprehension, to the best of ourknowledge, all the work deals with problems where the context is finite and well defined. Wewill look at tasks where the relevant context needs to be retrieved before any reasoning can beperformed over it, and our proposed approach involves jointly training the retrieval and reasoningcomponents, and determining when to stop retrieving contexts depending on the inputs.

57

July 12, 2017DRAFT

6.2 Timeline• Leveraging Selectional Preferences to Model Events: Done

Published at COLING 2014 (joint work with Ed Hovy)• Ontology Aware Token Embeddings: Done

Accepted at ACL 2017 (joint work with Waleed Ammar, Chris Dyer, and Ed Hovy)• Type-Driven Neural Semantic Parsing: Partially done

Initial Accepted at EMNLP 2017 (joint work with Jayant Krishnamurthy and MattGardner)

Target: Improved model as a ICLR 2018 or TACL submission• Retrieval for Memory Networks: Not started

Target: ACL 2018• Thesis writing and defense: End of Spring 2018

58

July 12, 2017DRAFT

Bibliography

Naoki Abe and Hang Li. Learning word association norms using tree cut pair models. arXivpreprint cmp-lg/9605029, 1996. 2.3

Eneko Agirre. Improving parsing and pp attachment performance with sense information. InACL. Citeseer, 2008. 3.3.3

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neuralnetworks for question answering. In HLT-NAACL, 2016. 1.3

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraicstructure of word senses, with applications to polysemy. arXiv preprint arXiv:1601.03764,2016. 3.3.1

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. CoRR, abs/1409.0473, 2014. 1.1, 3.4.2

Marco Baroni and Alessandro Lenci. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4):673–721, 2010. 2.3

David Belanger and Sham M. Kakade. A linear dynamical system model for text. In ICML,2015. 3.3.1

Yonatan Belinkov, Tao Lei, Regina Barzilay, and Amir Globerson. Exploring compositional ar-chitectures and word vector representations for prepositional phrase attachment. Transactionsof the Association for Computational Linguistics, 2:561–572, 2014. 3.5, 3.5.1, 3.5.2, 3.5.3,3.5.3, 3.5.3

Antoine Bordes, Nicolas Usunier, Ronan Collobert, and Jason Weston. Towards understandingsituated natural language. 5.1

Antoine Bordes, Jason Weston, Ronan Collobert, Yoshua Bengio, et al. Learning structuredembeddings of knowledge bases. In AAAI, 2011. 2.7

Johan Bos and Katja Markert. Recognising textual entailment with logical inference. In Pro-ceedings of the conference on Human Language Technology and Empirical Methods in Nat-ural Language Processing, pages 628–635. Association for Computational Linguistics, 2005.1.1

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A largeannotated corpus for learning natural language inference. In EMNLP, 2015. 3.6

Samuel R Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D Manning,and Christopher Potts. A fast unified model for parsing and sentence understanding. arXiv

59

July 12, 2017DRAFT

preprint arXiv:1603.06021, 2016. 1.1, 3.6.3, 3.6.4

Eric Breck, Marc Light, Gideon S Mann, Ellen Riloff, Brianne Brown, Pranav Anand, MatsRooth, and Michael Thelen. Looking under the hood: Tools for diagnosing your questionanswering engine. In Proceedings of the workshop on Open-domain question answering-Volume 12, pages 1–8. Association for Computational Linguistics, 2001. 1.2.2

Eric Brill and Philip Resnik. A rule-based approach to prepositional phrase attachment dis-ambiguation. In Proceedings of the 15th conference on Computational linguistics-Volume 2,pages 1198–1204. Association for Computational Linguistics, 1994. 3.3.3

A. Carlson, C. Cumby, J. Rosen, and D. Roth. The snow learning architecture. (UIUCDCS-R-99-2101), 5 1999. URL http://cogcomp.cs.illinois.edu/papers/CCRR99.pdf.1.3

Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal Vincent, Gerald Tesauro, and YoshuaBengio. Hierarchical memory networks. arXiv preprint arXiv:1605.07427, 2016. 1

Danqi Chen and Christopher D Manning. A fast and accurate dependency parser using neuralnetworks. In EMNLP, pages 740–750, 2014. 3.2

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. A unified model for word sense representationand disambiguation. In EMNLP, pages 1025–1035, 2014. 3.3.1

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. 5.1

Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, andJonathan Berant. Hierarchical question answering for long documents. arXiv preprintarXiv:1611.01839, 2016. 1, 2

Franois Chollet. Keras. https://github.com/fchollet/keras, 2015. 2.4.1

Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, andlexicography. Computational linguistics, 16(1):22–29, 1990. 2.6

Massimiliano Ciaramita and Mark Johnson. Explaining away ambiguity: Learning verb selec-tional preference with bayesian networks. In Proceedings of the 18th conference on Computa-tional linguistics-Volume 1, pages 187–193. Association for Computational Linguistics, 2000.2.3

Peter Clark. Elementary school science and math tests as a driver for ai: take the aristo challenge!2015. 1.2.2

Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and PavelKuksa. Natural language processing (almost) from scratch. The Journal of Machine LearningResearch, 12:2493–2537, 2011. 2.5

Courtney Corley and Rada Mihalcea. Measuring the semantic similarity of texts. In Proceedingsof the ACL workshop on empirical modeling of semantic equivalence and entailment, pages13–18. Association for Computational Linguistics, 2005. 1.1

Pradeep Dasigi and Eduard Hovy. Modeling newswire events using neural networks for anomaly

60

http://cogcomp.cs.illinois.edu/papers/CCRR99.pdf

https://github.com/fchollet/keras

July 12, 2017DRAFT

detection. In COLING 2014, volume 25, pages 1414–1422. Dublin City University and Asso-ciation for Computational Linguistics, 2014. 2.4.1

Bhuwan Dhingra, Hanxiao Liu, William W Cohen, and Ruslan Salakhutdinov. Gated-attentionreaders for text comprehension. arXiv preprint arXiv:1606.01549, 2016. 1.5, 5.1

Li Dong and Mirella Lapata. Language to logical form with neural attention. CoRR,abs/1601.01280, 2016. 1.3

Katrin Erk. A simple, similarity-based model for selectional preferences. ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 45(1):216, 2007. 2.3

Katrin Erk, Sebastian Pado, and Ulrike Pado. A flexible, corpus-driven model of regular andinverse selectional preferences. Computational Linguistics, 36(4):723–763, 2010. 2.3

Anthony Fader, Luke S Zettlemoyer, and Oren Etzioni. Paraphrase-driven learning for openquestion answering. Citeseer. 5.1

Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard H. Hovy, and Noah A.Smith. Retrofitting word vectors to semantic lexicons. In NAACL, 2015. 3.3.2, 3.5.3, 3.5.3

Matt Gardner, Partha Pratim Talukdar, Jayant Krishnamurthy, and Tom Mitchell. Incorporatingvector space similarity in random walk inference over knowledge bases. 2014. 2.7

Till Haug, Octavian-Eugen Ganea, and Paulina Grnarova. Neural multi-step reasoning for ques-tion answering on semi-structured tables, 2017. 4.2.4

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, MustafaSuleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances inNeural Information Processing Systems, pages 1693–1701, 2015. 1.1, 1.3, 1.5

Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Read-ing children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301,2015. 1.2.2

Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. 2.4, 4.1, 5.1

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. Learning whom to trustwith mace. In Proceedings of NAACL-HLT, pages 1120–1130, 2013. 2.5.1

Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. Improving wordrepresentations via global context and multiple word prototypes. In Proceedings of the 50thAnnual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1,pages 873–882. Association for Computational Linguistics, 2012. 3.3.1

Sujay Kumar Jauhar, Chris Dyer, and Eduard H. Hovy. Ontologically grounded multi-senserepresentation learning for semantic vector space models. In NAACL, 2015. 3.3.1, 3.3.2

Richard Johansson and Luis Nieto Pina. Embedding a semantic network in a word space. InIn Proceedings of the 2015 Conference of the North American Chapter of the Association forComputational Linguistics–Human Language Technologies. Citeseer, 2015. 3.3.2

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scaledistantly supervised challenge dataset for reading comprehension. 2017. 1.3

61

July 12, 2017DRAFT

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014. 2.4.1, 3.6.3

Karin Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer. A large-scale classificationof english verbs. Language Resources and Evaluation, 42(1):21–40, 2008. 3.5.3

Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage Publications(Beverly Hills), 1980. 2.5.1

Saisuresh Krishnakumaran and Xiaojin Zhu. Hunting elusive metaphors using lexical resources.In Proceedings of the Workshop on Computational approaches to Figurative Language, pages13–20. Association for Computational Linguistics, 2007. 2.3

Yuval Krymolowski and Dan Roth. Incorporating knowledge in natural language learning: Acase study. Urbana, 51:61801. 1.3

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and ChrisDyer. Neural architectures for named entity recognition. In NAACL, 2016. 3.2

Tao Lei, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. Low-rank tensors for scoringdependency structures. In ACL. Association for Computational Linguistics, 2014. 3.5.3

Alessandro Lenci. Composing and updating verb argument expectations: A distributional se-mantic model. In Proceedings of the 2nd Workshop on Cognitive Modeling and ComputationalLinguistics, pages 58–66. Association for Computational Linguistics, 2011. 2.3

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao. Neural symbolic ma-chines: Learning semantic parsers on freebase with weak supervision. CoRR, abs/1611.00020,2016. 1.3

Percy Liang, Michael I. Jordan, and Dan Klein. Learning dependency-based compositional se-mantics. In ACL, 2011. 4.2.2

Tomas Mikolov. Recurrent neural network based language model. 5.1

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributedrepresentations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013.3.3.2

George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. 2.7, 3.1, 3.2

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extrac-tion without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meetingof the ACL and the 4th International Joint Conference on Natural Language Processing of theAFNLP: Volume 2-Volume 2, pages 1003–1011. Association for Computational Linguistics,2009. 2.7

Dan I Moldovan and Vasile Rus. Logic form transformation of wordnet and its applicability toquestion answering. In Proceedings of the 39th Annual Meeting on Association for Compu-tational Linguistics, pages 402–409. Association for Computational Linguistics, 2001. 1.1,1.3

Lili Mou, Men Rui, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. Recognizing entailment

62

July 12, 2017DRAFT

and contradiction by tree-based convolution. In Proc. of ACL, 2016. 3.6.3

Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. Efficientnon-parametric estimation of multiple embeddings per word in vector space. arXiv preprintarXiv:1504.06654, 2015. 3.3.1

Arvind Neelakantan, Quoc V. Le, Martın Abadi, Andrew McCallum, and Dario Amodei. Learn-ing a natural language interface with neural programmer. CoRR, abs/1611.08945, 2016. 1.3,4.2.4

Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collectivelearning on multi-relational data. In Proceedings of the 28th international conference on ma-chine learning (ICML-11), pages 809–816, 2011. 2.7

Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gulsen Eryigit, Sandra Kubler, Sve-toslav Marinov, and Erwin Marsi. Maltparser: A language-independent system for data-drivendependency parsing. Natural Language Engineering, 13(02):95–135, 2007. 2.6

Martha Palmer, Daniel Gildea, and Paul Kingsbury. The proposition bank: An annotated corpusof semantic roles. Computational Linguistics, 31(1):71–106, 2005. 2.4

Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables.arXiv preprint arXiv:1508.00305, 2015. 1.2.2, 4.2, 4.2.1, 4.2.2, 4.2.2, 4.2.4

Panupong Pasupat and Percy Liang. Inferring logical forms from denotations. arXiv preprintarXiv:1606.06900, 2016. 4.1, 4.2.2

Anselmo Penas, Eduard Hovy, Pamela Forner, Alvaro Rodrigo, Richard Sutcliffe, and RoserMorante. Qa4mre 2011-2013: Overview of question answering for machine reading evalu-ation. In International Conference of the Cross-Language Evaluation Forum for EuropeanLanguages, pages 303–320. Springer, 2013. 1.2.2

Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors forword representation. In EMNLP, 2014. 3.5.3

Benjamin C. Pierce. Types and programming languages. 2002. 4.2.1

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questionsfor machine comprehension of text. In EMNLP, 2016. 1.3

Adwait Ratnaparkhi, Jeff Reynar, and Salim Roukos. A maximum entropy model for preposi-tional phrase attachment. In Proceedings of the workshop on Human Language Technology,1994. 3.5.3

Joseph Reisinger and Raymond J Mooney. Multi-prototype vector-space models of word mean-ing. In HLT-ACL, 2010. 3.3.1

Philip Resnik. Semantic classes and syntactic ambiguity. In Proceedings of the workshop onHuman Language Technology. Association for Computational Linguistics, 1993. 3.3.3, 3.4.1,3.5

Philip Resnik. Selectional constraints: An information-theoretic model and its computationalrealization. Cognition, 61(1):127–159, 1996. 2.3

Philip Resnik. Selectional preference and sense disambiguation. In Proceedings of the ACL

63

July 12, 2017DRAFT

SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How, pages 52–57. Washington, DC, 1997. 2.3

Matthew Richardson, Christopher JC Burges, and Erin Renshaw. Mctest: A challenge datasetfor the open-domain machine comprehension of text. 1.2.2

Matthew Richardson, Christopher J. C. Burges, and Erin Renshaw. Mctest: A challenge datasetfor the open-domain machine comprehension of text. In EMNLP, 2013. 1.3

Mats Rooth, Stefan Riezler, Detlef Prescher, Glenn Carroll, and Franz Beil. Inducing a semanti-cally annotated lexicon via em-based clustering. In Proceedings of the 37th annual meeting ofthe Association for Computational Linguistics on Computational Linguistics, pages 104–111.Association for Computational Linguistics, 1999. 2.3

Sascha Rothe and Hinrich Schutze. Autoextend: Extending word embeddings to embeddings forsynsets and lexemes. In ACL, 2015. 3.3.1, 3.4.2, 3.5.3, 3.5.3

Diarmuid O Seaghdha. Latent variable models of selectional preference. In Proceedings ofthe 48th Annual Meeting of the Association for Computational Linguistics, pages 435–444.Association for Computational Linguistics, 2010. 2.3

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional atten-tion flow for machine comprehension. CoRR, abs/1611.01603, 2016. 1.3

Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stopreading in machine comprehension. arXiv preprint arXiv:1609.05284, 2016. 5.2.1, 2

Ekaterina Shutova, Simone Teufel, and Anna Korhonen. Statistical metaphor processing. Com-putational Linguistics, 39(2):301–353, 2013. 2.3

Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neuraltensor networks for knowledge base completion. In Advances in neural information processingsystems, pages 926–934, 2013. 2.7

Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. A corpus of natural language for visualreasoning. 2017. 4.3.2

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory net-works. In NIPS, 2015a. 1.3, 1.5

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. InAdvances in neural information processing systems, pages 2440–2448, 2015b. 5.1

Ilya Sutskever, Joshua B Tenenbaum, and Ruslan R Salakhutdinov. Modelling relational datausing bayesian clustered tensor factorization. In Advances in neural information processingsystems, pages 1821–1828, 2009. 2.7

Marta Tatu and Dan Moldovan. A semantic approach to recognizing textual entailment. InProceedings of the conference on Human Language Technology and Empirical Methods inNatural Language Processing, pages 371–378. Association for Computational Linguistics,2005. 1.1

Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL, 2003. 3.6.1

64

July 12, 2017DRAFT

Tim Van de Cruys. A non-negative tensor factorization model for selectional preference induc-tion. GEMS: GEometrical Models of Natural Language Semantics, page 83, 2009. 2.3

Tim Van de Cruys. A neural network approach to selectional preference acquisition. In Proceed-ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),pages 26–35, 2014. 2.3

Luke Vilnis and Andrew McCallum. Word representations via gaussian embedding. In ICLR,2015. 3.3.1

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprintarXiv:1410.3916, 2014. 1.1, 1.3, 1.5, 5.1

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merrienboer, Ar-mand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prereq-uisite toy tasks. arXiv preprint arXiv:1502.05698, 2015. 1.2.2

Yorick Wilks. Preference semantics. Technical report, STANFORD UNIV CA DEPT OF COM-PUTER SCIENCE, 1973. 1.2.1, 1.3, 2.1, 2.3

Yorick Wilks. Making preferences more active. In Words and Intelligence I, pages 141–166.Springer, 2007a. 2.3

Yorick Wilks. A preferential, pattern-seeking, semantics for natural language inference. In Wordsand Intelligence I, pages 83–102. Springer, 2007b. 2.3

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3-4):229–256, 1992. 4.2.2, 2

Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual andtextual question answering. In ICML, 2016. 1.1, 1.3, 1.5

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchicalattention networks for document classification. In Proceedings of the 2016 Conference of theNorth American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies, 2016. 1.1

Mo Yu and Mark Dredze. Improving lexical embeddings with semantic knowledge. In ACL,2014. 3.3.2

Benat Zapirain, Eneko Agirre, Lluis Marquez, and Mihai Surdeanu. Selectional preferences forsemantic role classification. Computational Linguistics, 2013. 3.3.3

John M. Zelle and Raymond J. Mooney. Learning to parse database queries using inductive logicprogramming. In AAAI/IAAI, Vol. 2, 1996. 1.3

Luke S. Zettlemoyer and Michael Collins. Learning to map sentences to logical form: Structuredclassification with probabilistic categorial grammars. In UAI, 2005. 1.3

Luke S Zettlemoyer and Michael Collins. Online learning of relaxed ccg grammars for parsingto logical form. In EMNLP-CoNLL, pages 678–687, 2007. 1.3

65

thesis proposal - semantic scholar · 2017-07-24 · thesis proposal knowledge-aware natural...

Documents