grounded knowledge bases for scientiﬁc domainsdmovshov/docs/dma_proposal.pdf · abstract this...

Grounded Knowledge Bases forScientific Domains

Thesis Proposal

Dana Movshovitz-Attias

November 2014

Computer Science DepartmentSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA [email protected]

Thesis Committee:William W. Cohen (chair)

Tom MitchellRoni Rosenfeld

Alon Halevy (Google Research)

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy.

Copyright c© 2014 Dana Movshovitz-Attias

Keywords: grounded language learning, knowledge base construction, knowledge representation,ontology, unsupervised learning, semi supervised learning, bootstrapping, information extraction

AbstractThis thesis is focused on building knowledge bases (KBs) for scientific domains. We are

interested in the process of representing domain-specific information in a structured format,using unsupervised and semi-supervised learning methods. This work is inspired by the recentadvances in the creation of knowledge bases based on Web text. However, in the technicaldomains we consider here, we have “grounded” data about the objects named by text entities.For example, biomedical entities such as proteins can be described by a 3D structure, andthere is experimental information about their interactions. In the software realm, we canconsider the implementation of classes in a code repository, and we can observe the way theyare being used statically or dynamically in programs. The additional resources available intechnical domains, present an opportunity for learning, not only how entities are discussed intext, but also what are their real-world properties.

We ask three main research questions in the context of learning a knowledge base for atechnical domain, which concentrate on the following topics: (1) Knowledge representation:How should a domain-specific KB be structured and what are the algorithms used to mapinput data into the KBs internal representation? (2) Grounding: What type of input data canbe used to formulate the KB structure and to populate it? (3) Applications: What applicationscan benefit from using a structured KB?

We construct an open information extraction system for biomedical text based on NELL,a system designed for extraction from Web text. NELL uses a coupled semi-supervised boot-strapping approach to learn new facts from text, given an initial ontology and a small numberof seeds for each ontology category. We propose a process for automatically deriving an on-tology and seeds from existing resources. We then show that NELLs bootstrapping algorithmis susceptible to ambiguous seeds and propose a method for assessing seed quality, based ona large corpus of data derived from the Web. In our method, seed quality is assessed at eachiteration of the bootstrapping process.

We present a grounded approach for the detection of semantic relationships between en-tities from the software domain, that refer to Java classes. Usually, relations are found byexamining corpus statistics associated with text entities. Here, we develop a similarity mea-sure for entities that refer to Java classes using distributional information about how they areused in software, which we combine with corpus statistics on the distribution of contexts inwhich the classes appear in text. By aggregating predicted coordinate pairs, we are able toconstruct a software taxonomy which highlights functional software components.

We explore an application of statistical language models to the software domain, illus-trating the potential of structured knowledge in assisting downstream software understandingtasks. We predict comments from Java source files of open source projects, using topic mod-els and n-grams, and we analyze the performance of the models given varying amounts ofbackground data on the project being predicted.

Finally, we propose a topic model framework for learning a complete ontological struc-ture, including a hierarchy of semantic classes, seed examples of each class, and relationsbetween them. We suggest ways in which the current framework can be extended to in-clude grounded data from a specific domain. Additionally, we explore unsupervised andsemi-supervised approaches to training the model, in a way that will allow to incorporatepre-existing knowledge of the domain of interest.

Contents

1 Introduction 11.1 Completed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Ongoing and Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Interesting Extensions to Our Work . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Bootstrapping Biomedical Ontologies for Scientific Text using NELL 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Experiment: Learning Category Dictionaries . . . . . . . . . . . . . . . . . . . . 82.3.2 Application: Named-Entity Recognition using a Learned Lexicon . . . . . . . . . 8

3 Grounded Discovery of Coordinate Term Relationships between Software Entities 93.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.2 Code Distributional Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.3 Code Hierarchies and Organization . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.4 Experiment: Learning Coordinate Pairs . . . . . . . . . . . . . . . . . . . . . . . 13

4 Natural Language Models for Predicting Programming Comments 144.1 Introduction and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2.2 Prediction Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.3 Experiment: Within- and Cross-Project Comment Prediction . . . . . . . . . . . . 16

5 Preliminary and Proposed Work 175.1 Learning an Ontology with Relations using a Topic Model Framework . . . . . . . . . . . 175.2 Improving Software Language Modeling with a Software Ontology . . . . . . . . . . . . 215.3 Grounding a Learned Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.4 Learning a Complete Ontology for the Biomedical Domain . . . . . . . . . . . . . . . . . 225.5 Semi-Supervised Ontology Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

iv

Bibliography 24

v

Chapter 1

Introduction

Algorithmic advances in the fields of Natural Language Processing and Machine Learning have enableda surge in the construction of large-scale knowledge bases (KBs) from Web resources. KBs such asYAGO, DBpedia, and Freebase are derived mainly from Wikipedia, while others are learned from largecorpora of Web pages, including NELL, Knowledge Vault, and TextRunner [2, 11, 19, 72, 73, 84]. Thesesystems transform unstructured input into some structured representation, which includes large collectionsof entities, a mapping of entities to semantic classes, and relations among them. In the Open InformationExtraction paradigm, used for example in TextRunner, the entire system is extracted using a single passover the corpora. Conversely, systems such as NELL use an iterative bootstrapping approach, where eachiteration improves the existing KB. The existence of this large variety of KBs has promoted developmentof applications that are based on semantics and can take advantage of this type of structured knowledge,which did not previously exist in large scale. This in turn is contributing to the creation of improved KBs,such as YAGO2 which now integrates a spatio-temporal dimension [34].

The KBs described above draw their strength from the plentitude of available Web data. They describeinteresting topics that are discussed and written about by people. Still, some information, in particulartechnical and scientific knowledge, is not always conveyed through natural discourse. Here, we considerthe challenges in building KBs for technical domains, specifically, the biomedical and software domains,where in addition to text corpora we have access to the objects named by text entities, as well as dataassociated with those objects. For example, biomedical entities such as proteins can be described by a 3Dstructure, and there is experimental information about their interactions. In the software realm, we canconsider the implementation of classes in a code repository, and we can observe the way they are beingused statically or dynamically in programs. The additional resources available in these technical domains,present an opportunity for learning, not only how entities are discussed, but also what are their real-worldproperties. Some of these can be learned from text, but others are only available through other sources,hence, in this thesis, we strive for a combined approach.

The main contribution of this thesis, is in answering the following questions in the context of learninga knowledge base for a technical domain: (1) Knowledge representation: How should a domain-specificKB be structured and what are the algorithms used to map input data into the KBs internal representation?(2) Grounding: What type of input data can be used to formulate the KB structure and to populate it?How can domain-specific data about the learned entities be combined with information from text corpora?(3) Applications: What applications can benefit from using a structured KB? Table 1.1 summarizes thedifferent areas of contribution (current and proposed work), in the context of these three research areas ofinterest, and our work is described in more detail in the following sections.

1

1.1 Completed Work

Bootstrapping Biomedical Ontologies for Scientific Text using NELL [51, 54]: Motivated by recentadvances in knowledge base systems extracted from the Web, in this work we describe an open informationextraction system for biomedical text. Leveraging the large available collections of biomedical data, in oursystem, an initial ontology and set of examples seeds are automatically derived from existing resources.This knowledge base is then populated using a coupled semi-supervised bootstrapping approach, based onNELL, which uses multiple set expansion techniques which are combined by a joint learning objective.As part of this work, we show that NELL’s bootstrapping method is susceptible to ambiguous startingseeds, and can quickly lead to an accumulation of erroneous terms and contexts when learning a semanticclass. We address this problem by introducing a method for assessing seed quality at each bootstrappingiteration, using point-wise mutual information.

We analyzed open biomedical categories learned with our system, based on dictionaries taken fromFreebase, and we show that the proposed algorithm produces significantly more precise classes. Addition-ally, we used learned gene lexicons to improve annotations in a named-entity recognition task.

Grounded Discovery of Coordinate Term Relationships between Software Entities [53]: Discoveringsemantic relations between text entities is a key task in natural language understanding, which is a criticalcomponent that enables the success of knowledge representation systems. We examine coordinate rela-tions between text entities that potentially refer to Java classes. Usually, relations are found by examiningcorpus statistics associated with text entities. Here we explore the idea of grounding the relation discoverywith information about the class implementation, and coupling this with corpus statistics. To this end,we develop a similarity measure for Java classes using distributional information about how they are usedin software, which we combine with corpus statistics on the distribution of contexts in which the classesappear in text that discusses code.Our results are verified by human labeling and are also compared withcoordinate relations extracted using high-precision Hearst patterns.

We see this work as a first step towards building a knowledge representation system for the softwaredomain, in which text entities refer to elements from a software code base, including classes, methods,applications, and programming languages. As an initial step, we combine the predicted coordinate pairsfrom this study, by aggregating them into a graph where entities are nodes and edges are determined by acoordinate term relation. Using methods for community detection on graphs, we discover that highly con-nected component in the resulting graph correspond to functional software groups, such as UI elements,utility classes, exceptions, and graphic objects. This hierarchy highlights class interactions that cannot bedirectly recovered through traditional software taxonomies, such as the type hierarchy or class namespace.

Natural Language Models for Predicting Programming Comments [52]: We consider an applicationof statistical language modeling to the software domain, illustrating the potential of structured knowledgein assisting downstream software understanding tasks. Given an implementation of a Java class, we areinterested in predicting a natural language description matching the implementation. Beyond a sum-marization of the conceptual idea behind the code, this type of description can be viewed as a form ofdocument expansion, providing significant terms relevant to the implementation. We experiment withLDA, link-LDA and n-gram models to model the correlations between code segments and the commentsthat describe them. With these we model local syntactic dependencies, or term relevancy based on thetopic of the code, which are used to predict the main class comment of Java classes.

We evaluate our models based on their comment-completion capability in a setting similar to codecompletion tools that are built into standard code editors, and show that they can save a significant amountof typing. We also implemented a plugin for the Eclipse IDE based on one of the models, which assists incomment completion in real time.

2

1.2 Ongoing and Proposed Work

Learning a Complete Ontology with Relations using a Topic Model Framework: Ontology design in-volves assembling a set of interesting categories, organized in a meaningful hierarchical structure, andproviding representative examples for each category. We are often also interested in relations betweencategories. Redesigning a new ontology for a technical domain is difficult. We have previously con-structed a simple software hierarchy by aggregating learned pairwise relations. In this work, we describea topic model, which jointly learns an ontological structure including categories, seeds and relations, in anunsupervised way. The model currently includes two main components: The first learns category to cat-egory relations based on hypernym-hyponyms pairs, and the second learns subject-verb-object relationswhich provide a link between categories. We have so far experimented with using this model to build asoftware hierarchy describing concepts related to Java programming, and another, describing the field ofmalware software.

The framework proposed here currently suffers from lack of scalability. This is due, mainly to thefact that the model parameters are resolved using a Gibbs sampling update, which sequentially iterates theinput examples to the model, a process which does not gracefully scale (a detailed discussion on this isincluded in Section 5.1). We are interested in scaling the current model, in a way that will allow using sig-nificantly more input data as well as added components. We belive that the basic model proposed here canbe enhanced in many interesting ways, and we describe some suggestions in detail below and in Chapter 5,however, all of these extensions depend on the ability to handle more data as a basic requirement.

Semi-Supervised Ontology Learning: The current ontology topic model is fully unsupervised. Insome domains, however, we have existing pre-defined knowledge that can help guide ontology learning,for example, we may have access to an incomplete ontology, or we might be interested in representingspecified sets of objects. We are interested in exploring a semi-supervised variant of this model, thattakes in ”hints” of potential areas in the ontology and expands on them. Initial experiments in this areahave included modifying the gibbs sampling process, such that it starts by addressing only the providedinformation, and with every iteration, extends to the most relevant connected terms. In effect, this processsimulates the idea of bootstrapping, with the additional advantage of doing so while jointly consideringontology and relation constraints.

1.2.1 Interesting Extensions to Our Work

We describe possible interesting extentions and applications to the proposed work. We believe the sug-gested extensions can greatly enhance the quality of the learned ontology, and the applications can providevalue in understanding the generalizability and testing the effectiveness of the work. We acknowledge,however, that the timeline proposed for the remainder of this thesis limits the scope of extended research.We therefore provide a formal and detailed description of these exploratory ideas (detailed in Chapter 5),and we propose that a subset of these will be completed as part of this thesis.

Grounding a Learned Ontology: The proposed ontology topic model can be intuitively extended toinclude grounded data from a target domain. Grounded components can model information taken directlyfrom a grounded source. Some opportunities include, modeling the distributional similarity of classesas seen in code (which we have formalized in previous work), or incorporating our previously learnedcoordinate relations, which indicate sets of instances that belong in the same hierarchy sub-tree, based onboth code and corpus statistics.

Learning an Ontology for the Biomedical Domain: In previous work in the biomedical domain weleveraged existing resources to construct an ontology. We are interested in the ability of the topic modelframework in deriving a complete ontology including relations. Using our proposed framework, it is

3

also possible to combine existing resources in the learned model, in the form of grounded components. Itwould be interesting to see how a derived ontology compares with existing ones which have been manuallyconstructed, and whether the addition of relations improves the resulting knowledge base.

Improving Software Language Modeling with a Software Ontology: Given an ontology for the soft-ware domain, it will be interesting to see if we can improve a popular task in this domain. Recently, therehas been numerous efforts in creating a language model which describes software code. Such modelscan be used to assist a programming workflow by making real-time suggestions to the programmer or bycreating frequently used templates which represent common design patterns. To address potential spar-sity in the software language models, for example in modeling infrequently used objects, one can use ahigher-level ontology category of the object, as a backoff approach. It would be interesting to explore theusefulness of our derived ontology in this setting.

1.3 Overview

Table 1.1 gives an overview of this proposal with the completed and proposed work in the context of ourresearch areas of interest.

Software Biomedical

KnowledgeRepresentation

• Grounded Discovery of Coordinate Term Relation-ships between Software Entities [53]

• Bootstrapping Biomedical Ontologies forScientific Text using NELL [54]

• Natural Language Models for Predicting Program-ming Comments [52]

• Learning a Complete Ontology for the Bio-medical Domain

• Learning a Complete Ontology with Relations usinga Topic Model Framework

• Semi-Supervised Ontology Learning

• Semi-Supervised Ontology Learning

Grounding • Grounded Language Models [52, 53] • Grounding a Biomedical Ontology• Grounding a Learned Ontology

Application • Natural Language Models for Predicting Program-ming Comments [52]

• Named Entity Recognition [54]

• Improving Software Language Modeling with a Soft-ware Ontology

Table 1.1: Proposal overview: Completed work is in normal font, and proposed work is in italics.

1.4 Timeline

Below is a timeline for the completion of my proposed work. I plan to graduate in the Summer of 2015.1. Fall 2014: Thesis Proposal.

2. Fall 2014: Scaling the Ontology Topic Model Framework.

3. Fall 2014: Grounding a Learned Ontology (with possible application to Biomedical Domain).

4. Spring 2015: Semi-Supervision in a Learned Ontology.

5. Spring 2015: Improving Software Language Modeling with a Software Ontology.

6. Summer 2015: Thesis writing and defense.

4

Chapter 2

Bootstrapping Biomedical Ontologies forScientific Text using NELL

2.1 Introduction

NELL (the Never-Ending Language Learner) is a semi-supervised learning system, designed for extractionof information from the Web. The system uses a coupled semi-supervised bootstrapping approach to learnnew facts from text, given an initial ontology and a small number of ’seeds’, i.e., labeled examples foreach ontology category. The new facts are stored in a growing structured knowledge base. One of theconcerns about gathering data from the Web is that it comes from various un-authoritative sources, andmay not be reliable. This is especially true when gathering scientific information. In contrast to Web data,scientific text is potentially more reliable, as it is guided by peer-review. Open access archives make thisinformation available for all. In fact, the production rate of available scientific data far exceeds the abilityof researchers to manually process it, and there is a growing need for the automation of this process.

The biomedical field presents a great potential for text mining applications. An integral part of lifescience research involves production and publication of large collections of data by curators, and as part ofcollaborative community effort. Prominent examples include the publication of genomic sequence data,such as the Human Genome Project, and online collections of three-dimensional coordinates of proteinstructures. An important resource, initiated as a means of enforcing data standardization, are ontologiesdescribing biological, chemical and medical terms, which are heavily used by the research community.With this wealth of available data the biomedical field holds many information extraction opportunities.

We describe an open information extraction system adapting NELL to the biomedical domain. Wepresent an implementation of our approach, named BioNELL, which uses three main sources of informa-tion: (1) a public corpus of biomedical scientific text, (2) commonly used biomedical ontologies, and (3) acorpus of Web documents. NELL’s ontology, including categories and seeds, has been manually designedduring the system development. Ontology design involves assembling a set of interesting categories,organized in a meaningful hierarchical structure, and providing representative seeds for each category.Redesigning an ontology for a technical domain is difficult without non-trivial knowledge of the domain.We describe a process of merging source ontologies into one structure of categories with seed examples.

However, as we will show, using NELL’s bootstrapping algorithm to extract facts from a biomedicalcorpus is susceptible to noisy and ambiguous terms. Such ambiguities are common in biomedical ter-minology, and some ambiguous terms are heavily used in the literature. For example, in the sentence“We have cloned an induced white mutation and characterized the insertion sequence responsible for themutant phenotype”, white is an ambiguous term referring to the name of a gene. In NELL, ambiguity

5

is limited using coupled semi-supervised learning [10]: if two categories in the ontology are declaredmutually exclusive, instances of one category are used as negative examples for the other, and the twocategories cannot share any instances. To resolve the ambiguity of white with mutual exclusion, we wouldhave to include a Color category in the ontology, and declare it mutually exclusive with the Gene category.Then, instances of Color will not be able to refer to genes in the KB. It is hard to estimate what additionalcategories should be added, and building a “complete” ontology tree is practically infeasible.

NELL also includes a polysemy resolution component that acknowledges that one term, e.g. white,may refer to two distinct concepts that map to different ontology categories, such as Color and Fly Gene[39]. By including a Color category, this component can identify that white is both a color and a gene.The polysemy resolver performs word sense induction and synonym resolution based on relations definedbetween categories in the ontology, and labeled synonym examples. However, at present, BioNELL’sontology does not contain relation definitions, so we cannot include this component in our experiments.Additionally, it is unclear how to avoid the use of polysemous terms as category seeds, and no methodhas been suggested for selecting seeds that are representative of a single specific category. To address theproblem of ambiguity, we introduce a method for assessing the desirability of noun phrases to be used asseeds for a specific target category. We propose ranking seeds using a Pointwise Mutual Information basedcollocation measure for a seed and a category name. Collocation is measured based on a large corpus ofdomain-independent data derived from the Web, accounting for uses of the seed in many different contexts.

NELL’s bootstrapping algorithm uses the morphological and semantic features of seeds to propose newfacts, which are added to the KB and used as seeds in the next iteration to learn more facts. Ambiguousterms may, therefore, be added at any learning iteration. Since white really is a name of a gene, it issometimes used in the same semantic context as other genes, and may be added to the KB despite notbeing used as an initial seed. We therefore propose measuring seed quality in the following methodology:after every iteration, we rank all the instances that are added to the KB by their quality as category seeds.High-ranking instances are used as seeds in the next iteration. Low-ranking instances are stored in the KBand “remembered” as true facts, but are not used to learn new information. This is in contrast to NELL’sapproach (and most other bootstrapping systems), in which there is no distinction between acquired facts,and facts that are used for learning.

2.2 Related Work

Biomedical Information Extraction systems have traditionally targeted recognition of few distinct biolog-ical entities, focusing mainly on genes [12]. Few systems have been developed for fact-extraction of manybiomedical predicates, and these are relatively small scale [81], or they account for limited sub-domains[18]. We suggest a more general approach, using bootstrapping to extend existing ontologies, includ-ing a wide range of sub-domains and many categories. To the best of our knowledge, such large-scalebiomedical bootstrapping has not been done before.

Bootstrap Learning and Semantic Drift. Carlson et al. [11] use coupled semi-supervised bootstraplearning to learn a large set of category classifiers with high precision. One drawback of using iterativebootstrapping is the sensitivity of this method to the set of initial seeds [58]. Ambiguous seeds canlead to semantic drift, an accumulation of erroneous terms and contexts when learning a semantic class.Bootstrapping environments reduce this problem by adding boundaries or limiting the learning process,such as learning mutual terms and contexts [63] and using mutual exclusion examples [16]. McIntoshand Curran [48] propose a metric for measuring the semantic drift introduced by a learned term, favoringterms different than the recent m learned terms and similar to the first n, (shown for n=20 and n=100),following the assumption that semantic drift develops in late iterations. As we will show, for biomedical

6

Learning System Precision Correct Total

BioNELL .83 109 132NELL .29 186 651BioNELL+Random .73 248 338

Table 2.1: Precision, total number of instances, and cor-rect instances of gene lexicons learned with BioNELL andNELL. BioNELL significantly improves the precision ofthe learned lexicon compared with NELL.

categories, semantic drift can occur within a handful of iterations (< 5), however according to the authors,using low values for n produces inadequate results. In fact, effective n and m parameters may not only bea function of the data, but of the specific category, and it is unclear how to automatically tune them.

Seed Set Refinement. Vyas et al. [78] suggest a method for reducing ambiguity in seeds provided byhuman experts, by selecting the tightest seed clusters based on context similarity, a method described foran order of 10 seeds. In an ontology containing hundreds of seeds per class, however, it is unclear how toestimate the right number of clusters. Another approach, suggested by Kozareva et al. [38], is using onlyconstrained contexts where both seed and class are present in a sentence. Extending this idea, we considera more general collocation metric, looking at entire documents including both the seed and its category.

2.3 Main Results

BioNELL’s ontology is composed of six base ontologies, covering a wide range of biomedical sub-domains: the Gene Ontology [1], describing gene attributes; the NCBI Taxonomy for model organisms[65]; Chemical Entities of Biological Interest [17], a dictionary of small chemical compounds; the Se-quence Ontology [20]; the Cell Type Ontology [4]; and the Human Disease Ontology [57]. We merge baseontologies into one ontology tree, by grouping them under one root, producing a tree of over 1 millionentities. We then separate these into potential categories and potential seeds. Categories are unambiguousnodes (have one parent in the tree), with over 100 descendants. The descendants are the potential seeds.This results in 4188 category nodes. In these experiments we selected only the top 20 categories in thetree of each base ontology, leaving 109 categories. Leaf categories are given seeds from their descendantsin the full tree, giving a total of 1 million potential seeds. Seed set refinement is described below. Theseeds of leaf categories are later extended by the bootstrapping process.

Next, we define a seed quality metric based on a large corpus of Web data. Let s and c be a seed anda target category, respectively, e.g., s = ’white’, the name of a gene of the fruit-fly, and c = ’fly gene’.Now, let D be a document corpus, and let Dc be a subset of the documents containing a mention of thecategory name. We measure the collocation of the seed and the category by the number of times s appearsinDc, |Occur(s,Dc)|. The overall occurrence of s in the corpus is given by |Occur(s,D)|. Following theformulation of Church and Hanks [13], we compute the PMI-rank of s and c as PMI(s, c) = |Occur(s,Dc)|

|Occur(s,D)| .In our example, as white is a highly ambiguous gene name, we find that it appears in many documents thatdo not discuss the fruit fly, resulting in a PMI rank close to 0. This ranking is sensitive to the descriptivename given to categories. For a more robust ranking, we use a combination of rankings of the seedwith several of its ancestors in the ontology hierarchy. In [51] we describe the hierarchical ranking andadditionally explore the use of the binomial log-likelihood ratio test as an alternative collocation measure.

We further note that some biomedical terms follow nomenclature rules that make them identifiable ascategory specific. These terms may not be frequent in general Web context, leading to a low PMI rankunder the proposed method. Given such a set of high confidence seeds from a reliable source, one canenforce their inclusion in the learning process, and specialized seeds can additionally be identified byhigh-confidence patterns, if such exist. However, the scope of this work involves selecting seeds from anambiguous source, biomedical ontologies, thus we do not include an analysis for these specialized cases.

7

Lexicon Precision Correct Total

BioNELL .90 18 20NELL .02 5 268BioNELL+Random .03 3 82Complete Dictionary .09 153 1616

Table 2.2: Precision, total number of predicted genes,and correct predictions, in a named-entity recognitiontask using a complete lexicon, and lexicons learned withBioNELL and NELL.

10 20 30 40 500

0.2

0.4

0.6

0.8

1

Iteration

Prec

isio

n

BioNELLNELLBioNELL+Random

(a) Precision

10 20 30 40 500

50

100

150

200

250

Iteration

Cum

ulat

ive

corre

ct le

xico

n ite

ms


(b) Cumulative correct items

10 20 30 40 500

100

200

300

400

500

Iteration

Cum

ulat

ive

inco

rrect

lexi

con

item

s


(c) Cumulative incorrect items

Figure 2.1: Performance per learning iteration for gene lexicons learned using BioNELL and NELL.

We incorporate PMI ranking into BioNELL using a Rank-and-Learn bootstrapping methodology. Af-ter each iteration, we rank the instances that have been added to the KB. High-ranking instances are addedto the collection of seeds that are used in the next learning iteration. Instances with low PMI are stored inthe KB but are not used for learning. We consider a high-ranking instance to have PMI higher than 0.25.

2.3.1 Experiment: Learning Category Dictionaries

We used BioNELL to learn the lexicon of a closed category, representing genes of the fruit-fly, a modelorganism used to study genetics and developmental biology. We measured the precision, total number ofinstances, and correct instances of the learned lexicons against the full dictionary of genes (Table 2.1).BioNELL, initialized with PMI-ranked seeds, significantly improved the precision of the learned lexiconover NELL. In fact, the two learning systems using Rank-and-Learn bootstrapping resulted in higherprecision lexicons, suggesting that constrained bootstrapping using iterative seed ranking successfullyeliminates noisy and ambiguous seeds. Lexicons learned using BioNELL show high precision throughout50 learning iterations, even when initiated with random seeds (Figure 2.1A). By the final iteration, allsystems stop accumulating further significant amounts of correct gene instances (Figure 2.1B). Systemsthat use PMI-based Rank-and-Learn bootstrapping also stop learning incorrect instances This is in contrastto NELL which continues learning incorrect examples (Figure 2.1C).

2.3.2 Application: Named-Entity Recognition using a Learned Lexicon

We examined the use of learned gene lexicons for the task of recognizing concepts in free text, usinga simple strategy of matching words in the text with terms from the lexicon. We use data from theBioCreative challenge, which includes text abstracts and the IDs of genes that appear in each abstract.We evaluated annotators that were given as input the complete fly-genes dictionary, or lexicons learnedusing BioNELL and NELL. We show that BioNELL’s lexicon achieves both higher precision and recall inthis task than NELL’s. We report the average precision (over 108 text abstracts) and number of total andcorrect predictions of gene mentions, compared with the labeled annotations for each text (Table 2.2).

8

Chapter 3

Grounded Discovery of Coordinate TermRelationships between Software Entities

3.1 Introduction

Discovering semantic relations between text entities is a key task in natural language understanding. It isa critical component which enables the success of knowledge representation systems such as TextRunner[84], ReVerb [22], and NELL [11], which in turn are useful for a variety of NLP applications, including,temporal scoping [75], semantic parsing [41] and entity linking [47]. In this work, we examine coordinaterelations between words. According to the WordNet glossary, X and Y are defined as coordinate terms ifthey share a common hypernym [23, 49]. This is a symmetric relation that indicates a semantic similarity,meaning that X and Y are “a type of the same thing”, since they share at least one common ancestor insome hypernym taxonomy (to paraphrase the definition of Snow et al. [70]).

Semantic similarity relations are normally discovered by comparing corpus statistics associated withthe entities: for instance, if X and Y usually appear in similar contexts they are likely to be semanticallysimilar [15, 59, 61]. In some technical domains, we have access to additional information about the real-world objects named by the entities: e.g., we might have biographical data about a person entity, or a3D structural encoding of a protein entity. In such situations, it seems plausible that a ”grounded” NLPmethod, in which corpus statistics are coupled with data on the real-world referents of X and Y , mightlead to improved methods for relation discovery.

Here we explore the idea of grounded relation discovery in the domain of software. We consider thedetection of coordinate relations between entities that potentially refer to Java classes. We use a softwaredomain text corpus derived from StackOverflow, where users ask and answer questions about softwaredevelopment, and we extract posts which have been labeled by users as Java related. From this data, wecollected a small set of entity pairs that are labeled as coordinate terms (or not) based on high-precisionHearst patterns and frequency statistics, and we attempt to label these pairs using information availablefrom higher-recall approaches based on distributional similarity.

We describe an entity linking method in order to map a given text entity to an underlying class typeimplementation from the Java standard libraries. Next, we describe corpus and code based informationthat we use for the relation discovery task. Corpus based methods include distributional similarity andstring matching similarity. Additionally, we use two sources of code based information: (1) we define theclass-context of a Java class in a given code repository, and are therefore able to calculate a code-baseddistributional similarity measure for classes, and (2) we consider the hierarchical organization of classes,described by the Java class type and namespace hierarchies.

9

6WULQJ%XLOGHUV6WULQJ%XIIHUV

(YHQW/LVWHQHU3UR[\(YHQW/LVWHQHU/LVW

7UD\,FRQ6\VWHP7UD\

.H\(YHQWV0RXVH(YHQWV

$FWLRQ0DS,QSXW0DS

(QXP6HW(QXP0DS

%XIIHUHG:ULWHUV%XIIHUHG5HDGHUV

)LOH&KDQQHOV%\WH%XIIHUV

M7H[W$UHD��M&RPER%R[��

;0/6WUHDP:ULWHU'HIDXOW+DQGOHU

5ROH/LVW$UU\D/LVW

7DEOH9LHZHU9LHZ)LOWHU

'DWDJUDP3DFNHW'DWDJUDP6RFNHW

([HFXWLRQ([FHSWLRQ,QWHUUXSWHG([FHSWLRQ

,PDJH:ULWHU,PDJH5HDGHU

$FWLRQ/LVWHQHUV)RFXV/LVWHQHUV

,QYDOLG3DUDPHWHU([FHSWLRQ,OOHJDO$UJXPHQW([FSWLRQ'RFXPHQW/LVWHQHU'RFXPHQW)LOWHU

1R&ODVV'HI)RXQG(UURUV&ODVV1RW)RXQG([FHSWLRQV

6HFXULW\0DQDJHU$FFHVVLEOH2EMHFW3LSH2XWSXW6WUHDP3LSH,QSXW6WUHDP

5PL&RQQHFWRU&OLHQW5PL&RQQHFWRU$GGUHVV

;0/'HFRGHU;0/(QFRGHU

3ULYDWH.H\3XEOLF.H\

8QNQRZQ+RVW([FHSWLRQ8QNQRZQ([FHSWLRQ

$FWLRQ0DSV,QSXW0DSV2EMHFW,QSXW2EMHFW2XWSXW

5RRW3DQH/D\HUHG3DQH

,,2,PDJH,,20HWD'DWD

/RFDO9DULDEOH7\SH7DEOH/RFDO9DULDEOH7DEOH&KDUVHW(QFRGHU&KDUVHW'HFRGHU

JUHDWHU2U(TXDO7KDQOHVVHU2U(TXDO7KDQ

'RFXPHQW%XLOGHU)DFWRU\'RFXPHQW%XLOGHU

%RROHDQ$WWULEXWH6WULQJ$WWULEXWH

3RUWDEOH5HPRWH2EMHFW8QLFDVW5HPRWH2EMHFW

'206RXUFH6WUHDP6RXUFH7KUHDG*URXSV7KUHDG*URXS

)RFXV/LVWQHU$FWLRQ/LVWQHU

-&RPER%R[�-&RPER%R[�7UHH&HOO5HQGHUHU7UHH&HOO(GLWRU

3.&6�(QFRGHG.H\6SHF;��(QFRGHG.H\6SHF

3LSHG5HDGHU3LSHG:ULWHU

7\SH0LUURU7\SH(OHPHQW

6RFNHW,PSO)DFWRU\6RFNHW,PSO

VHW,FRQ,PDJHVVHW,FRQ,PDJH

3URSHUW\&KDQJH6XSSRUW3URSHUW\&KDQJH/LVWHQHU3URSHUW\&KDQJH(YHQW

,QSXW6WUHDPV/LQH5HDGHU2XWSXW6WUHDPV

7DEOH&HOO(GLWRU

7DEOH&HOO5HQGHUHU7DEOH0RGHO/LVWHQHU

7UHH0RGHO7UHH1RGH0XWDEOH7UHH1RGH

,QSXW6WUHDP5HDGHUV)LOH5HDGHUV6WULQJ:ULWHUV

5HVRXUFH%XQGOH/LVW5HVRXUFH%XQGOH3URSHUW\5HVRXUFH%XQGOH

*=,3,QSXW6WUHDP*=,32XWSXW6WUHDP'HIODWHU2XWSXW6WUHDP

*UDSKLFV'HYLFH*UDSKLFV(QYLURQPHQW*UDSKLFV&RQILJXUDWLRQ

1XOO3RLQWHU([FHSWLRQV,2([FHSWLRQV

8QNQRZQ2EMHFW([FHSWLRQV

64/([FHSWLRQV

6LPSOH'DWH)RUPDW'DWH)RUPDW

*UHJRULDQ&DOHQGDU

7LPH=RQH

5HHQWUDQW/RFN

$WRPLF,QWHJHU$WRPLF,QWHJHU$UUD\$WRPLF/RQJ

5HFWDQJOH�'7H[WXUH3DLQW/LQH�'

(OOLSVH�'

6ZLQJ8WLWOLHV6ZLQJ:RUNHU6ZLQJ8WLOLWLHV(YHQW4XHXH

'HIDXOW7UHH&HOO5HQGHUHU

'HIDXOW7DEOH&HOO(GLWRU'HIDXOW7DEOH&HOO5HQGHUHU

'HIDXOW/LVW&HOO5HQGHUHU

&KHFN%R[

7H[W)LHOG

&RPER%R[

7H[W$UHD

7H[W)LOHG6FUROO3DQH

%R[/D\RXW

*ULG%DJ/D\RXW

%RUGHU/D\RXW

)ORZ/D\RXW*URXS/D\RXW

6SULQJ/D\RXW*ULG%DJ&RQVWUDLQWV2SWLRQV3DQHO

-)RUPDWWHG7H[W)LHOG1XPEHU)RUPDW

%LJ,QWHJHU

%LJ'HFLPDO

'HFLPDO)RUPDW

0DWK&RQWH[W

0HVVDJH)RUPDW)LHOG3RVLWLRQ

&ODVV/RDGHU

85/&ODVV/RDGHU

&ODVV/RDGHUV

&ODVV3DWK

=LS2XWSXW6WUHDP=LS,QSXW6WUHDP

-DU,QSXW6WUHDP

-DU2XWSXW6WUHDP

=LS)LOH

-DU)LOH

%XIIHUHG,PDJH

,PDJH,2

*UDSKLFV�'$IILQH7UDQVIRUP

7H[W/D\RXW

,PDJH,FRQ

)RQW0HWULFV

:ULWDEOH5DVWHU,PDJH%XIIHU

$OSKD&RPSRVLWH9RODWLOH,PDJH5HQGHUHG,PDJH

&RORU0RGHO'DWD%XIIHU,QW

$IILQH7UDQVIRUP2S

6DPSOH0RGHO

,OOHJDO$UJXPHQW([FHSWLRQ1XOO3RLQWHU([FHSWLRQ

&ODVV&DVW([FHSWLRQ,OOHJDO$FFHVV([FHSWLRQ

64/([FHSWLRQ,OOHJDO6WDWH([FHSWLRQ5XQWLPH([FHSWLRQ

$ULWKPHWLF([FHSWLRQ

&RQFXUUHQW0RGLILFDWLRQ([FHSWLRQ1XPEHU)RUPDW([FHSWLRQ

)LOH1RW)RXQG([FHSWLRQ

,2([FHSWLRQ

2XW2I0HPRU\(UURU,QWHUQDO(UURU

1R&ODVV'HI)RXQG(UURU

6RIW5HIHUHQFH

&RUUXSWHG6WUHDP([FHSWLRQ6RFNHW([FHSWLRQ

(2)([FHSWLRQ6HFXULW\([FHSWLRQ

,QVWDQWLDWLRQ([FHSWLRQ([FHSWLRQ,Q,QLWLDOL]HU(UURU&ODVV1RW)RXQG([FHSWLRQ

:HDN5HIHUHQFH

6$;([FHSWLRQV

5HIHUHQFH4XHXH

:LQGRZ/LVWHQHU

-$SSOHW

-'LDORJ-%XWWRQV-)UDPH

-3DQHOV

-/DEHOV

-&RPSRQHQWV

-0HQX

-7DEEHG3DQH-:LQGRZ

-2SWLRQ3DQH

-0HQX,WHPV

-7H[W)LHOGV

-0HQXV

-0HQX%DU

7LPHU7DVN

-'LDORJV

/D\RXW0DQDJHU-3DQOH

6FKHGXOHG([HFXWRU6HUYLFH

-,QWHUQDO)UDPHV

-)UDPHV

-2SWLRQ3DQHV

/D\RXW0DQDJHUV

-$SSOHWV

,PDJH,FRQV

-7H[W$UHDV

-0HQX,WHP

-&RPER%R[V

-3RSXS0HQX

-7RRO%DU

.H\%LQGLQJV

7DEEHG3DQH

+DVK0DS

7UHH0DS

$UUD\/LVW/LQNHG/LVW7UHH6HW

+DVK0DSV$UUD\/LVWV

+DVK6HW

&RQFXUUHQW+DVK0DS

6RUWHG0DS

:HDN+DVK0DS

/LQNHG+DVK0DS

$UUD\%ORFNLQJ4XHXH$UUD\'HTXH

7KUHDG3RRO([HFXWRU$EVWUDFW/LVW&RS\2Q:ULWH$UUD\/LVW/LQNHG+DVK6HW

'HIDXOW/LVW0RGHO3ULRULW\4XHXH7UHH6HWV6RUWHG6HW

+DVK6HWV%ORFNLQJ4XHXH

([HFXWRU6HUYLFH

7KUHDG3RRO

)XWXUH7DVN

/LQNHG/LVWV

$WRPLF5HIHUHQFH

%LW6HW&RQFXUUHQW6NLS/LVW6HW

&RQFXUUHQW6NLS/LVW0DS

1DYLJDEOH0DS

&RQFXUUHQW/LQNHG4XHXH

:HDN5HIHUHQFHV6RIW5HIHUHQFHV

%ORFN4XHXH

'DWD,QSXW6WUHDP%XIIHUHG5HDGHU

)LOH,QSXW6WUHDP

'DWD2XWSXW6WUHDP

,QSXW6WUHDP5HDGHU%XIIHUHG,QSXW6WUHDP

2EMHFW,QSXW6WUHDP

%\WH%XIIHU

)LOH5HDGHU

6WULQJ%XIIHU

,QSXW6WUHDP

3ULQW:ULWHU

+WWS85/&RQQHFWLRQ

%XIIHUHG:ULWHU

)LOWHU5HDGHU

85/&RQQHFWLRQ

2XWSXW6WUHDP

)LOH:ULWHU)LOH2XWSXW6WUHDP

2EMHFW2XWSXW6WUHDP

)LOH&KDQQHO

5DQGRP$FFHVV)LOH

%\WH$UUD\2XWSXW6WUHDP

6WULQJ%XLOGHU

6WULQJ:ULWHU$XGLR,QSXW6WUHDP

3ULQW6WUHDP

)LOWHU2XWSXW6WUHDP

6RFNHW,QSXW6WUHDP,QSXW6RXUFH

%XIIHUHG2XWSXW6WUHDP

2XWSXW6WUHDP:ULWHU

6WULQJ5HDGHU

,QW%XIIHU

3LSHG2XWSXW6WUHDP

+WWSV85/&RQQHFWLRQ

;��&HUWLILFDWH

'DWD,QSXW6RFNHW&KDQQHO

&KDU%XIIHU)ORDW%XIIHU%\WH2UGHU

'DWD2XWSXW

3LSHG,QSXW6WUHDP

6HUYHU6RFNHW&KDQQHO

)LOWHU,QSXW6WUHDP

6WULQJ%XIIHU,QSXW6WUHDP

-7H[W)LHOG

$FWLRQ/LVWHQHU

-3DQHO

-%XWWRQ

0RXVH/LVWHQHU

.H\/LVWHQHU

$FWLRQ(YHQW,WHP/LVWHQHU

-/D\HUHG3DQH

-7UHH

-,QWHUQDO)UDPH

-&RPSRQHQW

-6OLGHU

-&KHFN%R[

-/DEHO

-7DEOH-(GLWRU3DQH

-/LVW -6FUROO3DQH

-5DGLR%XWWRQ

-&RPSQHQW

-7DEOH+HDGHU

0HQX%DU-6FUROO3DQHO

-'HVNWRS3DQH

'HIDXOW7UHH6HOHFWLRQ0RGHO8,0DQDJHU

-5RRW3DQH-&RPER%R[

-7H[W$UHD-7H[W3DQH

-)LOH&KRRVHU

-6SLQQHU$EVWUDFW%XWWRQ

-3URJUHVV%DU

-6HSDUDWRU

-7RJJOH%XWWRQ

-5DGLR%XWWRQ0HQX

0RXVH0RWLRQ/LVWHQHU0RXVH:KHHO/LVWHQHU

7DEOH0RGHO'HIDXOW7DEOH0RGHO

7DEOH5RZ6RUWHU

'HIDXOW6W\OHG'RFXPHQW+70/(GLWRU.LW

&HOO5HQGHUHU

&RPER%R[0RGHO'HIDXOW&RPER%R[0RGHO

6W\OH&RQVWDQWV-7DEOHV

/LVW6HOHFWLRQ/LVWHQHU/LVW0RGHO

.H\(YHQW)RFXV/LVWHQHU

-5DGLR%XWWRQV

6LPSOH$WWULEXWH6HW

-6FUROO%DU

,WHP(YHQWV

)LOH'LDORJ

:LQGRZ)RFXV/LVWHQHU

3URJUHVV0RQLWRU

5RZ)LOWHU



7UD\,FRQ6\VWHP7UD\


$FWLRQ0DS,QSXW0DS

(QXP6HW(QXP0DS





5ROH/LVW$UU\D/LVW











3ULYDWH.H\3XEOLF.H\



5RRW3DQH/D\HUHG3DQH

,,2,PDJH,,20HWD'DWD
















7DEOH&HOO(GLWRU









64/([FHSWLRQV


*UHJRULDQ&DOHQGDU

7LPH=RQH

5HHQWUDQW/RFN



(OOLSVH�'





&KHFN%R[

7H[W)LHOG

&RPER%R[

7H[W$UHD

7H[W)LOHG6FUROO3DQH

%R[/D\RXW

*ULG%DJ/D\RXW

%RUGHU/D\RXW




%LJ,QWHJHU

%LJ'HFLPDO

'HFLPDO)RUPDW

0DWK&RQWH[W


&ODVV/RDGHU

85/&ODVV/RDGHU

&ODVV/RDGHUV

&ODVV3DWK


-DU,QSXW6WUHDP

-DU2XWSXW6WUHDP

=LS)LOH

-DU)LOH

%XIIHUHG,PDJH

,PDJH,2


7H[W/D\RXW

,PDJH,FRQ

)RQW0HWULFV




$IILQH7UDQVIRUP2S

6DPSOH0RGHO




$ULWKPHWLF([FHSWLRQ



,2([FHSWLRQ



6RIW5HIHUHQFH




:HDN5HIHUHQFH

6$;([FHSWLRQV

5HIHUHQFH4XHXH

:LQGRZ/LVWHQHU

-$SSOHW


-3DQHOV

-/DEHOV

-&RPSRQHQWV

-0HQX

-7DEEHG3DQH-:LQGRZ

-2SWLRQ3DQH

-0HQX,WHPV

-7H[W)LHOGV

-0HQXV

-0HQX%DU

7LPHU7DVN

-'LDORJV

/D\RXW0DQDJHU-3DQOH


-,QWHUQDO)UDPHV

-)UDPHV

-2SWLRQ3DQHV

/D\RXW0DQDJHUV

-$SSOHWV

,PDJH,FRQV

-7H[W$UHDV

-0HQX,WHP

-&RPER%R[V

-3RSXS0HQX

-7RRO%DU

.H\%LQGLQJV

7DEEHG3DQH

+DVK0DS

7UHH0DS


+DVK0DSV$UUD\/LVWV

+DVK6HW

&RQFXUUHQW+DVK0DS

6RUWHG0DS

:HDN+DVK0DS

/LQNHG+DVK0DS





([HFXWRU6HUYLFH

7KUHDG3RRO

)XWXUH7DVN

/LQNHG/LVWV

$WRPLF5HIHUHQFH



1DYLJDEOH0DS



%ORFN4XHXH


)LOH,QSXW6WUHDP

'DWD2XWSXW6WUHDP


2EMHFW,QSXW6WUHDP

%\WH%XIIHU

)LOH5HDGHU

6WULQJ%XIIHU

,QSXW6WUHDP

3ULQW:ULWHU

+WWS85/&RQQHFWLRQ

%XIIHUHG:ULWHU

)LOWHU5HDGHU

85/&RQQHFWLRQ

2XWSXW6WUHDP


2EMHFW2XWSXW6WUHDP

)LOH&KDQQHO

5DQGRP$FFHVV)LOH


6WULQJ%XLOGHU


3ULQW6WUHDP

)LOWHU2XWSXW6WUHDP



2XWSXW6WUHDP:ULWHU

6WULQJ5HDGHU

,QW%XIIHU

3LSHG2XWSXW6WUHDP

+WWSV85/&RQQHFWLRQ

;��&HUWLILFDWH



'DWD2XWSXW

3LSHG,QSXW6WUHDP

6HUYHU6RFNHW&KDQQHO

)LOWHU,QSXW6WUHDP


-7H[W)LHOG

$FWLRQ/LVWHQHU

-3DQHO

-%XWWRQ

0RXVH/LVWHQHU

.H\/LVWHQHU


-/D\HUHG3DQH

-7UHH

-,QWHUQDO)UDPH

-&RPSRQHQW

-6OLGHU

-&KHFN%R[

-/DEHO

-7DEOH-(GLWRU3DQH

-/LVW -6FUROO3DQH

-5DGLR%XWWRQ

-&RPSQHQW

-7DEOH+HDGHU

0HQX%DU-6FUROO3DQHO

-'HVNWRS3DQH


-5RRW3DQH-&RPER%R[

-7H[W$UHD-7H[W3DQH

-)LOH&KRRVHU


-3URJUHVV%DU

-6HSDUDWRU

-7RJJOH%XWWRQ

-5DGLR%XWWRQ0HQX



7DEOH5RZ6RUWHU


&HOO5HQGHUHU





-5DGLR%XWWRQV

6LPSOH$WWULEXWH6HW

-6FUROO%DU

,WHP(YHQWV

)LOH'LDORJ

:LQGRZ)RFXV/LVWHQHU

3URJUHVV0RQLWRU

5RZ)LOWHU



7UD\,FRQ6\VWHP7UD\


$FWLRQ0DS,QSXW0DS

(QXP6HW(QXP0DS





5ROH/LVW$UU\D/LVW











3ULYDWH.H\3XEOLF.H\



5RRW3DQH/D\HUHG3DQH

,,2,PDJH,,20HWD'DWD
















7DEOH&HOO(GLWRU









64/([FHSWLRQV


*UHJRULDQ&DOHQGDU

7LPH=RQH

5HHQWUDQW/RFN



(OOLSVH�'





&KHFN%R[

7H[W)LHOG

&RPER%R[

7H[W$UHD

7H[W)LOHG6FUROO3DQH

%R[/D\RXW

*ULG%DJ/D\RXW

%RUGHU/D\RXW




%LJ,QWHJHU

%LJ'HFLPDO

'HFLPDO)RUPDW

0DWK&RQWH[W


&ODVV/RDGHU

85/&ODVV/RDGHU

&ODVV/RDGHUV

&ODVV3DWK


-DU,QSXW6WUHDP

-DU2XWSXW6WUHDP

=LS)LOH

-DU)LOH

%XIIHUHG,PDJH

,PDJH,2


7H[W/D\RXW

,PDJH,FRQ

)RQW0HWULFV




$IILQH7UDQVIRUP2S

6DPSOH0RGHO




$ULWKPHWLF([FHSWLRQ



,2([FHSWLRQ



6RIW5HIHUHQFH




:HDN5HIHUHQFH

6$;([FHSWLRQV

5HIHUHQFH4XHXH

:LQGRZ/LVWHQHU

-$SSOHW


-3DQHOV

-/DEHOV

-&RPSRQHQWV

-0HQX

-7DEEHG3DQH-:LQGRZ

-2SWLRQ3DQH

-0HQX,WHPV

-7H[W)LHOGV

-0HQXV

-0HQX%DU

7LPHU7DVN

-'LDORJV

/D\RXW0DQDJHU-3DQOH


-,QWHUQDO)UDPHV

-)UDPHV

-2SWLRQ3DQHV

/D\RXW0DQDJHUV

-$SSOHWV

,PDJH,FRQV

-7H[W$UHDV

-0HQX,WHP

-&RPER%R[V

-3RSXS0HQX

-7RRO%DU

.H\%LQGLQJV

7DEEHG3DQH

+DVK0DS

7UHH0DS


+DVK0DSV$UUD\/LVWV

+DVK6HW

&RQFXUUHQW+DVK0DS

6RUWHG0DS

:HDN+DVK0DS

/LQNHG+DVK0DS





([HFXWRU6HUYLFH

7KUHDG3RRO

)XWXUH7DVN

/LQNHG/LVWV

$WRPLF5HIHUHQFH



1DYLJDEOH0DS



%ORFN4XHXH


)LOH,QSXW6WUHDP

'DWD2XWSXW6WUHDP


2EMHFW,QSXW6WUHDP

%\WH%XIIHU

)LOH5HDGHU

6WULQJ%XIIHU

,QSXW6WUHDP

3ULQW:ULWHU

+WWS85/&RQQHFWLRQ

%XIIHUHG:ULWHU

)LOWHU5HDGHU

85/&RQQHFWLRQ

2XWSXW6WUHDP


2EMHFW2XWSXW6WUHDP

)LOH&KDQQHO

5DQGRP$FFHVV)LOH


6WULQJ%XLOGHU


3ULQW6WUHDP

)LOWHU2XWSXW6WUHDP



2XWSXW6WUHDP:ULWHU

6WULQJ5HDGHU

,QW%XIIHU

3LSHG2XWSXW6WUHDP

+WWSV85/&RQQHFWLRQ

;��&HUWLILFDWH



'DWD2XWSXW

3LSHG,QSXW6WUHDP

6HUYHU6RFNHW&KDQQHO

)LOWHU,QSXW6WUHDP


-7H[W)LHOG

$FWLRQ/LVWHQHU

-3DQHO

-%XWWRQ

0RXVH/LVWHQHU

.H\/LVWHQHU


-/D\HUHG3DQH

-7UHH

-,QWHUQDO)UDPH

-&RPSRQHQW

-6OLGHU

-&KHFN%R[

-/DEHO

-7DEOH-(GLWRU3DQH

-/LVW -6FUROO3DQH

-5DGLR%XWWRQ

-&RPSQHQW

-7DEOH+HDGHU

0HQX%DU-6FUROO3DQHO

-'HVNWRS3DQH


-5RRW3DQH-&RPER%R[

-7H[W$UHD-7H[W3DQH

-)LOH&KRRVHU


-3URJUHVV%DU

-6HSDUDWRU

-7RJJOH%XWWRQ

-5DGLR%XWWRQ0HQX



7DEOH5RZ6RUWHU


&HOO5HQGHUHU





-5DGLR%XWWRQV

6LPSOH$WWULEXWH6HW

-6FUROO%DU

,WHP(YHQWV

)LOH'LDORJ

:LQGRZ)RFXV/LVWHQHU

3URJUHVV0RQLWRU

5RZ)LOWHU



7UD\,FRQ6\VWHP7UD\


$FWLRQ0DS,QSXW0DS

(QXP6HW(QXP0DS





5ROH/LVW$UU\D/LVW











3ULYDWH.H\3XEOLF.H\



5RRW3DQH/D\HUHG3DQH

,,2,PDJH,,20HWD'DWD
















7DEOH&HOO(GLWRU









64/([FHSWLRQV


*UHJRULDQ&DOHQGDU

7LPH=RQH

5HHQWUDQW/RFN



(OOLSVH�'





&KHFN%R[

7H[W)LHOG

&RPER%R[

7H[W$UHD

7H[W)LOHG6FUROO3DQH

%R[/D\RXW

*ULG%DJ/D\RXW

%RUGHU/D\RXW




%LJ,QWHJHU

%LJ'HFLPDO

'HFLPDO)RUPDW

0DWK&RQWH[W


&ODVV/RDGHU

85/&ODVV/RDGHU

&ODVV/RDGHUV

&ODVV3DWK


-DU,QSXW6WUHDP

-DU2XWSXW6WUHDP

=LS)LOH

-DU)LOH

%XIIHUHG,PDJH

,PDJH,2


7H[W/D\RXW

,PDJH,FRQ

)RQW0HWULFV




$IILQH7UDQVIRUP2S

6DPSOH0RGHO




$ULWKPHWLF([FHSWLRQ



,2([FHSWLRQ



6RIW5HIHUHQFH




:HDN5HIHUHQFH

6$;([FHSWLRQV

5HIHUHQFH4XHXH

:LQGRZ/LVWHQHU

-$SSOHW


-3DQHOV

-/DEHOV

-&RPSRQHQWV

-0HQX

-7DEEHG3DQH-:LQGRZ

-2SWLRQ3DQH

-0HQX,WHPV

-7H[W)LHOGV

-0HQXV

-0HQX%DU

7LPHU7DVN

-'LDORJV

/D\RXW0DQDJHU-3DQOH


-,QWHUQDO)UDPHV

-)UDPHV

-2SWLRQ3DQHV

/D\RXW0DQDJHUV

-$SSOHWV

,PDJH,FRQV

-7H[W$UHDV

-0HQX,WHP

-&RPER%R[V

-3RSXS0HQX

-7RRO%DU

.H\%LQGLQJV

7DEEHG3DQH

+DVK0DS

7UHH0DS


+DVK0DSV$UUD\/LVWV

+DVK6HW

&RQFXUUHQW+DVK0DS

6RUWHG0DS

:HDN+DVK0DS

/LQNHG+DVK0DS





([HFXWRU6HUYLFH

7KUHDG3RRO

)XWXUH7DVN

/LQNHG/LVWV

$WRPLF5HIHUHQFH



1DYLJDEOH0DS



%ORFN4XHXH


)LOH,QSXW6WUHDP

'DWD2XWSXW6WUHDP


2EMHFW,QSXW6WUHDP

%\WH%XIIHU

)LOH5HDGHU

6WULQJ%XIIHU

,QSXW6WUHDP

3ULQW:ULWHU

+WWS85/&RQQHFWLRQ

%XIIHUHG:ULWHU

)LOWHU5HDGHU

85/&RQQHFWLRQ

2XWSXW6WUHDP


2EMHFW2XWSXW6WUHDP

)LOH&KDQQHO

5DQGRP$FFHVV)LOH


6WULQJ%XLOGHU


3ULQW6WUHDP

)LOWHU2XWSXW6WUHDP



2XWSXW6WUHDP:ULWHU

6WULQJ5HDGHU

,QW%XIIHU

3LSHG2XWSXW6WUHDP

+WWSV85/&RQQHFWLRQ

;��&HUWLILFDWH



'DWD2XWSXW

3LSHG,QSXW6WUHDP

6HUYHU6RFNHW&KDQQHO

)LOWHU,QSXW6WUHDP


-7H[W)LHOG

$FWLRQ/LVWHQHU

-3DQHO

-%XWWRQ

0RXVH/LVWHQHU

.H\/LVWHQHU


-/D\HUHG3DQH

-7UHH

-,QWHUQDO)UDPH

-&RPSRQHQW

-6OLGHU

-&KHFN%R[

-/DEHO

-7DEOH-(GLWRU3DQH

-/LVW -6FUROO3DQH

-5DGLR%XWWRQ

-&RPSQHQW

-7DEOH+HDGHU

0HQX%DU-6FUROO3DQHO

-'HVNWRS3DQH


-5RRW3DQH-&RPER%R[

-7H[W$UHD-7H[W3DQH

-)LOH&KRRVHU


-3URJUHVV%DU

-6HSDUDWRU

-7RJJOH%XWWRQ

-5DGLR%XWWRQ0HQX



7DEOH5RZ6RUWHU


&HOO5HQGHUHU





-5DGLR%XWWRQV

6LPSOH$WWULEXWH6HW

-6FUROO%DU

,WHP(YHQWV

)LOH'LDORJ

:LQGRZ)RFXV/LVWHQHU

3URJUHVV0RQLWRU

5RZ)LOWHU



7UD\,FRQ6\VWHP7UD\


$FWLRQ0DS,QSXW0DS

(QXP6HW(QXP0DS





5ROH/LVW$UU\D/LVW











3ULYDWH.H\3XEOLF.H\



5RRW3DQH/D\HUHG3DQH

,,2,PDJH,,20HWD'DWD
















7DEOH&HOO(GLWRU









64/([FHSWLRQV


*UHJRULDQ&DOHQGDU

7LPH=RQH

5HHQWUDQW/RFN



(OOLSVH�'





&KHFN%R[

7H[W)LHOG

&RPER%R[

7H[W$UHD

7H[W)LOHG6FUROO3DQH

%R[/D\RXW

*ULG%DJ/D\RXW

%RUGHU/D\RXW




%LJ,QWHJHU

%LJ'HFLPDO

'HFLPDO)RUPDW

0DWK&RQWH[W


&ODVV/RDGHU

85/&ODVV/RDGHU

&ODVV/RDGHUV

&ODVV3DWK


-DU,QSXW6WUHDP

-DU2XWSXW6WUHDP

=LS)LOH

-DU)LOH

%XIIHUHG,PDJH

,PDJH,2


7H[W/D\RXW

,PDJH,FRQ

)RQW0HWULFV




$IILQH7UDQVIRUP2S

6DPSOH0RGHO




$ULWKPHWLF([FHSWLRQ



,2([FHSWLRQ



6RIW5HIHUHQFH




:HDN5HIHUHQFH

6$;([FHSWLRQV

5HIHUHQFH4XHXH

:LQGRZ/LVWHQHU

-$SSOHW


-3DQHOV

-/DEHOV

-&RPSRQHQWV

-0HQX

-7DEEHG3DQH-:LQGRZ

-2SWLRQ3DQH

-0HQX,WHPV

-7H[W)LHOGV

-0HQXV

-0HQX%DU

7LPHU7DVN

-'LDORJV

/D\RXW0DQDJHU-3DQOH


-,QWHUQDO)UDPHV

-)UDPHV

-2SWLRQ3DQHV

/D\RXW0DQDJHUV

-$SSOHWV

,PDJH,FRQV

-7H[W$UHDV

-0HQX,WHP

-&RPER%R[V

-3RSXS0HQX

-7RRO%DU

.H\%LQGLQJV

7DEEHG3DQH

+DVK0DS

7UHH0DS


+DVK0DSV$UUD\/LVWV

+DVK6HW

&RQFXUUHQW+DVK0DS

6RUWHG0DS

:HDN+DVK0DS

/LQNHG+DVK0DS





([HFXWRU6HUYLFH

7KUHDG3RRO

)XWXUH7DVN

/LQNHG/LVWV

$WRPLF5HIHUHQFH



1DYLJDEOH0DS



%ORFN4XHXH


)LOH,QSXW6WUHDP

'DWD2XWSXW6WUHDP


2EMHFW,QSXW6WUHDP

%\WH%XIIHU

)LOH5HDGHU

6WULQJ%XIIHU

,QSXW6WUHDP

3ULQW:ULWHU

+WWS85/&RQQHFWLRQ

%XIIHUHG:ULWHU

)LOWHU5HDGHU

85/&RQQHFWLRQ

2XWSXW6WUHDP


2EMHFW2XWSXW6WUHDP

)LOH&KDQQHO

5DQGRP$FFHVV)LOH


6WULQJ%XLOGHU


3ULQW6WUHDP

)LOWHU2XWSXW6WUHDP



2XWSXW6WUHDP:ULWHU

6WULQJ5HDGHU

,QW%XIIHU

3LSHG2XWSXW6WUHDP

+WWSV85/&RQQHFWLRQ

;��&HUWLILFDWH



'DWD2XWSXW

3LSHG,QSXW6WUHDP

6HUYHU6RFNHW&KDQQHO

)LOWHU,QSXW6WUHDP


-7H[W)LHOG

$FWLRQ/LVWHQHU

-3DQHO

-%XWWRQ

0RXVH/LVWHQHU

.H\/LVWHQHU


-/D\HUHG3DQH

-7UHH

-,QWHUQDO)UDPH

-&RPSRQHQW

-6OLGHU

-&KHFN%R[

-/DEHO

-7DEOH-(GLWRU3DQH

-/LVW -6FUROO3DQH

-5DGLR%XWWRQ

-&RPSQHQW

-7DEOH+HDGHU

0HQX%DU-6FUROO3DQHO

-'HVNWRS3DQH


-5RRW3DQH-&RPER%R[

-7H[W$UHD-7H[W3DQH

-)LOH&KRRVHU


-3URJUHVV%DU

-6HSDUDWRU

-7RJJOH%XWWRQ

-5DGLR%XWWRQ0HQX



7DEOH5RZ6RUWHU


&HOO5HQGHUHU





-5DGLR%XWWRQV

6LPSOH$WWULEXWH6HW

-6FUROO%DU

,WHP(YHQWV

)LOH'LDORJ

:LQGRZ)RFXV/LVWHQHU

3URJUHVV0RQLWRU

5RZ)LOWHU



7UD\,FRQ6\VWHP7UD\


$FWLRQ0DS,QSXW0DS

(QXP6HW(QXP0DS





5ROH/LVW$UU\D/LVW











3ULYDWH.H\3XEOLF.H\



5RRW3DQH/D\HUHG3DQH

,,2,PDJH,,20HWD'DWD
















7DEOH&HOO(GLWRU









64/([FHSWLRQV


*UHJRULDQ&DOHQGDU

7LPH=RQH

5HHQWUDQW/RFN



(OOLSVH�'





&KHFN%R[

7H[W)LHOG

&RPER%R[

7H[W$UHD

7H[W)LOHG6FUROO3DQH

%R[/D\RXW

*ULG%DJ/D\RXW

%RUGHU/D\RXW




%LJ,QWHJHU

%LJ'HFLPDO

'HFLPDO)RUPDW

0DWK&RQWH[W


&ODVV/RDGHU

85/&ODVV/RDGHU

&ODVV/RDGHUV

&ODVV3DWK


-DU,QSXW6WUHDP

-DU2XWSXW6WUHDP

=LS)LOH

-DU)LOH

%XIIHUHG,PDJH

,PDJH,2


7H[W/D\RXW

,PDJH,FRQ

)RQW0HWULFV




$IILQH7UDQVIRUP2S

6DPSOH0RGHO




$ULWKPHWLF([FHSWLRQ



,2([FHSWLRQ



6RIW5HIHUHQFH




:HDN5HIHUHQFH

6$;([FHSWLRQV

5HIHUHQFH4XHXH

:LQGRZ/LVWHQHU

-$SSOHW


-3DQHOV

-/DEHOV

-&RPSRQHQWV

-0HQX

-7DEEHG3DQH-:LQGRZ

-2SWLRQ3DQH

-0HQX,WHPV

-7H[W)LHOGV

-0HQXV

-0HQX%DU

7LPHU7DVN

-'LDORJV

/D\RXW0DQDJHU-3DQOH


-,QWHUQDO)UDPHV

-)UDPHV

-2SWLRQ3DQHV

/D\RXW0DQDJHUV

-$SSOHWV

,PDJH,FRQV

-7H[W$UHDV

-0HQX,WHP

-&RPER%R[V

-3RSXS0HQX

-7RRO%DU

.H\%LQGLQJV

7DEEHG3DQH

+DVK0DS

7UHH0DS


+DVK0DSV$UUD\/LVWV

+DVK6HW

&RQFXUUHQW+DVK0DS

6RUWHG0DS

:HDN+DVK0DS

/LQNHG+DVK0DS





([HFXWRU6HUYLFH

7KUHDG3RRO

)XWXUH7DVN

/LQNHG/LVWV

$WRPLF5HIHUHQFH



1DYLJDEOH0DS



%ORFN4XHXH


)LOH,QSXW6WUHDP

'DWD2XWSXW6WUHDP


2EMHFW,QSXW6WUHDP

%\WH%XIIHU

)LOH5HDGHU

6WULQJ%XIIHU

,QSXW6WUHDP

3ULQW:ULWHU

+WWS85/&RQQHFWLRQ

%XIIHUHG:ULWHU

)LOWHU5HDGHU

85/&RQQHFWLRQ

2XWSXW6WUHDP


2EMHFW2XWSXW6WUHDP

)LOH&KDQQHO

5DQGRP$FFHVV)LOH


6WULQJ%XLOGHU


3ULQW6WUHDP

)LOWHU2XWSXW6WUHDP



2XWSXW6WUHDP:ULWHU

6WULQJ5HDGHU

,QW%XIIHU

3LSHG2XWSXW6WUHDP

+WWSV85/&RQQHFWLRQ

;��&HUWLILFDWH



'DWD2XWSXW

3LSHG,QSXW6WUHDP

6HUYHU6RFNHW&KDQQHO

)LOWHU,QSXW6WUHDP


-7H[W)LHOG

$FWLRQ/LVWHQHU

-3DQHO

-%XWWRQ

0RXVH/LVWHQHU

.H\/LVWHQHU


-/D\HUHG3DQH

-7UHH

-,QWHUQDO)UDPH

-&RPSRQHQW

-6OLGHU

-&KHFN%R[

-/DEHO

-7DEOH-(GLWRU3DQH

-/LVW -6FUROO3DQH

-5DGLR%XWWRQ

-&RPSQHQW

-7DEOH+HDGHU

0HQX%DU-6FUROO3DQHO

-'HVNWRS3DQH


-5RRW3DQH-&RPER%R[

-7H[W$UHD-7H[W3DQH

-)LOH&KRRVHU


-3URJUHVV%DU

-6HSDUDWRU

-7RJJOH%XWWRQ

-5DGLR%XWWRQ0HQX



7DEOH5RZ6RUWHU


&HOO5HQGHUHU





-5DGLR%XWWRQV

6LPSOH$WWULEXWH6HW

-6FUROO%DU

,WHP(YHQWV

)LOH'LDORJ

:LQGRZ)RFXV/LVWHQHU

3URJUHVV0RQLWRU

5RZ)LOWHU



7UD\,FRQ6\VWHP7UD\


$FWLRQ0DS,QSXW0DS

(QXP6HW(QXP0DS





5ROH/LVW$UU\D/LVW











3ULYDWH.H\3XEOLF.H\



5RRW3DQH/D\HUHG3DQH

,,2,PDJH,,20HWD'DWD
















7DEOH&HOO(GLWRU









64/([FHSWLRQV


*UHJRULDQ&DOHQGDU

7LPH=RQH

5HHQWUDQW/RFN



(OOLSVH�'





&KHFN%R[

7H[W)LHOG

&RPER%R[

7H[W$UHD

7H[W)LOHG6FUROO3DQH

%R[/D\RXW

*ULG%DJ/D\RXW

%RUGHU/D\RXW




%LJ,QWHJHU

%LJ'HFLPDO

'HFLPDO)RUPDW

0DWK&RQWH[W


&ODVV/RDGHU

85/&ODVV/RDGHU

&ODVV/RDGHUV

&ODVV3DWK


-DU,QSXW6WUHDP

-DU2XWSXW6WUHDP

=LS)LOH

-DU)LOH

%XIIHUHG,PDJH

,PDJH,2


7H[W/D\RXW

,PDJH,FRQ

)RQW0HWULFV




$IILQH7UDQVIRUP2S

6DPSOH0RGHO




$ULWKPHWLF([FHSWLRQ



,2([FHSWLRQ



6RIW5HIHUHQFH




:HDN5HIHUHQFH

6$;([FHSWLRQV

5HIHUHQFH4XHXH

:LQGRZ/LVWHQHU

-$SSOHW


-3DQHOV

-/DEHOV

-&RPSRQHQWV

-0HQX

-7DEEHG3DQH-:LQGRZ

-2SWLRQ3DQH

-0HQX,WHPV

-7H[W)LHOGV

-0HQXV

-0HQX%DU

7LPHU7DVN

-'LDORJV

/D\RXW0DQDJHU-3DQOH


-,QWHUQDO)UDPHV

-)UDPHV

-2SWLRQ3DQHV

/D\RXW0DQDJHUV

-$SSOHWV

,PDJH,FRQV

-7H[W$UHDV

-0HQX,WHP

-&RPER%R[V

-3RSXS0HQX

-7RRO%DU

.H\%LQGLQJV

7DEEHG3DQH

+DVK0DS

7UHH0DS


+DVK0DSV$UUD\/LVWV

+DVK6HW

&RQFXUUHQW+DVK0DS

6RUWHG0DS

:HDN+DVK0DS

/LQNHG+DVK0DS





([HFXWRU6HUYLFH

7KUHDG3RRO

)XWXUH7DVN

/LQNHG/LVWV

$WRPLF5HIHUHQFH



1DYLJDEOH0DS



%ORFN4XHXH


)LOH,QSXW6WUHDP

'DWD2XWSXW6WUHDP


2EMHFW,QSXW6WUHDP

%\WH%XIIHU

)LOH5HDGHU

6WULQJ%XIIHU

,QSXW6WUHDP

3ULQW:ULWHU

+WWS85/&RQQHFWLRQ

%XIIHUHG:ULWHU

)LOWHU5HDGHU

85/&RQQHFWLRQ

2XWSXW6WUHDP


2EMHFW2XWSXW6WUHDP

)LOH&KDQQHO

5DQGRP$FFHVV)LOH


6WULQJ%XLOGHU


3ULQW6WUHDP

)LOWHU2XWSXW6WUHDP



2XWSXW6WUHDP:ULWHU

6WULQJ5HDGHU

,QW%XIIHU

3LSHG2XWSXW6WUHDP

+WWSV85/&RQQHFWLRQ

;��&HUWLILFDWH



'DWD2XWSXW

3LSHG,QSXW6WUHDP

6HUYHU6RFNHW&KDQQHO

)LOWHU,QSXW6WUHDP


-7H[W)LHOG

$FWLRQ/LVWHQHU

-3DQHO

-%XWWRQ

0RXVH/LVWHQHU

.H\/LVWHQHU


-/D\HUHG3DQH

-7UHH

-,QWHUQDO)UDPH

-&RPSRQHQW

-6OLGHU

-&KHFN%R[

-/DEHO

-7DEOH-(GLWRU3DQH

-/LVW -6FUROO3DQH

-5DGLR%XWWRQ

-&RPSQHQW

-7DEOH+HDGHU

0HQX%DU-6FUROO3DQHO

-'HVNWRS3DQH


-5RRW3DQH-&RPER%R[

-7H[W$UHD-7H[W3DQH

-)LOH&KRRVHU


-3URJUHVV%DU

-6HSDUDWRU

-7RJJOH%XWWRQ

-5DGLR%XWWRQ0HQX



7DEOH5RZ6RUWHU


&HOO5HQGHUHU





-5DGLR%XWWRQV

6LPSOH$WWULEXWH6HW

-6FUROO%DU

,WHP(YHQWV

)LOH'LDORJ

:LQGRZ)RFXV/LVWHQHU

3URJUHVV0RQLWRU

5RZ)LOWHU

Figure 3.1: Visualization of predicted coordinate term pairs, where each pair of coordinate classes is connected byan edge. Highly connected components are labeled by edge color, and it can be noted that they contain classes withsimilar functionality. Some areas containing a functional class group have been magnified for easier readability.

We see this work as a first step towards building a knowledge representation system for the softwaredomain, in which text entities refer to elements from a software code base, such as classes, methods,applications and programming languages. Structured understanding of software entities, classes and rela-tions can enable higher reasoning capabilities in NLP applications for the software domain [9, 52, 80, 82]and improve a variety of code assisting applications, including code refactoring and token completion[5, 31, 35, 66]. Figure 3.1 shows a visualization based on our predicted coordinate pairs. Java classeswith similar functionality are highly connected in this graph, indicating that our method can be used toconstruct an interesting code taxonomy.

3.2 Related Work

Semantic Relation Discovery. Previous work on semantic relation discovery, in particular, coordinateterm discovery, has used two main approaches. The first is based on the insight that certain lexical patternsindicate a semantic relationship with high-precision, as initially observed by Hearst [32]. For example, theconjuction pattern “X and Y” indicates that X and Y are coordinate terms. Other pattern-based classifierhave been introduced for meronyms [28], synonyms [46], and general analogy relations [77]. The secondapproach relies on the notion that words that appear in a similar context are likely to be semanticallysimilar. In contrast to pattern based classifiers, context distributional similarity approaches are normallyhigher in recall. [15, 59, 61, 69]. In this work we attempt to label samples extracted with high-precisionHearst patterns, using information from higher-recall methods.

Grounded Language Learning. The aim of grounded language learning methods is to learn a mapping

10

between natural language (words and sentences) and the observed world [29, 68, 85]. Recent work in-cludes grounding language to the physical world [40], and grounding of entire discourses [50]. Early workin this field relied on supervised aligned sentence-to-meaning data [27, 86]. However, in later work thesupervision constraint has been gradually relaxed [36, 45]. Relative to prior work on grounded languageacquisition, we use a very rich and complex representation of entities and their relationships (throughsoftware code). Here, we consider a very constrained language task, namely coordinate term discovery.

Statistical Language Models for Software. In recent work by NLP and software engineering re-searchers, statistical language models have been adapted for modeling software code. NLP models havebeen used to enhance a variety of software development tasks such as code and comment token com-pletion [31, 35, 52, 66], analysis of code variable names [5, 44], and mining software repositories [26].This has been complemented by work from the programming language research community for structuredprediction of code syntax trees [56]. To the best of our knowledge, there is no prior work on discoveringsemantic relations for software entities.

3.3 Main Results

Given a software domain text corpus (StackOverflow) and a code repository (Java Standard Libraries), ourgoal is to predict a coordinate relation for 〈X,Y 〉, where X and Y are nouns which potentially refer toJava classes.

Corpus Distributional Similarity: As an initial baseline we calculate the corpus distributional similar-ity of nouns 〈X,Y 〉, following the assumption that words with similar context are likely to be semanticallysimilar. Our implementation follows Pereira et al. [61]. We calculate the empirical context distributionfor noun X , pX = f(c,X)/

∑c′ f(c

′, X), where f(c,X) is the frequency of occurrence of noun Xin context c. We can measure the similarity of nouns X and Y using the Kullback-Leibler divergence,D(pX ||pY ) =

∑z pX(z) log pX(z)

pY (z) . Finally, we consider the symmetric distributional similarity of X andY as D(pX ||pY ) +D(pY ||pX).

String Similarity: Due to naming convention standards, many related classes often exhibit some mor-phological closeness. For example, classes that provide Input/Output access to the file system will oftencontain the suffix Stream or Buffer. Likewise, many classes extend on the names of their super classes.We therefore include a second baseline which attempts to label the noun pair 〈X,Y 〉 as coordinate termsaccording to their string matching similarity. We use the SecondString open source Java toolkit1. Eachstring is tokenized by camel case (such that ArrayList is represented as Array List). We consider theSoftTFIDF distance of the tokenized strings, as defined by Cohen et al. [14].

3.3.1 Entity Linking

In order to draw code based information on text entities, we define a mapping function between words andclass types. Our goal is to find p(C|W ), where C is a specific class implementation andW is a word. Thismapping is ambiguous, for example, since users are less likely to mention the qualified class name (e.g.,java.lang.String), and usually use the class label, meaning the name of the class not including itspackage (e.g., String). As an example, the terms java.lang.String and java.util.Vectorappears 37 and 1 times respectively in our corpus, versus the terms String and Vector which appear35K and 1.6K times. Additionally, class names appear with several variations, including, case-insensitiveversions, spelling mistakes, or informal names (e.g., array instead of ArrayList).

1http://secondstring.sourceforge.net/

11

ARG-Method: Class is being passed as an argument to Method. We count an occurrence of thiscontext once for the method definition, Method(Class class, ...) and for each method in-vocation, Method(class, ...). For example, given the statement str = toString(i);where i is an Integer, we would count an occurrence for this class in the context ARG-toString.

API-Method: Class provides the API method Method. We count an occurrence of this con-text once for the method definition, and for every occurrence of the method invocation, e.g.class.Method(...). For example, given the statement s = map.size(); where mapis a HashMap, we would count an occurrence for this class in the context API-size.

Table 3.1: Definition of code-contexts for a class type, Class, or an instantiation of that type (e.g., class).

Therefore, in order to approximate p(C,W ) in p(C|W ) = p(C,W )p(W ) , we estimate a word to class-type

mapping that is mediated through the class label, L, as

p(C,W ) = p(C,L) · p(L,W ) (3.1)

Since p(C,L) = p(C|L)p(L), this can be estimated by the corresponding MLEs

p(C,L) = p(C|L) · p(L) = f(C)∑C′∈L f(C

′)· f(L)∑

L′ f(L′)(3.2)

where f() is the frequency function. Note that since∑

C′∈L f(C′) = f(L) we get that p(C,L) = p(C),

as the class label is uniquely determined by the class qualified name (the opposite does not hold sincemultiple class types may correspond to the same label). Finally, the term p(L,W ) is estimated by thesymmetric string distance between the two strings, as described above. We consider the linking probabilityof 〈X,Y 〉 to be p(X ′|X) · p(Y ′|Y ), where X ′ is the best matching class for X s.t. X ′ = maxC p(C|X)and similarly for Y ′.

3.3.2 Code Distributional Similarity

Corpus distributional similarity evaluates the occurrence of words in particular semantic contexts. Bydefining the class-context of a Java class, we can then similarly calculate a code distributional similar-ity between classes. Our definition of class context is based on the usage of a class as an argument tomethods and on the API which the class provides, and it is detailed in Table 3.1. We observe over 23Kunique contexts in our code repository. Based on these definitions we can compute the distributionalsimilarity measure between classes X ′ and Y ′ based on their code-context distributions, as previouslydescribed for the corpus distributional similarity. For the code-based case, we calculate the empiricalcontext distribution of X ′ using f(c,X ′), the occurrence frequency of class X ′ in context c, where cis one of the ARG-Method or API-Method contexts (defined in Table 3.1) for methods observed in thecode repository. The distributional similarity of 〈X ′, Y ′〉 is then taken, using the relative entropy, asD(pX′ ||pY ′) +D(pY ′ ||pX′).

3.3.3 Code Hierarchies and Organization

We define an ancestry relation between words X and Y , as another measure of whether they belong in thesame taxonomy, based on the following two code taxonomies.

Package Taxonomy. A package is the standard way for defining namespaces in the Java language. Itis a mechanism for organizing sets of classes which normally share a common functionality. Packages are

12

Method Coord Coord-PMI

Code & Corpus 85.3 88

Corpus Dist. Sim. 57.8 58.2Code Dist. Sim. 67 (60.2) 67.2 (59)All Corpus 64.7 60.9All Code 80.1 81.1

Table 3.2: Cross validation accuracy results for the coordinateterm SVM classifier (Code & Corpus), as well as baselines us-ing corpus distributional similarity, string similarity, all corpusbased features, or all code based features, and all single codebased features. The weighted version of the code based featuresis in parenthesis. Results are shown for both the Coord andCoord-PMI datasets.

200 400 600 800 10000

0.2

0.4

0.6

0.8

1

0.86

0.56

0.28

Rank

F1

Full

Code Dist. Sim.

Corpus Dist. Sim.

Figure 3.2: Manual Labeling Results. F1 results of the top 1000predicted coordinate terms by rank. The final data point in eachline is labeled with the F1 score at rank 1000.

organized in a hierarchical structure which can be easily inferred from the class name. For example, theclass java.lang.String, belongs to the java.lang package, which belongs to the java package.

Type Taxonomy. The inheritance structure of classes and interfaces in the Java language defines atype hierarchy, such that class A is the ancestor of class B if B extends or implements A.

We define type-ancestry and package-ancestry relations between classes 〈X ′, Y ′〉, based on the abovetaxonomies. For the type taxonomy, An

type(X′, Y ′) = {# of common ancestors X ′ and Y ′ share within n

higher up levels in the type taxonomy}, for n from 1 to 6. Anpackage is defined similarly for the package

taxonomy. As an example, A2package(ArrayList,Vector) = 2, as these classes both belong in the package

java.util, and therefore their common level 2 ancestors are: java and java.util.

3.3.4 Experiment: Learning Coordinate Pairs

In Table 3.2 we report the cross validation accuracy of the coordinate term classifier (Code & Corpus) aswell as baseline classifiers using corpus and code distributional similarity, all corpus features or all codefeatures. Note that using only code features is significantly more successful on this data than any otherbaseline. When using both data sources, performance is improved even further.

Evaluation by Manual Labeling: The cross-validation results above are based on labels extractedusing Hearst conjunction patterns. In Figure 3.2 we provide an additional analysis based on manualhuman labeling of samples from the Coord-PMI dataset, following a procedure similar to prior researchersexploring semi-supervised methods for relation discovery [11, 43]. After development was complete, welabeled the top 1000 coordinate pairs according to our full classifier (Code & Corpus) and the top 1000pairs predicted by the classifiers based on code and corpus distributional similarities only. We report theF1 results of each classifier by the rank of the predicted samples. It is interesting that the F1 score usingcorpus based distributional similarity degrades quickly after the 100th top pair. The combination of codeand text based signals, however, remains more stable, reaching 86% at rank 1000.

Taxonomy Construction: We visualize the coordinate term pairs predicted using our method, usinga graph where entities are nodes and edges are determined by a coordinate relation (Figure 3.1). Graphedges are colored using the Louvain method [8] for community detection and an entity label’s size isdetermined by its betweenness centrality degree. High-level communities in this graph correspond toclass functionality, indicating that our method can be used to create an interesting code taxonomy.

13

Chapter 4

Natural Language Models for PredictingProgramming Comments

4.1 Introduction and Related Work

Statistical language models have traditionally been used to analyze natural language documents. Recently,software engineering researchers have adopted the use of language models for modeling software code.Hindle et al. [33] observe that, as code is created by humans it is repetitive and predictable, similar tonatural language. NLP models have thus been used for a variety of software development tasks such ascode token completion [31, 35], analysis of names in code [5, 44] and mining software repositories [26].

An important part of code programming and maintenance lies in documentation, which may comein the form of tutorials, or inline comments provided by the programmer. The documentation providesa high level description of the task performed by the code, and may include examples of common use-cases, or define identifiers. Well documented code is easier to read and maintain in the long-run but writingcomments is a laborious, and often overlooked task. Code commenting not only provides a summarizationof the conceptual idea behind the code [71], but can also be viewed as a form of document expansion wherecomments contain significant terms relevant to the described code. Accurately predicted comment wordscan therefore be used for a variety of linguistic uses including improved search over code bases usingnatural language queries, and code categorization [42, 62, 67, 76, 79]. A related and well studied NLPtask is that of predicting natural language caption and commentary for images and videos [6, 24, 25, 83].

In this work, our goal is to apply statistical language models for predicting class comments. We showthat n-gram models are extremely successful in this task, and can lead to a saving of up to 47% in commenttyping. This is expected as n-grams have been shown as a strong model for language and speech predictionthat is hard to improve upon [64]. In some cases however, such as a document expansion task, we wishto extract important terms relevant to the code regardless of local syntactic dependencies. We hence alsoevaluate the use of LDA [7] and link-LDA [21] topic models, which are more relevant for this scenario.We find that topic model performance can be improved by distinguishing code and text tokens in the code.

4.2 Main Results

4.2.1 Models

We train n-gram models (n=1, 2, 3) over source code documents containing sequences of combined codeand text tokens from multiple training datasets. We use the Berkeley Language Model package [60] with

14

absolute discounting (Kneser-Ney smoothing; [37]) which includes a backoff strategy to lower-order n-grams. Next we use LDA topic models trained on the same data, with 1, 5, 10, and 20 topics. The jointdistribution of a topic mixture θ, and a set of topics z, for a single source code document with N observedword tokens, d = {wi}Ni=1, given the Dirichlet parameters α and β, is therefore

p(θ, z, w|α, β) = p(θ|α)∏w

p(z|θ)p(w|z, β) (4.1)

Under the models described so far, there is no distinction between text and code tokens.Finally, we consider documents as having a mixed membership of two entity types, code and text

tokens, d = ({wcodei }Cn

i=1, {wtexti }Tn

i=1). Text words include comment and string literals, and code wordsinclude programming language syntax tokens (e.g., public and for) and identifiers. We train link-LDAmodels with 1, 5, 10, and 20 topics. The joint distribution of a topic mixture, words and topics is then

p(θ, z, w|α, β) = p(θ|α) ·∏wtext

p(ztext|θ)p(wtext|ztext, β) ·∏wcode

p(zcode|θ)p(wcode|zcode, β) (4.2)

where θ is the joint topic distribution, w is the set of observed document words, ztext is a topic associatedwith a text word, and zcode a topic associated with a code word. We use Gibbs sampling [30] for topicinference, based on the implementation of Balasubramanyan and Cohen [3].

4.2.2 Prediction Methodology

Our goal is to predict the tokens of the Java class comment (the one preceding the class definition) ineach of the test files. Each of the models described above assigns a probability to the next commenttoken. In the case of n-grams, the probability of a token word wi is given by considering previous wordsp(wi|wi−1, . . . , w0), and is estimated given the previous n− 1 tokens as p(wi|wi−1, . . . , wi−(n−1)).

For the topic models, we separate the document tokens into the class definition and the comment wewish to predict. The set of tokens of the class comment wc, are all considered as text tokens. The rest ofthe tokens in the document wr, are considered to be the class definition, and they may contain both codeand text tokens (from string literals and other comments in the source file). We then compute the posteriorprobability of document topics by solving the following inference problem conditioned on the wr tokens

p(θ, zr|wr, α, β) =p(θ, zr, wr|α, β)p(wr|α, β)

(4.3)

This gives an estimate of the document distribution θ, then used to infer the probability of comment tokens

p(wc|θ, β) =∑z

p(wc|z, β)p(z|θ) (4.4)

Following Blei et al. [7], for the case of a single entity LDA, the inference problem from equation (4.3)can be solved by considering p(θ, z, w|α, β), as in equation (4.1), and by taking the marginal distributionof the document tokens as a continuous mixture distribution for the set w = wr, by integrating over θ andsumming over the set of topics z

p(w|α, β) =∫p(θ|α) ·

(∏w

∑z

p(z|θ)p(w|z, β)

)dθ (4.5)

For the case of link-LDA where the document is comprised of two entity types, we can consider the mixed-membership joint distribution θ, as in equation (4.2), and similarly the marginal distribution p(w|α, β)over both code and text tokens from wr. Since comment words in wc are all considered as text tokens theyare sampled using text topics, namely ztext, in equation (4.4).

15

Model n-gram LDA Link-LDA

n / topics 1 2 3 20 10 5 1 20 10 5 1

IN 33.05 43.27 47.1 34.20 33.93 33.63 33.05 35.76 35.81 35.37 34.59(3.62) (5.79) (6.87) (3.63) (3.67) (3.67) (3.62) (3.95) (4.12) (3.98) (3.92)

OUT 26.6 31.52 32.96 26.79 26.8 26.86 26.6 28.03 28 28 27.82(3.37) (4.17) (4.33) (3.26) (3.36) (3.44) (3.37) (3.60) (3.56) (3.67) (3.62)

SO 27.8 33.29 34.56 27.25 27.22 27.34 27.8 28.08 28.12 27.94 27.9(3.51) (4.40) (4.78) (3.67) (3.44) (3.55) (3.51) (3.48) (3.58) (3.56) (3.45)

Table 4.1: Average percentage of characters saved per comment using n-gram, LDA and link-LDA models trainedon the sets: IN, OUT, and SO. The results are averaged over nine Java projects (standard deviations in parenthesis).

4.2.3 Experiment: Within- and Cross-Project Comment Prediction

We use source code from nine open source Java projects. Files from each project are divided into trainand test sets. Then, for each project, we consider three training scenarios using the following datasets. IN:To emulate a scenario in which we are predicting comments in the middle of project development, we canuse data from the same project, which we name the in-project training dataset. OUT: Alternatively, if wetrain a comment prediction model at the beginning of development, we need to use source files from otherprojects. To analyze this scenario, for each project above we train models using an out-of-project datasetcontaining data from the other eight projects. SO: Typically, source code files contain a greater amount ofcode versus comments. Since we are interested in predicting comments, we consider a third training datasource which contains more English text as well as some code segments. For this purpose we downloadedposts from StackOverflow, and we used only posts that are tagged as Java related questions and answers.

Since our models are trained using various data sources the vocabularies used by each of them aredifferent, making the comment likelihood given by each model incomparable due to different sets of out-of-vocabulary tokens. We thus evaluate models using a character saving metric which aims at quantifyingthe characters that can be saved by using the model in a word-completion settings, similar to code comple-tion tools. For a comment word with n characters, w = w1, . . . , wn, we predict the two most likely wordsgiven each model filtered by the first 0, . . . , n characters of w. Let k be the minimal ki for which w is inthe top two predicted word tokens where tokens are filtered by the first ki characters. Then, the numberof saved characters for w is n− k. In Table 4.1 we report the average percentage of saved characters percomment using each of the above models, averaged over the nine input projects.

Models trained on in-project data perform significantly better than those trained on another datasource, regardless of the model type, with an average saving of 47.1% characters using a trigram model.This is expected, as files from the same project are likely to contain similar comments, and commonidentifier names that appear in comments. Clearly, in-project data should be used when available as itimproves comment prediction leading to an average increase of between 6%-14%. Of the out-of-projectdata sources, models using a greater amount of text (SO) out-performed ones based on more code (OUT).This increase in performance, however, comes at a cost of greater run-time due to the larger dictionary as-sociated with SO tokens. The trigram model shows the best overall performance. Amongst topic models,link-LDA performs consistently better than simple LDA. Note that in this work, topic models are based onunigram tokens, therefore their results are most comparable with the unigram in Table 4.1, which does notbenefit from the backoff strategy used by higher n-grams. By this comparison, link-LDA proves more suc-cessful in this task than the simpler models which do not distinguish code and text tokens. Using n-gramswithout backoff leads to results significantly worse than any of the presented models (not shown).

16

Chapter 5

Preliminary and Proposed Work

5.1 Learning an Ontology with Relations using a Topic Model Framework

Ontology design involves assembling a set of interesting categories, organized in a meaningful hierarchicalstructure, and providing seeds, i.e., representative examples for each category. We are often also interestedin relations between categories. Redesigning a new ontology for a technical domain is difficult and oftenrequires non-trivial knowledge of the domain, e.g., which are the main concepts and relations of interest.We have previously constructed a simple software hierarchy by aggregating learned pairwise relations[53]. However, this hierarchy does not distinguish between categories and examples. While it was shownthat highly connected components in this hierarchy map to functional code units, the units in this case hadto be manually identified and named.

In this work, we describe a topic model, which jointly learns an ontological structure including cate-gories, seeds and relations, in an unsupervised way. The model, presented in Figure 5.1 with notation inTable 5.1, currently includes two main components: The first learns instance to instance relations (’Ontol-ogy’ component) based on hypernym-hyponyms pairs, and the second learns subject-verb-object (SVO)relations (’Relations’ component) which provide a link between instances. Instance topics learned by thismodel can serve as a basis for an ontology, whereas the probability of association of an instance with acategory can determine a set of potential seeds. SVO topics provide a link between ontology categories,thereby, with this model we can learn a complete knowledge base (KB) structure, and we appropriatelyname it, KB-LDA.

Let NTIbe the number of latent instance topics, and NTR

be the number of latent relation topics wewish to recover. The generative process is then as follows:

1. Generate topics: For each tI ∈ 1, . . . , NTI, sample σtI ∼ Dirichlet(γI), the topic specific instance

distribution, and for each tR ∈ 1, . . . , NTR, sample δtR ∼ Dirichlet(γR), the topic relation distribution.

2. Generate ontology: Sample πO ∼ Dirichlet(αO), the instance topic distribution, composingthe main structure of the ontology. For each concept-instance pair 〈Ci, Ii〉, i ∈ 1, . . . , NO, sampletopic pair 〈zCi , zIi〉 ∼ Multinomial(πO), and then sample instances Ci ∼ Multinomial(σzCi

), Ii ∼Multinomial(σzIi ).

3. Generate relations: Sample πR ∼ Dirichlet(αR), the relation topic distribution, describing rela-tions among pairs of instance topics. For each SVO tuple 〈Sj , Vj , Oj〉, j ∈ 1, . . . , NR, sample topictuple 〈zSj , zVjzOj 〉 ∼ Multinomial(πR), and then sample instances Sj ∼ Multinomial(σzSj

), Oj ∼Multinomial(σzOj

), and sample a relation Vj ∼ Multinomial(δzVj ).Given the hyperparameters αO, αR, γI , and γR, the joint distribution over the concept-instance pairs,

17

Sj

Oj

Vj

zSj

zOj

zVj

αR

πR

NR

RelationsγI

σtI

NTI

γR

δtR

NTR

αOπO

zCi

zIi

Ci

IiNO

Ontology

αCTπCTzCTk

CT1k

CT2k

NCT

Coordinate Terms

αTπTzTlTl

KT

NT

Tables

Figure 5.1: Plate Diagram of KB-LDA. Proposed components are shaded (see Section 5.3).

the SVO tuples, the topics and the topic assignments is given by

p(πO, σ, δ, 〈C, I〉, 〈zC , zI〉, 〈S,O, V 〉, 〈zS , zO, zV 〉, πR|αO, αR, γI , γR) = (5.1)NTI∏tI=1

Dir(σtI |γI)×NTR∏tR=1

Dir(δtR |γR)×

Dir(πO|αO)

NO∏i=1

π〈zCi

,zIi 〉O σCi

zCiσIizIi×Dir(πR|αR)

NR∏j=1

π〈zSj

,zOj,zVj 〉

R σSjzSjσOjzOj

δVjzVj

Due to the intractability of exact inference in the KB-LDA model, a collapsed Gibbs sampler is usedto perform approximate inference in order to query the topic distributions and assignments. It samplesa latent topic pair for a concept-instance pair in the corpus conditioned on the assignments to all otherconcept-instance pairs and SVO tuples, using the following expression, after collapsing πO

p(zCIi = 〈zCi , zIi〉|〈Ci, Ii〉, zCI¬i, zSV O, 〈C, I〉¬i, αO, γI) (5.2)

∝(nO¬i〈zCi

,zIi 〉+ αO

)×

(nI¬izCi,Ci

+ γI)(nI¬izIi ,Ii

+ γI)

(∑

C nI¬izCi

,C +NTIγI)(

∑I n

I¬izIi ,I

+NTIγI)

We similarly sample a topic tuple for each SVO tuple conditioned on the assignments to all other SVOtuples and concept-instance pairs, using the following expression, after collapsing πR

p(zSV Oj = 〈zSj , zOj , zVj 〉|〈Sj , Oj , Vj〉, zSV O¬j , zCI , 〈S,O, V 〉¬j , αR, γI , γR) (5.3)

∝(nR¬j〈zSj

,zOj,zVj 〉

+ αR

)×

(nI¬jzSj,Sj

+ γI)(nI¬jzOj

,Oj+ γI)(n

R¬jzVj ,Vj

+ γR)

(∑

I nI¬jzSi

,I +NTIγI)(

∑I n

I¬jzOi

,I +NTIγI)(

∑V n

R¬jzVj ,V

+NTRγR)

18

πO - multinomial distribution over ontological topic pairs, with Dirichlet prior αO.πR - multinomial distribution over relation topic tuples, with Dirichlet prior αR.σtI - multinomial over instances for topic tI , with Dirichlet prior γI .δtR - multinomial over relations for topic tR, with Dirichlet prior γR.〈Ci, Ii〉 - the i-th ontological assignment pair.〈Sj , Oj , Vj〉 - the j-th relation assignment tuple.zCIi = 〈zCi , zIi〉 - topic pair chosen for the i-th ontological assignment pair.zSV Oj = 〈zSj , zOj , zVj 〉 - topic tuple chosen for the j-th relation assignment pair.nIz,i - the number of times the instance i is observed under topic z (in either zCI or zSV O).nCz,c - the number of times the concept c is observed under topic z (in zCI ).nRz,r - the number of times the relation r is observed under topic z (in zSV O).nO〈zc,zi〉 - count of ontological pairs assigned the topic pair 〈zc, zi〉 (in zCI ).nR〈zs,zo,zv〉 - count of relation tuples assigned the topic tuple 〈zs, zo, zv〉 (in zSV O).

Table 5.1: KB-LDA notation.

where the ns are the counts of observations from the training set, described in Table 5.1. The topicmultinomial parameters and the topic distributions of the ontology and relations are recovered using theirMAP estimates after inference using the counts of observations:

σItI =nItI ,I + γI∑

I′ nItI ,I′

+NTIγI, δRtR =

nRtR,R + γR∑R′ nRtR,R′ +NTR

γR, π〈zC ,zI〉O =

nO〈zC ,zI〉 + αO∑〈z′C ,z′I〉

nO〈z′C ,z′I〉+ (N2

TI)αO

,

and π〈zS ,zO,zV 〉R =

nR〈zS ,zO,zV 〉 + αR∑〈z′S ,z

′O,z′V 〉

nR〈z′S ,z′O,z′V 〉

+ (TR ·N2TI)αR

We have so far performed experiments using this model to build a software hierarchy describing con-cepts related to Java programming, and another, describing the field of malware software. We parsed acorpus of relevant StackOverflow posts with the MALT parser [55]. Concept instance pairs were extractedfrom the parsed corpus using a small set of IsA patterns, including “X is a Y” which indicate X is an in-stance and Y is a concept, and these were used to train the ontology component of the KB-LDA. Similarly,the parsed sentences were scanned for Subject-Verb-Object relations (similar to the work done in [74]),which were used to train the relations component.

We learn 15 instance and 15 relation topics using this model on the SO Java corpus. High probabilitywords from sample topics are shown in Table 5.2. Selected topics are also shown in Figure 5.2 along withσtzt , the probability of the token t to its topic zt, for the top 15 tokens per topic. One of the instance topicsincludes the concept of programming languages and lists a set of examples to that concept. Another topicdescribes programming environments (IDEs), and yet another deals with class and object APIs. Similarly,relation topics seem to capture common software related actions. We can then build a concept hierarchyby treating the concept-instance topic multinomial, πO, as a weighted adjacency matrix, and finding amaximum spanning tree. We can assign relations in this hierarchy, by considering high probability linksfound in πR. A sample ontology is shown in Figure 5.3, where the ontology has been rooted with thedummy ’Everything’ topic, and the edges are labeled with the topic-topic (π〈z1,z2〉O ) or topic-relation-topic(π〈z1,z2,z3〉R ) probabilities.

Proposed Work: The KB-LDA model proposed here is useful in finding interesting domain-specificrelations and topics. However, the hierarchy structure that is currently produced using the model is rela-tively flat, mainly due to a small number of learned topics. As a result, not many indirect new concept-concept links are learned by this model, as we would hope to discover in a deeper ontological tree. One

19

Instance Topics Relation Topics

Ant, Maven, Eclipse, Spring, Tomcat has, contains, provides, defines, declaresEclipse, Netbeans, Nexus enter, send, close, click, accept, allocateclass, interface, object, API implements, extends, containspointers, references, operators, lock returns, throws, generate, catchesPHP, Python, Scala, Groovy, Perl works, is running, holdingGC, garbage collector, JVM

Table 5.2: Sample topics learned with KB-LDA. Each row includes high probability words from a single topic.

Figure 5.2: Sample instance and relations topics with token-to-topic (σtzt ) probabilities.

of the major current bottlenecks is scalability, as learning a large set of topics is slow when using a Gibbssampling inference process. We are interested in scaling up the inference of this model, as well as sig-nificantly increasing the size of the training data. The extensions to this model, which are detailed inthe following sections, will also greatly benefit from the possibility to train it more quickly and on alarger set of inputs. We have so far modeled different software sub-domains with separate ontologies,but it seems reasonable that modeling a single software hierarchy (e.g., using the full data available fromStackOverflow) will produce a more interesting hierarchy of concepts.

We are planning to evaluate the learned semantic classes, and the connections between those classes(i.e., relations) as a collection of facts that can be individually evaluated using platforms such as AmazonMechanical Turk. The most interesting facts to evaluate will be those that can be inferred from our modelbut were not provided directly as input, for example, the transitive relations found through the learnedhierarchical structure, or those relations that span over a path of several concepts.

20

Everything

resourcesdatafiles(1)

1.000

nameint

classes(5)

0.322

codemethodline(3)

0.003

classone

anyone(9)

0.008

methodserveraccess(12)

0.005

librarytoolsserver(6)

0.002

TomcatMavenhttp(14)

0.290

languagesframeworks

IDE(0)

0.002

hashave

provides(10)

0.080

JavaEclipseSpring(2)

0.260buildmanagedpasses(11)

0.038

filevaluedata(7)

0.002

interfaceclassproject(8)

0.003

userthreadJVM(11)

0.002

exceptionerrorcare(13)

0.005

0.079

0.038

Figure 5.3: Sample Java ontology with selected instance and relation topics.

5.2 Improving Software Language Modeling with a Software Ontology

Given an ontology for the software domain, such as the one described above, it will be interesting to seeif we can improve a popular task in this domain. Recently, there has been numerous efforts in creating alanguage model which describes software code. As an example, a simple model may use n-grams of codetokens trained over a code repository, as in [66]. In this case, given a history of code tokens hi, the modelwould predict the next token to be the ti which satisfies ti = argmaxti p(ti|hi).

Such models can be used to assist a programming workflow by making real-time suggestions to theprogrammer or by creating frequently used templates which represent common design patterns. To addressthe potential sparsity in software language models, for example in modeling infrequently used objects, orin modeling variable names which are not consistent between projects, one can use a higher-level ontologycategory of an object, as a backoff approach. In the context of language modeling, for some tokens, insteadof considering p(ti|hi) we might consider p(ci|hi), where ci is the category of ti in some ontology. Thesetypes of backoff strategies have been explored before for natural language where ci represented somesemantic classification of the token ti, such as its part of speech tag [64]. One way of assigning such asemantic context in the software domain, is to ground the examined tokens in a pre-defined ontology. Itwould therefore be interesting to explore the usefulness of our derived ontology in this setting.

5.3 Grounding a Learned Ontology

The KB-LDA model described above, currently draws information only from corpus statistics. There are,however, several advantages to learning an ontology based on additional, grounded, sources of informa-tion, such as the implementation of classes and systems, software traces which represent run-time methodcalls, or other representations of the software structure, such as the type hierarchy. As we have seen inprevious studies [52, 53], the way people talk about software is apparently quite different than the wayit is used and implemented. This is evident, for example, by the fact that coordinate terms learned fromcorpus based distributional similarity are different than ones learned by similar code based statistics. Alearned ontology which takes into account both of these sources of information, and possibly additionalones, will therefore be more accurate, and probably also more robust, as the learned relations are backedby more evidence. Another important advantage of grounded resources is that they may contain “commonsense” knowledge, such as facts that are sufficiently known, clear, or common, that they are not mentionedin natural discourse. For example, sets and queues are types of containers and integers and booleans areexamples of primitive Java data types – these basic and obvious relations are known to any Java program-

21

mer, and they can be easily learned from the Java type hierarchy, but not all of these basic relations maybe enumerated in any given text corpus.

The proposed KB-LDA can be intuitively extended to include grounded data from a target domain,where grounded components can model information taken directly from a grounded source. A possibleextension is shows in Figure 5.1 (shaded component), which incorporates coordinate term relations intothe model. Coordinate relations indicate sets of instances that belong in the same hierarchy sub-tree,and as we have shown before, they can be learned by a combination of text and code based information[53]. In the suggested model, coordinate pairs are assigned a single topic and therefore can drive themodel to learning coherent functional groups, based on information coming directly from the code. Thisformulation suggests the following change to the joint distribution given in Equation 5.1

p(σ,δ, πO, πR, πCT , 〈C, I〉, 〈zC , zI〉, 〈S,O, V 〉, 〈zS , zO, zV 〉, 〈CT1, CT2〉, zCT |αO, αR, αCT , γI , γR) =

NTI∏tI=1

Dir(σtI |γI)×NTR∏tR=1

Dir(δtR |γR)×Dir(πCT |αCT )

NCT∏k=1

πzCTkCT σCT1k

zCTkσCT2kzCTk×

Dir(πO|αO)

NO∏i=1

π〈zCi

,zIi 〉O σCi

zCiσIizIi×Dir(πR|αR)

NR∏j=1

π〈zSj

,zOj,zVj 〉

R σSjzSjσOjzOj

δVjzVj

(5.4)

Similarly to the suggested coordinate terms component, we can consider an additional componentbased on tables, which aggregate a larger set of terms belonging to a single topic (Figure 5.1). We canadditionally include components, similar to the Ontology component described in Section 5.1, which drawdirectly from the available type and package hierarchies. We omit here the derivation of the complete jointdistribution and Gibbs update rules for the suggested components.

We note that combining these components in a single learning framework allows us a certain amountof control over the significance and weight given to each input resource, which can be tuned using theparameters of the model. This level of control is much harder to achieve when working directly with thecorpus or grounded statistics.

5.4 Learning a Complete Ontology for the Biomedical Domain

In previous work in the biomedical domain we leveraged existing resources to construct an ontology,which included a concept hierarchy and suggested low-ambiguity seeds but did not include any relations[54]. We are interested in the ability of the topic model framework suggested above in deriving a completeontology for this domain including relations, based on a corpus of biomedical articles, similar to the oneused previously.

There are several advantages to learning an ontology, even in a domain where many manual onesalready exist. First, the existing ontologies are the product of considerable time and effort spent by expertsin the domain. This construction process can not easily be replicated, for example by organizations whomay be interested in tuning ontologies for their needs, and more importantly, the task of maintaining andupdating these manually constructed resources is expensive and slow. This means that ontologies that relyon manual upkeep, will inevitably always suffer from a gap between the knowledge they represent, andwhat is publicly available. Finally, learning algorithms benefit from advances in distributed computation,which allow scaling of the learned ontologies, while manual ontologies are limited by the amount ofhuman resources that is put to the task.

Using our suggested framework, it is possible to learn an ontology while benefiting from the existingresources, for example, by introducing them as grounded components in the model. It would be interest-

22

ing to see how a derived learned ontology compares with existing, manually constructed ontologies, toevaluate whether it can reconstruct important semantic classes, such as those that have been previouslydefined manually, and whether the addition of relations improves the resulting knowledge base. Finally,we are curious to discover what areas in the learned ontology will be missing in the manually constructedones, in other words, can we identify topics whose value is reflected in the data sources but have not yetbeen formally defined and structured by a professional community or organization?

5.5 Semi-Supervised Ontology Learning

The KB-LDA topic model is fully unsupervised. The learning pipeline starts with extraction of concept-instance and subject-verb-object relations from a text corpus, and continues with learning topics over thisdata using the model. In some domains, however, we have existing pre-defined knowledge that can helpguide ontology learning, for example, we may have access to an incomplete ontology, or we might be par-ticularly interested in representing specified sets of objects. We have started exploring a semi-supervisedvariant of this model, which takes in ”hints” of interesting areas in the ontology and expands on them.Our initial experiments in this area have included modifying the Gibbs sampling process, specifically inthe update equations 5.2 and 5.3 above. One possibility of introducing supervision here is by startingthe topic update process by addressing only the supervised (provided) terms, and then with every updateiteration, extend the set of addressed terms to include to the most relevant ones which are connected to thecurrent set. This means that the set of updated terms grows in each iteration according to the connectionspresented by the input examples. This is in contrast to the normal Gibbs sampling process where all termtopics are updated in every iteration. In effect, this process simulates the idea of bootstrapping, with theadditional advantage of doing so while jointly considering ontology and relation constraints.

Our semi-supervised results are currently similar to the unsupervised ones, mainly due to the fact thatmost terms are added to the update process within few iterations. We hypothesize that this could, at leastin part, be due to the small scale of the training data, which again motivates an investigation into thescalability question discussed above.

23

Bibliography

[1] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski,S.S. Dwight, J.T. Eppig, et al. Gene ontology: tool for the unification of biology. Nature genetics,25(1):25, 2000. 7

[2] Soren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives.Dbpedia: A nucleus for a web of open data. Springer, 2007. 1

[3] Ramnath Balasubramanyan and William W Cohen. Block-lda: Jointly modeling entity-annotatedtext and entity-entity links. In Proceedings of the 7th SIAM International Conference on Data Min-ing, 2011. 15

[4] J. Bard, S.Y. Rhee, and M. Ashburner. An ontology for cell types. Genome Biology, 6(2):R21, 2005.7

[5] Dave Binkley, Matthew Hearn, and Dawn Lawrie. Improving identifier informativeness using partof speech information. In Proc. of the Working Conference on Mining Software Repositories. ACM,2011. 10, 11, 14

[6] David M Blei and Michael I Jordan. Modeling annotated data. In Proceedings of the 26th annualinternational ACM SIGIR conference on Research and development in information retrieval. ACM,2003. 14

[7] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of MachineLearning Research, 2003. 14, 15

[8] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfoldingof communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008.13

[9] SRK Branavan, Luke S Zettlemoyer, and Regina Barzilay. Reading between the lines: Learningto map high-level instructions to commands. In Proceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics. ACL, 2010. 10

[10] A. Carlson, J. Betteridge, E.R. Hruschka Jr, T.M. Mitchell, and SP Sao Carlos. Coupling semi-supervised learning of categories and relations. Semi-supervised Learning for Natural LanguageProcessing, page 1, 2009. 6

[11] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr, and T.M. Mitchell. Toward anarchitecture for never-ending language learning. In Proceedings of the Twenty-Fourth Conferenceon Artificial Intelligence (AAAI 2010), 2010. 1, 6, 9, 13

[12] J.T. Chang, H. Schutze, and R.B. Altman. Gapscore: finding gene and protein names one word at atime. Bioinformatics, 20(2):216, 2004. 6

[13] K.W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Com-

24

putational linguistics, 16(1):22–29, 1990. 7

[14] William W Cohen, Pradeep D Ravikumar, Stephen E Fienberg, et al. A comparison of string distancemetrics for name-matching tasks. In IIWeb, 2003. 11

[15] James Richard Curran. From distributional to semantic similarity. PhD thesis, University of Edin-burgh. College of Science and Engineering. School of Informatics., 2004. 9, 10

[16] J.R. Curran, T. Murphy, and B. Scholz. Minimising semantic drift with mutual exclusion bootstrap-ping. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics,2007. 6

[17] K. Degtyarenko, P. De Matos, M. Ennis, J. Hastings, M. Zbinden, A. McNaught, R. Alcantara,M. Darsow, M. Guedj, and M. Ashburner. Chebi: a database and ontology for chemical entities ofbiological interest. Nucleic acids research, 36(suppl 1):D344, 2008. 7

[18] A. Dolbey, M. Ellsworth, and J. Scheffczyk. Bioframenet: A domain-specific framenet extensionwith links to biomedical ontologies. In Proceedings of KR-MED, pages 87–94. Citeseer, 2006. 6

[19] Xin Luna Dong, K Murphy, E Gabrilovich, G Heitz, W Horn, N Lao, Thomas Strohmann, ShaohuaSun, and Wei Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. InKDD, 2014. 1

[20] K. Eilbeck, S.E. Lewis, C.J. Mungall, M. Yandell, L. Stein, R. Durbin, and M. Ashburner. Thesequence ontology: a tool for the unification of genome annotations. Genome biology, 6(5):R44,2005. 7

[21] Elena Erosheva, Stephen Fienberg, and John Lafferty. Mixed-membership models of scientific pub-lications. Proceedings of the National Academy of Sciences of the United States of America, 2004.14

[22] Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying relations for open informationextraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.ACL, 2011. 9

[23] Christiane Fellbaum. Wordnet: An electronic lexical database, 1998. 9

[24] Yansong Feng and Mirella Lapata. How many words is a picture worth? automatic caption gener-ation for news images. In Proc. of the 48th Annual Meeting of the Association for ComputationalLinguistics. Association for Computational Linguistics, 2010. 14

[25] Yansong Feng and Mirella Lapata. Automatic caption generation for news images. IEEE transac-tions on pattern analysis and machine intelligence, 2013. 14

[26] Mark Gabel and Zhendong Su. Javert: fully automatic mining of general temporal properties fromdynamic traces. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundationsof software engineering. ACM, 2008. 11, 14

[27] Ruifang Ge and Raymond J Mooney. A statistical semantic parser that integrates syntax and seman-tics. In Computational Natural Language Learning. ACL, 2005. 11

[28] Roxana Girju, Adriana Badulescu, and Dan Moldovan. Learning semantic constraints for the auto-matic discovery of part-whole relations. In North American Chapter of the Association for Compu-tational Linguistics on Human Language Technology. ACL, 2003. 10

[29] Peter Gorniak and Deb Roy. Situated language understanding as filtering perceived affordances.Cognitive Science, 2007. 11

25

[30] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proc. of the National Academy ofSciences of the United States of America, 2004. 15

[31] Sangmok Han, David R Wallace, and Robert C Miller. Code completion from abbreviated input. InAutomated Software Engineering. IEEE, 2009. 10, 11, 14

[32] Marti A Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the14th conference on Computational linguistics. ACL, 1992. 10

[33] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalnessof software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 2012.14

[34] Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, and Gerhard Weikum. Yago2: a spatiallyand temporally enhanced knowledge base from wikipedia. Artificial Intelligence, 194:28–61, 2013.1

[35] Ferosh Jacob and Robert Tairas. Code template inference using language models. In SoutheastRegional Conference. ACM, 2010. 10, 11, 14

[36] Rohit J Kate and Raymond J Mooney. Learning language semantics from ambiguous supervision.In AAAI, 2007. 11

[37] Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. InAcoustics, Speech, and Signal Processing, 1995. ICASSP-95. IEEE, 1995. 15

[38] Z. Kozareva and E. Hovy. Not all seeds are equal: measuring the quality of text mining seeds.In Human Language Technologies: The 2010 Annual Conference of the North American Chapterof the Association for Computational Linguistics, pages 618–626. Association for ComputationalLinguistics, 2010. 7

[39] J. Krishnamurthy and T.M. Mitchell. Which noun phrases denote which concepts? In Proceedingsof the 49th Annual Meeting of the Association for Computational Linguistics: Human LanguageTechnologies. Association for Computational Linguistics, 2011. 6

[40] Jayant Krishnamurthy and Thomas Kollar. Jointly learning to parse and perceive: Connecting naturallanguage to the physical world. TACL, 2013. 11

[41] Jayant Krishnamurthy and Tom M Mitchell. Weakly supervised training of semantic parsers. InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning. ACL, 2012. 9

[42] Naveen Kumar and Benjamin Carterette. Time based feedback and query expansion for twittersearch. In Advances in Information Retrieval. Springer, 2013. 14

[43] Ni Lao, Tom Mitchell, and William W Cohen. Random walk inference and learning in a large scaleknowledge base. In Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing. Association for Computational Linguistics, 2011. 13

[44] Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. What’s in a name? a study ofidentifiers. In ICPC 2006. 14th IEEE International Conference on, 2006. 11, 14

[45] Percy Liang, Michael I Jordan, and Dan Klein. Learning semantic correspondences with less super-vision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4thInternational Joint Conference on Natural Language Processing of the AFNLP, 2009. 11

[46] Dekang Lin, Shaojun Zhao, Lijuan Qin, and Ming Zhou. Identifying synonyms among distribution-ally similar words. In IJCAI, 2003. 10

26

[47] Thomas Lin, Oren Etzioni, et al. Entity linking at web scale. In Proceedings of the Joint Workshopon Automatic Knowledge Base Construction and Web-scale Knowledge Extraction. ACL, 2012. 9

[48] T. McIntosh and J.R. Curran. Reducing semantic drift with bagging and distributional similarity. InProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 396–404. Association for Computational Linguistics, 2009. 6

[49] George A Miller. Wordnet: A lexical database for english. Communications of the ACM, 1995. 9

[50] Thang Luong Minh, Michael C Frank, and Mark Johnson. Parsing entire discourses as very longstrings: Capturing topic continuity in grounded language learning. TACL, 2013. 11

[51] Dana Movshovitz-Attias and William W. Cohen. Bootstrapping biomedical ontologies for scientifictext using nell. Technical report, Carnegie Mellon University, CMU-ML-12-101, 2012. 2, 7

[52] Dana Movshovitz-Attias and William W. Cohen. Natural language models for predicting program-ming comments. In ACL. Association for Computational Linguistics, August 2013. 2, 4, 10, 11,21

[53] Dana Movshovitz-Attias and William W. Cohen. Grounded Discovery of Coordinate Term Relation-ships between Software Entities. ArXiv e-prints, May 2015. 2, 4, 17, 21, 22

[54] Dana Movshovitz-Attias and William W. Cohen. Bootstrapping biomedical ontologies for scientifictext using nell. In BioNLP: Biomedical Natural Language Processing at NAACL, pages 11–19,Montreal, Canada, June 2012. Association for Computational Linguistics. 2, 4, 22

[55] Joakim Nivre, Johan Hall, and Jens Nilsson. Maltparser: A data-driven parser-generator for depen-dency parsing. In Proceedings of LREC, 2006. 19

[56] Cyrus Omar. Structured statistical syntax tree prediction. In Proceedings of the 2013 companionpublication for conference on Systems, programming, & applications: software for humanity. ACM,2013. 11

[57] J. Osborne, J. Flatow, M. Holko, S. Lin, W. Kibbe, L. Zhu, M. Danila, G. Feng, and R. Chisholm.Annotating the human genome with disease ontology. BMC genomics, 10(Suppl 1):S6, 2009. 7

[58] Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, and Vishnu Vyas. Web-scaledistributional similarity and entity set expansion. In Proceedings of the 2009 Conference on Empir-ical Methods in Natural Language Processing: Volume 2-Volume 2. Association for ComputationalLinguistics, 2009. 6

[59] Patrick Andre Pantel. Clustering by committee. PhD thesis, Department of Computing Science,University of Alberta, 2003. 9, 10

[60] Adam Pauls and Dan Klein. Faster and smaller n-gram language models. In Proceedings of the 49thannual meeting of the Association for Computational Linguistics: Human Language Technologies,2011. 14

[61] Fernando Pereira, Naftali Tishby, and Lillian Lee. Distributional clustering of english words. InACL, 1993. 9, 10, 11

[62] Sarah Rastkar, Gail C Murphy, and Alexander WJ Bradley. Generating natural language summariesfor crosscutting source code concerns. In Software Maintenance (ICSM), 2011 27th IEEE Interna-tional Conference on. IEEE, 2011. 14

[63] E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping.In Proceedings of the National Conference on Artificial Intelligence (AAAI-99), pages 474–479,

27

1999. 6

[64] Ronald Rosenfeld. Two decades of statistical language modeling: Where do we go from here?Proceedings of the IEEE, 2000. 14, 21

[65] E. W. Sayers, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church,M. DiCuccio, R. Edgar, S. Federhen, M. Feolo, L. Y. Geer, W. Helmberg, Y. Kapustin, D. Lands-man, D. J. Lipman, T. L. Madden, D. R. Maglott, V. Miller, I. Mizrachi, J. Ostell, K. D. Pruitt,G. D. Schuler, E. Sequeira, S. T. Sherry, M. Shumway, K. Sirotkin, A. Souvorov, G. Starchenko,T. A. Tatusova, L. Wagner, E. Yaschenko, and J. Ye. Database resources of the National Center forBiotechnology Information. Nucleic Acids Res., 37:5–15, Jan 2009. 7

[66] Peter Schulam, Roni Rosenfeld, and Premkumar Devanbu. Building statistical language models ofcode. In Proc. DAPSE. IEEE, 2013. 10, 11, 21

[67] David Shepherd, Zachary P Fry, Emily Hill, Lori Pollock, and K Vijay-Shanker. Using naturallanguage program analysis to locate and understand action-oriented concerns. In Proceedings of the6th international conference on Aspect-oriented software development. ACM, 2007. 14

[68] Jeffrey Mark Siskind. A computational study of cross-situational techniques for learning word-to-meaning mappings. Cognition, 1996. 11

[69] Rion Snow, Daniel Jurafsky, and Andrew Y Ng. Learning syntactic patterns for automatic hypernymdiscovery. In NIPS, 2004. 10

[70] Rion Snow, Daniel Jurafsky, and Andrew Y Ng. Semantic taxonomy induction from heterogenousevidence. In Proceedings of the 21st International Conference on Computational Linguistics andthe 44th annual meeting of the Association for Computational Linguistics. Association for Compu-tational Linguistics, 2006. 9

[71] Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. Towardsautomatically generating summary comments for java methods. In Proceedings of the IEEE/ACMinternational conference on Automated software engineering. ACM, 2010. 14

[72] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic knowledge. InProceedings of the 16th international conference on World Wide Web, pages 697–706. ACM, 2007.1

[73] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A large ontology from wikipediaand wordnet. Web Semantics: Science, Services and Agents on the World Wide Web, 6(3):203–217,2008. 1

[74] Partha Pratim Talukdar, Derry Wijaya, and Tom Mitchell. Acquiring temporal constraints betweenrelations. In Proceedings of the 21st ACM international conference on Information and knowledgemanagement, pages 992–1001. ACM, 2012. 19

[75] Partha Pratim Talukdar, Derry Wijaya, and Tom Mitchell. Coupled temporal scoping of relationalfacts. In Proceedings of the fifth ACM international conference on Web search and data mining.ACM, 2012. 9

[76] Yuen-Hsien Tseng and Da-Wei Juang. Document-self expansion for text categorization. In Pro-ceedings of the 26th annual international ACM SIGIR conference on Research and development ininformaion retrieval. ACM, 2003. 14

[77] Peter Turney, Michael L Littman, Jeffrey Bigham, and Victor Shnayder. Combining independentmodules to solve multiple-choice synonym and analogy problems. Proceedings of the International

28

Conference on Recent Advances in Natural Language Processing, 2003. 10

[78] V. Vyas, P. Pantel, and E. Crestan. Helping editors choose better seed sets for entity set expansion. InProceeding of the 18th ACM conference on Information and knowledge management. ACM, 2009.7

[79] Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. Single document summarization with document ex-pansion. In Proc. of the National Conference on Artificial Intelligence. Menlo Park, CA; Cambridge,MA; London; AAAI Press; MIT Press; 1999, 2007. 14

[80] Xiaoyin Wang, David Lo, Jing Jiang, Lu Zhang, and Hong Mei. Extracting paraphrases of technicalterms from noisy parallel software corpora. In Proceedings of the ACL-IJCNLP. ACL, 2009. 10

[81] T. Wattarujeekrit, P. Shah, and N. Collier. Pasbio: predicate-argument structures for event extractionin molecular biology. BMC bioinformatics, 5(1):155, 2004. 6

[82] Markus Weimer, Iryna Gurevych, and Max Muhlhauser. Automatically assessing the post quality inonline discussions on software. In Proceedings of the 45th Annual Meeting of the ACL. ACL, 2007.10

[83] Roung-Shiunn Wu and Po-Chun Li. Video annotation using hierarchical dirichlet process mixturemodel. Expert Systems with Applications, 2011. 14

[84] Alexander Yates, Michael Cafarella, Michele Banko, Oren Etzioni, Matthew Broadhead, andStephen Soderland. Textrunner: open information extraction on the web. In Proceedings of HumanLanguage Technologies: The Annual Conference of the North American Chapter of the Associa-tion for Computational Linguistics: Demonstrations, pages 25–26. Association for ComputationalLinguistics, 2007. 1, 9

[85] Chen Yu and Dana H Ballard. On the integration of grounding language and learning objects. InAAAI, 2004. 11

[86] Luke S Zettlemoyer and Michael Collins. Learning to map sentences to logical form: Structuredclassification with probabilistic categorial grammars. Uncertainty in Artificial Intelligence, 2005. 11

29

grounded knowledge bases for scientiﬁc domainsdmovshov/docs/dma_proposal.pdf · abstract this...

Documents