document similarity for texts of varying lengths via ... · pedigree analysis: a family tree of...

11
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2341–2351 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics 2341 Document Similarity for Texts of Varying Lengths via Hidden Topics Hongyu Gong * Tarek Sakakini * Suma Bhat * Jinjun Xiong * University of Illinois at Urbana-Champaign, USA T. J. Watson Research Center, IBM * {hgong6, sakakini, spbhat2}@illinois.edu [email protected] Abstract Measuring similarity between texts is an important task for several applications. Available approaches to measure docu- ment similarity are inadequate for doc- ument pairs that have non-comparable lengths, such as a long document and its summary. This is because of the lexi- cal, contextual and the abstraction gaps be- tween a long document of rich details and its concise summary of abstract informa- tion. In this paper, we present a document matching approach to bridge this gap, by comparing the texts in a common space of hidden topics. We evaluate the match- ing algorithm on two matching tasks and find that it consistently and widely outper- forms strong baselines. We also highlight the benefits of the incorporation of domain knowledge to text matching. 1 Introduction Measuring the similarity between documents is of key importance in several natural process- ing applications including information retrieval (Salton and Buckley, 1988), book recommenda- tion (Gopalan et al., 2014), news categorization (Ontrup and Ritter, 2002) and essay scoring (Lan- dauer, 2003). A range of document similarity ap- proaches have been proposed and effectively used in recent applications including (Lai et al., 2015; Bordes et al., 2015). Central to the tasks discussed above is the assumption that the documents being compared are of comparable lengths. Advances in language processing approaches to transform natural language understanding, such as text summarization and recommendation, have generated new requirements for comparing doc- uments. For instance, summarization techniques Table 1: A Sample Concept-Project Matching Concept Heredity: Inheritance and Variation of Traits All cells contain genetic information in the form of DNA molecules. Gene s are regions in the DNA that contain the instructions that code for the formation of proteins . Project Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at members of a family it is easy to see that some physical characteristics or traits are shared. To start this project, you should draw a pedigree showing the different members of your family. Ideally you should include multiple people from at least three generations. (extractive and abstractive) are capable of auto- matically generating textual summaries by con- verting a long document of several hundred words into a condensed text of only a few words while preserving the core meaning of the original text (Kedzie and McKeown, 2016). Conceivably, a re- lated aspect of summarization is the task of bidi- rectional matching of a summary and a document or a set of documents, which is the focus of this study. The document similarity considered in this paper is between texts that have significant differ- ences not only in length, but also in the abstraction level (such as a definition of an abstract concept versus a detailed instance of that abstract concept). As an illustration, consider the task of match- ing a Concept with a Project as shown in Table 1. Here a Concept is a grade-level science curricu- lum item and represents the summary. A Project, listed in a collection of science projects, represents the document. Projects typically are long texts including an introduction, materials and proce- dures, whereas science concepts are much shorter in comparison having a title and a concise and ab- stract description. The concepts and projects are described in detail in Section 5.1. The matching

Upload: others

Post on 09-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2341–2351Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

2341

Document Similarity for Texts of Varying Lengths via Hidden Topics

Hongyu Gong* Tarek Sakakini* Suma Bhat* Jinjun Xiong †

*University of Illinois at Urbana-Champaign, USA†T. J. Watson Research Center, IBM

*{hgong6, sakakini, spbhat2}@illinois.edu †[email protected]

Abstract

Measuring similarity between texts is animportant task for several applications.Available approaches to measure docu-ment similarity are inadequate for doc-ument pairs that have non-comparablelengths, such as a long document and itssummary. This is because of the lexi-cal, contextual and the abstraction gaps be-tween a long document of rich details andits concise summary of abstract informa-tion. In this paper, we present a documentmatching approach to bridge this gap, bycomparing the texts in a common spaceof hidden topics. We evaluate the match-ing algorithm on two matching tasks andfind that it consistently and widely outper-forms strong baselines. We also highlightthe benefits of the incorporation of domainknowledge to text matching.

1 Introduction

Measuring the similarity between documents isof key importance in several natural process-ing applications including information retrieval(Salton and Buckley, 1988), book recommenda-tion (Gopalan et al., 2014), news categorization(Ontrup and Ritter, 2002) and essay scoring (Lan-dauer, 2003). A range of document similarity ap-proaches have been proposed and effectively usedin recent applications including (Lai et al., 2015;Bordes et al., 2015). Central to the tasks discussedabove is the assumption that the documents beingcompared are of comparable lengths.

Advances in language processing approachesto transform natural language understanding, suchas text summarization and recommendation, havegenerated new requirements for comparing doc-uments. For instance, summarization techniques

Table 1: A Sample Concept-Project MatchingConceptHeredity: Inheritance and Variation of TraitsAll cells contain genetic information in the form of DNAmolecules. Genes are regions in the DNA that containthe instructions that code for the formation of proteins.ProjectPedigree Analysis: A Family Tree of TraitsDo you have the same hair color or eye color as yourmother? When we look at members of a family it is easyto see that some physical characteristics or traits areshared. To start this project, you should draw a pedigreeshowing the different members of your family. Ideallyyou should include multiple people from at least threegenerations.

(extractive and abstractive) are capable of auto-matically generating textual summaries by con-verting a long document of several hundred wordsinto a condensed text of only a few words whilepreserving the core meaning of the original text(Kedzie and McKeown, 2016). Conceivably, a re-lated aspect of summarization is the task of bidi-rectional matching of a summary and a documentor a set of documents, which is the focus of thisstudy. The document similarity considered in thispaper is between texts that have significant differ-ences not only in length, but also in the abstractionlevel (such as a definition of an abstract conceptversus a detailed instance of that abstract concept).

As an illustration, consider the task of match-ing a Concept with a Project as shown in Table 1.Here a Concept is a grade-level science curricu-lum item and represents the summary. A Project,listed in a collection of science projects, representsthe document. Projects typically are long textsincluding an introduction, materials and proce-dures, whereas science concepts are much shorterin comparison having a title and a concise and ab-stract description. The concepts and projects aredescribed in detail in Section 5.1. The matching

Page 2: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

2342

task here is to automatically suggest a hands-onproject for a given concept in the curriculum, suchthat the project can help reinforce a learner’s basicunderstanding of the concept. Conversely, given ascience project, one may need to identify the con-cept it covers by matching it to a listed concept inthe curriculum. This would be conceivable in thecontext of an intelligent tutoring system.

Challenges to the matching task mentionedabove include: 1) The mismatch in the relativelengths of the documents being compared – a longpiece of text (henceforth termed document) and ashort piece of text (termed summary) – gives riseto the vocabulary mismatch problem, where thedocument and the summary do not share a ma-jority of terms. 2) The context mismatch prob-lem arising because a document provides a reason-able amount of text to infer the contextual mean-ing of a term, but a summary only provides a lim-ited context, which may or may not involve thesame terms considered in the document. Thesechallenges render existing approaches to compar-ing documents–for instance, those that rely ondocument representations (e.g., Doc2Vec (Le andMikolov, 2014))–inadequate, because the predom-inance of non-topic words in the document intro-duces noise to its representation while the sum-mary is relatively noise-free, rendering Doc2Vecinadequate for comparing them.

Our approach to the matching problem is to al-low a multi-view generalization of the document,where multiple hidden topics are used to estab-lish a common ground to capture as much infor-mation of the document and the summary as pos-sible and use this to score the relevance of thepair. We empirically validate our approach on twotasks – that of project-concept matching in grade-level science and that of scientific paper-summarymatching – using both custom-made and publiclyavailable datasets. The main contributions of thispaper are:1. We propose an embedding-based hidden topicmodel to extract topics and measure their impor-tance in long documents.2. We present a novel geometric approach to com-pare documents with differing modality (a longdocument to a short summary) and validate its per-formance relative to strong baselines.3. We explore the use of domain-specific wordembeddings for the matching task and show theexplicit benefit of incorporating domain knowl-

edge in the algorithm.4. We make available the first dataset1 on project-concept matching in the science domain to helpfurther research in this area.

2 Related Works

Document similarity approaches quantify the de-gree of relatedness between two pieces of textsof comparable lengths and thus enable matchingbetween documents. Traditionally, statistical ap-proaches (e.g., (Metzler et al., 2007)) and vector-space-based methods (including the robust LatentSemantic Analysis (LSA) (Dumais, 2004)) havebeen used for text similarity. More recently, neuralnetwork-based methods have been used for doc-ument representation and these include averageword embeddings (Mikolov et al., 2013), Doc2Vec(Le and Mikolov, 2014), Skip-Thought vectors(Kiros et al., 2015), recursive neural network-based methods (Socher et al., 2014), LSTM archi-tectures (Tai et al., 2015), and convolutional neuralnetworks (Blunsom et al., 2014).

Considering works that avoid using an ex-plicit document representation for comparingdocuments, the state-of-the-art method is WordMover’s Distance (WMD), which relies on pre-trained word embeddings (Kusner et al., 2015).Given these embeddings, the WMD defines thedistance between two documents as the best trans-port cost of moving all words from one documentto another within the space of word embeddings.The advantages of WMD are that it is hyper-parameter free and achieves high retrieval accu-racy on document classification tasks with docu-ments of comparable lengths. However, it is com-putationally expensive for long documents (Kus-ner et al., 2015).

Clearly, what is lacking in prior literature isa study of document similarity approaches thatmatch documents with widely different sizes. It isthis gap in literature that we expect to fill by wayof this study.Latent Variable Models. Latent variable mod-els including count-based and probabilistic modelshave been studied in many previous works. Count-based models such as Latent Semantic Indexing(LSI) compare two documents based on their com-bined vocabulary (Deerwester et al., 1990). When

1Our code and data are available at: https://github.com/HongyuGong/Document-Similarity-via-Hidden-Topics.git

Page 3: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

2343

(a) word geometry of general embedding (b) word geometry of science domain embeddings

Figure 1: Two key words “forces” and “matters” are shown in red and blue respectively. Red wordsrepresent different senses of “forces”, and blue words carry senses of “matters”. “forces” mainly refersto “army” and “matters” refers to “issues” in general embedding of (a), whereas “forces” shows its senseof “gravity” and “matters” shows the sense of “solids” in science-domain embedding of (b)

documents have highly mismatched vocabulariessuch as those that we study, relevant documentsmight be classified as irrelevant. Our model isbuilt upon word-embeddings which is more robustto such a vocabulary mismatch.

Probabilistic models such as Latent DirichletAnalysis (LDA) define topics as distributions overwords (Blei et al., 2003). In our model, topics arelow-dimensional real-valued vectors (more detailsin Section 4.2).

3 Domain Knowledge

Domain information pertaining to specific areas ofknowledge is made available in texts by the use ofwords with domain-specific meanings or senses.Consequently, domain knowledge has been shownto be critical in many NLP applications such as in-formation extraction and multi-document summa-rization (Cheung and Penn, 2013a), spoken lan-guage understanding (Chen et al., 2015), aspectextraction (Chen et al., 2013) and summarization(Cheung and Penn, 2013b).

As will be described later, our distance metricfor comparing a document and a summary relieson word embeddings. We show in this work, thatembeddings trained on a science-domain corpuslead to better performance than embeddings onthe general corpus (WikiCorpus). Towards this,we extract a science-domain sub-corpus from theWikiCorpus, and the corpus extraction will be de-tailed in Section 5.

To motivate the domain-specific behavior ofpolysemous words, we will qualitatively explorehow domain-specific embeddings differ from the

general embeddings on two polysemous scienceterms: forces and matters. Considering the factthat the meaning of a word is dictated by its neigh-bors, for each set of word embeddings, we plot theneighbors of these two terms in Figure 1 on to 2 di-mensions using Locally Linear Embedding (LLE),which preserves word distances (Roweis and Saul,2000). We then analyze the sense of the focusterms–here, forces and matters.

From Figure 1(a), we see that for the wordforces, its general embedding is close to army,soldiers, allies indicating that it is related withviolence and power in a general domain. Shift-ing our attention to Figure 1(b), we see that forthe same term, its science embedding is closerto torque, gravity, acceleration implying that itsscience sense is more about physical interactions.Likewise, for the word matters, its general embed-ding is surrounded by affairs and issues, whereas,its science embedding is closer to particles andmaterial, prompting that it represents substances.Thus, we conclude that domain specific embed-dings (here, science), is capable of incorporat-ing domain knowledge into word representations.We use this observation in our document-summarymatching system to which we turn next.

4 Model

Our model that performs the matching betweendocument and summary is depicted in Figure 2. Itis composed of three modules that perform prepro-cessing, document topic generation, and relevancemeasurement between a document and a summary.Each of these modules is discussed below.

Page 4: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

2344

Figure 2: The system for document-summary matching

4.1 PreprocessingThe preprocessing module tokenizes texts and re-moves stop words and prepositions. This step al-lows our system to focus on the content wordswithout impacting the meaning of original texts.

4.2 Topic Generation from DocumentsWe assume that a document (a long text) is astructured collection of words, with the ‘structure’brought about by the composition of topics. Insome sense, this ‘structure’ is represented as a setof hidden topics. Thus, we assume that a docu-ment is generated from certain hidden “topics”,analogous to the modeling assumption in LDA.However, unlike in LDA, the “topics” here are nei-ther specific words nor the distribution over words,but are are essentially a set of vectors. In turn, thismeans that words (represented as vectors) con-stituting the document structure can be generatedfrom the hidden topic vectors.

Introducing some notation, the word vectors ina document are {w1, . . . ,wn}, and the hiddentopic vectors of the document are {h1, . . . ,hK},where wi,hk 2 Rd, d = 300 in our experiments.

Linear operations using word embeddings havebeen empirically shown to approximate theircompositional properties (e.g. the embeddingof a phrase is nearly the sum of the embed-dings of its component words) (Mikolov et al.,2013). This motivates the linear reconstructionof the words from the document’s hidden topicswhile minimizing the reconstruction error. Westack the K topic vectors as a topic matrixH = [h1, . . . ,hK ](K < d). We define thereconstructed word vector ˜wi for the word wi asthe optimal linear approximation given by topicvectors: ˜wi = H ˜↵i, where

˜↵i = argmin

↵i2RK

kwi �H↵ik22. (1)

The reconstruction error E for the whole docu-ment is the sum of each word’s reconstruction er-

ror and is given by: E =

nPi=1

kwi � ˜wik22. This

being a function of the topic vectors, our goal is tofind the optimal H⇤ so as to minimize the error E:

H⇤= argmin

H2Rd⇥K

E(H)

= argmin

H2Rd⇥K

nX

i=1

min

↵ikwi �H↵ik22, (2)

where k·k is the Frobenius norm of a matrix.Without loss of generality, we require the topic

vectors {hi}Ki=1 to be orthonormal, i.e., hTi hj =

1(i=j). As we can see, the optimization problem(2) describes an optimal linear space spanned bythe topic vectors, so the norm and the linear de-pendency of the vectors do not matter. With theorthonormal constraints, we simplify the form ofthe reconstructed vector ˜wi as:

˜wi = HHTwi. (3)

We stack word vectors in the document as a matrixW = [w1, . . . ,wn]. The equivalent formulationto problem (2) is:

min

HkW �HHTWk22

s.t. HTH = I, (4)

where I is an identity matrix.The problem can be solved by Singular Value

Decomposition (SVD), using which, the matrixW can be decomposed as W = U⌃VT , whereUTU = I,VTV = I, and ⌃ is a diagonal ma-trix where the diagonal elements are arranged in adecreasing order of absolute values. We show inthe supplementary material that the first K vec-tors in the matrix U are exactly the solution toH⇤

= [h⇤1, . . . ,h

⇤K ].

We find optimal topic vectors H⇤=

[h⇤1, . . . ,h

⇤K ] by solving problem (4). We

note that these topic vectors are not equallyimportant, and we say that one topic is moreimportant than another if it can reconstruct words

Page 5: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

2345

with smaller error. Define Ek as the reconstruc-tion error when we only use topic vector h⇤

k toreconstruct the document:

Ek = kW � h⇤kh

⇤kTWk22. (5)

Now define ik as the importance of topic h⇤k,

which measures the topic’s ability to reconstructthe words in a document:

ik = kh⇤kTWk22 (6)

We show in the supplementary material that thehigher the importance ik is, the smaller the recon-struction error Ek is. Now we normalize ik as¯ik sothat the importance does not scale with the normof the word matrix W , and so that the importancesof the K topics sum to 1. Thus,

¯ik = ik/(KX

j=1

ij). (7)

The number of topics K is a hyperparameter in ourmodel. A small K may not cover key ideas of thedocument, whereas a large K may keep trivial andnoisy information. Empirically we find that K =

15 captures most important information from thedocument.

4.3 Topic Mapping to SummariesWe have extracted K topic vectors {h⇤

k}Kk=1 fromthe document matrix W, whose importance is re-flected by {¯ik}Kk=1. In this module, we measurethe relevance of a document-summary pair. To-wards this, a summary that matches the documentshould also be closely related with the “topics”of that document. Suppose the vectors of thewords in a summary are stacked as a d ⇥ m ma-trix S = [s1, . . . , sm], where sj is the vector ofthe j-th word in a summary. Similar to the recon-struction of the document, the summary can alsobe reconstructed from the documents’ topic vec-tors as shown in Eq. (3). Let ˜skj be the reconstruc-tion of the summary word sj given by one topich⇤k: ˜skj = h⇤

kh⇤kT sj .

Let r(h⇤k, sj) be the relevance between a topic

vector h⇤k and summary word sj . It is defined as

the cosine similarity between ˜skj and sj :

r(h⇤k, sj) = sj

T skj /(ksjk2 · k˜skj k2). (8)

Furthermore, let r(h⇤k,S) be the relevance be-

tween a topic vector and the summary, defined to

be the average similarity between the topic vectorand the summary words:

r(h⇤k,S) =

1

m

mX

j=1

r(h⇤k, sj). (9)

The relevance between a topic vector and a sum-mary is a real value between 0 and 1.

As we have shown, the topics extracted froma document are not equally important. Natu-rally, a summary relevant to more important top-ics is more likely to better match the document.Therefore, we define r(W,S) as the relevance be-tween the document W and the summary S, andr(W,S) is the sum of topic-summary relevanceweighted by the importance of the topic:

r(W,S) =KX

k=1

¯ik · r(h⇤k,S), (10)

where ¯ik is the importance of topic h⇤k as defined

in (7). The higher r(W,S) is, the better the sum-mary matches the document.

We provide a visual representation of the doc-uments as shown in Figure 3 to illustrate the no-tion of hidden topics. The two documents arefrom science projects: a genetics project, PedigreeAnalysis: A Family Tree of Traits (ScienceBud-dies, 2017a), and a weather project, How Do theSeasons Change in Each Hemisphere (Science-Buddies, 2017b). We project all embeddings to athree-dimensional space for ease of visualization.

As seen in Figure 3, the hidden topics recon-struct the words in their respective documents tothe extent possible. This means that the words of adocument lie roughly on the plane formed by theircorresponding topic vectors. We also notice thatthe summary words (heredity and weather respec-tively for the two projects under consideration) lievery close to the plane formed by the hidden topicsof the relevant project while remaining away fromthe plane of the irrelevant project. This shows thatthe words in the summary (and hence the summaryitself) can also be reconstructed from the hiddentopics of documents that match the summary (andare hence ‘relevant’ to the summary). Figure 3visually explains the geometric relations betweenthe summaries, the hidden topics and the docu-ments. It also validates the representation powerof the extracted hidden topic vectors.

Page 6: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

2346

Figure 3: Words mode and genes from the doc-ument on genetics and words storm and atmo-spheric from document on weather are representedby pink and blue points respectively. Linear spaceof hidden topics in genetics form the pink plane,where summary word heredity (the red point)roughly lies. Topic vectors of the document onweather form the blue plane, and the summaryword weather (the darkblue point) lies almost onthe same plane.

5 Experiments

In this section, we evaluate our document-summary matching approach on two specific ap-plications where texts of different sizes are com-pared. One application is that of concept-projectmatching useful in science education and the otheris that of summary-research paper matching.

Word Embeddings. Two sets of 300-dimension word embeddings were used in our ex-periments. They were trained by the Continu-ous Bag-of-Words (CBOW) model in word2vec(Mikolov et al., 2013) but on different corpora.One training corpus is the full English WikiCor-pus of size 9 GB (Al-Rfou et al., 2013). Thesecond consists of science articles extracted fromthe WikiCorpus. To extract these science articles,we manually selected the science categories inWikipedia and considered all subcategories withina depth of 3 from these manually selected rootcategories. We then extracted all articles in theaforementioned science categories resulting in ascience corpus of size 2.4 GB. The word vectorsused for documents and summaries are both fromthe pretrained word2vec embeddings.Baselines We include two state-of-the-art methodsof measuring document similarity for comparisonusing their implementations available in gensim(Rehurek and Sojka, 2010).

(1) Word movers’ distance (WMD) (Kusneret al., 2015). WMD quantifies the distance be-tween a pair of documents based on word em-beddings as introduced previously (c.f. RelatedWork). We take the negative of their distance asa measure of document similarity (here between adocument and a summary).(2) Doc2Vec (Le and Mikolov, 2014). Documentrepresentations have been trained with neural net-works. We used two versions of doc2vec: onetrained on the full English Wikicorpus and a sec-ond trained on the science corpus, same as the cor-pora used for word embedding training. We usedthe cosine similarity between two text vectors tomeasure their relevance.

For a given document-summary pair, we com-pare the scores obtained using the above two meth-ods with that obtained using our method.

5.1 Concept-Project matching

Science projects are valuable resources for learn-ers to instigate knowledge creation via experimen-tation and observation. The need for matching ascience concept with a science project arises whenlearners intending to delve deeper into certain con-cepts search for projects that match a given con-cept. Additionally, they may want to identify theconcepts with which a set of projects are related.

We note that in this task, science concepts arehighly concise summaries of the core ideas inprojects, whereas projects are detailed instructionsof the experimental procedures, including an intro-duction, materials and a description of the proce-dure, as shown in Table 1. Our matching methodprovides a way to bridge the gap between abstractconcepts and detailed projects. The format of theconcepts and the projects is discussed below.Concepts. For the purpose of this study we usethe concepts available in the Next Generation Sci-ence Standards (NGSS) (NGSS, 2017). Each con-cept is accompanied by a short description. Forexample, one concept in life science is Heredity:Inheritance and Variation of Traits. Its descrip-tion is All cells contain genetic information in theform of DNA molecules. Genes are regions in theDNA that contain the instructions that code for theformation of proteins. Typical lengths of conceptsare around 50 words.Projects. The website Science Buddies (Science-Buddies, 2017c) provides a list of projects from avariety of science and engineering disciplines such

Page 7: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

2347

Table 2: Classification results for the Concept-Project Matching task. All performance differences werestatistically significant at p = 0.01.

method topic science topic wiki wmd science wmd wiki doc2vec science doc2vec wikiprecision 0.758± 0.012 0.750± 0.009 0.643± 0.070 0.568± 0.055 0.615± 0.055 0.661± 0.084

recall 0.885± 0.071 0.842± 0.010 0.735± 0.119 0.661± 0.119 0.843± 0.066 0.737± 0.149fscore 0.818± 0.028 0.791± 0.007 0.679± 0.022 0.595± 0.020 0.695± 0.019 0.681± 0.032

as physical sciences, life sciences and social sci-ences. A typical project consists of an abstract, anintroduction, a description of the experiment andthe associated procedures. A project typically hasmore than 1000 words.Dataset. We prepared a representative dataset 537pairs of projects and concepts involving 53 uniqueconcepts from NGSS and 230 unique projectsfrom Science Buddies. Engineering undergrad-uate students annotated each pair with the deci-sion whether it was a good match or not and re-ceived research credit. As a result, each concept-project pair received at least three annotations,and upon consolidation, we considered a concept-project pair to be a good match when a majorityof the annotators agreed. Otherwise it was notconsidered a good match. The ratio between goodmatches and bad matches in the collected data was44 : 56.Classification Evaluation. Annotations from stu-dents provided the ground truth labels for the clas-sification task. We randomly split the dataset intotuning and test instances with a ratio of 1 : 9. Athreshold score was tuned on the tuning data, andconcept-project pairs with scores higher than thisthreshold were classified as a good matches dur-ing testing. We performed 10-fold cross valida-tion, and report the average precision, recall, F1score and their standard deviation in Table 2.

Our topic-based metric is denoted as “topic”,and the general-domain and science-domain em-beddings are denoted as “wiki” and “science”respectively. We show the performance of ourmethod against the two baselines while vary-ing the underlying embeddings, thus resultingin 6 different combinations. For example,“topic science” refers to our method with scienceembeddings. From the table (column 1) we noticethe following: 1) Our method significantly outper-forms the two baselines by a wide margin (⇡10%)in both the general domain setting as well as thedomain-specific setting. 2) Using science domain-specific word embeddings instead of the generalword embeddings results in the best performance

across all algorithms. This performance was ob-served despite the word embeddings being trainedon a significantly smaller corpus compared to thegeneral domain corpus.

Besides the classification metrics, we also eval-uated the directed matching from concepts toprojects with ranking metrics.Ranking Evaluation Our collected dataset re-sulted in having a many-to-many matching be-tween concepts and projects. This is becausethe same concept was found to be a good matchfor multiple projects and the same project wasfound to match many concepts. The previouslydescribed classification task evaluated the bidirec-tional concept-project matching. Next we eval-uated the directed matching from concepts toprojects, to see how relevant these top rankingprojects are to a given input concepts. Here weuse precision@k (Radlinski and Craswell, 2010)as the evaluation metric, considering the percent-age of relevant ones among top-ranking projects.

For this part, we only considered the methodsusing science domain embeddings as they haveshown superior performance in the classificaitontask. For each concept, we check the precision@kof matched projects and place it in one of k+1 binsaccordingly. For example, for k=3, if only two ofthe three top projects are a correct match, the con-cept is placed in the bin corresponding to 2/3. InFigure 4, we show the percentage of concepts thatfall into each bin for the three different algorithmsfor k=1,3,6.

We observe that recommendations using thehidden topic approach fall more in the high valuebin compared to others, performing consistentlybetter than two strong baselines. The advan-tage becomes more obvious at precision@6. It isworth mentioning that wmd science falls behinddoc2vec science in the classification task while itoutperforms in the ranking task.

5.2 Text SummarizationThe task of matching summaries and documentsis commonly seen in real life. For example, we

Page 8: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

2348

(a) Precision@1 (b) Precision@3 (c) Precision@6

Figure 4: Ranking Performance of All Methods

use an event summary “Google’s AlphaGo beatsKorean player Lee Sedol in Go” to search for rel-evant news, or use the summary of a scientific pa-per to look for related research publications. Suchmatching constitutes an ideal task to evaluate ourmatching method between texts of different sizes.Dataset. We use a dataset from the CL-SciSummShared Task (Jaidka et al., 2016). The dataset con-sists of 730 ACL Computational Linguistics re-search papers covering 50 categories in total. Eachcategory consists of a reference paper (RP) andaround 10 citing papers (CP) that contain citationsto the RP. A human-generated summary for the RPis provided and we use the 10 CP as being relevantto the summary. The matching task here is be-tween the summary and all CPs in each category.Evaluation. For each paper, we keep all of itscontent except the sections of experiments and ac-knowledgement (these sections were omitted be-cause often their content is often less related tothe topic of the summary). The typical summarylength is about 100 words, while a paper has morethan 2000 words. For each topic, we rank all 730papers in terms of their relevance generated by ourmethod and baselines using both sets of embed-dings. For evaluation, we use the information re-trieval measure of precision@k, which considersthe number of relevant matches in the top-k match-ings (Manning et al., 2008). For each combinationof the text similarity approaches and embeddings,we show precision@k for different k’s in Figure 5.We observe that our method with science embed-ding achieves the best performance compared tothe baselines, once again showing not only thebenefits of our method but also that of incorpo-rating domain knowledge.

6 Discussion

Analysis of Results. From the results of the twotasks we observe that our method outperforms two

Figure 5: Summary-Article Matching

strong baselines. The reason for WMD’s poorperformance could be that the many uninforma-tive words (those unrelated to the central topic)make WMD overestimate the distance between thedocument-summary pair. As for doc2vec, its sin-gle vector representation may not be able to cap-ture all the key topics of a document. A projectcould contain multifaceted information, e.g., aproject to study how climate change affects grainproduction is related to both environmental sci-ence and agricultural science.

Effect of Topic Number. The number of hid-den topics K is a hyperparameter in our setting.We empirically evaluate the effect of topic numberin the task of concept-project mapping. Figure 6shows the F1 scores and the standard deviationsat different K. As we can see, optimal K is 18.When K is too small, hidden topics are too fewto capture key information in projects. Thus wecan see that the increase of topic number from 3

to 6 brings a big improvement to the performance.Topic numbers larger than the optimal value de-grade the performance since more topics incorpo-rate noisy information. We note that the perfor-

Page 9: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

2349

Figure 6: F1 score on concept-project matchingwith different topic numbers K

mance changes are mild when the number of top-ics are in the range of [18, 31]. Since topics areweighted by their importance, the effect of noisyinformation from extra hidden topics is mitigated.Interpretation of Hidden Topics. We considerthe summary-paper matching as an example witharound 10 papers per category. We extractedthe hidden topics from each paper, reconstructedwords with these topics as shown in Eq. (3), andselected the words which had the smallest recon-struction errors. These words are thus closely re-lated to the hidden topics, and we call them topicwords to serve as an interpretation of the hid-den topics. We visualize the cloud of such topicwords on the set of papers about word sense dis-ambiguation as shown in Figure 7. We see that thewords selected based on the hidden topics coverkey ideas such as disambiguation, represent, clas-sification and sentence. This qualitatively vali-dates the representation power of hidden topics.More examples are available in the supplementarymaterial.

We interpret this to mean that proposed idea ofmultiple hidden topics captures the key informa-tion of a document. The extracted “hidden top-ics” represent the essence of documents, suggest-ing the appropriateness of our relevance metric tomeasure the similarity between texts of differentsizes. Even though our focus in this study was thescience domain we point out that the results aremore generally valid since we made no domain-specific assumptions.Varying Sensitivity to Domain. As shown in theresults, the science-domain embeddings improvedthe classification of concept-project matching for

Figure 7: Topic words from papers on word sensedisambiguation

the topic-based method by 2% in F1-score, WMDby 8% and doc2vec by 1%, thus underscoring theimportance of domain-specific word embeddings.

Doc2vec is less sensitive to the domain, becauseit provides document-level representation. Evenif some words cannot be disambiguated due tothe lack of domain knowledge, other words in thesame document can provide complementary infor-mation so that the document embedding does notdeviate too much from its true meaning.

Our method, also a word embedding method, isnot as sensitive to domain as WMD. It is robustto the polysemous words with domain-sensitivesemantics, since hidden topics are extracted inthe document level. Broader contexts beyondjust words provide complementary information forword sense disambiguation.

7 Conclusion

We propose a novel approach to matching docu-ments and summaries. The challenge we addressis to bridge the gap between detailed long texts andits abstraction with hidden topics. We incorpo-rate domain knowledge into the matching systemto gain further performance improvement. Ourapproach has beaten two strong baselines in twodownstream applications, concept-project match-ing and summary-research paper matching.

Acknowledgments

This work is supported by IBM-ILLINOIS Cen-ter for Cognitive Computing Systems Research(C3SR) - a research collaboration as part of theIBM AI Horizons Network. We thank ACLanonymous reviewers for their constructive sug-gestions. We thank Mathew Monfort for helpingdeploy annotation tasks, and thank Jong Yoon Leefor the dataset collection.

Page 10: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

2350

ReferencesRami Al-Rfou, Bryan Perozzi, and Steven Skiena.

2013. Polyglot: Distributed word representationsfor multilingual nlp. In Proceedings of the Seven-teenth Conference on Computational Natural Lan-guage Learning, pages 183–192, Sofia, Bulgaria.Association for Computational Linguistics.

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. Journal of ma-chine Learning research, 3(Jan):993–1022.

Phil Blunsom, Edward Grefenstette, and Nal Kalch-brenner. 2014. A convolutional neural network formodelling sentences. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics. Proceedings of the 52nd AnnualMeeting of the Association for Computational Lin-guistics.

Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale simple questionanswering with memory networks. arXiv preprintarXiv:1506.02075.

Yun-Nung Chen, William Yang Wang, Anatole Gersh-man, and Alexander I Rudnicky. 2015. Matrix fac-torization with knowledge graph propagation for un-supervised spoken language understanding. In ACL(1), pages 483–494.

Zhiyuan Chen, Arjun Mukherjee, Bing Liu, MeichunHsu, Malu Castellanos, and Riddhiman Ghosh.2013. Exploiting domain knowledge in aspect ex-traction. In EMNLP, pages 1655–1667.

Jackie Chi Kit Cheung and Gerald Penn. 2013a. Prob-abilistic domain modelling with contextualized dis-tributional semantic vectors. In ACL (1), pages 392–401.

Jackie Chi Kit Cheung and Gerald Penn. 2013b. To-wards robust abstractive multi-document summa-rization: A caseframe analysis of centrality and do-main. In ACL (1), pages 1233–1242.

Scott Deerwester, Susan T Dumais, George W Fur-nas, Thomas K Landauer, and Richard Harshman.1990. Indexing by latent semantic analysis. Jour-nal of the American society for information science,41(6):391.

Susan T Dumais. 2004. Latent semantic analysis. An-nual review of information science and technology,38(1):188–230.

Prem K Gopalan, Laurent Charlin, and David Blei.2014. Content-based recommendations with pois-son factorization. In Advances in Neural Informa-tion Processing Systems, pages 3176–3184.

Kokil Jaidka, Muthu Kumar Chandrasekaran, SajalRustagi, and Min-Yen Kan. 2016. Overview ofthe cl-scisumm 2016 shared task. In In Proceed-ings of Joint Workshop on Bibliometric-enhanced

Information Retrieval and NLP for Digital Libraries(BIRNDL 2016).

Chris Kedzie and Kathleen McKeown. 2016. Ex-tractive and abstractive event summarization overstreaming web text. In IJCAI, pages 4002–4003.

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InAdvances in neural information processing systems,pages 3294–3302.

Matt Kusner, Yu Sun, Nicholas Kolkin, and KilianWeinberger. 2015. From word embeddings to docu-ment distances. In International Conference on Ma-chine Learning, pages 957–966.

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.Recurrent convolutional neural networks for textclassification. In AAAI, volume 333, pages 2267–2273.

Thomas K Landauer. 2003. Automatic essay assess-ment. Assessment in education: Principles, policy& practice, 10(3):295–308.

Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In Proceed-ings of the 31st International Conference on Ma-chine Learning (ICML-14), pages 1188–1196.

Christopher D Manning, Prabhakar Raghavan, HinrichSchutze, et al. 2008. Introduction to information re-trieval, volume 1. Cambridge university press Cam-bridge.

Donald Metzler, Susan Dumais, and Christopher Meek.2007. Similarity measures for short segments oftext. In European Conference on Information Re-trieval, pages 16–27. Springer.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

NGSS. 2017. Available at: https://www.nextgenscience.org. Accessed:2017-06-30.

Jorg Ontrup and Helge Ritter. 2002. Hyperbolic self-organizing maps for semantic navigation. In Ad-vances in neural information processing systems,pages 1417–1424.

Filip Radlinski and Nick Craswell. 2010. Comparingthe sensitivity of information retrieval metrics. InProceedings of the 33rd international ACM SIGIRconference on Research and development in infor-mation retrieval, pages 667–674. ACM.

Radim Rehurek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. InProceedings of the LREC 2010 Workshop on New

Page 11: Document Similarity for Texts of Varying Lengths via ... · Pedigree Analysis: A Family Tree of Traits Do you have the same hair color or eye color as your mother? When we look at

2351

Challenges for NLP Frameworks, pages 45–50,Valletta, Malta. ELRA. http://is.muni.cz/publication/884893/en.

Sam T Roweis and Lawrence K Saul. 2000. Nonlin-ear dimensionality reduction by locally linear em-bedding. science, 290(5500):2323–2326.

Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. In-formation processing & management, 24(5):513–523.

ScienceBuddies. 2017a. Available at: https://www.sciencebuddies.org/science-fair-projects/project-ideas/Genom p010/genetics-genomics/pedigree-analysis-a-family-tree-of-traits. Accessed: 2017-06-30.

ScienceBuddies. 2017b. Available at: https://www.sciencebuddies.org/science-fair-projects/project-ideas/Weather p006/weather-atmosphere/how-do-the-seasons-change-in-each-hemisphere. Accessed: 2017-06-30.

ScienceBuddies. 2017c. Available at: http://www.sciencebuddies.org. Accessed: 2017-06-30.

Richard Socher, Andrej Karpathy, Quoc V Le, Christo-pher D Manning, and Andrew Y Ng. 2014.Grounded compositional semantics for finding anddescribing images with sentences. Transactionsof the Association for Computational Linguistics,2:207–218.

Kai Sheng Tai, Richard Socher, and Christopher DManning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),volume 1, pages 1556–1566.