rights / license: research collection in copyright - non …8578/eth... · the ﬁnal tf-idf value...

Research Collection

Master Thesis

WikiMiningSummarising Wikipedia using submodular function maximisation

Author(s): Ungureanu, Victor

Publication Date: 2014

Permanent Link: https://doi.org/10.3929/ethz-a-010144394

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

https://doi.org/10.3929/ethz-a-010144394

http://rightsstatements.org/page/InC-NC/1.0/

https://www.research-collection.ethz.ch

https://www.research-collection.ethz.ch/terms-of-use

WikiMining - SummarisingWikipedia using submodular

function maximisation

Master Thesis

Victor Ungureanu

April 17, 2014

Advisors: Baharan Mirzasoleiman, Dr. Amin Karbasi, Prof. Dr. Andreas Krause

Department of Computer Science, ETH Zurich

Abstract

As a result of the volume of content that exists today on the internet,it has become increasingly harder for content creators and consumersalike to manage this content in a centralised manner. One such exampleis Wikipedia which is very hard to analyse and digest at a general level,mainly because of its prohibitive large size. We aim to discover whatare the most important Wikipedia articles and how articles evolve inpopularity and influence over time. At a smaller scale, we investigatethese same problems on specific categories.

Attempts to analyse Wikipedia have been made in the past using quan-titative measures and semantic coverage, but these studies were doneback when Wikipedia was a lot smaller than it is today and none ofthese papers deals with detecting influential articles (using unsuper-vised methods). More often Wikipedia was used in natural languageprocessing, especially for computing semantic relatedness based onthe Wikipedia categories, but these papers deal only indirectly withanalysing Wikipedia itself.

In order to study the evolution of Wikipedia’s network of articles andfind representative subsets, we investigate different solutions using sub-modular function maximisation. We adapt existing submodular func-tions from the literature and define new functions to solve this problem.We devise a framework – WikiMining – that can scale up to Wikipedia’ssize, using the Distributed Submodular Maximisation (GreeDi) proto-col.

i

Acknowledgements

I would like to thank Baharan Mirzasoleiman for her continuous sup-port and help throughout my thesis, Dr. Amin Karbasi and Prof. An-dreas Krause for their valuable advice and Mahmoudreza Babaei forhis help extracting Wikipedia’s revisions data.

I am also grateful to Ruben Sipos and Adith Swaminathan for provid-ing me their ACL and NIPS datasets and part of their code.

ii

Contents

Contents iii

1 Introduction 1

2 Related work 3

3 Preliminaries 53.1 Term frequency - inverse document frequency . . . . . . . . . 53.2 Cosine similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Locality-sensitive hashing . . . . . . . . . . . . . . . . . . . . . 6

4 Submodularity, coverage, summarisation 74.1 Submodular functions . . . . . . . . . . . . . . . . . . . . . . . 7

4.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 74.1.2 Examples and properties . . . . . . . . . . . . . . . . . 8

4.2 Submodular function maximisation . . . . . . . . . . . . . . . 94.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . 94.2.2 Greedy maximisation . . . . . . . . . . . . . . . . . . . 94.2.3 GreeDi protocol . . . . . . . . . . . . . . . . . . . . . . . 9

4.3 Word coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.4 Document influence . . . . . . . . . . . . . . . . . . . . . . . . 124.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 124.4.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Massive corpus summarisation 155.1 Scaling from thousands to millions . . . . . . . . . . . . . . . . 155.2 Scaling influential documents . . . . . . . . . . . . . . . . . . . 165.3 Graph coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 17

iii

Contents

5.3.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.4 LSH buckets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 185.4.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.5 Beyond word coverage . . . . . . . . . . . . . . . . . . . . . . . 195.5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 195.5.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.6 Combining multiple submodular functions . . . . . . . . . . . 225.6.1 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . 23

6 WikiMining framework design 256.1 External libraries . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1.1 Java Wikipedia Library . . . . . . . . . . . . . . . . . . 256.1.2 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.1.3 Cloud9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.1.4 Mahout . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.2 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . 286.2.1 Base data types . . . . . . . . . . . . . . . . . . . . . . . 286.2.2 Submodular functions . . . . . . . . . . . . . . . . . . . 296.2.3 Submodular function maximisation . . . . . . . . . . . 306.2.4 Coverage MapReduces . . . . . . . . . . . . . . . . . . . 316.2.5 Influence MapReduces . . . . . . . . . . . . . . . . . . . 326.2.6 Selection and evaluation . . . . . . . . . . . . . . . . . . 336.2.7 Input, output . . . . . . . . . . . . . . . . . . . . . . . . 33

7 Experiments 357.1 Datasets and metrics . . . . . . . . . . . . . . . . . . . . . . . . 35

7.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367.1.3 Interpreting the results . . . . . . . . . . . . . . . . . . 36

7.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.2.1 Random . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.2.2 Word coverage . . . . . . . . . . . . . . . . . . . . . . . 387.2.3 Document influence . . . . . . . . . . . . . . . . . . . . 40

7.3 Graph coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.4 Locality Sensitive Hashing (LSH) buckets and word coverage 427.5 Beyond word coverage . . . . . . . . . . . . . . . . . . . . . . . 427.6 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8 Conclusion 53

Bibliography 55

iv

Chapter 1

Introduction

As a result of the volume of content that exists today on the internet, ithas become increasingly harder for content creators and consumers alike tomanage this content in a centralised manner. One such example is Wikipediawhich is very hard to analyse and digest at a general level, mainly becauseof its prohibitive large size. We aim to discover what are the most importantWikipedia articles and how articles evolve in popularity and influence overtime. At a smaller scale, we investigate these same problems on specificcategories.

Attempts to analyse Wikipedia have also been made in the past: quanti-tative measures [17], semantic coverage [3], but these studies were madeback when Wikipedia was a lot smaller than it is today and none of thesepapers deal with detecting influential articles (using submodular functionmaximisation). More often Wikipedia was used in natural language process-ing, especially for computing semantic relatedness based on the Wikipediacategories [2], but these kind of papers deal only indirectly with analysingWikipedia itself.

We define our problem as follows: given a very large corpus of documents– such as Wikipedia – find a way to pick the most representative articlesthat best encompass the most important topics. One of the main challengesis that the nature of the problem is subjective because the notion of impor-tance varies from person to person. Some more concrete objectives are tofind popular articles – how visited, how interlinked a page is – or debatedarticles – how changed the articles is or, in the case of Wikipedia, how manyrevisions does it have.

We can interpret the problem defined in the previous paragraph as a vari-ant of multi-document summarisation, but its setting is different in variousregards. First of all, the corpus is much bigger than the classical datasets.Secondly, the number of selected documents is very small – only 1/10000th

1

1. Introduction

or smaller fraction of the whole corpus. For Wikipedia, we select between40 and 130 articles from over 1.3 million human-written pages. We callthe problem of selecting very few documents from an extremely large cor-pus: massive corpus summarisation. In order to study the evolution of theWikipedia network and find representative subsets, we investigate differentsolutions using submodular function maximisation. We adapt existing sub-modular functions from the literature and define new functions to solve thisproblem.

Another challenge is scaling our algorithms to deal with such a large dataset.We devise a framework – WikiMining – that can scale up to Wikipedia’s size,using the Distributed Submodular Maximisation (GreeDi) protocol [8] overMapReduce [1]. To compare the different approaches we devise a simpleprocedure to evaluate the quality of my algorithms.

Contributions

We have the following main contributions:

• We define novel monotone submodular functions, that capture docu-ment importance, to summarise massive corpora better than existingfunctions;

• We create a framework that scales to millions of documents, usingGreeDi [8] and MapReduce [1];

• We extend, for the first time, multi-document summarisation to a mas-sive corpora – such as Wikipedia – using submodular function max-imisation;

• We use simple evaluation metrics to cross-check the different submod-ular functions;

2

Chapter 2

Related work

As presented in the introduction, our topic relates to two main fields of inter-est: one is analysing Wikipedia and the other is multi-document summarisa-tion. Most Wikipedia analysis are quantitative and deal with looking at thepages from a data analysis perspective. On the other side, multi-documentsummarisation is more related to our work, but deals with smaller sets ofdocuments and has a different objective.

On the topic of data analysis, Voss [17] does an excellent job at looking atWikipedia’s growth, articles and meta-pages distribution, article sizes dis-tribution, authors’ statistics and graph structure in multiple languages (En-glish, German, Japanese and others). This is a quantitative analysis, helpfulto better understand Wikipedia’s structure and find possible submodularfunction that can capture informativeness. On the same topic, but look-ing into Wikipedia’s semantic coverage, Holloway et al [3] analyse Wikipediabased on the semantic structures one can discover from the articles’ graphs,categories’ interlinkage and revisions statistics. More closely related to nat-ural language processing, but further away our interest of multi-documentsummarisation, Gabrilovich et al [2] look into explicit semantic analysis todiscover the meaning of the words using Wikipedia.

On the topic of multi-document summarisation using submodular functionmaximisation, an important reference that provided us with a starting pointis the paper of [16] about using the expansion of document influence overtime in order to capture papers’ importance. This method has encouragingresults on conferences’ corpora of published papers, but does not manageto transfer its results to a massive corpora like Wikipedia.

As far as we know there has not been any previous work on summarisingmassive corpora using submodular function maximisation.

3

Chapter 3

Preliminaries

3.1 Term frequency - inverse document frequency

Term frequency - inverse document frequency (tf-idf) [13] is a popular measurein information retrieval that captures the importance of a word within a givendocument (from a corpus). It is computed based on two different weights:

Term frequency – tf(d, w) measures the frequency of a word in a document(usually normalised by taking the square root, logarithm or more com-plex methods);

Inverted document frequency – idf(d, w) measures the rarity of a term withinthe corpus – it is defined as the total number of documents divided bythe number of documents in which the word appears (usually nor-malised by taking the logarithm).

The final tf-idf value is obtained by multiplying the two weights definedabove:

tf-idf(d, w) = tf(d, w) · idf(d, w).

3.2 Cosine similarity

The cosine similarity of two vectors is defined as the angle between made bythe two vectors. Formally, cosine similarity is defined as:

cos(θ) =~u ·~v‖~u‖ · ‖~v‖

or if the vectors are normalized simply as the dot product between of thetwo vectors:

cos(θ) = ~u ·~v.

5

3. Preliminaries

3.3 Locality-sensitive hashing

LSH [12] is a method to put similar elements into the same bucket with highprobability. This is achieved by using hash functions that, instead of tryingto uniformly distributed the elements among all buckets, are specifically de-signed to hash similar document to the same value with high probability. Alot of similarity measures (or distances) have an associated locality-sensitvehash functions. For cosine similarity, we use the following hash functionbased on random projections:

hash(~v) = sgn(~v ·~r),where ~v is the input vectorand~r is a random projection.

6

Chapter 4

Submodularity, coverage,summarisation

In this chapter we only provide an overview of what submodularity is, whyit is useful and how it can be applied to summarisation. If you are interestedin a deeper understanding of submodular functions and their many otheruse-cases, you can read survey [5] on submodular function maximisation.

4.1 Submodular functions

In this section we introduce what are submodular functions and why theyare important. We also offer a couple of examples to shed some light onhow these functions behave. Note that we will refer to these examples fromsome of the other sections.

4.1.1 Definitions

Definition 4.1 (Submodularity) A set function f : 2D → R is submodular iff∀S, T such that S ⊆ T ⊆ D and ∀d ∈ D \ T we have

f (S ∪ d)− f (S) ≥ f (T ∪ d)− f (T).

Intuitively this means that a new element’s impact can never be higher inthe future than it currently is, an effect also knows as diminishing returns.

Definition 4.2 (Monotonicity) A set function f : 2D → R is monotone iff∀S, T such that S ⊆ T ⊆ D we have

f (S) ≤ f (T) and f (∅) = 0.

From Definition 4.1 and Definition 4.2 we derive Proposition 4.3 [9].

7

4. Submodularity, coverage, summarisation

Figure 4.1: Visual representation of submodularity. The area contribution of d is larger whenadded to set S then when added to the larger set T. The right-most image shows the equalitycase.

Proposition 4.3 (Monotone submodular) A set function f : 2D → R is mono-tone submodular iff ∀S, T such that S ⊆ T ⊆ D and ∀d ∈ D we have

f (S ∪ d)− f (S) ≥ f (T ∪ d)− f (T).

Note that in this case we also allow d ∈ T. In Figure 4.1 we offer a visualrepresentation of what submodularity means.

4.1.2 Examples and properties

In this subsection we will discuss only monotone submodular functions –used in the other sections – and some of the submodular functions’ proper-ties.

Let D be a universe, A1, A2, . . . , An ⊆ D and D = 1, 2, . . . , n. We can defineseveral functions f : 2D− > R that are monotone submodular [5].

Definition 4.4 (Set coverage function) f (S) := |⋃i∈S Di|.

More generally we can extend the above function as follows.

Definition 4.5 (Weighted set coverage function) Let w : D → R+ be a non-negative weight function.Then f (S) := w(|⋃i∈S Di|).

This differs from Definition 4.4 in that we can sum non-constant weightsthat depend on the selected elements.

A very useful property of submodular functions is that the class of submod-ular functions is closed under non-negative linear combinations [5].

Proposition 4.6 (Closedness under non-negative linear combinations) Let g1, g2, . . . , gn :2D → R be submodular functions and λ1, λ2, . . . , λn ∈ R+.Then

f (S) :=n

∑i=1

λigi(S)

is submodular.

8

4.2. Submodular function maximisation

This property is important because it allows us to easily construct new sub-modular function by combining multiple simpler submodular functions.

4.2 Submodular function maximisation

4.2.1 Problem statement

Given a submodular function f we are interested in maximising its value onset S given some constraints on S. A common constraint on S is the cardi-nality constraint which limits the size of set S. Formally, we are interested incomputing:

maxS⊆D

f (S) subject to |S| ≤ k, for some k (4.1)

Most of the time we are actually interested in computing the set S thatmaximises our function f , so Equation 4.1 becomes:

arg maxS⊆D

f (S) subject to |S| ≤ k, for some k (4.2)

4.2.2 Greedy maximisation

Optimally solving Equations 4.1, 4.2 for some function f is NP-hard [5].Fortunately, we can devise a greedy algorithm that is at most 1− 1/e worse

Algorithm 4.1 Greedy submodular function maximisationS← ∅while |S| < k do

d∗ ← arg maxd∈D [ f (S ∪ d)− f (S)]S← S ∪ d∗

end whileAnswer S

than the best solution for maximising a fixed monotone submodular func-tion f [9]. We present the required steps in Algorithm 4.1 and offer a visualrepresentation in Figure 4.2. To speed-up the selection process one can usea max-heap structure to keep track of remaining document candidates; thisversion of the greedy algorithm is known as lazy greedy [7].

4.2.3 GreeDi protocol

Given that in this thesis we are interested in applying submodular functionmaximisation to a large corpus we need to find a way to transform the se-quential Algorithm 4.1 to run distributively. Fortunately, there exists a greedydistributed submodular maximisation protocol described in Algorithm 4.2 that

9


Figure 4.2: Visual representation of greedy maximisation. At each step we select the documentthat adds the highest area contribution.

partitions the data into subsets and then runs Algorithm 4.1 on each indi-vidual partition. This approach has the benefit that it gracefully degradesthe approximation guarantees based on the number of partitions and, moreimportantly, for many types of data it offers approximation guarantees closeto the ones offered by the sequential version, also with great experimentalresults [8] – results that are almost identical or very similar to the sequentialalgorithm.

Algorithm 4.2 Greedy Distributed Submodular Maximisation (GreeDi).Adapted from [8] with l = k

D :=set of all elementsp :=number of partitionsk :=number of selected elements

1: Partition D into p sets: D1, D2, . . . , Dp2: Run Greedy Algorithm 4.1 on each set Di to select k elements in Ti3: Merge the answers: T =

⋃pi=1 Ti

4: Run Greedy Algorithm 4.1 on T to select the final k elements in S5: Answer S

4.3 Word coverage

One of the basic ways of finding significant multi-document summaries isto find a good measure for a document’s information coverage. One such pro-posed metric is word coverage, a method which argues that covering wordsis a good indication of covering information. This method has been used be-fore in document summarisation and it extends naturally to multi-documentsummarisation [16].

10

4.3. Word coverage

Figure 4.3: Visual representation of maxd∈S ϕ(d, w). The X axis represents the words and theY axis represents the value of ϕ for a given document and word. Given two documents, theresult of taking for each word the maximum value of the two is shown in the max plot.

4.3.1 Definition

Sipos et al [16] propose a way to adapt a known submodular function touse word coverage as a information measure. We define this function inDefinition 4.7.

Definition 4.7 (Word coverage) Let:

D be a set of documents,W be a set of all words from all documents in D.

Then we define:

f : 2D → R+

f (S) := ∑w∈W

θ(w)maxd∈S

ϕ(d, w) (∀S ⊆ D)

where:

θ : W → R+ represents the importance of word w,ϕ : D×W → R+ represents the coverage of word w in document d

and it is usually chosen to be tf-idf(d, w).

Remember that we described term-frequency–inverted-document-frequency(tf-idf) in Section 3.1 on page 5. In Figure 4.3 we visually explain the be-haviour of taking the maximum of ϕ(d, w) over the selected documents.

Proposition 4.8 Function f from Definition 4.7 is monotone submodular.

Proof It is easily proven because taking the maximum of a function over allselected documents is monotone submodular. Then we just do a weightedsum of submodular functions which is again submodular according to Propo-sition 4.6. �

11


4.3.2 Rationale

The word coverage function defined in Definition 4.7 on the previous pagepromotes word diversity while trying to minimize eccentricity. Intuitively,maximising ϕ(d, w, seen as tf-idf, prefers selecting documents that have alot of words rarer in the other documents. This promotes diversity, butalso increases the eccentricity of the selected documents. To dampen theeccentricity of picked documents we introduce θ(w) so that we include theimportance of the word in the selection process. This will counterbalancethe eccentricity of words by preferring more common significant words1.

4.4 Document influence

While word coverage captures information well enough, it does not explainhow documents influenced each other nor how the corpus evolved over time.By capturing document influence in the submodular function, one can usethis new measure to select the most important documents.

4.4.1 Definition

Sipos et al [16] propose combining two complementary notions in a singlesubmodular function that takes into account document influence when scor-ing an article. We define this function in Definition 4.9.

Let:

D be a set of documents,W be a set of all words from all documents in D,Y be a set of all publishing dates of the documents in D,t(d) be a function that gives the publishing date of document d ∈ D.

Definition 4.9 (Document influence) We define:

f : 2D → R+

f (S) := ∑w∈W

∑y∈Y

θ(w, y) maxd∈{d′∈S|t(d′)<y}

ν(d, w) (∀S ⊆ D)

where:

θ : W ×Y → R+ represents the importance of word w in year yand it used to measure the spread of the word,

ϕ : D×W → R+ represents the novelty of word w in document d.

1Not to be confused with stop words.

12

4.4. Document influence

The same authors [16] also define word spread and document novelty that areused above in Definition 4.9. Let:

ϕ : D×W → R+ be the coverage of word w in document d,usually chosen to be tf-idf(d, w).

Definition 4.10 (Word spread) We define:

θ : W ×Y → R+ θ(w, y) = ∑d∈{d′∈D|t(d)=y}

ϕ(d, w)

Definition 4.11 (Document novelty) We define:

ν : D×W → R+

ν(d, w) = max{

0, mind′∈N (d)

{ϕ(d, w)− ϕ(d′, w)

}}where:

N (d) is a set representing the k-nearest neighbours of document dfrom the given corpus D.

Proposition 4.12 Function f from Definition 4.9 is monotone submodular.

Proof Similar to Proposition 4.8 [16]. �

4.4.2 Rationale

In order to measure document influence we have to find suitable metrics thatachieve this exact task. Sipos et al [16] propose two metrics:

• word spread

• document novelty

that together achieve our purpose of measuring influence and as such findthe most important, valuable documents. In Figure 4.1 we offer a visualrepresentation of document influence function.

Word spread

The idea of word spread stems from viewing words not just as static concepts,but as ideas that evolve over time. To be able to do this one has to extend theconcept of word importance introduced in Definition 4.7 on page 11 to dependnot only the word itself, but also on the year that word is to be considered.This is a valuable insight because as humans when we talk about a field – forexample, neural networks – we would value that term differently dependingon the year we are in. If we were in 1980, it would be an exciting new idea

13


Figure 4.4: Visual representation of document influence submodular function. Image source:[16]

that will manage to solve all the difficult artificial intelligence problems, ifwe were in the 1990s, it would have been an idea that cannot solve any ofthe harder problems, while today we view the neural networks field verypromising again. In a nutshell word spread aims to capture what are theimportant concepts on each particular date. It is important to note that thisdate is open to different granularities: which can be anything from hourgranularity (or smaller) up to year granularity (or larger).

Document novelty

While the formal definition of document novelty might look a little more com-plicated, the intuition behind it is really simple: we want to minimize count-ing the same concepts – mentioned in multiple documents – multiple times.For example, when considering published papers, if we already selected 3very good documents, we want to lower the novelty score of a survey paperthat discusses the same 3 papers. This is true for Wikipedia as well, where onecan find (human-written) general articles that offer an overview of a broadtopic and where each section links to another page that only discusses thatparticular sub-topic.

14

Chapter 5

Massive corpus summarisation

In this chapter we present novel submodular functions that can be appliedto find the most important documents out of millions of interconnected doc-uments such as parts of the web (or even the whole web). Our approachexpands on the ideas of multi-document summarisation using submodular wordcoverage [16] presented in Chapter 4 on page 7. This chapter presents theauthors’ novel contributions to the problem of massive corpus summarisation.

5.1 Scaling from thousands to millions

Sipos et al [16] present an interesting way to summarise a document corpususing submodular functions. However, their focus is on the different ways onecan view summarisation and their methods are hard or even impossible toscale beyond tens of thousand of documents. In this thesis we concentrate onfinding the most important articles out of a massive corpus of interconnecteddocuments. This can be viewed as a multi-document summarisation task,but given the small ratio between the number of selected documents andthe total size of the corpus it also bears some differences from the classicalview on multi-document summarisation. Concretely, we aim to select tensof Wikipedia articles from about 1.3 million Wikipedia pages1.

This problem is two-folded. On one hand, we have to adapt the submodularfunctions so that they yield satisfying results for such a large ’compression’ratio of about 1/50000 (as it is in the case of Wikipedia). On other hand,we can no longer run sequentially; so, in turn, we will employ having allalgorithms written in a MapReduce [1] distributed fashion. If you are inter-ested in more details about the framework we implemented please refer toChapter 6 on page 25.

1This is the total number of human-written articles in Wikipedia before Oct 2012

15

5. Massive corpus summarisation

5.2 Scaling influential documents

In Section 4.4 on page 12 we described the document influence function pro-posed in [16]. While we can apply the function as it is to small corpora(up to a couple tens of thousands documents), in order to make it scale toWikipedia’s size – around 1.3 million human-written articles – we have tomake changes to part of the submodular function.

Most of the general issues we had to solve are conceptual or engineeringproblems, such as:

1. finding a suitable creation date / year;

2. finding the k-nearest neighbours for each document;

3. computing the two component functions:

• θ(w, y) – word spread.

• ν(d, w) – document novelty;

Creation date To find a suitable creation date in the case of web pages (morespecifically, Wikipedia pages) is a much harder task than to finding the pub-lishing date of a paper. In the case of Wikipedia we settled on using the lastmodified date as an imperfect measure of creation date. A better approach isto find the most important revision and use that as the page’s creation date2.

K-nearest-neighbours Given the massive size of the corpora, we cannotuse k-nearest neighbours (kNN) to compute the nearest neighbours. Asa result, we employed using LSH to approximately compute the 1-nearestneighbour of each page. The authors in [16] observe – in their experimentalresults – that selecting only one neighbour is sufficient for the purposes ofthis submodular function. Furthermore, using LSH instead of an exact kNNapproach does not influence our results either.

Component functions As a result of the tf-idf matrix size, computing thetwo functions – ν(d, w) and θ(w, y) – has to be done in separate MapRe-duces as part of the preprocessing steps The same holds true for retrievingthe creation date of each article.

5.3 Graph coverage

In this section we present a submodular function defined on the inlinksgraph structure of the considered set of documents. Instead of only eval-

2For more details about future work, see Section 8 on page 53

16

5.3. Graph coverage

Figure 5.1: Visual representation of graph coverage submodular function.

uating the selected document we argue that each node expands its influenceamong its neighbouring articles.

5.3.1 Definition

Definition 5.1 (Graph coverage) Let G = (D, E) be a graph where:

D := set of document vertices,E := set of edges.

Then we define:

f : 2D → R+

f (S) := |⋃d∈S

Vd|

where:Vd := {d} ∪ {v ∈ D|(v, d) ∈ E}.

Proposition 5.2 Graph coverage from Definition 5.1 is monotone submodular.

Proof Easily proven as it can be reduced to the coverage function presentedin Definition 4.4 on page 8. �

In Figure 5.1 we offer a visual representation of this function.

5.3.2 Rationale

Graph coverage is a simple function that aims to measure a document’s cov-erage in terms of vertices it covers, while keeping the selected documentsdiversified. We argue that each document manifest an aura of influenceamong the articles that link to it. Consequently, we conjecture that adding

17


the source document – the page that adds the highest number of new inlinks– captures a topic that is both important and sufficiently different from thealready selected articles. As a simple extension, Definition 5.1 on the preced-ing page can be modified to consider neighbours at a radius larger than one(for example, to take into account all nodes that are at most two edges awayfrom the selected document).

5.4 LSH buckets

In this section we present a submodular function defined on the bucketsthat result from applying Locality Sensitive Hashing to the documents’ tf-idf vectors. This function aims to increase the topic diversity of selecteddocuments by selecting articles that cover multiple different buckets.

5.4.1 Definition

Definition 5.3 (LSH buckets) Let:

D := set of documents,B : = {Bi ⊆ D|Bi := LSH bucket i}. = {B1, B2, . . . , Bb}

Then we define:

f : 2D → R+

f (S) :=b

∑i=1|Bi| · g(|S ∩ Bi|),

where:g : N→ R+ is concave.

Proposition 5.4 LSH buckets from Definition 5.3 is monotone submodular.

Proof Easily provable using the properties of submodular functions [5]. �

In Figure 5.2 we offer a visual representation of this function.

5.4.2 Rationale

LSH buckets is a simple function designed to be combined with other sub-modular functions. The idea behind this function is that buckets groupsimilar documents together (using a measure such as cosine similarity, forexample).

In this regard, buckets that have more articles represent denser clusters ofpages, which we argue they define an important area or field. As a con-sequence we give higher importance – a higher weight – to dense buckets,

18

5.5. Beyond word coverage

Figure 5.2: Visual representation of LSH buckets submodular function.

multiplying by the size of the bucket: term |Bi|. On the other hand, we alsovalue documents that are in multiple buckets: term |S ∩ Bi|. Ideally, thesepages cover more than one topic and, as such, are an important source ofinformation if they are part of the summary.

Describing the function as a whole, our objective is to have diversity bypicking pages from different buckets while gauging both the importance ofa bucket and the topic coverage of the articles themselves.

5.5 Beyond word coverage

In Section 4.3 on page 10 we introduced word coverage as a measure of in-formation coverage based on words that tries to avoid eccentric documents.However, as you can see from our results in Chapter 7 on page 35, word cover-age performs poorer than expected when scaled to Wikipedia’s size (millionsof documents). In this, chapter we modify the function such that it takesinto account more information about documents when comparing them.

5.5.1 Definitions

We remind the reader that in Definition 4.7 on page 11 we described a sub-modular function that uses word coverage as follows:

f : 2D → R+ f (S) := ∑w∈W

θ(w)maxd∈S

ϕ(d, w) (∀S ⊆ D)

For details about the notations used we advise the reader to return to Defi-nition 4.7 on page 11.

In this section we propose explicit definitions for functions θ(w) and ϕ(d, w

19


such that they yield desirable results even for large-scale corpora. Let:

D := set of documents,W := set of all words from all documents in D,

(D, E) = G := documents’ graph.

Word importance

Word importance is the weight we give to a word in terms of its meaning.As we mentioned before, there has been significant work done on how toproperly define it [16] [6] [15]. Here we use some of the basic metrics suchthat it easily scales to a massive corpus.

Definition 5.5 (Word importance) We define word importance

θ : W → R+

in one of the following four (non-equivalent) ways:

1. θ(w) :=√

wc(w)

2. θ(w) := ln(wc(w))

3. θ(w) :=√

dc(w)

4. θ(w) := ln(dc(w))

where:wc, dc : W → R+

wc(w) := ∑d∈D

count(w, d),

count(w, d) := number of times word w appears in document d,dc(w) := |{w ∈ d|d ∈ D}|.

Functions wc(w) and dc(w) are functions needed to define tf-idf [13]:

• wc(w) stands for word count and represents the number of times wordw appears in the whole corpus;

• dc(w) stands for document count and represents the number of docu-ments (from the corpus) in which word w appears.

Coverage of a word

The coverage of a word in a document is a measure of how much a wordcovers a specific concept in that document. In our case we view the coverageof a word as the meaningfulness of that word in a document in the context

20


of all other words and documents from the corpus. Here we define novelfunctions that consider some of the different metrics one can use for thistask.

Definition 5.6 (Coverage of a word) We define coverage of a word

ϕ : D×W → R+

in one of the following (non-equivalent) ways:

ϕ(d, w) = tf-idf(d, w) [· #inlinks(d)][· #revisions(d)][· revisions-volume(d)],

where:#inlinks, revisions-volume, #revisions : D → R+

#inlinks(d) := |{v ∈ D|(v, d) ∈ E}|,#revisions(d) := number of edits of document d,

revisions-volume(d) := total size of edits of document d.

Note that [·] marks an optional term. As such we get twelve different possiblefunctions in total.

Also note that using any explicit forms for word importance from Defini-tion 5.5 and for coverage of a word from Definition 5.6 maintains the wordcoverage function from Definition 4.7 on page 11 monotone submodular.

5.5.2 Rationale

We were unsatisfied with the results of using only tf-idf for the coverage of aword in a document, so we thought about natural ways to extend ϕ(d, w) tocapture more of the available information. One of the main problems wasthat although we take into account the word importance in Definition 4.7 thisis not enough to dampen the eccentricity of the articles. In the followingparagraphs we explain how each added term from Definition 5.6 fits in andwhat is the reasoning behind it. We warn the reader that we offer only intu-itive explanations that were empirically confirmed, but no strict formalismsabout their validity.

Inlinks

The idea for inlinks has two justifications behind it that are similar, butdistinct concepts.

One of them is merging graph coverage presented in Section 5.3 with wordcoverage. This can be achieved, and we present the way it can be done

21


it Section 5.6, but we could not find appropriate parameters to use thatapproach.

The other is seeing inlinks as a citations measure. This has been used some-what successfully in web ranking and it has since evolved into PageRank andother more sophisticated, hybrid approaches, but here we are using it in itsplain form viewing Wikipedia different from a (competitive) game. How-ever, using PageRank or more advanced measure might prove to be a usefultweak to our current approach.

Revisions

While our goal is to find time-agnostic important articles, and as such wewould like to stay away from transitional popular articles, we consider thatadding a measure of popularity and/or controversy into our functions im-proves our results considerably.

While combining number of revisions and revision volume - total size of allrevisions of an articles - together is validated entirely in an empiric fashion,through experiments, each individual term has a interpretation behind it.

Number of revisions We can view number of revisions as a foremost contro-versy measure as articles with high edit count were changed a lot of times(by different persons); this means there existed different users with diverg-ing opinions on the given topic. For example, this can happen in case of acelebrities death.

Revision volume On the other side, we can interpret revision volume as ameasure of popular interest. Articles with high revision volumes have botha sizeable length and a considerable percentage of big changes made byusers passionate about the given topic.

5.6 Combining multiple submodular functions

In Section 4.1 on page 7 we mentioned Proposition 4.6 that allows us tocombine multiple submodular functions together. However, there are twomain problems that arise:

1. choosing the λi parameters.

2. the functions have to be of the same magnitude – ideally even growsimilarly;

For the first problem we empirically try different hand-picked combinations.While there is an entire field of machine learning (ML) that deals with pa-rameter selection we considered this to be outside the scope of this thesis. Forthe second problem, we present our solution in the next subsection.

22

5.6. Combining multiple submodular functions

5.6.1 Normalisation

Here we present a way to normalize the functions we defined in the sectionsabove, so that one can (more) easily combine them using linear combina-tions. It is important to note that we can only divide the functions’ expres-sions by constants with respect to S so that the functions remain monotonesubmodular.

Word coverage and document influence are already in a good spot because thetf-idf vectors are both normalized and very sparse and, as a consequence,the values remain small (< 1) even when summing then together.

LSH buckets In Definition 5.1 on page 17, we defined LSH buckets as:f (S) := ∑b

i=1 |Bi| · g(|S ∩ Bi|). g(S ∩ Bi) is bounded by g(Bi) and we canthen take the weighted average with respect to |Bi|. This results in:

f (S) :=1

∑bi=1 |Bi|

·b

∑i=1

[|Bi| ·

g(|S ∩ Bi|)g(|Bi|)

]that is always ≤ 1.

Graph coverage In Definition 5.3 on page 18, we defined graph coverageas: f (S) := |⋃d∈S Vd|. This function is bounded by the number of vertices(documents): |D|. This results in:

f (S) :=|⋃d∈S Vd||D|

that is always ≤ 1.

23

Chapter 6

WikiMining framework design

The framework is in an alpha development state. While all the describedfunctionality is implemented and works accordingly, you might sometimesfind yourself needing to change the code in order to tweak some of thebehaviour or that it might not be straight-forward to extend the frameworkwith some specific complex functionality. However, I intend to bring it to aquality standard that will allow you to perform both of the above tasks withease.

6.1 External libraries

As part of the WikiMining we use multiple external libraries that we quicklydescribe in the following subsections.

6.1.1 Java Wikipedia Library

Java Wikipedia Library (JWPL) is an open-source library that facilitates ac-cess to a lot of information concerning Wikipedia including, but not limitedto:

• page link graph

• category link graph

• page↔ category link graph

• page content (MediaWiki and plain-text formats)

How it works? Given a Wikipedia dump - that can be downloaded fromhttp://dumps.wikimedia.org/ (accessed on 17.04.2014)i - it generates 11text files that can then be imported into a MySQL databased and accessed

25

http://dumps.wikimedia.org/

6. WikiMining framework design

through the JWPL Application Programming Interface (API). You can findmore details about using JWPL on their website 1.

The library is publicly available 2 and its authors wrote a paper [18] provid-ing further insights into how it works. For the purposes of this thesis weuse JWPL 0.9.2.

6.1.2 Hadoop

Hadoop is an open-source project that aims to offer solutions for ’reliable,scalable, distributed computing’ 3. Although the whole system is more com-plex we mainly use two of its components:

• MapReduce

• Hadoop Distributed File System

MapReduce

MapReduce (MR) is a two-phase process that borrows two concepts fromfunctional programming:

• map;

• reduce;

and applies them to distributed computing. The main steps are as follows:

1. In the first stage - the map stage - the data is partitioned and spreadacross multiple processes that perform the same task and output mul-tiple (key, value) pairs.

2. In-between the two phases, the results from the first stage are collectedand sorted by key in a distributed fashion.

3. In the second stage - the reduce stage - all pairs that share the samekey are sent to the same process and this task outputs a results basedon the received pairs (that share the same key).

MapReduce (MR) is a very powerful technique for doing distributed compu-tations because it provides an easy way to write distributed code that is bothfault-tolerant and free from other distributed computing problems (such assynchronisation concerns, deadlocks etc.). If you are interested in learningmore about MapReduce (MR) we recommend reading the paper [1] writtenby its proponents.

1https://code.google.com/p/jwpl/wiki/DataMachine (accessed on 17.04.2014)2https://code.google.com/p/jwpl/ (accessed on 17.04.2014)3http://hadoop.apache.org/ - accessed on 17.04.2014)

26

https://code.google.com/p/jwpl/wiki/DataMachine

https://code.google.com/p/jwpl/

http://hadoop.apache.org/

6.1. External libraries

Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is, as it name says, a distributedfile-system that offers the file-system back-end for writing MapReduce jobsand using other Hadoop-based frameworks. Its purpose is to store large-scale data reliably and distributively such that it can be easy to stream thisdata at a high bandwidth to applications running on multiple machines [14].

For the purposes of this thesis we use Hadoop 1.0.4 and Hadoop 2.2.0 forCloud9.

6.1.3 Cloud9

Cloud9 is an open source library for processing big data that runs over Hadoop2. Cloud9provides the following features:

• frequently used Hadoop data types;

• basic graph algorithms: bread-first search, PageRank;

• a Wikipedia API.

In this thesis we use the Wikipedia API to convert the Extensible MarkupLanguage (XML) dumps to Hadoop sequence files indexed by document id.This allows us to easily find specific Wikipedia pages and process their con-tent.

If you want to find out more about Cloud9, visit their website 4 or checktheir Wikipedia API 5. In this thesis we use Cloud9version 1.5.0.

6.1.4 Mahout

Mahout is a scalable machine learning library designed to run over very largedata sets using Hadoop MapReduce. While it offers distributed algorithmsfor a lot of machine learning problems such as:

• classification

• clustering

• recommendations

we are only using Mahout for indexing Wikipedia and creating the corre-sponding tf-idf vectors.

4http://lintool.github.io/cloud (accessed on 17.04.2014)5http://lintool.github.io/cloud/docs/content/wikipedia.html (accessed on

17.04.2014)

27

http://lintool.github.io/cloud

http://lintool.github.io/cloud/docs/content/wikipedia.html


Figure 6.1: WikiMining architecture overview.

Generating tf-idf vectors To create tf-idf vectors from a set of plain-textdocuments 6 is being done using mahout seq2sparse. This tool has param-eters that allow us to filter out rare words, stop-words and even performstemming if needed.

If you are interested in finding out more about Mahout you can read thebook Mahout in Action [10] or visit their website 7. For the purposes of thisthesis we use Mahout version 0.9.

6.2 System architecture

In this section, we describe how we designed and implemented the Wiki-Mining library for summarising Wikipedia using submodular function max-imisation. In Figure 6.1 we present the general architecture of WikiMining.

6.2.1 Base data types

We use the following data type classes, defined in package ch.ethz.las.wikimining.base:

ArrayWritable Different subclasses for serialiasing an array of different types:

• IntArrayWritable - for integers: IntWritable;

• TextArrayWritable - for strings: Text;

or offer more functionality such as print its content in plain text usinga toString() method.

DocumentWithVector An (Integer document id, Vector) pair used in a lotof MapReduces to partition the data as part of:

• the GreeDi protocol (see Section 4.2 on page 9);

6Stored in a HDFS-specific format called Sequence Files.7https://mahout.apache.org/ (accessed on 17.04.2014)

28

https://mahout.apache.org/

6.2. System architecture

• Locality Sensitive Hashing (see Section 3.3 on page 6).

It also comes in a Writable form so that it can be serialisable by HadoopMapReduce.

HashBandWritable A (hash, band) pair used for LSH.

SquaredNearestNeighbour A collection that gets the nearest neighbour ofan article (based on cosine similarity) from among the documents thatwere written before the input article. This is used to compute documentinfluence.

Vector Class from Mahout that represents a double-valued mathematical vec-tor. It comes in two flavours:

• DenseVector - stored as a normal array

• RandomAccessSparseVector - stored as a hash-map of indexes tovalues.

In addition we have two classes for:

• storing the default parameters - Defaults

• defining the program arguments - Fields

6.2.2 Submodular functions

Given that the GreeDi protocol has only general, high-level restrictions (suchas being able to partition the data for example), we can easily implement sub-modular functions similarly to implementing them in other programminglanguage. Also the submodular function maximisation algorithms are imple-mented such that they can correctly work with any submodular functionthat extends the ObjectiveFunction interface - which is very non-restrictive.

WikiMining provides the following submodular functions package ch.ethz.las.wikimining.functions:

CombinerWordCoverage Implements the function described in Section 5.5on page 19, whose coverage of a word function uses tf-idf, number ofinlinks, number of revisions and revision volume.

CutFunction Given a graph split, it compute the maximum cut value. It isnot used with Wikipedia, but instead we used it to test our implemen-tation correctness against the SFO Matlab toolbox [4].

DocumentInfluence Computes the influence of a document as described inSection 4.4 on page 12 and Section 5.2 on page 16, given the documentcreation dates, a document novelty index and a yearly word spread index.

GraphCoverage Computes the graph coverage described in Section 5.3 onpage 16, given the graph as an adjacency list.

29


LSHBuckets Computes the LSH-buckets function described in Section 5.4 onpage 18, given the LSH buckets.

ObjectiveCombiner Linearly combines multiple objective functions as de-scribed in Section 5.6 on page 22, given a list of functions and weights.

RevisionsSummer A modular function that just sums the number of revi-sions of each selected document.

WeightedWordCoverage Implements the word coverage function where:

• word importance is defined as in Section 5.5 on page 19;

• coverage of a word uses just the tf-idf as described in Section 4.3 onpage 10;

WordCoverageFromMahout Implements the same word coverage function asabove (WeightedWordCoverage), except that it ignores the word impor-tance, considered to be a constant 1.

All functions that need the tf-idf vectors expect them as a HashMap fromdocument ids to Mahout Vectors.

If you are interested in implementing your own submodular functions, youjust need to implement the ObjectiveFunction interface such that your classprovides a compute method which returns the double score of a set of docu-ment ids.

6.2.3 Submodular function maximisation

WikiMining provides three types of greedy submodular function maximisa-tion (SFO) algorithms in package ch.ethz.las.wikimining.sfo:

SfoGreedyLazy Implements a lazy greedy SFO algorithm that uses a non-stable sorting mechanism, that is a heap, to speed-up the computationsas described in Section 4.1 on page 7. This is the only maximisationalgorithm that we use in all our non-testing code. It has a complexityof O(k · avgN, where k is the number of selected documents and avgNis the average number times the current top value of the heap is notthe real maximum score.

SfoGreedyNonLazy Implements a brute-force (non-lazy) - it tries all possi-ble documents at each iteration - greedy SFO algorithm as describedin Section 4.1 on page 7.

SfoGreedyStableLazy Implements a stable-sort lazy greedy SFO algorithm.The complexity of this algorithm is the same as of the non-lazy version:O(k · N), where k is the desired number of selected documents and Nis the total number of documents; as a results, it isn’t really useful, butwe used it for testing and debugging.

30


If you need to implement your own submodular function maximisation algo-rithm you would need to implement the SfoGreedyAlgorithm interface suchthat your class provides the two synonymous run methods which take alist of ids and the number of elements to select and return the selected documentids. Instead, if you are interested in using more of our already implementedcode and structure you can consider extending the AbstractSfoGreedy abstractclass that provides you with score, document id pair.

6.2.4 Coverage MapReduces

This part refers to the three categories of classes that deal with coverage,found in package ch.ethz.las.wikimining.mr.coverage[.h104]:

1. preprocessing classes;

2. GreeDi protocol classes;

3. reducer classes.

The preprocessing classes are as follows:

GraphEdgesConvertor Converts a plain-text list of edges to an adjacency listsaved in a sequence file.

GraphInlinksCounter Given the inverted adjacency lists it computes thenumber of inlinks of each node.

RevisionsConvertor Converts a list of revision statistics (number of revisions,revision volume from plain-text to sequence files;

WikiToPlainText Converts an XML Wikipedia dump to plain-text sequencefiles, using Cloud9.

The two stages of the GreeDi protocol are implemented as classes GreeDiFirstand GreeDiSecond. These two classes are very similar in implementation andthey deal with:

• parsing the program arguments;

• reading the main input files - the tf-idf vectors;

• setting up the desired reducer.

They also include the mapper implementations which are simple as their onlyjob is to partition the data.

While the mappers for the GreeDi protocol are simple and are almost alwaysthe same, independent of the used submodular functions being used, thereducers have to deal with reading all the additional data for the more com-plex submodular functions (for example, the graph, the revision statistics).We implement the following reducers:

31


CombinerGreeDiReducer Read multiple statistics - such as word count,LSH-buckets, graph, revisions - and uses submodular functions thatcombines them together 8;

CoverageGreeDiReducer Reads as less as possible to apply either the sim-ple word coverage with a word importance of 1 or with word importanceas word count or document count 9;

GraphGreeDiReducer Reads the graph and applies graph coverage by itself10;

LshBucketsGreeDiReducer Reads the LSH buckets and applies the LSHbuckets function. 11.

6.2.5 Influence MapReduces

The influence classes, from package ch.ethz.las.wikimining.mr.influence[.h104],can also be split in the same three categories as the coverage classes:

1. preprocessing classes;

2. GreeDi protocol classes;

3. reducer classes.

The preprocessing classes are as follows:

DocumentDate Given Wikipedia’s XML dump it retrieves the last modifieddate of each article;

GenerateBasisMatrix Generates the (bands, rows) random projections ma-trix needed for the cosine-similarity LSH;

TfidfNovelty Computes the document novelty score 12;

TfIdfRemoveDuplicates Removes duplicates (possibly) generated by TfIdfNov-elty;

TfIdfWordSpread Computes the yearly word spread 13;

The two stages of the GreeDi protocol are implemented as classes GreeDi-First and GreeDiSecond and are very similar to their coverage counterpartsdiscussed above.

We implement the following reducers:

8See Section 5.5 on page 199See Section 5.5 on page 19

10See Section 5.3 on page 1611See Section 5.4 on page 1812See Section 4.4 on page 12 and Section 5.2 on page 1613See Section 4.4 on page 12

32


GreeDiReducer Applies the SFO greedy algorithm for influential documentsas a Reduce stage, part of the GreeDi protocol;

TfIdfNoveltyReducer Used by TfIdfNovelty to compute the documents kNNand then use them to compute the document novelty score;

TfIdfNoveltyIdentityReducer Outputs the buckets themselves so that wecan use them as part of other objective functions.

6.2.6 Selection and evaluation

Classes from package ch.ethz.las.wikimining.evaluate deal with selecting singleor multiple Wikipedia categories and offer basic capabilities to analyze theresults.

The most important classes of this package are:

WikiRandomSelect Given the tf-idf vectors file – needed to get the docu-ment ids – randomly samples (with replacement) a given number ofWikipedia articles;

WikiExtractor Given the categories names as arguments it extracts the re-quested categories and saves them in sequence file format;

WikiCountInlinks Converts a list of articles ids into a list of (article name,number of inlinks) pairs, using JWPL.

6.2.7 Input, output

Classes from package ch.ethz.las.wikimining.mr.io.h104 simplify the way onecan deal with Hadoop input/output (IO). It offers SetupHelper class toeasily set the desired input and output format for Hadoop and a lot ofSequenceFileProcessor classes to read (and write) different sequence file. Someexamples include:

BucketsSequenceFileReader Reads (HashBandWritable, IntArrayWritable)pairs, used for the LSH buckets function 14;

IntLongSequenceFileReader Reads (IntWritable, LongWritable) pairs, usedin multiple classes;

TextVectorSequenceFileReader Reads (Text, VectorWritable) pairs, used inmultiple classes to read the tf-idf vectors;

TextVectorSequenceFileWriter Writes (Text, VectorWritable) pairs, used tooutput the novelty vectors 15.

14See Section 5.4 on page 1815See Section 4.4 on page 12 and Section 5.2 on page 16

33

Chapter 7

Experiments

In this chapter we present our results and compare them to existing ap-proaches.

7.1 Datasets and metrics

7.1.1 Datasets

To get the Wikipedia pages, we use the dump from 01 October 2012 that hasonly the latest revision for each article. The dump version we use shouldnot impact the results considerably. We happen to use the 2012 versionjust because it was easily available at the time. For this Wikipedia versionwe extract articles corresponding to a category (or multiple categories recur-sively) recursively – we parse the category graph downwards starting fromthe given root category. After extracting the articles we keep only non-stub,human-written articles and we exclude all meta pages. For more informa-tion about the extraction process see Section 6.2 on page 28.

We tested our approach on the following recursively extracted categories:

Classical composers Contains 7246 articles representing all classical musiccomposers – this is our biggest category apart from using all Wikipedia;

Game artificial intelligence Contains 471 articles about the use of artificialintelligence (AI) in games (for example, chess, go, video games);

Machine learning Contains 735 articles about the machine learning field;

Vectors Contains 341 articles about mathematical vectors – this our mostabstract category;

and merges of them:

Last 3 Contains the 1547 articles of the last three categories described above(Game AI, Machine learning, Vectors);

35

7. Experiments

All 4 Contains the 8793 articles of all above categories (Composers, GameAI, Machine learning, Vectors).

For the final results we use the 2012 Wikipedia which has slightly over1.3 million non-stub, human-written articles (excluding meta pages). Outof space considerations we only present the results we get on the wholeWikipedia.

In addition to the Wikipedia subset, we use the Neural Information Process-ing Systems (NIPS) dataset – contains 1955 papers – and the Association forComputational Linguistics (ACL) dataset – contains 18041 papers – to test ourbaseline implementations against their authors [16].

7.1.2 Metrics

We have looked to find metrics suitable to our task of finding the most im-portant, valuable Wikipedia articles, but we could not find a reliable methodthat also fits well to our use case. Some of the possible metrics are:

Number of inlinks This metric is similar to the use of number of citationspresented in [16], but extended to web pages. While it does not havethe same limitations as citations, it is hard to guarantee its validity;

PageRank A common way to extend the use of just inlinks which solves alot of the problems with inlinks [11];

Condensed Wikipedia Count the number of selected documents that ap-pear in a condensed Wikipedia version used in education.

For the purposes of this thesis we use number of inlinks as we view Wikipediaas a non-competitive environment (different from the game between searchengines and malevolent web masters).

7.1.3 Interpreting the results

For the experimental part, we use each submodular function to select the top40 articles out of all Wikipedia.

Graphs

To compare the different functions we score the 40 selected articles by agiven function with all the other functions and plot the scores as bar graph.A function value of 1 means that the given set performs just as good as theset selected by that function. In other words, if the set selected by graphcoverage has a score of 0.77 with word coverage it signifies that it performs77% as good as the set selected by maximising word coverage. The X axisrepresents the name of the functions used to compute that score and thenames refer to the following submodular functions:

36

7.2. Baselines

base word coverage;

influence document influence;

graph graph coverage;

base+lsh-buckets word coverage linearly combined with LSH buckets;

and a series of functions from Definition 5.6 from Section 5.5 on page 19,where we vary the definition of ϕ(d, w):

revcount tf-idf · number-of-revisions;

revcount-revvolume tf-idf · number-of-revisions · revisions-volume;

inlinks-revvolume tf-idf · number-of-inlinks · revisions-volume;

inlinks-revcount tf-idf · number-of-inlinks · number-of-revisions;

inlinks tf-idf · number-of-inlinks;

revvolume tf-idf · revisions-volume;

inlinks-revcount-revvolume tf-idf · number-of-inlinks · number-of-revisions·revisions-volume.

Tables

In the tables we show the top 20 articles among the selected 40. The columnshave the following meaning, in order from left to right:

1. article rank;

2. article name;

3. number of inlinks.

7.2 Baselines

We consider three baselines:

1. random;

2. word coverage;

3. document influence.

7.2.1 Random

To generate the random sets we sample without replacement 40 articles fromamong the 1.3 million human-written Wikipedia articles. We repeat the pro-cess 10 times and report the mean along with one standard-deviation bars.

37

7. Experiments

Figure 7.1: The score of the random selected sets as given by the other submodular functions.

Looking at Figure 7.1 we see that methods that rely solely on tf-idf – evenif in more processed forms, such in the case of document influence give highscores to the randomly selected sets:

• document influence ≈ 0.44(±0.037);

• word coverage ≈ 0.58(±0.025);

• word coverage + LSH buckets ≈ 0.64(±0.021);

While all the sets have scores higher than random for all of these mentionedfunctions, the high scores that random receives from these functions hintsthat there do not represent a great measure for Wikipedia – in other words,their informativeness does not scale up to massive corpora. Apart from thesefunctions, also revisions volume seems to be a less informative measure as alot of articles have a high volume of revisions. The random sets receive thefollowing score from the inlinks – revision volume function: ≈ 0.24(±0.025).

7.2.2 Word coverage

We present the functions scores in Figure 7.2 and the top 20 articles in Ta-ble 7.1.

38

7.2. Baselines

Figure 7.2: The score of the word coverage set as given by the other submodular functions.

Table 7.1: Top 20 articles selected by the word coverage function.

1 History of Western civilization 1152 Stephen Tompkinson 783 Timeline of United States inventions (1890–1945) 3814 November 2013 1025 Education in the United States 10986 American Civil War bibliography 3207 Races of the Mass Effect universe 428 October 2011 in sports 989 Prince (musician) 245810 Stowe House 10911 Outline of science 1212 2000 New Year Honours 7713 Seasons in Scottish football 85614 Lists of state leaders by year 256315 The Idler (1758–1760) 3216 National Football League 2293917 Disasters of the Century 818 Sustainable Services 119 Australian of the Year 23220 Largest organisms 41

39

7. Experiments

Figure 7.3: The scores of the document influence set as given by the other submodular functions.

7.2.3 Document influence

We present the functions scores in Figure 7.3 and the top 20 articles in Ta-ble 7.2.

As you can see from the graphs and tables for word coverage and documentinfluence the selected documents are not very informative of any significanttopic and scores of the functions are high only for the functions we arguedthat are not very informative: tf-idf -based and revisions volume-based func-tions. In the following section we will show how the results can be improvedsignificantly (for Wikipedia).

7.3 Graph coverage

We present the functions scores in Figure 7.4 and the top 20 articles in Table 7.3.Graph coverage is a great improvement over previous methods, but has asimilar risk as the one associated with inlinks measures. For example, itpicks articles like Citation needed that is clearly non-important, but it is linked-to a lot.

40

7.3. Graph coverage

Table 7.2: Top 20 articles selected by the document influence function.

1 History of virtual learning environments in the 1990s 12 The Idler (1758–1760) 323 Economic history of China (1949–present) 04 History of the Royal Australian Navy 585 Timeline of events in Hamilton, Ontario 2576 Technical features new to Windows Vista 237 Outline of culture 348 Seasons in Scottish football 8569 Principles and Standards for School Mathematics 6510 Łeka 011 December 2003 8812 Coleman Station Historic District 1013 Constitution of Fiji: Chapter 4 19514 Jandek 12815 Partners in Flight 416 Outline of software 317 Dunorlan Park 10218 Bob and Carol Look for Treasure 919 Facility information model 220 Nationality Rooms 833

Figure 7.4: The scores of the graph coverage set as given by the other submodular functions.

41

7. Experiments

Table 7.3: Top 20 articles selected by the graph coverage function.

1 Geographic coordinate system 7594412 Citation needed 6374333 United States 5082274 International Standard Book Number 3433615 Biological classification 2175336 Music genre 2093337 Internet Movie Database 1837398 Orphan 1718839 Association football 9644710 United Kingdom 18152511 Germany 15776812 Canada 14911913 Digital object identifier 10962214 England 15982215 France 17948316 India 11406917 Public domain 5371718 Japan 11374119 Australia 12567520 Given name 27087

7.4 LSH buckets and word coverage

We present the functions scores in Figure 7.5 and the top 20 articles in Ta-ble 7.4. As we mentioned in Section 5.6 on page 22, finding proper weightsto linearly combined multiple submodular functions is hard and as suchword coverage + LSH buckets performs similarly to using just word coverage.

7.5 Beyond word coverage

We present the tf-idf · number-of-revisions in Figure 7.6 and the top 20 articlesin Table 7.5.

We present the tf-idf · number-of-revisions · revisions-volume in Figure 7.7 andthe top 20 articles in Table 7.6.

We present the tf-idf · number-of-inlinks · revisions-volume in Figure 7.8 andthe top 20 articles in Table 7.7.

We present the tf-idf · number-of-inlinks · number-of-revisions in Figure 7.9 andthe top 20 articles in Table 7.8.

We present the tf-idf · number-of-inlinks in Figure 7.10 and the top 20 articlesin Table 7.9.

42


Figure 7.5: The scores of the set selected by LSH buckets linearly combined with word coverageas given by the other submodular functions.

Table 7.4: Top 20 articles selected by the LSH buckets function.

1 History of Western civilization 1152 Stephen Tompkinson 783 Timeline of United States inventions (1890–1945) 3814 November 2003 1025 Education in the United States 10986 2000 New Year Honours 777 Races of the Mass Effect universe 428 November 2006 in sports 1019 Timeline of musical events 6910 The Idler (1758–1760) 3211 Systems engineering 87412 Long Beach, California 427313 Seasons in Scottish football 85614 Asimov’s Biographical Encyclopedia of Science and Technology 015 Tears for Fears 57716 Disasters of the Century 817 The False Prince and the True 418 Economic history of China (1949–present) 019 2012 in film 73620 Independence National Historical Park 359

43

7. Experiments

Figure 7.6: The scores of the tf-idf – revisions count set as given by the other submodularfunctions.

Table 7.5: Top 20 articles selected by the tf-idf – revisions count function.

1 George W. Bush 115252 United States 5082273 Jesus 70654 Adolf Hitler 71485 Wii 38606 Michael Jackson 66617 World War II 1040378 Hurricane Katrina 38759 RuneScape 47510 United Kingdom 18152511 2006 Lebanon War 79312 Britney Spears 358413 The Beatles 1039514 2007 72315 New York City 6926716 India 11406917 Islam 1745518 Global warming 383219 PlayStation 3 369620 2006 FIFA World Cup 3072

44


Figure 7.7: The scores of the tf-idf – revisions count – revisions volume set as given by theother submodular functions.

Table 7.6: Top 20 articles selected by the tf-idf – revisions count – revisions volume function.

1 George W. Bush 115252 United States 5082273 Jesus 70654 Adolf Hitler 71485 Michael Jackson 66616 World War II 1040377 Wii 38608 Hurricane Katrina 38759 RuneScape 47510 United Kingdom 18152511 Britney Spears 358412 The Beatles 1039513 2006 Lebanon War 79314 New York City 6926715 Islam 1745516 India 11406917 2007 72318 PlayStation 3 369619 Global warming 383220 Christianity 17145

45

7. Experiments

Figure 7.8: The scores of the tf-idf – inlinks count – revisions volume set as given by the othersubmodular functions.

Table 7.7: Top 20 articles selected by the tf-idf – inlinks count – revisions volume function.

1 United States 5082272 Geographic coordinate system 7594413 France 1794834 United Kingdom 1815255 Time zone 2250856 England 1598227 International Standard Book Number 3433618 Germany 1577689 Internet Movie Database 18373910 Music genre 20933311 Record label 19538412 Poland 10857313 Canada 14911914 India 11406915 Australia 12567516 World War II 10403717 Italy 11586818 Daylight saving time 13988819 Animal 13364420 Association football 96447

46


Figure 7.9: The scores of the tf-idf – inlinks count – revisions count set as given by the othersubmodular functions.

Table 7.8: Top 20 articles selected by the tf-idf – inlinks count – revisions count function.

1 United States 5082272 United Kingdom 1815253 World War II 1040374 France 1794835 Germany 1577686 India 1140697 England 1598228 Canada 1491199 Japan 11374110 New York City 6926711 Australia 12567512 Italy 11586813 Association football 9644714 Russia 8469215 George W. Bush 1152516 London 8515217 Poland 10857318 Spain 9048219 World War I 4738420 Brazil 73972

47

7. Experiments

Figure 7.10: The scores of the tf-idf – inlinks count set as given by the other submodularfunctions.

Table 7.9: Top 20 articles selected by the tf-idf – inlinks count function.

1 United States 5082272 Geographic coordinate system 7594413 France 1794834 Citation needed 6374335 International Standard Book Number 3433616 United Kingdom 1815257 Time zone 2250858 Biological classification 2175339 Record label 19538410 England 15982211 Internet Movie Database 18373912 Music genre 20933313 Germany 15776814 Poland 10857315 Orphan 17188316 Daylight saving time 13988817 India 11406918 Canada 14911919 Record producer 11957320 Italy 115868

48


Figure 7.11: The scores of the tf-idf – revisions volume set as given by the other submodularfunctions.

We present the tf-idf · revisions-volume in Figure 7.11 and the top 20 articlesin Table 7.10.

We present the tf-idf · number-of-inlinks · number-of-revisions· revisions-volumein Figure 7.12 and the top 20 articles in Table 7.11.

Looking at the graphs we can see that there all functions have advantagesand disadvantages and there is not a clear winner. It is important to notethat while inlinks-based metrics include most of the time significant (highlyquoted) articles, the pages selected by the revision-based functions concernvery debatable topics.

We observe that functions that use number of inlinks as part of their metricperform similarly to one another and similarly to graph coverage. Also, wecan confirm that using revisions volume is not too informative as the scoresare similar to the equivalent functions that do not use the revisions volumevalue at all. More importantly, by looking at the bar graphs and at the top20 selected documents we conclude that the other functions outperform theexisting word coverage and document influence functions.

We argued before that the importance of an article is a subjective by its verynature. However, in our opinion, some of the best set of articles is the onein Table 7.9, selected by the tf-idf – inlinks count function.

49

7. Experiments

Table 7.10: Top 20 articles selected by the tf-idf – revisions volume function.

1 George W. Bush 115252 World War I 473843 United States 5082274 Michael Jackson 66615 England 1598226 Jesus 70657 Jack Thompson (activist) 358 Internet 118789 Baseball 2261910 Gray wolf 91611 The Beatles 1039512 2007 72313 Water 839014 France 17948315 Hurricane Katrina 387516 Punk rock 1215917 Microsoft 984218 Pope John Paul II 531419 Association football 9644720 Evolution 3867

Figure 7.12: The scores of the tf-idf – inlinks count – revisions count – revisions volume set asgiven by the other submodular functions.

50

7.6. Running time

Table 7.11: Top 20 articles selected by the tf-idf – inlinks count – revisions count – revisionsvolume function.

1 United States 5082272 United Kingdom 1815253 World War II 1040374 France 1794835 Germany 1577686 England 1598227 Canada 1491198 India 1140699 Japan 11374110 New York City 6926711 Australia 12567512 Italy 11586813 George W. Bush 1152514 Association football 9644715 Russia 8469216 World War I 4738417 London 8515218 Poland 10857319 Spain 9048220 Brazil 73972

7.6 Running time

In this section, we report the running times for selecting 40 documents us-ing Hadoop using 50 machine cores. We exclude extraction of plain-text andcreation of tf-idf from the reported times.

• Word-coverage based methods ≈ 20m;

• Graph coverage ≈ 19m;

• Word coverage linearly combined with LSH buckets ≈ 22m:

– Preprocess ≈ 5m;

– GreeDi ≈ 17m;

• Document influence ≈ 60m:

– (Novelty tf-idf preprocessing ≈ 1h30m)

– Other preprocessing ≈ 7m;

– GreeDi ≈ 53m;

51

Chapter 8

Conclusion

We introduce previously suggested methods for multi-document summari-sation, such as word coverage and document influence, and showed how wecan scale them to massive corpora that have over a million articles. Wedefine novel submodular functions that capture graph coverage, LSH buck-ets and extend word coverage to take into account the number of inlinks andrevisions’ statistics. We also experiment with different non-negative linearcombinations of multiple submodular functions. We devise a framework –WikiMining – that can easily be used to test different submodular functionson massive corpora and also extended to analyse large datasets, such asWikipedia. We offer comparisons for different submodular functions (bothnovel and from other papers) that provide a baseline for further research infinding the most important documents in a massive corpora.

Future work

An important area that has to be addressed is finding better evaluation meth-ods that are easy to visualise, while they are being computationally feasible.One reliable choice is to crowd source scoring of different chosen sets ofdocuments. However, one must investigate what is the best setting in thecase of Wikipedia as evaluation is multi-faceted problem.

Another important area involves finding new submodular functions and im-proving the existing ones. For example, it is computationally expensiveto retrieve a reliable creation date for Wikipedia’s articles as one has to gothrough all the revisions. Also, there are better ways to compute the im-portance of a word and document – heuristically [6] or by learning [15] – assuggested by Sipos et al [16].

Last, but not the least, using MapReduce for machine learning tasks is apoorer choice as MapReduce was designed for passing through dath a smallnumber of times – one or two times. A much better choice is Spark that

53

8. Conclusion

builds on top of Hadoop and HDFS and provides the ability to cache thedata in random-access memory (RAM) between different MapReduce stagesamong many other tricks. Put together Spark’s optimisations significantlyspeed up the computations, especially in our case of high number of MapRe-duce stages.

54

Bibliography

[1] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data pro-cessing on large clusters. Communications of the ACM, 51(1):107–113,2008.

[2] Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic re-latedness using wikipedia-based explicit semantic analysis. In IJCAI,volume 7, pages 1606–1611, 2007.

[3] Todd Holloway, Miran Bozicevic, and Katy Borner. Analyzing and visu-alizing the semantic coverage of wikipedia and its authors. Complexity,12(3):30–40, 2007.

[4] Andreas Krause. Sfo: A toolbox for submodular function optimization.The Journal of Machine Learning Research, 11:1141–1144, 2010.

[5] Andreas Krause and Daniel Golovin. Submodular function maximiza-tion. Tractability: Practical Approaches to Hard Problems, 3, 2012.

[6] Hui Lin and Jeff Bilmes. Multi-document summarization via budgetedmaximization of submodular functions. In Human Language Technolo-gies: The 2010 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics, pages 912–920. Association forComputational Linguistics, 2010.

[7] Michel Minoux. Accelerated greedy algorithms for maximizing sub-modular set functions. In Optimization Techniques, pages 234–243.Springer, 1978.

[8] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and AndreasKrause. Distributed submodular maximization: Identifying represen-tative elements in massive data. In Advances in Neural Information Pro-cessing Systems, pages 2049–2057, 2013.

55

Bibliography

[9] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher.An analysis of approximations for maximizing submodular set func-tions—i. Mathematical Programming, 14(1):265–294, 1978.

[10] Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout inaction. Manning, 2011.

[11] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. Thepagerank citation ranking: Bringing order to the web. 1999.

[12] Anand Rajaraman and Jeffrey David Ullman. Mining of massive datasets.Cambridge University Press, 2012.

[13] Gerard Salton and Christopher Buckley. Term-weighting approaches inautomatic text retrieval. Information processing & management, 24(5):513–523, 1988.

[14] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and RobertChansler. The hadoop distributed file system. In Mass Storage Sys-tems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1–10.IEEE, 2010.

[15] Ruben Sipos, Pannaga Shivaswamy, and Thorsten Joachims. Large-margin learning of submodular summarization models. In Proceedingsof the 13th conference of the European chapter of the Association for Compu-tational Linguistics, pages 224–233. Association for Computational Lin-guistics, 2012.

[16] Ruben Sipos, Adith Swaminathan, Pannaga Shivaswamy, and ThorstenJoachims. Temporal corpus summarization using submodular wordcoverage. In Proceedings of the 21st ACM international conference on Infor-mation and knowledge management, pages 754–763. ACM, 2012.

[17] Jakob Voss. Measuring wikipedia. International Conference of the Interna-tional Society for Scientometrics and Informetrics, 10, 2005.

[18] Torsten Zesch, Christof Muller, and Iryna Gurevych. Extracting lexicalsemantic knowledge from wikipedia and wiktionary. In Proceedingsof the 6th International Conference on Language Resources and Evaluation,Marrakech, Morocco, May 2008. electronic proceedings.

56

rights / license: research collection in copyright - non …8578/eth... · the ﬁnal tf-idf value...

Documents