unsupervisedtext mining: eﬀectivesimilarity calculation ......wathsala anupama mohotti and richi...

Unsupervised Text Mining:Effective Similarity Calculation

with Ranking and MatrixFactorization

Wathsala Anupama Mohotti

B.Sc.(Hons) & M.Sc. in Information Technology

Submitted in Fulfilment

of the Requirements

for the Degree of

Doctor of Philosophy

Queensland University of Technology

School of Computer Science

Science and Engineering Faculty

2020

This thesis is dedicated to my loving parents

N. Mohotti and G. Nagahawaththa

Statement of Original Authorship

The work contained in this thesis has not been previously submitted for a degree

or diploma at any other higher educational institution. To the best of my knowl-

edge and belief, the thesis contains no material previously published or written

by another person except where due reference is made.

Name:

Signature:

Date:27/03/2020

Wathsala Anupama Mohotti

QUT Verified Signature

Acknowledgements

It is my pleasure to express my appreciation and gratitude to everyone who has

been a part of my PhD journey. First, I would like to express my heartfelt

gratitude to my principal supervisor, Associate Professor Richi Nayak, for her

continuous support and constant guidance over the past few years. Her advice,

encouragement, feedback, motivation, direction, and patience made my journey

possible. Also, I would like to thank my associate supervisor, Associate Profes-

sor Shlomo Geva, for his support throughout my PhD journey. I would like to

acknowledge the financial support provided by QUT throughout my PhD by the

QUT Postgraduate Research Award (QUTPRA) and the QUT HDR Tuition Fee

Sponsorship.

I would like to acknowledge the QUT high-performance computing (HPC) team,

Big Data Laboratory and QUT digital observatory for providing the necessary

infrastructure during the course of my PhD. Also, I would like to thank the

staff of EECS School for their administrative support during my candidature. I

acknowledge the services of professional editor, Diane Kolomeitz, who provided

copyediting and proofreading services, according to the guidelines laid out in the

university-endorsed national “Guidelines for editing research theses”. I pay my

sincere gratitude to past and present lab members of in Applied Data Mining Re-

search Group (ADMRG) and all my friends in QUT as well as my housemates for

their valuable support throughout my journey. I would particularly like to thank

iv ACKNOWLEDGEMENTS

Dr Sarasi Munasinghe, Gayani Tennakoon, Dr Noor Ifada, Dr Taufik Sutanto,

Dinusha Wijedasa for providing much-needed support during my studies.

Finally, I would like to express my heartfelt gratitude to my loving parents for

everything they have done for me. Their unconditional support, sacrifice and the

positive influence they have had throughout my life has taken me to the place

I am at today. Also, I would also like to thank my brothers for their support,

encouragement, and love. Thank you all for being there for me.

Abstract

Advancements in digital processing techniques have led to exponential growth in

the size of text data collections. Text data have been used primarily in social me-

dia platforms, document repositories, news broadcasting services, websites, and

blogs as an effective communication medium. Text mining is a popular approach

to discover meaningful information such as clusters, outliers and evolution in clus-

ters from the text collections. The unavailability of ground-truths in real-world

collections creates the demand for conducting these analyses in an unsupervised

setting.

Multiple approaches have been explored to identify text similarity for finding

clusters, outliers, and evolution in text clusters. However, the high dimensional

nature of text data and the associated sparseness in document representation

present challenges for text mining methods to identify similarity within text data.

The distance calculation, density estimation, and other approximation techniques

become ineffective in identifying accurate information. This presents a need for

developing methods that can handle high dimensionality and related problems in

text data for knowledge discovery.

The thesis proposes a set of methods to identify text similarity mainly using rank-

ing and matrix factorization. It proposes methods for finding document clusters,

outliers, and changing dynamics of the clusters based on these novel similarity

vi ABSTRACT

concepts. More specifically, the proposed methods (1) use ranking concepts to

exploit nearest neighbors in determining text similarities and dissimilarities ef-

ficiently; (2) accurately learn dense patches in naturally sparse data; (3) enrich

documents to avoid extreme sparseness in short text data; and (4) represent

high dimensional text with lower rank representation using matrix factorization

minimizing the information loss.

Firstly, this thesis presents two novel text clustering methods, RDDC (Ranking-

Centered Density-Based Document Clustering Method) and CCNMF (Consensus

and Complementary Non-negative Matrix Factorization for Document Cluster-

ing), and two specific methods to identify clusters with short text for the appli-

cation areas of community detection and concept mining.

• In RDDC, a Shared Nearest Neighbor (SNN) graph is built based on

the ranked documents using an Information Retrieval system, and clus-

ters are identified with density estimation from the SNN and the frequent

neighborhood-based hubs. Empirical analysis shows RDDC to be accurate

and efficient due to the use of document neighborhoods, generated using the

relevant documents sets from an IR system, that form relatively uniform

regions in text collection to differentiate varying densities.

• In CCNMF, the vector space model is integrated with the neighborhood

information, preserving geometric structures, to compensate for the infor-

mation loss in NMF. Empirical analysis shows that CCNMF is able to accu-

rately identify clusters as it uses complementary and consensus information

from the input data, especially with local neighborhood affinity through

pairwise calculation and global neighborhood affinity through IR ranking.

• The proposed corpus-based augmented media posts with density-based clus-

tering for community detection as well as the concept mining in online

ABSTRACT vii

forums using self-corpus-based augmented text clustering, propose to use

document expansion to handle the extreme sparseness in short text posts.

The document expansion method approximates topic vectors using NMF to

obtain virtual words for post-expansion to improve the word co-occurrence

in the sparse text aligning with the semantics of the collection. These en-

riched documents are shown to be accurate in community detection with

a density-based clustering on heterogeneous social media text while con-

cept mining on homogenous forum text has shown better performance with

distance-based clustering.

Secondly, this thesis presents four novel outlier detection algorithms based on the

novel concepts of rare frequency of terms and ranking.

• OIDF (Outlier detection based on Inverse Document Frequency) proposes

the simple concept of using inverse document frequency of terms to identify

documents that are deviated from the set of inlier groups where high di-

mensionality of text vectors impairs the concepts such as distance, density

or dimensionality reduction.

• ORFS (Outlier detection based on Ranking Function Score) proposes an

outlier score for a document based on the inverse of the ranking scores

given for response documents by an IR system that are considered as nearest

neighbors.

• ORNC (Outlier detection based on Ranked Neighborhood k-occurrences

Count) proposes to calculate the reverse neighbor count in response lists for

documents in the entire collection to define an outlier score. This defines

high outlier scores for documents with less count that are anti-hubs.

• ORDG (Outliers by Ranking based Density Graphs) proposes outlier de-

tection by identifying documents that do not exist in the mutual nearest

viii ABSTRACT

neighbor graph that is meant to include inliers. Empirical analysis shows

ORDG to be accurate and efficient through the nearest neighbors identified

with the IR system to generate the mutual neighbor graph and identified

frequent nearest neighbors (hubs) attached to the graph.

These four algorithms have been shown to be efficient due to the use of IR ranking

concepts in modeling nearest neighbors compared to pairwise calculations that are

expensive for the large document collections. The outlier candidates generated

with ORFS, ORNC, and ORDG algorithms are sequentially and independently

combined with outlier candidates of OIDF to develop ensemble methods to obtain

higher accuracy.

Lastly, this thesis proposes a novel method for identifying the changing dynamics

of text clusters, named as CaCE (Cluster Association-aware matrix factorization

for discovering Cluster Evolution). CaCE tracks major lifecycle states of birth,

death, split and merge of the clusters to discover emergence, persistence, growth

and decay patterns using both intra-cluster and inter-cluster associations with

NMF. In CaCE, the use of both these relationships has shown to be accurate in

identifying cluster groups and compensating for the information loss in dimen-

sionality reduction. CaCE proposes to use density estimation with term weights

to refine the cluster assignment to groups as a further compensating mechanism

for the information loss. In CaCE, evolution is represented by drawing edges in

a k-partite graph between consecutive time intervals if the clusters possess the

same level of density and belong to the same group. This visualization technique

aids in the interpretability of lifecycle states and patterns in clusters.

In summary, the thesis makes a substantial contribution to the fundamental task

of effective text similarity identification needed for the development of text clus-

tering, text outlier detection and tracking text cluster evolution methods. This

thesis advances the fields of data mining, machine learning and document en-

ABSTRACT ix

gineering by successfully dealing with the high dimensionality of text vectors

and associated problems that have been repetitively discussed in the academic

literature and commonly faced in real-world applications.

Keywords

Unsupervised Learning, Text Mining, Text Similarity, Clustering, Outlier Detec-

tion, Cluster Evolution, Ranking, Nearest Neighbors, Density Estimation, Docu-

ment Expansion, Non-negative Matrix Factorization, Shared Nearest Neighbors,

Mutual Neighbor graph, Hubs, Anti-hubs, Skip-Gram with Negative Sampling

Contents

Abstract v

Keywords x

List of Tables xvi

List of Figures xvii

List of Publications xix

Acronyms & Abbreviations xxi

Chapter 1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement and Motivation . . . . . . . . . . . . . . . . . 5

1.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5

xii CONTENTS

1.2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Research Aim and Objectives . . . . . . . . . . . . . . . . . . . . 10

1.5 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . 12

1.6 Publications Resulting from Research . . . . . . . . . . . . . . . . 16

1.7 Research Significance . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.8 High Level Overview of the Thesis . . . . . . . . . . . . . . . . . . 20

Chapter 2 Literature Review and Background 24

2.1 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1.1 Text Mining Process . . . . . . . . . . . . . . . . . . . . . 26

2.1.2 Text Feature Representation . . . . . . . . . . . . . . . . . 28

2.2 Text similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.1 Distinct Text Characteristics . . . . . . . . . . . . . . . . . 30

2.2.2 Text Similarity Measures . . . . . . . . . . . . . . . . . . . 33

2.3 Unsupervised Text Mining Methods . . . . . . . . . . . . . . . . . 37

2.3.1 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.2 Text Outlier Detection . . . . . . . . . . . . . . . . . . . . 49

CONTENTS xiii

2.3.3 Text Cluster Evolution . . . . . . . . . . . . . . . . . . . . 57

2.4 Research Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2.4.1 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . 62

2.4.2 Text Outlier Detection . . . . . . . . . . . . . . . . . . . . 63

2.4.3 Text Cluster Evolution . . . . . . . . . . . . . . . . . . . . 64

Chapter 3 Text Clustering 66

Paper 1: An Efficient Ranking-Centered Density-Based Document Clus-

tering Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Paper 2: Consensus and Complementary Non-negative Matrix Factor-

ization for Document Clustering . . . . . . . . . . . . . . . . . . . 92

Paper 3: Corpus-based Augmented Media Posts with Density-based

Clustering for Community Detection . . . . . . . . . . . . . . . . 123

Paper 4: Concept Mining in Online Forums using Self-corpus-based

Augmented Text Clustering . . . . . . . . . . . . . . . . . . . . . 148

Chapter 4 Text Outlier Detection 157

Paper 5: Efficient Outlier Detection in Text Corpus Using Rare Fre-

quency and Ranking . . . . . . . . . . . . . . . . . . . . . . . . . 163

Paper 6: Text Outlier Detection using a Ranking-based Mutual Graph 214

xiv CONTENTS

Chapter 5 Text Cluster Evolution 248

Paper 7: Discovering Cluster Evolution Patterns with the Cluster

Association-aware Matrix Factorization . . . . . . . . . . . . . . . 252

Chapter 6 Conclusion and Future Directions 295

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 296

6.2 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . 300

6.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

6.2.2 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . 304

6.2.3 Cluster Evolution . . . . . . . . . . . . . . . . . . . . . . . 306

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

6.3.1 Stream mining . . . . . . . . . . . . . . . . . . . . . . . . 307

6.3.2 Community discovery considering both structure and con-

tent information . . . . . . . . . . . . . . . . . . . . . . . . 308

6.3.3 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . 309

6.3.4 Short text clustering . . . . . . . . . . . . . . . . . . . . . 309

6.3.5 Soft clustering . . . . . . . . . . . . . . . . . . . . . . . . . 310

6.3.6 Complete text mining framework . . . . . . . . . . . . . . 310

6.3.7 Pre-trained models for document representation . . . . . . 310

CONTENTS xv

Appendix A: Case Studies 311

Appendix B: Matrix Factorization for Community Detection using

a Coupled Matrix 314

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

2 Problem and Motivation . . . . . . . . . . . . . . . . . . . . . . . 315

3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

5 Results and Contributions . . . . . . . . . . . . . . . . . . . . . . 319

Bibliography 321

List of Tables

2.1 Internet traffic report by Alexa on August 15th, 2019 . . . . . . . 25

2.2 Summary of the major outlier detection methods . . . . . . . . . 50

2.3 Categories in dynamic text evolution . . . . . . . . . . . . . . . . 59

4.1 Proposed outlier detection methods . . . . . . . . . . . . . . . . . 161

List of Figures

1.1 Sparseness in text with higher dimensional representation [178] and

the distance concentration problem [52] . . . . . . . . . . . . . . . 2

1.2 Examples for text types and nature of the vectors . . . . . . . . . 4

1.3 Architecture of the thesis for unsupervised text mining . . . . . . 8

1.4 Overview for unsupervised text mining methods . . . . . . . . . . 20

2.1 General text mining process . . . . . . . . . . . . . . . . . . . . . 27

2.2 Skewness of k-NN [154] . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3 Skewness of hubs [175] . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Clustering with distance to the centroid and clustering with hub-

similarity [174] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5 Mutual Neighbors that share common documents . . . . . . . . . 35

2.6 An overview of text clustering methods . . . . . . . . . . . . . . . 39

2.7 The use of ranking for clustering [174] . . . . . . . . . . . . . . . 43

xviii LIST OF FIGURES

3.1 Overview of the Chapter 3 contributions . . . . . . . . . . . . . . 67



List of Publications

Wathsala Anupama Mohotti and Richi Nayak.: An Efficient Ranking-

Centered Density-Based Document Clustering Method. Pacific-Asia Conference

on Knowledge Discovery and Data Mining, pp. 439-451. Springer (2018) (Will

form part of Chapter 3).

Wathsala Anupama Mohotti and Richi Nayak.: Consensus and Comple-

mentary Non-negative Matrix Factorization for Document Clustering. Elsevier

Knowledge-Based Systems journal (Under Review). (Will form part of Chapter

3).

Wathsala Anupama Mohotti and Richi Nayak.: Corpus-Based Augmented

Media Posts with Density-Based Clustering for Community Detection. Inter-

national Conference on Tools with Artificial Intelligence (ICTAI), pp. 379-386.

IEEE (2018) (Will form part of Chapter 3).

Wathsala Anupama Mohotti and Darren Christopher Lukas and Richi

Nayak.: Concept Mining in Online Forums using Self-corpus-based Augmented

Text Clustering. Pacific Rim International Conference on Artificial Intelligence

(PRICAI), pp. 397-402. Springer (2019) (Will form part of Chapter 3).

Wathsala Anupama Mohotti and Richi Nayak.: Efficient Outlier Detection in

Text Corpus Using Rare Frequency and Ranking. ACM Transactions on Knowl-

xx LIST OF PUBLICATIONS

edge Discovery from Data (TKDD) (Accepted with Major Revision). (Will form

part of Chapter 4).

Wathsala Anupama Mohotti and Richi Nayak.: Text Outlier Detection using

a Ranking-based Mutual Graph. Data & Knowledge Engineering Journal (Under

Review). (Will form part of Chapter 4).

Wathsala Anupama Mohotti and Richi Nayak.: Discovering Cluster Evolu-

tion Patterns with the Cluster Association-aware Matrix Factorization. Springer

Knowledge and Information Systems (KAIS) (Under Review). (Will form Chap-

ter 5).

Acronyms & Abbreviation

NN Nearest NeighborsSNN Shared Nearest NeighborsVSM Vector Space ModelIR Information RetrievalBOW Bag Of WordNMF Non-negative Matrix FactorizationSGNS Skip-Gram model with Negative SamplingRDDC Ranking-Centered Density-Based Document ClusteringCCNMF Consensus and Complementary Non-negative Matrix

FactorizationOIDF Outlier detection based on Inverse Document FrequencyORFS Outlier detection based on Ranking Function ScoreORNC Outlier detection based on Ranked Neighborhood

k-occurrences CountORDG Outliers by Ranking based Density GraphsCaCE Cluster Association-aware matrix factorization

for discovering Cluster EvolutionWMD Word Mover’s DistanceGAN Generative Adversarial Network

Chapter 1

Introduction

This chapter presents the overview of the research conducted in this thesis, in-

cluding background, problem, questions, aims, and objectives of the research.

The overall research significance and limitations are described accordingly. The

structure of the thesis is presented at the end of the chapter.

1.1 Background

Text data, widespread in social media platforms and document repositories such

as news broadcasting platforms and document indexing systems, has emerged

as a powerful means of communication among people and organizations [3, 44].

The process of discovering useful information from text document collections is

known as text mining [3]. Text mining has a significant impact on diverse applica-

tions such as social media analytics [79], opinion mining [3] and recommendation

systems [118]. The real-world scenarios, where labeled data (i.e., data with cat-

egories attached) is not available, have made unsupervised text mining popular.

2 1.1 Background

This topic has been studied for decades in many fields such as clustering, outlier

detection, sentiment analysis, topic modeling and evolution analysis [3, 8, 96].

This thesis focuses on the identification of similarity/dissimilarity among text in-

stances effectively in order to learn the clusters, outliers and changing dynamics of

clusters in text document collections. The process of finding natural groups in the

document collection based on their similarities is known as document clustering

[8]. In contrast, finding documents that show a set of different terms that deviate

from the common terms in the collection is known as text outlier detection [96].

In addition to clusters and outliers, identifying dynamic changes to clusters and

the evolutionary pattern of clusters over time (or domains) based on similarity

among clusters is an emerging area that is aided by the strength of text mining

for knowledge discovery from text data collections [63].

Figure 1.1: Sparseness in text with higher dimensional representation [178] andthe distance concentration problem [52]

A common challenge faced by all these text mining methods is how to identify

the similarity/dissimilarity between text instances. Accurately identifying the

1.1 Background 3

similarity among documents is challenged by the sparseness of text representation

[3]. A popular data model for text representation is the vector space model

(VSM) that records (weighted) presence/count of a term within the document

[53]. Different types of data sources form different sizes of text vectors [95, 138]

(Fig. 1.2). The short text, which appears in social media platforms, forms short

text vectors that are usually extremely sparse compared to other text [79]. All

the other text data face usual sparseness faced by the high dimensional data.

The text data that appear in sources such as Wikipedia forms very large text

vectors [95] that need high processing power due to the need for processing a

high number of dimensions. News data shows medium size text vectors [138]

compared to other two types.

The high dimensional nature of text data forms a sparse VSM representation due

to fewer word co-occurrences. Consequently, many existing methods become less

effective in determining similarity. Identifying similar text instances is fundamen-

tal to text mining methods. In high dimensional representation, identifying near

and far points is problematic using the distance measurements. This phenomenon

is known as the distance concentration problem [205]. As shown in Fig. 1.1, the

distance differences among instances become negligible with the sparseness. This

leads to blurring the border between nearest neighbors and farthest neighbors (as

shown in Fig. 1.1.c). The sparse data representation is also a problem for identi-

fying the similarity based on density estimation [12]. Due to the lack of density

variations (spikes), it is hard to identify the subgroups. In order to identify the

similarity among text instances, it is essential to develop effective methods that

overcome sparseness in text for large document collections.

A wide range of text mining methods have been proposed to address the sparse-

ness in high dimensional text representation [3, 8]. Distance-based methods

[30, 36] aim to identify the text similarity based on distance differences. Nearest

4 1.1 Background

Figure 1.2: Examples for text types and nature of the vectors

neighbor based similarity [68, 78, 193] is used as an extension to this in recent

research. These methods use the frequent nearest neighbors in document col-

lections and similarity of other documents to them for identifying the groups.

Though this is used in cluster identification as well as outlier detection where the

deviated points are identified from frequent neighbors, it still faces challenges.

The process of calculating frequent nearest neighbor sets, as well as calculating

similarity to them, is expensive. Density-based methods [29, 33, 55, 201] also

fail in handling sparse text without sophisticated designs. Among the matrix

factorization methods that are used to approximate higher dimensional represen-

tation using lower rank factors, Non-negative Matrix Factorization [96, 110, 181]

is shown to be effective for text as term representations in the text are always

positive. Probabilistic methods [25, 49, 180] also perform the dimensionality re-

duction using probability calculation for a document to be in lower dimensional

space. All these dimensionality reduction methods, however, face the problem of

information loss [3].

1.2 Problem Statement and Motivation 5

These issues are common in clustering and, outlier detection as well as cluster

evolution detection, which deal with sparse, high dimensional text representa-

tion. There is a need to develop effective methods considering the nature of the

associated text vectors.

1.2 Problem Statement and Motivation

1.2.1 Problem Statement

Unsupervised text mining is an important process of deriving useful informa-

tion such as groups, patterns, and trends in the digital document collection.

For instance, social media platforms generate text data that are short in length

with an extreme sparseness. News data and web pages also contain high di-

mensional text that forms sparse text vectors. Generally, all the text data show

less word-occurrence among text pairs results in a sparse term representation.

Identifying the subgroups, deviated documents from a document collection or

dynamic changes in text clusters need effective methods to compare the similar-

ity between text pairs. This leads to problems in distance-based, density-based,

probability-based and matrix factorization-based methods. This thesis focuses

on unsupervised text mining to identify text similarity/dissimilarity for finding

clusters, outliers and dynamic changes to clusters effectively, while minimizing

the problems associated with sparseness of high-dimensional text.

1.2.2 Motivation

The popularity of the internet increases the availability of digital text in social

media, online forums or message boards, email services, news broadcasting ser-

6 1.2 Problem Statement and Motivation

vices, web blogs and websites. Text mining is an effective approach to extract

concepts, clusters, user communities, deviated themes and dynamic changes in

those text collections using machine learning approaches [3]. The real-world sce-

narios with less/zero availability of ground truth data create the need to use

unsupervised learning methods in finding these useful patterns. Usually, text

data is high in dimensions and shows a sparse vector representation due to fewer

word co-occurrences [3]. Particularly, sources such as social media contain short

text that forms comparatively short size text vectors and show a limited word

co-occurrence with the extreme sparseness in vectors [79]. This fact leads most

of the existing state-of-the-art methods to be ineffective in identifying similarity

among text instances [95, 159].

Density-based methods which usually identify the subgroups or deviated docu-

ments based on density patches are unable to accurately estimate the density

differences in the sparse text representation [29, 33, 55, 201]. Matrix factoriza-

tion [96, 110] or other dimensionality reduction methods [96, 110], commonly

used for higher dimensional data, are challenged by the information loss in lower

dimensional approximation. Distance-based methods [30, 36] face the distance

concentration problem in higher dimensions showing a blurred border between

near and far instances, as illustrated in Fig. 1.1 [177]. This has a similar effect on

hierarchical clustering due to the requirement of multiple pairwise computations

at each step of decision making.

Nearest neighbor-based methods have been used in handling higher dimensional

text in recent research to identify neighbors in addition to traditional similarity

measures [68, 78, 173, 193]. Researchers use the nearest neighbors with graphs to

identify dense patches and outliers [56, 193]. Higher dimensional data have been

known to show the Hub phenomena where “the distribution of number of times

some points appear among k nearest neighbors of other points is highly skewed”

1.2 Problem Statement and Motivation 7

[177]. Text data use this concept of frequent nearest neighbors in identifying the

similarity of text pairs, which would be useful for identifying clusters, outliers

or dynamic changes of clusters in text collections [155, 173]. However, pairwise

comparison in determining nearest neighbors is not accurate in higher dimensions,

as well as not being time efficient for large text collections. IR ranking concept

is used as an alternative efficient approach to identify the hubs in recent research

[173]. Furthermore, IR document querying is used to identify documents in a

cluster by giving center as the query point [30]. This thesis explores the novel

concept of IR ranking in clustering as well as in outlier detection. This thesis

proposes to develop effective methods to build neighborhood graphs for density

estimation in finding uniformly dense subgroups and filtering outlier documents.

In addition, the dimensionality reduction methods, specially NMF with the strict

positive constraint, is used in text mining to get a lower rank representation

that enables in identifying groups [96, 110, 181]. However, this higher to lower

dimension approximation destroys the geometric structure of data [146]. In the

projected lower order space, neighboring points in high dimensions do not remain

as close points and leads to information loss. The thesis identifies the need

for compensating this loss with assistance given by additional information of

nearby/close points [93, 95]. It investigates the use of nearest neighbor assistance

driven NMF in cluster identification as well as use of inter-cluster association

assistance in cluster dynamic identification.

The extreme sparseness in short text is a distinctive problem in text mining, which

has been handled with the assistance of different non-content information or using

terms from external sources in many state-of-the-art methods [17, 80, 95]. How-

ever, semantic characteristics of short-text mismatch with the assistance given by

the other information, and the structural incoherence between the external source

and the original data leads to a poor outcome. This thesis explores an effective

8 1.3 Research Questions

method to assist the extreme sparseness in short text using the corpus-based

expansion.

In summary, this thesis deals with the above-mentioned challenges in identifying

text-similarity in an effective manner, mainly using Ranking and Matrix Factor-

ization. It aims to propose methods, taking advantage of the ranking concepts,

density estimation with ranking, NMF-based learning and document expansion

with NMF, to learn the accurate clusters, outlier documents, and cluster evolution

according to the nature of the text as detailed in Figure 1.3.

OutputOuOuOuOuOuOOuOuOuOuOuOuOuOuOOuOuOOOutptpttptttptptptptptptptpptptptpptppututtututtututututututututtttuttuut

Text Similarity Identification ConceptsTeTTeTeTeTTeTeTeTeTeeTeTeTeeeTeTeeeextxtxtxtxtxttxtxtxtxtxxxxttxttxxxxtxtx SSSSSSSSSSSSSSSSSSSSimimimimimimimiimimimimimimimmmmmmmmilililiilillliliililiiiiiilaraararararararraraarraraaararaaaaritititititititititittttittiitttyyyyyyyyyyyyyyyyyyyyyy IddIdddIdIdIIIdIddIdIIddIdIdddddddeneneneneneneneneeneneneneneneneeeeentititititititititiitittttitiititt fiffifififfffiififififififffff cacacacacaaaccacacaacaacacacacacaccaacaatitiitititiitititititttitititiiononoononononnononoonoonnnnonooonoooo CCCCCCCCCCCCCCCCCCConononoonoononnonoonononononoononoonoo cecececececececececccececececeecececececeeptptptptptptptptptptppttptpptptpptsssssssssssssssssss

Data types(Vector size)DaDaDaDaDaDaDaDaDDDaDaDaDaaaaaDaDaDaaDaaaatatatatatatatatatatatataatatataaaatat ttttttttttttttttypypypypypypypypyppyppypppyyypppy esesesesesessseseseseeseseseseseeesses(V(V(V(V(V(V(V(V(VV(V(VVV(VV(VV(VVVVececececececcecececececececececececeeectototototototototototototoototottoooor rrrrrrrrrrrrr sisiisisisisisisisisisssssiisiisss zezezezezezezezezezezeezeezezezzezeezzze)))))))))))))))))

Density Estimation

Ranking Concepts

Non-negativeMatrix

Factorization

Document Expansion

Short size text vectors Medium size text vectors Large size text vectors

Clusters Outliers Cluster Evolution

Figure 1.3: Architecture of the thesis for unsupervised text mining

1.3 Research Questions

In unsupervised text mining, the similarity calculation among documents is a

fundamental and critical step. Unsupervised text mining methods for learning

subgroups, deviated documents and dynamic changes to clusters primarily rely

upon the processes that they employ for similarity identification. However, the

1.3 Research Questions 9

high dimensional nature of the text poses several challenges. The primary objec-

tive of the thesis is to explore effective ways of similarity identification between

text pairs. This thesis extends the similarity identification concept in implement-

ing clustering, outlier detection and cluster evolution methods. More specifically,

the thesis explores the solutions for the following research questions.

1. Clustering: To identify subgroups in a text corpus, how can the similarity

calculation among documents be conducted with the novel ranking and

matrix factorization concepts?

(a) In sparse data where density difference is not able to identify the sub-

groups, how can the graph-based methods with ranking be used for

effective density estimation?

(b) Instead of expensive pairwise comparisons, how can the IR ranking-

based neighbors be employed to identify the subgroups?

(c) How can the associated information loss be minimised in matrix fac-

torization to approximate the lower rank factors and to identify sub-

groups?

2. Outlier Detection: How can the concept of ranking and density, used in

finding text similarity, be extended in detecting outliers in a text collection?

3. Cluster Evolution: How can the matrix decomposition and identified

factors be used to understand the cluster similarity and changing dynamics

of text clusters in text collections?

10 1.4 Research Aim and Objectives

1.4 Research Aim and Objectives

The overarching aim of this thesis is to design, develop and evaluate effective un-

supervised text mining methods that are able to effectively identify the similarity

among text instances for learning clusters, outliers and the cluster evolution in

document collections. The objectives of this research are listed as follows:

RO.1. Developing text mining methods that are able to accurately

identify clusters in document collections

The main focus is to explore the problems associated with high dimensionality of

text vectors that challenge existing methods, especially pairwise neighbor iden-

tification impaired in this setting. This thesis investigates novel concepts such

as ranking-based neighborhoods and ranking-based neighborhood graphs. It ex-

plores the use of these concepts in density estimation in the sparse text data

as a key objective. Further, it investigates the use of ranking-based neighbor

information to assist matrix factorization to accurately cluster documents.

• RO.1.1. The short text data shows distinct characteristics with extremely

sparse representation due to short vector length. Effectively learning doc-

ument similarity in short text becomes challenging. This thesis focuses on

identifying a novel corpus-based document expansion method to deal with

this issue.

1.4 Research Aim and Objectives 11

RO.2. Developing text mining methods that are able to accurately and

efficiently identify outliers in a text collection

The high dimensional and sparse vector representation challenges traditional

methods in differentiating deviated documents from the inlier subgroups. Gen-

erally, outlier detection methods rank the observations based on deviations. The

majority of them show higher computational complexity with large text collec-

tions. This thesis investigates the novel term weighting-based and ranking-based

concepts to identify the outliers accurately and efficiently. It proposes methods

that use ranking-based neighbors and ranking-based on rare term frequency to

deal with high dimensional text representation and associated problems.

Developing the text outlier detection methods responding to these challenges,

considering the size of the text vectors, is another focus of the thesis.

RO.3. Developing a text mining method that is able to correctly iden-

tify the cluster evolution in text collections

Identifying all the life-cycle states of clusters and their evolutionary patterns is

another focus of this thesis. It studies a method to capture the evolution patterns

over the time/domain with matrix factorization using the high dimensional text

cluster representations. Matrix factorization naturally leads to information loss in

higher-to-lower dimensional projection. The use of different relationships within

clusters and term distributions are investigated to compensate for this loss. The

majority of the existing methods consider local relationships or consider a subset

of the data space in tracking evolution. Developing a method to identify the

global dynamics of text clusters responding to these challenges is the objective of

the thesis.

12 1.5 Research Contributions

1.5 Research Contributions

This thesis has developed several methods for identifying text clusters, outliers,

and cluster evolution, which address the ineffectiveness in existing measures in

identifying text similarity.

RC.1. Text clustering methods

• RC.1.1. Ranking-Centered Density-Based Document Clustering Method

(RDDC)

RDDC has been developed to gain the accuracy and time efficiency in text

clustering avoiding the pairwise nearest neighbor calculation. The IR rank-

ing concept is used to generate relevant documents, in response to a docu-

ment query that statistically represents a document used against inverted

indexed data structure, as nearest neighbors. These responses to a docu-

ment are proved to be relevant to each other and be in the same cluster

showing them semantically coherent. These generated nearest neighbors are

used in generating a shared nearest neighbor graph that shows uniformly

dense regions in the sparse text as a novel contribution. Another contribu-

tion of RDDC is the identification of hubs that exist in high dimensional

data (i.e., frequent nearest neighbors) using the shared neighbor graph. It

efficiently calculates the similarity for hubs using relevancy scores provided

by the IR system to enhance the percentage of documents that are clustered

to the correct group. This research is published in the 22nd Pacific-Asia

Conference on Knowledge Discovery and Data Mining (PAKDD).

• RC.1.2. Consensus and Complementary Non-negative Matrix Factorization

for Document Clustering (CCNMF)

Conjecturing that IR can be used to accurately generate nearest neighbors,

1.5 Research Contributions 13

CCNMF is an NMF-based method that uses nearest neighbors generated

with IR ranking as a document affinity matrix. The novel contribution

of combining nearest neighbors that preserve the geometric structure with

document representation, is able to accurately approximate the document

cluster assignment, minimizing information loss in lower dimensional ap-

proximation. CCNMF assigns clusters by using consensus and complemen-

tary information that are common and specific to inputs respectively. Em-

pirical analysis validates that combining IR-based global neighbor affinity

and pairwise similarity-based local neighbor affinity with the VSM docu-

ment representation results in finding more accurate clusters in lower-order

dimension approximation of the high-dimensional text. This research has

been submitted and is under-review in the Elsevier Knowledge-Based Sys-

tems (KBS) journal.

• RC.1.3. Corpus-based Augmented Media Posts with Density-based Cluster-

ing for Community Detection

In this method, a novel approach of document expansion to improve the

word co-occurrences has been proposed to deal with extremely sparse short

text. The virtual topic terms are included in documents aligning with the

semantics of the corpus itself, based on the topic vectors identified in the

corpus. NMF-based topic vector approximation is proposed to obtain vir-

tual terms. Another contribution is to identify user communities using this

enriched text, which represents users in social media platforms using the

density estimation and centroid-based fine tuning process, which boosts

the cluster assignments. Empirical analysis confirms that the enrichment

of text in social media that includes heterogeneous text is able to minimize

the sparseness in short text and support the learning process of term-based

density differences. This work led to a conference paper and was published

in the 30th International Conference on Tools with Artificial Intelligence

14 1.5 Research Contributions

(ICTAI).

• RC.1.4. Concept Mining in Online Forums using Self-corpus-based Aug-

mented Text Clustering

The corpus-based document enrichment method has been applied in another

application of concept mining. The NMF-based topic vector approximation

is used to enrich the forum posts using topic words as virtual words. Addi-

tionally, a centroid-based text clustering method is proposed in this method

to handle the homogenous nature of the forum text. This work led to a

conference paper and was published in the 16th Pacific Rim International

Conference on Artificial Intelligence (PRICAI).

RC.2. Outlier detection methods

• RC.2.1. Outlier Detection in Text Corpus Using Rare Frequency and Rank-

ing

This thesis proposes a set of novel algorithms OIDF, ORFS, and ORFS

using the concepts of ranking-based neighborhood and/or rare document

frequencies to identify the deviated documents from the inlier groups in

the corpus. The methods developed based on these categories of algorithms

contribute to a research area of much-needed attention. The simple concept

of inverse document frequency of terms is proposed as the first contribu-

tion to identify the outlier candidates in sparse text representation with

the OIDF algorithm. Empirical analysis shows that the use of this term

weighting-based ranking, to assign an outlier score for a document, accu-

rately identifies how deviated the document is from the common subgroups

in the corpus.

Additionally, the ranking scores generated by the IR system in response to

a document query are proposed to use in a reverse manner to identify the

1.5 Research Contributions 15

level of deviation of the document in the ORFS algorithm. Moreover, the

ORNC algorithm identifies the sub-dense hubs in high dimensional data

using the IR ranking responses with the k-occurrences and the anti-hubs

are proposed as outliers. A set of ensemble approaches, which combine the

concepts in OIDF with ORFS and ORNC, are proposed as optimal solutions

that boost accuracy, efficiency, and scalability of text outlier detection. This

research was submitted to the ACM Transactions on Knowledge Discovery

from Data (TKDD) journal and was accepted with major revision.

• RC.2.2. Text Outlier Detection using a Ranking-based Mutual Graph

(ORDG)

ORDG proposes an incremental graph-based method to identify the outliers

avoiding sparseness in the text representation. Using the inverse document

frequency of terms in a document, it first identifies the level of deviation

of a document from the inlier groups in the collection and identifies outlier

candidates. ORDG then presents a novel method to identify the outliers

that are deviated documents from a dense mutual neighbor graph, gener-

ated using IR ranking concept. The novel approach to construct the mutual

neighbor graph using IR results considering shared neighbors among docu-

ments is able to contain documents in inlier subgroups and forms hubs in

high dimensional data through shared nearest neighbors. ORDG proposes

the documents that are excluded from the graph, as well as those that do

not show similarity to hubs, as the next set of outlier candidates. The com-

mon outlier candidates identified by both these steps are proposed as the

final outliers in ORDG. This research has resulted in a journal paper, which

has been submitted to the Data & Knowledge Engineering Journal.

16 1.6 Publications Resulting from Research

RC.3. Text cluster evolution method (CaCE).

In this method, a novel global text cluster evolution approach is proposed to

track the full cluster life cycle over the time/domain. Based on the concept that

information loss in matrix factorization can be compensated by incorporating

additional information, CaCE proposes an NMF-based method to identify the

cluster groups in a corpus using both inter- and intra-cluster associations. This

semantic assistant obtained with the additional inter-cluster association is able to

accurately identify the cluster groups with birth, death, split and merge cluster

dynamics in clusters. CaCE presents the concept of density using term frequencies

of the cluster to identify the strength of the association of clusters to cluster

group and loosely attached clusters are separated from the group to enhance the

accuracy of detected cluster dynamics. Another important contribution of the

proposed CaCE is to display clusters in the same group with links in a progressive

k-partite graph over k time intervals to discovering emergence, persistent, growth

and decay patterns in clusters. This research has resulted in a journal paper,

which has been submitted to the Springer Knowledge and Information Systems

(KAIS) journal.

1.6 Publications Resulting from Research

A list of published/accepted/under review papers, included as part of the chapters

in this thesis, is given below,

• Paper 1. Wathsala Anupama Mohotti and Richi Nayak: An Efficient

Ranking-Centered Density-Based Document Clustering Method. Pacific-

Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp.

439-451. Springer (2018) (Will form part of Chapter 3)

1.6 Publications Resulting from Research 17

• Paper 2. Wathsala Anupama Mohotti and Richi Nayak: Consensus and

Complementary Non-negative Matrix Factorization for Document Cluster-

ing. Elsevier Knowledge-Based Systems journal (Under Review). (Will

form part of Chapter 3)

• Paper 3. Wathsala Anupama Mohotti and Richi Nayak: Corpus-Based

Augmented Media Posts with Density-Based Clustering for Community De-

tection. International Conference on Tools with Artificial Intelligence (IC-

TAI), pp. 379-386. IEEE (2018) (Will form part of Chapter 3)

• Paper 4. Wathsala Anupama Mohotti and Darren Christopher Lukas

and Richi Nayak: Concept Mining in Online Forums using Self-corpus-

based Augmented Text Clustering. Pacific Rim International Conference

on Artificial Intelligence (PRICAI), pp. 397-402. Springer (2019) (Will

form part of Chapter 3)

• Paper 5. Wathsala Anupama Mohotti and Richi Nayak: Efficient Out-

lier Detection in Text Corpus Using Rare Frequency and Ranking. ACM

Transactions on Knowledge Discovery from Data (TKDD) (Accepted with

Major Revision). (Will form part of Chapter 4)

• Paper 6. Wathsala Anupama Mohotti and Richi Nayak: Text Out-

lier Detection using a Ranking-based Mutual Graph. Journal of Data &

Knowledge Engineering (Under Review). (Will form part of Chapter 4)

• Paper 7. Wathsala Anupama Mohotti and Richi Nayak: Discovering

Cluster Evolution Patterns with the Cluster Association-aware Matrix Fac-

torization. Springer Knowledge and Information Systems (KAIS) (Under

Review). (Will form Chapter 5)

18 1.7 Research Significance

1.7 Research Significance

Text is the natural way of communication used by people in many digital ap-

plications. All the methods in this thesis fall into the category of unsupervised

machine learning, which works in the absence of ground-truth data and prac-

tically suits a real-world context. In an unsupervised setting, identifying text

similarity is a significant step as well as challenging due to the higher number of

dimensions and sparseness in the text representation. The developed techniques

in the thesis successfully deal with this problem and contribute to three major

application areas. Further, they are able to be used in various domains.

Firstly, the thesis has advanced the popular field of document clustering by de-

veloping effective methods to discover subgroups from text document collections.

Theoretically, these methods propose a new perspective for the high-dimensional

text clustering. (1) They show a new direction of using IR-based neighborhood

to identify text similarity and density distribution with mutual neighborhood

graphs in naturally sparse data. (2) They provide efficient methods for text

similarity calculation that overcome the sparse representation through IR-based

frequent neighbors (hubs) or document expansion. (3) They show the importance

of learning more accurate cluster assignments, by incorporating nearest neighbor

information with document representation in dimensionality reduction to identify

the text similarity. Practically, a clustering method is useful to organize the text

data based on similarity in many applications, such as information retrieval [3],

social media analytics [79], opinion mining [3] and recommendation systems [118].

Additionally, this thesis proposes two methods for short text analysis in discov-

ering user communities in social media and concepts discussed in online forums.

Community detection in social media analysis is useful in identifying groups of

users with common interests to assist in viral and targeted marketing, political

campaigning, customized health programs, event identification, and many other

1.7 Research Significance 19

applications [88, 144, 147]. Concept mining that extracts participants’ cognitive

grouping is useful in improving e-learning and e-marketing [76, 120].

Secondly, the thesis has advanced the much-needed field of unsupervised text

outlier detection by developing effective methods to detect anomalies in the text

data. The methods in the thesis, formally define a realistic text outlier detection

problem where the presence of outliers is identified from a number of subgroups

(instead of the entire documents). Furthermore, they propose an innovative view-

point of ranking and neighborhood concepts to identify these deviations/outliers

based on text dissimilarity. The evaluation measures proposed in the thesis would

be useful to categorize the effectiveness of methods based on the error in out-

lier/inlier detection specifically. These measures should also be applicable in

traditional outlier detection methods. Outlier detection in static text data is

beneficial in many application domains for decision-making, such as web, blog

and news article management to identify the unusual/uncommon page or news

[96] as well as in dynamic settings to detect unusual events from the social media

posts that can be early warnings [79].

Last but not least, the thesis contributes to an emerging field of tracking the

dynamic changes in a text collection. Theoretically, the method proposes a novel

approach to use additional relationship information in handling the sparse text

representation to identify the evolving patterns in the clusters through cluster

similarity. It shows the importance of having assistance to avoid the information

loss in dimensional reduction. This NMF method is applicable to different do-

mains and proposes advancements in the popular matrix factorization methods.

With the popularity of big data in the last decade, tracking of document collec-

tions over a period is helpful in several applications such as finding dynamics of

terminologies, identifying concept drift, and emerging and evolving trends [63].

Tracking evolution across different domains provides insight into how the same

20 1.8 High Level Overview of the Thesis

concept has been used over diverse domains. This is useful for policymakers and

project planners to mend their decisions, while discovering cluster dynamics over

the time in a specific field is useful for researchers, academics, and students in

that field to set up their publications, strategies, and research [73].

1.8 High Level Overview of the Thesis

This section connects the proposed methods and the common core concepts. It

then relates them to published/under-review papers.

Conc

epts

/ M

etho

dsCoCoCoCoCooCoCoCoCoCoCoCoCoCoCoooCoCooCoCC

ncncncncncncncnccncnccnccnccncccnccccnnnnepepepepepepepepppepepeppepeppeeeeeee

tstststststststsstssssststststsstttt//////////////////////

MMMMMMMMMMMMMMMMMMMMMMetetetetetetetetetteteteteteteeeeeeeee

hohohohohohoohohohohohoohohoohdsdsdsdsdsdssdsdssdsdsdsdsdsdsdsdsdsdddddd

Clusters Outliers Cluster Evolution

Paper 1RDDC

Paper 2CCNMF

Paper 3Augmented

text for Community Detection

Paper 4Augmented

text for Concept Mining

Paper 6ORDG

Paper 5 OIDF, ORFS and ORNC

Paper 7CaCE

application

Density Estimation

Ranking Concepts

Non-negativeMatrix

Factorization

Document Expansion

Effective Text Similarity CalculationEffective Text Similarity CalculationOutputOuOuOuOuOuOuOOuOuOuOuOuOOOuOuOuOuOOuuuOuuuOOO tptptpttptptptptptptptpppptptptpptptppputututttutututututututututututtuuuu

Figure 1.4: Overview for unsupervised text mining methods

As shown in Figure 1.4, this thesis aims to identify similarity/dissimilarity be-

tween text instances dealing with the challenges in high dimensional text represen-

tation and obtain three outputs: clusters, outliers and cluster evolution patterns.

The proposed unsupervised text mining methods deal with the sparseness of text

representation using the novel concepts such as ranking or rare term frequencies,

ranking-based neighborhood graphs for density estimation, NMF with additional

information and self-corpus based document expansion.

1.8 High Level Overview of the Thesis 21

Firstly, a set of text clustering methods are proposed to identify the subgroups

within a text collection. RDDC (Ranking-Centered Density-Based Document

Clustering Method) proposed in Paper 1 mainly aims to handle the high dimen-

sional and sparse nature of text using an IR ranking-based shared nearest neighbor

graph to identify the dense patches. CCNMF (Consensus and Complementary

Non-negative Matrix Factorization for Document Clustering Method) in Paper 2

uses IR-based nearest neighbors together with pairwise nearest neighbors to assist

the information loss in NMF. Corpus-based document expansion/augmentation

is proposed in Paper 3 for the problem of community detection considering text

posts of users in social media. Extreme sparseness in short text is avoided with

this expansion done through topic vectors in the text collection identified via

NMF. Another application of this concept is proposed in Paper 4 for concept

mining in online forums.

Secondly, a set of algorithms, namely OIDF (Outlier Detection Based On In-

verse Document Frequency), ORFS (Outlier Detection Based On Ranking Func-

tion Score), and ORNC (Outlier Detection Based On Ranked Neighborhood k-

Occurrences Count) are proposed in Paper 5 for outlier detection in text col-

lections. OIDF presents the core concept of using ranked terms based on inverse

document frequency, to identify the outliers, which usually contain uncommon

terms in the collection. ORFS aims to identify the outlier candidates based on

the IR ranking-based nearest neighbors where ranking scores for them inversely

present as an indicator to calculate the outliers. Aligning with IR ranking-based

nearest neighbors, fewer occurrences of a document in nearest neighbors is pro-

posed as a method to calculate the outliers in ORNC. In Paper 6, IR ranking

responses identified as the nearest neighbors are used to construct a mutual neigh-

bor graph and to identify the hubs. This hub concept is used together with a

density estimation process on the mutual neighbor to identify the inliers. It iso-

lates the outliers, which are not part of the graph or deviated from the graph

22 1.8 High Level Overview of the Thesis

with ORDG.

Lastly, the CaCE (Cluster Association-aware matrix factorization for discovering

Cluster Evolution) method in Paper 7 aims to identify the full cluster life cycle

and evolutionary patterns within clusters in a text collection. The core concept in

CaCE is an NMF-based approach to identify cluster groups with high dimensional

text cluster representation. These identified cluster groups are displayed across

the time/domain using k-partite graph to identify the evolving patterns globally.

In summary, this “thesis by publication” consists of following six chapters.

• Chapter 1 provides a general overview of the thesis, including research ques-

tions, objectives, and significance.

• Chapter 2 reviews unsupervised text mining problems by focusing on in-

effectiveness of the existing methods in identifying text similarity due to

high dimensionality of text in the areas of clustering, outlier detection, and

cluster evolution. This chapter contains sections ranging from the general

text mining process, associated challenges, characteristics of the text and

different type of methods. A list of research gaps concludes the chapter and

leads the development of proposed methods in other chapters.

• Chapter 3 focuses on dealing with the problem of text clustering. IR

ranking-based neighborhood is proposed to use in handling the high dimen-

sional nature of the text that leads to sparseness in Paper 1 and Paper

2 with density estimation and matrix factorization. Extreme sparseness in

short text vectors is handled with document expansion in Paper 3 and

validated with another application in Paper 4. These four papers form

Chapter 3.

• Chapter 4 is about text outlier detection. Multiple approaches to calculate

1.8 High Level Overview of the Thesis 23

outlier scores are proposed in Paper 5 using the efficient ranking concept.

IR ranking-based neighbors and ranking documents based on inverse doc-

ument frequency are proposed to cope with the sparse text representation.

Extension of this ranking based outlier detection for a graph-based method

is proposed in Paper 6. These two papers form Chapter 4.

• Chapter 5 concentrates on identifying cluster dynamics in a text collec-

tion. The CaCE method in Paper 7 is proposed to identify cluster life

cycles and evolution patterns using matrix factorization. The use of both

intra-cluster and inter-cluster association assists in dealing with sparse text

cluster representation for identifying cluster groups that are used in repre-

senting evolving patterns.

• Chapter 6 summarizes the thesis; the significant results and findings of this

thesis, aligning with the research objectives and identified research gaps

from Chapters 1 and 2. It concludes with recommendations for future

research directions.

Chapter 2

Literature Review and

Background

This chapter provides an overview of the current literature on unsupervised text

mining, giving focus to text similarity identification in clustering, outlier detec-

tion, and cluster evolution. The first part of the literature review (Section 2.1)

presents the importance of text data analytics, text mining process and the nature

of the text data with term modelling. The next section (Section 2.2) highlights

the key concept of finding similarity among text documents that is fundamental

to text mining methods. The main focus of this thesis is to propose alternative

text similarity calculation techniques and develop a set of novel unsupervised text

mining methods (e.g., clustering, outlier detection and cluster evolution). The

subsequent sections present more details on clustering, outlier detection and clus-

ter evolution methods. Section 2.3.1 provides traditional and recent developments

in text clustering methods. Outlier detection methods and their applicability in

high-dimensional data, including text, is provided in Section 2.3.2. The final

section presents methods in detecting dynamic changes to the text, which can

2.1 Text Mining 25

be related to cluster evolution. This chapter is concluded by highlighting the

research gaps, with regards to the main focus areas of the thesis.

2.1 Text Mining

The advancement of digital technology in the current era has resulted in exponen-

tial growth in text data. Reports suggest that 95% of the unstructured digital

data appears in text form [86]. For instance, most of the human interactions

with digital systems are in the form of free text such as emails, wikis, blogs and

digital news feeds [60]. Social media platforms disseminate trending information

based on users’ short-text communication over time. Search engine is another

popular internet medium that stores (or indexes) a large text collection. These

text sources play an important role in several applications. Table 2.1 reports the

top 10 websites according to the internet traffic statistics of Alexa1. It can be

seen that five of them (as highlighted in bold) are primarily driven from the text

media.

Table 2.1: Internet traffic report by Alexa on August 15th, 2019

Rank Website1 Google2 Youtube3 Tmall.com (Chinese shopping site)4 Baidu (Chinese search engine)5 Facebook6 Qq.com (Chinese internet service portal)7 Sohu.com (Chinese shopping site)8 Taobao.com (Chinese market place)9 Wikipedia10 Yahoo

Text mining is a process of discovering useful information from text document

1https://www.alexa.com/topsites

26 2.1 Text Mining

collections that has diverse applications [3]. For instance, content management

that facilitates efficient and effective information retrieval from document reposi-

tories [106] relies on organizing the content with the use of clustering. In opinion

mining or concept mining, clustering is used to extract the set of related terms

that represent cognitive groupings [150]. In social media analytics, community

detection and recommendation systems use text mining to identify similar inter-

ested users based on their text communications [79, 147]. Moreover, suspicious

content detection that identifies fake news or unusual events on social media com-

munication uses text mining methods to identify the deviations from normal [45].

Moreover, associated terminologies or concepts in text repositories change over

time or across the domains and show a varying trend. It is useful for practi-

tioners of diverse disciplines to mine these data to identify decaying, current and

emerging concepts that facilitates trend analysis [4].

2.1.1 Text Mining Process

A standard text mining process follows a series of activities as shown in Fig. 2.1.

Text data generated from different sources is initially cleaned using the pre-

processing steps such as stop word removal, stemming or lemmatizing to keep

only the important information [3]. A short text that appears in social media

shows unstructured phrases and abundant information such as URLs and hash-

tags, which require special pre-processing [79]. The text is then transformed into

a data model with each document represented as a vector of terms that make it

suitable for performing mining. Primarily, documents can be represented as a bag

of words (BOW), considering the number of occurrences of each term but ignoring

the order [3]. This results in a Vector Space Model (VSM) that can be augmented

using different term weighting models such as binary, tf , idf and tf ∗idf [37]. The

purpose of a term weighting model or a feature learning technique is to identify

2.1 Text Mining 27

Text Documents

Text Preprocessing

Text Transformation

Feature Selection and

Representation

Data Mining using

Text Similarity Identification

Interpretation/ Evaluation

• Web Pages• News articles

• Social media Posts

• Text cleaning• Tokenization

• Bag of Words• Vector Space

• Term weighting• Feature Learning

• Clustering• Outlier Detection • Evolution Tracking

• Qualitative• Quantitative

Figure 2.1: General text mining process

the important features in the text mining process.

A text collection usually contains a large set of terms that shows less word co-

occurrence among documents [77]. This results in sparse VSM [3]. A myriad

of text mining methods have been developed to deal with the sparse and high-

dimensional data matrices in identifying text similarity. These methods explore

the interesting patterns such as clusters, outliers or evolution in the text collec-

tions. Finally, the text mining results are evaluated with several quantitative

and qualitative methods. Accuracy, F1-score, Normalized Mutual Information

(NMI), False Positive Rate (FPR) and False Negative Rate (FNR) are popular

extrinsic evaluation measures [133]. In addition, intrinsic measurements such as

silhouette index for clustering and topic coherence for topic modeling are used

to quantitatively measure the validity of results [134]. Furthermore, case stud-

ies with top word analysis or word-cloud visualization are utilised for qualitative

interpretations [72].

28 2.1 Text Mining

2.1.2 Text Feature Representation

To perform the similarity calculation between text pairs for data mining, useful

features of text data need to be represented numerically. The frequencies of terms

in a document provide powerful insight for text mining to identify these useful

features. Generally, different weighting techniques are used to represent term

importance in a document yielding a numerical value in order to improve feature

importance [53]. The simplest weighting technique is term frequency (tf), which

assigns the weight to each term t as their number of occurrences in the document d

[133]. The tf weighting technique considers each term with equal importance and

treats terms with little or no discriminating power in different groups with similar

priority. The inverse document frequency (idf) weighting technique solves this

issue by considering the document frequency (df), which represents the number

of documents in the collection that contain term t [133]. The idf scales down the

frequency of terms to discriminate between documents in a document collection

of size N as given below.

idft = log

(N

dft

)(2.1)

This weighting model gives high values to rare terms and lower values to frequent

terms. The most popular weighting model used in text representation is tf ∗ idfwhich combines term frequency and inverse document frequency as given below.

tf ∗ idf = tft,d × idft (2.2)

However, the basic term weighting methods neglect the semantic relatedness be-

tween different words [42]. A novel perspective for term weighting is used in re-

cent text mining research to assign weights considering the context of the terms

2.1 Text Mining 29

to address this problem [117, 136, 168]. Initially proposed methods use the Skip

Gram model to learn the distributed word embedding. The Skip Gram model is

a training method for neural networks to learn neighbors or the context of a word

in a corpus for word embedding [136, 137]. It predicts surrounding words of a

specific word in a fixed window. This concept is used to obtain a dense document

representation for a document considering the most co-occurring words [115].

Extending this concept, the contextual information of words embedded with Skip-

Gram with Negative-Sampling (SGNS) modeling was proposed [117, 168]. The

word association relationships are used in them are similar to the Skip-Gram and

SGNS modeling in word embedding. In [117, 168], the word association matrix S

is modeled with SGNS to highlight the weight of words that are closely associated.

This uses other words in the vocabulary as contexts for a specific word. If w and

c denote a word and one of its contexts respectively, where #(w, c) denotes the

number of (w, c) pairs in the collection, each element of Swc is defined as follows.

[168]:

Sw,c = log

[{#(w, c)× T

#(w)×#(c)

}− log(k)

](2.3)

Here, T is the total number of word-context pairs where k is considered as the

total number of negative samples aligning with word-embedding. This k is 1

for a considered word-context pair and negative sampling tries to maximize the

probability of observed word-context pairs to be 1 (i.e., P (S = 1|w, c)) while min-

imizing unobserved word-context pairs to be 0 (i.e., P (S = 0|w, c)) within the

word association matrix. In [117] this SGNS modeling is proved to be equivalent

to factorizing a (shifted) word correlation matrix. It shows that SGNS is im-

plicitly factorizing a word-context matrix, whose cells are the point-wise mutual

information of the respective word and context pairs. Thereby, SGNS effectively

covers the entire collection and gives meaningful weight to the contexts of a word.

30 2.2 Text similarity

2.2 Text similarity

The primary aim of text mining is to analyze digital text data sources to discover

interesting patterns. In this context, text similarity plays a major role in identi-

fying similar text patterns or deviated text patterns to identify clusters, outliers

or trends.

2.2.1 Distinct Text Characteristics

This section details some specific characteristics of text data that occur due to

their sparse and high-dimensional nature and that affects the process of text

mining in identifying text similarity.

Distance Concentration

The high-dimensional nature of the text leads to different issues in analyzing doc-

ument collections [3]. If the text document pairs (represented in large size vector

form) are compared with Euclidean distance measures, there becomes little differ-

ence in the distance between different pairs due to associated sparsity in vectors.

The distance difference between far and near data points becomes negligible, as

shown in Fig. 1.1 (c), which is known as distance concentration [205]. This poses

a major challenge to text mining methods to differentiate similar and dissimilar

text data based on the common terms sharing.

2.2 Text similarity 31

(a) (b) (c)

Note : - N5 represents the number of times a point occurs among the k=5 nearest neighbors of all other points in the dataset.- Empirical distribution of N5 with Euclidean (l2), fractional l0.5 (Proposed for higher dimensional data) and cosine (cos) distance functions where d represents the number of dimensions

Figure 2.2: Skewness of k-NN [154]

Figure 2.3: Skewness of hubs [175]

Hubness Property

Text data has been shown to experience the Hub phenomenon which is evident in

high dimensional data, i.e., “the number of times some points appear among k-NN

of other points is highly skewed” [177] as illustrated in Fig. 2.2. As dimensionality

increases with Fig. 2.2 (a)-(c), the observed distributions of k-NN deviate from the

random graph model and become more skewed to the right. These characteristics


can be shown using the reverse neighbor count, which indicates the number of

times a point appears among nearest neighbors of the entire collection [155]. The

reverse neighbor count of hub points shows significant skewness as depicted in

Fig. 2.3. It shows the two extremes in the high-dimensional case : (a) more very

rarely co-occurring pairs and (b) also more very frequently co-occurring pairs.

These frequent nearest neighbors of the collection are hubs. Most importantly,

the data points in high-dimensional data tend to be closer to these hubs than

cluster mean [176]. This property has been used in recent text mining methods

to avoid distance concentration and sparseness-related issues in determining text

similarity [78, 173]. These methods assign documents to clusters by checking the

closest hub point, instead of comparing the cluster centers, as shown in Fig. 2.4.

Figure 2.4: Clustering with distance to the centroid and clustering with hub-similarity [174]


2.2.2 Text Similarity Measures

Pairwise Text Similarity

Pairwise text comparison using terms in VSM is one of the most common tech-

niques in identifying text similarity. Cosine similarity is a popular measure based

on the cosine angle difference between two vectors. Let Vd1 and Vd2 be the two

documents (i.e., d1 and d2) which numerically represent their term vectors. The

cosine similarity between these documents can be computed as below.

Cosine similarity(Vd1 , Vd2) = cos(θ) =Vd1 .Vd2

|Vd1 ||Vd2 |(2.4)

The use of Euclidean distance difference in text vectors as the pairwise comparison

measure is found to raise the distance concentration subsequent issue [205]. Some

other popular measures used for pairwise comparisons are Jaccard similarity,

Pearson coefficient and KL divergence [81]. The Jaccard coefficient compares the

sum of shared terms to the sum of terms that are present in either of the two

documents but are not the shared terms. Let td1 and td2 be the set of terms in

d1 and d2 respectively. The Jaccard coefficient between these documents can be

computed as below.

J similarity(d1, d2) =|td1 ∩ td2 |

|td1 |+ |td2 | − |td1 ∩ td2 |(2.5)

Pearson’s correlation coefficient is another measure based on vector statistics. Let

the term set T = {t1, t2, ..., tm} and wti,d1 represent the weight of ti ∈ d1. There

are different forms in defining this coefficient; the most commonly used form is

as follows [81].

P similarity(d1, d2) =m∑m

i=1 wti,d1 × wti,d2 − TFd1 × TFd2√[m∑m

i=1 w2ti,d1

− TF 2d1

] [m∑m

i=1 w2ti,d2

− TF 2d2

] (2.6)


where TFd1 =∑m

i=1 wti,d1 and TFd2 =∑m

i=1 wti,d2

In KL divergence, corresponding probability distributions of the documents are

considered for identifying similarity. Let wti,d1 represent the weight of ti ∈ d1.

The divergence between two distributions of words in d1 and d2 will be:

KL similarity(d1||d2) =m∑i=1

wti,d1 × log

(wti,d1

wti,d2

)(2.7)

These pairwise computations are known to possess high time complexity for larger

datasets [145]. This challenges traditional text mining to learn patterns such as

clusters or dynamic changes to clusters in larger datasets.

Shared Nearest Neighbor for Text Similarity

Alternative to aforementioned syntactic approaches, similarity between text doc-

uments could be modeled with the concept of Shared Nearest Neighbor (SNN)

to effectively identify the density distribution [92]. The SNN concept facilitates

the similarity between documents based on the number of neighbors they share

[55]. Two mutual neighboring documents are represented by the adjacent nodes

and the edge weight between them represents the number of neighbors those two

documents share [55] as depicted in Fig. 2.5. It allows identifying the mutually

connected documents as similar documents based on the connectedness. How-

ever, the number of pairwise comparisons needed to identify the mutual neighbors

based on the shared neighbors becomes very high in the large collections.

Ranking functions for Text Similarity

Information Retrieval (IR) is an established field that uses the document simi-

larity concept to provide the ranked results in response to the user query [59].


Shared Nearest Neighbors

Mutual Neighbor Nodes

Edge weight

Figure 2.5: Mutual Neighbors that share common documents

An IR system is able to process a keyword/document query efficiently with an

inverted index data structure and retrieve a list of matched (or similar) docu-

ments [174]. IR systems use different ranking functions to find the best matched

set of responses. A ranking function considers how important an individual word

is to the document and within the document collection, as well as the document

length [53] to have a statistical comparison between a query and the returned

documents based on text similarity. There is a wide variety of retrieval functions,

starting from language models such as tf ∗ idf to BM25 which focuses on proba-

bilistic retrieval [59]. While functions based on language models used the similar

concepts as in term weighting models, Okapi Best Matching 25 (BM25) and its

newer variations such as BM25f judge a specific document relevant to a query

[59].

Recently, the ranking concept has been used in finding similar documents in a

document collection [30, 173, 174]. This handful of research is based on the clus-

ter hypothesis [91] stating that “associated documents appear in a returned result

set of a query”. Several studies have validated this fact by showing that cluster-


ing can improve the ranking or retrieval performance [156, 171]. The optimum

clustering framework [59] reversed this cluster hypothesis and stated that “the

returned documents in response to a query will appear in the same cluster”. This

hypothesis has recently been used in clustering methods based on the conjecture

that the documents set returned in response to a query (i.e., a document repre-

sentation) can be considered as nearest neighbors [174]. Methods [30, 173] used

the ranking function employed in an IR system to generate a document neighbor-

hood using the relevant documents set, without the expensive pairwise document

comparisons. In addition, these concepts in the cluster hypothesis and reverse

cluster hypothesis show the embedding of semantic relationships and statistical

perception in IR-based text similarity identification. However, these methods

utilize only the ranked set of response documents as neighborhood and ignore

the associated ranking scores of those documents given by IR systems. These

scores show important information about the level of similarity of the response

documents to the document query.

Generally, the methods that use a few keywords to form queries for identifying

the text similarity neglect the underlying semantics of the terms [196]. It is im-

portant to consider semantic representations of words from local co-occurrences

in sentences to have coherent document similarity [196]. The methods [173, 174]

that used IR systems to identify the relevant documents for clustering, address

this by forming document queries considering statistical distribution of terms. Es-

pecially, those document queries consider all the terms in the document and sys-

tematically retrieve the most probable terms to represent the documents. These

document-driven queries allow a semantically related set of documents as relevant

documents, as proved in cluster hypothesis [91] and reverse cluster hypothesis[59].

2.3 Unsupervised Text Mining Methods 37

Semantic Information for Text Similarity

Word Mover’s Distance (WMD) is a recently proposed metric that targets both

semantic and syntactic approaches to get similarity between text documents [111].

It utilizes the property of word vector embedding and treats text documents as

a weighted point cloud of embedded words. In WMD, the distance between two

text documents is calculated by the minimum cumulative distance that words

from one text document need to travel, to match the point cloud of another text

document [111]. However, WMD shows a cubic time complexity growth with the

number of unique words in the documents. The Relaxed Word Mover’s Distance

(RWMD) is an extended version of WMD that is proposed by [111] to reduce

this time complexity from cubic to quadratic with a limited loss in accuracy

compared to WMD. Nevertheless, all these methods are expensive compared to

other similarity measures.

2.3 Unsupervised Text Mining Methods

Text mining methods broadly follow two approaches; (1) Supervised learning

when training data with labels is provided [88, 105], and (2) Unsupervised learn-

ing when labeled data is not available [68, 96]. The latter case is common in

real-world scenarios as text mining methods are employed in digital document

repositories to identify natural sub-groups or clusters [8] and exceptional doc-

uments [96] without manually annotated data. Similarly, the use of supervised

learning in identifying dynamic changes of clusters in text collections is infeasible.

There exists some research that used fully supervised [105] or semi-supervised ap-

proaches [173] in identifying text clusters and text outliers. However, this thesis

focuses on the more complex problem of text mining where fully unsupervised ap-

proaches for text clustering, outlier detection, and cluster evolution are explored

38 2.3 Unsupervised Text Mining Methods

with the focus of handling sparseness and high dimensional nature in text for

identifying the similarity between text pairs.

2.3.1 Text Clustering

Text clustering, which aims to extract useful information from unlabelled data

by finding natural groups based on data similarities, is a major paradigm in text

mining [3]. An overview of text clustering methods is provided in Fig. 2.6.

Traditional Clustering Methods

Major traditional clustering methods can be classified as partitional, hierarchical,

dimensionality reduction and density-based clustering [8]. These methods face

different challenges when being applied to higher dimensional data such as text.

The centroid based partitional methods such as k-means are known to suffer

from the distance concentration problem when the dimensionality is high and the

distribution is sparse [177]. Hierarchical clustering suffers from the same problem

due to the requirement of multiple pairwise computations at each step of decision

making [8]. Besides, they are one of the most computationally expensive methods

[8].

Density-based methods such as DBSCAN [57] and OPTICS [9, 179] have been

found highly efficient in spatial data. They generate diverse shapes of clusters

naturally around the density spikes without taking the required number of clusters

as an input. Though this is a desirable requirement for document clustering, the

sparse nature of text representation makes the application of these methods for

text clustering hard. They are unable to identify the dense patches in the sparse

text representation with fewer word co-occurrences to form clusters. Additionally,


Text Clustering

Partitioning Dimensionality reductionHierarchical Density-based

Probabilistic Matrix Factorization

Traditional

Recent Developments

Hub-based Semantic assistance-driven

IR Ranking-based

Self corpus-based document Expansion

Deep Learning

Semi-Supervised feature learning

Spectral

ActiveLearning

SNN

Figure 2.6: An overview of text clustering methods

methods such as DBSCAN identifies the neighborhood region of core dense points

with the distance measurements such as Euclidean distance. This technique,

employed for neighborhood inquiry to expand clusters, does not scale well to

high dimensional feature space. Furthermore, this neighborhood inquiry process

consumes high memory as evident in experiments presented in [174].

The SNN concept identifies mutual neighbor documents based on the shared

neighbors. This concept enables relatively uniform regions to form a graph and

to identify clusters by differentiating varying densities [92]. In document clus-

tering where data representation is naturally sparse, this is an ideal solution to

identify dense regions [55]. However, the computation of an SNN graph or mu-

tual neighbor graph is expensive due to the high number of pairwise comparisons

required to identify the nearest neighbors. This prompts investigation into the

more efficient methods to identify the nearest neighbors for building a SNN graph.

Dimension reduction methods such as generative probabilistic clustering, random

projection or matrix factorization are also commonly used in finding clusters in

the high dimensional text by approximating their low-dimensional representation

[3]. Latent topic models such as Latent Semantic Indexing(LSI), Probabilistic


Latent Semantic Analysis (PLSA), and Latent Dirichlet Allocation (LDA), ap-

ply the dimension reduction to find the semantic space and its relationship to

the high dimensional BOW representation. The new representation in semantic

space reveals the topical structure of the corpus more clearly than the original

representation [3]. Methods such as LSI identify semantic space through lower-

rank approximation via matrix factorization [20] while methods such as PLSA

and LDA use probability estimation to predict the lower-dimensional space [202].

In all these methods, information loss is inevitable while projecting from higher

to lower dimension as they are unable to maintain geometric structures in higher

order. In addition, the required resources for the lower-order approximation of

a text collection through optimization or iterative probability approximation in-

crease with the size of the datasets [3, 8]. These challenges open the requirement

for an improved dimensionality reduction method with minimum information loss.

Spectral clustering is another dimensional reduction method that identifies the

non-convex geometric structures in the data [143]. Spectral methods project

original data into the new coordinate space by encoding information about how

near data points are. This transformation reduces the dimensionality of space and

pre-clusters the data into orthogonal dimensions. However, these methods depend

on the selected eigenvectors in the Laplacian matrix generated from the affinity

matrix to perform clustering [141]. This selected eigenvectors of a Laplacian

matrix generated from the original data matrix could not successfully cluster

datasets that contain structures at different scales in size and density as in text

[141]. Furthermore, this two-step process in spectral methods results in high time

complexity. Effective approaches to incorporate geometric structures inherent in

a document collection, and use them in clustering, need to be investigated.

Non-negative Matrix Factorization(NMF) is a special variation of dimensional-

ity reduction that is suitable for the text domain [3]. The VSM representing


text data, which naturally shows non-negativity, is decomposed into two (non-

negative) lower-rank factor matrices [8]. The lower-rank factor matrices represent

the groups in terms and the groups in documents on the basis of shared terms

where the reduced rank can represent the number of clusters in the data [116].

However, information loss is inevitable in this lower-rank approximation as well.

The neighboring points in high dimensions do not remain as close points in the

projected lower order space destroying the geometry structure. This highlights

the need for enforcing geometrical relationships with NMF.

Recent Developments in Clustering

In high-dimensional data such as text data, the hubness property - where some

data points occur more frequently in k-nearest neighbor lists than others - has

been used to find similarity and to determine the cluster label of a point [176, 177].

Documents are compared to hubs instead of cluster centers. This overcomes the

difficulty of distinguishing distances between data points as faced in partitional

document clustering [78]. It avoids the tendency of distances between all pairs

of points in high-dimensional document clustering to become almost equal when

using a centroid approach [78]. Researchers have attempted to improve the hub-

based clustering by changing the hub selection approach such as weighted relative

hubness or Silhouette information based hubness [70]. However, the conventional

hub calculation and hub similarity calculation are not extensible to larger docu-

ment collections as the similarity between top hub points in all clusters should be

calculated to determine the correct clusters [78]. The high computational com-

plexity of this concept is bottleneck in text clustering. This creates the need of

efficient ways to calculate the hubs and hub similarity.

In parallel, proven methods of IR ranking-based query-document matching has

been recently used in document clustering to achieve efficiency in finding similar


documents [30, 173]. The main computational bottleneck in k-means is the need

to recompute the nearest centroid for every data point at every iteration [30].

IR ranking is used in this similarity calculation to reduce the cost by using the

centroid document as a query to choose documents through a responses list [30].

Inversely, the document that needs to be assigned is used as a query to select

the relevant clusters [172] as in Fig. 2.7 (a). It improves the clustering perfor-

mance by comparing the documents with the most relevant cluster centers. These

most relevant clusters are generated using IR responses and reduce the need for

calculating all the pairwise distances. In [173, 173], hub points that are fre-

quent nearest neighbors were generated using IR ranking responses. They create

hubs as dynamic cluster representations, called Loci, for a target document using

ranking results and used in assigning a data point into a cluster considering the

closest Loci, as given in Fig. 2.7 (b). However, this is a semi-supervised approach

where the initial cluster labeled assigned for the documents is used to guide the

cluster assignment through hubs [173]. All the existing ranking based clustering

works explore the applicability of ranking-based document similarity in parti-

tional document clustering and there exists a lack of research which investigates

applicability to other approaches such as density and matrix factorization.

The assistance driven matrix factorization [130, 168] to effectively identify the

text similarity is another recent development in text clustering. A set of methods

is proposed to assist factorization of VSM of documents with the use of additional

information to enhance the clustering decision. In [168], NMF-based document

clustering is assisted by the semantic information given by the term adjacency

within-corpus. Another set of methods use manifold learning in assisting NMF

[31, 130, 198]. The inclusion of neighborhood information that highlights geo-

metric structures among documents improves the accuracy of lower-dimensional

approximation [130]. Besides that, co-clustering is another branch of methods

that assists matrix-based document clustering by a two-way process [66]. Co-


Figure 2.7: The use of ranking for clustering [174]

clustering simultaneously clusters documents and words to improve the cluster-

ing solution, where word clustering induces document clustering, while document

clustering induces word clustering [44]. These interesting extensions of assisting

NMF to minimize information loss need to be thoroughly explored for finding

ways to generate neighborhood information effectively and model them in the

factorization process.

In addition, different semi-supervised approaches have been used to improve text

clustering methods [173, 201]. In density-based methods, active learning ap-

proaches by enforcing different levels of constraints have been used in document

clustering with DBSCAN [201].


Short Text Clustering

Microblogging services are popular social networking platforms, where people en-

gage with others with text communications. Similarly, online forums assemble

views written by the participants. They use only the short-text for communi-

cation [202]. Theses text sources introduce a distinct type of text that creates

additional problems in text mining for identifying the similarity between data

pairs. The short length in those posts leads to extremely sparse text vectors

compared to general text [79]. Moreover, the nature and vocabulary of short text

in social media is drastically different from the usual text [79]. This creates a

requirement for text mining methods to do additional text pre-processing to han-

dle unstructured phrases and abundant information attached with a short text.

Besides, social media contains a larger number of sub-groups compared to the

usual cases [174]. This is evident in social media text analytics that create the

need for fine-grained text mining solutions [174].

Community detection with social media text, for identifying users with common

interests based on what they communicate, is challenged by extreme sparseness

of short text [144, 147]. Existing community detection methods that rely on tra-

ditional distance-based clustering [147] face distance concentration due to sparse-

ness, whereas probabilistic approximation methods [144] face information loss

due to higher-to-lower approximation. Similarly, text mining for understanding

online discussion forums has to deal with the short nature of text [13]. Some of

these analysis methods use supervised approaches that depend on ground-truths

in online forum data to handle short text [124]. In [124], a classification model is

used for discovering genres in a Learning Management System to automatically

code posts. This method used a supervised approach to classify the forum thread

to a code that was manually mapped. The unsupervised text mining approach

[120], for grouping the forum text into various clusters, followed a centroid-based


approach similar to k-means. This approach faces distance concentration due

to sparseness in text. However, the unavailability of ground-truths in online fo-

rum data creates the demand for unsupervised methods that can overcome the

sparseness in the text representation.

Document expansion has been proposed as an effective way to solve the spar-

sity issue of feature vectors by expanding short texts [17, 80, 93, 95, 202].

Many researchers have used external knowledge sources for document expan-

sion [17, 80, 95]. Short texts are expanded to long texts by using Wikipedia

[17], WordNet [80], Web search results [162] and other user constructed ontolo-

gies [95]. Short text expansion is also done with pre-trained word vectors such

as Word2Vec [137] that use local context windows or Glob2Vec [148], which com-

bines global word-word co-occurrence counts and local context windows. Based

on these word embeddings that learn semantic representations for words from a

large corpus, short texts are aggregated into long pseudo-texts [152, 185]. How-

ever, social media texts enriched using these static external sources, which have

unstructured text patterns, provide inadequate or inaccurate information due to

semantic incoherence and lead to incomplete enrichment.

Alternatively, self corpus-based expansion is proposed as a more effective and

semantically aligned method to handle short text [93, 202]. Some approaches

identify concepts in the collections for augmentation using methods similar to

k-means [93] while others [202] identify topics in the collection considering the

term frequency probabilities. However, centroid-based or probability-based cal-

culations have shown inferior outcome due to sparseness in high-dimensional text

data. In summary, all these document expansion methods for short text face

challenges in dealing with high-dimensions. With effective expansion methods,

the status of social media applications relying on sub-grouping documents based

on text similarity can be improved.


An alternative approach to handle the extreme sparseness in short text represen-

tation is to utilize effective representation learning for the short text clustering

with the emerging field of deep learning [188, 189, 190, 194]. This family of meth-

ods uses deep neural networks to automatically learn the representations needed

for discovering subgroups from the original high dimensional sparse text. The use

of deep learning is a promising solution to learn non-linear mappings for feature

selection and present the reduced feature space [6]. It allows embedding text rep-

resentation into a more semantically coherent representation. This is especially

used in short text clustering to address the extreme sparseness [188, 189, 190, 194]

where a deep neural network is used to have deep feature representation from a

raw text representation. In [189, 190], pre-trained word vectors are fed into a con-

volution neural network [114] to learn deep feature representation, which is an

expensive process. Similarly, deep learning has been used for feature selection by

learning statistical dependencies between features [194]. However, it depends on

external semantic dictionaries to identify initial relationships of words [194]. All

these methods are similar to supervised learning and controlled by the guidance

given with external sources. Moreover, these methods apply standard semantic

dictionaries to the short text, hence face semantic incoherence.

In [188], microblog-specific semantic knowledge is utilized to expand the short

text based on the cosine similarity of terms with the aim of avoiding this issue.

However, this pre-processing step is computationally expensive. In addition, it

uses hash-tagging and retweeting as must-link information for minimizing the re-

construction error as a ground truth information. The knowledge derived with

hash-tagging and retweeting is not pure as they are overused in the twitter plat-

form forming heterogeneous relationships.

Recently, Generative Adversarial Network (GAN), a type of deep learning archi-

tecture, has been successfully used in short text mining using a pre-training data


[100, 182, 184]. A GAN model includes two networks where a generative network

is used to generate candidates and a discriminate network is used to evaluate their

validity [121, 128]. The GAN-based methods can work in an unsupervised setting

without relying on ground-truth labels of the data [100, 182, 184]. However, they

require a known dataset used as the initial training data for the discriminator

network [121], hence their accuracy highly depends on this pre-training phase.

The candidates generated by using the pre-training data should be closely re-

lated with original data as this data needs to be synthesized by the generator

network to be correctly evaluated by the discriminator network. Therefore, the

data used for pre-training the discriminator network needs to be closely related

to the underlying problem domain to maintain the semantic coherence.

Finally, these methods apply a clustering method on the learned enriched fea-

tures to obtain the clusters. Applying these methods to a real-world context is

challenging due to their supervised or semi-supervised nature, as it is difficult to

find semantically coherent datasets for short text learning.

Summary: Text Clustering

The identification of a set of sub-groups in a document collection has to deal with

the challenges generated by the sparse text representation in identifying similarity

between text pairs. Specifically, traditional text clustering methods face prob-

lems with sparse and high-dimensional vector representation while calculating

similarity using density distribution, distance measures or lower-dimensional ap-

proximation. In addition, existing methods are challenged when the collection size

is large. For instance, cosine angle-based pairwise similarity or shared-neighbors

based mutual similarity, as well as the recently introduced hub-based clustering,

become stagnant due to the computational complexity required for large docu-

ment collections. Recently, researchers have proposed the concept of ranking to


improve centroid-based text clustering. These handful of ranking-based methods

are limited and only have been explored by using a semi-supervised approach

and/or a partitional method to identify the clusters. These methods show poten-

tial and the use of ranking concept in clustering deserves more attention. It will

be interesting to investigate different ways to use the ranking concept (i.e. the

response documents as well as the ranking scores given for documents) in other

clustering methods as an alternative text similarity calculation technique. The

possibility to use this concept in accurate and efficient mutual neighbor iden-

tification for density estimation, as well as in hub identification, is promising

and fruitful. Furthermore, the possibility of using a ranking-based neighborhood

concept to assist matrix factorization-based clustering, also requires attention.

Short text clustering is another important problem in text mining. In addi-

tion to sparseness generated by high dimensional vector representation, short

text faces extreme sparseness due to the short length that challenges identifying

the similarity between text pairs. Different document expansion approaches and

dimensionality reduction approaches have been proposed to address this prob-

lem. Most of them depend on external information. External source-based docu-

ment expansion results in semantic incoherence while deep learning methods learn

lower-dimensional features through external sources with a supervised or semi-

supervised approach. Though there are a few self-corpus based strategies that

deal with probability approximation or distance concepts, no existing work inves-

tigates the applicability of other methods, such as matrix factorization, that have

commonly been used to project the high-dimensional data in a low-dimensional

search space to address this problem.


2.3.2 Text Outlier Detection

This section first introduces the general outlier detection problem, covering meth-

ods that deal with few dimensions as well as high dimensions. Secondly, this sec-

tion specifically focuses on the text outlier detection problem and applied meth-

ods. Finally, this section presents the evaluation measures used for assessing the

outlier detection methods and the associated problems.

General Outlier Detection Methods

A myriad of outlier detection methods exists for traditional structural data [1].

Table 2.2 lists the major categories these methods fall into. The majority of

these works identify outliers by separating the deviations based on the Hawkins

definition [69] given below.

Definition 2.1: An outlier is an observation which has deviated so much from

the other observations that it has aroused suspicions that it was generated by a

different mechanism.

Outlier detection broadly follows two approaches. (1) Supervised learning when

training data with labels of normal and abnormal data is provided [105]; and

(2) Unsupervised learning when labeled data is not available, which is com-

mon in real-world scenarios [68]. Unsupervised approaches based on traditional

methods such as distribution-based, distance-based, density-based, and cluster-

based have been used commonly to identify outliers due to unavailability of labels

[29, 68]. These methods, as listed in Table 2.2, which deal with numerical and

few-dimensional data, face several challenges when applied to the high dimen-

sional and sparse text in identifying dissimilarities between data pairs.


Table 2.2: Summary of the major outlier detection methods

Category Bottleneck Data DomainDistribution-based [89] Pre-assumptions Few dimensionalDistance-based [68] Distance concentration Few dimensionalDensity-based [29] Sparseness Few dimensional

Distance concentrationCluster-based [51, 71] Sparseness Few dimensionalGraph-based [68, 84] Distance concentration Few dimensional

Computational Complexity

Angle-based [109] Computational Complexity High dimensionalSubspace-based [2, 108] Computational Complexity High dimensionalProjection-based [11, 96, 126] Information Loss High dimensionalk-occurrence-based [58, 155] Computational Complexity High dimensional

Distribution-based and Distance-based Methods: Distribution-based

methods identify outliers as observations that over fit to a normal model. These

methods depend on the assumption of data distribution and learning in the nor-

mal model when identifying deviations [89]. They are known not to be scalable to

high-dimensional data [16]. Distance-based methods define data points as outliers

if they are far from many other points in the dataset considering a minimum

distance threshold [68]. However, use of this approach in high-dimensional data

is challenged by the distance concentration [126]. Further, the computational

complexity of these methods for larger datasets makes them less effective for big

datasets such as digitized text corpora with many documents or web repositories.

Density-based Methods: These methods define outliers considering density

distribution in a dataset. The well-known Local Outlier Factor (LOF) method

identifies outliers using the relative density of a point which is measured by com-

paring neighbors’ density with its density as a ratio [29]. A point is labeled

outlier if density around k-NN of that point is high respect to the density around

the point (i.e. point with a high LOF value). However, this density notion is

challenged by the “sparseness” in high dimensional data. Furthermore, this tech-


nique also depends on distance calculation for the k-NN identification and faces

the problem of distance concentration.

There are a few cluster-based methods that extend the “density” concept to iden-

tify dense clusters to filter the outliers [51, 71]. For example, clustering methods

such as DBSCAN [57] and OPTICS [9] are well-known for naturally detecting

outliers in spatial data that fall into sparse regions. The majority of these meth-

ods are parameter dependent. In high-dimensional sparse text data, adoption

of density-based methods for outlier detection is difficult as distinguishing high-

density regions from the low-density regions is complicated due to fewer term

co-occurrences.

Graph-based Methods: Researchers have proposed solutions based on

nearest-neighbor graphs for outlier detection. Nearest-neighbor is an important

concept used in identifying similarity/dissimilarity among observations [68, 84].

A set of methods uses nearest-neighbor graphs to determine the outliers via in-

degree numbers [68]. The exclusion from mutual proximity, derived based on the

nearest neighbors, have been also used to calculate outlier scores [84]. Never-

theless, the nearest-neighbor calculation is not known as a scalable solution for

larger document collections and higher-dimensional data due to the problem of

distance concentration [164].

Angle-based Methods: These outlier detection methods are introduced as

a successful remedy to the distance concentration problem in high dimensional

data [109]. A high dimension data is usually represented in the form of vectors

where the closeness of points can be effectively captured with the angle between

vectors. This is well suited to text-domain, where a document is represented

with its feature vector, and cosine similarity can be used in comparing document


similarity [186]. However, the high computational complexity for larger datasets,

due to the larger number of pairwise comparisons, makes these methods less

effective.

Subspace-based Methods: In contrast to these traditional methods,

subspace-based methods naturally identify outliers in high-dimensional data.

These methods identify a subset of dimensions with rarely existing patters using

brute-force searching to obtain outlier candidates [2]. This leads to high compu-

tational complexity as well as the local patterns identified in subspaces as outliers

may not be outliers in the full feature space [108].

Projection-based Methods: Lower-dimensional projection is an alternative

approach that is specially introduced as a remedy to distance concentration in

high dimension data. The degree of deviation of each observation to the original

point after projecting it to the lower-dimensional space is measured to identify the

outliers [126]. However, the information loss in this approach, when projecting

data from higher to lower dimension, is inevitable.

K-occurrences-based Methods: The hub concept in higher dimensions used

in clustering has been used inversely in anomaly detection. The researcher used

the reverse neighbor count or the k-occurrences count to determine outliers that

are away from hub points [155]. In [51], connections are made considering re-

verse k nearest neighbors and, nodes with the less-in-degree number identified

as possible outliers. Similar work has been proposed to use “anti-hubs” found

in sparse high dimensional data as possible outlier candidates [58]. Although

frequent nearest neighbor-based hub concept is successful in handling the higher

dimensions, scalability of these methods for larger datasets is questionable due to

computational complexity with pairwise comparisons. All these high-dimensional


outlier detection methods are summarized in the latter part of Table 2.2.

Text Outlier Detection Methods

Text data is a special variation of high-dimensional data. There are limited

studies specifically focused on text-domain to identify the documents deviated

from the common theme [4, 85, 96]. Text outliers need to be identified using

(dis)similarities between text pairs. In [85], an outlier text on the web is defined

as follows:

Definitions 2.2: Given a set of web texts Ti(i = 1, 2, ..., n) on a topic

M , let Wij(j = 1, 2, ...,m) be the top m keywords on topic M in Ti, Ti =

(Wi1,Wi2, ...,Wim). If the relative weight of text Ti is greater/smaller than the

ones that are similar to other texts, then the Web text Ti constitutes a Web text

outlier.

This outlier definition depends on a specific topic or class in the collection to filter

the outliers based on the deviation to it. However, text outlier should be able to be

detected in a natural setting, i.e., a corpus contains multiple related groups/topics

and each group will contain inliers. Let a set of 1 - c classes represent the inlier

groups. An outlier should be able to identify recognizing dissimilarities to all of

the multiple related inlier classes (1 - c) in the collection, which shows a lesser

number of shared terms with those classes.

A common process of identifying an outlier is to compare documents within the

collection and determine dissimilarity to decide an outlier score. Cosine similar-

ity is the standard text similarity measurement used to calculate the similarity

between a document pair. This is used in an inverse manner to identify the dis-

similarity (i.e., 1 − similarity) [85]. However, these pairwise comparisons are


expensive for a large text collection. In [4], n-grams (which is an efficient way

to determine the similarity between different related words in text processing)

are used in outlier detection. This is based on the hypothesis that documents

on the same topic should have similar n-gram frequency distribution [4]. The

n-gram frequency distribution for each document is generated and dissimilarities

are computed as the angle between the document vectors. However, this process

leads to high computational complexity.

A recent study on text outlier detection has been proposed using Non-negative

Matrix Factorization (NMF) to measure deviation [96]. NMF assumes to pre-

serve semantic structure within lower dimensions while decomposing the original

document-term matrix into two matrices, document-cluster matrix and term-

cluster matrix [3]. The learning error calculated with sum-of-squares differences

to the original matrix is used to identify the outliers in [96]. However, a dataset

with a larger number of clusters/groups misleads this reconstruction process.

NMF is designed to approximate high dimension original data to a lower rank r

where r is the number of natural groups in the collection [47]. When the rank r is

high, it is not easy to differentiate between them and produce higher reconstruc-

tion errors for many data points other than the outliers. Therefore, this method

becomes ineffective for datasets with fine-grained clusters. Applicability of this

method for accurate and scalable outlier detection in the Web content, which

often contains a large number of document categories is questionable. This re-

sults in the requirement for exploring scalable methods that will efficiently and

accurately identify outliers in higher dimensional text document collections.

All these above-mentioned methods rank documents to identify the outliers. For

example, a distance-based method ranks the documents according to their devia-

tion degree to neighbors and assigns the highly deviated observation with higher

score [126]. Similarly, a graph-based method ranks documents based on in-degree


of documents node, and documents with the lower in-degree get higher rank with

the possibility of being outliers [51]. The reverse neighbor count has been used

to rank outliers of high dimensional data [155]. In documents, term weights have

been used to rank the importance of terms in the documents considering their

appearance in the document and collection [133]. Though it is intuitive to use

this term weighting to rank the documents for consideration of outliers, there

exists no work showing this. IR systems also have been known for ranking the

documents in the collections with respect to the posed query document consid-

ering term weights to obtain ranked results [59]. The possibility of using this IR

ranking to identify the outliers in document collections is a promising direction

that is yet to be explored for text outlier detection.

Evaluation Measures

Evaluating the performance of outlier detection methods is an important problem.

Accuracy (ACC) [84] is the most popular measurement used in many text mining

methods to assess effectiveness. It considers the total correct predictions against

the total observations in the context. This measurement completely disregards

the incorrect predictions [84]. For instance, an outlier detection method that

detects all observations as inliers will yield high accuracy due to class skewness.

However, they also show very high false inlier prediction. Alternatively, the area

under the Receiver Operating Characteristics (ROC) curve is used with predictive

methods as well as outlier detection methods [1, 155] to overcome this issue. ROC

curve shows the True Positive Rate (TPR) against the False Postive Rate (FPR)

where P and N denote outliers and inliers respectively.

TPR =TP

TP + FN=

TP

P(2.8)


FPR =FP

FP + TN=

FP

N(2.9)

The area covered by the ROC curve at an optimum threshold indicates how

much a model is capable of distinguishing between inliers and outliers. However,

a detailed analysis in [32] proved that the Area Under the Curve (AUC) also

informs performance bias to true prediction at the optimal threshold. Assessing

an outlier detection method requires investigation of false predictions (i.e., both

outliers and inliers). FPR and False Negative Rate (FNR) is directly used by

some researchers to report false predictions [65, 101]. FPR and FNRmeasure the

effectiveness of outlier detection with the error in predicting inliers and outliers.

FNR =FN

TP + FN=

FN

P(2.10)

FPR denotes inliers detected as outliers against the total inliers, ranging the

values from 0 to 1 as in Eq. 2.9. Similarly, FNR denotes outliers detected

as inliers against the total outliers, ranging the values from 0 to 1 as in Eq.

2.10. Though FPR and FNR measures indicate the poor performance of outlier

detection methods with higher values recognizing incorrect inliers and outliers

respectively, they are not able to clearly categorize effectiveness of the methods

based on their capacity for false (inlier/outlier) detection. The capability of

outlier detection methods in terms of false detection requires a differentiable

measure to recognize their direct applicability.

Summary: Text Outlier Detection

An outlier in a document collection is a deviated document in the collection

compared to others. There is a myriad of methods existing that calculate


(dis)similarity between text pairs to determine this deviation. Traditional out-

lier detection methods, which deal with few dimensions, are challenged by the

distance concentration, sparseness and approximation errors in identifying text

outliers. Subspace based analysis is also not a guaranteed solution to find an

optimal solution for high-dimensional outlier detection. In addition to the high-

dimensionality, the larger size of the document collections also creates issues for

angle-based, nearest neighbor-based and anti-hub-based methods due to high

computational complexity in similarity calculation.

There is a lack of work focusing on text outlier detection and no formal definition

considering the multiple groups of inliers in document collections, specialized

methods and meaningful evaluation measures. When having multiple groups

within inlier documents, it is hard to identify outliers that are dissimilar to those

inlier classes and share a lesser number of common terms with them. The pos-

sibility of using ranking concepts to improve the accuracy and efficiency in text

outlier identification, that shows high potential, needs to be studied, in depth. It

is interesting to investigate the possibility of using term weighting-based ranking

as well as IR system-based ranking responses and ranking scores in defining out-

lier scores. Moreover, there is a demand for clear evaluation measures to indicate

the error in detecting outliers as well as inliers, to select effective outlier detection

methods.

2.3.3 Text Cluster Evolution

The topics, associated terminologies or concepts in text repositories change over

time as well as across the domains and show a varying trend. Researchers have

explored these dynamic changes to text for finding decaying, current and emerging

topic, events, communities or concepts [35, 41, 63, 73, 82, 119, 180]. However,


research is in infancy in this area. Existing methods have to deal with the common

problem of high-dimensional and sparse vector representation in the text data for

identifying the similarity between (text) cluster pairs [8]. One set of methods

uses the naive approach of term-based similarity [63], while some other methods

use probabilistic [180] or factorization approaches [98].

The similarity between clusters is determined by the term intersections using

Jaccard coefficient to define the persistence, merging and splitting of clusters

[63]. However, Jaccard similarity is not efficient in comparing similarity of text

as it only considers the common terms in sparse data [129], The probabilistic topic

modeling is used to track the topic occurrences over time [180], while NMF is used

to identify a set of steady topics through minimizing learning error [98]. However,

information loss is inevitable with any lower-dimensional approximation [8]. This

emphasizes a need to explore effective methods in handling the high-dimensional

sparse text representation to identify dynamic changes in clusters. Specifically,

how to use dimensionality reduction methods to handle higher dimensions in

text cluster representation, and what are the effective techniques to compensate

associated information loss in them, need to be studied.

Tracking evolution across different domains or time have been popularly used

with social network analysis [41, 102, 115, 123]. There are two main models

used in the community evolution of social networks namely snapshot model [115]

and temporal smoothness model [40]. The snapshot model keeps track of the

fixed number of communities [123] or focuses only on pre-determined community

structure [115] over time. In contrast, the temporal smoothness model analyzes

a continuous stream of changes to the considered networks to derive communities

over time [40]. However, all these methods that deal with community evolution

consider user clusters that are identified based on the network structure analysis

[102, 115, 123]. None of these works deal with the problem associated with text


analysis.

Methods that explore the changes in text structure to characterize the evolution-

ary events, concepts or terminologies have been developed with one of the three

major objectives: (1) cluster evolution [63, 73]; (2) topic evolution [35, 41, 180];

or (3) event evolution [82, 119]. In comparison to cluster evolution, topic evolu-

tion is done in a much smaller data space, where a number of extracted topics are

much less than the entire document collection. Similarly, associated vocabulary

with topics (i.e., highly probable terms to be in the topics) in a collection is much

smaller than the complete vocabulary of the collection. Event evolution detec-

tion also considers a set of selected events targeting a much smaller data scope

compared to the original data space. The summary of these existing methods is

given in Table 2.3.

Table 2.3: Categories in dynamic text evolution

Category Approach BottleneckCluster evolution Citation network analysis [73] Consider only the local

relationsTerm intersection analysis [63]Topic evolution LDA-based approaches [49,

180]Unable to identify thecomplex cluster dynamics

NMF-based approaches [98]Graph-based approaches [41] Study a fixed set of terms

and neglect new forma-tions

Event evolution Text similarity-based ap-proaches [119]

Limited only to novelevent identification

Topic Modeling-based ap-proaches [195]

Study a fixed set of eventsand neglect new forma-tions

Cluster Evolution: With the aim of identifying cluster dynamics, a simple

approach of analyzing citation network is used in publications data [73]. A net-

work of bibliographic coupling is generated using direct and co-citation analysis

to identify the current trends and emerging concepts. TextLuas [63] is a soft-


ware tool developed to model each cluster solution with the respective terms at

each timestamp in another study. It considers the similarity between consecutive

clusters with the Jaccard coefficient. However, none of these methods considers

concept shift with the full document-term space over the entire period in de-

termining similarity and/or limited to local relations between two consecutive

timestamps in defining evolution.

Topic Evolution: Topic evolution analysis has been used to identify the

content shift through a discovered subset of topics. The majority of these methods

rely on generative probabilistic approaches [49, 180]. The LDA-related approach

used in [180] only identifies the topic occurrences in different time dimensions

with the calculated respective probabilities. This is not capable of identifying

topic evolution with splits and merge. Another probabilistic approach used in

[49] determines topic cluster evolution based on the changes to term probability

within topics. This study considers a fixed vocabulary limiting the set of terms

to appear in the topics and is not able to track the new formation of topics.

Another topics evolution tracking method used NMF in identifying a set of steady

topics through minimizing learning error [98]. Although it identifies the emerging

topics over time, it is not able to detect complex topic structure changes such

as diminishing or growing. In [41], a graph-theoretic approach is used to track

persistent and diminishing topics using the term frequency for each topic cluster

solution. This method is unable to identify the complex dynamics of topics such

as merge and split to cover the complete cluster lifecycle.

Event Evolution: Event evolution usually happens in social media to keep

track of event clusters that appear over time to identify the novel events or shifts

that are deviated from the existing event clusters [119, 195]. A novelty score is

assigned to each event cluster for identifying new events considering the (cosine)


text-similarity in [119]. Topic modeling has been used to identify the events across

the time [195]. Though it identifies the emerging events through deviations to

previously existing events, it fails to identify the complex dynamics such as growth

and decay due to its assumption of a fixed set of events within a dataset.

There exists no work that focuses on identifying the full cluster life-cycle in the

original data. They are all restricted to a subset of the data due to various

limitations.

Summary: Text Cluster Evolution

Existing text evolution tracking methods neither consider full data space nor all

the dynamic changes over the time/domain in evolving patterns identification.

The topic or event evolution methods are limited to a smaller selected set of

data. There are some methods that consider full data space in evolution tracking;

however, these methods are limited to consecutive time stamps only in identifying

cluster similarity. An effective cluster evolution method that considers all the

changes over the considered time period/domains is needed to identify the full

cluster life-cycle with persistence, emergence, growth, and decay patterns. It

is important to accurately handle the high-dimensional nature of the text in

identifying the similarity relationship between clusters. Consequently, existing

naive comparisons between clusters or probabilistic methods face difficulties in

the identification of full cluster life-cycle accurately. This creates the requirement

to investigate a novel text mining method to identify the cluster dynamics in

unsupervised setting.

62 2.4 Research Gaps

2.4 Research Gaps

This chapter has reviewed the literature relevant to clustering, outlier detection

and evolution methods focusing on text data and the challenges they face in

identifying text similarity. The following research gaps have been identified.

2.4.1 Text Clustering

As summarized in Section 2.3.1, the high dimensional nature of text representa-

tion and associated sparseness challenges the process of identifying text similarity

in existing text clustering methods to find the clusters in a document collection.

In high dimensional data, data points are known to be closer to frequent near-

est neighbours (i.e., hubs) than cluster mean [176]. A handful of methods have

started to use emerging concepts such as ranking, hubs in higher dimensionality,

and neighborhood for effective text similarity identification in clustering. This

thesis identifies the following research gaps in those methods and aims to explore

these promising concepts in detail.

• Though the IR concepts such as indexing and ranking have been used with

partitional clustering for identifying similar text, they have yet to be ex-

ploited in density-based clustering, which is known to identify diverse shapes

of clusters without taking the number of clusters as an input. The success

of IR ranking concepts in identifying nearest neighbors motivates them to

be used in identifying density differences in a document collection.

• The mutual neighbor identification and Shared Nearest Neighbor graph

construction in forming dense representation for sparse text have been found

useful but computationally expensive. Utilizing IR ranking to make these

concepts efficient is promising that needs to be exploited.

2.4 Research Gaps 63

• The closeness to hubs is determined by pairwise neighborhood calculations.

This makes the existing clustering algorithms unscalable to large document

collections. The use of IR ranking score for calculating closeness to hubs is

a potential direction that needs to be studied.

• Assisting matrix factorization-based text clustering through neighborhood

information is an emerging research topic. There is no specific work exploit-

ing accurate ways to use IR ranking results in overcoming information loss

in matrix factorization. The success of IR ranking concepts in identifying

nearest neighbors shows the necessity of using them for assisting matrix

factorization.

• The extreme sparseness in short text is handled by the document expansion

or feature learning using external sources, however, these methods result in

semantic incoherence. It will be interesting to study how to utilize dimen-

sionality reduction methods such as matrix factorization that are successful

in identifying groups in terms, for self-corpus-based enrichment.

2.4.2 Text Outlier Detection

There exist only a handful of methods specifically designed for text outlier detec-

tion. Similar to text clustering, the high dimensional sparse representation of the

text is the major challenge in outlier detection to identify dis(similarity) in text

pairs, as summarized in Section 2.3.2. The following research gaps are identified

associated with this problem.

• Document collections taken from social media with a large number of groups

show the necessity of considering outliers among multiple inlier groups.

Existing outlier detection methods do not define text outliers in the presence

64 2.4 Research Gaps

of multiple inlier groups and propose solutions that can identify outlier

documents that share lesser terms with inlier groups.

• Inverse document frequency term weighting is successful in identifying

the rareness of terms. The possibility of utilizing this simple concept of

term weighting to rank the documents in a collection and identify devia-

tions/dissimilarities is promising and needs to be investigated.

• There is no specific work exploiting IR ranking results and ranking scores in

text outlier detection. The possibility of using IR ranking concepts in mu-

tual neighborhood identification and anti-hub identification for text outlier

detection is promising direction but unexploited.

• The evaluation of the outlier detection method is challenging due to the

bias of existing measures to true predictions. There is a lack of measures

that clearly differentiate the effectiveness of methods based on both inlier

and outlier prediction errors though there is a high necessity in identifying

them both.

2.4.3 Text Cluster Evolution

As summarized in Section 2.3.3, the problem of cluster evolution in text corpora

is ineffectively studied. Due to the challenges faced by high dimensional text in

identifying the similarity between pairs, the majority of existing methods limit

their analysis to a few local patterns while some methods are based only on topics

and events. There exist the following research gaps in this research problem.

• The global evolution of clusters over time/domain is a must to track for

identifying trends. There is no specific work exploiting the global cluster

2.4 Research Gaps 65

evolution over the time/domain with the objective of identifying all the clus-

ter states such as birth, death, split and merge to track the cluster evolution

patterns based on cluster similarity. “Birth” of clusters denotes an emerg-

ing pattern, “split” identifies a growth pattern, “death” and “merge” reflect

a decay pattern and a consistently appearing cluster across time/domain

signifies a persistence pattern.

• Matrix factorization is a successful solution to identify groups in input data.

None of the existing cluster evolution works explore using matrix factoriza-

tion to identify the cluster groups with similar clusters for tracking text dy-

namics. There is a loss in information with the higher to lower-dimensional

projection. The possibility of using different information and modeling

techniques to minimize this problem needs to be carefully studied.

Chapter 3

Text Clustering

This chapter introduces the primary contribution of the thesis, which is a set

of novel document clustering methods to identify the groups with similar doc-

uments in a document collection. Clustering is a popular unsupervised data

mining technique that groups the similar set of documents together based on

term co-occurrences. Generally, document collections which show fewer word

co-occurrences between documents form sparse data matrices for analysis. The

sparse representation of text data challenges traditional text clustering meth-

ods such as partitional, hierarchical, density-based and dimensionality reduction

approaches to identify the text similarities [3, 8].

Apart from the sparseness that directly affects clustering methods, distance con-

centration which shows negligible distance differences between far and near points,

is another major issue in high dimensional data clustering [3]. Besides, the major-

ity of the dimensionality reduction methods that project higher dimensional text

data into lower dimensional space, face information loss [8]. Though researchers

have explored effective partitional clustering methods using IR ranking concepts

and Hub concepts based on frequent nearest neighbors to identify the similarity

67

between text pairs dealing with these issues [30, 78, 173], there is no prior work

that investigates the applicability of these concepts to density-based methods

or matrix factorization to accurately cluster documents. In recent years, these

methods have shown high potential in text clustering [30, 174].

In addition to the sparseness created by the high dimensional representation and

other related issues faced by the general text, the short text clustering, which

has become popular with sources such as social media, is challenged by the ex-

treme sparseness in data [79]. The fact that there are extremely fewer word-co-

occurrences of the text, challenges traditional clustering methods. Despite the

existing different enrichment approaches based on the sophisticated designs to

handle this issue [17, 80, 95], there is no prior work that explores corpus-based

enrichment using matrix factorization to minimize the extreme sparseness.

Figure 3.1: Overview of the Chapter 3 contributions

68

Fig. 3.1 outlines the high-level overview of the contributions made in this chapter

to effectively identify the text similarity for clustering. This chapter explores the

effectiveness of utilizing the IR ranking concept in the nearest neighbor calcula-

tion and different ways of using this information to calculate similarity between

high-dimensional text pairs for clustering. It introduces a ranking-based, mutual

neighborhood graph for density calculations and the ranking for hub formation to

improve the quality of clustering solutions. Further, this chapter emphasizes the

effectiveness of using nearest neighbor information with proper modelling tech-

niques for assisting matrix factorization to identify the similarity between text

pairs in text clustering methods to avoid information loss. In addition, another

focus of this chapter is to explore the effectiveness of using corpus-based document

enrichment/augmentation for handling extreme sparseness in short text clustering

with topics derived through matrix factorization. It is followed by an application

of the corpus-based document augmentation method for concept-mining in online

forums.

This chapter is comprised of four papers relating to these contributions.

• Paper 1. Wathsala Anupama Mohotti and Richi Nayak.: An Efficient

Ranking-Centered Density-Based Document Clustering Method. Pacific-

Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp.

439-451. Springer (2018)

• Paper 2. Wathsala Anupama Mohotti and Richi Nayak.: Consensus and

Complementary Non-negative Matrix Factorization for Document Cluster-

ing. Elsevier Knowledge-Based Systems journal (Under Review).

• Paper 3. Wathsala Anupama Mohotti and Richi Nayak.: Corpus-Based

Augmented Media Posts with Density-Based Clustering for Community De-

tection. International Conference on Tools with Artificial Intelligence (IC-

69

TAI), pp. 379-386. IEEE (2018)

• Paper 4. Wathsala Anupama Mohotti and Darren Christopher Lukas

and Richi Nayak.: Concept Mining in Online Forums using Self-corpus-

based Augmented Text Clustering. Pacific Rim International Conference

on Artificial Intelligence (PRICAI), pp. 397-402. Springer (2019)

Paper 1 proposes a novel ranking centered density-based document clustering

method, RDDC. It uses top-10 ranked documents generated from a search engine

utilizing its tf*idf ranking function [54] against document-driven queries that

statistically represent documents to build a graph of shared nearest neighbors

(SNN), which are proved to possess sufficient information richness as in [199].

High-density regions in the graph are estimated if they have a considerable num-

ber of documents within a region bounded by the edge weight with that number

to identify the initial clusters. This threshold (i.e., alpha) is set up based on the

experiments. Then remaining documents are assigned to clusters comparing with

SNN sets identified as multiple hubs in the graph. The hub similarity calculation

is done using the ranking score. Empirical analysis using several document cor-

pora including popular NewsGroup data and Social Event detection data reveals

that RDDC is able to accurately identify the clusters based on density difference

in the SNN graph built using the ranking responses and ranking scores. However,

RDDC consumes higher time due to SNN graph construction and hub similarity

calculation steps and will have an adverse effect on extremely larger datasets.

This leads overall time complexity of RDDC to be O (ndkm) where n, d, k and

m are the dimensionality, size of the collection, considered number of nearest

neighbors and number of mutual number sets respectively.

Paper 2 explores the effectiveness of using nearest neighbor information to com-

pensate for the information loss in lower-dimensional approximation. The consen-

sus and complementary non-negative matrix factorization-based document clus-

70

tering method, CCNMF, is proposed based on the IR ranking-based document

similarity as well as the pairwise similarity to form adjacency matrices that hold

the geometric information. CCNMF uses the Skip-Gram with Negative Sam-

pling to accurately model the adjacency information by assigning probabilities

based on neighborhood information. It assigns high coefficients to document

pairs that show higher presence with respect to any neighborhood. The hypothe-

sis of this paper is that combining the common and specific information given by

each document affinity matrix, together with the information in document-term

representation, assists the lower-dimensional approximation in NMF with geomet-

ric information. Several experiments have been conducted with several datasets

including well-known public NewsGroup datasets covering short-to-medium-size

text vectors. Evaluation done using extrinsic measurements, validates that CC-

NMF is able to produce accurate clustering solutions compared to state-of-the-art

benchmarking methods.

The method RDDC which is proposed in Paper 1 uses hub-based cluster similar-

ity through SNN sets, in addition to the density concept for identifying cluster

similarity. The comparison between the results of RDDC and CCNMF shows

that RDDC is able to produce improved outcome than CCNMF for fine-grained

clustering scenarios. Therefore, RDDC can produce superior results for Social

Event Detection datasets which include over a thousand clusters. In general,

document clustering that deals with a small number of clusters, CCNMF that

factorizes the input matrix into lower rank metrics produces better results than

RDDC.

The other part of this chapter is related to Paper 3 and Paper 4 that present a

document expansion-based approach using self-corpus to address the extremely

fewer word co-occurrences within the short text. The approach proposed in Pa-

per 3 is to project the high-dimensional term space to a low-dimensional space,

71

and infer the topic proportion vectors using the associated semantic structure to

identify virtual terms for enrichment. The most probable virtual terms are se-

lected based on the mean and standard deviation of the coefficients in topic×termmatrix in a threshold independent manner. The experiments done with 3 twitter

datasets in Paper 3 show that the post-expansion using topic words improves the

word co-occurrences of important terms in short text. Furthermore, it reveals

that the expanded text allows identifying text-based communities in social media

though density-based clustering. This density-based clustering uses a distance

parameter (i.e, α) to identify neighbor posts based on pairwise distances which is

measured in the euclidean distance with tf as the weight and identifies the core

dense points that can be further expanded to form clusters.

Paper 4 shows another application of this topic word-based document expansion

in the area of online forums for concept mining. It also adds most probable

terms in topics that marked with higher coefficients in the topic×term matrix as

these coefficients of terms in a topic vector are comparable to weights of terms in

topics. The forum posts consist of short text where self-corpus-based expansion

is able to minimize the extremely fewer word co-occurrences among posts. The

experiments done on the QUT ESP forum data show improved performance with

intrinsic measurements in obtaining the themes within the posts. The qualitative

evaluation validates these themes. However, forum posts are not short in size

as tweets and show a homogeneous nature. Due to this, fewer terms correspond

to the topics needed to be added to expand the forum posts to identify concept

clusters as evident by experiments in the paper.

Next, the chapter will present four papers. Since this is a thesis by publication,

each original paper is presented by aligning with the thesis format. Due to the

papers’ different formats, there will be some minor format difference with the

published article. However, these do not alter the content of the original papers.

72 Paper 1

Paper 1: An Efficient Ranking-Centered Density-

Based Document Clustering Method

Wathsala Anupama Mohotti* and Richi Nayak*

*School of Electrical Engineering and Computer Science, Queensland University

of Technology, GPO BOX 2434, Brisbane, Australia

Published In: The Pacific-Asia Conference on Knowledge Discovery and Data

Mining (PAKDD), 3-6 June 2018, Melbourne, VIC, Australia

Statement of Contribution of Co-Authors

The authors of the papers have certified that:

1. They meet the criteria for authorship in that they have participated in

the conception, execution, or interpretation, of at least that part of the

publication in their field of expertise;

2. They take public responsibility for their part of the publication, except for

the responsible author who accepts overall responsibility for the publication;

3. There are no other authors of the publication according to these criteria;

4. Potential conflicts of interest have been disclosed to (a) granting bodies, (b)

the editor or publisher of journals or other publications, and (c) the head

of the responsible academic unit, and

5. They agree to the use of the publication in the student’s thesis and its

publication on the QUT ePrints database consistent with any limitations

set by publisher requirements.

Paper 1 73

Contributor Statement of contribution*

Wathsala Anupama Mohotti Conceived the idea,designed and conducted experiments,analyzed data, wrote the paper and

Signature: addressed the supervisor and reviewers’comments to improve the quality of paper

Date:

A/Prof Richi Nayak Provided critical commentsin a supervisory capacity

Signature: on the design and formulationof the concepts, method and experiments,

Date: edited and reviewed the paper26/03/2020

Nayak

Mohotti

27/03/2020



74 1 Introduction

ABSTRACT: Document clustering is a popular method for discovering useful

information from text data. This paper proposes an innovative hybrid document

clustering method based on the novel concepts of ranking, density and shared

neighborhood. We utilize ranked documents generated from a search engine to

effectively build a graph of shared relevant documents. The high density regions

in the graph are processed to form initial clusters. The clustering decisions are

further refined using the shared neighborhood information. Empirical analysis

shows that the proposed method is able to produce accurate and efficient solution

as compared to relevant benchmarking methods.

KEYWORDS: Density estimation; Ranking function; Graph-based clustering

1 Introduction

Document clustering is a popular method to discover useful information from

the text corpuses [8]. It has been used to organize the data based on similarity

in many applications such as social media analytics, opinion mining and recom-

mendation systems. A myriad of clustering methods exist that can be classified

into the popular categories of partitional, hierarchical, matrix factorization, and

density based clustering [8, 201]. The centroid based partitional methods such as

k-means are known to suffer from the data concentration problem when dimen-

sionality is high and the data distribution is sparse. Specifically, the difference

between data points becomes negligible [177]. Hierarchical clustering suffers from

the same problem due to the requirement of multiple pairwise computation at

each step of decision making [8]. Matrix factorization, a dimension reduction

method for high dimensional text, is also commonly used in finding clusters in

low dimension data. In these methods, information loss is inevitable [8] as well

as the required time for low rank approximation of a large text data through

1 Introduction 75

optimization increases with the size of the datasets.

Density-based methods such as DBSCAN and OPTICS have been found highly

efficient in traditional data [57]. They generate diverse shapes of clusters without

taking the number of clusters as an input – the desired requirements for document

datasets [201]. Moreover, text data has shown to experience the Hub phenomena,

i.e., “the number of times some points appear among k nearest neighbors of other

points is highly skewed” [177]. A density based clustering method should be ideal

to identify these naturally spread dense sub regions made of frequent nearest

neighbors that assist in estimating density. However, this approach is hardly

explored in document clustering due to manifold reasons [56].

Firstly, density based methods become stagnated in high dimensional data clus-

tering as the document datasets exhibit varying densities due to sparse text rep-

resentation and the density definition cannot identify core points to form clusters

[8]. Secondly, techniques employed for efficient neighborhood inquiry to expand

clusters do not scale well to high dimensional feature space [8]. A handful of

solutions have been proposed by using different shapes, sizes, density functions

and applying constraints in the high dimensional data [55, 201]. Semi supervised

and active learning approaches have been used in density document clustering

with DBSCAN to obtain improved clustering performance [201].

Majority of density based methods utilize the concept of Shared Nearest Neigh-

bor (SNN) [92] whereby the similarity between points is defined based on the

number of neighbors they share [55, 56]. The SNN concept facilitates the rela-

tively uniform regions to form a graph and to identify clusters by differentiating

varying densities. In document clustering setting where data representation is

naturally sparse this is an ideal solution to identify dense regions. However, the

computation of a SNN graph is expensive due to the high number of pairwise

comparisons required.

76 1 Introduction

In this paper, we propose a novel and effective method called as Ranking centered

Density based Document Clustering (RDDC). It first builds the SNN graph based

on the concepts of Inverted index and Ranking and, then, iteratively form clusters

by finding density regions within the shared boundary of documents in the SNN

graph.

Information Retrieval (IR) is an established field that uses the document simi-

larity concept to provide the ranked results in response to the user query [59].

An IR system is able to process queries per second on collections of millions of

documents using efficient inverted index data structure on a traditional desk-

top computer [200]. Given a query and the documents organized in the form

of inverted index on a standard desktop machine, a search engine will efficiently

retrieve the related documents ranked by the relevancy order to the query. We

conjecture that a document neighborhood can be generated using this relevant

documents set found by an IR system without the expensive pairwise documents

comparisons. In RDDC, we propose to explore this neighborhood of relevant doc-

uments to build the SNN graph effectively that, in turn, reveals the core dense

points and form clusters.

The conventional density clustering methods are known for not covering all data

points in clusters and leaving the higher number of documents un-clustered [55].

To deal with this problem, we identify multiple hubs in the shared neighborhoods

sets and reassign these un-clustered documents to the closest hub based on prior

calculated relevancy scores. Empirical analysis using several document corpuses

reveals that RDDC is able to cluster high percentage of documents accurately

and efficiently compared to other state-of-the-art methods.

More specifically, in this paper we propose a novel density based clustering

method RDDC for sparse text data. RDDC explores the dense patches in high

dimensional setting using a shared nearest neighbor graph built with ranked re-

2 Ranking-centered Density Document Clustering (RDDC) 77

sults of an IR system. RDDC further enhances the clustering decision using these

shared nearest neighbors as hubs in higher dimensionality. It efficiently calculates

the similarity for hubs using relevancy scores provided by the IR system. These

approaches of cluster allocation enable RDDC obtaining improved accuracy and

efficiency for document clustering.

To our best of knowledge, RDDC is the first such method that extends the IR

concepts of Inverted index and Ranking to density document clustering. Recently,

a couple of researchers have used the ranking concept to partitional document

clustering, to produce relevant clusters instead of all clusters in semi-supervised

clustering [173] and to select centroids using ranked retrieval in k-means [30].

However, the approach employed in RDDC is entirely different from these two

works. RDDC does not need a user-defined cluster number k and the expensive

steps of centroid updates in these methods. RDDC finds the density regions

in the SNN graph which is built efficiently using the document ranking scores

obtained from the text data through an IR system.

2 Ranking-centered Density Document Cluster-

ing (RDDC)

Let D = {d1, d2, d3, . . . . . . , dN} be a document corpus and di be a document

represents with set of M distinct terms {t1, t2, t3, . . . . . . , tM}. RDDC uses an

IR system to index all documents in D based on their terms and frequencies.

The indexed documents become input to the clustering process that includes

three main steps. (1) Firstly, the nearest neighbor sets which possess common

documents, DSNN ⊆ D are identified using the document ranking scores obtained

from the IR system in order to build the SNN graph. (2) Secondly, the graph

78 2 Ranking-centered Density Document Clustering (RDDC)

GSNN is built using documents inDSNN as vertices and the corresponding number

of shared relevant documents as edge weight. Dense regions are found in the graph

and a distinct cluster label in C = {c1, c2, c3, . . . . . . , cl} is assigned to documents

in high dense regions. Another set of documents DO2 that appear in low density

regions is separated out. (3) Lastly, RDDC assigns cluster labels to di ∈ DO2

according to their maximum affinity to a hub residing within a cluster that is

identified in previous step.

2.1 Obtaining Nearest Neighbors as Relevant Documents

Document Querying. Given a document di ∈ D as a query and D orga-

nized as inverted index, an IR system generates the most relevant documents

ranked in the order of relevancy to di. A query representing the document,

q = {t1, t2, t3, ..., ts} ∈ di should be generated such that the most accurate near-

est neighbors are obtained. RDDC represents the document as a query using the

top- s terms ranked in the order of term frequency according to the length of the

document. A set of s distinct terms with 0 ≤ s ≤M is obtained as:

s = (|di|) /k : i = 1, 2, . . . , N (1)

If the length of the document is less than s , all the terms in the document is used

as the query. A factor k controls the query length in various sized documents. A

smaller k value (e.g., k=15) is used in large text data yielding larger queries while

a larger value (e.g., k=35) is used in short text data yielding smaller queries. The

value of k is empirically learned ranging from 3-50 for each corpus.

Document Ranking. Given the document query q , a set of m most relevant

documents with their ranking score r vector is obtained using the ranking function

Rf as in Eq. 2. A number of ranking functions such as Term Frequency-Inverse


Document Frequency (tf ∗ idf), Okapi Best Matching 25 (BM25) can be used

to calculate the relevancy score [59]. RDDC uses the tf ∗ idf ranking function

given in Eq. 3 where tf represents how often a term appears in the document,

idf represents how often the term appears in the document collection and field

length normalization depicts how the length of the field which contains the term

is important.

Rf : q → Dq = {(dqj , rj

)} : j = 1, 2, . . . ,m (2)

rdj = score (q, dj) =∑t in q

(√tft,dj × idf 2

t × norm (t, dj))

(3)

Claim 1 shows that the ranking results obtained by an IR system using a ranking

function provides the relevant neighbors to di with a reduced computational time

and high accuracy, in comparison to the pairwise document comparison.

Claim 1 Let N (di) be the neighborhood documents calculated from the pairwise

document comparisons of document di with rest of the documents in the collection

D of size N obtained with δ1 time and ∂1 level of accuracy. Let R (di) be the

IR ranked result of document di obtained with δ2 time and ∂2 level of accuracy.

R (di) ⊂ N (di) will be built with δ2 (< δ1) time and ∂2 (> ∂1) level of accuracy.

Proof:

• In order to obtain N (di) , (cosine) similarity has to be obtained by com-

paring di with every document in D. This process consists of N − 1 steps

which takes δ1 time and allows to obtain the top- k neighbours where k ≥ 1

according to similarity values with ∂1 level of accuracy.

• R (di) is obtained using inverted indexed documents in D and a ranking

function such as tf ∗ idf [59]. The tf ∗ idf ranking function only computes

similarity scores for documents containing high tf ∗ idf weights for query


terms. This is a one step process which takes δ2 time and gives most relevant

neighbour documents with ∂2 level of accuracy.

• The cluster hypothesis [91] states that “associated documents appear in a

returned result set of a query” . The reversed cluster hypothesis in the opti-

mum clustering framework [59] further states that “the returned documents

in response to a query will be in the same cluster” and can be considered

as nearest neighbours.

• Since, R (di) ⊂ N (di) ⊂ D, R (di) will contain neighbours with ∂2 (> ∂1)

level of accuracy obtained with δ2 (< δ1) time.

2.2 Graph based Clustering

The IR ranked results should contain the relevant neighborhood documents to the

query document. RDDC uses the top-10 ranked documents as the nearest neigh-

borhood set as they possess sufficient information richness [199]. A DSNN ⊆ D

is identified by calculating common documents for each di ∈ D with its top-10

retrieved documents. Let retrieved results set of di , dj be R (di) and R (dj)

respectively where dj ∈ R (di). If R (di) ∩ R (dj) > 2, documents di and dj

(di, dj ∈ DSNN) become vertices in the graph GSNN and the corresponding num-

ber of shared relevant documents ( | (R (di) ∩R (dj) |) be the edge weight. GSNN

construction leaves out a set of orphan documents DO1 (D = DSNN ∪DO1) that

do not appear in DSNN .

DO1 = {(di ∈ D) ∩ (di /∈ DSNN)} : i = 1, 2, . . . , N (4)

The next task is to identify dense nodes in GSNN . A dense node contains the

number of documents (higher than a threshold) connected in the region with the

edge weight higher than a threshold. These nodes are defined as core points in


GSNN that become initial cluster representatives. Each cluster boundary is then

expanded to include documents with the same edge weight. This process gives

us a set of documents with cluster labels C , as well as it identifies documents

DO2 that do not fit into any cluster boundaries.

DO2 = {(di ∈ D) ∩ (di ∈ DSNN) ∩ (di /∈ C)} : i = 1, 2, . . . , N (5)

Algorithm 1 details the process of obtaining density based clusters. Claim 2 shows

that the SNN graph can be built accurately using ranking results.

Claim 2 Let the SNN graph created with k nearest neighbourhoods Nk (D) in the

document collection D be GNk(D) and the SNN graph created with IR ranked results

R (D) be GR(D) . If R (di) ⊂ N (di) for di ∈ D , then graph GR(D) ⊆ GNk(D) and

GR(D) contains ∂2 (> ∂1) level of accuracy where ∂1 is accuracy level of the GNk(D).

Proof:

• Let V (R (D)), V (Nk (D)) be the vertices of two graphs GR(D) and GNk(D)

represented by the documents in R (D) and Nk (D) respectively and

E (R (D)) , E (Nk (D)) be the edges represented by the number of shared

documents within document pairs in R (D) and Nk (D) respectively.

• For document di to obtain k relevant neighbourhoods Nk (di) we have to

prune meaningless neighbourhood levels. Hence, Nk (di) ⊂ N (di) .

• If Nk (di) ⊂ N (di) and R (di) ⊂ N (di) then R (di) ⊆ Nk (di) as R (di) con-

tains only the most relevant neighbours according to the optimum clustering

framework [59]. Thus R (D) ⊆ Nk (D) .

• Therefore V (R (D)) ⊆ V (Nk (D)) and E (R (D)) ⊆ E (Nk (D)) . It proves

GR(D) ⊆ GNk(D) .


• The IR ranked results R (di) of di contains the relevant documents to di with

∂2 (> ∂1) level of accuracy as shown in Claim 1. Hence, GR(D) contains all

required document information to represent SNN graph with ∂2 (> ∂1) level

of accuracy.

In this phase, a repository H = {H1, H2, H3, . . . ., H∅} is also built to store the

shared relevant documents where each node Hj ∈ H contains a set of shared

documents { di, dj, . . . dk} and ∅ is the total number of sets of shared documents.

Usually , |H| > |C| and a node Hj contains documents from the same cluster.

The set of relevant nodes within a cluster is comparable to the concept of Hubs

in high dimensionality [177]. These hubs actually represent the sub dense regions

within clusters. RDDC accurately cluster higher percentage of documents using

affinity calculation for these hub nodes and avoid the problem of higher number

of un-clustered documents in many other density based methods.

Figure 1: Algorithm for RDDC

3 Empirical Analysis 83

2.3 Relevancy Based Clustering

Algorithm 2 details the process of clustering documents DO2 that remain un-

clustered in the first phase, based on the maximum relevancy to the set of docu-

ments in the repository H. In the high-dimensional data such as text, a cluster is

shown to contain multiple hubs of documents instead of a uniform spread across

the cluster [177]. In RDDC, the sets of shared relevant documents present in H

are considered as hubs within a cluster. We envisage that the hubs of documents

stored in the repository H will share higher affinity to di ∈ DO2 instead of a

cluster represented as mean (centroid) vectors. For each di ∈ DO2 , we calculate

its affinity with each node in H as follows.

AS (di) =

{(∑sizeof(Hj)u=1 score (di, du)

sizeof (Hj)

), j : 1, 2, ...,∅,

score (di, du) calculated as per Eq.3

} (6)

The affinity score of hub node Hj is calculated using the ranking score of each

document that it contains, when the orphan document di was posed as a query.

Usually, the hub calculation in existing clustering methods is found very expensive

due to the need of pairwise computation between all documents within a cluster

[78, 173]. However, RDDC uses (already calculated) relevancy scores in IR ranked

results to measure hub affinity and makes the process computationally efficient.

Document di is then assigned with the cluster label of the maximum relevant

node Hj ∈ H that yields the largest affinity score to di .

3 Empirical Analysis

We used multiple datasets with varying dimensionality such as 20 Newsgroups,

Reuters 21578, Media Eval Social Event Detection (SED) 2013 and SED 2014,

84 3 Empirical Analysis

and the TDT5 English corpus, as reported in Table 1. We created smaller subsets

of 20 newsgroups datasets as given in to compare the RDDC performance with

the density-based document clustering method by Zhao et al. [201]. Additionally,

several other density-based clustering methods including SNN based DBSCAN

[55], SNN based clustering for coherent topic clustering [56] and DBSCAN [57]

as well as the well-known matrix factorization method, NMF [122] were used for

benchmarking.

Table 1: Summary of the datasets in the experiment.

Datasets # # Avg. # Std. Dev. #ClustersDocs Terms Terms per terms per (ground

doc corpus truth)20 Newsgroups 300 6595 104 88 3-20ng DS1

20 Newsgroups 2000 22841 100 119 20-20ng DS2

20 Newsgroups 7528 43946 97 104 20-20ng DS3Reuters 9100 19479 46 41 52-R52 DSSED 13 99989 61806 17 23 3711

-SED13 DSSED 14 120000 64056 16 16 3875

-SED14 DSTDT 5 3905 38631 172 124 40

-TDT5 DS

Each dataset was pre-processed to remove stop words and words are stemmed.

Document term frequency was selected as the weighting schema for the query

representation after extensive experiments. In all the experiments, query length

was set to optimum query length according to Eq. 1. The minimum number of

documents and minimum weight for graph α was set to 3 based on experiments

and prior research [67] for all the datasets. Experiments were done using python

3.5 on 1.2 GHz – 64 bit processor with 264 GB Memory. The Elasticsearch with

fast bulk indexing was used as the search engine to obtain relevant documents.


Standard pairwise F1-score and Normalized Mutual Information (NMI) were used

as cluster evaluation measures [201].

3.1 Accuracy Analysis

Results in Table 2 and Fig. 2 (a) show the comparative performance of RDDC

with benchmarking methods. As shown by the average performance of all datasets

in Table 2, RDDC has produced much higher accuracy as compared to bench-

marking methods. Results in Fig. 2 (a) ascertain that RDDC forms tight natural

clusters. It is able to identify sub clusters within the specified clusters as shown

by finding the higher number of clusters (Fig. 2 (a)), but still produce higher NMI

(Table 2). Sometimes, this leads to producing low F1- score. Density based clus-

tering is known not to cover every data point in clusters, due to the requirement

of fitting the clustered objects into a density region [57]. Fig. 2 (a) shows that

RDDC is able to assign a large share of documents to clusters with high accuracy

due to the inclusion of relevancy based clustering in the third step. RDDC shows

two-fold increase in the percentage of documents clustered using the graph-based

clustering to the relevance clustering with 52% and 17% increase in NMI and

F1-score respectively. In some datasets, DBSCAN has shown to cover more doc-

uments than RDDC, however, a closer investigation reveals that it produces a

few larger clusters only that will hold a large number of documents, yielding poor

clustering solution.

Zhao et al. [201] used DBSCAN in semi-supervised setting for document clus-

tering. We have created 20ng DS1, 20ng DS2, and TDT5 DS according to the

explanation given in their paper as we are unable to find the method implementa-

tion. RDDC is an unsupervised method and can be considered equivalent to zero

constraint level of the method in [201]. As shown in Table 3, results produced


Table 2: Performance comparison of different datasets, methods, and metrics

DatasetF1-score NMI

RD SD ST DB MF RD SD ST DB MF20ng DS3 0.28 - - 0.00 0.18 0.28 - - 0.09 0.14R52 DS 0.36 0.04 - 0.07 0.38 0.41 0.38 - 0.43 0.26SED13 DS 0.87 - - 0.00 - 0.66 - - 0.00 -SED14 DS 0.87 - - 0.00 - 0.65 - - 0.00 -TDT5 DS 0.61 0.36 0.21 0.22 0.70 0.35 0.32 0.25 0.22 0.54Average 0.60 0.20 0.21 0.06 0.42 0.47 0.35 0.25 0.15 0.31Methods: RDDC (RD), SNN based DBSCAN (SD), SNN based topicclustering (ST), DBSCAN (DB) and NMF (MF)Note: “-” denotes out of run-time or memory

Table 3: Performance comparison with semi supervised clustering [201]

DatasetRDDC Semi-supervised DBSCAN

NMI F1-score # NMI F1-score #constraints constraints

20ng DS1 0.66 0.75 0 0.62 0.62 2520ng DS2 0.40 0.32 0 0.22 0.42 50TDT5 DS 0.61 0.35 0 0.22 0.31 75

by unsupervised RDDC are mostly superior to semi-supervised DBSCAN [201].

These results show the effectiveness of using relevancy scores obtained with the

concepts of ranking and inverted index, in building SNN graph, finding dense

regions and forming clusters.

3.2 Scalability and Complexity Analysis

Fig. 2 (b) shows that the traditional SNN based methods failed to scale with

large datasets due to the computational complexity introduced by the number of

comparisons made for k NN search. It is O (nk + nd) where n is the number of

instances in dataset, d is feature dimensionality and k is the number of nearest

neighbors. Whereas, the relevant document calculation of RDDC has computa-


Figure 2: Performance comparison with percentage of clustered documents andtime taken

tional complexity of O (m+ n) where m is the query length to obtain relevant

neighbors. RDDC consumes more time than DBSCAN as in Fig. 2 (b) due to

additional graph construction and maximum relevancy calculation. However, as

shown in the tradeoff by achieving 0.54 and 0.32 increase on average accuracy

in terms of NMI and F1-score respectively in RDDC is well justified. Incremen-

tal sampling on the SED13 collection is used to demonstrate the scalability of

RDDC. Fig. 3 (a) shows that RDDC exhibits near linear increase in time with

the size of the corpus, whereas traditional SNN based methods are not scalable

as shown by the runtime in Table 2. Further, performance of RDDC with the

increased feature dimensionality in Fig. 3 (b) shows that the RDDC performance

comes to stabilize after a linear increase in runtime with dimensionality.

3.3 Sensitivity Analysis

The parameter sensitivity is analyzed by obtaining two independent samples of

different sizes from each dataset in Table 2. The document model for query


Figure 3: Scalability performance of RDDC using the SED 13 and 20ng DS3dataset

formation was evaluated using tf , idf and tf ∗ idf weighting schema. Fig. 4

(a) shows that the tf presentation outperformed others in many datasets. It is

justified as important terms which determine the theme of the document have

higher weights in this scheme.

Figure 4: Term Weighting Schema and Ranking Functions

Success of the RDDC relies on obtaining accurate nearest neighbors of a given


document to build the SNN graph. It depends on how a query is presented

and the ranking function used. There exist two most popular ranking functions,

tf ∗ idf and BM25 [59]. Fig. 4 (b) shows that a higher performance in terms of

NMI and F1 score is obtained by using tf ∗ idf , so it is used in all experiments

and can be set as default.

Next we explored the relationship between query length and the text size in

document corpuses. Fig. 5 (a) shows that there is a linear relationship between

query length and document size. Smaller document corpuses need smaller queries

while larger documents sized corpuses need lager queries. In RDDC factor k is

included to control the size of query length for documents in a corpus. We analyze

the parameter k against size of the datasets as in Fig. 5 (b). The factor k shows

inverse linear relationship, that is, a large k value should be set for short text

data and a smaller k value should be set for large text data. The parameter k

adjusts the query size as per length of the document.

Figure 5: Query length and Parameter k

The threshold α in RDDC denotes the minimum number of documents to be

shared between two documents to define as similar, and the number of documents

to be considered as dense. Prior research has shown this value to be set as 3 [67].

90 4 Conclusion

As shown in Fig. 6, the best performance (i.e. maximum number of documents

clustered) is obtained when α is set to 3. For future use, the default value can

bet set to 3.

Figure 6: Parameter Alpha vs. Clustered documents

4 Conclusion

This paper was inspired by the conjecture that text documents have sparse data

representation so we should leverage the techniques that suit to those represen-

tation. We proposed a novel ranking-centered density-based document clustering

method RDDC based on the concepts of density estimation, inverted indexing,

ranking and hubs. RDDC introduces the innovative concept of finding near-

est neighbors using the document relevancy ranking scores to construct a SNN

graph and finds the dense regions to form the clusters. We showed that the use

of document ranking score is more effective compared to calculating the pair-

wise similarity between data points in text data by reducing the computational

complexity and improving accuracy. We also introduce a refinement phase to

increase the percentage of clustered documents by assigning orphan documents

to hubs within clusters, rather than to cluster itself. The hubness affinity calcu-

lation utilizes the prior calculated relevancy ranking scores, thus, not incurring

4 Conclusion 91

any overheads. We proved that closeness to shared relevant neighbors can im-

prove the performance of text clustering due to the existence of multiple hubs

in a text cluster. Empirical results conducted on several datasets, benchmarked

with several clustering methods, show that RDDC overcomes the issues attach

with sparse vectors and cluster text data with considerably higher performance,

including accuracy and scalability.

92 Paper 2

Paper 2: Consensus and Complementary Non-

negative Matrix Factorization for Document

Clustering




Under Reviewed IN: Knowledge-Based Systems (KBS Journal)















Paper 2 93




Date:



Date: edited and reviewed the paper

hi Nayak

26/03/2020

Mohotti

27/03/2020



94 1 Introduction

ABSTRACT: Document clustering, for grouping the documents with similar con-

cepts, has been found useful in many applications of information retrieval and

knowledge discovery. High dimensionality and sparsity of text vector represen-

tation challenge state-of-the-art methods. We propose a novel method using the

non-negative matrix factorization framework, which utilizes the ranking-based

similarity and nearest neighbor-based affinity to form the document adjacency

matrices to overcome this issue. Empirical analysis using several datasets shows

that the proposed method is able to produce accurate clustering solutions as

compared to relevant benchmarking methods.

KEYWORDS: Ranking; Nearest Neighbors; Document Affinity; Non-negative

Matrix Factorization

1 Introduction

With the advancement in digital technology, text data has grown exponentially

[86]. The process of grouping the documents with similar concepts, known as

document mining, has become significant in diverse applications such as social

media analytics, opinion mining and recommendation systems [3, 140]. A myr-

iad of clustering methods such as partitional, hierarchical, density-based, matrix

factorization and spectral have been developed [8]. However, the majority of ex-

isting methods are challenged with high dimensionality and sparsity associated

with the text data.

Partitional clustering methods face distance concentration where distance dif-

ference between far and near data points becomes negligible due to the high

dimensionality of text vectors and associated sparsity [205]. Hierarchical clus-

tering suffers from the scalability problem due to the requirement of multiple

1 Introduction 95

pairwise computations at each step of making a clustering decision [3]. Density-

based clustering suffers from the sparseness of text data representation which is

challenging in finding density differences [140].

Matrix factorization, which maps high-dimensional text representation into lower

dimensions, has become popular due to its capability to finding groups in the

mapped low-dimensional space. Non-negative Matrix Factorization (NMF) has

been successfully used in data, such as images, spectrograms, and documents for

multivariate analysis [46]. NMF has become especially successful in document

clustering due to its requirement of representing/processing data with positive

values [39, 50]. NMF learns two low-rank factor matrices that represent docu-

ment and term clusters by decomposing a high-dimensional document × term

matrix. However, it has been reported that, in high-dimensional sparse data,

dimensionality reduction fails to capture the geometric structure of original data

[146]. Consequently, neighboring points in high-dimension do not remain as close

points in the projected low-dimensional space. This information loss adversely

affects a cluster solution [3].

Manifold learning has been proposed as a solution to preserve the geometric

structure in original data and ensure close points remain close in lower-dimension

space [22]. The majority of these methods preserve the local geometry of the

data by building a graph based on local neighborhood information in the dataset

[131]. This category of methods are known as spectral methods [203]. They face

higher computation cost and information loss due to calculating adjacency for

each document using the affinity graph and projecting original data into new

coordinate space [8]. There also exist global manifold learning methods that

preserve geometric information at different scales [203]. However, the distance

calculation in these methods is computationally expensive. Researchers have also

attempted to combine both local and global properties of the manifold in hybrid

96 1 Introduction

methods [203].

Recently, researchers have explored the ranking concept, commonly used in

Information Retrieval (IR), to find nearest neighbors in document clustering

[30, 140, 173]. IR is a well-established field, which uses inverted index data struc-

ture and the ranking concept to retrieve a list of matched (or similar) documents

for a query [30]. Researchers have shown the improved accuracy and scalabil-

ity in partitional clustering by utilising the ranking results, generated from an IR

system, to assign a document to a cluster [30, 173], as well as in density-based clus-

tering where the Shared Nearest Neighbor graph is constructed with IR ranked

document sets [140]. IR ranking shows the potential of calculating global neigh-

borhood information considering the entire document collection. However, this

approach is not explored in the Matrix Factorization based document clustering.

A – document × term matrixS1 – NN based Symmetric documentaffinity MatrixS2 – IR based Symmetric document affinity MatrixH1, H2, H – document × cluster matricesW – cluster × term matrices

S1

S2

A

H1

H2

H

H

Final Document × Cluster Matrix

terms

docs

docs

docs

docs

docs

clusters

docsH

H

W

Figure 1: Overview of the proposed CCNMF method

In this paper, we propose a novel method, Consensus and Complementary Non-

negative Matrix Factorization (CCNMF), that aims to preserve the geometric

structure of the data by combining the nearest neighbor (NN) information with

the document-term representation. Firstly, the local affinity between documents

is obtained by embedding NNs calculated using pairwise document similarity. The

top k-NNs are chosen for each document and a symmetric matrix representation

1 Introduction 97

is generated to encode the NN information. Secondly, the global affinity between

documents is obtained using an IR system to form another symmetric matrix to

represent the top k-NNs. The hybrid manifold learning approach employed in

CCNMF empowers it to effectively take more reliable neighborhood information

with IR while using higher representation capability of pairwise NNs. A novel

clustering objective function is proposed by combining these two matrices with

the document × term matrix in the NMF framework to accurately obtain the

document clusters.

CCNMF learns the optimum document cluster representation iteratively approxi-

mating the two symmetric matrices and the document-term representation matrix

with minimizing learning error as in Figure 1. Along with the complementary

global and local NN specific information given by the matrices, CCNMF inter-

nally combines consensus information in the data during the factorization using

the sequential update rules. We conjecture that the consensus and complemen-

tary information provided with local and global NNs is able to minimize the

information loss that occurs with higher-to-lower order document × term matrix

approximation.

To the best of our knowledge, CCNMF is the first such method that extends

the ranking concept to matrix factorization-based document clustering. Empiri-

cal analysis using several document corpora of varying sizes shows that manifold

learning in CCNMF results in providing a more accurate clustering solution com-

pared to relevant state-of-the-art methods. More specifically, this paper brings

several novel contributions to document clustering.

• A novel manner to utilise manifold-learning in NMF by combining the

document-term matrix factorization with the nearest-neighbor information

that preserves the geometric structure of the data, in order to improve the

accuracy of document clustering

98 2 Related Work

• The use of local document affinity represented with pairwise document sim-

ilarity and global document affinity represented with the ranked results for

text clustering

• A novel approach to use consensus and complementary information in local

and global NNs to improve the NMF-based document clustering

The rest of the paper is organized as follows. Section 2 reviews the work related to

document clustering. The proposed approach and implementation are elaborated

in Section 3. A comprehensive empirical study and benchmarks on several public

datasets are provided in Section 4. Final conclusion remarks are presented in

Section 5.

2 Related Work

NMF has been used in clustering the text documents successfully. It approximates

the high-order non-negative documents × terms matrix into lower rank factor

matrices, which represent the groups of terms and the groups of documents on

the basis of shared terms [116] where the reduced rank can represent the number

of clusters in the data. This dimensionality reduction of document Vector Space

Model (VSM) based on the underlying semantic relationships is able to handle

sparsity in higher order representation [149].

With the increase in dimension, the distance difference between far and near

points becomes negligible [205] as in Figure 2. The distance concentration prob-

lem is evident in high-dimensional sparse text data and challenges the traditional

methods based on centroids and hierarchies in deciding the matching clusters

using distance. Density differences can also not be traced in sparse data without

sophisticated designs [140]. On the other hand, NMF - which encodes the text

2 Related Work 99

Figure 2: Distance concentration in higher dimensions [52]

data and projects it to low-rank matrices by retaining natural data non-negativity

and semantic relatedness - is proposed as an effective solution to find clusters in

the newly mapped low-dimensional space. It eliminates the need to use sub-

tractive basis vector and encoding calculations present in other dimensionality

reduction techniques such as spectral clustering [166], which is expensive.

Spectral clustering identifies subgroups with non-convex geometric structures

compared to traditional clustering methods such as k-means [143]. The origi-

nal data is projected into the new coordinate space, which encodes information

about how nearby data points are. The eigenvalues (through first k eigenvec-

tors) of the Laplacian matrix of the data is used to perform dimensionality re-

duction. The similarity transformation reduces the dimensionality of space and

pre-clusters the data into orthogonal dimensions [143]. This pre-clustering is

non-linear and allows for arbitrarily connected non-convex geometries. However,

spectral clustering faces the fundamental limitation of depending on the selected

eigenvectors [141]. These selected values cannot successfully cluster datasets that

contain structures at different scales in size and density such as the text data

[141]. In contrast, NMF is able to directly encode text data, preserving natu-

ral non-negativity into the two lower rank factors, which automatically represent

100 2 Related Work

groups in documents and terms. Therefore, this paper proposes an NMF-based

approach to represent the document cluster assignment.

However, NMF and other dimensionality reduction methods face information loss

while compressing high to low dimensions [149]. This dimensionality reduction

fails to capture the geometric relationships in original data and neighboring points

in high-dimension do not remain as close points in the projected low-dimensional

space [146]. Researchers have explored different approaches to assist traditional

NMF to avoid this information loss. The term adjacency matrix has been used

to assist factorization of document × term matrix to semantically enhance the

clustering decision [168]. It learns the relationships of terms to its context for

enforcing geometric relationships.

Manifold learning is another approach to combine with the NMF framework to

discover and maintain the geometric structure of data when projecting it to a low-

dimensional space [131]. Manifold learning algorithms vary based on the type of

the geometry they attempt to preserve, such as local, global or hybrid [203].

Local methods, also known as spectral clustering, encode the local geometry of

the data, which shows the high representational ability by building an affinity

graph that incorporates neighborhood information [31]. Global methods give a

more reliable embedding by preserving geometric relationships at different levels

[203] though they are computationally expensive. However, hybrid methods that

combine positive capabilities of both aforementioned approaches have given better

performance in manifold learning [203]. Therefore, in this paper, we are proposing

a clustering algorithm, which uses both local and global NN relationships.

Co-clustering is another parallel branch of methods that uses the dual informa-

tion within the rows and columns of the matrix data simultaneously for clustering

[66]. Traditional methods focus on one-side clustering, i.e., clustering data based

on features of the data. In contrast, co-clustering groups data points based on

2 Related Work 101

their distribution on features while concurrently grouping features based on their

distribution on the data points [66, 167]. This concept is used in document clus-

tering to assist each other in improving the quality of the clustering solution. It

simultaneously clusters documents and words to find a global clustering solution,

whereas word clustering induces document clustering and document clustering in-

duces word clustering [44]. These interesting extensions with different information

assistance show the capability to minimize the information loss in dimensionality

reduction.

Generally, geometric relationships are considered with nearest neighbors [143].

Simple pairwise comparisons between points that calculated locally considering

the VSM are used to calculate the NNs. This information of nearby data points is

used in the clustering process to group them [143]. Distinct from this document

affinity, which only relies on the local information, Information Retrieval (IR)

systems are capable of producing global NNs. Given a document query against

the entire document collection organized in the form of the inverted index, a

search engine retrieves the related documents ranked by the relevancy order to

the query maintaining near and far relationships [59]. In this paper, we propose to

utilize this novel concept to generate NNs by encoding global information present

in the data.

The use of IR for clustering is an emerging area. In our prior works, we have

utilized this concept, IR-based NNs for partitional and density-based document

clustering methods [140, 173]. The frequent nearest neighbors for points in high

dimensions are known as Hubs [173]. Data points in high-dimensional data tend

to be closer to these hubs than cluster mean. These hub points were generated

using IR ranking responses in [173] and used in assigning a data point into a

cluster considering the closest hub. In [140], a Shared Nearest neighbor graph

is constructed using the ranked results and the density estimation is done on

102 2 Related Work

the graph to assign documents to clusters. The approach employed in CCNMF

is distinct from these works. We use the local NNs generated with a simple

pairwise similarity calculation as well as the global NNs calculated using the

ranked results to incorporate the geometrical structure of the data during the

matrix factorization process.

This paper proposes a novel approach to assist non-negative document × term

matrix factorization using document adjacency matrices with locally and globally

generated nearest neighbor information in the document collection. In contrast

to semantic assistance given in [168] and manifold learning in [131], CCNMF

generates local NNs with pairwise document comparison and global NNs with

IR ranking to assist NMF with the geometric structures. Our hypothesis in CC-

NMF is to incorporate both consensus and complementary information in docu-

ment clustering through NNs and document-term distribution. CCNMF clusters

documents utilizing adjacency distribution on documents while simultaneously

clustering documents based on term distribution. Within each iteration in the

optimization process, CCNMF exchanges this information to induce document

clustering.

In summary, CCNMF uses

• NMF-based clustering [39, 50] on naturally non-negative text data to iden-

tify the groups in a document collection.

• geometric relationships between data points as in manifold learning methods

[143, 203], for assisting the clustering process.

• global NNs calculated with the IR system that used in partitional and den-

sity based clustering [140, 173] to have a more faithful clustering solution.

3 Consensus and Complementary Non-negative Matrix Factorization(CCNMF) 103

3 Consensus and Complementary Non-negative

Matrix Factorization (CCNMF)

3.1 Overview of CCNMF

Let D = {d1, d2, ...dN} be a document collection of N documents that contain

a set of distinct terms {t1, t2, ..tM}. A document di is represented as a set of v

distinct terms {t1, t2, ..tv} where v � M . Let matrix A represent the M × N

term-document matrix with the entries as term counts. Let D be also organized

in the form of an inverted indexed data structure. The inverted index keeps a

dictionary of terms, together with a posting list that indicates which documents

the term occurs in [133]. A search engine can efficiently rank a document with

respect to the query using its inverted index [133].

CCNMF calculates the local NNs using a pairwise distance calculation and the

global NNs using an IR system to form the adjacency matrix S1 and S2 respec-

tively. The symmetric affinity matrices S1 and S2 are modeled with Skip-Gram

with Negative Sampling (SGNS) [117] weighting to make an affinity value for

point pairs considering their closer existence as neighbors in the entire collection.

This encodes geometric information of data points in the entire collection, i.e.,

how nearby the points are.

CCNMF uses a novel NMF-based approach to decompose the input data matrix

A and two affinity matrices S1 and S2 to identify W , H, H1 and H2. The over-

all process of CCNMF is shown in Figure 1. The overarching aim is to obtain

the optimum document-cluster matrix H ∈ RN×G that harmonizes the informa-

tion given by each input matrices. CCNMF enables complementary information

through different metrics in the matrix factorization process while maintaining

1043 Consensus and Complementary Non-negative Matrix Factorization

(CCNMF)

consensus information with the update rules of the factor matrices. CCNMF

learns a cluster label for each di ∈ D by assigning it to the highest coefficient

cluster g ∈ G in H.

3.2 Nearest Neighbor Calculation with Skip-Gram with

Negative Sampling

Nearest Neighbors

To obtain the top-k NNs as local using distance differences and, global NNs

using common as well as rare terms, two symmetric affinity matrices S1 and

S2 are generated considering all the documents in D. The local NNs in S1

is generated using distance differences. The cluster hypothesis [91] and reverse

cluster hypothesis [59] show the embedding of semantic relationships in ranking-

based similarity identification. Aligning with them a ranking function employed

in a search engine is used to identify the global NNs in S2.

Local Nearest neighbors The documents in D are represented using Vector

Space Model. A pairwise document comparison using Euclidean distance is car-

ried out on all pairs in the collection. Let {l1, l2, ...lN} be the list that includes

pairwise distance between di ∈ D and all N documents in the collection. A set

of closest k documents to di which show the lowest distances is considered as the

local k-NNs. This set of documents DlNN of di ∈ D can be represented as follows.

DlNN = {dp : p = 1, 2, . . . , k} ← top k (sort ({l1, l2, . . . , lN})) (1)


Let S1 be a N × N document-document matrix where each row in the matrix

represents the DlNN that the document di ∈ D has by showing the value 1.

S1(i,j) =

{1, if dj ∈ in DlNN(di)

0, if dj /∈ in DlNN(di)

(2)

Global Nearest neighbors Let di ∈ D be posed as a document query using

a set of distinct terms {t1, t2, ...tt} contained in di where t ≤ v. A set of retrieved

top-ranked documents by employing a ranking function Rf in the IR system is

considered as the global NNs. The ranking function Rf employed in the IR system

extracts the documents in D with the respective relevancy scores vector r for a

given query document q in the relevancy order. The most relevant k documents

among them are selected to be global NN documents DgNN of di ∈ D .

Rf : q → DgNN = {(dp, rp) : p = 1, 2, . . . , k} (3)

Let S2 be a N × N document-document matrix where each row in the matrix

represents the DgNN that the document di ∈ D has by showing the value 1.

S2(i,j) =

{1, if dj ∈ in DgNN(di)

0, if dj /∈ in DgNN(di)

(4)

Example: The toy document collection in Fig. 3 shows the local NNs obtained

using pairwise distance calculation and the global NNs obtained using IR ranking

similarity. For the first three documents, the distance-based approach is able to

identify the NNs more accurately in comparison to the IR system. In contrast,

the IR system identifies the NNs accurately for the last two document. This

shows the need for using both the local and global NNs in identifying clusters.


(CCNMF)

ID Documentd1 Rugby players are in the ground with a balld2 Rugby ball is oval shaped3 Rugby team contains fifteen players d4 Cricket use round ball d5 Cricket can play outdoor as well as indoord6 Cricket plays with twelve players

d1 d2 d3 d4 d5 d6

d1 1 1 1 0 0 0d2 1 1 1 0 0 0d3 1 1 1 0 0 0d4 0 1 0 1 1 0d5 0 0 0 1 1 1d6 0 0 1 0 1 1

d1 d2 d3 d4 d5 d6

d1 1 1 0 1 0 0d2 1 1 0 1 0 0d3 0 0 1 0 0 0d4 0 1 0 1 0 1d5 0 0 0 1 1 1d6 0 0 0 1 1 1

3-Nearest Neighbours based on Pairwise Distance (local NNs)

3-Nearest Neighbours based on IR ranking Similarity (global NNs)

The unique information given by each matrix show the requirement of using both local NNs and global NNs in identifying clusters

Figure 3: Example showing the importance of incorporating both local and globalNNs

Skip-Gram with Negative Sample modeling(SGNS)

CCNMF aims to capture the NN distribution in S1 and S2 effectively with the

SGNS modeling [168]. SGNS is used to highlight the documents that appear

together more often in comparison to their neigborhood. The Skip-Gram model

is one of the most popular neural-network-based techniques to learn word em-

bedding representation [137]. It is able to capture the context of a word in a

corpus whereas the continuous bag-of-word model fails [117]. The concept of

negative (word, context) sampling is used with the Skip-Gram model to max-

imize the probability of an observed pair while minimizing the probability of

unobserved pairs in distributed word representation [168]. SGNS is proved to be

equivalent to factorizing a (shifted) word correlation matrix whose cells are the

point-wise mutual information of the respective word and context pairs [117]. We

propose to model S1 and S2 as the Skip-Gram model in CCNMF to represent


neighboring document pairs considering other neighborhood pairs. The negative

sampling concept in the Skip-Gram model enables affinity matrices S1 and S2

to maximize the probability for the document pairs that show high presence in

comparison to their neighborhoods while minimizing the probability of document

pairs that show less presence. We conjecture that SGNS will help to preserve the

closeness of objects while projecting the original data to low-dimensional metrics.

Let c(di,dj) be the original cell value that represents the existence (i.e., 1) or

non-existence (i.e., 0) of NN relationship between di and dj in S1 or S2. The

SGNS weighting model considers only the cell values that show NN relationship

between di and dj with 1 and, how many documents di and dj have within their

nearest neighborhood through the number of 1′s in the di row and dj column.

By dividing original cell value c(di,dj) from these neighborhood association values,

CCNMF positions the NN relationship between di and dj with respect to any

neighborhood as in Eq. 5. This way of updating affinity between document pair

di and dj in S1 and S2 gives more information compared to binary representation,

which represents whether a neighbor or not.

[S1(di,dj) | S2(di,dj)

]= log

[c(di,dj) × T∑

da∈D c(da,di) ×∑

da∈D c(da,dj)

](5)

where T is the total number of document pairs that appear as nearest neighbors

in the entire affinity matrix.

Aligning with the negative sampling, the cell value S1di,dj or S2di,dj with entries

less than 0 are converted to 0 to minimize the probability of document pairs that

show less presence after taking logarithm as in [117].


(CCNMF)

3.3 Matrix Factorization

The aim of CCNMF is to decompose the term × document matrix A modeled

as VSM with NMF utilising the documents association matrices S1 and S2. It

learns the final document-cluster membership matrixH using distinct information

attached with each input matrix. We call it complementary, as it utilises the

information from each input matrix independently.

Definition 1 Complementary Non-negative Matrix Factorization: It

is the process of learning two non-negative factor matrices that approximate each

input matrix (A, S1 and S2) and achieve the final document-cluster matrix by

emphasizing the characteristics specific to each input matrix.

CCNMF decomposes matrix A ∈ RM×N into two non-negative factor matrices

W ∈ RM×G and H ∈ RN×G where G is a lower rank that represents the number

of clusters [116]. It is formulated as follows.

A ≈ WHT (6)

TheW andH matrices learn the term clusters and document clusters respectively.

In order to preserve the associations between documents, CCNMF uses the local

NNs and global NNs information with matrix A. The symmetric matrices S1

and S2, which carry the association information, are also decomposed into non-

negative H1 ∈ RN×G and H2 ∈ RN×G respectively with matrix H as follows.

S1 ≈ HHT1 (7)

S2 ≈ HHT2 (8)

We propose the following objective function to reduce the learning error (in frobe-

nius norm) to approximate the input matrix by incorporating specific information


given with each input matrix.

minW,H≥0‖A−WHT‖F+minH,H1≥0‖S1−HHT1 ‖F+minH,H2≥0‖S2−HHT

2 ‖F (9)

This learning process obtains the final document-cluster assignment matrix H by

considering specific information from each input matrix. Thus, CCNMF can be

considered a complementary NMF for document clustering, as in Definition 1.

Solving the optimization problem

The NMF process combines the complementary information provided by input

matrices A, S1 and S2 to learn the document-cluster matrix H. At the same

time, it harmonizes the compatible information given by each input to achieve

the optimum H during the factorization process.

Definition 2 Consensus Non-negative Matrix Factorization: It is the

process of combining the inter-dependent information present in each input matrix

for learning the non-negative factor matrices to support the approximation of the

optimum document-cluster matrix.

CCNMF learns the inter-dependent factor matrices interactively, exchanging in-

formation within them. In each iteration of the optimization process, CCNMF

updates matrix entries sequentially. It solves these interdependent sub-problems

sequentially starting from W using the Block Coordinate Descent (BCD) algo-

rithm [103]. The BCD algorithm divides the matrix members into several disjoint

subgroups and iteratively minimizes the objective function with respect to the

members of each subgroup g ∈ G at a time.

W(:,g) ←[W(:,g) +

(AH)(:,g) −(WHTH

)(:,g)

(HTH)(g,g)

](10)


(CCNMF)

When BCD solves sub-problems that depend on each other, they have to be

computed sequentially to make use of the most recent values of the associated

factor matrices. In CCNMF, the most recent values of members at first iteration

are set to zeros at the initialization. Firstly, the BCD update rule has been used

for finding W in the NMF optimization using the term-document matrix A and

initial matrix H as in Eq. 10.

Secondly, matrix H is updated using the current values of W and other members

as follows:

H(:,g) ←

⎡⎢⎢⎢⎢⎢⎣

H(:g)+

(ATW)(:g)

+(S1H1)(:g)+(S2H2)(:g)

(WTW )(g,g)+(HT1 H1)

(g,g)+(HT

2 H2)(g,g)

−

(HHT1 H1)

(:,g)+(HHT

2 H2)(:,g)

+(HWTW)(:,g)

(WTW )(g,g)+(HT1 H1)

(g,g)+(HT

2 H2)(g,g)

⎤⎥⎥⎥⎥⎥⎦ (11)

Then, S1, the matrix representing local NNs-based affinity and most recent values

of H are used in updating H1 as in Eq. 12.

H1(:,g) ←[H1(:,g) +

(S1H)(:,g) −(H1H

TH)(:,g)

(HTH)(g,g)

](12)

Finally, H2 is updated using S2, the matrix representing global NNs-based affinity

and most recent values of H as in Eq. 13.

H2(:,g) ←[H2(:,g) +

(S2H)(:,g) −(H2H

TH)(:,g)

(HTH)(g,g)

](13)

Most recent values of the matrix entries are used in updating other factor matrices

in this iterative optimization process. This interdependent information exchange

between H ↔ H1 and H ↔ H2 allows NMF to combine information present in

each input matrix for learning final matrix representation of H. Thus, CCNMF

uses a consensus NMF process as in Definition 2.

The overall algorithm of CCNMF is given in Algorithm 1. The final document-

cluster assignment matrix H represents the probability coefficients of each doc-

ument being assigned to each cluster g ∈ G. We choose a cluster that possesses

4 Experiments 111

Algorithm 1 Consensus and Complementary Non-negative Matrix FactorizationInput : Term-Document matrix A

Local NN affinity matrix S1Global NN affinity matrix S2Number of Clusters G

Output: Final Document-Cluster matrix HInit: W ≥ 0, H ≥ 0, H1 ≥ 0, H2 ≥ 0 random real numbers

while Convergence of Eq. 9 doforeach g=1: G do

Compute W using Eq. 10Compute H using Eq. 11Compute H1 using Eq. 12Compute H2 using Eq. 13

end

endConvergence: old error − new error < 1e-3 OR

number of iterations > 100

the highest coefficient within H as a cluster assigned to a specific document. CC-

NMF follows a hard clustering approach with the assumption that a document

belongs only to one cluster.

4 Experiments

4.1 Datasets and Experiment setup

We used four publicly available English text datasets and a permission accessible

dataset, as reported in Table 1. DS11 consists of webpages collected by the

WebKB project of the CMU text learning group. DS2 and DS32 are popular

news data collection known as 20Newsgroup and Reuters 21578, respectively.

1http://ana.cachopo.org/datasets-for-single-label-text-categorization2http://ana.cachopo.org/datasets-for-single-label-text-categorization

112 4 Experiments

Table 1: Summary of the datasets used in the experiments

Dataset# of

documents

Mean &Medianterms

per doc-ument

# ofuniqueterms

# ofclusters

WebKb (DS1) 4199 77, 59 7668 4R52 (DS2) 9100 46, 33 19479 52

20Newsgroup (DS3) 18821 97, 73 69610 20TDT5 (DS4) 27468 157, 133 113708 40

HealthServicesTickets (DS5) 50000 4, 4 12106 21

DS43 is a news dataset released for topic detection and tracking task. DS54 is a

Healthcare service dataset obtained from “kaggle”. These datasets show diverse

characteristics in terms of the number of clusters and size of the collection that

facilitate analysis of CCNMF.

Datasets have been prepossessed for word stemming and stop-word removal. Ma-

trix A is presented as a Vector Space Model (VSM) with the term frequency as

weighting. In all experiments, when a document is posed as a query to the IR sys-

tem, it is represented with top-10 terms in the order of the term frequency. This

paper uses the Elasticsearch 2.4 search engine as an IR system and obtains top-m

(m=10) documents (which are proved to possess sufficient information richness

[199]) given in response to a document query that represents a document with

its top-10 terms as in [173]. The tf ∗ idf ranking function is used to measure the

relevancy between the query document and responses. Experiments were done

using python 3.5 on a single processor of 1.2 GHz Intel (R) Xeon (R) with a 264

GB shared memory.

3https://catalog.ldc.upenn.edu/LDC2006T184https://www.kaggle.com/

4 Experiments 113

Table 2: Performance comparison: CCNMF with standard and latest baselines

F1-score NMIDS1 DS2 DS3 DS4 DS5 Avg DS1 DS2 DS3 DS4 DS5 Avg

M1 0.58 0.44 0.41 0.36 0.25 0.41 0.33 0.51 0.51 0.43 0.22 0.4M2 0.46 0.27 0.18 0.23 0.21 0.27 0.27 0.41 0.23 0.35 0.23 0.3M3 0.44 0.33 0.34 0.23 - 0.34 0.01 0.41 0.44 0.35 - 0.3M4 0.37 0.18 0.1 0.13 0.27 0.21 0.16 0.31 0.13 0.25 0.2 0.21M5 0.45 0.26 0.18 0.25 0.24 0.28 0.25 0.46 0.23 0.39 0.2 0.31M6 0.5 0.55 0.1 0.2 0.25 0.32 0.18 0.29 0.01 0.07 0.2 0.15M7 0.18 0.28 0.18 0.17 0.25 0.21 0.13 0.34 0.1 0.39 0.05 0.2

Note - M1: CCNMF, M2: NMF, M3: Spectral, M4: CoclusteringM5: k-means, M6: seaNMF, M7: RDDC

4.2 Benchmarking methods and evaluation measures

The state-of-the-art clustering methods, k-means [8], NMF[3], spectral clustering

[8] based on k-NNs and co-clustering [44] are used as the standard baselines.

Latest relevant methods such as the IR ranking and density based RDDC [140],

the term co-occurrences-based semantic NMF known as SeaNMF [168] have also

been used. Additionally, several variations of CCNMF have been evaluated to

investigate the inclusion of consensus and complementary components of NMF

as well as the use of local and global NN information. Standard pairwise F1-

score, which calculates the harmonic average of the precision and recall, and

Normalized Mutual Information (NMI), which measures the purity against the

number of clusters, are used as evaluation measures [140].

4.3 Accuracy

Comparison with baseline methods: Results in Table 2 show that CC-

NMF is able to achieve much higher accuracy as compared to all bench-marking

methods. Spectral clustering, which preserves the geometric structures between

114 4 Experiments

documents, is the second-best method. Density-based RDDC shows the least

performance as the density concept fails in sparse text data. The semantic NMF,

seaNMF, also produces lower NMI. It is interesting to note that it even performs

more poorly than the traditional NMF for NMI values. In comparison to CC-

NMF, which obtains the geometric relationship by using association relationships

between documents, seaNMF [168] uses the term co-occurrences to provide se-

mantic information during NMF. However, this semantic assistance given with

term association in seaNMF is able to produce much higher F1 score for col-

lections with overlapping clusters such as in DS2. In contrast, CCNMF shows

52% increase in F1 score and 33% increase in NMI compared to normal NMF.

This improved performance shows the importance of using complementary and

consensus information for assisting matrix factorization to preserve the geometric

relationship.

DS5 contains short text compared to other datasets of news or web blog data

that show medium text size. The short nature of text vectors impairs spectral

clustering to find valid eigenvectors through identified few NNs in DS5. CCNMF

also relies on the NN concept in identifying accurate clusters. Having a few

NNs directly impacts on CCNMF, therefore CCNMF is unable to gain much

performance improvement in this dataset.

Comparing the use of different ways of obtaining NNs In CCNMF, we

incorporate both local and global NNs with the input term-document matrix

during factorization, to identify the accurate cluster representation. The next ex-

periments were conducted to find which combination is the most effective. Three

different settings were tested.

1. Eq. 9 was made to exclude matrix S2 factorization process ⇒ (A+ S1)⇒NMF on term-document and local NN association matrices

4 Experiments 115

Table 3: Performance comparisons: Incorporating NNs

CCNMF

CCNMFwith onlylocal NNs(A+ S1)

CCNMF withonly global

NNs (A+ S2)

NMF withlocal andglobal NNs(S1 + S2)

F1 NMI F1 NMI F1 NMI F1 NMIDS1 0.58 0.33 0.51 0.51 0.51 0.21 0.54 0.24DS2 0.44 0.51 0.31 0.31 0.31 0.48 0.36 0.42DS3 0.41 0.51 0.31 0.31 0.31 0.38 0.35 0.41DS4 0.36 0.43 0.25 0.25 0.25 0.33 0.28 0.37DS5 0.25 0.22 0.23 0.19 0.28 0.19 0.24 0.19Avg. 0.41 0.4 0.32 0.31 0.33 0.32 0.35 0.33

2. Eq. 9 was made to exclude matrix S1 factorization process ⇒ (A+ S2)⇒NMF on term-document and global NN association matrices

3. Eq. 9 was made to exclude matrix A factorization process ⇒ (S1 + S2)⇒NMF on local NN and global NN association matrices

Results in Table 3 show that using only local NNs or global NNs with the input

data matrix is inferior to combining both these complementary information as in

CCNMF. It also shows that combining these two types of NNs is not sufficient to

obtain superior clustering performance, it is required to have the term-document

representation of text documents. The use of term distribution can represent

the semantic relationships between documents and uplift the clustering quality.

However, CCNMF can also outperform the combinations, which use either local

or global with term-document representation. In DS1, which contains the smallest

dataset with four classes, we can see that the assistance of local NNs gives superior

performance to global NNs while in DS2 - DS4 which contain more documents

and clusters, the assistance of global NNs are superior. This validates that global

NNs need to be considered for preserving the geometric relationships at different

levels. Furthermore, IR ranking, which identifies the relevant nearest neighbors

through an inverted index data structure, accurately forms the NN affinity (global

116 4 Experiments

NNs) minimizing the distance concentration as evident in superior performance in

many datasets compared to local NNs. Results in DS5 show combining only global

NNs is able to produce much higher results than having both due to significant

distance concentration in the extremely sparse short text. Thus, poor local NN

information reduces the overall results of CCNMF.

Analysis of the objective function in CCNMF In CCNMF we use com-

plementary information given by S1 and S2 by approximating them using factor

matrices H1, H2 and H with minimum error as given in Eq. 9. It uses the

consensus information to form matrix H through updating H, H1 and H2 inter-

changeably. The next set of experiments was conducted to find the best consensus

and complementary technique. Two different settings are tested.

1. CCNMF without consensus information in the objective function,

minW,Ha≥0‖A − WHTa ‖F + minH,H1≥0‖S1 − HbH

T1 ‖F + minH,H2≥0‖S2 −

HcHT2 ‖F ,

As this is without consensus, this factorization process is not learning a

common H. Instead it learns different H - i.e., Ha, Hb and Hc from the

inputs and then take the average of them to obtain final H.

2. CCNMF without using complementary information in the objective func-

tion,

minW,H≥0‖A−WHT‖F +min‖HHT1 −HHT

2 ‖F .This process uses the same update rule as CCNMF to have consensus

information, However, this only maintains least difference between local

(S1 ≈ HHT1 ) and global (S2 ≈ HHT

2 ) NNs within the factorization pro-

cess and does not directly consider the approximation of S1 and S2 for

optimization.

4 Experiments 117

Table 4: Performance comparisons: Variations in factorization process

CCNMFCCNMFwithoutconsensus

CCNMFwithout

complementaryF1 NMI F1 NMI F1 NMI

DS1 0.58 0.33 0.41 0.02 0.52 0.29DS2 0.44 0.51 0.21 0.10 0.18 0.32DS3 0.41 0.51 0.09 0.02 0.31 0.4DS4 0.36 0.43 0.11 0.07 0.07 0.07DS5 0.25 0.22 0.15 0.05 0.19 0.19Avg. 0.41 0.4 0.19 0.05 0.25 0.25

We analyzed all these options of approximating factor matrices to confirm that

the way we used in CCNMF is the best.

Results in Table 4 show that proposed CCNMF which uses both complementary

and consensus information, is superior to only using consensus information or

complementary information. CCNMF without consensus does not focus on a

common H and combines different variations of H (i.e., Ha − Hc) to get the

final H. In contrast, the proposed CCNMF incorporates consensus information

through interdependent update rules. Specifically, exchanging inter-dependent

information in obtaining final H can give a more accurate clustering solution.

Combining separately learned cluster assignments based on NNs and document

representation is not able to assist the document matrix factorization process. In

fact, it degrades the performance of the original NMF process. CCNMF without

complementary information only minimizes the difference between S1 and S2

within the factorization. This approach does not consider specific information

given by S1 and S2 as in the proposed CCNMF. These results confirmed that

specific information given by local NNs and global NNs have to be considered,

with NMF in achieving better performance.

118 4 Experiments

0.98

0.985

0.99

0.995

1

0 10 20 30 40 50 60 70 80 90 99Rela

tive

Appr

oxim

atio

n Er

ror

Number of Iterations

DS1

0.950.960.970.980.99

1

0 10 20 30 40 50 60 70 80 90 99Rela

tive

Appr

oxim

atio

n Er

ror


DS2

0.970.975

0.980.985

0.990.995

1

0 10 20 30 40 50 60 70 80 90 99Rela

tive

Appr

oxim

atio

n Er

ror


DS3

0.920.930.940.950.960.970.980.99

1

0 10 20 30 40 50 60 70 80 90 99Rela

tive

Appr

oxim

atio

n Er

ror


DS4

0.950.960.970.980.99

1

0 10 20 30 40 50 60 70 80 90 99Rela

tive

Appr

oxim

atio

n Er

ror


DS5

Figure 4: Optimization in CCNMF

Figure 5: Performance Comparison with and without using SGNS weighting


In order to achieve the optimized solution of CCNMF, we iteratively minimize

the approximation error in the factorization process over 100 cycles. Figure 4

shows that CCNMF approaches the optimum solution within the 10-40 iteration

range for all datasets.

The SGNS weighting used for matrices that represent the NNs is one of the

major strengths of CCNMF. We empirically validate this concept by using binary

4 Experiments 119

00.10.20.30.40.50.6

10 20 30 40 50 60 70 80 90 100

110

120

130

NM

I

# of NNsDS1 DS2 DS3 DS4 DS5Note: Number of NNs that gave highest NMI and F1-score are marked with

Note: F1-Score is displayed on the top of the each bar

(a) (b)

0.58

0.44 0.410.36

0.250.58

0.2

0.41

0.36

0.25

0

0.1

0.2

0.3

0.4

0.5

0.6

DS1 DS2 DS3 DS4 DS5

NM

I

TFIDF BM25

Figure 6: Performance Comparison with parameters - different ranking functionsand number of NNs

representation for the NNs entries against SGNS. Figure 5 shows that the use of

SGNS weighting for S1 and S2 matrices consistently provide superior results for

all the datasets except DS5. There is no improvement in DS5 with this weighting

as it is a short text dataset where only fewer NNs can be identified due to having

only a few terms in each document. It produces less word co-occurrence pattern

that is the basis of finding NNs and showing the NN association in SGNS. Using

SGNS, CCNMF can capture the geometric relationship between documents more

accurately, as shown by the improved clustering performance.

The success of CCNMF relies on the used IR ranking function to obtain global

NNS as well as the number of NNs used in CCNMF. There exist two most popular

ranking functions, tf*idf, and BM25 [173]. Figure 6(a) shows that the tf ∗ idf

ranking function is able to give slightly better performance improvement in most

of the dataset. Therefore tf ∗ idf is set as the default for all the experiments.

Figure 6(b) shows how performance varies with the selected number of NNs. We

select the number of NNs that give the highest NMI and F1 score to represent

the S1 and S2 affinity matrices for each dataset.

120 4 Experiments

Figure 7: Performance Comparison with different cluster numbers k

Another crucial factor in CCNMF is the low-rank dimension G, used in matrix

factorization process to project the data from high-dimension to low-dimension.

We set this as the number of classes/clusters within the dataset. Figure 7 validates

that CCNMF can produce the best performance when setting G as the number of

clusters. It shows that 4, 52, 20, 40 and 21, which are the exact cluster numbers

according to ground-truth, can produce best NMI and F1 scores for DS1- DS5

respectively. It can be noted that there are curves like the elbow method in Figure

7. It indicates that the elbow method or average silhouette score [10] can be used

to infer the low-rank number G in real-world scenarios where the cluster label is

unknown. For example, Figure 8 shows that G should be set to 4 based on the

highest average silhouette score for the WebKb dataset.

4.5 Complexity Analysis

This section explores the computational complexity of all the considered meth-

ods excluding the input preparation time. The objective in CCNMF is to obtain

5 Conclusion 121

Figure 8: Selecting the low-rank dimension G based on average silhouette score

higher clustering accuracy by overcoming the information loss in the NMF pro-

cess. We want to obtain a similar complexity as NMF-based clustering methods

without incurring high cost for additional processes. Consider a collection with N

documents and M feature dimensions. As expected, the complexity of CCNMF

(i.e., O(N2)) is lesser than Spectral clustering (O(N3)) but more than k-means

O(N) and RDDC O(NM) for larger datasets. All the NMF-related methods have

O(N2) complexity. However, seaNMF, which combines 2 matrices in finding a

final solution, consumes more time than the general NMF. Similarly, CCNMF,

which combines three matrices consumes higher time than seaNMF.

5 Conclusion

This paper proposes a novel document clustering method CCNMF, leveraging the

use of local NN affinity and ability of IR ranking to identify relevant documents

as global NNs. CCNMF combines local and global NNs to preserve geometric

structure together with documents represented with a vector space model in Non-

negative Matrix Factorization to deal with the sparseness in high dimensional

text vectors. We conjecture that the technique used in CCNMF to combine

complementary and consensus information can approximate lower dimensional

122 5 Conclusion

factor matrices of high dimensional text to accurately determine the clusters for

documents. We show that the Skip-Gram with Negative-Sampling weighting

that used in NN representation can boost the clustering accuracy by capturing

the presence of a document pair with respect to any neighborhood.

Empirical results conducted on several datasets, benchmarked with several clus-

tering methods, show that CCNMF overcomes the issues attached with sparse

vectors and provides the clustering solution with consistently higher accuracy

than all relevant baseline methods. The use of both local and global NN affinity

shows superior results preserving the geometric relationships in original data com-

pared to other dimensionality reduction methods such as normal NMF, seaNMF

and spectral clustering. Further, this paper validates the superiority of using

consensus and complementary information as in CCNMF.

However, using the pairwise comparison to generate local NNs is an expensive

process. In the future, we aim to investigate an effective local NN calculation

process. A problem such as community detection needs to handle user-content as

well as user-user relationships that are modeled with friendship information and

interaction information, which represent local and global associations respectively.

Extending this approach to meaningful community detection for this type of

context is also open for future investigation.

Paper 3 123

Paper 3: Corpus-based Augmented Media Posts

with Density-based Clustering for Community

Detection




Published In: IEEE International Conference on Tools with Artificial Intelli-

gence (ICTAI), 05-07 November 2018, Volos, Greece















124 Paper 3




Date:


Signature:on the design and formulationof the concepts, method and experiments,


Nayak

26/03/2020

a Mohotti

27/03/2020



1 Introduction 125

ABSTRACT: This paper proposes a corpus-based media posts expansion tech-

nique with a density-based clustering method for community detection. To enrich

the user content information, firstly all (short-text) media posts of a user are com-

bined with hash tags and URLs available with the posts. The expanded content

view is further augmented by the virtual words inferred using the novel concept

of matrix factorization based topic proportion vector approximation. This ex-

pansion technique deals with the extreme sparseness of short text data which

otherwise leads to insufficient word co-occurrence and, in hence, inaccurate out-

come. We then propose to group these augmented posts which represent users

by identifying the density patches and form user communities. The remaining

isolated users are then assigned to communities to which they are found most

similar using a distance measure. Experimental results using several Twitter

datasets show that the proposed approach is able to deal with common issues at-

tached with (short-text) media posts to form meaningful communities and attain

high accuracy compared to relevant benchmarking methods.

KEYWORDS: community detection; corpus-based expansion; clustering; short

text; text mining

1 Introduction

Microblogging services are popular social networks which disseminate trending

information and assemble social views of users based on their short-text com-

munication. Clustering algorithms have been popularly used in these services

to promote applications such as topic detection, answering service recommenda-

tions, community detection and image/video tagging [93]. Community detection

in social media analysis has been found useful in identifying groups of users with

common interests to assist in viral and targeted marketing, political campaign-

126 1 Introduction

ing, customized health programs, event identification and many other applications

[88, 144, 147].

Community detection is usually done via two means: (1) structure analysis and,

(2) content analysis. Structural analysis has been explored heavily to construct

a network representation based on user interactions and to find cohesive groups

by applying clustering to the graph model [151, 158, 169]. Researchers have

attempted to enrich this network representation by incorporating additional in-

formation such as hashtags and URLs [119]. Interpretation of similar groups

based on the network structure is challenging due to its complex and messy na-

ture. Users who belong to a common group make a connection with different

groups through the friendship information or other connections. They may fol-

low users belonging to different groups based on their own desires and emotions

[97]. This heterogeneous network structure analysis results in close ties that allow

the exchange of fine-grained information and unable to produce high-level user

groups [62]. A network-based analysis, which considers users who write messages

as entities, is unable to give insight into what the community is interested in

[142].

Alternatively, the content analysis focuses on the pure insight of the communities

based on what they write. However, it has not yet been explored in detail due

to difficulties of handling high dimensional short text data. A handful of studies

exist that have used supervised and unsupervised learning methods to identify

commonalities among social posts. clustering has been used to identify mental

health communities focusing on anxiety, depression, and PTSD from the Reddit

forum posts [147]. Supervised learning [88] and textual semantic similarity [144]

have been used to identify sub-topics in political tweets and events respectively.

However, these methods have been reported to suffer the problem of extreme

sparseness in short text data and result in poor outcomes.

1 Introduction 127

Majority of the content-based community detection methods rely on text clus-

tering [147, 202]. A distance-based clustering method was used to assign each

user to the closest community considering the distance between texts which rep-

resent users [147]. However, it performed poorly due to distance concentration in

high dimensionality [90]. Specifically, distance differences between all data points

tend to become harder to distinguish as dimensionality increases [176]. Authors in

[144, 202] used a generative probabilistic clustering method to approximate prob-

abilities of the user being in pre-considered communities and derive the maximum

probable community. However, information loss is inevitable in these methods

due to the projection of high dimension to low dimension, and result in inac-

curate communities. Most importantly, these methods require the number of

communities to be provided as an input parameter.

In contrast, density-based clustering methods such as DBSCAN [57] find high

dense patches in a dataset naturally and automatically identify clusters without

the requirement to provide the cluster number apriori. These characteristics make

a density based method ideal for clustering text data. Density estimation pro-

cess allows identifying clusters which show different shapes. However, they have

limitation in handling high-dimensional text data where feature space is usually

sparse without much term co-occurrence and face difficulty to distinguish high-

density regions from low-density regions [90]. This sparseness has been addressed

by the concept of finding Shared Nearest Neighbours (SNN) and finding a vary-

ing dense text representation [56]. However, short length in social media posts

impairs identifying SNNs due to less common words they share [79]. Further-

more, a density-based method usually results in incomplete clustering and leaves

some objects un-assigned to any cluster. This portion of un-assigned objects is

relatively high in text data due to sparseness. A density-based method cannot

directly apply to community detection where each user should be matched to a

community according to the expression of their ideas/opinions online

128 1 Introduction

To deal with these issues, in this paper, we propose a novel hybrid clustering

approach that relies on the density concept to naturally identify clusters of ar-

bitrary shapes and uses the distance concept to place un-assigned objects to the

nearest clusters. In addition, to deal with the sparsity issue for the lack of word

co-occurrence, we propose to apply document expansion to augment the media

posts to be used in clustering. Document expansion which is inherited from the

Information Retrieval field has been used in clustering to deal with sparseness

[202]. Mainly external information sources have been utilised to enrich the text

[93, 95]. However, due to context mismatch and structural incoherence between

the external source and the original data, it results in poorer outcome [202].

In this paper, we propose a novel media posts augmentation method that incor-

porates semantic characteristics of short-text using word-occurrences of the self-

corpus without using external resources. Firstly, we propose to use non-content

information such as hashtags and URLs to enrich the content of media posts.

These information has been used as separate views [5] or each information as a

singleton view [18] in clustering avoiding extreme sparseness. We believed that

combining this information into a consolidated view would be beneficial for the

latent terms learning task. We then project the high-dimensional term space to a

low-dimensional space and infer the topic proportion vectors using the associated

semantic structure to identify virtual terms for posts. We propose to use the

Non-negative Matrix Factorization (NMF) based dimensionality reduction which

considers context based term weights to form topic vectors. The coefficients in

the topic matrix are used to statistically derive the top-n terms from topic vectors

to augment media posts.

This media post-expansion improves the word co-occurrences of important terms

in short text. Based on the augmented users′ posts, the proposed density and

centroid based complete hard clustering mechanism is then used to group the

1 Introduction 129

users to a single community. Quantitative analysis using several Twitter datasets

which belong to several groups reveals that the proposed approach is superior to

the state-of-the-art content-based community detection methods. In addition, a

case study was conducted and the qualitative analysis confirms the ability of the

proposed method to detect meaningful cohesive online communities. In compari-

son to many sub-communities identified using disseminated network analysis, the

proposed content based method has been able to identify fewer cohesive commu-

nities.

y

Figure 1: Inference of virtual words for media post-expansion

More specifically, the contributions of the paper are as follows: (1) We put forward

the concept of document expansion to handle the sparsity issue of media posts.

We propose to use abundant information available on social platforms and, as well

as the corpus itself to obtain additional terms inferred from the topics to resolve

the sparseness. (2) We propose a hybrid hard clustering method with density

and centroid concepts to naturally detect the meaningful exclusive communities

based on the augmented media posts.

To our best of knowledge, this is the first work that uses NMF for augmenting

short text using topic vectors with a density-based clustering method for com-

130 2 Community Detection With Hybrid Clustering

munity detection.

2 Community Detection With Hybrid Cluster-

ing

2.1 Preliminaries and Overview

Let there be N distinct users who are defined as {u1, u2, ..., uN} to be assigned

to a community. Each user ui has written n number of posts {p1, p2, ..., pn}.Let Pi be a combined post representing all the posts {p1, p2, ..., pn} of the user

ui with the associated URLs and hashtags used by that user in the posts. Let

UP = {P1, P2, P3, ..., PN} be the enriched media post collection to represent all

N users combined posts. Let the collection UP consist of M distinct terms

{t1, t2, t3, ..., tM}. The text collection UP is usually extreme sparse matrix. The

low-rank approximation of topic distribution over terms, generated using NMF

on the UP matrix, is used to obtain top-n terms associated with each topic as

in Fig. 1. Each text post representing a user in UP is expanded with the virtual

words taken from an appropriate topic vector to form the enriched corpus UP ′.

The augmented text collection UP ′ becomes input to the clustering process that

includes two steps. Firstly, dense regions are found in the UP ′ search space

and a distinct cluster label in C = {c1, c2, c3, ..., cl} is assigned to users/posts in

high dense regions. Another set of posts P o that appears in low-density regions is

separated out. Finally, for each post Pi ∈ P o, a cluster label is assigned by finding

the closest dense region using a distance metric. Thereby each user represented

by a post assigns to a community.

2 Community Detection With Hybrid Clustering 131

Figure 2: The Content-based Community Detection Method

2.2 Augment Media Posts with semantically related

words

Basic media posts can be enriched using external sources or corpus itself to nar-

row the semantic gap created by the short length. Majority of previous works

utilise external knowledge sources such as Wikipedia, WordNet, Web search re-

sults and other user-constructed knowledge-bases [93, 95]. When social media

texts, which have frequent real-time updates, are enriched using static sources

such as Wikipedia or WordNet, they provide inadequate or inaccurate informa-

tion due to structural incoherence and lead to incomplete enrichment. Corpus-

based document enrichment [202] can be introduced as an efficient solution to

avoid these problems and enrichment can be done based on the data itself which


follow the same semantic structure. The Latent Dirichlet Allocation (LDA) topic

modelling has been used previously to form the inter-relationship between topics

over words and virtual words are sampled from the topics [202]. LDA is a proba-

bilistic generative model based on the word count, therefore, contextual analysis

cannot be effectively captured without considering the importance of terms with

the document and document corpus [3].

Latent Semantic Indexing (LSI) is another approach used to derive the latent con-

cepts by performing a matrix decomposition based on the term co-occurrences

[20]. LSI uses Singular Value Decomposition (SVD) to identify patterns in the

relationships between the terms and concepts contained in an unstructured col-

lection of text. SVD allows factors to contain both positive and negative entries.

VSM of a document collection capturing context importance of terms with term

weights which are strictly positive forms a positive original matrix. Thereby, fac-

tors need to be positive to directly model the physical connection between terms

and topics. Semantically related words are identified by the associations learned

with non-negative constraint in our augmenting technique as a remedy.

NMF is a dimensionality reduction method which transforms high-dimensional

features to a lower dimension by enforcing the non-negative constraint in the ma-

trix decomposition to generate the non-negative factors that yield the lower rank

approximation [110]. In a high dimensional document model where we represent

the document-term relationship with weights, this approximation directly corre-

sponds to topics. NMF based topic generation within lower dimensional space

considering term weights is more accurate than term count based probabilistic

approximation and, consumes less time than generative probabilistic models such

as LDA [110].

Let A be the M × N term-post matrix; using NMF we model A as a linear

combination of W and H as in Eq. 1. We use the Frobenius norm as the

2 Community Detection With Hybrid Clustering 133

objective function to obtain stable approximation and iteratively attempts to

determine optimum W and H with the minimum sum-of-the-square error for all

elements in both of those matrices as in Eq. 2 where W is M × k non-negative

matrix and H is k ×N non-negative matrix and k < min(M,N) [110].

A ≈ WH (1)

minW,H≥0

1

2‖A−WH‖ =

M∑i=1

N∑j=1

(Ai,j − (WH)i,j

)2(2)

Topic membership in each media post is obtained with H, considering the maxi-

mum probability of a post belonging to a topic. This associated topic is used to

identify the virtual terms for each post using W where topic proportion vectors

are maintained. W in Eq. 3 envisages the likelihood of each term in the given

topic, from where we obtain top-n terms to be the probable terms of the given

topic. We have set this n in a parameter independent way considering coefficients

of W (as described in sensitivity analysis later). Each text post Pi ∈ UP which

represents a user is updated with the virtual terms that correspond to its topic

vector and the augmented dataset UP ′ is obtained.

W =M∑i=1

k∑K=1

p (ti, K) (3)

2.3 Hybrid (Hard) Clustering Method

The augmented dataset UP ′ is analysed to identify natural dense patches and to

form communities/clusters. Some posts/users that are not part of a dense patch

remain un-clustered. These isolated users are assigned to communities based on

a distance-based method. Algorithm 1 in Fig. 2 explains the hybrid clustering

method proposed for content-based community detection. The proposed method

does not need a user-defined cluster number k as well as it does not include


the expensive steps of centroid updates which are bottlenecks in community de-

tection. Also, this method is capable of identifying different shapes of clusters

with varying densities compared to centroid based methods which produces the

spherical clusters only.

The first step of this method is to identify dense communities based on the dense

patches that naturally exist in the data. Each post Pi ∈ UP ′ which represents a

user is checked to determine core dense data points if it is already not a member

of a dense community. A post Pi ∈ UP ′ is defined as a core dense point when

the region marked by the distance α around Pi contains at least r number of

data points. Then the region around a core dense point is expanded using data

points in its′ region if they satisfy the condition to be core dense data points.

This allows forming arbitrary shapes of communities/clusters according to their

close neighborhood.

This density-based community detection leaves some users unassigned to any of

the communities as they lay in less dense regions (i.e. P o). In sparse text data,

the number of points left out as noise is considerably high by a traditional density-

based method. The post augmentation done in the previous step addresses this

problem to some extent and creates density variations among clusters. However,

the number of points unassigned still vary according to the distance parameter

α. Avoiding this dependency of the parameter, we design our method to identify

a cluster for each user depending on the distance closeness. This enables each

user to be included in a community using a pair-wise comparison of the user post

vector with the cluster centre of each dense community formed previously. Each

user Pi ∈ P o is assigned to a community with the minimum pair-wise distance

value.

Algorithm 1 details the process of content-based community detection which as-

signs each user to a community.

3 EMPIRICAL ANALYSIS 135

3 EMPIRICAL ANALYSIS

Datasets: We used several Twitter datasets obtained from Trisma [140] span-

ning across Cancer, Health and Sports domain as reported in Table 1. We have

chosen a set of groups under these domains where we can identify Twitter ac-

counts to collect posts. Each group in each domain is considered as the ground

truth community to benchmark the algorithmic outcome. The tweets of a user

within a given account are combined to form a single media post that represents

the user.

Table 1: Summary of the datasets in the experiments

Dataset Tweets UsersClasses # of Avg. post length (terms)(groups) Terms Before After

expansion expansionCancer 43730 14368 8 8050 16 134(DS1)Health 53255 11306 6 9733 20 159(DS2)Sports 230447 25243 8 24316 29 267(DS3)

Benchmarks and Experimental setting: The state-of-the-art clustering

methods, k-means [90], NMF [110], DBSCAN [57] and SNN-based DBSCAN [56],

are used to benchmark the concepts in the proposed method. Document term fre-

quency was selected as the weighting scheme to represent the vector space model

for each corpus. In all the experiments, for density-based clustering methods,

the minimum number of posts to form a dense patch was set to 3 based on prior

research as it shows the minimum requirement of a hub point [139].

The parameter α which represents the local radius for expanding clusters was set

to 0.7, 0.9 and 0.7 for DS1-DS3 based on the experiments. All the experiments

were done using python 3.5 on a standard desktop with 3.40GHz-64bit processor

136 3 EMPIRICAL ANALYSIS

and 16 GB memory. The standard pairwise harmonic average of the precision

and recall (F1-score) and Normalized Mutual Information (NMI) were used as

the evaluation measures [202].

3.1 Effect of the Expansion Steps on Clustering Perfor-

mance

The short-text posts were first augmented with non-content information available

with the text. Generally, tweets are accompanied with URLs, hashtags, and

emoticons. We treated each URL and hashtag attached with a tweet as a term

in that tweet. This technique allowed us to use additional information available

on a social media platform to improve the word co-occurrence for accurate latent

topic term learning. Fig. 3 shows the performance improvement measured by

NMI and F1-score using this additional data.

Figure 3: Performance difference when augmented with URLs and hash tags

For comparison, we benchmarked the performance with commonly used terms

expansion methods: (1) the generative probabilistic topic modelling LDA based


approach (LD), (2) Latent Semantic Indexing based approach (SI), (3) top-ten fre-

quent words from k-means clustering (KMT) and, (4) WordNet synonyms based

method (WN).

Table 2: Different Document Expansion Methods

MethodsDataset

DS1 DS2 DS3AC 0.7 0.56 0.66 F1-scoreLD 0.44 0.53 0.6SI 0.57 0.50 0.63

KMT 0.74 0.50 0.66WN 0.19 0.47 0.06AC 0.8 0.75 0.79 NMILD 0.54 0.75 0.73SI 0.45 0.01 0.46

KMT 0.62 0.01 0.58WN 0.24 0.13 0.14AC 10 4 7 Number ofLD 9 3 8 ClustersSI 8 2 7

KMT 163 15 465WN 527 369 716

Augmenting Methods: Proposed NMF (AC), LDA (LD), LSI (SI)Top 10 Words using k-means (KMT), WordNet-based(WN)

Table 2 shows the clustering performance with the datasets augmented with the

different expansion. The use of topic words estimated in lower-space through

NMF for expansion (AC) is found better than the word count based topic words

obtained with LDA. NMF takes the context of the terms in the corpus with term

weights and is able to provide stronger topic distribution over terms. In com-

parison to the generic matrix decomposition in LSI, NMF enforces non-negative

constraint for matrix factorization and is able to accurately capture the topic

terms. Furthermore, the simple use of top-10 words in k-means clusters for aug-

menting was found ineffective. This may be due to the incapability of k-means

deriving themes over terms by a distance metric, hence, unable to identify the

correct number of cohesive communities. We extracted synonyms for top-10 fre-


quent terms in a combined post that represents a user using WordNet as closely

related words to add the post for enrichment. This way (WN) of augmenting add

many unrelated terms for a post due to structural mismatch of terms in media

posts and WordNet taxonomy.

In the k-means based (KMT) and WordNet-based (WN) augmenting, the sparse-

ness of text representation further increases due to unrelated term addition. Den-

sity estimation process becomes weak in them and results in a larger number of

small clusters.

Figure 4: Time taken for different post augmenting methods

As depicted by Fig. 4, matrix factorization based methods take less time among

all others in augmenting media post due to their linear approximation process

of factors. Proposed NMF based method consumes more time than LSI based

approach due to imposing of non-negative constraints. The WordNet-based ap-

proach relies on the term search in the external information source and is expen-

sive.


3.2 Accuracy Comparison

Results in Tables 3 and 4 show the performance of the proposed augmented hy-

brid clustering method (AC) benchmarked with other popular methods, without

and with topic vector based augmentation, respectively. Table 3 reveals that

the performance of density-based algorithms, including the proposed one, is in-

ferior to other algorithms as density notion fails in sparse content, before term

expansion. In this homogeneous setting, centroid-based k-means is the best ap-

proach. However, it is to be noted that all methods except density based require

the number of classes as an external input for clustering, whereas density-based

methods find natural clusters without taking a number of classes. Consequently,

these methods show poor performance by forming many subgroups than the exact

number of communities. Moreover, DBSCAN which is meant for identifying noise

separately from dense regions leaves a huge number of users un-assigned to any

community due to scatterness of data points because of the sparseness of text.

Shared-nearest-neighbors based DBSCAN, which was introduced to address the

sparseness in high dimensionality, also fails to deal with short text due to hav-

ing a minimal number of shared terms and unable to identify shared neighbors

accurately.

Performance of all methods using the augmented inputs with NMF based topic

terms, given in Table 4, shows that media post-expansion boosts the accuracy of

density based methods as depicted by higher F1-score and NMI. The performance

boost in density-based methods is much larger than that other algorithms due

to augmentation step which increases the data density. Virtual terms added

to media posts create uniform dense regions minimizing sparseness of text that

favor density-based methods. These uniform clusters allow identifying high-level

groups closer to actual groups without identifying many dense patches. Most

importantly, the proposed method is able to assign community to each user,


Table 3: Performance Comparison of Different Methods Before Expansion

MethodsDataset

AverageDS1 DS2 DS3

AC 0.42 0.45 0.43 0.43 F1-scoreKM 0.69 0.62 0.63 0.65MF 0.58 0.54 0.61 0.58DB 0.5 0.45 0.52 0.49SD 0.45 0.49 0.45 0.46AC 0.4 0.43 0.31 0.38 NMIKM 0.66 0.51 0.62 0.60MF 0.51 0.43 0.57 0.50DB 0.34 0.32 0.17 0.28SD 0.06 0.00 0.02 0.03AC 547 519 739 Number of ClustersKM 8 6 8MF 8 6 8DB 547 519 739SD 9 1 23AC 0 0 0 Un-assigned usersKM 0 0 0MF 0 0 0DB 7210 5490 20528SD 795 59 2332

Proposed Augmented Hybrid Clustering Method (AC), K-Means (KM),NMF (MF), DBSCAN (DB) and SNN-based DBSCAN (SD)

whereas, DBSCAN and SNN-based DBSCAN leaves about 2% and 70% of users

unassigned even after expansion.

Moreover, structure-based community detection methods [14, 161] are generally

known to form many more communities in comparison to the proposed method.

3.3 Time Efficiency Analysis

Time taken after augmenting the media posts is given in Fig. 5 against bench-

marking methods. K-means consumes higher time due to the expensive step


Table 4: Performance Comparison of Different Methods After Expansion

MethodsDataset

AverageDS1 DS2 DS3

AC 0.8 0.75 0.79 0.78 F1-scoreKM 0.69 0.66 0.68 0.68MF 0.69 0.65 0.69 0.68DB 0.8 0.75 0.79 0.78SD 0.59 0.72 0.46 0.59AC 0.7 0.56 0.66 0.64 NMIKM 0.63 0.53 0.58 0.58MF 0.63 0.52 0.59 0.58DB 0.69 0.56 0.65 0.63SD 0.33 0.46 0.03 0.27AC 10 4 7 Number of ClustersKM 8 6 8MF 8 6 8DB 10 4 7SD 19 5 35AC 0 0 0 Un-assigned usersKM 0 0 0MF 0 0 0DB 101 11 202SD 903 71 1848

Proposed Augmented Hybrid Clustering Method (AC), K-Means (KM),NMF (MF), DBSCAN (DB) and SNN-based DBSCAN (SD)

of centroid updates in each assignment. NMF consumes the least time due to

document topic approximation process. However, the trade-off for achieving an

increase of 10% in NMI and 15% in F1-score in the proposed method is well

justified against NMF. SNN-based DBSCAN consumes the highest time due to

pairwise comparison among points to identify the shared nearest neighbors. DB-

SCAN which also uses the density-based clustering approach consumes slightly

lesser time than our method. However, DBSCAN leaves some users unassigned

to communities while our proposed hybrid method assigns each user to its closest

community. The complexity of total process is O(n2) where n is the number of

users. This is higher in comparison to proposed clustering method (O(nlogn))


due to the inclusion of augmentation step.

Figure 5: Time comparison of different cluster/community detection methods


Experimental settings used for augmenting media posts and clustering method

are analysed. For the datasets that we use in testing, the required number of

communities to be identified was known. We use these numbers to set the topics in

NMF for media post-expansion. However, this would not be the case in realworld.

To set a default value in NMF for unknown cases, we explore the relationship

between the number of topics and F1-score. Fig. 6 (a) shows that the accuracy

performance becomes stable after about 10. Consequently, 10 can be set as the

default number of topics to be obtained for post-expansion.

The next experiment conducted was to find out the number of terms that should

be augmented in a media post. Fig. 6 (b) shows a pattern where accuracy is

increased until a specific number of words are added and then it starts to declines

with the further addition. It happens as more additional terms, which have low

probability to be on the topic, become unrelated to the document themes. For


Figure 6: Performance w.r.t number of topics, number of virtual terms, termweighting methods and parameter alpha

all the three datasets, the best performance is achieved with about 10 terms. If

the number of communities within the dataset is high we need more terms as in

DS1 and DS3. Fig. 6 (b) confirms that the threshold to set the number of virtual

terms depends on the dataset. The dataset should possess sufficient information

to set it statistically and this threshold can be set in parameter independent way.

Fig. 6 (c) shows the best F1-score and NMI are obtained by setting the threshold

as “mean+standard deviation” with the least time consumption as shown in Fig.

6 (d). The threshold setting allows the method to select the most probable words

for each topic which goes beyond total average and gets boost by the standard

deviation.

The media posts are organized in a Vector Space Model (VSM) to find initial

dense regions. We explore the relationship between different weighting schemes

that can be used to model VSM. Fig. 6 (e) shows that Term Frequency (TF)

as the best weighting schema for the proposed method. TF which gives high

weights for frequent terms in a post allows forming dense patches that share

common theme.


The proposed clustering method, in line with a density-based clustering method,

uses a distance parameter to determine the search space. This parameter denotes

the region in the data space where the algorithm checks to define a particular data

point as a dense point. Fig. 6 (f) shows that the increase beyond a particular

value diminishes the accuracy by including all the points into one cluster.

3.5 CASE STUDY: COMMUNITY MINING

In the case study, we have explored the ability of proposed community mining

method in finding meaningful communities using Australian tweets of the “Na-

tionalSeniors” twitter account. The objective of this case study was to identify

sub-groups in a senior community with “alike” users based on their tweet posts.

Fig. 7 shows the word cloud1 generated for the entire tweet dataset. It can be

noted that Pension, Australian Politics (Auspol), Votes, Care and Finance are

popular discussion areas among seniors in Australia. However, it does not give

any more insight. Fig. 8 represents the 6 senior sub-communities identified by the

proposed method without setting any prior number. The word cloud generated

for tweet posts of each sub-community is able to confirm that they provide more

meaningful information.

Figure 7: Word Cloud for total tweets obtained from “NationalSeniors”

1The Voyant Tool [170] is used to illustrate the word clouds.


Community 1 is about the discussion on policies of reforms on older people

and discriminations. Discussions of community 2 focus on pension-related cuts,

changes for their finance and care. Community 3 users specifically talk about

older workers, votes, and candidates. Community 4 users seem to be concerned

with daily water supply. Community 5 was about an event organised for seniors

called “Senior Wednesday” and associated social engagements such as members,

movie, and tickets. Community 6 discussed the Australian Politics, budget and

associated matters for pensioners. This case study depicts the power of the con-

tent based community detection method to find the meaningful communities ac-

cording to their communicated text. It enables decision-makers to identify seniors′

political viewpoints and their concerns.

This information can be used in multiple ways. An example is to customize the

advertising strategies to each of the targeted groups. For example, users in com-

munity 5 can be the focus of advertisement in social gatherings. Furthermore,

this community information highlights the current events (e.g.: budget, Senior-

wednesday, and reforms), thus can be used for event detection if considered with

the timestamp.

Further, the users in “NationalSeniors” dataset are analysed using the retweets

network within the group to identify structure based communities using the well-

known Louvain algorithm [14]. It resulted in 94 sub-communities as disseminated

information network of users relies on closer ties. This validates the capability of

our content-based method in producing meaningful cohesive user groups rather

than splitting to a larger number of fine-grained clusters.


Figure 8: Word Cloud for total tweets obtained from “NationalSeniors”


3.6 CONCLUSION

In this paper, we propose a content-based community detection method to iden-

tify similar user groups that share similar content on social media platforms. Our

approach deals with the sparseness of text by augmenting media posts. Abundant

information available in social media platforms is used as a simple solution to in-

creases term co-occurrence and to aid media posts expansion along with NMF

based topic vectors. We conjecture that NMF-based dimensionality reduction

method, which considers the context of terms to infer topic proportion vectors,

gives high coefficients to most relevant terms of the topics in the approximation

process. Thereby it is able to derive the probable virtual words for augmenting

media posts.

The proposed density based complete clustering method naturally identifies com-

munities using these augmented media posts where each post represents a user.

The method initially forms high dense patches in the data and leaves some users

in low dense regions un-assigned to any community. The refinement phase added

with centroid-based clustering uses pre-calculated cluster centroids to group un-

assigned users. Extensive experiments on different Twitter datasets and the case

study confirm that the proposed approach with corpus-based expansion signifi-

cantly enhances the performance of short text-based community detection. Ex-

tending this approach which naturally assigns each user to one community to

deal with multiple interest users using soft clustering and handling dynamic tem-

poral context with improved time efficiency for event detection are our future

investigation.

148 Paper 4

Paper 4: Concept Mining in Online Forums using

Self-corpus-based Augmented Text Clustering

Wathsala Anupama Mohotti* , Darren Christopher Lukas* and Richi Nayak*



Published In: IEEE Pacific Rim International Conference on Artificial Intelli-

gence (PRICAI), 26-30 August 2019, Cuvu, Fiji















Paper 4 149


Wathsala Anupama Mohotti Conceived the idea and research design,analyzed data, wrote the paper and


Date:

Darren Christopher Lukas, Conducted experiments,analyzed data

Signature:

Date:




26/03/2020

hi Nayak

26/03/2020

Mohhhhhhhhhhhhhhhotoooooooooo ti

27/03/2020




150 1 Introduction

ABSTRACT: This paper proposes a self-corpus-based text augmentation tech-

nique with clustering for concept mining in a discussion forum. Sparseness in

text data, which challenges the distance and density measures in determining the

concepts in a corpus, is handled through self-corpus-based document expansion

via matrix factorization. Experiments with a real-world dataset show that the

proposed method is able to infer useful concepts.

KEYWORDS: Concept Mining; Corpus-based augmentation; Clustering

1 Introduction

An online forum is a formal mechanism that community uses to exchange infor-

mation through posted messages that are organized into “threads” [125]. The

forums can reflect concepts, themes, and concerns of online societies in diverse

fields such as education, marketing and politics [125, 132]. A handful of studies

have applied data and text mining methods to explore the predictive power of

the forum data [120, 132]. In the education domain, discussion forums have been

analyzed to assess interactivity over a period of time to predict early warnings for

students at-risk [132]. In marketing, online forum data is used to identify product

defects [125] with predictive models. However, these works neglect the natural

text content used in the online discussion. A few studies have applied text min-

ing in online forums for sentiment analysis [120] with supervised approaches to

classify forum threads. However, the unavailability of ground-truths in online fo-

rum data creates the demand for conducting the analysis in unsupervised setting

[120].

In this paper, we propose a concept mining method that can extract concepts

based on text discussions in the unsupervised setting. Concept mining of online

1 Introduction 151

forums data faces the same challenges as traditional text mining methods [3].

Sparse nature of text vectors and a higher number of dimensions make distance

and density-based methods to perform poorly due to distance concentration [3].

Specifically, distance differences between far and near points become negligible in

higher dimensions [3]. In addition, density based methods are unable to identify

dense patches in sparse text data. Moreover, forum data is usually homoge-

neous where a minor variation in the distance/density measures will determine

groupings. Probabilistic and matrix factorization based approaches have been

introduced to handle higher dimensions in text [3]. However, information loss in

these dimensional reduction methods is evident.

Figure 1: Clustering Algorithm for Concept Mining: ConMine

Distinct from these works, we introduce a novel approach for content mining in

152 2 Concept Mining with Self-corpus-based Augmentation

online forums using clustering and document expansion, named as ConMine to

understand the main concepts and themes present in user discussions. The self-

corpus based document expansion [139] in ConMine, via Non-negative Matrix

Factorization(NMF), learns virtual terms from the same corpus that semanti-

cally match the applied domain. A centroid-based clustering is then applied to

the expanded text to differentiate the concepts. ConMine automatically learns

the number of clusters to be produced within the augmentation process. Finally,

we synthesize meaningful concepts with the help of experts via word-cloud vi-

sualization. ConMine approach is evaluated on real-world data taken from the

Queensland University of Technology(QUT), Australia. The empirical analysis

shows that ConMine is able to handle sparse and homogeneous nature of text in

discussion forums and identify concepts more accurately than the benchmarks.

2 Concept Mining with Self-corpus-based Aug-

mentation

The proposed three-step ConMine Algorithm is outlined in Fig. 1. Consider

an online forum corpora D = {D1, D2, ..Di, ...Ds} over a time period s where

Di represents the corpus at time i. Let Di be a collection of N distinct posts,

{P1, P2, ...PN}, that contain a total of M distinct terms {t1, t2, ...tM}.

Self-corpus-based Augmentation with Matrix Factorization: In contrast

to using external knowledge bases [93], we conjecture that the self-corpus based

augmentation is well suited for augmenting text as it follow forums’ text patterns.

Let A be the M ×N matrix representation of Di. We decompose A using NMF

to have the lower rank matrices W and H which are non-negative and in the size

of M × k and k × N respectively with the low-order rank k set as the number

2 Concept Mining with Self-corpus-based Augmentation 153

of topics. The k is learned using the intrinsic topic coherence measure. The

matrix factorization process iteratively approximates W and H such that they

can represent high-dimensional A with the least error as in Eq 1.

minW,H≥0

1

2‖A−WH‖ =

M∑i=1

N∑j=1

(Ai,j − (WH)i,j

)2(1)

Topic membership of each post in Di is obtained considering the maximum coef-

ficient value in H for a post. This associated topic is used to identify the virtual

terms for each post using W . The coefficients in W are sorted in decreasing

order. The coefficients that yield higher value than mean+standard deviation of

the distribution become the terms to represent a topic as in [139]. Each text post

of Di is expanded using the most probable terms as virtual terms that correspond

to its topic vector and form D′i.

Augmented Text Clustering: The data matrix D′i with augmented posts is

represented with a weighted term × post matrix to partition into k clusters. We

use the centroid-based clustering as it is reported to produce an accurate outcome

for the homogeneous data [43]. As the online forum data shows the homogeneous

nature, we partition the N posts into k clusters (obtained through the previous

step) using k-means. Initial k cluster centers are randomly chosen. Then each

post P ′a ∈ D′

i is compared with each k center to decide on the closest to be

assigned. This process updates the respective cluster center in each iteration.

Knowledge Synthesis for meaningful Communities: Within this step, we

generate the m concepts that are meaningful to the domain in k clusters after

doing further post-processing and consultation with domain experts. We analyze

terms in each cluster through visualization and the highly occurring common

words are removed. This is an iterative quality checking process that includes

manual intervention. This process results in the m (≤ k) meaningful concepts


discussed in a forum.


Datasets: The dataset is obtained from the online forum, Essential Supervisory

Practices (ESP), a 5-week training program for higher-degree research supervisors

at QUT between 2015 to 2017. The posts from all years have been combined on

a weekly basis, resulting in five datasets as in Table 1. We consider each post,

regardless of its type (i.e, original or reply), as a single document after applying

standard text pre-processing steps. After comparing the experimental results with

multiple weighting schemes, posts are organized in vector space model(VSM) with

the tf*idf weighting schema to derive the topics, while the augmented posts are

represented using tf for clustering.


Dataset Number ofPosts

Number ofuniqueterms

Average post length (in terms)Before After

augmentation augmentationW1 1664 7090 154 165W2 1495 7385 177 194W3 1416 7145 155 165W4 1402 7057 161 174W5 1568 6893 145 153

Benchmarks and Evaluation Measures: The proposed NMF based ap-

proach for document expansion using topics in ConMine is evaluated against

probabilistic LDA (pLDA) [3] and Latent Semantic Indexing (LSI) [3]. The state-

of-the-art clustering methods of DBSCAN [139], LDA [3], LSI [3] and NMF [3]

are used for benchmarking the concepts of clustering in ConMine. Accuracy of

topic vector formation and clustering process were evaluated with the intrinsic

measures topic coherence [134] and Silhouette score [134] respectively.


Figure 2: Results of the Experiments

Accuracy: ConMine with NMF is found best in terms of topic coherence (Fig.

2(a)). LSI, which approximates factors with both positive and negative entries, is

not able to provide stronger topic distribution in VSM which is represented with

strictly positive entries. pLDA, which approximates topics using the probability

of terms considers only the term count and neglects the context of the words

and frequencies, has provided inferior results. We empirically learn the number

of topics as shown in Fig. 2(b) which produces highly cohesive topics. This

number is used in deriving topics for the post augmentation as well as it is set as

k in the clustering process. Fig. 2(c) compared clustering in ConMine with and

without post augmentation. Increased tightness of the clusters, indicated by a

higher silhouette score after augmentation in each method, confirms the benefit

of augmentation by handling the sparseness in high-dimensional text via added

terms. ConMine shows the highest increase in silhouette value compared to all

the baselines. In the homogeneous data, the density concept (DBSCAN) creates

contiguity-based clusters where very different data items may end up in the same

cluster giving the worst results. LDA which uses term counts-based probability

is unable to predict the correct cluster due to the negligence of context of the

terms. However, NMF as a clustering method performs similar to ConMine with a

marginal difference showing the importance of mapping higher to lower-dimension

156 4 Conclusion

space. The identified Concepts for each week are given in Table 2.

Table 2: Concepts identified for datasets

Dataset Identified Concepts by ConMineW1 Research skill, Milestones, Supervisors, Meetings, PublicationsW2 Experience in supervising, Relationship between student and

supervisorW3 Writing thesis, Writing literature review, Plagiarism and research

issuesW4 Emotional issues, Completion, Strategy for unsatisfactory progressW5 Examiner comments, Final submission and Seminar practice

4 Conclusion

We proposed and evaluated a concept mining method, ConMine, on a real-world

forum data for understanding the discussions that are held on online forums.

To handle the sparsity and high dimensionality in text, we use NMF (which

approximates topic vectors in a linear manner considering the context of terms) to

obtain virtual words for post-expansion. Leveraging the intrinsic measurements,

we learn the optimal number of k topics that are further used in centroid-based

clustering to obtain the clusters/concepts within the augmented text. Results

show that ConMine can deal with the sparse and homogeneous nature of online

forum data to obtain some useful concepts.

Chapter 4

Text Outlier Detection

This chapter introduces the second major contribution of the thesis, a set of novel

document outlier detection methods to identify the deviated documents from a

set of subgroups that cover common concepts in the document collection based

on effective text (dis)similarity calculation techniques. Outlier detection in text

data has not gained as much attention from the research community as cluster-

ing [1]. There exist a few research works that focus on text-domain [4, 96]. The

majority of outlier detection methods are able to only deal with few dimensions

[29, 68]. With the increasing number of dimensions, similar to clustering, outlier

detection methods face sparseness-related issues such as distance concentration

[126] or information loss [96] in dimensionality reduction methods in identifying

text similarity. There is some research that addresses the issues of high dimen-

sionality with angle differences [109], subspace analysis [108], and anti-hubs [58],

which identify the deviated (dissimilar) data points. However, the computational

complexity of these approaches is high. There is a lack of efficient approaches to

accurately identify the outlier documents in a document collection.

Fig. 4.1 outlines the high-level overview of the contributions discussed in this

158


chapter to effectively identify the text (dis)similarity for outlier detection. This

chapter explores the different ways to use ranking concepts for identifying out-

liers in document collections to avoid the issues with a sparse high-dimensional

text representation in identifying similarity. The primary hypothesis is to use

inverse document frequency that ranks the rare terms with higher importance to

determine outlier scores. This concept can filter outliers due to the higher aver-

age of rare terms in an outlier document. Besides that, this chapter introduces

the use of IR ranking concept in identifying relevant documents and relevancy

score through inverted indexed data structure for outlier detection. The proposed

methods in this thesis inversely use this information to identify the outliers. In

addition, neighborhood information identified through the IR system is proposed

to build a mutual neighbor graph efficiently. A method based on density estima-

tion and hub identification on the graph is proposed to filter the outliers for short

159

text [79], while directly obtaining excluded documents from the graph as outliers

for other cases.

This chapter is comprised of two papers relating to these contributions.

• Paper 5. Wathsala Anupama Mohotti and Richi Nayak.: Efficient Out-

lier Detection in Text Corpus Using Rare Frequency and Ranking. ACM

Transactions on Knowledge Discovery from Data (TKDD) (Accepted with

Major Revision).

• Paper 6. Wathsala Anupama Mohotti and Richi Nayak.: Text Out-

lier Detection using a Ranking-based Mutual Graph. Journal of Data &

Knowledge Engineering (DKE) (Under Review).

Paper 5 aims to identify a higher number of true outliers reducing false identifica-

tion using ranking concepts. It proposes ranking documents using rare document

frequency and IR ranking-based neighborhoods to identify dissimilar documents

to inliers as outliers with three main algorithms namely, Outlier detection based

on Inverse Document Frequency (OIDF), Outlier detection based on Ranking

Function Score (ORFS) and Outlier detection based on Ranked Neighborhood k-

occurrences Count (ORNC). OIDF proposes that outlier documents should have

a higher average if present with inverse document frequency of their terms as

they usually consist of a higher number of rare terms. ORFS identifies the out-

lier scores by using the inverse of the relevancy scores obtained for the top-10

relevant documents through the IR system instead of using the average of rare

term weights. In addition, ORNC identifies the anti-hubs in the collection with

relevant documents given by an IR system for the documents in the collection.

The lesser k-occurrences within ranking responses show the anti-hubs, which are

potential outliers.

160

This paper explores sequential and independent ensemble strategies using OIDF

with ORNC and ORFS to obtain higher accuracy in outlier prediction with less

false prediction. In addition, two new evaluation measures are introduced in

this paper to reveal false predictions of inliers and outliers in a meaningful man-

ner. Experiments are done on five datasets, Wikipedia, NewsGroup data and

Social Event Detection datasets covering all the size of text vectors. Experimen-

tal results show that the proposed strategies are accurate and efficient in all the

datasets compared to baselines. In the Wikipedia dataset that consists of large

text vectors, all the baselines fail due to their time/memory complexity while

OIDF that is based on the simple concept of using rare terms gives the best

performance among the proposed ones consuming less time. Ensemble methods

based on ORFS reduce the false detection and perform better than others among

proposed algorithms for NewsGroup data due to its efficient outlier score calcu-

lation with ranking scores. However, ensemble ORNC performs better for Social

Event Detection data due to its anti-hub-based filtering that deals with extreme

sparseness in short text.

Paper 6 identifies the dissimilarities in a document collection using the graph-

based approach, Outliers by Ranking-based Density Graphs (ORDG). It follows

an incremental approach, which starts with rare term frequency-based outlier

detection as in OIDF. The sparseness in text representation that challenges iden-

tifying the text-similarity, is addressed using a mutual neighbor graph constructed

with IR ranking results. The larger and medium text vectors, which show suffi-

cient word co-occurrences compared to the short documents, allow the connected

graph to include inliers by leaving outliers. The extreme sparseness of short text

only allows inclusion of a few inliers in the graph. Thus, ORDG estimates uni-

formly dense regions on the graph and thereby identifies the attached hubs with

these inlier regions. ORDG identifies the other inlier documents that are not

included in the graph by leaving documents dissimilar to hubs as outliers. This

161

hub similarity calculation is done with ranking scores given by the IR system.

Final outliers are determined by combining outlier candidates generated by the

rare frequency with this mutual graph-based approach. Experiments are done

with the same dataset as in Paper 5. Experimental results show that ORDG is

accurate and scalable compared to baseline methods. It shows much performance

improvement for datasets with short text vectors.

OIDF is the most efficient method among all the proposed outlier detection meth-

ods due to the use of simple rare document weighting concept. ORDG which is

proposed in paper 6 that uses hub-based inlier filtering outperforms ORNC in

Paper 5 which is also proposed as a method accurate for short text. However, the

sequential ORNC method is efficient compared to ORDG for short text as ORRG

uses multiple steps in identifying outliers. As per the experimental results of four

outlier detection methods, OIDF, ORFS, ORNC and ORDG, the suitability of

each method with respect to the data type can be summarised as in Table 4.1.

Table 4.1: Proposed outlier detection methods

Method Concept Suitabile data type SignificantCharacteristics

OIDF Rare Frequency Collections with large Accuracy,text vectors Efficiency

ORFS IR Ranking Score Collections with medium Accuracy,text vectors Efficiency

ORNC IR Raking Results Collections with short Accuracytext vectorsCollections with medium Accuracytext vectors that haveoverlapping terms

ORDG IR Raking-based Collections with short AccuracyGraph text vectors

Paper 5 proposes two new measurements to identify the false inliers and false out-

lier through Inlier Prediction Error (IPE) and Outlier Prediction Error (OPE).

They are superior to FPR and FNR that provide relative values and are capable

162

of differentiating methods based on their ability to deal with false alarms.

Next, the chapter will present these two papers. Since this is a thesis by publica-

tion, each original paper is presented aligning with the thesis format. Due to the

papers’ different formats, there may be some minor format differences. However,

these do not alter the content of the original papers.

Paper 5 163

Paper 5: Efficient Outlier Detection in Text Cor-

pus Using Rare Frequency and Ranking




Accepted with Major Revision In: ACM Transactions on Knowledge Dis-

covery from Data (TKDD Journal)















164 Paper 5




Date:




Nayak

26/03/2020

Mohotti

27/03/2020



1 Introduction 165

ABSTRACT: Outlier detection in text data collections has become significant

due to the need of finding anomalies in the myriad of text data sources. High

feature dimensionality, together with the larger size of these document collec-

tions, presents a growing need for developing accurate outlier detection methods

with high efficiency. Traditional outlier detection methods face several challenges

including data sparseness, distance concentration and the presence of a larger

number of sub-groups when dealing with text data. In this paper, we propose to

address these issues by developing novel concepts such as presenting documents

using rare document frequency, ranking-based neighborhood for similarity com-

putation and identifying sub-dense local neighborhoods in high dimensions. We

present a set of novel ensemble approaches using the ranking concept to reduce

the false identifications while identifying the higher number of true outliers, in or-

der to improve the proposed primary method based on rare document frequency.

Extensive empirical analysis shows that the proposed method and its variations

are scalable compared to relevant benchmarking methods, as well as improving

the quality of outlier detection in document repositories.

KEYWORDS: Outlier detection; high dimensional data; k-occurrences; ranking

function; term-weighting

1 Introduction

With the advances in data processing technology, digital data have witnessed ex-

ponential growth [86]. Outlier detection plays a vital role in identifying anoma-

lies in massive data, and some examples are credit card fraud detection, criminal

activity detection in e-commerce and abnormal weather prediction [126]. The

general idea of outlier detection is to identify patterns that do not conform to

general behavior, referred to as anomalies, deviants or abnormalities [1]. There

166 1 Introduction

exist different supervised and unsupervised machine learning methods that have

been used to identify exceptional points from different types of data, such as

numerical, spatial and categorical.

Outlier detection in text data is gaining attention due to the generation of a

vast amount of text through big data systems. Reports suggest that 95% of the

unstructured digital data appears in text form [86]. An outlier text document has

content that is different from the rest of the documents in the corpus that share

a few similarities amongst them [4]. Text outlier detection is frequently used in

stream data for event detection and first story detection to tracks the evolution of

an event [1]. However, detecting anomalies in static text data is also beneficial in

many application domains for decision-making such as web, blog and news article

management [96]. An unusual web page on a website or a web content deviating

from the theme in a blog, if discovered, will draw useful insight for administrative

purposes. Similarly, detecting an unusual news article from a collection of news

documents may help to flag it as exceptional or fake news. The unusual events

detection from the (short-length) social media data can indicate early warnings

[45].

These applications of identifying text outliers face several challenges; (1) Unavail-

ability or less availability of labeled data is the primary challenge for real-world

outlier detection methods and it creates the requirement for unsupervised meth-

ods. (2) Text data show fewer co-occurrences of terms among documents and form

sparse representation that challenges document similarity calculation to identify

the deviations [96]. (3) There are a special category of text data such as social

media text, the number of discriminative terms and common terms shared by

related text is small, being extremely sparse [139, 191]. The number of groups or

topics in social media are considerably high challenging text mining methods to

identify the outliers that are deviated from all these groups. Therefore, text out-

1 Introduction 167

lier detection methods face additional challenges in handling short-length social

media posts. (4) Moreover, larger sizes of text collections generated by big data

systems create a need to explore efficient outlier detection methods.

Studies on general outlier detection commonly use distribution, distance, and

density-based unsupervised proximity learning methods for real-world applica-

tions, where training data with class labels are not available. Most of these

methods suffer from efficiency problems due to the high volume in large datasets

[155]. The effectiveness of these outlier detection methods on high dimensional

data is also challenged by the well-known curse of high dimensionality [126].

Specifically, distance difference between near and far points becomes negligible

and unable to compute the similarity among documents with these proximity-

based methods [96]. This challenge is further amplified when the data collection

exhibits numerous distinct groups within the collection [173].

Researchers have developed subspace and angle-based methods to address these

high dimensional issues. These methods are computationally expensive due to

the larger numbers of comparisons required. Moreover, the subspace analysis

cannot guarantee that relevant subspaces are aligned with extreme values in full

dimensionality [108, 126]. Another set of solutions are proposed based on nearest-

neighbors and “anti-hub” concepts for high-dimensional data, such as graphs,

genes, etc [58, 68, 84, 154]. However, the nearest-neighbor (NN) calculation

is known to present scalability issues for larger datasets [164]. In this paper,

we conjecture that the relevant document set retrieved by a search engine, in

response to a document (posed as a query), is a promising alternative solution to

generate the neighborhood of the document. IR systems effectively calculate the

similarity between documents through the inverted index data structure avoiding

the issues with sparse data representation. We present a set of novel methods to

calculate outlier scores based on this ranking-based neighborhood concept using

168 1 Introduction

the scalable search engine technology.

There are only limited studies in outlier detection literature that specifically focus

on text-domain and deal with the sparseness of the document feature vector. Tra-

ditional outlier detection methods fail to capture the document similarity due to

ineffective distance/density computation [3] within the sparse vector space model.

Therefore, dimensionality reduction based methods have been used to text out-

lier detection [11, 16] and they have to deal with information loss in lower-order

mapping. Recently, the use of sum-of-square differences in matrix factorization

is proposed to determine outlier scores in text data [96]. In real-world scenarios

such as social media, the number of groups/topics inherent in the data is large

and creates the need for distinguishing the fine-grained sub-groups while identi-

fying outliers. However, the fine-grained problem dealing with a larger number

of inliers (i.e., normal data) and outlier classes presents issues for the aforemen-

tioned matrix decomposition as the iterative lower-rank matrix approximation

process increases the level of error (as evident in our experiments). The term

weighting scheme such as Inverse Document Frequency (IDF) is a common sta-

tistical measure in Information Retrieval (IR) that select intrinsic dimensionality

of a text document by representing how important a term is in a collection [37].

Inspired by this, in this paper, we propose a novel, rare-frequency-based method

for high-dimensional document outlier detection.

Many data mining problems such as classification and clustering improve the ro-

bustness of the primary solution using an ensemble mechanism [61, 187]. This

is a promising solution to reduce the false alarm rate of outlier detection meth-

ods. Ensemble approaches, broadly classified as sequential or independent, have

been successfully used in prior work to improve the quality of an outlier detec-

tion algorithm [1]. We explore and present an ensemble strategy by combining

rare document frequency with the ranking-based neighborhood to improve the

1 Introduction 169

accuracy of outlier detection by reducing false positives.

Overall, this paper presents novel outlier detection methods based on the concepts

of rare document frequency and ranking. Firstly, we propose that semantic term

clusters can effectively be used to detect deviations or anomalous documents

through meaningful term weighting. We then propose to exploit the ranking-

based retrieval techniques employed in search engines to provide similar docu-

ments in comparison to the conventional big data analytics methods that require

significant investments. Thereby, we propose to use the local sub-dense neigh-

borhood concept (Hubs), evident in high dimensional text data, through rank-

ing. We combine the neighborhood-based methods with the primary rare-term

weighting-based method to form ensemble approaches and reduce the potential

of false outlier detections. Unlike the state-of-the-art methods, we present these

methods as non-parametric and address the bottleneck of setting the user-defined

threshold to assess a document score. Lastly, the paper discusses the need for an

outlier-focused evaluation mechanism to report false positives (i.e., false outliers)

and false negatives (i.e., false inliers) in outlier detection.

In summary, this paper brings several novel contributions to the area of document

outlier detection, listed as:

• The use of rare frequency in document representation for outlier detection

to demarcate the border between common and rare documents. This novel

concept contributes to the primary OIDF algorithm.

• The concept of finding relevant neighbors using a scalable IR system that

consumes less computation cost. Two novel algorithms (ORFS and ORNC)

are developed to detect the level of deviations between documents.

• A set of ensemble approaches (ORFS (I), ORFS (S), and ORNC (S)) fo-

cusing on improving accuracy (i.e., reduced false outliers) with efficiency.

170 2 Related Work

Their approaches do not depend on a user-defined parameter as an outlier

threshold.

• Envisaging the requirement of meaningful evaluation measures, namely,

OPE and IPE, to highlight false detection.

The rest of this paper is organized as follows. Section 2 provides motivation

and related works related to outlier detection, term weighting, and IR ranking

concepts. The proposed approaches for text outlier detection based on rare term

weighting and ranking are detailed in Section 3. A comprehensive empirical

study with benchmarking on several datasets covering various length text data

is provided in Section 4, with a summary that provides useful insight on all

approaches. Finally, concluding remarks are summarized in Section 5.

2 Related Work

In the current era, most human interactions appear and are collected in the form

of free text such as emails, wikis, blogs and social media feeds. Outlier detection

is useful for finding interesting and suspicious text within the collection. Text

collection usually contains a high dimensional set of terms that result in a sparse

representation [3]. Different text representation models based on term frequency

have been used, with the Vector Space Model (VSM) being a primary model [77].

There are different term weighting schemes used in IR to give an importance

level to terms such as TF, IDF, TF∗IDF, and BM25 [160] in a data model. The

Inverse Document Frequency (IDF) scheme favors rare terms in the collection [37].

Several prior works in the field of outlier detection use Hawkins’s definition to

set an outlier, stating that “An outlier is an observation which deviated so much

from the other observations so as to arouse suspicions that it was generated by a

2 Related Work 171

different mechanism” [69]. These deviations can be identified by using rare terms

in the calculation. In this paper, we conjecture that the use of IDF, which values

the importance of rare words, in presenting a dataset will highlight the outlier

documents.

Outlier detection broadly follows two approaches. (1) Supervised learning when

training data with labels of normal and abnormal data is provided [105]; and (2)

Unsupervised learning when labeled data is not available, which is common in

real-world scenarios [68]. Neural network-based methods that used deep feature

extraction [38], and Generative Adversarial Network-based active learning meth-

ods in outlier detection [127] are recent supervised and semi-supervised methods

that predict the outliers based on training data directly or indirectly. Perfor-

mance of these methods are fully or partially affected by the supervision given

by label data. Unsupervised learning methods follow proximity approaches such

as distance-based, density-based, distribution-based and cluster-based [68]. The

majority of the outlier detection work deals with few-dimensional numerical data

where over fittings in terms of distance or density distribution clearly separate

outliers. Distribution-based methods use different statistical fundamentals to de-

termine the anomalies that occurred outside of the normal model [16, 89]. These

methods depend on the assumption about data representation and measures used

and can be affected by over fittings in normal data. Poor scalability of this ap-

proach for the high dimensional data further makes it less effective for text outlier

detection.

Distance and density-based approaches have been extensively used in outlier de-

tection due to their simplicity in implementation [126]. Conventional distance-

based methods identify outliers that highly deviate from the remaining data in

the collection using distance differences [104]. Alternatively, neighborhood infor-

mation is used for outlier detection [157, 197]. Nearest Neighbors (NN) has been

172 2 Related Work

used as an effective method to measure the distance differences. The differences

between each point and k –NN are considered and, the top n farthest points are

labeled as outliers [157]. Text data presents challenges to this approach where dis-

tance differences become negligible due to sparseness in high dimensions (known

as distance concentration) [126]. Document collections are usually large in size

and contain multiple groups. This creates a scalability issue for nearest-neighbor,

calculation-based approaches [173].

As a remedy to the distance concentration problem, similarity calculation based

on the angle between vectors is proposed to determine the deviation [109]. This

approach can be adapted to the text domain as cosine similarity can be used

to measure angle differences in text feature vectors [186]. However, the number

of pairwise comparisons needed for larger datasets increases the computational

complexity and makes this approach infeasible to apply to large-scale data.

In contrast to the distance-based methods that identify far-away points glob-

ally, density-based methods identify less dense points locally as outliers. These

methods derive a density distribution of data and identify nearest neighbors by

handling varying density patches. The relative density of a point is compared to

neighbors and a Local Outlier Factor (LOF) is defined to determine the degree of

outlier [29]. A point gets a higher LOF value if the ratio between density around

k nearest-neighbors of that point and local neighborhood of that point is high. It

is then labeled as an outlier candidate [109]. Density-based methods are known

to face difficulties to deal with higher dimensions, inherent to text data due to

distance concentration.

Density-based clustering methods such as DBSCAN can naturally detect outliers

in the dataset by considering them as points in sparse regions [9, 57]. Several

clusters-based outlier detection methods have been proposed considering tight-

ness of the clusters [51, 71]. Although these methods are capable of detecting

2 Related Work 173

outlier clusters, they highly depend on threshold parameters [51]. Moreover,

these methods cannot be directly adapted to text data, as the text data exists in

patches and it becomes highly difficult to detect outliers.

In high-dimensional data identifying outliers that are deviated from the rest of

the collection is hard with distance and density methods due to less effectiveness

of similarity calculation between points with the curse of dimensionality [2]. Dif-

ferent multi-dimensionality scaling techniques have been used to deal with this

issue [27] and identify the outliers in reduced dimensions. However, loss of infor-

mation in the higher-to-lower order approximation is inevitable. Alternatively,

subspaces-based outlier detection methods play a vital role to manage high di-

mensionality. They combine local data pattern analysis with subspace analysis.

However, the problem of finding a subset of dimensions, with rarely existing

patterns, using brute-force searching mechanisms is that it incurs high computa-

tional cost [2]. Furthermore, these approaches use selected subspaces’ behavior

to identify outliers [108]. The deviations in this embedded subspace cannot be

guaranteed to determine outliers in full dimensional space globally.

Text data has been shown to experience the Hub phenomena that is evident

in high dimensions, i.e., “the number of times some points appear among k-

NN of other points is highly skewed” [154]. These local NNs, which form sub-

dense regions (i.e. Hubs), is used to address sparseness-related problems in high

dimensionality effectively [58, 140, 173]. This concept has been used inversely in

outlier detection. In a graph-based method where each data point represents a

graph node, connections are made considering reverse k-NN [68, 84] and a lower

in-degree number identifies potential outlier nodes. Similarly, researchers have

used the concept of “anti-hubs” as potential outlier candidates [155]. Although

these local NNs-based approaches successfully handle the higher dimensions, the

scalability of these methods for larger datasets remains questionable, due to the

174 2 Related Work

need for calculating a hub for each data point.

Table 1 summarizes the outlier detection methods that have been applied in nu-

merical and textual data sets. Limited studies are available focused specifically

on the text domain [16]. Given the fact that random projection approximately

preserves the distance between points in the lower dimensional space, the ran-

dom projection has been applied to text data to identify outliers [11]. Loss of

information is inevitable in projection and ultimately reduces accuracy. In order

to improve the accuracy of text outlier detection, the significance of terms should

be interpreted related to the structure of the document [1]. Non-negative Matrix

Factorization (NMF) has been used to decompose the text collection into a set of

semantic term clusters and document clusters considering document structures

[96]. The term clusters allowed the method to learn the outlier degree of each

document by ranking the sum-of-squares of differences with the original matrix.

However, the increased number of groups within the collections makes this learn-

ing process impaired. Outlier detection in fine-grained scenarios is not practical

with an NMF-based method in terms of both accuracy and scalability.

The IR systems have shown the capacity to manage the text data successfully

[3]. Search engines are well-known IR systems capable of efficiently finding rele-

vant documents from a document collection, whereby the document collection is

organized in the inverted indexed data structure [200]. They have been known to

deal with big data collections [200]. In this paper, we propose the novel ensemble

approaches based on the rare frequency and ranking concepts in IR, and iden-

tify the NNs as well as local sub-dense neighborhoods in text data to determine

the deviations. To the best of our knowledge, this is the first extensive outlier

detection work in text mining, using the concepts of rare frequency, ranking and

ensemble approach.

3 Outlier Detection in Text data : Proposed Methods 175

Table 1: Summary of the major outlier detection methods used in high dimen-sional data

Category Methods Applied DomainRanking-based Neighborhood-based Numerical data

outlier detection [126]k-occurrence-based Anti-hub based Numerical data

outlier detection in [155]Hubness aware Numerical data

outlier detection in [58]Graph-based k-NN graph based Numerical data

outlier detection in [68]Natural-neighbor graph-based Numerical data

method in [84]Subspace-based Evolutionary algorithms in [2] Numerical data

Subspace outlier detection in [108] Numerical dataProjection-based Random projection Text data

based outlier detection [11]NMF based outlier detection [96] Text data

Angle-based Angle variance-based method in [109] Numerical data

3 Outlier Detection in Text data : Proposed

Methods

3.1 Preliminaries

In this paper, the objective of outlier detection is to identify data points distinct

from the rest of the collection as outliers, by separating them from the inliers

that are cohesive points forming sub-groups. Consider a document collection

D = {d1, d2, ..., dN} where di ∈ D is represented using a set of distinct terms

{t1, t2, ..., tv}. Let D contains a set of distinct terms {t1, t2, ..., tn}, n� v covering

all the terms in the collection. Let D be divided into a set of sub-groups C =

{c1, c2, ..., cl} where l � n and l < N . Each cg ∈ C contains a set of similar

documents that share related terms. We formally define document di ∈ D as

outlier or inlier as follows.

176 3 Outlier Detection in Text data : Proposed Methods

Definition 1 - Outlier: A document di ∈ D that shows high deviation, based

on term distribution, from all distinct sets of similar documents C is considered

an outlier.

Definition 2 - Inlier: A document di ∈ D that shows high similarity, based

on terms distributions, with any distinct set of similar documents, cg ∈ C is

considered an inlier.

Example: Fig. 2(a) shows a toy document collection. It contains two (sport)

groups of documents considered as inliers, and the document on weather infor-

mation, d11 is an outlier. The outlier document shows a set of different terms

that deviates from the common terms in the collection related to subcategories

of sports.

Term weighting: We use the VSM model [201] to represent the collection.

A document is represented as a point in multidimensional space by vector di =

{w1, w2, ..., wv}, where wj is the weight of a term tj in the document. We use

IDF [163] as the weighting scheme that statistically weights the rare terms higher.

We conjecture that IDF is more informative to differentiate a document from the

collection, instead of the term frequency (TF) that weights common terms with

higher weights. The weight of the term tj is given as:

wj = idfj = log

( |D|dfj

)(1)

where dfj is document frequency of term j, the number of documents that contain

the term. In this paper, we calculate IDF after applying standard pre-processing

steps to remove stop words and stemming.


Nearest neighbors: The cluster hypothesis proposed in IR [91] states, “the

linked sets of documents are likely to be relevant to the same request”. It led to

prove theoretically by applying the reversed cluster hypothesis [59] that “docu-

ments relevant to the same query should occur in the same cluster”. Prior research

then empirically showed that when a document di is posed as a query to an IR

system, the retrieved document set can be considered similar to di [140, 173]. In

this paper, we propose to use the top-m retrieved set as NNs of di.

Let the document collection D be organized in the form of an Inverted indexed

data structure stored in an IR system. Let di ∈ D be posed as a document query

q using a set of distinct terms {t1, t2, ..., ts} where s ≤ v to the IR system. Given

the query document q, a ranking function Rf employed in the IR system returns

the most relevant m document set, Dq as:

Rf : q → Dq = {(dp, rp)} : p = 1, 2, . . . ,m (2)

where the relevancy score rp of a document dp can be calculated as:

score (q, dp) = rp =∑t∈q

(√tft,dp × idf 2

t × norm (t, dp))

(3)

Definition 3 - Nearest Neighbors (NN): A set of top-ranked documents

retrieved by employing a ranking function Rf in an IR system can be considered

as NN of di.

In this paper, we use the Elasticsearch search engine as the IR system and obtain

top-m (m = 10) documents as k-NN (k = 10) for each document in the collection.

Prior research shows that precision at top-10 documents in the ranked list for a

query is high, due to tight coupling with the topic and these top documents

possess sufficient information richness [199]. When a document is posed as a

query to the IR system, it is represented with the top-s (s = 10) terms as in [173]


ranked in the order of IDF. There exist several ranking functions such as tf∗idf,BM25, BM25P and LM-JM [54, 173]. We propose to use the widely applied tf∗idfranking function to measure the relevance between a document and a query as

in Eq. 3.

Reverse neighbors: This is the count of how often a data object appears in

k-NNs of every other data object [155]. This is defined as retrievability of a

document in IR literature [15]. A document that rarely appears in any other

k-NNs will have a high chance of being an outlier. This can be considered as an

alternative way to determine hubs of documents in the collection.

Definition 4 - k-occurrences: The number of times di ∈ D occurs in the k-

NN set of other documents is defined as the number of k-occurrences of di. It is

denoted as Nk(di) and referred to as the reverse neighbor count of di.

Example: Consider the document collection in Fig. 2(a) that is indexed in an

IR system to obtain the set of all relevant results, as shown in Fig. 5 (a). The

k-occurrences of document d1 is Nk(d1) = 4 in this collection as it appears in

k-NN lists of documents d1, d2, d6 and d9. This is the reverse neighbor count of

d1.

Using these basic concepts and definitions, we propose three novel algorithms to

identify outliers in a document collection as detailed in Table 2 and map them to

the categories developed in the literature on (numerical) high-dimensional out-

lier detection. These methods capture outlier documents with rare terms with

different document ranking techniques. (1) OIDF uses the average of the IDF

weights of the terms to rank the documents and IDF weighting schema which

gives high importance to rare terms is able to deal with high dimensional text


Table 2: Summary of the proposed algorithms

Category Algorithm Concept

Rare Frequency based OIDF: Outlier Detectionbased on Inverse DocumentFrequency

Ranking based ORFS: Outlier Detectionbased on Ranking FunctionScore

k-occurrence based ORNC: Outlier Detectionbased on Ranked Neighbor-hood k-occurrences Count

representation to identify the outliers efficiently. (2) ORFS uses IR ranking func-

tion to retrieve ranking scores for nearest neighbor documents and reciprocal of

average similarity score is used to rank a document based on how dissimilar it

to nearest neighbors to identify the deviated documents. The use of scalable IR

systems deals with sparse larger text collections to identify a set of related docu-

ments as nearest neighbors in response to a given document following the cluster

hypothesis in ORFS [92]. (3) ORNC extends this concept and uses IR ranking

responses to capture the k-occurrences of documents among nearest neighbors in

the entire collection and thereby identify the property of hubs or the local sub

dense regions in high dimensional data. The ORNC rank the documents based

on the inverse of k-occurrence count to identify the outliers that are anti-hubs

with less k-occurrence count.

3.2 Outlier Detection based on Inverse Document Fre-

quency:OIDF

Generally, IDF measures how much information a term carries in the collection

and is able to differentiate the term as distinct in the collection. According to


Figure 1: Algorithm 1 - OIDF

Eq. 1, the IDF value of a rare term should be high. We conjecture that an outlier

document will contain terms that deviate from the majority in the collection. We

propose to use the average IDF weight of a document combining all the terms as

a measure to detect outliers. An outlier score OSidf is assigned to every document

di = {w1, w2, ..., wv} that represents with weights of the included v terms based

on the average IDF weight as follows.

OSdiidf =

1

|v|v∑

j=1

wj (4)

It is expected that the OSidf score, which captures rare term frequencies, is high

in outlier documents compared to inliers. This is explained in more detail in

Appendix A. A document is defined as an outlier if the outlier score OSidf is

greater than a control parameter T1. We will discuss later in the experiment

section, the setting of the control parameter systematically and automatically.

Algorithm 1 presents the all the steps of the OIDF method.

Example: Consider the same example in Fig. 2(a) that consists of eleven doc-

uments related to two sports: Cricket and Rugby, and an outlier document. Fig.

2(b) shows the IDF vector coefficients for each document with the average IDF

value of each vector. It reveals that the average IDF value of the outlier - d11 is

much higher than the rest of the collection.


(a) (b)

Figure 2: Example document collection with IDF VSM

OIDF is a simple algorithm that identifies potential outlier candidates; however,

it also generates a large number of false positives. We present two ranking-based

algorithms that we propose to combine with OIDF to form ensemble approaches.

These ensemble approaches are able to drastically reduce the search space for

outliers and reduce false outliers (as evident in experiments).

3.3 Outlier Detection Based On Ranking Function Score:

ORFS

The ranking concept can be used in outlier detection to assign an outlier score to

observations based on the ranking list assuming the observations at the top will

get higher outlier scores. In this paper, we propose to calculate a ranking score

to each document using the IR system and assign outlier scores. We assign an

outlier score to each document using the relevancy scores generated through the

IR system. As per Definition 3, an IR system uses a ranking function as shown

in Eq. 2 to determine the most relevant documents ranked by the relevancy score

as calculated in Eq. 3. The relevancy score represents the level of relevancy of

a retrieved document to the query document as compared to the whole collec-


tion. We utilize the relevancy score, rp, of top-m (m = 10) relevant documents

for a given document di to show how consistent the given document is within

the collection. The relevancy score of the document dp to a query document

(score(q, dp) or r(p)) represents how similar that document is to the query doc-

ument. In contrast, the reciprocal of the relevancy score determines how much

those two documents are dissimilar (i.e., deviated). We propose to calculate the

outlier score OSr for a document di ∈ D as reciprocal of the average relevancy

scores given by the search engine for top-10 relevant documents. It presents the

degree of deviation as follows.

OSdir =

m

|∑mp=1 rp|

di �= dp, where rp ≥ 0 (5)

A document is defined as an outlier if the outlier score OSr is greater than the

control parameter T2.

Ensembles ORFS(I) and ORFS(S): Combining OIDF and ORFS.

We propose to combine ORFS with OIDF to create an ensemble method to

achieve robust outlier detection as in Fig. 4. Previous researchers have built

the ensemble models in two ways: independent and sequential ensembles [1]. We

explored both approaches to reduce false positives.

Following the independent ensemble approach, both ORFS and OIDF algorithms

generate the outlier candidates (i.e., DI = D) and the common candidates in

both sets have been identified as final outliers, as in Eq. 6. This reduces the

number of false positives.

Dof = Do

idf ∩Dor (6)

Following the sequential ensemble approach, OIDF is first used to generate outlier

candidates, then those candidates are tested through ORFS to calculate the out-

lier score OSr. A final set of outliers (Dof = Do

r) is obtained using threshold T2.


Figure 3: Algorithm 2 - ORFS

Figure 4: Ensemble approaches of ORFS and OIDF for outlier detection

This approach allows ORFS to search in a much smaller search space for outliers

(i.e., DI ⊂ D). Algorithm 2 explains the all the steps in this approach of rele-

vancy score-based outlier detection. Combining outliers generated by OIDF that

directly uses IDF weights with outliers generated by ORFS that considers IDF

with reciprocal retrieval score reduces false detection as a detailed experimental

analysis section.


Figure 5: Outlier scores based on ranking scores and k-occurrences

Example: Fig. 5 (a) shows the outlier scores calculated using ranking scores

given by the Elasticsearch search engine for the same toy example. The outlier

document d11 held the outlier score� 1 which highlighted it as the most possible

outlier. As shown in Fig. 2 (b), OIDF also assigned the highest outlier score to

d11 and made it the most suitable outlier candidate using both the independent

and sequential ensemble approaches, after combining with ORFS.

3.4 Outlier Detection Based On Ranked Neighborhood

k-Occurrences Count: ORNC

In high dimensional data, hubs have been known to form local sub-dense neigh-

borhoods instead of uniform distributions in a cluster [154]. We conjecture that

outlier points would have less possibility to include in these hub regions and

should have fewer k-occurrences in the nearest neighbor lists. In an indexed doc-

ument collection, we obtain all sets of relevant documents using document queries

to form initial search space. We use neighborhood documents calculated using

Eq. 2 with tf∗idf function in Eq. 3 for each document to obtain the lists of nearest


Figure 6: Algorithm 3 - ORNC

neighbors. The k-occurrences count is measured within all the retrieved relevant

documents (i.e., nearest neighbor sets) and used to define outlier scores based on

the inverse of the count.

Let the documents retrieved in response to document query di on D be Ddi where

Ddi is obtained using Eq. 2. The outlier score OSc for do ∈ D is calculated as:

OSdoc = 1/

⎛⎝ |D|∑

i=1

| [do ∈ Ddi ] |⎞⎠ (7)

Algorithm 3 describes the overall process of ORNC where each document is as-

signed with an outlier score OSc. If the score is greater than the control parameter

T3, the document is classified as an outlier.

Ensemble ORNC(S): Combining OIDF and ORNC.

Identifying a neighborhood is an expensive operation, due to the need for pair-

wise comparisons [204]. Even in a smaller dataset, measuring k-occurrences by

analyzing the nearest-neighbor list is highly expensive. Therefore, we propose to

use only a selective set of outlier candidates to achieve effectiveness through the


Figure 7: The ensemble approach of OIDF and ORNC for outlier detection

sequential ensemble approach. The initial set of outlier candidates is obtained

using OIDF and the number of k-occurrences are measured, within all the re-

trieved relevant documents, for those candidate documents only. This sequential

ensemble approach of OIDF with ORNC is able to identify the final outlier doc-

uments with reduced time and higher accuracy by reducing the search space, as

in Fig. 7.

Example: Fig. 5(b) shows the outlier scores calculated using reverse neighbor

count (k-occurrences) for all documents in the example document collection. The

highest outlier score of 1 is given to the outlier candidate d11 proposed by OIDF.

It shows that the proposed method can identify the actual outlier.

In summary, the core concept used in three ensemble methods is the rare

frequency-based outlier detection (OIDF). By using various IR ranking concepts,

the quality of outlier detection of OIDF is improved by reducing false outliers.

The ranking-based neighborhoods were used to provide outlier scores using pre-

calculated relevancy scores in ORFS and k-occurrences in the response sets in

ORNC.



Datasets # of # of # of # of # ofDocs Unique Total Avg. Terms Outliers

Terms Terms per doc

Wikipedia (DS1) 11521 305827 9206250 799 10020News groups (DS2) 4909 27882 374642 76 50Reuters (DS3) 5050 13438 200482 40 50SED2013 (DS4) 81228 46548 1583073 19 840SED2014 (DS5) 91670 46031 1816840 20 976


In this section, we present the experimental evaluation of the proposed primary

method OIDF and its ensemble approaches ORFS(I), ORFS(S) and ORNC(S)

for accuracy, efficiency, and scalability. We performed the experiments on a single

processor of 1.2 GHz Intel(R) Xeon(R) with a 264 GB shared memory. Algorithms

were implemented in Python 3.5. Elasticsearch 2.4 was used as a search engine

to provide relevant documents. First, we present the description of the real-

world datasets used and the standard evaluation measures used to determine the

accuracy of outlier detection. We show that the commonly used measures do not

evaluate the outliers effectively; hence, we present a new evaluation criterion to

report false predictions. The next few sections present the empirical analyses.

4.1 Datasets

Three categories of collections having documents of short, medium and large

length are used in experiments as shown by the column of average terms per a

document in Table 3. Wikipedia data, which has about an average of 800 terms

in a document, is used to validate the outlier detection behavior on a collection

with large documents. The well-known 20News group data and Reuters data,


which have about 80 and 40 terms on average respectively, are used to validate

the outlier detection behavior on collections with medium documents. Whereas,

the MediaEval Social Event Detection 2013 and 2014 datasets with about 20

terms on average are used to analyze outlier detection on collections with short

documents. These are the average number of terms per document, and docu-

ment collections are having more larger as well as smaller documents within the

collections. These datasets have ground-truth values that were used to measure

the methods’ effectiveness extrinsically. These datasets are designed/selected to

evaluate the performance of proposed methods against various challenges that

exist for text outlier detection: (1) different text vector sizes, (2) different collec-

tion sizes, (3) the different number of classes and (4) high vocabulary overlapping

within inlier and outlier classes.

Our approach is distinct and more complex in comparison to existing methods

as the document sets contain multiple classes of documents – both inliers and

outliers. Existing methods [96] do not include the diverse set of classes in their

datasets that make the outlier detection process simpler and unnatural. They

usually have one class of documents and outliers that do not belong to this class.

We select a set of inlier and outlier classes, and attempt to identify outliers that

are different from all these inlier classes which show less term overlapping with

them. DS1 contains inliers from multiple Wikipedia subcategories under ‘War’

and outliers from 10 other categories not included inside ‘War’. DS2 contains

inliers from five classes related to ‘Computers’ and outliers from five other classes

in 20News groups. DS3 contains inliers from two classes and outliers from 25

other classes in the Reuters dataset. This dataset is having classes with over-

lapping vocabulary showing a more complex scenario of outlier detection that

has a considerably high number of overlapping terms between inlier and outlier

classes. Inliers in short datasets (DS4 and DS5) are collected from classes that

have at least 100 documents while two outliers per each class are collected from


all other classes. These short datasets consist of more than 800 inlier and 400

outlier classes to explore the fine-grained scenarios as well as large in collection

size. Table 3 shows a summary of these datasets.

4.2 Evaluation Measures

Accuracy is a well-known measure to define the effectiveness of outlier detec-

tion. Accuracy analyses the percentage of correctness in predictions [84]. Let

TP ,TN ,FP ,FN denote the correct outliers, correct inliers, incorrect outliers,

and incorrect inliers respectively where P , N denote the total number of outliers

and inliers. The metric accuracy (ACC) is calculated as:

ACC =TP + TN

TP + FP + FN + TN=

total correct predictions

total observations(8)

However, the ACC measure disregards the false outliers and false inliers. An

explanation is provided in Appendix B.

Alternatively, the effectiveness of outlier detection is measured using the area

under the Receiver Operating Characteristics (ROC) curve (AUC) [1, 126, 155].

The ROC curve shows the TP rate (TPR) against FP rate (FPR). Let TPR

and FPR be:

TPR =TP

TP + FN=

TP

P(9)

FPR =FP

FP + TN=

FP

N(10)

When T denotes a threshold to control outliers, AUC of ROC curve can be

defined as:

AUC =

∫ 1

0

ROC(T ) dT (11)

However, AUC also focuses only on correct predictions, which leads to a mis-

leading picture of outlier detection without considering incorrect predictions. An


explanation is provided in Appendix C. Consequently, there is a need for ana-

lyzing false positives and false negative predictions. The inverse of ACC, which

represents the total number of incorrect predictions against total observations,

is not a clear measure of the false outlier and false inlier predictions. Some re-

searchers have used FPR or FNR to report these. FPR considers false outliers

(FP ) against inliers in the dataset and shows linear variation within the range of

0 to 1 for the gradual increase of FP . FNR considers false inliers (FN) against

outliers in the dataset and shows linear variation within the range of 0 to 1 for

the gradual increase of FN . However, FPR and FNR provide relative values

and are not able to differentiate the capability of a method to deal with false

alarms. We propose two measures Outlier Prediction Error (OPE) and Inlier

Prediction Error (IPE) to emphasize false predictions with respect to true pre-

dictions. OPE reports false outliers (FP ) against true inliers (TN) and IPE

reports false inliers (FN) against true outlier (TP ).

OPE is defined as :

OPE =FP

TN + ε: if TN = 0 then ε = 1 else ε = 0 (12)

IPE is defined as :

IPE =FN

TP + ε: if TP = 0 then ε = 1 else ε = 0 (13)

The OPE (or IPE) measure varies in the range of 0 to N (or P ) and can be

divided into two ranges: 0 to 1; and 1 to N (or P ). The value of OPE is within

0 to 1 if a method detects more true inliers than false outliers. However, a value

greater than 1 indicates that an outlier detection method is producing higher

false outliers than true inliers. Similarly, a higher value of IPE than one shows

a less effective method.


4.3 Baseline Algorithms

As primary baselines, we have chosen unsupervised outlier detection methods

from the major categories in the existing literature to compare against the pro-

posed unsupervised outlier detection methods. The benchmarking algorithms

listed below were used as unsupervised baselines.

• Outlier detection using k-nearest neighbors (KNNO) [157]: This is a

distance-based method where distance is calculated between each object

and their k-NNs. Objects are then ranked based on the distance to k-NNs

where top n objects are declared as outliers with user-defined n. In this

baseline method, we assign the number of outlier documents within each

collection as 1% and 10 as the number of k.

• Outlier detection using local density estimation (LOFO) [29]: This is a

density-based method where a degree known as local outlier factor (LOF)

is assigned to each object considering how isolated an object is with respect

to k-NNs (k is set as 10). LOF is defined as the ratio between the average

densities of neighbors to the density of the object. Objects with high-rank

LOF are defined as outliers. The threshold that governs the boundary

between inliers and outliers is set as 1, in line with past research [8].

• Outlier detection using Non-Negative Matrix Factorization (NMFO) [96]:

This is a recently developed matrix factorization-based approach specifically

designed for text outlier detection. The l2 norm assigned in the learning

process of document-term matrix factorization is used as the outlier score

for each document. The documents that get high-rank outlier scores are

defined as outliers. This method depends on several control parameters

such as k, α, β. They are tuned to the best possible values after several

parameter-tuning attempts. Best parameter values k, α, β in DS2 and DS3


were set to (20, 179, 0) and (5, 23, 0) respectively while DS1, DS4 and DS5

were set to (20, 11, 0) following the description in [96] and yielding best

results in multiple experimental settings.

• Mutual nearest-neighbor graph-based clustering method for detecting out-

liers (MNCO) [55]: This method is designed to cluster high dimensional

sparse data by creating a considerably dense mutual neighbor graph. The

points that do not belong to a cluster in the graph are considered noise or

outliers. The two control parameters to define core dense regions in this

baseline are set to 3 as they satisfy the minimum requirement, to be dense

[140].

In addition, we compare term frequency-based IR ranking approach used for

document similarity identification in proposed methods against the semantic

embedding-based document similarity identification using doc2vec representation

[113]. The set of similar documents and similarity scores for a document given

by doc2vec are used in ORFS and ORNC algorithms to compare with IR-based

ORFS and ORNC. Recently, Neural network-based approaches are popular in

text mining with fully or weak supervision [107, 127, 135]. Although our pro-

posed methods are fully unsupervised we have done experiments with Convo-

lutional neural network for text classification [107] and Generative adversarial

active learning for outlier detection [127]. We follow the standard practice of

using dense word representation with reduce dimensionality that obtained using

Global Vectors for Word Representation (GloVe) [148] as the input to the neural

network in the experiments [107]. These methods based on training a neural net-

work are not effective and extremely time consuming to use with full dimensional

space.


4.4 Accuracy Comparison

Accuracy of the proposed methods for large, medium and short text document

collections is analyzed with the standard measures of ACC, ROC curve, and

AUC as well as the proposed measures of OPE and IPE.

Accuracy

In general, the proposed methods show improvement over the majority of base-

lines, especially when the dimensionality of the vector is high and the dataset

is large. As detailed in Table 4, it is evident from the high ACC values of

OIDF and its ensemble approaches that they outperformed all baselines except

KNNO. However, it can be noted that KNNO is not scalable to high dimensional

Wikipedia document collection (DS1). Moreover, KNNO requires the number of

outlier documents as a control parameter, which is a major limitation and makes

it dependent on the parameter to achieve an improved outcome.

Table 4: Accuracy measure for different methods against datasets

DatasetOur Methods Baseline Methods

OIDF ORFS ORFS ORNC KNNO LOFO NMFO MNCO

(I) (S) (S)

DS1 0.85 0.92 0.93 0.93 ∗ ∗ ∗ -DS2 0.87 0.93 0.94 0.94 0.99 0.01 0.98 0.06DS3 0.82 0.93 0.9 0.91 0.98 0.02 0.69 0.17DS4 0.82 0.9 0.9 0.93 0.99 ∗ 0.01 -DS5 0.82 0.9 0.91 0.93 0.99 ∗ 0.01 -Avg. 0.84 0.92 0.92 0.93 0.99 0.02 0.42 0.12Note : (S), (I), ”∗” and ”-” denote the sequential ensemble approach,Independent ensemble approach, aborted operations, and memory or runtimeerror respectively

Within the proposed approaches, the basic OIDF algorithm yields the least ac-

curacy as compared to ensemble methods, which are able to reduce the false pos-


itives generated by OIDF. Ensembles methods, based on the IR ranking score,

ORFS(I) and ORFS(S) perform similarly. As per ACC, ORNC(S) is the best

approach among the proposed methods. ORNC(S), based on the Hub concept, is

able to achieve a higher level of performance even in extremely sparse short text

data such as social media text (DS4, DS5). In the Reuters dataset (DS3), where

classes in the collection are highly overlapping, it is hard to separate outliers

considering terms in the VSM representation due to overlapping class behavior.

This database yields relatively lower accuracy in most of the methods.

ROC and AUC

ACC considers the total correctness of predictions and does not provide a de-

tailed analysis of true outliers and true inliers individually as compared to AUC

(see Appendix B for more detail). Hence, we have explored the ROC curves con-

sidering each fixed control parameter we proposed in our algorithms (sensitivity

analysis provides more details on these parameters) and the optimum threshold is

used for each baseline accordingly. As shown by graphs in Fig. 8, Fig. 9 and Fig.

10, OIDF provides the highest AUC values, except in short text datasets (DS4,

DS5), due to its capacity to distinguish documents according to rare frequen-

cies. This capacity helps OIDF to yield a higher TP rate and results in OIDF

achieving higher AUC due to its separate analysis of TP (true outliers) and

TN (true inliers) in contrast to ACC, which represents total correct predictions.

Both ranking score based ensemble methods ORFS(I/S) perform similarly. The

k-occurrences-based ensemble approach ORNC(S) outperformed all the others

for short text data due to its hub-based concept applicable to higher dimensions.

More specifically, Fig. 8 shows the ROC curves of the Wikipedia document

collection (DS1) derived by the OIDF and its ensemble approaches. No baseline

method could be executed on this dataset due to the large text size. This confirms


the scalable nature of OIDF and its improved variations. Moreover, as seen by

the results, the basic rare term-weight-based method OIDF has outperformed the

ensemble methods. It states the power of a simple term weighting model in large

documents to differentiate terms meaningfully where the occurrence of terms is

considerably high within a document as well as in the respective collection. IR

ranking concept diluted the effectiveness of this simple method, as depicted by

Fig. 8, by reducing the true outliers (TP ) when the TP and TN (true inliers)

are separately analyzed with AUC.

On the medium-sized collection (20News Group dataset (DS2) and Reuters

dataset (DS3)), the proposed methods give higher AUC compared to baselines

as shown in Fig. 9. Similar to large text size collections, basic OIDF that simply

considers average rare frequency values of terms yields the highest AUC due to

identifying the higher number of TP s. The KNNO method, which requires the

number of outliers as a control parameter, is the best amongst other baselines for

DS2 while NMFO performs as a random assignment. The LOFO which measures

the density around point respect to its’ neighbors density shows the lowest per-

formance in text data which is naturally sparse due to fewer term co-occurrences

among documents. It could not differentiate the density around points in this

sparse setting. The LOFO identifies the majority of the inliers as outlier and

results in lower ACC. The MNCO, which uses a mutual nearest neighbor graph,

outperforms other baselines in DS3 that contain overlapping class labels. In over-

lapping datasets, the ranking-based ORFS (I/S) methods and ORNC(S) yield a

reduced performance in comparison to normal medium-sized datasets.

The ROC curves for document collections with short-term vectors are presented

in Fig. 10. These documents share very few discriminative terms among similar

documents compared to the other two dataset categories. Consequently, in this

extremely sparse dataset, the ranking score-based ORNC(S) ensemble approach


outperforms the basic OIDF method, due to the inclusion of a local sub-dense

neighborhood (Hub) concept, which is known to work for higher dimensions.

Similarly, the distance-based KNNO, which uses pairwise distance difference com-

parison, does not perform well on this dataset, as on other datasets due to the

distance concentration problem.

Furthermore, we compare AUC results of ensemble ORFS(I/S) and ORNC(S)

that obtained using IR ranking-based text similarity with term occurrences and

their respective frequencies against semantic embedding-based text similarity.

Distributed Representations of Documents (doc2vec) [113] is an unsupervised

learning algorithm for obtaining dense vector representations for documents con-

sidering syntactic and semantic word relationships within a corpus. The doc2vec

is used to obtain the set of similar documents in the corpus with similarity scores

for a document, similar to IR ranking function, and we modified baselines ORFS

and ORNC to identify outliers as in Table 5. The results in Table 5 show that

semantic embedding-based ORFS performs same as IR ranking score on average.

However, for document collections with short term vectors, semantic embedding is

able to provide more accurate results. In short text, where vectors are extremely

sparse, semantic embedding can identify the text similarity more effectively com-

pared to an IR ranking function. However, the effectiveness of doc2vec-based

ORNC in identifying hub points or the local sub dense points in high dimen-

sional text data is inferior to IR ranking function-based ORNC(S). Theoretically,

cluster hypothesis and its’ reverese [59, 91] also proved that IR function is able

to give a set of similar documents in response to a query document that resides

in the same cluster. They show that IR ranking can be used to identify the doc-

uments in same cluster and we used it in outlier detection to identify the hubs in

clusters within ORNC(S).

The results in Table 6 shows the performance of Neural network-based methods


Figure 8: ROC curve and AUC for document collection with larger size termvectors (”∗” denotes aborted operations, memory or runtime error)

on the datasets. Convolutional neural network used for supervised text classifica-

tion [107] provides almost same accuracy as ORNC(S) on average when compared

with results in Table 4. However, results show that except in DS2 that is small

in collection size and with many classes, supervision based on training is able to

provide superior results. This confirms the superiority of the supervised meth-

ods compared to unsupervised methods in the presence of enough data to provide

training. Generative adversarial active learning for outlier detection [127] is novel

Generative Adversarial Network (GAN)-based semi-supervised approach used for

outlier detection. A GAN model includes two networks where a generative net-

work is used to generate candidates and a discriminate network is used to evaluate

their validity [121, 128]. Although GAN-based method in [127] works in an un-

supervised setting without relying on ground-truth labels of the data [100, 184],

it follows a semi-supervised approach with active leaning to generate initial out-

liers with reference to real data for the discriminator network. Results in Table

6 shows that it performs almost to a random method in sparse text data with

weak supervision and inferior to our proposed fully unsupervised methods.

Outlier Prediction Error and Inlier Prediction Error

To focus on false outliers and inliers, we next present the results with OPE and

IPE. Results in Table 7 support the conjecture that OIDF should be used as


Figure 9: ROC curve and AUC for document collections with medium size termvectors

Figure 10: ROC curve and AUC for document collections with short size termvectors (”∗” denotes aborted operations, memory or runtime error)

a basic method and the ranking-based algorithms ORFS or ORNC should be

used to make an ensemble method with OIDF. ORFS as a standalone method

generates a high level of false outliers (i.e. OPE value is closer to 1 than 0) while

giving more false inliers than OIDF on average. ORNC cannot be used as a basic

method due to the high time complexity incurred by the increased number of

comparisons with the size of the dataset.

As shown in Table 8, the ensemble methods combining OIDF and the ranking-

based algorithms show a significant reduction in producing false outliers, mak-

ing them suitable for real-world scenarios. The sequential ensemble approach

ORNC(S) is successful in giving fewest false outliers due to filtered candidates

of outliers and becomes the best among our methods in terms of OPE. Fur-

thermore, ORFS (I/S) also shows a reduction in false outliers. This confirms


Table 5: AUC comparisons against semantic word embedding-based ranking

Dataset

Our Methods Baseline Methods withdoc2vec similarity scores

ORFS (I) ORFS (S) ORNC (S) ORFS with ORNC with

doc2vec doc2vec

DS1 0.70 0.69 0.71 0.72 0.56DS2 0.85 0.83 0.79 0.65 0.60DS3 0.7 0.69 0.7 0.67 0.60DS4 0.65 0.65 0.77 0.75 0.72DS5 0.65 0.65 0.78 0.75 0.72Avg. 0.71 0.70 0.75 0.71 0.64Note : (S) and (I) denote the sequential ensemble approach

Table 6: Performance given by Neural network-based methods

DatasetSupervised CNN GAN based Active LearningAccuracy-ACC Area Under the Curve- AUC

DS1 0.99 0.52DS2 0.74 0.56DS3 0.98 0.52DS4 0.97 0.50DS5 0.97 0.50Avg. 0.93 0.52

importance of ensemble approaches that reduces the false detection compared to

OIDF that directly uses IDF weights for outlier detection or ORFS and ORNC

that use IDF weights within ranking function. The benchmark method KNNO

shows good performance as it uses the specified number of outliers as an external

input and obtains a controlled set of outlier documents. All other baseline meth-

ods generate a high amount of false outliers as well, as some of them such as the

mutual neighbor graph-based (MNCO) and LOFO algorithms, fail to scale for

large and high dimensional datasets. In this set-up, NMFO produces the worst

performance. This may be due to the need for rigorous parameter tuning. The

fine-grained nature of the large SED datasets (i.e., DS4 and DS5) impaired this

process and we were unable to find realistic parameters even after a great effort.


Table 7: AUC, OPE and IPE for our pure ranking based outlier detection ap-proaches

DatasetOIDF ORFS ORNC

AUC OPE IPE AUC OPE IPE AUC OPE IPEDS1 0.77 0.17 0.47 0.58 0.99 0.49 ∗ ∗ ∗DS2 0.88 0.15 0.14 0.68 0.56 0.28 ∗ ∗ ∗DS3 0.72 0.22 0.61 0.66 0.99 0.2 ∗ ∗ ∗DS4 0.76 0.22 0.44 0.57 0.98 0.54 ∗ ∗ ∗DS5 0.77 0.22 0.38 0.56 0.99 0.6 ∗ ∗ ∗Avg. 0.78 0.2 0.41 0.61 0.90 0.42 ∗ ∗ ∗Note: ”∗” denotes aborted operations and none of the above methodsare not recommended to use as a standalone method

Table 8: OPE for different methods against datasets (a lower value near 0 isbetter)



(I) (S) (S)

DS1 0.17 0.08 0.08 0.07 ∗ ∗ ∗ -DS2 0.15 0.07 0.07 0.06 0.01 372.77 0.01 20.13DS3 0.22 0.06 0.1 0.1 0.01 23.51 0.44 5.13DS4 0.22 0.1 0.1 0.07 0 ∗ 80388 -DS5 0.22 0.11 0.1 0.07 0 ∗ 90694 -Avg. 0.2 0.08 0.09 0.07 0.01 198.14 42770.61 12.63Note : (S), (I), ”∗” and ”-” denote the sequential ensemble approach,Independent ensemble approach, aborted operations, and memory or runtimeerror respectively

Table 9 shows the inlier prediction error using IPE. OIDF shows the least

false inliers among the proposed methods. KNNO and NMFO become ineffective

showing very high IPE. The baseline LOFO and MNCO methods outperformed

the proposed methods on the limited two datasets by producing lesser false in-

liers, however, they do not scale well for larger and high dimensional document

collections. A closer investigation on these two methods with OPE reveals that

although they do not produce false inliers, they produce a larger portion of in-

liers as outliers (i.e. FP is extremely high). In contrast, our proposed methods


produce a balanced performance with reduced false outliers and false inliers.

Table 9: IPE for different methods against datasets (a lower value near 0 is better)



(I) (S) (S)

DS1 0.47 1.13 1.17 1.08 ∗ ∗ ∗ -DS2 0.14 0.32 0.39 0.56 1.5 0 49 0DS3 0.61 1.17 1.08 1.08 49 0 1.63 0DS4 0.44 1.63 1.63 0.64 1 ∗ 0 -DS5 0.38 1.48 1.58 0.6 0.94 ∗ 0 -Avg. 0.41 1.15 1.17 0.79 13.11 0 12.66 0Note : (S), (I), ”∗” and ”-” denote the sequential ensemble approach,Independent ensemble approach, aborted operations, and memory or runtimeerror respectively

AUC and OPE in combination give the complete picture of the effectiveness of

the proposed methods in outlier prediction. Further, IPE gives an indication of

false inlier prediction, which has been neglected in most of the outlier detection

work. The OPE and IPE measures in Table 8 and Table 9 depict the higher

quality of outlier prediction, with reduced false positives and false negatives, in

our approaches compared to baseline methods.

4.5 Scalability and Computational Performance Analysis

Time taken by each method is shown in Fig. 11 (a). Results in this figure confirm

that the proposed methods consume lesser time than the benchmarking meth-

ods in addition to the improved accuracy performance as shown in the previous

sections. OIDF outperforms all the methods due to its simple rare document

frequency-based calculation used for outlier filtering. Among the ensemble ap-

proaches, ORNC(S) shows the highest time consumption due to the requirement

of a larger number of comparisons, though it is able to execute for all datasets

due to a much smaller search space generated by the potential OIDF outliers.


Figure 11: Time and memory consumption for the proposed and benchmarkingmethods

The independent ensemble approach in ORFS (i.e., ORFS (I)) shows a high time

requirement due to additional iterations over the complete dataset.

The proposed methods outperformed all methods except on DS5 where the ma-

trix factorization-based NMFO consumes slightly lesser time than ORNC(S).

However, as shown in previous sections, NMFO produces inferior outcomes to

ORNC(S). When data dimensionality is high, as in the Wikipedia dataset (DS1),

the benchmarking methods were aborted due to exceptional high time consump-

tion. Additionally, LOFO and MNCO could not handle the document collections

with a large number of instances such as SED 2013 (DS4) and SED 2014 (DS5).

Fig. 11 (b) shows the memory consumption of each method. It shows that

rare frequency-based OIDF and ranking scores based ORFS (I/S) ensemble ap-

proaches consume smaller memory in comparison to the ranked neighborhoods

based ORNC(S) that considers k-occurrences similar to sub-dense local neigh-

borhoods (Hubs) in high dimensionality. Fig. 11 (b) clearly highlights that the

proposed methods consume less memory, in comparison to baseline methods.

KNNO, which achieves high accuracy in ACC, shows higher memory and time


Figure 12: Scalability of methods using incremental samples

consumption. Due to heavy time and memory requirements, all baseline methods

are impaired when dealing with large term vectors such as Wikipedia, and lead

to resource starvation.

We further explore the scalability of the proposed methods considering incremen-

tal samples of SED 2013, which consists of short-term vectors. Fig. 12 shows

that the log of computation time of all proposed methods is near-linear to the

data size. The simplest rare frequency-based outlier detection OIDF shows the

smallest time while the k-occurrences-based ensemble approach (i.e., ORNC(S))

consumes the highest time within our methods. Among the baseline methods,

distance-based KNNO is the only method that scaled up to the largest sample

we have used within the experiment. However, the success of KNNO depends

on the number of outliers given as an external input and it is not successful in

terms of AUC, which analyses inliers and outliers separately. Further, IPE which

represents inlier prediction error, is high for KNNO.

Table 10 shows the computational complexities of the proposed algorithms against

baseline methods. Amongst the proposed algorithms, ORNC that defines the

outlier score based on the number of times a particular document appears in IR

search results has the highest computational complexity. The sequential ensem-


ble approach of this method, ORNC(S), cuts down this complexity by reducing

the search space n. The baseline algorithms show comparatively higher computa-

tional complexity. This validates why LOFO and MNCO did not work for larger

datasets while OIDF and ORFS worked efficiently.

Table 10: Summary of the proposed methods

Our Methods BaselinesOIDF ORFS ORNC KNNO LOFO NMFO MNCO

Big-O O(nd) O(ndk) O(n2dk) O(n2dk) O(n3dk) O(n2d) O(n3dk)

complexity

Note: n - size of the document collection, d - dimensionality andk - considered number of nearest neighbors.


OIDF and its ensemble approaches use a threshold parameter similar to prior

outlier detection algorithms [68, 157]. We set these control parameters auto-

matically using internal characteristics of the dataset, therefore, the proposed

methods can be called parameter free. All parameters that govern outlier scores

have been explored considering intrinsic data characteristics such as mean, me-

dian and standard deviation. The control parameter T1 of ODIF is set as the

combination of median and standard deviation. The median that removes the

effect of noise was boosted by adding standard deviation to detect the outliers

that have a smaller portion within the document collections. It yields more true

outliers as shown in Fig. 13 (a) for all the datasets except Reuters (DS3), which

contains overlapping class labels for documents.

The control parameter T2 in ORFS (I/S) and T3 in ORNC(S) are set as the

median value. Fig. 13 (b), Fig. 13 (c) and Fig. 13 (d) show how the quality

of prediction varies amongst descriptive statistical measures. The median, which


Figure 13: The sensitivity of the control parameters- T1, T2 and T3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

DS1 DS2 DS3 DS4 DS5

AUC

ORFS(I)

BM25 TF*IDF

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

DS1 DS2 DS3 DS4 DS5

AUC

ORFS(S)

BM25 TF*IDF

0.60.620.640.660.680.70.720.740.760.780.8

DS1 DS2 DS3 DS4 DS5

AUC

ORNC(S)

BM25 TF*IDF

Figure 14: The sensitivity of the ranking functions in the IR system

removes the unusual bursts in the outlier scores, gives the highest AUC except

for DS3 in ORFS (S), which contains overlapping class labels.

An IR system employs different ranking functions such as LM Jelinek-Mercer

Smoothing (LM-JM), LM Dirichlet Smoothing (LM-Dirichlet) and Okapi BM25

in addition to tf*idf [23]. However, LM-JM assigns negative scores to terms that

have fewer occurrences and LM-Dirichlet captures important patterns in the text

leaving the noise [54]. Therefore, they are less effectiveness in highlighting the


outliers that have rare terms. In contrast, BM25 and tf*idf ranking functions

show the capability to capture the deviated documents with rare terms using

IDF of terms. Figure 14 shows the results provide by each proposed method with

these two ranking functions. It shows that in general, tf*idf can more accurately

identify outliers than the BM25 for the proposed methods, and we used it as the

default.

4.7 Discussion

This paper proposes a basic method based on the concept of weighted term vec-

tors and its’ ensemble approaches with the concept of the ranking of relevant

documents obtained through an IR system, in order to achieve accurate and scal-

able outlier detection in document collections. An extensive empirical analysis

provides insight into the proposed algorithms. We summarize the interesting

observations as follows:

• The basic algorithm OIDF, based on the simple concept of using rare terms

in a document, which can be emphasized through IDF schema, shows high

competence to detect deviations even for large document collections con-

suming less time. However, as shown by ACC and OPE results, OIDF is

adversely affected by the higher number of false positives produced.

• The use of search engine ranking provides the advantage of obtaining rel-

evant documents as similar documents from a large document collection

for a document posed as a query. Reported results confirm the success

of this approach in ensemble ORFS(I/S) and ORNC(S) methods. The

ORFS(I/S) algorithms estimate outliers based on how a document deviates

from the relevancy scores of relevant neighborhoods while the ORNC(S) al-

gorithm estimates the degree of an outlier from the reverse neighbor count


(k-occurrences) within the relevant neighborhoods.

• The higher accuracy achieved with ORNC(S) compared to OIDF and

ORFS(I/S) can be attributed to identifying and using the sub-dense local

neighborhoods present in higher dimensions. The count of k-occurrences

allows identifying “hubs” and “anti-hubs”, which are away from local sub

dense neighborhoods that have less k-occurrences count. These anti-hub

points become probable outlier candidates. ORNC(S) even produces the

best outcome for the short text size data where complexity is increased due

to less term co-occurrence. However, it consumes substantial time due to

the requirement of a large number of comparisons within each document

neighborhood and will be less time efficient for datasets with a larger num-

ber of documents.

• The strategy of combining different outlier detection approaches affects the

effectiveness of outlier prediction. According to ACC and OPE measures,

ORFS(I/S) and ORFS(S) outperformed the basic singleton OIDF method

by reducing false positives. The improved time efficiency of ORFS(S), how-

ever, favors the sequential ensemble method compared to the independent

ensemble method, as both produce nearly the same level of accuracy.

• While comparing with baselines, KNNO shows a higher ACC compared

to the proposed approaches. The input parameter specifying the number

of outliers is the reason behind this behavior. However, AUC that inde-

pendently analyses the inlier and outlier prediction in detail confirms that

the effectiveness of KNNO is not up to the level of our methods in all the

datasets due to reporting high false inliers. High memory consumption and

false inlier prediction of the KNNO make this method a weak text outlier

detection method.

• Furthermore, reported results to show that the state-of-the-art methods


are not scalable to document collections with high term vectors. A mu-

tual NN graph building process using k-NN calculation is not scalable for

larger datasets due to the required high number of pairwise comparisons as

evident with MNCO. The NMFO method is a recent method, proposed to

handle the problem with high dimensional term vectors through a resulted

error in dimensionality reduction. However, our experiments with large

term vectors (DS1) reveal that it cannot handle large text size collection.

Additionally, experiments on datasets that consist of many groups (i.e. DS4

and DS5) show that the sum of the square error in non-negative matrix fac-

torization is impaired in handling fine-grained problems, as evident by the

OPE measure.

Finally, we summarize the proposed methods according to their suitability on

different types of data in Table 11.

Table 11: Applicability of the proposed methods

Category Method Nature of the documents in Functionalityoutlier detection applications

Rare Frequency OIDF Large text size collections Accuracy,based such as Wikipedia EfficiencyRanking based ORFS(S) Medium text size collections Accuracy,

such as newsgroup data EfficiencyRanking based ORFS(I) Medium text size collections Accuracy

such as newsgroup datak-occurrence ORNC(S) Short text data such as social Accuracybased media which deal with

extreme sparseness

5 Conclusion 209

5 Conclusion

This paper deals with the important topic of high-dimensional text outlier detec-

tion. In this data domain, the traditional distance or density-based outlier detec-

tion methods are challenged due to the distance concentration problem. Most of

the state-of-the-art methods are impaired when the number of groups within a

document collection is high, as it becomes difficult to generalize common patterns

to identify deviation for outliers.

This paper proposes a simple method of outlier detection based on the use of the

IDF weighting scheme, OIDF. It effectively uses the notion of rare terms to iden-

tify the documents that deviate from the majority of documents in the collection.

This method, however, suffers from generating high false positives and requires

additional processing to improve accuracy. To handle efficacy and efficiency, we

propose a number of ensemble approaches with OIDF using the ranking concept

in IR systems, which has already been proven to handle high dimensional larger

document collections with reduced computational complexity. An IR system is

used to retrieve the relevant documents for each document in the collection and

the top-n relevant documents are considered to be the neighborhood of the doc-

ument. ORFS uses the relevancy scores and ORNC uses the relevant document

count to identify outliers.

We explore the most effective ensemble approach (i.e., independent or sequential)

in combining ORFS and ORNC with OIDF. The sequential approach utilizes the

outlier candidates identified by OIDF to reduce the search space for improving the

quality of outlier detection. The ability of rare document frequency in identifying

outliers in OIDF is enhanced by the IR concepts in ORFS and ORNC, and

reduces false positives compared to OIDF only. In the independent ensemble

approach, both ORFS and OIDF algorithms generate the outlier candidates and

210 5 Conclusion

the common candidates in both sets have been identified to be the final outliers.

The ORNC is not used in the independent approach to generate outliers due to

high time complexity.

The empirical analysis is conducted on diverse datasets including large, medium-

and short-term vector sizes with different numbers of classes and different level

of vocabulary overlapping. Proposed methods are benchmarked against several

state-of-the-art distance-based, density-based, NMF-based and graph-based out-

lier detection methods. Empirical analysis shows that the proposed methods are

capable of detecting outliers in high dimensional document collection with con-

siderably high performance, including accuracy and efficiency. These approaches

are designed in a threshold independent way by setting the control parameter

autonomously based on the internal characteristics of the text collection.

This paper presents a substantial work in the area of text outlier detection. How-

ever, identifying outliers in dynamic text streams with limited memory and time

is important for novelty detection. Therefore, future directions are applying pro-

posed algorithms on dynamic temporal text data for outlier and insight detection.

Parallelizing these algorithms with possible improvement for run time and mem-

ory are also for our future investigation.

Appendix A: Rationale for the outliers’ score

Let OSdoidf and OSdi

idf be the average of IDF values of all terms in an outlier doc-

ument do ∈ D and inlier document di ∈ D respectively in a document collection

D. We believe that OSdoidf > OSdi

idf is valid for an outlier and inlier pair due to

the following reasons.

• For a generic document dk ∈ D , IDF weight for a term can be calculated

as in Eq. 1 where rare terms get high IDF values due to their low document

5 Conclusion 211

frequency (df) compared to common terms.

• An outlier document do with the average IDF weight of respective terms

OSdoidf calculated using Eq. 4, will get higher value compared to the inlier

document di, as do consists of a set of rare terms within D. It indicates a

deviation from the majority.

• In contrast, an inlier document di will possess common terms that represent

intrinsic themes of D and, thereby will hold a lower average IDF, OSdiidf for

respective terms.

• An OSdoidf , which is dominated by (rare) deviated terms should be higher

than an OSdiidf , which is led by common terms within D.

Appendix B: Weakness in ACC Measurement

ACC measures the effectiveness of predictions in terms of correct predictions and

does not consider false predictions. Consequently, it disregards the false outliers

and false inliers.

• ACC = TP+TNTP+FP+TN+TN

• ACC = TP+TNP+N

ACC considers truly predicted instances against the total observations as a ratio

and highlights only correct predictions. It can be considered a biased evaluation

that neglects the incorrect predictions made by a method. Hence, 1-ACC can

be used as an indirect indication of incorrect predictions, which represents total

false predictions against total observations. However, ACC does not separately

evaluate FP and FN that represent false outliers and false inliers to determine

the error in outlier identification and inlier identification of a method.

212 5 Conclusion

Figure 15: ROC curve generated by a binary outlier detection scenario

Appendix C: Weakness in AUC Measurement

AUC that considers a trade-off between TPR and FPR does not properly focus

on false outliers and false inliers, and masks the false positives and false negatives.

Let’s use a binary outlier detection scenario to produce the ROC curve as shown

in Fig. 15. AUC is the sum of the areas of A, B, and C.

• AUC = A+B + C as proved in [32]

• AUC = 12∗TPR ∗FPR+(1− FPR) ∗TPR+ 1

2∗ (1− TPR) ∗ (1− FPR)

• AUC = 12∗ TPR ∗ FPR + 1

2(1− FPR) (TPR + 1)

• AUC = 12(TPR + 1− FPR)

• AUC = 12

(TPP

+ 1− FPFP+TN

)• AUC = 1

2

(TPP

+ TNFP+TN

)• AUC = 1

2

(TPP

+ TNN

)

AUC informs true outliers out of total outliers and true inliers out of total in-

liers. Specifically, it details the correctly predicted outlier ratio and inlier ratio

5 Conclusion 213

separately. It does not inform false outliers or false inliers efficiently as it does

not treat false positives or false negatives with special care.

214 Paper 6

Paper 6: Text Outlier Detection using a Ranking-

based Mutual Graph




Under Review In: Data & Knowledge Engineering Journal















Paper 6 215




Date:




Nayak

26/03/2020

Mohotti

27/03/2020



216 1 Introduction

ABSTRACT: Identification of unusual text instances in text corpora is highly

beneficial for several applications such as content management, emerging and

suspicious pattern detection, etc. Extreme sparseness, distance concentration

and the presence of a large number of subgroups in a text corpus are some of

the issues that challenge the traditional outlier detection methods. In this paper,

we address these issues in a novel fashion by modeling the documents using rare

frequency weighting, building a ranking-based mutual neighbor graph and identify

outliers by the density estimation. The proposed graph-based incremental outlier

detection method effectively reduces false identifications. Experimental results

show that the proposed method is scalable compared to relevant benchmarking

methods as well as improve the quality of outlier detection in text corpora.

KEYWORDS: Outlier Detection; Density Estimation; Graph-based Clustering;

Data Mining; Mining methods and algorithms

1 Introduction

With the advancement in digital data technologies, text data has grown expo-

nentially [86]. The process of discovering useful information from the text docu-

ment corpora, known as text mining, has become significant and leads to diverse

applications [3]. Content management that facilitates efficient and effective in-

formation retrieval [106], emerging concept identification that facilitates trend

analysis [4] and suspicious content detection that identifies fake news [94] or un-

usual events [45] are some of them. Outlier detection plays a vital role in this

context to identify abnormalities out of a massive data collection that is usually

heterogeneous in nature and includes multiple subgroups.

The general idea behind outlier detection is to identify patterns that do not com-

1 Introduction 217

Figure 1: Text outlier detection and associated problems

ply to general behavior, also referred to as anomalies, deviants or abnormalities

[1]. An outlier text document has content that is different from the rest of the

documents in the collection that share some similarities amongst them [4]. Fig.

1(a) illustrates a typical scenario of text outliers in comparison to inliers in a

collection. There exist several applications where anomaly detection plays a ma-

jor role. (1) In Wikipedia where pages are required to be placed into relevant

categories based on their contents, identifying outliers is essential for organizing

the collection for effective information retrieval. (2) Identifying an unusual text

deviating from the theme in a blog will draw useful insight for administrative

purposes [96]. (3) In order to flag exceptional news, it is important to identify

unusual news articles from a collection of news documents [94]. (4) Detecting

unusual events that can be early warnings has high importance in social media

data [45]. (5) Identifying outliers leads to revealing the emerging business trends

and competitors on e-commerce applications [4].

Several supervised and unsupervised methods exist for discovery of outliers from

different types of data such as numerical, spatial and categorical [1]. As most

of the real-world data is unlabelled, unsupervised methods become a natural

218 1 Introduction

choice. Outlier detection methods based on unsupervised learning concepts such

as distribution, distance, and density used in traditional data face the scalability

issue when applying to large text sources [1]. There is only a handful of studies

specific to text-domain [4, 96] that deal with the sparseness of document vector

representation (shown in Fig. 1(b)). The well-known curse of dimensionality [126]

in the text data, as shown in Fig. 1(c), where the concept of distance or density

diminishes, leads to incorrect outcome (i.e. a high number of false prediction) [3].

A recent study used matrix factorization to project the high-dimensional text data

to lower order and calculated outliers by ranking the learning errors [96]. This

method accurately identifies outliers in the homogeneous data where the majority

of documents belong to the same topic. However, it fails to identify outliers in the

data where a large number of subgroups exists [173]. A large number of subgroups

within big datasets [173] also challenges traditional methods when applied to text

collections such as social media data.

In this paper, we focus on detecting text outliers with the aim of reducing false

detection using the well-known IR concepts in a novel fashion. We propose to use

(1) the “rare” term importance in detecting deviations in documents represented

as term vectors and (2) the concept of ranking to identify the local sub-dense

neighborhood, called as Hubs, evident in high dimensional data. We present a

novel graph-based method, named as Outliers by Ranking based Density Graphs

(ORDG), where a mutual neighbor graph is constructed using the relevant neigh-

borhoods. Documents that are not included in the mutual neighbor graph and

are away from the local sub-dense neighborhood, are treated as outliers. Em-

pirical analysis using several document corpora reveals that ORDG is able to

detect outliers in large document corpora, accurately and efficiently compared to

state-of-the-art methods [29, 96, 157].

To the best of our knowledge, ORDG is the first method that extends the IR con-

2 Related Work 219

cepts of term weighting and ranking to document outlier detection together with

the mutual neighbor graphs. More specifically, this paper brings the following

novel contribution to the area of outlier detection,

• Introduces a novel outlier detection algorithm (ORDG) which combines

the concepts of rare document frequency in document representation and

mutual neighbor graph.

• Proposes to construct the mutual neighbor graph based on the concept of

relevant neighbors using a scalable IR system that consumes less computa-

tion cost to identify the deviations.

The rest of the paper is organized as follows. Section 2 details related work on

traditional and high-dimensional outlier detection. The proposed approach and

implementation are elaborated in Section 3. A comprehensive empirical study

and benchmarks on several public datasets with well-known outlier detection

algorithms are provided in Section 4. The final concluding remarks are presented

in Section 5.

2 Related Work

It is estimated that 95% of unstructured data is dominated by digital text col-

lections [60]. Detecting outliers or anomalies in these large document collections

is useful for finding the interesting as well as suspicious text [4].

220 2 Related Work

2.1 Outlier detection methods for structured data

Traditional unsupervised methods are based on the proximity concepts of dis-

tance, density, distribution and, cluster [1, 2]. In the numerical data domain,

distribution-based methods use a statistical measure to determine the anomalies

that occur outside of the normal model [16, 89]. However, this approach is highly

depended on assumptions about data representation and leads to poor scalability,

making them less effective in text outlier detection.

The distance and density-based approaches successfully handle the anomalies in

numerical data with limited dimensions where outliers are easy to identify in

terms of distance or density distribution. These are extensively used in outlier

detection due to their simple implementation [126]. The concept of nearest neigh-

bors has been used to measure the distance differences. A distance based method

calculates the difference between each point and k nearest neighbors, and the

top-n points are ranked as outliers [157, 197]. A density-based method calcu-

lates the ratio between density around k-Nearest-Neighbors of a point and its

local neighborhood [109]. A point is ranked as an outlier candidate if it’s relative

density known as Local Outlier Factor (LOF) is high [29].

The reverse neighbor count [155] that indicates the number of times a point

appears among nearest neighbors of the entire collection has also been used in

outlier detection [155]. The reverse neighbor count of some points shows sig-

nificant skewness, forming sub-dense regions known as Hubs. The “Anti-Hubs”

points have been identified as outliers [154]. With reverse k-NNs, a graph-based

method is used to identify the outlier nodes which have less in-degree values [87].

Nearest Neighbor (NN)-based mutual proximity and the k-NNs have also been

used to calculate outlier scores [58, 84].

In text data, identifying neighborhoods using distance measures is challenging due

2 Related Work 221

to distance concentration in high dimensionality [104]. All the pairwise distances

(dissimilarities) yield a similar value where distance differences between far and

near points become negligible [24]. Document collections such as Web where

size is large as well as they exhibit multiple sub-groups, the nearest-neighbor

calculation poses the scalability problem due to the large number of pairwise

comparisons [173].

Density based clustering has been successfully used in spatial data by isolating

outlier points with the density approximation [9, 57]. However, in the text where

data is sparse, applying the density notion to separate outliers is challenging as

data already exists in patches.

Traditional outlier detection methods are impaired in high dimensionality data

due to sparseness [1]. The angle between vectors can be successfully used to

identify the deviations in this context. This approach can be well suited to text

data which are in the form of feature vectors and cosine similarity can be used

to measure angel differences [37]. However, the number of pairwise comparisons

needed for larger datasets increases the computational complexity and makes this

approach infeasible to apply to large-scale data.

Subspace analysis is an alternative method to detect outliers in high dimensional

data. However, the problem of finding a subset of dimensions, with rarely existing

patterns, using brute-force searching mechanisms poses extreme computational

complexity [1]. Therefore, lower-dimensional projections have been used as a rem-

edy. The degree of deviation of each observation after projecting it to the lower-

dimensional space by dimensionality reduction is used to determine outliers [126].

In [27], Multi-Dimensional Scaling (MDS) is used in identifying outliers in high

dimensional data. It reduces the number of dimensions preserving pairwise dis-

tances between points, and identify outliers in embedded space using a heuristic

that captures deviants. However, the information loss in these approaches when

222 2 Related Work

projecting data from higher to lower dimension makes it unsuitable to determine

extreme values in full dimensionality.

2.2 Outlier detection methods for text data

There are limited studies specifically focused on text-domain to identify the docu-

ments deviated from the common theme [4, 96]. In text-domain where data is high

dimensional, matrix factorization is proposed as a solution that projects the high

dimensional search space to a lower space with preserved original relationships in

the newly mapped space [11]. In a recent study, the sum-of-square of differences

with the original matrix while projecting to a lower order with the Non-negative

Matrix Factorization (NMF) is measured and observations with higher rank for

learning error are identified as outliers [96]. This method attempts to use the se-

mantic similarity while learning text outliers. However, the increased number of

groups within the collection makes this learning process impaired. This method

may fail to detect outliers accurately or in scalable fashion in the Web content

that often contains many document categories.

Deep neural network [38] and Generative Adversarial Network (GAN) with ac-

tive learning [127] are latest supervised approaches used for outlier detection in

text data. They have been used with the dense representation of text data as

they are unable to work with high-dimensional sparse data. Supervised deep

network-based methods use the labeled data for training and learn the patterns

of the text data that classify outliers and inliers. GAN methods generate infor-

mative potential outliers based on the mini-max game between a generator and

a discriminator network [127]. Authors in [127] used active learning (i.e., a weak

supervision approach) to generate potential outliers with a reasonable reference

distribution for the small labelled data with GAN. Accuracy of these methods

2 Related Work 223

rely on the labelled data. However, it is difficult to provide labelled data for train-

ing due to the unknown nature of anomalies. Therefore, a common approach is

to utilise unsupervised methods to find objects or patterns that are uncommon

based on data distribution.

2.3 IR Concepts: How can they be used in outlier detec-

tion

According to the well-known Hawkins definition, an outlier is “an observation

which deviated so much from the other observations as to arouse suspicions that

it was generated by a different mechanism” [84]. We conjecture that documents

with rare terms will exhibit this characteristic and the use of a rare term weighting

technique in document representation can reveal outliers. A Vector Space Model

(VSM) is used to represent a document by a vector where each term appears

as a co-efficient to represent the term weight considering its frequency within

the document and/or collection [3]. There are different term weighting schemes

used in IR to rank the terms such as TF, IDF, TF*IDF and BM25 [160]. Term

Frequency (TF) gives high weights for frequently occurring terms by favoring

common and long documents [37] while Inverse Document Frequency (IDF) favors

the rare terms in the collection [37]. In this paper, we use the concept of term

weightings with IDF to measure the importance of rare words in a novel fashion

to detect text outliers.

IR systems have shown as a scalable and efficient solution in handling high di-

mensional text data [3]. There exist advanced IR technologies including inverted

index data structure and ranking that allow a search engine to find related docu-

ments in a large document collection for a given user query [200]. In this paper,

we use the concepts of ranking in IR systems for outlier detection in a novel fash-

224 3 ORDG: Outliers By Ranking-based Density Graphs

ORDG

Phase 1:Outlier candidates from rare

frequency term modelling

Phase 2:Outlier candidates dissimilar to a mutual neighbor graph

Final Outliers:Common outliers for both phases

Figure 2: Architecture of the proposed ORDG method

ion. The proposed method generates a mutual NN graph based on the retrieved

(relevant) documents for the search queries instead of an expensive NN calcula-

tion and identifies outliers based on the density of the graph. This ranking based

mutual neighbor graphs generated outliers have been combined with the outliers

generated with the rare term frequency-based method, to reduce the high number

of false positives, a significant problem in outlier detection [2].

3 ORDG: Outliers By Ranking-based Density

Graphs

The proposed ORDG method is an ensemble approach that combines the outliers

generated by two processes as in Fig. 2. Phase 1 includes the process of obtaining

probable outlier candidates through the rare frequency term weighting model.

Phase 2 includes construction of the mutual neighbor graph and removal of outlier

candidates based on density of inliers in the graph. We define mutual neighbors

using ranking results generated by an IR system in a scalable and efficient manner.

The final list of outliers is generated by reporting the common outliers for both

processes.

3 ORDG: Outliers By Ranking-based Density Graphs 225

3.1 Preliminaries

Consider a document collection D = {t1, t2, ..., tn} that contain a total of n terms

where a document di ∈ D is represented using a set of unique terms {t1, t2, ..., tt}in D. Let D consist of a set of groups C = {c1, c2, ..., cN} and each cg ∈ C

contains a set of similar documents that share related terms.

Definition 1 Outlier A document di ∈ D that shows high deviation, based on

terms distributions, to all sets of similar documents cg ∈ C is considered an

outlier.

Definition 2 Inlier A document di ∈ D that shows high similarity, based on

terms distributions, to a set of similar documents cg ∈ C is considered an inlier.

3.2 Finding nearest neighbours

Given a document di ∈ D, a vector space model (VSM) represents a document as

a point vector in multi-dimensional space by assigning weights to each respective

term as di = {w1, w2, w3, ..., wt}. These weights in a vector emphasize the impor-

tance of the document within the collection using a weighting scheme. Inverse

Document Frequency (IDF) weighting scheme differentiates whether the term v

is common or rare considering document frequency dfv as per Eq. 1.

wv = idfv = log

( |D|dfv

)(1)

A wv ∈ di when modeled with IDF gives a higher score to rare terms. To empha-

size on rare terms appearing in the document, documents in a document collection

is represented with the IDF weighting schema.


Each document di ∈ D is treated as a query document represented with top-s

(s = 10) terms ranked in the order of IDF. We use Elasticsearch search engine as

the IR system and obtain top-k documents given in response to the query as k-

Nearest Neighbors. We have set k = 10 as P@10 (Precision at top-10 documents)

in the ranked list returned for a topic is considered high due to tight coupling with

the topic [139]. Thus the top-10 documents that possess sufficient information

richness [173] are chosen as the NNs.

Let Rf be the ranking function employed in an IR system that extracts the

most relevant k documents as nearest neighbor documents Dq, for a given query

document q, where r is the relevancy scores vector for query q as follows.

Rf : q → Dq = {(dp, rp)} : p = 1, 2, . . . , k (2)

There exist several ranking functions employed in search engines such as tf*idf,

BM25 and LM Jelinek-Mercer smoothing to calculate the relevant documents

[173]. We use the widely applied tf*idf ranking function to measure the relevancy

between a document dp and a query q where the relevancy score rp is given as:

score (q, dp) = rp =∑t∈q

(√tft,dp × idf 2

t × norm (t, dp))

(3)

Let Ddi and Ddj be the ranking results considered as nearest neighbors of di, dj ∈D respectively through the ranking function Rf . These ranking results are used

in defining mutual neighbors. Two documents di and dj are considered mutual

neighbors if dj ∈ Ddi , di ∈ Ddj and |Ddi ∩Ddj | > 2.


3.3 Phase 1: Outliers by the Term Weights

We conjecture that an outlier document will contain more rare terms than an

inlier document in the corpus. For each document represented in IDF weighting,

an average weight is calculated by summing all term weights that are present in

the document. We propose to filter the probable outlier documents in D, which

gives higher average weights beyond a threshold value Tidf as in Eq. 4.

Doidf ← di : where

{∑ti=1 (wv ∈ di)

t

}> Tidf (4)

We set this control parameter independently, using the internal statistics such as

median and standard deviation of the term weights of the datasets as detailed in

the sensitive analysis section to form the optimal threshold to filter the outliers.

Let OSdoidf and OSdi

idf be the average of IDF values of all terms in an outlier

document do ∈ D and inlier document di ∈ D respectively in the document

collection D.

Claim 1 Given an inlier and outlier document pair, OSdoidf > OSdi

idf is valid.

• For a generic document dk ∈ D, IDF weight for a term can be calculated as

in Eq. 1 where rare terms get high IDF values due to their low document

frequency (df) compared to common terms.

• Since the outlier document do consists of a rare set of terms within collection

D, the average IDF weight of respective terms, OSdoidf will get a higher value

compared to the inlier document di. A higher score indicates deviation from

the majority.

• In contrast, an inlier di will possess common terms that represent one of


the intrinsic themes of D, thereby it will hold a lower average IDF, OSdiidf

for respective terms.

• Any OSdoidf that is dominated by rare deviated terms should be higher than

any OSdiidf that is led by the common terms within the D.

3.4 Phase 2: Outliers by the Ranking-based Mutual

graph

A (mutual neighbour) graph is constructed where two mutual neighbor documents

are represented by the adjacent nodes and the edge weight between them is the

number of neighbors the two documents share.

Let Ddi represent the top-10 relevant neighbor documents of di obtained using

Eq. 2. Let document dj ∈ Ddi and its top-10 relevant neighbor documents be

Ddj . If di and dj are found mutual neighbors due to sharing more than two

documents showing common other documents, they are included as vertices of

the graph GM with the edge weight of |Ddi ∩ Ddj |. Repeating this process for

all documents in the collection, a mutual-neighbor graph GM (V,E,w) is formed

where the vertices V represent the document nodes and the edges E with the set

of weights w represent the number of mutually shared neighbors. All the mutual

documents in GM forms a set DMN . This process separates the set of outlier

documents DoG that are not part of the connected graph. Algorithm 1 in Fig. 3

represents the ORDG algorithm for building the mutual graph.

The document collections with medium to large size term vectors (e.g., news

stories, reviews, etc.) contain sufficient co-occurring terms and allow identifica-

tion of local density regions to form mutual neighbors. Due to a high number

of documents included in mutual neighbor graph GM (V,E,w) the outliers can


Figure 3: ORDG: Mutual Graph Building Algorithm

be effectively identified from the left out documents set DoG in these collections.

However, this approach poses a challenge to short documents (such as social me-

dia posts) or sparse documents as only a few documents show mutually shared

documents. Short document collections that hold extremely sparse vector rep-

resentations share very few common terms. It becomes hard to discriminate

amongst documents and, eventually, many inlier documents are left out from the

graph construction process.

We refine outlier discovery for the short documents by defining outliers based on

dissimilarity to Hubs in the graph. Initial dense regions on the graph are formed

based on a region where minimum edge weight is c and the region contains at least

c document nodes. Identified dense inlier neighborhoods are further expanded to

include documents from the same edge weight forming uniform dense regions. All

the document nodes in each dense region are identified as inliers as they hold the


Figure 4: ORDG: Dissimilarity with hubs (for document collections with shorttext vectors)

property in Definition 2. All other nodes are identified as outlier candidates, DoG.

This process further refines the outlier filtering considering dissimilarity to the

graph. The set of shared neighbor documents DMN identified in Algorithm 1 in

Fig. 3 attached to GM are labeled as inliers l if they are attached with the dense

regions and embedded documents in them are not outliers. These document

sets in DMN identified within mutual graph construction can be considered as

Hubs in clusters (i.e, dense regions), which we propose to use to separate outlier


candidates through dissimilarity. A normalized similarity score Sh is calculated

against each Hub h ∈ DMN for each outlier candidate do ∈ DoG for identifying

the dissimilarity. The similarity score Sh calculation utilises the ranking scores

derived through Eq. 3 for obtaining relevant documents of do if they appear in h

as:

Sdoh =

1

|h||h|∑i=1

score (q, do) where q is each document in Hub h (5)

This refinement step analyzes the maximum similar hub of each outlier candidate

do as in Eq. 6 and removes the considered do from the outlier candidate list if

the assigned hub is associated with an inlier label l. This process is given in

Algorithm 2 in Fig. 4 includes a two-step approach where each step provides

a better understanding of the document collection to enable the more refined

execution.

h← max(Sdoh

)(6)

3.5 Phase 3: Ensemble – Combining Outliers

We propose to use the independent ensemble approach to combine outliers de-

tected by the first two phases. In prior work, these ensemble methods have been

successfully used to improve the quality of an outlier detection algorithm [1].

This addresses the problem of high number of false positives generated by a sin-

gle method [2]. The final set of outliers are produced as the common outlier

documents identified in Phase 1, Doidf and in Phase 2, Do

G as:

Dof = Do

idf ∩DoG (7)

An example: Consider an example document set in Fig. 5, which consists

of eleven documents related to two sports: Cricket and Rugby, and an outlier


Figure 5: Example document collection

Figure 6: IDF weights of terms in documents with the average IDF weight

document. The example document set clearly depicts that inlier documents share

the common theme, sport, and the outlier document is a deviation from both of

these sports categories. Fig. 6 shows the IDF weights of terms for each document

after standard pre-processing together with the average IDF value of each vector.

It reveals that the average IDF value of the document d11 is much higher than

the rest of the collection. The first phase of ORDG identifies the possible outlier

candidate d11.


Figure 7: List of relevant documents given by the search engine for the exampledocument collection

Phase 2 of ORDG calculates the mutual neighbors to build the graph as in Fig.

8 considering the shared documents within the ranking results as given in Fig. 7.

The graph is able to isolate the outlier documents from the collection as shown

in Fig. 9.

This example document collection forms considerably a dense VSM model by

showing high term co-occurrences. This may be different from a real-world text

outlier detection problem. In a high dimensional text where usually terms’ co-

occurrences are extremely low, a single method tends to produce more false out-

liers [2]. We address this problem by combining outliers detected from phase 1

(using IDF term weights) with phase 2 (using mutual neighbor graphs). ORDG

identifies a document as an outlier only if both phases detected it as an outlier.


IR Ranking results of Document IR Ranking results of Document

Shared Neighbors of and

Check for minimum number of Shared Neighbors between and

3

Vertices documents and

Edge weight

Figure 8: Mutual Neighbor Calculation with IR search results

Mutual Neighbor Graph

3

3

3

3

3

6

3

3

3

7

5

5

5

55 5

5

Figure 9: Mutual graph construction in ORDG


4.1 Datasets: size, sparsity and classes of Inliers and out-

lier

We used multiple datasets with varying dimensionality such as 20 Newsgroups,

Reuters 21578, MediaEval Social Event Detection (SED) 2013 & SED 2014, and

Wikipedia in evaluation, as reported in Table 1. Wikipedia dataset (DS1), which

has about an average of 800 terms in a document, is used to validate the outlier

detection behavior on a large document set. Well-known 20News group dataset


Table 1: Summary of datasets used in experiments.

Datasets # ofDocs

# ofUniqueTerms

# ofTotalTerms

# ofAvg.Terms

# ofOut-liers

Wikipedia (DS1) 11521 305827 9206250 799 10020News groups (DS2) 4909 27882 374642 76 50Reuters (DS3) 5050 13438 200482 40 50SED2013 (DS4) 81228 46548 1583073 19 840SED2014 (DS5) 91670 46031 1816840 20 976

(DS2) and Reuter dataset (DS3), which have 40-80 terms on average, were used

to validate the outlier detection behavior on a medium document set. Whereas,

MediaEval Social Event Detection 2013 (DS4) and 2014 (DS5) datasets with 20

terms on average were used to analyze short document collections. The ground-

truth values with the class/category labels in the datasets were used to measure

the methods’ effectiveness extrinsically.

We allow the document sets to contain several classes of documents - both inliers

and outliers. Specifically, DS1 contains inliers from the multiple subclasses under

the Wikipedia category “War” while containing outliers from 10 other categories.

DS2 contains inliers from five classes related to “Computers” and outliers from five

other categories. Similarly, DS3 contains inliers from two classes while outliers

are taken from 25 other classes. Inliers in short datasets (DS4 and DS5) are

collected from classes that have at least 100 documents while outliers are to be

two per each class that are not inlier classes within the same dataset. These short

document collections of DS4 and DS5 are built to contain more than 400 groups

in both inlier and outlier classes to explore the fine-grained scenarios. Generally,

all the datasets were created such that they contain nearly one percent of outlier

documents, which belong to several classes and inlier documents also belong to a

diverse set of classes.


4.2 Experimental setting, Benchmarks and Evaluation

Measures

Experiments were done using python 3.5 on 1.2 GHz with a 64-bit processor

with 264 GB (shared) memory. All datasets were preprocessed using standard

text pre-processing such as stop-word removal and stemming. Elasticsearch was

used as the search engine. Inverted indexes were generated for all datasets. For

each document in a collection, top-10 relevant documents were obtained using

the ranking process employed in Elasticsearch.

There exist only a handful of text outlier detection methods, both supervised

[107, 127] and unsupervised [96]. We compare ORDG with Non-negative Matrix

factorization based unsupervised method [96] as well as traditional unsupervised

methods adapted for text data including k-nearest neighbor based method in [157]

(KNNO), density-based local outlier factor method in [29] (LOFO) and Pairwise

mutual neighbor graph method in [55] (MNCO).

Neural network-based approaches recently become popular in text mining with

fully or weak supervision [107, 127]. Although ORDG is a fully unsupervised

method, we have done experiments with supervised method based on Convolu-

tional Neural Network (CNN) [107] and semi-supervised method based on Gener-

ative adversarial active learning [127]. With deep learning, we follow the standard

practice of using dense word representation with reduced dimensionality obtained

with Global Vectors for Word Representation (GloVe) [148] as the input to the

neural network experiments [107]. These methods based on training a neural

network with full-dimensional space are extremely time-consuming.

Standard outlier evaluation measures including Accuracy (ACC) [84], Area Under

the ROC Curve (AUC) [1] and False Negative Rate (FNR) [1] are used to report


the results.

Let TP ,TN ,FP ,FN denote the correct outliers, correct inliers, incorrect outliers

and incorrect inliers respectively where P ,N denote the total outliers and inliers.

Accuracy (ACC) is calculated as:

ACC =TP + TN

TP + FP + FN + TN=

total correct predictions

total observations(8)

ACC measures the effectiveness of predictions in terms of correct predictions and

does not consider false predictions. Consequently, it may disregard the effect of

false inliers by giving higher importance to true inliers. This is misleading in

general outlier detection scenario where there is a massive class skew due to a

few classes of outliers and the larger number of inlier classes. Alternatively, we

used FNR to measure the effectiveness of outlier detection highlighting the error

in predicting outliers. The False Negative Rate (FNR) is calculated as:

FNR =FN

TP + FN=

FN

P(9)

The Area under the Receiver Operating Characteristics (ROC) curve has been

used in prior outlier detection work to evaluate accuracy [1, 126, 155]. The ROC

curve shows the ratio between true positive rate (TPR) against false positive rate

(FPR). This addresses the problem with skewed classes. Let TPR and FPR

be:

TPR =TP

TP + FN=

TP

P(10)


FPR =FP

FP + TN=

FP

N(11)

With T denotes the threshold to control outliers, AUC can be defined as:

AUC =

∫ 1

0

ROC(T )dT (12)

4.3 Experimental Results: Accuracy Analysis

Accuracy (ACC ): Accuracy results reported in Table 2 reveal that ORDG

outperformed all (unsupervised) baselines with a large margin except KNNO.

The performance of ORDG is similar to KNNO on all other datasets except DS3,

where classes in the corpus are highly overlapping. It is hard to separate outliers

considering terms in the VSM representation due to overlapping class behavior.

Hence, ORDG yields lower accuracy. KNNO compares all pairwise documents to

produce k-NNs and calculates the differences between each observation and its

NNs, to rank the top-p points as outliers for a given p. Due to intricate compar-

isons, it produces high accuracy, however, it is not scalable in high dimensional

Wikipedia document collection (DS1) and fails to produce results after a scalable

boundary time. Furthermore, KNNO requires the number of outlier documents

as a control parameter that directly induces high performance as compared to

others.

In addition to producing poor quality outcomes, LOFO and MNCO are not scal-

able to big datasets DS4 and DS5 due to their requirement of a large number

of pairwise comparisons. Though NMFO can scale with the size, it shows poor

performance, especially in DS4 and DS5, as it is unable to deal with a larger

number of groups because of the iterative factorization process designed to work

with lower rank in NMF. It is interesting to note that MNCO, a mutual neigh-


Table 2: Performance comparison of different datasets and methods

Dataset Accuracy-ACCORDG KNNO LOFO NMFO MNCO

DS1 0.98 * * * -DS2 0.95 0.99 0.01 0.98 0.06DS3 0.91 0.98 0.02 0.69 0.17DS4 0.97 0.99 * 0.01 -DS5 0.97 0.99 * 0.01 -Avg. 0.96 0.99 0.02 0.42 0.12

Note : “*” and “-” denotes aborted operations(after 100 minutes) and memory/runtime error re-spectively

bor graph method, is unable to deal with the sparseness in high dimensionality.

Whereas, ORDG builds a mutual neighbor graph utilizing a scalable IR system

to obtain neighbors and can deal with sparse and large datasets.

Area Under the Curve (AUC ): Next, we analyze the results in the form of

ROC and AUC that reports ratio between TPR and FPR. We have explored the

ROC curve considering the fixed control parameter we proposed in our algorithm

(sensitivity analysis provides more details on the parameter) and the optimum

threshold is used for each baseline accordingly. Any baseline methods could not

be executed on DS1 due to larger text size, as confirmed by Fig. 10(a).

As depicted by Fig. 10(b) and Fig. 10(c), ORDG gives the highest AUC compared

to baselines on DS2 and DS3 where term vectors are medium size. KNNO, which

requires the number of outliers as a control parameter, is the best amongst other

baselines for DS2 though it does not work in a similar fashion to DS3, which

contains overlapping class labels. MNCO, which uses a mutual neighbor graph,

outperforms other baselines in DS3 that uses simple nearest neighbors.

The ROC curves for document collections with short term vectors are given in


*

* *

*

(a)

(b) (c)

(d) (e)

****

Figure 10: ROC curve and AUC for document collections (”∗” denotes abortedoperations, memory or runtime error)

Fig. 10(d) and Fig. 10(e). Due to the large collection size, LOFO and MNCO

were not able to execute. Comparatively, ORDG succeeds by consuming less

memory and time due to the efficient IR ranking-based neighborhood generation

process. ORDG outperforms KNNO due to the inclusion of local sub-dense neigh-

borhood concept. NMFO performs equal to a random method on these datasets,

which contain a large number of groups as the iterative lower-rank matrices ap-

proximation process increases the level of error in factorization and is impaired


Table 3: FNR for different methods against datasets

Dataset False Negative Rate - FNRORDG KNNO LOFO NMFO MNCO

DS1 0.75 * * * -DS2 0.30 0.60 0.00 0.98 0.00DS3 0.56 0.98 0.00 0.62 0.00DS4 0.44 0.50 * 0.00 -DS5 0.41 0.49 * 0.00 -Avg. 0.49 0.64 0.00 0.40 0.00

The smaller the value, the better the performance.Note : “*” and “-” denotes aborted operations and mem-ory or runtime error respectively

in handling the fine-grained data.

With ACC and AUC, we have assessed how good the methods are in identify-

ing outliers. However, they are yet to be assessed for making false predictions.

Specially in outlier detection, the majority of documents are inliers and only a

few are outliers. A method identifying a large proportion of those few outliers as

inliers (i.e. higher FNR values) can be considered ineffective. Results in Table

3 report FNR (false negative rate) which informs false inliers (FN) against the

total number of outliers in the data. These results reveal that KNNO predicts

many outliers as false inliers. On the other hand, LOFO, NMFO and MNCO

produces lower accuracy in identifying true outliers (i.e. low ACC and AUC

values) but they report fewer false inliers (i.e. low FNR). This is mainly due to

identifying larger portion of documents as outliers.

In general, ORDG shows a consistent level of performance including short docu-

ment collections, DS4 and DS5, due to the additional concept of Hub based inlier

removal included for short collections. All baselines fail to produce results when

the dimensionality of the vector is high and the dataset is large as DS1, however,

ORDG handles it by using the IR concepts effectively.


Table 4: Performance given by Neural network-based methods

Dataset Supervised CNN GAN based Active LearningAccuracy-ACC Area Under the Curve- AUC

DS1 0.99 0.52DS2 0.74 0.56DS3 0.98 0.52DS4 0.97 0.50DS5 0.97 0.50Avg. 0.93 0.52

Supervised or Semi-supervised Baselines: Experiments have also been

conducted to check the performance of latest deep learning methods on outlier

detection. CNN has been used in the supervised setting [107] and the semi-

supervised GAN-based active learning method used in outlier detection [127] has

been used. GAN includes two networks where a generative network is used to

generate candidates and a discriminate network is used to evaluate their validity

in an unsupervised manner. Method in [127] follows a semi-supervised approach

with an active leaning to generate initial outliers with reference to real data. Re-

sults in Table 4 show that supervised CNN-based method which predict outliers

using the training knowledge given based on the labeled dataset is unable to out-

performed ORDG. Especially, in DS2 where document collection is short in size

and have many classes, the training phase is unable to give adequate supervision.

The semi-supervised GAN method performs almost similar to a random method

producing AUC value close to 0.5 on average. The data used for supervision

should be closely matched with the actual datasets to obtain higher performance

with GAN methods.


Figure 11: Time and memory consumption for different methods

4.4 Experimental Results: Scalability and Complexity

Analysis

Time taken by each method is presented in Fig. 11 (a). It shows that the

baseline methods must be aborted when data dimensionality is high as in the

Wikipedia dataset (DS1). Similarly, the larger document collections such as SED

2013 (DS4) and SED 2014 (DS5) cannot be handled by methods such as LOFO

and MNCO. Though the matrix factorization based NMFO shows slightly less

time consumption in larger size datasets DS4 and DS5, as compared to ORDG,

the performance increment of 96% in ACC and 27% in AUC gained by ORDG is

well justified.

In addition, we compare the memory consumption of each method as in Fig. 11

(b). It clearly highlights that ORDG consumes the least memory in comparison

to baseline methods. All the baseline methods are impaired when dealing with

large term vectors such as Wikipedia (DS1) due to resource starvation.

Table 5 shows the computational complexities of ORDG against baseline meth-

ods. This validates the experimental results where LOFO and MNCO fail to


Table 5: Summary of the datasets in the experiment.

ORDG KNNO LOFO NMFO MNCOComplexity O(ndkm) O(n2dk) O(n3dk) O(n2d) O(n3dk)Note: n - the size of the document collection, d - dimensionality,m - number of mutual neighbor sets and k - considered numberof nearest neighbors

Table 6: Number of False Positives (FP ) given by each phase

DatasetNumber of False Positives (FP )

% improvement byensemble approach

ORDG ORDG Full —Phase1 Phase2 ORDG —

DS1 1665 1113 159 86%DS2 624 1775 268 57%DS3 909 2406 422 54%DS4 14683 7854 2251 71%DS5 16223 7401 2349 68%

produce output on large datasets due to having cubic time complexity, whereas

ORDG, which uses an ensemble approach combining possible outlier candidates

from two methods, works efficiently due to linear complexity.


First, we explore the effectiveness of the ensemble solution in ORDG for reducing

the false positives as in Table 6. Both phases, the rare term weighting based

first phase of ORDG and mutual neighbor graph based second phase, yield high

number of false outliers individually. However by reporting the outliers that exist

in both phases by using the ensemble approach improves the quality of outlier

detection.

Similar to many other outlier detection algorithms [87, 157], ORDG uses a thresh-

old to determine the top-ranked observations as outliers. We propose to set the

threshold automatically and in a user-independent way, by utilizing the internal


Figure 12: Sensitivity of the control threshold Tidf

characteristics of the dataset. The control threshold Tidf , which governs the filter-

ing process of outlier candidates through average IDF weights in a document, is

set as the combination of median and standard deviation. It yields more outliers

as shown in Fig. 12 for all the datasets except Reuters (DS3), which contains

overlapping class labels for documents. The median that removes the effect of

noise was boosted, by adding standard deviation to detect the outliers that have

a smaller portion within the document collections by setting Tidf this way.

The premise of ORDG is obtaining nearest neighbors by the IR technology in-

stead of the pair-wise document comparisons as used in traditional methods. The

performance of ORDG in obtaining nearest neighbors depends on two factors: (1)

the weighting scheme used in query and document representation in order to re-

trieve the relevant neighbors; (2) the ranking function employed in the IR system

to measure the document similarity. Documents in a corpus can be represented

using different weighting schemes such as term frequency (TF), inverse document

frequency (IDF) and term frequency-inverse document frequency (TF-IDF) [37].

According to AUC results given in Fig. 13 (a), for outlier detection, IDF and TF-

IDF weighting schema used for document query representation are more effective

than TF schema, and IDF shows slightly better performance. As validated by


00.10.20.30.40.50.60.70.80.9

1

DS1 DS2 DS3 DS4 DS5

AUC

Document Collections

TF IDF TF * IDF

00.10.20.30.40.50.60.70.80.9

DS1 DS2 DS3 DS4 DS5

AUC

Document CollectionsBM25 Tf*idf

(a) Different document query representation techniques (b) Different ranking functions of IR systems

Figure 13: Performance with weighting schema and ranking functions

Claim 1, IDF directs a high focus to rare terms, which are the keys to consider de-

viations of documents represented as a vector of weighted terms. Therefore, IDF

representation is used in ORDG to form document queries to retrieve relevant

documents, which gives precise nearest neighbors that can be used to differentiate

outliers.

Figure 13 (b) shows the AUC results against BM25 and tf*idf ranking functions

of Elasticsearch search engine with ORDG. IR systems use different functions

such as LM Jelinek-Mercer Smoothing (LM-JM), LM Dirichlet Smoothing (LM-

Dirichlet), Okapi BM25 and tf*idf [23]. However, BM25 and tf*idf ranking func-

tion give importance to rare terms that require for outlier detection compared

to LM-JM that assigns negative scores to terms with fewer occurrences and LM-

Dirichlet that captures important patterns in the text leaving the noise [54].

Results show that both ranking functions have similar performance in document

collections with larger and medium-size text vectors. However, BM25 [59] which

calculates the relevancy score of documents with relation to a query, in addition

to terms in the documents, shows higher performance for document collections

with short text vectors.

5 Conclusion 247

5 Conclusion

This paper proposes a novel text outlier detection method based on ranking and

a mutual neighbor graph (ORDG). Phase 1 of ORDG indicates that rare terms

in a document, which can be emphasized through IDF weighting scheme, show

higher competence to detect deviations in a document collection. Sparseness in

high dimensional text data is handled by the mutual neighbors as in Phase 2

of ORDG where the traditional distance and density-based concepts fail. Mu-

tual neighbors facilitate relatively uniformed denseness inside the corpus with the

shared neighbors. A normal mutual nearest neighbor graph built using k-Nearest

Neighbors calculation is not scalable for larger datasets due to the required high

number of pairwise comparisons. Whereas, ORDG that calculates nearest neigh-

borhoods using relevant documents obtained through a scalable search engine,

can construct mutual nearest neighbor graphs for larger datasets effectively. The

local sub-dense neighborhood (Hub) concept in high dimensionality is brought to

ORDG together with the density approximation to separate outliers. It conjec-

tures that documents that are not attached to a sub-dense local neighborhoods’

graph are possible outliers.

Extensive empirical analysis has been conducted on diverse datasets belonging

to large, medium and short-term vector sizes. ORDG is benchmarked against

several state-of-the-art, distance-based, density-based, graph-based and matrix

factorization-based outlier detection methods. Results show that ORDG is ca-

pable of detecting outliers in high dimensional document collection with con-

siderably higher performance, including accuracy and efficiency. The ensemble

approach of ORDG reduces the false outliers and inliers. Applying ORDG on

dynamic temporal text data for outlier detection is for our future investigation.

Chapter 5

Text Cluster Evolution

This chapter introduces the last contribution of the thesis that is a novel document

cluster evolution method to identify the dynamic changes to text clusters over

the time or domain using text cluster similarity. Analyzing text-based communi-

cations over time or domains is important, so that it is known how concepts been

evolved. This allows knowing which clusters are emerging, persistent, growing

and diminishing. This information is important in planning events, publications,

advertising and much more. Evolution tracking is more popular with network

analysis for identifying community evolution [41, 115, 123]. There exists very

little research on text-based evolution tracking.

The majority of the text-based evolution research mainly focus on the topic evo-

lution [35, 41, 180] or event evolution [82, 119], which deals with a much smaller

data space as compared to the original data space. In addition, existing text evo-

lution methods are limited to compare only consecutive timestamps [63] or limited

to few evolution patterns such as emerging concepts [98] monitoring. There is no

prior work that considers a global cluster evolution that is able to show the full

cluster life cycle with all the evolution patterns in original data space with all the

249


terms.

Fig. 5.1 shows the main concepts used in the proposed method, Cluster

Association-aware matrix factorization for discovering Cluster Evolution (CaCE),

to identify the text cluster similarities and track the cluster evolution. This chap-

ter presents CaCE, which introduces NMF to identify the groups of similar clus-

ters over the time/domain using intra- and inter-cluster similarity to handle the

issues attached with high-dimensional text. This paper is based on the conjec-

ture that the assistance given by inter-cluster association is able to address the

information loss occurred with high to low-dimensional projection. Further, this

chapter introduces the Skip-Gram with Negative Sampling (SGNS) to accurately

learn the context by maximizing the probability of closely associated cluster pairs

within the considered time period/domains, while minimizing the loosely associ-

250

ated cluster pairs.

This chapter is formed by Paper 7 in its original form.

• Paper 7. Wathsala Anupama Mohotti and Richi Nayak.: Discovering

Cluster Evolution Patterns with the Cluster Association-aware Matrix Fac-

torization. Springer Knowledge and Information Systems (KAIS) (Under

Review).

Paper 7 proposes a novel method named CaCE to discover cluster evolution when

each cluster solution is given for each time-stamp/domain. Thus it works on

static cluster solutions and is able to identify the groups within the clusters over

the time/domain. Specifically, it identifies evolution patterns with the cluster

association-aware Matrix Factorization that identifies cluster groups with similar

text clusters. It uses an NMF-based method with graph-based visualization to

identify the changing dynamics of text clusters over the time/domain.

CaCE models inter-cluster associations with the number of overlapping terms

between clusters using the SGNS modelling to uplift the accuracy. Specifically,

it captures the similarity between each cluster pairs that carry important infor-

mation to assist the global evolution where even smaller values also represent the

initial stage of links between clusters that could develop as growth in upcoming

years/domains. Therefore this information semantically assists matrix factoriza-

tion for cluster-group discovery. A density concept based on the term frequency

is used to maintain the uniform term distribution within a cluster group and to

separate less cohesive clusters from it. CaCE tracks four major lifecycle states of

clusters, namely birth, death, split and merge, to discover their emergence, per-

sistence, growth and decay. It uses a bipartite graph to effectively visualize this

cluster evolution as the progressive k-partite across the k temporal dimensions or

251

domains. A NewsGroup dataset, a patent abstract dataset and 2 twitter datasets

are used for experiments. Quantitatively as well as qualitatively, experiments are

done to prove the validity of CaCE.

252 Paper 7

Paper 7: Discovering Cluster Evolution Patterns

with the Cluster Association-aware Matrix Fac-

torization




Under Reviewed In: Knowledge and Information Systems (KAIS Journal)















Paper 7 253




Date:




i Nayak

26/03/2020

Mohotti

27/03/2020



254 1 Introduction

ABSTRACT: Tracking of document collections over a period is helpful in several

applications such as finding dynamics of terminologies, identifying concept drift,

emerging and evolving trends, etc. We propose a novel “cluster association-

aware” Non-negative Matrix Factorization (NMF)-based method with graph-

based visualization to identify the changing dynamics of text clusters over time.

NMF is used to find associations among terms of the clusters within a collection

over the time. The novel concepts of “cluster associations” and term frequency

based “cluster density” have been used to improve the quality of evolution trend.

The cluster evolution is visualized using a k-partite graph to display the birth,

death, split and merge of clusters across time. Empirical analysis with the text

data shows that the proposed method is able to produce accurate and efficient

solution as compared to the state-of-the-art methods.

KEYWORDS: Cluster Evolution; Text Mining; Matrix Factorization

1 Introduction

Text data, widespread in social media platforms and document repositories such

as news broadcasting platforms and research publications, has emerged as a pow-

erful means of communication among people and organizations [60]. Text reposi-

tories contain the data covering across domains or/and time [7]. Social networks

include opinions expressed on diverse concepts over the time. Search engines are

another popular internet medium that store (or index) a large collection. Topics

(or concepts) and associated terminologies in these text repositories change over

the time as well as across the domains and show a varying trend.

It is useful for scholars, journalists, and practitioners of diverse disciplines to mine

these data, spanned across the time or domains, for finding decaying, current and

1 Introduction 255

emerging concepts [64, 73, 75]. A term analysis tool such as Google Trends can

track how the popularity of a term changes over time, based on query log analy-

sis [34]. With the rise of big data and the dependence amongst terms/concepts,

it is appropriate to analyze the formation and evolution of concepts (or clus-

ters) instead of individual terms in the dynamic text corpora. Over the time, a

cluster can go through the states of birth, death, split and merge indicating the

persistency, growth and decay of concepts [63].

Tracking evolution across different domains provides insight on how the same

concept has been used over the diverse domains. Consumer behavior is a well-

known concept mainly used in the economics domain, which is important for the

political domain as well as the agriculture domain. It is important to identify how

this concept evolve over the agriculture domain to establish marketing strategies

for businesses. Further, the trends showing through this concept dynamics in the

political domain will create opportunities for political parties and governments

to mend their campaigns. Similarly, discovering cluster dynamics over the time

in a specific field is useful for researchers, academics, and students in that field

to setup their publications, strategies and research. Further, these trends provide

insight for businesses and governments to setup policies accordingly to succeed.

Tracking of concepts over domain or time can also provide insight to Historians

and Social Scientists to understand how a concept or theory has evolved [112].

In order to find common concepts, text clustering faces challenges due to the com-

plex nature of text data resulting in high-dimensional and sparse vector repre-

sentation [8]. Matrix Factorization (MF), which maps high-dimensional to lower-

dimensional space, is one of the effective solutions [103]. However, information

loss is inevitable in this family of methods that may result in poor outcome [8].

Researchers have introduced term-based semantics to assist factorization with

additional information to identify topic clusters highlighting concepts [117, 168];

256 1 Introduction

particularly the use of Non-negative MF (NMF) has been found effective [117].

Only a handful of research studies exist that study cluster/topic evolution. Most

of these methods only deal with identifying emerging or novel topics [98, 99].

There are only a couple of studies that focus on identifying emerging, persis-

tence and diminishing topics [41, 63]. The method in [41] identifies some of these

patterns by measuring how the term frequency changes over time; however, it

is not able to track the individual state differences in topics such as split and

merge. The method in [63] performs similarity calculation between clusters using

overlapping terms in each consecutive time stamp, to visualize various states.

However, it disregards the global evolution over the time and focuses only on

adjacent time stamps to determine similarity. Identifying all states of cluster

evolution globally over the time is challenging for these types of methods as they

consider a consecutive time-interval pair at a time. Other methods [115, 123]

assume fixed skeleton structures of clusters over time to identify their evolution

and fail to consider new formations or the changes in the structure of the clus-

ters. In contrast, the proposed method considers the time/domain-wise clusters

(presented by the representative terms) for naturally identifying the emergence,

persistence, growth and decay of concepts over a period.

This paper proposes a novel and accurate method of Cluster Association-aware

matrix factorization for discovering Cluster Evolution, called CaCE. It can track

four major lifecycle states of clusters namely birth, death, split and merge to

discover their emergence, persistence, growth and decay. It includes an NMF-

based process to identify the groups of similar clusters that are formed over the

time or domains, based on inter and intra-cluster association relationships defined

using terms in the clusters. Specifically, inter-cluster associations modeled with

the number of overlapping terms between clusters, semantically assist matrix

factorization for cluster-group discovery. To separate less cohesive clusters from

1 Introduction 257

a cluster group, we introduce a novel concept of density based on uniform term

frequency distribution within the group using a pre-defined threshold. Finally,

the paper proposes to use the concept of bipartite graph to effectively visualize

the cluster evolution as a progressive k-partite graph in a novel fashion across

the k temporal dimensions. The evolution is represented by drawing edges in

a k-partite graph between consecutive time intervals if the clusters possess the

same level of density and belong to the same group in this time interval.

More specifically, this paper brings several novel contributions to the area of

cluster evolution listed as:

• An NMF based approach with inter and intra-cluster associations to identify

the cluster groups.

• A term frequency-based concept of density to remove the loosely connected

clusters in the cluster groups.

• A progressive k-partite graph-based approach to display evolution of clus-

ters in the cluster groups.

To the best of our knowledge, CaCE is the first method that considers the cluster

association using an inter-cluster matrix built with overlapping terms for discov-

ering cluster evolution. Empirical analyses using several document corpuses over

the varying number of time stamps and the varying number of clusters reveal

that CaCE can discover cluster evolution accurately and efficiently compared to

other state-of-the-art cluster/topic evolution methods.

The rest of the paper is organized as follows. Section 2 reviews related work and

presents the motivation behind this research. Section 3 introduces the problem

definition that is followed by the proposed CaCE method. Experiments are dis-

cussed in Section 4 with two real-world case studies. Final conclusion remarks

258 2 Related Work

are given in Section 5.

2 Related Work

Approaches that attempt to address the dynamic text over time can be seen

as the discovery methods of cluster evolution [63, 73], topic evolution [35, 41,

180] or event detection [82, 119]. All these paradigms focus on tracking content

shift and identifying emerging trends in dynamic text datasets. These methods

explore the change in cluster/topic structure over time through textual content

associated with clusters/topics to characterize the evolutionary events, concepts

or terminologies. In comparison to cluster evolution, topic evolution is done in

much smaller data space (i.e., topic space) as depicted by Fig. 1 (a) and Fig.

1 (b). The number of extracted topics is much less in topic evolution than the

entire document collection, and associated vocabulary with topic clusters in the

collection is much smaller than the complete vocabulary of the collection. This

is the same for event detection work, which considers the set of selected events

in tracking evolution. Community evolution [102, 115, 123] given in Fig. 1 (c),

is another paradigm in tracking cluster dynamics, which considers user groups as

clusters.

Research in (text) cluster evolution is infancy with the existence of simple ap-

proaches [63, 73]. A survey-based research [73] was carried out to identify evo-

lution of concepts in clusters of publications using bibliometric tools. This only

considers the citation network in tracking evolution. TextLuas [63] models each

cluster solution with the respective terms at each time stamp and considers sim-

ilarity between consecutive clusters, as determined by the term intersections. It

uses Jaccard coefficient between clusters based on a threshold to define the persis-

tence, merging and splitting of clusters on a timeline. It considers only the local

2 Related Work 259

Figure 1: Comparison between existing evolution approaches

relations between two consecutive time stamps in defining evolution. In contrast,

the proposed method CaCE globally identifies the cluster groups over a period

and visualizes the entire evolution among time stamps using a k-partite graph.

Topic modeling is another powerful paradigm for the semantic analysis of large

collections of documents. Topic models have been used as formalization of the

conversational understanding through identifying subsets (i.e.,topics) [98, 180].

Several researchers have attempted to identify evolution of topics in larger doc-

ument collections using extensions of LDA [28, 48, 74, 180]. In [180], a proba-

bilistic topic modeling approach is used to track the topic occurrence over the

time. This generative probabilistic approach only identifies topic occurrence in

different time dimensions with the calculated respective probabilities and is found

incapable of identifying topic evolution with splits and merge. Authors in [49]

determine text cluster evolution based on the changes to term probability within

topics. This proposal was limited by the fixed vocabulary constraint where only

a general set of terms in the topics was studied and neglected the tracking of new

topic formations. In [98], NMF is used to identify a set of steady topics through

minimizing learning error. The emerging topics are obtained by filtering deviated

260 2 Related Work

topics. However, discovering only these changes is insufficient, as they do not give

the complete insight of persistence, diminishing and growing concepts. Similarly,

topic models have been used in understanding the topic dynamics across temporal

dimensions [35, 41] in social media domain. Authors in [41] extended these topic

trends to track persistent and diminishing topics using the term frequency-based

energy concept defined for each cluster solution. The “density” concept, which

uses to determine the consistent cluster groups in CaCE, is inspired by the energy

concept. However, these topic evolution methods are limited to identify the few

states in cluster lifecycle. Identifying complex dynamics of topics such as merge

and split, detailing a complete cluster lifecycle, is challenging without additional

information due to the sparseness of text representation.

Event detection methods have been applied in social media communication to

find novel or trending events [82, 119, 195]. This stream of methods keeps track

of event clusters (much smaller number than the clusters in original space) that

appear across time to identify the novel events or shifts that are deviated from the

existing event clusters. In [119], a novelty score is assigned to each event cluster to

identify new events in a twitter dataset considering a tweet similarity. Identifying

events in twitter data across the time is handled in [195] with topic modeling.

This research is limited in tracking evolution and fails to identify growth and

decay of clusters. It attempts to identify emerging events through deviations to

previously existing events with the assumption of a fixed set of events within a

dataset.

Researchers have studied the community evolution in social networks focusing

on structural properties of communities [102, 115, 123]. In the area of network-

based community detection, clusters consist of users instead of text as depicted

by Fig. 1 (c). The “snapshot model” [192] considers different snapshots of the

network at different time steps to find communities or clusters; and then, track

2 Related Work 261

clusters over time in order to interpret their evolution. However, the majority

of community detection methods assume a fixed number of communities across

the time by disregarding new formation and dissolution [123] or relying on a pre-

determined community structure [115]. The “temporal smoothness model” [40] is

used to analyze continuous stream of atomic changes to the considered networks to

derive communities over time. It can be considered similar to the fixed vocabulary

constraint in some of the text evolution analysis methods [49]. However, the

network evolution based on user interactions is completely a different domain

compared to text cluster evolution.

In text clustering, the sparse nature of data results in poor outcome [3]. There

are a few recent studies that use additional information to assist sparse text

clustering problem with additional semantic information [117, 139, 168]. They

use word association relationships, Skip-Gram and Skip-Gram with Negative-

Sampling (SGNS), similar to the concept of word embedding. The Skip Gram

model is a training method for neural networks to learn neighbors or the context

of a word in a corpus for word embedding [137]. In [168], the term × term

association matrix modeled with SGNS is used to semantically assist the NMF

in short text clustering for topic discovery. Negative sampling tries to maximize

the probability of observed term pairs to be 1 and unobserved term pairs to

be 0 within the term association matrix. Inheriting these concepts to cluster

evolution, we propose the use of SGNS to model the inter-cluster association

using overlapping terms. We conjecture by learning the context of terms, clusters

can be grouped together that share similar concepts and terms.

Table 1 summarizes the existing cluster, topic/event and community detection

methods with their major drawbacks in accurate identification of text cluster evo-

lution. Distinct from these works, CaCE utilises the higher to lower dimensional

mapping via matrix factorization to identify the cluster associations and track all

2623 Cluster Association-aware Matrix Factorization method of Cluster

Evolution

Table 1: Summary of existing evolution detection methods

Category Applied data domain Major drawbackCluster Evolution Text data Neglect global evolution

patterns due to consecutivetime-stamps analysis [63]

Topic/Event Evolution Text dataUnable to identify complexcluster dynamics [41, 98,180]Study changes to fixed set ofterms and neglect new for-mations [49, 119, 195]

Community Evolution Network dataStudy changes to fixed set ofstructures and neglect newformations [40, 115]Assume fixed number ofcommunities over time [123]

of their states over the time.

3 Cluster Association-aware Matrix Factoriza-

tion method of Cluster Evolution

3.1 Preliminaries and Definitions

Consider a document collection D = {D1, D2, ...Dk} over a time period k or a

set of k domains. Let {t1, t2, ..ts, ..tk} be the considered time period or a set of

domains with k consecutive instances. Let C = {C1, C2, ...Cs, ...Ck} be the set

of respective cluster solutions in D. Each time-stamp/domain dataset creates a

cluster solution Ck = {c1, c2, ...cm} with m clusters where m > 1 and the value

of m can vary among each of the cluster solutions.

Given a text data collection spanned across the time/domain, the proposed

3 Cluster Association-aware Matrix Factorization method of ClusterEvolution 263

method aims to identify the cluster evolution over a period of time or domains,

as stated in Definition 1.

Definition 1 : Individual clusters in the set of cluster solutions C at each time-

stamp or domain hold a lifecycle state that can assist in displaying a cluster

evolution for the document collection stored over the time or domains. Following

are the types of states that can be assigned to cluster ci at timestamp ts that

reveal the evolution patterns.

• Birth: if cluster ci that appears in time/domain ts does not have any

similar cluster in time/domain ts−1, it marks the birth of ci

• Death if cluster ci that appears in time/domain ts does not have any cluster

that is similar in time/domain ts+1, it marks the death of ci

• Split if cluster ci that appears in time/domain ts does have multiple similar

clusters in time/domain ts+1, it marks the split of ci

• Merge if cluster ci that appears in time/domain ts does have multiple

similar clusters in time/domain ts−1, it marks the merge of ci

We propose an NMF-based solution to define the similarity between individual

clusters within the set of cluster solutions {C1, C2, ...Cs, ...Ck} based on cluster

associations and discover the latent relationships between clusters by projecting

them to a lower-order dimension. We then assign a unique cluster-group to each

cluster and refine these cluster-group assignments using the term weight-based

density concept to form uniform term distribution within a group. A cluster

with insufficient density value is excluded from the group, indicating that the

cluster does not share enough matching terms with the group to be a member of

the group. The following evolution patterns can be identified based on the final

cluster similarities given by the proposed method.


Evolution

• Persistence: if cluster ci ∈ Cs has a similar cluster in each consecutive

clustering solution until cluster solution Cp where p ≤ k, cluster ci will

display a persistent evolution pattern within time/domain s to p.

• Growth: if cluster ci ∈ Cs has a gradual increase in the number of splits

until the cluster solution Cp where p ≤ k, cluster ci will display a growth

evolution pattern within time/domain s to p.

• Decay: if cluster ci ∈ Cs has a gradual decrease in the number of merges

until the cluster solution Cp where p ≤ k, cluster ci will display a decay

evolution pattern within time/domain s to p.

• Emerging: if cluster ci ∈ Cs has been born in time/domain s it displays

an emerging pattern in time/domain s.

Let the set of cluster solutions C over the k time-stamps or domains consist of

a total number of N clusters {c1, c2, ...cN} that contain the total number of M

terms {w1, w2, ...wM}. Let matrix S represent the “Intra-cluster association” with

term × cluster relationship modeling N clusters with M terms. The matrix S

is modeled with the traditional bag-of-words model with each term count. This

is accompanied by the symmetric matrix A that represents “Inter-cluster associ-

ation” with cluster × cluster relationship using a number of overlapping terms

between clusters. The matrix A is modeled with the Skip-Gram with Negative-

Sampling (SGNS) [117] weighting to make the probability of presence of cluster

association be high. The Skip-Gram model is a popular training approach for

neural networks to learn distributed word representation. The Skip-Gram model

predicts neighbors or the context for a considered word in a corpus in comparison

to the continuous Bag-of-Words model, which uses context to predict the word

[137]. The concept of negative sampling is used to maximize the probability of

observed (word,context) pair to be 1 while minimizing the unobserved pairs to be


0 [168]. In [117] SGNS is proved to be equivalent to factorizing a (shifted) word

correlation matrix. It shows that SGNS is implicitly factorizing a word-context

matrix, whose cells are the point-wise mutual information of the respective word

and context pairs. In CaCE inter-cluster association matrix modeled consider-

ing Skip-Gram model, semantically assists the NMF in learning the context of

the clusters. The use of the SGNS concept in CaCE increases the probability of

accurately learning the context of clusters.

We propose to utilize SGNS in CaCE with the objective of maximizing probability

P (A = 1|ci, cj) for closely associated cluster pairs (ci,cj) within the observed k

time stamps while minimizing P (A = 0|ci, cj) for loosely associated cluster pairs

(ci,cj). The inter-cluster association matrix A is represented with the SGNS of

the observed set of clusters using the number of term co-occurrences as:

Acicj = log

[#{wci ∩ wcj

} × V∑cb∈C # {wcb ∩ wci} ×

∑cb∈C #

{wcb ∩ wcj

}]

(1)

where wci , wcj and #{wci ∩ wcj

}are a set of terms and number of overlapping

terms in cluster ci and cj respectively and, V =∑

(ci,cj)∈C #{wci ∩ wcj

}is the

total number of overlapping terms among all the cluster pairs.

The entries of A with less than 0 are converted to zero to minimize the probability

of unobserved pairs after taking logarithm as in Eq. 1. This modelling with

#{wci ∩ wcj

} ×V is able to represent the inter cluster similarity within each pair

respect to the total count of term similarities within clusters over the time/domain

in normalized manner.


Evolution

Figure 2: Overall process in CaCE

3.2 Overview of CaCE

CaCE includes three main phases for discovering cluster evolution, as depicted by

Fig. 2. (1) Firstly, it uses NMF to identify the groups of similar clusters over the

time/domain using the inter- and intra-cluster associations. This allows identify-

ing similar clusters within the cluster solutions C spanned across the time/domain

k. (2) Secondly, the loosely attached clusters in a cluster group are separated if

they do not contain sufficient density to be included in the group based on the

term frequencies of the cluster with respect to the maximum term frequency of

the cluster group. This allows the cluster groups to be tightly cohesive based on

the common terms that they share. (3) Finally, CaCE visualizes the global cluster

evolution patterns of emergence, persistence, growth and decay across time using

a k-partite graph where nodes represent clusters and edges represent relationships

between clusters such as persistence, split and merge considering cluster groups.

A cluster evolution with all state changes of a cluster lifecycle (i.e., birth, death,

split and merge) can be tracked with this visualization.


t1 t2 t3

MathematicsArcheologyIT

Figure 3: Example Cluster Evolution in Education domain

Example: Consider a document collection in a university archive collected over

three years. Application of CaCE shows an example of the evolution in clusters

in this corpus with the internal cluster state changes as displayed in Fig. 3. It

shows Mathematics as a persistent cluster over the considered period of time by

showing the progression of the similar cluster in each time stamp. IT which is

born in t2 shows an emerging pattern. It shows a growth with a split when comes

to t3 with two similar clusters. In contrast, Archeology shows a decay with a

merge between t1 and t2 that marked death at t2 without having a similar cluster

in t3.

3.3 Cluster association-aware Matrix Factorization

Marix Factorization

The aim of CaCE is to identify the global cluster evolution showing the trends,

how the group of terms have evolved over the time/domain. The first step is to

identify groups of common clusters in the high-dimensional sparse “intra-cluster

association” matrix S using the lower dimensional approximation. NMF, which

takes fewer parameters and produces coherent topics compared to other popular

dimensionality reduction methods such as LDA [21], is used in this approximation.

In traditional NMF [3], the sparse matrix S ∈ RM×N is approximated by learning


Evolution

W ∈ RM×g and H ∈ RN×g where g is the number of cluster groups as follows.

S ≈ WHT (2)

In order to find the best groupings of clusters in intra-cluster association matrix S,

we propose to utilize the latent information within the inter-cluster association

matrix A ∈ RN×N . In this way, we take advantage of co-clustering, finding

commonalities amongst the terms based on the clusters in which they appear

as well as finding commonalities amongst the clusters based on the terms they

share. The symmetric NMF [83] is applied to A for generating two commutative

matrices, HC ∈ RN×g and H ∈ RN×g where g is the number of cluster groups as

follows.

A ≈ HHTc (3)

Objective Function

CaCE proposes to use both these learning processes to discover cluster groups,

as defined in the following objective function:

minW,H≥0‖S −WHT‖F +minH,Hc≥0‖A−HHTC‖F + α‖W‖1 (4)

We approximate the intra-cluster association matrix S and inter-cluster associa-

tion matrix A with the minimum learning error. We introduce L1 regularization

on the factor matrix W to promote sparsity, and control the over fitting and

highlighting the distinguishing terms. This can be considered as the sparse dic-

tionary learning, which models the sparse input data representation using only a

few (important) terms of the dictionary learned from the data itself [19]. Prior

research on traditional NMF has found this constraint to be effective for detecting

deviations or novelty in text data [99]. We conjecture that this constraint will be

able to discriminate cluster groups more effectively.


Solving the optimization problem

We propose to use the Block Coordinate Descent (BCD) algorithm [103] to op-

timize the objective function in Eq. 4. The BCD algorithm divides the matrix

members into several disjoint subgroups and iteratively minimizes the objective

function with respect to the members of each subgroup at a time. It relies on the

most recent values of the members for solving sub-problems related to their up-

dates. When solving sub-problems depend on each other, they must be computed

sequentially to make use of the most recent values for BCD.

CaCE solves these interdependent sub-problems sequentially starting from W .

The most recent values of members for the first iteration are zeros set at the

initialization. Firstly, the BCD update rule has been used for finding W in the

NMF optimization using the intra-cluster association matrix S and initial matrix

H. The matrix H is then updated using the current values of W and other

members. Finally, Hc is updated using the inter-cluster association matrix A and

the most recent values of H. This is done for each g′ ∈ g.

W(:,g′) ←⎡⎣W(:,g′) +

(SH)(:,g′) −(WHTH

)(:,g′)

(HTH)(g′ ,g′)

⎤⎦ (5)

H(:,g′) ←⎡⎣H(:,g′) +

(STW

)(:,g′) + (AHc)(:,g′)

(W TW )(g′ ,g′) + (HTc Hc)(g′ ,g′)

−(HHT

c H)(:,g′) +

(HW TW

)(:,g′)

(W TW )(g′ ,g′) + (HTc Hc)(g′ ,g′)

⎤⎦

(6)

Hc(:,g′) ←⎡⎣Hc(:,g′) +

(AH)(:,g′) −(HcH

TH)(:,g′)

(HTH)(g′ ,g′)

⎤⎦ (7)

This enables the decomposition process to include both inter and intra cluster


Evolution

associations. In each iteration, at the end of this, sequential updates of factor

matrices W , H and Hc, CaCE minimize the objective function in Eq. 4.

The factorization process generates two perspectives of cluster × group matrices

H and Hc in lower dimensional space. This lower rank approximation of higher

dimensional cluster × term matrix gives dense representation compared to orig-

inal. It is conjectured that the lower dimensional representation that has high

co-occurrences is able to battle the sparseness related issues in high dimensional

data clustering. CaCE forms a final cluster group matrix HF based on the max-

imum pairwise coefficient of H and Hc. This allows us to identify the similarity

of a cluster with groups compensating weaknesses in the learning process of each

single perspective.

HF = max (H,HC) (8)

The final cluster assignment vector hf is defined using the hard cluster assignment

policy. A cluster group that possesses the highest coefficient within HF is used

as the group for a specific cluster.

hf = argmax

g∑i=1

(HF

(:,i)

)(9)

3.4 Cohesive cluster groups based on term density

The above matrix factorization process forces each cluster in a cluster solution

to be included in a cluster group. This may result in loosely connected clusters

to reside within a cluster group due to the fewer terms shared with others in

the group. To handle this, we propose the density concept that determines the

strength between a cluster and its associated cluster group considering the term

frequencies. More specifically, the density value of a cluster ci is defined as the

ratio of the term frequencies within the cluster using each term wj ∈ ci to the


maximum term frequency of the corresponding cluster group gz ∈ g, as follows.

Denci =

∑|wci |j=1 tf (wj)

max[∀|gz |x=1tf

(∑|wcx |j=1 (wj)

)]× |wci |

(10)

Density values that fall within first quantile (‘mean - standard deviation’) within

a group implies the clusters with least densities. CaCE uses this threshold to

separate the loosely connected clusters ensuring uniform term distribution within

a group. This allows identification of a set of cohesive cluster groups over the

time. A cluster that receives the density value less than the set threshold is

considered ‘inconsistent’ and its density value is set to zero. A cluster with zero

density value is indicated as a new singleton cluster group within the visualization

step.

3.5 Visualization of Cluster evolution with a k-partite

graph

CaCE proposes to visualize all cluster dynamics including birth, death, split and

merge within a k-partite graph. The set of clusters within the cluster solution

in time ts is represented with the respective partite s and each distinct cluster

group across k partite is uniquely identified with a color code. Each cluster in

the s > 1 partite in the graph is compared with each cluster in its predecessor

partite to add edges between two clusters to mark them as similar if they belong

the same group.

A cluster pair in two successive partites is eligible to have an edge between them

for being similar, if:

• they belong to the same group and either both of them posses zero density

values or both of them posses non-zero density value.


Evolution

Figure 4: Algorithms of CaCE

In contrast, a cluster pair in two successive partites is not eligible to have an

edge between them, if:

• one of them has zero density value, though they belong to a same group

The color code of the cluster with non-zero density is updated with a non-existing

color in the graph to separate it from the current cluster group. However, a cluster

pair in nonconsecutive partites is not considered to have an edge between them.


This process continues in an incremental manner to represent the cluster evolu-

tion spanned across time t1 to tk. The k-partite graph allows CaCE to identify

birth, death, as well as growing and decaying patterns in clusters, within the

period through colors and edges. Application of Definition 1 on the drawn edges

identifies the corresponding patterns:

• a cluster that appears in time ts (s ≤ k) that does not have any edge to

a cluster in time ts−1 marks the birth of that cluster, which represents an

emerging pattern,

• a cluster that appears in time ts (s < k) that does not have any edge to a

cluster in time ts+1 marks the death of that cluster,

• a cluster that appears in time ts (s < k) with multiple edges to clusters in

time ts+1 marks the split of that cluster showing a growth pattern,

• a cluster that appears in time ts (s ≤ k) with multiple edges to clusters in

time ts−1 marks the merge of that cluster showing a decay pattern,

• a cluster born in time ts (s < k) and continues across the time with a single

edge to succeeding time stamp ts+1 shows a persistent pattern,

This is further assisted by the colors to uniquely identify the similar clusters that

belong to the same group. Fig. 4 shows the algorithms of CaCE for discovering

the cluster evolution in a document corpus.


We evaluate three phases of CaCE to show its effectiveness. The quantita-

tive comparison against baselines using ground-truths evaluates the 1st phase


Table 2: Summary of the datasets used for the experiments.

Name # of clusters for each time-stamp cluster solution

Ground truth evolution

20Newsgroup(DS1)

t0: 3, t1: 5, t2: 4

Patent (DS2) t0: 5, t1: 5, t2: 5

Health (DS3) t0: 5, t1: 5, t2: 5, t3: 5

Sports (DS4) t0: 4, t1: 3, t2: 2, t3: 4, t4: 4

of CaCE, which uses inter-cluster association to measure the accuracy of cluster

group identification. The impact of 2nd phase with “density” in obtaining co-

hesive cluster groups for accurate cluster groups identification is evaluated with

and without using the density concept quantitatively. The 3rd phase, which shows

the evolution patterns of clusters through edges on the k-partite graph visualiza-

tion, is compared against baseline methods that are able to visualize the cluster

evolution qualitatively. Further, we compare the time efficiency and computa-

tional complexity of CaCE against different cluster group identification methods

as detailed in Section 4.2. Other different concepts used in the proposed method,

together with the parameters/thresholds, are analyzed in the sensitivity analysis

section. Finally, we conduct two case studies to qualitatively interpret the power

of CaCE in identifying cluster evolution in real-time data with the large number

of clusters that span across a larger period of time.


Datasets: We use two types of datasets with medium length text vectors (con-

taining < 150 terms on an average, i.e., DS1 and DS2) and short length text

vectors (containing < 50 characters on an average, i.e., DS3 and DS4). As shown

in Table 2, for each dataset, a few categories (or domains) spanned across the

time have been selected/created to have the ground truth information, in terms

of the number of clustering solutions and the number of clusters in each clustering

solution.

• For the 20News group dataset (DS1), we selected four categories (Social,

Talk, Recreational and Computer) and spread them across three time pe-

riods.

• For the Patent abstract dataset (DS2), four categories (Distributed Pro-

duction, Microbiota, Computer Vision and Block Chain) of abstracts were

collected during the three months of 2017 to make clusters.

• For the Health-related tweets (DS3), media posts sent to six disease-specific

twitter groups (Diabetes, Mental, Kidney, Lung, Heart, Cancer) within a

four year period (2014-2017) were selected to make clusters.

• For the Sport-related tweets (DS4), media posts sent to four sports specific

twitter groups (Cycling, Netball, Cricket and Soccer) within a five year

period (2010-2014) were selected to make clusters.

These clusters were placed in such a way as to show emerging, persistent, growth

and decay patterns over time as in Table 2. We have made these datasets available

to researchers 1.

Baselines: Several benchmarking methods were used to evaluate the accuracy

of cluster group identification : (1) general NMF [103] on intra-cluster association

1https://drive.google.com/open?id=1gHoEm-R9S2OkiN9LRVNk3JLVeGpRdWXn


matrix S; (2) the state-of-the-art clustering evolution method TextLuas [63] which

uses Jaccard coefficient to determine the cluster similarity within cluster pairs in

consecutive timestamps; and (3) a variation of CaCE (named as CaCE-CS) that

uses cosine similarity for an inter-cluster association matrix instead of SGNS

representation based on the number of overlapping terms. Additionally, the topic

evolution method proposed in [41] for social media with short text is used to

compare with CaCE in identifying the evolution patterns. Experiments were

done using python 3.5 on 1.2 GHz – 64-bit processor with 16 GB Memory.

Evaluation Measures: The standard pairwise harmonic average of the preci-

sion and recall (F1-score) and Normalized Mutual Information (NMI) were used

as the evaluation measures to identify the quality of cluster groups [165]. Evolu-

tion patterns of clusters including emerging, persistent, decay and growth indi-

cated through states changes are automatically identified within the visualization

using top-frequent terms in each cluster.

4.1 Accuracy Analysis

Quantitative Interpretation

Results in Table 3 show that CaCE is able to produce higher accuracy in cluster

groups identification spanned across time/domain compared to all other methods

due to the use of inter-cluster association information in the matrix factorization

using the number of common terms with SGNS. Next in line is the modified

version of CaCE; CaCE-CS uses cosine similarity to identify the inter-cluster as-

sociation using representative terms, which normalizes the similarity value to 0-1

range and fails to maximize the probability of closely associated clusters as orig-

inal CaCE does with using the number of overlapping terms with SGNS. Cosine


Table 3: Performance comparisons in identifying cluster groups accurately withdifferent datasets, methods, and metrics

DatasetF1-score NMI

CaCE CaCE-CS

NMF TextLuas CaCE CaCE-CS

NMF TextLuas

DS1 0.84 0.75 0.60 0.60 0.82 0.75 0.48 0.67DS2 0.68 0.68 0.65 0.56 0.68 0.68 0.37 0.61DS3 0.58 0.57 0.34 0.58 0.57 0.57 0.17 0.51DS4 0.74 0.66 0.51 0.53 0.65 0.54 0.06 0.46

Average 0.71 0.67 0.53 0.57 0.68 0.64 0.27 0.56

similarity, which measures the cosine angle between vectors that represent the

clusters, is inferior in modeling inter-cluster association to cardinality of term

set intersection between clusters. TextLuas, which employs Jaccard similarity

coefficient based on the number of common terms in clusters, links the clusters

in consecutive time stamps if this goes beyond a threshold (set as 0.5). However,

this naive approach is inferior in identifying global evolution over the considered

period. The proposed NMF with intra- and inter-cluster associations used in

CaCE is able to accurately project the high dimensional term × cluster repre-

sentation into a lower dimensional space for identifying global cluster groups. In

contrast, when original NMF is used on term × cluster, it is not able to capture

the cluster groups within the projected lower dimensional space and results in

lower accuracy outcome. This impact is worse when the number of clusters varies

significantly within different cluster solutions as in DS4 (shown in Table 2). As

shown by results, CaCE is capable of handling varying cluster numbers and the

uniformly distributed clustering solutions, over the multiple time stamps.

Fig. 5(a) shows the impact of applying regularization to the objective function

in Eq. (4) for identifying cluster groups accurately. L1 regularization on W in

reconstruction error promotes sparsity in the factor matrix W , which represents

the term × cluster groups. This has been shown to be more effective for identify-

ing distinct cluster groups for all the datasets based on the representative terms


Figure 5: Impact of regularization and density concept

as depicted by higher F1-score and NMI in Fig. 5(a).

We also analyze the effectiveness of the term frequency based density concept

used in CaCE for identifying accurate cluster groups over the time. The density

defined as in Eq.10 based on the term frequencies is capable of filtering out loosely

attached clusters to a group. CaCE uses the density value with a threshold

(explained in the sensitivity analysis section) to separate the loosely connected

clusters by setting the less dense values to zero and forms new singleton cluster

groups from those clusters. This ensures uniform term distribution within a group

compared to CaCE that operates without this density-based filtering. In general,

this allows us to identify a set of cohesive cluster groups over the time as shown by

the improved performance in Fig. 5 (b). In the dataset DS4, where less common

terms can be seen according to top frequent terms of clusters, the density based

filtering results in slightly poorer performance.


t0 t1 t2

graphics, window, image, driver, software, jpeg

game, team, player, bike, play, hockey

christian, jesus, church, faith, truth, bible

christian, jesus, church, sin, christ, bible

game, team, player, bike, season, baseball

game, team, player, hockey, play, baseball

gun, firearm, crime, weapon, handgun, criminal

window, driver, image, scsi, pc, modem

game, team, player, bike, fan, hockey

gun, firearm, weapon, fbi, waco, batf

gun, fbi, batf, koresh, waco, weapon

image , window, graphics, scsi, disk, software

Figure 6: Visualization of Cluster Evolution in DS1 with CaCE

Qualitative Interpretation

Fig. 6 - Fig. 9 show insight on evolution patterns obtained by CaCE, which

show similar clusters in a group with a unique color. We label each cluster with

its’ top 5 - 6 frequent terms to represent the included concept. According to the

derived evolution patterns in Fig. 6 for DS1, (1) a persistent cluster related to

‘computer technology’ appears in blue color, (2) there is decay in information

related to ‘games’ as revealed by merging of clusters in green color from t1 to t2

and (3) it also identifies another cluster group which is a mix with ‘religion’ and

‘war’ in yellow color, which shows both split and merge of clusters. This reveals

a growth pattern within t0 to t1 through the split while showing a decay pattern

within t1 to t2 though the merge. It also identifies a cluster in red color as a

separate isolated cluster group from the rest. It should be noted from results in

Table 3 that though CaCE achieves highest accuracy, however it is not 100%.

It misses to identify the similarity of this cluster (marked as red) to the cluster

group ‘game’ (marked as green), which seems to be highly similar according to


t0 t1 t2

block, chain, transaction, invention, storage, new

image, object, set, feature, point, plurality

image, object, signal, feature, plurality, location

image, object, region, point, camera, motion

production, control, portion, service, module, source

production, configured, energy, time, signal, plurality

image, object, depth, video, set, position

image, object, feature, 3d, plurality, model

transaction, blockchain, block, distributed, record, network

transaction, blockchain, digital, key, network, payments

payment, product, transaction,said, item, cryptocurrency

transaction, blockchain, distributed, key, invention, public

transaction, key, identity, configured, digital, communication

image, object, video, configured, material, plurality

invention, present, relates,said, subject, condition


the top-frequent terms of two clusters. However, an investigation of the cluster

vector shows that this cluster includes many other terms that are not part of the

(green colored) cluster and only these few terms are shared amongst the two.

Fig. 7 represents the cluster evolution identified in the Patent dataset (DS2). (1)

It shows that CaCE is able to capture the growth of ‘block chain’ related cluster

group in yellow color as revealed by their splits between t1 and t2. (2) It identifies

the ‘computer vision’ related cluster group in green color as a persistent pattern

within t0 to t2 which should have been shown as a decay of clusters according

to ground-truth. The top-terms within non-linked clusters show the evidence for

this deviated pattern as they show slight variations. (3) CaCE correctly identifies

the birth of the cluster in grey color showing an emerging pattern, which is under

the ‘Microbiota’ cluster group according to ground-truth. This Patent dataset

shows several related clusters as separate groups. A close investigation reveals

that these clusters are related, but contain several unrelated terms. Therefore,

CaCE identifies them as new groups with unique colors.


t0 t1 t2 t3

lung, cancer, australia, check, awareness, week

kidney, disease, dialysis, big, red, cancer

kidney, disease, week, risk, know, die

heart, love, walking, check, good, year

cancer, breast, woman, know, symptom, risk

nan, aumentalhealth, mentalhealth, mental, mhanews

kidney, indigenous, disease, kindneydisease, week, auspol

heart, woman, disease, gored, hour, every

heart, gearupgirl, walking, woman, healthy, foundation

cancer, woman, young, breast, website, know

mhanews, mentalhealth, mental, suicide, make, world

menatalhealth, mhanews, depression, illness, mental, suicide

heart, research, disease, foundation, raise, cardiac

heart, step, sign, billion, reach, donating, womenshearts

nan, heartaust, heart, heartweek, attack, woman

frankguinlan, greghuntmp, mental, advocacy mentalhealth,

mhanews, frankguinlan,mentalhealth, mental, nan

nan, aumentalhealth, mhanews, mental, anssl

heart, heartweek, blood, pressure, read, brenttoderian

diabetes, research, grant, australia, greghuntmp, need


Table 3 shows that DS3 has the least performance in identifying cluster similarity

for the group formation. Visualization of the patterns obtained by CaCE is shown

in Fig. 8. (1) The growth of ‘mental’ health-related clusters in blue color is

identified similar to ground truth values, through the splits. (2) It identifies the

birth of ‘diabetes’ cluster with pink color in t3 as an emerging pattern. However, it

fails to identify exact evolution as per the ground-truth. (3) CaCE shows a mixed

group with different types of clusters in yellow color as a pattern that decay over

t0 to t2 through the merges. A closer investigation of top frequent terms reveals

that common terms are found in many diseases with high frequency in this group.

This misleads CaCE to recognize different cluster groups separately.

Fig. 9 shows the evolution of clusters in DS4 displayed by CaCE. (1) It correctly

identifies the ‘soccer’ cluster group in blue color, which is persistent over the

time through its’ continuous appearance in each consecutive time stamp. (2)

The growth pattern of ‘cricket’-related clusters in green color within t3 to t4

and the decay pattern of ‘cycling’-related clusters in yellow within t0 and t1 are


t0 t1 t2 t3 t4

qantas, cup, world, squad, match, good

cricket, test, ash, ht, squad, pointing

cycling, world, road, men, gold, stage

cyclingaus, cycling, world, men, road, race

cyclingaus, australia, world, champion, men, champ

pointing, wicket, watson, world, captain, cricket

Oman, tonight, quantas, please, match, good

gosocceroos,match, quantas, good, tonight, denmark

clarke, test, cricket, mclarke23, india, mcgrath

good, today, gold, luck, match, final

hussey, mike, think, today, open, cricket

hussey, mike, ausvsl, today, open, think

football, tomorrow, breaking, aleague, quantas, play

play, cup, world, bresciano, mark, tree

Ash, bbl03, catch, scg, test, watch

ash, scg, pinktest,haddin, stevesmith49,test

ash, pinktest,scg, mcgrathfdn, test, england


also partially identified as per the ground-truth in Table 2. A deviated pattern

is resultant of some terms that are different in clusters and that contribute to

identify these clusters as unmatched patterns.

Comparison with state-of-the art TextLuas Fig. 10 - Fig. 13 show the

visualization of cluster evolution given by TextLuas. In DS1, it fails to identify

the persistence pattern of ‘computer technology’ and the decay pattern of ‘games’

identified by the CaCE. TextLuas, based on local evolution patterns between

cluster pairs in consecutive time stamps, is not capable of identifying these cluster

dynamics accurately.

Fig. 11 shows the evolution of clusters in DS2 according to TextLuas. It could

not identify the growth pattern of the cluster group ‘block chain’ or birth of ‘Mi-

crobiota’ in DS2 compared to CaCE. However, it identifies the decay pattern of

‘computer vision’, which CaCE identifies as a persistence pattern. A closer inves-

tigation on this pattern reveals that TextLuas identifies this cluster group mixing


Figure 10: Visualization of Cluster Evolution in DS1 with TextLuas


with the other groups. This shows the inability of simple Jaccard similarity-based

cluster comparison in identifying cluster groups accurately compared to CaCE,

which relies on both inter and intra cluster associations.



Fig. 12 and Fig. 13 show the cluster evolution pattern given by TextLuas for

DS3 and DS4. Both of them clearly show the mix of cluster groups compared to

CaCE. This confirms two facts: (1) the global cluster evolution patterns cannot be

accurately identified through local connection analysis; and (2) Jaccard similarity,

which relies on intra cluster similarity, is not sufficient in cluster association

identifications.

Benchmarking with other techniques We compare our visulaization results

with a recent method for emerging topic detection for short text [41]. This method

uses a set of heuristics such as energy based on term frequencies to identify terms

that have become important in the current time period and then creates a directed

term-correlation graph and identifies the topics from the previous time window

that persist in the current time window. Iterative graph traversal in this method

is able to identify the topics that are emerging, and track them over time. Table

4 shows the emerging topics identified using topic words in [41]. In DS1, it

identifies the ‘talk’ related topic as an emerging topic in 20NewsGroup dataset



Table 4: Results obtained by [41]: Comparative Outcome

Dataset Emerging topicsDS1 (1)people war fbiDS2 (1)invention blockchain relatesDS3 (1)today sign reach (2)cardiac (3)womensheartsDS4 (1)Clarke (2)Grella (3)career (4)Mr (5)mrcricket (6)Veteran

BREAKING 2014Note : As topic terms derived from full tweets message in DS3 and DS4,they include hash tags as well

with the terms: people, war and fbi, while identifying ‘blockchain’ as an emerging

topic in DS2. In DS3, it is capable of identifying the ‘heart’ related topics while

identifying ‘cricket’ and ‘soccer’ related topics in DS4. However, this method

based on a graph theoretic temporal topic model [41] shows all the identified

topics as emerging topics for these datasets. In contrast, CaCE is able to identify

emergent, persistent and diminishing concepts, as depicted in Fig. 8 and Fig. 9.

In summary, CaCE shows higher performance in identifying similar clusters over

the period (i.e., correct cluster groups) as given in Table 3 as compared to bench-


Figure 14: Time taken by each method for identifying the evolution of clusters

marking methods. This confirms the superiority of CaCE in identifying evolution

patterns (i.e., persistence, growth and decay) globally, which rely on accurate

cluster group identification over the considered period. As revealed by results in

Table 3 and Fig. 6 - Fig. 9, CaCE misses some evolution patterns due to some

common terms appearing in many cluster groups and sub groups within cluster

groups. Having said that, CaCE is the first method that details the comprehensive

global evolution patterns with high accuracy and informs the lifecycle of the main

clusters(concepts) inherent in a corpus, which is displayed through time/domain.

4.2 Efficiency and Complexity Analysis

Time comparison illustrated in Fig. 14 shows that the least time consumption

is by the traditional NMF, which considers a single matrix. It is obvious that

CaCE consumes more time than the traditional NMF due to the inclusion of

additional inter-cluster association matrix. Modified version CaCE-CS consumes

much higher time, due to the additional step of cosine similarity calculation be-


tween clusters. Naive approach of calculating Jaccard coefficient considering the

term intersection in TextLuas also consumes lesser time on average. The higher

performance with 152% and 21% increase of average NMI in CaCE as per Table 3

compared to NMF and TextLuas respectively, well justifies the 2 - 6 times higher

consumption in time. The computational complexity of CaCE, which is based on

NMF is O (n2) where n is number of clusters. Similarly, CaCE-CS also processes

the same computational complexity. However, time complexities vary according

to the additional matrices and steps included in the approaches. TextLuas has a

linear computational complexity of O (rm) where r is the number of time stamps

and m ≤ n is the number of clusters in a generic time-stamp.


One of the strengths in CaCE is modelling the inter-cluster association matrix us-

ing Skip-Gram with Negative-Sampling (SGNS). Empirically, we validate this as

in Fig. 15, by modelling the matrix A with just using the number of overlapping

terms between clusters, and modelling the same association in A with SGNS

based on probability. It shows that cluster associations modeled with SGNS,

which is able to predict the neighbors correctly, assists the sparse term × cluster

matrix factorization process in forming lower dimensional cluster × group matrix

accurately as depicted by the results. The inter-cluster association given with

the cluster × cluster association matrix using the number of overlapping terms

(without any weighting) between clusters gives lower performance as it fails to

boost the distinction between clusters. As an exception to the general results,

there is not much gain in DS4 by using SGNS where the number of clusters con-

siderably varies within time stamps. We conjecture that in this case, maximizing

probability of close cluster pairs is not making much difference due to the het-

erogeneous nature in that inter-cluster association matrix that formed with the


Figure 15: Effectiveness of modeling as SGNS

number of overlapping terms.

The threshold that uses to determine the consistency of a cluster group using the

term frequency-based density is analyzed as in Fig. 16 (a). It shows that the

density value less than ‘mean - standard deviation’ gives the best result in all

other datasets than DS3. Density values that fall within first quantile (mean -

standard deviation) within a group implies the clusters with least densities. Thus

this threshold is able to identify the less cohesive clusters in terms of density. Due

to the higher occurrence of terms common for many clusters that act as the noise,

median shows the highest performance in DS3. CaCE uses ‘mean - standard

deviation’ as the default threshold value for determining uniform density in a

cluster group.

CaCE focuses on identifying four major cluster dynamics (i.e., birth, death, split

and merge). It is natural to identify the four groups of clusters aligning with

them, so the evolution patterns within the groups can be studied. However, in

order to empirically verify this number of cluster groups, we also experimented

with a different number of cluster groups. As shown in Fig. 16 (b), the maximum


Figure 16: Parameter sensitivity for density and number of cluster groups

performance is obtained with setting four as the desired group number for each

dataset, confirming the conjecture of CaCE.

4.4 Case Studies : Research and Job Trend Analysis

Case Study I. We conducted a case study using the DBLP-ACM publication

data2 to confirm the capability of CaCE to accurately detect evolution patterns

over a considerably large period of time (10 time stamps). We consider the

DBLP-ACM bibliographic titles related to Data Science within the period of

1994-2003. The purpose of this case study was to display the effectiveness of

cluster evolution regardless of the primary clustering solution over a large time

period; we use traditional NMF for generating the primary cluster solutions with

three clusters per each time stamp and interpretation of clusters are done with

the top-3 terms of each cluster. Fig. 17 depicts the discovered evolution patterns

in this dataset where clusters are represented with top-3 frequent terms.

2https://www.openicpsr.org/openicpsr/project/100843/version/V2/view


Figure 17: Cluster evolution illustrated using publication data

Fig. 17 shows that research attached with database query and language technolo-

gies centered in 1994 to 1995 was in its peak (with yellow color clusters, with the

growth pattern revealed by splits). There exist variations of database and man-

agement within this period such as data replication (in pink color) and emerging

pattern of large object databases (in green color) shown with the new-born clus-

ter. Later from 1995 to 1996, this group of clusters, which showed a growth

earlier, shows a decay through merges. Remarkably, it shows the re-emergence of

this concept in 2001 in the form of XML and web semantic languages.

From the period 1997 to 1998, database technologies with commercial applica-

tions, such as multidimensional databases and related querying algorithms in

green color, shows decay revealed by merging those concepts. A special con-

cept of transaction management information in orange color born in 1997 is an

emerging pattern.

Data mining emerged in 1998, grows into distributed database architecture and

data warehouse concepts within 1999, which is depicted through the split in

red color clusters. These concepts, in combination with query processing, form


a separate cluster group in blue that shows the decay of clusters with merges

within 2000 to 2001. In 2003, CaCE captures the birth of new cluster query

optimization deviating from the rest of the concepts, which is identified as an

emerging concept.

Generally, this case study shows the foot-step of data science that moves from

simple database management to web/xml base databases through data mining

and warehousing over those years.

Case Study II. The aim of the second case study is to show the capability of

CaCE in handling larger number of clusters within a time stamp for the identifi-

cation of accurate cluster evolution. The study uses the online job posting data

in kaggle website3 posted through the Armenian human resource portal ‘Career-

Center’. We consider a subset of job postings, which span across 2004-2006, and

primary clusters in each clustering solution per each time stamp are obtained

using NMF by fixing the number of clusters to 10. Then CaCE is applied to

identify the global evolution of these clusters over the years. Fig. 18 depicts the

interesting evolution patterns revealed by CaCE for this dataset, where clusters

are represented with top-5 frequent terms. The study is able to reveal the de-

mand and changes in certain professions over the years and shows evolution of

necessary skills that are most frequently required by employers.

In general, it shows how the demand for administrative, coordination, sales

and software related job positions evolves over these years. The “administra-

tive/director positions” revealed by the yellow color cluster groups indicate the

changes in the scope and skills of the position across the time. Over the year

2004-2005, it shows a growth pattern with splits between 2004-2005 and 2005-

2006. Director positions are posted for accounting/finance skills and program

3https://www.kaggle.com/madhab/jobposts


Figure 18: Cluster evolution illustrated using online job posting dataset

implementation skills separately in 2004. In contrast, director positions require

both of these skills in 2005 as qualifications. This again changed to different job

positions in 2006, as shown by the splits (i.e., consultant, finance officers, director

supervision, program coordinator, etc.).

The cluster group depicted by the green color shows a mix of “administrative

positions” and “software developer” positions in 2004. It is obvious for CaCE to

fail in separating them, due to terms used by both these groups such as ‘design’

and ‘implementation’. Thus, it shows as a decay pattern over the year 2004-2005

with the merge of clusters. The post “software developer” is persistent within

2005-2006. Furthermore, CaCE identifies a persistent pattern attached with jobs

related to area specific programs over 2004-2005 in red color. In 2005, it marks

the death of those positions.

CaCE identifies the demand for “customer care” positions in 2004 as marked by

the birth of pink color cluster. However, “customer sales” or “product sales”

related positions appear in 2005 with slight variations to the skills required as

a new position (i.e., emerging pattern) showing in the blue color cluster group.

5 Conclusion 293

Over the year 2005-2006, this shows a decay pattern with a change to necessary

skills (i.e., ability to handle social and international activities). Furthermore,

CaCE discloses two emerging positions in 2005 with the birth of “community

coordination” and “rural program supervision” related clusters in olive color and

orange color.

The general interesting observation in this cluster evolution is skills required by

a job position improves over the time, while sometimes creating a subset of job

positions with specific skills. This case study confirms the effectiveness of CaCE in

identifying evolution with the presence of considerably larger number of clusters

within a time-stamp.

5 Conclusion

This paper proposes a novel Non-negative Matrix Factorization-based method

(CaCE) to discover the evolution of clusters across the time/domain using inter

and intra-cluster associations. CaCE provides an assistance with inter-cluster re-

lations to the matrix factorization process in sparse term × cluster matrix. The

inter-cluster association matrix is built with overlapping terms between clusters

modeled with Skip-Gram with Negative-Sampling (SGNS). Further, we conjec-

ture that term frequency based density of clusters can be used to identify the

inconsistent clusters in these cluster groups and thereby we form tight cluster

groups. We then visualize the evolution of clusters using a k-partite graph over

the time considering the important cluster dynamics of birth, death, split and

merge through the identified cluster groups. An extensive experimental study

has been conducted with both qualitative and quantitative evaluation. Empiri-

cal results conducted on several datasets, benchmarked with relevant methods,

show that CaCE discovers emerging, persistence, growth and decay of clusters

294 5 Conclusion

with considerably higher accuracy performance. Extending this approach to be

independent to a clustering method used for each time stamp, is for our future

investigation.

Chapter 6

Conclusion and Future Directions

The exponential growth of text collections creates the need to identify subgroups

and deviated documents within the corpus, as well as to track dynamic changes

in a corpus over the time or domain. Text mining leads to many applications

such as effective information retrieval, community detection, concept mining,

fake news detection, emerging concept detection and many more [45, 79, 106,

147, 150]. Text mining research is challenged by the high-dimensional nature

of the text and large collection sizes, that lead to poor accuracy, efficiency and

scalability issues in identifying similarity within text pairs. This thesis focuses

on unsupervised text mining methods using ranking concepts, effective density

estimation with ranking, matrix factorization, and matrix factorization-based

document expansion to minimize those challenges.

The main objective of this research work is to deal with the sparseness of text

representation, which results by higher-dimensional vector representation to ac-

curately identify the text similarity/dissimilarity for finding the clusters, outliers

and dynamic changes of clusters. With this objective, the thesis proposes novel

algorithms for ranking centered document clustering and outlier detection meth-

296 6.1 Summary of Contributions

ods. Furthermore, it presents a corpus-based document expansion for short text

clustering with NMF and, NMF-based subgroups identification with the assis-

tance of additional information to identify clusters and the cluster dynamics over

time.

6.1 Summary of Contributions

Based on the literature review detailed in Chapter 2, the following gaps were

identified.

1. Lack of alternative research approaches to find documents’ neighborhood,

such as IR ranking, for identifying text similarity in clustering and, lack of

text clustering methods dealing with short text documents.

2. Lack of research to solve the text outlier detection problem, especially in

the context of multiple classes of inliers. None of the existing works uses

term weighting-based ranking or IR ranking concepts in determining outlier

scores based on text dissimilarity.

3. Lack of research to explore the global text cluster evolution over the

time/domain to identify all the cluster states and patterns accurately deal-

ing with the higher dimensional vector representation based on the cluster

similarity.

This thesis aims to fill these research gaps, by developing effective approaches

to identify the text similarity in order to present accurate and novel unsuper-

vised text mining algorithms for clustering, outlier detection and cluster evolu-

tion tracking. These methods effectively apply the ranking concept to identify

6.1 Summary of Contributions 297

document neighborhoods and extend in text-similarity/dissimilarity identifica-

tion. Methods proposed in this thesis follow the “cluster hypothesis” [91] that

stated that the linked sets of documents are relevant to the same request and the

“reversed cluster hypothesis” [59] that theoretically proved that these documents

should occur in the same cluster. They confirm that an IR system internally

adheres to semantic relationships, as it is able to obtain document responses

belonging to the same group.

In comparison to a keyword-based search used with traditional IR systems, which

only consider syntactic similarity in obtaining relevant documents, methods pro-

posed in this thesis use document-driven queries that statistically represent the

whole document and are able to retrieve the relevant documents more accurately.

These relevant documents obtained via an IR system are explored for accurate

density estimation and to maintain the geometry structures among documents

for clustering. Further proposed methods use refinement steps to obtain the final

solutions: (1) incorporate hubs - a set of small groups with frequent neighbor

documents to identify the similarities or (2) model data using Skip-Gram with

Negative Sampling considering the context. These steps, employed together with

IR ranking, comply with semantic embedding and minimize issues with the syn-

tactic nature of VSM representation.

In addition, the use of inverse document frequency term ranking is exploited in the

thesis to define outliers, together with IR ranking responses and ranking scores.

As the other concept to effectively identify text similarity, the proposed methods

in the thesis effectively apply the matrix factorization and identify the groups in

lower-dimensional space. Clusters go through different stages of their life cycle

over time which is important to monitor in decision making. Therefore, this thesis

explores the global evolution of cluster dynamics in the higher dimensional text

with matrix factorization using additional relationships among clusters to avoid

298 6.1 Summary of Contributions

the information loss. Also, the use of topic vectors and terms obtained through

matrix factorization for corpus-based document expansion is explored to solve

the extreme sparseness in short text.

Clustering: The first research contribution is presenting a set of novel text

clustering methods that identify text similarity accurately. The IR ranking re-

sponses are used in constructing a mutual neighbor graph through shared neigh-

bors for density estimation. Hubs, which are evident in higher dimensional data,

are identified with shared neighbor sets on the graph. Expensive hub similar-

ity calculation is efficiently performed using ranking scores provided by the IR

system to improve the performance of the density-based method. Similarly, in

another method, IR ranking as well as pairwise neighbors are used in constructing

affinity document matrices to represent the nearest neighbors that enforce geo-

metric structures. The consensus and complementary information enforced by

these neighboring information and document representations are used to assist

higher to lower-dimensional projection in document clustering. These ranking

and neighborhood-based clustering approaches show higher accuracy compared

to state-of-the-art methods in handling sparse high dimensional data. The effec-

tiveness of these two methods is validated with real-world data that consists of

medium and short text vectors.

A corpus-based document expansion is implemented through NMF-based topic

modeling and using topic terms as virtual terms. This concept is used in commu-

nity detection and concept mining as short text clustering applications, where the

effectiveness is evaluated using social media datasets and forum data respectively.

Extrinsic, intrinsic measurements and case studies are used accordingly.

6.1 Summary of Contributions 299

Outlier Detection: Secondly, the thesis presents a set of novel outlier detection

methods. It defines the outliers as dissimilar documents that show significant

difference through terms compared to a set of inlier groups. The inverse document

frequency weighting model, which ranks rare terms with high priority, is used to

define an outlier score for a document with the assumption that outliers contain

more rare terms. Also, IR ranking scores of relevant documents are used to

define an outlier score for a document. The ranking score, which informs the

similarity, is inversely used to identify how dissimilar/deviated that document is.

Additionally, within ranking responses of all the documents, the reverse neighbor

count is calculated to identify the hubs and, anti-hubs are proposed as outliers

which possess lower k-occurrences. The concept of an IR ranking-based mutual

neighbor graph, which forms uniform dense regions in a document collection

efficiently, is used to filter the outliers. These mutual graphs are used together

with the hubs that are found on the graph, to identify the documents that are

not part of the graph and/or are dissimilar to the hubs attached with the inlier

groups as outliers. All these methods are evaluated using real-world text data

covering all the vector sizes against the state-of-the-art methods. New evaluation

measures are introduced to calculate the perdition error of inliers and outliers,

which are able to categorize the effectiveness of an outlier detection method.

Cluster Evolution: The third contribution of this thesis is to present a method

that is able to track the dynamic evolution of text clusters over the time/domain

based on the cluster similarity. The potential of using matrix factorization-based

dimensionality reduction for identifying cluster groups within high dimensional

text cluster representation, is explored. It uses inter-cluster association to as-

sist the information loss in lower-dimensional projection, which represents with

SGNS modeling to highlight close cluster associations. The thesis aims to identify

the global cluster evolution over the time/domain through these groups, and the

300 6.2 Summary of Findings

term frequency-based cohesiveness of cluster groups are used to filter loosely at-

tached clusters. The proposed method represents the evolution patterns through

a k-partite graph that spans across the k time stamps or domains. The real-

world data that consists of medium and short text vectors are used to evaluate

the performance of the proposed method, compared to state-of-the-art methods.

Quantitative evaluation is used for measuring the performance of identifying the

cluster groups and qualitative validation is done through visualization of evolu-

tion patterns with the cluster content. In addition, two case studies are used for

qualitative analysis.

6.2 Summary of Findings

This section discusses the main findings for the research questions presented in

Chapter 1. The neighborhood information obtained by the concepts of ranking

and matrix factorization-based projection have been found accurate in calculating

the similarity among text pairs, thus they are able to deal with the high dimen-

sional nature of the text representation and associated challenges. IR systems

have shown the ability to get relevant documents in response to a document-driven

query that statically represents a document accurately and efficiently. The assis-

tance of neighborhood information has shown the ability to identify the similarity

among text pairs accurately with non-negative matrix factorization minimizing its

associated information loss. Using these concepts in identifying clusters, outliers

and cluster evolution across time have resulted in effective outcomes as shown by

the empirical analyses in the previous three chapters.

6.2 Summary of Findings 301

6.2.1 Clustering

This section presents the findings for the first research question about finding

similarity in text corpus to identify the subgroups/clusters.

In response to the question: How can graph-based methods with rank-

ing be used for effective density estimation in sparse data, where den-

sity difference could not be used in identifying the subgroups?

IR systems can be used to effectively generate neighborhood information for doc-

uments in a collection by posing document-driven queries that could represent

the documents systematically. These IR-generated relevant documents between

document pairs that are analyzed to identify the shared neighbors, can accurately

and efficiently build a shared/mutual nearest neighbor graph compared to k-NN

analysis of document pairs. This IR based shared/mutual nearest neighbor graph

can give a dense representation for sparse text data where generally, density-based

methods fail. The core dense points on the graphs that identified if a minimum of

three documents is in the range that is connected to the core point, by sharing at

least three neighbor documents can effectively identify the minimum requirement

for being a dense point in sparse text data. Expanding the boundaries of the

core points on the graph is able to accurately identify the varying dense patches

respective to the subgroups in text data.


In response to the question: Instead of expensive pairwise compar-

isons, how can IR ranking-based neighbors be employed to identify

the subgroups?

A ranking function employed in an IR system can be used to accurately and effi-

ciently find the relevant documents from a document collection organized in the

form of an inverted index data structure for a given document query [173]. In

comparison to different ranking functions available in IR systems, tf∗idf function,which considers both common and rare terms that appear in the collection, is

found effective in similarity identification. The nature of the document queries

that represent the documents is found as another important factor for identifying

neighbors accurately with IR systems. The terms with high frequencies in a doc-

ument show the ability to accurately identify the relevant documents. Generally,

top-10 terms are found effective in identifying relevant documents. The accuracy

can be further improved by choosing query size, depending on the characteristic

of the document collection. Comparatively larger size queries can be used for

collections that have larger text vectors on average, and vice versa.

In addition to effectively forming the mutual neighbor graphs, the relevant docu-

ments obtained in this way from an IR system can be used to efficiently identify

the hubs - the frequent nearest neighbors in higher dimensionality. In the mutual

neighbor graphs generated with IR ranking results, the attached set of shared

neighborhoods can be used as the multiple hubs without any additional calcu-

lations. They can be used to assigned cluster labels for unclustered documents,

based on maximum relevancy/affinity. This affinity value for each hub can be

efficiently calculated using the ranking scores of included documents, that were

obtained a priori when the document was posed as the query to form neighbor-

hoods. This way of using IR ranking responses and ranking scores for identifying

the subgroups in text data is accurate and efficient, compared to expensive pair-


wise comparisons.

In response to the question: How can associated information loss be

minimised in matrix factorization to approximate the lower rank fac-

tors and to identify subgroups?

The neighborhood information that can preserve geometric structures within data

points is found effective in minimizing the information loss in NMF while project-

ing data from higher to lower order. The neighborhood information generated

with both pairwise comparisons (i.e., local neighborhoods) as well as IR ranking

responses (i.e., global neighborhoods) through document affinity matrices can ac-

curately assist the factorization of a document-term matrix. In modeling these

affinity matrices, the SGNS modeling technique in comparison to binary modeling

is found effective in highlighting the document pairs that show a higher presence

with respect to any neighborhoods giving higher accuracy. This use of both lo-

cal and global nearest neighbors can handle datasets with different sizes, scales,

and densities accurately. Moreover, the consideration of common and specific

(i.e., consensus and complementary) information given by the document-term

matrix, as well as neighborhood affinity matrices for the factorization process, is

found accurate in identifying subgroups in comparison to the use of either of the

aforementioned.

Additionally, in short text also, NMF is found as an accurate and efficient ap-

proach for identifying virtual terms for document expansion. Topic vectors iden-

tified using NMF can capture the topics represented within a short document

collection accurately due to its alignment with natural non-negativity in text

representation. Inclusion of the highly probable topic terms of the corresponding

topic derived with NMF as virtual words for a short document, can minimize the

extreme sparseness aligning with the semantics structure of the corpus to identify


the subgroups. This corpus-based document expansion was found accurate and

efficient in identifying subgroups in short text, compared to external source-based

expansion.

Ultimately, among the proposed text clustering methods, the graph-based method

with ranking and density concepts gives better performance for fine-grained clus-

tering compared to the matrix factorization-based method. The latter category

of methods work with smaller cluster numbers that used as the lower rank for

factorization.

6.2.2 Outlier Detection

This section presents the findings in response to the second research question,

about identifying outliers.

In response to the question: How can the concept of ranking and

density used in identifying text similarity be extended in identifying

outliers in a text collection?

Primarily, term ranking based on inverse documents frequency shows higher ef-

fectiveness to identify the outliers considering text dissimilarity. This simple

concept was found memory and time efficient for all types of text vectors. IR-

based relevant neighbors and associated relevancy score, which indicates the level

of similarity, can also be used to provide outlier scores in a scalable and efficient

manner. The inverse of average relevancy scores of relevant documents can ac-

curately define an outlier score for a document. This is efficient and scalable

compared to baselines, due to the use of efficient IR systems directly. Addi-

tionally, the reverse neighbor (k-occurrences) count of documents within ranking


responses indicates the hubness of documents that can accurately differentiate the

outlier documents when considering anti-hubs. This is especially found accurate

for short text, which does not show very little word co-occurrences to identify

text similarity/dissimilarity by other concepts.

In addition, the IR ranking-based mutual neighbor graph and density estimation

process on the graph that was used for subgroup identification can be used ac-

curately to identify the outliers that are not attached to the subgroups. In large

text vectors or medium-size text vectors that show higher word co-occurrences

compared to the short text getting a higher portion of documents as mutual neigh-

bors for the graph, this algorithm can accurately and efficiently identify outliers

that are not included in the dense mutual neighbor graph directly. In short text

documents that show fewer term co-occurrences among them, including only a

few documents in the mutual neighbor graph, the use of multiple hubs identified

on the graph with the density estimation can refine the inliers. This algorithm is

found accurate in identifying outliers dissimilar to the hubs using prior calculated

relevancy score, without compromising efficiency.

Ensemble methods proposed by combing term ranking and IR ranking-based

algorithms sequentially or independently, are found accurate with less false pos-

itives. Among them, ensemble methods that combine ranking function score

based outlier detection with term ranking sequentially and independently, are

found efficient compared to k-occurrences count-based sequential method, or

graph-based method, due to direct use of ranking score for identifying devia-

tions/dissimilarities. However, all these approaches were found time and memory

efficient, compared to existing baselines for all types of text vectors. Especially,

these ranking-based methods were found accurate and efficient for larger text

vectors where many methods fail.

These inverse documents frequency-based outlier ranking and IR ranking concepts


result in methods summarised in Table 4.1. OIDF which uses only inverse doc-

uments frequency-based outlier ranking is efficient for document collections with

large text vectors. In contrast, ORFS and ORNC which combine term frequency-

based ranking with IR ranking are better in accuracy compared to OIDF. For

datasets with short text vectors, k-occurrences-based ORNC as well as graph-

based ORDG shows higher accuracy. However, ORDG that uses hub-based inlier

filtering is superior among them.

6.2.3 Cluster Evolution

This section presents the findings in response to the third question, about iden-

tifying cluster evolution.

In response to the question: How can the matrix decomposition and

identified factors be used to understand the cluster similarity and

changing dynamics of text clusters in text collections?

The proposed CaCE method for cluster evolution identification accurately and

efficiently identifies the lower rank groups in the clusters and associated terms,

with the assistance of cluster association information using NMF from high di-

mensional text cluster representations. This additional assistance can minimize

the information loss in NMF. This inter-cluster association information, modeled

using the SGNS technique, maximizes the probability of closely associated cluster

pairs while minimizing loosely associated pairs and thereby improves the accu-

racy of cluster group identification. Moreover, CaCE identified cohesive cluster

groups by enforcing consistent term density distribution across the group, found

effective in cluster dynamic identification. The final cluster groups obtained this

way can effectively capture the similarity in the clusters across the time/domain,

6.3 Future Work 307

globally compared to identifying similarity among consecutive time stamps. The

changing dynamics of the clusters can be completely identified by linking clusters

in these cohesive groups via a k-partite graph that visualizes cluster span across

k domains. CaCE can show the full cluster lifecycle states: birth, death, split and

merge of clusters, and all the evolution patterns emergence, persistence, growth,

and decay across time/domain, compared to few cluster dynamics due to this

visualization. Birth can indicate the emergence pattern of a cluster, split can

indicate the growth pattern of a cluster, merge and death can indicate the decay

pattern of a cluster and consistent appearance across the time/domain indicates

the persistence pattern of a cluster.

6.3 Future Work

This thesis presents novel clustering, outlier detection, and cluster evolution

methods with effective text similarity identification techniques. There are various

improvements that can directly be applied to the proposed methods and exten-

sions that can apply to them for solving other related problems. These potential

future research directions are presented in this section.

6.3.1 Stream mining

All the clustering, outlier detection and evolution tracking methods proposed in

the thesis, focus on the static document collections in identifying groups with

similar documents, deviated documents and evolutionary patterns. However,

the popularity of the online social media streams such as Facebook, Twitter, and

LinkedIn with frequent updates, would be beneficial with dynamic stream mining

methods. Extending the proposed concept to a context with limited computing

308 6.3 Future Work

and storage capabilities where clusters, outliers, and cluster dynamics need to be

identified from a rapidly arriving continuous stream of text, will be worth studied

in the future.

6.3.2 Community discovery considering both structure

and content information

The community discovery problem is addressed in the thesis with the document

expansion considering text messages which show what users communicate. How-

ever, how users are connected shown through the network representation is pro-

viding a useful piece of information that could be used to improve the community

detection methods. We have conducted a pilot study to explore the use of an NMF

approach, with the assistance of additional information that is used in subgroup

identification for this community detection problem. The results of this study

were presented as a poster in the Hopper Down Under event1 showing that the

assistance of structural/network information for the NMF applied on the con-

tent is able to improve the accuracy of community detection problem, in many

cases. However, in a dense network representation, combining that information

with content shows inferior performance, highlighting the requirement of treating

content and structure with different weights/importance according to the nature

of the data. It would be interesting to study how this information could be

combined to detect communities with different weights to have improved results,

rather than using a single piece of information.

1https://community.anitab.org/event/hopper-down-under/

6.3 Future Work 309

6.3.3 Deep learning

Deep learning has become extremely popular, with successful applications in im-

age processing and other machine learning research [183]. It is used in short

text clustering literature as a feature learning technique [189]. However, it shows

the bottleneck of requiring ground-truth related information to guide the train-

ing process. One of the works uses retweet or hashtags as links that must hold

to guide the training process without direct ground truth [188]. The use of IR

ranking responses to guide this process could be investigated as possible work. It

will eliminate the use of supervised or semi-supervised approach, as the optimal

clustering framework has proved that “documents relevant to the same queries

should occur in the same cluster” [59].

6.3.4 Short text clustering

This thesis proposes document expansion based on self-corpus for short text clus-

tering to minimize extreme sparseness. With the interesting observation of higher

performance gained through the IR ranking concepts for short text clustering and

outlier detection, it would be useful to apply these concepts in document expan-

sion. Future work can explore the use of IR ranking neighborhoods to derive

the virtual terms for the expansion. It requires investigation of how to select the

neighbors to use for this expansion, and the role of frequent neighbors or higher

k-occurring neighbors in this context.

310 6.3 Future Work

6.3.5 Soft clustering

The methods proposed in this thesis use hard clustering. Depending on the used

datasets and their ground truth labels, the thesis followed the hard clustering.

However, the real-world text documents show a tendency of belonging to multi-

ple groups, such as text in social media. It would be interesting to study how

to identify the cluster labels for text that belongs to multiple clusters, and to

evaluate the accuracy of the methods.

6.3.6 Complete text mining framework

This thesis proposes a set of methods for text clustering, outlier detection and

cluster evolution identification with few different datasets. Future work can

explore the applicability of these methods for all types of datasets (i.e., short,

medium and long text vectors) in detail and proposes solutions to deal with any

type of text vector. It would result in a complete text mining frame that finds

clusters, outliers and evolution patterns for any given dataset.

6.3.7 Pre-trained models for document representation

The methods proposed in the thesis used the VSM-based representation to statis-

tically model documents. However, word-embedding based document representa-

tion is known to provide a dense vector representation for sparse text efficiently

considering semantic similarities [115]. It would be interesting to know how these

techniques for document representation and document query representation can

improve the proposed text mining methods, considering semantic embedding.

Appendix A

Case Studies

National Senior Communities

The results of this case study which applies the method proposed in

[139] is presented at the official launching of the QUT digital Ob-

servatory (https://www.qut.edu.au/institute-for-future-environments/

facilities/digital-observatory) and the related video can be found at

https://www.youtube.com/watch?v=BgoJ495X5so.

QSuper communities

QSuper is an Australian superannuation fund based in Brisbane, Australia. This

case study also uses the method proposed in [139] to understand concerns of users

regarding QSuper. The dataset used for the study also obtained from QUT digital

Observatory and includes the 1091 tweets among Australian Twitter accounts

with the ‘qsuper’ keyword.

312 APPENDIX A

Figure 1: Word Cloud for total tweets obtained from “qsuper”

Fig. 1 shows the word cloud2 generated for the entire tweet dataset. It can be

noted that the users talk about multiple things related to benefits, their families,

members, retirement issues as well as errors in the system which could not capture

separately.

The experiments are done with different α in [139]. The most meaningful com-

munities are given for the α = 1. It generates 8 communities as given in Fig. 2.

The first community of users talks generally about superannuation and Unisuper

news. The second community focuses on members of the Qsuper and related

facts. The focus of the third community is mainly about the funds as an asset

or pension. Community 4 is talking about women and qbr2018 teams. Fifth and

sixth communities are focusing on investments and awards respectively. Commu-

nity 7 is about administrative errors related to Qsuper. The last community of

users generally talks about Brisbane Queensland community.

2The Voyant Tool [170] is used to illustrate the word clouds.

APPENDIX A 313

Community 1 Community 2



Community 7 Community 8Figure 2: Word Cloud for derived communities from “qsuper”

Appendix B

Matrix Factorization for Community Detection

using a Coupled Matrix

This case study explores the applicability of NMF method proposed in CaCE

(Chapter 5) for social media community detection problem. Instead of the

inter and intra cluster association matrices, user×user network structure and

user×term content matrices are used. The datasets include the tweets and

retweet interactions collected from QUT digital observatory, 3.The results of this

pilot study is presented as a poster at Hopper down under conference https:

//community.anitab.org/event/hopper-down-under/.

1 Introduction

Social media platforms are a popular networking mechanism for people which

allow them to disseminate information and assemble social views based on short-

text communication [79]. Community detection in these platforms has been found

3https://www.qut.edu.au/institute-for-future-environments/facilities/

digital-observatory

2 Problem and Motivation 315

useful in identifying the groups of users with common interests. It creates oppor-

tunities for political parties, businesses, and government organizations to target

certain user groups for their campaigns, customized programs and events [88, 147].

Two popular unsupervised learning methods to discover communities are network

analysis using graph partitioning [26, 161], and content analysis using clustering

and topic modeling [139, 147]. Network analysis methods, which group users

based on their connections, face the challenge due to sparseness in the network

with the heterogeneity of the interactions. Content analysis methods, which

group users based on their written posts, produce inferior outcome due to the

curse of dimensionality in text vectors [3]. In this paper, we propose a novel

approach, named as CS-NMF, to utilize both types of data using the coupled

matrix factorization in a fully unsupervised manner.

CS-NMF learns the consensus user-community matrix using Non-negative Matrix

Factorization (NMF) by coupling the high-dimensional content and structure

related matrices iteratively. We empirically evaluate CS-NMF using three twitter

datasets and benchmark with the state-of-the-art clustering and network analysis

methods. Results show that the coupling complementary information generated

by both structure and content data can minimize the issues raised by sparseness

when used separately.

2 Problem and Motivation

Community detection is a well-studied research area with graph-based models

where network structure is analyzed to see how users are connected through so-

cial media. In contrary, there are that considers what users communicate in

community discovery via the text messages. However, all these methods become

316 3 Related Work

ineffective due to sparseness in structural and textual representations. Further-

more, discovering communities in a fully unsupervised manner is an essential

requirement in many real-world applications. Disseminating information related

to sales promotions, political campaigns and any special event or program need

the identification of an interested group of users where prior knowledge on the

group is unavailable. In this paper, we explore how to overcome the sparsity

associated with social media data to have accurate communities and how to in-

corporate structure with content for community detection in an unsupervised

setting.

3 Related Work

Community detection is usually done via two means: (1) network analysis and,

(2) content analysis. A larger proportion of research explores the connectedness

in user interaction network through the graph based models for community identi-

fication [26, 161]. However, this network representation is sparse and complex for

analysis. Users who belong to a common group make connections with different

groups based on friendship creating heterogeneous networks.

The content analysis which relies on the text messages were written by users for

communication identify similar users based on what they share [79, 147]. Gen-

erally, text mining faces the curse of dimensionality due to high dimensionality.

In high-dimensional data, the distance difference between near and far points be-

comes negligible [3] and many state-of-the-art clustering methods fail to identify

communities accurately. Additionally, social-media text is short in length that

causes extreme sparseness in the data with the lack of co-relational occurrences

[79].

4 Approach 317

There is a handful of research that attempted to enrich the outcome of community

detection with both content and structural data. Additional information available

with social media such as URLs and hashtags is incorporated with network repre-

sentation to identify users with similar interests [119]. A few researchers use text

messages together with network structures in learning user communities [153].

However, they require label information fed as input to accurately detect com-

munities. We propose to use a coupled matrix combining content and structure

generated by NMF to accurately represent the communities in an unsupervised

setting.

4 Approach

Let there be N users to be assigned to G communities. Let S ∈ RN×N denote

the user interaction matrix between users with each cell representing the number

of interactions between those two users. Let C ∈ RM×N denote the user content

matrix where the short text messages written by N users consists of M distinct

terms. The proposed CS-NMF take the normalized input matrices as input to

the NMF process and iteratively attempts to learn an optimum coupled matrix

representing user community assignment as a factor matrix in a novel fashion.

Thereby, each user is assigned to a community using both structure and content

information in an unsupervised setting.

CS-NMF

The proposed CS-NMF factorizes the high dimensional content matrix C into

two-factor matrices W ∈ RM×G and H ∈ RG×N where G is the number of

communities. It simultaneously identifies H and Hc ∈ RG×N as the lower rank

318 4 Approach

matrices for S. It learns the coupled matrix H iteratively by minimizing the

learning errors in the factorization of matrix S and C as follows.

minW,H≥0‖C −WHT‖F +minH,Hc≥0‖S −HHTC‖F (1)

We update each matrix W , H and Hc sequentially for each g ∈ G within each

iteration as follows.

W(:,g′) ←⎡⎣W(:,g′) +

(CH)(:,g′) −(WHTH

)(:,g′)

(HTH)(g′ ,g′)

⎤⎦ (2)

H(:,g′) ←⎡⎣H(:,g′) +

(CTW

)(:,g′) + (SHc)(:,g′)

(W TW )(g′ ,g′) + (HTc Hc)(g′ ,g′)

−(HHT

c H)(:,g′) +

(HW TW

)(:,g′)

(W TW )(g′ ,g′) + (HTc Hc)(g′ ,g′)

⎤⎦

(3)

Hc(:,g′) ←⎡⎣Hc(:,g′) +

(SH)(:,g′) −(HcH

TH)(:,g′)

(HTH)(g′ ,g′)

⎤⎦ (4)

CS-NMF is able to effectively use complimentary information available with user

communicated text messages and interactions as the coupled matrix (H) learning

process incorporating both C and S.

Table 1: Summary of the datasets

Datasets # of # of # of Unique # ofUsers Interactions Tweets Terms Groups

DS1:Cancer 1585 1174 8260 2975 8DS2: Health 2073 2191 19758 5444 6DS3: Sport 5531 19699 12044 3558 6

Empirical Analysis

Experiments were carried out to evaluate (1) the accuracy improvement gain

by combining content and structure against having them individually, and (2)

5 Results and Contributions 319

effectiveness of CS-NMF against the state-of-the-art methods to test the efficacy

of this way of combination. We have used NMF, LDA and k-means clustering

methods [3] and Louvain network analysis method [26] as baseline methods with

F1-score (F1) and NMI evaluation measures [3].

We used three Twitter datasets focusing on Cancer, Health and Sports domains

as reported in Table 1. We have chosen a set of groups under these domains

where we can identify Twitter accounts to collect tweets and user interactions.

Each subgroup is considered as the ground-truth community to benchmark the

outcome.

5 Results and Contributions

Experimental results show that combining - what users communicate through text

messages with how they connect with each other - is able to improve the accuracy

of community detection compared to using each of the information individually.

This confirms that the coupling content and structure in learning the community

assignment through CS-NMF is an effective approach. Thus, the use of CS-NMF

in identifying users with similar interest would be useful in applications such as

target marketing or campaigns.

Results

Results in Table 2 shows that CS-NMF is superior to applying clustering methods

NMF, LDA, and k-means on structure or content separately in DS1 and DS2.

Applying network analysis based Louvain also gave an inferior outcome. There is

a slight variation in DS3, though it confirms combining structure and content is

320 5 Results and Contributions

Table 2: Accuracy analysis

MethodsDS1 DS2 DS3

F1-Score NMI F1-Score NMI F1-Score NMICS-NMF 0.78 0.76 0.69 0.62 0.48 0.35

NMF for C∗ 0.62 0.58 0.55 0.46 0.35 0.07NMF for S∗ 0.36 0.19 0.42 0.15 0.43 0.31LDA for C∗ 0.26 0.02 0.39 0.01 0.31 0.00LDA for S∗ 0.19 0.09 0.28 0.12 0.48 0.38

k-means for C∗ 0.74 0.72 0.59 0.50 0.36 0.07k-means for S∗ 0.26 0.02 0.40 0.03 0.32 0.01

Louvain 0.40 0.32 0.40 0.24 0.49 0.44

Note: C∗ and S∗ stands for content and structure matrices

able to accurately discover communities, the structure based grouping by Louvain

achieves the best performance. As shown in Table 1, DS3 has a higher number of

user interactions compared to others that creates a considerably dense structure

matrix. This confirms that when a structure matrix is dense, a network analysis

method is able to accurately discover the communities while a sparse network

representation requires coupling with content.

Contributions

The contributions of this work are:

• We put forward the concept of combining content and structure for the

community detection in a fully unsupervised manner to address the data

sparsity, that otherwise results in an inferior outcome.

• We propose a Non-negative Matrix Factorization based coupled matrix to

accurately learn the user communities with content and structure.

Bibliography

[1] C. C. Aggarwal, “Outlier analysis,” in Data mining, pp. 237–263, Springer,

2015.

[2] C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,”

in ACM Sigmod Record, vol. 30, pp. 37–46, ACM, 2001.

[3] C. C. Aggarwal and C. Zhai, Mining text data. Springer Science & Business

Media, 2012.

[4] M. Agyemang, K. Barker, and R. S. Alhajj, “Wcond-mine: algorithm for

detecting web content outliers from web documents,” in 10th IEEE Sym-

posium on Computers and Communications (ISCC’05), pp. 885–890, IEEE,

2005.

[5] M. Akbari and T.-S. Chua, “Leveraging behavioral factorization and prior

knowledge for community discovery and profiling,” in Proceedings of the

Tenth ACM International Conference on Web Search and Data Mining,

pp. 71–79, ACM, 2017.

[6] E. Aljalbout, V. Golkov, Y. Siddiqui, M. Strobel, and D. Cremers, “Clus-

tering with deep learning: Taxonomy and new methods,” arXiv preprint

arXiv:1801.07648, 2018.

322 BIBLIOGRAPHY

[7] A. Amado, P. Cortez, P. Rita, and S. Moro, “Research trends on big data

in marketing: A text mining and topic modeling based literature analysis,”

European Research on Management and Business Economics, vol. 24, no. 1,

pp. 1–7, 2018.

[8] D. C. Anastasiu, A. Tagarelli, and G. Karypis, “Document clustering: The

next frontier.,” 2013.

[9] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: ordering

points to identify the clustering structure,” in ACM Sigmod record, vol. 28,

pp. 49–60, ACM, 1999.

[10] M. Antunes, D. Gomes, and R. L. Aguiar, “Knee/elbow estimation based

on first derivative threshold,” in 2018 IEEE Fourth International Conference

on Big Data Computing Service and Applications (BigDataService), pp. 237–

240, IEEE, 2018.

[11] M. Aouf and L. A. Park, “Approximate document outlier detection using

random spectral projection,” in Australasian Joint Conference on Artificial

Intelligence, pp. 579–590, Springer, 2012.

[12] W. Ashour and S. Sunoallah, “Multi density dbscan,” in International Con-

ference on Intelligent Data Engineering and Automated Learning, pp. 446–

453, Springer, 2011.

[13] Y. Awuor and R. Oboko, “Automatic assessment of online discussions using

text mining,” International Journal of Machine Learning and Applications,

vol. 1, no. 1, p. 7, 2012.

[14] T. Aynaud, “Community detection for networkx’s documentation,” 2018.

[15] L. Azzopardi and V. Vinay, “Retrievability: an evaluation measure for higher

order information access tasks,” in Proceedings of the 17th ACM conference

on Information and knowledge management, pp. 561–570, ACM, 2008.

BIBLIOGRAPHY 323

[16] L. D. Baker, T. Hofmann, A. McCallum, and Y. Yang, “A hierarchical prob-

abilistic model for novelty detection in text,” in Proceedings of International

Conference on Machine Learning, 1999.

[17] S. Banerjee, K. Ramanathan, and A. Gupta, “Clustering short texts using

wikipedia,” in Proceedings of the 30th annual international ACM SIGIR

conference on Research and development in information retrieval, pp. 787–

788, ACM, 2007.

[18] P. Bansal, R. Bansal, and V. Varma, “Towards deep semantic analysis of

hashtags,” in European conference on information retrieval, pp. 453–464,

Springer, 2015.

[19] C. Bao, H. Ji, Y. Quan, and Z. Shen, “Dictionary learning for sparse cod-

ing: Algorithms and convergence analysis,” IEEE transactions on pattern

analysis and machine intelligence, vol. 38, no. 7, pp. 1356–1369, 2016.

[20] B. V. Barde and A. M. Bainwad, “An overview of topic modeling methods

and tools,” in 2017 International Conference on Intelligent Computing and

Control Systems (ICICCS), pp. 745–750, IEEE, 2017.

[21] M. Belford, B. Mac Namee, and D. Greene, “Stability of topic modeling via

matrix factorization,” Expert Systems with Applications, vol. 91, pp. 159–

169, 2018.

[22] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geomet-

ric framework for learning from labeled and unlabeled examples,” Journal

of machine learning research, vol. 7, no. Nov, pp. 2399–2434, 2006.

[23] G. Bennett, F. Scholer, and A. Uitdenbogerd, “A comparative study of prob-

abilistic and language models for information retrieval,” in Proceedings of the

nineteenth conference on Australasian database-Volume 75, pp. 65–74, Aus-

tralian Computer Society, Inc., 2008.

324 BIBLIOGRAPHY

[24] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is “near-

est neighbor” meaningful?,” in International conference on database theory,

pp. 217–235, Springer, 1999.

[25] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal

of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.

[26] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast un-

folding of communities in large networks,” Journal of statistical mechanics:

theory and experiment, vol. 2008, no. 10, p. P10008, 2008.

[27] L. Blouvshtein and D. Cohen-Or, “Outlier detection for robust multi-

dimensional scaling,” IEEE transactions on pattern analysis and machine

intelligence, 2018.

[28] L. Bolelli, S. Ertekin, and C. L. Giles, “Topic and trend detection in text

collections using latent dirichlet allocation,” in European Conference on In-

formation Retrieval, pp. 776–780, Springer, 2009.

[29] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying

density-based local outliers,” in ACM sigmod record, vol. 29, pp. 93–104,

ACM, 2000.

[30] A. Broder, L. Garcia-Pueyo, V. Josifovski, S. Vassilvitskii, and S. Venkate-

san, “Scalable k-means by ranked retrieval,” in Proceedings of the 7th ACM

international conference on Web search and data mining, pp. 233–242, ACM,

2014.

[31] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative

matrix factorization for data representation,” IEEE transactions on pattern

analysis and machine intelligence, vol. 33, no. 8, pp. 1548–1560, 2010.

BIBLIOGRAPHY 325

[32] S. B. Cantor and M. W. Kattan, “Determining the area under the roc

curve for a binary diagnostic test,” Medical Decision Making, vol. 20, no. 4,

pp. 468–470, 2000.

[33] F. Cao, M. Estert, W. Qian, and A. Zhou, “Density-based clustering over

an evolving data stream with noise,” in Proceedings of the 2006 SIAM in-

ternational conference on data mining, pp. 328–339, SIAM, 2006.

[34] H. A. Carneiro and E. Mylonakis, “Google trends: a web-based tool for real-

time surveillance of disease outbreaks,” Clinical infectious diseases, vol. 49,

no. 10, pp. 1557–1564, 2009.

[35] M. Cataldi, L. Di Caro, and C. Schifanella, “Emerging topic detection on

twitter based on temporal and social terms evaluation,” in Proceedings of the

tenth international workshop on multimedia data mining, p. 4, ACM, 2010.

[36] M. E. Celebi, Partitional clustering algorithms. Springer, 2014.

[37] N. Cercone, F. Yasmeen, and Y. Gonzalez-Fernandez, “Information retrieval

and the vector space model.” University Lecture, 2014.

[38] D. Chakraborty, V. Narayanan, and A. Ghosh, “Integration of deep feature

extraction and ensemble learning for outlier detection,” Pattern Recognition,

vol. 89, pp. 161–171, 2019.

[39] Y. Chen, H. Zhang, R. Liu, Z. Ye, and J. Lin, “Experimental explorations

on short text topic mining between lda and nmf based schemes,” Knowledge-

Based Systems, vol. 163, pp. 1–13, 2019.

[40] Y. Chi, X. Song, D. Zhou, K. Hino, and B. L. Tseng, “Evolutionary spectral

clustering by incorporating temporal smoothness,” in Proceedings of the 13th

ACM SIGKDD international conference on Knowledge discovery and data

mining, pp. 153–162, ACM, 2007.

326 BIBLIOGRAPHY

[41] R. Churchill, L. Singh, and C. Kirov, “A temporal topic model for noisy

mediums,” in Pacific-Asia Conference on Knowledge Discovery and Data

Mining, pp. 42–53, Springer, 2018.

[42] C. De Boom, S. Van Canneyt, T. Demeester, and B. Dhoedt, “Representa-

tion learning for very short texts using weighted word embedding aggrega-

tion,” Pattern Recognition Letters, vol. 80, pp. 150–156, 2016.

[43] S. Dehuri, C. Mohapatra, A. Ghosh, and R. Mall, “Comparative study of

clustering algorithms,” Information Technology Journal, 2006.

[44] I. S. Dhillon, “Co-clustering documents and words using bipartite spectral

graph partitioning,” in Proceedings of the seventh ACM SIGKDD interna-

tional conference on Knowledge discovery and data mining, pp. 269–274,

ACM, 2001.

[45] J. DiGrazia, K. McKelvey, J. Bollen, and F. Rojas, “More tweets, more

votes: Social media as a quantitative indicator of political behavior,” PloS

one, vol. 8, no. 11, p. e79449, 2013.

[46] C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegative matrix

t-factorizations for clustering,” in Proceedings of the 12th ACM SIGKDD

international conference on KDD, pp. 126–135, ACM, 2006.

[47] B. Dong, M. M. Lin, and M. T. Chu, “Nonnegative rank factorization via

rank reduction,” preprint, 2008.

[48] L. Du, W. Buntine, H. Jin, and C. Chen, “Sequential latent dirichlet allo-

cation,” Knowledge and information systems, vol. 31, no. 3, pp. 475–503,

2012.

[49] N. Du, M. Farajtabar, A. Ahmed, A. J. Smola, and L. Song, “Dirichlet-

hawkes processes with applications to clustering continuous-time document

BIBLIOGRAPHY 327

streams,” in Proceedings of the 21th ACM SIGKDD International Confer-

ence on Knowledge Discovery and Data Mining, pp. 219–228, ACM, 2015.

[50] R. Du, D. Kuang, B. Drake, and H. Park, “Dc-nmf: nonnegative matrix

factorization based on divide-and-conquer for fast clustering and topic mod-

eling,” Journal of Global Optimization, vol. 68, no. 4, pp. 777–798, 2017.

[51] L. Duan, L. Xu, Y. Liu, and J. Lee, “Cluster-based outlier detection,” Annals

of Operations Research, vol. 168, no. 1, pp. 151–168, 2009.

[52] A. Egg, “Locality-sensitive hashing (lsh),” 2017.

[53] I. A. El-Khair, “Term weighting,” Encyclopedia of Database Systems,

pp. 3037–3040, 2009.

[54] Elasticsearch, “Similarity module,” 2019.

[55] L. Ertoz, M. Steinbach, and V. Kumar, “Finding clusters of different sizes,

shapes, and densities in noisy, high dimensional data,” in Proceedings of the

2003 SIAM international conference on data mining, pp. 47–58, SIAM, 2003.

[56] L. Ertoz, M. Steinbach, and V. Kumar, “Finding topics in collections of

documents: A shared nearest neighbor approach,” in Clustering and Infor-

mation Retrieval, pp. 83–103, Springer, 2004.

[57] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., “A density-based algorithm

for discovering clusters in large spatial databases with noise.,” inKdd, vol. 96,

pp. 226–231, 1996.

[58] A. Flexer, “Hubness-aware outlier detection for music genre recognition,”

in Proceedings of the 19th international conference on digital audio effects,

2016.

328 BIBLIOGRAPHY

[59] N. Fuhr, M. Lechtenfeld, B. Stein, and T. Gollub, “The optimum clustering

framework: implementing the cluster hypothesis,” Information Retrieval,

vol. 15, no. 2, pp. 93–115, 2012.

[60] A. Gandomi and M. Haider, “Beyond the hype: Big data concepts, methods,

and analytics,” International journal of information management, vol. 35,

no. 2, pp. 137–144, 2015.

[61] J. Ghosh and A. Acharya, “Cluster ensembles,” Wiley Interdisciplinary Re-

views: Data Mining and Knowledge Discovery, vol. 1, no. 4, pp. 305–315,

2011.

[62] E. Giuliani and C. Pietrobelli, “Social network analysis methodologies for

the evaluation of cluster development programs,” tech. rep., Inter-American

Development Bank, 2011.

[63] D. Greene, D. Archambault, V. Belak, and P. Cunningham, “Textluas:

tracking and visualizing document and term clusters in dynamic text data,”

arXiv preprint arXiv:1502.04609, 2014.

[64] D. Greene and J. P. Cross, “Exploring the political agenda of the european

parliament using a dynamic topic modeling approach,” Political Analysis,

vol. 25, no. 1, pp. 77–94, 2017.

[65] X. Gu and H. Wang, “Online anomaly prediction for robust cluster systems,”

in 2009 IEEE 25th International Conference on Data Engineering, pp. 1000–

1011, IEEE, 2009.

[66] Q. Gu and J. Zhou, “Co-clustering on manifolds,” in Proceedings of the 15th

ACM SIGKDD international conference on Knowledge discovery and data

mining, pp. 359–368, ACM, 2009.

[67] B. Hajek, “Adaptive transmission strategies and routing in mobile radio

networks,” in Proceedings of the Conference on Information Sciences and

BIBLIOGRAPHY 329

Systems, vol. 17, p. 373, Department of Electrical Engineering, Johns Hop-

kins University., 1983.

[68] V. Hautamaki, I. Karkkainen, and P. Franti, “Outlier detection using k-

nearest neighbour graph,” in Proceedings of the 17th International Confer-

ence on Pattern Recognition, 2004. ICPR 2004., vol. 3, pp. 430–433, IEEE,

2004.

[69] D. M. Hawkins, Identification of outliers, vol. 11. Springer, 1980.

[70] Z. He, “Hub selection for hub based clustering algorithms,” in 2014 11th In-

ternational Conference on Fuzzy Systems and Knowledge Discovery (FSKD),

pp. 479–484, IEEE, 2014.

[71] Z. He, X. Xu, and S. Deng, “Discovering cluster-based local outliers,” Pat-

tern Recognition Letters, vol. 24, no. 9-10, pp. 1641–1650, 2003.

[72] F. Heimerl, S. Lohmann, S. Lange, and T. Ertl, “Word cloud explorer: Text

analytics based on word clouds,” in 2014 47th Hawaii International Confer-

ence on System Sciences, pp. 1833–1842, IEEE, 2014.

[73] J.-L. Hervas-Oliver, G. Gonzalez, P. Caja, and F. Sempere-Ripoll, “Clusters

and industrial districts: Where is the literature going? identifying emerging

sub-fields of research,” European Planning Studies, vol. 23, no. 9, pp. 1827–

1872, 2015.

[74] M. Hoffman, F. R. Bach, and D. M. Blei, “Online learning for latent dirichlet

allocation,” in advances in neural information processing systems, pp. 856–

864, 2010.

[75] L. Hong and B. D. Davison, “Empirical study of topic modeling in twitter,”

in Proceedings of the first workshop on social media analytics, pp. 80–88,

ACM, 2010.

330 BIBLIOGRAPHY

[76] T. Hong, T. Lee, and J. Li, “Development of sentiment analysis model for

the hot topic detection of online stock forums,” Journal of Intelligence and

Information Systems, vol. 22, no. 1, pp. 187–204, 2016.

[77] A. Hotho, A. Nurnberger, and G. Paaß, “A brief survey of text mining.,” in

Ldv Forum, vol. 20, pp. 19–62, Citeseer, 2005.

[78] J. Hou and R. Nayak, “The heterogeneous cluster ensemble method using

hubness for clustering text documents,” in International Conference on Web

Information Systems Engineering, pp. 102–110, Springer, 2013.

[79] X. Hu and H. Liu, “Text analytics in social media,” in Mining text data,

pp. 385–414, Springer, 2012.

[80] X. Hu, N. Sun, C. Zhang, and T.-S. Chua, “Exploiting internal and ex-

ternal semantics for the clustering of short texts using world knowledge,”

in Proceedings of the 18th ACM conference on Information and knowledge

management, pp. 919–928, ACM, 2009.

[81] A. Huang, “Similarity measures for text document clustering,” in Proceed-

ings of the sixth new zealand computer science research student conference

(NZCSRSC2008), Christchurch, New Zealand, vol. 4, pp. 9–56, 2008.

[82] G. Huang, J. He, Y. Zhang, W. Zhou, H. Liu, P. Zhang, Z. Ding, Y. You,

and J. Cao, “Mining streams of short text for analysis of world-wide event

evolutions,” World Wide Web, vol. 18, no. 5, pp. 1201–1217, 2015.

[83] K. Huang, N. D. Sidiropoulos, and A. Swami, “Non-negative matrix factor-

ization revisited: Uniqueness and algorithm for symmetric decomposition,”

IEEE Transactions on Signal Processing, vol. 62, no. 1, pp. 211–224, 2014.

[84] J. Huang, Q. Zhu, L. Yang, and J. Feng, “A non-parameter outlier detection

algorithm based on natural neighbor,” Knowledge-Based Systems, vol. 92,

pp. 71–77, 2016.

BIBLIOGRAPHY 331

[85] X. Huosong, F. Zhaoyan, and P. Liuyan, “Chinese web text outlier min-

ing based on domain knowledge,” in 2010 Second WRI Global Congress on

Intelligent Systems, vol. 2, pp. 73–77, IEEE, 2010.

[86] IBM, “Big data and analytics hub,” 2017.

[87] K. Ismo et al., “Outlier detection using k-nearest neighbour graph,” in null,

pp. 430–433, IEEE, 2004.

[88] R. Iyer, J. Wong, W. Tavanapong, and D. A. Peterson, “Identifying policy

agenda sub-topics in political tweets based on community detection,” in

Proceedings of the 2017 IEEE/ACM International Conference on Advances

in Social Networks Analysis and Mining 2017, pp. 698–705, ACM, 2017.

[89] D. A. Jackson and Y. Chen, “Robust principal component analysis and out-

lier detection with ecological data,” Environmetrics: The official journal of

the International Environmetrics Society, vol. 15, no. 2, pp. 129–139, 2004.

[90] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern recognition

letters, vol. 31, no. 8, pp. 651–666, 2010.

[91] N. Jardine and C. J. van Rijsbergen, “The use of hierarchic clustering in in-

formation retrieval,” Information storage and retrieval, vol. 7, no. 5, pp. 217–

240, 1971.

[92] R. A. Jarvis and E. A. Patrick, “Clustering using a similarity measure

based on shared near neighbors,” IEEE Transactions on computers, vol. 100,

no. 11, pp. 1025–1034, 1973.

[93] C. Jia, M. B. Carson, X. Wang, and J. Yu, “Concept decompositions for

short text clustering by identifying word communities,” Pattern Recognition,

vol. 76, pp. 691–703, 2018.

332 BIBLIOGRAPHY

[94] M. Jiang, P. Cui, and C. Faloutsos, “Suspicious behavior detection: Cur-

rent trends and future directions,” IEEE Intelligent Systems, vol. 31, no. 1,

pp. 31–39, 2016.

[95] O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang, “Transferring topical

knowledge from auxiliary long texts for short text clustering,” in Proceedings

of the 20th ACM international conference on Information and knowledge

management, pp. 775–784, ACM, 2011.

[96] R. Kannan, H. Woo, C. C. Aggarwal, and H. Park, “Outlier detection for

text data: An extended version,” arXiv preprint arXiv:1701.01325, 2017.

[97] A. Kappas, “Social regulation of emotion: messy layers,” Frontiers in psy-

chology, vol. 4, p. 51, 2013.

[98] S. P. Kasiviswanathan, P. Melville, A. Banerjee, and V. Sindhwani, “Emerg-

ing topic detection using dictionary learning,” in Proceedings of the 20th

ACM international conference on Information and knowledge management,

pp. 745–754, ACM, 2011.

[99] S. P. Kasiviswanathan, H. Wang, A. Banerjee, and P. Melville, “Online

l1-dictionary learning with application to novel document detection,” in Ad-

vances in Neural Information Processing Systems, pp. 2258–2266, 2012.

[100] P. Ke, F. Huang, M. Huang, and X. Zhu, “Araml: A stable adversarial

training framework for text generation,” arXiv preprint arXiv:1908.07195,

2019.

[101] I. Khalil, Z. Dou, and A. Khreishah, “Your credentials are compromised, do

not panic: You can be well protected,” in Proceedings of the 11th ACM on

Asia Conference on Computer and Communications Security, pp. 925–930,

ACM, 2016.

BIBLIOGRAPHY 333

[102] M.-S. Kim and J. Han, “A particle-and-density based evolutionary cluster-

ing method for dynamic networks,” Proceedings of the VLDB Endowment,

vol. 2, no. 1, pp. 622–633, 2009.

[103] J. Kim, Y. He, and H. Park, “Algorithms for nonnegative matrix and tensor

factorizations: a unified view based on block coordinate descent framework,”

Journal of Global Optimization, vol. 58, no. 2, pp. 285–319, 2014.

[104] E. M. Knox and R. T. Ng, “Algorithms for mining distancebased outliers in

large datasets,” in Proceedings of the international conference on very large

data bases, pp. 392–403, Citeseer, 1998.

[105] S. Kokkula and N. M. Musti, “Classification and outlier detection based

on topic based pattern synthesis,” in International Workshop on Machine

Learning and Data Mining in Pattern Recognition, pp. 99–114, Springer,

2013.

[106] R. Kosala and H. Blockeel, “Web mining research: A survey,” ACM Sigkdd

Explorations Newsletter, vol. 2, no. 1, pp. 1–15, 2000.

[107] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and

D. Brown, “Text classification algorithms: A survey,” Information, vol. 10,

no. 4, p. 150, 2019.

[108] H.-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek, “Outlier detection in

axis-parallel subspaces of high dimensional data,” in Pacific-Asia Conference

on Knowledge Discovery and Data Mining, pp. 831–838, Springer, 2009.

[109] H.-P. Kriegel, M. Schubert, and A. Zimek, “Angle-based outlier detection

in high-dimensional data,” in Proceedings of the 14th ACM SIGKDD inter-

national conference on Knowledge discovery and data mining, pp. 444–452,

ACM, 2008.

334 BIBLIOGRAPHY

[110] D. Kuang, J. Choo, and H. Park, “Nonnegative matrix factorization for in-

teractive topic modeling and document clustering,” in Partitional Clustering

Algorithms, pp. 215–243, Springer, 2015.

[111] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings

to document distances,” in International conference on machine learning,

pp. 957–966, 2015.

[112] S. Kutty, R. Nayak, P. Turnbull, R. Chernich, G. Kennedy, and K. Ray-

mond, “Paperminer—a real-time spatiotemporal visualization for newspaper

articles,” Digital Scholarship in the Humanities, 2019.

[113] Q. Le and T. Mikolov, “Distributed representations of sentences and doc-

uments,” in International conference on machine learning, pp. 1188–1196,

2014.

[114] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradient-based learning

applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,

pp. 2278–2324, 1998.

[115] P. Lee, L. V. Lakshmanan, and E. E. Milios, “Incremental cluster evolution

tracking from highly dynamic network data,” in 2014 IEEE 30th Interna-

tional Conference on Data Engineering, pp. 3–14, IEEE, 2014.

[116] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factoriza-

tion,” in Advances in neural information processing systems, pp. 556–562,

2001.

[117] O. Levy and Y. Goldberg, “Neural word embedding as implicit matrix fac-

torization,” in Advances in neural information processing systems, pp. 2177–

2185, 2014.

[118] Y. Li, J. Nie, Y. Zhang, B. Wang, B. Yan, and F. Weng, “Contextual recom-

mendation based on text mining,” in Proceedings of the 23rd International

BIBLIOGRAPHY 335

Conference on Computational Linguistics: Posters, pp. 692–700, Association

for Computational Linguistics, 2010.

[119] Q. Li, A. Nourbakhsh, S. Shah, and X. Liu, “Real-time novel event detection

from social media,” in 2017 IEEE 33rd International Conference on Data

Engineering (ICDE), pp. 1129–1139, IEEE, 2017.

[120] N. Li and D. D. Wu, “Using text mining and sentiment analysis for online

forums hotspot detection and forecast,” Decision support systems, vol. 48,

no. 2, pp. 354–368, 2010.

[121] S. Liang, “Unsupervised semantic generative adversarial networks for ex-

pert retrieval,” in The World Wide Web Conference, pp. 1039–1050, ACM,

2019.

[122] C.-J. Lin, “Projected gradient methods for nonnegative matrix factoriza-

tion,” Neural computation, vol. 19, no. 10, pp. 2756–2779, 2007.

[123] Y.-R. Lin, Y. Chi, S. Zhu, H. Sundaram, and B. L. Tseng, “Facetnet: a

framework for analyzing communities and their evolutions in dynamic net-

works,” in Proceedings of the 17th international conference on World Wide

Web, pp. 685–694, ACM, 2008.

[124] F.-R. Lin, L.-S. Hsieh, and F.-T. Chuang, “Discovering genres of online

discussion threads via text mining,” Computers & Education, vol. 52, no. 2,

pp. 481–495, 2009.

[125] Y. Liu, C. Jiang, and H. Zhao, “Using contextual features and multi-view

ensemble learning in product defect identification from online discussion fo-

rums,” Decision Support Systems, vol. 105, pp. 1–12, 2018.

[126] H. Liu, X. Li, J. Li, and S. Zhang, “Efficient outlier detection for high-

dimensional data,” IEEE Transactions on Systems, Man, and Cybernetics:

Systems, vol. 48, no. 12, pp. 2451–2461, 2017.

336 BIBLIOGRAPHY

[127] Y. Liu, Z. Li, C. Zhou, Y. Jiang, J. Sun, M. Wang, and X. He, “Gener-

ative adversarial active learning for unsupervised outlier detection,” IEEE

Transactions on Knowledge and Data Engineering, 2019.

[128] L. Liu, Y. Lu, M. Yang, Q. Qu, J. Zhu, and H. Li, “Generative adver-

sarial network for abstractive text summarization,” in Thirty-second AAAI

conference on artificial intelligence, 2018.

[129] N. Ljubesic, D. Boras, N. Bakaric, and J. Njavro, “Comparing measures

of semantic similarity,” in ITI 2008-30th International Conference on Infor-

mation Technology Interfaces, pp. 675–682, IEEE, 2008.

[130] K. Luong, T. Balasubramaniam, and R. Nayak, “A novel technique of using

coupled matrix and greedy coordinate descent for multi-view data represen-

tation,” in International Conference on Web Information Systems Engineer-

ing, pp. 285–300, Springer, 2018.

[131] K. Luong and R. Nayak, “Clustering multi-view data using non-negative

matrix factorization and manifold learning for effective understanding: A

survey paper,” in Linking and Mining Heterogeneous and Multi-view Data,

pp. 201–227, Springer, 2019.

[132] L. P. Macfadyen and S. Dawson, “Mining lms data to develop an “early

warning system” for educators: A proof of concept,” Computers & education,

vol. 54, no. 2, pp. 588–599, 2010.

[133] C. Manning, P. Raghavan, and H. Schutze, “Introduction to information

retrieval,” Natural Language Engineering, vol. 16, no. 1, pp. 100–103, 2010.

[134] V. Mehta, R. S. Caceres, and K. M. Carter, “Evaluating topic quality using

model clustering,” in 2014 IEEE Symposium on Computational Intelligence

and Data Mining (CIDM), pp. 178–185, IEEE, 2014.

BIBLIOGRAPHY 337

[135] Y. Meng, J. Shen, C. Zhang, and J. Han, “Weakly-supervised hierarchi-

cal text classification,” in Proceedings of the AAAI Conference on Artificial

Intelligence, vol. 33, pp. 6826–6833, 2019.

[136] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of

word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

[137] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Dis-

tributed representations of words and phrases and their compositionality,”

in Advances in neural information processing systems, pp. 3111–3119, 2013.

[138] M. Mohler and R. Mihalcea, “Text-to-text semantic similarity for automatic

short answer grading,” in Proceedings of the 12th Conference of the Euro-

pean Chapter of the Association for Computational Linguistics, pp. 567–575,

Association for Computational Linguistics, 2009.

[139] W. A. Mohotti and R. Nayak, “Corpus-based augmented media posts with

density-based clustering for community detection,” in 2018 IEEE 30th Inter-

national Conference on Tools with Artificial Intelligence (ICTAI), pp. 379–

386, IEEE, 2018.

[140] W. A. Mohotti and R. Nayak, “An efficient ranking-centered density-based

document clustering method,” in Pacific-Asia Conference on Knowledge

Discovery and Data Mining, pp. 439–451, Springer, 2018.

[141] B. Nadler and M. Galun, “Fundamental limitations of spectral clustering,”

in Advances in neural information processing systems, pp. 1017–1024, 2007.

[142] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi, “Bad news travel

fast: A content-based analysis of interestingness on twitter,” in Proceedings

of the 3rd international web science conference, p. 8, ACM, 2011.

338 BIBLIOGRAPHY

[143] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis

and an algorithm,” in Advances in neural information processing systems,

pp. 849–856, 2002.

[144] D. Nolleke, C. G. Grimmer, and T. Horky, “News sources and follow-up

communication: Facets of complementarity between sports journalism and

social media,” Journalism Practice, vol. 11, no. 4, pp. 509–526, 2017.

[145] N. Oikonomakou and M. Vazirgiannis, “A review of web document cluster-

ing approaches,” in Data mining and knowledge discovery handbook, pp. 921–

943, Springer, 2005.

[146] T. Pang, F. Nie, and J. Han, “Flexible orthogonal neighborhood preserving

embedding.,” in IJCAI, pp. 2592–2598, 2017.

[147] A. Park, M. Conway, and A. T. Chen, “Examining thematic similarity,

difference, and membership in three online mental health communities from

reddit: a text mining and visualization approach,” Computers in human

behavior, vol. 78, pp. 98–112, 2018.

[148] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word

representation,” in Proceedings of the 2014 conference on empirical methods

in natural language processing (EMNLP), pp. 1532–1543, 2014.

[149] R. Peter, G. Shivapratap, G. Divya, and K. Soman, “Evaluation of svd and

nmf methods for latent semantic analysis,” International Journal of Recent

Trends in Engineering, vol. 1, no. 3, p. 308, 2009.

[150] W. M. Pottenger and T.-h. Yang, “Detecting emerging concepts in textual

data mining,” Computational information retrieval, vol. 100, no. 1, pp. 89–

105, 2001.

[151] T. Puranik and L. Narayanan, “Community detection in evolving net-

works,” in Proceedings of the 2017 IEEE/ACM International Conference

BIBLIOGRAPHY 339

on Advances in Social Networks Analysis and Mining 2017, pp. 385–390,

ACM, 2017.

[152] J. Qiang, P. Chen, T. Wang, and X. Wu, “Topic modeling over short texts

by incorporating word embeddings,” in Pacific-Asia Conference on Knowl-

edge Discovery and Data Mining, pp. 363–374, Springer, 2017.

[153] M. Qin, D. Jin, K. Lei, B. Gabrys, and K. Musial-Gabrys, “Adaptive com-

munity detection incorporating topology and content in social networks,”

Knowledge-Based Systems, vol. 161, pp. 342–356, 2018.

[154] M. Radovanovic, A. Nanopoulos, and M. Ivanovic, “Hubs in space: Popular

nearest neighbors in high-dimensional data,” Journal of Machine Learning

Research, vol. 11, no. Sep, pp. 2487–2531, 2010.

[155] M. Radovanovic, A. Nanopoulos, and M. Ivanovic, “Reverse nearest neigh-

bors in unsupervised distance-based outlier detection,” IEEE transactions

on knowledge and data engineering, vol. 27, no. 5, pp. 1369–1382, 2014.

[156] F. Raiber and O. Kurland, “Exploring the cluster hypothesis, and cluster-

based retrieval, over the web,” in Proceedings of the 21st ACM interna-

tional conference on Information and knowledge management, pp. 2507–

2510, ACM, 2012.

[157] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for mining

outliers from large data sets,” in ACM Sigmod Record, vol. 29, pp. 427–438,

ACM, 2000.

[158] M. Ramezani, A. Khodadadi, and H. R. Rabiee, “Community detection

using diffusion information,” ACM Transactions on Knowledge Discovery

from Data (TKDD), vol. 12, no. 2, p. 20, 2018.

[159] A. Rangrej, S. Kulkarni, and A. V. Tendulkar, “Comparative study of clus-

tering techniques for short text documents,” in Proceedings of the 20th in-

340 BIBLIOGRAPHY

ternational conference companion on World wide web, pp. 111–112, ACM,

2011.

[160] T. Roelleke and J. Wang, “Tf-idf uncovered: a study of theories and prob-

abilities,” in Proceedings of the 31st annual international ACM SIGIR con-

ference on Research and development in information retrieval, pp. 435–442,

ACM, 2008.

[161] M. Rosvall and C. T. Bergstrom, “Maps of random walks on complex net-

works reveal community structure,” Proceedings of the National Academy of

Sciences, vol. 105, no. 4, pp. 1118–1123, 2008.

[162] M. Sahami and T. D. Heilman, “A web-based kernel function for measuring

the similarity of short text snippets,” in Proceedings of the 15th international

conference on World Wide Web, pp. 377–386, AcM, 2006.

[163] G. Salton and C. Buckley, “Term-weighting approaches in automatic text

retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–

523, 1988.

[164] E. Schubert, A. Zimek, and H.-P. Kriegel, “Fast and scalable outlier detec-

tion with approximate nearest neighbor ensembles,” in International Con-

ference on Database Systems for Advanced Applications, pp. 19–36, Springer,

2015.

[165] H. Schutze, C. D. Manning, and P. Raghavan, Introduction to information

retrieval, vol. 39. Cambridge University Press, 2008.

[166] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons, “Document

clustering using nonnegative matrix factorization,” Information Processing

& Management, vol. 42, no. 2, pp. 373–386, 2006.

BIBLIOGRAPHY 341

[167] F. Shang, L. Jiao, and F. Wang, “Graph dual regularization non-negative

matrix factorization for co-clustering,” Pattern Recognition, vol. 45, no. 6,

pp. 2237–2250, 2012.

[168] T. Shi, K. Kang, J. Choo, and C. K. Reddy, “Short-text topic modeling via

non-negative matrix factorization enriched with local word-context correla-

tions,” in Proceedings of the 2018 World Wide Web Conference, pp. 1105–

1114, International World Wide Web Conferences Steering Committee, 2018.

[169] W. Silva, A. Santana, F. Lobato, and M. Pinheiro, “A methodology for

community detection in twitter,” in Proceedings of the International Con-

ference on Web Intelligence, pp. 1006–1009, ACM, 2017.

[170] S. Sinclair and G. Rockwell, “the voyant tools team,” 2012.

[171] M. D. Smucker and J. Allan, “A new measure of the cluster hypothesis,” in

Conference on the Theory of Information Retrieval, pp. 281–288, Springer,

2009.

[172] T. Sutanto and R. Nayak, “The ranking based constrained document clus-

tering method and its application to social event detection,” in Interna-

tional Conference on Database Systems for Advanced Applications, pp. 47–

60, Springer, 2014.

[173] T. Sutanto and R. Nayak, “Semi-supervised document clustering via

loci,” in International Conference on Web Information Systems Engineering,

pp. 208–215, Springer, 2015.

[174] T. Sutanto and R. Nayak, “Fine-grained document clustering via ranking

and its application to social media analytics,” Social Network Analysis and

Mining, vol. 8, no. 1, p. 29, 2018.

[175] N. Tomasev and D. Mladenic, “Hub co-occurrence modeling for robust high-

dimensional knn classification,” in Joint European Conference on Machine

342 BIBLIOGRAPHY

Learning and Knowledge Discovery in Databases, pp. 643–659, Springer,

2013.

[176] N. Tomasev, M. Radovanovic, D. Mladenic, and M. Ivanovic, “The role of

hubness in clustering high-dimensional data,” IEEE Transactions on Knowl-

edge and Data Engineering, vol. 26, no. 3, pp. 739–751, 2013.

[177] N. Tomasev, M. Radovanovic, D. Mladenic, and M. Ivanovic, “Hubness-

based clustering of high-dimensional data,” in Partitional clustering algo-

rithms, pp. 353–386, Springer, 2015.

[178] P. University, “Predictive modeling & machine learning laboratory,” 2016.

[179] T. Wagner, R. Feger, and A. Stelzer, “Modifications of the optics clus-

tering algorithm for short-range radar tracking applications,” in 2018 15th

European Radar Conference (EuRAD), pp. 91–94, IEEE, 2018.

[180] X. Wang and A. McCallum, “Topics over time: a non-markov continuous-

time model of topical trends,” in Proceedings of the 12th ACM SIGKDD

international conference on Knowledge discovery and data mining, pp. 424–

433, ACM, 2006.

[181] H. Wang, F. Nie, H. Huang, and F. Makedon, “Fast nonnegative matrix

tri-factorization for large-scale data co-clustering,” in Twenty-Second Inter-

national Joint Conference on Artificial Intelligence, 2011.

[182] H. Wang, Z. Qin, and T. Wan, “Text generation based on generative adver-

sarial nets with latent variables,” in Pacific-Asia Conference on Knowledge

Discovery and Data Mining, pp. 92–103, Springer, 2018.

[183] N. Wang and D.-Y. Yeung, “Learning a deep compact image representation

for visual tracking,” in Advances in neural information processing systems,

pp. 809–817, 2013.

BIBLIOGRAPHY 343

[184] R. Wang, D. Zhou, and Y. He, “Open event extraction from online text

using a generative adversarial network,” arXiv preprint arXiv:1908.09246,

2019.

[185] L. Wensen, C. Zewen, W. Jun, and W. Xiaoyi, “Short text classification

based on wikipedia and word2vec,” in 2016 2nd IEEE International Con-

ference on Computer and Communications (ICCC), pp. 1195–1200, IEEE,

2016.

[186] S. M. Wong, W. Ziarko, and P. C. Wong, “Generalized vector spaces model

in information retrieval,” in Proceedings of the 8th annual international ACM

SIGIR conference on Research and development in information retrieval,

pp. 18–25, ACM, 1985.

[187] M. Wozniak, M. Grana, and E. Corchado, “A survey of multiple classifier

systems as hybrid systems,” Information Fusion, vol. 16, pp. 3–17, 2014.

[188] L. Xu, C. Jiang, Y. Ren, and H.-H. Chen, “Microblog dimensionality re-

duction—a deep learning approach,” IEEE Transactions on Knowledge and

Data Engineering, vol. 28, no. 7, pp. 1779–1789, 2016.

[189] J. Xu, W. Peng, T. Guanhua, X. Bo, Z. Jun, W. Fangyuan, H. Hongwei,

et al., “Short text clustering via convolutional neural networks,” in Pro-

ceedings of the Annual Conference of the North American Chapter of the

Association for Computational Linguistics, pp. 62–69, Association for Com-

putational Linguistics, 2015.

[190] J. Xu, B. Xu, P. Wang, S. Zheng, G. Tian, and J. Zhao, “Self-taught

convolutional neural networks for short text clustering,” Neural Networks,

vol. 88, pp. 22–31, 2017.

[191] Y. Yan, R. Huang, C. Ma, L. Xu, Z. Ding, R. Wang, T. Huang, and B. Liu,

“Improving document clustering for short texts by long documents via a

344 BIBLIOGRAPHY

dirichlet multinomial allocation model,” in Asia-Pacific Web (APWeb) and

Web-Age Information Management (WAIM) Joint Conference on Web and

Big Data, pp. 626–641, Springer, 2017.

[192] T. Yang, Y. Chi, S. Zhu, Y. Gong, and R. Jin, “Detecting communities and

their evolutions in dynamic social networks—a bayesian approach,” Machine

learning, vol. 82, no. 2, pp. 157–189, 2011.

[193] P. Yang and B. Huang, “Knn based outlier detection algorithm in large

dataset,” in 2008 International Workshop on Education Technology and

Training & 2008 International Workshop on Geoscience and Remote Sens-

ing, vol. 1, pp. 611–613, IEEE, 2008.

[194] J. Yi, Y. Zhang, X. Zhao, and J. Wan, “A novel text clustering approach

using deep-learning vocabulary network,” Mathematical Problems in Engi-

neering, vol. 2017, 2017.

[195] Y. You, G. Huang, J. Cao, E. Chen, J. He, Y. Zhang, and L. Hu, “Geam: A

general and event-related aspects model for twitter event detection,” in Inter-

national Conference on Web Information Systems Engineering, pp. 319–332,

Springer, 2013.

[196] B. Yu, “Research on information retrieval model based on ontology,”

EURASIP Journal on Wireless Communications and Networking, vol. 2019,

no. 1, p. 30, 2019.

[197] Z. Yuan, X. Zhang, and S. Feng, “Hybrid data-driven outlier detection

based on neighborhood information entropy and its developmental mea-

sures,” Expert Systems with Applications, vol. 112, pp. 243–257, 2018.

[198] X. Zhang, H. Gao, G. Li, J. Zhao, J. Huo, J. Yin, Y. Liu, and L. Zheng,

“Multi-view clustering based on graph-regularized nonnegative matrix fac-

BIBLIOGRAPHY 345

torization for object recognition,” Information Sciences, vol. 432, pp. 463–

478, 2018.

[199] B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W.-Y. Ma, “Im-

proving web search results using affinity graph,” in Proceedings of the 28th

annual international ACM SIGIR conference on Research and development

in information retrieval, pp. 504–511, ACM, 2005.

[200] J. Zhang, X. Long, and T. Suel, “Performance of compressed inverted list

caching in search engines,” in Proceedings of the 17th international confer-

ence on World Wide Web, pp. 387–396, ACM, 2008.

[201] W. Zhao, Q. He, H. Ma, and Z. Shi, “Effective semi-supervised document

clustering via active learning with instance-level constraints,” Knowledge

and information systems, vol. 30, no. 3, pp. 569–587, 2012.

[202] C. T. Zheng, C. Liu, and H. San Wong, “Corpus-based topic diffusion for

short text clustering,” Neurocomputing, vol. 275, pp. 2444–2458, 2018.

[203] N. Zheng and J. Xue, “Manifold learning,” in Statistical Learning and Pat-

tern Analysis for Image and Video Processing, pp. 87–119, Springer, 2009.

[204] P. Zhu, X. Zhan, and W. Qiu, “Efficient k-nearest neighbors search in high

dimensions using mapreduce,” in 2015 IEEE Fifth International Conference

on Big Data and Cloud Computing, pp. 23–30, IEEE, 2015.

[205] A. Zimek, “Clustering high-dimensional data,” in Data Clustering, pp. 201–

230, Chapman and Hall/CRC, 2018.