unsupervisedtext mining: effectivesimilarity calculation ......wathsala anupama mohotti and richi...
TRANSCRIPT
Unsupervised Text Mining:Effective Similarity Calculation
with Ranking and MatrixFactorization
Wathsala Anupama Mohotti
B.Sc.(Hons) & M.Sc. in Information Technology
Submitted in Fulfilment
of the Requirements
for the Degree of
Doctor of Philosophy
Queensland University of Technology
School of Computer Science
Science and Engineering Faculty
2020
This thesis is dedicated to my loving parents
N. Mohotti and G. Nagahawaththa
Statement of Original Authorship
The work contained in this thesis has not been previously submitted for a degree
or diploma at any other higher educational institution. To the best of my knowl-
edge and belief, the thesis contains no material previously published or written
by another person except where due reference is made.
Name:
Signature:
Date:27/03/2020
Wathsala Anupama Mohotti
QUT Verified Signature
Acknowledgements
It is my pleasure to express my appreciation and gratitude to everyone who has
been a part of my PhD journey. First, I would like to express my heartfelt
gratitude to my principal supervisor, Associate Professor Richi Nayak, for her
continuous support and constant guidance over the past few years. Her advice,
encouragement, feedback, motivation, direction, and patience made my journey
possible. Also, I would like to thank my associate supervisor, Associate Profes-
sor Shlomo Geva, for his support throughout my PhD journey. I would like to
acknowledge the financial support provided by QUT throughout my PhD by the
QUT Postgraduate Research Award (QUTPRA) and the QUT HDR Tuition Fee
Sponsorship.
I would like to acknowledge the QUT high-performance computing (HPC) team,
Big Data Laboratory and QUT digital observatory for providing the necessary
infrastructure during the course of my PhD. Also, I would like to thank the
staff of EECS School for their administrative support during my candidature. I
acknowledge the services of professional editor, Diane Kolomeitz, who provided
copyediting and proofreading services, according to the guidelines laid out in the
university-endorsed national “Guidelines for editing research theses”. I pay my
sincere gratitude to past and present lab members of in Applied Data Mining Re-
search Group (ADMRG) and all my friends in QUT as well as my housemates for
their valuable support throughout my journey. I would particularly like to thank
iv ACKNOWLEDGEMENTS
Dr Sarasi Munasinghe, Gayani Tennakoon, Dr Noor Ifada, Dr Taufik Sutanto,
Dinusha Wijedasa for providing much-needed support during my studies.
Finally, I would like to express my heartfelt gratitude to my loving parents for
everything they have done for me. Their unconditional support, sacrifice and the
positive influence they have had throughout my life has taken me to the place
I am at today. Also, I would also like to thank my brothers for their support,
encouragement, and love. Thank you all for being there for me.
Abstract
Advancements in digital processing techniques have led to exponential growth in
the size of text data collections. Text data have been used primarily in social me-
dia platforms, document repositories, news broadcasting services, websites, and
blogs as an effective communication medium. Text mining is a popular approach
to discover meaningful information such as clusters, outliers and evolution in clus-
ters from the text collections. The unavailability of ground-truths in real-world
collections creates the demand for conducting these analyses in an unsupervised
setting.
Multiple approaches have been explored to identify text similarity for finding
clusters, outliers, and evolution in text clusters. However, the high dimensional
nature of text data and the associated sparseness in document representation
present challenges for text mining methods to identify similarity within text data.
The distance calculation, density estimation, and other approximation techniques
become ineffective in identifying accurate information. This presents a need for
developing methods that can handle high dimensionality and related problems in
text data for knowledge discovery.
The thesis proposes a set of methods to identify text similarity mainly using rank-
ing and matrix factorization. It proposes methods for finding document clusters,
outliers, and changing dynamics of the clusters based on these novel similarity
vi ABSTRACT
concepts. More specifically, the proposed methods (1) use ranking concepts to
exploit nearest neighbors in determining text similarities and dissimilarities ef-
ficiently; (2) accurately learn dense patches in naturally sparse data; (3) enrich
documents to avoid extreme sparseness in short text data; and (4) represent
high dimensional text with lower rank representation using matrix factorization
minimizing the information loss.
Firstly, this thesis presents two novel text clustering methods, RDDC (Ranking-
Centered Density-Based Document Clustering Method) and CCNMF (Consensus
and Complementary Non-negative Matrix Factorization for Document Cluster-
ing), and two specific methods to identify clusters with short text for the appli-
cation areas of community detection and concept mining.
• In RDDC, a Shared Nearest Neighbor (SNN) graph is built based on
the ranked documents using an Information Retrieval system, and clus-
ters are identified with density estimation from the SNN and the frequent
neighborhood-based hubs. Empirical analysis shows RDDC to be accurate
and efficient due to the use of document neighborhoods, generated using the
relevant documents sets from an IR system, that form relatively uniform
regions in text collection to differentiate varying densities.
• In CCNMF, the vector space model is integrated with the neighborhood
information, preserving geometric structures, to compensate for the infor-
mation loss in NMF. Empirical analysis shows that CCNMF is able to accu-
rately identify clusters as it uses complementary and consensus information
from the input data, especially with local neighborhood affinity through
pairwise calculation and global neighborhood affinity through IR ranking.
• The proposed corpus-based augmented media posts with density-based clus-
tering for community detection as well as the concept mining in online
ABSTRACT vii
forums using self-corpus-based augmented text clustering, propose to use
document expansion to handle the extreme sparseness in short text posts.
The document expansion method approximates topic vectors using NMF to
obtain virtual words for post-expansion to improve the word co-occurrence
in the sparse text aligning with the semantics of the collection. These en-
riched documents are shown to be accurate in community detection with
a density-based clustering on heterogeneous social media text while con-
cept mining on homogenous forum text has shown better performance with
distance-based clustering.
Secondly, this thesis presents four novel outlier detection algorithms based on the
novel concepts of rare frequency of terms and ranking.
• OIDF (Outlier detection based on Inverse Document Frequency) proposes
the simple concept of using inverse document frequency of terms to identify
documents that are deviated from the set of inlier groups where high di-
mensionality of text vectors impairs the concepts such as distance, density
or dimensionality reduction.
• ORFS (Outlier detection based on Ranking Function Score) proposes an
outlier score for a document based on the inverse of the ranking scores
given for response documents by an IR system that are considered as nearest
neighbors.
• ORNC (Outlier detection based on Ranked Neighborhood k-occurrences
Count) proposes to calculate the reverse neighbor count in response lists for
documents in the entire collection to define an outlier score. This defines
high outlier scores for documents with less count that are anti-hubs.
• ORDG (Outliers by Ranking based Density Graphs) proposes outlier de-
tection by identifying documents that do not exist in the mutual nearest
viii ABSTRACT
neighbor graph that is meant to include inliers. Empirical analysis shows
ORDG to be accurate and efficient through the nearest neighbors identified
with the IR system to generate the mutual neighbor graph and identified
frequent nearest neighbors (hubs) attached to the graph.
These four algorithms have been shown to be efficient due to the use of IR ranking
concepts in modeling nearest neighbors compared to pairwise calculations that are
expensive for the large document collections. The outlier candidates generated
with ORFS, ORNC, and ORDG algorithms are sequentially and independently
combined with outlier candidates of OIDF to develop ensemble methods to obtain
higher accuracy.
Lastly, this thesis proposes a novel method for identifying the changing dynamics
of text clusters, named as CaCE (Cluster Association-aware matrix factorization
for discovering Cluster Evolution). CaCE tracks major lifecycle states of birth,
death, split and merge of the clusters to discover emergence, persistence, growth
and decay patterns using both intra-cluster and inter-cluster associations with
NMF. In CaCE, the use of both these relationships has shown to be accurate in
identifying cluster groups and compensating for the information loss in dimen-
sionality reduction. CaCE proposes to use density estimation with term weights
to refine the cluster assignment to groups as a further compensating mechanism
for the information loss. In CaCE, evolution is represented by drawing edges in
a k-partite graph between consecutive time intervals if the clusters possess the
same level of density and belong to the same group. This visualization technique
aids in the interpretability of lifecycle states and patterns in clusters.
In summary, the thesis makes a substantial contribution to the fundamental task
of effective text similarity identification needed for the development of text clus-
tering, text outlier detection and tracking text cluster evolution methods. This
thesis advances the fields of data mining, machine learning and document en-
ABSTRACT ix
gineering by successfully dealing with the high dimensionality of text vectors
and associated problems that have been repetitively discussed in the academic
literature and commonly faced in real-world applications.
Keywords
Unsupervised Learning, Text Mining, Text Similarity, Clustering, Outlier Detec-
tion, Cluster Evolution, Ranking, Nearest Neighbors, Density Estimation, Docu-
ment Expansion, Non-negative Matrix Factorization, Shared Nearest Neighbors,
Mutual Neighbor graph, Hubs, Anti-hubs, Skip-Gram with Negative Sampling
Contents
Abstract v
Keywords x
List of Tables xvi
List of Figures xvii
List of Publications xix
Acronyms & Abbreviations xxi
Chapter 1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement and Motivation . . . . . . . . . . . . . . . . . 5
1.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5
xii CONTENTS
1.2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Research Aim and Objectives . . . . . . . . . . . . . . . . . . . . 10
1.5 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Publications Resulting from Research . . . . . . . . . . . . . . . . 16
1.7 Research Significance . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.8 High Level Overview of the Thesis . . . . . . . . . . . . . . . . . . 20
Chapter 2 Literature Review and Background 24
2.1 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.1 Text Mining Process . . . . . . . . . . . . . . . . . . . . . 26
2.1.2 Text Feature Representation . . . . . . . . . . . . . . . . . 28
2.2 Text similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Distinct Text Characteristics . . . . . . . . . . . . . . . . . 30
2.2.2 Text Similarity Measures . . . . . . . . . . . . . . . . . . . 33
2.3 Unsupervised Text Mining Methods . . . . . . . . . . . . . . . . . 37
2.3.1 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.2 Text Outlier Detection . . . . . . . . . . . . . . . . . . . . 49
CONTENTS xiii
2.3.3 Text Cluster Evolution . . . . . . . . . . . . . . . . . . . . 57
2.4 Research Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.4.1 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . 62
2.4.2 Text Outlier Detection . . . . . . . . . . . . . . . . . . . . 63
2.4.3 Text Cluster Evolution . . . . . . . . . . . . . . . . . . . . 64
Chapter 3 Text Clustering 66
Paper 1: An Efficient Ranking-Centered Density-Based Document Clus-
tering Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Paper 2: Consensus and Complementary Non-negative Matrix Factor-
ization for Document Clustering . . . . . . . . . . . . . . . . . . . 92
Paper 3: Corpus-based Augmented Media Posts with Density-based
Clustering for Community Detection . . . . . . . . . . . . . . . . 123
Paper 4: Concept Mining in Online Forums using Self-corpus-based
Augmented Text Clustering . . . . . . . . . . . . . . . . . . . . . 148
Chapter 4 Text Outlier Detection 157
Paper 5: Efficient Outlier Detection in Text Corpus Using Rare Fre-
quency and Ranking . . . . . . . . . . . . . . . . . . . . . . . . . 163
Paper 6: Text Outlier Detection using a Ranking-based Mutual Graph 214
xiv CONTENTS
Chapter 5 Text Cluster Evolution 248
Paper 7: Discovering Cluster Evolution Patterns with the Cluster
Association-aware Matrix Factorization . . . . . . . . . . . . . . . 252
Chapter 6 Conclusion and Future Directions 295
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 296
6.2 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . 300
6.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
6.2.2 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . 304
6.2.3 Cluster Evolution . . . . . . . . . . . . . . . . . . . . . . . 306
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
6.3.1 Stream mining . . . . . . . . . . . . . . . . . . . . . . . . 307
6.3.2 Community discovery considering both structure and con-
tent information . . . . . . . . . . . . . . . . . . . . . . . . 308
6.3.3 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . 309
6.3.4 Short text clustering . . . . . . . . . . . . . . . . . . . . . 309
6.3.5 Soft clustering . . . . . . . . . . . . . . . . . . . . . . . . . 310
6.3.6 Complete text mining framework . . . . . . . . . . . . . . 310
6.3.7 Pre-trained models for document representation . . . . . . 310
CONTENTS xv
Appendix A: Case Studies 311
Appendix B: Matrix Factorization for Community Detection using
a Coupled Matrix 314
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
2 Problem and Motivation . . . . . . . . . . . . . . . . . . . . . . . 315
3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
5 Results and Contributions . . . . . . . . . . . . . . . . . . . . . . 319
Bibliography 321
List of Tables
2.1 Internet traffic report by Alexa on August 15th, 2019 . . . . . . . 25
2.2 Summary of the major outlier detection methods . . . . . . . . . 50
2.3 Categories in dynamic text evolution . . . . . . . . . . . . . . . . 59
4.1 Proposed outlier detection methods . . . . . . . . . . . . . . . . . 161
List of Figures
1.1 Sparseness in text with higher dimensional representation [178] and
the distance concentration problem [52] . . . . . . . . . . . . . . . 2
1.2 Examples for text types and nature of the vectors . . . . . . . . . 4
1.3 Architecture of the thesis for unsupervised text mining . . . . . . 8
1.4 Overview for unsupervised text mining methods . . . . . . . . . . 20
2.1 General text mining process . . . . . . . . . . . . . . . . . . . . . 27
2.2 Skewness of k-NN [154] . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Skewness of hubs [175] . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Clustering with distance to the centroid and clustering with hub-
similarity [174] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Mutual Neighbors that share common documents . . . . . . . . . 35
2.6 An overview of text clustering methods . . . . . . . . . . . . . . . 39
2.7 The use of ranking for clustering [174] . . . . . . . . . . . . . . . 43
xviii LIST OF FIGURES
3.1 Overview of the Chapter 3 contributions . . . . . . . . . . . . . . 67
4.1 Overview of the Chapter 4 contributions . . . . . . . . . . . . . . 158
5.1 Overview of the Chapter 5 contributions . . . . . . . . . . . . . . 249
List of Publications
Wathsala Anupama Mohotti and Richi Nayak.: An Efficient Ranking-
Centered Density-Based Document Clustering Method. Pacific-Asia Conference
on Knowledge Discovery and Data Mining, pp. 439-451. Springer (2018) (Will
form part of Chapter 3).
Wathsala Anupama Mohotti and Richi Nayak.: Consensus and Comple-
mentary Non-negative Matrix Factorization for Document Clustering. Elsevier
Knowledge-Based Systems journal (Under Review). (Will form part of Chapter
3).
Wathsala Anupama Mohotti and Richi Nayak.: Corpus-Based Augmented
Media Posts with Density-Based Clustering for Community Detection. Inter-
national Conference on Tools with Artificial Intelligence (ICTAI), pp. 379-386.
IEEE (2018) (Will form part of Chapter 3).
Wathsala Anupama Mohotti and Darren Christopher Lukas and Richi
Nayak.: Concept Mining in Online Forums using Self-corpus-based Augmented
Text Clustering. Pacific Rim International Conference on Artificial Intelligence
(PRICAI), pp. 397-402. Springer (2019) (Will form part of Chapter 3).
Wathsala Anupama Mohotti and Richi Nayak.: Efficient Outlier Detection in
Text Corpus Using Rare Frequency and Ranking. ACM Transactions on Knowl-
xx LIST OF PUBLICATIONS
edge Discovery from Data (TKDD) (Accepted with Major Revision). (Will form
part of Chapter 4).
Wathsala Anupama Mohotti and Richi Nayak.: Text Outlier Detection using
a Ranking-based Mutual Graph. Data & Knowledge Engineering Journal (Under
Review). (Will form part of Chapter 4).
Wathsala Anupama Mohotti and Richi Nayak.: Discovering Cluster Evolu-
tion Patterns with the Cluster Association-aware Matrix Factorization. Springer
Knowledge and Information Systems (KAIS) (Under Review). (Will form Chap-
ter 5).
Acronyms & Abbreviation
NN Nearest NeighborsSNN Shared Nearest NeighborsVSM Vector Space ModelIR Information RetrievalBOW Bag Of WordNMF Non-negative Matrix FactorizationSGNS Skip-Gram model with Negative SamplingRDDC Ranking-Centered Density-Based Document ClusteringCCNMF Consensus and Complementary Non-negative Matrix
FactorizationOIDF Outlier detection based on Inverse Document FrequencyORFS Outlier detection based on Ranking Function ScoreORNC Outlier detection based on Ranked Neighborhood
k-occurrences CountORDG Outliers by Ranking based Density GraphsCaCE Cluster Association-aware matrix factorization
for discovering Cluster EvolutionWMD Word Mover’s DistanceGAN Generative Adversarial Network
Chapter 1
Introduction
This chapter presents the overview of the research conducted in this thesis, in-
cluding background, problem, questions, aims, and objectives of the research.
The overall research significance and limitations are described accordingly. The
structure of the thesis is presented at the end of the chapter.
1.1 Background
Text data, widespread in social media platforms and document repositories such
as news broadcasting platforms and document indexing systems, has emerged
as a powerful means of communication among people and organizations [3, 44].
The process of discovering useful information from text document collections is
known as text mining [3]. Text mining has a significant impact on diverse applica-
tions such as social media analytics [79], opinion mining [3] and recommendation
systems [118]. The real-world scenarios, where labeled data (i.e., data with cat-
egories attached) is not available, have made unsupervised text mining popular.
2 1.1 Background
This topic has been studied for decades in many fields such as clustering, outlier
detection, sentiment analysis, topic modeling and evolution analysis [3, 8, 96].
This thesis focuses on the identification of similarity/dissimilarity among text in-
stances effectively in order to learn the clusters, outliers and changing dynamics of
clusters in text document collections. The process of finding natural groups in the
document collection based on their similarities is known as document clustering
[8]. In contrast, finding documents that show a set of different terms that deviate
from the common terms in the collection is known as text outlier detection [96].
In addition to clusters and outliers, identifying dynamic changes to clusters and
the evolutionary pattern of clusters over time (or domains) based on similarity
among clusters is an emerging area that is aided by the strength of text mining
for knowledge discovery from text data collections [63].
Figure 1.1: Sparseness in text with higher dimensional representation [178] andthe distance concentration problem [52]
A common challenge faced by all these text mining methods is how to identify
the similarity/dissimilarity between text instances. Accurately identifying the
1.1 Background 3
similarity among documents is challenged by the sparseness of text representation
[3]. A popular data model for text representation is the vector space model
(VSM) that records (weighted) presence/count of a term within the document
[53]. Different types of data sources form different sizes of text vectors [95, 138]
(Fig. 1.2). The short text, which appears in social media platforms, forms short
text vectors that are usually extremely sparse compared to other text [79]. All
the other text data face usual sparseness faced by the high dimensional data.
The text data that appear in sources such as Wikipedia forms very large text
vectors [95] that need high processing power due to the need for processing a
high number of dimensions. News data shows medium size text vectors [138]
compared to other two types.
The high dimensional nature of text data forms a sparse VSM representation due
to fewer word co-occurrences. Consequently, many existing methods become less
effective in determining similarity. Identifying similar text instances is fundamen-
tal to text mining methods. In high dimensional representation, identifying near
and far points is problematic using the distance measurements. This phenomenon
is known as the distance concentration problem [205]. As shown in Fig. 1.1, the
distance differences among instances become negligible with the sparseness. This
leads to blurring the border between nearest neighbors and farthest neighbors (as
shown in Fig. 1.1.c). The sparse data representation is also a problem for identi-
fying the similarity based on density estimation [12]. Due to the lack of density
variations (spikes), it is hard to identify the subgroups. In order to identify the
similarity among text instances, it is essential to develop effective methods that
overcome sparseness in text for large document collections.
A wide range of text mining methods have been proposed to address the sparse-
ness in high dimensional text representation [3, 8]. Distance-based methods
[30, 36] aim to identify the text similarity based on distance differences. Nearest
4 1.1 Background
Figure 1.2: Examples for text types and nature of the vectors
neighbor based similarity [68, 78, 193] is used as an extension to this in recent
research. These methods use the frequent nearest neighbors in document col-
lections and similarity of other documents to them for identifying the groups.
Though this is used in cluster identification as well as outlier detection where the
deviated points are identified from frequent neighbors, it still faces challenges.
The process of calculating frequent nearest neighbor sets, as well as calculating
similarity to them, is expensive. Density-based methods [29, 33, 55, 201] also
fail in handling sparse text without sophisticated designs. Among the matrix
factorization methods that are used to approximate higher dimensional represen-
tation using lower rank factors, Non-negative Matrix Factorization [96, 110, 181]
is shown to be effective for text as term representations in the text are always
positive. Probabilistic methods [25, 49, 180] also perform the dimensionality re-
duction using probability calculation for a document to be in lower dimensional
space. All these dimensionality reduction methods, however, face the problem of
information loss [3].
1.2 Problem Statement and Motivation 5
These issues are common in clustering and, outlier detection as well as cluster
evolution detection, which deal with sparse, high dimensional text representa-
tion. There is a need to develop effective methods considering the nature of the
associated text vectors.
1.2 Problem Statement and Motivation
1.2.1 Problem Statement
Unsupervised text mining is an important process of deriving useful informa-
tion such as groups, patterns, and trends in the digital document collection.
For instance, social media platforms generate text data that are short in length
with an extreme sparseness. News data and web pages also contain high di-
mensional text that forms sparse text vectors. Generally, all the text data show
less word-occurrence among text pairs results in a sparse term representation.
Identifying the subgroups, deviated documents from a document collection or
dynamic changes in text clusters need effective methods to compare the similar-
ity between text pairs. This leads to problems in distance-based, density-based,
probability-based and matrix factorization-based methods. This thesis focuses
on unsupervised text mining to identify text similarity/dissimilarity for finding
clusters, outliers and dynamic changes to clusters effectively, while minimizing
the problems associated with sparseness of high-dimensional text.
1.2.2 Motivation
The popularity of the internet increases the availability of digital text in social
media, online forums or message boards, email services, news broadcasting ser-
6 1.2 Problem Statement and Motivation
vices, web blogs and websites. Text mining is an effective approach to extract
concepts, clusters, user communities, deviated themes and dynamic changes in
those text collections using machine learning approaches [3]. The real-world sce-
narios with less/zero availability of ground truth data create the need to use
unsupervised learning methods in finding these useful patterns. Usually, text
data is high in dimensions and shows a sparse vector representation due to fewer
word co-occurrences [3]. Particularly, sources such as social media contain short
text that forms comparatively short size text vectors and show a limited word
co-occurrence with the extreme sparseness in vectors [79]. This fact leads most
of the existing state-of-the-art methods to be ineffective in identifying similarity
among text instances [95, 159].
Density-based methods which usually identify the subgroups or deviated docu-
ments based on density patches are unable to accurately estimate the density
differences in the sparse text representation [29, 33, 55, 201]. Matrix factoriza-
tion [96, 110] or other dimensionality reduction methods [96, 110], commonly
used for higher dimensional data, are challenged by the information loss in lower
dimensional approximation. Distance-based methods [30, 36] face the distance
concentration problem in higher dimensions showing a blurred border between
near and far instances, as illustrated in Fig. 1.1 [177]. This has a similar effect on
hierarchical clustering due to the requirement of multiple pairwise computations
at each step of decision making.
Nearest neighbor-based methods have been used in handling higher dimensional
text in recent research to identify neighbors in addition to traditional similarity
measures [68, 78, 173, 193]. Researchers use the nearest neighbors with graphs to
identify dense patches and outliers [56, 193]. Higher dimensional data have been
known to show the Hub phenomena where “the distribution of number of times
some points appear among k nearest neighbors of other points is highly skewed”
1.2 Problem Statement and Motivation 7
[177]. Text data use this concept of frequent nearest neighbors in identifying the
similarity of text pairs, which would be useful for identifying clusters, outliers
or dynamic changes of clusters in text collections [155, 173]. However, pairwise
comparison in determining nearest neighbors is not accurate in higher dimensions,
as well as not being time efficient for large text collections. IR ranking concept
is used as an alternative efficient approach to identify the hubs in recent research
[173]. Furthermore, IR document querying is used to identify documents in a
cluster by giving center as the query point [30]. This thesis explores the novel
concept of IR ranking in clustering as well as in outlier detection. This thesis
proposes to develop effective methods to build neighborhood graphs for density
estimation in finding uniformly dense subgroups and filtering outlier documents.
In addition, the dimensionality reduction methods, specially NMF with the strict
positive constraint, is used in text mining to get a lower rank representation
that enables in identifying groups [96, 110, 181]. However, this higher to lower
dimension approximation destroys the geometric structure of data [146]. In the
projected lower order space, neighboring points in high dimensions do not remain
as close points and leads to information loss. The thesis identifies the need
for compensating this loss with assistance given by additional information of
nearby/close points [93, 95]. It investigates the use of nearest neighbor assistance
driven NMF in cluster identification as well as use of inter-cluster association
assistance in cluster dynamic identification.
The extreme sparseness in short text is a distinctive problem in text mining, which
has been handled with the assistance of different non-content information or using
terms from external sources in many state-of-the-art methods [17, 80, 95]. How-
ever, semantic characteristics of short-text mismatch with the assistance given by
the other information, and the structural incoherence between the external source
and the original data leads to a poor outcome. This thesis explores an effective
8 1.3 Research Questions
method to assist the extreme sparseness in short text using the corpus-based
expansion.
In summary, this thesis deals with the above-mentioned challenges in identifying
text-similarity in an effective manner, mainly using Ranking and Matrix Factor-
ization. It aims to propose methods, taking advantage of the ranking concepts,
density estimation with ranking, NMF-based learning and document expansion
with NMF, to learn the accurate clusters, outlier documents, and cluster evolution
according to the nature of the text as detailed in Figure 1.3.
OutputOuOuOuOuOuOOuOuOuOuOuOuOuOuOOuOuOOOutptpttptttptptptptptptptpptptptpptppututtututtututututututututtttuttuut
Text Similarity Identification ConceptsTeTTeTeTeTTeTeTeTeTeeTeTeTeeeTeTeeeextxtxtxtxtxttxtxtxtxtxxxxttxttxxxxtxtx SSSSSSSSSSSSSSSSSSSSimimimimimimimiimimimimimimimmmmmmmmilililiilillliliililiiiiiilaraararararararraraarraraaararaaaaritititititititititittttittiitttyyyyyyyyyyyyyyyyyyyyyy IddIdddIdIdIIIdIddIdIIddIdIdddddddeneneneneneneneneeneneneneneneneeeeentititititititititiitittttitiititt fiffifififfffiififififififffff cacacacacaaaccacacaacaacacacacacaccaacaatitiitititiitititititttitititiiononoononononnononoonoonnnnonooonoooo CCCCCCCCCCCCCCCCCCConononoonoononnonoonononononoononoonoo cecececececececececccececececeecececececeeptptptptptptptptptptppttptpptptpptsssssssssssssssssss
Data types(Vector size)DaDaDaDaDaDaDaDaDDDaDaDaDaaaaaDaDaDaaDaaaatatatatatatatatatatatataatatataaaatat ttttttttttttttttypypypypypypypypyppyppypppyyypppy esesesesesessseseseseeseseseseseeesses(V(V(V(V(V(V(V(V(VV(V(VVV(VV(VV(VVVVececececececcecececececececececececeeectototototototototototototoototottoooor rrrrrrrrrrrrr sisiisisisisisisisisisssssiisiisss zezezezezezezezezezezeezeezezezzezeezzze)))))))))))))))))
Density Estimation
Ranking Concepts
Non-negativeMatrix
Factorization
Document Expansion
Short size text vectors Medium size text vectors Large size text vectors
Clusters Outliers Cluster Evolution
Figure 1.3: Architecture of the thesis for unsupervised text mining
1.3 Research Questions
In unsupervised text mining, the similarity calculation among documents is a
fundamental and critical step. Unsupervised text mining methods for learning
subgroups, deviated documents and dynamic changes to clusters primarily rely
upon the processes that they employ for similarity identification. However, the
1.3 Research Questions 9
high dimensional nature of the text poses several challenges. The primary objec-
tive of the thesis is to explore effective ways of similarity identification between
text pairs. This thesis extends the similarity identification concept in implement-
ing clustering, outlier detection and cluster evolution methods. More specifically,
the thesis explores the solutions for the following research questions.
1. Clustering: To identify subgroups in a text corpus, how can the similarity
calculation among documents be conducted with the novel ranking and
matrix factorization concepts?
(a) In sparse data where density difference is not able to identify the sub-
groups, how can the graph-based methods with ranking be used for
effective density estimation?
(b) Instead of expensive pairwise comparisons, how can the IR ranking-
based neighbors be employed to identify the subgroups?
(c) How can the associated information loss be minimised in matrix fac-
torization to approximate the lower rank factors and to identify sub-
groups?
2. Outlier Detection: How can the concept of ranking and density, used in
finding text similarity, be extended in detecting outliers in a text collection?
3. Cluster Evolution: How can the matrix decomposition and identified
factors be used to understand the cluster similarity and changing dynamics
of text clusters in text collections?
10 1.4 Research Aim and Objectives
1.4 Research Aim and Objectives
The overarching aim of this thesis is to design, develop and evaluate effective un-
supervised text mining methods that are able to effectively identify the similarity
among text instances for learning clusters, outliers and the cluster evolution in
document collections. The objectives of this research are listed as follows:
RO.1. Developing text mining methods that are able to accurately
identify clusters in document collections
The main focus is to explore the problems associated with high dimensionality of
text vectors that challenge existing methods, especially pairwise neighbor iden-
tification impaired in this setting. This thesis investigates novel concepts such
as ranking-based neighborhoods and ranking-based neighborhood graphs. It ex-
plores the use of these concepts in density estimation in the sparse text data
as a key objective. Further, it investigates the use of ranking-based neighbor
information to assist matrix factorization to accurately cluster documents.
• RO.1.1. The short text data shows distinct characteristics with extremely
sparse representation due to short vector length. Effectively learning doc-
ument similarity in short text becomes challenging. This thesis focuses on
identifying a novel corpus-based document expansion method to deal with
this issue.
1.4 Research Aim and Objectives 11
RO.2. Developing text mining methods that are able to accurately and
efficiently identify outliers in a text collection
The high dimensional and sparse vector representation challenges traditional
methods in differentiating deviated documents from the inlier subgroups. Gen-
erally, outlier detection methods rank the observations based on deviations. The
majority of them show higher computational complexity with large text collec-
tions. This thesis investigates the novel term weighting-based and ranking-based
concepts to identify the outliers accurately and efficiently. It proposes methods
that use ranking-based neighbors and ranking-based on rare term frequency to
deal with high dimensional text representation and associated problems.
Developing the text outlier detection methods responding to these challenges,
considering the size of the text vectors, is another focus of the thesis.
RO.3. Developing a text mining method that is able to correctly iden-
tify the cluster evolution in text collections
Identifying all the life-cycle states of clusters and their evolutionary patterns is
another focus of this thesis. It studies a method to capture the evolution patterns
over the time/domain with matrix factorization using the high dimensional text
cluster representations. Matrix factorization naturally leads to information loss in
higher-to-lower dimensional projection. The use of different relationships within
clusters and term distributions are investigated to compensate for this loss. The
majority of the existing methods consider local relationships or consider a subset
of the data space in tracking evolution. Developing a method to identify the
global dynamics of text clusters responding to these challenges is the objective of
the thesis.
12 1.5 Research Contributions
1.5 Research Contributions
This thesis has developed several methods for identifying text clusters, outliers,
and cluster evolution, which address the ineffectiveness in existing measures in
identifying text similarity.
RC.1. Text clustering methods
• RC.1.1. Ranking-Centered Density-Based Document Clustering Method
(RDDC)
RDDC has been developed to gain the accuracy and time efficiency in text
clustering avoiding the pairwise nearest neighbor calculation. The IR rank-
ing concept is used to generate relevant documents, in response to a docu-
ment query that statistically represents a document used against inverted
indexed data structure, as nearest neighbors. These responses to a docu-
ment are proved to be relevant to each other and be in the same cluster
showing them semantically coherent. These generated nearest neighbors are
used in generating a shared nearest neighbor graph that shows uniformly
dense regions in the sparse text as a novel contribution. Another contribu-
tion of RDDC is the identification of hubs that exist in high dimensional
data (i.e., frequent nearest neighbors) using the shared neighbor graph. It
efficiently calculates the similarity for hubs using relevancy scores provided
by the IR system to enhance the percentage of documents that are clustered
to the correct group. This research is published in the 22nd Pacific-Asia
Conference on Knowledge Discovery and Data Mining (PAKDD).
• RC.1.2. Consensus and Complementary Non-negative Matrix Factorization
for Document Clustering (CCNMF)
Conjecturing that IR can be used to accurately generate nearest neighbors,
1.5 Research Contributions 13
CCNMF is an NMF-based method that uses nearest neighbors generated
with IR ranking as a document affinity matrix. The novel contribution
of combining nearest neighbors that preserve the geometric structure with
document representation, is able to accurately approximate the document
cluster assignment, minimizing information loss in lower dimensional ap-
proximation. CCNMF assigns clusters by using consensus and complemen-
tary information that are common and specific to inputs respectively. Em-
pirical analysis validates that combining IR-based global neighbor affinity
and pairwise similarity-based local neighbor affinity with the VSM docu-
ment representation results in finding more accurate clusters in lower-order
dimension approximation of the high-dimensional text. This research has
been submitted and is under-review in the Elsevier Knowledge-Based Sys-
tems (KBS) journal.
• RC.1.3. Corpus-based Augmented Media Posts with Density-based Cluster-
ing for Community Detection
In this method, a novel approach of document expansion to improve the
word co-occurrences has been proposed to deal with extremely sparse short
text. The virtual topic terms are included in documents aligning with the
semantics of the corpus itself, based on the topic vectors identified in the
corpus. NMF-based topic vector approximation is proposed to obtain vir-
tual terms. Another contribution is to identify user communities using this
enriched text, which represents users in social media platforms using the
density estimation and centroid-based fine tuning process, which boosts
the cluster assignments. Empirical analysis confirms that the enrichment
of text in social media that includes heterogeneous text is able to minimize
the sparseness in short text and support the learning process of term-based
density differences. This work led to a conference paper and was published
in the 30th International Conference on Tools with Artificial Intelligence
14 1.5 Research Contributions
(ICTAI).
• RC.1.4. Concept Mining in Online Forums using Self-corpus-based Aug-
mented Text Clustering
The corpus-based document enrichment method has been applied in another
application of concept mining. The NMF-based topic vector approximation
is used to enrich the forum posts using topic words as virtual words. Addi-
tionally, a centroid-based text clustering method is proposed in this method
to handle the homogenous nature of the forum text. This work led to a
conference paper and was published in the 16th Pacific Rim International
Conference on Artificial Intelligence (PRICAI).
RC.2. Outlier detection methods
• RC.2.1. Outlier Detection in Text Corpus Using Rare Frequency and Rank-
ing
This thesis proposes a set of novel algorithms OIDF, ORFS, and ORFS
using the concepts of ranking-based neighborhood and/or rare document
frequencies to identify the deviated documents from the inlier groups in
the corpus. The methods developed based on these categories of algorithms
contribute to a research area of much-needed attention. The simple concept
of inverse document frequency of terms is proposed as the first contribu-
tion to identify the outlier candidates in sparse text representation with
the OIDF algorithm. Empirical analysis shows that the use of this term
weighting-based ranking, to assign an outlier score for a document, accu-
rately identifies how deviated the document is from the common subgroups
in the corpus.
Additionally, the ranking scores generated by the IR system in response to
a document query are proposed to use in a reverse manner to identify the
1.5 Research Contributions 15
level of deviation of the document in the ORFS algorithm. Moreover, the
ORNC algorithm identifies the sub-dense hubs in high dimensional data
using the IR ranking responses with the k-occurrences and the anti-hubs
are proposed as outliers. A set of ensemble approaches, which combine the
concepts in OIDF with ORFS and ORNC, are proposed as optimal solutions
that boost accuracy, efficiency, and scalability of text outlier detection. This
research was submitted to the ACM Transactions on Knowledge Discovery
from Data (TKDD) journal and was accepted with major revision.
• RC.2.2. Text Outlier Detection using a Ranking-based Mutual Graph
(ORDG)
ORDG proposes an incremental graph-based method to identify the outliers
avoiding sparseness in the text representation. Using the inverse document
frequency of terms in a document, it first identifies the level of deviation
of a document from the inlier groups in the collection and identifies outlier
candidates. ORDG then presents a novel method to identify the outliers
that are deviated documents from a dense mutual neighbor graph, gener-
ated using IR ranking concept. The novel approach to construct the mutual
neighbor graph using IR results considering shared neighbors among docu-
ments is able to contain documents in inlier subgroups and forms hubs in
high dimensional data through shared nearest neighbors. ORDG proposes
the documents that are excluded from the graph, as well as those that do
not show similarity to hubs, as the next set of outlier candidates. The com-
mon outlier candidates identified by both these steps are proposed as the
final outliers in ORDG. This research has resulted in a journal paper, which
has been submitted to the Data & Knowledge Engineering Journal.
16 1.6 Publications Resulting from Research
RC.3. Text cluster evolution method (CaCE).
In this method, a novel global text cluster evolution approach is proposed to
track the full cluster life cycle over the time/domain. Based on the concept that
information loss in matrix factorization can be compensated by incorporating
additional information, CaCE proposes an NMF-based method to identify the
cluster groups in a corpus using both inter- and intra-cluster associations. This
semantic assistant obtained with the additional inter-cluster association is able to
accurately identify the cluster groups with birth, death, split and merge cluster
dynamics in clusters. CaCE presents the concept of density using term frequencies
of the cluster to identify the strength of the association of clusters to cluster
group and loosely attached clusters are separated from the group to enhance the
accuracy of detected cluster dynamics. Another important contribution of the
proposed CaCE is to display clusters in the same group with links in a progressive
k-partite graph over k time intervals to discovering emergence, persistent, growth
and decay patterns in clusters. This research has resulted in a journal paper,
which has been submitted to the Springer Knowledge and Information Systems
(KAIS) journal.
1.6 Publications Resulting from Research
A list of published/accepted/under review papers, included as part of the chapters
in this thesis, is given below,
• Paper 1. Wathsala Anupama Mohotti and Richi Nayak: An Efficient
Ranking-Centered Density-Based Document Clustering Method. Pacific-
Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp.
439-451. Springer (2018) (Will form part of Chapter 3)
1.6 Publications Resulting from Research 17
• Paper 2. Wathsala Anupama Mohotti and Richi Nayak: Consensus and
Complementary Non-negative Matrix Factorization for Document Cluster-
ing. Elsevier Knowledge-Based Systems journal (Under Review). (Will
form part of Chapter 3)
• Paper 3. Wathsala Anupama Mohotti and Richi Nayak: Corpus-Based
Augmented Media Posts with Density-Based Clustering for Community De-
tection. International Conference on Tools with Artificial Intelligence (IC-
TAI), pp. 379-386. IEEE (2018) (Will form part of Chapter 3)
• Paper 4. Wathsala Anupama Mohotti and Darren Christopher Lukas
and Richi Nayak: Concept Mining in Online Forums using Self-corpus-
based Augmented Text Clustering. Pacific Rim International Conference
on Artificial Intelligence (PRICAI), pp. 397-402. Springer (2019) (Will
form part of Chapter 3)
• Paper 5. Wathsala Anupama Mohotti and Richi Nayak: Efficient Out-
lier Detection in Text Corpus Using Rare Frequency and Ranking. ACM
Transactions on Knowledge Discovery from Data (TKDD) (Accepted with
Major Revision). (Will form part of Chapter 4)
• Paper 6. Wathsala Anupama Mohotti and Richi Nayak: Text Out-
lier Detection using a Ranking-based Mutual Graph. Journal of Data &
Knowledge Engineering (Under Review). (Will form part of Chapter 4)
• Paper 7. Wathsala Anupama Mohotti and Richi Nayak: Discovering
Cluster Evolution Patterns with the Cluster Association-aware Matrix Fac-
torization. Springer Knowledge and Information Systems (KAIS) (Under
Review). (Will form Chapter 5)
18 1.7 Research Significance
1.7 Research Significance
Text is the natural way of communication used by people in many digital ap-
plications. All the methods in this thesis fall into the category of unsupervised
machine learning, which works in the absence of ground-truth data and prac-
tically suits a real-world context. In an unsupervised setting, identifying text
similarity is a significant step as well as challenging due to the higher number of
dimensions and sparseness in the text representation. The developed techniques
in the thesis successfully deal with this problem and contribute to three major
application areas. Further, they are able to be used in various domains.
Firstly, the thesis has advanced the popular field of document clustering by de-
veloping effective methods to discover subgroups from text document collections.
Theoretically, these methods propose a new perspective for the high-dimensional
text clustering. (1) They show a new direction of using IR-based neighborhood
to identify text similarity and density distribution with mutual neighborhood
graphs in naturally sparse data. (2) They provide efficient methods for text
similarity calculation that overcome the sparse representation through IR-based
frequent neighbors (hubs) or document expansion. (3) They show the importance
of learning more accurate cluster assignments, by incorporating nearest neighbor
information with document representation in dimensionality reduction to identify
the text similarity. Practically, a clustering method is useful to organize the text
data based on similarity in many applications, such as information retrieval [3],
social media analytics [79], opinion mining [3] and recommendation systems [118].
Additionally, this thesis proposes two methods for short text analysis in discov-
ering user communities in social media and concepts discussed in online forums.
Community detection in social media analysis is useful in identifying groups of
users with common interests to assist in viral and targeted marketing, political
campaigning, customized health programs, event identification, and many other
1.7 Research Significance 19
applications [88, 144, 147]. Concept mining that extracts participants’ cognitive
grouping is useful in improving e-learning and e-marketing [76, 120].
Secondly, the thesis has advanced the much-needed field of unsupervised text
outlier detection by developing effective methods to detect anomalies in the text
data. The methods in the thesis, formally define a realistic text outlier detection
problem where the presence of outliers is identified from a number of subgroups
(instead of the entire documents). Furthermore, they propose an innovative view-
point of ranking and neighborhood concepts to identify these deviations/outliers
based on text dissimilarity. The evaluation measures proposed in the thesis would
be useful to categorize the effectiveness of methods based on the error in out-
lier/inlier detection specifically. These measures should also be applicable in
traditional outlier detection methods. Outlier detection in static text data is
beneficial in many application domains for decision-making, such as web, blog
and news article management to identify the unusual/uncommon page or news
[96] as well as in dynamic settings to detect unusual events from the social media
posts that can be early warnings [79].
Last but not least, the thesis contributes to an emerging field of tracking the
dynamic changes in a text collection. Theoretically, the method proposes a novel
approach to use additional relationship information in handling the sparse text
representation to identify the evolving patterns in the clusters through cluster
similarity. It shows the importance of having assistance to avoid the information
loss in dimensional reduction. This NMF method is applicable to different do-
mains and proposes advancements in the popular matrix factorization methods.
With the popularity of big data in the last decade, tracking of document collec-
tions over a period is helpful in several applications such as finding dynamics of
terminologies, identifying concept drift, and emerging and evolving trends [63].
Tracking evolution across different domains provides insight into how the same
20 1.8 High Level Overview of the Thesis
concept has been used over diverse domains. This is useful for policymakers and
project planners to mend their decisions, while discovering cluster dynamics over
the time in a specific field is useful for researchers, academics, and students in
that field to set up their publications, strategies, and research [73].
1.8 High Level Overview of the Thesis
This section connects the proposed methods and the common core concepts. It
then relates them to published/under-review papers.
Conc
epts
/ M
etho
dsCoCoCoCoCooCoCoCoCoCoCoCoCoCoCoooCoCooCoCC
ncncncncncncncnccncnccnccnccncccnccccnnnnepepepepepepepepppepepeppepeppeeeeeee
tstststststststsstssssststststsstttt//////////////////////
MMMMMMMMMMMMMMMMMMMMMMetetetetetetetetetteteteteteteeeeeeeee
hohohohohohoohohohohohoohohoohdsdsdsdsdsdssdsdssdsdsdsdsdsdsdsdsdsdddddd
Clusters Outliers Cluster Evolution
Paper 1RDDC
Paper 2CCNMF
Paper 3Augmented
text for Community Detection
Paper 4Augmented
text for Concept Mining
Paper 6ORDG
Paper 5 OIDF, ORFS and ORNC
Paper 7CaCE
application
Density Estimation
Ranking Concepts
Non-negativeMatrix
Factorization
Document Expansion
Effective Text Similarity CalculationEffective Text Similarity CalculationOutputOuOuOuOuOuOuOOuOuOuOuOuOOOuOuOuOuOOuuuOuuuOOO tptptpttptptptptptptptpppptptptpptptppputututttutututututututututututtuuuu
Figure 1.4: Overview for unsupervised text mining methods
As shown in Figure 1.4, this thesis aims to identify similarity/dissimilarity be-
tween text instances dealing with the challenges in high dimensional text represen-
tation and obtain three outputs: clusters, outliers and cluster evolution patterns.
The proposed unsupervised text mining methods deal with the sparseness of text
representation using the novel concepts such as ranking or rare term frequencies,
ranking-based neighborhood graphs for density estimation, NMF with additional
information and self-corpus based document expansion.
1.8 High Level Overview of the Thesis 21
Firstly, a set of text clustering methods are proposed to identify the subgroups
within a text collection. RDDC (Ranking-Centered Density-Based Document
Clustering Method) proposed in Paper 1 mainly aims to handle the high dimen-
sional and sparse nature of text using an IR ranking-based shared nearest neighbor
graph to identify the dense patches. CCNMF (Consensus and Complementary
Non-negative Matrix Factorization for Document Clustering Method) in Paper 2
uses IR-based nearest neighbors together with pairwise nearest neighbors to assist
the information loss in NMF. Corpus-based document expansion/augmentation
is proposed in Paper 3 for the problem of community detection considering text
posts of users in social media. Extreme sparseness in short text is avoided with
this expansion done through topic vectors in the text collection identified via
NMF. Another application of this concept is proposed in Paper 4 for concept
mining in online forums.
Secondly, a set of algorithms, namely OIDF (Outlier Detection Based On In-
verse Document Frequency), ORFS (Outlier Detection Based On Ranking Func-
tion Score), and ORNC (Outlier Detection Based On Ranked Neighborhood k-
Occurrences Count) are proposed in Paper 5 for outlier detection in text col-
lections. OIDF presents the core concept of using ranked terms based on inverse
document frequency, to identify the outliers, which usually contain uncommon
terms in the collection. ORFS aims to identify the outlier candidates based on
the IR ranking-based nearest neighbors where ranking scores for them inversely
present as an indicator to calculate the outliers. Aligning with IR ranking-based
nearest neighbors, fewer occurrences of a document in nearest neighbors is pro-
posed as a method to calculate the outliers in ORNC. In Paper 6, IR ranking
responses identified as the nearest neighbors are used to construct a mutual neigh-
bor graph and to identify the hubs. This hub concept is used together with a
density estimation process on the mutual neighbor to identify the inliers. It iso-
lates the outliers, which are not part of the graph or deviated from the graph
22 1.8 High Level Overview of the Thesis
with ORDG.
Lastly, the CaCE (Cluster Association-aware matrix factorization for discovering
Cluster Evolution) method in Paper 7 aims to identify the full cluster life cycle
and evolutionary patterns within clusters in a text collection. The core concept in
CaCE is an NMF-based approach to identify cluster groups with high dimensional
text cluster representation. These identified cluster groups are displayed across
the time/domain using k-partite graph to identify the evolving patterns globally.
In summary, this “thesis by publication” consists of following six chapters.
• Chapter 1 provides a general overview of the thesis, including research ques-
tions, objectives, and significance.
• Chapter 2 reviews unsupervised text mining problems by focusing on in-
effectiveness of the existing methods in identifying text similarity due to
high dimensionality of text in the areas of clustering, outlier detection, and
cluster evolution. This chapter contains sections ranging from the general
text mining process, associated challenges, characteristics of the text and
different type of methods. A list of research gaps concludes the chapter and
leads the development of proposed methods in other chapters.
• Chapter 3 focuses on dealing with the problem of text clustering. IR
ranking-based neighborhood is proposed to use in handling the high dimen-
sional nature of the text that leads to sparseness in Paper 1 and Paper
2 with density estimation and matrix factorization. Extreme sparseness in
short text vectors is handled with document expansion in Paper 3 and
validated with another application in Paper 4. These four papers form
Chapter 3.
• Chapter 4 is about text outlier detection. Multiple approaches to calculate
1.8 High Level Overview of the Thesis 23
outlier scores are proposed in Paper 5 using the efficient ranking concept.
IR ranking-based neighbors and ranking documents based on inverse doc-
ument frequency are proposed to cope with the sparse text representation.
Extension of this ranking based outlier detection for a graph-based method
is proposed in Paper 6. These two papers form Chapter 4.
• Chapter 5 concentrates on identifying cluster dynamics in a text collec-
tion. The CaCE method in Paper 7 is proposed to identify cluster life
cycles and evolution patterns using matrix factorization. The use of both
intra-cluster and inter-cluster association assists in dealing with sparse text
cluster representation for identifying cluster groups that are used in repre-
senting evolving patterns.
• Chapter 6 summarizes the thesis; the significant results and findings of this
thesis, aligning with the research objectives and identified research gaps
from Chapters 1 and 2. It concludes with recommendations for future
research directions.
Chapter 2
Literature Review and
Background
This chapter provides an overview of the current literature on unsupervised text
mining, giving focus to text similarity identification in clustering, outlier detec-
tion, and cluster evolution. The first part of the literature review (Section 2.1)
presents the importance of text data analytics, text mining process and the nature
of the text data with term modelling. The next section (Section 2.2) highlights
the key concept of finding similarity among text documents that is fundamental
to text mining methods. The main focus of this thesis is to propose alternative
text similarity calculation techniques and develop a set of novel unsupervised text
mining methods (e.g., clustering, outlier detection and cluster evolution). The
subsequent sections present more details on clustering, outlier detection and clus-
ter evolution methods. Section 2.3.1 provides traditional and recent developments
in text clustering methods. Outlier detection methods and their applicability in
high-dimensional data, including text, is provided in Section 2.3.2. The final
section presents methods in detecting dynamic changes to the text, which can
2.1 Text Mining 25
be related to cluster evolution. This chapter is concluded by highlighting the
research gaps, with regards to the main focus areas of the thesis.
2.1 Text Mining
The advancement of digital technology in the current era has resulted in exponen-
tial growth in text data. Reports suggest that 95% of the unstructured digital
data appears in text form [86]. For instance, most of the human interactions
with digital systems are in the form of free text such as emails, wikis, blogs and
digital news feeds [60]. Social media platforms disseminate trending information
based on users’ short-text communication over time. Search engine is another
popular internet medium that stores (or indexes) a large text collection. These
text sources play an important role in several applications. Table 2.1 reports the
top 10 websites according to the internet traffic statistics of Alexa1. It can be
seen that five of them (as highlighted in bold) are primarily driven from the text
media.
Table 2.1: Internet traffic report by Alexa on August 15th, 2019
Rank Website1 Google2 Youtube3 Tmall.com (Chinese shopping site)4 Baidu (Chinese search engine)5 Facebook6 Qq.com (Chinese internet service portal)7 Sohu.com (Chinese shopping site)8 Taobao.com (Chinese market place)9 Wikipedia10 Yahoo
Text mining is a process of discovering useful information from text document
1https://www.alexa.com/topsites
26 2.1 Text Mining
collections that has diverse applications [3]. For instance, content management
that facilitates efficient and effective information retrieval from document reposi-
tories [106] relies on organizing the content with the use of clustering. In opinion
mining or concept mining, clustering is used to extract the set of related terms
that represent cognitive groupings [150]. In social media analytics, community
detection and recommendation systems use text mining to identify similar inter-
ested users based on their text communications [79, 147]. Moreover, suspicious
content detection that identifies fake news or unusual events on social media com-
munication uses text mining methods to identify the deviations from normal [45].
Moreover, associated terminologies or concepts in text repositories change over
time or across the domains and show a varying trend. It is useful for practi-
tioners of diverse disciplines to mine these data to identify decaying, current and
emerging concepts that facilitates trend analysis [4].
2.1.1 Text Mining Process
A standard text mining process follows a series of activities as shown in Fig. 2.1.
Text data generated from different sources is initially cleaned using the pre-
processing steps such as stop word removal, stemming or lemmatizing to keep
only the important information [3]. A short text that appears in social media
shows unstructured phrases and abundant information such as URLs and hash-
tags, which require special pre-processing [79]. The text is then transformed into
a data model with each document represented as a vector of terms that make it
suitable for performing mining. Primarily, documents can be represented as a bag
of words (BOW), considering the number of occurrences of each term but ignoring
the order [3]. This results in a Vector Space Model (VSM) that can be augmented
using different term weighting models such as binary, tf , idf and tf ∗idf [37]. The
purpose of a term weighting model or a feature learning technique is to identify
2.1 Text Mining 27
Text Documents
Text Preprocessing
Text Transformation
Feature Selection and
Representation
Data Mining using
Text Similarity Identification
Interpretation/ Evaluation
• Web Pages• News articles
• Social media Posts
• Text cleaning• Tokenization
• Bag of Words• Vector Space
• Term weighting• Feature Learning
• Clustering• Outlier Detection • Evolution Tracking
• Qualitative• Quantitative
Figure 2.1: General text mining process
the important features in the text mining process.
A text collection usually contains a large set of terms that shows less word co-
occurrence among documents [77]. This results in sparse VSM [3]. A myriad
of text mining methods have been developed to deal with the sparse and high-
dimensional data matrices in identifying text similarity. These methods explore
the interesting patterns such as clusters, outliers or evolution in the text collec-
tions. Finally, the text mining results are evaluated with several quantitative
and qualitative methods. Accuracy, F1-score, Normalized Mutual Information
(NMI), False Positive Rate (FPR) and False Negative Rate (FNR) are popular
extrinsic evaluation measures [133]. In addition, intrinsic measurements such as
silhouette index for clustering and topic coherence for topic modeling are used
to quantitatively measure the validity of results [134]. Furthermore, case stud-
ies with top word analysis or word-cloud visualization are utilised for qualitative
interpretations [72].
28 2.1 Text Mining
2.1.2 Text Feature Representation
To perform the similarity calculation between text pairs for data mining, useful
features of text data need to be represented numerically. The frequencies of terms
in a document provide powerful insight for text mining to identify these useful
features. Generally, different weighting techniques are used to represent term
importance in a document yielding a numerical value in order to improve feature
importance [53]. The simplest weighting technique is term frequency (tf), which
assigns the weight to each term t as their number of occurrences in the document d
[133]. The tf weighting technique considers each term with equal importance and
treats terms with little or no discriminating power in different groups with similar
priority. The inverse document frequency (idf) weighting technique solves this
issue by considering the document frequency (df), which represents the number
of documents in the collection that contain term t [133]. The idf scales down the
frequency of terms to discriminate between documents in a document collection
of size N as given below.
idft = log
(N
dft
)(2.1)
This weighting model gives high values to rare terms and lower values to frequent
terms. The most popular weighting model used in text representation is tf ∗ idfwhich combines term frequency and inverse document frequency as given below.
tf ∗ idf = tft,d × idft (2.2)
However, the basic term weighting methods neglect the semantic relatedness be-
tween different words [42]. A novel perspective for term weighting is used in re-
cent text mining research to assign weights considering the context of the terms
2.1 Text Mining 29
to address this problem [117, 136, 168]. Initially proposed methods use the Skip
Gram model to learn the distributed word embedding. The Skip Gram model is
a training method for neural networks to learn neighbors or the context of a word
in a corpus for word embedding [136, 137]. It predicts surrounding words of a
specific word in a fixed window. This concept is used to obtain a dense document
representation for a document considering the most co-occurring words [115].
Extending this concept, the contextual information of words embedded with Skip-
Gram with Negative-Sampling (SGNS) modeling was proposed [117, 168]. The
word association relationships are used in them are similar to the Skip-Gram and
SGNS modeling in word embedding. In [117, 168], the word association matrix S
is modeled with SGNS to highlight the weight of words that are closely associated.
This uses other words in the vocabulary as contexts for a specific word. If w and
c denote a word and one of its contexts respectively, where #(w, c) denotes the
number of (w, c) pairs in the collection, each element of Swc is defined as follows.
[168]:
Sw,c = log
[{#(w, c)× T
#(w)×#(c)
}− log(k)
](2.3)
Here, T is the total number of word-context pairs where k is considered as the
total number of negative samples aligning with word-embedding. This k is 1
for a considered word-context pair and negative sampling tries to maximize the
probability of observed word-context pairs to be 1 (i.e., P (S = 1|w, c)) while min-
imizing unobserved word-context pairs to be 0 (i.e., P (S = 0|w, c)) within the
word association matrix. In [117] this SGNS modeling is proved to be equivalent
to factorizing a (shifted) word correlation matrix. It shows that SGNS is im-
plicitly factorizing a word-context matrix, whose cells are the point-wise mutual
information of the respective word and context pairs. Thereby, SGNS effectively
covers the entire collection and gives meaningful weight to the contexts of a word.
30 2.2 Text similarity
2.2 Text similarity
The primary aim of text mining is to analyze digital text data sources to discover
interesting patterns. In this context, text similarity plays a major role in identi-
fying similar text patterns or deviated text patterns to identify clusters, outliers
or trends.
2.2.1 Distinct Text Characteristics
This section details some specific characteristics of text data that occur due to
their sparse and high-dimensional nature and that affects the process of text
mining in identifying text similarity.
Distance Concentration
The high-dimensional nature of the text leads to different issues in analyzing doc-
ument collections [3]. If the text document pairs (represented in large size vector
form) are compared with Euclidean distance measures, there becomes little differ-
ence in the distance between different pairs due to associated sparsity in vectors.
The distance difference between far and near data points becomes negligible, as
shown in Fig. 1.1 (c), which is known as distance concentration [205]. This poses
a major challenge to text mining methods to differentiate similar and dissimilar
text data based on the common terms sharing.
2.2 Text similarity 31
(a) (b) (c)
Note : - N5 represents the number of times a point occurs among the k=5 nearest neighbors of all other points in the dataset.- Empirical distribution of N5 with Euclidean (l2), fractional l0.5 (Proposed for higher dimensional data) and cosine (cos) distance functions where d represents the number of dimensions
Figure 2.2: Skewness of k-NN [154]
Figure 2.3: Skewness of hubs [175]
Hubness Property
Text data has been shown to experience the Hub phenomenon which is evident in
high dimensional data, i.e., “the number of times some points appear among k-NN
of other points is highly skewed” [177] as illustrated in Fig. 2.2. As dimensionality
increases with Fig. 2.2 (a)-(c), the observed distributions of k-NN deviate from the
random graph model and become more skewed to the right. These characteristics
32 2.2 Text similarity
can be shown using the reverse neighbor count, which indicates the number of
times a point appears among nearest neighbors of the entire collection [155]. The
reverse neighbor count of hub points shows significant skewness as depicted in
Fig. 2.3. It shows the two extremes in the high-dimensional case : (a) more very
rarely co-occurring pairs and (b) also more very frequently co-occurring pairs.
These frequent nearest neighbors of the collection are hubs. Most importantly,
the data points in high-dimensional data tend to be closer to these hubs than
cluster mean [176]. This property has been used in recent text mining methods
to avoid distance concentration and sparseness-related issues in determining text
similarity [78, 173]. These methods assign documents to clusters by checking the
closest hub point, instead of comparing the cluster centers, as shown in Fig. 2.4.
Figure 2.4: Clustering with distance to the centroid and clustering with hub-similarity [174]
2.2 Text similarity 33
2.2.2 Text Similarity Measures
Pairwise Text Similarity
Pairwise text comparison using terms in VSM is one of the most common tech-
niques in identifying text similarity. Cosine similarity is a popular measure based
on the cosine angle difference between two vectors. Let Vd1 and Vd2 be the two
documents (i.e., d1 and d2) which numerically represent their term vectors. The
cosine similarity between these documents can be computed as below.
Cosine similarity(Vd1 , Vd2) = cos(θ) =Vd1 .Vd2
|Vd1 ||Vd2 |(2.4)
The use of Euclidean distance difference in text vectors as the pairwise comparison
measure is found to raise the distance concentration subsequent issue [205]. Some
other popular measures used for pairwise comparisons are Jaccard similarity,
Pearson coefficient and KL divergence [81]. The Jaccard coefficient compares the
sum of shared terms to the sum of terms that are present in either of the two
documents but are not the shared terms. Let td1 and td2 be the set of terms in
d1 and d2 respectively. The Jaccard coefficient between these documents can be
computed as below.
J similarity(d1, d2) =|td1 ∩ td2 |
|td1 |+ |td2 | − |td1 ∩ td2 |(2.5)
Pearson’s correlation coefficient is another measure based on vector statistics. Let
the term set T = {t1, t2, ..., tm} and wti,d1 represent the weight of ti ∈ d1. There
are different forms in defining this coefficient; the most commonly used form is
as follows [81].
P similarity(d1, d2) =m∑m
i=1 wti,d1 × wti,d2 − TFd1 × TFd2√[m∑m
i=1 w2ti,d1
− TF 2d1
] [m∑m
i=1 w2ti,d2
− TF 2d2
] (2.6)
34 2.2 Text similarity
where TFd1 =∑m
i=1 wti,d1 and TFd2 =∑m
i=1 wti,d2
In KL divergence, corresponding probability distributions of the documents are
considered for identifying similarity. Let wti,d1 represent the weight of ti ∈ d1.
The divergence between two distributions of words in d1 and d2 will be:
KL similarity(d1||d2) =m∑i=1
wti,d1 × log
(wti,d1
wti,d2
)(2.7)
These pairwise computations are known to possess high time complexity for larger
datasets [145]. This challenges traditional text mining to learn patterns such as
clusters or dynamic changes to clusters in larger datasets.
Shared Nearest Neighbor for Text Similarity
Alternative to aforementioned syntactic approaches, similarity between text doc-
uments could be modeled with the concept of Shared Nearest Neighbor (SNN)
to effectively identify the density distribution [92]. The SNN concept facilitates
the similarity between documents based on the number of neighbors they share
[55]. Two mutual neighboring documents are represented by the adjacent nodes
and the edge weight between them represents the number of neighbors those two
documents share [55] as depicted in Fig. 2.5. It allows identifying the mutually
connected documents as similar documents based on the connectedness. How-
ever, the number of pairwise comparisons needed to identify the mutual neighbors
based on the shared neighbors becomes very high in the large collections.
Ranking functions for Text Similarity
Information Retrieval (IR) is an established field that uses the document simi-
larity concept to provide the ranked results in response to the user query [59].
2.2 Text similarity 35
Shared Nearest Neighbors
Mutual Neighbor Nodes
Edge weight
Figure 2.5: Mutual Neighbors that share common documents
An IR system is able to process a keyword/document query efficiently with an
inverted index data structure and retrieve a list of matched (or similar) docu-
ments [174]. IR systems use different ranking functions to find the best matched
set of responses. A ranking function considers how important an individual word
is to the document and within the document collection, as well as the document
length [53] to have a statistical comparison between a query and the returned
documents based on text similarity. There is a wide variety of retrieval functions,
starting from language models such as tf ∗ idf to BM25 which focuses on proba-
bilistic retrieval [59]. While functions based on language models used the similar
concepts as in term weighting models, Okapi Best Matching 25 (BM25) and its
newer variations such as BM25f judge a specific document relevant to a query
[59].
Recently, the ranking concept has been used in finding similar documents in a
document collection [30, 173, 174]. This handful of research is based on the clus-
ter hypothesis [91] stating that “associated documents appear in a returned result
set of a query”. Several studies have validated this fact by showing that cluster-
36 2.2 Text similarity
ing can improve the ranking or retrieval performance [156, 171]. The optimum
clustering framework [59] reversed this cluster hypothesis and stated that “the
returned documents in response to a query will appear in the same cluster”. This
hypothesis has recently been used in clustering methods based on the conjecture
that the documents set returned in response to a query (i.e., a document repre-
sentation) can be considered as nearest neighbors [174]. Methods [30, 173] used
the ranking function employed in an IR system to generate a document neighbor-
hood using the relevant documents set, without the expensive pairwise document
comparisons. In addition, these concepts in the cluster hypothesis and reverse
cluster hypothesis show the embedding of semantic relationships and statistical
perception in IR-based text similarity identification. However, these methods
utilize only the ranked set of response documents as neighborhood and ignore
the associated ranking scores of those documents given by IR systems. These
scores show important information about the level of similarity of the response
documents to the document query.
Generally, the methods that use a few keywords to form queries for identifying
the text similarity neglect the underlying semantics of the terms [196]. It is im-
portant to consider semantic representations of words from local co-occurrences
in sentences to have coherent document similarity [196]. The methods [173, 174]
that used IR systems to identify the relevant documents for clustering, address
this by forming document queries considering statistical distribution of terms. Es-
pecially, those document queries consider all the terms in the document and sys-
tematically retrieve the most probable terms to represent the documents. These
document-driven queries allow a semantically related set of documents as relevant
documents, as proved in cluster hypothesis [91] and reverse cluster hypothesis[59].
2.3 Unsupervised Text Mining Methods 37
Semantic Information for Text Similarity
Word Mover’s Distance (WMD) is a recently proposed metric that targets both
semantic and syntactic approaches to get similarity between text documents [111].
It utilizes the property of word vector embedding and treats text documents as
a weighted point cloud of embedded words. In WMD, the distance between two
text documents is calculated by the minimum cumulative distance that words
from one text document need to travel, to match the point cloud of another text
document [111]. However, WMD shows a cubic time complexity growth with the
number of unique words in the documents. The Relaxed Word Mover’s Distance
(RWMD) is an extended version of WMD that is proposed by [111] to reduce
this time complexity from cubic to quadratic with a limited loss in accuracy
compared to WMD. Nevertheless, all these methods are expensive compared to
other similarity measures.
2.3 Unsupervised Text Mining Methods
Text mining methods broadly follow two approaches; (1) Supervised learning
when training data with labels is provided [88, 105], and (2) Unsupervised learn-
ing when labeled data is not available [68, 96]. The latter case is common in
real-world scenarios as text mining methods are employed in digital document
repositories to identify natural sub-groups or clusters [8] and exceptional doc-
uments [96] without manually annotated data. Similarly, the use of supervised
learning in identifying dynamic changes of clusters in text collections is infeasible.
There exists some research that used fully supervised [105] or semi-supervised ap-
proaches [173] in identifying text clusters and text outliers. However, this thesis
focuses on the more complex problem of text mining where fully unsupervised ap-
proaches for text clustering, outlier detection, and cluster evolution are explored
38 2.3 Unsupervised Text Mining Methods
with the focus of handling sparseness and high dimensional nature in text for
identifying the similarity between text pairs.
2.3.1 Text Clustering
Text clustering, which aims to extract useful information from unlabelled data
by finding natural groups based on data similarities, is a major paradigm in text
mining [3]. An overview of text clustering methods is provided in Fig. 2.6.
Traditional Clustering Methods
Major traditional clustering methods can be classified as partitional, hierarchical,
dimensionality reduction and density-based clustering [8]. These methods face
different challenges when being applied to higher dimensional data such as text.
The centroid based partitional methods such as k-means are known to suffer
from the distance concentration problem when the dimensionality is high and the
distribution is sparse [177]. Hierarchical clustering suffers from the same problem
due to the requirement of multiple pairwise computations at each step of decision
making [8]. Besides, they are one of the most computationally expensive methods
[8].
Density-based methods such as DBSCAN [57] and OPTICS [9, 179] have been
found highly efficient in spatial data. They generate diverse shapes of clusters
naturally around the density spikes without taking the required number of clusters
as an input. Though this is a desirable requirement for document clustering, the
sparse nature of text representation makes the application of these methods for
text clustering hard. They are unable to identify the dense patches in the sparse
text representation with fewer word co-occurrences to form clusters. Additionally,
2.3 Unsupervised Text Mining Methods 39
Text Clustering
Partitioning Dimensionality reductionHierarchical Density-based
Probabilistic Matrix Factorization
Traditional
Recent Developments
Hub-based Semantic assistance-driven
IR Ranking-based
Self corpus-based document Expansion
Deep Learning
Semi-Supervised feature learning
Spectral
ActiveLearning
SNN
Figure 2.6: An overview of text clustering methods
methods such as DBSCAN identifies the neighborhood region of core dense points
with the distance measurements such as Euclidean distance. This technique,
employed for neighborhood inquiry to expand clusters, does not scale well to
high dimensional feature space. Furthermore, this neighborhood inquiry process
consumes high memory as evident in experiments presented in [174].
The SNN concept identifies mutual neighbor documents based on the shared
neighbors. This concept enables relatively uniform regions to form a graph and
to identify clusters by differentiating varying densities [92]. In document clus-
tering where data representation is naturally sparse, this is an ideal solution to
identify dense regions [55]. However, the computation of an SNN graph or mu-
tual neighbor graph is expensive due to the high number of pairwise comparisons
required to identify the nearest neighbors. This prompts investigation into the
more efficient methods to identify the nearest neighbors for building a SNN graph.
Dimension reduction methods such as generative probabilistic clustering, random
projection or matrix factorization are also commonly used in finding clusters in
the high dimensional text by approximating their low-dimensional representation
[3]. Latent topic models such as Latent Semantic Indexing(LSI), Probabilistic
40 2.3 Unsupervised Text Mining Methods
Latent Semantic Analysis (PLSA), and Latent Dirichlet Allocation (LDA), ap-
ply the dimension reduction to find the semantic space and its relationship to
the high dimensional BOW representation. The new representation in semantic
space reveals the topical structure of the corpus more clearly than the original
representation [3]. Methods such as LSI identify semantic space through lower-
rank approximation via matrix factorization [20] while methods such as PLSA
and LDA use probability estimation to predict the lower-dimensional space [202].
In all these methods, information loss is inevitable while projecting from higher
to lower dimension as they are unable to maintain geometric structures in higher
order. In addition, the required resources for the lower-order approximation of
a text collection through optimization or iterative probability approximation in-
crease with the size of the datasets [3, 8]. These challenges open the requirement
for an improved dimensionality reduction method with minimum information loss.
Spectral clustering is another dimensional reduction method that identifies the
non-convex geometric structures in the data [143]. Spectral methods project
original data into the new coordinate space by encoding information about how
near data points are. This transformation reduces the dimensionality of space and
pre-clusters the data into orthogonal dimensions. However, these methods depend
on the selected eigenvectors in the Laplacian matrix generated from the affinity
matrix to perform clustering [141]. This selected eigenvectors of a Laplacian
matrix generated from the original data matrix could not successfully cluster
datasets that contain structures at different scales in size and density as in text
[141]. Furthermore, this two-step process in spectral methods results in high time
complexity. Effective approaches to incorporate geometric structures inherent in
a document collection, and use them in clustering, need to be investigated.
Non-negative Matrix Factorization(NMF) is a special variation of dimensional-
ity reduction that is suitable for the text domain [3]. The VSM representing
2.3 Unsupervised Text Mining Methods 41
text data, which naturally shows non-negativity, is decomposed into two (non-
negative) lower-rank factor matrices [8]. The lower-rank factor matrices represent
the groups in terms and the groups in documents on the basis of shared terms
where the reduced rank can represent the number of clusters in the data [116].
However, information loss is inevitable in this lower-rank approximation as well.
The neighboring points in high dimensions do not remain as close points in the
projected lower order space destroying the geometry structure. This highlights
the need for enforcing geometrical relationships with NMF.
Recent Developments in Clustering
In high-dimensional data such as text data, the hubness property - where some
data points occur more frequently in k-nearest neighbor lists than others - has
been used to find similarity and to determine the cluster label of a point [176, 177].
Documents are compared to hubs instead of cluster centers. This overcomes the
difficulty of distinguishing distances between data points as faced in partitional
document clustering [78]. It avoids the tendency of distances between all pairs
of points in high-dimensional document clustering to become almost equal when
using a centroid approach [78]. Researchers have attempted to improve the hub-
based clustering by changing the hub selection approach such as weighted relative
hubness or Silhouette information based hubness [70]. However, the conventional
hub calculation and hub similarity calculation are not extensible to larger docu-
ment collections as the similarity between top hub points in all clusters should be
calculated to determine the correct clusters [78]. The high computational com-
plexity of this concept is bottleneck in text clustering. This creates the need of
efficient ways to calculate the hubs and hub similarity.
In parallel, proven methods of IR ranking-based query-document matching has
been recently used in document clustering to achieve efficiency in finding similar
42 2.3 Unsupervised Text Mining Methods
documents [30, 173]. The main computational bottleneck in k-means is the need
to recompute the nearest centroid for every data point at every iteration [30].
IR ranking is used in this similarity calculation to reduce the cost by using the
centroid document as a query to choose documents through a responses list [30].
Inversely, the document that needs to be assigned is used as a query to select
the relevant clusters [172] as in Fig. 2.7 (a). It improves the clustering perfor-
mance by comparing the documents with the most relevant cluster centers. These
most relevant clusters are generated using IR responses and reduce the need for
calculating all the pairwise distances. In [173, 173], hub points that are fre-
quent nearest neighbors were generated using IR ranking responses. They create
hubs as dynamic cluster representations, called Loci, for a target document using
ranking results and used in assigning a data point into a cluster considering the
closest Loci, as given in Fig. 2.7 (b). However, this is a semi-supervised approach
where the initial cluster labeled assigned for the documents is used to guide the
cluster assignment through hubs [173]. All the existing ranking based clustering
works explore the applicability of ranking-based document similarity in parti-
tional document clustering and there exists a lack of research which investigates
applicability to other approaches such as density and matrix factorization.
The assistance driven matrix factorization [130, 168] to effectively identify the
text similarity is another recent development in text clustering. A set of methods
is proposed to assist factorization of VSM of documents with the use of additional
information to enhance the clustering decision. In [168], NMF-based document
clustering is assisted by the semantic information given by the term adjacency
within-corpus. Another set of methods use manifold learning in assisting NMF
[31, 130, 198]. The inclusion of neighborhood information that highlights geo-
metric structures among documents improves the accuracy of lower-dimensional
approximation [130]. Besides that, co-clustering is another branch of methods
that assists matrix-based document clustering by a two-way process [66]. Co-
2.3 Unsupervised Text Mining Methods 43
Figure 2.7: The use of ranking for clustering [174]
clustering simultaneously clusters documents and words to improve the cluster-
ing solution, where word clustering induces document clustering, while document
clustering induces word clustering [44]. These interesting extensions of assisting
NMF to minimize information loss need to be thoroughly explored for finding
ways to generate neighborhood information effectively and model them in the
factorization process.
In addition, different semi-supervised approaches have been used to improve text
clustering methods [173, 201]. In density-based methods, active learning ap-
proaches by enforcing different levels of constraints have been used in document
clustering with DBSCAN [201].
44 2.3 Unsupervised Text Mining Methods
Short Text Clustering
Microblogging services are popular social networking platforms, where people en-
gage with others with text communications. Similarly, online forums assemble
views written by the participants. They use only the short-text for communi-
cation [202]. Theses text sources introduce a distinct type of text that creates
additional problems in text mining for identifying the similarity between data
pairs. The short length in those posts leads to extremely sparse text vectors
compared to general text [79]. Moreover, the nature and vocabulary of short text
in social media is drastically different from the usual text [79]. This creates a
requirement for text mining methods to do additional text pre-processing to han-
dle unstructured phrases and abundant information attached with a short text.
Besides, social media contains a larger number of sub-groups compared to the
usual cases [174]. This is evident in social media text analytics that create the
need for fine-grained text mining solutions [174].
Community detection with social media text, for identifying users with common
interests based on what they communicate, is challenged by extreme sparseness
of short text [144, 147]. Existing community detection methods that rely on tra-
ditional distance-based clustering [147] face distance concentration due to sparse-
ness, whereas probabilistic approximation methods [144] face information loss
due to higher-to-lower approximation. Similarly, text mining for understanding
online discussion forums has to deal with the short nature of text [13]. Some of
these analysis methods use supervised approaches that depend on ground-truths
in online forum data to handle short text [124]. In [124], a classification model is
used for discovering genres in a Learning Management System to automatically
code posts. This method used a supervised approach to classify the forum thread
to a code that was manually mapped. The unsupervised text mining approach
[120], for grouping the forum text into various clusters, followed a centroid-based
2.3 Unsupervised Text Mining Methods 45
approach similar to k-means. This approach faces distance concentration due
to sparseness in text. However, the unavailability of ground-truths in online fo-
rum data creates the demand for unsupervised methods that can overcome the
sparseness in the text representation.
Document expansion has been proposed as an effective way to solve the spar-
sity issue of feature vectors by expanding short texts [17, 80, 93, 95, 202].
Many researchers have used external knowledge sources for document expan-
sion [17, 80, 95]. Short texts are expanded to long texts by using Wikipedia
[17], WordNet [80], Web search results [162] and other user constructed ontolo-
gies [95]. Short text expansion is also done with pre-trained word vectors such
as Word2Vec [137] that use local context windows or Glob2Vec [148], which com-
bines global word-word co-occurrence counts and local context windows. Based
on these word embeddings that learn semantic representations for words from a
large corpus, short texts are aggregated into long pseudo-texts [152, 185]. How-
ever, social media texts enriched using these static external sources, which have
unstructured text patterns, provide inadequate or inaccurate information due to
semantic incoherence and lead to incomplete enrichment.
Alternatively, self corpus-based expansion is proposed as a more effective and
semantically aligned method to handle short text [93, 202]. Some approaches
identify concepts in the collections for augmentation using methods similar to
k-means [93] while others [202] identify topics in the collection considering the
term frequency probabilities. However, centroid-based or probability-based cal-
culations have shown inferior outcome due to sparseness in high-dimensional text
data. In summary, all these document expansion methods for short text face
challenges in dealing with high-dimensions. With effective expansion methods,
the status of social media applications relying on sub-grouping documents based
on text similarity can be improved.
46 2.3 Unsupervised Text Mining Methods
An alternative approach to handle the extreme sparseness in short text represen-
tation is to utilize effective representation learning for the short text clustering
with the emerging field of deep learning [188, 189, 190, 194]. This family of meth-
ods uses deep neural networks to automatically learn the representations needed
for discovering subgroups from the original high dimensional sparse text. The use
of deep learning is a promising solution to learn non-linear mappings for feature
selection and present the reduced feature space [6]. It allows embedding text rep-
resentation into a more semantically coherent representation. This is especially
used in short text clustering to address the extreme sparseness [188, 189, 190, 194]
where a deep neural network is used to have deep feature representation from a
raw text representation. In [189, 190], pre-trained word vectors are fed into a con-
volution neural network [114] to learn deep feature representation, which is an
expensive process. Similarly, deep learning has been used for feature selection by
learning statistical dependencies between features [194]. However, it depends on
external semantic dictionaries to identify initial relationships of words [194]. All
these methods are similar to supervised learning and controlled by the guidance
given with external sources. Moreover, these methods apply standard semantic
dictionaries to the short text, hence face semantic incoherence.
In [188], microblog-specific semantic knowledge is utilized to expand the short
text based on the cosine similarity of terms with the aim of avoiding this issue.
However, this pre-processing step is computationally expensive. In addition, it
uses hash-tagging and retweeting as must-link information for minimizing the re-
construction error as a ground truth information. The knowledge derived with
hash-tagging and retweeting is not pure as they are overused in the twitter plat-
form forming heterogeneous relationships.
Recently, Generative Adversarial Network (GAN), a type of deep learning archi-
tecture, has been successfully used in short text mining using a pre-training data
2.3 Unsupervised Text Mining Methods 47
[100, 182, 184]. A GAN model includes two networks where a generative network
is used to generate candidates and a discriminate network is used to evaluate their
validity [121, 128]. The GAN-based methods can work in an unsupervised setting
without relying on ground-truth labels of the data [100, 182, 184]. However, they
require a known dataset used as the initial training data for the discriminator
network [121], hence their accuracy highly depends on this pre-training phase.
The candidates generated by using the pre-training data should be closely re-
lated with original data as this data needs to be synthesized by the generator
network to be correctly evaluated by the discriminator network. Therefore, the
data used for pre-training the discriminator network needs to be closely related
to the underlying problem domain to maintain the semantic coherence.
Finally, these methods apply a clustering method on the learned enriched fea-
tures to obtain the clusters. Applying these methods to a real-world context is
challenging due to their supervised or semi-supervised nature, as it is difficult to
find semantically coherent datasets for short text learning.
Summary: Text Clustering
The identification of a set of sub-groups in a document collection has to deal with
the challenges generated by the sparse text representation in identifying similarity
between text pairs. Specifically, traditional text clustering methods face prob-
lems with sparse and high-dimensional vector representation while calculating
similarity using density distribution, distance measures or lower-dimensional ap-
proximation. In addition, existing methods are challenged when the collection size
is large. For instance, cosine angle-based pairwise similarity or shared-neighbors
based mutual similarity, as well as the recently introduced hub-based clustering,
become stagnant due to the computational complexity required for large docu-
ment collections. Recently, researchers have proposed the concept of ranking to
48 2.3 Unsupervised Text Mining Methods
improve centroid-based text clustering. These handful of ranking-based methods
are limited and only have been explored by using a semi-supervised approach
and/or a partitional method to identify the clusters. These methods show poten-
tial and the use of ranking concept in clustering deserves more attention. It will
be interesting to investigate different ways to use the ranking concept (i.e. the
response documents as well as the ranking scores given for documents) in other
clustering methods as an alternative text similarity calculation technique. The
possibility to use this concept in accurate and efficient mutual neighbor iden-
tification for density estimation, as well as in hub identification, is promising
and fruitful. Furthermore, the possibility of using a ranking-based neighborhood
concept to assist matrix factorization-based clustering, also requires attention.
Short text clustering is another important problem in text mining. In addi-
tion to sparseness generated by high dimensional vector representation, short
text faces extreme sparseness due to the short length that challenges identifying
the similarity between text pairs. Different document expansion approaches and
dimensionality reduction approaches have been proposed to address this prob-
lem. Most of them depend on external information. External source-based docu-
ment expansion results in semantic incoherence while deep learning methods learn
lower-dimensional features through external sources with a supervised or semi-
supervised approach. Though there are a few self-corpus based strategies that
deal with probability approximation or distance concepts, no existing work inves-
tigates the applicability of other methods, such as matrix factorization, that have
commonly been used to project the high-dimensional data in a low-dimensional
search space to address this problem.
2.3 Unsupervised Text Mining Methods 49
2.3.2 Text Outlier Detection
This section first introduces the general outlier detection problem, covering meth-
ods that deal with few dimensions as well as high dimensions. Secondly, this sec-
tion specifically focuses on the text outlier detection problem and applied meth-
ods. Finally, this section presents the evaluation measures used for assessing the
outlier detection methods and the associated problems.
General Outlier Detection Methods
A myriad of outlier detection methods exists for traditional structural data [1].
Table 2.2 lists the major categories these methods fall into. The majority of
these works identify outliers by separating the deviations based on the Hawkins
definition [69] given below.
Definition 2.1: An outlier is an observation which has deviated so much from
the other observations that it has aroused suspicions that it was generated by a
different mechanism.
Outlier detection broadly follows two approaches. (1) Supervised learning when
training data with labels of normal and abnormal data is provided [105]; and
(2) Unsupervised learning when labeled data is not available, which is com-
mon in real-world scenarios [68]. Unsupervised approaches based on traditional
methods such as distribution-based, distance-based, density-based, and cluster-
based have been used commonly to identify outliers due to unavailability of labels
[29, 68]. These methods, as listed in Table 2.2, which deal with numerical and
few-dimensional data, face several challenges when applied to the high dimen-
sional and sparse text in identifying dissimilarities between data pairs.
50 2.3 Unsupervised Text Mining Methods
Table 2.2: Summary of the major outlier detection methods
Category Bottleneck Data DomainDistribution-based [89] Pre-assumptions Few dimensionalDistance-based [68] Distance concentration Few dimensionalDensity-based [29] Sparseness Few dimensional
Distance concentrationCluster-based [51, 71] Sparseness Few dimensionalGraph-based [68, 84] Distance concentration Few dimensional
Computational Complexity
Angle-based [109] Computational Complexity High dimensionalSubspace-based [2, 108] Computational Complexity High dimensionalProjection-based [11, 96, 126] Information Loss High dimensionalk-occurrence-based [58, 155] Computational Complexity High dimensional
Distribution-based and Distance-based Methods: Distribution-based
methods identify outliers as observations that over fit to a normal model. These
methods depend on the assumption of data distribution and learning in the nor-
mal model when identifying deviations [89]. They are known not to be scalable to
high-dimensional data [16]. Distance-based methods define data points as outliers
if they are far from many other points in the dataset considering a minimum
distance threshold [68]. However, use of this approach in high-dimensional data
is challenged by the distance concentration [126]. Further, the computational
complexity of these methods for larger datasets makes them less effective for big
datasets such as digitized text corpora with many documents or web repositories.
Density-based Methods: These methods define outliers considering density
distribution in a dataset. The well-known Local Outlier Factor (LOF) method
identifies outliers using the relative density of a point which is measured by com-
paring neighbors’ density with its density as a ratio [29]. A point is labeled
outlier if density around k-NN of that point is high respect to the density around
the point (i.e. point with a high LOF value). However, this density notion is
challenged by the “sparseness” in high dimensional data. Furthermore, this tech-
2.3 Unsupervised Text Mining Methods 51
nique also depends on distance calculation for the k-NN identification and faces
the problem of distance concentration.
There are a few cluster-based methods that extend the “density” concept to iden-
tify dense clusters to filter the outliers [51, 71]. For example, clustering methods
such as DBSCAN [57] and OPTICS [9] are well-known for naturally detecting
outliers in spatial data that fall into sparse regions. The majority of these meth-
ods are parameter dependent. In high-dimensional sparse text data, adoption
of density-based methods for outlier detection is difficult as distinguishing high-
density regions from the low-density regions is complicated due to fewer term
co-occurrences.
Graph-based Methods: Researchers have proposed solutions based on
nearest-neighbor graphs for outlier detection. Nearest-neighbor is an important
concept used in identifying similarity/dissimilarity among observations [68, 84].
A set of methods uses nearest-neighbor graphs to determine the outliers via in-
degree numbers [68]. The exclusion from mutual proximity, derived based on the
nearest neighbors, have been also used to calculate outlier scores [84]. Never-
theless, the nearest-neighbor calculation is not known as a scalable solution for
larger document collections and higher-dimensional data due to the problem of
distance concentration [164].
Angle-based Methods: These outlier detection methods are introduced as
a successful remedy to the distance concentration problem in high dimensional
data [109]. A high dimension data is usually represented in the form of vectors
where the closeness of points can be effectively captured with the angle between
vectors. This is well suited to text-domain, where a document is represented
with its feature vector, and cosine similarity can be used in comparing document
52 2.3 Unsupervised Text Mining Methods
similarity [186]. However, the high computational complexity for larger datasets,
due to the larger number of pairwise comparisons, makes these methods less
effective.
Subspace-based Methods: In contrast to these traditional methods,
subspace-based methods naturally identify outliers in high-dimensional data.
These methods identify a subset of dimensions with rarely existing patters using
brute-force searching to obtain outlier candidates [2]. This leads to high compu-
tational complexity as well as the local patterns identified in subspaces as outliers
may not be outliers in the full feature space [108].
Projection-based Methods: Lower-dimensional projection is an alternative
approach that is specially introduced as a remedy to distance concentration in
high dimension data. The degree of deviation of each observation to the original
point after projecting it to the lower-dimensional space is measured to identify the
outliers [126]. However, the information loss in this approach, when projecting
data from higher to lower dimension, is inevitable.
K-occurrences-based Methods: The hub concept in higher dimensions used
in clustering has been used inversely in anomaly detection. The researcher used
the reverse neighbor count or the k-occurrences count to determine outliers that
are away from hub points [155]. In [51], connections are made considering re-
verse k nearest neighbors and, nodes with the less-in-degree number identified
as possible outliers. Similar work has been proposed to use “anti-hubs” found
in sparse high dimensional data as possible outlier candidates [58]. Although
frequent nearest neighbor-based hub concept is successful in handling the higher
dimensions, scalability of these methods for larger datasets is questionable due to
computational complexity with pairwise comparisons. All these high-dimensional
2.3 Unsupervised Text Mining Methods 53
outlier detection methods are summarized in the latter part of Table 2.2.
Text Outlier Detection Methods
Text data is a special variation of high-dimensional data. There are limited
studies specifically focused on text-domain to identify the documents deviated
from the common theme [4, 85, 96]. Text outliers need to be identified using
(dis)similarities between text pairs. In [85], an outlier text on the web is defined
as follows:
Definitions 2.2: Given a set of web texts Ti(i = 1, 2, ..., n) on a topic
M , let Wij(j = 1, 2, ...,m) be the top m keywords on topic M in Ti, Ti =
(Wi1,Wi2, ...,Wim). If the relative weight of text Ti is greater/smaller than the
ones that are similar to other texts, then the Web text Ti constitutes a Web text
outlier.
This outlier definition depends on a specific topic or class in the collection to filter
the outliers based on the deviation to it. However, text outlier should be able to be
detected in a natural setting, i.e., a corpus contains multiple related groups/topics
and each group will contain inliers. Let a set of 1 - c classes represent the inlier
groups. An outlier should be able to identify recognizing dissimilarities to all of
the multiple related inlier classes (1 - c) in the collection, which shows a lesser
number of shared terms with those classes.
A common process of identifying an outlier is to compare documents within the
collection and determine dissimilarity to decide an outlier score. Cosine similar-
ity is the standard text similarity measurement used to calculate the similarity
between a document pair. This is used in an inverse manner to identify the dis-
similarity (i.e., 1 − similarity) [85]. However, these pairwise comparisons are
54 2.3 Unsupervised Text Mining Methods
expensive for a large text collection. In [4], n-grams (which is an efficient way
to determine the similarity between different related words in text processing)
are used in outlier detection. This is based on the hypothesis that documents
on the same topic should have similar n-gram frequency distribution [4]. The
n-gram frequency distribution for each document is generated and dissimilarities
are computed as the angle between the document vectors. However, this process
leads to high computational complexity.
A recent study on text outlier detection has been proposed using Non-negative
Matrix Factorization (NMF) to measure deviation [96]. NMF assumes to pre-
serve semantic structure within lower dimensions while decomposing the original
document-term matrix into two matrices, document-cluster matrix and term-
cluster matrix [3]. The learning error calculated with sum-of-squares differences
to the original matrix is used to identify the outliers in [96]. However, a dataset
with a larger number of clusters/groups misleads this reconstruction process.
NMF is designed to approximate high dimension original data to a lower rank r
where r is the number of natural groups in the collection [47]. When the rank r is
high, it is not easy to differentiate between them and produce higher reconstruc-
tion errors for many data points other than the outliers. Therefore, this method
becomes ineffective for datasets with fine-grained clusters. Applicability of this
method for accurate and scalable outlier detection in the Web content, which
often contains a large number of document categories is questionable. This re-
sults in the requirement for exploring scalable methods that will efficiently and
accurately identify outliers in higher dimensional text document collections.
All these above-mentioned methods rank documents to identify the outliers. For
example, a distance-based method ranks the documents according to their devia-
tion degree to neighbors and assigns the highly deviated observation with higher
score [126]. Similarly, a graph-based method ranks documents based on in-degree
2.3 Unsupervised Text Mining Methods 55
of documents node, and documents with the lower in-degree get higher rank with
the possibility of being outliers [51]. The reverse neighbor count has been used
to rank outliers of high dimensional data [155]. In documents, term weights have
been used to rank the importance of terms in the documents considering their
appearance in the document and collection [133]. Though it is intuitive to use
this term weighting to rank the documents for consideration of outliers, there
exists no work showing this. IR systems also have been known for ranking the
documents in the collections with respect to the posed query document consid-
ering term weights to obtain ranked results [59]. The possibility of using this IR
ranking to identify the outliers in document collections is a promising direction
that is yet to be explored for text outlier detection.
Evaluation Measures
Evaluating the performance of outlier detection methods is an important problem.
Accuracy (ACC) [84] is the most popular measurement used in many text mining
methods to assess effectiveness. It considers the total correct predictions against
the total observations in the context. This measurement completely disregards
the incorrect predictions [84]. For instance, an outlier detection method that
detects all observations as inliers will yield high accuracy due to class skewness.
However, they also show very high false inlier prediction. Alternatively, the area
under the Receiver Operating Characteristics (ROC) curve is used with predictive
methods as well as outlier detection methods [1, 155] to overcome this issue. ROC
curve shows the True Positive Rate (TPR) against the False Postive Rate (FPR)
where P and N denote outliers and inliers respectively.
TPR =TP
TP + FN=
TP
P(2.8)
56 2.3 Unsupervised Text Mining Methods
FPR =FP
FP + TN=
FP
N(2.9)
The area covered by the ROC curve at an optimum threshold indicates how
much a model is capable of distinguishing between inliers and outliers. However,
a detailed analysis in [32] proved that the Area Under the Curve (AUC) also
informs performance bias to true prediction at the optimal threshold. Assessing
an outlier detection method requires investigation of false predictions (i.e., both
outliers and inliers). FPR and False Negative Rate (FNR) is directly used by
some researchers to report false predictions [65, 101]. FPR and FNRmeasure the
effectiveness of outlier detection with the error in predicting inliers and outliers.
FNR =FN
TP + FN=
FN
P(2.10)
FPR denotes inliers detected as outliers against the total inliers, ranging the
values from 0 to 1 as in Eq. 2.9. Similarly, FNR denotes outliers detected
as inliers against the total outliers, ranging the values from 0 to 1 as in Eq.
2.10. Though FPR and FNR measures indicate the poor performance of outlier
detection methods with higher values recognizing incorrect inliers and outliers
respectively, they are not able to clearly categorize effectiveness of the methods
based on their capacity for false (inlier/outlier) detection. The capability of
outlier detection methods in terms of false detection requires a differentiable
measure to recognize their direct applicability.
Summary: Text Outlier Detection
An outlier in a document collection is a deviated document in the collection
compared to others. There is a myriad of methods existing that calculate
2.3 Unsupervised Text Mining Methods 57
(dis)similarity between text pairs to determine this deviation. Traditional out-
lier detection methods, which deal with few dimensions, are challenged by the
distance concentration, sparseness and approximation errors in identifying text
outliers. Subspace based analysis is also not a guaranteed solution to find an
optimal solution for high-dimensional outlier detection. In addition to the high-
dimensionality, the larger size of the document collections also creates issues for
angle-based, nearest neighbor-based and anti-hub-based methods due to high
computational complexity in similarity calculation.
There is a lack of work focusing on text outlier detection and no formal definition
considering the multiple groups of inliers in document collections, specialized
methods and meaningful evaluation measures. When having multiple groups
within inlier documents, it is hard to identify outliers that are dissimilar to those
inlier classes and share a lesser number of common terms with them. The pos-
sibility of using ranking concepts to improve the accuracy and efficiency in text
outlier identification, that shows high potential, needs to be studied, in depth. It
is interesting to investigate the possibility of using term weighting-based ranking
as well as IR system-based ranking responses and ranking scores in defining out-
lier scores. Moreover, there is a demand for clear evaluation measures to indicate
the error in detecting outliers as well as inliers, to select effective outlier detection
methods.
2.3.3 Text Cluster Evolution
The topics, associated terminologies or concepts in text repositories change over
time as well as across the domains and show a varying trend. Researchers have
explored these dynamic changes to text for finding decaying, current and emerging
topic, events, communities or concepts [35, 41, 63, 73, 82, 119, 180]. However,
58 2.3 Unsupervised Text Mining Methods
research is in infancy in this area. Existing methods have to deal with the common
problem of high-dimensional and sparse vector representation in the text data for
identifying the similarity between (text) cluster pairs [8]. One set of methods
uses the naive approach of term-based similarity [63], while some other methods
use probabilistic [180] or factorization approaches [98].
The similarity between clusters is determined by the term intersections using
Jaccard coefficient to define the persistence, merging and splitting of clusters
[63]. However, Jaccard similarity is not efficient in comparing similarity of text
as it only considers the common terms in sparse data [129], The probabilistic topic
modeling is used to track the topic occurrences over time [180], while NMF is used
to identify a set of steady topics through minimizing learning error [98]. However,
information loss is inevitable with any lower-dimensional approximation [8]. This
emphasizes a need to explore effective methods in handling the high-dimensional
sparse text representation to identify dynamic changes in clusters. Specifically,
how to use dimensionality reduction methods to handle higher dimensions in
text cluster representation, and what are the effective techniques to compensate
associated information loss in them, need to be studied.
Tracking evolution across different domains or time have been popularly used
with social network analysis [41, 102, 115, 123]. There are two main models
used in the community evolution of social networks namely snapshot model [115]
and temporal smoothness model [40]. The snapshot model keeps track of the
fixed number of communities [123] or focuses only on pre-determined community
structure [115] over time. In contrast, the temporal smoothness model analyzes
a continuous stream of changes to the considered networks to derive communities
over time [40]. However, all these methods that deal with community evolution
consider user clusters that are identified based on the network structure analysis
[102, 115, 123]. None of these works deal with the problem associated with text
2.3 Unsupervised Text Mining Methods 59
analysis.
Methods that explore the changes in text structure to characterize the evolution-
ary events, concepts or terminologies have been developed with one of the three
major objectives: (1) cluster evolution [63, 73]; (2) topic evolution [35, 41, 180];
or (3) event evolution [82, 119]. In comparison to cluster evolution, topic evolu-
tion is done in a much smaller data space, where a number of extracted topics are
much less than the entire document collection. Similarly, associated vocabulary
with topics (i.e., highly probable terms to be in the topics) in a collection is much
smaller than the complete vocabulary of the collection. Event evolution detec-
tion also considers a set of selected events targeting a much smaller data scope
compared to the original data space. The summary of these existing methods is
given in Table 2.3.
Table 2.3: Categories in dynamic text evolution
Category Approach BottleneckCluster evolution Citation network analysis [73] Consider only the local
relationsTerm intersection analysis [63]Topic evolution LDA-based approaches [49,
180]Unable to identify thecomplex cluster dynamics
NMF-based approaches [98]Graph-based approaches [41] Study a fixed set of terms
and neglect new forma-tions
Event evolution Text similarity-based ap-proaches [119]
Limited only to novelevent identification
Topic Modeling-based ap-proaches [195]
Study a fixed set of eventsand neglect new forma-tions
Cluster Evolution: With the aim of identifying cluster dynamics, a simple
approach of analyzing citation network is used in publications data [73]. A net-
work of bibliographic coupling is generated using direct and co-citation analysis
to identify the current trends and emerging concepts. TextLuas [63] is a soft-
60 2.3 Unsupervised Text Mining Methods
ware tool developed to model each cluster solution with the respective terms at
each timestamp in another study. It considers the similarity between consecutive
clusters with the Jaccard coefficient. However, none of these methods considers
concept shift with the full document-term space over the entire period in de-
termining similarity and/or limited to local relations between two consecutive
timestamps in defining evolution.
Topic Evolution: Topic evolution analysis has been used to identify the
content shift through a discovered subset of topics. The majority of these methods
rely on generative probabilistic approaches [49, 180]. The LDA-related approach
used in [180] only identifies the topic occurrences in different time dimensions
with the calculated respective probabilities. This is not capable of identifying
topic evolution with splits and merge. Another probabilistic approach used in
[49] determines topic cluster evolution based on the changes to term probability
within topics. This study considers a fixed vocabulary limiting the set of terms
to appear in the topics and is not able to track the new formation of topics.
Another topics evolution tracking method used NMF in identifying a set of steady
topics through minimizing learning error [98]. Although it identifies the emerging
topics over time, it is not able to detect complex topic structure changes such
as diminishing or growing. In [41], a graph-theoretic approach is used to track
persistent and diminishing topics using the term frequency for each topic cluster
solution. This method is unable to identify the complex dynamics of topics such
as merge and split to cover the complete cluster lifecycle.
Event Evolution: Event evolution usually happens in social media to keep
track of event clusters that appear over time to identify the novel events or shifts
that are deviated from the existing event clusters [119, 195]. A novelty score is
assigned to each event cluster for identifying new events considering the (cosine)
2.3 Unsupervised Text Mining Methods 61
text-similarity in [119]. Topic modeling has been used to identify the events across
the time [195]. Though it identifies the emerging events through deviations to
previously existing events, it fails to identify the complex dynamics such as growth
and decay due to its assumption of a fixed set of events within a dataset.
There exists no work that focuses on identifying the full cluster life-cycle in the
original data. They are all restricted to a subset of the data due to various
limitations.
Summary: Text Cluster Evolution
Existing text evolution tracking methods neither consider full data space nor all
the dynamic changes over the time/domain in evolving patterns identification.
The topic or event evolution methods are limited to a smaller selected set of
data. There are some methods that consider full data space in evolution tracking;
however, these methods are limited to consecutive time stamps only in identifying
cluster similarity. An effective cluster evolution method that considers all the
changes over the considered time period/domains is needed to identify the full
cluster life-cycle with persistence, emergence, growth, and decay patterns. It
is important to accurately handle the high-dimensional nature of the text in
identifying the similarity relationship between clusters. Consequently, existing
naive comparisons between clusters or probabilistic methods face difficulties in
the identification of full cluster life-cycle accurately. This creates the requirement
to investigate a novel text mining method to identify the cluster dynamics in
unsupervised setting.
62 2.4 Research Gaps
2.4 Research Gaps
This chapter has reviewed the literature relevant to clustering, outlier detection
and evolution methods focusing on text data and the challenges they face in
identifying text similarity. The following research gaps have been identified.
2.4.1 Text Clustering
As summarized in Section 2.3.1, the high dimensional nature of text representa-
tion and associated sparseness challenges the process of identifying text similarity
in existing text clustering methods to find the clusters in a document collection.
In high dimensional data, data points are known to be closer to frequent near-
est neighbours (i.e., hubs) than cluster mean [176]. A handful of methods have
started to use emerging concepts such as ranking, hubs in higher dimensionality,
and neighborhood for effective text similarity identification in clustering. This
thesis identifies the following research gaps in those methods and aims to explore
these promising concepts in detail.
• Though the IR concepts such as indexing and ranking have been used with
partitional clustering for identifying similar text, they have yet to be ex-
ploited in density-based clustering, which is known to identify diverse shapes
of clusters without taking the number of clusters as an input. The success
of IR ranking concepts in identifying nearest neighbors motivates them to
be used in identifying density differences in a document collection.
• The mutual neighbor identification and Shared Nearest Neighbor graph
construction in forming dense representation for sparse text have been found
useful but computationally expensive. Utilizing IR ranking to make these
concepts efficient is promising that needs to be exploited.
2.4 Research Gaps 63
• The closeness to hubs is determined by pairwise neighborhood calculations.
This makes the existing clustering algorithms unscalable to large document
collections. The use of IR ranking score for calculating closeness to hubs is
a potential direction that needs to be studied.
• Assisting matrix factorization-based text clustering through neighborhood
information is an emerging research topic. There is no specific work exploit-
ing accurate ways to use IR ranking results in overcoming information loss
in matrix factorization. The success of IR ranking concepts in identifying
nearest neighbors shows the necessity of using them for assisting matrix
factorization.
• The extreme sparseness in short text is handled by the document expansion
or feature learning using external sources, however, these methods result in
semantic incoherence. It will be interesting to study how to utilize dimen-
sionality reduction methods such as matrix factorization that are successful
in identifying groups in terms, for self-corpus-based enrichment.
2.4.2 Text Outlier Detection
There exist only a handful of methods specifically designed for text outlier detec-
tion. Similar to text clustering, the high dimensional sparse representation of the
text is the major challenge in outlier detection to identify dis(similarity) in text
pairs, as summarized in Section 2.3.2. The following research gaps are identified
associated with this problem.
• Document collections taken from social media with a large number of groups
show the necessity of considering outliers among multiple inlier groups.
Existing outlier detection methods do not define text outliers in the presence
64 2.4 Research Gaps
of multiple inlier groups and propose solutions that can identify outlier
documents that share lesser terms with inlier groups.
• Inverse document frequency term weighting is successful in identifying
the rareness of terms. The possibility of utilizing this simple concept of
term weighting to rank the documents in a collection and identify devia-
tions/dissimilarities is promising and needs to be investigated.
• There is no specific work exploiting IR ranking results and ranking scores in
text outlier detection. The possibility of using IR ranking concepts in mu-
tual neighborhood identification and anti-hub identification for text outlier
detection is promising direction but unexploited.
• The evaluation of the outlier detection method is challenging due to the
bias of existing measures to true predictions. There is a lack of measures
that clearly differentiate the effectiveness of methods based on both inlier
and outlier prediction errors though there is a high necessity in identifying
them both.
2.4.3 Text Cluster Evolution
As summarized in Section 2.3.3, the problem of cluster evolution in text corpora
is ineffectively studied. Due to the challenges faced by high dimensional text in
identifying the similarity between pairs, the majority of existing methods limit
their analysis to a few local patterns while some methods are based only on topics
and events. There exist the following research gaps in this research problem.
• The global evolution of clusters over time/domain is a must to track for
identifying trends. There is no specific work exploiting the global cluster
2.4 Research Gaps 65
evolution over the time/domain with the objective of identifying all the clus-
ter states such as birth, death, split and merge to track the cluster evolution
patterns based on cluster similarity. “Birth” of clusters denotes an emerg-
ing pattern, “split” identifies a growth pattern, “death” and “merge” reflect
a decay pattern and a consistently appearing cluster across time/domain
signifies a persistence pattern.
• Matrix factorization is a successful solution to identify groups in input data.
None of the existing cluster evolution works explore using matrix factoriza-
tion to identify the cluster groups with similar clusters for tracking text dy-
namics. There is a loss in information with the higher to lower-dimensional
projection. The possibility of using different information and modeling
techniques to minimize this problem needs to be carefully studied.
Chapter 3
Text Clustering
This chapter introduces the primary contribution of the thesis, which is a set
of novel document clustering methods to identify the groups with similar doc-
uments in a document collection. Clustering is a popular unsupervised data
mining technique that groups the similar set of documents together based on
term co-occurrences. Generally, document collections which show fewer word
co-occurrences between documents form sparse data matrices for analysis. The
sparse representation of text data challenges traditional text clustering meth-
ods such as partitional, hierarchical, density-based and dimensionality reduction
approaches to identify the text similarities [3, 8].
Apart from the sparseness that directly affects clustering methods, distance con-
centration which shows negligible distance differences between far and near points,
is another major issue in high dimensional data clustering [3]. Besides, the major-
ity of the dimensionality reduction methods that project higher dimensional text
data into lower dimensional space, face information loss [8]. Though researchers
have explored effective partitional clustering methods using IR ranking concepts
and Hub concepts based on frequent nearest neighbors to identify the similarity
67
between text pairs dealing with these issues [30, 78, 173], there is no prior work
that investigates the applicability of these concepts to density-based methods
or matrix factorization to accurately cluster documents. In recent years, these
methods have shown high potential in text clustering [30, 174].
In addition to the sparseness created by the high dimensional representation and
other related issues faced by the general text, the short text clustering, which
has become popular with sources such as social media, is challenged by the ex-
treme sparseness in data [79]. The fact that there are extremely fewer word-co-
occurrences of the text, challenges traditional clustering methods. Despite the
existing different enrichment approaches based on the sophisticated designs to
handle this issue [17, 80, 95], there is no prior work that explores corpus-based
enrichment using matrix factorization to minimize the extreme sparseness.
Figure 3.1: Overview of the Chapter 3 contributions
68
Fig. 3.1 outlines the high-level overview of the contributions made in this chapter
to effectively identify the text similarity for clustering. This chapter explores the
effectiveness of utilizing the IR ranking concept in the nearest neighbor calcula-
tion and different ways of using this information to calculate similarity between
high-dimensional text pairs for clustering. It introduces a ranking-based, mutual
neighborhood graph for density calculations and the ranking for hub formation to
improve the quality of clustering solutions. Further, this chapter emphasizes the
effectiveness of using nearest neighbor information with proper modelling tech-
niques for assisting matrix factorization to identify the similarity between text
pairs in text clustering methods to avoid information loss. In addition, another
focus of this chapter is to explore the effectiveness of using corpus-based document
enrichment/augmentation for handling extreme sparseness in short text clustering
with topics derived through matrix factorization. It is followed by an application
of the corpus-based document augmentation method for concept-mining in online
forums.
This chapter is comprised of four papers relating to these contributions.
• Paper 1. Wathsala Anupama Mohotti and Richi Nayak.: An Efficient
Ranking-Centered Density-Based Document Clustering Method. Pacific-
Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp.
439-451. Springer (2018)
• Paper 2. Wathsala Anupama Mohotti and Richi Nayak.: Consensus and
Complementary Non-negative Matrix Factorization for Document Cluster-
ing. Elsevier Knowledge-Based Systems journal (Under Review).
• Paper 3. Wathsala Anupama Mohotti and Richi Nayak.: Corpus-Based
Augmented Media Posts with Density-Based Clustering for Community De-
tection. International Conference on Tools with Artificial Intelligence (IC-
69
TAI), pp. 379-386. IEEE (2018)
• Paper 4. Wathsala Anupama Mohotti and Darren Christopher Lukas
and Richi Nayak.: Concept Mining in Online Forums using Self-corpus-
based Augmented Text Clustering. Pacific Rim International Conference
on Artificial Intelligence (PRICAI), pp. 397-402. Springer (2019)
Paper 1 proposes a novel ranking centered density-based document clustering
method, RDDC. It uses top-10 ranked documents generated from a search engine
utilizing its tf*idf ranking function [54] against document-driven queries that
statistically represent documents to build a graph of shared nearest neighbors
(SNN), which are proved to possess sufficient information richness as in [199].
High-density regions in the graph are estimated if they have a considerable num-
ber of documents within a region bounded by the edge weight with that number
to identify the initial clusters. This threshold (i.e., alpha) is set up based on the
experiments. Then remaining documents are assigned to clusters comparing with
SNN sets identified as multiple hubs in the graph. The hub similarity calculation
is done using the ranking score. Empirical analysis using several document cor-
pora including popular NewsGroup data and Social Event detection data reveals
that RDDC is able to accurately identify the clusters based on density difference
in the SNN graph built using the ranking responses and ranking scores. However,
RDDC consumes higher time due to SNN graph construction and hub similarity
calculation steps and will have an adverse effect on extremely larger datasets.
This leads overall time complexity of RDDC to be O (ndkm) where n, d, k and
m are the dimensionality, size of the collection, considered number of nearest
neighbors and number of mutual number sets respectively.
Paper 2 explores the effectiveness of using nearest neighbor information to com-
pensate for the information loss in lower-dimensional approximation. The consen-
sus and complementary non-negative matrix factorization-based document clus-
70
tering method, CCNMF, is proposed based on the IR ranking-based document
similarity as well as the pairwise similarity to form adjacency matrices that hold
the geometric information. CCNMF uses the Skip-Gram with Negative Sam-
pling to accurately model the adjacency information by assigning probabilities
based on neighborhood information. It assigns high coefficients to document
pairs that show higher presence with respect to any neighborhood. The hypothe-
sis of this paper is that combining the common and specific information given by
each document affinity matrix, together with the information in document-term
representation, assists the lower-dimensional approximation in NMF with geomet-
ric information. Several experiments have been conducted with several datasets
including well-known public NewsGroup datasets covering short-to-medium-size
text vectors. Evaluation done using extrinsic measurements, validates that CC-
NMF is able to produce accurate clustering solutions compared to state-of-the-art
benchmarking methods.
The method RDDC which is proposed in Paper 1 uses hub-based cluster similar-
ity through SNN sets, in addition to the density concept for identifying cluster
similarity. The comparison between the results of RDDC and CCNMF shows
that RDDC is able to produce improved outcome than CCNMF for fine-grained
clustering scenarios. Therefore, RDDC can produce superior results for Social
Event Detection datasets which include over a thousand clusters. In general,
document clustering that deals with a small number of clusters, CCNMF that
factorizes the input matrix into lower rank metrics produces better results than
RDDC.
The other part of this chapter is related to Paper 3 and Paper 4 that present a
document expansion-based approach using self-corpus to address the extremely
fewer word co-occurrences within the short text. The approach proposed in Pa-
per 3 is to project the high-dimensional term space to a low-dimensional space,
71
and infer the topic proportion vectors using the associated semantic structure to
identify virtual terms for enrichment. The most probable virtual terms are se-
lected based on the mean and standard deviation of the coefficients in topic×termmatrix in a threshold independent manner. The experiments done with 3 twitter
datasets in Paper 3 show that the post-expansion using topic words improves the
word co-occurrences of important terms in short text. Furthermore, it reveals
that the expanded text allows identifying text-based communities in social media
though density-based clustering. This density-based clustering uses a distance
parameter (i.e, α) to identify neighbor posts based on pairwise distances which is
measured in the euclidean distance with tf as the weight and identifies the core
dense points that can be further expanded to form clusters.
Paper 4 shows another application of this topic word-based document expansion
in the area of online forums for concept mining. It also adds most probable
terms in topics that marked with higher coefficients in the topic×term matrix as
these coefficients of terms in a topic vector are comparable to weights of terms in
topics. The forum posts consist of short text where self-corpus-based expansion
is able to minimize the extremely fewer word co-occurrences among posts. The
experiments done on the QUT ESP forum data show improved performance with
intrinsic measurements in obtaining the themes within the posts. The qualitative
evaluation validates these themes. However, forum posts are not short in size
as tweets and show a homogeneous nature. Due to this, fewer terms correspond
to the topics needed to be added to expand the forum posts to identify concept
clusters as evident by experiments in the paper.
Next, the chapter will present four papers. Since this is a thesis by publication,
each original paper is presented by aligning with the thesis format. Due to the
papers’ different formats, there will be some minor format difference with the
published article. However, these do not alter the content of the original papers.
72 Paper 1
Paper 1: An Efficient Ranking-Centered Density-
Based Document Clustering Method
Wathsala Anupama Mohotti* and Richi Nayak*
*School of Electrical Engineering and Computer Science, Queensland University
of Technology, GPO BOX 2434, Brisbane, Australia
Published In: The Pacific-Asia Conference on Knowledge Discovery and Data
Mining (PAKDD), 3-6 June 2018, Melbourne, VIC, Australia
Statement of Contribution of Co-Authors
The authors of the papers have certified that:
1. They meet the criteria for authorship in that they have participated in
the conception, execution, or interpretation, of at least that part of the
publication in their field of expertise;
2. They take public responsibility for their part of the publication, except for
the responsible author who accepts overall responsibility for the publication;
3. There are no other authors of the publication according to these criteria;
4. Potential conflicts of interest have been disclosed to (a) granting bodies, (b)
the editor or publisher of journals or other publications, and (c) the head
of the responsible academic unit, and
5. They agree to the use of the publication in the student’s thesis and its
publication on the QUT ePrints database consistent with any limitations
set by publisher requirements.
Paper 1 73
Contributor Statement of contribution*
Wathsala Anupama Mohotti Conceived the idea,designed and conducted experiments,analyzed data, wrote the paper and
Signature: addressed the supervisor and reviewers’comments to improve the quality of paper
Date:
A/Prof Richi Nayak Provided critical commentsin a supervisory capacity
Signature: on the design and formulationof the concepts, method and experiments,
Date: edited and reviewed the paper26/03/2020
Nayak
Mohotti
27/03/2020
QUT Verified Signature
QUT Verified Signature
74 1 Introduction
ABSTRACT: Document clustering is a popular method for discovering useful
information from text data. This paper proposes an innovative hybrid document
clustering method based on the novel concepts of ranking, density and shared
neighborhood. We utilize ranked documents generated from a search engine to
effectively build a graph of shared relevant documents. The high density regions
in the graph are processed to form initial clusters. The clustering decisions are
further refined using the shared neighborhood information. Empirical analysis
shows that the proposed method is able to produce accurate and efficient solution
as compared to relevant benchmarking methods.
KEYWORDS: Density estimation; Ranking function; Graph-based clustering
1 Introduction
Document clustering is a popular method to discover useful information from
the text corpuses [8]. It has been used to organize the data based on similarity
in many applications such as social media analytics, opinion mining and recom-
mendation systems. A myriad of clustering methods exist that can be classified
into the popular categories of partitional, hierarchical, matrix factorization, and
density based clustering [8, 201]. The centroid based partitional methods such as
k-means are known to suffer from the data concentration problem when dimen-
sionality is high and the data distribution is sparse. Specifically, the difference
between data points becomes negligible [177]. Hierarchical clustering suffers from
the same problem due to the requirement of multiple pairwise computation at
each step of decision making [8]. Matrix factorization, a dimension reduction
method for high dimensional text, is also commonly used in finding clusters in
low dimension data. In these methods, information loss is inevitable [8] as well
as the required time for low rank approximation of a large text data through
1 Introduction 75
optimization increases with the size of the datasets.
Density-based methods such as DBSCAN and OPTICS have been found highly
efficient in traditional data [57]. They generate diverse shapes of clusters without
taking the number of clusters as an input – the desired requirements for document
datasets [201]. Moreover, text data has shown to experience the Hub phenomena,
i.e., “the number of times some points appear among k nearest neighbors of other
points is highly skewed” [177]. A density based clustering method should be ideal
to identify these naturally spread dense sub regions made of frequent nearest
neighbors that assist in estimating density. However, this approach is hardly
explored in document clustering due to manifold reasons [56].
Firstly, density based methods become stagnated in high dimensional data clus-
tering as the document datasets exhibit varying densities due to sparse text rep-
resentation and the density definition cannot identify core points to form clusters
[8]. Secondly, techniques employed for efficient neighborhood inquiry to expand
clusters do not scale well to high dimensional feature space [8]. A handful of
solutions have been proposed by using different shapes, sizes, density functions
and applying constraints in the high dimensional data [55, 201]. Semi supervised
and active learning approaches have been used in density document clustering
with DBSCAN to obtain improved clustering performance [201].
Majority of density based methods utilize the concept of Shared Nearest Neigh-
bor (SNN) [92] whereby the similarity between points is defined based on the
number of neighbors they share [55, 56]. The SNN concept facilitates the rela-
tively uniform regions to form a graph and to identify clusters by differentiating
varying densities. In document clustering setting where data representation is
naturally sparse this is an ideal solution to identify dense regions. However, the
computation of a SNN graph is expensive due to the high number of pairwise
comparisons required.
76 1 Introduction
In this paper, we propose a novel and effective method called as Ranking centered
Density based Document Clustering (RDDC). It first builds the SNN graph based
on the concepts of Inverted index and Ranking and, then, iteratively form clusters
by finding density regions within the shared boundary of documents in the SNN
graph.
Information Retrieval (IR) is an established field that uses the document simi-
larity concept to provide the ranked results in response to the user query [59].
An IR system is able to process queries per second on collections of millions of
documents using efficient inverted index data structure on a traditional desk-
top computer [200]. Given a query and the documents organized in the form
of inverted index on a standard desktop machine, a search engine will efficiently
retrieve the related documents ranked by the relevancy order to the query. We
conjecture that a document neighborhood can be generated using this relevant
documents set found by an IR system without the expensive pairwise documents
comparisons. In RDDC, we propose to explore this neighborhood of relevant doc-
uments to build the SNN graph effectively that, in turn, reveals the core dense
points and form clusters.
The conventional density clustering methods are known for not covering all data
points in clusters and leaving the higher number of documents un-clustered [55].
To deal with this problem, we identify multiple hubs in the shared neighborhoods
sets and reassign these un-clustered documents to the closest hub based on prior
calculated relevancy scores. Empirical analysis using several document corpuses
reveals that RDDC is able to cluster high percentage of documents accurately
and efficiently compared to other state-of-the-art methods.
More specifically, in this paper we propose a novel density based clustering
method RDDC for sparse text data. RDDC explores the dense patches in high
dimensional setting using a shared nearest neighbor graph built with ranked re-
2 Ranking-centered Density Document Clustering (RDDC) 77
sults of an IR system. RDDC further enhances the clustering decision using these
shared nearest neighbors as hubs in higher dimensionality. It efficiently calculates
the similarity for hubs using relevancy scores provided by the IR system. These
approaches of cluster allocation enable RDDC obtaining improved accuracy and
efficiency for document clustering.
To our best of knowledge, RDDC is the first such method that extends the IR
concepts of Inverted index and Ranking to density document clustering. Recently,
a couple of researchers have used the ranking concept to partitional document
clustering, to produce relevant clusters instead of all clusters in semi-supervised
clustering [173] and to select centroids using ranked retrieval in k-means [30].
However, the approach employed in RDDC is entirely different from these two
works. RDDC does not need a user-defined cluster number k and the expensive
steps of centroid updates in these methods. RDDC finds the density regions
in the SNN graph which is built efficiently using the document ranking scores
obtained from the text data through an IR system.
2 Ranking-centered Density Document Cluster-
ing (RDDC)
Let D = {d1, d2, d3, . . . . . . , dN} be a document corpus and di be a document
represents with set of M distinct terms {t1, t2, t3, . . . . . . , tM}. RDDC uses an
IR system to index all documents in D based on their terms and frequencies.
The indexed documents become input to the clustering process that includes
three main steps. (1) Firstly, the nearest neighbor sets which possess common
documents, DSNN ⊆ D are identified using the document ranking scores obtained
from the IR system in order to build the SNN graph. (2) Secondly, the graph
78 2 Ranking-centered Density Document Clustering (RDDC)
GSNN is built using documents inDSNN as vertices and the corresponding number
of shared relevant documents as edge weight. Dense regions are found in the graph
and a distinct cluster label in C = {c1, c2, c3, . . . . . . , cl} is assigned to documents
in high dense regions. Another set of documents DO2 that appear in low density
regions is separated out. (3) Lastly, RDDC assigns cluster labels to di ∈ DO2
according to their maximum affinity to a hub residing within a cluster that is
identified in previous step.
2.1 Obtaining Nearest Neighbors as Relevant Documents
Document Querying. Given a document di ∈ D as a query and D orga-
nized as inverted index, an IR system generates the most relevant documents
ranked in the order of relevancy to di. A query representing the document,
q = {t1, t2, t3, ..., ts} ∈ di should be generated such that the most accurate near-
est neighbors are obtained. RDDC represents the document as a query using the
top- s terms ranked in the order of term frequency according to the length of the
document. A set of s distinct terms with 0 ≤ s ≤M is obtained as:
s = (|di|) /k : i = 1, 2, . . . , N (1)
If the length of the document is less than s , all the terms in the document is used
as the query. A factor k controls the query length in various sized documents. A
smaller k value (e.g., k=15) is used in large text data yielding larger queries while
a larger value (e.g., k=35) is used in short text data yielding smaller queries. The
value of k is empirically learned ranging from 3-50 for each corpus.
Document Ranking. Given the document query q , a set of m most relevant
documents with their ranking score r vector is obtained using the ranking function
Rf as in Eq. 2. A number of ranking functions such as Term Frequency-Inverse
2 Ranking-centered Density Document Clustering (RDDC) 79
Document Frequency (tf ∗ idf), Okapi Best Matching 25 (BM25) can be used
to calculate the relevancy score [59]. RDDC uses the tf ∗ idf ranking function
given in Eq. 3 where tf represents how often a term appears in the document,
idf represents how often the term appears in the document collection and field
length normalization depicts how the length of the field which contains the term
is important.
Rf : q → Dq = {(dqj , rj
)} : j = 1, 2, . . . ,m (2)
rdj = score (q, dj) =∑t in q
(√tft,dj × idf 2
t × norm (t, dj))
(3)
Claim 1 shows that the ranking results obtained by an IR system using a ranking
function provides the relevant neighbors to di with a reduced computational time
and high accuracy, in comparison to the pairwise document comparison.
Claim 1 Let N (di) be the neighborhood documents calculated from the pairwise
document comparisons of document di with rest of the documents in the collection
D of size N obtained with δ1 time and ∂1 level of accuracy. Let R (di) be the
IR ranked result of document di obtained with δ2 time and ∂2 level of accuracy.
R (di) ⊂ N (di) will be built with δ2 (< δ1) time and ∂2 (> ∂1) level of accuracy.
Proof:
• In order to obtain N (di) , (cosine) similarity has to be obtained by com-
paring di with every document in D. This process consists of N − 1 steps
which takes δ1 time and allows to obtain the top- k neighbours where k ≥ 1
according to similarity values with ∂1 level of accuracy.
• R (di) is obtained using inverted indexed documents in D and a ranking
function such as tf ∗ idf [59]. The tf ∗ idf ranking function only computes
similarity scores for documents containing high tf ∗ idf weights for query
80 2 Ranking-centered Density Document Clustering (RDDC)
terms. This is a one step process which takes δ2 time and gives most relevant
neighbour documents with ∂2 level of accuracy.
• The cluster hypothesis [91] states that “associated documents appear in a
returned result set of a query” . The reversed cluster hypothesis in the opti-
mum clustering framework [59] further states that “the returned documents
in response to a query will be in the same cluster” and can be considered
as nearest neighbours.
• Since, R (di) ⊂ N (di) ⊂ D, R (di) will contain neighbours with ∂2 (> ∂1)
level of accuracy obtained with δ2 (< δ1) time.
2.2 Graph based Clustering
The IR ranked results should contain the relevant neighborhood documents to the
query document. RDDC uses the top-10 ranked documents as the nearest neigh-
borhood set as they possess sufficient information richness [199]. A DSNN ⊆ D
is identified by calculating common documents for each di ∈ D with its top-10
retrieved documents. Let retrieved results set of di , dj be R (di) and R (dj)
respectively where dj ∈ R (di). If R (di) ∩ R (dj) > 2, documents di and dj
(di, dj ∈ DSNN) become vertices in the graph GSNN and the corresponding num-
ber of shared relevant documents ( | (R (di) ∩R (dj) |) be the edge weight. GSNN
construction leaves out a set of orphan documents DO1 (D = DSNN ∪DO1) that
do not appear in DSNN .
DO1 = {(di ∈ D) ∩ (di /∈ DSNN)} : i = 1, 2, . . . , N (4)
The next task is to identify dense nodes in GSNN . A dense node contains the
number of documents (higher than a threshold) connected in the region with the
edge weight higher than a threshold. These nodes are defined as core points in
2 Ranking-centered Density Document Clustering (RDDC) 81
GSNN that become initial cluster representatives. Each cluster boundary is then
expanded to include documents with the same edge weight. This process gives
us a set of documents with cluster labels C , as well as it identifies documents
DO2 that do not fit into any cluster boundaries.
DO2 = {(di ∈ D) ∩ (di ∈ DSNN) ∩ (di /∈ C)} : i = 1, 2, . . . , N (5)
Algorithm 1 details the process of obtaining density based clusters. Claim 2 shows
that the SNN graph can be built accurately using ranking results.
Claim 2 Let the SNN graph created with k nearest neighbourhoods Nk (D) in the
document collection D be GNk(D) and the SNN graph created with IR ranked results
R (D) be GR(D) . If R (di) ⊂ N (di) for di ∈ D , then graph GR(D) ⊆ GNk(D) and
GR(D) contains ∂2 (> ∂1) level of accuracy where ∂1 is accuracy level of the GNk(D).
Proof:
• Let V (R (D)), V (Nk (D)) be the vertices of two graphs GR(D) and GNk(D)
represented by the documents in R (D) and Nk (D) respectively and
E (R (D)) , E (Nk (D)) be the edges represented by the number of shared
documents within document pairs in R (D) and Nk (D) respectively.
• For document di to obtain k relevant neighbourhoods Nk (di) we have to
prune meaningless neighbourhood levels. Hence, Nk (di) ⊂ N (di) .
• If Nk (di) ⊂ N (di) and R (di) ⊂ N (di) then R (di) ⊆ Nk (di) as R (di) con-
tains only the most relevant neighbours according to the optimum clustering
framework [59]. Thus R (D) ⊆ Nk (D) .
• Therefore V (R (D)) ⊆ V (Nk (D)) and E (R (D)) ⊆ E (Nk (D)) . It proves
GR(D) ⊆ GNk(D) .
82 2 Ranking-centered Density Document Clustering (RDDC)
• The IR ranked results R (di) of di contains the relevant documents to di with
∂2 (> ∂1) level of accuracy as shown in Claim 1. Hence, GR(D) contains all
required document information to represent SNN graph with ∂2 (> ∂1) level
of accuracy.
In this phase, a repository H = {H1, H2, H3, . . . ., H∅} is also built to store the
shared relevant documents where each node Hj ∈ H contains a set of shared
documents { di, dj, . . . dk} and ∅ is the total number of sets of shared documents.
Usually , |H| > |C| and a node Hj contains documents from the same cluster.
The set of relevant nodes within a cluster is comparable to the concept of Hubs
in high dimensionality [177]. These hubs actually represent the sub dense regions
within clusters. RDDC accurately cluster higher percentage of documents using
affinity calculation for these hub nodes and avoid the problem of higher number
of un-clustered documents in many other density based methods.
Figure 1: Algorithm for RDDC
3 Empirical Analysis 83
2.3 Relevancy Based Clustering
Algorithm 2 details the process of clustering documents DO2 that remain un-
clustered in the first phase, based on the maximum relevancy to the set of docu-
ments in the repository H. In the high-dimensional data such as text, a cluster is
shown to contain multiple hubs of documents instead of a uniform spread across
the cluster [177]. In RDDC, the sets of shared relevant documents present in H
are considered as hubs within a cluster. We envisage that the hubs of documents
stored in the repository H will share higher affinity to di ∈ DO2 instead of a
cluster represented as mean (centroid) vectors. For each di ∈ DO2 , we calculate
its affinity with each node in H as follows.
AS (di) =
{(∑sizeof(Hj)u=1 score (di, du)
sizeof (Hj)
), j : 1, 2, ...,∅,
score (di, du) calculated as per Eq.3
} (6)
The affinity score of hub node Hj is calculated using the ranking score of each
document that it contains, when the orphan document di was posed as a query.
Usually, the hub calculation in existing clustering methods is found very expensive
due to the need of pairwise computation between all documents within a cluster
[78, 173]. However, RDDC uses (already calculated) relevancy scores in IR ranked
results to measure hub affinity and makes the process computationally efficient.
Document di is then assigned with the cluster label of the maximum relevant
node Hj ∈ H that yields the largest affinity score to di .
3 Empirical Analysis
We used multiple datasets with varying dimensionality such as 20 Newsgroups,
Reuters 21578, Media Eval Social Event Detection (SED) 2013 and SED 2014,
84 3 Empirical Analysis
and the TDT5 English corpus, as reported in Table 1. We created smaller subsets
of 20 newsgroups datasets as given in to compare the RDDC performance with
the density-based document clustering method by Zhao et al. [201]. Additionally,
several other density-based clustering methods including SNN based DBSCAN
[55], SNN based clustering for coherent topic clustering [56] and DBSCAN [57]
as well as the well-known matrix factorization method, NMF [122] were used for
benchmarking.
Table 1: Summary of the datasets in the experiment.
Datasets # # Avg. # Std. Dev. #ClustersDocs Terms Terms per terms per (ground
doc corpus truth)20 Newsgroups 300 6595 104 88 3-20ng DS1
20 Newsgroups 2000 22841 100 119 20-20ng DS2
20 Newsgroups 7528 43946 97 104 20-20ng DS3Reuters 9100 19479 46 41 52-R52 DSSED 13 99989 61806 17 23 3711
-SED13 DSSED 14 120000 64056 16 16 3875
-SED14 DSTDT 5 3905 38631 172 124 40
-TDT5 DS
Each dataset was pre-processed to remove stop words and words are stemmed.
Document term frequency was selected as the weighting schema for the query
representation after extensive experiments. In all the experiments, query length
was set to optimum query length according to Eq. 1. The minimum number of
documents and minimum weight for graph α was set to 3 based on experiments
and prior research [67] for all the datasets. Experiments were done using python
3.5 on 1.2 GHz – 64 bit processor with 264 GB Memory. The Elasticsearch with
fast bulk indexing was used as the search engine to obtain relevant documents.
3 Empirical Analysis 85
Standard pairwise F1-score and Normalized Mutual Information (NMI) were used
as cluster evaluation measures [201].
3.1 Accuracy Analysis
Results in Table 2 and Fig. 2 (a) show the comparative performance of RDDC
with benchmarking methods. As shown by the average performance of all datasets
in Table 2, RDDC has produced much higher accuracy as compared to bench-
marking methods. Results in Fig. 2 (a) ascertain that RDDC forms tight natural
clusters. It is able to identify sub clusters within the specified clusters as shown
by finding the higher number of clusters (Fig. 2 (a)), but still produce higher NMI
(Table 2). Sometimes, this leads to producing low F1- score. Density based clus-
tering is known not to cover every data point in clusters, due to the requirement
of fitting the clustered objects into a density region [57]. Fig. 2 (a) shows that
RDDC is able to assign a large share of documents to clusters with high accuracy
due to the inclusion of relevancy based clustering in the third step. RDDC shows
two-fold increase in the percentage of documents clustered using the graph-based
clustering to the relevance clustering with 52% and 17% increase in NMI and
F1-score respectively. In some datasets, DBSCAN has shown to cover more doc-
uments than RDDC, however, a closer investigation reveals that it produces a
few larger clusters only that will hold a large number of documents, yielding poor
clustering solution.
Zhao et al. [201] used DBSCAN in semi-supervised setting for document clus-
tering. We have created 20ng DS1, 20ng DS2, and TDT5 DS according to the
explanation given in their paper as we are unable to find the method implementa-
tion. RDDC is an unsupervised method and can be considered equivalent to zero
constraint level of the method in [201]. As shown in Table 3, results produced
86 3 Empirical Analysis
Table 2: Performance comparison of different datasets, methods, and metrics
DatasetF1-score NMI
RD SD ST DB MF RD SD ST DB MF20ng DS3 0.28 - - 0.00 0.18 0.28 - - 0.09 0.14R52 DS 0.36 0.04 - 0.07 0.38 0.41 0.38 - 0.43 0.26SED13 DS 0.87 - - 0.00 - 0.66 - - 0.00 -SED14 DS 0.87 - - 0.00 - 0.65 - - 0.00 -TDT5 DS 0.61 0.36 0.21 0.22 0.70 0.35 0.32 0.25 0.22 0.54Average 0.60 0.20 0.21 0.06 0.42 0.47 0.35 0.25 0.15 0.31Methods: RDDC (RD), SNN based DBSCAN (SD), SNN based topicclustering (ST), DBSCAN (DB) and NMF (MF)Note: “-” denotes out of run-time or memory
Table 3: Performance comparison with semi supervised clustering [201]
DatasetRDDC Semi-supervised DBSCAN
NMI F1-score # NMI F1-score #constraints constraints
20ng DS1 0.66 0.75 0 0.62 0.62 2520ng DS2 0.40 0.32 0 0.22 0.42 50TDT5 DS 0.61 0.35 0 0.22 0.31 75
by unsupervised RDDC are mostly superior to semi-supervised DBSCAN [201].
These results show the effectiveness of using relevancy scores obtained with the
concepts of ranking and inverted index, in building SNN graph, finding dense
regions and forming clusters.
3.2 Scalability and Complexity Analysis
Fig. 2 (b) shows that the traditional SNN based methods failed to scale with
large datasets due to the computational complexity introduced by the number of
comparisons made for k NN search. It is O (nk + nd) where n is the number of
instances in dataset, d is feature dimensionality and k is the number of nearest
neighbors. Whereas, the relevant document calculation of RDDC has computa-
3 Empirical Analysis 87
Figure 2: Performance comparison with percentage of clustered documents andtime taken
tional complexity of O (m+ n) where m is the query length to obtain relevant
neighbors. RDDC consumes more time than DBSCAN as in Fig. 2 (b) due to
additional graph construction and maximum relevancy calculation. However, as
shown in the tradeoff by achieving 0.54 and 0.32 increase on average accuracy
in terms of NMI and F1-score respectively in RDDC is well justified. Incremen-
tal sampling on the SED13 collection is used to demonstrate the scalability of
RDDC. Fig. 3 (a) shows that RDDC exhibits near linear increase in time with
the size of the corpus, whereas traditional SNN based methods are not scalable
as shown by the runtime in Table 2. Further, performance of RDDC with the
increased feature dimensionality in Fig. 3 (b) shows that the RDDC performance
comes to stabilize after a linear increase in runtime with dimensionality.
3.3 Sensitivity Analysis
The parameter sensitivity is analyzed by obtaining two independent samples of
different sizes from each dataset in Table 2. The document model for query
88 3 Empirical Analysis
Figure 3: Scalability performance of RDDC using the SED 13 and 20ng DS3dataset
formation was evaluated using tf , idf and tf ∗ idf weighting schema. Fig. 4
(a) shows that the tf presentation outperformed others in many datasets. It is
justified as important terms which determine the theme of the document have
higher weights in this scheme.
Figure 4: Term Weighting Schema and Ranking Functions
Success of the RDDC relies on obtaining accurate nearest neighbors of a given
3 Empirical Analysis 89
document to build the SNN graph. It depends on how a query is presented
and the ranking function used. There exist two most popular ranking functions,
tf ∗ idf and BM25 [59]. Fig. 4 (b) shows that a higher performance in terms of
NMI and F1 score is obtained by using tf ∗ idf , so it is used in all experiments
and can be set as default.
Next we explored the relationship between query length and the text size in
document corpuses. Fig. 5 (a) shows that there is a linear relationship between
query length and document size. Smaller document corpuses need smaller queries
while larger documents sized corpuses need lager queries. In RDDC factor k is
included to control the size of query length for documents in a corpus. We analyze
the parameter k against size of the datasets as in Fig. 5 (b). The factor k shows
inverse linear relationship, that is, a large k value should be set for short text
data and a smaller k value should be set for large text data. The parameter k
adjusts the query size as per length of the document.
Figure 5: Query length and Parameter k
The threshold α in RDDC denotes the minimum number of documents to be
shared between two documents to define as similar, and the number of documents
to be considered as dense. Prior research has shown this value to be set as 3 [67].
90 4 Conclusion
As shown in Fig. 6, the best performance (i.e. maximum number of documents
clustered) is obtained when α is set to 3. For future use, the default value can
bet set to 3.
Figure 6: Parameter Alpha vs. Clustered documents
4 Conclusion
This paper was inspired by the conjecture that text documents have sparse data
representation so we should leverage the techniques that suit to those represen-
tation. We proposed a novel ranking-centered density-based document clustering
method RDDC based on the concepts of density estimation, inverted indexing,
ranking and hubs. RDDC introduces the innovative concept of finding near-
est neighbors using the document relevancy ranking scores to construct a SNN
graph and finds the dense regions to form the clusters. We showed that the use
of document ranking score is more effective compared to calculating the pair-
wise similarity between data points in text data by reducing the computational
complexity and improving accuracy. We also introduce a refinement phase to
increase the percentage of clustered documents by assigning orphan documents
to hubs within clusters, rather than to cluster itself. The hubness affinity calcu-
lation utilizes the prior calculated relevancy ranking scores, thus, not incurring
4 Conclusion 91
any overheads. We proved that closeness to shared relevant neighbors can im-
prove the performance of text clustering due to the existence of multiple hubs
in a text cluster. Empirical results conducted on several datasets, benchmarked
with several clustering methods, show that RDDC overcomes the issues attach
with sparse vectors and cluster text data with considerably higher performance,
including accuracy and scalability.
92 Paper 2
Paper 2: Consensus and Complementary Non-
negative Matrix Factorization for Document
Clustering
Wathsala Anupama Mohotti* and Richi Nayak*
*School of Electrical Engineering and Computer Science, Queensland University
of Technology, GPO BOX 2434, Brisbane, Australia
Under Reviewed IN: Knowledge-Based Systems (KBS Journal)
Statement of Contribution of Co-Authors
The authors of the papers have certified that:
1. They meet the criteria for authorship in that they have participated in
the conception, execution, or interpretation, of at least that part of the
publication in their field of expertise;
2. They take public responsibility for their part of the publication, except for
the responsible author who accepts overall responsibility for the publication;
3. There are no other authors of the publication according to these criteria;
4. Potential conflicts of interest have been disclosed to (a) granting bodies, (b)
the editor or publisher of journals or other publications, and (c) the head
of the responsible academic unit, and
5. They agree to the use of the publication in the student’s thesis and its
publication on the QUT ePrints database consistent with any limitations
set by publisher requirements.
Paper 2 93
Contributor Statement of contribution*
Wathsala Anupama Mohotti Conceived the idea,designed and conducted experiments,analyzed data, wrote the paper and
Signature: addressed the supervisor and reviewers’comments to improve the quality of paper
Date:
A/Prof Richi Nayak Provided critical commentsin a supervisory capacity
Signature: on the design and formulationof the concepts, method and experiments,
Date: edited and reviewed the paper
hi Nayak
26/03/2020
Mohotti
27/03/2020
QUT Verified Signature
QUT Verified Signature
94 1 Introduction
ABSTRACT: Document clustering, for grouping the documents with similar con-
cepts, has been found useful in many applications of information retrieval and
knowledge discovery. High dimensionality and sparsity of text vector represen-
tation challenge state-of-the-art methods. We propose a novel method using the
non-negative matrix factorization framework, which utilizes the ranking-based
similarity and nearest neighbor-based affinity to form the document adjacency
matrices to overcome this issue. Empirical analysis using several datasets shows
that the proposed method is able to produce accurate clustering solutions as
compared to relevant benchmarking methods.
KEYWORDS: Ranking; Nearest Neighbors; Document Affinity; Non-negative
Matrix Factorization
1 Introduction
With the advancement in digital technology, text data has grown exponentially
[86]. The process of grouping the documents with similar concepts, known as
document mining, has become significant in diverse applications such as social
media analytics, opinion mining and recommendation systems [3, 140]. A myr-
iad of clustering methods such as partitional, hierarchical, density-based, matrix
factorization and spectral have been developed [8]. However, the majority of ex-
isting methods are challenged with high dimensionality and sparsity associated
with the text data.
Partitional clustering methods face distance concentration where distance dif-
ference between far and near data points becomes negligible due to the high
dimensionality of text vectors and associated sparsity [205]. Hierarchical clus-
tering suffers from the scalability problem due to the requirement of multiple
1 Introduction 95
pairwise computations at each step of making a clustering decision [3]. Density-
based clustering suffers from the sparseness of text data representation which is
challenging in finding density differences [140].
Matrix factorization, which maps high-dimensional text representation into lower
dimensions, has become popular due to its capability to finding groups in the
mapped low-dimensional space. Non-negative Matrix Factorization (NMF) has
been successfully used in data, such as images, spectrograms, and documents for
multivariate analysis [46]. NMF has become especially successful in document
clustering due to its requirement of representing/processing data with positive
values [39, 50]. NMF learns two low-rank factor matrices that represent docu-
ment and term clusters by decomposing a high-dimensional document × term
matrix. However, it has been reported that, in high-dimensional sparse data,
dimensionality reduction fails to capture the geometric structure of original data
[146]. Consequently, neighboring points in high-dimension do not remain as close
points in the projected low-dimensional space. This information loss adversely
affects a cluster solution [3].
Manifold learning has been proposed as a solution to preserve the geometric
structure in original data and ensure close points remain close in lower-dimension
space [22]. The majority of these methods preserve the local geometry of the
data by building a graph based on local neighborhood information in the dataset
[131]. This category of methods are known as spectral methods [203]. They face
higher computation cost and information loss due to calculating adjacency for
each document using the affinity graph and projecting original data into new
coordinate space [8]. There also exist global manifold learning methods that
preserve geometric information at different scales [203]. However, the distance
calculation in these methods is computationally expensive. Researchers have also
attempted to combine both local and global properties of the manifold in hybrid
96 1 Introduction
methods [203].
Recently, researchers have explored the ranking concept, commonly used in
Information Retrieval (IR), to find nearest neighbors in document clustering
[30, 140, 173]. IR is a well-established field, which uses inverted index data struc-
ture and the ranking concept to retrieve a list of matched (or similar) documents
for a query [30]. Researchers have shown the improved accuracy and scalabil-
ity in partitional clustering by utilising the ranking results, generated from an IR
system, to assign a document to a cluster [30, 173], as well as in density-based clus-
tering where the Shared Nearest Neighbor graph is constructed with IR ranked
document sets [140]. IR ranking shows the potential of calculating global neigh-
borhood information considering the entire document collection. However, this
approach is not explored in the Matrix Factorization based document clustering.
A – document × term matrixS1 – NN based Symmetric documentaffinity MatrixS2 – IR based Symmetric document affinity MatrixH1, H2, H – document × cluster matricesW – cluster × term matrices
S1
S2
A
H1
H2
H
H
Final Document × Cluster Matrix
terms
docs
docs
docs
docs
docs
clusters
docsH
H
W
Figure 1: Overview of the proposed CCNMF method
In this paper, we propose a novel method, Consensus and Complementary Non-
negative Matrix Factorization (CCNMF), that aims to preserve the geometric
structure of the data by combining the nearest neighbor (NN) information with
the document-term representation. Firstly, the local affinity between documents
is obtained by embedding NNs calculated using pairwise document similarity. The
top k-NNs are chosen for each document and a symmetric matrix representation
1 Introduction 97
is generated to encode the NN information. Secondly, the global affinity between
documents is obtained using an IR system to form another symmetric matrix to
represent the top k-NNs. The hybrid manifold learning approach employed in
CCNMF empowers it to effectively take more reliable neighborhood information
with IR while using higher representation capability of pairwise NNs. A novel
clustering objective function is proposed by combining these two matrices with
the document × term matrix in the NMF framework to accurately obtain the
document clusters.
CCNMF learns the optimum document cluster representation iteratively approxi-
mating the two symmetric matrices and the document-term representation matrix
with minimizing learning error as in Figure 1. Along with the complementary
global and local NN specific information given by the matrices, CCNMF inter-
nally combines consensus information in the data during the factorization using
the sequential update rules. We conjecture that the consensus and complemen-
tary information provided with local and global NNs is able to minimize the
information loss that occurs with higher-to-lower order document × term matrix
approximation.
To the best of our knowledge, CCNMF is the first such method that extends
the ranking concept to matrix factorization-based document clustering. Empiri-
cal analysis using several document corpora of varying sizes shows that manifold
learning in CCNMF results in providing a more accurate clustering solution com-
pared to relevant state-of-the-art methods. More specifically, this paper brings
several novel contributions to document clustering.
• A novel manner to utilise manifold-learning in NMF by combining the
document-term matrix factorization with the nearest-neighbor information
that preserves the geometric structure of the data, in order to improve the
accuracy of document clustering
98 2 Related Work
• The use of local document affinity represented with pairwise document sim-
ilarity and global document affinity represented with the ranked results for
text clustering
• A novel approach to use consensus and complementary information in local
and global NNs to improve the NMF-based document clustering
The rest of the paper is organized as follows. Section 2 reviews the work related to
document clustering. The proposed approach and implementation are elaborated
in Section 3. A comprehensive empirical study and benchmarks on several public
datasets are provided in Section 4. Final conclusion remarks are presented in
Section 5.
2 Related Work
NMF has been used in clustering the text documents successfully. It approximates
the high-order non-negative documents × terms matrix into lower rank factor
matrices, which represent the groups of terms and the groups of documents on
the basis of shared terms [116] where the reduced rank can represent the number
of clusters in the data. This dimensionality reduction of document Vector Space
Model (VSM) based on the underlying semantic relationships is able to handle
sparsity in higher order representation [149].
With the increase in dimension, the distance difference between far and near
points becomes negligible [205] as in Figure 2. The distance concentration prob-
lem is evident in high-dimensional sparse text data and challenges the traditional
methods based on centroids and hierarchies in deciding the matching clusters
using distance. Density differences can also not be traced in sparse data without
sophisticated designs [140]. On the other hand, NMF - which encodes the text
2 Related Work 99
Figure 2: Distance concentration in higher dimensions [52]
data and projects it to low-rank matrices by retaining natural data non-negativity
and semantic relatedness - is proposed as an effective solution to find clusters in
the newly mapped low-dimensional space. It eliminates the need to use sub-
tractive basis vector and encoding calculations present in other dimensionality
reduction techniques such as spectral clustering [166], which is expensive.
Spectral clustering identifies subgroups with non-convex geometric structures
compared to traditional clustering methods such as k-means [143]. The origi-
nal data is projected into the new coordinate space, which encodes information
about how nearby data points are. The eigenvalues (through first k eigenvec-
tors) of the Laplacian matrix of the data is used to perform dimensionality re-
duction. The similarity transformation reduces the dimensionality of space and
pre-clusters the data into orthogonal dimensions [143]. This pre-clustering is
non-linear and allows for arbitrarily connected non-convex geometries. However,
spectral clustering faces the fundamental limitation of depending on the selected
eigenvectors [141]. These selected values cannot successfully cluster datasets that
contain structures at different scales in size and density such as the text data
[141]. In contrast, NMF is able to directly encode text data, preserving natu-
ral non-negativity into the two lower rank factors, which automatically represent
100 2 Related Work
groups in documents and terms. Therefore, this paper proposes an NMF-based
approach to represent the document cluster assignment.
However, NMF and other dimensionality reduction methods face information loss
while compressing high to low dimensions [149]. This dimensionality reduction
fails to capture the geometric relationships in original data and neighboring points
in high-dimension do not remain as close points in the projected low-dimensional
space [146]. Researchers have explored different approaches to assist traditional
NMF to avoid this information loss. The term adjacency matrix has been used
to assist factorization of document × term matrix to semantically enhance the
clustering decision [168]. It learns the relationships of terms to its context for
enforcing geometric relationships.
Manifold learning is another approach to combine with the NMF framework to
discover and maintain the geometric structure of data when projecting it to a low-
dimensional space [131]. Manifold learning algorithms vary based on the type of
the geometry they attempt to preserve, such as local, global or hybrid [203].
Local methods, also known as spectral clustering, encode the local geometry of
the data, which shows the high representational ability by building an affinity
graph that incorporates neighborhood information [31]. Global methods give a
more reliable embedding by preserving geometric relationships at different levels
[203] though they are computationally expensive. However, hybrid methods that
combine positive capabilities of both aforementioned approaches have given better
performance in manifold learning [203]. Therefore, in this paper, we are proposing
a clustering algorithm, which uses both local and global NN relationships.
Co-clustering is another parallel branch of methods that uses the dual informa-
tion within the rows and columns of the matrix data simultaneously for clustering
[66]. Traditional methods focus on one-side clustering, i.e., clustering data based
on features of the data. In contrast, co-clustering groups data points based on
2 Related Work 101
their distribution on features while concurrently grouping features based on their
distribution on the data points [66, 167]. This concept is used in document clus-
tering to assist each other in improving the quality of the clustering solution. It
simultaneously clusters documents and words to find a global clustering solution,
whereas word clustering induces document clustering and document clustering in-
duces word clustering [44]. These interesting extensions with different information
assistance show the capability to minimize the information loss in dimensionality
reduction.
Generally, geometric relationships are considered with nearest neighbors [143].
Simple pairwise comparisons between points that calculated locally considering
the VSM are used to calculate the NNs. This information of nearby data points is
used in the clustering process to group them [143]. Distinct from this document
affinity, which only relies on the local information, Information Retrieval (IR)
systems are capable of producing global NNs. Given a document query against
the entire document collection organized in the form of the inverted index, a
search engine retrieves the related documents ranked by the relevancy order to
the query maintaining near and far relationships [59]. In this paper, we propose to
utilize this novel concept to generate NNs by encoding global information present
in the data.
The use of IR for clustering is an emerging area. In our prior works, we have
utilized this concept, IR-based NNs for partitional and density-based document
clustering methods [140, 173]. The frequent nearest neighbors for points in high
dimensions are known as Hubs [173]. Data points in high-dimensional data tend
to be closer to these hubs than cluster mean. These hub points were generated
using IR ranking responses in [173] and used in assigning a data point into a
cluster considering the closest hub. In [140], a Shared Nearest neighbor graph
is constructed using the ranked results and the density estimation is done on
102 2 Related Work
the graph to assign documents to clusters. The approach employed in CCNMF
is distinct from these works. We use the local NNs generated with a simple
pairwise similarity calculation as well as the global NNs calculated using the
ranked results to incorporate the geometrical structure of the data during the
matrix factorization process.
This paper proposes a novel approach to assist non-negative document × term
matrix factorization using document adjacency matrices with locally and globally
generated nearest neighbor information in the document collection. In contrast
to semantic assistance given in [168] and manifold learning in [131], CCNMF
generates local NNs with pairwise document comparison and global NNs with
IR ranking to assist NMF with the geometric structures. Our hypothesis in CC-
NMF is to incorporate both consensus and complementary information in docu-
ment clustering through NNs and document-term distribution. CCNMF clusters
documents utilizing adjacency distribution on documents while simultaneously
clustering documents based on term distribution. Within each iteration in the
optimization process, CCNMF exchanges this information to induce document
clustering.
In summary, CCNMF uses
• NMF-based clustering [39, 50] on naturally non-negative text data to iden-
tify the groups in a document collection.
• geometric relationships between data points as in manifold learning methods
[143, 203], for assisting the clustering process.
• global NNs calculated with the IR system that used in partitional and den-
sity based clustering [140, 173] to have a more faithful clustering solution.
3 Consensus and Complementary Non-negative Matrix Factorization(CCNMF) 103
3 Consensus and Complementary Non-negative
Matrix Factorization (CCNMF)
3.1 Overview of CCNMF
Let D = {d1, d2, ...dN} be a document collection of N documents that contain
a set of distinct terms {t1, t2, ..tM}. A document di is represented as a set of v
distinct terms {t1, t2, ..tv} where v � M . Let matrix A represent the M × N
term-document matrix with the entries as term counts. Let D be also organized
in the form of an inverted indexed data structure. The inverted index keeps a
dictionary of terms, together with a posting list that indicates which documents
the term occurs in [133]. A search engine can efficiently rank a document with
respect to the query using its inverted index [133].
CCNMF calculates the local NNs using a pairwise distance calculation and the
global NNs using an IR system to form the adjacency matrix S1 and S2 respec-
tively. The symmetric affinity matrices S1 and S2 are modeled with Skip-Gram
with Negative Sampling (SGNS) [117] weighting to make an affinity value for
point pairs considering their closer existence as neighbors in the entire collection.
This encodes geometric information of data points in the entire collection, i.e.,
how nearby the points are.
CCNMF uses a novel NMF-based approach to decompose the input data matrix
A and two affinity matrices S1 and S2 to identify W , H, H1 and H2. The over-
all process of CCNMF is shown in Figure 1. The overarching aim is to obtain
the optimum document-cluster matrix H ∈ RN×G that harmonizes the informa-
tion given by each input matrices. CCNMF enables complementary information
through different metrics in the matrix factorization process while maintaining
1043 Consensus and Complementary Non-negative Matrix Factorization
(CCNMF)
consensus information with the update rules of the factor matrices. CCNMF
learns a cluster label for each di ∈ D by assigning it to the highest coefficient
cluster g ∈ G in H.
3.2 Nearest Neighbor Calculation with Skip-Gram with
Negative Sampling
Nearest Neighbors
To obtain the top-k NNs as local using distance differences and, global NNs
using common as well as rare terms, two symmetric affinity matrices S1 and
S2 are generated considering all the documents in D. The local NNs in S1
is generated using distance differences. The cluster hypothesis [91] and reverse
cluster hypothesis [59] show the embedding of semantic relationships in ranking-
based similarity identification. Aligning with them a ranking function employed
in a search engine is used to identify the global NNs in S2.
Local Nearest neighbors The documents in D are represented using Vector
Space Model. A pairwise document comparison using Euclidean distance is car-
ried out on all pairs in the collection. Let {l1, l2, ...lN} be the list that includes
pairwise distance between di ∈ D and all N documents in the collection. A set
of closest k documents to di which show the lowest distances is considered as the
local k-NNs. This set of documents DlNN of di ∈ D can be represented as follows.
DlNN = {dp : p = 1, 2, . . . , k} ← top k (sort ({l1, l2, . . . , lN})) (1)
3 Consensus and Complementary Non-negative Matrix Factorization(CCNMF) 105
Let S1 be a N × N document-document matrix where each row in the matrix
represents the DlNN that the document di ∈ D has by showing the value 1.
S1(i,j) =
{1, if dj ∈ in DlNN(di)
0, if dj /∈ in DlNN(di)
(2)
Global Nearest neighbors Let di ∈ D be posed as a document query using
a set of distinct terms {t1, t2, ...tt} contained in di where t ≤ v. A set of retrieved
top-ranked documents by employing a ranking function Rf in the IR system is
considered as the global NNs. The ranking function Rf employed in the IR system
extracts the documents in D with the respective relevancy scores vector r for a
given query document q in the relevancy order. The most relevant k documents
among them are selected to be global NN documents DgNN of di ∈ D .
Rf : q → DgNN = {(dp, rp) : p = 1, 2, . . . , k} (3)
Let S2 be a N × N document-document matrix where each row in the matrix
represents the DgNN that the document di ∈ D has by showing the value 1.
S2(i,j) =
{1, if dj ∈ in DgNN(di)
0, if dj /∈ in DgNN(di)
(4)
Example: The toy document collection in Fig. 3 shows the local NNs obtained
using pairwise distance calculation and the global NNs obtained using IR ranking
similarity. For the first three documents, the distance-based approach is able to
identify the NNs more accurately in comparison to the IR system. In contrast,
the IR system identifies the NNs accurately for the last two document. This
shows the need for using both the local and global NNs in identifying clusters.
1063 Consensus and Complementary Non-negative Matrix Factorization
(CCNMF)
ID Documentd1 Rugby players are in the ground with a balld2 Rugby ball is oval shaped3 Rugby team contains fifteen players d4 Cricket use round ball d5 Cricket can play outdoor as well as indoord6 Cricket plays with twelve players
d1 d2 d3 d4 d5 d6
d1 1 1 1 0 0 0d2 1 1 1 0 0 0d3 1 1 1 0 0 0d4 0 1 0 1 1 0d5 0 0 0 1 1 1d6 0 0 1 0 1 1
d1 d2 d3 d4 d5 d6
d1 1 1 0 1 0 0d2 1 1 0 1 0 0d3 0 0 1 0 0 0d4 0 1 0 1 0 1d5 0 0 0 1 1 1d6 0 0 0 1 1 1
3-Nearest Neighbours based on Pairwise Distance (local NNs)
3-Nearest Neighbours based on IR ranking Similarity (global NNs)
The unique information given by each matrix show the requirement of using both local NNs and global NNs in identifying clusters
Figure 3: Example showing the importance of incorporating both local and globalNNs
Skip-Gram with Negative Sample modeling(SGNS)
CCNMF aims to capture the NN distribution in S1 and S2 effectively with the
SGNS modeling [168]. SGNS is used to highlight the documents that appear
together more often in comparison to their neigborhood. The Skip-Gram model
is one of the most popular neural-network-based techniques to learn word em-
bedding representation [137]. It is able to capture the context of a word in a
corpus whereas the continuous bag-of-word model fails [117]. The concept of
negative (word, context) sampling is used with the Skip-Gram model to max-
imize the probability of an observed pair while minimizing the probability of
unobserved pairs in distributed word representation [168]. SGNS is proved to be
equivalent to factorizing a (shifted) word correlation matrix whose cells are the
point-wise mutual information of the respective word and context pairs [117]. We
propose to model S1 and S2 as the Skip-Gram model in CCNMF to represent
3 Consensus and Complementary Non-negative Matrix Factorization(CCNMF) 107
neighboring document pairs considering other neighborhood pairs. The negative
sampling concept in the Skip-Gram model enables affinity matrices S1 and S2
to maximize the probability for the document pairs that show high presence in
comparison to their neighborhoods while minimizing the probability of document
pairs that show less presence. We conjecture that SGNS will help to preserve the
closeness of objects while projecting the original data to low-dimensional metrics.
Let c(di,dj) be the original cell value that represents the existence (i.e., 1) or
non-existence (i.e., 0) of NN relationship between di and dj in S1 or S2. The
SGNS weighting model considers only the cell values that show NN relationship
between di and dj with 1 and, how many documents di and dj have within their
nearest neighborhood through the number of 1′s in the di row and dj column.
By dividing original cell value c(di,dj) from these neighborhood association values,
CCNMF positions the NN relationship between di and dj with respect to any
neighborhood as in Eq. 5. This way of updating affinity between document pair
di and dj in S1 and S2 gives more information compared to binary representation,
which represents whether a neighbor or not.
[S1(di,dj) | S2(di,dj)
]= log
[c(di,dj) × T∑
da∈D c(da,di) ×∑
da∈D c(da,dj)
](5)
where T is the total number of document pairs that appear as nearest neighbors
in the entire affinity matrix.
Aligning with the negative sampling, the cell value S1di,dj or S2di,dj with entries
less than 0 are converted to 0 to minimize the probability of document pairs that
show less presence after taking logarithm as in [117].
1083 Consensus and Complementary Non-negative Matrix Factorization
(CCNMF)
3.3 Matrix Factorization
The aim of CCNMF is to decompose the term × document matrix A modeled
as VSM with NMF utilising the documents association matrices S1 and S2. It
learns the final document-cluster membership matrixH using distinct information
attached with each input matrix. We call it complementary, as it utilises the
information from each input matrix independently.
Definition 1 Complementary Non-negative Matrix Factorization: It
is the process of learning two non-negative factor matrices that approximate each
input matrix (A, S1 and S2) and achieve the final document-cluster matrix by
emphasizing the characteristics specific to each input matrix.
CCNMF decomposes matrix A ∈ RM×N into two non-negative factor matrices
W ∈ RM×G and H ∈ RN×G where G is a lower rank that represents the number
of clusters [116]. It is formulated as follows.
A ≈ WHT (6)
TheW andH matrices learn the term clusters and document clusters respectively.
In order to preserve the associations between documents, CCNMF uses the local
NNs and global NNs information with matrix A. The symmetric matrices S1
and S2, which carry the association information, are also decomposed into non-
negative H1 ∈ RN×G and H2 ∈ RN×G respectively with matrix H as follows.
S1 ≈ HHT1 (7)
S2 ≈ HHT2 (8)
We propose the following objective function to reduce the learning error (in frobe-
nius norm) to approximate the input matrix by incorporating specific information
3 Consensus and Complementary Non-negative Matrix Factorization(CCNMF) 109
given with each input matrix.
minW,H≥0‖A−WHT‖F+minH,H1≥0‖S1−HHT1 ‖F+minH,H2≥0‖S2−HHT
2 ‖F (9)
This learning process obtains the final document-cluster assignment matrix H by
considering specific information from each input matrix. Thus, CCNMF can be
considered a complementary NMF for document clustering, as in Definition 1.
Solving the optimization problem
The NMF process combines the complementary information provided by input
matrices A, S1 and S2 to learn the document-cluster matrix H. At the same
time, it harmonizes the compatible information given by each input to achieve
the optimum H during the factorization process.
Definition 2 Consensus Non-negative Matrix Factorization: It is the
process of combining the inter-dependent information present in each input matrix
for learning the non-negative factor matrices to support the approximation of the
optimum document-cluster matrix.
CCNMF learns the inter-dependent factor matrices interactively, exchanging in-
formation within them. In each iteration of the optimization process, CCNMF
updates matrix entries sequentially. It solves these interdependent sub-problems
sequentially starting from W using the Block Coordinate Descent (BCD) algo-
rithm [103]. The BCD algorithm divides the matrix members into several disjoint
subgroups and iteratively minimizes the objective function with respect to the
members of each subgroup g ∈ G at a time.
W(:,g) ←[W(:,g) +
(AH)(:,g) −(WHTH
)(:,g)
(HTH)(g,g)
](10)
1103 Consensus and Complementary Non-negative Matrix Factorization
(CCNMF)
When BCD solves sub-problems that depend on each other, they have to be
computed sequentially to make use of the most recent values of the associated
factor matrices. In CCNMF, the most recent values of members at first iteration
are set to zeros at the initialization. Firstly, the BCD update rule has been used
for finding W in the NMF optimization using the term-document matrix A and
initial matrix H as in Eq. 10.
Secondly, matrix H is updated using the current values of W and other members
as follows:
H(:,g) ←
⎡⎢⎢⎢⎢⎢⎣
H(:g)+
(ATW)(:g)
+(S1H1)(:g)+(S2H2)(:g)
(WTW )(g,g)+(HT1 H1)
(g,g)+(HT
2 H2)(g,g)
−
(HHT1 H1)
(:,g)+(HHT
2 H2)(:,g)
+(HWTW)(:,g)
(WTW )(g,g)+(HT1 H1)
(g,g)+(HT
2 H2)(g,g)
⎤⎥⎥⎥⎥⎥⎦ (11)
Then, S1, the matrix representing local NNs-based affinity and most recent values
of H are used in updating H1 as in Eq. 12.
H1(:,g) ←[H1(:,g) +
(S1H)(:,g) −(H1H
TH)(:,g)
(HTH)(g,g)
](12)
Finally, H2 is updated using S2, the matrix representing global NNs-based affinity
and most recent values of H as in Eq. 13.
H2(:,g) ←[H2(:,g) +
(S2H)(:,g) −(H2H
TH)(:,g)
(HTH)(g,g)
](13)
Most recent values of the matrix entries are used in updating other factor matrices
in this iterative optimization process. This interdependent information exchange
between H ↔ H1 and H ↔ H2 allows NMF to combine information present in
each input matrix for learning final matrix representation of H. Thus, CCNMF
uses a consensus NMF process as in Definition 2.
The overall algorithm of CCNMF is given in Algorithm 1. The final document-
cluster assignment matrix H represents the probability coefficients of each doc-
ument being assigned to each cluster g ∈ G. We choose a cluster that possesses
4 Experiments 111
Algorithm 1 Consensus and Complementary Non-negative Matrix FactorizationInput : Term-Document matrix A
Local NN affinity matrix S1Global NN affinity matrix S2Number of Clusters G
Output: Final Document-Cluster matrix HInit: W ≥ 0, H ≥ 0, H1 ≥ 0, H2 ≥ 0 random real numbers
while Convergence of Eq. 9 doforeach g=1: G do
Compute W using Eq. 10Compute H using Eq. 11Compute H1 using Eq. 12Compute H2 using Eq. 13
end
endConvergence: old error − new error < 1e-3 OR
number of iterations > 100
the highest coefficient within H as a cluster assigned to a specific document. CC-
NMF follows a hard clustering approach with the assumption that a document
belongs only to one cluster.
4 Experiments
4.1 Datasets and Experiment setup
We used four publicly available English text datasets and a permission accessible
dataset, as reported in Table 1. DS11 consists of webpages collected by the
WebKB project of the CMU text learning group. DS2 and DS32 are popular
news data collection known as 20Newsgroup and Reuters 21578, respectively.
1http://ana.cachopo.org/datasets-for-single-label-text-categorization2http://ana.cachopo.org/datasets-for-single-label-text-categorization
112 4 Experiments
Table 1: Summary of the datasets used in the experiments
Dataset# of
documents
Mean &Medianterms
per doc-ument
# ofuniqueterms
# ofclusters
WebKb (DS1) 4199 77, 59 7668 4R52 (DS2) 9100 46, 33 19479 52
20Newsgroup (DS3) 18821 97, 73 69610 20TDT5 (DS4) 27468 157, 133 113708 40
HealthServicesTickets (DS5) 50000 4, 4 12106 21
DS43 is a news dataset released for topic detection and tracking task. DS54 is a
Healthcare service dataset obtained from “kaggle”. These datasets show diverse
characteristics in terms of the number of clusters and size of the collection that
facilitate analysis of CCNMF.
Datasets have been prepossessed for word stemming and stop-word removal. Ma-
trix A is presented as a Vector Space Model (VSM) with the term frequency as
weighting. In all experiments, when a document is posed as a query to the IR sys-
tem, it is represented with top-10 terms in the order of the term frequency. This
paper uses the Elasticsearch 2.4 search engine as an IR system and obtains top-m
(m=10) documents (which are proved to possess sufficient information richness
[199]) given in response to a document query that represents a document with
its top-10 terms as in [173]. The tf ∗ idf ranking function is used to measure the
relevancy between the query document and responses. Experiments were done
using python 3.5 on a single processor of 1.2 GHz Intel (R) Xeon (R) with a 264
GB shared memory.
3https://catalog.ldc.upenn.edu/LDC2006T184https://www.kaggle.com/
4 Experiments 113
Table 2: Performance comparison: CCNMF with standard and latest baselines
F1-score NMIDS1 DS2 DS3 DS4 DS5 Avg DS1 DS2 DS3 DS4 DS5 Avg
M1 0.58 0.44 0.41 0.36 0.25 0.41 0.33 0.51 0.51 0.43 0.22 0.4M2 0.46 0.27 0.18 0.23 0.21 0.27 0.27 0.41 0.23 0.35 0.23 0.3M3 0.44 0.33 0.34 0.23 - 0.34 0.01 0.41 0.44 0.35 - 0.3M4 0.37 0.18 0.1 0.13 0.27 0.21 0.16 0.31 0.13 0.25 0.2 0.21M5 0.45 0.26 0.18 0.25 0.24 0.28 0.25 0.46 0.23 0.39 0.2 0.31M6 0.5 0.55 0.1 0.2 0.25 0.32 0.18 0.29 0.01 0.07 0.2 0.15M7 0.18 0.28 0.18 0.17 0.25 0.21 0.13 0.34 0.1 0.39 0.05 0.2
Note - M1: CCNMF, M2: NMF, M3: Spectral, M4: CoclusteringM5: k-means, M6: seaNMF, M7: RDDC
4.2 Benchmarking methods and evaluation measures
The state-of-the-art clustering methods, k-means [8], NMF[3], spectral clustering
[8] based on k-NNs and co-clustering [44] are used as the standard baselines.
Latest relevant methods such as the IR ranking and density based RDDC [140],
the term co-occurrences-based semantic NMF known as SeaNMF [168] have also
been used. Additionally, several variations of CCNMF have been evaluated to
investigate the inclusion of consensus and complementary components of NMF
as well as the use of local and global NN information. Standard pairwise F1-
score, which calculates the harmonic average of the precision and recall, and
Normalized Mutual Information (NMI), which measures the purity against the
number of clusters, are used as evaluation measures [140].
4.3 Accuracy
Comparison with baseline methods: Results in Table 2 show that CC-
NMF is able to achieve much higher accuracy as compared to all bench-marking
methods. Spectral clustering, which preserves the geometric structures between
114 4 Experiments
documents, is the second-best method. Density-based RDDC shows the least
performance as the density concept fails in sparse text data. The semantic NMF,
seaNMF, also produces lower NMI. It is interesting to note that it even performs
more poorly than the traditional NMF for NMI values. In comparison to CC-
NMF, which obtains the geometric relationship by using association relationships
between documents, seaNMF [168] uses the term co-occurrences to provide se-
mantic information during NMF. However, this semantic assistance given with
term association in seaNMF is able to produce much higher F1 score for col-
lections with overlapping clusters such as in DS2. In contrast, CCNMF shows
52% increase in F1 score and 33% increase in NMI compared to normal NMF.
This improved performance shows the importance of using complementary and
consensus information for assisting matrix factorization to preserve the geometric
relationship.
DS5 contains short text compared to other datasets of news or web blog data
that show medium text size. The short nature of text vectors impairs spectral
clustering to find valid eigenvectors through identified few NNs in DS5. CCNMF
also relies on the NN concept in identifying accurate clusters. Having a few
NNs directly impacts on CCNMF, therefore CCNMF is unable to gain much
performance improvement in this dataset.
Comparing the use of different ways of obtaining NNs In CCNMF, we
incorporate both local and global NNs with the input term-document matrix
during factorization, to identify the accurate cluster representation. The next ex-
periments were conducted to find which combination is the most effective. Three
different settings were tested.
1. Eq. 9 was made to exclude matrix S2 factorization process ⇒ (A+ S1)⇒NMF on term-document and local NN association matrices
4 Experiments 115
Table 3: Performance comparisons: Incorporating NNs
CCNMF
CCNMFwith onlylocal NNs(A+ S1)
CCNMF withonly global
NNs (A+ S2)
NMF withlocal andglobal NNs(S1 + S2)
F1 NMI F1 NMI F1 NMI F1 NMIDS1 0.58 0.33 0.51 0.51 0.51 0.21 0.54 0.24DS2 0.44 0.51 0.31 0.31 0.31 0.48 0.36 0.42DS3 0.41 0.51 0.31 0.31 0.31 0.38 0.35 0.41DS4 0.36 0.43 0.25 0.25 0.25 0.33 0.28 0.37DS5 0.25 0.22 0.23 0.19 0.28 0.19 0.24 0.19Avg. 0.41 0.4 0.32 0.31 0.33 0.32 0.35 0.33
2. Eq. 9 was made to exclude matrix S1 factorization process ⇒ (A+ S2)⇒NMF on term-document and global NN association matrices
3. Eq. 9 was made to exclude matrix A factorization process ⇒ (S1 + S2)⇒NMF on local NN and global NN association matrices
Results in Table 3 show that using only local NNs or global NNs with the input
data matrix is inferior to combining both these complementary information as in
CCNMF. It also shows that combining these two types of NNs is not sufficient to
obtain superior clustering performance, it is required to have the term-document
representation of text documents. The use of term distribution can represent
the semantic relationships between documents and uplift the clustering quality.
However, CCNMF can also outperform the combinations, which use either local
or global with term-document representation. In DS1, which contains the smallest
dataset with four classes, we can see that the assistance of local NNs gives superior
performance to global NNs while in DS2 - DS4 which contain more documents
and clusters, the assistance of global NNs are superior. This validates that global
NNs need to be considered for preserving the geometric relationships at different
levels. Furthermore, IR ranking, which identifies the relevant nearest neighbors
through an inverted index data structure, accurately forms the NN affinity (global
116 4 Experiments
NNs) minimizing the distance concentration as evident in superior performance in
many datasets compared to local NNs. Results in DS5 show combining only global
NNs is able to produce much higher results than having both due to significant
distance concentration in the extremely sparse short text. Thus, poor local NN
information reduces the overall results of CCNMF.
Analysis of the objective function in CCNMF In CCNMF we use com-
plementary information given by S1 and S2 by approximating them using factor
matrices H1, H2 and H with minimum error as given in Eq. 9. It uses the
consensus information to form matrix H through updating H, H1 and H2 inter-
changeably. The next set of experiments was conducted to find the best consensus
and complementary technique. Two different settings are tested.
1. CCNMF without consensus information in the objective function,
minW,Ha≥0‖A − WHTa ‖F + minH,H1≥0‖S1 − HbH
T1 ‖F + minH,H2≥0‖S2 −
HcHT2 ‖F ,
As this is without consensus, this factorization process is not learning a
common H. Instead it learns different H - i.e., Ha, Hb and Hc from the
inputs and then take the average of them to obtain final H.
2. CCNMF without using complementary information in the objective func-
tion,
minW,H≥0‖A−WHT‖F +min‖HHT1 −HHT
2 ‖F .This process uses the same update rule as CCNMF to have consensus
information, However, this only maintains least difference between local
(S1 ≈ HHT1 ) and global (S2 ≈ HHT
2 ) NNs within the factorization pro-
cess and does not directly consider the approximation of S1 and S2 for
optimization.
4 Experiments 117
Table 4: Performance comparisons: Variations in factorization process
CCNMFCCNMFwithoutconsensus
CCNMFwithout
complementaryF1 NMI F1 NMI F1 NMI
DS1 0.58 0.33 0.41 0.02 0.52 0.29DS2 0.44 0.51 0.21 0.10 0.18 0.32DS3 0.41 0.51 0.09 0.02 0.31 0.4DS4 0.36 0.43 0.11 0.07 0.07 0.07DS5 0.25 0.22 0.15 0.05 0.19 0.19Avg. 0.41 0.4 0.19 0.05 0.25 0.25
We analyzed all these options of approximating factor matrices to confirm that
the way we used in CCNMF is the best.
Results in Table 4 show that proposed CCNMF which uses both complementary
and consensus information, is superior to only using consensus information or
complementary information. CCNMF without consensus does not focus on a
common H and combines different variations of H (i.e., Ha − Hc) to get the
final H. In contrast, the proposed CCNMF incorporates consensus information
through interdependent update rules. Specifically, exchanging inter-dependent
information in obtaining final H can give a more accurate clustering solution.
Combining separately learned cluster assignments based on NNs and document
representation is not able to assist the document matrix factorization process. In
fact, it degrades the performance of the original NMF process. CCNMF without
complementary information only minimizes the difference between S1 and S2
within the factorization. This approach does not consider specific information
given by S1 and S2 as in the proposed CCNMF. These results confirmed that
specific information given by local NNs and global NNs have to be considered,
with NMF in achieving better performance.
118 4 Experiments
0.98
0.985
0.99
0.995
1
0 10 20 30 40 50 60 70 80 90 99Rela
tive
Appr
oxim
atio
n Er
ror
Number of Iterations
DS1
0.950.960.970.980.99
1
0 10 20 30 40 50 60 70 80 90 99Rela
tive
Appr
oxim
atio
n Er
ror
Number of Iterations
DS2
0.970.975
0.980.985
0.990.995
1
0 10 20 30 40 50 60 70 80 90 99Rela
tive
Appr
oxim
atio
n Er
ror
Number of Iterations
DS3
0.920.930.940.950.960.970.980.99
1
0 10 20 30 40 50 60 70 80 90 99Rela
tive
Appr
oxim
atio
n Er
ror
Number of Iterations
DS4
0.950.960.970.980.99
1
0 10 20 30 40 50 60 70 80 90 99Rela
tive
Appr
oxim
atio
n Er
ror
Number of Iterations
DS5
Figure 4: Optimization in CCNMF
Figure 5: Performance Comparison with and without using SGNS weighting
4.4 Sensitivity Analysis
In order to achieve the optimized solution of CCNMF, we iteratively minimize
the approximation error in the factorization process over 100 cycles. Figure 4
shows that CCNMF approaches the optimum solution within the 10-40 iteration
range for all datasets.
The SGNS weighting used for matrices that represent the NNs is one of the
major strengths of CCNMF. We empirically validate this concept by using binary
4 Experiments 119
00.10.20.30.40.50.6
10 20 30 40 50 60 70 80 90 100
110
120
130
NM
I
# of NNsDS1 DS2 DS3 DS4 DS5Note: Number of NNs that gave highest NMI and F1-score are marked with
Note: F1-Score is displayed on the top of the each bar
(a) (b)
0.58
0.44 0.410.36
0.250.58
0.2
0.41
0.36
0.25
0
0.1
0.2
0.3
0.4
0.5
0.6
DS1 DS2 DS3 DS4 DS5
NM
I
TFIDF BM25
Figure 6: Performance Comparison with parameters - different ranking functionsand number of NNs
representation for the NNs entries against SGNS. Figure 5 shows that the use of
SGNS weighting for S1 and S2 matrices consistently provide superior results for
all the datasets except DS5. There is no improvement in DS5 with this weighting
as it is a short text dataset where only fewer NNs can be identified due to having
only a few terms in each document. It produces less word co-occurrence pattern
that is the basis of finding NNs and showing the NN association in SGNS. Using
SGNS, CCNMF can capture the geometric relationship between documents more
accurately, as shown by the improved clustering performance.
The success of CCNMF relies on the used IR ranking function to obtain global
NNS as well as the number of NNs used in CCNMF. There exist two most popular
ranking functions, tf*idf, and BM25 [173]. Figure 6(a) shows that the tf ∗ idf
ranking function is able to give slightly better performance improvement in most
of the dataset. Therefore tf ∗ idf is set as the default for all the experiments.
Figure 6(b) shows how performance varies with the selected number of NNs. We
select the number of NNs that give the highest NMI and F1 score to represent
the S1 and S2 affinity matrices for each dataset.
120 4 Experiments
Figure 7: Performance Comparison with different cluster numbers k
Another crucial factor in CCNMF is the low-rank dimension G, used in matrix
factorization process to project the data from high-dimension to low-dimension.
We set this as the number of classes/clusters within the dataset. Figure 7 validates
that CCNMF can produce the best performance when setting G as the number of
clusters. It shows that 4, 52, 20, 40 and 21, which are the exact cluster numbers
according to ground-truth, can produce best NMI and F1 scores for DS1- DS5
respectively. It can be noted that there are curves like the elbow method in Figure
7. It indicates that the elbow method or average silhouette score [10] can be used
to infer the low-rank number G in real-world scenarios where the cluster label is
unknown. For example, Figure 8 shows that G should be set to 4 based on the
highest average silhouette score for the WebKb dataset.
4.5 Complexity Analysis
This section explores the computational complexity of all the considered meth-
ods excluding the input preparation time. The objective in CCNMF is to obtain
5 Conclusion 121
Figure 8: Selecting the low-rank dimension G based on average silhouette score
higher clustering accuracy by overcoming the information loss in the NMF pro-
cess. We want to obtain a similar complexity as NMF-based clustering methods
without incurring high cost for additional processes. Consider a collection with N
documents and M feature dimensions. As expected, the complexity of CCNMF
(i.e., O(N2)) is lesser than Spectral clustering (O(N3)) but more than k-means
O(N) and RDDC O(NM) for larger datasets. All the NMF-related methods have
O(N2) complexity. However, seaNMF, which combines 2 matrices in finding a
final solution, consumes more time than the general NMF. Similarly, CCNMF,
which combines three matrices consumes higher time than seaNMF.
5 Conclusion
This paper proposes a novel document clustering method CCNMF, leveraging the
use of local NN affinity and ability of IR ranking to identify relevant documents
as global NNs. CCNMF combines local and global NNs to preserve geometric
structure together with documents represented with a vector space model in Non-
negative Matrix Factorization to deal with the sparseness in high dimensional
text vectors. We conjecture that the technique used in CCNMF to combine
complementary and consensus information can approximate lower dimensional
122 5 Conclusion
factor matrices of high dimensional text to accurately determine the clusters for
documents. We show that the Skip-Gram with Negative-Sampling weighting
that used in NN representation can boost the clustering accuracy by capturing
the presence of a document pair with respect to any neighborhood.
Empirical results conducted on several datasets, benchmarked with several clus-
tering methods, show that CCNMF overcomes the issues attached with sparse
vectors and provides the clustering solution with consistently higher accuracy
than all relevant baseline methods. The use of both local and global NN affinity
shows superior results preserving the geometric relationships in original data com-
pared to other dimensionality reduction methods such as normal NMF, seaNMF
and spectral clustering. Further, this paper validates the superiority of using
consensus and complementary information as in CCNMF.
However, using the pairwise comparison to generate local NNs is an expensive
process. In the future, we aim to investigate an effective local NN calculation
process. A problem such as community detection needs to handle user-content as
well as user-user relationships that are modeled with friendship information and
interaction information, which represent local and global associations respectively.
Extending this approach to meaningful community detection for this type of
context is also open for future investigation.
Paper 3 123
Paper 3: Corpus-based Augmented Media Posts
with Density-based Clustering for Community
Detection
Wathsala Anupama Mohotti* and Richi Nayak*
*School of Electrical Engineering and Computer Science, Queensland University
of Technology, GPO BOX 2434, Brisbane, Australia
Published In: IEEE International Conference on Tools with Artificial Intelli-
gence (ICTAI), 05-07 November 2018, Volos, Greece
Statement of Contribution of Co-Authors
The authors of the papers have certified that:
1. They meet the criteria for authorship in that they have participated in
the conception, execution, or interpretation, of at least that part of the
publication in their field of expertise;
2. They take public responsibility for their part of the publication, except for
the responsible author who accepts overall responsibility for the publication;
3. There are no other authors of the publication according to these criteria;
4. Potential conflicts of interest have been disclosed to (a) granting bodies, (b)
the editor or publisher of journals or other publications, and (c) the head
of the responsible academic unit, and
5. They agree to the use of the publication in the student’s thesis and its
publication on the QUT ePrints database consistent with any limitations
set by publisher requirements.
124 Paper 3
Contributor Statement of contribution*
Wathsala Anupama Mohotti Conceived the idea,designed and conducted experiments,analyzed data, wrote the paper and
Signature: addressed the supervisor and reviewers’comments to improve the quality of paper
Date:
A/Prof Richi Nayak Provided critical commentsin a supervisory capacity
Signature:on the design and formulationof the concepts, method and experiments,
Date: edited and reviewed the paper
Nayak
26/03/2020
a Mohotti
27/03/2020
QUT Verified Signature
QUT Verified Signature
1 Introduction 125
ABSTRACT: This paper proposes a corpus-based media posts expansion tech-
nique with a density-based clustering method for community detection. To enrich
the user content information, firstly all (short-text) media posts of a user are com-
bined with hash tags and URLs available with the posts. The expanded content
view is further augmented by the virtual words inferred using the novel concept
of matrix factorization based topic proportion vector approximation. This ex-
pansion technique deals with the extreme sparseness of short text data which
otherwise leads to insufficient word co-occurrence and, in hence, inaccurate out-
come. We then propose to group these augmented posts which represent users
by identifying the density patches and form user communities. The remaining
isolated users are then assigned to communities to which they are found most
similar using a distance measure. Experimental results using several Twitter
datasets show that the proposed approach is able to deal with common issues at-
tached with (short-text) media posts to form meaningful communities and attain
high accuracy compared to relevant benchmarking methods.
KEYWORDS: community detection; corpus-based expansion; clustering; short
text; text mining
1 Introduction
Microblogging services are popular social networks which disseminate trending
information and assemble social views of users based on their short-text com-
munication. Clustering algorithms have been popularly used in these services
to promote applications such as topic detection, answering service recommenda-
tions, community detection and image/video tagging [93]. Community detection
in social media analysis has been found useful in identifying groups of users with
common interests to assist in viral and targeted marketing, political campaign-
126 1 Introduction
ing, customized health programs, event identification and many other applications
[88, 144, 147].
Community detection is usually done via two means: (1) structure analysis and,
(2) content analysis. Structural analysis has been explored heavily to construct
a network representation based on user interactions and to find cohesive groups
by applying clustering to the graph model [151, 158, 169]. Researchers have
attempted to enrich this network representation by incorporating additional in-
formation such as hashtags and URLs [119]. Interpretation of similar groups
based on the network structure is challenging due to its complex and messy na-
ture. Users who belong to a common group make a connection with different
groups through the friendship information or other connections. They may fol-
low users belonging to different groups based on their own desires and emotions
[97]. This heterogeneous network structure analysis results in close ties that allow
the exchange of fine-grained information and unable to produce high-level user
groups [62]. A network-based analysis, which considers users who write messages
as entities, is unable to give insight into what the community is interested in
[142].
Alternatively, the content analysis focuses on the pure insight of the communities
based on what they write. However, it has not yet been explored in detail due
to difficulties of handling high dimensional short text data. A handful of studies
exist that have used supervised and unsupervised learning methods to identify
commonalities among social posts. clustering has been used to identify mental
health communities focusing on anxiety, depression, and PTSD from the Reddit
forum posts [147]. Supervised learning [88] and textual semantic similarity [144]
have been used to identify sub-topics in political tweets and events respectively.
However, these methods have been reported to suffer the problem of extreme
sparseness in short text data and result in poor outcomes.
1 Introduction 127
Majority of the content-based community detection methods rely on text clus-
tering [147, 202]. A distance-based clustering method was used to assign each
user to the closest community considering the distance between texts which rep-
resent users [147]. However, it performed poorly due to distance concentration in
high dimensionality [90]. Specifically, distance differences between all data points
tend to become harder to distinguish as dimensionality increases [176]. Authors in
[144, 202] used a generative probabilistic clustering method to approximate prob-
abilities of the user being in pre-considered communities and derive the maximum
probable community. However, information loss is inevitable in these methods
due to the projection of high dimension to low dimension, and result in inac-
curate communities. Most importantly, these methods require the number of
communities to be provided as an input parameter.
In contrast, density-based clustering methods such as DBSCAN [57] find high
dense patches in a dataset naturally and automatically identify clusters without
the requirement to provide the cluster number apriori. These characteristics make
a density based method ideal for clustering text data. Density estimation pro-
cess allows identifying clusters which show different shapes. However, they have
limitation in handling high-dimensional text data where feature space is usually
sparse without much term co-occurrence and face difficulty to distinguish high-
density regions from low-density regions [90]. This sparseness has been addressed
by the concept of finding Shared Nearest Neighbours (SNN) and finding a vary-
ing dense text representation [56]. However, short length in social media posts
impairs identifying SNNs due to less common words they share [79]. Further-
more, a density-based method usually results in incomplete clustering and leaves
some objects un-assigned to any cluster. This portion of un-assigned objects is
relatively high in text data due to sparseness. A density-based method cannot
directly apply to community detection where each user should be matched to a
community according to the expression of their ideas/opinions online
128 1 Introduction
To deal with these issues, in this paper, we propose a novel hybrid clustering
approach that relies on the density concept to naturally identify clusters of ar-
bitrary shapes and uses the distance concept to place un-assigned objects to the
nearest clusters. In addition, to deal with the sparsity issue for the lack of word
co-occurrence, we propose to apply document expansion to augment the media
posts to be used in clustering. Document expansion which is inherited from the
Information Retrieval field has been used in clustering to deal with sparseness
[202]. Mainly external information sources have been utilised to enrich the text
[93, 95]. However, due to context mismatch and structural incoherence between
the external source and the original data, it results in poorer outcome [202].
In this paper, we propose a novel media posts augmentation method that incor-
porates semantic characteristics of short-text using word-occurrences of the self-
corpus without using external resources. Firstly, we propose to use non-content
information such as hashtags and URLs to enrich the content of media posts.
These information has been used as separate views [5] or each information as a
singleton view [18] in clustering avoiding extreme sparseness. We believed that
combining this information into a consolidated view would be beneficial for the
latent terms learning task. We then project the high-dimensional term space to a
low-dimensional space and infer the topic proportion vectors using the associated
semantic structure to identify virtual terms for posts. We propose to use the
Non-negative Matrix Factorization (NMF) based dimensionality reduction which
considers context based term weights to form topic vectors. The coefficients in
the topic matrix are used to statistically derive the top-n terms from topic vectors
to augment media posts.
This media post-expansion improves the word co-occurrences of important terms
in short text. Based on the augmented users′ posts, the proposed density and
centroid based complete hard clustering mechanism is then used to group the
1 Introduction 129
users to a single community. Quantitative analysis using several Twitter datasets
which belong to several groups reveals that the proposed approach is superior to
the state-of-the-art content-based community detection methods. In addition, a
case study was conducted and the qualitative analysis confirms the ability of the
proposed method to detect meaningful cohesive online communities. In compari-
son to many sub-communities identified using disseminated network analysis, the
proposed content based method has been able to identify fewer cohesive commu-
nities.
y
Figure 1: Inference of virtual words for media post-expansion
More specifically, the contributions of the paper are as follows: (1) We put forward
the concept of document expansion to handle the sparsity issue of media posts.
We propose to use abundant information available on social platforms and, as well
as the corpus itself to obtain additional terms inferred from the topics to resolve
the sparseness. (2) We propose a hybrid hard clustering method with density
and centroid concepts to naturally detect the meaningful exclusive communities
based on the augmented media posts.
To our best of knowledge, this is the first work that uses NMF for augmenting
short text using topic vectors with a density-based clustering method for com-
130 2 Community Detection With Hybrid Clustering
munity detection.
2 Community Detection With Hybrid Cluster-
ing
2.1 Preliminaries and Overview
Let there be N distinct users who are defined as {u1, u2, ..., uN} to be assigned
to a community. Each user ui has written n number of posts {p1, p2, ..., pn}.Let Pi be a combined post representing all the posts {p1, p2, ..., pn} of the user
ui with the associated URLs and hashtags used by that user in the posts. Let
UP = {P1, P2, P3, ..., PN} be the enriched media post collection to represent all
N users combined posts. Let the collection UP consist of M distinct terms
{t1, t2, t3, ..., tM}. The text collection UP is usually extreme sparse matrix. The
low-rank approximation of topic distribution over terms, generated using NMF
on the UP matrix, is used to obtain top-n terms associated with each topic as
in Fig. 1. Each text post representing a user in UP is expanded with the virtual
words taken from an appropriate topic vector to form the enriched corpus UP ′.
The augmented text collection UP ′ becomes input to the clustering process that
includes two steps. Firstly, dense regions are found in the UP ′ search space
and a distinct cluster label in C = {c1, c2, c3, ..., cl} is assigned to users/posts in
high dense regions. Another set of posts P o that appears in low-density regions is
separated out. Finally, for each post Pi ∈ P o, a cluster label is assigned by finding
the closest dense region using a distance metric. Thereby each user represented
by a post assigns to a community.
2 Community Detection With Hybrid Clustering 131
Figure 2: The Content-based Community Detection Method
2.2 Augment Media Posts with semantically related
words
Basic media posts can be enriched using external sources or corpus itself to nar-
row the semantic gap created by the short length. Majority of previous works
utilise external knowledge sources such as Wikipedia, WordNet, Web search re-
sults and other user-constructed knowledge-bases [93, 95]. When social media
texts, which have frequent real-time updates, are enriched using static sources
such as Wikipedia or WordNet, they provide inadequate or inaccurate informa-
tion due to structural incoherence and lead to incomplete enrichment. Corpus-
based document enrichment [202] can be introduced as an efficient solution to
avoid these problems and enrichment can be done based on the data itself which
132 2 Community Detection With Hybrid Clustering
follow the same semantic structure. The Latent Dirichlet Allocation (LDA) topic
modelling has been used previously to form the inter-relationship between topics
over words and virtual words are sampled from the topics [202]. LDA is a proba-
bilistic generative model based on the word count, therefore, contextual analysis
cannot be effectively captured without considering the importance of terms with
the document and document corpus [3].
Latent Semantic Indexing (LSI) is another approach used to derive the latent con-
cepts by performing a matrix decomposition based on the term co-occurrences
[20]. LSI uses Singular Value Decomposition (SVD) to identify patterns in the
relationships between the terms and concepts contained in an unstructured col-
lection of text. SVD allows factors to contain both positive and negative entries.
VSM of a document collection capturing context importance of terms with term
weights which are strictly positive forms a positive original matrix. Thereby, fac-
tors need to be positive to directly model the physical connection between terms
and topics. Semantically related words are identified by the associations learned
with non-negative constraint in our augmenting technique as a remedy.
NMF is a dimensionality reduction method which transforms high-dimensional
features to a lower dimension by enforcing the non-negative constraint in the ma-
trix decomposition to generate the non-negative factors that yield the lower rank
approximation [110]. In a high dimensional document model where we represent
the document-term relationship with weights, this approximation directly corre-
sponds to topics. NMF based topic generation within lower dimensional space
considering term weights is more accurate than term count based probabilistic
approximation and, consumes less time than generative probabilistic models such
as LDA [110].
Let A be the M × N term-post matrix; using NMF we model A as a linear
combination of W and H as in Eq. 1. We use the Frobenius norm as the
2 Community Detection With Hybrid Clustering 133
objective function to obtain stable approximation and iteratively attempts to
determine optimum W and H with the minimum sum-of-the-square error for all
elements in both of those matrices as in Eq. 2 where W is M × k non-negative
matrix and H is k ×N non-negative matrix and k < min(M,N) [110].
A ≈ WH (1)
minW,H≥0
1
2‖A−WH‖ =
M∑i=1
N∑j=1
(Ai,j − (WH)i,j
)2(2)
Topic membership in each media post is obtained with H, considering the maxi-
mum probability of a post belonging to a topic. This associated topic is used to
identify the virtual terms for each post using W where topic proportion vectors
are maintained. W in Eq. 3 envisages the likelihood of each term in the given
topic, from where we obtain top-n terms to be the probable terms of the given
topic. We have set this n in a parameter independent way considering coefficients
of W (as described in sensitivity analysis later). Each text post Pi ∈ UP which
represents a user is updated with the virtual terms that correspond to its topic
vector and the augmented dataset UP ′ is obtained.
W =M∑i=1
k∑K=1
p (ti, K) (3)
2.3 Hybrid (Hard) Clustering Method
The augmented dataset UP ′ is analysed to identify natural dense patches and to
form communities/clusters. Some posts/users that are not part of a dense patch
remain un-clustered. These isolated users are assigned to communities based on
a distance-based method. Algorithm 1 in Fig. 2 explains the hybrid clustering
method proposed for content-based community detection. The proposed method
does not need a user-defined cluster number k as well as it does not include
134 2 Community Detection With Hybrid Clustering
the expensive steps of centroid updates which are bottlenecks in community de-
tection. Also, this method is capable of identifying different shapes of clusters
with varying densities compared to centroid based methods which produces the
spherical clusters only.
The first step of this method is to identify dense communities based on the dense
patches that naturally exist in the data. Each post Pi ∈ UP ′ which represents a
user is checked to determine core dense data points if it is already not a member
of a dense community. A post Pi ∈ UP ′ is defined as a core dense point when
the region marked by the distance α around Pi contains at least r number of
data points. Then the region around a core dense point is expanded using data
points in its′ region if they satisfy the condition to be core dense data points.
This allows forming arbitrary shapes of communities/clusters according to their
close neighborhood.
This density-based community detection leaves some users unassigned to any of
the communities as they lay in less dense regions (i.e. P o). In sparse text data,
the number of points left out as noise is considerably high by a traditional density-
based method. The post augmentation done in the previous step addresses this
problem to some extent and creates density variations among clusters. However,
the number of points unassigned still vary according to the distance parameter
α. Avoiding this dependency of the parameter, we design our method to identify
a cluster for each user depending on the distance closeness. This enables each
user to be included in a community using a pair-wise comparison of the user post
vector with the cluster centre of each dense community formed previously. Each
user Pi ∈ P o is assigned to a community with the minimum pair-wise distance
value.
Algorithm 1 details the process of content-based community detection which as-
signs each user to a community.
3 EMPIRICAL ANALYSIS 135
3 EMPIRICAL ANALYSIS
Datasets: We used several Twitter datasets obtained from Trisma [140] span-
ning across Cancer, Health and Sports domain as reported in Table 1. We have
chosen a set of groups under these domains where we can identify Twitter ac-
counts to collect posts. Each group in each domain is considered as the ground
truth community to benchmark the algorithmic outcome. The tweets of a user
within a given account are combined to form a single media post that represents
the user.
Table 1: Summary of the datasets in the experiments
Dataset Tweets UsersClasses # of Avg. post length (terms)(groups) Terms Before After
expansion expansionCancer 43730 14368 8 8050 16 134(DS1)Health 53255 11306 6 9733 20 159(DS2)Sports 230447 25243 8 24316 29 267(DS3)
Benchmarks and Experimental setting: The state-of-the-art clustering
methods, k-means [90], NMF [110], DBSCAN [57] and SNN-based DBSCAN [56],
are used to benchmark the concepts in the proposed method. Document term fre-
quency was selected as the weighting scheme to represent the vector space model
for each corpus. In all the experiments, for density-based clustering methods,
the minimum number of posts to form a dense patch was set to 3 based on prior
research as it shows the minimum requirement of a hub point [139].
The parameter α which represents the local radius for expanding clusters was set
to 0.7, 0.9 and 0.7 for DS1-DS3 based on the experiments. All the experiments
were done using python 3.5 on a standard desktop with 3.40GHz-64bit processor
136 3 EMPIRICAL ANALYSIS
and 16 GB memory. The standard pairwise harmonic average of the precision
and recall (F1-score) and Normalized Mutual Information (NMI) were used as
the evaluation measures [202].
3.1 Effect of the Expansion Steps on Clustering Perfor-
mance
The short-text posts were first augmented with non-content information available
with the text. Generally, tweets are accompanied with URLs, hashtags, and
emoticons. We treated each URL and hashtag attached with a tweet as a term
in that tweet. This technique allowed us to use additional information available
on a social media platform to improve the word co-occurrence for accurate latent
topic term learning. Fig. 3 shows the performance improvement measured by
NMI and F1-score using this additional data.
Figure 3: Performance difference when augmented with URLs and hash tags
For comparison, we benchmarked the performance with commonly used terms
expansion methods: (1) the generative probabilistic topic modelling LDA based
3 EMPIRICAL ANALYSIS 137
approach (LD), (2) Latent Semantic Indexing based approach (SI), (3) top-ten fre-
quent words from k-means clustering (KMT) and, (4) WordNet synonyms based
method (WN).
Table 2: Different Document Expansion Methods
MethodsDataset
DS1 DS2 DS3AC 0.7 0.56 0.66 F1-scoreLD 0.44 0.53 0.6SI 0.57 0.50 0.63
KMT 0.74 0.50 0.66WN 0.19 0.47 0.06AC 0.8 0.75 0.79 NMILD 0.54 0.75 0.73SI 0.45 0.01 0.46
KMT 0.62 0.01 0.58WN 0.24 0.13 0.14AC 10 4 7 Number ofLD 9 3 8 ClustersSI 8 2 7
KMT 163 15 465WN 527 369 716
Augmenting Methods: Proposed NMF (AC), LDA (LD), LSI (SI)Top 10 Words using k-means (KMT), WordNet-based(WN)
Table 2 shows the clustering performance with the datasets augmented with the
different expansion. The use of topic words estimated in lower-space through
NMF for expansion (AC) is found better than the word count based topic words
obtained with LDA. NMF takes the context of the terms in the corpus with term
weights and is able to provide stronger topic distribution over terms. In com-
parison to the generic matrix decomposition in LSI, NMF enforces non-negative
constraint for matrix factorization and is able to accurately capture the topic
terms. Furthermore, the simple use of top-10 words in k-means clusters for aug-
menting was found ineffective. This may be due to the incapability of k-means
deriving themes over terms by a distance metric, hence, unable to identify the
correct number of cohesive communities. We extracted synonyms for top-10 fre-
138 3 EMPIRICAL ANALYSIS
quent terms in a combined post that represents a user using WordNet as closely
related words to add the post for enrichment. This way (WN) of augmenting add
many unrelated terms for a post due to structural mismatch of terms in media
posts and WordNet taxonomy.
In the k-means based (KMT) and WordNet-based (WN) augmenting, the sparse-
ness of text representation further increases due to unrelated term addition. Den-
sity estimation process becomes weak in them and results in a larger number of
small clusters.
Figure 4: Time taken for different post augmenting methods
As depicted by Fig. 4, matrix factorization based methods take less time among
all others in augmenting media post due to their linear approximation process
of factors. Proposed NMF based method consumes more time than LSI based
approach due to imposing of non-negative constraints. The WordNet-based ap-
proach relies on the term search in the external information source and is expen-
sive.
3 EMPIRICAL ANALYSIS 139
3.2 Accuracy Comparison
Results in Tables 3 and 4 show the performance of the proposed augmented hy-
brid clustering method (AC) benchmarked with other popular methods, without
and with topic vector based augmentation, respectively. Table 3 reveals that
the performance of density-based algorithms, including the proposed one, is in-
ferior to other algorithms as density notion fails in sparse content, before term
expansion. In this homogeneous setting, centroid-based k-means is the best ap-
proach. However, it is to be noted that all methods except density based require
the number of classes as an external input for clustering, whereas density-based
methods find natural clusters without taking a number of classes. Consequently,
these methods show poor performance by forming many subgroups than the exact
number of communities. Moreover, DBSCAN which is meant for identifying noise
separately from dense regions leaves a huge number of users un-assigned to any
community due to scatterness of data points because of the sparseness of text.
Shared-nearest-neighbors based DBSCAN, which was introduced to address the
sparseness in high dimensionality, also fails to deal with short text due to hav-
ing a minimal number of shared terms and unable to identify shared neighbors
accurately.
Performance of all methods using the augmented inputs with NMF based topic
terms, given in Table 4, shows that media post-expansion boosts the accuracy of
density based methods as depicted by higher F1-score and NMI. The performance
boost in density-based methods is much larger than that other algorithms due
to augmentation step which increases the data density. Virtual terms added
to media posts create uniform dense regions minimizing sparseness of text that
favor density-based methods. These uniform clusters allow identifying high-level
groups closer to actual groups without identifying many dense patches. Most
importantly, the proposed method is able to assign community to each user,
140 3 EMPIRICAL ANALYSIS
Table 3: Performance Comparison of Different Methods Before Expansion
MethodsDataset
AverageDS1 DS2 DS3
AC 0.42 0.45 0.43 0.43 F1-scoreKM 0.69 0.62 0.63 0.65MF 0.58 0.54 0.61 0.58DB 0.5 0.45 0.52 0.49SD 0.45 0.49 0.45 0.46AC 0.4 0.43 0.31 0.38 NMIKM 0.66 0.51 0.62 0.60MF 0.51 0.43 0.57 0.50DB 0.34 0.32 0.17 0.28SD 0.06 0.00 0.02 0.03AC 547 519 739 Number of ClustersKM 8 6 8MF 8 6 8DB 547 519 739SD 9 1 23AC 0 0 0 Un-assigned usersKM 0 0 0MF 0 0 0DB 7210 5490 20528SD 795 59 2332
Proposed Augmented Hybrid Clustering Method (AC), K-Means (KM),NMF (MF), DBSCAN (DB) and SNN-based DBSCAN (SD)
whereas, DBSCAN and SNN-based DBSCAN leaves about 2% and 70% of users
unassigned even after expansion.
Moreover, structure-based community detection methods [14, 161] are generally
known to form many more communities in comparison to the proposed method.
3.3 Time Efficiency Analysis
Time taken after augmenting the media posts is given in Fig. 5 against bench-
marking methods. K-means consumes higher time due to the expensive step
3 EMPIRICAL ANALYSIS 141
Table 4: Performance Comparison of Different Methods After Expansion
MethodsDataset
AverageDS1 DS2 DS3
AC 0.8 0.75 0.79 0.78 F1-scoreKM 0.69 0.66 0.68 0.68MF 0.69 0.65 0.69 0.68DB 0.8 0.75 0.79 0.78SD 0.59 0.72 0.46 0.59AC 0.7 0.56 0.66 0.64 NMIKM 0.63 0.53 0.58 0.58MF 0.63 0.52 0.59 0.58DB 0.69 0.56 0.65 0.63SD 0.33 0.46 0.03 0.27AC 10 4 7 Number of ClustersKM 8 6 8MF 8 6 8DB 10 4 7SD 19 5 35AC 0 0 0 Un-assigned usersKM 0 0 0MF 0 0 0DB 101 11 202SD 903 71 1848
Proposed Augmented Hybrid Clustering Method (AC), K-Means (KM),NMF (MF), DBSCAN (DB) and SNN-based DBSCAN (SD)
of centroid updates in each assignment. NMF consumes the least time due to
document topic approximation process. However, the trade-off for achieving an
increase of 10% in NMI and 15% in F1-score in the proposed method is well
justified against NMF. SNN-based DBSCAN consumes the highest time due to
pairwise comparison among points to identify the shared nearest neighbors. DB-
SCAN which also uses the density-based clustering approach consumes slightly
lesser time than our method. However, DBSCAN leaves some users unassigned
to communities while our proposed hybrid method assigns each user to its closest
community. The complexity of total process is O(n2) where n is the number of
users. This is higher in comparison to proposed clustering method (O(nlogn))
142 3 EMPIRICAL ANALYSIS
due to the inclusion of augmentation step.
Figure 5: Time comparison of different cluster/community detection methods
3.4 Sensitivity Analysis
Experimental settings used for augmenting media posts and clustering method
are analysed. For the datasets that we use in testing, the required number of
communities to be identified was known. We use these numbers to set the topics in
NMF for media post-expansion. However, this would not be the case in realworld.
To set a default value in NMF for unknown cases, we explore the relationship
between the number of topics and F1-score. Fig. 6 (a) shows that the accuracy
performance becomes stable after about 10. Consequently, 10 can be set as the
default number of topics to be obtained for post-expansion.
The next experiment conducted was to find out the number of terms that should
be augmented in a media post. Fig. 6 (b) shows a pattern where accuracy is
increased until a specific number of words are added and then it starts to declines
with the further addition. It happens as more additional terms, which have low
probability to be on the topic, become unrelated to the document themes. For
3 EMPIRICAL ANALYSIS 143
Figure 6: Performance w.r.t number of topics, number of virtual terms, termweighting methods and parameter alpha
all the three datasets, the best performance is achieved with about 10 terms. If
the number of communities within the dataset is high we need more terms as in
DS1 and DS3. Fig. 6 (b) confirms that the threshold to set the number of virtual
terms depends on the dataset. The dataset should possess sufficient information
to set it statistically and this threshold can be set in parameter independent way.
Fig. 6 (c) shows the best F1-score and NMI are obtained by setting the threshold
as “mean+standard deviation” with the least time consumption as shown in Fig.
6 (d). The threshold setting allows the method to select the most probable words
for each topic which goes beyond total average and gets boost by the standard
deviation.
The media posts are organized in a Vector Space Model (VSM) to find initial
dense regions. We explore the relationship between different weighting schemes
that can be used to model VSM. Fig. 6 (e) shows that Term Frequency (TF)
as the best weighting schema for the proposed method. TF which gives high
weights for frequent terms in a post allows forming dense patches that share
common theme.
144 3 EMPIRICAL ANALYSIS
The proposed clustering method, in line with a density-based clustering method,
uses a distance parameter to determine the search space. This parameter denotes
the region in the data space where the algorithm checks to define a particular data
point as a dense point. Fig. 6 (f) shows that the increase beyond a particular
value diminishes the accuracy by including all the points into one cluster.
3.5 CASE STUDY: COMMUNITY MINING
In the case study, we have explored the ability of proposed community mining
method in finding meaningful communities using Australian tweets of the “Na-
tionalSeniors” twitter account. The objective of this case study was to identify
sub-groups in a senior community with “alike” users based on their tweet posts.
Fig. 7 shows the word cloud1 generated for the entire tweet dataset. It can be
noted that Pension, Australian Politics (Auspol), Votes, Care and Finance are
popular discussion areas among seniors in Australia. However, it does not give
any more insight. Fig. 8 represents the 6 senior sub-communities identified by the
proposed method without setting any prior number. The word cloud generated
for tweet posts of each sub-community is able to confirm that they provide more
meaningful information.
Figure 7: Word Cloud for total tweets obtained from “NationalSeniors”
1The Voyant Tool [170] is used to illustrate the word clouds.
3 EMPIRICAL ANALYSIS 145
Community 1 is about the discussion on policies of reforms on older people
and discriminations. Discussions of community 2 focus on pension-related cuts,
changes for their finance and care. Community 3 users specifically talk about
older workers, votes, and candidates. Community 4 users seem to be concerned
with daily water supply. Community 5 was about an event organised for seniors
called “Senior Wednesday” and associated social engagements such as members,
movie, and tickets. Community 6 discussed the Australian Politics, budget and
associated matters for pensioners. This case study depicts the power of the con-
tent based community detection method to find the meaningful communities ac-
cording to their communicated text. It enables decision-makers to identify seniors′
political viewpoints and their concerns.
This information can be used in multiple ways. An example is to customize the
advertising strategies to each of the targeted groups. For example, users in com-
munity 5 can be the focus of advertisement in social gatherings. Furthermore,
this community information highlights the current events (e.g.: budget, Senior-
wednesday, and reforms), thus can be used for event detection if considered with
the timestamp.
Further, the users in “NationalSeniors” dataset are analysed using the retweets
network within the group to identify structure based communities using the well-
known Louvain algorithm [14]. It resulted in 94 sub-communities as disseminated
information network of users relies on closer ties. This validates the capability of
our content-based method in producing meaningful cohesive user groups rather
than splitting to a larger number of fine-grained clusters.
146 3 EMPIRICAL ANALYSIS
Figure 8: Word Cloud for total tweets obtained from “NationalSeniors”
3 EMPIRICAL ANALYSIS 147
3.6 CONCLUSION
In this paper, we propose a content-based community detection method to iden-
tify similar user groups that share similar content on social media platforms. Our
approach deals with the sparseness of text by augmenting media posts. Abundant
information available in social media platforms is used as a simple solution to in-
creases term co-occurrence and to aid media posts expansion along with NMF
based topic vectors. We conjecture that NMF-based dimensionality reduction
method, which considers the context of terms to infer topic proportion vectors,
gives high coefficients to most relevant terms of the topics in the approximation
process. Thereby it is able to derive the probable virtual words for augmenting
media posts.
The proposed density based complete clustering method naturally identifies com-
munities using these augmented media posts where each post represents a user.
The method initially forms high dense patches in the data and leaves some users
in low dense regions un-assigned to any community. The refinement phase added
with centroid-based clustering uses pre-calculated cluster centroids to group un-
assigned users. Extensive experiments on different Twitter datasets and the case
study confirm that the proposed approach with corpus-based expansion signifi-
cantly enhances the performance of short text-based community detection. Ex-
tending this approach which naturally assigns each user to one community to
deal with multiple interest users using soft clustering and handling dynamic tem-
poral context with improved time efficiency for event detection are our future
investigation.
148 Paper 4
Paper 4: Concept Mining in Online Forums using
Self-corpus-based Augmented Text Clustering
Wathsala Anupama Mohotti* , Darren Christopher Lukas* and Richi Nayak*
*School of Electrical Engineering and Computer Science, Queensland University
of Technology, GPO BOX 2434, Brisbane, Australia
Published In: IEEE Pacific Rim International Conference on Artificial Intelli-
gence (PRICAI), 26-30 August 2019, Cuvu, Fiji
Statement of Contribution of Co-Authors
The authors of the papers have certified that:
1. They meet the criteria for authorship in that they have participated in
the conception, execution, or interpretation, of at least that part of the
publication in their field of expertise;
2. They take public responsibility for their part of the publication, except for
the responsible author who accepts overall responsibility for the publication;
3. There are no other authors of the publication according to these criteria;
4. Potential conflicts of interest have been disclosed to (a) granting bodies, (b)
the editor or publisher of journals or other publications, and (c) the head
of the responsible academic unit, and
5. They agree to the use of the publication in the student’s thesis and its
publication on the QUT ePrints database consistent with any limitations
set by publisher requirements.
Paper 4 149
Contributor Statement of contribution*
Wathsala Anupama Mohotti Conceived the idea and research design,analyzed data, wrote the paper and
Signature: addressed the supervisor and reviewers’comments to improve the quality of paper
Date:
Darren Christopher Lukas, Conducted experiments,analyzed data
Signature:
Date:
A/Prof Richi Nayak Provided critical commentsin a supervisory capacity
Signature: on the design and formulationof the concepts, method and experiments,
Date: edited and reviewed the paper
26/03/2020
hi Nayak
26/03/2020
Mohhhhhhhhhhhhhhhotoooooooooo ti
27/03/2020
QUT Verified Signature
QUT Verified Signature
QUT Verified Signature
150 1 Introduction
ABSTRACT: This paper proposes a self-corpus-based text augmentation tech-
nique with clustering for concept mining in a discussion forum. Sparseness in
text data, which challenges the distance and density measures in determining the
concepts in a corpus, is handled through self-corpus-based document expansion
via matrix factorization. Experiments with a real-world dataset show that the
proposed method is able to infer useful concepts.
KEYWORDS: Concept Mining; Corpus-based augmentation; Clustering
1 Introduction
An online forum is a formal mechanism that community uses to exchange infor-
mation through posted messages that are organized into “threads” [125]. The
forums can reflect concepts, themes, and concerns of online societies in diverse
fields such as education, marketing and politics [125, 132]. A handful of studies
have applied data and text mining methods to explore the predictive power of
the forum data [120, 132]. In the education domain, discussion forums have been
analyzed to assess interactivity over a period of time to predict early warnings for
students at-risk [132]. In marketing, online forum data is used to identify product
defects [125] with predictive models. However, these works neglect the natural
text content used in the online discussion. A few studies have applied text min-
ing in online forums for sentiment analysis [120] with supervised approaches to
classify forum threads. However, the unavailability of ground-truths in online fo-
rum data creates the demand for conducting the analysis in unsupervised setting
[120].
In this paper, we propose a concept mining method that can extract concepts
based on text discussions in the unsupervised setting. Concept mining of online
1 Introduction 151
forums data faces the same challenges as traditional text mining methods [3].
Sparse nature of text vectors and a higher number of dimensions make distance
and density-based methods to perform poorly due to distance concentration [3].
Specifically, distance differences between far and near points become negligible in
higher dimensions [3]. In addition, density based methods are unable to identify
dense patches in sparse text data. Moreover, forum data is usually homoge-
neous where a minor variation in the distance/density measures will determine
groupings. Probabilistic and matrix factorization based approaches have been
introduced to handle higher dimensions in text [3]. However, information loss in
these dimensional reduction methods is evident.
Figure 1: Clustering Algorithm for Concept Mining: ConMine
Distinct from these works, we introduce a novel approach for content mining in
152 2 Concept Mining with Self-corpus-based Augmentation
online forums using clustering and document expansion, named as ConMine to
understand the main concepts and themes present in user discussions. The self-
corpus based document expansion [139] in ConMine, via Non-negative Matrix
Factorization(NMF), learns virtual terms from the same corpus that semanti-
cally match the applied domain. A centroid-based clustering is then applied to
the expanded text to differentiate the concepts. ConMine automatically learns
the number of clusters to be produced within the augmentation process. Finally,
we synthesize meaningful concepts with the help of experts via word-cloud vi-
sualization. ConMine approach is evaluated on real-world data taken from the
Queensland University of Technology(QUT), Australia. The empirical analysis
shows that ConMine is able to handle sparse and homogeneous nature of text in
discussion forums and identify concepts more accurately than the benchmarks.
2 Concept Mining with Self-corpus-based Aug-
mentation
The proposed three-step ConMine Algorithm is outlined in Fig. 1. Consider
an online forum corpora D = {D1, D2, ..Di, ...Ds} over a time period s where
Di represents the corpus at time i. Let Di be a collection of N distinct posts,
{P1, P2, ...PN}, that contain a total of M distinct terms {t1, t2, ...tM}.
Self-corpus-based Augmentation with Matrix Factorization: In contrast
to using external knowledge bases [93], we conjecture that the self-corpus based
augmentation is well suited for augmenting text as it follow forums’ text patterns.
Let A be the M ×N matrix representation of Di. We decompose A using NMF
to have the lower rank matrices W and H which are non-negative and in the size
of M × k and k × N respectively with the low-order rank k set as the number
2 Concept Mining with Self-corpus-based Augmentation 153
of topics. The k is learned using the intrinsic topic coherence measure. The
matrix factorization process iteratively approximates W and H such that they
can represent high-dimensional A with the least error as in Eq 1.
minW,H≥0
1
2‖A−WH‖ =
M∑i=1
N∑j=1
(Ai,j − (WH)i,j
)2(1)
Topic membership of each post in Di is obtained considering the maximum coef-
ficient value in H for a post. This associated topic is used to identify the virtual
terms for each post using W . The coefficients in W are sorted in decreasing
order. The coefficients that yield higher value than mean+standard deviation of
the distribution become the terms to represent a topic as in [139]. Each text post
of Di is expanded using the most probable terms as virtual terms that correspond
to its topic vector and form D′i.
Augmented Text Clustering: The data matrix D′i with augmented posts is
represented with a weighted term × post matrix to partition into k clusters. We
use the centroid-based clustering as it is reported to produce an accurate outcome
for the homogeneous data [43]. As the online forum data shows the homogeneous
nature, we partition the N posts into k clusters (obtained through the previous
step) using k-means. Initial k cluster centers are randomly chosen. Then each
post P ′a ∈ D′
i is compared with each k center to decide on the closest to be
assigned. This process updates the respective cluster center in each iteration.
Knowledge Synthesis for meaningful Communities: Within this step, we
generate the m concepts that are meaningful to the domain in k clusters after
doing further post-processing and consultation with domain experts. We analyze
terms in each cluster through visualization and the highly occurring common
words are removed. This is an iterative quality checking process that includes
manual intervention. This process results in the m (≤ k) meaningful concepts
154 3 Empirical Analysis
discussed in a forum.
3 Empirical Analysis
Datasets: The dataset is obtained from the online forum, Essential Supervisory
Practices (ESP), a 5-week training program for higher-degree research supervisors
at QUT between 2015 to 2017. The posts from all years have been combined on
a weekly basis, resulting in five datasets as in Table 1. We consider each post,
regardless of its type (i.e, original or reply), as a single document after applying
standard text pre-processing steps. After comparing the experimental results with
multiple weighting schemes, posts are organized in vector space model(VSM) with
the tf*idf weighting schema to derive the topics, while the augmented posts are
represented using tf for clustering.
Table 1: Summary of the datasets used in the experiments
Dataset Number ofPosts
Number ofuniqueterms
Average post length (in terms)Before After
augmentation augmentationW1 1664 7090 154 165W2 1495 7385 177 194W3 1416 7145 155 165W4 1402 7057 161 174W5 1568 6893 145 153
Benchmarks and Evaluation Measures: The proposed NMF based ap-
proach for document expansion using topics in ConMine is evaluated against
probabilistic LDA (pLDA) [3] and Latent Semantic Indexing (LSI) [3]. The state-
of-the-art clustering methods of DBSCAN [139], LDA [3], LSI [3] and NMF [3]
are used for benchmarking the concepts of clustering in ConMine. Accuracy of
topic vector formation and clustering process were evaluated with the intrinsic
measures topic coherence [134] and Silhouette score [134] respectively.
3 Empirical Analysis 155
Figure 2: Results of the Experiments
Accuracy: ConMine with NMF is found best in terms of topic coherence (Fig.
2(a)). LSI, which approximates factors with both positive and negative entries, is
not able to provide stronger topic distribution in VSM which is represented with
strictly positive entries. pLDA, which approximates topics using the probability
of terms considers only the term count and neglects the context of the words
and frequencies, has provided inferior results. We empirically learn the number
of topics as shown in Fig. 2(b) which produces highly cohesive topics. This
number is used in deriving topics for the post augmentation as well as it is set as
k in the clustering process. Fig. 2(c) compared clustering in ConMine with and
without post augmentation. Increased tightness of the clusters, indicated by a
higher silhouette score after augmentation in each method, confirms the benefit
of augmentation by handling the sparseness in high-dimensional text via added
terms. ConMine shows the highest increase in silhouette value compared to all
the baselines. In the homogeneous data, the density concept (DBSCAN) creates
contiguity-based clusters where very different data items may end up in the same
cluster giving the worst results. LDA which uses term counts-based probability
is unable to predict the correct cluster due to the negligence of context of the
terms. However, NMF as a clustering method performs similar to ConMine with a
marginal difference showing the importance of mapping higher to lower-dimension
156 4 Conclusion
space. The identified Concepts for each week are given in Table 2.
Table 2: Concepts identified for datasets
Dataset Identified Concepts by ConMineW1 Research skill, Milestones, Supervisors, Meetings, PublicationsW2 Experience in supervising, Relationship between student and
supervisorW3 Writing thesis, Writing literature review, Plagiarism and research
issuesW4 Emotional issues, Completion, Strategy for unsatisfactory progressW5 Examiner comments, Final submission and Seminar practice
4 Conclusion
We proposed and evaluated a concept mining method, ConMine, on a real-world
forum data for understanding the discussions that are held on online forums.
To handle the sparsity and high dimensionality in text, we use NMF (which
approximates topic vectors in a linear manner considering the context of terms) to
obtain virtual words for post-expansion. Leveraging the intrinsic measurements,
we learn the optimal number of k topics that are further used in centroid-based
clustering to obtain the clusters/concepts within the augmented text. Results
show that ConMine can deal with the sparse and homogeneous nature of online
forum data to obtain some useful concepts.
Chapter 4
Text Outlier Detection
This chapter introduces the second major contribution of the thesis, a set of novel
document outlier detection methods to identify the deviated documents from a
set of subgroups that cover common concepts in the document collection based
on effective text (dis)similarity calculation techniques. Outlier detection in text
data has not gained as much attention from the research community as cluster-
ing [1]. There exist a few research works that focus on text-domain [4, 96]. The
majority of outlier detection methods are able to only deal with few dimensions
[29, 68]. With the increasing number of dimensions, similar to clustering, outlier
detection methods face sparseness-related issues such as distance concentration
[126] or information loss [96] in dimensionality reduction methods in identifying
text similarity. There is some research that addresses the issues of high dimen-
sionality with angle differences [109], subspace analysis [108], and anti-hubs [58],
which identify the deviated (dissimilar) data points. However, the computational
complexity of these approaches is high. There is a lack of efficient approaches to
accurately identify the outlier documents in a document collection.
Fig. 4.1 outlines the high-level overview of the contributions discussed in this
158
Figure 4.1: Overview of the Chapter 4 contributions
chapter to effectively identify the text (dis)similarity for outlier detection. This
chapter explores the different ways to use ranking concepts for identifying out-
liers in document collections to avoid the issues with a sparse high-dimensional
text representation in identifying similarity. The primary hypothesis is to use
inverse document frequency that ranks the rare terms with higher importance to
determine outlier scores. This concept can filter outliers due to the higher aver-
age of rare terms in an outlier document. Besides that, this chapter introduces
the use of IR ranking concept in identifying relevant documents and relevancy
score through inverted indexed data structure for outlier detection. The proposed
methods in this thesis inversely use this information to identify the outliers. In
addition, neighborhood information identified through the IR system is proposed
to build a mutual neighbor graph efficiently. A method based on density estima-
tion and hub identification on the graph is proposed to filter the outliers for short
159
text [79], while directly obtaining excluded documents from the graph as outliers
for other cases.
This chapter is comprised of two papers relating to these contributions.
• Paper 5. Wathsala Anupama Mohotti and Richi Nayak.: Efficient Out-
lier Detection in Text Corpus Using Rare Frequency and Ranking. ACM
Transactions on Knowledge Discovery from Data (TKDD) (Accepted with
Major Revision).
• Paper 6. Wathsala Anupama Mohotti and Richi Nayak.: Text Out-
lier Detection using a Ranking-based Mutual Graph. Journal of Data &
Knowledge Engineering (DKE) (Under Review).
Paper 5 aims to identify a higher number of true outliers reducing false identifica-
tion using ranking concepts. It proposes ranking documents using rare document
frequency and IR ranking-based neighborhoods to identify dissimilar documents
to inliers as outliers with three main algorithms namely, Outlier detection based
on Inverse Document Frequency (OIDF), Outlier detection based on Ranking
Function Score (ORFS) and Outlier detection based on Ranked Neighborhood k-
occurrences Count (ORNC). OIDF proposes that outlier documents should have
a higher average if present with inverse document frequency of their terms as
they usually consist of a higher number of rare terms. ORFS identifies the out-
lier scores by using the inverse of the relevancy scores obtained for the top-10
relevant documents through the IR system instead of using the average of rare
term weights. In addition, ORNC identifies the anti-hubs in the collection with
relevant documents given by an IR system for the documents in the collection.
The lesser k-occurrences within ranking responses show the anti-hubs, which are
potential outliers.
160
This paper explores sequential and independent ensemble strategies using OIDF
with ORNC and ORFS to obtain higher accuracy in outlier prediction with less
false prediction. In addition, two new evaluation measures are introduced in
this paper to reveal false predictions of inliers and outliers in a meaningful man-
ner. Experiments are done on five datasets, Wikipedia, NewsGroup data and
Social Event Detection datasets covering all the size of text vectors. Experimen-
tal results show that the proposed strategies are accurate and efficient in all the
datasets compared to baselines. In the Wikipedia dataset that consists of large
text vectors, all the baselines fail due to their time/memory complexity while
OIDF that is based on the simple concept of using rare terms gives the best
performance among the proposed ones consuming less time. Ensemble methods
based on ORFS reduce the false detection and perform better than others among
proposed algorithms for NewsGroup data due to its efficient outlier score calcu-
lation with ranking scores. However, ensemble ORNC performs better for Social
Event Detection data due to its anti-hub-based filtering that deals with extreme
sparseness in short text.
Paper 6 identifies the dissimilarities in a document collection using the graph-
based approach, Outliers by Ranking-based Density Graphs (ORDG). It follows
an incremental approach, which starts with rare term frequency-based outlier
detection as in OIDF. The sparseness in text representation that challenges iden-
tifying the text-similarity, is addressed using a mutual neighbor graph constructed
with IR ranking results. The larger and medium text vectors, which show suffi-
cient word co-occurrences compared to the short documents, allow the connected
graph to include inliers by leaving outliers. The extreme sparseness of short text
only allows inclusion of a few inliers in the graph. Thus, ORDG estimates uni-
formly dense regions on the graph and thereby identifies the attached hubs with
these inlier regions. ORDG identifies the other inlier documents that are not
included in the graph by leaving documents dissimilar to hubs as outliers. This
161
hub similarity calculation is done with ranking scores given by the IR system.
Final outliers are determined by combining outlier candidates generated by the
rare frequency with this mutual graph-based approach. Experiments are done
with the same dataset as in Paper 5. Experimental results show that ORDG is
accurate and scalable compared to baseline methods. It shows much performance
improvement for datasets with short text vectors.
OIDF is the most efficient method among all the proposed outlier detection meth-
ods due to the use of simple rare document weighting concept. ORDG which is
proposed in paper 6 that uses hub-based inlier filtering outperforms ORNC in
Paper 5 which is also proposed as a method accurate for short text. However, the
sequential ORNC method is efficient compared to ORDG for short text as ORRG
uses multiple steps in identifying outliers. As per the experimental results of four
outlier detection methods, OIDF, ORFS, ORNC and ORDG, the suitability of
each method with respect to the data type can be summarised as in Table 4.1.
Table 4.1: Proposed outlier detection methods
Method Concept Suitabile data type SignificantCharacteristics
OIDF Rare Frequency Collections with large Accuracy,text vectors Efficiency
ORFS IR Ranking Score Collections with medium Accuracy,text vectors Efficiency
ORNC IR Raking Results Collections with short Accuracytext vectorsCollections with medium Accuracytext vectors that haveoverlapping terms
ORDG IR Raking-based Collections with short AccuracyGraph text vectors
Paper 5 proposes two new measurements to identify the false inliers and false out-
lier through Inlier Prediction Error (IPE) and Outlier Prediction Error (OPE).
They are superior to FPR and FNR that provide relative values and are capable
162
of differentiating methods based on their ability to deal with false alarms.
Next, the chapter will present these two papers. Since this is a thesis by publica-
tion, each original paper is presented aligning with the thesis format. Due to the
papers’ different formats, there may be some minor format differences. However,
these do not alter the content of the original papers.
Paper 5 163
Paper 5: Efficient Outlier Detection in Text Cor-
pus Using Rare Frequency and Ranking
Wathsala Anupama Mohotti* and Richi Nayak*
*School of Electrical Engineering and Computer Science, Queensland University
of Technology, GPO BOX 2434, Brisbane, Australia
Accepted with Major Revision In: ACM Transactions on Knowledge Dis-
covery from Data (TKDD Journal)
Statement of Contribution of Co-Authors
The authors of the papers have certified that:
1. They meet the criteria for authorship in that they have participated in
the conception, execution, or interpretation, of at least that part of the
publication in their field of expertise;
2. They take public responsibility for their part of the publication, except for
the responsible author who accepts overall responsibility for the publication;
3. There are no other authors of the publication according to these criteria;
4. Potential conflicts of interest have been disclosed to (a) granting bodies, (b)
the editor or publisher of journals or other publications, and (c) the head
of the responsible academic unit, and
5. They agree to the use of the publication in the student’s thesis and its
publication on the QUT ePrints database consistent with any limitations
set by publisher requirements.
164 Paper 5
Contributor Statement of contribution*
Wathsala Anupama Mohotti Conceived the idea,designed and conducted experiments,analyzed data, wrote the paper and
Signature: addressed the supervisor and reviewers’comments to improve the quality of paper
Date:
A/Prof Richi Nayak Provided critical commentsin a supervisory capacity
Signature: on the design and formulationof the concepts, method and experiments,
Date: edited and reviewed the paper
Nayak
26/03/2020
Mohotti
27/03/2020
QUT Verified Signature
QUT Verified Signature
1 Introduction 165
ABSTRACT: Outlier detection in text data collections has become significant
due to the need of finding anomalies in the myriad of text data sources. High
feature dimensionality, together with the larger size of these document collec-
tions, presents a growing need for developing accurate outlier detection methods
with high efficiency. Traditional outlier detection methods face several challenges
including data sparseness, distance concentration and the presence of a larger
number of sub-groups when dealing with text data. In this paper, we propose to
address these issues by developing novel concepts such as presenting documents
using rare document frequency, ranking-based neighborhood for similarity com-
putation and identifying sub-dense local neighborhoods in high dimensions. We
present a set of novel ensemble approaches using the ranking concept to reduce
the false identifications while identifying the higher number of true outliers, in or-
der to improve the proposed primary method based on rare document frequency.
Extensive empirical analysis shows that the proposed method and its variations
are scalable compared to relevant benchmarking methods, as well as improving
the quality of outlier detection in document repositories.
KEYWORDS: Outlier detection; high dimensional data; k-occurrences; ranking
function; term-weighting
1 Introduction
With the advances in data processing technology, digital data have witnessed ex-
ponential growth [86]. Outlier detection plays a vital role in identifying anoma-
lies in massive data, and some examples are credit card fraud detection, criminal
activity detection in e-commerce and abnormal weather prediction [126]. The
general idea of outlier detection is to identify patterns that do not conform to
general behavior, referred to as anomalies, deviants or abnormalities [1]. There
166 1 Introduction
exist different supervised and unsupervised machine learning methods that have
been used to identify exceptional points from different types of data, such as
numerical, spatial and categorical.
Outlier detection in text data is gaining attention due to the generation of a
vast amount of text through big data systems. Reports suggest that 95% of the
unstructured digital data appears in text form [86]. An outlier text document has
content that is different from the rest of the documents in the corpus that share
a few similarities amongst them [4]. Text outlier detection is frequently used in
stream data for event detection and first story detection to tracks the evolution of
an event [1]. However, detecting anomalies in static text data is also beneficial in
many application domains for decision-making such as web, blog and news article
management [96]. An unusual web page on a website or a web content deviating
from the theme in a blog, if discovered, will draw useful insight for administrative
purposes. Similarly, detecting an unusual news article from a collection of news
documents may help to flag it as exceptional or fake news. The unusual events
detection from the (short-length) social media data can indicate early warnings
[45].
These applications of identifying text outliers face several challenges; (1) Unavail-
ability or less availability of labeled data is the primary challenge for real-world
outlier detection methods and it creates the requirement for unsupervised meth-
ods. (2) Text data show fewer co-occurrences of terms among documents and form
sparse representation that challenges document similarity calculation to identify
the deviations [96]. (3) There are a special category of text data such as social
media text, the number of discriminative terms and common terms shared by
related text is small, being extremely sparse [139, 191]. The number of groups or
topics in social media are considerably high challenging text mining methods to
identify the outliers that are deviated from all these groups. Therefore, text out-
1 Introduction 167
lier detection methods face additional challenges in handling short-length social
media posts. (4) Moreover, larger sizes of text collections generated by big data
systems create a need to explore efficient outlier detection methods.
Studies on general outlier detection commonly use distribution, distance, and
density-based unsupervised proximity learning methods for real-world applica-
tions, where training data with class labels are not available. Most of these
methods suffer from efficiency problems due to the high volume in large datasets
[155]. The effectiveness of these outlier detection methods on high dimensional
data is also challenged by the well-known curse of high dimensionality [126].
Specifically, distance difference between near and far points becomes negligible
and unable to compute the similarity among documents with these proximity-
based methods [96]. This challenge is further amplified when the data collection
exhibits numerous distinct groups within the collection [173].
Researchers have developed subspace and angle-based methods to address these
high dimensional issues. These methods are computationally expensive due to
the larger numbers of comparisons required. Moreover, the subspace analysis
cannot guarantee that relevant subspaces are aligned with extreme values in full
dimensionality [108, 126]. Another set of solutions are proposed based on nearest-
neighbors and “anti-hub” concepts for high-dimensional data, such as graphs,
genes, etc [58, 68, 84, 154]. However, the nearest-neighbor (NN) calculation
is known to present scalability issues for larger datasets [164]. In this paper,
we conjecture that the relevant document set retrieved by a search engine, in
response to a document (posed as a query), is a promising alternative solution to
generate the neighborhood of the document. IR systems effectively calculate the
similarity between documents through the inverted index data structure avoiding
the issues with sparse data representation. We present a set of novel methods to
calculate outlier scores based on this ranking-based neighborhood concept using
168 1 Introduction
the scalable search engine technology.
There are only limited studies in outlier detection literature that specifically focus
on text-domain and deal with the sparseness of the document feature vector. Tra-
ditional outlier detection methods fail to capture the document similarity due to
ineffective distance/density computation [3] within the sparse vector space model.
Therefore, dimensionality reduction based methods have been used to text out-
lier detection [11, 16] and they have to deal with information loss in lower-order
mapping. Recently, the use of sum-of-square differences in matrix factorization
is proposed to determine outlier scores in text data [96]. In real-world scenarios
such as social media, the number of groups/topics inherent in the data is large
and creates the need for distinguishing the fine-grained sub-groups while identi-
fying outliers. However, the fine-grained problem dealing with a larger number
of inliers (i.e., normal data) and outlier classes presents issues for the aforemen-
tioned matrix decomposition as the iterative lower-rank matrix approximation
process increases the level of error (as evident in our experiments). The term
weighting scheme such as Inverse Document Frequency (IDF) is a common sta-
tistical measure in Information Retrieval (IR) that select intrinsic dimensionality
of a text document by representing how important a term is in a collection [37].
Inspired by this, in this paper, we propose a novel, rare-frequency-based method
for high-dimensional document outlier detection.
Many data mining problems such as classification and clustering improve the ro-
bustness of the primary solution using an ensemble mechanism [61, 187]. This
is a promising solution to reduce the false alarm rate of outlier detection meth-
ods. Ensemble approaches, broadly classified as sequential or independent, have
been successfully used in prior work to improve the quality of an outlier detec-
tion algorithm [1]. We explore and present an ensemble strategy by combining
rare document frequency with the ranking-based neighborhood to improve the
1 Introduction 169
accuracy of outlier detection by reducing false positives.
Overall, this paper presents novel outlier detection methods based on the concepts
of rare document frequency and ranking. Firstly, we propose that semantic term
clusters can effectively be used to detect deviations or anomalous documents
through meaningful term weighting. We then propose to exploit the ranking-
based retrieval techniques employed in search engines to provide similar docu-
ments in comparison to the conventional big data analytics methods that require
significant investments. Thereby, we propose to use the local sub-dense neigh-
borhood concept (Hubs), evident in high dimensional text data, through rank-
ing. We combine the neighborhood-based methods with the primary rare-term
weighting-based method to form ensemble approaches and reduce the potential
of false outlier detections. Unlike the state-of-the-art methods, we present these
methods as non-parametric and address the bottleneck of setting the user-defined
threshold to assess a document score. Lastly, the paper discusses the need for an
outlier-focused evaluation mechanism to report false positives (i.e., false outliers)
and false negatives (i.e., false inliers) in outlier detection.
In summary, this paper brings several novel contributions to the area of document
outlier detection, listed as:
• The use of rare frequency in document representation for outlier detection
to demarcate the border between common and rare documents. This novel
concept contributes to the primary OIDF algorithm.
• The concept of finding relevant neighbors using a scalable IR system that
consumes less computation cost. Two novel algorithms (ORFS and ORNC)
are developed to detect the level of deviations between documents.
• A set of ensemble approaches (ORFS (I), ORFS (S), and ORNC (S)) fo-
cusing on improving accuracy (i.e., reduced false outliers) with efficiency.
170 2 Related Work
Their approaches do not depend on a user-defined parameter as an outlier
threshold.
• Envisaging the requirement of meaningful evaluation measures, namely,
OPE and IPE, to highlight false detection.
The rest of this paper is organized as follows. Section 2 provides motivation
and related works related to outlier detection, term weighting, and IR ranking
concepts. The proposed approaches for text outlier detection based on rare term
weighting and ranking are detailed in Section 3. A comprehensive empirical
study with benchmarking on several datasets covering various length text data
is provided in Section 4, with a summary that provides useful insight on all
approaches. Finally, concluding remarks are summarized in Section 5.
2 Related Work
In the current era, most human interactions appear and are collected in the form
of free text such as emails, wikis, blogs and social media feeds. Outlier detection
is useful for finding interesting and suspicious text within the collection. Text
collection usually contains a high dimensional set of terms that result in a sparse
representation [3]. Different text representation models based on term frequency
have been used, with the Vector Space Model (VSM) being a primary model [77].
There are different term weighting schemes used in IR to give an importance
level to terms such as TF, IDF, TF∗IDF, and BM25 [160] in a data model. The
Inverse Document Frequency (IDF) scheme favors rare terms in the collection [37].
Several prior works in the field of outlier detection use Hawkins’s definition to
set an outlier, stating that “An outlier is an observation which deviated so much
from the other observations so as to arouse suspicions that it was generated by a
2 Related Work 171
different mechanism” [69]. These deviations can be identified by using rare terms
in the calculation. In this paper, we conjecture that the use of IDF, which values
the importance of rare words, in presenting a dataset will highlight the outlier
documents.
Outlier detection broadly follows two approaches. (1) Supervised learning when
training data with labels of normal and abnormal data is provided [105]; and (2)
Unsupervised learning when labeled data is not available, which is common in
real-world scenarios [68]. Neural network-based methods that used deep feature
extraction [38], and Generative Adversarial Network-based active learning meth-
ods in outlier detection [127] are recent supervised and semi-supervised methods
that predict the outliers based on training data directly or indirectly. Perfor-
mance of these methods are fully or partially affected by the supervision given
by label data. Unsupervised learning methods follow proximity approaches such
as distance-based, density-based, distribution-based and cluster-based [68]. The
majority of the outlier detection work deals with few-dimensional numerical data
where over fittings in terms of distance or density distribution clearly separate
outliers. Distribution-based methods use different statistical fundamentals to de-
termine the anomalies that occurred outside of the normal model [16, 89]. These
methods depend on the assumption about data representation and measures used
and can be affected by over fittings in normal data. Poor scalability of this ap-
proach for the high dimensional data further makes it less effective for text outlier
detection.
Distance and density-based approaches have been extensively used in outlier de-
tection due to their simplicity in implementation [126]. Conventional distance-
based methods identify outliers that highly deviate from the remaining data in
the collection using distance differences [104]. Alternatively, neighborhood infor-
mation is used for outlier detection [157, 197]. Nearest Neighbors (NN) has been
172 2 Related Work
used as an effective method to measure the distance differences. The differences
between each point and k –NN are considered and, the top n farthest points are
labeled as outliers [157]. Text data presents challenges to this approach where dis-
tance differences become negligible due to sparseness in high dimensions (known
as distance concentration) [126]. Document collections are usually large in size
and contain multiple groups. This creates a scalability issue for nearest-neighbor,
calculation-based approaches [173].
As a remedy to the distance concentration problem, similarity calculation based
on the angle between vectors is proposed to determine the deviation [109]. This
approach can be adapted to the text domain as cosine similarity can be used
to measure angle differences in text feature vectors [186]. However, the number
of pairwise comparisons needed for larger datasets increases the computational
complexity and makes this approach infeasible to apply to large-scale data.
In contrast to the distance-based methods that identify far-away points glob-
ally, density-based methods identify less dense points locally as outliers. These
methods derive a density distribution of data and identify nearest neighbors by
handling varying density patches. The relative density of a point is compared to
neighbors and a Local Outlier Factor (LOF) is defined to determine the degree of
outlier [29]. A point gets a higher LOF value if the ratio between density around
k nearest-neighbors of that point and local neighborhood of that point is high. It
is then labeled as an outlier candidate [109]. Density-based methods are known
to face difficulties to deal with higher dimensions, inherent to text data due to
distance concentration.
Density-based clustering methods such as DBSCAN can naturally detect outliers
in the dataset by considering them as points in sparse regions [9, 57]. Several
clusters-based outlier detection methods have been proposed considering tight-
ness of the clusters [51, 71]. Although these methods are capable of detecting
2 Related Work 173
outlier clusters, they highly depend on threshold parameters [51]. Moreover,
these methods cannot be directly adapted to text data, as the text data exists in
patches and it becomes highly difficult to detect outliers.
In high-dimensional data identifying outliers that are deviated from the rest of
the collection is hard with distance and density methods due to less effectiveness
of similarity calculation between points with the curse of dimensionality [2]. Dif-
ferent multi-dimensionality scaling techniques have been used to deal with this
issue [27] and identify the outliers in reduced dimensions. However, loss of infor-
mation in the higher-to-lower order approximation is inevitable. Alternatively,
subspaces-based outlier detection methods play a vital role to manage high di-
mensionality. They combine local data pattern analysis with subspace analysis.
However, the problem of finding a subset of dimensions, with rarely existing
patterns, using brute-force searching mechanisms is that it incurs high computa-
tional cost [2]. Furthermore, these approaches use selected subspaces’ behavior
to identify outliers [108]. The deviations in this embedded subspace cannot be
guaranteed to determine outliers in full dimensional space globally.
Text data has been shown to experience the Hub phenomena that is evident
in high dimensions, i.e., “the number of times some points appear among k-
NN of other points is highly skewed” [154]. These local NNs, which form sub-
dense regions (i.e. Hubs), is used to address sparseness-related problems in high
dimensionality effectively [58, 140, 173]. This concept has been used inversely in
outlier detection. In a graph-based method where each data point represents a
graph node, connections are made considering reverse k-NN [68, 84] and a lower
in-degree number identifies potential outlier nodes. Similarly, researchers have
used the concept of “anti-hubs” as potential outlier candidates [155]. Although
these local NNs-based approaches successfully handle the higher dimensions, the
scalability of these methods for larger datasets remains questionable, due to the
174 2 Related Work
need for calculating a hub for each data point.
Table 1 summarizes the outlier detection methods that have been applied in nu-
merical and textual data sets. Limited studies are available focused specifically
on the text domain [16]. Given the fact that random projection approximately
preserves the distance between points in the lower dimensional space, the ran-
dom projection has been applied to text data to identify outliers [11]. Loss of
information is inevitable in projection and ultimately reduces accuracy. In order
to improve the accuracy of text outlier detection, the significance of terms should
be interpreted related to the structure of the document [1]. Non-negative Matrix
Factorization (NMF) has been used to decompose the text collection into a set of
semantic term clusters and document clusters considering document structures
[96]. The term clusters allowed the method to learn the outlier degree of each
document by ranking the sum-of-squares of differences with the original matrix.
However, the increased number of groups within the collections makes this learn-
ing process impaired. Outlier detection in fine-grained scenarios is not practical
with an NMF-based method in terms of both accuracy and scalability.
The IR systems have shown the capacity to manage the text data successfully
[3]. Search engines are well-known IR systems capable of efficiently finding rele-
vant documents from a document collection, whereby the document collection is
organized in the inverted indexed data structure [200]. They have been known to
deal with big data collections [200]. In this paper, we propose the novel ensemble
approaches based on the rare frequency and ranking concepts in IR, and iden-
tify the NNs as well as local sub-dense neighborhoods in text data to determine
the deviations. To the best of our knowledge, this is the first extensive outlier
detection work in text mining, using the concepts of rare frequency, ranking and
ensemble approach.
3 Outlier Detection in Text data : Proposed Methods 175
Table 1: Summary of the major outlier detection methods used in high dimen-sional data
Category Methods Applied DomainRanking-based Neighborhood-based Numerical data
outlier detection [126]k-occurrence-based Anti-hub based Numerical data
outlier detection in [155]Hubness aware Numerical data
outlier detection in [58]Graph-based k-NN graph based Numerical data
outlier detection in [68]Natural-neighbor graph-based Numerical data
method in [84]Subspace-based Evolutionary algorithms in [2] Numerical data
Subspace outlier detection in [108] Numerical dataProjection-based Random projection Text data
based outlier detection [11]NMF based outlier detection [96] Text data
Angle-based Angle variance-based method in [109] Numerical data
3 Outlier Detection in Text data : Proposed
Methods
3.1 Preliminaries
In this paper, the objective of outlier detection is to identify data points distinct
from the rest of the collection as outliers, by separating them from the inliers
that are cohesive points forming sub-groups. Consider a document collection
D = {d1, d2, ..., dN} where di ∈ D is represented using a set of distinct terms
{t1, t2, ..., tv}. Let D contains a set of distinct terms {t1, t2, ..., tn}, n� v covering
all the terms in the collection. Let D be divided into a set of sub-groups C =
{c1, c2, ..., cl} where l � n and l < N . Each cg ∈ C contains a set of similar
documents that share related terms. We formally define document di ∈ D as
outlier or inlier as follows.
176 3 Outlier Detection in Text data : Proposed Methods
Definition 1 - Outlier: A document di ∈ D that shows high deviation, based
on term distribution, from all distinct sets of similar documents C is considered
an outlier.
Definition 2 - Inlier: A document di ∈ D that shows high similarity, based
on terms distributions, with any distinct set of similar documents, cg ∈ C is
considered an inlier.
Example: Fig. 2(a) shows a toy document collection. It contains two (sport)
groups of documents considered as inliers, and the document on weather infor-
mation, d11 is an outlier. The outlier document shows a set of different terms
that deviates from the common terms in the collection related to subcategories
of sports.
Term weighting: We use the VSM model [201] to represent the collection.
A document is represented as a point in multidimensional space by vector di =
{w1, w2, ..., wv}, where wj is the weight of a term tj in the document. We use
IDF [163] as the weighting scheme that statistically weights the rare terms higher.
We conjecture that IDF is more informative to differentiate a document from the
collection, instead of the term frequency (TF) that weights common terms with
higher weights. The weight of the term tj is given as:
wj = idfj = log
( |D|dfj
)(1)
where dfj is document frequency of term j, the number of documents that contain
the term. In this paper, we calculate IDF after applying standard pre-processing
steps to remove stop words and stemming.
3 Outlier Detection in Text data : Proposed Methods 177
Nearest neighbors: The cluster hypothesis proposed in IR [91] states, “the
linked sets of documents are likely to be relevant to the same request”. It led to
prove theoretically by applying the reversed cluster hypothesis [59] that “docu-
ments relevant to the same query should occur in the same cluster”. Prior research
then empirically showed that when a document di is posed as a query to an IR
system, the retrieved document set can be considered similar to di [140, 173]. In
this paper, we propose to use the top-m retrieved set as NNs of di.
Let the document collection D be organized in the form of an Inverted indexed
data structure stored in an IR system. Let di ∈ D be posed as a document query
q using a set of distinct terms {t1, t2, ..., ts} where s ≤ v to the IR system. Given
the query document q, a ranking function Rf employed in the IR system returns
the most relevant m document set, Dq as:
Rf : q → Dq = {(dp, rp)} : p = 1, 2, . . . ,m (2)
where the relevancy score rp of a document dp can be calculated as:
score (q, dp) = rp =∑t∈q
(√tft,dp × idf 2
t × norm (t, dp))
(3)
Definition 3 - Nearest Neighbors (NN): A set of top-ranked documents
retrieved by employing a ranking function Rf in an IR system can be considered
as NN of di.
In this paper, we use the Elasticsearch search engine as the IR system and obtain
top-m (m = 10) documents as k-NN (k = 10) for each document in the collection.
Prior research shows that precision at top-10 documents in the ranked list for a
query is high, due to tight coupling with the topic and these top documents
possess sufficient information richness [199]. When a document is posed as a
query to the IR system, it is represented with the top-s (s = 10) terms as in [173]
178 3 Outlier Detection in Text data : Proposed Methods
ranked in the order of IDF. There exist several ranking functions such as tf∗idf,BM25, BM25P and LM-JM [54, 173]. We propose to use the widely applied tf∗idfranking function to measure the relevance between a document and a query as
in Eq. 3.
Reverse neighbors: This is the count of how often a data object appears in
k-NNs of every other data object [155]. This is defined as retrievability of a
document in IR literature [15]. A document that rarely appears in any other
k-NNs will have a high chance of being an outlier. This can be considered as an
alternative way to determine hubs of documents in the collection.
Definition 4 - k-occurrences: The number of times di ∈ D occurs in the k-
NN set of other documents is defined as the number of k-occurrences of di. It is
denoted as Nk(di) and referred to as the reverse neighbor count of di.
Example: Consider the document collection in Fig. 2(a) that is indexed in an
IR system to obtain the set of all relevant results, as shown in Fig. 5 (a). The
k-occurrences of document d1 is Nk(d1) = 4 in this collection as it appears in
k-NN lists of documents d1, d2, d6 and d9. This is the reverse neighbor count of
d1.
Using these basic concepts and definitions, we propose three novel algorithms to
identify outliers in a document collection as detailed in Table 2 and map them to
the categories developed in the literature on (numerical) high-dimensional out-
lier detection. These methods capture outlier documents with rare terms with
different document ranking techniques. (1) OIDF uses the average of the IDF
weights of the terms to rank the documents and IDF weighting schema which
gives high importance to rare terms is able to deal with high dimensional text
3 Outlier Detection in Text data : Proposed Methods 179
Table 2: Summary of the proposed algorithms
Category Algorithm Concept
Rare Frequency based OIDF: Outlier Detectionbased on Inverse DocumentFrequency
Ranking based ORFS: Outlier Detectionbased on Ranking FunctionScore
k-occurrence based ORNC: Outlier Detectionbased on Ranked Neighbor-hood k-occurrences Count
representation to identify the outliers efficiently. (2) ORFS uses IR ranking func-
tion to retrieve ranking scores for nearest neighbor documents and reciprocal of
average similarity score is used to rank a document based on how dissimilar it
to nearest neighbors to identify the deviated documents. The use of scalable IR
systems deals with sparse larger text collections to identify a set of related docu-
ments as nearest neighbors in response to a given document following the cluster
hypothesis in ORFS [92]. (3) ORNC extends this concept and uses IR ranking
responses to capture the k-occurrences of documents among nearest neighbors in
the entire collection and thereby identify the property of hubs or the local sub
dense regions in high dimensional data. The ORNC rank the documents based
on the inverse of k-occurrence count to identify the outliers that are anti-hubs
with less k-occurrence count.
3.2 Outlier Detection based on Inverse Document Fre-
quency:OIDF
Generally, IDF measures how much information a term carries in the collection
and is able to differentiate the term as distinct in the collection. According to
180 3 Outlier Detection in Text data : Proposed Methods
Figure 1: Algorithm 1 - OIDF
Eq. 1, the IDF value of a rare term should be high. We conjecture that an outlier
document will contain terms that deviate from the majority in the collection. We
propose to use the average IDF weight of a document combining all the terms as
a measure to detect outliers. An outlier score OSidf is assigned to every document
di = {w1, w2, ..., wv} that represents with weights of the included v terms based
on the average IDF weight as follows.
OSdiidf =
1
|v|v∑
j=1
wj (4)
It is expected that the OSidf score, which captures rare term frequencies, is high
in outlier documents compared to inliers. This is explained in more detail in
Appendix A. A document is defined as an outlier if the outlier score OSidf is
greater than a control parameter T1. We will discuss later in the experiment
section, the setting of the control parameter systematically and automatically.
Algorithm 1 presents the all the steps of the OIDF method.
Example: Consider the same example in Fig. 2(a) that consists of eleven doc-
uments related to two sports: Cricket and Rugby, and an outlier document. Fig.
2(b) shows the IDF vector coefficients for each document with the average IDF
value of each vector. It reveals that the average IDF value of the outlier - d11 is
much higher than the rest of the collection.
3 Outlier Detection in Text data : Proposed Methods 181
(a) (b)
Figure 2: Example document collection with IDF VSM
OIDF is a simple algorithm that identifies potential outlier candidates; however,
it also generates a large number of false positives. We present two ranking-based
algorithms that we propose to combine with OIDF to form ensemble approaches.
These ensemble approaches are able to drastically reduce the search space for
outliers and reduce false outliers (as evident in experiments).
3.3 Outlier Detection Based On Ranking Function Score:
ORFS
The ranking concept can be used in outlier detection to assign an outlier score to
observations based on the ranking list assuming the observations at the top will
get higher outlier scores. In this paper, we propose to calculate a ranking score
to each document using the IR system and assign outlier scores. We assign an
outlier score to each document using the relevancy scores generated through the
IR system. As per Definition 3, an IR system uses a ranking function as shown
in Eq. 2 to determine the most relevant documents ranked by the relevancy score
as calculated in Eq. 3. The relevancy score represents the level of relevancy of
a retrieved document to the query document as compared to the whole collec-
182 3 Outlier Detection in Text data : Proposed Methods
tion. We utilize the relevancy score, rp, of top-m (m = 10) relevant documents
for a given document di to show how consistent the given document is within
the collection. The relevancy score of the document dp to a query document
(score(q, dp) or r(p)) represents how similar that document is to the query doc-
ument. In contrast, the reciprocal of the relevancy score determines how much
those two documents are dissimilar (i.e., deviated). We propose to calculate the
outlier score OSr for a document di ∈ D as reciprocal of the average relevancy
scores given by the search engine for top-10 relevant documents. It presents the
degree of deviation as follows.
OSdir =
m
|∑mp=1 rp|
di �= dp, where rp ≥ 0 (5)
A document is defined as an outlier if the outlier score OSr is greater than the
control parameter T2.
Ensembles ORFS(I) and ORFS(S): Combining OIDF and ORFS.
We propose to combine ORFS with OIDF to create an ensemble method to
achieve robust outlier detection as in Fig. 4. Previous researchers have built
the ensemble models in two ways: independent and sequential ensembles [1]. We
explored both approaches to reduce false positives.
Following the independent ensemble approach, both ORFS and OIDF algorithms
generate the outlier candidates (i.e., DI = D) and the common candidates in
both sets have been identified as final outliers, as in Eq. 6. This reduces the
number of false positives.
Dof = Do
idf ∩Dor (6)
Following the sequential ensemble approach, OIDF is first used to generate outlier
candidates, then those candidates are tested through ORFS to calculate the out-
lier score OSr. A final set of outliers (Dof = Do
r) is obtained using threshold T2.
3 Outlier Detection in Text data : Proposed Methods 183
Figure 3: Algorithm 2 - ORFS
Figure 4: Ensemble approaches of ORFS and OIDF for outlier detection
This approach allows ORFS to search in a much smaller search space for outliers
(i.e., DI ⊂ D). Algorithm 2 explains the all the steps in this approach of rele-
vancy score-based outlier detection. Combining outliers generated by OIDF that
directly uses IDF weights with outliers generated by ORFS that considers IDF
with reciprocal retrieval score reduces false detection as a detailed experimental
analysis section.
184 3 Outlier Detection in Text data : Proposed Methods
Figure 5: Outlier scores based on ranking scores and k-occurrences
Example: Fig. 5 (a) shows the outlier scores calculated using ranking scores
given by the Elasticsearch search engine for the same toy example. The outlier
document d11 held the outlier score� 1 which highlighted it as the most possible
outlier. As shown in Fig. 2 (b), OIDF also assigned the highest outlier score to
d11 and made it the most suitable outlier candidate using both the independent
and sequential ensemble approaches, after combining with ORFS.
3.4 Outlier Detection Based On Ranked Neighborhood
k-Occurrences Count: ORNC
In high dimensional data, hubs have been known to form local sub-dense neigh-
borhoods instead of uniform distributions in a cluster [154]. We conjecture that
outlier points would have less possibility to include in these hub regions and
should have fewer k-occurrences in the nearest neighbor lists. In an indexed doc-
ument collection, we obtain all sets of relevant documents using document queries
to form initial search space. We use neighborhood documents calculated using
Eq. 2 with tf∗idf function in Eq. 3 for each document to obtain the lists of nearest
3 Outlier Detection in Text data : Proposed Methods 185
Figure 6: Algorithm 3 - ORNC
neighbors. The k-occurrences count is measured within all the retrieved relevant
documents (i.e., nearest neighbor sets) and used to define outlier scores based on
the inverse of the count.
Let the documents retrieved in response to document query di on D be Ddi where
Ddi is obtained using Eq. 2. The outlier score OSc for do ∈ D is calculated as:
OSdoc = 1/
⎛⎝ |D|∑
i=1
| [do ∈ Ddi ] |⎞⎠ (7)
Algorithm 3 describes the overall process of ORNC where each document is as-
signed with an outlier score OSc. If the score is greater than the control parameter
T3, the document is classified as an outlier.
Ensemble ORNC(S): Combining OIDF and ORNC.
Identifying a neighborhood is an expensive operation, due to the need for pair-
wise comparisons [204]. Even in a smaller dataset, measuring k-occurrences by
analyzing the nearest-neighbor list is highly expensive. Therefore, we propose to
use only a selective set of outlier candidates to achieve effectiveness through the
186 3 Outlier Detection in Text data : Proposed Methods
Figure 7: The ensemble approach of OIDF and ORNC for outlier detection
sequential ensemble approach. The initial set of outlier candidates is obtained
using OIDF and the number of k-occurrences are measured, within all the re-
trieved relevant documents, for those candidate documents only. This sequential
ensemble approach of OIDF with ORNC is able to identify the final outlier doc-
uments with reduced time and higher accuracy by reducing the search space, as
in Fig. 7.
Example: Fig. 5(b) shows the outlier scores calculated using reverse neighbor
count (k-occurrences) for all documents in the example document collection. The
highest outlier score of 1 is given to the outlier candidate d11 proposed by OIDF.
It shows that the proposed method can identify the actual outlier.
In summary, the core concept used in three ensemble methods is the rare
frequency-based outlier detection (OIDF). By using various IR ranking concepts,
the quality of outlier detection of OIDF is improved by reducing false outliers.
The ranking-based neighborhoods were used to provide outlier scores using pre-
calculated relevancy scores in ORFS and k-occurrences in the response sets in
ORNC.
4 Empirical Analysis 187
Table 3: Summary of the datasets used in the experiments
Datasets # of # of # of # of # ofDocs Unique Total Avg. Terms Outliers
Terms Terms per doc
Wikipedia (DS1) 11521 305827 9206250 799 10020News groups (DS2) 4909 27882 374642 76 50Reuters (DS3) 5050 13438 200482 40 50SED2013 (DS4) 81228 46548 1583073 19 840SED2014 (DS5) 91670 46031 1816840 20 976
4 Empirical Analysis
In this section, we present the experimental evaluation of the proposed primary
method OIDF and its ensemble approaches ORFS(I), ORFS(S) and ORNC(S)
for accuracy, efficiency, and scalability. We performed the experiments on a single
processor of 1.2 GHz Intel(R) Xeon(R) with a 264 GB shared memory. Algorithms
were implemented in Python 3.5. Elasticsearch 2.4 was used as a search engine
to provide relevant documents. First, we present the description of the real-
world datasets used and the standard evaluation measures used to determine the
accuracy of outlier detection. We show that the commonly used measures do not
evaluate the outliers effectively; hence, we present a new evaluation criterion to
report false predictions. The next few sections present the empirical analyses.
4.1 Datasets
Three categories of collections having documents of short, medium and large
length are used in experiments as shown by the column of average terms per a
document in Table 3. Wikipedia data, which has about an average of 800 terms
in a document, is used to validate the outlier detection behavior on a collection
with large documents. The well-known 20News group data and Reuters data,
188 4 Empirical Analysis
which have about 80 and 40 terms on average respectively, are used to validate
the outlier detection behavior on collections with medium documents. Whereas,
the MediaEval Social Event Detection 2013 and 2014 datasets with about 20
terms on average are used to analyze outlier detection on collections with short
documents. These are the average number of terms per document, and docu-
ment collections are having more larger as well as smaller documents within the
collections. These datasets have ground-truth values that were used to measure
the methods’ effectiveness extrinsically. These datasets are designed/selected to
evaluate the performance of proposed methods against various challenges that
exist for text outlier detection: (1) different text vector sizes, (2) different collec-
tion sizes, (3) the different number of classes and (4) high vocabulary overlapping
within inlier and outlier classes.
Our approach is distinct and more complex in comparison to existing methods
as the document sets contain multiple classes of documents – both inliers and
outliers. Existing methods [96] do not include the diverse set of classes in their
datasets that make the outlier detection process simpler and unnatural. They
usually have one class of documents and outliers that do not belong to this class.
We select a set of inlier and outlier classes, and attempt to identify outliers that
are different from all these inlier classes which show less term overlapping with
them. DS1 contains inliers from multiple Wikipedia subcategories under ‘War’
and outliers from 10 other categories not included inside ‘War’. DS2 contains
inliers from five classes related to ‘Computers’ and outliers from five other classes
in 20News groups. DS3 contains inliers from two classes and outliers from 25
other classes in the Reuters dataset. This dataset is having classes with over-
lapping vocabulary showing a more complex scenario of outlier detection that
has a considerably high number of overlapping terms between inlier and outlier
classes. Inliers in short datasets (DS4 and DS5) are collected from classes that
have at least 100 documents while two outliers per each class are collected from
4 Empirical Analysis 189
all other classes. These short datasets consist of more than 800 inlier and 400
outlier classes to explore the fine-grained scenarios as well as large in collection
size. Table 3 shows a summary of these datasets.
4.2 Evaluation Measures
Accuracy is a well-known measure to define the effectiveness of outlier detec-
tion. Accuracy analyses the percentage of correctness in predictions [84]. Let
TP ,TN ,FP ,FN denote the correct outliers, correct inliers, incorrect outliers,
and incorrect inliers respectively where P , N denote the total number of outliers
and inliers. The metric accuracy (ACC) is calculated as:
ACC =TP + TN
TP + FP + FN + TN=
total correct predictions
total observations(8)
However, the ACC measure disregards the false outliers and false inliers. An
explanation is provided in Appendix B.
Alternatively, the effectiveness of outlier detection is measured using the area
under the Receiver Operating Characteristics (ROC) curve (AUC) [1, 126, 155].
The ROC curve shows the TP rate (TPR) against FP rate (FPR). Let TPR
and FPR be:
TPR =TP
TP + FN=
TP
P(9)
FPR =FP
FP + TN=
FP
N(10)
When T denotes a threshold to control outliers, AUC of ROC curve can be
defined as:
AUC =
∫ 1
0
ROC(T ) dT (11)
However, AUC also focuses only on correct predictions, which leads to a mis-
leading picture of outlier detection without considering incorrect predictions. An
190 4 Empirical Analysis
explanation is provided in Appendix C. Consequently, there is a need for ana-
lyzing false positives and false negative predictions. The inverse of ACC, which
represents the total number of incorrect predictions against total observations,
is not a clear measure of the false outlier and false inlier predictions. Some re-
searchers have used FPR or FNR to report these. FPR considers false outliers
(FP ) against inliers in the dataset and shows linear variation within the range of
0 to 1 for the gradual increase of FP . FNR considers false inliers (FN) against
outliers in the dataset and shows linear variation within the range of 0 to 1 for
the gradual increase of FN . However, FPR and FNR provide relative values
and are not able to differentiate the capability of a method to deal with false
alarms. We propose two measures Outlier Prediction Error (OPE) and Inlier
Prediction Error (IPE) to emphasize false predictions with respect to true pre-
dictions. OPE reports false outliers (FP ) against true inliers (TN) and IPE
reports false inliers (FN) against true outlier (TP ).
OPE is defined as :
OPE =FP
TN + ε: if TN = 0 then ε = 1 else ε = 0 (12)
IPE is defined as :
IPE =FN
TP + ε: if TP = 0 then ε = 1 else ε = 0 (13)
The OPE (or IPE) measure varies in the range of 0 to N (or P ) and can be
divided into two ranges: 0 to 1; and 1 to N (or P ). The value of OPE is within
0 to 1 if a method detects more true inliers than false outliers. However, a value
greater than 1 indicates that an outlier detection method is producing higher
false outliers than true inliers. Similarly, a higher value of IPE than one shows
a less effective method.
4 Empirical Analysis 191
4.3 Baseline Algorithms
As primary baselines, we have chosen unsupervised outlier detection methods
from the major categories in the existing literature to compare against the pro-
posed unsupervised outlier detection methods. The benchmarking algorithms
listed below were used as unsupervised baselines.
• Outlier detection using k-nearest neighbors (KNNO) [157]: This is a
distance-based method where distance is calculated between each object
and their k-NNs. Objects are then ranked based on the distance to k-NNs
where top n objects are declared as outliers with user-defined n. In this
baseline method, we assign the number of outlier documents within each
collection as 1% and 10 as the number of k.
• Outlier detection using local density estimation (LOFO) [29]: This is a
density-based method where a degree known as local outlier factor (LOF)
is assigned to each object considering how isolated an object is with respect
to k-NNs (k is set as 10). LOF is defined as the ratio between the average
densities of neighbors to the density of the object. Objects with high-rank
LOF are defined as outliers. The threshold that governs the boundary
between inliers and outliers is set as 1, in line with past research [8].
• Outlier detection using Non-Negative Matrix Factorization (NMFO) [96]:
This is a recently developed matrix factorization-based approach specifically
designed for text outlier detection. The l2 norm assigned in the learning
process of document-term matrix factorization is used as the outlier score
for each document. The documents that get high-rank outlier scores are
defined as outliers. This method depends on several control parameters
such as k, α, β. They are tuned to the best possible values after several
parameter-tuning attempts. Best parameter values k, α, β in DS2 and DS3
192 4 Empirical Analysis
were set to (20, 179, 0) and (5, 23, 0) respectively while DS1, DS4 and DS5
were set to (20, 11, 0) following the description in [96] and yielding best
results in multiple experimental settings.
• Mutual nearest-neighbor graph-based clustering method for detecting out-
liers (MNCO) [55]: This method is designed to cluster high dimensional
sparse data by creating a considerably dense mutual neighbor graph. The
points that do not belong to a cluster in the graph are considered noise or
outliers. The two control parameters to define core dense regions in this
baseline are set to 3 as they satisfy the minimum requirement, to be dense
[140].
In addition, we compare term frequency-based IR ranking approach used for
document similarity identification in proposed methods against the semantic
embedding-based document similarity identification using doc2vec representation
[113]. The set of similar documents and similarity scores for a document given
by doc2vec are used in ORFS and ORNC algorithms to compare with IR-based
ORFS and ORNC. Recently, Neural network-based approaches are popular in
text mining with fully or weak supervision [107, 127, 135]. Although our pro-
posed methods are fully unsupervised we have done experiments with Convo-
lutional neural network for text classification [107] and Generative adversarial
active learning for outlier detection [127]. We follow the standard practice of
using dense word representation with reduce dimensionality that obtained using
Global Vectors for Word Representation (GloVe) [148] as the input to the neural
network in the experiments [107]. These methods based on training a neural net-
work are not effective and extremely time consuming to use with full dimensional
space.
4 Empirical Analysis 193
4.4 Accuracy Comparison
Accuracy of the proposed methods for large, medium and short text document
collections is analyzed with the standard measures of ACC, ROC curve, and
AUC as well as the proposed measures of OPE and IPE.
Accuracy
In general, the proposed methods show improvement over the majority of base-
lines, especially when the dimensionality of the vector is high and the dataset
is large. As detailed in Table 4, it is evident from the high ACC values of
OIDF and its ensemble approaches that they outperformed all baselines except
KNNO. However, it can be noted that KNNO is not scalable to high dimensional
Wikipedia document collection (DS1). Moreover, KNNO requires the number of
outlier documents as a control parameter, which is a major limitation and makes
it dependent on the parameter to achieve an improved outcome.
Table 4: Accuracy measure for different methods against datasets
DatasetOur Methods Baseline Methods
OIDF ORFS ORFS ORNC KNNO LOFO NMFO MNCO
(I) (S) (S)
DS1 0.85 0.92 0.93 0.93 ∗ ∗ ∗ -DS2 0.87 0.93 0.94 0.94 0.99 0.01 0.98 0.06DS3 0.82 0.93 0.9 0.91 0.98 0.02 0.69 0.17DS4 0.82 0.9 0.9 0.93 0.99 ∗ 0.01 -DS5 0.82 0.9 0.91 0.93 0.99 ∗ 0.01 -Avg. 0.84 0.92 0.92 0.93 0.99 0.02 0.42 0.12Note : (S), (I), ”∗” and ”-” denote the sequential ensemble approach,Independent ensemble approach, aborted operations, and memory or runtimeerror respectively
Within the proposed approaches, the basic OIDF algorithm yields the least ac-
curacy as compared to ensemble methods, which are able to reduce the false pos-
194 4 Empirical Analysis
itives generated by OIDF. Ensembles methods, based on the IR ranking score,
ORFS(I) and ORFS(S) perform similarly. As per ACC, ORNC(S) is the best
approach among the proposed methods. ORNC(S), based on the Hub concept, is
able to achieve a higher level of performance even in extremely sparse short text
data such as social media text (DS4, DS5). In the Reuters dataset (DS3), where
classes in the collection are highly overlapping, it is hard to separate outliers
considering terms in the VSM representation due to overlapping class behavior.
This database yields relatively lower accuracy in most of the methods.
ROC and AUC
ACC considers the total correctness of predictions and does not provide a de-
tailed analysis of true outliers and true inliers individually as compared to AUC
(see Appendix B for more detail). Hence, we have explored the ROC curves con-
sidering each fixed control parameter we proposed in our algorithms (sensitivity
analysis provides more details on these parameters) and the optimum threshold is
used for each baseline accordingly. As shown by graphs in Fig. 8, Fig. 9 and Fig.
10, OIDF provides the highest AUC values, except in short text datasets (DS4,
DS5), due to its capacity to distinguish documents according to rare frequen-
cies. This capacity helps OIDF to yield a higher TP rate and results in OIDF
achieving higher AUC due to its separate analysis of TP (true outliers) and
TN (true inliers) in contrast to ACC, which represents total correct predictions.
Both ranking score based ensemble methods ORFS(I/S) perform similarly. The
k-occurrences-based ensemble approach ORNC(S) outperformed all the others
for short text data due to its hub-based concept applicable to higher dimensions.
More specifically, Fig. 8 shows the ROC curves of the Wikipedia document
collection (DS1) derived by the OIDF and its ensemble approaches. No baseline
method could be executed on this dataset due to the large text size. This confirms
4 Empirical Analysis 195
the scalable nature of OIDF and its improved variations. Moreover, as seen by
the results, the basic rare term-weight-based method OIDF has outperformed the
ensemble methods. It states the power of a simple term weighting model in large
documents to differentiate terms meaningfully where the occurrence of terms is
considerably high within a document as well as in the respective collection. IR
ranking concept diluted the effectiveness of this simple method, as depicted by
Fig. 8, by reducing the true outliers (TP ) when the TP and TN (true inliers)
are separately analyzed with AUC.
On the medium-sized collection (20News Group dataset (DS2) and Reuters
dataset (DS3)), the proposed methods give higher AUC compared to baselines
as shown in Fig. 9. Similar to large text size collections, basic OIDF that simply
considers average rare frequency values of terms yields the highest AUC due to
identifying the higher number of TP s. The KNNO method, which requires the
number of outliers as a control parameter, is the best amongst other baselines for
DS2 while NMFO performs as a random assignment. The LOFO which measures
the density around point respect to its’ neighbors density shows the lowest per-
formance in text data which is naturally sparse due to fewer term co-occurrences
among documents. It could not differentiate the density around points in this
sparse setting. The LOFO identifies the majority of the inliers as outlier and
results in lower ACC. The MNCO, which uses a mutual nearest neighbor graph,
outperforms other baselines in DS3 that contain overlapping class labels. In over-
lapping datasets, the ranking-based ORFS (I/S) methods and ORNC(S) yield a
reduced performance in comparison to normal medium-sized datasets.
The ROC curves for document collections with short-term vectors are presented
in Fig. 10. These documents share very few discriminative terms among similar
documents compared to the other two dataset categories. Consequently, in this
extremely sparse dataset, the ranking score-based ORNC(S) ensemble approach
196 4 Empirical Analysis
outperforms the basic OIDF method, due to the inclusion of a local sub-dense
neighborhood (Hub) concept, which is known to work for higher dimensions.
Similarly, the distance-based KNNO, which uses pairwise distance difference com-
parison, does not perform well on this dataset, as on other datasets due to the
distance concentration problem.
Furthermore, we compare AUC results of ensemble ORFS(I/S) and ORNC(S)
that obtained using IR ranking-based text similarity with term occurrences and
their respective frequencies against semantic embedding-based text similarity.
Distributed Representations of Documents (doc2vec) [113] is an unsupervised
learning algorithm for obtaining dense vector representations for documents con-
sidering syntactic and semantic word relationships within a corpus. The doc2vec
is used to obtain the set of similar documents in the corpus with similarity scores
for a document, similar to IR ranking function, and we modified baselines ORFS
and ORNC to identify outliers as in Table 5. The results in Table 5 show that
semantic embedding-based ORFS performs same as IR ranking score on average.
However, for document collections with short term vectors, semantic embedding is
able to provide more accurate results. In short text, where vectors are extremely
sparse, semantic embedding can identify the text similarity more effectively com-
pared to an IR ranking function. However, the effectiveness of doc2vec-based
ORNC in identifying hub points or the local sub dense points in high dimen-
sional text data is inferior to IR ranking function-based ORNC(S). Theoretically,
cluster hypothesis and its’ reverese [59, 91] also proved that IR function is able
to give a set of similar documents in response to a query document that resides
in the same cluster. They show that IR ranking can be used to identify the doc-
uments in same cluster and we used it in outlier detection to identify the hubs in
clusters within ORNC(S).
The results in Table 6 shows the performance of Neural network-based methods
4 Empirical Analysis 197
Figure 8: ROC curve and AUC for document collection with larger size termvectors (”∗” denotes aborted operations, memory or runtime error)
on the datasets. Convolutional neural network used for supervised text classifica-
tion [107] provides almost same accuracy as ORNC(S) on average when compared
with results in Table 4. However, results show that except in DS2 that is small
in collection size and with many classes, supervision based on training is able to
provide superior results. This confirms the superiority of the supervised meth-
ods compared to unsupervised methods in the presence of enough data to provide
training. Generative adversarial active learning for outlier detection [127] is novel
Generative Adversarial Network (GAN)-based semi-supervised approach used for
outlier detection. A GAN model includes two networks where a generative net-
work is used to generate candidates and a discriminate network is used to evaluate
their validity [121, 128]. Although GAN-based method in [127] works in an un-
supervised setting without relying on ground-truth labels of the data [100, 184],
it follows a semi-supervised approach with active leaning to generate initial out-
liers with reference to real data for the discriminator network. Results in Table
6 shows that it performs almost to a random method in sparse text data with
weak supervision and inferior to our proposed fully unsupervised methods.
Outlier Prediction Error and Inlier Prediction Error
To focus on false outliers and inliers, we next present the results with OPE and
IPE. Results in Table 7 support the conjecture that OIDF should be used as
198 4 Empirical Analysis
Figure 9: ROC curve and AUC for document collections with medium size termvectors
Figure 10: ROC curve and AUC for document collections with short size termvectors (”∗” denotes aborted operations, memory or runtime error)
a basic method and the ranking-based algorithms ORFS or ORNC should be
used to make an ensemble method with OIDF. ORFS as a standalone method
generates a high level of false outliers (i.e. OPE value is closer to 1 than 0) while
giving more false inliers than OIDF on average. ORNC cannot be used as a basic
method due to the high time complexity incurred by the increased number of
comparisons with the size of the dataset.
As shown in Table 8, the ensemble methods combining OIDF and the ranking-
based algorithms show a significant reduction in producing false outliers, mak-
ing them suitable for real-world scenarios. The sequential ensemble approach
ORNC(S) is successful in giving fewest false outliers due to filtered candidates
of outliers and becomes the best among our methods in terms of OPE. Fur-
thermore, ORFS (I/S) also shows a reduction in false outliers. This confirms
4 Empirical Analysis 199
Table 5: AUC comparisons against semantic word embedding-based ranking
Dataset
Our Methods Baseline Methods withdoc2vec similarity scores
ORFS (I) ORFS (S) ORNC (S) ORFS with ORNC with
doc2vec doc2vec
DS1 0.70 0.69 0.71 0.72 0.56DS2 0.85 0.83 0.79 0.65 0.60DS3 0.7 0.69 0.7 0.67 0.60DS4 0.65 0.65 0.77 0.75 0.72DS5 0.65 0.65 0.78 0.75 0.72Avg. 0.71 0.70 0.75 0.71 0.64Note : (S) and (I) denote the sequential ensemble approach
Table 6: Performance given by Neural network-based methods
DatasetSupervised CNN GAN based Active LearningAccuracy-ACC Area Under the Curve- AUC
DS1 0.99 0.52DS2 0.74 0.56DS3 0.98 0.52DS4 0.97 0.50DS5 0.97 0.50Avg. 0.93 0.52
importance of ensemble approaches that reduces the false detection compared to
OIDF that directly uses IDF weights for outlier detection or ORFS and ORNC
that use IDF weights within ranking function. The benchmark method KNNO
shows good performance as it uses the specified number of outliers as an external
input and obtains a controlled set of outlier documents. All other baseline meth-
ods generate a high amount of false outliers as well, as some of them such as the
mutual neighbor graph-based (MNCO) and LOFO algorithms, fail to scale for
large and high dimensional datasets. In this set-up, NMFO produces the worst
performance. This may be due to the need for rigorous parameter tuning. The
fine-grained nature of the large SED datasets (i.e., DS4 and DS5) impaired this
process and we were unable to find realistic parameters even after a great effort.
200 4 Empirical Analysis
Table 7: AUC, OPE and IPE for our pure ranking based outlier detection ap-proaches
DatasetOIDF ORFS ORNC
AUC OPE IPE AUC OPE IPE AUC OPE IPEDS1 0.77 0.17 0.47 0.58 0.99 0.49 ∗ ∗ ∗DS2 0.88 0.15 0.14 0.68 0.56 0.28 ∗ ∗ ∗DS3 0.72 0.22 0.61 0.66 0.99 0.2 ∗ ∗ ∗DS4 0.76 0.22 0.44 0.57 0.98 0.54 ∗ ∗ ∗DS5 0.77 0.22 0.38 0.56 0.99 0.6 ∗ ∗ ∗Avg. 0.78 0.2 0.41 0.61 0.90 0.42 ∗ ∗ ∗Note: ”∗” denotes aborted operations and none of the above methodsare not recommended to use as a standalone method
Table 8: OPE for different methods against datasets (a lower value near 0 isbetter)
DatasetOur Methods Baseline Methods
OIDF ORFS ORFS ORNC KNNO LOFO NMFO MNCO
(I) (S) (S)
DS1 0.17 0.08 0.08 0.07 ∗ ∗ ∗ -DS2 0.15 0.07 0.07 0.06 0.01 372.77 0.01 20.13DS3 0.22 0.06 0.1 0.1 0.01 23.51 0.44 5.13DS4 0.22 0.1 0.1 0.07 0 ∗ 80388 -DS5 0.22 0.11 0.1 0.07 0 ∗ 90694 -Avg. 0.2 0.08 0.09 0.07 0.01 198.14 42770.61 12.63Note : (S), (I), ”∗” and ”-” denote the sequential ensemble approach,Independent ensemble approach, aborted operations, and memory or runtimeerror respectively
Table 9 shows the inlier prediction error using IPE. OIDF shows the least
false inliers among the proposed methods. KNNO and NMFO become ineffective
showing very high IPE. The baseline LOFO and MNCO methods outperformed
the proposed methods on the limited two datasets by producing lesser false in-
liers, however, they do not scale well for larger and high dimensional document
collections. A closer investigation on these two methods with OPE reveals that
although they do not produce false inliers, they produce a larger portion of in-
liers as outliers (i.e. FP is extremely high). In contrast, our proposed methods
4 Empirical Analysis 201
produce a balanced performance with reduced false outliers and false inliers.
Table 9: IPE for different methods against datasets (a lower value near 0 is better)
DatasetOur Methods Baseline Methods
OIDF ORFS ORFS ORNC KNNO LOFO NMFO MNCO
(I) (S) (S)
DS1 0.47 1.13 1.17 1.08 ∗ ∗ ∗ -DS2 0.14 0.32 0.39 0.56 1.5 0 49 0DS3 0.61 1.17 1.08 1.08 49 0 1.63 0DS4 0.44 1.63 1.63 0.64 1 ∗ 0 -DS5 0.38 1.48 1.58 0.6 0.94 ∗ 0 -Avg. 0.41 1.15 1.17 0.79 13.11 0 12.66 0Note : (S), (I), ”∗” and ”-” denote the sequential ensemble approach,Independent ensemble approach, aborted operations, and memory or runtimeerror respectively
AUC and OPE in combination give the complete picture of the effectiveness of
the proposed methods in outlier prediction. Further, IPE gives an indication of
false inlier prediction, which has been neglected in most of the outlier detection
work. The OPE and IPE measures in Table 8 and Table 9 depict the higher
quality of outlier prediction, with reduced false positives and false negatives, in
our approaches compared to baseline methods.
4.5 Scalability and Computational Performance Analysis
Time taken by each method is shown in Fig. 11 (a). Results in this figure confirm
that the proposed methods consume lesser time than the benchmarking meth-
ods in addition to the improved accuracy performance as shown in the previous
sections. OIDF outperforms all the methods due to its simple rare document
frequency-based calculation used for outlier filtering. Among the ensemble ap-
proaches, ORNC(S) shows the highest time consumption due to the requirement
of a larger number of comparisons, though it is able to execute for all datasets
due to a much smaller search space generated by the potential OIDF outliers.
202 4 Empirical Analysis
Figure 11: Time and memory consumption for the proposed and benchmarkingmethods
The independent ensemble approach in ORFS (i.e., ORFS (I)) shows a high time
requirement due to additional iterations over the complete dataset.
The proposed methods outperformed all methods except on DS5 where the ma-
trix factorization-based NMFO consumes slightly lesser time than ORNC(S).
However, as shown in previous sections, NMFO produces inferior outcomes to
ORNC(S). When data dimensionality is high, as in the Wikipedia dataset (DS1),
the benchmarking methods were aborted due to exceptional high time consump-
tion. Additionally, LOFO and MNCO could not handle the document collections
with a large number of instances such as SED 2013 (DS4) and SED 2014 (DS5).
Fig. 11 (b) shows the memory consumption of each method. It shows that
rare frequency-based OIDF and ranking scores based ORFS (I/S) ensemble ap-
proaches consume smaller memory in comparison to the ranked neighborhoods
based ORNC(S) that considers k-occurrences similar to sub-dense local neigh-
borhoods (Hubs) in high dimensionality. Fig. 11 (b) clearly highlights that the
proposed methods consume less memory, in comparison to baseline methods.
KNNO, which achieves high accuracy in ACC, shows higher memory and time
4 Empirical Analysis 203
Figure 12: Scalability of methods using incremental samples
consumption. Due to heavy time and memory requirements, all baseline methods
are impaired when dealing with large term vectors such as Wikipedia, and lead
to resource starvation.
We further explore the scalability of the proposed methods considering incremen-
tal samples of SED 2013, which consists of short-term vectors. Fig. 12 shows
that the log of computation time of all proposed methods is near-linear to the
data size. The simplest rare frequency-based outlier detection OIDF shows the
smallest time while the k-occurrences-based ensemble approach (i.e., ORNC(S))
consumes the highest time within our methods. Among the baseline methods,
distance-based KNNO is the only method that scaled up to the largest sample
we have used within the experiment. However, the success of KNNO depends
on the number of outliers given as an external input and it is not successful in
terms of AUC, which analyses inliers and outliers separately. Further, IPE which
represents inlier prediction error, is high for KNNO.
Table 10 shows the computational complexities of the proposed algorithms against
baseline methods. Amongst the proposed algorithms, ORNC that defines the
outlier score based on the number of times a particular document appears in IR
search results has the highest computational complexity. The sequential ensem-
204 4 Empirical Analysis
ble approach of this method, ORNC(S), cuts down this complexity by reducing
the search space n. The baseline algorithms show comparatively higher computa-
tional complexity. This validates why LOFO and MNCO did not work for larger
datasets while OIDF and ORFS worked efficiently.
Table 10: Summary of the proposed methods
Our Methods BaselinesOIDF ORFS ORNC KNNO LOFO NMFO MNCO
Big-O O(nd) O(ndk) O(n2dk) O(n2dk) O(n3dk) O(n2d) O(n3dk)
complexity
Note: n - size of the document collection, d - dimensionality andk - considered number of nearest neighbors.
4.6 Sensitivity Analysis
OIDF and its ensemble approaches use a threshold parameter similar to prior
outlier detection algorithms [68, 157]. We set these control parameters auto-
matically using internal characteristics of the dataset, therefore, the proposed
methods can be called parameter free. All parameters that govern outlier scores
have been explored considering intrinsic data characteristics such as mean, me-
dian and standard deviation. The control parameter T1 of ODIF is set as the
combination of median and standard deviation. The median that removes the
effect of noise was boosted by adding standard deviation to detect the outliers
that have a smaller portion within the document collections. It yields more true
outliers as shown in Fig. 13 (a) for all the datasets except Reuters (DS3), which
contains overlapping class labels for documents.
The control parameter T2 in ORFS (I/S) and T3 in ORNC(S) are set as the
median value. Fig. 13 (b), Fig. 13 (c) and Fig. 13 (d) show how the quality
of prediction varies amongst descriptive statistical measures. The median, which
4 Empirical Analysis 205
Figure 13: The sensitivity of the control parameters- T1, T2 and T3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
DS1 DS2 DS3 DS4 DS5
AUC
ORFS(I)
BM25 TF*IDF
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
DS1 DS2 DS3 DS4 DS5
AUC
ORFS(S)
BM25 TF*IDF
0.60.620.640.660.680.70.720.740.760.780.8
DS1 DS2 DS3 DS4 DS5
AUC
ORNC(S)
BM25 TF*IDF
Figure 14: The sensitivity of the ranking functions in the IR system
removes the unusual bursts in the outlier scores, gives the highest AUC except
for DS3 in ORFS (S), which contains overlapping class labels.
An IR system employs different ranking functions such as LM Jelinek-Mercer
Smoothing (LM-JM), LM Dirichlet Smoothing (LM-Dirichlet) and Okapi BM25
in addition to tf*idf [23]. However, LM-JM assigns negative scores to terms that
have fewer occurrences and LM-Dirichlet captures important patterns in the text
leaving the noise [54]. Therefore, they are less effectiveness in highlighting the
206 4 Empirical Analysis
outliers that have rare terms. In contrast, BM25 and tf*idf ranking functions
show the capability to capture the deviated documents with rare terms using
IDF of terms. Figure 14 shows the results provide by each proposed method with
these two ranking functions. It shows that in general, tf*idf can more accurately
identify outliers than the BM25 for the proposed methods, and we used it as the
default.
4.7 Discussion
This paper proposes a basic method based on the concept of weighted term vec-
tors and its’ ensemble approaches with the concept of the ranking of relevant
documents obtained through an IR system, in order to achieve accurate and scal-
able outlier detection in document collections. An extensive empirical analysis
provides insight into the proposed algorithms. We summarize the interesting
observations as follows:
• The basic algorithm OIDF, based on the simple concept of using rare terms
in a document, which can be emphasized through IDF schema, shows high
competence to detect deviations even for large document collections con-
suming less time. However, as shown by ACC and OPE results, OIDF is
adversely affected by the higher number of false positives produced.
• The use of search engine ranking provides the advantage of obtaining rel-
evant documents as similar documents from a large document collection
for a document posed as a query. Reported results confirm the success
of this approach in ensemble ORFS(I/S) and ORNC(S) methods. The
ORFS(I/S) algorithms estimate outliers based on how a document deviates
from the relevancy scores of relevant neighborhoods while the ORNC(S) al-
gorithm estimates the degree of an outlier from the reverse neighbor count
4 Empirical Analysis 207
(k-occurrences) within the relevant neighborhoods.
• The higher accuracy achieved with ORNC(S) compared to OIDF and
ORFS(I/S) can be attributed to identifying and using the sub-dense local
neighborhoods present in higher dimensions. The count of k-occurrences
allows identifying “hubs” and “anti-hubs”, which are away from local sub
dense neighborhoods that have less k-occurrences count. These anti-hub
points become probable outlier candidates. ORNC(S) even produces the
best outcome for the short text size data where complexity is increased due
to less term co-occurrence. However, it consumes substantial time due to
the requirement of a large number of comparisons within each document
neighborhood and will be less time efficient for datasets with a larger num-
ber of documents.
• The strategy of combining different outlier detection approaches affects the
effectiveness of outlier prediction. According to ACC and OPE measures,
ORFS(I/S) and ORFS(S) outperformed the basic singleton OIDF method
by reducing false positives. The improved time efficiency of ORFS(S), how-
ever, favors the sequential ensemble method compared to the independent
ensemble method, as both produce nearly the same level of accuracy.
• While comparing with baselines, KNNO shows a higher ACC compared
to the proposed approaches. The input parameter specifying the number
of outliers is the reason behind this behavior. However, AUC that inde-
pendently analyses the inlier and outlier prediction in detail confirms that
the effectiveness of KNNO is not up to the level of our methods in all the
datasets due to reporting high false inliers. High memory consumption and
false inlier prediction of the KNNO make this method a weak text outlier
detection method.
• Furthermore, reported results to show that the state-of-the-art methods
208 4 Empirical Analysis
are not scalable to document collections with high term vectors. A mu-
tual NN graph building process using k-NN calculation is not scalable for
larger datasets due to the required high number of pairwise comparisons as
evident with MNCO. The NMFO method is a recent method, proposed to
handle the problem with high dimensional term vectors through a resulted
error in dimensionality reduction. However, our experiments with large
term vectors (DS1) reveal that it cannot handle large text size collection.
Additionally, experiments on datasets that consist of many groups (i.e. DS4
and DS5) show that the sum of the square error in non-negative matrix fac-
torization is impaired in handling fine-grained problems, as evident by the
OPE measure.
Finally, we summarize the proposed methods according to their suitability on
different types of data in Table 11.
Table 11: Applicability of the proposed methods
Category Method Nature of the documents in Functionalityoutlier detection applications
Rare Frequency OIDF Large text size collections Accuracy,based such as Wikipedia EfficiencyRanking based ORFS(S) Medium text size collections Accuracy,
such as newsgroup data EfficiencyRanking based ORFS(I) Medium text size collections Accuracy
such as newsgroup datak-occurrence ORNC(S) Short text data such as social Accuracybased media which deal with
extreme sparseness
5 Conclusion 209
5 Conclusion
This paper deals with the important topic of high-dimensional text outlier detec-
tion. In this data domain, the traditional distance or density-based outlier detec-
tion methods are challenged due to the distance concentration problem. Most of
the state-of-the-art methods are impaired when the number of groups within a
document collection is high, as it becomes difficult to generalize common patterns
to identify deviation for outliers.
This paper proposes a simple method of outlier detection based on the use of the
IDF weighting scheme, OIDF. It effectively uses the notion of rare terms to iden-
tify the documents that deviate from the majority of documents in the collection.
This method, however, suffers from generating high false positives and requires
additional processing to improve accuracy. To handle efficacy and efficiency, we
propose a number of ensemble approaches with OIDF using the ranking concept
in IR systems, which has already been proven to handle high dimensional larger
document collections with reduced computational complexity. An IR system is
used to retrieve the relevant documents for each document in the collection and
the top-n relevant documents are considered to be the neighborhood of the doc-
ument. ORFS uses the relevancy scores and ORNC uses the relevant document
count to identify outliers.
We explore the most effective ensemble approach (i.e., independent or sequential)
in combining ORFS and ORNC with OIDF. The sequential approach utilizes the
outlier candidates identified by OIDF to reduce the search space for improving the
quality of outlier detection. The ability of rare document frequency in identifying
outliers in OIDF is enhanced by the IR concepts in ORFS and ORNC, and
reduces false positives compared to OIDF only. In the independent ensemble
approach, both ORFS and OIDF algorithms generate the outlier candidates and
210 5 Conclusion
the common candidates in both sets have been identified to be the final outliers.
The ORNC is not used in the independent approach to generate outliers due to
high time complexity.
The empirical analysis is conducted on diverse datasets including large, medium-
and short-term vector sizes with different numbers of classes and different level
of vocabulary overlapping. Proposed methods are benchmarked against several
state-of-the-art distance-based, density-based, NMF-based and graph-based out-
lier detection methods. Empirical analysis shows that the proposed methods are
capable of detecting outliers in high dimensional document collection with con-
siderably high performance, including accuracy and efficiency. These approaches
are designed in a threshold independent way by setting the control parameter
autonomously based on the internal characteristics of the text collection.
This paper presents a substantial work in the area of text outlier detection. How-
ever, identifying outliers in dynamic text streams with limited memory and time
is important for novelty detection. Therefore, future directions are applying pro-
posed algorithms on dynamic temporal text data for outlier and insight detection.
Parallelizing these algorithms with possible improvement for run time and mem-
ory are also for our future investigation.
Appendix A: Rationale for the outliers’ score
Let OSdoidf and OSdi
idf be the average of IDF values of all terms in an outlier doc-
ument do ∈ D and inlier document di ∈ D respectively in a document collection
D. We believe that OSdoidf > OSdi
idf is valid for an outlier and inlier pair due to
the following reasons.
• For a generic document dk ∈ D , IDF weight for a term can be calculated
as in Eq. 1 where rare terms get high IDF values due to their low document
5 Conclusion 211
frequency (df) compared to common terms.
• An outlier document do with the average IDF weight of respective terms
OSdoidf calculated using Eq. 4, will get higher value compared to the inlier
document di, as do consists of a set of rare terms within D. It indicates a
deviation from the majority.
• In contrast, an inlier document di will possess common terms that represent
intrinsic themes of D and, thereby will hold a lower average IDF, OSdiidf for
respective terms.
• An OSdoidf , which is dominated by (rare) deviated terms should be higher
than an OSdiidf , which is led by common terms within D.
Appendix B: Weakness in ACC Measurement
ACC measures the effectiveness of predictions in terms of correct predictions and
does not consider false predictions. Consequently, it disregards the false outliers
and false inliers.
• ACC = TP+TNTP+FP+TN+TN
• ACC = TP+TNP+N
ACC considers truly predicted instances against the total observations as a ratio
and highlights only correct predictions. It can be considered a biased evaluation
that neglects the incorrect predictions made by a method. Hence, 1-ACC can
be used as an indirect indication of incorrect predictions, which represents total
false predictions against total observations. However, ACC does not separately
evaluate FP and FN that represent false outliers and false inliers to determine
the error in outlier identification and inlier identification of a method.
212 5 Conclusion
Figure 15: ROC curve generated by a binary outlier detection scenario
Appendix C: Weakness in AUC Measurement
AUC that considers a trade-off between TPR and FPR does not properly focus
on false outliers and false inliers, and masks the false positives and false negatives.
Let’s use a binary outlier detection scenario to produce the ROC curve as shown
in Fig. 15. AUC is the sum of the areas of A, B, and C.
• AUC = A+B + C as proved in [32]
• AUC = 12∗TPR ∗FPR+(1− FPR) ∗TPR+ 1
2∗ (1− TPR) ∗ (1− FPR)
• AUC = 12∗ TPR ∗ FPR + 1
2(1− FPR) (TPR + 1)
• AUC = 12(TPR + 1− FPR)
• AUC = 12
(TPP
+ 1− FPFP+TN
)• AUC = 1
2
(TPP
+ TNFP+TN
)• AUC = 1
2
(TPP
+ TNN
)
AUC informs true outliers out of total outliers and true inliers out of total in-
liers. Specifically, it details the correctly predicted outlier ratio and inlier ratio
5 Conclusion 213
separately. It does not inform false outliers or false inliers efficiently as it does
not treat false positives or false negatives with special care.
214 Paper 6
Paper 6: Text Outlier Detection using a Ranking-
based Mutual Graph
Wathsala Anupama Mohotti* and Richi Nayak*
*School of Electrical Engineering and Computer Science, Queensland University
of Technology, GPO BOX 2434, Brisbane, Australia
Under Review In: Data & Knowledge Engineering Journal
Statement of Contribution of Co-Authors
The authors of the papers have certified that:
1. They meet the criteria for authorship in that they have participated in
the conception, execution, or interpretation, of at least that part of the
publication in their field of expertise;
2. They take public responsibility for their part of the publication, except for
the responsible author who accepts overall responsibility for the publication;
3. There are no other authors of the publication according to these criteria;
4. Potential conflicts of interest have been disclosed to (a) granting bodies, (b)
the editor or publisher of journals or other publications, and (c) the head
of the responsible academic unit, and
5. They agree to the use of the publication in the student’s thesis and its
publication on the QUT ePrints database consistent with any limitations
set by publisher requirements.
Paper 6 215
Contributor Statement of contribution*
Wathsala Anupama Mohotti Conceived the idea,designed and conducted experiments,analyzed data, wrote the paper and
Signature: addressed the supervisor and reviewers’comments to improve the quality of paper
Date:
A/Prof Richi Nayak Provided critical commentsin a supervisory capacity
Signature: on the design and formulationof the concepts, method and experiments,
Date: edited and reviewed the paper
Nayak
26/03/2020
Mohotti
27/03/2020
QUT Verified Signature
QUT Verified Signature
216 1 Introduction
ABSTRACT: Identification of unusual text instances in text corpora is highly
beneficial for several applications such as content management, emerging and
suspicious pattern detection, etc. Extreme sparseness, distance concentration
and the presence of a large number of subgroups in a text corpus are some of
the issues that challenge the traditional outlier detection methods. In this paper,
we address these issues in a novel fashion by modeling the documents using rare
frequency weighting, building a ranking-based mutual neighbor graph and identify
outliers by the density estimation. The proposed graph-based incremental outlier
detection method effectively reduces false identifications. Experimental results
show that the proposed method is scalable compared to relevant benchmarking
methods as well as improve the quality of outlier detection in text corpora.
KEYWORDS: Outlier Detection; Density Estimation; Graph-based Clustering;
Data Mining; Mining methods and algorithms
1 Introduction
With the advancement in digital data technologies, text data has grown expo-
nentially [86]. The process of discovering useful information from the text docu-
ment corpora, known as text mining, has become significant and leads to diverse
applications [3]. Content management that facilitates efficient and effective in-
formation retrieval [106], emerging concept identification that facilitates trend
analysis [4] and suspicious content detection that identifies fake news [94] or un-
usual events [45] are some of them. Outlier detection plays a vital role in this
context to identify abnormalities out of a massive data collection that is usually
heterogeneous in nature and includes multiple subgroups.
The general idea behind outlier detection is to identify patterns that do not com-
1 Introduction 217
Figure 1: Text outlier detection and associated problems
ply to general behavior, also referred to as anomalies, deviants or abnormalities
[1]. An outlier text document has content that is different from the rest of the
documents in the collection that share some similarities amongst them [4]. Fig.
1(a) illustrates a typical scenario of text outliers in comparison to inliers in a
collection. There exist several applications where anomaly detection plays a ma-
jor role. (1) In Wikipedia where pages are required to be placed into relevant
categories based on their contents, identifying outliers is essential for organizing
the collection for effective information retrieval. (2) Identifying an unusual text
deviating from the theme in a blog will draw useful insight for administrative
purposes [96]. (3) In order to flag exceptional news, it is important to identify
unusual news articles from a collection of news documents [94]. (4) Detecting
unusual events that can be early warnings has high importance in social media
data [45]. (5) Identifying outliers leads to revealing the emerging business trends
and competitors on e-commerce applications [4].
Several supervised and unsupervised methods exist for discovery of outliers from
different types of data such as numerical, spatial and categorical [1]. As most
of the real-world data is unlabelled, unsupervised methods become a natural
218 1 Introduction
choice. Outlier detection methods based on unsupervised learning concepts such
as distribution, distance, and density used in traditional data face the scalability
issue when applying to large text sources [1]. There is only a handful of studies
specific to text-domain [4, 96] that deal with the sparseness of document vector
representation (shown in Fig. 1(b)). The well-known curse of dimensionality [126]
in the text data, as shown in Fig. 1(c), where the concept of distance or density
diminishes, leads to incorrect outcome (i.e. a high number of false prediction) [3].
A recent study used matrix factorization to project the high-dimensional text data
to lower order and calculated outliers by ranking the learning errors [96]. This
method accurately identifies outliers in the homogeneous data where the majority
of documents belong to the same topic. However, it fails to identify outliers in the
data where a large number of subgroups exists [173]. A large number of subgroups
within big datasets [173] also challenges traditional methods when applied to text
collections such as social media data.
In this paper, we focus on detecting text outliers with the aim of reducing false
detection using the well-known IR concepts in a novel fashion. We propose to use
(1) the “rare” term importance in detecting deviations in documents represented
as term vectors and (2) the concept of ranking to identify the local sub-dense
neighborhood, called as Hubs, evident in high dimensional data. We present a
novel graph-based method, named as Outliers by Ranking based Density Graphs
(ORDG), where a mutual neighbor graph is constructed using the relevant neigh-
borhoods. Documents that are not included in the mutual neighbor graph and
are away from the local sub-dense neighborhood, are treated as outliers. Em-
pirical analysis using several document corpora reveals that ORDG is able to
detect outliers in large document corpora, accurately and efficiently compared to
state-of-the-art methods [29, 96, 157].
To the best of our knowledge, ORDG is the first method that extends the IR con-
2 Related Work 219
cepts of term weighting and ranking to document outlier detection together with
the mutual neighbor graphs. More specifically, this paper brings the following
novel contribution to the area of outlier detection,
• Introduces a novel outlier detection algorithm (ORDG) which combines
the concepts of rare document frequency in document representation and
mutual neighbor graph.
• Proposes to construct the mutual neighbor graph based on the concept of
relevant neighbors using a scalable IR system that consumes less computa-
tion cost to identify the deviations.
The rest of the paper is organized as follows. Section 2 details related work on
traditional and high-dimensional outlier detection. The proposed approach and
implementation are elaborated in Section 3. A comprehensive empirical study
and benchmarks on several public datasets with well-known outlier detection
algorithms are provided in Section 4. The final concluding remarks are presented
in Section 5.
2 Related Work
It is estimated that 95% of unstructured data is dominated by digital text col-
lections [60]. Detecting outliers or anomalies in these large document collections
is useful for finding the interesting as well as suspicious text [4].
220 2 Related Work
2.1 Outlier detection methods for structured data
Traditional unsupervised methods are based on the proximity concepts of dis-
tance, density, distribution and, cluster [1, 2]. In the numerical data domain,
distribution-based methods use a statistical measure to determine the anomalies
that occur outside of the normal model [16, 89]. However, this approach is highly
depended on assumptions about data representation and leads to poor scalability,
making them less effective in text outlier detection.
The distance and density-based approaches successfully handle the anomalies in
numerical data with limited dimensions where outliers are easy to identify in
terms of distance or density distribution. These are extensively used in outlier
detection due to their simple implementation [126]. The concept of nearest neigh-
bors has been used to measure the distance differences. A distance based method
calculates the difference between each point and k nearest neighbors, and the
top-n points are ranked as outliers [157, 197]. A density-based method calcu-
lates the ratio between density around k-Nearest-Neighbors of a point and its
local neighborhood [109]. A point is ranked as an outlier candidate if it’s relative
density known as Local Outlier Factor (LOF) is high [29].
The reverse neighbor count [155] that indicates the number of times a point
appears among nearest neighbors of the entire collection has also been used in
outlier detection [155]. The reverse neighbor count of some points shows sig-
nificant skewness, forming sub-dense regions known as Hubs. The “Anti-Hubs”
points have been identified as outliers [154]. With reverse k-NNs, a graph-based
method is used to identify the outlier nodes which have less in-degree values [87].
Nearest Neighbor (NN)-based mutual proximity and the k-NNs have also been
used to calculate outlier scores [58, 84].
In text data, identifying neighborhoods using distance measures is challenging due
2 Related Work 221
to distance concentration in high dimensionality [104]. All the pairwise distances
(dissimilarities) yield a similar value where distance differences between far and
near points become negligible [24]. Document collections such as Web where
size is large as well as they exhibit multiple sub-groups, the nearest-neighbor
calculation poses the scalability problem due to the large number of pairwise
comparisons [173].
Density based clustering has been successfully used in spatial data by isolating
outlier points with the density approximation [9, 57]. However, in the text where
data is sparse, applying the density notion to separate outliers is challenging as
data already exists in patches.
Traditional outlier detection methods are impaired in high dimensionality data
due to sparseness [1]. The angle between vectors can be successfully used to
identify the deviations in this context. This approach can be well suited to text
data which are in the form of feature vectors and cosine similarity can be used
to measure angel differences [37]. However, the number of pairwise comparisons
needed for larger datasets increases the computational complexity and makes this
approach infeasible to apply to large-scale data.
Subspace analysis is an alternative method to detect outliers in high dimensional
data. However, the problem of finding a subset of dimensions, with rarely existing
patterns, using brute-force searching mechanisms poses extreme computational
complexity [1]. Therefore, lower-dimensional projections have been used as a rem-
edy. The degree of deviation of each observation after projecting it to the lower-
dimensional space by dimensionality reduction is used to determine outliers [126].
In [27], Multi-Dimensional Scaling (MDS) is used in identifying outliers in high
dimensional data. It reduces the number of dimensions preserving pairwise dis-
tances between points, and identify outliers in embedded space using a heuristic
that captures deviants. However, the information loss in these approaches when
222 2 Related Work
projecting data from higher to lower dimension makes it unsuitable to determine
extreme values in full dimensionality.
2.2 Outlier detection methods for text data
There are limited studies specifically focused on text-domain to identify the docu-
ments deviated from the common theme [4, 96]. In text-domain where data is high
dimensional, matrix factorization is proposed as a solution that projects the high
dimensional search space to a lower space with preserved original relationships in
the newly mapped space [11]. In a recent study, the sum-of-square of differences
with the original matrix while projecting to a lower order with the Non-negative
Matrix Factorization (NMF) is measured and observations with higher rank for
learning error are identified as outliers [96]. This method attempts to use the se-
mantic similarity while learning text outliers. However, the increased number of
groups within the collection makes this learning process impaired. This method
may fail to detect outliers accurately or in scalable fashion in the Web content
that often contains many document categories.
Deep neural network [38] and Generative Adversarial Network (GAN) with ac-
tive learning [127] are latest supervised approaches used for outlier detection in
text data. They have been used with the dense representation of text data as
they are unable to work with high-dimensional sparse data. Supervised deep
network-based methods use the labeled data for training and learn the patterns
of the text data that classify outliers and inliers. GAN methods generate infor-
mative potential outliers based on the mini-max game between a generator and
a discriminator network [127]. Authors in [127] used active learning (i.e., a weak
supervision approach) to generate potential outliers with a reasonable reference
distribution for the small labelled data with GAN. Accuracy of these methods
2 Related Work 223
rely on the labelled data. However, it is difficult to provide labelled data for train-
ing due to the unknown nature of anomalies. Therefore, a common approach is
to utilise unsupervised methods to find objects or patterns that are uncommon
based on data distribution.
2.3 IR Concepts: How can they be used in outlier detec-
tion
According to the well-known Hawkins definition, an outlier is “an observation
which deviated so much from the other observations as to arouse suspicions that
it was generated by a different mechanism” [84]. We conjecture that documents
with rare terms will exhibit this characteristic and the use of a rare term weighting
technique in document representation can reveal outliers. A Vector Space Model
(VSM) is used to represent a document by a vector where each term appears
as a co-efficient to represent the term weight considering its frequency within
the document and/or collection [3]. There are different term weighting schemes
used in IR to rank the terms such as TF, IDF, TF*IDF and BM25 [160]. Term
Frequency (TF) gives high weights for frequently occurring terms by favoring
common and long documents [37] while Inverse Document Frequency (IDF) favors
the rare terms in the collection [37]. In this paper, we use the concept of term
weightings with IDF to measure the importance of rare words in a novel fashion
to detect text outliers.
IR systems have shown as a scalable and efficient solution in handling high di-
mensional text data [3]. There exist advanced IR technologies including inverted
index data structure and ranking that allow a search engine to find related docu-
ments in a large document collection for a given user query [200]. In this paper,
we use the concepts of ranking in IR systems for outlier detection in a novel fash-
224 3 ORDG: Outliers By Ranking-based Density Graphs
ORDG
Phase 1:Outlier candidates from rare
frequency term modelling
Phase 2:Outlier candidates dissimilar to a mutual neighbor graph
Final Outliers:Common outliers for both phases
Figure 2: Architecture of the proposed ORDG method
ion. The proposed method generates a mutual NN graph based on the retrieved
(relevant) documents for the search queries instead of an expensive NN calcula-
tion and identifies outliers based on the density of the graph. This ranking based
mutual neighbor graphs generated outliers have been combined with the outliers
generated with the rare term frequency-based method, to reduce the high number
of false positives, a significant problem in outlier detection [2].
3 ORDG: Outliers By Ranking-based Density
Graphs
The proposed ORDG method is an ensemble approach that combines the outliers
generated by two processes as in Fig. 2. Phase 1 includes the process of obtaining
probable outlier candidates through the rare frequency term weighting model.
Phase 2 includes construction of the mutual neighbor graph and removal of outlier
candidates based on density of inliers in the graph. We define mutual neighbors
using ranking results generated by an IR system in a scalable and efficient manner.
The final list of outliers is generated by reporting the common outliers for both
processes.
3 ORDG: Outliers By Ranking-based Density Graphs 225
3.1 Preliminaries
Consider a document collection D = {t1, t2, ..., tn} that contain a total of n terms
where a document di ∈ D is represented using a set of unique terms {t1, t2, ..., tt}in D. Let D consist of a set of groups C = {c1, c2, ..., cN} and each cg ∈ C
contains a set of similar documents that share related terms.
Definition 1 Outlier A document di ∈ D that shows high deviation, based on
terms distributions, to all sets of similar documents cg ∈ C is considered an
outlier.
Definition 2 Inlier A document di ∈ D that shows high similarity, based on
terms distributions, to a set of similar documents cg ∈ C is considered an inlier.
3.2 Finding nearest neighbours
Given a document di ∈ D, a vector space model (VSM) represents a document as
a point vector in multi-dimensional space by assigning weights to each respective
term as di = {w1, w2, w3, ..., wt}. These weights in a vector emphasize the impor-
tance of the document within the collection using a weighting scheme. Inverse
Document Frequency (IDF) weighting scheme differentiates whether the term v
is common or rare considering document frequency dfv as per Eq. 1.
wv = idfv = log
( |D|dfv
)(1)
A wv ∈ di when modeled with IDF gives a higher score to rare terms. To empha-
size on rare terms appearing in the document, documents in a document collection
is represented with the IDF weighting schema.
226 3 ORDG: Outliers By Ranking-based Density Graphs
Each document di ∈ D is treated as a query document represented with top-s
(s = 10) terms ranked in the order of IDF. We use Elasticsearch search engine as
the IR system and obtain top-k documents given in response to the query as k-
Nearest Neighbors. We have set k = 10 as P@10 (Precision at top-10 documents)
in the ranked list returned for a topic is considered high due to tight coupling with
the topic [139]. Thus the top-10 documents that possess sufficient information
richness [173] are chosen as the NNs.
Let Rf be the ranking function employed in an IR system that extracts the
most relevant k documents as nearest neighbor documents Dq, for a given query
document q, where r is the relevancy scores vector for query q as follows.
Rf : q → Dq = {(dp, rp)} : p = 1, 2, . . . , k (2)
There exist several ranking functions employed in search engines such as tf*idf,
BM25 and LM Jelinek-Mercer smoothing to calculate the relevant documents
[173]. We use the widely applied tf*idf ranking function to measure the relevancy
between a document dp and a query q where the relevancy score rp is given as:
score (q, dp) = rp =∑t∈q
(√tft,dp × idf 2
t × norm (t, dp))
(3)
Let Ddi and Ddj be the ranking results considered as nearest neighbors of di, dj ∈D respectively through the ranking function Rf . These ranking results are used
in defining mutual neighbors. Two documents di and dj are considered mutual
neighbors if dj ∈ Ddi , di ∈ Ddj and |Ddi ∩Ddj | > 2.
3 ORDG: Outliers By Ranking-based Density Graphs 227
3.3 Phase 1: Outliers by the Term Weights
We conjecture that an outlier document will contain more rare terms than an
inlier document in the corpus. For each document represented in IDF weighting,
an average weight is calculated by summing all term weights that are present in
the document. We propose to filter the probable outlier documents in D, which
gives higher average weights beyond a threshold value Tidf as in Eq. 4.
Doidf ← di : where
{∑ti=1 (wv ∈ di)
t
}> Tidf (4)
We set this control parameter independently, using the internal statistics such as
median and standard deviation of the term weights of the datasets as detailed in
the sensitive analysis section to form the optimal threshold to filter the outliers.
Let OSdoidf and OSdi
idf be the average of IDF values of all terms in an outlier
document do ∈ D and inlier document di ∈ D respectively in the document
collection D.
Claim 1 Given an inlier and outlier document pair, OSdoidf > OSdi
idf is valid.
• For a generic document dk ∈ D, IDF weight for a term can be calculated as
in Eq. 1 where rare terms get high IDF values due to their low document
frequency (df) compared to common terms.
• Since the outlier document do consists of a rare set of terms within collection
D, the average IDF weight of respective terms, OSdoidf will get a higher value
compared to the inlier document di. A higher score indicates deviation from
the majority.
• In contrast, an inlier di will possess common terms that represent one of
228 3 ORDG: Outliers By Ranking-based Density Graphs
the intrinsic themes of D, thereby it will hold a lower average IDF, OSdiidf
for respective terms.
• Any OSdoidf that is dominated by rare deviated terms should be higher than
any OSdiidf that is led by the common terms within the D.
3.4 Phase 2: Outliers by the Ranking-based Mutual
graph
A (mutual neighbour) graph is constructed where two mutual neighbor documents
are represented by the adjacent nodes and the edge weight between them is the
number of neighbors the two documents share.
Let Ddi represent the top-10 relevant neighbor documents of di obtained using
Eq. 2. Let document dj ∈ Ddi and its top-10 relevant neighbor documents be
Ddj . If di and dj are found mutual neighbors due to sharing more than two
documents showing common other documents, they are included as vertices of
the graph GM with the edge weight of |Ddi ∩ Ddj |. Repeating this process for
all documents in the collection, a mutual-neighbor graph GM (V,E,w) is formed
where the vertices V represent the document nodes and the edges E with the set
of weights w represent the number of mutually shared neighbors. All the mutual
documents in GM forms a set DMN . This process separates the set of outlier
documents DoG that are not part of the connected graph. Algorithm 1 in Fig. 3
represents the ORDG algorithm for building the mutual graph.
The document collections with medium to large size term vectors (e.g., news
stories, reviews, etc.) contain sufficient co-occurring terms and allow identifica-
tion of local density regions to form mutual neighbors. Due to a high number
of documents included in mutual neighbor graph GM (V,E,w) the outliers can
3 ORDG: Outliers By Ranking-based Density Graphs 229
Figure 3: ORDG: Mutual Graph Building Algorithm
be effectively identified from the left out documents set DoG in these collections.
However, this approach poses a challenge to short documents (such as social me-
dia posts) or sparse documents as only a few documents show mutually shared
documents. Short document collections that hold extremely sparse vector rep-
resentations share very few common terms. It becomes hard to discriminate
amongst documents and, eventually, many inlier documents are left out from the
graph construction process.
We refine outlier discovery for the short documents by defining outliers based on
dissimilarity to Hubs in the graph. Initial dense regions on the graph are formed
based on a region where minimum edge weight is c and the region contains at least
c document nodes. Identified dense inlier neighborhoods are further expanded to
include documents from the same edge weight forming uniform dense regions. All
the document nodes in each dense region are identified as inliers as they hold the
230 3 ORDG: Outliers By Ranking-based Density Graphs
Figure 4: ORDG: Dissimilarity with hubs (for document collections with shorttext vectors)
property in Definition 2. All other nodes are identified as outlier candidates, DoG.
This process further refines the outlier filtering considering dissimilarity to the
graph. The set of shared neighbor documents DMN identified in Algorithm 1 in
Fig. 3 attached to GM are labeled as inliers l if they are attached with the dense
regions and embedded documents in them are not outliers. These document
sets in DMN identified within mutual graph construction can be considered as
Hubs in clusters (i.e, dense regions), which we propose to use to separate outlier
3 ORDG: Outliers By Ranking-based Density Graphs 231
candidates through dissimilarity. A normalized similarity score Sh is calculated
against each Hub h ∈ DMN for each outlier candidate do ∈ DoG for identifying
the dissimilarity. The similarity score Sh calculation utilises the ranking scores
derived through Eq. 3 for obtaining relevant documents of do if they appear in h
as:
Sdoh =
1
|h||h|∑i=1
score (q, do) where q is each document in Hub h (5)
This refinement step analyzes the maximum similar hub of each outlier candidate
do as in Eq. 6 and removes the considered do from the outlier candidate list if
the assigned hub is associated with an inlier label l. This process is given in
Algorithm 2 in Fig. 4 includes a two-step approach where each step provides
a better understanding of the document collection to enable the more refined
execution.
h← max(Sdoh
)(6)
3.5 Phase 3: Ensemble – Combining Outliers
We propose to use the independent ensemble approach to combine outliers de-
tected by the first two phases. In prior work, these ensemble methods have been
successfully used to improve the quality of an outlier detection algorithm [1].
This addresses the problem of high number of false positives generated by a sin-
gle method [2]. The final set of outliers are produced as the common outlier
documents identified in Phase 1, Doidf and in Phase 2, Do
G as:
Dof = Do
idf ∩DoG (7)
An example: Consider an example document set in Fig. 5, which consists
of eleven documents related to two sports: Cricket and Rugby, and an outlier
232 3 ORDG: Outliers By Ranking-based Density Graphs
Figure 5: Example document collection
Figure 6: IDF weights of terms in documents with the average IDF weight
document. The example document set clearly depicts that inlier documents share
the common theme, sport, and the outlier document is a deviation from both of
these sports categories. Fig. 6 shows the IDF weights of terms for each document
after standard pre-processing together with the average IDF value of each vector.
It reveals that the average IDF value of the document d11 is much higher than
the rest of the collection. The first phase of ORDG identifies the possible outlier
candidate d11.
3 ORDG: Outliers By Ranking-based Density Graphs 233
Figure 7: List of relevant documents given by the search engine for the exampledocument collection
Phase 2 of ORDG calculates the mutual neighbors to build the graph as in Fig.
8 considering the shared documents within the ranking results as given in Fig. 7.
The graph is able to isolate the outlier documents from the collection as shown
in Fig. 9.
This example document collection forms considerably a dense VSM model by
showing high term co-occurrences. This may be different from a real-world text
outlier detection problem. In a high dimensional text where usually terms’ co-
occurrences are extremely low, a single method tends to produce more false out-
liers [2]. We address this problem by combining outliers detected from phase 1
(using IDF term weights) with phase 2 (using mutual neighbor graphs). ORDG
identifies a document as an outlier only if both phases detected it as an outlier.
234 4 Empirical Analysis
IR Ranking results of Document IR Ranking results of Document
Shared Neighbors of and
Check for minimum number of Shared Neighbors between and
3
Vertices documents and
Edge weight
Figure 8: Mutual Neighbor Calculation with IR search results
Mutual Neighbor Graph
3
3
3
3
3
6
3
3
3
7
5
5
5
55 5
5
Figure 9: Mutual graph construction in ORDG
4 Empirical Analysis
4.1 Datasets: size, sparsity and classes of Inliers and out-
lier
We used multiple datasets with varying dimensionality such as 20 Newsgroups,
Reuters 21578, MediaEval Social Event Detection (SED) 2013 & SED 2014, and
Wikipedia in evaluation, as reported in Table 1. Wikipedia dataset (DS1), which
has about an average of 800 terms in a document, is used to validate the outlier
detection behavior on a large document set. Well-known 20News group dataset
4 Empirical Analysis 235
Table 1: Summary of datasets used in experiments.
Datasets # ofDocs
# ofUniqueTerms
# ofTotalTerms
# ofAvg.Terms
# ofOut-liers
Wikipedia (DS1) 11521 305827 9206250 799 10020News groups (DS2) 4909 27882 374642 76 50Reuters (DS3) 5050 13438 200482 40 50SED2013 (DS4) 81228 46548 1583073 19 840SED2014 (DS5) 91670 46031 1816840 20 976
(DS2) and Reuter dataset (DS3), which have 40-80 terms on average, were used
to validate the outlier detection behavior on a medium document set. Whereas,
MediaEval Social Event Detection 2013 (DS4) and 2014 (DS5) datasets with 20
terms on average were used to analyze short document collections. The ground-
truth values with the class/category labels in the datasets were used to measure
the methods’ effectiveness extrinsically.
We allow the document sets to contain several classes of documents - both inliers
and outliers. Specifically, DS1 contains inliers from the multiple subclasses under
the Wikipedia category “War” while containing outliers from 10 other categories.
DS2 contains inliers from five classes related to “Computers” and outliers from five
other categories. Similarly, DS3 contains inliers from two classes while outliers
are taken from 25 other classes. Inliers in short datasets (DS4 and DS5) are
collected from classes that have at least 100 documents while outliers are to be
two per each class that are not inlier classes within the same dataset. These short
document collections of DS4 and DS5 are built to contain more than 400 groups
in both inlier and outlier classes to explore the fine-grained scenarios. Generally,
all the datasets were created such that they contain nearly one percent of outlier
documents, which belong to several classes and inlier documents also belong to a
diverse set of classes.
236 4 Empirical Analysis
4.2 Experimental setting, Benchmarks and Evaluation
Measures
Experiments were done using python 3.5 on 1.2 GHz with a 64-bit processor
with 264 GB (shared) memory. All datasets were preprocessed using standard
text pre-processing such as stop-word removal and stemming. Elasticsearch was
used as the search engine. Inverted indexes were generated for all datasets. For
each document in a collection, top-10 relevant documents were obtained using
the ranking process employed in Elasticsearch.
There exist only a handful of text outlier detection methods, both supervised
[107, 127] and unsupervised [96]. We compare ORDG with Non-negative Matrix
factorization based unsupervised method [96] as well as traditional unsupervised
methods adapted for text data including k-nearest neighbor based method in [157]
(KNNO), density-based local outlier factor method in [29] (LOFO) and Pairwise
mutual neighbor graph method in [55] (MNCO).
Neural network-based approaches recently become popular in text mining with
fully or weak supervision [107, 127]. Although ORDG is a fully unsupervised
method, we have done experiments with supervised method based on Convolu-
tional Neural Network (CNN) [107] and semi-supervised method based on Gener-
ative adversarial active learning [127]. With deep learning, we follow the standard
practice of using dense word representation with reduced dimensionality obtained
with Global Vectors for Word Representation (GloVe) [148] as the input to the
neural network experiments [107]. These methods based on training a neural
network with full-dimensional space are extremely time-consuming.
Standard outlier evaluation measures including Accuracy (ACC) [84], Area Under
the ROC Curve (AUC) [1] and False Negative Rate (FNR) [1] are used to report
4 Empirical Analysis 237
the results.
Let TP ,TN ,FP ,FN denote the correct outliers, correct inliers, incorrect outliers
and incorrect inliers respectively where P ,N denote the total outliers and inliers.
Accuracy (ACC) is calculated as:
ACC =TP + TN
TP + FP + FN + TN=
total correct predictions
total observations(8)
ACC measures the effectiveness of predictions in terms of correct predictions and
does not consider false predictions. Consequently, it may disregard the effect of
false inliers by giving higher importance to true inliers. This is misleading in
general outlier detection scenario where there is a massive class skew due to a
few classes of outliers and the larger number of inlier classes. Alternatively, we
used FNR to measure the effectiveness of outlier detection highlighting the error
in predicting outliers. The False Negative Rate (FNR) is calculated as:
FNR =FN
TP + FN=
FN
P(9)
The Area under the Receiver Operating Characteristics (ROC) curve has been
used in prior outlier detection work to evaluate accuracy [1, 126, 155]. The ROC
curve shows the ratio between true positive rate (TPR) against false positive rate
(FPR). This addresses the problem with skewed classes. Let TPR and FPR
be:
TPR =TP
TP + FN=
TP
P(10)
238 4 Empirical Analysis
FPR =FP
FP + TN=
FP
N(11)
With T denotes the threshold to control outliers, AUC can be defined as:
AUC =
∫ 1
0
ROC(T )dT (12)
4.3 Experimental Results: Accuracy Analysis
Accuracy (ACC ): Accuracy results reported in Table 2 reveal that ORDG
outperformed all (unsupervised) baselines with a large margin except KNNO.
The performance of ORDG is similar to KNNO on all other datasets except DS3,
where classes in the corpus are highly overlapping. It is hard to separate outliers
considering terms in the VSM representation due to overlapping class behavior.
Hence, ORDG yields lower accuracy. KNNO compares all pairwise documents to
produce k-NNs and calculates the differences between each observation and its
NNs, to rank the top-p points as outliers for a given p. Due to intricate compar-
isons, it produces high accuracy, however, it is not scalable in high dimensional
Wikipedia document collection (DS1) and fails to produce results after a scalable
boundary time. Furthermore, KNNO requires the number of outlier documents
as a control parameter that directly induces high performance as compared to
others.
In addition to producing poor quality outcomes, LOFO and MNCO are not scal-
able to big datasets DS4 and DS5 due to their requirement of a large number
of pairwise comparisons. Though NMFO can scale with the size, it shows poor
performance, especially in DS4 and DS5, as it is unable to deal with a larger
number of groups because of the iterative factorization process designed to work
with lower rank in NMF. It is interesting to note that MNCO, a mutual neigh-
4 Empirical Analysis 239
Table 2: Performance comparison of different datasets and methods
Dataset Accuracy-ACCORDG KNNO LOFO NMFO MNCO
DS1 0.98 * * * -DS2 0.95 0.99 0.01 0.98 0.06DS3 0.91 0.98 0.02 0.69 0.17DS4 0.97 0.99 * 0.01 -DS5 0.97 0.99 * 0.01 -Avg. 0.96 0.99 0.02 0.42 0.12
Note : “*” and “-” denotes aborted operations(after 100 minutes) and memory/runtime error re-spectively
bor graph method, is unable to deal with the sparseness in high dimensionality.
Whereas, ORDG builds a mutual neighbor graph utilizing a scalable IR system
to obtain neighbors and can deal with sparse and large datasets.
Area Under the Curve (AUC ): Next, we analyze the results in the form of
ROC and AUC that reports ratio between TPR and FPR. We have explored the
ROC curve considering the fixed control parameter we proposed in our algorithm
(sensitivity analysis provides more details on the parameter) and the optimum
threshold is used for each baseline accordingly. Any baseline methods could not
be executed on DS1 due to larger text size, as confirmed by Fig. 10(a).
As depicted by Fig. 10(b) and Fig. 10(c), ORDG gives the highest AUC compared
to baselines on DS2 and DS3 where term vectors are medium size. KNNO, which
requires the number of outliers as a control parameter, is the best amongst other
baselines for DS2 though it does not work in a similar fashion to DS3, which
contains overlapping class labels. MNCO, which uses a mutual neighbor graph,
outperforms other baselines in DS3 that uses simple nearest neighbors.
The ROC curves for document collections with short term vectors are given in
240 4 Empirical Analysis
*
* *
*
(a)
(b) (c)
(d) (e)
****
Figure 10: ROC curve and AUC for document collections (”∗” denotes abortedoperations, memory or runtime error)
Fig. 10(d) and Fig. 10(e). Due to the large collection size, LOFO and MNCO
were not able to execute. Comparatively, ORDG succeeds by consuming less
memory and time due to the efficient IR ranking-based neighborhood generation
process. ORDG outperforms KNNO due to the inclusion of local sub-dense neigh-
borhood concept. NMFO performs equal to a random method on these datasets,
which contain a large number of groups as the iterative lower-rank matrices ap-
proximation process increases the level of error in factorization and is impaired
4 Empirical Analysis 241
Table 3: FNR for different methods against datasets
Dataset False Negative Rate - FNRORDG KNNO LOFO NMFO MNCO
DS1 0.75 * * * -DS2 0.30 0.60 0.00 0.98 0.00DS3 0.56 0.98 0.00 0.62 0.00DS4 0.44 0.50 * 0.00 -DS5 0.41 0.49 * 0.00 -Avg. 0.49 0.64 0.00 0.40 0.00
The smaller the value, the better the performance.Note : “*” and “-” denotes aborted operations and mem-ory or runtime error respectively
in handling the fine-grained data.
With ACC and AUC, we have assessed how good the methods are in identify-
ing outliers. However, they are yet to be assessed for making false predictions.
Specially in outlier detection, the majority of documents are inliers and only a
few are outliers. A method identifying a large proportion of those few outliers as
inliers (i.e. higher FNR values) can be considered ineffective. Results in Table
3 report FNR (false negative rate) which informs false inliers (FN) against the
total number of outliers in the data. These results reveal that KNNO predicts
many outliers as false inliers. On the other hand, LOFO, NMFO and MNCO
produces lower accuracy in identifying true outliers (i.e. low ACC and AUC
values) but they report fewer false inliers (i.e. low FNR). This is mainly due to
identifying larger portion of documents as outliers.
In general, ORDG shows a consistent level of performance including short docu-
ment collections, DS4 and DS5, due to the additional concept of Hub based inlier
removal included for short collections. All baselines fail to produce results when
the dimensionality of the vector is high and the dataset is large as DS1, however,
ORDG handles it by using the IR concepts effectively.
242 4 Empirical Analysis
Table 4: Performance given by Neural network-based methods
Dataset Supervised CNN GAN based Active LearningAccuracy-ACC Area Under the Curve- AUC
DS1 0.99 0.52DS2 0.74 0.56DS3 0.98 0.52DS4 0.97 0.50DS5 0.97 0.50Avg. 0.93 0.52
Supervised or Semi-supervised Baselines: Experiments have also been
conducted to check the performance of latest deep learning methods on outlier
detection. CNN has been used in the supervised setting [107] and the semi-
supervised GAN-based active learning method used in outlier detection [127] has
been used. GAN includes two networks where a generative network is used to
generate candidates and a discriminate network is used to evaluate their validity
in an unsupervised manner. Method in [127] follows a semi-supervised approach
with an active leaning to generate initial outliers with reference to real data. Re-
sults in Table 4 show that supervised CNN-based method which predict outliers
using the training knowledge given based on the labeled dataset is unable to out-
performed ORDG. Especially, in DS2 where document collection is short in size
and have many classes, the training phase is unable to give adequate supervision.
The semi-supervised GAN method performs almost similar to a random method
producing AUC value close to 0.5 on average. The data used for supervision
should be closely matched with the actual datasets to obtain higher performance
with GAN methods.
4 Empirical Analysis 243
Figure 11: Time and memory consumption for different methods
4.4 Experimental Results: Scalability and Complexity
Analysis
Time taken by each method is presented in Fig. 11 (a). It shows that the
baseline methods must be aborted when data dimensionality is high as in the
Wikipedia dataset (DS1). Similarly, the larger document collections such as SED
2013 (DS4) and SED 2014 (DS5) cannot be handled by methods such as LOFO
and MNCO. Though the matrix factorization based NMFO shows slightly less
time consumption in larger size datasets DS4 and DS5, as compared to ORDG,
the performance increment of 96% in ACC and 27% in AUC gained by ORDG is
well justified.
In addition, we compare the memory consumption of each method as in Fig. 11
(b). It clearly highlights that ORDG consumes the least memory in comparison
to baseline methods. All the baseline methods are impaired when dealing with
large term vectors such as Wikipedia (DS1) due to resource starvation.
Table 5 shows the computational complexities of ORDG against baseline meth-
ods. This validates the experimental results where LOFO and MNCO fail to
244 4 Empirical Analysis
Table 5: Summary of the datasets in the experiment.
ORDG KNNO LOFO NMFO MNCOComplexity O(ndkm) O(n2dk) O(n3dk) O(n2d) O(n3dk)Note: n - the size of the document collection, d - dimensionality,m - number of mutual neighbor sets and k - considered numberof nearest neighbors
Table 6: Number of False Positives (FP ) given by each phase
DatasetNumber of False Positives (FP )
% improvement byensemble approach
ORDG ORDG Full —Phase1 Phase2 ORDG —
DS1 1665 1113 159 86%DS2 624 1775 268 57%DS3 909 2406 422 54%DS4 14683 7854 2251 71%DS5 16223 7401 2349 68%
produce output on large datasets due to having cubic time complexity, whereas
ORDG, which uses an ensemble approach combining possible outlier candidates
from two methods, works efficiently due to linear complexity.
4.5 Sensitivity Analysis
First, we explore the effectiveness of the ensemble solution in ORDG for reducing
the false positives as in Table 6. Both phases, the rare term weighting based
first phase of ORDG and mutual neighbor graph based second phase, yield high
number of false outliers individually. However by reporting the outliers that exist
in both phases by using the ensemble approach improves the quality of outlier
detection.
Similar to many other outlier detection algorithms [87, 157], ORDG uses a thresh-
old to determine the top-ranked observations as outliers. We propose to set the
threshold automatically and in a user-independent way, by utilizing the internal
4 Empirical Analysis 245
Figure 12: Sensitivity of the control threshold Tidf
characteristics of the dataset. The control threshold Tidf , which governs the filter-
ing process of outlier candidates through average IDF weights in a document, is
set as the combination of median and standard deviation. It yields more outliers
as shown in Fig. 12 for all the datasets except Reuters (DS3), which contains
overlapping class labels for documents. The median that removes the effect of
noise was boosted, by adding standard deviation to detect the outliers that have
a smaller portion within the document collections by setting Tidf this way.
The premise of ORDG is obtaining nearest neighbors by the IR technology in-
stead of the pair-wise document comparisons as used in traditional methods. The
performance of ORDG in obtaining nearest neighbors depends on two factors: (1)
the weighting scheme used in query and document representation in order to re-
trieve the relevant neighbors; (2) the ranking function employed in the IR system
to measure the document similarity. Documents in a corpus can be represented
using different weighting schemes such as term frequency (TF), inverse document
frequency (IDF) and term frequency-inverse document frequency (TF-IDF) [37].
According to AUC results given in Fig. 13 (a), for outlier detection, IDF and TF-
IDF weighting schema used for document query representation are more effective
than TF schema, and IDF shows slightly better performance. As validated by
246 4 Empirical Analysis
00.10.20.30.40.50.60.70.80.9
1
DS1 DS2 DS3 DS4 DS5
AUC
Document Collections
TF IDF TF * IDF
00.10.20.30.40.50.60.70.80.9
DS1 DS2 DS3 DS4 DS5
AUC
Document CollectionsBM25 Tf*idf
(a) Different document query representation techniques (b) Different ranking functions of IR systems
Figure 13: Performance with weighting schema and ranking functions
Claim 1, IDF directs a high focus to rare terms, which are the keys to consider de-
viations of documents represented as a vector of weighted terms. Therefore, IDF
representation is used in ORDG to form document queries to retrieve relevant
documents, which gives precise nearest neighbors that can be used to differentiate
outliers.
Figure 13 (b) shows the AUC results against BM25 and tf*idf ranking functions
of Elasticsearch search engine with ORDG. IR systems use different functions
such as LM Jelinek-Mercer Smoothing (LM-JM), LM Dirichlet Smoothing (LM-
Dirichlet), Okapi BM25 and tf*idf [23]. However, BM25 and tf*idf ranking func-
tion give importance to rare terms that require for outlier detection compared
to LM-JM that assigns negative scores to terms with fewer occurrences and LM-
Dirichlet that captures important patterns in the text leaving the noise [54].
Results show that both ranking functions have similar performance in document
collections with larger and medium-size text vectors. However, BM25 [59] which
calculates the relevancy score of documents with relation to a query, in addition
to terms in the documents, shows higher performance for document collections
with short text vectors.
5 Conclusion 247
5 Conclusion
This paper proposes a novel text outlier detection method based on ranking and
a mutual neighbor graph (ORDG). Phase 1 of ORDG indicates that rare terms
in a document, which can be emphasized through IDF weighting scheme, show
higher competence to detect deviations in a document collection. Sparseness in
high dimensional text data is handled by the mutual neighbors as in Phase 2
of ORDG where the traditional distance and density-based concepts fail. Mu-
tual neighbors facilitate relatively uniformed denseness inside the corpus with the
shared neighbors. A normal mutual nearest neighbor graph built using k-Nearest
Neighbors calculation is not scalable for larger datasets due to the required high
number of pairwise comparisons. Whereas, ORDG that calculates nearest neigh-
borhoods using relevant documents obtained through a scalable search engine,
can construct mutual nearest neighbor graphs for larger datasets effectively. The
local sub-dense neighborhood (Hub) concept in high dimensionality is brought to
ORDG together with the density approximation to separate outliers. It conjec-
tures that documents that are not attached to a sub-dense local neighborhoods’
graph are possible outliers.
Extensive empirical analysis has been conducted on diverse datasets belonging
to large, medium and short-term vector sizes. ORDG is benchmarked against
several state-of-the-art, distance-based, density-based, graph-based and matrix
factorization-based outlier detection methods. Results show that ORDG is ca-
pable of detecting outliers in high dimensional document collection with con-
siderably higher performance, including accuracy and efficiency. The ensemble
approach of ORDG reduces the false outliers and inliers. Applying ORDG on
dynamic temporal text data for outlier detection is for our future investigation.
Chapter 5
Text Cluster Evolution
This chapter introduces the last contribution of the thesis that is a novel document
cluster evolution method to identify the dynamic changes to text clusters over
the time or domain using text cluster similarity. Analyzing text-based communi-
cations over time or domains is important, so that it is known how concepts been
evolved. This allows knowing which clusters are emerging, persistent, growing
and diminishing. This information is important in planning events, publications,
advertising and much more. Evolution tracking is more popular with network
analysis for identifying community evolution [41, 115, 123]. There exists very
little research on text-based evolution tracking.
The majority of the text-based evolution research mainly focus on the topic evo-
lution [35, 41, 180] or event evolution [82, 119], which deals with a much smaller
data space as compared to the original data space. In addition, existing text evo-
lution methods are limited to compare only consecutive timestamps [63] or limited
to few evolution patterns such as emerging concepts [98] monitoring. There is no
prior work that considers a global cluster evolution that is able to show the full
cluster life cycle with all the evolution patterns in original data space with all the
249
Figure 5.1: Overview of the Chapter 5 contributions
terms.
Fig. 5.1 shows the main concepts used in the proposed method, Cluster
Association-aware matrix factorization for discovering Cluster Evolution (CaCE),
to identify the text cluster similarities and track the cluster evolution. This chap-
ter presents CaCE, which introduces NMF to identify the groups of similar clus-
ters over the time/domain using intra- and inter-cluster similarity to handle the
issues attached with high-dimensional text. This paper is based on the conjec-
ture that the assistance given by inter-cluster association is able to address the
information loss occurred with high to low-dimensional projection. Further, this
chapter introduces the Skip-Gram with Negative Sampling (SGNS) to accurately
learn the context by maximizing the probability of closely associated cluster pairs
within the considered time period/domains, while minimizing the loosely associ-
250
ated cluster pairs.
This chapter is formed by Paper 7 in its original form.
• Paper 7. Wathsala Anupama Mohotti and Richi Nayak.: Discovering
Cluster Evolution Patterns with the Cluster Association-aware Matrix Fac-
torization. Springer Knowledge and Information Systems (KAIS) (Under
Review).
Paper 7 proposes a novel method named CaCE to discover cluster evolution when
each cluster solution is given for each time-stamp/domain. Thus it works on
static cluster solutions and is able to identify the groups within the clusters over
the time/domain. Specifically, it identifies evolution patterns with the cluster
association-aware Matrix Factorization that identifies cluster groups with similar
text clusters. It uses an NMF-based method with graph-based visualization to
identify the changing dynamics of text clusters over the time/domain.
CaCE models inter-cluster associations with the number of overlapping terms
between clusters using the SGNS modelling to uplift the accuracy. Specifically,
it captures the similarity between each cluster pairs that carry important infor-
mation to assist the global evolution where even smaller values also represent the
initial stage of links between clusters that could develop as growth in upcoming
years/domains. Therefore this information semantically assists matrix factoriza-
tion for cluster-group discovery. A density concept based on the term frequency
is used to maintain the uniform term distribution within a cluster group and to
separate less cohesive clusters from it. CaCE tracks four major lifecycle states of
clusters, namely birth, death, split and merge, to discover their emergence, per-
sistence, growth and decay. It uses a bipartite graph to effectively visualize this
cluster evolution as the progressive k-partite across the k temporal dimensions or
251
domains. A NewsGroup dataset, a patent abstract dataset and 2 twitter datasets
are used for experiments. Quantitatively as well as qualitatively, experiments are
done to prove the validity of CaCE.
252 Paper 7
Paper 7: Discovering Cluster Evolution Patterns
with the Cluster Association-aware Matrix Fac-
torization
Wathsala Anupama Mohotti* and Richi Nayak*
*School of Electrical Engineering and Computer Science, Queensland University
of Technology, GPO BOX 2434, Brisbane, Australia
Under Reviewed In: Knowledge and Information Systems (KAIS Journal)
Statement of Contribution of Co-Authors
The authors of the papers have certified that:
1. They meet the criteria for authorship in that they have participated in
the conception, execution, or interpretation, of at least that part of the
publication in their field of expertise;
2. They take public responsibility for their part of the publication, except for
the responsible author who accepts overall responsibility for the publication;
3. There are no other authors of the publication according to these criteria;
4. Potential conflicts of interest have been disclosed to (a) granting bodies, (b)
the editor or publisher of journals or other publications, and (c) the head
of the responsible academic unit, and
5. They agree to the use of the publication in the student’s thesis and its
publication on the QUT ePrints database consistent with any limitations
set by publisher requirements.
Paper 7 253
Contributor Statement of contribution*
Wathsala Anupama Mohotti Conceived the idea,designed and conducted experiments,analyzed data, wrote the paper and
Signature: addressed the supervisor and reviewers’comments to improve the quality of paper
Date:
A/Prof Richi Nayak Provided critical commentsin a supervisory capacity
Signature: on the design and formulationof the concepts, method and experiments,
Date: edited and reviewed the paper
i Nayak
26/03/2020
Mohotti
27/03/2020
QUT Verified Signature
QUT Verified Signature
254 1 Introduction
ABSTRACT: Tracking of document collections over a period is helpful in several
applications such as finding dynamics of terminologies, identifying concept drift,
emerging and evolving trends, etc. We propose a novel “cluster association-
aware” Non-negative Matrix Factorization (NMF)-based method with graph-
based visualization to identify the changing dynamics of text clusters over time.
NMF is used to find associations among terms of the clusters within a collection
over the time. The novel concepts of “cluster associations” and term frequency
based “cluster density” have been used to improve the quality of evolution trend.
The cluster evolution is visualized using a k-partite graph to display the birth,
death, split and merge of clusters across time. Empirical analysis with the text
data shows that the proposed method is able to produce accurate and efficient
solution as compared to the state-of-the-art methods.
KEYWORDS: Cluster Evolution; Text Mining; Matrix Factorization
1 Introduction
Text data, widespread in social media platforms and document repositories such
as news broadcasting platforms and research publications, has emerged as a pow-
erful means of communication among people and organizations [60]. Text reposi-
tories contain the data covering across domains or/and time [7]. Social networks
include opinions expressed on diverse concepts over the time. Search engines are
another popular internet medium that store (or index) a large collection. Topics
(or concepts) and associated terminologies in these text repositories change over
the time as well as across the domains and show a varying trend.
It is useful for scholars, journalists, and practitioners of diverse disciplines to mine
these data, spanned across the time or domains, for finding decaying, current and
1 Introduction 255
emerging concepts [64, 73, 75]. A term analysis tool such as Google Trends can
track how the popularity of a term changes over time, based on query log analy-
sis [34]. With the rise of big data and the dependence amongst terms/concepts,
it is appropriate to analyze the formation and evolution of concepts (or clus-
ters) instead of individual terms in the dynamic text corpora. Over the time, a
cluster can go through the states of birth, death, split and merge indicating the
persistency, growth and decay of concepts [63].
Tracking evolution across different domains provides insight on how the same
concept has been used over the diverse domains. Consumer behavior is a well-
known concept mainly used in the economics domain, which is important for the
political domain as well as the agriculture domain. It is important to identify how
this concept evolve over the agriculture domain to establish marketing strategies
for businesses. Further, the trends showing through this concept dynamics in the
political domain will create opportunities for political parties and governments
to mend their campaigns. Similarly, discovering cluster dynamics over the time
in a specific field is useful for researchers, academics, and students in that field
to setup their publications, strategies and research. Further, these trends provide
insight for businesses and governments to setup policies accordingly to succeed.
Tracking of concepts over domain or time can also provide insight to Historians
and Social Scientists to understand how a concept or theory has evolved [112].
In order to find common concepts, text clustering faces challenges due to the com-
plex nature of text data resulting in high-dimensional and sparse vector repre-
sentation [8]. Matrix Factorization (MF), which maps high-dimensional to lower-
dimensional space, is one of the effective solutions [103]. However, information
loss is inevitable in this family of methods that may result in poor outcome [8].
Researchers have introduced term-based semantics to assist factorization with
additional information to identify topic clusters highlighting concepts [117, 168];
256 1 Introduction
particularly the use of Non-negative MF (NMF) has been found effective [117].
Only a handful of research studies exist that study cluster/topic evolution. Most
of these methods only deal with identifying emerging or novel topics [98, 99].
There are only a couple of studies that focus on identifying emerging, persis-
tence and diminishing topics [41, 63]. The method in [41] identifies some of these
patterns by measuring how the term frequency changes over time; however, it
is not able to track the individual state differences in topics such as split and
merge. The method in [63] performs similarity calculation between clusters using
overlapping terms in each consecutive time stamp, to visualize various states.
However, it disregards the global evolution over the time and focuses only on
adjacent time stamps to determine similarity. Identifying all states of cluster
evolution globally over the time is challenging for these types of methods as they
consider a consecutive time-interval pair at a time. Other methods [115, 123]
assume fixed skeleton structures of clusters over time to identify their evolution
and fail to consider new formations or the changes in the structure of the clus-
ters. In contrast, the proposed method considers the time/domain-wise clusters
(presented by the representative terms) for naturally identifying the emergence,
persistence, growth and decay of concepts over a period.
This paper proposes a novel and accurate method of Cluster Association-aware
matrix factorization for discovering Cluster Evolution, called CaCE. It can track
four major lifecycle states of clusters namely birth, death, split and merge to
discover their emergence, persistence, growth and decay. It includes an NMF-
based process to identify the groups of similar clusters that are formed over the
time or domains, based on inter and intra-cluster association relationships defined
using terms in the clusters. Specifically, inter-cluster associations modeled with
the number of overlapping terms between clusters, semantically assist matrix
factorization for cluster-group discovery. To separate less cohesive clusters from
1 Introduction 257
a cluster group, we introduce a novel concept of density based on uniform term
frequency distribution within the group using a pre-defined threshold. Finally,
the paper proposes to use the concept of bipartite graph to effectively visualize
the cluster evolution as a progressive k-partite graph in a novel fashion across
the k temporal dimensions. The evolution is represented by drawing edges in
a k-partite graph between consecutive time intervals if the clusters possess the
same level of density and belong to the same group in this time interval.
More specifically, this paper brings several novel contributions to the area of
cluster evolution listed as:
• An NMF based approach with inter and intra-cluster associations to identify
the cluster groups.
• A term frequency-based concept of density to remove the loosely connected
clusters in the cluster groups.
• A progressive k-partite graph-based approach to display evolution of clus-
ters in the cluster groups.
To the best of our knowledge, CaCE is the first method that considers the cluster
association using an inter-cluster matrix built with overlapping terms for discov-
ering cluster evolution. Empirical analyses using several document corpuses over
the varying number of time stamps and the varying number of clusters reveal
that CaCE can discover cluster evolution accurately and efficiently compared to
other state-of-the-art cluster/topic evolution methods.
The rest of the paper is organized as follows. Section 2 reviews related work and
presents the motivation behind this research. Section 3 introduces the problem
definition that is followed by the proposed CaCE method. Experiments are dis-
cussed in Section 4 with two real-world case studies. Final conclusion remarks
258 2 Related Work
are given in Section 5.
2 Related Work
Approaches that attempt to address the dynamic text over time can be seen
as the discovery methods of cluster evolution [63, 73], topic evolution [35, 41,
180] or event detection [82, 119]. All these paradigms focus on tracking content
shift and identifying emerging trends in dynamic text datasets. These methods
explore the change in cluster/topic structure over time through textual content
associated with clusters/topics to characterize the evolutionary events, concepts
or terminologies. In comparison to cluster evolution, topic evolution is done in
much smaller data space (i.e., topic space) as depicted by Fig. 1 (a) and Fig.
1 (b). The number of extracted topics is much less in topic evolution than the
entire document collection, and associated vocabulary with topic clusters in the
collection is much smaller than the complete vocabulary of the collection. This
is the same for event detection work, which considers the set of selected events
in tracking evolution. Community evolution [102, 115, 123] given in Fig. 1 (c),
is another paradigm in tracking cluster dynamics, which considers user groups as
clusters.
Research in (text) cluster evolution is infancy with the existence of simple ap-
proaches [63, 73]. A survey-based research [73] was carried out to identify evo-
lution of concepts in clusters of publications using bibliometric tools. This only
considers the citation network in tracking evolution. TextLuas [63] models each
cluster solution with the respective terms at each time stamp and considers sim-
ilarity between consecutive clusters, as determined by the term intersections. It
uses Jaccard coefficient between clusters based on a threshold to define the persis-
tence, merging and splitting of clusters on a timeline. It considers only the local
2 Related Work 259
Figure 1: Comparison between existing evolution approaches
relations between two consecutive time stamps in defining evolution. In contrast,
the proposed method CaCE globally identifies the cluster groups over a period
and visualizes the entire evolution among time stamps using a k-partite graph.
Topic modeling is another powerful paradigm for the semantic analysis of large
collections of documents. Topic models have been used as formalization of the
conversational understanding through identifying subsets (i.e.,topics) [98, 180].
Several researchers have attempted to identify evolution of topics in larger doc-
ument collections using extensions of LDA [28, 48, 74, 180]. In [180], a proba-
bilistic topic modeling approach is used to track the topic occurrence over the
time. This generative probabilistic approach only identifies topic occurrence in
different time dimensions with the calculated respective probabilities and is found
incapable of identifying topic evolution with splits and merge. Authors in [49]
determine text cluster evolution based on the changes to term probability within
topics. This proposal was limited by the fixed vocabulary constraint where only
a general set of terms in the topics was studied and neglected the tracking of new
topic formations. In [98], NMF is used to identify a set of steady topics through
minimizing learning error. The emerging topics are obtained by filtering deviated
260 2 Related Work
topics. However, discovering only these changes is insufficient, as they do not give
the complete insight of persistence, diminishing and growing concepts. Similarly,
topic models have been used in understanding the topic dynamics across temporal
dimensions [35, 41] in social media domain. Authors in [41] extended these topic
trends to track persistent and diminishing topics using the term frequency-based
energy concept defined for each cluster solution. The “density” concept, which
uses to determine the consistent cluster groups in CaCE, is inspired by the energy
concept. However, these topic evolution methods are limited to identify the few
states in cluster lifecycle. Identifying complex dynamics of topics such as merge
and split, detailing a complete cluster lifecycle, is challenging without additional
information due to the sparseness of text representation.
Event detection methods have been applied in social media communication to
find novel or trending events [82, 119, 195]. This stream of methods keeps track
of event clusters (much smaller number than the clusters in original space) that
appear across time to identify the novel events or shifts that are deviated from the
existing event clusters. In [119], a novelty score is assigned to each event cluster to
identify new events in a twitter dataset considering a tweet similarity. Identifying
events in twitter data across the time is handled in [195] with topic modeling.
This research is limited in tracking evolution and fails to identify growth and
decay of clusters. It attempts to identify emerging events through deviations to
previously existing events with the assumption of a fixed set of events within a
dataset.
Researchers have studied the community evolution in social networks focusing
on structural properties of communities [102, 115, 123]. In the area of network-
based community detection, clusters consist of users instead of text as depicted
by Fig. 1 (c). The “snapshot model” [192] considers different snapshots of the
network at different time steps to find communities or clusters; and then, track
2 Related Work 261
clusters over time in order to interpret their evolution. However, the majority
of community detection methods assume a fixed number of communities across
the time by disregarding new formation and dissolution [123] or relying on a pre-
determined community structure [115]. The “temporal smoothness model” [40] is
used to analyze continuous stream of atomic changes to the considered networks to
derive communities over time. It can be considered similar to the fixed vocabulary
constraint in some of the text evolution analysis methods [49]. However, the
network evolution based on user interactions is completely a different domain
compared to text cluster evolution.
In text clustering, the sparse nature of data results in poor outcome [3]. There
are a few recent studies that use additional information to assist sparse text
clustering problem with additional semantic information [117, 139, 168]. They
use word association relationships, Skip-Gram and Skip-Gram with Negative-
Sampling (SGNS), similar to the concept of word embedding. The Skip Gram
model is a training method for neural networks to learn neighbors or the context
of a word in a corpus for word embedding [137]. In [168], the term × term
association matrix modeled with SGNS is used to semantically assist the NMF
in short text clustering for topic discovery. Negative sampling tries to maximize
the probability of observed term pairs to be 1 and unobserved term pairs to
be 0 within the term association matrix. Inheriting these concepts to cluster
evolution, we propose the use of SGNS to model the inter-cluster association
using overlapping terms. We conjecture by learning the context of terms, clusters
can be grouped together that share similar concepts and terms.
Table 1 summarizes the existing cluster, topic/event and community detection
methods with their major drawbacks in accurate identification of text cluster evo-
lution. Distinct from these works, CaCE utilises the higher to lower dimensional
mapping via matrix factorization to identify the cluster associations and track all
2623 Cluster Association-aware Matrix Factorization method of Cluster
Evolution
Table 1: Summary of existing evolution detection methods
Category Applied data domain Major drawbackCluster Evolution Text data Neglect global evolution
patterns due to consecutivetime-stamps analysis [63]
Topic/Event Evolution Text dataUnable to identify complexcluster dynamics [41, 98,180]Study changes to fixed set ofterms and neglect new for-mations [49, 119, 195]
Community Evolution Network dataStudy changes to fixed set ofstructures and neglect newformations [40, 115]Assume fixed number ofcommunities over time [123]
of their states over the time.
3 Cluster Association-aware Matrix Factoriza-
tion method of Cluster Evolution
3.1 Preliminaries and Definitions
Consider a document collection D = {D1, D2, ...Dk} over a time period k or a
set of k domains. Let {t1, t2, ..ts, ..tk} be the considered time period or a set of
domains with k consecutive instances. Let C = {C1, C2, ...Cs, ...Ck} be the set
of respective cluster solutions in D. Each time-stamp/domain dataset creates a
cluster solution Ck = {c1, c2, ...cm} with m clusters where m > 1 and the value
of m can vary among each of the cluster solutions.
Given a text data collection spanned across the time/domain, the proposed
3 Cluster Association-aware Matrix Factorization method of ClusterEvolution 263
method aims to identify the cluster evolution over a period of time or domains,
as stated in Definition 1.
Definition 1 : Individual clusters in the set of cluster solutions C at each time-
stamp or domain hold a lifecycle state that can assist in displaying a cluster
evolution for the document collection stored over the time or domains. Following
are the types of states that can be assigned to cluster ci at timestamp ts that
reveal the evolution patterns.
• Birth: if cluster ci that appears in time/domain ts does not have any
similar cluster in time/domain ts−1, it marks the birth of ci
• Death if cluster ci that appears in time/domain ts does not have any cluster
that is similar in time/domain ts+1, it marks the death of ci
• Split if cluster ci that appears in time/domain ts does have multiple similar
clusters in time/domain ts+1, it marks the split of ci
• Merge if cluster ci that appears in time/domain ts does have multiple
similar clusters in time/domain ts−1, it marks the merge of ci
We propose an NMF-based solution to define the similarity between individual
clusters within the set of cluster solutions {C1, C2, ...Cs, ...Ck} based on cluster
associations and discover the latent relationships between clusters by projecting
them to a lower-order dimension. We then assign a unique cluster-group to each
cluster and refine these cluster-group assignments using the term weight-based
density concept to form uniform term distribution within a group. A cluster
with insufficient density value is excluded from the group, indicating that the
cluster does not share enough matching terms with the group to be a member of
the group. The following evolution patterns can be identified based on the final
cluster similarities given by the proposed method.
2643 Cluster Association-aware Matrix Factorization method of Cluster
Evolution
• Persistence: if cluster ci ∈ Cs has a similar cluster in each consecutive
clustering solution until cluster solution Cp where p ≤ k, cluster ci will
display a persistent evolution pattern within time/domain s to p.
• Growth: if cluster ci ∈ Cs has a gradual increase in the number of splits
until the cluster solution Cp where p ≤ k, cluster ci will display a growth
evolution pattern within time/domain s to p.
• Decay: if cluster ci ∈ Cs has a gradual decrease in the number of merges
until the cluster solution Cp where p ≤ k, cluster ci will display a decay
evolution pattern within time/domain s to p.
• Emerging: if cluster ci ∈ Cs has been born in time/domain s it displays
an emerging pattern in time/domain s.
Let the set of cluster solutions C over the k time-stamps or domains consist of
a total number of N clusters {c1, c2, ...cN} that contain the total number of M
terms {w1, w2, ...wM}. Let matrix S represent the “Intra-cluster association” with
term × cluster relationship modeling N clusters with M terms. The matrix S
is modeled with the traditional bag-of-words model with each term count. This
is accompanied by the symmetric matrix A that represents “Inter-cluster associ-
ation” with cluster × cluster relationship using a number of overlapping terms
between clusters. The matrix A is modeled with the Skip-Gram with Negative-
Sampling (SGNS) [117] weighting to make the probability of presence of cluster
association be high. The Skip-Gram model is a popular training approach for
neural networks to learn distributed word representation. The Skip-Gram model
predicts neighbors or the context for a considered word in a corpus in comparison
to the continuous Bag-of-Words model, which uses context to predict the word
[137]. The concept of negative sampling is used to maximize the probability of
observed (word,context) pair to be 1 while minimizing the unobserved pairs to be
3 Cluster Association-aware Matrix Factorization method of ClusterEvolution 265
0 [168]. In [117] SGNS is proved to be equivalent to factorizing a (shifted) word
correlation matrix. It shows that SGNS is implicitly factorizing a word-context
matrix, whose cells are the point-wise mutual information of the respective word
and context pairs. In CaCE inter-cluster association matrix modeled consider-
ing Skip-Gram model, semantically assists the NMF in learning the context of
the clusters. The use of the SGNS concept in CaCE increases the probability of
accurately learning the context of clusters.
We propose to utilize SGNS in CaCE with the objective of maximizing probability
P (A = 1|ci, cj) for closely associated cluster pairs (ci,cj) within the observed k
time stamps while minimizing P (A = 0|ci, cj) for loosely associated cluster pairs
(ci,cj). The inter-cluster association matrix A is represented with the SGNS of
the observed set of clusters using the number of term co-occurrences as:
Acicj = log
[#{wci ∩ wcj
} × V∑cb∈C # {wcb ∩ wci} ×
∑cb∈C #
{wcb ∩ wcj
}]
(1)
where wci , wcj and #{wci ∩ wcj
}are a set of terms and number of overlapping
terms in cluster ci and cj respectively and, V =∑
(ci,cj)∈C #{wci ∩ wcj
}is the
total number of overlapping terms among all the cluster pairs.
The entries of A with less than 0 are converted to zero to minimize the probability
of unobserved pairs after taking logarithm as in Eq. 1. This modelling with
#{wci ∩ wcj
} ×V is able to represent the inter cluster similarity within each pair
respect to the total count of term similarities within clusters over the time/domain
in normalized manner.
2663 Cluster Association-aware Matrix Factorization method of Cluster
Evolution
Figure 2: Overall process in CaCE
3.2 Overview of CaCE
CaCE includes three main phases for discovering cluster evolution, as depicted by
Fig. 2. (1) Firstly, it uses NMF to identify the groups of similar clusters over the
time/domain using the inter- and intra-cluster associations. This allows identify-
ing similar clusters within the cluster solutions C spanned across the time/domain
k. (2) Secondly, the loosely attached clusters in a cluster group are separated if
they do not contain sufficient density to be included in the group based on the
term frequencies of the cluster with respect to the maximum term frequency of
the cluster group. This allows the cluster groups to be tightly cohesive based on
the common terms that they share. (3) Finally, CaCE visualizes the global cluster
evolution patterns of emergence, persistence, growth and decay across time using
a k-partite graph where nodes represent clusters and edges represent relationships
between clusters such as persistence, split and merge considering cluster groups.
A cluster evolution with all state changes of a cluster lifecycle (i.e., birth, death,
split and merge) can be tracked with this visualization.
3 Cluster Association-aware Matrix Factorization method of ClusterEvolution 267
t1 t2 t3
MathematicsArcheologyIT
Figure 3: Example Cluster Evolution in Education domain
Example: Consider a document collection in a university archive collected over
three years. Application of CaCE shows an example of the evolution in clusters
in this corpus with the internal cluster state changes as displayed in Fig. 3. It
shows Mathematics as a persistent cluster over the considered period of time by
showing the progression of the similar cluster in each time stamp. IT which is
born in t2 shows an emerging pattern. It shows a growth with a split when comes
to t3 with two similar clusters. In contrast, Archeology shows a decay with a
merge between t1 and t2 that marked death at t2 without having a similar cluster
in t3.
3.3 Cluster association-aware Matrix Factorization
Marix Factorization
The aim of CaCE is to identify the global cluster evolution showing the trends,
how the group of terms have evolved over the time/domain. The first step is to
identify groups of common clusters in the high-dimensional sparse “intra-cluster
association” matrix S using the lower dimensional approximation. NMF, which
takes fewer parameters and produces coherent topics compared to other popular
dimensionality reduction methods such as LDA [21], is used in this approximation.
In traditional NMF [3], the sparse matrix S ∈ RM×N is approximated by learning
2683 Cluster Association-aware Matrix Factorization method of Cluster
Evolution
W ∈ RM×g and H ∈ RN×g where g is the number of cluster groups as follows.
S ≈ WHT (2)
In order to find the best groupings of clusters in intra-cluster association matrix S,
we propose to utilize the latent information within the inter-cluster association
matrix A ∈ RN×N . In this way, we take advantage of co-clustering, finding
commonalities amongst the terms based on the clusters in which they appear
as well as finding commonalities amongst the clusters based on the terms they
share. The symmetric NMF [83] is applied to A for generating two commutative
matrices, HC ∈ RN×g and H ∈ RN×g where g is the number of cluster groups as
follows.
A ≈ HHTc (3)
Objective Function
CaCE proposes to use both these learning processes to discover cluster groups,
as defined in the following objective function:
minW,H≥0‖S −WHT‖F +minH,Hc≥0‖A−HHTC‖F + α‖W‖1 (4)
We approximate the intra-cluster association matrix S and inter-cluster associa-
tion matrix A with the minimum learning error. We introduce L1 regularization
on the factor matrix W to promote sparsity, and control the over fitting and
highlighting the distinguishing terms. This can be considered as the sparse dic-
tionary learning, which models the sparse input data representation using only a
few (important) terms of the dictionary learned from the data itself [19]. Prior
research on traditional NMF has found this constraint to be effective for detecting
deviations or novelty in text data [99]. We conjecture that this constraint will be
able to discriminate cluster groups more effectively.
3 Cluster Association-aware Matrix Factorization method of ClusterEvolution 269
Solving the optimization problem
We propose to use the Block Coordinate Descent (BCD) algorithm [103] to op-
timize the objective function in Eq. 4. The BCD algorithm divides the matrix
members into several disjoint subgroups and iteratively minimizes the objective
function with respect to the members of each subgroup at a time. It relies on the
most recent values of the members for solving sub-problems related to their up-
dates. When solving sub-problems depend on each other, they must be computed
sequentially to make use of the most recent values for BCD.
CaCE solves these interdependent sub-problems sequentially starting from W .
The most recent values of members for the first iteration are zeros set at the
initialization. Firstly, the BCD update rule has been used for finding W in the
NMF optimization using the intra-cluster association matrix S and initial matrix
H. The matrix H is then updated using the current values of W and other
members. Finally, Hc is updated using the inter-cluster association matrix A and
the most recent values of H. This is done for each g′ ∈ g.
W(:,g′) ←⎡⎣W(:,g′) +
(SH)(:,g′) −(WHTH
)(:,g′)
(HTH)(g′ ,g′)
⎤⎦ (5)
H(:,g′) ←⎡⎣H(:,g′) +
(STW
)(:,g′) + (AHc)(:,g′)
(W TW )(g′ ,g′) + (HTc Hc)(g′ ,g′)
−(HHT
c H)(:,g′) +
(HW TW
)(:,g′)
(W TW )(g′ ,g′) + (HTc Hc)(g′ ,g′)
⎤⎦
(6)
Hc(:,g′) ←⎡⎣Hc(:,g′) +
(AH)(:,g′) −(HcH
TH)(:,g′)
(HTH)(g′ ,g′)
⎤⎦ (7)
This enables the decomposition process to include both inter and intra cluster
2703 Cluster Association-aware Matrix Factorization method of Cluster
Evolution
associations. In each iteration, at the end of this, sequential updates of factor
matrices W , H and Hc, CaCE minimize the objective function in Eq. 4.
The factorization process generates two perspectives of cluster × group matrices
H and Hc in lower dimensional space. This lower rank approximation of higher
dimensional cluster × term matrix gives dense representation compared to orig-
inal. It is conjectured that the lower dimensional representation that has high
co-occurrences is able to battle the sparseness related issues in high dimensional
data clustering. CaCE forms a final cluster group matrix HF based on the max-
imum pairwise coefficient of H and Hc. This allows us to identify the similarity
of a cluster with groups compensating weaknesses in the learning process of each
single perspective.
HF = max (H,HC) (8)
The final cluster assignment vector hf is defined using the hard cluster assignment
policy. A cluster group that possesses the highest coefficient within HF is used
as the group for a specific cluster.
hf = argmax
g∑i=1
(HF
(:,i)
)(9)
3.4 Cohesive cluster groups based on term density
The above matrix factorization process forces each cluster in a cluster solution
to be included in a cluster group. This may result in loosely connected clusters
to reside within a cluster group due to the fewer terms shared with others in
the group. To handle this, we propose the density concept that determines the
strength between a cluster and its associated cluster group considering the term
frequencies. More specifically, the density value of a cluster ci is defined as the
ratio of the term frequencies within the cluster using each term wj ∈ ci to the
3 Cluster Association-aware Matrix Factorization method of ClusterEvolution 271
maximum term frequency of the corresponding cluster group gz ∈ g, as follows.
Denci =
∑|wci |j=1 tf (wj)
max[∀|gz |x=1tf
(∑|wcx |j=1 (wj)
)]× |wci |
(10)
Density values that fall within first quantile (‘mean - standard deviation’) within
a group implies the clusters with least densities. CaCE uses this threshold to
separate the loosely connected clusters ensuring uniform term distribution within
a group. This allows identification of a set of cohesive cluster groups over the
time. A cluster that receives the density value less than the set threshold is
considered ‘inconsistent’ and its density value is set to zero. A cluster with zero
density value is indicated as a new singleton cluster group within the visualization
step.
3.5 Visualization of Cluster evolution with a k-partite
graph
CaCE proposes to visualize all cluster dynamics including birth, death, split and
merge within a k-partite graph. The set of clusters within the cluster solution
in time ts is represented with the respective partite s and each distinct cluster
group across k partite is uniquely identified with a color code. Each cluster in
the s > 1 partite in the graph is compared with each cluster in its predecessor
partite to add edges between two clusters to mark them as similar if they belong
the same group.
A cluster pair in two successive partites is eligible to have an edge between them
for being similar, if:
• they belong to the same group and either both of them posses zero density
values or both of them posses non-zero density value.
2723 Cluster Association-aware Matrix Factorization method of Cluster
Evolution
Figure 4: Algorithms of CaCE
In contrast, a cluster pair in two successive partites is not eligible to have an
edge between them, if:
• one of them has zero density value, though they belong to a same group
The color code of the cluster with non-zero density is updated with a non-existing
color in the graph to separate it from the current cluster group. However, a cluster
pair in nonconsecutive partites is not considered to have an edge between them.
4 Empirical Analysis 273
This process continues in an incremental manner to represent the cluster evolu-
tion spanned across time t1 to tk. The k-partite graph allows CaCE to identify
birth, death, as well as growing and decaying patterns in clusters, within the
period through colors and edges. Application of Definition 1 on the drawn edges
identifies the corresponding patterns:
• a cluster that appears in time ts (s ≤ k) that does not have any edge to
a cluster in time ts−1 marks the birth of that cluster, which represents an
emerging pattern,
• a cluster that appears in time ts (s < k) that does not have any edge to a
cluster in time ts+1 marks the death of that cluster,
• a cluster that appears in time ts (s < k) with multiple edges to clusters in
time ts+1 marks the split of that cluster showing a growth pattern,
• a cluster that appears in time ts (s ≤ k) with multiple edges to clusters in
time ts−1 marks the merge of that cluster showing a decay pattern,
• a cluster born in time ts (s < k) and continues across the time with a single
edge to succeeding time stamp ts+1 shows a persistent pattern,
This is further assisted by the colors to uniquely identify the similar clusters that
belong to the same group. Fig. 4 shows the algorithms of CaCE for discovering
the cluster evolution in a document corpus.
4 Empirical Analysis
We evaluate three phases of CaCE to show its effectiveness. The quantita-
tive comparison against baselines using ground-truths evaluates the 1st phase
274 4 Empirical Analysis
Table 2: Summary of the datasets used for the experiments.
Name # of clusters for each time-stamp cluster solution
Ground truth evolution
20Newsgroup(DS1)
t0: 3, t1: 5, t2: 4
Patent (DS2) t0: 5, t1: 5, t2: 5
Health (DS3) t0: 5, t1: 5, t2: 5, t3: 5
Sports (DS4) t0: 4, t1: 3, t2: 2, t3: 4, t4: 4
of CaCE, which uses inter-cluster association to measure the accuracy of cluster
group identification. The impact of 2nd phase with “density” in obtaining co-
hesive cluster groups for accurate cluster groups identification is evaluated with
and without using the density concept quantitatively. The 3rd phase, which shows
the evolution patterns of clusters through edges on the k-partite graph visualiza-
tion, is compared against baseline methods that are able to visualize the cluster
evolution qualitatively. Further, we compare the time efficiency and computa-
tional complexity of CaCE against different cluster group identification methods
as detailed in Section 4.2. Other different concepts used in the proposed method,
together with the parameters/thresholds, are analyzed in the sensitivity analysis
section. Finally, we conduct two case studies to qualitatively interpret the power
of CaCE in identifying cluster evolution in real-time data with the large number
of clusters that span across a larger period of time.
4 Empirical Analysis 275
Datasets: We use two types of datasets with medium length text vectors (con-
taining < 150 terms on an average, i.e., DS1 and DS2) and short length text
vectors (containing < 50 characters on an average, i.e., DS3 and DS4). As shown
in Table 2, for each dataset, a few categories (or domains) spanned across the
time have been selected/created to have the ground truth information, in terms
of the number of clustering solutions and the number of clusters in each clustering
solution.
• For the 20News group dataset (DS1), we selected four categories (Social,
Talk, Recreational and Computer) and spread them across three time pe-
riods.
• For the Patent abstract dataset (DS2), four categories (Distributed Pro-
duction, Microbiota, Computer Vision and Block Chain) of abstracts were
collected during the three months of 2017 to make clusters.
• For the Health-related tweets (DS3), media posts sent to six disease-specific
twitter groups (Diabetes, Mental, Kidney, Lung, Heart, Cancer) within a
four year period (2014-2017) were selected to make clusters.
• For the Sport-related tweets (DS4), media posts sent to four sports specific
twitter groups (Cycling, Netball, Cricket and Soccer) within a five year
period (2010-2014) were selected to make clusters.
These clusters were placed in such a way as to show emerging, persistent, growth
and decay patterns over time as in Table 2. We have made these datasets available
to researchers 1.
Baselines: Several benchmarking methods were used to evaluate the accuracy
of cluster group identification : (1) general NMF [103] on intra-cluster association
1https://drive.google.com/open?id=1gHoEm-R9S2OkiN9LRVNk3JLVeGpRdWXn
276 4 Empirical Analysis
matrix S; (2) the state-of-the-art clustering evolution method TextLuas [63] which
uses Jaccard coefficient to determine the cluster similarity within cluster pairs in
consecutive timestamps; and (3) a variation of CaCE (named as CaCE-CS) that
uses cosine similarity for an inter-cluster association matrix instead of SGNS
representation based on the number of overlapping terms. Additionally, the topic
evolution method proposed in [41] for social media with short text is used to
compare with CaCE in identifying the evolution patterns. Experiments were
done using python 3.5 on 1.2 GHz – 64-bit processor with 16 GB Memory.
Evaluation Measures: The standard pairwise harmonic average of the preci-
sion and recall (F1-score) and Normalized Mutual Information (NMI) were used
as the evaluation measures to identify the quality of cluster groups [165]. Evolu-
tion patterns of clusters including emerging, persistent, decay and growth indi-
cated through states changes are automatically identified within the visualization
using top-frequent terms in each cluster.
4.1 Accuracy Analysis
Quantitative Interpretation
Results in Table 3 show that CaCE is able to produce higher accuracy in cluster
groups identification spanned across time/domain compared to all other methods
due to the use of inter-cluster association information in the matrix factorization
using the number of common terms with SGNS. Next in line is the modified
version of CaCE; CaCE-CS uses cosine similarity to identify the inter-cluster as-
sociation using representative terms, which normalizes the similarity value to 0-1
range and fails to maximize the probability of closely associated clusters as orig-
inal CaCE does with using the number of overlapping terms with SGNS. Cosine
4 Empirical Analysis 277
Table 3: Performance comparisons in identifying cluster groups accurately withdifferent datasets, methods, and metrics
DatasetF1-score NMI
CaCE CaCE-CS
NMF TextLuas CaCE CaCE-CS
NMF TextLuas
DS1 0.84 0.75 0.60 0.60 0.82 0.75 0.48 0.67DS2 0.68 0.68 0.65 0.56 0.68 0.68 0.37 0.61DS3 0.58 0.57 0.34 0.58 0.57 0.57 0.17 0.51DS4 0.74 0.66 0.51 0.53 0.65 0.54 0.06 0.46
Average 0.71 0.67 0.53 0.57 0.68 0.64 0.27 0.56
similarity, which measures the cosine angle between vectors that represent the
clusters, is inferior in modeling inter-cluster association to cardinality of term
set intersection between clusters. TextLuas, which employs Jaccard similarity
coefficient based on the number of common terms in clusters, links the clusters
in consecutive time stamps if this goes beyond a threshold (set as 0.5). However,
this naive approach is inferior in identifying global evolution over the considered
period. The proposed NMF with intra- and inter-cluster associations used in
CaCE is able to accurately project the high dimensional term × cluster repre-
sentation into a lower dimensional space for identifying global cluster groups. In
contrast, when original NMF is used on term × cluster, it is not able to capture
the cluster groups within the projected lower dimensional space and results in
lower accuracy outcome. This impact is worse when the number of clusters varies
significantly within different cluster solutions as in DS4 (shown in Table 2). As
shown by results, CaCE is capable of handling varying cluster numbers and the
uniformly distributed clustering solutions, over the multiple time stamps.
Fig. 5(a) shows the impact of applying regularization to the objective function
in Eq. (4) for identifying cluster groups accurately. L1 regularization on W in
reconstruction error promotes sparsity in the factor matrix W , which represents
the term × cluster groups. This has been shown to be more effective for identify-
ing distinct cluster groups for all the datasets based on the representative terms
278 4 Empirical Analysis
Figure 5: Impact of regularization and density concept
as depicted by higher F1-score and NMI in Fig. 5(a).
We also analyze the effectiveness of the term frequency based density concept
used in CaCE for identifying accurate cluster groups over the time. The density
defined as in Eq.10 based on the term frequencies is capable of filtering out loosely
attached clusters to a group. CaCE uses the density value with a threshold
(explained in the sensitivity analysis section) to separate the loosely connected
clusters by setting the less dense values to zero and forms new singleton cluster
groups from those clusters. This ensures uniform term distribution within a group
compared to CaCE that operates without this density-based filtering. In general,
this allows us to identify a set of cohesive cluster groups over the time as shown by
the improved performance in Fig. 5 (b). In the dataset DS4, where less common
terms can be seen according to top frequent terms of clusters, the density based
filtering results in slightly poorer performance.
4 Empirical Analysis 279
t0 t1 t2
graphics, window, image, driver, software, jpeg
game, team, player, bike, play, hockey
christian, jesus, church, faith, truth, bible
christian, jesus, church, sin, christ, bible
game, team, player, bike, season, baseball
game, team, player, hockey, play, baseball
gun, firearm, crime, weapon, handgun, criminal
window, driver, image, scsi, pc, modem
game, team, player, bike, fan, hockey
gun, firearm, weapon, fbi, waco, batf
gun, fbi, batf, koresh, waco, weapon
image , window, graphics, scsi, disk, software
Figure 6: Visualization of Cluster Evolution in DS1 with CaCE
Qualitative Interpretation
Fig. 6 - Fig. 9 show insight on evolution patterns obtained by CaCE, which
show similar clusters in a group with a unique color. We label each cluster with
its’ top 5 - 6 frequent terms to represent the included concept. According to the
derived evolution patterns in Fig. 6 for DS1, (1) a persistent cluster related to
‘computer technology’ appears in blue color, (2) there is decay in information
related to ‘games’ as revealed by merging of clusters in green color from t1 to t2
and (3) it also identifies another cluster group which is a mix with ‘religion’ and
‘war’ in yellow color, which shows both split and merge of clusters. This reveals
a growth pattern within t0 to t1 through the split while showing a decay pattern
within t1 to t2 though the merge. It also identifies a cluster in red color as a
separate isolated cluster group from the rest. It should be noted from results in
Table 3 that though CaCE achieves highest accuracy, however it is not 100%.
It misses to identify the similarity of this cluster (marked as red) to the cluster
group ‘game’ (marked as green), which seems to be highly similar according to
280 4 Empirical Analysis
t0 t1 t2
block, chain, transaction, invention, storage, new
image, object, set, feature, point, plurality
image, object, signal, feature, plurality, location
image, object, region, point, camera, motion
production, control, portion, service, module, source
production, configured, energy, time, signal, plurality
image, object, depth, video, set, position
image, object, feature, 3d, plurality, model
transaction, blockchain, block, distributed, record, network
transaction, blockchain, digital, key, network, payments
payment, product, transaction,said, item, cryptocurrency
transaction, blockchain, distributed, key, invention, public
transaction, key, identity, configured, digital, communication
image, object, video, configured, material, plurality
invention, present, relates,said, subject, condition
Figure 7: Visualization of Cluster Evolution in DS2 with CaCE
the top-frequent terms of two clusters. However, an investigation of the cluster
vector shows that this cluster includes many other terms that are not part of the
(green colored) cluster and only these few terms are shared amongst the two.
Fig. 7 represents the cluster evolution identified in the Patent dataset (DS2). (1)
It shows that CaCE is able to capture the growth of ‘block chain’ related cluster
group in yellow color as revealed by their splits between t1 and t2. (2) It identifies
the ‘computer vision’ related cluster group in green color as a persistent pattern
within t0 to t2 which should have been shown as a decay of clusters according
to ground-truth. The top-terms within non-linked clusters show the evidence for
this deviated pattern as they show slight variations. (3) CaCE correctly identifies
the birth of the cluster in grey color showing an emerging pattern, which is under
the ‘Microbiota’ cluster group according to ground-truth. This Patent dataset
shows several related clusters as separate groups. A close investigation reveals
that these clusters are related, but contain several unrelated terms. Therefore,
CaCE identifies them as new groups with unique colors.
4 Empirical Analysis 281
t0 t1 t2 t3
lung, cancer, australia, check, awareness, week
kidney, disease, dialysis, big, red, cancer
kidney, disease, week, risk, know, die
heart, love, walking, check, good, year
cancer, breast, woman, know, symptom, risk
nan, aumentalhealth, mentalhealth, mental, mhanews
kidney, indigenous, disease, kindneydisease, week, auspol
heart, woman, disease, gored, hour, every
heart, gearupgirl, walking, woman, healthy, foundation
cancer, woman, young, breast, website, know
mhanews, mentalhealth, mental, suicide, make, world
menatalhealth, mhanews, depression, illness, mental, suicide
heart, research, disease, foundation, raise, cardiac
heart, step, sign, billion, reach, donating, womenshearts
nan, heartaust, heart, heartweek, attack, woman
frankguinlan, greghuntmp, mental, advocacy mentalhealth,
mhanews, frankguinlan,mentalhealth, mental, nan
nan, aumentalhealth, mhanews, mental, anssl
heart, heartweek, blood, pressure, read, brenttoderian
diabetes, research, grant, australia, greghuntmp, need
Figure 8: Visualization of Cluster Evolution in DS3 with CaCE
Table 3 shows that DS3 has the least performance in identifying cluster similarity
for the group formation. Visualization of the patterns obtained by CaCE is shown
in Fig. 8. (1) The growth of ‘mental’ health-related clusters in blue color is
identified similar to ground truth values, through the splits. (2) It identifies the
birth of ‘diabetes’ cluster with pink color in t3 as an emerging pattern. However, it
fails to identify exact evolution as per the ground-truth. (3) CaCE shows a mixed
group with different types of clusters in yellow color as a pattern that decay over
t0 to t2 through the merges. A closer investigation of top frequent terms reveals
that common terms are found in many diseases with high frequency in this group.
This misleads CaCE to recognize different cluster groups separately.
Fig. 9 shows the evolution of clusters in DS4 displayed by CaCE. (1) It correctly
identifies the ‘soccer’ cluster group in blue color, which is persistent over the
time through its’ continuous appearance in each consecutive time stamp. (2)
The growth pattern of ‘cricket’-related clusters in green color within t3 to t4
and the decay pattern of ‘cycling’-related clusters in yellow within t0 and t1 are
282 4 Empirical Analysis
t0 t1 t2 t3 t4
qantas, cup, world, squad, match, good
cricket, test, ash, ht, squad, pointing
cycling, world, road, men, gold, stage
cyclingaus, cycling, world, men, road, race
cyclingaus, australia, world, champion, men, champ
pointing, wicket, watson, world, captain, cricket
Oman, tonight, quantas, please, match, good
gosocceroos,match, quantas, good, tonight, denmark
clarke, test, cricket, mclarke23, india, mcgrath
good, today, gold, luck, match, final
hussey, mike, think, today, open, cricket
hussey, mike, ausvsl, today, open, think
football, tomorrow, breaking, aleague, quantas, play
play, cup, world, bresciano, mark, tree
Ash, bbl03, catch, scg, test, watch
ash, scg, pinktest,haddin, stevesmith49,test
ash, pinktest,scg, mcgrathfdn, test, england
Figure 9: Visualization of Cluster Evolution in DS4 with CaCE
also partially identified as per the ground-truth in Table 2. A deviated pattern
is resultant of some terms that are different in clusters and that contribute to
identify these clusters as unmatched patterns.
Comparison with state-of-the art TextLuas Fig. 10 - Fig. 13 show the
visualization of cluster evolution given by TextLuas. In DS1, it fails to identify
the persistence pattern of ‘computer technology’ and the decay pattern of ‘games’
identified by the CaCE. TextLuas, based on local evolution patterns between
cluster pairs in consecutive time stamps, is not capable of identifying these cluster
dynamics accurately.
Fig. 11 shows the evolution of clusters in DS2 according to TextLuas. It could
not identify the growth pattern of the cluster group ‘block chain’ or birth of ‘Mi-
crobiota’ in DS2 compared to CaCE. However, it identifies the decay pattern of
‘computer vision’, which CaCE identifies as a persistence pattern. A closer inves-
tigation on this pattern reveals that TextLuas identifies this cluster group mixing
4 Empirical Analysis 283
Figure 10: Visualization of Cluster Evolution in DS1 with TextLuas
Figure 11: Visualization of Cluster Evolution in DS2 with TextLuas
with the other groups. This shows the inability of simple Jaccard similarity-based
cluster comparison in identifying cluster groups accurately compared to CaCE,
which relies on both inter and intra cluster associations.
284 4 Empirical Analysis
Figure 12: Visualization of Cluster Evolution in DS3 with TextLuas
Fig. 12 and Fig. 13 show the cluster evolution pattern given by TextLuas for
DS3 and DS4. Both of them clearly show the mix of cluster groups compared to
CaCE. This confirms two facts: (1) the global cluster evolution patterns cannot be
accurately identified through local connection analysis; and (2) Jaccard similarity,
which relies on intra cluster similarity, is not sufficient in cluster association
identifications.
Benchmarking with other techniques We compare our visulaization results
with a recent method for emerging topic detection for short text [41]. This method
uses a set of heuristics such as energy based on term frequencies to identify terms
that have become important in the current time period and then creates a directed
term-correlation graph and identifies the topics from the previous time window
that persist in the current time window. Iterative graph traversal in this method
is able to identify the topics that are emerging, and track them over time. Table
4 shows the emerging topics identified using topic words in [41]. In DS1, it
identifies the ‘talk’ related topic as an emerging topic in 20NewsGroup dataset
4 Empirical Analysis 285
Figure 13: Visualization of Cluster Evolution in DS4 with TextLuas
Table 4: Results obtained by [41]: Comparative Outcome
Dataset Emerging topicsDS1 (1)people war fbiDS2 (1)invention blockchain relatesDS3 (1)today sign reach (2)cardiac (3)womensheartsDS4 (1)Clarke (2)Grella (3)career (4)Mr (5)mrcricket (6)Veteran
BREAKING 2014Note : As topic terms derived from full tweets message in DS3 and DS4,they include hash tags as well
with the terms: people, war and fbi, while identifying ‘blockchain’ as an emerging
topic in DS2. In DS3, it is capable of identifying the ‘heart’ related topics while
identifying ‘cricket’ and ‘soccer’ related topics in DS4. However, this method
based on a graph theoretic temporal topic model [41] shows all the identified
topics as emerging topics for these datasets. In contrast, CaCE is able to identify
emergent, persistent and diminishing concepts, as depicted in Fig. 8 and Fig. 9.
In summary, CaCE shows higher performance in identifying similar clusters over
the period (i.e., correct cluster groups) as given in Table 3 as compared to bench-
286 4 Empirical Analysis
Figure 14: Time taken by each method for identifying the evolution of clusters
marking methods. This confirms the superiority of CaCE in identifying evolution
patterns (i.e., persistence, growth and decay) globally, which rely on accurate
cluster group identification over the considered period. As revealed by results in
Table 3 and Fig. 6 - Fig. 9, CaCE misses some evolution patterns due to some
common terms appearing in many cluster groups and sub groups within cluster
groups. Having said that, CaCE is the first method that details the comprehensive
global evolution patterns with high accuracy and informs the lifecycle of the main
clusters(concepts) inherent in a corpus, which is displayed through time/domain.
4.2 Efficiency and Complexity Analysis
Time comparison illustrated in Fig. 14 shows that the least time consumption
is by the traditional NMF, which considers a single matrix. It is obvious that
CaCE consumes more time than the traditional NMF due to the inclusion of
additional inter-cluster association matrix. Modified version CaCE-CS consumes
much higher time, due to the additional step of cosine similarity calculation be-
4 Empirical Analysis 287
tween clusters. Naive approach of calculating Jaccard coefficient considering the
term intersection in TextLuas also consumes lesser time on average. The higher
performance with 152% and 21% increase of average NMI in CaCE as per Table 3
compared to NMF and TextLuas respectively, well justifies the 2 - 6 times higher
consumption in time. The computational complexity of CaCE, which is based on
NMF is O (n2) where n is number of clusters. Similarly, CaCE-CS also processes
the same computational complexity. However, time complexities vary according
to the additional matrices and steps included in the approaches. TextLuas has a
linear computational complexity of O (rm) where r is the number of time stamps
and m ≤ n is the number of clusters in a generic time-stamp.
4.3 Sensitivity Analysis
One of the strengths in CaCE is modelling the inter-cluster association matrix us-
ing Skip-Gram with Negative-Sampling (SGNS). Empirically, we validate this as
in Fig. 15, by modelling the matrix A with just using the number of overlapping
terms between clusters, and modelling the same association in A with SGNS
based on probability. It shows that cluster associations modeled with SGNS,
which is able to predict the neighbors correctly, assists the sparse term × cluster
matrix factorization process in forming lower dimensional cluster × group matrix
accurately as depicted by the results. The inter-cluster association given with
the cluster × cluster association matrix using the number of overlapping terms
(without any weighting) between clusters gives lower performance as it fails to
boost the distinction between clusters. As an exception to the general results,
there is not much gain in DS4 by using SGNS where the number of clusters con-
siderably varies within time stamps. We conjecture that in this case, maximizing
probability of close cluster pairs is not making much difference due to the het-
erogeneous nature in that inter-cluster association matrix that formed with the
288 4 Empirical Analysis
Figure 15: Effectiveness of modeling as SGNS
number of overlapping terms.
The threshold that uses to determine the consistency of a cluster group using the
term frequency-based density is analyzed as in Fig. 16 (a). It shows that the
density value less than ‘mean - standard deviation’ gives the best result in all
other datasets than DS3. Density values that fall within first quantile (mean -
standard deviation) within a group implies the clusters with least densities. Thus
this threshold is able to identify the less cohesive clusters in terms of density. Due
to the higher occurrence of terms common for many clusters that act as the noise,
median shows the highest performance in DS3. CaCE uses ‘mean - standard
deviation’ as the default threshold value for determining uniform density in a
cluster group.
CaCE focuses on identifying four major cluster dynamics (i.e., birth, death, split
and merge). It is natural to identify the four groups of clusters aligning with
them, so the evolution patterns within the groups can be studied. However, in
order to empirically verify this number of cluster groups, we also experimented
with a different number of cluster groups. As shown in Fig. 16 (b), the maximum
4 Empirical Analysis 289
Figure 16: Parameter sensitivity for density and number of cluster groups
performance is obtained with setting four as the desired group number for each
dataset, confirming the conjecture of CaCE.
4.4 Case Studies : Research and Job Trend Analysis
Case Study I. We conducted a case study using the DBLP-ACM publication
data2 to confirm the capability of CaCE to accurately detect evolution patterns
over a considerably large period of time (10 time stamps). We consider the
DBLP-ACM bibliographic titles related to Data Science within the period of
1994-2003. The purpose of this case study was to display the effectiveness of
cluster evolution regardless of the primary clustering solution over a large time
period; we use traditional NMF for generating the primary cluster solutions with
three clusters per each time stamp and interpretation of clusters are done with
the top-3 terms of each cluster. Fig. 17 depicts the discovered evolution patterns
in this dataset where clusters are represented with top-3 frequent terms.
2https://www.openicpsr.org/openicpsr/project/100843/version/V2/view
290 4 Empirical Analysis
Figure 17: Cluster evolution illustrated using publication data
Fig. 17 shows that research attached with database query and language technolo-
gies centered in 1994 to 1995 was in its peak (with yellow color clusters, with the
growth pattern revealed by splits). There exist variations of database and man-
agement within this period such as data replication (in pink color) and emerging
pattern of large object databases (in green color) shown with the new-born clus-
ter. Later from 1995 to 1996, this group of clusters, which showed a growth
earlier, shows a decay through merges. Remarkably, it shows the re-emergence of
this concept in 2001 in the form of XML and web semantic languages.
From the period 1997 to 1998, database technologies with commercial applica-
tions, such as multidimensional databases and related querying algorithms in
green color, shows decay revealed by merging those concepts. A special con-
cept of transaction management information in orange color born in 1997 is an
emerging pattern.
Data mining emerged in 1998, grows into distributed database architecture and
data warehouse concepts within 1999, which is depicted through the split in
red color clusters. These concepts, in combination with query processing, form
4 Empirical Analysis 291
a separate cluster group in blue that shows the decay of clusters with merges
within 2000 to 2001. In 2003, CaCE captures the birth of new cluster query
optimization deviating from the rest of the concepts, which is identified as an
emerging concept.
Generally, this case study shows the foot-step of data science that moves from
simple database management to web/xml base databases through data mining
and warehousing over those years.
Case Study II. The aim of the second case study is to show the capability of
CaCE in handling larger number of clusters within a time stamp for the identifi-
cation of accurate cluster evolution. The study uses the online job posting data
in kaggle website3 posted through the Armenian human resource portal ‘Career-
Center’. We consider a subset of job postings, which span across 2004-2006, and
primary clusters in each clustering solution per each time stamp are obtained
using NMF by fixing the number of clusters to 10. Then CaCE is applied to
identify the global evolution of these clusters over the years. Fig. 18 depicts the
interesting evolution patterns revealed by CaCE for this dataset, where clusters
are represented with top-5 frequent terms. The study is able to reveal the de-
mand and changes in certain professions over the years and shows evolution of
necessary skills that are most frequently required by employers.
In general, it shows how the demand for administrative, coordination, sales
and software related job positions evolves over these years. The “administra-
tive/director positions” revealed by the yellow color cluster groups indicate the
changes in the scope and skills of the position across the time. Over the year
2004-2005, it shows a growth pattern with splits between 2004-2005 and 2005-
2006. Director positions are posted for accounting/finance skills and program
3https://www.kaggle.com/madhab/jobposts
292 4 Empirical Analysis
Figure 18: Cluster evolution illustrated using online job posting dataset
implementation skills separately in 2004. In contrast, director positions require
both of these skills in 2005 as qualifications. This again changed to different job
positions in 2006, as shown by the splits (i.e., consultant, finance officers, director
supervision, program coordinator, etc.).
The cluster group depicted by the green color shows a mix of “administrative
positions” and “software developer” positions in 2004. It is obvious for CaCE to
fail in separating them, due to terms used by both these groups such as ‘design’
and ‘implementation’. Thus, it shows as a decay pattern over the year 2004-2005
with the merge of clusters. The post “software developer” is persistent within
2005-2006. Furthermore, CaCE identifies a persistent pattern attached with jobs
related to area specific programs over 2004-2005 in red color. In 2005, it marks
the death of those positions.
CaCE identifies the demand for “customer care” positions in 2004 as marked by
the birth of pink color cluster. However, “customer sales” or “product sales”
related positions appear in 2005 with slight variations to the skills required as
a new position (i.e., emerging pattern) showing in the blue color cluster group.
5 Conclusion 293
Over the year 2005-2006, this shows a decay pattern with a change to necessary
skills (i.e., ability to handle social and international activities). Furthermore,
CaCE discloses two emerging positions in 2005 with the birth of “community
coordination” and “rural program supervision” related clusters in olive color and
orange color.
The general interesting observation in this cluster evolution is skills required by
a job position improves over the time, while sometimes creating a subset of job
positions with specific skills. This case study confirms the effectiveness of CaCE in
identifying evolution with the presence of considerably larger number of clusters
within a time-stamp.
5 Conclusion
This paper proposes a novel Non-negative Matrix Factorization-based method
(CaCE) to discover the evolution of clusters across the time/domain using inter
and intra-cluster associations. CaCE provides an assistance with inter-cluster re-
lations to the matrix factorization process in sparse term × cluster matrix. The
inter-cluster association matrix is built with overlapping terms between clusters
modeled with Skip-Gram with Negative-Sampling (SGNS). Further, we conjec-
ture that term frequency based density of clusters can be used to identify the
inconsistent clusters in these cluster groups and thereby we form tight cluster
groups. We then visualize the evolution of clusters using a k-partite graph over
the time considering the important cluster dynamics of birth, death, split and
merge through the identified cluster groups. An extensive experimental study
has been conducted with both qualitative and quantitative evaluation. Empiri-
cal results conducted on several datasets, benchmarked with relevant methods,
show that CaCE discovers emerging, persistence, growth and decay of clusters
294 5 Conclusion
with considerably higher accuracy performance. Extending this approach to be
independent to a clustering method used for each time stamp, is for our future
investigation.
Chapter 6
Conclusion and Future Directions
The exponential growth of text collections creates the need to identify subgroups
and deviated documents within the corpus, as well as to track dynamic changes
in a corpus over the time or domain. Text mining leads to many applications
such as effective information retrieval, community detection, concept mining,
fake news detection, emerging concept detection and many more [45, 79, 106,
147, 150]. Text mining research is challenged by the high-dimensional nature
of the text and large collection sizes, that lead to poor accuracy, efficiency and
scalability issues in identifying similarity within text pairs. This thesis focuses
on unsupervised text mining methods using ranking concepts, effective density
estimation with ranking, matrix factorization, and matrix factorization-based
document expansion to minimize those challenges.
The main objective of this research work is to deal with the sparseness of text
representation, which results by higher-dimensional vector representation to ac-
curately identify the text similarity/dissimilarity for finding the clusters, outliers
and dynamic changes of clusters. With this objective, the thesis proposes novel
algorithms for ranking centered document clustering and outlier detection meth-
296 6.1 Summary of Contributions
ods. Furthermore, it presents a corpus-based document expansion for short text
clustering with NMF and, NMF-based subgroups identification with the assis-
tance of additional information to identify clusters and the cluster dynamics over
time.
6.1 Summary of Contributions
Based on the literature review detailed in Chapter 2, the following gaps were
identified.
1. Lack of alternative research approaches to find documents’ neighborhood,
such as IR ranking, for identifying text similarity in clustering and, lack of
text clustering methods dealing with short text documents.
2. Lack of research to solve the text outlier detection problem, especially in
the context of multiple classes of inliers. None of the existing works uses
term weighting-based ranking or IR ranking concepts in determining outlier
scores based on text dissimilarity.
3. Lack of research to explore the global text cluster evolution over the
time/domain to identify all the cluster states and patterns accurately deal-
ing with the higher dimensional vector representation based on the cluster
similarity.
This thesis aims to fill these research gaps, by developing effective approaches
to identify the text similarity in order to present accurate and novel unsuper-
vised text mining algorithms for clustering, outlier detection and cluster evolu-
tion tracking. These methods effectively apply the ranking concept to identify
6.1 Summary of Contributions 297
document neighborhoods and extend in text-similarity/dissimilarity identifica-
tion. Methods proposed in this thesis follow the “cluster hypothesis” [91] that
stated that the linked sets of documents are relevant to the same request and the
“reversed cluster hypothesis” [59] that theoretically proved that these documents
should occur in the same cluster. They confirm that an IR system internally
adheres to semantic relationships, as it is able to obtain document responses
belonging to the same group.
In comparison to a keyword-based search used with traditional IR systems, which
only consider syntactic similarity in obtaining relevant documents, methods pro-
posed in this thesis use document-driven queries that statistically represent the
whole document and are able to retrieve the relevant documents more accurately.
These relevant documents obtained via an IR system are explored for accurate
density estimation and to maintain the geometry structures among documents
for clustering. Further proposed methods use refinement steps to obtain the final
solutions: (1) incorporate hubs - a set of small groups with frequent neighbor
documents to identify the similarities or (2) model data using Skip-Gram with
Negative Sampling considering the context. These steps, employed together with
IR ranking, comply with semantic embedding and minimize issues with the syn-
tactic nature of VSM representation.
In addition, the use of inverse document frequency term ranking is exploited in the
thesis to define outliers, together with IR ranking responses and ranking scores.
As the other concept to effectively identify text similarity, the proposed methods
in the thesis effectively apply the matrix factorization and identify the groups in
lower-dimensional space. Clusters go through different stages of their life cycle
over time which is important to monitor in decision making. Therefore, this thesis
explores the global evolution of cluster dynamics in the higher dimensional text
with matrix factorization using additional relationships among clusters to avoid
298 6.1 Summary of Contributions
the information loss. Also, the use of topic vectors and terms obtained through
matrix factorization for corpus-based document expansion is explored to solve
the extreme sparseness in short text.
Clustering: The first research contribution is presenting a set of novel text
clustering methods that identify text similarity accurately. The IR ranking re-
sponses are used in constructing a mutual neighbor graph through shared neigh-
bors for density estimation. Hubs, which are evident in higher dimensional data,
are identified with shared neighbor sets on the graph. Expensive hub similar-
ity calculation is efficiently performed using ranking scores provided by the IR
system to improve the performance of the density-based method. Similarly, in
another method, IR ranking as well as pairwise neighbors are used in constructing
affinity document matrices to represent the nearest neighbors that enforce geo-
metric structures. The consensus and complementary information enforced by
these neighboring information and document representations are used to assist
higher to lower-dimensional projection in document clustering. These ranking
and neighborhood-based clustering approaches show higher accuracy compared
to state-of-the-art methods in handling sparse high dimensional data. The effec-
tiveness of these two methods is validated with real-world data that consists of
medium and short text vectors.
A corpus-based document expansion is implemented through NMF-based topic
modeling and using topic terms as virtual terms. This concept is used in commu-
nity detection and concept mining as short text clustering applications, where the
effectiveness is evaluated using social media datasets and forum data respectively.
Extrinsic, intrinsic measurements and case studies are used accordingly.
6.1 Summary of Contributions 299
Outlier Detection: Secondly, the thesis presents a set of novel outlier detection
methods. It defines the outliers as dissimilar documents that show significant
difference through terms compared to a set of inlier groups. The inverse document
frequency weighting model, which ranks rare terms with high priority, is used to
define an outlier score for a document with the assumption that outliers contain
more rare terms. Also, IR ranking scores of relevant documents are used to
define an outlier score for a document. The ranking score, which informs the
similarity, is inversely used to identify how dissimilar/deviated that document is.
Additionally, within ranking responses of all the documents, the reverse neighbor
count is calculated to identify the hubs and, anti-hubs are proposed as outliers
which possess lower k-occurrences. The concept of an IR ranking-based mutual
neighbor graph, which forms uniform dense regions in a document collection
efficiently, is used to filter the outliers. These mutual graphs are used together
with the hubs that are found on the graph, to identify the documents that are
not part of the graph and/or are dissimilar to the hubs attached with the inlier
groups as outliers. All these methods are evaluated using real-world text data
covering all the vector sizes against the state-of-the-art methods. New evaluation
measures are introduced to calculate the perdition error of inliers and outliers,
which are able to categorize the effectiveness of an outlier detection method.
Cluster Evolution: The third contribution of this thesis is to present a method
that is able to track the dynamic evolution of text clusters over the time/domain
based on the cluster similarity. The potential of using matrix factorization-based
dimensionality reduction for identifying cluster groups within high dimensional
text cluster representation, is explored. It uses inter-cluster association to as-
sist the information loss in lower-dimensional projection, which represents with
SGNS modeling to highlight close cluster associations. The thesis aims to identify
the global cluster evolution over the time/domain through these groups, and the
300 6.2 Summary of Findings
term frequency-based cohesiveness of cluster groups are used to filter loosely at-
tached clusters. The proposed method represents the evolution patterns through
a k-partite graph that spans across the k time stamps or domains. The real-
world data that consists of medium and short text vectors are used to evaluate
the performance of the proposed method, compared to state-of-the-art methods.
Quantitative evaluation is used for measuring the performance of identifying the
cluster groups and qualitative validation is done through visualization of evolu-
tion patterns with the cluster content. In addition, two case studies are used for
qualitative analysis.
6.2 Summary of Findings
This section discusses the main findings for the research questions presented in
Chapter 1. The neighborhood information obtained by the concepts of ranking
and matrix factorization-based projection have been found accurate in calculating
the similarity among text pairs, thus they are able to deal with the high dimen-
sional nature of the text representation and associated challenges. IR systems
have shown the ability to get relevant documents in response to a document-driven
query that statically represents a document accurately and efficiently. The assis-
tance of neighborhood information has shown the ability to identify the similarity
among text pairs accurately with non-negative matrix factorization minimizing its
associated information loss. Using these concepts in identifying clusters, outliers
and cluster evolution across time have resulted in effective outcomes as shown by
the empirical analyses in the previous three chapters.
6.2 Summary of Findings 301
6.2.1 Clustering
This section presents the findings for the first research question about finding
similarity in text corpus to identify the subgroups/clusters.
In response to the question: How can graph-based methods with rank-
ing be used for effective density estimation in sparse data, where den-
sity difference could not be used in identifying the subgroups?
IR systems can be used to effectively generate neighborhood information for doc-
uments in a collection by posing document-driven queries that could represent
the documents systematically. These IR-generated relevant documents between
document pairs that are analyzed to identify the shared neighbors, can accurately
and efficiently build a shared/mutual nearest neighbor graph compared to k-NN
analysis of document pairs. This IR based shared/mutual nearest neighbor graph
can give a dense representation for sparse text data where generally, density-based
methods fail. The core dense points on the graphs that identified if a minimum of
three documents is in the range that is connected to the core point, by sharing at
least three neighbor documents can effectively identify the minimum requirement
for being a dense point in sparse text data. Expanding the boundaries of the
core points on the graph is able to accurately identify the varying dense patches
respective to the subgroups in text data.
302 6.2 Summary of Findings
In response to the question: Instead of expensive pairwise compar-
isons, how can IR ranking-based neighbors be employed to identify
the subgroups?
A ranking function employed in an IR system can be used to accurately and effi-
ciently find the relevant documents from a document collection organized in the
form of an inverted index data structure for a given document query [173]. In
comparison to different ranking functions available in IR systems, tf∗idf function,which considers both common and rare terms that appear in the collection, is
found effective in similarity identification. The nature of the document queries
that represent the documents is found as another important factor for identifying
neighbors accurately with IR systems. The terms with high frequencies in a doc-
ument show the ability to accurately identify the relevant documents. Generally,
top-10 terms are found effective in identifying relevant documents. The accuracy
can be further improved by choosing query size, depending on the characteristic
of the document collection. Comparatively larger size queries can be used for
collections that have larger text vectors on average, and vice versa.
In addition to effectively forming the mutual neighbor graphs, the relevant docu-
ments obtained in this way from an IR system can be used to efficiently identify
the hubs - the frequent nearest neighbors in higher dimensionality. In the mutual
neighbor graphs generated with IR ranking results, the attached set of shared
neighborhoods can be used as the multiple hubs without any additional calcu-
lations. They can be used to assigned cluster labels for unclustered documents,
based on maximum relevancy/affinity. This affinity value for each hub can be
efficiently calculated using the ranking scores of included documents, that were
obtained a priori when the document was posed as the query to form neighbor-
hoods. This way of using IR ranking responses and ranking scores for identifying
the subgroups in text data is accurate and efficient, compared to expensive pair-
6.2 Summary of Findings 303
wise comparisons.
In response to the question: How can associated information loss be
minimised in matrix factorization to approximate the lower rank fac-
tors and to identify subgroups?
The neighborhood information that can preserve geometric structures within data
points is found effective in minimizing the information loss in NMF while project-
ing data from higher to lower order. The neighborhood information generated
with both pairwise comparisons (i.e., local neighborhoods) as well as IR ranking
responses (i.e., global neighborhoods) through document affinity matrices can ac-
curately assist the factorization of a document-term matrix. In modeling these
affinity matrices, the SGNS modeling technique in comparison to binary modeling
is found effective in highlighting the document pairs that show a higher presence
with respect to any neighborhoods giving higher accuracy. This use of both lo-
cal and global nearest neighbors can handle datasets with different sizes, scales,
and densities accurately. Moreover, the consideration of common and specific
(i.e., consensus and complementary) information given by the document-term
matrix, as well as neighborhood affinity matrices for the factorization process, is
found accurate in identifying subgroups in comparison to the use of either of the
aforementioned.
Additionally, in short text also, NMF is found as an accurate and efficient ap-
proach for identifying virtual terms for document expansion. Topic vectors iden-
tified using NMF can capture the topics represented within a short document
collection accurately due to its alignment with natural non-negativity in text
representation. Inclusion of the highly probable topic terms of the corresponding
topic derived with NMF as virtual words for a short document, can minimize the
extreme sparseness aligning with the semantics structure of the corpus to identify
304 6.2 Summary of Findings
the subgroups. This corpus-based document expansion was found accurate and
efficient in identifying subgroups in short text, compared to external source-based
expansion.
Ultimately, among the proposed text clustering methods, the graph-based method
with ranking and density concepts gives better performance for fine-grained clus-
tering compared to the matrix factorization-based method. The latter category
of methods work with smaller cluster numbers that used as the lower rank for
factorization.
6.2.2 Outlier Detection
This section presents the findings in response to the second research question,
about identifying outliers.
In response to the question: How can the concept of ranking and
density used in identifying text similarity be extended in identifying
outliers in a text collection?
Primarily, term ranking based on inverse documents frequency shows higher ef-
fectiveness to identify the outliers considering text dissimilarity. This simple
concept was found memory and time efficient for all types of text vectors. IR-
based relevant neighbors and associated relevancy score, which indicates the level
of similarity, can also be used to provide outlier scores in a scalable and efficient
manner. The inverse of average relevancy scores of relevant documents can ac-
curately define an outlier score for a document. This is efficient and scalable
compared to baselines, due to the use of efficient IR systems directly. Addi-
tionally, the reverse neighbor (k-occurrences) count of documents within ranking
6.2 Summary of Findings 305
responses indicates the hubness of documents that can accurately differentiate the
outlier documents when considering anti-hubs. This is especially found accurate
for short text, which does not show very little word co-occurrences to identify
text similarity/dissimilarity by other concepts.
In addition, the IR ranking-based mutual neighbor graph and density estimation
process on the graph that was used for subgroup identification can be used ac-
curately to identify the outliers that are not attached to the subgroups. In large
text vectors or medium-size text vectors that show higher word co-occurrences
compared to the short text getting a higher portion of documents as mutual neigh-
bors for the graph, this algorithm can accurately and efficiently identify outliers
that are not included in the dense mutual neighbor graph directly. In short text
documents that show fewer term co-occurrences among them, including only a
few documents in the mutual neighbor graph, the use of multiple hubs identified
on the graph with the density estimation can refine the inliers. This algorithm is
found accurate in identifying outliers dissimilar to the hubs using prior calculated
relevancy score, without compromising efficiency.
Ensemble methods proposed by combing term ranking and IR ranking-based
algorithms sequentially or independently, are found accurate with less false pos-
itives. Among them, ensemble methods that combine ranking function score
based outlier detection with term ranking sequentially and independently, are
found efficient compared to k-occurrences count-based sequential method, or
graph-based method, due to direct use of ranking score for identifying devia-
tions/dissimilarities. However, all these approaches were found time and memory
efficient, compared to existing baselines for all types of text vectors. Especially,
these ranking-based methods were found accurate and efficient for larger text
vectors where many methods fail.
These inverse documents frequency-based outlier ranking and IR ranking concepts
306 6.2 Summary of Findings
result in methods summarised in Table 4.1. OIDF which uses only inverse doc-
uments frequency-based outlier ranking is efficient for document collections with
large text vectors. In contrast, ORFS and ORNC which combine term frequency-
based ranking with IR ranking are better in accuracy compared to OIDF. For
datasets with short text vectors, k-occurrences-based ORNC as well as graph-
based ORDG shows higher accuracy. However, ORDG that uses hub-based inlier
filtering is superior among them.
6.2.3 Cluster Evolution
This section presents the findings in response to the third question, about iden-
tifying cluster evolution.
In response to the question: How can the matrix decomposition and
identified factors be used to understand the cluster similarity and
changing dynamics of text clusters in text collections?
The proposed CaCE method for cluster evolution identification accurately and
efficiently identifies the lower rank groups in the clusters and associated terms,
with the assistance of cluster association information using NMF from high di-
mensional text cluster representations. This additional assistance can minimize
the information loss in NMF. This inter-cluster association information, modeled
using the SGNS technique, maximizes the probability of closely associated cluster
pairs while minimizing loosely associated pairs and thereby improves the accu-
racy of cluster group identification. Moreover, CaCE identified cohesive cluster
groups by enforcing consistent term density distribution across the group, found
effective in cluster dynamic identification. The final cluster groups obtained this
way can effectively capture the similarity in the clusters across the time/domain,
6.3 Future Work 307
globally compared to identifying similarity among consecutive time stamps. The
changing dynamics of the clusters can be completely identified by linking clusters
in these cohesive groups via a k-partite graph that visualizes cluster span across
k domains. CaCE can show the full cluster lifecycle states: birth, death, split and
merge of clusters, and all the evolution patterns emergence, persistence, growth,
and decay across time/domain, compared to few cluster dynamics due to this
visualization. Birth can indicate the emergence pattern of a cluster, split can
indicate the growth pattern of a cluster, merge and death can indicate the decay
pattern of a cluster and consistent appearance across the time/domain indicates
the persistence pattern of a cluster.
6.3 Future Work
This thesis presents novel clustering, outlier detection, and cluster evolution
methods with effective text similarity identification techniques. There are various
improvements that can directly be applied to the proposed methods and exten-
sions that can apply to them for solving other related problems. These potential
future research directions are presented in this section.
6.3.1 Stream mining
All the clustering, outlier detection and evolution tracking methods proposed in
the thesis, focus on the static document collections in identifying groups with
similar documents, deviated documents and evolutionary patterns. However,
the popularity of the online social media streams such as Facebook, Twitter, and
LinkedIn with frequent updates, would be beneficial with dynamic stream mining
methods. Extending the proposed concept to a context with limited computing
308 6.3 Future Work
and storage capabilities where clusters, outliers, and cluster dynamics need to be
identified from a rapidly arriving continuous stream of text, will be worth studied
in the future.
6.3.2 Community discovery considering both structure
and content information
The community discovery problem is addressed in the thesis with the document
expansion considering text messages which show what users communicate. How-
ever, how users are connected shown through the network representation is pro-
viding a useful piece of information that could be used to improve the community
detection methods. We have conducted a pilot study to explore the use of an NMF
approach, with the assistance of additional information that is used in subgroup
identification for this community detection problem. The results of this study
were presented as a poster in the Hopper Down Under event1 showing that the
assistance of structural/network information for the NMF applied on the con-
tent is able to improve the accuracy of community detection problem, in many
cases. However, in a dense network representation, combining that information
with content shows inferior performance, highlighting the requirement of treating
content and structure with different weights/importance according to the nature
of the data. It would be interesting to study how this information could be
combined to detect communities with different weights to have improved results,
rather than using a single piece of information.
1https://community.anitab.org/event/hopper-down-under/
6.3 Future Work 309
6.3.3 Deep learning
Deep learning has become extremely popular, with successful applications in im-
age processing and other machine learning research [183]. It is used in short
text clustering literature as a feature learning technique [189]. However, it shows
the bottleneck of requiring ground-truth related information to guide the train-
ing process. One of the works uses retweet or hashtags as links that must hold
to guide the training process without direct ground truth [188]. The use of IR
ranking responses to guide this process could be investigated as possible work. It
will eliminate the use of supervised or semi-supervised approach, as the optimal
clustering framework has proved that “documents relevant to the same queries
should occur in the same cluster” [59].
6.3.4 Short text clustering
This thesis proposes document expansion based on self-corpus for short text clus-
tering to minimize extreme sparseness. With the interesting observation of higher
performance gained through the IR ranking concepts for short text clustering and
outlier detection, it would be useful to apply these concepts in document expan-
sion. Future work can explore the use of IR ranking neighborhoods to derive
the virtual terms for the expansion. It requires investigation of how to select the
neighbors to use for this expansion, and the role of frequent neighbors or higher
k-occurring neighbors in this context.
310 6.3 Future Work
6.3.5 Soft clustering
The methods proposed in this thesis use hard clustering. Depending on the used
datasets and their ground truth labels, the thesis followed the hard clustering.
However, the real-world text documents show a tendency of belonging to multi-
ple groups, such as text in social media. It would be interesting to study how
to identify the cluster labels for text that belongs to multiple clusters, and to
evaluate the accuracy of the methods.
6.3.6 Complete text mining framework
This thesis proposes a set of methods for text clustering, outlier detection and
cluster evolution identification with few different datasets. Future work can
explore the applicability of these methods for all types of datasets (i.e., short,
medium and long text vectors) in detail and proposes solutions to deal with any
type of text vector. It would result in a complete text mining frame that finds
clusters, outliers and evolution patterns for any given dataset.
6.3.7 Pre-trained models for document representation
The methods proposed in the thesis used the VSM-based representation to statis-
tically model documents. However, word-embedding based document representa-
tion is known to provide a dense vector representation for sparse text efficiently
considering semantic similarities [115]. It would be interesting to know how these
techniques for document representation and document query representation can
improve the proposed text mining methods, considering semantic embedding.
Appendix A
Case Studies
National Senior Communities
The results of this case study which applies the method proposed in
[139] is presented at the official launching of the QUT digital Ob-
servatory (https://www.qut.edu.au/institute-for-future-environments/
facilities/digital-observatory) and the related video can be found at
https://www.youtube.com/watch?v=BgoJ495X5so.
QSuper communities
QSuper is an Australian superannuation fund based in Brisbane, Australia. This
case study also uses the method proposed in [139] to understand concerns of users
regarding QSuper. The dataset used for the study also obtained from QUT digital
Observatory and includes the 1091 tweets among Australian Twitter accounts
with the ‘qsuper’ keyword.
312 APPENDIX A
Figure 1: Word Cloud for total tweets obtained from “qsuper”
Fig. 1 shows the word cloud2 generated for the entire tweet dataset. It can be
noted that the users talk about multiple things related to benefits, their families,
members, retirement issues as well as errors in the system which could not capture
separately.
The experiments are done with different α in [139]. The most meaningful com-
munities are given for the α = 1. It generates 8 communities as given in Fig. 2.
The first community of users talks generally about superannuation and Unisuper
news. The second community focuses on members of the Qsuper and related
facts. The focus of the third community is mainly about the funds as an asset
or pension. Community 4 is talking about women and qbr2018 teams. Fifth and
sixth communities are focusing on investments and awards respectively. Commu-
nity 7 is about administrative errors related to Qsuper. The last community of
users generally talks about Brisbane Queensland community.
2The Voyant Tool [170] is used to illustrate the word clouds.
APPENDIX A 313
Community 1 Community 2
Community 3 Community 4
Community 5 Community 6
Community 7 Community 8Figure 2: Word Cloud for derived communities from “qsuper”
Appendix B
Matrix Factorization for Community Detection
using a Coupled Matrix
This case study explores the applicability of NMF method proposed in CaCE
(Chapter 5) for social media community detection problem. Instead of the
inter and intra cluster association matrices, user×user network structure and
user×term content matrices are used. The datasets include the tweets and
retweet interactions collected from QUT digital observatory, 3.The results of this
pilot study is presented as a poster at Hopper down under conference https:
//community.anitab.org/event/hopper-down-under/.
1 Introduction
Social media platforms are a popular networking mechanism for people which
allow them to disseminate information and assemble social views based on short-
text communication [79]. Community detection in these platforms has been found
3https://www.qut.edu.au/institute-for-future-environments/facilities/
digital-observatory
2 Problem and Motivation 315
useful in identifying the groups of users with common interests. It creates oppor-
tunities for political parties, businesses, and government organizations to target
certain user groups for their campaigns, customized programs and events [88, 147].
Two popular unsupervised learning methods to discover communities are network
analysis using graph partitioning [26, 161], and content analysis using clustering
and topic modeling [139, 147]. Network analysis methods, which group users
based on their connections, face the challenge due to sparseness in the network
with the heterogeneity of the interactions. Content analysis methods, which
group users based on their written posts, produce inferior outcome due to the
curse of dimensionality in text vectors [3]. In this paper, we propose a novel
approach, named as CS-NMF, to utilize both types of data using the coupled
matrix factorization in a fully unsupervised manner.
CS-NMF learns the consensus user-community matrix using Non-negative Matrix
Factorization (NMF) by coupling the high-dimensional content and structure
related matrices iteratively. We empirically evaluate CS-NMF using three twitter
datasets and benchmark with the state-of-the-art clustering and network analysis
methods. Results show that the coupling complementary information generated
by both structure and content data can minimize the issues raised by sparseness
when used separately.
2 Problem and Motivation
Community detection is a well-studied research area with graph-based models
where network structure is analyzed to see how users are connected through so-
cial media. In contrary, there are that considers what users communicate in
community discovery via the text messages. However, all these methods become
316 3 Related Work
ineffective due to sparseness in structural and textual representations. Further-
more, discovering communities in a fully unsupervised manner is an essential
requirement in many real-world applications. Disseminating information related
to sales promotions, political campaigns and any special event or program need
the identification of an interested group of users where prior knowledge on the
group is unavailable. In this paper, we explore how to overcome the sparsity
associated with social media data to have accurate communities and how to in-
corporate structure with content for community detection in an unsupervised
setting.
3 Related Work
Community detection is usually done via two means: (1) network analysis and,
(2) content analysis. A larger proportion of research explores the connectedness
in user interaction network through the graph based models for community identi-
fication [26, 161]. However, this network representation is sparse and complex for
analysis. Users who belong to a common group make connections with different
groups based on friendship creating heterogeneous networks.
The content analysis which relies on the text messages were written by users for
communication identify similar users based on what they share [79, 147]. Gen-
erally, text mining faces the curse of dimensionality due to high dimensionality.
In high-dimensional data, the distance difference between near and far points be-
comes negligible [3] and many state-of-the-art clustering methods fail to identify
communities accurately. Additionally, social-media text is short in length that
causes extreme sparseness in the data with the lack of co-relational occurrences
[79].
4 Approach 317
There is a handful of research that attempted to enrich the outcome of community
detection with both content and structural data. Additional information available
with social media such as URLs and hashtags is incorporated with network repre-
sentation to identify users with similar interests [119]. A few researchers use text
messages together with network structures in learning user communities [153].
However, they require label information fed as input to accurately detect com-
munities. We propose to use a coupled matrix combining content and structure
generated by NMF to accurately represent the communities in an unsupervised
setting.
4 Approach
Let there be N users to be assigned to G communities. Let S ∈ RN×N denote
the user interaction matrix between users with each cell representing the number
of interactions between those two users. Let C ∈ RM×N denote the user content
matrix where the short text messages written by N users consists of M distinct
terms. The proposed CS-NMF take the normalized input matrices as input to
the NMF process and iteratively attempts to learn an optimum coupled matrix
representing user community assignment as a factor matrix in a novel fashion.
Thereby, each user is assigned to a community using both structure and content
information in an unsupervised setting.
CS-NMF
The proposed CS-NMF factorizes the high dimensional content matrix C into
two-factor matrices W ∈ RM×G and H ∈ RG×N where G is the number of
communities. It simultaneously identifies H and Hc ∈ RG×N as the lower rank
318 4 Approach
matrices for S. It learns the coupled matrix H iteratively by minimizing the
learning errors in the factorization of matrix S and C as follows.
minW,H≥0‖C −WHT‖F +minH,Hc≥0‖S −HHTC‖F (1)
We update each matrix W , H and Hc sequentially for each g ∈ G within each
iteration as follows.
W(:,g′) ←⎡⎣W(:,g′) +
(CH)(:,g′) −(WHTH
)(:,g′)
(HTH)(g′ ,g′)
⎤⎦ (2)
H(:,g′) ←⎡⎣H(:,g′) +
(CTW
)(:,g′) + (SHc)(:,g′)
(W TW )(g′ ,g′) + (HTc Hc)(g′ ,g′)
−(HHT
c H)(:,g′) +
(HW TW
)(:,g′)
(W TW )(g′ ,g′) + (HTc Hc)(g′ ,g′)
⎤⎦
(3)
Hc(:,g′) ←⎡⎣Hc(:,g′) +
(SH)(:,g′) −(HcH
TH)(:,g′)
(HTH)(g′ ,g′)
⎤⎦ (4)
CS-NMF is able to effectively use complimentary information available with user
communicated text messages and interactions as the coupled matrix (H) learning
process incorporating both C and S.
Table 1: Summary of the datasets
Datasets # of # of # of Unique # ofUsers Interactions Tweets Terms Groups
DS1:Cancer 1585 1174 8260 2975 8DS2: Health 2073 2191 19758 5444 6DS3: Sport 5531 19699 12044 3558 6
Empirical Analysis
Experiments were carried out to evaluate (1) the accuracy improvement gain
by combining content and structure against having them individually, and (2)
5 Results and Contributions 319
effectiveness of CS-NMF against the state-of-the-art methods to test the efficacy
of this way of combination. We have used NMF, LDA and k-means clustering
methods [3] and Louvain network analysis method [26] as baseline methods with
F1-score (F1) and NMI evaluation measures [3].
We used three Twitter datasets focusing on Cancer, Health and Sports domains
as reported in Table 1. We have chosen a set of groups under these domains
where we can identify Twitter accounts to collect tweets and user interactions.
Each subgroup is considered as the ground-truth community to benchmark the
outcome.
5 Results and Contributions
Experimental results show that combining - what users communicate through text
messages with how they connect with each other - is able to improve the accuracy
of community detection compared to using each of the information individually.
This confirms that the coupling content and structure in learning the community
assignment through CS-NMF is an effective approach. Thus, the use of CS-NMF
in identifying users with similar interest would be useful in applications such as
target marketing or campaigns.
Results
Results in Table 2 shows that CS-NMF is superior to applying clustering methods
NMF, LDA, and k-means on structure or content separately in DS1 and DS2.
Applying network analysis based Louvain also gave an inferior outcome. There is
a slight variation in DS3, though it confirms combining structure and content is
320 5 Results and Contributions
Table 2: Accuracy analysis
MethodsDS1 DS2 DS3
F1-Score NMI F1-Score NMI F1-Score NMICS-NMF 0.78 0.76 0.69 0.62 0.48 0.35
NMF for C∗ 0.62 0.58 0.55 0.46 0.35 0.07NMF for S∗ 0.36 0.19 0.42 0.15 0.43 0.31LDA for C∗ 0.26 0.02 0.39 0.01 0.31 0.00LDA for S∗ 0.19 0.09 0.28 0.12 0.48 0.38
k-means for C∗ 0.74 0.72 0.59 0.50 0.36 0.07k-means for S∗ 0.26 0.02 0.40 0.03 0.32 0.01
Louvain 0.40 0.32 0.40 0.24 0.49 0.44
Note: C∗ and S∗ stands for content and structure matrices
able to accurately discover communities, the structure based grouping by Louvain
achieves the best performance. As shown in Table 1, DS3 has a higher number of
user interactions compared to others that creates a considerably dense structure
matrix. This confirms that when a structure matrix is dense, a network analysis
method is able to accurately discover the communities while a sparse network
representation requires coupling with content.
Contributions
The contributions of this work are:
• We put forward the concept of combining content and structure for the
community detection in a fully unsupervised manner to address the data
sparsity, that otherwise results in an inferior outcome.
• We propose a Non-negative Matrix Factorization based coupled matrix to
accurately learn the user communities with content and structure.
Bibliography
[1] C. C. Aggarwal, “Outlier analysis,” in Data mining, pp. 237–263, Springer,
2015.
[2] C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,”
in ACM Sigmod Record, vol. 30, pp. 37–46, ACM, 2001.
[3] C. C. Aggarwal and C. Zhai, Mining text data. Springer Science & Business
Media, 2012.
[4] M. Agyemang, K. Barker, and R. S. Alhajj, “Wcond-mine: algorithm for
detecting web content outliers from web documents,” in 10th IEEE Sym-
posium on Computers and Communications (ISCC’05), pp. 885–890, IEEE,
2005.
[5] M. Akbari and T.-S. Chua, “Leveraging behavioral factorization and prior
knowledge for community discovery and profiling,” in Proceedings of the
Tenth ACM International Conference on Web Search and Data Mining,
pp. 71–79, ACM, 2017.
[6] E. Aljalbout, V. Golkov, Y. Siddiqui, M. Strobel, and D. Cremers, “Clus-
tering with deep learning: Taxonomy and new methods,” arXiv preprint
arXiv:1801.07648, 2018.
322 BIBLIOGRAPHY
[7] A. Amado, P. Cortez, P. Rita, and S. Moro, “Research trends on big data
in marketing: A text mining and topic modeling based literature analysis,”
European Research on Management and Business Economics, vol. 24, no. 1,
pp. 1–7, 2018.
[8] D. C. Anastasiu, A. Tagarelli, and G. Karypis, “Document clustering: The
next frontier.,” 2013.
[9] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: ordering
points to identify the clustering structure,” in ACM Sigmod record, vol. 28,
pp. 49–60, ACM, 1999.
[10] M. Antunes, D. Gomes, and R. L. Aguiar, “Knee/elbow estimation based
on first derivative threshold,” in 2018 IEEE Fourth International Conference
on Big Data Computing Service and Applications (BigDataService), pp. 237–
240, IEEE, 2018.
[11] M. Aouf and L. A. Park, “Approximate document outlier detection using
random spectral projection,” in Australasian Joint Conference on Artificial
Intelligence, pp. 579–590, Springer, 2012.
[12] W. Ashour and S. Sunoallah, “Multi density dbscan,” in International Con-
ference on Intelligent Data Engineering and Automated Learning, pp. 446–
453, Springer, 2011.
[13] Y. Awuor and R. Oboko, “Automatic assessment of online discussions using
text mining,” International Journal of Machine Learning and Applications,
vol. 1, no. 1, p. 7, 2012.
[14] T. Aynaud, “Community detection for networkx’s documentation,” 2018.
[15] L. Azzopardi and V. Vinay, “Retrievability: an evaluation measure for higher
order information access tasks,” in Proceedings of the 17th ACM conference
on Information and knowledge management, pp. 561–570, ACM, 2008.
BIBLIOGRAPHY 323
[16] L. D. Baker, T. Hofmann, A. McCallum, and Y. Yang, “A hierarchical prob-
abilistic model for novelty detection in text,” in Proceedings of International
Conference on Machine Learning, 1999.
[17] S. Banerjee, K. Ramanathan, and A. Gupta, “Clustering short texts using
wikipedia,” in Proceedings of the 30th annual international ACM SIGIR
conference on Research and development in information retrieval, pp. 787–
788, ACM, 2007.
[18] P. Bansal, R. Bansal, and V. Varma, “Towards deep semantic analysis of
hashtags,” in European conference on information retrieval, pp. 453–464,
Springer, 2015.
[19] C. Bao, H. Ji, Y. Quan, and Z. Shen, “Dictionary learning for sparse cod-
ing: Algorithms and convergence analysis,” IEEE transactions on pattern
analysis and machine intelligence, vol. 38, no. 7, pp. 1356–1369, 2016.
[20] B. V. Barde and A. M. Bainwad, “An overview of topic modeling methods
and tools,” in 2017 International Conference on Intelligent Computing and
Control Systems (ICICCS), pp. 745–750, IEEE, 2017.
[21] M. Belford, B. Mac Namee, and D. Greene, “Stability of topic modeling via
matrix factorization,” Expert Systems with Applications, vol. 91, pp. 159–
169, 2018.
[22] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geomet-
ric framework for learning from labeled and unlabeled examples,” Journal
of machine learning research, vol. 7, no. Nov, pp. 2399–2434, 2006.
[23] G. Bennett, F. Scholer, and A. Uitdenbogerd, “A comparative study of prob-
abilistic and language models for information retrieval,” in Proceedings of the
nineteenth conference on Australasian database-Volume 75, pp. 65–74, Aus-
tralian Computer Society, Inc., 2008.
324 BIBLIOGRAPHY
[24] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is “near-
est neighbor” meaningful?,” in International conference on database theory,
pp. 217–235, Springer, 1999.
[25] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal
of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
[26] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast un-
folding of communities in large networks,” Journal of statistical mechanics:
theory and experiment, vol. 2008, no. 10, p. P10008, 2008.
[27] L. Blouvshtein and D. Cohen-Or, “Outlier detection for robust multi-
dimensional scaling,” IEEE transactions on pattern analysis and machine
intelligence, 2018.
[28] L. Bolelli, S. Ertekin, and C. L. Giles, “Topic and trend detection in text
collections using latent dirichlet allocation,” in European Conference on In-
formation Retrieval, pp. 776–780, Springer, 2009.
[29] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying
density-based local outliers,” in ACM sigmod record, vol. 29, pp. 93–104,
ACM, 2000.
[30] A. Broder, L. Garcia-Pueyo, V. Josifovski, S. Vassilvitskii, and S. Venkate-
san, “Scalable k-means by ranked retrieval,” in Proceedings of the 7th ACM
international conference on Web search and data mining, pp. 233–242, ACM,
2014.
[31] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative
matrix factorization for data representation,” IEEE transactions on pattern
analysis and machine intelligence, vol. 33, no. 8, pp. 1548–1560, 2010.
BIBLIOGRAPHY 325
[32] S. B. Cantor and M. W. Kattan, “Determining the area under the roc
curve for a binary diagnostic test,” Medical Decision Making, vol. 20, no. 4,
pp. 468–470, 2000.
[33] F. Cao, M. Estert, W. Qian, and A. Zhou, “Density-based clustering over
an evolving data stream with noise,” in Proceedings of the 2006 SIAM in-
ternational conference on data mining, pp. 328–339, SIAM, 2006.
[34] H. A. Carneiro and E. Mylonakis, “Google trends: a web-based tool for real-
time surveillance of disease outbreaks,” Clinical infectious diseases, vol. 49,
no. 10, pp. 1557–1564, 2009.
[35] M. Cataldi, L. Di Caro, and C. Schifanella, “Emerging topic detection on
twitter based on temporal and social terms evaluation,” in Proceedings of the
tenth international workshop on multimedia data mining, p. 4, ACM, 2010.
[36] M. E. Celebi, Partitional clustering algorithms. Springer, 2014.
[37] N. Cercone, F. Yasmeen, and Y. Gonzalez-Fernandez, “Information retrieval
and the vector space model.” University Lecture, 2014.
[38] D. Chakraborty, V. Narayanan, and A. Ghosh, “Integration of deep feature
extraction and ensemble learning for outlier detection,” Pattern Recognition,
vol. 89, pp. 161–171, 2019.
[39] Y. Chen, H. Zhang, R. Liu, Z. Ye, and J. Lin, “Experimental explorations
on short text topic mining between lda and nmf based schemes,” Knowledge-
Based Systems, vol. 163, pp. 1–13, 2019.
[40] Y. Chi, X. Song, D. Zhou, K. Hino, and B. L. Tseng, “Evolutionary spectral
clustering by incorporating temporal smoothness,” in Proceedings of the 13th
ACM SIGKDD international conference on Knowledge discovery and data
mining, pp. 153–162, ACM, 2007.
326 BIBLIOGRAPHY
[41] R. Churchill, L. Singh, and C. Kirov, “A temporal topic model for noisy
mediums,” in Pacific-Asia Conference on Knowledge Discovery and Data
Mining, pp. 42–53, Springer, 2018.
[42] C. De Boom, S. Van Canneyt, T. Demeester, and B. Dhoedt, “Representa-
tion learning for very short texts using weighted word embedding aggrega-
tion,” Pattern Recognition Letters, vol. 80, pp. 150–156, 2016.
[43] S. Dehuri, C. Mohapatra, A. Ghosh, and R. Mall, “Comparative study of
clustering algorithms,” Information Technology Journal, 2006.
[44] I. S. Dhillon, “Co-clustering documents and words using bipartite spectral
graph partitioning,” in Proceedings of the seventh ACM SIGKDD interna-
tional conference on Knowledge discovery and data mining, pp. 269–274,
ACM, 2001.
[45] J. DiGrazia, K. McKelvey, J. Bollen, and F. Rojas, “More tweets, more
votes: Social media as a quantitative indicator of political behavior,” PloS
one, vol. 8, no. 11, p. e79449, 2013.
[46] C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegative matrix
t-factorizations for clustering,” in Proceedings of the 12th ACM SIGKDD
international conference on KDD, pp. 126–135, ACM, 2006.
[47] B. Dong, M. M. Lin, and M. T. Chu, “Nonnegative rank factorization via
rank reduction,” preprint, 2008.
[48] L. Du, W. Buntine, H. Jin, and C. Chen, “Sequential latent dirichlet allo-
cation,” Knowledge and information systems, vol. 31, no. 3, pp. 475–503,
2012.
[49] N. Du, M. Farajtabar, A. Ahmed, A. J. Smola, and L. Song, “Dirichlet-
hawkes processes with applications to clustering continuous-time document
BIBLIOGRAPHY 327
streams,” in Proceedings of the 21th ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, pp. 219–228, ACM, 2015.
[50] R. Du, D. Kuang, B. Drake, and H. Park, “Dc-nmf: nonnegative matrix
factorization based on divide-and-conquer for fast clustering and topic mod-
eling,” Journal of Global Optimization, vol. 68, no. 4, pp. 777–798, 2017.
[51] L. Duan, L. Xu, Y. Liu, and J. Lee, “Cluster-based outlier detection,” Annals
of Operations Research, vol. 168, no. 1, pp. 151–168, 2009.
[52] A. Egg, “Locality-sensitive hashing (lsh),” 2017.
[53] I. A. El-Khair, “Term weighting,” Encyclopedia of Database Systems,
pp. 3037–3040, 2009.
[54] Elasticsearch, “Similarity module,” 2019.
[55] L. Ertoz, M. Steinbach, and V. Kumar, “Finding clusters of different sizes,
shapes, and densities in noisy, high dimensional data,” in Proceedings of the
2003 SIAM international conference on data mining, pp. 47–58, SIAM, 2003.
[56] L. Ertoz, M. Steinbach, and V. Kumar, “Finding topics in collections of
documents: A shared nearest neighbor approach,” in Clustering and Infor-
mation Retrieval, pp. 83–103, Springer, 2004.
[57] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., “A density-based algorithm
for discovering clusters in large spatial databases with noise.,” inKdd, vol. 96,
pp. 226–231, 1996.
[58] A. Flexer, “Hubness-aware outlier detection for music genre recognition,”
in Proceedings of the 19th international conference on digital audio effects,
2016.
328 BIBLIOGRAPHY
[59] N. Fuhr, M. Lechtenfeld, B. Stein, and T. Gollub, “The optimum clustering
framework: implementing the cluster hypothesis,” Information Retrieval,
vol. 15, no. 2, pp. 93–115, 2012.
[60] A. Gandomi and M. Haider, “Beyond the hype: Big data concepts, methods,
and analytics,” International journal of information management, vol. 35,
no. 2, pp. 137–144, 2015.
[61] J. Ghosh and A. Acharya, “Cluster ensembles,” Wiley Interdisciplinary Re-
views: Data Mining and Knowledge Discovery, vol. 1, no. 4, pp. 305–315,
2011.
[62] E. Giuliani and C. Pietrobelli, “Social network analysis methodologies for
the evaluation of cluster development programs,” tech. rep., Inter-American
Development Bank, 2011.
[63] D. Greene, D. Archambault, V. Belak, and P. Cunningham, “Textluas:
tracking and visualizing document and term clusters in dynamic text data,”
arXiv preprint arXiv:1502.04609, 2014.
[64] D. Greene and J. P. Cross, “Exploring the political agenda of the european
parliament using a dynamic topic modeling approach,” Political Analysis,
vol. 25, no. 1, pp. 77–94, 2017.
[65] X. Gu and H. Wang, “Online anomaly prediction for robust cluster systems,”
in 2009 IEEE 25th International Conference on Data Engineering, pp. 1000–
1011, IEEE, 2009.
[66] Q. Gu and J. Zhou, “Co-clustering on manifolds,” in Proceedings of the 15th
ACM SIGKDD international conference on Knowledge discovery and data
mining, pp. 359–368, ACM, 2009.
[67] B. Hajek, “Adaptive transmission strategies and routing in mobile radio
networks,” in Proceedings of the Conference on Information Sciences and
BIBLIOGRAPHY 329
Systems, vol. 17, p. 373, Department of Electrical Engineering, Johns Hop-
kins University., 1983.
[68] V. Hautamaki, I. Karkkainen, and P. Franti, “Outlier detection using k-
nearest neighbour graph,” in Proceedings of the 17th International Confer-
ence on Pattern Recognition, 2004. ICPR 2004., vol. 3, pp. 430–433, IEEE,
2004.
[69] D. M. Hawkins, Identification of outliers, vol. 11. Springer, 1980.
[70] Z. He, “Hub selection for hub based clustering algorithms,” in 2014 11th In-
ternational Conference on Fuzzy Systems and Knowledge Discovery (FSKD),
pp. 479–484, IEEE, 2014.
[71] Z. He, X. Xu, and S. Deng, “Discovering cluster-based local outliers,” Pat-
tern Recognition Letters, vol. 24, no. 9-10, pp. 1641–1650, 2003.
[72] F. Heimerl, S. Lohmann, S. Lange, and T. Ertl, “Word cloud explorer: Text
analytics based on word clouds,” in 2014 47th Hawaii International Confer-
ence on System Sciences, pp. 1833–1842, IEEE, 2014.
[73] J.-L. Hervas-Oliver, G. Gonzalez, P. Caja, and F. Sempere-Ripoll, “Clusters
and industrial districts: Where is the literature going? identifying emerging
sub-fields of research,” European Planning Studies, vol. 23, no. 9, pp. 1827–
1872, 2015.
[74] M. Hoffman, F. R. Bach, and D. M. Blei, “Online learning for latent dirichlet
allocation,” in advances in neural information processing systems, pp. 856–
864, 2010.
[75] L. Hong and B. D. Davison, “Empirical study of topic modeling in twitter,”
in Proceedings of the first workshop on social media analytics, pp. 80–88,
ACM, 2010.
330 BIBLIOGRAPHY
[76] T. Hong, T. Lee, and J. Li, “Development of sentiment analysis model for
the hot topic detection of online stock forums,” Journal of Intelligence and
Information Systems, vol. 22, no. 1, pp. 187–204, 2016.
[77] A. Hotho, A. Nurnberger, and G. Paaß, “A brief survey of text mining.,” in
Ldv Forum, vol. 20, pp. 19–62, Citeseer, 2005.
[78] J. Hou and R. Nayak, “The heterogeneous cluster ensemble method using
hubness for clustering text documents,” in International Conference on Web
Information Systems Engineering, pp. 102–110, Springer, 2013.
[79] X. Hu and H. Liu, “Text analytics in social media,” in Mining text data,
pp. 385–414, Springer, 2012.
[80] X. Hu, N. Sun, C. Zhang, and T.-S. Chua, “Exploiting internal and ex-
ternal semantics for the clustering of short texts using world knowledge,”
in Proceedings of the 18th ACM conference on Information and knowledge
management, pp. 919–928, ACM, 2009.
[81] A. Huang, “Similarity measures for text document clustering,” in Proceed-
ings of the sixth new zealand computer science research student conference
(NZCSRSC2008), Christchurch, New Zealand, vol. 4, pp. 9–56, 2008.
[82] G. Huang, J. He, Y. Zhang, W. Zhou, H. Liu, P. Zhang, Z. Ding, Y. You,
and J. Cao, “Mining streams of short text for analysis of world-wide event
evolutions,” World Wide Web, vol. 18, no. 5, pp. 1201–1217, 2015.
[83] K. Huang, N. D. Sidiropoulos, and A. Swami, “Non-negative matrix factor-
ization revisited: Uniqueness and algorithm for symmetric decomposition,”
IEEE Transactions on Signal Processing, vol. 62, no. 1, pp. 211–224, 2014.
[84] J. Huang, Q. Zhu, L. Yang, and J. Feng, “A non-parameter outlier detection
algorithm based on natural neighbor,” Knowledge-Based Systems, vol. 92,
pp. 71–77, 2016.
BIBLIOGRAPHY 331
[85] X. Huosong, F. Zhaoyan, and P. Liuyan, “Chinese web text outlier min-
ing based on domain knowledge,” in 2010 Second WRI Global Congress on
Intelligent Systems, vol. 2, pp. 73–77, IEEE, 2010.
[86] IBM, “Big data and analytics hub,” 2017.
[87] K. Ismo et al., “Outlier detection using k-nearest neighbour graph,” in null,
pp. 430–433, IEEE, 2004.
[88] R. Iyer, J. Wong, W. Tavanapong, and D. A. Peterson, “Identifying policy
agenda sub-topics in political tweets based on community detection,” in
Proceedings of the 2017 IEEE/ACM International Conference on Advances
in Social Networks Analysis and Mining 2017, pp. 698–705, ACM, 2017.
[89] D. A. Jackson and Y. Chen, “Robust principal component analysis and out-
lier detection with ecological data,” Environmetrics: The official journal of
the International Environmetrics Society, vol. 15, no. 2, pp. 129–139, 2004.
[90] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern recognition
letters, vol. 31, no. 8, pp. 651–666, 2010.
[91] N. Jardine and C. J. van Rijsbergen, “The use of hierarchic clustering in in-
formation retrieval,” Information storage and retrieval, vol. 7, no. 5, pp. 217–
240, 1971.
[92] R. A. Jarvis and E. A. Patrick, “Clustering using a similarity measure
based on shared near neighbors,” IEEE Transactions on computers, vol. 100,
no. 11, pp. 1025–1034, 1973.
[93] C. Jia, M. B. Carson, X. Wang, and J. Yu, “Concept decompositions for
short text clustering by identifying word communities,” Pattern Recognition,
vol. 76, pp. 691–703, 2018.
332 BIBLIOGRAPHY
[94] M. Jiang, P. Cui, and C. Faloutsos, “Suspicious behavior detection: Cur-
rent trends and future directions,” IEEE Intelligent Systems, vol. 31, no. 1,
pp. 31–39, 2016.
[95] O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang, “Transferring topical
knowledge from auxiliary long texts for short text clustering,” in Proceedings
of the 20th ACM international conference on Information and knowledge
management, pp. 775–784, ACM, 2011.
[96] R. Kannan, H. Woo, C. C. Aggarwal, and H. Park, “Outlier detection for
text data: An extended version,” arXiv preprint arXiv:1701.01325, 2017.
[97] A. Kappas, “Social regulation of emotion: messy layers,” Frontiers in psy-
chology, vol. 4, p. 51, 2013.
[98] S. P. Kasiviswanathan, P. Melville, A. Banerjee, and V. Sindhwani, “Emerg-
ing topic detection using dictionary learning,” in Proceedings of the 20th
ACM international conference on Information and knowledge management,
pp. 745–754, ACM, 2011.
[99] S. P. Kasiviswanathan, H. Wang, A. Banerjee, and P. Melville, “Online
l1-dictionary learning with application to novel document detection,” in Ad-
vances in Neural Information Processing Systems, pp. 2258–2266, 2012.
[100] P. Ke, F. Huang, M. Huang, and X. Zhu, “Araml: A stable adversarial
training framework for text generation,” arXiv preprint arXiv:1908.07195,
2019.
[101] I. Khalil, Z. Dou, and A. Khreishah, “Your credentials are compromised, do
not panic: You can be well protected,” in Proceedings of the 11th ACM on
Asia Conference on Computer and Communications Security, pp. 925–930,
ACM, 2016.
BIBLIOGRAPHY 333
[102] M.-S. Kim and J. Han, “A particle-and-density based evolutionary cluster-
ing method for dynamic networks,” Proceedings of the VLDB Endowment,
vol. 2, no. 1, pp. 622–633, 2009.
[103] J. Kim, Y. He, and H. Park, “Algorithms for nonnegative matrix and tensor
factorizations: a unified view based on block coordinate descent framework,”
Journal of Global Optimization, vol. 58, no. 2, pp. 285–319, 2014.
[104] E. M. Knox and R. T. Ng, “Algorithms for mining distancebased outliers in
large datasets,” in Proceedings of the international conference on very large
data bases, pp. 392–403, Citeseer, 1998.
[105] S. Kokkula and N. M. Musti, “Classification and outlier detection based
on topic based pattern synthesis,” in International Workshop on Machine
Learning and Data Mining in Pattern Recognition, pp. 99–114, Springer,
2013.
[106] R. Kosala and H. Blockeel, “Web mining research: A survey,” ACM Sigkdd
Explorations Newsletter, vol. 2, no. 1, pp. 1–15, 2000.
[107] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and
D. Brown, “Text classification algorithms: A survey,” Information, vol. 10,
no. 4, p. 150, 2019.
[108] H.-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek, “Outlier detection in
axis-parallel subspaces of high dimensional data,” in Pacific-Asia Conference
on Knowledge Discovery and Data Mining, pp. 831–838, Springer, 2009.
[109] H.-P. Kriegel, M. Schubert, and A. Zimek, “Angle-based outlier detection
in high-dimensional data,” in Proceedings of the 14th ACM SIGKDD inter-
national conference on Knowledge discovery and data mining, pp. 444–452,
ACM, 2008.
334 BIBLIOGRAPHY
[110] D. Kuang, J. Choo, and H. Park, “Nonnegative matrix factorization for in-
teractive topic modeling and document clustering,” in Partitional Clustering
Algorithms, pp. 215–243, Springer, 2015.
[111] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings
to document distances,” in International conference on machine learning,
pp. 957–966, 2015.
[112] S. Kutty, R. Nayak, P. Turnbull, R. Chernich, G. Kennedy, and K. Ray-
mond, “Paperminer—a real-time spatiotemporal visualization for newspaper
articles,” Digital Scholarship in the Humanities, 2019.
[113] Q. Le and T. Mikolov, “Distributed representations of sentences and doc-
uments,” in International conference on machine learning, pp. 1188–1196,
2014.
[114] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278–2324, 1998.
[115] P. Lee, L. V. Lakshmanan, and E. E. Milios, “Incremental cluster evolution
tracking from highly dynamic network data,” in 2014 IEEE 30th Interna-
tional Conference on Data Engineering, pp. 3–14, IEEE, 2014.
[116] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factoriza-
tion,” in Advances in neural information processing systems, pp. 556–562,
2001.
[117] O. Levy and Y. Goldberg, “Neural word embedding as implicit matrix fac-
torization,” in Advances in neural information processing systems, pp. 2177–
2185, 2014.
[118] Y. Li, J. Nie, Y. Zhang, B. Wang, B. Yan, and F. Weng, “Contextual recom-
mendation based on text mining,” in Proceedings of the 23rd International
BIBLIOGRAPHY 335
Conference on Computational Linguistics: Posters, pp. 692–700, Association
for Computational Linguistics, 2010.
[119] Q. Li, A. Nourbakhsh, S. Shah, and X. Liu, “Real-time novel event detection
from social media,” in 2017 IEEE 33rd International Conference on Data
Engineering (ICDE), pp. 1129–1139, IEEE, 2017.
[120] N. Li and D. D. Wu, “Using text mining and sentiment analysis for online
forums hotspot detection and forecast,” Decision support systems, vol. 48,
no. 2, pp. 354–368, 2010.
[121] S. Liang, “Unsupervised semantic generative adversarial networks for ex-
pert retrieval,” in The World Wide Web Conference, pp. 1039–1050, ACM,
2019.
[122] C.-J. Lin, “Projected gradient methods for nonnegative matrix factoriza-
tion,” Neural computation, vol. 19, no. 10, pp. 2756–2779, 2007.
[123] Y.-R. Lin, Y. Chi, S. Zhu, H. Sundaram, and B. L. Tseng, “Facetnet: a
framework for analyzing communities and their evolutions in dynamic net-
works,” in Proceedings of the 17th international conference on World Wide
Web, pp. 685–694, ACM, 2008.
[124] F.-R. Lin, L.-S. Hsieh, and F.-T. Chuang, “Discovering genres of online
discussion threads via text mining,” Computers & Education, vol. 52, no. 2,
pp. 481–495, 2009.
[125] Y. Liu, C. Jiang, and H. Zhao, “Using contextual features and multi-view
ensemble learning in product defect identification from online discussion fo-
rums,” Decision Support Systems, vol. 105, pp. 1–12, 2018.
[126] H. Liu, X. Li, J. Li, and S. Zhang, “Efficient outlier detection for high-
dimensional data,” IEEE Transactions on Systems, Man, and Cybernetics:
Systems, vol. 48, no. 12, pp. 2451–2461, 2017.
336 BIBLIOGRAPHY
[127] Y. Liu, Z. Li, C. Zhou, Y. Jiang, J. Sun, M. Wang, and X. He, “Gener-
ative adversarial active learning for unsupervised outlier detection,” IEEE
Transactions on Knowledge and Data Engineering, 2019.
[128] L. Liu, Y. Lu, M. Yang, Q. Qu, J. Zhu, and H. Li, “Generative adver-
sarial network for abstractive text summarization,” in Thirty-second AAAI
conference on artificial intelligence, 2018.
[129] N. Ljubesic, D. Boras, N. Bakaric, and J. Njavro, “Comparing measures
of semantic similarity,” in ITI 2008-30th International Conference on Infor-
mation Technology Interfaces, pp. 675–682, IEEE, 2008.
[130] K. Luong, T. Balasubramaniam, and R. Nayak, “A novel technique of using
coupled matrix and greedy coordinate descent for multi-view data represen-
tation,” in International Conference on Web Information Systems Engineer-
ing, pp. 285–300, Springer, 2018.
[131] K. Luong and R. Nayak, “Clustering multi-view data using non-negative
matrix factorization and manifold learning for effective understanding: A
survey paper,” in Linking and Mining Heterogeneous and Multi-view Data,
pp. 201–227, Springer, 2019.
[132] L. P. Macfadyen and S. Dawson, “Mining lms data to develop an “early
warning system” for educators: A proof of concept,” Computers & education,
vol. 54, no. 2, pp. 588–599, 2010.
[133] C. Manning, P. Raghavan, and H. Schutze, “Introduction to information
retrieval,” Natural Language Engineering, vol. 16, no. 1, pp. 100–103, 2010.
[134] V. Mehta, R. S. Caceres, and K. M. Carter, “Evaluating topic quality using
model clustering,” in 2014 IEEE Symposium on Computational Intelligence
and Data Mining (CIDM), pp. 178–185, IEEE, 2014.
BIBLIOGRAPHY 337
[135] Y. Meng, J. Shen, C. Zhang, and J. Han, “Weakly-supervised hierarchi-
cal text classification,” in Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 33, pp. 6826–6833, 2019.
[136] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[137] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Dis-
tributed representations of words and phrases and their compositionality,”
in Advances in neural information processing systems, pp. 3111–3119, 2013.
[138] M. Mohler and R. Mihalcea, “Text-to-text semantic similarity for automatic
short answer grading,” in Proceedings of the 12th Conference of the Euro-
pean Chapter of the Association for Computational Linguistics, pp. 567–575,
Association for Computational Linguistics, 2009.
[139] W. A. Mohotti and R. Nayak, “Corpus-based augmented media posts with
density-based clustering for community detection,” in 2018 IEEE 30th Inter-
national Conference on Tools with Artificial Intelligence (ICTAI), pp. 379–
386, IEEE, 2018.
[140] W. A. Mohotti and R. Nayak, “An efficient ranking-centered density-based
document clustering method,” in Pacific-Asia Conference on Knowledge
Discovery and Data Mining, pp. 439–451, Springer, 2018.
[141] B. Nadler and M. Galun, “Fundamental limitations of spectral clustering,”
in Advances in neural information processing systems, pp. 1017–1024, 2007.
[142] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi, “Bad news travel
fast: A content-based analysis of interestingness on twitter,” in Proceedings
of the 3rd international web science conference, p. 8, ACM, 2011.
338 BIBLIOGRAPHY
[143] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis
and an algorithm,” in Advances in neural information processing systems,
pp. 849–856, 2002.
[144] D. Nolleke, C. G. Grimmer, and T. Horky, “News sources and follow-up
communication: Facets of complementarity between sports journalism and
social media,” Journalism Practice, vol. 11, no. 4, pp. 509–526, 2017.
[145] N. Oikonomakou and M. Vazirgiannis, “A review of web document cluster-
ing approaches,” in Data mining and knowledge discovery handbook, pp. 921–
943, Springer, 2005.
[146] T. Pang, F. Nie, and J. Han, “Flexible orthogonal neighborhood preserving
embedding.,” in IJCAI, pp. 2592–2598, 2017.
[147] A. Park, M. Conway, and A. T. Chen, “Examining thematic similarity,
difference, and membership in three online mental health communities from
reddit: a text mining and visualization approach,” Computers in human
behavior, vol. 78, pp. 98–112, 2018.
[148] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word
representation,” in Proceedings of the 2014 conference on empirical methods
in natural language processing (EMNLP), pp. 1532–1543, 2014.
[149] R. Peter, G. Shivapratap, G. Divya, and K. Soman, “Evaluation of svd and
nmf methods for latent semantic analysis,” International Journal of Recent
Trends in Engineering, vol. 1, no. 3, p. 308, 2009.
[150] W. M. Pottenger and T.-h. Yang, “Detecting emerging concepts in textual
data mining,” Computational information retrieval, vol. 100, no. 1, pp. 89–
105, 2001.
[151] T. Puranik and L. Narayanan, “Community detection in evolving net-
works,” in Proceedings of the 2017 IEEE/ACM International Conference
BIBLIOGRAPHY 339
on Advances in Social Networks Analysis and Mining 2017, pp. 385–390,
ACM, 2017.
[152] J. Qiang, P. Chen, T. Wang, and X. Wu, “Topic modeling over short texts
by incorporating word embeddings,” in Pacific-Asia Conference on Knowl-
edge Discovery and Data Mining, pp. 363–374, Springer, 2017.
[153] M. Qin, D. Jin, K. Lei, B. Gabrys, and K. Musial-Gabrys, “Adaptive com-
munity detection incorporating topology and content in social networks,”
Knowledge-Based Systems, vol. 161, pp. 342–356, 2018.
[154] M. Radovanovic, A. Nanopoulos, and M. Ivanovic, “Hubs in space: Popular
nearest neighbors in high-dimensional data,” Journal of Machine Learning
Research, vol. 11, no. Sep, pp. 2487–2531, 2010.
[155] M. Radovanovic, A. Nanopoulos, and M. Ivanovic, “Reverse nearest neigh-
bors in unsupervised distance-based outlier detection,” IEEE transactions
on knowledge and data engineering, vol. 27, no. 5, pp. 1369–1382, 2014.
[156] F. Raiber and O. Kurland, “Exploring the cluster hypothesis, and cluster-
based retrieval, over the web,” in Proceedings of the 21st ACM interna-
tional conference on Information and knowledge management, pp. 2507–
2510, ACM, 2012.
[157] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for mining
outliers from large data sets,” in ACM Sigmod Record, vol. 29, pp. 427–438,
ACM, 2000.
[158] M. Ramezani, A. Khodadadi, and H. R. Rabiee, “Community detection
using diffusion information,” ACM Transactions on Knowledge Discovery
from Data (TKDD), vol. 12, no. 2, p. 20, 2018.
[159] A. Rangrej, S. Kulkarni, and A. V. Tendulkar, “Comparative study of clus-
tering techniques for short text documents,” in Proceedings of the 20th in-
340 BIBLIOGRAPHY
ternational conference companion on World wide web, pp. 111–112, ACM,
2011.
[160] T. Roelleke and J. Wang, “Tf-idf uncovered: a study of theories and prob-
abilities,” in Proceedings of the 31st annual international ACM SIGIR con-
ference on Research and development in information retrieval, pp. 435–442,
ACM, 2008.
[161] M. Rosvall and C. T. Bergstrom, “Maps of random walks on complex net-
works reveal community structure,” Proceedings of the National Academy of
Sciences, vol. 105, no. 4, pp. 1118–1123, 2008.
[162] M. Sahami and T. D. Heilman, “A web-based kernel function for measuring
the similarity of short text snippets,” in Proceedings of the 15th international
conference on World Wide Web, pp. 377–386, AcM, 2006.
[163] G. Salton and C. Buckley, “Term-weighting approaches in automatic text
retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–
523, 1988.
[164] E. Schubert, A. Zimek, and H.-P. Kriegel, “Fast and scalable outlier detec-
tion with approximate nearest neighbor ensembles,” in International Con-
ference on Database Systems for Advanced Applications, pp. 19–36, Springer,
2015.
[165] H. Schutze, C. D. Manning, and P. Raghavan, Introduction to information
retrieval, vol. 39. Cambridge University Press, 2008.
[166] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons, “Document
clustering using nonnegative matrix factorization,” Information Processing
& Management, vol. 42, no. 2, pp. 373–386, 2006.
BIBLIOGRAPHY 341
[167] F. Shang, L. Jiao, and F. Wang, “Graph dual regularization non-negative
matrix factorization for co-clustering,” Pattern Recognition, vol. 45, no. 6,
pp. 2237–2250, 2012.
[168] T. Shi, K. Kang, J. Choo, and C. K. Reddy, “Short-text topic modeling via
non-negative matrix factorization enriched with local word-context correla-
tions,” in Proceedings of the 2018 World Wide Web Conference, pp. 1105–
1114, International World Wide Web Conferences Steering Committee, 2018.
[169] W. Silva, A. Santana, F. Lobato, and M. Pinheiro, “A methodology for
community detection in twitter,” in Proceedings of the International Con-
ference on Web Intelligence, pp. 1006–1009, ACM, 2017.
[170] S. Sinclair and G. Rockwell, “the voyant tools team,” 2012.
[171] M. D. Smucker and J. Allan, “A new measure of the cluster hypothesis,” in
Conference on the Theory of Information Retrieval, pp. 281–288, Springer,
2009.
[172] T. Sutanto and R. Nayak, “The ranking based constrained document clus-
tering method and its application to social event detection,” in Interna-
tional Conference on Database Systems for Advanced Applications, pp. 47–
60, Springer, 2014.
[173] T. Sutanto and R. Nayak, “Semi-supervised document clustering via
loci,” in International Conference on Web Information Systems Engineering,
pp. 208–215, Springer, 2015.
[174] T. Sutanto and R. Nayak, “Fine-grained document clustering via ranking
and its application to social media analytics,” Social Network Analysis and
Mining, vol. 8, no. 1, p. 29, 2018.
[175] N. Tomasev and D. Mladenic, “Hub co-occurrence modeling for robust high-
dimensional knn classification,” in Joint European Conference on Machine
342 BIBLIOGRAPHY
Learning and Knowledge Discovery in Databases, pp. 643–659, Springer,
2013.
[176] N. Tomasev, M. Radovanovic, D. Mladenic, and M. Ivanovic, “The role of
hubness in clustering high-dimensional data,” IEEE Transactions on Knowl-
edge and Data Engineering, vol. 26, no. 3, pp. 739–751, 2013.
[177] N. Tomasev, M. Radovanovic, D. Mladenic, and M. Ivanovic, “Hubness-
based clustering of high-dimensional data,” in Partitional clustering algo-
rithms, pp. 353–386, Springer, 2015.
[178] P. University, “Predictive modeling & machine learning laboratory,” 2016.
[179] T. Wagner, R. Feger, and A. Stelzer, “Modifications of the optics clus-
tering algorithm for short-range radar tracking applications,” in 2018 15th
European Radar Conference (EuRAD), pp. 91–94, IEEE, 2018.
[180] X. Wang and A. McCallum, “Topics over time: a non-markov continuous-
time model of topical trends,” in Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data mining, pp. 424–
433, ACM, 2006.
[181] H. Wang, F. Nie, H. Huang, and F. Makedon, “Fast nonnegative matrix
tri-factorization for large-scale data co-clustering,” in Twenty-Second Inter-
national Joint Conference on Artificial Intelligence, 2011.
[182] H. Wang, Z. Qin, and T. Wan, “Text generation based on generative adver-
sarial nets with latent variables,” in Pacific-Asia Conference on Knowledge
Discovery and Data Mining, pp. 92–103, Springer, 2018.
[183] N. Wang and D.-Y. Yeung, “Learning a deep compact image representation
for visual tracking,” in Advances in neural information processing systems,
pp. 809–817, 2013.
BIBLIOGRAPHY 343
[184] R. Wang, D. Zhou, and Y. He, “Open event extraction from online text
using a generative adversarial network,” arXiv preprint arXiv:1908.09246,
2019.
[185] L. Wensen, C. Zewen, W. Jun, and W. Xiaoyi, “Short text classification
based on wikipedia and word2vec,” in 2016 2nd IEEE International Con-
ference on Computer and Communications (ICCC), pp. 1195–1200, IEEE,
2016.
[186] S. M. Wong, W. Ziarko, and P. C. Wong, “Generalized vector spaces model
in information retrieval,” in Proceedings of the 8th annual international ACM
SIGIR conference on Research and development in information retrieval,
pp. 18–25, ACM, 1985.
[187] M. Wozniak, M. Grana, and E. Corchado, “A survey of multiple classifier
systems as hybrid systems,” Information Fusion, vol. 16, pp. 3–17, 2014.
[188] L. Xu, C. Jiang, Y. Ren, and H.-H. Chen, “Microblog dimensionality re-
duction—a deep learning approach,” IEEE Transactions on Knowledge and
Data Engineering, vol. 28, no. 7, pp. 1779–1789, 2016.
[189] J. Xu, W. Peng, T. Guanhua, X. Bo, Z. Jun, W. Fangyuan, H. Hongwei,
et al., “Short text clustering via convolutional neural networks,” in Pro-
ceedings of the Annual Conference of the North American Chapter of the
Association for Computational Linguistics, pp. 62–69, Association for Com-
putational Linguistics, 2015.
[190] J. Xu, B. Xu, P. Wang, S. Zheng, G. Tian, and J. Zhao, “Self-taught
convolutional neural networks for short text clustering,” Neural Networks,
vol. 88, pp. 22–31, 2017.
[191] Y. Yan, R. Huang, C. Ma, L. Xu, Z. Ding, R. Wang, T. Huang, and B. Liu,
“Improving document clustering for short texts by long documents via a
344 BIBLIOGRAPHY
dirichlet multinomial allocation model,” in Asia-Pacific Web (APWeb) and
Web-Age Information Management (WAIM) Joint Conference on Web and
Big Data, pp. 626–641, Springer, 2017.
[192] T. Yang, Y. Chi, S. Zhu, Y. Gong, and R. Jin, “Detecting communities and
their evolutions in dynamic social networks—a bayesian approach,” Machine
learning, vol. 82, no. 2, pp. 157–189, 2011.
[193] P. Yang and B. Huang, “Knn based outlier detection algorithm in large
dataset,” in 2008 International Workshop on Education Technology and
Training & 2008 International Workshop on Geoscience and Remote Sens-
ing, vol. 1, pp. 611–613, IEEE, 2008.
[194] J. Yi, Y. Zhang, X. Zhao, and J. Wan, “A novel text clustering approach
using deep-learning vocabulary network,” Mathematical Problems in Engi-
neering, vol. 2017, 2017.
[195] Y. You, G. Huang, J. Cao, E. Chen, J. He, Y. Zhang, and L. Hu, “Geam: A
general and event-related aspects model for twitter event detection,” in Inter-
national Conference on Web Information Systems Engineering, pp. 319–332,
Springer, 2013.
[196] B. Yu, “Research on information retrieval model based on ontology,”
EURASIP Journal on Wireless Communications and Networking, vol. 2019,
no. 1, p. 30, 2019.
[197] Z. Yuan, X. Zhang, and S. Feng, “Hybrid data-driven outlier detection
based on neighborhood information entropy and its developmental mea-
sures,” Expert Systems with Applications, vol. 112, pp. 243–257, 2018.
[198] X. Zhang, H. Gao, G. Li, J. Zhao, J. Huo, J. Yin, Y. Liu, and L. Zheng,
“Multi-view clustering based on graph-regularized nonnegative matrix fac-
BIBLIOGRAPHY 345
torization for object recognition,” Information Sciences, vol. 432, pp. 463–
478, 2018.
[199] B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W.-Y. Ma, “Im-
proving web search results using affinity graph,” in Proceedings of the 28th
annual international ACM SIGIR conference on Research and development
in information retrieval, pp. 504–511, ACM, 2005.
[200] J. Zhang, X. Long, and T. Suel, “Performance of compressed inverted list
caching in search engines,” in Proceedings of the 17th international confer-
ence on World Wide Web, pp. 387–396, ACM, 2008.
[201] W. Zhao, Q. He, H. Ma, and Z. Shi, “Effective semi-supervised document
clustering via active learning with instance-level constraints,” Knowledge
and information systems, vol. 30, no. 3, pp. 569–587, 2012.
[202] C. T. Zheng, C. Liu, and H. San Wong, “Corpus-based topic diffusion for
short text clustering,” Neurocomputing, vol. 275, pp. 2444–2458, 2018.
[203] N. Zheng and J. Xue, “Manifold learning,” in Statistical Learning and Pat-
tern Analysis for Image and Video Processing, pp. 87–119, Springer, 2009.
[204] P. Zhu, X. Zhan, and W. Qiu, “Efficient k-nearest neighbors search in high
dimensions using mapreduce,” in 2015 IEEE Fifth International Conference
on Big Data and Cloud Computing, pp. 23–30, IEEE, 2015.
[205] A. Zimek, “Clustering high-dimensional data,” in Data Clustering, pp. 201–
230, Chapman and Hall/CRC, 2018.