knowledge discovery using pattern taxonomy model in text ... · knowledge discovery using pattern...
TRANSCRIPT
Knowledge Discovery Using PatternTaxonomy Model in Text Mining
by
Sheng-Tang Wu
(B.Sc., M.Sc.)
A dissertation submitted for the degree of
Doctor of Philosophy
Faculty of Information Technology
Queensland University of Technology
December 2007
Keywords
Pattern Taxonomy Model, Information Retrieval, Text Mining, Data Mining,Association Rules, Sequential Pattern Mining, Closed Sequential Patterns, PatternDeploying, Pattern Evolving.
i
ii
Abstract
In the last decade, many data mining techniques have been proposed for fulfilling
various knowledge discovery tasks in order to achieve the goal of retrieving useful
information for users. Various types of patterns can then be generated using
these techniques, such as sequential patterns, frequent itemsets, and closed and
maximum patterns. However, how to effectively exploit the discovered patterns
is still an open research issue, especially in the domain of text mining. Most
of the text mining methods adopt the keyword-based approach to construct text
representations which consist of single words or single terms, whereas other
methods have tried to use phrases instead of keywords, based on the hypothesis
that the information carried by a phrase is considered more than that by a
single term. Nevertheless, these phrase-based methods did not yield significant
improvements due to the fact that the patterns with high frequency (normally the
shorter patterns) usually have a high value on exhaustivity but a low value on
specificity, and thus the specific patterns encounter the low frequency problem.
This thesis presents the research on the concept of developing an effective
Pattern Taxonomy Model (PTM) to overcome the aforementioned problem by
deploying discovered patterns into a hypothesis space. PTM is a pattern-based
method which adopts the technique of sequential pattern mining and uses closed
patterns as features in the representative. A PTM-based information filtering
system is implemented and evaluated by a series of experiments on the latest
version of the Reuters dataset, RCV1. The pattern evolution schemes are also
iii
proposed in this thesis with the attempt of utilising information from negative
training examples to update the discovered knowledge. The results show that the
PTM outperforms not only all up-to-date data mining-based methods, but also the
traditional Rocchio and the state-of-the-art BM25 and Support Vector Machines
(SVM) approaches.
iv
Contents
Keywords i
Abstract iii
List of Figures x
List of Tables xii
Statement of Original Authorship xiii
Acknowledgement xv
1 Introduction 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . 71.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Literature Review 112.1 Knowledge Discovery . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Process of Knowledge Discovery . . . . . . . . . . . . . 132.1.2 Data Repository . . . . . . . . . . . . . . . . . . . . . . 152.1.3 Tasks and Challenges . . . . . . . . . . . . . . . . . . . . 18
2.2 Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . 212.2.1 Association Rules mining . . . . . . . . . . . . . . . . . 222.2.2 Sequential Patterns . . . . . . . . . . . . . . . . . . . . . 242.2.3 Frequent Itemsets . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . 272.3.2 Term Weighting . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . 322.4.1 Keyword-based Representation . . . . . . . . . . . . . . 32
v
2.4.2 Phrase-based Representation . . . . . . . . . . . . . . . . 332.4.3 Other Representation . . . . . . . . . . . . . . . . . . . . 34
2.5 Information Filtering . . . . . . . . . . . . . . . . . . . . . . . . 352.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Prototype of Pattern Taxonomy Model 393.1 Pattern Taxonomy Model . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Sequential Pattern Mining (SPM) . . . . . . . . . . . . . 403.1.2 Pattern Pruning . . . . . . . . . . . . . . . . . . . . . . . 433.1.3 Using Discovered Patterns . . . . . . . . . . . . . . . . . 52
3.2 Finding Non-Sequential Patterns . . . . . . . . . . . . . . . . . . 533.2.1 Basic Definition of NSPM . . . . . . . . . . . . . . . . . 543.2.2 NSPM Algorithm . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Pattern Deploying Methods 654.1 Pattern Deploying . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.1 Pattern Deploying Method (PDM) . . . . . . . . . . . . . 714.1.2 Pattern Deploying based on Supports (PDS) . . . . . . . . 79
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 Evolution of Discovered Patterns 875.1 Deployed Pattern Evolution . . . . . . . . . . . . . . . . . . . . . 87
5.1.1 Basic Definition of DPE . . . . . . . . . . . . . . . . . . 885.1.2 The Algorithm of DPE . . . . . . . . . . . . . . . . . . . 91
5.2 Individual Pattern Evolution . . . . . . . . . . . . . . . . . . . . 955.2.1 Basic Definition of IPE . . . . . . . . . . . . . . . . . . . 975.2.2 The Algorithm of IPE . . . . . . . . . . . . . . . . . . . 101
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6 Experiments and Results 1076.1 Experimental Dataset . . . . . . . . . . . . . . . . . . . . . . . . 1086.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . 1136.3 Evaluation Procedures . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.1 Document Indexing . . . . . . . . . . . . . . . . . . . . . 1216.3.2 Procedure of Pattern Discovery . . . . . . . . . . . . . . 1246.3.3 Procedure of Pattern Deploying . . . . . . . . . . . . . . 1256.3.4 Procedure of Pattern Evolving . . . . . . . . . . . . . . . 127
vi
6.4 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . 1306.5 Experiment Evaluation . . . . . . . . . . . . . . . . . . . . . . . 131
6.5.1 Experiment on Pattern Discovery Methods . . . . . . . . 1336.5.2 Experiment on Pattern Deploying . . . . . . . . . . . . . 1466.5.3 Experiment on Pattern Evolution . . . . . . . . . . . . . . 158
6.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7 Conclusion 1757.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1767.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Appendices 181
A An Example of a RCV1 Document 181
B Topic Codes of TREC RCV1 185
C List of Stopwords 189
Bibliography 191
vii
viii
List of Figures
1.1 The research cycle. . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 A typical process of knowledge discovery [43]. . . . . . . . . . . 132.2 Taxonomy of Web mining techniques [82]. . . . . . . . . . . . . . 152.3 Bag-of-words representation using word frequency. . . . . . . . . 32
3.1 An example of pattern taxonomy where patterns in dash boxes areclosed patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Illustration of pruning redundant patterns. . . . . . . . . . . . . . 46
4.1 Deploying patterns into a term space. . . . . . . . . . . . . . . . . 664.2 Overlaps between discovered patterns. . . . . . . . . . . . . . . . 684.3 Flowchart of pattern deploying methods in Pattern Taxonomy
Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.4 The process of merging pattern taxonomies into the feature space. 71
5.1 A negative document nd and its offending deployed patterns. . . . 905.2 Different levels involved by DPE and IPE in pattern evolution. . . 955.3 The flowchart of two pattern evolving approaches. . . . . . . . . . 975.4 Relations between patternset and termset under the topic “Effects
of global warming”. . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1 An XML document in RCV1 dataset. . . . . . . . . . . . . . . . 1116.2 Distribution of words in an RCV1 collection [118]. . . . . . . . . 1126.3 Number of paragraphs per document in an RCV1 collection [118]. 1126.4 An example of topic description. . . . . . . . . . . . . . . . . . . 1136.5 Process of document indexing. . . . . . . . . . . . . . . . . . . . 1216.6 Primary output of a preprocessed document and found patterns. . . 1236.7 Flow chart of experimental procedure for pattern deploying
methods PDM and PDS in the pattern taxonomy model PTM. . . . 1266.8 Flow chart of experimental procedure for pattern evolving
methods DPE and IPE in the pattern taxonomy model PTM. . . . 128
ix
6.9 Number of patterns discovered using SPM with differentconstraints on 10 RCV1 topics. . . . . . . . . . . . . . . . . . . . 137
6.10 Comparison of precision and recall curves for different methodson RCV1 Topic r110. . . . . . . . . . . . . . . . . . . . . . . . . 142
6.11 Comparison of all methods in precision at standard recall pointson the first 50 topics. . . . . . . . . . . . . . . . . . . . . . . . . 154
6.12 Comparison of PDS method and Rocchio method in difference ofFβ=1 on all topics. . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.13 Comparison of the PDS method and the Rocchio method indifference of top-20 precision on all topics. . . . . . . . . . . . . 155
6.14 Comparison of all methods in all measures on 100 topics. . . . . . 1566.15 The relationship between the proportion in number of negative
documents greater than threshold to all documents and correspond-ing improvement on DPE with µ = 5 on improved topics. . . . . . 163
6.16 Comparison in the number of patterns used for training by eachmethod on the first 50 topics (r101∼r150) and the rest of the topics(r151∼r200). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.17 Comparison of PTM(IPE) and TFIDF in top-20 precision. . . . . 1666.18 Comparing PTM(IPE) with data mining methods on the first 50
RCV1 topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1686.19 Comparing PTM(IPE) with other methods on the first 50 RCV1
topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
x
List of Tables
2.1 Association rules mining algorithms. . . . . . . . . . . . . . . . . 262.2 Information Filtering models. . . . . . . . . . . . . . . . . . . . . 37
3.1 Each transaction represents a paragraph in a text document andcontains a sequence consisting of an ordered list of words. . . . . 42
3.2 All frequent sequential patterns discovered from the sampledocument (Table 3.1) with min sup: ξ = 0.5. . . . . . . . . . . . 42
3.3 Frequent 1Term patterns with min sup = 0.5. . . . . . . . . . . . 483.4 An example of a p-projected database. . . . . . . . . . . . . . . . 483.5 2Terms sequential patterns derived from 1Term patterns. . . . . . 493.6 The assessment of closed pattern of 1Term patterns. . . . . . . . . 513.7 The assessment of closed pattern of 2Terms patterns. . . . . . . . 513.8 Discovered frequent closed and non-closed sequential patterns. . . 523.9 2Terms candidates generated during non-sequential pattern mining. 573.10 3Terms candidates generated during non-sequential pattern mining. 583.11 4Terms candidates generated in NSPM. . . . . . . . . . . . . . . 593.12 Frequent non-sequential patterns discovered using NSPM. . . . . 60
4.1 Example of a set of positive documents consisting of patterntaxonomies. The number beside each sequential pattern indicatesthe absolute support of pattern. . . . . . . . . . . . . . . . . . . . 73
4.2 Patterns with their support from the sample database. . . . . . . . 80
5.1 Examples of positive documents which are represented by a set ofsequential patterns mined using PTM. . . . . . . . . . . . . . . . 88
5.2 Deployed patterns from the document examples. . . . . . . . . . . 895.3 dp2 and dp3 are replaced by dp6 and deployed patterns are
normalised. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.4 The change of term weights in offender dp1 before and after
shuffling when µ = 1/2. . . . . . . . . . . . . . . . . . . . . . . 945.5 Examples of positive documents represented by a set of sequential
patterns with frequency. . . . . . . . . . . . . . . . . . . . . . . . 99
xi
5.6 Normalised patternsets which contain sequential patterns withcorresponding weights. . . . . . . . . . . . . . . . . . . . . . . . 100
5.7 An example of patternset composition. . . . . . . . . . . . . . . . 100
6.1 Current Reuters data collections. . . . . . . . . . . . . . . . . . . 1096.2 Contingency table. . . . . . . . . . . . . . . . . . . . . . . . . . 1146.3 Number of relevant documents(#r) and total number of documents(#d)
by each topic in the RCV1 training dataset. . . . . . . . . . . . . 1186.4 Number of relevant documents(#r) and total number of documents(#d)
by each topic in the RCV1 test dataset. . . . . . . . . . . . . . . . 1196.5 Comparing PTM with data mining-based methods on RCV1
topics r101 to r150. . . . . . . . . . . . . . . . . . . . . . . . . . 1346.6 Precisions of top 20 returned documents on 10 RCV1 topics. . . . 1406.7 Results of pattern deploying methods compared with others on the
first 50 topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1486.8 Results of pattern deploying methods compared with others on the
last 50 topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1496.9 Results of pattern deploying methods compared with others on all
topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1516.10 Accumulated number of patterns found during pattern discovering. 1536.11 The list of methods used for evaluation. . . . . . . . . . . . . . . 1606.12 Comparison of pattern deploying and pattern evolving methods
used by PTM on all topics. . . . . . . . . . . . . . . . . . . . . . 1626.13 Comparison of all methods on the first 50 topics. . . . . . . . . . 164
xii
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meetrequirements for an award at this or any other higher education institution. Tothe best of my knowledge and belief, the thesis contains no material previouslypublished or written by another person except where due reference is made.
Signed:
Date:
xiii
xiv
Acknowledgement
Firstly, I would like to express my immense gratitude to Associate Professor
Yuefeng Li, my principle supervisor, for all his guidance and encouragement
throughout this research work. He has been always there providing sufficient
support with his excellent expertise in this area. Many thanks also go to my
associate supervisors, Dr. Yue Xu and Associate Professor Yi-Ping Phoebe Chen
for their generous support and comments on my work during this candidature.
I would also like to thank my examiners for their precious comments and
suggestions.
Special thanks must go to Faculty of Information Technology, QUT, which
has provided me the comfortable research environment with needed facilities and
financial support including my scholarship and travel allowances over the period
of my candidature. I would especially like to thank all the members of our research
group for offering invaluable advice and comments regarding my research work.
This work would not have been accomplished without the constant support of
my family. I would like to dedicate this thesis to my parents for their never-ending
encouragement over these years.
Last but certainly not the least I would like to thank my wife Vivien and my
parents-in-law for their tremendous support.
xv
xvi
Chapter 1
Introduction
Due to the rapid growth of digital data made available in recent years, knowledge
discovery and data mining have attracted great attention with an imminent
need for turning such data into useful information and knowledge. Many
applications, such as market analysis and business management, can benefit by
the use of the information and knowledge extracted from a large amount of data.
Knowledge discovery can be viewed as the process of nontrivial extraction of
information from large databases, information that is implicitly presented in the
data, previously unknown and potentially useful for users [33, 42]. Data mining
is therefore an essential step in the process of knowledge discovery in databases.
In the past decade, a significant number of data mining techniques have been
presented in order to perform different knowledge tasks. These techniques include
association rule mining, frequent itemset mining, sequential pattern mining,
maximum pattern mining and closed pattern mining. Most of them are proposed
for the purpose of developing efficient mining algorithms to find particular
patterns within a reasonable and acceptable time frame. With a large number of
patterns generated by using the data mining approaches, how to effectively exploit
these patterns is still an open research issue. Therefore, in this thesis, we focus on
1
2 Introduction
the development of a knowledge discovery model to effectively use the discovered
patterns and apply it to the field of text mining.
Text mining is the technique that helps users find useful information from a
large amount of digital text data. It is therefore crucial that a good text mining
model should retrieve the information that users require with relevant efficiency.
Traditional Information Retrieval (IR) has the same objective of automatically
retrieving as many relevant documents as possible whilst filtering out irrelevant
documents at the same time [50]. However, IR-based systems do not adequately
provide users with what they really need [86]. Many text mining methods have
been developed in order to achieve the goal of retrieving useful information
for users [1, 36, 71, 130, 133]. Most text mining methods use the keyword-
based approaches, whereas others choose the phrase technique to construct a
text representation for a set of documents. It is believed that the phrase-based
approaches should perform better than the keyword-based ones as it is considered
that more information is carried by a phrase than by a single term. Based on
this hypothesis, Lewis [77] conducted several experiments using phrasal indexing
language on a text categorisation task. Ironically, the results showed that the
phrase-based indexing language was not superior to the word-based one.
Although phrases carry less ambiguous and more succinct meanings than
individual words, the likely reasons for the discouraging performance from the
use of phrases are: (1) phrases have inferior statistical properties to words, (2)
they have a low frequency of occurrence, and (3) there are a large number
of redundant and noisy phrases among them [130]. Scott and Matwin [129]
also suggested that simple phrase-based representations are not worth pursuing
since they found no significant performance improvement on eight different
3
representations based on words, phrases, synonyms and hypernyms. They also
suggested the combination of classifiers with alternative representations might
produce more favorable results.
In order to solve the above mentioned problem, new studies have been
focusing on finding better text representatives from a textual data collection. One
solution is to use the data mining techniques, such as sequential pattern mining,
for building up a representation with the new type of features [159]. Such data
mining-based methods adopted the concept of closed sequential patterns and
pruned non-closed patterns from the representation with an attempt to reduce the
size of the feature set by removing noisy patterns. However, treating each multi-
terms pattern as an atom in the representation seems likely to encounter the low-
frequency problem while dealing with the long patterns [157]. Another challenge
for the data mining-based methods is that more time is spent on uncovering
knowledge from the data; consequently less significant improvements are made
compared with information retrieval methods [158].
The problem caused by data mining-based methods is that the measures (e.g.,
supports and confidences) adopted in the phase of using discovered patterns
are not suitable. For instance, given a specified topic, a highly frequent
pattern (normally the short pattern) usually has a high exhaustivity but a low
specificity, where exhaustivity describes the extent to which the pattern discusses
the topic and specificity describes the extent to which the pattern focuses on
the topic. These measures reveal only the statistic property in a pattern, but
not its specificity. Therefore, a new evaluation mechanism for various length of
patterns is required [158]. Based on this observation, this thesis proposed a novel
method, Pattern Taxonomy Model (PTM) for the purpose of effectively using
4 Introduction
discovered patterns. PTM re-evaluates the measures of patterns by deploying
them into a common hypothesis space based on their correlations in the pattern
taxonomies. As a result, the patterns with high specificity to the topic can
obtain the reasonable and adequate significance values, leading to the significant
improvement in effectiveness for the system.
In addition to the pattern deploying, the influence of patterns from the negative
training examples is also investigated in this research work. There is no doubt
that negative documents contain useful information to help identify ambiguous
patterns during the concept learning. A pattern may be a good indicator to
classify relevant documents if this pattern always appears in the positive examples.
However, it would be ambiguous if this pattern also appears in negative examples
at certain times. Therefore, it is necessary for a system to collect this information
to find ambiguous patterns and try to reduce their influence. The process of
refining ambiguous patterns can be referred as pattern evolution. The pattern
evolution is used for concept refinement for user profile mining. Li and Zhong [86]
proposed a novel approach of pattern evolution and applied it to ontology mining
for automatically acquiring user information needs. However, their work is
developed based on a keyword-based system. Hence, in our study we propose
an effective pattern evolution approach for the PTM-based system.
In order to evaluate the proposed PTM model, we apply PTM to the practical
information filtering task. Information filtering is a task that a user with a specific
information need is monitoring a stream of documents and the system selects
documents from the stream according to a profile of the user’s interests. Filtering
systems process one document at a time and show it to the user if this document
is relevant. The system then adjusts the profile or updates the threshold based on
1.1 Problem Statement 5
the user’s feedback. In the case of batch filtering, a number of relevant documents
are returned, whereas a list of ranked documents is given in the case of routing
filtering. In this thesis, we conduct routing filtering to avoid the need of threshold
tuning, which is beyond our research scope. Numerous experiments are performed
on the latest data collection, Reuters Corpus Volume 1 (RCV1), to evaluate the
proposed PTM-based information filtering system. The results show that the
PTM outperforms not only all up-to-date data mining-based methods, but also
the traditional probabilistic and Rocchio methods.
1.1 Problem Statement
Most research works in the data mining community have focused on developing
efficient mining algorithms for discovering a variety of patterns from a larger
data collection. However, searching for useful and interesting patterns is still
an open problem [88]. In the field of text mining, data mining techniques can be
used to find various text patterns, such as sequential patterns, frequent itemsets,
co-occurring terms and multiple grams, for building up a representation with
these new types of features [159]. Nevertheless, the first problem is how to
effectively deal with the large amount of patterns generated by using the data
mining methods.
Using phrases for the text representation still has doubts in increasing
performance over domains of text categorisation tasks [77, 130], meaning that
there exists no particular representation method with dominating advantage over
others [68, 129]. Instead of the keyword-based approach which is typically used
by text mining-related tasks in the past, the pattern-based model (single term or
6 Introduction
multiple terms) is employed to perform the same concept of task. There are two
phases that we need to consider when we use pattern-based models in text mining:
one is how to discover useful patterns from digital text documents, and the other
is how to utilise these mined patterns to improve the system’s performance.
1.2 Contributions
In this thesis a new knowledge discovery model is proposed with an attempt to
effectively exploit the discovered patterns in a large data collection using data
mining approaches. This model uses pattern taxonomies as features to represent
knowledge based on the state-of-the-art data mining techniques such as sequential
pattern mining and closed pattern mining. In order to overcome the problem in the
phase of using discovered patterns, the PTM model is extended to be effective by
using the strategy of pattern deploying. Two deploying mechanisms are proposed
to enhance the effectiveness of the PTM. Furthermore, the PTM is equipped with
pattern evolution approaches to be able to deal with the negative examples during
the profile learning. The summarised contributions are briefed as follows:
• A knowledge discovery model based on pattern taxonomies is proposed.
• The state-of-the-art data mining techniques are used in the PTM including
sequential pattern mining and closed sequential pattern mining.
• Pattern deploying strategies are provided to increase the effectiveness of the
PTM and to solve the low precision problem.
• A scalable PTM is developed with the capability of concept adjustment by
means of evolving mined patterns.
1.3 Research Methodology 7
Figure 1.1: The research cycle.
• Experimental evaluation is made and the results prove the feasibility and
effectiveness of the proposed PTM.
1.3 Research Methodology
There has been an increase in the range of research approaches that are acceptable
for knowledge discovery research during the last decade. These methods include
case studies, field studies, action research, prototyping, and experimenting [26].
As the research is considered to focus on the development of robust mechanisms in
the knowledge discovery system, these mechanisms or proposed theories have to
be proven by the classic science method of experiment. Hence, the experimenting
approach integrated with cycles of research is chosen as the research method. The
process of the research approach used in this research is illustrated in Figure 1.1.
8 Introduction
1.4 Thesis Outline
The rest of this thesis is summarised as follows:
Chapter 2: This chapter is a literature review of related disciplines including
data mining, text mining, knowledge representation models and information
filtering. It pinpoints the current works on data mining and identifies the
drawbacks of existing representation schemes.
Chapter 3: This chapter provides the definition of sequential pattern and the
proposed algorithms of mining frequent sequential patterns from a textual
data collection. This chapter also presents a novel representation scheme
that makes the use of the discovered pattern taxonomies. The relevant
publications about this chapter are:
- S-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, Automatic
Pattern-Taxonomy Extraction for Web Mining, Proceedings of the
IEEE/WIC/ACM International Conference on Web Intelligence (WI
2004), pages 242–248, 2004.
- S-T. Wu, Knowledge Discovery from Digital Text Documents,
Proceedings of the 4th International Conference on Active Media
Technology (AMT 2006), pages 446–447, 2006.
- X. Zhou, S-T. Wu, Y. Li, Y. Xu, R. Y. K. Lau, and P. D. Bruza,
Utilizing Search Intent in Topic Ontology-Based User Profile for Web
Mining, Proceedings of the IEEE/WIC/ACM International Conference
on Web Intelligence (WI 2006), pages 558–564, 2006.
- R. Y. K. Lau, Y. Li, S-T. Wu, and X. Zhou, Sequential Pattern
1.4 Thesis Outline 9
Mining and Nonmonotonic Reasoning for Intelligent Information
Agents, International Journal of Pattern Recognition and Artificial
Intelligence, 21(4):773–789, 2007.
- X. Zhou, Y. Li, P. D. Bruza, S-T. Wu, Y. Xu, and R. Y. K.
Lau, Using Information Filtering in Web Data Mining Process, The
IEEE/WIC/ACM International Conference on Web Intelligence (WI
2007), pages 163–169, 2007.
Chapter 4: This chapter describes the extension to the model presented in
chapter 3 and discusses the problem caused by the inadequate use of mined
patterns in a pattern-based model. The strategy of deploying discovered
patterns is adapted. Two effective and feasible solutions are proposed in
this chapter to address the problem. The relevant publications are:
- Y. Li, S-T. Wu, and Y. Xu, Deploying Association Rules on
Hypothesis Spaces, Proceedings of International Conference on
Computational Intelligence for Modelling Control and Automation
(CIMCA 2004), pages 769–778, 2004.
- S-T. Wu, Y. Li and Y. Xu, An Effective Deploying Algorithm for using
Pattern-Taxonomy, Proceedings of the 7th International Conference
on Information Integration and Web-based Applications & Services
(iiWAS 2005), pages 1013–1022, 2005.
- S-T. Wu, Y. Li and Y. Xu, Deploying Approaches for Pattern
Refinement in Text Mining, Proceedings of the 6th IEEE International
Conference on Data Mining (ICDM 2006), pages 1157–1161, 2006.
10 Introduction
Chapter 5: This chapter presents mechanisms for pattern updating including the
evolution of both deployed patterns and individual patterns. The proposed
algorithms of these evolutions are offered in this chapter.
Chapter 6: This chapter gives the description of benchmark datasets and
performance measures, along with the application of the proposed pattern
taxonomy model to the information filtering. A detailed analysis of the
comparison results of experiments is also presented in this chapter.
Chapter 7: This chapter concludes this thesis and draws the direction for future
work.
Chapter 2
Literature Review
This chapter provides a literature review containing a wide range of knowledge
discovery and text mining topics that relate to this research work and provide the
needed conceptual framework for the development of the proposed model.
2.1 Knowledge Discovery
Knowledge discovery is the process of nontrivial extraction of information from
large databases, information that is implicitly present in the data, previously
unknown and potentially useful for users [33, 42]. The knowledge discovery
can be defined as follows [42]: Given a set of facts (data) F , a language L,
and some measure of certainty C, a pattern is a statement S in L that describes
relationships among a subset Fs of F with a certainty c, such that S is simpler than
the enumeration of all facts in Fs. A pattern is called knowledge if it is interesting
and certain enough, according to the user’s imposed interestingness measures and
criteria. Discovered knowledge is the output of a system that extracts patterns
from the set of facts in a database.
The term pattern in the above definition is an expression in some language
11
12 Literature Review
describing a subset of the data [40]. For example, a pattern in a high-level
language can be expressed as:
If Age > 35 and Salary > 70K
Then buy(“Plasma TV”)
With Likelihood(0.6...0.8).
The above pattern can be understood by people and used directly by some
knowledge discovery system (e.g., expert system). In different communities,
finding useful patterns in data is represented by different names including data
mining, knowledge extraction, information harvesting and data archaeology.
Representing the degree of certainty is essential to determining how much
faith the system or user should put into a discovery [42]. Certainty is effected
by several factors, such as the size of sample, the integrity of data and the
support from domain knowledge. Patterns cannot be considered knowledge with
insufficient certainty. The discovered patterns also must be valid, novel and
potentially useful for the users to meet their information needs.
There are numbers of patterns which may be discovered from a database, but
not all of them are interesting. Only those evaluated to be interesting in some
manner are viewed as useful knowledge. This depends on the assumed frame of
reference defined either by the system itself or the user’s knowledge. A system
may encounter a problem where a discovered pattern is not interesting a user. Such
patterns are not qualified as knowledge. Therefore, a knowledge discovery system
should have the capability of deciding whether a pattern is interesting enough to
form knowledge in the current context.
In summary, knowledge discovery has to exhibit the following characteris-
tics [42]:
2.1 Knowledge Discovery 13
Figure 2.1: A typical process of knowledge discovery [43].
- Interestingness: Discovered knowledge is interesting based on the
implication that patterns should be novel and potentially useful, and the
process of knowledge discovery must be nontrivial.
- Accuracy: Discovered patterns should accurately depict the contents of the
data. The extent to which depiction is imperfect is expressed by measures
of certainty.
- Efficiency: The process of knowledge discovery is efficient, especially for
large data sources. An algorithm is considered efficient if the run time is
acceptable and predictable.
- Understandability: A high-level language is required for expressing
discovered knowledge. The expression must be understandable by users.
2.1.1 Process of Knowledge Discovery
As shown in Figure 2.1, the steps of knowledge discovery may consist of
the following: data selection, data preprocessing, data transformation, pattern
14 Literature Review
discovery and pattern evaluation [43]. These steps are briefly described as follows:
Data selection: This process includes generating a target dataset and selecting a
dataset or a subset of large data sources where discovery is to be performed.
The input of this process is a database and output is a target data. For
example, among various data sources on the World Wide Web, we may
collect newswire-related Web pages for Web content mining tasks.
Pre-processing: This process involves data cleaning and noise removing. It also
includes collecting required information from selected data fields, providing
appropriate strategies for dealing with missing data and accounting for
redundant data. For the case of Web pages, textual data such as tags,
CSS codes, hyperlinks, pictures and metadata needs to be removed for Web
content mining.
Transformation: The preprocessed data needs to be transformed into a
predefined format, depending on the data mining task. This process needs
to select an adequate type of features to represent data. In addition, feature
selection can be used at this stage for dimensionality reduction. At the end
of this process, a set of features is recognised as a dataset.
Data mining: Data mining is a specific activity that is conducted over the
transformed data in order to discover patterns. Based on user requirements,
the discovered patterns can be pairs of features from the given dataset, a set
of ordered features occurring together, or a maximum set of features.
Evaluating: The discovered patterns are evaluated if they are valid, novel and
potentially useful for the users to meet their information needs. Only those
2.1 Knowledge Discovery 15
Figure 2.2: Taxonomy of Web mining techniques [82].
evaluated to be interesting in some manner are viewed as useful knowledge.
This process should decide whether a pattern is interesting enough to form
knowledge in the current context.
2.1.2 Data Repository
Knowledge Discovery in Databases (KDD) can be referred to as the term of data
mining which aims for discovering interesting patterns or trends from a database.
In particular, a process of turning low-level data into high-level knowledge is
denoted as KDD [48]. The concept of KDD process is the data mining for
extracting patterns from data. Therefore, knowledge discovery and data mining
should be applicable to any kind of data repository. There are many different data
stores where mining can be applied, including relational databases, transactional
databases, text and multimedia databases, and the World Wide Web. Following
are the brief descriptions of these repositories.
Relational Databases: A relational database consists of tables, each of which
has a unique name and contains a set of attributes (i.e., columns). A set
of records (i.e., tuples) is stored in a table. Each record represents an
object with a set of attribute values and is assigned a unique key. Data
16 Literature Review
in a relational database can be accessed by relational queries (e.g., SQL).
Data mining techniques can be applied to relational databases for pattern
discovery or trends detection. Relational databases are one of the most
common information repositories in the domain of knowledge discovery
and data mining.
Transactional Databases: A transactional database is generally a collection of
purchase records and is commonly used for the analysis of market basket.
Each record contains a unique identity number and a list of purchased items.
A transactional database usually consists of a large amount of records.
However, data mining systems can easily identify which items are sold
together and find the relationship between item types and certain groups
of customers.
Spatial and Temporal Databases: A spatial database contains images or maps
which are used for urban renewal or public service planning. Data
mining is able to find patterns in a spatial database to describe relationship
between objects on the images or maps. A temporal database is a time-
series database which contains time-related attributes within relational data.
Analysing temporal database using data mining techniques can uncover the
trend of change for objects. It also provides useful information for decision
making and strategy planning.
Text Databases: A text database consists of text data which is used for describing
objects. Such a database has three types of data structure: structured (e.g.
relational database with values in text format), semi-structured (e.g., XML
documents), and unstructured (e.g., Web pages). Associations of terms in
2.1 Knowledge Discovery 17
text can be discovered by applying data mining techniques to text databases.
For further analysis, they may need to integrate with some techniques from
other fields, such as information retrieval.
The World Wide Web: The World Wide Web provides rich information on an
extremely large amount of linked Web pages. Such a repository contains not
only text data but also multimedia objects, such as images, audio and video
clips. Data mining on the World Wide Web can be referred to as Web mining
which has gained much attention with the rapid growth in the amount of
information available on the internet. Web mining is classified into several
categories, including Web content mining, Web usage mining and Web
structure mining. A taxonomy of Web mining techniques is illustrated in
Figure 2.2.
The field of knowledge discovery has developed gradually since the late 1980s.
Recently the research trends in knowledge discovery belong to the following
issues:
• Mining association rules efficiently [4, 53, 147].
• Mining object-oriented databases [47, 154].
• Mining multimedia data [93].
• Mining distributed and heterogeneous databases [145].
• Text mining [61].
• Knowledge discovery in semi-structured (Web) data [83, 159].
In recent years, the last issue attracts much attention due to the rapid growth of
online data which has created an immense need for data mining. The Web users
18 Literature Review
now expect more sensible and rational knowledge discovery systems to help them
retrieve relevant information.
2.1.3 Tasks and Challenges
Knowledge discovery tasks depend on which type of functionalities the
knowledge system performs and which kind of patterns the system looks for.
Different functionalities are developed for achieving different tasks. However,
some goals with particular results need to be reached by using the combination of
several KDD methods. The main KDD tasks can be classified into the following
categories:
• Classification: Classification is the process of assigning data objects to
desired predefined categories or classes. It also can be viewed as the process
of finding a proper method to distinguish data classes or concepts. Objects
without a class label are then classified using this method. Generally,
training data is required for concept learning before classification can be
proceeded.
• Clustering: Given a set of data objects, clustering is the task of dividing
the set of objects into a number of groups such that the objects in the
same group have similar characteristics. In other words, clustering aims
for maximising the intra-class similarity and minimising the inter-class
similarity. The major difference between classification and clustering is
that the latter analyses objects without consulting class labels, whereas the
former needs such information to begin with.
• Summarisation: This is the task of analysing data objects and finding
2.1 Knowledge Discovery 19
their common characteristics for generating summarisation rules. A set of
compact patterns that represent the concept of these objects is extracted. For
instance, a summarisation rule can be a description like “emission of carbon
dioxide CO2 is the main factor causing global warming”.
• Change and Deviation Detection: Such a task involves the discovery of
changes and deviation of specific values in data objects (e.g., the change
in time-series data, protein sequencing in a genome, and the difference
between expected values in ordering data objects).
• Mining Association Rules: Associations are rules that describe the
frequency and certainty of two groups of data values. This task usually
is applied to a transactional database. It discovers the implication between
antecedent and consequent, both of which represent sets of items in the
transactions. For example, an association rule can be “70% of customers
who purchase bread also purchase milk”.
Data mining is the process of pattern discovery in a dataset from which noise
has been previously eliminated and which has been transformed in such a way to
enable the pattern discovery process. Although the knowledge discovery covers
problems related concepts, activities and process, the most challenging one is data
mining [33].
Matheus [99] described the context and computational resources need to
perform knowledge discovery. There must exist an application through which the
user can select, start and run the main process and access the discovered patterns.
Knowledge discovery methods often make it possible to use domain knowledge to
guide and control the process and help evaluate the patterns. In such case domain
20 Literature Review
knowledge must be represented using an appropriate knowledge representation
technique such as taxonomies, rules, decision trees and so on.
The main process of text-related machine learning tasks is document indexing,
which maps a document into a feature space representing the semantics of the
document. Many types of text representations have been proposed in the past. A
well-known one is the bag-of-words that uses words as elements in the vector of
the feature space. There are two types of representations used in the bag-of-words
approach: binary representation and term-weighted representation.
In [80], the Term Frequency times Inverse Document Frequency (TFIDF)
weighting scheme is used for text representation in Rocchio classifiers. In
addition to TFIDF, the global IDF and entropy weighting scheme is proposed by
Dumais [34] and improves performance by an average of 30%. Various weighting
schemes for the bag-of-words representation approach are given in [1, 62, 125].
The problem of the bag-of-words approach is how to select a limited number of
features among an enormous set of words or terms in order to increase the system’s
efficiency and avoid “overfitting” [130]. In order to reduce the number of features,
many dimensionality reduction approaches have been conducted by the use of
feature selection techniques, such as Information Gain, Mutual Information, Chi-
Square, Odds Ratio, and so on. Details of these selection functions are stated
in [78, 130].
Information extraction is used to transform unstructured data in the document
corpus into a structured database and traditional data mining methods are applied
to identify useful patterns in this extracted data [102].
2.2 Association Analysis 21
2.2 Association Analysis
Association rules are interesting patterns that are discovered from a given dataset.
They are generally processed with various data mining techniques. The earliest
form of association rule mining is the market basket analysis, which searches for
interesting relationships between shoppers and items bought, for example.
A data mining process may still retrieve a large number of “thought-to-be”
interesting patterns even though it has specified the relevant tasks and the type of
knowledge to be mined. Generally, only a small portion of these mined patterns
is actually of interest to the users. Thus, it is then essential to further confine the
size of mined patterns with the attempt to improve the effectiveness of the system,
which can be achieved by measuring the usefulness on its simplicity, certainty,
utility and novelty of patterns.
Two common measures of rule interestingness or usefulness are rule support
and confidence. Rule support is estimated by a utility function in order to define
the usefulness of a mined pattern. It is calculated by the percentage of task-
relevant data transactions for which the pattern is recognised as true. Confidence,
on the other hand, reflects the certainty or validity of the mined patterns. Given
item set A and B, in a set of transactions D, the rule A ⇒ B holds in the
transaction set D with support s, where s is the percentage of transactions in
D that contain A∪B. It can be viewed as probability P (A∪B). The rule A⇒ B
has confidence c in the transaction set D if c is the percentage of transactions in
D containing A that also contain B. It is the conditional probability P (B|A).
The support and confidence of the rule A⇒ B can be expressed as the following
22 Literature Review
equations [54].
support(A⇒ B) = P (A ∪B) (2.1)
confidence(A⇒ B) = P (B|A) (2.2)
Generally, if association rules meet both a minimum support threshold and
a minimum confidence threshold, both of which can be set by users or domain
experts, the association rules are considered interesting and useful. Market basket
analysis, as mentioned earlier, is just one form of association rule mining. There
are various kinds of association rules that can be classified based on different
criteria, such as the types of values handled in the rule, the dimensions of data
involved, the levels of abstractions involved, and various extensions to association
mining. All these variations will be discussed in the later subsections.
2.2.1 Association Rules mining
The association rules mining, first studied in [2] for market basket analysis,
is to find any association rules satisfying user-specified minimum support and
minimum confidence [153]. An association rule is the discovery of the associative
relationships among objects; i.e., the appearance of a set of objects in a database
is strongly related to the appearance of another set of objects [43]. The basic
problem of finding association rules is introduced in [2].
The problems of mining association rules from large databases can be
decomposed into two subproblems: (1) Find itemsets whose support is greater
than the user-specified minimal support; (2) Use the frequent itemsets to generate
the desired rules [53]. Much of the research has been focused on the former [2,
107]. In these studies, the well-known Apriori algorithm is adopted for finding all
2.2 Association Analysis 23
frequent itemsets with minimum support. However, the drawback of the Apriori
algorithm is that the same minimal support threshold is applied for all processes of
examining data items. Therefore, using different support thresholds for different
levels of abstraction is required.
In recent years mining of association rules from a large database has been a
focused topic. Performed at multiple levels of abstraction in many applications
at mining associations is required. For example, 80% of customers who purchase
wheat bread may also purchase butter. Then we can draw down and find: 60%
of customers who buy bread may also buy salty butter. The latter statement is a
lower level of abstraction and the former is a higher level. The lower level carries
more specific information than that in the higher level. A top-down progressive
deepening method is developed for efficient mining of multiple-level association
rules from a large database based on the Apriori principle. The method first finds
frequent items at the topmost level and then deepens the mining process into their
descendants at lower concept levels. For example, if the minimal support is 3 in
level 1 the method filters out infrequent items (coffee, wine) in the transaction set
and frequent items (bread, milk) remain in the set. Then we deepen the mining
to find the associations of only frequent items’ descendants (wheat bread, white
bread, 2% milk and light milk).
One assumption is to explore only the descendants of the frequent items
since, if an item occurs rarely, its descendants will occur even less frequently
and are uninteresting. In [53] different support thresholds for different levels of
abstraction are applied. Using a single support threshold will generate many
uninteresting rules with interesting ones if the threshold is set too low, but
many interesting rules at lower level will be neglected if the threshold is too
24 Literature Review
high. For example, if the threshold is too low “milk ⇒ bread” and “milk ⇒
shampoo” would be generated and be passed down to find the associations of their
descendants, but the latter is not interesting. If the threshold is too high we get only
“milk⇒ bread” , but then find nothing since association rules of their descendants
“light milk⇒ wheat bread” and 2% “milk⇒ white bread” cannot reach this high
threshold. Using different support thresholds gives users the flexibility to control
the mining process and to reduce the meaningless associations to be generated.
The scope of mining association rules has been extended from single level
to multiple concept levels for mining multiple-level association rules from large
databases. A top-down progressive deepening algorithm is developed for mining
multiple-level association rules. Mining multiple-level association rules from
databases has wide applications, and efficient algorithms can be developed for
finding interesting and strong rules in large databases.
2.2.2 Sequential Patterns
Mining sequential patterns have been extensively studied in data mining
communities since the first research work in [7]. The earlier studies which
focused on the large size of retail datasets have developed several Apriori-like
algorithms [5, 107, 142] in order to solve the problem of discovering sequential
patterns or itemsets from such databases. However, these algorithms perform well
only in databases consisting of short frequent sequences. This is due to the fact
that it is quite time-consuming to generate nTerms sequences as candidates from
(n-1)Terms sequences. As a result, to solve this problem, a variety of algorithms
such as AprioriAll [7], PrefixSpan [114, 115], CloSpan [161], FP-tree [51, 56],
SPADE [165], SLPMiner [131, 132], TSP [149], SPAM [18], GSP [87], GST [59],
2.2 Association Analysis 25
MILE [27] and Sliding Window [57] have been proposed. In order to improve
the efficiency, each algorithm pursues a different method of discovering frequent
sequential patterns. This makes the algorithm featured simply by the capability of
mining such patterns without even generating any candidates.
Kum et al. [69] developed an algorithm, ApproxMAP (for APPROXimate
Multiple Alignment Pattern mining), to find approximate sequential patterns
which are the patterns approximately shared by many sequences and cover many
short patterns. Approximate sequential patterns can effectively represent the local
data for efficient global sequential pattern mining from multiple data sources.
Additionally, mining sequential patterns from multidimensional sequential data
is suggested in [164].
2.2.3 Frequent Itemsets
The frequent itemset mining has obtained a great deal of attention since the
introduction of itemset mining in [2]. The main difference between an itemset
and a sequential pattern is that the order of the items is concerned in the latter.
The widely adopted algorithm for frequent itemset mining is Apriori [4], which
contains the following three recursive steps:
(1) counts item occurrences to determine the frequent n-itemsets, where n starts
from 1.
(2) generates (n + 1)-itemsets from the frequent n-itemsets using candidate
generation procedure.
(3) prunes candidates where their support is below a predefined minimum
support.
26 Literature Review
Method Pattern Algorithm
AprioriAll [7] sequential Apriori-likePrefixSpan [114, 115] sequential Apriori-likeFP-tree [51, 56] sequential FP-treeSPADE [165] sequential Apriori-likeSLPMiner [131, 132] sequential Apriori-likeTSP [149] closed sequential Apriori-likeCloSpan [161] closed sequential Apriori-likeSPAM [18] sequential Apriori-likeGSP [87] sequential Apriori-likeGST [59] sequential GraphMILE [27] sequential Apriori-likeCLOSET [113] closed itemset Apriori-likeCLOSE [110] closed itemset Apriori-likeCHARM [166] closed itemset Apriori-likeGenMax [49] closed itemset Apriori-like
Table 2.1: Association rules mining algorithms.
The recent data mining algorithms for discovering various patterns are
depicted in Table 2.1. The drawback of the Apriori algorithm is the time-
consuming procedure of candidate generation, especially for a large databases
with a small minimum support as criteria. Many variations of the Apriori
algorithm and its applications have been extensively investigated in the
literature [35, 49, 110, 112, 113, 166]. Liu et al. [89] mined frequent itemsets
from the Web to find topic-specific concepts and definitions. Maintaining frequent
itemsets in dynamic databases is examined by Zhang et al. [167]. Mining Top-K
frequent itemsets is suggested in [156]
2.3 Text Mining 27
2.3 Text Mining
Most work in knowledge discovery and data mining was concerned with
transactional or structured databases. However, a large portion of the available
data appears in collections of text articles. Text mining is used to denote all
tasks that try to extract useful information by finding potential patterns from large
quantities of text. It combines many disciplines such as information retrieval,
information extraction, machine learning, text categorisation, text clustering and
data mining [76].
Text classification or categorisation (TC) is an instance of text mining. TC is
a supervised learning task that assigns a Boolean value to each pair (di, ci) ∈
(D × C), where D is a domain of documents and C is a set of predefined
categories. The task is to approximate the true function ϕ : D × C → 1, 0
by means of a function ϕ : D × C → 1, 0, such that ϕ and ϕ coincide as much
as possible [46]. The function ϕ is called a classifier. The goal of the classifier is
to precisely define and estimate this coincidence.
2.3.1 Feature Selection
There will be a large number of terms extracted from text using data mining
methods. The high dimensionality of the feature space leads to the computational
complexity and overfitting problems. Only terms with valuable information are
selected. The simple way to reduce the dimensionality is the filtering approach,
which filters irrelevant terms based on the measures derived from the statistical
information. The common measures are briefed in the following.
28 Literature Review
Term Frequency
The frequency of a term t in a document d can be used for document-specific
weighting and denoted as TF(d, t). It is only a measure of a term’s significance
within a document.
Inverse Document Frequency
Inverse Document Frequency (IDF) is used to measure the specificity of terms in
a set of documents. It assumes that a high-semantical term appears in only a few
documents, while a low-semantical term is spread over many documents. The
formula of IDF can be expressed by the following.
IDF(t) = log|D|
DF(t)(2.3)
where D is the set of documents in the collection and DF(t) is the document
frequency, which is the number of documents where the term t appears at least
once.
Term Frequency Inverse Document Frequency
Term Frequency Inverse Document Frequency (TFIDF) [125] is the most widely
adopted measure. TFIDF is the combination of the exhaustivity statistic (TF) and
the specificity statistic (DF) to a term.
TFIDF(t) = TF(d, t)× IDF(t) (2.4)
Residual Inverse Document Frequency
Residual Inverse Document Frequency (RIDF) is a variation of IDF. RIDF assigns
collection-specific measures to terms according to the difference between the logs
2.3 Text Mining 29
of the actual IDF and the prediction by a Poisson model [30]. It measures the
distributional behaviour of terms across documents. The function of RIDF is
expressed in the following.
RIDF(t) = IDF(t) + log(1− Probp(t)) (2.5)
where Probp(t) = 1−p(0; λ) is the Poisson probability that t appears at least once
in a document; λ = CF (t)/N is the average number of occurrences of terms t
per document.
Relative Frequency Technique
Relative Frequency Technique (RFT) is suggested in [37] with the assumption that
special or technical words are more rare in general usage than in documents about
the corresponding subjects. In contrast to pure TF, RTF uses a term’s collection
statistic.
RFT(t) =TF(d, t)
Td
− CF(t)
Tc
(2.6)
where CF(t) is the collection frequency denoting the number of times a term t
appears in the entire collection. Td and Tc are the total number of terms in the
document and the number of terms in a general document collection respectively.
2.3.2 Term Weighting
Term weighting uses statistical regularities in documents to estimate significance
weights for terms. Term weighting functions can measure how specific terms are
to a topic by exploiting the statistic variations in the distribution of terms within
relevant documents and within a complete document collection [105]. The term
weighting strategy should be context-specific [39].
30 Literature Review
Given a term t, following are the notions that will be used in the weighting
functions.
r: the number of relevant documents that contain term t.
n: the total number of documents in the collection that contain term t.
R: the total number of relevant documents.
N : the number of documents in the collection.
Probabilistic Model
Robertson and Sparck Jones [119] proposed four probabilistic functions for term
weighting based on the binary independence retrieval model. Two kinds of
assumption are used in these functions: independence assumptions and ordering
principles. Following are the four probabilistic functions.
F1(t) = log( r
R)
( nN
)(2.7)
F2(t) = log( r
R)
( n−rN−R
)(2.8)
F3(t) = log( r
R−r)
( nN−n
)(2.9)
F4(t) = log( r
R−r)
( n−rN−n−R+r
)(2.10)
Okapi Model
The Okapi model is based on the above-mentioned probabilistic model. The
BM25 function in the Okapi model involves using the term frequency and
document length [120, 140, 141]. The weighting function can be expressed as
follows.
2.3 Text Mining 31
BM25 =TF · (k1 + 1)
k1 ·NF + TF· log
(r + 0.5) · (N − n−R + r + 0.5)
(R− r + 0.5) · (n− r + 0.5)(2.11)
and
NF = (1− b) + bDL
AV DL(2.12)
where TF is the term frequency; k1 and b are the tuning parameters; DL and
AV DL denote the document length and average document length respectively.
Mutual Information
Mutual Information (MI) estimates the reduction in uncertainty of the two
variables. It can be used to identify term correlations and the association between
a term and a specific topic.
MI = logr/R
n/N= log
r
R− log
n
N(2.13)
Information Gain
Information Gain (IG) gauges the expected reduction in entropy of category
prediction. It can also be applied for measuring term correlations [104].
IG = −R
N· log
R
N+
r
N· log
r
n+
R− r
N· log
R− r
N − n
= −Pr(rel) log Pr(rel) + Pr(t)Pr(rel|t) log Pr(rel|t)
+Pr(¬t)Pr(rel|¬t) log Pr(rel|¬t) (2.14)
Chi-Square
Chi-Square (X 2) estimates the difference between the observed frequencies and
the frequencies expected under the independent assumption. It can be applied for
32 Literature Review
Figure 2.3: Bag-of-words representation using word frequency.
measuring the lack of independence between a term and the specific topic [104].
X 2 =N · (rN − nR)2
R · n · (N −R) · (N − n)(2.15)
2.4 Text Representation
2.4.1 Keyword-based Representation
The bag-of-words scheme is a typical keyword-based representation in the area of
information retrieval. It has been widely used in text classification tasks due to its
simplicity. Figure 2.3 illustrates the paradigm of the bag-of-words technique. As
we can see that each word in the document is retrieved and stored in a vector space
alone with its frequency. The context of this document then can be represented by
these words, known as “features”. However, the main drawback of this scheme
is that the relationship among words cannot be reflected [135]. Another problem
in considering single words as features is the semantic ambiguity which can be
categorised in:
• Synonyms: a word which shares the same meaning as another word (e.g.
taxi and cab).
2.4 Text Representation 33
• Homonym: a word which has two or more meanings (e.g. river “bank” and
CITI “bank”).
In IR-related tasks, if a query contains an ambiguous word, the retrieved
documents may have this word but not its intended meaning. Conversely, a
document may not be retrieved since it does not share a word with the query,
even though this document is relevant as it contains words which are synonymous
to words in the query. However, almost all existing IR systems use bag-of-words
scheme to represent documents and queries. This does not seem adequate from a
formal semanticist’s point of view, but for simple retrieval tasks this way turns out
to be surprisingly effective [101]. More detail about word disambiguation can be
found in [126].
2.4.2 Phrase-based Representation
Using single words in keyword-based representation poses the semantic ambiguity
problem. To solve this problem, the use of multiple words (i.e. phrases) as features
therefore is proposed. In general, phrases carry more specific content than single
words. For instance, “engine” and “search engine”.
Another reason for using phrase-based representation is that the simple
keyword-based representation of content is usually inadequate because single
words are rarely specific enough for accurate discrimination [144]. To identify
groups of words that create meaningful phrases is a better method, especially
for phrases indicating important concepts in the text. Lewis [77] noted that the
traditional term clustering methods are unlikely to provide significantly improved
text representation.
There are five categories of phrase or terms extraction:
34 Literature Review
• Co-occurring terms [13, 77]
• Episodes [10, 11]
• Noun phrase [21, 134]
• Key-phrase [148]
• nGram [12, 20, 100, 135]
Ahonen et al. [10] applied data mining techniques to finding the episodes for
extraction of useful information in text. For sequential data, episodes and episode
rules are a modification of the concept of frequent sets and association rules [2].
A sequential data is treated as a sequence of events, where each event is a pair of
event type and time [95]. Shen et al. [135] proposed an n-multigram model to help
the automatic text classification task. Their model is smaller than an nGram-based
one and achieves the similar performance on RCV1.
Fuhr [44] took investigation into the probabilistic models in IR and pointed out
that a dependent model for phrases is not sufficient because only the occurrence
of the phrase components in a document is considered but not the syntactical
structure of the phrases. Moreover, the certainty of identification also should
be regarded such as whether the words occur adjacent or only within the same
paragraph.
2.4.3 Other Representation
A new representation model that uses word-clusters as features for text
classification is proposed in [14]. The technique of feature clustering has been
proven in this work to be an alternative to feature selection for reducing the
dimensionality.
2.5 Information Filtering 35
The choice of a representation depends on what one regards as the meaningful
units of text and the meaningful natural language rules for the combination of
these units [130]. With respect to the representation of the content of documents,
some research works have used phrases rather than individual words. In [25],
the combination of unigram and 2-gram is chosen for document indexing in
text categorisation (TC) and evaluated on a variety of feature selection functions
(FEF). Sharma and Raman [134] propose a phrase-based text representation for
web document management using rule-based Natural Language Processing (NLP)
and Context-free Grammar (CFG) techniques. In [11], they apply data mining
techniques to text analysis by extracting co-occurring terms as descriptive phrases
from document collections. However, the effectiveness of the text mining systems
using phrases as text representation showed no significant improvement. The
likely reason is that a phrase-based method has “lower consistency of assignment
and lower document frequency for terms” as mentioned in [77].
2.5 Information Filtering
An information filtering (IF) system monitors an incoming document stream and
selects documents relevant to one or more of its query profiles. If the interactions
of profile are ignored, this task can be treated as a binary decision to accept
or reject incoming documents with respect to a given profile [60]. In terms of
relevance judgements, if the users are able to give these judgements as feedback,
IF can be viewed as an interactive learning process. In contrast, it is a non-
interactive machine learning problem with a set of labeled documents provided
in advance. The task of IF is supposed to reduce a user’s information load
36 Literature Review
by removing all non-relevant documents from an incoming stream. It can also
be regarded as a special instance of text classification [130]. The historical
development of IF can be seen in [106].
Simple averaging of probabilities or log odds ratios generates a significant
improvement for document filtering [60]. Kernel-based methods [23, 24, 91] have
been used to address document filtering problems.
Unlike the traditional search query, an adaptive filtering system maintains user
profiles which tend to reflect a long-term information need. By interacting with
users, an adaptive filtering system can learn a better profile and update it with
feedback to improve its performance over time. The assumption of adaptive
system is that users want to get interesting documents as soon as they arrive.
Hence, the system has to make a binary decision for each incoming document
to retrieve or reject on a user profile. Lau et al. [75] applied Belief Revision logic
to model the task of adaptive information retrieval. Landquillon [73, 74] proposed
two methods for assessing performance indicators without user feedback.
Table 2.2 shows the existing information filtering systems in the related
literature. Among these systems, most of them adopted singular words (so called
bag-of-words) for data representation and TFIDF variant for the term weighting
scheme.
2.6 Chapter Summary
In this chapter, the background of knowledge discovery and data mining has
been discussed and the related research work regarding text mining, association
analysis, text representation and information filtering has been reviewed. Starting
2.6 Chapter Summary 37
IF Model Representation Term weighting
KerMIT [24] bag-of-words Kernel FunctionPIRCS [70] bag-of-words TFIDFOkapi [121] bag-of-words TFIDFRELIEFS [22] bag-of-words ProbabilisticRutgers [17] bag-of-words TFIDFCLARIT [38] bag-of-words TFIDFNewT [136] bag-of-words TFIDFNewsWeeder [72] bag-of-words TFIDFAplipes [155] bag-of-words TFIDFGroupLens [66] bag-of-words TFIDFINFOrmer [138] nGram TFIDFSIFT [160] bag-of-words TFProFile [16] bag-of-words TFIDFINQUERY [15, 32, 150] bag-of-words Probabilistic
Table 2.2: Information Filtering models.
38 Literature Review
with the knowledge discovery, we discussed its definition and the typical process
of knowledge discovery. The current applications and challenges in the area of
knowledge discovery were explored as well. We then focused on the development
of the data mining and analysed its production, the association rules. Various
pattern mining algorithms were reviewed. These include association rule mining,
frequent itemset mining, sequential pattern mining, closed and maximum pattern
mining. In terms of text mining, we briefly reviewed the common feature
selection and term weighting approaches for the dimensionality reduction. Three
types of text representation schemes were also explored and discussed. Lastly,
we reviewed the literature work regarding the information filtering and related
techniques.
Chapter 3
Prototype of Pattern TaxonomyModel
As mentioned in Chapter 1, knowledge discovery has been investigated for a long
time and a lot of data mining methods have been proposed for conquering related
challenges in various fields, especially in the domain of supermarket basket data,
telecommunications data and human genomes [10]. However, it is still difficult
to find a suitable example that can implement these data mining techniques in
the area of text mining, which is usually analysed by the use of Information
Retrieval-related methods or natural language processing. This chapter presents
the fundamental prototype of Pattern Taxonomy Model (PTM), which focus on
the issue of finding useful patterns from text documents. Definitions of patterns
and related algorithms for pattern discovering are provided in this chapter as well.
3.1 Pattern Taxonomy Model
Two main stages are considered in PTM. The first stage is how to extract useful
phrases from text documents, which will be discussed in this chapter. The second
stage is then how to use these discovered patterns to improve the effectiveness of
39
40 Prototype of Pattern Taxonomy Model
a knowledge discovery system and will be presented in Chapter 4.
In PTM, we split a text document into a set of paragraphs and treat each
paragraph as an individual transaction, which consists of a set of words (terms).
At the subsequent phase, we apply the data mining method to find frequent
patterns from these transactions and generate pattern taxonomies. During the
pruning phase, non-meaningful and redundant patterns are eliminated by applying
a proposed pruning scheme.
3.1.1 Sequential Pattern Mining (SPM)
The basic definitions of sequences used in this research work are described as
follows. Let T = t1, t2, . . . , tk be a set of all terms, which can be viewed as
words or keywords in text documents. A sequence S = 〈s1, s2, . . . , sn〉(si ∈ T )
is an ordered list of terms. Note that the duplication of terms is allowed in a
sequence. This is different from the usual definition where a pattern consists of
distinct terms.
Definition 3.1. (sub-sequence) A sequence α = 〈a1, a2, . . . , an〉 is a sub-
sequence of another sequence β = 〈b1, b2, . . . , bm〉, denoted by α v β, if there
exist integers 1 ≤ i1 < i2 < . . . < in ≤ m, such that a1 = bi1 , a2 = bi2 , . . . , an =
bin .
For instance, sequence 〈s1, s3〉 is a sub-sequence of sequences 〈s1, s2, s3〉.
However, 〈s2, s1〉 is not a sub-sequence of 〈s1, s2, s3〉 since the order of terms
is considered. The sequence α is a proper sub-sequence of β if α v β but
α 6= β, denoted as α < β. In addition, we can also say sequence 〈s1, s2, s3〉
is a super-sequence of 〈s1, s3〉. The problem of mining sequential patterns is to
3.1 Pattern Taxonomy Model 41
find a complete set of sub-sequences from a set of sequences whose support is
greater than a user-predefined threshold, min sup.
Pattern taxonomy is a tree-like hierarchy that reserves the sub-sequence (i.e.,
“is-a”) relationship between discovered sequential patterns. An example of
pattern taxonomy is illustrated in Figure 3.1.
Definition 3.2. (Absolute and Relative Support) Given a document d =
S1, S2, . . . , Sn, where Si is a sequence representing a paragraph in d. Thus,
|d| is the number of paragraphs in document d. Let P be a sequence. We call P a
sequential pattern of d if there is a Si ∈ d such that P v Si. The absolute support
of P , denoted as suppa(P ) = |S|S ∈ d, P v S|, is the number of occurrences
of P in d. The relative support of P is the fraction of paragraphs that contain P
in document d, denoted as suppr(P ) = suppa(P )/|d|.
For example, the sequential pattern P = 〈t1, t2, t3〉 in the sample database,
as shown in Table 3.1, has suppa(P ) = 2 and suppr(P ) = 0.5. All sequential
patterns in Table 3.1 with absolute support greater than or equal to 2 are presented
in Table 3.2.
The relative support of pattern is used to properly estimate the significance
of pattern. Usually, a pattern with the same frequency will acquire the same
support in different document lengths. However, with the same occurrence, a
pattern is more significant in a short document than in a long one. Different
from other approaches, we decompose a document into a set of transactions and
discover frequent patterns from them by using data mining methods. The relative
support of pattern can be estimated by dividing absolute support with the number
of transactions in a document. Hence, a pattern can obtain an adequate support
42 Prototype of Pattern Taxonomy Model
Transaction Sequence
1 S1 : 〈t1, t2, t3, t4〉2 S2 : 〈t2, t4, t5, t3〉3 S3 : 〈t3, t6, t1〉4 S4 : 〈t5, t1, t2, t7, t3〉
Table 3.1: Each transaction represents a paragraph in a text document and containsa sequence consisting of an ordered list of words.
Patterns suppa suppr
〈t4〉, 〈t5〉, 〈t1, t2〉, 〈t1, t3〉, 〈t2, t4〉, 2 0.5〈t5, t3〉, 〈t1, t2, t3〉
〈t1〉, 〈t2〉, 〈t2, t3〉 3 0.75
〈t3〉 4 1
Table 3.2: All frequent sequential patterns discovered from the sample document(Table 3.1) with min sup: ξ = 0.5.
for various document lengths with the same frequency.
Definition 3.3. (Frequent Sequential Pattern) A sequential pattern P is called
frequent sequential pattern if suppr(P ) is greater than or equal to a minimum
support (min sup for short) ξ.
For example, let min sup be 0.75 for mining frequent sequential patterns from
a sample document in Table 3.1, we can obtain four frequent sequential patterns:
〈t2, t3〉, 〈t1〉, 〈t2〉, and 〈t3〉 since their relative supports are not less than ξ.
Definition 3.4. (nTerms Pattern) The length of sequential pattern P , denoted as
len(P ), indicates the number of words (or terms) contained in P . A sequential
pattern which contains n terms can be denoted in short as a nTerms pattern.
3.1 Pattern Taxonomy Model 43
For instance, given pattern P = 〈t2, t3〉, we have len(P ) = 2, and P is a
2Terms pattern. A sequential pattern consists of several terms (words) as well as
one term. Thus, a 1Term pattern is a sort of special nTerms pattern in this research
work.
3.1.2 Pattern Pruning
For all algorithms used for finding all frequent sequential patterns from a dataset,
the problem encountered is a large amount of patterns generated, most of which
are considered as non-meaningful patterns and need to be eliminated [159].
A proper pruning scheme can be used for addressing this issue by removing
redundant patterns, leading to not only reducing the dimensionality but also
decreasing the effects from noise patterns. In this research work, we defined
closed patterns as meaningful patterns since most of the sub-sequence patterns of
closed patterns have the same frequency, which means they always occur together
in a document. For example, in Figure 3.1 patterns 〈t1, t2〉 and 〈t1, t3〉 appear
two times in a document as their parent pattern 〈t1, t2, t3〉 has a frequency of two.
SPM stands for sequential pattern mining and we define sequential closed pattern
mining as SCPM. The notion of closed pattern is defined as follows:
Definition 3.5. (Closed Sequential Pattern) A frequent sequential pattern P is a
closed sequential pattern if there exist no frequent sequential patterns P ′ such that
P < P ′ and suppa(P ) = suppa(P′). The relation < represents the strict part of
the subsequence relation v.
For instance, the nodes in Figure 3.1 represent sequential patterns extracted
from Table 3.1. Only the patterns within dash-line borders are closed sequential
44 Prototype of Pattern Taxonomy Model
Figure 3.1: An example of pattern taxonomy where patterns in dash boxes areclosed patterns.
patterns if min sup ξ = 0.50. The others are considered as non-closed sequential
patterns.
Algorithm 3.1. SPMining(PL, min sup)
Input: a list of nTerms frequent sequential patterns, PL; minimum support,
min sup.
Output: a set of frequent sequential patterns, SP.
Method:
1: SP ← SP − Pa ∈ SP | ∃Pb ∈ PL such that len(Pa) = len(Pb)− 1
∧Pa < Pb ∧ suppa(Pa) = suppa(Pb) //pattern pruning
2: SP ← SP ∪ PL //storing nTerms patterns
3: PL′ ← ∅
4: foreach pattern p in PL do begin
5: generating p-projected database PD
6: foreach frequent term t in PD do begin
7: P ′ = p ./ t //sequence extension
8: if suppr(P′) ≥ min sup then
3.1 Pattern Taxonomy Model 45
9: PL′ ← PL′ ∪ P ′
10: end if
11: end for
12: end for
13: if |PL′| = 0 then
14: return //no more pattern
15: else
16: call SPMining(PL′, min sup)
17: end if
18: output SP
The Sequential Pattern Mining (SPM) algorithm SPMining is depicted in
Algorithm 3.1. In this algorithm, we apply the pruning scheme for the purpose
of eliminating non-closed patterns during the process of sequential patterns
discovery. The key feature behind this recursive algorithm is represented in
the first line of the algorithm, which describes this pruning procedure. In this
algorithm, all (n-1)Terms of length patterns are diagnosed to determine whether
or not they are closed patterns after all nTerms of length patterns are generated
from the previous recursion. For instance, for a 2Terms pattern 〈t2, t3〉, if there
exists a 3Terms frequent pattern 〈t1, t2, t3〉 having the same frequency as pattern
〈t2, t3〉, the shorter one is then detected as a non-closed pattern and therefore
pruned. After this pruning scheme, the rest of (n-1)Terms patterns (i.e. closed
sequential patterns) are stored and then the algorithm continues to find the next
(n+1)Terms patterns. The algorithm repeats itself recursively until there is no
46 Prototype of Pattern Taxonomy Model
Figure 3.2: Illustration of pruning redundant patterns.
more pattern discovered. As a result, the output of algorithm SPMining is a set of
closed sequential patterns with relative supports greater than or equal to a specified
minimum support.
As mentioned above, SPMining adopts the projected-database-based approach
to project (or partition) a database for each nTerms pattern with an attempt to find
(n+1)Terms patterns during each recursion.
Definition 3.6. (P-projected Database) Given a pattern p, a p-projected database
contains a set of sequences which is made of postfixes of p in the database.
For instance, referred to the sample database in Table 3.1, let p be 〈t1〉,
the p-projected database will be 〈t2, t3, t4〉, 〈〉, 〈〉, 〈t2, t7, t3〉 where 〈〉 is a null
sequence since p does not appear in transaction 2 and term t1 locates at the end of
transaction 3.
After generating the p-projected database for a sequential pattern, the next step
3.1 Pattern Taxonomy Model 47
is to find the frequent terms in this database satisfying a given minimum support.
If one frequent term is found, a (n+1)Terms sequential pattern is expanded from
the nTerms sequential pattern by using sequence extension, which is defined as
follows.
Definition 3.7. (Sequence Extension) Given a term t and a sequence S, the
sequence extension of S with term t can be obtained by simply appending t to
S and generating a sequence S ′ , denoted as S ′ = S ./ t.
For instance, the sequence extension of 〈t1, t3〉 with term t2 is sequence
〈t1, t3, t2〉. The generated (n+1)Terms sequential patterns are passed as an input
value by calling the algorithm itself for the next recursive processing until no more
sequential pattern is found.
The input of the SPMining algorithm is a set of frequent sequential patterns
obtained from the previous output from itself. The initial input of this recursive
function is a set of frequent 1Term patterns (i.e., frequent items). For instance, the
frequent 1Term patterns derived from the sample document (Table 3.1) are listed
in Table 3.3. Patterns 〈t6〉 and 〈t7〉 are discarded since their relative support is less
than 0.5, meaning that both of them appear only once among paragraphs and are
impossible to be “frequent”.
The set of 1Term patterns in Table 3.3 then skips line 1 in the algorithm,
which is designed for the purpose of pruning non-closed patterns generated from
previous iterations. The process from line 4 to line 12 presents the generating
of (n+1)Terms patterns from the nTerms p-projected database, and a p-projected
database will be firstly formed by using the 1Term pattern as a root and finding
all projected sequences. Table 3.4 illustrates the example of p-projected database
48 Prototype of Pattern Taxonomy Model
1Term Pattern suppa suppr
〈t1〉 3 0.75〈t2〉 3 0.75〈t3〉 4 1.0〈t4〉 2 0.5〈t5〉 2 0.5
Table 3.3: Frequent 1Term patterns with min sup = 0.5.
Root Projected sequence
t1〈t1〉 :→ 〈t2, t3, t4〉〈t1〉 :→ 〈t2, t7, t3〉
t2
〈t2〉 :→ 〈t3, t4〉〈t2〉 :→ 〈t4, t5, t3〉〈t2〉 :→ 〈t7, t3〉
t3〈t3〉 :→ 〈t4〉〈t3〉 :→ 〈t6, t1〉
t4 〈t4〉 :→ 〈t5, t3〉
t5〈t5〉 :→ 〈t3〉〈t5〉 :→ 〈t1, t2, t7, t3〉
Table 3.4: An example of a p-projected database.
based on the above mentioned scenario. For each 1Term pattern p, a number
of subsequences Ps starting with p are generated, where Ps v Sn and Sn ∈ d.
Generally speaking, the number of projected sequences of a root pattern equals to
the pattern’s absolute support unless it locates at the end of some paragraphs (e.g.,
t1, t3 and t4).
After a p-projected database is built, (n+1)Terms pattern candidates can be
obtained by extending the nTerms pattern using sequence extension. For instance,
3.1 Pattern Taxonomy Model 49
1Term Pattern 2Terms Pattern suppa suppr
t1〈t1, t2〉 2 0.5〈t1, t3〉 2 0.5
t2〈t2, t3〉 3 0.75〈t2, t4〉 2 0.5
t3 not found - -
t4 not found - -
t5 〈t5, t3〉 2 0.5
Table 3.5: 2Terms sequential patterns derived from 1Term patterns.
in Table 3.4 we have 〈t1, t2, t3, t4〉 and 〈t1, t2, t7, t3〉, two p-projected sequences,
and find two frequent terms among them t2, t3. At the next step, two 2Terms
candidates 〈t1, t2〉 and 〈t1, t3〉 are formed with the same relative support of 0.5.
Then each candidate is examined at line 8 and confirmed as a frequent pattern if
its relative support is greater than or equal to the minimum support. Based on the
previous example, the 2Terms patterns derived from the p-projected database in
Table 3.4 are presented in Table 3.5. Note that neither pattern will be generated
from the p-projected databases of pattern 〈t3〉 and 〈t4〉 because there are no more
frequent terms existing in their databases.
As long as one nTerms pattern is found in this iteration, the algorithm
iteratively calls itself again to find (n+1)Terms patterns. Otherwise, the algorithm
will terminate and return the set of sequential patterns SP as an output if there are
no more nTerms patterns discovered in the current iteration.
For the scenario in Table 3.5, frequent 2Terms patterns 〈t1, t2〉, 〈t1, t3〉, 〈t2, t3〉,
〈t2, t4〉 and 〈t5, t3〉 are passed to the algorithm itself as one of the parameters in
50 Prototype of Pattern Taxonomy Model
order to find the 3Terms patterns. Again, at the first line in the algorithm, these
2Terms patterns are compared to 1Term patterns for the purpose of pruning non-
closed 1Term patterns, and this process can be described as follows. For each
(n-1)Terms pattern Pa in SP , if there exists any nTerms pattern Pb such that Pa is
a proper subsequence of Pb (i.e., Pa < Pb) and both of them have the same relative
support (i.e., suppr(Pa) = suppr(Pb)), then Pa is defined as a non-closed pattern
and is eliminated from the set SP . This is performed for the case of finding
frequent closed sequential patterns; whereas for another case of just searching
frequent sequential patterns, this step should be ignored and go on to the next
step. Therefore, the result of non-closed pattern pruning based on our example is
depicted in Table 3.6. Figure 3.2 illustrates the process of pattern pruning. The
arrows in the figure indicate the process of finding and pruning redundant patterns
(i.e., non-closed patterns) from the lower level (i.e., (n+1)Terms patterns) to the
higher level (i.e., nTerms patterns).
The same procedure is taken for the 2Terms patterns in the rest of the
lines in the algorithm, including the generation of p-projected databases and
the assessment of frequent patterns. As a result, a 3Terms sequential pattern,
〈t1, t2, t3〉, is discovered at the end of this iteration. After passing it to the next
iteration, the pruning result for 2Terms patterns is illustrated in Table 3.7.
As mentioned before, the algorithm SPMining is designed for discovering both
closed and non-closed sequential patterns from a set of documents. The execution
of the first line in the algorithm is the key for the closed patterns finding and
removal of the other patterns. Moreover, it can be easily adjusted to find the
non-closed sequential patterns by skipping the first line in the algorithm. For
the previous document example in Table 3.1, after inputting this document into
3.1 Pattern Taxonomy Model 51
1Term Pattern suppr Sequence of suppr Closed pattern?
〈t1〉 0.75 〈t1, t2〉 0.5yes
〈t1, t3〉 0.5
〈t2〉 0.75〈t1, t2〉 0.5
no〈t2, t3〉 0.75〈t2, t4〉 0.5
〈t3〉 1.0〈t1, t3〉 0.5
yes〈t2, t3〉 0.75〈t5, t3〉 0.5
〈t4〉 0.5 〈t2, t4〉 0.5 no
〈t5〉 0.5 none - yes
Table 3.6: The assessment of closed pattern of 1Term patterns.
2Terms Pattern suppr Sequence of suppr Closed pattern?
〈t1, t2〉 0.5 〈t1, t2, t3〉 0.5 no
〈t1, t3〉 0.5 〈t1, t2, t3〉 0.5 no
〈t2, t3〉 0.75 〈t1, t2, t3〉 0.5 yes
〈t2, t4〉 0.5 none - yes
〈t5, t3〉 0.5 none - yes
Table 3.7: The assessment of closed pattern of 2Terms patterns.
52 Prototype of Pattern Taxonomy Model
Frequent patterns Non-closed Closed
1Term 〈t2〉, 〈t4〉 〈t1〉, 〈t3〉, 〈t5〉
2Terms 〈t1, t2〉, 〈t1, t3〉〈t2, t3〉, 〈t2, t4〉,〈t5, t3〉
3Terms none 〈t1, t2, t3〉
Table 3.8: Discovered frequent closed and non-closed sequential patterns.
the algorithm and setting the minimum support to be 0.5, a list of all the closed
or non-closed sequential patterns can be returned and their results are shown in
Table 3.8.
3.1.3 Using Discovered Patterns
As mentioned in the previous section, the algorithm SPMining uses the sequential
data mining technique with a pruning scheme to find meaningful patterns from
text documents. The next issue is then how to use these discovered patterns. There
are various ways to utilise discovered patterns by using a weighting function to
assign a value for each pattern according to its frequency. One strategy has been
implemented and evaluated in [159], which proposed a pattern mining method
that treated each found sequential pattern as a whole item without breaking it into
a set of individual terms, and its result found that using confidence as the pattern
measure outperformed the use of support. For example, each mined sequential
pattern p in PTM can be viewed as a rule p → positive and the confidence of p
was evaluated using the following weighting function:
W (p) =|da|da ∈ D+, p ⊆ da||db|db ∈ D, p ⊆ db|
,
3.2 Finding Non-Sequential Patterns 53
where D is the training set of documents, and D+ indicates the set of positive
documents in D.
Two problems arise when using the above-mentioned weighting function. One
is the low pattern frequency problem which is mainly due to the fact that it is
hard to match patterns in documents when the length of the pattern is long. The
other problem is that those patterns which are specific to a topic may gain a lower
score than those for general patterns. In other words, the information carried by
the specific patterns cannot be estimated by the weighting function. Therefore, a
proper pattern processing method to overcome these problems is then desirable.
We will discuss this issue in Chapter 4.
3.2 Finding Non-Sequential Patterns
In Section 3.1, the algorithm SPMining is developed for the purpose of mining all
frequent sequential patterns from documents. In addition to sequential patterns,
non-sequential patterns mining (NSPM) from a set of textual documents is another
application of the data mining mechanism. From the data mining point of
view, non-sequential patterns can be treated as frequent itemsets extracted from
a transactional database. Frequent itemset mining is one of the most essential
issues in many data mining applications. Since the first research work coined
by Agrawal et al. [2] in the mid ’90s, many itemset mining approaches which
adopted the concept of Apriori algorithm have been proposed. These Apriori-
like approaches utilise a bottom-up scheme that enumerates every single frequent
itemset [49]. However, the phase of pattern generation, one process in Apriori-
like algorithms, is likely to be computationally expensive. Particularly, for a long
54 Prototype of Pattern Taxonomy Model
pattern which contains n items, there will be 2n−2 subsets of this pattern, leading
to make this approach inefficient and unfeasible.
To tackle the above mentioned problem, we propose the following strategy.
Each document in the dataset is split into a set of transactions (i.e., paragraphs),
instead of the whole document which is viewed as a single transaction in the
traditional methods. As a result, the number of candidates for pattern generation
can be greatly reduced. For example, for a document with five paragraphs,
the average length of transaction is one-fifth of the original one. This strategy
therefore can save much computational time, especially for long documents. In
this section, an NSPM algorithm is developed to address the problem of finding all
frequent non-sequential patterns from a given textual database. The fundamental
definitions are defined in Section 3.2.1 and its corresponding algorithm is
presented in Section 3.2.2.
3.2.1 Basic Definition of NSPM
The essential definitions of NSPM are described as follows. Let T =
t1, t2, . . . , tk be a set of distinct terms. A non-sequential pattern p is a subset of
T .
Definition 3.8. (frequency and support) Given a document d = S1, S2, . . . , Sn,
where Si is a paragraph of d. The frequency of non-sequential pattern p is the
number of paragraphs which contain p. The frequency of p can be denoted as
freq(p) = |S|S ∈ d ∧ p ⊆ S|. The support of p is defined as support(p) =
freq(p)/|d|.
For example, the frequency of non-sequential pattern t1, t3 in the document
3.2 Finding Non-Sequential Patterns 55
example (Table 3.1) is 3, and the support of this pattern is 0.75. Note that in the
case of SPM, the frequency and the relative support of pattern 〈t1, t3〉 are 2 and
0.5 respectively since the order of terms is concerned.
3.2.2 NSPM Algorithm
Algorithm 3.2 is proposed to mine the non-sequential patterns whose supports are
greater than or equal to a specified min sup. The inputs require a list of nTerms
frequent non-sequential patterns NP , a list of 1Term frequent patterns FT and a
minimum support min sup. Similar to the algorithm SPMining, the initial NP
is a list of frequent 1Term patterns. Using Table 3.1 as a document example, the
initial NP should be t1, t2, t3, t4, t5. The content of FT is the same as the one
of initial NP , but FT is used for candidate generation and will be static for every
iteration. min sup is a constant real value as well.
Algorithm 3.2. NSPMining(NP, FT, min sup)
Input: a list of nTerms frequent non-sequential patterns, NP; a list of 1Term
frequent patterns, FT; minimum support, min sup.
Output: a set of frequent non-sequential patterns, FP.
Method:
1: FP ← FP ∪NP //nTerms non-sequential patterns
2: NP ′ ← ∅
3: foreach pattern p in NP do begin
4: foreach frequent term t in FT do begin
5: P ′ = p ∪ t //pattern growing
6: if support(P ′) ≥ min sup then
56 Prototype of Pattern Taxonomy Model
7: NP ′ ← NP ′ ∪ P ′
8: end if
9: end for
10: end for
11: if |NP ′| = 0 then
12: return //no more pattern
13: else
14: call NSPMining(NP ′, FT, min sup)
15: end if
16: output FP
The first line of algorithm NSPMining is the process of storing the mined
patterns in NP passed from the previous recursive loop. From line 3 to line 10
is the process of extending nTerms patterns to (n+1)Terms patterns for candidates
generation. By jointing each frequent term in FT into each nTerms pattern in NP ,
a number of (n+1)Terms candidates are created. For example, 2Terms candidates
generated from the previous document example are listed in Table 3.9.
As we can see, the number of candidates generated in NSPM (10 candidates
in Table 3.9) is much more than the number in SPM (5 candidates in Table 3.5)
from the same document example. The difference will be more significant when
the document is larger. The reason is that in NSPM every term in each transaction
needs to be visited to estimate its frequency and support, whereas in SPM only a
portion of the sequence in each transaction needs to do so. Although SPM requires
an extra process in advance such as building p-projected databases, it takes less
3.2 Finding Non-Sequential Patterns 57
1Term Pattern 2Terms Pattern Frequency Support
t1
t1, t2 2 0.5t1, t3 3 0.75t1, t4 1 0.25t1, t5 1 0.25
t2t2, t3 3 0.75t2, t4 2 0.5t2, t5 2 0.5
t3t3, t4 2 0.5t3, t5 2 0.5
t4 t4, t5 1 0.25
Table 3.9: 2Terms candidates generated during non-sequential pattern mining.
computing time since only a simple splitting function is taken for such work. In
our experiment, some topics take a longer time (a couple of hours) than others to
complete an NSPM task with a low min sup.
Once each candidate is generated, it is then examined immediately to check
whether or not it is frequent by executing line 6 in the algorithm. At the end of this
process (line 10), all frequent 2Terms non-sequential patterns are discovered in the
first recursive loop. From the previous example with min sup = 0.5, there are
seven 2Terms patterns remaining as frequent patterns in Table 3.9 except patterns
t1, t4, t1, t5 and t4, t5 because of their low supports. If there is no more
pattern frequent, the algorithm will terminate and output the result. Otherwise,
these discovered nTerms frequent patterns will be passed to the next recursive
loop to find further (n+1)Terms patterns.
At the second recursive loop of the algorithm NSPMining, those previously
found 2Terms non-sequential patterns are reserved in FP at the first line in this
58 Prototype of Pattern Taxonomy Model
2Terms Pattern 3Terms Pattern Frequency Support Frequent?
t1, t2t1, t2, t3 2 0.5 yest1, t2, t4 1 0.25 not1, t2, t5 1 0.25 no
t1, t3t1, t3, t4 1 0.25 not1, t3, t5 1 0.25 no
t1, t4 t1, t4, t5 0 0 no
t2, t3t2, t3, t4 2 0.5 yest2, t3, t5 2 0.5 yes
t2, t4 t2, t4, t5 1 0.25 no
t3, t4 t3, t4, t5 1 0.25 no
Table 3.10: 3Terms candidates generated during non-sequential pattern mining.
algorithm. These 2Terms patterns are then extended to form 3Terms candidates by
jointing each term t in frequent 1Term pattern set FT . All of the possible 3Terms
candidates are presented in Table 3.10.
After assessing all 3Terms candidates, there are still three frequent patterns
remaining, compared to just one at the same stage using SPM algorithm. This is
another evidence showing that the NSPM is inefficient compared with SPM while
applying data mining algorithms to the text mining task. In other words, NSPM
takes time not only on the process of generating candidates but also the task of
more patterns needed to be processed in the next recursive loop.
For the current case in NSPM, three 3Terms frequent non-sequential patterns
(i.e., t1, t2, t3, t2, t3, t4, t2, t3, t5) are consequentially delivered to the next
recursive iteration. After storing them in FP , the same process of candidate
generation is activated and 4Terms candidates are then created. These candidates
3.3 Related Work 59
3Terms Pattern 4Terms Pattern Frequency Support Frequent?
t1, t2, t3t1, t2, t3, t4 1 0.25 not1, t2, t3, t5 1 0.25 no
t2, t3, t4 t2, t3, t4, t5 1 0.25 no
Table 3.11: 4Terms candidates generated in NSPM.
are illustrated in Table 3.11. The algorithm terminates and frequent patterns stored
in FP are returned. The whole list of frequent non-sequential patterns is shown
in Table 3.12.
The frequent non-sequential closed pattern can be mined using NSPMining
with the following process inserted into the first line:
1: FP ← FP − Pa ∈ FP | ∃Pb ∈ NP such that len(Pa) = len(Pb)− 1 ∧
Pa ⊂ Pb ∧ support(Pa) = support(Pb) (3.1)
The above pruning scheme is employed by the NSPMining algorithm to
eliminate any non-sequential non-closed pattern, which is a pattern with the same
support as its supersets. Such mining method for closed non-sequential patterns
is called the Non-Sequential Closed Pattern Mining (NSCPM) method.
3.3 Related Work
Many data mining methods have been proposed for knowledge discovery in the
last decade. However, most of them are developed for addressing the problem
of mining specific patterns in a reasonable and acceptable time frame from a
large transactional or relational database. Agrawal et al. [2] coined association
rule mining and the well-known Apriori algorithm was proposed by Agrawal
60 Prototype of Pattern Taxonomy Model
Pattern type Pattern Frequency Support Closed?
1Term
t1 3 0.75 not2 3 0.75 not3 4 1.0 yest4 2 0.5 not5 2 0.5 no
2Terms
t1, t2 2 0.5 not1, t3 3 0.75 yest2, t3 3 0.75 yest2, t4 2 0.5 not2, t5 2 0.5 not3, t4 2 0.5 not3, t5 2 0.5 no
3Termst1, t2, t3 2 0.5 yest2, t3, t4 2 0.5 yest2, t3, t5 2 0.5 yes
Table 3.12: Frequent non-sequential patterns discovered using NSPM.
3.3 Related Work 61
and Srikant [5]. Similar algorithms for association rules mining were developed
in [3, 29, 65, 96, 107]. Some strategies were introduced in order to find association
rules efficiently, such as transaction reduction by Agrawal and Srikant [6] and a
hash-based algorithm by Park et al. [108, 109]. Many extensions of association
rule mining have been developed. Spatial association rule mining was proposed
by Koperski and Han [67]. Frequent episodes mining [97], negative association
rule mining [128] and inter-transaction association rule mining [41, 92, 146] were
proposed and discussed. Multilevel association rule mining was explored by Han
and Fu [52, 53].
Sequential pattern mining has been extensively studied in data mining
communities since the first research work by Agrawal and Srikant [7]. The same
concept was discussed by Srikant and Agrawal [143]. Since the first work, many
algorithms of sequential pattern mining were introduced, such as GSP [143],
FreeSpan [55], PrefixSpan [114], SPADE [165], CloSpan [161], TSP [149],
SLPMiner [131] and IncSpan [28]. Most of them adopt Apriori policy: all
nonempty subsets of a frequent itemset must also be frequent [54]. However,
with this policy applied, longer patterns tend not to be mined since a static
minimum support is used for all pattern finding. Hence, a few constraint-based
algorithms [111, 116] are introduced to find longer patterns using lower minimum
support. Moreover, Ayres et al. [18] used bitmap representation for sequential
pattern mining. Ahonen-Myka et al. proposed several algorithms to find frequent
sequences or co-occurring phrases from textual datasets in [8, 9, 11, 12, 13]. In
addition to sequential patterns, the mining heuristic for frequent itemsets has the
goal of discovering all frequent non-sequential patterns in a database. There are
various extensions of frequent itemset mining including frequent closed itemset
62 Prototype of Pattern Taxonomy Model
mining [35, 110, 113, 166], maximum frequent parallel and distributed frequent
itemset mining [49, 151], mining top-k frequent itemsets from data stream [156],
constraint-based frequent itemset mining [112, 152], and mining frequent itemsets
by opportunistic projection [90].
The first attempt of applying data mining techniques to the domain of text was
made by Ahonen et al. [10], which presented the experiments on discovering
phrases and co-occurring terms in text. The technique used by episode rule (i.e.,
modified association rule) mining is bottom-up nGram method, which differs
greatly from our method in PTM. The size of window needs to be defined in
the nGram method, and the frequent threshold for finding frequent co-occurring
terms is also required [8]. In PTM, the minimum support is the only coefficient
needed to be specified. Furthermore, during pattern discovery, the occurrence of
terms in a document is taken into account in the PTM method, but omitted in their
works. The great difference is that their works focused on finding patterns in text
only by the use of data mining techniques, but did not mention how to use these
discovered patterns. In contrast, PTM not only adopts data mining methods to
find patterns in text, but also applies them to the domain of information filtering
with the attempt to improve the performance.
3.4 Chapter Summary
Many data mining methods, such as association rule mining, frequent itemset
mining, sequential pattern mining and closed pattern mining, have been proposed
and usually used for a transactional database. In this chapter we have presented
a novel methodology that attempts to implement data mining algorithms on the
3.4 Chapter Summary 63
domain of text data for knowledge discovery. For applying these methods, a
textual document can be viewed as a transactional database by splitting it based
on paragraphs. A pattern, therefore, is defined as a frequent pattern if its relative
support is greater than or equal to a pre-specified minimum support. Four types of
frequent patterns can be found in a textual dataset using our proposed mining
algorithms, and they are sequential pattern mining (SPM), sequential closed
pattern mining (SCPM), non-sequential pattern (i.e., itemset) mining (NSPM),
and non-sequential closed pattern (i.e., closed itemset) mining (NSCPM).
Pattern taxonomy is a tree-like hierarchy that reserves the sub-sequence
(i.e., “is-a”) relationship between discovered sequential patterns. Rather than
the term independence assumption that were usually used in the traditional
information retrieval methods, pattern taxonomy in PTM can preserve the
semantical information adhered in the text data. By the use of the pattern pruning
strategy, the number of pattern candidates can be dramatically reduced during the
process of pattern discovery, resulting in a great improvement on the efficiency of
PTM. On the other hand, the effectiveness of the system can also be improved by
the removal of redundant patterns. The related experimental results are presented
in Chapter 6.
Using confidence of pattern for evaluation encounters two problems. One
is the so-called low pattern frequency problem mainly due to the few matched
patterns can be found in the training stage while the length of the pattern is
long. The other problem is that a specific long pattern will not obtain a proper
weighting score leading to the unsatisfied performance. Therefore, a proper
pattern processing method to overcome these problems is then desirable.
In summary, the proposed mining algorithms tackle the limitation of applying
64 Prototype of Pattern Taxonomy Model
data mining mechanisms to the text domain and provide the fundamental
prototype required for the development of PTM. PTM adopts the SCPM algorithm
for pattern discovery, as well as the pattern pruning scheme used to eliminate
redundant patterns, resulting in the improvement of efficiency. Moreover, the
problem of using discovered patterns is identified and the feasible solutions will
be discussed and presented in the following chapters.
Chapter 4
Pattern Deploying Methods
In this chapter, we propose two novel approaches with the attempt of addressing
the drawback which is caused by the inadequate use of discovered patterns. In
the previous chapter, we have discussed and provided various methods for mining
desired patterns by the use of data mining techniques. We have also pointed out
the difficulty of transferring these techniques and found the primary solutions to
alleviate the problem. However, the issue regarding how to exploit discovered
patterns still remains unsolved. To use discovered patterns, one of the easiest way
is to treat patterns as atoms in the feature space to represent the concept of a set
of documents. The significance of patterns then can be estimated by assigning an
evaluated value based on one of the existing weighting functions. Nevertheless,
the same mechanism for pattern discovery is required in the phase of document
evaluation in an attempt to find matched patterns if the above representation
method is used. Such an approach is time-consuming and ineffective because
of the drawback on computational expensiveness contributed by the nature of
data mining-based methods and the unsolved low-frequency problem for the long
patterns. Therefore, an efficient and effective pattern evaluation methodology is
needed after the phase of pattern discovery in a knowledge discovery system.
65
66 Pattern Deploying Methods
Figure 4.1: Deploying patterns into a term space.
4.1 Pattern Deploying
The properties of pattern (e.g., support and confidence) used by data mining-
based methods in the phase of pattern discovery are not suitable to be adopted
in the phase of using discovered patterns [158]. Therefore, in this chapter we re-
evaluate the property of patterns by deploying them into a common hypothesis
space based on their correlations to the pattern taxonomies. A fundamental
mechanism, Pattern Deploying Method (PDM), is firstly introduced to implement
patterns deploying and followed by the method of Pattern Deploying based on
Support (PDS).
The simplified concept of PDM is illustrated in Figure 4.1. There is no doubt
that a pattern consisting of more terms is considered to be more specific, but its
frequency is relatively low. The short pattern, however, obtains more influence on
the judgement of relevance for document evaluation due to its high frequency. In
particular, we need the former one to help distinguish the relevance of documents
especially in an information filtering system. For instance, comparing pattern
4.1 Pattern Deploying 67
“Sequential Pattern Mining” with pattern “Mining” in Figure 4.1, the former
is obviously more helpful than the latter since the former carries more specific
information.
To use these patterns, two inevitable issues arise:
(1) How to emphasise the significance of specific patterns and avoid the low-
frequency problem.
(2) How to eliminate the interference by the general patterns, which are usually
with high frequency.
As mentioned in the previous chapter, the nature of data mining methods is
the large amount of short patterns to be generated during the phase of pattern
discovery. One way to reduce the large number of patterns can be achieved by
the adoption of pattern pruning in the mining algorithm. The strategy of pattern
pruning we used is to eliminate the sub-sequences of maximum sequential patterns
if their values of supports are the same. That means patterns which always co-
occur with their parent patterns in the same transaction are redundant patterns and
need to be discarded. This induces the sequential closed pattern mining approach
which allows us to mine closed patterns only. Therefore, such a strategy provides
a partial solution for the second aforementioned issue, through the removal of a
large amount of sub-sequences (i.e., short patterns).
Despite the redundant short patterns discarded by the use of pattern pruning in
the SCPM method, there still are some remaining short patterns. These patterns
can be classified into two main groups. The first group contains patterns which
are themselves closed but short in length (e.g., 2Terms or 3Terms closed patterns),
which means these patterns have no parent patterns. The remaining short closed
68 Pattern Deploying Methods
Figure 4.2: Overlaps between discovered patterns.
patterns, which have parent patterns, are classified into the second group. We pay
attention to the second group since the closed pattern in the first group has been
widely discussed. Patterns in the second group are considered to be potentially
useful since they do not always co-occur with their parent patterns in the same
paragraph, meaning that they solely appear several times in other paragraphs in a
document. We believe such a kind of short pattern obtains significant information
with reference to the related concept of topic and has to be taken into account.
Therefore, we define these patterns as significant short patterns.
Other than the above-mentioned closed patterns and significant short patterns,
the correlation among patterns from different pattern taxonomies also draws our
attention and needs to be clarified. As mentioned before, many pattern taxonomies
may be automatically formed after all sequential patterns are found in the phase
of pattern discovery. All patterns under a pattern taxonomy should contain a
subset of terms derived from the longest closed pattern in the same taxonomy,
the root of the pattern taxonomy. Therefore, two patterns from different pattern
taxonomies should not have relationship in “subset”, but may share some terms.
Note that the sequential pattern in this case is simply viewed as a regular pattern.
4.1 Pattern Deploying 69
For example, p1 and p2 are two patterns which have inter-taxonomy correlation
between them, such that p1 ∩ p2 6= ∅, p1 * p2 and p2 * p1. Figure 4.2 illustrates
the inter-taxonomy correlation among several patterns. Patterns from different
taxonomies may have overlaps and share some elements (i.e., terms), but not all
of them have such a phenomenon. For example, p1 and p2 in Figure 4.2 have
an overlap and share two common terms, whereas p1 and p4 are independent
and there is no intersection between them. On the other hand, a pattern could
own inter-taxonomy relationship with more than one pattern. For instance, p1
shares one term with p3 and overlaps another two with p2 in the above-mentioned
figure. In summary, a term which appears in many overlaps of patterns implies
the high occurrence in pattern taxonomies it has and the potential significance and
usefulness can be offered by the term. Therefore, by appropriately evaluating the
inter-taxonomy correlation between involved patterns, the capability of describing
context of documents for a knowledge discovery system can be improved.
In order to estimate the usefulness of significant short patterns and consider
the ease of pattern application for document evaluation, deploying patterns into a
feature space is an effective and efficient methodology to tackle the challenging
issues of dealing with discovered patterns. In terms of effectiveness, by deploying
patterns into a feature space, significant terms with high appearance in the overlap
area can be punctuated and emphasised through the accumulation of significant
terms’ occurrence during the pattern evaluation. Details of this process will
be presented in Section 4.1.1 and Section 4.1.2. With regard to efficiency, the
components in the feature space are replaced by short individual terms instead of
long sequential or non-sequential patterns. As a result, there is no need to find such
long patterns in the phase of document evaluation, which require the effort from
70 Pattern Deploying Methods
Figure 4.3: Flowchart of pattern deploying methods in Pattern Taxonomy Model.
the computationally expensive mining algorithms. Hence, such replacement leads
to the saving of long run time and obtains a great improvement on the efficiency
of the system.
The process of pattern deploying is depicted in Figure 4.3, which shows the
flowchart of PTM model featured with pattern deploying methods (i.e., PDM
and PDS). Started from the documents (in the case of textual data), several
pattern taxonomies can be built by finding informative patterns using data mining
methods. On the other hand, a feature space which consists of a set of individual
terms is generated by the use of traditional document indexing techniques. At
the next step, the created pattern taxonomies and feature space then can be
used to represent the concept of documents by applying a data mining-based
method (e.g., SPM) and the traditional Vector Space Method (VSM) respectively.
However, both approaches have inevitable limits and drawbacks, which have been
mentioned and discussed earlier in this section. In general, SPM brings both
the effectiveness and efficiency problem caused by the use of time-consuming
4.1 Pattern Deploying 71
Figure 4.4: The process of merging pattern taxonomies into the feature space.
heuristics in pattern discovery. VSM usually bothers with the issue of attempts for
improvement on effectiveness. To overcome these problems, a novel methodology
is proposed in this thesis. By deploying patterns into the feature space, PDM
and PDS not only benefit by the use of sequential patterns to keep the useful
semantic information but also greatly improve the system efficiency by preventing
the time-consuming pattern discovery approaches from being used in the phase of
document evaluation.
4.1.1 Pattern Deploying Method (PDM)
The PDM is proposed with the attempt to address the problem caused by the
inappropriate evaluation of patterns, discovered using data mining methods. Data
mining methods, such as SPM and NSPM, utilise discovered patterns directly
without any modification and thus encounter the problem of lacking frequency
on specific patterns. Instead of using patterns individually, mapping patterns to
a common hypothesis space is considered in order to re-evaluate and emphasise
the specific patterns. The concept of mapping is illustrated in Figure 4.4, which
merges patterns under all mined taxonomies into a feature space. This approach
is proposed to tackle the aforementioned issues by the following strategies:
- Simplifying the feature space to reduce the computational complexity in the
72 Pattern Deploying Methods
phase of document evaluation.
- Reducing the size of the feature space to improve the efficiency.
- Deploying specific patterns to emphasise their levels of significance and
avoid the low-frequency problem.
- Emphasising specific patterns to reduce the interference from the general
patterns.
- Taking into account correlation of pattern taxonomies to evaluate the
significant short patterns.
- Accumulating the weight of terms in the overlap area to estimate their levels
of significance.
Upon implementing the above strategies, the goal of improving effectiveness
and efficiency for a pattern-based knowledge discovery system can be achieved.
Details with regard to the definitions and implementation of the PDM are
presented as follows.
Firstly, the common hypothesis space used in this chapter can be defined as T ,
a set of terms. For any set of term X , its covering set is
coverset(X) = p|p ∈ SP,X ⊆ p (4.1)
where p denotes a sequential pattern, SP is a set of discovered sequential pattern
using the proposed SPMining algorithm in the previous chapter. This definition is
similar to the definition of coverset used in Li and Zhong [86]. However, in this
study coverset refers to a set of patterns rather than a set of transactions. Given a
set of documents D, it consists of positive and negative document sets which can
4.1 Pattern Deploying 73
be denoted as D+ and D− respectively. A set of positive documents is then given
as an example in the following table.
Doc. Pattern taxonomies Sequential patterns
d1 PT(1,1) 〈carbon〉4 , 〈carbon, emiss〉3PT(1,2) 〈air, pollut〉2
d2 PT(2,1) 〈greenhous, global〉3PT(2,2) 〈emiss, global〉2
d3 PT(3,1) 〈greenhous〉2PT(3,2) 〈global, emiss〉2
d4 PT(4,1) 〈carbon〉3PT(4,2) 〈air〉3, 〈air, antarct〉2
d5 PT(5,1) 〈emiss, global, pollut〉2
Table 4.1: Example of a set of positive documents consisting of patterntaxonomies. The number beside each sequential pattern indicates the absolutesupport of pattern.
For each positive document d ∈ D+, a set of patterns is discovered in order to
be merged into the dedicated vector:
~dk =< (tk1 , nk1), (tk2 , nk2), . . . , (tkm , nkm) > (4.2)
where tkiin pair (tki
, nki) denotes an individual term and nki
= |coverset(tki)|
is the total supports obtained from all patterns in ~dk. For example, documents in
74 Pattern Deploying Methods
Table 4.1 can be represented by the following vectors:
~d1 = < (carbon, 2), (emiss, 1), (air, 1), (pollut, 1) >
~d2 = < (greenhous, 1), (global, 2), (emiss, 1) >
~d3 = < (greenhous, 1), (global, 1), (emiss, 1) >
~d4 = < (carbon, 1), (air, 2), (antarct, 1) >
~d5 = < (emiss, 1), (global, 1), (pollut, 1) >
Then the specificity of pattern p in a document ~dk can be evaluated by the
following specificity definition:
specificity(p, ~dk) =∑
t∈p,(t,n)∈~dk
n
For the example documents in Table 4.1, the specificity of pattern 〈carbon,
emiss〉 in ~d1 is derived as specificity(〈carbon, emiss〉, ~d1) = 2 + 1 = 3. The
higher the value is, the more specific the pattern is. As two patterns are in the
same pattern taxonomy, the longer pattern obtains more specificity than the one
for the shorter one. According to this definition, we can then easily prove the
following theorem.
Theorem 4.1. Let p1 and p2 be patterns found in document ~dk. We have
specificity(p1) ≤ specificity(p2) if p1 ⊆ p2.
In order to develop an efficient algorithm to evaluate such a kind of
representation for each positive document, the composition operation defined
in [86] is adopted for merging any two patterns. For this purpose, we firstly
expand a pattern as a set of term integer pairs. For example, 〈greenhous,
global〉 and 〈emiss, global〉 are two frequent sequential patterns derived from the
4.1 Pattern Deploying 75
sample database in Table 4.1. Their expanded forms can be denoted as pa =
〈(greenhous, 1), (global, 1)〉 and pb = 〈(emiss, 1), (global, 1)〉, respectively.
Moreover, we need a function to extract terms from a pattern. Given a pattern
p in its expanded form, we can use “termset” function to obtain the term list in p,
which satisfies
termset(p) = t|(t, f) ∈ p.
Using the above-mentioned patterns as an example, the termset of pa
and pb will be termset(pa) = greenhous, global and termset(pb) =
emiss, global. Note that the patterns themselves and their expanded forms are
not particularly distinguished unless it is necessary to do so.
Patterns can be merged using the following composition operation. The
composition of two patterns p1 and p2 can be processed using the following
equation.
p1 ⊕ p2 = (t, f1 + f2)|(t, f1) ∈ p1, (t, f2) ∈ p2 ∪
(t, f)|t ∈ (termset(p1) ∪ termset(p2))−
(termset(p1) ∩ termset(p2)), (t, f) ∈ p1 ∪ p2 (4.3)
For example, the composition of aforementioned patterns pa ⊕ pb can be
denoted as p′ where
p′ = pa ⊕ pb = (greenhous, 1), (emiss, 1), (global, 2).
The detailed process of pattern deploying is presented in Algorithm 4.1. Note
that the SPMining (Algorithm 3.1) is used in line 4 for generating frequent
sequential patterns. The main process of pattern deploying occurs between line 6
and line 8 inclusively. The output of this algorithm is a set of vectors.
76 Pattern Deploying Methods
Algorithm 4.1. PDM(D+, min sup)
Input: a list of positive documents, D+; minimum support, min sup.
Output: a set of vectors, ∆.
Method:
1: ∆← ∅
2: foreach document d in D+ do begin
3: extract 1Terms frequent patterns PL from d
4: SP = SPMining(PL, min sup) // Call Algorithm 3.1
5: ~d← ∅
6: foreach pattern p in SP do begin
7: ~d← ~d⊕ p′ // p′ is the expanded form of p
8: end for
9: ∆← ∆ ∪ ~d
10: end for
The inputs of the algorithm PDM are a set of positive documents and a
pre-specified minimum support. In line 4 of this algorithm, a set of sequential
patterns is discovered by calling the algorithm SPMining (in Section 3.1) for
each document. So far, only positive documents are considered and used in
this approach. The use of information from negative documents is another issue
with reference to pattern evolution which will be investigated and discussed in
Chapter 5.
At the next step in line 6 to 8, each pattern is firstly transferred into
an expanded form and then merged into a temporary storage using pattern
4.1 Pattern Deploying 77
composition operator (Equation 4.3). As a result, the deployed pattern (i.e., the set
of term weight pairs) for each document is obtained. For example, the deployed
patterns of five sample documents in Table 4.1 can be expressed as ∆:
~d1 = (carbon, 2), (emiss, 1), (air, 1), (pollut, 1)
~d2 = (greenhous, 1), (global, 2), (emiss, 1)
~d3 = (greenhous, 1), (global, 1), (emiss, 1)
~d4 = (carbon, 1), (air, 2), (antarct, 1)
~d5 = (emiss, 1), (global, 1), (pollut, 1)
We keep ∆ as the training result for the further processing in the pattern
evolution stage. To deploy patterns, each vector of document in ∆ is normalised
first and the feature space is updated by summing up the weight value for each
corresponding term using the aforementioned pattern composition until all vectors
in ∆ are processed. For instance, the gradual updating of feature space for the
above-mentioned example is illustrated as follows.
~d1 = (carbon, 2/5), (emiss, 1/5), (air, 1/5), (pollut, 1/5)
~d1 ⊕ ~d2 = (carbon, 2/5), (emiss, 9/20), (air, 1/5), (pollut, 1/5),
(greenhous, 1/4), (global, 1/2)...
d∧ = (carbon, 13/20), (emiss, 67/60), (air, 7/10), (pollut, 8/15),
(greenhous,7/12), (global, 7/6), (antarct, 1/4)
As can be seen in the above example, terms emiss and global are more likely
to gain higher scores than the others. This is due to their high appearance among
78 Pattern Deploying Methods
sequential patterns. By applying pattern deploying, the significance these terms
possess therefore can be expressed. As a result, the significant long sequential
pattern can be effectively exploited and becomes useful through the emphasis of
its high-frequency components. In contrast, these high-frequency terms cannot be
fully exploited in SPM or SCPM methods since they are likely to be trapped in
low-frequency patterns. Furthermore, the major difference between PDM and a
keyword-based method (e.g., TFIDF) is that the former utilises the information of
pattern correlation in taxonomies, but the latter evaluates terms using simplistic
statistics only. In other words, the deployed terms in PDM carry informative
properties adhered to patterns which contain them, rather than the independent
terms without any relation to other terms or patterns as used in the keyword-based
methods.
The output of the algorithm PDM is a set of term weight pairs, which can be
viewed as feature space used to represent the concept of specified documents in a
knowledge discovery system. The weighting scheme for a given term ti in feature
space is denoted as the following function.
weight(ti) =∑
~dk∈∆,(ti,nki)∈~dk
(nki∑
(t,w)∈~dkw
). (4.4)
The time complexity of the composition operation can be O(m) if the pairs
in patterns are sorted, where m is the average length of patterns, and the basic
operation is the comparison between terms. It is obvious that the complexity of
pattern compositions is O(nN) during the process of pattern deploying, where n
is the number of positive documents and N is the average number of discovered
patterns in every positive document. Therefore, the overall time complexity of the
main process of pattern deploying is O(nNm) if the basic operation is still the
4.1 Pattern Deploying 79
comparison between terms.
4.1.2 Pattern Deploying based on Supports (PDS)
PDM adopts the methodology that maps discovered patterns into a hypothesis
space with an attempt to overcome the low-frequency problem pertaining to the
specific long patterns. By simply deploying patterns through the use of a pattern
composition operator, the goal of reserving the significant information embedded
in specific patterns can be achieved. The significant short patterns, the terms
appearing in the overlaps of patterns, can be emphasised as well. However, the
pattern’s support, a useful and essential property of a pattern, is not taken into
account in the PDM method. For instance, the discovered pattern 〈carbon〉 in
Table 4.1 acquires an absolute support of 4 in document d1 and 3 in document
d4, but the evaluated score for this term is as low as 13/20 in the feature space,
compared to 67/60 for another term “emiss”, which appears only two more times
in supports. Hence, it is doubtful that the term “emiss” is estimated to be twice as
significant as the term “carbon” in this case. Such a phenomenon is caused by the
disregard of taking a pattern’s support into account during the pattern evaluation
process. With reference to the algorithm PDM, the discovered patterns are fairly
treated and given an equivalent weight. Therefore, the support of a pattern is
required to be considered while the feature’s significance is reviewed.
In this section, a novel pattern deploying method with utilisation of more
properties in a pattern is proposed. Different from the PDM discussed in
Section 4.1.1, the pattern’s support obtained in the phase of patterns discovery
is taken into account when we deploy patterns into a common hypothesis space.
A probability function is also introduced to estimate the feature’s significance.
80 Pattern Deploying Methods
By using SPMining (Algorithm 3.1), we can acquire a set of frequent
sequential patterns SP for all documents d ∈ D+, such that SP =
p1, p2, . . . , pn. The absolute support suppa(pi) for all pi ∈ SP is obtained as
well. We firstly normalise the absolute support of each discovered pattern based
on the following equation:
support :: SP → [0, 1]
such that
support(pi) =suppa(pi)∑
pj∈SP suppa(pj)(4.5)
For example, after we apply the above function to the sample database in
Table 4.1, the new support of each pattern can be calculated and the result is
listed in Table 4.2.
Doc. Sequential patterns Support
d1 〈carbon〉 4/9〈carbon, emiss〉 1/3〈air, pollut〉 2/9
d2 〈greenhous, global〉 3/5〈emiss, global〉 2/5
d3 〈greenhous〉, 〈global, emiss〉 1/2
d4 〈carbon〉, 〈air〉 3/8〈air, antarct〉 1/4
d5 〈emiss, global, pollut〉 1
Table 4.2: Patterns with their support from the sample database.
Based on the above normalisation, the expanded form of pattern pi can be
represented using the following format:
4.1 Pattern Deploying 81
pi = 〈(ti,1, fi,1), (ti,2, fi,2), . . . , (ti,m, fi,m)〉
where
fi,j =support(pi)
m
It is obvious that the composition operation stated in Section 4.1.1 is still
available for the expanded forms of patterns in such a kind of format. Details
of the deploying process are presented in Algorithm 4.2, and its result is a vector
~d, which consists of term weight pairs. Note that the input is a set of discovered
sequential patterns SP , not a set of documents required in PDM.
Algorithm 4.2. PDS(SP)
Input: a set of frequent sequential patterns, SP.
Output: a vector of features in expanded form, ~d.
Method:
1: sum supp = 0, ~d← ∅
2: foreach pattern p in SP do begin
3: sum supp += suppa(p)
4: end for
5: foreach pattern p in SP do begin
6: f = suppa(p)/(sum supp× len(p))
7: p′ ← ∅
8: foreach term t in p do begin
9: p′ ← p′ ∪ (t, f)
10: end for
82 Pattern Deploying Methods
11: ~d← ~d⊕ p′
12: end for
The first step in the algorithm PDS is to initialize parameters in line 1. Then
in line 2 to 4 the absolute support of each pattern in SP is summed up and stored
for further reference. The value of f in the expanded form of each pattern p is
estimated and assigned to all terms in p. The above operation is completed in line
6 and then, in line 8 to 10, each term weight pair in pattern p is transferred into a
temporary space p′. Lastly, the final step of the algorithm PDS is to merge p′ into
vector ~d, which will be returned as an output if all patterns in SP are processed.
Using the data in Table 4.2 as an example, after processing all of the documents,
the result of the algorithm PDS for each of them will be
~d1 = (carbon, 4/9+1/6), (emiss, 1/6), (air, 1/9), (pollut, 1/9)
~d2 = (greenhous, 3/10), (global, 3/10+1/5), (emiss, 1/5)
~d3 = (greenhous, 1/2), (global, 1/4), (emiss, 1/4)
~d4 = (carbon, 3/8), (air, 3/8+1/8), (antarct, 1/8)
~d5 = (emiss, 1/3), (global, 1/3), (pollut, 1/3).
The value of f in the expanded form (t, f) implies the relative degree of
significance the term t is. As mentioned before, the value of f for term “carbon”
is not likely to be appropriately evaluated in PDM since it is given a much lower
value than the one for term “emiss” which approximately doubles the score of
f . However, in PDS, the values of f for both of them are estimated to be nearly
the same, comparing 71/72 for term “carbon” with 19/20 for term “emiss”. The
4.1 Pattern Deploying 83
difference between these two terms’ estimated values of significance is reduced
in PDS since the support of patterns is considered and used to re-evaluate the
patterns in PDS . Moreover, all documents processed in PDS are treated equally
in importance, meaning that the sum of term values in the expanded form for each
document is assumed to be a constant.
Theorem 4.2. Let ~d be the vector returned by the algorithm PDS. We have
∑(fst,snd)∈~d
snd = 1.
Proof. According to line 5 to 10 in the algorithm PDS, we have
∑(fst,snd)∈~d
snd =∑p∈SP
∑(t,f)∈p
suppa(p)
sum supp× len(p)
=1
sum supp
∑p∈SP
∑(t,f)∈p
suppa(p)
len(p)
=1
sum supp
∑p∈SP
suppa(p)
= 1.
Despite the algorithm PDS allowing one document to be processed at a time,
a set of vectors ∆ can be obtained by calling the algorithm PDS one by one as
all specified documents are processed. Formally, the relation between the vectors
and the common hypothesis space can be explicitly described as follows:
β :: ∆→ 2T×[0,1] − ∅
such that
β(d) = (t1, f1), (t2, f2), . . . , (tn, fn) ⊆ T × [0, 1] (4.6)
84 Pattern Deploying Methods
Generally speaking, the concept of relevance is subjective. A common
description can be used in several scales for representation issues. For example,
we may use a scale in “0” to denote non-relevant, and “1” for marginal relevance,
“2” for fair relevance and “3” for high relevance. The simplest case is using“0” to
represent non-relevant and “1” for relevant. Therefore, a relevance function can
be used to describe the extent of relevance for all positive documents. We also can
normalise the relevance function which satisfies:
∑d∈D+
relevance(d) = 1.
Based on the above assumptions, a probability function can be derived to
substitute the weighting scheme (Equation 4.4) for all term ti ∈ T , which satisfies:
prβ(ti) =∑
~d∈∆,(ti,f)∈~d
relevance(d)× f (4.7)
Theorem 4.3. Let prβ(ti) =∑
~d∈∆,(ti,f)∈~d relevance(d) × f , then prβ is a
probability function on T .
Proof. From the above definitions and Theorem 4.2, we have:
∑ti∈T
prβ(ti) =∑ti∈T
∑~d∈∆,(ti,f)∈~d
relevance(d)× f
=∑~d∈∆
∑(fst,snd)∈~d
relevance(d)× snd
=∑~d∈∆
relevance(d)∑
(fst,snd)∈~d
snd
=∑~d∈∆
relevance(d)
= 1.
Then, the specificity for all patterns p can be defined as follows.
4.2 Related Work 85
specificity(p) =∑t∈T
prβ(t)τ(t, p)
where
τ(t, p) =
1 if t ∈ p0 otherwise. (4.8)
It is obvious that the specificity function defined in this sub-section also
satisfies Theorem 4.1. As a result, after all documents in Table 4.2 are processed
by PDS the feature weight pairs in the hypothesis space can be presented as
(carbon, 71/72), (emiss, 19/20), (air, 11/18), (pollut, 4/9),
(greenhous, 4/5), (global, 13/12), (antarct, 1/8)
4.2 Related Work
This chapter presents a novel concept for effectively dealing with the discovered
patterns. Two approaches are introduced to implement the proposed methodology
by deploying discovered patterns into a specified hypothesis space with an
attempt to overcome the underlying problem within the data mining-based
methods. In [159], with regard to the pattern property, the pattern’s confidence
was estimated and exploited in the phase of using discovered patterns. The
result indicated that such an application of a pattern’s confidence is feasible
and outperforms TFIDF and other traditional probabilistic methods. However,
some problems with the use of confidence for document evaluation still remained
unsolved, such as the overlap among discovered patterns and low-frequency
problems in specific patterns [157]. In terms of interpretation of patterns, Li [81]
86 Pattern Deploying Methods
introduced a novel approach for interpreting discovered patterns by using the
random set concept. Li and Zhong [84] presented an in-depth discussion on the
interpretation of association rules. Furthermore, an extended random set-based
method was proposed by Li et al. [83] for deploying mined association rules into
a hypothesis space. In our approach, we deploy features which are on the pattern
level rather than the terms on the document level used by the other approaches.
Moreover, the interesting rule used in our method for pattern discovering differs
from the others.
4.3 Chapter Summary
In this chapter, we propose two novel approaches for deploying discovered
patterns in order to address the fundamental problem caused by the inadequate
use of these patterns. In the phase of using discovered patterns, patterns can be
treated as components in the feature space and evaluated using the same way
as a keyword-based method does. Nevertheless, such an approach leads to the
insufficient capability of reasoning patterns due to the usage of the weak property
of pattern. The confidence of pattern adopted in the data mining-based method
is a weak property since it induces the low-frequency problem resulting in the
ineffective performance for a knowledge-based system. The concept of deploying
patterns proposed in this thesis is a novel solution for such a problem.
Chapter 5
Evolution of Discovered Patterns
In Chapter 4, pattern deploying methods are proposed for the use of discovered
knowledge. However, not all discovered patterns are suitable for describing
interesting topics since some noise patterns are extracted from the training
dataset [85]. In this chapter, two methods employing the pattern evolution are
proposed and developed, and they are Deployed Pattern Evolution (DPE) and
Individual Pattern Evolution (IPE). The basic definition and their algorithms are
also presented.
5.1 Deployed Pattern Evolution
In the previous chapter, the PTM model has been significantly improved after the
adoption of pattern deploying method PDS, which uses the strategy of mapping
discovered patterns into a hypothesis space in order to solve the low-frequency
problem pertaining to the specific long patterns. However, information from the
negative examples has not been exploited during the concept learning. There is
no doubt that negative documents contain much useful information to identify
ambiguous patterns in the concept. For example, a pattern may be a good indicator
87
88 Evolution of Discovered Patterns
Document Sequential pattern set
d1 〈carbon〉, 〈carbon, emiss〉, 〈air, pollut〉d2 〈greenhous, global〉, 〈emiss, global〉d3 〈greenhous〉, 〈global, emiss〉d4 〈carbon〉, 〈air〉, 〈air, antarct〉d5 〈emiss, global, pollut〉
Table 5.1: Examples of positive documents which are represented by a set ofsequential patterns mined using PTM.
to identify relevant documents if this particular pattern always appears in the
positive examples. But it would not be if this pattern also appears in the negative
examples at certain times. Therefore, it is necessary for a system to exploit these
ambiguous patterns from the negative examples in order to reduce their influences.
The concept of pattern evolution is introduced by Li and Zhong [86]. We adopt
the concept and propose the DPE approach for a PTM-based system, which deals
with the deployed patterns rather than the terms.
5.1.1 Basic Definition of DPE
Given a set of documents D = d1, d2, . . . , d|D|, where ~dk =<
(tk1 , nk1), (tk2 , nk2), . . . , (tkm , nkm) > with the same definition as in section 4.1.1.
The threshold of these documents can be estimated by using the following
equation:
Threshold(D) = arg min~di∈D
∑(tj ,nk)∈~di
nk (5.1)
where the weight of term t is determined using the PDM term weighting function
as in Equation 4.4.
Table 5.1 presents the examples of positive documents which are represented
5.1 Deployed Pattern Evolution 89
Name Support Deployed pattern (vector)
dp1 1 (carbon,2), (emiss,1), (air,1), (pollut,1)dp2 1 (greenhous,1), (emiss,1), (global,2)dp3 1 (greenhous,1), (emiss,1), (global,1)dp4 1 (carbon,1), (air,2), (antarct,1)dp5 1 (emiss,1), (global,1), (pollut,1)
Table 5.2: Deployed patterns from the document examples.
Name Support Normalised deployed pattern
dp1 1/5 (carbon,2/5), (emiss,1/5), (air,1/5), (pollut,1/5)dp4 1/5 (carbon,1/4), (air,1/2), (antarct,1/4)dp5 1/5 (emiss,1/3), (global,1/3), (pollut,1/3)dp6 2/5 (greenhous,7/12), (emiss,7/12), (global,5/6)
Table 5.3: dp2 and dp3 are replaced by dp6 and deployed patterns are normalised.
by a set of sequential patterns mined using PTM whereas Table 5.2 shows the
deployed patterns from these document examples. For instance, in Table 5.1,
although documents d2 and d3 do not have the same set of sequential patterns, we
still can find that they share the same termset since termset(d2) = termset(d3) =
greenhous, emiss, global as shown in Table 5.2. Therefore, we compose vectors
with the same itemset into one.
dp6 = dp2 ⊕ dp3
= (greenhous, 1/4+1/3), (emiss, 1/4+1/3), (global, 1/2+1/3)
Let Ω be a set of deployed patterns. For each document in Table 5.1, the
representation of one particular document can be transformed from a set of
discovered sequential patterns to a set of terms using the pattern deployed method
PDM. Therefore the set of terms is denoted as “deployed pattern” in this approach.
90 Evolution of Discovered Patterns
Figure 5.1: A negative document nd and its offending deployed patterns.
A negative document nd is a document that the system falsely identified as
a positive. The offender of nd is a deployed pattern which obtains at least one
component that appears in nd. The set of offenders of nd is defined by:
∆p = dp ∈ Ω|termset(dp) ∩ nd 6= ∅ (5.2)
Figure 5.1 illustrates the relationship between a negative document nd and its
offenders. Given a set of terms T , For each term t ∈ T , t can be classified into
four categories:
• “X” type: t ∈ T |t ∈ termset(dpk), termset(dpk) ⊆ nd.
• “Y” type: t ∈ T |t ∈ termset(dpk) ∩ nd, termset(dpk) * nd.
• “Z” type: t ∈ T |t ∈ termset(dpk)− nd.
• “∗” type: others.
where k = i or j.
There are two types of offenders: (1) a complete conflict offender which
contains “X” type terms only. (2) a partial conflict offender which contains both
“Y” type and “Z” type terms. For instance, the deployed pattern dpi in Figure 5.1
5.1 Deployed Pattern Evolution 91
is a complete conflict offender of negative document nd and deployed pattern
dpj is a partial conflict offender of negative document nd. Another example in a
given negative document nd = 〈emiss〉, 〈global〉, 〈pollut〉, 〈car〉, the deployed
patterns dp1, dp2 and dp3 in Table 5.2 are all partial conflict offenders of nd since
termset(dp1) ∩ nd 6= ∅, termset(dp2) ∩ nd 6= ∅, and termset(dp3) ∩ nd 6= ∅,
but they are not subsets of nd, whereas dp5 in the same table is a complete conflict
offender of nd because of termset(dp5) ⊆ nd.
5.1.2 The Algorithm of DPE
Algorithm 5.1. DPEvolving(Ω, D+, D−)
Input: a list of deployed patterns Ω; a list of positive and negative documents,
D+ and D−.
Output: a set of term weight pairs ~d.
Method:
1: ~d← ∅
// estimate minimum threshold
2: τ = Threshold(D+) // Equation 5.1
3: foreach negative document nd in D− do begin
4: if Threshold(nd) > τ then
5: ∆p = dp ∈ Ω|termset(dp) ∩ nd 6= ∅
6: Shuffling(nd, ∆p) //Algorithm 5.2
7 : end if
8 : foreach deployed pattern dp in Ω do begin
9 : ~d← ~d⊕ dp
92 Evolution of Discovered Patterns
10: end for
11: end for
The evolution of deployed patterns is implemented by the algorithm
DPEvolving (see Algorithm 5.1). The inputs of this algorithm are a list of
deployed patterns Ω, a list of positive and negative documents, D+ and D−. The
output is a set of term weight pairs which can be used directly in the testing phase.
Line 2 in DPEvolving is used to estimate the threshold for finding the interesting
negative documents. Line 3 to 5 is the process of discovering the offenders of
negative documents. Therefore, a set of deployed patterns that share the same
patterns with negative documents is collected for further processing. Once all the
offenders are found, another algorithm Shuffling 5.2 is then called to perform the
main task for this algorithm.
Algorithm 5.2. Shuffling(nd, ∆p)
Input: a negative document nd and a list of deployed patterns ∆p.
Output: updated deployed patterns.
Method:
1: foreach deployed pattern dp in ∆p do begin
2: if termset(dp) ⊆ nd then // complete conflict offender
3: Ω = Ω− dp
4: else // partial conflict offender
5: offering’ = (1− 1µ)×
∑t∈termset(dp)
t.weight|t ∈ nd
6: base =∑
t∈termset(dp)
t.weight|t /∈ nd
5.1 Deployed Pattern Evolution 93
7: foreach term t in termset(dp) do begin
8: if t ∈ nd then // shrink offender weight
9: t.weight = ( 1µ)× t.weight
10: else // shuffle weights
11: t.weight = t.weight× (1 + offering’÷ base)
12: end if
13: end for
14: end if
15: end for
The task of algorithm Shuffling is to tune the weight distribution of terms
within a deployed pattern. A different strategy is dedicated in this algorithm for
each type of offender. As stated in line 3 in the algorithm Shuffling, for the type of
complete conflict offenders, the deployed patterns are removed from the deployed
pattern set Ω since all elements within the deployed patterns are held by the
negative documents indicating they can be discarded for preventing interference
from these possible “noises”.
The parameter offering’ is used in line 5 for the purpose of temporarily storing
the reduced weight of “Y” type terms in a partial conflict offender. The offering’
is part of offering. The offering is the sum of weight of terms in a deployed pattern
where these terms also appear in a negative document. Given a deployed pattern
dp and a negative document nd, the value of offering can be estimated by the
following equation:
offering(dp) =∑
(t,t.weight)∈β(dp),t∈nd
t.weight (5.3)
94 Evolution of Discovered Patterns
“Y” type term of dp1 “Z” type term of dp1
original (air, 1/5), (pollut, 1/5) (carbon, 2/5), (emiss, 1/5)shuffled (air, 1/10), (pollut, 1/10) (carbon, 8/15), (emiss, 4/15)
Table 5.4: The change of term weights in offender dp1 before and after shufflingwhen µ = 1/2.
where the β is a mapping function which describes the relationship between
deployed patterns and the hypothesis space:
β : Ω→ 2T×[0,1] − ∅
β(dp) = (t1, w1), (t2, w2), . . . , (tn, wn) ⊆ T × [0, 1] (5.4)
For the partial conflict offender of negative documents, since it contains two
types of terms in it, different processes are used as stated in lines 8 to 12 in the
algorithm Shuffling. For “Y” type terms, the weights are shrunk by being divided
an experimental coefficient µ (µ > 1). An example is given in Table 5.4 showing
that the weights of terms “air” and “pollut” are reduced when dp1 is a partial
conflict type of nd, where nd = 〈air〉, 〈pollut〉, 〈health〉. On the other hand,
the “Z” type terms are given the reduced weights from the “Y” type terms based
on their weight distribution. As can be seen in Table 5.4, the weights of terms
“carbon” and “emiss” are increased as dp1 is conflicting with nd.
When all of the deployed patterns in ∆p have been visited and processed, the
algorithm iterates the next document in D− until all of the negative documents
are visited. At the end of algorithm DPEvolving, the last operation is to join all
the deployed patterns in Ω using pattern deposition. As a result, the output of
algorithm DPEvolving is a set of term weight pairs which can be used for system
evaluation which will be presented in Chapter 6.
5.2 Individual Pattern Evolution 95
Figure 5.2: Different levels involved by DPE and IPE in pattern evolution.
5.2 Individual Pattern Evolution
In Section 5.1, a pattern refinement strategy is proposed using the pattern evolving
approach DPE to reshuffle the weight distribution within offenders. For such a
type of approach, features which reside in the intersection of negative documents
and partial conflict offenders have to be reviewed and adjusted by shifting their
weight contribution away in order to weaken the effect on the concept. In
contrast, the rest of the features in the same deployed pattern thus receive the
reduced offering. However, it should be noted that a deployed pattern in DPE
is constructed by compounding discovered patterns from PTM into a hypothesis
space, which means this action involves all the features including some that may
come from the other patterns at the “P Level” in Figure 5.2.
Figure 5.2 demonstrates the three levels in a feature hierarchy based on the
physical structure of features. In other words, features from the lower level
(e.g., “T Level”) are encapsulated into the features in the higher level (e.g., “P
Level”). If a document contains two or more patterns, it indicates the concept
96 Evolution of Discovered Patterns
of this document is represented by more than one subtopics. For instance, two
patterns p1 = 〈air, pollut〉 and p2 = 〈antarct〉 are discovered from a document d
which describes a topic about “global warming”. Hence d = p1, p2 implies that
the joint of two subtopics “air pollution” and “Antarctic” describes the concept of
“global warming” in d. If there exists a negative document nd = antarct, explor
with the use of DPE approach for pattern evolution, the weight contribution of
p2 in d should be shifted to p1 according to the algorithm Shuffling in DPE.
Essentially, it is reasonable that the pattern evolution is applied to a pattern
which appears both in the offender and the negative document for the purpose
of removing the suspicious source of “noise”. However, the adjustment of the
other patterns in the offender (such as p1 in d) is still arguable. For the above
example, the significance of the pattern 〈antarct〉 in document d needs to be
reduced since its occurrence in the negative document leads to the ambiguity
problem as mentioned before. Nevertheless, this does not mean the significance of
pattern 〈air, pollut〉 has to be increased. Since the deployed pattern is the lower-
level pattern which has been mixed up with multiple subtopics (a pattern in “P
Level” represents a subtopic), we have to process each subtopic individually.
Accordingly, an alternative way to conduct the evolution of patterns is to alter
these patterns at the upper level (“P Level” in Figure 5.2) before they are deployed
as the lower-level features. Therefore, an evolving approach called Individual
Pattern Evolution (IPE) is proposed in this section. IPE deals with patterns at the
early state of individual form, instead of manipulating patterns in deployed form
at the late state.
Figure 5.3 illustrates the different states in which the evolution of patterns
takes place using DPE and IPE. When a negative document is detected, DPE
5.2 Individual Pattern Evolution 97
Figure 5.3: The flowchart of two pattern evolving approaches.
starts to find offenders and implements pattern evolving at “Hypothesis Space”
state. In contrast, IPE executes the same action at “Pattern” state. In addition,
the structures of “Hypothesis Space” and “Pattern” are different, and thus an
alternative definition and algorithm for IPE are needed. Note that the physical
structure of components in a hypothesis space is a set of term weight pairs derived
by deploying all the discovered patterns in the previous stage and the basic
component in the “Pattern” is a set of sequential pattern weight pairs obtained
from the output of PTM.
5.2.1 Basic Definition of IPE
Let T = t1, t2, t3, ..., tn be a set of terms, which can be viewed as words or
keywords in text documents. D is a set of documents consisting of a set of positive
98 Evolution of Discovered Patterns
documents D+ and a set of negative documents D−. As mentioned earlier, a set
of terms is denoted as a termset. A set of pattern weight pairs can be named
patternset, which is defined as:
Pseti = (pi,1, wi,1), (pi,2, wi,2), . . . , (pi,n, wi,n) (5.5)
where pi,n is a sequential pattern with its corresponding weight wi,n. A patternset
can be used to represent a set of discovered patterns from a document d using
PTM. In this section, the result of PTM mining from D is therefore represented
by a set of patternsets:
SD = Pset1, Pset2, . . . , Psetk (5.6)
where Psetk denotes a discovered pattern set for a document dk and dk ∈ D.
Let Φ = t1, t2, t3, . . . , tm be a set of terms and Φ ⊆ T indicating a
hypothesis space of D. For the document examples listed in Table 5.5, Φ can
be derived as:
Φ = carbon, emiss, air, pollut, greenhous, global, air, antarct
The relations between termset Φ and the patternset Pseti for the topic “Effects of
global warming” are demonstrated in Figure 5.4. As it can be seen, each pattern
pi in Pseti consists of a set of terms in Φ.
A set of terms in a pattern p can be easily derived from termset(p) =
t|(t, f) ∈ p, which has been discussed in section 4.1.1. However, the term list
in this set is not in order. In IPE, the order of terms within a pattern is considered.
Two sequential patterns are equal if and only if they contain the same terms in the
same order. For example, given two patterns p1 = 〈t1, t2, t3〉 and p2 = 〈t1, t3, t2〉,
5.2 Individual Pattern Evolution 99
Figure 5.4: Relations between patternset and termset under the topic “Effects ofglobal warming”.
Document Patterns
d1 〈carbon〉4 , 〈carbon, emiss〉3 , 〈air, pollut〉2d2 〈greenhous, global〉3, 〈emiss, global〉2d3 〈greenhous〉2, 〈global, emiss〉2d4 〈carbon〉3, 〈air〉3, 〈air, antarct〉2d5 〈emiss, global, pollut〉2
Table 5.5: Examples of positive documents represented by a set of sequentialpatterns with frequency.
100 Evolution of Discovered Patterns
Name Patternset
Pset1 (〈carbon〉, 4/9), (〈carbon, emiss〉,1/3), (〈air, pollut〉, 2/9)Pset2 (〈greenhous, global〉, 3/5), (〈emiss, global〉, 2/5)Pset3 (〈greenhous〉, 1/2), (〈global, emiss〉, 1/2)Pset4 (〈carbon〉, 3/8), (〈air〉, 3/8), (〈air, antarct〉, 1/4)Pset5 (〈emiss, global, pollut〉, 1/1)
Table 5.6: Normalised patternsets which contain sequential patterns withcorresponding weights.
Patternset
Pset1 (〈carbon〉, 4/9), (〈carbon, emiss〉,1/3), (〈air, pollut〉, 2/9)Pset4 (〈carbon〉, 3/8), (〈air〉, 3/8), (〈air, antarct〉, 1/4)
Pset1 t Pset4(〈carbon〉, 4/9+3/8), (〈carbon, emiss〉,1/3), (〈air, pollut〉, 2/9),(〈air〉, 3/8), (〈air, antarct〉, 1/4)
Table 5.7: An example of patternset composition.
although termset(p1) = termset(p2), these two patterns are not equal since the
terms are in different order.
Given two patternsets Pseti Psetj , the join of these two patternsets can be
operated by the following patternset composition:
Pseti t Psetj = (pi,m, wi,m + wj,n)|pi,m = pj,n,
(pi,m, wi,m) ∈ Pseti, (pj,n, wj,n) ∈ Psetj ∪
(p, w)|p ∈ (Pseti ∪ Psetj)−
(Pseti ∩ Psetj), (p, w) ∈ Pseti ∪ Psetj (5.7)
An example of patternset composition is shown in Table 5.7. The weight of
pattern “carbon” is updated during the patternset composition since it appears in
both patternsets with the same term sequence. However, pattern “〈emiss, global〉”
5.2 Individual Pattern Evolution 101
in Pset2 and “〈global, emiss〉” in Pset3 cannot be joined when we combine Pset2
and Pset3 using patternset composition, even though the termsets of these two
patterns are the same, termset(〈emiss, global〉) = termset(〈 global, emiss〉).
Therefore, given two patterns p1 ∈ Pseti and p2 ∈ Psetj , they can be joined
during the operation of patternset composition (Pseti t Psetj) if and only if
p1 = p2. For instance, in Table 5.7 pattern (〈carbon〉, 4/9) in Pset1 is able
to be joined with the pattern (〈carbon〉, 3/8) in Pset4 during the operation of
Pset1 t Pset4. After the composition, this pattern is updated as (〈carbon〉,
4/9+3/8) = (〈carbon〉, 59/72).
5.2.2 The Algorithm of IPE
Algorithm 5.3. IPEvolving(D+, D−)
Input: a list of positive and negative documents, D+ and D−.
Output: a set of term weight pairs ∆.
Method:
1: ∆← ∅; ∆ps ← ∅ // ∆ps: patternset
// find a set of patternsets SD from D+ using SPMining (Algorithm 3.1)
2: SD = Psetd1 , Psetd2 , . . . , Psetdm // where m = |D+|
3: foreach Psetdiin SD do begin
// normalise each pattern in Psetdi
4: foreach pattern (pi,k, wi,k) ∈ Psetdido begin wi,k = wi,k ÷
|Psetdi|∑
j=1
wi,j
5: ∆ps = ∆ps t Psetdi// patternset composition
6: end for
// find a set of patternsets SD− from D−
102 Evolution of Discovered Patterns
7: SD− = Psetd1, Psetd2
, . . . , Psetd|D−|
8: foreach (p, w) ∈ ∆ps do begin
// accumulate the support of offending patterns
9: sum sup =|SD− |∑i=1
∑p=p−;(p−,w−)∈Psetdi
suppa(p−)
10: w = w × (suppa(p)− sum sup)/suppa(p)
11: foreach term (t, f) in p do begin f = w/len(p)
12: ∆← ∆⊕ p // pattern deploying
13: end for
The input of algorithm IPEvolving is a set of positive documents D+ and a set
of negative documents D−. The output of this algorithm is a set of term weight
pairs ∆ which represents the concept of the topic with respect to D+ and D−.
Three main phases contained in Algorithm 5.3 are briefly described as follows:
Pattern Generation: a set of sequential patterns for each document is generated
in this phase using PTM. Note that only positive documents are processed
here. At the end of this stage, a set of patternsets is discovered and prepared
for the next phase. This process is implemented in line 1 and line 2 as listed
in the algorithm.
Patternset Composition: in this phase, the discovered patterns from the previous
phase are transformed into a form of pattern weight pairs using patternset
composition. The structure of each pattern is preserved and all essential
information such as statistical data is temporarily stored as well. This
operation can be found between line 3 and line 6 in the algorithm.
5.3 Related Work 103
Individual Pattern Involving: the major task of this algorithm is performed and
completed in this phase. The involved patterns are evaluated before being
deployed into a hypothesis space. The procedure is stated from line 7 to line
13 in the algorithm.
Given document examples as shown in Table 5.5, we assume all documents
are positive and belong to D+ as D+ = d1, d2, d3, d4, d5 and each document
obtains a set of patterns listed in the same table discovered in the phase
of pattern generation. For instance, document d1 has a set of sequential
patterns 〈carbon〉4, 〈carbon, emiss〉3, 〈air, pollut〉2 where the number beside
each pattern indicates the absolute support of the pattern. The details of how
to find sequential patterns from a set of documents have been discussed in
Chapter 3 and the corresponding algorithm can be referred to in Algorithm 3.1.
At the next step, each document is represented by a patternset defined in
Equation 5.5. Therefore, document d1 can be replaced by patternset Psetd1 =
(〈carbon〉, 4), (〈carbon, emiss〉, 3), (〈air, pollut〉, 2). Note that each pattern’s
weight is still the absolute support at this stage. At the end of this phase, all
documents are then grouped into a set of patternsets and can be denoted as
SD = Psetd1 , Psetd2 , Psetd3 , Psetd4 , Psetd5.
5.3 Related Work
The pattern evolution is used for concept refinement for user profile mining.
Li and Zhong [86] proposed a novel approach for mining ontology in order to
automatically acquire user information needs. For ontology constructing in this
work, hierarchical clustering [94, 98] is adopted to determine synonymy and
104 Evolution of Discovered Patterns
hyponymy relations between keywords. A set of interesting negative documents,
labeled as relevant by the system, is then detected and exploited for pattern
evolving. Two kinds of offenders can be discovered from these interesting
negative documents: total conflict and partial conflict. By reshuffling their weight
distributions, the uncertainties contained in these offenders can be evaporated.
We adopt such a concept and apply it to our pattern-based information filtering
method DPE. Instead of using document-wise patterns for concept evolution,
DPE conducts evolution on deployed patterns which are discovered by using data
mining techniques and deployed by using our proposed PDS method. In other
words, different pattern discovery methods are used for generating representatives
in these two works.
5.4 Chapter Summary
The objective of pattern evolution is to provide an effective mechanism for
allowing the contextual concept in the knowledge base to be updated during the
learning phase for a pattern-based knowledge discovery system. The capability
of adaptiveness for a knowledge-based system is enabled by revising features in a
particular state and rebuilding the context representation as negative patterns are
detected.
There are two evolving approaches proposed in this chapter. In DPE, the
first approach, it detects the offenders from negative documents at first and then
applies a revision scheme to those features resided in the offenders. By shuffling
the weight distribution of these features in the hypothesis space, the patterns
can be properly adjusted and the goal of refinement for contextual concept in
5.4 Chapter Summary 105
the knowledge base can be achieved as well. Similarly, in the second approach,
IPE also tunes patterns with an attempt to reach the same goal but at a different
level. IPE adjusts patterns in an upper level where these patterns are still in
the form of a sequential one, other than a space where patterns are deployed in
DPE. The advantage of IPE is that not all sequential patterns are necessary to
be involved during the evolving process. Only those that are also found in the
negative documents need to be re-evaluated. As a result, the efficiency of the
system can be improved. Moreover, by modifying the involved patterns only, we
can narrow the scope of target components and concentrate on those which really
need to be altered in the whole feature space.
Chapter 6
Experiments and Results
This chapter describes the experimental evaluation of our proposed approaches
featured in the pattern taxonomy model PTM. To fulfil this chapter, three
aspects are discussed including experimental datasets, performance measures,
and evaluation procedures. The latest version of Reuters document collection is
chosen among several versions as our benchmark dataset. Most of the standard
performance measures (i.e. precision, recall, breakeven point, Fβ-measure and
11 standard points) are used for evaluating the experimental performance. The
discussion and analysis of experiments are split into three categories based on
the methods or strategies proposed in the previous chapters. The PTM model
comprises the methods including pattern discovery approaches (i.e., SPM and
SCPM), pattern deploying methods (i.e., PDM and PDS), and pattern evolution
strategies (i.e., DPE and IPE).
The process of executing PTM consists of two major phases, concept learning
and document evaluation. In the former phase, one of the proposed pattern
discovery approaches is adopted to learn the concept (i.e., user profile) of
documents in the training set, then the various combinations of pattern deploying
and evolving methods are taken in the latter phase to evaluate documents in the test
107
108 Experiments and Results
set. Text preprocessing for each document is applied before both of the learning
and evaluating phases. Term stemming and stopword removal techniques are also
used in this stage for document indexing.
To evaluate the performance of PTM, we implement PTM for the task of
information filtering (IF) in our experiments. By conducting IF tasks, we can
examine the ability of the proposed pattern discovery approaches and test the
effectiveness of refinement methods for discovered patterns. The experimental
results are compared with other well-known IF-related methods including Term
Frequency Inverse Document Frequency (TFIDF) method [129], Probabilistic
method (Prob) [50, 139] and Rocchio method [122, 124]. We also compare the
results from PTM to those from data mining-based methods, such as frequent
itemset mining, sequential pattern mining and closed pattern mining methods.
6.1 Experimental Dataset
Several standard benchmark datasets are available for experimental purposes.
They are Reuters corpora, OHSUMED [58], and 20 Newsgroups collection [72].
The most frequently used one is the Reuters dataset. During the last decade,
several versions of Reuters corpora have been released. The particular version
that we chose for our experiment is Reuters Corpus Volume 1, also known as
RCV1. The reason is that RCV1 is the latest one among those common data
collections and it also contains a reasonable number of documents with relevance
judgment both in the training and test examples. Although another version,
Reuters-21578, is currently the most widely used dataset for text categorisation
tasks, it is predicted to be superseded by RCV1 in the upcoming years [123]. The
6.1 Experimental Dataset 109
Version #docs #trainings #tests #topics Release year
Reuters-22173 22,173 14,704 6,746 135 1993Retuers-21578 21,578 9,603 3,299 90 1996RCV1 806,791 5,127 37,556 100 2000
Table 6.1: Current Reuters data collections.
summary of current Reuters data collections is stated in Table 6.1.
RCV1 includes 806,791 English language news stories which were produced
by Reuters journalists for the period between 20 August 1996 and 19 August 1997.
These documents were formatted using a structured XML scheme. TREC (Text
REtrieval Conference)1 has developed and provided 100 topics for the filtering
track aiming at building a robust filtering system [123]. The first 50 topics were
composed by human researchers and the rest were formed by intersecting two
Reuters topic categories. These topic codes are listed in Appendix B.
Each RCV1 topic was divided into two sets: training and test, and the
relevance judgments have also been given for each topic. The training set has
a total amount of 5,127 news stories with dates up to and including 30 September
1996 and the test set contains 37,556 news stories from the rest of the collection.
Stories in both sets are assigned to be either positive or negative. “Positive” means
the story is relevant to the assigned topic; otherwise “Negative” will be shown. In
our experiments we chose all 100 TREC topics (from topic 101 to topic 200).
Further details regarding the RCV1 can be found in [123].
RCV1 is distributed on two CDs and contains about 810,000 English language
stories. It requires about 3.7 GB for storage if all files are uncompressed. This
1http://trec.nist.gov/
110 Experiments and Results
corpus can also be obtained from the following Web sites: http://about.reuters.com/researchandstandards/corpus/
http://trec.nist.gov/data/reuters/reuters.html
The former Web site is owned by Reuters Ltd and the latter one is maintained
by NIST, the National Institute of Science and Technology. Another Reuters
Corpus, the Volume 2 (RCV2) is also available on requesting. This multilingual
corpus is distributed on one CD and contains over 487,000 Reuters news stories
in 13 languages including Dutch, French, German, Chinese, Japanese, Russian,
Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and
Swedish.
The documents in RCV1 are tagged using XML format for easy access and
parsing. An example of an RCV1 document is illustrated in Figure 6.1. Each
document is identified by a unique item ID and corresponded with a title in the
field marked by the tag <title>. The main content of the story is in a distinct
<text> field consisting of one or several paragraphs. Each paragraph is enclosed
by the XML tag <p>. In our experiment, both the “title” and “text” fields are
used and each paragraph (i.e., content in <p>) in the “text” field is viewed as
a transaction in a document. Moreover, we treat the content in the “title” field
in the document as an additional paragraph (i.e., transaction). The information
contained in the rest of the tags, such as <headline> and <metadata>, is ignored
and discarded. Nevertheless the “headline” and “metadata” fields may contain rich
information. The reason for ignoring them is that in an RCV1 document “title”
and “headline” are duplicated fields and the “metadata” field contains information
such as region and classification codes which are out of our research scope in this
work. In this thesis we focus on the major part of the document and pay more
6.1 Experimental Dataset 111
Figure 6.1: An XML document in RCV1 dataset.
attention to the issue of how to use these meaningful patterns discovered from it.
As mentioned above, each RCV1 document contains at least one paragraph.
Each paragraph contains at least one sentence. This makes RCV1 different to
the previous versions of Reuters datasets, which usually have only one paragraph
per document. The characteristic of multiple paragraphs in the RCV1 documents
allows the data mining algorithms to be applied for pattern discovery with ease.
The distributions of words and paragraphs in the RCV1 dataset are shown in
Figure 6.2 and Figure 6.3 respectively. The number of stories that have a particular
word or paragraph count is demonstrated on these charts. We also can find that
most stories are short, with around 6 or 7 paragraphs and 1,000 words [118].
The TREC conference is held annually and co-sponsored by the National
Institute of Standards and Technology (NIST) and the U.S. Department of
Defense. For each TREC, NIST provides a test set of documents and questions.
Participants run their own retrieval systems and return to NIST a list of the
retrieved top-ranked documents. NIST judges the retrieved documents for
112 Experiments and Results
Figure 6.2: Distribution of words in an RCV1 collection [118].
Figure 6.3: Number of paragraphs per document in an RCV1 collection [118].
6.2 Performance Measures 113
<top>
<num> Number: R101 <title> Economic espionage
<desc> Description: What is being done to counter economicespionage internationally?
<narr> Narrative: Documents which identify economicespionage cases and provide action(s) taken toreprimand offenders or terminate their behavior arerelevant. Economic espionage would encompasscommercial, technical, industrial or corporatetypes of espionage. Documents about military orpolitical espionage would be irrelevant.
</top>
Figure 6.4: An example of topic description.
correctness, and evaluates the results. Each TREC conference consists of a set
tracks, such as “Blog Track”, “Cross-Language Track”, and “Filtering Track”.
The experiments implemented in this thesis choose the same data collections
revealed in TREC Filtering Track 2002. In this track, a filtering system has
to make a binary decision as to whether a new document should be retrieved
according to a user’s information needs. Therefore, a topic in RCV1 can be viewed
as the representation of the user’s information needs. An example of a topic can
be seen in Figure 6.4. Details of building a test collection for TREC 2002 can be
found in [137].
6.2 Performance Measures
It is important that how to measure the performance for a information system. In
this section, some of the common measures that have been used in the literature
are described. To evaluate experimental results, several standard measures such as
114 Experiments and Results
human judgement
yes no
system judgementyes TP FPno FN TN
Table 6.2: Contingency table.
precision and recall are used. The precision is the fraction of retrieved documents
that are relevant to the topic, and the recall is the fraction of relevant documents
that have been retrieved. For a binary classification problem the judgement can
be defined within a contingency table as depicted in Table 6.2. According to
the definition in this table, the precision and recall are denoted by the following
formulas: precision = TPTP+FP
recall = TPTP+FN
(6.1)
where TP (True Positives) is the number of documents the system correctly
identifies as positives; FP (False Positives) is the number of documents the
system falsely identifies as positives; FN (False Negatives) is the number of
relevant documents the system fails to identify.
The precision of first K returned documents top-K is also adopted in this paper
due to the fact that most users would focus more on the first few dozen returned
documents. The precision of top-K returned documents refers to the relative value
of relevant documents in the first K returned documents. The value of K we use
in the experiments is 20.
In addition, breakeven point (b/p) is used to provide another measurement for
performance evaluation. It indicates the point where the value of precision equals
to the value of recall for a topic. The higher the figure of b/p, the more effective
6.2 Performance Measures 115
the system is. The b/p measure has been frequently used in common information
retrieval evaluations.
In order to assess the effect involving both precision and recall, another
criterion which can be used for experimental evaluation is Fβ-measure [79] which
combines precision and recall and can be defined by the following equation:
Fβ−measure =(β2 + 1) ∗ precision ∗ recall
β2 ∗ precision + recall(6.2)
where β is a parameter giving weights of precision and recall and can be viewed as
the relative degree of importance attributed to precision and recall [130]. A value
β = 1 is adopted in our experiments meaning that it attributes equal importance
to precision and recall. When β = 1, the measure is expressed as:
F1 =2 ∗ precision ∗ recall
precision + recall(6.3)
The value of Fβ=1 is equivalent to the b/p when precision equals to recall.
However, the b/p cannot be compared directly to the Fβ=1 value since the latter is
given a higher score than that of the former [162]. It has also been stated in [103]
that the Fβ=1 measure is greater or equal to the value of b/p.
Both the b/p and Fβ-measure are the single-valued measures in that they only
use a figure to reflect the performance over all the documents. However, we
need more figures to evaluate the system as a whole. Therefore, another measure,
Interpolated Average Precision (IAP) is introduced and has been adopted before
in several research works [71, 133, 162]. This measure is used to compare the
performance of different systems by averaging precisions at 11 standard recall
levels (i.e., recall=0.0, 0.1, ..., 1.0). The 11-points measure is used in our
comparison tables indicating the first value of 11 points where recall equals to
116 Experiments and Results
zero. Moreover, Mean Average Precision (MAP) is used in our evaluation which is
calculated by measuring precision at each relevance document first, and averaging
precisions over all topics.
Error rate is another performance measure that is commonly used in text
categorisation. The value of error rate ε can be calculated by the equation:
ε =FP + FN
TP + FP + FN + TN(6.4)
In order to obtain a global measurement, there are two ways to evaluate the
average performance. In the case of text categorisation, let C be a set of classes;
precision and recall can be averaged using:
- micro-averaging:
the contingency tables of all categories are merged into a single table and
then the global performance is estimated using the merged table:
precisionmicro =
∑|C|i=1 TPi∑|C|
i=1(TPi + FPi)
recallmicro =
∑|C|i=1 TPi∑|C|
i=1(TPi + FNi)
- macro-averaging:
one contingency table per category is used, measures are calculated locally
and then averaged over categories:
precisionmacro =
∑|C|i=1 precisioni
|C|
recallmacro =
∑|C|i=1 recalli|C|
6.3 Evaluation Procedures 117
Generally speaking, the micro-averaging yields better scores than macro-
averaging in the practical experiments. In particular, micro-averaging gives every
document an equal weight on performance, whereas macro-averaging gives every
class an equal weight on performance. The micro-averaging precision and recall
values are usually used in the text classification domain. Aforementioned, PTM
is evaluated on a system which performs information filtering tasks rather than
the text categorisation. Therefore, the averaged precision and recall values are
computed by summing up the corresponding values of each topic and then divided
by the number of topics.
6.3 Evaluation Procedures
In order to evaluate the proposed PTM model, we apply PTM to the practical
information filtering task. As mentioned in chapter 2, information filtering is
a task that a user with a specific information need is monitoring a stream of
documents and the system selects documents from the stream according to a
profile of the user’s interests. Filtering systems process one document at a time
and show it to the user if this document is relevant. The system then adjusts the
profile or updates the threshold based on the user’s feedback. In the case of batch
filtering, a number of relevant documents are returned, whereas a list of ranked
documents is given by a routing filtering system. In this thesis, routing filtering
is implemented and performance of the model is evaluated based on the ranked
documents. The choice of routing task can avoid the need of threshold tuning,
which is beyond our focus in this research work.
We evaluate PTM using all 100 TREC topics (r101–r200) in the experiments.
118 Experiments and Results
No. #r #d No. #r #d No. #r #d No. #r #dr101 7 23 r126 19 29 r151 6 49 r176 5 57r102 135 199 r127 5 32 r152 5 55 r177 25 45r103 14 64 r128 4 51 r153 10 18 r178 3 43r104 120 194 r129 17 72 r154 6 52 r179 5 57r105 16 37 r130 3 24 r155 11 74 r180 5 61r106 4 44 r131 4 31 r156 6 37 r181 4 64r107 3 61 r132 7 103 r157 3 42 r182 19 36r108 3 53 r133 5 47 r158 5 79 r183 25 55r109 20 40 r134 5 31 r159 21 62 r184 9 48r110 5 91 r135 14 29 r160 15 36 r185 26 52r111 3 52 r136 8 46 r161 5 52 r186 20 38r112 6 57 r137 3 50 r162 6 27 r187 7 48r113 12 68 r138 7 98 r163 4 29 r188 3 30r114 5 25 r139 3 21 r164 21 64 r189 12 56r115 3 46 r140 11 59 r165 7 53 r190 13 42r116 16 46 r141 24 56 r166 8 39 r191 5 43r117 3 13 r142 4 28 r167 5 63 r192 3 40r118 3 32 r143 4 52 r168 32 43 r193 5 64r119 4 26 r144 6 50 r169 5 35 r194 31 80r120 9 54 r145 5 95 r170 16 79 r195 8 36r121 14 81 r146 13 32 r171 7 48 r196 5 61r122 15 70 r147 6 62 r172 10 78 r197 22 34r123 3 51 r148 12 33 r173 27 35 r198 3 29r124 6 33 r149 5 26 r174 5 44 r199 21 40r125 12 36 r150 4 51 r175 37 37 r200 7 34
Table 6.3: Number of relevant documents(#r) and total number of documents(#d)by each topic in the RCV1 training dataset.
6.3 Evaluation Procedures 119
No. #r #d No. #r #d No. #r #d No. #r #dr101 307 577 r126 172 270 r151 22 437 r176 37 411r102 159 308 r127 42 238 r152 41 402 r177 61 250r103 61 528 r128 33 276 r153 37 118 r178 47 271r104 94 279 r129 57 507 r154 39 469 r179 32 510r105 50 258 r130 16 307 r155 63 489 r180 72 426r106 31 321 r131 74 252 r156 72 354 r181 25 574r107 37 571 r132 22 446 r157 37 300 r182 32 157r108 15 386 r133 28 380 r158 45 542 r183 139 443r109 74 240 r134 67 351 r159 97 368 r184 13 361r110 31 491 r135 337 501 r160 54 199 r185 184 371r111 15 451 r136 67 452 r161 47 463 r186 264 417r112 20 481 r137 9 325 r162 81 319 r187 31 467r113 70 552 r138 44 328 r163 122 343 r188 36 322r114 62 361 r139 17 253 r164 182 432 r189 76 384r115 63 357 r140 67 432 r165 52 499 r190 85 337r116 87 298 r141 82 379 r166 17 219 r191 18 347r117 32 297 r142 24 198 r167 40 486 r192 29 367r118 14 293 r143 23 417 r168 269 342 r193 16 430r119 40 271 r144 55 380 r169 35 348 r194 187 571r120 158 415 r145 27 488 r170 73 507 r195 37 263r121 84 597 r146 111 280 r171 68 394 r196 50 453r122 51 393 r147 34 380 r172 41 441 r197 144 264r123 17 342 r148 228 380 r173 226 314 r198 18 249r124 33 250 r149 57 449 r174 82 364 r199 116 272r125 132 544 r150 54 371 r175 312 312 r200 86 277
Table 6.4: Number of relevant documents(#r) and total number of documents(#d)by each topic in the RCV1 test dataset.
120 Experiments and Results
TREC provides two sets of documents for each topic for training and test
purposes. Table 6.3 and Table 6.4 provide the related statistic information for
training and test dataset respectively. All of the documents in these two sets are
processed both in the phase of profile learning and document evaluating. Before
the learning phase, document indexing is applied to preprocess words and remove
stopwords. As each document is transferred into the desired format, one of the
mining methods is selected to find dedicated patterns in the phase of pattern
discovery. These patterns are then passed through the subsequent deploying and
evolving processes to generate the representative concept (e.g. deployed pattern
set), which is used to represent the set of documents. Following is the test phase
where each document in the test set is evaluated to examine the performance of
the PTM-based IF system. In summary, steps required for the whole evaluation
procedure in PTM are briefly listed as follows:
(1) System starts from one of the RCV1 topics and retrieves the related
information with regard to the training set, such as file list and the number
of documents.
(2) Each document is preprocessed with word stemming and stopwords
removal and transformed into a set of transactions based on its nature of
document structure.
(3) System selects one of the pattern discovery algorithms to extract patterns.
(4) Discovered patterns are deployed into a hypothesis space using one of the
proposed deploying methods.
(5) If required, the pattern evolving process is used to refine patterns. A concept
representing the context of the topic is eventually generated.
6.3 Evaluation Procedures 121
Figure 6.5: Process of document indexing.
(6) Each document in the test set is assessed by the document evaluation
method and the experimental results are shown as an output.
(7) System ends for this topic and repeats the above steps for the next topic if
required.
In the following subsections, more details about document indexing in our
experiments are presented, followed by the description of three main procedures
in our proposed PTM model. Experimental environment and settings are also
discussed in the end of this section.
6.3.1 Document Indexing
Document indexing is the process that assigns terms to documents for retrieval
purposes [45]. The goal of document indexing is to select informative features
that represent the concept of a set of documents. A typical process of document
indexing is illustrated in Figure 6.5. In this process, a set of documents is read and
a set of features is returned as an output. Document indexing needs two steps to
complete, preprocessing and feature selection.
122 Experiments and Results
In preprocessing, redundant terms need to be eliminated before the documents
can be interpreted by the system. Since RCV1 documents are all in XML format,
there are many fields enclosed by tags including < title >, < headline >,
< dateline >, < text >, < copyright > and < metadata > (see the
document example in Appendix A). In our experiments, the fields we chose in
each document are < title > and < text >. The content in the rest of the
fields is discarded. In an RCV1 document, each < text > field contains several
paragraphs enclosed by tag < p >. For implementing PTM, we treat each
paragraph as a transaction, as well as the content in the field < title > which
is viewed as an extra paragraph because of the rich information obtained by it.
The next process is to apply stopword removal and word stemming. In stopword
removal, function words and non-informative terms are removed according to a
given stopword list (Appendix C). For word stemming, the Porter algorithm [117]
is used for suffix stripping.
During the process of feature selection, each term is assigned a value by the
weighting schemes and terms with low scores should be removed for the purpose
of dimensionality reduction. As already mentioned in Chapter 2, feature selection
is a way to make the system efficient. Existing systems usually select a term
weighting scheme to diminish a large amount of non-relevance terms, especially
in the field of Information Retrieval or Text Categorisation. However, in IF,
due to the lack of relevant information for training, the shrink of term base may
affect the system’s effectiveness. Therefore, the strategy we adopt is not to select
terms but patterns. That means the pattern pruning is applied during the process
of pattern discovering in order to achieve the goal of dimensionality reduction
(Section 3.1.2). Hence, in our experiments, almost all the terms are reserved after
6.3 Evaluation Procedures 123
<< 63261.xml >> => 1title + 4 paragraphs => 32 words(T)bill senat(1)bill theft trade secret foreign compani feder crime
final action senat(2)senat version bill pass hous version pass hous final
action hous(3)bill compani theft feder crime(4)foreign trade secret====Found Patterns:[1Terms]:([senat](3)) Freq:3, rel_supp:0.6[1Terms]:([bill](4))Freq:4, rel_supp:0.8[1Terms]:([foreign](2))Freq:2, rel_supp:0.4[2Terms]:([bill](4),senat) Freq:2, rel_supp:0.4[2Terms]:([trade](2),secret) Freq:2, rel_supp:0.4[3Terms]:([bill,final](2),action) Freq:2, rel_supp:0.4[4Terms]:([bill,theft,feder](2),crime) Freq:2,
rel_supp:0.4[4Terms]:([bill,compani,feder](2),crime) Freq:2,
rel_supp:0.4
Figure 6.6: Primary output of a preprocessed document and found patterns.
scanning all training documents, except the term with frequency equalling to one.
In fact, there are numbers of RCV1 topics that contain only a couple of training
examples. About 63% of all RCV1 topics have no more than 10 relevant examples
available for training. An example of output after document preprocessing is
illustrated in Figure 6.6.
In the case (document “63261.xml”) of Figure 6.6, it can be seen that words in
the document are stemmed and only those that appear at least in two transactions
are reserved, otherwise removed. The use of pattern pruning in SCPM can remove
a large number of non-closed patterns, leading to the number of discovered
patterns to be reasonable. Otherwise, if SPM is chosen for pattern discovery,
the number of generated patterns would explode to be 35, compared to 8 for
the SCPM algorithm. In fact, the extra patterns do not benefit the system with
the improvement for effectiveness according to our finding in the preliminary
124 Experiments and Results
work [159]. In addition to SPM, NSPM encounters the same problem since both
of them generate lots of redundant patterns.
6.3.2 Procedure of Pattern Discovery
The result of document indexing is a set of transactions and each transaction
consists of a vector of stemmed terms. The next step is to find frequent patterns
using our proposed pattern discovery algorithms. As mentioned in Chapter 3, data
mining approaches including association rule mining, frequent sequential pattern
mining, closed pattern mining, itemset mining, and closed itemset mining are
adopted and applied to the text mining tasks. By splitting each document into
several transactions (i.e., paragraphs), we can use these mining methods to find
frequent patterns from the textual documents. Five pattern discovery methods
which have been implemented in the experiments are briefed as follows:
- SPM: Finding sequential patterns using the algorithm SPMining (Algo-
rithm 3.1 in Section 3.1.1) with skipping of the first line in the algorithm.
- SCPM: Finding sequential closed patterns using the algorithm SPMining.
- NSPM: Finding non-sequential patterns using the algorithm NSPMining
(Algorithm 3.2 in Section 3.2).
- NSCPM: Finding non-sequential closed patterns using algorithm NSPMin-
ing with closed pattern mining scheme (Equation 3.1) in Section 3.2.2.
- nGram: Finding all sequential patterns, whose lengths do not exceed “n”,
using the SPMining algorithm.
6.3 Evaluation Procedures 125
Note that the min sup we choose is 0.2 for all mining methods, which means a
pattern is frequent if it appears in n paragraphs (including title field) in a document
containing m transactions (paragraphs + title), such that n/m ≥ 0.2. For the
comparison reason, the value of min sup will keep the same for all approaches.
Figure 6.6 shows a primary result of the SCPM mining method. As we can see,
there are eight sequential closed patterns bring found. The frequency and relative
support of each pattern are also estimated. A similar result can be obtained if we
use the NSCPM mining algorithm instead of SCPM since both of them adopt
a pattern pruning scheme during pattern discovery. However, the number of
discovered patterns will dramatically increase if both SPM and NSPM are applied.
The comparison of these methods is revealed in Section 6.5.1.
6.3.3 Procedure of Pattern Deploying
The procedure of pattern deploying is illustrated in Figure 6.7. Each step in the
figure is briefed as follows:
Topic: The system usually needs to process a number of topics and starts
each of them in turns. A topic contains a training dataset and a test dataset
in which both datasets have a set of documents.
Data Transform: Each document in the training dataset is preprocessed.
For a document, those words enclosed by “title” and “text” tags are reserved
for further processing, which includes word stemming and stopwords
removal. After preprocessing, each document is transformed as a set of
transactions which represent the title and paragraphs.
Pattern Discovery: In this step, the SCPM method is chosen as a mining
126 Experiments and Results
Figure 6.7: Flow chart of experimental procedure for pattern deploying methodsPDM and PDS in the pattern taxonomy model PTM.
6.3 Evaluation Procedures 127
mechanism in order to find frequent sequential closed patterns from
transactions. Each document now is represented by pattern taxonomies
which consist of discovered patterns.
Pattern Deployment: There are two choices for pattern deploying. Either
PDM or PDS method can be chosen in order to map discovered patterns
into a hypothesis space. The main difference between these two methods is
that the latter considers the pattern support during pattern re-evaluation.
Concept: After pattern deployment, the concept of topic is built by merging
all documents using pattern decomposition.
Test: While the concept is established, the relevance estimation of each
document in the test dataset is conducted using the document evaluating
function. Documents in the dataset are ranked according to their relevance
scores.
Evaluation: The system’s performance is evaluated using the aforemen-
tioned measures. After evaluation, the system assesses the next topic if
required.
6.3.4 Procedure of Pattern Evolving
The procedure of pattern evolving is similar to that of pattern deploying in the first
three steps but different in the remaining. Figure 6.8 presents the flow chart of
pattern evolving methods DPE and IPE. Each step is briefly described as follows:
Topic: The system usually needs to process a number of topics and starts
each of them in turns. A topic contains a training dataset and a test dataset
128 Experiments and Results
Figure 6.8: Flow chart of experimental procedure for pattern evolving methodsDPE and IPE in the pattern taxonomy model PTM.
6.3 Evaluation Procedures 129
in which both datasets have a set of documents.
Data Transform: Each document in the training dataset is preprocessed.
For a document, those words enclosed by “title” and “text” tags are reserved
for further processing, which includes word stemming and stopwords
removal. After preprocessing, each document is transformed as a set of
transactions which represent the title and paragraphs.
Pattern Discovery: In this step, the SCPM method is chosen as a mining
mechanism in order to find frequent sequential closed patterns from
transactions. Each document now is represented by pattern taxonomies
which consist of discovered patterns.
Pattern Deployment: Pattern evolving methods DPE and IPE undertake
different processes in this step. For DPE, the deployment of pattern is
processed as usual and deployed patterns are generated and passed to the
subsequent step. However, for IPE, there is no need for patterns to be
deployed before they are evolved. In terms of pattern deploying, either
PDM or PDS can be selected to perform the task.
Pattern Evolution: There are two approaches for pattern evolution,
DPE and IPE. Both approaches need the information from the negative
documents (“nds”). The DPE method evolves patterns based on the
deployed patterns which are viewed as term level evolution, whereas the
IPE method processes the task directly on the non-deployed patterns, the
results from the step of Pattern Discovery, which is referred to as pattern
level evolution.
130 Experiments and Results
Concept ∼ Evaluation: Please refer to the same steps described in the
procedure of pattern deploying in the previous section.
6.4 Experimental Setting
All the experiments reported in this thesis were conducted on a PC equipped
with an Intel Pentium IV 3.0GHz CPU and 1,024M memory running a Windows
XP operating system. The application of the PTM-based IF system is coded
using Java programming language with J2SDK version 1.4.2 as the development
environment. The data collection is acquired from a licensed CD from TREC
organisation and used in our experiments without any modification, although
we find some errors and duplicates in the data. The information of relevance
judgement for each topic in training and test datasets is also derived from the files
which are directly downloaded from the TREC Web site 2.
The value of minimum support used for association rules mining in the
experiments is set as 0.2 according to the system optimisation. For the reason of
consistency, we use the same minimum support in all related mining algorithms.
The influence of various settings on minimum support is a well-studied issue
which has been widely investigated in the data mining literatures [54]. Thus in our
experiment we did not focus on this coefficient. Moreover, during the recursive
loop of proposed mining algorithms, the loop will stop and exit if there is no
more pattern being found. However, in some cases (e.g., topic r193 and r199) the
recursive loop seems not to stop since some documents in these topics contain a
large number of long patterns. The longest pattern we found is 15 using SCPM and
SPM mining algorithms. Therefore, for non-sequential pattern mining algorithms2http://trec.nist.gov/data/t2002 filtering.html
6.5 Experiment Evaluation 131
(i.e., NSPM and NSCPM), the maximum length of pattern we search for is set
as 15 and the loop can exit after such long patterns have been found no matter if
there is any longer candidate generated.
6.5 Experiment Evaluation
In order to evaluate the performance of the proposed PTM model, we apply PTM
to perform IF tasks and examine the results against those of other methods. For
an IF task, the system aims to filter out the non-relevant incoming documents
according to the user profiles and extracts profiles from a training dataset for each
topic. Firstly, we conduct data processing techniques for each document in order
to reduce dimensionality. Removing stopwords and term stemming are adopted
according to a given list of stopwords (see Appendix C) and the Porter stemming
algorithm [117]. In practice, about 20 to 30 percent of text are stopwords [19].
There are many classic approaches for concept (i.e., user profile) generation.
The Rocchio algorithm [122], which has been widely adopted in the areas of TC
and IF, can be used to build the profile for representing the concept of a topic
which consists of a set of relevant and irrelevant documents. The Centroid ~c of a
topic can be generated by using the following equation:
~c = α1
|D+|∑~d∈D+
~d
‖~d‖− β
1
|D−|∑~d∈D−
~d
‖~d‖(6.5)
where α and β are empirical parameters; D+ and D− are the sets of positive and
negative documents respectively; ~d denotes a document.
Probabilistic method [50, 119] Prob is a well-known keyword-based approach
for concept generation. With this heuristic, the basic element term t in the feature
132 Experiments and Results
space is weighted using the following formula:
W (t) = log(r + η
R− r + η÷ n− r + η
(N − n)− (R− r) + η) (6.6)
where N and R are the total number of documents and the number of positive
documents in the training set respectively; n is the number of documents which
contain t; r is the number of positive documents which contain t, and η is a
coefficient.
In addition, TFIDF is also widely used. The term t can be weighted by
W (t) = TF (d, t) × IDF (t), where term frequency TF (d, t) is the number of
times term t occurs in document d(d ∈ D) and D is a set of documents in the
dataset; DF (t) is the document frequency which is the number of documents
where the term t occurs at least once; IDF (t), the inverse document frequency, is
denoted by log( |D|DF (t)
).
Another well-known term-based model is the BM25 approach [120], which is
basically considered the state-of-the-art baseline in IR. The weight of a term t can
be estimated by using the following function:
W (t) =TF · (k1 + 1)
k1 · ((1− b) + b DLAV DL
) + TF· log
(r+0.5)(n−r+0.5)
(R−r+0.5)(N−n−R+r+0.5)
(6.7)
where TF is the term frequency; k1 and b are the parameters; DL and AV DL are
the document length and average document length respectively. The values of k1
and b are set as 1.2 and 0.75 respectively according to the suggestion in [140, 141].
Support vector machines (SVM) model is also a well-known learning method
introduced by Cortes and Vapnik [31]. Since the works of Joachims [63, 64],
researchers have successfully applied SVM to many related tasks and presented
some convincing results [23, 24, 91, 127, 163]. The decision function in SVM is
6.5 Experiment Evaluation 133
defined as:
h(x) = sign(w · x + b) =
+1 if (w · x + b) > 0−1 else (6.8)
where x is the input space; b ∈ R is a threshold and w =∑l
i=1 yiαixi for the
given training data:
(xi, yi), . . . , (xl, yl) (6.9)
where xi ∈ Rn and yi equals +1 (−1), if document xi is labeled positive
(negative). αi ∈ R is the weight of the training example xi and satisfies the
following constraints:
∀i : αi ≥ 0 andl∑
i=1
αiyi = 0. (6.10)
Since all positive documents are treated equally before the process of
document evaluation, the value of αi is set as 1.0 for all of the positive documents
and thus the αi value for the negative documents can be determined by using
Equation 6.10.
In document evaluation, once the concept for a topic is obtained, the similarity
between a test document and the concept is estimated using inner product. The
relevance of a document d to a topic can be calculated by the function R(d) = ~d·~c,
where ~d is the term vector of d and ~c is the concept of the topic.
6.5.1 Experiment on Pattern Discovery Methods
In this section, we present the experimental results from the application of various
data mining techniques to the pattern discovery in an IF system. The comparison
on effectiveness and efficiency of these techniques is also conducted. The purpose
134 Experiments and Results
Method Pattern type #Patterns Runtime (Sec.) b/p
SPM Sequential Pattern 126,310 5,308 0.343SCPM Sequential Closed Pattern 38,588 4,653 0.353NSPM Frequent Itemset 340,142 14,502 0.352NSCPM Frequent Closed Itemset 34,794 7,122 0.3463Gram nGram 88,991 4,092 0.342PTM(PDS) Pattern Taxonomy 8,027 1,510 0.431
Table 6.5: Comparing PTM with data mining-based methods on RCV1 topicsr101 to r150.
of this experiment is to find if PTM is superior to the data mining-based methods.
Furthermore, we can figure out which data mining technique is suitable to be
adopted by a knowledge discovery system in the text mining domain. In addition,
we also compare the result from PTM to that for classic approaches such as IR
and probabilistic method.
The comparison of PTM with data mining-based methods on the first 50 RCV1
topics is depicted in Table 6.5. As we can see, that PTM outperforms the other
methods by around 8% in b/p. The results also support the superiority of PTM
in efficiency since the runtime of PTM is as low as 1,510 seconds, compared to
4,000+ seconds for the others. NSPM even takes 14,502 seconds to complete the
task. This can be explained in two aspects. On the one hand, PTM discovers
8,027 patterns only, compared to over 340,000 patterns for NSPM, which means
it can save much time for pattern discovery. On the other hand, PTM does
not need the time-consuming pattern mining algorithm again in the phase of
document evaluation since the deploying method PDS is applied, leading to the
greater performance on efficiency than those for data mining methods. In terms of
closed pattern mining, both closed pattern-based methods, SCPM and NSCPM,
6.5 Experiment Evaluation 135
produce a fewer number of patterns compared to those for non-closed pattern-
based methods, SPM and NSPM, respectively. We expect an increase in scores
of b/p for both SCPM and NSCPM methods. However, only the former performs
better than its non-closed pattern-based method. The reason is that the significance
in the non-closed pattern (usually a short pattern) can be replaced by the closed
pattern (a long pattern) since the former one is the subsequence of the latter one,
but the low-frequency problem of long-pattern causes the result not corresponding
to our intuition. This behaviour motivates the further investigation for the issue of
pattern deploying.
Comparing the closed pattern-based method with the non-closed one, it is
obvious that the closed pattern-based SCPM and NSCPM methods are more
suitable for the text mining task than SPM and NSPM. This is because fewer
patterns are generated in SCPM and NSCPM and less runtime is needed for them.
Despite the slight difference in performance, closed pattern-based methods are
much efficient than non-closed-based ones. With regard to the issue of term order
in a pattern, a sequential pattern contains an ordered list of terms, whereas a non-
sequential pattern consists of an un-orderly itemset mined by NSCPM. Comparing
the result on the number of discovered patterns, NSCPM scores a lower number
than that for SCPM. However, SCPM has advantages over NSCPM on runtime
and its performance. As a result, SCPM is better than NSCPM and therefore is
suitable to be used in the text-related domain. In this experiment, PTM thus adopts
the concept of SCPM for closed sequential pattern mining in the pattern discovery
phase.
The sequential pattern-based methods SPM and SCPM use less runtime (5,308
+ 4,653 seconds) than non-sequential pattern-based NSPM and NSCPM (14,502
136 Experiments and Results
+ 7,122 seconds) for completing the first 50 topics. This is mainly due to the
difference in the process of candidate generation implemented by these two types
of methods. In SPM and SCPM, it traverses only half of a paragraph on average
for generating candidates because, in the algorithms for each traversal to find a
(n+1)Terms candidate, it starts from the point where the last term in the nTerms
pattern locates. In contrast, for NSPM and NSCPM, it has to start from the first
term in the paragraph every time for each candidate to be generated since the term
order in a pattern is not concerned in such mining methods. Another observation is
that NSCPM generates fewer patterns than that for SCPM. This can be explained
in that the proportion of non-closed patterns in non-sequential itemsets is larger
than that in sequential patterns, leading to the more non-closed patterns being
removed during the process of pattern pruning in NSCPM.
The lowest result in b/p is performed by the 3Gram method, which is a special
case of SPM. 3Gram discovers sequential patterns whose lengths are not longer
than 3, resulting in a great reduction of discovered patterns (88,991 for 3Gram
compared to 126,310 for SPM). The runtime for 3Gram to complete the first 50
RCV1 topics is also reduced from 5,308 to 4,092 seconds. According to our
assumption that long patterns carry more significance than short ones, the removal
of a large number of long patterns in the 3Gram method should be coupled with
the drop in the score of performance. However, the b/p of 3Gram is only slightly
lower than that for SPM. This indicates that data mining methods can generate a
large amount of specific long patterns, but these patterns would be redundant if
there is no adequate strategy to properly use these meaningful patterns. This also
implies that our proposed method PTM did provide a great solution to process and
utilise discovered patterns.
6.5 Experiment Evaluation 137
Figure 6.9: Number of patterns discovered using SPM with different constraintson 10 RCV1 topics.
138 Experiments and Results
Figure 6.9 illustrated the effect of the pattern pruning scheme and minimum
support setting upon the number of patterns and the performance. Setting
minimum support as min sup = 0.2, the application of the pattern pruning
scheme can remove one third of the patterns, from 36,202 to 28,733 in total. Also,
the score of b/p can be improved by around 4% from 0.406 to 0.443. However,
the change in minimum support cannot affect the score of b/p without the use of
the pattern pruning scheme. Despite the great decrease in the number of patterns
with the setting of minimum support, the score of b/p changes only slightly. One
possible explanation is that even though a large number of patterns are removed
by the setting of minimum support, a large proportion of remaining patterns are
still redundant. In contrast, the activation of pattern pruning not only reduces
the number of discovered patterns but also improves the performance on b/p.
Therefore, the results highlight the importance of the use of a pattern pruning
scheme in the sequential pattern mining algorithm.
The results of average b/p values of 10 topics are also illustrated in Figure 6.9,
which shows that the improvement is achieved by using SPM with pruning
compared to that without pruning. As the minimum support increases, the average
value of b/p reduces slightly from 0.409 to 0.406. That means the effects of these
pruned patterns are not significant since their supports are relatively smaller than
the remained patterns. The performance is obviously enhanced by applying the
pruning scheme as we can see the measure of b/p value increased from 0.409 to
0.443. This is because the noises of redundant patterns are reduced after they
are pruned. By using both minimum support and pruning scheme as constraints,
significant improvement is therefore achieved.
To compare the other classic IF methods, we implement the TFIDF method
6.5 Experiment Evaluation 139
and probabilistic (Prob) method, which are described as follows.
TFIDF: Let D be a set of documents. The term frequency TF (d, t) is the number
of times term (word) t occurs in document d(d ∈ D) and the document
frequency DF (t) is the number of documents in which term t occurs at least
once. The inverse document frequency IDF (t) is denoted by log( |D|DF (t)
),
which scores low if term t occurs in many documents and scores high if
it occurs in only a few documents. The weight of a term t then can be
represented by TFIDF value which is calculated as
W (t) = TF (d, t)IDF (t).
Prob: The probabilistic method uses a keyword-based algorithm. With this
heuristic, a term t is weighted using the following formula:
W (t) = log(r + 0.5
R− r + 0.5÷ n− r + 0.5
(N − n)− (R− r) + 0.5) (6.11)
where N and R are the total number of documents and the number of
positive documents in the training set respectively; n is the number of
documents which contain t, and r is the number of positive documents
which contain t.
Table 6.6 depicts the average precision of the top 20 returned documents on
10 RCV1 topics. It can be seen that the PTM outperforms the other methods.
The score of top-20 for PTM is greater than those for TFIDF and Prob methods
by around 20%. Two data mining methods, SPM and SCPM, are also superior
to those classic methods. The significant performance of data mining-based
methods indicates that the use of phrases (i.e., sequential patterns) is feasible
140 Experiments and Results
Topic TFIDF Prob SPM SCPM PTM(PDS)
r110 0.15 0.30 0.45 0.65 0.50r120 0.45 0.30 0.80 0.60 0.65r130 0.05 0.05 0.10 0.25 0.25r140 0.35 0.30 0.45 0.10 0.65r150 0.15 0.01 0.10 0.10 0.20r160 0.90 1.00 0.95 1.00 1.00r170 0.30 0.30 0.55 0.60 0.50r180 0.70 0.70 0.65 0.65 0.65r190 0.75 0.60 0.80 0.80 0.95r200 0.20 0.50 0.20 0.40 0.70
top-20 0.400 0.406 0.505 0.515 0.605
Table 6.6: Precisions of top 20 returned documents on 10 RCV1 topics.
and applicable, compared those for keyword-based TFIDF and Prob methods.
However, the computational cost for discovering patterns is still a major concern
in the data mining-based methods since keyword-based TFIDF and Prob are well-
known fast and efficient methods. Another observation is that SCPM has a slightly
better result than that for SPM, which means it is benefited greatly by the use
of a pattern pruning scheme in SCPM. With regard to the correlation of the
performance of a method and the number of discovered patterns in it, we found
that there is no strong relationship between these two factors. According to the
number of generated patterns presented in Figure 6.9, four topics have at least
3,000 patterns (i.e., r110, r140, r150 and r170), and three of them (except r140)
have scores less than or equal to 0.5 in top-20 for the pattern-based PTM method
according to the results depicted in Table 6.6. However, this does not mean the
low pattern number is always corresponding to the high performance. As we can
6.5 Experiment Evaluation 141
see the r130 has a low score in top-20 and fewer patterns as well. Hence there
is no evidence to show any correlation between the number of patterns and the
performance. Nevertheless, the number of patterns is not one of the main factors
that affect the result for a pattern-based method. For efficiency reasons, a method
which can produce fewer patterns is preferable.
Another observation in Table 6.6 is that the score for SCPM in r140 drops
significantly after pattern pruning (0.1 to 0.45 for SPM in the same topic). This
can be explained in that some useful non-closed patterns are removed and these
patterns are just the majority of those specific indicators for this topic. However,
such a severe drop in performance is not common for SCPM. Generally, according
the positive result in Table 6.5 the pattern pruning scheme used in SCPM improves
the performance in b/p (0.353 for SCPM compared to 0.343 for SPM on the first
50 topics). Comparing SCPM to SPM in top-20 scores, the similar result can
be found in Table 6.6. Accordingly, non-closed sequential patterns are proven as
redundant and useless patterns and should be removed for a sequential pattern-
based method.
We have investigated the performance for PTM and found its significant results
both in top-20 and b/p measures. Other measures such as precision and recall
are also used for evaluation and the results are illustrated in Figure 6.10, which
presents the comparison of PTM with other methods in the precision and recall
curve on RCV1 topic r110. As we can see, PTM performs better in high recall
value than other methods. TFIDF has the lowest performance on this topic.
Generally speaking, data mining-based methods are superior to classic methods.
All methods produce similar results after the point where recall value equals to
0.8, indicating that there is no method dominating the others in the low-recall
142 Experiments and Results
Figure 6.10: Comparison of precision and recall curves for different methods onRCV1 Topic r110.
6.5 Experiment Evaluation 143
area.
In summary, we have examined several data mining methods adopted for
pattern discovery in the pattern-based IF system. We also have tested our proposed
PTM model and compared its results to those of data mining-based methods and
classic methods. The following findings are observed:
• Data mining approaches can be used for the task of pattern discovery in
the text mining domain. To overcome the problem of a large amount of
association rules (patterns) generated while using these approaches, our
strategy is that the text in a document can be split into several parts based
on paragraphs. These paragraphs therefore can be treated as transactions
and used by data mining methods. In addition to the paragraph, a whole
document or a sentence in the document can be defined as a transaction.
However, the former definition will cause the above-mentioned problem
where a tremendous number of patterns are discovered, especially when
the number of documents is vast. The latter will generate too many
short non-significant patterns due to the short sentence length. Hence
splitting documents by paragraphs is a suitable and effective manner for
the applications of data mining approaches in the text domain.
• Both closed pattern-based approaches (i.e., SCPM and NSCPM) and non-
closed-based approaches (i.e., SPM and NSPM) can be adopted and used
by a pattern-based IF system for pattern discovery. Similar performances
are yielded by those two kinds of approaches in the measure of b/p on the
first 50 topics. However, SCPM and NSCPM spend much less runtime
than SPM and NSPM due to the use of the pattern pruning scheme in
144 Experiments and Results
closed pattern-based SCPM and NSCPM. In addition, the closed pattern-
based approaches generate fewer patterns showing that they can efficiently
alleviate the computational cost problem.
• Sequential pattern-based approaches SPM and SCPM require less runtime
than non-sequential pattern-based NSPM and NSCPM. This can be
explained in that the efficient process for candidate generation in the pattern
mining algorithm is adopted by SPM and SCPM. Despite the fact that
SCPM is more efficient than NSCPM, the former approach discovered more
patterns than the latter method. This indicates that the process of candidate
generation takes more time than the process of pattern pruning in the closed
pattern-based methods. Therefore, the sequential pattern-based approaches
are efficient than non-sequential ones.
• The pattern pruning scheme is important and necessary for a data mining-
based method due to the large amount of patterns generated, which is
considered to be one of the most concerning problems caused by the
application of these techniques on the text domain. Not only can it reduce
the number of discovered patterns but also improve the performance in
effectiveness for a pattern-based IF system.
• In order to reduce the runtime for a pattern-based system without affecting
performance, the nGram-based method mines patterns with a limited length
and stops discovering patterns when the length of mined maximum patterns
has reached a pre-specified value. Nevertheless, the 3Gram approach
is slightly more efficient than SCPM. It produces as many as double
the number of discovered patterns compared to SCPM and weakens the
6.5 Experiment Evaluation 145
performance. This implies that the majority of redundant 3Terms and
2Terms patterns are non-closed patterns and thus cause the above-mentioned
problem. This behavior supports the importance of the use of pattern
pruning in SCPM. We also have tested the 5Gram method and a similar
result is obtained.
• The document evaluation method is sensitive to the frequency of patterns.
The weight of a pattern is directly proportional to the pattern’s frequency in
documents. With reference to the comparison between SPM and 3Gram, it
is obvious that in SPM the more specific long patterns, which carry more
significant information, cannot actually improve the system’s effectiveness.
This is mainly due to the natural characteristic of low frequency found from
those long patterns, namely the low-frequency problem. Such a problem is
one of the main drawbacks of data mining-based approaches.
• The method to use discovered patterns in the data mining-based approaches
is proven not adequate according to our observation on experimental results.
Although these approaches can discover various types of patterns (i.e.,
frequent sequential patterns, frequent itemsets, and frequent closed or non-
closed patterns), how to effectively use these discovered patterns is still a
critical issue. The use of pattern support for document evaluation suffers
from the low-frequency problem when the length of pattern is long. Despite
the fact that the data mining methods are superior to TFIDF and Prob
methods, they do not outperform the keyword-based IR methods such
as Rocchio [158]. Therefore, a proper pattern evaluation method which
can solve the low-frequency problem for specific long patterns is then
146 Experiments and Results
necessarily required.
• The experimental results support the superiority of the proposed PTM
method in effectiveness and efficiency. PTM reduces 77% of the number
of patterns compared to NSCPM, which is the best case of the data mining
methods, and takes only one third of the runtime of SCPM, which is the
most efficient data mining methods, resulting in a 21% improvement on the
performance in the average figure of b/p on the first 50 RCV1 topics. PTM
also improves the figures of top-20 precision by 51% and 49% over those
of TFIDF and Prob methods respectively on the 10 RCV1 topics.
6.5.2 Experiment on Pattern Deploying
This section presents the experimental results from the pattern deploying methods,
PDM and PDS, proposed in Chapter 4 for attempting to address the problem
caused by the inadequate usage of patterns discovered using data mining
mechanisms. The main problem is that too many patterns are generated by data
mining-based methods and there is no existing manner to effectively use these
discovered patterns. Moreover, the characteristic of low frequency within specific
long patterns is the key factor to this problem according to our finding in the
previous section. That means if we can find a way to facilitate the significance
provided by these specific patterns, the performance of the system can be greatly
boosted in effectiveness. In the previous section we have presented some results
of PDS compared to data mining methods and have shown that PDS significantly
improves the performance both in effectiveness and efficiency. In this section,
we focus on the comparison of two proposed deploying methods and all other
methods including Prob, data mining (DM) methods SCPM and Rocchio with
6.5 Experiment Evaluation 147
intensive examinations on all RCV1 topics. Among these baselines, Prob is
a probabilistic and classic method outperforming TFIDF; SCPM is one of the
effective and efficient data mining methods and Rocchio is the well-known IF
method. We illustrate the results in two groups, the first 50 RCV1 topics and
the rest of the RCV1 topics, due to the different ways of labeling documents for
evaluation in these two datasets as previously mentioned in Section 6.1.
All of the RCV1 topics are used in the experiment for evaluation. Prob method
is implemented using Equation 6.11 with η = 0.5 and Rocchio method basically
refers to Equation 6.5 with α = 1 and β = 0. The DM method uses the SPMining
algorithm aforementioned in Section 3.1.2 with min sup set to 0.2. Details of
PDM and PDS are presented in Chapter 4. The same strategy of preprocessing
each document is adopted by all methods, including word stemming and removal
of stopwords. We also use the same set of keywords both in keyword-based
methods and pattern-based methods for comparison reasons, meaning that the
same set of keywords is used in Prob and Rocchio and adopted in the DM, PDM
and PDS for pattern discovery.
Five implemented methods are briefly described as follows:
• Prob: Keyword-based probabilistic method in Equation 6.11 with η = 0.5.
• DM: Pattern-based data mining method SCPM.
• Rocchio: Keyword-based Rocchio method in Equation 6.5 with α = 1 and
β = 0.
• PDM: Pattern taxonomy model PTM equipped with pattern deploying
method PDM proposed and presented in Section 4.1.1.
• PDS: Pattern taxonomy model PTM equipped with PDS method proposed
148 Experiments and Results
Prob DM Rocchio PDM PDS
top-20 0.407 0.406 0.416 0.470 0.490b/p 0.381 0.353 0.392 0.427 0.431
MAP 0.379 0.364 0.391 0.435 0.441Fβ=1 0.396 0.390 0.408 0.435 0.440IAP 0.402 0.392 0.418 0.458 0.465
Table 6.7: Results of pattern deploying methods compared with others on the first50 topics.
and described in Section 4.1.2.
The experimental results of all methods on the first 50 topics are shown
in Table 6.7. The proposed method PDS improves the performance with five
validating measures compared to all other methods, especially in terms of top-
20 scores, meaning that the proposed method increases the precision of the first
20 returned documents. It improves by 20.4% in top-20 precision compared
to that for the Prob method and about 11% to 16% of improvement in b/p,
MAP, Fβ=1 and IAP measures as well. The significant improvement in top-20
precision indicates that PDS performs well in the low-recall region and has the
ability of ranking relevant documents in the returned list to the top as possible.
The result supports the superiority of PDS over the keyword-based Prob and
Rocchio methods. Nevertheless, PDS gains slightly better performance in all
measures than that for PDM. This can be explained in that the support of pattern
is considered and utilised in the PDS method during the phase of using discovered
patterns. This behaviour highlights the importance of such a pattern property,
which is however omitted and not used in the PDM method. Hence the omission
of pattern support in a pattern-based method can weaken the performance.
6.5 Experiment Evaluation 149
Prob DM Rocchio PDM PDS
top-20 0.542 0.540 0.562 0.583 0.576b/p 0.457 0.434 0.476 0.492 0.498
MAP 0.476 0.456 0.492 0.512 0.513Fβ=1 0.454 0.445 0.465 0.471 0.473IAP 0.493 0.479 0.508 0.529 0.531
Table 6.8: Results of pattern deploying methods compared with others on the last50 topics.
The promising result in Table 6.7 provides the empirical evidence to support
the superiority of pattern deploying methods PDM and PDS to data mining
method DM. It confirms that the evaluation of discovered patterns in DM is
ineffective and the strategy of pattern deployment employed in PDM and PDS can
offer a much better solution to effectively use discovered patterns. As mentioned
in the previous section, the DM method suffers from the main problem that
the useful information hidden in specific long patterns cannot be fully utilised.
Such a problem has been identified in the previous section as the low-frequency
problem for long specific patterns. The significance in a specific pattern can be
extracted and carried by its components and progressively accumulated using
the pattern decomposition function. By deploying patterns, a term with more
frequent occurrence (i.e., appears in many patterns) will be assigned a higher
value of importance. In contrast, a specific pattern can not obtain such a high
value since it is difficult to match the same pattern in text especially when the
pattern contains many terms. Such a weakness in the DM method corresponds
with the unpromising result in our experiment.
The similar results for the second 50 Reuters topics are shown in Table 6.8.
150 Experiments and Results
Again, both pattern deploying methods PDM and PDS outperform the other
methods in all measures. However, the difference in scores between pattern
deploying methods and other methods becomes smaller than those obtained based
on the first 50 topics. For instance, PDS improves by 20.7% of scores in top-
20 precision over the DM method on the first 50 topics. However, the figure of
improvement on the last 50 is only 6.7%. The similar observation can be found in
other measures. This behaviour can be explained in that the manner of generating
these two sets of topics is different. As mentioned in Section 6.1, the first 50 topics
are manually created by domain experts, whereas the last 50 ones are collected by
systems according to their category codes tagged in each XML document. This
can also be the reason to explain why scores of all measures for all methods on the
last 50 topics are higher than those on the first 50 ones. We expect the behaviour is
correlated to the number of available relevance examples for each topic. However,
after further investigation, there is no relation between them which can be used to
explain the observation.
Another interesting observation is that the PDM method is slightly better than
the PDS method in the score of top-20 precision, which implies that PDM can
locate the relevant documents as front as possible in the ranked document list, but
only in the section of the first few dozen documents. That means the PDM method
performs well only in the low-recall situation compared to PDS, which achieves
a higher IAP score than that for PDM. The similar ability can be found in the
DM method as well. Although the DM method is inferior to the Prob method, it
achieves the similar performance in top-20 measure compared to that for the Prob
method. This indicates that DM can produce the same comparable performance
as pattern deploying methods in the low-recall situation. The similar behaviour
6.5 Experiment Evaluation 151
Prob DM Rocchio PDM PDS
top-20 0.475 0.473 0.489 0.527 0.533b/p 0.419 0.394 0.434 0.460 0.464
MAP 0.427 0.410 0.442 0.473 0.477Fβ=1 0.425 0.417 0.436 0.453 0.457IAP 0.447 0.435 0.463 0.493 0.498
Table 6.9: Results of pattern deploying methods compared with others on alltopics.
for the DM method also can be found in the result obtained based on the first
50 topics. Therefore, this finding provides the evidence to support that the data
mining method DM owns the ability of accurately ranking the documents with
high relevance in the front of the list.
Table 6.9 provides an overall view of the performance achieved by all methods
based on the whole dataset. It concludes the previous finding that the pattern
deploying methods PDM and PDS achieve the significant performance. Both
methods outperform not only the data mining method DM, but also the classic
methods Prob and Rocchio according to the experimental results on all 100 topics.
Based on their robustness, these methods can be ranked as follows: PDS > PDM
Rocchio > Prob > DM. It is not surprising that the Prob method is superior to
the DM method. But this is not consistent with the result published in [159], which
showed that the data mining-based method outperforms the probabilistic method.
This can be explained in that different sets of topics are chosen and examined in
these two experiments.
With regard to the pattern deploying methods, PDS is slightly better than
PDM. As mentioned before, this can be attributed to the usage of pattern support
152 Experiments and Results
in the PDS method. The pattern support is calculated by normalising the absolute
support of a pattern in a document. A frequent term can be re-assigned a higher
weight to reflect its significance through considering the effect of pattern support.
Such an assumption has been proven in Section 4.1.2 by using a real example. In
the PDM method, a pattern with a high support will be treated equally to a pattern
with a low support. Hence, the support of a pattern cannot affect the significance
of terms contained by it, which is the reason that PDM is inferior to PDS according
to the experimental results.
By using PDS, the score of precision is improved on top 20 returned
documents by around 9% against the Rocchio method from 48.9% to 53.3% in
Table 6.9. On the first 50 topics and the last 50 topics, the PDS method also
increases the figures by 17.8% and 2.5% respectively. The most important thing is
the PDS uses the least number of training patterns compared to the other methods
(except PDM) as shown in Table 6.10. In fact, the number of training patterns
used by PDS has been reduced by 72% compared to the number of terms used in
the Rocchio method, which means that the PDS method can improve not only the
effectiveness but also the efficiency for the system. The number of patterns used
by the Prob method is the same as that by the Rocchio method since they both are
keyword-based methods and use the same set of terms for concept learning and
document evaluation. Another observation is that the DM method produces the
largest number of patterns. This is because the data mining scheme for pattern
discovery is applied.
We compare the pattern deploying method PDS and PDM with the other three
methods and illustrate in Figure 6.11 with results of precision at standard recall
points on the first 50 topics. It can be seen that the PDS method yields 0.77
6.5 Experiment Evaluation 153
Topic
First 50 Last 50 All
Prob 32,760 37,418 70,178DM 38,588 39,317 77,905Rocchio 32,760 37,418 70,178PDM, PDS 8,027 11,838 19,865
Table 6.10: Accumulated number of patterns found during pattern discovering.
of precision at the first recall point (recall = 0) and 0.65 at the second point
(recall = 0.1). The scores produced by the PDM method at the first few points
are slightly less than those for the PDS method with 0.76 and 0.63 at the first
and second point respectively. Comparing these scores to those generated by the
other methods, we find PDS and PDM are much superior to Rocchio and Prob
methods but not quite so to the DM method. It can be seen that the DM method
gives a similar score to that for the PDS method at the first point. This behaviour
corresponds to the previous finding that a data mining method is able to keep the
high relevant documents in the ranked list as front as possible compared to the
Rocchio and Prob methods. However, such an ability is only effective in the low-
recall area as the plotting of the DM method drops rapidly after the first point. As
a whole, the DM method cannot dominate over the other methods but it is a good
indicator of relevance for the top couples of documents. In addition, the similar
performance in the first recall point for the DM method and the other two pattern
deploying methods provides the evidence that the DM method itself can achieve
better results than those for the Rocchio and Prob methods without the help from
the application of the pattern deploying mechanism provided by the PDS or PDM
methods, despite the inferiority in the whole performance. Such a behaviour is an
154 Experiments and Results
Figure 6.11: Comparison of all methods in precision at standard recall points onthe first 50 topics.
important advantage obtained by a data mining-based method.
The comparison of the PDS method and Rocchio method on each topic in
difference of Fβ=1 is illustrated in Figure 6.12. It can be observed that the PDS
method outperforms the Rocchio method on the majority of topics. There are
couples of topics showing negative results in the figure. Among them, the worst
case is the result from topic 157. This is probably caused by the reason that there
are no sufficiently positive examples for concept learning. Another observation
corresponding to the previous finding is the average results based on the first 50
topics are better than those on the last 50 topics. This can be explained in the
different manners for building these two sets of topics.
Moreover, we have an investigation into the comparison of performance
in top-20 precision between PDS and Rocchio methods on each topic. The
6.5 Experiment Evaluation 155
Figure 6.12: Comparison of PDS method and Rocchio method in difference ofFβ=1 on all topics.
Figure 6.13: Comparison of the PDS method and the Rocchio method indifference of top-20 precision on all topics.
156 Experiments and Results
Figure 6.14: Comparison of all methods in all measures on 100 topics.
results are shown in Figure 6.13, which again confirm the superiority of the PDS
method in the top-20 precision on almost all topics. There are many significant
improvements in scores on all topics indicating that the PDS method is able to
accurately filter out irrelevant documents in the low-recall situation. In order to
gain an overall view, a comprehensive comparison of all methods in all measures
is depicted in Figure 6.14. It can be seen that pattern deploying methods PDS
and PDM outperform the other five baselines in all evaluating measures. These
promising results for the PDS method support the importance of the usage of
the pattern deploying mechanism, which has been proven to be able to overcome
the low-frequency problem pertaining to the DM method and data mining-based
methods.
In summary, we have examined two pattern deploying methods, PDS and
PDM. Both are proposed with an attempt to provide a proper mechanism to exploit
patterns discovered by using data mining techniques. We also have compared
6.5 Experiment Evaluation 157
their results to those of data mining-based methods and term-based methods. In
conclusion, the following findings in this section are observed:
• By using pattern deploying strategies, the experimental results of the PDS
and PDM methods provide the evidence that the pattern deploying methods
can significantly improve the information filtering system in effectiveness.
These promising results from the PDS and PDM methods also indicate that
deployment of discovered patterns is a proper way to exploit these patterns
and to solve the low-frequency problem pertaining to the data mining-based
methods.
• The main drawback of data mining-based methods is that too many patterns
are generated by mining algorithms and there is no existing suitable manner
which can be used to effectively deal with these patterns. With pattern
deploying strategies, the number of patterns has been dramatically reduced
by 75% compared to the data mining method DM. This minimises the
computing complexity and also saves the space for storing patterns.
• The low-frequency problem pertaining to the data mining-based methods
has been solved by deploying patterns into a hypothesis space. The
significance in a specific pattern can be extracted and carried by its deployed
components and progressively accumulated using a pattern decomposition
function. Once the pattern is deployed, the term (i.e., component) with
higher occurrence (i.e., appearing in many patterns) will be assigned a
higher value of importance to support its significance. The feasibility
and effectiveness of such a strategy have been proven by the positive
experimental results in this section.
158 Experiments and Results
• The usage of pattern support in the PDS method leads to a noticeable
improvement over the PDM method, where the latter method omits such
a potential property during the pattern deploying. However, without
considering this property, the PDM method still outperforms the Rocchio,
Prob and DM methods and produces much better results in all measures
both on the first and last 50 topics.
• Despite the low performance in overall measures for the DM method
compared to other methods, the data mining-based methods can achieve
a similar outcome to the PDS method in the score of precision at the first
standard recall point on the first 50 topics. This implies that the DM method
has a high accuracy of document filtering in the low-recall situation.
• All methods yield higher scores in measures on the last 50 topics than
those on the first 50 topics. However, the improvement achieved by pattern
deploying methods on the last 50 topics is slightly less than that on the first
50 ones. The reason is that the manners of generating these two sets of
topics are different. The first 50 topics are classified and judged manually
by experts whereas the last 50 are made by the system according to the
coded information in the metadata of each document.
6.5.3 Experiment on Pattern Evolution
This section presents the result for the evaluation of DPE and IPE, the proposed
pattern evolving approaches used in PTM. In the previous section, PTM has been
significantly improved upon the adoption of pattern deploying method PDS, which
uses the strategy of mapping discovered patterns into a feature space in order to
6.5 Experiment Evaluation 159
solve the low frequency problem pertaining to the specific long patterns. However,
information from the negative examples has not been exploited during the concept
learning. We test the ability of DPE and IPE to deal with negative documents in
this experiment.
In order to compare the PTM method with others, we implement several
approaches and divide them into two categories. The first category contains all
data mining-based methods, such as sequential pattern mining, sequential closed
pattern mining, frequent itemset mining and frequent closed itemset mining,
which have been discussed in Section 6.5.1 and the other classic IF methods,
including nGram, Rocchio, Probabilistic and TFIDF, are classified into the second
category. Two state-of-the-art models, BM25 and SVM, are also implemented in
this section for comparison purpose. Note that we employ SCPM as the method
for pattern discovery and PDS as the pattern deploying approach for PTM. With
regard to pattern evolvement, IPE is chosen due to its promising performance. A
brief of these methods is depicted in Table 6.11.
Before we discuss the comparison between the PTM and other baselines,
we firstly investigate the experimental results of two proposed pattern evolving
approaches, DPE and IPE. Table 6.12 depicts the figures of all evaluating
measures achieved by pattern evolving methods (DPE, IPE) and pattern deploying
methods (PDS, PDM) on all RCV1 topics. As we can see from the table the
individual pattern evolving method IPE outperforms the other methods. These
results provide evidences to support the superiority of IPE, indicating that IPE can
effectively exploit the information provided by negative documents. Moreover,
the results also confirm that the process of pattern evolvement should take place
in the pattern level rather than the term level as the DPE method does. It can
160 Experiments and Results
Method Description Algorithm
PTM Proposed method equipped with IPEPDS and IPE Section 5.2.2
Sequential ptns. Data mining method using frequent SPMsequential patterns Section 3.1.1
Sequential closed ptns. Data mining method using frequent SCPMsequential closed patterns Section 3.1.1
Freq. itemsets Data mining method using frequent NSPMitemsets Section 3.2.2
Freq. closed itemsets Data mining method using frequent NSCPMclosed itemsets Section 3.2.2
nGram nGram method with n = 3 3GramSection 6.3.2
Rocchio Rocchio method Equation 6.5α = 1, β = 0
Prob Probabilistic method Equation 6.11η = 0.5
TFIDF TFIDF method TFIDFSection 6.5
BM25 Probabilistic method Equation 6.7k1 = 1.2,b = 0.75
SVM Support vector machines method Equation 6.8b = 0
Table 6.11: The list of methods used for evaluation.
be explained that the informative context of pattern can be reserved using IPE
during pattern evolving and such information is omitted in DPE since all patterns
are broken during mapping before they are evolved. Another observation is that
6.5 Experiment Evaluation 161
PDS performs marginally better than IPE in the score of b/p. After further
investigation, we find that PDS performs very well in the score of b/p on the last
50 topics, leading to a slightly higher averaged score on all topics. As mentioned
before, the result obtained based on the last 50 topics is not as stable as that on the
first 50 topics.
In terms of coefficient µ used in DPE, only the score of MAP is slightly
improved by the changes in the value of µ for the DPE method, indicating that
the filtering accuracy cannot be greatly improved by the setting of a coefficient to
shuffle significance among patterns in a document. This can be explained in that
the most shuffled patterns are deployed patterns which means each pattern of them
may represent multiple concepts from its various parent patterns. Unfortunately,
these parent patterns are not all relevant. For instance, a deployed pattern
“mining” can be acquired from the relevant parent pattern “data mining” and
the irrelevant pattern “strip mining” when we consider a topic about “knowledge
discovery”. After we find the irrelevant part and weaken the significance of
pattern “mining”, the significance of representing the relevant part in the pattern is
reduced as well. Such a problem even weakens the overall performance of DPE,
leading to the slight inferiority to the pattern deployed method PDS. Therefore,
it induces the main motivation for the proposed IPE method. In IPE, patterns are
evolved and revised in the pattern level rather than the term level, which means
patterns are modified before they are deployed into a hypothesis space. Using the
aforementioned patterns for example, if we find “strip mining” is not relevant to
the topic “knowledge discovery”, this pattern is individually weakened firstly and
then merged into the space with the unchanged relevant pattern “data mining”.
This insures that the relevant part in pattern “mining” can be reserved. As a
162 Experiments and Results
PDS PDM DPEµ=3 DPEµ=5 DPEµ=7 IPE
top-20 0.5330 0.5265 0.5280 0.5285 0.5275 0.5360b/p 0.4643 0.4598 0.4507 0.4507 0.4516 0.4632
MAP 0.4768 0.4734 0.4649 0.4652 0.4653 0.4770Fβ=1 0.4565 0.4528 0.4519 0.4520 0.4520 0.4570IAP 0.4982 0.4932 0.4861 0.4867 0.4867 0.4994
Table 6.12: Comparison of pattern deploying and pattern evolving methods usedby PTM on all topics.
result, the change is applied only to those patterns which are un-deployed and
contain high specificity. The experimental results support our finding and show
the superiority of IPE.
Another advantage of pattern evolving in the pattern level for IPE is the
scalability. The topic concept (i.e., a user profile) needs to be updated once the
original concept is drifted when the user changes his/her information needs. The
system should be able to adapt to the new concept by evolving representatives.
To update the concept more precisely, we have to aim at individual patterns rather
than the whole deployed patterns. This can be easily achieved by using the IPE
method. Therefore, the IPE method is suitable for the concept drifting or adaptive
filtering cases where an accurate updating mechanism is more demanding.
In order to evaluate the effectiveness of DPE, we attempt to find the correlation
between the achieved improvement and the parameter, denoting the ratio of the
number of negative documents greater than the threshold to the number of all
documents. This value can be obtained using the following equation:
Ratio =|d|d ∈ D− ∧ relevance(d) > Threshold(D+)|
|D+|+ |D−|
where d is a document in negative dataset D−, relevance(d) is the function to
6.5 Experiment Evaluation 163
Figure 6.15: The relationship between the proportion in number ofnegative documents greater than threshold to all documents and correspondingimprovement on DPE with µ = 5 on improved topics.
estimate the degree of relevance for d to the concept of its corresponding topic,
Threshold(D) refers to Equation 5.1 which is used to find the threshold for a set
of documents D, and D+ is the positive dataset.
Figure 6.15 illustrates the relationship of the improvement as DPE is applied
and the above-mentioned value of Ratio. As we can see the degree of improvement
is in direct proportion to the score of Ratio. That means the more qualified
negative documents are detected for concept revision, the more improvement we
can achieve. In other words, the expected result can be achieved by using the DPE
method.
The results of overall comparisons are illustrated in Table 6.13. We list the
result obtained based only on the first 50 RCV1 topics since not all methods can
complete all tasks in the last 50 topics. As aforementioned, itemset-based data
164 Experiments and Results
Method top-20 b/p MAP Fβ=1 IAP
PTM(IPE) 0.493 0.429 0.441 0.440 0.466Sequential ptns 0.401 0.343 0.361 0.385 0.384Sequential closed ptns 0.406 0.353 0.364 0.390 0.392Freq. itemsets 0.412 0.352 0.361 0.386 0.384Freq. closed itemsets 0.428 0.346 0.361 0.385 0.387nGram 0.401 0.342 0.361 0.386 0.384Rocchio 0.416 0.392 0.391 0.408 0.418Prob 0.407 0.381 0.379 0.396 0.402TFIDF 0.321 0.321 0.322 0.355 0.348BM25 0.434 0.399 0.401 0.410 0.422SVM 0.447 0.409 0.408 0.421 0.434
Table 6.13: Comparison of all methods on the first 50 topics.
mining methods struggle in some topics as too many candidates are generated
to be processed. In addition, results obtained based on the first 50 topics are
more practical and reliable since the judgement for these topics is manually made
by domain experts, whereas the judgment for the last 50 is created based on the
metadata tagged in each document. The most important information revealed in
this table is that our proposed PTM-based IF model outperforms not only the data
mining-based methods, but also the term-based methods including the state-of-
the-art methods BM25 and SVM.
The number of patterns used for training by each method is shown in
Figure 6.16. The total number of patterns is estimated by accumulating the
number for each topic. As a result, the figure shows PTM is the method that
utilises the least amount of patterns for concept learning compared to others. This
is because the efficient scheme of pattern pruning is applied to the PTM method.
Nevertheless, the classic methods such as Rocchio, Prob and TFIDF adopt terms
6.5 Experiment Evaluation 165
Figure 6.16: Comparison in the number of patterns used for training by eachmethod on the first 50 topics (r101∼r150) and the rest of the topics (r151∼r200).
as patterns in the feature space, they use much more patterns than our proposed
PTM method and slightly less than the sequential closed pattern mining method.
Particularly, nGram is the method with the lowest performance which requires
more than 17,000 patterns for concept learning. In addition, the total number of
patterns obtained based on the first 50 topics is almost the same as the number
obtained based on the last 50 topics for all methods except PTM. The figure based
on the first topics group (r101∼r150) for PTM is less than that based on the other
group (r151∼r200). This can be explained in that the high proportion of closed
patterns is obtained by using PTM based on the first topics group.
A further investigation in the comparison of PTM and TFIDF in top-20
precision on all RCV1 topics is depicted in Figure 6.17. It is obvious that PTM is
superior to TFIDF as it can be seen that positive results distribute over all topics,
especially for the first 50 topics. Another observation is the scores on the first 50
topics are better than those on the last fifties. That is because of the different ways
of generating these two sets of topics, which has been mentioned before. The
interesting behaviour is that there are few topics where TFIDF outperforms PTM.
166 Experiments and Results
Figure 6.17: Comparison of PTM(IPE) and TFIDF in top-20 precision.
After further investigation, we found a similar characteristic of these topics in that
there are only a few positive examples available in these topics. For example,
topic r157, which is the worst case for PTM compared to TFIDF, has only three
positive documents available. Note that the average number of positive documents
for each topic is 12.13. The number of documents for that topic is 42 compared
to 51.27 as the overall average number of documents. The similar behaviours are
found in topic r134 and r144. The former drops 0.25 in top-20 score and 0.4 for
the latter. It is no surprise that topics r134 and r144 contain only five and six
positive documents respectively.
The plotting of precisions on 11 standard points for PTM and data mining
methods on the first 50 RCV1 topics is illustrated in Figure 6.18. The result
supports the superiority of the PTM method and highlights the importance of the
adoption of proper pattern deploying and pattern evolving methods to a pattern-
based knowledge discovery system. Comparing their performance at the first few
6.5 Experiment Evaluation 167
points around the low-recall area, it is also found that the points for data mining
methods drop rapidly as the recall value rises and then keep a relatively gradual
slope from the mid recall period to the end. All four data mining methods achieve
similar results. However, the plotting curve for PTM is much smoother than
those for data mining methods as there is no severe fluctuation on it. Another
observation on this figure is that the data mining-based methods however perform
well at the point where recall equals to zero, despite the overall unpromising
results they have. Accordingly, we can conclude that the data mining-based
methods can improve the performance in the low-recall situation. As we can
compare their performance with other methods depicted in Figure 6.19, for the
score of precision at the first recall point it is obvious that data mining-based
methods outperform the Rocchio, Prob and TFIDF methods. This behaviour
provides the explanation for why SCPM and SPM perform better than Prob and
TFIDF in top-20 precision as shown in Table 6.6.
Although the PTM is equipped with the data mining algorithm for discovering
sequential closed patterns, the promising results cannot be produced without the
help from the successful application of the proposed PDS and IPE methodologies.
The proper usage of the PDS method, which has been proven previously,
can overcome the low-frequency problem and provide a feasible solution
to effectively exploit the vast amount of patterns generated by data mining
algorithms. Moreover, the employment of IPE provides the mechanism to
utilise the information from negative examples to evolve patterns for the concept
updating purpose. In conclusion, the experimental results provide the evidences
showing that the PTM method is an ideal model for a pattern-based knowledge
discovery system.
168 Experiments and Results
Figure 6.18: Comparing PTM(IPE) with data mining methods on the first 50RCV1 topics.
Figure 6.19 presents the plotting of precisions at 11 standard points for PTM
and several term-based methods on the first 50 RCV1 topics. Compared to the
previous plotting in Figure 6.18, the difference of performance for all methods
is easier to be recognised in this figure. Again, the PTM method outperforms
all other methods, including nGram, Rocchio, Prob, TFIDF, BM25 and SVM
methods. Among these methods, the nGram method achieves a noticeable score
of precision at the first point where recall equals to zero, meaning that the nGram
method is able to promote top relevant documents toward the front of the ranking
list. As mentioned before, data mining-based methods can perform well at low-
recall area, which can explain why nGram has better results at this point. However,
the scores for the nGram method drop rapidly at the following couple of points.
During that period, SVM, BM25, Rocchio and Prob methods transcend the nGram
6.5 Experiment Evaluation 169
Figure 6.19: Comparing PTM(IPE) with other methods on the first 50 RCV1topics.
method and keep the superiority until the last point where recall equals to 1. There
is no doubt that the lowest performance is produced by the TFIDF method, which
outperforms the nGram method only at the last few recall points. In addition,
the Prob method is superior to the nGram method, but inferior to the Rocchio
method. The overall performance of Rocchio is better than that for Prob method
which corresponds to the finding in [158].
In summary, both pattern evolving methods DPE and IPE are experimentally
evaluated in this section and positive results are obtained. However, the IPE
method does not produce many gains over the pattern deploying method PDS.
The reason is that the sufficient information has been obtained from positive
examples by using PDS while a large amount of patterns have been discovered
and exploited. Hence, in IPE the effect of the use of information from negative
170 Experiments and Results
examples is relatively not very significant.
We have equipped our proposed pattern taxonomy model PTM with IPE and
compared its performance to those for the up-to-date data mining-based methods
and the well-known term-based methods, including the state-of-the-art BM25 and
SVM models. The results show the PTM model can produce encouraging gains
in effectiveness, in particular over the SVM model. The promising results can
be explained in that the use of pattern taxonomies in PTM combines well with
the advantages of terms and phrases. Moreover, the pattern deploying strategy
provides an effective evaluation for estimating each term’s significance in the
hypothesis space based on not only the term’s statistical properties but also the
pattern’s associations in the pattern taxonomies.
The important findings are briefed as follows:
• Both DPE and IPE methods attempt to utilise information extracted
from negative examples to improve the performance for the pattern-based
knowledge discovery system PTM. The experimental results show that the
IPE method can achieve the goal by evolving individual patterns once an
offending pattern is detected from a negative example. The DPE method,
however, evolves patterns by shuffling the contribution of significance of all
elements in a deployed pattern and yields slightly unsatisfactory results in
all measures compared to those for the IPE method. Hence, PTM chooses
to adopt IPE for pattern evolving to conduct IF tasks in all the following
experiments.
• The main difference between DPE and IPE is that the former evolves
patterns at the term level, while the latter evolves patterns at the pattern
6.5 Experiment Evaluation 171
level before they are deployed. According to the experimental results, IPE
is superior to DPE and hence is suitable for pattern evolution in a pattern-
based knowledge discovery system.
• The performance of the DPE method does not depend on the tuning of
parameter µ. One possible reason is that the element of deployed patterns is
mixed up with context from both positive and negative training examples
due to the application of pattern decomposition. This is also the main
motivation to propose and develop the individual pattern evolution method
which evolves patterns before they are decomposed and mapped into a
deployed pattern.
• The similar performances achieved by all data mining-based methods, such
as sequential (closed) patterns and frequent (closed) itemsets, provide the
evidences that selecting a proper approach to exploit discovered patterns is
more important than choosing a mining method to find different sorts of
patterns.
• The final promising results support the evidence that the PTM model
which implements IPE for pattern evolution can outperform not only the
data mining-based methods but also the state-of-the-art term-based IR
method. The PTM model benefits from the use of pattern taxonomies
which combines well with the advantages of terms and phrases. The pattern
deploying strategy used by PTM also provides an effective evaluation for
estimating each term’s significance in the hypothesis space based on both
the term’s statistical properties and pattern’s associations in the pattern
taxonomies.
172 Experiments and Results
6.6 Chapter Summary
In this chapter, we have conducted extensive experiments to evaluate the
proposed pattern-based knowledge discovery system PTM with various pattern
deploying approaches and evolution strategies. We briefly describe the existing
data collections and choose RCV1 corpus as our dataset for evaluation since
RCV1 is the latest corpus coupled with a large amount of documents and
relevance judgements. Most existing standard evaluating measures are selected to
estimate the system’s performance. Following is the description of experimental
procedures for three main stages. The extensive analysis and discussion on
experimental results are presented at the end.
In terms of pattern discovery, the data mining techniques can be used
for pattern discovery. However, the main drawback of using data mining is
the explosion of numbers of discovered patterns. Both closed pattern-based
approaches (i.e., SCPM and NSCPM) and non-closed-based approaches (i.e.,
SPM and NSPM) can be adopted and used in a pattern-based IF system for pattern
discovery. The weight of a pattern is in direct proportion to the pattern’s frequency
in documents. With reference to the comparison between SPM and 3Gram, it is
obvious that in SPM the more specific long patterns, which carry more significant
information, cannot improve the system’s effectiveness. This is mainly due to
the natural characteristic of low frequency found on those long patterns, which is
called the low-frequency problem. Such a problem is one of the main drawbacks
of data mining-based approaches. The method to use discovered patterns in the
data mining-based approaches is proven not adequate according to our observation
on experimental results. Although these approaches can discover various types of
6.6 Chapter Summary 173
patterns (i.e., frequent sequential patterns, frequent itemsets, and frequent closed
or non-closed patterns), how to effectively use these discovered patterns is still a
critical issue.
By using pattern deploying strategies, the experimental results of PDS
and PDM methods provide the evidences that pattern deploying methods can
significantly improve the information filtering system in its effectiveness. The
promising results of the PDS and PDM methods also indicate that deployment
of discovered patterns is a proper way to exploit these patterns and to solve the
low-frequency problem pertaining to the data mining-based methods. The usage
of pattern support in the PDS method leads to a noticeable improvement over
the PDM method, where the latter method omits such a potential property during
pattern deploying. Despite the lowest performance in overall measures for the DM
method compared to other methods, the data mining-based method can achieve a
similar outcome to the PDS method in the score of precision at the first standard
recall point on the first 50 topics. This implies that the DM method has a high
accuracy of document filtering in the low-recall situation.
The main difference between DPE and IPE is that the former evolves patterns
at the term level, while the latter evolves patterns at the pattern level before they
are deployed. According to the experimental results, IPE is superior to DPE and
hence is suitable for pattern evolution in a pattern-based knowledge discovery
system. The similar performances achieved by all data mining-based methods,
such as sequential (closed) patterns and frequent (closed) itemsets, provide the
evidence that selecting a proper approach to exploit discovered patterns is more
important than choosing a mining method to find different sorts of patterns.
The final promising results support the evidence that the PTM model which
174 Experiments and Results
implements IPE for pattern evolution can outperform not only data mining-based
methods but also the state-of-the-art term-based IR method.
Chapter 7
Conclusion
In the last decade, many data mining techniques have been proposed for fulfilling
various knowledge discovery tasks. These techniques include association rule
mining, frequent itemset mining, sequential pattern mining, maximum pattern
mining and closed pattern mining. However, using these discovered patterns in
the field of text mining is difficult and ineffective. The reason is that a useful long
pattern with high specificity lacks in support. We argue that not all frequent short
patterns are useful. Hence, inadequate use of patterns derived from data mining
techniques leads to the ineffective performance. In this thesis, an effective pattern
taxonomy model have been proposed to overcome the aforementioned problem
by deploying discovered patterns into a hypothesis space. In addition, pattern
updating schemes are investigated as well in this research.
This thesis presents the research on the concept of developing an effective
knowledge discovery model (PTM) based on pattern taxonomies. PTM is
implemented by three main steps: (1) discovering useful patterns by integrating
sequential closed pattern mining algorithm and pruning scheme (Chapter 3);
(2) using discovered patterns by pattern deploying (Chapter 4); (3) adjusting
user profiles by applying pattern evolution (Chapter 5). Various mechanisms in
175
176 Conclusion
each step are proposed and evaluated for fulfilling the PTM model. Numerous
experiments within an information filtering domain are conducted. The latest
version of the Reuters dataset, RCV1, is selected and tested by the proposed PTM-
based information filtering system. The results show that the PTM outperforms
not only several pure data mining-based methods, but also traditional probabilistic
and Rocchio methods.
Section 7.1 presents the main contributions of this research. Section 7.2
discusses the possible directions for the future work in the area of this research.
7.1 Contributions
The contributions made by this research are listed as follows:
Solving data mining problems: We can acquire a vast amount of patterns using
data mining techniques for text mining. However, dealing with these
patterns is difficult due to some characteristics of them. A typical problem
is the low support of a specific long pattern. In this thesis, we conquer this
problem by proposing the PTM model. In PTM, the specificity of a long
pattern is reserved by transforming it to another data format, which then
can be effectively used by a text mining system.
Effective Pattern Taxonomy Model: A complete model is set up for implement-
ing three main phases of knowledge discovery including: (1) discovering
useful patterns; (2) evaluating patterns; and (3) updating information con-
cept. At the first phase, pattern taxonomy model adapts up-to-date data
mining techniques to discover useful patterns and represents information
concept using pattern taxonomies. At the second phase, pattern deploying
7.1 Contributions 177
mechanisms are introduced to overcome the low frequency problem of the
inadequate use of discovered patterns. At the final phase, concept updating
is achieved by evolving patterns based on the information from the irrele-
vant document examples.
Novel Application of Current Data Mining Techniques: The pattern taxonomy
model is the first attempt at adopting the frequent sequential pattern mining
and closed sequential patterns techniques to implement the knowledge dis-
covery task. Applying data mining techniques to text mining domain is very
difficult since the textual data is in unstructured format and time consum-
ing during the pattern discovery. Pattern taxonomy model conquers this
problem and obtains great results on the test platform of information filter-
ing equipped with the proposed pattern deployment strategies. The related
information and experiments can be found in chapter 3 and section 6.5.1
respectively.
Proper Usage of Discovered Patterns: Inadequate use of discovered patterns
leads to the low frequency problem in a data mining-based method.
Pattern Deploying Method (PDM) and Pattern Deploying method based on
Supports (PDS) are developed to deal with the discovered patterns in proper
ways and provide suitable solutions for using these patterns. Experimental
results show that the new deploying methods have achieved significant
improvements over the others. The details can be found in Chapter 4 and
related experiments are discussed in Section 6.5.2.
Scalable Modification Scheme for Concept Updating: From text mining point
of view, negative documents may provide useful information for the system.
178 Conclusion
Hence the capability of handling negative patterns is essential for a pattern
taxonomy-based model. Two concept adjusting schemes are proposed in
this thesis for the purpose of updating concepts in the knowledge base
by pattern evolving. The first one is Deployed Pattern Evolving (DPE)
which performs pattern evolution in the document level. The second one
is Individual Pattern Evolving (IPE) which executes pattern evolution in the
pattern level. With respect to information filtering, DPE and IPE can be used
to re-evaluate the importance of conflict patterns leading to the decrease
of interference from the possibly noisy patterns. The details of DPE and
IPE are described in Chapter 5 and the related experiments are analysed in
Section 6.5.3.
Feasible Information Filtering System: An information filtering framework
based on the proposed pattern taxonomy model is established and evaluated
by a series of experiments. By comparing to traditional information filtering
methods, the pattern taxonomy model can improve the performance in
effectiveness of the system. It also gains the advantages over the up-to-
date data mining-based methods such as sequential phrases and frequent
itemsets-based methods. The experimental results also verify that the
proposed system is promising for the challenging issue for the text mining
community, that is, to provide effective methods to overcome the limitation
of term-based information filtering models. Furthermore, the experiments
are conducted on all the topics in the RCV1 dataset, which is the latest
benchmark data collection in the area of text mining [123].
7.2 Future Work 179
In summary, this research work presents many novel ideas: (1) For pattern
discovery, each paragraph of a document is treated as a unit to enable the
application of sequential pattern mining. (2) A document is expressed as a set of
weighted patterns, which are useful in acquiring the high-level concepts described
by a document. (3) Three levels of features are introduced for the organisation
of information and knowledge, including the Term Level, Pattern Level and
Document Level. They represent the different level of abstraction extracted from
a document collection, which are useful for more effective retrieval and filtering.
(4) The notions of pattern deploying and evolving are introduced for representing
the wise usages of discovered patterns. They can be exploited for constructing an
effective PTM-based information filtering system.
7.2 Future Work
This research work in pattern taxonomy-based knowledge discovery model is
developed towards applying data mining techniques to practical text mining
applications. In a PTM-based system, the knowledge base is represented by
the discovered pattern taxonomies, which provides many useful features such as
support and confidence of a pattern, relationship between patterns, distribution of
pattern taxonomies, and the dimension of these taxonomies. These features can
be used to capture more information for building a descriptive and comprehensive
representation in the knowledge base. In our model, some features (such as the
relationship among patterns and the support of patterns) have been investigated
and evaluated. The rest of the features will be used in further research work. An
initial investigation of using length of patterns as critical factors in a PTM-based
180 Conclusion
Web mining model is examined by Zhou [168, 169].
Data mining algorithms such as association rule mining and sequential pattern
mining are computationally expensive and so are the pattern taxonomies-based
models, especially during the phase of pattern discovery. An efficient algorithm
of finding useful patterns from a large dataset is essential in future work. One
possible solution to improve the efficiency of the pattern taxonomy-based model
is to reduce the dimensionality of the feature space in the knowledge base. This
optimisation approach is known as feature selection. However, the tradeoff of
using feature selection is the lack of information remaining in the selected feature,
especially when the number of training examples is few. Therefore, an alternative
way of applying length-decreasing support constraints [132] to frequent pattern
mining may help. That is, minimum supports used for mining different lengths
of pattern could vary. On the one hand, a higher value of minimum support can
be used for finding short patterns in order to reduce the number of patterns to be
mined. On the other hand, a lower minimum support is set for longer patterns
to prevent the specific information contained in these patterns from being lost.
However, more work is required to build a constraints-based pattern taxonomy
model.
Appendix A
An Example of a RCV1 Document
<?xml version="1.0" encoding="iso-8859-1"?>- <newsitem itemid="105780" id="root" date="1996-10-09"
xml:lang="en"><title>GERMANY: Court says to rule on VW-GM lawsuit Oct 30.</title><headline>Court says to rule on VW-GM lawsuit Oct 30.</headline><dateline>FRANKFURT, Germany</dateline>
- <text><p>A German court said Wednesday it would rule at the end
of the month on a charge of defamation brought byautomaker Volkswagen AG against General Motors and GM’sGerman subsidiary Adam Opel AG.</p>
<p>Following statements from lawyers for both companies ata hearing at the Frankfurt District Court, JudgeGuenther Kinnel closed proceedings and said he wouldannounce the court’s ruling on Oct. 30.</p>
<p>VW is demanding 10 million German marks (\$6.54million) in damages for statements made by GM and Opelofficials last March when GM filed a claim in theUnited States accusing VW of industrial espionage.</p>
<p>Wednesday’s hearing was the latest development in athree-year series of legal battle between the two cargiants.</p>
<p>GM alleges VW production chief Jose Ignacio Lopez deArriortua and seven other former GM managers stolesecrets on purchasing and car production plans whenthey moved to VW in early 1993.</p>
<p>Frustrated by the lack of progress in almost threeyears of legal action against VW in Germany, GM said at
181
182 An Example of a RCV1 Document
news conferences held on March 8 in Detroit andRuesselsheim near Frankfurt it would seek justice inthe United States in the espionage case by filing acomplaint at a federal district court in Michigan.</p>
<p>VW has since filed a motion to have that casedismissed.</p>
<p>At Wednesday’s hearing, VW’s lawyers said GM had soughtto prejudice public opinion at the news conferenceswhen it said its U.S. complaint accused VW and topofficials, including VW head Ferdinand Piech, of"conspiracy, conversion, the misappropriation of tradesecrets and racketeering."</p>
<p>The lawyers also accused Opel of seeking to present VWas a criminal organisation in the public eye, forexample by distributing a chronology of the three-yearsaga to the press.</p>
</text><copyright>(c) Reuters Limited 1996</copyright>
- <metadata>- <codes class="bip:countries:1.0">
- <code code="GFR"><editdetail attribution="Reuters BIP Coding Group"action="confirmed" date="1996-10-09" />
</code></codes>
- <codes class="bip:topics:1.0">- <code code="C12">
<editdetail attribution="Reuters BIP Coding Group"action="confirmed" date="1996-10-09" />
</code>- <code code="GCRIM">
<editdetail attribution="Reuters BIP Coding Group"action="confirmed" date="1996-10-09" />
</code></codes><dc element="dc.publisher" value="Reuters Holdings Plc" /><dc element="dc.date.published" value="1996-10-09" /><dc element="dc.source" value="Reuters" /><dc element="dc.creator.location" value="FRANKFURT,Germany" /><dc element="dc.creator.location.country.name"value="GERMANY" /><dc element="dc.source" value="Reuters" />
183
</metadata></newsitem>
Appendix B
Topic Codes of TREC RCV1
CODE DESCRIPTION
1POL CURRENT NEWS - POLITICS2ECO CURRENT NEWS - ECONOMICS3SPO CURRENT NEWS - SPORT4GEN CURRENT NEWS - GENERAL6INS CURRENT NEWS - INSURANCE7RSK CURRENT NEWS - RISK NEWS8YDB TEMPORARY9BNX TEMPORARYADS10 CURRENT NEWS - ADVERTISINGBNW14 CURRENT NEWS - BUSINESS NEWSBRP11 CURRENT NEWS - BRANDSC11 STRATEGY/PLANSC12 LEGAL/JUDICIALC13 REGULATION/POLICYC14 SHARE LISTINGSC15 PERFORMANCEC151 ACCOUNTS/EARNINGSC1511 ANNUAL RESULTSC152 COMMENT/FORECASTSC16 INSOLVENCY/LIQUIDITYC17 FUNDING/CAPITALC171 SHARE CAPITALC172 BONDS/DEBT ISSUESC173 LOANS/CREDITSC174 CREDIT RATINGSC18 OWNERSHIP CHANGESC181 MERGERS/ACQUISITIONSC182 ASSET TRANSFERS
185
186 Topic Codes of TREC RCV1
C183 PRIVATISATIONSC21 PRODUCTION/SERVICESC22 NEW PRODUCTS/SERVICESC23 RESEARCH/DEVELOPMENTC24 CAPACITY/FACILITIESC31 MARKETS/MARKETINGC311 DOMESTIC MARKETSC312 EXTERNAL MARKETSC313 MARKET SHAREC32 ADVERTISING/PROMOTIONC33 CONTRACTS/ORDERSC331 DEFENCE CONTRACTSC34 MONOPOLIES/COMPETITIONC41 MANAGEMENTC411 MANAGEMENT MOVESC42 LABOURCCAT CORPORATE/INDUSTRIALE11 ECONOMIC PERFORMANCEE12 MONETARY/ECONOMICE121 MONEY SUPPLYE13 INFLATION/PRICESE131 CONSUMER PRICESE132 WHOLESALE PRICESE14 CONSUMER FINANCEE141 PERSONAL INCOMEE142 CONSUMER CREDITE143 RETAIL SALESE21 GOVERNMENT FINANCEE211 EXPENDITURE/REVENUEE212 GOVERNMENT BORROWINGE31 OUTPUT/CAPACITYE311 INDUSTRIAL PRODUCTIONE312 CAPACITY UTILIZATIONE313 INVENTORIESE41 EMPLOYMENT/LABOURE411 UNEMPLOYMENTE51 TRADE/RESERVESE511 BALANCE OF PAYMENTSE512 MERCHANDISE TRADEE513 RESERVESE61 HOUSING STARTSE71 LEADING INDICATORSECAT ECONOMICS
187
ENT12 CURRENT NEWS - ENTERTAINMENTG11 SOCIAL AFFAIRSG111 HEALTH/SAFETYG112 SOCIAL SECURITYG113 EDUCATION/RESEARCHG12 INTERNAL POLITICSG13 INTERNATIONAL RELATIONSG131 DEFENCEG14 ENVIRONMENTG15 EUROPEAN COMMUNITYG151 EC INTERNAL MARKETG152 EC CORPORATE POLICYG153 EC AGRICULTURE POLICYG154 EC MONETARY/ECONOMICG155 EC INSTITUTIONSG156 EC ENVIRONMENT ISSUESG157 EC COMPETITION/SUBSIDYG158 EC EXTERNAL RELATIONSG159 EC GENERALGCAT GOVERNMENT/SOCIALGCRIM CRIME, LAW ENFORCEMENTGDEF DEFENCEGDIP INTERNATIONAL RELATIONSGDIS DISASTERS AND ACCIDENTSGEDU EDUCATIONGENT ARTS, CULTURE, ENTERTAINMENTGENV ENVIRONMENT AND NATURAL WORLDGFAS FASHIONGHEA HEALTHGJOB LABOUR ISSUESGMIL MILLENNIUM ISSUESGOBIT OBITUARIESGODD HUMAN INTERESTGPOL DOMESTIC POLITICSGPRO BIOGRAPHIES, PERSONALITIES, PEOPLEGREL RELIGIONGSCI SCIENCE AND TECHNOLOGYGSPO SPORTSGTOUR TRAVEL AND TOURISMGVIO WAR, CIVIL WARGVOTE ELECTIONSGWEA WEATHERGWELF WELFARE, SOCIAL SERVICES
188 Topic Codes of TREC RCV1
M11 EQUITY MARKETSM12 BOND MARKETSM13 MONEY MARKETSM131 INTERBANK MARKETSM132 FOREX MARKETSM14 COMMODITY MARKETSM141 SOFT COMMODITIESM142 METALS TRADINGM143 ENERGY MARKETSMCAT MARKETSMEUR EURO CURRENCYPRB13 CURRENT NEWS - PRESS RELEASE WIRES
Appendix C
List of Stopwords
a about above according across after afterwards again against albeit all almost
alone along already also although always am among amongst an and another
any anybody anyhow anyone anything anyway anywhere apart are around as
at av be became because become becomes becoming been before beforehand
behind being below beside besides between beyond both but by can cannot canst
certain cf choose contrariwise cos could cu day do does doesn doing dost doth
double down dual during each either else elsewhere enough et etc even ever every
everybody everyone everything everywhere except excepted excepting exception
exclude excluding exclusive far farther farthest few ff first for formerly forth
forward from front further furthermore furthest get go had halves hardly has hast
hath have he hence henceforth her here hereabouts hereafter hereby herein hereto
hereupon hers herself him himself hindmost his hither hitherto how however
howsoever i ie if in inasmuch inc include included including indeed indoors
inside insomuch instead into inward inwards is it its itself just kind kg km last
latter latterly less lest let like little ltd many may maybe me meantime meanwhile
might moreover most mostly more mr mrs ms much must my myself namely
need neither never nevertheless next no nobody none nonetheless noone nope
189
190 List of Stopwords
nor not nothing notwithstanding now nowadays nowhere of off often ok on once
one only onto or other others otherwise ought our ours ourselves out outside
over own per perhaps plenty provide quite rather really reuter reuters round said
sake same sang save saw see seeing seem seemed seeming seems seen seldom
selves sent several shalt she should shown sideways since slept slew slung slunk
smote so some somebody somehow someone something sometime sometimes
somewhat somewhere spake spat spoke spoken sprang sprung stave staves still
such supposing than that the thee their them themselves then thence thenceforth
there thereabout thereabouts thereafter thereby therefore therein thereof thereon
thereto thereupon these they this those thou though thrice through throughout thru
thus thy thyself till to together too toward towards ugh unable under underneath
unless unlike until up upon upward upwards us use used using very via vs
want was we week well were what whatever whatsoever when whence whenever
whensoever where whereabouts whereafter whereas whereat whereby wherefore
wherefrom wherein whereinto whereof whereon wheresoever whereto whereunto
whereupon wherever wherewith whether whew which whichever whichsoever
while whilst whither who whoa whoever whole whom whomever whomsoever
whose whosoever why will wilt with within without worse worst would wow ye
yet year yippee you your yours yourself yourselves
Bibliography
[1] K. Aas and L. Eikvil. Text categorisation: A survey. Technical report,Norwegian Computing Center, Raport NR 941, 1999. 2, 20
[2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules betweensets and items in large database. In Proceedings of the ACM-SIGMODInternational Conference on Management of Data, pages 207–216, 1993.22, 25, 34, 53, 59
[3] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo.Fast discovery of association rules. In Advances in Knowledge Discoveryand Data Mining, pages 307–328. AAAI/MIT Press, 1996. 61
[4] R. Agrawal and J. C. Shafer. Parallel mining of association rules: Design,implementation, and experience. IEEE Transactions on Knowledge andData Engineering, 8(6):962–969, 1996. 17, 25
[5] R. Agrawal and R. Srikant. Fast algorithms for mining association rulesin large databases. In Proceedings of the 20th International Conference onVery Large Data Bases, pages 478–499, 1994. 24, 61
[6] R. Agrawal and R. Srikant. Fast algorithms for mining association rules inlarge databases. In Proceedings of VLDB, pages 487–499, 1994. 61
[7] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedingsof the 11th International Conference on Data Engineering, pages 3–14,Taipei, Taiwan, 1995. 24, 26, 61
[8] H. Ahonen. Finding all maximal frequent sequences in text. In ICML99Workshop, Machine Learning in Text Data Analysis, 1999. 61, 62
[9] H. Ahonen. Knowledge discovery in documents by extracting frequentword sequences. Library Trends, 48(1):160–181, 1999. 61
191
192 BIBLIOGRAPHY
[10] H. Ahonen, O. Heinonen, M. Klemettinen, and A. I. Verkamo. Mining inthe phrasal frontier. In Proceedings of PKDD, pages 343–350, 1997. 34,39, 62
[11] H. Ahonen, O. Heinonen, M. Klemettinen, and A. I. Verkamo. Applyingdata mining techniques for descriptive phrase extraction in digitaldocument collections. In Proceedings of the IEEE Forum on Research andTechnology Advances in Digital Libraries (ADL98), pages 2–11, 1998. 34,35, 61
[12] H. Ahonen-Myka. Discovery of frequent word sequences in text. InProceedings of Pattern Detection and Discovery, pages 180–189, 2002.34, 61
[13] H. Ahonen-Myka, O. Heinonen, M. Klemettinen, and A. I. Verkamo.Finding co-occurring text phrases by combining sequence and frequent setdiscovery. In Proceedings of International Joint Conference on ArtificialIntelligence (IJCAI99) Workshop on Text Mining, pages 1–9, 1999. 34, 61
[14] H. Al-Mubaid and S. A. Umair. A new text categorization techniqueusing distributional clustering and learning logic. IEEE Transactions onKnowledge and Data Engineering, 18(9):1156–1165, 2006. 34
[15] J. Allan, J. P. Callan, F. Feng, and D. Malin. Inquery and trec-8. In TREC,1999. 37
[16] G. Amati, D. D. Aloisi, V. Giannini, and F. Ubaldini. A frameworkfor filtering news and managing distributed data. Journal of UniversalComputer Science, 3(8):1007–1021, 1997. 37
[17] A. Anghelescu, E. Boros, D. Lewis, V. Menkov, D. Neu, and P. Kantor.Rutgers filtering work at trec 2002: Adaptive and batch. In TREC, 2002.37
[18] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu. Sequential pattern miningusing a bitmap representation. In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining(KDD), pages 429–435, 2002. 24, 26, 61
[19] P. Baldi, P. Frasconi, and P. Smyth. Modeling the Internet and the Web:Probabilistic Method and Algorithms. John Wiley, 2003. 131
[20] B. T. Bartell, G. W. Cottrell, and R. K. Belew. Automatic combination ofmultiple ranked retrieval systems. In Proceedings of SIGIR, pages 173–181, 1994. 34
BIBLIOGRAPHY 193
[21] E. Brill and P. Resnik. A rule-based approach to prepositional phraseattachment disambiguation. In Proceedings of the 15th InternationalConference on Computational Linguistics (COLING), pages 1198–1204,1994. 34
[22] C. Brouard. Clips at trec 11: Experiments in filtering. In TREC, 2002. 37
[23] N. Cancedda, N. Cesa-Bianchi, A. Conconi, and C. Gentile. Kernelmethods for document filtering. In TREC, 2002. 36, 132
[24] N. Cancedda, E. Gaussier, C. Goutte, and J-M. Renders. Word-sequencekernels. Journal of Machine Learning Research, 3:1059–1082, 2003. 36,37, 132
[25] M. F. Caropreso, S. Matwin, and F. Sebastiani. Statistical phrases inautomated text categorization. Technical report, Instituto di Elaborazionedell’Informazione, Technical Report IEI-B4-07-2000, 2000. 35
[26] J. M. Carroll and P. A. Swatman. Structured-case: A methodologicalframework for building theory in information system research. InProceedings of the European Conference on Information Systems, 2000.7
[27] G. Chen, X. Wu, and X. Zhu. Sequential pattern mining in multiple streams.In Proceedings of the 5th IEEE International Conference on Data Mining(ICDM05), pages 585–588, 2005. 25, 26
[28] H. Cheng, X. Yan, and J. Han. Incspan: incremental mining of sequentialpatterns in large database. In Proceedings of KDD, pages 527–532, 2004.61
[29] D. W. Cheung, J. Han, V. T. Ng, A. W. Fu, and Y. Fu. A fastdistributed algorithm for mining association rules. In Proceedings of the 4thInternational Conference on Parallel and Distributed Information Systems,pages 31–42, 1996. 61
[30] K. W. Church. One term or two? In Proceedings of SIGIR, pages 310–318,1995. 29
[31] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,20(3):273–297, 1995. 132
[32] W. B. Croft, J. P. Callan, and J. Broglio. Trec-2 routing and ad-hoc retrievalevaluation using the inquery system. In TREC, 1993. 37
194 BIBLIOGRAPHY
[33] V. Devedzic. Knowledge discovery and data mining in databases. InS.K. Chang, editor, Handbook of Software Engineering and KnowledgeEngineering, volume Vol.1 - Fundamentals, pages 615–637. WorldScientific Publishing Co, 2001. 1, 11, 19
[34] S. T. Dumais. Improving the retrieval of information from external sources.Behavior Research Methods, Instruments, & Computers, 23(2):229–236,1991. 20
[35] L. Dumitriu. Interactive mining and knowledge reuse for the closed-itemsetincremental-mining problem. SIGKDD Explorations, 3(2):28–36, 2002.26, 62
[36] L. Edda and K. Jorg. Text categorization with support vector machines.how to represent texts in input space? Machine Learning, 46:423–444,2002. 2
[37] H. P. Edmundson and R. E. Wyllys. Automatic abstracting and indexing -survey and recommendations. Commun. ACM, 4(5):226–234, 1961. 29
[38] D. A. Evans, J. Shanahan, N. Roma, J. Bennett, V. Sheftel, E. Stoica,J. Montgomery, D. A. Hull, and W. Tembe. Term selection and thresholdoptimization in ir and svm filters. In TREC, 2002. 37
[39] W. Fan, M. D. Gordon, and P. Pathak. Personalization of search engineservices for effective retrieval and knowledge management. In Proceedingsof the 21th International Conference on Information Systems, pages 20–34,2000. 29
[40] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The kdd process forextracting useful knowledge from volumes of data. Communications of theACM, 39(11):27–34, 1996. 12
[41] L. Feng, H. Lu, J. X. Yu, and J. Han. Mining inter-transaction associationswith templates. In Proceedings of CIKM, pages 225–233, 1999. 61
[42] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledgediscovery in databases: an overview. AI Magazine, 13:57–70, 1992. 1,11, 12
[43] Y. J. Fu. Data mining: Tasks, techniques and applications. IEEE Potentials,16(4):18–20, 1997. ix, 13, 14, 22
[44] N. Fuhr. Probabilistic models in information retrieval. The ComputerJournal, 35(3):243–255, 1992. 34
BIBLIOGRAPHY 195
[45] N. Fuhr and C. Buckley. A probabilistic learning approach for documentindexing. ACM Transactions on Information Systems, 9(3):223–248, 1991.121
[46] G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu. Text classification withoutnegative examples revisit. IEEE Transactions on Knowledge and DataEngineering, 18(1):6–20, 2006. 27
[47] S. S. Ge and Y. Liu. Extensible object oriented reasoning informationfiltering. In Proceedings of the 2002 IEEE International Symposium onIntelligent Control, pages 827–832, 2002. 17
[48] M. Goebel and L. Gruenwald. A survey of data mining and knowledgediscovery software tools. SIGKDD Explorations, 1(1):20–33, 1999. 15
[49] K. Gouda and M. J. Zaki. Genmax: An efficient algorithm for miningmaximal frequent itemsets. Data Mining and Knowledge Discovery,11(3):223–242, 2005. 26, 53, 62
[50] D. A. Grossman and O. Frieder. Information Retrieval Algorithms andHeuristics. Kluwer Academic, 1998. 2, 108, 131
[51] J. Han and K.C-C. Chang. Data mining for web intelligence. Computer,35(11):64–70, 2002. 24, 26
[52] J. Han and Y. Fu. Discovery of multiple-level association rules from largedatabases. In Proceedings of VLDB, pages 420–431, 1995. 61
[53] J. Han and Y. Fu. Mining multiple-level association rules in large databases.IEEE Transactions on Knowledge and Data Engineering, 11(5):798–805,1999. 17, 22, 23, 61
[54] J. Han and M. Kamber. Data Mining: Concepts and Techniques. MorganKaufmann, 2000. 22, 61, 130
[55] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu. Freespan:frequent pattern-projected sequential pattern mining. In Proceedings ofKDD, pages 355–359, 2000. 61
[56] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidategeneration. In Proceedings of ACM-SIGMOD, pages 1–12, 2000. 24, 26
[57] L. Harada. An efficient sliding window algorithm for detection ofsequential pattern. In Proceedings of DASFAA, pages 73–80, 2003. 25
196 BIBLIOGRAPHY
[58] W. Hersh, C. Buckley, T. Leone, and D. Hickman. Ohsumed: aninteractive retrieval evaluation and new large text collection for research.In Proceedings of the 17th ACM International Conference on Research andDevelopment in Information Retrieval, pages 192–201, 1994. 108
[59] Y. Huang and S. Lin. Mining sequential patterns using graph searchtechniques. In Proceedings of the 27th Annual International ComputerSoftware and Applications Conf. (COMPSAC03), pages 4–9, 2003. 24, 26
[60] D. A. Hull, J. O. Pedersen, and H. Schutze. Method combination fordocument filtering. In Proceedings of SIGIR, pages 279–287, 1996. 35,36
[61] L. P. Jing, H. K. Huang, and H. B. Shi. Improved feature selection approachtfidf in text mining. In Proceedings of the First International Conferenceon Machine Learning and Cybernetics, pages 944–946, 2002. 17
[62] T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf fortext categorization. In Proceedings of ICML, pages 143–151, 1997. 20
[63] T. Joachims. Text categorization with suport vector machines: Learningwith many relevant features. In Proceedings of the European Conferenceon Machine Learning, pages 137–142, 1998. 132
[64] T. Joachims. Transductive inference for text classification using supportvector machines. In Proceedings of ICML, pages 200–209, 1999. 132
[65] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, andA. Inkeri Verkamo. Finding interesting rules from large sets of discoveredassociation rules. In Proceedings of CIKM, pages 401–407, 1994. 61
[66] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon,and J. Riedl. Grouplens: Applying collaborative filtering to usenet news.Communications of the ACM (CACM), 40(3):77–87, 1997. 37
[67] K. Koperski and J. Han. Discovery of spatial association rules ingeographic information databases. In Proceedings of SSD, pages 47–66,1995. 61
[68] R. Kosala and H. Blockeel. Web mining research: A survey. ACM SIGKDDExplorations, 2(1):1–15, 2000. 5
[69] H. Kum, J. H. Chang, and W. Wang. Sequential pattern mining in multi-databases via multiple alignment. Data Mining and Knowledge Discovery,12(2-3):151–180, 2006. 25
BIBLIOGRAPHY 197
[70] K. L. Kwok, P. Deng, N. Dinstl, and M. Chan. Trec2002 web, novelty andfiltering track experiments using pircs. In TREC, 2002. 37
[71] W. Lam, M. E. Ruiz, and P. Srinivasan. Automatic text categorization andits application to text retrieval. IEEE Transactions on Knowledge and DataEngineering, 11(6):865–879, 1999. 2, 115
[72] K. Lang. News weeder: Learning to filter netnews. In Proceedings ofICML, pages 331–339, 1995. 37, 108
[73] C. Lanquillon. Evaluating performance indicators for adaptive informationfiltering. In Proceedings of ICSC, pages 11–20, 1999. 36
[74] C Lanquillon and I. Renz. Adaptive information filtering: Detectingchanges in text streams. In Proceedings of CIKM, pages 538–544, 1999.36
[75] R. Y. K. Lau, P. Bruza, and D Song. Belief revision for adaptive informationretrieval. In Proceedings of SIGIR, pages 130–137, 2004. 36
[76] C-H. Lee and H-C. Yang. A multilingual text mining approach based onself-organizing maps. Applied Intelligence, 18(3):295–310, 2003. 27
[77] D. D. Lewis. An evaluation of phrasal and clustered representations on atext categorization task. In Proceedings of SIGIR, pages 37–50, 1992. 2, 5,33, 34, 35
[78] D. D. Lewis. Feature selection and feature extraction for textcategorization. In Speech and Natural Language Workshop, pages 212–217, 1992. 20
[79] D. D. Lewis. Evaluating and optimizing automous text classificationsystems. In Proceedings of SIGIR, pages 246–254, 1995. 115
[80] X. Li and B. Liu. Learning to classify texts using positive and unlabeleddata. In Proceedings of IJCAI, pages 587–594, 2003. 20
[81] Y. Li. Extended random sets for knowledge discovery in informationsystem. In Proceedings of the 9th International Conference on Rough Sets,Fuzzy Sets, Data Miing and Granular Computing, pages 524–532, 2003.85
[82] Y. Li, X. Z. Chen, and B. R. Yang. Research on web mining-basedintelligent search engine. In Proceedings of the first InternationalConference on Machine Learning and Cybernetics, pages 386–390, 2002.ix, 15
198 BIBLIOGRAPHY
[83] Y. Li, S-T. Wu, and Y. Xu. Deploying association rules on hypothesisspaces. In Proceedings of the International Conference on ComputationalIntelligence for Modelling Control and Automation (CIMCA04), pages769–778, 2004. 17, 86
[84] Y. Li and N. Zhong. Interpretations of association rules by granularcomputing. In Proceedings of the 3rd IEEE International Conference onData Mining, pages 593–596, 2003. 86
[85] Y. Li and N. Zhong. Capturing evolving patterns for ontology-basedweb mining. In Proceedings of the International Conference on WebIntelligence (WI04), pages 256–263, 2004. 87
[86] Y. Li and N. Zhong. Mining ontology for automatically acquiring webuser information needs. IEEE Transactions on Knowledge and DataEngineering, 18(4):554–568, 2006. 2, 4, 72, 74, 88, 103
[87] M-Y. Lin and S-Y. Lee. Incremental update on sequential patterns in largedatabases by implicit merging and efficient counting. Information Systems,29(5):385–404, 2004. 24, 26
[88] T. Y. Lin. Database mining on derived attributes. In Proceedings of RoughSets and Current Trends in Computing, pages 14–32, 2002. 5
[89] B. Liu, C. W. Chin, and H. T. Ng. Mining topic-specific concepts anddefinitions on the web. In Proceedings of WWW, pages 251–260, 2003. 26
[90] J. Liu, Y. Pan, K. Wang, and H. Han. Mining frequent item sets byopportunistic projection. In Proceedings of KDD, pages 229–238, 2002.62
[91] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins.Text classification using string kernels. Journal of Machine LearningResearch, 2:419–444, 2002. 36, 132
[92] H. Lu, L. Feng, and J. Han. Beyond intratransaction associationanalysis: mining multidimensional intertransaction association rules. ACMTransations on Information Systems, 18(4):423–454, 2000. 61
[93] W. Y. Ma and B. S. Manjunath. Netra: A toolbox for navigating large imagedatabases. ACM Multimedia System, 7:184–198, 1999. 17
[94] A. Maedche. Ontology learning for the semantic Web. Kluwer Academic,2003. 103
BIBLIOGRAPHY 199
[95] H. Mannila and H. Toivonen. Discovering generalized episodes usingminimal occurrences. In Proceedings of KDD, pages 146–151, 1996. 34
[96] H. Mannila, H. Toivonen, and A. Inkeri Verkamo. Efficient algorithms fordiscovering association rules. In KDD Workshop, pages 181–192, 1994. 61
[97] H. Mannila, H. Toivonen, and A. Inkeri Verkamo. Discovery of frequentepisodes in event sequences. Data Mining and Knowledge Discovery,1(3):259–289, 1997. 61
[98] C. Manning and H. Schutze. Foundations of statistical natural languageprocessing. MIT Press, Cambridge, MA, USA, 1999. 103
[99] C. J. Matheus, P. C. Chan, and G. Piatetsky-Shapiro. Systems forknowledge discovery in databases. IEEE Transactions on Knowledge andData Engineering, 5:903–913, 1993. 19
[100] D. Mladenic and M. Globelnik. Word sequences as features in text-learning. In Proceedings of the 17th Electrotechnical and ComputerScience Conference (ERK98), pages 145–148, 1998. 34
[101] C. Monz. Contextual inference in computational semantics. In Proceedingsof Modeling and Using Context, Second International and InterdisciplinaryConference (CONTEXT99), pages 242–255, 1999. 33
[102] R. J. Mooney and R. C. Bunescu. Mining knowledge from text usinginformation extraction. SIGKDD Explorations, 7(1):3–10, 2005. 20
[103] I. Moulinier, G. Raskinis, and J. Ganascia. Text categorization: A symbolicapproach. In Proceedings of the 5th Annual Symposium on DocumentAnalysis and Information Retrieval (SDAIR), 1996. 115
[104] N. Nanas. Towards Nootropia: a Non-Linear Approach to AdaptiveDocument Filtering. PhD thesis, The Open University, 2003. 31, 32
[105] N. Nanas, V. S. Uren, and A. Roeck. A comparative evaluation of termweighting methods for information filtering. In DEXA Workshops, pages13–17, 2004. 29
[106] D. W. Oard. The state of the art in text filtering. User Model. User-Adapt.Interact., 7(3):141–178, 1997. 36
[107] J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithmfor mining association rules. In Proceedings of SIGMOD, pages 175–186,1995. 22, 24, 61
200 BIBLIOGRAPHY
[108] J. S. Park, M-S. Chen, and P. S. Yu. An effective hash based algorithmfor mining association rules. In Proceedings of SIGMOD, pages 175–186,1995. 61
[109] J. S. Park, M-S. Chen, and P. S. Yu. Using a hash-based method withtransaction trimming for mining association rules. IEEE Transactions onKnowledge and Data Engineering, 9(5):813–825, 1997. 61
[110] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discoveringfrequent closed itemsets for association rules. In Proceedings of the 7thInternational Conference on Database Theory (ICDT99), pages 398–416,1999. 26, 62
[111] J. Pei and J. Han. Can we push more constraints into frequent patternmining? In Proceedings of KDD, pages 350–354, 2000. 61
[112] J. Pei, J. Han, and L. V. S. Lakshmanan. Pushing convertible constraintsin frequent itemset mining. Data Mining and Knowledge Discovery,8(3):227–252, 2004. 26, 62
[113] J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for miningfrequent closed itemsets. In ACM SIGMOD Workshop on Research Issuesin Data Mining and Knowledge Discovery, pages 21–30, 2000. 26, 62
[114] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, andM. Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of ICDE, pages 215–224, 2001.24, 26, 61
[115] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, andM. Hsu. Mining sequential patterns by pattern-growth: The prefixspanapproach. IEEE Transactions on Knowledge and Data Engineering,16(11):1424–1440, 2004. 24, 26
[116] J. Pei, J. Han, and W. Wang. Mining sequential patterns with constraints inlarge databases. In Proceedings of CIKM, pages 18–25, 2002. 61
[117] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137,1980. 122, 131
[118] Reuters. Reuters ltd corpus statistics web page. Available fromhttp://about.reuters.com/researchandstandards/corpus/statistics/index.asp.ix, 111, 112
BIBLIOGRAPHY 201
[119] S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms.Journal of the American Society for Information Science, 27:129–46, 1976.30, 131
[120] S. E. Robertson, S. Walker, and M. Hancock-Beaulieu. Experimentationas a way of life: Okapi at trec. Information Processing and Management,36(1):95–108, 2000. 30, 132
[121] S. E. Robertson, S. Walker, H. Zaragoza, and R. Herbrich. Microsoftcambridge at trec 2002: Filtering track. In TREC, 2002. 37
[122] J. Rocchio. Relevance Feedback in Information Retrieval, chapter 14, pages313–323. Prentice-Hall, 1971. 108, 131
[123] T. Rose, M. Stevenson, and M. Whitehead. The reuters corpus volume1- from yesterday’s news to today’s language resources. In Proceedings ofthe 3rd Inter. Conf. on Language Resources and Evaluation, pages 29–31,2002. 108, 109, 178
[124] G. Salton. The SMART Retrieval System – Experiments in AutomaticDocument Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,1971. 108
[125] G. Salton and C. Buckley. Term-weighting approaches in automatic textretrieval. Information Processing and Management: an InternationalJournal, 24(5):513–523, 1988. 20, 28
[126] M. Sanderson. Word sense disambiguation and information retrieval. InProceedings of SIGIR, pages 142–151, 1994. 33
[127] M. Sassano. Virtual examples for text classification with support vectormachines. In Proceedings of Empirical Methods in Natural LanguageProcessing, pages 208–215, 2003. 132
[128] A. Savasere, E. Omiecinski, and S. B. Navathe. Mining for strong negativeassociations in a large database of customer transactions. In Proceedingsof ICDE, pages 494–502, 1998. 61
[129] S. Scott and S. Matwin. Feature engineering for text classification. InProceedings of ICML, pages 379–388, 1999. 2, 5, 108
[130] F. Sebastiani. Machine learning in automated text categorization. ACMComputing Surveys, 34(1):1–47, 2002. 2, 5, 20, 35, 36, 115
202 BIBLIOGRAPHY
[131] M. Seno and G. Karypis. Slpminer: An algorithm for findingfrequent sequential patterns using length-decreasing support constraint. InProceedings of ICDM, pages 418–425, 2002. 24, 26, 61
[132] M. Seno and G. Karypis. Finding frequent patterns using length-decreasingsupport constraints. Data Mining and Knowledge Discovery, 10(3):197–228, 2005. 24, 26, 180
[133] R. E. Shapire and Y. Singer. Boostexter: a boosting-based system for textcategorization. Machine Learning, 39:135–168, 2000. 2, 115
[134] R. Sharma and S. Raman. Phrase-based text representation for managingthe web document. In Proceedings of The International Conference onInformation Technology: Computers and Communications (ITCC), pages165–169, 2003. 34, 35
[135] D. Shen, Sun J., Q. Yang, H. Zhao, and Z. Chen. Text classificationimproved through automatically extracted sequences. In Proceedings ofICDE, pages 121–123, 2006. 32, 34
[136] B. D. Sheth. A learning approach to personalized information filtering.Master’s thesis, Master of Science, Massachusetts Institue of Technology,1994. 37
[137] I. Soboroff and S. E. Robertson. Building a filtering test collection for trec2002. In Proceedings of SIGIR, pages 243–250, 2003. 113
[138] H. Sorensen, A. O’Riordan, and C. O’Riordan. Profiling with the informertext filtering agent. The Journal of Universal Computer Science, 3(8):988–1006, 1997. 37
[139] K. Sparck Jones. Experiments in relevance weighting of search terms. Inf.Process. Manage., 15(3):133–144, 1979. 108
[140] K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model ofinformation retrieval: development and comparative experiments - part 1.Information Processing and Management, 36(6):779–808, 2000. 30, 132
[141] K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model ofinformation retrieval: development and comparative experiments - part 2.Information Processing and Management, 36(6):809–840, 2000. 30, 132
[142] R. Srikant and R. Agrawal. Mining generalized association rules. InProceedings of VLDB, pages 407–419, 1995. 24
BIBLIOGRAPHY 203
[143] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizationsand performance improvements. In Proceedings of EDBT, pages 3–17,1996. 61
[144] T. Strzalkowski. Robust text processing in automated informationretrieval. In Proceedings of the 4th Applied Natural Language ProcessingConference (ANLP), pages 168–173, 1994. 33
[145] B. Thuraisingham. A primer for understanding and applying data mining.IEEE IT Professional, 12(1):28–31, 2000. 17
[146] A. K. H. Tung, H. Lu, J. Han, and L. Feng. Breaking the barrier oftransactions: Mining inter-transaction association rules. In Proceedingsof KDD, pages 297–301, 1999. 61
[147] A. K. H. Tung, H. Lu, J. Han, and L. Feng. Efficient mining ofintertransaction association rules. IEEE Transactions on Knowledge andData Engineering, 15(4):1001–1017, 2003. 17
[148] P. D. Turney. Learning algorithms for keyphrase extraction. InformationRetrieval, 2(4):303–336, 2000. 34
[149] P. Tzvetkov, X. Yan, and J. Han. Tsp: Mining top-k closed sequentialpatterns. In Proceedings of ICDM, pages 347–354, 2003. 24, 26, 61
[150] S. R. Vasanthakumar, J. P. Callan, and W. B. Croft. Integrating inquerywith an rdbms to support text retrieval. IEEE Data Engineering Bulletin,19(1):24–33, 1996. 37
[151] A. Veloso, M. E. Otey, S. Parthasarathy, and Meira Jr. W. Parallel anddistributed frequent itemset mining on dynamic datasets. In Proceedings ofHiPC, pages 184–193, 2003. 62
[152] K. Wang, Y. He, and J. Han. Mining frequent itemsets using supportconstraints. In Proceedings of VLDB, pages 43–52, 2000. 62
[153] K. Wang, Y. He, and J. Han. Pushing support constraints into associationrules mining. IEEE Transactions on Knowledge and Data Engineering,15(3):642–658, 2003. 22
[154] K. Wang and H. Liu. Discovery structural association of semistructureddata. IEEE Transactions on Knowledge and Data Engineering, 12:353–371, 2000. 17
204 BIBLIOGRAPHY
[155] D. H. Widyantoro, T. R. Ioerger, and J. Yen. An adaptive algorithm forlearning changes in user interests. In Proceedings of CIKM, pages 405–412, 1999. 37
[156] R. C. Wong and A. W. Fu. Mining top-k frequent itemsets from datastreams. Data Mining and Knowledge Discovery, 13(2):193–217, 2006.26, 62
[157] S-T. Wu, Y. Li, and Y. Xu. An effective deploying algorithm for usingpattern-taxonomy. In Proceedings of the 7th International Conferenceon Information Integration and Web-based Applications & Services(iiWAS05), pages 1013–1022, 2005. 3, 85
[158] S-T. Wu, Y. Li, and Y Xu. Deploying approaches for pattern refinement intext mining. In Proceedings of ICDM, pages 1157–1161, 2006. 3, 66, 145,169
[159] S-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen. Automatic pattern-taxonomyextraction for web mining. In Proceedings of the IEEE/WIC/ACMInternational Conference on Web Intelligence (WI04), pages 242–248,2004. 3, 5, 17, 43, 52, 85, 124, 151
[160] T. W. Yan and H. Garcia-Molina. Sift - a tool for wide-area informationdissemination. In Proceedings of USENIX Winter, pages 177–186, 1995.37
[161] X. Yan, J. Han, and R. Afshar. Clospan: mining closed sequential patternsin large datasets. In Proceedings of SIAM Int. Conf. on Data Mining(SDM03), pages 166–177, 2003. 24, 26, 61
[162] Y. Yang. An evaluation of statistical approaches to text categorization.Information Retrieval, 1:69–90, 1999. 115
[163] Y. Yang and X. Liu. A re-examination of text categorization methods. InProceedings of SIGIR, pages 42–49, 1999. 132
[164] C-C. Yu and Y-L. Chen. Mining sequential patterns from multidimensionalsequence data. IEEE Transactions on Knowledge and Data Engineering,17(1):136–140, 2005. 25
[165] M. Zaki. Spade: An efficient algorithm for mining frequent sequences.Machine Learning, 40:31–60, 2001. 24, 26, 61
BIBLIOGRAPHY 205
[166] M. J. Zaki and C-J. Hsiao. Charm: An efficient algorithm for closed itemsetmining. In Proceedings of The 2nd SIAM International Conference on DataMining, pages 457–473, 2002. 26, 62
[167] S. Zhang, X. Wu, J. Zhang, and C. Zhang. A decremental algorithm formaintaining frequent itemsets in dynamic databases. In Proceedings ofthe 7th International Conference on Data Warehousing and KnowledgeDiscovery (DaWaK05), pages 305–314, 2005. 26
[168] X. Zhou, Y. Li, P. D. Bruza, S-T. Wu, Y. Xu, and R. Y. K. Lau. Usinginformation filtering in web data mining process. In Proceedings of theIEEE/WIC/ACM International Conference on Web Intelligence (WI07),pages 163–169, 2007. 180
[169] X. Zhou, S-T. Wu, Y. Li, Y. Xu, R. Y. K. Lau, and P. D. Bruza. Utilizingsearch intent in topic ontology-based user profile for web mining. InProceedings of the IEEE/WIC/ACM International Conference on WebIntelligence (WI06), pages 558–564, 2006. 180