knowledge discovery using pattern taxonomy model in text ... · knowledge discovery using pattern...

Knowledge Discovery Using PatternTaxonomy Model in Text Mining

by

Sheng-Tang Wu

(B.Sc., M.Sc.)

A dissertation submitted for the degree of

Doctor of Philosophy

Faculty of Information Technology

Queensland University of Technology

December 2007

Keywords

Pattern Taxonomy Model, Information Retrieval, Text Mining, Data Mining,Association Rules, Sequential Pattern Mining, Closed Sequential Patterns, PatternDeploying, Pattern Evolving.

i

Abstract

In the last decade, many data mining techniques have been proposed for fulfilling

various knowledge discovery tasks in order to achieve the goal of retrieving useful

information for users. Various types of patterns can then be generated using

these techniques, such as sequential patterns, frequent itemsets, and closed and

maximum patterns. However, how to effectively exploit the discovered patterns

is still an open research issue, especially in the domain of text mining. Most

of the text mining methods adopt the keyword-based approach to construct text

representations which consist of single words or single terms, whereas other

methods have tried to use phrases instead of keywords, based on the hypothesis

that the information carried by a phrase is considered more than that by a

single term. Nevertheless, these phrase-based methods did not yield significant

improvements due to the fact that the patterns with high frequency (normally the

shorter patterns) usually have a high value on exhaustivity but a low value on

specificity, and thus the specific patterns encounter the low frequency problem.

This thesis presents the research on the concept of developing an effective

Pattern Taxonomy Model (PTM) to overcome the aforementioned problem by

deploying discovered patterns into a hypothesis space. PTM is a pattern-based

method which adopts the technique of sequential pattern mining and uses closed

patterns as features in the representative. A PTM-based information filtering

system is implemented and evaluated by a series of experiments on the latest

version of the Reuters dataset, RCV1. The pattern evolution schemes are also

iii

proposed in this thesis with the attempt of utilising information from negative

training examples to update the discovered knowledge. The results show that the

PTM outperforms not only all up-to-date data mining-based methods, but also the

traditional Rocchio and the state-of-the-art BM25 and Support Vector Machines

(SVM) approaches.

iv

Contents

Keywords i

Abstract iii

List of Figures x

List of Tables xii

Statement of Original Authorship xiii

Acknowledgement xv

1 Introduction 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . 71.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Literature Review 112.1 Knowledge Discovery . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Process of Knowledge Discovery . . . . . . . . . . . . . 132.1.2 Data Repository . . . . . . . . . . . . . . . . . . . . . . 152.1.3 Tasks and Challenges . . . . . . . . . . . . . . . . . . . . 18

2.2 Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . 212.2.1 Association Rules mining . . . . . . . . . . . . . . . . . 222.2.2 Sequential Patterns . . . . . . . . . . . . . . . . . . . . . 242.2.3 Frequent Itemsets . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . 272.3.2 Term Weighting . . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . 322.4.1 Keyword-based Representation . . . . . . . . . . . . . . 32

v

2.4.2 Phrase-based Representation . . . . . . . . . . . . . . . . 332.4.3 Other Representation . . . . . . . . . . . . . . . . . . . . 34

2.5 Information Filtering . . . . . . . . . . . . . . . . . . . . . . . . 352.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Prototype of Pattern Taxonomy Model 393.1 Pattern Taxonomy Model . . . . . . . . . . . . . . . . . . . . . . 39

3.1.1 Sequential Pattern Mining (SPM) . . . . . . . . . . . . . 403.1.2 Pattern Pruning . . . . . . . . . . . . . . . . . . . . . . . 433.1.3 Using Discovered Patterns . . . . . . . . . . . . . . . . . 52

3.2 Finding Non-Sequential Patterns . . . . . . . . . . . . . . . . . . 533.2.1 Basic Definition of NSPM . . . . . . . . . . . . . . . . . 543.2.2 NSPM Algorithm . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 Pattern Deploying Methods 654.1 Pattern Deploying . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1.1 Pattern Deploying Method (PDM) . . . . . . . . . . . . . 714.1.2 Pattern Deploying based on Supports (PDS) . . . . . . . . 79


5 Evolution of Discovered Patterns 875.1 Deployed Pattern Evolution . . . . . . . . . . . . . . . . . . . . . 87

5.1.1 Basic Definition of DPE . . . . . . . . . . . . . . . . . . 885.1.2 The Algorithm of DPE . . . . . . . . . . . . . . . . . . . 91

5.2 Individual Pattern Evolution . . . . . . . . . . . . . . . . . . . . 955.2.1 Basic Definition of IPE . . . . . . . . . . . . . . . . . . . 975.2.2 The Algorithm of IPE . . . . . . . . . . . . . . . . . . . 101


6 Experiments and Results 1076.1 Experimental Dataset . . . . . . . . . . . . . . . . . . . . . . . . 1086.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . 1136.3 Evaluation Procedures . . . . . . . . . . . . . . . . . . . . . . . 117

6.3.1 Document Indexing . . . . . . . . . . . . . . . . . . . . . 1216.3.2 Procedure of Pattern Discovery . . . . . . . . . . . . . . 1246.3.3 Procedure of Pattern Deploying . . . . . . . . . . . . . . 1256.3.4 Procedure of Pattern Evolving . . . . . . . . . . . . . . . 127

vi

6.4 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . 1306.5 Experiment Evaluation . . . . . . . . . . . . . . . . . . . . . . . 131

6.5.1 Experiment on Pattern Discovery Methods . . . . . . . . 1336.5.2 Experiment on Pattern Deploying . . . . . . . . . . . . . 1466.5.3 Experiment on Pattern Evolution . . . . . . . . . . . . . . 158

6.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7 Conclusion 1757.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1767.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Appendices 181

A An Example of a RCV1 Document 181

B Topic Codes of TREC RCV1 185

C List of Stopwords 189

Bibliography 191

vii

List of Figures

1.1 The research cycle. . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 A typical process of knowledge discovery [43]. . . . . . . . . . . 132.2 Taxonomy of Web mining techniques [82]. . . . . . . . . . . . . . 152.3 Bag-of-words representation using word frequency. . . . . . . . . 32

3.1 An example of pattern taxonomy where patterns in dash boxes areclosed patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Illustration of pruning redundant patterns. . . . . . . . . . . . . . 46

4.1 Deploying patterns into a term space. . . . . . . . . . . . . . . . . 664.2 Overlaps between discovered patterns. . . . . . . . . . . . . . . . 684.3 Flowchart of pattern deploying methods in Pattern Taxonomy

Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.4 The process of merging pattern taxonomies into the feature space. 71

5.1 A negative document nd and its offending deployed patterns. . . . 905.2 Different levels involved by DPE and IPE in pattern evolution. . . 955.3 The flowchart of two pattern evolving approaches. . . . . . . . . . 975.4 Relations between patternset and termset under the topic “Effects

of global warming”. . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.1 An XML document in RCV1 dataset. . . . . . . . . . . . . . . . 1116.2 Distribution of words in an RCV1 collection [118]. . . . . . . . . 1126.3 Number of paragraphs per document in an RCV1 collection [118]. 1126.4 An example of topic description. . . . . . . . . . . . . . . . . . . 1136.5 Process of document indexing. . . . . . . . . . . . . . . . . . . . 1216.6 Primary output of a preprocessed document and found patterns. . . 1236.7 Flow chart of experimental procedure for pattern deploying

methods PDM and PDS in the pattern taxonomy model PTM. . . . 1266.8 Flow chart of experimental procedure for pattern evolving

methods DPE and IPE in the pattern taxonomy model PTM. . . . 128

ix

6.9 Number of patterns discovered using SPM with differentconstraints on 10 RCV1 topics. . . . . . . . . . . . . . . . . . . . 137

6.10 Comparison of precision and recall curves for different methodson RCV1 Topic r110. . . . . . . . . . . . . . . . . . . . . . . . . 142

6.11 Comparison of all methods in precision at standard recall pointson the first 50 topics. . . . . . . . . . . . . . . . . . . . . . . . . 154

6.12 Comparison of PDS method and Rocchio method in difference ofFβ=1 on all topics. . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.13 Comparison of the PDS method and the Rocchio method indifference of top-20 precision on all topics. . . . . . . . . . . . . 155

6.14 Comparison of all methods in all measures on 100 topics. . . . . . 1566.15 The relationship between the proportion in number of negative

documents greater than threshold to all documents and correspond-ing improvement on DPE with µ = 5 on improved topics. . . . . . 163

6.16 Comparison in the number of patterns used for training by eachmethod on the first 50 topics (r101∼r150) and the rest of the topics(r151∼r200). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

6.17 Comparison of PTM(IPE) and TFIDF in top-20 precision. . . . . 1666.18 Comparing PTM(IPE) with data mining methods on the first 50

RCV1 topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1686.19 Comparing PTM(IPE) with other methods on the first 50 RCV1

topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

x

List of Tables

2.1 Association rules mining algorithms. . . . . . . . . . . . . . . . . 262.2 Information Filtering models. . . . . . . . . . . . . . . . . . . . . 37

3.1 Each transaction represents a paragraph in a text document andcontains a sequence consisting of an ordered list of words. . . . . 42

3.2 All frequent sequential patterns discovered from the sampledocument (Table 3.1) with min sup: ξ = 0.5. . . . . . . . . . . . 42

3.3 Frequent 1Term patterns with min sup = 0.5. . . . . . . . . . . . 483.4 An example of a p-projected database. . . . . . . . . . . . . . . . 483.5 2Terms sequential patterns derived from 1Term patterns. . . . . . 493.6 The assessment of closed pattern of 1Term patterns. . . . . . . . . 513.7 The assessment of closed pattern of 2Terms patterns. . . . . . . . 513.8 Discovered frequent closed and non-closed sequential patterns. . . 523.9 2Terms candidates generated during non-sequential pattern mining. 573.10 3Terms candidates generated during non-sequential pattern mining. 583.11 4Terms candidates generated in NSPM. . . . . . . . . . . . . . . 593.12 Frequent non-sequential patterns discovered using NSPM. . . . . 60

4.1 Example of a set of positive documents consisting of patterntaxonomies. The number beside each sequential pattern indicatesthe absolute support of pattern. . . . . . . . . . . . . . . . . . . . 73

4.2 Patterns with their support from the sample database. . . . . . . . 80

5.1 Examples of positive documents which are represented by a set ofsequential patterns mined using PTM. . . . . . . . . . . . . . . . 88

5.2 Deployed patterns from the document examples. . . . . . . . . . . 895.3 dp2 and dp3 are replaced by dp6 and deployed patterns are

normalised. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.4 The change of term weights in offender dp1 before and after

shuffling when µ = 1/2. . . . . . . . . . . . . . . . . . . . . . . 945.5 Examples of positive documents represented by a set of sequential

patterns with frequency. . . . . . . . . . . . . . . . . . . . . . . . 99

xi

5.6 Normalised patternsets which contain sequential patterns withcorresponding weights. . . . . . . . . . . . . . . . . . . . . . . . 100

5.7 An example of patternset composition. . . . . . . . . . . . . . . . 100

6.1 Current Reuters data collections. . . . . . . . . . . . . . . . . . . 1096.2 Contingency table. . . . . . . . . . . . . . . . . . . . . . . . . . 1146.3 Number of relevant documents(#r) and total number of documents(#d)

by each topic in the RCV1 training dataset. . . . . . . . . . . . . 1186.4 Number of relevant documents(#r) and total number of documents(#d)

by each topic in the RCV1 test dataset. . . . . . . . . . . . . . . . 1196.5 Comparing PTM with data mining-based methods on RCV1

topics r101 to r150. . . . . . . . . . . . . . . . . . . . . . . . . . 1346.6 Precisions of top 20 returned documents on 10 RCV1 topics. . . . 1406.7 Results of pattern deploying methods compared with others on the

first 50 topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1486.8 Results of pattern deploying methods compared with others on the

last 50 topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1496.9 Results of pattern deploying methods compared with others on all

topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1516.10 Accumulated number of patterns found during pattern discovering. 1536.11 The list of methods used for evaluation. . . . . . . . . . . . . . . 1606.12 Comparison of pattern deploying and pattern evolving methods

used by PTM on all topics. . . . . . . . . . . . . . . . . . . . . . 1626.13 Comparison of all methods on the first 50 topics. . . . . . . . . . 164

xii

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meetrequirements for an award at this or any other higher education institution. Tothe best of my knowledge and belief, the thesis contains no material previouslypublished or written by another person except where due reference is made.

Signed:

Date:

xiii

Acknowledgement

Firstly, I would like to express my immense gratitude to Associate Professor

Yuefeng Li, my principle supervisor, for all his guidance and encouragement

throughout this research work. He has been always there providing sufficient

support with his excellent expertise in this area. Many thanks also go to my

associate supervisors, Dr. Yue Xu and Associate Professor Yi-Ping Phoebe Chen

for their generous support and comments on my work during this candidature.

I would also like to thank my examiners for their precious comments and

suggestions.

Special thanks must go to Faculty of Information Technology, QUT, which

has provided me the comfortable research environment with needed facilities and

financial support including my scholarship and travel allowances over the period

of my candidature. I would especially like to thank all the members of our research

group for offering invaluable advice and comments regarding my research work.

This work would not have been accomplished without the constant support of

my family. I would like to dedicate this thesis to my parents for their never-ending

encouragement over these years.

Last but certainly not the least I would like to thank my wife Vivien and my

parents-in-law for their tremendous support.

xv

Chapter 1

Introduction

Due to the rapid growth of digital data made available in recent years, knowledge

discovery and data mining have attracted great attention with an imminent

need for turning such data into useful information and knowledge. Many

applications, such as market analysis and business management, can benefit by

the use of the information and knowledge extracted from a large amount of data.

Knowledge discovery can be viewed as the process of nontrivial extraction of

information from large databases, information that is implicitly presented in the

data, previously unknown and potentially useful for users [33, 42]. Data mining

is therefore an essential step in the process of knowledge discovery in databases.

In the past decade, a significant number of data mining techniques have been

presented in order to perform different knowledge tasks. These techniques include

association rule mining, frequent itemset mining, sequential pattern mining,

maximum pattern mining and closed pattern mining. Most of them are proposed

for the purpose of developing efficient mining algorithms to find particular

patterns within a reasonable and acceptable time frame. With a large number of

patterns generated by using the data mining approaches, how to effectively exploit

these patterns is still an open research issue. Therefore, in this thesis, we focus on

1

2 Introduction

the development of a knowledge discovery model to effectively use the discovered

patterns and apply it to the field of text mining.

Text mining is the technique that helps users find useful information from a

large amount of digital text data. It is therefore crucial that a good text mining

model should retrieve the information that users require with relevant efficiency.

Traditional Information Retrieval (IR) has the same objective of automatically

retrieving as many relevant documents as possible whilst filtering out irrelevant

documents at the same time [50]. However, IR-based systems do not adequately

provide users with what they really need [86]. Many text mining methods have

been developed in order to achieve the goal of retrieving useful information

for users [1, 36, 71, 130, 133]. Most text mining methods use the keyword-

based approaches, whereas others choose the phrase technique to construct a

text representation for a set of documents. It is believed that the phrase-based

approaches should perform better than the keyword-based ones as it is considered

that more information is carried by a phrase than by a single term. Based on

this hypothesis, Lewis [77] conducted several experiments using phrasal indexing

language on a text categorisation task. Ironically, the results showed that the

phrase-based indexing language was not superior to the word-based one.

Although phrases carry less ambiguous and more succinct meanings than

individual words, the likely reasons for the discouraging performance from the

use of phrases are: (1) phrases have inferior statistical properties to words, (2)

they have a low frequency of occurrence, and (3) there are a large number

of redundant and noisy phrases among them [130]. Scott and Matwin [129]

also suggested that simple phrase-based representations are not worth pursuing

since they found no significant performance improvement on eight different

3

representations based on words, phrases, synonyms and hypernyms. They also

suggested the combination of classifiers with alternative representations might

produce more favorable results.

In order to solve the above mentioned problem, new studies have been

focusing on finding better text representatives from a textual data collection. One

solution is to use the data mining techniques, such as sequential pattern mining,

for building up a representation with the new type of features [159]. Such data

mining-based methods adopted the concept of closed sequential patterns and

pruned non-closed patterns from the representation with an attempt to reduce the

size of the feature set by removing noisy patterns. However, treating each multi-

terms pattern as an atom in the representation seems likely to encounter the low-

frequency problem while dealing with the long patterns [157]. Another challenge

for the data mining-based methods is that more time is spent on uncovering

knowledge from the data; consequently less significant improvements are made

compared with information retrieval methods [158].

The problem caused by data mining-based methods is that the measures (e.g.,

supports and confidences) adopted in the phase of using discovered patterns

are not suitable. For instance, given a specified topic, a highly frequent

pattern (normally the short pattern) usually has a high exhaustivity but a low

specificity, where exhaustivity describes the extent to which the pattern discusses

the topic and specificity describes the extent to which the pattern focuses on

the topic. These measures reveal only the statistic property in a pattern, but

not its specificity. Therefore, a new evaluation mechanism for various length of

patterns is required [158]. Based on this observation, this thesis proposed a novel

method, Pattern Taxonomy Model (PTM) for the purpose of effectively using

4 Introduction

discovered patterns. PTM re-evaluates the measures of patterns by deploying

them into a common hypothesis space based on their correlations in the pattern

taxonomies. As a result, the patterns with high specificity to the topic can

obtain the reasonable and adequate significance values, leading to the significant

improvement in effectiveness for the system.

In addition to the pattern deploying, the influence of patterns from the negative

training examples is also investigated in this research work. There is no doubt

that negative documents contain useful information to help identify ambiguous

patterns during the concept learning. A pattern may be a good indicator to

classify relevant documents if this pattern always appears in the positive examples.

However, it would be ambiguous if this pattern also appears in negative examples

at certain times. Therefore, it is necessary for a system to collect this information

to find ambiguous patterns and try to reduce their influence. The process of

refining ambiguous patterns can be referred as pattern evolution. The pattern

evolution is used for concept refinement for user profile mining. Li and Zhong [86]

proposed a novel approach of pattern evolution and applied it to ontology mining

for automatically acquiring user information needs. However, their work is

developed based on a keyword-based system. Hence, in our study we propose

an effective pattern evolution approach for the PTM-based system.

In order to evaluate the proposed PTM model, we apply PTM to the practical

information filtering task. Information filtering is a task that a user with a specific

information need is monitoring a stream of documents and the system selects

documents from the stream according to a profile of the user’s interests. Filtering

systems process one document at a time and show it to the user if this document

is relevant. The system then adjusts the profile or updates the threshold based on

1.1 Problem Statement 5

the user’s feedback. In the case of batch filtering, a number of relevant documents

are returned, whereas a list of ranked documents is given in the case of routing

filtering. In this thesis, we conduct routing filtering to avoid the need of threshold

tuning, which is beyond our research scope. Numerous experiments are performed

on the latest data collection, Reuters Corpus Volume 1 (RCV1), to evaluate the

proposed PTM-based information filtering system. The results show that the

PTM outperforms not only all up-to-date data mining-based methods, but also

the traditional probabilistic and Rocchio methods.

1.1 Problem Statement

Most research works in the data mining community have focused on developing

efficient mining algorithms for discovering a variety of patterns from a larger

data collection. However, searching for useful and interesting patterns is still

an open problem [88]. In the field of text mining, data mining techniques can be

used to find various text patterns, such as sequential patterns, frequent itemsets,

co-occurring terms and multiple grams, for building up a representation with

these new types of features [159]. Nevertheless, the first problem is how to

effectively deal with the large amount of patterns generated by using the data

mining methods.

Using phrases for the text representation still has doubts in increasing

performance over domains of text categorisation tasks [77, 130], meaning that

there exists no particular representation method with dominating advantage over

others [68, 129]. Instead of the keyword-based approach which is typically used

by text mining-related tasks in the past, the pattern-based model (single term or

6 Introduction

multiple terms) is employed to perform the same concept of task. There are two

phases that we need to consider when we use pattern-based models in text mining:

one is how to discover useful patterns from digital text documents, and the other

is how to utilise these mined patterns to improve the system’s performance.

1.2 Contributions

In this thesis a new knowledge discovery model is proposed with an attempt to

effectively exploit the discovered patterns in a large data collection using data

mining approaches. This model uses pattern taxonomies as features to represent

knowledge based on the state-of-the-art data mining techniques such as sequential

pattern mining and closed pattern mining. In order to overcome the problem in the

phase of using discovered patterns, the PTM model is extended to be effective by

using the strategy of pattern deploying. Two deploying mechanisms are proposed

to enhance the effectiveness of the PTM. Furthermore, the PTM is equipped with

pattern evolution approaches to be able to deal with the negative examples during

the profile learning. The summarised contributions are briefed as follows:

• A knowledge discovery model based on pattern taxonomies is proposed.

• The state-of-the-art data mining techniques are used in the PTM including

sequential pattern mining and closed sequential pattern mining.

• Pattern deploying strategies are provided to increase the effectiveness of the

PTM and to solve the low precision problem.

• A scalable PTM is developed with the capability of concept adjustment by

means of evolving mined patterns.

1.3 Research Methodology 7

Figure 1.1: The research cycle.

• Experimental evaluation is made and the results prove the feasibility and

effectiveness of the proposed PTM.

1.3 Research Methodology

There has been an increase in the range of research approaches that are acceptable

for knowledge discovery research during the last decade. These methods include

case studies, field studies, action research, prototyping, and experimenting [26].

As the research is considered to focus on the development of robust mechanisms in

the knowledge discovery system, these mechanisms or proposed theories have to

be proven by the classic science method of experiment. Hence, the experimenting

approach integrated with cycles of research is chosen as the research method. The

process of the research approach used in this research is illustrated in Figure 1.1.

8 Introduction

1.4 Thesis Outline

The rest of this thesis is summarised as follows:

Chapter 2: This chapter is a literature review of related disciplines including

data mining, text mining, knowledge representation models and information

filtering. It pinpoints the current works on data mining and identifies the

drawbacks of existing representation schemes.

Chapter 3: This chapter provides the definition of sequential pattern and the

proposed algorithms of mining frequent sequential patterns from a textual

data collection. This chapter also presents a novel representation scheme

that makes the use of the discovered pattern taxonomies. The relevant

publications about this chapter are:

- S-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, Automatic

Pattern-Taxonomy Extraction for Web Mining, Proceedings of the

IEEE/WIC/ACM International Conference on Web Intelligence (WI

2004), pages 242–248, 2004.

- S-T. Wu, Knowledge Discovery from Digital Text Documents,

Proceedings of the 4th International Conference on Active Media

Technology (AMT 2006), pages 446–447, 2006.

- X. Zhou, S-T. Wu, Y. Li, Y. Xu, R. Y. K. Lau, and P. D. Bruza,

Utilizing Search Intent in Topic Ontology-Based User Profile for Web

Mining, Proceedings of the IEEE/WIC/ACM International Conference

on Web Intelligence (WI 2006), pages 558–564, 2006.

- R. Y. K. Lau, Y. Li, S-T. Wu, and X. Zhou, Sequential Pattern

1.4 Thesis Outline 9

Mining and Nonmonotonic Reasoning for Intelligent Information

Agents, International Journal of Pattern Recognition and Artificial

Intelligence, 21(4):773–789, 2007.

- X. Zhou, Y. Li, P. D. Bruza, S-T. Wu, Y. Xu, and R. Y. K.

Lau, Using Information Filtering in Web Data Mining Process, The

IEEE/WIC/ACM International Conference on Web Intelligence (WI

2007), pages 163–169, 2007.

Chapter 4: This chapter describes the extension to the model presented in

chapter 3 and discusses the problem caused by the inadequate use of mined

patterns in a pattern-based model. The strategy of deploying discovered

patterns is adapted. Two effective and feasible solutions are proposed in

this chapter to address the problem. The relevant publications are:

- Y. Li, S-T. Wu, and Y. Xu, Deploying Association Rules on

Hypothesis Spaces, Proceedings of International Conference on

Computational Intelligence for Modelling Control and Automation

(CIMCA 2004), pages 769–778, 2004.

- S-T. Wu, Y. Li and Y. Xu, An Effective Deploying Algorithm for using

Pattern-Taxonomy, Proceedings of the 7th International Conference

on Information Integration and Web-based Applications & Services

(iiWAS 2005), pages 1013–1022, 2005.

- S-T. Wu, Y. Li and Y. Xu, Deploying Approaches for Pattern

Refinement in Text Mining, Proceedings of the 6th IEEE International

Conference on Data Mining (ICDM 2006), pages 1157–1161, 2006.

10 Introduction

Chapter 5: This chapter presents mechanisms for pattern updating including the

evolution of both deployed patterns and individual patterns. The proposed

algorithms of these evolutions are offered in this chapter.

Chapter 6: This chapter gives the description of benchmark datasets and

performance measures, along with the application of the proposed pattern

taxonomy model to the information filtering. A detailed analysis of the

comparison results of experiments is also presented in this chapter.

Chapter 7: This chapter concludes this thesis and draws the direction for future

work.

Chapter 2

Literature Review

This chapter provides a literature review containing a wide range of knowledge

discovery and text mining topics that relate to this research work and provide the

needed conceptual framework for the development of the proposed model.

2.1 Knowledge Discovery

Knowledge discovery is the process of nontrivial extraction of information from

large databases, information that is implicitly present in the data, previously

unknown and potentially useful for users [33, 42]. The knowledge discovery

can be defined as follows [42]: Given a set of facts (data) F , a language L,

and some measure of certainty C, a pattern is a statement S in L that describes

relationships among a subset Fs of F with a certainty c, such that S is simpler than

the enumeration of all facts in Fs. A pattern is called knowledge if it is interesting

and certain enough, according to the user’s imposed interestingness measures and

criteria. Discovered knowledge is the output of a system that extracts patterns

from the set of facts in a database.

The term pattern in the above definition is an expression in some language

11

12 Literature Review

describing a subset of the data [40]. For example, a pattern in a high-level

language can be expressed as:

If Age > 35 and Salary > 70K

Then buy(“Plasma TV”)

With Likelihood(0.6...0.8).

The above pattern can be understood by people and used directly by some

knowledge discovery system (e.g., expert system). In different communities,

finding useful patterns in data is represented by different names including data

mining, knowledge extraction, information harvesting and data archaeology.

Representing the degree of certainty is essential to determining how much

faith the system or user should put into a discovery [42]. Certainty is effected

by several factors, such as the size of sample, the integrity of data and the

support from domain knowledge. Patterns cannot be considered knowledge with

insufficient certainty. The discovered patterns also must be valid, novel and

potentially useful for the users to meet their information needs.

There are numbers of patterns which may be discovered from a database, but

not all of them are interesting. Only those evaluated to be interesting in some

manner are viewed as useful knowledge. This depends on the assumed frame of

reference defined either by the system itself or the user’s knowledge. A system

may encounter a problem where a discovered pattern is not interesting a user. Such

patterns are not qualified as knowledge. Therefore, a knowledge discovery system

should have the capability of deciding whether a pattern is interesting enough to

form knowledge in the current context.

In summary, knowledge discovery has to exhibit the following characteris-

tics [42]:

2.1 Knowledge Discovery 13

Figure 2.1: A typical process of knowledge discovery [43].

- Interestingness: Discovered knowledge is interesting based on the

implication that patterns should be novel and potentially useful, and the

process of knowledge discovery must be nontrivial.

- Accuracy: Discovered patterns should accurately depict the contents of the

data. The extent to which depiction is imperfect is expressed by measures

of certainty.

- Efficiency: The process of knowledge discovery is efficient, especially for

large data sources. An algorithm is considered efficient if the run time is

acceptable and predictable.

- Understandability: A high-level language is required for expressing

discovered knowledge. The expression must be understandable by users.

2.1.1 Process of Knowledge Discovery

As shown in Figure 2.1, the steps of knowledge discovery may consist of

the following: data selection, data preprocessing, data transformation, pattern

halla

This figure is not available online. Please consult the hardcopy thesis available from the QUT Library


discovery and pattern evaluation [43]. These steps are briefly described as follows:

Data selection: This process includes generating a target dataset and selecting a

dataset or a subset of large data sources where discovery is to be performed.

The input of this process is a database and output is a target data. For

example, among various data sources on the World Wide Web, we may

collect newswire-related Web pages for Web content mining tasks.

Pre-processing: This process involves data cleaning and noise removing. It also

includes collecting required information from selected data fields, providing

appropriate strategies for dealing with missing data and accounting for

redundant data. For the case of Web pages, textual data such as tags,

CSS codes, hyperlinks, pictures and metadata needs to be removed for Web

content mining.

Transformation: The preprocessed data needs to be transformed into a

predefined format, depending on the data mining task. This process needs

to select an adequate type of features to represent data. In addition, feature

selection can be used at this stage for dimensionality reduction. At the end

of this process, a set of features is recognised as a dataset.

Data mining: Data mining is a specific activity that is conducted over the

transformed data in order to discover patterns. Based on user requirements,

the discovered patterns can be pairs of features from the given dataset, a set

of ordered features occurring together, or a maximum set of features.

Evaluating: The discovered patterns are evaluated if they are valid, novel and

potentially useful for the users to meet their information needs. Only those


Figure 2.2: Taxonomy of Web mining techniques [82].

evaluated to be interesting in some manner are viewed as useful knowledge.

This process should decide whether a pattern is interesting enough to form

knowledge in the current context.

2.1.2 Data Repository

Knowledge Discovery in Databases (KDD) can be referred to as the term of data

mining which aims for discovering interesting patterns or trends from a database.

In particular, a process of turning low-level data into high-level knowledge is

denoted as KDD [48]. The concept of KDD process is the data mining for

extracting patterns from data. Therefore, knowledge discovery and data mining

should be applicable to any kind of data repository. There are many different data

stores where mining can be applied, including relational databases, transactional

databases, text and multimedia databases, and the World Wide Web. Following

are the brief descriptions of these repositories.

Relational Databases: A relational database consists of tables, each of which

has a unique name and contains a set of attributes (i.e., columns). A set

of records (i.e., tuples) is stored in a table. Each record represents an

object with a set of attribute values and is assigned a unique key. Data

halla



in a relational database can be accessed by relational queries (e.g., SQL).

Data mining techniques can be applied to relational databases for pattern

discovery or trends detection. Relational databases are one of the most

common information repositories in the domain of knowledge discovery

and data mining.

Transactional Databases: A transactional database is generally a collection of

purchase records and is commonly used for the analysis of market basket.

Each record contains a unique identity number and a list of purchased items.

A transactional database usually consists of a large amount of records.

However, data mining systems can easily identify which items are sold

together and find the relationship between item types and certain groups

of customers.

Spatial and Temporal Databases: A spatial database contains images or maps

which are used for urban renewal or public service planning. Data

mining is able to find patterns in a spatial database to describe relationship

between objects on the images or maps. A temporal database is a time-

series database which contains time-related attributes within relational data.

Analysing temporal database using data mining techniques can uncover the

trend of change for objects. It also provides useful information for decision

making and strategy planning.

Text Databases: A text database consists of text data which is used for describing

objects. Such a database has three types of data structure: structured (e.g.

relational database with values in text format), semi-structured (e.g., XML

documents), and unstructured (e.g., Web pages). Associations of terms in


text can be discovered by applying data mining techniques to text databases.

For further analysis, they may need to integrate with some techniques from

other fields, such as information retrieval.

The World Wide Web: The World Wide Web provides rich information on an

extremely large amount of linked Web pages. Such a repository contains not

only text data but also multimedia objects, such as images, audio and video

clips. Data mining on the World Wide Web can be referred to as Web mining

which has gained much attention with the rapid growth in the amount of

information available on the internet. Web mining is classified into several

categories, including Web content mining, Web usage mining and Web

structure mining. A taxonomy of Web mining techniques is illustrated in

Figure 2.2.

The field of knowledge discovery has developed gradually since the late 1980s.

Recently the research trends in knowledge discovery belong to the following

issues:

• Mining association rules efficiently [4, 53, 147].

• Mining object-oriented databases [47, 154].

• Mining multimedia data [93].

• Mining distributed and heterogeneous databases [145].

• Text mining [61].

• Knowledge discovery in semi-structured (Web) data [83, 159].

In recent years, the last issue attracts much attention due to the rapid growth of

online data which has created an immense need for data mining. The Web users


now expect more sensible and rational knowledge discovery systems to help them

retrieve relevant information.

2.1.3 Tasks and Challenges

Knowledge discovery tasks depend on which type of functionalities the

knowledge system performs and which kind of patterns the system looks for.

Different functionalities are developed for achieving different tasks. However,

some goals with particular results need to be reached by using the combination of

several KDD methods. The main KDD tasks can be classified into the following

categories:

• Classification: Classification is the process of assigning data objects to

desired predefined categories or classes. It also can be viewed as the process

of finding a proper method to distinguish data classes or concepts. Objects

without a class label are then classified using this method. Generally,

training data is required for concept learning before classification can be

proceeded.

• Clustering: Given a set of data objects, clustering is the task of dividing

the set of objects into a number of groups such that the objects in the

same group have similar characteristics. In other words, clustering aims

for maximising the intra-class similarity and minimising the inter-class

similarity. The major difference between classification and clustering is

that the latter analyses objects without consulting class labels, whereas the

former needs such information to begin with.

• Summarisation: This is the task of analysing data objects and finding


their common characteristics for generating summarisation rules. A set of

compact patterns that represent the concept of these objects is extracted. For

instance, a summarisation rule can be a description like “emission of carbon

dioxide CO2 is the main factor causing global warming”.

• Change and Deviation Detection: Such a task involves the discovery of

changes and deviation of specific values in data objects (e.g., the change

in time-series data, protein sequencing in a genome, and the difference

between expected values in ordering data objects).

• Mining Association Rules: Associations are rules that describe the

frequency and certainty of two groups of data values. This task usually

is applied to a transactional database. It discovers the implication between

antecedent and consequent, both of which represent sets of items in the

transactions. For example, an association rule can be “70% of customers

who purchase bread also purchase milk”.

Data mining is the process of pattern discovery in a dataset from which noise

has been previously eliminated and which has been transformed in such a way to

enable the pattern discovery process. Although the knowledge discovery covers

problems related concepts, activities and process, the most challenging one is data

mining [33].

Matheus [99] described the context and computational resources need to

perform knowledge discovery. There must exist an application through which the

user can select, start and run the main process and access the discovered patterns.

Knowledge discovery methods often make it possible to use domain knowledge to

guide and control the process and help evaluate the patterns. In such case domain


knowledge must be represented using an appropriate knowledge representation

technique such as taxonomies, rules, decision trees and so on.

The main process of text-related machine learning tasks is document indexing,

which maps a document into a feature space representing the semantics of the

document. Many types of text representations have been proposed in the past. A

well-known one is the bag-of-words that uses words as elements in the vector of

the feature space. There are two types of representations used in the bag-of-words

approach: binary representation and term-weighted representation.

In [80], the Term Frequency times Inverse Document Frequency (TFIDF)

weighting scheme is used for text representation in Rocchio classifiers. In

addition to TFIDF, the global IDF and entropy weighting scheme is proposed by

Dumais [34] and improves performance by an average of 30%. Various weighting

schemes for the bag-of-words representation approach are given in [1, 62, 125].

The problem of the bag-of-words approach is how to select a limited number of

features among an enormous set of words or terms in order to increase the system’s

efficiency and avoid “overfitting” [130]. In order to reduce the number of features,

many dimensionality reduction approaches have been conducted by the use of

feature selection techniques, such as Information Gain, Mutual Information, Chi-

Square, Odds Ratio, and so on. Details of these selection functions are stated

in [78, 130].

Information extraction is used to transform unstructured data in the document

corpus into a structured database and traditional data mining methods are applied

to identify useful patterns in this extracted data [102].

2.2 Association Analysis 21

2.2 Association Analysis

Association rules are interesting patterns that are discovered from a given dataset.

They are generally processed with various data mining techniques. The earliest

form of association rule mining is the market basket analysis, which searches for

interesting relationships between shoppers and items bought, for example.

A data mining process may still retrieve a large number of “thought-to-be”

interesting patterns even though it has specified the relevant tasks and the type of

knowledge to be mined. Generally, only a small portion of these mined patterns

is actually of interest to the users. Thus, it is then essential to further confine the

size of mined patterns with the attempt to improve the effectiveness of the system,

which can be achieved by measuring the usefulness on its simplicity, certainty,

utility and novelty of patterns.

Two common measures of rule interestingness or usefulness are rule support

and confidence. Rule support is estimated by a utility function in order to define

the usefulness of a mined pattern. It is calculated by the percentage of task-

relevant data transactions for which the pattern is recognised as true. Confidence,

on the other hand, reflects the certainty or validity of the mined patterns. Given

item set A and B, in a set of transactions D, the rule A ⇒ B holds in the

transaction set D with support s, where s is the percentage of transactions in

D that contain A∪B. It can be viewed as probability P (A∪B). The rule A⇒ B

has confidence c in the transaction set D if c is the percentage of transactions in

D containing A that also contain B. It is the conditional probability P (B|A).

The support and confidence of the rule A⇒ B can be expressed as the following


equations [54].

support(A⇒ B) = P (A ∪B) (2.1)

confidence(A⇒ B) = P (B|A) (2.2)

Generally, if association rules meet both a minimum support threshold and

a minimum confidence threshold, both of which can be set by users or domain

experts, the association rules are considered interesting and useful. Market basket

analysis, as mentioned earlier, is just one form of association rule mining. There

are various kinds of association rules that can be classified based on different

criteria, such as the types of values handled in the rule, the dimensions of data

involved, the levels of abstractions involved, and various extensions to association

mining. All these variations will be discussed in the later subsections.

2.2.1 Association Rules mining

The association rules mining, first studied in [2] for market basket analysis,

is to find any association rules satisfying user-specified minimum support and

minimum confidence [153]. An association rule is the discovery of the associative

relationships among objects; i.e., the appearance of a set of objects in a database

is strongly related to the appearance of another set of objects [43]. The basic

problem of finding association rules is introduced in [2].

The problems of mining association rules from large databases can be

decomposed into two subproblems: (1) Find itemsets whose support is greater

than the user-specified minimal support; (2) Use the frequent itemsets to generate

the desired rules [53]. Much of the research has been focused on the former [2,

107]. In these studies, the well-known Apriori algorithm is adopted for finding all


frequent itemsets with minimum support. However, the drawback of the Apriori

algorithm is that the same minimal support threshold is applied for all processes of

examining data items. Therefore, using different support thresholds for different

levels of abstraction is required.

In recent years mining of association rules from a large database has been a

focused topic. Performed at multiple levels of abstraction in many applications

at mining associations is required. For example, 80% of customers who purchase

wheat bread may also purchase butter. Then we can draw down and find: 60%

of customers who buy bread may also buy salty butter. The latter statement is a

lower level of abstraction and the former is a higher level. The lower level carries

more specific information than that in the higher level. A top-down progressive

deepening method is developed for efficient mining of multiple-level association

rules from a large database based on the Apriori principle. The method first finds

frequent items at the topmost level and then deepens the mining process into their

descendants at lower concept levels. For example, if the minimal support is 3 in

level 1 the method filters out infrequent items (coffee, wine) in the transaction set

and frequent items (bread, milk) remain in the set. Then we deepen the mining

to find the associations of only frequent items’ descendants (wheat bread, white

bread, 2% milk and light milk).

One assumption is to explore only the descendants of the frequent items

since, if an item occurs rarely, its descendants will occur even less frequently

and are uninteresting. In [53] different support thresholds for different levels of

abstraction are applied. Using a single support threshold will generate many

uninteresting rules with interesting ones if the threshold is set too low, but

many interesting rules at lower level will be neglected if the threshold is too


high. For example, if the threshold is too low “milk ⇒ bread” and “milk ⇒

shampoo” would be generated and be passed down to find the associations of their

descendants, but the latter is not interesting. If the threshold is too high we get only

“milk⇒ bread” , but then find nothing since association rules of their descendants

“light milk⇒ wheat bread” and 2% “milk⇒ white bread” cannot reach this high

threshold. Using different support thresholds gives users the flexibility to control

the mining process and to reduce the meaningless associations to be generated.

The scope of mining association rules has been extended from single level

to multiple concept levels for mining multiple-level association rules from large

databases. A top-down progressive deepening algorithm is developed for mining

multiple-level association rules. Mining multiple-level association rules from

databases has wide applications, and efficient algorithms can be developed for

finding interesting and strong rules in large databases.

2.2.2 Sequential Patterns

Mining sequential patterns have been extensively studied in data mining

communities since the first research work in [7]. The earlier studies which

focused on the large size of retail datasets have developed several Apriori-like

algorithms [5, 107, 142] in order to solve the problem of discovering sequential

patterns or itemsets from such databases. However, these algorithms perform well

only in databases consisting of short frequent sequences. This is due to the fact

that it is quite time-consuming to generate nTerms sequences as candidates from

(n-1)Terms sequences. As a result, to solve this problem, a variety of algorithms

such as AprioriAll [7], PrefixSpan [114, 115], CloSpan [161], FP-tree [51, 56],

SPADE [165], SLPMiner [131, 132], TSP [149], SPAM [18], GSP [87], GST [59],


MILE [27] and Sliding Window [57] have been proposed. In order to improve

the efficiency, each algorithm pursues a different method of discovering frequent

sequential patterns. This makes the algorithm featured simply by the capability of

mining such patterns without even generating any candidates.

Kum et al. [69] developed an algorithm, ApproxMAP (for APPROXimate

Multiple Alignment Pattern mining), to find approximate sequential patterns

which are the patterns approximately shared by many sequences and cover many

short patterns. Approximate sequential patterns can effectively represent the local

data for efficient global sequential pattern mining from multiple data sources.

Additionally, mining sequential patterns from multidimensional sequential data

is suggested in [164].

2.2.3 Frequent Itemsets

The frequent itemset mining has obtained a great deal of attention since the

introduction of itemset mining in [2]. The main difference between an itemset

and a sequential pattern is that the order of the items is concerned in the latter.

The widely adopted algorithm for frequent itemset mining is Apriori [4], which

contains the following three recursive steps:

(1) counts item occurrences to determine the frequent n-itemsets, where n starts

from 1.

(2) generates (n + 1)-itemsets from the frequent n-itemsets using candidate

generation procedure.

(3) prunes candidates where their support is below a predefined minimum

support.


Method Pattern Algorithm

AprioriAll [7] sequential Apriori-likePrefixSpan [114, 115] sequential Apriori-likeFP-tree [51, 56] sequential FP-treeSPADE [165] sequential Apriori-likeSLPMiner [131, 132] sequential Apriori-likeTSP [149] closed sequential Apriori-likeCloSpan [161] closed sequential Apriori-likeSPAM [18] sequential Apriori-likeGSP [87] sequential Apriori-likeGST [59] sequential GraphMILE [27] sequential Apriori-likeCLOSET [113] closed itemset Apriori-likeCLOSE [110] closed itemset Apriori-likeCHARM [166] closed itemset Apriori-likeGenMax [49] closed itemset Apriori-like

Table 2.1: Association rules mining algorithms.

The recent data mining algorithms for discovering various patterns are

depicted in Table 2.1. The drawback of the Apriori algorithm is the time-

consuming procedure of candidate generation, especially for a large databases

with a small minimum support as criteria. Many variations of the Apriori

algorithm and its applications have been extensively investigated in the

literature [35, 49, 110, 112, 113, 166]. Liu et al. [89] mined frequent itemsets

from the Web to find topic-specific concepts and definitions. Maintaining frequent

itemsets in dynamic databases is examined by Zhang et al. [167]. Mining Top-K

frequent itemsets is suggested in [156]

2.3 Text Mining 27

2.3 Text Mining

Most work in knowledge discovery and data mining was concerned with

transactional or structured databases. However, a large portion of the available

data appears in collections of text articles. Text mining is used to denote all

tasks that try to extract useful information by finding potential patterns from large

quantities of text. It combines many disciplines such as information retrieval,

information extraction, machine learning, text categorisation, text clustering and

data mining [76].

Text classification or categorisation (TC) is an instance of text mining. TC is

a supervised learning task that assigns a Boolean value to each pair (di, ci) ∈

(D × C), where D is a domain of documents and C is a set of predefined

categories. The task is to approximate the true function ϕ : D × C → 1, 0

by means of a function ϕ : D × C → 1, 0, such that ϕ and ϕ coincide as much

as possible [46]. The function ϕ is called a classifier. The goal of the classifier is

to precisely define and estimate this coincidence.

2.3.1 Feature Selection

There will be a large number of terms extracted from text using data mining

methods. The high dimensionality of the feature space leads to the computational

complexity and overfitting problems. Only terms with valuable information are

selected. The simple way to reduce the dimensionality is the filtering approach,

which filters irrelevant terms based on the measures derived from the statistical

information. The common measures are briefed in the following.


Term Frequency

The frequency of a term t in a document d can be used for document-specific

weighting and denoted as TF(d, t). It is only a measure of a term’s significance

within a document.

Inverse Document Frequency

Inverse Document Frequency (IDF) is used to measure the specificity of terms in

a set of documents. It assumes that a high-semantical term appears in only a few

documents, while a low-semantical term is spread over many documents. The

formula of IDF can be expressed by the following.

IDF(t) = log|D|

DF(t)(2.3)

where D is the set of documents in the collection and DF(t) is the document

frequency, which is the number of documents where the term t appears at least

once.

Term Frequency Inverse Document Frequency

Term Frequency Inverse Document Frequency (TFIDF) [125] is the most widely

adopted measure. TFIDF is the combination of the exhaustivity statistic (TF) and

the specificity statistic (DF) to a term.

TFIDF(t) = TF(d, t)× IDF(t) (2.4)

Residual Inverse Document Frequency

Residual Inverse Document Frequency (RIDF) is a variation of IDF. RIDF assigns

collection-specific measures to terms according to the difference between the logs

2.3 Text Mining 29

of the actual IDF and the prediction by a Poisson model [30]. It measures the

distributional behaviour of terms across documents. The function of RIDF is

expressed in the following.

RIDF(t) = IDF(t) + log(1− Probp(t)) (2.5)

where Probp(t) = 1−p(0; λ) is the Poisson probability that t appears at least once

in a document; λ = CF (t)/N is the average number of occurrences of terms t

per document.

Relative Frequency Technique

Relative Frequency Technique (RFT) is suggested in [37] with the assumption that

special or technical words are more rare in general usage than in documents about

the corresponding subjects. In contrast to pure TF, RTF uses a term’s collection

statistic.

RFT(t) =TF(d, t)

Td

− CF(t)

Tc

(2.6)

where CF(t) is the collection frequency denoting the number of times a term t

appears in the entire collection. Td and Tc are the total number of terms in the

document and the number of terms in a general document collection respectively.

2.3.2 Term Weighting

Term weighting uses statistical regularities in documents to estimate significance

weights for terms. Term weighting functions can measure how specific terms are

to a topic by exploiting the statistic variations in the distribution of terms within

relevant documents and within a complete document collection [105]. The term

weighting strategy should be context-specific [39].


Given a term t, following are the notions that will be used in the weighting

functions.

r: the number of relevant documents that contain term t.

n: the total number of documents in the collection that contain term t.

R: the total number of relevant documents.

N : the number of documents in the collection.

Probabilistic Model

Robertson and Sparck Jones [119] proposed four probabilistic functions for term

weighting based on the binary independence retrieval model. Two kinds of

assumption are used in these functions: independence assumptions and ordering

principles. Following are the four probabilistic functions.

F1(t) = log( r

R)

( nN

)(2.7)

F2(t) = log( r

R)

( n−rN−R

)(2.8)

F3(t) = log( r

R−r)

( nN−n

)(2.9)

F4(t) = log( r

R−r)

( n−rN−n−R+r

)(2.10)

Okapi Model

The Okapi model is based on the above-mentioned probabilistic model. The

BM25 function in the Okapi model involves using the term frequency and

document length [120, 140, 141]. The weighting function can be expressed as

follows.

2.3 Text Mining 31

BM25 =TF · (k1 + 1)

k1 ·NF + TF· log

(r + 0.5) · (N − n−R + r + 0.5)

(R− r + 0.5) · (n− r + 0.5)(2.11)

and

NF = (1− b) + bDL

AV DL(2.12)

where TF is the term frequency; k1 and b are the tuning parameters; DL and

AV DL denote the document length and average document length respectively.

Mutual Information

Mutual Information (MI) estimates the reduction in uncertainty of the two

variables. It can be used to identify term correlations and the association between

a term and a specific topic.

MI = logr/R

n/N= log

r

R− log

n

N(2.13)

Information Gain

Information Gain (IG) gauges the expected reduction in entropy of category

prediction. It can also be applied for measuring term correlations [104].

IG = −R

N· log

R

N+

r

N· log

r

n+

R− r

N· log

R− r

N − n

= −Pr(rel) log Pr(rel) + Pr(t)Pr(rel|t) log Pr(rel|t)

+Pr(¬t)Pr(rel|¬t) log Pr(rel|¬t) (2.14)

Chi-Square

Chi-Square (X 2) estimates the difference between the observed frequencies and

the frequencies expected under the independent assumption. It can be applied for


Figure 2.3: Bag-of-words representation using word frequency.

measuring the lack of independence between a term and the specific topic [104].

X 2 =N · (rN − nR)2

R · n · (N −R) · (N − n)(2.15)

2.4 Text Representation

2.4.1 Keyword-based Representation

The bag-of-words scheme is a typical keyword-based representation in the area of

information retrieval. It has been widely used in text classification tasks due to its

simplicity. Figure 2.3 illustrates the paradigm of the bag-of-words technique. As

we can see that each word in the document is retrieved and stored in a vector space

alone with its frequency. The context of this document then can be represented by

these words, known as “features”. However, the main drawback of this scheme

is that the relationship among words cannot be reflected [135]. Another problem

in considering single words as features is the semantic ambiguity which can be

categorised in:

• Synonyms: a word which shares the same meaning as another word (e.g.

taxi and cab).

2.4 Text Representation 33

• Homonym: a word which has two or more meanings (e.g. river “bank” and

CITI “bank”).

In IR-related tasks, if a query contains an ambiguous word, the retrieved

documents may have this word but not its intended meaning. Conversely, a

document may not be retrieved since it does not share a word with the query,

even though this document is relevant as it contains words which are synonymous

to words in the query. However, almost all existing IR systems use bag-of-words

scheme to represent documents and queries. This does not seem adequate from a

formal semanticist’s point of view, but for simple retrieval tasks this way turns out

to be surprisingly effective [101]. More detail about word disambiguation can be

found in [126].

2.4.2 Phrase-based Representation

Using single words in keyword-based representation poses the semantic ambiguity

problem. To solve this problem, the use of multiple words (i.e. phrases) as features

therefore is proposed. In general, phrases carry more specific content than single

words. For instance, “engine” and “search engine”.

Another reason for using phrase-based representation is that the simple

keyword-based representation of content is usually inadequate because single

words are rarely specific enough for accurate discrimination [144]. To identify

groups of words that create meaningful phrases is a better method, especially

for phrases indicating important concepts in the text. Lewis [77] noted that the

traditional term clustering methods are unlikely to provide significantly improved

text representation.

There are five categories of phrase or terms extraction:


• Co-occurring terms [13, 77]

• Episodes [10, 11]

• Noun phrase [21, 134]

• Key-phrase [148]

• nGram [12, 20, 100, 135]

Ahonen et al. [10] applied data mining techniques to finding the episodes for

extraction of useful information in text. For sequential data, episodes and episode

rules are a modification of the concept of frequent sets and association rules [2].

A sequential data is treated as a sequence of events, where each event is a pair of

event type and time [95]. Shen et al. [135] proposed an n-multigram model to help

the automatic text classification task. Their model is smaller than an nGram-based

one and achieves the similar performance on RCV1.

Fuhr [44] took investigation into the probabilistic models in IR and pointed out

that a dependent model for phrases is not sufficient because only the occurrence

of the phrase components in a document is considered but not the syntactical

structure of the phrases. Moreover, the certainty of identification also should

be regarded such as whether the words occur adjacent or only within the same

paragraph.

2.4.3 Other Representation

A new representation model that uses word-clusters as features for text

classification is proposed in [14]. The technique of feature clustering has been

proven in this work to be an alternative to feature selection for reducing the

dimensionality.

2.5 Information Filtering 35

The choice of a representation depends on what one regards as the meaningful

units of text and the meaningful natural language rules for the combination of

these units [130]. With respect to the representation of the content of documents,

some research works have used phrases rather than individual words. In [25],

the combination of unigram and 2-gram is chosen for document indexing in

text categorisation (TC) and evaluated on a variety of feature selection functions

(FEF). Sharma and Raman [134] propose a phrase-based text representation for

web document management using rule-based Natural Language Processing (NLP)

and Context-free Grammar (CFG) techniques. In [11], they apply data mining

techniques to text analysis by extracting co-occurring terms as descriptive phrases

from document collections. However, the effectiveness of the text mining systems

using phrases as text representation showed no significant improvement. The

likely reason is that a phrase-based method has “lower consistency of assignment

and lower document frequency for terms” as mentioned in [77].

2.5 Information Filtering

An information filtering (IF) system monitors an incoming document stream and

selects documents relevant to one or more of its query profiles. If the interactions

of profile are ignored, this task can be treated as a binary decision to accept

or reject incoming documents with respect to a given profile [60]. In terms of

relevance judgements, if the users are able to give these judgements as feedback,

IF can be viewed as an interactive learning process. In contrast, it is a non-

interactive machine learning problem with a set of labeled documents provided

in advance. The task of IF is supposed to reduce a user’s information load


by removing all non-relevant documents from an incoming stream. It can also

be regarded as a special instance of text classification [130]. The historical

development of IF can be seen in [106].

Simple averaging of probabilities or log odds ratios generates a significant

improvement for document filtering [60]. Kernel-based methods [23, 24, 91] have

been used to address document filtering problems.

Unlike the traditional search query, an adaptive filtering system maintains user

profiles which tend to reflect a long-term information need. By interacting with

users, an adaptive filtering system can learn a better profile and update it with

feedback to improve its performance over time. The assumption of adaptive

system is that users want to get interesting documents as soon as they arrive.

Hence, the system has to make a binary decision for each incoming document

to retrieve or reject on a user profile. Lau et al. [75] applied Belief Revision logic

to model the task of adaptive information retrieval. Landquillon [73, 74] proposed

two methods for assessing performance indicators without user feedback.

Table 2.2 shows the existing information filtering systems in the related

literature. Among these systems, most of them adopted singular words (so called

bag-of-words) for data representation and TFIDF variant for the term weighting

scheme.

2.6 Chapter Summary

In this chapter, the background of knowledge discovery and data mining has

been discussed and the related research work regarding text mining, association

analysis, text representation and information filtering has been reviewed. Starting

2.6 Chapter Summary 37

IF Model Representation Term weighting

KerMIT [24] bag-of-words Kernel FunctionPIRCS [70] bag-of-words TFIDFOkapi [121] bag-of-words TFIDFRELIEFS [22] bag-of-words ProbabilisticRutgers [17] bag-of-words TFIDFCLARIT [38] bag-of-words TFIDFNewT [136] bag-of-words TFIDFNewsWeeder [72] bag-of-words TFIDFAplipes [155] bag-of-words TFIDFGroupLens [66] bag-of-words TFIDFINFOrmer [138] nGram TFIDFSIFT [160] bag-of-words TFProFile [16] bag-of-words TFIDFINQUERY [15, 32, 150] bag-of-words Probabilistic

Table 2.2: Information Filtering models.


with the knowledge discovery, we discussed its definition and the typical process

of knowledge discovery. The current applications and challenges in the area of

knowledge discovery were explored as well. We then focused on the development

of the data mining and analysed its production, the association rules. Various

pattern mining algorithms were reviewed. These include association rule mining,

frequent itemset mining, sequential pattern mining, closed and maximum pattern

mining. In terms of text mining, we briefly reviewed the common feature

selection and term weighting approaches for the dimensionality reduction. Three

types of text representation schemes were also explored and discussed. Lastly,

we reviewed the literature work regarding the information filtering and related

techniques.

Chapter 3

Prototype of Pattern TaxonomyModel

As mentioned in Chapter 1, knowledge discovery has been investigated for a long

time and a lot of data mining methods have been proposed for conquering related

challenges in various fields, especially in the domain of supermarket basket data,

telecommunications data and human genomes [10]. However, it is still difficult

to find a suitable example that can implement these data mining techniques in

the area of text mining, which is usually analysed by the use of Information

Retrieval-related methods or natural language processing. This chapter presents

the fundamental prototype of Pattern Taxonomy Model (PTM), which focus on

the issue of finding useful patterns from text documents. Definitions of patterns

and related algorithms for pattern discovering are provided in this chapter as well.

3.1 Pattern Taxonomy Model

Two main stages are considered in PTM. The first stage is how to extract useful

phrases from text documents, which will be discussed in this chapter. The second

stage is then how to use these discovered patterns to improve the effectiveness of

39

40 Prototype of Pattern Taxonomy Model

a knowledge discovery system and will be presented in Chapter 4.

In PTM, we split a text document into a set of paragraphs and treat each

paragraph as an individual transaction, which consists of a set of words (terms).

At the subsequent phase, we apply the data mining method to find frequent

patterns from these transactions and generate pattern taxonomies. During the

pruning phase, non-meaningful and redundant patterns are eliminated by applying

a proposed pruning scheme.

3.1.1 Sequential Pattern Mining (SPM)

The basic definitions of sequences used in this research work are described as

follows. Let T = t1, t2, . . . , tk be a set of all terms, which can be viewed as

words or keywords in text documents. A sequence S = 〈s1, s2, . . . , sn〉(si ∈ T )

is an ordered list of terms. Note that the duplication of terms is allowed in a

sequence. This is different from the usual definition where a pattern consists of

distinct terms.

Definition 3.1. (sub-sequence) A sequence α = 〈a1, a2, . . . , an〉 is a sub-

sequence of another sequence β = 〈b1, b2, . . . , bm〉, denoted by α v β, if there

exist integers 1 ≤ i1 < i2 < . . . < in ≤ m, such that a1 = bi1 , a2 = bi2 , . . . , an =

bin .

For instance, sequence 〈s1, s3〉 is a sub-sequence of sequences 〈s1, s2, s3〉.

However, 〈s2, s1〉 is not a sub-sequence of 〈s1, s2, s3〉 since the order of terms

is considered. The sequence α is a proper sub-sequence of β if α v β but

α 6= β, denoted as α < β. In addition, we can also say sequence 〈s1, s2, s3〉

is a super-sequence of 〈s1, s3〉. The problem of mining sequential patterns is to

3.1 Pattern Taxonomy Model 41

find a complete set of sub-sequences from a set of sequences whose support is

greater than a user-predefined threshold, min sup.

Pattern taxonomy is a tree-like hierarchy that reserves the sub-sequence (i.e.,

“is-a”) relationship between discovered sequential patterns. An example of

pattern taxonomy is illustrated in Figure 3.1.

Definition 3.2. (Absolute and Relative Support) Given a document d =

S1, S2, . . . , Sn, where Si is a sequence representing a paragraph in d. Thus,

|d| is the number of paragraphs in document d. Let P be a sequence. We call P a

sequential pattern of d if there is a Si ∈ d such that P v Si. The absolute support

of P , denoted as suppa(P ) = |S|S ∈ d, P v S|, is the number of occurrences

of P in d. The relative support of P is the fraction of paragraphs that contain P

in document d, denoted as suppr(P ) = suppa(P )/|d|.

For example, the sequential pattern P = 〈t1, t2, t3〉 in the sample database,

as shown in Table 3.1, has suppa(P ) = 2 and suppr(P ) = 0.5. All sequential

patterns in Table 3.1 with absolute support greater than or equal to 2 are presented

in Table 3.2.

The relative support of pattern is used to properly estimate the significance

of pattern. Usually, a pattern with the same frequency will acquire the same

support in different document lengths. However, with the same occurrence, a

pattern is more significant in a short document than in a long one. Different

from other approaches, we decompose a document into a set of transactions and

discover frequent patterns from them by using data mining methods. The relative

support of pattern can be estimated by dividing absolute support with the number

of transactions in a document. Hence, a pattern can obtain an adequate support


Transaction Sequence

1 S1 : 〈t1, t2, t3, t4〉2 S2 : 〈t2, t4, t5, t3〉3 S3 : 〈t3, t6, t1〉4 S4 : 〈t5, t1, t2, t7, t3〉

Table 3.1: Each transaction represents a paragraph in a text document and containsa sequence consisting of an ordered list of words.

Patterns suppa suppr

〈t4〉, 〈t5〉, 〈t1, t2〉, 〈t1, t3〉, 〈t2, t4〉, 2 0.5〈t5, t3〉, 〈t1, t2, t3〉

〈t1〉, 〈t2〉, 〈t2, t3〉 3 0.75

〈t3〉 4 1

Table 3.2: All frequent sequential patterns discovered from the sample document(Table 3.1) with min sup: ξ = 0.5.

for various document lengths with the same frequency.

Definition 3.3. (Frequent Sequential Pattern) A sequential pattern P is called

frequent sequential pattern if suppr(P ) is greater than or equal to a minimum

support (min sup for short) ξ.

For example, let min sup be 0.75 for mining frequent sequential patterns from

a sample document in Table 3.1, we can obtain four frequent sequential patterns:

〈t2, t3〉, 〈t1〉, 〈t2〉, and 〈t3〉 since their relative supports are not less than ξ.

Definition 3.4. (nTerms Pattern) The length of sequential pattern P , denoted as

len(P ), indicates the number of words (or terms) contained in P . A sequential

pattern which contains n terms can be denoted in short as a nTerms pattern.

For instance, given pattern P = 〈t2, t3〉, we have len(P ) = 2, and P is a

2Terms pattern. A sequential pattern consists of several terms (words) as well as

one term. Thus, a 1Term pattern is a sort of special nTerms pattern in this research

work.

3.1.2 Pattern Pruning

For all algorithms used for finding all frequent sequential patterns from a dataset,

the problem encountered is a large amount of patterns generated, most of which

are considered as non-meaningful patterns and need to be eliminated [159].

A proper pruning scheme can be used for addressing this issue by removing

redundant patterns, leading to not only reducing the dimensionality but also

decreasing the effects from noise patterns. In this research work, we defined

closed patterns as meaningful patterns since most of the sub-sequence patterns of

closed patterns have the same frequency, which means they always occur together

in a document. For example, in Figure 3.1 patterns 〈t1, t2〉 and 〈t1, t3〉 appear

two times in a document as their parent pattern 〈t1, t2, t3〉 has a frequency of two.

SPM stands for sequential pattern mining and we define sequential closed pattern

mining as SCPM. The notion of closed pattern is defined as follows:

Definition 3.5. (Closed Sequential Pattern) A frequent sequential pattern P is a

closed sequential pattern if there exist no frequent sequential patterns P ′ such that

P < P ′ and suppa(P ) = suppa(P′). The relation < represents the strict part of

the subsequence relation v.

For instance, the nodes in Figure 3.1 represent sequential patterns extracted

from Table 3.1. Only the patterns within dash-line borders are closed sequential

Figure 3.1: An example of pattern taxonomy where patterns in dash boxes areclosed patterns.

patterns if min sup ξ = 0.50. The others are considered as non-closed sequential

patterns.

Algorithm 3.1. SPMining(PL, min sup)

Input: a list of nTerms frequent sequential patterns, PL; minimum support,

min sup.

Output: a set of frequent sequential patterns, SP.

Method:

1: SP ← SP − Pa ∈ SP | ∃Pb ∈ PL such that len(Pa) = len(Pb)− 1

∧Pa < Pb ∧ suppa(Pa) = suppa(Pb) //pattern pruning

2: SP ← SP ∪ PL //storing nTerms patterns

3: PL′ ← ∅

4: foreach pattern p in PL do begin

5: generating p-projected database PD

6: foreach frequent term t in PD do begin

7: P ′ = p ./ t //sequence extension

8: if suppr(P′) ≥ min sup then


9: PL′ ← PL′ ∪ P ′

10: end if

11: end for

12: end for

13: if |PL′| = 0 then

14: return //no more pattern

15: else

16: call SPMining(PL′, min sup)

17: end if

18: output SP

The Sequential Pattern Mining (SPM) algorithm SPMining is depicted in

Algorithm 3.1. In this algorithm, we apply the pruning scheme for the purpose

of eliminating non-closed patterns during the process of sequential patterns

discovery. The key feature behind this recursive algorithm is represented in

the first line of the algorithm, which describes this pruning procedure. In this

algorithm, all (n-1)Terms of length patterns are diagnosed to determine whether

or not they are closed patterns after all nTerms of length patterns are generated

from the previous recursion. For instance, for a 2Terms pattern 〈t2, t3〉, if there

exists a 3Terms frequent pattern 〈t1, t2, t3〉 having the same frequency as pattern

〈t2, t3〉, the shorter one is then detected as a non-closed pattern and therefore

pruned. After this pruning scheme, the rest of (n-1)Terms patterns (i.e. closed

sequential patterns) are stored and then the algorithm continues to find the next

(n+1)Terms patterns. The algorithm repeats itself recursively until there is no


Figure 3.2: Illustration of pruning redundant patterns.

more pattern discovered. As a result, the output of algorithm SPMining is a set of

closed sequential patterns with relative supports greater than or equal to a specified

minimum support.

As mentioned above, SPMining adopts the projected-database-based approach

to project (or partition) a database for each nTerms pattern with an attempt to find

(n+1)Terms patterns during each recursion.

Definition 3.6. (P-projected Database) Given a pattern p, a p-projected database

contains a set of sequences which is made of postfixes of p in the database.

For instance, referred to the sample database in Table 3.1, let p be 〈t1〉,

the p-projected database will be 〈t2, t3, t4〉, 〈〉, 〈〉, 〈t2, t7, t3〉 where 〈〉 is a null

sequence since p does not appear in transaction 2 and term t1 locates at the end of

transaction 3.

After generating the p-projected database for a sequential pattern, the next step


is to find the frequent terms in this database satisfying a given minimum support.

If one frequent term is found, a (n+1)Terms sequential pattern is expanded from

the nTerms sequential pattern by using sequence extension, which is defined as

follows.

Definition 3.7. (Sequence Extension) Given a term t and a sequence S, the

sequence extension of S with term t can be obtained by simply appending t to

S and generating a sequence S ′ , denoted as S ′ = S ./ t.

For instance, the sequence extension of 〈t1, t3〉 with term t2 is sequence

〈t1, t3, t2〉. The generated (n+1)Terms sequential patterns are passed as an input

value by calling the algorithm itself for the next recursive processing until no more

sequential pattern is found.

The input of the SPMining algorithm is a set of frequent sequential patterns

obtained from the previous output from itself. The initial input of this recursive

function is a set of frequent 1Term patterns (i.e., frequent items). For instance, the

frequent 1Term patterns derived from the sample document (Table 3.1) are listed

in Table 3.3. Patterns 〈t6〉 and 〈t7〉 are discarded since their relative support is less

than 0.5, meaning that both of them appear only once among paragraphs and are

impossible to be “frequent”.

The set of 1Term patterns in Table 3.3 then skips line 1 in the algorithm,

which is designed for the purpose of pruning non-closed patterns generated from

previous iterations. The process from line 4 to line 12 presents the generating

of (n+1)Terms patterns from the nTerms p-projected database, and a p-projected

database will be firstly formed by using the 1Term pattern as a root and finding

all projected sequences. Table 3.4 illustrates the example of p-projected database


1Term Pattern suppa suppr

〈t1〉 3 0.75〈t2〉 3 0.75〈t3〉 4 1.0〈t4〉 2 0.5〈t5〉 2 0.5

Table 3.3: Frequent 1Term patterns with min sup = 0.5.

Root Projected sequence

t1〈t1〉 :→ 〈t2, t3, t4〉〈t1〉 :→ 〈t2, t7, t3〉

t2

〈t2〉 :→ 〈t3, t4〉〈t2〉 :→ 〈t4, t5, t3〉〈t2〉 :→ 〈t7, t3〉

t3〈t3〉 :→ 〈t4〉〈t3〉 :→ 〈t6, t1〉

t4 〈t4〉 :→ 〈t5, t3〉

t5〈t5〉 :→ 〈t3〉〈t5〉 :→ 〈t1, t2, t7, t3〉

Table 3.4: An example of a p-projected database.

based on the above mentioned scenario. For each 1Term pattern p, a number

of subsequences Ps starting with p are generated, where Ps v Sn and Sn ∈ d.

Generally speaking, the number of projected sequences of a root pattern equals to

the pattern’s absolute support unless it locates at the end of some paragraphs (e.g.,

t1, t3 and t4).

After a p-projected database is built, (n+1)Terms pattern candidates can be

obtained by extending the nTerms pattern using sequence extension. For instance,


1Term Pattern 2Terms Pattern suppa suppr

t1〈t1, t2〉 2 0.5〈t1, t3〉 2 0.5

t2〈t2, t3〉 3 0.75〈t2, t4〉 2 0.5

t3 not found - -

t4 not found - -

t5 〈t5, t3〉 2 0.5

Table 3.5: 2Terms sequential patterns derived from 1Term patterns.

in Table 3.4 we have 〈t1, t2, t3, t4〉 and 〈t1, t2, t7, t3〉, two p-projected sequences,

and find two frequent terms among them t2, t3. At the next step, two 2Terms

candidates 〈t1, t2〉 and 〈t1, t3〉 are formed with the same relative support of 0.5.

Then each candidate is examined at line 8 and confirmed as a frequent pattern if

its relative support is greater than or equal to the minimum support. Based on the

previous example, the 2Terms patterns derived from the p-projected database in

Table 3.4 are presented in Table 3.5. Note that neither pattern will be generated

from the p-projected databases of pattern 〈t3〉 and 〈t4〉 because there are no more

frequent terms existing in their databases.

As long as one nTerms pattern is found in this iteration, the algorithm

iteratively calls itself again to find (n+1)Terms patterns. Otherwise, the algorithm

will terminate and return the set of sequential patterns SP as an output if there are

no more nTerms patterns discovered in the current iteration.

For the scenario in Table 3.5, frequent 2Terms patterns 〈t1, t2〉, 〈t1, t3〉, 〈t2, t3〉,

〈t2, t4〉 and 〈t5, t3〉 are passed to the algorithm itself as one of the parameters in

order to find the 3Terms patterns. Again, at the first line in the algorithm, these

2Terms patterns are compared to 1Term patterns for the purpose of pruning non-

closed 1Term patterns, and this process can be described as follows. For each

(n-1)Terms pattern Pa in SP , if there exists any nTerms pattern Pb such that Pa is

a proper subsequence of Pb (i.e., Pa < Pb) and both of them have the same relative

support (i.e., suppr(Pa) = suppr(Pb)), then Pa is defined as a non-closed pattern

and is eliminated from the set SP . This is performed for the case of finding

frequent closed sequential patterns; whereas for another case of just searching

frequent sequential patterns, this step should be ignored and go on to the next

step. Therefore, the result of non-closed pattern pruning based on our example is

depicted in Table 3.6. Figure 3.2 illustrates the process of pattern pruning. The

arrows in the figure indicate the process of finding and pruning redundant patterns

(i.e., non-closed patterns) from the lower level (i.e., (n+1)Terms patterns) to the

higher level (i.e., nTerms patterns).

The same procedure is taken for the 2Terms patterns in the rest of the

lines in the algorithm, including the generation of p-projected databases and

the assessment of frequent patterns. As a result, a 3Terms sequential pattern,

〈t1, t2, t3〉, is discovered at the end of this iteration. After passing it to the next

iteration, the pruning result for 2Terms patterns is illustrated in Table 3.7.

As mentioned before, the algorithm SPMining is designed for discovering both

closed and non-closed sequential patterns from a set of documents. The execution

of the first line in the algorithm is the key for the closed patterns finding and

removal of the other patterns. Moreover, it can be easily adjusted to find the

non-closed sequential patterns by skipping the first line in the algorithm. For

the previous document example in Table 3.1, after inputting this document into


1Term Pattern suppr Sequence of suppr Closed pattern?

〈t1〉 0.75 〈t1, t2〉 0.5yes

〈t1, t3〉 0.5

〈t2〉 0.75〈t1, t2〉 0.5

no〈t2, t3〉 0.75〈t2, t4〉 0.5

〈t3〉 1.0〈t1, t3〉 0.5

yes〈t2, t3〉 0.75〈t5, t3〉 0.5

〈t4〉 0.5 〈t2, t4〉 0.5 no

〈t5〉 0.5 none - yes

Table 3.6: The assessment of closed pattern of 1Term patterns.

2Terms Pattern suppr Sequence of suppr Closed pattern?

〈t1, t2〉 0.5 〈t1, t2, t3〉 0.5 no

〈t1, t3〉 0.5 〈t1, t2, t3〉 0.5 no

〈t2, t3〉 0.75 〈t1, t2, t3〉 0.5 yes

〈t2, t4〉 0.5 none - yes

〈t5, t3〉 0.5 none - yes

Table 3.7: The assessment of closed pattern of 2Terms patterns.


Frequent patterns Non-closed Closed

1Term 〈t2〉, 〈t4〉〈t1〉, 〈t3〉, 〈t5〉

2Terms 〈t1, t2〉, 〈t1, t3〉〈t2, t3〉, 〈t2, t4〉,〈t5, t3〉

3Terms none 〈t1, t2, t3〉

Table 3.8: Discovered frequent closed and non-closed sequential patterns.

the algorithm and setting the minimum support to be 0.5, a list of all the closed

or non-closed sequential patterns can be returned and their results are shown in

Table 3.8.

3.1.3 Using Discovered Patterns

As mentioned in the previous section, the algorithm SPMining uses the sequential

data mining technique with a pruning scheme to find meaningful patterns from

text documents. The next issue is then how to use these discovered patterns. There

are various ways to utilise discovered patterns by using a weighting function to

assign a value for each pattern according to its frequency. One strategy has been

implemented and evaluated in [159], which proposed a pattern mining method

that treated each found sequential pattern as a whole item without breaking it into

a set of individual terms, and its result found that using confidence as the pattern

measure outperformed the use of support. For example, each mined sequential

pattern p in PTM can be viewed as a rule p → positive and the confidence of p

was evaluated using the following weighting function:

W (p) =|da|da ∈ D+, p ⊆ da||db|db ∈ D, p ⊆ db|

,

3.2 Finding Non-Sequential Patterns 53

where D is the training set of documents, and D+ indicates the set of positive

documents in D.

Two problems arise when using the above-mentioned weighting function. One

is the low pattern frequency problem which is mainly due to the fact that it is

hard to match patterns in documents when the length of the pattern is long. The

other problem is that those patterns which are specific to a topic may gain a lower

score than those for general patterns. In other words, the information carried by

the specific patterns cannot be estimated by the weighting function. Therefore, a

proper pattern processing method to overcome these problems is then desirable.

We will discuss this issue in Chapter 4.

3.2 Finding Non-Sequential Patterns

In Section 3.1, the algorithm SPMining is developed for the purpose of mining all

frequent sequential patterns from documents. In addition to sequential patterns,

non-sequential patterns mining (NSPM) from a set of textual documents is another

application of the data mining mechanism. From the data mining point of

view, non-sequential patterns can be treated as frequent itemsets extracted from

a transactional database. Frequent itemset mining is one of the most essential

issues in many data mining applications. Since the first research work coined

by Agrawal et al. [2] in the mid ’90s, many itemset mining approaches which

adopted the concept of Apriori algorithm have been proposed. These Apriori-

like approaches utilise a bottom-up scheme that enumerates every single frequent

itemset [49]. However, the phase of pattern generation, one process in Apriori-

like algorithms, is likely to be computationally expensive. Particularly, for a long


pattern which contains n items, there will be 2n−2 subsets of this pattern, leading

to make this approach inefficient and unfeasible.

To tackle the above mentioned problem, we propose the following strategy.

Each document in the dataset is split into a set of transactions (i.e., paragraphs),

instead of the whole document which is viewed as a single transaction in the

traditional methods. As a result, the number of candidates for pattern generation

can be greatly reduced. For example, for a document with five paragraphs,

the average length of transaction is one-fifth of the original one. This strategy

therefore can save much computational time, especially for long documents. In

this section, an NSPM algorithm is developed to address the problem of finding all

frequent non-sequential patterns from a given textual database. The fundamental

definitions are defined in Section 3.2.1 and its corresponding algorithm is

presented in Section 3.2.2.

3.2.1 Basic Definition of NSPM

The essential definitions of NSPM are described as follows. Let T =

t1, t2, . . . , tk be a set of distinct terms. A non-sequential pattern p is a subset of

T .

Definition 3.8. (frequency and support) Given a document d = S1, S2, . . . , Sn,

where Si is a paragraph of d. The frequency of non-sequential pattern p is the

number of paragraphs which contain p. The frequency of p can be denoted as

freq(p) = |S|S ∈ d ∧ p ⊆ S|. The support of p is defined as support(p) =

freq(p)/|d|.

For example, the frequency of non-sequential pattern t1, t3 in the document


example (Table 3.1) is 3, and the support of this pattern is 0.75. Note that in the

case of SPM, the frequency and the relative support of pattern 〈t1, t3〉 are 2 and

0.5 respectively since the order of terms is concerned.

3.2.2 NSPM Algorithm

Algorithm 3.2 is proposed to mine the non-sequential patterns whose supports are

greater than or equal to a specified min sup. The inputs require a list of nTerms

frequent non-sequential patterns NP , a list of 1Term frequent patterns FT and a

minimum support min sup. Similar to the algorithm SPMining, the initial NP

is a list of frequent 1Term patterns. Using Table 3.1 as a document example, the

initial NP should be t1, t2, t3, t4, t5. The content of FT is the same as the one

of initial NP , but FT is used for candidate generation and will be static for every

iteration. min sup is a constant real value as well.

Algorithm 3.2. NSPMining(NP, FT, min sup)

Input: a list of nTerms frequent non-sequential patterns, NP; a list of 1Term

frequent patterns, FT; minimum support, min sup.

Output: a set of frequent non-sequential patterns, FP.

Method:

1: FP ← FP ∪NP //nTerms non-sequential patterns

2: NP ′ ← ∅

3: foreach pattern p in NP do begin

4: foreach frequent term t in FT do begin

5: P ′ = p ∪ t //pattern growing

6: if support(P ′) ≥ min sup then


7: NP ′ ← NP ′ ∪ P ′

8: end if

9: end for

10: end for

11: if |NP ′| = 0 then

12: return //no more pattern

13: else

14: call NSPMining(NP ′, FT, min sup)

15: end if

16: output FP

The first line of algorithm NSPMining is the process of storing the mined

patterns in NP passed from the previous recursive loop. From line 3 to line 10

is the process of extending nTerms patterns to (n+1)Terms patterns for candidates

generation. By jointing each frequent term in FT into each nTerms pattern in NP ,

a number of (n+1)Terms candidates are created. For example, 2Terms candidates

generated from the previous document example are listed in Table 3.9.

As we can see, the number of candidates generated in NSPM (10 candidates

in Table 3.9) is much more than the number in SPM (5 candidates in Table 3.5)

from the same document example. The difference will be more significant when

the document is larger. The reason is that in NSPM every term in each transaction

needs to be visited to estimate its frequency and support, whereas in SPM only a

portion of the sequence in each transaction needs to do so. Although SPM requires

an extra process in advance such as building p-projected databases, it takes less


1Term Pattern 2Terms Pattern Frequency Support

t1

t1, t2 2 0.5t1, t3 3 0.75t1, t4 1 0.25t1, t5 1 0.25

t2t2, t3 3 0.75t2, t4 2 0.5t2, t5 2 0.5

t3t3, t4 2 0.5t3, t5 2 0.5

t4 t4, t5 1 0.25

Table 3.9: 2Terms candidates generated during non-sequential pattern mining.

computing time since only a simple splitting function is taken for such work. In

our experiment, some topics take a longer time (a couple of hours) than others to

complete an NSPM task with a low min sup.

Once each candidate is generated, it is then examined immediately to check

whether or not it is frequent by executing line 6 in the algorithm. At the end of this

process (line 10), all frequent 2Terms non-sequential patterns are discovered in the

first recursive loop. From the previous example with min sup = 0.5, there are

seven 2Terms patterns remaining as frequent patterns in Table 3.9 except patterns

t1, t4, t1, t5 and t4, t5 because of their low supports. If there is no more

pattern frequent, the algorithm will terminate and output the result. Otherwise,

these discovered nTerms frequent patterns will be passed to the next recursive

loop to find further (n+1)Terms patterns.

At the second recursive loop of the algorithm NSPMining, those previously

found 2Terms non-sequential patterns are reserved in FP at the first line in this


2Terms Pattern 3Terms Pattern Frequency Support Frequent?

t1, t2t1, t2, t3 2 0.5 yest1, t2, t4 1 0.25 not1, t2, t5 1 0.25 no

t1, t3t1, t3, t4 1 0.25 not1, t3, t5 1 0.25 no

t1, t4 t1, t4, t5 0 0 no

t2, t3t2, t3, t4 2 0.5 yest2, t3, t5 2 0.5 yes

t2, t4 t2, t4, t5 1 0.25 no

t3, t4 t3, t4, t5 1 0.25 no

Table 3.10: 3Terms candidates generated during non-sequential pattern mining.

algorithm. These 2Terms patterns are then extended to form 3Terms candidates by

jointing each term t in frequent 1Term pattern set FT . All of the possible 3Terms

candidates are presented in Table 3.10.

After assessing all 3Terms candidates, there are still three frequent patterns

remaining, compared to just one at the same stage using SPM algorithm. This is

another evidence showing that the NSPM is inefficient compared with SPM while

applying data mining algorithms to the text mining task. In other words, NSPM

takes time not only on the process of generating candidates but also the task of

more patterns needed to be processed in the next recursive loop.

For the current case in NSPM, three 3Terms frequent non-sequential patterns

(i.e., t1, t2, t3, t2, t3, t4, t2, t3, t5) are consequentially delivered to the next

recursive iteration. After storing them in FP , the same process of candidate

generation is activated and 4Terms candidates are then created. These candidates

3.3 Related Work 59

3Terms Pattern 4Terms Pattern Frequency Support Frequent?

t1, t2, t3t1, t2, t3, t4 1 0.25 not1, t2, t3, t5 1 0.25 no

t2, t3, t4 t2, t3, t4, t5 1 0.25 no

Table 3.11: 4Terms candidates generated in NSPM.

are illustrated in Table 3.11. The algorithm terminates and frequent patterns stored

in FP are returned. The whole list of frequent non-sequential patterns is shown

in Table 3.12.

The frequent non-sequential closed pattern can be mined using NSPMining

with the following process inserted into the first line:

1: FP ← FP − Pa ∈ FP | ∃Pb ∈ NP such that len(Pa) = len(Pb)− 1 ∧

Pa ⊂ Pb ∧ support(Pa) = support(Pb) (3.1)

The above pruning scheme is employed by the NSPMining algorithm to

eliminate any non-sequential non-closed pattern, which is a pattern with the same

support as its supersets. Such mining method for closed non-sequential patterns

is called the Non-Sequential Closed Pattern Mining (NSCPM) method.

3.3 Related Work

Many data mining methods have been proposed for knowledge discovery in the

last decade. However, most of them are developed for addressing the problem

of mining specific patterns in a reasonable and acceptable time frame from a

large transactional or relational database. Agrawal et al. [2] coined association

rule mining and the well-known Apriori algorithm was proposed by Agrawal


Pattern type Pattern Frequency Support Closed?

1Term

t1 3 0.75 not2 3 0.75 not3 4 1.0 yest4 2 0.5 not5 2 0.5 no

2Terms

t1, t2 2 0.5 not1, t3 3 0.75 yest2, t3 3 0.75 yest2, t4 2 0.5 not2, t5 2 0.5 not3, t4 2 0.5 not3, t5 2 0.5 no

3Termst1, t2, t3 2 0.5 yest2, t3, t4 2 0.5 yest2, t3, t5 2 0.5 yes

Table 3.12: Frequent non-sequential patterns discovered using NSPM.

3.3 Related Work 61

and Srikant [5]. Similar algorithms for association rules mining were developed

in [3, 29, 65, 96, 107]. Some strategies were introduced in order to find association

rules efficiently, such as transaction reduction by Agrawal and Srikant [6] and a

hash-based algorithm by Park et al. [108, 109]. Many extensions of association

rule mining have been developed. Spatial association rule mining was proposed

by Koperski and Han [67]. Frequent episodes mining [97], negative association

rule mining [128] and inter-transaction association rule mining [41, 92, 146] were

proposed and discussed. Multilevel association rule mining was explored by Han

and Fu [52, 53].

Sequential pattern mining has been extensively studied in data mining

communities since the first research work by Agrawal and Srikant [7]. The same

concept was discussed by Srikant and Agrawal [143]. Since the first work, many

algorithms of sequential pattern mining were introduced, such as GSP [143],

FreeSpan [55], PrefixSpan [114], SPADE [165], CloSpan [161], TSP [149],

SLPMiner [131] and IncSpan [28]. Most of them adopt Apriori policy: all

nonempty subsets of a frequent itemset must also be frequent [54]. However,

with this policy applied, longer patterns tend not to be mined since a static

minimum support is used for all pattern finding. Hence, a few constraint-based

algorithms [111, 116] are introduced to find longer patterns using lower minimum

support. Moreover, Ayres et al. [18] used bitmap representation for sequential

pattern mining. Ahonen-Myka et al. proposed several algorithms to find frequent

sequences or co-occurring phrases from textual datasets in [8, 9, 11, 12, 13]. In

addition to sequential patterns, the mining heuristic for frequent itemsets has the

goal of discovering all frequent non-sequential patterns in a database. There are

various extensions of frequent itemset mining including frequent closed itemset


mining [35, 110, 113, 166], maximum frequent parallel and distributed frequent

itemset mining [49, 151], mining top-k frequent itemsets from data stream [156],

constraint-based frequent itemset mining [112, 152], and mining frequent itemsets

by opportunistic projection [90].

The first attempt of applying data mining techniques to the domain of text was

made by Ahonen et al. [10], which presented the experiments on discovering

phrases and co-occurring terms in text. The technique used by episode rule (i.e.,

modified association rule) mining is bottom-up nGram method, which differs

greatly from our method in PTM. The size of window needs to be defined in

the nGram method, and the frequent threshold for finding frequent co-occurring

terms is also required [8]. In PTM, the minimum support is the only coefficient

needed to be specified. Furthermore, during pattern discovery, the occurrence of

terms in a document is taken into account in the PTM method, but omitted in their

works. The great difference is that their works focused on finding patterns in text

only by the use of data mining techniques, but did not mention how to use these

discovered patterns. In contrast, PTM not only adopts data mining methods to

find patterns in text, but also applies them to the domain of information filtering

with the attempt to improve the performance.

3.4 Chapter Summary

Many data mining methods, such as association rule mining, frequent itemset

mining, sequential pattern mining and closed pattern mining, have been proposed

and usually used for a transactional database. In this chapter we have presented

a novel methodology that attempts to implement data mining algorithms on the


domain of text data for knowledge discovery. For applying these methods, a

textual document can be viewed as a transactional database by splitting it based

on paragraphs. A pattern, therefore, is defined as a frequent pattern if its relative

support is greater than or equal to a pre-specified minimum support. Four types of

frequent patterns can be found in a textual dataset using our proposed mining

algorithms, and they are sequential pattern mining (SPM), sequential closed

pattern mining (SCPM), non-sequential pattern (i.e., itemset) mining (NSPM),

and non-sequential closed pattern (i.e., closed itemset) mining (NSCPM).

Pattern taxonomy is a tree-like hierarchy that reserves the sub-sequence

(i.e., “is-a”) relationship between discovered sequential patterns. Rather than

the term independence assumption that were usually used in the traditional

information retrieval methods, pattern taxonomy in PTM can preserve the

semantical information adhered in the text data. By the use of the pattern pruning

strategy, the number of pattern candidates can be dramatically reduced during the

process of pattern discovery, resulting in a great improvement on the efficiency of

PTM. On the other hand, the effectiveness of the system can also be improved by

the removal of redundant patterns. The related experimental results are presented

in Chapter 6.

Using confidence of pattern for evaluation encounters two problems. One

is the so-called low pattern frequency problem mainly due to the few matched

patterns can be found in the training stage while the length of the pattern is

long. The other problem is that a specific long pattern will not obtain a proper

weighting score leading to the unsatisfied performance. Therefore, a proper

pattern processing method to overcome these problems is then desirable.

In summary, the proposed mining algorithms tackle the limitation of applying


data mining mechanisms to the text domain and provide the fundamental

prototype required for the development of PTM. PTM adopts the SCPM algorithm

for pattern discovery, as well as the pattern pruning scheme used to eliminate

redundant patterns, resulting in the improvement of efficiency. Moreover, the

problem of using discovered patterns is identified and the feasible solutions will

be discussed and presented in the following chapters.

Chapter 4

Pattern Deploying Methods

In this chapter, we propose two novel approaches with the attempt of addressing

the drawback which is caused by the inadequate use of discovered patterns. In

the previous chapter, we have discussed and provided various methods for mining

desired patterns by the use of data mining techniques. We have also pointed out

the difficulty of transferring these techniques and found the primary solutions to

alleviate the problem. However, the issue regarding how to exploit discovered

patterns still remains unsolved. To use discovered patterns, one of the easiest way

is to treat patterns as atoms in the feature space to represent the concept of a set

of documents. The significance of patterns then can be estimated by assigning an

evaluated value based on one of the existing weighting functions. Nevertheless,

the same mechanism for pattern discovery is required in the phase of document

evaluation in an attempt to find matched patterns if the above representation

method is used. Such an approach is time-consuming and ineffective because

of the drawback on computational expensiveness contributed by the nature of

data mining-based methods and the unsolved low-frequency problem for the long

patterns. Therefore, an efficient and effective pattern evaluation methodology is

needed after the phase of pattern discovery in a knowledge discovery system.

65

66 Pattern Deploying Methods

Figure 4.1: Deploying patterns into a term space.

4.1 Pattern Deploying

The properties of pattern (e.g., support and confidence) used by data mining-

based methods in the phase of pattern discovery are not suitable to be adopted

in the phase of using discovered patterns [158]. Therefore, in this chapter we re-

evaluate the property of patterns by deploying them into a common hypothesis

space based on their correlations to the pattern taxonomies. A fundamental

mechanism, Pattern Deploying Method (PDM), is firstly introduced to implement

patterns deploying and followed by the method of Pattern Deploying based on

Support (PDS).

The simplified concept of PDM is illustrated in Figure 4.1. There is no doubt

that a pattern consisting of more terms is considered to be more specific, but its

frequency is relatively low. The short pattern, however, obtains more influence on

the judgement of relevance for document evaluation due to its high frequency. In

particular, we need the former one to help distinguish the relevance of documents

especially in an information filtering system. For instance, comparing pattern

4.1 Pattern Deploying 67

“Sequential Pattern Mining” with pattern “Mining” in Figure 4.1, the former

is obviously more helpful than the latter since the former carries more specific

information.

To use these patterns, two inevitable issues arise:

(1) How to emphasise the significance of specific patterns and avoid the low-

frequency problem.

(2) How to eliminate the interference by the general patterns, which are usually

with high frequency.

As mentioned in the previous chapter, the nature of data mining methods is

the large amount of short patterns to be generated during the phase of pattern

discovery. One way to reduce the large number of patterns can be achieved by

the adoption of pattern pruning in the mining algorithm. The strategy of pattern

pruning we used is to eliminate the sub-sequences of maximum sequential patterns

if their values of supports are the same. That means patterns which always co-

occur with their parent patterns in the same transaction are redundant patterns and

need to be discarded. This induces the sequential closed pattern mining approach

which allows us to mine closed patterns only. Therefore, such a strategy provides

a partial solution for the second aforementioned issue, through the removal of a

large amount of sub-sequences (i.e., short patterns).

Despite the redundant short patterns discarded by the use of pattern pruning in

the SCPM method, there still are some remaining short patterns. These patterns

can be classified into two main groups. The first group contains patterns which

are themselves closed but short in length (e.g., 2Terms or 3Terms closed patterns),

which means these patterns have no parent patterns. The remaining short closed


Figure 4.2: Overlaps between discovered patterns.

patterns, which have parent patterns, are classified into the second group. We pay

attention to the second group since the closed pattern in the first group has been

widely discussed. Patterns in the second group are considered to be potentially

useful since they do not always co-occur with their parent patterns in the same

paragraph, meaning that they solely appear several times in other paragraphs in a

document. We believe such a kind of short pattern obtains significant information

with reference to the related concept of topic and has to be taken into account.

Therefore, we define these patterns as significant short patterns.

Other than the above-mentioned closed patterns and significant short patterns,

the correlation among patterns from different pattern taxonomies also draws our

attention and needs to be clarified. As mentioned before, many pattern taxonomies

may be automatically formed after all sequential patterns are found in the phase

of pattern discovery. All patterns under a pattern taxonomy should contain a

subset of terms derived from the longest closed pattern in the same taxonomy,

the root of the pattern taxonomy. Therefore, two patterns from different pattern

taxonomies should not have relationship in “subset”, but may share some terms.

Note that the sequential pattern in this case is simply viewed as a regular pattern.


For example, p1 and p2 are two patterns which have inter-taxonomy correlation

between them, such that p1 ∩ p2 6= ∅, p1 * p2 and p2 * p1. Figure 4.2 illustrates

the inter-taxonomy correlation among several patterns. Patterns from different

taxonomies may have overlaps and share some elements (i.e., terms), but not all

of them have such a phenomenon. For example, p1 and p2 in Figure 4.2 have

an overlap and share two common terms, whereas p1 and p4 are independent

and there is no intersection between them. On the other hand, a pattern could

own inter-taxonomy relationship with more than one pattern. For instance, p1

shares one term with p3 and overlaps another two with p2 in the above-mentioned

figure. In summary, a term which appears in many overlaps of patterns implies

the high occurrence in pattern taxonomies it has and the potential significance and

usefulness can be offered by the term. Therefore, by appropriately evaluating the

inter-taxonomy correlation between involved patterns, the capability of describing

context of documents for a knowledge discovery system can be improved.

In order to estimate the usefulness of significant short patterns and consider

the ease of pattern application for document evaluation, deploying patterns into a

feature space is an effective and efficient methodology to tackle the challenging

issues of dealing with discovered patterns. In terms of effectiveness, by deploying

patterns into a feature space, significant terms with high appearance in the overlap

area can be punctuated and emphasised through the accumulation of significant

terms’ occurrence during the pattern evaluation. Details of this process will

be presented in Section 4.1.1 and Section 4.1.2. With regard to efficiency, the

components in the feature space are replaced by short individual terms instead of

long sequential or non-sequential patterns. As a result, there is no need to find such

long patterns in the phase of document evaluation, which require the effort from


Figure 4.3: Flowchart of pattern deploying methods in Pattern Taxonomy Model.

the computationally expensive mining algorithms. Hence, such replacement leads

to the saving of long run time and obtains a great improvement on the efficiency

of the system.

The process of pattern deploying is depicted in Figure 4.3, which shows the

flowchart of PTM model featured with pattern deploying methods (i.e., PDM

and PDS). Started from the documents (in the case of textual data), several

pattern taxonomies can be built by finding informative patterns using data mining

methods. On the other hand, a feature space which consists of a set of individual

terms is generated by the use of traditional document indexing techniques. At

the next step, the created pattern taxonomies and feature space then can be

used to represent the concept of documents by applying a data mining-based

method (e.g., SPM) and the traditional Vector Space Method (VSM) respectively.

However, both approaches have inevitable limits and drawbacks, which have been

mentioned and discussed earlier in this section. In general, SPM brings both

the effectiveness and efficiency problem caused by the use of time-consuming


Figure 4.4: The process of merging pattern taxonomies into the feature space.

heuristics in pattern discovery. VSM usually bothers with the issue of attempts for

improvement on effectiveness. To overcome these problems, a novel methodology

is proposed in this thesis. By deploying patterns into the feature space, PDM

and PDS not only benefit by the use of sequential patterns to keep the useful

semantic information but also greatly improve the system efficiency by preventing

the time-consuming pattern discovery approaches from being used in the phase of

document evaluation.

4.1.1 Pattern Deploying Method (PDM)

The PDM is proposed with the attempt to address the problem caused by the

inappropriate evaluation of patterns, discovered using data mining methods. Data

mining methods, such as SPM and NSPM, utilise discovered patterns directly

without any modification and thus encounter the problem of lacking frequency

on specific patterns. Instead of using patterns individually, mapping patterns to

a common hypothesis space is considered in order to re-evaluate and emphasise

the specific patterns. The concept of mapping is illustrated in Figure 4.4, which

merges patterns under all mined taxonomies into a feature space. This approach

is proposed to tackle the aforementioned issues by the following strategies:

- Simplifying the feature space to reduce the computational complexity in the


phase of document evaluation.

- Reducing the size of the feature space to improve the efficiency.

- Deploying specific patterns to emphasise their levels of significance and

avoid the low-frequency problem.

- Emphasising specific patterns to reduce the interference from the general

patterns.

- Taking into account correlation of pattern taxonomies to evaluate the

significant short patterns.

- Accumulating the weight of terms in the overlap area to estimate their levels

of significance.

Upon implementing the above strategies, the goal of improving effectiveness

and efficiency for a pattern-based knowledge discovery system can be achieved.

Details with regard to the definitions and implementation of the PDM are

presented as follows.

Firstly, the common hypothesis space used in this chapter can be defined as T ,

a set of terms. For any set of term X , its covering set is

coverset(X) = p|p ∈ SP,X ⊆ p (4.1)

where p denotes a sequential pattern, SP is a set of discovered sequential pattern

using the proposed SPMining algorithm in the previous chapter. This definition is

similar to the definition of coverset used in Li and Zhong [86]. However, in this

study coverset refers to a set of patterns rather than a set of transactions. Given a

set of documents D, it consists of positive and negative document sets which can

be denoted as D+ and D− respectively. A set of positive documents is then given

as an example in the following table.

Doc. Pattern taxonomies Sequential patterns

d1 PT(1,1) 〈carbon〉4 , 〈carbon, emiss〉3PT(1,2) 〈air, pollut〉2

d2 PT(2,1) 〈greenhous, global〉3PT(2,2) 〈emiss, global〉2

d3 PT(3,1) 〈greenhous〉2PT(3,2) 〈global, emiss〉2

d4 PT(4,1) 〈carbon〉3PT(4,2) 〈air〉3, 〈air, antarct〉2

d5 PT(5,1) 〈emiss, global, pollut〉2

Table 4.1: Example of a set of positive documents consisting of patterntaxonomies. The number beside each sequential pattern indicates the absolutesupport of pattern.

For each positive document d ∈ D+, a set of patterns is discovered in order to

be merged into the dedicated vector:

~dk =< (tk1 , nk1), (tk2 , nk2), . . . , (tkm , nkm) > (4.2)

where tkiin pair (tki

, nki) denotes an individual term and nki

= |coverset(tki)|

is the total supports obtained from all patterns in ~dk. For example, documents in

Table 4.1 can be represented by the following vectors:

~d1 = < (carbon, 2), (emiss, 1), (air, 1), (pollut, 1) >

~d2 = < (greenhous, 1), (global, 2), (emiss, 1) >

~d3 = < (greenhous, 1), (global, 1), (emiss, 1) >

~d4 = < (carbon, 1), (air, 2), (antarct, 1) >

~d5 = < (emiss, 1), (global, 1), (pollut, 1) >

Then the specificity of pattern p in a document ~dk can be evaluated by the

following specificity definition:

specificity(p, ~dk) =∑

t∈p,(t,n)∈~dk

n

For the example documents in Table 4.1, the specificity of pattern 〈carbon,

emiss〉 in ~d1 is derived as specificity(〈carbon, emiss〉, ~d1) = 2 + 1 = 3. The

higher the value is, the more specific the pattern is. As two patterns are in the

same pattern taxonomy, the longer pattern obtains more specificity than the one

for the shorter one. According to this definition, we can then easily prove the

following theorem.

Theorem 4.1. Let p1 and p2 be patterns found in document ~dk. We have

specificity(p1) ≤ specificity(p2) if p1 ⊆ p2.

In order to develop an efficient algorithm to evaluate such a kind of

representation for each positive document, the composition operation defined

in [86] is adopted for merging any two patterns. For this purpose, we firstly

expand a pattern as a set of term integer pairs. For example, 〈greenhous,

global〉 and 〈emiss, global〉 are two frequent sequential patterns derived from the


sample database in Table 4.1. Their expanded forms can be denoted as pa =

〈(greenhous, 1), (global, 1)〉 and pb = 〈(emiss, 1), (global, 1)〉, respectively.

Moreover, we need a function to extract terms from a pattern. Given a pattern

p in its expanded form, we can use “termset” function to obtain the term list in p,

which satisfies

termset(p) = t|(t, f) ∈ p.

Using the above-mentioned patterns as an example, the termset of pa

and pb will be termset(pa) = greenhous, global and termset(pb) =

emiss, global. Note that the patterns themselves and their expanded forms are

not particularly distinguished unless it is necessary to do so.

Patterns can be merged using the following composition operation. The

composition of two patterns p1 and p2 can be processed using the following

equation.

p1 ⊕ p2 = (t, f1 + f2)|(t, f1) ∈ p1, (t, f2) ∈ p2 ∪

(t, f)|t ∈ (termset(p1) ∪ termset(p2))−

(termset(p1) ∩ termset(p2)), (t, f) ∈ p1 ∪ p2 (4.3)

For example, the composition of aforementioned patterns pa ⊕ pb can be

denoted as p′ where

p′ = pa ⊕ pb = (greenhous, 1), (emiss, 1), (global, 2).

The detailed process of pattern deploying is presented in Algorithm 4.1. Note

that the SPMining (Algorithm 3.1) is used in line 4 for generating frequent

sequential patterns. The main process of pattern deploying occurs between line 6

and line 8 inclusively. The output of this algorithm is a set of vectors.


Algorithm 4.1. PDM(D+, min sup)

Input: a list of positive documents, D+; minimum support, min sup.

Output: a set of vectors, ∆.

Method:

1: ∆← ∅

2: foreach document d in D+ do begin

3: extract 1Terms frequent patterns PL from d

4: SP = SPMining(PL, min sup) // Call Algorithm 3.1

5: ~d← ∅

6: foreach pattern p in SP do begin

7: ~d← ~d⊕ p′ // p′ is the expanded form of p

8: end for

9: ∆← ∆ ∪ ~d

10: end for

The inputs of the algorithm PDM are a set of positive documents and a

pre-specified minimum support. In line 4 of this algorithm, a set of sequential

patterns is discovered by calling the algorithm SPMining (in Section 3.1) for

each document. So far, only positive documents are considered and used in

this approach. The use of information from negative documents is another issue

with reference to pattern evolution which will be investigated and discussed in

Chapter 5.

At the next step in line 6 to 8, each pattern is firstly transferred into

an expanded form and then merged into a temporary storage using pattern


composition operator (Equation 4.3). As a result, the deployed pattern (i.e., the set

of term weight pairs) for each document is obtained. For example, the deployed

patterns of five sample documents in Table 4.1 can be expressed as ∆:

~d1 = (carbon, 2), (emiss, 1), (air, 1), (pollut, 1)

~d2 = (greenhous, 1), (global, 2), (emiss, 1)

~d3 = (greenhous, 1), (global, 1), (emiss, 1)

~d4 = (carbon, 1), (air, 2), (antarct, 1)

~d5 = (emiss, 1), (global, 1), (pollut, 1)

We keep ∆ as the training result for the further processing in the pattern

evolution stage. To deploy patterns, each vector of document in ∆ is normalised

first and the feature space is updated by summing up the weight value for each

corresponding term using the aforementioned pattern composition until all vectors

in ∆ are processed. For instance, the gradual updating of feature space for the

above-mentioned example is illustrated as follows.

~d1 = (carbon, 2/5), (emiss, 1/5), (air, 1/5), (pollut, 1/5)

~d1 ⊕ ~d2 = (carbon, 2/5), (emiss, 9/20), (air, 1/5), (pollut, 1/5),

(greenhous, 1/4), (global, 1/2)...

d∧ = (carbon, 13/20), (emiss, 67/60), (air, 7/10), (pollut, 8/15),

(greenhous,7/12), (global, 7/6), (antarct, 1/4)

As can be seen in the above example, terms emiss and global are more likely

to gain higher scores than the others. This is due to their high appearance among


sequential patterns. By applying pattern deploying, the significance these terms

possess therefore can be expressed. As a result, the significant long sequential

pattern can be effectively exploited and becomes useful through the emphasis of

its high-frequency components. In contrast, these high-frequency terms cannot be

fully exploited in SPM or SCPM methods since they are likely to be trapped in

low-frequency patterns. Furthermore, the major difference between PDM and a

keyword-based method (e.g., TFIDF) is that the former utilises the information of

pattern correlation in taxonomies, but the latter evaluates terms using simplistic

statistics only. In other words, the deployed terms in PDM carry informative

properties adhered to patterns which contain them, rather than the independent

terms without any relation to other terms or patterns as used in the keyword-based

methods.

The output of the algorithm PDM is a set of term weight pairs, which can be

viewed as feature space used to represent the concept of specified documents in a

knowledge discovery system. The weighting scheme for a given term ti in feature

space is denoted as the following function.

weight(ti) =∑

~dk∈∆,(ti,nki)∈~dk

(nki∑

(t,w)∈~dkw

). (4.4)

The time complexity of the composition operation can be O(m) if the pairs

in patterns are sorted, where m is the average length of patterns, and the basic

operation is the comparison between terms. It is obvious that the complexity of

pattern compositions is O(nN) during the process of pattern deploying, where n

is the number of positive documents and N is the average number of discovered

patterns in every positive document. Therefore, the overall time complexity of the

main process of pattern deploying is O(nNm) if the basic operation is still the


comparison between terms.

4.1.2 Pattern Deploying based on Supports (PDS)

PDM adopts the methodology that maps discovered patterns into a hypothesis

space with an attempt to overcome the low-frequency problem pertaining to the

specific long patterns. By simply deploying patterns through the use of a pattern

composition operator, the goal of reserving the significant information embedded

in specific patterns can be achieved. The significant short patterns, the terms

appearing in the overlaps of patterns, can be emphasised as well. However, the

pattern’s support, a useful and essential property of a pattern, is not taken into

account in the PDM method. For instance, the discovered pattern 〈carbon〉 in

Table 4.1 acquires an absolute support of 4 in document d1 and 3 in document

d4, but the evaluated score for this term is as low as 13/20 in the feature space,

compared to 67/60 for another term “emiss”, which appears only two more times

in supports. Hence, it is doubtful that the term “emiss” is estimated to be twice as

significant as the term “carbon” in this case. Such a phenomenon is caused by the

disregard of taking a pattern’s support into account during the pattern evaluation

process. With reference to the algorithm PDM, the discovered patterns are fairly

treated and given an equivalent weight. Therefore, the support of a pattern is

required to be considered while the feature’s significance is reviewed.

In this section, a novel pattern deploying method with utilisation of more

properties in a pattern is proposed. Different from the PDM discussed in

Section 4.1.1, the pattern’s support obtained in the phase of patterns discovery

is taken into account when we deploy patterns into a common hypothesis space.

A probability function is also introduced to estimate the feature’s significance.


By using SPMining (Algorithm 3.1), we can acquire a set of frequent

sequential patterns SP for all documents d ∈ D+, such that SP =

p1, p2, . . . , pn. The absolute support suppa(pi) for all pi ∈ SP is obtained as

well. We firstly normalise the absolute support of each discovered pattern based

on the following equation:

support :: SP → [0, 1]

such that

support(pi) =suppa(pi)∑

pj∈SP suppa(pj)(4.5)

For example, after we apply the above function to the sample database in

Table 4.1, the new support of each pattern can be calculated and the result is

listed in Table 4.2.

Doc. Sequential patterns Support

d1 〈carbon〉 4/9〈carbon, emiss〉 1/3〈air, pollut〉 2/9

d2 〈greenhous, global〉 3/5〈emiss, global〉 2/5

d3 〈greenhous〉, 〈global, emiss〉 1/2

d4 〈carbon〉, 〈air〉 3/8〈air, antarct〉 1/4

d5 〈emiss, global, pollut〉 1

Table 4.2: Patterns with their support from the sample database.

Based on the above normalisation, the expanded form of pattern pi can be

represented using the following format:


pi = 〈(ti,1, fi,1), (ti,2, fi,2), . . . , (ti,m, fi,m)〉

where

fi,j =support(pi)

m

It is obvious that the composition operation stated in Section 4.1.1 is still

available for the expanded forms of patterns in such a kind of format. Details

of the deploying process are presented in Algorithm 4.2, and its result is a vector

~d, which consists of term weight pairs. Note that the input is a set of discovered

sequential patterns SP , not a set of documents required in PDM.

Algorithm 4.2. PDS(SP)

Input: a set of frequent sequential patterns, SP.

Output: a vector of features in expanded form, ~d.

Method:

1: sum supp = 0, ~d← ∅


3: sum supp += suppa(p)

4: end for


6: f = suppa(p)/(sum supp× len(p))

7: p′ ← ∅

8: foreach term t in p do begin

9: p′ ← p′ ∪ (t, f)

10: end for


11: ~d← ~d⊕ p′

12: end for

The first step in the algorithm PDS is to initialize parameters in line 1. Then

in line 2 to 4 the absolute support of each pattern in SP is summed up and stored

for further reference. The value of f in the expanded form of each pattern p is

estimated and assigned to all terms in p. The above operation is completed in line

6 and then, in line 8 to 10, each term weight pair in pattern p is transferred into a

temporary space p′. Lastly, the final step of the algorithm PDS is to merge p′ into

vector ~d, which will be returned as an output if all patterns in SP are processed.

Using the data in Table 4.2 as an example, after processing all of the documents,

the result of the algorithm PDS for each of them will be

~d1 = (carbon, 4/9+1/6), (emiss, 1/6), (air, 1/9), (pollut, 1/9)

~d2 = (greenhous, 3/10), (global, 3/10+1/5), (emiss, 1/5)

~d3 = (greenhous, 1/2), (global, 1/4), (emiss, 1/4)

~d4 = (carbon, 3/8), (air, 3/8+1/8), (antarct, 1/8)

~d5 = (emiss, 1/3), (global, 1/3), (pollut, 1/3).

The value of f in the expanded form (t, f) implies the relative degree of

significance the term t is. As mentioned before, the value of f for term “carbon”

is not likely to be appropriately evaluated in PDM since it is given a much lower

value than the one for term “emiss” which approximately doubles the score of

f . However, in PDS, the values of f for both of them are estimated to be nearly

the same, comparing 71/72 for term “carbon” with 19/20 for term “emiss”. The


difference between these two terms’ estimated values of significance is reduced

in PDS since the support of patterns is considered and used to re-evaluate the

patterns in PDS . Moreover, all documents processed in PDS are treated equally

in importance, meaning that the sum of term values in the expanded form for each

document is assumed to be a constant.

Theorem 4.2. Let ~d be the vector returned by the algorithm PDS. We have

∑(fst,snd)∈~d

snd = 1.

Proof. According to line 5 to 10 in the algorithm PDS, we have

∑(fst,snd)∈~d

snd =∑p∈SP

∑(t,f)∈p

suppa(p)

sum supp× len(p)

=1

sum supp

∑p∈SP

∑(t,f)∈p

suppa(p)

len(p)

=1

sum supp

∑p∈SP

suppa(p)

= 1.

Despite the algorithm PDS allowing one document to be processed at a time,

a set of vectors ∆ can be obtained by calling the algorithm PDS one by one as

all specified documents are processed. Formally, the relation between the vectors

and the common hypothesis space can be explicitly described as follows:

β :: ∆→ 2T×[0,1] − ∅

such that

β(d) = (t1, f1), (t2, f2), . . . , (tn, fn) ⊆ T × [0, 1] (4.6)


Generally speaking, the concept of relevance is subjective. A common

description can be used in several scales for representation issues. For example,

we may use a scale in “0” to denote non-relevant, and “1” for marginal relevance,

“2” for fair relevance and “3” for high relevance. The simplest case is using“0” to

represent non-relevant and “1” for relevant. Therefore, a relevance function can

be used to describe the extent of relevance for all positive documents. We also can

normalise the relevance function which satisfies:

∑d∈D+

relevance(d) = 1.

Based on the above assumptions, a probability function can be derived to

substitute the weighting scheme (Equation 4.4) for all term ti ∈ T , which satisfies:

prβ(ti) =∑

~d∈∆,(ti,f)∈~d

relevance(d)× f (4.7)

Theorem 4.3. Let prβ(ti) =∑

~d∈∆,(ti,f)∈~d relevance(d) × f , then prβ is a

probability function on T .

Proof. From the above definitions and Theorem 4.2, we have:

∑ti∈T

prβ(ti) =∑ti∈T

∑~d∈∆,(ti,f)∈~d

relevance(d)× f

=∑~d∈∆

∑(fst,snd)∈~d

relevance(d)× snd

=∑~d∈∆

relevance(d)∑

(fst,snd)∈~d

snd

=∑~d∈∆

relevance(d)

= 1.

Then, the specificity for all patterns p can be defined as follows.

4.2 Related Work 85

specificity(p) =∑t∈T

prβ(t)τ(t, p)

where

τ(t, p) =

1 if t ∈ p0 otherwise. (4.8)

It is obvious that the specificity function defined in this sub-section also

satisfies Theorem 4.1. As a result, after all documents in Table 4.2 are processed

by PDS the feature weight pairs in the hypothesis space can be presented as

(carbon, 71/72), (emiss, 19/20), (air, 11/18), (pollut, 4/9),

(greenhous, 4/5), (global, 13/12), (antarct, 1/8)

4.2 Related Work

This chapter presents a novel concept for effectively dealing with the discovered

patterns. Two approaches are introduced to implement the proposed methodology

by deploying discovered patterns into a specified hypothesis space with an

attempt to overcome the underlying problem within the data mining-based

methods. In [159], with regard to the pattern property, the pattern’s confidence

was estimated and exploited in the phase of using discovered patterns. The

result indicated that such an application of a pattern’s confidence is feasible

and outperforms TFIDF and other traditional probabilistic methods. However,

some problems with the use of confidence for document evaluation still remained

unsolved, such as the overlap among discovered patterns and low-frequency

problems in specific patterns [157]. In terms of interpretation of patterns, Li [81]


introduced a novel approach for interpreting discovered patterns by using the

random set concept. Li and Zhong [84] presented an in-depth discussion on the

interpretation of association rules. Furthermore, an extended random set-based

method was proposed by Li et al. [83] for deploying mined association rules into

a hypothesis space. In our approach, we deploy features which are on the pattern

level rather than the terms on the document level used by the other approaches.

Moreover, the interesting rule used in our method for pattern discovering differs

from the others.

4.3 Chapter Summary

In this chapter, we propose two novel approaches for deploying discovered

patterns in order to address the fundamental problem caused by the inadequate

use of these patterns. In the phase of using discovered patterns, patterns can be

treated as components in the feature space and evaluated using the same way

as a keyword-based method does. Nevertheless, such an approach leads to the

insufficient capability of reasoning patterns due to the usage of the weak property

of pattern. The confidence of pattern adopted in the data mining-based method

is a weak property since it induces the low-frequency problem resulting in the

ineffective performance for a knowledge-based system. The concept of deploying

patterns proposed in this thesis is a novel solution for such a problem.

Chapter 5

Evolution of Discovered Patterns

In Chapter 4, pattern deploying methods are proposed for the use of discovered

knowledge. However, not all discovered patterns are suitable for describing

interesting topics since some noise patterns are extracted from the training

dataset [85]. In this chapter, two methods employing the pattern evolution are

proposed and developed, and they are Deployed Pattern Evolution (DPE) and

Individual Pattern Evolution (IPE). The basic definition and their algorithms are

also presented.

5.1 Deployed Pattern Evolution

In the previous chapter, the PTM model has been significantly improved after the

adoption of pattern deploying method PDS, which uses the strategy of mapping

discovered patterns into a hypothesis space in order to solve the low-frequency

problem pertaining to the specific long patterns. However, information from the

negative examples has not been exploited during the concept learning. There is

no doubt that negative documents contain much useful information to identify

ambiguous patterns in the concept. For example, a pattern may be a good indicator

87

88 Evolution of Discovered Patterns

Document Sequential pattern set

d1 〈carbon〉, 〈carbon, emiss〉, 〈air, pollut〉d2 〈greenhous, global〉, 〈emiss, global〉d3 〈greenhous〉, 〈global, emiss〉d4 〈carbon〉, 〈air〉, 〈air, antarct〉d5 〈emiss, global, pollut〉

Table 5.1: Examples of positive documents which are represented by a set ofsequential patterns mined using PTM.

to identify relevant documents if this particular pattern always appears in the

positive examples. But it would not be if this pattern also appears in the negative

examples at certain times. Therefore, it is necessary for a system to exploit these

ambiguous patterns from the negative examples in order to reduce their influences.

The concept of pattern evolution is introduced by Li and Zhong [86]. We adopt

the concept and propose the DPE approach for a PTM-based system, which deals

with the deployed patterns rather than the terms.

5.1.1 Basic Definition of DPE

Given a set of documents D = d1, d2, . . . , d|D|, where ~dk =<

(tk1 , nk1), (tk2 , nk2), . . . , (tkm , nkm) > with the same definition as in section 4.1.1.

The threshold of these documents can be estimated by using the following

equation:

Threshold(D) = arg min~di∈D

∑(tj ,nk)∈~di

nk (5.1)

where the weight of term t is determined using the PDM term weighting function

as in Equation 4.4.

Table 5.1 presents the examples of positive documents which are represented

5.1 Deployed Pattern Evolution 89

Name Support Deployed pattern (vector)

dp1 1 (carbon,2), (emiss,1), (air,1), (pollut,1)dp2 1 (greenhous,1), (emiss,1), (global,2)dp3 1 (greenhous,1), (emiss,1), (global,1)dp4 1 (carbon,1), (air,2), (antarct,1)dp5 1 (emiss,1), (global,1), (pollut,1)

Table 5.2: Deployed patterns from the document examples.

Name Support Normalised deployed pattern

dp1 1/5 (carbon,2/5), (emiss,1/5), (air,1/5), (pollut,1/5)dp4 1/5 (carbon,1/4), (air,1/2), (antarct,1/4)dp5 1/5 (emiss,1/3), (global,1/3), (pollut,1/3)dp6 2/5 (greenhous,7/12), (emiss,7/12), (global,5/6)

Table 5.3: dp2 and dp3 are replaced by dp6 and deployed patterns are normalised.

by a set of sequential patterns mined using PTM whereas Table 5.2 shows the

deployed patterns from these document examples. For instance, in Table 5.1,

although documents d2 and d3 do not have the same set of sequential patterns, we

still can find that they share the same termset since termset(d2) = termset(d3) =

greenhous, emiss, global as shown in Table 5.2. Therefore, we compose vectors

with the same itemset into one.

dp6 = dp2 ⊕ dp3

= (greenhous, 1/4+1/3), (emiss, 1/4+1/3), (global, 1/2+1/3)

Let Ω be a set of deployed patterns. For each document in Table 5.1, the

representation of one particular document can be transformed from a set of

discovered sequential patterns to a set of terms using the pattern deployed method

PDM. Therefore the set of terms is denoted as “deployed pattern” in this approach.


Figure 5.1: A negative document nd and its offending deployed patterns.

A negative document nd is a document that the system falsely identified as

a positive. The offender of nd is a deployed pattern which obtains at least one

component that appears in nd. The set of offenders of nd is defined by:

∆p = dp ∈ Ω|termset(dp) ∩ nd 6= ∅ (5.2)

Figure 5.1 illustrates the relationship between a negative document nd and its

offenders. Given a set of terms T , For each term t ∈ T , t can be classified into

four categories:

• “X” type: t ∈ T |t ∈ termset(dpk), termset(dpk) ⊆ nd.

• “Y” type: t ∈ T |t ∈ termset(dpk) ∩ nd, termset(dpk) * nd.

• “Z” type: t ∈ T |t ∈ termset(dpk)− nd.

• “∗” type: others.

where k = i or j.

There are two types of offenders: (1) a complete conflict offender which

contains “X” type terms only. (2) a partial conflict offender which contains both

“Y” type and “Z” type terms. For instance, the deployed pattern dpi in Figure 5.1


is a complete conflict offender of negative document nd and deployed pattern

dpj is a partial conflict offender of negative document nd. Another example in a

given negative document nd = 〈emiss〉, 〈global〉, 〈pollut〉, 〈car〉, the deployed

patterns dp1, dp2 and dp3 in Table 5.2 are all partial conflict offenders of nd since

termset(dp1) ∩ nd 6= ∅, termset(dp2) ∩ nd 6= ∅, and termset(dp3) ∩ nd 6= ∅,

but they are not subsets of nd, whereas dp5 in the same table is a complete conflict

offender of nd because of termset(dp5) ⊆ nd.

5.1.2 The Algorithm of DPE

Algorithm 5.1. DPEvolving(Ω, D+, D−)

Input: a list of deployed patterns Ω; a list of positive and negative documents,

D+ and D−.

Output: a set of term weight pairs ~d.

Method:

1: ~d← ∅

// estimate minimum threshold

2: τ = Threshold(D+) // Equation 5.1

3: foreach negative document nd in D− do begin

4: if Threshold(nd) > τ then

5: ∆p = dp ∈ Ω|termset(dp) ∩ nd 6= ∅

6: Shuffling(nd, ∆p) //Algorithm 5.2

7 : end if

8 : foreach deployed pattern dp in Ω do begin

9 : ~d← ~d⊕ dp


10: end for

11: end for

The evolution of deployed patterns is implemented by the algorithm

DPEvolving (see Algorithm 5.1). The inputs of this algorithm are a list of

deployed patterns Ω, a list of positive and negative documents, D+ and D−. The

output is a set of term weight pairs which can be used directly in the testing phase.

Line 2 in DPEvolving is used to estimate the threshold for finding the interesting

negative documents. Line 3 to 5 is the process of discovering the offenders of

negative documents. Therefore, a set of deployed patterns that share the same

patterns with negative documents is collected for further processing. Once all the

offenders are found, another algorithm Shuffling 5.2 is then called to perform the

main task for this algorithm.

Algorithm 5.2. Shuffling(nd, ∆p)

Input: a negative document nd and a list of deployed patterns ∆p.

Output: updated deployed patterns.

Method:

1: foreach deployed pattern dp in ∆p do begin

2: if termset(dp) ⊆ nd then // complete conflict offender

3: Ω = Ω− dp

4: else // partial conflict offender

5: offering’ = (1− 1µ)×

∑t∈termset(dp)

t.weight|t ∈ nd

6: base =∑

t∈termset(dp)

t.weight|t /∈ nd


7: foreach term t in termset(dp) do begin

8: if t ∈ nd then // shrink offender weight

9: t.weight = ( 1µ)× t.weight

10: else // shuffle weights

11: t.weight = t.weight× (1 + offering’÷ base)

12: end if

13: end for

14: end if

15: end for

The task of algorithm Shuffling is to tune the weight distribution of terms

within a deployed pattern. A different strategy is dedicated in this algorithm for

each type of offender. As stated in line 3 in the algorithm Shuffling, for the type of

complete conflict offenders, the deployed patterns are removed from the deployed

pattern set Ω since all elements within the deployed patterns are held by the

negative documents indicating they can be discarded for preventing interference

from these possible “noises”.

The parameter offering’ is used in line 5 for the purpose of temporarily storing

the reduced weight of “Y” type terms in a partial conflict offender. The offering’

is part of offering. The offering is the sum of weight of terms in a deployed pattern

where these terms also appear in a negative document. Given a deployed pattern

dp and a negative document nd, the value of offering can be estimated by the

following equation:

offering(dp) =∑

(t,t.weight)∈β(dp),t∈nd

t.weight (5.3)


“Y” type term of dp1 “Z” type term of dp1

original (air, 1/5), (pollut, 1/5) (carbon, 2/5), (emiss, 1/5)shuffled (air, 1/10), (pollut, 1/10) (carbon, 8/15), (emiss, 4/15)

Table 5.4: The change of term weights in offender dp1 before and after shufflingwhen µ = 1/2.

where the β is a mapping function which describes the relationship between

deployed patterns and the hypothesis space:

β : Ω→ 2T×[0,1] − ∅

β(dp) = (t1, w1), (t2, w2), . . . , (tn, wn) ⊆ T × [0, 1] (5.4)

For the partial conflict offender of negative documents, since it contains two

types of terms in it, different processes are used as stated in lines 8 to 12 in the

algorithm Shuffling. For “Y” type terms, the weights are shrunk by being divided

an experimental coefficient µ (µ > 1). An example is given in Table 5.4 showing

that the weights of terms “air” and “pollut” are reduced when dp1 is a partial

conflict type of nd, where nd = 〈air〉, 〈pollut〉, 〈health〉. On the other hand,

the “Z” type terms are given the reduced weights from the “Y” type terms based

on their weight distribution. As can be seen in Table 5.4, the weights of terms

“carbon” and “emiss” are increased as dp1 is conflicting with nd.

When all of the deployed patterns in ∆p have been visited and processed, the

algorithm iterates the next document in D− until all of the negative documents

are visited. At the end of algorithm DPEvolving, the last operation is to join all

the deployed patterns in Ω using pattern deposition. As a result, the output of

algorithm DPEvolving is a set of term weight pairs which can be used for system

evaluation which will be presented in Chapter 6.

5.2 Individual Pattern Evolution 95

Figure 5.2: Different levels involved by DPE and IPE in pattern evolution.

5.2 Individual Pattern Evolution

In Section 5.1, a pattern refinement strategy is proposed using the pattern evolving

approach DPE to reshuffle the weight distribution within offenders. For such a

type of approach, features which reside in the intersection of negative documents

and partial conflict offenders have to be reviewed and adjusted by shifting their

weight contribution away in order to weaken the effect on the concept. In

contrast, the rest of the features in the same deployed pattern thus receive the

reduced offering. However, it should be noted that a deployed pattern in DPE

is constructed by compounding discovered patterns from PTM into a hypothesis

space, which means this action involves all the features including some that may

come from the other patterns at the “P Level” in Figure 5.2.

Figure 5.2 demonstrates the three levels in a feature hierarchy based on the

physical structure of features. In other words, features from the lower level

(e.g., “T Level”) are encapsulated into the features in the higher level (e.g., “P

Level”). If a document contains two or more patterns, it indicates the concept


of this document is represented by more than one subtopics. For instance, two

patterns p1 = 〈air, pollut〉 and p2 = 〈antarct〉 are discovered from a document d

which describes a topic about “global warming”. Hence d = p1, p2 implies that

the joint of two subtopics “air pollution” and “Antarctic” describes the concept of

“global warming” in d. If there exists a negative document nd = antarct, explor

with the use of DPE approach for pattern evolution, the weight contribution of

p2 in d should be shifted to p1 according to the algorithm Shuffling in DPE.

Essentially, it is reasonable that the pattern evolution is applied to a pattern

which appears both in the offender and the negative document for the purpose

of removing the suspicious source of “noise”. However, the adjustment of the

other patterns in the offender (such as p1 in d) is still arguable. For the above

example, the significance of the pattern 〈antarct〉 in document d needs to be

reduced since its occurrence in the negative document leads to the ambiguity

problem as mentioned before. Nevertheless, this does not mean the significance of

pattern 〈air, pollut〉 has to be increased. Since the deployed pattern is the lower-

level pattern which has been mixed up with multiple subtopics (a pattern in “P

Level” represents a subtopic), we have to process each subtopic individually.

Accordingly, an alternative way to conduct the evolution of patterns is to alter

these patterns at the upper level (“P Level” in Figure 5.2) before they are deployed

as the lower-level features. Therefore, an evolving approach called Individual

Pattern Evolution (IPE) is proposed in this section. IPE deals with patterns at the

early state of individual form, instead of manipulating patterns in deployed form

at the late state.

Figure 5.3 illustrates the different states in which the evolution of patterns

takes place using DPE and IPE. When a negative document is detected, DPE


Figure 5.3: The flowchart of two pattern evolving approaches.

starts to find offenders and implements pattern evolving at “Hypothesis Space”

state. In contrast, IPE executes the same action at “Pattern” state. In addition,

the structures of “Hypothesis Space” and “Pattern” are different, and thus an

alternative definition and algorithm for IPE are needed. Note that the physical

structure of components in a hypothesis space is a set of term weight pairs derived

by deploying all the discovered patterns in the previous stage and the basic

component in the “Pattern” is a set of sequential pattern weight pairs obtained

from the output of PTM.

5.2.1 Basic Definition of IPE

Let T = t1, t2, t3, ..., tn be a set of terms, which can be viewed as words or

keywords in text documents. D is a set of documents consisting of a set of positive


documents D+ and a set of negative documents D−. As mentioned earlier, a set

of terms is denoted as a termset. A set of pattern weight pairs can be named

patternset, which is defined as:

Pseti = (pi,1, wi,1), (pi,2, wi,2), . . . , (pi,n, wi,n) (5.5)

where pi,n is a sequential pattern with its corresponding weight wi,n. A patternset

can be used to represent a set of discovered patterns from a document d using

PTM. In this section, the result of PTM mining from D is therefore represented

by a set of patternsets:

SD = Pset1, Pset2, . . . , Psetk (5.6)

where Psetk denotes a discovered pattern set for a document dk and dk ∈ D.

Let Φ = t1, t2, t3, . . . , tm be a set of terms and Φ ⊆ T indicating a

hypothesis space of D. For the document examples listed in Table 5.5, Φ can

be derived as:

Φ = carbon, emiss, air, pollut, greenhous, global, air, antarct

The relations between termset Φ and the patternset Pseti for the topic “Effects of

global warming” are demonstrated in Figure 5.4. As it can be seen, each pattern

pi in Pseti consists of a set of terms in Φ.

A set of terms in a pattern p can be easily derived from termset(p) =

t|(t, f) ∈ p, which has been discussed in section 4.1.1. However, the term list

in this set is not in order. In IPE, the order of terms within a pattern is considered.

Two sequential patterns are equal if and only if they contain the same terms in the

same order. For example, given two patterns p1 = 〈t1, t2, t3〉 and p2 = 〈t1, t3, t2〉,


Figure 5.4: Relations between patternset and termset under the topic “Effects ofglobal warming”.

Document Patterns

d1 〈carbon〉4 , 〈carbon, emiss〉3 , 〈air, pollut〉2d2 〈greenhous, global〉3, 〈emiss, global〉2d3 〈greenhous〉2, 〈global, emiss〉2d4 〈carbon〉3, 〈air〉3, 〈air, antarct〉2d5 〈emiss, global, pollut〉2

Table 5.5: Examples of positive documents represented by a set of sequentialpatterns with frequency.


Name Patternset

Pset1 (〈carbon〉, 4/9), (〈carbon, emiss〉,1/3), (〈air, pollut〉, 2/9)Pset2 (〈greenhous, global〉, 3/5), (〈emiss, global〉, 2/5)Pset3 (〈greenhous〉, 1/2), (〈global, emiss〉, 1/2)Pset4 (〈carbon〉, 3/8), (〈air〉, 3/8), (〈air, antarct〉, 1/4)Pset5 (〈emiss, global, pollut〉, 1/1)

Table 5.6: Normalised patternsets which contain sequential patterns withcorresponding weights.

Patternset

Pset1 (〈carbon〉, 4/9), (〈carbon, emiss〉,1/3), (〈air, pollut〉, 2/9)Pset4 (〈carbon〉, 3/8), (〈air〉, 3/8), (〈air, antarct〉, 1/4)

Pset1 t Pset4(〈carbon〉, 4/9+3/8), (〈carbon, emiss〉,1/3), (〈air, pollut〉, 2/9),(〈air〉, 3/8), (〈air, antarct〉, 1/4)

Table 5.7: An example of patternset composition.

although termset(p1) = termset(p2), these two patterns are not equal since the

terms are in different order.

Given two patternsets Pseti Psetj , the join of these two patternsets can be

operated by the following patternset composition:

Pseti t Psetj = (pi,m, wi,m + wj,n)|pi,m = pj,n,

(pi,m, wi,m) ∈ Pseti, (pj,n, wj,n) ∈ Psetj ∪

(p, w)|p ∈ (Pseti ∪ Psetj)−

(Pseti ∩ Psetj), (p, w) ∈ Pseti ∪ Psetj (5.7)

An example of patternset composition is shown in Table 5.7. The weight of

pattern “carbon” is updated during the patternset composition since it appears in

both patternsets with the same term sequence. However, pattern “〈emiss, global〉”


in Pset2 and “〈global, emiss〉” in Pset3 cannot be joined when we combine Pset2

and Pset3 using patternset composition, even though the termsets of these two

patterns are the same, termset(〈emiss, global〉) = termset(〈 global, emiss〉).

Therefore, given two patterns p1 ∈ Pseti and p2 ∈ Psetj , they can be joined

during the operation of patternset composition (Pseti t Psetj) if and only if

p1 = p2. For instance, in Table 5.7 pattern (〈carbon〉, 4/9) in Pset1 is able

to be joined with the pattern (〈carbon〉, 3/8) in Pset4 during the operation of

Pset1 t Pset4. After the composition, this pattern is updated as (〈carbon〉,

4/9+3/8) = (〈carbon〉, 59/72).

5.2.2 The Algorithm of IPE

Algorithm 5.3. IPEvolving(D+, D−)

Input: a list of positive and negative documents, D+ and D−.

Output: a set of term weight pairs ∆.

Method:

1: ∆← ∅; ∆ps ← ∅ // ∆ps: patternset

// find a set of patternsets SD from D+ using SPMining (Algorithm 3.1)

2: SD = Psetd1 , Psetd2 , . . . , Psetdm // where m = |D+|

3: foreach Psetdiin SD do begin

// normalise each pattern in Psetdi

4: foreach pattern (pi,k, wi,k) ∈ Psetdido begin wi,k = wi,k ÷

|Psetdi|∑

j=1

wi,j

5: ∆ps = ∆ps t Psetdi// patternset composition

6: end for

// find a set of patternsets SD− from D−


7: SD− = Psetd1, Psetd2

, . . . , Psetd|D−|

8: foreach (p, w) ∈ ∆ps do begin

// accumulate the support of offending patterns

9: sum sup =|SD− |∑i=1

∑p=p−;(p−,w−)∈Psetdi

suppa(p−)

10: w = w × (suppa(p)− sum sup)/suppa(p)

11: foreach term (t, f) in p do begin f = w/len(p)

12: ∆← ∆⊕ p // pattern deploying

13: end for

The input of algorithm IPEvolving is a set of positive documents D+ and a set

of negative documents D−. The output of this algorithm is a set of term weight

pairs ∆ which represents the concept of the topic with respect to D+ and D−.

Three main phases contained in Algorithm 5.3 are briefly described as follows:

Pattern Generation: a set of sequential patterns for each document is generated

in this phase using PTM. Note that only positive documents are processed

here. At the end of this stage, a set of patternsets is discovered and prepared

for the next phase. This process is implemented in line 1 and line 2 as listed

in the algorithm.

Patternset Composition: in this phase, the discovered patterns from the previous

phase are transformed into a form of pattern weight pairs using patternset

composition. The structure of each pattern is preserved and all essential

information such as statistical data is temporarily stored as well. This

operation can be found between line 3 and line 6 in the algorithm.

5.3 Related Work 103

Individual Pattern Involving: the major task of this algorithm is performed and

completed in this phase. The involved patterns are evaluated before being

deployed into a hypothesis space. The procedure is stated from line 7 to line

13 in the algorithm.

Given document examples as shown in Table 5.5, we assume all documents

are positive and belong to D+ as D+ = d1, d2, d3, d4, d5 and each document

obtains a set of patterns listed in the same table discovered in the phase

of pattern generation. For instance, document d1 has a set of sequential

patterns 〈carbon〉4, 〈carbon, emiss〉3, 〈air, pollut〉2 where the number beside

each pattern indicates the absolute support of the pattern. The details of how

to find sequential patterns from a set of documents have been discussed in

Chapter 3 and the corresponding algorithm can be referred to in Algorithm 3.1.

At the next step, each document is represented by a patternset defined in

Equation 5.5. Therefore, document d1 can be replaced by patternset Psetd1 =

(〈carbon〉, 4), (〈carbon, emiss〉, 3), (〈air, pollut〉, 2). Note that each pattern’s

weight is still the absolute support at this stage. At the end of this phase, all

documents are then grouped into a set of patternsets and can be denoted as

SD = Psetd1 , Psetd2 , Psetd3 , Psetd4 , Psetd5.

5.3 Related Work

The pattern evolution is used for concept refinement for user profile mining.

Li and Zhong [86] proposed a novel approach for mining ontology in order to

automatically acquire user information needs. For ontology constructing in this

work, hierarchical clustering [94, 98] is adopted to determine synonymy and


hyponymy relations between keywords. A set of interesting negative documents,

labeled as relevant by the system, is then detected and exploited for pattern

evolving. Two kinds of offenders can be discovered from these interesting

negative documents: total conflict and partial conflict. By reshuffling their weight

distributions, the uncertainties contained in these offenders can be evaporated.

We adopt such a concept and apply it to our pattern-based information filtering

method DPE. Instead of using document-wise patterns for concept evolution,

DPE conducts evolution on deployed patterns which are discovered by using data

mining techniques and deployed by using our proposed PDS method. In other

words, different pattern discovery methods are used for generating representatives

in these two works.

5.4 Chapter Summary

The objective of pattern evolution is to provide an effective mechanism for

allowing the contextual concept in the knowledge base to be updated during the

learning phase for a pattern-based knowledge discovery system. The capability

of adaptiveness for a knowledge-based system is enabled by revising features in a

particular state and rebuilding the context representation as negative patterns are

detected.

There are two evolving approaches proposed in this chapter. In DPE, the

first approach, it detects the offenders from negative documents at first and then

applies a revision scheme to those features resided in the offenders. By shuffling

the weight distribution of these features in the hypothesis space, the patterns

can be properly adjusted and the goal of refinement for contextual concept in


the knowledge base can be achieved as well. Similarly, in the second approach,

IPE also tunes patterns with an attempt to reach the same goal but at a different

level. IPE adjusts patterns in an upper level where these patterns are still in

the form of a sequential one, other than a space where patterns are deployed in

DPE. The advantage of IPE is that not all sequential patterns are necessary to

be involved during the evolving process. Only those that are also found in the

negative documents need to be re-evaluated. As a result, the efficiency of the

system can be improved. Moreover, by modifying the involved patterns only, we

can narrow the scope of target components and concentrate on those which really

need to be altered in the whole feature space.

Chapter 6

Experiments and Results

This chapter describes the experimental evaluation of our proposed approaches

featured in the pattern taxonomy model PTM. To fulfil this chapter, three

aspects are discussed including experimental datasets, performance measures,

and evaluation procedures. The latest version of Reuters document collection is

chosen among several versions as our benchmark dataset. Most of the standard

performance measures (i.e. precision, recall, breakeven point, Fβ-measure and

11 standard points) are used for evaluating the experimental performance. The

discussion and analysis of experiments are split into three categories based on

the methods or strategies proposed in the previous chapters. The PTM model

comprises the methods including pattern discovery approaches (i.e., SPM and

SCPM), pattern deploying methods (i.e., PDM and PDS), and pattern evolution

strategies (i.e., DPE and IPE).

The process of executing PTM consists of two major phases, concept learning

and document evaluation. In the former phase, one of the proposed pattern

discovery approaches is adopted to learn the concept (i.e., user profile) of

documents in the training set, then the various combinations of pattern deploying

and evolving methods are taken in the latter phase to evaluate documents in the test

107

108 Experiments and Results

set. Text preprocessing for each document is applied before both of the learning

and evaluating phases. Term stemming and stopword removal techniques are also

used in this stage for document indexing.

To evaluate the performance of PTM, we implement PTM for the task of

information filtering (IF) in our experiments. By conducting IF tasks, we can

examine the ability of the proposed pattern discovery approaches and test the

effectiveness of refinement methods for discovered patterns. The experimental

results are compared with other well-known IF-related methods including Term

Frequency Inverse Document Frequency (TFIDF) method [129], Probabilistic

method (Prob) [50, 139] and Rocchio method [122, 124]. We also compare the

results from PTM to those from data mining-based methods, such as frequent

itemset mining, sequential pattern mining and closed pattern mining methods.

6.1 Experimental Dataset

Several standard benchmark datasets are available for experimental purposes.

They are Reuters corpora, OHSUMED [58], and 20 Newsgroups collection [72].

The most frequently used one is the Reuters dataset. During the last decade,

several versions of Reuters corpora have been released. The particular version

that we chose for our experiment is Reuters Corpus Volume 1, also known as

RCV1. The reason is that RCV1 is the latest one among those common data

collections and it also contains a reasonable number of documents with relevance

judgment both in the training and test examples. Although another version,

Reuters-21578, is currently the most widely used dataset for text categorisation

tasks, it is predicted to be superseded by RCV1 in the upcoming years [123]. The

6.1 Experimental Dataset 109

Version #docs #trainings #tests #topics Release year

Reuters-22173 22,173 14,704 6,746 135 1993Retuers-21578 21,578 9,603 3,299 90 1996RCV1 806,791 5,127 37,556 100 2000

Table 6.1: Current Reuters data collections.

summary of current Reuters data collections is stated in Table 6.1.

RCV1 includes 806,791 English language news stories which were produced

by Reuters journalists for the period between 20 August 1996 and 19 August 1997.

These documents were formatted using a structured XML scheme. TREC (Text

REtrieval Conference)1 has developed and provided 100 topics for the filtering

track aiming at building a robust filtering system [123]. The first 50 topics were

composed by human researchers and the rest were formed by intersecting two

Reuters topic categories. These topic codes are listed in Appendix B.

Each RCV1 topic was divided into two sets: training and test, and the

relevance judgments have also been given for each topic. The training set has

a total amount of 5,127 news stories with dates up to and including 30 September

1996 and the test set contains 37,556 news stories from the rest of the collection.

Stories in both sets are assigned to be either positive or negative. “Positive” means

the story is relevant to the assigned topic; otherwise “Negative” will be shown. In

our experiments we chose all 100 TREC topics (from topic 101 to topic 200).

Further details regarding the RCV1 can be found in [123].

RCV1 is distributed on two CDs and contains about 810,000 English language

stories. It requires about 3.7 GB for storage if all files are uncompressed. This

1http://trec.nist.gov/

corpus can also be obtained from the following Web sites: http://about.reuters.com/researchandstandards/corpus/

http://trec.nist.gov/data/reuters/reuters.html

The former Web site is owned by Reuters Ltd and the latter one is maintained

by NIST, the National Institute of Science and Technology. Another Reuters

Corpus, the Volume 2 (RCV2) is also available on requesting. This multilingual

corpus is distributed on one CD and contains over 487,000 Reuters news stories

in 13 languages including Dutch, French, German, Chinese, Japanese, Russian,

Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and

Swedish.

The documents in RCV1 are tagged using XML format for easy access and

parsing. An example of an RCV1 document is illustrated in Figure 6.1. Each

document is identified by a unique item ID and corresponded with a title in the

field marked by the tag <title>. The main content of the story is in a distinct

<text> field consisting of one or several paragraphs. Each paragraph is enclosed

by the XML tag . In our experiment, both the “title” and “text” fields are

used and each paragraph (i.e., content in ) in the “text” field is viewed as

a transaction in a document. Moreover, we treat the content in the “title” field

in the document as an additional paragraph (i.e., transaction). The information

contained in the rest of the tags, such as <headline> and <metadata>, is ignored

and discarded. Nevertheless the “headline” and “metadata” fields may contain rich

information. The reason for ignoring them is that in an RCV1 document “title”

and “headline” are duplicated fields and the “metadata” field contains information

such as region and classification codes which are out of our research scope in this

work. In this thesis we focus on the major part of the document and pay more

6.1 Experimental Dataset 111

Figure 6.1: An XML document in RCV1 dataset.

attention to the issue of how to use these meaningful patterns discovered from it.

As mentioned above, each RCV1 document contains at least one paragraph.

Each paragraph contains at least one sentence. This makes RCV1 different to

the previous versions of Reuters datasets, which usually have only one paragraph

per document. The characteristic of multiple paragraphs in the RCV1 documents

allows the data mining algorithms to be applied for pattern discovery with ease.

The distributions of words and paragraphs in the RCV1 dataset are shown in

Figure 6.2 and Figure 6.3 respectively. The number of stories that have a particular

word or paragraph count is demonstrated on these charts. We also can find that

most stories are short, with around 6 or 7 paragraphs and 1,000 words [118].

The TREC conference is held annually and co-sponsored by the National

Institute of Standards and Technology (NIST) and the U.S. Department of

Defense. For each TREC, NIST provides a test set of documents and questions.

Participants run their own retrieval systems and return to NIST a list of the

retrieved top-ranked documents. NIST judges the retrieved documents for


Figure 6.2: Distribution of words in an RCV1 collection [118].

Figure 6.3: Number of paragraphs per document in an RCV1 collection [118].

halla


halla


6.2 Performance Measures 113

<top>

<num> Number: R101 <title> Economic espionage

<desc> Description: What is being done to counter economicespionage internationally?

<narr> Narrative: Documents which identify economicespionage cases and provide action(s) taken toreprimand offenders or terminate their behavior arerelevant. Economic espionage would encompasscommercial, technical, industrial or corporatetypes of espionage. Documents about military orpolitical espionage would be irrelevant.

</top>

Figure 6.4: An example of topic description.

correctness, and evaluates the results. Each TREC conference consists of a set

tracks, such as “Blog Track”, “Cross-Language Track”, and “Filtering Track”.

The experiments implemented in this thesis choose the same data collections

revealed in TREC Filtering Track 2002. In this track, a filtering system has

to make a binary decision as to whether a new document should be retrieved

according to a user’s information needs. Therefore, a topic in RCV1 can be viewed

as the representation of the user’s information needs. An example of a topic can

be seen in Figure 6.4. Details of building a test collection for TREC 2002 can be

found in [137].

6.2 Performance Measures

It is important that how to measure the performance for a information system. In

this section, some of the common measures that have been used in the literature

are described. To evaluate experimental results, several standard measures such as


human judgement

yes no

system judgementyes TP FPno FN TN

Table 6.2: Contingency table.

precision and recall are used. The precision is the fraction of retrieved documents

that are relevant to the topic, and the recall is the fraction of relevant documents

that have been retrieved. For a binary classification problem the judgement can

be defined within a contingency table as depicted in Table 6.2. According to

the definition in this table, the precision and recall are denoted by the following

formulas: precision = TPTP+FP

recall = TPTP+FN

(6.1)

where TP (True Positives) is the number of documents the system correctly

identifies as positives; FP (False Positives) is the number of documents the

system falsely identifies as positives; FN (False Negatives) is the number of

relevant documents the system fails to identify.

The precision of first K returned documents top-K is also adopted in this paper

due to the fact that most users would focus more on the first few dozen returned

documents. The precision of top-K returned documents refers to the relative value

of relevant documents in the first K returned documents. The value of K we use

in the experiments is 20.

In addition, breakeven point (b/p) is used to provide another measurement for

performance evaluation. It indicates the point where the value of precision equals

to the value of recall for a topic. The higher the figure of b/p, the more effective

6.2 Performance Measures 115

the system is. The b/p measure has been frequently used in common information

retrieval evaluations.

In order to assess the effect involving both precision and recall, another

criterion which can be used for experimental evaluation is Fβ-measure [79] which

combines precision and recall and can be defined by the following equation:

Fβ−measure =(β2 + 1) ∗ precision ∗ recall

β2 ∗ precision + recall(6.2)

where β is a parameter giving weights of precision and recall and can be viewed as

the relative degree of importance attributed to precision and recall [130]. A value

β = 1 is adopted in our experiments meaning that it attributes equal importance

to precision and recall. When β = 1, the measure is expressed as:

F1 =2 ∗ precision ∗ recall

precision + recall(6.3)

The value of Fβ=1 is equivalent to the b/p when precision equals to recall.

However, the b/p cannot be compared directly to the Fβ=1 value since the latter is

given a higher score than that of the former [162]. It has also been stated in [103]

that the Fβ=1 measure is greater or equal to the value of b/p.

Both the b/p and Fβ-measure are the single-valued measures in that they only

use a figure to reflect the performance over all the documents. However, we

need more figures to evaluate the system as a whole. Therefore, another measure,

Interpolated Average Precision (IAP) is introduced and has been adopted before

in several research works [71, 133, 162]. This measure is used to compare the

performance of different systems by averaging precisions at 11 standard recall

levels (i.e., recall=0.0, 0.1, ..., 1.0). The 11-points measure is used in our

comparison tables indicating the first value of 11 points where recall equals to


zero. Moreover, Mean Average Precision (MAP) is used in our evaluation which is

calculated by measuring precision at each relevance document first, and averaging

precisions over all topics.

Error rate is another performance measure that is commonly used in text

categorisation. The value of error rate ε can be calculated by the equation:

ε =FP + FN

TP + FP + FN + TN(6.4)

In order to obtain a global measurement, there are two ways to evaluate the

average performance. In the case of text categorisation, let C be a set of classes;

precision and recall can be averaged using:

- micro-averaging:

the contingency tables of all categories are merged into a single table and

then the global performance is estimated using the merged table:

precisionmicro =

∑|C|i=1 TPi∑|C|

i=1(TPi + FPi)

recallmicro =

∑|C|i=1 TPi∑|C|

i=1(TPi + FNi)

- macro-averaging:

one contingency table per category is used, measures are calculated locally

and then averaged over categories:

precisionmacro =

∑|C|i=1 precisioni

|C|

recallmacro =

∑|C|i=1 recalli|C|

6.3 Evaluation Procedures 117

Generally speaking, the micro-averaging yields better scores than macro-

averaging in the practical experiments. In particular, micro-averaging gives every

document an equal weight on performance, whereas macro-averaging gives every

class an equal weight on performance. The micro-averaging precision and recall

values are usually used in the text classification domain. Aforementioned, PTM

is evaluated on a system which performs information filtering tasks rather than

the text categorisation. Therefore, the averaged precision and recall values are

computed by summing up the corresponding values of each topic and then divided

by the number of topics.

6.3 Evaluation Procedures

In order to evaluate the proposed PTM model, we apply PTM to the practical

information filtering task. As mentioned in chapter 2, information filtering is

a task that a user with a specific information need is monitoring a stream of

documents and the system selects documents from the stream according to a

profile of the user’s interests. Filtering systems process one document at a time

and show it to the user if this document is relevant. The system then adjusts the

profile or updates the threshold based on the user’s feedback. In the case of batch

filtering, a number of relevant documents are returned, whereas a list of ranked

documents is given by a routing filtering system. In this thesis, routing filtering

is implemented and performance of the model is evaluated based on the ranked

documents. The choice of routing task can avoid the need of threshold tuning,

which is beyond our focus in this research work.

We evaluate PTM using all 100 TREC topics (r101–r200) in the experiments.


No. #r #d No. #r #d No. #r #d No. #r #dr101 7 23 r126 19 29 r151 6 49 r176 5 57r102 135 199 r127 5 32 r152 5 55 r177 25 45r103 14 64 r128 4 51 r153 10 18 r178 3 43r104 120 194 r129 17 72 r154 6 52 r179 5 57r105 16 37 r130 3 24 r155 11 74 r180 5 61r106 4 44 r131 4 31 r156 6 37 r181 4 64r107 3 61 r132 7 103 r157 3 42 r182 19 36r108 3 53 r133 5 47 r158 5 79 r183 25 55r109 20 40 r134 5 31 r159 21 62 r184 9 48r110 5 91 r135 14 29 r160 15 36 r185 26 52r111 3 52 r136 8 46 r161 5 52 r186 20 38r112 6 57 r137 3 50 r162 6 27 r187 7 48r113 12 68 r138 7 98 r163 4 29 r188 3 30r114 5 25 r139 3 21 r164 21 64 r189 12 56r115 3 46 r140 11 59 r165 7 53 r190 13 42r116 16 46 r141 24 56 r166 8 39 r191 5 43r117 3 13 r142 4 28 r167 5 63 r192 3 40r118 3 32 r143 4 52 r168 32 43 r193 5 64r119 4 26 r144 6 50 r169 5 35 r194 31 80r120 9 54 r145 5 95 r170 16 79 r195 8 36r121 14 81 r146 13 32 r171 7 48 r196 5 61r122 15 70 r147 6 62 r172 10 78 r197 22 34r123 3 51 r148 12 33 r173 27 35 r198 3 29r124 6 33 r149 5 26 r174 5 44 r199 21 40r125 12 36 r150 4 51 r175 37 37 r200 7 34

Table 6.3: Number of relevant documents(#r) and total number of documents(#d)by each topic in the RCV1 training dataset.


No. #r #d No. #r #d No. #r #d No. #r #dr101 307 577 r126 172 270 r151 22 437 r176 37 411r102 159 308 r127 42 238 r152 41 402 r177 61 250r103 61 528 r128 33 276 r153 37 118 r178 47 271r104 94 279 r129 57 507 r154 39 469 r179 32 510r105 50 258 r130 16 307 r155 63 489 r180 72 426r106 31 321 r131 74 252 r156 72 354 r181 25 574r107 37 571 r132 22 446 r157 37 300 r182 32 157r108 15 386 r133 28 380 r158 45 542 r183 139 443r109 74 240 r134 67 351 r159 97 368 r184 13 361r110 31 491 r135 337 501 r160 54 199 r185 184 371r111 15 451 r136 67 452 r161 47 463 r186 264 417r112 20 481 r137 9 325 r162 81 319 r187 31 467r113 70 552 r138 44 328 r163 122 343 r188 36 322r114 62 361 r139 17 253 r164 182 432 r189 76 384r115 63 357 r140 67 432 r165 52 499 r190 85 337r116 87 298 r141 82 379 r166 17 219 r191 18 347r117 32 297 r142 24 198 r167 40 486 r192 29 367r118 14 293 r143 23 417 r168 269 342 r193 16 430r119 40 271 r144 55 380 r169 35 348 r194 187 571r120 158 415 r145 27 488 r170 73 507 r195 37 263r121 84 597 r146 111 280 r171 68 394 r196 50 453r122 51 393 r147 34 380 r172 41 441 r197 144 264r123 17 342 r148 228 380 r173 226 314 r198 18 249r124 33 250 r149 57 449 r174 82 364 r199 116 272r125 132 544 r150 54 371 r175 312 312 r200 86 277

Table 6.4: Number of relevant documents(#r) and total number of documents(#d)by each topic in the RCV1 test dataset.


TREC provides two sets of documents for each topic for training and test

purposes. Table 6.3 and Table 6.4 provide the related statistic information for

training and test dataset respectively. All of the documents in these two sets are

processed both in the phase of profile learning and document evaluating. Before

the learning phase, document indexing is applied to preprocess words and remove

stopwords. As each document is transferred into the desired format, one of the

mining methods is selected to find dedicated patterns in the phase of pattern

discovery. These patterns are then passed through the subsequent deploying and

evolving processes to generate the representative concept (e.g. deployed pattern

set), which is used to represent the set of documents. Following is the test phase

where each document in the test set is evaluated to examine the performance of

the PTM-based IF system. In summary, steps required for the whole evaluation

procedure in PTM are briefly listed as follows:

(1) System starts from one of the RCV1 topics and retrieves the related

information with regard to the training set, such as file list and the number

of documents.

(2) Each document is preprocessed with word stemming and stopwords

removal and transformed into a set of transactions based on its nature of

document structure.

(3) System selects one of the pattern discovery algorithms to extract patterns.

(4) Discovered patterns are deployed into a hypothesis space using one of the

proposed deploying methods.

(5) If required, the pattern evolving process is used to refine patterns. A concept

representing the context of the topic is eventually generated.


Figure 6.5: Process of document indexing.

(6) Each document in the test set is assessed by the document evaluation

method and the experimental results are shown as an output.

(7) System ends for this topic and repeats the above steps for the next topic if

required.

In the following subsections, more details about document indexing in our

experiments are presented, followed by the description of three main procedures

in our proposed PTM model. Experimental environment and settings are also

discussed in the end of this section.

6.3.1 Document Indexing

Document indexing is the process that assigns terms to documents for retrieval

purposes [45]. The goal of document indexing is to select informative features

that represent the concept of a set of documents. A typical process of document

indexing is illustrated in Figure 6.5. In this process, a set of documents is read and

a set of features is returned as an output. Document indexing needs two steps to

complete, preprocessing and feature selection.

In preprocessing, redundant terms need to be eliminated before the documents

can be interpreted by the system. Since RCV1 documents are all in XML format,

there are many fields enclosed by tags including < title >, < headline >,

< dateline >, < text >, < copyright > and < metadata > (see the

document example in Appendix A). In our experiments, the fields we chose in

each document are < title > and < text >. The content in the rest of the

fields is discarded. In an RCV1 document, each < text > field contains several

paragraphs enclosed by tag . For implementing PTM, we treat each

paragraph as a transaction, as well as the content in the field < title > which

is viewed as an extra paragraph because of the rich information obtained by it.

The next process is to apply stopword removal and word stemming. In stopword

removal, function words and non-informative terms are removed according to a

given stopword list (Appendix C). For word stemming, the Porter algorithm [117]

is used for suffix stripping.

During the process of feature selection, each term is assigned a value by the

weighting schemes and terms with low scores should be removed for the purpose

of dimensionality reduction. As already mentioned in Chapter 2, feature selection

is a way to make the system efficient. Existing systems usually select a term

weighting scheme to diminish a large amount of non-relevance terms, especially

in the field of Information Retrieval or Text Categorisation. However, in IF,

due to the lack of relevant information for training, the shrink of term base may

affect the system’s effectiveness. Therefore, the strategy we adopt is not to select

terms but patterns. That means the pattern pruning is applied during the process

of pattern discovering in order to achieve the goal of dimensionality reduction

(Section 3.1.2). Hence, in our experiments, almost all the terms are reserved after

<< 63261.xml >> => 1title + 4 paragraphs => 32 words(T)bill senat(1)bill theft trade secret foreign compani feder crime

final action senat(2)senat version bill pass hous version pass hous final

action hous(3)bill compani theft feder crime(4)foreign trade secret====Found Patterns:[1Terms]:([senat](3)) Freq:3, rel_supp:0.6[1Terms]:([bill](4))Freq:4, rel_supp:0.8[1Terms]:([foreign](2))Freq:2, rel_supp:0.4[2Terms]:([bill](4),senat) Freq:2, rel_supp:0.4[2Terms]:([trade](2),secret) Freq:2, rel_supp:0.4[3Terms]:([bill,final](2),action) Freq:2, rel_supp:0.4[4Terms]:([bill,theft,feder](2),crime) Freq:2,

rel_supp:0.4[4Terms]:([bill,compani,feder](2),crime) Freq:2,

rel_supp:0.4

Figure 6.6: Primary output of a preprocessed document and found patterns.

scanning all training documents, except the term with frequency equalling to one.

In fact, there are numbers of RCV1 topics that contain only a couple of training

examples. About 63% of all RCV1 topics have no more than 10 relevant examples

available for training. An example of output after document preprocessing is

illustrated in Figure 6.6.

In the case (document “63261.xml”) of Figure 6.6, it can be seen that words in

the document are stemmed and only those that appear at least in two transactions

are reserved, otherwise removed. The use of pattern pruning in SCPM can remove

a large number of non-closed patterns, leading to the number of discovered

patterns to be reasonable. Otherwise, if SPM is chosen for pattern discovery,

the number of generated patterns would explode to be 35, compared to 8 for

the SCPM algorithm. In fact, the extra patterns do not benefit the system with

the improvement for effectiveness according to our finding in the preliminary


work [159]. In addition to SPM, NSPM encounters the same problem since both

of them generate lots of redundant patterns.

6.3.2 Procedure of Pattern Discovery

The result of document indexing is a set of transactions and each transaction

consists of a vector of stemmed terms. The next step is to find frequent patterns

using our proposed pattern discovery algorithms. As mentioned in Chapter 3, data

mining approaches including association rule mining, frequent sequential pattern

mining, closed pattern mining, itemset mining, and closed itemset mining are

adopted and applied to the text mining tasks. By splitting each document into

several transactions (i.e., paragraphs), we can use these mining methods to find

frequent patterns from the textual documents. Five pattern discovery methods

which have been implemented in the experiments are briefed as follows:

- SPM: Finding sequential patterns using the algorithm SPMining (Algo-

rithm 3.1 in Section 3.1.1) with skipping of the first line in the algorithm.

- SCPM: Finding sequential closed patterns using the algorithm SPMining.

- NSPM: Finding non-sequential patterns using the algorithm NSPMining

(Algorithm 3.2 in Section 3.2).

- NSCPM: Finding non-sequential closed patterns using algorithm NSPMin-

ing with closed pattern mining scheme (Equation 3.1) in Section 3.2.2.

- nGram: Finding all sequential patterns, whose lengths do not exceed “n”,

using the SPMining algorithm.


Note that the min sup we choose is 0.2 for all mining methods, which means a

pattern is frequent if it appears in n paragraphs (including title field) in a document

containing m transactions (paragraphs + title), such that n/m ≥ 0.2. For the

comparison reason, the value of min sup will keep the same for all approaches.

Figure 6.6 shows a primary result of the SCPM mining method. As we can see,

there are eight sequential closed patterns bring found. The frequency and relative

support of each pattern are also estimated. A similar result can be obtained if we

use the NSCPM mining algorithm instead of SCPM since both of them adopt

a pattern pruning scheme during pattern discovery. However, the number of

discovered patterns will dramatically increase if both SPM and NSPM are applied.

The comparison of these methods is revealed in Section 6.5.1.

6.3.3 Procedure of Pattern Deploying

The procedure of pattern deploying is illustrated in Figure 6.7. Each step in the

figure is briefed as follows:

Topic: The system usually needs to process a number of topics and starts

each of them in turns. A topic contains a training dataset and a test dataset

in which both datasets have a set of documents.

Data Transform: Each document in the training dataset is preprocessed.

For a document, those words enclosed by “title” and “text” tags are reserved

for further processing, which includes word stemming and stopwords

removal. After preprocessing, each document is transformed as a set of

transactions which represent the title and paragraphs.

Pattern Discovery: In this step, the SCPM method is chosen as a mining


Figure 6.7: Flow chart of experimental procedure for pattern deploying methodsPDM and PDS in the pattern taxonomy model PTM.


mechanism in order to find frequent sequential closed patterns from

transactions. Each document now is represented by pattern taxonomies

which consist of discovered patterns.

Pattern Deployment: There are two choices for pattern deploying. Either

PDM or PDS method can be chosen in order to map discovered patterns

into a hypothesis space. The main difference between these two methods is

that the latter considers the pattern support during pattern re-evaluation.

Concept: After pattern deployment, the concept of topic is built by merging

all documents using pattern decomposition.

Test: While the concept is established, the relevance estimation of each

document in the test dataset is conducted using the document evaluating

function. Documents in the dataset are ranked according to their relevance

scores.

Evaluation: The system’s performance is evaluated using the aforemen-

tioned measures. After evaluation, the system assesses the next topic if

required.

6.3.4 Procedure of Pattern Evolving

The procedure of pattern evolving is similar to that of pattern deploying in the first

three steps but different in the remaining. Figure 6.8 presents the flow chart of

pattern evolving methods DPE and IPE. Each step is briefly described as follows:

Topic: The system usually needs to process a number of topics and starts

each of them in turns. A topic contains a training dataset and a test dataset


Figure 6.8: Flow chart of experimental procedure for pattern evolving methodsDPE and IPE in the pattern taxonomy model PTM.


in which both datasets have a set of documents.

Data Transform: Each document in the training dataset is preprocessed.

For a document, those words enclosed by “title” and “text” tags are reserved

for further processing, which includes word stemming and stopwords

removal. After preprocessing, each document is transformed as a set of

transactions which represent the title and paragraphs.

Pattern Discovery: In this step, the SCPM method is chosen as a mining

mechanism in order to find frequent sequential closed patterns from

transactions. Each document now is represented by pattern taxonomies

which consist of discovered patterns.

Pattern Deployment: Pattern evolving methods DPE and IPE undertake

different processes in this step. For DPE, the deployment of pattern is

processed as usual and deployed patterns are generated and passed to the

subsequent step. However, for IPE, there is no need for patterns to be

deployed before they are evolved. In terms of pattern deploying, either

PDM or PDS can be selected to perform the task.

Pattern Evolution: There are two approaches for pattern evolution,

DPE and IPE. Both approaches need the information from the negative

documents (“nds”). The DPE method evolves patterns based on the

deployed patterns which are viewed as term level evolution, whereas the

IPE method processes the task directly on the non-deployed patterns, the

results from the step of Pattern Discovery, which is referred to as pattern

level evolution.


Concept ∼ Evaluation: Please refer to the same steps described in the

procedure of pattern deploying in the previous section.

6.4 Experimental Setting

All the experiments reported in this thesis were conducted on a PC equipped

with an Intel Pentium IV 3.0GHz CPU and 1,024M memory running a Windows

XP operating system. The application of the PTM-based IF system is coded

using Java programming language with J2SDK version 1.4.2 as the development

environment. The data collection is acquired from a licensed CD from TREC

organisation and used in our experiments without any modification, although

we find some errors and duplicates in the data. The information of relevance

judgement for each topic in training and test datasets is also derived from the files

which are directly downloaded from the TREC Web site 2.

The value of minimum support used for association rules mining in the

experiments is set as 0.2 according to the system optimisation. For the reason of

consistency, we use the same minimum support in all related mining algorithms.

The influence of various settings on minimum support is a well-studied issue

which has been widely investigated in the data mining literatures [54]. Thus in our

experiment we did not focus on this coefficient. Moreover, during the recursive

loop of proposed mining algorithms, the loop will stop and exit if there is no

more pattern being found. However, in some cases (e.g., topic r193 and r199) the

recursive loop seems not to stop since some documents in these topics contain a

large number of long patterns. The longest pattern we found is 15 using SCPM and

SPM mining algorithms. Therefore, for non-sequential pattern mining algorithms2http://trec.nist.gov/data/t2002 filtering.html

6.5 Experiment Evaluation 131

(i.e., NSPM and NSCPM), the maximum length of pattern we search for is set

as 15 and the loop can exit after such long patterns have been found no matter if

there is any longer candidate generated.

6.5 Experiment Evaluation

In order to evaluate the performance of the proposed PTM model, we apply PTM

to perform IF tasks and examine the results against those of other methods. For

an IF task, the system aims to filter out the non-relevant incoming documents

according to the user profiles and extracts profiles from a training dataset for each

topic. Firstly, we conduct data processing techniques for each document in order

to reduce dimensionality. Removing stopwords and term stemming are adopted

according to a given list of stopwords (see Appendix C) and the Porter stemming

algorithm [117]. In practice, about 20 to 30 percent of text are stopwords [19].

There are many classic approaches for concept (i.e., user profile) generation.

The Rocchio algorithm [122], which has been widely adopted in the areas of TC

and IF, can be used to build the profile for representing the concept of a topic

which consists of a set of relevant and irrelevant documents. The Centroid ~c of a

topic can be generated by using the following equation:

~c = α1

|D+|∑~d∈D+

~d

‖~d‖− β

1

|D−|∑~d∈D−

~d

‖~d‖(6.5)

where α and β are empirical parameters; D+ and D− are the sets of positive and

negative documents respectively; ~d denotes a document.

Probabilistic method [50, 119] Prob is a well-known keyword-based approach

for concept generation. With this heuristic, the basic element term t in the feature


space is weighted using the following formula:

W (t) = log(r + η

R− r + η÷ n− r + η

(N − n)− (R− r) + η) (6.6)

where N and R are the total number of documents and the number of positive

documents in the training set respectively; n is the number of documents which

contain t; r is the number of positive documents which contain t, and η is a

coefficient.

In addition, TFIDF is also widely used. The term t can be weighted by

W (t) = TF (d, t) × IDF (t), where term frequency TF (d, t) is the number of

times term t occurs in document d(d ∈ D) and D is a set of documents in the

dataset; DF (t) is the document frequency which is the number of documents

where the term t occurs at least once; IDF (t), the inverse document frequency, is

denoted by log( |D|DF (t)

).

Another well-known term-based model is the BM25 approach [120], which is

basically considered the state-of-the-art baseline in IR. The weight of a term t can

be estimated by using the following function:

W (t) =TF · (k1 + 1)

k1 · ((1− b) + b DLAV DL

) + TF· log

(r+0.5)(n−r+0.5)

(R−r+0.5)(N−n−R+r+0.5)

(6.7)

where TF is the term frequency; k1 and b are the parameters; DL and AV DL are

the document length and average document length respectively. The values of k1

and b are set as 1.2 and 0.75 respectively according to the suggestion in [140, 141].

Support vector machines (SVM) model is also a well-known learning method

introduced by Cortes and Vapnik [31]. Since the works of Joachims [63, 64],

researchers have successfully applied SVM to many related tasks and presented

some convincing results [23, 24, 91, 127, 163]. The decision function in SVM is


defined as:

h(x) = sign(w · x + b) =

+1 if (w · x + b) > 0−1 else (6.8)

where x is the input space; b ∈ R is a threshold and w =∑l

i=1 yiαixi for the

given training data:

(xi, yi), . . . , (xl, yl) (6.9)

where xi ∈ Rn and yi equals +1 (−1), if document xi is labeled positive

(negative). αi ∈ R is the weight of the training example xi and satisfies the

following constraints:

∀i : αi ≥ 0 andl∑

i=1

αiyi = 0. (6.10)

Since all positive documents are treated equally before the process of

document evaluation, the value of αi is set as 1.0 for all of the positive documents

and thus the αi value for the negative documents can be determined by using

Equation 6.10.

In document evaluation, once the concept for a topic is obtained, the similarity

between a test document and the concept is estimated using inner product. The

relevance of a document d to a topic can be calculated by the function R(d) = ~d·~c,

where ~d is the term vector of d and ~c is the concept of the topic.

6.5.1 Experiment on Pattern Discovery Methods

In this section, we present the experimental results from the application of various

data mining techniques to the pattern discovery in an IF system. The comparison

on effectiveness and efficiency of these techniques is also conducted. The purpose


Method Pattern type #Patterns Runtime (Sec.) b/p

SPM Sequential Pattern 126,310 5,308 0.343SCPM Sequential Closed Pattern 38,588 4,653 0.353NSPM Frequent Itemset 340,142 14,502 0.352NSCPM Frequent Closed Itemset 34,794 7,122 0.3463Gram nGram 88,991 4,092 0.342PTM(PDS) Pattern Taxonomy 8,027 1,510 0.431

Table 6.5: Comparing PTM with data mining-based methods on RCV1 topicsr101 to r150.

of this experiment is to find if PTM is superior to the data mining-based methods.

Furthermore, we can figure out which data mining technique is suitable to be

adopted by a knowledge discovery system in the text mining domain. In addition,

we also compare the result from PTM to that for classic approaches such as IR

and probabilistic method.

The comparison of PTM with data mining-based methods on the first 50 RCV1

topics is depicted in Table 6.5. As we can see, that PTM outperforms the other

methods by around 8% in b/p. The results also support the superiority of PTM

in efficiency since the runtime of PTM is as low as 1,510 seconds, compared to

4,000+ seconds for the others. NSPM even takes 14,502 seconds to complete the

task. This can be explained in two aspects. On the one hand, PTM discovers

8,027 patterns only, compared to over 340,000 patterns for NSPM, which means

it can save much time for pattern discovery. On the other hand, PTM does

not need the time-consuming pattern mining algorithm again in the phase of

document evaluation since the deploying method PDS is applied, leading to the

greater performance on efficiency than those for data mining methods. In terms of

closed pattern mining, both closed pattern-based methods, SCPM and NSCPM,


produce a fewer number of patterns compared to those for non-closed pattern-

based methods, SPM and NSPM, respectively. We expect an increase in scores

of b/p for both SCPM and NSCPM methods. However, only the former performs

better than its non-closed pattern-based method. The reason is that the significance

in the non-closed pattern (usually a short pattern) can be replaced by the closed

pattern (a long pattern) since the former one is the subsequence of the latter one,

but the low-frequency problem of long-pattern causes the result not corresponding

to our intuition. This behaviour motivates the further investigation for the issue of

pattern deploying.

Comparing the closed pattern-based method with the non-closed one, it is

obvious that the closed pattern-based SCPM and NSCPM methods are more

suitable for the text mining task than SPM and NSPM. This is because fewer

patterns are generated in SCPM and NSCPM and less runtime is needed for them.

Despite the slight difference in performance, closed pattern-based methods are

much efficient than non-closed-based ones. With regard to the issue of term order

in a pattern, a sequential pattern contains an ordered list of terms, whereas a non-

sequential pattern consists of an un-orderly itemset mined by NSCPM. Comparing

the result on the number of discovered patterns, NSCPM scores a lower number

than that for SCPM. However, SCPM has advantages over NSCPM on runtime

and its performance. As a result, SCPM is better than NSCPM and therefore is

suitable to be used in the text-related domain. In this experiment, PTM thus adopts

the concept of SCPM for closed sequential pattern mining in the pattern discovery

phase.

The sequential pattern-based methods SPM and SCPM use less runtime (5,308

+ 4,653 seconds) than non-sequential pattern-based NSPM and NSCPM (14,502


+ 7,122 seconds) for completing the first 50 topics. This is mainly due to the

difference in the process of candidate generation implemented by these two types

of methods. In SPM and SCPM, it traverses only half of a paragraph on average

for generating candidates because, in the algorithms for each traversal to find a

(n+1)Terms candidate, it starts from the point where the last term in the nTerms

pattern locates. In contrast, for NSPM and NSCPM, it has to start from the first

term in the paragraph every time for each candidate to be generated since the term

order in a pattern is not concerned in such mining methods. Another observation is

that NSCPM generates fewer patterns than that for SCPM. This can be explained

in that the proportion of non-closed patterns in non-sequential itemsets is larger

than that in sequential patterns, leading to the more non-closed patterns being

removed during the process of pattern pruning in NSCPM.

The lowest result in b/p is performed by the 3Gram method, which is a special

case of SPM. 3Gram discovers sequential patterns whose lengths are not longer

than 3, resulting in a great reduction of discovered patterns (88,991 for 3Gram

compared to 126,310 for SPM). The runtime for 3Gram to complete the first 50

RCV1 topics is also reduced from 5,308 to 4,092 seconds. According to our

assumption that long patterns carry more significance than short ones, the removal

of a large number of long patterns in the 3Gram method should be coupled with

the drop in the score of performance. However, the b/p of 3Gram is only slightly

lower than that for SPM. This indicates that data mining methods can generate a

large amount of specific long patterns, but these patterns would be redundant if

there is no adequate strategy to properly use these meaningful patterns. This also

implies that our proposed method PTM did provide a great solution to process and

utilise discovered patterns.


Figure 6.9: Number of patterns discovered using SPM with different constraintson 10 RCV1 topics.


Figure 6.9 illustrated the effect of the pattern pruning scheme and minimum

support setting upon the number of patterns and the performance. Setting

minimum support as min sup = 0.2, the application of the pattern pruning

scheme can remove one third of the patterns, from 36,202 to 28,733 in total. Also,

the score of b/p can be improved by around 4% from 0.406 to 0.443. However,

the change in minimum support cannot affect the score of b/p without the use of

the pattern pruning scheme. Despite the great decrease in the number of patterns

with the setting of minimum support, the score of b/p changes only slightly. One

possible explanation is that even though a large number of patterns are removed

by the setting of minimum support, a large proportion of remaining patterns are

still redundant. In contrast, the activation of pattern pruning not only reduces

the number of discovered patterns but also improves the performance on b/p.

Therefore, the results highlight the importance of the use of a pattern pruning

scheme in the sequential pattern mining algorithm.

The results of average b/p values of 10 topics are also illustrated in Figure 6.9,

which shows that the improvement is achieved by using SPM with pruning

compared to that without pruning. As the minimum support increases, the average

value of b/p reduces slightly from 0.409 to 0.406. That means the effects of these

pruned patterns are not significant since their supports are relatively smaller than

the remained patterns. The performance is obviously enhanced by applying the

pruning scheme as we can see the measure of b/p value increased from 0.409 to

0.443. This is because the noises of redundant patterns are reduced after they

are pruned. By using both minimum support and pruning scheme as constraints,

significant improvement is therefore achieved.

To compare the other classic IF methods, we implement the TFIDF method


and probabilistic (Prob) method, which are described as follows.

TFIDF: Let D be a set of documents. The term frequency TF (d, t) is the number

of times term (word) t occurs in document d(d ∈ D) and the document

frequency DF (t) is the number of documents in which term t occurs at least

once. The inverse document frequency IDF (t) is denoted by log( |D|DF (t)

),

which scores low if term t occurs in many documents and scores high if

it occurs in only a few documents. The weight of a term t then can be

represented by TFIDF value which is calculated as

W (t) = TF (d, t)IDF (t).

Prob: The probabilistic method uses a keyword-based algorithm. With this

heuristic, a term t is weighted using the following formula:

W (t) = log(r + 0.5

R− r + 0.5÷ n− r + 0.5

(N − n)− (R− r) + 0.5) (6.11)

where N and R are the total number of documents and the number of

positive documents in the training set respectively; n is the number of

documents which contain t, and r is the number of positive documents

which contain t.

Table 6.6 depicts the average precision of the top 20 returned documents on

10 RCV1 topics. It can be seen that the PTM outperforms the other methods.

The score of top-20 for PTM is greater than those for TFIDF and Prob methods

by around 20%. Two data mining methods, SPM and SCPM, are also superior

to those classic methods. The significant performance of data mining-based

methods indicates that the use of phrases (i.e., sequential patterns) is feasible


Topic TFIDF Prob SPM SCPM PTM(PDS)

r110 0.15 0.30 0.45 0.65 0.50r120 0.45 0.30 0.80 0.60 0.65r130 0.05 0.05 0.10 0.25 0.25r140 0.35 0.30 0.45 0.10 0.65r150 0.15 0.01 0.10 0.10 0.20r160 0.90 1.00 0.95 1.00 1.00r170 0.30 0.30 0.55 0.60 0.50r180 0.70 0.70 0.65 0.65 0.65r190 0.75 0.60 0.80 0.80 0.95r200 0.20 0.50 0.20 0.40 0.70

top-20 0.400 0.406 0.505 0.515 0.605

Table 6.6: Precisions of top 20 returned documents on 10 RCV1 topics.

and applicable, compared those for keyword-based TFIDF and Prob methods.

However, the computational cost for discovering patterns is still a major concern

in the data mining-based methods since keyword-based TFIDF and Prob are well-

known fast and efficient methods. Another observation is that SCPM has a slightly

better result than that for SPM, which means it is benefited greatly by the use

of a pattern pruning scheme in SCPM. With regard to the correlation of the

performance of a method and the number of discovered patterns in it, we found

that there is no strong relationship between these two factors. According to the

number of generated patterns presented in Figure 6.9, four topics have at least

3,000 patterns (i.e., r110, r140, r150 and r170), and three of them (except r140)

have scores less than or equal to 0.5 in top-20 for the pattern-based PTM method

according to the results depicted in Table 6.6. However, this does not mean the

low pattern number is always corresponding to the high performance. As we can


see the r130 has a low score in top-20 and fewer patterns as well. Hence there

is no evidence to show any correlation between the number of patterns and the

performance. Nevertheless, the number of patterns is not one of the main factors

that affect the result for a pattern-based method. For efficiency reasons, a method

which can produce fewer patterns is preferable.

Another observation in Table 6.6 is that the score for SCPM in r140 drops

significantly after pattern pruning (0.1 to 0.45 for SPM in the same topic). This

can be explained in that some useful non-closed patterns are removed and these

patterns are just the majority of those specific indicators for this topic. However,

such a severe drop in performance is not common for SCPM. Generally, according

the positive result in Table 6.5 the pattern pruning scheme used in SCPM improves

the performance in b/p (0.353 for SCPM compared to 0.343 for SPM on the first

50 topics). Comparing SCPM to SPM in top-20 scores, the similar result can

be found in Table 6.6. Accordingly, non-closed sequential patterns are proven as

redundant and useless patterns and should be removed for a sequential pattern-

based method.

We have investigated the performance for PTM and found its significant results

both in top-20 and b/p measures. Other measures such as precision and recall

are also used for evaluation and the results are illustrated in Figure 6.10, which

presents the comparison of PTM with other methods in the precision and recall

curve on RCV1 topic r110. As we can see, PTM performs better in high recall

value than other methods. TFIDF has the lowest performance on this topic.

Generally speaking, data mining-based methods are superior to classic methods.

All methods produce similar results after the point where recall value equals to

0.8, indicating that there is no method dominating the others in the low-recall


Figure 6.10: Comparison of precision and recall curves for different methods onRCV1 Topic r110.


area.

In summary, we have examined several data mining methods adopted for

pattern discovery in the pattern-based IF system. We also have tested our proposed

PTM model and compared its results to those of data mining-based methods and

classic methods. The following findings are observed:

• Data mining approaches can be used for the task of pattern discovery in

the text mining domain. To overcome the problem of a large amount of

association rules (patterns) generated while using these approaches, our

strategy is that the text in a document can be split into several parts based

on paragraphs. These paragraphs therefore can be treated as transactions

and used by data mining methods. In addition to the paragraph, a whole

document or a sentence in the document can be defined as a transaction.

However, the former definition will cause the above-mentioned problem

where a tremendous number of patterns are discovered, especially when

the number of documents is vast. The latter will generate too many

short non-significant patterns due to the short sentence length. Hence

splitting documents by paragraphs is a suitable and effective manner for

the applications of data mining approaches in the text domain.

• Both closed pattern-based approaches (i.e., SCPM and NSCPM) and non-

closed-based approaches (i.e., SPM and NSPM) can be adopted and used

by a pattern-based IF system for pattern discovery. Similar performances

are yielded by those two kinds of approaches in the measure of b/p on the

first 50 topics. However, SCPM and NSCPM spend much less runtime

than SPM and NSPM due to the use of the pattern pruning scheme in


closed pattern-based SCPM and NSCPM. In addition, the closed pattern-

based approaches generate fewer patterns showing that they can efficiently

alleviate the computational cost problem.

• Sequential pattern-based approaches SPM and SCPM require less runtime

than non-sequential pattern-based NSPM and NSCPM. This can be

explained in that the efficient process for candidate generation in the pattern

mining algorithm is adopted by SPM and SCPM. Despite the fact that

SCPM is more efficient than NSCPM, the former approach discovered more

patterns than the latter method. This indicates that the process of candidate

generation takes more time than the process of pattern pruning in the closed

pattern-based methods. Therefore, the sequential pattern-based approaches

are efficient than non-sequential ones.

• The pattern pruning scheme is important and necessary for a data mining-

based method due to the large amount of patterns generated, which is

considered to be one of the most concerning problems caused by the

application of these techniques on the text domain. Not only can it reduce

the number of discovered patterns but also improve the performance in

effectiveness for a pattern-based IF system.

• In order to reduce the runtime for a pattern-based system without affecting

performance, the nGram-based method mines patterns with a limited length

and stops discovering patterns when the length of mined maximum patterns

has reached a pre-specified value. Nevertheless, the 3Gram approach

is slightly more efficient than SCPM. It produces as many as double

the number of discovered patterns compared to SCPM and weakens the


performance. This implies that the majority of redundant 3Terms and

2Terms patterns are non-closed patterns and thus cause the above-mentioned

problem. This behavior supports the importance of the use of pattern

pruning in SCPM. We also have tested the 5Gram method and a similar

result is obtained.

• The document evaluation method is sensitive to the frequency of patterns.

The weight of a pattern is directly proportional to the pattern’s frequency in

documents. With reference to the comparison between SPM and 3Gram, it

is obvious that in SPM the more specific long patterns, which carry more

significant information, cannot actually improve the system’s effectiveness.

This is mainly due to the natural characteristic of low frequency found from

those long patterns, namely the low-frequency problem. Such a problem is

one of the main drawbacks of data mining-based approaches.

• The method to use discovered patterns in the data mining-based approaches

is proven not adequate according to our observation on experimental results.

Although these approaches can discover various types of patterns (i.e.,

frequent sequential patterns, frequent itemsets, and frequent closed or non-

closed patterns), how to effectively use these discovered patterns is still a

critical issue. The use of pattern support for document evaluation suffers

from the low-frequency problem when the length of pattern is long. Despite

the fact that the data mining methods are superior to TFIDF and Prob

methods, they do not outperform the keyword-based IR methods such

as Rocchio [158]. Therefore, a proper pattern evaluation method which

can solve the low-frequency problem for specific long patterns is then


necessarily required.

• The experimental results support the superiority of the proposed PTM

method in effectiveness and efficiency. PTM reduces 77% of the number

of patterns compared to NSCPM, which is the best case of the data mining

methods, and takes only one third of the runtime of SCPM, which is the

most efficient data mining methods, resulting in a 21% improvement on the

performance in the average figure of b/p on the first 50 RCV1 topics. PTM

also improves the figures of top-20 precision by 51% and 49% over those

of TFIDF and Prob methods respectively on the 10 RCV1 topics.

6.5.2 Experiment on Pattern Deploying

This section presents the experimental results from the pattern deploying methods,

PDM and PDS, proposed in Chapter 4 for attempting to address the problem

caused by the inadequate usage of patterns discovered using data mining

mechanisms. The main problem is that too many patterns are generated by data

mining-based methods and there is no existing manner to effectively use these

discovered patterns. Moreover, the characteristic of low frequency within specific

long patterns is the key factor to this problem according to our finding in the

previous section. That means if we can find a way to facilitate the significance

provided by these specific patterns, the performance of the system can be greatly

boosted in effectiveness. In the previous section we have presented some results

of PDS compared to data mining methods and have shown that PDS significantly

improves the performance both in effectiveness and efficiency. In this section,

we focus on the comparison of two proposed deploying methods and all other

methods including Prob, data mining (DM) methods SCPM and Rocchio with


intensive examinations on all RCV1 topics. Among these baselines, Prob is

a probabilistic and classic method outperforming TFIDF; SCPM is one of the

effective and efficient data mining methods and Rocchio is the well-known IF

method. We illustrate the results in two groups, the first 50 RCV1 topics and

the rest of the RCV1 topics, due to the different ways of labeling documents for

evaluation in these two datasets as previously mentioned in Section 6.1.

All of the RCV1 topics are used in the experiment for evaluation. Prob method

is implemented using Equation 6.11 with η = 0.5 and Rocchio method basically

refers to Equation 6.5 with α = 1 and β = 0. The DM method uses the SPMining

algorithm aforementioned in Section 3.1.2 with min sup set to 0.2. Details of

PDM and PDS are presented in Chapter 4. The same strategy of preprocessing

each document is adopted by all methods, including word stemming and removal

of stopwords. We also use the same set of keywords both in keyword-based

methods and pattern-based methods for comparison reasons, meaning that the

same set of keywords is used in Prob and Rocchio and adopted in the DM, PDM

and PDS for pattern discovery.

Five implemented methods are briefly described as follows:

• Prob: Keyword-based probabilistic method in Equation 6.11 with η = 0.5.

• DM: Pattern-based data mining method SCPM.

• Rocchio: Keyword-based Rocchio method in Equation 6.5 with α = 1 and

β = 0.

• PDM: Pattern taxonomy model PTM equipped with pattern deploying

method PDM proposed and presented in Section 4.1.1.

• PDS: Pattern taxonomy model PTM equipped with PDS method proposed


Prob DM Rocchio PDM PDS

top-20 0.407 0.406 0.416 0.470 0.490b/p 0.381 0.353 0.392 0.427 0.431

MAP 0.379 0.364 0.391 0.435 0.441Fβ=1 0.396 0.390 0.408 0.435 0.440IAP 0.402 0.392 0.418 0.458 0.465

Table 6.7: Results of pattern deploying methods compared with others on the first50 topics.

and described in Section 4.1.2.

The experimental results of all methods on the first 50 topics are shown

in Table 6.7. The proposed method PDS improves the performance with five

validating measures compared to all other methods, especially in terms of top-

20 scores, meaning that the proposed method increases the precision of the first

20 returned documents. It improves by 20.4% in top-20 precision compared

to that for the Prob method and about 11% to 16% of improvement in b/p,

MAP, Fβ=1 and IAP measures as well. The significant improvement in top-20

precision indicates that PDS performs well in the low-recall region and has the

ability of ranking relevant documents in the returned list to the top as possible.

The result supports the superiority of PDS over the keyword-based Prob and

Rocchio methods. Nevertheless, PDS gains slightly better performance in all

measures than that for PDM. This can be explained in that the support of pattern

is considered and utilised in the PDS method during the phase of using discovered

patterns. This behaviour highlights the importance of such a pattern property,

which is however omitted and not used in the PDM method. Hence the omission

of pattern support in a pattern-based method can weaken the performance.



top-20 0.542 0.540 0.562 0.583 0.576b/p 0.457 0.434 0.476 0.492 0.498

MAP 0.476 0.456 0.492 0.512 0.513Fβ=1 0.454 0.445 0.465 0.471 0.473IAP 0.493 0.479 0.508 0.529 0.531

Table 6.8: Results of pattern deploying methods compared with others on the last50 topics.

The promising result in Table 6.7 provides the empirical evidence to support

the superiority of pattern deploying methods PDM and PDS to data mining

method DM. It confirms that the evaluation of discovered patterns in DM is

ineffective and the strategy of pattern deployment employed in PDM and PDS can

offer a much better solution to effectively use discovered patterns. As mentioned

in the previous section, the DM method suffers from the main problem that

the useful information hidden in specific long patterns cannot be fully utilised.

Such a problem has been identified in the previous section as the low-frequency

problem for long specific patterns. The significance in a specific pattern can be

extracted and carried by its components and progressively accumulated using

the pattern decomposition function. By deploying patterns, a term with more

frequent occurrence (i.e., appears in many patterns) will be assigned a higher

value of importance. In contrast, a specific pattern can not obtain such a high

value since it is difficult to match the same pattern in text especially when the

pattern contains many terms. Such a weakness in the DM method corresponds

with the unpromising result in our experiment.

The similar results for the second 50 Reuters topics are shown in Table 6.8.


Again, both pattern deploying methods PDM and PDS outperform the other

methods in all measures. However, the difference in scores between pattern

deploying methods and other methods becomes smaller than those obtained based

on the first 50 topics. For instance, PDS improves by 20.7% of scores in top-

20 precision over the DM method on the first 50 topics. However, the figure of

improvement on the last 50 is only 6.7%. The similar observation can be found in

other measures. This behaviour can be explained in that the manner of generating

these two sets of topics is different. As mentioned in Section 6.1, the first 50 topics

are manually created by domain experts, whereas the last 50 ones are collected by

systems according to their category codes tagged in each XML document. This

can also be the reason to explain why scores of all measures for all methods on the

last 50 topics are higher than those on the first 50 ones. We expect the behaviour is

correlated to the number of available relevance examples for each topic. However,

after further investigation, there is no relation between them which can be used to

explain the observation.

Another interesting observation is that the PDM method is slightly better than

the PDS method in the score of top-20 precision, which implies that PDM can

locate the relevant documents as front as possible in the ranked document list, but

only in the section of the first few dozen documents. That means the PDM method

performs well only in the low-recall situation compared to PDS, which achieves

a higher IAP score than that for PDM. The similar ability can be found in the

DM method as well. Although the DM method is inferior to the Prob method, it

achieves the similar performance in top-20 measure compared to that for the Prob

method. This indicates that DM can produce the same comparable performance

as pattern deploying methods in the low-recall situation. The similar behaviour



top-20 0.475 0.473 0.489 0.527 0.533b/p 0.419 0.394 0.434 0.460 0.464

MAP 0.427 0.410 0.442 0.473 0.477Fβ=1 0.425 0.417 0.436 0.453 0.457IAP 0.447 0.435 0.463 0.493 0.498

Table 6.9: Results of pattern deploying methods compared with others on alltopics.

for the DM method also can be found in the result obtained based on the first

50 topics. Therefore, this finding provides the evidence to support that the data

mining method DM owns the ability of accurately ranking the documents with

high relevance in the front of the list.

Table 6.9 provides an overall view of the performance achieved by all methods

based on the whole dataset. It concludes the previous finding that the pattern

deploying methods PDM and PDS achieve the significant performance. Both

methods outperform not only the data mining method DM, but also the classic

methods Prob and Rocchio according to the experimental results on all 100 topics.

Based on their robustness, these methods can be ranked as follows: PDS > PDM

Rocchio > Prob > DM. It is not surprising that the Prob method is superior to

the DM method. But this is not consistent with the result published in [159], which

showed that the data mining-based method outperforms the probabilistic method.

This can be explained in that different sets of topics are chosen and examined in

these two experiments.

With regard to the pattern deploying methods, PDS is slightly better than

PDM. As mentioned before, this can be attributed to the usage of pattern support


in the PDS method. The pattern support is calculated by normalising the absolute

support of a pattern in a document. A frequent term can be re-assigned a higher

weight to reflect its significance through considering the effect of pattern support.

Such an assumption has been proven in Section 4.1.2 by using a real example. In

the PDM method, a pattern with a high support will be treated equally to a pattern

with a low support. Hence, the support of a pattern cannot affect the significance

of terms contained by it, which is the reason that PDM is inferior to PDS according

to the experimental results.

By using PDS, the score of precision is improved on top 20 returned

documents by around 9% against the Rocchio method from 48.9% to 53.3% in

Table 6.9. On the first 50 topics and the last 50 topics, the PDS method also

increases the figures by 17.8% and 2.5% respectively. The most important thing is

the PDS uses the least number of training patterns compared to the other methods

(except PDM) as shown in Table 6.10. In fact, the number of training patterns

used by PDS has been reduced by 72% compared to the number of terms used in

the Rocchio method, which means that the PDS method can improve not only the

effectiveness but also the efficiency for the system. The number of patterns used

by the Prob method is the same as that by the Rocchio method since they both are

keyword-based methods and use the same set of terms for concept learning and

document evaluation. Another observation is that the DM method produces the

largest number of patterns. This is because the data mining scheme for pattern

discovery is applied.

We compare the pattern deploying method PDS and PDM with the other three

methods and illustrate in Figure 6.11 with results of precision at standard recall

points on the first 50 topics. It can be seen that the PDS method yields 0.77


Topic

First 50 Last 50 All

Prob 32,760 37,418 70,178DM 38,588 39,317 77,905Rocchio 32,760 37,418 70,178PDM, PDS 8,027 11,838 19,865

Table 6.10: Accumulated number of patterns found during pattern discovering.

of precision at the first recall point (recall = 0) and 0.65 at the second point

(recall = 0.1). The scores produced by the PDM method at the first few points

are slightly less than those for the PDS method with 0.76 and 0.63 at the first

and second point respectively. Comparing these scores to those generated by the

other methods, we find PDS and PDM are much superior to Rocchio and Prob

methods but not quite so to the DM method. It can be seen that the DM method

gives a similar score to that for the PDS method at the first point. This behaviour

corresponds to the previous finding that a data mining method is able to keep the

high relevant documents in the ranked list as front as possible compared to the

Rocchio and Prob methods. However, such an ability is only effective in the low-

recall area as the plotting of the DM method drops rapidly after the first point. As

a whole, the DM method cannot dominate over the other methods but it is a good

indicator of relevance for the top couples of documents. In addition, the similar

performance in the first recall point for the DM method and the other two pattern

deploying methods provides the evidence that the DM method itself can achieve

better results than those for the Rocchio and Prob methods without the help from

the application of the pattern deploying mechanism provided by the PDS or PDM

methods, despite the inferiority in the whole performance. Such a behaviour is an


Figure 6.11: Comparison of all methods in precision at standard recall points onthe first 50 topics.

important advantage obtained by a data mining-based method.

The comparison of the PDS method and Rocchio method on each topic in

difference of Fβ=1 is illustrated in Figure 6.12. It can be observed that the PDS

method outperforms the Rocchio method on the majority of topics. There are

couples of topics showing negative results in the figure. Among them, the worst

case is the result from topic 157. This is probably caused by the reason that there

are no sufficiently positive examples for concept learning. Another observation

corresponding to the previous finding is the average results based on the first 50

topics are better than those on the last 50 topics. This can be explained in the

different manners for building these two sets of topics.

Moreover, we have an investigation into the comparison of performance

in top-20 precision between PDS and Rocchio methods on each topic. The


Figure 6.12: Comparison of PDS method and Rocchio method in difference ofFβ=1 on all topics.

Figure 6.13: Comparison of the PDS method and the Rocchio method indifference of top-20 precision on all topics.


Figure 6.14: Comparison of all methods in all measures on 100 topics.

results are shown in Figure 6.13, which again confirm the superiority of the PDS

method in the top-20 precision on almost all topics. There are many significant

improvements in scores on all topics indicating that the PDS method is able to

accurately filter out irrelevant documents in the low-recall situation. In order to

gain an overall view, a comprehensive comparison of all methods in all measures

is depicted in Figure 6.14. It can be seen that pattern deploying methods PDS

and PDM outperform the other five baselines in all evaluating measures. These

promising results for the PDS method support the importance of the usage of

the pattern deploying mechanism, which has been proven to be able to overcome

the low-frequency problem pertaining to the DM method and data mining-based

methods.

In summary, we have examined two pattern deploying methods, PDS and

PDM. Both are proposed with an attempt to provide a proper mechanism to exploit

patterns discovered by using data mining techniques. We also have compared


their results to those of data mining-based methods and term-based methods. In

conclusion, the following findings in this section are observed:

• By using pattern deploying strategies, the experimental results of the PDS

and PDM methods provide the evidence that the pattern deploying methods

can significantly improve the information filtering system in effectiveness.

These promising results from the PDS and PDM methods also indicate that

deployment of discovered patterns is a proper way to exploit these patterns

and to solve the low-frequency problem pertaining to the data mining-based

methods.

• The main drawback of data mining-based methods is that too many patterns

are generated by mining algorithms and there is no existing suitable manner

which can be used to effectively deal with these patterns. With pattern

deploying strategies, the number of patterns has been dramatically reduced

by 75% compared to the data mining method DM. This minimises the

computing complexity and also saves the space for storing patterns.

• The low-frequency problem pertaining to the data mining-based methods

has been solved by deploying patterns into a hypothesis space. The

significance in a specific pattern can be extracted and carried by its deployed

components and progressively accumulated using a pattern decomposition

function. Once the pattern is deployed, the term (i.e., component) with

higher occurrence (i.e., appearing in many patterns) will be assigned a

higher value of importance to support its significance. The feasibility

and effectiveness of such a strategy have been proven by the positive

experimental results in this section.


• The usage of pattern support in the PDS method leads to a noticeable

improvement over the PDM method, where the latter method omits such

a potential property during the pattern deploying. However, without

considering this property, the PDM method still outperforms the Rocchio,

Prob and DM methods and produces much better results in all measures

both on the first and last 50 topics.

• Despite the low performance in overall measures for the DM method

compared to other methods, the data mining-based methods can achieve

a similar outcome to the PDS method in the score of precision at the first

standard recall point on the first 50 topics. This implies that the DM method

has a high accuracy of document filtering in the low-recall situation.

• All methods yield higher scores in measures on the last 50 topics than

those on the first 50 topics. However, the improvement achieved by pattern

deploying methods on the last 50 topics is slightly less than that on the first

50 ones. The reason is that the manners of generating these two sets of

topics are different. The first 50 topics are classified and judged manually

by experts whereas the last 50 are made by the system according to the

coded information in the metadata of each document.

6.5.3 Experiment on Pattern Evolution

This section presents the result for the evaluation of DPE and IPE, the proposed

pattern evolving approaches used in PTM. In the previous section, PTM has been

significantly improved upon the adoption of pattern deploying method PDS, which

uses the strategy of mapping discovered patterns into a feature space in order to


solve the low frequency problem pertaining to the specific long patterns. However,

information from the negative examples has not been exploited during the concept

learning. We test the ability of DPE and IPE to deal with negative documents in

this experiment.

In order to compare the PTM method with others, we implement several

approaches and divide them into two categories. The first category contains all

data mining-based methods, such as sequential pattern mining, sequential closed

pattern mining, frequent itemset mining and frequent closed itemset mining,

which have been discussed in Section 6.5.1 and the other classic IF methods,

including nGram, Rocchio, Probabilistic and TFIDF, are classified into the second

category. Two state-of-the-art models, BM25 and SVM, are also implemented in

this section for comparison purpose. Note that we employ SCPM as the method

for pattern discovery and PDS as the pattern deploying approach for PTM. With

regard to pattern evolvement, IPE is chosen due to its promising performance. A

brief of these methods is depicted in Table 6.11.

Before we discuss the comparison between the PTM and other baselines,

we firstly investigate the experimental results of two proposed pattern evolving

approaches, DPE and IPE. Table 6.12 depicts the figures of all evaluating

measures achieved by pattern evolving methods (DPE, IPE) and pattern deploying

methods (PDS, PDM) on all RCV1 topics. As we can see from the table the

individual pattern evolving method IPE outperforms the other methods. These

results provide evidences to support the superiority of IPE, indicating that IPE can

effectively exploit the information provided by negative documents. Moreover,

the results also confirm that the process of pattern evolvement should take place

in the pattern level rather than the term level as the DPE method does. It can


Method Description Algorithm

PTM Proposed method equipped with IPEPDS and IPE Section 5.2.2

Sequential ptns. Data mining method using frequent SPMsequential patterns Section 3.1.1

Sequential closed ptns. Data mining method using frequent SCPMsequential closed patterns Section 3.1.1

Freq. itemsets Data mining method using frequent NSPMitemsets Section 3.2.2

Freq. closed itemsets Data mining method using frequent NSCPMclosed itemsets Section 3.2.2

nGram nGram method with n = 3 3GramSection 6.3.2

Rocchio Rocchio method Equation 6.5α = 1, β = 0

Prob Probabilistic method Equation 6.11η = 0.5

TFIDF TFIDF method TFIDFSection 6.5

BM25 Probabilistic method Equation 6.7k1 = 1.2,b = 0.75

SVM Support vector machines method Equation 6.8b = 0

Table 6.11: The list of methods used for evaluation.

be explained that the informative context of pattern can be reserved using IPE

during pattern evolving and such information is omitted in DPE since all patterns

are broken during mapping before they are evolved. Another observation is that


PDS performs marginally better than IPE in the score of b/p. After further

investigation, we find that PDS performs very well in the score of b/p on the last

50 topics, leading to a slightly higher averaged score on all topics. As mentioned

before, the result obtained based on the last 50 topics is not as stable as that on the

first 50 topics.

In terms of coefficient µ used in DPE, only the score of MAP is slightly

improved by the changes in the value of µ for the DPE method, indicating that

the filtering accuracy cannot be greatly improved by the setting of a coefficient to

shuffle significance among patterns in a document. This can be explained in that

the most shuffled patterns are deployed patterns which means each pattern of them

may represent multiple concepts from its various parent patterns. Unfortunately,

these parent patterns are not all relevant. For instance, a deployed pattern

“mining” can be acquired from the relevant parent pattern “data mining” and

the irrelevant pattern “strip mining” when we consider a topic about “knowledge

discovery”. After we find the irrelevant part and weaken the significance of

pattern “mining”, the significance of representing the relevant part in the pattern is

reduced as well. Such a problem even weakens the overall performance of DPE,

leading to the slight inferiority to the pattern deployed method PDS. Therefore,

it induces the main motivation for the proposed IPE method. In IPE, patterns are

evolved and revised in the pattern level rather than the term level, which means

patterns are modified before they are deployed into a hypothesis space. Using the

aforementioned patterns for example, if we find “strip mining” is not relevant to

the topic “knowledge discovery”, this pattern is individually weakened firstly and

then merged into the space with the unchanged relevant pattern “data mining”.

This insures that the relevant part in pattern “mining” can be reserved. As a


PDS PDM DPEµ=3 DPEµ=5 DPEµ=7 IPE

top-20 0.5330 0.5265 0.5280 0.5285 0.5275 0.5360b/p 0.4643 0.4598 0.4507 0.4507 0.4516 0.4632

MAP 0.4768 0.4734 0.4649 0.4652 0.4653 0.4770Fβ=1 0.4565 0.4528 0.4519 0.4520 0.4520 0.4570IAP 0.4982 0.4932 0.4861 0.4867 0.4867 0.4994

Table 6.12: Comparison of pattern deploying and pattern evolving methods usedby PTM on all topics.

result, the change is applied only to those patterns which are un-deployed and

contain high specificity. The experimental results support our finding and show

the superiority of IPE.

Another advantage of pattern evolving in the pattern level for IPE is the

scalability. The topic concept (i.e., a user profile) needs to be updated once the

original concept is drifted when the user changes his/her information needs. The

system should be able to adapt to the new concept by evolving representatives.

To update the concept more precisely, we have to aim at individual patterns rather

than the whole deployed patterns. This can be easily achieved by using the IPE

method. Therefore, the IPE method is suitable for the concept drifting or adaptive

filtering cases where an accurate updating mechanism is more demanding.

In order to evaluate the effectiveness of DPE, we attempt to find the correlation

between the achieved improvement and the parameter, denoting the ratio of the

number of negative documents greater than the threshold to the number of all

documents. This value can be obtained using the following equation:

Ratio =|d|d ∈ D− ∧ relevance(d) > Threshold(D+)|

|D+|+ |D−|

where d is a document in negative dataset D−, relevance(d) is the function to


Figure 6.15: The relationship between the proportion in number ofnegative documents greater than threshold to all documents and correspondingimprovement on DPE with µ = 5 on improved topics.

estimate the degree of relevance for d to the concept of its corresponding topic,

Threshold(D) refers to Equation 5.1 which is used to find the threshold for a set

of documents D, and D+ is the positive dataset.

Figure 6.15 illustrates the relationship of the improvement as DPE is applied

and the above-mentioned value of Ratio. As we can see the degree of improvement

is in direct proportion to the score of Ratio. That means the more qualified

negative documents are detected for concept revision, the more improvement we

can achieve. In other words, the expected result can be achieved by using the DPE

method.

The results of overall comparisons are illustrated in Table 6.13. We list the

result obtained based only on the first 50 RCV1 topics since not all methods can

complete all tasks in the last 50 topics. As aforementioned, itemset-based data


Method top-20 b/p MAP Fβ=1 IAP

PTM(IPE) 0.493 0.429 0.441 0.440 0.466Sequential ptns 0.401 0.343 0.361 0.385 0.384Sequential closed ptns 0.406 0.353 0.364 0.390 0.392Freq. itemsets 0.412 0.352 0.361 0.386 0.384Freq. closed itemsets 0.428 0.346 0.361 0.385 0.387nGram 0.401 0.342 0.361 0.386 0.384Rocchio 0.416 0.392 0.391 0.408 0.418Prob 0.407 0.381 0.379 0.396 0.402TFIDF 0.321 0.321 0.322 0.355 0.348BM25 0.434 0.399 0.401 0.410 0.422SVM 0.447 0.409 0.408 0.421 0.434

Table 6.13: Comparison of all methods on the first 50 topics.

mining methods struggle in some topics as too many candidates are generated

to be processed. In addition, results obtained based on the first 50 topics are

more practical and reliable since the judgement for these topics is manually made

by domain experts, whereas the judgment for the last 50 is created based on the

metadata tagged in each document. The most important information revealed in

this table is that our proposed PTM-based IF model outperforms not only the data

mining-based methods, but also the term-based methods including the state-of-

the-art methods BM25 and SVM.

The number of patterns used for training by each method is shown in

Figure 6.16. The total number of patterns is estimated by accumulating the

number for each topic. As a result, the figure shows PTM is the method that

utilises the least amount of patterns for concept learning compared to others. This

is because the efficient scheme of pattern pruning is applied to the PTM method.

Nevertheless, the classic methods such as Rocchio, Prob and TFIDF adopt terms


Figure 6.16: Comparison in the number of patterns used for training by eachmethod on the first 50 topics (r101∼r150) and the rest of the topics (r151∼r200).

as patterns in the feature space, they use much more patterns than our proposed

PTM method and slightly less than the sequential closed pattern mining method.

Particularly, nGram is the method with the lowest performance which requires

more than 17,000 patterns for concept learning. In addition, the total number of

patterns obtained based on the first 50 topics is almost the same as the number

obtained based on the last 50 topics for all methods except PTM. The figure based

on the first topics group (r101∼r150) for PTM is less than that based on the other

group (r151∼r200). This can be explained in that the high proportion of closed

patterns is obtained by using PTM based on the first topics group.

A further investigation in the comparison of PTM and TFIDF in top-20

precision on all RCV1 topics is depicted in Figure 6.17. It is obvious that PTM is

superior to TFIDF as it can be seen that positive results distribute over all topics,

especially for the first 50 topics. Another observation is the scores on the first 50

topics are better than those on the last fifties. That is because of the different ways

of generating these two sets of topics, which has been mentioned before. The

interesting behaviour is that there are few topics where TFIDF outperforms PTM.


Figure 6.17: Comparison of PTM(IPE) and TFIDF in top-20 precision.

After further investigation, we found a similar characteristic of these topics in that

there are only a few positive examples available in these topics. For example,

topic r157, which is the worst case for PTM compared to TFIDF, has only three

positive documents available. Note that the average number of positive documents

for each topic is 12.13. The number of documents for that topic is 42 compared

to 51.27 as the overall average number of documents. The similar behaviours are

found in topic r134 and r144. The former drops 0.25 in top-20 score and 0.4 for

the latter. It is no surprise that topics r134 and r144 contain only five and six

positive documents respectively.

The plotting of precisions on 11 standard points for PTM and data mining

methods on the first 50 RCV1 topics is illustrated in Figure 6.18. The result

supports the superiority of the PTM method and highlights the importance of the

adoption of proper pattern deploying and pattern evolving methods to a pattern-

based knowledge discovery system. Comparing their performance at the first few


points around the low-recall area, it is also found that the points for data mining

methods drop rapidly as the recall value rises and then keep a relatively gradual

slope from the mid recall period to the end. All four data mining methods achieve

similar results. However, the plotting curve for PTM is much smoother than

those for data mining methods as there is no severe fluctuation on it. Another

observation on this figure is that the data mining-based methods however perform

well at the point where recall equals to zero, despite the overall unpromising

results they have. Accordingly, we can conclude that the data mining-based

methods can improve the performance in the low-recall situation. As we can

compare their performance with other methods depicted in Figure 6.19, for the

score of precision at the first recall point it is obvious that data mining-based

methods outperform the Rocchio, Prob and TFIDF methods. This behaviour

provides the explanation for why SCPM and SPM perform better than Prob and

TFIDF in top-20 precision as shown in Table 6.6.

Although the PTM is equipped with the data mining algorithm for discovering

sequential closed patterns, the promising results cannot be produced without the

help from the successful application of the proposed PDS and IPE methodologies.

The proper usage of the PDS method, which has been proven previously,

can overcome the low-frequency problem and provide a feasible solution

to effectively exploit the vast amount of patterns generated by data mining

algorithms. Moreover, the employment of IPE provides the mechanism to

utilise the information from negative examples to evolve patterns for the concept

updating purpose. In conclusion, the experimental results provide the evidences

showing that the PTM method is an ideal model for a pattern-based knowledge

discovery system.


Figure 6.18: Comparing PTM(IPE) with data mining methods on the first 50RCV1 topics.

Figure 6.19 presents the plotting of precisions at 11 standard points for PTM

and several term-based methods on the first 50 RCV1 topics. Compared to the

previous plotting in Figure 6.18, the difference of performance for all methods

is easier to be recognised in this figure. Again, the PTM method outperforms

all other methods, including nGram, Rocchio, Prob, TFIDF, BM25 and SVM

methods. Among these methods, the nGram method achieves a noticeable score

of precision at the first point where recall equals to zero, meaning that the nGram

method is able to promote top relevant documents toward the front of the ranking

list. As mentioned before, data mining-based methods can perform well at low-

recall area, which can explain why nGram has better results at this point. However,

the scores for the nGram method drop rapidly at the following couple of points.

During that period, SVM, BM25, Rocchio and Prob methods transcend the nGram


Figure 6.19: Comparing PTM(IPE) with other methods on the first 50 RCV1topics.

method and keep the superiority until the last point where recall equals to 1. There

is no doubt that the lowest performance is produced by the TFIDF method, which

outperforms the nGram method only at the last few recall points. In addition,

the Prob method is superior to the nGram method, but inferior to the Rocchio

method. The overall performance of Rocchio is better than that for Prob method

which corresponds to the finding in [158].

In summary, both pattern evolving methods DPE and IPE are experimentally

evaluated in this section and positive results are obtained. However, the IPE

method does not produce many gains over the pattern deploying method PDS.

The reason is that the sufficient information has been obtained from positive

examples by using PDS while a large amount of patterns have been discovered

and exploited. Hence, in IPE the effect of the use of information from negative


examples is relatively not very significant.

We have equipped our proposed pattern taxonomy model PTM with IPE and

compared its performance to those for the up-to-date data mining-based methods

and the well-known term-based methods, including the state-of-the-art BM25 and

SVM models. The results show the PTM model can produce encouraging gains

in effectiveness, in particular over the SVM model. The promising results can

be explained in that the use of pattern taxonomies in PTM combines well with

the advantages of terms and phrases. Moreover, the pattern deploying strategy

provides an effective evaluation for estimating each term’s significance in the

hypothesis space based on not only the term’s statistical properties but also the

pattern’s associations in the pattern taxonomies.

The important findings are briefed as follows:

• Both DPE and IPE methods attempt to utilise information extracted

from negative examples to improve the performance for the pattern-based

knowledge discovery system PTM. The experimental results show that the

IPE method can achieve the goal by evolving individual patterns once an

offending pattern is detected from a negative example. The DPE method,

however, evolves patterns by shuffling the contribution of significance of all

elements in a deployed pattern and yields slightly unsatisfactory results in

all measures compared to those for the IPE method. Hence, PTM chooses

to adopt IPE for pattern evolving to conduct IF tasks in all the following

experiments.

• The main difference between DPE and IPE is that the former evolves

patterns at the term level, while the latter evolves patterns at the pattern


level before they are deployed. According to the experimental results, IPE

is superior to DPE and hence is suitable for pattern evolution in a pattern-

based knowledge discovery system.

• The performance of the DPE method does not depend on the tuning of

parameter µ. One possible reason is that the element of deployed patterns is

mixed up with context from both positive and negative training examples

due to the application of pattern decomposition. This is also the main

motivation to propose and develop the individual pattern evolution method

which evolves patterns before they are decomposed and mapped into a

deployed pattern.

• The similar performances achieved by all data mining-based methods, such

as sequential (closed) patterns and frequent (closed) itemsets, provide the

evidences that selecting a proper approach to exploit discovered patterns is

more important than choosing a mining method to find different sorts of

patterns.

• The final promising results support the evidence that the PTM model

which implements IPE for pattern evolution can outperform not only the

data mining-based methods but also the state-of-the-art term-based IR

method. The PTM model benefits from the use of pattern taxonomies

which combines well with the advantages of terms and phrases. The pattern

deploying strategy used by PTM also provides an effective evaluation for

estimating each term’s significance in the hypothesis space based on both

the term’s statistical properties and pattern’s associations in the pattern

taxonomies.


6.6 Chapter Summary

In this chapter, we have conducted extensive experiments to evaluate the

proposed pattern-based knowledge discovery system PTM with various pattern

deploying approaches and evolution strategies. We briefly describe the existing

data collections and choose RCV1 corpus as our dataset for evaluation since

RCV1 is the latest corpus coupled with a large amount of documents and

relevance judgements. Most existing standard evaluating measures are selected to

estimate the system’s performance. Following is the description of experimental

procedures for three main stages. The extensive analysis and discussion on

experimental results are presented at the end.

In terms of pattern discovery, the data mining techniques can be used

for pattern discovery. However, the main drawback of using data mining is

the explosion of numbers of discovered patterns. Both closed pattern-based

approaches (i.e., SCPM and NSCPM) and non-closed-based approaches (i.e.,

SPM and NSPM) can be adopted and used in a pattern-based IF system for pattern

discovery. The weight of a pattern is in direct proportion to the pattern’s frequency

in documents. With reference to the comparison between SPM and 3Gram, it is

obvious that in SPM the more specific long patterns, which carry more significant

information, cannot improve the system’s effectiveness. This is mainly due to

the natural characteristic of low frequency found on those long patterns, which is

called the low-frequency problem. Such a problem is one of the main drawbacks

of data mining-based approaches. The method to use discovered patterns in the

data mining-based approaches is proven not adequate according to our observation

on experimental results. Although these approaches can discover various types of


patterns (i.e., frequent sequential patterns, frequent itemsets, and frequent closed

or non-closed patterns), how to effectively use these discovered patterns is still a

critical issue.

By using pattern deploying strategies, the experimental results of PDS

and PDM methods provide the evidences that pattern deploying methods can

significantly improve the information filtering system in its effectiveness. The

promising results of the PDS and PDM methods also indicate that deployment

of discovered patterns is a proper way to exploit these patterns and to solve the

low-frequency problem pertaining to the data mining-based methods. The usage

of pattern support in the PDS method leads to a noticeable improvement over

the PDM method, where the latter method omits such a potential property during

pattern deploying. Despite the lowest performance in overall measures for the DM

method compared to other methods, the data mining-based method can achieve a

similar outcome to the PDS method in the score of precision at the first standard

recall point on the first 50 topics. This implies that the DM method has a high

accuracy of document filtering in the low-recall situation.

The main difference between DPE and IPE is that the former evolves patterns

at the term level, while the latter evolves patterns at the pattern level before they

are deployed. According to the experimental results, IPE is superior to DPE and

hence is suitable for pattern evolution in a pattern-based knowledge discovery

system. The similar performances achieved by all data mining-based methods,

such as sequential (closed) patterns and frequent (closed) itemsets, provide the

evidence that selecting a proper approach to exploit discovered patterns is more

important than choosing a mining method to find different sorts of patterns.

The final promising results support the evidence that the PTM model which


implements IPE for pattern evolution can outperform not only data mining-based

methods but also the state-of-the-art term-based IR method.

Chapter 7

Conclusion

In the last decade, many data mining techniques have been proposed for fulfilling

various knowledge discovery tasks. These techniques include association rule

mining, frequent itemset mining, sequential pattern mining, maximum pattern

mining and closed pattern mining. However, using these discovered patterns in

the field of text mining is difficult and ineffective. The reason is that a useful long

pattern with high specificity lacks in support. We argue that not all frequent short

patterns are useful. Hence, inadequate use of patterns derived from data mining

techniques leads to the ineffective performance. In this thesis, an effective pattern

taxonomy model have been proposed to overcome the aforementioned problem

by deploying discovered patterns into a hypothesis space. In addition, pattern

updating schemes are investigated as well in this research.

This thesis presents the research on the concept of developing an effective

knowledge discovery model (PTM) based on pattern taxonomies. PTM is

implemented by three main steps: (1) discovering useful patterns by integrating

sequential closed pattern mining algorithm and pruning scheme (Chapter 3);

(2) using discovered patterns by pattern deploying (Chapter 4); (3) adjusting

user profiles by applying pattern evolution (Chapter 5). Various mechanisms in

175

176 Conclusion

each step are proposed and evaluated for fulfilling the PTM model. Numerous

experiments within an information filtering domain are conducted. The latest

version of the Reuters dataset, RCV1, is selected and tested by the proposed PTM-

based information filtering system. The results show that the PTM outperforms

not only several pure data mining-based methods, but also traditional probabilistic

and Rocchio methods.

Section 7.1 presents the main contributions of this research. Section 7.2

discusses the possible directions for the future work in the area of this research.

7.1 Contributions

The contributions made by this research are listed as follows:

Solving data mining problems: We can acquire a vast amount of patterns using

data mining techniques for text mining. However, dealing with these

patterns is difficult due to some characteristics of them. A typical problem

is the low support of a specific long pattern. In this thesis, we conquer this

problem by proposing the PTM model. In PTM, the specificity of a long

pattern is reserved by transforming it to another data format, which then

can be effectively used by a text mining system.

Effective Pattern Taxonomy Model: A complete model is set up for implement-

ing three main phases of knowledge discovery including: (1) discovering

useful patterns; (2) evaluating patterns; and (3) updating information con-

cept. At the first phase, pattern taxonomy model adapts up-to-date data

mining techniques to discover useful patterns and represents information

concept using pattern taxonomies. At the second phase, pattern deploying

7.1 Contributions 177

mechanisms are introduced to overcome the low frequency problem of the

inadequate use of discovered patterns. At the final phase, concept updating

is achieved by evolving patterns based on the information from the irrele-

vant document examples.

Novel Application of Current Data Mining Techniques: The pattern taxonomy

model is the first attempt at adopting the frequent sequential pattern mining

and closed sequential patterns techniques to implement the knowledge dis-

covery task. Applying data mining techniques to text mining domain is very

difficult since the textual data is in unstructured format and time consum-

ing during the pattern discovery. Pattern taxonomy model conquers this

problem and obtains great results on the test platform of information filter-

ing equipped with the proposed pattern deployment strategies. The related

information and experiments can be found in chapter 3 and section 6.5.1

respectively.

Proper Usage of Discovered Patterns: Inadequate use of discovered patterns

leads to the low frequency problem in a data mining-based method.

Pattern Deploying Method (PDM) and Pattern Deploying method based on

Supports (PDS) are developed to deal with the discovered patterns in proper

ways and provide suitable solutions for using these patterns. Experimental

results show that the new deploying methods have achieved significant

improvements over the others. The details can be found in Chapter 4 and

related experiments are discussed in Section 6.5.2.

Scalable Modification Scheme for Concept Updating: From text mining point

of view, negative documents may provide useful information for the system.

178 Conclusion

Hence the capability of handling negative patterns is essential for a pattern

taxonomy-based model. Two concept adjusting schemes are proposed in

this thesis for the purpose of updating concepts in the knowledge base

by pattern evolving. The first one is Deployed Pattern Evolving (DPE)

which performs pattern evolution in the document level. The second one

is Individual Pattern Evolving (IPE) which executes pattern evolution in the

pattern level. With respect to information filtering, DPE and IPE can be used

to re-evaluate the importance of conflict patterns leading to the decrease

of interference from the possibly noisy patterns. The details of DPE and

IPE are described in Chapter 5 and the related experiments are analysed in

Section 6.5.3.

Feasible Information Filtering System: An information filtering framework

based on the proposed pattern taxonomy model is established and evaluated

by a series of experiments. By comparing to traditional information filtering

methods, the pattern taxonomy model can improve the performance in

effectiveness of the system. It also gains the advantages over the up-to-

date data mining-based methods such as sequential phrases and frequent

itemsets-based methods. The experimental results also verify that the

proposed system is promising for the challenging issue for the text mining

community, that is, to provide effective methods to overcome the limitation

of term-based information filtering models. Furthermore, the experiments

are conducted on all the topics in the RCV1 dataset, which is the latest

benchmark data collection in the area of text mining [123].

7.2 Future Work 179

In summary, this research work presents many novel ideas: (1) For pattern

discovery, each paragraph of a document is treated as a unit to enable the

application of sequential pattern mining. (2) A document is expressed as a set of

weighted patterns, which are useful in acquiring the high-level concepts described

by a document. (3) Three levels of features are introduced for the organisation

of information and knowledge, including the Term Level, Pattern Level and

Document Level. They represent the different level of abstraction extracted from

a document collection, which are useful for more effective retrieval and filtering.

(4) The notions of pattern deploying and evolving are introduced for representing

the wise usages of discovered patterns. They can be exploited for constructing an

effective PTM-based information filtering system.

7.2 Future Work

This research work in pattern taxonomy-based knowledge discovery model is

developed towards applying data mining techniques to practical text mining

applications. In a PTM-based system, the knowledge base is represented by

the discovered pattern taxonomies, which provides many useful features such as

support and confidence of a pattern, relationship between patterns, distribution of

pattern taxonomies, and the dimension of these taxonomies. These features can

be used to capture more information for building a descriptive and comprehensive

representation in the knowledge base. In our model, some features (such as the

relationship among patterns and the support of patterns) have been investigated

and evaluated. The rest of the features will be used in further research work. An

initial investigation of using length of patterns as critical factors in a PTM-based

180 Conclusion

Web mining model is examined by Zhou [168, 169].

Data mining algorithms such as association rule mining and sequential pattern

mining are computationally expensive and so are the pattern taxonomies-based

models, especially during the phase of pattern discovery. An efficient algorithm

of finding useful patterns from a large dataset is essential in future work. One

possible solution to improve the efficiency of the pattern taxonomy-based model

is to reduce the dimensionality of the feature space in the knowledge base. This

optimisation approach is known as feature selection. However, the tradeoff of

using feature selection is the lack of information remaining in the selected feature,

especially when the number of training examples is few. Therefore, an alternative

way of applying length-decreasing support constraints [132] to frequent pattern

mining may help. That is, minimum supports used for mining different lengths

of pattern could vary. On the one hand, a higher value of minimum support can

be used for finding short patterns in order to reduce the number of patterns to be

mined. On the other hand, a lower minimum support is set for longer patterns

to prevent the specific information contained in these patterns from being lost.

However, more work is required to build a constraints-based pattern taxonomy

model.

Appendix A

An Example of a RCV1 Document

<?xml version="1.0" encoding="iso-8859-1"?>- <newsitem itemid="105780" id="root" date="1996-10-09"

xml:lang="en"><title>GERMANY: Court says to rule on VW-GM lawsuit Oct 30.</title><headline>Court says to rule on VW-GM lawsuit Oct 30.</headline><dateline>FRANKFURT, Germany</dateline>

- <text>A German court said Wednesday it would rule at the end

of the month on a charge of defamation brought byautomaker Volkswagen AG against General Motors and GM’sGerman subsidiary Adam Opel AG.

Following statements from lawyers for both companies ata hearing at the Frankfurt District Court, JudgeGuenther Kinnel closed proceedings and said he wouldannounce the court’s ruling on Oct. 30.

VW is demanding 10 million German marks (\$6.54million) in damages for statements made by GM and Opelofficials last March when GM filed a claim in theUnited States accusing VW of industrial espionage.

Wednesday’s hearing was the latest development in athree-year series of legal battle between the two cargiants.

GM alleges VW production chief Jose Ignacio Lopez deArriortua and seven other former GM managers stolesecrets on purchasing and car production plans whenthey moved to VW in early 1993.

Frustrated by the lack of progress in almost threeyears of legal action against VW in Germany, GM said at

181

182 An Example of a RCV1 Document

news conferences held on March 8 in Detroit andRuesselsheim near Frankfurt it would seek justice inthe United States in the espionage case by filing acomplaint at a federal district court in Michigan.

VW has since filed a motion to have that casedismissed.

At Wednesday’s hearing, VW’s lawyers said GM had soughtto prejudice public opinion at the news conferenceswhen it said its U.S. complaint accused VW and topofficials, including VW head Ferdinand Piech, of"conspiracy, conversion, the misappropriation of tradesecrets and racketeering."

The lawyers also accused Opel of seeking to present VWas a criminal organisation in the public eye, forexample by distributing a chronology of the three-yearsaga to the press.

</text><copyright>(c) Reuters Limited 1996</copyright>

- <metadata>- <codes class="bip:countries:1.0">

- <code code="GFR"><editdetail attribution="Reuters BIP Coding Group"action="confirmed" date="1996-10-09" />

</code></codes>

- <codes class="bip:topics:1.0">- <code code="C12">

<editdetail attribution="Reuters BIP Coding Group"action="confirmed" date="1996-10-09" />

</code>- <code code="GCRIM">

<editdetail attribution="Reuters BIP Coding Group"action="confirmed" date="1996-10-09" />

</code></codes><dc element="dc.publisher" value="Reuters Holdings Plc" /><dc element="dc.date.published" value="1996-10-09" /><dc element="dc.source" value="Reuters" /><dc element="dc.creator.location" value="FRANKFURT,Germany" /><dc element="dc.creator.location.country.name"value="GERMANY" /><dc element="dc.source" value="Reuters" />

183

</metadata></newsitem>

Appendix B

Topic Codes of TREC RCV1

CODE DESCRIPTION

1POL CURRENT NEWS - POLITICS2ECO CURRENT NEWS - ECONOMICS3SPO CURRENT NEWS - SPORT4GEN CURRENT NEWS - GENERAL6INS CURRENT NEWS - INSURANCE7RSK CURRENT NEWS - RISK NEWS8YDB TEMPORARY9BNX TEMPORARYADS10 CURRENT NEWS - ADVERTISINGBNW14 CURRENT NEWS - BUSINESS NEWSBRP11 CURRENT NEWS - BRANDSC11 STRATEGY/PLANSC12 LEGAL/JUDICIALC13 REGULATION/POLICYC14 SHARE LISTINGSC15 PERFORMANCEC151 ACCOUNTS/EARNINGSC1511 ANNUAL RESULTSC152 COMMENT/FORECASTSC16 INSOLVENCY/LIQUIDITYC17 FUNDING/CAPITALC171 SHARE CAPITALC172 BONDS/DEBT ISSUESC173 LOANS/CREDITSC174 CREDIT RATINGSC18 OWNERSHIP CHANGESC181 MERGERS/ACQUISITIONSC182 ASSET TRANSFERS

185

186 Topic Codes of TREC RCV1

C183 PRIVATISATIONSC21 PRODUCTION/SERVICESC22 NEW PRODUCTS/SERVICESC23 RESEARCH/DEVELOPMENTC24 CAPACITY/FACILITIESC31 MARKETS/MARKETINGC311 DOMESTIC MARKETSC312 EXTERNAL MARKETSC313 MARKET SHAREC32 ADVERTISING/PROMOTIONC33 CONTRACTS/ORDERSC331 DEFENCE CONTRACTSC34 MONOPOLIES/COMPETITIONC41 MANAGEMENTC411 MANAGEMENT MOVESC42 LABOURCCAT CORPORATE/INDUSTRIALE11 ECONOMIC PERFORMANCEE12 MONETARY/ECONOMICE121 MONEY SUPPLYE13 INFLATION/PRICESE131 CONSUMER PRICESE132 WHOLESALE PRICESE14 CONSUMER FINANCEE141 PERSONAL INCOMEE142 CONSUMER CREDITE143 RETAIL SALESE21 GOVERNMENT FINANCEE211 EXPENDITURE/REVENUEE212 GOVERNMENT BORROWINGE31 OUTPUT/CAPACITYE311 INDUSTRIAL PRODUCTIONE312 CAPACITY UTILIZATIONE313 INVENTORIESE41 EMPLOYMENT/LABOURE411 UNEMPLOYMENTE51 TRADE/RESERVESE511 BALANCE OF PAYMENTSE512 MERCHANDISE TRADEE513 RESERVESE61 HOUSING STARTSE71 LEADING INDICATORSECAT ECONOMICS

187

ENT12 CURRENT NEWS - ENTERTAINMENTG11 SOCIAL AFFAIRSG111 HEALTH/SAFETYG112 SOCIAL SECURITYG113 EDUCATION/RESEARCHG12 INTERNAL POLITICSG13 INTERNATIONAL RELATIONSG131 DEFENCEG14 ENVIRONMENTG15 EUROPEAN COMMUNITYG151 EC INTERNAL MARKETG152 EC CORPORATE POLICYG153 EC AGRICULTURE POLICYG154 EC MONETARY/ECONOMICG155 EC INSTITUTIONSG156 EC ENVIRONMENT ISSUESG157 EC COMPETITION/SUBSIDYG158 EC EXTERNAL RELATIONSG159 EC GENERALGCAT GOVERNMENT/SOCIALGCRIM CRIME, LAW ENFORCEMENTGDEF DEFENCEGDIP INTERNATIONAL RELATIONSGDIS DISASTERS AND ACCIDENTSGEDU EDUCATIONGENT ARTS, CULTURE, ENTERTAINMENTGENV ENVIRONMENT AND NATURAL WORLDGFAS FASHIONGHEA HEALTHGJOB LABOUR ISSUESGMIL MILLENNIUM ISSUESGOBIT OBITUARIESGODD HUMAN INTERESTGPOL DOMESTIC POLITICSGPRO BIOGRAPHIES, PERSONALITIES, PEOPLEGREL RELIGIONGSCI SCIENCE AND TECHNOLOGYGSPO SPORTSGTOUR TRAVEL AND TOURISMGVIO WAR, CIVIL WARGVOTE ELECTIONSGWEA WEATHERGWELF WELFARE, SOCIAL SERVICES

188 Topic Codes of TREC RCV1

M11 EQUITY MARKETSM12 BOND MARKETSM13 MONEY MARKETSM131 INTERBANK MARKETSM132 FOREX MARKETSM14 COMMODITY MARKETSM141 SOFT COMMODITIESM142 METALS TRADINGM143 ENERGY MARKETSMCAT MARKETSMEUR EURO CURRENCYPRB13 CURRENT NEWS - PRESS RELEASE WIRES

Appendix C

List of Stopwords

a about above according across after afterwards again against albeit all almost

alone along already also although always am among amongst an and another

any anybody anyhow anyone anything anyway anywhere apart are around as

at av be became because become becomes becoming been before beforehand

behind being below beside besides between beyond both but by can cannot canst

certain cf choose contrariwise cos could cu day do does doesn doing dost doth

double down dual during each either else elsewhere enough et etc even ever every

everybody everyone everything everywhere except excepted excepting exception

exclude excluding exclusive far farther farthest few ff first for formerly forth

forward from front further furthermore furthest get go had halves hardly has hast

hath have he hence henceforth her here hereabouts hereafter hereby herein hereto

hereupon hers herself him himself hindmost his hither hitherto how however

howsoever i ie if in inasmuch inc include included including indeed indoors

inside insomuch instead into inward inwards is it its itself just kind kg km last

latter latterly less lest let like little ltd many may maybe me meantime meanwhile

might moreover most mostly more mr mrs ms much must my myself namely

need neither never nevertheless next no nobody none nonetheless noone nope

189

190 List of Stopwords

nor not nothing notwithstanding now nowadays nowhere of off often ok on once

one only onto or other others otherwise ought our ours ourselves out outside

over own per perhaps plenty provide quite rather really reuter reuters round said

sake same sang save saw see seeing seem seemed seeming seems seen seldom

selves sent several shalt she should shown sideways since slept slew slung slunk

smote so some somebody somehow someone something sometime sometimes

somewhat somewhere spake spat spoke spoken sprang sprung stave staves still

such supposing than that the thee their them themselves then thence thenceforth

there thereabout thereabouts thereafter thereby therefore therein thereof thereon

thereto thereupon these they this those thou though thrice through throughout thru

thus thy thyself till to together too toward towards ugh unable under underneath

unless unlike until up upon upward upwards us use used using very via vs

want was we week well were what whatever whatsoever when whence whenever

whensoever where whereabouts whereafter whereas whereat whereby wherefore

wherefrom wherein whereinto whereof whereon wheresoever whereto whereunto

whereupon wherever wherewith whether whew which whichever whichsoever

while whilst whither who whoa whoever whole whom whomever whomsoever

whose whosoever why will wilt with within without worse worst would wow ye

yet year yippee you your yours yourself yourselves

Bibliography

[1] K. Aas and L. Eikvil. Text categorisation: A survey. Technical report,Norwegian Computing Center, Raport NR 941, 1999. 2, 20

[2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules betweensets and items in large database. In Proceedings of the ACM-SIGMODInternational Conference on Management of Data, pages 207–216, 1993.22, 25, 34, 53, 59

[3] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo.Fast discovery of association rules. In Advances in Knowledge Discoveryand Data Mining, pages 307–328. AAAI/MIT Press, 1996. 61

[4] R. Agrawal and J. C. Shafer. Parallel mining of association rules: Design,implementation, and experience. IEEE Transactions on Knowledge andData Engineering, 8(6):962–969, 1996. 17, 25

[5] R. Agrawal and R. Srikant. Fast algorithms for mining association rulesin large databases. In Proceedings of the 20th International Conference onVery Large Data Bases, pages 478–499, 1994. 24, 61

[6] R. Agrawal and R. Srikant. Fast algorithms for mining association rules inlarge databases. In Proceedings of VLDB, pages 487–499, 1994. 61

[7] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedingsof the 11th International Conference on Data Engineering, pages 3–14,Taipei, Taiwan, 1995. 24, 26, 61

[8] H. Ahonen. Finding all maximal frequent sequences in text. In ICML99Workshop, Machine Learning in Text Data Analysis, 1999. 61, 62

[9] H. Ahonen. Knowledge discovery in documents by extracting frequentword sequences. Library Trends, 48(1):160–181, 1999. 61

191

192 BIBLIOGRAPHY

[10] H. Ahonen, O. Heinonen, M. Klemettinen, and A. I. Verkamo. Mining inthe phrasal frontier. In Proceedings of PKDD, pages 343–350, 1997. 34,39, 62

[11] H. Ahonen, O. Heinonen, M. Klemettinen, and A. I. Verkamo. Applyingdata mining techniques for descriptive phrase extraction in digitaldocument collections. In Proceedings of the IEEE Forum on Research andTechnology Advances in Digital Libraries (ADL98), pages 2–11, 1998. 34,35, 61

[12] H. Ahonen-Myka. Discovery of frequent word sequences in text. InProceedings of Pattern Detection and Discovery, pages 180–189, 2002.34, 61

[13] H. Ahonen-Myka, O. Heinonen, M. Klemettinen, and A. I. Verkamo.Finding co-occurring text phrases by combining sequence and frequent setdiscovery. In Proceedings of International Joint Conference on ArtificialIntelligence (IJCAI99) Workshop on Text Mining, pages 1–9, 1999. 34, 61

[14] H. Al-Mubaid and S. A. Umair. A new text categorization techniqueusing distributional clustering and learning logic. IEEE Transactions onKnowledge and Data Engineering, 18(9):1156–1165, 2006. 34

[15] J. Allan, J. P. Callan, F. Feng, and D. Malin. Inquery and trec-8. In TREC,1999. 37

[16] G. Amati, D. D. Aloisi, V. Giannini, and F. Ubaldini. A frameworkfor filtering news and managing distributed data. Journal of UniversalComputer Science, 3(8):1007–1021, 1997. 37

[17] A. Anghelescu, E. Boros, D. Lewis, V. Menkov, D. Neu, and P. Kantor.Rutgers filtering work at trec 2002: Adaptive and batch. In TREC, 2002.37

[18] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu. Sequential pattern miningusing a bitmap representation. In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining(KDD), pages 429–435, 2002. 24, 26, 61

[19] P. Baldi, P. Frasconi, and P. Smyth. Modeling the Internet and the Web:Probabilistic Method and Algorithms. John Wiley, 2003. 131

[20] B. T. Bartell, G. W. Cottrell, and R. K. Belew. Automatic combination ofmultiple ranked retrieval systems. In Proceedings of SIGIR, pages 173–181, 1994. 34

BIBLIOGRAPHY 193

[21] E. Brill and P. Resnik. A rule-based approach to prepositional phraseattachment disambiguation. In Proceedings of the 15th InternationalConference on Computational Linguistics (COLING), pages 1198–1204,1994. 34

[22] C. Brouard. Clips at trec 11: Experiments in filtering. In TREC, 2002. 37

[23] N. Cancedda, N. Cesa-Bianchi, A. Conconi, and C. Gentile. Kernelmethods for document filtering. In TREC, 2002. 36, 132

[24] N. Cancedda, E. Gaussier, C. Goutte, and J-M. Renders. Word-sequencekernels. Journal of Machine Learning Research, 3:1059–1082, 2003. 36,37, 132

[25] M. F. Caropreso, S. Matwin, and F. Sebastiani. Statistical phrases inautomated text categorization. Technical report, Instituto di Elaborazionedell’Informazione, Technical Report IEI-B4-07-2000, 2000. 35

[26] J. M. Carroll and P. A. Swatman. Structured-case: A methodologicalframework for building theory in information system research. InProceedings of the European Conference on Information Systems, 2000.7

[27] G. Chen, X. Wu, and X. Zhu. Sequential pattern mining in multiple streams.In Proceedings of the 5th IEEE International Conference on Data Mining(ICDM05), pages 585–588, 2005. 25, 26

[28] H. Cheng, X. Yan, and J. Han. Incspan: incremental mining of sequentialpatterns in large database. In Proceedings of KDD, pages 527–532, 2004.61

[29] D. W. Cheung, J. Han, V. T. Ng, A. W. Fu, and Y. Fu. A fastdistributed algorithm for mining association rules. In Proceedings of the 4thInternational Conference on Parallel and Distributed Information Systems,pages 31–42, 1996. 61

[30] K. W. Church. One term or two? In Proceedings of SIGIR, pages 310–318,1995. 29

[31] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,20(3):273–297, 1995. 132

[32] W. B. Croft, J. P. Callan, and J. Broglio. Trec-2 routing and ad-hoc retrievalevaluation using the inquery system. In TREC, 1993. 37

194 BIBLIOGRAPHY

[33] V. Devedzic. Knowledge discovery and data mining in databases. InS.K. Chang, editor, Handbook of Software Engineering and KnowledgeEngineering, volume Vol.1 - Fundamentals, pages 615–637. WorldScientific Publishing Co, 2001. 1, 11, 19

[34] S. T. Dumais. Improving the retrieval of information from external sources.Behavior Research Methods, Instruments, & Computers, 23(2):229–236,1991. 20

[35] L. Dumitriu. Interactive mining and knowledge reuse for the closed-itemsetincremental-mining problem. SIGKDD Explorations, 3(2):28–36, 2002.26, 62

[36] L. Edda and K. Jorg. Text categorization with support vector machines.how to represent texts in input space? Machine Learning, 46:423–444,2002. 2

[37] H. P. Edmundson and R. E. Wyllys. Automatic abstracting and indexing -survey and recommendations. Commun. ACM, 4(5):226–234, 1961. 29

[38] D. A. Evans, J. Shanahan, N. Roma, J. Bennett, V. Sheftel, E. Stoica,J. Montgomery, D. A. Hull, and W. Tembe. Term selection and thresholdoptimization in ir and svm filters. In TREC, 2002. 37

[39] W. Fan, M. D. Gordon, and P. Pathak. Personalization of search engineservices for effective retrieval and knowledge management. In Proceedingsof the 21th International Conference on Information Systems, pages 20–34,2000. 29

[40] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The kdd process forextracting useful knowledge from volumes of data. Communications of theACM, 39(11):27–34, 1996. 12

[41] L. Feng, H. Lu, J. X. Yu, and J. Han. Mining inter-transaction associationswith templates. In Proceedings of CIKM, pages 225–233, 1999. 61

[42] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledgediscovery in databases: an overview. AI Magazine, 13:57–70, 1992. 1,11, 12

[43] Y. J. Fu. Data mining: Tasks, techniques and applications. IEEE Potentials,16(4):18–20, 1997. ix, 13, 14, 22

[44] N. Fuhr. Probabilistic models in information retrieval. The ComputerJournal, 35(3):243–255, 1992. 34

BIBLIOGRAPHY 195

[45] N. Fuhr and C. Buckley. A probabilistic learning approach for documentindexing. ACM Transactions on Information Systems, 9(3):223–248, 1991.121

[46] G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu. Text classification withoutnegative examples revisit. IEEE Transactions on Knowledge and DataEngineering, 18(1):6–20, 2006. 27

[47] S. S. Ge and Y. Liu. Extensible object oriented reasoning informationfiltering. In Proceedings of the 2002 IEEE International Symposium onIntelligent Control, pages 827–832, 2002. 17

[48] M. Goebel and L. Gruenwald. A survey of data mining and knowledgediscovery software tools. SIGKDD Explorations, 1(1):20–33, 1999. 15

[49] K. Gouda and M. J. Zaki. Genmax: An efficient algorithm for miningmaximal frequent itemsets. Data Mining and Knowledge Discovery,11(3):223–242, 2005. 26, 53, 62

[50] D. A. Grossman and O. Frieder. Information Retrieval Algorithms andHeuristics. Kluwer Academic, 1998. 2, 108, 131

[51] J. Han and K.C-C. Chang. Data mining for web intelligence. Computer,35(11):64–70, 2002. 24, 26

[52] J. Han and Y. Fu. Discovery of multiple-level association rules from largedatabases. In Proceedings of VLDB, pages 420–431, 1995. 61

[53] J. Han and Y. Fu. Mining multiple-level association rules in large databases.IEEE Transactions on Knowledge and Data Engineering, 11(5):798–805,1999. 17, 22, 23, 61

[54] J. Han and M. Kamber. Data Mining: Concepts and Techniques. MorganKaufmann, 2000. 22, 61, 130

[55] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu. Freespan:frequent pattern-projected sequential pattern mining. In Proceedings ofKDD, pages 355–359, 2000. 61

[56] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidategeneration. In Proceedings of ACM-SIGMOD, pages 1–12, 2000. 24, 26

[57] L. Harada. An efficient sliding window algorithm for detection ofsequential pattern. In Proceedings of DASFAA, pages 73–80, 2003. 25

196 BIBLIOGRAPHY

[58] W. Hersh, C. Buckley, T. Leone, and D. Hickman. Ohsumed: aninteractive retrieval evaluation and new large text collection for research.In Proceedings of the 17th ACM International Conference on Research andDevelopment in Information Retrieval, pages 192–201, 1994. 108

[59] Y. Huang and S. Lin. Mining sequential patterns using graph searchtechniques. In Proceedings of the 27th Annual International ComputerSoftware and Applications Conf. (COMPSAC03), pages 4–9, 2003. 24, 26

[60] D. A. Hull, J. O. Pedersen, and H. Schutze. Method combination fordocument filtering. In Proceedings of SIGIR, pages 279–287, 1996. 35,36

[61] L. P. Jing, H. K. Huang, and H. B. Shi. Improved feature selection approachtfidf in text mining. In Proceedings of the First International Conferenceon Machine Learning and Cybernetics, pages 944–946, 2002. 17

[62] T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf fortext categorization. In Proceedings of ICML, pages 143–151, 1997. 20

[63] T. Joachims. Text categorization with suport vector machines: Learningwith many relevant features. In Proceedings of the European Conferenceon Machine Learning, pages 137–142, 1998. 132

[64] T. Joachims. Transductive inference for text classification using supportvector machines. In Proceedings of ICML, pages 200–209, 1999. 132

[65] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, andA. Inkeri Verkamo. Finding interesting rules from large sets of discoveredassociation rules. In Proceedings of CIKM, pages 401–407, 1994. 61

[66] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon,and J. Riedl. Grouplens: Applying collaborative filtering to usenet news.Communications of the ACM (CACM), 40(3):77–87, 1997. 37

[67] K. Koperski and J. Han. Discovery of spatial association rules ingeographic information databases. In Proceedings of SSD, pages 47–66,1995. 61

[68] R. Kosala and H. Blockeel. Web mining research: A survey. ACM SIGKDDExplorations, 2(1):1–15, 2000. 5

[69] H. Kum, J. H. Chang, and W. Wang. Sequential pattern mining in multi-databases via multiple alignment. Data Mining and Knowledge Discovery,12(2-3):151–180, 2006. 25

BIBLIOGRAPHY 197

[70] K. L. Kwok, P. Deng, N. Dinstl, and M. Chan. Trec2002 web, novelty andfiltering track experiments using pircs. In TREC, 2002. 37

[71] W. Lam, M. E. Ruiz, and P. Srinivasan. Automatic text categorization andits application to text retrieval. IEEE Transactions on Knowledge and DataEngineering, 11(6):865–879, 1999. 2, 115

[72] K. Lang. News weeder: Learning to filter netnews. In Proceedings ofICML, pages 331–339, 1995. 37, 108

[73] C. Lanquillon. Evaluating performance indicators for adaptive informationfiltering. In Proceedings of ICSC, pages 11–20, 1999. 36

[74] C Lanquillon and I. Renz. Adaptive information filtering: Detectingchanges in text streams. In Proceedings of CIKM, pages 538–544, 1999.36

[75] R. Y. K. Lau, P. Bruza, and D Song. Belief revision for adaptive informationretrieval. In Proceedings of SIGIR, pages 130–137, 2004. 36

[76] C-H. Lee and H-C. Yang. A multilingual text mining approach based onself-organizing maps. Applied Intelligence, 18(3):295–310, 2003. 27

[77] D. D. Lewis. An evaluation of phrasal and clustered representations on atext categorization task. In Proceedings of SIGIR, pages 37–50, 1992. 2, 5,33, 34, 35

[78] D. D. Lewis. Feature selection and feature extraction for textcategorization. In Speech and Natural Language Workshop, pages 212–217, 1992. 20

[79] D. D. Lewis. Evaluating and optimizing automous text classificationsystems. In Proceedings of SIGIR, pages 246–254, 1995. 115

[80] X. Li and B. Liu. Learning to classify texts using positive and unlabeleddata. In Proceedings of IJCAI, pages 587–594, 2003. 20

[81] Y. Li. Extended random sets for knowledge discovery in informationsystem. In Proceedings of the 9th International Conference on Rough Sets,Fuzzy Sets, Data Miing and Granular Computing, pages 524–532, 2003.85

[82] Y. Li, X. Z. Chen, and B. R. Yang. Research on web mining-basedintelligent search engine. In Proceedings of the first InternationalConference on Machine Learning and Cybernetics, pages 386–390, 2002.ix, 15

198 BIBLIOGRAPHY

[83] Y. Li, S-T. Wu, and Y. Xu. Deploying association rules on hypothesisspaces. In Proceedings of the International Conference on ComputationalIntelligence for Modelling Control and Automation (CIMCA04), pages769–778, 2004. 17, 86

[84] Y. Li and N. Zhong. Interpretations of association rules by granularcomputing. In Proceedings of the 3rd IEEE International Conference onData Mining, pages 593–596, 2003. 86

[85] Y. Li and N. Zhong. Capturing evolving patterns for ontology-basedweb mining. In Proceedings of the International Conference on WebIntelligence (WI04), pages 256–263, 2004. 87

[86] Y. Li and N. Zhong. Mining ontology for automatically acquiring webuser information needs. IEEE Transactions on Knowledge and DataEngineering, 18(4):554–568, 2006. 2, 4, 72, 74, 88, 103

[87] M-Y. Lin and S-Y. Lee. Incremental update on sequential patterns in largedatabases by implicit merging and efficient counting. Information Systems,29(5):385–404, 2004. 24, 26

[88] T. Y. Lin. Database mining on derived attributes. In Proceedings of RoughSets and Current Trends in Computing, pages 14–32, 2002. 5

[89] B. Liu, C. W. Chin, and H. T. Ng. Mining topic-specific concepts anddefinitions on the web. In Proceedings of WWW, pages 251–260, 2003. 26

[90] J. Liu, Y. Pan, K. Wang, and H. Han. Mining frequent item sets byopportunistic projection. In Proceedings of KDD, pages 229–238, 2002.62

[91] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins.Text classification using string kernels. Journal of Machine LearningResearch, 2:419–444, 2002. 36, 132

[92] H. Lu, L. Feng, and J. Han. Beyond intratransaction associationanalysis: mining multidimensional intertransaction association rules. ACMTransations on Information Systems, 18(4):423–454, 2000. 61

[93] W. Y. Ma and B. S. Manjunath. Netra: A toolbox for navigating large imagedatabases. ACM Multimedia System, 7:184–198, 1999. 17

[94] A. Maedche. Ontology learning for the semantic Web. Kluwer Academic,2003. 103

BIBLIOGRAPHY 199

[95] H. Mannila and H. Toivonen. Discovering generalized episodes usingminimal occurrences. In Proceedings of KDD, pages 146–151, 1996. 34

[96] H. Mannila, H. Toivonen, and A. Inkeri Verkamo. Efficient algorithms fordiscovering association rules. In KDD Workshop, pages 181–192, 1994. 61

[97] H. Mannila, H. Toivonen, and A. Inkeri Verkamo. Discovery of frequentepisodes in event sequences. Data Mining and Knowledge Discovery,1(3):259–289, 1997. 61

[98] C. Manning and H. Schutze. Foundations of statistical natural languageprocessing. MIT Press, Cambridge, MA, USA, 1999. 103

[99] C. J. Matheus, P. C. Chan, and G. Piatetsky-Shapiro. Systems forknowledge discovery in databases. IEEE Transactions on Knowledge andData Engineering, 5:903–913, 1993. 19

[100] D. Mladenic and M. Globelnik. Word sequences as features in text-learning. In Proceedings of the 17th Electrotechnical and ComputerScience Conference (ERK98), pages 145–148, 1998. 34

[101] C. Monz. Contextual inference in computational semantics. In Proceedingsof Modeling and Using Context, Second International and InterdisciplinaryConference (CONTEXT99), pages 242–255, 1999. 33

[102] R. J. Mooney and R. C. Bunescu. Mining knowledge from text usinginformation extraction. SIGKDD Explorations, 7(1):3–10, 2005. 20

[103] I. Moulinier, G. Raskinis, and J. Ganascia. Text categorization: A symbolicapproach. In Proceedings of the 5th Annual Symposium on DocumentAnalysis and Information Retrieval (SDAIR), 1996. 115

[104] N. Nanas. Towards Nootropia: a Non-Linear Approach to AdaptiveDocument Filtering. PhD thesis, The Open University, 2003. 31, 32

[105] N. Nanas, V. S. Uren, and A. Roeck. A comparative evaluation of termweighting methods for information filtering. In DEXA Workshops, pages13–17, 2004. 29

[106] D. W. Oard. The state of the art in text filtering. User Model. User-Adapt.Interact., 7(3):141–178, 1997. 36

[107] J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithmfor mining association rules. In Proceedings of SIGMOD, pages 175–186,1995. 22, 24, 61

200 BIBLIOGRAPHY

[108] J. S. Park, M-S. Chen, and P. S. Yu. An effective hash based algorithmfor mining association rules. In Proceedings of SIGMOD, pages 175–186,1995. 61

[109] J. S. Park, M-S. Chen, and P. S. Yu. Using a hash-based method withtransaction trimming for mining association rules. IEEE Transactions onKnowledge and Data Engineering, 9(5):813–825, 1997. 61

[110] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discoveringfrequent closed itemsets for association rules. In Proceedings of the 7thInternational Conference on Database Theory (ICDT99), pages 398–416,1999. 26, 62

[111] J. Pei and J. Han. Can we push more constraints into frequent patternmining? In Proceedings of KDD, pages 350–354, 2000. 61

[112] J. Pei, J. Han, and L. V. S. Lakshmanan. Pushing convertible constraintsin frequent itemset mining. Data Mining and Knowledge Discovery,8(3):227–252, 2004. 26, 62

[113] J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for miningfrequent closed itemsets. In ACM SIGMOD Workshop on Research Issuesin Data Mining and Knowledge Discovery, pages 21–30, 2000. 26, 62

[114] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, andM. Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of ICDE, pages 215–224, 2001.24, 26, 61

[115] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, andM. Hsu. Mining sequential patterns by pattern-growth: The prefixspanapproach. IEEE Transactions on Knowledge and Data Engineering,16(11):1424–1440, 2004. 24, 26

[116] J. Pei, J. Han, and W. Wang. Mining sequential patterns with constraints inlarge databases. In Proceedings of CIKM, pages 18–25, 2002. 61

[117] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137,1980. 122, 131

[118] Reuters. Reuters ltd corpus statistics web page. Available fromhttp://about.reuters.com/researchandstandards/corpus/statistics/index.asp.ix, 111, 112

BIBLIOGRAPHY 201

[119] S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms.Journal of the American Society for Information Science, 27:129–46, 1976.30, 131

[120] S. E. Robertson, S. Walker, and M. Hancock-Beaulieu. Experimentationas a way of life: Okapi at trec. Information Processing and Management,36(1):95–108, 2000. 30, 132

[121] S. E. Robertson, S. Walker, H. Zaragoza, and R. Herbrich. Microsoftcambridge at trec 2002: Filtering track. In TREC, 2002. 37

[122] J. Rocchio. Relevance Feedback in Information Retrieval, chapter 14, pages313–323. Prentice-Hall, 1971. 108, 131

[123] T. Rose, M. Stevenson, and M. Whitehead. The reuters corpus volume1- from yesterday’s news to today’s language resources. In Proceedings ofthe 3rd Inter. Conf. on Language Resources and Evaluation, pages 29–31,2002. 108, 109, 178

[124] G. Salton. The SMART Retrieval System – Experiments in AutomaticDocument Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,1971. 108

[125] G. Salton and C. Buckley. Term-weighting approaches in automatic textretrieval. Information Processing and Management: an InternationalJournal, 24(5):513–523, 1988. 20, 28

[126] M. Sanderson. Word sense disambiguation and information retrieval. InProceedings of SIGIR, pages 142–151, 1994. 33

[127] M. Sassano. Virtual examples for text classification with support vectormachines. In Proceedings of Empirical Methods in Natural LanguageProcessing, pages 208–215, 2003. 132

[128] A. Savasere, E. Omiecinski, and S. B. Navathe. Mining for strong negativeassociations in a large database of customer transactions. In Proceedingsof ICDE, pages 494–502, 1998. 61

[129] S. Scott and S. Matwin. Feature engineering for text classification. InProceedings of ICML, pages 379–388, 1999. 2, 5, 108

[130] F. Sebastiani. Machine learning in automated text categorization. ACMComputing Surveys, 34(1):1–47, 2002. 2, 5, 20, 35, 36, 115

202 BIBLIOGRAPHY

[131] M. Seno and G. Karypis. Slpminer: An algorithm for findingfrequent sequential patterns using length-decreasing support constraint. InProceedings of ICDM, pages 418–425, 2002. 24, 26, 61

[132] M. Seno and G. Karypis. Finding frequent patterns using length-decreasingsupport constraints. Data Mining and Knowledge Discovery, 10(3):197–228, 2005. 24, 26, 180

[133] R. E. Shapire and Y. Singer. Boostexter: a boosting-based system for textcategorization. Machine Learning, 39:135–168, 2000. 2, 115

[134] R. Sharma and S. Raman. Phrase-based text representation for managingthe web document. In Proceedings of The International Conference onInformation Technology: Computers and Communications (ITCC), pages165–169, 2003. 34, 35

[135] D. Shen, Sun J., Q. Yang, H. Zhao, and Z. Chen. Text classificationimproved through automatically extracted sequences. In Proceedings ofICDE, pages 121–123, 2006. 32, 34

[136] B. D. Sheth. A learning approach to personalized information filtering.Master’s thesis, Master of Science, Massachusetts Institue of Technology,1994. 37

[137] I. Soboroff and S. E. Robertson. Building a filtering test collection for trec2002. In Proceedings of SIGIR, pages 243–250, 2003. 113

[138] H. Sorensen, A. O’Riordan, and C. O’Riordan. Profiling with the informertext filtering agent. The Journal of Universal Computer Science, 3(8):988–1006, 1997. 37

[139] K. Sparck Jones. Experiments in relevance weighting of search terms. Inf.Process. Manage., 15(3):133–144, 1979. 108

[140] K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model ofinformation retrieval: development and comparative experiments - part 1.Information Processing and Management, 36(6):779–808, 2000. 30, 132

[141] K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model ofinformation retrieval: development and comparative experiments - part 2.Information Processing and Management, 36(6):809–840, 2000. 30, 132

[142] R. Srikant and R. Agrawal. Mining generalized association rules. InProceedings of VLDB, pages 407–419, 1995. 24

BIBLIOGRAPHY 203

[143] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizationsand performance improvements. In Proceedings of EDBT, pages 3–17,1996. 61

[144] T. Strzalkowski. Robust text processing in automated informationretrieval. In Proceedings of the 4th Applied Natural Language ProcessingConference (ANLP), pages 168–173, 1994. 33

[145] B. Thuraisingham. A primer for understanding and applying data mining.IEEE IT Professional, 12(1):28–31, 2000. 17

[146] A. K. H. Tung, H. Lu, J. Han, and L. Feng. Breaking the barrier oftransactions: Mining inter-transaction association rules. In Proceedingsof KDD, pages 297–301, 1999. 61

[147] A. K. H. Tung, H. Lu, J. Han, and L. Feng. Efficient mining ofintertransaction association rules. IEEE Transactions on Knowledge andData Engineering, 15(4):1001–1017, 2003. 17

[148] P. D. Turney. Learning algorithms for keyphrase extraction. InformationRetrieval, 2(4):303–336, 2000. 34

[149] P. Tzvetkov, X. Yan, and J. Han. Tsp: Mining top-k closed sequentialpatterns. In Proceedings of ICDM, pages 347–354, 2003. 24, 26, 61

[150] S. R. Vasanthakumar, J. P. Callan, and W. B. Croft. Integrating inquerywith an rdbms to support text retrieval. IEEE Data Engineering Bulletin,19(1):24–33, 1996. 37

[151] A. Veloso, M. E. Otey, S. Parthasarathy, and Meira Jr. W. Parallel anddistributed frequent itemset mining on dynamic datasets. In Proceedings ofHiPC, pages 184–193, 2003. 62

[152] K. Wang, Y. He, and J. Han. Mining frequent itemsets using supportconstraints. In Proceedings of VLDB, pages 43–52, 2000. 62

[153] K. Wang, Y. He, and J. Han. Pushing support constraints into associationrules mining. IEEE Transactions on Knowledge and Data Engineering,15(3):642–658, 2003. 22

[154] K. Wang and H. Liu. Discovery structural association of semistructureddata. IEEE Transactions on Knowledge and Data Engineering, 12:353–371, 2000. 17

204 BIBLIOGRAPHY

[155] D. H. Widyantoro, T. R. Ioerger, and J. Yen. An adaptive algorithm forlearning changes in user interests. In Proceedings of CIKM, pages 405–412, 1999. 37

[156] R. C. Wong and A. W. Fu. Mining top-k frequent itemsets from datastreams. Data Mining and Knowledge Discovery, 13(2):193–217, 2006.26, 62

[157] S-T. Wu, Y. Li, and Y. Xu. An effective deploying algorithm for usingpattern-taxonomy. In Proceedings of the 7th International Conferenceon Information Integration and Web-based Applications & Services(iiWAS05), pages 1013–1022, 2005. 3, 85

[158] S-T. Wu, Y. Li, and Y Xu. Deploying approaches for pattern refinement intext mining. In Proceedings of ICDM, pages 1157–1161, 2006. 3, 66, 145,169

[159] S-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen. Automatic pattern-taxonomyextraction for web mining. In Proceedings of the IEEE/WIC/ACMInternational Conference on Web Intelligence (WI04), pages 242–248,2004. 3, 5, 17, 43, 52, 85, 124, 151

[160] T. W. Yan and H. Garcia-Molina. Sift - a tool for wide-area informationdissemination. In Proceedings of USENIX Winter, pages 177–186, 1995.37

[161] X. Yan, J. Han, and R. Afshar. Clospan: mining closed sequential patternsin large datasets. In Proceedings of SIAM Int. Conf. on Data Mining(SDM03), pages 166–177, 2003. 24, 26, 61

[162] Y. Yang. An evaluation of statistical approaches to text categorization.Information Retrieval, 1:69–90, 1999. 115

[163] Y. Yang and X. Liu. A re-examination of text categorization methods. InProceedings of SIGIR, pages 42–49, 1999. 132

[164] C-C. Yu and Y-L. Chen. Mining sequential patterns from multidimensionalsequence data. IEEE Transactions on Knowledge and Data Engineering,17(1):136–140, 2005. 25

[165] M. Zaki. Spade: An efficient algorithm for mining frequent sequences.Machine Learning, 40:31–60, 2001. 24, 26, 61

BIBLIOGRAPHY 205

[166] M. J. Zaki and C-J. Hsiao. Charm: An efficient algorithm for closed itemsetmining. In Proceedings of The 2nd SIAM International Conference on DataMining, pages 457–473, 2002. 26, 62

[167] S. Zhang, X. Wu, J. Zhang, and C. Zhang. A decremental algorithm formaintaining frequent itemsets in dynamic databases. In Proceedings ofthe 7th International Conference on Data Warehousing and KnowledgeDiscovery (DaWaK05), pages 305–314, 2005. 26

[168] X. Zhou, Y. Li, P. D. Bruza, S-T. Wu, Y. Xu, and R. Y. K. Lau. Usinginformation filtering in web data mining process. In Proceedings of theIEEE/WIC/ACM International Conference on Web Intelligence (WI07),pages 163–169, 2007. 180

[169] X. Zhou, S-T. Wu, Y. Li, Y. Xu, R. Y. K. Lau, and P. D. Bruza. Utilizingsearch intent in topic ontology-based user profile for web mining. InProceedings of the IEEE/WIC/ACM International Conference on WebIntelligence (WI06), pages 558–564, 2006. 180