evaluation of mutual information versus gini index for...

University of Manchester

School of Computer Science

Final year project report 2016

BSc Computer Science with Industrial Experience

Evaluation of Mutual information

versus Gini index for stable feature

selection

Author:

Veneta Haralampieva

Supervisor:

Dr. Gavin Brown

May 02, 2016

Abstract

The selection of highly discriminatory features has been crucial in aiding further advancements in

domains such as biomedical sciences, high-energy physics and e-commerce. Therefore evaluation of the

robustness of feature selection methods to small perturbations in the data, known as feature selection

stability, is of great importance to people in these respective fields. However, little research has been

focused on investigating the stability of feature selection algorithms independently from any learning

models. This project address the problem by providing an overview of several established stability

measures and reintroduces Pearson’s correlation coefficient as another. The coefficient has then been

employed in the empirical evaluation of four commonly used feature selection criteria, Mutual

information maximisation, Mutual information feature selection, Gini index and ReliefF. A high overall

stability of Mutual information maximisation and Gini index can be observed for small data samples,

with a slightly lower stability being seen for ReliefF. All criteria exhibit low stability when applied to

high-dimensional datasets, consisting of small number of samples, with Mutual information feature

selection performing poorly across all datasets.

Acknowledgements

Firstly, I would like to thank my supervisor Dr Gavin Brown for the support and guidance

throughout the course of this project.

Secondly, I would like to extend a thank you to my parents and my sister for always believing

in me and helping me through all difficult times.

Lastly, I would like to express my gratitude to all my colleagues and friends, in particular to my

flat-mate Gaby and my boyfriend Dan, for the useful feedback regarding the seminar and later

the presentation of results.

All your help and support has been greatly appreciated!

CONTENTS

1 Introduction................................................................................................................................... 1

1.1 Chapter overview .................................................................................................................. 1

1.2 Motivation.............................................................................................................................. 1

1.3 Aims and objectives .............................................................................................................. 2

1.4 Report outline ........................................................................................................................ 3

1.5 Chapter summary ................................................................................................................. 3

2 Background ................................................................................................................................... 4

2.1 Chapter overview .................................................................................................................. 4

2.2 Stability measures ................................................................................................................. 4

2.2.1 Tanimoto distance ......................................................................................................... 5

2.2.2 Hamming distance ........................................................................................................ 5

2.2.3 Kuncheva’s consistency index ...................................................................................... 6

2.2.4 Pearson’s correlation coefficient ................................................................................... 6

2.3 Feature selection criteria ....................................................................................................... 7

2.3.1 Mutual information maximization ............................................................................... 7

2.3.2 Mutual information feature selection .......................................................................... 9

2.3.3 Gini index ..................................................................................................................... 10

2.3.4 ReliefF ........................................................................................................................... 10

2.4 Chapter summary ............................................................................................................... 14

3 Methods and design ................................................................................................................... 15

3.1 Chapter overview ................................................................................................................ 15

3.2 Project management ............................................................................................................ 15

3.3 Technologies ........................................................................................................................ 16

3.3.1 R .................................................................................................................................... 17

3.3.2 RStudio ......................................................................................................................... 18

3.3.3 Packages ....................................................................................................................... 18

3.4 Architecture ......................................................................................................................... 19

3.4.1 The user interface ........................................................................................................ 19

3.4.2 The server ..................................................................................................................... 19

3.4.3 The computational core ............................................................................................... 19

3.5 Experimental design ........................................................................................................... 20

3.6 Chapter summary ............................................................................................................... 21

4 Implementation ........................................................................................................................... 22


4.2 The user interface ................................................................................................................ 22

4.3 The server ............................................................................................................................ 24

4.4 The computational core ...................................................................................................... 24

4.4.1 File manipulation ......................................................................................................... 24

4.4.2 Data pre-processing ..................................................................................................... 24

4.4.3 Probability estimation ................................................................................................. 25

4.4.4 The feature selection criteria ....................................................................................... 26

4.5 Testing .................................................................................................................................. 26

4.5.1 Unit tests....................................................................................................................... 27

4.5.2 Integration tests ........................................................................................................... 27

4.5.3 External tools validation ............................................................................................. 27

4.6 Chapter summary ............................................................................................................... 27

5 Experiments and analysis........................................................................................................... 28


5.2 Stability measures ............................................................................................................... 28

5.2.1 Experiments ................................................................................................................. 28

5.2.2 Which stability measure should be used? ................................................................. 30

5.3 Feature selection criteria ..................................................................................................... 30

5.3.1 Experiments ................................................................................................................. 30

5.3.2 How stable are the criteria? ........................................................................................ 34

5.4 Chapter summary ............................................................................................................... 35

6 Conclusions ................................................................................................................................. 36


6.2 Achievements ...................................................................................................................... 36

6.2.1 Scientific achievements ............................................................................................... 36

6.2.2 Personal achievements ................................................................................................ 36

6.3 Experimental analysis summary ........................................................................................ 37

6.4 Future work ......................................................................................................................... 37

6.5 Chapter summary ............................................................................................................... 38

Bibliography ....................................................................................................................................... 39

Appendix A ......................................................................................................................................... 42

Appendix B ......................................................................................................................................... 46

Appendix C ......................................................................................................................................... 50

Appendix D ......................................................................................................................................... 54

Appendix E ......................................................................................................................................... 55

Appendix F ......................................................................................................................................... 57

Appendix G ......................................................................................................................................... 59

LIST OF FIGURES

Figure 1: Programming languages popularity in data science........................................................ 17

Figure 2: Computational core structure ............................................................................................ 20

Figure 3: User interface: data upload and preprocessing ................................................................ 23

Figure 4: User interface: stability evaluation .................................................................................... 23

Figure 5: User interface: result consolidation ................................................................................... 23

Figure 6: User interface: example reproductions ............................................................................. 23

Figure 7: User interface: references ................................................................................................... 23

Figure 8: Example one reproduced from Kuncheva's paper (Kuncheva, 2007). ............................ 29

Figure 9: Example two reproduced from Kuncheva's paper (Kuncheva, 2007). ........................... 29

Figure 10: Breast dataset stability ...................................................................................................... 32

Figure 11: Krvskp dataset stability .................................................................................................... 32

Figure 12: Splice dataset stability ...................................................................................................... 32

Figure 13: Waveform dataset stability .............................................................................................. 32

Figure 14: Congress dataset stability................................................................................................. 32

Figure 15: Heart dataset stability ....................................................................................................... 32

Figure 16: Ionosphere dataset stability ............................................................................................. 32

Figure 17: Parkinson’s dataset stability ............................................................................................ 32

Figure 18: Lung cancer dataset stability ........................................................................................... 33

Figure 19: Sonar dataset stability....................................................................................................... 33

Figure 20: Penglung dataset stability ................................................................................................ 33

Figure 21: Pengcolon dataset stability ............................................................................................... 33

Figure 22: Penglymph dataset stability ............................................................................................. 33

Figure 23: Sampling with replacement ............................................................................................. 59

1

Chapter One

1 INTRODUCTION

1.1 CHAPTER OVERVIEW

The focus of this chapter is to provide a short overview of the main aspects related to this

project, such as the problem being investigated, the motivation behind it and its aims and

objectives. The final section will outline the report’s structure.

1.2 MOTIVATION

In recent years the term ‘Big Data’ appears prominently in both research literature and popular

media. Ward and Barker provide a unified definition of it which postulates that ‘Big data is a

term describing the storage and analysis of large and or complex data sets using a series of

techniques including, but not limited to: NoSQL, MapReduce and Machine learning.’ (Ward &

Barker, 2013). It might seem counter-intuitive as to why this could be problematic since one

would assume that the more information there is regarding a particular problem, the easier

finding a solution will be.

The high-energy physics domain can be used to illustrate this phenomenon; the Large Hadron

Collider in CERN, Geneva has been colliding particles travelling close to the speed of light, and

thus driving important scientific discoveries in the field of particle physics over the last few

decades. Interestingly, the research conducted there presents numerous unique challenges not

only to physicists but also to computer scientists. The data anticipated to be generated from all

four experiments is 25 GB/sec, a single collision generating 1 MB of data (CERN, 2016). After a

two-stage filtering process, around 15 petabytes of data are stored annually. From this

astonishing amount scientists have estimated that only one collision in a million is of interest. It

becomes apparent that without proper data filtering and storage, its management and analysis

becomes virtually impossible. For scientists in that domain selecting only the relevant particle

collisions is of vital importance.

However, the Big Data phenomenon is present in other fields such as biomedical sciences. Since

the completion of the Human Genome Project, the price for full genome sequencing1 has

dropped rapidly, thus generating vast amounts of genomic data. Effective analysis of this data

1 The term full genome sequencing is used to describe the identification of ‘the genetic basis of traits and disease

susceptibilities using SNP microarrays that capture most of the common genetic variation in the human population’

(Ng & Kirkness, 2010). Single nucleotide polymorphism (SNP) microarray is the technique used to accomplish this.

2

promises to revolutionise medicine by uncovering causes for genetic diseases, as well as

developing innovative personalised drug treatments. It should be noted that the human

genome consists of around 30,000 protein-coding genes and selecting the ones associated with a

particular genetic disease such as cancer is no trivial task.

Outside the scientific domains commercial companies such as Google, Amazon, Facebook and

Apple are storing large amounts of user data in hopes of developing better user experience by

effectively analysing and using this information. One example area is digital advertising, where

Amazon, for example employs previously stored user activity to generate recommendations for

products and services. One approach the company uses for offering personalised

advertisements is finding similarity between the products purchased and other available

products (Linden, Smith, & York, 2003). One should realise that a product on Amazon, such as a

book, is described by a number of characteristics such as genre, publisher, product dimensions,

author, price, average review score, etc. Evidently, publisher or product dimensions might not

be of vital importance to whether a person will purchase a book; however, genre and author

might be. Thus selecting a good set of characteristics is important.

All the above mentioned examples are characterised with high-dimensionality, and require

extraction of the most important attributes describing the target concept. In the field of Machine

Learning this is typically accomplished by the means of feature selection, which focuses on

identifying a subset of features which best achieve a certain goal, such as discovering the main

genes, in which mutations can lead to cancer.

It should be noted that obtaining a robust subset of features, also known as ‘feature preferences’

(Kalousis, Prados, & Hilario, 2007), is a necessity in the areas mentioned above. Since experts

will invest considerable amount of time, research effort and money into further exploring the

selected set of ‘preferences’ it is paramount that small variations in the data do not drastically

change these. This links closely with the notion of stability of a feature selection algorithm,

defined as the “robustness of the feature preferences it produces to differences in training sets

drawn from the same generating distribution” (Kalousis, Prados, & Hilario, 2007).

Surprisingly, this area of Machine Learning has received little attention from the research

community. This has motivated the investigation of various stability measures and their

applications in the empirical evaluation of feature selection criteria.

1.3 AIMS AND OBJECTIVES

As identified above, the aim of this project is to research and evaluate the stability of different

feature selection criteria. In order to achieve this, a detailed research has been conducted,

followed by experimental work and consolidation of results. The process involved critical

analysis of every criterion’s strengths and limitations, followed by careful examination of their

3

relationship with each other. To facilitate this a custom implementation of the criteria and

stability measures has been developed, followed by extensive testing and validation.

The project work has been supported by aiming to accomplish the following project objectives:

Research and implement the following feature selection criteria

o Mutual information maximisation

o Mutual information feature selection

o Gini index

o ReliefF

Research, implement and perform critical analysis of the following similarity measures

used in stability estimation

o Tanimoto distance

o Kuncheva’s stability index

o Hamming distance

o Pearson’s correlation coefficient

With the help of the above identified measures evaluate the stability of the criteria

1.4 REPORT OUTLINE

The remainder of the report is organised as follows: Chapter Two provides an overview of the

feature selection criteria and stability measures explored in this project. Chapter Three describes

the project structure and technologies used. Chapter Four focuses on the details of the

implementation, followed by a thorough explanation of the testing performed. Chapter Five

outlines the experimental work undertaken, the characteristics of the datasets used, followed by

analysis on the results. Chapter Six serves as a conclusion and highlights achievements and

future work.

1.5 CHAPTER SUMMARY

The notion of feature selection and stability was introduced, and its importance in relation to

Big Data and Machine Learning applications. The chapter then set out the aims and objectives

and presented a brief overview of the report’s structure.

4

Chapter Two

2 BACKGROUND


This chapter will provide a review of the research already conducted with respect to feature

selection and stability, which aims to introduce the reader to the vital concepts explored in

detail in this project. An outline of the strengths and weaknesses of each of the techniques shall

be presented.

2.2 STABILITY MEASURES

In Machine Learning, the task of identifying the category or class of an unseen instance based

on past data, also known as classification plays a prominent role. It has been established that in

high-dimensional problems feature reduction could improve classification accuracy as

discussed by Das (Das, 2001) and Xing, Jordan and Karp (Xing, Jordan, & Karp, 2001). As a

result the notion of feature selection stability has traditionally been associated with it. However,

this provides no means of distinguishing the behaviour of the feature selection algorithm with

respect to changes in the data, and that of the classifier. One can easily infer that this distinction

is quite important; since a low stability score in that case presents no information as to whether

the feature selection algorithm produces a non-optimal feature subset or the classifier’s

sensitivity to the data is too high.

Kalousis was one of the first people to investigate feature selection stability separately

(Kalousis, Prados, & Hilario, 2007). He proposed several similarity measures, which can be

employed independently of any learning model and cater for the different outputs produced by

feature selection algorithms. These algorithms do one of three things; assign a weight or a score

to features, rank them or select a small subset. As a result Kalousis proposed three separate

measures: Pearson’s correlation coefficient for weight-scoring outputs, Spearman’s rank

correlation coefficient for ranking and Tanimoto distance for measuring set similarity. Measures

operating on subsets of features are the focus of this research and will be explored in detail

shortly.

Each of the similarity measures 𝑆𝑥 presented below operates on two sets alone. They can be

used to measure the total similarity 𝑆 of 𝑚 sequences 𝐴 = {𝐴1, 𝐴2, 𝐴3, … 𝐴𝑚} by averaging the

sum of the similarity scores for every two subsets.

5

This leads to a generalised approach for computing the overall stability regardless of the

similarity measure employed and is defined as:

𝑆(𝐴) =∑ ∑ 𝑆𝑥

𝑚𝑗=𝑖+1

𝑚−1𝑖=1 (𝐴𝑖, 𝐴𝑗)

𝑚(𝑚 − 1)2

(1)

2.2.1 Tanimoto distance

An adaptation of Tanimoto distance has been proposed by Kalousis as a possible measure for

set similarity (Kalousis, Prados, & Hilario, 2007). It relies on computing the number of elements

in the intersection of two sets s and s’ divided by the number of elements in the union and can

be expressed as follows:

𝑆𝑡 (𝑠, 𝑠

′) =|𝑠 ∩ 𝑠′|

|𝑠 ∪ 𝑠′| (2)

The values are in the range of [0, 1], where 0 characterises two completely different subsets, and

1, two subsets which are identical. This measure has advantages, such as computational

efficiency and simplicity of implementation.

However, upon further investigation one apparent disadvantage emerges, as discussed by

Kuncheva (Kuncheva, 2007) and Lustgarten (Lustgarten, Gopalakrishnan, & Visweswaran,

2009), which is the inability of this measure to handle ‘chance’. As the number of selected

features approaches the total number of features, the Tanimoto distance produces values close

to 1.

2.2.2 Hamming distance

When evaluating the stability of different feature selection algorithms, Dunne proposed another

similarity measure based on the Hamming distance (Dunne, Cunningham, & Azuaje, 2002). It

employs the masks generated by the feature selection algorithm, 1 denoting a feature has been

selected and 0 that it has not. Later Kuncheva expressed the Hamming distance in terms of two

sets in the following manner (Kuncheva, 2007), where ‘\’ denotes the set difference operation

and 𝑛 is the total number of features in the dataset.

𝑆ℎ (𝑠, 𝑠

′) = 1 −|𝑠\𝑠′| + |𝑠′\𝑠|

𝑛 (3)

Much like Tanimoto distance this similarity measure does not factor for chance.

6

2.2.3 Kuncheva’s consistency index

In order to resolve this problem Kuncheva introduced several desirable properties of a

similarity measure, which are monotonicity, limits and correction of chance (Kuncheva, 2007).

Her research empirically proved that the previously used measures do not meet these

requirements, thus emphasising the need for a new one. She introduced the ‘consistency index’

which can be formulated as follows.

𝑆𝑐 (𝑠, 𝑠

′) =𝑟𝑛 − 𝑘2

𝑘(𝑛 − 𝑘) (4)

𝑟 − | s ⋂ s’| 𝑘 - number of selected features

𝑛 - total number of features

𝑠, 𝑠′ - subsets of features with the same size

The values range between [-1, 1], where 1 can be obtained for two identical sets (𝑟 = 𝑘). The

minimal value is produced when 𝑘 =𝑛

2 and 𝑟 = 0. As noted by Kuncheva the value of the index

is not defined for 𝑘 = 0 and 𝑘 = 𝑛. Nevertheless, for completeness these can be implemented as

0. By taking into account the total number of features the consistency index satisfies all three

requirements introduced above.

2.2.4 Pearson’s correlation coefficient

Pearson’s correlation coefficient has been widely used in statistics and economics as a measure

for correlation between two variables. In the context of feature selection stability Kalousis

proposed its use for measuring similarity of two separate weighting-scoring outputs produced

by feature selection algorithms. It can be formulated as follows, where 𝜇𝑤 and 𝜇𝑤′ are the

sample means of 𝑤 and 𝑤′ respectively.

𝑆𝑝 (𝑤,𝑤

′) =∑ (𝑤𝑖 − 𝜇𝑤)(𝑤𝑖′ − 𝜇𝑤′)𝑖

√∑ (𝑤𝑖 − 𝜇𝑤)2∑ (𝑤𝑖′ − 𝜇𝑤′)

2𝑖𝑖

(5)

Its values are in the range of [-1, 1], where values less than 0 denote low similarity. This project

will reintroduce Pearson’s correlation coefficient and demonstrate its use in stability estimation.

7

2.3 FEATURE SELECTION CRITERIA

Feature selection has been quite a prominent research area in Machine Learning due to the

increasing number of datasets, characterised by high-dimensionality, as highlighted previously.

Such datasets consist of a high number of features and a low number of samples, which greatly

reduces classification accuracy as well as hinders computational performance. To mitigate these

problems feature selection techniques, aimed at reducing the number of dimensions, have been

widely employed. Huang provides a detailed overview of different feature selection methods

and their distinction into filter, wrapper and embedded methods (Huang, 2015), each of which

will be explained in the following paragraphs.

Wrapper methods evaluate the feature quality by exploring the feature subset space and its

effect on a particular learning model’s accuracy. Their main drawbacks are computational

inefficiency and often being specific to the learning algorithm used.

Embedded methods incorporate feature selection whilst constructing the learning model.

Examples include classification and regression trees. A key advantage of these types of methods

is that they are more computationally efficient in comparison to wrapper approaches.

Nevertheless, they have considerable disadvantages as well; Huang discusses their difficulty in

finding the globally optimal solution, due to the attempt to estimate the model parameters

whilst selecting features (Huang, 2015).

Filters, on the other hand rely on statistical measures, independent from a classifier, to estimate

the relevance of a feature to the target concept. As a result, they produce generalised subsets,

which might not in all cases yield the highest classification accuracy. They are characterised by

computational efficiency and less overfitting2. A majority of them are myopic3 measures, such as

Mutual information and Gini index. Additionally, there are also non-myopic approaches such

as Relief, firstly introduced by Kira (Kira & Rendel, 1992). A comparison of the different filter

based feature selection criteria is the focus of this research.

2.3.1 Mutual information maximization

A number of widely used feature selection approaches rely on the notion of mutual

information, first defined in the field of Information Theory. To understand mutual

information, one should be familiar with the term entropy, which attempts to quantify the

amount of uncertainty present in a distribution of events of a random variable 𝑋. (Shannon,

1948). If one event is highly likely to occur then the entropy is low, since there is little

uncertainty regarding the outcome. If on the other hand, all events in the distribution are

2 In Machine learning, overfitting relates to a learning model being too complicated for the underlying data, thus

capturing noise. This leads to a model which is too specific to the data. 3 Myopic approaches consider each feature independently from the others and do not detect feature interactions.

8

equally probable, then the entropy is maximal, since no prevalence of a particular event is

present. Shannon defined entropy as:

𝐻(𝑋) = −∑𝑝(𝑥) log 𝑝(𝑥)

𝑥∈𝑋

(6)

Furthermore, Shannon demonstrated the means of conditioning entropy based on other events.

Suppose we have another distribution of events, 𝑌 then the conditional entropy of 𝑋 given 𝑌 is

regarded as ‘the amount of uncertainty remaining in 𝑋 after we learn the outcome of 𝑌’ (Brown,

Pocock, Zhao, & Luján, 2012). Its mathematical formulation is defined in the following manner:

𝐻(𝑋|𝑌) = −∑𝑝(𝑦)

𝑦∈𝑌

∑𝑝(𝑥|𝑦) log 𝑝(𝑥|𝑦)

𝑥∈𝑋

(7)

Hence the mutual information shared between 𝑋 and 𝑌 can naturally be describes as a

difference of the two entropies, the amount of uncertainty present in 𝑋 before the outcome of

event 𝑌 is known and after. It should be noted that mutual information is symmetric, hence the

name.

𝐼(𝑋; 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌)

= ∑∑𝑝(𝑥𝑦) log𝑝(𝑥𝑦)

𝑝(𝑥)𝑝(𝑦)𝑦∈𝑌𝑥∈𝑋

(8)

To aid understanding a practical example is presented. Suppose that 𝑋 denotes the question

whether a person has heart failure or not. We can assume that without any additional

information about the individual our uncertainty regarding the outcome is high. However, by

examining his/hers family history, our uncertainty lowers. If the person’s relatives suffer from

similar heart condition, then it is more likely that he/she will experience heart failure as well.

Similarly, if there is no evidence that any of his/hers family members have had this diagnosis,

that person might be less likely to experience heart problems. Therefore, the outcome of 𝑌,

denoting whether cardiovascular diseases are present in the family, lowers the uncertainty. This

leads to a big difference between 𝐻(𝑋) and 𝐻(𝑋|𝑌), or high mutual information.

In the context of Machine Learning, one can infer that the higher the mutual information

between a feature 𝑋 and a class 𝑌 is, the more useful that feature will be in classification.

Therefore, computing the mutual information between all features with respect to the class, and

selecting the top ones, with the highest values, will result in a good feature selection criterion.

Brown refers to this as Mutual information maximisation criterion, which uses mutual

information to define the feature score (Brown, Pocock, Zhao, & Luján, 2012).

9

𝐽𝑚𝑖𝑚(𝑋𝑘) = 𝐼(𝑋𝑘; 𝑌) (9)

It is important to note that this criterion does not consider inter-feature relationships, the fact

that when grouped together certain features may be highly discriminatory even if individually

they are not, making it a myopic measure. Another disadvantage is its inability to detect feature

redundancy, which as discussed by Huang is the view that ‘two perfectly correlated features are

redundant to each other because adding one feature on top of the other will not provide

additional information; and hence, will not improve model accuracy’ (Huang, 2015). A possible

solution to this problem has been proposed in the form of the Mutual information feature

selection criterion (Battiti, 1994) which will be explored in detail in the following section.

2.3.2 Mutual information feature selection

Mutual information feature selection attempts to resolve the problem of redundancy, by

introducing a penalty for highly correlated features. It is based on the assumption that

redundancy and feature correlation are equivalent and do not increase the predictive accuracy.

Using this method the feature score is calculated using both the feature’s mutual information

and the sum of the mutual information between the feature in question and all others selected

previously. A user-configurable parameter, 𝛽 is used to determine the relative importance of the

penalty. A value of 1 for 𝛽 has been found to yield satisfactory results by Battiti (Battiti, 1994).

𝐽𝑚𝑖𝑓𝑠(𝑋𝑘) = 𝐼(𝑋𝑘; 𝑌) − 𝛽∑𝐼(𝑋𝑘; 𝑋𝑗)

𝑋𝑗

(10)

It should be noted that Guyon argued that high correlation does not always equate redundancy

(Guyon & Elisseeff, 2003). He empirically demonstrated that two highly correlated features can

improve classification accuracy, and reached the conclusion that only perfectly correlated

features are redundant to each other. So this implies that this criterion operates under the

assumption that features are class-conditionally independent.

Moreover Estévez argued that the redundancy term will increase in magnitude with respect to

the relevancy term as the cardinality of the subset of selected features grows (Estévez, 2009). He

inferred that this may result in the early selection of irrelevant features.

As it can be observed, this criterion is more computationally expensive than the previous one

due to its need to estimate the mutual information between the feature in question and all

previously selected features.

10

2.3.3 Gini index

Another correlation-based criterion which attempts to estimate a feature’s ability to distinguish

between classes is Gini index. It was first introduced by (Breiman, Friedman, Olshen, & Stone,

1984) as a splitting rule in decision tree generation. This feature selection method examines the

decrease of impurity when using a chosen feature. In this context, impurity relates to the ability

of a feature to distinguish between the possible classes. In a similar fashion to Mutual

information, the Gini index uses the difference of impurities before the value of a feature is

known and afterwards. For a feature 𝑋 the Gini index is defined as:

𝐺𝑖𝑛𝑖(𝑋) = ∑𝑃(𝑋𝑗)∑𝑃(𝑌𝑐|𝑋𝑗)

2 − ∑𝑃(𝑌𝑐)2,

𝑐𝑐𝑗

(11)

where:

𝑃(𝑋𝑗) – prior probability of feature 𝑋 having value 𝑋𝑗

𝑃(𝑌𝑐|𝑋𝑗) - probability that a random sample from the dataset, whose feature 𝑋 has value 𝑋𝑗,

belongs to class 𝑌𝑐

𝑃(𝑌𝑐) – prior probability that a random sample belongs to class 𝑌𝑐

Due to its myopic nature, the Gini index has low computational requirements, thus making it

attractive in high-dimensional data analysis. Nevertheless, as noted by Čehovin when used on

numerical data it results in overestimation of the features’ quality (Čehovin & Bosnić, 2010). An

improvement can be obtained by discretising the data prior to analysis, which is the approach

undertaken in this project. Similarly to Mutual information maximisation, the Gini index is

unable to detect redundant features, as well as inter-feature relationships.

2.3.4 ReliefF

The biggest disadvantage to the above described criteria is their inability to detect feature

interactions. In domains such as protein folding, where amino-acids’4 interactions determine the

folded structure of a protein and its functionality, discovering inter-feature relationships is

paramount.

To cater for this type of problem Kira proposed a different approach to feature selection called

Relief (Kira & Rendel, 1992). What Relief does differently is that it performs both a global and a

local search before updating the feature relevance score. It starts by selecting a random instance

and finding its nearest neighbour from the same class, called the nearest-hit, and the nearest

neighbour from the other class, nearest-miss. It then examines how well each feature

4 An amino-acid can be treated as a feature in this example.

11

distinguishes between the classes. Using this information, it updates the feature weight. The

details of the original algorithms can be found below:

2.3.4.1 Algorithm: Relief

1

2

3

4

5

6

7

8

9

Input: 𝑆 – dataset, described by n features

𝑚 – the number of examples the algorithm should examine

Output: 𝑊 – a set containing the weight score for each feature

𝑊 = [0…0] for 𝑖 = 1 to 𝑚

Pick a random instance 𝐼𝑅

Find the nearest-hit 𝐻𝐼 and nearest-miss 𝑀𝐼 of 𝐼𝑅

for 𝑋 = 1 to 𝑛

𝑊[𝑓] =(𝑊[𝑋]−𝑑𝑖𝑓𝑓(𝐼𝑅[𝑋],𝐻𝐼[𝑋])

2+𝑑𝑖𝑓𝑓(𝐼𝑅[𝑋],𝑀𝐼[𝑋])2)

𝑚

end for

end for

return 𝑊

The function diff was originally defined as:

𝑑𝑖𝑓𝑓(𝐼1[𝑋𝑖], 𝐼2[𝑋𝑗]) =

{

|𝐼1[𝑋𝑖] − 𝐼2[𝑋𝑗]|

max(𝑋) −min(𝑋) 𝑋 𝑖𝑠 𝑛𝑢𝑚𝑒𝑟𝑖𝑐𝑎𝑙

0 𝐼1[𝑋𝑖] = 𝐼2[𝑋𝑗] ⋀ 𝑋 𝑖𝑠 𝑛𝑜𝑚𝑖𝑛𝑎𝑙

1 𝐼1[𝑋𝑖] ≠ 𝐼2[𝑋𝑗] ⋀ 𝑋 𝑖𝑠 𝑛𝑜𝑚𝑖𝑛𝑎𝑙

(12)

As discussed by Liu this particular function tends to underestimate numerical features (Liu &

Motoda, 2007). The example he used to illustrate this phenomenon is:

Let 𝑋𝑖 ∈ [1,2,3,4,5,6,7,8]

If 𝑋𝑖 were nominal the value obtained by the difference function would be 𝑑𝑖𝑓𝑓(2,5) = 1.

However, if 𝑋𝑖 were numerical then the value would be 𝑑𝑖𝑓𝑓(2,5) =|2−5|

8−1≈ 0.43. Since this is

used to update the feature weights, Relief underestimates numerical features. As a solution Liu

suggests the use of the contextual merit function proposed by Hong. It provides a generalisation

of the 𝑑𝑖𝑓𝑓function for numerical attributes (Hong, 1997).

𝑑𝑖𝑓𝑓(𝐼1[𝑋𝑖], 𝐼2[𝑋𝑗]) =

{

0 |𝐼1[𝑋𝑖] − 𝐼2[𝑋𝑗]| ≤ 𝑡𝑒𝑞

1 |𝐼1[𝑋𝑖] − 𝐼2[𝑋𝑗]| > 𝑡𝑑𝑖𝑓𝑓

|𝐼1[𝑋𝑖] − 𝐼2[𝑋𝑗]| − 𝑡𝑒𝑞𝑡𝑑𝑖𝑓𝑓 − 𝑡𝑒𝑞

𝑡𝑒𝑞 < |𝐼1[𝑋𝑖] − 𝐼2[𝑋𝑗]| ≤ 𝑡𝑑𝑖𝑓𝑓

(13)

12

Here 𝑡𝑒𝑞 and 𝑡𝑑𝑖𝑓𝑓 are user-defined parameters, where 𝑡𝑒𝑞 is the threshold below which two

values are considered equal and 𝑡𝑑𝑖𝑓𝑓 is the threshold above which two values are regarded as

distinct. Hong mentions the following default values for these parameters:

𝑡𝑒𝑞 = 0.05(max(𝑋) −min(𝑋)) (14)

𝑡𝑑𝑖𝑓𝑓 = 0.10(max(𝑋) − min(𝑋)) (15)

Kira used the Euclidean distance for the nearest neighbour search and the squared difference, as

it can be seen above. Interestingly, Kononenko reported that in his experiments using only the

difference seems to yield the same results (Kononenko, Šimec, & Robnik-Šikonja, 1997). He also

used the Manhattan distance in the nearest neighbour search as opposed to the Euclidean one,

though an explicit reason has not been provided. The author of this report has also

experimented with both approaches and found little difference in the results they produce.

An interesting observation about the relationship between the update rule in Relief and Gini

index has been noticed by Kononenko. He demonstrated that the update rule can be expressed

as an approximation of the following difference in probabilities (Kononenko, Šimec, & Robnik-

Šikonja, 1997):

𝑊[𝑋] = 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑋|𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑟𝑜𝑚 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑐𝑙𝑎𝑠𝑠)

− 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑋|𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑟𝑜𝑚 𝑠𝑎𝑚𝑒 𝑐𝑙𝑎𝑠𝑠)

By removing the nearest instance condition he rewrote the rule to obtain:

𝑊[𝑋] = 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑋|𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑐𝑙𝑎𝑠𝑠)

− 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑋|𝑠𝑎𝑚𝑒 𝑐𝑙𝑎𝑠𝑠)

He then proceeded to mathematically derive the following:

𝑊[𝑋] = 𝑐𝑜𝑛𝑠𝑡∑𝑃(𝑋𝑗)

2

𝑗

𝐺𝑖𝑛𝑖′(𝑋), (16)

where:

𝐺𝑖𝑛𝑖′(𝑋) =∑(𝑃(𝑋𝑗)

2

∑ 𝑃(𝑋𝑗)2

𝑗∑𝑃(𝑌𝑐|𝑋𝑗)

2

𝑐

)

𝑗

−∑𝑃(𝑌𝑐)2

𝑐

(17)

13

As noted by Kononenko the term 𝑃(𝑋𝑗)

2

∑ 𝑃(𝑋𝑗)2

𝑗 is what distinguishes it from the Gini index’

𝑃(𝑋𝑗)∑ 𝑃(𝑋𝑗)𝑗

,

providing Relief with a normalisation for multivalued features. This effectively solves the

problem of multivalued feature overestimation, which impurity functions such as Mutual

information and Gini index have (Kononenko, Šimec, & Robnik-Šikonja, 1997). A detailed

reproduction of the relationship demonstrated by Kononenko can be found in Appendix B.

It is worth noting that Relief was designed only for binary class problems. Moreover, as

discussed by Kononenko by using only the nearest-hit and nearest-miss of a chosen instance, it is

sensitive to noisy data (Kononenko I. , 1994). To overcome these two limitations he proposed an

extension, which finds k-nearest neighbors of that instance from each class and averages their

contributions. If this average is of nearest-misses, it is then weighted accordingly with the prior

probability of that class (Kononenko, Šimec, & Robnik-Šikonja, 1997). This led to the ReliefF

algorithm. Due to its advantages over the classic Relief the author has decided to employ it in

the experimental work.

2.3.4.2 Algorithm: ReliefF

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Input: 𝑆 – dataset, described by 𝑛 features and 𝑡 instances

𝑚 – the number of examples the algorithm should examine

𝑘 – the number of nearest instances to examine

Output: 𝑊 – a set containing the weight score for each feature

𝑊 = [0…0] for 𝑖 = 1 to 𝑚

Pick a random instance 𝐼𝑅 , with class 𝑌𝐼

for 𝑐 = 1 to 𝐶

find the nearest 𝑘 instances of 𝐼𝑅 from class 𝑌𝑐

for 𝑋 = 1 to 𝑛

for 𝑗 = 1 to 𝑘

if 𝑌𝑐 = 𝑌𝐼

then 𝑊[𝑋] = 𝑊[𝑋] − 𝑑𝑖𝑓𝑓( 𝐼𝑅[𝑋], S𝑗[𝑋])/(𝑚 ∗ 𝑘)

else 𝑊[𝑋] = 𝑊[𝑋] +𝑃𝑌𝑐

1−𝑃𝑌𝐼𝑑𝑖𝑓𝑓( 𝐼𝑅[𝑋], S𝑗[𝑋])/(𝑚 ∗ 𝑘)

end if

end for

end for

end for

end for

return 𝑊

ReliefF’s asymptotic complexity is 𝑂(𝑚 ∗ 𝑛 ∗ 𝑡), with the selection of the 𝑘 nearest instances

being the most expensive operation. Therefore, employing techniques such as k-d trees and

cover trees for nearest neighbour search, can improve efficiency drastically.

14

2.4 CHAPTER SUMMARY

This chapter introduced the reader to the existing research done in the areas of stability and

feature selection criteria central to this project. A review of several similarity measures was

presented, with their advantages and drawbacks explored, followed by a detailed description of

the different feature selection criteria used.

15

Chapter 3

3 METHODS AND DESIGN


This chapter will outline the project constraints, planning and the software engineering

practices employed. It will discuss the choice of programming language, associated libraries,

and the development environment that has been used.

3.2 PROJECT MANAGEMENT

One of the key characteristics of this project is that it is date-driven5, which has prompted the

author to employ agile software engineering planning techniques. Due to the short length of the

development time delivering the most important aspects of the project in a timely manner has

been crucial.

This has been achieved by effectively using short incremental iterations and user stories. ‘Short

iterations’ refers to a process which involves dividing the problem into smaller units, each of

which in itself represents a complete piece of functionality, and incrementally completing them.

This results in the delivery of working software at the end of the iteration, with typical length of

two weeks.

‘User stories’ facilitate the achievement of the timely delivery of project features. They consist of

a brief description in plain English of the desirable functionality of the system, and by

reprioritising them it can be ensured that the most important aspects of the software shall be

delivered first. Table 1 illustrates the ‘user stories’ identified, along with their respective

priorities. It can be observed that those are closely related to the project’s objectives.

5 Date-driven is an agile term, which Cohn uses to define a project ‘that must be released by a certain date but for

which the feature set is negotiable.’ (Cohn, 2005)

16

Iteration Stories Priority

Iteration 1 As a researcher, I want to be able to select the most relevant features, describing

my data, based on their Mutual Information, so that I can have a clear view of

factors underpinning the problem. (Mutual information maximisation criterion)

High

Iteration 2 As a researcher, I would like to minimise the feature redundancy so that I can

obtain a subset or relevant features. (Mutual information feature selection

criterion)

High

Iteration 3

As a researcher, I would like to be able to upload a .csv data files, since this is one

of the most commons ways my data is stored. Low

As a researcher I would like to be able to quantify the stability of different criteria

using a basic similarity measure so that I can understand how robust the subset

obtained from the method is to slight variations of the training data. (Tanimoto

distance)

High

Iteration 4

As a researcher, I would like to see the stability measure of the feature selection

algorithm in a common correlation measure such as Pearson's correlation

coefficient, so that I can easily interpret it.

High

As a researcher, I would like to have a stability measure that factors randomness

and chance in the feature selection, so that I can ensure that any stable subsets

produced are done by design and not by chance. (Kuncheva's consistency index)

High

As a researcher, I would like to implement Hamming distance in order to compare

it to the other existing similarity measures. Medium

Iteration 5

As a researcher, I would like the ability to visually interpret the stability results of

each criterion. High

As a researcher, I would like to able to use the feature selection algorithms on

continuous data attributes to cater for different datasets Medium

As a researcher, I would like the continuous data to be discretized in the most

optimal way so that I can be sure that no relevant information will lost during the

process.

Medium

Iteration 6 As a researcher, I would like to be able to select a feature subset based on their Gini

index values, since it is computationally inexpensive and will present highly

discriminatory features.

High

Iteration 7 As a researcher, I would like to be able to employ a feature selection technique

which considers inter-feature relationships. (Relief) Medium

Table 1: Iterations, user stories and their priorities, coinciding with the main objectives of this project

3.3 TECHNOLOGIES

The supporting technologies behind any software engineering or research project are vital to its

length and quality. Therefore, enough time had been dedicated to exploring several different

languages, their strengths and limitations.

17

3.3.1 R

The focus area of this project is in the domain of Machine Learning, which relies heavily on

statistical calculations. With this in mind a key capability of the chosen language had to be

excellent support and optimisations for matrix operation, good integration with visualising

libraries and well-established status within the research community and industry.

Figure 1 provides statistical information regarding the most widely used languages within the

fields of data mining, data analysis and data science. From the six languages identified those

which are highly optimised for matrix operations, freely available and are well-established in

these domains are R and Python.

It should also be noted that when compared to other languages like Python, who have been

known for their simplicity and suitability for beginners, R is considered to have a steep learning

curve as highlighted by Theuwissen (Theuwissen, 2016). Furthermore, he commented that R has

been well-established in the research community and that most importantly its visualisation

capabilities surpass those of Python. The final point has been of vital importance to this project

since effective visualisation of results is critical. This prompted the choice of R over Python

despite its drawbacks. Furthermore, it presented the challenge of learning a new language.

Figure 1: Programming languages popularity in data science.6

6 This graph was obtained from: http://www.kdnuggets.com/polls/2013/languages-analytics-data-mining-data-

science.html

http://www.kdnuggets.com/polls/2013/languages-analytics-data-mining-data-science.html

http://www.kdnuggets.com/polls/2013/languages-analytics-data-mining-data-science.html

18

3.3.2 RStudio

Another advantage of using R is the RStudio development environment, which provides access

to debugging tools, instant graph visualisation and a simple command line interface. It allows

good integration with external packages and facilitates both version control and code

organisation. Moreover, it has good support from the community and is freely available. All

these advantages have made it the preferred choice over the terminal-based environment.

3.3.3 Packages

One of the main initial concerns has been related to the need for effective means of visualising

the results. An objective of the project is to investigate and compare the stability of different

feature selection criteria. In order to achieve this, the results needed to be presented in a form

that is easily understandable and enhanced the analysis process, as opposed to hindering it.

This led to research in data visualisation with R and then to the ‘plotly’7 package. Plotly is an

open-source interactive graphic library for multiple programming languages, including R. It

provides a fast way for generation of histograms, charts and line plots. This package has a steep

learning curve and requires the comprehension of the documentation and examples to allow a

novice user to utilise its capabilities.

It should be noted that the speed of the experimental work has been quite important in such a

short project. Hence an application has been created to cater for the author’s needs. However,

since the focus of this project is not on application development the need for a package

supporting quick and effortless web development in R has become apparent. After careful

investigation, the ‘shiny’8 and ‘shiny dashboard’9 packages have been found to facilitate the

creation of a dashboard-style web interface with ease. This provided a way for producing a

general-purpose application, where datasets could be manipulated easily, the stability of

different feature selection criteria analysed and the results consolidated. Nevertheless, these two

packages have their limitations. For example, customising colours or buttons is quite difficult if

not impossible in certain cases. However, for the purposes of this project it has served the

purpose of creating a clean interface facilitating the user’s needs.

Another package of vital importance to this project has been ‘RUnit’10, which supports the

creation of test suites and since most computer science projects require extensive testing, this

proved to be one of the most useful libraries. With the help of a test suite produced, the author

could safely introduce modifications and improvements in the code. The topic of testing and

validation will be explored in detail in the next chapter.

7 Plotly is available from https://plot.ly/r/, along with complimentary tutorials and start up guides 8 Shiny is available from http://shiny.rstudio.com/, along with complimentary tutorials and start up guides 9 Shiny dashboard can be obtained from following GitHub repository https://rstudio.github.io/shinydashboard/ 10 RUnit is available from the CRAN repository https://cran.r-project.org/web/packages/RUnit/index.html

https://plot.ly/r/

http://shiny.rstudio.com/

https://rstudio.github.io/shinydashboard/

https://cran.r-project.org/web/packages/RUnit/index.html

19

3.4 ARCHITECTURE

A principle employed when designing the structure of the project is ‘Separation of concerns’,

which as defined by Dijkstra ‘even if not perfectly possible, is yet the only available technique

for effective ordering of one's thoughts’ (Dijkstra, 1982). Therefore, as previously established a

web-based tool, which aids the research process, has been created. It consists of three main

components, a thin-client, a server and a main computational core.

3.4.1 The user interface

Schneiderman suggests several guidelines which need to be employed when designing a user

interface amonst which is that one should strive for consistency (Shneiderman & Plaisant, 2010).

To ensure this throughout the application all graphical representations have been visualised

using line plots along with a consistent colour scheme.

Another key principle, which Schneiderman presented, is offering informative feedback

(Shneiderman & Plaisant, 2010). This improves the user experience by providing information

regarding the status of an event, or the result of an action. For example, in this project the user

has the ability to view the results of the data upload or the data discretisation in real time,

which allows him/her to understand the impliction of these actions.

The dashboard-style of all pages has been chosen deliberately since it serves as a metaphor for

physical dashboards which people rely on in their every day lives. This technique has been

employed by many companies, such as Apple, as a tool which allows new users to grasps

concepts quickly (Apple, 2016).

3.4.2 The server

The server’s responsibilities include graph generation, retrieval of results and communication

with the main computational core. The decision to split the server from the core has been

influenced by the need for abstraction to facilitate the independent use of the stability measures

and feature selection criteria. The server simply utilises the necessary components from the

core, passes the relevant information to the client and is responsible for generating the graphs.

3.4.3 The computational core

The computational core is the most fundamental component of the project. It comprises of the

different feature selection criteria, all the similarity measures discussed, probability estimation

functions which support the criteria, data pre-processing capabilities and file manipulation.

Consequently, the core has been divided into five main subcomponents, illustrated in Figure 2,

whose implementation will be explored in detail in the next chapter.

20

The separation of the subcomponents facilitates the need to flexibility and extensibility. This

implies that if a reimplementation of the probability estimation functions is required, it could be

done without affecting other parts of the code. Due to the generalised approach for stability

estimation, the addition of a new criterion is a trivial task and requires minimal effort and

change in the existing codebase.

Figure 2: Computational core structure

3.5 EXPERIMENTAL DESIGN

To achieve the aim of this project several experiments have been conducted. However, since the

output of them is used for formulating the hypothesis about the behavior of different feature

selection criteria, a careful design process has been required. By constructing a design protocol

one ensures the repeatability of the work, along with verifying that the experiment will indeed

answer the question of interest. Therefore, the following protocol has been devised. A

bootstrapping technique, detailed in Appendix G, is used to introduce small variations in the

data, and the number of repetitions, which is thirty, has been chosen to provide a more

representative stability score for each feature selection criterion.

3.5.1.1 Experimental protocol

1

2

3

4

5

6

7

8

Generate a similar dataset to the original one, using bootstrapping

For subset size, 𝑘 = 1 to 𝑛

Perform feature selection using the four different criteria

Store the feature preferences produced

Repeat Steps 1-4 thirty times

Compute the stability scores for every subset size 𝑘, using the preferences from the 30 sets

Compute the confidence intervals

Plot the results

21

3.6 CHAPTER SUMMARY

This chapter presented the main technologies and software engineering practices utilised in the

development of this project. It described the design choices, provided an overview of the main

architecture and discussed the project planning process.

22

Chapter 4

4 IMPLEMENTATION


This chapter will provide a detailed review of the project implementation and the challenges

faced. Due to the main focus of this project being research and not web application

development this chapter will focus largely on the computational core but will feature details

regarding the user interface and the server as well. This chapter will also explore the different

testing techniques employed for validation.

4.2 THE USER INTERFACE

The user interface has been developed using the ‘shiny’ and ‘shiny dashboard’ packages, which

are web application frameworks. They provide basic building blocks such as menu bars,

sidebars, headers, plot outputs, table outputs and many more, which facilitate the creation of a

web application for data analysis in an efficient way. The figures below demonstrate the

interface’s capabilities.

As it can be observed in Figure 3, the user has the ability to upload data, labels, discretise it if

necessary and preview it in a table format. Such a represtantion proves to be quite consistent

with tools such as Microsoft Excel, thus exploiting the user’s pre-existing knowledge to create a

better user experience.

Figure 4 demonstrates the feature selection stability evaluation page where the user can select

the number of times the algorithms should be run and which criteria should be compared. The

graphs are interactive and facilitate the data analysis process. Furthermore, if the user wishes

he/she could exclude certain criteria on the graph directly and compare only two at time.

Figure 5 illustrates the result consolidation capabilties. The user has the ability to examine the

characterists of the datasets, such as number of features, classes or examples, and display

graphical representation of previously run results. This page serves as a storage place for all

experimental work and is accessbile from multiple locations. This has proven to be valuable

during the experimental analysis stage of the project.

Figure 6 presents the example recreation screen, which contains research paper reproductions

illustrating the stability measures’ performance. Lastly, Figure 7 provides an overview of the

References page which caters for the central storage of background reading.

23

Figure 3: User interface: data upload and preprocessing

Figure 4: User interface: stability evaluation

Figure 5: User interface: result consolidation

Figure 6: User interface: example reproductions

Figure 7: User interface: references

24

4.3 THE SERVER

Section 3.4.2 highlighted the server’s main responsibilities whose implementation shall be

discussed in detail. For retrieval and storage of results the server utilizes the file manipulation

subcomponent of the computational core. Similarly, when data discretisation is required the

server utilizes the data pre-processing subcomponent.

Graph generation is required in a couple of instances and has been implemented using the

‘plotly’ package. Firstly, upon conducting experimental work the result is presented in a form of

a line graph. The server obtains the results from the computational core. Afterwards, they are

passed to the client for visualisation. Their update is achieved with the help of ‘reactive

expressions’11. Additionally, if the user wishes to preserve the results for future use, a special R

file is created containing the data structure of the output.

The saved experimental work could then be read from the server’s filesystem upon start-up.

Similar techniques have been employed for the example reproduction page, with all

experimental results being stored as files.

4.4 THE COMPUTATIONAL CORE

As outlined previously the computational core has been divided into five subcomponents,

ensuring modularity and abstraction.

4.4.1 File manipulation

The first subcomponent is responsible for file manipulation, namely opening and saving files.

Those capabilities are utilized when the user uploads the data and if he/she wishes to save the

results obtained from the analysis. It is also used in the example sections for reading previously

stored data analysis results and dataset information.

4.4.2 Data pre-processing

The data pre-processing subcomponent provides data discretization capabilities, which have

become necessary to the performance improvement of several feature selection criteria as

mentioned in section 2.3.3.

Discretisation is the process of converting continuous attributes into discrete ones. As explained

by Clarke it usually involves the process of splitting the continuous space into 𝐾 partitions

corresponding to discrete values (Clarke & Barton, 2000). This process introduces what is

11 Reactive expression facilitate the control of which parts of the application need to be update and when. They aim to

avoid unnecessary work.

25

known as discretisation error due to the inability to accurately represent all the states of a

variable.

Therefore the number of partitions 𝐾 is important to minimize this. In 1926 Sturges had

proposed an approach for estimating the number of partitions given a sample size 𝑛, which

later on became known as ‘Sturges rule’ (Sturges, 1926).

𝐾 = 1 + 𝑙𝑜𝑔2(𝑛) (18)

An R package ‘infotheo’12 has been used to perform the actual discretization and the number of

partitions has been provided using the above formula.

4.4.3 Probability estimation

Since many of the feature selection criteria rely on probability estimations, such as prior and

posterior probabilities, a separate subcomponent has been developed to facilitate reusability

and modularity. All methods in it rely on the same input parameters, namely the dataset and

the column number of the data labels, and returned matrices as an output. The output type was

chosen after investigation of R’s performance using profiling tools which had demonstrated the

superiority of matrices over other R data types such as ‘data frames’13.

An issue often encountered in probability estimation is the ‘zero conditional probability

problem’, which refers to a dataset not containing an example of class 𝑌𝑐 with value 𝑋𝑖 for an

attribute 𝑋. This is explored in detail in Appendix D along with a possible solution.

As per the definition of stability its calculation involves introducing small perturbations in the

data and applying the feature selection algorithm a number of times. In the field of Machine

learning, a technique for achieving this is bootstrapping, detailed in Appendix G. After a

dataset is generated using this approach a feature selection method is used to produce the

feature preferences. A user-defined parameter controls the number of times the process is

performed and afterwards the preferences are passed to the stability estimator, along with the

type of similarity measure to be used. A detailed pseudocode of the stability function is

supplied in Appendix C.

12 The package ‘infotheo’ is available from https://cran.r-project.org/web/packages/infotheo/index.html 13 A detailed information about data frames is available on https://stat.ethz.ch/R-manual/R-

devel/library/base/html/data.frame.html

https://cran.r-project.org/web/packages/infotheo/index.html

https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

26

4.4.4 The feature selection criteria

The feature selection subcomponent is responsible for invoking the correct criteria, performing

the necessary computations for subset generation and returning a set of feature scores. The first

and last parts are straightforward and are implemented as a separate interface. Each of the

criteria is separated into an individual function and file, to ensure low coupling. The Mutual

information based approaches along with Gini index utilise the probability estimation

subcomponent, which provides the prior and posterior probabilities necessary.

The following parameters have been used in the implementations:

Mutual information maximisation: a custom implementation has been provided with

no special parameters; the data has been discretised prior analysis; a detailed

pseudocode is available in Appendix C.

Mutual information feature selection: 𝛽 = 1 (Battiti, 1994); the data has been discretised

prior analysis; a detailed pseudocode is available in Appendix C.

Gini index: a custom implementation has been provided with no special parameters; the

data has been discretised prior analysis; a detailed pseudocode is located in Appendix C.

ReliefF: the number of instances considered, 𝑚 is equal to the total number of examples

(Kononenko, Šimec, & Robnik-Šikonja, 1997); the number of nearest neighbours 𝑘 has

been given value 10 (Kononenko I. , 1994) or less if there are less than 10 instances of that

class. The default values for the thresholds 𝑡𝑒𝑞 and 𝑡𝑑𝑖𝑓𝑓 introduced in section 2.3.4 have

been used; the data has been discretised prior analysis to facilitate comparison with

other criteria; a detailed pseudocode is available in section 2.3.4.

4.5 TESTING

In order to verify the correctness of the implementation, along with the author’s understanding

of the subject matter, several testing techniques have been employed. Together they formed the

test suite of the project.

Embury provided several guidelines for ‘a good quality test suite’ (Embury, 2012). Firstly, test

cases need to be independent, in other words always produce the same result regardless of the

execution context. Secondly, they should be repeatable, meaning that the same output should be

produced every time the test is run (assuming no code modifications). Thirdly, there should be

no redundant test cases, i.e. a test case which verifies the same behaviour as another one. The

last guideline is of importance since running test suites is computationally expensive, consumes

resources and introduces waste, since unnecessary test cases need to be maintained. The author

has taken the above considerations into account when designing the test suite.

27

4.5.1 Unit tests

In software development, unit tests are developed in such a manner so that they can verify the

smallest testable piece of the software. They are generally fast and useful in the debugging

process since they can pinpoint a bug’s location very accurately. When designing test cases the

author has decided to employ black box testing techniques such as boundary value analysis14

and equivalence partitioning15. This ensured all necessary scenarios have been covered with an

optimal set of test cases. Overall there are eighteen separate functions all of which have unit

tests.

4.5.2 Integration tests

This type of tests is used to verify the correctness of multiple components working together. For

example, the feature selection component relies on the probability estimation one for obtaining

the necessary prior and posterior probabilities. To test the components interactions a total of six

such tests have been employed. The output of these along with the one from the unit tests can

be found in Appendix F.

4.5.3 External tools validation

The above mentioned approaches test the correctness of the implementation only against the

author’s individual understanding of the problem. As a counter-measure validation of the

results with external tools and libraries has been performed.

To test Mutual information based criteria the FEAST16 toolbox for Matlab has been utilized. The

implementation of the ReliefF algorithm has been verified using the ‘dprep’17 package for R and

certain similarity measures like Pearson’s correlation coefficient have been validated using

online tools such as the ‘Correlation coefficient calculator’18.

4.6 CHAPTER SUMMARY

This chapter presented in detail the architectural components and their use. It explored various

testing techniques and their practical application in the context of this project.

14 As defined in the testing standards glossary, boundary value analysis is: ‘A test case design technique for a

component in which test cases are designed which include representatives of boundary values’ (Standards, 2016). 15 As defined in the testing standards glossary, equivalence partitioning is: ‘A test case design technique for a

component in which test cases are designed to execute representatives from equivalence classes’ (Standards, 2016). 16 FEAST has been created by G. Brown, A. Pocock, M.-J. Zhao, M. Lujan and is available at

http://www.cs.man.ac.uk/~gbrown/software/ 17 ‘dprep’ is an R package available from the CRAN https://cran.r-project.org/web/packages/dprep/index.html 18 ‘Correlation coefficient calculator’ is a tool available at http://www.socscistatistics.com/tests/pearson/Default2.aspx

http://www.cs.man.ac.uk/~gbrown/software/

https://cran.r-project.org/web/packages/dprep/index.html

http://www.socscistatistics.com/tests/pearson/Default2.aspx

28

Chapter 5

5 EXPERIMENTS AND ANALYSIS


This chapter will explore in detail the experiments conducted, the dataset used and the

techniques applied. It will discuss the choice of a similarity measure and will investigate the

stability of the different criteria by providing an analysis of the experimental results.

5.2 STABILITY MEASURES

In order to identify the best similarity measure to use, the author has conducted several

experiments which aim to explore the strengths and weaknesses of each one.

5.2.1 Experiments

In Chapter 2, four different similarity measures have been presented. In order to compare these,

two examples from Kuncheva’s paper have been reproduced. They focus on Tanimoto distance,

Hamming distance and the Consistency index (Kuncheva, 2007). Additionally in section 2.2.4

another similarity measure based on Pearson’s correlation coefficient has been presented and

has been added to the reproduced examples for completeness.

5.2.1.1 Experiment one

Experiment one aimed to demonstrate the behaviour of the different similarity measures when

applied on a small subset of selected features. The following hypothetical sequences of features,

𝑠 and 𝑠′, have been produced by two runs of a feature selection algorithm on a dataset with 𝑛 =

10 features.

𝑠 = {𝑓9, 𝑓7, 𝑓2, 𝑓1, 𝑓3, 𝑓10, 𝑓8, 𝑓4, 𝑓5, 𝑓6}

𝑠′ = {𝑓3, 𝑓7, 𝑓9, 𝑓10, 𝑓2, 𝑓4, 𝑓8, 𝑓6, 𝑓1, 𝑓5}

These sets have been used to illustrate the behaviour of all measures as subsets of size 𝑘 = 1 to

10 have been examined.

As noted previously, Kuncheva’s experiments do not investigate the behaviour of Pearson’s

correlation coefficient. Originally Kalousis suggested its use only when the feature selection

algorithm produces weight-scoring output (Kalousis, Prados, & Hilario, 2007). However, one

can apply a generalised approach that would allow the use of the coefficient for feature subsets

29

by assigning weight 1 to all features in the selected subset and 0 to all features which have not

been chosen. These weights can then be used to compute its values.

Figure 8 demonstrates the results of this experiment; more details about which can be found in

Appendix E.

Figure 8: Example one reproduced from Kuncheva's paper (Kuncheva, 2007).

5.2.1.2 Experiment two

Kuncheva’s second example aims to demonstrate the advantages of the consistency index over

Tanimoto distance and Hamming distance using the total stability obtained from ten random

sequences of size ten. Similarly to experiment one, Pearson’s correlation coefficient has been

added for completeness. Using the random sequences, a number of features have been selected

and the stability amongst all ten subsets computed using equation (1). Figure 9 illustrates the

output of this experiment.

Figure 9: Example two reproduced from Kuncheva's paper (Kuncheva, 2007).

30

5.2.2 Which stability measure should be used?

The most important observation that can be made from analysing the results of these

experiments is that Pearson’s correlation coefficient produces identical values to the

Consistency index. The results imply that under the assumptions outlined above Pearson’s

correlation coefficient is equivalent to the Consistency index; a detailed proof of which can be

found in Appendix A. It should be noted that this has simultaneously been observed by fellow

researchers in the field Nogueira and Brown (Nogueira & Brown, 2016).

Additionally, the first example clearly demonstrates that Tanimoto distance and Hamming

distance produce high values when the subset size 𝑘 approaches the total number of elements 𝑛.

In contrast the Consistency index and Pearson’s correlation coefficient adjust the value

accordingly to factor for chance.

Example two demonstrates that the Consistency index and Pearson’s correlation coefficient

retain a value of 0 for all subset sizes, whilst the stability reported by Tanimoto distance and

Hamming distance varies considerably. This is particularly important since high stability scores

should be obtained only by design and not by chance.

Considering the above identified advantages the author has decided to employ Pearson’s

correlation coefficient in the stability evaluation of feature selection criteria.

5.3 FEATURE SELECTION CRITERIA

The aim of this research is to evaluate the stability of different feature selection criteria which

has been achieved by conducting several experiments, outlined in detail below.

5.3.1 Experiments

The empirical evaluation of the criteria involved the use of datasets19 with various

characteristics. Table 2 provides detailed information about them. Some datasets like the breast

cancer and splice contain a small number of features and classes and a large number of samples,

whilst other such as the lung cancer, penglung, pengcolon and penglymph are described by a large

number of features and classes but contain only a few examples. These have been chosen

deliberately to facilitate the analysis of the feature selection criteria under various conditions.

19 The datasets used in this research have been obtained from https://archive.ics.uci.edu/ml/index.html

https://archive.ics.uci.edu/ml/index.html

31

Data Features Examples Classes

breast 30 569 2

congress 16 435 2

heart 13 270 2

ionosphere 34 351 2

krvskp 36 3196 2

lungcancer 56 32 3

parkinsons 22 195 2

pengcolon 2000 62 2

penglung 325 73 7

penglymph 4026 96 9

sonar 60 208 2

splice 60 3175 3

waveform 40 5000 3

Table 2: Datasets characteristics

Since stability aims to evaluate the robustness of the feature selection algorithm to small

perturbations to the data, a way is needed to introduce these. In Machine learning, a technique

called ‘bootstrapping’ could be used, which is explained in detail in Appendix G.

By applying the design protocol described in section 3.5 the following results have been

obtained. The x-axis denotes the number of features selected and the y-axis the stability score

for each subset calculated using Pearson’s correlation coefficient, with the appropriate

confidence intervals. Blue illustrates the scores achieved by Mutual information maximization

criterion (MIM); orange corresponds to Mutual information feature selection (MIFS); green to

Gini index (GI) and red to ReliefF (RF).

32

Figure 10: Breast dataset stability

Figure 11: Krvskp dataset stability

Figure 12: Splice dataset stability

Figure 13: Waveform dataset stability

Figure 14: Congress dataset stability

Figure 15: Heart dataset stability

Figure 16: Ionosphere dataset stability

Figure 17: Parkinson’s dataset stability

33

Figure 18: Lung cancer dataset stability

Figure 19: Sonar dataset stability

Figure 20: Penglung dataset stability

Figure 21: Pengcolon dataset stability

Figure 22: Penglymph dataset stability

34

5.3.2 How stable are the criteria?

As it can be observed from the experimental results, there seem to be no criterion which is

stable in all cases. However, a few generalized conclusions could be drawn.

Mutual information maximization and the Gini index criterion exhibit very similar behaviour in

all cases. This is likely due to their method of computing feature scores. Both are impurity

functions, which aim to approximate the difference between prior and posterior probabilities.

Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 characterize the behaviour of the feature

selection criteria when the dataset contains a large number of examples and a small number of

features. In those instances Mutual information maximization (MIM) and Gini index (GI)

appear to be the most stable. This is likely due to the fact that introducing small variations in

datasets with large number of examples does not result in the change of the prior and posterior

probabilities, which MIM and GI rely on. ReliefF produces quite high results as well which can

be associated with its relationship to GI identified in section 2.3.4. However, since ReliefF relies

on nearest neighbours’ estimation, the bootstrapping technique might affect the nearest-hits and

nearest-misses, thus resulting in lower stability scores than GI and MIM report. Mutual

information feature selection (MIFS) appears the least stable in these datasets. Since the

relevancy term is the same as in MIM, this would imply that variations in the data cause

changes in the redundancy term, or the overall correlation between features. Interestingly, MIFS

appears to achieve high stability scores on the Krvskp dataset possibly due to the high number

of examples with respect to the number of features and classes.

Additionally, the behaviour of this criterion when applied to the Waveform dataset is peculiar.

Its stability is poor initially and increases after the top twenty features have been selected, as

opposed to MIM, GI and ReliefF whose stability degrades at that point. This could indicate the

presence of highly correlated features with high discriminatory power. In these instances MIM,

GI and ReliefF will successfully select all of them even with small variations. MIFS, on the other

hands, could be affected by which of the correlated ones are chosen first. After the twenty

feature mark most of the features could be of similar relevance thus explaining the drop in

stability of MIM, GI and ReliefF. However, since these would have been selected by MIFS

earlier its stability is lower beforehand.

Figure 15, Figure 16 and Figure 17 report the results obtained from applying the experimental

protocol on datasets with a relatively small number of features but far less samples.

Interestingly, Figure 15 seems to differ from the other two because all algorithms fail to identify

a consistent set of one to five most important features. This behaviour could be due to the

features describing the dataset. It is likely that there is no single feature which has significantly

higher discriminatory power in comparison to the others. This could lead to different features

receiving slightly higher scores when perturbations in the data are introduced. The poor

35

performance of ReliefF, which considers feature interactions, could infer that most of the

features play an important role for distinguishing between the classes.

Figure 16 and Figure 17 demonstrate the effect of the decreasing number of instances. MIM and

GI identify the top two features quite well but have problems producing a stable subset

afterwards. ReliefF on the other hand produces lower stability values possibly due to the

sampling technique which removes certain hits and misses and introduces new ones. MIFS

identifies the most important feature but its performance degrades significantly afterwards. As

it can be observed it achieves higher stability results on the Parkinson’s dataset. Since the

relevancy term is the same as MIM this would imply that the redundancy term is able to

compensate and likely introduces a penalty for the same highly correlated features.

Figure 18, Figure 19, Figure 20, Figure 21 and Figure 22 illustrate the behaviour of the criteria in

extreme cases, which is large number of features and classes and small sample size, with all

criteria performing poorly. ReliefF, which relies on nearest-hits and misses, suffers from the

small sample size since there are not enough instances belonging to each class, and often the

resampling removes key ones. Similarly the prior and posterior probability estimations are

highly affected when variations are present, resulting in the poor stability score of MIM and GI.

MIFS appears to achieve slightly higher average stability with the increase of the subset size,

perhaps due to early detection of certain highly correlated features. As noted by Brown, with

the increasing number of features the redundancy term in MIFS grows, whilst the relevancy one

remains the same, thus selecting features that minimize redundancy as opposed to the desired

trade-off between relevancy and redundancy (Brown, Pocock, Zhao, & Luján, 2012).

5.4 CHAPTER SUMMARY

This chapter introduced the reader to the experiments conducted and the results obtained. A

thorough analysis of the different feature selection criteria was performed, exploring their

behaviour under various conditions.

36

Chapter 6

6 CONCLUSIONS


The final chapter will summarize the key findings and contributions of this project as well as

discuss future work on the subject.

6.2 ACHIEVEMENTS

This section will provide a summary of the key achievements of this project. To aid

understanding they have been divided into two groups, scientific and personal ones. As it can

be observed they clearly demonstrate that the project’s aim and objectives have been met.

6.2.1 Scientific achievements

Empirical and theoretical proof of Pearson’s correlation coefficient and Kuncheva’s

consistency index equivalence

Empirical evaluation of feature selection criteria’s stability

Reintroduction of Pearson’s correlation coefficient as a measure in stability estimation

Comparison of various similarity measures

Critical analysis of feature selection criteria

Theoretical analysis of the relationship between the criteria

6.2.2 Personal achievements

Comprehension of over thirty research papers, books and materials

Example reproduction of scientific experiments

Use of a new programming language, R

Web application development using R

Practical use of agile methodologies

Application of user interface design principles

37

6.3 EXPERIMENTAL ANALYSIS SUMMARY

The theoretical research and empirical evaluation of the various similarity measures has led to

the reintroduction of Pearson’s correlation coefficient as a possible measure to be employed. A

peculiar finding has been its equivalence to Kuncheva’s consistency index, which has been

proven both mathematically and experimentally. Consequently, this research relied on the

correlation coefficient in the stability calculations.

The results obtained from the experimental work on the stability of feature selection criteria

illustrates that impurity based measures such as Gini index and Mutual information

maximization exhibit stable behaviour in datasets, characterized by small number of features

and classes and large number of samples, whilst ReliefF’s achieves a slightly lower score.

ReliefF’s stability is highly influenced the nearest-hits and nearest-misses, which could attest its

behaviour. Mutual information feature selection appears to be quite unstable in most cases.

Moreover, the empirical evaluation of all four criteria demonstrated their poor stability when

applied to high-dimensional small-sample datasets, thus implying that no criterion appears to

be superior to the others. Therefore, the choice of a feature selection criterion should also be

based on the data characteristics to ensure maximal stability.

6.4 FUTURE WORK

This research has focused on evaluating the stability of the feature selection criteria without the

use of a classifier, and while it gives insights into their behaviour it does not provide any

evidence for the classification capabilities of the subsets. In the future the topic of optimizing

both stability and classification accuracy could be explored.

It should also be noted that one of the limitations of this study is that the stability measure used

for evaluation does not cater the possibility that different highly correlated features could be

selected in each run. As discussed by Brown, the output from each run could contain features

with different array indices; however, due to their high correlation they are effectively identical

(Brown, Pocock, Zhao, & Luján, 2012). To address this problem Yu proposed a new stability

measure called ‘information consistency’ (Yu, Ding, & Loscalzo, 2008). Using this measure for

the stability estimation of feature selection criteria could be a future topic for research.

A further limitation of this study is that the stability of the criteria has not been examined on

dataset with more than four thousand features. Since feature selection is widely used on

genomic and proteomic datasets, which contain more features, in the future it will be useful to

evaluate the stability across such datasets as well.

38

Furthermore, another possible topic for future investigation could be the stability evaluation of

approaches such as Markov blankets for detecting weakly relevant but non-redundant features

(Koller & Sahami, 1995), the Joint mutual information criterion (Yang & Moody, 1999) and the

Hilbert-Schmidt Independence criterion (Okun, 2011), which appear to be widely employed.

Exploring the stability of more exotic feature selection algorithms, such as ones based on

cooperative Game Theory (Sun, et al., 2012) and filter approaches relying on genetic algorithms

(Lanzi, 1997), could provide new insights about their behaviour and applicability to high

dimensional problems.

Finally, this study does not address the problem of selecting a final set of features to be explored

further. Since the algorithms will produce slightly different preferences as an output, a method

for constructing a final set of size 𝑘, obtained from all sequences, is paramount both for domains

which employ feature selection alone and ones which utilize a classifier. In the future, the

development of a standardized approach to achieve this task will be of great benefit.

6.5 CHAPTER SUMMARY

This chapter provided the reader with the concluding remarks and summarized achievements.

It also explored the limitations of the study and suggested topics for future research.

39

BIBLIOGRAPHY

Apple. (2016, Apr 21). iOS Human Interface Guidelines: Design principles. Retrieved from Apple:

https://developer.apple.com/library/ios/documentation/UserExperience/Conceptual/MobileHI

G/Principles.html#//apple_ref/doc/uid/TP40006556-CH4-SW1

Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning.

Neural Networks, IEEE Transactions on 5.4, 537-550.

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Belmont,

CA, USA: Wadsworth International Group.

Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional likelihood maximisation: a unifying

framework for information theoretic feature selection. The Journal of Machine Learning

Research, 27-66.

Čehovin, L., & Bosnić, Z. (2010). Empirical evaluation of feature selection methods in classification.

Intelligent data analysis 14.3, 265-281.

CERN. (2016, April 07). Processing What To Record. Retrieved from CERN:

http://home.cern/about/computing/processing-what-record

Chen, K. (2016, April 9). Naive Bayes Classifier. Retrieved from COMP24111 Machine Learning:

http://studentnet.cs.manchester.ac.uk/ugt/COMP24111/materials/slides/Naive-Bayes.pdf

Clarke, E. J., & Barton, B. A. (2000). Entropy and MDL discretization of continuous variables for Bayesian

belief networks. International Journal of Intelligent Systems 15.1, 61-92.

Cohn, M. (2005). Agile estimating and planning. Addison-Wesley.

Das, S. (2001). Filters, wrappers and a boosting-based hybrid for feature selection. ICML. Vol. 1, 74-81.

Dijkstra, E. W. (1982). On the role of scientific thought. Selected writings on computing: a personal

perspective, 60-66.

Dunne, K., Cunningham, P., & Azuaje, F. (2002). Solutions to instability problems with sequential

wrapper-based approaches to feature selection. Journal of Machine Learning Research, 1-22.

Embury, S. M. (2012, Sept). What's in a Test? A Brief Introduction. Retrieved from COMP33711: Agile

Software Engineering:

https://moodle.cs.man.ac.uk/file.php/357/Handouts/AgileTesting/comp33711TestingBasics.pdf

Estévez, P. A. (2009). Normalized mutual information feature selection. IEEE Transactions on 20, no.2,

189-201.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of

Machine Learning Research 3, 1157-1182.

Hong, S. J. (1997). Use of contextual information for feature ranking and discretization. Knowledge and

Data Engineering, IEEE Transactions on 9.5, 718-730.

40

Huang, S. H. (2015). Supervised feature selection: A tutorial. Artificial Intelligence Research, 22.

Kalousis, A., Prados, J., & Hilario, M. (2007). Stability of Feature Selection Algorithms: a study on high

dimensional spaces. Knowledge and information systems, 95-116.

Kira, K., & Rendel, L. A. (1992, July 12). A practical approach to feature selection. Proceedings of the

ninth international workshop on Machine learning, 249-256.

Koller, D., & Sahami, M. (1995). Toward optimal feature selection. In 13th International Conference on

Machine Learning, 284--292.

Kononenko, I. (1994, April 6). Estimating attributes: analysis and extensions of RELIEF. Machine Learning:

ECML-94, 171-182.

Kononenko, I., Šimec, E., & Robnik-Šikonja, M. (1997). Overcoming the myopia of inductive learning

algorithms with RELIEFF. Applied Intelligence 7.1, 39-55.

Kuncheva, L. I. (2007, Feb 12). A stability index for feature selection. Artificial intelligence and

applications, 421-427.

Lanzi, P. L. (1997). Fast feature selection with genetic algorithms: a filter approach. IEEE International

Conference on. IEEE, 537-540.

Linden, G., Smith, B., & York, J. (2003). Amazon. com recommendations: Item-to-item collaborative

filtering. Internet Computing, IEEE 7.1 , 76-80.

Liu, H., & Motoda, H. (2007). Computational methods of feature selection. Liu, Huan, and Hiroshi

Motoda: CRC Press.

Lustgarten, J. L., Gopalakrishnan, V., & Visweswaran, S. (2009, Jan). Measuring stability of feature

selection in biomedical datasets. AMIA, 406-410.

Ng, P. C., & Kirkness, E. F. (2010). Whole genome sequencing. In Genetic Variation (pp. 215--226).

Humana Press.

Nogueira, S., & Brown, G. (2016). Tutorial Stability Measures. Unpublished manuscript.

Okun, O. (2011). Feature Selection and Ensemble Methods for Bioinformatics. Hershey: PA: Medical

Information Science Reference.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 623-

656.

Shneiderman, B., & Plaisant, C. (2010). Designing the User Interface: Strategies for Effective Human-

Computer Interaction. Reading, MA: Addison-Wesley Publ. Co.

Standards, T. (2016, 04 27). Living Glossary. Retrieved from Testing Standards:

http://www.testingstandards.co.uk/living_glossary.htm

Sturges, H. A. (1926). The choice of a class interval. Journal of the american statistical association

21.153, 65-66.

41

Sun, X., Liu, Y., Li, J., Zhu, J., Chen, H., & Liu, X. (2012). Feature evaluation and selection with cooperative

game theory. Pattern recognition, 45(8), 2992-3002.

Theuwissen, M. (2016, Apr 04). R vs Python for Data Science. Retrieved from kdnuggets:

http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html

Ward, J. S., & Barker, A. (2013). Undefined by data: a survey of big data definitions. arXiv preprint

arXiv:1309.5821.

Xing, E. P., Jordan, M., & Karp, R. M. (2001). Feature selection for high-dimensional genomic microarray

data. ICML Vol. 1, 601-608.

Yang, H. H., & Moody, J. E. (1999). Data Visualization and Feature Selection: New Algorithms for

Nongaussian Data. NIPS. Vol. 99, 687-693.

Yu, L., Ding, C., & Loscalzo, S. (2008). Stable feature selection via dense feature groups. Proceedings of

the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 803--

811.

42

Appendices

APPENDIX A

This appendix details the proof of the equivalence between Pearson’s correlation coefficient and

the Consistency index.

Consistency index

𝑆𝑐(𝑠, 𝑠′) =

𝑟𝑛 − 𝑘2

𝑘(𝑛 − 𝑘)

Pearson’s correlation coefficient

𝑆𝑝(𝑤, 𝑤′) =

∑ (𝑤𝑖 − 𝜇𝑤)(𝑤𝑖′ − 𝜇𝑤′)𝑖

√∑ (𝑤𝑖 − 𝜇𝑤)2∑ (𝑤𝑖′ − 𝜇𝑤′)

2𝑖𝑖

Notations:

I. r = | s ⋂ s’|

II. n – total number of features

III. k – number of selected features

IV. s, s’ – two sets containing the features selected

V. w, w’ – two sets containing the weights of all features

Assumptions:

I. The number of selected features in both sets is the same

II. The weights used in Pearson’s correlation coefficient is in {0,1}

Proof:

𝜇𝑤 , 𝜇𝑤′ are the sample means of the weighting sets 𝑤,𝑤′ respectively. Since 1 denotes selected

features and 0, non-selected ones, and using Assumption I we can infer that:

𝜇𝑤 = 𝜇𝑤′ = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 =

𝑘

𝑛

Therefore, Pearson’s correlation coefficient can be rewritten as follows:

𝑆𝑝(𝑤, 𝑤′) =

∑ (𝑤𝑖 − 𝑘𝑛)(𝑤𝑖′ −

𝑘𝑛)𝑖

√∑ (𝑤𝑖 − 𝑘𝑛)2 ∑ (𝑤𝑖′ −

𝑘𝑛)2𝑖𝑖

43

Let:

𝐴 =∑ (𝑤𝑖 − 𝑘

𝑛) (𝑤𝑖

′ − 𝑘

𝑛)

𝑖

𝐵 = √∑ (𝑤𝑖 − 𝑘

𝑛)2∑ (𝑤𝑖′ −

𝑘

𝑛)2

𝑖𝑖

For:

𝐴 =∑ (𝑤𝑖 − 𝑘

𝑛) (𝑤𝑖

′ − 𝑘

𝑛)

𝑖

Taking into account Assumption II there are four possible cases for the different values

of 𝑤𝑖 , 𝑤𝑖′.

I. 𝑤𝑖 = 𝑤𝑖′ = 1, the features selected in both sets; their number is 𝑟

For those values the equation takes the following form:

(1 − 𝑘

𝑛) (1 −

𝑘

𝑛) = 1 −

2𝑘

𝑛+ 𝑘2

𝑛2

II. 𝑤𝑖 = 1,𝑤𝑖′ = 0, the features selected in set 𝑠, but not in set 𝑠′; 𝑘 − 𝑟


(1 −𝑘

𝑛) (0 −

𝑘

𝑛) = −

𝑘

𝑛 +

𝑘2

𝑛2

III. 𝑤𝑖 = 0,𝑤𝑖′ = 1, the features selected in set 𝑠′, but not in set 𝑠; 𝑘 − 𝑟


(0 −𝑘

𝑛) (1 −

𝑘

𝑛) = −

𝑘

𝑛 +

𝑘2

𝑛2

IV. 𝑤𝑖 = 𝑤𝑖′ = 0, the features not selected in both sets; their number is

𝑛 − (𝑟 + 𝑘 − 𝑟 + 𝑘 − 𝑟) = 𝑛 − 2𝑘 + 𝑟


(0 −𝑘

𝑛) (0 −

𝑘

𝑛) =

𝑘2

𝑛2

44

Using these four cases we can rewrite the equation as follows:

𝐴 =∑ (𝑤𝑖 − 𝑘

𝑛) (𝑤𝑖

′ − 𝑘

𝑛)

𝑖

= 𝑟 (1 −2𝑘

𝑛+𝑘2

𝑛2) + (𝑘 − 𝑟) (−

𝑘

𝑛+𝑘2

𝑛2) + (𝑘 − 𝑟) (−

𝑘

𝑛+𝑘2

𝑛2)

+ (𝑛 − 2𝑘 + 𝑟)𝑘2

𝑛2

= 𝑟 −2𝑟𝑘

𝑛+𝑟𝑘2

𝑛2−𝑘2

𝑛+𝑘3

𝑛2+𝑟𝑘

𝑛−𝑟𝑘2

𝑛2−𝑘2

𝑛+𝑘3

𝑛2+𝑟𝑘

𝑛−𝑟𝑘2

𝑛2+𝑘2

𝑛−2𝑘3

𝑛2

+𝑟𝑘2

𝑛2

=𝑟𝑛 − 𝑘2

𝑛

For:

𝐵 = √∑ (𝑤𝑖 − 𝑘

𝑛)2∑ (𝑤𝑖′ −

𝑘

𝑛)2

𝑖𝑖

Taking into account Assumption II there are two possible cases for each of the values of 𝑤𝑖 , 𝑤𝑖′.

I. 𝑤𝑖 = 1, all the features selected in the first set, 𝑘

(𝑤𝑖 − 𝑘

𝑛)2 = (1 −

𝑘

𝑛)2 = 1 −

2𝑘

𝑛+ 𝑘2

𝑛2

II. 𝑤𝑖 = 0, all the features selected in the first set, 𝑛 − 𝑘

(𝑤𝑖 − 𝑘

𝑛)2 = (0 −

𝑘

𝑛)2 =

𝑘2

𝑛2

Similarly for the values of 𝑤𝑖′ we have two cases:

I. 𝑤𝑖′ = 1, all the features selected in the first set, 𝑘

(𝑤𝑖′ − 𝑘

𝑛)2 = (1 −

𝑘

𝑛)2 = 1 −

2𝑘

𝑛+ 𝑘2

𝑛2

II. 𝑤𝑖′ = 0, all the features selected in the first set, 𝑛 − 𝑘

(𝑤𝑖′ − 𝑘

𝑛)2 = (0 −

𝑘

𝑛)2 =

𝑘2

𝑛2

45

Using these cases we can rewrite the equation as follows:

𝐵 = √∑ (𝑤𝑖 − 𝑘

𝑛)2∑ (𝑤𝑖

′ − 𝑘

𝑛)2

𝑖𝑖

= √(𝑘 (1 −2𝑘

𝑛+ 𝑘2

𝑛2) + (𝑛 − 𝑘)

𝑘2

𝑛2)(𝑘 (1 −

2𝑘

𝑛+ 𝑘2

𝑛2) + (𝑛 − 𝑘)

𝑘2

𝑛2)

= √(𝑘 (1 −2𝑘

𝑛+ 𝑘2

𝑛2) + (𝑛 − 𝑘)

𝑘2

𝑛2)

2

= 𝑘 (1 −2𝑘

𝑛+ 𝑘2

𝑛2) + (𝑛 − 𝑘)

𝑘2

𝑛2

= 𝑘 −2𝑘2

𝑛+𝑘3

𝑛2+𝑘2

𝑛−𝑘3

𝑛2

=𝑘(𝑛 − 𝑘 )

𝑛

Using the above equations we can rewrite Pearson’s correlation coefficient in the following

manner:

𝑆𝑝(𝑤, 𝑤′) =

𝐴

𝐵=

𝑟𝑛 − 𝑘2

𝑛𝑘(𝑛 − 𝑘 )

𝑛

= 𝑟𝑛 − 𝑘2

𝑘(𝑛 − 𝑘 )

∴ 𝑆𝑝(𝑤, 𝑤′) =

∑ (𝑤𝑖 − 𝜇𝑤)(𝑤𝑖′ − 𝜇𝑤′)𝑖

√∑ (𝑤𝑖 − 𝜇𝑤)2 ∑ (𝑤𝑖′ − 𝜇𝑤′)

2𝑖𝑖

=𝑟𝑛 − 𝑘2

𝑘(𝑛 − 𝑘)= 𝑆𝑐(𝑠, 𝑠

′)

46

APPENDIX B

This appendix explores the relationship between Relief and Gini index.

Gini index:

𝐺𝑖𝑛𝑖(𝑋) = ∑𝑃(𝑋𝑗)∑𝑃(𝑌𝑐|𝑋𝑗)2 − ∑𝑃(𝑌𝑐)

2

𝑐𝑐𝑗

Relief weight update:

𝑊[𝑋] = 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑋|𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑟𝑜𝑚 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑐𝑙𝑎𝑠𝑠)

− 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑋|𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑟𝑜𝑚 𝑠𝑎𝑚𝑒 𝑐𝑙𝑎𝑠𝑠)

As noted by Kononenko if the nearest instance condition is removed this can be rewritten in the

following manner (Kononenko, Šimec, & Robnik-Šikonja, 1997):

𝑊[𝑋] = 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑋|𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑐𝑙𝑎𝑠𝑠)

− 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑋|𝑠𝑎𝑚𝑒 𝑐𝑙𝑎𝑠𝑠)

Notations:

I. 𝑃(𝑋𝑗) – prior probability of feature 𝑋 with value 𝑋𝑗

II. 𝑃(𝑌𝑐|𝑋𝑗) - probability that a random sample from the dateset, whose feature 𝑋 has value

𝑋𝑗, belongs to class 𝑌𝑐

III. 𝑃(𝑌𝑐) – prior probability that a random sample belongs to class 𝑌𝑐

IV. 𝑃𝑠𝑋|𝑑𝑐 – probability that two instances have the same value of 𝑋 given that they belong to

different classes

V. 𝑃𝑠𝑋|𝑠𝑐 – probability that two instances have the same value of 𝑋 given that they belong to

the same class

VI. 𝑃𝑑𝑋|𝑑𝑐 – probability that two instances have different values of 𝑋 given that they belong

to different classes

VII. 𝑃𝑑𝑋|𝑠𝑐 – probability that two instances have the same value of 𝑋 given that they belong to

the same class

47

Using the above defined notations the Relief update rule becomes:

𝑊[𝑋] = 𝑃𝑑𝑋|𝑑𝑐 − 𝑃𝑑𝑋|𝑠𝑐

= 1 − 𝑃𝑠𝑋|𝑑𝑐 − 1 + 𝑃𝑠𝑋|𝑠𝑐

= 𝑃𝑠𝑋|𝑠𝑐 − 𝑃𝑠𝑋|𝑑𝑐

=𝑃𝑠𝑐|𝑠𝑋𝑃𝑠𝑋

𝑃𝑠𝑐− 𝑃𝑑𝑐|𝑠𝑋𝑃𝑠𝑋

𝑃𝑑𝑐

=𝑃𝑠𝑐|𝑠𝑋𝑃𝑠𝑋

𝑃𝑠𝑐− (1 − 𝑃𝑠𝑐|𝑠𝑋)𝑃𝑠𝑋

1 − 𝑃𝑠𝑐

=𝑃𝑠𝑐|𝑠𝑋𝑃𝑠𝑋(1 − 𝑃𝑠𝑐) − (1 − 𝑃𝑠𝑐|𝑠𝑋)𝑃𝑠𝑋𝑃𝑠𝑐

𝑃𝑠𝑐(1 − 𝑃𝑠𝑐)

=𝑃𝑠𝑐|𝑠𝑋𝑃𝑠𝑋 − 𝑃𝑠𝑐|𝑠𝑋𝑃𝑠𝑋𝑃𝑠𝑐 − 𝑃𝑠𝑋𝑃𝑠𝑐 + 𝑃𝑠𝑐|𝑠𝑋𝑃𝑠𝑋𝑃𝑠𝑐


=𝑃𝑠𝑐|𝑠𝑋𝑃𝑠𝑋 − 𝑃𝑠𝑋𝑃𝑠𝑐


=𝑃𝑠𝑋(𝑃𝑠𝑐|𝑠𝑋 − 𝑃𝑠𝑐)


Since the prior class probability does not depend on the attribute 𝑋 it can be considered as a

constant and the weight update rule takes the following form:

𝑐𝑜𝑛𝑠𝑡 = 𝑃𝑠𝑐(1 − 𝑃𝑠𝑐)

𝑊[𝑋] = 𝑐𝑜𝑛𝑠𝑡 𝑃𝑠𝑋(𝑃𝑠𝑐|𝑠𝑋 − 𝑃𝑠𝑐)

The probability of two instances being in the same class 𝑃𝑠𝑐 is that simply for each class the

probability that one instance belongs to that class times the probability that the second one

belongs to the same one.

As a result:

𝑃𝑠𝑐 =∑𝑃(𝑌𝑐)2

𝑐

48

Similarly the prior probability that two instances have the same value can be expressed as:

𝑃𝑠𝑋 =∑𝑃(𝑋𝑗)2

𝑗

Using the Bayesian rule we obtain:

𝑃𝑠𝑐|𝑠𝑋 =𝑃𝑠𝑐∩𝑠𝑋𝑃𝑠𝑋

=𝑃𝑠𝑐∩𝑠𝑋∑ 𝑃(𝑋𝑗)

2𝑗

=𝑃𝑠𝑐|𝑠𝑋𝑃𝑠𝑋∑ 𝑃(𝑋𝑗)

2𝑗

=𝑃𝑠𝑐|𝑠𝑋∑ 𝑃(𝑋𝑗)

2𝑗

∑ 𝑃(𝑋𝑗)2

𝑗

=∑ (∑ 𝑃(𝑌𝑐|𝑋𝑗)

2𝑐𝑗 𝑃(𝑋𝑗)

2)

∑ 𝑃(𝑋𝑗)2

𝑗

=∑(𝑃(𝑋𝑗)

2

∑ 𝑃(𝑋𝑗)2


2

𝑐

)

𝑗

Using this and the above two equations the update rule can be rewritten as follows:

𝑊[𝑋] = 𝑐𝑜𝑛𝑠𝑡 𝑃𝑠𝑋(𝑃𝑠𝑐|𝑠𝑋 − 𝑃𝑠𝑐)

= 𝑐𝑜𝑛𝑠𝑡 𝑃𝑠𝑋(𝑃𝑠𝑐|𝑠𝑋 − 𝑃𝑠𝑐)

= 𝑐𝑜𝑛𝑠𝑡∑𝑃(𝑋𝑗)2

𝑗

(∑(𝑃(𝑋𝑗)

2

∑ 𝑃(𝑋𝑗)2


2

𝑐

)

𝑗

−∑𝑃(𝑌𝑐)2

𝑐

)

Let:

𝐺𝑖𝑛𝑖′(𝑋) =∑(𝑃(𝑋𝑗)

2

∑ 𝑃(𝑋𝑗)2


2

𝑐

)

𝑗

−∑𝑃(𝑌𝑐)2

𝑐

49

∴ 𝑊[𝑋] = 𝑐𝑜𝑛𝑠𝑡∑𝑃(𝑋𝑗)2

𝑗

𝐺𝑖𝑛𝑖′(𝑋)

𝐺𝑖𝑛𝑖(𝑋) = ∑𝑃(𝑋𝑗)∑𝑃(𝑌𝑐|𝑋𝑗)2 − ∑𝑃(𝑌𝑐)

2

𝑐𝑐𝑗

The similarity between 𝐺𝑖𝑛𝑖′(𝑋) and 𝐺𝑖𝑛𝑖(𝑋) is evident with the only difference being the 𝑃(𝑋𝑗)

2

∑ 𝑃(𝑋𝑗)2

𝑗 term in 𝐺𝑖𝑛𝑖′(𝑋) . In the original 𝐺𝑖𝑛𝑖(𝑋) we have the following

𝑃(𝑋𝑗)

∑ 𝑃(𝑋𝑗)𝑗.

Since:

∑ 𝑃(𝑋𝑗) = 1𝑗

∴ 𝑃(𝑋𝑗) =𝑃(𝑋𝑗)

∑ 𝑃(𝑋𝑗)𝑗

As noted by Kononenko the term used for the weight update in the Relief algorithm serves as a

normalisation for multivalued features. This effectively solves the problem of multivalued

feature overestimation, which impurity functions such as Mutual information and Gini index

have (Kononenko, Šimec, & Robnik-Šikonja, 1997).

50

APPENDIX C

This appendix presents the pseudocode used for implementing stability estimation and the

feature selection criteria, crucial to this project.

Algorithm 1: measure Stability

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Input: 𝐴 – a 2d array containing all sets of feature preferences

𝑚 – the number of sets

𝑡𝑦𝑝𝑒 – the type of similarity measure to be used

𝑛 – total number of features in the original dataset

Output: 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 – the stability score for those sets 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 0 for 𝑖 = 1 to 𝑚

if (𝑡𝑦𝑝𝑒 = 𝑝)

then 𝑤𝑒𝑖𝑔ℎ𝑡𝑠𝑆𝑒𝑡𝐼 = 𝑜𝑏𝑡𝑎𝑖𝑛𝑊𝑒𝑖𝑔ℎ𝑡𝑠𝐹𝑟𝑜𝑚𝑆𝑒𝑡(𝐴[𝑖, ], 𝑛)

end if

for 𝑗 = 𝑖 + 1 to 𝑚

if (𝑡𝑦𝑝𝑒 = 𝑝)

then 𝑤𝑒𝑖𝑔ℎ𝑡𝑠𝑆𝑒𝑡𝐽 = 𝑜𝑏𝑡𝑎𝑖𝑛𝑊𝑒𝑖𝑔ℎ𝑡𝑠𝐹𝑟𝑜𝑚𝑆𝑒𝑡(𝐴[𝑗, ], 𝑛)

𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦+= 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑃𝑒𝑎𝑟𝑠𝑜𝑛𝑠𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡(𝑤𝑒𝑖𝑔ℎ𝑡𝑠𝑆𝑒𝑡𝐼 , 𝑤𝑒𝑖𝑔ℎ𝑡𝑠𝑆𝑒𝑡𝐽)

else if (𝑡𝑦𝑝𝑒 = 𝑡)

then 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦+= 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑇𝑎𝑛𝑖𝑚𝑜𝑡𝑜𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐴[𝑖, ], 𝐴[𝑗, ])

else if (𝑡𝑦𝑝𝑒 = ℎ)

then 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦+= 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐻𝑎𝑚𝑚𝑖𝑛𝑔𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐴[𝑖, ], 𝐴[𝑗, ])

else if (𝑡𝑦𝑝𝑒 = 𝑐)

then 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦+= 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦𝐼𝑛𝑑𝑒𝑥(𝐴[𝑖, ], 𝐴[𝑗, ], 𝑛)

end if

end for

end for

𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 =𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦

𝑚(𝑚 − 1)/2

return 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦

51

Algorithm 2: Mutual information

1

2

3

4

5

6

7

Input: 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏– a 2d array containing the conditional probability between the variables

𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋– a 1d array containing all the prior probabilities for 𝑋

𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌– a 1d array containing all the prior probabilities for 𝑌

Output: 𝑚𝑖– the mutual information between 𝑋and 𝑌 𝑚𝑖 = 0 for 𝑣𝑎𝑙𝑢𝑒𝑋 = 1 to 𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋)

for 𝑣𝑎𝑙𝑢𝑒𝑌 = 1 to 𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌)

𝑚𝑖 += 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏[𝑣𝑎𝑙𝑢𝑒𝑋, 𝑣𝑎𝑙𝑢𝑒𝑌] ∗ 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌[𝑣𝑎𝑙𝑢𝑒𝑌] ∗ 𝑙𝑜𝑔2(𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏[𝑣𝑎𝑙𝑢𝑒𝑋,𝑣𝑎𝑙𝑢𝑒𝑌]

𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋[𝑣𝑎𝑙𝑢𝑒𝑋] )

end for

end for

return 𝑚𝑖

Algorithm 3: Mutual information maximization criterion

1

2

3

4

5

6

7

8

9

Input: 𝑑𝑎𝑡𝑎 – a 2d array containing all 𝑚 samples and 𝑛 features

𝑑𝑎𝑡𝑎𝐶𝑙𝑎𝑠𝑠 – the column number of the data class

Output: 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 – the features, ordered by the algorithm’s preference 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒𝑠 = [0…0]

𝑝𝑟𝑖𝑜𝑟𝐶𝑙𝑎𝑠𝑠 = 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒(𝑑𝑎𝑡𝑎, 𝑑𝑎𝑡𝑎𝐶𝑙𝑎𝑠𝑠)

𝑝𝑟𝑖𝑜𝑟𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠 = 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝐴𝑙𝑙(𝑑𝑎𝑡𝑎[, −𝑑𝑎𝑡𝑎𝐶𝑙𝑎𝑠𝑠])

for 𝑓 = 1 to 𝑛 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏 = 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏𝐵𝑒𝑡𝑤𝑒𝑒𝑛𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠[𝑑𝑎𝑡𝑎, 𝑓, 𝑑𝑎𝑡𝑎𝐶𝑙𝑎𝑠𝑠]

𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒𝑠[𝑓] = 𝑚𝑢𝑡𝑢𝑎𝑙𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏, 𝑝𝑟𝑖𝑜𝑟𝐶𝑙𝑎𝑠𝑠, 𝑝𝑟𝑖𝑜𝑟𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[𝑓, ])

end for 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 = 𝑜𝑟𝑑𝑒𝑟(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒𝑠)

return 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠

52

Algorithm 4: Mutual information feature selection criterion

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28



Output: 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 – the features, ordered by the algorithm’s preference 𝑚𝑖𝐶𝑙𝑎𝑠𝑠 = [0…0]

𝑚𝑖𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠 = [0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 0

]

𝑏𝑒𝑡𝑎 = 1



for 𝑓 = 1 to 𝑛

𝑐𝑜𝑛𝑑𝐶𝑙𝑎𝑠𝑠 = 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏𝐵𝑒𝑡𝑤𝑒𝑒𝑛𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠[𝑑𝑎𝑡𝑎, 𝑓, 𝑑𝑎𝑡𝑎𝐶𝑙𝑎𝑠𝑠]

𝑚𝑖𝐶𝑙𝑎𝑠𝑠[𝑓] = 𝑚𝑢𝑡𝑢𝑎𝑙𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑐𝑜𝑛𝑑𝐶𝑙𝑎𝑠𝑠, 𝑝𝑟𝑖𝑜𝑟𝐶𝑙𝑎𝑠𝑠, 𝑝𝑟𝑖𝑜𝑟𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[𝑓])

end for

for 𝑖 = 1 to 𝑛

for 𝑗 = 1 to 𝑛

𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏 = 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏𝐵𝑒𝑡𝑤𝑒𝑒𝑛𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠(𝑑𝑎𝑡𝑎, 𝑖, 𝑗)

𝑚𝑖𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[𝑖, 𝑗] = 𝑚𝑢𝑡𝑢𝑎𝑙𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏, 𝑝𝑟𝑖𝑜𝑟𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[𝑖, ],

𝑝𝑟𝑖𝑜𝑟𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[𝑗, ])

end for

end for 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎 = 𝑚𝑖𝐶𝑙𝑎𝑠𝑠

for 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑇𝑜𝑆𝑒𝑙𝑒𝑐𝑡 = 1 to 𝑛

𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 [𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑇𝑜𝑆𝑒𝑙𝑒𝑐𝑡] = 𝑠𝑒𝑙𝑒𝑐𝑡𝑁𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠 (𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎, 1)

𝑚𝑖𝑆𝑒𝑙𝑒𝑐𝑡𝑒𝑑 = 𝑚𝑖𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[ , 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠]

𝑟𝑜𝑤𝑆𝑢𝑚 = 𝑟𝑜𝑤𝑆𝑢𝑚𝑠[ 𝑚𝑖𝑆𝑒𝑙𝑒𝑐𝑡𝑒𝑑] 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎 = [0…0]

for 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 = 1 to 𝑛

𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎[𝑓𝑒𝑎𝑡𝑢𝑟𝑒] = 𝑚𝑖𝐶𝑙𝑎𝑠𝑠[𝑓𝑒𝑎𝑡𝑢𝑟𝑒] − 𝑏𝑒𝑡𝑎 ∗ 𝑟𝑜𝑤𝑆𝑢𝑚[𝑓𝑒𝑎𝑡𝑢𝑟𝑒]

end for

end for

return 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠

53

Algorithm 5: Gini index

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Input: 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋– a 1d array containing all the prior probabilities for 𝑋

𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌– a 1d array containing all the prior probabilities for 𝑌

𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏– a 2d array containing the conditional probability between the variables

Output: 𝑔𝑖 – the gini index between the variables 𝑔𝑖 = 0

for 𝑦 = 1 to 𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌) 𝑥𝐶𝑜𝑛𝑑𝑆𝑞𝑢𝑎𝑟𝑒𝑑 = 0

for 𝑥 = 1 to 𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋) 𝑥𝐶𝑜𝑛𝑑𝑆𝑞𝑢𝑎𝑟𝑒𝑑+= 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏[𝑥, 𝑦]^2

end for

𝑔𝑖+= 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌[𝑦] ∗ 𝑥𝐶𝑜𝑛𝑑𝑆𝑞𝑢𝑎𝑟𝑒𝑑

end for 𝑥𝑃𝑟𝑖𝑜𝑟𝑆𝑞𝑢𝑎𝑟𝑒𝑑 = 0

for 𝑥 = 1 to 𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋)

𝑥𝑃𝑟𝑖𝑜𝑟𝑆𝑞𝑢𝑎𝑟𝑒𝑑+= 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋[𝑥]^2

end for 𝑔𝑖−= 𝑥𝑃𝑟𝑖𝑜𝑟𝑆𝑞𝑢𝑎𝑟𝑒𝑑 return 𝑔𝑖

Algorithm 6: Gini index criterion

1

2

3

4

5

6

7

8

9



Output: 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 – the features, ordered by the algorithm’s preference 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒𝑠 = [0…0]



for 𝑓 = 1 to 𝑛 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏 = 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏𝐵𝑒𝑡𝑤𝑒𝑒𝑛𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠[𝑑𝑎𝑡𝑎, 𝑑𝑎𝑡𝑎𝐶𝑙𝑎𝑠𝑠, 𝑓]

𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒𝑠[𝑓] = 𝑔𝑖𝑛𝑖𝐼𝑛𝑑𝑒𝑥(𝑝𝑟𝑖𝑜𝑟𝐶𝑙𝑎𝑠𝑠, 𝑝𝑟𝑖𝑜𝑟𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[𝑓, ], 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏)

end for 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 = 𝑜𝑟𝑑𝑒𝑟(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒𝑠) return 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠

54

APPENDIX D

This appendix explores the ‘zero conditional probability problem’, which is often encountered

in probability estimation and refers to a dataset not containing an example of class 𝑌𝑐 with

value 𝑋𝑖 for an attribute 𝑋. This leads to 𝑃(𝑋𝑖|𝑌𝑐) = 0 which yields incorrect results. This is

particularly common for datasets consisting of a small number of samples but characterised by

high number of classes and attribute values. To mitigate this problem an approach discussed by

Chen has been implemented. It relies on virtual examples in the conditional probability

estimation (Chen, 2016).

𝑃(𝑋𝑖|𝑌𝑐) =

𝑛𝑐 +𝑚𝑝

𝑛 + 𝑚 (19)

𝑛𝑐 – number of training examples for which 𝑋 = 𝑋𝑖 and 𝑌 = 𝑌𝑐

𝑚 - number of virtual examples, usually 𝑚 ≥ 1

𝑝 – prior estimate, usually 𝑝 =1

𝑡 where 𝑡 is the number of possible values for 𝑋

𝑛 - number of training examples for which 𝑌 = 𝑌𝑐

55

APPENDIX E

This appendix provides calculations on example one reproduced from Kuncheva’s paper.

Let 𝑠 and 𝑠′ be two hypothetical sequences of features produced by two runs of a feature

selection algorithm on a dataset with 𝑛 = 10 features.

𝑠 = {𝑓9, 𝑓7, 𝑓2, 𝑓1, 𝑓3, 𝑓10, 𝑓8, 𝑓4, 𝑓5, 𝑓6}

𝑠′ = {𝑓3, 𝑓7, 𝑓9, 𝑓10, 𝑓2, 𝑓4, 𝑓8, 𝑓6, 𝑓1, 𝑓5}

Let 𝑘 = 3 be the subset size to be considered.

𝑠(3) = {𝑓9, 𝑓7, 𝑓2}

𝑠′(3) = {𝑓3, 𝑓7, 𝑓9}

Using this information we can compute the similarities according to the different measures.

Tanimoto distance:

The cardinality of intersection of the two subsets is:

|𝑠(3) ∩ 𝑠′(3)| = |{𝑓7, 𝑓9}| = 2

The cardinality of the union is:

|𝑠(3) ∪ 𝑠′(3)| = |{𝑓2, 𝑓3 , 𝑓7, 𝑓9}| = 4

∴ 𝑆𝑡 (𝑠, 𝑠′) =

|𝑠 ∩ 𝑠′|

|𝑠 ∪ 𝑠′|=2

4= 0.5

Hamming distance:

The cardinality of difference between, 𝑠 and 𝑠′ is:

|𝑠(3) − 𝑠′(3)| = |{𝑓2}| = 1

The cardinality of difference between, 𝑠′ and 𝑠 is:

|𝑠′(3) − 𝑠(3)| = |{𝑓2}| = 1

∴ 𝑆ℎ (𝑠, 𝑠′) = 1 −

|𝑠\𝑠′| + |𝑠′\𝑠|

𝑛= 1 −

1 + 1

10= 0.8

56

Consistency index:

The cardinality of intersection, 𝑟 of those two subsets is:

𝑟 = |𝑠(3) ∩ 𝑠′(3)| = |{𝑓7, 𝑓9}| = 2

𝑆𝑐 (𝑠, 𝑠′) =

𝑟𝑛 − 𝑘2

𝑘(𝑛 − 𝑘)=2 ∗ 10 − 32

3(10 − 3)=11

21≈ 0.5238

Pearson’s correlation coefficient:

Considering this example, by assigning weight 1 to all features in the selected subset and 0 to all

features which have not been chosen, we obtain the following weights for subsets 𝑠 and 𝑠′:

𝑤 = {0,1,0,0,0,0,1,0,1,0}

𝑤′ = {0,0,1,0,0,0,1,0,1,0}

Consequently, the mean of 𝑤 and 𝑤′ can be computed.

𝜇𝑤 = 𝜇𝑤′ =3

10

𝑆𝑝 (𝑤, 𝑤′) =

∑ (𝑤𝑖 − 𝜇𝑤)(𝑤𝑖′ − 𝜇𝑤′)𝑖

√∑ (𝑤𝑖 − 𝜇𝑤)2 ∑ (𝑤𝑖′ − 𝜇𝑤′)2𝑖𝑖

≈ 0.5238

57

APPENDIX F

The following appendix contains the output of the test suite for this project.

RUNIT TEST PROTOCOL -- Sun May 01 14:19:12 2016 *********************************************** Number of test functions: 24 Number of errors: 0 Number of failures: 0 1 Test Suite : final-year-project - 24 test functions, 0 errors, 0 failures Details *************************** Test Suite: final-year-project Test function regexp: ^test.+ Test file regexp: ^\d+\.R Involved directory: tests --------------------------- Test file: tests/1.R test.calculatePriorProbability: (2 checks) ... OK (0.03 seconds) --------------------------- Test file: tests/10.R test.estimateHammingDistance: (7 checks) ... OK (0 seconds) --------------------------- Test file: tests/11.R test.estimateNumBinSturgesRule: (5 checks) ... OK (0 seconds) --------------------------- Test file: tests/12.R test.estimateNumBinScottsRule: (6 checks) ... OK (0 seconds) --------------------------- Test file: tests/13.R test.calculateGiniIndexFeature: (4 checks) ... OK (0.01 seconds) --------------------------- Test file: tests/14.R test.differenceFuntion: (3 checks) ... OK (0 seconds) --------------------------- Test file: tests/15.R test.calculateMinMaxDiffValues: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/16.R test.mutualInformationFeatureSelection: (4 checks) ... OK (0.02 seconds) --------------------------- Test file: tests/17.R test.mutualInformationMaximisation: (4 checks) ... OK (0.01 seconds) --------------------------- Test file: tests/18.R test.calculateMutualInformationSingleFeature: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/19.R test.mutualInfoCriterionBetweenSelectedFeatures: (8 checks) ... OK (0 seconds) ---------------------------

58

Test file: tests/2.R test.calculateNumFeatureValues: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/20.R test.calculateGiniIndexSingleFeature: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/21.R test.featureSelectionBasedOnGiniIndex: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/22.R test.calculatePriorProbabilityFeatures: (16 checks) ... OK (0 seconds) --------------------------- Test file: tests/23.R test.calculateCondProbBetweenVariables: (30 checks) ... OK (0 seconds) --------------------------- Test file: tests/24.R test.findNearestImproved: (8 checks) ... OK (0.01 seconds) --------------------------- Test file: tests/3.R test.calculateCondProbWithM: (20 checks) ... OK (0 seconds) --------------------------- Test file: tests/4.R test.calculateMutualInformation: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/5.R test.selectNumberOfFeatures: (3 checks) ... OK (0 seconds) --------------------------- Test file: tests/6.R test.estimateTanimotoDistance: (7 checks) ... OK (0 seconds) --------------------------- Test file: tests/7.R test.estimateConsistencyIndex: (10 checks) ... OK (0.03 seconds) --------------------------- Test file: tests/8.R test.estimatePearsonsCorrelationCoefficient: (8 checks) ... OK (0 seconds) --------------------------- Test file: tests/9.R test.weightsFromSet: (22 checks) ... OK (0 seconds)

59

APPENDIX G

Suppose there is a dataset containing 𝑁 items. ‘Bootstrapping’ performs sampling of 𝑁 items

with replacement from the original dataset. This method can be used to generate a new dataset

with similar characteristics to the original one. Therefore, it can be successfully used in the

evaluation the stability of different feature selection criteria.

Figure 23 illustrates this approach. As it can be observed the first bootstrap set contains samples

4 and 7 more than once, whilst samples 5 and 8 are missing. The second one contains a

duplicate of sample 8, with sample 9 missing.

Bootstrap 1

…

Sample 4

Sample 4

Sample 6

Sample 7

Sample 7

Dataset

Sample 9

…

…

Sample 4

Sample 5

Sample 6

Bootstrap 2

Sample 7

…

Sample 8

Sample 4

Sample 9

Sample 5

…

Sample 6

Sample 7

Sample 8

Sample 8

…

Figure 23: Sampling with replacement

evaluation of mutual information versus gini index for...

Documents