feature selection using ant colony optimization: applications in health … · 2012-02-14 · ant...

1

FEATURE SELECTION USINGANT COLONY OPTIMIZATION:

APPLICATIONS IN HEALTH CARE

João M. C. Sousa1

[email protected]

S. M. Vieira1, S. N. Finkelstein2,3, A. S. Fialho1,2,F. Cismondi1,2, S. R. Reti3 and M. D. Howell3

1 Technical University of Lisbon, Instituto Superior Técnico, Dept. of Mechanical Engineering,CIS/IDMEC – LAETA, Av. Rovisco Pais, 1049-001 Lisbon, Portugal

2 Massachusetts Institute of Technology, Engineering Systems Division, 77 MassachusettsAvenue, 02139 Cambridge, MA, USA

3 Division of Clinical Informatics, Department of Medicine, Beth Israel Deaconess MedicalCentre, Harvard Medical School, Boston, MA, USA

Motivation

Knowledge discovery process

20 September 2010 Eindhoven, the Netherlands 2

Modeling

DataTarget data

Preprocesseddata

Reduceddata

Patterns

Knowledge

Data acquisition

Preprocessing

Feature selection

Interpretation

From G. Piatetsky-Shapiro U. Fayyad and P. Smyth. From data mining to knowledge discovery in databases.Artificial Intelligence Magazine, 17(3):37-54, 1996.

Outline

Motivation

ModelingNeural networksFuzzy sets and systemsFuzzy modeling

Feature selection

Ant colony optimization

Ant feature selection

Application: predicting outcomes of sepsis patients


NEURAL NETWORKS

Eindhoven, the Netherlands 5

Artificial neuron

xi: i-th input of the neuron

wi: synaptic strength (weight) for xi

y = ( wixi): output signal

w2

wn

x1

x2

xn

...

yNeuron


Types of neurons

McCulloch and Pits (1943)Threshold :

1

n

i ii

y sign w x

Other types of activation functions (net = wixi)

1

0

1

step 1, if 00, if 0

nety

netsigmoid 1

1 netye

lineary net

mailto:jmsousa:@ist.utl.pt

2


Multi-Layer Perceptron (MLP)

Can learn functions that are not linearly separable.

Out

puts

igna

lsEindhoven, the Netherlands 8

Most common MLP

Output layer

Hidden layer

x1

xi

xn

......

1

2

j

m

wijh

1

k

l

w11h

wnmh

w11o

y1

yk

yl

......

......

wjko

b1h

bmh

b1o

blowml

o


Most common MLP

Output of neurons in the hidden-layer hj:

1 0

0tanh

n nh h hj ij i j ij ii i

n hij ii

h w x b w x

w x

Output of neurons in the output-layer yk:

1 0

0

m mo o ok jk j j jk kj j

m ojk jj

y w h b w h

w h

sigmoid

linear


Learning in NN

Biological neural networks:Synaptic connections amongst neurons whichsimultaneously exhibit high activity are strengthned.

Artificial neural networks:Mathematical approximation of biological learning.Error minimization (nonlinear optimization problem).

Error backpropagation (first-order gradient)Newton methods (second-order gradient)Levenberg-Marquardt (second-order gradient)Conjugate gradients...


Supervised learning

Training data:

1 2

TT T TNY y y y

1 2

TT T TNX x x x

x y

e


Bibliography

S. Haykin. Neural Networks - A ComprehensiveFoundation. Prentice Hall, 1999.

J.-S. Jang, C.-T. Sun and E. Mizutani. Neuro-Fuzzy andSoft Computing: A Computational Approach toLearning and Machine Intelligence. Prentice Hall, NewJersey, 1997.

Andries P. Engelbrecht. Computational Intelligence: AnIntroduction. John Wiley, Chichester, 2002

Michael Negnevitsky. Artificial Intelligence: A Guide toIntelligent Systems. Addison-Wesley, PearsonEducation, 2002.

3

FUZZY SETS

Basic Concepts


Introduction

How to simplify very complex systems?

Allow some degree of uncertainty in theirdescription!

How to deal mathematically with uncertainty?

Using probabilistic theory (stochastic).

Using the theory of fuzzy sets (non-stochastic).Proposed in 1965 by Lotfi Zadeh (Fuzzy Sets,Information Control, 8, pp. 338-353).

Imprecision or vagueness in natural language doesnot imply a loss of accuracy or meaningfulness!


Classical set

Example: set of old people A = {age | age 70}

A

50 600

70 80 90 100

0.5

1

16

Logic propositions

“Nick is old” ... true or false

Nick’s age:ageNick = 70, A(70) = 1 (true)

ageNick = 69.9, A(69.9) = 0 (false)A

50 600

70 80 90 100

0.5

1


Fuzzy set

Graded membership, element belongs to a set to acertain degree.

A

50 600

70 80 90 100

0.5

1

mem

bers

hip

grad

e

18

Fuzzy proposition

“Nick is old”... degree of truthageNick = 70, A(70) = 0.5

ageNick = 69.9, A(69.9) = 0.49

ageNick = 90, A(90) = 1A

50 600

70 80 90 100

0.5

1

mem

bers

hip

grad

e

4


Typical linguistic values

20 400

60 80 100

1

mem

bers

hip

grad

e young middle age old

20

Linguistic variable

x is age= {young, middle age, old}

20 400

60 80 100

1

mem

bers

hip

grad

e

young middle age oldsemantic rules MX


Fuzzy complement

(x) = 1 – A(x)

1

x

A


Intersection of fuzzy sets

A B(x) = min( A(x), B(x))

x

BA


Union of fuzzy sets

A B(x) = max( A(x), B(x))

x

BAFUZZY SYSTEMS

5


Linguistic variable

{x, , , MX}

Where:x – name of the linguistic variable

– linguistic values (terms)

– Universe of discourse

MX – semantic rule that associates each linguisticvalue to a membership function.


Fuzzy if-then rules

Fuzzy propositionsx is A, y is B

Linguistic (Mamdani) fuzzy if-then rule:

If x is A then y is BAntecedent: x is A

Consequent: y is B

Rule “If x is A then y is B” is represented by afuzzy relation defined on X Y.


Examples

If the road is slippery then brake softly.

If error is Negative big and e is Positive big then u isNegative small.If a tomato is red then the tomato is ripe.

If the temperature is very high then reduce the heat alot.

If the valve is closed then the pressure is high.


Linguistic (Mamdani) model

Decomposing using conjunctive forms:

: is is , 1,2, ,k k kR A B k KIf x then y

1 1 2 2

1 1 2 2

: is is is is is is

k k k kn n

k k kp p

R x A x A x Ay B y B y B

If and and andthen and and and

Degree of fulfillment of antecedents:

1 21 2= ( ) ( ) ( ), 1,2, ,k k k

n

knA A Ax x x k K


Takagi-Sugeno fuzzy model

Affine linear form:

: is ( ), 1,2, ,k k k kR A y f k KIf x then x

: isTk k k k kR A y a bIf x then x

Degree of fulfillment k defined as in linguistic models

Model output given by the weighted fuzzy-mean:

11

1 1

( )KK k k T kk kkk

K Kj jj j

a byy

x


Bibliography

G. Klir and T. Folger. Fuzzy Sets Uncertainty and Information.Prentice Hall, 1988.J.-S. Jang, C.-T. Sun and E. Mizutani. Neuro-Fuzzy and SoftComputing: A Computational Approach to Learning andMachine Intelligence. Prentice Hall, New Jersey, 1997.Andries P. Engelbrecht. Computational Intelligence: AnIntroduction. John Wiley, Chichester, 2002.J.M.C. Sousa and U. Kaymak. Fuzzy Decision Making in Modelingand Control. World Scientific Series in Robotics and IntelligentSystems, vol. 27. World Scientific Pub. Co., Singapore, Dec. 2002Michael Negnevitsky. Artificial Intelligence: A Guide toIntelligent Systems. Addison-Wesley, Pearson Education, 2002.R. Babuska. Fuzzy Modeling for Control. Kluwer AcademicPublishers, 1998.

6

FUZZY MODELING


Kernel-based modeling

Fuzzy systems

Radial basis function networksSupport vector machines

Multi-layer perceptron

...

Fuzzy systems can be interpretable!

Fuzzy sets can close the gap between symbolicprocessing and numerical computations.


Fuzzy system parameters

Parameters of antecedent membership functions(shape, location, etc.)

Parameters of consequent membership functions(Mamdani systems)

Parameters of consequent functions (Takagi-Sugenosystems)

Aggregation of antecedent membershipsImplication/reasoning

Defuzzification function (Mamdani systems)


Building fuzzy models

Data-driven approachnonlinear mapping

extract from input-output data:rules

antecedents (membership functions)

consequents (membership or crisp functions)

35

Fuzzy c-means

0

0.5

1

0

0.5

10

0.5

1

XY

MF

0

0.5

1

0

0.5

10

0.5

1

XY

MF

0

0.5

1

0

0.5

10

0.5

1

XY

MF

Assumes partition matrix is fixed


Modeling based on fuzzy clustering

1. Collect the data

2. Select model structure (Mamdani, Takagi-Sugeno,…)3. Select number of clusters and clustering algorithm

4. Cluster the data

5. Obtain antecedent membership functions (MF) fromclusters. Obtain consequents (MF or parameters)

6. Determine a fuzzy rule for each cluster7. Simplify the model, if necessary

8. Validate the model

7


Building fuzzy models

StructureInput and output variables. For dynamic systems alsothe representation of the dynamics.Number of membership functions per variable, type ofmembership functions, number of rules.

ParametersAntecedent membership functionsConsequent parameters


Linguistic models from clustering

Use fuzzy c-means algorithm.

Cluster data in input–output product space.Membership functions obtained by:

projection onto variables,membership function parameterization.

One rule per cluster

: is is , 1,2, ,k k kR A B k KIf x then y

Figure reproduced with permission of Prof. Uzay Kaymak39

Example of linguistic model

If income is Low then tax is Low

If income is High then tax is High


Selecting number of antecedents

A priori knowledge (experts, dynamics, etc.)

Regularity criterion – based on cross-validation.Split training set randomly into two parts (A and B)

Minimize regularity criterion:

Variables selected incrementally until regularitycriterion increases.

2 21 1 12

1 1

ˆ ˆA B

A B

k kA AB B BAi i i ik k

i iRC y y y y


Selecting number of antecedents

Feature selection (FS)Principal Component Analysis (PCA)

Curvilinear Component Analysis (CCA)

(...)Tree search methods

Bottom-upTop-down

FS using genetic algorithms

FS using ant colony optimization

FEATURE SELECTION

8

Introduction

Many applications have hundreds to tens of thousandsof variables/features

Many are irrelevant and/or redundant.

Curse of dimensionality.

20 September 2010 Eindhoven, the Netherlands 43 Eindhoven, the Netherlands 44

Feature selection

What is feature selection?

Remove features (inputs) X(i) to improve (orleast degrade) prediction of outputs Y.

Advantages:Feature selection selects most relevant featuresCollect/process less features and dataLess complex models run fasterModels are easier to understand, verify and explain


Feature selection algorithms

FiltersBased on general characteristics of data to be evaluated.No model is involved.

WrappersUses model performance to evaluate feature subsets.Train one classification model for each feature subset.

Hybrid methodsDo not retrain the model at every step.Search feature selection space and model parameterspace simultaneously.

Tree search – bottom-up


ANT COLONYOPTIMIZATION


Biologically inspired algorithms

Artificial ant colonies: maybe the most used methodfrom the artificial life algorithms.Introduced by Marco Dorigo (1992), has been wellreceived by academic world and it is starting to beused in industrial applications.Applications: Traveling Salesman Problem, VehicleRouting, Quadratic Assignment Problem, InternetRouting, Logistic Scheduling, clustering and datamining problems.

9


Ant Colony Optimization

Artificial Life algorithms: swarm, ants, wasps, bees

Ant Colony Optimization is one of the most usedmethod of the Artificial Life algorithms.

Applications: Travelling salesman problem, vehiclerouting, quadratic assignment problem, internetrouting, logistics scheduling.

There are also some applications of ACO in clusteringand data mining problems, including featureselection.


What is special about ants?

Ants can perform complex tasks:nest building, food storage

garbage collection, war

foraging (to wander in search of food)

There is no management in an ant colonycollective intelligence

They communicate using:pheromones (chemical substance), sound, touch

Curiosities:Ant colonies exist for more than 100 million years

Myrmercologists estimate that there are around 20 000 speciesof ants


The foraging behaviour of ants

How can almost blind animals manage to learn the shortest route pathsfrom their nests to the food source and back?

a) - Ants follow path between theNest and the Food Source

b) - Ants go around the obstacle following oneof two different paths with equalprobability

c) - On the shorter path, morepheromones are laid down

Fotos: http://iridia.ulb.ac.be/~mdorigo/ACO/RealAnts.html

d) – At the end, all ants follow the shortestpath.

52

Artificial ants

Artificial ants move in graphsnodes / arcsenvironment is discrete

As real ants:choose paths based on pheromoneconcentrationdeposit pheromones on pathsEnvironment updates pheromones

Extra abilities of artificial ants:prior knowledge (heuristic )

memory (feasible neighbourhood N


Mathematical framework

Choose nodeInitializationSet ij = 0For l =1: Nmax

Build a complete tourFor i = 1 to n

For k = 1 to mChoose node

Update NApply local heuristicend

endAnalyze solutions

For k = 1 to mCompute fk

endUpdate pheromones

end

Update feasible neighbourhood

Pheromone update

, if ( , )0, otherwise

kij

Q f i j S

\N N j

kijll )1()()1(

,

0, otherwise

ij ij

k ij ijij j

if j Np

54

Bibliography

Marco Dorigo and Thomas Stützle. Ant ColonyOptimization. The MIT Press. July 2004.

J. Kennedy, R. C. Eberhart and Y. Shi. SwarmIntelligence. Morgan Kaufmann Publishers, 2002.

Andries P. Engelbrecht. Computational Intelligence: AnIntroduction. John Wiley, Chichester, 2002

http://iridia.ulb.ac.be/%7Emdorigo/ACO/RealAnts.html

10

ANT FEATURE SELECTION

Objective function

Objectives:

minimize the number of misclassifications, or theclassification error Ne

reduce the number of features, or the featurecardinality Nf

Tradeoff precision vs. accuracy.


1 2minimize e ff Nw w N

Multicriteria ant system


Feature 1

Feature N

RankFeatures

YtestXtest

Test

Cost

Modeling

Ant system

Ant colonyfor selectionof features

Ant colonyfor cardinality

of features

Updatepheromone

Updatepheromone

N cyclesMinimize

number of features Minimizeclassification error

Ant Feature Selection (AFS)


Choose node

Pheromone update

( 1) ( )(1 ) kijl l

x3

x5

x7

x1 x2

x4

xn

x6

Subset:{x3,x6,x7,x1,x4}

,

0, otherwise

ij ij

kij ijij

j

if j Np

Heuristics in AFS

Heuristic for feature cardinality: Fisher’s score for thefeatures

Heuristic for selection of features: classification errore(i) for the individual features


1 2

1 2

2

2 2

( ) ( )( )

( ) ( )c c

c c

i iF i

i imean and variance values of featurei for the samples in class c1 and c2

1( )

( )f ie i


Ant feature selection

General designNumber of ants gBalance of exploration and exploitation

Combination with greedy heuristics orlocal search

When should pheromones be updated?

11


/* Initialization */Set parameters f, n, f, n, f, n, I, N, g.for t = 1 to I

Choose size of subset Nf(k) for each ant kfor l = 1 to N

Build feature set Lkf(t) choosing Nf(k) features

Derive model using Lkf(t) features selected by ant k

Compute classification error Ek(t)Update pheromone trails ni(t + 1) and fj(t + 1)

end forend for

Algorithm

PREDICTING OUTCOMES OFSEPTIC SHOCK PATIENTS

Motivation


ProblemSeptic shock is a common ICU key adverseoutcome, translated into ~50% mortalityrate and high costs of treatments.

Feature Selection (tree search and ant colony optimization)

MethodsFuzzy Systems or Neural Networks

+Feature Selection (tree search and ant colony optimization)

GoalPredict the outcome (survive or

decease) of septic shock patients,for purposes of therapy

management.

Sepsis

Annual mortality rate of sepsis in USA: more than220,000. Sepsis is the tenth most common cause ofdeath.Severe sepsis accounts 2% to 3% of all hospitaladmissions. 59% of patients with sepsis require ICUcare, composing 10.4% of ICU admissions.

The mortality rate for severe sepsis ranges from 13%to 50%, and is as high as 80% to 90% for septic shockand multiple organ dysfunction.


Septic shock - background

Management of sepsis is increasingly protocol-driven

Care is goal-directed and parameterizedWith goal-directed therapy, care becomes similar to acontrol problem, with ‘ideal’ process of care revolvingaround:

Setting a goal/target for a specific physiological parameter

Rapidly driving the physiologic process toward specificgoal/target

Maintaining that physiological parameter within upper andlower limits of that goal


Septic shock - assumptions

Adequacy of control depends largely on:

Close monitoringEarly detection of change

Active management and intervention by nurses


12

MEDAN database

Database used as testbench (Paetza 2003)http://141.2.16.103/datenbank/download_database.htm

VariablesThe MEDAN data base contains the data of 103variables of 387 patientsData from ICU from 1998-2002 collected by medicaldocumentation staffAll patients have septic shock of abdominal cause

TaskPredict patients survival


Problems in the database


Selection of 387 patients and 59 variables.


One of the most complete patients.


Varia

ble

Time [hours]


Measurements for a considerable part of the variablesstopped.


Varia

ble

Time [hours]


Long periods with missing data.


Varia

ble

Time [hours]

Classification measures

In this example we used the following measures:

Classification accuracy (% of correct classification)Area under the ROC Curve (AUC)

Specificity

Sensitivity


http://141.2.16.103/datenbank/download_database.htm

13

Confusion matrix


Specificity and Sensibility

Specificity or true negative rate (TNR)

Sensitivity or true positive rate (TPR)


TNSpecificityTN FP

TPSensibilityTP FN

Area Under the ROC Curve (AUC)

In signal detection theory, a receiver operating characteristic(ROC), or simply ROC curve, is a graphical plot of the sensitivity,vs. false positive ratio (1 specificity).


Area under the ROC curve(AUC)

Results

Classification accuracy ACC (%)


FSmethod

Model

12 Features set 28 Features set

Num.Feat.

Mean StdNum.Feat.

Mean Std

-NN

[Paetza]12 69.0 4.37 - - -

Treesearch

Fuzzy 2-6 74.1 1.31 2-7 82.3 1.56

NN 2-8 73.2 2.03 4-8 81.2 1.97

AFSFuzzy 2-3 72.8 1.44 3-9 78.6 1.44

NN 2-7 75.7 1.37 5-12 81.9 2.12

Results

Specificity


FSmethod

Model12 Features set 28 Features set

Num.Feat.

Mean StdNum.Feat.

Mean Std

-NN

[Paetza]12 92.3 - - - -

Treesearch

Fuzzy 2-6 71.2 2.86 2-7 83.3 2.62

NN 2-8 81.7 3.61 4-8 90.3 2.05

AFSFuzzy 2-3 70.5 0.02 3-9 78.2 0.03

NN 2-7 85.6 0.02 5-12 90.2 0.02

Results

Sensitivity


FSmethod

Model12 Features set 28 Features set

Num.Feat.

Mean StdNum.Feat.

Mean Std

-NN

[Paetza]12 15.0 - - - -

Treesearch

Fuzzy 2-6 79.9 2.60 2-7 82.3 1.56

NN 2-8 54.5 5.42 4-8 64.2 3.92

AFSFuzzy 2-3 76.5 0.03 3-9 79.2 0.04

NN 2-7 59.6 0.02 5-12 67.0 0.05

14

Results

AUC


FSmethod

Model

12 Features set 28 Features set

Num.Feat.

Mean Std Num.Feat.

Mean Std

-NN

[Paetza]12 - - - - -

Treesearch

Fuzzy 2-6 75.0 1.06 2-7 81.8 1.97

NN 2-8 71.9 1.17 4-8 80.8 1.28

AFSFuzzy 2-3 73.5 0.01 3-9 78.7 0.02

NN 2-7 72.6 0.01 5-12 78.1 0.03

12 features subset


0%

20%

40%

60%

80%

100%

1 2 5 6 8 10 14 16 17 24 26 28

Freq

uenc

y

Feature label

BU + FM

AFS + FM

AFS + NN

BU + NN

Most frequent features:8 – pH26 – Calcium28 – Creatinine

28 features subset


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 5 6 8 9 10 11 12 13 14 16 17 18 20 22 24 25 26 28 29 31 32 35 41 43 85

Freq

uenc

y

Feature label

BU + FM

AFS + FM

AFS + NN

BU + NN

Most frequent features: 8, 26, 28 and18 – thrombocytes 41 – CRP (C-reactive protein)22 – antithrombin III 85 – FiO235 – total bilirubin

Future work

Apply the same techniques to larger health caredatabases with more available features (MIMIC II)

MIMIC II (dimension of database) 40,000 patients and500 features

PreprocessingLarge amount of missing valuesUneven time samplings

Validate models with other datasets – Hospital da Luzin Lisbon.


feature selection using ant colony optimization: applications in health … · 2012-02-14 · ant...

Documents