feature selection using ant colony optimization: applications in health … · 2012-02-14 · ant...
TRANSCRIPT
1
FEATURE SELECTION USINGANT COLONY OPTIMIZATION:
APPLICATIONS IN HEALTH CARE
João M. C. Sousa1
S. M. Vieira1, S. N. Finkelstein2,3, A. S. Fialho1,2,F. Cismondi1,2, S. R. Reti3 and M. D. Howell3
1 Technical University of Lisbon, Instituto Superior Técnico, Dept. of Mechanical Engineering,CIS/IDMEC – LAETA, Av. Rovisco Pais, 1049-001 Lisbon, Portugal
2 Massachusetts Institute of Technology, Engineering Systems Division, 77 MassachusettsAvenue, 02139 Cambridge, MA, USA
3 Division of Clinical Informatics, Department of Medicine, Beth Israel Deaconess MedicalCentre, Harvard Medical School, Boston, MA, USA
Motivation
Knowledge discovery process
20 September 2010 Eindhoven, the Netherlands 2
Modeling
DataTarget data
Preprocesseddata
Reduceddata
Patterns
Knowledge
Data acquisition
Preprocessing
Feature selection
Interpretation
From G. Piatetsky-Shapiro U. Fayyad and P. Smyth. From data mining to knowledge discovery in databases.Artificial Intelligence Magazine, 17(3):37-54, 1996.
Outline
Motivation
ModelingNeural networksFuzzy sets and systemsFuzzy modeling
Feature selection
Ant colony optimization
Ant feature selection
Application: predicting outcomes of sepsis patients
20 September 2010 Eindhoven, the Netherlands 3
NEURAL NETWORKS
Eindhoven, the Netherlands 5
Artificial neuron
xi: i-th input of the neuron
wi: synaptic strength (weight) for xi
y = ( wixi): output signal
w2
wn
x1
x2
xn
...
yNeuron
Eindhoven, the Netherlands 6
Types of neurons
McCulloch and Pits (1943)Threshold :
1
n
i ii
y sign w x
Other types of activation functions (net = wixi)
1
0
1
step 1, if 00, if 0
nety
netsigmoid 1
1 netye
lineary net
2
Eindhoven, the Netherlands 7
Multi-Layer Perceptron (MLP)
Can learn functions that are not linearly separable.
Out
puts
igna
lsEindhoven, the Netherlands 8
Most common MLP
Output layer
Hidden layer
x1
xi
xn
......
1
2
j
m
wijh
1
k
l
w11h
wnmh
w11o
y1
yk
yl
......
......
wjko
b1h
bmh
b1o
blowml
o
Eindhoven, the Netherlands 9
Most common MLP
Output of neurons in the hidden-layer hj:
1 0
0tanh
n nh h hj ij i j ij ii i
n hij ii
h w x b w x
w x
Output of neurons in the output-layer yk:
1 0
0
m mo o ok jk j j jk kj j
m ojk jj
y w h b w h
w h
sigmoid
linear
Eindhoven, the Netherlands 10
Learning in NN
Biological neural networks:Synaptic connections amongst neurons whichsimultaneously exhibit high activity are strengthned.
Artificial neural networks:Mathematical approximation of biological learning.Error minimization (nonlinear optimization problem).
Error backpropagation (first-order gradient)Newton methods (second-order gradient)Levenberg-Marquardt (second-order gradient)Conjugate gradients...
Eindhoven, the Netherlands 11
Supervised learning
Training data:
1 2
TT T TNY y y y
1 2
TT T TNX x x x
x y
e
Eindhoven, the Netherlands 12
Bibliography
S. Haykin. Neural Networks - A ComprehensiveFoundation. Prentice Hall, 1999.
J.-S. Jang, C.-T. Sun and E. Mizutani. Neuro-Fuzzy andSoft Computing: A Computational Approach toLearning and Machine Intelligence. Prentice Hall, NewJersey, 1997.
Andries P. Engelbrecht. Computational Intelligence: AnIntroduction. John Wiley, Chichester, 2002
Michael Negnevitsky. Artificial Intelligence: A Guide toIntelligent Systems. Addison-Wesley, PearsonEducation, 2002.
3
FUZZY SETS
Basic Concepts
Eindhoven, the Netherlands 14
Introduction
How to simplify very complex systems?
Allow some degree of uncertainty in theirdescription!
How to deal mathematically with uncertainty?
Using probabilistic theory (stochastic).
Using the theory of fuzzy sets (non-stochastic).Proposed in 1965 by Lotfi Zadeh (Fuzzy Sets,Information Control, 8, pp. 338-353).
Imprecision or vagueness in natural language doesnot imply a loss of accuracy or meaningfulness!
Eindhoven, the Netherlands 15
Classical set
Example: set of old people A = {age | age 70}
A
50 600
70 80 90 100
0.5
1
16
Logic propositions
“Nick is old” ... true or false
Nick’s age:ageNick = 70, A(70) = 1 (true)
ageNick = 69.9, A(69.9) = 0 (false)A
50 600
70 80 90 100
0.5
1
Eindhoven, the Netherlands 17
Fuzzy set
Graded membership, element belongs to a set to acertain degree.
A
50 600
70 80 90 100
0.5
1
mem
bers
hip
grad
e
18
Fuzzy proposition
“Nick is old”... degree of truthageNick = 70, A(70) = 0.5
ageNick = 69.9, A(69.9) = 0.49
ageNick = 90, A(90) = 1A
50 600
70 80 90 100
0.5
1
mem
bers
hip
grad
e
4
Eindhoven, the Netherlands 19
Typical linguistic values
20 400
60 80 100
1
mem
bers
hip
grad
e young middle age old
20
Linguistic variable
x is age= {young, middle age, old}
20 400
60 80 100
1
mem
bers
hip
grad
e
young middle age oldsemantic rules MX
Eindhoven, the Netherlands 21
Fuzzy complement
(x) = 1 – A(x)
1
x
A
Eindhoven, the Netherlands 22
Intersection of fuzzy sets
A B(x) = min( A(x), B(x))
x
BA
Eindhoven, the Netherlands 23
Union of fuzzy sets
A B(x) = max( A(x), B(x))
x
BAFUZZY SYSTEMS
5
Eindhoven, the Netherlands 25
Linguistic variable
{x, , , MX}
Where:x – name of the linguistic variable
– linguistic values (terms)
– Universe of discourse
MX – semantic rule that associates each linguisticvalue to a membership function.
Eindhoven, the Netherlands 26
Fuzzy if-then rules
Fuzzy propositionsx is A, y is B
Linguistic (Mamdani) fuzzy if-then rule:
If x is A then y is BAntecedent: x is A
Consequent: y is B
Rule “If x is A then y is B” is represented by afuzzy relation defined on X Y.
Eindhoven, the Netherlands 27
Examples
If the road is slippery then brake softly.
If error is Negative big and e is Positive big then u isNegative small.If a tomato is red then the tomato is ripe.
If the temperature is very high then reduce the heat alot.
If the valve is closed then the pressure is high.
Eindhoven, the Netherlands 28
Linguistic (Mamdani) model
Decomposing using conjunctive forms:
: is is , 1,2, ,k k kR A B k KIf x then y
1 1 2 2
1 1 2 2
: is is is is is is
k k k kn n
k k kp p
R x A x A x Ay B y B y B
If and and andthen and and and
Degree of fulfillment of antecedents:
1 21 2= ( ) ( ) ( ), 1,2, ,k k k
n
knA A Ax x x k K
Eindhoven, the Netherlands 29
Takagi-Sugeno fuzzy model
Affine linear form:
: is ( ), 1,2, ,k k k kR A y f k KIf x then x
: isTk k k k kR A y a bIf x then x
Degree of fulfillment k defined as in linguistic models
Model output given by the weighted fuzzy-mean:
11
1 1
( )KK k k T kk kkk
K Kj jj j
a byy
x
Eindhoven, the Netherlands 30
Bibliography
G. Klir and T. Folger. Fuzzy Sets Uncertainty and Information.Prentice Hall, 1988.J.-S. Jang, C.-T. Sun and E. Mizutani. Neuro-Fuzzy and SoftComputing: A Computational Approach to Learning andMachine Intelligence. Prentice Hall, New Jersey, 1997.Andries P. Engelbrecht. Computational Intelligence: AnIntroduction. John Wiley, Chichester, 2002.J.M.C. Sousa and U. Kaymak. Fuzzy Decision Making in Modelingand Control. World Scientific Series in Robotics and IntelligentSystems, vol. 27. World Scientific Pub. Co., Singapore, Dec. 2002Michael Negnevitsky. Artificial Intelligence: A Guide toIntelligent Systems. Addison-Wesley, Pearson Education, 2002.R. Babuska. Fuzzy Modeling for Control. Kluwer AcademicPublishers, 1998.
6
FUZZY MODELING
Eindhoven, the Netherlands 32
Kernel-based modeling
Fuzzy systems
Radial basis function networksSupport vector machines
Multi-layer perceptron
...
Fuzzy systems can be interpretable!
Fuzzy sets can close the gap between symbolicprocessing and numerical computations.
Eindhoven, the Netherlands 33
Fuzzy system parameters
Parameters of antecedent membership functions(shape, location, etc.)
Parameters of consequent membership functions(Mamdani systems)
Parameters of consequent functions (Takagi-Sugenosystems)
Aggregation of antecedent membershipsImplication/reasoning
Defuzzification function (Mamdani systems)
Eindhoven, the Netherlands 34
Building fuzzy models
Data-driven approachnonlinear mapping
extract from input-output data:rules
antecedents (membership functions)
consequents (membership or crisp functions)
35
Fuzzy c-means
0
0.5
1
0
0.5
10
0.5
1
XY
MF
0
0.5
1
0
0.5
10
0.5
1
XY
MF
0
0.5
1
0
0.5
10
0.5
1
XY
MF
Assumes partition matrix is fixed
Eindhoven, the Netherlands 36
Modeling based on fuzzy clustering
1. Collect the data
2. Select model structure (Mamdani, Takagi-Sugeno,…)3. Select number of clusters and clustering algorithm
4. Cluster the data
5. Obtain antecedent membership functions (MF) fromclusters. Obtain consequents (MF or parameters)
6. Determine a fuzzy rule for each cluster7. Simplify the model, if necessary
8. Validate the model
7
Eindhoven, the Netherlands 37
Building fuzzy models
StructureInput and output variables. For dynamic systems alsothe representation of the dynamics.Number of membership functions per variable, type ofmembership functions, number of rules.
ParametersAntecedent membership functionsConsequent parameters
Eindhoven, the Netherlands 38
Linguistic models from clustering
Use fuzzy c-means algorithm.
Cluster data in input–output product space.Membership functions obtained by:
projection onto variables,membership function parameterization.
One rule per cluster
: is is , 1,2, ,k k kR A B k KIf x then y
Figure reproduced with permission of Prof. Uzay Kaymak39
Example of linguistic model
If income is Low then tax is Low
If income is High then tax is High
Eindhoven, the Netherlands 40
Selecting number of antecedents
A priori knowledge (experts, dynamics, etc.)
Regularity criterion – based on cross-validation.Split training set randomly into two parts (A and B)
Minimize regularity criterion:
Variables selected incrementally until regularitycriterion increases.
2 21 1 12
1 1
ˆ ˆA B
A B
k kA AB B BAi i i ik k
i iRC y y y y
Eindhoven, the Netherlands 41
Selecting number of antecedents
Feature selection (FS)Principal Component Analysis (PCA)
Curvilinear Component Analysis (CCA)
(...)Tree search methods
Bottom-upTop-down
FS using genetic algorithms
FS using ant colony optimization
FEATURE SELECTION
8
Introduction
Many applications have hundreds to tens of thousandsof variables/features
Many are irrelevant and/or redundant.
Curse of dimensionality.
20 September 2010 Eindhoven, the Netherlands 43 Eindhoven, the Netherlands 44
Feature selection
What is feature selection?
Remove features (inputs) X(i) to improve (orleast degrade) prediction of outputs Y.
Advantages:Feature selection selects most relevant featuresCollect/process less features and dataLess complex models run fasterModels are easier to understand, verify and explain
Eindhoven, the Netherlands 45
Feature selection algorithms
FiltersBased on general characteristics of data to be evaluated.No model is involved.
WrappersUses model performance to evaluate feature subsets.Train one classification model for each feature subset.
Hybrid methodsDo not retrain the model at every step.Search feature selection space and model parameterspace simultaneously.
Tree search – bottom-up
20 September 2010 Eindhoven, the Netherlands 46
ANT COLONYOPTIMIZATION
Eindhoven, the Netherlands 48
Biologically inspired algorithms
Artificial ant colonies: maybe the most used methodfrom the artificial life algorithms.Introduced by Marco Dorigo (1992), has been wellreceived by academic world and it is starting to beused in industrial applications.Applications: Traveling Salesman Problem, VehicleRouting, Quadratic Assignment Problem, InternetRouting, Logistic Scheduling, clustering and datamining problems.
9
Eindhoven, the Netherlands 49
Ant Colony Optimization
Artificial Life algorithms: swarm, ants, wasps, bees
Ant Colony Optimization is one of the most usedmethod of the Artificial Life algorithms.
Applications: Travelling salesman problem, vehiclerouting, quadratic assignment problem, internetrouting, logistics scheduling.
There are also some applications of ACO in clusteringand data mining problems, including featureselection.
Eindhoven, the Netherlands 50
What is special about ants?
Ants can perform complex tasks:nest building, food storage
garbage collection, war
foraging (to wander in search of food)
There is no management in an ant colonycollective intelligence
They communicate using:pheromones (chemical substance), sound, touch
Curiosities:Ant colonies exist for more than 100 million years
Myrmercologists estimate that there are around 20 000 speciesof ants
Eindhoven, the Netherlands 51
The foraging behaviour of ants
How can almost blind animals manage to learn the shortest route pathsfrom their nests to the food source and back?
a) - Ants follow path between theNest and the Food Source
b) - Ants go around the obstacle following oneof two different paths with equalprobability
c) - On the shorter path, morepheromones are laid down
Fotos: http://iridia.ulb.ac.be/~mdorigo/ACO/RealAnts.html
d) – At the end, all ants follow the shortestpath.
52
Artificial ants
Artificial ants move in graphsnodes / arcsenvironment is discrete
As real ants:choose paths based on pheromoneconcentrationdeposit pheromones on pathsEnvironment updates pheromones
Extra abilities of artificial ants:prior knowledge (heuristic )
memory (feasible neighbourhood N
Eindhoven, the Netherlands 53
Mathematical framework
Choose nodeInitializationSet ij = 0For l =1: Nmax
Build a complete tourFor i = 1 to n
For k = 1 to mChoose node
Update NApply local heuristicend
endAnalyze solutions
For k = 1 to mCompute fk
endUpdate pheromones
end
Update feasible neighbourhood
Pheromone update
, if ( , )0, otherwise
kij
Q f i j S
\N N j
kijll )1()()1(
,
0, otherwise
ij ij
k ij ijij j
if j Np
54
Bibliography
Marco Dorigo and Thomas Stützle. Ant ColonyOptimization. The MIT Press. July 2004.
J. Kennedy, R. C. Eberhart and Y. Shi. SwarmIntelligence. Morgan Kaufmann Publishers, 2002.
Andries P. Engelbrecht. Computational Intelligence: AnIntroduction. John Wiley, Chichester, 2002
10
ANT FEATURE SELECTION
Objective function
Objectives:
minimize the number of misclassifications, or theclassification error Ne
reduce the number of features, or the featurecardinality Nf
Tradeoff precision vs. accuracy.
20 September 2010 Eindhoven, the Netherlands 56
1 2minimize e ff Nw w N
Multicriteria ant system
20 September 2010 Eindhoven, the Netherlands 57
Feature 1
Feature N
RankFeatures
YtestXtest
Test
Cost
Modeling
Ant system
Ant colonyfor selectionof features
Ant colonyfor cardinality
of features
Updatepheromone
Updatepheromone
N cyclesMinimize
number of features Minimizeclassification error
Ant Feature Selection (AFS)
20 September 2010 Eindhoven, the Netherlands 58
Choose node
Pheromone update
( 1) ( )(1 ) kijl l
x3
x5
x7
x1 x2
x4
xn
x6
Subset:{x3,x6,x7,x1,x4}
,
0, otherwise
ij ij
kij ijij
j
if j Np
Heuristics in AFS
Heuristic for feature cardinality: Fisher’s score for thefeatures
Heuristic for selection of features: classification errore(i) for the individual features
20 September 2010 Eindhoven, the Netherlands 59
1 2
1 2
2
2 2
( ) ( )( )
( ) ( )c c
c c
i iF i
i imean and variance values of featurei for the samples in class c1 and c2
1( )
( )f ie i
Eindhoven, the Netherlands 60
Ant feature selection
General designNumber of ants gBalance of exploration and exploitation
Combination with greedy heuristics orlocal search
When should pheromones be updated?
11
Eindhoven, the Netherlands 61
/* Initialization */Set parameters f, n, f, n, f, n, I, N, g.for t = 1 to I
Choose size of subset Nf(k) for each ant kfor l = 1 to N
Build feature set Lkf(t) choosing Nf(k) features
Derive model using Lkf(t) features selected by ant k
Compute classification error Ek(t)Update pheromone trails ni(t + 1) and fj(t + 1)
end forend for
Algorithm
PREDICTING OUTCOMES OFSEPTIC SHOCK PATIENTS
Motivation
20 September 2010 Eindhoven, the Netherlands 63
ProblemSeptic shock is a common ICU key adverseoutcome, translated into ~50% mortalityrate and high costs of treatments.
Feature Selection (tree search and ant colony optimization)
MethodsFuzzy Systems or Neural Networks
+Feature Selection (tree search and ant colony optimization)
GoalPredict the outcome (survive or
decease) of septic shock patients,for purposes of therapy
management.
Sepsis
Annual mortality rate of sepsis in USA: more than220,000. Sepsis is the tenth most common cause ofdeath.Severe sepsis accounts 2% to 3% of all hospitaladmissions. 59% of patients with sepsis require ICUcare, composing 10.4% of ICU admissions.
The mortality rate for severe sepsis ranges from 13%to 50%, and is as high as 80% to 90% for septic shockand multiple organ dysfunction.
20 September 2010 Eindhoven, the Netherlands 64
Septic shock - background
Management of sepsis is increasingly protocol-driven
Care is goal-directed and parameterizedWith goal-directed therapy, care becomes similar to acontrol problem, with ‘ideal’ process of care revolvingaround:
Setting a goal/target for a specific physiological parameter
Rapidly driving the physiologic process toward specificgoal/target
Maintaining that physiological parameter within upper andlower limits of that goal
20 September 2010 Eindhoven, the Netherlands 65
Septic shock - assumptions
Adequacy of control depends largely on:
Close monitoringEarly detection of change
Active management and intervention by nurses
20 September 2010 Eindhoven, the Netherlands 66
12
MEDAN database
Database used as testbench (Paetza 2003)http://141.2.16.103/datenbank/download_database.htm
VariablesThe MEDAN data base contains the data of 103variables of 387 patientsData from ICU from 1998-2002 collected by medicaldocumentation staffAll patients have septic shock of abdominal cause
TaskPredict patients survival
20 September 2010 Eindhoven, the Netherlands 67
Problems in the database
20 September 2010 Eindhoven, the Netherlands 68
Selection of 387 patients and 59 variables.
Problems in the database
One of the most complete patients.
20 September 2010 Eindhoven, the Netherlands 69
Varia
ble
Time [hours]
Problems in the database
Measurements for a considerable part of the variablesstopped.
20 September 2010 Eindhoven, the Netherlands 70
Varia
ble
Time [hours]
Problems in the database
Long periods with missing data.
20 September 2010 Eindhoven, the Netherlands 71
Varia
ble
Time [hours]
Classification measures
In this example we used the following measures:
Classification accuracy (% of correct classification)Area under the ROC Curve (AUC)
Specificity
Sensitivity
20 September 2010 Eindhoven, the Netherlands 72
13
Confusion matrix
20 September 2010 Eindhoven, the Netherlands 73
Specificity and Sensibility
Specificity or true negative rate (TNR)
Sensitivity or true positive rate (TPR)
20 September 2010 Eindhoven, the Netherlands 74
TNSpecificityTN FP
TPSensibilityTP FN
Area Under the ROC Curve (AUC)
In signal detection theory, a receiver operating characteristic(ROC), or simply ROC curve, is a graphical plot of the sensitivity,vs. false positive ratio (1 specificity).
20 September 2010 Eindhoven, the Netherlands 75
Area under the ROC curve(AUC)
Results
Classification accuracy ACC (%)
20 September 2010 Eindhoven, the Netherlands 76
FSmethod
Model
12 Features set 28 Features set
Num.Feat.
Mean StdNum.Feat.
Mean Std
-NN
[Paetza]12 69.0 4.37 - - -
Treesearch
Fuzzy 2-6 74.1 1.31 2-7 82.3 1.56
NN 2-8 73.2 2.03 4-8 81.2 1.97
AFSFuzzy 2-3 72.8 1.44 3-9 78.6 1.44
NN 2-7 75.7 1.37 5-12 81.9 2.12
Results
Specificity
20 September 2010 Eindhoven, the Netherlands 77
FSmethod
Model12 Features set 28 Features set
Num.Feat.
Mean StdNum.Feat.
Mean Std
-NN
[Paetza]12 92.3 - - - -
Treesearch
Fuzzy 2-6 71.2 2.86 2-7 83.3 2.62
NN 2-8 81.7 3.61 4-8 90.3 2.05
AFSFuzzy 2-3 70.5 0.02 3-9 78.2 0.03
NN 2-7 85.6 0.02 5-12 90.2 0.02
Results
Sensitivity
20 September 2010 Eindhoven, the Netherlands 78
FSmethod
Model12 Features set 28 Features set
Num.Feat.
Mean StdNum.Feat.
Mean Std
-NN
[Paetza]12 15.0 - - - -
Treesearch
Fuzzy 2-6 79.9 2.60 2-7 82.3 1.56
NN 2-8 54.5 5.42 4-8 64.2 3.92
AFSFuzzy 2-3 76.5 0.03 3-9 79.2 0.04
NN 2-7 59.6 0.02 5-12 67.0 0.05
14
Results
AUC
20 September 2010 Eindhoven, the Netherlands 79
FSmethod
Model
12 Features set 28 Features set
Num.Feat.
Mean Std Num.Feat.
Mean Std
-NN
[Paetza]12 - - - - -
Treesearch
Fuzzy 2-6 75.0 1.06 2-7 81.8 1.97
NN 2-8 71.9 1.17 4-8 80.8 1.28
AFSFuzzy 2-3 73.5 0.01 3-9 78.7 0.02
NN 2-7 72.6 0.01 5-12 78.1 0.03
12 features subset
20 September 2010 Eindhoven, the Netherlands 80
0%
20%
40%
60%
80%
100%
1 2 5 6 8 10 14 16 17 24 26 28
Freq
uenc
y
Feature label
BU + FM
AFS + FM
AFS + NN
BU + NN
Most frequent features:8 – pH26 – Calcium28 – Creatinine
28 features subset
20 September 2010 Eindhoven, the Netherlands 81
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 5 6 8 9 10 11 12 13 14 16 17 18 20 22 24 25 26 28 29 31 32 35 41 43 85
Freq
uenc
y
Feature label
BU + FM
AFS + FM
AFS + NN
BU + NN
Most frequent features: 8, 26, 28 and18 – thrombocytes 41 – CRP (C-reactive protein)22 – antithrombin III 85 – FiO235 – total bilirubin
Future work
Apply the same techniques to larger health caredatabases with more available features (MIMIC II)
MIMIC II (dimension of database) 40,000 patients and500 features
PreprocessingLarge amount of missing valuesUneven time samplings
Validate models with other datasets – Hospital da Luzin Lisbon.
20 September 2010 Eindhoven, the Netherlands 82