evolutionary artificial neural networks - yonsei...
TRANSCRIPT
1
Evolutionary Artificial Neural Networks
2
Overview of Soft Computing
[Castellano]
Backgrounds
3
Why NN+EC?
• “Evolving brains”: Biological neural networks compete and evolve– The way that intelligence was created
• Global search
• Adaptation to dynamic environments without human intervention– Architecture evolution
Initial weightsOptimal solution
Local Max
Population Samples
Backgrounds
4
General Framework of EANN
[X. Yao]
Backgrounds
5
Evolution of Connection Weights
1. Encode each individual neural network’s connection weights into chromosomes
2. Calculate the error function and determine individual’s fitness
3. Reproduce children based on selection criterion
4. Apply genetic operators
Backgrounds
6
Representation
• Binary representation– Weights are represented by binary bits
• e.g. 8 bits can represent connection weights between -127 and +127
– Limitation on representation precision• too few bits → some numbers cannot be approximated• too many bits → training might be prolonged
• To overcome binary representation, some proposed using real number– i.e., one real number per connection weight
• Standard genetic operators such as crossover not applicable to this representation– However, some argue that it is possible to perform evolutionary
computation with only mutation – Fogel, Fogel and Porto (1990): adopted one genetic operator –
Gaussian random mutation
Backgrounds
7
Evolution of Architectures
1. Encode each individual neural network’s architecture into chromosomes
2. Train each neural network with predetermined learning rule
3. Calculate the error function and determine individual’s fitness
4. Reproduce children based on selection criterion
5. Apply genetic operators
Backgrounds
8
Direct Encoding
• All information is represented by binary strings, i.e. each connection and node is specified by some binary bits
• An N by N matrix can represent the connectivity with N nodes, where
• Does not scale well since large NN need a big matrix to represent
NNijcC ×= )(
⎩⎨⎧
=FFction is O, if conne
tion is ON if conneccij 0
,1
Backgrounds
9
Indirect Encoding
• Only the most important parameters or features of an architecture are represented. Other details are left to the learning process to decide– e.g. specify the number of hidden nodes and let the learning
process decide how they are connected (e.g. fully connected)
• More biologically plausible as it is impossible for genetic information encoded in humans to specify the whole nervous system directly according to the discoveries of neuroscience
Backgrounds
10
Evolution of Learning Rules
1. Decode each individual into a learning rule
2. Construct a neural network (either pre-determined or randomly) and train it with decoded learning rule• refers to adapting the learning function, in this case, the
connection weights are updated with an adaptive rule
3. Calculate the error function and determine individual’s fitness
4. Reproduce children based on selection criterion
5. Apply genetic operators
Backgrounds
11
Three Case Studies
• Evolving single neural network– Evolving intrusion detector – Evolving classifier for DNA microarray data
• Evolving ensemble neural networks
12
Evolutionary Learning Program’s Behavior In Neural Networks for Anomaly Detection
13
Motivation (1)
• Attacker’s strategy: Leading to malfunctions by using program’s bug – Showing different behavior compared to normal one
• Anomaly detection– Learning normal program’s behavior from audit data – Classifying programs which show different behavior with normal
one as intrusion– Adopted in many host-based intrusion detection system
• System audit data and machine learning techniques– Basic security module (BSM)– Rule-based learning, neural network and HMM
14
Motivation (2)
• Machine learning methods such as Neural network (NN) and HMM– Effective for intrusion detection based on program’s
behavior• Architecture of classifier
– The most important thing in classification– Searching for appropriate architecture for the problems is crucial
• NN: the number of hidden neurons and connection information
• HMM: the number of states and connection information• Traditional methods
– Trial-and-error• Train 90 neural networks [Ghosh99]
It took too much time because the size of audit data is too largeOptimizing architectures as well as connection weights
15
Related Works
• S. Forrest (1998, 1999)– First intrusion detection by learning program’s behavior– HMM performed better than other methods
• J. Stolfo (1997) : Rule-based learning (RIPPER) • N. Ye (2001)
– Probabilistic methods: Decision tree, chi-square multivariate test and one order Markov chain model (1998 IDEVAL data)
• Ghosh (1999, 2000)– Multi-layer perceptrons and Elman neural network– Elman neural network performed the best (1999 IDEVAL data)
• Vemuri (2003)– kNN and SVM (1998 IDEVAL data)
16
The Proposed Method
• Architecture– System call audit data and evolutionary neural networks
NNps
BS
M A
udit Facility ...
Preprocessor
Detector
ALARM
ps
su
at
login
ping
GA
Modeler
Normal ProfileAudit Data
.
.
.
NNsu
NNat
NNlogin
NNping
17
Normal Behavior Modeling
• Evolutionary neural networks– Simultaneously learning weights and architectures using
genetic algorithm– Partial training: back-propagation algorithm– Representation: matrix
• Rank-based selection, crossover, mutation operators• Fitness evaluation : Recognition rate on training data (mixing real
normal sequences and artificial intrusive sequences)
Generating neural networks with optimal architectures for learning program’s behavior
NN ×
18
ENN (Evolutionary Neural Network) Algorithm
Stop? No
Yes
Generate initial ANNs
Train the ANNs fully
Compute the fitness
Train the ANNs partially
Generate new generation
Apply crossover and mutation
Rank-based seletion
BSM data
Data separation
Trainingdata
Testdata
Evaluation
19
Representation
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0.07.02.07.01.00.10.01.00.00.00.10.10.00.05.00.10.00.00.04.00.10.00.10.10.0
13211
13211
OHHHI
OHHHI
I1
H1
H3
H2 O1
0.4
0.5
0.1
0.7
0.1
0.2
0.7
Generation ofNeural Network
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0.07.02.07.01.00.10.01.00.00.00.10.10.00.05.00.10.00.00.04.00.10.00.10.10.0
13211
13211
OHHHI
OHHHI
Weight
ConnectivityHidden Node
Input Node
Output Node
20
Crossover (1)
I1
H1
H3
H2 O1
0.4
0.5
0.1
0.7
0.1
0.2
0.7
I1
H1
H3
H2 O1
0.1
0.5
0.20.10.5
Crossover
0.4
I1
H10.4 0.7
O1
I1
H1
O1
0.1 0.2
H3
H20.5 0.1
0.2
0.7
0.1
0.4H3
H20.5
0.5
0.1
21
Crossover (2)
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0.07.02.07.01.00.10.01.00.00.00.10.10.00.05.00.10.00.00.04.00.10.00.10.10.0
13211
13211
OHHHI
OHHHI
Crossover
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0.00.05.02.00.00.00.00.00.04.00.10.00.01.05.00.10.00.10.01.00.00.10.10.10.0
13211
13211
OHHHI
OHHHI
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0.07.02.02.00.00.10.01.00.00.00.10.10.00.05.00.10.00.00.01.00.00.00.10.10.0
13211
13211
OHHHI
OHHHI
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0.00.05.07.01.00.00.00.00.04.00.10.00.01.05.00.10.00.10.04.00.10.10.10.10.0
13211
13211
OHHHI
OHHHI
22
Mutation
I1
H1
H3
H2 O1
0.4
0.5
0.1
0.7
0.1
0.2
0.7
Add ConnectionI1
H1
H3
H2 O1
0.4
0.5
0.1
0.7
0.1
0.2
0.70.3
I1
H1
H3
H2 O1
0.4
0.5
0.1
0.7
0.1
0.2
0.7
Delete ConnectionI1
H1
H3
H2 O1
0.4
0.5
0.1
0.7
0.1
0.2
23
Anomaly Detection (1)• 280 system calls in BSM audit data
– 45 frequently occurred calls (indexing as 0~44)– Indexing remaining calls as 45
• 10 input nodes, 15 hidden nodes (Maximum number of hidden nodes), 2 output nodes– Normalizing input values between 0 and 1– Output nodes: Normal and anomaly
open - read,write,creavfork
pathconfchdiropen - read,writeexecve
getauditsetpgrpopen - write,creat,truncreadlink
closesetgroupsopen - write,trunclstat
sysinfoauditopen - write,creatstat
memcntlmmapopen - writeaccess
auditonsetgidopen -readchown
getmsgutimefchdirunlink
putmsgsetuidmkdircreat
seteuidpiperenamefork
munmapioctlfcntlexit
24
Anomaly Detection (2)
• Evaluation value will rise up shortly when intrusion occurs– Detection of locally continuous anomaly sequence is important
– Considering previous values
• Normalizing output values for applying the same threshold to allneural networks– m: Average output value for training data, d: std d
mtt
−=α
α '
00.1
0.20.3
0.40.50.6
0.70.8
0.91
1 11 21 31 41 51 61 71 81 91 101 111
Time
Outp
ut va
lue
Abnormal
normal
23
1211 tttt owoww ⋅−⋅+⋅= −αα
25
Experimental Design• 1999 DARPA IDEVAL data provided by MIT Lincoln Lab
– Denial of Service, probe, Remove-to-local (R2L), User-to-root (U2R)
– Main focus: Detection of U2R attack• Bearing marks of traces in audit data
• Monitoring program’s behavior which has SETUID privilege – Main target for U2R attack
sacadmexrecoverrdist
rloginpingufsrestorercp
pt_chmodmkdevmapsufsdumpps
whodomkdevallocquotapasswd
suloginallocatetopnewgrp
admintoolmkcookienispasswdlogin
suloginkcms_calibratectfdformat
sshkcms_configurevolcheckeject
pwaitff.coreyppasswdcrontab
ptreexlockwchkey
ffbconfigacctonuptimeatm
list_devicesutmp_updatesuatq
deallocatesendmailrshat
26
Experimental Design (2)• 1999 IDEVAL : audit data for 5 weeks
– 1, 3 weeks (attack free) training data– 4-5 weeks test data
• Test data includes totally 11 attacks with 4 types of U2R
• Setting of genetic algorithm– Population size: 20, crossover rate: 0.3 mutation rate: 0.08, Maximum
generation:100– The best individual in the last generation
4race condition attack in 'ps' programps3exploiting buffer overflow in the 'fdformat' programfdformat2exploiting buffer overflow in the 'ffbconifg' programffbconfig2exploiting buffer overflow in the 'eject' programeject
TimesDescriptionName
27
Evolution Results
• Convergence to fitness 0.8 near 100 generations
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 12 23 34 45 56 67 78 89 100
generations
fitn
ess average
minimum
max imum
28
Learning Time• Environments
– Intel Pentium Zeon 2.4GHz Dual processor, 1GB RAM– Solaris 9 operating system
• Data– Login program– Totally 1905 sequences
• Parameters– Learning for 5000epoch– Average of 10 runs
446015ENN161560121650853.64070035
603.63048225
454.220263.415235.510
MLP
Running Time (sec)
Hidden NodesTypes
29
Detection Rates
• 100% detection rate with 0.7 false alarm per day
• Elman NN which shows the best performance for the 1999 IDEVAL data : 100% detection rate with 3 false alarms per day
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20
False Alarm Per Day
Dete
ction R
ate
Effectiveness of Evolutionary NN for IDS
30
Results Analysis – Architecture of NN
• The best individual for learning behavior of ps program– Effective for system call sequence and more complex than
general MLP
31
Comparison of Architectures
• Comparison of the number of connections between ENN learned for 100 generations using ps program data and MLP
• They have the similar number of connections• However, ENN has different types of connections and sophisticated
architectures
000Output19670Hidden15860Input
OutputHiddenInputFROM╲TO
000Output3000Hidden01500Input
OutputHiddenInputFROM╲TO
MLP ENN
32
Evolving Artificial Neural Networks for DNA Microarray Analysis
33
Motivation
• Colon cancer : The second only to lung cancer as a cause of cancer-related mortality in Western countries
• The development of microarray technology has supplied a large volume of data to many fields
• It has been applied to prediction and diagnosis of cancer, so that it expectedly helps us to exactly predict and diagnose cancer
• Proposed method– Feature selection + evolutionary neural network (ENN)– ENN : no restriction on architecture (design without human’s
prior knowledge)
34
What is Microarray?
• Microarray technology– Enables the simultaneous analysis of thousands of
sequences of DNA for genetic and genomic research and for diagnostics
• Two Major Techniques– Hybridization method
• cDNA microarray/ Oligonucleotide microarray– Sequencing method
• SAGE
35
Acquiring Gene Expression Data
Gene expressin data
DNA microarray
Image scanner
Cy5
Cy3
)3Cy(Int)5Cy(Int
2log
Gen
es
Samples
36
Machine Learning for DNA Microarray
Feature selection
Tumor Normal
Cancer predictor
Pearson's correlation coefficientSpearman's correlation coefficientEuclidean distanceCosine coefficientInformation gainMutual informationSignal to noise ratio
3-layered MLP with backpropagationk-nearest neighborSupport vector machineStructure adaptive self-organizing mapEnsemble classifier
Microarray
Expression data
37
Related Works
91.9Quadratic discriminant
93.5Logistic discriminantPartial least square
87.1Quadratic discriminant
87.1Logistic discriminantPrincipal component analysis
Nguyen et al.
72.6AdaBoost
74.2SVM with quadratic kernel
80.6Nearest neighbor
All genes, TNoM scoreBen-Dor et al.
94.1KNNGenetic algorithmLi et al.
90.3SVMSignal to noise ratioFurey et al.
ClassifierFeature
Accuracy(%)
MethodAuthors
38
Overview
Stop? No
Yes
Generate initial ANNs
Train the ANNs fully
Compute the fitness
Train the ANNs partially
Generate new generation
Apply crossover and mutation
Rank-based seletion
Microarray data
Feature selection
Data separation
Trainingdata
Validationdata
Testdata
Evaluation
39
Colon Cancer Dataset
• Alon’s data• Colon dataset consists of 62 samples of colon epithelial cells
taken from colon-cancer patients– 40 of 62 samples are colon cancer samples and the
remaining are normal samples
• Each sample contains 2000 gene expression levels
• Each sample was taken from tumors and normal healthy parts of the colons of the same patients and measured using high density oligonucleotide arrays
• Training data: 31 of 62, Test data: 31 of 62
40
Experimental Setup
• Feature size : 30 • Parameters of genetic algorithm
– Population size : 20– Maximum generation number : 200– Crossover rate : 0.3– Mutation rate : 0.1
• Fitness function : recognition rate for validation data• Learning rate of BP : 0.1
41
Performance Comparison
12
34
56
7
S1
0.94
0.71 0.71 0.710.71 0.74
0.81
0.6
0.7
0.8
0.9
1
Accuracy
Class ifier
1: EANN
2: MLP
3: SASOM
4: SVM(Linear)
5: SVM(RBF)
6: KNN(Cosine)
7: KNN(Pearson)
42
Sensitivity/Specificity
• Sensitivity = 100%• Specificity = 81.8%• Cost comparison
– Classifying cancer person as normal person > classifying normal person as cancer person
2001 (Cancer)
290 (Normal)Actual
1 (Cancer)0 (Normal)
PredictedEANN
43
Architecture Analysis
Whole architecture
From input to hidden neuron
44
Architecture Analysis (2)
Input to output
Hidden neuron to hidden neuron
Hidden neuron to output neuron
Input to outputrelationship
is useful to analyze
45
Exploiting Diversity of Neural Ensembles with Speciated Evolution
46
• ANN– Need many trial-and-error to decide the parameters like # of hidden
nodes, weights, connections and others
• Evolutionary ANN– Use evolution algorithm to decide the parameters– Use only the best one and ignore all the information which other NNs
have gained from evolution and learning– EPNet (X. Yao and Y. Liu, 1998)
• based on evolutionary programming
Motivation (1)
47
• Multiple ANNs– Use all the information learned by ANNs in the population– Improve performance and reliability– But, EA tends to converge to one best solution– Combination based on the Dempster-Shafer theory (G. Rogova,
1994)– ADDEMUP (D. W. Opitz and J. W. Shavlik, 1996)
• Use genetic algorithm to search for a diverse set of ANNs
• Proposed multiple ANNs– Use speciation in evolution of ANNs to drive diverse ANNs– Speciation
• Create different species in genetic algorithm• Drive diverse solutions
Motivation (2)
48
Proposed Model
49
Overview
Stop?
Combine the outputs of the ANNs
No
Yes
Generate initial ANNs
Train the ANNs fully
Compute the fitness
Train the ANNs partially
Generate new generation
Apply crossover and mutation
Select ANNs
50
Fitness Sharing (1)
• Prevent to converge to one best solution by decreasing the incremental of fitness of densely populated ANNs and sharing the fitness with other members
• Shared fitness
– fi : fitness of an individual i, sh(dij) : sharing function
∑=
= sizepopulation
jij
ii
dsh
ffs
1)(
51
Fitness Sharing (2)
• Sharing function
– dij : distance of i and j, σs : sharing radius
⎪⎩
⎪⎨
⎧
≥
<≤−=
sij
sijs
ij
ij
dfor
dford
dshσ
σσ
0
01)(
σs
dic
dibdia
52
An Example of Fitness Sharing
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1xx
x xxx
xx
x
x
x
xxx
x
xxx
With sharing
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1xx
xx
x
No sharing
53
Speciation in ANNs
• Similarity criterion– Average of the outputs of ANN– Modified Kullback-Leibler entropy– Pearson correlation
• Sharing radius– Set empirically
54
Similarity Criterion(1)
• Average of outputs
– outi is the output of the ith input data and N is the total number of the data
• Modified Kullback-Leibler entropy
– p and q are output probability distribution of two ANNs which consist of m output nodes and are trained with n data
– pij means the ith output value of the ANN with respect to the jth data
NoutoutN
iiavg /)(
1∑=
=
∑∑= =
+=m
i
n
j ij
ijij
ij
ijij p
qp
pqpD1 1
)loglog(21),(
55
Similarity Criterion(2)
• Pearson correlation– The similarity between ANN a and u
56
Environment (1)
• Breast cancer data of the University of Wisconsin Hospitals– Class : benign and malignant– Number of attributes : 9– Number of data : 699
• Train/Verify/Test : 349/175/175• Australian credit approval data
– Class : ‘+’ and ‘-’– Number of attributes : 14– Number of data : 690
• Train/Verify/Test : 346/172/172• Diabetes data
– Class : one or two classes– Number of attributes : 9– Number of data : 768
• Train/Verity/Test : 384/192/192
57
Environment (2)
• Evolution parameters– Individual : feed-forward ANN with 5 initial hidden nodes– Population size : 20– Crossover rate : 0.3– Mutation rate : 0.1
• Learning parameters– Learning algorithm : BP– Error function : sum-of-square– Learning rate : 0.1
58
Analysis of Speciation
Combine multiple EANNs
59
Combined Results(1)
Australian credit approval Breast Cancer
G=Gating, V=Voting, W=Winner, WA=Weighted Average, B=Bayesian, C=Condorect, and O=Ideal
60
Combined Results(2)
Diabetes
G=Gating, V=Voting, W=Winner, WA=Weighted Average, B=Bayesian, C=Condorect, and O=Ideal
61
Comparison with Other Works
• BKS combination method is used for comparisonAustralian credit approval
Breast cancer
Diabetes