evolutionary artificial neural networks - yonsei...

1

Evolutionary Artificial Neural Networks

2

Overview of Soft Computing

[Castellano]

Backgrounds

3

Why NN+EC?

• “Evolving brains”: Biological neural networks compete and evolve– The way that intelligence was created

• Global search

• Adaptation to dynamic environments without human intervention– Architecture evolution

Initial weightsOptimal solution

Local Max

Population Samples

Backgrounds

4

General Framework of EANN

[X. Yao]

Backgrounds

5

Evolution of Connection Weights

1. Encode each individual neural network’s connection weights into chromosomes

2. Calculate the error function and determine individual’s fitness

3. Reproduce children based on selection criterion

4. Apply genetic operators

Backgrounds

6

Representation

• Binary representation– Weights are represented by binary bits

• e.g. 8 bits can represent connection weights between -127 and +127

– Limitation on representation precision• too few bits → some numbers cannot be approximated• too many bits → training might be prolonged

• To overcome binary representation, some proposed using real number– i.e., one real number per connection weight

• Standard genetic operators such as crossover not applicable to this representation– However, some argue that it is possible to perform evolutionary

computation with only mutation – Fogel, Fogel and Porto (1990): adopted one genetic operator –

Gaussian random mutation

Backgrounds

7

Evolution of Architectures

1. Encode each individual neural network’s architecture into chromosomes

2. Train each neural network with predetermined learning rule




Backgrounds

8

Direct Encoding

• All information is represented by binary strings, i.e. each connection and node is specified by some binary bits

• An N by N matrix can represent the connectivity with N nodes, where

• Does not scale well since large NN need a big matrix to represent

NNijcC ×= )(

⎩⎨⎧

=FFction is O, if conne

tion is ON if conneccij 0

,1

Backgrounds

9

Indirect Encoding

• Only the most important parameters or features of an architecture are represented. Other details are left to the learning process to decide– e.g. specify the number of hidden nodes and let the learning

process decide how they are connected (e.g. fully connected)

• More biologically plausible as it is impossible for genetic information encoded in humans to specify the whole nervous system directly according to the discoveries of neuroscience

Backgrounds

10

Evolution of Learning Rules

1. Decode each individual into a learning rule

2. Construct a neural network (either pre-determined or randomly) and train it with decoded learning rule• refers to adapting the learning function, in this case, the

connection weights are updated with an adaptive rule




Backgrounds

11

Three Case Studies

• Evolving single neural network– Evolving intrusion detector – Evolving classifier for DNA microarray data

• Evolving ensemble neural networks

12

Evolutionary Learning Program’s Behavior In Neural Networks for Anomaly Detection

13

Motivation (1)

• Attacker’s strategy: Leading to malfunctions by using program’s bug – Showing different behavior compared to normal one

• Anomaly detection– Learning normal program’s behavior from audit data – Classifying programs which show different behavior with normal

one as intrusion– Adopted in many host-based intrusion detection system

• System audit data and machine learning techniques– Basic security module (BSM)– Rule-based learning, neural network and HMM

14

Motivation (2)

• Machine learning methods such as Neural network (NN) and HMM– Effective for intrusion detection based on program’s

behavior• Architecture of classifier

– The most important thing in classification– Searching for appropriate architecture for the problems is crucial

• NN: the number of hidden neurons and connection information

• HMM: the number of states and connection information• Traditional methods

– Trial-and-error• Train 90 neural networks [Ghosh99]

It took too much time because the size of audit data is too largeOptimizing architectures as well as connection weights

15

Related Works

• S. Forrest (1998, 1999)– First intrusion detection by learning program’s behavior– HMM performed better than other methods

• J. Stolfo (1997) : Rule-based learning (RIPPER) • N. Ye (2001)

– Probabilistic methods: Decision tree, chi-square multivariate test and one order Markov chain model (1998 IDEVAL data)

• Ghosh (1999, 2000)– Multi-layer perceptrons and Elman neural network– Elman neural network performed the best (1999 IDEVAL data)

• Vemuri (2003)– kNN and SVM (1998 IDEVAL data)

16

The Proposed Method

• Architecture– System call audit data and evolutionary neural networks

NNps

BS

M A

udit Facility ...

Preprocessor

Detector

ALARM

ps

su

at

login

ping

GA

Modeler

Normal ProfileAudit Data

.

.

.

NNsu

NNat

NNlogin

NNping

17

Normal Behavior Modeling

• Evolutionary neural networks– Simultaneously learning weights and architectures using

genetic algorithm– Partial training: back-propagation algorithm– Representation: matrix

• Rank-based selection, crossover, mutation operators• Fitness evaluation : Recognition rate on training data (mixing real

normal sequences and artificial intrusive sequences)

Generating neural networks with optimal architectures for learning program’s behavior

NN ×

18

ENN (Evolutionary Neural Network) Algorithm

Stop? No

Yes

Generate initial ANNs

Train the ANNs fully

Compute the fitness

Train the ANNs partially

Generate new generation

Apply crossover and mutation

Rank-based seletion

BSM data

Data separation

Trainingdata

Testdata

Evaluation

19

Representation

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

0.07.02.07.01.00.10.01.00.00.00.10.10.00.05.00.10.00.00.04.00.10.00.10.10.0

13211

13211

OHHHI

OHHHI

I1

H1

H3

H2 O1

0.4

0.5

0.1

0.7

0.1

0.2

0.7

Generation ofNeural Network

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

0.07.02.07.01.00.10.01.00.00.00.10.10.00.05.00.10.00.00.04.00.10.00.10.10.0

13211

13211

OHHHI

OHHHI

Weight

ConnectivityHidden Node

Input Node

Output Node

20

Crossover (1)

I1

H1

H3

H2 O1

0.4

0.5

0.1

0.7

0.1

0.2

0.7

I1

H1

H3

H2 O1

0.1

0.5

0.20.10.5

Crossover

0.4

I1

H10.4 0.7

O1

I1

H1

O1

0.1 0.2

H3

H20.5 0.1

0.2

0.7

0.1

0.4H3

H20.5

0.5

0.1

21

Crossover (2)

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

0.07.02.07.01.00.10.01.00.00.00.10.10.00.05.00.10.00.00.04.00.10.00.10.10.0

13211

13211

OHHHI

OHHHI

Crossover

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

0.00.05.02.00.00.00.00.00.04.00.10.00.01.05.00.10.00.10.01.00.00.10.10.10.0

13211

13211

OHHHI

OHHHI

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

0.07.02.02.00.00.10.01.00.00.00.10.10.00.05.00.10.00.00.01.00.00.00.10.10.0

13211

13211

OHHHI

OHHHI

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

0.00.05.07.01.00.00.00.00.04.00.10.00.01.05.00.10.00.10.04.00.10.10.10.10.0

13211

13211

OHHHI

OHHHI

22

Mutation

I1

H1

H3

H2 O1

0.4

0.5

0.1

0.7

0.1

0.2

0.7

Add ConnectionI1

H1

H3

H2 O1

0.4

0.5

0.1

0.7

0.1

0.2

0.70.3

I1

H1

H3

H2 O1

0.4

0.5

0.1

0.7

0.1

0.2

0.7

Delete ConnectionI1

H1

H3

H2 O1

0.4

0.5

0.1

0.7

0.1

0.2

23

Anomaly Detection (1)• 280 system calls in BSM audit data

– 45 frequently occurred calls (indexing as 0~44)– Indexing remaining calls as 45

• 10 input nodes, 15 hidden nodes (Maximum number of hidden nodes), 2 output nodes– Normalizing input values between 0 and 1– Output nodes: Normal and anomaly

open - read,write,creavfork

pathconfchdiropen - read,writeexecve

getauditsetpgrpopen - write,creat,truncreadlink

closesetgroupsopen - write,trunclstat

sysinfoauditopen - write,creatstat

memcntlmmapopen - writeaccess

auditonsetgidopen -readchown

getmsgutimefchdirunlink

putmsgsetuidmkdircreat

seteuidpiperenamefork

munmapioctlfcntlexit

24

Anomaly Detection (2)

• Evaluation value will rise up shortly when intrusion occurs– Detection of locally continuous anomaly sequence is important

– Considering previous values

• Normalizing output values for applying the same threshold to allneural networks– m: Average output value for training data, d: std d

mtt

−=α

α '

00.1

0.20.3

0.40.50.6

0.70.8

0.91

1 11 21 31 41 51 61 71 81 91 101 111

Time

Outp

ut va

lue

Abnormal

normal

23

1211 tttt owoww ⋅−⋅+⋅= −αα

25

Experimental Design• 1999 DARPA IDEVAL data provided by MIT Lincoln Lab

– Denial of Service, probe, Remove-to-local (R2L), User-to-root (U2R)

– Main focus: Detection of U2R attack• Bearing marks of traces in audit data

• Monitoring program’s behavior which has SETUID privilege – Main target for U2R attack

sacadmexrecoverrdist

rloginpingufsrestorercp

pt_chmodmkdevmapsufsdumpps

whodomkdevallocquotapasswd

suloginallocatetopnewgrp

admintoolmkcookienispasswdlogin

suloginkcms_calibratectfdformat

sshkcms_configurevolcheckeject

pwaitff.coreyppasswdcrontab

ptreexlockwchkey

ffbconfigacctonuptimeatm

list_devicesutmp_updatesuatq

deallocatesendmailrshat

26

Experimental Design (2)• 1999 IDEVAL : audit data for 5 weeks

– 1, 3 weeks (attack free) training data– 4-5 weeks test data

• Test data includes totally 11 attacks with 4 types of U2R

• Setting of genetic algorithm– Population size: 20, crossover rate: 0.3 mutation rate: 0.08, Maximum

generation:100– The best individual in the last generation

4race condition attack in 'ps' programps3exploiting buffer overflow in the 'fdformat' programfdformat2exploiting buffer overflow in the 'ffbconifg' programffbconfig2exploiting buffer overflow in the 'eject' programeject

TimesDescriptionName

27

Evolution Results

• Convergence to fitness 0.8 near 100 generations

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 12 23 34 45 56 67 78 89 100

generations

fitn

ess average

minimum

max imum

28

Learning Time• Environments

– Intel Pentium Zeon 2.4GHz Dual processor, 1GB RAM– Solaris 9 operating system

• Data– Login program– Totally 1905 sequences

• Parameters– Learning for 5000epoch– Average of 10 runs

446015ENN161560121650853.64070035

603.63048225

454.220263.415235.510

MLP

Running Time (sec)

Hidden NodesTypes

29

Detection Rates

• 100% detection rate with 0.7 false alarm per day

• Elman NN which shows the best performance for the 1999 IDEVAL data : 100% detection rate with 3 false alarms per day

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20

False Alarm Per Day

Dete

ction R

ate

Effectiveness of Evolutionary NN for IDS

30

Results Analysis – Architecture of NN

• The best individual for learning behavior of ps program– Effective for system call sequence and more complex than

general MLP

31

Comparison of Architectures

• Comparison of the number of connections between ENN learned for 100 generations using ps program data and MLP

• They have the similar number of connections• However, ENN has different types of connections and sophisticated

architectures

000Output19670Hidden15860Input

OutputHiddenInputFROM╲TO

000Output3000Hidden01500Input

OutputHiddenInputFROM╲TO

MLP ENN

32

Evolving Artificial Neural Networks for DNA Microarray Analysis

33

Motivation

• Colon cancer : The second only to lung cancer as a cause of cancer-related mortality in Western countries

• The development of microarray technology has supplied a large volume of data to many fields

• It has been applied to prediction and diagnosis of cancer, so that it expectedly helps us to exactly predict and diagnose cancer

• Proposed method– Feature selection + evolutionary neural network (ENN)– ENN : no restriction on architecture (design without human’s

prior knowledge)

34

What is Microarray?

• Microarray technology– Enables the simultaneous analysis of thousands of

sequences of DNA for genetic and genomic research and for diagnostics

• Two Major Techniques– Hybridization method

• cDNA microarray/ Oligonucleotide microarray– Sequencing method

• SAGE

35

Acquiring Gene Expression Data

Gene expressin data

DNA microarray

Image scanner

Cy5

Cy3

)3Cy(Int)5Cy(Int

2log

Gen

es

Samples

36

Machine Learning for DNA Microarray

Feature selection

Tumor Normal

Cancer predictor

Pearson's correlation coefficientSpearman's correlation coefficientEuclidean distanceCosine coefficientInformation gainMutual informationSignal to noise ratio

3-layered MLP with backpropagationk-nearest neighborSupport vector machineStructure adaptive self-organizing mapEnsemble classifier

Microarray

Expression data

37

Related Works

91.9Quadratic discriminant

93.5Logistic discriminantPartial least square

87.1Quadratic discriminant

87.1Logistic discriminantPrincipal component analysis

Nguyen et al.

72.6AdaBoost

74.2SVM with quadratic kernel

80.6Nearest neighbor

All genes, TNoM scoreBen-Dor et al.

94.1KNNGenetic algorithmLi et al.

90.3SVMSignal to noise ratioFurey et al.

ClassifierFeature

Accuracy(%)

MethodAuthors

38

Overview

Stop? No

Yes



Compute the fitness




Rank-based seletion

Microarray data

Feature selection

Data separation

Trainingdata

Validationdata

Testdata

Evaluation

39

Colon Cancer Dataset

• Alon’s data• Colon dataset consists of 62 samples of colon epithelial cells

taken from colon-cancer patients– 40 of 62 samples are colon cancer samples and the

remaining are normal samples

• Each sample contains 2000 gene expression levels

• Each sample was taken from tumors and normal healthy parts of the colons of the same patients and measured using high density oligonucleotide arrays

• Training data: 31 of 62, Test data: 31 of 62

40

Experimental Setup

• Feature size : 30 • Parameters of genetic algorithm

– Population size : 20– Maximum generation number : 200– Crossover rate : 0.3– Mutation rate : 0.1

• Fitness function : recognition rate for validation data• Learning rate of BP : 0.1

41

Performance Comparison

12

34

56

7

S1

0.94

0.71 0.71 0.710.71 0.74

0.81

0.6

0.7

0.8

0.9

1

Accuracy

Class ifier

1: EANN

2: MLP

3: SASOM

4: SVM(Linear)

5: SVM(RBF)

6: KNN(Cosine)

7: KNN(Pearson)

42

Sensitivity/Specificity

• Sensitivity = 100%• Specificity = 81.8%• Cost comparison

– Classifying cancer person as normal person > classifying normal person as cancer person

2001 (Cancer)

290 (Normal)Actual

1 (Cancer)0 (Normal)

PredictedEANN

43

Architecture Analysis

Whole architecture

From input to hidden neuron

44

Architecture Analysis (2)

Input to output

Hidden neuron to hidden neuron

Hidden neuron to output neuron

Input to outputrelationship

is useful to analyze

45

Exploiting Diversity of Neural Ensembles with Speciated Evolution

46

• ANN– Need many trial-and-error to decide the parameters like # of hidden

nodes, weights, connections and others

• Evolutionary ANN– Use evolution algorithm to decide the parameters– Use only the best one and ignore all the information which other NNs

have gained from evolution and learning– EPNet (X. Yao and Y. Liu, 1998)

• based on evolutionary programming

Motivation (1)

47

• Multiple ANNs– Use all the information learned by ANNs in the population– Improve performance and reliability– But, EA tends to converge to one best solution– Combination based on the Dempster-Shafer theory (G. Rogova,

1994)– ADDEMUP (D. W. Opitz and J. W. Shavlik, 1996)

• Use genetic algorithm to search for a diverse set of ANNs

• Proposed multiple ANNs– Use speciation in evolution of ANNs to drive diverse ANNs– Speciation

• Create different species in genetic algorithm• Drive diverse solutions

Motivation (2)

48

Proposed Model

49

Overview

Stop?

Combine the outputs of the ANNs

No

Yes



Compute the fitness




Select ANNs

50

Fitness Sharing (1)

• Prevent to converge to one best solution by decreasing the incremental of fitness of densely populated ANNs and sharing the fitness with other members

• Shared fitness

– fi : fitness of an individual i, sh(dij) : sharing function

∑=

= sizepopulation

jij

ii

dsh

ffs

1)(

51

Fitness Sharing (2)

• Sharing function

– dij : distance of i and j, σs : sharing radius

⎪⎩

⎪⎨

⎧

≥

<≤−=

sij

sijs

ij

ij

dfor

dford

dshσ

σσ

0

01)(

σs

dic

dibdia

52

An Example of Fitness Sharing

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1xx

x xxx

xx

x

x

x

xxx

x

xxx

With sharing

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1xx

xx

x

No sharing

53

Speciation in ANNs

• Similarity criterion– Average of the outputs of ANN– Modified Kullback-Leibler entropy– Pearson correlation

• Sharing radius– Set empirically

54

Similarity Criterion(1)

• Average of outputs

– outi is the output of the ith input data and N is the total number of the data

• Modified Kullback-Leibler entropy

– p and q are output probability distribution of two ANNs which consist of m output nodes and are trained with n data

– pij means the ith output value of the ANN with respect to the jth data

NoutoutN

iiavg /)(

1∑=

=

∑∑= =

+=m

i

n

j ij

ijij

ij

ijij p

qq

qp

pqpD1 1

)loglog(21),(

55

Similarity Criterion(2)

• Pearson correlation– The similarity between ANN a and u

56

Environment (1)

• Breast cancer data of the University of Wisconsin Hospitals– Class : benign and malignant– Number of attributes : 9– Number of data : 699

• Train/Verify/Test : 349/175/175• Australian credit approval data

– Class : ‘+’ and ‘-’– Number of attributes : 14– Number of data : 690

• Train/Verify/Test : 346/172/172• Diabetes data

– Class : one or two classes– Number of attributes : 9– Number of data : 768

• Train/Verity/Test : 384/192/192

57

Environment (2)

• Evolution parameters– Individual : feed-forward ANN with 5 initial hidden nodes– Population size : 20– Crossover rate : 0.3– Mutation rate : 0.1

• Learning parameters– Learning algorithm : BP– Error function : sum-of-square– Learning rate : 0.1

58

Analysis of Speciation

Combine multiple EANNs

59

Combined Results(1)

Australian credit approval Breast Cancer

G=Gating, V=Voting, W=Winner, WA=Weighted Average, B=Bayesian, C=Condorect, and O=Ideal

60

Combined Results(2)

Diabetes

G=Gating, V=Voting, W=Winner, WA=Weighted Average, B=Bayesian, C=Condorect, and O=Ideal

61

Comparison with Other Works

• BKS combination method is used for comparisonAustralian credit approval

Breast cancer

Diabetes

evolutionary artificial neural networks - yonsei...

Documents