university term project final...

23
0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI UNIVERSITY YONSEI UNIV. YONSEI UNIV. Computer Science Computer Science Database system : Database system : Term project Term project Final Presentation Final Presentation Efficient huge Efficient huge - - scale feature selection scale feature selection using using S S peciated peciated G G enetic enetic A A lgorithm and lgorithm and B B ayesian ayesian N N etwork etwork Team : 9C Team : 9C Members : Members : Ja Ja - - Min Min Koo Koo Head Head (M.S. 3 (M.S. 3 rd rd ) ) [email protected] [email protected] Ji Ji - - Oh Oh Yoo Yoo (M.S. 1 (M.S. 1 st st ) ) [email protected] [email protected] Jin Jin - - Hyuk Hyuk Hong Hong ( ( Ph.D Ph.D 1 1 st st ) ) [email protected] [email protected] Gsum Gsum - - Sung Hwang Sung Hwang ( ( Ph.D Ph.D 1 1 st st ) ) [email protected] [email protected] Soft Computing Lab. Computer Science, Soft Computing Lab. Computer Science, Yonsei Yonsei University. University. Presentation Date : June 10 Presentation Date : June 10 th th , 2004 , 2004 Presentation by Presentation by Ja Ja - - Min Min Koo Koo

Upload: others

Post on 07-Jul-2020

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

01001011100110111101011110111001101111010111

YONSEIUNIVERSITY

YONSEI UNIV.YONSEI UNIV.Computer ScienceComputer Science

Database system :Database system :

Term project Term project Final PresentationFinal Presentation

Efficient hugeEfficient huge--scale feature selectionscale feature selectionusing using SSpeciatedpeciated GGenetic enetic AAlgorithm and lgorithm and BBayesian ayesian NNetworketwork

Team : 9CTeam : 9CMembers : Members : JaJa--Min Min KooKoo HeadHead (M.S. 3(M.S. 3rdrd)) [email protected]@sclab.yonsei.ac.kr

JiJi--Oh Oh YooYoo (M.S. 1(M.S. 1stst)) [email protected]@sclab.yonsei.ac.kr

JinJin--HyukHyuk HongHong ((Ph.DPh.D 11stst)) [email protected]@sclab.yonsei.ac.kr

GsumGsum--Sung HwangSung Hwang ((Ph.DPh.D 11stst)) [email protected]@sclab.yonsei.ac.kr

Soft Computing Lab. Computer Science, Soft Computing Lab. Computer Science, YonseiYonsei University.University.Presentation Date : June 10Presentation Date : June 10thth , 2004, 2004

Presentation by Presentation by JaJa--Min Min KooKoo

Page 2: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 11/20/20

YONSEI UNIV.Computer ScienceAgendaAgenda

MotivationMotivationRelated WorkRelated Work– Classification & feature selection– Conventional feature selection with genetic algorithm

Proposed MethodProposed Method– Overview– Genetic algorithm with speciation– Modified representation for huge-scale feature selection– Bayesian network learning and classification

ExperimentsExperiments– Experimental environment– Experimental results

ConclusionConclusion

Page 3: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 22/20/20

YONSEI UNIV.Computer Science

GeneticGeneticalgorithmalgorithm

MotivationMotivation

Classification issueClassification issue– one of the popular data mining problems in Database.

HugeHuge--scale featurescale feature

Selecting informative featuresSelecting informative features– NP-Complete problem

In this projectIn this project……

Web usage profileWeb usage profile Gene expression dataGene expression data

Over 1,000Over 1,000of features!!of features!!

!BayesianBayesianClassifierClassifier

Page 4: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 33/20/20

YONSEI UNIV.Computer ScienceClassificationClassification

Classification Classification –– 2 step process2 step process

ClassifiersClassifiers– Decision trees, neural networks, SVMs, kNN, Bayesian classifier, etc.– Bayesian classifier is based on Bayes theorem.

– Based on class conditional independence and graphical probability model.

Related WorkRelated Work

)()()|()|(

xPhphxpxhp =

1. Build a modelthat describe a data classes or concepts.

2. Classify new samplesand evaluate the accuracy.

?

Page 5: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 44/20/20

YONSEI UNIV.Computer ScienceFeature selection issueFeature selection issue

Feature selectionFeature selection– Huge-scale features

Irrelevant or redundant features is possible.

– ProblemsNeed to ensure the statistical variability between patterns fromdifferent classes.Mislead learning algorithms or overfit the data.More Complex

– Therefore, feature selection extracting informative features is necessary!!

Related WorkRelated Work

Page 6: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 55/20/20

YONSEI UNIV.Computer ScienceConventional feature selection with GAConventional feature selection with GA

Previous workPrevious work– Goal

Maximize the distribution of distance between groups.– Evaluate each feature in one dimension

it May lose crucial information from the combination of features.

GA wrapper methodGA wrapper method– Possible feature subsets is– It is almost impossible to evaluate them all.– So..

Related WorkRelated Work

∑=

==f

f

f

n

k

nknc Cn

12 ▶▶ Too Large!!Too Large!!

…… Crossover

Mutation

Selection

Evaluation

GA

Procedure

GA

Procedure

Page 7: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 66/20/20

YONSEI UNIV.Computer ScienceOverviewOverview

Proposed MethodProposed Method

SGABNSGABN– Speciated Genetic Algorithm and Bayesian Network

Page 8: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 77/20/20

YONSEI UNIV.Computer ScienceGenetic algorithm with speciationGenetic algorithm with speciation

Speciation techniqueSpeciation technique– Generate multiple species within the population of evolutionary methods.– Method

Restrict an individual to mate only with similar ones, while others manipulate its fitness using niching pressure to control the selection.

– AdvantagesAvoid “genetic drift”.Maintain the wide search landscape.

We use Explicit fitness sharing.We use Explicit fitness sharing.– sfi : sharing fitness, fi : fitness– dij : distance between i and j– σs : sharing radius

Proposed MethodProposed Method

sij

sijα

s

ij

ij

N

jiji

i

ii

σr d ,fo

σd, for )σd

(dsh

dshm

mfsf

<≤−=

=

=

∑=

0

01)(

)(1

Page 9: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 88/20/20

YONSEI UNIV.Computer ScienceGenetic algorithm with speciationGenetic algorithm with speciation

Speciation techniqueSpeciation technique– Generate multiple species within the population of evolutionary methods.– Method

Restrict an individual to mate only with similar ones, while others manipulate its fitness using niching pressure to control the selection.

– AdvantagesAvoid “genetic drift”.Maintain the wide search landscape.

We use Explicit fitness sharing.We use Explicit fitness sharing.– sfi : sharing fitness, fi : fitness– dij : distance between i and j– σs : sharing radius

Proposed MethodProposed Method

sij

sijα

s

ij

ij

N

jiji

i

ii

σr d ,fo

σd, for )σd

(dsh

dshm

mfsf

<≤−=

=

=

∑=

0

01)(

)(1

Page 10: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 99/20/20

YONSEI UNIV.Computer ScienceModified representationModified representation for hugefor huge--scale feature selectionscale feature selection

This project are composed of thousands of features.This project are composed of thousands of features.– It is hard to apply the conventional approach.

New ApproachNew Approach– Modify the representation of chromosome.

– In this project…ns is set as 25.13-bit and 12-bit indices are used to represent …

– 7,129 features of Leukemia– 4,026 features of Lymphoma cancer data

Proposed MethodProposed Method

Page 11: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 1010/20/20

YONSEI UNIV.Computer ScienceModified representationModified representation for hugefor huge--scale feature selectionscale feature selection

This project are composed of thousands of features.This project are composed of thousands of features.– It is hard to apply the conventional approach.

New ApproachNew Approach– Modify the representation of chromosome.

– In this project…ns is set as 25.13-bit and 12-bit indices are used to represent …

– 7,129 features of Leukemia– 4,026 features of Lymphoma cancer data

Proposed MethodProposed Method

Page 12: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 1111/20/20

YONSEI UNIV.Computer ScienceBN learning and classificationBN learning and classification

Bayesian Network LearningBayesian Network Learning– Training data : from GA– Structure Learning : K2 algorithm !!– Accuracy for training data : GA fitness function.

K2 algorithmK2 algorithm– Greedy heuristic search algorithm.– Given a database D, this searches for the BN structure G with maximal Pr(G, D).

NoisyNoisy--OR gateOR gate– Describe the interaction between n causes X1, X2, …, Xn and their common effect Y.– Binary noisy-OR gate

For Y’s CPD, define p1, p2, …,pn. (if the Cause Xi is true, the others is false, then Y is true)

It is easy to verify that the probability of y given a subset Xp of Xi’s.

Proposed MethodProposed Method

Page 13: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 1212/20/20

YONSEI UNIV.Computer ScienceBN learning and classificationBN learning and classification

Bayesian Network LearningBayesian Network Learning– Training data : from GA– Structure Learning : K2 algorithm !!– Accuracy for training data : GA fitness function.

K2 algorithmK2 algorithm– Greedy heuristic search algorithm.– Given a database D, this searches for the BN structure G with maximal Pr(G, D).

NoisyNoisy--OR gateOR gate– Describe the interaction between n causes X1, X2, …, Xn and their common effect Y.– Binary noisy-OR gate

For Y’s CPD, define p1, p2, …,pn. (if the Cause Xi is true, the others is false, then Y is true)

It is easy to verify that the probability of y given a subset Xp of Xi’s.

Proposed MethodProposed Method

Page 14: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 1313/20/20

YONSEI UNIV.Computer ScienceExperimental environment (1)Experimental environment (1)

Experimental DomainsExperimental Domains– Leukemia Data (Lin et al., 2001)

# of features : 7,129 features# of data : 72 (training : 38, testing : 34)Class

– ALL patients (47 samples) are labeled as class 0– AML patients (25 samples) are labeled as class 1

– Lymphoma Data (Lossos et al., 2000)# of features : 4,026 features# of data : 47 (training : 22, testing : 25)Class

– GC B-like (24 samples) are labeled as class 0– Activated B-like (23 samples) are labeled as class 1

ExperimentsExperiments

Page 15: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 1414/20/20

YONSEI UNIV.Computer ScienceExperimental environment (2)Experimental environment (2)

Parameter of Genetic OperatorParameter of Genetic Operator– Population size : 20– Selection rate : 0.8– Crossover rate : 0.8– Mutation rate : 0.005– Use Elitism

Parameter of Neural networkParameter of Neural network– Learning rate : 0.3 / 0.1– Momentum : 0.5 / 0.8– Maximum iteration : 500– Minimum error : 0.02– # of Hidden node : 2 / 5

ExperimentsExperiments

Crossover

Mutation

Selection

Evaluation

GA

Procedure

GA

Procedure

0

1

Page 16: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 1515/20/20

YONSEI UNIV.Computer ScienceExperimental results: Number of feature usedExperimental results: Number of feature used

ExperimentsExperiments

0 20 40 60 80 1000

25

1500

2000

feat

ures

#

generation

sGA SGABN

0 20 40 60 80 1000

25

3000

3500

feat

ures

#

generation

sGA SGABN

Lymphoma Leukemia

Page 17: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 1616/20/20

YONSEI UNIV.Computer ScienceExperimental results: Bayesian networkExperimental results: Bayesian network

ExperimentsExperiments

Result TableResult Table

Selected features Selected features – simple GA

14 features is selected: → 0.638889 – 507 651 35 1999 274 154 1044 1277 522

1221 875 105 624 320

– speciated GA 13 features is selected: → 0.86

– 448 1999 235 743 1834 1575 634 1809 879 1132 380 1247 1556

Page 18: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 1717/20/20

YONSEI UNIV.Computer ScienceExperimental results: GA comparisonExperimental results: GA comparison

Lymphoma

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1 51 101 151 201 251 301

generation

Fit

ne

ss

simple GA

speciated GA

no Index sGA

Page 19: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 1818/20/20

YONSEI UNIV.Computer ScienceExperimental results: Neural network (1)Experimental results: Neural network (1)

ExperimentsExperiments

25253569Average features used

510

Discovered solutions in last 10 generations

(train error rate <= 1/38)

86 generationsNot found in 100 generations

Not found in 100 generations

Generation for finding 3 solutions

0.026 (0.08)0.008 (0.07)0.211 (0.06)Average training error rate

of the best solution (std)

0.236 (0.03)0.210 (0.03)0.342 (0.02)Average training error rate (std)

SGANNGANNsGAMeasure

25252009Average features used

780Discovered solutions in last 5 generation (train error rate

= 0)

15 generations16 generationsNot foundin 100 generations

Generation for finding 5 solutions

0.002 (0.09)0.002 (0.09)0.3217 (0.06)Average training error rate of the best solution (std)

0.269 (0.02)0.266 (0.02)0.637 (0.02)Average training error rate (std)

SGANNGANNsGAMeasure

Lymphoma

Leukemia

Page 20: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 1919/20/20

YONSEI UNIV.Computer ScienceExperimental results: Neural network (2)Experimental results: Neural network (2)

ExperimentsExperiments

0.2252 (0.09)0.3401 (0.03)0.3006 (0.03)Average test error rate (std)

590 sec559 sec42,359 secAverage processing time(10 generations)

25253569Input node #

SGANNGANNsGAMeasure

0.2247 (0.08)0.2662 (0.10)0.4692 (0.07)Average test error rate (std)

545 sec502 sec34,390 secAverage processing time(10 generations)

25252009Input node #

SGANNGANNsGAMeasure

Lymphoma

Leukemia

PC SC ED CC MI sGA GANN SGANN0

10

20

30

40

50

60

70

80

Neural network SASOM SVM (linear kernel) SVM (RBF kernel) kNN (cosine)

Test

acc

urac

y

Feature selection

Lymphoma

Page 21: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

10111101011110111001101111010111

Softcomputing Lab.Softcomputing Lab. 2020/20/20

YONSEI UNIV.Computer ScienceConclusionConclusion

HugeHuge--scale feature datascale feature data– Search diverse solutions using speciated GA– Improve the performance of classification– Easy interpretability by BN

Future worksFuture works– Some problems on BN classifier

The other results will be included into the final report

– Cross-validation: Too few samples

Page 22: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

01001011100110111101011110111001101111010111

YONSEIUNIVERSITY

YONSEI UNIV.YONSEI UNIV.Computer ScienceComputer Science

Any Question??Any Question??

Page 23: UNIVERSITY Term project Final Presentationsclab.yonsei.ac.kr/courses/04DB_project/report/9C_Final... · 2007-03-16 · 0100 1011 1001 1011 1101 0111 1011 1001 1011 1101 0111 YONSEI

01001011100110111101011110111001101111010111

YONSEIUNIVERSITY

YONSEI UNIV.YONSEI UNIV.Computer ScienceComputer Science

Thank You~!!Thank You~!!