artificial intelligence term project #3 kyu-baek hwang biointelligence lab school of computer...
TRANSCRIPT
Artificial IntelligenceArtificial IntelligenceTerm Project #3Term Project #3
Kyu-Baek Hwang
Biointelligence Lab
School of Computer Science and Engineering
Seoul National University
Copyright (c) 2004 by SNU CSE Biointelligence Lab
2
OutlineOutline
Bayesian network – revisit I Properties of Bayesian network Structural learning of Bayesian network
Project 3-1 Analysis of structural learning algorithms ALARM dataset
Bayesian network – revisit II Bayesian network classifiers (probabilistic inference)
Project 3-2 Classification of microarray gene expression data using
Bayesian networks
Copyright (c) 2004 by SNU CSE Biointelligence Lab
3
Bayesian NetworkBayesian Network
The joint probability distribution over all the variables in the Bayesian network.
n
i iin XPXXXP121 )|(),...,,( Pa
)|()|(),|()()(
),,,|(),,|(),|()|()(
),,,,(
CEPBDPBACPBPAP
DCBAEPCBADPBACPABPAP
EDCBAP
BA
C D
E
Local probability distribution for Xi
1
: the set of parents of
( ,..., ) ~ parameter for ( | )
: # of configurations of
: # of states of
i
i i
i i iq i i
i i
i i
X
P X
q
r X
Pa
Pa
Pa
Copyright (c) 2004 by SNU CSE Biointelligence Lab
4
Generative ModelGenerative Model
From the underlying distribution, a set of data examples can be generated.
Conditional probability of interest can be calculated from jpd.
Gene B
Class
Gene F Gene G
Gene A
Gene C Gene D
Gene E Gene H
This Bayesian network can classify the examples by calculating appropriate conditional probability.
P(Class| other variables)
Copyright (c) 2004 by SNU CSE Biointelligence Lab
5
Classification by Bayesian Networks IClassification by Bayesian Networks I
Calculate the conditional probability of ‘Class’ variable given the value of the other variables. Infer conditional probability from joint probability distribution. For example,
where summation is taken over all the possible class values.
,) , , , , , , , ,(
) , , , , , , , ,(
) , , , , , , , |(
Class
HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP
HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP
HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP
Copyright (c) 2004 by SNU CSE Biointelligence Lab
6
Knowing the Causal StructureKnowing the Causal Structure
Gene B
Class
Gene F Gene G
Gene A
Gene C Gene D
Gene E Gene H
Gene C regulates Gene E and F.
Gene D regulates Gene G and H.
Class has an effect on Gene F and G.
A set of comprehensible rules (or knowledge)
Copyright (c) 2004 by SNU CSE Biointelligence Lab
7
Learning Bayesian NetworksLearning Bayesian Networks
Metric approach Use a scoring metric to measure how well a particular structure
fits an observed set of cases. A search algorithm is used. Find a canonical form of an
equivalence class.
Independence approach An independence oracle (approximated by some statistical test)
is queried to identify the equivalence class that captures the independencies in the distribution from which the data was generated. Search for a PDAG
Copyright (c) 2004 by SNU CSE Biointelligence Lab
8
Scoring Metrics for Bayesian NetworksScoring Metrics for Bayesian Networks
Likelihood L(G, G, C) = P(C|Gh, G) Gh: hypothesis that the given data (C) was generated by a
distribution that can be factored according to G. The maximum likelihood metric of G (entropy metric with opposite
sign)
1
,
log ( , ) max log ( , , )
log ( )
ˆ( ) log ( | )
( , ) log ( | )
G
i i
ML G
N
G jj
i ii
i i i ii x
M G C L G C
P
N P P x
N x P x
x
pa
x
x pa
pa pa
prefer complete graph structure
N: data size
xj: jth example
Copyright (c) 2004 by SNU CSE Biointelligence Lab
9
Information Criterion Scoring MetricsInformation Criterion Scoring Metrics
The Akaike information criterion (AIC) metric
Bayesian information criterion (BIC) metric
( , ) log ( , ) ( )
( ) ( 1)AIC ML
i ii
M G C M G C Dim G
Dim G r q
NGDimCGMCGM MLBIC log)(2
1),(log),(
Copyright (c) 2004 by SNU CSE Biointelligence Lab
10
MDL Scoring MetricsMDL Scoring Metrics
The minimum description length (MDL) metric 1
The minimum description length (MDL) metric 2
),()(log),(1 CGMGPCGM BICMDL
)(log||),(log),(2 GDimcNECGMCGM GMLMDL
Copyright (c) 2004 by SNU CSE Biointelligence Lab
11
Bayesian Scoring MetricsBayesian Scoring Metrics
A Bayesian metric
The BDe (Bayesian Dirichlet & likelihood equivalence) metric
Prior on the network structure
cGCPGPCGM hh ),|(log)|(log),,(
n
i
q
j
r
k ijk
ijkijk
ijij
ijhi i
N
NN
NN
NGCP
1 1 1 )'(
)'(
)'(
)'(),|(
log log| |( ) 2
ii
nn
P G
Pa
Copyright (c) 2004 by SNU CSE Biointelligence Lab
12
Greedy Search AlgorithmGreedy Search Algorithm
Generate initial Bayesian network structure G0.
For m = 1, 2, 3, …, until convergence. Among the possible local changes (insertion of an edge, reversal of an
edge, and deletion of an edge) in Gm–1, the one leads to the largest improvement in the score is performed. The resulting graph is Gm.
Stopping criterion Score(Gm–1) == Score(Gm).
At each iteration (learning Bayesian network consisting of n variables) O(n2) local changes should be evaluated to select the best one.
Random restarts is usually adopted to escape local maxima.
Copyright (c) 2004 by SNU CSE Biointelligence Lab
13
Project 3-1Project 3-1
Analysis of structural learning algorithms Data generation from ALARM network
Various data set size, e.g., 1000, 3000, 5000, 10000.
Structural learning of Bayesian network by greedy search (hill-climbing) with several kinds of scoring metrics
Compare the results w.r.t. edge errors according to various sample sizes and learning methods
Copyright (c) 2004 by SNU CSE Biointelligence Lab
14
ALARM NetworkALARM Network
# of nodes: 37# of edges: 46# of possible values of variable: 2 ~ 4 values
Copyright (c) 2004 by SNU CSE Biointelligence Lab
15
Data GenerationData Generation
Using Netica (http://www.norsys.com)
Copyright (c) 2004 by SNU CSE Biointelligence Lab
16
Structural LearningStructural Learning
WEKA (http://www.cs.waikato.ac.nz/ml/weka/) http://www.cs.waikato.ac.nz/~remco/weka_bn/
Copyright (c) 2004 by SNU CSE Biointelligence Lab
17
Probabilistic InferenceProbabilistic Inference
Calculate the conditional probability given values of observed variables. Junction tree algorithm Sampling methods General probabilistic inference is intractable. (It is known to be
NP-hard.) However, calculation of the conditional probability for
classification is rather straightforward because of the property of Bayesian network structure (d-separation).
Copyright (c) 2004 by SNU CSE Biointelligence Lab
18
Markov BlanketMarkov Blanket
Variables of interest X = {X1, X2, …, Xn}
For a variable Xi, its Markov blanket MB(Xi) is the subset of X – Xi which satisfies the following:
Markov boundary Minimal Markov blanket
)).(|()|( iiii XXPXXP MBX
Copyright (c) 2004 by SNU CSE Biointelligence Lab
19
Markov Blanket in Bayesian NetworkMarkov Blanket in Bayesian Network
Given Bayesian network structure, determination of the Markov blanket of a variable is straightforward. By the conditional independence assertions.
Gene B
Class
Gene F Gene G
Gene A
Gene C Gene D
Gene E Gene H
The Markov blanket of a node in the Bayesian network consists of all of its parents, spouses, and children.
Copyright (c) 2004 by SNU CSE Biointelligence Lab
20
Classification by Bayesian Networks IIClassification by Bayesian Networks II
),|(),|(),|(
),|(),|()(),|()()()(
),|(),|()(),|()()()(
)|(),|(),|()|()(),|()()()(
)|(),|(),|()|()(),|()()()(
) , , , , , , , ,(
) , , , , , , , ,(
) , , , , , , , |(
DClassGPClassCFPBAClassP
DClassGPClassCFPDPBAClassPCPBPAP
DClassGPClassCFPDPBAClassPCPBPAP
DHPDClassGPClassCFPCEPDPBAClassPCPBPAP
DHPDClassGPClassCFPCEPDPBAClassPCPBPAP
HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP
HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP
HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP
Class
Class
Class
Copyright (c) 2004 by SNU CSE Biointelligence Lab
21
Project 3-2Project 3-2
Classification using Bayesian network Evaluate performance of Bayesian network classifier
(classification accuracy) Various parameter settings, e.g., scoring metrics and learning
methods If possible, compare with other learning methods such as neural
networks and decision trees. Leave-one-out cross validation
Using WEKA
Copyright (c) 2004 by SNU CSE Biointelligence Lab
22
Molecular Biology: Central DogmaMolecular Biology: Central Dogma
DNA microarray
Copyright (c) 2004 by SNU CSE Biointelligence Lab
23
DNA MicroarraysDNA Microarrays
Monitor thousands of gene expression levels simultaneously traditional one gene experiments.
Fabricated by high-speed robotics.
Known probes
Copyright (c) 2004 by SNU CSE Biointelligence Lab
24
Types of DNA MicroarraysTypes of DNA Microarrays
Oligonucleotide chips An array of oligonucleotide (20 ~ 80-mer oligos) probes is
synthesized.
cDNA microarrays Probe cDNA (500 ~ 5,000 bases long) is immobilized to a solid
surface.
Copyright (c) 2004 by SNU CSE Biointelligence Lab
25
StudyStudy
Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells, MH Cheok et al., Nature Genetics 35, 2003.
60 leukemia patients
Bone marrow samples
Affymetrix GeneChip arrays
Gene expression data
Copyright (c) 2004 by SNU CSE Biointelligence Lab
26
Gene Expression DataGene Expression Data
# of data examples 120 (60: before treatment, 60: after treatment)
# of genes measured 12600 (Affymetrix HG-U95A array)
Task Classification between “before treatment” and “after treatment”
based on gene expression pattern
Copyright (c) 2004 by SNU CSE Biointelligence Lab
27
Affymetrix GeneChip ArraysAffymetrix GeneChip Arrays
Use short oligos to detect gene expression level. Each gene is probed by a set of short oligos. Each gene expression level is summarized by
Signal: numerical value describing the abundance of mRNA A/P call: denotes the statistical significance of signal
Copyright (c) 2004 by SNU CSE Biointelligence Lab
28
PreprocessingPreprocessing
Remove the genes having more than 60 ‘A’ calls # of genes: 12600 3190
Discretization of gene expression level Criterion: median gene expression value of each sample 0 (low) and 1 (high)
Copyright (c) 2004 by SNU CSE Biointelligence Lab
29
Gene FilteringGene Filtering
Using mutual information
Estimated probabilities were used. # of genes: 3190 50
Final dataset # of attributes: 51 (one for the class)
Class: 0 (after treatment), 1 (before treatment)
# of data examples: 120
CG CPGP
CGPCGPCGI
, )()(log
),(log),();(
Copyright (c) 2004 by SNU CSE Biointelligence Lab
30
Final DatasetFinal Dataset
120
51
Copyright (c) 2004 by SNU CSE Biointelligence Lab
31
SubmissionSubmission
Deadline: 2004. 12. 2 Location: 301-419