model-based clustering of gene expression data ka yee yeung 1,chris fraley 2, alejandro murua 3,...
TRANSCRIPT
Model-based clustering of gene expression data
Ka Yee Yeung1,Chris Fraley2, Alejandro Murua3, Adrian E. Raftery2,
and
Walter L. Ruzzo1
1Department of Computer Science & Engineering, University of Washington2Department of Statistics, University of Washington3Insightful Corporation
CAIMS 2001UW CSE Computational Biology Group
Model-based clustering of gene expression data
Walter L. Ruzzo
University of Washington
Computer Science and Engineering
CAIMS 2001UW CSE Computational Biology Group
3
Overview
• Overview of gene expression data• Model-based clustering• Data sets• Results• Summary and Conclusions
4
• Each chip is one experiment:– time course– drug response– tissue samples (normal/cancer)– …
• Snapshot of activities of all genes in the cell
……..
A gene expression data set
p experiments
n genes
Xij
5
Analysis of expression data?
• Hard problem– High dimensionality– Noise & artifacts– Goals loosely defined, etc.
• Common exploratory analysis technique: Clustering
6
Why Cluster?
• Find biologically related genes • Tissue classification
• Many algorithms• Most heuristic-based• No clear winner
How to cluster?
7
Overview• Overview of gene expression data• Model-based clustering• Data sets• Results• Summary and Conclusions
8
Model-based clustering• Mixture model:
– Each mixture component generated by a specified underlying probability distribution
• Gaussian mixture model:– Each component k is modeled by the
multivariate normal distribution with parameters :
• Mean vector: k
• Covariance matrix: k
• EM algorithm: to estimate parameters
9
k=kDkAkDkT
volume
orientation
shape
Covariance models(Banfield & Raftery 1993)
• Equal volume spherical model (EI): ~ kmeansk = I
• Unequal volume spherical (VI):k = kI
• Diagonal model:k = kBk, where Bk is diagonal with | Bk|=1
• EEE elliptical model:k = DADT
• Unconstrained model (VVV):k = kDkAkDk
T
More
flexib
leM
ore
desc
ripti
ve
But
more
para
mete
rs
10
Advantages of model-based clustering• Higher quality clusters• Flexible models• Model selection - choose right model and
right # of clusters– Bayesian Information Criterion (BIC):
• Approximate Bayes factor: posterior odds for one model against another model
– A large BIC score indicates strong evidence for the corresponding model.
11
Overview• Overview of gene expression data• Model-based clustering• Data sets
– Gene expression data sets– Synthetic data sets
• Results• Summary and Conclusions
12
Our Goal
• To show the model-based approach has superior performance with respect to:– Quality of clusters– Number of clusters and model chosen (BIC)
13
Our Approach• Validate the results from model-based clustering
using data sets with external criteria(BIC scores do not require the external criteria)
• To compare clusters with external criterion:– Adjusted Rand index (Hubert and Arabie 1985)
– Adjusted Rand index = 1 perfect agreement– 2 random partitions have an expected index of 0
• Compare the quality of clusters with a leading heuristic-based algorithm: CAST (Ben-Dor & Yakhini 1999)
14
Gene expression data sets
• Ovarian cancer data set (Michel Schummer, Institute of Systems Biology)
– Subset of data : 235 clones
24 experiments (cancer/normal tissue samples)– 235 clones correspond to 4 genes
• Yeast cell cycle data (Cho et al 1998)
– 17 time points– Subset of 384 genes associated with 5 phases of cell
cycle
15
Synthetic data setsBoth based on ovary data• Gaussian mixture
– Generate multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data
• Randomly resampled ovary data– For each class, randomly sample the
expression levels in each experiment, independently
– Near diagonal covariance matrix
16
Overview
• Overview of gene expression data• Model-based clustering• Data sets• Results
– Synthetic data sets– Expression data sets
• Summary and Conclusions
17
Results: Gaussian
mixture based on ovary data
(2350 genes)
• At 4 clusters, VVV, diagonal, CAST : high adjusted Rand
• BIC selects VVV at 4 clusters.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
number of clusters
Ad
jus
ted
Ra
nd
EIVIVVVdiagonalCAST
-160000
-140000
-120000
-100000
-80000
-60000
-40000
-20000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
number of clusters
BIC
EIVIVVVdiagonal
18
Results: randomly resampled ovary data
• Diagonal model achieves the max adjusted Rand and BIC score (higher than CAST)
• BIC max at 4 clusters• Confirms expected
result
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
Ad
just
ed R
and
EIVIVVVdiagonalCAST
-17000
-16000
-15000
-14000
-13000
-12000
-11000
-10000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
BIC
EIVIdiagonal
19
Results: square root ovary data
• Max adjusted Rand at EEE 4 clusters (> CAST)
• BIC selects VI at 8 clusters
• BIC curve shows a bend at 4 clusters for EEE.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
Ad
jus
ted
Ra
nd
EIVIVVVdiagonalCASTEEE
-6000
-5000
-4000
-3000
-2000
-1000
0
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
BIC
EIVIdiagonalEEE
20
Results: log yeast cell cycle data
• CAST achieves much higher adjusted Rand indices than most model-based approaches (except EEE).
• BIC scores of EEE much higher than the other models.
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
Ad
just
ed R
and
EIVIVVVdiagonalCASTEEE
-15000
-13000
-11000
-9000
-7000
-5000
-3000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
BIC
EIVIdiagonalEEE
21
Results: standardized yeast cell cycle data
• EI slightly higher adjusted Rand indices than CAST at 5 clusters.
• BIC selects EEE at 5 clusters.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
Ad
jus
ted
Ra
nd
EIVIVVVdiagonalCASTEEE
-18000
-16000
-14000
-12000
-10000
-8000
-6000
-4000
-2000
02 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
BIC
EI
VIdiagonal
EEE
22
Overview
• Overview of gene expression data• Model-based clustering• Data sets• Results• Summary and Conclusions
23
Summary and Conclusions• Synthetic data sets:
– With the correct model, the model-based clustering better than a leading heuristic clustering algorithm
– BIC selects the right model & right number of clusters
• Real expression data sets:– Comparable adjusted Rand indices to CAST– BIC gives a hint to the number of clusters
• Time course data:– Standardization makes classes spherical
24
Future Work
• Custom refinements to the model-based implementation:– Design models that incorporate specific
information about the experiments, eg. Block diagonal covariance matrix
– Missing data– Outliers & noise
25
Acknowledgements• Ka Yee Yeung1,Chris Fraley2, Alejandro Murua3,
Adrian E. Raftery2
1Department of Computer Science & Engineering, University of Washington2Department of Statistics, University of Washington3Insightful Corporation
• Michèl Schummer (Institute of Systems Biology)
– the ovary data set
• Jeremy Tantrum– help with MBC software (diagonal model)
27
Model-based clustering• Mixture model:
– Assume each component is generated by an underlying probability distribution
– Likelihood for the mixture model:
• Gaussian mixture model:– Each component k is modeled by the multivariate
normal distribution with parameters :• Mean vector: k
• Covariance matrix: k
n
i
G
kkikkGmix fL
1 11 )|()|,...,( yy
28
Eigenvalue decomposition of the covariance matrix
• Covariance matrix k determines geometric features of each component k.
• Banfield and Raftery (1993):
– Dk: orientation of component k (orthogonal matrix)
– Ak: shape of component k (diagonal matrix)
– k : volume of component k (scalar)
• Allowing some but not all of the parameters to vary many models within this framework
Tkkkkk DAD
29
EM algorithm• General appraoch to maximum likelihood
• Iterate between E and M steps:– E step: compute the probability of each
observation belonging to each cluster using the current parameter estimates
– M-step: estimate model parameters using the current group membership probabilities
30
Definition of the BIC score
• The integrated likelihood p(D|Mk) is hard to evaluate, where D is the data, Mk is the model.
• BIC is an approximation to log p(D|Mk)
• k: number of parameters to be estimated in model Mk
kkkkk BICnMDpMDp )log(),ˆ|(log2)|(log2
32
Adjusted Rand Example c#1(4) c#2(5) c#3(7) c#4(4)
class#1(2) 2 0 0 0class#2(3) 0 0 0 3class#3(5) 1 4 0 0class#4(10) 1 1 7 1
1192
20
2831592
10
2
5
2
3
2
2
1231432
4
2
7
2
5
2
4
312
7
2
4
2
3
2
2
cbad
ac
ab
a
469.0)(1
)( Rand Adjusted
789.0,
RE
RERdcda
daRRand
33
Results: mixture of normal
distributions based on ovary data (235 genes)• VVV and CAST:
comparable adjusted Rand index at 4 clusters
• With only 235 genes, no BIC scores available from VVV.
• With bigger data sets (2350 genes), BIC selects VVV at 4 clusters.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
Ad
just
ed R
and
EI
VI
VVV
diagonal
CAST
-17000
-16000
-15000
-14000
-13000
-12000
-11000
-10000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
BIC
EI
VI
diagonal
34
Results: log ovary data
• Max adjusted Rand: VVV 4 clusters, EEE 4-6 clusters (higher than CAST)
• BIC selects VI at 9 clusters
• BIC curve shows a bend at 4 clusters for the diagonal model.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
Ad
just
ed R
and
EIVIVVVdiagonalCASTEEE
-10000
-9500
-9000
-8500
-8000
-7500
-7000
-6500
-6000
-5500
-5000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
BIC
EI
VI
diagonal
EEE
35
Results: standardized ovary data
• Comparable adjusted Rand indices for CAST, EI and EEE at 4 clusters.
• BIC selects EEE at 7 clusters.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
Ad
jus
ted
Ra
nd
EIVIVVVdiagonalCASTEEE
-15000
-14000
-13000
-12000
-11000
-10000
-9000
-8000
-7000
-6000
-5000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
number of clusters
BIC
EIVIdiagonalEEE