model-based clustering of gene expression data ka yee yeung 1,chris fraley 2, alejandro murua 3,...

36
Model-based clustering of gene expression data Ka Yee Yeung 1 ,Chris Fraley 2 , Alejandro Murua 3 , Adrian E. Raftery 2 , and Walter L. Ruzzo 1 1 Department of Computer Science & Engineering, University of Washington 2 Department of Statistics, University of Washington 3 Insightful Corporation CAIMS 2001UW CSE Computational Biology Group

Upload: clara-mathews

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Model-based clustering of gene expression data

Ka Yee Yeung1,Chris Fraley2, Alejandro Murua3, Adrian E. Raftery2,

and

Walter L. Ruzzo1

1Department of Computer Science & Engineering, University of Washington2Department of Statistics, University of Washington3Insightful Corporation

CAIMS 2001UW CSE Computational Biology Group

Model-based clustering of gene expression data

Walter L. Ruzzo

University of Washington

Computer Science and Engineering

CAIMS 2001UW CSE Computational Biology Group

3

Overview

• Overview of gene expression data• Model-based clustering• Data sets• Results• Summary and Conclusions

4

• Each chip is one experiment:– time course– drug response– tissue samples (normal/cancer)– …

• Snapshot of activities of all genes in the cell

……..

A gene expression data set

p experiments

n genes

Xij

5

Analysis of expression data?

• Hard problem– High dimensionality– Noise & artifacts– Goals loosely defined, etc.

• Common exploratory analysis technique: Clustering

6

Why Cluster?

• Find biologically related genes • Tissue classification

• Many algorithms• Most heuristic-based• No clear winner

How to cluster?

7

Overview• Overview of gene expression data• Model-based clustering• Data sets• Results• Summary and Conclusions

8

Model-based clustering• Mixture model:

– Each mixture component generated by a specified underlying probability distribution

• Gaussian mixture model:– Each component k is modeled by the

multivariate normal distribution with parameters :

• Mean vector: k

• Covariance matrix: k

• EM algorithm: to estimate parameters

9

k=kDkAkDkT

volume

orientation

shape

Covariance models(Banfield & Raftery 1993)

• Equal volume spherical model (EI): ~ kmeansk = I

• Unequal volume spherical (VI):k = kI

• Diagonal model:k = kBk, where Bk is diagonal with | Bk|=1

• EEE elliptical model:k = DADT

• Unconstrained model (VVV):k = kDkAkDk

T

More

flexib

leM

ore

desc

ripti

ve

But

more

para

mete

rs

10

Advantages of model-based clustering• Higher quality clusters• Flexible models• Model selection - choose right model and

right # of clusters– Bayesian Information Criterion (BIC):

• Approximate Bayes factor: posterior odds for one model against another model

– A large BIC score indicates strong evidence for the corresponding model.

11

Overview• Overview of gene expression data• Model-based clustering• Data sets

– Gene expression data sets– Synthetic data sets

• Results• Summary and Conclusions

12

Our Goal

• To show the model-based approach has superior performance with respect to:– Quality of clusters– Number of clusters and model chosen (BIC)

13

Our Approach• Validate the results from model-based clustering

using data sets with external criteria(BIC scores do not require the external criteria)

• To compare clusters with external criterion:– Adjusted Rand index (Hubert and Arabie 1985)

– Adjusted Rand index = 1 perfect agreement– 2 random partitions have an expected index of 0

• Compare the quality of clusters with a leading heuristic-based algorithm: CAST (Ben-Dor & Yakhini 1999)

14

Gene expression data sets

• Ovarian cancer data set (Michel Schummer, Institute of Systems Biology)

– Subset of data : 235 clones

24 experiments (cancer/normal tissue samples)– 235 clones correspond to 4 genes

• Yeast cell cycle data (Cho et al 1998)

– 17 time points– Subset of 384 genes associated with 5 phases of cell

cycle

15

Synthetic data setsBoth based on ovary data• Gaussian mixture

– Generate multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data

• Randomly resampled ovary data– For each class, randomly sample the

expression levels in each experiment, independently

– Near diagonal covariance matrix

16

Overview

• Overview of gene expression data• Model-based clustering• Data sets• Results

– Synthetic data sets– Expression data sets

• Summary and Conclusions

17

Results: Gaussian

mixture based on ovary data

(2350 genes)

• At 4 clusters, VVV, diagonal, CAST : high adjusted Rand

• BIC selects VVV at 4 clusters.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

number of clusters

Ad

jus

ted

Ra

nd

EIVIVVVdiagonalCAST

-160000

-140000

-120000

-100000

-80000

-60000

-40000

-20000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

number of clusters

BIC

EIVIVVVdiagonal

18

Results: randomly resampled ovary data

• Diagonal model achieves the max adjusted Rand and BIC score (higher than CAST)

• BIC max at 4 clusters• Confirms expected

result

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

Ad

just

ed R

and

EIVIVVVdiagonalCAST

-17000

-16000

-15000

-14000

-13000

-12000

-11000

-10000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

BIC

EIVIdiagonal

19

Results: square root ovary data

• Max adjusted Rand at EEE 4 clusters (> CAST)

• BIC selects VI at 8 clusters

• BIC curve shows a bend at 4 clusters for EEE.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

Ad

jus

ted

Ra

nd

EIVIVVVdiagonalCASTEEE

-6000

-5000

-4000

-3000

-2000

-1000

0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

BIC

EIVIdiagonalEEE

20

Results: log yeast cell cycle data

• CAST achieves much higher adjusted Rand indices than most model-based approaches (except EEE).

• BIC scores of EEE much higher than the other models.

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

Ad

just

ed R

and

EIVIVVVdiagonalCASTEEE

-15000

-13000

-11000

-9000

-7000

-5000

-3000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

BIC

EIVIdiagonalEEE

21

Results: standardized yeast cell cycle data

• EI slightly higher adjusted Rand indices than CAST at 5 clusters.

• BIC selects EEE at 5 clusters.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

Ad

jus

ted

Ra

nd

EIVIVVVdiagonalCASTEEE

-18000

-16000

-14000

-12000

-10000

-8000

-6000

-4000

-2000

02 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

BIC

EI

VIdiagonal

EEE

22

Overview

• Overview of gene expression data• Model-based clustering• Data sets• Results• Summary and Conclusions

23

Summary and Conclusions• Synthetic data sets:

– With the correct model, the model-based clustering better than a leading heuristic clustering algorithm

– BIC selects the right model & right number of clusters

• Real expression data sets:– Comparable adjusted Rand indices to CAST– BIC gives a hint to the number of clusters

• Time course data:– Standardization makes classes spherical

24

Future Work

• Custom refinements to the model-based implementation:– Design models that incorporate specific

information about the experiments, eg. Block diagonal covariance matrix

– Missing data– Outliers & noise

25

Acknowledgements• Ka Yee Yeung1,Chris Fraley2, Alejandro Murua3,

Adrian E. Raftery2

1Department of Computer Science & Engineering, University of Washington2Department of Statistics, University of Washington3Insightful Corporation

• Michèl Schummer (Institute of Systems Biology)

– the ovary data set

• Jeremy Tantrum– help with MBC software (diagonal model)

26

More Info

http://www.cs.washington.edu/homes/ruzzo

CAIMS 2001UW CSE Computational Biology Group

27

Model-based clustering• Mixture model:

– Assume each component is generated by an underlying probability distribution

– Likelihood for the mixture model:

• Gaussian mixture model:– Each component k is modeled by the multivariate

normal distribution with parameters :• Mean vector: k

• Covariance matrix: k

n

i

G

kkikkGmix fL

1 11 )|()|,...,( yy

28

Eigenvalue decomposition of the covariance matrix

• Covariance matrix k determines geometric features of each component k.

• Banfield and Raftery (1993):

– Dk: orientation of component k (orthogonal matrix)

– Ak: shape of component k (diagonal matrix)

– k : volume of component k (scalar)

• Allowing some but not all of the parameters to vary many models within this framework

Tkkkkk DAD

29

EM algorithm• General appraoch to maximum likelihood

• Iterate between E and M steps:– E step: compute the probability of each

observation belonging to each cluster using the current parameter estimates

– M-step: estimate model parameters using the current group membership probabilities

30

Definition of the BIC score

• The integrated likelihood p(D|Mk) is hard to evaluate, where D is the data, Mk is the model.

• BIC is an approximation to log p(D|Mk)

• k: number of parameters to be estimated in model Mk

kkkkk BICnMDpMDp )log(),ˆ|(log2)|(log2

31

Randomly resampled synthetic data set

gen

es

experiments

class

Ovary data

Synthetic data

32

Adjusted Rand Example c#1(4) c#2(5) c#3(7) c#4(4)

class#1(2) 2 0 0 0class#2(3) 0 0 0 3class#3(5) 1 4 0 0class#4(10) 1 1 7 1

1192

20

2831592

10

2

5

2

3

2

2

1231432

4

2

7

2

5

2

4

312

7

2

4

2

3

2

2

cbad

ac

ab

a

469.0)(1

)( Rand Adjusted

789.0,

RE

RERdcda

daRRand

33

Results: mixture of normal

distributions based on ovary data (235 genes)• VVV and CAST:

comparable adjusted Rand index at 4 clusters

• With only 235 genes, no BIC scores available from VVV.

• With bigger data sets (2350 genes), BIC selects VVV at 4 clusters.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

Ad

just

ed R

and

EI

VI

VVV

diagonal

CAST

-17000

-16000

-15000

-14000

-13000

-12000

-11000

-10000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

BIC

EI

VI

diagonal

34

Results: log ovary data

• Max adjusted Rand: VVV 4 clusters, EEE 4-6 clusters (higher than CAST)

• BIC selects VI at 9 clusters

• BIC curve shows a bend at 4 clusters for the diagonal model.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

Ad

just

ed R

and

EIVIVVVdiagonalCASTEEE

-10000

-9500

-9000

-8500

-8000

-7500

-7000

-6500

-6000

-5500

-5000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

BIC

EI

VI

diagonal

EEE

35

Results: standardized ovary data

• Comparable adjusted Rand indices for CAST, EI and EEE at 4 clusters.

• BIC selects EEE at 7 clusters.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

Ad

jus

ted

Ra

nd

EIVIVVVdiagonalCASTEEE

-15000

-14000

-13000

-12000

-11000

-10000

-9000

-8000

-7000

-6000

-5000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

number of clusters

BIC

EIVIdiagonalEEE

36

Key advantage of the model-based approach:

choose the right model and the number of clusters

• Bayesian Information Criterion (BIC):– Approximate Bayes factor: posterior odds for

one model against another model

• A large BIC score indicates strong evidence for the corresponding model.