pcluster: probabilistic agglomerative clustering of gene expression profiles nir friedman...

39
PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer Scie Tel – Aviv University

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

PCluster: Probabilistic Agglomerative Clustering

of Gene Expression Profiles

Nir Friedman

Presenting: Inbar Matarasso

09/05/2005

The School of Computer Science

Tel – Aviv University

Page 2: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Outline

A little about clustering Mathematics background Introduction The problem Notation Scoring Method Agglomerative clustering Double clustering Conclusion

Page 3: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

A little about clustering

Partition entities (genes) into groups called clusters (according to similarity in their expression profiles across the probed conditions).

Cluster are homogeneous and well-separated.

Clustering problem arise in numerous disciplines including biology, medicine, psychology, economics.

Page 4: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Clustering – why?

Reduce the dimensionality of the problem – identify the major patterns in the dataset

Pattern Recognition Image Processing Economic Science (especially market research) WWW

Document classification Cluster Weblog data to discover groups of

similar access patterns

Page 5: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Page 6: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Types of clustering methods

How to choose a particular method?1. The type of output desired2. The known performance of method with particular

types of data3. The hardware and software facilities available 4. The size of the dataset.

In general , clustering methods may be divided into two categories based on the cluster structure which they produce: Partitioning Methods, Hierarchical

Agglomerative methods

Page 7: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Partitioning Methods

Partition the objects into a prespecified number of groups K

Iteratively reallocate objects to clusters until some criterion is met (e.g. minimize within cluster sums of squares)

Examples: k-means, partitioning around medoids (PAM), self-organizing maps (SOM), model-based clustering

Page 8: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Partitioning Methods

Result: M clusters, each object belonging to one cluster

Single Pass: 1. Make the first object the centroid for the first cluster. 2. For the next object, calculate the similarity, S, with

each existing cluster centroid, using some similarity coefficient.

3. If the highest calculated S is greater than some specified threshold value, add the object to the corresponding cluster and re determine the centroid; otherwise, use the object to initiate a new cluster. If any objects remain to be clustered, return to step 2.

Page 9: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Partitioning Methods

This method requires only one pass through the dataset

The time requirements are typically of order O(NlogN) for order O(logN) clusters.

A disadvantage is that the resulting clusters are not independent of the order in which the documents are processed, with the first clusters formed usually being larger than those created later in the clustering run

Page 10: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Hierarchical Clustering

Produce a dendrogram Avoid prespecification of the number of clusters

K The tree can be built in two distinct ways:

Bottom-up: agglomerative clustering Top-down: divisive clustering

Page 11: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Hierarchical Clustering Organize the genes in a

structure of a hierarchical tree Initial step: each gene is

regarded as a cluster with one item

Find the 2 most similar clusters and merge them into a common node

The length of the branch is proportional to the distance

Iterate on merging nodes until all genes are contained in one cluster- the root of the tree.

g1 g2 g3 g4 g5

{1,2}

{4,5}

{1,2,3}

{1,2,3,4,5}

Page 12: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Partitioning vs. Hierarchical

Partitioning Advantage: Provides clusters that satisfy some

optimality criterion (approximately) Disadvantages: Need initial K, long computation time

Hierarchical Advantage: Fast computation (agglomerative) Disadvantages: Rigid, cannot correct later for

erroneous decisions made earlier

Page 13: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Mathematical evaluation of clustering solution

Merits of a ‘good’ clustering solution: Homogeneity:

Genes inside a cluster are highly similar to each other. Average similarity between a gene and the center (average

profile) of its cluster. Separation:

Genes from different clusters have low similarity to each other.

Weighted average similarity between centers of clusters. These are conflicting features: increasing the

number of clusters tends to improve with-in cluster Homogeneity on the expense of between-cluster Separation

Page 14: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Gaussian Distribution Function

Large number of events

describes physical events

approximates the exact binomial distribution of events

Distribution Functional Form Mean Standard Deviation

Gaussian a σ

Page 15: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Bayes' Theorem

p(A|X) =       p(X|A)*p(A)     p(X|A)*p(A) + p(X|~A)*p(~A)

1% of women at age forty who participate in routine screening have breast cancer.  80% of women with breast cancer will get positive mammographies.  9.6% of women without breast cancer will also get positive mammographies.  A woman in this age group had a positive mammography in a routine screening.  What is the probability that she actually has breast cancer?

Page 16: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Bayes' Theorem

The correct answer is 7.8%, obtained as follows:  Out of 10,000 women, 100 have breast cancer; 80 of those 100 have positive mammographies.  From the same 10,000 women, 9,900 will not have breast cancer and of those 9,900 women, 950 will also get positive mammographies.  This makes the total number of women with positive mammographies 950+80 or 1,030.  Of those 1,030 women with positive mammographies, 80 will have cancer.  Expressed as a proportion, this is 80/1,030 or 0.07767 or 7.8%.

Page 17: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Bayes' Theorem

p(cancer): 0.01 Group 1: 100 women with breast cancer

p(~cancer): 0.99 Group 2: 9900 women without breast cancer

p(positive|cancer): 80.0% 80% of women with breast cancer have positive mammographies

p(~positive|cancer): 20.0% 20% of women with breast cancer have negative mammographies

p(positive|~cancer): 9.6% 9.6% of women without breast cancer have positive mammographies

p(~positive|~cancer): 90.4% 90.4% of women without breast cancer have negative mammographies

p(cancer&positive): 0.008 Group A:  80 women with breast cancer and positive mammographies

p(cancer&~positive): 0.002 Group B: 20 women with breast cancer and negative mammographies

p(~cancer&positive): 0.095 Group C: 950 women without breast cancer and positive mammographies

p(~cancer&~positive): 0.895 Group D: 8950 women without breast cancer and negative mammographies

p(positive): 0.103 1030 women with positive results

p(~positive): 0.897 8970 women with negative results

p(cancer|positive): 7.80% Chance you have breast cancer if mammography is positive: 7.8%

p(~cancer|positive): 92.20% Chance you are healthy if mammography is positive: 92.2%

p(cancer|~positive): 0.22% Chance you have breast cancer if mammography is negative: 0.22%

p(~cancer|~positive): 99.78% Chance you are healthy if mammography is negative: 99.78%

Page 18: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Bayes' Theorem

to find the chance that a woman with positive mammography has breast cancer, we computed:

p(positive|cancer)*p(cancer) p(positive|cancer)*p(cancer) + p(positive|~cancer)*p(~cancer)

1. which isp(positive&cancer) / [p(positive&cancer) + p(positive&~cancer)]

2. which isp(positive&cancer) / p(positive)

3. which isp(cancer|positive)

Page 19: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Bayes' Theorem

The original proportion of patients with breast cancer is known as the prior probability.  The chance that a patient with breast cancer gets a positive mammography, and the chance that a patient without breast cancer gets a positive mammography, are known as the two conditional probabilities.  Collectively, this initial information is known as the priors.  The final answer - the estimated probability that a patient has breast cancer, given that we know she has a positive result on her mammography - is known as the revised probability or the posterior probability.

Page 20: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Bayes' Theorem

p(A|X) =  p(A|X)

p(A|X) =  p(X&A) p(X)

p(A|X) =      p(X&A)      p(X&A) + p(X&~A)

p(A|X) =     p(X|A)*p(A)      p(X|A)*p(A) + p(X|~A)*p(~A)

Page 21: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Introduction

A central problem in analysis of gene expression data is clustering of genes with similar expression profiles.

We are going to get familiar with an hierarchical clustering procedure that is based on simple probabilistic model.

Genes that are expressed similarly in each group of conditions are clustered together.

Page 22: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

The problem

The goal of clustering is identify groups of genes with “similar” expression patterns.

A group of genes are clustered together if their measured expression values could have been sampled from the same stochastic source with a high probability.

The user specifies in advance a partition of the experimental conditions

Page 23: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Clustering Gene Expression Data

Cluster genes , e.g. to (attempt to) identify groups of co-regulated genes

Cluster samples , e.g. to identify tumors based on profiles

Cluster both at the same time Can be helpful for identifying patterns in time or

space Useful (essential?) when seeking new

subclasses of samples Can be used for exploratory purposes

Page 24: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Notation

a matrix of gene expression measurement:D = {eg,c : gєGenes, cєConds}

Genes is a set genes, and Conds is a set of conditions

Page 25: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Scoring Method

partition C = {C1, … ,Cm} of conditions in Conds and a partition G = {G1 , … , Gn} of genes in Genes.

We want to score the combined partition. Assumption: g and g’ are in the same gene

cluster, and c and c’ in the same condition cluster, then the expression value eg,c and eg’,c’ are sampled from the same distribution.

Page 26: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Scoring Method

Likelihood function:

Where θi,k are the parameters that describe the expression of genes in Gi in conditions in Ck.

L(G,C,θ:D) = L(G,C,θ:D’) for any choice of G and θ.

Page 27: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Scoring Method

Parameterization for expression is using a Gaussian distribution.

Page 28: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Scoring Method

Using the previous Parameterization for each data we choose the best parameter sets.

To compensate for this overestimate we use the Bayesian approach, and average the likelihood over all of them.

Page 29: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Scoring Method - Summary

Local score of a particular cell:

Page 30: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Agglomerative Clustering

Given a partition C = {C1, … ,Cm} of conditions. One approach to learn a clustering of genes is

using an agglomerative procedure.

Page 31: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Agglomerative Clustering

G(1) ={G1, … ,G|Genes|} where each Gi is a singleton.

While t < |Genes| and G(t) contains a single cluster.

Compute the change in the score that results from merging the clusters Gi and Gj

Page 32: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Agglomerative Clustering

Choose (it,jt) to be the pair of clusters whose merger is the most beneficial according to the score:

Define:

O(|Genes|2|C|)

Page 33: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Double Clustering

We want the procedure to select for us the best partition:

1. Track the sequence of partitions G(1),…, G|Genes|.2. Select the partition with the highest score.

In theory: the maximum likelihood score should select G(1)

In Practice: it selects a partition in a much later stage.

Intuition: the best scoring partition strikes a tradeoff between finding groups of genes, so that each is homogeneous, and there distinct differences between them.

Page 34: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Double Clustering

Cluster both genes and conditions at the same time:

1. start with some partition of the conditions (say the one where each is a singleton).

2. perform gene agglomeration3. select the “best” scoring gene partition4. fix this gene partition5. perform agglomeration on conditions

Intuitively, each step improves the score, and thus this procedure should converge.

Page 35: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

particular features of our algorithm

We can measure a large amount of genes. The agglomerative clustering algorithm returns

a hierarchical partition that describes similarities at different scales.

We use a likelihood function rather than a measure of similarity.

The user specifies in advance a partition of experimental conditions.

Page 36: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Conclusion

Partition entities into groups called clusters . Cluster are homogeneous and well-separated. Bayes' Theorem

p(A|X) =     p(X|A)*p(A)      p(X|A)*p(A) + p(X|~A)*p(~A)

Partitions: C = {C1, … ,Cm}, G = {G1 , … , Gn} we want to score the combined partition.

Likelihood function:

Page 37: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Conclusion

Agglomerative Clustering

The main advantage of this procedure is that it can take as input the “relevant” distinctions among the conditions

Page 38: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

Questions?

Page 39: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer

References

[1] N. Friedman. PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles. 2003

[2] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. J. Comp. Bio., 6(3-4):281–97, 1999.

[3] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genomewide expression patterns. PNAS, 95(25):14863–8, 1998.

[4] Eliezer Yudkowsky. An Intuitive Explanation of Bayesian Reasoning. 2003