data mining in bioinformatics day 7: clustering in ... · karsten borgwardt: data mining in...

24
Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen

Upload: others

Post on 18-Jan-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in BioinformaticsDay 7: Clustering in Bioinformatics

Karsten Borgwardt

February 21 to March 4, 2011

Machine Learning & Computational Biology Research GroupMPIs Tübingen

Page 2: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Clustering in bioinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

MicroarraysClustering is a widely used tool in microarray analysisClass discovery is an important problem in microarraystudies for two reasons:

either the classes are completely unknown before-handor it is unknown whether a known class contains inter-esting subclasses

Page 3: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Clustering in bioinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

ExamplesClasses unknown:

Does a disease affect gene expression in a particulartissue?Does gene expression differ between two groups in aparticular condition?

Subclasses unknown:Are there subtypes of a disease?Is there even a hierarchy of subclasses within one dis-ease?

Page 4: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Clustering in bioinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

PopularityClustering tools are available in the large microarraydatabase NCBI Gene Expression Omnibus (GEO)http://www.ncbi.nlm.nih.gov/geo/3002 pubmed hits for ’microarray clustering’Recent editorial of OUP Bioinformatics

Page 5: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Distance metrics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Euclidean distanceEuclidean distance of gene x and y of n samples or sam-ple x and y of n genes:

dxy =

√√√√ n∑i=1

(xi − yi)2 (1)

Pearson’s Correlation

Pearson Correlation of gene x and y of n samples orsample x and y of n genes, where x is the mean of xand is y the mean of y:

rxy =

∑ni=1(xi − x)(yi − y)√∑n

i=1(xi − x)2√∑n

i=1(yi − y)2(2)

Page 6: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Distance metrics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Un-centered correlation coefficientUn-centered correlation coefficient of gene x and y of nsamples or sample x and y of n genes:

ruxy =

∑ni=1 xiyi√∑n

i=1 x2i

√∑ni=1 y

2i

(3)

Page 7: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Clustering algorithms

Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Hierarchical ClusteringSingle linkage: The linking distance is the minimum dis-tance between two clusters.Complete linkage: The linking distance is the maximumdistance between two clusters.Average linkage/UPGMA (The linking distance is the av-erage of all pair-wise distances between members of thetwo clusters. Since all genes and samples carry equalweight, the linkage is an Unweighted Pair Group Methodwith Arithmetic Means (UPGMA))

‘Flat’ Clusteringk-means (k from 2 to 15, 3 runs)k-median (k-medoid)

Page 8: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

The two-sample problem

Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

Interpretation of clustersClustering introduces ‘structure’ into microarraydatasetsBut is there a statistical or biomedical meaning of theseclasses?Biomedical meaning has to be established in experi-ments‘Statistical meaning’ can be measured using statisticaltests, by a so-called two-sample test

A two-sample tests decides whether two samples weredrawn from the same probability distribution or not

Page 9: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

The two-sample problem

Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

Data diversityMolecular biology produces a wealth of informationThe problem is that these data are generated

on different platforms andby different protocolsunder different levels of noise

Hence data from different labs showdifferent scalesdifferent rangesdifferent distributions

Main problem:Joint data analysis may detect differences in distribu-tions, not biological phenomena!

Page 10: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

The two-sample problem

Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

The two-sample problemGiven two samples X and Y .Were they generated by the same distribution?

Previous approachestwo-sample tests exist for univariate and multivariatedata

Page 11: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

The two-sample problem

Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

t-testA test of the null hypothesis that the means of two nor-mally distributed populations are equalunpaired/independent (versus paired)For equal sample sizes and equal variances, the t statis-tic to test whether the means are different can be calcu-lated as follows:

t =x− y

σxy ·√

2n

(4)

where σxy =√

σ2x+σ2y

2 .The degrees of freedom for this test is 2n− 2 where n isthe size of each sample.

Page 12: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

The two-sample problem

Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

New challenges in bioinformaticshigh-dimensionalstructured (strings and graphs)low sample size

Novel distribution test: Maximum Mean Discrepancy(MMD)

Page 13: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

MMD key idea

Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Page 14: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

MMD key idea

Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Key IdeaAvoid density estimator, use means in feature spacesMaximum Mean Discrepancy (Fortet and Mourier, 1953)

D(p, q,F) := supf∈F

Ep [f (x)]− Eq [f (y)]

TheoremD(p, q,F) = 0 iff p = q, when F = C0(X).

Follows directly, e.g. from Dudley, 1984.

TheoremD(p, q,F) = 0 iff p = q, when F = {f | ‖f‖H ≤ 1}provided that H is a universal RKHS.

(follows via Steinwart, 2001, Smola et al., 2006).

Page 15: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

MMD statistic

Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Goal: Estimate D(p, q,F)

Ep,pk(x, x′)− 2Ep,qk(x, y) + Eq,qk(y, y′)

U-Statistic: Empirical estimate D(X, Y,F)

1m(m−1)

∑i 6=j

k(xi, xj)− k(xi, yj)− k(yi, xj) + k(yi, yj)

TheoremD(X, Y,F) is an unbiased estimator of D(p, q,F).

TestEstimate σ2 from data.Reject null hypothesis that p = q if D(X, Y,F) exceedsacceptance threshold.

Page 16: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Attractive for bioinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

MMDtwo-sample test in terms of kernels

Computationally attractivesearch infinite space of functions by evaluating one ex-pressionno optimization problem has to be solved

All thanks to kernels!

Page 17: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Attractive for bioinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

Wide applicabilityfor one- and higher-dimensional vectorial data,but also for structured data!two-sample problems can now be tackled on

strings: protein and DNA sequencesgraphs: molecules, protein interaction networkstime series: time series of microarray dataand sets, trees, . . .

Page 18: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Cross-platform comparability

Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Datamicroarray data from two breast cancer studiesone on cDNA platform (Gruvberger et al., 2001)other on oligonucleotide microarray platform (West etal., 2001)

TaskCan MMD help to find out if two sets of observationswere generated bythe same study (both from Gruvberger or both fromWest)?different studies (one Gruvberger, one West)?

Page 19: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Cross-platform comparability

Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Experimentsample size each: 25dimension of each datapoint 2,116significance level: α = 0.05

100 times: 1 sample from Gruvberger, 1 from West100 times: both from Gruvberger or both from Westreport percentage of correct decisionscompare to t-test, Friedman-Rafsky Wald-Wolfowitz andSmirnov

Page 20: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Cross-platform comparability

Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Page 21: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Kernel-based statistical test

Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

novel statistical test for two-sample problem:

easy to implementnon-parametricfirst for structured databest on high-dimensional dataquadratic runtime w.r.t. the number of data pointsimpressive accuracy in our experiments

kernel method for two-sample problem:

all kernels recently defined in molecular biology can bere-used for data integrationapplicable to vectors, strings, sets, trees, graphs andtime series

Page 22: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

Biclustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Clustering in two dimensionsalternative names: co-clustering, two-mode clusteringA bicluster is a subset of genes that show similar activ-ity patterns under a subset of conditions.Clustering in 2 dimensionsCluster patients and conditionsEarliest work by Hartigan, 1972: Divide a matrix intosubmatrices with minimum variance.Most interesting cases are NP-complete.Many extensions in bioinformatics (e.g. Cheng andChurch, 2002)

Page 23: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

References and further reading

Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

References

[1] Gretton, Borgwardt, Rasch, Schölkopf, Smola: A kernelmethod for the two-sample problem. NIPS 2006

Page 24: Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

The end

Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

See you tomorrow! Next topic: Feature Selection inBioinformatics