machine learning clustering and machine learning...

14
1 Clustering and machine learning for gene expression data Torgeir R. Hvidsten Linnaeus Centre for Bioinformatics Torgeir R. Hvidsten 2006.02.17 2 Machine learning: to learn general concepts from examples Real world Data (Feature space) Knowledge (classes) Assumed functional relationship partially described by the examples Data collection Abstraction Machine learning Torgeir R. Hvidsten 2006.02.17 3 Ordered controlled vocabulary organized in a taxonomy for describing the molecular role of gene products Molecular function: the tasks performed by individual gene products Biological process: broad biological goals that are accomplished by ordered assemblies of molecular functions Cellular component: subcellular structures, locations, and macromolecular complexes Gene Ontology Torgeir R. Hvidsten 2006.02.17 4 Protein structure classification (CATH)

Upload: others

Post on 27-Jul-2020

57 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

1

Clustering and machine learning for gene expression data

Torgeir R. Hvidsten

Linnaeus Centre for Bioinformatics

Torgeir R. Hvidsten2006.02.172

Machine learning: to learn general concepts from examples

Real world Data (Feature space)

Knowledge (classes)

Assumed functional relationship partially described by the examples

Data collection

Abstraction

Machine learning

Torgeir R. Hvidsten2006.02.173

Ordered controlled vocabulary organized in a taxonomy for describing the molecular role of gene products

• Molecular function: the tasks performed by individual gene products

• Biological process: broad biological goals that are accomplished by ordered assemblies of molecular functions

• Cellular component: subcellular structures, locations, and macromolecular complexes

Gene Ontology

Torgeir R. Hvidsten2006.02.174

Protein structure classification (CATH)

Page 2: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

2

Torgeir R. Hvidsten2006.02.175

Microarray

Torgeir R. Hvidsten2006.02.176

Hybridization

Torgeir R. Hvidsten2006.02.177

Image after scanning

Torgeir R. Hvidsten2006.02.178

Numerical data

Gene/Expr E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 … EMG1 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 … -0.94G2 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 … -0.42G3 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 … -1.12G4 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 … -0.62G5 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 … -0.74G6 0.54 0.53 0.16 0.14 0.20 -0.34 -0.38 -0.36 -0.49 -0.58 … -1.47G7 0.20 0.14 0.00 0.11 -0.34 -0.03 0.04 -0.76 -0.81 -1.12 … -1.36G8 0.40 0.43 0.18 0.00 -0.14 0.29 0.07 -0.79 -0.81 -0.92 … -1.22G9 0.01 0.46 0.28 -0.34 -0.23 -0.36 -0.45 -0.64 -0.79 -1.22 … -1.09… … … … … … … … … … … … …GN -0.23 0.04 0.00 -0.30 -0.29 -0.45 -0.97 -2.06 -0.89 -1.22 … -0.97

log(2.3/2.4) = log(“Red/Green”)

M < 100

N ≈ 10000

Page 3: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

3

Torgeir R. Hvidsten2006.02.179

Data analysis goals

What to study?

• Classes of experiments; changes in expression levels in tissue samples with different e.g. diseases, treatments, environmental effects etc.

• Classes of genes; expression profiles of genes with similar biological function

• Both of the above

Torgeir R. Hvidsten2006.02.1710

Data analysis methods

• Unsupervised learning (clustering, class discovery); used to “discover” natural groups of genes/experiments e.g.– discover subclasses of a form of cancer that is clinically

homogenous• Supervised learning; used to “learn” a model of a set

of predefined classes of genes/experiments e.g.– diagnosis of cancer/subclasses of cancer

Torgeir R. Hvidsten2006.02.1711

Clustering analysis

Need to define;• measure of similarity• algorithm for using the measure of similarity to

discover natural groups in the data

The number of ways to divide n items into k clusters: kn/k!

Example: 10500/10! = 2.756 ×10493

Torgeir R. Hvidsten2006.02.1712

Measure of similarity

E1

E2

d

What is similar? Euclidean distance

Page 4: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

4

Torgeir R. Hvidsten2006.02.1713

Hierarchical clustering

• INPUT: n genes/experiments• Consider each gene/experiment as an individual cluster and

initiate an n × n distance matrix d• Repeat

– identify the two most similar clusters in d (i.e. smallest number in d)– merge the two most similar clusters and update the matrix (i.e. substitute

the two clusters with the new cluster)• OUTPUT: A tree of merged genes/experiments (called a

dendrogram)

Torgeir R. Hvidsten2006.02.1714

Hierarchical clusteringIntercluster similarity measures: (a) single linkage, (b) complete linkage and (c) average linkage

Example of hierarchical clustering: languages of Europe

Distance: Frequency of numbers with different first letter e.g.

dEN = 2 dEDu = 7 dSpI = 1

Intercluster strategy: SINGEL LINKAGE

Iteration 1

E N Da Du G Fr Sp I P H FiE 0N 2 0

Da 2 1 0Du 7 5 6 0G 6 4 5 5 0Fr 6 6 6 9 7 0Sp 6 6 5 9 7 2 0I 6 6 5 9 7 1 1 0P 7 7 6 10 8 5 3 4 0H 9 8 8 8 9 10 10 10 10 0Fi 9 9 9 9 9 9 9 9 9 8 0

I

12345678

Fr

Page 5: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

5

Iteration 2

I Fr E N Da Du G Sp P H FiI Fr 0E 6 0N 6 2 0

Da 5 2 1 0Du 9 7 5 6 0G 7 6 4 5 5 0Sp 1 6 6 5 9 7 0P 4 7 7 6 10 8 3 0H 10 9 8 8 8 9 10 10 0Fi 9 9 9 9 9 9 9 9 8 0

I

12345678

Fr Da N

Iteration 3

Da N I Fr E Du G Sp P H FiDa N 0I Fr 5 0E 2 6 0

Du 5 9 7 0G 4 7 6 5 0

Sp 5 1 6 9 7 0P 6 4 7 10 8 3 0H 8 10 9 8 9 10 10 0Fi 9 9 9 9 9 9 9 8 0

I

12345678

Fr Da NSp

Iteration 4

Sp I Da E Du G P H FiSp I Fr 0Da N 5 0

E 6 2 0Du 9 5 7 0G 7 4 6 5 0P 3 6 7 10 8 0H 10 8 9 8 9 10 0Fi 9 9 9 9 9 9 8 0

I

12345678

Fr Da NSp E

Iteration 5

E Da N

Sp I Fr Du G P H Fi

E Da N 0

Sp I Fr 5 0Du 5 9 0G 4 7 5 0P 6 3 10 8 0H 8 10 8 9 10 0Fi 9 9 9 9 9 8 0

I

12345678

Fr Da NSp EP

Page 6: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

6

Iteration 6

P Sp I Fr

E Da N Du G H Fi

P Sp I Fr 0

E Da N 5 0

Du 9 5 0G 7 4 5 0H 10 8 8 9 0Fi 9 9 9 9 8 0

I

12345678

Fr Da NSp EP G

Iteration 7

G E Da N

P Sp I Fr Du H Fi

G E Da N 0

P Sp I Fr 5 0Du 5 9 0H 8 10 8 0Fi 9 9 9 8 0

I

12345678

Fr Da NSp EP G Du

Iteration 8

Du G E Da N

P Sp I Fr H Fi

Du G E Da N 0

P Sp I Fr 5 0H 8 10 0Fi 9 9 8 0

I

12345678

Fr Da NSp EP G Du

Iteration 9

P Sp I Fr Du G E

Da N H FiP Sp I Fr Du G E Da N 0

H 8 0Fi 9 8 0

I

12345678

Fr Da NSp EP G Du H

Page 7: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

7

Iteration 10

Fi H

P Sp I Fr Du G E Da N

Fi H 0P Sp I

Fr Du G E Da N 8 0

I

12345678

Fr Da NSp EP G Du H Fi

Torgeir R. Hvidsten2006.02.1726

Any data mining result needs to be consistent BOTH with the data and current knowledge!

Torgeir R. Hvidsten2006.02.1727

Evaluation of clusters

I

12345678

Fr Da NSp EP G Du H Fi

Clusters may be evaluated according to how well they describe current knowledge

RomanSlavicGermanicUgro-Finnish Torgeir R. Hvidsten

2006.02.1728

Hierarchical clustering: properties

• Huge memory requirements: stores the n × n matrix• Running time: O(n3)• Deterministic: produces the same clustering each

time• Nice visualization: dendrogram• Number of clusters can be selected using the

dendrogram

Page 8: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

8

Torgeir R. Hvidsten2006.02.1729

K-means clustering

• Split the data into k random clusters• Repeat

– calculate the centroid of each cluster– (re-)assign each gene/experiment to the closest centroid– stop if no new assignments are made

Example of K-means:two dimensions

Initial clustersK=2

Iteration 1

Calculate centroids

xx

Iteration 1

(Re-)assign

xx

Page 9: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

9

Iteration 2

Calculate centroids

x

x

Iteration 2

(Re-)assign

x

x

Iteration 3

Calculate centroid

x

x

Iteration 3

(Re-)assign

No new assignments! STOP

x

x

Page 10: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

10

Torgeir R. Hvidsten2006.02.1737

K-means: properties

• Low memory usage• Running time: O(n)• Improves iteratively: not trapped in previous

mistakes• Non-deterministic: will in general produce different

clusters with different initializations• Number of clusters must be decided in advance

Torgeir R. Hvidsten2006.02.1738

Hierarchical vs. k-means

• Hierarchical clustering: – computationally expensive -> relatively small data sets– nice visualization, no. of clusters can be selected– deterministic– cannot correct early ”mistakes”

• K-means: – computationally efficient -> large data sets– predefined no. of clusters– non-deterministic -> should be run several times– iterative improvement

• Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!

Torgeir R. Hvidsten2006.02.1739

Supervised learning• Uses examples of known classes to learn a model• Examples are expression profiles of genes with known

classes (clinical state or function)• The model can be e.g.

– hyperplanes separating classes in n dimensions– artificial neural networks– decision trees– IF-THEN rules

• Can be used for e.g.– diagnostics– predicting gene function for unknown genes

Torgeir R. Hvidsten2006.02.1740

Support Vector Machines

Maximum marginseparating ”hyperplane”

Support vectors

Soft margin

Page 11: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

11

Torgeir R. Hvidsten2006.02.1741

Artificial neural networks

Input layer Output layer

x1

x2

x3

x4

f(x)

…x1

xn ⎪⎪⎪

⎪⎪⎪

∑=

>

otherwise

n1i

if

1

01 ixiww1

wn

Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain,

PortugalGroup 3: Benelux countries, Switzerland,

Austria, Italy, Germany

Christian Democrats > 16

Group 3

Yes

Agrarians > 4

YesGroup 1 Group 2

No

Decision tree learning

No

Agrarians([4, *)) AND Christian Democrats([*, 16)) => Class(1)Agrarians([*, 4)) AND Christian Democrats([*, 16)) => Class(2)Christian Democrats([16, *)) => Class(3)

Rule learning: Rough sets

Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain, PortugalGroup 3: Benelux countries, Switzerland, Austria, Italy, Germany

Torgeir R. Hvidsten2006.02.1744

Supervised vs. clustering

Clustering+ class discovery+ robust towards incorrect knowledge

Supervised+ evaluation+ predictive/descriptive model+ based on actual knowledge rather than idealized

hypotheses

Page 12: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

12

Predicting biological process from gene expression time profiles

Papers:

I. T. R. Hvidsten, A. Lægreid and J. Komorowski. Learning rule-based models of biological process from gene expression time profiles using gene ontology, Bioinformatics19(9): 1116-23, 2003.

II. A. Lægreid, T. R. Hvidsten, H. Midelfart, J. Komorowski and A. K. Sandvik. Predicting Gene Ontology Biological Process From Temporal Gene Expression Patterns, Genome Research, 13(5): 965-979, 2003.

Torgeir R. Hvidsten2006.02.1746

Hierarchical clustering

Iyer et al., The transcriptional program in the response of human fibroblasts to serum, Science, 283(5398): 83-87, 1999

Torgeir R. Hvidsten2006.02.1747

Gene Ontology vs. expression clustering

Torgeir R. Hvidsten2006.02.1748

Gene 0HR 15MIN30MIN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR Process g1 0.00 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 -0.94 Unknown

g2 0.00 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 -0.42Transport and

defense responseg3 0.00 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 -1.12 Cell cycle control

g4 0.00 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 -0.62Positive control of cell proliferation

g5 0.00 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 -0.74Positive control of cell proliferation

... ... ... ... ... ... ... ... ... ... ... ... ... ...

Process

Positive controlof cell

proliferation

Defenseresponse

Cell cyclecontrol

Ontology

Transport

g2 ... g2 ... g3 ...g4 ... g5

0 - 4(Increasing) AND 6 - 10(Decreasing) AND 14 - 18(Constant) => GO(cell proliferation)

1. Annotation

2. Extracting features for learning

3. Inducing minimal decision rules using rough sets

4. The function of uncharacterized genes is predicted using the rules !-2

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14 16 18 20 22 24

Methodology

Page 13: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

13

Torgeir R. Hvidsten2006.02.1749

Rule Induction

• IF-part (antecedent, premise): the minimal set of discrete changes in expression needed to uphold the discriminatory power of the full data set

• THEN-part (consequent): all functions of genes described by the premise-side

• We want rules that describe the expression profiles of several genes with one or a few functions

– accuracy: the fraction of genes matching the IF-part that are annotated with the process in the THEN-part

– coverage: the fraction of genes annotated with the process in the THEN-part that matches the IF-part

IF 0 - 4(Constant) AND 0 - 10(Increasing)

THEN GO(prot. met. and mod.) OR GO(mesoderm develop.) OR GO(prot. biosynt.)

Torgeir R. Hvidsten2006.02.1750

Rule example

M35296 J02783 D13748 X05130

X60957D13748

0 - 4(Constant) AND 0 - 10(Increasing) => GO(protein metabolism and modification) OR

GO(mesoderm development) OR GO(protein biosynthesis)

Covered genesRule

-1

-0.5

0

0.5

1

1.5

2

2.5

3

0 2 4 6 8 10 12 14 16 18 20 22 24

Torgeir R. Hvidsten2006.02.1751

Classification

IF … THEN …IF … THEN …IF … THEN …IF … THEN …IF … THEN …IF … THEN …IF … THEN …

IF 0 - 4(Constant) AND 0 - 10(Increasing) THEN GO(protein metabolism and modification ) OR

GO(mesoderm development) OR GO(protein biosynthesis)

IF … THEN IF … THEN …IF … THEN …IF … THEN …IF … THEN …

X60957

-1-0.5

00.5

11.5

22.5

3

0 2 4 6 8 10 12 14 16 18 20 22 24

Process Votes protein metabolism and modification 6 mesoderm development 3 proteolysis and peptidolysis 2 transcription 1 protein biosynthesis 1 vision 1 …

+4

Votes are normalized and processes with vote fractions higher than a selection-threshold are chosen as predictions

+1+1

Torgeir R. Hvidsten2006.02.1752

EvaluationEvaluation technique

– divide examples into training set and test set– cross validation

Evaluation measures:– accuracy = (TP+TN)/(TP+FN+TN+FP)– sensitivity = TP/(TP+FN)– specificity = TN/(TN+FP)

Page 14: Machine learning Clustering and machine learning forxray.bmc.uu.se/kurs/BioinfX3/2006_learning_rules.pdf · Clustering and machine learning for gene expression data Torgeir R. Hvidsten

14

Threshold selection

1

Fraction of votes for “proteinbiosynthesis”

Test setg1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12

Sensitivity = 2/3, Specificity=1Sensitivity = 1, Specificity=2/3

Gene with function “protein biosynthesis”Gene with a different function

sensitivity: TP/(TP+FN)specificity: TN/(TN+FP)

Threshold 1

Threshold 2

Torgeir R. Hvidsten2006.02.1754

ROC analysis and classifier evaluation

1

sens

itivi

ty

1 - specificity 1

No discrimination

Perfect discrimination

AUC

00

• ROC: Receiver operating characteristics curve results from plotting sensitivity against specificity for all possible thresholds

– sensitivity: TP/(TP+FN)– specificity: TN/(TN+FP)

• AUC: Area under the ROC curve• Cross validation (CV)

– systematic division of data into training and test sets

– CV estimates are interpreted as the classification performance expected on new, unseen data

Torgeir R. Hvidsten2006.02.1755

Over all classes:Coverage = TP/(TP+FN)Precision = TP/(TP+FP)

Coverage: 84%Precision: 50%

Coverage: 71%Precision: 60%

Coverage: 39%Precision: 90%

*Iyer et al.

Cross validation estimatesPROCESS AUC SE P-VALUE Ion homeostasis 1.00 0.00 0.008 Protein targeting 0.99 0.03 0.000 Blood coagulation 0.96 0.08 0.000 DNA metabolism 0.94 0.09 0.000 Intracellular signaling cascade 0.94 0.06 0.000 Cell cycle 0.93 0.04 0.000 Energy pathways 0.93 0.12 0.004 Oncogenesis 0.92 0.11 0.000 Circulation 0.91 0.11 0.001 Cell death 0.90 0.10 0.000 Developmental processes 0.90 0.07 0.000 Defense (immune) response 0.88 0.05 0.000 Transcription 0.88 0.11 0.002 Cell adhesion 0.87 0.09 0.002 Stress response 0.86 0.15 0.002 Protein metabolism and modification 0.85 0.10 0.000 Cell motility 0.84 0.11 0.000 Cell surface rec linked signal transd 0.82 0.15 0.005 Lipid metabolism 0.81 0.14 0.000 Cell organization and biogenesis 0.79 0.11 0.000 Cell proliferation 0.79 0.06 0.002 Transport 0.79 0.17 0.001 Amino acid and derivative metabolism 0.69 0.06 0.288

AVERAGE

0.88

0.09