machine learning clustering and machine learning...

1

Clustering and machine learning for gene expression data

Torgeir R. Hvidsten

Linnaeus Centre for Bioinformatics

Torgeir R. Hvidsten2006.02.172

Machine learning: to learn general concepts from examples

Real world Data (Feature space)

Knowledge (classes)

Assumed functional relationship partially described by the examples

Data collection

Abstraction

Machine learning


Ordered controlled vocabulary organized in a taxonomy for describing the molecular role of gene products

• Molecular function: the tasks performed by individual gene products

• Biological process: broad biological goals that are accomplished by ordered assemblies of molecular functions

• Cellular component: subcellular structures, locations, and macromolecular complexes

Gene Ontology


Protein structure classification (CATH)

2


Microarray


Hybridization


Image after scanning


Numerical data

Gene/Expr E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 … EMG1 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 … -0.94G2 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 … -0.42G3 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 … -1.12G4 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 … -0.62G5 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 … -0.74G6 0.54 0.53 0.16 0.14 0.20 -0.34 -0.38 -0.36 -0.49 -0.58 … -1.47G7 0.20 0.14 0.00 0.11 -0.34 -0.03 0.04 -0.76 -0.81 -1.12 … -1.36G8 0.40 0.43 0.18 0.00 -0.14 0.29 0.07 -0.79 -0.81 -0.92 … -1.22G9 0.01 0.46 0.28 -0.34 -0.23 -0.36 -0.45 -0.64 -0.79 -1.22 … -1.09… … … … … … … … … … … … …GN -0.23 0.04 0.00 -0.30 -0.29 -0.45 -0.97 -2.06 -0.89 -1.22 … -0.97

log(2.3/2.4) = log(“Red/Green”)

M < 100

N ≈ 10000

3


Data analysis goals

What to study?

• Classes of experiments; changes in expression levels in tissue samples with different e.g. diseases, treatments, environmental effects etc.

• Classes of genes; expression profiles of genes with similar biological function

• Both of the above


Data analysis methods

• Unsupervised learning (clustering, class discovery); used to “discover” natural groups of genes/experiments e.g.– discover subclasses of a form of cancer that is clinically

homogenous• Supervised learning; used to “learn” a model of a set

of predefined classes of genes/experiments e.g.– diagnosis of cancer/subclasses of cancer


Clustering analysis

Need to define;• measure of similarity• algorithm for using the measure of similarity to

discover natural groups in the data

The number of ways to divide n items into k clusters: kn/k!

Example: 10500/10! = 2.756 ×10493


Measure of similarity

E1

E2

d

What is similar? Euclidean distance

4


Hierarchical clustering

• INPUT: n genes/experiments• Consider each gene/experiment as an individual cluster and

initiate an n × n distance matrix d• Repeat

– identify the two most similar clusters in d (i.e. smallest number in d)– merge the two most similar clusters and update the matrix (i.e. substitute

the two clusters with the new cluster)• OUTPUT: A tree of merged genes/experiments (called a

dendrogram)


Hierarchical clusteringIntercluster similarity measures: (a) single linkage, (b) complete linkage and (c) average linkage

Example of hierarchical clustering: languages of Europe

Distance: Frequency of numbers with different first letter e.g.

dEN = 2 dEDu = 7 dSpI = 1

Intercluster strategy: SINGEL LINKAGE

Iteration 1

E N Da Du G Fr Sp I P H FiE 0N 2 0

Da 2 1 0Du 7 5 6 0G 6 4 5 5 0Fr 6 6 6 9 7 0Sp 6 6 5 9 7 2 0I 6 6 5 9 7 1 1 0P 7 7 6 10 8 5 3 4 0H 9 8 8 8 9 10 10 10 10 0Fi 9 9 9 9 9 9 9 9 9 8 0

I

12345678

Fr

5

Iteration 2

I Fr E N Da Du G Sp P H FiI Fr 0E 6 0N 6 2 0

Da 5 2 1 0Du 9 7 5 6 0G 7 6 4 5 5 0Sp 1 6 6 5 9 7 0P 4 7 7 6 10 8 3 0H 10 9 8 8 8 9 10 10 0Fi 9 9 9 9 9 9 9 9 8 0

I

12345678

Fr Da N

Iteration 3

Da N I Fr E Du G Sp P H FiDa N 0I Fr 5 0E 2 6 0

Du 5 9 7 0G 4 7 6 5 0

Sp 5 1 6 9 7 0P 6 4 7 10 8 3 0H 8 10 9 8 9 10 10 0Fi 9 9 9 9 9 9 9 8 0

I

12345678

Fr Da NSp

Iteration 4

Sp I Da E Du G P H FiSp I Fr 0Da N 5 0

E 6 2 0Du 9 5 7 0G 7 4 6 5 0P 3 6 7 10 8 0H 10 8 9 8 9 10 0Fi 9 9 9 9 9 9 8 0

I

12345678

Fr Da NSp E

Iteration 5

E Da N

Sp I Fr Du G P H Fi

E Da N 0

Sp I Fr 5 0Du 5 9 0G 4 7 5 0P 6 3 10 8 0H 8 10 8 9 10 0Fi 9 9 9 9 9 8 0

I

12345678

Fr Da NSp EP

6

Iteration 6

P Sp I Fr

E Da N Du G H Fi

P Sp I Fr 0

E Da N 5 0

Du 9 5 0G 7 4 5 0H 10 8 8 9 0Fi 9 9 9 9 8 0

I

12345678

Fr Da NSp EP G

Iteration 7

G E Da N

P Sp I Fr Du H Fi

G E Da N 0

P Sp I Fr 5 0Du 5 9 0H 8 10 8 0Fi 9 9 9 8 0

I

12345678

Fr Da NSp EP G Du

Iteration 8

Du G E Da N

P Sp I Fr H Fi

Du G E Da N 0

P Sp I Fr 5 0H 8 10 0Fi 9 9 8 0

I

12345678

Fr Da NSp EP G Du

Iteration 9

P Sp I Fr Du G E

Da N H FiP Sp I Fr Du G E Da N 0

H 8 0Fi 9 8 0

I

12345678

Fr Da NSp EP G Du H

7

Iteration 10

Fi H

P Sp I Fr Du G E Da N

Fi H 0P Sp I

Fr Du G E Da N 8 0

I

12345678

Fr Da NSp EP G Du H Fi


Any data mining result needs to be consistent BOTH with the data and current knowledge!


Evaluation of clusters

I

12345678

Fr Da NSp EP G Du H Fi

Clusters may be evaluated according to how well they describe current knowledge

RomanSlavicGermanicUgro-Finnish Torgeir R. Hvidsten

2006.02.1728

Hierarchical clustering: properties

• Huge memory requirements: stores the n × n matrix• Running time: O(n3)• Deterministic: produces the same clustering each

time• Nice visualization: dendrogram• Number of clusters can be selected using the

dendrogram

8


K-means clustering

• Split the data into k random clusters• Repeat

– calculate the centroid of each cluster– (re-)assign each gene/experiment to the closest centroid– stop if no new assignments are made

Example of K-means:two dimensions

Initial clustersK=2

Iteration 1

Calculate centroids

xx

Iteration 1

(Re-)assign

xx

9

Iteration 2

Calculate centroids

x

x

Iteration 2

(Re-)assign

x

x

Iteration 3

Calculate centroid

x

x

Iteration 3

(Re-)assign

No new assignments! STOP

x

x

10


K-means: properties

• Low memory usage• Running time: O(n)• Improves iteratively: not trapped in previous

mistakes• Non-deterministic: will in general produce different

clusters with different initializations• Number of clusters must be decided in advance


Hierarchical vs. k-means

• Hierarchical clustering: – computationally expensive -> relatively small data sets– nice visualization, no. of clusters can be selected– deterministic– cannot correct early ”mistakes”

• K-means: – computationally efficient -> large data sets– predefined no. of clusters– non-deterministic -> should be run several times– iterative improvement

• Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!


Supervised learning• Uses examples of known classes to learn a model• Examples are expression profiles of genes with known

classes (clinical state or function)• The model can be e.g.

– hyperplanes separating classes in n dimensions– artificial neural networks– decision trees– IF-THEN rules

• Can be used for e.g.– diagnostics– predicting gene function for unknown genes


Support Vector Machines

Maximum marginseparating ”hyperplane”

Support vectors

Soft margin

11


Artificial neural networks

Input layer Output layer

x1

x2

x3

x4

f(x)

…x1

xn ⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

−

∑=

>

otherwise

n1i

if

1

01 ixiww1

wn

Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain,

PortugalGroup 3: Benelux countries, Switzerland,

Austria, Italy, Germany

Christian Democrats > 16

Group 3

Yes

Agrarians > 4

YesGroup 1 Group 2

No

Decision tree learning

No

Agrarians([4, *)) AND Christian Democrats([*, 16)) => Class(1)Agrarians([*, 4)) AND Christian Democrats([*, 16)) => Class(2)Christian Democrats([16, *)) => Class(3)

Rule learning: Rough sets

Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain, PortugalGroup 3: Benelux countries, Switzerland, Austria, Italy, Germany


Supervised vs. clustering

Clustering+ class discovery+ robust towards incorrect knowledge

Supervised+ evaluation+ predictive/descriptive model+ based on actual knowledge rather than idealized

hypotheses

12

Predicting biological process from gene expression time profiles

Papers:

I. T. R. Hvidsten, A. Lægreid and J. Komorowski. Learning rule-based models of biological process from gene expression time profiles using gene ontology, Bioinformatics19(9): 1116-23, 2003.

II. A. Lægreid, T. R. Hvidsten, H. Midelfart, J. Komorowski and A. K. Sandvik. Predicting Gene Ontology Biological Process From Temporal Gene Expression Patterns, Genome Research, 13(5): 965-979, 2003.


Hierarchical clustering

Iyer et al., The transcriptional program in the response of human fibroblasts to serum, Science, 283(5398): 83-87, 1999


Gene Ontology vs. expression clustering


Gene 0HR 15MIN30MIN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR Process g1 0.00 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 -0.94 Unknown

g2 0.00 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 -0.42Transport and

defense responseg3 0.00 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 -1.12 Cell cycle control

g4 0.00 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 -0.62Positive control of cell proliferation

g5 0.00 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 -0.74Positive control of cell proliferation

... ... ... ... ... ... ... ... ... ... ... ... ... ...

Process

Positive controlof cell

proliferation

Defenseresponse

Cell cyclecontrol

Ontology

Transport

g2 ... g2 ... g3 ...g4 ... g5

0 - 4(Increasing) AND 6 - 10(Decreasing) AND 14 - 18(Constant) => GO(cell proliferation)

1. Annotation

2. Extracting features for learning

3. Inducing minimal decision rules using rough sets

4. The function of uncharacterized genes is predicted using the rules !-2

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14 16 18 20 22 24

Methodology

13


Rule Induction

• IF-part (antecedent, premise): the minimal set of discrete changes in expression needed to uphold the discriminatory power of the full data set

• THEN-part (consequent): all functions of genes described by the premise-side

• We want rules that describe the expression profiles of several genes with one or a few functions

– accuracy: the fraction of genes matching the IF-part that are annotated with the process in the THEN-part

– coverage: the fraction of genes annotated with the process in the THEN-part that matches the IF-part

IF 0 - 4(Constant) AND 0 - 10(Increasing)

THEN GO(prot. met. and mod.) OR GO(mesoderm develop.) OR GO(prot. biosynt.)


Rule example

M35296 J02783 D13748 X05130

X60957D13748

0 - 4(Constant) AND 0 - 10(Increasing) => GO(protein metabolism and modification) OR

GO(mesoderm development) OR GO(protein biosynthesis)

Covered genesRule

-1

-0.5

0

0.5

1

1.5

2

2.5

3

0 2 4 6 8 10 12 14 16 18 20 22 24


Classification

IF … THEN …IF … THEN …IF … THEN …IF … THEN …IF … THEN …IF … THEN …IF … THEN …

IF 0 - 4(Constant) AND 0 - 10(Increasing) THEN GO(protein metabolism and modification ) OR

GO(mesoderm development) OR GO(protein biosynthesis)

IF … THEN IF … THEN …IF … THEN …IF … THEN …IF … THEN …

X60957

-1-0.5

00.5

11.5

22.5

3

0 2 4 6 8 10 12 14 16 18 20 22 24

Process Votes protein metabolism and modification 6 mesoderm development 3 proteolysis and peptidolysis 2 transcription 1 protein biosynthesis 1 vision 1 …

+4

Votes are normalized and processes with vote fractions higher than a selection-threshold are chosen as predictions

+1+1


EvaluationEvaluation technique

– divide examples into training set and test set– cross validation

Evaluation measures:– accuracy = (TP+TN)/(TP+FN+TN+FP)– sensitivity = TP/(TP+FN)– specificity = TN/(TN+FP)

14

Threshold selection

1

Fraction of votes for “proteinbiosynthesis”

Test setg1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12

Sensitivity = 2/3, Specificity=1Sensitivity = 1, Specificity=2/3

Gene with function “protein biosynthesis”Gene with a different function

sensitivity: TP/(TP+FN)specificity: TN/(TN+FP)

Threshold 1

Threshold 2


ROC analysis and classifier evaluation

1

sens

itivi

ty

1 - specificity 1

No discrimination

Perfect discrimination

AUC

00

• ROC: Receiver operating characteristics curve results from plotting sensitivity against specificity for all possible thresholds

– sensitivity: TP/(TP+FN)– specificity: TN/(TN+FP)

• AUC: Area under the ROC curve• Cross validation (CV)

– systematic division of data into training and test sets

– CV estimates are interpreted as the classification performance expected on new, unseen data


Over all classes:Coverage = TP/(TP+FN)Precision = TP/(TP+FP)

Coverage: 84%Precision: 50%



*Iyer et al.

Cross validation estimatesPROCESS AUC SE P-VALUE Ion homeostasis 1.00 0.00 0.008 Protein targeting 0.99 0.03 0.000 Blood coagulation 0.96 0.08 0.000 DNA metabolism 0.94 0.09 0.000 Intracellular signaling cascade 0.94 0.06 0.000 Cell cycle 0.93 0.04 0.000 Energy pathways 0.93 0.12 0.004 Oncogenesis 0.92 0.11 0.000 Circulation 0.91 0.11 0.001 Cell death 0.90 0.10 0.000 Developmental processes 0.90 0.07 0.000 Defense (immune) response 0.88 0.05 0.000 Transcription 0.88 0.11 0.002 Cell adhesion 0.87 0.09 0.002 Stress response 0.86 0.15 0.002 Protein metabolism and modification 0.85 0.10 0.000 Cell motility 0.84 0.11 0.000 Cell surface rec linked signal transd 0.82 0.15 0.005 Lipid metabolism 0.81 0.14 0.000 Cell organization and biogenesis 0.79 0.11 0.000 Cell proliferation 0.79 0.06 0.002 Transport 0.79 0.17 0.001 Amino acid and derivative metabolism 0.69 0.06 0.288

AVERAGE

0.88

0.09

machine learning clustering and machine learning...

Documents