machine learning techniques in bio-informatics elena marchiori ibivu vrije universiteit amsterdam

75
MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Post on 20-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

MACHINE LEARNING TECHNIQUES IN BIO-

INFORMATICS

Elena Marchiori

IBIVU

Vrije Universiteit Amsterdam

Page 2: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Summary

• Machine Learning

• Supervised Learning: classification

• Unsupervised Learning: clustering

Page 3: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Machine Learning (ML)

• Construct a computational model from a dataset describing properties of an unknown (but existent) system.

Computational model

System (unknown)observations

?

properties

ML

Page 4: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Supervised Learning

• The dataset describes examples of input-output behaviour of a unknown (but existent) system.

• The algorithm tries to find a function ‘equivalent’ to the system.

• ML techniques for classification: K-nearest neighbour, decision trees, Naïve Bayes, Support Vector Machines.

Page 5: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Supervised Learning

Training data

ML algorithm

model predictionnew observation

System (unknown)observationsproperty of interest

?

supervisor

Unsupervised learning

Page 6: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Example: A Classification Problem

• Categorize images of fish—say, “Atlantic salmon” vs. “Pacific salmon”

• Use features such as length, width, lightness, fin shape & number, mouth position, etc.

• Steps1. Preprocessing (e.g., background

subtraction)2. Feature extraction 3. Classification

example from Duda & Hart

Page 7: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Classification in Bioinformatics

• Computational diagnostic: early cancer detection

• Tumor biomarker discovery

• Protein folding prediction

• Protein-protein binding sites prediction

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 8: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Classification Techniques

• Naïve Bayes

• K Nearest Neighbour

• Support Vector Machines (next lesson)

• …

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 9: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Bayesian Approach

• Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis

• Prior knowledge can be combined with observed data to determine hypothesis

• Bayesian methods can accommodate hypotheses that make probabilistic predictions

• New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities

Kathleen McKeown’s slides

Page 10: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Bayesian Approach

• Assign the most probable target value, given <a1,a2,…an>

• VMAP=argmax P(vj| a1,a2,…an)

• Using Bayes Theorem:• VMAP=argmax P(a1,a2,…an|vj)P(vi)

vjV P(a1,a2,…an) =argmax P(a1,a2,…an|vj)P(vi) vjV

• Bayesian learning is optimal• Easy to estimate P(vi) by counting in training data• Estimating the different P(a1,a2,…an|vj) not feasible(we would need a training set of size proportional to the number of

possible instances times the number of classes) Kathleen McKeown’s slides

Page 11: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Bayes’ Rules

• Product RuleP(a Λ b) = P(a|b)P(b)= P(b|a)P(a)

• Bayes’ ruleP(a|b)=P(b|a)P(a)

P(b)• In distribution form:

P(Y|X)=P(X|Y)P(Y) = αP(X|Y)P(Y) P(X)

Kathleen McKeown’s slides

Page 12: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Naïve Bayes

• Assume independence of attributes– P(a1,a2,…an|vj)=∏P(ai|vj)

i

• Substitute into VMAP formula– VNB=argmax P(vj)∏P(ai|vj)

vjV i

Kathleen McKeown’s slides

Page 13: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

VNB=argmax P(vj)∏P(ai|vj) vjV

S-length S-width P-length Class

1 high high high Versicolour

2 low high low Setosa

3 low high low Verginica

4 low high med Verginica

5 high high high Versicolour

6 high high med Setosa

7 high high low Setosa

8 high high high Versicolour

9 high high high Versicolour

Kathleen McKeown’s slides

Page 14: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Estimating Probabilities

• What happens when the number of data elements is small?

• Suppose true P(S-length=low|verginica)=.05• There are only 2 instances with C=Verginica• We estimate probability by nc/n using the training set• Then #S-length =low |Verginica must = 0• Then, instead of .05 we use estimated probability of 0• Two problems

• Biased underestimate of probability• This probability term will dominate if future query contains S-

length=low

Kathleen McKeown’s slides

Page 15: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Instead: use m-estimate

• Use priors as well

• nc+mp n+m

– Where p = prior estimate of P(S-length=low|verginica)

– m is a constant called the equivalent sample size» Determines how heavily to weight p relative to the

observed data» Typical method: assume a uniform prior of an

attribute (e.g. if values low,med,high -> p =1/3)

Kathleen McKeown’s slides

Page 16: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

K-Nearest Neighbour• Memorize the training data

• Given a new example, find its k nearest neighbours, and output the majority vote class.

• Choices: – How many neighbours?– What distance measure?

Page 17: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Application in Bioinformatics • A Regression-based K nearest neighbor algorithm for gene

function prediction from heterogeneous data, Z. Yao and W.L. Ruzzo, BMC Bioinformatics 2006, 7

1. For each dataset k, for each pair of genes p compute similarity fk(p) of p wrt k-th data

2. Construct predictor of gene pair similarity, e.g. logistic regressionH: f(p,1),…,f(p,m) H(f(p,1),…,f(p,m)) such thatH high value if genes of p have similar functions.

Given a new gene g find kNN using H as distancePredict the functional classes C1, .., Cn of g with confidence equal toConfidence(Ci) = 1- Π (1- Pij) with gj neighbour of g and Ci in the set

of classes of gj (probability that at least one prediction is correct, that is 1 – probability that all predictions are wrong)

Page 18: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Classification: CV error

• Training error– Empirical error

• Error on independent test set – Test error

• Cross validation (CV) error– Leave-one-out (LOO)– N-fold CV

N samples

splitting

1/n samples for testing

Summarize CV error rate

N-1/n samples for training

Count errors

Supervised learning

Page 19: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Two schemes of cross validation

N samples

LOO

Train and test the gene-selector and the

classifier

Count errors

N samples

Gene selection

Train and test the classifier

Count errors

LOO

CV2CV1

Supervised learning

Page 20: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Difference between CV1 and CV2

• CV1 gene selection within LOOCV• CV2 gene selection before before LOOCV• CV2 can yield optimistic estimation of classification true

error

• CV2 used in paper by Golub et al. :– 0 training error– 2 CV error (5.26%)– 5 test error (14.7%)– CV error different from test error!

Supervised learning

Page 21: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Significance of classification results

• Permutation test:– Permute class label of samples– LOOCV error on data with permuted labels– Repeat process a high number of times– Compare with LOOCV error on original data:

• P-value = (# times LOOCV on permuted data <= LOOCV on original data) / total # of permutations considered

Supervised learning

Page 22: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Unsupervised Learning

ML for unsupervised learning attempts to discover interesting structure in the available data

Unsupervised learning

Page 23: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Unsupervised Learning

• The dataset describes the structure of an unknown (but existent) system.

• The computer program tries to identify structure of the system (clustering, data compression).

• ML techniques: hierarchical clustering, k-means, Self Organizing Maps (SOM), fuzzy clustering (described in a future lesson).

Page 24: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Clustering Clustering is one of the most important unsupervised

learning processes for organizing objects into groups whose members are similar in some way.

Clustering finds structures in a collection of unlabeled data.

A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

Page 25: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Clustering Algorithms

• Start with a collection of n objects each represented by a p–dimensional feature vector xi , i=1, …n.

• The goal is to associatethe n objects to k clusters so that objects “within” a clusters are more “similar” than objects between clusters. k is usually unknown.

• Popular methods: hierarchical, k-means, SOM, …

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 26: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Hierarchical Clustering

DendrogramVenn Diagram of Clustered Data

From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

Page 27: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Hierarchical Clustering (Cont.)

• Multilevel clustering: level 1 has n clusters level n has one cluster.

• Agglomerative HC: starts with singleton and merge clusters.

• Divisive HC: starts with one sample and split clusters.

Page 28: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Nearest Neighbor Algorithm

• Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).

• Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.

From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

Page 29: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Nearest Neighbor, Level 2, k = 7 clusters.

From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

Page 30: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Nearest Neighbor, Level 3, k = 6 clusters.

Page 31: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Nearest Neighbor, Level 4, k = 5 clusters.

Page 32: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Nearest Neighbor, Level 5, k = 4 clusters.

Page 33: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Nearest Neighbor, Level 6, k = 3 clusters.

Page 34: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Nearest Neighbor, Level 7, k = 2 clusters.

Page 35: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Nearest Neighbor, Level 8, k = 1 cluster.

Page 36: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Calculate the similarity between all possible

combinations of two profiles

Two most similar clusters are grouped together to form

a new cluster

Calculate the similarity between the new cluster and

all remaining clusters.

Hierarchical Clustering

Keys• Similarity• Clustering

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 37: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Clustering in Bioinformatics

• Microarray data quality checking– Does replicates cluster together?– Does similar conditions, time points, tissue types

cluster together?

• Cluster genes Prediction of functions of unknown genes by known ones

• Cluster samples Discover clinical characteristics (e.g. survival, marker status) shared by samples.

• Promoter analysis of commonly regulated genes

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 38: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Functional significant gene clusters

Two-way clustering

Gene clusters

Sample clusters

Page 39: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Bhattacharjee et al. (2001) Human lung carcinomas mRNA expression profiling reveals distinct adenocarcinoma subclasses.

Proc. Natl. Acad. Sci. USA, Vol. 98, 13790-13795.

Page 40: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Similarity Measurements• Pearson Correlation

Nx

x

x 1

Two profiles (vectors) and

])(][)([

))((),(

1

2

1

2

1

N

i yi

N

i xi

N

i yixipearson

mymx

mymxyxC

Ny

y

y 1

x

y

x

y+1 Pearson Correlation – 1

N

n nx xN

m1

1

N

n ny yN

m1

1

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 41: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Similarity Measurements

• Pearson Correlation: Trend Similarity

ab

5.0

2.0ac

1),( caCpearson

1),( baCpearson

1),( cbCpearson

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 42: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Similarity Measurements

• Euclidean Distance

N

n nn yxyxd1

2)(),(

Nx

x

x 1

Ny

y

y 1

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 43: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Similarity Measurements

• Euclidean Distance: Absolute difference

ab

5.0

2.0ac

5875.1),( cad

8025.2),( bad

2211.3),( cbd

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 44: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Clustering

C1

C2

C3

Merge which pair of clusters?

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 45: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

+

+

Clustering

Single Linkage

C1

C2

Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters

Tend to generate “long chains”

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 46: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

+

+

Clustering

Complete Linkage

C1

C2

Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters

Tend to generate “clumps”

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 47: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

+

+

Clustering

Average Linkage

C1

C2

Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 48: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

+

+

Clustering

Average Group Linkage

C1

C2

Dissimilarity between two clusters = Distance between two cluster means.

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 49: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Considerations

• What genes are used to cluster samples?– Expression variation– Inherent variation– Prior knowledge

(irrelevant genes)– Etc.

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Page 50: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

K-means Clustering– Initialize the K cluster representatives w’s, e.g. to randomly

chosen examples. – Assign each input example x to the cluster c(x) with the

nearest corresponding weight vector:

– Update the weights:

– Increment n by 1 and go until no noticeable changes of the cluster representatives occur.

)n(wxmin argc(x) jj

jcluster toassigned examples ofnumber the with

/)1(j c(x) that suchx

j

jj

n

nxnw

Unsupervised learning

Page 51: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Example I

Initial Data and Seeds Final Clustering

Unsupervised learning

Page 52: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Example II

Initial Data and Seeds Final Clustering

Unsupervised learning

Page 53: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

The brain maps the external multidimensional representation of the world into a similar 1 or 2 - dimensional internal representation.

That is, the brain processes the external signals in a topology-preserving way

Mimicking the way the brain learns, our clustering system should be able to do the same thing.

SOM: Brain’s self-organization

Unsupervised learning

Page 54: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Self-Organized Map: ideaSelf-Organized Map: idea

Data: vectors XT = (X1, ... Xd) from d-dimensional space.

Grid of nodes, with local processor (called neuron) in each node.

Local processor # j has d adaptive parameters W(j).

Goal: change W(j) parameters to recover data clusters in X space.Unsupervised learning

Page 55: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Training processTraining process

o

o

oox

x

xx=dane

siatka neuronów

N-wymiarowa

xo=pozycje wag neuronów

o

o o

o

o

o

o

o

przestrzeń danych

wagi wskazująna punkty w N-D

w 2-D

Java demos: http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/DemoGNG/GNG.html

Unsupervised learning

Page 56: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Concept of the SOM Input space Reduced feature space

s1

s2Mn

Sr

Ba

Clustering and ordering of the cluster centers in a two dimensional grid

Cluster centers (code vectors) Place of these code vectors in the reduced space

Unsupervised learning

Page 57: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Concept of the SOM

Ba

Mn

Sr

SA3

We can use it for visualization

We can

use it fo

r classification

Mg

We can use it for clustering

SA3

Unsupervised learning

Page 58: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

SOM: learning algorithm

• Initialization. n=0. Choose random small values for weight vectors components.

• Sampling. Select an x from the input examples.• Similarity matching. Find the winning neuron i(x) at iteration n:

• Updating: adjust the weight vectors of all neurons using the following rule

• Continuation: n = n+1. Go to the Sampling step until no noticeable changes in the weights are observed.

||)()(||minarg)( nwnxxi jj

)(-x(n) )( )()()1( )()( nwdhnnwnw jjxixijj

Unsupervised learning

Page 59: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Neighborhood Function

– Gaussian neighborhood function:

– dji: lateral distance of neurons i and j

• in a 1-dimensional lattice | j - i |

• in a 2-dimensional lattice || rj - ri || where rj is the position of

neuron j in the lattice.

2

2

2exp)(

ijiji

ddh

Unsupervised learning

Page 60: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Initial h function (Example )

Unsupervised learning

Page 61: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Some examples of real-life applications

Helsinki University of Technology web site http://www.cis.hut.fi/research/refs/Contains > 5000 papers on SOM and its applications:• Brain research: modeling of formation of various

topographical maps in motor, auditory, visual and somatotopic areas.

• Clusterization of genes, protein properties, chemical compounds, speech phonemes, sounds of birds and insects, astronomical objects, economical data, business and financial data ....

• Data compression (images and audio), information filtering.• Medical and technical diagnostics.

Unsupervised learning

Page 62: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Issues in Clustering

• How many clusters?– User parameter– Use model selection criteria (Bayesian Information Criterion) with

penalization term which considers model complexity. See e.g. X-means: http://www2.cs.cmu.edu/~dpelleg/kmeans.html

• What similarity measure?– Euclidean distance– Correlation coefficient– Ad-hoc similarity measures

Unsupervised learning

Page 63: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Validation of clustering results

• External measures– According to some external knowledge

– Consideration of bias and subjectivity

• Internal measures– Quality of clusters according to the data

– Compactness and separation

– Stability

– …

See e.g. J.Handl, J.Knowles, D.B.Kell

Computational cluster validation in postgenomic data analysis, Bioinformatics, 21(15):3201-3212, 2005

Unsupervised learning

Page 64: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Molecular Classification of Cancer:Class Discovery and Class Prediction by Gene Expression Monitoring

T.R. Golub et al., Science 286, 531 (1999)

Bioinformatics ApplicationBioinformatics Application

Unsupervised learning

Page 65: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Identification of cancer types

• Why is Identification of Cancer Class (tumor sub-type) important?– Cancers of Identical grade can have widely variable

clinical courses (i.e. acute lymphoblastic leukemia, or Acute myeloid leukemia).

• Tradition Method:– Morphological appearance.

– Enzyme-based histochemical analyses.

– Immunophenotyping.

– Cytogenetic analysis.

Golub et al 1999Unsupervised learning

Page 66: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Class Prediction

• How could one use an initial collection of samples belonging to know classes to create a class Predictor?– Identification of Informative Genes

– Weighted Vote

Golub et al slidesUnsupervised learning

Page 67: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Data

• Initial Sample: 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the time of diagnosis.

• Independent Sample: 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).

Golub et al slidesUnsupervised learning

Page 68: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Validation of Gene Voting

• Initial Samples: 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis.

• Independent Samples: 29 of 34 samples are strongly predicted with 100% accuracy.

Golub et al slidesUnsupervised learning

Page 69: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Class Discovery

• Can cancer classes be discovered automatically based on gene expression?– Cluster tumors by gene expression– Determine whether the putative classes

produced are meaningful.

Golub et al slidesUnsupervised learning

Page 70: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Cluster tumors

Self-organization Map (SOM) Mathematical cluster analysis for recognizing and

clasifying feautres in complex, multidimensional data (similar to K-mean approach) Chooses a geometry of “nodes” Nodes are mapped into K-dimensional space, initially at

random. Iteratively adjust the nodes.

Golub et al slidesUnsupervised learning

Page 71: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Validation of SOM

• Prediction based on cluster A1 and A2:– 24/25 of the ALL samples from initial dataset were

clustered in group A1

– 10/13 of the AML samples from initial dataset were clustered in group A2

Golub et al slidesUnsupervised learning

Page 72: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Validation of SOM

• How could one evaluate the putative cluster if the “right” answer were not known?– Assumption: class discovery could be tested by class

prediction.• Testing of Assumption:

– Construct Predictors based on clusters A1 and A2.– Construct Predictors based on random clusters

Golub et al slidesUnsupervised learning

Page 73: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Validation of SOM

• Predictions using predictors based on clusters A1 and A2 yields 34 accurate predictions, one error and three uncertains.

Golub et al slidesUnsupervised learning

Page 74: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Validation of SOM

Golub et al slidesUnsupervised learning

Page 75: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam

CONCLUSION

• In Machine Learning, every technique has its assumptions and constraints, advantages and limitations

• My view:– First perform simple data analysis before applying fancy high tech ML

methods

– Possibly use different ML techniques and then ensemble results

– Apply correct cross validation method!

– Check for significance of results (permutation test, stability of selected genes)

– Work in collaboration with data producer (biologist, pathologist) when possible!

ML in bioinformatics