machine learning techniques in bio-informatics elena marchiori ibivu vrije universiteit amsterdam

MACHINE LEARNING TECHNIQUES IN BIO-

INFORMATICS

Elena Marchiori

IBIVU

Vrije Universiteit Amsterdam

Summary

• Machine Learning

• Supervised Learning: classification

• Unsupervised Learning: clustering

Machine Learning (ML)

• Construct a computational model from a dataset describing properties of an unknown (but existent) system.

Computational model

System (unknown)observations

?

properties

ML

Supervised Learning

• The dataset describes examples of input-output behaviour of a unknown (but existent) system.

• The algorithm tries to find a function ‘equivalent’ to the system.

• ML techniques for classification: K-nearest neighbour, decision trees, Naïve Bayes, Support Vector Machines.

Supervised Learning

Training data

ML algorithm

model predictionnew observation

System (unknown)observationsproperty of interest

?

supervisor

Unsupervised learning

Example: A Classification Problem

• Categorize images of fish—say, “Atlantic salmon” vs. “Pacific salmon”

• Use features such as length, width, lightness, fin shape & number, mouth position, etc.

• Steps1. Preprocessing (e.g., background

subtraction)2. Feature extraction 3. Classification

example from Duda & Hart

http://museums.ncl.ac.uk/flint/images/salmon.jpg

Classification in Bioinformatics

• Computational diagnostic: early cancer detection

• Tumor biomarker discovery

• Protein folding prediction

• Protein-protein binding sites prediction

From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Classification Techniques

• Naïve Bayes

• K Nearest Neighbour

• Support Vector Machines (next lesson)

• …


Bayesian Approach

• Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis

• Prior knowledge can be combined with observed data to determine hypothesis

• Bayesian methods can accommodate hypotheses that make probabilistic predictions

• New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities

Kathleen McKeown’s slides

Bayesian Approach

• Assign the most probable target value, given <a1,a2,…an>

• VMAP=argmax P(vj| a1,a2,…an)

• Using Bayes Theorem:• VMAP=argmax P(a1,a2,…an|vj)P(vi)

vjV P(a1,a2,…an) =argmax P(a1,a2,…an|vj)P(vi) vjV

• Bayesian learning is optimal• Easy to estimate P(vi) by counting in training data• Estimating the different P(a1,a2,…an|vj) not feasible(we would need a training set of size proportional to the number of

possible instances times the number of classes) Kathleen McKeown’s slides

Naïve Bayes

• Assume independence of attributes– P(a1,a2,…an|vj)=∏P(ai|vj)

i

• Substitute into VMAP formula– VNB=argmax P(vj)∏P(ai|vj)

vjV i


VNB=argmax P(vj)∏P(ai|vj) vjV

S-length S-width P-length Class

1 high high high Versicolour

2 low high low Setosa

3 low high low Verginica

4 low high med Verginica


6 high high med Setosa

7 high high low Setosa




Estimating Probabilities

• What happens when the number of data elements is small?

• Suppose true P(S-length=low|verginica)=.05• There are only 2 instances with C=Verginica• We estimate probability by nc/n using the training set• Then #S-length =low |Verginica must = 0• Then, instead of .05 we use estimated probability of 0• Two problems

• Biased underestimate of probability• This probability term will dominate if future query contains S-

length=low


Instead: use m-estimate

• Use priors as well

• nc+mp n+m

– Where p = prior estimate of P(S-length=low|verginica)

– m is a constant called the equivalent sample size» Determines how heavily to weight p relative to the

observed data» Typical method: assume a uniform prior of an

attribute (e.g. if values low,med,high -> p =1/3)


K-Nearest Neighbour• Memorize the training data

• Given a new example, find its k nearest neighbours, and output the majority vote class.

• Choices: – How many neighbours?– What distance measure?

Application in Bioinformatics • A Regression-based K nearest neighbor algorithm for gene

function prediction from heterogeneous data, Z. Yao and W.L. Ruzzo, BMC Bioinformatics 2006, 7

1. For each dataset k, for each pair of genes p compute similarity fk(p) of p wrt k-th data

2. Construct predictor of gene pair similarity, e.g. logistic regressionH: f(p,1),…,f(p,m) H(f(p,1),…,f(p,m)) such thatH high value if genes of p have similar functions.

Given a new gene g find kNN using H as distancePredict the functional classes C1, .., Cn of g with confidence equal toConfidence(Ci) = 1- Π (1- Pij) with gj neighbour of g and Ci in the set

of classes of gj (probability that at least one prediction is correct, that is 1 – probability that all predictions are wrong)

Classification: CV error

• Training error– Empirical error

• Error on independent test set – Test error

• Cross validation (CV) error– Leave-one-out (LOO)– N-fold CV

N samples

splitting

1/n samples for testing

Summarize CV error rate

N-1/n samples for training

Count errors

Supervised learning

Two schemes of cross validation

N samples

LOO

Train and test the gene-selector and the

classifier

Count errors

N samples

Gene selection

Train and test the classifier

Count errors

LOO

CV2CV1

Supervised learning

Difference between CV1 and CV2

• CV1 gene selection within LOOCV• CV2 gene selection before before LOOCV• CV2 can yield optimistic estimation of classification true

error

• CV2 used in paper by Golub et al. :– 0 training error– 2 CV error (5.26%)– 5 test error (14.7%)– CV error different from test error!

Supervised learning

Significance of classification results

• Permutation test:– Permute class label of samples– LOOCV error on data with permuted labels– Repeat process a high number of times– Compare with LOOCV error on original data:

• P-value = (# times LOOCV on permuted data <= LOOCV on original data) / total # of permutations considered

Supervised learning

Unsupervised Learning

ML for unsupervised learning attempts to discover interesting structure in the available data


Unsupervised Learning

• The dataset describes the structure of an unknown (but existent) system.

• The computer program tries to identify structure of the system (clustering, data compression).

• ML techniques: hierarchical clustering, k-means, Self Organizing Maps (SOM), fuzzy clustering (described in a future lesson).

Clustering Clustering is one of the most important unsupervised

learning processes for organizing objects into groups whose members are similar in some way.

Clustering finds structures in a collection of unlabeled data.

A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

Clustering Algorithms

• Start with a collection of n objects each represented by a p–dimensional feature vector xi , i=1, …n.

• The goal is to associatethe n objects to k clusters so that objects “within” a clusters are more “similar” than objects between clusters. k is usually unknown.

• Popular methods: hierarchical, k-means, SOM, …


Hierarchical Clustering

DendrogramVenn Diagram of Clustered Data

From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

Hierarchical Clustering (Cont.)

• Multilevel clustering: level 1 has n clusters level n has one cluster.

• Agglomerative HC: starts with singleton and merge clusters.

• Divisive HC: starts with one sample and split clusters.

Nearest Neighbor Algorithm

• Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).

• Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.


Nearest Neighbor, Level 2, k = 7 clusters.


Nearest Neighbor, Level 8, k = 1 cluster.

Calculate the similarity between all possible

combinations of two profiles

Two most similar clusters are grouped together to form

a new cluster

Calculate the similarity between the new cluster and

all remaining clusters.

Hierarchical Clustering

Keys• Similarity• Clustering


Clustering in Bioinformatics

• Microarray data quality checking– Does replicates cluster together?– Does similar conditions, time points, tissue types

cluster together?

• Cluster genes Prediction of functions of unknown genes by known ones

• Cluster samples Discover clinical characteristics (e.g. survival, marker status) shared by samples.

• Promoter analysis of commonly regulated genes


Functional significant gene clusters

Two-way clustering

Gene clusters

Sample clusters

Bhattacharjee et al. (2001) Human lung carcinomas mRNA expression profiling reveals distinct adenocarcinoma subclasses.

Proc. Natl. Acad. Sci. USA, Vol. 98, 13790-13795.

Similarity Measurements• Pearson Correlation

Nx

x

x 1

Two profiles (vectors) and

])(][)([

))((),(

1

2

1

2

1

N

i yi

N

i xi

N

i yixipearson

mymx

mymxyxC

Ny

y

y 1

x

y

x

y+1 Pearson Correlation – 1

N

n nx xN

m1

1

N

n ny yN

m1

1


Similarity Measurements

• Pearson Correlation: Trend Similarity

ab

5.0

2.0ac

1),( caCpearson

1),( baCpearson

1),( cbCpearson



• Euclidean Distance

N

n nn yxyxd1

2)(),(

Nx

x

x 1

Ny

y

y 1



• Euclidean Distance: Absolute difference

ab

5.0

2.0ac

5875.1),( cad

8025.2),( bad

2211.3),( cbd


Clustering

C1

C2

C3

Merge which pair of clusters?


+

+

Clustering

Single Linkage

C1

C2

Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters

Tend to generate “long chains”


+

+

Clustering

Complete Linkage

C1

C2

Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters

Tend to generate “clumps”


+

+

Clustering

Average Linkage

C1

C2

Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).


+

+

Clustering

Average Group Linkage

C1

C2

Dissimilarity between two clusters = Distance between two cluster means.


Considerations

• What genes are used to cluster samples?– Expression variation– Inherent variation– Prior knowledge

(irrelevant genes)– Etc.


K-means Clustering– Initialize the K cluster representatives w’s, e.g. to randomly

chosen examples. – Assign each input example x to the cluster c(x) with the

nearest corresponding weight vector:

– Update the weights:

– Increment n by 1 and go until no noticeable changes of the cluster representatives occur.

)n(wxmin argc(x) jj

jcluster toassigned examples ofnumber the with

/)1(j c(x) that suchx

j

jj

n

nxnw


Example I

Initial Data and Seeds Final Clustering


Example II

Initial Data and Seeds Final Clustering


The brain maps the external multidimensional representation of the world into a similar 1 or 2 - dimensional internal representation.

That is, the brain processes the external signals in a topology-preserving way

Mimicking the way the brain learns, our clustering system should be able to do the same thing.

SOM: Brain’s self-organization


Self-Organized Map: ideaSelf-Organized Map: idea

Data: vectors XT = (X1, ... Xd) from d-dimensional space.

Grid of nodes, with local processor (called neuron) in each node.

Local processor # j has d adaptive parameters W(j).

Goal: change W(j) parameters to recover data clusters in X space.Unsupervised learning

Training processTraining process

o

o

oox

x

xx=dane

siatka neuronów

N-wymiarowa

xo=pozycje wag neuronów

o

o o

o

o

o

o

o

przestrzeń danych

wagi wskazująna punkty w N-D

w 2-D

Java demos: http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/DemoGNG/GNG.html


Concept of the SOM Input space Reduced feature space

s1

s2Mn

Sr

Ba

Clustering and ordering of the cluster centers in a two dimensional grid

Cluster centers (code vectors) Place of these code vectors in the reduced space


Concept of the SOM

Ba

Mn

Sr

…

SA3

We can use it for visualization

We can

use it fo

r classification

Mg

We can use it for clustering

SA3


SOM: learning algorithm

• Initialization. n=0. Choose random small values for weight vectors components.

• Sampling. Select an x from the input examples.• Similarity matching. Find the winning neuron i(x) at iteration n:

• Updating: adjust the weight vectors of all neurons using the following rule

• Continuation: n = n+1. Go to the Sampling step until no noticeable changes in the weights are observed.

||)()(||minarg)( nwnxxi jj

)(-x(n) )( )()()1( )()( nwdhnnwnw jjxixijj


Neighborhood Function

– Gaussian neighborhood function:

– dji: lateral distance of neurons i and j

• in a 1-dimensional lattice | j - i |

• in a 2-dimensional lattice || rj - ri || where rj is the position of

neuron j in the lattice.

2

2

2exp)(

ijiji

ddh


Initial h function (Example )


Some examples of real-life applications

Helsinki University of Technology web site http://www.cis.hut.fi/research/refs/Contains > 5000 papers on SOM and its applications:• Brain research: modeling of formation of various

topographical maps in motor, auditory, visual and somatotopic areas.

• Clusterization of genes, protein properties, chemical compounds, speech phonemes, sounds of birds and insects, astronomical objects, economical data, business and financial data ....

• Data compression (images and audio), information filtering.• Medical and technical diagnostics.


http://www.cis.hut.fi/research/refs/

Issues in Clustering

• How many clusters?– User parameter– Use model selection criteria (Bayesian Information Criterion) with

penalization term which considers model complexity. See e.g. X-means: http://www2.cs.cmu.edu/~dpelleg/kmeans.html

• What similarity measure?– Euclidean distance– Correlation coefficient– Ad-hoc similarity measures


http://www2.cs.cmu.edu/~dpelleg/kmeans.html











Validation of clustering results

• External measures– According to some external knowledge

– Consideration of bias and subjectivity

• Internal measures– Quality of clusters according to the data

– Compactness and separation

– Stability

– …

See e.g. J.Handl, J.Knowles, D.B.Kell

Computational cluster validation in postgenomic data analysis, Bioinformatics, 21(15):3201-3212, 2005


Molecular Classification of Cancer:Class Discovery and Class Prediction by Gene Expression Monitoring

T.R. Golub et al., Science 286, 531 (1999)

Bioinformatics ApplicationBioinformatics Application


Identification of cancer types

• Why is Identification of Cancer Class (tumor sub-type) important?– Cancers of Identical grade can have widely variable

clinical courses (i.e. acute lymphoblastic leukemia, or Acute myeloid leukemia).

• Tradition Method:– Morphological appearance.

– Enzyme-based histochemical analyses.

– Immunophenotyping.

– Cytogenetic analysis.

Golub et al 1999Unsupervised learning

Class Prediction

• How could one use an initial collection of samples belonging to know classes to create a class Predictor?– Identification of Informative Genes

– Weighted Vote

Golub et al slidesUnsupervised learning

Data

• Initial Sample: 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the time of diagnosis.

• Independent Sample: 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).


Validation of Gene Voting

• Initial Samples: 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis.

• Independent Samples: 29 of 34 samples are strongly predicted with 100% accuracy.


Class Discovery

• Can cancer classes be discovered automatically based on gene expression?– Cluster tumors by gene expression– Determine whether the putative classes

produced are meaningful.


Cluster tumors

Self-organization Map (SOM) Mathematical cluster analysis for recognizing and

clasifying feautres in complex, multidimensional data (similar to K-mean approach) Chooses a geometry of “nodes” Nodes are mapped into K-dimensional space, initially at

random. Iteratively adjust the nodes.


Validation of SOM

• Prediction based on cluster A1 and A2:– 24/25 of the ALL samples from initial dataset were

clustered in group A1

– 10/13 of the AML samples from initial dataset were clustered in group A2


Validation of SOM

• How could one evaluate the putative cluster if the “right” answer were not known?– Assumption: class discovery could be tested by class

prediction.• Testing of Assumption:

– Construct Predictors based on clusters A1 and A2.– Construct Predictors based on random clusters


Validation of SOM

• Predictions using predictors based on clusters A1 and A2 yields 34 accurate predictions, one error and three uncertains.


Validation of SOM


CONCLUSION

• In Machine Learning, every technique has its assumptions and constraints, advantages and limitations

• My view:– First perform simple data analysis before applying fancy high tech ML

methods

– Possibly use different ML techniques and then ensemble results

– Apply correct cross validation method!

– Check for significance of results (permutation test, stability of selected genes)

– Work in collaboration with data producer (biologist, pathologist) when possible!

ML in bioinformatics

machine learning techniques in bio-informatics elena marchiori ibivu vrije universiteit amsterdam

Documents