25/05/2004 evolution/phylogeny/pattern recognition bioinformatics master course bioinformatics data...

51
25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

25/05/2004

Evolution/Phylogeny/Pattern recognition

Bioinformatics Master Course

Bioinformatics Data Analysis and Tools

Page 2: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

PatternsSome are easy some are not

• Knitting patterns

• Cooking recipes

• Pictures (dot plots)

• Colour patterns

• Maps

Page 3: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Example of algorithm reuse: Data clustering

• Many biological data analysis problems can be formulated as clustering problems– microarray gene expression data analysis– identification of regulatory binding sites (similarly, splice

junction sites, translation start sites, ......)– (yeast) two-hybrid data analysis (for inference of protein

complexes)– phylogenetic tree clustering (for inference of horizontally

transferred genes)– protein domain identification– identification of structural motifs– prediction reliability assessment of protein structures– NMR peak assignments – ......

Page 4: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Data Clustering Problems

• Clustering: partition a data set into clusters so that data points of the same cluster are “similar” and points of different clusters are “dissimilar”

• cluster identification -- identifying clusters with significantly different features than the background

Page 5: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Application Examples

• Regulatory binding site identification: CRP (CAP) binding site

• Two hybrid data analysis Gene expression data analysis

Are all solvable by the same algorithm!

Page 6: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Other Application Examples

• Phylogenetic tree clustering analysis (Evolutionary trees)

• Protein sidechain packing prediction

• Assessment of prediction reliability of protein structures

• Protein secondary structures

• Protein domain prediction

• NMR peak assignments

• ……

Page 7: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Multivariate statistics – Cluster analysis

Dendrogram

Scores

Similaritymatrix

5×5

12345

C1 C2 C3 C4 C5 C6 ..

Raw table

Similarity criterion

Cluster criterion

Page 8: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Human Evolution

Page 9: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Comparing sequences - Similarity Score -

Many properties can be used:

• Nucleotide or amino acid composition

• Isoelectric point

• Molecular weight

• Morphological characters

• But: molecular evolution through sequence alignment

Page 10: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Multivariate statistics – Cluster analysisNow for sequences

Phylogenetic tree

Scores

Similaritymatrix

5×5

Multiple sequence alignment

12345

Similarity criterion

Page 11: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQLamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ

Lactate dehydrogenase multiple alignment

Distance Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Human 0.000 0.112 0.128 0.202 0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635 0.637 2 Chicken 0.112 0.000 0.155 0.214 0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631 0.651 3 Dogfish 0.128 0.155 0.000 0.196 0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600 0.655 4 Lamprey 0.202 0.214 0.196 0.000 0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616 0.669 5 Barley 0.378 0.382 0.389 0.426 0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629 0.575 6 Maizey 0.346 0.348 0.337 0.356 0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643 0.587 7 Lacto_casei 0.530 0.538 0.522 0.553 0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526 0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589 0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598 0.495 9 Lacto_plant 0.512 0.516 0.516 0.544 0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563 0.485 10 Therma_mari 0.524 0.524 0.512 0.503 0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405 0.598 11 Bifido 0.528 0.524 0.524 0.544 0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604 0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616 0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000 0.641 13 Mycoplasma 0.637 0.651 0.655 0.669 0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641 0.000

Page 12: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools
Page 13: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Multivariate statistics – Cluster analysis

Dendrogram/tree

Scores

Similaritymatrix

5×5

12345

C1 C2 C3 C4 C5 C6 ..

Data table

Similarity criterion

Cluster criterion

Page 14: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Multivariate statistics – Cluster analysis

Why do it?• Finding a true typology• Model fitting• Prediction based on groups• Hypothesis testing• Data exploration• Data reduction• Hypothesis generation But you can never prove a

classification/typology!

Page 15: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Cluster analysis – data normalisation/weighting

12345

C1 C2 C3 C4 C5 C6 ..

Raw table

Normalisation criterion

12345

C1 C2 C3 C4 C5 C6 ..

Normalised table

Column normalisation x/max

Column range normalise (x-min)/(max-min)

Page 16: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Cluster analysis – (dis)similarity matrix

Scores

Similaritymatrix

5×5

12345

C1 C2 C3 C4 C5 C6 ..

Raw table

Similarity criterion

Di,j = (k | xik – xjk|r)1/r Minkowski metrics

r = 2 Euclidean distancer = 1 City block distance

Page 17: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Cluster analysis – Clustering criteria

Dendrogram (tree)

Scores

Similaritymatrix

5×5

Cluster criterion

Single linkage - Nearest neighbour

Complete linkage – Furthest neighbour

Group averaging – UPGMA

Ward

Neighbour joining – global measure

Page 18: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Cluster analysis – Clustering criteria

1. Start with N clusters of 1 object each

2. Apply clustering distance criterion iteratively until you have 1 cluster of N objects

3. Most interesting clustering somewhere in between

Dendrogram (tree)

distance

N clusters1 cluster

Page 19: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Page 20: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Page 21: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Page 22: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Page 23: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Page 24: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Distance from point to cluster is defined as the smallest distance between that point and any point in the cluster

Page 25: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Cluster analysis – Ward’s clustering criterion

Per cluster: calculate Error Sum of Squares (ESS)

ESS = x2 – (x)2/n

calculate minimum increase of ESS

Suppose:

Obj Val c l u s t e r i n g ESS

1 1 1 2 3 4 5 0

2 2 1 2 3 4 5 0.5

3 7 1 2 3 4 5 2.5

4 9 1 2 3 4 5 13.1

5 12 1 2 3 4 5 86.8

Page 26: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Multivariate statistics – Cluster analysis

Phylogenetic tree

Scores

Similaritymatrix

5×5

12345

C1 C2 C3 C4 C5 C6 ..

Data table

Similarity criterion

Cluster criterion

Page 27: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Multivariate statistics – Cluster analysis

Scores

5×5

12345

C1 C2 C3 C4 C5 C6

Similarity criterion

Cluster criterion

Scores

6×6

Cluster criterion

Make two-way ordered

table using dendrograms

Page 28: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Multivariate statistics – Principal Component Analysis (PCA)

12345

C1 C2 C3 C4 C5 C6 Similarity Criterion:Correlations

6×6

Calculate eigenvectors with greatest eigenvalues:

•Linear combinations

•Orthogonal

Correlations

Project datapoints ontonew axes (eigenvectors)

12

Page 29: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))

“Nothing in bioinformatics makes sense except in the light of Biology”

Bioinformatics

Page 30: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Evolution

• Most of bioinformatics is comparative biology

• Comparative biology is based upon evolutionary relationships between compared entities

• Evolutionary relationships are normally depicted in a phylogenetic tree

Page 31: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Where can phylogeny be used

• For example, finding out about orthology versus paralogy

• Predicting secondary structure of RNA

• Studying host-parasite relationships

• Mapping cell-bound receptors onto their binding ligands

• Multiple sequence alignment (e.g. Clustal)

Page 32: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Phylogenetic tree (unrooted)

human

mousefugu

Drosophila

edge

internal node

leaf

OTU – Observed taxonomic unit

Page 33: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Phylogenetic tree (unrooted)

human

mousefugu

Drosophila

root

edge

internal node

leaf

OTU – Observed taxonomic unit

Page 34: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Phylogenetic tree (rooted)

human

mouse

fuguDrosophila

root

edge

internal node (ancestor)

leaf

OTU – Observed taxonomic unit

time

Page 35: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

How to root a tree

• Outgroup – place root between distant sequence and rest group

• Midpoint – place root at midpoint of longest path (sum of branches between any two OTUs)

• Gene duplication – place root between paralogous gene copies

f

D

m

h D f m h

f

D

m

h D f m h

f-

h-

f-

h- f- h- f- h-

5

32

1

1

4

1

2

13

1

Page 36: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Combinatoric explosion

# sequences # unrooted # rooted trees trees

2 1 13 1 34 3 155 15 1056 105 9457 945 10,3958 10,395 135,1359 135,135 2,027,02510 2,027,025 34,459,425

Page 37: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Tree distances

human x

mouse 6 x

fugu 7 3 x

Drosophila 14 10 9 x

human

mouse

fugu

Drosophila

5

1

1

2

6human

mouse

fuguDrosophila

Evolutionary (sequence distance) = sequence dissimilarity

Page 38: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Phylogeny methods

• Parsimony – fewest number of evolutionary events (mutations) – relatively often fails to reconstruct correct phylogeny

• Distance based – pairwise distances

• Maximum likelihood – L = Pr[Data|Tree]

Page 39: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Parsimony & DistanceSequences 1 2 3 4 5 6 7Drosophila t t a t t a a fugu a a t t t a a mouse a a a a a t a human a a a a a a t

human x

mouse 2 x

fugu 3 4 x

Drosophila 5 5 3 x

human

mouse

fuguDrosophila

Drosophila

fugu

mouse

human

12

3 7

64 5

Drosophila

fugu

mouse

human

2

11

12

parsimony

distance

Page 40: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Maximum likelihood

• If data=alignment, hypothesis = tree, and under a given evolutionary model,

maximum likelihood selects the hypothesis (tree) that maximises the observed data

• Extremely time consuming method

• We also can test the relative fit to the tree of different models (Huelsenbeck & Rannala, 1997)

Page 41: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Bayesian methods

• Calculates the posterior probability of a tree (Huelsenbeck et al., 2001) –- probability that tree is true tree given evolutionary model

• Most computer intensive technique• Feasible thanks to Markov chain Monte Carlo

(MCMC) numerical technique for integrating over probability distributions

• Gives confidence number (posterior probability) per node

Page 42: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Distance methods: fastest

• Clustering criterion using a distance matrix

• Distance matrix filled with alignment scores (sequence identity, alignment scores, E-values, etc.)

• Cluster criterion

Page 43: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Phylogenetic tree by Distance methods (Clustering)

Phylogenetic tree

Scores

Similaritymatrix

5×5

Multiple alignment

12345

Similarity criterion

Page 44: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQLamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ

Lactate dehydrogenase multiple alignment

Distance Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Human 0.000 0.112 0.128 0.202 0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635 0.637 2 Chicken 0.112 0.000 0.155 0.214 0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631 0.651 3 Dogfish 0.128 0.155 0.000 0.196 0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600 0.655 4 Lamprey 0.202 0.214 0.196 0.000 0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616 0.669 5 Barley 0.378 0.382 0.389 0.426 0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629 0.575 6 Maizey 0.346 0.348 0.337 0.356 0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643 0.587 7 Lacto_casei 0.530 0.538 0.522 0.553 0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526 0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589 0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598 0.495 9 Lacto_plant 0.512 0.516 0.516 0.544 0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563 0.485 10 Therma_mari 0.524 0.524 0.512 0.503 0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405 0.598 11 Bifido 0.528 0.524 0.524 0.544 0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604 0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616 0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000 0.641 13 Mycoplasma 0.637 0.651 0.655 0.669 0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641 0.000

Page 45: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools
Page 46: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Cluster analysis – (dis)similarity matrix

Scores

Similaritymatrix

5×5

1

2

3

4

5

C1 C2 C3 C4 C5 C6 ..

Raw table

Similarity criterion

Di,j = (k | xik – xjk|r)1/r Minkowski metrics

r = 2 Euclidean distancer = 1 City block distance

Page 47: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Cluster analysis – Clustering criteria

Phylogenetic tree

Scores

Similaritymatrix

5×5

Cluster criterion

Single linkage - Nearest neighbour

Complete linkage – Furthest neighbour

Group averaging – UPGMA

Ward

Neighbour joining – global measure

Page 48: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Neighbour joining

• Global measure – keeps total branch length minimal, tends to produce a tree with minimal total branch length

• At each step, join two nodes such that distances are minimal (criterion of minimal evolution)

• Agglomerative algorithm

• Leads to unrooted tree

Page 49: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Neighbour joining

xx

y

x

y

xy xy

x

(a) (b) (c)

(d) (e) (f)

At each step all possible ‘neighbour joinings’ are checked and the one corresponding to the minimal total tree length (calculated by adding all branch lengths) is taken.

Page 50: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

How to assess confidence in tree

• Bayesian method – time consuming– The Bayesian posterior probabilities (BPP) are assigned to

internal branches in consensus tree – Bayesian Markov chain Monte Carlo (MCMC) analytical software

such as MrBayes (Huelsenbeck and Ronquist, 2001) and BAMBE (Simon and Larget,1998) is now commonly used

– Uses all the data

• Distance method – bootstrap:– Select multiple alignment columns with replacement– Recalculate tree– Compare branches with original (target) tree– Repeat 100-1000 times, so calculate 100-1000 different trees– How often is branching (point between 3 nodes) preserved for

each internal node?– Uses samples of the data

Page 51: 25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

The Bootstrap

1 2 3 4 5 6 7 8 C C V K V I Y SM A V R L I F SM C L R L L F T

3 4 3 8 6 6 8 6 V K V S I I S IV R V S I I S IL R L T L L T L

1

2

3

1

2

3

Original

Scrambled

4

5

1

5

2x 3x

Non-supportive