tree searching

Tree searching

Kai Müller

Tree searching: exhaustive search

• branch addition algorithm

Branch and bound

• Lmin=L(random tree)• „search tree“ as in branch addition

– at each level, if L < Lmin go back one level to try another path

– if at last level, Lmin=L and go back to first level unless all paths have been tried already

Heuristic searches

• stepwise addition– as branch addition, but

on each level only the path that follows the shortest tree at this level is searched

best

Star decomposition

Branch swappingNNI: nearest neighbour interchanges

SPR: subtree pruning and regrafting

TBR: tree bisection and reconnection

Tree inference with many terminals• general problem of getting

trapped in local optima• searches under parsimony:

parsimony ratchet• searches under likelihood:

estimation of– substitution model

parameters– branch lengths– topology

Parsimony ratchet1) generate start tree2) TBR on this and the original

matrix3) perturbe characters by

randomly upweighting 5-25%. TBR on best tree found under 2). Go to 2) [200+ times]

4) once more TBR on current best tree & original matrix

5) get best trees from those collected in steps 2) and 4)

Bootstrapping• estimates properties of an

estimator (such as its variance) by constructing a number of resamples of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset

Bootstrapping

• variants– FWR (Frequencies within

replicates)– SC (strict consensus)

Bootstrapping

Bremer support / decay• Bremer support (decay analysis) is the number of extra

steps needed to "collapse" a branch. • searches under reverse constraints: keep trees only that

do NOT contain a given node • Takes longer than bootstrapping: parsimony ratchet

beneficial (~20 iterations)

Homoplasie-Indices

• Consistency Index CI = m/s.• m = die kleinste theoretisch mögliche Schrittzahl die

das Merkmal auf einem Baum zeigen könnte• s = Anzahl an tatsächlichen Schritten, die ein Merkmal

auf einem gegebenen Baum zeigt• Merkmale ohne Homoplasie haben also einen CI von 1. • Sobald „überschüssige“ Schritte nötig werden, also z.B.

s = 3, steigt der Homoplasiegehalt und erniedrigt sich der CI, etwa auf 1/3 = 0.33.

13

Homoplasie-Indices (2)• Ensemble Consistency Index

– Der Ensemble Consistency Index ist dann 1, wenn alle Merkmale nicht homoplastisch sind, also alle perfekt auf den Baum passen.

• Nachteile des CI– Parsimonie-uninformative Merkmale tragen immer einen CI von 1

bei und erhöhen so den summarischen CI künstlich. – Andererseits kann der CI nie 0 werden. Gerade das wäre aber eine

wünschenswerte Eigenschaft für eine Skala aller denkbaren Homoplasiegrade, die idealerweise von 0 bis 1 reichen sollte.

– Drittens wird der CI bei erhöhter Taxonanzahl kleiner, auch wenn sich nichts Wesentliches an dem Informationsgehalt im Datensatz ändert

14

Homoplasie-Indices (3)

• Retention Index (RI)

– Wenn g die größtmögliche Schrittzahl eines Merkmals auf jedem denkbaren Baum ist (die auf einem völlig unaufgelösten „Besen“), dann ist RI = (g-s)/(g-m)

–

15

Homoplasie-Indices (4)

16

char s m g CI RI

1 2 1 2 0.5

0

2 3 1 4 0.3

0.3

Overview: treebuilding methods

Data types: discrete characters vs. distances

Distance methods• observed number vs. actual number of substitutions

Distance methods• observed number vs. actual number of

substitutions

Types of substitutions

• transitions/transversions

• synonymous/non-synonymous

Distance correction

correction

Substitution models

• p-distance: uncorrected• substitution models

– characterized by substitution probability matrices:

Substitution models

• Jukes-Cantor– oldest (1969), simplest– nucleotide frequencies all identical– nucleotide substitutions all equally likely

P(t)

• JC69:– probability of a

substitution after time t

if mean instant. subst. rate = 10^-8 per site per year

Distances

• simple considerations & rearrangements of Pij(t) show that the JC-corrected distance when observing a fraction P of differing nucleotides is

K2P

• Kimura 2-parameter model– 2 different nucleotide substitution types

• transitions• transversions

– nucleotide frequencies all identical

More models

• Felsenstein (1981), F81:– 1 nucleotide substitution

type, 4 base frequencies• HKY85

– 2 different nucleotide substitution types, 4 base frequencies

• GTR– 6 different nucleotide

substitution types, 4 base frequencies

Heterogeneity among sites

Among site rate variation modelled via gamma distribution

Hierarchical relationships among common models

Amino acid models

Codon models

• GY94, MG94• 61 x 61 matrix

(stop codons ignored)= frequency of codon j

= transition/transversion ratio

= ratio nonsynonymous/synonymous

Models getting more "realistic"

• example: covarion models• DNA sites change between „on“ and „off“

states: changes allowed vs. forbidden.– transition rates s01 s10, kappa= proportion of „on“:

Additivity of distances

Additivity of distances

• condition: triangle-inequality

• four-point-condition

Corrected distances are rarely tree additive!

• two approaches try to find the tree that minimizes the error e when fitting the distances on it:

• both are tree search-, 2-step methods1. least-squares-fit criterion:

general: goodness of fit methods

2. minimum evolution• length L of sum of all branches

Clustering methods

• 1-step, algorithmic methods– UPGMA

• condition of an ultrametrictree

Clustering methods

• neighbor joining– star decomposition

d(pair members new) node:

d(other taxa new node):

tree searching

Documents

probability of

indicesconsistency index

denkbaren baum

summarischen ci knstlich

tbr on best tree found

parsimony ratchet beneficial

substitution modelsjukes

retention index ri