maximum likelihood given competing explanations for a particular observation, which explanation...

21
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest that the explanation that makes the observation most likely is the preferred one Given some data, D, and a hypothesis, H, L D = Pr(D|H) - the likelihood of the data is the probability of observing D given H Previous slide = L = P(D | M, θ, τ, ν) For our purposes, D is the data set (sequences typically) and H is any possible tree relating those sequences The best tree is the one that makes the observed data most likely. The main idea behind maximum likelihood (ML) phylogenetic inference is to determine the tree topology, branch lengths, and evolutionary model that maximizes the probability of observing the sequences observed L(τ, θ) = Prob(Data | τ, θ) = Prob(Aligned sequences | tree, model)

Upload: caitlin-briggs

Post on 18-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

Maximum Likelihood• Given competing explanations for a particular observation, which

explanation should we choose?• Maximum likelihood methodologies suggest that the explanation that

makes the observation most likely is the preferred one• Given some data, D, and a hypothesis, H,

– LD = Pr(D|H) - the likelihood of the data is the probability of observing D given H– Previous slide = L = P(D | M, θ, τ, ν)

• For our purposes, D is the data set (sequences typically) and H is any possible tree relating those sequences

• The best tree is the one that makes the observed data most likely.• The main idea behind maximum likelihood (ML) phylogenetic

inference is to determine the tree topology, branch lengths, and evolutionary model that maximizes the probability of observing the sequences observed

– L(τ, θ) = Prob(Data | τ, θ) = Prob(Aligned sequences | tree, model)

Page 2: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Given four taxa and associated sequences– 1 – …TTCGCTTAA…– 2 – …TTTCTGCAA…– 3 – …TTGCTGGTA…– 4 – …TCTCGGCAA…

• If we have an evolutionary model, we have an estimate of the instantaneous rates of change for any given site and set of nucleotides

• We can also derive any number of hypothetical trees on which to map the data

• Our job is to determine the likelihood of data given the evolutionary model and the possible trees

Maximum Likelihood

Page 3: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Given four taxa and associated sequences– 1 – …TTCGCTTAA…– 2 – …TTTCTGCAA…– 3 – …TTGCTGGTA…– 4 – …TCTCGGCAA…

• For the bold position, there are three possible trees

• Assuming a given evolutionary model, a value can be assigned to each topology

C

T(2)

T(3)

G

XY

C

T(3)

G

T(2)

XY

C

T(3)

T(2)

G

XY

Maximum Likelihood

Page 4: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Just as we are considering only one site among many, we can consider one tree among many

• This is one possible rooted tree

• There are 16 possible values for X and Y• Again, let’s choose one of them

C T(2) T(3)G

X

Y

1 – …TTCGCTTAA…2 – …TTTCTGCAA…3 – …TTGCTGGTA…4 – …TCTCGGCAA…

A

T

Maximum Likelihood

Page 5: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Since we have a model of evolutionary change we can calculate the probability of this tree for this site

– It is a product of the probability of all of the states/changes of state given our model of sequence evolution

– P(τ) = Π Pi, a product function

– P(τ) = PA x PAG x PAC x PAT x PTT x PTT

C T(2) T(3)G

1 – …TTCGCTTAA…2 – …TTTCTGCAA…3 – …TTGCTGGTA…4 – …TCTCGGCAA…

A

T

Maximum Likelihood

i=1

N

Page 6: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• The probability must be calculated for all sites for this tree

– P(τ) = Π Pi, a product function

• Then for all sites in all possible trees• These numbers are very small so they are typically

expressed as log likelihoods– ln L(τ) = Σ lnLi

• ln L(τ) is the log likelihood of observing the given alignment under the chosen evolutionary model, given that particular tree and branch lengths on the tree

• Because we are dealing not only with simple tree topologies but also with branch lengths, there are even more trees than ordinarily considered

• Heuristic (approximate) methods are usually applied

1 – …TTCGCTTAA…2 – …TTTCTGCAA…3 – …TTGCTGGTA…4 – …TCTCGGCAA…

i=1

N

i=1

N C T(2) T(3)G

A

T

Maximum Likelihood

Page 7: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Because we are dealing not only with simple tree topologies but also with branch lengths, there are even more trees than ordinarily considered

• To reduce the computational complexity, heuristic methods are usually applied to suggest reasonable starting trees

• Exact methods – will find the best tree under a given criterion but not feasible for large data sets

• Branch and Bound• Heuristic - any approach to problem solving, learning, or discovery

that employs a practical methodology not guaranteed to be optimal or perfect, but sufficient for the immediate goal

• Stepwise addition• Branch swapping methods• Quartet puzzling

Maximum Likelihood

Page 8: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Branch-and-Bound Method– Add taxa to trees along ‘paths’ – Quit a path when it is apparent that no solutions along that path are optimal– Accomplished by evaluating tree criterion after each addition

• Good for 12 – 25 taxa• Will find a locally optimal tree

Maximum Likelihood

Page 9: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Branch-and-Bound Method– L = number of terminal taxa– Choose an initial tree with three leaves from L– Add a terminal taxon at a defined position– Repeat until all taxa are added– Evaluate using optimality criterion – Set upper bound for optimality criterion– Repeat

Maximum Likelihood

Page 10: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Branch-and-Bound Method– L = number of terminal taxa– Choose an initial tree with three leaves from L– Add a terminal taxon at a defined position– Repeat until all taxa are added– Evaluate using optimality criterion – Set upper bound for optimality criterion– Repeat

Maximum Likelihood

Taxa A-F evaluated in this example

Page 11: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Stepwise Addition Method– Select three random taxa from n

terminal taxa– Find the most likely tree– Add another random taxon– Find the most likely tree– Repeat n-3 times

• Will find a locally optimal tree• Other addition orders may give a

more optimal tree• Perform tree rearrangements to

search for other optimal trees

Maximum Likelihood

Page 12: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Stepwise Addition Method– Select three random taxa from n

terminal taxa– Find the most likely tree– Add another random taxon– Find the most likely tree– Repeat n-3 times

• Will find a locally optimal tree• Other addition orders may give a

more optimal tree• Perform tree rearrangements to

search for other optimal trees

Maximum Likelihood

Page 13: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Once you have found a reasonable tree using a heuristic method…• Perform branch swapping to search for other, possibly more optimal

trees– Nearest neighbor interchange (NNI)– Subtree pruning and regrafting (SPR)– Tree bisection and reconnection (TBR)

Maximum Likelihood

NNI

SPR TBR

Page 14: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

– Nearest neighbor interchange (NNI)

• For any internal edge, there are three ways the four subtrees can be regrouped

Maximum Likelihood

Page 15: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

– Subtree pruning and regrafting (SPR)

• Clip subtrees and reinsert them at all possible locations

Maximum Likelihood

Page 16: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

– Tree bisection and reconnection (TBR)

• Cut a tree into two subtrees• Reconnect the trees by creating a

new branch that joins one subtree to a branch on the other

Maximum Likelihood

Page 17: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Quartet Puzzling Method• Given any set of sequences, any

group of four is a quartet

Maximum Likelihood

Page 18: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Quartet Puzzling Method• 1. Estimate parameters for the model to be used

– Build distance matrix (D) and corresponding NJ tree using a given model– Determine ML branch lengths and use to re-estimate model parameters– Using new estimates, rebuild D and NJ tree, re-estimate parameters– Iterate second two steps until the parameters are stable

• 2. Calculate likelihoods for all quartets = 3 x (n!/(4!(n-4)!))• 3. Add taxa in random order and positioned in least contradictory

position based on likelihoods– Repeat using different addition orders to generate a set of trees

• 4. Build a consensus tree where the percent occurrence of each branch is represented with puzzle support values

Maximum Likelihood

Page 19: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Quartet Puzzling Method• Assume 6 taxa• 1. Pick four at random and build the best ML tree• 2. Pick another random sequence and add it for all

possible quartets• 3. Evaluate ML for each• 4. Graft new taxon on best branch based on ML • 5. Repeat

Maximum Likelihood

best

best

2,3

4

2’,3’4’

Page 20: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• All of these methods will work for ML, parsimony, Bayesian methods

Maximum Likelihood

Page 21: Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest

• Objections to ML• Phylogenetic inferences using ML require an explicit model of

evolution• Good – we are aware of any assumptions• Bad – where do we get our parameter estimates?

– If we knew the actual parameters, we could better infer evolution– In order to get the actual parameters, we need to know the evolutionary history

Maximum Likelihood