interpreting ‘tree space’ in the context of very large empirical datasets

41
Interpreting ‘tree space’ in the context of very large empirical datasets Joe Parker School of Biological and Chemical Sciences Queen Mary University of London

Upload: joe-parker

Post on 13-Jul-2015

53 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Interpreting ‘tree space’ in the context of very large empirical datasets

Interpreting ‘tree space’ in the context of very large empirical

datasetsJoe Parker

School of Biological and Chemical Sciences

Queen Mary University of London

Page 2: Interpreting ‘tree space’ in the context of very large empirical datasets

Topics

• What evolutionary biology is– And what we do in the lab

• Introducing phylogenies (trees / digraphs)• Molecular evolution• Tests involving phylogeny comparison• Problems in phylogeny comparison• Conclusion / thanks / questions

Page 3: Interpreting ‘tree space’ in the context of very large empirical datasets

Introduction to our work (1/5)

Page 4: Interpreting ‘tree space’ in the context of very large empirical datasets

A Tale of Bats and Whales

Page 5: Interpreting ‘tree space’ in the context of very large empirical datasets

The Prestin gene & high-frequency hearing

Page 6: Interpreting ‘tree space’ in the context of very large empirical datasets

Evolution

Page 7: Interpreting ‘tree space’ in the context of very large empirical datasets

Prestin evolutionHuman NDLTRNRFFENPALWELLFH… SIHDAVLGSQLREALAEQEASAPPSQ

Rat NDLTSNRFFENPALKELLFH… SIHDAVLGSQVREAMAEQETTVLPPQ

Dog NDLTQNRFFENPALKELLFH… SIHDAVLGSQLREALAEQEASALPPQ

Dolphin SDLTRNQFFENPALLDLLFH… SIHDAVLGSLVREALAEKEAAAATPQ

Horseshoe Bat SDLTRNRFFENPALLDLLFH… SIHDAVLGSLVREALEEKEAAAATPQ

Page 8: Interpreting ‘tree space’ in the context of very large empirical datasets

Introduction to phylogenies (2/5)

Page 9: Interpreting ‘tree space’ in the context of very large empirical datasets

Phylogenies

• Phylogenies are directed graphs that show evolutionary relations between taxa

• Or our hypotheses about them

Page 10: Interpreting ‘tree space’ in the context of very large empirical datasets

Comparative approaches

Page 11: Interpreting ‘tree space’ in the context of very large empirical datasets

Tree space

• Phylogeneticists often talk about tree space - the set of all possible trees

• Within tree space two graphs are said to be adjacent if they differ at e.g. one internal node

• Trees are said to be ‘near’ if they are similar e.g. only a few rearrangements

• It is not actually a well-defined concept however

Page 12: Interpreting ‘tree space’ in the context of very large empirical datasets

Introduction to molecular evolution (3/5)

Page 13: Interpreting ‘tree space’ in the context of very large empirical datasets

Molecular evolution

• Molecular evolution is the study of the processes by which DNA sequences change over time

• Stochastic changes dominate over short time-scales but over longer ones directional natural selection is apparent

• Normally modelled as stochastic process

• Unlike classical physical phenomena largely understood as a statistical not mechanical phenomenon

Page 14: Interpreting ‘tree space’ in the context of very large empirical datasets

Simple model: Jukes-Cantor 69

• Letters {A,C,G,T} • Equal frequencies at equilibrium• Transition probabilities u / 3 in time t• e.g. A C:

More generally:

Felsenstein (2004) Inferring Phylogenies. Springer, NY

(Following model figures and formulae: ibid.)

Pr(C | A • u • t) =1

41−e

−4

3ut ⎛

⎝ ⎜

⎠ ⎟

Page 15: Interpreting ‘tree space’ in the context of very large empirical datasets

Maximum likelihood

• One of the most popular frameworks for understanding and modelling molecular evolution and phylogenies

• Likelihood of data given model, phylogeny:

• Likelihood-maximisation gives a way to parametize model and/or phylogeny

L = Pr(D | T) = Pr D i( ) | T( )i=1

m

Page 16: Interpreting ‘tree space’ in the context of very large empirical datasets

Independence of sites (1) Independence of branches (2)

L = Pr(D | T) = Pr D i( ) | T( )i=1

m

= Pr A,C,C,C,G, x,y,z,w,T( )w

∑z

∑y

∑x

Page 17: Interpreting ‘tree space’ in the context of very large empirical datasets

Phylogenomics

• Advances mean data sets several orders of magnitude larger

• Shift in emphasis from ML on specific phylogenies to statistics of all

spectrum.ieee.orgIllumina.comflickr/stephenjjohnson

Page 18: Interpreting ‘tree space’ in the context of very large empirical datasets

Phylogenomics

• Stochastic property of molecular evolution becomes apparent in large datasets

• Goodness-of-fit varies by site / gene for a single phylogeny / model

• Corollary: goodness-of-fit varies amongst models for a single genome

Page 19: Interpreting ‘tree space’ in the context of very large empirical datasets

Hypothesis-comparison tests using multiple phylogenies (4/5)

Page 20: Interpreting ‘tree space’ in the context of very large empirical datasets

Convergence detection by ∆SSLS - Parker e t al. (2013)

• De novo genomes:– four taxa– 2,321 protein-coding loci– 801,301 codons

• Published:– 18 genomes

• ~69,000 simulated datasets• ~3,500 cluster cores

∆SSLSi = ln Li, H0− ln Li, Ha

Page 21: Interpreting ‘tree space’ in the context of very large empirical datasets

Our pipeline for detecting genome-wide convergence

Page 22: Interpreting ‘tree space’ in the context of very large empirical datasets
Page 23: Interpreting ‘tree space’ in the context of very large empirical datasets
Page 24: Interpreting ‘tree space’ in the context of very large empirical datasets
Page 25: Interpreting ‘tree space’ in the context of very large empirical datasets
Page 26: Interpreting ‘tree space’ in the context of very large empirical datasets
Page 27: Interpreting ‘tree space’ in the context of very large empirical datasets
Page 28: Interpreting ‘tree space’ in the context of very large empirical datasets
Page 29: Interpreting ‘tree space’ in the context of very large empirical datasets

mean = 0.05

Page 30: Interpreting ‘tree space’ in the context of very large empirical datasets

mean = 0.05 mean = -0.01 mean = -0.08

Page 31: Interpreting ‘tree space’ in the context of very large empirical datasets

Continuous distributions

• Output approximates a continuous distribution• Comparing alternative hypotheses it is apparent that selection of tree gives largely

determines location skew etc (perhaps as expected)• But given that distribution tails are considered significant meaning of values in

these tails problematic / comparable

Page 32: Interpreting ‘tree space’ in the context of very large empirical datasets

Significance by simulation

• Very common technique in evolutionary biology – simulate a large dataset under the null model, compare w/empirical

• in this context simulate data get unexpectedness U:

U = 1 – cdf ( ∆SSLSH0-Ha | j )

Page 33: Interpreting ‘tree space’ in the context of very large empirical datasets

Problems in multiple-hypothesis phylogeny comparisons (5/5)

Page 34: Interpreting ‘tree space’ in the context of very large empirical datasets

Multiple hypotheses

• Alternative hypotheses drawn from tree space• Same dataset different Ha, different U• What U expected for Ha?• More simulation – multiple draws from tree

space:

Uc,= U – mean Uc

Page 35: Interpreting ‘tree space’ in the context of very large empirical datasets

Tree space

• In the context of ML tree space can be thought of as the distance in lnL units (or any other related statistic*) between two trees with otherwise identical models / data

• In our previous results this appeared continuous.

• This may be misleading; in reality tree space, or derived statistics, can be highly discontinuous.

Page 36: Interpreting ‘tree space’ in the context of very large empirical datasets

Multiple comparisons

• However…. We recall that distance in tree space, or shape of tree space, not well determined.

• How to sample effectively to control U (as Uc)?• How to compare Uc for Ha?• Sample every point (tree)?• Sample lots?• Sample systematically? Inverse-distance? Etc

Page 37: Interpreting ‘tree space’ in the context of very large empirical datasets

Tree space

• Previously with small empirical datasets assume a single phylogeny a good descriptor of most/many sites

• With large datasets this may not be true– Both small adjustments better fit for many sites– And also some large rearrangements

• Perhaps a better definition of tree space• Considering two Ha equidistant from H0

Page 38: Interpreting ‘tree space’ in the context of very large empirical datasets
Page 39: Interpreting ‘tree space’ in the context of very large empirical datasets

Tree distance properties

• Scalar distances informative• Triagonality• Proportional to L for a given model(?)• Vectors informative (?)

Page 40: Interpreting ‘tree space’ in the context of very large empirical datasets

Tree distance candidates

• Statistic or model-based measures:– Parsimony, ML or amino-acid/nucleotide distance– ∆lnL

• Topology-based measures:– Number / type of rearrangement moves, e.g. • Nearest-neighbour interchange• Subtree prune-and-regraft• Tree bisection-and-reconnection

• Algorithm-based measures:– # Of algorithm move steps– Wall clock time

Page 41: Interpreting ‘tree space’ in the context of very large empirical datasets

Acknowledgements

• School of Biological and Chemical Sciences, Queen Mary, University of London – Rossiter Group– Prof. Steve Rossiter (PI)– Drs Kalina Davies, Georgia Tsagkogeorga, Michael McGowen, Mao

Xiuguang– Seb Bailey, Kim Warren

• Others:– Profs Richard Nichols, Andrew Leitch (SBCS)– Drs Yannick Wurm, Richard Buggs, Chris Faulkes, Steve Le Comber (SBCS)– Drs Chris Walker & Rob Horton (GridPP HTC)

• Sanger Centre – Dr James Cotton

(L-R): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey