substitution models and thesubstitution models and the...

29
Substitution Models and the Substitution Models and the Phylogenetic Assumptions Vivek Jayaswal Lars S. Jermiin COMMONWEALTH OF AUSTRALIA C i htR l ti Copyright Regulation WARNING This material has been reproduced and communicated to you by or on be half of the University of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice.

Upload: others

Post on 31-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Substitution Models and theSubstitution Models and the Phylogenetic Assumptions

Vivek Jayaswal Lars S. Jermiin

COMMONWEALTH OF AUSTRALIAC i ht R l tiCopyright Regulation

WARNINGThis material has been reproduced and communicated to you by or on

be half of the University of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act).

The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by

you may be the subject of copyright protection under the Act.Do not remove this notice.

Page 2: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Why do we need substitution models?Why do we need substitution models?

1 2 3 4 5 N1 2 3 4 5 … N Human A T C G A … CChimp A G C A A … C

Root

Gorilla ………........... Orangutan ………...........

I1 I2

G ill O t Chi HGorilla Orangutan Chimp Human

4 44

Rootj

)|()|()|()|()|()|( 4321

4

1

4

1

4

1vOPvOPuOPuOPjvPjuPfL

u vjji ∑ ∑∑

= ==

=

I1P(u|j)

I2P(v|j)P(u|j) ( |j)

GorillaP(O1|u)

OrangutanP(O2|u)

ChimpP(O3|v)

HumanP(O4|u)P(O1|u) P(O2|u) ( 3| ) ( 4| )

Page 3: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

TopicsTopics

• how correction for multiple substitution are done• how correction for multiple substitution are done• some of the phylogenetic assumptions• how we may evaluate phylogenetic assumptions• an example involving bacterial DNAp g

Page 4: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Substitutions at a Single SiteSubstitutions at a Single Site

S S S C S

AG

Single Substitution

ATG T

Multiple Substitution

TG

Coincidental Substitution

A

A—G

AA—G

G—T

A

A—G A—T

GG

Parallel Substitution

GG

Convergent Substitution

AA

Back Substitution

A

GG

A—G A—G

A

GG

A—G A—TT—G

A

AA

A—GG—A

Note —

A A A

NoteEvery substitution overwrites the evidence of a past state, which leads to an erosion of the historical signal

Page 5: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Modeling Nucleotide SubstitutionsModeling Nucleotide Substitutions

Consider the evolution at a given site in terms of conditionalConsider the evolution at a given site in terms of conditionalrates-of-change from nucleotide i to nucleotide j

− α Aj∑ α AC α AG α AT⎡⎢

⎤⎥

A T

R =

j≠A

α CA − α Cjj≠C∑ α CG α CT

⎢⎢⎢⎢

⎥⎥⎥⎥R =

α GA α GC − α Gjj≠G∑ α GT

α α α α∑

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥α TA α TC α TG − α Tj

j≠T∑

⎣⎢⎢ ⎦

⎥⎥CG

Note —Here αij is the conditional rate-of-change from nucleotide i to nucleotide j in R the ‘rate matrix’ and R is the most general Markov model for DNAR — the rate matrix — and R is the most general Markov model for DNA

Page 6: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Modelling a Site in one SequenceModelling a Site in one Sequence

Consider Markov process X that results in nucleotide iConsider Markov process, X, that results in nucleotide ibeing converted to nucleotide j over time t —

Pij t( )= P X t( )= j | X 0( )= i[ ] t j

If the rates of change are constant then P(t) eRt

Time X

constant, then P(t) = eRtt = 0

Ancestori

P t( )= I + Rt +Rt( )2

+Rt( )3

+

In matrix notation, this isP t( )= I + Rt +

2!+

3!+L

=Rt( )k∞

∑ k!k=0∑

Page 7: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Modelling a Site in two SequencesModelling a Site in two Sequences

Consider two Markov processes X and Y and the followingConsider two Markov processes, X and Y and the following scenario —

t

Fij t( )= P X t( )= i, Y t( )= j | X 0( )=Y 0( )[ ] Time XY

t = 0

Ancestor

F t( )= PX t( )( )T F 0( )PY t( )In matrix notation, this is

Page 8: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Take-home Message #1

• The substitution model, R, is an integral part of , , g pthe transition function; it is used to estimate the probability of the present states given R and tprobability of the present states, given R and t

Page 9: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Rate matrix revisitedRate matrix revisited

− r∑ r r r⎡ ⎤

Ri =

rAjj≠A∑ rAC rAG rAT

rCA − rCjj≠C∑ rCG rCT

⎡⎢⎢⎢⎢⎢

⎤⎥⎥⎥⎥⎥i

rGA rGC − rGjj≠G∑ rGT

rTA rTC rTG − rTjj≠T∑

⎢⎢⎢⎢⎢ ⎦

⎥⎥⎥⎥⎥j≠T⎣⎢ ⎦⎥

• Typically simplified forms of this rate matrix are used

• These matrices belong to the GTR-family of models and can be represented as

⎥⎥⎥⎤

⎢⎢⎢⎡

⎥⎥⎥⎤

⎢⎢⎢⎡

−−

=C

A

CTCGAC

ATAGAC

ππ

Rrrrrrr

⎥⎥

⎦⎢⎢

⎣⎥⎥

⎦⎢⎢

⎣ −−

T

G

GTCTAT

GTCGAG

ππ

rrrrrr

Page 10: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Commonly-used Markov Models

Assumptions

Commonly-used Markov Models

J k & C t (1969) Assumptions−3α α α αα −3α α α

⎡⎢⎢

⎤⎥⎥

Jukes & Cantor (1969)

1 One rate (α)R =α α −3α αα α α −3α⎣

⎢⎢⎢

⎥⎥⎥

1. One rate (α) 2. Uniform nucleotide content

−(2β +α ) β α β⎡⎢

⎤⎥

Kimura (1980)

R =β −(2β +α ) β αα β −(2β +α ) ββ α β −(2β +α )⎣

⎢⎢⎢⎢

⎥⎥⎥⎥

1. Two rates (α and β) 2. Uniform nucleotide content

β β ( β )⎣ ⎦

(β ) β β⎡ ⎤

Hasegawa, Kishino & Yano (1985)

R =

−(βπY +απG ) βπC απG βπT

βπ A −(βπ R +απT ) βπG απT

απ A βπC −(βπY +απ A ) βπT

⎡⎢⎢⎢⎢

⎤⎥⎥⎥⎥

1. Two rates (α and β)2. Non-uniform nucleotide content

βπ A απC βπG −(βπ R +απC )⎣⎢

⎦⎥

Page 11: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

The Phylogenetic AssumptionsThe Phylogenetic Assumptions

• Given an alignment of nucleotides phylogenetic methods commonlyGiven an alignment of nucleotides, phylogenetic methods commonly assume that the sites have evolved under

• stationary and reversible conditions• homogeneous conditions• independent and identical conditions

Page 12: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Phylogenetic AssumptionsPhylogenetic Assumptions

Consider the following scenarioConsider the following scenario…

21

F1 = fA fC fG fT[ ] F2 = fA fC fG fT[ ]

0− αAjj≠A∑ αAC αAG αAT⎡

⎢⎢

⎤⎥⎥

− αAjj≠A∑ αAC αAG αAT⎡

⎢⎢

⎤⎥⎥

R1 =

j

αCA − αCjj≠C∑ αCG αCT

αGA αGC − αGj∑ αGT

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

R2 =

j

αCA − αCjj≠C∑ αCG αCT

αGA αGC − αGj∑ αGT

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥GA GC Gj

j≠G∑ GT

αTA αTC αTG − αTjj≠T∑

⎢⎢⎢⎢ ⎦

⎥⎥⎥⎥

αGA αGC αGjj≠G∑ αGT

αTA αTC αTG − αTjj≠T∑

⎢⎢⎢⎢ ⎦

⎥⎥⎥⎥

F0 = fA fC fG fT[ ]Π1 = π A πC πG πT[ ] Π2 = π A πC πG πT[ ]F0 fA fC fG fT[ ]Π1 π A πC πG πT[ ] Π2 π A πC πG πT[ ]

Page 13: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

The Stationary ConditionThe Stationary Condition

21

0

Π1 = π A πC πG πT[ ] Π2 = π A πC πG πT[ ]F0 = fA fC fG fT[ ]0 fA fC fG fT[ ]

Note —The stationary condition is met if F0 = Π1 = Π2, implying that the marginal distributions of R and R are the same even though R and R may differ!distributions of R1 and R2 are the same, even though R1 and R2 may differ!

Page 14: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

The Reversible ConditionThe Reversible Condition

21

0

R1 Π1 R2 Π2

NotesNotes —If the process is stationary, then Π1 = Π2

Moreover if πR = πR for all i and j then the process is reversibleMoreover, if πiRij = πjRji for all i and j, then the process is reversible

Page 15: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

The Homogeneous ConditionThe Homogeneous Condition

21

0R1 R2

Notes —The homogeneous condition is met for the Markov processes, R1 and R2, if R1 = R2

If the homogeneous condition is met, then Π1 = Π2 — however, non-stationary, and therefore non-reversible conditions may still prevail (i e if F ≠ Π = Π )and therefore non-reversible, conditions may still prevail (i.e., if F0 ≠ Π1 = Π2)

Page 16: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

IID conditionIID condition12

t

time XY

Seq_1 ACGTGTCCATGATTA...

time XY

Rx Rx …t = 0

Ancestor

Seq_2 ACCTGCCCAAGATAA...

NNotes —For computational reasons, it is convenient to assume that —

Sites in the sequence have evolved independentlySites in the sequence have evolved independentlySites in the sequence have evolved under the same model (Rx = Ry)

Page 17: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Rate Heterogeneity Across SitesRate Heterogeneity Across SitesRNA-coding GenesRNA coding Genes

Page 18: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Take-home Message #2

• Phylogenetic analyses require the users to make y g y qcertain assumptions about the data before these are investigated in detailare investigated in detail

Page 19: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

QuestionsQuestions…

• Are these phylogenetic assumptions realistic? p y g p• How can we assess whether the phylogenetic

ti t b th d t ?assumptions are met by the data?

Page 20: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

AnswerAnswer…

• Inspecting the data — before or after inferring theInspecting the data before or after inferring the phylogeny — increases the chance of finding out what might have taken place during the evolutionwhat might have taken place during the evolution

Page 21: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Considering the IID ConditionConsidering the IID Condition

• Visual inspection of the alignment might show• Visual inspection of the alignment might show whether some regions evolved faster

• Assume rate-heterogeneity across some sites, and then use phylogenetic methods that accountand then use phylogenetic methods that account for this

Page 22: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Hierarchical Likelihood-Ratio TestHierarchical Likelihood-Ratio Test

Consider the following decision treeConsider the following decision tree…

Yes No Uniform nucleotide content?

One conditional rate?YesNo NoYes

Two or six conditional rates?YesNo YesNo rates?

Source: Posada & Crandall (1998) Bioinformatics 14, 817-818.

Page 23: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Take-home Message #3

• The likelihood-ratio test can be used identify a suitable substitution model for a given data set g

Page 24: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Testing assumptions prior to modellingTesting assumptions prior to modelling

• n × F(t) expected divergence matrix• n × F(t) — expected divergence matrix• N(t) — observed divergence matrix

Examples:

294 0 0 00 372 0 0

⎡⎢⎢

⎤⎥⎥

244 31 8 1128 321 14 9

⎡⎢⎢

⎤⎥⎥N(0) =

0 372 0 00 0 829 00 0 0 655⎣

⎢⎢⎢

⎥⎥⎥

N(t) =28 321 14 911 13 801 414 10 3 628⎣

⎢⎢⎢

⎥⎥⎥

⎣ ⎦ ⎣ ⎦

Page 25: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Matched-pairs Tests of SymmetryMatched-pairs Tests of Symmetry

Seq 1 AGACTAGGTCTTGTATAGACTAATGTTCACAGTTTTTTAACTTTGTCAATGGASeq 1 AGACTAGGTCTTGTATAGACTAATGTTCACAGTTTTTTAACTTTGTCAATGGA...Seq 2 AGACGAGGTCGTGTATGGCCTCGTGAGCACGGGTTGTTCACTCCGCCAACGGT...

A C G T Σ2

A 5 4 7 1 17

A C G T Σ2

A 5 4 7 1 17 ( )2Test of Symmetry

C 0 7 2 0 9

G 0 1 5 0 6

C 0 7 2 0 9

G 0 1 5 0 6XBowker

2 =xij − x ji( )xij + x jii< j

∑G 0 1 5 0 6

T 1 4 5 8 18Σ

G 0 1 5 0 6

T 1 4 5 8 18Σ T t f M i l S tΣ1 6 16 19 9Σ1 6 16 19 9

Test of Marginal SymmetryXStuart

2 = DV−1DT ,Dij = xi• − x•i ,

V = covariance matrix of D

Note:These tests statistics are asymptotically χ2-distributed on ν degrees of freedom V covariance matrix of D

Sources: Bowker (1948). JASA 43, 572-574; Stuart (1955). Biometrica 42, 412-416.

χ distributed on ν degrees of freedom

Page 26: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Bacterial 16S rDNA SequencesBacterial 16S rDNA Sequences

Ribosomal RNA from five bacteria was compared using theRibosomal RNA from five bacteria was compared using the matched-pairs test of homogeneity

Probabilities Thermotoga Bacillus Deinococcus Thermus

Aquifex 1 32 10 01 1 79 10 11 3 45 10 10 5 09 10 01Aquifex 1.32 10–01 1.79 10–11 3.45 10-10 5.09 10-01

Thermotoga 2.64 10–12 6.69 10–12 4.15 10–01

Bacillus 9.95 10–01 3.64 10–09

Deinococcus 5 99 10–11Deinococcus 5.99 10

Note —It is highly unlike that these data have evolved under homogeneous conditions, implying that it would be unwise to use a time-reversible Markov model

Source: Ababneh et al. (2006). Bioinformatics 22, 1225-1231.

implying that it would be unwise to use a time-reversible Markov model

Page 27: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Phylogeny of Bacterial Ribosomal RNAPhylogeny of Bacterial Ribosomal RNA

Markov model: GTR

Markov model: General

Source: Ababneh et al. (2006). Bioinformatics 22, 1225-1231.

Page 28: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Take home Message #4Take-home Message #4

• It is important to consider the substitution models carefully when using them in phylogenetic studies y g p y g

Page 29: Substitution Models and theSubstitution Models and the ...bioinformatics.org.au/resources/ws11/presentations... · Commonly-used Markov Models Assumptions used Markov Models J k &

Suggested LiteratureSuggested Literature

RDM Page, EC Holmes (1998), Molecular Evolution.• Chapter 5 (Sections 5.2, 5.3) — important reading

W-H Li (1997), Molecular Evolution.• Chapter 3 (pp. 59-78) — contains descriptions that are better than

those in Page & Holmes 1998 — important reading D Posada, KA Crandall, 1998. MODELTEST: testing the model of DNA

b tit ti Bi i f ti 14 817 818 f l disubstitution. Bioinformatics 14, 817-818 — useful readingLS Jermiin et al. (2008). Phylogenetic model Evaluation. Pp 331-363. In

Bioinformatics Volume I: Data Sequences Analysis andBioinformatics - Volume I: Data, Sequences Analysis and Evolution (Ed. Keith J), Humana Press, Totowa, NJ. [2008] —important readingp g