substitution models and thesubstitution models and the...

Substitution Models and theSubstitution Models and the Phylogenetic Assumptions

Vivek Jayaswal Lars S. Jermiin

COMMONWEALTH OF AUSTRALIAC i ht R l tiCopyright Regulation

WARNINGThis material has been reproduced and communicated to you by or on

be half of the University of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act).

The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by

you may be the subject of copyright protection under the Act.Do not remove this notice.

Why do we need substitution models?Why do we need substitution models?

1 2 3 4 5 N1 2 3 4 5 … N Human A T C G A … CChimp A G C A A … C

Root

Gorilla ………........... Orangutan ………...........

I1 I2

G ill O t Chi HGorilla Orangutan Chimp Human

4 44

Rootj

)|()|()|()|()|()|( 4321

4

1

4

1

4

1vOPvOPuOPuOPjvPjuPfL

u vjji ∑ ∑∑

= ==

=

I1P(u|j)

I2P(v|j)P(u|j) ( |j)

GorillaP(O1|u)

OrangutanP(O2|u)

ChimpP(O3|v)

HumanP(O4|u)P(O1|u) P(O2|u) ( 3| ) ( 4| )

TopicsTopics

• how correction for multiple substitution are done• how correction for multiple substitution are done• some of the phylogenetic assumptions• how we may evaluate phylogenetic assumptions• an example involving bacterial DNAp g

Substitutions at a Single SiteSubstitutions at a Single Site

S S S C S

AG

Single Substitution

ATG T

Multiple Substitution

TG

Coincidental Substitution

A

A—G

AA—G

G—T

A

A—G A—T

GG

Parallel Substitution

GG

Convergent Substitution

AA

Back Substitution

A

GG

A—G A—G

A

GG

A—G A—TT—G

A

AA

A—GG—A

Note —

A A A

NoteEvery substitution overwrites the evidence of a past state, which leads to an erosion of the historical signal

Modeling Nucleotide SubstitutionsModeling Nucleotide Substitutions

Consider the evolution at a given site in terms of conditionalConsider the evolution at a given site in terms of conditionalrates-of-change from nucleotide i to nucleotide j

− α Aj∑ α AC α AG α AT⎡⎢

⎤⎥

A T

R =

j≠A

α CA − α Cjj≠C∑ α CG α CT

⎢⎢⎢⎢

⎥⎥⎥⎥R =

α GA α GC − α Gjj≠G∑ α GT

α α α α∑

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥α TA α TC α TG − α Tj

j≠T∑

⎣⎢⎢ ⎦

⎥⎥CG

Note —Here αij is the conditional rate-of-change from nucleotide i to nucleotide j in R the ‘rate matrix’ and R is the most general Markov model for DNAR — the rate matrix — and R is the most general Markov model for DNA

Modelling a Site in one SequenceModelling a Site in one Sequence

Consider Markov process X that results in nucleotide iConsider Markov process, X, that results in nucleotide ibeing converted to nucleotide j over time t —

Pij t( )= P X t( )= j | X 0( )= i[ ] t j

If the rates of change are constant then P(t) eRt

Time X

constant, then P(t) = eRtt = 0

Ancestori

P t( )= I + Rt +Rt( )2

+Rt( )3

+

In matrix notation, this isP t( )= I + Rt +

2!+

3!+L

=Rt( )k∞

∑ k!k=0∑

Modelling a Site in two SequencesModelling a Site in two Sequences

Consider two Markov processes X and Y and the followingConsider two Markov processes, X and Y and the following scenario —

t

Fij t( )= P X t( )= i, Y t( )= j | X 0( )=Y 0( )[ ] Time XY

t = 0

Ancestor

F t( )= PX t( )( )T F 0( )PY t( )In matrix notation, this is

Take-home Message #1

• The substitution model, R, is an integral part of , , g pthe transition function; it is used to estimate the probability of the present states given R and tprobability of the present states, given R and t

Rate matrix revisitedRate matrix revisited

− r∑ r r r⎡ ⎤

Ri =

rAjj≠A∑ rAC rAG rAT

rCA − rCjj≠C∑ rCG rCT

∑

⎡⎢⎢⎢⎢⎢

⎤⎥⎥⎥⎥⎥i

rGA rGC − rGjj≠G∑ rGT

rTA rTC rTG − rTjj≠T∑

⎣

⎢⎢⎢⎢⎢ ⎦

⎥⎥⎥⎥⎥j≠T⎣⎢ ⎦⎥

• Typically simplified forms of this rate matrix are used

• These matrices belong to the GTR-family of models and can be represented as

⎥⎥⎥⎤

⎢⎢⎢⎡

⎥⎥⎥⎤

⎢⎢⎢⎡

−−

=C

A

CTCGAC

ATAGAC

ππ

Rrrrrrr

⎥⎥

⎦⎢⎢

⎣⎥⎥

⎦⎢⎢

⎣ −−

T

G

GTCTAT

GTCGAG

ππ

rrrrrr

Commonly-used Markov Models

Assumptions

Commonly-used Markov Models

J k & C t (1969) Assumptions−3α α α αα −3α α α

⎡⎢⎢

⎤⎥⎥

Jukes & Cantor (1969)

1 One rate (α)R =α α −3α αα α α −3α⎣

⎢⎢⎢

⎦

⎥⎥⎥

1. One rate (α) 2. Uniform nucleotide content

−(2β +α ) β α β⎡⎢

⎤⎥

Kimura (1980)

R =β −(2β +α ) β αα β −(2β +α ) ββ α β −(2β +α )⎣

⎢⎢⎢⎢

⎦

⎥⎥⎥⎥

1. Two rates (α and β) 2. Uniform nucleotide content

β β ( β )⎣ ⎦

(β ) β β⎡ ⎤

Hasegawa, Kishino & Yano (1985)

R =

−(βπY +απG ) βπC απG βπT

βπ A −(βπ R +απT ) βπG απT

απ A βπC −(βπY +απ A ) βπT

⎡⎢⎢⎢⎢

⎤⎥⎥⎥⎥

1. Two rates (α and β)2. Non-uniform nucleotide content

βπ A απC βπG −(βπ R +απC )⎣⎢

⎦⎥

The Phylogenetic AssumptionsThe Phylogenetic Assumptions

• Given an alignment of nucleotides phylogenetic methods commonlyGiven an alignment of nucleotides, phylogenetic methods commonly assume that the sites have evolved under

• stationary and reversible conditions• homogeneous conditions• independent and identical conditions

Phylogenetic AssumptionsPhylogenetic Assumptions

Consider the following scenarioConsider the following scenario…

21

F1 = fA fC fG fT[ ] F2 = fA fC fG fT[ ]

0− αAjj≠A∑ αAC αAG αAT⎡

⎢⎢

⎤⎥⎥

− αAjj≠A∑ αAC αAG αAT⎡

⎢⎢

⎤⎥⎥

R1 =

j

αCA − αCjj≠C∑ αCG αCT

αGA αGC − αGj∑ αGT

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

R2 =

j

αCA − αCjj≠C∑ αCG αCT

αGA αGC − αGj∑ αGT

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥GA GC Gj

j≠G∑ GT

αTA αTC αTG − αTjj≠T∑

⎣

⎢⎢⎢⎢ ⎦

⎥⎥⎥⎥

αGA αGC αGjj≠G∑ αGT

αTA αTC αTG − αTjj≠T∑

⎣

⎢⎢⎢⎢ ⎦

⎥⎥⎥⎥

F0 = fA fC fG fT[ ]Π1 = π A πC πG πT[ ] Π2 = π A πC πG πT[ ]F0 fA fC fG fT[ ]Π1 π A πC πG πT[ ] Π2 π A πC πG πT[ ]

The Stationary ConditionThe Stationary Condition

21

0

Π1 = π A πC πG πT[ ] Π2 = π A πC πG πT[ ]F0 = fA fC fG fT[ ]0 fA fC fG fT[ ]

Note —The stationary condition is met if F0 = Π1 = Π2, implying that the marginal distributions of R and R are the same even though R and R may differ!distributions of R1 and R2 are the same, even though R1 and R2 may differ!

The Reversible ConditionThe Reversible Condition

21

0

R1 Π1 R2 Π2

NotesNotes —If the process is stationary, then Π1 = Π2

Moreover if πR = πR for all i and j then the process is reversibleMoreover, if πiRij = πjRji for all i and j, then the process is reversible

The Homogeneous ConditionThe Homogeneous Condition

21

0R1 R2

Notes —The homogeneous condition is met for the Markov processes, R1 and R2, if R1 = R2

If the homogeneous condition is met, then Π1 = Π2 — however, non-stationary, and therefore non-reversible conditions may still prevail (i e if F ≠ Π = Π )and therefore non-reversible, conditions may still prevail (i.e., if F0 ≠ Π1 = Π2)

IID conditionIID condition12

t

time XY

Seq_1 ACGTGTCCATGATTA...

time XY

Rx Rx …t = 0

Ancestor

Seq_2 ACCTGCCCAAGATAA...

NNotes —For computational reasons, it is convenient to assume that —

Sites in the sequence have evolved independentlySites in the sequence have evolved independentlySites in the sequence have evolved under the same model (Rx = Ry)

Rate Heterogeneity Across SitesRate Heterogeneity Across SitesRNA-coding GenesRNA coding Genes


• Phylogenetic analyses require the users to make y g y qcertain assumptions about the data before these are investigated in detailare investigated in detail

QuestionsQuestions…

• Are these phylogenetic assumptions realistic? p y g p• How can we assess whether the phylogenetic

ti t b th d t ?assumptions are met by the data?

AnswerAnswer…

• Inspecting the data — before or after inferring theInspecting the data before or after inferring the phylogeny — increases the chance of finding out what might have taken place during the evolutionwhat might have taken place during the evolution

Considering the IID ConditionConsidering the IID Condition

• Visual inspection of the alignment might show• Visual inspection of the alignment might show whether some regions evolved faster

• Assume rate-heterogeneity across some sites, and then use phylogenetic methods that accountand then use phylogenetic methods that account for this

Hierarchical Likelihood-Ratio TestHierarchical Likelihood-Ratio Test

Consider the following decision treeConsider the following decision tree…

Yes No Uniform nucleotide content?

One conditional rate?YesNo NoYes

Two or six conditional rates?YesNo YesNo rates?

Source: Posada & Crandall (1998) Bioinformatics 14, 817-818.


• The likelihood-ratio test can be used identify a suitable substitution model for a given data set g

Testing assumptions prior to modellingTesting assumptions prior to modelling

• n × F(t) expected divergence matrix• n × F(t) — expected divergence matrix• N(t) — observed divergence matrix

Examples:

294 0 0 00 372 0 0

⎡⎢⎢

⎤⎥⎥

244 31 8 1128 321 14 9

⎡⎢⎢

⎤⎥⎥N(0) =

0 372 0 00 0 829 00 0 0 655⎣

⎢⎢⎢

⎦

⎥⎥⎥

N(t) =28 321 14 911 13 801 414 10 3 628⎣

⎢⎢⎢

⎦

⎥⎥⎥

⎣ ⎦ ⎣ ⎦

Matched-pairs Tests of SymmetryMatched-pairs Tests of Symmetry

Seq 1 AGACTAGGTCTTGTATAGACTAATGTTCACAGTTTTTTAACTTTGTCAATGGASeq 1 AGACTAGGTCTTGTATAGACTAATGTTCACAGTTTTTTAACTTTGTCAATGGA...Seq 2 AGACGAGGTCGTGTATGGCCTCGTGAGCACGGGTTGTTCACTCCGCCAACGGT...

A C G T Σ2

A 5 4 7 1 17

A C G T Σ2

A 5 4 7 1 17 ( )2Test of Symmetry

C 0 7 2 0 9

G 0 1 5 0 6

C 0 7 2 0 9

G 0 1 5 0 6XBowker

2 =xij − x ji( )xij + x jii< j

∑G 0 1 5 0 6

T 1 4 5 8 18Σ

G 0 1 5 0 6

T 1 4 5 8 18Σ T t f M i l S tΣ1 6 16 19 9Σ1 6 16 19 9

Test of Marginal SymmetryXStuart

2 = DV−1DT ,Dij = xi• − x•i ,

V = covariance matrix of D

Note:These tests statistics are asymptotically χ2-distributed on ν degrees of freedom V covariance matrix of D

Sources: Bowker (1948). JASA 43, 572-574; Stuart (1955). Biometrica 42, 412-416.

χ distributed on ν degrees of freedom

Bacterial 16S rDNA SequencesBacterial 16S rDNA Sequences

Ribosomal RNA from five bacteria was compared using theRibosomal RNA from five bacteria was compared using the matched-pairs test of homogeneity

Probabilities Thermotoga Bacillus Deinococcus Thermus

Aquifex 1 32 10 01 1 79 10 11 3 45 10 10 5 09 10 01Aquifex 1.32 10–01 1.79 10–11 3.45 10-10 5.09 10-01

Thermotoga 2.64 10–12 6.69 10–12 4.15 10–01

Bacillus 9.95 10–01 3.64 10–09

Deinococcus 5 99 10–11Deinococcus 5.99 10

Note —It is highly unlike that these data have evolved under homogeneous conditions, implying that it would be unwise to use a time-reversible Markov model

Source: Ababneh et al. (2006). Bioinformatics 22, 1225-1231.

implying that it would be unwise to use a time-reversible Markov model

Phylogeny of Bacterial Ribosomal RNAPhylogeny of Bacterial Ribosomal RNA

Markov model: GTR

Markov model: General

Source: Ababneh et al. (2006). Bioinformatics 22, 1225-1231.

Take home Message #4Take-home Message #4

• It is important to consider the substitution models carefully when using them in phylogenetic studies y g p y g

Suggested LiteratureSuggested Literature

RDM Page, EC Holmes (1998), Molecular Evolution.• Chapter 5 (Sections 5.2, 5.3) — important reading

W-H Li (1997), Molecular Evolution.• Chapter 3 (pp. 59-78) — contains descriptions that are better than

those in Page & Holmes 1998 — important reading D Posada, KA Crandall, 1998. MODELTEST: testing the model of DNA

b tit ti Bi i f ti 14 817 818 f l disubstitution. Bioinformatics 14, 817-818 — useful readingLS Jermiin et al. (2008). Phylogenetic model Evaluation. Pp 331-363. In

Bioinformatics Volume I: Data Sequences Analysis andBioinformatics - Volume I: Data, Sequences Analysis and Evolution (Ed. Keith J), Humana Press, Totowa, NJ. [2008] —important readingp g

substitution models and thesubstitution models and the...

Documents