substitution models and thesubstitution models and the...
TRANSCRIPT
Substitution Models and theSubstitution Models and the Phylogenetic Assumptions
Vivek Jayaswal Lars S. Jermiin
COMMONWEALTH OF AUSTRALIAC i ht R l tiCopyright Regulation
WARNINGThis material has been reproduced and communicated to you by or on
be half of the University of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by
you may be the subject of copyright protection under the Act.Do not remove this notice.
Why do we need substitution models?Why do we need substitution models?
1 2 3 4 5 N1 2 3 4 5 … N Human A T C G A … CChimp A G C A A … C
Root
Gorilla ………........... Orangutan ………...........
I1 I2
G ill O t Chi HGorilla Orangutan Chimp Human
4 44
Rootj
)|()|()|()|()|()|( 4321
4
1
4
1
4
1vOPvOPuOPuOPjvPjuPfL
u vjji ∑ ∑∑
= ==
=
I1P(u|j)
I2P(v|j)P(u|j) ( |j)
GorillaP(O1|u)
OrangutanP(O2|u)
ChimpP(O3|v)
HumanP(O4|u)P(O1|u) P(O2|u) ( 3| ) ( 4| )
TopicsTopics
• how correction for multiple substitution are done• how correction for multiple substitution are done• some of the phylogenetic assumptions• how we may evaluate phylogenetic assumptions• an example involving bacterial DNAp g
Substitutions at a Single SiteSubstitutions at a Single Site
S S S C S
AG
Single Substitution
ATG T
Multiple Substitution
TG
Coincidental Substitution
A
A—G
AA—G
G—T
A
A—G A—T
GG
Parallel Substitution
GG
Convergent Substitution
AA
Back Substitution
A
GG
A—G A—G
A
GG
A—G A—TT—G
A
AA
A—GG—A
Note —
A A A
NoteEvery substitution overwrites the evidence of a past state, which leads to an erosion of the historical signal
Modeling Nucleotide SubstitutionsModeling Nucleotide Substitutions
Consider the evolution at a given site in terms of conditionalConsider the evolution at a given site in terms of conditionalrates-of-change from nucleotide i to nucleotide j
− α Aj∑ α AC α AG α AT⎡⎢
⎤⎥
A T
R =
j≠A
α CA − α Cjj≠C∑ α CG α CT
⎢⎢⎢⎢
⎥⎥⎥⎥R =
α GA α GC − α Gjj≠G∑ α GT
α α α α∑
⎢⎢⎢⎢⎢
⎥⎥⎥⎥⎥α TA α TC α TG − α Tj
j≠T∑
⎣⎢⎢ ⎦
⎥⎥CG
Note —Here αij is the conditional rate-of-change from nucleotide i to nucleotide j in R the ‘rate matrix’ and R is the most general Markov model for DNAR — the rate matrix — and R is the most general Markov model for DNA
Modelling a Site in one SequenceModelling a Site in one Sequence
Consider Markov process X that results in nucleotide iConsider Markov process, X, that results in nucleotide ibeing converted to nucleotide j over time t —
Pij t( )= P X t( )= j | X 0( )= i[ ] t j
If the rates of change are constant then P(t) eRt
Time X
constant, then P(t) = eRtt = 0
Ancestori
P t( )= I + Rt +Rt( )2
+Rt( )3
+
In matrix notation, this isP t( )= I + Rt +
2!+
3!+L
=Rt( )k∞
∑ k!k=0∑
Modelling a Site in two SequencesModelling a Site in two Sequences
Consider two Markov processes X and Y and the followingConsider two Markov processes, X and Y and the following scenario —
t
Fij t( )= P X t( )= i, Y t( )= j | X 0( )=Y 0( )[ ] Time XY
t = 0
Ancestor
F t( )= PX t( )( )T F 0( )PY t( )In matrix notation, this is
Take-home Message #1
• The substitution model, R, is an integral part of , , g pthe transition function; it is used to estimate the probability of the present states given R and tprobability of the present states, given R and t
Rate matrix revisitedRate matrix revisited
− r∑ r r r⎡ ⎤
Ri =
rAjj≠A∑ rAC rAG rAT
rCA − rCjj≠C∑ rCG rCT
∑
⎡⎢⎢⎢⎢⎢
⎤⎥⎥⎥⎥⎥i
rGA rGC − rGjj≠G∑ rGT
rTA rTC rTG − rTjj≠T∑
⎣
⎢⎢⎢⎢⎢ ⎦
⎥⎥⎥⎥⎥j≠T⎣⎢ ⎦⎥
• Typically simplified forms of this rate matrix are used
• These matrices belong to the GTR-family of models and can be represented as
⎥⎥⎥⎤
⎢⎢⎢⎡
⎥⎥⎥⎤
⎢⎢⎢⎡
−−
=C
A
CTCGAC
ATAGAC
ππ
Rrrrrrr
⎥⎥
⎦⎢⎢
⎣⎥⎥
⎦⎢⎢
⎣ −−
T
G
GTCTAT
GTCGAG
ππ
rrrrrr
Commonly-used Markov Models
Assumptions
Commonly-used Markov Models
J k & C t (1969) Assumptions−3α α α αα −3α α α
⎡⎢⎢
⎤⎥⎥
Jukes & Cantor (1969)
1 One rate (α)R =α α −3α αα α α −3α⎣
⎢⎢⎢
⎦
⎥⎥⎥
1. One rate (α) 2. Uniform nucleotide content
−(2β +α ) β α β⎡⎢
⎤⎥
Kimura (1980)
R =β −(2β +α ) β αα β −(2β +α ) ββ α β −(2β +α )⎣
⎢⎢⎢⎢
⎦
⎥⎥⎥⎥
1. Two rates (α and β) 2. Uniform nucleotide content
β β ( β )⎣ ⎦
(β ) β β⎡ ⎤
Hasegawa, Kishino & Yano (1985)
R =
−(βπY +απG ) βπC απG βπT
βπ A −(βπ R +απT ) βπG απT
απ A βπC −(βπY +απ A ) βπT
⎡⎢⎢⎢⎢
⎤⎥⎥⎥⎥
1. Two rates (α and β)2. Non-uniform nucleotide content
βπ A απC βπG −(βπ R +απC )⎣⎢
⎦⎥
The Phylogenetic AssumptionsThe Phylogenetic Assumptions
• Given an alignment of nucleotides phylogenetic methods commonlyGiven an alignment of nucleotides, phylogenetic methods commonly assume that the sites have evolved under
• stationary and reversible conditions• homogeneous conditions• independent and identical conditions
Phylogenetic AssumptionsPhylogenetic Assumptions
Consider the following scenarioConsider the following scenario…
21
F1 = fA fC fG fT[ ] F2 = fA fC fG fT[ ]
0− αAjj≠A∑ αAC αAG αAT⎡
⎢⎢
⎤⎥⎥
− αAjj≠A∑ αAC αAG αAT⎡
⎢⎢
⎤⎥⎥
R1 =
j
αCA − αCjj≠C∑ αCG αCT
αGA αGC − αGj∑ αGT
⎢⎢⎢⎢⎢
⎥⎥⎥⎥⎥
R2 =
j
αCA − αCjj≠C∑ αCG αCT
αGA αGC − αGj∑ αGT
⎢⎢⎢⎢⎢
⎥⎥⎥⎥⎥GA GC Gj
j≠G∑ GT
αTA αTC αTG − αTjj≠T∑
⎣
⎢⎢⎢⎢ ⎦
⎥⎥⎥⎥
αGA αGC αGjj≠G∑ αGT
αTA αTC αTG − αTjj≠T∑
⎣
⎢⎢⎢⎢ ⎦
⎥⎥⎥⎥
F0 = fA fC fG fT[ ]Π1 = π A πC πG πT[ ] Π2 = π A πC πG πT[ ]F0 fA fC fG fT[ ]Π1 π A πC πG πT[ ] Π2 π A πC πG πT[ ]
The Stationary ConditionThe Stationary Condition
21
0
Π1 = π A πC πG πT[ ] Π2 = π A πC πG πT[ ]F0 = fA fC fG fT[ ]0 fA fC fG fT[ ]
Note —The stationary condition is met if F0 = Π1 = Π2, implying that the marginal distributions of R and R are the same even though R and R may differ!distributions of R1 and R2 are the same, even though R1 and R2 may differ!
The Reversible ConditionThe Reversible Condition
21
0
R1 Π1 R2 Π2
NotesNotes —If the process is stationary, then Π1 = Π2
Moreover if πR = πR for all i and j then the process is reversibleMoreover, if πiRij = πjRji for all i and j, then the process is reversible
The Homogeneous ConditionThe Homogeneous Condition
21
0R1 R2
Notes —The homogeneous condition is met for the Markov processes, R1 and R2, if R1 = R2
If the homogeneous condition is met, then Π1 = Π2 — however, non-stationary, and therefore non-reversible conditions may still prevail (i e if F ≠ Π = Π )and therefore non-reversible, conditions may still prevail (i.e., if F0 ≠ Π1 = Π2)
IID conditionIID condition12
t
time XY
Seq_1 ACGTGTCCATGATTA...
time XY
Rx Rx …t = 0
Ancestor
Seq_2 ACCTGCCCAAGATAA...
NNotes —For computational reasons, it is convenient to assume that —
Sites in the sequence have evolved independentlySites in the sequence have evolved independentlySites in the sequence have evolved under the same model (Rx = Ry)
Rate Heterogeneity Across SitesRate Heterogeneity Across SitesRNA-coding GenesRNA coding Genes
Take-home Message #2
• Phylogenetic analyses require the users to make y g y qcertain assumptions about the data before these are investigated in detailare investigated in detail
QuestionsQuestions…
• Are these phylogenetic assumptions realistic? p y g p• How can we assess whether the phylogenetic
ti t b th d t ?assumptions are met by the data?
AnswerAnswer…
• Inspecting the data — before or after inferring theInspecting the data before or after inferring the phylogeny — increases the chance of finding out what might have taken place during the evolutionwhat might have taken place during the evolution
Considering the IID ConditionConsidering the IID Condition
• Visual inspection of the alignment might show• Visual inspection of the alignment might show whether some regions evolved faster
• Assume rate-heterogeneity across some sites, and then use phylogenetic methods that accountand then use phylogenetic methods that account for this
Hierarchical Likelihood-Ratio TestHierarchical Likelihood-Ratio Test
Consider the following decision treeConsider the following decision tree…
Yes No Uniform nucleotide content?
One conditional rate?YesNo NoYes
Two or six conditional rates?YesNo YesNo rates?
Source: Posada & Crandall (1998) Bioinformatics 14, 817-818.
Take-home Message #3
• The likelihood-ratio test can be used identify a suitable substitution model for a given data set g
Testing assumptions prior to modellingTesting assumptions prior to modelling
• n × F(t) expected divergence matrix• n × F(t) — expected divergence matrix• N(t) — observed divergence matrix
Examples:
294 0 0 00 372 0 0
⎡⎢⎢
⎤⎥⎥
244 31 8 1128 321 14 9
⎡⎢⎢
⎤⎥⎥N(0) =
0 372 0 00 0 829 00 0 0 655⎣
⎢⎢⎢
⎦
⎥⎥⎥
N(t) =28 321 14 911 13 801 414 10 3 628⎣
⎢⎢⎢
⎦
⎥⎥⎥
⎣ ⎦ ⎣ ⎦
Matched-pairs Tests of SymmetryMatched-pairs Tests of Symmetry
Seq 1 AGACTAGGTCTTGTATAGACTAATGTTCACAGTTTTTTAACTTTGTCAATGGASeq 1 AGACTAGGTCTTGTATAGACTAATGTTCACAGTTTTTTAACTTTGTCAATGGA...Seq 2 AGACGAGGTCGTGTATGGCCTCGTGAGCACGGGTTGTTCACTCCGCCAACGGT...
A C G T Σ2
A 5 4 7 1 17
A C G T Σ2
A 5 4 7 1 17 ( )2Test of Symmetry
C 0 7 2 0 9
G 0 1 5 0 6
C 0 7 2 0 9
G 0 1 5 0 6XBowker
2 =xij − x ji( )xij + x jii< j
∑G 0 1 5 0 6
T 1 4 5 8 18Σ
G 0 1 5 0 6
T 1 4 5 8 18Σ T t f M i l S tΣ1 6 16 19 9Σ1 6 16 19 9
Test of Marginal SymmetryXStuart
2 = DV−1DT ,Dij = xi• − x•i ,
V = covariance matrix of D
Note:These tests statistics are asymptotically χ2-distributed on ν degrees of freedom V covariance matrix of D
Sources: Bowker (1948). JASA 43, 572-574; Stuart (1955). Biometrica 42, 412-416.
χ distributed on ν degrees of freedom
Bacterial 16S rDNA SequencesBacterial 16S rDNA Sequences
Ribosomal RNA from five bacteria was compared using theRibosomal RNA from five bacteria was compared using the matched-pairs test of homogeneity
Probabilities Thermotoga Bacillus Deinococcus Thermus
Aquifex 1 32 10 01 1 79 10 11 3 45 10 10 5 09 10 01Aquifex 1.32 10–01 1.79 10–11 3.45 10-10 5.09 10-01
Thermotoga 2.64 10–12 6.69 10–12 4.15 10–01
Bacillus 9.95 10–01 3.64 10–09
Deinococcus 5 99 10–11Deinococcus 5.99 10
Note —It is highly unlike that these data have evolved under homogeneous conditions, implying that it would be unwise to use a time-reversible Markov model
Source: Ababneh et al. (2006). Bioinformatics 22, 1225-1231.
implying that it would be unwise to use a time-reversible Markov model
Phylogeny of Bacterial Ribosomal RNAPhylogeny of Bacterial Ribosomal RNA
Markov model: GTR
Markov model: General
Source: Ababneh et al. (2006). Bioinformatics 22, 1225-1231.
Take home Message #4Take-home Message #4
• It is important to consider the substitution models carefully when using them in phylogenetic studies y g p y g
Suggested LiteratureSuggested Literature
RDM Page, EC Holmes (1998), Molecular Evolution.• Chapter 5 (Sections 5.2, 5.3) — important reading
W-H Li (1997), Molecular Evolution.• Chapter 3 (pp. 59-78) — contains descriptions that are better than
those in Page & Holmes 1998 — important reading D Posada, KA Crandall, 1998. MODELTEST: testing the model of DNA
b tit ti Bi i f ti 14 817 818 f l disubstitution. Bioinformatics 14, 817-818 — useful readingLS Jermiin et al. (2008). Phylogenetic model Evaluation. Pp 331-363. In
Bioinformatics Volume I: Data Sequences Analysis andBioinformatics - Volume I: Data, Sequences Analysis and Evolution (Ed. Keith J), Humana Press, Totowa, NJ. [2008] —important readingp g