combining genes in phylogeny and how to test phylogeny methods … tal pupko department of cell...
Post on 19-Dec-2015
219 views
TRANSCRIPT
Combining genes in phylogeny
And
How to test phylogeny methods…
Tal Pupko
Department of Cell Research and Immunology, George S. Wise Faculty of
Life Sciences, Tel-Aviv University
Multiple sequence alignment (vWF)
Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK
Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN
Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR
Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR
Rat QEPGGLVVPPTDA
Rabbit QEPGGMVVPPTDA
Gorilla QEPGGLVVPPTDA
Cat REPGGLVVPPTEG
VWF
From sequences to a phylogenetic tree
Multiple multiple sequence alignment
Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK
Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN
Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR
Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR
Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK
Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN
Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR
Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR
Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK
Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN
Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR
Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR
Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK
Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN
Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR
Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR
Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK
Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN
Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR
Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR
Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK
Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN
Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR
Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR
Murphy et al. (2001b)19 nuclear genes + 3 mitochondrial genes (16,400 bp)
Phylogenetic studies are now
based on the analysis
of multiple genes
Consensus trees
Consensus tree
A consensus tree summarizes information common to two or more trees.
b c d eab c d eab c d ea
Strict consensus
Strict consensus includes only those groups that occur in all the trees being considered.
b c d eab c d ea
b c d ea
b c d ea
Strict consensus
Strict consensus
Problem: the split {ab} is found 2 out of 3 times, and this is not shown in the strict consensus.
b c d eab c d ea
b c d ea
b c d ea
Strict consensus
Majority-rule consensus
Majority-rule consensus: splits that are found in the majority of the trees are shown.
b c d eab c d ea
b c d ea
b c d ea
Majority-rule consensus
Majority-rule consensus
The percentage of the trees supporting each splits are indicated
b c d eab c d ea
b c d e
100
b c d ea
Majority-rule consensusa
67
67
Problem with Majority-rule consensus
However in both trees if we consider only {b,c,d}, then in both trees b is closer to c than b to d, or c to d.
b c d e
b c d ae
Majority-rule consensus=Strict consensus = a
b c d ea
Adams consensus
Adams consensus will give the subtrees that are common to all trees. Adams consensus is useful where there is one or more sequences with unclear positions but there’s a subset of sequences that are common to all trees.
b c d ae
Adams consensus=
b c d ea
c d a eb
Networks
A network is sometimes used to represent tree in which recombination occurred.
b c d ea
t1t3
t2
A
C
XS
}{
321 )()()()( AAX
XSXCXA rtPrtPrtPXPrDataP
Maximum Likelihood
Multiple genes analysisconcatenate analysis
Sp1Sp2
Sp3Sp4
e.g., Murphy et al. (2001)
Gene 1 + Gene 2 + Gene 3Sp1: TCTGT…AACTCTTT…GAATCGTT…GCCSp2: TCTGC…GACTCGCT…GGAACGCT…CCCSp3: CTTAT…GATCTATT…GGAATATT…CGASp4: CCTAT…GATCCATT…GGACCATT…CCA
Evolutionarymodel
Multiple genes analysisconcatenate analysis
Sp1: TCTTT…GAASp2: TCGCT…GGASp3: CTATT…GGASp4: CCATT…GGA
Gene 2Sp1: TCTGT…AACSp2: TCTGC…GACSp3: CTTAT…GATSp4: CCTAT…GAT
Gene 1Sp1: TCGTT…GCCSp2: ACGCT…CCCSp3: ATATT…CGASp4: CCATT…CCA
Gene 3
e.g., Murphy et al. (2001)
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Evolutionarymodel
Evolutionarymodel
Evolutionarymodel
Branch lengths correspond to
evolutionary distance:
d = AA replacements/site=
[AA
replacements/(site*year)]*year=
Evolutionary rate * year
What are branch lengths
Multiple genes analysisseparate analysis
Sp1: TCTTT…GAASp2: TCGCT…GGASp3: CTATT…GGASp4: CCATT…GGA
Gene 2Sp1: TCTGT…AACSp2: TCTGC…GACSp3: CTTAT…GATSp4: CCTAT…GAT
Gene 1Sp1: TCGTT…GCCSp2: ACGCT…CCCSp3: ATATT…CGASp4: CCATT…CCA
Gene 3
e.g., Nikaido et al. (2001)
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Evolutionarymodel2
Evolutionarymodel1
Evolutionarymodel3
Multiple genes analysisNumber of parameters
Separateanalysis
Concatenateanalysis
Number of species = nNumber of gene = gNumber of parameters in the model = m
Number ofparameter m+(2n-3) g*(m+(2n-3))
Examplen= 44 ; g = 22
m = 085 1870
Multiple genes analysisNumber of parameters
Both oversimplified model and over-parameterization may lead to the wrong phylogenetic conclusions
Multiple genes analysisproportional analysis
Sp1: TCTTT…GAASp2: TCGCT…GGASp3: CTATT…GGASp4: CCATT…GGA
Gene 2Sp1: TCTGT…AACSp2: TCTGC…GACSp3: CTTAT…GATSp4: CCTAT…GAT
Gene 1Sp1: TCGTT…GCCSp2: ACGCT…CCCSp3: ATATT…CGASp4: CCATT…CCA
Gene 3
Evolutionarymodel2
Evolutionarymodel1
Evolutionarymodel3
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Rate=1 Rate=0.5 Rate=1.5
Multiple genes analysisNumber of parameters
Separateanalysis
Concatenateanalysis
Number of species = nNumber of gene = gNumber of parameters in the model = m
Number ofparameter m+(2n-3) g*(m+(2n-3))
Proportionalanalysis
g-1+gm+(2n-3)
Examplen= 44
g = 22m = 0
85 1870 106
Aims of our studyTo compare 3 types of multiple-genes analysis: Concatenate analysis Separate analysis Proportional analysis
3 protein datasets: Mitochondrial data set [56 species, 12 genes] Nuclear dataset (“short genes”) [46 species, 6 genes] Nuclear dataset (“long genes”) [28 species, 4 genes]
(Short genes- based on Murphy dataset)
Comparing topologies
BonoboChimpanzeeManGorillaSumatran orangutanBornean orangutanCommon gibbonBarbary apeBaboonWhite-fronted capuchinSlow lorisTree shrewJapanese pipistrelleLong-tailed batJamaican fruit-eating batHorseshoe bat
Little red flying foxRyukyu flying foxMouseRatVoleCane-ratGuinea pigSquirrelDormouseRabbitPikaPigHippopotamusSheepCowAlpacaBlue whaleFin whaleSperm whaleDonkeyHorseIndian rhinoWhite rhinoElephantAardvarkGrey sealHarbor sealDogCatAsiatic shrewLong-clawed shrewSmall Madagascar hedgehogHedgehogGymnureMoleArmadilloBandicootWallarooOpossumPlatypus
Archonta
Glires
Ungulata
Carnivora
Insectivora
Xenarthra
(Based on Mc Kenna and Bell, 1997)
Morphological topology
Mitochondrial topologyDonkeyHorseIndian rhinoWhite rhinoGrey sealHarbor sealDogCatBlue whaleFin whaleSperm whaleHippopotamusSheepCowAlpacaPigLittle red flying foxRyukyu flying foxHorseshoe batJapanese pipistrelleLong-tailed batJamaican fruit-eating bat
Asiatic shrewLong-clawed shrew
MoleSmall Madagascar hedgehogAardvarkElephantArmadilloRabbitPikaTree shrewBonoboChimpanzeeManGorillaSumatran orangutanBornean orangutanCommon gibbonBarbary apeBaboon
White-fronted capuchinSlow lorisSquirrelDormouseCane-ratGuinea pigMouseRatVoleHedgehogGymnureBandicootWallarooOpossumPlatypus
Perissodactyla
Carnivora
Cetartiodactyla
Rodentia 1
HedgehogsRodentia 2
Primates
ChiropteraMoles+ShrewsAfrotheria
XenarthraLagomorpha
+ Scandentia
Aims of our study
Nuclear topology
Aims of our study
Round Eared Bat
Flying Fox
Hedgehog
Mole
Pangolin
Whale
Hippo
Cow
Pig
Cat
Dog
Horse
Rhino
Rat
Capybara
Rabbit
Flying Lemur
Tree Shrew
Human
Galago
Sloth
Hyrax
Dugong
Elephant
Aardvark
Elephant Shrew
Opossum
Kangaroo
1
2
3
4
Cetartiodactyla
Afrotheria
Chiroptera
Eulipotyphla
Glires
Xenarthra
CarnivoraPerissodactyla
Scandentia+Dermoptera
Pholidota
Primate
(Madsenl tree)
Comparing different models using AKAIKE INFORMATION CRITERION
PLAIC 2log2
A model which minimizes the AIC is considered to be the most appropriate model.
Results: the best multiple gene analysis
The proportional analysis is the best for the mitochondrial dataset
Separateanalysis
Concatenateanalysis
Proportionalanalysis
df
Ln(L)
AIC
-90999.30
182262.60
-89921.78
182483.55
-91188.71
182619.42
1321320 121
(Mitochondrial tree, N-Gamma rate model)
Results: the best multiple gene method
Separateanalysis
Concatenateanalysis
Proportionalanalysis
df
Ln(L)
AIC
-11543.87
23287.74
-11192.12
23464.23
-11618.67
23427.33
100540 95
(Murphy dataset, Madsenl tree, N-Gamma rate model)
The Proportional analysis is the best for the Nuclear dataset (“Short genes”)
The Separate analysis is the best for the Nuclear dataset (“Long genes”)
Results: the best multiple gene method
Separateanalysis
Concatenateanalysis
Proportionalanalysis
df
Ln(L)
AIC
-31406.81
62933.63
-31153.28
62738.56
-31519.10
63152.21
60216 57
(Madsen dataset, Murphyl tree, N-Gamma rate model)
Conclusion: the best multiple gene method
1- The concatenate model is always the worst way to analyze multiple genes.
2- Selecting between the separate analysis or the proportional analysis depends on the data considered:
The proportional model is more adapted for short genes, the separate model for longer sequences
Results: mammalian phylogeny
The morphological tree is always rejected
P(K-H test) < 0.05
• whatever the model used
• whatever the dataset
Results: mammalian phylogeny
• The mitochondrial tree is the best tree for the mitochondrial dataset. But we cannot reject the nuclear tree.
• The nuclear tree is the best for the nuclear datasets, and we can reject the mitochondrial tree.
Conclusion (Topology): It seems that the nuclear tree is the best tree among the 3 alternative trees.
Modelisation of site rate variation
The gamma distribution:
F(t+x) =
(1/n).F(t).P(x.Rn)c
n=1
Homogenous model:
F(t+x) = F(t).P(x)
Gamma model:
Sit
e p
roport
ions
f(r)
Substitution rates (R)
A
C
Gd1
d3
d2
Continuous
A
C
Gd1
d3
d2
Discrete
Likelihoods with rate variation
r X
XCXAXG drrfrdPrdPrdPXpDp )()()()()()( 321
irr
iX
XCXAXG rprdPrdPrdPXpDp )()()()()()( 321
Results: the best site-rate variation model
Mitochondrial data set(Mitochondrial tree, proportional analysis)
Homogenousmodel
1-Gammamodel
N-Gammamodel
df
Ln(L)
AIC
-90999.30
182262.60
-98998.68
198237.37
-91094.30
182430.61
132120 121
Conclusion: the best site-rate variation model
The N-Gamma model is always the best site-rate variation model.
Combining Multiple Genes
Dorothee Huchon (Florida State University)Masami Hasegawa (Institute of Statistical Mathematics)Norihiri Okada (Tokyo Institute of Technology)Ying Cao (ISM).
Collaborations
Known phylogenies
Known phylogenies
Best way to test different methods of phylogenetic reconstruction is on trees that are known to be true from other resources…
Problem: known phylogenies are very rare.
Known phylogeny: laboratory animals, crop plants (and even those are often suspect). Also their evolutionary rate is very small…
Known phylogenies
David Hillis and colleagues have created “experimental” phylogenies in the lab.
Known phylogenies
They have used bacteriophage T7 and subdivided cultures of it, in the present of a mutagen. They then sequenced a marker gene from the final cultures and gave the sequences as input to few phylogenetic methods. The output of the tree building methods was compared to the true tree.
Known phylogenies
In fact, they used restriction sites method to infer the phylogeny, using MP, NJ, UPGMA and others.
All methods reconstructed the true tree.
Known phylogenies
They also compared outputs of ancestral sequence reconstruction, using MP.
97.3% of the ancestral states were correctly reconstructed.
Encouraging!
Known phylogenies
Criticism: (1) The true tree was very easy to infer, because it was well balances, and all nodes are accompanied by numerous changes.
(2) The mutations by a single mutagen do not reflect reality.
Thank You…