models of molecular evolution ii level 3 molecular evolution and bioinformatics jim provan page and...
TRANSCRIPT
Models of Molecular Models of Molecular Evolution IIEvolution II
Level 3 Molecular Evolution and Level 3 Molecular Evolution and BioinformaticsBioinformatics
Jim ProvanJim Provan
Page and Holmes: Sections 7.3 – 7.4Page and Holmes: Sections 7.3 – 7.4
Isochore structure of vertebrate Isochore structure of vertebrate genomesgenomes
Why do patterns of base composition – the Why do patterns of base composition – the frequencies of the four bases and of codons used to frequencies of the four bases and of codons used to specify amino acids – differ between genomes?specify amino acids – differ between genomes?
Mean G + C content in bacteria ranges from 25% to Mean G + C content in bacteria ranges from 25% to 75%, but there is little intragenome variation75%, but there is little intragenome variation
Genomes of vertebrates have a much greater Genomes of vertebrates have a much greater range of G + C values:range of G + C values:
Caused by continuous sections (> 300kb) each of which Caused by continuous sections (> 300kb) each of which has a uniform G + C content (has a uniform G + C content (isochoresisochores))
G + C content of isochores also varies between speciesG + C content of isochores also varies between species
Properties of vertebrate Properties of vertebrate isochoresisochores
G + C rich isochoresG + C rich isochores
Correlate with reverse Giesma (R) bandsCorrelate with reverse Giesma (R) bandsEarly replicatingEarly replicatingHigh density of genesHigh density of genesSINEs presentSINEs presentCpG islands in genesCpG islands in genesHigh G + C content at third codon positionHigh G + C content at third codon positionHigh frequency of retroviral sequencesHigh frequency of retroviral sequencesHigh frequency of chiasmataHigh frequency of chiasmata
A + T rich isochoresA + T rich isochores
Correlate with Giesma (G) bandsCorrelate with Giesma (G) bandsLate replicatingLate replicatingLow density of genes (only tissue specific)Low density of genes (only tissue specific)LINEs presentLINEs presentNo CpG islandsNo CpG islandsHigh A + T content at third codon positionHigh A + T content at third codon positionLow frequency of retroviral sequencesLow frequency of retroviral sequencesLow frequency of chiasmataLow frequency of chiasmata
Theories on the existence of Theories on the existence of isochoresisochores
Selectionist hypothesis of Bernardi Selectionist hypothesis of Bernardi et al.et al. suggests that GC-rich isochores predominantly suggests that GC-rich isochores predominantly found in warm-blooded vertebrates are an found in warm-blooded vertebrates are an adaptation to higher body temperature:adaptation to higher body temperature:
Extra hydrogen bond in G-C pair may lessen Extra hydrogen bond in G-C pair may lessen possibility of thermal damage to DNApossibility of thermal damage to DNADesert plants also have higher GC contentsDesert plants also have higher GC contents
Evidence for independent occurrence of Evidence for independent occurrence of isochores since birds and mammals do not isochores since birds and mammals do not share an immediate ancestorshare an immediate ancestorHowever, some thermophilic bacteria are AT-However, some thermophilic bacteria are AT-richrich
Theories on the existence of Theories on the existence of isochoresisochores
Neutralist explanation for the existence of Neutralist explanation for the existence of isochores is that they simply reflect variation in isochores is that they simply reflect variation in the process of mutation across the genomethe process of mutation across the genomeStudies on argininosuccinate synthetase Studies on argininosuccinate synthetase processed pseudogenes from anthropoid primates:processed pseudogenes from anthropoid primates:
Pseudogenes were derived from same functional Pseudogenes were derived from same functional ancestral gene but then inserted into different parts of ancestral gene but then inserted into different parts of the genomethe genomeDespite their common ancestry, they now differ in base Despite their common ancestry, they now differ in base compositioncompositionBecause pseudogenes are not subject to selection, Because pseudogenes are not subject to selection, differences in base composition must have been due to differences in base composition must have been due to regional variation in mutation patternsregional variation in mutation patterns
Why should mutation patterns Why should mutation patterns vary across genomes?vary across genomes?
Replication hypothesisReplication hypothesis suggests that genes which suggests that genes which replicate earlier in the cell cycle are more GC-rich replicate earlier in the cell cycle are more GC-rich than those which replicate later:than those which replicate later:
Believed to be due to the fact that G and C precursor Believed to be due to the fact that G and C precursor pools of dNTPs are larger at this time – errors are more pools of dNTPs are larger at this time – errors are more likely to incorporate G or Clikely to incorporate G or C
Repair hypothesisRepair hypothesis is based on assumption that is based on assumption that efficiency of DNA repair varies across genome:efficiency of DNA repair varies across genome:
May be an outcome of transcriptionally active areas May be an outcome of transcriptionally active areas being repaired more efficientlybeing repaired more efficientlyCpG islands are maintained by a special repair system – CpG islands are maintained by a special repair system – efficiency of DNA replication may be dependent on efficiency of DNA replication may be dependent on locationlocation
Why should mutation patterns Why should mutation patterns vary across genomes?vary across genomes?
Recombination hypothesisRecombination hypothesis claims that isochore claims that isochore structure of vertebrate genomes is the outcome of structure of vertebrate genomes is the outcome of differences in the pattern and frequency of differences in the pattern and frequency of recombination:recombination:
Low GC localities will be associated with regions of reduced Low GC localities will be associated with regions of reduced recombination:recombination:
— Genes with low rates of recombination have low GC valuesGenes with low rates of recombination have low GC values— The large, non-recombining region of the Y-chromosome has a The large, non-recombining region of the Y-chromosome has a
low GC compositionlow GC composition
Fact that recombination plays such a large part in the Fact that recombination plays such a large part in the structuring of eukaryote genomes makes this an attractive structuring of eukaryote genomes makes this an attractive hypothesishypothesis
Although the relative contributions of these Although the relative contributions of these hypotheses are still unclear, the neutralist hypotheses are still unclear, the neutralist interpretation seems more likelyinterpretation seems more likely
Codon usageCodon usage
0
10
20
30
40
50
60
0
10
20
30
40
50
60
CGACGA
CGCCGC
CGGCGG
CGUCGU
AGAAGA
AGGAGG
CUACUA
CACCAC
CUGCUG
CUUCUU
UUAUUA
UUGUUG
E. coliE. coli
HumanHuman
ARGARGLEULEU
What determines codon usage?What determines codon usage?
Degeneracy of genetic code:Degeneracy of genetic code:Null hypothesis is that all codons for a particular Null hypothesis is that all codons for a particular amino acid are used with equal frequencyamino acid are used with equal frequency
Refuted when nucleotide sequences became Refuted when nucleotide sequences became available for a wide range of organismsavailable for a wide range of organisms
Selectionist argument:Selectionist argument:Highly expressed genes show most codon bias Highly expressed genes show most codon bias because they require more translational efficiency: because they require more translational efficiency: coevolution of tRNAs and codonscoevolution of tRNAs and codons
Also supports the neutralist prediction of a Also supports the neutralist prediction of a relationship between functional constraint and relationship between functional constraint and substitution ratesubstitution rate
Gene expression and codon biasGene expression and codon bias
Highly expressedHighly expressedgenesgenes
Strong selection forStrong selection fortranslational efficiencytranslational efficiency
RestrictedRestrictedtRNAs usedtRNAs used
Strong codon biasStrong codon bias
Low rate ofLow rate ofsynonymous substitutionsynonymous substitution(few neutral mutations)(few neutral mutations)
Lowly expressedLowly expressedgenesgenes
Weak selection forWeak selection fortranslational efficiencytranslational efficiency
MoreMoretRNAs usedtRNAs used
Weak codon biasWeak codon bias
High rate ofHigh rate ofsynonymous substitutionsynonymous substitution(many neutral mutations)(many neutral mutations)
The molecular clockThe molecular clock
Idea of a molecular clock is central to the Idea of a molecular clock is central to the neutralist theory, since it demonstrates the neutralist theory, since it demonstrates the constancy of the underlying neutral mutation rateconstancy of the underlying neutral mutation ratePrevious example of Previous example of -globin-globinDoes not imply that all genes and proteins evolve Does not imply that all genes and proteins evolve at the same rate:at the same rate:
Great variation between proteins (fibrinonectins vs. Great variation between proteins (fibrinonectins vs. histones)histones)Variation in rate among genes and proteins is compatible Variation in rate among genes and proteins is compatible with the neutral theory if the underlying cause is with the neutral theory if the underlying cause is changes in selective constraintchanges in selective constraintKey question concerning the validity of a molecular clock Key question concerning the validity of a molecular clock is whether rates of substitution are constant is whether rates of substitution are constant withinwithin genes across evolutionary timegenes across evolutionary time
Neutral theory and the Neutral theory and the molecular clockmolecular clock
Rate of nucleotide substitution (fixation) at any Rate of nucleotide substitution (fixation) at any site per year, site per year, kk, in a diploid population of size , in a diploid population of size 22NN is equal to the number of new mutations is equal to the number of new mutations (neutral, deleterious or advantageous) arising (neutral, deleterious or advantageous) arising per year, per year, , multiplied by their probability of , multiplied by their probability of fixation, fixation, uu::
kk = 2 = 2N N uu
For a neutral mutation, probability of fixation is For a neutral mutation, probability of fixation is reciprocal of population size:reciprocal of population size:
uu = 1/2 = 1/2NN
So substitution rate for a neutral mutation is:So substitution rate for a neutral mutation is:
kk = (2 = (2N N )(1/2)(1/2N N ))
Neutral theory and the Neutral theory and the molecular clock (continued)molecular clock (continued)
Parameters for population size (2Parameters for population size (2NN) cancel out, ) cancel out, leaving:leaving:
k k = = One of the most important formulae in molecular One of the most important formulae in molecular evolution – means that rate of substitution in evolution – means that rate of substitution in neutral mutations is dependent only on neutral mutations is dependent only on underlying mutation rate and is independent of underlying mutation rate and is independent of other factors such as population sizeother factors such as population size
Also holds for mutants with a very weak Also holds for mutants with a very weak selective advantage e.g. selective advantage e.g. s s < 1/2< 1/2NNee
Substitution of selectively Substitution of selectively advantageous mutationsadvantageous mutations
Probability of fixation is roughly twice the selection Probability of fixation is roughly twice the selection coefficient:coefficient:
uu = 2 = 2sNsNee//NNSubstituting this into the original equation, we get:Substituting this into the original equation, we get:
kk = 4 = 4NNeessIn this case, substitution rate for an advantageous In this case, substitution rate for an advantageous mutation also depends on population size and mutation also depends on population size and magnitude of selective advantagemagnitude of selective advantageFor natural selection to produce a molecular clock, For natural selection to produce a molecular clock, it is necessary for it is necessary for NNee, , ss and and (combination of (combination of ecological, mutational and selective events) to be ecological, mutational and selective events) to be the same across evolutionary time – highly unlikely!the same across evolutionary time – highly unlikely!
Constancy of the molecular Constancy of the molecular clockclock
Neutral theory predicted a molecular clock and Neutral theory predicted a molecular clock and first protein sequence data appeared to first protein sequence data appeared to confirm this: led Kimura to cite this as the best confirm this: led Kimura to cite this as the best evidence for neutralityevidence for neutrality
As more comparative sequence data became As more comparative sequence data became available, particularly from mammals, available, particularly from mammals, examples of rate variation began to appearexamples of rate variation began to appear
Debate arose concerning the constancy of the Debate arose concerning the constancy of the molecular clockmolecular clock
Testing the molecular clockTesting the molecular clock
Dispersion index Dispersion index R(t)R(t): test whether there is : test whether there is more rate variation between lineages than more rate variation between lineages than expected under a Poisson process:expected under a Poisson process:
If the data fit a Poisson process, variance in number If the data fit a Poisson process, variance in number of substitutions between lineages should be no of substitutions between lineages should be no greater than the mean numbergreater than the mean number
If the data fit a Poisson process then If the data fit a Poisson process then R(t)R(t) = 1.0, if not = 1.0, if not then then R(t)R(t) > 1.0 and the clock is said to be > 1.0 and the clock is said to be overdispersedoverdispersed
A star phylogeny should be used, since any A star phylogeny should be used, since any phylogenetic structure will complicate the phylogenetic structure will complicate the calculations (e.g. placental mammals)calculations (e.g. placental mammals)
Testing the molecular clockTesting the molecular clock
Mammalian protein data presented a serious problem Mammalian protein data presented a serious problem for neutralistsfor neutralistsProblems most likely due to inaccuracies in Problems most likely due to inaccuracies in phylogenies:phylogenies:
““Outlier” in data was guinea pigOutlier” in data was guinea pigGuinea pig is much more divergent than previously thoughtGuinea pig is much more divergent than previously thought
ProteinProtein
Haemoglobin Haemoglobin Haemoglobin Haemoglobin MyoglobinMyoglobinCytochrome Cytochrome ccRibonucleaseRibonuclease-Crystallin-Crystallin
Species (Species (nn))
666666444466
Amino acidsAmino acids
141141146146153153104104123123175175
R(t)R(t)
1.171.173.043.041.601.603.223.222.152.152.712.71
The relative rate testThe relative rate test
The The relative rate testrelative rate test compares the difference between compares the difference between the numbers of substitutions between two closely the numbers of substitutions between two closely related taxa in comparison with a third, more distantly related taxa in comparison with a third, more distantly related outgrouprelated outgroup
If A and B have If A and B have evolved according to evolved according to a molecular clock, a molecular clock, both should be both should be equidistant from Cequidistant from C
ddACAC = = ddBCBC
A and B must be A and B must be closest relatives and closest relatives and C must not be too far C must not be too far removedremoved
A B C
X
The relative rate testThe relative rate test
Synonymous sites in nine Synonymous sites in nine nuclear genes (3520 bp):nuclear genes (3520 bp):
dd1212 = 6.7 = 6.7
dd13 13 – – dd2323 = 2.3 = 2.3 ± 0.6± 0.6
-globin pseudogene (1827 -globin pseudogene (1827 bp):bp):
dd1212 = 7.9 = 7.9
dd13 13 – – dd2323 = 1.5 = 1.5 ± 0.4± 0.4
Three introns (3376 bp):Three introns (3376 bp):dd1212 = 6.9 = 6.9
dd13 13 – – dd2323 = 1.0 = 1.0 ± 0.5± 0.5
Two flanking regions (936 bp):Two flanking regions (936 bp):dd1212 = 7.9 = 7.9
dd13 13 – – dd2323 = 3.1 = 3.1 ± 1.1± 1.1
11 22 33
Old WorldOld Worldmonkeymonkey HumanHuman
New WorldNew Worldmonkeymonkey
Lineage effects and the molecular Lineage effects and the molecular clockclock
Substitution rate varies with underlying neutral Substitution rate varies with underlying neutral mutation rate: mutation rate: k = k = Three ways for rates to vary between species:Three ways for rates to vary between species:
Differences in generation timeDifferences in generation timeDifferences in metabolic rateDifferences in metabolic rateDifferences in efficiency of DNA repairDifferences in efficiency of DNA repair
These are known as These are known as lineage effectslineage effects: neutralists : neutralists believe that lineage effects alone can account for believe that lineage effects alone can account for all variation in molecular clockall variation in molecular clockSelectionists believe that genes also show rate Selectionists believe that genes also show rate variation due to other, selection-driven factors variation due to other, selection-driven factors ((residue effectsresidue effects))
Generation time and the Generation time and the molecular clockmolecular clock
Tim
eTim
e
Generation time and the Generation time and the molecular clockmolecular clock
At the molecular level, generation time (At the molecular level, generation time (gg) can ) can be defined as time it takes for germ-line DNA to be defined as time it takes for germ-line DNA to replicate i.e. from one gamete to the nextreplicate i.e. from one gamete to the nextSince most mutations occur at this point, rate of Since most mutations occur at this point, rate of substitution under neutral theory is a function of substitution under neutral theory is a function of both mutation rate and generation time:both mutation rate and generation time:
kk = = //ggGeneral conclusion from molecular data is that General conclusion from molecular data is that the clock is generation time dependent at silent the clock is generation time dependent at silent sites and in non-coding DNA:sites and in non-coding DNA:
Silent rates in orang-utan, gorilla and chimp are 1.3-, Silent rates in orang-utan, gorilla and chimp are 1.3-, 2.2- and 1.2-fold faster than in humans, which matches 2.2- and 1.2-fold faster than in humans, which matches differences in generation timesdifferences in generation times
The metabolic rate hypothesisThe metabolic rate hypothesis
In sharks, rate of silent change is five- to sevenfold In sharks, rate of silent change is five- to sevenfold lower than in primates and ungulates which have lower than in primates and ungulates which have similar generation times:similar generation times:
Led to the hypothesis that differences in molecular rate are Led to the hypothesis that differences in molecular rate are a better explanation for differences in mutation rates than a better explanation for differences in mutation rates than differences in generation time (differences in generation time (metabolic rate hypothesismetabolic rate hypothesis))States that organisms with high metabolic rates have States that organisms with high metabolic rates have higher levels of DNA synthesishigher levels of DNA synthesis
Two pieces of mitochondrial DNA evidence support Two pieces of mitochondrial DNA evidence support this:this:
Small bodied animals, which have higher metabolic rates, Small bodied animals, which have higher metabolic rates, tend to have higher mutation ratestend to have higher mutation ratesWarm-blooded animals also have higher mutation rates Warm-blooded animals also have higher mutation rates than cold-blooded animalsthan cold-blooded animals
Relationship between body mass Relationship between body mass and sequence evolutionand sequence evolution
0.010.01 0.10.1 11 1010 100100 10001000 10,00010,000 100,000100,0000.10.1
11
1010
% s
equence
div
erg
ence
per
Myr
% s
equence
div
erg
ence
per
Myr
Body mass (kg)Body mass (kg)
Rodents
GeeseDogs
Primates HorsesBears
WhalesNewts
Frogs
Tortoises
TortoisesSalmon
Sea turtles Sharks
DNA repair and mutationDNA repair and mutation
DNADNA
DirectDirectdamagedamage
ReplicationReplicationerrorserrors
RepairRepair IncorrectlyIncorrectlyrepairedrepaired
CorrectlyCorrectlyrepairedrepaired
MutationMutation
DNA repair and mutationDNA repair and mutation
Repair mechanisms are extremely complex and Repair mechanisms are extremely complex and there are many repair pathwaysthere are many repair pathwaysThere is some evidence supporting the hypothesis There is some evidence supporting the hypothesis that DNA repair influences mutation rate:that DNA repair influences mutation rate:
Evidence that highly transcribed genes are more Evidence that highly transcribed genes are more efficiently repairedefficiently repairedBase composition and substitution rates at silent sites in Base composition and substitution rates at silent sites in mammalian genes tends to be gene- rather than species-mammalian genes tends to be gene- rather than species-specific: suggests that homologous genes are transcribed specific: suggests that homologous genes are transcribed and repaired in a similar mannerand repaired in a similar manner
Conversely, closely related species such as Conversely, closely related species such as hominind primates, which share very similar repair hominind primates, which share very similar repair mechanisms, can exhibit greatly differing mechanisms, can exhibit greatly differing substitution ratessubstitution rates