genetics to genomics (from basics to buzzwords)cobamide2.bio.pitt.edu/core/overheads.pdf ·...
TRANSCRIPT
Genetics to Genomics(From Basics to Buzzwords)
• Genetics : Understanding the role of heritable material in
shaping organismal phenotypes
• Genomes are more than collections of genes:
• Chromosomes and episomes
• Gene clusters within chromosomes
• Genes and associated control elements
• Complex Exon/Intron organization
• Functional domains organized within coding regions
• Functional domains positioned outside coding regions
The fundamental task of genomics is understanding whatinformation is important, and what is not.• Genomes results from the accumulation of changes over
time (evolution).
• Therefore, an understanding of genomes must have a basis
in the understanding of how constituent domains, genes,
gene clusters and chromosomes evolve
• This leads to an understanding of patterns of information
within and between gene and genomes.
A History of Genomic Data• Richard Roblin’s Ph.D. thesis in 1967 was the determination
of the identity of a single nucleotide (1 base is not asequence); it was the 5’ end of bacteriophage R17, a 3 kbRNA phage; that base was a guanosine (pppGp…)
• In 1970, the 12-bp cohesive ends of bacteriophage lambdawere determined by Ray Wu
• In 1977, two methods were introduced for rapid DNAsequencing (both won their proponents Nobel Prizes):• The Maxam-Gilbert chemical degradation method• The Sanger primer extension method
• In 1977, the 5,386 base sequence of E. coli bacteriophageφX-174 was published
• In 1983, the 48,502 base pair sequence of bacteriophage λ was published
• In 1995 the 1,830,137 base pair sequence of the free-livingbacterium Haemophilus influenza was published
• In late 1996, the 12,052,000 base pair sequence of the yeastSaccharomyces cerevisiae was published
• In late 1998, the 97,000,000 bp sequence of the nematodeCaenorhabditis elegans was published
• In 2000, a draft of the 3,000,000,000 base pair humangenome was completed
• By early 2003, the genomes of nearly 100 species ofBacteria and Archaea, and 10 species of eukaryotes, arecompletely sequenced.
Mendel and Darwin:More than two dead white guys?
• What was “Blending Inheritance”?
• How did they view mutations?
• What was the influence of Aristotle?
• What was the influence of Malthus?
• What was the influence of Geologists?
• Natural Variation The result of genetic experiments played
out over long periods of time
• Similarity & Difference : Provide clues to relative importance:
the results of (more or less) an infinite number of genetic
experiments
Pillars of molecular evolution(How we make our models)
Empirical Data• Direct experimentation in laboratory environments
• Direct experimentation in natural environments
• Observation of natural variation within species
• Observation of differences between species
Integrative Analysis• Mathematical Modeling
• Cogitation
• Extrapolation and integration
Classification of similarity
Criteria for Classification• By what criteria are features similar?
• By what criteria are features different?
• What processes lead to similarity?
• What processes lead to differences?
Types of Similarity
• Homology : Identity by Descent
• Orthology : Encoded functions are identical
• Paralogy : Encoded functions are different
• Convergence : Identity by State
• Chance : Identity by State
Methods for Assessing Molecular Similarity
• DNA-DNA Hybridization
• Isozyme analysis and MLEE
• Library overlap (SAB)
• RFLP Analysis
• DNA sequence divergence
Measuring Mutation Rates
Mutation rates• Luria-Delbrück Fluctuation tests
• Targets used in laboratory experiments
• Phage resistance
• Antibiotic resistance
• lacZ
• Lessons
• Mutations occur almost at random
• Probability matrix is organism-dependent
• There are context effects
Substitution rates• A mutation is a lesion
• A substitution is a variant allele observed in nature
• Not all mutations become substitutions
Fate of Mutations
A. Mutations originate at particular frequencies
i Variable exposure to mutagens
ii Context for polymerase error
iii Likelihood of replication slippage for frameshift
B. However, lesions are repaired at different frequencies
i Different mismatch repair systems
ii Transcription coupled repair
C. Mutation not repaired, but has lethal effects
D. Mutation is disadvantageous and eventually lost
i Though not lost, mutation is infrequent in the population
ii Mutation is frequent in the population
E. Mutation becomes ubiquitous in the population (fixation)
i For a neutral mutation, P = 1/p*N
ii Average time to fixation is T = 2pN generations
Random Genetic Drift
What is the probability that a variant allele becomes fixed?• P = (1 - e-4N
esq)/(1 - e-4N
es)
• Consider the correction that e-x = (1-x) when x is small
• Consider a newly arisen allele; in a diploid population
frequency is 1/2N
• P = (1 - e-2Ne
s/N)/(1 - e-4Ne
s)
• s = 0 for a neutral mutation
• Therefore P = 1/2N; this should be intuitive
• If Ne = N, then P = (1 - e-2s)/(1 - e-4Ns)
• If s is small, then P = 2s/(1 - e-4Ns)
• For s > 0, N is large, P = 2s
• Neutral alleles go to fixation in t=2pN generations
• For s <> 0, alleles fix in t = (2/|s|)ln(2N) generations
Definitions q = Initial frequency of variant allele
s = selection coefficient
Ne = Effective population size
N = Actual population size
Effectively Neutral Mutations
Are mutations always either beneficial or detrimental?
As we saw earlier, that depends on what phenotype one isexamining
Even more insidious, that depends on population size andpopulation structure
In small populations, it takes a mighty big change in fitness(either positive or negative) to counter-act the stochasticprocess of genetic drift. “Detrimental” mutations can sweepa population even if they confer a disadvantageousphenotype.
In larger populations, these same mutations could beeliminated quickly, since genetic drift has a smaller impact.
This interaction between population size and the effect of amutation delineates a zone of fitness effects termed“effectively neutral,” whereby the fitness impact is notstatistically different from zero. This is a function of thepopulation size.
This is why it is difficult to proclaim “conserved” sequencesas important and nonconserved sequences as unimportant.“Conservation” (that is, the elimination of deleteriousmutations from the population) is a function of populationsize.
It is also a function of population structure (subdivision,migration, etc) and sexual exchange (obligate or infrequent)
Selectionism vs Neutralism
What is the significance of natural variation?
Selectionism Selectionism argues that most variants have adaptive value,
and variation is maintain through a variety of mechanisms
• Selection/mutation balance
• Heterosis
• Frequency dependence
• Spatial and temporal heterogeneity
Neutralism Neutralism argues that most variation is effectively neutral, and
reflects primarily genetic drift
The Poisson Distribution
• Predicts the distribution of occurrences in a discrete
classification system
• Derived from the binomial distribution and equal probability
of state
• Pµ(x) = µx / x!eµ
• So, for µ=1, P(x) = 1/ex!
• So, the probability of zero occurrences is ~37%
• The probability of only 1 occurrence is ~37%
• The probability of 2 or more occurrences is ~27%
When did they diverge?
ACTGTAGGAATCGC * * * AATGAAAGAATCGC
If the probability of mutation is 10-9 / bp / generation, how manygenerations have these two sequences been diverging?
Naïve answer• Let p be the probability of a mutation arising
• This can be (and has been) measured in the laboratory
• p = 10-9 / bp / generation
• For 14 bp, p = 1.4x10-8 /generation
• Therefore 1 mutation arises - on average - every 7.14 x 107
generations
• Therefore 3 mutations arise - on average - in 2.14 x 108
generations
What is missing here?
When did they diverge - Part II
First, many substitutions go unnoticed
A A C C A Single Substitution T T G G A C T A Multiple Substitutions A A C G C A Coincidental Substitutions G G T A T A Parallel Substitutions A A A C T A T Convergent Substitution C C G G C C T C Back Substitution
Only the “Single Substitution” leads to differences thataccurately reflect the number of mutational events
Jukes and Cantor Model
• Probability of any base changing to another base during time
t is set to be α
• Probability of a base being equal to its original state at time
T= t is P1 = 1 - 3α
• At time T = 2t, the probability of the original state is:
P2 = (1 - 3α)P1 + α(1 - P1)• This can be formulated as a first-order differential equation:
dPt/dt = -4αPt + α
• This can be solved as Pt = ¼ + (P0 - ¼)e-4at
• Since P0 = 1, Pt = ¼ + ¾ e-4at
• If P0 = 0, Pt = ¼ - ¾ e-4at
• Notice that both equation converge at equilibrium
• Under the Jukes-Cantor model, all bases have the same
frequency and interchange with equal likelihood
Convergence of the Jukes & Cantor Model
Pro b
abili
ty o
f hav
ing
an 'A
'
0.25
0.5
0.75
1.00
0.00
Time (million years)0 50 100 150 200
Kimura’s Two-parameter Model
• Separates transition probability from transversion probability
• A transition substitution occurs with probability α
• A transversion substitution occurs with probability β
• Probability of identity over time is calculated as:
Xt = ¼ + ¼e-4βt + ½e-2(α+β)t
• Probability of difference by transition is
Yt = ¼ + ¼e-4βt - ½e-2(α+β)t
• Probability of difference by each transversion is
Zt = ¼ - ¼e-4βt
• Note that Xt + Yt + 2Zt = 1
Justification for the Kimura Model
Relative substitution rates in mammalian pseudogenes
Mutant
Nucleotide
Original Nucleotide
A T C G A - 4.4 +/- 1.1 6.5 +/- 1.1 20.1 +/- 2.2
T 4.7 +/- 1.3 - 21.0 +/- 2.1 7.2 +/- 1.1
C 5.0 +/- 0.7 8.2 +/- 1.3 - 5.3 +/- 1.0
G 9.4 +/- 1.3 3.3 +/- 1.2 4.2 +/- 0.5 -
• Notice that transition rates are higher than individual
transversion rates
• Notice also that the rates of substitution are not symmetrical
Correcting for multiple substitutions
• Let’s start with the Jukes & Cantor one-parameter model
• Probability of identity for sequence in TWO lineages is
Pi = ¼ + ¾ e-8αt
• Probability of difference is PD = (1- Pi)
PD = ¾(1 - e-8αt)
or, 8αt = -ln(1 - 4/3P)
• Since t is unknown, we cannot estimate α. Instead, we
compute K, the number of substitutions per site
• For 2 lineages, K = 2*(3αt)
• So, K = - ¾ * ln(1 - 4/3P), where P is the proportion of
differing nucleotides per site
• For sequence of length L, the sampling variance is
V(K) = P(1-P) / L(1 - 4/3P)2
• For the Kimura model, let P be the proportion of bases as
transitions and Q be the proportion of bases as
transversions
• K = ½ ln(a) + ¼ ln(b), where
a = 1/(1-2P-Q) and b = 1/(1-2Q)
• V(K) = [a2P + c2Q -(aP + cQ)2]/L, where c=(a+b)/2
Nucleotide Positions Comprise TwoClasses of ‘Sites’
• Alterations at Nonsynonymous Sites change the encoded
protein
• Alterations at Synonymous Sites do not change the
encoded protein
In early terminology:
• Synonymous Site = “Silent Site;” such a change was thought
to show no phenotype
• Nonsynonymous Site = “Replacement Site”
Variation in Nonsynonymous Substitution Rates
• Variation in purifying selection within genes
• There is “domain structure” within proteins that mean
some regions will evolve more quickly than others, since
they serve less important roles
• For example, a nucleotide binding domain may evolve
more slowly that a cytoplasmic loop
• Variation among genes due to selection intensity
• Some entire genes play more important roles, and
therefore changes are less well tolerated; e.g., gene
encoding histones evolve very slowly
• Variation among genes due to differences in mutation rate
• Variation in amino acid tolerance to substitution
• Variation in lineage-specific rates
The Molecular Clock
Hypothesis that substitution rates are equivalent between
lineages. Tested using a “relative rate” test: -------------C | ---| ------B | | ------O | ------A
• KAC = KOA + KOC• KBC = KOB + KOC• KAB = KOA + KOB
Therefore:
KOA = (KAC + KAB - KBC)/2 KOB = (KAB + KBC - KAC)/2 KOC = (KAC + KBC - KAB)/2
According to the Molecular Clock, KOA - KOB = 0
Natural Selection
Fitness (w) - A measure of relative ability of organisms to
survive and reproduce in a certain environment. A fitness of
1.0 is typically used as a baseline for comparison. A fitness
value lower than 1.0 indicates that an organism is less likely to
produce viable offspring. Fitness is a genotype by
environment interaction.
Selection coefficient (s) - A measure of how a particular
phenotypic trait alters fitness. Since fitness is measured as
w=1-s, positive selection coefficients denoted detrimental
traits.
Malthus - Noted that more offspring are produced than can
survive.
Darwin - Postulated that fitness is heritable; that is, more fit
organisms produce more fit offspring.
Kinds of Selection
Purifying selection - The process by which substitutions
resulting in less fit organisms are removed from the population.
Heterosis - The phenomenon whereby heterozygotes have a
higher fitness than do homozygotes
Frequency-dependent selection - The phenomenon whereby
fitness of a genotype depends upon its frequency in the
population; typically less frequent genotypes are more fit. This
leads to the stable maintenance of polymorphism.
Diversifying Selection - The phenomenon whereby the
fitness conferred by a genotype is strongly dictated by the
environment, leading to the stable maintenance of
polymorphism.
Approaches for Constructing Dendrograms
Phenetics : Relationships are based on overall levels of
similarity. Common methods include:
• UPGMA (unweighted pair-group by geometric means)
• Transformed Distance and other variants
• Fitch-Margoliash
• Neighbor-joining
Cladistics : Relationships are based elucidation of shared,
derived characteristics. Common methods include:
• Parsimony
• Evolutionary Parsimony
• Maximum Likelihood
UPGMA I : The Method• Form clusters starting with most closely related taxa
• Average relationships with other taxa
• Repeat
Original Divergence A B C D E F
A - .08 .19 .32 .28 .55
B - .22 .26 .25 .62
C - .31 .28 .59
D - .14 .64
E - .63
Round 1 : Group taxa A & B A,B C D E F
A,B - .205 .28 .265 .585
C - .31 .28 .59
D - .14 .64
E - .63
UPGMA II : The Tree
Round 2 : Group taxa D & E A,B C D,E F
A,B - .205 .2725 .585
C - .295 .59
D,E - .635
Round 3 : Group taxa C with (A,B) Cluster (A,B),C D,E F
(A,B),C - .28375 .5875
D,E - .635
Round 4 : Group taxa [(A,B),C] Cluster with (D,E) cluster ((A,B),C),(D,E) F
((A,B),C),(D,E) - 0.61125** **Note straight average of all taxa is 0.606; this value reflects shared branches
Round 5 : Dendrogram is complete
Divergence0.00.10.20.30.40.50.6
ABCDEF
UPGMA III : Significance
So, the previous tree looked robust, but what about this one:
We may be confident with saying A, B & C belong to one group
and D & E belong to another, but are we confident that A is
closer to B than it is to C? In other words, what is the
confidence in the marked (**) branch?
Divergence0.00.10.20.30.40.50.6
ABCDEF
**
Other Distance Methods Transformed Distance• Transform distance first as Dij
* = (Dij - Dir - Djr)/2 + c
• Where r is a referent taxon and c allows for positive values
Fitch and Margoliash• Dij is again the observed distance and Eij is the tree distance
• Trees are chosen to minimize the following:
sFM = 100[ 2Σ(I<j){(Dij-Eij)/Dij}2]/(n2-n) ]1/2
Minimum Evolution (Simplified as Neighbor Joining)• In an unrooted, bifurcating tree of n taxa, there are 2n - 3
possible branches; λi is the length of the ith branch
• The sum of branches is L = Σ λi
• Final tree minimizes L; this is not maximum parsimony since
this method is not affected by backward or parallel mutation
Neighborliness• Consider a tree with n > 3 taxa; assume taxa 1 & 2 are
neighbors and Dij is distance between taxa i & j
• Therefore D12 + Dij < D1i + D2j AND D12 + Dij < D1j + D2I
• The best tree maximums the cases this is true
Estimating Branch Lengths
Consider 3 taxa in an unrooted tree• DAB = x + y
• DAC = x + z
• DBC = y + z
So, we can solve as• x = (DAB + DAC - DBC)/2
• y = (DAB + DBC - DAC)/2
• z = (DAC + DBC - DAB)/2
Now, consider more than three taxa
• Lets say taxa 1 & 2 were the first to cluster
• These will correspond to taxa ‘A’ and ‘B’ in the three-taxa
case as above
• Therefore, x=a and y=b
• Collapse all other taxa into ‘C’, represented as the c/d/e
junctions on this phylogeny
• So, DA,C = D1,(3,4,5) = (D1,3 + D1,4 + D1,5)/3
• Next we collapse other taxa so that 1,2=A, 4=B and 3,5=C
• Repeat until all lengths are calculated
A
BC
x
y
z
1
2 4
a
b
c
3 5
e
d
fg
Parsimony Methods
Maximum Parsimony• Ancestral sequences are inferred from extant sequences,
and the tree requiring the minimum number of changes is
computed.
• Branch lengths are computed a number of ways, often
correlated to the number of changes occurring along a
branch.
• The likelihood of parallel and/or backwards events can be
adjusted depending on the data set.
Evolutionary Parsimony• Usually considers only Four taxa
• Transition/transversion bias is computed to compute
quantities X, Y & Z for the three topologies.
• If only one is significant, then this tree is chosen
Example of Parsimony
Taxon Sequence A G C G G C G G A C C G G G
B G C G A C A C T C C G G A
C A C A T T G G A A A T A A
D G C A T T A C T T A T A G
Types of Sites• Invariant
• Variant
• Informative
Tree Support (A,B) , (C,D) 5
(A,C) , (B,D) 3
(A,D) , (B,C) 1
Types of Trees• Unrooted
• Outgroup Rooted
• Midpoint Rooted
Testing Parsimony Trees Cavender’s test• For specific numbers of characters, calculates how many
characters worse a tree must be to be rejected
Chars Steps Chars Steps Chars Steps 3-4 3 21-23 10 43-46 17 5-6 4 24-26 11 47-49 18 7-9 5 27-29 12 50 19 10-11 6 30-33 13 60 22 12-14 7 34-36 14 75 26 15-17 8 37-39 15 100 33 18-20 9 40-42 16
Felsenstein tests• Work on trees with small numbers of taxa• First : Tests to see of the number of steps supporting the
best tree is significantly lower than the number of stepssupporting the next-best tree (S = a-b).
• Second : The number of characters supporting the best tree(C=a)
Chars S(.05) C(.05) Chars S(.05) C(.05) 4 4 4 14 6 9 5 5 5 15-16 6 10 6 6 5 17-19 6 11 7 7 6 20 6 12 8 8 6 21 7 12 9-10 5 7 22-23 7 13 11-12 5 8 24-26 7 14 13 5 9 27-28 7 15
Maximum Likelihood Methods
Topology Generation• Nucleotide positions are considered separately under
models for DNA evolution.
• Topologies are tested for their likelihood of generating the
resulting data set.
• Likelihood is calculated as the sum log of the likelihood for
each variant site.
• The tree with the highest likelihood is chosen.
Topology testing
• Likelihood calculated for each tree as L = Σln(λi)
• The log-likelihood test uses the differences in likelihood to
eliminate topologies with significantly lower likelihoods
• All other trees are not significantly different; this is a
dendrogram “neighborhood” of equally good trees
• Variant branches can be collapsed to yield a consensus tree
Trade-offs in Alignments
Many kinds of data must be weighed• Homologous positions must be assigned• Relative weighting of transitions and transversions in making
alignment• Relative weighting of transitions and transversions in
assessing divergence• Relative weight of assigning a gap• Relative weight of increasing gap length
• Different for protein coding & non-coding sequences
Scheme A Scheme B
Sequence 1 GT-AC GT-AC
Sequence 2 G-CAG GC-AG
Sequence 3 GTCAC GTCAC
• In both schemes, three events occur
• In scheme A, there are two insertion/deletion events and
one nucleotide substitution
• In scheme B, there is one insertion/deletion event and twonucleotide substitutions
Testing Complex Trees : A Naïve Approach
• Felsenstein’s and other tests work on small numbers of taxa
• Therefore, we can test specific 4-taxon subsets to determine
which clades are robust
Image a complex phylogeny (right); we could
analyze this as follows:
1. Test (A,B) , (C,D) to determine is ‘C’ is
excluded from (A,B) clade
2. Test (D,E) , (F,G) to determine if ‘F’ is
excluded from (D,E) clade
3. Test (G,H) , (I,J) to determine if those are
robust clade
4. Test [(A,C) , (D,F)] AND [(A,B) , (E,F)] AND [(B,B) , (D,F)] to
determine if those clades are distinct.
5. But wait, that’s only 3 of the possible 9 combinations; should
we do all 9? What if 8 support and 1 doesn’t?
6. In test 1, does it matter that the ‘outgroup’ is ‘D’? Should we
try all 7 outgroups?
ABCDEFGHIJ
The 1% Inclusion Parsimony Approach
• Examine all trees within 1% of the tree length of the most-
parsimonious tree
• Assign confidence in nodes according to what percentage of
trees include that clade
This is somewhat arbitrary in two ways:
(1) Why are trees within 1% of the most-parimonious branch
length chosen?
(2) At how do we interpret “confidence” values?
Testing Trees : Resampling Approaches
The Jackknife• Resample data points at random without replacement
• If resample size = 50% of the sample size, then the variance
of the distribution of the resampled parameter is equivalent
to the variance of the original parameter, since MN
M−
= σσ ~22
• Robust nodes appear in >95 of trees made with resampled
datasets; typically 100 - 10,000 trees are examined
The Bootstrap• Resample N-1 data points at random with replacement
• Recalculate topology as above
• Robust nodes appear in >95 of trees made with resampled
datasets; typically 100 - 10,000 trees are examined
Advantages : Method for assessing reliability is independent
of tree construction method
Drawback : Can be computationally intensive
The Problem With Parsimony
For three taxa there is only 1 unrooted tree and three possiblerooted trees: (A,B),C and (A,C),B and (B,C),A.
But these numbers grow fast. For N taxa there are:• (2N-3)! / (2N-2)(N-2)! rooted trees and• (2N-5)! / (2N-3)(N-3)! unrooted trees
N Rooted 2 1 3 3 4 15 5 105 6 945 7 10,395 8 135,135 9 2,027,025 10 34,459,425 12 13,749,310,575 14 7,905,853,580,625 16 6,190,283,353,629,370 18 6,332,659,870,762,850,000 20 8.200 E+021 22 1.311 E+025 24 2.537 E+028 26 5.843 E+031 28 1.579 E+035 30 4.951 E+038
How Good Is It?
Like distance methods, parsimony will give you a tree,
although you may not get a “most-parsimonious” tree.
How good is it? Consider these two data sets:
Taxon Data Set 1 Data Set 2A GGGCCAATTAA GGGCCAATTAA
B GGGCCAATGCC GGGCCTTGGCC
C CAATTTTGTCC AAATTAATTCC
D CAATTTTGGAA AAATTTTGGAA
Steps 14 17
Chars 8 OF 11 5 OF 11
A
B
C
D
History of ClassificationGod 4500 B.C. not-A groupsNoah (3500 B.C.) Cladistic charactersPlato (427-347 BC) Idealized FormAristotle (384 - 322 BC) Scala NaturalHans and Zacharias Janssen (1600) MicroscopeMarcello Malpighi (1628-1694) Cellular orgaizationRobert Hooke (1635-1702) CellsAnton van Leeuwenhoek (1632-1723) Describe bacteria.Carl von Linné (1707-1778) Systema naturaleOtto F. Muller (1730-1784) 379 Animacule descriptionsAntoine-Laurent de Jussieu (1748 -1836) Major divisions of plantsGeorges Cluvier (1769-1832) Major animal phylaChristian Ehrenberg (1795-1876) Included bacteria in systematicsGeorges-Louis Buffon (1707-1788) Not all species present at the CreationThomas Malthus (1766-1834) Exponential growthGeorges Cluvier (1769-1832) CatastrophesLouis Agassiz (1807-1873) Serial CreationJames Hutton (1726-1797) Old EarthCharles Lyell (1797-1875) Old EarthJean-Baptiste Lamarck (1744-1829) Inheritance of acquired charactersCharles Darwin (1809 - 1882) Natural SelectionLouis Pasteur (1822-1895) Microbial processesErnst Haeckel (1834 - 1919) Evolutionary classificationEdouard Chatton (1883 - 1947) Prokaryote/eukaryote dichotomyHerbert Copeland (1902 - 1968) ReclassificationRobert Whittaker (1924 - 1980) “Modern” classificationEmil Zuckerkandl & Linus Pauling Molecular clocksMotoo Kimura and Tom Jukes Neutral theoryCarl Woese and George Fox Molecular phylogenyNaoyuki Iwabe & Takashi Miyata Rooting the tree of life via EF’sPeter Gogarten Rooting the tree of life via ATPasesBrian Golding & Radhey Gupta Eukarya by Fusion
God (4500 BC)
Heavens Earth
Noah (3500 BC)
Animals Plants
Phylogeny I
Yet the “Heavens” haveno defining characteristics
Potential Introductionof Hierarchy in Classification
Living Things
PlantsAnimals
Aristotle (350 BC)
Animal Vegetable
Mineral
Aristotle (350 BC)
Air Earth
Fire Water
Phylogeny II
Classification, butlacking Hierarchy
Classification, butlacking Hierarchy
Linneus (AD 1743)
Animalia Plantae
Infusoria
Chatton (AD 1937)
Prokaryotes
Eukaryotes
Phylogeny III
Completely Hierarchical(even to non-living things!)
AnimalsVertebrates
Invertebrates
Birds Mammals
Introduced Polarity,or Time, Into Lines ofPhylogenetic Descent
Copeland (AD 1956)
Animalia PlantaeFungi
ProtistaMonera
Whittaker (AD 1959)
Animalia Plantae Protista Fungi Monera
Eukaryotes Prokaryotes
- Incorporation of Chatton’s distinction
Association Coefficients Between representative members ofthe Three Primary KingdomsOrganism 1 2 3 4 5 6 7 8 9 10 11 12 13
S. cerevisiae - .29 .33 .05 .06 .08 .09 .11 .08 .11 .11 .08 .08
Lemna minor .29 - .36 .10 .05 .06 .10 .09 .11 .10 .10 .13 .07
.33 .36 - .06 .06 .07 .07 .09 .06 .10 .10 .09 .07
Escherichia coli .05 .10 .06 - .24 .25 .28 .26 .21 .11 .12 .07 .12
Chlorobium vibrioforme .06 .05 .06 .24 - .22 .22 .20 .19 .06 .07 .06 .09
Bacillus firmus .08 .06 .07 .25 .22 - .34 .26 .20 .11 .13 .06 .12
C. diptheriae .09 .10 .07 .28 .22 .34 - .23 .21 .12 .12 .09 .10
Aphanocapsa 6714 .11 .09 .09 .26 .20 .26 .23 - .31 .11 .11 .10 .10
Chloroplast (Lemna) .08 .11 .06 .21 .19 .20 .21 .31 - .14 .12 .10 .12
Methanobacterium th. .11 .10 .10 .11 .06 .11 .12 .11 .14 - .51 .25 .30
M. ruminantium .11 .10 .10 .12 .07 .13 .12 .11 .12 .51 - .25 .24
Methanobacterium sp. .08 .13 .09 .07 .06 .06 .09 .10 .10 .25 .25 - .32
Methanosarcina barkeri .08 .07 .07 .12 .09 .12 .10 .10 .12 .30 .24 .32 -
Similarities determined by SAB analysis of ribosomal RNAs
Woese (AD 1977)
Eukaryotes Eubacteria
Archaebacteria
Woese, implied (AD 1977)
Eukaryotes
EubacteriaArchaebacteria
Phylogeny VI
Rooting
theTree
ofLife:U
seofD
uplicateG
enes
ArchaeaEucaryaEubacteriaO
utgroupgene
EucaryaEubacteriaArchaeaO
utgroupgene
ArchaeaEubacteriaEucaryaO
utgroupgene
Gene
EF-Tu
EF-G
ATPaseF1-
ATPaseF1-
tRN
AM
et-E
tRN
AM
et-I
0.960.03
0.01
0.790.21
0.00
1.000.00
0.00
1.000.00
0.00
0.550.33
0.12
0.500.41
0.09
ProbabilityofR
ecoveringthe
SpecifiedR
elationships
Golding and Gupta (AD 1995)
EukaryotesEubacteria Archaebacteria
Mitochondrion
Cyanobacteria
Iwabe (AD 1989) & Gogarten (1989)
EukaryaBacteria Archaea
Phylogeny VII
Gene Families
• Most genes have homologues in closely-related taxa whose
products perform similar functions; these pairs of genes are
called orthologues
• In addition, many genes have homologues within the same
genome which perform some different function; these paris
of genes are called paralogues
What Functions Do Paralogues Play?
• They may act on different substrates (e.g., an enzyme with a
different binding site, or a protein kinase with a different
target)
• They may action in different tissue types or at different
developmental stages (e.g., embryonically-expressed globin
genes)
• They may be regulated in response to differential
environmental conditions to perform the same job, but for
different reasons (e.g., nitrate reductase for reducing nitrate
as an electron sink, or reducing it to provide ammonia for
assimilation)
How Do Gene Families Arise?
The Classic Model
• A gene duplicates within the genome; typically, an unequal-
crossing-over event is invoked
• The additional copy is free to evolve a novel function, or
novel regulatory regime, since the original copy performs its
original function
• Both copies are then maintained by selection
Yet this scenario is not as rosy as it sounds….
Problems with the Classic Model
• Dosage compensation after duplication may select for
organisms that have eliminated the duplicated copy
• After duplication, there is no selection to prevent deleterious
mutations from eliminating gene function.
• If an advantageous mutation arises, it must have a
sufficiently large benefit that elimination of this newly-created
form by mutation and drift is counter selected; this is difficult
in small populations, especially those seen in eukaryotes
• After duplication and gain of advantageous function, gene
conversion may homogenize the two copies
• Duplication of a single gene may be insufficient to provide for
a novel phenotype; for example, a new signaling cascade
will require a new receptor, MAPK, MAPKK, MAPKKKm etc.
An Alternative Model• Genes “duplicate” every time an organism reproduces.
• Consider a population (Pop’n A) where an entire pathway
experiences selection to perform an alternative function; in
Population B, the original function is maintained
• This function will likely never be achieved, since it would
require abandoning its original function, which may be
essential
• However, substantial headway may be gained in pursuing
the alternative form at the expense of the original form
• Admixture of population A and B produces heterozygotes
with an advantage; that is, both pathways are now found in
the same cell, leading to heterosis.
• This is an unstable state, since only 50% of the progeny of
heterozygotes are also heterozygotes.
• NOW, unequal crossing-over or other chromosome
gymnastics will allow for presence of both pathways in all
offspring.
• In this model, genes diverge UNDER SELECTION, prior to
reintroduction into the same cytoplasm (duplication).
Keeping Pathways Together I• What prevents “mixing” of the genes of the two pathways via
gene-conversion and meiosis?
• R.A. Fisher (1930) proposed that natural selection could
maintain groups of cooperating advantageous genes; this
idea was extended by Botstein and Suskind to suggest that
this selection would lead to clustering of these genes on the
eukaryotic chromosome.
• Consider two loci, each with two alleles (A & a; B & b)
• Consider A works well with B, and a works well with b
• Therefore, the fitness of AB/AB cells and ab/ab cells would
be higher than heterozygotes, especially Ab/Ab or aB/aB.
• This would lead to APPARENT linkage disequilibrium
between loci A & B due to counterselection of the classes
of heterozygotes.
• This selection would lead to an increase in the ACTUAL
linkage disequilibrium (decrease in chromosomal distance)
so that heterozygote disruption of coadapted gene
complexes would be minimized.
• This model requires high-frequency recombination.
Keeping Pathways Together II• The above model does not work for haploid organisms
with minimal amounts of recombination that could disrupt
coadapted gene complexes.
• Yet in these organisms, coadapted gene complexes are
found in very tight clusters (bacterial operons).
• How do genes attain such tight clustering, especially since
the primary mechanism for juxtaposition (deletion) would
likely remove important genes form the chromosome?
What genes are clustered?
• The bacterial operon allows for coregulation of genes, as
well as reducing their disruption by recombination.
• Yet coregulation is not a plausible influence for selection
for the origin of the gene cluster, since a very tiny
advantage would be conferred by adding only one gene to
a cluster at a time.
• Moreover, virtually none of the very important,
coordinately-regulated genes are found in operons
• In contrast, many operons encode peripheral metabolic
functions with lower selective value.
Keeping Pathways Together III
• Therefore, we must consider that the selection for the
ORIGIN of a feature may not be the same as the selection
for the MAINTENANCE of a feature.
• One advantage to a gene cluster is that it allows
mobilization of all of the genes responsible for a selectable
function or phenotype to be transferred in a single step;
transfer of one individual gene wold not.
• After transfer of genes that are only moderately clustered,
intervening genes would be deleted, since only those
genes under selection would be maintained; this results in
a tight gene cluster or operon, which can be expressed by
a host promoter at the site of insertion.
• The operon exploits the capability of prokaryotes to direct
the synthesis of numerous proteins from a single transcript
• Transfer of operons among bacteria and from bacteria to
eukaryotes is a powerful mechanism for allowing
recipients to gain novel metabolic capabilities
Bacterial genes are organized into clusters
Both bacterial and eukaryotic can beclustered via selection for proximity to cis-
acting regulatory sequences
β−Globin Locus : Developmental regulation via proximity to LCR
Both bacterial and eukaryotic can beclustered via selection for proximity to cis-
acting genes