genetics to genomics (from basics to buzzwords)cobamide2.bio.pitt.edu/core/overheads.pdf ·...

Genetics to Genomics(From Basics to Buzzwords)

• Genetics : Understanding the role of heritable material in

shaping organismal phenotypes

• Genomes are more than collections of genes:

• Chromosomes and episomes

• Gene clusters within chromosomes

• Genes and associated control elements

• Complex Exon/Intron organization

• Functional domains organized within coding regions

• Functional domains positioned outside coding regions

The fundamental task of genomics is understanding whatinformation is important, and what is not.• Genomes results from the accumulation of changes over

time (evolution).

• Therefore, an understanding of genomes must have a basis

in the understanding of how constituent domains, genes,

gene clusters and chromosomes evolve

• This leads to an understanding of patterns of information

within and between gene and genomes.

A History of Genomic Data• Richard Roblin’s Ph.D. thesis in 1967 was the determination

of the identity of a single nucleotide (1 base is not asequence); it was the 5’ end of bacteriophage R17, a 3 kbRNA phage; that base was a guanosine (pppGp…)

• In 1970, the 12-bp cohesive ends of bacteriophage lambdawere determined by Ray Wu

• In 1977, two methods were introduced for rapid DNAsequencing (both won their proponents Nobel Prizes):• The Maxam-Gilbert chemical degradation method• The Sanger primer extension method

• In 1977, the 5,386 base sequence of E. coli bacteriophageφX-174 was published

• In 1983, the 48,502 base pair sequence of bacteriophage λ was published

• In 1995 the 1,830,137 base pair sequence of the free-livingbacterium Haemophilus influenza was published

• In late 1996, the 12,052,000 base pair sequence of the yeastSaccharomyces cerevisiae was published

• In late 1998, the 97,000,000 bp sequence of the nematodeCaenorhabditis elegans was published

• In 2000, a draft of the 3,000,000,000 base pair humangenome was completed

• By early 2003, the genomes of nearly 100 species ofBacteria and Archaea, and 10 species of eukaryotes, arecompletely sequenced.

Mendel and Darwin:More than two dead white guys?

• What was “Blending Inheritance”?

• How did they view mutations?

• What was the influence of Aristotle?

• What was the influence of Malthus?

• What was the influence of Geologists?

• Natural Variation The result of genetic experiments played

out over long periods of time

• Similarity & Difference : Provide clues to relative importance:

the results of (more or less) an infinite number of genetic

experiments

Pillars of molecular evolution(How we make our models)

Empirical Data• Direct experimentation in laboratory environments

• Direct experimentation in natural environments

• Observation of natural variation within species

• Observation of differences between species

Integrative Analysis• Mathematical Modeling

• Cogitation

• Extrapolation and integration

Classification of similarity

Criteria for Classification• By what criteria are features similar?

• By what criteria are features different?

• What processes lead to similarity?

• What processes lead to differences?

Types of Similarity

• Homology : Identity by Descent

• Orthology : Encoded functions are identical

• Paralogy : Encoded functions are different

• Convergence : Identity by State

• Chance : Identity by State

Methods for Assessing Molecular Similarity

• DNA-DNA Hybridization

• Isozyme analysis and MLEE

• Library overlap (SAB)

• RFLP Analysis

• DNA sequence divergence

Measuring Mutation Rates

Mutation rates• Luria-Delbrück Fluctuation tests

• Targets used in laboratory experiments

• Phage resistance

• Antibiotic resistance

• lacZ

• Lessons

• Mutations occur almost at random

• Probability matrix is organism-dependent

• There are context effects

Substitution rates• A mutation is a lesion

• A substitution is a variant allele observed in nature

• Not all mutations become substitutions

Fate of Mutations

A. Mutations originate at particular frequencies

i Variable exposure to mutagens

ii Context for polymerase error

iii Likelihood of replication slippage for frameshift

B. However, lesions are repaired at different frequencies

i Different mismatch repair systems

ii Transcription coupled repair

C. Mutation not repaired, but has lethal effects

D. Mutation is disadvantageous and eventually lost

i Though not lost, mutation is infrequent in the population

ii Mutation is frequent in the population

E. Mutation becomes ubiquitous in the population (fixation)

i For a neutral mutation, P = 1/p*N

ii Average time to fixation is T = 2pN generations

Random Genetic Drift

What is the probability that a variant allele becomes fixed?• P = (1 - e-4N

esq)/(1 - e-4N

es)

• Consider the correction that e-x = (1-x) when x is small

• Consider a newly arisen allele; in a diploid population

frequency is 1/2N

• P = (1 - e-2Ne

s/N)/(1 - e-4Ne

s)

• s = 0 for a neutral mutation

• Therefore P = 1/2N; this should be intuitive

• If Ne = N, then P = (1 - e-2s)/(1 - e-4Ns)

• If s is small, then P = 2s/(1 - e-4Ns)

• For s > 0, N is large, P = 2s

• Neutral alleles go to fixation in t=2pN generations

• For s <> 0, alleles fix in t = (2/|s|)ln(2N) generations

Definitions q = Initial frequency of variant allele

s = selection coefficient

Ne = Effective population size

N = Actual population size

Effectively Neutral Mutations

Are mutations always either beneficial or detrimental?

As we saw earlier, that depends on what phenotype one isexamining

Even more insidious, that depends on population size andpopulation structure

In small populations, it takes a mighty big change in fitness(either positive or negative) to counter-act the stochasticprocess of genetic drift. “Detrimental” mutations can sweepa population even if they confer a disadvantageousphenotype.

In larger populations, these same mutations could beeliminated quickly, since genetic drift has a smaller impact.

This interaction between population size and the effect of amutation delineates a zone of fitness effects termed“effectively neutral,” whereby the fitness impact is notstatistically different from zero. This is a function of thepopulation size.

This is why it is difficult to proclaim “conserved” sequencesas important and nonconserved sequences as unimportant.“Conservation” (that is, the elimination of deleteriousmutations from the population) is a function of populationsize.

It is also a function of population structure (subdivision,migration, etc) and sexual exchange (obligate or infrequent)

Selectionism vs Neutralism

What is the significance of natural variation?

Selectionism Selectionism argues that most variants have adaptive value,

and variation is maintain through a variety of mechanisms

• Selection/mutation balance

• Heterosis

• Frequency dependence

• Spatial and temporal heterogeneity

Neutralism Neutralism argues that most variation is effectively neutral, and

reflects primarily genetic drift

The Poisson Distribution

• Predicts the distribution of occurrences in a discrete

classification system

• Derived from the binomial distribution and equal probability

of state

• Pµ(x) = µx / x!eµ

• So, for µ=1, P(x) = 1/ex!

• So, the probability of zero occurrences is ~37%

• The probability of only 1 occurrence is ~37%

• The probability of 2 or more occurrences is ~27%

When did they diverge?

ACTGTAGGAATCGC * * * AATGAAAGAATCGC

If the probability of mutation is 10-9 / bp / generation, how manygenerations have these two sequences been diverging?

Naïve answer• Let p be the probability of a mutation arising

• This can be (and has been) measured in the laboratory

• p = 10-9 / bp / generation

• For 14 bp, p = 1.4x10-8 /generation

• Therefore 1 mutation arises - on average - every 7.14 x 107

generations

• Therefore 3 mutations arise - on average - in 2.14 x 108

generations

What is missing here?

When did they diverge - Part II

First, many substitutions go unnoticed

A A C C A Single Substitution T T G G A C T A Multiple Substitutions A A C G C A Coincidental Substitutions G G T A T A Parallel Substitutions A A A C T A T Convergent Substitution C C G G C C T C Back Substitution

Only the “Single Substitution” leads to differences thataccurately reflect the number of mutational events

Jukes and Cantor Model

• Probability of any base changing to another base during time

t is set to be α

• Probability of a base being equal to its original state at time

T= t is P1 = 1 - 3α

• At time T = 2t, the probability of the original state is:

P2 = (1 - 3α)P1 + α(1 - P1)• This can be formulated as a first-order differential equation:

dPt/dt = -4αPt + α

• This can be solved as Pt = ¼ + (P0 - ¼)e-4at

• Since P0 = 1, Pt = ¼ + ¾ e-4at

• If P0 = 0, Pt = ¼ - ¾ e-4at

• Notice that both equation converge at equilibrium

• Under the Jukes-Cantor model, all bases have the same

frequency and interchange with equal likelihood

Convergence of the Jukes & Cantor Model

Pro b

abili

ty o

f hav

ing

an 'A

'

0.25

0.5

0.75

1.00

0.00

Time (million years)0 50 100 150 200

Kimura’s Two-parameter Model

• Separates transition probability from transversion probability

• A transition substitution occurs with probability α

• A transversion substitution occurs with probability β

• Probability of identity over time is calculated as:

Xt = ¼ + ¼e-4βt + ½e-2(α+β)t

• Probability of difference by transition is

Yt = ¼ + ¼e-4βt - ½e-2(α+β)t

• Probability of difference by each transversion is

Zt = ¼ - ¼e-4βt

• Note that Xt + Yt + 2Zt = 1

Justification for the Kimura Model

Relative substitution rates in mammalian pseudogenes

Mutant

Nucleotide

Original Nucleotide

A T C G A - 4.4 +/- 1.1 6.5 +/- 1.1 20.1 +/- 2.2

T 4.7 +/- 1.3 - 21.0 +/- 2.1 7.2 +/- 1.1

C 5.0 +/- 0.7 8.2 +/- 1.3 - 5.3 +/- 1.0

G 9.4 +/- 1.3 3.3 +/- 1.2 4.2 +/- 0.5 -

• Notice that transition rates are higher than individual

transversion rates

• Notice also that the rates of substitution are not symmetrical

Correcting for multiple substitutions

• Let’s start with the Jukes & Cantor one-parameter model

• Probability of identity for sequence in TWO lineages is

Pi = ¼ + ¾ e-8αt

• Probability of difference is PD = (1- Pi)

PD = ¾(1 - e-8αt)

or, 8αt = -ln(1 - 4/3P)

• Since t is unknown, we cannot estimate α. Instead, we

compute K, the number of substitutions per site

• For 2 lineages, K = 2*(3αt)

• So, K = - ¾ * ln(1 - 4/3P), where P is the proportion of

differing nucleotides per site

• For sequence of length L, the sampling variance is

V(K) = P(1-P) / L(1 - 4/3P)2

• For the Kimura model, let P be the proportion of bases as

transitions and Q be the proportion of bases as

transversions

• K = ½ ln(a) + ¼ ln(b), where

a = 1/(1-2P-Q) and b = 1/(1-2Q)

• V(K) = [a2P + c2Q -(aP + cQ)2]/L, where c=(a+b)/2

Nucleotide Positions Comprise TwoClasses of ‘Sites’

• Alterations at Nonsynonymous Sites change the encoded

protein

• Alterations at Synonymous Sites do not change the

encoded protein

In early terminology:

• Synonymous Site = “Silent Site;” such a change was thought

to show no phenotype

• Nonsynonymous Site = “Replacement Site”

Variation in Nonsynonymous Substitution Rates

• Variation in purifying selection within genes

• There is “domain structure” within proteins that mean

some regions will evolve more quickly than others, since

they serve less important roles

• For example, a nucleotide binding domain may evolve

more slowly that a cytoplasmic loop

• Variation among genes due to selection intensity

• Some entire genes play more important roles, and

therefore changes are less well tolerated; e.g., gene

encoding histones evolve very slowly

• Variation among genes due to differences in mutation rate

• Variation in amino acid tolerance to substitution

• Variation in lineage-specific rates

The Molecular Clock

Hypothesis that substitution rates are equivalent between

lineages. Tested using a “relative rate” test: -------------C | ---| ------B | | ------O | ------A

• KAC = KOA + KOC• KBC = KOB + KOC• KAB = KOA + KOB

Therefore:

KOA = (KAC + KAB - KBC)/2 KOB = (KAB + KBC - KAC)/2 KOC = (KAC + KBC - KAB)/2

According to the Molecular Clock, KOA - KOB = 0

Natural Selection

Fitness (w) - A measure of relative ability of organisms to

survive and reproduce in a certain environment. A fitness of

1.0 is typically used as a baseline for comparison. A fitness

value lower than 1.0 indicates that an organism is less likely to

produce viable offspring. Fitness is a genotype by

environment interaction.

Selection coefficient (s) - A measure of how a particular

phenotypic trait alters fitness. Since fitness is measured as

w=1-s, positive selection coefficients denoted detrimental

traits.

Malthus - Noted that more offspring are produced than can

survive.

Darwin - Postulated that fitness is heritable; that is, more fit

organisms produce more fit offspring.

Kinds of Selection

Purifying selection - The process by which substitutions

resulting in less fit organisms are removed from the population.

Heterosis - The phenomenon whereby heterozygotes have a

higher fitness than do homozygotes

Frequency-dependent selection - The phenomenon whereby

fitness of a genotype depends upon its frequency in the

population; typically less frequent genotypes are more fit. This

leads to the stable maintenance of polymorphism.

Diversifying Selection - The phenomenon whereby the

fitness conferred by a genotype is strongly dictated by the

environment, leading to the stable maintenance of

polymorphism.

Approaches for Constructing Dendrograms

Phenetics : Relationships are based on overall levels of

similarity. Common methods include:

• UPGMA (unweighted pair-group by geometric means)

• Transformed Distance and other variants

• Fitch-Margoliash

• Neighbor-joining

Cladistics : Relationships are based elucidation of shared,

derived characteristics. Common methods include:

• Parsimony

• Evolutionary Parsimony

• Maximum Likelihood

UPGMA I : The Method• Form clusters starting with most closely related taxa

• Average relationships with other taxa

• Repeat

Original Divergence A B C D E F

A - .08 .19 .32 .28 .55

B - .22 .26 .25 .62

C - .31 .28 .59

D - .14 .64

E - .63

Round 1 : Group taxa A & B A,B C D E F

A,B - .205 .28 .265 .585

C - .31 .28 .59

D - .14 .64

E - .63

UPGMA II : The Tree

Round 2 : Group taxa D & E A,B C D,E F

A,B - .205 .2725 .585

C - .295 .59

D,E - .635

Round 3 : Group taxa C with (A,B) Cluster (A,B),C D,E F

(A,B),C - .28375 .5875

D,E - .635

Round 4 : Group taxa [(A,B),C] Cluster with (D,E) cluster ((A,B),C),(D,E) F

((A,B),C),(D,E) - 0.61125** **Note straight average of all taxa is 0.606; this value reflects shared branches

Round 5 : Dendrogram is complete

Divergence0.00.10.20.30.40.50.6

ABCDEF

UPGMA III : Significance

So, the previous tree looked robust, but what about this one:

We may be confident with saying A, B & C belong to one group

and D & E belong to another, but are we confident that A is

closer to B than it is to C? In other words, what is the

confidence in the marked (**) branch?

Divergence0.00.10.20.30.40.50.6

ABCDEF

**

Other Distance Methods Transformed Distance• Transform distance first as Dij

* = (Dij - Dir - Djr)/2 + c

• Where r is a referent taxon and c allows for positive values

Fitch and Margoliash• Dij is again the observed distance and Eij is the tree distance

• Trees are chosen to minimize the following:

sFM = 100[ 2Σ(I<j){(Dij-Eij)/Dij}2]/(n2-n) ]1/2

Minimum Evolution (Simplified as Neighbor Joining)• In an unrooted, bifurcating tree of n taxa, there are 2n - 3

possible branches; λi is the length of the ith branch

• The sum of branches is L = Σ λi

• Final tree minimizes L; this is not maximum parsimony since

this method is not affected by backward or parallel mutation

Neighborliness• Consider a tree with n > 3 taxa; assume taxa 1 & 2 are

neighbors and Dij is distance between taxa i & j

• Therefore D12 + Dij < D1i + D2j AND D12 + Dij < D1j + D2I

• The best tree maximums the cases this is true

Estimating Branch Lengths

Consider 3 taxa in an unrooted tree• DAB = x + y

• DAC = x + z

• DBC = y + z

So, we can solve as• x = (DAB + DAC - DBC)/2

• y = (DAB + DBC - DAC)/2

• z = (DAC + DBC - DAB)/2

Now, consider more than three taxa

• Lets say taxa 1 & 2 were the first to cluster

• These will correspond to taxa ‘A’ and ‘B’ in the three-taxa

case as above

• Therefore, x=a and y=b

• Collapse all other taxa into ‘C’, represented as the c/d/e

junctions on this phylogeny

• So, DA,C = D1,(3,4,5) = (D1,3 + D1,4 + D1,5)/3

• Next we collapse other taxa so that 1,2=A, 4=B and 3,5=C

• Repeat until all lengths are calculated

A

BC

x

y

z

1

2 4

a

b

c

3 5

e

d

fg

Parsimony Methods

Maximum Parsimony• Ancestral sequences are inferred from extant sequences,

and the tree requiring the minimum number of changes is

computed.

• Branch lengths are computed a number of ways, often

correlated to the number of changes occurring along a

branch.

• The likelihood of parallel and/or backwards events can be

adjusted depending on the data set.

Evolutionary Parsimony• Usually considers only Four taxa

• Transition/transversion bias is computed to compute

quantities X, Y & Z for the three topologies.

• If only one is significant, then this tree is chosen

Example of Parsimony

Taxon Sequence A G C G G C G G A C C G G G

B G C G A C A C T C C G G A

C A C A T T G G A A A T A A

D G C A T T A C T T A T A G

Types of Sites• Invariant

• Variant

• Informative

Tree Support (A,B) , (C,D) 5

(A,C) , (B,D) 3

(A,D) , (B,C) 1

Types of Trees• Unrooted

• Outgroup Rooted

• Midpoint Rooted

Testing Parsimony Trees Cavender’s test• For specific numbers of characters, calculates how many

characters worse a tree must be to be rejected

Chars Steps Chars Steps Chars Steps 3-4 3 21-23 10 43-46 17 5-6 4 24-26 11 47-49 18 7-9 5 27-29 12 50 19 10-11 6 30-33 13 60 22 12-14 7 34-36 14 75 26 15-17 8 37-39 15 100 33 18-20 9 40-42 16

Felsenstein tests• Work on trees with small numbers of taxa• First : Tests to see of the number of steps supporting the

best tree is significantly lower than the number of stepssupporting the next-best tree (S = a-b).

• Second : The number of characters supporting the best tree(C=a)

Chars S(.05) C(.05) Chars S(.05) C(.05) 4 4 4 14 6 9 5 5 5 15-16 6 10 6 6 5 17-19 6 11 7 7 6 20 6 12 8 8 6 21 7 12 9-10 5 7 22-23 7 13 11-12 5 8 24-26 7 14 13 5 9 27-28 7 15

Maximum Likelihood Methods

Topology Generation• Nucleotide positions are considered separately under

models for DNA evolution.

• Topologies are tested for their likelihood of generating the

resulting data set.

• Likelihood is calculated as the sum log of the likelihood for

each variant site.

• The tree with the highest likelihood is chosen.

Topology testing

• Likelihood calculated for each tree as L = Σln(λi)

• The log-likelihood test uses the differences in likelihood to

eliminate topologies with significantly lower likelihoods

• All other trees are not significantly different; this is a

dendrogram “neighborhood” of equally good trees

• Variant branches can be collapsed to yield a consensus tree

Trade-offs in Alignments

Many kinds of data must be weighed• Homologous positions must be assigned• Relative weighting of transitions and transversions in making

alignment• Relative weighting of transitions and transversions in

assessing divergence• Relative weight of assigning a gap• Relative weight of increasing gap length

• Different for protein coding & non-coding sequences

Scheme A Scheme B

Sequence 1 GT-AC GT-AC

Sequence 2 G-CAG GC-AG

Sequence 3 GTCAC GTCAC

• In both schemes, three events occur

• In scheme A, there are two insertion/deletion events and

one nucleotide substitution

• In scheme B, there is one insertion/deletion event and twonucleotide substitutions

Testing Complex Trees : A Naïve Approach

• Felsenstein’s and other tests work on small numbers of taxa

• Therefore, we can test specific 4-taxon subsets to determine

which clades are robust

Image a complex phylogeny (right); we could

analyze this as follows:

1. Test (A,B) , (C,D) to determine is ‘C’ is

excluded from (A,B) clade

2. Test (D,E) , (F,G) to determine if ‘F’ is

excluded from (D,E) clade

3. Test (G,H) , (I,J) to determine if those are

robust clade

4. Test [(A,C) , (D,F)] AND [(A,B) , (E,F)] AND [(B,B) , (D,F)] to

determine if those clades are distinct.

5. But wait, that’s only 3 of the possible 9 combinations; should

we do all 9? What if 8 support and 1 doesn’t?

6. In test 1, does it matter that the ‘outgroup’ is ‘D’? Should we

try all 7 outgroups?

ABCDEFGHIJ

The 1% Inclusion Parsimony Approach

• Examine all trees within 1% of the tree length of the most-

parsimonious tree

• Assign confidence in nodes according to what percentage of

trees include that clade

This is somewhat arbitrary in two ways:

(1) Why are trees within 1% of the most-parimonious branch

length chosen?

(2) At how do we interpret “confidence” values?

Testing Trees : Resampling Approaches

The Jackknife• Resample data points at random without replacement

• If resample size = 50% of the sample size, then the variance

of the distribution of the resampled parameter is equivalent

to the variance of the original parameter, since MN

M−

= σσ ~22

• Robust nodes appear in >95 of trees made with resampled

datasets; typically 100 - 10,000 trees are examined

The Bootstrap• Resample N-1 data points at random with replacement

• Recalculate topology as above

• Robust nodes appear in >95 of trees made with resampled

datasets; typically 100 - 10,000 trees are examined

Advantages : Method for assessing reliability is independent

of tree construction method

Drawback : Can be computationally intensive

The Problem With Parsimony

For three taxa there is only 1 unrooted tree and three possiblerooted trees: (A,B),C and (A,C),B and (B,C),A.

But these numbers grow fast. For N taxa there are:• (2N-3)! / (2N-2)(N-2)! rooted trees and• (2N-5)! / (2N-3)(N-3)! unrooted trees

N Rooted 2 1 3 3 4 15 5 105 6 945 7 10,395 8 135,135 9 2,027,025 10 34,459,425 12 13,749,310,575 14 7,905,853,580,625 16 6,190,283,353,629,370 18 6,332,659,870,762,850,000 20 8.200 E+021 22 1.311 E+025 24 2.537 E+028 26 5.843 E+031 28 1.579 E+035 30 4.951 E+038

How Good Is It?

Like distance methods, parsimony will give you a tree,

although you may not get a “most-parsimonious” tree.

How good is it? Consider these two data sets:

Taxon Data Set 1 Data Set 2A GGGCCAATTAA GGGCCAATTAA

B GGGCCAATGCC GGGCCTTGGCC

C CAATTTTGTCC AAATTAATTCC

D CAATTTTGGAA AAATTTTGGAA

Steps 14 17

Chars 8 OF 11 5 OF 11

A

B

C

D

History of ClassificationGod 4500 B.C. not-A groupsNoah (3500 B.C.) Cladistic charactersPlato (427-347 BC) Idealized FormAristotle (384 - 322 BC) Scala NaturalHans and Zacharias Janssen (1600) MicroscopeMarcello Malpighi (1628-1694) Cellular orgaizationRobert Hooke (1635-1702) CellsAnton van Leeuwenhoek (1632-1723) Describe bacteria.Carl von Linné (1707-1778) Systema naturaleOtto F. Muller (1730-1784) 379 Animacule descriptionsAntoine-Laurent de Jussieu (1748 -1836) Major divisions of plantsGeorges Cluvier (1769-1832) Major animal phylaChristian Ehrenberg (1795-1876) Included bacteria in systematicsGeorges-Louis Buffon (1707-1788) Not all species present at the CreationThomas Malthus (1766-1834) Exponential growthGeorges Cluvier (1769-1832) CatastrophesLouis Agassiz (1807-1873) Serial CreationJames Hutton (1726-1797) Old EarthCharles Lyell (1797-1875) Old EarthJean-Baptiste Lamarck (1744-1829) Inheritance of acquired charactersCharles Darwin (1809 - 1882) Natural SelectionLouis Pasteur (1822-1895) Microbial processesErnst Haeckel (1834 - 1919) Evolutionary classificationEdouard Chatton (1883 - 1947) Prokaryote/eukaryote dichotomyHerbert Copeland (1902 - 1968) ReclassificationRobert Whittaker (1924 - 1980) “Modern” classificationEmil Zuckerkandl & Linus Pauling Molecular clocksMotoo Kimura and Tom Jukes Neutral theoryCarl Woese and George Fox Molecular phylogenyNaoyuki Iwabe & Takashi Miyata Rooting the tree of life via EF’sPeter Gogarten Rooting the tree of life via ATPasesBrian Golding & Radhey Gupta Eukarya by Fusion

God (4500 BC)

Heavens Earth

Noah (3500 BC)

Animals Plants

Phylogeny I

Yet the “Heavens” haveno defining characteristics

Potential Introductionof Hierarchy in Classification

Living Things

PlantsAnimals

Aristotle (350 BC)

Animal Vegetable

Mineral

Aristotle (350 BC)

Air Earth

Fire Water

Phylogeny II

Classification, butlacking Hierarchy

Classification, butlacking Hierarchy

Linneus (AD 1743)

Animalia Plantae

Infusoria

Chatton (AD 1937)

Prokaryotes

Eukaryotes

Phylogeny III

Completely Hierarchical(even to non-living things!)

AnimalsVertebrates

Invertebrates

Birds Mammals

Introduced Polarity,or Time, Into Lines ofPhylogenetic Descent

Copeland (AD 1956)

Animalia PlantaeFungi

ProtistaMonera

Whittaker (AD 1959)

Animalia Plantae Protista Fungi Monera

Eukaryotes Prokaryotes

- Incorporation of Chatton’s distinction

Association Coefficients Between representative members ofthe Three Primary KingdomsOrganism 1 2 3 4 5 6 7 8 9 10 11 12 13

S. cerevisiae - .29 .33 .05 .06 .08 .09 .11 .08 .11 .11 .08 .08

Lemna minor .29 - .36 .10 .05 .06 .10 .09 .11 .10 .10 .13 .07

.33 .36 - .06 .06 .07 .07 .09 .06 .10 .10 .09 .07

Escherichia coli .05 .10 .06 - .24 .25 .28 .26 .21 .11 .12 .07 .12

Chlorobium vibrioforme .06 .05 .06 .24 - .22 .22 .20 .19 .06 .07 .06 .09

Bacillus firmus .08 .06 .07 .25 .22 - .34 .26 .20 .11 .13 .06 .12

C. diptheriae .09 .10 .07 .28 .22 .34 - .23 .21 .12 .12 .09 .10

Aphanocapsa 6714 .11 .09 .09 .26 .20 .26 .23 - .31 .11 .11 .10 .10

Chloroplast (Lemna) .08 .11 .06 .21 .19 .20 .21 .31 - .14 .12 .10 .12

Methanobacterium th. .11 .10 .10 .11 .06 .11 .12 .11 .14 - .51 .25 .30

M. ruminantium .11 .10 .10 .12 .07 .13 .12 .11 .12 .51 - .25 .24

Methanobacterium sp. .08 .13 .09 .07 .06 .06 .09 .10 .10 .25 .25 - .32

Methanosarcina barkeri .08 .07 .07 .12 .09 .12 .10 .10 .12 .30 .24 .32 -

Similarities determined by SAB analysis of ribosomal RNAs

Woese (AD 1977)

Eukaryotes Eubacteria

Archaebacteria

Woese, implied (AD 1977)

Eukaryotes

EubacteriaArchaebacteria

Phylogeny VI

Rooting

theTree

ofLife:U

seofD

uplicateG

enes

ArchaeaEucaryaEubacteriaO

utgroupgene

EucaryaEubacteriaArchaeaO

utgroupgene

ArchaeaEubacteriaEucaryaO

utgroupgene

Gene

EF-Tu

EF-G

ATPaseF1-

ATPaseF1-

tRN

AM

et-E

tRN

AM

et-I

0.960.03

0.01

0.790.21

0.00

1.000.00

0.00

1.000.00

0.00

0.550.33

0.12

0.500.41

0.09

ProbabilityofR

ecoveringthe

SpecifiedR

elationships

Golding and Gupta (AD 1995)

EukaryotesEubacteria Archaebacteria

Mitochondrion

Cyanobacteria

Iwabe (AD 1989) & Gogarten (1989)

EukaryaBacteria Archaea

Phylogeny VII

Gene Families

• Most genes have homologues in closely-related taxa whose

products perform similar functions; these pairs of genes are

called orthologues

• In addition, many genes have homologues within the same

genome which perform some different function; these paris

of genes are called paralogues

What Functions Do Paralogues Play?

• They may act on different substrates (e.g., an enzyme with a

different binding site, or a protein kinase with a different

target)

• They may action in different tissue types or at different

developmental stages (e.g., embryonically-expressed globin

genes)

• They may be regulated in response to differential

environmental conditions to perform the same job, but for

different reasons (e.g., nitrate reductase for reducing nitrate

as an electron sink, or reducing it to provide ammonia for

assimilation)

How Do Gene Families Arise?

The Classic Model

• A gene duplicates within the genome; typically, an unequal-

crossing-over event is invoked

• The additional copy is free to evolve a novel function, or

novel regulatory regime, since the original copy performs its

original function

• Both copies are then maintained by selection

Yet this scenario is not as rosy as it sounds….

Problems with the Classic Model

• Dosage compensation after duplication may select for

organisms that have eliminated the duplicated copy

• After duplication, there is no selection to prevent deleterious

mutations from eliminating gene function.

• If an advantageous mutation arises, it must have a

sufficiently large benefit that elimination of this newly-created

form by mutation and drift is counter selected; this is difficult

in small populations, especially those seen in eukaryotes

• After duplication and gain of advantageous function, gene

conversion may homogenize the two copies

• Duplication of a single gene may be insufficient to provide for

a novel phenotype; for example, a new signaling cascade

will require a new receptor, MAPK, MAPKK, MAPKKKm etc.

An Alternative Model• Genes “duplicate” every time an organism reproduces.

• Consider a population (Pop’n A) where an entire pathway

experiences selection to perform an alternative function; in

Population B, the original function is maintained

• This function will likely never be achieved, since it would

require abandoning its original function, which may be

essential

• However, substantial headway may be gained in pursuing

the alternative form at the expense of the original form

• Admixture of population A and B produces heterozygotes

with an advantage; that is, both pathways are now found in

the same cell, leading to heterosis.

• This is an unstable state, since only 50% of the progeny of

heterozygotes are also heterozygotes.

• NOW, unequal crossing-over or other chromosome

gymnastics will allow for presence of both pathways in all

offspring.

• In this model, genes diverge UNDER SELECTION, prior to

reintroduction into the same cytoplasm (duplication).

Keeping Pathways Together I• What prevents “mixing” of the genes of the two pathways via

gene-conversion and meiosis?

• R.A. Fisher (1930) proposed that natural selection could

maintain groups of cooperating advantageous genes; this

idea was extended by Botstein and Suskind to suggest that

this selection would lead to clustering of these genes on the

eukaryotic chromosome.

• Consider two loci, each with two alleles (A & a; B & b)

• Consider A works well with B, and a works well with b

• Therefore, the fitness of AB/AB cells and ab/ab cells would

be higher than heterozygotes, especially Ab/Ab or aB/aB.

• This would lead to APPARENT linkage disequilibrium

between loci A & B due to counterselection of the classes

of heterozygotes.

• This selection would lead to an increase in the ACTUAL

linkage disequilibrium (decrease in chromosomal distance)

so that heterozygote disruption of coadapted gene

complexes would be minimized.

• This model requires high-frequency recombination.

Keeping Pathways Together II• The above model does not work for haploid organisms

with minimal amounts of recombination that could disrupt

coadapted gene complexes.

• Yet in these organisms, coadapted gene complexes are

found in very tight clusters (bacterial operons).

• How do genes attain such tight clustering, especially since

the primary mechanism for juxtaposition (deletion) would

likely remove important genes form the chromosome?

What genes are clustered?

• The bacterial operon allows for coregulation of genes, as

well as reducing their disruption by recombination.

• Yet coregulation is not a plausible influence for selection

for the origin of the gene cluster, since a very tiny

advantage would be conferred by adding only one gene to

a cluster at a time.

• Moreover, virtually none of the very important,

coordinately-regulated genes are found in operons

• In contrast, many operons encode peripheral metabolic

functions with lower selective value.

Keeping Pathways Together III

• Therefore, we must consider that the selection for the

ORIGIN of a feature may not be the same as the selection

for the MAINTENANCE of a feature.

• One advantage to a gene cluster is that it allows

mobilization of all of the genes responsible for a selectable

function or phenotype to be transferred in a single step;

transfer of one individual gene wold not.

• After transfer of genes that are only moderately clustered,

intervening genes would be deleted, since only those

genes under selection would be maintained; this results in

a tight gene cluster or operon, which can be expressed by

a host promoter at the site of insertion.

• The operon exploits the capability of prokaryotes to direct

the synthesis of numerous proteins from a single transcript

• Transfer of operons among bacteria and from bacteria to

eukaryotes is a powerful mechanism for allowing

recipients to gain novel metabolic capabilities

Bacterial genes are organized into clusters

Both bacterial and eukaryotic can beclustered via selection for proximity to cis-

acting regulatory sequences

β−Globin Locus : Developmental regulation via proximity to LCR

Both bacterial and eukaryotic can beclustered via selection for proximity to cis-

acting genes

genetics to genomics (from basics to buzzwords)cobamide2.bio.pitt.edu/core/overheads.pdf ·...

Documents