7 introduction to population genetics - uni-tuebingen.de · bioinformatics i, ws’14/15, d. huson,...
TRANSCRIPT
Bioinformatics I, WS’14/15, D. Huson, December 15, 2014 107
7 Introduction to Population Genetics
This chapter is closely based on a tutorial given by Stephan Schiffels (currently Sanger Institute) atthe Australian Centre for Ancient DNA in November 2014. This text is based very closely on hisscript, with his permission.
7.1 Questions
The aim of today’s lecture is to answer three questions:
1. What can we say about the time to the most recent common ancestor between you and theQueen of England?
2. How different or similar is the DNA sequence of you and the Queen of England?
3. How did our ancestral population size change through time?
7.2 Ancestors
In a simple model of human populations, the number of ancestors that an individual has doubles whengoing back one generation:
(figure by Stephan Schiffels)
Number of ancestors as a function of number of generations g:
A(g) = 2g.
Problem: After ≈ 32 generations this exceeds the number of living humans.
7.3 Coalescence events
In a family tree in a finite population, it will occasionally happen that two (or more) different ancestorsin the same generation share an ancestor in the previous generation. This is called a coalescence event:
108 Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014
(figure by Stephan Schiffels)
Model of growth: Here is a more realistic model of growth of number of ancestors as a function ofnumber of generations g:
A(g) =
{2g if 2g � Neff
−→ Neff else,
where Neff is the so-called effective population size.
7.4 Effective population size
The effective population size
• reflects the long-term population size of a population (in humans that is across several hundredthousands of years)
• reflects the effective number of people that you “randomly” choose your mates from; a so-calledpanmictic population: well-mixed randomly mating population.
Some concrete numbers:
• North-Europeans: Neff ≈ 15 000
• West-Africans: Neff ≈ 20 000
• Native Americans: Neff ≈ 12 000
7.5 Answer to question 1
Question 1: Time to recent common ancestor: What can we say about the time to the mostrecent common ancestor between you and the Queen of England?
The approximate probability of sharing an ancestor with someone else g generations ago is:
(2g)2
Neff.
This is approximately 1 in only 7 generations:
(2g)2
Neff=
(27)2
15 000=
1282
15 000=
16 384
15 000≈ 1.
How strongly does this answer depend on the assumed effective population size? While the number ofancestors that lived g generations ago depends strongly on this, the number of generations that onemust go back to find a last common ancestor does not:
Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014 109
(figure by Stephan Schiffels)
7.6 Ancestry of two or more genes
We now model the genealogy of two or more genes (or, more precisely, alleles of a gene) backwarduntil their most recent common ancestor is found:
(figure by Stephan Schiffels)
This looks complicated. A central idea is to ignore all genes that are not passed down to the current-day set of interest. This is also called looking backward in time.
To study the genealogy of a set of genes, we start at the present, and move backward in time, generationby generation, modeling individual coalescence events:
(figure by Stephan Schiffels)
110 Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014
7.7 Coalescence theory with a pair of samples
Definition 7.7.1 (Basic coalescense theory) Basic assumptions of coalescence theory of a pair ofsamples1:
• Population has size N , with 2N gene copies.
• The probability P (t) of two genes not having the same ancestor in t generations is given byP (t) = (1− 1
2N )t.
• In the limit for very large populations, N −→∞, we have P (t) = e−t
2N .
• So the waiting time to a coalescence event between two lineages is exponentially distributed withmean T2 = 〈tcoal 〉 = 2N
Recall2:
If the cumulative distribution function of an exponential distribution is:
F (x;λ) =
{e−λx for x ≥ 00 else
,
then the mean is 1λ.
7.8 Genetic diversity
To model genetic diversity, we add mutations to our simple model.
• Mutations occur with probability µ per generation per site.
• The mean tMRCA (time to the most recent common ancestor) between two genes is 2N gener-ations ago.
• So, the number of mutations that we expect between two genes is 4Nµ.
(figure by Stephan Schiffels)
• The site heterozygosity is given by Θ = 4Nµ.
Estimator for population size: This gives us an estimator for population size:
Θ = 4Nµ↗ ↖
Fraction of heterozygote Effective population sizepositions in the genome
This simple formula encapsulates a deep relationship between a purely genomic property (the het-erozygosity) and a population level quantity (the effective population size).
1J.F.C. Kingman, On the Genealogy of Large Populations, J. of Applied Probability, 19:27-43 (1982)2http://en.wikipedia.org/wiki/Cumulative_distribution_function
Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014 111
7.9 Answer to question 2
Question 2: Sequence similarity: How similar is the DNA sequence of the Queen of England andof you?
Consider a single chromosome and compare the Queen’s copy with your copy.
Using N = 15 000 and µ = 1.25× 10−8, we get:
Θ = 4Nµ = 4× 15000× 1.25× 108 = 0.00075
Hence:
The Queen’s and your chromosome differ at about 1 in 1333 sites. (Note that 1 333 = 10.00075 .)
7.10 Mutations on a coalescence tree
Recall that the probability of two samples not coalescencing in time t is:
P2(t) =
(1− 1
2N
)t≈ e−
t2N .
The probability of i samples not coalescencing in time t is:
Pi(t) =
(1−
(i
2
)1
2N
)t=
(1− i(i− 1)
2
1
2N
)t≈ e−
i(i−1)4N
t.
Mean waiting time for coalescence events: So, the waiting times Ti are exponentially distributedwith mean3 is:
〈Ti〉 =4N
i(i− 1).
(figure by Stephan Schiffels)
Given n samples, and times Ti, the total branch length is:
〈T 〉 =
n∑i=2
i〈Ti〉 = 4N
n∑i=2
1
i− 1= 4N
n−1∑i=1
1
i.
Hence, the expected number of mutations anywhere on the tree is:
〈S〉 = µ〈T 〉 = µ4Nn−1∑i=1
1
i= Θ
n−1∑i=1
1
i.
3Kingman, 1982
112 Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014
7.11 Two famous estimators of genetic diversity
How to estimate the quantity Θ = 4Nµ (heterozygosity) from genome data?
Consider n sequences of length L.
Definition 7.11.1 (Tajima’s estimator) Tajima’s estimator is the mean proportion of pairwisedifferences between any two sequences:
Θπ =nr of pairwise differences
L.
Definition 7.11.2 (Watterson estimator) The Watterson estimator is the number of segregatingsites:
ΘW =nr of segregating sites
L∑n−1
i=1 1/i.
(figures by Stephan Schiffels)
7.12 Answer to question 3
Question 3: demographic history: How did our ancestral population size change through time?
Three possible simple answers: The population has been
(a) constant
(b) declining
(c) expanding
(figure by Stephan Schiffels)
Interestingly, we can destinguish between these three possible scenarios by only comparing existinggenomes:
Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014 113
7.13 Determining demographic history
We compare Tajima’s estimator and Watterson’s estimator to get a useful measure:
Definition 7.13.1 (Tajima’s D) Define
D =Θπ −ΘW√
Var(Θπ −ΘW )
If the population size is constant, then should have D ≈ 0.
If the population size has been increasing, then more mutations will have occurred on leaf edges, thuseffecting less pairs, causing D to be negative.
If the population size has been decreasing, then more mutations will have occurred on inner edges,thus effecting more pairs, causing D to be positive.
So, Taijima’s D tells us something about the history of a population.
7.14 Summary
The simple coalescence model allows us to:
• Estimate the time to the last common ancestor of individuals of a population.
• Estimate how similar the DNA of different individuals of a population is.
• Make statements about the “shape” of the recent history of a population.