7 introduction to population genetics - uni-tuebingen.de · bioinformatics i, ws’14/15, d. huson,...

7
Bioinformatics I, WS’14/15, D. Huson, December 15, 2014 107 7 Introduction to Population Genetics This chapter is closely based on a tutorial given by Stephan Schiffels (currently Sanger Institute) at the Australian Centre for Ancient DNA in November 2014. This text is based very closely on his script, with his permission. 7.1 Questions The aim of today’s lecture is to answer three questions: 1. What can we say about the time to the most recent common ancestor between you and the Queen of England? 2. How different or similar is the DNA sequence of you and the Queen of England? 3. How did our ancestral population size change through time? 7.2 Ancestors In a simple model of human populations, the number of ancestors that an individual has doubles when going back one generation: (figure by Stephan Schiffels) Number of ancestors as a function of number of generations g: A(g)=2 g . Problem: After 32 generations this exceeds the number of living humans. 7.3 Coalescence events In a family tree in a finite population, it will occasionally happen that two (or more) different ancestors in the same generation share an ancestor in the previous generation. This is called a coalescence event:

Upload: others

Post on 30-Aug-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 7 Introduction to Population Genetics - uni-tuebingen.de · Bioinformatics I, WS’14/15, D. Huson, December 15, 2014 107 7 Introduction to Population Genetics This chapter is closely

Bioinformatics I, WS’14/15, D. Huson, December 15, 2014 107

7 Introduction to Population Genetics

This chapter is closely based on a tutorial given by Stephan Schiffels (currently Sanger Institute) atthe Australian Centre for Ancient DNA in November 2014. This text is based very closely on hisscript, with his permission.

7.1 Questions

The aim of today’s lecture is to answer three questions:

1. What can we say about the time to the most recent common ancestor between you and theQueen of England?

2. How different or similar is the DNA sequence of you and the Queen of England?

3. How did our ancestral population size change through time?

7.2 Ancestors

In a simple model of human populations, the number of ancestors that an individual has doubles whengoing back one generation:

(figure by Stephan Schiffels)

Number of ancestors as a function of number of generations g:

A(g) = 2g.

Problem: After ≈ 32 generations this exceeds the number of living humans.

7.3 Coalescence events

In a family tree in a finite population, it will occasionally happen that two (or more) different ancestorsin the same generation share an ancestor in the previous generation. This is called a coalescence event:

Page 2: 7 Introduction to Population Genetics - uni-tuebingen.de · Bioinformatics I, WS’14/15, D. Huson, December 15, 2014 107 7 Introduction to Population Genetics This chapter is closely

108 Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014

(figure by Stephan Schiffels)

Model of growth: Here is a more realistic model of growth of number of ancestors as a function ofnumber of generations g:

A(g) =

{2g if 2g � Neff

−→ Neff else,

where Neff is the so-called effective population size.

7.4 Effective population size

The effective population size

• reflects the long-term population size of a population (in humans that is across several hundredthousands of years)

• reflects the effective number of people that you “randomly” choose your mates from; a so-calledpanmictic population: well-mixed randomly mating population.

Some concrete numbers:

• North-Europeans: Neff ≈ 15 000

• West-Africans: Neff ≈ 20 000

• Native Americans: Neff ≈ 12 000

7.5 Answer to question 1

Question 1: Time to recent common ancestor: What can we say about the time to the mostrecent common ancestor between you and the Queen of England?

The approximate probability of sharing an ancestor with someone else g generations ago is:

(2g)2

Neff.

This is approximately 1 in only 7 generations:

(2g)2

Neff=

(27)2

15 000=

1282

15 000=

16 384

15 000≈ 1.

How strongly does this answer depend on the assumed effective population size? While the number ofancestors that lived g generations ago depends strongly on this, the number of generations that onemust go back to find a last common ancestor does not:

Page 3: 7 Introduction to Population Genetics - uni-tuebingen.de · Bioinformatics I, WS’14/15, D. Huson, December 15, 2014 107 7 Introduction to Population Genetics This chapter is closely

Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014 109

(figure by Stephan Schiffels)

7.6 Ancestry of two or more genes

We now model the genealogy of two or more genes (or, more precisely, alleles of a gene) backwarduntil their most recent common ancestor is found:

(figure by Stephan Schiffels)

This looks complicated. A central idea is to ignore all genes that are not passed down to the current-day set of interest. This is also called looking backward in time.

To study the genealogy of a set of genes, we start at the present, and move backward in time, generationby generation, modeling individual coalescence events:

(figure by Stephan Schiffels)

Page 4: 7 Introduction to Population Genetics - uni-tuebingen.de · Bioinformatics I, WS’14/15, D. Huson, December 15, 2014 107 7 Introduction to Population Genetics This chapter is closely

110 Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014

7.7 Coalescence theory with a pair of samples

Definition 7.7.1 (Basic coalescense theory) Basic assumptions of coalescence theory of a pair ofsamples1:

• Population has size N , with 2N gene copies.

• The probability P (t) of two genes not having the same ancestor in t generations is given byP (t) = (1− 1

2N )t.

• In the limit for very large populations, N −→∞, we have P (t) = e−t

2N .

• So the waiting time to a coalescence event between two lineages is exponentially distributed withmean T2 = 〈tcoal 〉 = 2N

Recall2:

If the cumulative distribution function of an exponential distribution is:

F (x;λ) =

{e−λx for x ≥ 00 else

,

then the mean is 1λ.

7.8 Genetic diversity

To model genetic diversity, we add mutations to our simple model.

• Mutations occur with probability µ per generation per site.

• The mean tMRCA (time to the most recent common ancestor) between two genes is 2N gener-ations ago.

• So, the number of mutations that we expect between two genes is 4Nµ.

(figure by Stephan Schiffels)

• The site heterozygosity is given by Θ = 4Nµ.

Estimator for population size: This gives us an estimator for population size:

Θ = 4Nµ↗ ↖

Fraction of heterozygote Effective population sizepositions in the genome

This simple formula encapsulates a deep relationship between a purely genomic property (the het-erozygosity) and a population level quantity (the effective population size).

1J.F.C. Kingman, On the Genealogy of Large Populations, J. of Applied Probability, 19:27-43 (1982)2http://en.wikipedia.org/wiki/Cumulative_distribution_function

Page 5: 7 Introduction to Population Genetics - uni-tuebingen.de · Bioinformatics I, WS’14/15, D. Huson, December 15, 2014 107 7 Introduction to Population Genetics This chapter is closely

Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014 111

7.9 Answer to question 2

Question 2: Sequence similarity: How similar is the DNA sequence of the Queen of England andof you?

Consider a single chromosome and compare the Queen’s copy with your copy.

Using N = 15 000 and µ = 1.25× 10−8, we get:

Θ = 4Nµ = 4× 15000× 1.25× 108 = 0.00075

Hence:

The Queen’s and your chromosome differ at about 1 in 1333 sites. (Note that 1 333 = 10.00075 .)

7.10 Mutations on a coalescence tree

Recall that the probability of two samples not coalescencing in time t is:

P2(t) =

(1− 1

2N

)t≈ e−

t2N .

The probability of i samples not coalescencing in time t is:

Pi(t) =

(1−

(i

2

)1

2N

)t=

(1− i(i− 1)

2

1

2N

)t≈ e−

i(i−1)4N

t.

Mean waiting time for coalescence events: So, the waiting times Ti are exponentially distributedwith mean3 is:

〈Ti〉 =4N

i(i− 1).

(figure by Stephan Schiffels)

Given n samples, and times Ti, the total branch length is:

〈T 〉 =

n∑i=2

i〈Ti〉 = 4N

n∑i=2

1

i− 1= 4N

n−1∑i=1

1

i.

Hence, the expected number of mutations anywhere on the tree is:

〈S〉 = µ〈T 〉 = µ4Nn−1∑i=1

1

i= Θ

n−1∑i=1

1

i.

3Kingman, 1982

Page 6: 7 Introduction to Population Genetics - uni-tuebingen.de · Bioinformatics I, WS’14/15, D. Huson, December 15, 2014 107 7 Introduction to Population Genetics This chapter is closely

112 Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014

7.11 Two famous estimators of genetic diversity

How to estimate the quantity Θ = 4Nµ (heterozygosity) from genome data?

Consider n sequences of length L.

Definition 7.11.1 (Tajima’s estimator) Tajima’s estimator is the mean proportion of pairwisedifferences between any two sequences:

Θπ =nr of pairwise differences

L.

Definition 7.11.2 (Watterson estimator) The Watterson estimator is the number of segregatingsites:

ΘW =nr of segregating sites

L∑n−1

i=1 1/i.

(figures by Stephan Schiffels)

7.12 Answer to question 3

Question 3: demographic history: How did our ancestral population size change through time?

Three possible simple answers: The population has been

(a) constant

(b) declining

(c) expanding

(figure by Stephan Schiffels)

Interestingly, we can destinguish between these three possible scenarios by only comparing existinggenomes:

Page 7: 7 Introduction to Population Genetics - uni-tuebingen.de · Bioinformatics I, WS’14/15, D. Huson, December 15, 2014 107 7 Introduction to Population Genetics This chapter is closely

Bioinformatics I, WS’14/15, D. Huson (this part by S. Schiffels) December 15, 2014 113

7.13 Determining demographic history

We compare Tajima’s estimator and Watterson’s estimator to get a useful measure:

Definition 7.13.1 (Tajima’s D) Define

D =Θπ −ΘW√

Var(Θπ −ΘW )

If the population size is constant, then should have D ≈ 0.

If the population size has been increasing, then more mutations will have occurred on leaf edges, thuseffecting less pairs, causing D to be negative.

If the population size has been decreasing, then more mutations will have occurred on inner edges,thus effecting more pairs, causing D to be positive.

So, Taijima’s D tells us something about the history of a population.

7.14 Summary

The simple coalescence model allows us to:

• Estimate the time to the last common ancestor of individuals of a population.

• Estimate how similar the DNA of different individuals of a population is.

• Make statements about the “shape” of the recent history of a population.