population genetics: wright fisher model and … · essentially the field of population genetics...

17
POPULATION GENETICS: WRIGHT FISHER MODEL AND COALESCENT PROCESS by Hailong Cui and Wangshu Zhang Superviser: Prof. Quentin Berger A Final Project Report Presented In Partial Fulfillment of the Requirements for Math 505b April 2014

Upload: others

Post on 25-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

POPULATION GENETICS:

WRIGHT FISHER MODEL AND COALESCENT PROCESS

by

Hailong Cui and Wangshu Zhang

Superviser: Prof. Quentin Berger

A Final Project Report Presented

In Partial Fulfillment of the

Requirements for Math 505b

April 2014

Page 2: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Acknowledgments

We want to thank Prof. Quentin Berger for introducing to us the Wright Fishermodel in the lecture, which inspired us to choose Population Genetics for our projecttopic. The resources Prof. Berger provided us have been excellent learning mate-rials and his feedback has helped us greatly to create this report. We also like toacknowledge that the research papers (in the reference) are integral parts of thisprocess. They have motivated us to learn more about models beyond the class, andgranted us confidence that these probabilistic models can actually be used for realapplications.

ii

Page 3: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Contents

Acknowledgments ii

Abstract iv

1 Introduction to Population Genetics 1

2 Wright Fisher Model 2

2.1 Random drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Genealogy of the Wright Fisher model . . . . . . . . . . . . . . . . . 4

3 Coalescent Process 8

3.1 Kingman’s Approximation . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Applications 11

Reference List 12

iii

Page 4: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Abstract

In this project on Population Genetics, we aim to introduce models that lay thefoundation to study more complicated models further. Specifically we will discussWright Fisher model as well as Coalescent process. The reason these are of our inter-est is not just the mathematical elegance of these models. With the availability ofmassive amount of sequencing data, we actually can use these models (or advancedmodels incorporating variable population size, mutation e�ect etc, which are how-ever out of the scope of this project) to solve and answer real questions in molecularbiology. First we will explain concepts such as random drift, then discuss if an allelecan eventually get fixed in a population, and what is the probability of genetic vari-ation surviving after generations. After this we will illustrate in graphs the tree likenature of traversing back to most recent common ancestors (MRCA) then derive thedistribution of the time back to MRCA for a sample of size 2. For the remainderof the report, we will provide a treatment of Kingman’s approximation. Finally wemove on to a literature review of an application to HIV-1 regarding the average coa-lescent estimates of HIV-1 generation time in vivo.

Keywords: Population Genetics, Wright Fisher model, Most Recent CommonAncestors (MRCA), Allele, Sequencing data, Heterozygosity, Genealogy, Coalescentprocess, Kingman’s approximation, HIV-1 evolution

iv

Page 5: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Chapter 1

Introduction to Population

Genetics

With the advent of new sequencing technology [1, 2], we are harvesting largevolume of genetic data (in DNA, RNA and even protein level) and making thempublically available [3, 4, 5, 6]. This enables researchers to analyze these sequencingdata to tackle one of the most important challenges in modern molecular biology –how to make sense of the variations existing among the genetic information and howthese variations are translated into the di�erences in phenotypes. For example canone capture the evolution among the tumor cells and use the observed variability toinfer the velocity of disease aggravation? Another example is to research the variationamong human genetic sequences to extract genes that are related to diseases such asEgr3 gene for Schizophrenia. These are some of the questions in population genetics,and for the scope of this project we aim to first introduce probabilistic models thatcomprise the basis of further research. Then we like to briefly review some publishedpapers that utilize these models as well as other related methods and software inPhylogeny which all aim to eventually understand diseases better.

Essentially the field of population genetics is a study of genetic variation withina population. We assume that a gene has two alleles and denote them by A and B.Then the population is composed of individuals with two copies of each genes, i.e.,AA, AB (BA) or BB. It is convenient to classify the evolution problems by employingthe time scales involved. A typical question to ask is what will happen in the future,such as “how long does a new mutant survive in the population?” or “what is thechance that an allele gets fixed in a population?”. We can think about these problemsfrom a di�erent angle, in other words, retrospectively, by asking where the populationhas been in the past instead.

Many factors can a�ect the evolution of a population, such as random drift,selection, mutation, recombination, population subdivision etc. Nonetheless we willbegin by introducing a simple model with many of these e�ects ignored in the nextsection.

1

Page 6: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Chapter 2

Wright Fisher Model

In this section we want to begin by the introduction of the simplest Wright Fishermodel (Fisher (1922), Wright (1931)). Here we assume the population is finite, ofconstant size, and each individual has only two alleles. We also ignore the e�ectsof mutation, selection, etc. We assume this population undergoes random mating.This is what we call random drift, which will be discussed more formally below.

2.1 Random drift

Let’s denote two alleles A and B as before at the locus of interest and assume nomutation occurs. Define Yr as the number of A alleles in generation r, then N ≠ Yr

represents the number of B alleles in generation r. First, let’s make the followingassumptions:

Assumption 1. Discrete, non-overlapping generations of equal size N .

Assumption 2. Parents of next generation of N genes are picked randomly withreplacement from preceding generation (genetic di�erences have no fitness conse-quences).

The population at generation r + 1 is derived from the population at time r bybinomial sampling of N genes from a gene pool in which the fraction of A alleles isits current frequency, namely fii = i/N . Hence given Yr = i, the probability thatYr+1 = j is

pij =A

N

j

B

fiji (1 ≠ fii)N≠j, 0 Æ i, j Æ N.

The process {Yr, r = 0, 1, · · · } is a time-homogeneous Markov chain. It hastransition matrix P = (pij), and state space S = {0, 1, · · · , N}. It is trivial thatthe states 0 and N are absorbing; if the population contains only one allele in somegeneration, then it remains so in every subsequent generation. In this case, we saythat the population is fixed for that allele.

2

Page 7: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

The binomial nature of the transition matrix makes some properties of the processeasy to calculate. For example,

E(Yr|Yr≠1) = NYr≠1N

= Yr≠1,

so by taking expectation on both sides, we get E(Yr) = E(Yr≠1), and by recursiveiteration,

E(Yr) = E(Y0), r = 1, 2, · · ·

Note that:E(Yr+1|Yr = i) = N

1 i

N

2= i

V ar(Yr+1|Yr = i) = N1 i

N

211 ≠ i

N

2

Therefore the expected number of A (or B) alleles remains constant across gen-erations, nonetheless variability must be lost eventually. Hence, the population ulti-mately will contain only A alleles or all B alleles. States 0 and N are absorbingstates. Naturally we want to understand the probabilities of these events, so wedefine

ai = P(eventually all alleles are A given that initially only i alleles are A)

Apparently a0 = 0, aN = 1 and Yr is a martingale as can be seen from the aboveequation. If we define T as the time of absorption at 0 or N and apply the optionalstopping theorem, we can get

E[YT ] = N · P(YT = N) = N · ai = i

soai = i

N

This means an allele will eventually become fixed in the population with thesame probability as its initial proportion. As a side note, fixation in genetic sequenceincreases di�culty in traversing back in time to determine the common ancestors.

The next question of interest is to assess how fast the genetic variation gets lost.To achieve this purpose, let’s study another widely used term in population genetics:heterozygosity. It is defined as a probability Hr that two genes chosen at randomwith replacement in generation r are di�erent.

3

Page 8: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

If we define Pr = YrN to be the proportion of A alleles in generation r, then the

heterozygosity Hr = 2Pr(1 ≠ Pr).Look at expected heterozygosity:

E(H1) = E(2P1(1 ≠ P1))= 2E(P1 ≠ E(P2

1))

= 23

E(P1) ≠ E(P1)2 ≠ V ar(P1)4

= 23

p0 ≠ p20 ≠ p0(1 ≠ p0)

N

4

= 2p0(1 ≠ p0)3

1 ≠ 1N

4

= H0

31 ≠ 1

N

4

After r generations:

E(Hr) = H0

31 ≠ 1

N

4r

¥ H0e≠ r

N

The probability Hr measures the genetic variability surviving in the population,which decays at rate 1/N per generation. The decrease of heterozygosity is a measureof random drift. As can be seen from above computation, the heterozygosity decaysto 0 as r goes to infinity. The expected time for the loss is complicated to compute.As a matter of fact, due to the di�culty of finding explicit expression, one may wantto resort to approximation method. Interested readers can refer to topics on di�usionapproximations for further reading [7].

2.2 Genealogy of the Wright Fisher model

In this section we want to study the genealogy of the Wright Fisher model. Wecan imagine that each individual in a given generation carries either A or B allele.Assuming no mutation as before, all o�spring of A individuals continue to containonly A alleles. Below we like to introduce the concept of most recent commonancestors (MRCA) by illustrating two simulation results in Fig 2.1 and Fig 2.2 [7].Both are for a Wright Fisher model of N = 9 individuals. Generations are evolvingvertically down and the individuals are labelled 1, 2, · · · , 9 from left to right. Linesare directional though without arrows and join individuals in two generations if one

4

Page 9: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Figure 2.1: First simulation.

is the o�spring of the other. In Fig 2.1, we can see that individual 3 and 4 havethe MRCA 3 generations ago. This figure shows very much tangled relationship andmay look confusing. The next one Fig 2.2 however presents a more clear structurein a typical phylogenetic tree format. The individual’s order is untangled in Figure2.2, and we can see that the MRCA of individual 6 and 7 is 11 generations ago, i.e.,the root of the tree.

Now we want to understand how long it takes for two alleles to travel back totheir MRCA. Since individuals choose their parents at random, we see that

P(2 individuals have 2 distinct parents) = ⁄ = 1 ≠ 1N

.

5

Page 10: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Figure 2.2: Second simulation in untangled form.

Since those parents are themselves a random sample from their generation, wemay iterate this argument to see that

P(first common ancestor more than r generations ago)

= ⁄r =3

1 ≠ 1N

4r

. (2.1)

When the population size is large and time is measured in units of N generations,the distribution of the time to the MRCA of a sample of size 2 has approximatelyan exponential distribution with mean 1. To see this, rescale time so that r = Nt,and let N æ Œ in (2.1). We see that this probability is

31 ≠ 1

N

4Nt

æ e≠t.

6

Page 11: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Now we consider the probability hr that two individuals chosen with replacementfrom generation r carry distinct alleles. Two individuals are di�erent if and only iftheir common ancestor is more than r generations ago, and the ancestors at time 0 aredistinct. The probability of this event is the chance that 2 individuals chosen withoutreplacement at time 0 carry di�erent alleles, and this is just E[2Y0(N≠Y0)]/N(N≠1).Combining these results

hr = ⁄r N ≠ 1N

E[2Y0(N ≠ Y0)]N(N ≠ 1) = ⁄rh0,

just as Hr we discussed in previous section.Here are more discrete-time properties:

• P(two genes have same parent in the previous generation) is 1N

• Number of generations since two genes first shared a common ancestor ≥Geometric( 1

N )

• Number of generations since at least two genes in a sample of k shared acommon ancestor ≥ Geometric

1k(k≠1)

2N

2

Proof. Define Gk,k to be the probability that k distinct ancestors in the previousgeneration. Then

Gk,k =3

N ≠ 1N

43N ≠ 2

N

4· · ·

3N ≠ (k ≠ 1)

N

4

=3

1 ≠ 1N

431 ≠ 2

N

4· · ·

31 ≠ k ≠ 1

N

4

= 1 ≠31 + 2 + 3 + · · · + (k ≠ 1)

N

4+ O

3 1N2

4

= 1 ≠ k(k ≠ 1)2N

+ O3 1

N2

4

Therefore, the probability that at least two genes share a common ancestor in theprevious generation is

1 ≠ Gk,k = k(k ≠ 1)2N

+ O3 1

N2

4

Since this is the same in each generation, we have that the number of generations untilat least two genes in a sample of k shared a common ancestor ≥ Geometric

1k(k≠1)

2N

2.

7

Page 12: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Chapter 3

Coalescent Process

In this section we discuss a basic coalescent process. This is tightly related toMRCA introduced in previous sections. Essentially the term coalescence means“connection” or “coming together”, it is the contrary of branching. When two allelesare descended from a common ancestor in some previous generation, we say that theycoalesce in that generation. In the previous Wright Fisher model we started from apopulation of size N then moved “forward” in time to observe descendants. In thecoalescent process, we begin from a certain generation and then look “backward” intime at the past. This way the two lineages of two individuals of interest will mergein some previous generation.

Let’s begin with the simplest statement of the coalescent model. Kingman provedthis to be limiting ancestral process for a broad class of populations structures thatincludes the Wright Fisher model. We trace the ancestral lineages, which are theseries of genetic ancestors of the samples at a locus, back through time. The history ofa sample of size n comprises n≠1 coalescent events. Each coalescent event decreasesthe number of ancestral lineages by one. This takes the sample from the present daywhen there are n lineages through a series of steps in which the number of lineagesdecreases from n to n≠1, then from n≠1 to n≠2, etc., then finally from two to one.The single lineage remaining at the final coalescent event is the MRCA of the entiresample. At each coalescent event, two of the lineages fuse into one common-ancestrallineage. The result is a bifurcating tree like the one shown in Fig 3.1. The timesTi on the right in Fig 3.1 are the times during which there were exactly i lineagesancestral to the sample.

3.1 Kingman’s Approximation

Discrete-time models can be cumbersome to work with, thus we would like a rep-resentation in continuous time. Kingman (1982) considered the case where N (pop-ulation size) is very large relative to n (sample size). Recall that Gk,k = probabilitythat k genes had k ancestors in the previous generation. Define Gi,j = probability

8

Page 13: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Figure 3.1: A coalescent genealogy of a sample of n = 9 items.

that i genes had j(j < i) ancestors in the previous generation. Then we can showthat

Gi,j = S(j)i N[j]N i

, 1 Æ j Æ i

where N[j] = N(N ≠ 1) · · · (N ≠ j + 1) and S(j)i are Stirling numbers of the second

kind. Important observation: When N is large, it is unlikely for more than onecoalescent event to occur in a single generation.

Under the assumption that coalescent events do not occur simultaneously, welook at the limit as N æ Œ:

• Consider a sample of n lineages and follow the process backwards in time

• Define Ti = time during which there are exactly i lineages in the sample

• P(Ti > t) = probability that the time to a coalescent event in a sample of i

lineages in from a population of size N is greater than t

P(Ti > t) = (Gi,i)[Nt]

9

Page 14: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

For the Wight Fisher model:

P(Ti > t) = (Gi,i)[Nt] =3

1 ≠ i(i ≠ 1)2 · N

4[Nt]

æ e≠(i2)t as N æ Œ

In this case, with appropriate time units, the time to coalescence in a sample of i

lineages follows an Exponential1µ = 2

i(i≠1)

2distribution.

The probability density function for Ti is

fTi(t) =A

i

2

B

e≠(i2)t, t Ø 0, i = 2, 3, · · · , n

The mean and variance are

E(Ti) = 2i(i ≠ 1)

V ar(Ti) =3 2

i(i ≠ 1)

42

Fewer lineage means longer expected time to coalescence.To generate a genealogy of i genes under Kingman’s coalescent:

• Draw an observation from an exponential distribution with mean µ = 2/(i(i ≠1)). This will be the time of the first coalescent event (looking from the presentbackwards in time).

• Pick two lineages at random to coalescence.

• Decrease i by 1.

• If i = 1, stop. Otherwise, repeat these steps [8, 9].

10

Page 15: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Chapter 4

Applications

In this section we like to start by a paper review of coalescent estimates of HIV-1generation time in vivo [10]. Though a bit outdated, this paper shows us how a newmethod based on coalescent theory can be used to esimate HIV-1 generation time invivo. The estimated generation time in HIV-1 had been of many researchers’ inter-est and had previously been estimated by a di�erent mathematical model of viraldynamics. The first author Allen Rodrigo (now a professor at Duke) used nucleotidesequencing data for the analysis, and a reconstructed genealogy of sequences obtainedover time. The study was on one single individual, a homosexual Caucasian malewho was diagnosed as HIV-1 positive following an episode of aseptic meningitis inFebruary of 1985, when he was 23 years old. Over the course of 3 years begin-ning in 1989, blood was obtained at time points 7, 22, 23, and 34 months after thefirst specimen. The method is applied to sequences obtained from a long-term non-progressing individual at above five sampling occasions. The estimated average ofviral generation time using the coalescent method was 1.2 days per generation andis close to that obtained by mathematical modeling (1.8 days per generation), thusstrengthening confidence in estimates of a short viral generation time.

Readers interested in more recent papers with application to sequence data canrefer to 2002 Nature paper by Noah Rosenberg [11] (now a professor at Stanford).The authors discussed the increased use of genetic polymorphism for inference aboutpopulation phenomena, such as migration and selection and employed the coalescenceprocess for their analysis.

Beyond the scope of our stochastic modeling, there are also di�erent approachesusing sequence alignment to infer phylogentic trees, estimate the rates of molecularevolution etc. Readers can refer to a well established software package called MEGA[12].

11

Page 16: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

Reference List

[1] E. R. Mardis, “Next-generation dna sequencing methods,” Annu. Rev. GenomicsHum. Genet., vol. 9, pp. 387–402, 2008.

[2] M. L. Metzker, “Sequencing technologies – the next generation,” Nature ReviewsGenetics, vol. 11, no. 1, pp. 31–46, 2010.

[3] P. J. Cock, C. J. Fields, N. Goto, M. L. Heuer, and P. M. Rice, “The sangerfastq file format for sequences with quality scores, and the solexa/illumina fastqvariants,” Nucleic acids research, vol. 38, no. 6, pp. 1767–1771, 2010.

[4] D. Karolchik, R. Baertsch, M. Diekhans, T. S. Furey, A. Hinrichs, Y. Lu, K. M.Roskin, M. Schwartz, C. W. Sugnet, D. J. Thomas, et al., “The ucsc genomebrowser database,” Nucleic acids research, vol. 31, no. 1, pp. 51–54, 2003.

[5] T. Hubbard, D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox,J. Cu�, V. Curwen, T. Down, et al., “The ensembl genome database project,”Nucleic acids research, vol. 30, no. 1, pp. 38–41, 2002.

[6] K. D. Pruitt, T. Tatusova, and D. R. Maglott, “Ncbi reference sequence project:update and current status,” Nucleic acids research, vol. 31, no. 1, pp. 34–37,2003.

[7] S. Tavaré, “Part i: Ancestral inference in population genetics,” in Lectures onprobability theory and statistics, pp. 1–188, Springer, 2004.

[8] J. Wakeley, “Chapter 3 of coalescent theory: An introduction.” http://

webpages.uidaho.edu/hohenlohe/Wakeley_ch3.pdf, cited April 2014.

[9] L. Kubatko, “Tutorial on coalescent theory.” http://www.stat.osu.edu/

~lkubatko/coalescent_theory_penn_state_part1.pdf, cited April 2014.

[10] A. G. Rodrigo, E. G. Shpaer, E. L. Delwart, A. K. Iversen, M. V. Gallo, J. Bro-jatsch, M. S. Hirsch, B. D. Walker, and J. I. Mullins, “Coalescent estimates ofhiv-1 generation time in vivo,” Proceedings of the National Academy of Sciences,vol. 96, no. 5, pp. 2187–2191, 1999.

12

Page 17: POPULATION GENETICS: WRIGHT FISHER MODEL AND … · Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two

[11] N. A. Rosenberg and M. Nordborg, “Genealogical trees, coalescent theory andthe analysis of genetic polymorphisms,” Nature Reviews Genetics, vol. 3, no. 5,pp. 380–390, 2002.

[12] K. Tamura, G. Stecher, D. Peterson, A. Filipski, and S. Kumar, “Mega6: Molec-ular evolutionary genetics analysis version 6.0,” Molecular biology and evolution,vol. 30, no. 12, pp. 2725–2729, 2013.

13