monte carlo methods for estimating population genetic parameters rasmus nielsen university of...

45
Monte Carlo methods for Monte Carlo methods for estimating population estimating population genetic parameters genetic parameters Rasmus Nielsen Rasmus Nielsen University of Copenhagen University of Copenhagen

Post on 21-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Monte Carlo methods for Monte Carlo methods for estimating population genetic estimating population genetic

parametersparameters

Rasmus NielsenRasmus Nielsen

University of CopenhagenUniversity of Copenhagen

Page 2: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

OutlineOutline Idiosyncratic history and background on ML Idiosyncratic history and background on ML

estimation of demographic parameters based estimation of demographic parameters based on DNA sequence data.on DNA sequence data.

A new computational approach/modification.A new computational approach/modification. Idiosyncratic history and background on ML Idiosyncratic history and background on ML

estimation of demographic parameters based estimation of demographic parameters based on SNP data.on SNP data.

Ascertainment and large scale SNP data sets.Ascertainment and large scale SNP data sets.

Page 3: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Felsenstein’s Equation

dGGpGXX

)|()|Pr()|Pr(

)|Pr(| GXEG

SoSo

k

iiGX

kX

1

)|Pr(1

)|Pr(

where where GGii,, ii=1,2,…=1,2,…kk, has been simulated from , has been simulated from pp((GG||).).

Page 4: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Coefficient of Variation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 10

Sample size

C.V

.

Page 5: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Importance Sampling

)(

)|()|Pr(

)(

)()|()|Pr()|()|Pr(

Gh

GpGXE

dGGh

GhGpGXdGGpGX

So

k

i i

ii

Gh

GpGX

kX

1 )(

)|()|Pr(1)|Pr(

where Gi, i=1,2,…k, has been simulated from h(G).

Page 6: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Griffiths and Tavare

Recursion

Simulate mutation (coalescent) from

and correct using importance sampling.

'

'

),|()|Pr()|(

),|()|Pr()|()|Pr(

''

''

coal

mut

Xcoalcoal

Xmutmut

coalXXpXcoalp

mutXXpXmutpX

''

),|()|(),|()|(

),|()|(''

'

coalmut Xcoal

Xmut

mut

coalXXpcoalpmutXXpmutp

mutXXpmutp

Page 7: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Example (Nielsen 1998)

•Infinite sites Infinite sites modelmodel

•Estimation of TEstimation of T

•Estimation of Estimation of population population phylogeniesphylogenies

Page 8: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Integro-recursionIntegro-recursion Ugliest Ugliest equation equation ever ever published in published in a biological a biological journal…journal…

Page 9: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

MLE: T=1.8 (36,000 years)

Page 10: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Data from the Caribean Hawksbill TurtleData from the Caribean Hawksbill Turtle

Page 11: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

MCMC

)()|()|Pr()|,( pGpGXXGp

Set up a Markov chain on state space on all supported values of and G and with stationary distribution p(, G | X). Now since

this can easily be done using Metropolis-Hastings sampling, i.e. updates to and G are proposed from a proposal distribution q( , G → ’ , G’) and accepted with probability

)',',()|()|(

),','()'|'()'|(

GGqGPGXP

GGqGPGXP

Page 12: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen
Page 13: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen
Page 14: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen
Page 15: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen
Page 16: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen
Page 17: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Some problems…

• Histogram estimator or other smoothing must be used.

• Likelihood ratios hard to estimate (e.g. M=0).

Page 18: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

A new method

• It is possible to calculate the marginal prior probability of a genealogy

dPGPGP )()|()(

• It turns out that this math is doable, for most components of Θ such as and M.• The we can sample from the marginal posterior of G

using the previously discussed MCMC procedures.

Slide inspired by Jody Slide inspired by Jody HeyHey

)()|()(

)()|()|( GPGXP

XP

GPGXPXGP

Page 19: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

dGXGPGPXP

)|()|()|(

We then recover the posterior for using

Approximated by

k

i i

ik

ii GP

PGP

kGP

kXP

11 )(

)()|(1)|(

1)|(

Slide inspired by Jody Slide inspired by Jody HeyHey

Page 20: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Advantages

• Eliminates problems with covariance between parameters leading to mixing problems.

• Provides a smooth posterior/likelihood function useful for optimization and likelihood ratio estimation.

Disadvantages

• Requires more calculation in each MCMC iteration

Page 21: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Likelihood ratio estimation

6 loci, 15 gene copies, H0: m1=m2

Page 22: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Other approaches

• Kuhner and Felsenstein use a combination of MCMC and importance sampling to estimate surfaces (no prior for the parameters).

• PAC methods suggested by Stephens and Donnelly samples from a close approximation to

to estimate an approximate likelihood.• ABC (Beaumont, Pritchard, Tavare and others) methods are

a very popular and promising class of methods based on (1) reducing the data to summary statistics, (2) simulate new data from the prior, (3) accepting the parameter value under which the data was simulated if the difference between simulated and true statistics is less than .

)|Pr(

)|()|Pr(),|(

X

GpGXXGp

Page 23: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

SNP DataNielsen and Slatkin (2000)

Page 24: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen
Page 25: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen
Page 26: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

A more efficient method..Griffiths and Tavare (1998), Nielsen (2000)

Page 27: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

A more efficient method..Griffiths and Tavare (1998), Nielsen (2000)

Page 28: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen
Page 29: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Ascertainment Sample vs. Typed Sample

Ascertainment sample

Typed sample

Page 30: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

n = 20, d = 4, #SNPs = 1000

0.00

0.05

0.10

0.15

0.20

0.25

0.30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

x

Fre

qu

ency

True Frequencies

Observed frequencies

Page 31: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

0.5

0.6

0.7

0.8

0.9

1.0

0 1 2 3 4 5 6 7 8 9 10

=2Nc

E[D

']

no ascertainment biasascertainment bias

Page 32: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Correcting for ascertainment biases

Now, for simplicity, consider the case without a sweep, then

where (in the simplest possible case)

and

)|Pr(

)|Pr(

)|Pr(

)|,Pr()(

PP

PP

Asc

xXAscp

Asc

AscxXL i

xi

i

d

nd

xn

d

x

xXAsc i 1)|Pr(

1

1

)|Pr()|Pr(n

jij jXAscpAsc P

Page 33: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

In this simple case, the maximum likelihood estimate of P is simply given by

, k = 1, 2, …, n – 1,

where nk is the number of SNPs with allele frequency k.

11

1 )|Pr()|Pr(ˆ

n

j

jkk jXAsc

n

kXAsc

np

Selective sweeps:

Similarly define ),,,|Pr(),,( AscDXDL PP

Page 34: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

0

0.05

0.1

0.15

0.2

0.25

0.3

1 3 5 7 9 11 13 15 17 19

True frequencies

Observed frequencies

Corrected frequencies

10,000 simulated SNPs with n = 20 and d = 5

Page 35: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

b.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2 3 4 5 6 7 8 9 10

=2N c

Hudson’s (2001) Estimator when n = 100, m = 5, = 5, and #SNP pairs = 200.

Corrected

Uncorrected

Page 36: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Complications• Double-hit ascertainment (HapMap)• Ascertainment based on chimpanzee (HapMap)• Panel depth may vary among SNPs and/or

among regions (HapMap).• Ascertainment method may vary among SNPs

(HapMap).• Population structure (HapMap).• Loss of information regarding asc. scheme

(HapMap??).

Page 37: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

0.00E+00

5.00E-02

1.00E-01

1.50E-01

2.00E-01

2.50E-01

3.00E-01

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

HapMap ascertainment depth distrb.(ignores many important components)

Page 38: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen
Page 39: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

PerlegenPerlegen

HapMapHapMap

Page 40: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

DataDataDirectly sequenced polymorphism data from Directly sequenced polymorphism data from

20 European-Americans, 19 African-20 European-Americans, 19 African-Americans and one chimpanzee from Americans and one chimpanzee from 9,316 protein coding genes9,316 protein coding genes

Data set previously described in Data set previously described in Bustamante, C.D. et al. 2005. Natural Bustamante, C.D. et al. 2005. Natural selection on protein-coding genes in the selection on protein-coding genes in the human genome. Nature human genome. Nature 437437, 1153-7., 1153-7.

Page 41: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Demographic modelDemographic model

European-AmericansEuropean-Americans African-AmericansAfrican-Americans

BottleneckBottleneck

Population growthPopulation growth

migratiomigrationn

AdmixtureAdmixture

Page 42: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

EstimationEstimation

1

1

)()(n

j

nj

jpL

, Sampling probabilities from the 2D frequency Sampling probabilities from the 2D frequency spectrumspectrum

Number of SNPs with pattern Number of SNPs with pattern jj in the 2D frequency in the 2D frequency spectrumspectrum

SNPs within a gene are correlated. But estimator is SNPs within a gene are correlated. But estimator is consistent. The estimate has the same properties as consistent. The estimate has the same properties as a real likelihood estimator except that it converges a real likelihood estimator except that it converges slightly slower because of the correlation (Wiuf 2006).slightly slower because of the correlation (Wiuf 2006).

Page 43: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

African-AmericansAfrican-Americans

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 5 10 15 20 25 30 35

Allele Frequency

%

Simulated

Observed

Page 44: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

European-AmericansEuropean-Americans

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0 5 10 15 20 25 30 35 40

Allele Frequency

%

Simulated

Observed

Godness-of-fit: Godness-of-fit: p p = 0.6= 0.6

Page 45: Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen

Acknowledgements

Jody Hey, John Wakeley, Melissa Hubisz, Andy Clark, Carlos Bustamante, Scott Williamson, Aida Andres, Amit Andip, Adam Boyko, Anders Albrechtsen,Mark Adams, Michelle Cargill and other staff at Celera Genomics and Applied Biosystems.