rare variants, selection, and recent population demography ... · results: age as a function of pf,...

Rare variants, selection, and recent population demography in humans

John NovembreDepartment of Human Genetics

University of Chicago

October 8, 2013

Wednesday, October 30, 2013

Outline

• Background: Recent attention on rare variants

• Missing heritability

• Large sequencing surveys

• Modeling studies: rare variants and heritability

• Two works in progress:

• Characteristics of negatively selected rare variant haplotypes under varying demographies

• Properties of the distribution of allele frequencies for negatively selected alleles under varying demographies

• Conclusion


• Untyped common SNPs

• CNVs/Structural variants

• Epistatic effects

• Epigenetic effects

• Over-estimates of heritability

• Rare variants of modest effect

Common Variant Genome-wide Association Studies

Many insightful loci found, but not enough to explain the observed heritability of traits in question

Manolio et al, Nature (2009)

Potential “hiding grounds” of the missing heritability:


Large sample size surveys in humans

Study Genes Individuals

Coventry et al (2010) 2 13,715

Nelson et al (2012) 202 14,002

Tennessen et al (2012) 15,585 2,440

Fu et al (2013) 15,336 6,515

Gazave et al (arXiv) 15 (non-coding) 493

• Consistent findings of:

• Abundance of rare variants (e.g. 1 per 17bp in Nelson et al)

• Excess of rare variants over constant-size predictions

• Among rare variants, excess missense and nonsense coding variants


Gene-level abundance of variants as a function of frequency

Varia

nt C

ount

per

kb

Minor Allele Count1 101 102 103

10!4

10!3

10!2

10!1

1101

"w(syn)

"#(syn)

!

!

!

!

!

!

!

!

!

!

!

!!

! !

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

! !

!

! !

!

!

!

!

IntronUTRNon−SynonymousSynonymous

A

Num

ber o

f Var

iant

s

025

300

250

200

150

100

50

0MAF > 0.5%

MAF $ 0.5%

NonsynonymousSynonymous

B

!

2.0 3.0 4.0 5.01.0

1.5

2.0

2.5

Ne (x 1e6)

Perc

ent G

row

th

Population Size (millions)

C

Mut

atio

n R

ate

(10!

8bp

gene

ratio

n)

0.1

0.2

0.5

1

2

5

!

!

!

!

!

!!

!!

!!

!

!

!!!

!

!

!

!!!

!

!

!!

!

! !!

!

!

!

!

!! !

!

!

!

!

!

!

!

!

!!!!

!

!

!

!!!

!

!!

!

!

!

!

!

!

!

!

!

!

!!!

!!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!!!!

!

!

!! !

!!

!

!!!

!

!

!

!

!

!!!

!

!!!!

!

!

!

! !

!!

!

!

!!!!

!

!

!

!!

!

!

!!

!

!!!!

!

!

!

!

!

!

!!!

!!

!

!!

!

!

!!!

!

!

!

!

!

!

!

!

!

!!!!

!!

!

!

!

D

(0,0

.001

]

(0.0

01,0

.005

]

(0.0

05,0

.01]

(0.0

1,0.

05]

Prop

ortio

n of

cM

AF

0.0

0.2

0.4

0.6

0.8

1.0

SingletonDoubleton(0,0.001]

(0.001,0.005]

Genes 29 98 50 24

Gene cMAF Range

E

Cum

ulat

ive M

AF

0.00

0.01

0.02

0.03

0.04

0

3

6

Cod

ing

Leng

th in

kb

( ) −

F

20 40 60 80 100

120

140

160

180

200

Gene Rank (by Number of Rare Coding SNVs)

Number of common variants per gene

Number of rare variants per gene

202 genes ordered by number of rare variants

Adapted from: Nelson et al (2012)


Observed log-log frequency spectra

Varia

nt C

ount

per

kb

Minor Allele Count1 101 102 103

10!4

10!3

10!2

10!1

1101

"w(syn)

"#(syn)

!

!

!

!

!

!

!

!

!

!

!

!!

! !

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

! !

!

! !

!

!

!

!

IntronUTRNon−SynonymousSynonymous

A

Num

ber o

f Var

iant

s

025

300

250

200

150

100

50

0MAF > 0.5%

MAF $ 0.5%

NonsynonymousSynonymous

B

!

2.0 3.0 4.0 5.01.0

1.5

2.0

2.5

Ne (x 1e6)

Perc

ent G

row

th

Population Size (millions)

CM

utat

ion

Rat

e(1

0!8

bpge

nera

tion)

0.1

0.2

0.5

1

2

5

!

!

!

!

!

!!

!!

!!

!

!

!!!

!

!

!

!!!

!

!

!!

!

! !!

!

!

!

!

!! !

!

!

!

!

!

!

!

!

!!!!

!

!

!

!!!

!

!!

!

!

!

!

!

!

!

!

!

!

!!!

!!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!!!!

!

!

!! !

!!

!

!!!

!

!

!

!

!

!!!

!

!!!!

!

!

!

! !

!!

!

!

!!!!

!

!

!

!!

!

!

!!

!

!!!!

!

!

!

!

!

!

!!!

!!

!

!!

!

!

!!!

!

!

!

!

!

!

!

!

!

!!!!

!!

!

!

!

D

(0,0

.001

]

(0.0

01,0

.005

]

(0.0

05,0

.01]

(0.0

1,0.

05]

Prop

ortio

n of

cM

AF

0.0

0.2

0.4

0.6

0.8

1.0

SingletonDoubleton(0,0.001]

(0.001,0.005]

Genes 29 98 50 24

Gene cMAF Range

E

Cum

ulat

ive M

AF

0.00

0.01

0.02

0.03

0.04

0

3

6

Cod

ing

Leng

th in

kb

( ) −

F20 40 60 80 100

120

140

160

180

200

Gene Rank (by Number of Rare Coding SNVs)

~10-fold deficiency of common nonsynonymous

variants

Excess of rare variants over constant-size expectations

Relative abundance of rare variants does not depend on

functional category

Constant-size expectations

Nelson et al (2012)


0

50

100

150

!

!

!

!

!

!

!

!

IntronUTRNonsynonymous (NS)Synonymous (S)

0 11000 25000 50000

A

Num

ber o

f Site

s O

bser

ved

to b

e Va

riabl

e pe

r kb

Sample Size

0

100

200

300

400

!!!!

104 105 106

B

Freq

euen

cy o

f NS:

S

0.0

0.2

0.4

0.6

0.8

1.0

Mut

atio

nsSi

ngle

ton

Doub

leto

n(0

.01,

0.1

](0

.1, 0

.5]

(0.5

, 2]

(2, 5

0)

NS

S

C

MAF (%)

Freq

uenc

y

0.0

0.2

0.4

0.6

0.8

1.0

prob dmgposs dmg

benign

PolyPhenD

Freq

uenc

y

0.0

0.2

0.4

0.6

0.8

1.0

prob dmg

poss dmg

tolerate

SIFT

Freq

uenc

y

0.0

0.2

0.4

0.6

0.8

1.0

Sing

leto

nDo

uble

ton

(0.0

1, 0

.1]

(0.1

, 0.5

](0

.5, 2

](2

, 50)

never common

never fixed

neutral

Relative ratio inference

MAF (%)

!! !

!!

!

−0.5

0.0

0.5

1.0

1.5

Phylo

P sc

ore

Sing

leto

n

Doub

leto

n

(0.0

1, 0

.1]

(0.1

, 0.5

]

(0.5

, 2]

(2, 5

0)

!! !

! !

!

!!

! !

!

!

!

! !!

!!

E

MAF (%)

0

50

100

150

!

!

!

!

!

!

!

!


0 11000 25000 50000

A

Num

ber o

f Site

s O

bser

ved

to b

e Va

riabl

e pe

r kb

Sample Size

0

100

200

300

400

!!!!

104 105 106

B

Freq

euen

cy o

f NS:

S

0.0

0.2

0.4

0.6

0.8

1.0

Mut

atio

nsSi

ngle

ton

Doub

leto

n(0

.01,

0.1

](0

.1, 0

.5]

(0.5

, 2]

(2, 5

0)

NS

S

C

MAF (%)

Freq

uenc

y

0.0

0.2

0.4

0.6

0.8

1.0

prob dmgposs dmg

benign

PolyPhenDFr

eque

ncy

0.0

0.2

0.4

0.6

0.8

1.0

prob dmg

poss dmg

tolerate

SIFT

Freq

uenc

y

0.0

0.2

0.4

0.6

0.8

1.0

Sing

leto

nDo

uble

ton

(0.0

1, 0

.1]

(0.1

, 0.5

](0

.5, 2

](2

, 50)

never common

never fixed

neutral


MAF (%)

!! !

!!

!

−0.5

0.0

0.5

1.0

1.5

Phylo

P sc

ore

Sing

leto

n

Doub

leto

n

(0.0

1, 0

.1]

(0.1

, 0.5

]

(0.5

, 2]

(2, 5

0)

!! !

! !

!

!!

! !

!

!

!

! !!

!!

E

MAF (%)

Nonsynoymous:Synonymous ratio vs minor allele frequency

0

50

100

150

!

!

!

!

!

!

!

!


0 11000 25000 50000

A

Num

ber o

f Site

s O

bser

ved

to b

e Va

riabl

e pe

r kb

Sample Size

0

100

200

300

400

!!!!

104 105 106

B

Freq

euen

cy o

f NS:

S

0.0

0.2

0.4

0.6

0.8

1.0M

utat

ions

Sing

leto

nDo

uble

ton

(0.0

1, 0

.1]

(0.1

, 0.5

](0

.5, 2

](2

, 50)

NS

S

C

MAF (%)Fr

eque

ncy

0.0

0.2

0.4

0.6

0.8

1.0

prob dmgposs dmg

benign

PolyPhenD

Freq

uenc

y

0.0

0.2

0.4

0.6

0.8

1.0

prob dmg

poss dmg

tolerate

SIFT

Freq

uenc

y

0.0

0.2

0.4

0.6

0.8

1.0

Sing

leto

nDo

uble

ton

(0.0

1, 0

.1]

(0.1

, 0.5

](0

.5, 2

](2

, 50)

never common

never fixed

neutral


MAF (%)

!! !

!!

!

−0.5

0.0

0.5

1.0

1.5

Phylo

P sc

ore

Sing

leto

n

Doub

leto

n

(0.0

1, 0

.1]

(0.1

, 0.5

]

(0.5

, 2]

(2, 5

0)

!! !

! !

!

!!

! !

!

!

!

! !!

!!

E

MAF (%)

Proportion of NonSyn:Syn in singletons is close to

mutational expectation

Substantial fraction of variants are

nonsynonsymous up to ~2% frequency

Nelson et al (2012)


Data quality is crucial...Shown: Simulation study of inferred vs. true frequency spectra

Eunjung HanPhd studentBiostatistics

4x coverage per individual 10x coverage per individual

• Single-sample caller mode• False positive problem for

rare variants

• Multi-sample caller mode• False positive problem for

rare variants

• Single or multi...• Higher coverage

necessary to infer frequency spectra


Data quality metrics for Nelson et al study• Median coverage of 27x• Call rate of 90.7% using a depth >=7 and GQ >=20 filter• Heterozygous concordance at called sites:

– 99.1% estimated from 130 sample duplicates– 99.0% estimated from comparison to 1000G Trios

• Singleton concordance at called sites:– 98.5% estimated from 130 sample duplicates– 98.3% estimated from 245 validation attempts with capillary

sequencing• Estimates of false negative rate for calling variants due to filtering

on depth and GQ: – 1.02% overall– 2.72% for singletons


Do rare variants contribute substantially to complex trait heritability?

• Models with constant-size populations:

• Pritchard (2001, AJHG) and Eyre-Walker (2010, PNAS)

• Models considering population growth:

• Simons et al (2013, arXiv), Lohmueller (2013, arXiv), Gazave et al (2013, Genetics)

• Findings of:

• Strong dependence on distribution of fitness effects

• If disease alleles negatively selected, contribution of rare to heritability can be strong, especially in presence of recent growth

• Growth increases genetic heterogeneity more than genetic variance and load - many loci at lower frequencies


Open questions to discuss today

• What should we expect for the characteristics of the haplotypes that carry a negatively selected variant?

• Conditional on frequency, deleterious variants are expected to have younger ages and hence longer shared haplotypes

• How is this affected by recent demography?

• Does it have applications for detecting selected alleles?

• How is the frequency spectrum of negatively selected alleles impacted by recent demography?


Challenge: Efficient simulation of rare variant haplotypes

• Challenge: To study haplotype variation around deleterious rare variants is challenging using forward simulations

• Difficult to condition on final frequency

• Expensive to simulate haplotypes in forward-time simulations

• Proposed solution:

• Importance sampling to generate allele frequency trajectories conditioned on final allele frequency

• Structured coalescent to generate sample haplotype data

• Joint work with Diego Ortega del Vecchyo (PhD student, Bioinformatics UCLA)


Importance sampling with backward trajectories

• Following Slatkin (2001)

• Backwards path generated by modified Wright-Fisher

where y’t is the z that solves

I.e. mean frequency is equivalent to frequency that would give rise to current frequency after a deterministic update.

Motivated by Maruyama (1974) use -s1 and -s2 in above for deleterious


Example allele frequency trajectories

• Conditional on achieving a fixed frequency -

• Negatively selected alleles rise in frequency faster than neutral alleles

• In case of additive fitness effect, constant-size, expected allele ages are equal for positive and negative selected alleles

0 500 1000 1500

0.01

0.03

0.05

Number of Generations

Alle

le fr

eque

ncy

●

●

●

Selective coefficient4Ns = 04Ns = 2004Ns = − 200


Structured coalescent

• Structured coalescent

• Allele frequency trajectory determines subpopulation size

• Using extension to ms courtesy of Dick Hudson...

Hudson and Kaplan 1998 Spencer and Coop 2004


wi =PF (H)

PB(H)=

Importance sampling with backward trajectories

• Calculate importance sampling weights:

• We can compute expectations of arbitrary functions / summary statistics from the realized data:

• And we can calculate Effective Sample Sizes:


Validation: Theory vs. IS estimates for constant-size case

• Comparison to Maruyama (1974) results

●●

●

● ●

4Ns

Expe

cted

Alle

le A

ge

●

●

●

● ●

−100 −50 −10 −1 0

050

015

0025

0035

00 Allele freq0.010.03

● TheoryIS estimates

4Ns

Exp

ecte

d A

llele

Age

(g

ener

atio

ns)


Evaluating effect of demographic history

• We will consider 4 trajectories of population size

• “Constant”: Ne=10,000

• “Africa” and “Europe” trajectories based on Tennessen et al (2012)

• Africa final Ne=424k

• Europe final Ne=512k

• Europe’: Europe without recent growth

Thousands of years

log1

0(N

e)

34

56

150 100 50 0

ConstantAfricaEuropeEurope'


Results: Age as a function of pF, demography, and 4Ns

• As expected, negatively selected alleles have younger ages

• Bottleneck in Europe/Europe’ appears to have effect:

• Dampen dependence of age for 4Ns < -10

• Strengthen dependence of age for 4Ns > -10


Haplotype diversity statistics

• Expected haplotype homozygosity for a variant

• For specified length scale (e.g. 100kb or 0.1cM)

• Calculate frequencies for each of K haplotypes that carry the variant allele

• And then calculate expected haplotype homozygosity as:

• NC statistic (Keizun et al 2013)

• Identify minimal distance to a recombination or less frequent mutation in each direction from the variant

• NC = Log of distance between the events in each direction

EHH =K�

i

p2i


Results: Haplotype diversity statistics • As expected:

• Lower frequency alleles have higher EHH

• Stronger purifying selection have higher EHH

• Notable:

• Magnitude of haplotype response with respect to 4Ns is small in absolute terms (also see Keizun et al 2013)

• Rapid growth in Europe leads to reduced EHH relative to Europe’

• NC statistic slightly more responsive than EHH to 4Ns(results not shown)


Caveats and extensions to IS

• Inefficiency of importance sampling

• Effective sample size (ESS) mean: 112.7

• SD=66.6 across parameter settings

• With 200,000 simulations

• Worst ESS values are for larger allele frequencies and selection coefficients

• Extensions:

• Suggests place for attempting SIS resampling algorithm

• Using sampled coalescent genealogies, can calculate likelihoods based on rare variant haplotype configurations


Examining the distribution of allele frequencies using diffusion theory based approaches

• Thus far we have conditioned on allele frequency...

• But what does the distribution of allele frequencies look like for negatively selected alleles?

• How is it affected by demography?

• Diffusion-based approaches are best suited to address these questions...

• Joint work with Evan Koch (Phd student, Ecology & Evolution, U Chicago)


Diffusion equationsEvans et al. 2007 describe the following formula and boundary conditions for the expectation measure of the frequency spectrum.

Which they transform to make numerical solution easier:

Where we use the usual Wright-Fisher mean and variance and measure time in terms of the pop. size at t=0.


Continuing the method of Evans et al. we use an implicit backward Euler scheme over a grid on t and first/second order difference schemes for the first/second partials on x respectively.

Where,

We solve this linear system of equations to obtain numerical solutions forward in time.

Where the vector d contains the boundary conditions for the diffusion.

Numerical solution: Temporal update


0.0 0.2 0.4 0.6 0.8 1.0

020

4060

Frequency

Expe

cted

num

ber

ConstantAfricaEurope

To perform this numerical solution we need to choose an appropriate grid on x and t. The frequency spectrum is very steep at low values of x so we use a nonuniform grid beginning with smaller spacing.

Begin with x spacing ~10-8, double after 20 steps until spacing is 10-3.

Spacing in t is uniform at 10-3.

Numerical solution: Grid settings

4Ns=-10


Testing numerical methodsNumerically test the equilibrium solution forwards in time for 6000 generations to see how consistent it stays.

Error appears to increase with allele frequency but scale is relatively small.

S=2


0.0 0.1 0.2 0.3 0.4

020

040

060

080

0

Time (2N generations)

E[ #

pol

ymor

phic

in (0

,.05)

] S=−20S=−10S=−2S=−1

AfricanEuropean

0.0 0.1 0.2 0.3 0.4

0.0

0.5

1.0

1.5


E[ #

pol

ymor

phic

in (.

2,.5

) ]

0.0 0.1 0.2 0.3 0.4

0.0

0.5

1.0

1.5

2.0

2.5

3.0


E[ #

pol

ymor

phic

in (.

05,.2

) ]

Slices of the spectra through timeVery common variants(20-50% MAF)

Less common variants(5-20% MAF)

Rare variants(0-5% MAF)

• For “very common” variants: little effect of demography

• For “less common” variants: older events show effect

• For “rare” variants: tight coupling with demography - large impact of recent growth


0.0000 0.0010 0.0020 0.0030

0.80

0.85

0.90

0.95

1.00

−s

Prop

of N

ovel

Var

iant

s

ConstantAfricaEuropeEurope'

Proportion of variants that have arisen since onset of growth

• First solution: Forward to present with no mutations added after onset of growth

• Second solution: From onset of growth to present, with starting conditions as no variants

• By summing two solutions we can obtain total variants and proportion since onset of growth


Conclusions

• Haplotype characteristics of negatively selected rare variants

• Age and haplotype diversity reflect 4Ns of the variant

• However, only weakly, especially for haplotype diversity.

• Any inference will need to bin variants

• Demographies: Europeans bottleneck has strong effect on ages, and subsequent growth lowers haplotype homozygosity relative to no growth

• Frequency spectra characteristics:

• Sensitivity to demography is highly dependent on frequency range

• Most variants in growing populations are recent and rare, but especially negatively selected ones


Conclusions

• Have applied two computational techniques for generating expectations in situations with varying population sizes and negative selection

• Techniques are promising but still have challenges:

• Very strong selection and growth is problematic for both

• Sampling issues (particularly as n approaches 2N)

• Still have limitations of numeric approaches - how outcomes depend on parameters is never generalized

• Handling these challenges is an exciting and important area for contemporary theory and methods development


Acknowledgements

• Importance sampling to study rare variant haplotypes

• Diego Ortega del Vecchyo (UCLA)

• Numerical solutions to diffusion

• Evan Koch (University of Chicago)

• Funding: NIH, Sloan Research Fellowship


Thanks


rare variants, selection, and recent population demography ... · results: age as a function of pf,...

Documents