rare variants, selection, and recent population demography ... · results: age as a function of pf,...
TRANSCRIPT
Rare variants, selection, and recent population demography in humans
John NovembreDepartment of Human Genetics
University of Chicago
October 8, 2013
Wednesday, October 30, 2013
Outline
• Background: Recent attention on rare variants
• Missing heritability
• Large sequencing surveys
• Modeling studies: rare variants and heritability
• Two works in progress:
• Characteristics of negatively selected rare variant haplotypes under varying demographies
• Properties of the distribution of allele frequencies for negatively selected alleles under varying demographies
• Conclusion
Wednesday, October 30, 2013
• Untyped common SNPs
• CNVs/Structural variants
• Epistatic effects
• Epigenetic effects
• Over-estimates of heritability
• Rare variants of modest effect
Common Variant Genome-wide Association Studies
Many insightful loci found, but not enough to explain the observed heritability of traits in question
Manolio et al, Nature (2009)
Potential “hiding grounds” of the missing heritability:
Wednesday, October 30, 2013
Large sample size surveys in humans
Study Genes Individuals
Coventry et al (2010) 2 13,715
Nelson et al (2012) 202 14,002
Tennessen et al (2012) 15,585 2,440
Fu et al (2013) 15,336 6,515
Gazave et al (arXiv) 15 (non-coding) 493
• Consistent findings of:
• Abundance of rare variants (e.g. 1 per 17bp in Nelson et al)
• Excess of rare variants over constant-size predictions
• Among rare variants, excess missense and nonsense coding variants
Wednesday, October 30, 2013
Gene-level abundance of variants as a function of frequency
Varia
nt C
ount
per
kb
Minor Allele Count1 101 102 103
10!4
10!3
10!2
10!1
1101
"w(syn)
"#(syn)
!
!
!
!
!
!
!
!
!
!
!
!!
! !
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
IntronUTRNon−SynonymousSynonymous
A
Num
ber o
f Var
iant
s
025
300
250
200
150
100
50
0MAF > 0.5%
MAF $ 0.5%
NonsynonymousSynonymous
B
!
2.0 3.0 4.0 5.01.0
1.5
2.0
2.5
Ne (x 1e6)
Perc
ent G
row
th
Population Size (millions)
C
Mut
atio
n R
ate
(10!
8bp
gene
ratio
n)
0.1
0.2
0.5
1
2
5
!
!
!
!
!
!!
!!
!!
!
!
!!!
!
!
!
!!!
!
!
!!
!
! !!
!
!
!
!
!! !
!
!
!
!
!
!
!
!
!!!!
!
!
!
!!!
!
!!
!
!
!
!
!
!
!
!
!
!
!!!
!!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
!
!!!!
!
!
!! !
!!
!
!!!
!
!
!
!
!
!!!
!
!!!!
!
!
!
! !
!!
!
!
!!!!
!
!
!
!!
!
!
!!
!
!!!!
!
!
!
!
!
!
!!!
!!
!
!!
!
!
!!!
!
!
!
!
!
!
!
!
!
!!!!
!!
!
!
!
D
(0,0
.001
]
(0.0
01,0
.005
]
(0.0
05,0
.01]
(0.0
1,0.
05]
Prop
ortio
n of
cM
AF
0.0
0.2
0.4
0.6
0.8
1.0
SingletonDoubleton(0,0.001]
(0.001,0.005]
Genes 29 98 50 24
Gene cMAF Range
E
Cum
ulat
ive M
AF
0.00
0.01
0.02
0.03
0.04
0
3
6
Cod
ing
Leng
th in
kb
( ) −
F
20 40 60 80 100
120
140
160
180
200
Gene Rank (by Number of Rare Coding SNVs)
Number of common variants per gene
Number of rare variants per gene
202 genes ordered by number of rare variants
Adapted from: Nelson et al (2012)
Wednesday, October 30, 2013
Observed log-log frequency spectra
Varia
nt C
ount
per
kb
Minor Allele Count1 101 102 103
10!4
10!3
10!2
10!1
1101
"w(syn)
"#(syn)
!
!
!
!
!
!
!
!
!
!
!
!!
! !
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
IntronUTRNon−SynonymousSynonymous
A
Num
ber o
f Var
iant
s
025
300
250
200
150
100
50
0MAF > 0.5%
MAF $ 0.5%
NonsynonymousSynonymous
B
!
2.0 3.0 4.0 5.01.0
1.5
2.0
2.5
Ne (x 1e6)
Perc
ent G
row
th
Population Size (millions)
CM
utat
ion
Rat
e(1
0!8
bpge
nera
tion)
0.1
0.2
0.5
1
2
5
!
!
!
!
!
!!
!!
!!
!
!
!!!
!
!
!
!!!
!
!
!!
!
! !!
!
!
!
!
!! !
!
!
!
!
!
!
!
!
!!!!
!
!
!
!!!
!
!!
!
!
!
!
!
!
!
!
!
!
!!!
!!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
!
!!!!
!
!
!! !
!!
!
!!!
!
!
!
!
!
!!!
!
!!!!
!
!
!
! !
!!
!
!
!!!!
!
!
!
!!
!
!
!!
!
!!!!
!
!
!
!
!
!
!!!
!!
!
!!
!
!
!!!
!
!
!
!
!
!
!
!
!
!!!!
!!
!
!
!
D
(0,0
.001
]
(0.0
01,0
.005
]
(0.0
05,0
.01]
(0.0
1,0.
05]
Prop
ortio
n of
cM
AF
0.0
0.2
0.4
0.6
0.8
1.0
SingletonDoubleton(0,0.001]
(0.001,0.005]
Genes 29 98 50 24
Gene cMAF Range
E
Cum
ulat
ive M
AF
0.00
0.01
0.02
0.03
0.04
0
3
6
Cod
ing
Leng
th in
kb
( ) −
F20 40 60 80 100
120
140
160
180
200
Gene Rank (by Number of Rare Coding SNVs)
~10-fold deficiency of common nonsynonymous
variants
Excess of rare variants over constant-size expectations
Relative abundance of rare variants does not depend on
functional category
Constant-size expectations
Nelson et al (2012)
Wednesday, October 30, 2013
0
50
100
150
!
!
!
!
!
!
!
!
IntronUTRNonsynonymous (NS)Synonymous (S)
0 11000 25000 50000
A
Num
ber o
f Site
s O
bser
ved
to b
e Va
riabl
e pe
r kb
Sample Size
0
100
200
300
400
!!!!
104 105 106
B
Freq
euen
cy o
f NS:
S
0.0
0.2
0.4
0.6
0.8
1.0
Mut
atio
nsSi
ngle
ton
Doub
leto
n(0
.01,
0.1
](0
.1, 0
.5]
(0.5
, 2]
(2, 5
0)
NS
S
C
MAF (%)
Freq
uenc
y
0.0
0.2
0.4
0.6
0.8
1.0
prob dmgposs dmg
benign
PolyPhenD
Freq
uenc
y
0.0
0.2
0.4
0.6
0.8
1.0
prob dmg
poss dmg
tolerate
SIFT
Freq
uenc
y
0.0
0.2
0.4
0.6
0.8
1.0
Sing
leto
nDo
uble
ton
(0.0
1, 0
.1]
(0.1
, 0.5
](0
.5, 2
](2
, 50)
never common
never fixed
neutral
Relative ratio inference
MAF (%)
!! !
!!
!
−0.5
0.0
0.5
1.0
1.5
Phylo
P sc
ore
Sing
leto
n
Doub
leto
n
(0.0
1, 0
.1]
(0.1
, 0.5
]
(0.5
, 2]
(2, 5
0)
!! !
! !
!
!!
! !
!
!
!
! !!
!!
E
MAF (%)
0
50
100
150
!
!
!
!
!
!
!
!
IntronUTRNonsynonymous (NS)Synonymous (S)
0 11000 25000 50000
A
Num
ber o
f Site
s O
bser
ved
to b
e Va
riabl
e pe
r kb
Sample Size
0
100
200
300
400
!!!!
104 105 106
B
Freq
euen
cy o
f NS:
S
0.0
0.2
0.4
0.6
0.8
1.0
Mut
atio
nsSi
ngle
ton
Doub
leto
n(0
.01,
0.1
](0
.1, 0
.5]
(0.5
, 2]
(2, 5
0)
NS
S
C
MAF (%)
Freq
uenc
y
0.0
0.2
0.4
0.6
0.8
1.0
prob dmgposs dmg
benign
PolyPhenDFr
eque
ncy
0.0
0.2
0.4
0.6
0.8
1.0
prob dmg
poss dmg
tolerate
SIFT
Freq
uenc
y
0.0
0.2
0.4
0.6
0.8
1.0
Sing
leto
nDo
uble
ton
(0.0
1, 0
.1]
(0.1
, 0.5
](0
.5, 2
](2
, 50)
never common
never fixed
neutral
Relative ratio inference
MAF (%)
!! !
!!
!
−0.5
0.0
0.5
1.0
1.5
Phylo
P sc
ore
Sing
leto
n
Doub
leto
n
(0.0
1, 0
.1]
(0.1
, 0.5
]
(0.5
, 2]
(2, 5
0)
!! !
! !
!
!!
! !
!
!
!
! !!
!!
E
MAF (%)
Nonsynoymous:Synonymous ratio vs minor allele frequency
0
50
100
150
!
!
!
!
!
!
!
!
IntronUTRNonsynonymous (NS)Synonymous (S)
0 11000 25000 50000
A
Num
ber o
f Site
s O
bser
ved
to b
e Va
riabl
e pe
r kb
Sample Size
0
100
200
300
400
!!!!
104 105 106
B
Freq
euen
cy o
f NS:
S
0.0
0.2
0.4
0.6
0.8
1.0M
utat
ions
Sing
leto
nDo
uble
ton
(0.0
1, 0
.1]
(0.1
, 0.5
](0
.5, 2
](2
, 50)
NS
S
C
MAF (%)Fr
eque
ncy
0.0
0.2
0.4
0.6
0.8
1.0
prob dmgposs dmg
benign
PolyPhenD
Freq
uenc
y
0.0
0.2
0.4
0.6
0.8
1.0
prob dmg
poss dmg
tolerate
SIFT
Freq
uenc
y
0.0
0.2
0.4
0.6
0.8
1.0
Sing
leto
nDo
uble
ton
(0.0
1, 0
.1]
(0.1
, 0.5
](0
.5, 2
](2
, 50)
never common
never fixed
neutral
Relative ratio inference
MAF (%)
!! !
!!
!
−0.5
0.0
0.5
1.0
1.5
Phylo
P sc
ore
Sing
leto
n
Doub
leto
n
(0.0
1, 0
.1]
(0.1
, 0.5
]
(0.5
, 2]
(2, 5
0)
!! !
! !
!
!!
! !
!
!
!
! !!
!!
E
MAF (%)
Proportion of NonSyn:Syn in singletons is close to
mutational expectation
Substantial fraction of variants are
nonsynonsymous up to ~2% frequency
Nelson et al (2012)
Wednesday, October 30, 2013
Data quality is crucial...Shown: Simulation study of inferred vs. true frequency spectra
Eunjung HanPhd studentBiostatistics
4x coverage per individual 10x coverage per individual
• Single-sample caller mode• False positive problem for
rare variants
• Multi-sample caller mode• False positive problem for
rare variants
• Single or multi...• Higher coverage
necessary to infer frequency spectra
Wednesday, October 30, 2013
Data quality metrics for Nelson et al study• Median coverage of 27x• Call rate of 90.7% using a depth >=7 and GQ >=20 filter• Heterozygous concordance at called sites:
– 99.1% estimated from 130 sample duplicates– 99.0% estimated from comparison to 1000G Trios
• Singleton concordance at called sites:– 98.5% estimated from 130 sample duplicates– 98.3% estimated from 245 validation attempts with capillary
sequencing• Estimates of false negative rate for calling variants due to filtering
on depth and GQ: – 1.02% overall– 2.72% for singletons
Wednesday, October 30, 2013
Do rare variants contribute substantially to complex trait heritability?
• Models with constant-size populations:
• Pritchard (2001, AJHG) and Eyre-Walker (2010, PNAS)
• Models considering population growth:
• Simons et al (2013, arXiv), Lohmueller (2013, arXiv), Gazave et al (2013, Genetics)
• Findings of:
• Strong dependence on distribution of fitness effects
• If disease alleles negatively selected, contribution of rare to heritability can be strong, especially in presence of recent growth
• Growth increases genetic heterogeneity more than genetic variance and load - many loci at lower frequencies
Wednesday, October 30, 2013
Open questions to discuss today
• What should we expect for the characteristics of the haplotypes that carry a negatively selected variant?
• Conditional on frequency, deleterious variants are expected to have younger ages and hence longer shared haplotypes
• How is this affected by recent demography?
• Does it have applications for detecting selected alleles?
• How is the frequency spectrum of negatively selected alleles impacted by recent demography?
Wednesday, October 30, 2013
Challenge: Efficient simulation of rare variant haplotypes
• Challenge: To study haplotype variation around deleterious rare variants is challenging using forward simulations
• Difficult to condition on final frequency
• Expensive to simulate haplotypes in forward-time simulations
• Proposed solution:
• Importance sampling to generate allele frequency trajectories conditioned on final allele frequency
• Structured coalescent to generate sample haplotype data
• Joint work with Diego Ortega del Vecchyo (PhD student, Bioinformatics UCLA)
Wednesday, October 30, 2013
Importance sampling with backward trajectories
• Following Slatkin (2001)
• Backwards path generated by modified Wright-Fisher
where y’t is the z that solves
I.e. mean frequency is equivalent to frequency that would give rise to current frequency after a deterministic update.
Motivated by Maruyama (1974) use -s1 and -s2 in above for deleterious
Wednesday, October 30, 2013
Example allele frequency trajectories
• Conditional on achieving a fixed frequency -
• Negatively selected alleles rise in frequency faster than neutral alleles
• In case of additive fitness effect, constant-size, expected allele ages are equal for positive and negative selected alleles
0 500 1000 1500
0.01
0.03
0.05
Number of Generations
Alle
le fr
eque
ncy
●
●
●
Selective coefficient4Ns = 04Ns = 2004Ns = − 200
Wednesday, October 30, 2013
Structured coalescent
• Structured coalescent
• Allele frequency trajectory determines subpopulation size
• Using extension to ms courtesy of Dick Hudson...
Hudson and Kaplan 1998 Spencer and Coop 2004
Wednesday, October 30, 2013
wi =PF (H)
PB(H)=
Importance sampling with backward trajectories
• Calculate importance sampling weights:
• We can compute expectations of arbitrary functions / summary statistics from the realized data:
• And we can calculate Effective Sample Sizes:
Wednesday, October 30, 2013
Validation: Theory vs. IS estimates for constant-size case
• Comparison to Maruyama (1974) results
●●
●
● ●
4Ns
Expe
cted
Alle
le A
ge
●
●
●
● ●
−100 −50 −10 −1 0
050
015
0025
0035
00 Allele freq0.010.03
● TheoryIS estimates
4Ns
Exp
ecte
d A
llele
Age
(g
ener
atio
ns)
Wednesday, October 30, 2013
Evaluating effect of demographic history
• We will consider 4 trajectories of population size
• “Constant”: Ne=10,000
• “Africa” and “Europe” trajectories based on Tennessen et al (2012)
• Africa final Ne=424k
• Europe final Ne=512k
• Europe’: Europe without recent growth
Thousands of years
log1
0(N
e)
34
56
150 100 50 0
ConstantAfricaEuropeEurope'
Wednesday, October 30, 2013
Results: Age as a function of pF, demography, and 4Ns
• As expected, negatively selected alleles have younger ages
• Bottleneck in Europe/Europe’ appears to have effect:
• Dampen dependence of age for 4Ns < -10
• Strengthen dependence of age for 4Ns > -10
Wednesday, October 30, 2013
Haplotype diversity statistics
• Expected haplotype homozygosity for a variant
• For specified length scale (e.g. 100kb or 0.1cM)
• Calculate frequencies for each of K haplotypes that carry the variant allele
• And then calculate expected haplotype homozygosity as:
• NC statistic (Keizun et al 2013)
• Identify minimal distance to a recombination or less frequent mutation in each direction from the variant
• NC = Log of distance between the events in each direction
EHH =K�
i
p2i
Wednesday, October 30, 2013
Wednesday, October 30, 2013
Wednesday, October 30, 2013
Results: Haplotype diversity statistics • As expected:
• Lower frequency alleles have higher EHH
• Stronger purifying selection have higher EHH
• Notable:
• Magnitude of haplotype response with respect to 4Ns is small in absolute terms (also see Keizun et al 2013)
• Rapid growth in Europe leads to reduced EHH relative to Europe’
• NC statistic slightly more responsive than EHH to 4Ns(results not shown)
Wednesday, October 30, 2013
Caveats and extensions to IS
• Inefficiency of importance sampling
• Effective sample size (ESS) mean: 112.7
• SD=66.6 across parameter settings
• With 200,000 simulations
• Worst ESS values are for larger allele frequencies and selection coefficients
• Extensions:
• Suggests place for attempting SIS resampling algorithm
• Using sampled coalescent genealogies, can calculate likelihoods based on rare variant haplotype configurations
Wednesday, October 30, 2013
Examining the distribution of allele frequencies using diffusion theory based approaches
• Thus far we have conditioned on allele frequency...
• But what does the distribution of allele frequencies look like for negatively selected alleles?
• How is it affected by demography?
• Diffusion-based approaches are best suited to address these questions...
• Joint work with Evan Koch (Phd student, Ecology & Evolution, U Chicago)
Wednesday, October 30, 2013
Diffusion equationsEvans et al. 2007 describe the following formula and boundary conditions for the expectation measure of the frequency spectrum.
Which they transform to make numerical solution easier:
Where we use the usual Wright-Fisher mean and variance and measure time in terms of the pop. size at t=0.
Wednesday, October 30, 2013
Continuing the method of Evans et al. we use an implicit backward Euler scheme over a grid on t and first/second order difference schemes for the first/second partials on x respectively.
Where,
We solve this linear system of equations to obtain numerical solutions forward in time.
Where the vector d contains the boundary conditions for the diffusion.
Numerical solution: Temporal update
Wednesday, October 30, 2013
0.0 0.2 0.4 0.6 0.8 1.0
020
4060
Frequency
Expe
cted
num
ber
ConstantAfricaEurope
To perform this numerical solution we need to choose an appropriate grid on x and t. The frequency spectrum is very steep at low values of x so we use a nonuniform grid beginning with smaller spacing.
Begin with x spacing ~10-8, double after 20 steps until spacing is 10-3.
Spacing in t is uniform at 10-3.
Numerical solution: Grid settings
4Ns=-10
Wednesday, October 30, 2013
Testing numerical methodsNumerically test the equilibrium solution forwards in time for 6000 generations to see how consistent it stays.
Error appears to increase with allele frequency but scale is relatively small.
S=2
Wednesday, October 30, 2013
0.0 0.1 0.2 0.3 0.4
020
040
060
080
0
Time (2N generations)
E[ #
pol
ymor
phic
in (0
,.05)
] S=−20S=−10S=−2S=−1
AfricanEuropean
0.0 0.1 0.2 0.3 0.4
0.0
0.5
1.0
1.5
Time (2N generations)
E[ #
pol
ymor
phic
in (.
2,.5
) ]
0.0 0.1 0.2 0.3 0.4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Time (2N generations)
E[ #
pol
ymor
phic
in (.
05,.2
) ]
Slices of the spectra through timeVery common variants(20-50% MAF)
Less common variants(5-20% MAF)
Rare variants(0-5% MAF)
• For “very common” variants: little effect of demography
• For “less common” variants: older events show effect
• For “rare” variants: tight coupling with demography - large impact of recent growth
Wednesday, October 30, 2013
0.0000 0.0010 0.0020 0.0030
0.80
0.85
0.90
0.95
1.00
−s
Prop
of N
ovel
Var
iant
s
ConstantAfricaEuropeEurope'
Proportion of variants that have arisen since onset of growth
• First solution: Forward to present with no mutations added after onset of growth
• Second solution: From onset of growth to present, with starting conditions as no variants
• By summing two solutions we can obtain total variants and proportion since onset of growth
Wednesday, October 30, 2013
Conclusions
• Haplotype characteristics of negatively selected rare variants
• Age and haplotype diversity reflect 4Ns of the variant
• However, only weakly, especially for haplotype diversity.
• Any inference will need to bin variants
• Demographies: Europeans bottleneck has strong effect on ages, and subsequent growth lowers haplotype homozygosity relative to no growth
• Frequency spectra characteristics:
• Sensitivity to demography is highly dependent on frequency range
• Most variants in growing populations are recent and rare, but especially negatively selected ones
Wednesday, October 30, 2013
Conclusions
• Have applied two computational techniques for generating expectations in situations with varying population sizes and negative selection
• Techniques are promising but still have challenges:
• Very strong selection and growth is problematic for both
• Sampling issues (particularly as n approaches 2N)
• Still have limitations of numeric approaches - how outcomes depend on parameters is never generalized
• Handling these challenges is an exciting and important area for contemporary theory and methods development
Wednesday, October 30, 2013
Acknowledgements
• Importance sampling to study rare variant haplotypes
• Diego Ortega del Vecchyo (UCLA)
• Numerical solutions to diffusion
• Evan Koch (University of Chicago)
• Funding: NIH, Sloan Research Fellowship
Wednesday, October 30, 2013
Thanks
Wednesday, October 30, 2013