theory of ibd sharing in the wright-fisher model

1
Theory of IBD sharing in the Wright-Fisher model Shai Carmi, Pier Francesco Palamara, Vladimir Vacic, and Itsik Pe’er Department of Computer Science, Columbia University, New York, NY 1. A. Gusev et al., Whole population, genome-wide mapping of hidden relatedness, Genome Res. 19, 318 (2009). 2. B. L. Browning and S. R. Browning, A fast, powerful method for detecting identity-by-descent, AJHG 88, 173 (2011). 3. S. R. Browning and B. L. Browning, Identity by Descent Between Distant Relatives: Detection and Applications, Annu. Rev. Genet. 46, 615 (2012). 4. P. F. Palamara et al. Length Distributions of Identity by Descent Reveal Fine-Scale Demographic History, AJHG (2012). 5. H. Li and R. Durbin, Inference of human population history from individual whole-genome sequences, Nature 449, 851 (2011). References bution of the total sharing: renewal theory ackground: Identity-by-descent ect calculation of the variance cohort-averaged sharing and imputation by IBD ore applications and conclusions A B A B A shared segment In populations that have recently underwent strong genetic drift, most individuals share a very recent common ancestor. Long haplotypes are frequently shared identical- by-descent (IBD). Algorithms can detect IBD shared segments between all pairs in large cohorts based on either segment length or frequency [1,2]. Applications: o Demographic inference o Imputation o Phasing o Association of rare variants/haplotypes o Pedigree reconstruction o Detection of positive selection. o See review [3] How does the amount of sharing depend on the demographic history of the population? The Wright-Fisher model: Non-overlapping, discrete generations. Constant size of N haploid individuals, or, changing size Ignore recent mutations. Recombination is a Poisson process. Each pair of individuals (linages) has probability 1/N to coalesce in the previous generation. For continuous-time and large population size, approximated by the coalescent. (Scaled) Time to most recent common ancestor: ( for constant size). Assume a segment can be detected only if it is longer than m (Morgans). Denote the fraction of the chromosome shared between two random individuals as the total sharing f T . Palamara et al. [4]: . For constant size, Used to infer population histories. Questions: Distribution? Higher moments? Differences between individuals? Applications [imputation by IBD, sharing between siblings] 1 0 L coordinate 2 3 4 5 6 7 8 9 10 11 m T =ℓ 1 +ℓ 5 +ℓ 9 A B In each block, the two chromosomes maintain the same ancestor. Blocks (segments) end at recombination events. Define T as the total length of segments having length ≥m. In the Sequentially Markov Coalescent, f T =ℓ T /L. Li and Durbin [5] showed that at segment ends: For a given t, the probability of no recombination at distance is . Therefore (see also [4]), For constant size, , . Find the distribution of T using renewal theory. Map: o Coordinate on chromosome → time (t) o Shared segments → waiting times between events o L T, T t S o Segment length PDF P(ℓ) → waiting time PDF ψ(τ). Laplace transform the PDF P T (t S ) P s (u). The PDF of the number of shared segments (Laplace transformed T → s) A general equation for the variance of the total sharing f T : M: number of markers; sum is over all markers I(s): indicator of a site to lie on a shared segment; with probability π. π 2 (s 1 ,s 2 ): probability of both sites s 1 and s 2 to lie on shared segments. A simple approximation: For a constant size population: A full solution of the variance: (only key equations shown) . p nr : probability of no recombination between the two sites in the history of the two chromosomes. π nr : the probability of the two sites to lie on shared segments, given that there was no recombination (similarly for π r ). For a discrete ancestral process and distance d between the sites: When there was recombination, calculation of π r is complicated by the fact that the segments are bounded on one end. Solve by explicit calculations on the coalescent with recombination. Define the cohort-averaged sharing: For each individual: the average sharing to the rest of the cohort. Approximate the variance: . For small n, For large n, , independent of n. Distribution is approximately normal. Imputation by IBD: Assume a cohort of size n is genotyped and IBD sharing is detected between all pairs. A fraction n s /n of the individuals is selected for sequencing. Non-sequenced individuals are imputed using the sequenced individuals along segments of IBD sharing. What is the expected imputation success rate when individuals are randomly selected? What is the success rate when individuals are selected according to their cohort-averaged sharing [6]? Define p c as the fraction of the genome covered by IBD segments shared with the sequenced individuals. Downstream effect on power of association: The effective number of sequenced individuals increases with imputation success rate. Power to detect variant of frequency β appearing in cases only [7]: See our paper: S. Carmi, P. F. Palamara, V. Vacic, T. Lencz, A. Darvasi, and I. Pe’er, The variance of identity-by-descent sharing in the Wright-Fisher model, Submitted (2012). arXiv:1206.4745. Sharing between siblings: The variance in sharing between (same parent) chromosomes of siblings is known. What happens when siblings come from an inbred population and thus share also due to remote ancestry? The mean sharing is When calculating variance, decompose sharing into either same-grandparent or remote. A simple estimator of the population size: Use , isolate N and simplify: , where is the total sharing averaged over all pairs. Can be seen that The variance of the estimator: . Summary and discussion: We obtained analytical results for properties of IBD sharing in the Wright-Fisher model. Calculated the distribution using renewal theory and the variance using two methods. Treat genotyping/phasing errors by increasing the length cutoff m. If segments are missed with probability ε, can show that both mean and variance are scaled by (1- ε). Other analytical approaches and applications to demographic inferences in [4] and talk here. The sharing per individual (averaged over cohort) exhibits a surprisingly wide distribution even for large cohorts. Can be taken advantage of in imputation by IBD. ,

Upload: mairi

Post on 22-Feb-2016

30 views

Category:

Documents


2 download

DESCRIPTION

Theory of IBD sharing in the Wright-Fisher model. m. Shai Carmi , Pier Francesco Palamara , Vladimir Vacic , and Itsik Pe’er. Department of Computer Science, Columbia University, New York, NY. Background: Identity-by-descent. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Theory of IBD sharing in the Wright-Fisher model

Theory of IBD sharing in the Wright-Fisher modelShai Carmi, Pier Francesco Palamara, Vladimir Vacic, and Itsik Pe’er

Department of Computer Science, Columbia University, New York, NY

1. A. Gusev et al., Whole population, genome-wide mapping of hidden relatedness, Genome Res. 19, 318 (2009).2. B. L. Browning and S. R. Browning, A fast, powerful method for detecting identity-by-descent, AJHG 88, 173 (2011).3. S. R. Browning and B. L. Browning, Identity by Descent Between Distant Relatives: Detection and Applications, Annu. Rev. Genet. 46, 615 (2012).4. P. F. Palamara et al. Length Distributions of Identity by Descent Reveal Fine-Scale Demographic History, AJHG (2012).5. H. Li and R. Durbin, Inference of human population history from individual whole-genome sequences, Nature 449, 851 (2011).6. A. Gusev et al. Low-Pass Genome-Wide Sequencing and Variant Inference Using Identity-by-Descent in an Isolated Human Population, Genetics

190, 679 (2012).7. Y. Shen et al., Coverage tradeoffs and power estimation in the design of whole-genome sequencing experiments for detecting association,

Bioinformatics 27, 1995 (2011).

References

Distribution of the total sharing: renewal theory

Background: Identity-by-descent

Direct calculation of the variance

The cohort-averaged sharing and imputation by IBD

More applications and conclusions

A B

AB

A shared segment

• In populations that have recently underwent strong genetic drift, most individuals share a very recent common ancestor.

• Long haplotypes are frequently shared identical-by-descent (IBD).• Algorithms can detect IBD shared segments between all pairs in

large cohorts based on either segment length or frequency [1,2].• Applications:

o Demographic inferenceo Imputationo Phasingo Association of rare variants/haplotypeso Pedigree reconstructiono Detection of positive selection.o See review [3]

How does the amount of sharing depend on the demographic history of the population?

The Wright-Fisher model:• Non-overlapping, discrete generations.• Constant size of N haploid individuals,

or,changing size

• Ignore recent mutations.• Recombination is a Poisson process.• Each pair of individuals (linages) has

probability 1/N to coalesce in the previous generation.

• For continuous-time and large population size, approximated by the coalescent.

• (Scaled) Time to most recent common ancestor: ( for constant size).

• Assume a segment can be detected only if it is longer than m (Morgans).

• Denote the fraction of the chromosome shared between two random individuals as the total sharing fT.

• Palamara et al. [4]:.

• For constant size,

• Used to infer population histories.

Questions:• Distribution? Higher moments?• Differences between individuals?• Applications [imputation by IBD, sharing between siblings]

ℓ1

0 Lcoordinate

ℓ2ℓ3 ℓ4

ℓ5 ℓ6ℓ7 ℓ8

ℓ9 ℓ10ℓ11

m ℓT=ℓ1+ℓ5+ℓ9

A

B

• In each block, the two chromosomes maintain the same ancestor. • Blocks (segments) end at recombination events.• Define ℓT as the total length of segments having length ≥m.• In the Sequentially Markov Coalescent, fT=ℓT/L.

• Li and Durbin [5] showed that at segment ends:

• For a given t, the probability of no recombination at distance ℓ is . Therefore (see also [4]),

• For constant size, , .• Find the distribution of ℓT using renewal theory. Map:

o Coordinate on chromosome → time (t) o Shared segments → waiting times between eventso L → T, ℓT → tS

o Segment length PDF P(ℓ) → waiting time PDF ψ(τ).• Laplace transform the PDF PT(tS) → Ps(u).

The PDF of the number of shared segments (Laplace transformed T → s)

• A general equation for the variance of the total sharing fT:

• M: number of markers; sum is over all markers• I(s): indicator of a site to lie on a shared segment; with probability π.• π2(s1,s2): probability of both sites s1 and s2 to lie on shared segments.• A simple approximation:

• For a constant size population:

• A full solution of the variance: (only key equations shown)• .• pnr: probability of no recombination between the two sites in the

history of the two chromosomes.• πnr: the probability of the two sites to lie on shared segments, given

that there was no recombination (similarly for πr).• For a discrete ancestral process and distance d between the sites:

• When there was recombination, calculation of πr is complicated by the fact that the segments are bounded on one end.

• Solve by explicit calculations on the coalescent with recombination.

• Define the cohort-averaged sharing:

• For each individual: the average sharing to the rest of the cohort.

• Approximate the variance:• .• For small n, • For large n, , independent of n.• Distribution is approximately normal.

Imputation by IBD:

• Assume a cohort of size n is genotyped and IBD sharing is detected between all pairs.

• A fraction ns/n of the individuals is selected for sequencing.

• Non-sequenced individuals are imputed using the sequenced individuals along segments of IBD sharing.

• What is the expected imputation success rate when individuals are randomly selected?

• What is the success rate when individuals are selected according to their cohort-averaged sharing [6]?

• Define pc as the fraction of the genome covered by IBD segments shared with the sequenced individuals.

Downstream effect on power of association:

• The effective number of sequenced individuals increases with imputation success rate.

• Power to detect variant of frequency β appearing in cases only [7]:

See our paper:

S. Carmi, P. F. Palamara, V. Vacic, T. Lencz, A. Darvasi, and I. Pe’er, The variance of identity-by-descent sharing in the Wright-Fisher model, Submitted (2012). arXiv:1206.4745.

Sharing between siblings:

• The variance in sharing between (same parent) chromosomes of siblings is known.

• What happens when siblings come from an inbred population and thus share also due to remote ancestry?

• The mean sharing is • When calculating variance, decompose sharing into

either same-grandparent or remote.

A simple estimator of the population size:

• Use , isolate N and simplify: , where is the total sharing averaged over all pairs.

• Can be seen that • The variance of the estimator:

.

Summary and discussion:

• We obtained analytical results for properties of IBD sharing in the Wright-Fisher model.

• Calculated the distribution using renewal theory and the variance using two methods.

• Treat genotyping/phasing errors by increasing the length cutoff m. If segments are missed with probability ε, can show that both mean and variance are scaled by (1- ε).

• Other analytical approaches and applications to demographic inferences in [4] and talk here.

• The sharing per individual (averaged over cohort) exhibits a surprisingly wide distribution even for large cohorts.

• Can be taken advantage of in imputation by IBD.

,