modeling linkage disequilibrium and identifying...

22
Copyright 2003 by the Genetics Society of America Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data Na Li* and Matthew Stephens †,1 *Department of Biostatistics and Department of Statistics, University of Washington, Seattle, Washington 98195 Manuscript received January 30, 2003 Accepted for publication August 11, 2003 ABSTRACT We introduce a new statistical model for patterns of linkage disequilibrium (LD) among multiple SNPs in a population sample. The model overcomes limitations of existing approaches to understanding, summarizing, and interpreting LD by (i) relating patterns of LD directly to the underlying recombination process; (ii) considering all loci simultaneously, rather than pairwise; (iii) avoiding the assumption that LD necessarily has a “block-like” structure; and (iv) being computationally tractable for huge genomic regions (up to complete chromosomes). We examine in detail one natural application of the model: estimation of underlying recombination rates from population data. Using simulation, we show that in the case where recombination is assumed constant across the region of interest, recombination rate estimates based on our model are competitive with the very best of current available methods. More importantly, we demonstrate, on real and simulated data, the potential of the model to help identify and quantify fine-scale variation in recombination rate from population data. We also outline how the model could be useful in other contexts, such as in the development of more efficient haplotype-based methods for LD mapping. L INKAGE disequilibrium (LD) is the nonindepen- 2. They assume a “block-like” structure for patterns of LD, which may not be appropriate at all loci. dence, at a population level, of the alleles carried at different positions in the genome. The patterns of 3. They do not directly relate patterns of LD to biologi- cal mechanisms of interest, such as the underlying LD observed in natural populations are the result of recombination rate. a complex interplay between genetic factors and the population’s demographic history. In particular, recom- As an example of the limitations of current methods, bination plays a key role in shaping patterns of LD in a consider Figure 1, which graphically shows pairwise LD population. When a recombination occurs between two measures for six simulated data sets, simulated under loci, it tends to reduce the dependence between the alleles various models for heterogeneity in the underlying re- carried at those loci and thus reduce LD. Although combination rate. The reader is invited to speculate recombination events in a single meiosis are relatively on what the underlying models are in each case—the rare over small regions, the large total number of meioses answer appears in the Figure 8 legend. In each of the six that occurs each generation in a population has a sub- figures one can identify by eye, or by some quantitative stantial cumulative effect on patterns of LD, and so criteria (e.g., Daly et al. 2001; Olivier et al. 2001; Wang molecular data from population samples contain valu- et al. 2002), “blocks” of sites, such that LD tends to be able information on fine-scale variations in recombina- high among markers within a block. In some cases there tion rate. might also be little LD between markers in different Despite the undoubted importance of understanding blocks, which might be interpreted as evidence for varia- patterns of LD across the genome, most obviously be- tion in local recombination rates: low recombination cause of its potential impact on the design and analysis rates within the blocks and higher rates between the of studies to map disease genes in humans, most current blocks. Indeed, Jeffreys et al. (2001) have shown, using methods for interpreting and analyzing patterns of LD sperm typing, that in the class II region of the MHC, suffer from at least one of the following limitations: variations in local recombination rate are indeed re- sponsible for block-like patterns of LD. However, with- 1. They are based on computing some measure of LD out this type of experimental confirmation, which is defined only for pairs of sites, rather than considering currently technically challenging and time consuming, all sites simultaneously. it is difficult to distinguish between blocks that arise due to recombination rate heterogeneity and blocks that arise due to chance, perhaps through chance clus- 1 Corresponding author: Department of Statistics, University of Wash- tering of recombination events in the ancestry of the ington, Padelford B-313, Stevens Way, Seattle, WA 98195. E-mail: [email protected] particular sample being considered (Wang et al. 2002). Genetics 165: 2213–2233 (December 2003)

Upload: others

Post on 08-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

Copyright 2003 by the Genetics Society of America

Modeling Linkage Disequilibrium and Identifying Recombination HotspotsUsing Single-Nucleotide Polymorphism Data

Na Li* and Matthew Stephens†,1

*Department of Biostatistics and †Department of Statistics, University of Washington, Seattle, Washington 98195

Manuscript received January 30, 2003Accepted for publication August 11, 2003

ABSTRACTWe introduce a new statistical model for patterns of linkage disequilibrium (LD) among multiple SNPs

in a population sample. The model overcomes limitations of existing approaches to understanding,summarizing, and interpreting LD by (i) relating patterns of LD directly to the underlying recombinationprocess; (ii) considering all loci simultaneously, rather than pairwise; (iii) avoiding the assumption thatLD necessarily has a “block-like” structure; and (iv) being computationally tractable for huge genomicregions (up to complete chromosomes). We examine in detail one natural application of the model:estimation of underlying recombination rates from population data. Using simulation, we show that inthe case where recombination is assumed constant across the region of interest, recombination rateestimates based on our model are competitive with the very best of current available methods. Moreimportantly, we demonstrate, on real and simulated data, the potential of the model to help identify andquantify fine-scale variation in recombination rate from population data. We also outline how the modelcould be useful in other contexts, such as in the development of more efficient haplotype-based methodsfor LD mapping.

LINKAGE disequilibrium (LD) is the nonindepen- 2. They assume a “block-like” structure for patterns ofLD, which may not be appropriate at all loci.dence, at a population level, of the alleles carried

at different positions in the genome. The patterns of 3. They do not directly relate patterns of LD to biologi-cal mechanisms of interest, such as the underlyingLD observed in natural populations are the result ofrecombination rate.a complex interplay between genetic factors and the

population’s demographic history. In particular, recom-As an example of the limitations of current methods,bination plays a key role in shaping patterns of LD in aconsider Figure 1, which graphically shows pairwise LDpopulation. When a recombination occurs between twomeasures for six simulated data sets, simulated underloci, it tends to reduce the dependence between the allelesvarious models for heterogeneity in the underlying re-carried at those loci and thus reduce LD. Althoughcombination rate. The reader is invited to speculaterecombination events in a single meiosis are relativelyon what the underlying models are in each case—therare over small regions, the large total number of meiosesanswer appears in the Figure 8 legend. In each of the sixthat occurs each generation in a population has a sub-figures one can identify by eye, or by some quantitativestantial cumulative effect on patterns of LD, and socriteria (e.g., Daly et al. 2001; Olivier et al. 2001; Wangmolecular data from population samples contain valu-et al. 2002), “blocks” of sites, such that LD tends to beable information on fine-scale variations in recombina-high among markers within a block. In some cases theretion rate.might also be little LD between markers in different

Despite the undoubted importance of understandingblocks, which might be interpreted as evidence for varia-

patterns of LD across the genome, most obviously be-tion in local recombination rates: low recombination

cause of its potential impact on the design and analysisrates within the blocks and higher rates between the

of studies to map disease genes in humans, most current blocks. Indeed, Jeffreys et al. (2001) have shown, usingmethods for interpreting and analyzing patterns of LD sperm typing, that in the class II region of the MHC,suffer from at least one of the following limitations: variations in local recombination rate are indeed re-

sponsible for block-like patterns of LD. However, with-1. They are based on computing some measure of LDout this type of experimental confirmation, which isdefined only for pairs of sites, rather than consideringcurrently technically challenging and time consuming,all sites simultaneously.it is difficult to distinguish between blocks that arisedue to recombination rate heterogeneity and blocksthat arise due to chance, perhaps through chance clus-

1Corresponding author: Department of Statistics, University of Wash-tering of recombination events in the ancestry of theington, Padelford B-313, Stevens Way, Seattle, WA 98195.

E-mail: [email protected] particular sample being considered (Wang et al. 2002).

Genetics 165: 2213–2233 (December 2003)

Page 2: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2214 N. Li and M. Stephens

Figure 1.—Plots of LD measure-ment, |D�|, (bottom right diagonal)and P value for Fisher’s exact test(top left diagonal) for every pairof sites with minor allele fre-quency �0.15 in data sets simu-lated under varying assumptionsabout variation in the local recom-bination rate. Details of the mod-els used to simulate each data setappear in the Figure 8 legend,which is based on the same six datasets.

The ability to distinguish between these cases would of the lines of, for example, McPeek and Strahs (1999),Morris et al. (2000), and Liu et al. (2001).course be interesting from a basic science standpoint—

for example, in helping to identify sequence characteris-tics associated with recombination hotspots. In addition,

MODELSit would have important implications for the design andanalysis of LD mapping studies. For example, it would Background: The most successful current approacheshelp in predicting patterns of variation at sites that have to constructing statistical models relating genetic varia-not been genotyped (perhaps sites influencing suscepti- tion to the underlying recombination rate (and to otherbility to a disease), and it would provide some indication genetic and demographic factors) are based on the coa-of whether block structures observed in one sample lescent (Kingman 1982) and its generalization to in-are likely to be replicated in other samples—a crucial clude recombination (Hudson 1983). Although theserequirement for being able to select representative “tag” approaches are based on rather simplistic assumptionssingle-nucleotide polymorphisms (SNPs; Johnson et al. about the demographic history of the population from2001) on the basis of LD patterns observed in some which individuals were sampled and about the evolu-reference sample. tionary processes acting on the genetic region being

In this article we introduce a statistical model for LD studied, they have nonetheless proven useful in a varietythat overcomes the limitations of existing approaches of applications. In particular, they provide a helpful simu-by relating genetic variation in a population sample to lation tool (e.g., software described in Hudson 2002),the underlying recombination rate. We examine in de- allowing more realistic data to be generated under vari-tail one natural application of the model: estimation of ous assumptions about underlying biology and demo-underlying recombination rates from population data. graphy, and hence aid exploration of what patterns ofUsing simulation, we show that in the case in which LD might be expected under different scenarios (Krug-recombination is assumed constant across the region lyak 1999; Pritchard and Przeworski 2001).of interest, recombination rate estimates based on our Despite the ease with which coalescent models can bemodel are competitive with the very best of current simulated from, using these models for inference remainsavailable methods. More importantly, we demonstrate, extremely challenging. For example, consider the prob-on real and simulated data, the potential of the model lem of estimating the underlying recombination rate into help identify and quantify fine-scale variation in re- a region, using data from a random population sample.combination rate (including “recombination hotspots”) It follows from coalescent theory that population sam-from population data. ples contain information on the value of the product

Although we focus here on estimating recombination of the recombination rate c and the effective (diploid)rates, we view the model as being useful more broadly population size N, but not on c and N separately. It hasin interpreting and analyzing patterns of LD across mul- therefore become standard to attempt to estimate thetiple loci. In particular, as we outline in our discussion, compound parameter � � 4Nc, and several methodsthe model could be helpful in the development of more have been proposed. Some (e.g., Griffiths and Marjo-

ram 1996; Kuhner et al. 2000; Nielsen 2000; Fearn-efficient haplotype-based methods for LD mapping, along

Page 3: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2215Linkage Disequilibrium and Recombination

head and Donnelly 2001) try to make use of all the Pr(h 1, . . . , h n|�) � �(h 1|�)�(h 2|h 1; �) . . . �(h n|h 1, . . . , h n�1; �).

(2)molecular data available. However, although such meth-ods have been applied successfully to small regions and

We refer to this as a “product of approximate condition-nonrecombining parts of the genome (Harding et al.

als” (PAC) model and to the corresponding likelihood1997; Hammer et al. 1998; Kuhner et al. 2000; Nielsen

as a PAC likelihood, which we denote LPAC. Explicitly,2000; Fearnhead and Donnelly 2001), for even moder-ate-sized autosomal regions (e.g., a few kilobases in hu- LPAC(�) � �(h1|�)�(h2|h1; �) . . . �(hn|h1, . . . , hn�1; �).

(3)mans) they become computationally impractical (Fearn-head and Donnelly 2001). Other methods, many of

Similarly, we refer to the value of � that maximizes LPACwhich are considered by Wall (2000), make use of onlyas a maximum PAC likelihood estimate for � and denote it

summaries of the data, substantially reducing computa-by �PAC.

tional requirements at the expense of some loss in effi-The utility of the model (3) will naturally depend on the

ciency.use of an appropriate approximation for the conditional

More recently, Hudson (2001) and Fearnhead anddistribution �. This approximation should be designed

Donnelly (2002) proposed “composite-likelihood” meth-to answer the following question: if, at a particular locus,

ods for estimating � over moderate to large genomicin a random sample of k chromosomes from a popula-

regions. Hudson’s method is based on multiplying to-tion, we observe genetic types h1, . . . , hk, what is the

gether likelihoods for every pair of sites genotyped, inconditional distribution of the type of the next sampled

which these pairwise likelihoods are computed via simu-chromosome, Pr(hk�1|h1, . . . , hk)? We are aware of three

lation, assuming an “infinite-sites” mutation model (i.e.,forms for � in the literature, each of which attempts to

no repeat mutation). This method has been modifiedanswer this question under different assumptions for

by McVean et al. (2002) to allow for repeat mutation.the genetic model underlying the loci being studied.

Fearnhead and Donnelly’s method is based on dividingThe first and best known comes from the Ewens sampling

data on a large region into smaller regions and multi-formula (Ewens 1972). This arises from considering a

plying likelihoods obtained for each smaller region.neutral locus in a randomly mating population, evolving

These methods, together with the best of the summary-with constant (diploid) size N and mutation rate � per

statistic-based methods of Wall (2000), appear to begeneration, and assuming an “infinite-alleles” mutation

the most accurate of existing methods for estimatingmodel, in which each mutation creates a novel (pre-

recombination rates from patterns of LD over moderateviously unseen) haplotype. Under these idealized condi-

to large genomic regions. None of these methods, astions, if we let � 4N�, then with probability k/(k �

currently implemented, allows explicitly for variation in) the k � 1st haplotype is an exact copy of one of the

recombination rate along the region under study.first k haplotypes chosen at random; otherwise it is a

A new model: Here we describe a new model for LD,novel haplotype. Although the assumptions underlying

which enjoys many of the advantages of coalescent-basedthis formula will never hold in practice, it does capture

methods (e.g., it directly relates LD patterns to the un-the following properties that we would expect to hold

derlying recombination rate) while remaining computa-more generally:

tionally tractable for huge genomic regions, up to entirechromosomes. Our model relates the distribution of i. The next haplotype is more likely to match a haplo-

type that has already been observed many timessampled haplotypes to the underlying recombinationrate, by exploiting the identity rather than one that has been observed less fre-

quently.Pr(h 1, . . . , h n |�) � Pr(h 1|�) Pr(h 2|h 1; �) . . . Pr(h n|h 1, . . . , h n�1; �),

ii. The probability of seeing a novel haplotype de-(1)

creases as k increases.iii. The probability of seeing a novel haplotype in-where h 1, . . . , hn denote the n sampled haplotypes, and

� denotes the recombination parameter (which may creases as increases.be a vector of parameters if the recombination rate is

However, for modern molecular data, and for se-allowed to vary along the region). This identity expresses

quence data and SNP data in particular, it fails to cap-the unknown probability distribution on the left as a prod-

ture the following two properties:uct of conditional distributions on the right. For simplicitywe often use the notation � to denote these conditional iv. If the next haplotype is not exactly the same as an

existing (i.e., previously seen) haplotype, it will tenddistributions. While the conditional distributions are notcomputationally tractable for models of interest, they to differ by a small number of mutations from an

existing haplotype, rather than to be completelyare amenable to approximation, as we describe below.Our strategy is to substitute an approximation for these different from all existing haplotypes.

v. Due to recombination, the next haplotype will tendconditional distributions (�, say) into the right-handside of (1), to obtain an approximation to the distribu- to look somewhat similar to existing haplotypes over

contiguous genomic regions, the average physicaltion of the haplotypes h given �:

Page 4: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2216 N. Li and M. Stephens

suggested a similar simplification.] The second, which wedescribe in detail in appendix b and denote �B, is a slightmodification of �A, developed using empirical results fromFigure 3 to produce a likelihood LPAC that gives moreaccurate estimates of �. Where necessary, we denote thePAC likelihoods and maximum PAC-likelihood estimatescorresponding to �A (respectively, �B) by LPAC-A and �PAC-A

(respectively, LPAC-B and �PAC-B).A key property of both �A and �B is that they are

easy and fast to compute. Unlike the Ewens samplingFigure 2.—Illustration of how �A(hk�1|h1, . . . , hk) builds formula, but like the approximations of Stephens and

hk�1 as an imperfect mosaic of h1, . . . , hk. This illustrates the Donnelly (2000) and FD, neither corresponds exactlycase k � 3 and shows two possible values (h4A and h4B) for h4, to the actual conditional distribution under explicit as-given h1,h2,h3. Each of the possible h4’s can be thought of as

sumptions about population demography and the evolu-having been created by “copying” (imperfectly) parts of h1,h2,tionary forces on the locus under consideration. Indeed,and h3. The shading in each case shows which haplotype was

copied at each position along the chromosome. Intuitively we no closed-form expressions for �, based on such explicitthink of h4 as having recent shared ancestry with the haplotype assumptions, and capturing iv or v, are known. However,that it copied in each segment. We assume that the copying the suggested forms for � were motivated by consideringprocess is Markov along the chromosome, with jumps (i.e.,

both the Ewens sampling formula and the underlyingchanges in the shading) occurring at rate �/k per physicalgenealogy (or, in the case with recombination, genealo-distance. Thus the more frequent jumps in h4B suggest a higher

value of � than do the less frequent jumps in h4A. Note that gies) relating a random sample of haplotypes from afor very large values of � the loci become independent, as neutrally evolving, constant-sized panmictic population.they should. Each column of circles represents a SNP locus, As such, it may be helpful to view them as approxima-with black and white representing the two alleles. The imper-

tions to the (unknown) true conditional distributionfect nature of the copying process is exemplified at the thirdunder these assumptions. In particular, there are certainlocus, where h4A and h4B have the black allele, although they

copied h2, which has the white allele. In practice, of course, aspects of many real populations (e.g., populationthe shading is not observed, and so to compute the probability expansion or population structure) and biological fac-of observing a particular h4 we must sum over all possible tors (e.g., gene conversion and selection) that theseshadings. The Markov assumption allows us to do this effi-

forms for � do not attempt to capture. For some applica-ciently, using standard methods for hidden Markov models,tions this may not matter very much. For others it mayas described in appendix a.be necessary to develop forms for � that do capturethese aspects—a point we return to in the discussion.

An unwelcome feature of the PAC likelihoods corre-length of these regions being larger in areas of thesponding to our choices of �—and indeed the formsgenome where the local rate of recombination isfor � from Stephens and Donnelly (2000) and FD—islow.that they depend on the order in which the haplotypes

Stephens and Donnelly (2000) suggested a form for are considered. In other words, although these likeli-� that captures properties i–iv above. In their suggested hoods each correspond to a valid probability distribu-form for �, the next haplotype differs by M mutations tion on the haplotypes, these probability distributionsfrom a randomly chosen existing haplotype, where M has do not enjoy the property of exchangeability that wea geometric distribution with Pr(M � 0) � k/(k � ) (so would expect to be satisfied by the true (unknown)that it reproduces the Ewens sampling formula in the distribution. Practical experience, and theory in Ste-special case of the infinite-alleles mutation model). phens and Donnelly (2000; their Proposition 1, partThus the next haplotype is a (possibly imperfect) “copy” d), suggests that this problem cannot be rectified byof a randomly chosen existing haplotype. making a simple modification to �. Although in princi-

Fearnhead and Donnelly (2001; henceforth FD) ple the dependence on ordering could be removed byextended this form for � to also capture property v averaging the PAC likelihood over all possible orderingsabove. In FD’s approximation, the k � 1st haplotype is of the haplotypes, in practice this would require a summade up of an imperfect mosaic of the first k haplotypes, over n! terms, which is infeasible even for rather smallwith the size of the mosaic fragments being smaller for values of n. Instead, as a pragmatic alternative solution,higher values of the recombination rate. we propose to average LPAC over several random orders

Here we use two new forms for � that also capture of the haplotypes. Unless otherwise stated, all resultsproperties i–v above. The first, described in detail in reported here were obtained by averaging over 20 ran-appendix a and illustrated in Figure 2, which we denote dom orders. In our experience, the performance of�A, is a simplification of FD’s approximation that is easier the method is not especially sensitive to the numberto understand and slightly quicker to compute. [N. Pat- of random orders used—results based on 100 random

orders gave qualitatively similar results, and resultsterson (personal communication) has independently

Page 5: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2217Linkage Disequilibrium and Recombination

Figure 3.—Histograms of the error Err(�, �PAC-A) � log10(�PAC-A/�), based on 100 data sets simulated from the standard coalescentmodel with n � 50 haplotypes and S � 50 segregating sites. The values of � are (a) � � 5, (b) � � 25, and (c) � � 500.Superimposed curves are normal densities with the same mean and standard deviation as the 100 values making up the histogram.These results, as well as those in Figure 4 and Table 1, are based on averaging the likelihoods over 10 random orders of thehaplotypes.

based on a single random order were often not much equals 1.0. Thus the value of � is also the total valueof � across the region).worse (data not shown). It is, however, important that

when comparing likelihoods for different values of �,For each data set we found �PAC-A by numerically max-the same set of random orders should be used for each

imizing the PAC likelihood (using a golden bisectionvalue of �.search method; Press et al. 1992) and compared it withthe true value of � used to generate the data.

It seems natural to measure the error in estimatesESTIMATING CONSTANT RECOMBINATION RATE for � on a relative, rather than an absolute, scale. For

example, Wall (2000) reported the frequency withIn this section we consider estimating the recombina-which different methods for estimating � gave estimatestion rate when it is assumed to be constant across thewithin a factor of 2 of the true value, and both FD andregion of interest. More precisely, we assume that cross-Hudson (2001) examine the distribution of the ratioovers in a single meiosis occur as a Poisson process of�/� for their estimates of � and the deviation of thisconstant rate c per unit (physical) distance and con-ratio from the “optimal” value of 1. A problem withsider estimating the scalar parameter � � 4Nc. We firstworking with this ratio directly is that it tends to penalizeuse simulated data to examine the properties of theoverestimation more heavily than underestimation. Forestimator �PAC-A, corresponding to the conditional distri-example, overestimating � by a factor of 10 gives a largerbution �A described in appendix a, under what we calldeviation from 1 than underestimating � by a factor ofthe “standard coalescent model”: a constant-sized, pan-10. To avoid this problem, we quantify the relative errormictic population with an infinite-sites mutation model.of an estimate � for � by Err(�, �) � log10(�/�). ThisWe show that, although quite accurate, �PAC-A exhibits agives, for example, an error of 0 if � � �, an error of 1systematic bias. We use the empirical results to develop aif � overestimates � by a factor of 10, and an error ofmodified conditional distribution, �B (described in detail�1 if � underestimates � by a factor of 10.in appendix b), whose corresponding estimator, �PAC-B,

We note that Err(�, �) can also be viewed as the errorexhibits considerably less bias and is more accurate. We(on an absolute scale) in estimating log10(�) by log 10(�).compare the performance of models based on both �AThus, if the usual asymptotic theory for maximum-likeli-and �B with results from other methods.hood estimation applies for estimation of log10(�) inProperties of the point estimate �PAC: We used thethis setting (which, as discussed in FD, it may not), thenprogram mksample (Hudson 2002) to simulate datafor the actual maximum-likelihood estimate (MLE) �MLEsets consisting of samples of SNP haplotypes from theof �, Err(�, �MLE) would be normally distributed asymp-standard coalescent model for various values of:totically, centered on 0. Optimistically, we might there-

1. The number n of haplotypes in the sample fore hope that for sufficiently large data sets (large in2. The number S of markers typed terms of the number of haplotypes, the number of mark-3. The value of � (we measure physical distance so that ers, or both) Err(�, �PAC-A) might be approximately nor-

mally distributed, centered on 0. In our simulations, wethe total physical length of each simulated haplotype

Page 6: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2218 N. Li and M. Stephens

found that for some combinations of n, S, and � this This may be due to the fact that for larger spacingsmore recombination events occur, increasing the rela-did indeed appear to be the case (e.g., Figure 3b), but

that for other combinations, although the distribution tive accuracy with which � can be estimated, althoughwe would not expect this pattern to continue indefinitelyoften appeared close to normal, it was centered around

some nonzero value (e.g., Figure 3, a and c), indicating as the marker spacing is increased beyond the rangeconsidered here.a systematic tendency for �PAC-A to over- or underestimate

�. We refer to the median of Err as the “bias” [of log Comparison of point estimates with other methods:Hudson (2001) introduced a composite-likelihood method(�PAC-A) in estimating log(�)]. Although bias is usually

defined as a mean error, this is not particularly helpful for estimating �, based on multiplying together the like-lihood computed for every pair of SNPs. He comparedhere since the mean is often heavily influenced by a

small number of very large values and may even be the performance of this method with others in the litera-ture (Hudson and Kaplan 1985; Hudson 1987; Heyinfinite in some cases (see also FD). We therefore follow

previous authors, including Hudson (2001) and FD, in and Wakeley 1997; Wall 2000), under the standardcoalescent model, and found it to be as good as, orconcentrating on the behavior of the median, rather

than the mean, of the error. better than, the best of these. We compared the resultsreported by Hudson (2001) for his maximum compos-Despite the biases evident in Figure 3, a and c, �PAC-A

gives reasonably accurate estimates of �. For example, ite-likelihood estimate, �CL, with the results for �PAC-A and�PAC-B on data sets simulated under the same conditionseven in Figure 3c, which shows one of the most extreme

biases that we observed in our simulations, the bias (Figure 5). For data sets with small numbers of SNPs( �12) �CL provides the most accurate estimates of �,corresponds to underestimating � by approximately a

factor of 2, and �PAC-A is within a factor of 2 of the true although all three methods struggle to produce reliableestimates. For larger numbers of SNPs both �PAC-A andvalue of � in 68% of cases. Although in many statistical

applications estimates within a factor of 2 of the truth �PAC-B tend, desirably, to exhibit less variability than �CL.Further, �PAC-B exhibits little or none of the bias presentwould not be considered particularly helpful or impres-

sive, in this setting this kind of accuracy is often not in �PAC-A and provides the most accurate estimates of �.The superior performance of the pairwise composite-easy to achieve (see, for example, Wall 2000).

We performed extensive simulations to better charac- likelihood method for data sets with small numbers ofSNPs is perhaps not surprising—indeed, for data setsterize the bias noted above and found that, although

the bias depends on all three variables (n, S, and �), it with only two SNPs �CL is precisely the maximum-likeli-hood estimate for �. However, we note that almost allis especially dependent on the average spacing between

sites. More specifically, for fixed n and S we observed a of the improvement in accuracy comes from the in-crease in the 10th percentile of the estimator towardstriking linear relationship between the bias and the log

of the average marker spacing (Figure 4). This linear the true value, rather than from a decrease in the 90thpercentile. One possible explanation for this is that �CLrelationship was also apparent for data simulated under

an assumption of population expansion (data not uses a likelihood based on an infinite-sites mutationmodel (i.e., assumes no repeat mutation) and so is ableshown). The slope of the linear relationship is negative

in each case, indicating a tendency for �PAC-A to overesti- essentially to rule out very small values for � if there iseven one pair of sites at which all four gametes aremate � when the markers are very closely spaced and

underestimate � when the markers are far apart. As the present. (The effect of this may be compounded by thefact that �CL was found by maximizing over a grid ofnumber of sampled haplotypes increases, both the slope

and intercept of the line appear to get closer to 0 (Table possible values, which forces all nonzero estimates of �to be above some threshold.) Our estimator does not1). On the basis of these empirical results we can modify

�A to reduce the bias of the point estimates (see appen- make the infinite-sites assumption and so will be moreinclined to estimate very small values of �, possibly lead-dix b for details). The improved performance of this

modified conditional distribution, which we denote �B, ing to occasional substantial underestimates. Since inreal data it will typically be unclear whether or not theis illustrated in the next section.

Figure 4 also illustrates the effect of varying parameter infinite-sites assumption holds, the advantage of �CL foreven small numbers of sites is perhaps less clear-cut thanvalues on the variability of point estimates. As might

be expected, the variance of the error reduces with it appears in Figure 5.We used the same simulated data to examine theincreased sample size and increased number of sites,

with the latter providing the more substantial decrease. accuracy of estimates of � obtained by the methodsdescribed by Kuhner et al. (2000) and FD, both of whichFor example, doubling the number of sites from 50 to

100 roughly halved the variance of the error in most use computationally intensive Monte Carlo proceduresto attempt to approximate the full coalescent likelihood.cases, while doubling the number of individuals from

50 to 100 resulted in much smaller decreases. For a The computational complexity of these approaches in-creases with what might be called “the total value of �fixed sample size and number of sites, the variance of

the error decreases as the spacing between sites grows. across the region,” or “per-locus �,” which we denote �

Page 7: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2219Linkage Disequilibrium and Recombination

Figure 4.—Box plots showing the relationship of the bias to the average marker spacing. For each combination of parameters,100 data sets each were simulated under the standard coalescent model. The parameters involved are: the number of haplotypesin each sample, n � 20,50,100,200; the number of segregating sites, S � 20,50,100; and the average marker spacing, �/S �0.1,0.5,1.0,5.0, and 10.0. In humans a marker spacing of �/S � 0.5 corresponds to �1 kb between markers. The unlabeled tickmarks on the y-axis correspond to �PAC-A � �2�.

for larger �. However, point estimates based on these(more precisely, in our notation � � �L, where L is themethods could still be accurate, if the maximum of thephysical length of the region). Results from FD suggestapproximate-likelihood curve occurs in about the rightthat even for small values of � (�3, say), the approxi-place. To investigate this possibility, we applied bothmate-likelihood curves obtained by these methods maymethods, using �1 day of CPU time per method perbe poor approximations to the actual likelihood curve,data set (compared with �30 sec per data set for �PAC-B), toand so it seems unlikely that the curves will be accurate10 of the data sets simulated with � � 40. Computationalconsiderations make a more comprehensive simulationstudy inconvenient. Each of the methods was run with

TABLE 1 fixed at the value used to simulate the data, givingthem some advantage over how they could be used inThe intercepts and slopes of the linear relationship

between log10(�PAC/�) and log10(spacing) practice. Nevertheless, neither method produced pointestimates of � as accurate as those from �PAC-B (Table 2).

Intercept Slope Of the two full-likelihood schemes, the maximum ofthe likelihood curve obtained by infs was consistentlyn/S 20 50 100 20 50 100closer to the true value of � than was the maximum of

20 �0.16 �0.12 �0.09 �0.18 �0.21 �0.26 the likelihood curve obtained by Recombine. Indeed,50 �0.12 �0.07 �0.06 �0.16 �0.21 �0.24 the estimates obtained from Recombine were often100 �0.12 �0.06 �0.04 �0.09 �0.17 �0.21 close to an order of magnitude smaller than the true200 �0.10 �0.05 �0.02 �0.06 �0.14 �0.17 value of �, which raises a danger that when the method

See also Figure 4. is applied to real data (for which the value of � is of

Page 8: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2220 N. Li and M. Stephens

Figure 5.—Comparisonof �PAC-A and �PAC-B with Hud-son’s pairwise composite-likelihood estimator �CL

(Hudson 2001) on data setsof n � 50 haplotypes simu-lated from the neutral infi-nite-sites model. The data setswere simulated with haplo-types of physical length L �4, 8, 12, 16, 20, 24, 28, 32,36, 40, 44, and 48 units, with� � 1/unit physical length,and � 1⁄4/unit physicallength. (With these parame-ters the expected numberof SNPs in each data set isapproximately equal to the

physical length of the haplotypes.) The results for �CL come from Hudson (2001) and were kindly provided by R. Hudson. The resultsfor �PAC-A and �PAC-B are based on 1000 data sets that we simulated for each set of parameters using the program mksample (Hudson2002). (We discarded the few simulated data sets that had only one SNP.) The panels are (a) �CL, (b) �PAC-A, and (c) �PAC-B. In eachpart, the solid line is the median of Err(�, �) � log10(�/�) and the dashed lines are the 10 and 90% quantiles.

course not known) the user might be misled into think- I. Include all values of � for which loge(LPAC-B(�)) iswithin 2 of the maximum.ing that the value of � is small enough for the method

to produce reliable results. Results from longer runs of II. Include all values of loge(�) within �1.96 of �PAC-B,where is the square root of the inverse of minusinfs, taking �5 days of CPU time each, produced improved

results, competitive with �PAC-B (data not shown). the second derivative (found numerically) of thelog of the PAC-B likelihood curve [as a function ofProperties of PAC likelihood curves: Construction of

confidence intervals: We examined the coverage proper- loge(�)] evaluated at �PAC-A � �PAC-B.ties of confidence intervals (C.I.’s) constructed from

The rationale for looking at such C.I.’s is that, underthe PAC-likelihood curve in two ways:standard asymptotic theory for likelihood estimation,C.I.’s constructed in this way using the true likelihoodcurve would include the true value of � �95% of theTABLE 2time. (For I this follows from the asymptotic �2 distribu-

Comparison of �PAC-B with estimates of � from Recombine and tion of the log-likelihood-ratio statistic; for II it followsinfs, for 10 data sets simulated with � � 40, � � 10, n � 50

from asymptotic normal distribution of the MLE.)Figure 6 shows the coverage properties for C.I.’s pro-Data set Recombine infs �PAC-B

duced using the two methods (i.e., the proportion of1 13 26 57 times that C.I.’s formed using each method contained2 9 21 24 the true value of �) for the data sets used to obtain3 24 27 30 Figure 5c. For moderate sequence length both methods4 10 29 26

produce C.I.’s that are slightly anticonservative, with5 10 27 34coverage properties that approach �0.91, compared6 5 33 42with the expectation of �0.95 under asymptotic theory.7 4 25 45

8 8 23 38 On the basis of these results we speculate that the curva-9 7 53 50 ture of the PAC-B-likelihood curve does not deviate

10 12 27 29 grossly from that of the true-likelihood curve. We noteMedian 10 27 36 that the coverage properties are also closer to asymptoticMedian |Err(�, �)| 0.62 0.17 0.11

expectations than those reported by Fearnhead andBoth infs and Recombine were run with fixed at its true Donnelly (2002) for their composite likelihood using

value. infs was run for 20,000 iterations with five driving values the same methods of C.I. construction.for � (10, 30, 40, 50, and 60). The effective sample size (ESS) Comparison with other methods: We compared the LPAC-B-at the MLE is always �4, indicating that infs had very little

likelihood curves with likelihood curves obtained fromconfidence in its estimated likelihood curve (and the esti-three other methods: the full-data coalescent methodmated 95% C.I.’s failed to include the true � in all but one

case). Recombine was run with five short runs of 20,000 itera- of FD (implemented in the computer program infs) andtions and one long run of 1 million iterations, using three the pairwise composite-likelihood methods of Hudsonheating temperatures, initializing the runs at the true value [2001; implemented by one of us (N.L.), using tablesof �, with fixed as 10. The CPU times (on an 800-MHz Pentium

available from R. Hudson’s website] and McVean et al.III processor) for data set 8 were �30 hr for infs and Recombineand 30 sec for �PAC-B. (2002) (implemented in the computer program LDhat).

Page 9: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2221Linkage Disequilibrium and Recombination

shown from infs are profile likelihoods for � at the truevalue of ; Hudson’s method and our method avoidexplicitly estimating ; LDhat estimates using an ana-log of Watterson’s estimate, but allowing for multiplemutations.

Notwithstanding these issues, we attempt to drawsome general conclusions:

1. In general, the likelihood curves produced by thefour methods seem to agree rather more closely thanmight have been expected. (Compare, for example,the variability here with the variability observed fordifferent runs of a single method in FD.) However,the closeness of the agreement between the methodsdiffers appreciably across data sets. Data set 12 con-sists of only four sites, three of which are singletons,and so the differences in the curves for this dataFigure 6.—Empirical coverage properties of confidenceset seem not to be particularly interesting. We wereintervals produced using two different methods described in

the text. Each number is based on analysis of 1000 data sets unable to discern a systematic reason for the largerand shows the proportion of cases in which the C.I. contained differences among methods observed in some of thethe true value of � used to generate the data. The data sets

other data sets (e.g., 16).used are the same as those used to produce Figure 5c.2. The two pairwise composite-likelihood methods tend

to produce likelihood curves that are slightly moreFigure 7 shows likelihood curves obtained using each peaked than those of the other two methods. Thismethod for the 20 data sets considered by Wall (2000), might be expected since, as pointed out by McVeanwhich were simulated under the standard coalescent et al. (2002), pairwise composite-likelihood curvesmodel with � � � 3.0 across a region of physical are typically more peaked than the true-likelihoodlength 1 and were kindly supplied by J. D. Wall. (These curve because they treat each pair of sites as indepen-likelihood curves are plotted with � on the x-axis, rather dent, when in fact many pairs are highly dependent.than log(�), because infs and LDhat output likelihood 3. The method implemented in LDhat, which allowscurves for evenly spaced values of �.) for multiple mutations, tends to achieve its maximum

Interpreting the results of this comparison is slightly at larger values of � than does Hudson’s method,tricky. Unlike the other three methods we consider, the which does not allow for multiple mutations. This isfull-data coalescent method can, in principle, provide surprising; indeed, the opposite might have beena fully accurate representation of the true-likelihood expected, since multiple mutations could be usedcurve. As such it is tempting to treat this as a “gold in place of recombination events to explain certainstandard” against which to compare the other methods.

patterns of LD. One possible explanation is that theHowever, as mentioned previously, even for the rather

run lengths we used for computing the likelihoodsmall value of � � 3 used to generate these data, accuratein LDhat might be insufficient (we used the defaultapproximation of the true-likelihood curve may be com-values).putationally impractical. Indeed, the estimated effective

4. Different orderings of the haplotypes can give PAC-sample sizes (ESSs) obtained for these data sets, shownlikelihood curves that differ appreciably from oneat the top of each part of Figure 7, suggest that weanother. In addition, the maximum of the likelihoodshould not place much confidence in the accuracy ofcurve based on the average over several orderingsmany of the curves. Our attempts to obtain more accu-tends to be toward the left end of the distributionrate likelihood curves by performing longer runs forof maxima obtained from different orderings. Thissome of the data sets (numbers 15 and 16) actually pro-is because, although not shown in Figure 7, the curvesduced smaller estimated ESSs, suggesting that the effectivewith maxima at smaller values of � tend to be largersample sizes quoted for the other data sets are optimistic(in absolute value) than those with maxima at larger(see FD for further discussion of this problem). A fur-values of � (presumably because they correspondther complication in comparing the methods is thatto orderings of the haplotypes that, in some sense,both our method and that of McVean et al. (2002) allowrequire fewer recombination events to explain them)(implicitly and explicitly, respectively) for the possibilityand thus contribute more to the average. Althoughof multiple mutations, and thus the likelihoods fromthis dependence on ordering is bothersome, in simu-these methods are in some sense not directly compara-lation studies (results not shown) we have found thatble with those from the other two methods. Finally, wethe variability in the position of the maxima of thenote that the methods deal in different ways with the

unknown mutation parameter : the likelihood curves PAC likelihood over different orderings of the haplo-

Page 10: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2222 N. Li and M. Stephens

Figure 7.—Comparison of the relative PAC-likelihood curves, with coalescent-based and pairwise composite relative-likelihoodcurves, for the first 20 data sets in Wall (2000). In each case the relative likelihood is obtained by normalizing each likelihoodcurve to have a maximum of 1. The light gray lines show 20 PAC-likelihood curves, each from a different random order of thehaplotypes, and the solid black line is based on the PAC likelihood averaged over the 20 random orders. The other linescorrespond to likelihood curves computed using the methods of: FD, implemented in the computer program infs (red dashedline); McVean et al. (2002), implemented in LDhat (blue dotted line); and Hudson (2001), using the table generated by programeh written by Hudson (cyan dot-dashed line). The effective sample size (ESS) for infs at the MLE is given for each data set abovethe graph and is a measure of the confidence infs has in its estimated likelihood curve (the larger the better). Results for infsfor all data sets except 15 and 16 were kindly provided to us by P. Fearnhead and were obtained using between 50,000 and5,000,000 iterations. Results for data sets 15 and 16 were obtained by ourselves using 10,000,000 iterations.

types is typically small compared with the uncertainty Here c represents the background rate of crossover,a and b represent the left and right ends of thein estimation of �.hotspot region, and � (�1) quantifies the magnitudeof the recombination hotspot. The PAC likelihoodVARIABLE RECOMBINATION RATEfor this model is a function of four parameters: a, b,

Models for variation in recombination rate: One of �, and � � 4N c.our main motivations for developing this model is to

2. A more general model, where if x is a position be-explore fine-scale variation in recombination rates. Atween markers j and j � 1, thensimple (no interference) model for variation in recom-

bination rates is that crossovers in a single meiosis occuras an inhomogeneous Poisson process, of rate c(x) at c(x) � �j c . (5)position x. Here we consider two specific cases of this

Here c represents the background rate of crossover,general model:and �j is a multiplier controlling how the crossover

1. A simple single-hotspot model, where rate between markers j and j � 1 deviates from thebackground rate. The PAC likelihood for this modelis a function of the parameters �1, . . . �S�1 (where Sc(x) � ��c for a x b,

c otherwise.(4)

is the number of SNPs) and � � 4 Nc.

Page 11: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2223Linkage Disequilibrium and Recombination

TABLE 3

Performance of the simple single-hotspot model, for data sets simulated under various demographic scenarios,with a hotspot of magnitude � � 10 and with no hotspot (i.e., � � 1)

� � 10 � � 1

Power Med � Med |Err| Type I error Med � Med |Err|All sites 0.90 13.02 0.19 0.04 0.95 0.45Common sites (f � 0.1) 0.81 16.57 0.29 0.05 1.13 0.57Island (mixed) 0.92 11.35 0.18 0.07 0.90 0.33Island (single) 0.94 10.24 0.16 0.07 1.01 0.31Expansion (t � 500) 0.94 24.29 0.40 0.13 1.41 0.56Expansion (t � 5000) 0.53 11.66 0.36 0.07 1.23 0.70

In each case, the first column shows the proportion of “significant” LR tests (LLR � 1.92) for testing thenull hypothesis of no hotspot, the second column shows the median estimate of �, and the third column showsthe median of |Err(�, �)|. Each number is based on results for 200 simulated data sets.

For the simple single-hotspot model it is straightfor- likelihood for a particular �j is very flat, this indicatesthat there is little information about the recombinationward to obtain numerically the maximum PAC-likeli-

hood estimates for all four parameters simultaneously, rate in that marker interval, in which case it seems sensi-ble to estimate that the recombination rate is close toalthough in the examples that we consider we assume

that a and b are known and maximize the PAC likelihood the background rate (i.e., �j � 1), rather than (closeto) infinitely bigger or smaller!in terms of � and �. The evidence for the presence of

a hotspot can be summarized by the log-likelihood ratio To solve both these problems, we assume a “prior”distribution for the �j’s: specifically that the �j’s are inde-(LLR) for the null hypothesis of no hotspot, H0: � � 1

vs. the alternative H1: � � 1. If �0 denotes the value of pendent and identically distributed, with log10(�j) � N(0,0.52). This prior was chosen to allow occasional devia-� that maximizes LPAC-B under H0, and �1 and �1 de-

note the values of � and � that maximize LPAC-B under tions from the background rate of recombination by afactor of 10 or more (with probability �95%, �j liesH1, thenin the range 0.1–10). This choice of prior could be

LLR � logeLPAC-B(�1, �1)/LPAC-B(�0 , � � 1), (6)motivated from a Bayesian viewpoint as reflecting ourprior beliefs about the �j’s, but it also has the moreand large values of LLR represent evidence for the exis-pragmatic justification that identifying variations of thistence of a hotspot. Under standard asymptotic theory,kind of magnitude seems both interesting and, perhaps,two times LLR would have (asymptotically) a chi-squareattainable. We consider alternative prior specificationsdistribution on 1 d.f., and so rejecting H0 if LLR � 1.92in the discussion.would give a hypothesis test with a type I error rate of

In principle, given the prior distribution for the �j’s0.05. Although it seemed unlikely that standard asymp-described above, we could also place a prior distributiontotic theory would apply here, we found that for dataon � and obtain an approximation to the posterior distri-sets simulated under the null hypothesis, rejecting H0

bution of all parameters, using Markov chain Montefor LLR � 1.92 gave empirical type I error rates closeCarlo, for example. Although this would be our pre-to 0.05 (Table 3), which provides some guidance as toferred approach, for simplicity we avoid this here andwhat might be considered a “large” value of LLR.instead use the following ad hoc two-stage approach:For the second, more general model, obtaining maxi-first, obtain point estimates for � and � by jointly max-mum PAC-likelihood estimates for the parameters cre-imizing the product of LPAC(�, �) and the prior densityates problems. First, the maximum-likelihood estimatesof �; second, obtain a “posterior distribution” for eachare not unique (indeed there are infinitely many of�j by fixing all other parameters at their estimated val-them), because multiplying all the �j by any constant,ues, discretizing the prior on �j (truncated at �j � �3),and dividing � by the same constant, gives exactly theand computing the corresponding discretized posteriorsame likelihood. (In technical terms, the parametersdistribution as being proportional to the prior timesare said to be unidentifiable.) Second, even if the identi-the PAC likelihood. For data sets with a large numberfiability problem is solved (for example, by first ob-of sites the first stage (optimization over �, �) can betaining an estimate for � assuming that there is no hot-very time consuming, requiring large numbers of evalua-spot and then fixing this when estimating the othertions of the likelihood function. Further, it seems un-parameters) there is the practical problem that the like-likely that the simple optimization method we used willlihood curve for some �j will often be very flat, resultingreliably find the global maximum of the likelihood sur-in estimates for many �j being very close to (or equal

to) either 0 or infinity. This seems undesirable: if the face. Both these problems could be alleviated by ex-

Page 12: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2224 N. Li and M. Stephens

ploiting the fact that the derivatives of the PAC likelihood to test the null hypothesis H0: � � 1 against the alterna-tive H1: � � 1 (Table 3). For the scenarios not involvingcan be computed efficiently, but we do not pursue this

here. population expansion, the test gave type I error ratesof �0.05 when applied to data without a hotspot and aPower to detect recombination hotspots and robust-

ness: In this section we assume that there is a single power of �0.90 when applied to data simulated with ahotspot, although the test based on just the commonrecombination hotspot (Equation 4), whose putative

position is known, and examine the power of our model SNPs had a slightly reduced power. The two scenariosinvolving population expansion gave either a substantialto detect the hotspot under various assumptions about

the population demography and SNP marker ascertain- reduction in power or an inflated type I error rate(which is in some sense equivalent to a reduction inment. Although the assumptions made here (in particu-

lar, that there is a single hotspot with known putative power). This might be due to a reduction in the numberof “common” SNPs under these scenarios, as commonposition) are unrealistic, they provide a convenient

framework within which to examine quantitatively the SNPs tend to be most informative for estimating recom-bination rates.power of our approach and how it is affected by popula-

tion demographic history and marker ascertainment We also examined the robustness of estimates of �under the various scenarios (Figure 3). As noted by FD,schemes.

We consider the following scenarios: the recombination rate � � 4Nc depends on how theeffective population size N is defined. In contrast, the

1. Constant-size randomly mating population, alldefinition of the parameter � does not depend on how

markersthe effective population size is defined, and so we might

2. Constant-size randomly mating population, onlyhope that estimation of � will be robust to departures

markers at frequency �0.1from the assumption of a constant-sized panmictic pop-

3. Exponentially expanding population, with expan-ulation. For the levels of population structure we used

sion starting t � 500 generations agoin our simulations this does indeed appear to be the

4. Exponentially expanding population, with expan-case—in both cases estimates were more accurate than

sion starting t � 5000 generations agothose for the sample from a single random-mating popu-

5. Haplotypes sampled from a structured population,lation, perhaps because population structure makes re-

consisting of two islands exchanging migrants at acombinants easier to “spot.” As might be expected, esti-

rate of one per generation (scaled migration parame-mates based only on common SNPs were less accurate

ter 4Nm � 4)than those based on all SNPs. A drop in accuracy is also

6. Haplotypes sampled from only one of the islands inevident for the scenarios simulated under population

the structured population described above.expansion, probably again due to a reduction in thenumber of “common” SNPs under these scenarios.These last four models are the same as those consid-

ered in Pritchard and Przeworski (2001). In the two Some of the scenarios also resulted in an upward bias forestimates of �, notably one of the expansion scenarios inexpanding-population scenarios, the population was as-

sumed constant sized until t generations ago, when it which the median of the estimates was almost 2.5 timesthe true value.started to expand exponentially, continuing until the

present. The current population size N0 is set to be 105 Estimating recombination rates along a region: Simu-lated data: We fitted the more general varying recombi-and the population growth rate is chosen, as a function

of t, to match the expected diversity in a population of nation rate model to the simulated data used to producethe pairwise LD plots in Figure 1; the results are shownconstant size 104. (The necessary growth rates of � �

1960, 350 for t � 500,5000 were kindly provided by M. in Figure 8. From the latter figure we might conclude,correctly, that the data sets corresponding to the top leftPrzeworski.)

For each scenario we simulated data sets under the and bottom middle parts had recombination hotspotssomewhere in the region 0.4–0.6. We might also con-simple single-hotspot model described above, using

mksample and the postprocessing algorithm described clude that the other data sets had no hotspots, whichwould be correct except in the case of the bottom leftin Appendix C. For the first two scenarios each data

set was simulated to have �50 segregating sites and 60 figure, which was actually generated from data with ahotspot between 0.4 and 0.5. One possible reason thatchromosomes, with a � 0.4, b � 0.5, � � 20, and � �

10 [these values were chosen to approximately match the hotspot shows up less well in this case is that thereare fewer sites at high frequency (�0.15) in this datavalues for the TAP2 data from Jeffreys et al. (2000)

considered in Figure 9]. For the expanding population set. Despite the fact that we might have been misled inone case out of six, we view Figure 8 as considerablyscenarios we set � � 4N0c � 200, and for the structured

scenarios � � 20 within each population. For each sce- more informative than Figure 1, from which we find itdifficult to draw any conclusions.nario we also simulated data sets under the same condi-

tions, but with no hotspot (i.e., � � 1). TAP2 data: Jeffreys et al. (2000) used patterns ofLD (measured by haplotype diversity) in a populationWe applied the likelihood-ratio test to each data set

Page 13: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2225Linkage Disequilibrium and Recombination

Figure 8.—Estimates of variation from background recombination rate within each marker interval for the same simulateddata sets that were used to produce Figure 1. Top left, bottom left, and bottom middle correspond to data sets simulated witha single hotspot of magnitude � � 10 between positions 0.4 and 0.5. The other parts correspond to data simulated with constantrecombination rate across the region.

sample to refine the location of a putative recombina- they warn that this estimate of the background rateshould be “treated with caution”). Since our estimatetion hotspot in the human TAP2 gene and provided a

more detailed characterization of its properties through is based on a population sample, it is actually an estimateof the magnitude of the hotspot in the sex-averagedsperm typing. The population sample consists of 30

individuals from the United Kingdom typed at 47 poly- crossover rates. Jeffreys et al. (2000) point out that thecrossover rate in this region appears to be substantiallymorphisms (45 SNPs, two insertion/deletions) across 9.7

kb, with haplotypes determined by allele-specific PCR. higher in females than in males, and so the sex-averagedcrossover rates are likely to be dominated by the femaleThrough analysis of sperm crossover events Jeffreys

et al. (2000) identified a region of increased crossover crossover process. Our results therefore suggest thatthe crossover rate within the 1.2-kb hotspot in femaleintensity, located approximately in the interval from 4

to 5.2 kb. meioses is roughly an order of magnitude higher thanthe (female) background rate.We fitted the simple single-hotspot model to the hap-

lotype data (kindly provided in convenient electronic Jeffreys et al. (2000) found that if one assumes thatthe sex-averaged background recombination rate isformat by A. Jeffreys), assuming a hotspot between 4

and 5.2 kb, and obtained estimates of � � 1.3/kb and equal to the male background recombination rate of�0.4 cM/Mb (a number that we again emphasize they� � 12 (95% C.I. [6,21]), with a LLR of 12, indicating

strong evidence for the presence of the hotspot. Our suggest should be treated with caution), the observedpatterns of LD in the population sample appear consis-estimate for the average magnitude of the hotspot, � �

12 times the background rate, agrees well with the tent with an effective population size of N � 100,000,which contrasts with the more commonly quoted num-sperm-typing results from Jeffreys et al. (2000). In par-

ticular, Jeffreys et al. (2000; their Figure 4) observed ber for humans of N � 10,000. Our estimate of � �1.3/kb supports their analysis, as it corresponds to N �128 crossovers within the interval 4–5.2 kb in 2.4 million

progenitor molecules, giving an average rate of 4.4 cM/ 84,000 if the background sex-averaged recombinationrate is assumed to be 0.4 cM/Mb. As pointed out byMb, which is 11 times the approximate background rate

(for males) that of 0.4 cM/Mb they quote, (although Jeffreys et al. (2000), one possible explanation for this

Page 14: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2226 N. Li and M. Stephens

Figure 9.—TAP2 data: MLE with posteriorquantiles. Results of fitting the more generalmodel for varying recombination rate to the TAP2data from Jeffreys et al. (2000). The plot showsthe estimated value and 25, 10, and 5% posteriorquantiles for �j in each marker interval along thechromosome. The vertical lines indicate the ap-proximate boundaries of the hotspot identifiedby Jeffreys et al. (2000).

is differences between male and female recombination accurate, and in this analysis we assume them to berates—in particular, a sex-averaged background rate known without error. On the basis of patterns of LDacross the 9.7-kb region of 3.4 cM/Mb would give N � and on the results of phylogenetic-based methods that10,000. An alternative (or additional) explanation, sug- attempt to infer ancestral recombination events, Tem-gested to us by M. Przeworski (personal communica- pleton et al. (2000) suggested the existence of a putativetion), is that gene conversion events not detected by recombination hotspot between [2987,4872].the sperm-typing experiments could partially account Table 4 shows the results of fitting the simple single-for the unusually large estimated effective population hotspot model to the whole data set and to the datasize. from each subpopulation individually, assuming a hot-

Figure 9 shows the estimates of �j and posterior quan- spot from 3 to 5 kb. Figure 10 shows the results fortiles obtained by fitting the more general model for fitting the more general model for recombination raterecombination rate variation to the TAP2 haplotype variation. Overall, these results seem to support thedata. The hotspot in the region 4–5.2 kb is fairly clear, existence of the putative hotspot, although there is con-with some suggestion that it may extend slightly farther siderable variation in the strength of the evidence (asto the right than 5.2 kb. The peak of the recombination measured by the LLR), and of the estimated magnitudehotspot is estimated as being �14 times the background of the hotspot, in different subpopulations. We note thatrate. In the interval corresponding to this peak the pos- the apparent magnitude of the hotspot in the Finnishterior probability that �j � 7 is 75% (compared with a population is smaller in Figure 10 than in Table 4, dueprior probability of �5%). However, the large number to the affect of the prior. There is also tentative evidence,of parameters estimated within this more general modelresults in generally poor precision for the estimate ofeach �j. In particular, confidence in the estimates is TABLE 4probably not sufficient to conclude that the three sub-

Results of fitting the simple single-hotspot model to thepeaks present in Figure 9, within the hotspot, corre-LPL data, to each subpopulation sample individually,

spond to actual variation in the hotspot intensity. and to the combined sampleLipoprotein lipase data: The lipoprotein lipase (LPL)

data (Clark et al. 1998; Nickerson et al. 1998) consist Population �/kb � C.I. LLRof 9.7 kb of genomic DNA sequence from the human

Jackson 7.5 2.7 [1.3, 4.8] 3lipoprotein lipase gene from 71 individuals from Jack-Rochester 0.74 12 [5.0, 25] 8son, Mississippi (n � 24), Rochester, Minnesota (n �Finland 0.15 104 [55, 183] 2223), and North Karelia, Finland (n � 24). In the pub- Combined 7.0 6.5 [4.1, 8.5] 23

lished data, the haplotypic phase for 69 sites was eitherThe first two columns show estimated values for � and �determined by experiment or estimated by Clark’s

assuming a hotspot between 3 and 5 kb along the sequence.(1990) algorithm. Although the use of a statisticalC.I. gives the range of values of � whose log-likelihood is withinmethod to infer some of the phases means that there 1.92 of the log-likelihood at � (for � fixed at its estimated

is some possibility that not all the published haplotypes value). LLR gives the value of the log-likelihood ratio fortesting the null hypothesis of � � 1.are completely correct, the majority seem likely to be

Page 15: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2227Linkage Disequilibrium and Recombination

Figure 10.—Results of fittingthe more general model for vary-ing recombination rate to the LPLhaplotype data from Clark et al.(1998). The plot shows the esti-mated value and 25, 10, and 5%posterior quantiles for �j in eachmarker interval along the chromo-some. The vertical dashed linesindicate the boundaries of the pu-tative recombination hotspot identi-fied by Templeton et al. (2000).

mostly from the Jackson sample, for a smaller-magni- the coverage probabilities for these C.I.’s are unlikelyto be 95%, they give some indication of the curvaturetude hotspot between 8 and 9 kb. Although no interval

in that region produces a very large estimate for �i, the of the likelihood surface, and the fact that not one ofthe three intervals overlaps with either of the otherclustering together of three intervals with moderate �i

provides stronger evidence than any one of these esti- two suggests that the hotspot intensity may indeed varyamong the three populations. Additional simulation re-mates taken separately.

Our results are consistent with those from Fearnhead sults (not shown) suggest that the larger effective popu-lation size of the Jackson population should actuallyand Donnelly (2002), who found evidence for the

[2987,4872] hotspot in the samples from Rochester and increase power to detect the hotspot compared with theFinnish population, and so differences in effective popu-Finland, but not in those from Jackson. In addition,

both we and Fearnhead and Donnelly (2002) found lation sizes do not appear to explain our results. Analternative explanation is the bias we observed for ourthat the Rochester and Finland samples give muchestimates of � under certain expansion scenarios (Fig-smaller estimates for � than the Jackson sample gives,ure 3), which might partially explain the large estimateprobably reflecting smaller effective population sizes asof � in the Finnish population, for example, althougha result of a recent bottleneck. One general advantageit is unclear whether this is enough to account for theof the approach we take here, over the method of con-fact that the estimated � is almost 40 times greater thansidering segments of the chromosome separately as dothat in the Jackson population. Biological mechanismsFearnhead and Donnelly (2002), is that it uses thethat could lead to different patterns of recombinationpatterns of LD not only between markers within therate heterogeneity in different populations are knownhotspot, but also between markers either side of the hot-to exist (e.g., Jeffreys and Neumann 2002), and thespot, to estimate the magnitude of the hotspot. Thiskinds of methods we introduce here should be helpfulmay explain why we detected a signal (albeit a modestin determining how often this occurs in practice.one) for a [2987,4872] hotspot in the Jackson sample,

where Fearnhead and Donnelly (2002) did not.The large differences among estimates for � from the

DISCUSSIONthree separate population samples are surprising. Toexamine whether this might be simply due to poor preci- In this article we have introduced a new statisticalsion for these estimates in one or more of the popula- model relating patterns of LD at multiple loci to thetions (due, for example, to too much or too little diver- underlying recombination rate and examined its effec-sity), we constructed �95% confidence intervals for � tiveness for inferring the underlying rate of recombina-using an analog of method I in the Construction of confi- tion. Another potential application of our model is to

methods for LD (association) mapping in “case-control”dence intervals section (column C.I. in Table 4). Although

Page 16: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2228 N. Li and M. Stephens

studies, where chromosomes have been collected and observation that in some regions of the genome LDexhibits a “block-like” structure. Daly et al. (2001)typed for both case and control individuals. Several au-

thors, including McPeek and Strahs (1999), Morris model each observed haplotype as a mosaic of “ancestralhaplotypes,” with the transition rates among these ances-et al. (2000), and Liu et al. (2001), have developed meth-

ods to use genetic types at multiple loci to perform tral states (representing the “historical recombinationfrequency” between each pair of consecutive markers)association mapping for case-control studies. These

methods aim to improve on other common methods— being estimated by maximum likelihood. The ancestralhaplotypes are identified by an initial scan for regionswhich typically test small groups of markers, one group

at a time, for association with a trait—by considering of low haplotype diversity, although in principle theycould instead be treated as parameters in the model.data at many SNP markers simultaneously. Although

the methods differ in details, broadly speaking they all Daly et al. (2001) used this model to produce a sum-mary of patterns of LD that illustrates the haplotypepursue a strategy of assuming that (subsets of) the case

chromosomes share some region identical by descent structure in their data more clearly, and in more detail,than would plots of pairwise LD measures. However, itabout a causal mutation and as a result will be more

similar than would be expected by chance. The chal- is currently unclear to what extent this model might behelpful for applications involving statistical inference,lenge then is to identify regions where (subsets of) the

case chromosomes are more similar than would be ex- or prediction, particularly in regions where patterns ofLD are less block-like.pected by chance. Models of LD play a key role here,

because what would be expected “by chance” depends Several challenges might arise in applying our methodto real data, which we have ignored here. In particular,critically on the amount of LD among loci. In particular,

correlations among loci will cause chromosomes to tend we have assumed in our examples that haplotypes areknown and that there are no missing genotypes or geno-to be more similar by chance than if the loci were inde-

pendent. McPeek and Strahs (1999) use a first-order typing errors. A new version of the software packagePHASE (Stephens et al. 2001) is under development,Markov chain to model LD, so that the probability of

observing types (x1, . . . , xL) at L loci along a chromosome and it will deal with these problems by incorporatingthe PAC likelihood into a Markov chain Monte Carlois Pr(x1) Pr(x2|x1) Pr(x3|x2) . . . Pr(xL|xL�1), where the

conditional probabilities Pr(xr|xr�1) are estimated using algorithm to jointly estimate the recombination rateparameters, haplotypes, missing genotypes, and poten-the control chromosomes. This model was also adopted

by Morris et al. (2000). While the first-order Markov tial locations of genotyping errors. This algorithm alsoproduces a method for estimating haplotypes that takesassumption is better than assuming that the loci are

independent and may suffice if there is little LD among account of decay of LD along chromosomes. Prelimi-nary results for simulated data suggest that these ideasmarkers, it seems not to be a good model for LD in

general. In particular, it fails to capture the fact that result in slightly more accurate haplotype estimates thandoes the method described in Stephens et al. (2001).markers may be in weak LD with neighboring markers,

but in strong LD with more distant markers. Although There are also biological aspects of real data that wehave not accounted for here, including, for example, geneMcPeek and Strahs (1999) mention that higher-order

Markov models might better model LD, such models conversion, whose effect on patterns of LD in humanshas been the subject of considerable recent interest (seeseem unlikely to be helpful in practice because of the

difficulty of estimating all the necessary parameters. The Frisse et al. 2001, for example). The effect that thepresence of gene conversion will have on our methodmodel we have introduced here provides a parsimoni-

ous method for modeling LD: even the more general will vary, depending on how the tract length—aboutwhich little is known in humans—compares with themodel for varying recombination rates has fewer param-

eters than the first-order Markov model used previously. marker density. Gene conversion events with very smalltract lengths compared to the marker density will onlyFurther, in these kinds of applications, where estimation

of underlying recombination rates may be of only indi- rarely involve a typed marker and so will tend to havea small effect on our method unless such events arerect interest, the usefulness of our model will depend

only on whether Pr(h1, . . . , hn|�) is a sensible distribution extremely common. Conversely, gene conversion eventswith longer tract lengths—comparable to the typicalfor h1, . . . , hn for some value of the parameters �, even if

this � does not correspond precisely to the background distance between markers—will often affect one ormore markers and will tend to look like double-cross-recombination rate scaled by the effective population

size. Under these circumstances our two approxima- over events to our method. The presence of gene con-version with this kind of tract length will thus elevate ourtions �A and �B should perform almost identically, and

so �A might be preferred on the grounds that it is sim- estimates of recombination rate, perhaps substantially,and regions with elevated rates of such gene conversionpler to understand and implement and is more amena-

ble to theoretical study. may appear as recombination hotspots in our method.In principle the PAC model could be extended to ac-Another model for LD across multiple sites, intro-

duced by Daly et al. (2001), is based on the empirical count explicitly for gene conversion by suitable modifi-

Page 17: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2229Linkage Disequilibrium and Recombination

cation of the conditional distribution �. A concrete have been to use the pseudo-likelihood (Besag 1974)based on our conditional distribution,suggestion for how to achieve this would be to augment

the space of the hidden Markov model for the mosaicL pseudo(�) � �

n

k�1

�(hk |H�k), (7)process (described in detail in appendix a) to includeboth the current and previous “copied” chromosomeand then to modify the Markov jump process to make where H�k denotes the set of all haplotypes excluding

hk. The pseudo-likelihood, by definition, does not de-jumps back to the previously copied chromosome morelikely than jumps to other chromosomes. However, this pend on the ordering of the haplotypes. This idea is

more along the lines of the way that these conditionalwould greatly increase the computational expense ofthe model, making it unappealing in practice. A more distributions are used in Stephens et al. (2001). How-

ever, in preliminary studies we found that this pseudo-attractive possibility would be to settle for modeling onlythose gene conversion events that affect a single marker likelihood performed poorly for estimating �. Our intu-

itive explanation for this is that the pseudo-likelihood(which, depending on tract length and marker density,may be the vast majority of gene conversion events af- in effect contains only information on the recombina-

tion that is occurring in the tips of the trees and notfecting patterns of LD). This would require only a simplemodification of the conditional distribution (it could be on the structure of the tree as a whole. (Interestingly,

under our approximation the first two haplotypes con-handled similarly to the way that mutations are currentlyhandled), with essentially no increase in the computa- tain no information on �, so in some sense the informa-

tion on � comes from intermediate haplotypes.) None-tion required.Another aspect of real data that we have not ac- theless, it is possible that the pseudo-likelihood may

prove useful in settings where estimating � is not ofcounted for explicitly is population structure. Our simu-lation results in Figure 3 suggest that for the purposes of direct interest.

We have introduced here two models for variation inidentifying recombination hotspots our method is robustto a certain amount of population structure. Nevertheless, recombination rate: a simple single-hotspot model and

a more general model that allows recombination ratesmodeling population structure explicitly might provehelpful in some settings. For example, it could be used to vary along the chromosome. Each of these models

has weaknesses. The simple single-hotspot model makesto extend methods for detecting population structurefrom unlinked markers (e.g., Pritchard et al. 2000) to some unrealistic assumptions: the background recombi-

nation is unlikely to be constant and neither is theallow them to be applied to sets of tightly linked mark-ers. Again, a natural approach is to modify the condi- recombination rate within the hotspot; furthermore,

there could be more than one hotspot. The more gen-tional distribution � to account explicitly for populationstructure. One suggestion is to modify the copying pro- eral model makes few assumptions and allows more

flexible investigation of patterns of recombination ratecess in the k � 1st chromosome (see appendix a) sothat, rather than being equally likely to copy all r existing variation along a region. However, this extra flexibility

comes at the expense of the introduction of extra pa-chromosomes, it is more likely to copy chromosomesfrom the same population than chromosomes from a rameters, which can result in a reduction in the preci-

sion with which parameters can be estimated. Whendifferent population. This would effectively model pop-ulation structure by increasing the probability of seeing using the model as a general model for LD, rather than

for parameter estimation as we have concentrated onsimilar chromosomes in the same population, comparedwith in different populations. We are currently investi- here, the precision of parameter estimates may be unim-

portant, and the few assumptions made by the moregating the effectiveness of a similar idea for LD mappingin case-control studies: treating cases and controls as sepa- general model make it particularly attractive in this situ-

ation. When estimation of recombination rates is therate populations and examining whether there appearsto be evidence in some regions for the case chromosomes main goal, the more general model may be viewed as most

suited to exploratory data analysis, identifying plausibleto be more similar to other case chromosomes than tocontrol chromosomes. positions for hotspots, whose magnitudes might be esti-

mated by a more parsimonious model. In this situationWhile we have concentrated here on models for bial-lelic loci, the ideas we have introduced could also be it might prove fruitful to consider modifying the more

general model by putting a more informative prior onused to model LD among multiallelic loci such as micro-satellites. There is a natural analog of �A for loci with the �j’s. In particular, a prior in which the �j’s are corre-

lated along the chromosome (e.g., an autoregressiveK alleles (see also the conditional distribution forK-allele loci suggested in FD), and this could form a prior) would reduce the variance of parameter estimates at

the expense of assuming that changes in recombinationstarting point for further investigation.To deal with the problem that the PAC likelihood rate occur more or less smoothly along the chromosome

(which may or may not be the case).depends on the order in which the haplotypes are con-sidered, we have chosen to average the likelihood over In assessing our model as a method for estimating

recombination rates from sequence data over moderateseveral random orders. One possible alternative would

Page 18: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2230 N. Li and M. Stephens

Clark, A. G., 1990 Inference of haplotypes from PCR-amplifiedgenomic regions, perhaps the most natural comparisonssamples of diploid populations. Mol. Biol. Evol. 7 (2): 111–122.

to make are with the composite-likelihood methods of Clark, A. G., K. M. Weiss, D. A. Nickerson, S. L. Taylor, A.Buchanan et al., 1998 Haplotype structure and population ge-Hudson (2001) and Fearnhead and Donnelly (2002).netics inferences from nucleotide-sequence variation in human(While some other methods based on summaries of thelipoprotein lipase. Am. J. Hum. Genet. 63: 595–612.

data might be competitive with these approaches when Daly, M. J., J. D. Rioux, S. F. Schaffner, T. J. Hudson and E. S.Lander, 2001 High-resolution haplotype structure in the hu-the recombination rate is assumed constant, they seemman genome. Nat. Genet. 29: 229–232.likely to suffer from loss of information when fitting

Ewens, W. J., 1972 The sampling theory of selectively neutral alleles.models with more parameters, such as either of our Theor. Popul. Biol. 3: 87–112.

Fearnhead, P. N., and P. Donnelly, 2001 Estimating recombina-models for recombination rate variation.) Of the twotion rates from population genetic data. Genetics 159: 1299–1318.composite-likelihood methods, although both are trac-

Fearnhead, P., and P. Donnelly, 2002 Approximate likelihoodtable for estimating � over large genomic regions, only methods for estimating local recombination rates. J. R. Stat. Soc.

Ser. B 64: 657–680.Hudson’s method is comparable with our own in termsFrisse, L., R. R. Hudson, A. Bartoszewicz, J. D. Wall, J. Donfackof computational expense: our method and Hudson’s

et al., 2001 Gene conversion and different population historiesmethod typically take seconds, or less, to compute a may explain the contrast between polymorphism and linkage

disequilibrium levels. Am. J. Hum. Genet. 69: 831–843.likelihood, while Fearnhead and Donnelly’s method canGriffiths, R. C., and P. Marjoram, 1996 Ancestral inference fromtake hours per likelihood computation. Although the

samples of DNA sequences with recombination. J. Comput. Biol.time and effort expended in collecting these kinds of 3: 479–502.

Hammer, M. C., T. Karafet, A. Rasanayagam, E. T. Wood, T. K.data make it not unreasonable to wait hours or days forAltheide et al., 1998 Out of Africa and back again: nestedthe results of analysis, the extra computational burdencladistic analysis of human Y chromosome variation. Mol. Biol.

may make Fearnhead and Donnelly’s method difficult Evol. 15 (4): 427–441.Harding, R. M., S. M. Fullerton, R. C. Griffiths, J. Bond, M. J. Coxto extend to more general settings involving missing

et al., 1997 Archaic African and Asian lineages in the geneticgenotype data, genotyping error, and/or unknownancestry of modern humans. Am. J. Hum. Genet. 60: 772–789.

haplotypic phase, for example. The approach of split- Hey, J., and J. Wakeley, 1997 A coalescent estimator of the popula-tion recombination rate. Genetics 145: 833–846.ting the sequence data into contiguous segments also

Hudson, R. R., 1983 Properties of a neutral allele model with intra-has the disadvantage, noted earlier, of estimating recom-genic recombination. Theor. Popul. Biol. 23: 183–201.

bination rates in a region only on the basis of sites Hudson, R. R., 1987 Estimating the recombination parameter of afinite population model without selection. Genet. Res. 50: 245–within the region and not sites either side of the region,250.resulting in potential loss of information. Our limited

Hudson, R. R., 2001 Two-locus sampling distribution and their ap-comparisons with Hudson’s method suggest that it per- plication. Genetics 159: 1805–1817.

Hudson, R. R., 2002 Generating samples under a Wright-Fisherforms similarly to our method for estimating the recom-neutral model of genetic variation. Bioinformatics 18: 337–338.bination rate when it is assumed constant across the

Hudson, R. R., and N. L. Kaplan, 1985 Statistical properties of theregion. In principle, Hudson’s method could also be number of recombination events in the history of a sample of

DNA sequences. Genetics 111: 147–164.applied to fit models of varying recombination rateJeffreys, A. J., and R. Neumann, 2002 Reciprocal crossover asymme-along the sequence, and the existence of more than

try and meiotic drive in a human recombination hot spot. Nat.one method to fit such models would be welcome. Both Genet. 31: 267–271.

Jeffreys, A. J., A. Ritchie and R. Neumann, 2000 High resolutionapproaches seem to offer considerable advantages overanalysis of haplotype diversity and meiotic crossover in the humanother available methods for modeling LD and inferringTAP2 recombination hotspot. Hum. Mol. Genet. 9: 725–733.

patterns of recombination rate heterogeneity. Jeffreys, A. J., L. Kauppi and R. Neumann, 2001 Intensely punctatemeiotic recombination in the class II region of the major histo-compatibility complex. Nat. Genet. 29: 217–222.

Johnson, G. C. L., L. Esposito, B. J. B. A. N. Smith, J. Heward, G. D.SOFTWARE Genova et al., 2001 Haplotype tagging for the identification ofcommon disease genes. Nat. Genet. 29: 233–235.The methods discussed in this article have been imple- Kingman, J. F. C., 1982 The coalescent. Stoch. Proc. Appl. 13: 235–

mented in a C�� package, Hotspotter, which is freely 248.Kruglyak, L., 1999 Prospects for whole-genome linkage disequilib-available for download at http://www.stat.washington.edu/

rium mapping of common disease genes. Nat. Genet. 22: 139–144.stephens/software.html. Kuhner, M. K., J. Yamato and J. Felsenstein, 2000 Maximum likeli-hood estimation of recombination rates from population data.We thank Peter Donnelly, Paul Fearnhead, Gil McVean, JonathanGenetics 156: 1393–1401.Pritchard, and Molly Przeworski for helpful conversations and Paul

Liu, J. S., C. Sabatti, J. Teng, B. J. Keats and N. Risch, 2001 Bayes-Fearnhead, Richard Hudson, and Mary Kuhner for helpful adviceian analysis of haplotypes for linkage disequilibrium mapping.

and/or results from software. Two anonymous referees gave helpful Genome Res. 11: 1716–1724.comments on the submitted manuscript. This work was supported by McPeek, M. S., and A. Strahs, 1999 Assessment of linkage disequi-a grant from the University of Washington and National Institutes of librium by the decay of haplotype sharing, with application toHealth grant no. 1RO1HG/LM02585-01. fine-scale genetic mapping. Am. J. Hum. Genet. 65: 858–875.

McVean, G., P. Awadalla and P. Fearnhead, 2002 A coalescent-based method for detecting and estimating recombination fromgene sequences. Genetics 160: 1231–1241.

Morris, A. P., J. C. Whittaker and D. J. Balding, 2000 BayesianLITERATURE CITEDfine-scale mapping of disease loci, by hidden Markov models.

Besag, J., 1974 Spatial interaction and the statistical analysis of lattice Am. J. Hum. Genet. 67: 155–169.Nickerson, D. A., S. L. Taylor, K. M. Weiss, A. G. Clark, R. G.systems. J. R. Stat. Soc. Ser. B 36: 192–236.

Page 19: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2231Linkage Disequilibrium and Recombination

Hutchinson et al., 1998 DNA sequence diversity in a 9.7-kb Pr(Xj�1 � x�|Xj � x)region of the human lipoprotein lipase gene. Nat. Genet. 19:233–240.

Nielsen, R., 2000 Estimation of population parameters and recom- � �exp(��jdj/k) � (1 � exp(�� jdj/k))(1/k) if x� � x;

(1 � exp(�� jdj/k))(1/k) otherwise ,bination rates from single nucleotide polymorphisms. Genetics154: 931–942. (A1)

Olivier, M., V. I. Bustos, M. R. Levy, G. A. Smick, I. Moreno etal., 2001 Complex high-resolution linkage disequilibrium and where dj is the physical distance between markers j andhaplotype patterns of single-nucleotide polymorphisms in 2.5 Mb

j � 1 (assumed known); and �j � 4Ncj , where N is theof sequence on human chromosome 21. Genomics 78: 64–72.Press, W. H., S. A. Teukolsky, W. T. Vetterling and B. P. Flannery, effective (diploid) population size, and cj is the average

1992 Numerical Recipes in C: The Art of Scientific Computing. Cam- rate of crossover per unit physical distance, per meiosis,bridge University Press, New York.between sites j and j � 1 (so that cjd j is the geneticPritchard, J. K., and M. Przeworski, 2001 Linkage disequilibrium

in humans: models and data. Am. J. Hum. Genet. 69: 1–14. distance between sites j and j � 1). This transition matrixPritchard, J. K., M. Stephens and P. Donnelly, 2000 Inference captures the idea that, if sites j and j � 1 are a smallof population structure using multilocus genotype data. Genetics

genetic distance apart (i.e., cjd j is small), then they are155: 945–959.Rabiner, L. R., 1989 A tutorial on hidden Markov models and se- highly likely to copy the same chromosome (i.e., Xj�1 �

lected applications in speech recognition. Proc. IEEE 77: 257– Xj). We note the following special cases used in this286.article:Stephens, M., and P. Donnelly, 2000 Inference in molecular popu-

lation genetics. J. R. Stat. Soc. Ser. B 62: 605–655.Stephens, M., N. J. Smith and P. Donnelly, 2001 A new statistical 1. Constant recombination rate (see estimating con-

method for haplotype reconstruction from population data. Am. stant recombination rate): cj � c for all j.J. Hum. Genet. 68: 978–989.2. Simple single-hotspot model (see Power to detect recom-Templeton, A. R., A. G. Clark, K. M. Weiss, D. A. Nickerson, E.

Boerwinkle et al., 2000 Recombination and mutation at the bination hotspots and robustness): c j � c if markers jLPL locus. Am. J. Hum. Genet. 66: 69–83. and j � 1 are both outside the hotspot, and c j � �cUeberhuber, C. W., 1997 Numeric Computation: Methods, Software,

if markers j and j � 1 are both inside the hotspot.and Analysis, Vol. I. Springer-Verlag, Berlin-Heidelberg, Germany.Wakeley, J., 1997 Using the variance of pairwise differences to esti- (For brevity we omit details of the more tedious,

mate the recombination rate. Genet. Res. 69: 45–48. though straightforward, case in which one marker isWall, J. D., 2000 A comparison of estimators of the populationin the hotspot and the other outside the hotspot.)recombination rate. Mol. Biol. Evol. 17: 156–163.

Wang, N., J. M. Akey, K. Zhang, R. Chakraborty and L. Jin, 2002 3. General variable recombination rate model (see Esti-Distribution of recombination crossovers and the origin of haplo- mating recombination rates along a region): cj � �j c.type blocks: the interplay of population history, recombination,and mutation. Am. J. Hum. Genet. 71: 1227–1234.

To mimic the effects of mutation, the copying processCommunicating editor: S. Tavare may be imperfect: with probability k/(k � ) the copy

is exact, while with probability /(k � ), a “mutation”will be applied to the copied haplotype. Specifically, if

APPENDIX A: hi,j denotes the allele (0 or 1) at site j in haplotype i,THE CONDITIONAL DISTRIBUTION �Athen, given the copying process X1, . . . , XS, the alleles

Here we give a formal description of �A. We also pro- hk�1,1, hk�1,2 , . . . , hk�1,S are independent, withvide some additional motivation for the form of this

Pr(hk�1,j � a|Xj � x,h1, . . . , hk)approximation and describe briefly some of the varia-tions on this form with which we have also experi-

� �k/(k � ) � (1/2) � /(k � ), h x, j � a

(1/2) � /(k � ), hx , j � a . (A2)mented.

Formal description of �A: Let h1, . . . , hn denote then sampled haplotypes typed at S biallelic loci (SNPs).

[The factor of (1/2) appears in both cases, so that asTypically h1, . . . , hn would come from a sample of n → ∞ both alleles become equally likely.]haploid individuals or n/2 diploid individuals. We as-

We fix the value of to besume that the distribution of the first haplotype is inde-pendent of � [e.g., all 2S possible haplotypes are equally

� � �n�1

m�1

1m�

�1

, (A3)likely, so �A(h1) � 1/2S]. Consider now the conditionaldistribution of hk�1, given h1, . . . , hk, for k � 1. Recall(Figure 2) that hk�1 is an imperfect mosaic of h1, . . . , where n is the total number of sampled haplotypes. (See

Motivation and Variation below for more discussion.)hk. That is, for k � 1, at each SNP, hk�1 is a (possiblyimperfect) copy of one of h1, . . . , hk at that position. Computation: Computing �A(hk�1|h1, . . . , hk) requires

a sum over all possible values of the Xj , which can beLet Xj denote which haplotype hk�1 copies at site j (soXj � {1,2, . . . , k}). For example, for haplotype h4A in done efficiently using the forward part of the forward-

backward algorithm for hidden Markov models (e.g.,Figure 2, (X 1,X 2,X 3,X 4,X 5) � (3,3,2,2,2). To mimic theeffects of recombination, we model the Xj as a Markov Rabiner 1989). Specifically, let hk�1,j denote the

types of the first j sites of haplotype hk�1, and let �j(x) �chain on {1, . . . , k}, with Pr(X1 � x) � 1/k (x � {1,. . . , k}), and Pr(hk�1, j ,Xj � x). Then �1(x) can be computed directly

Page 20: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2232 N. Li and M. Stephens

for x � 1, . . . k, and �2(x), . . . . �S(x) can be computed The reason for our choice of is that �n�1m�11/m is the

recursively using expected number of mutation events at a single site onthe genealogical tree relating a random sample of n

�j�1(x) � �j�1(x) �k

x��1

�j(x�)Pr(Xj�1 � x |Xj � x�) (A4) chromosomes, so (A3) gives a priori an expected numberof mutation events at each site of 1 (although it doesnot force the number of mutations to be exactly 1,

� �j�1(x)�pj �j(x) � (1 � pj)1k �

k

x��1

�j(x�)�, (A5) and so our method should be somewhat robust to thepresence of multiple mutations at some sites).

where �j�1(x) � Pr(hk�1,j�1|Xj�1 � x, h1, . . . , hk) is given We performed simulation experiments along the linesin (A2), and pj � exp(��j d j/k). of those used to produce Figure 4 to see whether varia-

The value of �A(hk�1|h1, . . . , hk) can then be computed tions on the conditional distribution �A described aboveusing might eliminate the bias we observed for �A. In particu-

lar, we tried using values for that were up to four times�A(hk�1|h1, . . . , hk) � �

k

x�1

�S(x). (A6) bigger or smaller than that in (A3); estimating fromthe data; replacing the transition probability in (A1)

The second term in the parentheses of (A5) does not with a transition probability of �di/(k � �d i), as in FD;depend on x and needs to be computed only once and making use of the more complex mutation mecha-for each j (as noted in FD). Thus the computational nism (involving Gaussian quadrature) used in Stephenscomplexity of �A(hk�1|h1, . . . , hk) increases linearly in and Donnelly (2000) and in FD. Although these differ-the number of SNPs and linearly in k. As a result, the ent variations gave different quantitative results, theycomputation of LPAC-A is linear in the number of SNPs all produced similar qualitative patterns, and in particu-and quadratic in the number of chromosomes in the lar the bias we observed for �A remained in every varia-sample. tion that we tried. We therefore resorted to the empiri-

Motivation and variations: Although it seems intu- cal correction described in appendix b below.itively sensible that the transition matrix in Figure 8should have the property that the rate at which jumpsoccur in the copying process should increase with � anddecrease with the number of previous sampled chromo- APPENDIX B:somes k, it is perhaps not so obvious why we chose �B, A BIAS-CORRECTED VERSION OF �A

the rate �/k. Indeed, the empirical results in Figure 4To correct the bias observed in the results for �PAC-A,suggest that this rate is not quite ideal, and appendix

we modified the transition matrix in (A1) by replacingb describes a corrected rate, based on these empirical�j by �j� j, where �j � exp(a � b log10�j). The interceptresults. However, we can get some idea of why �/n is aa and slope b are interpolated on the basis of the numbersensible starting point for the rate parameter from theof haplotypes n and segregating sites S in the data (Fig-following informal argument. Assume that h1, . . . , hk�1ure 1), using tensor product interpolation with naturalare a random sample of haplotypes from a neutrallycubic splines first in the direction of varying n and thenevolving, randomly mating, constant-sized population,in the direction of varying S (Ueberhuber 1997).and consider the unknown genealogical tree relating

h1, . . . , hk�1 at a single site. It follows from the Ewenssampling formula that in this tree, the probability thathk�1 is separated by at least one mutation from each of

APPENDIX C: SIMULATING DATA WITH Ah1, . . . , hk (unconditional on the actual values of h1,RECOMBINATION HOTSPOT. . . , hk) is /(k � ), where � 4N�, and � is the

probability of mutation per meiosis at that site. Similarly, We use the following algorithm to postprocess theif we consider marking on the tree recombination events output from mksample (Hudson 2002) to simulate datathat occur between this site and the next site, the proba- under the simple single-hotspot model for recombina-bility that there will be at least one such event separating tion variation. Suppose we would like to simulate a sam-hk�1 from each of h1, . . . , hk is �/(k � �), where � � ple with approximately S segregating sites. The back-4Nc and c is the probability of recombination between ground recombination rate is �. A hotspot of width w �two adjacent sites per meiosis. Since � is small, �/(k � (b � a) lies between positions a and b, with recombina-�) � �/k, giving the rate that we used. (We emphasize tion rate �� where � � 1. We follow these steps:that this is not intended to be a formal argument and

1. Simulate samples with S� � (1 � w(� � 1))S segregat-in particular that it is unclear how our mosaic processing sites and constant recombination rate �� � (1 �relates formally to the genealogical tree relating thew(� � 1))�.haplotypes. It is merely intended to provide additional

2. Multiply the position of each site by a factor of 1 �motivation for the use of this rate and perhaps to stimu-late research into a more formal connection.) w(� � 1) so that the total length of the haplotypes

Page 21: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage

2233Linkage Disequilibrium and Recombination

is 1 � w(� � 1) instead of 1 (and the background hotspot [subtract w(� � 1)] so that the total lengthrecombination rate is �). is again 1.

3. For sites within a and a � w�, randomly delete themShrinking the distance between sites in the hotspotwith probability 1 � 1/�.

produces the effect of elevated recombination rate. De-4. For the remaining sites within a and a � w�, shrinkleting some sites keeps the mutation rate constant overthe distance of adjacent sites by a factor of �.

5. Shift the positions of the sites to the right of the the region.

Page 22: Modeling Linkage Disequilibrium and Identifying ...evolution.genetics.washington.edu/gs590/2006/LiStephens2003.pdf · Copyright 2003 by the Genetics Society of America Modeling Linkage