estimation of relatedness by dna fingerprintinglynchlab/pdf/lynch37.pdf · estimation of...

16
Estimation of Relatedness by DNA Fingerprinting 1 Michael Lynch Department of Ecology, Ethology and Evolution, University of Illinois The recent discovery of hypervariable VNTR (variable number of tandem repeat) loci has led to much excitement among population biologists regarding the feasibility of deriving individual estimates of relatedness in field populations by DNA finger- printing. It is shown that unbiased estimates of relatedness cannot be obtained at the individual level without knowledge of the allelic distributions in both the in- dividuals of interest and the base population unless the proportion of shared marker alleles between unrelated individuals is essentially zero. Since the latter is usually on the order of 0.1-0.5 and since there are enormous practical difficulties in obtaining the former, only an approximate estimator for the relatedness can be given. The bias of this estimator is individual specific and inversely related to the number of marker loci and frequencies of marker alleles. Substantial sampling variance in estimates of relatedness arises from variation in identity by descent within and between loci and, with finite numbers of alleles, from variation in identity in state between genes that are not identical by descent. In the extreme case of 25 assayed loci, each with an effectively infinite number of alleles, the standard error of a relatedness estimate is no less than 1496, 20%, 3596, and 53% of the expectation for full sibs and second-, third-, and fourth-order relationships, respectively. Attempts to ascertain relatedness by means of DNA fingerprinting should proceed with caution. Introduction Tests of a number of ideas in evolutionary biology, particularly in the areas of social organization and kin selection, require that the genetic relationships of wild individuals be resolved accurately (Hamilton 1964; Grafen 1985 ) . It is also useful to know the relatedness of potential mates in controlled breeding programs, for the ad- vancement of economically important characters or for the preservation of endangered species, in order to avoid the deleterious consequences of inbreeding. Even among pair-bonding species, behavioral observations cannot be trusted as definitive indicators of blood relationships because of the possibility of extrapair copulations; in species lacking pair-bonds, this problem is exacerbated. It is necessary to rely on genetic markers. In the past, almost all attempts to ascertain relationships have utilized poly- morphic enzyme loci (Metcalf and Whitt 1977; McCracken and Bradbury 198 1; Pamilo and Crozier 1982; Crozier et al. 1984; Pamilo 1984; Wilkinson and McCracken 1986). Since the analysis of each such locus requires a separate biochemical protocol and since most loci are monomorphic (hence providing no information), this can be a rather time-consuming procedure. Moreover, because individual relatedness estimates (including paternity analyses) based on isozyme surveys are statistically unreliable 1. Key words: none. Address for correspondence and reprints: Michael Lynch, Department of Ecology,Ethology, and Evo- lution, University of Illinois, Shelford Vivarium, 606 East Healey Street, Champaign, Illinois 6 1820. Mol. Biol. Evol. 5(5):584-599. 1988. 0 1988 by The University of Chicago. All rights reserved. 0737-4038/88/0505-00 10$02.00 584

Upload: others

Post on 17-Jan-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

Estimation of Relatedness by DNA Fingerprinting 1

Michael Lynch Department of Ecology, Ethology and Evolution, University of Illinois

The recent discovery of hypervariable VNTR (variable number of tandem repeat) loci has led to much excitement among population biologists regarding the feasibility of deriving individual estimates of relatedness in field populations by DNA finger- printing. It is shown that unbiased estimates of relatedness cannot be obtained at the individual level without knowledge of the allelic distributions in both the in- dividuals of interest and the base population unless the proportion of shared marker alleles between unrelated individuals is essentially zero. Since the latter is usually on the order of 0.1-0.5 and since there are enormous practical difficulties in obtaining the former, only an approximate estimator for the relatedness can be given. The bias of this estimator is individual specific and inversely related to the number of marker loci and frequencies of marker alleles. Substantial sampling variance in estimates of relatedness arises from variation in identity by descent within and between loci and, with finite numbers of alleles, from variation in identity in state between genes that are not identical by descent. In the extreme case of 25 assayed loci, each with an effectively infinite number of alleles, the standard error of a relatedness estimate is no less than 1496, 20%, 3596, and 53% of the expectation for full sibs and second-, third-, and fourth-order relationships, respectively. Attempts to ascertain relatedness by means of DNA fingerprinting should proceed with caution.

Introduction

Tests of a number of ideas in evolutionary biology, particularly in the areas of social organization and kin selection, require that the genetic relationships of wild individuals be resolved accurately (Hamilton 1964; Grafen 1985 ) . It is also useful to know the relatedness of potential mates in controlled breeding programs, for the ad- vancement of economically important characters or for the preservation of endangered species, in order to avoid the deleterious consequences of inbreeding. Even among pair-bonding species, behavioral observations cannot be trusted as definitive indicators of blood relationships because of the possibility of extrapair copulations; in species lacking pair-bonds, this problem is exacerbated. It is necessary to rely on genetic markers. In the past, almost all attempts to ascertain relationships have utilized poly- morphic enzyme loci (Metcalf and Whitt 1977; McCracken and Bradbury 198 1; Pamilo and Crozier 1982; Crozier et al. 1984; Pamilo 1984; Wilkinson and McCracken 1986). Since the analysis of each such locus requires a separate biochemical protocol and since most loci are monomorphic (hence providing no information), this can be a rather time-consuming procedure. Moreover, because individual relatedness estimates (including paternity analyses) based on isozyme surveys are statistically unreliable

1. Key words: none. Address for correspondence and reprints: Michael Lynch, Department of Ecology, Ethology, and Evo-

lution, University of Illinois, Shelford Vivarium, 606 East Healey Street, Champaign, Illinois 6 1820.

Mol. Biol. Evol. 5(5):584-599. 1988. 0 1988 by The University of Chicago. All rights reserved. 0737-4038/88/0505-00 10$02.00

584

Page 2: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

Estimation of Relatedness by DNA Fingerprinting 585

(Chakraborty et al. 1988), field studies have not advanced beyond the estimation of average relatedness among group members.

There is now much hope that DNA fingerprinting techniques (Gill et al. 1985; Jeffreys et al. 1985a, 1985b, 1987; Burke and Bruford 1987; Jeffreys and Morton 1987; Nakamura et al. 1987; Wetton et al. 1987; Vassart et al. 1987) may revolutionize the study of social interactions by allowing investigators to estimate relatedness with an unprecedented level of accuracy. DNA fingerprinting has two big advantages over isozyme analysis. First, the markers for many loci are simultaneously visible on a single gel. Second, the average number of alleles per locus is very much greater than in the case of enzymatic loci. Thus, the yield of information per unit effort is expected to be quite high.

DNA fingerprinting is now being advocated as a powerful method for paternity exclusion and forensic analysis in humans. The technique can be applied to other mammals, birds, and, very likely, to most other eukaryotes (Vassart et al. 1987). Thus, many laboratories are now gearing up for the application of DNA fingerprinting to wild populations and to captive populations of endangered species. The general attitude seems to be that the success of DNA fingerprinting in the analysis of parent- offspring relationships will extend to more distant relationships. Because DNA tech- nology is still quite expensive and because the fingerprinting technique has some tech- nical and statistical problems, it seems useful at this point to outline the limitations of the technique.

Background

DNA fingerprinting relies on the existence of families of minisatellites that are dispersed throughout the genome in a large number of hypervariable loci. Individual minisatellites consist of tandem arrays of short ( lo-60-bp) repeat units. While DNA sequence variation may exist among repeat units, the most important variation from the standpoint of the fingerprinting technique concerns the variation in the number of tandem repeats per array. Substantial amounts of such allelic variation appear to be generated by high rates of unequal crossing-over, possibly due, in part, to the ex- istence of core sequences within each repeat unit that act as recombination hot spots (Jeffreys et al. 1985a).

Standard DNA techniques can be used to discriminate the members of a mini- satellite family on the basis of their size. After a genomic digest with a restriction endonuclease that does not cleave the repeat unit, the fragments are separated by size by agarose gel electrophoresis, transferred to an appropriate membrane via a Southern blot, and hybridized to a radio-labeled probe for the repeat unit. The length of each fragment is a function of the number of tandem repeats contained within it. Depending on the stringency of the hybridization conditions, many loci are probed simultaneously under this procedure, and a dozen or more bands typically appear in each lane of the gel (fig. 1). The array of bands for an individual is its DNA fingerprint. In humans, the length variation at each VNTR (variable-number-of-tandem-repeat ) locus is so great that the fingerprints of virtually all individuals (other than monozygotic twins) are unique (Jeffreys et al. 1985b).

A number of problems arise in interpreting the genetic basis of an individual’s DNA fingerprint. First, it is generally unknown which markers belong to which loci. Allelism can be excluded whenever two unique markers of a single parent are found to be inherited by the same offspring, but large breeding designs would have to be

Page 3: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

586 Lynch

B \_I.... \_j......... . . . . . .

-b a

:--, /_~W... ~_........... .C.. f

- . . ..[_~ . . ..I.......... g

h- \-j.... j-

i-1 . . . . . . . . . . . . . . . i

- k - I

B.... f_;. . I.. . . . I.. . . . . m i-1.... -. . . . . . . . . . . . . . . n -...* -.............. . 0

- P - ..I. - . . . . . . . . . . . . . . . q

i-i.... f-j.. . . . . . . . . . . . . . r t--s u--v

- . . . . - . . . . . . . . . . . . . . . w

- x f-j.... f-b... . . . . . . . . . . . .

Y I_\.... - . . . . . . . . . . . . . . . Z

FIG. 1 .-DNA fingerprints for two individuals, A and B. There are 26 distinct bands, a-z. Shared bands are denoted with dashed lines.

implemented to identify alleles with certainty. Since dozens of alleles may exist at some loci, we can regard such work as unfeasible for most studies. This raises a second problem. If markers cannot be associated with specific loci, then it cannot be ascertained whether individuals are homozygous or heterozygous. Consequently, if an appreciable probability of homozygosity exists, then the only feasible measure of similarity between individuals, i.e., the fraction of shared marker bands, will not necessarily be equivalent to the fraction of shared genes.

If the DNA fingerprinting technique is to be of any use in estimating relatedness, the fraction of shared marker bands must be a reliable surrogate. In figure 1, A exhibits 60% of the markers of B, while B exhibits 66.7% of the markers of A. Thus, as a first approximation, one might be inclined to suggest that A and B are first-degree relatives. However, if there is a significant chance that unrelated individuals share the same markers, as will always be the case unless the allelic variation is enormous, the pro-

Page 4: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

Estimation of Relatedness by DNA Fingerprinting 587

portional similarity of two DNA fingerprints will exceed the relatedness. The bias is directly related to the frequencies of markers in the population, but, as noted above, these are generally unknown.

The following theory considers the connection between DNA fingerprint similarity and relatedness. The expected similarity between various kinds of relatives will be evaluated first, and an estimator for relatedness will be given. Then, the sampling variance of similarity resulting from variation in identity by descent and from back- ground variation will be discussed. The formulas presented are meant primarily to serve as a heuristic guide to evaluating the practical utility of DNA fingerprinting for ascertaining relationships.

Unfortunately, several technical limitations of the fingerprinting technique pre- clude the development and application of a rigorous statistical model. First, when allelic diversity is high and a large fraction of the lane of a gel is occupied by radio- labeled markers, as is normally the case with DNA fingerprints, there is a very real possibility of corn&ration of unrelated fragments. Comigration will inflate the variance of similarity, but the practical difficulties noted above usually will prevent a quantitative evaluation of the problem. Second, substantial numbers of markers with very low molecular weights are usually run off the end of the gel, leaving one with an incomplete description of the minisatellite family. If the variation among missing fragments differs from that among observed fragments, as appears to be the case (Jeffreys et al. 1985a), the connection between similarity and relatedness will be further obscured. Third, some marker loci may be linked, in which case individual markers should not be treated as independent estimators of relatedness. However, the ascertainment of linkage relationships is clearly beyond the scope of most of the field studies to which the fingerprinting technique is likely to be applied.

Returning to figure 1, we see that individual A has two fewer bands than does B. We now see that there are at least three potential explanations for this: (1) A might be homozygous for two more loci than B; ( 2) two pairs of unrelated markers in A may have comigrated; or (3) two of A’s markers may have migrated off the gel. In the following analyses, it is assumed that comigration, loss of markers, and linkage are unimportant, so that variation in marker number per individual can only be a result of variation in homozygosity. It is also assumed that the marker alleles are neutral and unlinked with any selected loci and that the probability of mutation is negligible. The resultant conclusions regarding the limits of the power of DNA fin- gerprinting for ascertaining relatedness are therefore conservative.

Expected Proportion of Shared Markers

Many different technical definitions of relatedness exist (Michod and Hamilton 1980; Grafen 1985). Here we will define the relatedness of individual B to individual A, rB,?, , as the expected fraction of B’s autosomal genes that are identical by descent ( IBD) with those in A (Crozier 1970). Identity by descent refers to situations in which two genes are direct copies of an ancestral gene. In contrast, identity in state (IIS) implies nothing about coancestry. Genes that are IIS need not be IBD, but genes that are IBD must also be IIS. In the special case of no inbreeding, which will be our primary focus, rBA is equivalent to all other definitions of relatedness, including Wright’s ( 1922) correlation coefficient of relationship and Hamilton’s ( 197 1) regression coefficient of relatedness.

Focusing on a single locus, table 1 illustrates the ways in which IBD can arise between the genes in A and B. Associated with each of the nine configurations is a

Page 5: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

588 Lynch

probability of occurrence (Jacquard’s [ 1974, pp. 10% 1071 condensed coefficients of identity). The coefficients 4 1, 42, and C$ 3 are probabilities of situations in which neither A nor B is inbred, $4 and C#J 5 of situations in which only A is inbred, @6 and 4, of situations in which only B is inbred, and C#J 8 and & of situations in which both A and B are inbred. For any pair of individuals, the nine coefficients, which sum to one, are defined by the type of relationship. For example, if A and B are parent and offspring and if neither is inbred, all coefficients are equal to zero, except that 42 = 1; an offspring always inherits one, and only one, gene from its parent at each locus. In terms of the coefficients,

rBA=+3+67+69+[(42+&)/2]. (1)

Ideally, estimates of relatedness based on DNA fingerprints should converge asymp- totically on rBA as more markers are compared.

The operational measure of similarity between two DNA fingerprints is the pro- portion of shared bands. If we let kB be the number of bands exhibited by B, then the similarity of B to A is

(2)

where Pmj = 1 if A shares the jth marker of B and Pmj = 0 otherwise. The subscript m simply denotes the (generally unknown) marker locus. To evaluate the connection between the statistic S BA and the parameter r&+, the behavior of the marker-specific similarities needs to be defined. We start with E( Pmj), the conditional probability that A will have the band given that B has it. This probability depends on the distribution of identity by descent between the two individuals. If we let 8imj be the probability that A exhibits the jth band of B conditional on identity relationship i, then

(3)

Expressions for three of the 8imj values will be derived to illustrate the basic concepts. Under identity relationship 1, A and B are noninbred and not IBD. Thus, the probability that A exhibits a particular band of B is simply the probability that A is a homozygote or heterozygote for the allele: 8 lmj = PGj + 2pmj( 1 - pmj) = pmj( 2 - Pmj), where Pmj is the population frequency of genes at locus m that are of type j. Under identity relationship 2, A and B have a single gene IBD and neither is inbred. We know that individual B has at least one allele that yields the observed band. The probability that the second allele in B is the same as the first (and that B is therefore homozygous) is pmj, in which case A must exhibit the marker. The probability that B is a heterozygote is (1 - pmj), in which case there is a 0.5 probability that the marker gene in B is also present in A. If it is the alternative allele in B that is transmitted to A, there is still a pmj probability that A’s other gene at the locus is identical in state with the marker. TO sum UP, fl2mj = pmj + (1 - pmj)(O.5 + 0.5pmj) = pmj + O-5(1 - p$). Finally, for identity relationship 3, both alleles exhibit identity by descent between individuals, SO 03mj = 1. These and the remaining coefficients are summarized in table 1.

Page 6: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

Estimation of Relatedness by DNA Fingerprinting 589

Table 1 The Nine Possible Identity-by-Descent Relationships

Identity-by-Descent Relationship Probability Probability that A has Marker Allele

L: : aI a2 al a2 al a2 aI a2

II or II or \ or // b, b2 bl b2 b, b2 b, b2 al a2 aI a2

II II or % b, b2 b, b al = a2 b, bz al = a2 al = a2 II 4 or WI b, b2 bl b2

;: =; aI a2 al a2

II \ or //II b, = b2 b, = b2 al = a2 b, = b2 al = a2 II % II b, = b2

e2mj=Pmj + O-5(1 - Pijl

&j= 1

@4m,=P*j

%l,=O.5(1 + Pm,)

07mj= 1

98m,=Pm,

&j=l

NOTE.-The relationships are between two alleles at the mth locus in individuals A (a, and az) and B (br and b& the associated probabilities, c#J,, depend on the nature of the relationship. 0 ,,,,, is the probability that A shares a particular band (j, with population allelic frequency pW) with B, given the identity relationship i at the locus. The presence of an equals sign (=, in any direction) implies identity by descent.

The expected proportion of bands shared between unrelated individuals is ob- tained by weighting the conditional 8 lmj values by their expected frequencies Pmj and summing over loci and alleles:

e1 = 5 5 PLj(2 - Pmj)lL 9

m=lj=l (4)

where L is the number of marker loci and n, is the total number of alleles at the mth locus. When all alleles have equal frequencies, nm = 1 /p, and equation (4) reduces to 8, = p( 2 - p). Clearly, 8r approaches zero as the number of alleles per locus increases. However, substantial similarity between nonrelatives is still expected even at fairly low allelic frequencies. With p = 0.01, 0.1, and 0.2, for example, O1 = 0.02, 0.19, and 0.36. Diallelic loci with p = 0.5 yield 8, = 0.75. Although the allelic frequency distributions for DNA fingerprinting loci are not actually uniform, equation (4) shows that the expected similarity, per locus, of unrelated individuals can be no less than q&( 2 - qm), where qm is the frequency of the most abundant allele. Consequently, even if most alleles are very rare, a few common ones will greatly magnifi the similarity of nonrelatives.

We are now in a position to examine the relationship between expected similarity

Page 7: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

590 Lynch

and relatedness. The following illustrations focus on the special case of noninbred individuals, but all of the tools necessary for more complex analyses (including different definitions of relatedness) are in table 1. For our special case, the relatedness of B to A is

rBA = 0.562 + $3 . (5)

From equation (3), the expected similarity, conditional on B having band m j, is

(6)

Noting that 1 =~#~,+$~+&,wecanalsowritethisas

Averaging over all bands exhibited by B, we find that the expected similarity between A and B conditional on B’s set of bands is

E(SBA) = rBA + (1 - rBA)eIB, (8)

where

8 1B = 5 elmj/kB j=l

(9)

is the expected similarity of B to nonrelatives. Only if all alleles have equal frequencies is 8 1B exactly equal to 8, , the average similarity between all pairs of nonrelatives in the population.

Equation ( 8) states that the expected similarity is equal to the sum of (a) the probability that genes in B are IBD with those in A and (b) the probability that they are not IBD but still identical in state. The second term in equation (8) defines the upward bias that results if one relies on similarity as an estimator of relatedness. Figure 2 illustrates the inflation E( SBA)/rBA for the special case in which all marker alleles have equal frequency. This ratio is equal to 1 /r BA when loci are monomorphic and declines to one with increqing allelic diversity. Thus, the more distant the relationship, the more one is likely to overestimate the degree of relatedness when relying on sim- ilarity as an estimator-and with distant relationships, a substantial amount of bias remains even when the number of alleles is quite high. When the true relatedness is 0.25,O. 125, and 0.0625, fewer than 5, 13, and 29 detectable alleles per locus will cause the expected similarity to be more than double the relatedness. The implications of these results are clear. One should avoid using similarity as an estimator of relatedness unless there is reason to believe that all marker alleles are very rare.

Rearranging equation (8) and substituting observed for expected similarity, we find an unbiased estimator for the relatedness

&A = SBA - 8 1B

1 - elB ’

Page 8: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

8

6

4

2

0

Estimation of Relatedness by DNA Fingerprinting 59 1

1

0 20 40 60 80

Number of Marker Alleles

100

FIG. 2.-The ratio of the expected proportion of shared bands E( SB,) to the relatedness ( rgA) for the special case in which allele frequencies are equal and inbreeding is absent. The dependence of this ratio on the number of alleles per locus is shown for four levels of reIatedness.

The main problem with this approach is that the parameter eIB will almost never be known. There are two ways to obtain estimates of 8 1~. One way is to substitute estimates of the population frequencies for B’s alleles into equation ( 9) and solve for the statistic &B. Alternatively, one could estimate the mean similarity of B to many independent relatives of the same type (fixed rBA) and solve

%A - rBA ’ &B = 1 _ rBA - (11)

Both of these approaches are impractical, if not impossible, for most field studies. Thus, it is clear that an alternative protocol is required if DNA fingerprinting is to be of any use in estimating relatedness.

The most easily implemented procedure is to utilize an estimate of the average similarity between all unrelated individuals, 8 1, in place of 8 1B. It can be shown by Taylor expansion that

1BA z S BA - 0, + (1 - sBA)vdh)

l-4, (1 - Q3 ’ (12)

Page 9: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

_- - -~ __. _- ~___

592 Lynch

where Var( 0,) is an estimate of the variance of similarity among pairs of unrelated individuals. From large sampling theory, olB and 6, are expected to converge as the number of assayed markers increases. However, the rate and direction of this con- vergence will depend on the distribution of marker alleles in the population and in individual B. Thus, the cost of not knowing 8 lB is that equation (12) gives biased estimates of relatedness, the magnitude of which is unique and unknown for each individual. The burden is therefore upon the investigator to demonstrate that equation ( 12) is a reasonable surrogate for the less practical equation ( 10).

In principle, the estimates 8, and Var( 0,) can be obtained by making measures of SBA for N independent pairs of the same types of relatives. 6, is then obtained by use of equation ( 11) with S, the mean similarity between the designated class of relatives, being substituted for S BA. If we let Var( S) be the variance among the independent measures of S, Var(BJ = Var(S)/( 1 - r)2, and the standard error of 6, is [Var(B,)/ N] ‘I2 . Ideally this type of analysis would be performed on pairs of unrelated individuals (r = 0), since this minimizes the standard error of 6,. However, since the main pur- pose of a fingerprinting study is to ascertain relatedness, this would involve a circularity in many field studies. An alternative approach would be to focus on a class of individuals for which the relatedness is known with confidence (e.g., mothers and offspring in mammals), but even here one would have to be secure with an assumption of no inbreeding.

Sampling Variance

An essential consideration in any attempt to ascertain relatedness is the magnitude of the sampling variance of similarity. Such variation arises as a consequence of vari- ation in both identity by descent per locus and IIS between genes that are not IBD. If the sampling variance of SBA is high, then that of rBA will be too.

Here it becomes useful to focus on the properties of a locus rather than on those of a single allele. If the two alleles of B at the mth locus are denoted as 1 and 2, the realized similarity of B with A at this locus may be written as

S pm1 + pm2 BA,m = 2 ’ (13)

where SBA,m is restricted to values of 0, 0.5, and 1. To compute the sampling variance of SBA,m, 02( SBA,m), for a particular type of relationship, the mean squared value, E( S&m), needs to be evaluated. This requires an averaging over all possible identity relationships. If we let Hmi be the probability that B is homozygous at the mth locus conditional on identity relationship i and let E ( Si,m 1 H, i) and E( S iA,m IO, i) be the expected squared similarities given homozygosity (H) or heterozygosity (0) in B and identity relationship i, then

a2(SBA,m) = 5 $i[HmiE(S&m( H, i) + (1 - Hmi)E(SL,rnIO, i)l - E2(fh4m) * , i= 1

Page 10: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

Estimation of Relatedness by DNA Fingerprinting 593

Table 2 Expressions for a*(S BA,,,,) for Cases in Which There Is No Inbreeding and the Distribution of Allele Frequencies Is Uniform

Relation Sampling Variance of SBA,m

Nonrelative ............................ Parent-offspring ........................ Full sibs .............................. Half-sibs, grandparent-grandchild, uncle(aunt)-

niece(nephew) ....................... Third order ............................ Fourth order ...........................

I41 --Pm-P) 0.25p(2 - 5p + 48 - p3) 0.125( 1 - 4pz + 5p3 - 2p4)

0.0625( 1 + 8p - 248 + 24$ - 9p4) 0.03125( 1.5 + 22~ - 61p2 -+ 628 - 24.5~~) 0.015625(1.75 + 53~ - 139.58 + 141$ - 56.25~~)

NOTE.-TO obtain the sampling variance of similarity based on L loci, the tabulated expressions should be divided by L.

For the special case of no inbreeding, equation ( 14) needs only to be summed over i = 1 to 3 with

E(S&rn I H9 l) = 5 Pmjf.PiZj + 2Pmj( 1 - Pmj)] 9

j = 1 WO

Wi~,mlW 2) = 1, Wb)

W&m 1% 3) = 1 , WC)

E(ShrnlO, 1) = { 5 Pmj 2 Pmk(2PmjPmk) + 0.25 5 Pmj ?f Pmk

j= 1 is:; j= 1 i3)

x[P~j+P~k+2(Pmj+Pmk)(l -Pmj-Pmk)])/{l - 5 P&j] 3 j= 1

E(SL,m lo, 2) = { ? Pmj 5 Prnk[0-25(1 -Pmk) +Pmk]}/{ 1 - 2 P&j], (15e) j= 1 is:. j= 1

and

E(SL,mIO, 3) = 1 - um

Although three of these expressions depend on the gene frequency distribution in rather complicated ways, substantial simplification is possible when p is assumed to be constant within and between loci. In table 2, expressions for 02( SBA,m) are given for several types of relatives under these special conditions. The sampling variance for the similarity of parents and offspring at a marker locus approaches zero asymp- totically as p + 0, being - p/ 2 for small p . For other types of relationships, however, there is a real lower limit to the sampling variance per locus (0.125 for full sibs and

Page 11: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

-- - ---

594 Lynch

0.062,0.047, and 0.027 for second-, third-, and fourth-order relationships, respectively), caused by variation in the number of IBD genes per locus.

To gain some appreciation for the magnitude of the error in estimates of relat- edness that is caused by the sampling variance of similarity, the square roots of the expressions in table 1 are divided by [ E( &,,,,) - p( 2 - p)] to obtain coefficients of variation (standard error/expectation) of lBA (fig. 3). For parent-offspring relationships, CV( fBA) declines with increasing allelic diversity, but even with 100 alleles/locus it is still -0.15. For other types of relationships, the coefficient of variation is substantially higher, increasing with the distance of the relationship. As the number of alleles in- creases, CV( iBA) approaches lower limits of 0.7 for full sibs and of 1 .O, 1.7, and 2.6 for second-, third-, and fourth-order relationships, respectively.

Ordinarily, similarities will be computed by summing over several loci, typically 10-20. If we assume that the loci are unlinked and in linkage equilibrium, the sampling variance of such an integrated measure would then be equal to the mean sampling variance of the individual loci divided by the number of loci. Thus, with 25 loci, the coefficients of variation of &A would be one-fifth of those in figure 3. In this extreme case, the standard error of ?BA for full sibs and for second-, third-, and fourth-order relationships would be no less than 14%, 20%, 35%, and 53%, respectively, of the expectation.

These analyses are sufficient to illustrate the uncertainties that can arise when the similarity measure is used to estimate the degree of relationship between individuals. The sampling variance of ?B* computed above is truly a lower limit because it does not account for the (usually unobserved) variation of 8,, around 0, or for the sam- pling variance of & and Var( 0,) around the population parameters. One can only conclude that beyond (and often including) second-degree relationships, DNA fin- gerprinting does not provide a powerful means of assessing individual relationships. If, however, one is simply interested in the average relatedness among independent pairs of individuals in a group, then there is hope, since the standard error of the average relatedness is that for independent estimates divided by the square root of the number of estimates. For the extreme case of 25 loci and p --+ 0, assays of two inde- pendent pairs of full sibs would reduce the coefficient of variation of average relatedness to 0.1 (the additional problem of uncertainty about 8 lB here being ignored). For groups composed of second-, third-, and fourth-order relatives, the same level of ac- curacy would require assays of 4, 12, and 28 independent pairs, respectively.

Discussion

The preceding analyses have identified three technical difficulties in using DNA fingerprinting to obtain individual estimates of relatedness: (1) the upward bias of fingerprint similarity compared with relatedness caused by finite numbers of alleles, (2) the inability to completely correct for such bias because of its individual specificity, and (3) the sampling variance caused by variation in identity by descent within and between loci. If we take into account the additional problems of comigration of non- allelic markers, linkage and/ or linkage disequilibrium between marker loci, the frequent inability to observe markers with very low molecular weights, possible linkage of marker loci with other loci under selection, and high and variable mutation rates (Jeffreys et al. 1988)) it is clear that considerable caution needs to be exercised in applications of DNA fingerprinting to estimate individual relatedness.

With very large numbers of alleles per locus (the critical number increasing with the distance of the relationship), the bias between DNA fingerprint similarity and

Page 12: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

Estimation of Relatedness by DNA Fingerprinting 595

3.5

3.0

2.5

2.0

1.5

I .o

0.5

0

Second degree

Full-sibs

Parent-offspring

I I I I

0 20 40 60 80

Number of Marker Alleles/Locus

100

FIG. 3.-Coefficient of variation of the relatedness estimate as a function of the number of marker alleles per locus and the degree of relationship. For L loci, the plotted values should be divided by L”*.

relatedness can be ignored. However, while most VNTR loci do appear to be excep- tionally variable, the number of alleles per locus is by no means high enough to permit the use of similarity as a reasonable estimator of relatedness. Nakamura et al. (1987) surveyed samples of 60-80 individuals with 372 VNTR probes, 77 of which revealed length polymorphisms. Of the latter loci, 11 exhibited more than 10 marker alleles, but the vast majority (68%) had five or fewer alleles, and 26% had only two to three

Page 13: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

596 Lynch

300

0

Minimum Similarity/Locus for Nonrelatives

I .oo 0.75 0.56 0.36 0.10

I I I I I I I I I I

0 20 40 60 80 100 Heterozygosity (%I

FIG. 4.-Distribution of heterozygosity over 392 loci probed for VNTR polymorphisms in humans (data from Nakamura et al. 1987). The upper scale gives the expected fraction of shared bands per locus between unrelated individuals as a function of the heterozygosity under the assumption of an even allele frequency distribution.

alleles. The heterozygosity per polymorphic locus has a mode at 0.6-0.7 (fig. 4), which implies a rather high average similarity between nonrelatives. Expected values of &, are given in figure 4 under the assumption of an even allele frequency distribution (heterozygosity = 1 - p, and &, = p [ 2 - p] ). An averaging of these locus-specific similarities leads to an underestimate of O1 , the expected similarity between nonrela- tives, because the observed allele frequency distributions are not even. Thus, for the set of loci observed by Nakamura et al. (1987), 8, is certainly >OS.

It is important to recognize that Nakamura et al. ( 1987 ) were only able to reveal locus-specific polymorphisms by probing for VNTR loci under high-stringency con-

Page 14: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

Estimation of Relatedness by DNA Fingerprinting 597

ditions. In DNA fingerprinting, one normally hybridizes under low-stringency con- ditions in order to examine a large number of markers simultaneously. This eliminates the luxury of ignoring loci with low (or no) variability. Prescreening a population with many different potential probes prior to analysis is a useful way to identify probes that give low 8, under low-stringency conditions. However, it should be noted that probes for the most hypervariable VNTR loci in humans still give & in the range of 0.1-0.3 (Jeffreys et al. 19853). When the same probes are used in house sparrows, 0, = 0.14 (SE = 0.01) (Wetton et al. 1987). Pair-wise comparisons in six species of birds yielded estimates of 8, in the range of 0.1-0.5 ( Burke and Bruford 1987 ) .

Thus, unless probes can be located for loci that are much more variable than the already “hypervariable” loci of Jeffreys et al. ( 1985a, 198%)) the upward bias of DNA fingerprint similarity relative to relatedness is too substantial to ignore. Prior to em- barking on a field study, it is then essential to obtain the baseline estimates 8, and Var( 0,) in order to apply equation (12), and even then one must recognize that this formula gives biased estimates of relatedness. Equation (12) will tend to overestimate the relatedness when applied to individuals that happen to contain common alleles and will tend to underestimate it for individuals carrying rare alleles.

Since the estimator for the sampling variance derived in the present paper did not include some potentially important sources of variation, it should be interpreted as nothing more than a lower limit to the sampling variance in hypothesis testing; but it can serve as a guide to the minimum number of markers (and/or individuals) that need to be evaluated to achieve a desired level of precision. An alternative empirically based-but time-consuming -approach to hypothesis testing exists. This would involve the development of frequency distributions of similarity from known types of relatives drawn from the base population of interest. The probability that any individual estimate of similarity is associated with a particular type of relationship can then be assayed directly without knowledge of the allelic distributions at specific loci.

In closing, we must emphasize that the cautionary tone of the present paper applies primarily to ascertainment of relatedness. There are numerous problems in population biology for which DNA fingerprinting may provide a simple and powerful analytical tool. Provided one knows the mother of an individual, paternity exclusion is quite straightforward, since it simply involves the identification of markers in the offspring that cannot be attributed to the mother or to the putative father. A similar logic applies to the ascertainment of multiple paternity. The simplicity of both types of analyses stems from the fact that they merely involve the rejection of a general hypothesis rather than the acceptance of a specific relationship. In some applications, such as the identification of optimal breeding pairs in genetic conservation programs, it will often be less critical to establish absolute relatedness than rank-order relatedness. Since expected similarity is a monotonic function of relatedness, DNA fingerprinting can provide a useful guide to such decision making. One should, however, maintain an awareness that, owing to the many sources of error described in the present paper, rank order based on fingerprint similarities can sometimes be substantially different than the true rank order of relatedness.

Acknowledgments

Helpful comments were provided by N. Burley, T. Crease, B. Leathers, D. Queller, S. Robinson, and R. Yokoyama. Financial support came from National Science Foundation grant BSR 86-00487 and National Institutes of Health Biomedical Re- search support grant RR07030.

Page 15: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

598 Lynch

LITERATURE CITED

BURKE, T., and M. W. BRUFORD. 1987. DNA fingerprinting in birds. Nature 327:149-152. CHAKRABORTY, R., T. R. MEAGHER, and P. E. SMOUSE. 1988. Parentage analysis with genetic

markers in natural populations. I. The expected proportion of offspring with unambiguous paternity. Genetics 118:527-536.

CROZIER, R. H. 1970. Coefficients of relationship and the identity of genes by descent in the Hymenoptera. Am. Nat. 104:2 16-2 17.

CROZIER, R. H., P. PAMILO, and Y. C. CROZIER. 1984. Relatedness and microgeographic genetic variation in Rhytidoponera mayri, an Australian arid zone ant. Behav. Ecol. Sociobiol. 15: 143-150.

GILL, P., A. J. JEFFREYS, and D. J. WERRE-I-T . 1985. Forensic application of DNA ‘fingerprints.’ Nature 318:577-579.

GRAFEN, A. 1985. A geometric view of relatedness. Oxf. Surv. Evol. Biol. 2:28-89. HAMILTON, W. D. 1964. The genetical evolution of social behaviour. I and II. J. Theor. Biol.

7:1-52. - 197 1. Selection of selfish and spiteful behavior in some extreme models. P. 55-9 1 in .

J. F. EISENBERG and W. S. DILLON, eds. Man and beast: comparative social behaviour. Smithsonian Institution Press, Washington, D.C.

JACQUARD, A. 1974. The genetic structure of populations. Springer, Berlin. JEFFREYS, A. J., and D. B. MORTON. 1987. DNA fingerprints of dogs and cats. Anim. Genet.

18:1-15. JEFFREYS, A. J., N. J. ROYLE, V. WILSON, and Z. WONG. 1988. Spontaneous mutation rates

to new length alleles at tandem-repetitive hypervariable loci in human DNA. Nature 332: 278-28 1.

JEFFREYS, A. J., V. WILSON, R. KELLY, B. A. TAYLOR, and G. BULFIELD. 1987. Mouse DNA ‘fingerprints’: analysis of chromosome localization and germ line stability of hypervariable loci in recombinant inbred strains. Nucleic Acids Res. 15:2823-2836.

JEFFREYS, A. J., V. WILSON, and S. L. THEIN. 1985a. Hypervariable ‘minisatellite’ regions in human DNA. Nature 314:67-73.

- 1985b. Individual-specific ‘fingerprints’ of human DNA. Nature 316:76-79. . MCCRACKEN, G. F., and J. W. BRADBURY. 1981. Social organization and kinship in the po-

lygynous bat, PhyZlostomus hastatus. Behav. Ecol. Sociobiol. 15:287-29 1. METCALF, R. A., and G. S. WHITT. 1977. Intra-nest relatedness in the social wasp Polistes

metricus. Behav. Ecol. Sociobiol. 2:339-35 1. MICHOD, R. E., and W. D. HAMILTON. 1980. Coefficients of relatedness in sociobiology. Nature

288:694-697. NAKAMURA, Y., M. LEPPERT, P. O’CONNELL, R. WOLFF, T. HOLM, M. CULVER, C. MARTIN,

E. FUJIMOTO, M. HOFF, E. KUMLIN, and R. WHITE. 1987. Variable number of tandem repeat (VNTR) markers for human gene mapping. Science 235: 16 16- 1622.

PAMILO, P. 1984. Genotypic correlation and regression in social groups: multiple alleles, multiple loci and subdivided populations. Genetics 107:307-320.

PAMILO, P., and R. H. CROZIER. 1982. Measuring genetic relatedness in natural populations: methodology. Theor. Popul. Biol. 21: 17 1- 193.

VASSART, G., M. GEORGES, R. MONSIEUR, H. BROCAS, A. S. LEQUARRE, and D. CHRISTOPHE. 1987. A sequence in M 13 phage detects hypervariable minisatellites in human and animal DNA. Science 235:683-684.

WETTON, J. H., R. E. CARTER, D. T. PARKIN, and D. WALTERS. 1987. Demographic study of a wild house sparrow population by DNA fingerprinting. Nature 327: 147- 149.

Page 16: Estimation of Relatedness by DNA Fingerprintinglynchlab/PDF/Lynch37.pdf · Estimation of Relatedness by DNA Fingerprinting 587 portional similarity of two DNA fingerprints will exceed

Estimation of Relatedness by DNA Fingerprinting 599

WILKINSON, G. S., and G. F. MCCRACKEN. 1986. On estimating genetic relatedness using genetic markers. Evolution 39: 1169- 1174.

WRIGHT, S. 1922. Coefficients of inbreeding and relationship. Am. Nat. 56:330-338.

MASATOSHI NEI, reviewing editor

Received February 4, 1988; revision received April 2 1, 1988