estimating allele frequencies of hypervariable dna systems

Forensic Skience International, 51 (1991) 273-280 Elsevier Scientific Publishers Ireland Ltd.

273

ESTIMATING ALLELE FREQUENCIES OF HYPERVARIABLE DNA SYSTEMS

V.L. PASCALI, E. d’ALOJA, M. DOBOSZ and M. PESCARMONA

Immunohematology laboratory, Istituto di Medicina Legale, Universitci Cattolica o!el S. Cuore, Largo F. Vito, 1, I- 00168 Roma (Italy)

(Received May 13th, 1991) (Revision received August 22nd, 1991) (Accepted August 29th, 1991)

Summary

Several polymorphisms of human DNA have been shown to be hypervariable due to the recurrence of a variable number of tandem repeats (VNTRs) in the lengths of allelic restriction fragments. The recurrence of allelic variants in this novel class of polymorphisms seems to comply well with a model of continuous random variables. Based on this assumption, we have compiled some simple algorithms for classification of continuous data and estimation of classes of relative frequencies and have im- plemented these routines for the management of databases storing hypervariable single locus DNA genetic systems. The algorithms are compiled in BASIC language and can be incorporated in task- oriented computer programs. Three procedures are discussed, based in turn on: (a) using predeter- mined, arbitrary classes; (b) point estimations of frequencies for single fragments using error measurements associated with the kilobase value assignment; (c) estimates of phenotype frequencies according to error measurements. Error measurements are obtained from a statistic of values pertaining to several restriction fragments (genomic controls) repeatedly tested in different experiments. Problems related to these approaches are discussed.

Key words: DNA profiles; Single-locus probes; Gene frequencies; BASIC algorithms.

Introduction

Hypervariable DNA markers of human genome are a vast category of polymorphisms whose importance is well established in several fields of biological research [1,2]. They are used in forensic biology because of their con- siderable power of individualization and have revolutionized the fields of biological stains analysis [3] and parenthood investigations [4]. The prominent feature of HVR’s (hypervariable regions; VNTRs, variable number of tandem repeats) lies in the high number of allelic restriction fragments for each genetic locus.

Unequal crossing over [5], as well as premeiotic germline mutations [6] have been proposed to explain a continuous generation of new alleles. According to these hypotheses, a high rate of mutation per locus [7] has been reported for some of these loci. Whatever the nature of the underlying genetic mechanism,

0379-0738/91/$03.50 0 1991 Elsevier Scientific Publishers Ireland Ltd. Printed and Published in Ireland

274

DNA restriction fragments with different lengths result from the iteration of simple tandem repeats of core nucleotide sequences.

In VNTR systems, allelic products may theoretically differ from each other by a length portion equivalent to at least one repeat unit [8]. In low molecular weight systems, resolution is much improved so that individual alleles can be identified and classified even in the presence of a high polymorphism (e.g. YNZ22-D17S5).

However,many VNTRs have short repeat units assembled in large-size alleles. Small differences in lengths and measurement errors (because of the low resolution of high molecular weight alleles in agarose electrophoresis) create an array of continuous data. As a consequence, individual alleles cannot be identified [8].

Systems of this sort fit the model of continuous random variables. Therefore, use of VNTRs as genetic markers must address allelic length variation against a background of random measurement fluctuation.

This paper deals with several procedures which we have adopted to estimate the frequency of occurrence of hypervariable fragments. We have developed some simple routine algorithms, written in BASIC language, which enable calculation of allele frequencies of hypervariable DNA and may be useful as part of a task-oriented computer program. Some properties of hypervariable DNA regions (HVRs) to which the algorithms apply, as well as problems related to the computational approach are discussed.

Materials and Methods

Genomic DNA was obtained as follows. Blood samples (0.8 ml each) were first frozen, then thawed and the red cells selectively lyzed by 1 x saline sodium citrate (SSC). The white cells were pelleted and subsequently incubated in a sodium dodecyl sulphate(SDS)/sodium acetatejproteinase K buffer [3]. DNA was finally obtained by phenol/chloroform extraction and ethanol precipitation.

Enzymatic restriction was carried out using overnight incubation with fivefold excess Hinf I (Boehringer, Mannheim, no. 1274082). Digests of 3 pg were finally electrophoresed on 0.8% agarose gels in 1 x TBE (Tris Borate EDTA buffer; 10 x = 1.3 M Tris; 0.75 M Boric acid; 0.015 M EDTA, pH 8.8). Each run typically contained two visible mol. wt markers (1 kb ladder, BRL, cat. no. 520 - 5615SB, 1 pg per lane) at the gel side extremities. Adjacent to these lanes, two more lanes contained a smaller aliquot of the same marker (1 kb ladder, 2 ng). A central lane contained one genomic control digest, whose polymorphic profile was known from previous analyses. One genomic control digest was typically subjected to serial analyses (30 experiments in most cases), thereafter it was replaced by another digest. Data pertaining to several genomic control digests with different molecular sizes were in this way collected. At the end of the experiments, data on genomic control fragments spanning from 7 - 0.5 kb at roughly regular intervals of about 500 base pair (bp) were available.

The gels were run (35 V, 20 mA) until the 2 kb marker band had reached 160 mm apart from the well line. On completion of electrophoresis, all digests were Southern blotted onto a nylon membrane (Hybond, Amersham) and hybridised

275

to two different hypervariable probes (YNH24/D2S44; 3’HVR/D16) under high stringency conditions after Church and Gilbert [9]. Autoradiography of the hybridised membranes was carried out for 3-5 days, at -80°C.

The relative positions of the detected fragments were measured on a semi- automated basis, using a digitizing tablet (Summagraphics), then turned into kb values by an algorithm based on the reciprocal method of Elder and Southern [lo] and stored in sequential files. Different files were created for the genomic controls and the population data. In both cases, additional information (on the geographical and ethnical origin of the individual and on the relevant experi- ment) were attached to each pair of kb values and recorded in duplicated sequential files.

Files containing genomic controls were used to derive standard deviations and the percentages of error underlying the procedure of kb values assignment. Files containing population data were used to calculate allele and phenotype frequencies. Two hundred samples (unrelated individuals from Central and Southern Ita- ly) were processed in about 40 electrophoretic runs.

Results

As shown by the diagram in Fig. 1, a system was devised in which genomic control files act as keys of access to the population database. Profiles of unknown DNA sample are first converted into kb values and a class interval is sized around it, by ascribing * 1 to * 3 standard deviations to each individual kb value. Gene frequencies according to different confidence intervals are finally derived by scrambling the population database and assessing how many kb values fall within the class interval. Care is taken in assuming the percentage of error pertaining to the genomic control as closest in length as possible to the fragment whose frequency is sought.

Discussion

The procedure of calculating gene frequencies outlined above treats VNTRs alleles as continuous random variables [11,12]. While the distribution of popula- tions of VNTRs alleles does not fit the Gaussian distribution (being multimodal), it can be conversely demonstrated that each individual fragment, if repeatedly measured, generates its own subset of values which fit a normal distribution curve. As a consequence and as long as the hypothesis of normal distribution of the errors of measurements holds, fragments from a population may be sampled according to the variance of every assigned weight.

To classify fragments of a given population, a variance(s) should be experimen- tally ascribed to each kb measurement. This would involve repetition of every measurement for an adequate number of times under standard experimental conditions. Such a procedure is obviously not practical. A possible way to circum- vent this problem is to create a statistic of serial measurements from a few selected restriction fragments and assume their variance to represent the error in weighting fragments, regardless of their size.

276

GENOMIC

CONTROLS

SOUTHERN BLOT

Kb VALUES

\L

I

I 1

I ’ I

\I/

ARCHIVE

OF

POPULATION

DATA

RELATIVE FREOUENCY BY

POINT ESTIMATION

Fig. 1. A scheme of the procedure by which relative frequencies of hypervariable alleles are computed. A standardized protocol of Southern blot analysis feeds the population database and a selected array of fragments is scattered at roughly regular kb intervals (genomic controls). Separate computer files are provided for the archive and for each genomic control. Serial measures of genomic controls contribute standard deviations of the procedure of kb assignment. These are converted in percentage of error and ascribed to every fragment size whose frequency is sought. Point estimates on variable confidence limits are derived by scrambling the database and assessing how many fragments fall within the preselected confidence limit over the kb measure under evaluation.

Prior to this, it is necessary to show that the variance in the procedure of kb assignment does not significantly change within a reasonably large kb interval.

Data from our laboratory (Pascali, observations on 12 genomic controls; unpublished data) have so far shown us that the model of normal distribution ap- plies well to most subsets of genomic control data and that a roughly uniform percentage of error (ranging between 1.0 - 1.5%) affects a rather large kb range (from 0.400 - 7.5 kb) in our experimental conditions.

We are aware that, in spite of the uniformity of percentage error in controls, shifting measures of errors from selected fragments to the whole population of VNTRs measures (or to a part of it) remains a simplified inference. There are in fact arguments questioning the validity of this assumption.

First of all, the larger the distance in kb between the allelic control and the restriction fragment to be evaluated, the weaker the hypothesis on the equiva- lence of their own variances(s). In fact, if errors due to the simple act of measur- ing play a central role in determinations [ll], this does not prove that they hold uniform all along the kb range, in the widest range of experimental conditions and for every band position in the kb scale.

Factors involving local electrophoretic band shifts and/or kb measured fluctuations have already been noticed [13]. It has also been observed [14] that measurement error increases with the molecular weight. We actually experienced both phenomena (Pascali, unpublished data), presumably due to local endosmosis, uneven coefficient of diffusion of the bands and different thickness of autoradiographed bands. These observations argue against adoption of just one genomic control’s standard deviation to validate procedures for classifying the whole range of hypervariable alleles.

As a compromise, we have set out to establish and permanently update a library of several control fragments throughout the kb scale of our interest. As the general database of fragment population grows in size with the addition of new data, the genomic controls data grows as well (since controls are included in every test). Standard deviations (kb%) can be calculated by reference to the closest control fragment.

The problem of estimating VNTRs frequencies is controversial ]15], and several computational procedures have so far been adopted [14], [16,17].

The SO called binning approach is the most obvious procedure to obtain relative frequencies and is one of the most widespread.

A given population of fragments is sorted into an ascending array and stored in discrete classes 1181, so that each class of continuous values becomes an easier- to-handle population of discrete data. Increasing the number of classes involves a proportionally higher fluctuation of values contained in each class, if the population is sampled again. If conversely a few classes were used, they could miss some significant features of the underlying population.

Although no general rule may be adopted to assess a straightforward class interval, several laboratories adopt this approach (and its graphical counterpart, the bar histogram) as the current method to estimate the distribution of VNTRs alleles [16,17]. We occasionally use it to represent the overall distribution of fragment sizes, but not as a way to estimate individual gene frequencies (a short

BASIC routine enables subdivision of VNTR measurements in classes and representation by a bar histogram - the programme is available on request).

One intriguing feature of the binning approach is the fact that relative frequencies of arbitrary classes (bins) vary if the starting point for the class subdivision is changed. Apparently in this case the same density distribution of the population sample is sufficiently well represented by both arrays whereas kb values corresponding to given bands may have significantly different relative frequencies. A different way of computing hypervariable allele frequencies has recently been employed by Gill et al. [14].

With this method, called the Sliding Window Fit approach, the ideal migra- tional field encompassing all observed DNA fragments is subdivided into numerous position units. Two adjacent positional units intercept the minimal distance still resolvable by a computer graphical grid. After Southern blot analysis, DNA bands are associated to the nearest position unit. Since the experimental procedure generates fluctuations of the electrophoretic positions of the fragments and any given band may be ascribed to several neighbouring positions when repeatedly electrophoresed, the sliding window approach estimates VNTRs frequencies by first determining the mean error in positioning the bands (this is easily obtained by serial data on the genomic controls) and then grouping all bands which fall within f 1 to f 3 standard deviations from selected position units. Frequencies derived from this method are used in place of those drawn from arbitrary subdivisions in classes. This procedure was originally described to work in association with an automated optical scanner capturing and storing positions of the bands within a reference mol. wt standard ladder [19].

The method that we propose here (Point Estimates based on random errors in base pairs) is similar to the sliding window fit, but there are three main differences: (1) a point estimate is drawn from a given kb value (any fragment whose frequency is sought) using the percentage of error of the nearest genomic control available in the library (we find it useful to derive three relative frequencies corresponding to as many confidence intervals). (2) A point estimation involves assessing how many fragments fall inside a class interval defined by superimposing tolerance limits to single kb measurements. No arrays of frequencies are therefore calculated. (3) The procedure uses precalculated band kb values rather than band position units, thus ensuring that all sources of experimental/computational errors are accounted for.

A database of DNA fragments stored in computer sequential files can obviously speed up the computations involved by our procedure. Indeed, point estimations by random errors would be tedious to perform manually, but are easily accomplished by a computer algorithm (a model of point estimation algorithm is available on request).

If both alleles of every typed individual are sequentially entered in the database file, then phenotype frequencies may be calculated by grouping individuals who match for the classification of both fragments. This further application of the point estimation method may be run by another simple algorithm (details available on request). Definite advantages of our approach are: (1) relative frequency estimates account for the ability of the laboratory to resolve

279

hypervariable alleles in different kb regions; (2) confidence limits (according to the number of standard deviations adopted to define the class intervals) always accompany frequency values; (3) frequencies based on realistic (and conser- vative) assumptions may be computed. Our current guideline is to adopt confidence limits corresponding to *2 standard deviations around each estimated molecular size; (4) the whole procedure (storage and retrival of data, computa- tion of frequencies) handles kb values, thus allowing a direct comparison of data with that from other laboratories. Databases from different laboratories can be computed together, provided that a common experimental protocol is used. Gene frequencies deriving point estimates by random errors in base pairs may realistically represent the probability of a given allele to occur in the database of our laboratory. On the other hand, whenever a general idea about the distribution of fragments along the entire kb value is needed, arbitrary bins arrays are adequate.

Acknowledgements

We thank Prof. Francesco Andreasi Bassi (Dept. Physics, UCSC) for criticisms and for patiently supervising several editions of the algorithms presented here. We thank as well Prof. Angelo Serra (Dept. Genetics, UCSC) and Dr. Maurizio Genuardi for revising the manuscript. We acknowledge Dr. Peter Gill (HOCRE, Aldermaston) for surveying the reasonings contained in this paper and for critically reading the manuscript. Many ideas discussed here came to us by atten- ding the sessions of the EDNAP (European DNA profile) group. The sliding window fit procedure was first introduced by P. Gill and the HOCRE staff on such occasions. We are indebted to all our colleagues of the EDNAP group for several interesting public discussions and private interviews on this subject. The respon- sibility for what is asserted here rests on us entirely.

References

1 R. White, DNA sequence polymorphisms revitalize linkage approaches in human genetics. Trends Genet., 177 (1985) 181.

2 K.E. Davies, New hypervariable DNA markers for mapping human genetic disease. Trends Genet., April (1985) 97.

3 P. Gill, A.J. Jeffreys and D.J. Werrett, Forensic applications of DNA fingerprints. Nature, 318 (1985) 577 - 579.

4 A.J. Jeffreys, J.F.Y. Brookfield and R. Semeonoff, Positive identification of an immigration test-case using human DNA fingerprints. Nature, 317 (1985) 818-819.

5 A.J. Jeffreys, V. Wilson and S.L. Thein, Hypervariable ‘minisatellite’ regions in human DNA. Nature, 314 (1985) 67 -73.

6 A.J. Jeffreys, R. Neumann and V. Wilson, Repeat unit sequence variation in minisatellites: a novel source of DNA polymorphism for studying variation and mutation by single molecule analysis. Cell, 60 (1990) 473 - 485.

7 A.J. Jeffreys, N.J. Royle, V. Wilson and Z. Wong, Spontaneous mutation rates to new length alleles at tandem-repetitive hypervariable loci in human DNA. Nature, 332 (1988) 278- 281.

8 S.J. Odelberg, R. Plaetke, J.R. Eldridge, L. Ballard, P. O’Connell, Y. Nakamura, M. Leppert, J.M. Lalouel and R. White, Characterization of eight VNTRs loci by agarose gel electrophoresis. Genomics, 5 (1989) 915-924.

280

9

10

11

12 13

14

15

16

17

18

19

G.M. Church and W. Gilbert, Genomic sequencing. Proc. N&l. Acad. Sci. U.S.A., 81 (1984) 1991- 1995. J.K. Elder and E.M. Southern, Computer-aided analysis of one-dimensional restriction frag ment gels, in M.J. Bishop and C.J. Rawlings (eds.), Nucleic Acids and Protein Sequence Analy sis, Irl Press, Oxford, 1987, pp. 165-1’72. B.W. Lindgreen and G.W. McElrath, Introduction to Probability and Statistics, The Macmillan Co., Toronto, 1969, pp. 131- 142. P. Armitage, Statistical Methods in Medical Research, Blackwell, Oxford, 1971. C. Brenner and J.W. Morris, Paternity Index Calculations in Single Locus Hypervariable DNA Probes: Validation and Other Studies. Proc. Znt. Symp. on Human Identification 1989, copy right Promega Corp., 1990, pp. 21-54. P. Gill, K. Sullivan, D.J. Werrett, The analysis of hypervariable DNA profiles: problems associated with the objective determination of the probability of a match. Hum. Genet., 85 (1990) 75 - 79. P.H. van Eade, T.M. Cuypers, G.G. de Lange, Attempts to calculate the allele frequency distribution of a TaqI RFLP detected by the highly polymorphic probe YNH24. Hum. Gent., 84 (1990) 376-378. M. Baird, I. Balsz, A. Giusti, L. Miyzaki, L. Nicholas, K. Wexler, E. Kanter, J. Glassber, F. Allen, P. Rubinstein and L. Sussman, Allele frequency distribution of two highly polymorphic DNA sequences in three ethnic groups and its application to the determination of paternity. Am. J. Hum. Genet., 29 (1986) 489-501. B. Budowle, Data for Forensic Matching Criteria for VNTRs Profiles. Proc. Znt. Symp. on Human Zd.entij%cation 1989, copyright Promega Corp., 1990, pp. 103- 116. W.H. Press, B.P. Flannery, S.A. Toukolsky and W. Vetterling, Numerical Recipes, Cambridge University Press, Cambridge, 1986, pp. 462 - 464. P. Gill and D.J. Werrett, Interpretation of DNA profiles using a computerised database. Elec- trophoresis, 11 (1990) 444 - 448.

estimating allele frequencies of hypervariable dna systems

Documents