Estimating allele frequencies of hypervariable DNA systems

Download Estimating allele frequencies of hypervariable DNA systems

Post on 21-Jun-2016




0 download

Embed Size (px)


<ul><li><p>Forensic Skience International, 51 (1991) 273-280 Elsevier Scientific Publishers Ireland Ltd. </p><p>273 </p><p>ESTIMATING ALLELE FREQUENCIES OF HYPERVARIABLE DNA SYSTEMS </p><p>V.L. PASCALI, E. dALOJA, M. DOBOSZ and M. PESCARMONA </p><p>Immunohematology laboratory, Istituto di Medicina Legale, Universitci Cattolica o!el S. Cuore, Largo F. Vito, 1, I- 00168 Roma (Italy) </p><p>(Received May 13th, 1991) (Revision received August 22nd, 1991) (Accepted August 29th, 1991) </p><p>Summary </p><p>Several polymorphisms of human DNA have been shown to be hypervariable due to the recurrence of a variable number of tandem repeats (VNTRs) in the lengths of allelic restriction fragments. The recurrence of allelic variants in this novel class of polymorphisms seems to comply well with a model of continuous random variables. Based on this assumption, we have compiled some simple algorithms for classification of continuous data and estimation of classes of relative frequencies and have im- plemented these routines for the management of databases storing hypervariable single locus DNA genetic systems. The algorithms are compiled in BASIC language and can be incorporated in task- oriented computer programs. Three procedures are discussed, based in turn on: (a) using predeter- mined, arbitrary classes; (b) point estimations of frequencies for single fragments using error measurements associated with the kilobase value assignment; (c) estimates of phenotype frequencies according to error measurements. Error measurements are obtained from a statistic of values per- taining to several restriction fragments (genomic controls) repeatedly tested in different ex- periments. Problems related to these approaches are discussed. </p><p>Key words: DNA profiles; Single-locus probes; Gene frequencies; BASIC algorithms. </p><p>Introduction </p><p>Hypervariable DNA markers of human genome are a vast category of polymorphisms whose importance is well established in several fields of biological research [1,2]. They are used in forensic biology because of their con- siderable power of individualization and have revolutionized the fields of biological stains analysis [3] and parenthood investigations [4]. The prominent feature of HVRs (hypervariable regions; VNTRs, variable number of tandem repeats) lies in the high number of allelic restriction fragments for each genetic locus. </p><p>Unequal crossing over [5], as well as premeiotic germline mutations [6] have been proposed to explain a continuous generation of new alleles. According to these hypotheses, a high rate of mutation per locus [7] has been reported for some of these loci. Whatever the nature of the underlying genetic mechanism, </p><p>0379-0738/91/$03.50 0 1991 Elsevier Scientific Publishers Ireland Ltd. Printed and Published in Ireland </p></li><li><p>274 </p><p>DNA restriction fragments with different lengths result from the iteration of simple tandem repeats of core nucleotide sequences. </p><p>In VNTR systems, allelic products may theoretically differ from each other by a length portion equivalent to at least one repeat unit [8]. In low molecular weight systems, resolution is much improved so that individual alleles can be identified and classified even in the presence of a high polymorphism (e.g. YNZ22-D17S5). </p><p>However,many VNTRs have short repeat units assembled in large-size alleles. Small differences in lengths and measurement errors (because of the low resolu- tion of high molecular weight alleles in agarose electrophoresis) create an array of continuous data. As a consequence, individual alleles cannot be identified [8]. </p><p>Systems of this sort fit the model of continuous random variables. Therefore, use of VNTRs as genetic markers must address allelic length variation against a background of random measurement fluctuation. </p><p>This paper deals with several procedures which we have adopted to estimate the frequency of occurrence of hypervariable fragments. We have developed some simple routine algorithms, written in BASIC language, which enable calculation of allele frequencies of hypervariable DNA and may be useful as part of a task-oriented computer program. Some properties of hypervariable DNA regions (HVRs) to which the algorithms apply, as well as problems related to the computational approach are discussed. </p><p>Materials and Methods </p><p>Genomic DNA was obtained as follows. Blood samples (0.8 ml each) were first frozen, then thawed and the red cells selectively lyzed by 1 x saline sodium citrate (SSC). The white cells were pelleted and subsequently incubated in a sodium dodecyl sulphate(SDS)/sodium acetatejproteinase K buffer [3]. DNA was finally obtained by phenol/chloroform extraction and ethanol precipitation. </p><p>Enzymatic restriction was carried out using overnight incubation with fivefold excess Hinf I (Boehringer, Mannheim, no. 1274082). Digests of 3 pg were finally electrophoresed on 0.8% agarose gels in 1 x TBE (Tris Borate EDTA buffer; 10 x = 1.3 M Tris; 0.75 M Boric acid; 0.015 M EDTA, pH 8.8). Each run typically contained two visible mol. wt markers (1 kb ladder, BRL, cat. no. 520 - 5615SB, 1 pg per lane) at the gel side extremities. Adjacent to these lanes, two more lanes contained a smaller aliquot of the same marker (1 kb ladder, 2 ng). A central lane contained one genomic control digest, whose polymorphic profile was known from previous analyses. One genomic control digest was typically subjected to serial analyses (30 experiments in most cases), thereafter it was replaced by another digest. Data pertaining to several genomic control digests with different molecular sizes were in this way collected. At the end of the experiments, data on genomic control fragments spanning from 7 - 0.5 kb at roughly regular inter- vals of about 500 base pair (bp) were available. </p><p>The gels were run (35 V, 20 mA) until the 2 kb marker band had reached 160 mm apart from the well line. On completion of electrophoresis, all digests were Southern blotted onto a nylon membrane (Hybond, Amersham) and hybridised </p></li><li><p>275 </p><p>to two different hypervariable probes (YNH24/D2S44; 3HVR/D16) under high stringency conditions after Church and Gilbert [9]. Autoradiography of the hybridised membranes was carried out for 3-5 days, at -80C. </p><p>The relative positions of the detected fragments were measured on a semi- automated basis, using a digitizing tablet (Summagraphics), then turned into kb values by an algorithm based on the reciprocal method of Elder and Southern [lo] and stored in sequential files. Different files were created for the genomic controls and the population data. In both cases, additional information (on the geographical and ethnical origin of the individual and on the relevant experi- ment) were attached to each pair of kb values and recorded in duplicated sequen- tial files. </p><p>Files containing genomic controls were used to derive standard deviations and the percentages of error underlying the procedure of kb values assignment. Files containing population data were used to calculate allele and phenotype frequen- cies. Two hundred samples (unrelated individuals from Central and Southern Ita- ly) were processed in about 40 electrophoretic runs. </p><p>Results </p><p>As shown by the diagram in Fig. 1, a system was devised in which genomic control files act as keys of access to the population database. Profiles of unknown DNA sample are first converted into kb values and a class interval is sized around it, by ascribing * 1 to * 3 standard deviations to each individual kb value. Gene frequencies according to different confidence intervals are finally derived by scrambling the population database and assessing how many kb values fall within the class interval. Care is taken in assuming the percentage of error per- taining to the genomic control as closest in length as possible to the fragment whose frequency is sought. </p><p>Discussion </p><p>The procedure of calculating gene frequencies outlined above treats VNTRs alleles as continuous random variables [11,12]. While the distribution of popula- tions of VNTRs alleles does not fit the Gaussian distribution (being multimodal), it can be conversely demonstrated that each individual fragment, if repeatedly measured, generates its own subset of values which fit a normal distribution curve. As a consequence and as long as the hypothesis of normal distribution of the errors of measurements holds, fragments from a population may be sampled according to the variance of every assigned weight. </p><p>To classify fragments of a given population, a variance(s) should be experimen- tally ascribed to each kb measurement. This would involve repetition of every measurement for an adequate number of times under standard experimental conditions. Such a procedure is obviously not practical. A possible way to circum- vent this problem is to create a statistic of serial measurements from a few selected restriction fragments and assume their variance to represent the error in weighting fragments, regardless of their size. </p></li><li><p>276 </p><p>GENOMIC </p><p>CONTROLS </p><p>SOUTHERN BLOT </p><p>Kb VALUES </p><p>\L </p><p>I </p><p>I 1 </p><p>I I </p><p>\I/ </p><p>ARCHIVE </p><p>OF </p><p>POPULATION </p><p>DATA </p><p>RELATIVE FREOUENCY BY </p><p>POINT ESTIMATION </p><p>Fig. 1. A scheme of the procedure by which relative frequencies of hypervariable alleles are com- puted. A standardized protocol of Southern blot analysis feeds the population database and a selected array of fragments is scattered at roughly regular kb intervals (genomic controls). Separate comput- er files are provided for the archive and for each genomic control. Serial measures of genomic con- trols contribute standard deviations of the procedure of kb assignment. These are converted in percentage of error and ascribed to every fragment size whose frequency is sought. Point estimates on variable confidence limits are derived by scrambling the database and assessing how many fragments fall within the preselected confidence limit over the kb measure under evaluation. </p></li><li><p>Prior to this, it is necessary to show that the variance in the procedure of kb assignment does not significantly change within a reasonably large kb interval. </p><p>Data from our laboratory (Pascali, observations on 12 genomic controls; un- published data) have so far shown us that the model of normal distribution ap- plies well to most subsets of genomic control data and that a roughly uniform percentage of error (ranging between 1.0 - 1.5%) affects a rather large kb range (from 0.400 - 7.5 kb) in our experimental conditions. </p><p>We are aware that, in spite of the uniformity of percentage error in controls, shifting measures of errors from selected fragments to the whole population of VNTRs measures (or to a part of it) remains a simplified inference. There are in fact arguments questioning the validity of this assumption. </p><p>First of all, the larger the distance in kb between the allelic control and the restriction fragment to be evaluated, the weaker the hypothesis on the equiva- lence of their own variances(s). In fact, if errors due to the simple act of measur- ing play a central role in determinations [ll], this does not prove that they hold uniform all along the kb range, in the widest range of experimental conditions and for every band position in the kb scale. </p><p>Factors involving local electrophoretic band shifts and/or kb measured fluctua- tions have already been noticed [13]. It has also been observed [14] that measure- ment error increases with the molecular weight. We actually experienced both phenomena (Pascali, unpublished data), presumably due to local endosmosis, uneven coefficient of diffusion of the bands and different thickness of autoradiographed bands. These observations argue against adoption of just one genomic controls standard deviation to validate procedures for classifying the whole range of hypervariable alleles. </p><p>As a compromise, we have set out to establish and permanently update a library of several control fragments throughout the kb scale of our interest. As the general database of fragment population grows in size with the addition of new data, the genomic controls data grows as well (since controls are included in every test). Standard deviations (kb%) can be calculated by reference to the closest control fragment. </p><p>The problem of estimating VNTRs frequencies is controversial ]15], and several computational procedures have so far been adopted [14], [16,17]. </p><p>The SO called binning approach is the most obvious procedure to obtain relative frequencies and is one of the most widespread. </p><p>A given population of fragments is sorted into an ascending array and stored in discrete classes 1181, so that each class of continuous values becomes an easier- to-handle population of discrete data. Increasing the number of classes involves a proportionally higher fluctuation of values contained in each class, if the population is sampled again. If conversely a few classes were used, they could miss some significant features of the underlying population. </p><p>Although no general rule may be adopted to assess a straightforward class in- terval, several laboratories adopt this approach (and its graphical counterpart, the bar histogram) as the current method to estimate the distribution of VNTRs alleles [16,17]. We occasionally use it to represent the overall distribution of fragment sizes, but not as a way to estimate individual gene frequencies (a short </p></li><li><p>BASIC routine enables subdivision of VNTR measurements in classes and representation by a bar histogram - the programme is available on request). </p><p>One intriguing feature of the binning approach is the fact that relative frequen- cies of arbitrary classes (bins) vary if the starting point for the class subdivision is changed. Apparently in this case the same density distribution of the popula- tion sample is sufficiently well represented by both arrays whereas kb values cor- responding to given bands may have significantly different relative frequencies. A different way of computing hypervariable allele frequencies has recently been employed by Gill et al. [14]. </p><p>With this method, called the Sliding Window Fit approach, the ideal migra- tional field encompassing all observed DNA fragments is subdivided into numerous position units. Two adjacent positional units intercept the minimal dis- tance still resolvable by a computer graphical grid. After Southern blot analysis, DNA bands are associated to the nearest position unit. Since the experimental procedure generates fluctuations of the electrophoretic positions of the fragments and any given band may be ascribed to several neighbouring positions when repeatedly electrophoresed, the sliding window approach estimates VNTRs frequencies by first determining the mean error in positioning the bands (this is easily obtained by serial data on the genomic controls) and then grouping all bands which fall within f 1 to f 3 standard deviations from selected position units. Frequencies derived from this method are used in place of those drawn from arbitrary subdivisions in classes. This procedure was originally described to work in association with an automated optical scanner capturing...</p></li></ul>