duplication-targeted dna mutagenesisproc. natl. acad. sci. usa89(1992)...

5
Proc. Natl. Acad. Sci. USA Vol. 89, pp. 1075-1079, February 1992 Genetics Duplication-targeted DNA methylation and mutagenesis in the evolution of eukaryotic chromosomes (repeat-induced point mutation/CpG dinucleotide) MAJA C. KRICKER, JOHN W. DRAKE*, AND MIROSLAV RADMANt Laboratory of Molecular Genetics, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709 Communicated by Carl R. Woese, October 28, 1991 ABSTRACT Mammalian genomes are threatened with gene inactivation and chromosomal scrambling by recombina- tion between repeated sequences such as mobile genetic ele- ments and pseudogenes. We present and test a model for a defensive strategy based on the methylation and subsequent mutation of CpG dinucleotides in those DNA duplications that create uninterrupted homologous sequences longer than about 0.3 kilobases. The model helps to explain both the diversity of CpG frequencies in different genes and the persistence of gene fragmentation into exons and introns. The human genome harbors about a million interspersed repetitive elements and many gene families, affording a myriad of opportunities for ectopic (out-of-register) recom- bination generating deletions, additions, and chromosomal rearrangements (1) that can result in disability and disease. Even if the recombination frequency between duplicated sequences were as low as 106, the majority of cells would accumulate gross chromosomal rearrangements or aberra- tions. Yet mammalian genomes are remarkably stable: chro- mosomal rearrangements caused by inter-repeat recombina- tion are seen only rarely, as in cellular events causing hereditary disease or cancer. What molecular mechanisms suppress ectopic recombination among repeated sequences in animal and plant chromosomes while allowing those chro- mosomes to recombine accurately as sister chromatids in mitosis and as homologs in meiosis? Studies in bacteria, yeast, and mammalian cells show that homologous recombination is inhibited by decreasing the degree of sequence similarity; reducing similarity by only a few percent sharply reduces recombination (2-5). These effects can be explained by the known properties of recom- bination enzymes and the specificity of mismatch repair as an editor of recombination in bacteria (2-4, 6). Defects in bacterial mismatch repair greatly increase interspecies re- combination (6), and chromosomal rearrangements resulting from intrachromosomal recombination between 3.7-kilobase (kb) sequence repeats diverged by about 3-4% (7). Intra- chromosomal recombination between identical repeats in mouse cells is an efficient process if segments of sequence identity are longer than about 0.3 kb. However, 19o se- quence divergence in the same repeats inhibits their recom- bination by 1000-fold compared with identical repeats (5). The human growth hormone gene GHJ is flanked by highly homologous sequences and by 48 Alu elements. Familial growth hormone deficiency type 1A is caused by ectopic recombination that deletes both GHJ alleles on homologous chromosomes (8). Of 10 independent deletions, 9 occurred within 99%o-identical 594-nucleotide (nt) segments and one within 98%-identical 274-nt segments flanking the GHJ gene. No deletion breakpoints were in Alu sequences, which are only 85% identical. This result again suggests that efficient ectopic recombination requires highly homologous segments and that the divergence typical of Alu sequences is sufficient to thwart recombination between them. Thus, it is likely that the existing sequence divergence and polymorphism among repeated sequences accounts for the current stability of eukaryotic genomes. But how is stability achieved after the amplification of sequences when the resulting repeats are identical? The spread of identical re- peats such as transposons should create a genetic time bomb unless efficient mechanisms exist to suppress their amplifi- cation and recombination. Sequence divergence by rare spontaneous mutations is not likely to be an efficient mech- anism. However, an efficient germ-line process capable of identifying, specifically modifying, and mutating repetitive sequences has been demonstrated in fungal systems. In Neurospora crassa and Ascobolus immersus, DNA duplica- tions that enter the sexual cycle are often extensively meth- ylated, leading to their immediate functional inactivation. In N. crassa, methylation at cytosines is accompanied by a very high rate of G-C -- A-T mutation, a phenomenon designated as "repeat-induced point mutation" (RIP) (9) or ripping. Because C -* T mutagenesis is associated with extensive cytosine methylation, ripping may occur by the facilitated deamination of 5-methylcytosine (5MeC) to thymine. Despite the existence of a G-T -- G-C mismatch repair system, 5MeC acts as an intermediate in C -- T mutagenesis both in bacteria and in mammalian cells (10-12). In vertebrate DNA, about 60-90%o of CpG dinucleotides are methylated at the 5 position of cytosine (13-15). CpGs occur at only about ¼15 of the expected frequency in bulk DNA, suggesting 5MeC deamination to yield thymine. About 1% of total vertebrate DNA is rich in unmethylated "islands" of DNA, has the CpG frequencies expected from its base composition, and is composed of unique sequences by the criterion of DNA reassociation. These islands contain about half of the unmethylated CpGs in the genome. They are suspected of playing a role in gene expression because they are associated with housekeeping genes and some tissue- specific genes and because the transcription of some genes with Hpa II tiny fragment (HTF) islands is inhibited when the island is methylated (16). We propose that, in addition to its putative role in gene control, CpG methylation has an important role in the evo- lution and stability of chromosome structure: it provides a means to specifically mark and diversify duplicated se- quences and thereby to protect against recombination- mediated chromosome rearrangements. At present there is Abbreviations: nt, nucleotide(s); 5MeC, 5-methylcytosine; LINE, long interspersed element; MUP, major urinary protein; TPI, triose- phosphate isomerase; SINE, short interspersed element; TK, thy- midine kinase. *To whom reprint requests should be addressed. tPermanent address: Institute Jacques Monod, 75251 Paris Cedex 05, France. 1075 The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. Downloaded by guest on September 28, 2020

Upload: others

Post on 25-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Duplication-targeted DNA mutagenesisProc. Natl. Acad. Sci. USA89(1992) nocoherentexplanationwhysomeDNAsegmentsaremeth-ylatedwhileothers arenot. CpGislandsextendfrom -0.5to >3 kb and

Proc. Natl. Acad. Sci. USAVol. 89, pp. 1075-1079, February 1992Genetics

Duplication-targeted DNA methylation and mutagenesis in theevolution of eukaryotic chromosomes

(repeat-induced point mutation/CpG dinucleotide)

MAJA C. KRICKER, JOHN W. DRAKE*, AND MIROSLAV RADMANtLaboratory of Molecular Genetics, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709

Communicated by Carl R. Woese, October 28, 1991

ABSTRACT Mammalian genomes are threatened withgene inactivation and chromosomal scrambling by recombina-tion between repeated sequences such as mobile genetic ele-ments and pseudogenes. We present and test a model for adefensive strategy based on the methylation and subsequentmutation of CpG dinucleotides in those DNA duplications thatcreate uninterrupted homologous sequences longer than about0.3 kilobases. The model helps to explain both the diversity ofCpG frequencies in different genes and the persistence of genefragmentation into exons and introns.

The human genome harbors about a million interspersedrepetitive elements and many gene families, affording amyriad of opportunities for ectopic (out-of-register) recom-bination generating deletions, additions, and chromosomalrearrangements (1) that can result in disability and disease.Even if the recombination frequency between duplicatedsequences were as low as 106, the majority of cells wouldaccumulate gross chromosomal rearrangements or aberra-tions. Yet mammalian genomes are remarkably stable: chro-mosomal rearrangements caused by inter-repeat recombina-tion are seen only rarely, as in cellular events causinghereditary disease or cancer. What molecular mechanismssuppress ectopic recombination among repeated sequencesin animal and plant chromosomes while allowing those chro-mosomes to recombine accurately as sister chromatids inmitosis and as homologs in meiosis?

Studies in bacteria, yeast, and mammalian cells show thathomologous recombination is inhibited by decreasing thedegree of sequence similarity; reducing similarity by only afew percent sharply reduces recombination (2-5). Theseeffects can be explained by the known properties of recom-bination enzymes and the specificity of mismatch repair as aneditor of recombination in bacteria (2-4, 6). Defects inbacterial mismatch repair greatly increase interspecies re-combination (6), and chromosomal rearrangements resultingfrom intrachromosomal recombination between 3.7-kilobase(kb) sequence repeats diverged by about 3-4% (7). Intra-chromosomal recombination between identical repeats inmouse cells is an efficient process if segments of sequenceidentity are longer than about 0.3 kb. However, 19o se-quence divergence in the same repeats inhibits their recom-bination by 1000-fold compared with identical repeats (5).The human growth hormone gene GHJ is flanked by highly

homologous sequences and by 48 Alu elements. Familialgrowth hormone deficiency type 1A is caused by ectopicrecombination that deletes both GHJ alleles on homologouschromosomes (8). Of 10 independent deletions, 9 occurredwithin 99%o-identical 594-nucleotide (nt) segments and onewithin 98%-identical 274-nt segments flanking the GHJ gene.No deletion breakpoints were in Alu sequences, which are

only 85% identical. This result again suggests that efficientectopic recombination requires highly homologous segmentsand that the divergence typical ofAlu sequences is sufficientto thwart recombination between them.Thus, it is likely that the existing sequence divergence and

polymorphism among repeated sequences accounts for thecurrent stability of eukaryotic genomes. But how is stabilityachieved after the amplification of sequences when theresulting repeats are identical? The spread of identical re-peats such as transposons should create a genetic time bombunless efficient mechanisms exist to suppress their amplifi-cation and recombination. Sequence divergence by rarespontaneous mutations is not likely to be an efficient mech-anism. However, an efficient germ-line process capable ofidentifying, specifically modifying, and mutating repetitivesequences has been demonstrated in fungal systems. InNeurospora crassa and Ascobolus immersus, DNA duplica-tions that enter the sexual cycle are often extensively meth-ylated, leading to their immediate functional inactivation. InN. crassa, methylation at cytosines is accompanied by a veryhigh rate of G-C -- A-T mutation, a phenomenon designatedas "repeat-induced point mutation" (RIP) (9) or ripping.Because C -* T mutagenesis is associated with extensivecytosine methylation, ripping may occur by the facilitateddeamination of5-methylcytosine (5MeC) to thymine. Despitethe existence of a G-T -- G-C mismatch repair system, 5MeCacts as an intermediate in C -- T mutagenesis both in bacteriaand in mammalian cells (10-12).

In vertebrate DNA, about 60-90%o of CpG dinucleotidesare methylated at the 5 position of cytosine (13-15). CpGsoccur at only about ¼15 of the expected frequency in bulkDNA, suggesting 5MeC deamination to yield thymine. About1% of total vertebrate DNA is rich in unmethylated "islands"of DNA, has the CpG frequencies expected from its basecomposition, and is composed of unique sequences by thecriterion ofDNA reassociation. These islands contain abouthalf of the unmethylated CpGs in the genome. They aresuspected of playing a role in gene expression because theyare associated with housekeeping genes and some tissue-specific genes and because the transcription of some geneswith Hpa II tiny fragment (HTF) islands is inhibited when theisland is methylated (16).We propose that, in addition to its putative role in gene

control, CpG methylation has an important role in the evo-lution and stability of chromosome structure: it provides ameans to specifically mark and diversify duplicated se-quences and thereby to protect against recombination-mediated chromosome rearrangements. At present there is

Abbreviations: nt, nucleotide(s); 5MeC, 5-methylcytosine; LINE,long interspersed element; MUP, major urinary protein; TPI, triose-phosphate isomerase; SINE, short interspersed element; TK, thy-midine kinase.*To whom reprint requests should be addressed.tPermanent address: Institute Jacques Monod, 75251 Paris Cedex 05,France.

1075

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement"in accordance with 18 U.S.C. §1734 solely to indicate this fact.

Dow

nloa

ded

by g

uest

on

Sep

tem

ber

28, 2

020

Page 2: Duplication-targeted DNA mutagenesisProc. Natl. Acad. Sci. USA89(1992) nocoherentexplanationwhysomeDNAsegmentsaremeth-ylatedwhileothers arenot. CpGislandsextendfrom -0.5to >3 kb and

Proc. Natl. Acad. Sci. USA 89 (1992)

no coherent explanation why some DNA segments are meth-ylated while others are not. CpG islands extend from -0.5 to>3 kb and can include the 5' ends of genes, the first fewexons, portions of introns, or the 3' regions of genes. Wepropose that CpG islands remain unmethylated because theyare unique sequences. We further suggest that preciselyduplicated sequences are targeted for methylation by thepairing of homologous regions. In support of this hypothesis,we present evidence from sequence analyses demonstratingthat repeated sequences in mammals preferentially experi-ence a high frequency of transition mutations at sites ofcytosine methylation. We propose that this process trans-forms actively amplifying sequences into inert nonrecombin-ing DNA, thus stabilizing vertebrate genomes. We thenestimate the minimum length of homologous sequence re-quired for such a process and suggest that gene fragmentationinto exons protects coding sequences against homologousinteractions with their own processed pseudogenes.

METHODSThe process we propose will leave specific traces in the patternof DNA sequence evolution. Because the most frequentlymethylated sequence in vertebrates is the dinucleotide CpG,the cytosine deamination model predicts that repeated se-quences should display a history of numerous transitions fromCpG to TpG and CpA dinucleotides. Thus, gene families andrepeated sequences would have low CpG frequencies com-pared with unique or unmethylated sequences.We therefore compared the nucleotide and dinucleotide

compositions of members of functional gene families, inter-spersed repetitive elements, pseudogenes, housekeepinggenes, introns, and unmethylated sequences. Sequenceswere obtained from GenBank or EMBL data bases providedby the Genetics Computer Group. Nucleotide compositionsand sequence alignments were generated by using the Ge-netics Computer Group sequence analysis software package(17). In each sequence, the observed CpG frequency wasnormalized to that expected from the product of its cytosineand guanine frequencies. We note, but will not here elabo-rate, that other normalizations are possible, such as compar-isons against reciprocal GpC, GpT, and ApC dinucleotides oragainst hypothetical unripped consensus sequences. Whileperhaps appealing, these alternative normalizations involveadditional assumptions and complicate the analyses. Theytend to increase the contrasts between unique and repeatedsequences and thus to strengthen our conclusions.Mammalian genomes contain over 101 copies of long

interspersed elements (LINEs). Each mammalian speciescontains a single LINE family, always designated Li regard-less of species (18). Each LINE contains two open readingframes, the second homologous to reverse transcriptasegenes (19). We analyzed the second open reading frame oftheconsensus Li sequence in three species of mice and one ofrats, and Li open reading frames from the factor VIII andhemoglobin genes of humans.We examined several families of functional genes. (i) The

primate (-globin gene family comprises the adult (3- and6-globins, the embryonic C-globin, the fetal Or and Arglobins, and the ,8 pseudogene. All six apparently derive froma single ancestral gene (20). We analyzed all three exons fromeach gene in the human cluster and the exons of the 6- and-globins from several species. (ii) Mouse major urinary

proteins (MUPs) are the most abundant proteins of the malemouse liver. Most of the 35 MUP genes have been classifiedinto two groups with approximately 15 members each (21).We analyzed the seven exons from a representative of theactively expressed group 1 genes. (iii) The cytochrome P450superfamily comprises 20 gene families, 10 of which arepresent in all mammals (22). There are 60-200 functional

cytochrome P450 genes in any mammalian species. Weanalyzed the first exon of two closely related genes in therabbit alcohol-inducible cytochrome P450 subfamily. (iv)Mammalian genomes have -35 different cytochrome c se-quences (23). We analyzed the single functional human gene.Pseudogenes are under relaxed constraints compared with

their progenitor genes. We analyzed the coding sequences ofa group-2 MUP pseudogene, three human cytochrome cpseudogenes, a human (B-globin pseudogene, and three hu-man triose-phosphate isomerase (TPI) pseudogenes.For unique sequences, we examined 12 housekeeping

genes for which there was no evidence of homology withother sequences except for processed pseudogenes.When repeated sequences are eliminated, introns contain

unique sequences less subject to selective pressures thanexons of housekeeping genes, as witnessed by frequentintra-intronic insertions of repeated elements. We analyzedfour examples that were apparently free of insertions.

Active mammalian a-globin genes are unmethylated. TheC-globin gene encodes a fetal form of a-globin replaced duringdevelopment by the al and a2 globins; while f3-globin cy-tosines are methylated, a-globin cytosines are not (24).Drosophila and yeast do not detectably methylate their

cytosines (25), and we analyzed three Drosophila LINE-liketransposons and a yeast Ty] element.The mammalian genome contains about 5 x 105 copies of

short interspersed elements (SINEs), including the humanAlu family (26). We examined eight Alu sequences lurking inthe introns of the human complement component Cl-inhibitor gene, and one newly inserted Alu (27).Diverse criteria were used to choose sequences for anal-

ysis. The longest LINEs were chosen to reduce samplingerror. Globins were chosen because sequences were avail-able for most members of the globin family from a variety ofspecies and because some were unmethylated; genes fromother families were simply chosen without conscious bias.Pseudogenes were chosen as derivatives of either uniquegenes or functional gene families. A convenient set of Alusequences was chosen from a single locus; Alu sequenceswithin other regions gave similar results. Housekeepinggenes were chosen for which information was available onthe presence or absence of pseudogenes. Intron sequenceswere chosen to be free of repetitive elements. Mobile ele-ments among nonmethylating organisms were chosen forsimilarity of evolutionary origin to vertebrate elements.

RESULTS AND DISCUSSIONPatterns of CpG Depletion. The results of our analyses

appear in Table 1. It is immediately obvious that mostrepeated sequences (LINEs, members of functional genefamilies, and repeated pseudogenes) contain substantial def-icits of CpG dinucleotides compared with most unique se-

quences (housekeeping genes and unique introns).Table 1 also reveals that repeated but unmethylated se-

quences-the mammalian a-globin genes and transposons inorganisms that do not methylate their DNA-contain sub-stantially more CpGs than do mammalian repetitive se-

quences or even housekeeping genes. Thus, methylation andripping are closely associated, as expected if5MeC is indeeda mechanistic intermediate in the accelerated loss of CpGs.The special case of a-globin genes is discussed below.We performed pairwise comparisons of the means of

observed/expected CpG frequencies for each group using themodified Tukey-Kramer method that adjusts for unequalsample sizes (28, 29). The means of repeated groups (LINEs,pseudogenes, members of functional gene families) are notsignificantly different from each other at a 95% confidenceinterval, but they differ significantly from the means of the

1076 Genetics: Kricker et al.

Dow

nloa

ded

by g

uest

on

Sep

tem

ber

28, 2

020

Page 3: Duplication-targeted DNA mutagenesisProc. Natl. Acad. Sci. USA89(1992) nocoherentexplanationwhysomeDNAsegmentsaremeth-ylatedwhileothers arenot. CpGislandsextendfrom -0.5to >3 kb and

Proc. Nati. Acad. Sci. USA 89 (1992) 1077

Table 1. CpG dinucleotides in different classes of genes

Gene or sequence

LINEsConsensus Mus domesticusConsensus Mus caroliConsensus Mus platythrixLi-h in human ,8-globinLi-c in human factor VIIILi rat consensusLi-b in human factor VIIILi-a in human factor VIII

Members of functional gene familiesOrangutan 8-globinHuman AlyglobinHuman G-6globinHuman 8-globinHuman P-globinHuman e-globinGoat -globinSpider monkey 6-globinTarsius syrichta 8-globinRabbit Cyt P450 IIE2 exon 1Human Cyt cHuman HPRTMUP group 1Rabbit Cyt P450 IIE1 exon 1

Processed pseudogenesHuman Cyt c pseudogene 1MUP group 2 exons 1-7Human Cyt c pseudogene 2Human ,-globin pseudogeneHuman Cyt c pseudogene aHuman TPI pseudogene 13CHuman TPI pseudogene 19AHuman TPI pseudogene SA

Size, CpGsbp 0 E O/E

315315315

21012145319

2145477

444441441441441441438443444398318657543401

318776317439309100510031280

0 141 131 12

10 7618 863 12

21 845 19

MeanSD

2 322 322 303 325 356 356 327 358 357 284 158 277 2310 30

MeanSD

1 143 292 155 273 15

15 5324 7132 81

MeanSD

0.0000.0780.0850.1320.2100.2470.2500.2680.1590.099

0.0630.0630.0660.0930.1420.1740.1860.1990.2320.2490.2640.2990.3080.3320.1910.095

0.0730.1030.1360.1890.2000.2170.3380.3930.2060.111

Gene or sequence

Unique housekeeping genesHuman HMG-CoA reductaseHuman G3P dehydrogenaseEel Na channel proteinHuman DNA polymerase ,

Mouse TKHuman Na/K ATPaseHuman Cu/Zn SODHuman TKHuman APRTHuman TPIHuman G6P dehydrogenaseChicken Na/K ATPase

Unique intronsHuman TK intron dHuman APRT intron 2Human HPRT intron a 5'Cyt c intron a

Unmethylated mammalian genesOrangutan al-globinOrangutan a2-globinHuman '-globin

Size, CpGsbp 0 E

266710085463100862191246570584110761440918

150399311661073

40 12829 77113 29518 4323 4724 4814 2735 6449 8837 6277 12738 63

MeanSD

42 12041 10058 8846 64

MeanSD

429 33 44429 34 44429 42 45

MeanSD

Repeated sequences in nonmethylating organismsDrosophila jockey element 1752 62Yeast TyIH3 element ORF b 3985 112Drosophila Fw element ORF 1 2576 110Drosophila I factor 1290 39

8413712443MeanSD

bp, Base pairs; 0, observed number ofCpGs in a sequence; E, expected number ofCpGs in a sequence = (number ofC residues in sequence)x (number of G residues in sequence) . (total bases in sequence). APRT, adenosine phosphoribosyltransferase; Cyt, cytochrome; G3P,glyceraldehyde 3-phosphate; G6P, glucose 6-phosphate; HMG-CoA, hydroxymethylglutaryl coenzyme A; HPRT, hypoxanthine phosphori-bosyltransferase; ORF, open reading frame; SOD, superoxide dismutase; TK, thymidine kinase.

unique and unmethylated groups (housekeeping genes, in-trons, and Drosophila and yeast transposons).The results support our proposal that mammalian genomes

possess a general mechanism for speeding the divergence ofrepetitive sequences and inactivating mobile elements. Theyalso help to rationalize the diversity of CpG frequencies inmammalian genes. An alternative hypothesis for the paucityof CpGs in repetitive sequences is that methylation occursirrespective of duplication and that duplication merely re-laxes the stringency of selection against new transitions atCpGs. Several observations contradict this hypothesis. (i)Families offunctional (highly stringent) genes, many carryingessential functions, have as few CpGs as do LINEs. (ii)Unique regions in introns are nonessential sequences butaverage no fewer CpGs than do exons ofhousekeeping genes.(iii) CpG-rich islands in active genes are relatively short,unique sequences; they remain unmethylated unless the genehas been switched off (15).Because unmethylated sequences can have CpG values

20.8 (Table 1), the mean CpG value of -0.5 for uniquesequences suggests that even they have been ripped, perhaps

reflecting their ancient origin by duplication, or limitedripping by currently unrecognized pseudogenes.While our analysis used mostly mammalian examples,

repeat-induced methylation and ripping should occur gener-ally in methylating eukaryotes, including vertebrates, someinvertebrates, plants, and some fungi. Because the sequencespecificity of methylation differs among plants, fungi, andvertebrates, CpG depletion might apply only to vertebrates.In addition, rates of ripping may vary greatly in differenteukaryotes. Thus, in N. crassa ripping is rapid, while in A.immersus duplications may be inactivated initially by meth-ylation and subsequently by slow ripping (9).

Mutational Specifcity at CpG Sites. IfCpG sites are rippedby cytosine deamination, transitions to TpG and CpA willpredominate. We tested this prediction in three situations. (i)When nine clustered human Alu sequences were aligned andcompared with a consensus sequence (30) from another 30human Alu sequences, more than half of the nucleotidechanges were at CpG positions (Fig. 1). The alternate dinu-cleotide was TpG or CpA at about 90% of the CpG sites,indicating that CpG -+ TpG and CpG -- CpA transitions are

the major consequences of ripping in mammals and a major

O/E

0.3120.3790.3830.4230.4900.4990.5230.5460.5600.5940.6060.6060.4930.098

0.3490.4090.6590.7230.5350.183

0.7600.7760.9430.8260.101

0.7350.8180.8900.8990.8350.076

Genetics: Kricker et al.

Dow

nloa

ded

by g

uest

on

Sep

tem

ber

28, 2

020

Page 4: Duplication-targeted DNA mutagenesisProc. Natl. Acad. Sci. USA89(1992) nocoherentexplanationwhysomeDNAsegmentsaremeth-ylatedwhileothers arenot. CpGislandsextendfrom -0.5to >3 kb and

Proc. Natl. Acad. Sci. USA 89 (1992)

L**[** -j

3nKoo~ FIG. 1. Variation at CpG sitesI W_, in Alu sequences. 0, CpG; A,

, CpA; o, TpG; m, TpA; V, trans-

.~o_version; absence of a symbol in-dicates a deletion that includes aCpG site

source of Alu sequence divergence. (Alu 9 is a special case

that will be considered below.) (ii) The TPI gene and its 13Cpseudogene (31) were aligned and CpG positions were ana-

lyzed over a 464-nt region of homologous coding sequence.

Of the 20 CpGs in the active gene, 14 had mutated in thepseudogene, 13 by transition and 1 by transversion. (iii) Wealigned a 2000-nt region in the second open reading frame ofprimate LINEs and derived a consensus sequence from atleast seven LINEs at each position, a number that variesbecause most LINEs harbor deletions. (The alignment isavailable from M.C.K. upon request.) We compared eachsequence at the 13 consensus CpG sites and detected 66transitions and 17 transversions. The frequency of transitionswas 79% of all substitution mutations at CpGs.

Evolutionary Implications. Because of ripping, the numberof nucleotide substitutions among Alu sequences, LINEs,and other repetitive elements cannot accurately reflect theirchronological ages. Transitions at CpG sites in eight of theAlu sequences accounted for more than 50% of total muta-tions (Fig. 1), revealing that these positions mutate morerapidly than other positions in the same sequence. Thenumbers of non-CpG transitions were similar to the numbersof transversions; thus, ripping alters the proportions oftransitions and transversions and should be considered whenestimating divergence times between sequences. Clearly,therefore, ripping should be factored and separately analyzedwhen constructing phylogenies, and especially when evalu-ating the rates of putative evolutionary clocks.

Surprisingly, the depletion of CpG dinucleotides in mobileelements such as LINEs presents yet a different hazard togenome stability. If extensively ripped sequences remaincapable of transposition, then duplications will arise that are

already CpG-poor and cannot efficiently be further methyl-ated or ripped. Although LINEs are much older than SINEs(19, 26), LINEs show less intraspecific diversity than doSINEs in humans and rodents (18, 19). Because mammalianLINEs have few CpGs, they appear to derive mainly fromalready ripped sequences. We noticed that LINEs are AT-rich and that their noncoding strands are 1.5- to 2-fold richerin adenines than other bases. Among human LINEs, G -- Atransitions account for nearly 70% of mutations of adenines.This may be related to a diversifying mechanism in meal-worm DNA, where G-C -* AT transitions account for most

of the variation in a majorDNA component despite its normalCpG content and absence of 5MeC (32). Thus, even whenCpG ripping is impossible, an additional mechanism mayaccelerate divergence.

The accumulation of repetitious noncoding sequences thathave been diversified by ripping would explain most of theobserved sequence polymorphism within mammalian spe-

cies. Ripping as a major generator of sequence polymorphismmay be instrumental in the recombinational isolation ofchromosomes during both mitosis and meiosis, and it couldtherefore contribute to sterility in mating between closelyrelated species (1-3). Indeed, yeast mutations preventingchromosome pairing and recombination lead to meiotic ste-rility because of chromosome nondisjunction (33).

Protected Sequences. The a-globin gene family remainsunripped (Table 1), implying that a special mechanism existsto protect specific multicopy genes from methylation andCpG loss. Although old Alu sequences are extensivelyripped, newly transposed Alu sequences such as Alu 9 areboth common and unripped (27), suggesting that activelytransposing Alu sequences arise from a hidden protectedprogenitor with a complete CpG content. A similar protectivemechanism operates in N. crassa, where '170 copies of9.3-kb rRNA-encoding DNA are clustered at the end of achromosome and remain unripped except when an occasionalcopy is transposed to an unprotected position (9).

Ripping Target Size. Because it is likely to involve a

homologous DNA interaction, ripping may require contigu-ous homologous sequence over some minimum length ('i).

Consider first the TPI gene, with seven exons, six introns,and several pseudogenes (31). The structures of the activegene and a typical pseudogene are shown in Fig. 2. There isabout 90% identity between the coding regions of the activegene and three well-characterized pseudogenes. As shown inTable 2, the CpG content of the first six exons of the activegene is similar to that of other housekeeping genes and higherthan for the pseudogenes. The CpG content of the seventhexon of the active gene is low, resembling that of repetitivesequences. The pseudogenes have lost all seven introns, thusgaining uninterrupted sequence homology with each otherover at least 462 nt. The first six exons of the active gene lackcontinuous homology with the pseudogenes over segmentslonger than 133 nt, the size of the largest exon. The seventhexon is also short (119 nt), but homology with the pseudo-genes extends an additional 445 nt into the 3' noncodingregion, so that the pseudogenes and the seventh exon of theactive gene are homologous over 564 nt. This suggests that a

minimum continuously homologous sequence of 133 <1m <462 nt is required for ripping.

Consider next the rabbit cytochrome P450 IIE1 and IIE2genes (22). They share 176 homologous nt in their first exonplus 199 homologous nt 5' to the first exon, providing 375 nt

active aeneI51"I 1250 11241 11118517411331 310 1861 290 1881 12711191 45

631 1l11 445

FIG. 2. Sizes (in nt) and topog-raphy of the TPI active gene andpseudogene. Black bars, exons

g pseudogene 1-6; hatched bar, exon 7; openbar, 3' noncoding region; openspaces between bars, introns.

Alu consensus

Alu 9

Alu 8

Alu 7

Alu 6

Alu 5

Alu 4

Alu 3

Alu2

Alu 1

1078 Genetics: Kricker et al.

Dow

nloa

ded

by g

uest

on

Sep

tem

ber

28, 2

020

Page 5: Duplication-targeted DNA mutagenesisProc. Natl. Acad. Sci. USA89(1992) nocoherentexplanationwhysomeDNAsegmentsaremeth-ylatedwhileothers arenot. CpGislandsextendfrom -0.5to >3 kb and

Proc. Natl. Acad. Sci. USA 89 (1992) 1079

Table 2. Contiguous homology requirements for ripping

SequenceTPI exons 1-6TPI exon 7TPI exon 7 + 3'

noncoding region

TPI pseudogene 13CTPI pseudogene 19ATPI pseudogene SARabbit CytP450 IIEl exon 1

Rabbit CytP450 IIE2 exon 1

Rabbit CytP450 IIEl exon 2

Rabbit CytP450 IIE2 exon 2

Mouse TK exons 1-6Mouse TK pseudogeneexons 1-6

Mouse TK exon 7Mouse TK pseudogeneexon 7

CpGO/E*

0.590.21

0.150.200.350.37

0.33

0.25

Sequencelength, bp

631

119

564

909

922

1066

176

224

0.71 132

0.860.55

0.570.22

0.23 570

Uninterruptedhomologoussequence, bp

133564

564909922500

375

375

132

132120

120570

570*Observed/expected numbers of CpG dinucleotides.

of continuously homologous sequence; here, CpG frequen-cies are low (Table 3). They share only 130 nt of continuouslyhomologous sequence in their second exon (the adjacentintrons lacking homology); here, CpG frequencies are similarto those of housekeeping genes. Thus, 130 < im < 375 nt.Consider next the mouse TK gene and its two pseudogenes

(34). One appears by hybridization tests to have been exten-sively rearranged and lacks uninterrupted homology witheither the active gene or the second pseudogene. The activegene has six introns, six short exons of -120 nt, and one longfinal exon of 570 nt. The second pseudogene is intronless andthe longest homology with the active gene is the 570-ntsegment. The first six exons of the active gene and the sameregion in the second pseudogene have CpG frequenciessimilar to those of housekeeping genes. The long, 570-nt,segment, however, has low CpG frequencies in both theactive gene and the second pseudogene. Thus, 120 < ,, < 570nt.Because Alu sequences of =300 nt are extensively ripped,

im < 300 nt. Pooling all values, 133 <im < 300 nt, a rangecomparable with the 150 < 4,, < 800-nt range reported for N.crassa (9). This value recalls observations suggesting asimilar minimum length for efficient homologous recombina-tion in mammalian somatic cells (35). Because the minimumlength for efficient recombination varies in different orga-nisms, so may 4,,. Both recombination and ripping presum-ably initiate with DNA pairing tests for homology, andripping may occur at an early stage of this interaction.

Implications for Genome Organization. Mammalian trans-genes can be methylated and inactivated in the germ line inrough proportion to their copy number (36), suggesting thatmulticopy transgenes are subject to ripping. Effective trans-gene therapy for genetic disorders may therefore requireintroducing only a single copy of the gene.Most mammalian genes have spawned multiple processed

retropseudogenes (37). Our analysis suggests that pseudo-genes might be potent CpG -- TpG mutagens for functional

genes. Indeed, as many as a third of point mutations causinghuman genetic disorders and 40%o of point mutations causing

some cancers result from transitions at CpG dinucleotides(12). However, we observe that functional parental genes arelargely protected from being ripped by their own pseudo-genes. Because more than 95% of exons are smaller than 0.3kb (38), we propose that the fragmentation of coding se-quences into exons protects genes from the awesome effectof ripping and from recombination with their retro-pseudogene homologs. The observation that introns arepreferentially located between domain-encoding sequencessuggests that exon shuffling might b! an important evolu-tionary strategy (38, 39). Ripping, however, provides a pow-erful selective pressure to both generate and maintain thefragmented status of genes.

1. Petes, T. & Hill, C. H. (1988) Annu. Rev. Genet. 22, 147-168.2. Radman, M. (1988) in Genetic Recombination, eds. Kucherlapati,

R. & Smith, G. R. (Am. Soc. Microbiol., Washington), pp. 169-192.3. Shen, P. & Huang, H. V. (1986) Genetics 112, 441-457.4. Shen, P. & Huang, H. V. (1989) Mol. Gen. Genet. 218, 358-360.5. Waldman, A. S. & Liskay, R. M. (1987) Proc. Natl. Acad. Sci. USA

84, 5340-5344.6. Rayssiguier, C., Thaler, D. S. & Radman, M. (1989) Nature (Lon-

don) 342, 396-401.7. Petit, M.-A., Dimpfl, J., Radman, M. & Echols, H. (1991) Genetics

129, 327-332.8. Vnencak-Jones, C. L. & Phillips, J. A., III (1991) Science 250,

1745-1748.9. Selker, E. U. (1990) Annu. Rev. Genet. 24, 579-613.

10. Radman, M. & Wagner, R. (1986) Annu. Rev. Genet. 20, 523-538.11. Coulondre, C., Miller, J. H., Farabaugh, P. J. & Gilbert, W. (1978)

Nature (London) 274, 775-780.12. Rideout, W. M., III, Coetzee, G. A., Olumi, A. F. & Jones, P. A.

(1990) Science 249, 1288-1290.13. Bird, A. P. (1986) Nature (London) 321, 209-213.14. Bird, A. P. (1987) Trends Genet. 3, 342-346.15. Bird, A., Taggart, M., Frommer, M., Miller, 0. J. & MacLeod, D.

(1985) Cell 40, 91-99.16. Antequera, F., Boyes, J. & Bird, A. (1990) Cell 62, 503-514.17. Devereux, J., Haeberli, P. & Smithies, 0. (1984) Nucleic Acids Res.

12, 387-395.18. Singer, M. F. & Skowronski, J. (1985) Trends Biochem. Sci. 10,

119-122.19. Hutchison, C. A., III, Hardies, S. C., Loeb, D. L., Shehee, W. R.

& Edgell, M. H. (1989) in Mobile DNA, eds. Berg, D. E. & Howe,M. M. (Am. Soc. Microbiol., Washington), pp. 593-617.

20. Hardies, S. C., Edgell, M. H. & Hutchison, C. A., III (1984)J. Biol.Chem. 259, 3748-3756.

21. Shi, Y., Son, H. J., Shahan, K., Rodriguez, M., Costantini, F. &Derman, E. (1989) Proc. Natl. Acad. Sci. USA 86, 4584-4588.

22. Gonzalez, F. J. & Nebert, D. W. (1990) Trends Genet. 6, 182-186.23. Evans, M. J. & Scarpulla, R. C. (1988) Proc. Natl. Acad. Sci. USA

85, 9625-9629.24. Perutz, M. F. (1990) J. Mol. Biol. 213, 203-206.25. Proffitt, J. H., Davie, J. R., Swinton, D. & Hattman, S. (1984) Mol.

Cell. Biol. 4, 985-988.26. Deininger, P. L. & Daniels, G. R. (1986) Trends Genet. 2, 76-80.27. Stoppa-Lyonnet, D., Carter, P. E., Meo, T. & Tosi, M. (1990) Proc.

Natl. Acad. Sci. USA 87, 1551-1555.28. Kramer, C. Y. (1956) Biometrics 12, 307-310.29. SAS Institute (1985) SAS User's Guide: Statistics (SAS Inst., Cary,

NC), Version 5, pp. 470-476.30. Britten, R. J., Baron, W. F., Stout, D. B. & Davidson, E. H. (1988)

Proc. Natl. Acad. Sci. USA 85, 4770-4774.31. Brown, J. R., Daar, I. O., Krug, J. R. & Maquat, L. E. (1985) Mol.

Cell. Biol. 5, 1694-1706.32. Ugarkovic, D., Plohl, M. & Gamulin, V. (1989) Gene 83, 181-183.33. Roeder, S. (1990) Trends Genet. 6, 385-389.34. Seiser, E., Kn6fler, M., Rudelstorfer, I., Haas, R. & Wintersberger,

E. (1989) Nucleic Acids Res. 17, 185-197.35. Bollag, R. J., Waldman, A. S. & Liskay, R. M. (1989) Annu. Rev.

Genet. 23, 199-225.36. Mehtali, M., LeMeur, M. & Lathe, R. (1990) Gene 91, 179-184.37. Li, W.-H. & Grauer, D. (1991) Molecular Evolution (Sinauer,

Sunderland, MA).38. Dorit, R. L., Schoenbach, L. & Gilbert, W. (1990) Science 250,

1377-1382.39. Gilbert, W. (1978) Nature (London) 271, 501.

Genetics: Kricker et al.

Dow

nloa

ded

by g

uest

on

Sep

tem

ber

28, 2

020