dna sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

14
J Mol Evil (1985) 22:195-208 Journal of MolecularEvolution <~ Springer-Ver3ag 1985 I)NA Sequence Patterns in Human, Mouse, and Rabbit Immunoglobulin Kappa-Genes Samuel Karlin and Ghassan Ghandour Department of Mathematics, Stanford University, Stanford, California 94305, USA SUmmary. DNA sequences of the human, mouse, and rabbit immunoglobulin kappa-gene (J--C re- gions) are compared with respect to various DNA Patterns, including dyad symmetry pairings, runs of rtUeleotides, repeat clusters, and repeats that occur With unusually high frequency. The significant dyad symmetry pairings within each of the sequences em- Phasize the two "control-enhancer" elements of the Js-42 intron. Dyad symmetry pairs between the J- C region and a number of kappa variable (V)-gene domains suggest differences in the affinities between the V and J segments. It is the "consensus hepta- raer', rather than the "consensus nonamer'" that em- bodies the longest V-J dyad symmetry combina- tions. In the rabbit there are long runs and repeat clusters of the sequences that identify regions of high cl~aplication; these regions are absent in the human and mouse sequences. High-frequency oligonucleo- tides feature the consensus nonamer 5' to the J seg- ments, especially in the mouse sequence. Key Words: Dyad symmetries -- Consensus se- quences _ Repeat clusters -- High-frequency oli- g~~ Introduction bNA "consensus" elements on the 5' side ofstruc- bural genes that function in transcription control have een identified in prokaryotes. Attempts to char- acterize corresponding control sites in eukaryotes have been less successful. Proposed control sites for OffPrint requests to: S. Karlin RNA polymerase II include the TATA box, placed about 30 nucleotides upstream from the initiation site of transcription, and, at a variable distance [40- 120 base pair Cop)] upstream, the CCAAT box, or the presence of a O + C-rich block in this vicinity. Consensus sequences for RNA polymerase III have been identified downstream from the transcription start site (e.g., see Grosschedl and Birnstiel 1980; Tsujimoto et al. 1981). Recent investigations of en- hancer elements (located more distantly both down- stream and upstream) that stimulate and substan- tially increase gene transcription have been reported for mammalian viruses (e.g., see Gluzman and Shenk 1983) and for the introns of both the heavy and light immunoglobulin (Ig) genes (see Gillies et al. 1983; Queen and Baltimore 1983). The performance of the Ig enhancer tends to be tissue and species spe- cific. Although a consensus core has been proposed (e.g., see Laimins et al. 1983) the essential DNA composition of enhancer elements is unknown. That consensus patterns (i.e., DNA segments of similar features but not necessarily the same con- tent), rather than specific consensus oligonucleo- tides, proximal to a given gene (or set of genes) and shared among species or among genes within species may underlie biological controls is an appeahng con- cept. Our objective in this paper is to extend our previous homology analyses (Karlin and Ghandour 1985a; Karlin el al. 1985) by ascertaining and inter- preting convergence and divergence of various DNA patterns in the human, mouse, and rabbit Ig kappa- genes (J--C gene segments, intervening regions, and flanks). In particular, we examine global and close dyad symmetry (inverted complementary) pairings, repeat clusters, extended runs of specific nucleotide types, and frequently occurring oligonucleotides. We further identify all block identities and exact dyad

Upload: samuel-karlin

Post on 11-Jul-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

J Mol Evil (1985) 22:195-208 Journal of Molecular Evolution <~ Springer-Ver3ag 1985

I)NA Sequence Patterns in Human, Mouse, and Rabbit Immunoglobulin Kappa-Genes

Samuel Karlin and Ghassan Ghandour

Department of Mathematics, Stanford University, Stanford, California 94305, USA

SUmmary. DNA sequences of the human, mouse, and rabbit immunoglobulin kappa-gene (J--C re- gions) are compared with respect to various DNA Patterns, including dyad symmetry pairings, runs of rtUeleotides, repeat clusters, and repeats that occur With unusually high frequency. The significant dyad symmetry pairings within each of the sequences em- Phasize the two "control-enhancer" elements of the Js-42 intron. Dyad symmetry pairs between the J - C region and a number of kappa variable (V)-gene domains suggest differences in the affinities between the V and J segments. It is the "consensus hepta- raer', rather than the "consensus nonamer'" that em- bodies the longest V-J dyad symmetry combina- tions. In the rabbit there are long runs and repeat clusters of the sequences that identify regions of high cl~aplication; these regions are absent in the human and mouse sequences. High-frequency oligonucleo- tides feature the consensus nonamer 5' to the J seg- ments, especially in the mouse sequence.

Key Words: Dyad symmetries -- Consensus se- quences _ Repeat clusters -- High-frequency oli- g~~

Introduction

b N A "consensus" elements on the 5' side ofstruc- bural genes that function in transcription control have

een identified in prokaryotes. Attempts to char- acterize corresponding control sites in eukaryotes have been less successful. Proposed control sites for

OffPrint requests to: S. Karlin

RNA polymerase II include the TATA box, placed about 30 nucleotides upstream from the initiation site of transcription, and, at a variable distance [40- 120 base pair Cop)] upstream, the CCAAT box, or the presence of a O + C-rich block in this vicinity. Consensus sequences for RNA polymerase III have been identified downstream from the transcription start site (e.g., see Grosschedl and Birnstiel 1980; Tsujimoto et al. 1981). Recent investigations of en- hancer elements (located more distantly both down- stream and upstream) that stimulate and substan- tially increase gene transcription have been reported for mammalian viruses (e.g., see Gluzman and Shenk 1983) and for the introns of both the heavy and light immunoglobulin (Ig) genes (see Gillies et al. 1983; Queen and Baltimore 1983). The performance of the Ig enhancer tends to be tissue and species spe- cific. Although a consensus core has been proposed (e.g., see Laimins et al. 1983) the essential DNA composition of enhancer elements is unknown.

That consensus patterns (i.e., D N A segments of similar features but not necessarily the same con- tent), rather than specific consensus oligonucleo- tides, proximal to a given gene (or set of genes) and shared among species or among genes within species may underlie biological controls is an appeahng con- cept. Our objective in this paper is to extend our previous homology analyses (Karlin and Ghandour 1985a; Karlin el al. 1985) by ascertaining and inter- preting convergence and divergence of various DNA patterns in the human, mouse, and rabbit Ig kappa- genes (J--C gene segments, intervening regions, and flanks). In particular, we examine global and close dyad symmetry (inverted complementary) pairings, repeat clusters, extended runs of specific nucleotide types, and frequently occurring oligonucleotides. We further identify all block identities and exact dyad

Page 2: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

196

symmetry pairings above a certain length between the variable (V) regions and the J-C regions.

It is hypothesized (e.g., Early et al. 1980; Hieter et al. 1982) that appropriate DNA recognition sites contribute to the process of V-J joining. Specifically, the consensus nonamer GGTTTTTGT and consen- sus heptamer CACTGTG (the latter of which abuts the J-gene segments), which are separated by a 23 _ 2-bp spacer segment, are proposed as V-J recom- bination control sites. The importance of these oli- gonucleotides is further suggested by the existence of their approximate inverted complements proxi- mal 3' to the V domains but separated by a 12-bp spacer segment. A number of questions arise. How strong is the dyad symmetry correspondence be- tween the V and J consensus elements? In terms of these recognition sequences, do all J segments relate equally strongly to a given V, or do certain V do- mains and J-gene segments tend to be more strongly associated? Are the associations of equal magnitude for the consensus nonamer and the consensus hep- tamer?

Lewis et al. (1984, 1985) studied V-J kappa rear- rangement within a DNA construct that was intro- duced via a retrovirus into a cell line that undergoes continuous rearrangement of its kappa-gene locus. They demonstrated in this system that rearrange- ment and joining of V and J can occur through an inversion event focused at a J heptamer element and at a V 3'-flank dyad heptamer element. The recovered reciprocal recombinant product is a 14- bp palindrome composed of the J 5'-flank heptamer abutting its inverted complement originating in the 3' flank of V. These findings implicate the heptamer element in the process of V-J rearrangement. The role of the consensus nonamer in this process was not investigated.

Wood and Coleclough (1984) assayed the fre- quencies with which the four functional J-gene seg- ments in the mouse rearrange with V coding do- mains in populations of B-lymphocytes in the absence of antigenic selection. The results for both unspliced (nuclear) and mature mRNA indicated a marked difference in frequency of usage among the four active J-gene segments. J~ and J2 were favored, each being selected about 40% of the time, whereas J4 and J5 were each utilized about 10% of the time. Moreover, Wood and Coleclough's review of pub- lished data for the human kappa-gene revealed a similar bias toward the J~ and J2 segments. We note parenthetically that Yancopoulous et al. (1984) re- ported that the most J-proximal V-gene family of the heavy chain is used preferentially to form V- D--J rearrangements in pre-B-cells.

In light of the demands on the Ig-gene complex to accommodate diverse antigenic stimuli, the need for correct rearrangement, and the occurrence of

preferred V-J pairings, both common and discrim- inating signatures might be expected in the inter- vening and flanking regions.

To facilitate later discussion, it will be useful to summarize the distinctive characteristics of the principal conserved oligonucleotides of the Ig kap- pa-gene sequences across and within the three species. [See Karlin et al. (1985) for a detailed de- scription of the contents and locations of all statis- tically significant block identities (shared oligonu- cleotide).] The long conserved oligonucleotides shoW close alignment relative to one another and/or to the J- and C-gene boundaries.

(i) The leader region 5' to J~ contains a highly significant shared oligonucleotide in all three species about 125 bp upstream from Jr. We hypothesized that this element serves as a major control site pri- mary to all V-J rearrangement operations and not specific to any J.

(ii) Similarities in the sections separating the J segments emphasize the consensus nonamer both within and between species. Unlike the nonamer, the consensus heptamer is not part of any significant block identity shared among all three species. How- ever, the human and mouse sequences share three statistically significant block identities that subsume the heptamer. These occur proximal to J~, J3, and J4 in human and to J~, J4, and Js in mouse, respec- tively (see Karlin et al. 1985, Table 2-IV).

(iii) In the Js--C intron there are two distinctive regions of identity. These include a stretch encom- passing the demonstrated enhancer element el (Queen and Baltimore 1983; Picard and Schaffaaer 1984; Potter et al. 1984) covering the region 600- to 750-5' to C in all three species, t and a second 200-bp segment, e2, about 1 kilobase (kb) upstream from e~. The e2 element is centered at about 915-Y to Js in human, 1200-3' to J5 in mouse, and 960-3' to J5 in rabbit. With respect to the J-C intron, the e2 element shows more overall clustering and longer statistically significant block identities among the three species than does el. Several recent deletiot~ experiments (Picard and Schaffner 1984; Potter et al. 1984) suggest that the enhancer element e~ alone cannot account for the transcriptional levels ob- served and that other regulatory elements 5' to e~ are presumably operating. Also, Chung et al. (1983) report regions of DNase I hypersensitivity that

k In this paper we employ the following coordinate notation in locating oligonucleotides: "120-5' to J2" signifies that the indi- cated oligonucleotide starts 120 bp upstream from the J2 gene segment; "77-Y to Js" means that the oligonucleotide starts at base position 77 after Js. The designation "15-5' in J3" specifies that the start of the oligonucleotide coincides with the 15th base pair of J3.

Page 3: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

closely correspond to e~ and ez. This finding, as well as the results o f Parslow and Granner (1982), sup- port the notion that these two regions play a role in kappa-gene rearrangement and transcription.

(iv) There are two highly significant triply shared block identities, one of 11 bp and the other of 10 bl~, situated about 120- to 150- and 160- to 190-3' to C, respectively. The distance between the two is almost identical in each sequence. The triply shared 10-bp oligonucleotide (AATAAAGTGA) starts with the familiar polyadenylation signal.

(v) The statistically significant repeated oligonu- eleotides within the human and mouse sequences are localized to the J-gene segments; these generally share the 3' half of the segments extending over the Splice junction. The mouse sequence shows addi- tional long internal repeats encompassing the con- sensus nonamer element. In the rabbit sequence, all slgnificantly long repeats are concentrated in A + T-rich sections 5' to Jj and in the Js-C intron. The naOUse functional J segments display a high degree ~ having more than 80% DNA identity, Whereas the rabbit J segments are more differen- tiated, their degree of identity being about 50%.

Given the high degree of similarity among the J Segments and among the nonamers for the human and mouse, it seems plausible that the distinction among the various J segments for purposes of V-J rearrangement resides in the inter-J regions. There are five groups of substantial conserved oligonu- cleotides in the inter-J regions that could relate to recombination, transcription, splicing, and termi- nation control: (1) The regions about 40 bp 5' to each j display extensions on the consensus nona- rner; (2) the 5' abutting J flank subsumes the con- Sensus heptamer; (3) significant block identities are found about 100-150 bp 5' to both Jj and J2; (4) the canonical splice identities terminate each J of the general form GT(A/C)AGT, with some con- served extensions; (5) approximately 70-90 bp 3' to Various J-gene segments there are long conserved ~

TYPes of Patterns

DYad Symmetries

Identifying dyad symmetry pairs to determine their 13Otential function in secondary structure and pos- Sible role as signals for recombinational, transcrip- tional, or translational control is a widely accepted 13rocedure (e.g., see Tjian 1978; Nomura et al. 1981; Lewin 1983). In this paper, we identify all signifi- cantly long and all close dyad symmetry combina- tions. The latter is defined by the conditions of a

197

prescribed minimum stem length (at least 7 bp) with loop length not exceeding 55 bp. We also examine the distribution of close dyad pairs favoring G + C content�9 These potential stem loop structures are quite stable consonant with the Tinoco et al. (I 973) criteria. Close dyad symmetries and palindromes appropriately spaced may provide "optimal" struc- tural stability to mRNA, enabling it to reach the ribosomes with reasonable success yet also to un- ravel expeditiously for translation�9

Nucleotide Runs

We define a run of nucleotides as the consecutive occurrence ofnucleotides of the same type, e.g., suf- ficient iteration of a single nucleotide, a purine tract, or a string of weakly bonding nudeotides. Another run pattern of interest consists of strictly alternating consecutive occurrences of nucleotides from two possible groups. RY runs are hypothesized to facil- itate Z-DNA formation (Wang et al. 1979). Runs based on more elaborate almost periodic patterns may also be of interest in relation to tertiary struc- ture.

Clusters of Repeats

The locations of the longest oligonucleotide occur- ring twice in a randomly generated sequence N bp long are exepcted to be, on the average, N/3 bp apart (Karlin and Ost 1985). Repeat clusters are closely spaced multiple repeats. For our purposes, in a 5-kb DNA sequence a repeat duster of an oligonucleotide 5, 6, or 7 bp long is defined as consisting of four or more such direct repeats with no gap exceeding 50 bp between successive copies. A repeat cluster of an 8- or 9-bp oligonucleotide in a 5-kb DNA sequence is taken to include at least three copies with suc- cessive gaps of ~ 50 bp. These criteria entail a prob- ability of <0.05 of observing such a repeat cluster in a corresponding (base content) randomly gener- ated 5-kb sequence.

Repeat clusters may aid in transcription. Several close direct repeats identified by Alu restriction sites are located in spacer regions of rRNA genes of Dro- sophila melanogaster; these are similar to a 45-bp direct repeat overlapping the 5' end of the 28S gene. It has been postulated that these Alu cluster repeats can simulate binding sites, drawing R NA polymer- ase I to the vicinity and thereby aiding transcription of the actual 28S gene (Dover 1982).

Tandem and interspersed direct repeats can be induced through unequal crossover events and through intra- and interchromosomal conversions and transpositions. DNA stuttering, often associ-

Page 4: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

198

Table 1. All close exact dyad symmetry (DS) pairings of stem length >-6 and loop length - 55 and close DS pairings with mismatches having a majority of the stem base elements consisting of G or C

Human Mouse Rabbit

Gap Gap Gap to to to

Sequence Location dyad Sequence Location dyad Sequence Location dyad

5' to Ji

AAACCA 603-5' to J~ 41 CTTAGTTA 600-5' to Jt 4 GCCTCAG 105-5' to J~ 34 TGTAATTAT 427-5' to J~ 26 AGGAAAA 502-5' to Jt 31 GCCTCAG 74-5' to Jt 2 AAAAAAA 326-5' to J~ 7 AAAGAACT 490-5' to Jt 34

326-5' to Jt 8 ATATAATA 216-5' to J~ 5 AAAGAAA 337-5' to J~ 37 TTCTCTG 134-5' to Jt 1 TTTAGGT 167-5' to Ji 3 ACAGTGG 23-5' to J~ 7

CAGC.CAG 645-5' to J~ 15 CAGAG.AG 126-5' to Jt 52 T 'CTCCC'A 387-5' to Jl 8 CA'CTGCT 107-5' to Jt 20 TG 'AGAGG 96-5' to J~ 32 CTGC'TCAG 119-5' to J~ 45

CCTC'TCA 56-5' to Jt 1

CACTT-CG 3-5' in J4 26

J regions

ATGGGACCA 14-5' in J3 50 AAAACGT 33-5' in J4 27

ATGGGACCA'A 14-5' in J3 46

GGACCAA 17-5' in J3 44

GAATCAC 12-5' to J3 1

TCT.TTCCCTG 41-3' to J~ 53 AGCC.ATC 70-5' to J4 6 G.TCCCTC 15-5' to J, 20 C-CTAGCC.G 120-5' to J5 42

Between the J regions

TTACTCTG ATCACTG TGAATCAC CAGTGAT

AGCA. GGT AG. TGGGC GCCC-TCT

75-3' to Jl 10-5' to J3 13-5' to J4 21-5' to J4

93-5' to J2 94-5' to J~ 70-5' to J~

34 AAAGGGA 17-5' to J3 52 11 CCCCAAA 93-3' to J3 41

1 4

35 TCTG.CTG 75-5' to J5 20 16 33

Jr-C intron

TAAAGAG 694-3' to J~ 37 CAGAAAATCT 68-3' to Js GCCGGTG 758-3' to J~ 1 ACTTCAG 330-3' to Js TTTAAAA 988-3' to J5 13 GATTTTT 576-3' to J5 AAAATTr 960-3' to Js 54 TCATTTC 789-3' to J5 TTTTCAT 1072-3' to J~ 36 ATTTCTAC 791-3' to J~ TAGTTTT 851-5' to C 10 ATATTAA 1118-5' to C GCAAACA 68-5' to C 40 ATATTTTT 1061-5' to C

TTAAAAT 824-5' to C CCTCTGT 714-5' to C CTGGCAG 661-5' to C CCAGGGTCTG 540-5' to C TTCAGAC 351-5' to C TTTCTAA 205-5' to C

GGCTCT.T 736-3' to Js 39 GGA.ACTCA.C 466-3' to J5 T.CCACCC 746-5' to C 3 G.GACTC.T 689-3' to Js G-AATCCCCC 731-5' to C 3 AG.GACTC 729-3' to J5 C.AAGAGG 709-5' to C 44 C.GGGTGA.C 984-3' to J~ GGCCACCTG.C 703.5' to C 41 CT.CCGCNGG 1034.3' to J5 T.GCCAC.T 667-5' to C 10 CC-CGCGG 1055-3' to J5 A.AGGCCT 589-5' to C 2 A.CTGGCAG 663-5' to C GGC.GGGCT 506-5' to C 31 GCCA.CTG 666-5' to C GGC.GGGCT.A 506-5' to C 22 CCTG.TCT 590-5' to C TGC.GGGA 485-5' to C 44 CAT.AGACC 547-5' to C AGC.AGGA 186-5' to C 41 GGT,TAGC 463-5' to C

48 CACTTTTA 52-3' to J~ 13 12 AAATAGA 272-3' to J~ 14 8 CCAATTT 515-3' to J~ 35

17 TGAAGGAAA 602-3' to J5 46 39 AAAAGAA 1092-3' to J5 17

9 TATTTTA 1408-3' to J5 13 8, 36 TTTCAAA 1542-3' to J~ 5

29 TTrTAAA 1594-3' to J5 25 21 TTTTAAAA 1604-3' to J~ 13 23 TTrAAAA 1626-3' to J5 12 35 CTTTTAA 1644-3' to J5 6 29 ATTTTAT 1732-3' to J~ 11 27 AAATATA 1753-3' to J5 14

AATTAATT 1797-3' to J5 18

3 C.GTGGCC.T 761-3' to J5 36 46 CGTGGC.G 762-3' to Js 52

9 G-GGGGC.G 784-3' to J5 24 23 GGCCAC.GCAG 809-3' to J5 23 12 CC.GCCAC 821-3' to J5 19 13 CAG.ACTG 1044-5' to C 52 23 T.GGCCCTG 769-5' to C 23 45 G.CAGCTG 725-5' to C 54 39 CA.CTGCC 723-5' to C 34

2 G.GGTCAG 714-5' to C 46 3 G.TCACC.G 699-5' to C 11

AC_K3CC.AC 516-5' to C 12 CCTGG.CTG 491-5' to C 47 AGA-CAGG 434-5' to C 40 AG.GAGCGT 413-5' to C 8 CTG.CCCC 374-5' to C 52

Page 5: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

Table 1. Continued

199

Human Mouse Rabbit

Sequence Location

Gap Gap to to dyad Sequence Location dyad Sequence Location

Gap to dyad

Jy--C intron (Continued)

ACAGG-CC 351-5' to C 1 A.GOGGCAG 188-5' to C 47 AGGG.AGA 55-5' to C 34

GCCTCT. T 64-5' in C CAGc-TCAG 195-5' in C CCCTGA.G 208-5' in C A'GGGAG.G 306-5' in C

C domains

CATCAAT 105-5' in C 12 GTCCTGA 151-5' in C 13

28 GTCCTGA-CAGT 151-5' in C 3 7 C.TGTGAG 254-5' in C 4

51 CC.TTGTC 286-5' in C 28 48

TG.TCCAGTTGC 3-5' in C 40 TT-CCACC.G 31-5' in C 8 CCAG.TGC 37-5'in C 10 TTC.CGATG.CA 98-5' in C 37 TGGG.GGTG 118-5' in C 5 CAG.CGTC 281-5' in C 32 GGGT.ACTG 303-5'in C 29

CTCCTCC 129-3' to C 26 AATATTT CTCCTCC 132-3' to C 23 TTGATTC 498-3' to C 10

TCCTCCA.C 100-3' to C 50 GGAGG.TT TTC.CCTC 115-3' to C 40 T'CTCCTCC 130-3' to C 23 G'CTTGCT 417-3' to C 19

3' to C

I66-3' to C 5

67-3' to C 34 C.CTGAGG 32-3' to C 20

Dots denote mismatch positions. Two or more mismatches are allowed only if at least five consecutive dyad matches separate them. "N" represents an undetermined nucleotide

ated with repair mechanisms, may create short, fre- quent direct repeats. A steady rate of mutat ion would tend to disrupt the identi ty o f longer repeats readily. Long direct repeats are thus more likely to occur by virtue o f functional constraints or be o f relatively recent origin.

Identification o f D N A sections with clustered re- peats versus sequences containing long repeats can help demarcate regions where different kinds o f du- Plication mechanisms are operating or regions that Serve different functions.

High-Frequency Repeats

A high-frequency (hi) repeat is defined as an oli- gonucleotide o f a given length that occurs signifi- cantly more often than is expected by chance within the sequence or throughout the aggregate o f se- quences under consideration. High-frequency and clustered repeats may reflect the extent o f duplica- tions and genomic rearrangements.

Significant similarities and contrasts in D N A pat- tern among the human, mouse, and rabbit Ig kappa- gene sequences are recorded in Tables 1-6 and graphically presented in Figs. 1-3.

Dyad Symmetry (DS) Pairings Within the Human, Mouse, and Rabbit Ig Kappa-Gene Sequences

A statistically significant structure is one that differs by at least three s tandard deviations from the the- oretical expected structure, as calculated f rom a cor- responding r andom model (see Karlin and Ost 1985 for details). We determined all exact DS pairings o f stem length >--10 bp without restriction on loop length (global) (see Karlin and G h an d o u r 1985a, Table 3). We also de termined all exact close DS pairs o f length - 7 bp and close DS o f high G + C content, allowing mismatches (Table 1; see notes to table for mismatch criteria).

Based on these results, we now offer some inter- pretat ions and hypotheses:

(i) The single global statistically significant DS pair in the mouse sequence is 14 bp long (ATTA- T A T A C A T T A A ) and occurs 202-5' to J~, close to the postulated major control site at 127-5' to J~, with its dyad oligonucleotide 1297-3' to Js, proximal to the e2 "con t ro l " e lement o f the Js--C intron (see Introduction). The correspondence between these control sites is strengthened by an 11-bp dyad com- binat ion composed o f the oligonucleotide sequence

Page 6: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

200

HUMAN

MOUSE

RABBIT

7

J1 J2 J3 J4 J5

7 10 7 8

7 9

7 7 10 7

7 8 7

I , I 0 500 bp

7 le2 I

9 7 8

~e2

10

}el C

7

i ii I 78 7

7r o 7

77 7 11 7 7

8 7 7 7 7

Fig. 1. All close exact DS pairs and DS pairs with mismatches of stem length >-7 bp and loop length -<55 bp having a majority of stem base elements consisting of G or C in human, mouse, and rabbit Ig kappa-gene J--C regions. The numbers above the lines indicate the lengths of exact DS words. Numbers below the lines refer to lengths of DS words with mismatches. Arrows labeled "el" and "el' refer to the known enhancer element and the proposed control region (Karlin and Ghandour 1985), respectively. Arrow labeled "g" refers to the global recombination control site 5' to Jt (see Karlin et al. 1985)

G C T C T G T T C C T , located 64-5' to JL, and its dyad oligonucleotide 1109-3' to Js, which is also close to the e2 intron control element.

Also, in the mouse the members o f the dyad pair consisting o f C A A G G A A A G G G 1172-3' to J5 and its inverted complement 141-3' to C are proximal to two significantly long shared oligonucleotides o f the three sequences. We have conjectured that this dyad combina t ion may be instrumental in relating the e 2 regulatory e lement and a transcriptional ter- minat ion control site.

(ii) With one exception, all exact DS pairings of at least 11 bp (of which there are five in human, nine in mouse, and eight in rabbit), and even those o f 10 bp, do not overlap the coding domains (cf. Karlin and Ghandour 1985a).

(iii) Close DS pairings composed primarily o f G + C can form the most stable secondary structures. Moreover , D N A sequences domina ted by G + C may be less subject to disabling muta t ion events by virtue o f their strong bonding.

All three sequences show a high concentrat ion of close DS pairs o f high G + C content near the en- hancer d e m e n t el (Table 1). An enhancer element 's capacity to operate in ei ther or ientat ion entails that its reading in the 5' to 3' direct ion exhibits an es- sential D N A feature that is also mainta ined in its inverted complementary sequence. In this capacity, a region containing a close DS pair obviously pre- sents the same oligonucleotide in both orientations.

The enhancer e lement et is tissue and species specific. With this e lement deleted, Pot ter et al.

(1984) observed a low level o f t ranscript ion in fi- broblasts, whereas the transcription assays in B-ceils were normal , which led them to hypothesize that there may be a D N A recognit ion site o ther than the identified enhancer ups t ream in the J5-C intron. A potential second " 'enhancer-control" region about 1 kb upstream is indicated in Karl in and Ghandour (1985a, Table 1 and Fig. 1); it is labeled ez. The preponderance o f long conserved ol igonudeot ides in the three species about e2 (which is 915-, 1200-, and 960-3' to Js in human, mouse, and rabbit, re- spectively) supports the proposi t ion that this region has a regulatory function. Moreover , a concentra- t ion o f close DS pairs occurs in the region o f the c2 element (see Fig. l) in mouse and rabbit. (A stretch o f about 570 bp not sequenced in the h u m an starting at about posit ion 950-3' to J5 may account for the absence of corresponding DS pairings in the human sequence.)

(iv) In rabbit, the bulk o f close dyad symmetr ies with stem composit ions favoring G + C extend from about 800-5' to C through the C domain. For the human and mouse sequences the same stretch also shows an excess o f close DS pairs. The h u m an se- quence further shows a preponderance o f strong close DS pairings 3' to C in the vicinity of the " termi- nat ion cont ro l" site ( I00- to 170-3' to C), described explicitly in Karl in et al. (1985).

(v) The cumulat ive count (human 125, mouse 141, rabbit 193) of close DS pairings (including both exact pairings and those with mismatches regardless o f base content) in the three species shows a clear

Page 7: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

excess of close dyad symmetries in the rabbit com- Pared with the human and mouse Ig kappa-gene Sequences, a pattern similar to that observed with respect to repeat clusters (see Fig. 3). This can be ascribed partly to intense duplication of short oli- gonucleotides, generating high A + T content, in the regions 600- to 700-3' to Js, 1500- to 1700-3' to Js, and about 300- to 400-5' to J2 in the rabbit.

(vi) Generally, the patterns (numbers, locations, and compositions) of dyad symmetries are more similar between human and mouse than between the rabbit and either the mouse or human. To il- lustrate, Table 1 shows six close exact DS pairs in both human and mouse located 5' to J~ that are evenly distributed over this region. By contrast, in the rabbit sequence, there exists a single close exact dyad combination consisting of the oligonucleotide GCCTCAG repeated at positions 105- and 74-5' to Ja With its inverted complement located at 65-5' to J~. The rabbit sequence is deficient in close dyads OVer the inter-J regions compared with the human and mouse sequences. For all three species there are Virtually no close dyads overlapping the J-gene seg- ments (an exception is in the inactive J3 segment of mouse and rabbit).

l)Yad Symmetry Pairings and Repeat Occurrences Between the V and J-C Regions

We determined all exact DS pairings and all shared ~ of length -> 9 bp between the V re- gions of the kappa-gene and the J -C region for each of the human and mouse sequences. The V se- quences available to us (from the Genbank data- base) were VK101 of human and VK2, VK41, VL6, and V167 of mouse.

The longest exact DS pairings and block identities between the V and J-C regions are presented in Table 2. The heptamer and not the nonamer shows greater length and stability of the DS pairings for each of the V domains considered. In fact, the pair- ing often extends beyond the core heptamer CACTGTG to a length ranging from 9 to 12 bp. We refer to these associations as the extended heptamer DS elements.

The mouse VK2 sequence shows the strongest I)S pairing, with the extended heptamer abutting J4 (12 bp exact stem length). The mouse VL6 extended heptamer can also strongly pair with J5 (10 bp stem), Whereas mouse V 167 and VK41 heptamer elements Show equal affinity to J4 and J5 heptamers (both with 9-bp stem lengths for both V domains). The longest I)S pairing, that between the human sequence VK101 and the J -C region, is 11 bp and also relates to the consensus heptamer 5' to Js.

What about the strength of the V and J consensus

201

nonamer DS pairings? It is noteworthy that all the V-J nonamer-dyad pairings but one are not exact, even for the core (GGTTTTTGT) element. For ex- ample, VK2, VKL6, and VK41 each show one or two mismatches in the core nonamer. Moreover, virtually no pairings of neighboring nucleotides are feasible, even if one allows for deletions, insertions, and mismatches. VK167 forms an exact dyad match with the nonamer of Jl that stretches to a length of 13 bp with one mismatch, but the dyad associations to the nonamer regions ofJz, J4, and J5 are limited to the core nonamer. These observations attest to the relatively weaker dyad pairing of the consensus nonamer compared with the consensus heptamer.

A number of DS pairings of stem length - 7 bp relate regions of the V-gcne segments to the Js-C intron in the vicinity of the e~ enhancer element and to the neighborhood of the second control element e2 in the intron (data not shown). Enzymatic activity leading to V-J joining may act on these regions in the J~-C intron. In fact, Durdik et at. (1984) reported such a case in h-Ig-producing B-cells where the oli- gonucleotide CACAGTGATAG (of which the ini- tial heptamer is the inverted complement of the consensus heptamer) was rearranged with a recom- bining sequence about 10 kb downstream, leading to the excision of the kappa constant (C) domain from the genomes of these cells. The joining site 1084-3' to J~ of the first recombination was localized to about 100 bp from the major oligonucleotide identity of the e2 region. Since the V-J rearrange- ment would have been nonfunctional for both cell lines examined, Durdik et al. hypothesized that the intron heptamer may play some role in kappa to lambda switching. These observations, along with the fact that the correspondences between the V heptamcrs and the J-segment heptamers extend be- yond the consensus 7 bp, suggest a greater degree of involvement of the consensus heptamer region than of the consensus nonamer in V-J joining.

The foregoing hypothesis is further supported by the results of I.ewis et al. (1985), who investigated the mechanisms of Ig kappa-gene rearrangement through the use of a virally introduced recombi- nation substrate. They isolated in this system re- combined V-J kappa coding units and a reciprocal recombination product consisting of the J and V heptamer elements contiguously joined. These find- ings suggest an inversion event in V-J rearrange- ment and underscore the relevance of the consensus heptamer to this process.

The rearranged V-J plus C domain is transcribed and the m R N A is ultimately spliced to yield an intron-free mRNA composed of contiguous V-J -C coding domains. However, after V-J rearrangement there can appear potentially several donor splice junctions (immediately 3' to each J segment). The

Page 8: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

202

Table 2. (i) Dyad symmetry pairs relating J--C and V kappa-gene regions; included are all dyad symmetries of -I I bp plus all consensus hepmmer and nonamer dyads of ---9 bp

Sequence Length J--C locat ion

Locat ion o f inver ted c o m p l e m e n t in the V-region

H u m a n V K I 0 1 :

T A A C A C T G T G G 11 1 t -5 ' to J5 2-3 ' to exon 2

Mouse VK2:

T G A A T C A C T G T G 12 13-5' to J4 3-3' to exon 2 C C T C A T G T C A G 11 123-5' to J4 73-5 ' to exon 1

Mouse VK41:

T C T C A T T T C T A C 12 785-3 ' to Js 39-3 ' to exon 2 T C A C T G T G G 9 9-5' to Js 2-3 ' to exon 2 A T C A C T G T G 9 10-5' to J , 3-3' to exon 2

Mouse VL6:

C A A G T G A T A G T 11 122-5' to J2 69-5 ' in exon 2 C T C A C T G T G G 10 10-5' to Js 2-3' to exon 2

Mouse V 167:

T C A C T G T G G 9 9-5 ' to J5 1-3' to exon 2 A T C A C T G T G 9 10-5' to J4 2-3' to exon 2 G G T T T T T G T 9 40-5 ' to Jl, J5 21-3 ' to exon 2

41-5 ' to J4

(ii) All block identi t ies be tween J--C and V kappa-gene regions o f length -> 11 bp in m o u s e and h u m a n

Locat ion o f block ident i ty in the

Sequence Length J - C locat ion V-region

H u m a n VK101:

No block identi t ies >-- 11 bp

Mouse VK2:

C T C A G C T A C T A 11 968-5 ' to C 94-3 ' to exon 2

M o u s e VK41:

C T T T T A T C A T G C T 13" 153-3' to C 47-5 ' to exon 2 A A A G A G T C A G T 11 126-3' to J4 62-5 ' in exon 2

Mouse VL6:

A A A G A T A T A A T A 12 220-5 ' to J~ 366-5 ' to exon 1 T C A C T G T G A T T 11 9-5' to J4 27-3 ' to exon 1 T T C T T T G T T G T 11 49-3 ' to J2 349-5 ' to exon 1 A A C C A A A G T A A 11 87-5' to J~ 394-5 ' to exon 1

Mouse V167:

C A C A G T G A T A G 11 1084-3' to J5 2-3 ' to exon 2 C T T C C T T C C T C 11 15-5' to C 45-5 ' to exon 1

The expected length o f the longest DS pair or block ident i ty be tween a g iven V and J--C region occuring by chance is ~ 11 bp. To be statistically significant ( indicated by an asterisk) a DS pair or block ident i ty m u s t be >- 13 bp long

multiple choice of splice donor sites may present a problem in implementing correct splicing when V joins with one of the earlier J segments. There are at least two mechanisms that may ensure accurate and unencumbered intron splicing: (i) In the region 5' to each J segment we can imagine control ele- ments whose presence in transcribed mRNA re- presses the splicing of the J segment immediately downstream. The union with V removes these ele- ments for the selected J and thereby frees its splice

junction to operate normally. (ii) Alternatively, within the V coding regions or the V intron, there may be positive controls that activate the selection for processing of the J splice junction of the partic- ular J joined directly to V. In the first scenario one could envision the nonamer element as a (repressor) control site acting to guarantee correct VJ--C splic- ing, for the following reason. The nonamer sequence displays marked constancy both within and among the three species. Its role is presumably similar for

Page 9: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

all the J segments and consequently it is probably not involved in processes requiring differentiation among various J segments. Its exceptional conser- vation, which may extend up to 19 bp in length (cf. Karlin et ak 1985, Table 2-IV), strongly suggests Some function for this sequence.

The observed lengths o fDS "heptamer" pairings favor Js and to a lesser extent J4 and J~, depending on the V domain. These DS associations of proximal Sequences between the V and J segments featuring J5 and J4 are perhaps surprising in light o f the ex- perimental findings of Wood and Coleclough (1984). They reported for both unspliced and mature mRNA mouse Ig kappa-genes under rnitogenic stimulation approximately 40% representation from each of Jt and J2 compared with 10% each from J4 and Js. Perhaps the J~ and J2 V-J recombination preferences derive from the special relations of the proximal 5' regions of these gene segments to the major control enhancer element e2 o f the Js-C intron. In this con- text, Jl is distinguished by the existence of a highly COnserved oligonucleotide about 125-5' to Jj in all three species that we postulated to be primary to all V-j rearrangement operations. Moreover, this re- gion contains long oligonucteotides in dyad sym- metry relationships with the region e2 of the Js--C Xntron of the mouse sequence. The sequence 5' tO J2 also displays a striking correspondence with the e~ element. Explicitly, the 9-bp oligonucleotide TTCAGAAAT occurs 105-5' to J2 in human, 111- 5' to J2 in mouse, and 112-5' to J2 in rabbit, and 917-3' to Js in human, 1198-3' to Js in mouse, and (the initial octamer) 961-3' to J5 in rabbit. The lo- cations in the Js--C intron correspond to the core of the e2 control element (Karlin et at. 1985, Table l-IV).

We determined all block identities between the J--C region and the V regions [Table 2(ii)]. No sta- tistically significant 0ength -> 13 bp) block identities Were observed. However, there occur two shared ~ of lengths 11 and 10 bp (CA- ~AGTGATAG and its initial 10 bp relating the COnsensus heptamer 3' to the V domains V167 and VK41, respectively, to the region of the hypothe- Sized second control element e2 (cf. Durdik et at. 1984). Thus in the mouse, the extended approxi- rrlate consensus heptamer appears at three sets of locations: 3' to V domains, 5' to J segments, and in the Js-C intron.

Long Runs of Nueleotide Types

Sufficient consecutive repeated occurrences of a sin- gle nucleotide, or of elements of a group of nucleo- tides, may demarcate a signal region.

Figure 2 shows in a schematic layout all statis-

203

tically significant runs of bases or groups of bases, and ofaltemating elements. [The criteria for statis- tical significance are discussed in Karlin and Ost (1985).] These include several long polyadenine runs in both the human and rabbit sequences. There is a striking iteration of 12 guanines in the rabbit. The mouse sequence contains no significant single-base runs, in contradistinction to the numerous such runs in the human sequence. The putatively higher mu- tation rate in mouse may disrupt long runs (cf. Wu and Li 1985). Iterations of a single nucleotide in the three species do not overlap any coding segments of the Ig kappa-gene and are principally found in the Js-C intron or in the flanking regions 5' to J~ and 3' to C.

A similar asymmetry prevails in terms of runs composed of strongly or weakly bonding bases. Thus, the rabbit sequence shows the only significant such runs, the longest being a formidable 35-bp run of "weak" bases located 5' to J~. Six other significant weakly bonding base runs of the rabbit sequence are confined to a region of approximately 150 bases in the Js--C intron, which region is about 95% A + T.

Of the significant pyrimidine (Y) runs there are three in human, one in rabbit, and one 28 bp long in mouse, all located 3' to C. Is it possible that this consensus structure is pertinent to transcription ter- mination of the lg kappa-gene?

There is a single significant alternating strong- weak (SW) run, of length 16 bp. This contains the dinucleotide GA iterated seven times midway be- tween Jl and J2 of the rabbit sequence. (This is embedded in the 27-bp purine run marked in Fig. 2.) The next-longest SW run in rabbit is of length 15 bp and found at 207-5' in C. It is interesting that the longest alternating SW run in human, atso of length 15 bp, exists central to the C domain at 157- 5'. This conforms with the view (e.g., Blaisdell 1983; W.-H. Li, C.-C. Luo, and C.-I. Chung, manuscript in preparation) that alternating SW runs tend to occur in coding regions, whereas long stretches of only strong or weak bases occur more often in non- coding regions.

There are no significant alternating RY runs in the human sequences. A significant mouse R Y run, GTATATATGTGCATC, is 15 bp long and is prox- imal to the C domain at 154-5' to C. A I4-bp RY alternation starts 7 bp 5' to J2 in mouse and extends over the 5' end. Alternating RY runs, mostly from 1 to 2 standard deviations above the chance theo- retical mean length, are largely confined to the sec- tions separating the J-gene segments (data not shown).

We call attention to a substantial imperfect RY run (not shown in Fig. 2). Specifically, the rabbit exhibits an RY run starting at 208-5' to Jl with two

Page 10: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

204

J1 J2 J3 J4 J5 C

AN .o. / H H !

C8

AI 0 I , 1

0 500 bp

Yl4 A9

c w) 5(Yi5 /7114 17

i E2]

I

I I1 ,

I

Fig. 2. Long runs ofnucleotides and ofnucleotide groups in the human, mouse, and rabbit Ig kappa-genes (J--C region). The notation is exemplified as follows: "As" denotes eight iterated A bases; "S~3" stands for 13 successive strong bases; S (strong) = O + C; W (weak) = A + T; R (purine) = A + G; Y (pyrimidine) = C + T. "RY" and "SW" denote, respectively, alternating purine--pyrimidine and strong-weak runs: "(SW)~s" signifies a total of 15 alternating strong and weak bases. In human and mouse eight consecutive repetitions are statistically significant for A, six for C, six for G, and eight for T; in the rabbit the corresponding levels are eight for A, six for C, and seven for G or T. An 18-bp or longer stretch exhibiting runs of weakly bonding bases (W) and an 1 l-bp or longer stretch for strongly bonding bases (S) are statistically significant in the human and mouse sequences. Significant runs in the rabbit sequence involve 16 or more weak or 10 strong bases. A purine (R) run >- 14 bp in length and a pyrimidine (Y) run >_ 13 bp in length are significant in both human and mouse, but 15- and 13-bp lengths are required for rabbit. An alternating run SW or RY of -> 16 bp in any of the three species is statistically significant. A statistically significant run occurs by chance (i.e., for a corresponding random sequence with the same base content) with a probability of <0.02

single base breaks in the alternation, giving lengths o f 7, 13, and 6, for a total o f 26 bp.

The human sequence shows no significant runs in inter-J regions, whereas the mouse and rabbit show contrasts and similarities in their run patterns. The mouse is distinguished by several significant poly-G runs and by a lack o f any significant iteration o f A or T. The human and rabbit sequences contain long A, C, and T runs located exclusively 5' to J l and in the Js -C intron. Deletion experiments or other D N A manipulat ions for purposes o f investi- gating the role o f these po lynudeot ide tracts have not yet been done.

Repeat Clusters

Figure 3 describes the distribution o f clusters in the three sequences. A marked contrast to the h u m a n and mouse sequences is shown by the greater abun- dance o f repeat clusters in the rabbit sequence. This conforms with the phenomenon of m a n y short du- plications in the rabbit, especially 5' to J1 and in the Js -C intron. In all three species the repeat clusters share the following characteristics: (i) No clusters

overlap any of the coding regions. (it) All A + T- rich dusters occur either 5' to Jl or in the Js--C intron. (iii) All the repeat clusters occurring between the J regions exist only in the rabbit, are exclusively purine, and are generally G rich. (iv) All repeat clus- ters 3' to the constant domain are C rich and exclu- sively pyrimidine in content. These clusters are lo- cated in the region 300-400 bp 3' to C in human and mouse. (The corresponding segment was not sequenced in the rabbit.)

High-Frequency (hf) Oligonucleotides

Many (probably most) D N A sequences show sig- nificantly more highly repeated oligonucleotides of lengths 4-6 bp than would be expected o f randomly generated sequences (of. Karlin et al. 1983). This is not true for oligonucleotide lengths o f _ 8 bp.

High-frequency repeats o f oligonucleotides 6, 7, 8, and 9 bp long with high representation in all three sequences are presented in Table 3. Note that all such hf repeats o f length > 7 relate to the consensus nonamer. It is interesting and perhaps telling that the c o m m o n form of the polyadenylat ion signal

Page 11: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

J1 J2 J3 J4 J5 C

205

HUMAN

MOUSE

I T5 ~ A k5 4G

(ACC)3 I

ITc,Tc

(CCT)2C 2

2 RABBIT

IT2A 3 TAT 3 3 TT IATA2TATA I (GA)t.G 'AG3A

i {AGI)2A~ 3 3 T4A3 I O 500 bp

FiE. 3. Repeat clusters in the human, mouse, and rabbit Ig [~ppa-genes (J-C resJon), The following exemplify the notation: "T~" indicates the oligonucleotide TTTTT; "(ACC)s" represents ACCACCACC. Labels "e~," "e2," and "g" are defined in the caption to Fig, l

(AATAAA) occurs nine times in both the human and mouse sequences and six times in the rabbit Sequence. This suggests that this oligonucleotide alone cannot be a decisive signal for termination of transcription.

The mouse sequence contains the most represen- tations of the oligonucleotides of length >-8 bp that are found in high frequency in the aggregate of the three sequences. In this category we have the con- Sensus nonamer GGTTTTTGT, which occurs ex- actly seven times, 39-5' to J3 and 40-5' to J4 in human, 40-5' to J~ and J5 and 41-5' to J4 in mouse, and 31-5' to J4 and 39-5' to Js in rabbit (Karlin and ~handour 1985, Table 2-IV).

Based on an analysis of repeat structure for a variety of DNA sequences, including many non-Ig Sequences, a distinction of the patterns, contents, and distribution of short (5-7 bp) repeats from those of long (>'9 bp) repeats emerged (Karlin and Ghan- dour 1985b). Short repeats are generally composed ~ of one or two bases that display nearby duplications and relatively few point mutations. LOnger repeat elements (excluding special situations SUch as the homogeneous J-gene segments) tend to be G + C rich and may contribute to some regu- latory function. Both types of repeats are more prev- alent in noncoding regions.

Probably numerous and quite distinct mecha- rtisms lead to short versus long repeats. The general explanation is that short clustered repeats are prod- UCts of DNA repair actions and of local gene con- Versions with incidental mutation events. Alterna- tively, short repeats can be a result of DNA

Table 3. Oligonucleotides with high representation* in all three sequences

Length Oligonucleotides b

6 TTTTGT (9,11,6), AAATAA (I 1,8,9), TTTTTC (8,7,8), AATAAA (9,9,6), TTATTT (6,6,8)

GTTTTTG (4,4,3), TTTTTGT (5,5,3) GTTTTTGT (4,4,3) AGGTTTTTG (2,2,2), GGTTTTTGT (2,3,2)

= Criteria are >--6, >--3, >--3, and _~2 occurrences in each sequence for lengths 6, 7, 8, and 9, respectively

b The array (m.m2,ms) indicates the number of occurrences (m= in the a-th sequence) of the specified oligonucleotide in the three sequences. Thus G T T T T T G T occurs four times in human, four times in mouse, and three times in rabbit

polymerase stuttering due to blockage by single- strand folding during replication. It seems likely that most short clustered repeats are due to anomalous duplication or replication and thus embody no func- tional constraints.

Longer repeats are often related to crossing-over events and to transposition operations. Effective un- equal crossing over will generally produce tandem long repeat units corresponding to the sequences that mediated the alignment of the recombining elements. Transposon termini and sites of their insertions are associated with both direct and in- verted (DS) repeats. Long clustered repeats can serve as regulatory binding sites that can conceivably con- vert an on/off single-copy activation signal into a progressively controlled one.

Table 4 lists the oligonucleotides whose cumu-

Page 12: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

206

Table 4. High-frequency oligonucleotides with skewed representations relative to the human, mouse, and rabbit sequences

Length Oligonueleotides

10 11

AAAAA (44,9,24), CAGGG (12,5,18), CAGTT (6,18,10), TATAA (5,10,19) TTTAAA (12,3,10), TAAAAT (2,9,9), AAA.A.AA (25,0,8), TAAAAA (9,2,6), ATTTTA (11,3,11), AGGGAG (6,1,10),

GGGAAA (11,2,6), AAGATT (10,3,5), TAATTT (3,7,10), TATTTT (8,3,9) A A A (15,0,3), AAATAAA (5,6,2), TATTTTA (4,1,7), AATAAAA (2,7,2), TTTTAAA (6,0,5) AACGTAAG (4,4,1) GTTTTTGTA (2,4,1), ACCAAGCTG (1,3,1), CCAAGCTGG (1,3,1), CAAGCTGGA (1,3,1), TCAAACGTA (3,1,1),

CAAACGTAA (3,1,1), AAACGTAAG (4,4,1), AACGTAAGT (4,4,1), ACGTAAGTA (2,4, i), GAAATAAAA (1,4,0), AGAGCTTCA (1,1,3)

GGTTTTTGTA (2,3,1), GGAAATAAAA (1,3,0) AAACGTAAGTA (2,4,1), ACCAAGCTGGA (1,3,1 )

The array (ml,m2,m3) is as described in Table 3. An oligonucleotide of length k is considered of high frequency (hf) provided m~ + m2 + m3 exceeds an appropriate level 1.~ (cf. Karlin et al. 1983). In the ease at hand, the requisite levels are L5 = 20, L6 = 15, L7 = 10, L~ = 9, I.~ = 7, Lio = 6, and LI1 = 5. The critical numbers are determined by permutation analyses of an associated independenCe model that reflects the same base frequencies and lengths of the individual sequences such that less than 1% of oligonueleotides of length k obey ml + m2 + m3 ~ I_~. A hf oligonueleotide is described as having a skewed representation if the ratio of occurrences is at least 3:1 between some two of the three sequences

lat ive occurrence is significantly high in the aggre- gate 16,000 bp, but tha t are unequal ly represented, being favored by a factor o f 3 in one or two o f the sequences c o m p a r e d with the other(s) (see notes to Table 4). The h f ol igonucleotides with skewed rep- resentat ion o f length >-9 bp listed in Table 4 favor the mouse and slightly less the human , and mos t ly over lap the J-gene segments. On the other hand, for oligonueleotides o f length 5-7 bp the h u m a n and rabbi t sequences show more o f the h f repea t s . Most hfol igonucleot ides of length 5-7 bp are A + T rich. Except ions are the purine-r ich words C A G G G , A G G G A G , and G G G A A A , which are significantly frequent in h u m a n and rabbit , but vir tual ly absent in the mouse sequence.

Discuss ion

C o m p a r i s o n o f D N A sequences o f s imilar genes within or across species entails the identification o f regions o f significant similari t ies and differences. Regions that are clearly distinct m a y e m b o d y species- specific characteristics. Similar regions m a y indicate control or structural e lements due to homology ( c o m m o n ancestry) or analogy (convergence). Gen- erally, such compar i sons concentrate on similari t ies and differences o f D N A content. In the foregoing sections we have descr ibed var ious sequence fea- tures o f the Ig kappa-gene D N A sequences for hu- man , mouse , and rabbit . We conclude by briefly highlighting the ma in results and then we con tem- plate possible exper iments for assessing mecha- n i sms o f V - J rear rangements and splicing in relat ion to the sequence features identified in the prev ious sections.

(i) DS pairings within each sequence. Close G +

C-rich DS pairs are concent ra ted in the J s -C intron close to the e2 control e lement and the known en- hancer e lement eL (Karlin and G h a n d o u r 1985a, Table 1). The region e2 is fur ther dis t inguished by having a 14-bp DS pair in the mouse sequence that links it with a highly conserved region 5' to Jl.

(ii) DS pairings between V and J-C regions. The consensus hep t amer and not the n o n a m e r embodies the strongest correspondence between the V- and J-gene segment recognit ion sequences in te rms of the length o f exact DS pairing. More than seven (generally nine to 12) nucleotides are invo lved in these pairings. The exper iments o f Lewis et al. (1985) suggest that the hep t amer e lements play a key role in V - J joining. A bias toward J~ and J2 in Ig expres- sion is repor ted in W o o d and Coleclough (1984). However , the longest V - J exact dyad pairings occur with J4 and J5 (Table 2). The greater number s o f V - J t and V-J2 recombinan t s m a y be due to the stronger relat ionships these J regions have with enhancer control e lements of the large intron. The extended n o n a m e r e lements we hypothes ized to exercise neg- at ive control on J - C splicing.

(iii) Long runs of nucleotides and repeat clusters. These types o f sequence features tend to single out the regions o f high dupl icat ion in the rabbi t se- quence. The mos t significant runs are found 5' to J~ and in the Js--C intron, which regions are heavily (95%) A + T and concomi tan t ly display long runs o f weakly bonding bases and clusters o f A + T-rich repeats. The regions 3' to C in h u m a n and mouse show a preponderance o fpyr imid ine stretches, which m a y be related to a t e rmina t ion signal.

(iv) High-frequency oligonucleotides. The con- sensus n o n a m e r domina tes the compos i t ion o f the h f repeats o f length - 8 bp over the aggregate o f the three sequences. The mouse shows the highest nurn-

Page 13: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

ber of hf repeats overlapping the J-gene segments. These are suggestive of a higher degree of homog- enization across the J segments in the mouse than in the human and rabbit.

One experimental approach to investigating the nature of the RNA splicing choices for Ig-x and to defining the role of the V-J joining event in asso- ciation with the inter-J signals would utilize two DNA constructs. The first would incorporate a pro- motet element with a partial sequence of a known gene ligated upstream from the J~-C, sequence, thereby retaining the proximal flanking inter-J seg- meats through the termination signal after C,. A SeCond construct would have the promoter and se- quence of this known gene replacing the 5' region of one of the J regions. This would simulate the result of V-J joining, but with V replaced by the. known gene segment. Plasmids containing one of each of these constructs can be transfected into pre-B lyrnphocytes (e.g., see Potter et al. 1984). Analysis of the resulting primary transcripts and mature RNA from these transfections could yield information about the splicing mechanism. Specifically, if the presence of signals 5' to each J represses splicing of the corresponding J, then for the first construct there Will be no size differences between nuclear and ma- ture RNA transcripts (all splice junctions will be repressed) or no mature RNA transcript will be re- COVered. For the second construct, one would expect Correct splicing for the J segment with no 5' flank. Alternatively, if the splice activation signal is carried Specifically within the V region, both constructs should display the same relationship between the Primary and mature RNA, since the V segment is absent from both. The possibility of positive inter- aCtions of signals in the V sequence and elements 3' to the prescribed J in activating correct splicing Can be investigated by deletion experiments.

A procedure described in Lewis et al. (1984) pro- Vides a system with which to test the degree to which the consensus nonamer is involved in V-J rear- rangement. This investigation would entail some modification of the regions 5' to J segments by dele- tion of the nonamer element. Using the procedure of Lewis et al., one could then assay for V-J rear- rangements in the absence of the nonamer signal. A POssible modification would be to utilize one or sev- eral j segments with the 5' nonamer region deleted. This type of investigation could be repeated to test for various signals associated with V-J joining (e.g., One could remove the V nonamer signal, or shorten or lengthen the 23-bp heptamer-nonamer spacer se- quence).

"4cknowledgments. We are happy to thank Drs. P. Jones and IL Blalsdell for valuable comments on the manuscript. This work Was supported in part by NIH grants 2R01 GM10452-21 and IR01HL 30856 and NSF grant MCS 82-15131.

207

References

BlaisdellBE (1983) Aprevalentpcrsistentnonrandomnessthat distinguishes coding and noncoding eucaryotic nuclear DNA sequences. J Mol Evol 19:122-I 33

Chung S-Y, Folsom V, Wooley J (I 983) DNase I-hypersensitive sites in the chromatin of immunogiobulin K light chain genes. Proc Nail Acad Sci USA 80:2427-2431

Dover G (1982) Molecular drive: a cohesive mode of species evolution. Nature 299:11 I-I 17

Durdik J, Moore MW, Selsing E (1984) Novel x light-chain gene rearrangements in mouse )~ light chain-producing B lym- phocytes. Nature 307:749-752

Early P, Huang H, Davis M, Calame K, Hood L (1980) An immunoglobulin heavy chain variable region gene is gener- ated from three segments of DNA: V, D, and J. Cell 19:981- 992

Gillies SD, Morrison SL, Oi VT, Tonegawa S (1983) Tissue- specific enhancer element in the major intron of an immu- noglobuiin heavy-chain gene. In: Gluzman Y, Shenk T (eds) Current communications in molecular biology: enhancers and eukaryodc gene expression. Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, pp 121-128

Gluzman Y, Shenk T (eds) (1983) Current communications in molecular biology: enhancers and eukaryotic gene expression. Cold Spring Harbor Laboratory, Cold Spring Harbor, New York

Grosscbedl R, Birnstiel ML (1980) Identification of regulatory sequences in prelude sequences of an H2A historic gene by the study of specific deletion mutants in vivo. Proc Natl Acad Sci USA 78:727-731

Hieter PA, Maizel JV, Ledcr P (1982) Evolution of human immunoglobulin x J region genes. J Biol Chem 257:1516-

1522 Karlin S, Ghandour G (I 985a) Alignment maps and homology

analysis of the J-C intron in human, mouse and rabbit im- munoglobulin kappa gene. Mol Biol Evol 2:53-65

Karlin S, Ghandour G (I 985b) Comparative statistics for DNA and protein sequences: single sequence analysis. Proc Nail Acad Sci USA 82:5800-5804

Karlin S, Ost F (1985) Maximal segmental match length among random sequences from a finite alphabet. In: Lecam L, Olsen R (eds) Berkeley symposium in probability and statistics in honor of J. Kiefer and J. Neyman. Wadsworth, Belmont, California, pp 225-244

Kartin S, Ghandour G, Ost F, Tavar6 S, Korn LJ (1983) New approaches for computer analysis of nucleic acid sequences. Proc Natl Acad Sci USA 80:5660-5664

Karlin S, Ghandour G, Foulser DE (1985) DNA sequence com- parisons of the human, mouse and rabbit immunoglobulin kappa gene. Mol Biol Evol 2:35-52

Laimins LA, Kessel M, Rosenthal N, Khoury G (1983) Viral and cellular enhancer elements. In: Gluzman Y, Shenk T (eds) Current communications in molecular biology: enhancers and eukaryotic gene expression. Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, pp 28-37

Lewin B (1983) Genes. John Wiley & Sons, New York, pp 205- 208

Lewis S, Gifford A, Baltimore D (1984) Joining of V, to J, gene segments in a retroviral vector introduced into lymphoid cells. Nature 308:425-428

Lewis S, Gifford A, Baltimore D (1985) DNA elements are asymmetrically joined during site specific recombination of kappa immunoglobulin genes. Science 228:677-685

Nomura M, Yates JL, Jean D, Post LE (1981) Feedback reg- ulation of ribosomal protein expression in Escherichia coil: structural homology of ribosomal RNA and ribosomal pro- tein mRNA. Proc Natl Acad Sci USA 77:7084-7088

Page 14: DNA sequence patterns in human, mouse, and rabbit immunoglobulin kappa-genes

208

Parslow TG, Granner DK (1982) Chromatin changes accom- pany immunoglobulin r gene activation: a potential control region within the gene. Nature 299:449-451

Picard D, Schaffner W (1984) A lymphocyte specific enhancer in the mouse immunoglobulin r gene. Nature 307:80-82

Potter H, Weir L, Leder P (1984) Enhancer dependent expres- sion o fhumaa r immunoglobulin genes introduced into mouse pre-B lymphocytes by electroporation. Proc Natl Acad Sci USA 81:7161-7165

Queen C, Baltimore D (1983) Immunoglobulin gene transcrip- tion is activated by downstream sequence elements. Cell 33: 741-748

Tinoco I, Borer PN, Dengler B, Levine MD, Ohlenbeck OC, Crothers DM, Gralla J (1973) Improved estimation of sec- ondary structure in ribonucleic acids. Nature (New Bioi) 246: 40-41

Tjian R (1978) Binding site on S-V 40 DNA for a T-antigen related protein. Cell 13:165-179

Tsujimoto Y, Hirose S, Tsuda M, Suzuki Y (1981) Promoter

sequence of fibroin gene assigned by in vitro transcription system. Proc Natl Acad Sci USA 78:4838-4842

Wan~ AH-J, Quigiey GJ, Kolpak FJ, Crawford JL, van Boom JH, van der Mare1 G, Rich A (1979) Molecular structure of a left-handed double helical DNA fragment at atomic reso- lution. Nature 282:680-686

Wood DL, Coleclough C (I984) Differcntjoining region J ele- ments of the mur/ne K immunoglobulin light chain locus are used at markedly different frequencies. Proc Natl Acad Sci USA 81:4756-4760

Wu CI, Li W-H (1985) Evidence for higher rates ofnucleotide substitutions in rodents than in man. Proc Natl Acad Sci uSA 82:1741-1745

Yancopoulous GD, Desiderio SV, Paskind M, Kearney JF, Bal- timore D, Alt FW (1984) Preferential utilization of the most Jn-proximal Va gene segments in pre-B-cell lines. Nature 311: 727-733

Received March 4, 1985/Revised June 6, 1985