questions to be addressed

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Questions to be addressed

• Can multiple D genes be inserted?– Violation of 12/23 rule

• Can D genes be inserted backwards?• Is there a D gene preference?• Is there a reading frame preference for D genes?– If yes, is it part of the gene rearrangement?

• Who is doing the end trimming?

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Data sets

• 6329 clonally unrelated rearrangements.– 1968 un-mutated functional– 3707 mutated functional– 274 un-mutated non-functional– 380 mutated non-functional

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

P nucleotides

Sequences Permutated sequences

Distance from heptamer to gene end

No. ofseq

No. withP

% with P No. ofseq

No. with P

% with P p-value

VH gene

1 1448 474 32.7 1635 103 6.3 <10-5

2 1027 48 4.7 1068 65 6.1 0.091

3 762 53 7.0 612 36 5.9 0.245

JH gene

1 324 60 18.5 350 23 6.6 <10-5

2 184 2 1.0 209 3 1.4 0.560

3 219 8 3.7 250 14 5.6 0.220

5’ end of D gene

1 519 128 24.7 619 54 8.7 <10-5

2 343 31 9.0 347 26 7.5 0.275

3 474 25 5.3 454 17 3.7 0.168

3’ end of D gene

1 616 86 14.0 684 58 8.5 0.001

2 266 30 11.3 276 24 8.7 0.195

3 460 5 1.1 485 9 1.9 0.241

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

How many types of D genes?

• Conventional D genes– Identified in 81% of sequences unmutated sequences, 64% of mutated sequences

• Inverted D genes– Long inverted D genes can not be excluded

• Two D genes• D genes with irregular RSS (DIR)• Chromosome 15 OR

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

D gene usage

27 conventional D genes, 34 known alleles

D-Gene Usage and Lengths

0

100

200

300

400

500

600

700

800

IGHD1-1IGHD2-2IGHD3-3

IGHD4-11/IGHD4-4IGHD5-18/IGHD5-5

IGHD6-6IGHD1-7IGHD2-8IGHD3-9IGHD3-10IGHD5-12IGHD6-13IGHD1-14IGHD2-15IGHD3-16IGHD4-17IGHD6-19IGHD1-20IGHD2-21IGHD3-22IGHD4-23IGHD5-24IGHD6-25IGHD1-26IGHD7-27

D Gene

Number of Sequences (bars) 0

5

10

15

20

25

30

35

40

Average Length (triangles)Germline Length (diamonds)

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

D-gene usage and JH gene

• JH proximal D genes more often recombined to JH4 than JH6 and JH distal D genes more often to JH6

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Inverted (palindrom) D genes

Inverted D genes are not used!(or used extremely infrequent)

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

D genes with irregular RSS (recombinaation signal sequence) (DIR)

• Very long, >180 bp

• Contain a family 1 D gene

>DIR1 (in between D6-6 og D1-7)GGTGTTCCGCTAGCTGGGGCTCACAGTGCTCACCCCACACCTAAAACGAGCCACAGCCTCCGGAGCCCCTGAAGGAGACCCCGCCCACAAGCCCAGCCCCCACCCAGGAGGCCCCAGAGCACAGGGCGCCCCGTCGGATTCTGAACAGCCCCGAGTCACAGTGGGTATAACTGGAACTAC>IGHD1-7-01|X13972|IGHD1-7-01|Homo sapiens|F|D-REGIONGGTATAACTGGAACTAC

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

D genes with irregular RSS (DIR)

• Very long, >180 bp

• Contain a family 1 D gene

• Found in 1% of sequences, inverted in 1.2%

• Some explained as family 1 gene plus N additions

• Median length of remaining not different from in permutated sequences

=> No evidence for use of DIR

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Two D genes

• 2 D genes found in 1% of sequences

• Frequency not different from permutated sequences

• Some explained as one long D genes with deletion

• Some not possible due to D genes location

• Median lengths of longest gene resembles normal D genes, shortest resembles permutated sequences

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Multiple D genes

• 65 sequences with two D genes• Average length of shortest D genes: 11.6bp• Average length of longest D genes: 18.8bp• Average length of D genes in permuted sequences: 11.3bp

• Average length of D genes in normal sequences: 17.8bp

• => multiple D genes are not present!!!

Longest-D Shortest-DV-gene J-gene

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Chromosome 15 OR (open reading frames)

• 10 OR resembling D genes on chromosome 15

• High homology to conventional D genes

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

>IGHD5-12-01|X13972|IGHD5-12-01|Homo sapiens|F|D- 275 aa vs.>IGHD5-OR15-5 |X55583 og X55584 253 aa91.3% identity; Global alignment score: 1563 10 20 30 40 50 FINDFASTADTEMPLATESDATIGHD-SATSEPMNIELHMEPRECTSMNIELANTIBDYH :::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::: FINDFASTADTEMPLATESDATIGHDRSATSEPMNIELHMEPRECTSMNIELANTIBDYH 10 20 30 40 50 60

60 70 80 90 100 110 MMDGENANALYSISIRIXRGANISMIPCMMANDLINEPARAMETERSSETTLARGVISAN :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: MMDGENANALYSISIRIXRGANISMIPCMMANDLINEPARAMETERSSETTLARGVISAN 70 80 90 100 110 120

120 130 140 150 160 170 AMELISTDEFALTISNAMEXEXCLDENAMESMFINDENTRYBYCMMNSBMATCHAFINDA :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: AMELISTDEFALTISNAMEXEXCLDENAMESMFINDENTRYBYCMMNSBMATCHAFINDA 130 140 150 160 170 180

180 190 200 210 220 230 LLENTRIESSNAMEMSTBEINSTARTFNAMENMBERFFASTAENTRIESREADFRMFILE :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: LLENTRIESSNAMEMSTBEINSTARTFNAMENMBERFFASTAENTRIESREADFRMFILE 190 200 210 220 230 240

240 250 260 270 DTEMPLATESDATGTGGATATAGTGGCTACGATTAC ::::::::::::: DTEMPLATESDAT----------------------- 250

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Chromosome 15 OR (open reading frames)

• 10 OR resembling D genes on chromosome 15

• High homology to conventional D genes

• Very few OR15 in un-mutated sequences

• Median length not different from hits in permutated sequences

=> No evidence for use of OR15 genes

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

D gene reading frames

• The recombination mechanism utilises each D gene reading frame at same frequency

Reading Frame

Stop Hydrophilic Hydrophobic

Gene P NP P NP P NP

D2-2*01 RIL**YQLLC (1) 6.5 34.7 GYCSSTSCYA (2) 61.2 32.6 DIVVVPAAM (3) 32.2 32.6

D2-2*02 RIL**YQLLY (1) 11.3 46.7 GYCSSTSCYT (2) 55.0 20.0 DIVVVPAAI (3) 33.8 33.3

D2-2*03 WIL**YQLLC (1) 0.0 50.0 GYCSSTSCYA (2) 66.7 50.0 DIVVVPAAM (3) 33.3 0.0

D2-8*01 RILY*WCMLY (1) 2.4 42.9 GYCTNGVCYT (2) 68.3 28.6 DIVLMVYAI (3) 29.3 28.6

D2-8*02 RILYWWCMLY (1) 0.0 0.0 GYCTGGVCYT (2) 88.9 0.0 DIVLVVYAI (3) 11.1 100

D2-15*01 RIL*WW*LLL (1) 1.5 32.5 GYCSGGSCYS (2) 70.8 37.5 DIVVVVAAT (3) 27.7 30.0

D2-21*01 SILWW*LLF (1) 8.3 50.0 AYCGGDCYS (2) 58.3 25.0 HIVVVIAI (3) 33.3 25.0

D2-21*02 SILWW*LLF (1) 0.0 54.5 AYCGGDCYS (2) 78.0 18.2 HIVVVTAI (3) 22.0 27.3

Total - 10.8 33.6 - 62.2 32.4 - 26.9 34.0

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

N nucleotide dependence on end nucleotide

Position X+1Position X A T G C P-value A 0.292 0.146 0.292 0.271 0.04

T 0.260 0.290 0.207 0.243 0.016G 0.204 0.172 0.453 0.172 0.0004C 0.136 0.204 0.231 0.430 <0.0001

Expected 0.210 0.201 0.292 0.298 -

N addition is not random but dependent on end nucleotide

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Trimming of gene ends

Trimming of VH

0

20

40

60

80

100

120

140

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 P1 P2

End Position

Number of Sequences

Observed

Predicted

Trimming depends on the gene-end and can not only be described by a simple removal of one nucleotide at a time

Avg. 3.8 bp

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

VDJsolver performance

Unmutated sequences

Mutated sequences

#: p<0.01§: P<0.001

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Results regarding recombination and diversity and open questions

• DIR, OR15, multiple D genes and VH replacements are not used at a significant rate

• Inverted D genes are used rarely• All D genes not used at same frequency What determines if a D genes is used?• D gene usage somewhat dependent on JH gene Does multiple D-J recombination steps take place?

• All D gene reading frames used at equal rate at the recombination step

At what step in the development happens the selection for the hydrophilic reading frame?

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Results regarding recombination and diversity and open questions (cont.)

• N addition not random but dependent on end nucleotide

Does nucleotide availability or the specificity of TdT determine the N addition?

• Trimming not random but dependent on gene and sequence

What enzyme(s) is responsible for the trimming?

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Numbering Schemes

The Kabat numbering scheme is a widely adopted standard for numbering the residues in an antibody in a consistent manner. However the scheme has problems!

The Chothia numbering scheme is identical to the Kabat scheme, but places the insertions in CDR-L1 and CDR-H1 at the structurally correct positions. This means that topologically equivalent residues in these loops do get the same label (unlike the Kabat scheme).

The IMGT unique numbering for all IG and TR V-REGIONs of all species relies on the high conservation of the structure of the variable region. This numbering, set up after aligning more than 5 000 sequences, takes into account and combines the definition of the framework (FR) and complementarity determining regions (CDR), structural data from X-ray diffraction studies, and the characterization of the hypervariable loops.

http://www.bioinf.org.uk/abs/#kabatnum http://imgt.cines.fr/

CE

NT

ER

FO

R B

IOL

OG

ICA

L S

EQ

UE

NC

E A

NA

LY

SIS

Identification of CDR regions

Indentifying the CDRs

CDR-L1Start Approx residue 24Residue before is always CResidue after is always W. Typically WYQ, but also, WLQ, WFQ, WYLLength 10 to 17 residuesCDR-L2Start always 16 residues after the end of CDR-L1Residues before generally IY, but also, VY, IK, IFLength always 7 residuesCDR-L3Start always 33 residues after end of CDR-L2Residue before is always CResidues after always FGXGLength 7 to 11 residuesCDR-H1Start Approximately residue 31 (always 9 after a C) (Chothia/AbM defintion starts 5 residues earlier)Residues before always CXXXXXXXXResidues after always W. Typically WV, but also WI, WALength 5 to 7 residues (Kabat definition); 7 to 9 residues (Chothia definition); 10 to 12 residues (AbM definition)CDR-H2Start always 15 residues after the end of Kabat/AbM definition of CDR-H1Residues before typically LEWIG, but a number of variationsResidues after K[RL]IVFT[AT]SIA (where residues in square brackets are alternatives at that position)Length Kabat definition 16 to 19 residues (AbM definition and most recent Chothia definition ends 7 residues earlier; earlier Chothia definition starts 2 residues later and ends 9 earlier)CDR-H3Start always 33 residues after end of CDR-H2 (always 3 after a C)Residues before always CXX (typically CAR)

questions to be addressed

Documents