the hla system - cdn.ymaws.com · q a 1 d q b 1 r b 1 coverage hla coverage over wgs average...

28
1 The Application of NGS to HLA Typing Challenges in Data Interpretation The Application of NGS to HLA Typing Challenges in Data Interpretation Marcelo A. Fernández Viña, Ph.D. Department of Pathology Medical School Stanford University The HLA system High degree of polymorphism at most of the expressed loci (function) Lack of a single predominant allele, high degree of heterozygosity (function) Strong linkage disequilibrium (unknown, function?) 3492 4358 3111 2135 198 7 940 73 671 43 5 4 3 B R D

Upload: dothu

Post on 25-Jan-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

1

The Application of NGS to HLA Typing

Challenges in Data Interpretation

The Application of NGS to HLA Typing

Challenges in Data Interpretation

Marcelo A. Fernández Viña, Ph.D.

Department of Pathology

Medical School

Stanford University

The HLA system

� High degree of polymorphism at most of the

expressed loci (function)

� Lack of a single predominant allele, high degree of

heterozygosity (function)

� Strong linkage disequilibrium (unknown, function?)

34924358 3111

2135 198 7940 73671 43

5

43

B

R

D

Page 2: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

2

4

GENOMIC ORGANIZATION OF THE HLA GENES

HLA-A

HLA-B

HLA-C

HLA-DQA1

HLA-DQB1

HLA-DRB1

HLA-DPA1

HLA-DPB1

DRB1*08 Alleles

INTRON 1

INTRON 1

INTRON 1

INTRON 2

Page 3: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

3

0

20

40

60

80

100

120

140

160

180

A B C DPA

1

DPB

1

DQA1

DQ

B1

DRB1

A B C DPA

1

DPB

1

DQ

A1

DQB1

DRB1

A B C DPA

1

DPB

1

DQA1

DQ

B1

DRB1

A B C DPA

1

DPB

1

DQ

A1

DQ

B1

DRB1

Covera

ge

HLA coverage over WGS

Average

Minimum

Average

Minimum

Average

Minimum

Average

Minimum

Sample 4Sample 3Sample 2Sample 1

Why not whole-genome sequencing?

• Inadequate coverage of complex genomic regions,

such as HLA. Conventional WGS (30x avg.

coverage) provides only sparse coverage of HLA.

Complexities due to:

– Indels

– GC-rich regions, secondary structure

– Paralogous genes

– Repeat regions across HLA loci

• Cost. Using WGS, to achieve adequate coverage of HLA would require >1,000X avg. coverage

J.Immunol. 1992 Jun 15;148(12):4043-53.

HLA-J, a second inactivated class I HLA gene related to HLA-G and HLA-A.

Implications for the evolution of the HLA-A-related genes.

Messer G, Zemmour J, Orr HT, Parham P, Weiss EH, Girdlestone J.

Ragoussis and co-workers described a class I HLA gene that maps to within 50 kb of HLA-A. Comparison of the nucleotide sequences of HLA-J alleles shows this gene is more related to HLA-G, A, and H.

All alleles of HLA-J are pseudogenes because of deleterious mutations that produce translation termination either in exon 2 orexon 4.

HLA-J appears, like HLA-H, to be an inactivated gene that result from duplication of an Ag-presenting locus related to HLA-A.

Evolutionary relationships as assessed by construction of trees suggest the four modern loci: HLA-A, G, H, and J were formed by successive duplications from a common ancestral gene.

In this scheme one intermediate locus gave rise to HLA-A and H, the other to HLA-G and J.

Page 4: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

4

Alleles at different HLA loci (genes and pseudogenes)

share nucleotide sequences

HLA_A and HLA-H (pseudogene)

AA Codon 30 35 40 45 50

A*24:02:01:01 GGC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AGG ATG GAG CCG CGG GCG CCG

A*01:01:01:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A- --- --- --- --- --- ---

A*02:01:01:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

A*25:01:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

A*32:01:01 --- --- --- --- --- --- --- --- --- --- --T --- --- --- --- --- --- --- --- --- --- --- --- --- ---

H*01:01:01:01 GGC TAC GTG GAC GAT ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AGG ATG GAG CCG CGG GCG CCG

HLA-A, B and HLA-H (pseudogene)

AA Codon 80 85 90

B*57:01:01 GAG AAC CTG CGG ATC GCG CTC CGC TAC TAC AAC CAG AGC GAG GCC G

B*07:02:01 --- -G- --- --- -A- CT- -G- G-- --- --- --- --- --- --- --- -

B*08:01:01 --- -G- --- --- -A- CT- -G- G-- --- --- --- --- --- --- --- -

B*15:17:01:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -

B*35:01:01:01 --- -G- --- --- -A- CT- -G- G-- --- --- --- --- --- --- --- -

B*44:02:01:01 --- --- --- --C -C- --- --- --- --- --- --- --- --- --- --- -

B*51:01:01:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -

AA Codon 80 85 90

H*01:01:01:01 GAG AAC CTG CGG ATC GCG CTC CGC TAC TAC AAC CAG AGC GAG GGC G

AA Codon 80 85 90

A*24:02:01:01 GAG AAC CTG CGG ATC GCG CTC CGC TAC TAC AAC CAG AGC GAG GCC G

A*01:01:01:01 -C- --- --- G-- -C- CT- -G- G-- --- --- --- --- --- --- -A- -

A*02:01:01:01 -T- G-- --- G-- -C- CT- -G- G-- --- --- --- --- --- --- --- -

A*25:01:01 --- -G- --- --- --- --- --- --- --- --- --- --- --- --- -A- -

A*32:01:01 --- -G- --- --- --- --- --- --- --- --- --- --- --- --- --- -

DRB Gene Content varies in Haplotypes Bearing Different

DRB1 Allele-Sero-Groups – (Copy Number Variation has been

known in HLA for more than 3 decades)Haplo-groups

DR1 DQB1 DQA1 DRB1 DRB6 DRB9 DRA HLA-B

DR51 DQB1 DQA1 DRB1 DRB6 DRB5 DRB9 DRA HLA-B

DR52 DQB1 DQA1 DRB1 DRB2 DRB3 DRA HLA-B

DR8 DQB1 DQA1 DRB1 DRB9 DRA HLA-B

DR53 DQB1 DQA1 DRB1 DRB7 DRB8 DRB4 DRB9 DRA HLA-B

Nature. 1986 Jul 3-9;322(6074):67-70.

Polymorphism of human Ia antigens: gene conversion between two DR beta

lociresults in a new HLA-D/DR specificity.

Gorski J, Mach B.

Molecular mapping of the DR beta-chain region allows true allelic comparisons of the two expressed DR beta-chain loci, DR beta I and DR beta III.

At the more polymorphic locus, DR beta I, the allelic differences are clustered and may result from gene conversion events over very short distances.

The gene encoding the HLA-DR3/Dw3 specificity has been generated by a gene conversion involving the DR beta I and the DR beta III loci of the HLA-DRw6/Dw18 haplotype, as recipient and donor gene, respectively.

The generation of HLA-DR polymorphism within the DRw52 supertypic group can thus be accounted for by a succession of gene duplication, divergence and gene conversion.

Page 5: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

5

Alleles at different HLA-DRB loci share nucleotide

sequences

AA Codon 10 15 20 25

DRB1*01:01:01 CA CGT TTC TTG TGG CAG CTT AAG TTT GAA TGT CAT TTC TTC AAT GGG ACG GAG CGG GTG CGG TTG CTG GAA AGA

DRB1*01:03 -- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

DRB1*03:01:01:01 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---

DRB1*04:02:01 -- --- --- --- GA- --- G-- --A CA- --G --- --- --- --- --C --- --- --- --- --- --- --C --- --C ---

DRB1*07:01:01:01 -- --- --- C-- --- --- GG- --- -A- A-G --- --- --- --- --C --- --- --- --- --- -A- --C --- --- ---

DRB1*11:01:01 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---

DRB1*11:30 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---

DRB1*13:01:01 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---

DRB3*01:01:02:01 -- --- --- --- GA- -T- -G- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---

DRB3*02:02:01:01 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --G ---

Alleles at different HLA-DRB loci share nucleotide sequences

Importance of determining PhaseAA Codon 10 15 20 25

DRB1*01:01:01 CA CGT TTC TTG TGG CAG CTT AAG TTT GAA TGT CAT TTC TTC AAT GGG ACG GAG CGG GTG CGG TTG CTG GAA AGA

DRB1*03:01:01:02 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---

DRB1*13:01:01:01 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---

DRB1*13:67 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---

DRB3*01:01:02:01 -- --- --- --- GA- -T- -G- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---

DRB3*02:02:01:02 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --G ---

AA Codon 30 35 40 45 50

DRB1*01:01:01 TGC ATC TAT AAC CAA GAG GAG TCC GTG CGC TTC GAC AGC GAC GTG GGG GAG TAC CGG GCG GTG ACG GAG CTG GGG

DRB1*03:01:01:02 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---

DRB1*13:01:01:01 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---

DRB1*13:67 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---

DRB3*01:01:02:01 -A- T-- C-- --- --G --- --- -T- C-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

DRB3*02:02:01:02 CA- T-- C-- --- --G --- --- -A- -C- --- --- --- --- --- --- --- --- --- --- --- --- -G- --- --- ---

AA Codon 55 60 65 70 75

DRB1*01:01:01 CGG CCT GAT GCC GAG TAC TGG AAC AGC CAG AAG GAC CTC CTG GAG CAG AGG CGG GCC GCG GTG GAC ACC TAC TGC

DRB1*03:01:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CG- --- --- -AT --- ---

DRB1*13:01:01:01 --- --- --- --- --- --- --- --- --- --- --- --- A-- --- --A G-C GA- --- --- --- --- --- --- --- ---

DRB1*13:67 --- --- --- --- --- --- --- --- --- --- --- --- A-- --- --A G-C GA- --- --- --- --- --- --- --- ---

DRB3*01:01:02:01 --- --- -TC --- --- -C- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CG- --- --- -AT --- ---

DRB3*02:02:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CA- --- --- -AT --- ---

AA Codon 80 85 90

DRB1*01:01:01 AGA CAC AAC TAC GGG GTT GGT GAG AGC TTC ACA GTG CAG CGG CGA G

DRB1*03:01:01:02 --- --- --- --- --- --- -TG --- --- --- --- --- --- --- --- -

DRB1*13:01:01:01 --- --- --- --- --- --- -TG --- --- --- --- --- --- --- --- -

DRB1*13:67 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -

DRB3*01:01:02:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -

DRB3*02:02:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -

15

HLA typing using high throughput sequencing

technologies.

Whole-gene amplification.

Exon-wise amplification of few exons.

Page 6: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

6

16

Sequencing workflow

Ex 3 Ex 4Ex 1 Ex 2

Fragmentation

Ligate barcoded adaptors

Size select and purify 300-340 bp fragments

Sequencing library Q/C

Ex 3 Ex 4Ex 1 Ex 2

Page 7: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

7

19

• High-throughput

• Accurate

• Long read length

• Simple to use

• Able to detect all types of genomic changes (SNP’s, insertion or deletionss, large scale rearrangements, methylation)

What would be the ideal sequencing

machine?

20

High-throughput sequencing technologies: an

overview

• All platforms share core similarities:

– DNA templates are spatially segregated, no physical separation step

– DNA is sequenced through synthesis, rather than termination

– DNA sequence is decoded by the emission of light or pH change

• Platforms differ by:

– Specific method used to generate template libraries

– Chemistries/approaches used to generate the sequence signal (light) signal

– Throughput (amount of bases sequenced per run)

– Length of sequence read

– Error modalities and error rates (e.g. homopolymer regions)

The Platforms that we Tested

• 454-Roche: exon coverage only – Multiplexing – Work flow was demanding (2011-2012)

• PGM-Ion Torrent: Instrument problems – reads too short -homopolymer problems (2012-2013)

• Pacific Biosciences: Extremely log reads!- Throughput and Workflow still in development; appears to be simple – Base calling (15 percent error) by consensus - homopolymer problems (2014)

• Illumina: Less error rate – robust instruments (2011-2015). Various systems

Page 8: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

8

Examples of ambiguities: exon shuffling, segmental

exchange, substitutions in untested segments

23

Potential benefits of next-generation

sequencing for HLA typing

• Clonal template amplification in vitro to eliminate

problem of sequencing heterozygous DNA

• Sufficiently long read length (300+ bp) to cover entire

exon (or more) in phase

• Increased sequence coverage of HLA genes

• Capability to multiplex patient specimens

• Potential to complete run and data analysis within one

week

Practical Advantages or of Extending

Sequence Coverage

• Test complete gene

• No Assumptions made

• Transplantation: Detect mismatches thought to be

absent

• Mapping of Disease Susceptibility Factors

Page 9: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

9

Many allele groups in HLA-A show one allele with an

insertion of an extra ‘C’ after seven ‘C’

A*01010101 AC CCC CCC .AAG ACA CAT ATG ACC CAC CAC

A*0104N -- --- --- C--- --- --- --- --- --- ---

A*02010101 -- G-- --- .--A --G --- --- --T --- ---

A*03010101 -- --- --- .--- --- --- --- --- --- ---

A*0321N -- --- --- C--- --- --- --- --- --- ---

A*310102 -- --- --- .--- --G --- --- --T --- ---

A*3114N -- --- --- C--- --G --- --- --T --- ---

C

A*02010101 AAAACGCATATGACTCACCAC

A*0321N CAAGACACATATGACCCACCA

MAARMSMMWWWK

A*03010101 AAGACACATATGACCCACCAC

A*02null CAAAACGCATATGACTCACCA

MARAMMSMWWWK

Resolution of common and well documented

null- alleles ( clinically relevant)Locus Allele related allele Difference Change Resolution Alternative

A 0104N 010101 EXON 4 ins 1 routine SBT

A 0253N 020101 EXON 2 PTC routine SBT

A 2409N 240201 EXON 4 PTC routine SBT

A 2411N 240201 EXON 4 ins 1 routine SBT

A 6811N 680102 EXON 1 del 1 ad hoc SSP

B 15010102N 150101 INTRON 1 del 10 ad hoc SSP extend reading by SBT

B 4022N 400201 EXON 3 PTC routine SBT

B 4423N 440201 EXON 3 PTC routine SBT

B 5111N 510101 EXON 4 ins 1 routine SBT

Cw 0409N 040101 EXON 7 del 1 ad hoc SSP

Cw 0507N 050101 EXON 3 del 2 routine SBT

DRB4 01030102N 010301 INTRON 1 splicing site ad hoc SSP extend reading by SBT

DRB5 0108N 0102 EXON 3 del 19 ad hoc SSP

DRB5 0110N 0102 EXON 2 del 2 routine SBT

Cw*0401/Cw*0409N if B*4403 is present

PTC = premature termination codonins = nuc. insertiondel = nuc. deletion

DRB5*0102/0108N if possible haplotype is DRB1*1502-DQB1*050188

Detection of C*04:09N (common) and

A*31:14N(rare) allele in single pass

A*31:01:02 (red line) shows interrupted

coverage at the beginning of Exon 4, while

A*31:14N (blue line), which differs from

A*31:01:02 with one base insertion, show

continuous coverage.

C*04:01:01:01 (red line) shows interrupted

coverage at the end of Exon 7, while

C*04:09N(blue line), which differs from

C*04:01:01:01 with one base deletion, show

continuous coverage.

Page 10: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

10

HLA Typing by NGS• Wang C, Krishnakumar S, Wilhelmy J, Babrzadeh F, Stepanyan L,

Su LF, Levinson D, Fernandez-Viña MA, Davis RW, Davis MM, Mindrinos M

High-throughput, high-fidelity HLA genotyping with deep sequencing. Proc Natl Acad Sci U S A. 2012May 29;109(22):8676-81. doi: 10.1073/pnas.1206614109. Epub 2012 May 15. PubMed PMID: 22589303; PubMed Central PMCID: PMC3365218.

• New methodology that leverages the power of Next Generation Sequencing (NGS) and long range PCR

• Interrogated the entire sequences of the class I genes and most of the extent Class II genes in more than 9,000 subjects

1. Sample Collection

2. Long-Range PCR

3. Quantification

& Pooling

4. Fragmentation

5. Library preparation

& Pooling

6. Sequencing

7. Data analysis

5’UTR 1 2 3 4 5 6 7 3’UTR

5’UTR 1 2 3 4 5 6 7 3’UTR8

5’UTR 1 2 3 4 5 6 7 3’UTR8

5’UTR 1 2 3 4 3’UTR

5’UTR 1 2 3 4 5 3’UTR6

5’UTR 1 2 3 4 3’UTR

5’UTR 1 2 3 4 5 3’UTR

5’UTR 1 2 3 4 5 3’UTR6

HLA-B

HLA-A

HLA-C

HLA-DQA1

HLA-DQB1

HLA-DPB1

HLA-DPA1

HLA-DRB1, 3, 4, 5

8

NGS HLA TYPING SYSTEMS

30

Data Analysis

Shotgun sequencing

Page 11: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

11

Genotype calling

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 500 1000 1500 2000 2500 3000

Covera

ge

Position

0

1000

2000

3000

4000

5000

6000

7000

0 200 400 600 800 1000 1200C

overa

ge

Position

A*02:07

A*02:01:01:01

A*02:01:01:01

A*02:07

Genomic mapping cDNA mapping

One nucleotide difference at exon 3 distinguishes A*02:01:01:01(A) from A*02:07(G). The cell line BM9

HLA-A is A*02:01:01:01. Top left pane shows the coverage plot when sequencing reads are mapped to

A*02:01:01:01 and A*02:07 genomic sequence. Top right panel shows the coverage plot when sequencing

reads are mapped to A*02:01:01:01 and A*02:07 cDNA sequence.

High-throughput, High resolution HLA

genotyping

32

Data Analysis Steps• De-multiplexing

– Identical barcodes at both ends of pair-end reads

• Lowering the chance of cross-contamination

• Mapping

– Competitive mapping

• All available reference sequence, including those form pseudo-genes are mapped, best

alignments are passed.

• Filtering

– Best alignments

– identical alignments (for cDNA only)

– Pair-end alignment

• Genotype calling

– Limited number of candidates (top 10 of each category: number of reads mapped, minimal coverage,

minimal central coverage)

– Enumerate possible combination of homozygous and heterozygous set

– Rank those combination on aggregated number of reads mapped, minimal coverage, minimal central

coverage.

• De novo Assembly

– Local de novo assembly can be performed to capture SNP for novel allele 33

Page 12: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

12

Paired-end Sequencing

Reference Sequence 1

Reference Sequence 2✕

Pair-end reads

~500bp

34

Central Read Definition

35

Using Central Reads Coverage

On regular coverage plot, the two candidates looks similar. On central read coverage plot, the wrong

candidate have much lower coverage in comparison with the authentic candidate.

36

Page 13: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

13

Complement Logics Resolved Difficult

Alleles

C*03:03:01 and C*03:04:01:01 differ in a single base at the end of exon 2. Due to similarity between

some B alleles and C alleles at this region, with cDNA alignment, there is no much difference between

those two candidates. With genomic alignment and paired-end filter, the difference between those two

candidates is greatly amplified to provide definite evidences to call one versus the other.

37

Using Complement Logics

cDNA alignment genomic alignment

Some short exons such as exon 6 of some C alleles are identical to that of B alleles. With cDNA

alignment, it is hard to predict whether the alignment is authentic. With genomic alignment and pair-

end reads, the neighboring polymorphic site provides sufficient information for this.

38

Maximize Usage of Computing Power for

SpeedRaw

reads

Raw

reads

B1B1 B2B2 B3B3 B4B4 B5B5

B1.1B1.1 B1.2B1.2 B1.3B1.3 B1.4B1.4 B1.5B1.5

B1B1

B1.AB1.A B1.BB1.B B1.CB1.CB1.D

PA

B1.D

PA

B1.D

PB

B1.D

PB

De-multiplexing

Mapping

Merging, Filtering,

De-multiplexing

Genotype calling

One process *

M-processes per barcode *****

One processes per barcode ***

Several processes per barcode *****

Level of parallelism

Streaming SIMD Extensions-vectorized implementation of Smith-Waterman algorithm 39

Page 14: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

14

User Friendly Interface

• Data analysis pipeline runs with one single

command: hla_pipeline.py –c config.file

• Result reviewing is through web page graphically.

• The two components will be merged together in a

single standalone program in next about 6 months.

40

Interface

Sample Info

Genotypes

Candidate

Commenting

Counting Logics

41

Interface

Coverage plot Central Read Coverage plot

Reference alignment Read tiling pattern

42

Page 15: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

15

Phasing Strategy

de novo Assembly

GCCAATGATGCACTGACTAGCCTAGCCACCC

GCCAATGATGCACTGACTAGCCTAGCCACCC

GCCAATGATGCACTGACTAGCCTAGCCACCCTGCACTGACTAGCCTAGCCACCCGATCAGCTCC

TGCACTGACTAGCCTAGCCACCCGATCAGCTCC

TGCACTGACTAGCCTAGCCACCCGATCAGCTCC

CTAGCCACCCGATCAGCTCCGATCGATCGGG

CTAGCCACCCGATCAGCTCCGATCGATCGGG

CTAGCCACCCGATCAGCTCCGATCGATCGGG

CTAGCCTAGCCACCCGATCAGCTCCGATC

CTAGCCTAGCCACCCGATCAGCTCCGATC

CTAGCCTAGCCACCCGATCAGCTCCGATC

CCGATCGATCGGGCATCGATCGATCGG

CCGATCGATCGGGCATCGATCGATCGG

CCGATCGATCGGGCATCGATCGATCGG

GCCAATGATGCACTGACTAGCCTAGCCACCC

GCCAATGATGCACTGACTAGCCTAGCCACCC

GCCAATGATGCACTGACTAGCCTAGCCACCC

TGCACTGACTAGCCTAGCCACCCGATCAGCTCC

TGCACTGACTAGCCTAGCCACCCGATCAGCTCC

TGCACTGACTAGCCTAGCCACCCGATCAGCTCC

CTAGCCACCCGATCAGCTCCGATCGATCGGG

CTAGCCACCCGATCAGCTCCGATCGATCGGG

CTAGCCACCCGATCAGCTCCGATCGATCGGG

CTAGCCTAGCCACCCGATCAGCTCCGATC

CTAGCCTAGCCACCCGATCAGCTCCGATC

CTAGCCTAGCCACCCGATCAGCTCCGATC

CCGATCGATCGGGCATCGATCGATCGG

CCGATCGATCGGGCATCGATCGATCGG

CCGATCGATCGGGCATCGATCGATCGG

Multiple fragments of similar sequences generated by NGS

Clustering of fragments based on similar sequences to create contiguous sequence

Phasing Analysis

Step1: Identify true polymorphic sites

• Ratio between major and minor alleles needs be above set threshold to be considered as true polymorphic sites

• The polymorphic sites are determined by a statistical model

5x”T”

6x”G”

5x”G”

6x”A”

5x”C”

6x”A”

All 3 sites are true polymorphic sites

10x”T”

1 x”G”

1 x”G”

10x”A”

5x”C”

6x”A”

“G” = noise

“G” = noise

True Polymorphic site

Build Phase Resolved “Contigs”

CCATGTTCCAATGATGCCCTGTGCATGCATCG

CCATGTGCCAATAATGCACTGTGCATGCATCG

CCATGTTCCAATGATGCCCTGTGCATGCATCG

CCATGTGCCAATAATGCACTGTGCATGCATCG

CCATGTTCCAATGATGCCCTGTGCATGCATCG

CCATGTGCCAATAATGCACTGTGCATGCATCG

CCATGTGCCAATAATGCACTGTGCATGCATCG

CCATGTTCCAATGATGCCCTGTGCATGCATCG

CCATGTGCCAATAATGCACTGTGCATGCATCG

CCATGTTCCAATGATGCCCTGTGCATGCATCG

CCATGTGCCAATAATGCACTGTGCATGCATCG

Polymorphic Sites

T/G G/A C/A

Step1: Identify polymorphic sites Step2: Determine which polymorphisms are

linked together to resolve two contigs

CCATGTTCCAATGATGCCCTGTGCATGCATCG

CCATGTTCCAATGATGCCCTGTGCATGCATCGCCATGTTCCAATGATGCCCTGTGCATGCATCG

CCATGTTCCAATGATGCCCTGTGCATGCATCGCCATGTTCCAATGATGCCCTGTGCATGCATCG

CCATGTGCCAATAATGCACTGTGCATGCATCGCCATGTGCCAATAATGCACTGTGCATGCATCG

CCATGTGCCAATAATGCACTGTGCATGCATCG

CCATGTGCCAATAATGCACTGTGCATGCATCGCCATGTGCCAATAATGCACTGTGCATGCATCG

CCATGTGCCAATAATGCACTGTGCATGCATCG

T-G-C are linked

G-A-A are linked

Page 16: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

16

Best Matching Alleles

Dynamic Phasing

Calling polymorphisms from de novo assembled, mapped, paired-end sequences

Phase Resolved Consensus

Build phased contig sequences based on polymorphic linkage

Consensus Alignment

Compare contig sequences back to the database to find the best match

Best Matching

Alleles

• “Detail Review” window can be used for in-depth review of HLA genotyping

• “Detail Review” window displays the contig alignment browser as well as other reference parameters

(eReads and xReads)

• “Contig alignment” browser indicates phased blocks

Build Phased Contig Blocks

Summary

• Broad coverage (exons & introns) and deep sequencing (> 50)

• Paired-end sequencing

• Mapping

• Phasing:

– Complement logic (cDNA vs. genomic)

– Paired-end sequencing

– Central read logic

– Build Contig blocks

48

Page 17: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

17

49

Coverage variance

0

5000

10000

15000

20000

Covera

ge

0 500 1000 1500 2000 2500

Position

0

5000

10000

15000

20000

25000

30000

Covera

ge

0 500 1000 1500 2000 2500

Position

0

5000

10000

15000

20000

25000

30000

Covera

ge

0 500 1000 1500 2000 2500

Position

0

200

400

600

800

1000

Covera

ge

0 1000 2000 3000 4000 5000

Position

0

5000

10000

15000

20000

Covera

ge

0 1000 2000 3000 4000 5000 6000Position

0

2500

5000

7500

10000

12500

15000

Covera

ge

0 2500 5000 7500 10000 12500 15000

Position

HLA-A HLA-B

HLA-C HLA-DQA1

HLA-DQB1 HLA-DRB1

50

Genotype calling

Mapping

Reads mapped onto the IMGT-HLA

references, including non-classic

HLA genes and pseduogenes with

NCBI BLASTN.

Filtering

Filtering alignments of sub-best bits,

containing mismatches or gaps, and

shorter than 50bp, and those where

references are mapped to only one

end of a pair-end read while one

reference is mapped to both ends of

the pair-end read, subsequently.

Genotyping

Computing MCOR, MCCR for each

mapped reference. Eliminating those

of either MCOR = 0 or MCCR = 0.

Enumerate combinations of either

one reference (homozygous) or two

references (heterozygous), and pick up combination of maximum reads

01:02:01

01:03:01

01:02:02L R

L/R<=2 or R/L<=2

A

B

C

D

Data Analysis• Determination of number of reads

• Bar codes specific for sample and locus (amplicon)

• Barcodes specific for sample (early pooling)

• Informatics:

• Mapping of Reads

• Phasing Reads

• Insertions and Deletions

• Homozygous and Heterozygous Positions

• Reads from other Loci

• Hybrid alleles, Novel alleles

Page 18: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

18

1. Sample Collection

2. Long-Range PCR

3. Quantification

& Pooling

4. Fragmentation

5. Library preparation

& Pooling

6. Sequencing

7. Data analysis

5’UTR 1 2 3 4 5 6 7 3’UTR

5’UTR 1 2 3 4 5 6 7 3’UTR8

5’UTR 1 2 3 4 5 6 7 3’UTR8

5’UTR 1 2 3 4 3’UTR

5’UTR 1 2 3 4 5 3’UTR6

5’UTR 1 2 3 4 3’UTR

5’UTR 1 2 3 4 5 3’UTR

5’UTR 1 2 3 4 5 3’UTR6

HLA-B

HLA-A

HLA-C

HLA-DQA1

HLA-DQB1

HLA-DPB1

HLA-DPA1

HLA-DRB1, 3, 4, 5

8

NGS HLA TYPING SYSTEMS

Data Analysis• Determination of number of reads

• Bar codes specific for sample and locus (amplicon)

- Technically unwieldy

- Easier interpretation by Software (reads are assigned to the locus)

• Barcodes specific for sample (early pooling)

- Technically simple

- Software needs to be more sophisticated

need to phase longer sequence stretches

Alleles at different HLA-DRB loci share nucleotide sequences

Importance of determining PhaseAA Codon 10 15 20 25

DRB1*01:01:01 CA CGT TTC TTG TGG CAG CTT AAG TTT GAA TGT CAT TTC TTC AAT GGG ACG GAG CGG GTG CGG TTG CTG GAA AGA

DRB1*03:01:01:02 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---

DRB1*13:01:01:01 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---

DRB1*13:67 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---

DRB3*01:01:02:01 -- --- --- --- GA- -T- -G- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---

DRB3*02:02:01:02 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --G ---

AA Codon 30 35 40 45 50

DRB1*01:01:01 TGC ATC TAT AAC CAA GAG GAG TCC GTG CGC TTC GAC AGC GAC GTG GGG GAG TAC CGG GCG GTG ACG GAG CTG GGG

DRB1*03:01:01:02 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---

DRB1*13:01:01:01 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---

DRB1*13:67 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---

DRB3*01:01:02:01 -A- T-- C-- --- --G --- --- -T- C-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

DRB3*02:02:01:02 CA- T-- C-- --- --G --- --- -A- -C- --- --- --- --- --- --- --- --- --- --- --- --- -G- --- --- ---

AA Codon 55 60 65 70 75

DRB1*01:01:01 CGG CCT GAT GCC GAG TAC TGG AAC AGC CAG AAG GAC CTC CTG GAG CAG AGG CGG GCC GCG GTG GAC ACC TAC TGC

DRB1*03:01:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CG- --- --- -AT --- ---

DRB1*13:01:01:01 --- --- --- --- --- --- --- --- --- --- --- --- A-- --- --A G-C GA- --- --- --- --- --- --- --- ---

DRB1*13:67 --- --- --- --- --- --- --- --- --- --- --- --- A-- --- --A G-C GA- --- --- --- --- --- --- --- ---

DRB3*01:01:02:01 --- --- -TC --- --- -C- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CG- --- --- -AT --- ---

DRB3*02:02:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CA- --- --- -AT --- ---

AA Codon 80 85 90

DRB1*01:01:01 AGA CAC AAC TAC GGG GTT GGT GAG AGC TTC ACA GTG CAG CGG CGA G

DRB1*03:01:01:02 --- --- --- --- --- --- -TG --- --- --- --- --- --- --- --- -

DRB1*13:01:01:01 --- --- --- --- --- --- -TG --- --- --- --- --- --- --- --- -

DRB1*13:67 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -

DRB3*01:01:02:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -

DRB3*02:02:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -

Page 19: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

19

Data AnalysisInformatics:

• Mapping of Reads

• Phasing Reads

• Reads from other Loci (Highly homologous genes, DQA2, DPA2, DQB2, DPB2, DRB2/6/7/8/9)

• Alleles with incomplete references (in general rare)

• Hybrid alleles

• Novel alleles

Pseudogene Disambiguation

(Alpha sample SBC060)

SBT Result: A*02:01 NGS Result

HLA-H (novel)

A*02:01 TAC CAC CAG TAC GCC TAC GAC GGC AAG GAT TAC ATC GCC CTG AAA GAG GAC CTG

CGC TCT TGG

H*01:01 GAC CAC CAG TAC GCC TAC GAC AGC AAG GAT TAC ATC GCT CTG AAA GAG GAC CTG

CGC TCC TGG

Hybrid allele carrying sequences of two loci

AA Codon 1 5 10 15 20

DRB1*01:01:01 CTG GCT TTG GCT GGG GAC ACC CGA C|CA CGT TTC TTG TGG CAG CTT AAG TTT GAA TGT CAT TTC TTC AAT GGG ACG

DRB1*14:54:01 --- --- --- --- --- --- --- A-- -|-- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- ---

DRB1*14:141 --- --- --- --- --- --- --- A-- -|-- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- ---

DRB3*02:02:01:02 --- --- --C --- --- --- --- --- -|-- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- ---

AA Codon 25 30 35 40 45

DRB1*01:01:01 GAG CGG GTG CGG TTG CTG GAA AGA TGC ATC TAT AAC CAA GAG GAG TCC GTG CGC TTC GAC AGC GAC GTG GGG GAG

DRB1*14:54:01 --- --- --- --- --C --- --C --- -A- T-- C-- --- --G --- --- -T- --- --- --- --- --- --- --- --- ---

DRB1*14:141 --- --- --- --- --C --- --G --- CA- T-- C-- --- --G --- --- -A- -C- --- --- --- --- --- --- --- ---

DRB3*02:02:01:02 --- --- --- --- --C --- --G --- CA- T-- C-- --- --G --- --- -A- -C- --- --- --- --- --- --- --- ---

AA Codon 50 55 60 65 70

DRB1*01:01:01 TAC CGG GCG GTG ACG GAG CTG GGG CGG CCT GAT GCC GAG TAC TGG AAC AGC CAG AAG GAC CTC CTG GAG CAG AGG

DRB1*14:54:01 --- --- --- --- --- --- --- --- --- --- -C- --G --- C-- --- --- --- --- --- --- --- --- --- -G- ---

DRB1*14:141 --- --- --- --- -G- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A-

AA Codon 75 80 85 90 95

DRB1*01:01:01 CGG GCC GCG GTG GAC ACC TAC TGC AGA CAC AAC TAC GGG GTT GGT GAG AGC TTC ACA GTG CAG CGG CGA G|TT GAG

DRB1*14:54:01 --- --- -A- --- --- --- --T --- --- --- --- --- --- --- -TG --- --- --- --- --- --- --- --- -|-C C-T

DRB1*14:141 --- -G- CA- --- --- -AT --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -|-C C-T

DRB3*02:02:01:02 --- -G- CA- --- --- -AT --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -|-C C-T

AA Codon 100 105 110 115 120

DRB1*01:01:01 CCT AAG GTG ACT GTG TAT CCT TCA AAG ACC CAG CCC CTG CAG CAC CAC AAC CTC CTG GTC TGC TCT GTG AGT GGT

DRB1*14:54:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --G --- --- --T --- --- --- ---

DRB1*14:141 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --G --- --- --T --- --- --- ---

DRB3*02:02:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --G ---

AA Codon 125 130 135 140 145

DRB1*01:01:01 TTC TAT CCA GGC AGC ATT GAA GTC AGG TGG TTC CGG AAC GGC CAG GAA GAG AAG GCT GGG GTG GTG TCC ACA GGC

DRB1*14:54:01 --- --- --- --- --- --- --- --- --- --- --- --- --T --- --- --- --- --- A-- --- --- --- --- --- ---

DRB1*14:141 --- --- --- --- --- --- --- --- --- --- --- --- --T --- --- --- --- --- A-- --- --- --- --- --- ---

DRB3*02:02:01:02 --- --- --- --- --- --C --- --- --- --- --- --- --- --- --A --- --- --- --- --- --- --- --- --- ---

Page 20: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

20

Characterization of a rare allele with incomplete

sequence

B*15:147 derives from B*15:01:01:01

Page 21: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

21

SBT/SSO vs NGS

Identifying a novel allele

S-101, Reference Type Result: B*13@, B*38@, one allele is an exon 4 variant ?

NGS Result:

B*13:02:01, B*38:02:01 _Exon 4 variant A to G, Lys to Arg, codon 186.

E2 E3 E4

I2 I3

X(AAGG)

DPB1*463:01/

DPA1*01:03:01:05244 - 2761 bp from exon2

Recombination area

DPB1*04:02:01:01/DP

A1*01:03:01:05

DPB1*03:01:01/

DPA1*01:03:01:03

DPB1 Hybrid Alleles

Page 22: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

22

Characterization of a Novel allele through the

evaluation of unmapped reads

Functional SignificanceSubject with two closely related alleles included in the DPB1*04:02:01:01G

DPB1*04:02:01G:

DPB1*04:02:01:01

DPB1*04:02:01:02

DPB1*105:01

DPB1*463:01

DPB1*571:01

Identical Antigen Recognition Site Structure

Different levels of Expression (we propose)

AA Codon -25 -20 -15 -10 -5

DPB1*105:01 ATG ATG GTT CTG CAG GTT TCT GCG GCC CCC CGG ACA GTG GCT CTG ACG GCG TTA CTG ATG GTG CTG CTC ACA TCT

DPB1*414:01 *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***

DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

AA Codon 1 5 10 15 20

DPB1*105:01 GTG GTC CAG GGC AGG GCC ACT CCA G|AG AAT TAC CTT TTC CAG GGA CGG CAG GAA TGC TAC GCG TTT AAT GGG ACA

DPB1*414:01 *** *** *** *** *** *** *** *** *|-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

DPB1*463:01 --- --- --- --- --- --- --- --- -|-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

AA Codon 25 30 35 40 45

DPB1*105:01 CAG CGC TTC CTG GAG AGA TAC ATC TAC AAC CGG GAG GAG TTC GTG CGC TTC GAC AGC GAC GTG GGG GAG TTC CGG

DPB1*414:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

AA Codon 50 55 60 65 70

DPB1*105:01 GCG GTG ACG GAG CTG GGG CGG CCT GAT GAG GAG TAC TGG AAC AGC CAG AAG GAC ATC CTG GAG GAG AAG CGG GCA

DPB1*414:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- G-- --- ---

DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

AA Codon 75 80 85 90 95

DPB1*105:01 GTG CCG GAC AGG ATG TGC AGA CAC AAC TAC GAG CTG GGC GGG CCC ATG ACC CTG CAG CGC CGA G|TC CAG CCT AGG

DPB1*414:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -|-- --- --- -A-

DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -|-- --- --- -A-

AA Codon 100 105 110 115 120

DPB1*105:01 GTG AAT GTT TCC CCC TCC AAG AAG GGG CCC TTG CAG CAC CAC AAC CTG CTT GTC TGC CAC GTG ACG GAT TTC TAC

DPB1*414:01 --- --C --- --- --- --- --- --- --- --- C-- --- --- --- --- --- --- --- --- --- --- --A --- --- ---

DPB1*463:01 --- --C --- --- --- --- --- --- --- --- C-- --- --- --- --- --- --- --- --- --- --- --A --- --- ---

AA Codon 125 130 135 140 145

DPB1*105:01 CCA GGC AGC ATT CAA GTC CGA TGG TTC CTG AAT GGA CAG GAG GAA ACA GCT GGG GTC GTG TCC ACC AAC CTG ATC

DPB1*414:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

AA Codon 150 155 160 165 170

DPB1*105:01 CGT AAT GGA GAC TGG ACC TTC CAG ATC CTG GTG ATG CTG GAA ATG ACC CCC CAG CAG GGA GAT GTC TAC ACC TGC

DPB1*414:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --C --- --- -T- ---

DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --C --- --- -T- ---

AA Codon 175 180 185 190 195

DPB1*105:01 CAA GTG GAG CAC ACC AGC CTG GAT AGT CCT GTC ACC GTG GAG TGG A|AG GCA CAG TCT GAT TCT GCC CGG AGT AAG

DPB1*414:01 --- --- --- --- --- --- --- --C --- --- --- --- --- --- --- -|-- --- --- --- --- --- --- --- --- ---

DPB1*463:01 --- --- --- --- --- --- --- --C --- --- --- --- --- --- --- -|-- --- --- --- --- --- --- --- --- ---

AA Codon 200 205 210 215 220

DPB1*105:01 ACA TTG ACG GGA GCT GGG GGC TTC GTG CTG GGG CTC ATC ATC TGT GGA GTG GGC ATC TTC ATG CAC AGG AGG AGC

DPB1*414:01 --- --- --- --- --- --- --- --- A-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

Page 23: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

23

Page 24: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

24

Possible eSTR proximal to intron-2 splicing site

STR length may play a regulatory role in the expression of DPB1

E2

264

E3

282E4

111

DPB1 Fragment E2/E4

(~5.1kb)

I2 I3

F_DPB1 R_DPB1

X(AAGG)

4172 –

4227 bp

E5

22

E1

101

5’

UTR

3’

UTR

DPB1 Intron 2 eSTR(AAGG)

Low

High

DPB1*02:01:04e1 L_AA

DPB1*02:01:02v5 L_AA

DPB1*02:01:02v6 L_AA

DPB1*02:01:02 L_AA

DPB1*02:01:02v3 L_AA

DPB1*02:01:02v7 L_AA

DPB1*02:02 L_AA

DPB1*02:01:02v2 L_AA

DPB1*02:01:02v1 L_AA

DPB1*02:01:02v4 L_AA

DPB1*04:02:01:01 L_AA

DPB1*04:02:01:02 L_AA

DPB1*04:01:01:01v1 L_AA

DPB1*04:01:31 L_AA

DPB1*04:01:31 L_AA

DPB1*04:01:01:01v4 L_AA

DPB1*04:01:01:01v5 L_AA

DPB1*04:01:01:01 L_AA

DPB1*04:01:01:01v3 L_AA

DPB1*464:01

DPB1*04:01:01:02 L_AA

DPB1*398:01

DPB1*30:01

DPB1*58:01e1

DPB1*17:01e1 L_AA

DPB1*17:01x1

DPB1*19:01e1 H_GA

DPB1*39:01x1

DPB1*11:01:01e1 H_GA

DPB1*27:01e1

DPB1*13:01:01/DPB1*107:01e1 H_GA

DPB1*85:01e1 H_GA

DPB1*01:01:01e1 H_GA

DPB1*296:01e1

DPB1*15:01:01e1 H_GA

DPB1*18:01e1 H_GA

DPB1*05:01:01e1 H_GA

DPB1*414:01e1

DPB1*463:01

DPB1*16:01:01 H_GG

DPB1*21:01e1

DPB1*06:01e1 H_GG

DPB1*09:01:01e1 H_GG

DPB1*104:01e1

DPB1*03:01:01 H_GG

DPB1*14:01:01e1 H_GG

9 9

5 5

74

9 1

96

6 9

76

9 8

9 4

6 8

6 4

4 5

4 8

4 0

10 0

1 00

9 7

9 9

1 00

1 00

74

8 6

6 8

4 2

3 2

49

5 4

Intron2 (-43)

STR Analysis : Short - Short

DPB1* 01:01:01e1, 05:01:01e1

Page 25: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

25

STR Analysis : Short - Long

DPB1* 02:01:02, 13:01:01e1

STR Analysis : Long - Long

DPB1* 02:01:02, 02:01:02v3

Data Analysis• Determination of number of reads

• Bar codes specific for sample and locus (amplicon)

• Barcodes specific for sample (early pooling)

• Informatics:

• Mapping of Reads

• Phasing Reads

• Insertions and Deletions

• Homozygous and Heterozygous Positions

• Reads from other Loci

• Hybrid alleles, Novel alleles

Page 26: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

26

Typing two DRB5 alleles

All reads need to be accounted

Correct genotype: DRB5*01:01:01, DRB5*01:08N

DRB5*01:02, 0108N DRB5*01:01:01, 01:02 DRB5*01:01:01, 01:08N

DRB5*01:02/01:08N identical in exon 2, differ by 19 nt indel in exon 3

DRB5*01:01:01/01:02 identical in exon 3, differ by 3 nt substitutions in exon 2

Must Know• Amplicon: size (homogeneous or variable according

to allele families)

• Preferential amplifications (locus or allele families)

• Primers: multiplexed or single location

• Other genes co-amplified (DRB)

• Software: Binning of reads (to a given locus, to a given allele family). No binning (possible interference in allele assignment)

• Phasing: reads covering informative SNPs, Central Reads, Assembly

• Utilization of reads

Homozygous allele? Not exactlyDRB1DRB1 DQB1DQB1DQA1DQA1

DRB1 DQA1 DQB1 Count

*13:02:01 *01:02:01:04 *06:04:01/*06:09:01 21/553

*15:01:01:01 *01:02:01:03 *06:02:01/*06:03:01 423/553

Page 27: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

27

DRB1DRB1 DQB1DQB1DQA1DQA1~40Kb ~10Kb

DRB1 DQA1 DQB1

*01:01:01/*01:03 *01:01:01 *05:01:01:0x

*10:01:01/*14:54:01 *01:05/*01:04:01:01 *05:01:01:02

*01:02:01 *01:01:02 *05:01:01:01

DQB1*05:01:01:0x =DQB1*05:01:01:01(intron 4) +DQB1*05:01:01:02 (intron 2)

My thought Process for Genotype Assignment

• Examine Genotype assigned by software through mapping

• Perfect match with reference vs no full match at genomic level

• Check match with reference vs no full match at exon level

• Check completeness of reference

• Identify novel allele; see close allele and check differences and sequences

• Examine by other method phasing (central reads, pair end reads, assembly)

• Check LD tables (my help identify drop outs)

81

Barcode performance

0

200000

400000

600000

800000

1e+06

1.2e+06

1.4e+06

1.6e+06

1.8e+06

GC

AG

AC

TG

CATG

AT

GC

GTATT

GC

TG

CAT

GTAG

CTT

GTATAG

TG

TC

ATC

TG

TG

AC

GT

GTG

CG

AT

TAC

AC

AT

TAC

GTC

TTAG

CTAT

TAG

TAC

GTAG

TC

TC

TATAC

TA

TATC

TG

CTATG

CG

T

Read c

ount

Barcode

Page 28: The HLA system - cdn.ymaws.com · Q A 1 D Q B 1 R B 1 Coverage HLA coverage over WGS Average Minimum Average Minimum Average Minimum Average Minimum Sample 1 Sample 2 Sample 3 Sample

28

Data Analysis• Solid and simple logic

– error is minor

• Accurate

– User-friendly interface for reviewing result

• Fast

– Less than 2 hours for seq run (12-24 samples)

• Ability to pick up new allele

• Stand-alone desktop solution

• Ability to evaluate genotype assignment by second method

Our experience• Allele calls were made virtually by the software with

no operator evaluation

• Fourth field data: in most instances no previous information

• Haplotype associations stronger than expected

• Several common allele subtypes distinguished at the fourth field

• Specific allele associations came apparent without any assumptions made

• These studies show the robustness and comprehensive coverage provided by the typing system

84

Summary of State of the Art NGS for HLA

• Application to HLA typing is feasible

• Processes have been optimized

• Current methods are appropriate for both Registry Typing and small scale quick TAT

• Extremely accurate and comprehenisve

• Great developments in the informatics and analysis

•Completion of sequences of common alleles will be helpful

•Studies in familes may unravel limitations