Graph and assembly strategies for the MHC and ribosomal DNA regions
Alexander Dilthey
The MHC is the zebrafish of the genome!
(model region)
PRGs – Population Reference Graphs• Simple: acyclic, directed (sub-class of general variation graphs)
• Usually built from MSA, preserve gap positions(i.e. global homology between input sequences).
• Generative model: Recombination
• Ploidy well-defined (0, 1, 2)
TA CT A G
C
C
_
_
A
TA
A
Outline• Quick recap:
What we know about the utility of graph genome approaches
• New results:
Haplotyping in hypervariable regions (HLA)Pseudo graph alignment
• De novo assembly of ribosomal DNA
In most of the MHC, single-reference approaches work just fine…
Num
ber o
f kme
rs (m
illion
s)4.5
5.0
PGF reference Platypus PRG-Viterbi PRG-Mapped
kmers recoveredkmers not recovered
+ long-read validation with consistent results (not shown)Dilthey et al., Nature Genetics 2015
… graph genomes outperform in the most complex sub-region of the MHC …
Dilthey et al., Nature Genetics 2015
… remaining problems driven by incomplete input haplotypes + algorithmics.
Aligned kmers
Chromotype position (kb)
Read
posit
ion (k
b)
0 10 200
2
4
6
Incomplete input haplotypes:Large uncharacterized inversion
Algorithmics:Incorrect HLA haplotyping.
Dilthey et al., Nature Genetics 2015
HLA haplotyping• Hypothesis: Whole-genome sequencing data contains the information
necessary for accurate HLA typing
• “HLA typing” HLA gene exon sequences• HLA class I: exons 2 and 3• HLA class II: exon 2
• Challenge: align reads to the right gene – homology hell.
• Proper read-to-graph alignment instead of k-Mers.
Class I exon homology
Exon 2 Exon 3
HLA-A 3284 allelesHLA-B 4077 allelesHLA-C 2799 alleles
Approach: deep PRG + mapping
Exonic MSAT*01:01 _ _ A C G T A C T _ _T*01:02 C A A C A T A C T _ _T*01:03 _ _ A C G C G C T _ _T*01:04 _ _ A T C C G C T A CT*01:05 _ _ A T C C C C T _ _T*01:06 _ _ _ C C T A C T _ _
Genomic MSAT*01:01 A G C A _ _ A C G T A C T _ _ C C T AT*01:02 A C C A C A A C A T A C T _ _ C C T AT*01:04 _ T T A _ _ A T C C G C T A C C C T A
8 xMHC reference haplotypes
PGF (with T*01:01) A C T A G C A _ _ A C G T A C T _ _ C C T A T G AMANN (with T*01:04) T T T _ T T A _ _ A T C C G C T A C C C T A T G A
1) Gene-only PRG – 46 (pseudo) genes, mostly HLA|--NNN--| |--NNN--| Gene 1 Gene 2 Gene 3
Padding UTR Exon 1 Intron 1 Exon 2 UTR Padding
Num
ber o
f ref
eren
ce se
quen
ces
Region covered by 'genomic' sequences
2) Varying numbers of input sequences across PRG
3) Use hierarchical MSA approach to combine in
Approach: deep PRG + mapping
Level 1
CA
_ _
C T
C
CC
G
AAligned read
2 3 4 5 6 7
A _ TATA _ C
198 9 10 11 12 13 14 15 16 17 18 25 26
C AGTATC
20 21 22 23 24
TCTC
T T
A
_
A _A G
CT
C
T
T
C T
ATAC
C {G, C}T
C
G
CA A
_ _
A
4) Seed-and-extend paired-end mapping to PRG
5) Likelihood-based inference: maximize L( aligned reads | HLA types ) (independently per locus)
High-quality WGS data enables gold-standard accuracy
(of note: 2/3 original discrepancies with validation data were errors in the validation data!)
… but not from exome, MiSeq data
Sequencing error?
Effective fragment length? [2 x read length + IS]
Conclusion (intermediate)• If the input sequencing data is „good enough“, we manage near-
perfect haplotyping in the genome‘s most polymorphic region
• Effective fragment length likely the most important factor
• Not-so-good sequencing data: joint haplotyping + alignment(i.e. alignment location is not independent of inferred haplotype)
• Read mapping implementation SLOW
Pseudo graph mappingInput sequences
Pseudo graph mappingInput sequences
Graph
Pseudo graph mappingInput sequences
Graph
Align short reads to input sequences...
Pseudo graph mappingInput sequences
Graph
Align short reads to input sequences...
... transpose onto graph
Scrubbing, cutting, cleaning
Input MSA Lin. alignment MSA coor. Scrubbed
123456789 123456X789 123456789Seq1 AACAC_TTT Seq1 AACAC_TTT AACAC__TTT AACAC_TTTSeq2 TTCACGTTT Read AACACGTTT AACAC_GTTT AACACGTTT
-Graph TTCAC TTT G
Scrubbing: get rid of INDEL-induced changes in the alignment coordinate system
Cutting: Examine alignment gap structure; cut in „bad“ areas; use longest stretch
Cleaning: Find the best gap-less sequence-to-graph alignment + extension with gaps
Graph alignment
123456789Graph AACACGTTTSeq1 AACACGTTT
Accuracy slightly worse; fast!
Conclusion: perhaps there is a middle ground between graph and linear sequence alignment. Work in progress. Further tuning?
Inferred Accuracy Call Rate Inferred Accuracy Call RateA 6 6 1.00 1.00 6 1.00 1.00B 6 6 1.00 1.00 6 1.00 1.00C 6 6 1.00 1.00 6 1.00 1.00DQA1 6 6 1.00 1.00 6 1.00 1.00DQB1 6 6 1.00 1.00 6 1.00 1.00DRB1 6 6 1.00 1.00 6 1.00 1.00A 22 22 0.86 1.00 22 1.00 1.00B 22 22 1.00 1.00 22 1.00 1.00C 22 22 1.00 1.00 22 1.00 1.00DQA1 12 12 1.00 1.00 12 1.00 1.00DQB1 22 22 1.00 1.00 22 1.00 1.00DRB1 22 22 0.91 1.00 22 0.95 1.00
PlatinumTrio
1000 Genomes
Highest Resolution
MHC-PRG-2 HLA*PRGNLocusCohort
Towards additional high-quality reference haplotypes…
Remaining challenges: extreme repeats, haplotypes.Sergey Koren
Ribosomal DNA• Encodes ribosomal RNA• Hundreds of copies
(tandem repeat arrays)
• Variation poorly characterized
• Step 1: Targeted approach• Step 2: WGS-based• Step 3: Variation graph
Read error vs variation
… from whole-genome data?Long reads de Bruijn graph Technology!
6% > 50k
Summary• Variation graphs are worth the effort – at least in highly complex regions.
• Evidence: MHC „model system“+ overall improvement of Genome inference accuracy+ complex-locus haplotyping
• Incorporate LD?
• Middle ground between full graph alignment and linear sequence alignment?
• Ribosomal DNA – let me know if you‘re also interested!
AcknowledgementsNIHAdam PhillippySergey KorenBrian WalenzJung-Hyun KimVladimir Larionov
OxfordGil McVeanZam IqbalAlexander Mentzer
HistogeneticsNezih Cereb
UCSF/NantesPierre-Antoine Gourraud
GSKMatt NelsonCharles Cox