Download - Graph and assembly strategies for the MHC and ribosomal DNA regions

Graph and assembly strategies for the MHC and ribosomal DNA regions

Alexander Dilthey

The MHC is the zebrafish of the genome!

(model region)

PRGs – Population Reference Graphs• Simple: acyclic, directed (sub-class of general variation graphs)

• Usually built from MSA, preserve gap positions(i.e. global homology between input sequences).

• Generative model: Recombination

• Ploidy well-defined (0, 1, 2)

TA CT A G

C

C

_

_

A

TA

A

Outline• Quick recap:

What we know about the utility of graph genome approaches

• New results:

Haplotyping in hypervariable regions (HLA)Pseudo graph alignment

• De novo assembly of ribosomal DNA

In most of the MHC, single-reference approaches work just fine…

Num

ber o

f kme

rs (m

illion

s)4.5

5.0

PGF reference Platypus PRG-Viterbi PRG-Mapped

kmers recoveredkmers not recovered

+ long-read validation with consistent results (not shown)Dilthey et al., Nature Genetics 2015

… graph genomes outperform in the most complex sub-region of the MHC …

Dilthey et al., Nature Genetics 2015

… remaining problems driven by incomplete input haplotypes + algorithmics.

Aligned kmers

Chromotype position (kb)

Read

posit

ion (k

b)

0 10 200

2

4

6

Incomplete input haplotypes:Large uncharacterized inversion

Algorithmics:Incorrect HLA haplotyping.

Dilthey et al., Nature Genetics 2015

HLA haplotyping• Hypothesis: Whole-genome sequencing data contains the information

necessary for accurate HLA typing

• “HLA typing” HLA gene exon sequences• HLA class I: exons 2 and 3• HLA class II: exon 2

• Challenge: align reads to the right gene – homology hell.

• Proper read-to-graph alignment instead of k-Mers.

Class I exon homology

Exon 2 Exon 3

HLA-A 3284 allelesHLA-B 4077 allelesHLA-C 2799 alleles

Approach: deep PRG + mapping

Exonic MSAT*01:01 _ _ A C G T A C T _ _T*01:02 C A A C A T A C T _ _T*01:03 _ _ A C G C G C T _ _T*01:04 _ _ A T C C G C T A CT*01:05 _ _ A T C C C C T _ _T*01:06 _ _ _ C C T A C T _ _

Genomic MSAT*01:01 A G C A _ _ A C G T A C T _ _ C C T AT*01:02 A C C A C A A C A T A C T _ _ C C T AT*01:04 _ T T A _ _ A T C C G C T A C C C T A

8 xMHC reference haplotypes

PGF (with T*01:01) A C T A G C A _ _ A C G T A C T _ _ C C T A T G AMANN (with T*01:04) T T T _ T T A _ _ A T C C G C T A C C C T A T G A

1) Gene-only PRG – 46 (pseudo) genes, mostly HLA|--NNN--| |--NNN--| Gene 1 Gene 2 Gene 3

Padding UTR Exon 1 Intron 1 Exon 2 UTR Padding

Num

ber o

f ref

eren

ce se

quen

ces

Region covered by 'genomic' sequences

2) Varying numbers of input sequences across PRG

3) Use hierarchical MSA approach to combine in

Approach: deep PRG + mapping

Level 1

CA

_ _

C T

C

CC

G

AAligned read

2 3 4 5 6 7

A _ TATA _ C

198 9 10 11 12 13 14 15 16 17 18 25 26

C AGTATC

20 21 22 23 24

TCTC

T T

A

_

A _A G

CT

C

T

T

C T

ATAC

C {G, C}T

C

G

CA A

_ _

A

4) Seed-and-extend paired-end mapping to PRG

5) Likelihood-based inference: maximize L( aligned reads | HLA types ) (independently per locus)

High-quality WGS data enables gold-standard accuracy

(of note: 2/3 original discrepancies with validation data were errors in the validation data!)

… but not from exome, MiSeq data

Sequencing error?

Effective fragment length? [2 x read length + IS]

Conclusion (intermediate)• If the input sequencing data is „good enough“, we manage near-

perfect haplotyping in the genome‘s most polymorphic region

• Effective fragment length likely the most important factor

• Not-so-good sequencing data: joint haplotyping + alignment(i.e. alignment location is not independent of inferred haplotype)

• Read mapping implementation SLOW

Pseudo graph mappingInput sequences


Graph


Graph

Align short reads to input sequences...


Graph

Align short reads to input sequences...

... transpose onto graph

Scrubbing, cutting, cleaning

Input MSA Lin. alignment MSA coor. Scrubbed

123456789 123456X789 123456789Seq1 AACAC_TTT Seq1 AACAC_TTT AACAC__TTT AACAC_TTTSeq2 TTCACGTTT Read AACACGTTT AACAC_GTTT AACACGTTT

-Graph TTCAC TTT G

Scrubbing: get rid of INDEL-induced changes in the alignment coordinate system

Cutting: Examine alignment gap structure; cut in „bad“ areas; use longest stretch

Cleaning: Find the best gap-less sequence-to-graph alignment + extension with gaps

Graph alignment

123456789Graph AACACGTTTSeq1 AACACGTTT

Accuracy slightly worse; fast!

Conclusion: perhaps there is a middle ground between graph and linear sequence alignment. Work in progress. Further tuning?

Inferred Accuracy Call Rate Inferred Accuracy Call RateA 6 6 1.00 1.00 6 1.00 1.00B 6 6 1.00 1.00 6 1.00 1.00C 6 6 1.00 1.00 6 1.00 1.00DQA1 6 6 1.00 1.00 6 1.00 1.00DQB1 6 6 1.00 1.00 6 1.00 1.00DRB1 6 6 1.00 1.00 6 1.00 1.00A 22 22 0.86 1.00 22 1.00 1.00B 22 22 1.00 1.00 22 1.00 1.00C 22 22 1.00 1.00 22 1.00 1.00DQA1 12 12 1.00 1.00 12 1.00 1.00DQB1 22 22 1.00 1.00 22 1.00 1.00DRB1 22 22 0.91 1.00 22 0.95 1.00

PlatinumTrio

1000 Genomes

Highest Resolution

MHC-PRG-2 HLA*PRGNLocusCohort

Towards additional high-quality reference haplotypes…

Remaining challenges: extreme repeats, haplotypes.Sergey Koren

Ribosomal DNA• Encodes ribosomal RNA• Hundreds of copies

(tandem repeat arrays)

• Variation poorly characterized

• Step 1: Targeted approach• Step 2: WGS-based• Step 3: Variation graph

Read error vs variation

… from whole-genome data?Long reads de Bruijn graph Technology!

6% > 50k

Summary• Variation graphs are worth the effort – at least in highly complex regions.

• Evidence: MHC „model system“+ overall improvement of Genome inference accuracy+ complex-locus haplotyping

• Incorporate LD?

• Middle ground between full graph alignment and linear sequence alignment?

• Ribosomal DNA – let me know if you‘re also interested!

AcknowledgementsNIHAdam PhillippySergey KorenBrian WalenzJung-Hyun KimVladimir Larionov

OxfordGil McVeanZam IqbalAlexander Mentzer

HistogeneticsNezih Cereb

UCSF/NantesPierre-Antoine Gourraud

GSKMatt NelsonCharles Cox

Download - Graph and assembly strategies for the MHC and ribosomal DNA regions

Top Related