everyday de novo diploid assembly
TRANSCRIPT
FORRESEARCHUSEONLY.Notforuseindiagnosticprocedures.©10xGenomics,Inc.2016
Everydaydenovodiploid assembly
DeannaM.ChurchOct,2016
@deannachurch
2
Disclosures
EmployeeandShareholder
Shareholder
10xGenomics
Personalis
10xGenomicsproductsdescribedareforResearchUseOnly.Notforuseindiagnosticprocedures.
3
Acknowledgements
Theentireteamat10x
DavidJaffeNeilWeisenfeldVijayKumarPreyas Shah
PatrickMarks
4
Agenda
•Whyhaven’twealwaysdonedenovoassemblyoneverysample?
5
Agenda
•Whyhaven’twealwaysdonedenovoassemblyoneverysample?
•WhatareLinked-Reads?
6
Agenda
•Whyhaven’twealwaysdonedenovoassemblyoneverysample?
•WhatareLinked-Reads?•Whatdoeseverydaydenovoassemblyenabletoday?
7
Whyhaven’twealwaysdonedenovo genomeanalysis?
8
9
10
11
Currentapproach:averagingoverhaplotypes
12
Averagingoverhaplotypesfailswithincreaseddiversitydoi:10.1038/nature20098
AK1
13
Newtechnologyrevealsmoreinformation
14
Newtechnologyrevealsmoreinformation
10.1038/nrg3933
15
Publichumanassembliestodate
https://www.ncbi.nlm.nih.gov/assembly/organism/9606/latest/
Compositegenomes IndividualGenomes
Hydatidiform moles(singlehaplotypes)
• GRCh38• Celera(2)
• CHM1(9)
• CHM13(5)
• NA12878(9)• HX1• A/JSon• A/JMother
• A/JFather• NA18507• YH1• HS1011• AK1• HuRef
• Lotsoflabor• Lotsoftime• Lotsofcoverage• Lotsofmoney
16
WhatareLinked-Reads?
17
Unlinked-Reads:shortrangeinformation
18
Linked-Reads:longrangeinformation
19
StartwithlongmoleculesNA19240
20
MakingLinked-Reads
P5 16bpBCR1 Nmer gDNA Insert
21
MakingLinked-Reads
Longinputmolecule
Excessofsequenceableinsertsrandomlyprimedoffeachlongmolecule
P5 16bpBCR1 Nmer gDNA Insert
22
MakingLinked-Reads
Longinputmolecule(50Kb)
Excessofsequenceableinsertsrandomlyprimedoffeachlongmolecule
P5 16bpBCR1 Nmer gDNA Insert
Longinputmolecule(50Kb)
30xsequence~35fragments~0.2xcoverage
Standardreferencebasedanalysisrecommendations
23
MakingLinked-Reads
Longinputmolecule(50Kb)
Excessofsequenceableinsertsrandomlyprimedoffeachlongmolecule
P5 16bpBCR1 Nmer gDNA Insert
Longinputmolecule(50Kb)
56xsequence~65fragments~0.4xcoverage
Supernovaanalysisrecommendations
24
SyntheticLongReads:lessphysicalcoverage
CA B
SequencingcostPhysicalcoverage
25
Linked-Reads:greaterphysicalcoverage
CA B
SequencingcostPhysicalcoverage
26
Linked-Readsallowforincreasedphysicalcoverage
150X avg physicalcoverage
Chr13: BRCA2
4/4/2016 Loupe
http://loupe.fuzzplex.com/loupe/view/MTk1MzgtUEhBU0VSX1NWQ0FMTEVSX1BELTEwMTMuMC4yNi5sb3VwZQ==/reads?ranges=chr13%2B32850000-chr1… 1/1
▲
쁛 ►
>56X avgread coverage(assembly)
27
GeneratingLinked-Reads
Startwith:
HMWgDNA,100Kb+molecules1.0ng inputDNA=300copiesofthegenome
0.5ngDNA=150 copiesofthegenome,partitionedinto>1MGEMs
DNA
OilBarcodedPrimerLibrary Enzyme Collect
28
Assemblymadeeasy
FASTABCL SupernovaDenovoAssembly
1200MNA19240
http://www.biorxiv.org/content/early/2016/08/19/070425
1server348Gbmemory2dayscompute
1library1nginput
29
Assemblymadeeasy
FASTABCL SupernovaDenovoAssembly
1200MNA192401library
1nginput
http://www.biorxiv.org/content/early/2016/08/19/070425
1server(28cores)348Gbmemory2dayscompute
30
Assemblymadeeasy
FASTABCL SupernovaDenovoAssembly
1200MNA192401library
1nginput
http://www.biorxiv.org/content/early/2016/08/19/070425
1server(28cores)348Gbmemory2dayscompute
megabubble megabubble megabubble
31
Performanceovermultiplehumansamples
http://www.biorxiv.org/content/early/2016/08/19/070425
sample ethnicity sex cov frag
N50contig(Kb)
N50scaffold(Mb)
N50Phaseblock(Mb)
Gap(%)
NA19238 YRI F 56 115 114.6 18.7 8 2.1
NA19240 YRI F 56 125 118.8 16.4 9.3 2.3
HG00733 PR F 56 106 123.6 17.8 3.4 2.0
HG00512 HAN M 56 102 113.2 15.4 2.7 2.2
NA24385 AJ M 56 120 106.4 15.1 4.2 2.6
HGP EUR M 56 139 120.2 18.6 4.5 2.5
NA12878 EUR F 56 92 118.5 16.4 2.8 2.9
32
HighqualityAssemblyatlowercoverage
102104106108110112114116118120122
500 700 900 1,100 1,300
ContigN50
(kb)
Numberofreads(millions)
0
5
10
15
20
25
500 700 900 1,100 1,300
ScaffoldN50
(Mb)
Numberofreads(millions)
00.51
1.52
2.53
3.54
4.55
500 700 900 1,100 1,300
PhaseBlockN50
(Mb)
Numberofreads(millions)
33
DeNovoPerformanceDrasticallyImproveswithIncreasedDNALength
020,00040,00060,00080,000100,000120,000
0 10,000 20,000 30,000 40,000 50,000 60,000
ContigN50
0
5
10
15
20
0 10,000 20,000 30,000 40,000 50,000 60,000
ScaffoldN50
(Mb)
0100,000200,000300,000400,000500,000
0 10,000 20,000 30,000 40,000 50,000 60,000PhaseBlock
N50
DNALength
34
Comparisontotruthdata
35
Assemblyassessment
Supernova10x Othermethods
0
5
10
15
20
25
NA19238 NA19240 HG00733 HG00512 NA24385 HGP NA12878 YH NA12878 NA12878 NA12878 NA24385 NA24143
PercentGRCh37100mersmissingperassembly
Missing100mershaploid Missing100mersdiploid
Diploid Haploid
36
Whatdoeseverydaydenovoassemblyenable?
37
Ideal:Completegenomeinformation
doi:10.1038/nature09534
• SNVs• Deletions• Insertions• Inversions• Translocations
38
Areasinwhichassemblyexcels:diverseregions
AluY
Supernova(denovo)
PacBio Reads
IlluminaReads
39
Areasinwhichassemblyexcels:insertions
Supernova(denovo)
PacBioReads
IlluminaReads
40
Areasinwhichassemblyexcels:insertions
41
Areasinwhichassemblyexcels:insertions
SHANK2
GRCh37:chr11
GRCh37.p13:chr11_fix_patch
42
Areasinwhichassemblyexcels:insertions
SHANK2
GRCh37:chr11
GRCh37.p13:chr11_fix_patch35kb
43
Areasinwhichassemblyexcels:insertions
Hap1_scaffold7938
Hap2_scaffold7939
chr11
SHANK2
44
Areasinwhichassemblyexcels:insertions
Hap1_scaffold7938
Hap2_scaffold7939
chr11
SHANK2
45
Areasinwhichassemblyexcels:insertions
Hap1_scaffold7938
Hap1_scaffold7939
chr11
SHANK2
chr11
Hap2_scaffold7939
SHANK2
Hap1_scaffold7938
46
Assemblyanalysis:alignmentworkneeded
SHANK2
Supernova(denovo)
PacBio Reads
IlluminaReads
47
Areasinwhichassemblyexcels:inversions
GRCh37 chrX:6137041-6138541 (NLGN4X)
Supernova(denovo)
PacBio Reads
IlluminaReads
48
Assemblyanalysis:alignmentworkneeded
GRCh37 chrX:6137041-6138541 (NLGN4X)
Hap1_scaffold5127
Hap2_scaffold5128
49
Fasta isalossy format
megabubble megabubble megabubble
multi-Mbphaseblocks
manyMbscaffold
microstructure• bubbles,oftenatindeterminatepoly-A• shortgaps,oftenatpoly-A
50
Nativeformatshavemoreinformation
Supernova(denovo)
LongRangerReferencebased
51
Nativeformatshavemoreinformation
Supernova(denovo)
LongRangerReferencebased
52
Nativeformatshavemoreinformation
Supernova(denovo)
LongRangerReferencebased
53
Conclusions
•Routine,denovo,diploidassemblyof1000sofsamplesispossibletoday!
•Earlyuseswillbeforbetterresolutionofdivergentregionsandnovelsequence
•Anewgenerationoftoolsneedstobedevelopedtofullyutilizeassemblydata