getting the most from the reference assembly
TRANSCRIPT
Getting the Most from the Reference Assembly
Valerie Schneider, Ph.D.NCBI
18 October 2016
https://genomereference.org
https://genomereference.org
Twitter: @GenomeRefAnnouncements: [email protected]
• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data
Outline
Reference Assembly Basics
Reference Assembly Basics
Latest assembly version only
Reference Assembly Basics
Contiguity
Coverage
Segmental Duplication/Repeat
Representation
Gene representation
Diploid/haploid
Spot the Difference! Sample/Population
Assembly level (contig, scaffold,
chromosome)
Full/Partial Genome
Representation
Sequencing method
Assembly method
Variant analysis
Annotation
Clinical Diagnostics
Comparative genomics
Transcriptomics
Reference Assembly Basics
Reference Assembly Basics
FINISHED?
Clone based assemblies BAC insertBAC vector
Shotgun sequence clone
Assemble
GAPS
Finish
Minimal Tiling Path
Define switch points for adjacent components
Fold
sequ
ence
Gaps
deeper sequencecoverage rarelyresolves all gaps
Reference Assembly Basics
Lander and Waterman(1988) Genomics
SequencedNot sequenced1X Coverage5X Coverage
10X Coverage
37% 63%0.6% 99.4%
0.005% 99.995%
Measure of contiguity. Half of the contigs in the assembly are this length or greater.
Coverage
N50HuRef
SOAPdenovoNA12878
ALLPATHSNA12878
MHAPCHM1
Reference Assembly Basics
Chaisson and Eichler (2015)
AK1 HX1
Why all this matters:Longer haplotype blocks
Fewer collapsed repeats & segmental duplicationsBetter annotation
More robust mapping target
• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data
Outline
Sequences from haplotype 1Sequences from haplotype 2
Old Assembly model: compress into a consensus
Current Assembly model: represent both haplotypes
GRC Assembly Model
many
Assembly (e.g. GRCh38)
Primary Assembly
Unit
Non-nuclear assembly unit
(e.g. MT)
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 GRC Assembly Model
ALT 2
ALT 3
ALT 4
ALT 5
ALT 6
ALT 7
ALT 1
GRC Assembly Model
Alt loci alignments are an integral part of the assembly modelalignment to chr + scaffold sequence = Alt
GRCh38• 178 regions with alt loci: 2% of
chromosome sequence (61.9 Mb)• 261 Alt Loci: 3.6 Mb novel sequence
relative to chromosomes• Average alt length = 400 kb, max = ~5 Mb
GRCh38
Outline
• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data
GRCh38: Alt Loci
Alignment Legend
no alignmentmismatchdeletion
chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRCh38: Alt LociPLoS Biol. 2011 Jul;9(7):e1001091
GRCh38: Assembly Statshttp://www.biorxiv.org/content/early/2016/08/30/072116
GRCh38: Assembly Stats
GRCh38 vs. GRCh37http://www.biorxiv.org/content/early/2016/08/30/072116
GRCh38: Annotation Statshttp://www.biorxiv.org/content/early/2016/08/30/072116
GRCh38 Base Updateshttp://www.biorxiv.org/content/early/2016/08/30/072116
GRCh38 Novel Sequence
New in GRCh38
http://www.biorxiv.org/content/early/2016/08/30/072116
GRCh38 Centromeres
Miga et al., Genome Res. 2014 Apr;24(4):697-707
Assembly (e.g. GRCh38.p1)
Primary Assembly
Unit
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 6
ALT 7
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Patches
Genomic Region(ABO)
Genomic Region
(FOXO6)Genomic
Region(FCGBP)
Assembly Updates
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE
ALT LOCI
--(integrated)
Treat as: Allelic
Treat as: Preferred
GRCh38.p9• 96 Patches: >1 Mb novel
sequence• 48 FIX• 48 NOVEL
Assembly Updates
Outline
• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data
Accessing the Data
https://genomereference.org
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
GRC Assembly Management
What assemblyversion?
Where’s theproblem?
How can wecontact you?
What’s wrong?
Accessing the Data
Accessing the Data
Learn more about navigating GRCh38 with
NCBI browsers and resources: 1926F (3-4 pm)
Accessing the Data
Accessing the Data
ftp://ngs.sanger.ac.uk/production/grit/track_hub/hub.txt
Outline
• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data
GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes
GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs
GRC Creditshttps://genomereference.org
Assembly Updates
https://www.ncbi.nlm.nih.gov/genome/tools/remap