getting the most from the reference assembly

46
Getting the Most from the Reference Assembly Valerie Schneider, Ph.D. NCBI 18 October 2016 https://genomereference.o

Upload: genome-reference-consortium

Post on 17-Jan-2017

95 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Getting the most from the reference assembly

Getting the Most from the Reference Assembly

Valerie Schneider, Ph.D.NCBI

18 October 2016

https://genomereference.org

Page 2: Getting the most from the reference assembly

https://genomereference.org

Twitter: @GenomeRefAnnouncements: [email protected]

Page 3: Getting the most from the reference assembly

• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data

Outline

Page 4: Getting the most from the reference assembly

Reference Assembly Basics

Page 5: Getting the most from the reference assembly

Reference Assembly Basics

Latest assembly version only

Page 6: Getting the most from the reference assembly

Reference Assembly Basics

Contiguity

Coverage

Segmental Duplication/Repeat

Representation

Gene representation

Diploid/haploid

Spot the Difference! Sample/Population

Assembly level (contig, scaffold,

chromosome)

Full/Partial Genome

Representation

Sequencing method

Assembly method

Page 7: Getting the most from the reference assembly

Variant analysis

Annotation

Clinical Diagnostics

Comparative genomics

Transcriptomics

Reference Assembly Basics

Page 8: Getting the most from the reference assembly
Page 9: Getting the most from the reference assembly

Reference Assembly Basics

FINISHED?

Page 10: Getting the most from the reference assembly

Clone based assemblies BAC insertBAC vector

Shotgun sequence clone

Assemble

GAPS

Finish

Minimal Tiling Path

Define switch points for adjacent components

Fold

sequ

ence

Gaps

deeper sequencecoverage rarelyresolves all gaps

Reference Assembly Basics

Page 11: Getting the most from the reference assembly

Lander and Waterman(1988) Genomics

SequencedNot sequenced1X Coverage5X Coverage

10X Coverage

37% 63%0.6% 99.4%

0.005% 99.995%

Measure of contiguity. Half of the contigs in the assembly are this length or greater.

Coverage

N50HuRef

SOAPdenovoNA12878

ALLPATHSNA12878

MHAPCHM1

Reference Assembly Basics

Chaisson and Eichler (2015)

AK1 HX1

Why all this matters:Longer haplotype blocks

Fewer collapsed repeats & segmental duplicationsBetter annotation

More robust mapping target

Page 12: Getting the most from the reference assembly

• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data

Outline

Page 13: Getting the most from the reference assembly

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

Current Assembly model: represent both haplotypes

GRC Assembly Model

many

Page 14: Getting the most from the reference assembly

Assembly (e.g. GRCh38)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 GRC Assembly Model

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

ALT 1

Page 15: Getting the most from the reference assembly

GRC Assembly Model

Alt loci alignments are an integral part of the assembly modelalignment to chr + scaffold sequence = Alt

Page 16: Getting the most from the reference assembly

GRCh38• 178 regions with alt loci: 2% of

chromosome sequence (61.9 Mb)• 261 Alt Loci: 3.6 Mb novel sequence

relative to chromosomes• Average alt length = 400 kb, max = ~5 Mb

GRCh38

Page 17: Getting the most from the reference assembly

Outline

• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data

Page 18: Getting the most from the reference assembly

GRCh38: Alt Loci

Alignment Legend

no alignmentmismatchdeletion

Page 19: Getting the most from the reference assembly

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

GRCh38: Alt LociPLoS Biol. 2011 Jul;9(7):e1001091

Page 20: Getting the most from the reference assembly

GRCh38: Assembly Statshttp://www.biorxiv.org/content/early/2016/08/30/072116

Page 21: Getting the most from the reference assembly

GRCh38: Assembly Stats

GRCh38 vs. GRCh37http://www.biorxiv.org/content/early/2016/08/30/072116

Page 22: Getting the most from the reference assembly

GRCh38: Annotation Statshttp://www.biorxiv.org/content/early/2016/08/30/072116

Page 23: Getting the most from the reference assembly

GRCh38 Base Updateshttp://www.biorxiv.org/content/early/2016/08/30/072116

Page 24: Getting the most from the reference assembly

GRCh38 Novel Sequence

New in GRCh38

http://www.biorxiv.org/content/early/2016/08/30/072116

Page 25: Getting the most from the reference assembly

GRCh38 Centromeres

Miga et al., Genome Res. 2014 Apr;24(4):697-707

Page 26: Getting the most from the reference assembly

Assembly (e.g. GRCh38.p1)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region

(FOXO6)Genomic

Region(FCGBP)

Assembly Updates

Patches

FIX NOVEL

SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE

ALT LOCI

--(integrated)

Treat as: Allelic

Treat as: Preferred

Page 27: Getting the most from the reference assembly

GRCh38.p9• 96 Patches: >1 Mb novel

sequence• 48 FIX• 48 NOVEL

Assembly Updates

Page 28: Getting the most from the reference assembly

Outline

• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data

Page 29: Getting the most from the reference assembly

Accessing the Data

https://genomereference.org

Page 30: Getting the most from the reference assembly

Accessing the Data

Page 31: Getting the most from the reference assembly

Accessing the Data

Page 32: Getting the most from the reference assembly

Accessing the Data

Page 33: Getting the most from the reference assembly

Accessing the Data

Page 34: Getting the most from the reference assembly

Accessing the Data

Page 35: Getting the most from the reference assembly

Accessing the Data

Page 36: Getting the most from the reference assembly

GRC Assembly Management

What assemblyversion?

Where’s theproblem?

How can wecontact you?

What’s wrong?

Page 37: Getting the most from the reference assembly
Page 38: Getting the most from the reference assembly

Accessing the Data

Page 39: Getting the most from the reference assembly

Accessing the Data

Learn more about navigating GRCh38 with

NCBI browsers and resources: 1926F (3-4 pm)

Page 40: Getting the most from the reference assembly

Accessing the Data

http://www.ensembl.org/

Page 41: Getting the most from the reference assembly

Accessing the Data

Page 42: Getting the most from the reference assembly

Accessing the Data

ftp://ngs.sanger.ac.uk/production/grit/track_hub/hub.txt

Page 43: Getting the most from the reference assembly

Outline

• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data

Page 44: Getting the most from the reference assembly

GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes

GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs

GRC Creditshttps://genomereference.org

Page 45: Getting the most from the reference assembly

Assembly Updates