abgt 2016 workshop schneider

22
Relating New Assemblies to the Human Genome Reference Valerie Schneider, Ph.D. NCBI 10 February 2016 http://genomereference.or

Upload: genome-reference-consortium

Post on 17-Jan-2017

489 views

Category:

Science


1 download

TRANSCRIPT

Page 1: ABGT 2016 Workshop Schneider

Relating New Assemblies to the Human Genome

ReferenceValerie Schneider, Ph.D.

NCBI10 February 2016

http://genomereference.org

Page 2: ABGT 2016 Workshop Schneider

http://genomereference.org

Twitter: @[email protected]

Page 3: ABGT 2016 Workshop Schneider
Page 4: ABGT 2016 Workshop Schneider

Overview

• Changes in reference assembly sequence sources• Diversity• Properties

• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly

Page 5: ABGT 2016 Workshop Schneider

GRCh38• 178 regions with alt loci: 2% of

chromosome sequence (61.9 Mb)• 261 Alt Loci: 3.6 Mb novel sequence

relative to chromosomes• Average alt length = 400 kb, max = ~5 Mb

GRCh38

Page 6: ABGT 2016 Workshop Schneider
Page 7: ABGT 2016 Workshop Schneider

Assembly Composition

Page 8: ABGT 2016 Workshop Schneider

WGS Assemblies Contributing to GRCh38

Assembly Name Assembly Accession

Seq Method Usage Length

RP11_1.0_unmatched_regions GCA_000442295.1 454 Gaps, Correction 754717 (0.02%)

CHM1_1.1 GCF_000306695.2 Illumina Gaps, Correction 133662 (0.004%)

HsapALLPATHS1 GCA_000185165.1 Illumina Gaps, Correction 364303 (0.01%)

HuRef GCF_000002125.1 Sanger Gaps, Correction, Alt Loci, CEN 4800690 (0.16%)

LinearCen1.1 (normalized) GCA_000442335.2 Sanger CEN 59546786 (2.02%)

Assembly Composition

Page 9: ABGT 2016 Workshop Schneider

WGS Gap Closure

Page 10: ABGT 2016 Workshop Schneider

Human assemblies available in the NCBI assembly database

http://www.ncbi.nlm.nih.gov/assembly

Assemblies in GenBank

Oct. 2014: 13 assemblies

Nov. 2015: 28 assembliesFeb. 2016: 39 assemblies

YRI

CEUCEU

CHB

Page 11: ABGT 2016 Workshop Schneider

Reference Assembly Basics

Sanger Sanger Illumina Illumina PacBio (older)clone WGS WGS WGS WGS

Reads:Method:

PacBio (newer)WGS

N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.

Page 12: ABGT 2016 Workshop Schneider

Overview

• Changes in reference assembly sequence sources• Diversity• Properties

• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly

Page 13: ABGT 2016 Workshop Schneider

Assemblathon Analysis Overview

CHM1/CHM13 Assemblathon Goals• Assess aspects of data generation (coverage, length)• Assess assembler algorithms & parameters• Platinum genome selection (MGI)• More robust reference curation (GRC)• Set expectations for these new resources• Understand quality and limitations• Plan for regions needing other resources• Develop new pipelines and SOPs

Page 14: ABGT 2016 Workshop Schneider

GRCh38 CHM13_FC CHM13_CA1 CHM13_CA2 CHM13_CA3 CHM13_CA4

GCF_000001405.26 GCA_000983455.2 GCA_000983465.1 GCA_001015355.1 GCA_000983475.1 GCA_001015385.1Total sequences 50,304 50,304 50,304 50,304 50,304 50,304

No Alignment 21 (0.04%) 88 (0.17%) 50 (0.10%) 49 (0.10%) 46 (0.09%) 50 (0.10%)Multiple best alignments (split transcripts) 10 (0.02%) 40 (0.08%) 340 (0.68%) 316 (0.63%) 611 (1.22%) 395 (0.79%)CDS coverage < 95% 17 (0.04%) 256 (0.66%) 378 (0.97%) 326 (0.84%) 622 (1.60%) 392 (1.01%)Dropped at consolidation (coding) 0 167 259 278 240 250Dropped at consolidation (non-coding) 0 138 212 209 185 191

Assemblathon RefSeq Alignment Stats: CHM13

Page 15: ABGT 2016 Workshop Schneider

GRCh38 CHM13_FC CHM13_CA1 CHM13_CA2 CHM13_CA3 CHM13_CA4

Frameshifts GCF_000001405.26 GCA_000983455.2 GCA_000983465.1 GCA_001015355.1 GCA_000983475.1 GCA_001015385.1

proteins 19 218 346 503 627 439

genes 12 161 232 317 366 281

Number proteins

Assemblies in which frameshifted

953 1

106 2

50 3

113 4

115 5

41 6

2 (PKD1L2) 7

Assemblathon RefSeq Alignment Stats: CHM13

Page 16: ABGT 2016 Workshop Schneider

Seq inassembly 1

Seq inassembly 2

A A

B

B’

B

Unique well aligned region in both assemblies.

Second Pass (SP) alignments

SP onlyExpansion Assembly 1

SP + FPCollapse Assembly 2

Graphic: Deanna Church

First Pass (FP) alignments

Assemblathon: Assembly-Assembly Alignments

Page 17: ABGT 2016 Workshop Schneider

Assembly Average

CHM13_FC 2.36%

CHM13_CA1 2.38%

CHM13_CA2 2.41%

CHM13_CA3 2.03%

CHM13_CA4 2.13%

GRCh37 1.06%

ungapped

Page 18: ABGT 2016 Workshop Schneider

Overview

• Changes in reference assembly sequence sources• Diversity• Properties

• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly

Page 19: ABGT 2016 Workshop Schneider

• Platinum and gold genomes expected to contribute to reference corrections and alternate loci• Set standards for use of other WGS assemblies

• Gold and platinum assembly curation• Tools for local re-assembly• Assessing and communicating local assembly quality

GRCh38CHM1

CHM13

NA19240NA12878

NA19434

HG007033

HG00514

Future Curation

Page 20: ABGT 2016 Workshop Schneider

• Multiple human references• Reference graphs• Long-term curation

Future Curation

CHM1

CHM13

NA19240

NA12878

HG000733

HG00514

NA19434

GRCh38

Page 21: ABGT 2016 Workshop Schneider

Overview

• Changes in reference assembly sequence sources• Diversity• Properties

• Evaluating new sequences for use (Assemblathon)• Future of assembly curation and the reference assembly

Page 22: ABGT 2016 Workshop Schneider

GRCh38 Credits

GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes

GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs

Assemblathon Collaborators• Jason Chin• Adam Phillippy• Sergey Koren• Heng Li

GRCTina Graves-LindsayKaryn Meltz SteinbergKerstin HoweRichard DurbinPaul FlicekLaura ClarkeDeanna ChurchCurators!Developers!