150224 grc kms

47
Characterizing extreme diversity in the human genome using a single haplotype genomic resource Karyn Meltz Steinberg, Ph.D. AGBT 2015 GRC Workshop @KMS_Meltzy

Upload: genome-reference-consortium

Post on 14-Jul-2015

836 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: 150224 grc kms

Characterizing extreme diversity in

the human genome using a single

haplotype genomic resource

Karyn Meltz Steinberg, Ph.D.

AGBT 2015 GRC Workshop

@KMS_Meltzy

Page 2: 150224 grc kms

1 bp 1 chr

Fre

quency

SNP

Trisomy

monosomy

Copy number

variants

Size of variant

1 kb 1 Mb

Types of genetic variants

Slide courtesy of S. Girirajan

Human Genetic Variation

Page 3: 150224 grc kms

1 bp 1 chr

Fre

quency

SNP

Trisomy

monosomy

Copy number

variants

Size of variant

1 kb 1 Mb

Types of genetic variants

1 bp 1 chr

Thro

ughput

1 kb 1 Mb

Size of variant

How do we assay them?

Slide courtesy of S. Girirajan

Human Genetic Variation

Page 4: 150224 grc kms

1 bp 1 chr

Fre

quency

SNP

Trisomy

monosomy

Copy number

variants

Size of variant

1 kb 1 Mb

Types of genetic variants

SNP genotyping

1 bp 1 chr

Thro

ughput

1 kb 1 Mb

Size of variant

How do we assay them?

Slide courtesy of S. Girirajan

Human Genetic Variation

Page 5: 150224 grc kms

1 bp 1 chr

Fre

quency

SNP

Trisomy

monosomy

Copy number

variants

Size of variant

1 kb 1 Mb

Types of genetic variants

Array-CGH

Karyotyping

SNP genotyping

1 bp 1 chr

Thro

ughput

1 kb 1 Mb

Size of variant

How do we assay them?

Slide courtesy of S. Girirajan

Human Genetic Variation

Page 6: 150224 grc kms

1 bp 1 chr

Fre

quency

SNP

Trisomy

monosomy

Copy number

variants

Size of variant

1 kb 1 Mb

Types of genetic variants

Array-CGH

Karyotyping

Sequencing

SNP genotyping

1 bp 1 chr

Thro

ughput

1 kb 1 Mb

Size of variant

How do we assay them?

Slide courtesy of S. Girirajan

Human Genetic Variation

Page 7: 150224 grc kms

Extreme diversity in the human genome

• <99.5% identity to the reference

• Refractory to traditional sequencing efforts

• Loci often contain gene families associated with

immune response and xenobiotic metabolism

Page 8: 150224 grc kms

HLA is a classic example of an extremely diverse locus

• Critical to immune response

• Characterized by overdominant

selection

• Alleles are linked and segregate as

distinct haplotypes

• Shaped by gene duplication and

diversification

Page 9: 150224 grc kms
Page 10: 150224 grc kms
Page 11: 150224 grc kms
Page 12: 150224 grc kms
Page 13: 150224 grc kms

Segmental duplications can predispose loci to further

rearrangement via NAHR

Page 14: 150224 grc kms

Segmental duplications can predispose loci to further

rearrangement via NAHR

Page 15: 150224 grc kms

A

A

C

T

C

G

C

C

Repeat Copies (noted by color difference)

Allelic

Copies

Diploid Genome

With a diploid genome, there is significant ambiguity

sorting allelic copies from repeat copies

A C C C

Haploid Genome

Repeat Copies

(ONLY but noted by color differences)

With a haploid genome, allelic differences are eliminated, and

base differences are likely indicative of repeat copies

Page 16: 150224 grc kms

Hydatidiform mole

Page 17: 150224 grc kms

SRGAP2 Homology between genes

Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs

Shows homology between SRGAP2B and SRGAP2C

Dennis, et.al. 2012

SRGAP2A

SRGAP2B

SRGAP2C

Page 18: 150224 grc kms

1q21

1q21 patch alignment to chromosome 1

1q32 1q21 1p21

Page 19: 150224 grc kms

Hydatidiform mole

Let’s sequence and assemble the whole genome!

Page 20: 150224 grc kms

CHM1_1.1 Assembly

• Reference-guided assembly • SRPRISM v2.3, R. Agarwala

• Alignment of Illumina reads to GRCh37 primary assembly

• CHORI-17 BAC clone tilepaths were then incorporated

• 428 total clones

• 324 clones in 45 tilepaths

• 104 clones as singletons

http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.

2

Total Sequence Length 3,037,866,619 bp

Total Assembly Gap Length 210,229,812 bp

Number of Scaffolds 163

Scaffold N50 50,362,920 bp

CHM1 Assembly Paper - Genome Research Steinberg et al. 2014

Page 21: 150224 grc kms

CHM1_1.1 assembly is highly contiguous compared to

other WGS based assemblies

Page 22: 150224 grc kms

Integrating BAC tiling paths improved assembly

Page 23: 150224 grc kms

Integrating BAC tiling paths improved assembly

Page 24: 150224 grc kms

Alignment of CHM1 Illumina data to assembly revealed

regions of extreme heterogeneity

Heterozygous Homozygous Total

Variants 64033 22513 86546

In RepeatMasked (RM) sequence 37060 14833 51893

In Segmental duplication (SD) 30670 4843 35513

In RM and SD 51466 17174 68640

Ts:Tv 1.5 0.7 1.2

Mean SNV density/kb 0.02 0.008 0.03

There are significantly more heterozygous variants in repetitive

sequence than expected (p<1x10-16). BAC ends mapping discordantly

and in multiple loci are significantly enriched for segmental

duplications (p<1x10-5).

Page 25: 150224 grc kms

Identified 549 novel protein coding genes not annotated

in GRCh37

Page 26: 150224 grc kms

CHM1 BioNano Genome Map Aligned to GRCh38

GRCh38

CHM1 BioNano Map~15kb additional data

Page 27: 150224 grc kms

BioNano SV Calls Identified a Assembly Problems

Collapse

Expansi

on

in A

ssem

bly

Gap in SequenceCHM1_1.1 Assembly

CHM1 BioNano Map

Page 28: 150224 grc kms

Conclusion

• Extremely diverse regions of the genome are difficult to

characterize due to issues distinguishing allelic from

paralogous duplications

• CHM1_1.1 highly contiguous single haplotype

representation of the genome

• Identified regions of misassembly or reference-ized

regions

• Utilize long read technology and nanopore technology to

attempt to fix these regions

Page 29: 150224 grc kms
Page 30: 150224 grc kms
Page 31: 150224 grc kms

Need to add more diversity to reference

• Finish another hydatidiform mole to platinum

status

• Finish 5 genomes to gold status

• NA19240 (Yoruban)

• NA12878 (European)

• HG00513 (Han Chinese)

• 2 “wildcards”

• Looking for underrepresented minority population

• Add high quality alternative sequences to

reference to create a population reference graph

or “pan genome”

Page 32: 150224 grc kms

Use colored de Bruijn graph structure to represent

population reference graph

Page 33: 150224 grc kms

Bioinformatic tool development in the future

• Alignment of short reads to population reference

graph

• Variant calling

• Variant reporting/Haplotype resolution

Page 34: 150224 grc kms

Adapted from Weinstein et al, 2009

Page 35: 150224 grc kms

The GRCh37 reference sequence was assembled

from three lymphoblastoid cell lines

Not a true haplotype

Incomplete

Page 36: 150224 grc kms

The CH17 haplotype is quite different from the reference

Page 37: 150224 grc kms

Novel insertion

The CH17 haplotype is quite different from the reference

Page 38: 150224 grc kms

Complex Indel

The CH17 haplotype is quite different from the reference

Page 39: 150224 grc kms

Hotspot/Recurrent Mutation

The CH17 haplotype is quite different from the reference

Page 40: 150224 grc kms

60 kbp Insertion

(Hotspot)

African Asian European

Page 41: 150224 grc kms

Duplication (influenza)

The CH17 haplotype is quite different from the reference

Page 42: 150224 grc kms

44 kbp Duplication

(influenza)

African Asian European

Page 43: 150224 grc kms

Summary of hydatidiform mole sequence

• 47 functional V genes

• 24 total variants (SNV and CNV) involving 29 IGHV

genes

• 5 structural variants

• 19 single nucleotide variants

• 15 non-synonymous mutations

• 20 out of 24 variants represent differences in amino acid

sequence or gene copy number

Page 44: 150224 grc kms

Summary of hydatidiform mole sequence

• 47 functional V genes

• 24 total variants (SNV and CNV) involving 29 IGHV

genes

• 5 structural variants

• 19 single nucleotide variants

• 15 non-synonymous mutations

• 20 out of 24 variants represent differences in amino acid

sequence or gene copy number

100 kbp of novel sequence

Page 45: 150224 grc kms

Current status of CHM1 resources

• CHORI-17 BAC Library (created from CHM1 cell line)

• CHORI-17 BAC end sequences (n=325,659)

• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)

• CHORI-17 BACs (>750 have been sequenced, with 592 of them in

Genbank as phase 3)

• Active cell line

• >100X coverage Illumina 100bp reads

• 300, 500bp, 3kb inserts

• Reference assisted assembly CHM1_1.1

• BioNano genome map

• >50X coverage of PacBio long read data

Page 46: 150224 grc kms

CHM1_1.1 Assembly

• Reference-guided assembly – SRPRISM v2.3, R. Agarwala

• Alignment of Illumina reads to GRCh37 primary assembly

• CHORI-17 BAC clone tilepaths were then incorporated

• 428 total clones

• 324 clones in 45 tilepaths

• 104 clones as singletons

• Comparison back to GRCh37 reference to provide appropriate gaps sizes

• Assembly submitted to Genbank

• http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2

• Steinberg et al, 2014• Genome Research (Dec;24(12):2066-76)

Page 47: 150224 grc kms

LILR (leukocyte

immunoglobulin-like

receptor)/KIR (killer

immunoglobulin receptor)

Immunoglobulin Kappa chain

Immunoglobulin Lambda chain

TCRA/B

17q21.31 inversion

polymorphism

Immunoglobulin

heavy chain locus

CYP2D6

SRGAP2

15q13.3

inversion

polymorphism