aug2014 nist structural variant integration

13
ANALYSIS OF STRUCTURAL VARIANTS FROM NEXT GENERATION SEQUENCING Hemang Parikh, Ph.D. NIST

Upload: genomeinabottle

Post on 24-Jun-2015

265 views

Category:

Health & Medicine


2 download

DESCRIPTION

NIST SVs

TRANSCRIPT

Page 1: Aug2014 nist structural variant integration

ANALYSIS OF STRUCTURAL VARIANTS FROM NEXT GENERATION SEQUENCING

Hemang Parikh, Ph.D.

NIST

Page 2: Aug2014 nist structural variant integration

Challenges for identifying true SVs

This Venn diagram shows the numbers of unique and shared structural variants (SVs) found by different sequencing-based discovery approaches that have been used in the 1000 Genomes Project

Hence we decided to develop methods to look for evidence of SVs in mapped sequencing reads from multiple sequencing technologies

From Alkan et al. (2011)

Page 3: Aug2014 nist structural variant integration

• Coverage (mean and standard deviation)• Paired-end distance/insert size (mean and standard deviation)

• # of discordant paired-ends reads• Soft clipping of the reads (mean and standard deviation)

• Mapping quality (mean and standard deviation)

• # of heterozygous and homozygous SNP genotype calls

• % of GC content

Validation parameters for each SV

Page 4: Aug2014 nist structural variant integration

Reference sequence Repeatmasker data

Perl scriptAbout 180

annotations per SV

Aligned sequence data (BAM file)

List of structural variants (bed file)

Page 5: Aug2014 nist structural variant integration

NA12878 Data Sets—RM for GIAB

• Illumina (250 bp long sequences with 50X coverage)

• Illumina NIST (150 bp long sequences with 300X coverage)

• Illumina Platinum Genome (100 bp long sequences with 200X coverage)

• Illumina Moleculo

• Pacific Biosciences

Page 6: Aug2014 nist structural variant integration

Deletions Gold Sets for NA12878

• Personalis (n=2,306)• The 1000 Genomes pilot (n=2,773)• Complete Genomics (n=2,032)• Conrad et al. (n=515)• Kidds et al. (n=317)• McCaroll et al. (n=128)• The 1000 Genomes—aCGH array based (n=3,901)• Roche NimbleGen 42 million—aCGH array based (n=719)

• Randomly generated (n=2,306)

Page 7: Aug2014 nist structural variant integration

Personalis deletions call set (n=2,306)

Log10 (SV Size)

2 3 4 5

Cou

nts

600

400

200

0

• BAM-level evidence in the vicinity of each SV, in most of the 19 CEPH pedigree samples

• SV breakpoints were identified

• Some SVs were validated with PCR

Page 8: Aug2014 nist structural variant integration

Illumina NIST

-2 0 2 4

400

300

200

100

0

Cou

nts

Log10 (M coverage) Log10 (M coverage)

-1 0 1 2 3

Cou

nts

900

600

300

0

Personalis Random genome

Page 9: Aug2014 nist structural variant integration

Identifying likely SVs and likely non-SVs

Log10 (M coverage)

Cou

nts

400

300

200

100

0

Random genome

Identify 99 percentile value of an annotation parameter

-3 -2 -1 0 1 2

Compared this value with an

annotation parameter from SV Gold Set

Page 10: Aug2014 nist structural variant integration

Annotating with Illumina NIST and Illumina Moleculo

Personalis SV Gold Set for Illumina NIST annotation parameters

Personalis SV Gold Set for Illumina Moleculo annotation parameters

L Insert sizeL Soft ClippedL # of discordant paired-ends readsM CoverageM Coverage SDM Mapping qualityM Insert sizeM Soft ClippedM # of discordant paired-ends reads

L Soft ClippedM CoverageM Coverage SDM Mapping qualityM Soft Clipped

Page 11: Aug2014 nist structural variant integration

0 1 2 3 4 5 6 7 8 9 10

0 21 96 323 350 231 126 80 40 10 2 1

1 4 19 45 59 61 29 16 9 9 0 1

2 1 22 108 200 214 111 69 36 8 3 0

3 0 0 0 1 1 0 0 0 0 0 0

4 0 0 0 0 0 0 0 0 0 0 0

5 0 0 0 0 0 0 0 0 0 0 0

Illumina NISTM

olec

ulo

0 1 2 3 4 5 6 7 8 9 10

0 2059 94 18 6 2 3 1 0 0 0 0

1 62 15 12 5 1 3 2 0 0 1 0

2 13 3 5 0 0 0 0 1 0 0 0

3 0 0 0 0 0 0 0 0 0 0 0

4 0 0 0 0 0 0 0 0 0 0 0

5 0 0 0 0 0 0 0 0 0 0 0

Illumina NIST

Mol

ecul

o

(B) Random genome

(A) Personalis

Page 12: Aug2014 nist structural variant integration

Conclusions

• Graphical visualization of the annotation parameters has shown clear distinction between true positive and false positive SVs

• A key advantage of the proposed method is its simplicity and flexibility to generate various annotation parameters from aligned sequence data based on different sequencing datasets from the same genome

• This allows integration of multiple sequencing datasets to identify high-confidence SV and non-SV calls that can be used as a benchmark to assess false positive and false negative rates

• We are currently testing classification methods based on the annotation parameters to generate both high-confidence SV calls and high-confidence non-SV calls for NA12878

Page 13: Aug2014 nist structural variant integration

Acknowledgements

NISTMarc Salit

Justin Zook

Hariharan Iyer

Desu Chen

Sumona Sarkar

Jennifer McDaniel

Lindsay Vang

David Catoe

Nathanael Olson

Genome in a Bottle Consortium

Personalis Inc.Mark Pratt

Gabor Bartha

Jason Harris

Illumina Inc.Michael Eberle

Stanford UniversityMichael Snyder

Amin Zia

Somalee Datta

Cuiping Pan

Sean Michael Boyle

Rajini Haraksingh

Natalie Jaeger