introduction - unmc.edu€¦ · introduction eisenstein. nature. 2015 . 10/09/2015 2 _____ fall...

6
10/09/2015 1 Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week 13: Next Generation Sequencing (NGS) Analysis Adam Cornish Graduate Student Guda lab Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center __________________________________________________________________________________________________ Fall 2015 GCBA 815 __________________________________________________________________________________________________ Fall 2015 GCBA 815 Vector NTI is an integrated suite of sequence analysis and design tools that help you manage, view, analyze, transform, share, and publicize diverse types of molecular biology data, in a graphically rich analysis environment. Introduction Eisenstein. Nature. 2015

Upload: others

Post on 23-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction - unmc.edu€¦ · Introduction Eisenstein. Nature. 2015 . 10/09/2015 2 _____ Fall 2015 GCBA 815 Sources of NGS data Illumina Ion Torrent PacBio _____ Fall 2015 GCBA

10/09/2015

1

__________________________________________________________________________________________________ Fall 2015 GCBA 815 __________________________________________________________________________________________________ Fall 2015 GCBA 815

Tools and Algorithms in Bioinformatics GCBA815, Fall 2015

Week 13: Next Generation Sequencing (NGS) Analysis

Adam Cornish

Graduate Student Guda lab

Department of Genetics, Cell Biology and Anatomy

University of Nebraska Medical Center

__________________________________________________________________________________________________ Fall 2015 GCBA 815

__________________________________________________________________________________________________ Fall 2015 GCBA 815

n Vector NTI is an integrated suite of sequence analysis and design tools that help you manage, view, analyze, transform, share, and publicize diverse types of molecular biology data, in a graphically rich analysis environment.

Introduction

Eisenstein. Nature. 2015

Page 2: Introduction - unmc.edu€¦ · Introduction Eisenstein. Nature. 2015 . 10/09/2015 2 _____ Fall 2015 GCBA 815 Sources of NGS data Illumina Ion Torrent PacBio _____ Fall 2015 GCBA

10/09/2015

2

__________________________________________________________________________________________________ Fall 2015 GCBA 815

Sources of NGS data

Illumina Ion Torrent

PacBio

__________________________________________________________________________________________________ Fall 2015 GCBA 815

Single Cell Sequencing

Page 3: Introduction - unmc.edu€¦ · Introduction Eisenstein. Nature. 2015 . 10/09/2015 2 _____ Fall 2015 GCBA 815 Sources of NGS data Illumina Ion Torrent PacBio _____ Fall 2015 GCBA

10/09/2015

3

__________________________________________________________________________________________________ Fall 2015 GCBA 815

n  Genome ¨  Targeted sequencing panels (cancer, newborns, autism, etc.) ¨  Whole exome sequencing ¨  Whole genome sequencing ¨  Copy number analysis ¨  Reconstruction of extinct species’ genomes

n  Transcriptome ¨  Whole transcriptome (poly-A selection) ¨  Small RNA analysis (siRNA, snoRNA, lincRNA, etc.) ¨  Gene expression profiling for selected target genes ¨  Rare cell identification

n  Metagenome ¨  Bulk sequencing of many types of bacteria ¨  Examples: human gut microbiome, pollen composition, bacteria composition, viral studies

n  Epigenome ¨  Chromatin Immunoprecipitation Sequencing (ChIP-Seq) ¨  Methylation Sequencing (Methyl-Seq)

Applications of NGS

__________________________________________________________________________________________________ Fall 2015 GCBA 815

Variant calling using NGS data

Page 4: Introduction - unmc.edu€¦ · Introduction Eisenstein. Nature. 2015 . 10/09/2015 2 _____ Fall 2015 GCBA 815 Sources of NGS data Illumina Ion Torrent PacBio _____ Fall 2015 GCBA

10/09/2015

4

__________________________________________________________________________________________________ Fall 2015 GCBA 815

The big three: n  Fastq

¨  Raw sequencing data usually directly from the sequencer

n  SAM/BAM ¨  Sequence data that has usually been aligned to a specific genome

n  VCF ¨  Tab-delimited text file that contains a list of possible variants: ¨  SNV ¨  Insertion and deletion (indel) ¨  Duplication ¨  Copy number variation ¨  Inversion ¨  Tandem duplication

Important file types

__________________________________________________________________________________________________ Fall 2015 GCBA 815

Row 1: Information from the sequencer about the location of this read on the plate

Row 2: The Sequence Row 3: Metadata provided by the sequencing team Row 4: Quality scores pertaining to each nucleotide in the

sequence

Fastq @SRR098401.11403008/1 GAGGCTATAGCATGGTCAAGGCACAAGAAGATCACTGGACTGCCCTCGCTCAGCCCTCAGCTACTG + >>?>?@>?>@@>?@@=@@@@@??>??@??@?@A?>@@@?>@@???A@:@A@@A@@@A@@AAB@@BB

Page 5: Introduction - unmc.edu€¦ · Introduction Eisenstein. Nature. 2015 . 10/09/2015 2 _____ Fall 2015 GCBA 815 Sources of NGS data Illumina Ion Torrent PacBio _____ Fall 2015 GCBA

10/09/2015

5

__________________________________________________________________________________________________ Fall 2015 GCBA 815

Quality scores are phred-scaled:

Seq: TCAGCCCTCAGCTACTGCTCT

Score: A@@A@@@A@@AAB@@BBABAB

Phred-33 is the most common, and is based on ASCII values.

The quality score of a base call is the ASCII value of the character subtracted by 33.

Example: the ASCII value for ‘A’ is 65, and 65 - 33 = 32. That means the base call corresponding to this score has a 1 in ~2,000 chance of being wrong.

Fastq continued Phred quality

score Probability that

the base is called wrong

Accuracy of the base call

20 1 in 100 99%

30 1 in 1,000 99.9%

40 1 in 10,000 99.99%

50 1 in 100,000 99.999%

__________________________________________________________________________________________________ Fall 2015 GCBA 815

Similar to the Fastq file in that it contains the raw sequence and its quality scores.

It also tells you where the sequence aligned to the genome, and how well (this scre is also phred-scaled).

In this case, this read aligned to chromosome 22, position 17445857, and has a quality score of 60 (or a 1 in 1,000,000 chance of being placed incorrectly).

Sequence Alignment / Map (SAM / BAM) SRR098401.104031357 83 chr22 17445857 60 76M = 17445512 -421 ACTGTTACCAGATCAAGAACTGATAGGGACAGGGATCATTATTCCCCCTTTACAGATGAGAAGGCCGTCACGCCTC @@>>B@@@BBAAAB9A@@>:@@?=A@?@?@A???>?@??=???@@@@@>@>>@@@><??@>@>@@8?>?=:@>?>> BD:Z:NOJKPQQQQMONOMKKKLNOMNLLLJLMINLJLMLMLKKKKJLJJJMKCKLINJMMLJKKKMOOMNNOLPQSNMKK PG:Z:MarkDuplicates RG:Z:NA12878 BI:Z:OOMLRRPPRPPQQONOLOPOONOOOKLNMONJKMNONMMMMLMKKKMLGMNLNMMNNJMJLNOMLNMPNONONNMM NM:i:0 MQ:i:60 AS:i:76 XS:i:0

Page 6: Introduction - unmc.edu€¦ · Introduction Eisenstein. Nature. 2015 . 10/09/2015 2 _____ Fall 2015 GCBA 815 Sources of NGS data Illumina Ion Torrent PacBio _____ Fall 2015 GCBA

10/09/2015

6

__________________________________________________________________________________________________ Fall 2015 GCBA 815 ExAC Browser

Variant Call Format (VCF)