ngs - qc & dataformat
DESCRIPTION
The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification this ppt show parameters for NGS Data quality check and Dataformat of top sequencing machineTRANSCRIPT
![Page 1: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/1.jpg)
NGS
Data Formats & QC Analysis
Karan Veer SinghScientist, NBAGR
04/11/231
![Page 2: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/2.jpg)
Sequence Formats All Sequence formats are ASCII text containing
sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence
Formats are designed to hold sequence data and other information about sequence
04/11/232
![Page 3: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/3.jpg)
Why so many formats?
04/11/233
Created based on the information required for each step of analysis
Efficient Data & time management
Each Data formats vary in the information they contain
Types of sequence file formats
•Raw Sequence files •Co-ordinate files•Parameter files•Annotation files•Metadata files
![Page 4: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/4.jpg)
Read output formats
454
Solexa/Illumina
SOLiD
04/11/234
![Page 5: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/5.jpg)
454 output formats
.sff
.fna
.qual
04/11/235
Standard flowgram format
![Page 6: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/6.jpg)
Illumina output formats
.seq.txt
.prb.txt
Illumina FASTQ (ASCII – 64 is Illumina score)
Qseq(ASCII – 64 is Phred score)
Illumina single line formatSCARF
04/11/236
Solexa Compact ASCII Read Format
Phred quality scores
![Page 7: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/7.jpg)
ASCII value for h= 103 Quality of Base A at the position 1 = 103- 64 103- 64 = 39 Where 39 is the phred score
04/11/237
Illumina FastQ
![Page 8: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/8.jpg)
SOLiD output format(s)
CSFASTA
04/11/238
color-space sequence reads in a fasta format
These reads can be retained and analyzed in color-space by software
The Format Conversion Tool offers options for cleaning of the CSFASTA files
![Page 9: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/9.jpg)
Read Length
• Sanger reads lengths ~ 800-2000bp
• Generally we define short reads as anything below 200bp−Illumina (100bp – 250bp)−SoLID (75bp max)−Ion Torrent (200-300bp max – currently...)−Roche 454 – 400-800bp
• Even with these platforms it is cheaper to produce short reads (e.g. 50bp) rather than 100 or 200bp reads
• Diminishing returns:−For some applications 50bp is more than sufficient
−Resequencing of smaller organisms−Bacterial de-novo assembly −ChIP-Seq−Digital Gene Expression profiling−Bacterial RNA-seq
![Page 10: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/10.jpg)
Common (“standard”) format for read alignments: Alignment/Assembly Format
SAM
BAM (= binary SAM)MAQ
04/11/2310
![Page 11: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/11.jpg)
Sequencers & Sequence Assembly Packages
04/11/2311
![Page 12: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/12.jpg)
Formats for Genome/Gene annotation
BED format (genome-browser tracks)
GFF format (gene/genome features)
BioXSD (XML) (any annotation; under development)
04/11/2312
![Page 13: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/13.jpg)
If reads should be deposited in a public repository:
SRA (Short Read Archive) at NCBI
04/11/2313
![Page 14: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/14.jpg)
For base-call data, “standard” FASTQ (Sanger, Phred)
For read alignments, SAM/BAM/MAQ format
For annotation results (e.g. GFF or BED format)
Points to remember on Data Formats
04/11/2314
![Page 15: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/15.jpg)
QC analysis
04/11/2315
![Page 16: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/16.jpg)
All platforms have errors
Illumina SoLID/ABI-Life Roche 454 Ion Torrent
1. Removal of low quality bases/ Low complexity regions2. Removal of adaptor sequences3. Homopolymer-associated base call errors (3 or more
identical DNA bases) causes higher number of (artificial) frameshifts
![Page 17: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/17.jpg)
Illumina artefacts
under represented GC rich regions PCR Sequencing
GGC/GCC motif is associated with low quality and mismatches
Low quality reads < 20% phred score
![Page 18: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/18.jpg)
Need for QC & Preprocessing
QC analysis of sequence data is extremely important for meaningful downstream analysis
To analyze problems in quality scores/ statistics of sequencing data
To check whether further analysis with sequence is possible
To remove redundancy (filtering)
To remove low quality reads from analysis
To remove adapter contamination
Highly efficient and fast processing tools are required to handle large volume of datasets
04/11/2318
![Page 19: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/19.jpg)
The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification
Most of the programs available for downstream
analyses do not provide the utility for quality check and filtering of NGS data before processing
04/11/2319
Need for QC & Preprocessing
![Page 20: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/20.jpg)
NGS QC Toolkit & FastQC
NGS QC Toolkit is for quality check and filtering of high-quality read
This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html
Application have been implemented in Perl programming language
QC of sequencing data generated using Roche 454 and Illumina platforms
Additional tools to aid QC : (sequence format converter and trimming tools) and analysis (statistics tools)
FastQC can be used only for preliminary analysis
04/11/2320
![Page 21: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/21.jpg)
04/11/2321
![Page 22: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/22.jpg)
04/11/2322
![Page 23: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/23.jpg)
04/11/2323
NGSQC toolkit Output
![Page 24: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/24.jpg)
04/11/2324
NGSQC toolkit Output
![Page 25: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/25.jpg)
04/11/2325
Comparison - QC tools
![Page 26: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/26.jpg)
FastQC Basic statistics Quality- Per base position Per Sequence Quality Distribution Nucleotide content per position Per sequence GC distribution Per base GC distribution Per base N content Length Distribution Overrepresented/ duplicated sequences K-mer content
04/11/2326
![Page 27: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/27.jpg)
FastQC (Box-Whisker plot)
Y axis- Quality ScoreX axis- Base position
04/11/2327
![Page 28: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/28.jpg)
2. Quality- Per base position04/11/2328
![Page 29: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/29.jpg)
2. Quality- Per base position04/11/2329
![Page 30: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/30.jpg)
3.Per Sequence Quality Distribution
04/11/2330
![Page 31: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/31.jpg)
3. Per Sequence Quality Distribution
04/11/2331
![Page 32: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/32.jpg)
4.Nucleotide content per position
04/11/2332
![Page 33: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/33.jpg)
4. Nucleotide content per position
04/11/2333
![Page 34: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/34.jpg)
5.Per sequence GC distribution
04/11/2334
![Page 35: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/35.jpg)
5.Per sequence GC distribution
04/11/2335
![Page 36: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/36.jpg)
6. Per base GC distribution04/11/2336
![Page 37: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/37.jpg)
6. Per base GC distribution04/11/2337
![Page 38: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/38.jpg)
7. Per base N content04/11/2338
![Page 39: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/39.jpg)
7. Length Distribution04/11/2339
![Page 40: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/40.jpg)
8. Kmer content04/11/2340
Any k-mer showing more than a 3 fold overall enrichment or a 5 fold enrichment at any given base position will be reported by this module.
![Page 41: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/41.jpg)
9. Overrepresented/ duplicate sequences
The analysis of overrepresented sequences will spot an increase in any exactly duplicated sequences
Too many duplicate regions in the sequence will be due to sequencing problems
04/11/2341
This module will issue a warning if any sequence is found to represent more than 0.1% of the total.
![Page 42: NGS - QC & Dataformat](https://reader036.vdocuments.net/reader036/viewer/2022062418/554c193bb4c905f1518b502b/html5/thumbnails/42.jpg)
QC Report Sequence StatisticsTotal No. of Sequences 6970943Avg. Sequence Length 54Max Sequence Length 54Min Sequence Length 54Total Sequence Length 376430922Total N bases 14254521% N bases 3.78676No of Sequences with Ns 278635% Sequences with Ns 3.99709
Quality StatisticsTotal HQ bases 334195496%HQ bases 88.78Total HQ reads 6350256%HQ reads 91.0961
04/11/2342
Alignment statistics