exome sequencing report - lc sciences

Exome Sequencing Report

Customer Name Demo

Institute University of ABC

Order Number 6667

Date Prepared 2017-06-02

This report covers confidential materials of LC Sciences. Please make sure

that the contents of this report are for your personal use only and that you are

responsible for confidentiality. If the contents of this report are disclosed to any

third party or company, according to the relevant laws and regulations, LC Sciences

will be entitled to legal action.

Prepared by LC Sciences |www.lcsciences.com| [email protected]

2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181

Table of Contents

1 Exome Sequencing Introduction 2

2 Exome Sequencing Report 3

2.1 Disease Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Database Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Data Analysis Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Technical Methods and Processes 5

3.1 Experimental Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Analysis Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Sequencing Data Overview 8

4.1 Sample Collection and Grouping Information . . . . . . . . . . . . . . . . . . . . 8

4.2 Sequencing Data Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.3 Sequencing Data Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.4 Sequencing Depth/Coverage Distribution . . . . . . . . . . . . . . . . . . . . . . 10

4.5 Coverage Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Variant Calling Results and Analysis 14

5.1 SNP Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2 SnpEff Annotation of SNP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3 INDEL Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.4 SnpEff Annotation of INDEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Annotation and Filtering of Variants 26

6.1 dbSNP Annotation and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2 1000Genome Annotation and Filtering . . . . . . . . . . . . . . . . . . . . . . . 26

6.3 Coding Region Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.4 Protein Function Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Quality Control 28

8 Appendix 30

8.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30



8.2 Information Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

9 References 32

10 Contact Us 32



Exome Sequencing Report 2

1 Exome Sequencing Introduction

The exon is the part of the eukaryotic gene which is preserved after splicing and can be trans-

lated into peptide sequences. Exome is the sum of all exon regions in the genome that contains

the information needed for translation, covering most of the functional variants associated with

the individual phenotype. The human genome contains approximately 180,000 exons with a

total length of about 30 Mb. The human exome accounts for about 1% of the genome, but is

responsible for about 85% of human pathogenic mutations.

Exome sequencing refers to the use of specially designed probes to enrich the protein cod-

ing region of interest or a specific region of interest. High-throughput sequencing generates

genetic information, which greatly improves the efficiency of exome studying and significantly

reduces the cost of research. The technology can be used to identify and study Mendelian

diseases, complex diseases such as cancer, diabetes, obesity and other pathogenic genes. This

enables researchers to better explain the pathogenesis of diseases.

The technical advantages of exome sequencing:

Cost-effective: Genome-wide information can be obtained economically and efficiently relative

to genome-wide sequencing. The depth of sequencing of the exon region is deeper and the

results are more accurate.

High detection accuracy: Individual base variation can be identified in the genome-wide range.

Applicable to analysis with a large sample size: Exome sequencing is economically efficient and

more applicable to analysis with a large sample size.




2 Exome Sequencing Report

Species Name: Human

Species Name: Homo sapiens

2.1 Disease Information

Disease Name: Genetic Disease

Disease Type: Dominant/recessive disorders on autosomal or sex chromosomes

Family MapDemo




2.2 Database Information

Database Informaton

Genome Database ftp://ftp.ensembl.org/pub/release-73/fasta/homo sapiens/dna/Homo sapiens.GRCh37.73.dna.toplevel.fa.gz

hg19

dbSNP Database http://www.ncbi.nlm.nih.gov/SNP/ 144b

1000Genome Database http://www.1000genomes.org/ V73

Clinvar Database ftp://ftp.ncbi.nih.gov/snp/organisms/human 9606 b144 GRCh37p13/VCF/clinical vcf set

144

2.3 Data Analysis Program

Analysis Tool Version and Description

Data Quality Control FastQC 0.10.1

BWA 0.7.10 Reference Genome Comparison

SAMtools 0.1.19 View and Sort Alignment Results

Picard 1.119 Merge Sample Bam Results

SNP/INDEL Detection GATK 3.3.0 Detection and Filtration of SNP and IN-DEL

SNP/INDEL Gene Anno-tation

SnpEff V4.1 Detected Variation Explanation




3 Technical Methods and Processes

3.1 Experimental Processes

The liquid chip capture system (Agilent, CA, USA) is used to efficiently enrich the human exon

region. High throughput sequencing is performed on the HiSeq 2500/4000 platform. Construc-

tion and capture experiments are carried out using the SureSelect Human All Exon V6 kit

(Agilent, CA,USA).

Sample DNA quality assessment requires the amount of DNA to be >= 1.5 ug (Qubit). The

agarose gel electrophoresis result should show no degradation and no RNA contamination. Ad-

ditionally, the amount of OD260/280 measured by Nanodrop should range from 1.8 to 2.0. For

samples with the DNA amount < 1 ug, a substitute protocol may be suggested for optimizing

the sequencing library.

Genomic DNA is randomly broken into small 150-300 bp fragments. Following end-repair and

polyadenylation at both ends, the fragments are ligated with the sequencing adaptor including

specific indices. The library is then hybridized with up to 738,690 biotin-labeled probes so that

the exon region (including the upstream and downstream regions) of 58 Mb can be captured

using streptomycin beads. After PCR amplification and quality assurance, the library is loaded

on the flowcell for sequencing (Figure). Paired-end reads (2 x 150) are obtained for downstream

data analysis. A more detailed description of library preparation is available in supplemental

materials.

Figure. Library Construction Experimental Workflow




3.2 Analysis Process

The adaptor, polyN, polyA and other low-complexity sequences are excluded from the raw

reads. The remaining valid reads are mapped to the reference genome using BWA (Li H et al.

and Kent WJ et al.). The mapping result is saved in the BAM format and then sorted using

SAMtools (Li H et al.). Duplicate reads coming from PCR amplification are marked using

Picard. Marked duplicate reads are not used for subsequent processing, as they may result in

false positive results in the detection of mutations.

After marking duplicate reads, it is necessary to re-align the reads close to the region re-

ported as insertion/deletion (INDEL) by BWA based on the Compact Idiosyncratic Gapped

Alignment Report (CIGAR) value. The mismatch close to the INDEL region reported by BWA

may not be accurate due to its algorithm of alignment and it may cause false positive results of

variant calling. Therefore, correction at these sites is required for subsequent SNP and INDEL

analysis. The IndelRealigner module in GATK is use to carry out INDEL re-alignment in an

effort to minimize the error rate of mismatches near each INDEL site.

Variant calling relies heavily on the quality score of each base reported by the seuqencer. For

example, the BWA aligner reports a mismatch when the base quality is above Q25. Namely,

the error rate of the mismatch caused by sequencing is about 1%, which may heavily affect the

reliability of downstream analysis. Additionally, the sequencing quality at the 3’ end is always

lower than the quality at the 5’ end due to reagent depletion, and the quality of A/C is often

lower than T/G. Therefore, it is necessary to recalibrate the base quality using the BaseRe-

calibrator module in GATK so that the quality score of the sequence can be more reliable.

Note: The reads in one sample are supposed to come from the same lane for base recalibration.

Otherwise, reads from different lanes need to be recalibrated separately.

After the steps metioned above, variant calling is made by the UnifiedGenotyper or the Hap-

lotypeCaller moduel in GATK. The UnifiedGenotyper module does not consider the impact of

adjacent bases to make the call, and the HaplotypeCaller module makes the call based on the

local de-novo model. The HaplotypeCaller module first builds a De Bruijn graph and applies

a PairHMM model to do haplotype prediction and make the variant call.

Of course, in dealing with such typical large scale Bayesian inference problems for a large quan-

tity of samples, the high-sensitivity model may cause more false positive results. Therefore, it is

important to perform further corrections on the variant calling results (Variant recalibration).




In general, real mutation sites are clustered together by the variant calling model. These clus-

ters should fit the Gaussian distribution. Therefore, the VariantRecalibrator module in GATK

uses a Gaussian mixture model to correct the false positve calls and find the true mutation sites.

It is well known that mutations in the coding region may be critical and cause diseases. There-

fore, it is important to annotate the biological function of the mutation site. We used the SnpEff

program (official recommended by GATK) to examine structural changes at the mutation site

and further sort out the candidate area leading to the disease. The overall flowchart of data

analysis is as follows:

Flowchart of data analysis

Note:

The Genome Analysis Toolkit (GATK) is developed by the Broad Institute for second-generation sequencing

data analysis. It contains a variety of tools mainly engaged in variant calling, genotyping etc. Data quality

assurance is highly emphasized to reduce the false positive resutls. The program has a powerful architecture,

a powerful processing engine and high-performance computing capabilities that make it suitable for projects

of any size. At present, GATK and the mapping software BWA have become the most mainstream analysis

pipeline for whole genome and exome sequencing.




4 Sequencing Data Overview

4.1 Sample Collection and Grouping Information

Group Sample

sampleA sampleA

sampleB sampleB

A complete table can be found in summary/1 RawData/sample info mendelian.xlsx

4.2 Sequencing Data Filtering

Paired-end raw reads were obtained by the high-throughput sequencer, which may contain the sequencing adap-

tor and low-quality reads. In order to ensure accurate analysis results, it is necessary to preprocess the raw

reads and obtain valid data for subsequent analysis.

The preprocessing steps are as follows:

(1) Adapter removal

(2) Removal of reads containing N for more than 5% of the bases

(3) Removal of low quality reads with the quality score less than 10 for more than 20% of the bases

(4) Removal of reads after a comprehensive evaluation with Q20, Q30, and the GC content




4.3 Sequencing Data Quality Control

For paired-end sequencing (PE150), the percentage of bases with their quality score greater than 20 should be

more than 90%; the percentage of bases with their quality score greater than 30 should more than 85%.

Table. Summary of sequencing quality control

Sample Raw Data Valid Data Rawdepth(x)

Valid% Q20% Q30% GC%

Read Base Read Base

sampleA 100053676 15.01G 98910722 14.84G 258.79 98.86 98.51 96.28 47.51

sampleB 98312530 14.75G 96767964 14.52G 254.31 98.43 97.24 93.08 48.39

Terminology:

Term Annotations

Sample Sequencing Library Name

Raw Data/Read Number of reads obtained from the sequencer

Raw Data/Base Number of bases in billions (Giga) obtained from the sequencer

Valid Data/Read Number of reads after preprocessing

Valid Data/Base Number of bases in billions (Giga) after preprocessing

Raw Depth Raw number of bases divided by the Agilent kit captures Size: 58M

Valid Ratio% Percentage of the processed reads (Valid) to the raw reads (Raw)

Q20% Percentage of bases with the quality score greater than 20

Q30% Percentage of bases with the quality score greater than 30

GC count% Percentage of the GC content in raw reads

A complete table can be found in summary/1 RawData/ReadsQC.xlsx

After preprocessing, the average number of valid bases per sample is 195.68G. All samples have more than

more than 97.84G bases, Q30>90% and meet the criteria for downstream analysis.




4.4 Sequencing Depth/Coverage Distribution

Sequencing coverage/depth is estimated by the number of reads mapped to exons. Usually the mapping rate of

reads from a human sample can be more than 95%. A variatn call with the coverage/depth higher than 10X is

more reliable.

Figure. Sequencing Depth Graph

The abscissa indicates the depth of the sequencing, and the ordinate indicates the ratio of the base at the corresponding

depth to all bases. The graph on the right shows the cumulative base ratio (ordinate) at different depths (abscissa). For

example, the cumulative depth of 50X corresponds to the base ratio about 95%. It indicates that about 95%of the bases

have the sequencing depth greater than 50X.

The file of the figure can be found in summary/2 MappedData/ReadsDepthCoverage.png




Figure. The average sequencing coverage/depth on each chromosome

The abscissa indicates the chromosomes, and the ordinate represents the average depth. The average depth is calculated

by (nubmer of mapped reads x length of the covered region) / total length of the exon region on each chromosome.

The file of the figure can be found in summary/2 MappedData/DepthCoverageByChr.png




4.5 Coverage Results

Term sampleA sampleB

Total 100053676(100.00%)

98312530(100.00%)

Duplicate 11323351(11.32%)

9049250(9.20%)

Mapped 95830841(95.78%)

91689785(93.26%)

TARGET TERRITORY 60700153 60700153

NEAR AMPLICON BASES 2500942993 2061476018

NEAR AMPLICON BASES+TARGET TERRITORY

2561643146 2122176171

PF UQ READS ALIGNED 95830841 91689785

ON AMPLICON BASES 7066308762 5996686766

MEAN TARGET COVERAGE 118.66 100.58

PCT TARGET BASES 30X 93.23% 90.81%




Terminology:

Term Annotations

Total Total number of raw reads (read 1 + read 2)

Duplicate Number of duplicate reads

Mapped Number of reads mapped to the reference genome

TARGET TERRITORY Number of unique bases covered by the intervals of all targets thatshould be covered

NEAR AMPLICON BASES Number of PF aligned bases that mapped to within a fixed intervalof an amplified region, but not on a baited region

PF UQ READS ALIGNED Number of PF unique reads that are aligned with mapping score> 0 to the reference genome.

ON AMPLICON BASES Number of PF aligned amplified that mapped to an amplifiedregion of the genome

MEAN TARGET COVERAGE Mean coverage of targets that recieved at least coverage depth =2 at one base

PCT TARGET BASES 30X Percentage of all target bases achieving 30X or greater coverage




A complete table can be found in summary/2 MappedData/MappedStatistics.xlsx




Figure. Read coverage of each sample

The file of the figure can be found in summary/2 MappedData/DepthCoverageByTarget.png

The average coverage depth of each sample is more than 100X. The average coverage depth of each chromosome is

more than 100X. 109.62%of bases have more than 20X and good uniformity for downstream analysis.




5 Variant Calling Results and Analysis

Single Nucleotide Polymorphism (SNP) refers to a single nucleotide variation in the genome which leads to the formation

of a genetic marker. Variations at individual nucleotides in the genome include substitutions, deletions, and insertions.

Depending on the structure of the nucleotide base, a substitution can be classified into a transition (C to T, G to A) and

a transversion (C to A, G to T, C to G, A to T). SNPs appear most frequently in the CG islands. C in the CG islands

tends to be methylated during histone modification. Methylated C then turns into T through spontaneous deamination.

In general, a SNP refers to a single nucleotide variation with the presence greater than 1% in a population.

SNPs may fall within coding regions of genes, non-coding regions of genes, or intergenic regions. SNPs within non-

coding regions may still affect alternative splicing, transcription factor binding, mRNA degradation or non-coding RNA

sequences. Gene expression affected by this type of SNPs is known as expression SNP (eSNP) and may occur in the

upstream or downstream region of the gene. SNPs within the coding region of the gene (cSNPs) are less common, and

the variation rate in the exome is only 1/5 of the variation rate in the surrounding regions. However, they are more

significantly correlated to the development of genetic diseases. From the perspective of a genetic trait, cSNPs can be

classified into two types: synonymous and nonsynonymous cSNPs. Synonymous cSNPs are SNP-induced changes in the

coding region that do not affect the translation of the protein amino acid sequence. Nonsynonymous cSNPs are the

variations that result in protein sequence changes and therefore changes in the function of the protein. This change is

often implicated as the direct cause of changes in biological traits. About half of cSNPs are nonsynonymous.

Insertion-Deletion (INDEL) refers to the insertion or the deletion of a small fragment (one or more bases) in the

reference genome. INDELs may fall within the coding regions or the noncoding regions of genes. INDELs within the

coding regions may cause the structural and functional change of the protein. If one or several bases (not a multiple of

three) are inserted or deleted within the coding region, the mutation is called a frame shift mutation. Such mutations

cause changes in the downstream amino acid sequence. INDELs within the noncoding regions (e.g., intron regions) may

reduce the efficiency of transcription and the accuracy of alternative splicing. Moreover, the occurrence of INDELs is

also one of the main causes of evolution. For species with relatively close genetic relationship, the major cause of species

divergence is INDELs. In general, the longer the genetic distance between species, the more INDEL fragments and longer

the length of the INDEL fragment.

SNP and INDEL sites on the genome are analyzed using GATK. After variant calling, SnpEff is used for mutation

site annotation.




5.1 SNP Results

Table. SNP location classification and annotation

SNV Class NON SYNONY-MOUS CODING

START GAINEDSTART LOST STOP GAINED STOP LOST SYNONYMOUSCODING

sampleA 30352 781 57 280 29 38200

sampleB 27992 729 39 295 33 31879

SNV Pos DOWNSTREAM INTRON SPLICE SITEACCEPTOR

UPSTREAM UTR 3 PRIME UTR 5 PRIME

sampleA 71305 137844 111 56138 10858 6331

sampleB 66760 127846 122 52787 10034 5948

Terminology:

Term Annotations

SNV Class

NON SYNONYMOUS CODING Number of variants causing a codon that pro-duces a different amino acid

SYNONYMOUS CODING Number of variants causing a codon that pro-duces the same amino acid

STOP GAINED Number of variants causing a STOP codon

STOP LOST Number of variants causing a stop codon tobe mutated into a non-stop codon

START GAINED Number of variants causing a START codon

START LOST Number of variants causing a start codon tobe mutated into a non-start codon

SNV Pos

DOWNSTREAM Downstream 1K bases from the stop codon

EXON Exon region

INTERGENIC Intergenic region

INTRON Intron region

SPLICING Region within 10 bases from the splicing junci-ton

UPSTREAM Upstream 1K bases from the start codon

UTR3 PRIME 3’-UTR region


A complete table can be found in summary/3 VariantData/SNP INDEL PositionType VariantsType.xlsx




Note: The effect of mutations on the genome is as follows

High-Impact Effects Moderate-Impact Effects Low-Impact Effects

SPLICE SITE ACCEPTOR NON SYNONYMOUS CODING SYNONYMOUS START

SPLICE SITE DONOR CODON CHANGE (note: this effectis used by SnpEff only for MNPs, notSNPs)

NON SYNONYMOUS START

START LOST CODON INSERTION START GAINED

EXON DELETED CODON CHANGE PLUS CODONINSERTION

SYNONYMOUS CODING

FRAME SHIFT CODON DELETION SYNONYMOUS STOP

STOP GAINED CODON CHANGE PLUS CODONDELETION

NON SYNONYMOUS STOP

STOP LOST UTR 5 DELETED

UTR 3 DELETED




Figure. The distribution of SNPs in different categories (left) and the distribution of SNPs (right) in different regions

of the genome

The pie chart on the left shows the distribution of SNPs in different categories as definded in the previous table. The

pie chart on the right shows the ditribution of SNPs in different locations as defined in the previous table.

The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.SNV.png




Table. Summary of SNP types

Sample all genotype.Het

genotype.Hom

novel in db-SNP

novel proportion

dbSNP proportion

Ts Tv novel.Ts novel.Tv

sampleA 55608 35408 20200 3545 52063 0.06 0.94 39295 16354 2303 1242

sampleB 51936 32751 19185 2527 49409 0.05 0.95 36785 15193 1599 928

Terminology:

Term Annotations

Sample Sample Name

novel ts Number of novel transitions (not annotated in dbSNP)

novel tv Number of novel transversions

ts Total number of transitions

tv Total number of transversions

all Number of all SNPs

genotype.Het Number of SNPs of heterozygous genotypes

genotype.Hom Number of SNPs of homozygous genotypes

novel Number of novel SNPs

novel proportion Proportion of novel SNPs

A complete table can be found in summary/3 VariantData/VariantsType SNP.xlsx




Figure. Comparison of SNP types

The first bar graph shows the number of heterozygous (Het) and homozygous (Hom) SNPs in one sample. The Het rate

represents the percentage of heterozygous SNPs in all SNPs.

The second bar graph shows the number of annotated (dbSNP) and novel SNPs. The dbSNP rate represents the per-

centage of annotated SNPs in all SNPs.

The third bar graph shows the number of transitions (ts) and trasversions (tv). The value of ts/tv is the ratio of transi-

tions to transversions.

The fourth bar graph shows the number of transitions (ts) and trasversions (tv) in novel SNPs. The value of ts/tv is the

ratio of transitions to transversions.

The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.SNP VariantsType.png




5.2 SnpEff Annotation of SNP

SNPs identified from sequencing are annotated using SnpEff. The annotations are presented in the VCF4.1 format for each

sample. The VCF files can be found in summary/3 VariantData/ for each sample, e.g., summary/3 VariantData/sampleA/sampleA.snp.annotation.fixed.function.vcf

Terminology:

Term Description

CHROM chromosome id

POS ID chromosome position ID

REF reference allele

ALT alternative allele

QUAL quality

FILTER filter

INFO information

AD Allelic depths

DP Approximate read depth

GQ Genotype Quality

GT Genotype

PL Phred-scaled likelihoods of the givengenotypes

A complete description of the terms in the VCF file can be found in summary/3 VariantData/readme.xlsx




5.3 INDEL Results

Table. INDEL location classification and annotation

INDEL Class CODONCHANGE

PLUS CODONDELETION

CODONCHANGE

PLUS CODONINSERTION

CODONDELETION

CODONINSERTION

FRAME SHIFT FRAME SHIFT+

STOP GAINED

sampleA 187 126 84 137 310 16

sampleB 177 142 78 144 357 18

INDEL Pos DOWNSTREAM INTRON SPLICE SITEACCEPTOR

SPLICE SITEDONOR

UPSTREAM UTR 3 PRIME

sampleA 5474 15091 101 17 4428 618

sampleB 5171 13373 97 14 4152 641

Terminology:

Term Annotations

INDEL Class

CODON CHANGE PLUSCODON DELETION

One codon is changed and one or more codons are deleted

CODON CHANGE PLUSCODON INSERTION

One codon is changed and one or many codons are inserted

CODON DELETION One or many codons are deleted

CODON INSERTION One or many codons are inserted

FRAME SHIFT Insertion or deletion causes a frame shift

FRAME SHIFT+STOP GAINED Insertion or deletion causes a frame shift or a STOP codon

INDEL Pos

DOWNSTREAM Downstream 1K bases from the stop codon

EXON Exon region

INTERGENIC Intergenic region

INTRON Intron region

SPLICING Region within 10 bases from the splicing junciton

UPSTREAM Upstream 1K bases from the start codon



A complete table can be found in summary/3 VariantData/SNP INDEL PositionType VariantsType.xlsx




Figure. The distribution of INDELs in different categories (left) and the distribution of INDELs (right) in different

regions of the genome

The pie chart on the left shows the distribution of INDELs in different categories as definded in the previous table. The

pie chart on the right shows the ditribution of INDELs in different locations as defined in the previous table.

The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.INDEL.png




Table. Summary of INDEL types

Sample all genotype.Het

genotype.Hom

novel in dbSNP novel proportion

dbSNP proportion

sampleA 3846 2231 1615 436 3410

sampleB 3469 1904 1565 279 3190

Terminology:

Term Description

Sample Sample name

all Number of all INDELs

genotype.Het Number of heterozygous INDELs

genotype.Hom Number of homozygous INDELs

novel Number of novel INDELs

novel proportion Proportion of novel INDELs

dbSNP proportion Proportion of INDELs annotated in dbSNP

A complete table can be found in summary/3 VariantData/VariantsType INDEL.xlsx




Figure. Comparison of INDEL types

The first bar graph shows the number of heterozygous (Het) and homozygous (Hom) INDELs in one sample. The Het

rate represents the percentage of heterozygous INDELs in all INDELs.

The second bar graph shows the number of annotated (dbSNP) and novel INDELs. The dbSNP rate represents the

percentage of annotated INDELs in all INDELs.

The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.INDEL VariantsType.png




5.4 SnpEff Annotation of INDEL

INDELs identified from sequencing are annotated using SnpEff. The annotations are presented in the VCF4.1 format for

each sample. The VCF files can be found in summary/3 VariantData/ for each sample, e.g., summary/3 VariantData/sampleA/sampleA.indel.annotation.fixed.function.vcf

Terminology:

Term Description

CHROM chromosome id

POS ID chromosome position ID

REF reference allele

ALT alternative allele

QUAL quality

FILTER filter

INFO information

AD Allelic depths

DP Approximate read depth

GQ Genotype Quality

GT Genotype

PL Phred-scaled likelihoods of the givengenotypes

A complete description of the terms in the VCF file can be found in summary/3 VariantData/readme.xlsx




6 Annotation and Filtering of Variants

6.1 dbSNP Annotation and Filtering

The Single Nucleotide Polymorphism Database (dbSNP; http://www.ncbi.nlm.nih.gov/SNP/) is developed by NCBI and

the Human Genome Research Institute (National Human Genome Research Institute. The database contains the SNP

annotations of single base substitutions and short insertions/deletions for multiple organisms. Each SNP is annotated by

an index starting with ’rs’. High frequency mutations in normal people are usually not pathogenic sites, so we annotate

and filter out the high frequency mutations that have been included in dbSNP to retain the mutation sites that are not

annotated in dbSNP for downstream analysis.

The annotation results can be found in summary/4 VariantMultiAnno/sampleA/SNP/sampleA.snp.dbSNP.xlsx

6.2 1000Genome Annotation and Filtering

The 1000 Genomes Project (1000Genome) was launched on January 22, 2008, with a total mission of 1,200 people. It

was designed to draw the most detailed and the most valuable human genome genetic polymorphism map. The new map

allows researchers to more quickly lock down disease-related genetic variants, enabling the use of genetic information

to develop new strategies for the diagnosis and treatment/prevention of common diseases. The project includes genetic

data from Yoruba in the Ibadan region of Nigeria, Japanese living in Tokyo, Chinese living in Beijing, the descendants

of Scandinavia in Western Europe and Utah, Luhya of Webuye, Maasai of Kinyawa, Toscani residents of Italy, Gujarati

Indians living in Houston, Chinese people living in Denver, Mexican descendants living in Los Angeles and African

descendants living in the southwestern United States. Mutaions in 1000Genome with the minor allele frequency (MAF)

greater than 5% are filtered out to retain the mutation sites with the MAF less than 5% for downstream analysis.

The annotation results can be found in summary/4 VariantMultiAnno/sampleA/SNP/

sampleA.snp.dbSNP.KGenome.xlsx




6.3 Coding Region Annotation

Variants within the coding region or within the upstream/downstream 10 bases region from the splicing junction are

retained as candidate sites that may cause diseases.


sampleA.snp.dbSNP.KGenome.func.xlsx

6.4 Protein Function Annotation

The effect of amino acid substitutions on protein function is predicted using the SIFT program (http://sift.jcvi.org/). It

can determine whether the amino acid substitutions are functionally neutral or deleterious. A standardized score ranging

from 0 to 1 is reported by the program. A score greater than 0.05 indicates that the mutation is tolerable. In other

words, the mutation has little or no effect on protein function. A score less than 0.05 suggests the mutation is harmful,

that is, the mutation has greater impact on protein function.

Polymorphism Phenotyping (PolyPhen2; http://genetics.bwh.harvard.edu/pph2/) is also a tool for predicting the effect

of amino acid substitutions on protein structure and function. The results include three parts, Query, Prediction and

Details. The Query section contains query information, similar to the input file. The Prediction section shows the

predicted results. The Details section shows the details of the PolyPhen forecast, including all data information. A

PolyPhen2 values greater than 0.95 indicates that the mutation site has a great impact on gene function.


sampleA.snp.dbSNP.KGenome.func.syn.xlsx




7 Quality Control

The original image data obtained on Illumina HiSeq 2500 were processed for base calling and then converted to raw

reads stored in the FASTQ format. The FASTQ format contains the sequence information and the corresponding base

quality. The sequencing error rate per base increases along with the length of the sequencing read due to depletion of

chemical reagents at the end of the sequencing cycles. This phenomenon is common in the Illumina platform (Erlich and

Mitra, 2008).

Quality score per base




The figure above shows the quality score per base. A quality score of 20 (Q20) indicates the sequencing error rate is 1%

and a quality socre of 30 (Q30) indicates the sequencing error rate is 0.1%.

The figure below shows the sequence content per base. Ideallly the percentage of the base content at each postion should

be approximately equal and show no bias with the postion.

Sequence content per base

The file of the figure can be found in summary/6 Quality Control/1 fastQC




8 Appendix

8.1 Materials and Methods

Materials for Sample Quality Control

Instrument: Bioanalyzer 2100 (Agilent, CA, USA)

DNA Quality Assurance Kit: RNA 6000 Nano LabChip Kit (Agilent, CA, USA)

Materials for Sequencing Library Quality Control

Instrument: Bioanalyzer 2100 (Agilent, CA, USA)

Sequencing Library Quality Assurance Kit: High Sensitivity DNA Chip Kit (Agilent, CA, USA)

Sequencing

The sequencing library was loaded on the flowcell to generate clusters on Illumina’s Cluster Station. Each sequencing

cycle was followed by a fluorescence signal to detect one base. Paired-end reads with 150 bases were obtained. Read

1 starts from the 5’ end of the insertion and read 2 starts from the 3’ end of the insertion. Based on the insertion

size, paired-end reads can cover the seuqence from both ends. In addtion, the distance between the two reads can be

estimated for downstream mapping and assembly.

8.2 Information Analysis

Sequence and primary analysis

A DNA library from human samples was sequenced with the Illumina HiSeq2500/4000 platform. Millions of paired-end

reads with 150 bases were obtained. This yields average 195.68G bases per sample to cover the human exome (50Mb).

Alignment and duplicate marking

For the alignment step, BWA is utilized to perform reference genome alignment with the reads contained in paired

FASTQ files. For the first post-alignment processing step, Picard tools is utilized to identify and mark duplicate reads

from BAM file.

Local realignment around INDELs

In the second post-alignment processing step, local read realignment is performed to correct for potential alignment

errors around indels. Mapping of reads around the edges of indels often results in misaligned bases, thus creating false

positive SNP calls. Local realignment uses these mismatched bases to determine if a site should be realigned and applies

a computationally intensive algorithm to determine the most consistent placement of the reads with respect to the indel

and remove misalignment artifacts.

Base quality score recalibration

Each base of each read has an associated quality score, corresponding to the probability of a sequencing error. Due to

the Systematic biases, the reported quality scores are known to be inaccurate and as such must be recalibrated prior to

genotyping. After recalibration, the recalibrated quality score in the output BAM will more closely correspond to the

probability of a sequencing error.

Variant calling

Variant calls can be generated with GATK HaplotypeCaller or UnifiedGenotyper which examines the evidence for vari-




ation from reference via Bayesian inference.

Variant recalibration

A Gaussian mixture model is fit to assign accurate confidence scores to each putative mutation call and evaluate new

potential variants.

Variant function annotation

Biological functional annotation is a crucial step in finding the links between genetic variation and disease. SnpEff is

utilized to add biological information to a set of variants.




9 References

[1]. Ng SB1, Turner EH., et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature.

461(7261):272-6.

[2]. Choi M1,Scholl UI., et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing.Proc

Natl Acad Sci USA. 106(45):19096-101.

[3]. Li H, Durbin R. Fast and accurate short read alignment with BurrowsCWheeler transform. Bioinformatics, 2009,

25(14): 1754-1760.

[4]. Kent W J, Sugnet C W, Furey T S, et al. The human genome browser at UCSC. Genome research, 2002, 12(6):

996-1006.

[5]. Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics, 2009,

25(16): 2078-2079.

[6]. Sherry S T, Ward M H, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research,

2001, 29(1): 308-311.

[7]. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequenc-

ing data. Nucleic acids research, 2010, 38(16): e164-e164.

[8]. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature,

2012, 491(7422): 56-65.

[9]. P Cingolani, A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs

in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 2012, 6:2,80-92.

10 Contact Us

Address: 2575 W Bellfort Ave Ste 270 Houston, TX 77054

Phone Number:(713) 664-7087

Toll-free: (888)-528-8818

Fax:(713) 664-8181

Email: [email protected]

Website: www.lcsciences.com



exome sequencing report - lc sciences

Documents