exome sequencing report - lc sciences
TRANSCRIPT
Exome Sequencing Report
Customer Name Demo
Institute University of ABC
Order Number 6667
Date Prepared 2017-06-02
This report covers confidential materials of LC Sciences. Please make sure
that the contents of this report are for your personal use only and that you are
responsible for confidentiality. If the contents of this report are disclosed to any
third party or company, according to the relevant laws and regulations, LC Sciences
will be entitled to legal action.
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Table of Contents
1 Exome Sequencing Introduction 2
2 Exome Sequencing Report 3
2.1 Disease Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Database Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Data Analysis Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Technical Methods and Processes 5
3.1 Experimental Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Analysis Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Sequencing Data Overview 8
4.1 Sample Collection and Grouping Information . . . . . . . . . . . . . . . . . . . . 8
4.2 Sequencing Data Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 Sequencing Data Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Sequencing Depth/Coverage Distribution . . . . . . . . . . . . . . . . . . . . . . 10
4.5 Coverage Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Variant Calling Results and Analysis 14
5.1 SNP Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 SnpEff Annotation of SNP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 INDEL Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.4 SnpEff Annotation of INDEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 Annotation and Filtering of Variants 26
6.1 dbSNP Annotation and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 1000Genome Annotation and Filtering . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Coding Region Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.4 Protein Function Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7 Quality Control 28
8 Appendix 30
8.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
8.2 Information Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9 References 32
10 Contact Us 32
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 2
1 Exome Sequencing Introduction
The exon is the part of the eukaryotic gene which is preserved after splicing and can be trans-
lated into peptide sequences. Exome is the sum of all exon regions in the genome that contains
the information needed for translation, covering most of the functional variants associated with
the individual phenotype. The human genome contains approximately 180,000 exons with a
total length of about 30 Mb. The human exome accounts for about 1% of the genome, but is
responsible for about 85% of human pathogenic mutations.
Exome sequencing refers to the use of specially designed probes to enrich the protein cod-
ing region of interest or a specific region of interest. High-throughput sequencing generates
genetic information, which greatly improves the efficiency of exome studying and significantly
reduces the cost of research. The technology can be used to identify and study Mendelian
diseases, complex diseases such as cancer, diabetes, obesity and other pathogenic genes. This
enables researchers to better explain the pathogenesis of diseases.
The technical advantages of exome sequencing:
Cost-effective: Genome-wide information can be obtained economically and efficiently relative
to genome-wide sequencing. The depth of sequencing of the exon region is deeper and the
results are more accurate.
High detection accuracy: Individual base variation can be identified in the genome-wide range.
Applicable to analysis with a large sample size: Exome sequencing is economically efficient and
more applicable to analysis with a large sample size.
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 3
2 Exome Sequencing Report
Species Name: Human
Species Name: Homo sapiens
2.1 Disease Information
Disease Name: Genetic Disease
Disease Type: Dominant/recessive disorders on autosomal or sex chromosomes
Family MapDemo
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 4
2.2 Database Information
Database Informaton
Genome Database ftp://ftp.ensembl.org/pub/release-73/fasta/homo sapiens/dna/Homo sapiens.GRCh37.73.dna.toplevel.fa.gz
hg19
dbSNP Database http://www.ncbi.nlm.nih.gov/SNP/ 144b
1000Genome Database http://www.1000genomes.org/ V73
Clinvar Database ftp://ftp.ncbi.nih.gov/snp/organisms/human 9606 b144 GRCh37p13/VCF/clinical vcf set
144
2.3 Data Analysis Program
Analysis Tool Version and Description
Data Quality Control FastQC 0.10.1
BWA 0.7.10 Reference Genome Comparison
SAMtools 0.1.19 View and Sort Alignment Results
Picard 1.119 Merge Sample Bam Results
SNP/INDEL Detection GATK 3.3.0 Detection and Filtration of SNP and IN-DEL
SNP/INDEL Gene Anno-tation
SnpEff V4.1 Detected Variation Explanation
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 5
3 Technical Methods and Processes
3.1 Experimental Processes
The liquid chip capture system (Agilent, CA, USA) is used to efficiently enrich the human exon
region. High throughput sequencing is performed on the HiSeq 2500/4000 platform. Construc-
tion and capture experiments are carried out using the SureSelect Human All Exon V6 kit
(Agilent, CA,USA).
Sample DNA quality assessment requires the amount of DNA to be >= 1.5 ug (Qubit). The
agarose gel electrophoresis result should show no degradation and no RNA contamination. Ad-
ditionally, the amount of OD260/280 measured by Nanodrop should range from 1.8 to 2.0. For
samples with the DNA amount < 1 ug, a substitute protocol may be suggested for optimizing
the sequencing library.
Genomic DNA is randomly broken into small 150-300 bp fragments. Following end-repair and
polyadenylation at both ends, the fragments are ligated with the sequencing adaptor including
specific indices. The library is then hybridized with up to 738,690 biotin-labeled probes so that
the exon region (including the upstream and downstream regions) of 58 Mb can be captured
using streptomycin beads. After PCR amplification and quality assurance, the library is loaded
on the flowcell for sequencing (Figure). Paired-end reads (2 x 150) are obtained for downstream
data analysis. A more detailed description of library preparation is available in supplemental
materials.
Figure. Library Construction Experimental Workflow
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 6
3.2 Analysis Process
The adaptor, polyN, polyA and other low-complexity sequences are excluded from the raw
reads. The remaining valid reads are mapped to the reference genome using BWA (Li H et al.
and Kent WJ et al.). The mapping result is saved in the BAM format and then sorted using
SAMtools (Li H et al.). Duplicate reads coming from PCR amplification are marked using
Picard. Marked duplicate reads are not used for subsequent processing, as they may result in
false positive results in the detection of mutations.
After marking duplicate reads, it is necessary to re-align the reads close to the region re-
ported as insertion/deletion (INDEL) by BWA based on the Compact Idiosyncratic Gapped
Alignment Report (CIGAR) value. The mismatch close to the INDEL region reported by BWA
may not be accurate due to its algorithm of alignment and it may cause false positive results of
variant calling. Therefore, correction at these sites is required for subsequent SNP and INDEL
analysis. The IndelRealigner module in GATK is use to carry out INDEL re-alignment in an
effort to minimize the error rate of mismatches near each INDEL site.
Variant calling relies heavily on the quality score of each base reported by the seuqencer. For
example, the BWA aligner reports a mismatch when the base quality is above Q25. Namely,
the error rate of the mismatch caused by sequencing is about 1%, which may heavily affect the
reliability of downstream analysis. Additionally, the sequencing quality at the 3’ end is always
lower than the quality at the 5’ end due to reagent depletion, and the quality of A/C is often
lower than T/G. Therefore, it is necessary to recalibrate the base quality using the BaseRe-
calibrator module in GATK so that the quality score of the sequence can be more reliable.
Note: The reads in one sample are supposed to come from the same lane for base recalibration.
Otherwise, reads from different lanes need to be recalibrated separately.
After the steps metioned above, variant calling is made by the UnifiedGenotyper or the Hap-
lotypeCaller moduel in GATK. The UnifiedGenotyper module does not consider the impact of
adjacent bases to make the call, and the HaplotypeCaller module makes the call based on the
local de-novo model. The HaplotypeCaller module first builds a De Bruijn graph and applies
a PairHMM model to do haplotype prediction and make the variant call.
Of course, in dealing with such typical large scale Bayesian inference problems for a large quan-
tity of samples, the high-sensitivity model may cause more false positive results. Therefore, it is
important to perform further corrections on the variant calling results (Variant recalibration).
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 7
In general, real mutation sites are clustered together by the variant calling model. These clus-
ters should fit the Gaussian distribution. Therefore, the VariantRecalibrator module in GATK
uses a Gaussian mixture model to correct the false positve calls and find the true mutation sites.
It is well known that mutations in the coding region may be critical and cause diseases. There-
fore, it is important to annotate the biological function of the mutation site. We used the SnpEff
program (official recommended by GATK) to examine structural changes at the mutation site
and further sort out the candidate area leading to the disease. The overall flowchart of data
analysis is as follows:
Flowchart of data analysis
Note:
The Genome Analysis Toolkit (GATK) is developed by the Broad Institute for second-generation sequencing
data analysis. It contains a variety of tools mainly engaged in variant calling, genotyping etc. Data quality
assurance is highly emphasized to reduce the false positive resutls. The program has a powerful architecture,
a powerful processing engine and high-performance computing capabilities that make it suitable for projects
of any size. At present, GATK and the mapping software BWA have become the most mainstream analysis
pipeline for whole genome and exome sequencing.
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 8
4 Sequencing Data Overview
4.1 Sample Collection and Grouping Information
Group Sample
sampleA sampleA
sampleB sampleB
A complete table can be found in summary/1 RawData/sample info mendelian.xlsx
4.2 Sequencing Data Filtering
Paired-end raw reads were obtained by the high-throughput sequencer, which may contain the sequencing adap-
tor and low-quality reads. In order to ensure accurate analysis results, it is necessary to preprocess the raw
reads and obtain valid data for subsequent analysis.
The preprocessing steps are as follows:
(1) Adapter removal
(2) Removal of reads containing N for more than 5% of the bases
(3) Removal of low quality reads with the quality score less than 10 for more than 20% of the bases
(4) Removal of reads after a comprehensive evaluation with Q20, Q30, and the GC content
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 9
4.3 Sequencing Data Quality Control
For paired-end sequencing (PE150), the percentage of bases with their quality score greater than 20 should be
more than 90%; the percentage of bases with their quality score greater than 30 should more than 85%.
Table. Summary of sequencing quality control
Sample Raw Data Valid Data Rawdepth(x)
Valid% Q20% Q30% GC%
Read Base Read Base
sampleA 100053676 15.01G 98910722 14.84G 258.79 98.86 98.51 96.28 47.51
sampleB 98312530 14.75G 96767964 14.52G 254.31 98.43 97.24 93.08 48.39
Terminology:
Term Annotations
Sample Sequencing Library Name
Raw Data/Read Number of reads obtained from the sequencer
Raw Data/Base Number of bases in billions (Giga) obtained from the sequencer
Valid Data/Read Number of reads after preprocessing
Valid Data/Base Number of bases in billions (Giga) after preprocessing
Raw Depth Raw number of bases divided by the Agilent kit captures Size: 58M
Valid Ratio% Percentage of the processed reads (Valid) to the raw reads (Raw)
Q20% Percentage of bases with the quality score greater than 20
Q30% Percentage of bases with the quality score greater than 30
GC count% Percentage of the GC content in raw reads
A complete table can be found in summary/1 RawData/ReadsQC.xlsx
After preprocessing, the average number of valid bases per sample is 195.68G. All samples have more than
more than 97.84G bases, Q30>90% and meet the criteria for downstream analysis.
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 10
4.4 Sequencing Depth/Coverage Distribution
Sequencing coverage/depth is estimated by the number of reads mapped to exons. Usually the mapping rate of
reads from a human sample can be more than 95%. A variatn call with the coverage/depth higher than 10X is
more reliable.
Figure. Sequencing Depth Graph
The abscissa indicates the depth of the sequencing, and the ordinate indicates the ratio of the base at the corresponding
depth to all bases. The graph on the right shows the cumulative base ratio (ordinate) at different depths (abscissa). For
example, the cumulative depth of 50X corresponds to the base ratio about 95%. It indicates that about 95%of the bases
have the sequencing depth greater than 50X.
The file of the figure can be found in summary/2 MappedData/ReadsDepthCoverage.png
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 11
Figure. The average sequencing coverage/depth on each chromosome
The abscissa indicates the chromosomes, and the ordinate represents the average depth. The average depth is calculated
by (nubmer of mapped reads x length of the covered region) / total length of the exon region on each chromosome.
The file of the figure can be found in summary/2 MappedData/DepthCoverageByChr.png
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 12
4.5 Coverage Results
Term sampleA sampleB
Total 100053676(100.00%)
98312530(100.00%)
Duplicate 11323351(11.32%)
9049250(9.20%)
Mapped 95830841(95.78%)
91689785(93.26%)
TARGET TERRITORY 60700153 60700153
NEAR AMPLICON BASES 2500942993 2061476018
NEAR AMPLICON BASES+TARGET TERRITORY
2561643146 2122176171
PF UQ READS ALIGNED 95830841 91689785
ON AMPLICON BASES 7066308762 5996686766
MEAN TARGET COVERAGE 118.66 100.58
PCT TARGET BASES 30X 93.23% 90.81%
PCT TARGET BASES 20X 95.47% 94.66%
PCT TARGET BASES 10X 96.87% 96.81%
PCT TARGET BASES 2X 97.61% 97.71%
Terminology:
Term Annotations
Total Total number of raw reads (read 1 + read 2)
Duplicate Number of duplicate reads
Mapped Number of reads mapped to the reference genome
TARGET TERRITORY Number of unique bases covered by the intervals of all targets thatshould be covered
NEAR AMPLICON BASES Number of PF aligned bases that mapped to within a fixed intervalof an amplified region, but not on a baited region
PF UQ READS ALIGNED Number of PF unique reads that are aligned with mapping score> 0 to the reference genome.
ON AMPLICON BASES Number of PF aligned amplified that mapped to an amplifiedregion of the genome
MEAN TARGET COVERAGE Mean coverage of targets that recieved at least coverage depth =2 at one base
PCT TARGET BASES 30X Percentage of all target bases achieving 30X or greater coverage
PCT TARGET BASES 20X Percentage of all target bases achieving 20X or greater coverage
PCT TARGET BASES 10X Percentage of all target bases achieving 10X or greater coverage
PCT TARGET BASES 2X Percentage of all target bases achieving 2X or greater coverage
A complete table can be found in summary/2 MappedData/MappedStatistics.xlsx
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 13
Figure. Read coverage of each sample
The file of the figure can be found in summary/2 MappedData/DepthCoverageByTarget.png
The average coverage depth of each sample is more than 100X. The average coverage depth of each chromosome is
more than 100X. 109.62%of bases have more than 20X and good uniformity for downstream analysis.
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 14
5 Variant Calling Results and Analysis
Single Nucleotide Polymorphism (SNP) refers to a single nucleotide variation in the genome which leads to the formation
of a genetic marker. Variations at individual nucleotides in the genome include substitutions, deletions, and insertions.
Depending on the structure of the nucleotide base, a substitution can be classified into a transition (C to T, G to A) and
a transversion (C to A, G to T, C to G, A to T). SNPs appear most frequently in the CG islands. C in the CG islands
tends to be methylated during histone modification. Methylated C then turns into T through spontaneous deamination.
In general, a SNP refers to a single nucleotide variation with the presence greater than 1% in a population.
SNPs may fall within coding regions of genes, non-coding regions of genes, or intergenic regions. SNPs within non-
coding regions may still affect alternative splicing, transcription factor binding, mRNA degradation or non-coding RNA
sequences. Gene expression affected by this type of SNPs is known as expression SNP (eSNP) and may occur in the
upstream or downstream region of the gene. SNPs within the coding region of the gene (cSNPs) are less common, and
the variation rate in the exome is only 1/5 of the variation rate in the surrounding regions. However, they are more
significantly correlated to the development of genetic diseases. From the perspective of a genetic trait, cSNPs can be
classified into two types: synonymous and nonsynonymous cSNPs. Synonymous cSNPs are SNP-induced changes in the
coding region that do not affect the translation of the protein amino acid sequence. Nonsynonymous cSNPs are the
variations that result in protein sequence changes and therefore changes in the function of the protein. This change is
often implicated as the direct cause of changes in biological traits. About half of cSNPs are nonsynonymous.
Insertion-Deletion (INDEL) refers to the insertion or the deletion of a small fragment (one or more bases) in the
reference genome. INDELs may fall within the coding regions or the noncoding regions of genes. INDELs within the
coding regions may cause the structural and functional change of the protein. If one or several bases (not a multiple of
three) are inserted or deleted within the coding region, the mutation is called a frame shift mutation. Such mutations
cause changes in the downstream amino acid sequence. INDELs within the noncoding regions (e.g., intron regions) may
reduce the efficiency of transcription and the accuracy of alternative splicing. Moreover, the occurrence of INDELs is
also one of the main causes of evolution. For species with relatively close genetic relationship, the major cause of species
divergence is INDELs. In general, the longer the genetic distance between species, the more INDEL fragments and longer
the length of the INDEL fragment.
SNP and INDEL sites on the genome are analyzed using GATK. After variant calling, SnpEff is used for mutation
site annotation.
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 15
5.1 SNP Results
Table. SNP location classification and annotation
SNV Class NON SYNONY-MOUS CODING
START GAINEDSTART LOST STOP GAINED STOP LOST SYNONYMOUSCODING
sampleA 30352 781 57 280 29 38200
sampleB 27992 729 39 295 33 31879
SNV Pos DOWNSTREAM INTRON SPLICE SITEACCEPTOR
UPSTREAM UTR 3 PRIME UTR 5 PRIME
sampleA 71305 137844 111 56138 10858 6331
sampleB 66760 127846 122 52787 10034 5948
Terminology:
Term Annotations
SNV Class
NON SYNONYMOUS CODING Number of variants causing a codon that pro-duces a different amino acid
SYNONYMOUS CODING Number of variants causing a codon that pro-duces the same amino acid
STOP GAINED Number of variants causing a STOP codon
STOP LOST Number of variants causing a stop codon tobe mutated into a non-stop codon
START GAINED Number of variants causing a START codon
START LOST Number of variants causing a start codon tobe mutated into a non-start codon
SNV Pos
DOWNSTREAM Downstream 1K bases from the stop codon
EXON Exon region
INTERGENIC Intergenic region
INTRON Intron region
SPLICING Region within 10 bases from the splicing junci-ton
UPSTREAM Upstream 1K bases from the start codon
UTR3 PRIME 3’-UTR region
UTR5 PRIME 5’-UTR region
A complete table can be found in summary/3 VariantData/SNP INDEL PositionType VariantsType.xlsx
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 16
Note: The effect of mutations on the genome is as follows
High-Impact Effects Moderate-Impact Effects Low-Impact Effects
SPLICE SITE ACCEPTOR NON SYNONYMOUS CODING SYNONYMOUS START
SPLICE SITE DONOR CODON CHANGE (note: this effectis used by SnpEff only for MNPs, notSNPs)
NON SYNONYMOUS START
START LOST CODON INSERTION START GAINED
EXON DELETED CODON CHANGE PLUS CODONINSERTION
SYNONYMOUS CODING
FRAME SHIFT CODON DELETION SYNONYMOUS STOP
STOP GAINED CODON CHANGE PLUS CODONDELETION
NON SYNONYMOUS STOP
STOP LOST UTR 5 DELETED
UTR 3 DELETED
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 17
Figure. The distribution of SNPs in different categories (left) and the distribution of SNPs (right) in different regions
of the genome
The pie chart on the left shows the distribution of SNPs in different categories as definded in the previous table. The
pie chart on the right shows the ditribution of SNPs in different locations as defined in the previous table.
The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.SNV.png
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 18
Table. Summary of SNP types
Sample all genotype.Het
genotype.Hom
novel in db-SNP
novel proportion
dbSNP proportion
Ts Tv novel.Ts novel.Tv
sampleA 55608 35408 20200 3545 52063 0.06 0.94 39295 16354 2303 1242
sampleB 51936 32751 19185 2527 49409 0.05 0.95 36785 15193 1599 928
Terminology:
Term Annotations
Sample Sample Name
novel ts Number of novel transitions (not annotated in dbSNP)
novel tv Number of novel transversions
ts Total number of transitions
tv Total number of transversions
all Number of all SNPs
genotype.Het Number of SNPs of heterozygous genotypes
genotype.Hom Number of SNPs of homozygous genotypes
novel Number of novel SNPs
novel proportion Proportion of novel SNPs
A complete table can be found in summary/3 VariantData/VariantsType SNP.xlsx
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 19
Figure. Comparison of SNP types
The first bar graph shows the number of heterozygous (Het) and homozygous (Hom) SNPs in one sample. The Het rate
represents the percentage of heterozygous SNPs in all SNPs.
The second bar graph shows the number of annotated (dbSNP) and novel SNPs. The dbSNP rate represents the per-
centage of annotated SNPs in all SNPs.
The third bar graph shows the number of transitions (ts) and trasversions (tv). The value of ts/tv is the ratio of transi-
tions to transversions.
The fourth bar graph shows the number of transitions (ts) and trasversions (tv) in novel SNPs. The value of ts/tv is the
ratio of transitions to transversions.
The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.SNP VariantsType.png
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 20
5.2 SnpEff Annotation of SNP
SNPs identified from sequencing are annotated using SnpEff. The annotations are presented in the VCF4.1 format for each
sample. The VCF files can be found in summary/3 VariantData/ for each sample, e.g., summary/3 VariantData/sampleA/sampleA.snp.annotation.fixed.function.vcf
Terminology:
Term Description
CHROM chromosome id
POS ID chromosome position ID
REF reference allele
ALT alternative allele
QUAL quality
FILTER filter
INFO information
AD Allelic depths
DP Approximate read depth
GQ Genotype Quality
GT Genotype
PL Phred-scaled likelihoods of the givengenotypes
A complete description of the terms in the VCF file can be found in summary/3 VariantData/readme.xlsx
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 21
5.3 INDEL Results
Table. INDEL location classification and annotation
INDEL Class CODONCHANGE
PLUS CODONDELETION
CODONCHANGE
PLUS CODONINSERTION
CODONDELETION
CODONINSERTION
FRAME SHIFT FRAME SHIFT+
STOP GAINED
sampleA 187 126 84 137 310 16
sampleB 177 142 78 144 357 18
INDEL Pos DOWNSTREAM INTRON SPLICE SITEACCEPTOR
SPLICE SITEDONOR
UPSTREAM UTR 3 PRIME
sampleA 5474 15091 101 17 4428 618
sampleB 5171 13373 97 14 4152 641
Terminology:
Term Annotations
INDEL Class
CODON CHANGE PLUSCODON DELETION
One codon is changed and one or more codons are deleted
CODON CHANGE PLUSCODON INSERTION
One codon is changed and one or many codons are inserted
CODON DELETION One or many codons are deleted
CODON INSERTION One or many codons are inserted
FRAME SHIFT Insertion or deletion causes a frame shift
FRAME SHIFT+STOP GAINED Insertion or deletion causes a frame shift or a STOP codon
INDEL Pos
DOWNSTREAM Downstream 1K bases from the stop codon
EXON Exon region
INTERGENIC Intergenic region
INTRON Intron region
SPLICING Region within 10 bases from the splicing junciton
UPSTREAM Upstream 1K bases from the start codon
UTR3 PRIME 3’-UTR region
UTR5 PRIME 5’-UTR region
A complete table can be found in summary/3 VariantData/SNP INDEL PositionType VariantsType.xlsx
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 22
Figure. The distribution of INDELs in different categories (left) and the distribution of INDELs (right) in different
regions of the genome
The pie chart on the left shows the distribution of INDELs in different categories as definded in the previous table. The
pie chart on the right shows the ditribution of INDELs in different locations as defined in the previous table.
The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.INDEL.png
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 23
Table. Summary of INDEL types
Sample all genotype.Het
genotype.Hom
novel in dbSNP novel proportion
dbSNP proportion
sampleA 3846 2231 1615 436 3410
sampleB 3469 1904 1565 279 3190
Terminology:
Term Description
Sample Sample name
all Number of all INDELs
genotype.Het Number of heterozygous INDELs
genotype.Hom Number of homozygous INDELs
novel Number of novel INDELs
novel proportion Proportion of novel INDELs
dbSNP proportion Proportion of INDELs annotated in dbSNP
A complete table can be found in summary/3 VariantData/VariantsType INDEL.xlsx
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 24
Figure. Comparison of INDEL types
The first bar graph shows the number of heterozygous (Het) and homozygous (Hom) INDELs in one sample. The Het
rate represents the percentage of heterozygous INDELs in all INDELs.
The second bar graph shows the number of annotated (dbSNP) and novel INDELs. The dbSNP rate represents the
percentage of annotated INDELs in all INDELs.
The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.INDEL VariantsType.png
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 25
5.4 SnpEff Annotation of INDEL
INDELs identified from sequencing are annotated using SnpEff. The annotations are presented in the VCF4.1 format for
each sample. The VCF files can be found in summary/3 VariantData/ for each sample, e.g., summary/3 VariantData/sampleA/sampleA.indel.annotation.fixed.function.vcf
Terminology:
Term Description
CHROM chromosome id
POS ID chromosome position ID
REF reference allele
ALT alternative allele
QUAL quality
FILTER filter
INFO information
AD Allelic depths
DP Approximate read depth
GQ Genotype Quality
GT Genotype
PL Phred-scaled likelihoods of the givengenotypes
A complete description of the terms in the VCF file can be found in summary/3 VariantData/readme.xlsx
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 26
6 Annotation and Filtering of Variants
6.1 dbSNP Annotation and Filtering
The Single Nucleotide Polymorphism Database (dbSNP; http://www.ncbi.nlm.nih.gov/SNP/) is developed by NCBI and
the Human Genome Research Institute (National Human Genome Research Institute. The database contains the SNP
annotations of single base substitutions and short insertions/deletions for multiple organisms. Each SNP is annotated by
an index starting with ’rs’. High frequency mutations in normal people are usually not pathogenic sites, so we annotate
and filter out the high frequency mutations that have been included in dbSNP to retain the mutation sites that are not
annotated in dbSNP for downstream analysis.
The annotation results can be found in summary/4 VariantMultiAnno/sampleA/SNP/sampleA.snp.dbSNP.xlsx
6.2 1000Genome Annotation and Filtering
The 1000 Genomes Project (1000Genome) was launched on January 22, 2008, with a total mission of 1,200 people. It
was designed to draw the most detailed and the most valuable human genome genetic polymorphism map. The new map
allows researchers to more quickly lock down disease-related genetic variants, enabling the use of genetic information
to develop new strategies for the diagnosis and treatment/prevention of common diseases. The project includes genetic
data from Yoruba in the Ibadan region of Nigeria, Japanese living in Tokyo, Chinese living in Beijing, the descendants
of Scandinavia in Western Europe and Utah, Luhya of Webuye, Maasai of Kinyawa, Toscani residents of Italy, Gujarati
Indians living in Houston, Chinese people living in Denver, Mexican descendants living in Los Angeles and African
descendants living in the southwestern United States. Mutaions in 1000Genome with the minor allele frequency (MAF)
greater than 5% are filtered out to retain the mutation sites with the MAF less than 5% for downstream analysis.
The annotation results can be found in summary/4 VariantMultiAnno/sampleA/SNP/
sampleA.snp.dbSNP.KGenome.xlsx
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 27
6.3 Coding Region Annotation
Variants within the coding region or within the upstream/downstream 10 bases region from the splicing junction are
retained as candidate sites that may cause diseases.
The annotation results can be found in summary/4 VariantMultiAnno/sampleA/SNP/
sampleA.snp.dbSNP.KGenome.func.xlsx
6.4 Protein Function Annotation
The effect of amino acid substitutions on protein function is predicted using the SIFT program (http://sift.jcvi.org/). It
can determine whether the amino acid substitutions are functionally neutral or deleterious. A standardized score ranging
from 0 to 1 is reported by the program. A score greater than 0.05 indicates that the mutation is tolerable. In other
words, the mutation has little or no effect on protein function. A score less than 0.05 suggests the mutation is harmful,
that is, the mutation has greater impact on protein function.
Polymorphism Phenotyping (PolyPhen2; http://genetics.bwh.harvard.edu/pph2/) is also a tool for predicting the effect
of amino acid substitutions on protein structure and function. The results include three parts, Query, Prediction and
Details. The Query section contains query information, similar to the input file. The Prediction section shows the
predicted results. The Details section shows the details of the PolyPhen forecast, including all data information. A
PolyPhen2 values greater than 0.95 indicates that the mutation site has a great impact on gene function.
The annotation results can be found in summary/4 VariantMultiAnno/sampleA/SNP/
sampleA.snp.dbSNP.KGenome.func.syn.xlsx
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 28
7 Quality Control
The original image data obtained on Illumina HiSeq 2500 were processed for base calling and then converted to raw
reads stored in the FASTQ format. The FASTQ format contains the sequence information and the corresponding base
quality. The sequencing error rate per base increases along with the length of the sequencing read due to depletion of
chemical reagents at the end of the sequencing cycles. This phenomenon is common in the Illumina platform (Erlich and
Mitra, 2008).
Quality score per base
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 29
The figure above shows the quality score per base. A quality score of 20 (Q20) indicates the sequencing error rate is 1%
and a quality socre of 30 (Q30) indicates the sequencing error rate is 0.1%.
The figure below shows the sequence content per base. Ideallly the percentage of the base content at each postion should
be approximately equal and show no bias with the postion.
Sequence content per base
The file of the figure can be found in summary/6 Quality Control/1 fastQC
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 30
8 Appendix
8.1 Materials and Methods
Materials for Sample Quality Control
Instrument: Bioanalyzer 2100 (Agilent, CA, USA)
DNA Quality Assurance Kit: RNA 6000 Nano LabChip Kit (Agilent, CA, USA)
Materials for Sequencing Library Quality Control
Instrument: Bioanalyzer 2100 (Agilent, CA, USA)
Sequencing Library Quality Assurance Kit: High Sensitivity DNA Chip Kit (Agilent, CA, USA)
Sequencing
The sequencing library was loaded on the flowcell to generate clusters on Illumina’s Cluster Station. Each sequencing
cycle was followed by a fluorescence signal to detect one base. Paired-end reads with 150 bases were obtained. Read
1 starts from the 5’ end of the insertion and read 2 starts from the 3’ end of the insertion. Based on the insertion
size, paired-end reads can cover the seuqence from both ends. In addtion, the distance between the two reads can be
estimated for downstream mapping and assembly.
8.2 Information Analysis
Sequence and primary analysis
A DNA library from human samples was sequenced with the Illumina HiSeq2500/4000 platform. Millions of paired-end
reads with 150 bases were obtained. This yields average 195.68G bases per sample to cover the human exome (50Mb).
Alignment and duplicate marking
For the alignment step, BWA is utilized to perform reference genome alignment with the reads contained in paired
FASTQ files. For the first post-alignment processing step, Picard tools is utilized to identify and mark duplicate reads
from BAM file.
Local realignment around INDELs
In the second post-alignment processing step, local read realignment is performed to correct for potential alignment
errors around indels. Mapping of reads around the edges of indels often results in misaligned bases, thus creating false
positive SNP calls. Local realignment uses these mismatched bases to determine if a site should be realigned and applies
a computationally intensive algorithm to determine the most consistent placement of the reads with respect to the indel
and remove misalignment artifacts.
Base quality score recalibration
Each base of each read has an associated quality score, corresponding to the probability of a sequencing error. Due to
the Systematic biases, the reported quality scores are known to be inaccurate and as such must be recalibrated prior to
genotyping. After recalibration, the recalibrated quality score in the output BAM will more closely correspond to the
probability of a sequencing error.
Variant calling
Variant calls can be generated with GATK HaplotypeCaller or UnifiedGenotyper which examines the evidence for vari-
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 31
ation from reference via Bayesian inference.
Variant recalibration
A Gaussian mixture model is fit to assign accurate confidence scores to each putative mutation call and evaluate new
potential variants.
Variant function annotation
Biological functional annotation is a crucial step in finding the links between genetic variation and disease. SnpEff is
utilized to add biological information to a set of variants.
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181
Exome Sequencing Report 32
9 References
[1]. Ng SB1, Turner EH., et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature.
461(7261):272-6.
[2]. Choi M1,Scholl UI., et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing.Proc
Natl Acad Sci USA. 106(45):19096-101.
[3]. Li H, Durbin R. Fast and accurate short read alignment with BurrowsCWheeler transform. Bioinformatics, 2009,
25(14): 1754-1760.
[4]. Kent W J, Sugnet C W, Furey T S, et al. The human genome browser at UCSC. Genome research, 2002, 12(6):
996-1006.
[5]. Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics, 2009,
25(16): 2078-2079.
[6]. Sherry S T, Ward M H, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research,
2001, 29(1): 308-311.
[7]. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequenc-
ing data. Nucleic acids research, 2010, 38(16): e164-e164.
[8]. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature,
2012, 491(7422): 56-65.
[9]. P Cingolani, A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs
in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 2012, 6:2,80-92.
10 Contact Us
Address: 2575 W Bellfort Ave Ste 270 Houston, TX 77054
Phone Number:(713) 664-7087
Toll-free: (888)-528-8818
Fax:(713) 664-8181
Email: [email protected]
Website: www.lcsciences.com
Prepared by LC Sciences |www.lcsciences.com| [email protected]
2575 W Bellort Ave, Houston, TX 77054Tel. 888-528-8818, Fax. 713-664-8181