bioinformatic techniques & tools for snp analysis jill l wegrzyn department of plant sciences...

Bioinformatic Techniques & Tools for SNP Analysis

Jill L Wegrzyn

Department of Plant Sciences University of California at Davis, Davis, CA 95616 USAE-mail: [email protected]

SNP Analysis Agenda

Sequence-Based SNP Identification

Common Bioinformatic SolutionsPhred, Phrap, Consed, Polyphred, and Polybayes

Re-Sequencing Project (ADEPT2)

High-Throughput SNP Identification Solution (PineSAP)

Finding SNPs: Sequence-based SNP MiningFinding SNPs: Sequence-based SNP Mining

Sequence Overlap - SNP DiscoverySequence Overlap - SNP Discovery

GTTACGCCAATACAGGTTACGCCAATACAGGGATCCAGGAGATTACCATCCAGGAGATTACCGTTACGCCAATACAGGTTACGCCAATACAGCCATCCAGGAGATTACCATCCAGGAGATTACC

Genomic Genomic

RRSRRSLibraryLibrary

ShotgunShotgunOverlapOverlap

BACBACLibraryLibrary

BACBACOverlapOverlap

DNADNASEQUENCINGSEQUENCING

mRNAmRNA

cDNAcDNALibraryLibrary

ESTESTOverlapOverlap

RandomRandomShotgunShotgun

Align toAlign toReferenceReference

Comprehensive SNP Discovery: ResequencingComprehensive SNP Discovery: Resequencing

• Overlapping PCR Amplicons across entire gene

• Make no assumptions about sequence function

• Sequence diversity and genetic structure for each gene is different

• Proper association studies can only be designed in this context

• Complete resequencing facilitates population genetics methods

Sequence each end

of the fragment.

Base-calling

Quality determination

Contig assembly

Final quality determination

Sequence viewing

Polymorphism tagging

Polymorphism reporting

Individual genotyping

Polymorphism detection

PolyPhred/Polybayes

Consed

Analysis

Sequence Phred PhrapAmplify DNA

5’ 3’

Sequence-based SNP Identification

Phylogenetic analysis

ATAGACG ATACACG ATAGACG ATACACG

ATAGACGATACACG

Homozygotes Heterozygote

What is Phred/Phrap/Consed ?

Phred/Phrap/Consed is a DNA sequence analysis package from University of Washington:

Trace file (chromatograms) reading

Quality (confidence) assignment to each individual base

Vector and repeat sequences identification and maskingSequence assembly

Assembly visualization and editing

Automatic finishing

Phred, Phrap, Consed, Polyphred, Polybayes

phred: Base calling and quality assignments

phrap: Contig formation and new quality assignments

consed: Visual X-Windows graphic interface, to view and edit alignments and contigs, and to view the original traces

polyphred: find polymorphisms in phrap contigs, quality calls, add data to phrap files to permit consed finding and visualization of polymorphisms.

polybayes: Fully probabilistic SNP detection algorithm that calculates the probability (SNP score) that discrepancies at a given location of a multiple alignment represent true sequence variations as opposed to sequencing errors.

Files and Program Execution

Input files: ABI chromat or SCF files

Order of Program execution: phred, then phrap, then consed or: phred, then phrap, then polyphred,

polybayes, then consed or: phred, then phrap, then consed, then

polyphred, polybayes, then consed

Trace File

High quality region – no ambiguities

Trace File

Medium quality region – some ambiguities

Trace File

Poor quality region – low confidence

What is Phred?

• Phred observes the base trace, makes base calls, and assigns quality values (qv) of bases in the sequence. • Writes base calls and qv to output files for Phrap assembly. • Quality values (phred scores) range from 0 to 60. 20 and above is considered a confident base call (1 in 100 chance that is has been called wrong ~99% accuracy)

• Useful for consensus sequence construction

ATGCATTC string1 CGTTCATGC string2 ATGC-TTCATGC superstring

• Here we have a mismatch ‘A’ and ‘G’, the qv will determine the dash in the superstring. The base with higher qv will replace the dash.

Phred value formula

q = - 10 x log10 (p)

whereq - quality valuep - estimated probability error for a base call

Examples:Examples:

qq = 20 means = 20 means pp = 10 = 10-2-2 (1 error in 100 bases) (1 error in 100 bases)

qq = 40 means = 40 means pp = 10 = 10-4-4 (1 error in 10,000 (1 error in 10,000 bases)bases)

Why Phred?

• Output sequence might contain errors.• Vector contamination• Dye-terminator reaction might not occur.• Weak or variable signal strength of peak

corresponding to a base.

The structure of a phd file

BEGIN_SEQUENCE 01EBV10201A02.g

BEGIN_COMMENT

CHROMAT_FILE: EBV10201A02.gABI_THUMBPRINT: PHRED_VERSION: 0.990722.gCALL_METHOD: phredQUALITY_LEVELS:99TIME: Thu May 24 00:18:58 2001TRACE_ARRAY_MIN_INDEX: 0TRACE_ARRAY_MAX_INDEX: 12153TRIM: CHEM: termDYE: big

END_COMMENT

BEGIN_DNAt 8 5c 13 17a 19 26c 19 32

t 6 11908t 6 11908a 6 11921a 6 11921g 6 11927g 6 11927t 6 11947t 6 11947c 6 11953c 6 11953a 6 11964a 6 11964g 6 11981g 6 11981c 4 11994c 4 11994n 4 12015n 4 12015c 4 12037c 4 12037n 4 12044n 4 12044n 4 12058n 4 12058n 4 12071n 4 12071n 4 12085n 4 12085n 4 12098n 4 12098n 4 12111n 4 12111n 4 12124n 4 12124c 4 12144c 4 12144n 4 12151n 4 12151END_DNAEND_DNA END_SEQUENCEEND_SEQUENCE

t 24 2221t 24 2221a 24 2232a 24 2232a 22 2245a 22 2245a 27 2261a 27 2261g 25 2272g 25 2272c 19 2286c 19 2286c 12 2302c 12 2302t 19 2314t 19 2314g 12 2324g 12 2324g 15 2331g 15 2331g 19 2346g 19 2346g 23 2363g 23 2363t 33 2378t 33 2378g 36 2390g 36 2390c 44 2404c 44 2404c 44 2419c 44 2419t 39 2433t 39 2433a 39 2446a 39 2446a 34 2460a 34 2460t 35 2470t 35 2470g 34 2482g 34 2482

t 16 8191t 16 8191g 19 8200g 19 8200t 13 8211t 13 8211c 13 8229c 13 8229g 4 8241g 4 8241n 4 8253n 4 8253c 4 8263c 4 8263t 10 8276t 10 8276t 9 8286t 9 8286c 12 8301c 12 8301t 16 8313t 16 8313c 12 8329c 12 8329c 12 8336c 12 8336c 15 8343c 15 8343t 19 8356t 19 8356c 9 8371c 9 8371g 13 8386g 13 8386g 14 8397g 14 8397a 7 8417a 7 8417g 9 8427g 9 8427g 4 8445g 4 8445

Phrap

• Phrap constructs the contig sequence as a mosaic of the highest quality parts of the reads rather than a statistically computed “consensus”.

• Avoids the complex algorithm issues associated with multiple alignment methods

• Speed and accuracy• Phrap is an assembler NOT an aligner

• The sequence produced by Phrap is quite accurate• Less than 1 error per 10 kb in typical datasets.

• Phrap considers sequence quality at a given position (determined by Phred)

Running phredPhrapConstruct directory structure

directory for the run, containing subdirectories chromat_dir, edit_dir, phd_dir, poly_dir

Move tracefiles to directory chromat_dir

Run phredPhrap from directory edit_dir Determine phredPhrap parameters

Default Adjusting Parameters

The greater the forcelevel, the more liberal the assembly (generally produces fewer contigs).

however results in more manual editing reviewing polymorphisms later

Phrap output files

• *.contigs*.contigs – fasta file containing the contigs– fasta file containing the contigs- Contigs with more than one readContigs with more than one read

- Singletons (single reads with a match to some other contig Singletons (single reads with a match to some other contig but could not be merged consistently)but could not be merged consistently)

• *.singlets*.singlets –– fasta file of the singlet readsfasta file of the singlet reads- Reads with no match to other readReads with no match to other read

• *.ace*.ace – for viewing the assembly w/ Consed– for viewing the assembly w/ Consed

• *.view*.view – for viewing the assembly w/ – for viewing the assembly w/ PhrapviewPhrapview

Consed

Consed is a program for viewing and editing Consed is a program for viewing and editing assemblies produced by Phrapassemblies produced by Phrap

a. Assembly viewer a. Assembly viewer - allows for visualization of - allows for visualization of contigs, assembly (aligned reads), quality contigs, assembly (aligned reads), quality values of reads and final sequence.values of reads and final sequence.

b. Trace file viewer b. Trace file viewer – single and multiple trace – single and multiple trace files can be visualized allowing for comparison files can be visualized allowing for comparison of a given sequence in several reads.of a given sequence in several reads.

c. Navigationc. Navigation – – identify and list regions which are identify and list regions which are below a given quality threshold, contain high below a given quality threshold, contain high quality discrepancies, single-strand coverage, quality discrepancies, single-strand coverage, etc.etc.

Phred/Phrap/Consed Pipeline

Chromat_dirChromat_dir

Phd_dirPhd_dir

Edit_dirEdit_dir

Directories:Directories:

Assembly viewing/editingConsed

AssemblyPhrapassembled contigs - seqs_fasta.screen.contigsassembly file - seqs_fasta.screen.ace#

Vector screening and maskingCross_Match (local alignment program) x vector.seqscreened/masked file - seqs_fasta.screen

Conversion - phd to fastaphd2fasta.plnucleotide sequences - seqs_fastaquality values - seqs_fasta.screen.qual

Quality (confidence) values assignmentPhredphd files - *.phd

Inputchromatogram files

Using PolyPhred to Visualize SNPs

•Compares sequences across traces obtained from different individuals to identify sites for SNPs. •Will occasionally miscall genotypes - frequency of such mistakes depends on the sequencing chemistry used to generate the trace. •To reduce the number of miscalled sites, ignores regions of poor quality & ends

Polyphred polyphred -ace gene_file.ace.1 -tag p -snp hom -indel -f 50 >

gene_file.polyphred.out

where gene_file.ace.1 is the .ace file present in the edit_dir directory. Note: this command will give polymorphisms of quality 'ranks' 1

through 3: 1 is the highest quality, 6 is the lowest quality The qualifier -f 50 is used to list 50bp of flanking sequence on each

side of the detected polymorphism The qualifier -indel is used to identify insertions or deletions The qualifier -snp hom is used to identify homozygous SNPs only The qualifier -tag p is used to list the tagged polymorphisms in the

polyphred output file *.polyphred.out.

It first reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. It then reads the PHD and POLY files associated with each trace.

Polyphred

Reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly.

Reads the PHD and POLY files associated with each trace.

During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence

The score indicates how well the trace at the site matches the expected pattern for a SNP.

Updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using Consed.

Polybayes

Bayesian statistical model takes into account:

- depth of coverge- base quality values of the sequences-a priori expected rate of polymorphic sites in region

Polybayes calculations are aided with information on major/minor allele frequencies as well as polymorphism rates within the species under investigation

**Can also integrate into the poly files for viewing with Consed

Allele Discovery of Economic Pine Traits

Goals Re-sequence ~8,000 amplicons in

Loblolly Pine SNP Discovery Genotyping of 3,000 trees using Illumina

technology Phenotyping of same 3,000 trees Association studies

EST Databases(UGA, UMN)

Assemble Contigs

PrimerDesign

PrimerValidation

Diversity PanelDNA Extraction

Megagametophyte

Agencourt Sequencing18 individuals & 5 other

SNP IdenticationIllumina Genotyping

PhenotypingDrought, Dieases, & Wood

AssociationStudies

Population DNA Extraction(Needles)

Resequencing Project Pipeline

Primer Selection and Validation

Primer SelectionUsing OSP primer picking

Sequencing on validation DNA

Primer synthesis at Illumina

Successes - select 1 primer set per amplicon

BLASTing of validation DNA sequence back to EST database

Successes

Sequencing of diversity panel - 1 primer set per amplicon which has been successful at each step

Successes

Report to TreeGenes - several primers per amplicon will be stored in TreeGenes

Report to TreeGenes

Report to TreeGenes

Report to TreeGenes

Species Num. Successful Amplicons

Pinus taeda 7,424

Pinus radiata 6,249 (84%)

Pinus lamertiana

2,234 (30%)

Picea abies 1,024 (13%)

Pseudotsuga menziesii

750 (10%)

**312 successful in all 5 species

AlignmentCritical in the automation of base calls Commonly used Phrap (from PhredPhrap) is an assembler and is NOT ideal for

alignments Many commonly used aligners work best with protein sequences or

with a reference sequence Preservation of quality scores for input into SNP identification programs Speed for high-throughput programs

Automated SNP Calls- Reference Sequence Required- Traditional approaches without reference sequence include “eSNPs”

(human, maize, and pine) -Very little redundancy outside of abundant genes-Overall high number of false positives (single pass reads)

- Not specific to frequencies observed in different organisms- High number of false positives in currently accepted methods

(Polybayes & Polyphred)

Alignment and SNP Calling PipelineChallenges in High-Throughput SNP Identification

PineSAPRe-Sequencing datafrom Agencourt:

Initial Processing

Base Calling Sequence Alignment

SNP Identification

Machine Learning

Data Storage & Release

Base Calling and Sequence Alignment

Modified PhredPhrapallows for trimming of bases from start and end of sequence based on trace quality

Ace2FASTAConverts native PhredPhrap output (ace file) into an unaligned FASTA file

ProbconsRNAOptimal DNA sequence alignment program

AlignedContig2ReadFASTAProvides single multifasta file with all reads aligned to the contig from PhredPhrap AND the contigs alignment to the other contigs from probconsRNA

FASTA2AceConverts resulting FASTA file back into ace file for SNP Identification

Number of contigs Percentage

1 23.67%

2 27.00%

3 19.67%

4 14.00%

5 6.00%

6 4.33%

7 2.67%

8 2.33%

Phrap assembly

Alignment Pipeline Success

Problems (out of 356 validation set)

Sequencing problems 3 0.84%

Problems in end or middle 4 1.12%

F&R don't overlap 2 0.56%

Alignment problems due to low quality 1 0.28%

Alignment problem due to wrong direction in original phrap alignment 2 0.56%

Summary (out of 356 validation set)

Correct alignments 344 96.63%

Alignments we are working on fixing +5 98.03%

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

SNP Identification and Machine Learning

Algorithms applied Independently for SNP Identification:

Polyphred and Polybayes both utilizealignment and base quality information to identify potential polymorphisms.

Both programs yield a high # of false positives

Machine LearningUtilizes sequence based information and compares actual base calls to proposed calls to develop SNP identification filtersspecific to pine

Parameters under Consideration: Local and global quality scores, frequencies of major and minor alleles, local and global SNP frequences, relative positions in sequence to existing variation, and polyphred and polybayes scores

SNP Identification Overview

Examine features to improve the accuracy of SNP location prediction

Utilize machine learning to apply the features

Refine the accuracy of the learning algorithm through adjustments to feature representation (iterative process)

Utilize the final classifier against the large re-sequenced set to improve accuracy of SNP calls originating from Polybayes and Polyphred

Features Selected

Description RepresentationSequence Depth Continuous

Variation Type Categorical

Polybayes Score Continuous

Polyphred Score Continuous

Freq of major/minor alleles Continuous

Max quality of major/minor alleles Continuous

Local average quality Continuous

Overall average quality Continuous

Alignment Quality Continuous

Feature Representation Sequence depth

Count of number of sequences in the alignment at the position of variation.

All sequences in the alignment may not overlap at the position of variation;

number is different from the total number of the sequences in the alignment

Variation type Variation type can be SNP or INDEL

PolyBayes score PolyBayes program assigns a Bayesian posterior probability value for

each called SNP using the frequency priors given for observing a variation at that position

Polyphred score Polyphred assigns a score calculated primarily from sequence depth

and quality score.

Feature Representation

Base frequencies The number of occurrences of different bases at the

position of variation is important in determining a polymorphic position.

Frequencies of the first (major allele) and the second (minor allele) are represented as ratio to sequence depth.

Relative distance Sequence quality at the ends of the alignment tends to

be poor due to inherent limitations of current sequencing technology.

SNP position was represented as the ratio of the distance in the consensus sequence from the closest end, or the relative distance

Feature Representation

Sequence quality Variation is observed because of a poor quality base. Based on the base frequencies calculated:

maximum qualities of the major and minor alleles average qualities of major and minor alleles

Alignment quality Misalignment of bases caused by sequence alignment

programs sometimes result in an erroneous SNP call. In the neighborhood of the SNP (+/- 10 bases) all the

mismatches with the consensus sequence are given a penalty and the penalty is more if the mismatch is continuous

Classification Goal: Prediction

Software: Decision tree C4.5 program is open-source C code (WEKA)

At each point in the construction of the decision tree, C4.5 selects the feature to test based on maximum information gain.

The goal is to generate a minimum size tree that correctly classifies all the elements in the training set.

The size of the tree is the number of nodes (decision nodes) and the number of leaves are the classes (categories) that they are distributed to.

SNP Identification Datasets

Training set for loblolly pine was composed of a total of 300 validated sequences. Divided to represent the relative percentages of sequence source Testing set is composed of 120 validated sequence sets

Training set for poplar was composed of 42 validated sequences selected at random Testing set is composed of a total of 30 validated sequence sets.

Validation = manually observed FP, FN, TP, and TN SNP calls through observation of tracefiles in Consed.

Alignment and SNP Identification

Accuracy = (TP + TN)/total

Sensitivity = TP/(TP + FN)

Specificity = TN/(FP + TN)

Alignment and SNP IdentificationEvaluation Criteria

Evaluation J48 Polyphred Polybayes

Accuracy 93.6 76.25 78.02

Sensitivity 88.21 83.22 86.54

Specificity 98.73 N/A N/A

Evaluation J48 Polyphred Polybayes

Accuracy 94.6 79.35 80.24

Sensitivity 90.54 85.01 88.14

Specificity 97.23 N/A N/A

Resulting Identifications SNPs have been called using PineSAP on

7424 amplicons representing 6924 Unigenes.

Average amplicon length is 445 bp Average of 5.5 SNPs/amplicons Pine is highly polymorphic! Total SNPS ~41,500 Distribution of SNPs/amplicon

1404 - 0 SNPs 1133 - 1 SNP 4887 - 2 or more SNPs

Custom lists can be exported into Illumina style input **option for adding IUPAC codes for SNPs in flanking

sequence

Alignment and SNP IdentificationSNP Formatting

Alignment and SNP IdentificationIllumina Design

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Data Storage and Release

SNP Identifcations

GENOTYPING

dbSNP NCBI SNP database

SNP Discovery: dbSNP database

SNP data submitted to dbSNPSNP data submitted to dbSNP

dbSNP processing of SNPsdbSNP processing of SNPs

SNPs submittedSNPs submittedby research communtiyby research communtiy((ssubmitted SNPs = ss#)ubmitted SNPs = ss#)

Unique mappingUnique mappingto a genome locationto a genome location((rreference eference SSNP = rs#)NP = rs#)

(by 2hit-2allele)(by 2hit-2allele)

Summary PineSAP improves

Inaccuracies introduced by using Phrap to align sequences

Time which would be required by using a aligner such as ProbconsRNA or ClustalW on its own

PineSAP has a 98% success rate when used to align loblolly resequencing data.

PineSAP identified a success list of features to enhance polymorphism predictions and generated a prediction accuracy of 93%

PineSAP provided a full alignment and polymorphism detection system that can be adapted to specific genomes

More Information…

Phred/Phrap/Consed: www.phrap.org/

Polyphred: droog.mbt.washington.edu/PolyPhred.html

Polybayes: genomeold.wustl.edu/groups/informatics/software/polybayes/

bioinformatic techniques & tools for snp analysis jill l wegrzyn department of plant sciences...

Documents

consed slide

phrap files

phrap assembly

quality calls

reference slide

phrap contigs

new quality assignments

base trace