rationale for searching sequence databases june 22, 2005 writing topics due today writing projects...
Post on 22-Dec-2015
217 views
TRANSCRIPT
![Page 1: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/1.jpg)
Rationale for searching sequence databases
June 22, 2005Writing Topics due todayWriting projects due July 8 Learning objectives- Review of Smith-Waterman Program FASTA and BLAST programs. Psi-Blast
Workshop-Use of Psi-BLAST to determine sequence similarities. Use BLASTx to gain information on gene structure.
![Page 2: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/2.jpg)
FASTA (Pearson and Lipman 1988)
This is a combination of word search and Smith-Waterman algorithmThe query sequence is divided into small words of certain size.The initial comparison of the query sequence to the database is performed using these “words”.If these “words” are located on the same diagonal in an array the region surrounding the diagonals are analyzed further.Search time is only proportional to size of database (not database*query sequence)
![Page 3: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/3.jpg)
The FASTA program is the uses Hash tables. These tables speed the process of word search.
Query Sequence = TCTCTC 123456 (position number)Database Sequence = TTCTCTC 1234567 (position number)You choose to use word size = 4 for yourtable (total number of words in your table is44 = 256)
Sequence (totalof 256)
Position w/in query Position w/in DB Offset (Q minus DB)
TCTC 1,3 2,4 -1 or -3 or 1CTCT 2 3 -1TTCT 1
?
![Page 4: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/4.jpg)
FASTA Steps
1
Local regions ofidentity are found
Different offset values
Identical offsetvalues in acontiguous sequence
2
Rescore the local regions using PAM or Blos. matrix
Diagonals are extended
3
Eliminate short diagonalsbelow a cutoff score
4
Create a gapped alignment ina narrow segment and thenperform S-W alignment
![Page 5: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/5.jpg)
Summary of FASTA steps
1. Analyzes database for identical matches that are contiguous (between 5 and 10 amino acids in length (same offset values)).
2. Longest diagonals are scored again using the PAM matrix (or other matrix). The best scores are saved as “init1” scores.
3. Short diagonals are removed.4. Long diagonals that are neighbors are joined. The score for this
joined region is “initn”. 5. A S-W dynamic programming alignment is performed around
the joined sequences to give an “opt” score.Thus, the time-consuming S-W step is performed only on top
initn scoring sequences
![Page 6: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/6.jpg)
The ktup value•The ktup (for k-tuples) value stands for the length of the word used to search for identity.•For proteins, a ktup value of 3 would give a hash table of 203
elements (8000 entries).•The higher the ktup value the less likely you will get a match unless it is identical (remember the dot plots).•The lower the ktup value the more background you will have•The higher the ktup value the faster analysis (fewer diagonals).The following rules typically apply when using FASTA:
ktup analysis____________________ 1 proteins- distantly related 2 proteins- somewhat related (default) 3 DNA-default
![Page 7: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/7.jpg)
FASTA Versions
FASTA-nucleotide or protein sequence searching
FASTx/-compares a translated DNA query sequenceFASTy to a protein sequence database (forward or backward translation of the query)
tFASTx/-compares protein query sequence to tFASTy DNA sequence database that has been translated into three forward and three reverse reading frames
![Page 8: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/8.jpg)
FASTA Statistical Significance
A way of measuring the significance of a score considers the mean
of the random score distribution.
The difference between the similarity score for your single alignment
and the mean of the random score distribution is normalized by
the standard deviation of that random score
distribution. This is the Z-score.
Higher Z-scores are better because
the further the real score is from this mean (in standard deviation units)
the more significant it is.
![Page 9: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/9.jpg)
FASTA Statistical Significance
Z score for a single alignment=
(similarity score - mean score from database)standard deviation from database
Stand. Dev. = scores2 - ( scores)2
Total#ofSequencesTotal#ofSequences
![Page 10: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/10.jpg)
Mean similarity scoresof complete database
Mean similarity scoresof related records
![Page 11: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/11.jpg)
FASTA statistics (cont.)
Using the distribution of the z-scores in the database, the FastA
program can estimate the number of sequences that would
be expected to produce, purely by chance, a z-score greater than or
equal to the z-score obtained in the search.
![Page 12: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/12.jpg)
The E value (false positive expectation value)
The Expect value (E) is a parameter that describes the number of “hits” one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially as the Similarity Score (S) increases (inverse relationship). The higher the Similarity Score, the lower the E value. Essentially, the E value describes the random background noise that exists for matches between two sequences. The E value is used as a convenient way to create a significance threshold for reporting results. When the E value is increased from the default value of 10 prior to a sequence search, a larger list with more low-similarity scoring hits can be reported. An E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size you might expect to see 1 match with a similar score simply by chance.
![Page 13: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/13.jpg)
E value
E = K•m•n•e-λS
Where K is constant, m is the length of the query sequence, n is the length of the database sequence, λ is the decay constant, S is the similarity score.
If S increases, E decreases exponentially.If the decay constant increases, E decreases exponentiallyIf m•n increases, the “search space” increases and there is a
greater chance for a random “hit”, E increases. Larger database will increase E.
When z the E()
![Page 14: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/14.jpg)
Evaluating the Results of FASTA
BestSCORES Init1: 2847 Initn: 2847 Opt: 2847z-score: 2609.2 E(): 1.4e-138Smith-Waterman score: 2847; 100.0% identity in 413 overlap
GoodSCORES Init1: 719 Initn: 748 Opt: 793z-score: 734.0 E(): 3.8e-34Smith-Waterman score: 796; 41.3% identity in 378 overlap
MediocreSCORES Init1: 249 Initn: 304 Opt: 260z-score: 243.2 E(): 8.3e-07Smith-Waterman score: 270; 35.0% identity in 183 overlap
![Page 15: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/15.jpg)
BLAST
Basic Local Alignment Search Tool
Speed is achieved by: Pre-indexing the database before the search Parallel processing
Uses a hash table that contains neighborhood words rather than just random words.
![Page 16: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/16.jpg)
Neighborhood words
The program declares a hit if the word taken from the query sequence has a score >= T when a scoring matrix is used.
This allows the word size (W (this is similar to ktup value)) to be kept high (for speed) without sacrificing sensitivity.
If T is increased by the user the number of background hits is reduced and the program will run faster
![Page 17: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/17.jpg)
![Page 18: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/18.jpg)
Comparison Matrices
In general, the BLOSUM series is thought to be superior to thePAM series because it is derived from areas of conserved sequences.
It is important to vary the parameters when performing a sequencecomparison. Similarity scores for truly related sequences areusually not sensitive to changes in scoring matrix and gap penalty.
Thus, if your “hits list” holds up after changing these parametersyou can be more sure that you are detecting similar sequences.
![Page 19: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/19.jpg)
Which Program should one use?
Most researchers use methods for determining local similarities: Smith-Waterman (gold standard) FASTA BLAST }Do not find every possible alignment
of query with database sequence. Theseare used because they run faster than S-W
![Page 20: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/20.jpg)
What are the different BLAST programs?
blastp compares an amino acid query sequence against a protein sequence
database blastn compares a nucleotide query sequence against a nucleotide sequence
database blastx compares a nucleotide query sequence translated in all reading frames
against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against
the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.
![Page 21: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/21.jpg)
When to use the correct program
Problem Program Explanation
IdentifyUnknownProtein
BLASTP;FASTA3
General protein comparison. Use ktup=2 for speed; ktup=1 for sensitive search.
Smith-Waterman Slower than FASTA3 and BLAST but provides maximum sensitivity
TFASTX3;TFASTY3;TBLASTN
Use if homolog cannot be found in protein databases; Approx. 33% slower
Psi-BLAST Finds distantly related sequences. It replaces the query sequence with a position-specific score matrix after an initial BLASTP search. Then it uses the matrix to find distantly related sequences
![Page 22: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/22.jpg)
When to use the correct program (cont. 1)
Problem Program Explanation
Identify new orthologs
TFASTX3;TFASTY3TBLASTN:TBLASTX
Use PAM matrix <=20 or BLOSUM90 to avoid detecting distant relationships. Search EST sequences w/in the same species.
IdentifyESTSequence
FASTX3;FASTY3;BLASTX;TBLASTX
Always attempt to translate your sequence into protein prior to searching.
IdentifyDNASequence
FASTA;BLASTN Nucleotide sequence comparision
![Page 23: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/23.jpg)
Choosing the database
Remember that the E value increases approximately linearly with database size. When searching for distant relationships always use the smallest database likely to contain the homolog of interest.Thought problem: If the E-value one obtains for a search is 12 in Swiss-PROT and the E-value one obtains for same search is 74 in PIR how large is PIR compared to Swiss-PROT?
74/12 = ~6
![Page 24: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/24.jpg)
Filtering Repetitive Sequences
Over 50% of genomic DNA is repetitiveThis is due to: retrotransposons ALU region microsatellites centromeric sequences, telomeric sequences 5’ Untranslated Region of ESTs
Example of ESTs with simple low complexity regions:
T27311GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC
![Page 25: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/25.jpg)
Filtering Repetitive Sequences (cont. 1)
Programs like BLAST have the option of filtering out low complex regions.
Repetitive sequences increase the chance of a match during a database search
![Page 26: Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman](https://reader034.vdocuments.net/reader034/viewer/2022042717/56649d795503460f94a5d62e/html5/thumbnails/26.jpg)
PSI-BLAST
PSI-position specific iterativea position specific scoring matrix (PSSM) is constructed automatically from multiple HSPs of initial BLAST search. Normal E value is usedThis PSSM is the new scoring matrix for a second BLAST search. Low E value is used E=.001.Result-1) obtain distantly related sequences
2) find out the important residues that provide function or structure.