database searching for similar sequences search a sequence database for sequences that are similar...

25
Database Searching for Database Searching for Similar Sequences Similar Sequences Search a sequence database for Search a sequence database for sequences that are similar to a sequences that are similar to a query sequence query sequence provide a list of database sequences provide a list of database sequences with which the query sequence can be with which the query sequence can be aligned well aligned well Key issue: Key issue: efficiency efficiency

Post on 19-Dec-2015

250 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

Database Searching for Database Searching for Similar SequencesSimilar Sequences

Search a sequence database for Search a sequence database for sequences that are similar to a query sequences that are similar to a query sequence sequence

provide a list of database sequences with provide a list of database sequences with which the query sequence can be which the query sequence can be aligned wellaligned well

Key issue: Key issue:

efficiencyefficiency

Page 2: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

Database Searching for Database Searching for Similar Sequences MethodsSimilar Sequences Methods

Smith-Waterman requires order NSmith-Waterman requires order N22L L computationscomputations

Popular database searching methods Popular database searching methods (heuristic methods)(heuristic methods) FASTAFASTA [Pearson & Lipman, 1988] [Pearson & Lipman, 1988] BLASTBLAST [Altschul et al., 1990] [Altschul et al., 1990]

Tradeoffs of using the heuristic fast methodTradeoffs of using the heuristic fast method Accuracy (Sensitivity and Selectivity)Accuracy (Sensitivity and Selectivity)

Page 3: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

FASTAFASTAFASTAFASTA Problem with Smith-Waterman algorithm: Problem with Smith-Waterman algorithm:

Too many calculations “wasted” by Too many calculations “wasted” by comparing regions that have nothing in comparing regions that have nothing in commoncommon

Initial insight: Regions that are Initial insight: Regions that are similarsimilar between two sequences are likely to share between two sequences are likely to share short stretches that are short stretches that are identicalidentical

Basic method: Look for similar regions only Basic method: Look for similar regions only near short stretches that match near short stretches that match exactly exactly --- --- “Hit and extend” sequence searching“Hit and extend” sequence searching

Page 4: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

11

Diagonal Method Diagonal Method ExampleExample

55

1010

33

88

1111

99

11

44

22

LLVVIIQQAAAAYYFFRRAAHHs s ==

11111010998877665544332211

AAIIQQAAAAMMDDVVtt = =

8877665544332211

+9+9offsetoffset

-2-2

+2+2

+3+3

-3-3

+1+1

+2+2

+2+2 +2+2 -6-6

-2-2

-1-1

……

YY

VV

RR

QQ

LL

II

HH

FF

AA

Look-u

p t

ab

leL

ook-u

p t

ab

le

11 11 11 11 11 11

,6,6,7,7

1122 223344

+1+100

+9+9+8+8+7+7+6+6+5+5+4+4+3+3+2+2+1+100-1-1-2-2-3-3-4-4-5-5-6-6-7-7Offset vectorOffset vector

Page 5: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

Limitations of FASTALimitations of FASTALimitations of FASTALimitations of FASTA FASTA can miss significant similarity since:FASTA can miss significant similarity since:

For nucleic acids, due to codon “wobble”, DNA For nucleic acids, due to codon “wobble”, DNA sequences may look like XXy where X’s are sequences may look like XXy where X’s are conserved and y’s are notconserved and y’s are not

GGuUCuACgAAgGGuUCuACgAAg and and GGcUCcACaAAAGGcUCcACaAAA both both code for the same peptide sequence (Gly-Ser-Thr-Lys) code for the same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with k-tuple size of 3 or higherbut they don’t match with k-tuple size of 3 or higher

For proteins, similar sequences do not have to For proteins, similar sequences do not have to share identical residuesshare identical residues

Gly-Asp-Gly-Lys-GlyGly-Asp-Gly-Lys-Gly is quite similar to is quite similar to Gly-Glu-Gly-Glu-Gly-Arg-GlyGly-Arg-Gly but there is no match with k-tuple of size but there is no match with k-tuple of size 22

Asp-Lys-ValAsp-Lys-Val is quite similar to is quite similar to Glu-Arg-IleGlu-Arg-Ile yet it is yet it is missed even with k-tuple size of 1missed even with k-tuple size of 1

Score ?

Score ?Ala-Ala-Ala-Ala-Ala Ala-Ala-Ala-Ala-Ala vsvs Ala-Ala-Ala-Ala-Ala Ala-Ala-Ala-Ala-AlaScore

?

Page 6: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLASTBLAST

What does BLAST stand for?What does BLAST stand for?

Basic Local Alignment Search Basic Local Alignment Search ToolTool

Page 7: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLASTBLASTBLASTBLAST

BLAST is similar to FASTA but it searches BLAST is similar to FASTA but it searches for words which for words which score above Tscore above T rather rather than that than that match exactlymatch exactly.. It is also faster It is also faster because its implementation has been because its implementation has been optimized to work with parallel UNIX optimized to work with parallel UNIX architecture from an early stage. architecture from an early stage.

ReferenceReference S. F. Altschul, W. Gish, W. Miller, E. W. S. F. Altschul, W. Gish, W. Miller, E. W.

Myers and D. J. Lipman. Myers and D. J. Lipman. Basic Local Basic Local Alignment Search Tool. Alignment Search Tool. J. Mol. Biol. 215:J. Mol. Biol. 215:403-403-410 (1990)410 (1990)

Page 8: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLAST basicsBLAST basicsBLAST basicsBLAST basics

BLAST is mainly a 3-step algorithm:BLAST is mainly a 3-step algorithm:

Compile list of high-scoring strings Compile list of high-scoring strings ((wordswords))

Search for hits – each hit gives a Search for hits – each hit gives a seedseed

Extend seeds to obtain Extend seeds to obtain segment pairssegment pairs

Page 9: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLASTBLASTBLASTBLAST

For protein sequences, the list of high-scoring For protein sequences, the list of high-scoring words consists of all words with words consists of all words with ww characters characters that scores at least that scores at least TT with some word in the with some word in the query sequence (query sequence (ww = 3 or 4 for protein = 3 or 4 for protein search, 11 or 12 for nucleotide sequences).search, 11 or 12 for nucleotide sequences).

Search for “hits” using a hash table or a Search for “hits” using a hash table or a finite state machine.finite state machine.

Key concept: Searching for words which Key concept: Searching for words which score above Tscore above T rather than that rather than that match match exactlyexactly

Page 10: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLAST method for BLAST method for proteinsproteins

BLAST method for BLAST method for proteinsproteins

1. Compile a list of words which give a 1. Compile a list of words which give a score above score above TT when paired with the query when paired with the query sequence.sequence. Example using PAM-120 for query sequence Example using PAM-120 for query sequence

ACDE (ACDE (ww=4, =4, TT=17):=17): A C D EA C D E

ACDE = +3 +9 +5 +5 = 22ACDE = +3 +9 +5 +5 = 22

try all possibilities:try all possibilities:AAAA = +3 -3 0 0 = 0 no goodAAAA = +3 -3 0 0 = 0 no good

AAAC = +3 -3 0 -7 = -7 no goodAAAC = +3 -3 0 -7 = -7 no good

...too slow, try directed change...too slow, try directed change

Page 11: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

Generating word listGenerating word listGenerating word listGenerating word list

A C D EA C D EACDE = +3 +9 +5 +5 = 22ACDE = +3 +9 +5 +5 = 22

change 1st pos. to all acceptable substitutionschange 1st pos. to all acceptable substitutionsgCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,tCDE)gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,tCDE)nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE,nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE,

nCDE,vCDE) nCDE,vCDE)

iCDE = -1 9 5 5 = 18 ok (=qCDE)iCDE = -1 9 5 5 = 18 ok (=qCDE)kCDE = -2 9 5 5 = 17 ok (=mCDE)kCDE = -2 9 5 5 = 17 ok (=mCDE)

change 2nd pos.: can't - all alternatives negative and change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13the other three positions only add up to 13

change 3rd pos. in combination with first positionchange 3rd pos. in combination with first positiongCnE = 1 9 2 5 = 17 okgCnE = 1 9 2 5 = 17 ok

continue - use recursioncontinue - use recursion

Page 12: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLAST method for BLAST method for proteinsproteins

BLAST method for BLAST method for proteinsproteins

2. Scan the database for hits with the 2. Scan the database for hits with the compiled list of words. compiled list of words. Use Use finite state machinefinite state machine (actually used) (actually used)

Calculate a state transition table that tells Calculate a state transition table that tells what state to go to based on the next what state to go to based on the next character in the sequencecharacter in the sequence

3. Extend hits in both directions to 3. Extend hits in both directions to form segment pairs (without allowing form segment pairs (without allowing gaps)gaps)

Page 13: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLAST method for BLAST method for proteinsproteins

Example of a finite state machine for Example of a finite state machine for string matching: (input alphabet: string matching: (input alphabet: a,b,c)a,b,c)Word: Word: ababacaababaca

aa

bbbb

aa aaaa

22 33 5544 66 771100 aa aabb bb aa aacc

Database sequence: Database sequence: bcabccaaababacababacabbbcabccaaababacababacabb

Page 14: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

exerciseexercise

Construct a finite state machine that Construct a finite state machine that recognize the word: recognize the word:

ATGATG

Assuming the sequence is a Assuming the sequence is a nucleotide sequencenucleotide sequence

Page 15: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLAST Method for DNABLAST Method for DNABLAST Method for DNABLAST Method for DNA

1. Make list of all words of length 1. Make list of all words of length ww in the in the query sequence (often query sequence (often ww=11 or 12)=11 or 12)

2. Compress database by packing 4 2. Compress database by packing 4 nucleotides into a single byte (use nucleotides into a single byte (use auxiliary table to tell you where auxiliary table to tell you where sequences start and stop within the sequences start and stop within the compressed database) -- doesn't allow for compressed database) -- doesn't allow for unspecified bases (wildcards)unspecified bases (wildcards)

Page 16: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLAST Method for DNABLAST Method for DNABLAST Method for DNABLAST Method for DNA

3. Compress the 3. Compress the wordswords from the query from the query sequence the same way.sequence the same way.

4. Search the compressed database for 4. Search the compressed database for matches with the compressed matches with the compressed wordswordsSince all frames of the query sequence are Since all frames of the query sequence are

considered separately, any match of length considered separately, any match of length ww>=11 must contain a match of length 8 that >=11 must contain a match of length 8 that lies on a byte boundary of one of the lies on a byte boundary of one of the wordswords from the query sequence. Thus can scan a from the query sequence. Thus can scan a (packed) byte at a time, improving speed 4-fold (packed) byte at a time, improving speed 4-fold over comparing one nucleotide at a time.over comparing one nucleotide at a time.

Page 17: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

Low-Complexity RegionsLow-Complexity RegionsLow-Complexity RegionsLow-Complexity Regions Low-complexity regions are segments Low-complexity regions are segments

that contains certain bases or amino acid that contains certain bases or amino acid more often than one would expect in more often than one would expect in ““normalnormal” nucleotide or protein ” nucleotide or protein sequences. sequences.

Problem: if query sequence has a stretch Problem: if query sequence has a stretch of unusual base composition (e.g., A-T of unusual base composition (e.g., A-T rich) or a repeated sequence element rich) or a repeated sequence element (e.g., (e.g., AluAlu sequence) there will be many sequence) there will be many hits with "uninteresting" regions.hits with "uninteresting" regions.

Page 18: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

Low-Complexity RegionsLow-Complexity RegionsLow-Complexity RegionsLow-Complexity Regions

Solution :Solution : Make a list of the words occurring very Make a list of the words occurring very

frequently (more frequently than frequently (more frequently than expected by chance).expected by chance).

Remove these words from the query list Remove these words from the query list of of wordswords before searching database. before searching database. (The words are replaced by strings of (The words are replaced by strings of Xs.)Xs.)

Page 19: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLAST Statistical BLAST Statistical significancesignificance

BLAST Statistical BLAST Statistical significancesignificance

A key to the utility of BLAST is the A key to the utility of BLAST is the ability to calculate expected ability to calculate expected probabilities of occurrence of maximum probabilities of occurrence of maximum segment pairs (MSPs) given segment pairs (MSPs) given ww and and TT

This allows BLAST to rank matching This allows BLAST to rank matching sequences in order of “significance” sequences in order of “significance” and to cut off listings at a user-specified and to cut off listings at a user-specified probabilityprobability

Page 20: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences
Page 21: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences
Page 22: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

Choosing Values for Choosing Values for ww and and TT

Trade-off: sensitivity vs. running-timeTrade-off: sensitivity vs. running-time

Choosing a value for Choosing a value for ww Small Small ww: many matches to expand: many matches to expand Big Big ww: many words to be generated: many words to be generated ww=3/4 is a good compromise=3/4 is a good compromise

Choosing a value for Choosing a value for TT Small Small TT: greater sensitivity, more : greater sensitivity, more

matches to expandmatches to expand

Page 23: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLAST NotesBLAST Notes

May fail to find optimal MSPsMay fail to find optimal MSPs May miss seeds if May miss seeds if TT is too stringent is too stringent

Empirically, 10 to 50 times faster than Empirically, 10 to 50 times faster than Smith-WatermanSmith-Waterman

Page 24: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

Basic BLAST FamilyBasic BLAST Family BLASTNBLASTN

DNA to DNA databaseDNA to DNA database BLASTPBLASTP

protein to protein databaseprotein to protein database TBLASTNTBLASTN

DNA (translated) to protein databaseDNA (translated) to protein database BLASTXBLASTX

protein to DNA database (translated)protein to DNA database (translated) TBLASTXTBLASTX

DNA (translated) to DNA database DNA (translated) to DNA database (translated)(translated)

Page 25: Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences

BLAST RefinementsBLAST Refinements

gapped alignmentsgapped alignments

““two-hit” method for extending two-hit” method for extending word pairsword pairs

Iterate with position-specific Iterate with position-specific matrix (PSI-BLAST)matrix (PSI-BLAST)

Pattern-hit initiated BLAST (PHI-Pattern-hit initiated BLAST (PHI-BLAST)BLAST)