blast - basic local alignment search tool

28
BLAST - Basic Local Alignment Search Tool Published in 1990 by Altschul, Gish, Miller, Myers, and Lipman Originally for ungapped local comparison of sequences. It has since been expanded to involve comparisons of gapped sequences. There have been several extensions of the technique and improvements to the basic tool throughout the 17 years of its life thus far.

Upload: damia

Post on 12-Jan-2016

54 views

Category:

Documents


2 download

DESCRIPTION

BLAST - Basic Local Alignment Search Tool. Published in 1990 by Altschul, Gish, Miller, Myers, and Lipman Originally for ungapped local comparison of sequences. It has since been expanded to involve comparisons of gapped sequences. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BLAST -  Basic Local Alignment Search Tool

BLAST - Basic Local Alignment Search Tool

Published in 1990 by Altschul, Gish, Miller, Myers, and Lipman

Originally for ungapped local comparison of sequences. It has since been expanded to involve comparisons of gapped sequences.

There have been several extensions of the technique and improvements to the basic tool throughout the 17 years of its life thus far.

Page 2: BLAST -  Basic Local Alignment Search Tool

Needleman-Wunsch, SemiGlobal Alignment, and Smith-Waterman assume we know which two sequences we need to compare.

BLAST is designed to do a database search for possible matching sequences:

1. There is no known starting point to begin the matches

2. There is not a well established format for the information stored in the data

3. It is like searching for a file in a cluttered office – see Professor Leinbach’s or Professor James’ offices for reference.

The amazing thing is that BLAST has been so successful!

Page 3: BLAST -  Basic Local Alignment Search Tool

Consider this sequence

gtcaaatgaaaggagtttctacatttatgtcggaaatgctggaaacagcttctatattaa

We want to search for possible matches to gain clues to its identity

1. Place a sliding window 11 or 12 nucleotides long over the sequence

gtcaaatgaaaggagtttctacatttatgtcggaaatgctggaaacagcttctatattaa

2. Extract that subsequence and compress it to 3 bytes

Code a as 002 c as 012 g as 102 and t as 112

Thus, the 11 characters take up 22 bits – 3 bytes with two bits unused

3. Using a Finite State Automaton, eliminate subsequences that occur with a very high frequency in the database

4. If the subsequence survives, i.e. is determined to be relatively rare, use hash table to locate sequences in the database that contain that subsequence.

5. Extend the match in both directions scoring the extension until the match drops below a predetermined threshold. If it survives for the length of the original sequence – report the result

6. Slide the window down one character and repeat steps 2 - 6

The above is only an approximate description of the BLAST algorithm.

Page 4: BLAST -  Basic Local Alignment Search Tool

Here is one of the BLAST results for our sequence:

Score = 44.1 bits (22), Expect = 0.035 Identities = 34/38 (89%), Gaps = 0/38 (0%) Strand=Plus/Plus

Query 12 GGAGTTTCTACATTTATGTCGGAAATGCTGGAAACAGC 49 ||||| || ||||||||| | |||||||||||||||||

Sbjct 5030 GGAGTGTCAACATTTATGGCTGAAATGCTGGAAACAGC 5067

With a report of:

gi|71835970|gb|DQ117988.1| Physcomitrella patens DNA mismatch repair protein MSH2 gene

The result along with 68 others was reported in about 30 seconds of searching on a Friday afternoon before a holiday. The search involved over 3 million subject sequences accounting for over 16 billion characters!

Page 5: BLAST -  Basic Local Alignment Search Tool

Let’s try to mimic what may have happened to get this alignment. (This is all hypothetical for illustrative purposes only.)

0. The user determines a scoring scheme, the sliding window size, the window cutoff score and a threshold for the extension. Let’s choose

score: match =1 mismatch = -1

window size = 11 window cutoff score = 9

extension threshold = 4 below present maximum

minimum reportable sequence length = 25

1. BLAST has compiled a preliminary list of “words” that are considered as significant. One of the words is ATGCTGGAAAC.

2. This sequence is found in the MSH2 Gene of the moss. So the match begins.

Query: ATGCTGGAAAC

Moss: ATGCTGGAAAC

Page 6: BLAST -  Basic Local Alignment Search Tool

3. We now begin the extension to the right and our score of 11 advances to a score of 14 (our present maximum) since we have three more matches

Query: ATGCTGGAAACAGC

Moss: ATGCTGGAAACAGC

4. We continue the extension and have 3 successive mismatches, 1 match, and then two more mismatches. The score is now four below the previous maximum of 14

Query: ATGCTGGAAACAGCTGACTA Moss: ATGCTGGAAACAGCCACCAG We prune this part of the sequence

5. We now repeat the extension process going to the left and obtain the reported sequence of 27 characters

Query 12 GGAGTTTCTACATTTATGTCGGAAATGCTGGAAACAGC 49 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Moss 5030 GGAGTGTCAACATTTATGGCTGAAATGCTGGAAACAGC 5067

Page 7: BLAST -  Basic Local Alignment Search Tool

Let’s look at a real BLAST report using the blast2seq program on the NCBI BLAST page.

For this example we compared pig -globin and Human Hemoglobin, beta mRNA.

BLAST2Res.html

Using accession numbers: 28302128(Human) and X86791.1(pig)

Note the parameters that have been set by the user (or default) prior to beginning the search.

Let’s look at a real BLAST report using the blast2seq program on the NCBI BLAST page.

For this example we compared pig -globin and Human Hemoglobin, beta mRNA.

BLAST2Res.html

Using accession numbers: 28302128(Human) and X86791.1(pig)

Let’s look at a real BLAST report using the blast2seq program on the NCBI BLAST page.

For this example we compared pig -globin and Human Hemoglobin, beta mRNA.

BLAST2Res.html

Using accession numbers: 28302128(Human) and X86791.1(pig)

Page 8: BLAST -  Basic Local Alignment Search Tool

For protein sequences the window is 3 - 4 amino acids long:

1 Amino Acid = 3 Nucleotides

3 Amino Acids = 9 Nucleotides = 18 Bits coded

4 Amino Acids = 12 Nucleotides = 3 Bytes coded

Alternative: One byte for each of Amino Acid Letters then a window of 3 Amino Acids would take 3 Bytes.

We will choose a window length of 3 Bytes for our example.

The BLAST algorithm has three main parts

1. Compile a list of pairwise alignments with the window called word pairs.

2. Scan this list for word pairs that meet some threshold score, T

3. Extend the word pairs that surpass some cutoff score S at which point those hits will be reported to the user. Scores are calculated by some scoring matrix, say BLOWSUM62.

Page 9: BLAST -  Basic Local Alignment Search Tool

Let’s apply these rules to the following sequences:

Bovine RBP

1 hmsatakgrv rllnnwdvca dmvgtftdte dpakfkmkyw gvasflqkgn ddhwiidtdy61 etfavqyscr llnldgtcad sysfvfardp sgfspevqki vrqrqeelcl arqyrliphn121 gycdgksern il

Human RBP 1 mkwvwallll aawaaaerdc rvssfrvken fdkarfsgtw yamakkdpeg lflqdnivae61 fsvdetgqms atakgrvrll nnwdvcadmv gtftdtedpa kfkmkywgva sflqkg

These are rather short sequences and the matching subsequences are easy for us

to find. However, the computer has a narrow field of vision and needs to follow the Algotithm.

Page 10: BLAST -  Basic Local Alignment Search Tool

A Retinol Binding Protein

Page 11: BLAST -  Basic Local Alignment Search Tool

BLOSUM62

Page 12: BLAST -  Basic Local Alignment Search Tool

Step 1:

First calculate the score for matching word in the sliding window with itself:

msa = (5 + 4 + 4) = 13

Set a threshold (this is arbitrary), say T = 9 now match all possible word pairs with the word in the window

Words > T Words < T

msc = (5 + 4 + 9) = 17 msw = (5 + 4 – 3) = 6

mst = (5 + 4 + 0) = 9 msf = (5 + 4 – 2) = 7

mss = (5 + 4 + 1) = 10 mva = (5 – 3 + 4) = 6

mta = (5 + 5 + 4) = 14 mia = (5 – 3 + 4) = 6

mga = (5 + 0 + 4) = 9 dsa = (-3 + 4 + 4) = 5

esa = (1 + 4 + 4) = 9 hsa = (-2 + 4 + 4) = 6

This is only a partial list. There are 8,000 possible word pairs

We test the subject sequence for any occurrence of the words in the word pair that are above the threshold and are considered to be significant, i.e. occurring in the database with low frequency.

Page 13: BLAST -  Basic Local Alignment Search Tool

Next we (arbitrarily) set a cut off score for our extension, say S = 15. If the local alignment falls below this score, we stop aligning the sequences. hmsatakgrvrllnnwdvcadmvgtftdtedpakfkmkywgvasflqkgnddhwiidtdy

fsvdetgqmsatakgrvrllnnwdvcadmvgtftdtedpakfkmkywgvasflqkg

The box encloses our original match. We extend the match to the right and we come to the end of the second sequence (Human RBP) so we must stop. We now extend the original match to the left and the first sequence (Bovine RBP) ends with a mismatch. So, we do not include it in the report.

Here is the report from the actual NCBI BLAST 2 Sequences program:

Score = 104 bits (259), Expect = 1e-21 Identities = 48/48 (100%), Positives = 48/48 (100%), Gaps = 0/48 (0%)

Query 69 MSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKG 116

MSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKG

Sbjct 2 MSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKG 49

Page 14: BLAST -  Basic Local Alignment Search Tool

The Effect of the Threshold Parameter

In the original papers on BLAST by Steven Altschul, David Lipman et. al. in 1990 and 1997 they called the threshold parameter T. In the NCBI BLAST program, this threshold is called f.

Jonathan Pevsner (Bioinformatics and Functional Genomics, Wiley-Liss, 2003, p 102) reports on the effect of using different threshold values for performing a gapped BLAST search using RBP4 (NP_006735) as the Query sequence

f = 11 (default) f = 5 f = 17

Number of sequences in db 1,046,476 1,046,476 1,046,476

Number of hits to the db 129,839,417 2,200,945,350 12,002,487

Number of extensions 5,198,652 589,935,555 61,838

Number of successful exten. 8,377 13,145 1,117

Number of HSPs gapped 145 146 93

Page 15: BLAST -  Basic Local Alignment Search Tool

The FASTA

Heuristic Search Algorithm

Page 16: BLAST -  Basic Local Alignment Search Tool

• Developed by Pearson & Lipman in 1988

• Part of a family of algorithms known as FASTX Algorithms

• Compares Query to every sequence in the database

• Generally takes more time than BLAST

• Uses a predetermined scoring matrix

Nucleotides: Identity, Cantor-Jukes, Kimura

Proteins: PAM250, Blosum62, Blosum50

• More Sensitive than BLAST especially when sequences are repetitive

• Can do gapped comparisons

FASTA

Page 17: BLAST -  Basic Local Alignment Search Tool

Algorithmic Details

• Break the algorithm into words (similar to the sliding window in BLAST, although these are smaller) sometimes called a ktup (for k-tuple)

Generally, 1 – 2 Amino Acids for proteins 4 – 6 Nucleotides for DNA

Larger word sizes are usually less sensitive.• Gap Penalties (set by user – these are only suggestions)

Opening or origination: -12 for proteins -16 for DNAExtension or Length: -2 for proteins -4 for DNA

Page 18: BLAST -  Basic Local Alignment Search Tool

Step1 – Break Query Sequence Into Words

Our Example Sequence:

R K T U R K R K T U

R K

K T

T U

U R

R K

K R

R K

K T

T U

These words will be placed within a table for comparison with the sequence from the database. Initially score based on the number of identities

Page 19: BLAST -  Basic Local Alignment Search Tool

Compare

RKTURKRKTU with ARKURWKTUR

RK KT TU UR RK KR RK KT TU

AR

RK

KU

UR

RW

WK

KT

TU

UR

Hash Table

Key Address

RK 1, 5, 7

KT 2, 8

TU 3, 9

UR 4

KR 6

Page 20: BLAST -  Basic Local Alignment Search Tool

Compare

RKTURKRKTU with ARKURWKTUR

Hash Table

Key Address

RK 1, 5, 7

KT 2, 8

TU 3, 9

UR 4

KR 6

RK KT TU UR RK KR RK KT TU

AR

RK * * *

KU

UR *

RW

WK

KT * *

TU * *

UR *

Page 21: BLAST -  Basic Local Alignment Search Tool

FASTA – Steps 2, 3, & 4

2. In the complete query sequence locate the 10 regions having the highest score with the database sequence. Rescore those regions with a PAM or Blosum matrix. If necessary, trim the ends of the region if the result is a higher score. Call this score init1

3. Try to join some of the regions from step 2 even with gaps inserted if it makes a better overall score. Call this score an initn score.

4. Use the full Smith – Waterman Algorithm on a narrow band, usually chosen to be 32 residues wide around the best matches from step 3. This is called the opt score.

Page 22: BLAST -  Basic Local Alignment Search Tool

Graphical Display of the

Four Step FASTA Process

Step 1 Step 2

Step 3 Step 4

Page 23: BLAST -  Basic Local Alignment Search Tool

Some Problems with FASTA

1. A sequence with 50% similarity could be missed

AKAKAKAKAKAKAKAKAKAK

ARARARARARARARARARAR

2. The join in step 4 restricts the band to a width 32. Suppose that 2 proteins are identical except that one has a gap of length greater than 32 in the middle. FASTA will only pick up on one side of the gap.

3. FASTA picks up on only perfect matches and ignores conservative replacement in proteins, i.e. it may miss sequences with functional homology, but not identity.

NOTE: using a smaller word size can help with 1 and 3.

No Word Matches

Page 24: BLAST -  Basic Local Alignment Search Tool

The Following Graphic Compares The Three Local Alignment Algorithms

Speed is Graphed vs. Quality of the Results

Page 25: BLAST -  Basic Local Alignment Search Tool

A FASTA search engine is located at the BioChem Department server at the University of Virginia and can be reached through the URL

http://fasta.bioch.virginia.edu/fasta

This site allows the user to choose the database against which the query sequence will be run. In fact, the user can choose multiple databases!

The user can then set the following parameters:

ktup length

scoring matrix

gap origination

gap exension

A FASTA search engine is located at the BioChem Department server at the University of Virginia and can be reached through the URL

http://fasta.bioch.virginia.edu/fasta

This site allows the user to choose the database against which the query sequence will be run. In fact, the user can choose multiple databases!

The user can then set the following parameters:

ktup length

scoring matrix

gap origination

gap exension

A FASTA search engine is located at the BioChem Department server at the University of Virginia and can be reached through the URL

http://fasta.bioch.virginia.edu/fasta

This site allows the user to choose the database against which the query sequence will be run. In fact, the user can choose multiple databases!

The user can then set the following parameters:

ktup length

scoring matrix

gap origination

gap exension

Page 26: BLAST -  Basic Local Alignment Search Tool

This site can also do a FASTA 2 sequence comparison

Page 27: BLAST -  Basic Local Alignment Search Tool

Here is the result of that search (whale hemoglobin vs hippo hemoglobin :

Page 28: BLAST -  Basic Local Alignment Search Tool

We also compare the whale hemoglobin with the same subunit of the pig hemoglobin