gapped blast and psi-blast : a new generation of protein database search programs

86
Gapped BLAST and PSI-BLAST a new generation of protein da tabase search programs Presented by 佘佘佘 佘佘佘 佘佘佘 佘佘佘

Upload: jamal

Post on 02-Feb-2016

88 views

Category:

Documents


0 download

DESCRIPTION

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs. Presented by 佘健生 鄭為正 李定達 曾文鴻. Outline. BLAST 1.0 BLAST 2.0 The two-hit method Gapped alignment PSI-BLAST Performance evaluation Discussion and Conclusion NCBI website. Statistical preliminaries. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Gapped BLAST and PSI-BLAST :a new generation of protein database se

arch programs

Presented by佘健生鄭為正李定達曾文鴻

Page 2: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Outline

• BLAST 1.0

• BLAST 2.0– The two-hit method– Gapped alignment– PSI-BLAST

• Performance evaluation

• Discussion and Conclusion

• NCBI website

Page 3: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Statistical preliminaries

• HSP: High-scoring segment pair– Locally optimal pair

• S’ = (λS - ㏑ K) / ㏑ 2– S’: normalized score

– Pi : background probability that amino acids occur randomly at all position

– sij: score for aligning each pair of amino acids I and j

– K : minor constant– λ: constant to adjust for matrix

– sij and Pi → K and λ

Page 4: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

• E = N / 2S’ – E: number of distinct HSPs with normalized sc

ore at least S’– N = mn is search space– S’ = log2(N/E)

• qij = PiPjeλuS

ij

– qij : target frequency of aligned pair of letters (i, j) with HSP, high-scoring segment paris

– λu: the unique positive number

Page 5: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

BLAST

• Basic Local Alignment Search Tool(by Altschul, Gish, Miller, Myers and Lipman)

• The BLAST program are widely used tools for searching protein and DNA database for sequence similarities

• BLAST is a heuristic that attempts to optimize a specific similarity measure.

• The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.

Page 6: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

The maximal segment pair measure

• MSP(maximal segment pair): the highest scoring pair of identical length segments chosen from 2 sequences– for DNA: Identities: +5; Mismatches: -4– for protein: BLOSUM62 …

• BLAST heuristically attempts to calculate the MSP score.

• DP is O(mn) ,but BLAST is O(m)the highest scoring pair

Page 7: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

BLAST 1.0

1) Build the hash table for Sequence A.

2) Scan Sequence B for hits.

3) Extend hits.

Page 8: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Step 1: Build the hash table for Sequence A. (3-tuple example)

For DNA :Seq. A = ACGTAGTA 12345678 AAAAAC..ACG 1..AGT 5..CGT 2..GTA 3 6..TAG 4..TTT

For protein :

Seq. A = YGGFM

Add xyz to the hash table if Score(xyz, YGG) ≧ T;Add xyz to the hash table if Score(xyz, GGF) ≧ T;Add xyz to the hash table if Score(xyz, GFM) ≧ T;

T: ‘threshold’ parameterHigh T yelds greater speed,but weak similarities

Hash table

Page 9: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

List all words in query

YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQYGG GGF GFM FMT MTS TSE SEK …

Page 10: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Augment word list

YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQYGG GGF GFM FMT MTS TSE SEK …

AAAAABAAC

YYY

Page 11: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

G G FG G Y6 + 6 + 3 = 15

BLOSUM62 scores Non-match

Match

A user-specified threshold determines which three-letter words are considered matches and non-matches.

G G FA A A0 + 0 + -2 = -2

Page 12: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQYGG GGF GFM FMT MTS TSE SEK …

GGIGGLGGMGGFGGWGGY

Page 13: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Store words in search tree

Search tree

Augmented list of query words

“Does this query contain GGF?”

“Yes, at position 2.”

O(1) time

Page 14: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Search tree

G

G

L MF W Y

Page 15: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Scan the database

Database sequence

Que

ry s

eque

nce

x

x

x

x

xx

x

x

Page 16: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Extend hit

L P P Q G L L Query sequenceM P P E G L L Database sequence <word> 7 2 6 BLOSUM62 scores word score = 15<--- --->2 7 7 2 6 4 4 HSP SCORE = 32

This is done by extending a hit in both directions, until the running alignment’s score has dropped more than Xbelow

hit

Extend

Page 17: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

BLAST 2.0The two-hit method

• BLAST 1.0– Extension step typically accounts for >90% of BLAST’

execution time

• Observations:– A HSP of interest is much longer than a single word

pair– Entail multiple hits on the same diagonal and within

short distance of one another

• Invoke an extension only when two non-overlapping hits are found within distance A on the same diagonal

Page 18: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

• Recent[i]: the most recent hit found on the ith diagonal (always increasing)

overlap

< A

Extend!

> A

Page 19: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

• T must to be lowered– one-hits : W=3 ,T=13– Two-hit : W=3 ,T=11– More one-hits while the majo

rity are dismissed

• Sensitivity– For HSPs with at least 33 bit

s, the two-hit heuristic is more sensitive

• Speed(two-hit):– Generates on average ~3.2 t

imes as many hit, but only ~0.14 times as many hit extension(decide whether a hit need be extended)

– Twice as rapid as one-hit

Page 20: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Gapped alignment

• Original BLAST: find several distinct HSPs– All HSPs related to one alignment should be found

• Gapped BLAST: tolerate a much higher chance of missing any single moderately scoring HSP– Seeking a single gapped alignment, rather than a collection of u

mgapped ones– For example, result should > 0.95, p: miss prob of HSP

• Orignial with 2 HSP: (1-p)(1-p)>0.95 p<0.025• Now: p2<0.05p=0.22

– T can be raised faster

• Now:– Find one HSP only– seed, than use 2-hit

Page 21: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Gapped alignment (contd)

• A gapped extension takes much longer to execute than an ungapped extension, but by performing very few of them the fraction of the total time could be kept low.

• Trigger a gapped extension for any HSP exceeding score Sg

• Sg should be set at ~22 bits (1:50)

Page 22: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Original BLAST locates only the first and the last ungapped aligment, E-value > 50 times

Page 23: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Gapped Local Alignments

http://binfo.ym.edu.tw/post/internet/gap_blast.htm

Page 24: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Before Gap Insertionactaactattacagactaactattacagactaactataca

|||||||||||| |||||||| | | | |

actaactattacggactaacttacagactaactaaaca

Percent Identity = 24/40 = 0.6

After Gap InsertionAfter Gap Insertionactaactattacactaactattacaagactaactgactaactatattacagactaactatacagactaactattacaaca

|||||||||||| |||||||| ||||||||||||| ||||||||||||||| |||||||| ||||||||||||| |||

actaactattacactaactattacgggactaactgactaact----tacagactaactatacagactaactaaaacaaca

Percent Identity = 36/40 = 0.9Percent Identity = 36/40 = 0.9

actaactattacactaactattacaagactaactgactaactatattacagactaactatacagactaactattacaacaactaactattacactaactattacgggactaacttacagactaactagactaacttacagactaactaaaacaaca

Page 25: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

• Start from a single aligned pair of residues, called the seed.

Gapped Local Alignments

Page 26: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Gapped expansion

– Find out ungapped region with highest alignment score.

– If the length of the ungapped region larger than Sg, then try using DP

– Use its central residue pair as the seed.– Gapped extension is invoked less than onc

e per 50 database sequences.

Page 27: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

PSSM

Page 28: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

• conserved regions– same protein family– some regions are very similar– the structure and functionality typical to this

family

Page 29: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html

PSI-BLAST (Position-Specific Iterated BLAST)

PSSM

PSSM

[1] Select a query and search it against a protein database

[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)

[3] The PSSM is used as a query against the database

[4] PSI-BLAST estimates statistical significance (E values)

[5] Repeat steps [3] and [4] iteratively, typically 5 times.At each new search, a new profile is used as the query.

Page 30: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Score matrix architecture

• Each matrix has length precisely equal to that of the original query sequence.

Page 31: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Multiple alignment construction

• E-value < 0.01 from the output of BLAST output.

• Any row identical to the query segment with which it aligns is purged.

• Only one copy is retained of any rows that are above 98% identical to one another.

Page 32: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Multiple alignment construction

• Pairwise alignment columns that involve gap characters inserted into the query are simply ignored.

• So M has exactly the same length as the query.

Page 33: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Multiple alignment construction

• The matrix scores for a given alignment column should depand not only upon the residues appearing there.

• The set R of sequences it includes to be exactly those that contribute a residue to column C.

• The columns of MC to be just those columns of M in which all the sequences of R are represented.

Page 34: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 35: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Sequence weights

• A large set of closely related sequences carries little more information than a single member, but its size may allow it outvote a small number of more divergent sequences.

• One way is to assign weights.

• Gap characters are treated as a 21st distinct char.

Page 36: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Sequence weights

• In constructing matrix scores, not only a column’s observed residue frequencies are important.

• Estimate the relative number NC of independent observations constituted by the alignment MC.

• NC: the mean number of different residue types.

Page 37: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

• a large number of independent sequences, the estimate of Qi should converge simply to the observed frequency of residue i in that column.

• Pseudocount frequencies

• Estimate Qi by:

iii

gfQ

ijj j

ji q

P

fg

1 CN

Page 38: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Performance Evaluation

Page 39: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 40: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Gapped BLAST: 1. 3X faster than original BLAST, finds more 2. >100X faster than S-W, misses only 8, same scores

PSI-BLAST: 1. faster than original BLAST, 40X faster than S-W, much more sensitive

2. multiple iterations is even better, better for non-redundant database of NCBI

3. slower than gapped BLAST: time for construction of PSSM

Page 41: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

PSI-BLAST Examples(1)

二者已被證明結構相似 , 但用 HIT 當作 query, a BLAST

search of SWISS-PROT reveals hits with E<0.01 only to other HIT proteins.

1.

2. A PSI-BLAST search, using PSSM generated by

yields the E-value of 2X10-4 for uridylyltransferase.

Page 42: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

PSI-BLAST Examples(2)BRCT proteins

Page 43: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 44: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Seven recent additions to the protein databases as members of BRCT superfamily

Page 45: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Discussion

Page 46: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Possible future improvement Gap costs

• Allows a gap to involve residues in both sequences rather than just one

• A gap in which k residues are inserted or deleted and j pairs of residues are left unaligned receives the score –(a+bk+cj)

Page 47: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Possible future improvementRealignment

• 不將所有超過 threshold 的 pairwise alignment組合成單一 multiple alignment, 而是只選出 the most significant 建構 initial multiple alignment and PSSM, 然後再以此 rescore and realign database sequences that received lower scores

• 優點– Improve weaker pairwise alignments– False positive can be downgraded by an improved

matrix– False negative can have their scores increased

Page 48: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Conclusion

• Gapped version of BLAST is faster than original one, and able to produce gapped alignments.

• PSI-BLAST greatly increase sensitivity to weak but biologically relevant sequence relationships.

• PSI-BLAST retains the ability to report accurate statistics, per iteration runs in times not much greater than gapped BLAST, and can be used both iteratively and fully automatically.

Page 49: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

NCBI• Books• Pudmed• Blast

(1)Nucleotide

-- Quickly search for highly similar sequences

-- Nucleotide-nucleotide BLAST

(2)Protein

-- Protein-protein BLAST

(3)Translated

-- Translated query vs. Protein database

(4)Special

-- Align two sequences

Page 50: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 51: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 52: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

NCBI• Books• Pudmed• Blast

(1)Nucleotide

-- Quickly search for highly similar sequences

-- Nucleotide-nucleotide BLAST

(2)Protein

-- Protein-protein BLAST

(3)Translated

-- Translated query vs. Protein database

(4)Special

-- Align two sequences

Page 53: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 54: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 55: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 56: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

NCBI• Books• Pudmed• Blast

(1)Nucleotide

-- Quickly search for highly similar sequences

-- Nucleotide-nucleotide BLAST

(2)Protein

-- Protein-protein BLAST

(3)Translated

-- Translated query vs. Protein database

(4)Special

-- Align two sequences

Page 57: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Sequencedatabase

Database searching

Sequencecomparisonalgorithm

Query

Targets ranked by score

Page 58: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 59: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 60: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 61: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 62: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 63: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

NCBI• Books• Pudmed• Blast

(1)Nucleotide

-- Quickly search for highly similar sequences

-- Nucleotide-nucleotide BLAST

(2)Protein

-- Protein-protein BLAST

(3)Translated

-- Translated query vs. Protein database

(4)Special

-- Align two sequences

Page 64: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 65: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 66: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 67: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 68: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 69: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 70: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

NCBI• Books• Pudmed• Blast

(1)Nucleotide

-- Quickly search for highly similar sequences

-- Nucleotide-nucleotide BLAST

(2)Protein

-- Protein-protein BLAST

(3)Translated

-- Translated query vs. Protein database

(4)Special

-- Align two sequences

Page 71: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 72: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 73: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 74: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 75: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 76: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 77: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

NCBI• Books• Pudmed• Blast

(1)Nucleotide

-- Quickly search for highly similar sequences

-- Nucleotide-nucleotide BLAST

(2)Protein

-- Protein-protein BLAST

(3)Translated

-- Translated query vs. Protein database

(4)Special

-- Align two sequences

Page 78: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 79: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 80: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 81: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 82: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

NCBI• Books• Pudmed• Blast

(1)Nucleotide

-- Quickly search for highly similar sequences

-- Nucleotide-nucleotide BLAST

(2)Protein

-- Protein-protein BLAST

(3)Translated

-- Translated query vs. Protein database

(4)Special

-- Align two sequences

Page 83: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 84: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 85: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs
Page 86: Gapped BLAST and  PSI-BLAST : a new generation of protein database search programs

Question Set of Final Exam

• 1. 請寫出 blast 可以快速在 database 中找到 sequence 的原理

• 2. Two hit 與 One hit 不同之處為何 ?

• 3. 試簡述 PSI-BLAST 對 BLAST 做了哪些改進 ?