c e n t r f o r i n t e g r a t i v e b i o i n f o r m a t i c s v u e 02-11-2006 alignments 3:...

62
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

Post on 18-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

02-11-2006

Alignments 3:BLAST

Sequence Analysis

Page 2: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[2] 09-11-2006 Sequence Analysis

Sequence searching - challenges• Exponential growth of databases

Page 3: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[3] 09-11-2006 Sequence Analysis

Sequence searching – definition

• Task:

• Query: short, new sequence (~1000b)

• Database (searching space): very many sequences

• Goal: find seqs related to query

• We want:

• fast tool

• primarily a filter: most sequences will be unrelated to the query

• fine-tune the alignment later

Page 4: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[4] 09-11-2006 Sequence Analysis

•dynamic programming has performance O(mn) which is too slow for large databases with high query traffic

– MPsrch [Sturrock & Collins, MPsrch version 1.3 (1993) – Massively parallel DP]

•heuristic methods do fast approximation to dynamic programming

– FASTA [Pearson & Lipman, 1988]

– BLAST [Altschul et al., 1990]

Heuristic Alignment Motivation

Page 5: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[5] 09-11-2006 Sequence Analysis

Heuristic Alignment Motivation• consider the task of searching SWISS-PROT against a query sequence:• say our query sequence is 362 amino-acids long• SWISS-PROT release 38 contains 29,085,265 amino

acids• finding local alignments via dynamic programming

would entail O(1010) matrix operations• many servers handle thousands of such queries a day

(NCBI > 50,000)• Using the DP algorithm for this is clearly prohibitive • Note: each database search can be sped up by ‘trivial

parallelisation”

Page 6: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[6] 09-11-2006 Sequence Analysis

Heuristic Alignment

• Today: BLAST is discussed to show you a few of the tricks people have come up with to make alignment and database searching fast, while not losing too much quality.

Page 7: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[7] 09-11-2006 Sequence Analysis

What is BLAST• Basic Local Alignment Search Tool

• Bad news: it is only a heuristic• Heuristics: A rule of thumb that often helps in solving

a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work

• Also see http://en.wikipedia.org/wiki/Heuristic

• Basic idea:• High scoring segments have well conserved (almost

identical) part

• As well conserved parts are identified, extend these to the real alignment

q

e

s

-

euqes-

Page 8: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[8] 09-11-2006 Sequence Analysis

What means well conserved for BLAST?

• BLAST works with k-words (words of length k)

• k is a parameter

• different for DNA (>10) and proteins (2..4), default k values are 11 and 3, resp.

• word w1 is T-similar to w2 if the sum of pair scores is

at least T (e.g. T=12)Similar 3-words

W1: R K PW2: R R PScore: 9 –1 7 = 15

Page 9: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[9] 09-11-2006 Sequence Analysis

BLAST algorithm3 basic steps

1)Preprocess the query: extract all the k-words

2)Scan for T-similar matches in database

3)Extend them to alignments

1) Preprocess2) Scan3) Extend

Page 10: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[10] 09-11-2006 Sequence Analysis

BLAST, Step 1: Preprocess the query

• Take the query (e.g. LVNRKPVVP)

• Chop it into overlapping k-words (k=3 in this case)

• For each word find all similar words (scoring at least T)

• E.g. for RKP the following 3-words are similar:QKP KKP RQP REP RRP RKP

1) Preprocess2) Scan3) Extend

Query: LVNRKPVVPWord1: LVNWord2: VNRWord3: NRK…

Page 11: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[11] 09-11-2006 Sequence Analysis

Step 2: Scanning the Database with DFA (Deterministic Finite-state Automaton)• search database for all occurrences of query

words• can be a massive task• approach:

• build a DFA (deterministic finite-state automaton) that recognizes all query words

• run DB sequences through DFA• remember hits

1) Preprocess2) Scan3) Extend

Page 12: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[12] 09-11-2006 Sequence Analysis

DFA Finite state machine

AC*T|GGC

• abstract machine

• constant amount of memory (states)

• used in computation and languages

• recognizes regular expressions

• cp dmt*.pdf /home/john

1) Preprocess2) Scan3) Extend

Page 13: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[13] 09-11-2006 Sequence Analysis

BLAST, Step 2: Find “exact” matches with scanning • Use all the T-similar k-words to build

the Finite State Machine

• Scan for exact matches

...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...

QKPKKPRQPREPRRPRKP...

movement

1) Preprocess2) Scan3) Extend

Page 14: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[14] 09-11-2006 Sequence Analysis

Scanning the Database - DFA

Example (next 2 slides):• consider a DFA to recognize the query words: QL, QM, ZL

• All that a DFA does is read strings, and output "accept" or "reject."

• use Mealy paradigm (accept on transitions) to save space and time

Moore paradigm: the alphabet is (a, b), the states are q0, q1, and q2, the start state is q0 (denoted by the arrow coming from nowhere), the only accepting state is q2 (denoted by the double ring around the state), and the transitions are the arrows. The machine works as follows. Given an input string, we start at the start state, and read in each character one at a time, jumping from state to state as directed by the transitions. When we run out of input, we check to see if we are in an accept state. If we are, then we accept. If not, we reject.

Moore paradigm: accept/reject states

Mealy paradigm: accept/reject transitions

Page 15: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[15] 09-11-2006 Sequence Analysis

a DFA to recognize the query words: QL, QM, ZL in a fast wayQ

ZL o

r MQ

not (

L or M

or Q

)

Z

L

not (L or Z)

Mealy paradigm

not (Q or Z)

Accept on red transitions

start

This DFA is downloaded from expert website, but what do you think (see next..)?

Page 16: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[16] 09-11-2006 Sequence Analysis

a DFA to recognize the query words: QL, QM, ZL in a fast wayQ

ZL o

r MQ

not (

L or M

or Q

or Z

)

Z

L

not (L or Z or Q)

Mealy paradigm

not (Q or Z)

Accept on red transitions

start ZQ

spot and justify the differences with the last slide..

Page 17: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[17] 09-11-2006 Sequence Analysis

BLAST, Step 3: Extending “exact” matches• Having the list of matches (hits) we extend

alignment in both directions

Query: L V N R K P V V PT-similar: R R PSubject: G V C R R P L K CScore: -3 4 -3 5 2 7 1 -2 -3

1) Preprocess2) Scan3) Extend

• …till the sum of scores drops below some level X from the best known

Page 18: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[18] 09-11-2006 Sequence Analysis

Step 3: Extending Hits• extend hits in both directions (without allowing

gaps)

• terminate extension in one direction when score falls certain distance below best score for shorter extensions

• return segment pairs scoring at least S

1) Preprocess2) Scan3) Extend

Page 19: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[19] 09-11-2006 Sequence Analysis

More Recent BLAST Extensions• the two-hit method

• gapped BLAST• hashing the database• PSI-BLAST

all are aimed at increasing sensitivity while keeping run-times minimal

Altschul et al., Nucleic Acids Research 1997

Page 20: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[20] 09-11-2006 Sequence Analysis

The Two-Hit Method• extension step typically accounts for

90% of BLAST’s execution time• key idea: do extension only when there

are two hits on the same diagonal within distance A of each other

• to maintain sensitivity, lower T parameter• more single hits found• but only small fraction have associated

2nd hit

Page 21: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[21] 09-11-2006 Sequence Analysis

The Two-Hit Method

Figure from: Altschul et al. Nucleic Acids Research 25, 1997

Page 22: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[22] 09-11-2006 Sequence Analysis

Gapped BLAST• trigger gapped alignment if two-hit

extension has a sufficiently high score• find length-11 segment with highest score;

use central pair in this segment as seed• run DP process both forward & backward

from seed• prune cells when local alignment score falls

a certain distance below best score yet

Page 23: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[23] 09-11-2006 Sequence Analysis

Gapped BLAST

Figure from: Altschul et al. Nucleic Acids Research 25, 1997

Page 24: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[24] 09-11-2006 Sequence Analysis

Combining the two-hit method and Gapped BLAST • Before:

• relatively high T threshold for 3-letter word (hashed) lists

• two-way hit extension (see earlier slides)• Current BLAST:

• Lower T: many more hits (more 3-letter words accepted as match)

• Relatively few hits (diagonal elements) will be on same matrix diagonal within a given distance A

• Perform 2-way local Dynamic Programming (gapped BLAST) only on ‘two-hits’ (preceding bullet)

The new way is a bit faster on average and gives better (gapped) alignments and better alignment scores!

Page 25: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[25] 09-11-2006 Sequence Analysis

Hashing – associative arrays

• Indexing with the object, the

• Hash function:

• Objects should be “well spread”

hash:

x

set of possible objects - largesmall

(fits in memory)

Page 26: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[26] 09-11-2006 Sequence Analysis

Hashing - examples

• T9 Predictive Text in mobile phones

• “hello”:4, 4, 3, 3, 5, 5, 5, (pause) 5, 5, 5, 6, 6, 6

• “hello” in T9: 4, 3, 5, 5, 6

• Collisions: 4, 6:“in”, “go”

Page 27: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[27] 09-11-2006 Sequence Analysis

Hashing – examples (cont..)• Other easier hash function: let a=1, b=2, c=3,

etc.

• “hello” now gets hash address 8+5+12+12+15 = 52

• “olleh” will get same address (collision)

• Each word encountered gets a hash address immediately and can be indexed.

• How good is this hash function?

Page 28: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[28] 09-11-2006 Sequence Analysis

BLAST, Step 2: Find ”exact” matches with hashing

• Preprocess the database

• Hash the database with k-words

• For each k-word store in which sequences it appears

k-word: RKP

Hashed DB:QKP: HUgn0151194, Gene14, IG0, ...KKP: haemoglobin, Gene134, IG_30, ...RQP: HSPHOSR1, GeneA22...RKP: galactosyltransferase, IG_1...REP: haemoglobin, Gene134, IG_30, ...RRP: Z17368, Creatine kinase, ......

1) Preprocess2) Scan3) Extend

Page 29: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[29] 09-11-2006 Sequence Analysis

BLAST, Step 2: Find “exact” matches with hashing

• The database is preprocessed only once! (independent from the query)

• In a constant time we can get the sequences with a certain k-word

k-word: RKP

Hashed DB:QKP: HUgn0151194, Gene14, IG0, ...KKP: haemoglobin, Gene134, IG_30, ...RQP: HSPHOSR1, GeneA22...RKP: galactosyltransferase, IG_1...REP: haemoglobin, Gene134, IG_30, ...RRP: Z17368, Creatine kinase, ......

Page 30: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[30] 09-11-2006 Sequence Analysis

BLAST flavours• blastp: protein query, protein db• blastn: DNA query, DNA db• blastx: DNA query, protein db

• in all reading frames. Used to find potential translation products of an unknown nucleotide sequence.

• tblastn: protein query, DNA db

• database dynamically translated in all reading frames.• tblastx: DNA query, DNA db

• all translations of query against all translations of db (compare at protein level)

Page 31: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[31] 09-11-2006 Sequence Analysis

PSI-BLAST• Position-Specific Iterated BLAST

• A profile (called PSSM by BLAST – Position Specific Scoring Matrix) is derived from the result of the first search (using a single query sequence)

• Database is searched against the profile (instead of a sequence) in subsequent rounds

• Up to 3-10 iterations are recommended

Page 32: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[32] 09-11-2006 Sequence Analysis

1. Query sequences are first scanned for the presence of so-called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to spurious hits; are excluded from alignment.

2. The program then initially operates on a single query sequence by performing a gapped BLAST search

3. Then, the program takes significant local alignments (hits) found, constructs a multiple alignment (master-slave alignment) and abstracts a position-specific scoring matrix (PSSM) from this alignment.

4. The database is rescanned in a subsequent round, now using the PSSM, to find more homologous sequences. Iteration continues until user decides to stop or search has converged

PSI-BLAST steps in words

Page 33: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[33] 09-11-2006 Sequence Analysis

Profile

• a Profile is a generalized form of sequence

• probabilities instead of a letter

ACDWY

0.30.10..0.30.3

0.500..00.5

00.50.2..0.10.2

0.20.00.1..0.40.3

...

...

...

...

...

...

...

...

Page 34: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[34] 09-11-2006 Sequence Analysis

Constructing a profile

• Take significant BLAST hits

• Make an alignment

• Assign weights to sequences

• Construct profile

ACDWY

0.30.10..0.30.3

0.500..00.5

00.50.2..0.10.2

0.20.00.1..0.40.3

...

...

...

...

...

...

...

...

Page 35: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[35] 09-11-2006 Sequence Analysis

PSI BLAST:Constructing the Profile Matrix

Figure from: Altschul et al. Nucleic Acids Research 25, 1997

Page 36: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[36] 09-11-2006 Sequence Analysis

1 2 3 4 5 Overall

A .17 .33 .17 .17 .17 6/30 = .20

C .17 .17 .17 .50 .50 9/30 = .30

G .50 .17 .17 .17 .17 7/30 = .23

T .17 .33 .50 .17 .17 8/30 = .27

12345S1 GCTCC S2 AATCGS3 TACGCS4 GTGTTS5 GTAAAS6 CGTCC

1 2 3 4 5 Overall

A .85 1.65 .85 .85 .85 6/30 = .20

C .57 .57 .57 1.67 1.67 9/30 = .30

G 2.17 .74 .74 .74 .74 7/30 = .23

T .63 1.22 1.85 .63 .63 8/30 = .27

1 2 3 4 5

A -0.23 0.72 -0.23 -0.23 -0.23

C -0.81 -0.81 -0.81 0.74 0.74

G 1.11 -0.43 -0.43 -0.43 -0.43

T -0.66 0.29 0.89 -0.66 -0.66

Normalise by dividing by overall frequencies

Convert to log to base of 2

1 2 3 4 5

A -0.23 0.72 -0.23 -0.23 -0.23

C -0.81 -0.81 -0.81 0.74 0.74

G 1.11 -0.43 -0.43 -0.43 -0.43

T -0.66 0.29 0.89 -0.66 -0.66

Match GATCA to PSSM

Score = 1.11 + 0.72 + 0.89 + 0.74 - 0.23 = 3.23

Find nucleotides at corresponding positions

Sum corresponding log odds matrix scores

(A)

(B)

Profile calculation example using frequency normalisation and log conversionp

rofi

le

Page 37: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[37] 09-11-2006 Sequence Analysis

PSI BLAST: Determining profile elements more reliably using pseudo-counts• the value for a given element of the profile

matrix is given by:

• where the probability of seeing amino acid ai in column j is estimated as:

Observed frequency

Pseudocount (e.g. database frequency)

e.g. = number of sequences in profile, =1

Overall alignment frequency (preceding slide)

Page 38: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[38] 09-11-2006 Sequence Analysis

PSI BLAST: Determining profile elements more reliably using pseudo-countsPseudo-counts:

• mix observed a.a. frequencies with prior (e.g. database) frequencies

•drawback is pulling all frequencies to prior frequencies, which reduces differences

• are useful when multiple alignment contains only few sequences so that there is no statistical sample per column yet

•with greater numbers of sequences in the MSA, the profile becomes less dependent

Page 39: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[39] 09-11-2006 Sequence Analysis

PSI-BLAST iteration graphic…

Q

ACD..Y

Query sequence

PSSM

Q Query sequence

Gapped BLAST search

Database hits

Gapped BLAST search

ACD..Y

PSSM

Database hits

xxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxx

iterate

Low-complexity

region

Page 40: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[40] 09-11-2006 Sequence Analysis

DBT

hits

PSSM

Q

Discarded sequences

Run query sequence against

database

Run PSSM against database

Another PSI-BLAST iteration graphic…

Page 41: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[41] 09-11-2006 Sequence Analysis

(A) (B)

(C) (D)

Figure 6

Page 42: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[42] 09-11-2006 Sequence Analysis

PSI-BLAST entry page

Paste your query sequence

Switch this off for default run

Page 43: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[43] 09-11-2006 Sequence Analysis

Page 44: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[44] 09-11-2006 Sequence Analysis

1 - This portion of each description links to the sequence record for a particular hit.

2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence).

3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e-117 meaning that a sequence with a similar score is very unlikely to occur simply by chance.

4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's Molecular Modeling DataBase.

Page 45: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[45] 09-11-2006 Sequence Analysis

‘X’ residues denote low-complexity sequence fragments that are ignored

Page 46: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[46] 09-11-2006 Sequence Analysis

Alignment Bit Score

•S is the raw alignment score •The bit score (‘bits’) B has a standard set of units•The bit score B is calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment and K and are the statistical parameters of the scoring system (BLOSUM62 in Blast). •See Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices. •Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches based on different scoring schemes (a.a. exchange matrices)

B = (S – ln K) / ln 2

Page 47: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[47] 09-11-2006 Sequence Analysis

What is the statistical significance of an alignment• To get a null model: extract local

alignments from random sequences

• P-value• The probability of obtaining the result by

pure chance• An alignment giving a lower P-value than

a threshold value set by the user is considered a hit.

Page 48: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[48] 09-11-2006 Sequence Analysis

Normalised sequence similarity

The p-value is defined as the probability of seeing at least one unrelated score S greater than or equal to a given score x in a database search over n sequences.

This probability follows the Poisson distribution (Waterman and Vingron, 1994):

P(x, n) = 1 – e-nP(S x),

where n is the number of sequences in the database

Depending on x and n (fixed)

Page 49: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[49] 09-11-2006 Sequence Analysis

E-value• The concept of P-value applies to single

comparisons

• What with searching in a large database?

Task.

Having a protein, we want to find similar ones in a large database (1mln sequences). We are interested in

P-value < 0.01Count the number of hits we’ll get by chance alone.

Page 50: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[50] 09-11-2006 Sequence Analysis

Normalised sequence similarityStatistical significance

The E-value is defined as the expected number of non-homologous sequences with score greater than or equal to a score x in a database of n sequences:

E(x, n) = nP(S x)

For example, if E-value = 0.01, then the expected number of random hits with score S x is 0.01, which means that this E-value is expected by chance only once in 100 independent searches over the database.if the E-value of a hit is 5, then five fortuitous hits with S x are expected within a single database search, which renders the hit not significant.

Page 51: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[51] 09-11-2006 Sequence Analysis

A model for database searching score probabilities• Scores resulting from searching with a query

sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955).

• Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)

Page 52: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[52] 09-11-2006 Sequence Analysis

Extreme Value Distribution

Probability density function for the extreme value distribution resulting from parameter values = 0 and = 1, [y = 1 – exp(-e-x)], where is the characteristic value and is the decay constant.

y = 1 – exp(-e-(x-))

Page 53: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[53] 09-11-2006 Sequence Analysis

Extreme Value Distribution (EDV)

You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit.

real data

EDV approximation

Page 54: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[54] 09-11-2006 Sequence Analysis

Extreme Value DistributionThe probability of a score S to be larger than a

given value x can be calculated following the EDV as:

E-value: P(S x) = 1 – exp(-e -(x-)),

where =(ln Kmn)/, and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices).

Page 55: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[55] 09-11-2006 Sequence Analysis

Extreme Value DistributionUsing the equation for (preceding slide), the probability

for the raw alignment score S becomes

P(S x) = 1 – exp(-Kmne-x).

In practice, the probability P(Sx) is estimated using the approximation 1 – exp(-e-x) e-x, which is valid for large values of x. This leads to a simplification of the equation for P(Sx):

P(S x) e-(x-) = Kmne-x.

The lower the probability (E value) for a given threshold value x, the more significant the score S.

Page 56: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[56] 09-11-2006 Sequence Analysis

Normalised sequence similarityStatistical significance• Database searching is commonly

performed using an E-value in between 0.1 and 0.001.

• Low E-values decrease the number of false positives in a database search, but increase the number of false negatives, thereby lowering the sensitivity of the search.

Page 57: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[57] 09-11-2006 Sequence Analysis

Words of Encouragement• “There are three kinds of lies: lies,

damned lies, and statistics” – Benjamin Disraeli

• “Statistics in the hands of an engineer are like a lamppost to a drunk – they’re used more for support than illumination”

• “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates

Page 58: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[58] 09-11-2006 Sequence Analysis

Database Search Algorithms:Sensitivity, Selectivity• Sensitivity – the ability to detect weak similarities between sequences

(often due to long evolutionary separation). Increasing sensitivity reduces false negatives, i.e. those database sequences similar to the similar to the query, but rejected. Sensitivity = TP / (TP+FN)

• Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Selectivity = TP / (TP + FP)

SensitivitySensitivity

SelectivitySelectivity

Courtesy of Gary Benson (ISSCB 2003)

Page 59: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[59] 09-11-2006 Sequence Analysis

Dot-plotsa simple way to visualise sequence similarity

Can be a bit messy, though...Filter: 6/10 residues have to match...

Page 60: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[60] 09-11-2006 Sequence Analysis

Dot-plots, what about...• Insertions/deletions -- DNA and proteins

• Duplications (e.g. tandem repeats) – DNA and proteins

• Inversions -- DNA

Page 61: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[61] 09-11-2006 Sequence Analysis

Dot-plots, self-comparison

Direct repeatDirect repeat

Tandem repeatTandem repeat

Inverted repeatInverted repeat

Page 62: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[63] 09-11-2006 Sequence Analysis

• END