c e n t r f o r i n t e g r a t i v e b i o i n f o r m a t i c s v u e 02-11-2006 alignments 3:...

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

02-11-2006

Alignments 3:BLAST

Sequence Analysis

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

[2] 09-11-2006 Sequence Analysis

Sequence searching - challenges• Exponential growth of databases


E


Sequence searching – definition

• Task:

• Query: short, new sequence (~1000b)

• Database (searching space): very many sequences

• Goal: find seqs related to query

• We want:

• fast tool

• primarily a filter: most sequences will be unrelated to the query

• fine-tune the alignment later


E


•dynamic programming has performance O(mn) which is too slow for large databases with high query traffic

– MPsrch [Sturrock & Collins, MPsrch version 1.3 (1993) – Massively parallel DP]

•heuristic methods do fast approximation to dynamic programming

– FASTA [Pearson & Lipman, 1988]

– BLAST [Altschul et al., 1990]

Heuristic Alignment Motivation


E


Heuristic Alignment Motivation• consider the task of searching SWISS-PROT against a query sequence:• say our query sequence is 362 amino-acids long• SWISS-PROT release 38 contains 29,085,265 amino

acids• finding local alignments via dynamic programming

would entail O(1010) matrix operations• many servers handle thousands of such queries a day

(NCBI > 50,000)• Using the DP algorithm for this is clearly prohibitive • Note: each database search can be sped up by ‘trivial

parallelisation”


E


Heuristic Alignment

• Today: BLAST is discussed to show you a few of the tricks people have come up with to make alignment and database searching fast, while not losing too much quality.


E


What is BLAST• Basic Local Alignment Search Tool

• Bad news: it is only a heuristic• Heuristics: A rule of thumb that often helps in solving

a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work

• Also see http://en.wikipedia.org/wiki/Heuristic

• Basic idea:• High scoring segments have well conserved (almost

identical) part

• As well conserved parts are identified, extend these to the real alignment

q

e

s

-

euqes-


E


What means well conserved for BLAST?

• BLAST works with k-words (words of length k)

• k is a parameter

• different for DNA (>10) and proteins (2..4), default k values are 11 and 3, resp.

• word w1 is T-similar to w2 if the sum of pair scores is

at least T (e.g. T=12)Similar 3-words

W1: R K PW2: R R PScore: 9 –1 7 = 15


E


BLAST algorithm3 basic steps

1)Preprocess the query: extract all the k-words

2)Scan for T-similar matches in database

3)Extend them to alignments

1) Preprocess2) Scan3) Extend


E


BLAST, Step 1: Preprocess the query

• Take the query (e.g. LVNRKPVVP)

• Chop it into overlapping k-words (k=3 in this case)

• For each word find all similar words (scoring at least T)

• E.g. for RKP the following 3-words are similar:QKP KKP RQP REP RRP RKP


Query: LVNRKPVVPWord1: LVNWord2: VNRWord3: NRK…


E


Step 2: Scanning the Database with DFA (Deterministic Finite-state Automaton)• search database for all occurrences of query

words• can be a massive task• approach:

• build a DFA (deterministic finite-state automaton) that recognizes all query words

• run DB sequences through DFA• remember hits



E


DFA Finite state machine

AC*T|GGC

• abstract machine

• constant amount of memory (states)

• used in computation and languages

• recognizes regular expressions

• cp dmt*.pdf /home/john



E


BLAST, Step 2: Find “exact” matches with scanning • Use all the T-similar k-words to build

the Finite State Machine

• Scan for exact matches

...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...

QKPKKPRQPREPRRPRKP...

movement



E


Scanning the Database - DFA

Example (next 2 slides):• consider a DFA to recognize the query words: QL, QM, ZL

• All that a DFA does is read strings, and output "accept" or "reject."

• use Mealy paradigm (accept on transitions) to save space and time

Moore paradigm: the alphabet is (a, b), the states are q0, q1, and q2, the start state is q0 (denoted by the arrow coming from nowhere), the only accepting state is q2 (denoted by the double ring around the state), and the transitions are the arrows. The machine works as follows. Given an input string, we start at the start state, and read in each character one at a time, jumping from state to state as directed by the transitions. When we run out of input, we check to see if we are in an accept state. If we are, then we accept. If not, we reject.

Moore paradigm: accept/reject states

Mealy paradigm: accept/reject transitions


E


a DFA to recognize the query words: QL, QM, ZL in a fast wayQ

ZL o

r MQ

not (

L or M

or Q

)

Z

L

not (L or Z)

Mealy paradigm

not (Q or Z)

Accept on red transitions

start

This DFA is downloaded from expert website, but what do you think (see next..)?


E


a DFA to recognize the query words: QL, QM, ZL in a fast wayQ

ZL o

r MQ

not (

L or M

or Q

or Z

)

Z

L

not (L or Z or Q)

Mealy paradigm

not (Q or Z)

Accept on red transitions

start ZQ

spot and justify the differences with the last slide..


E


BLAST, Step 3: Extending “exact” matches• Having the list of matches (hits) we extend

alignment in both directions

Query: L V N R K P V V PT-similar: R R PSubject: G V C R R P L K CScore: -3 4 -3 5 2 7 1 -2 -3


• …till the sum of scores drops below some level X from the best known


E


Step 3: Extending Hits• extend hits in both directions (without allowing

gaps)

• terminate extension in one direction when score falls certain distance below best score for shorter extensions

• return segment pairs scoring at least S



E


More Recent BLAST Extensions• the two-hit method

• gapped BLAST• hashing the database• PSI-BLAST

all are aimed at increasing sensitivity while keeping run-times minimal

Altschul et al., Nucleic Acids Research 1997


E


The Two-Hit Method• extension step typically accounts for

90% of BLAST’s execution time• key idea: do extension only when there

are two hits on the same diagonal within distance A of each other

• to maintain sensitivity, lower T parameter• more single hits found• but only small fraction have associated

2nd hit


E


The Two-Hit Method

Figure from: Altschul et al. Nucleic Acids Research 25, 1997


E


Gapped BLAST• trigger gapped alignment if two-hit

extension has a sufficiently high score• find length-11 segment with highest score;

use central pair in this segment as seed• run DP process both forward & backward

from seed• prune cells when local alignment score falls

a certain distance below best score yet


E


Gapped BLAST



E


Combining the two-hit method and Gapped BLAST • Before:

• relatively high T threshold for 3-letter word (hashed) lists

• two-way hit extension (see earlier slides)• Current BLAST:

• Lower T: many more hits (more 3-letter words accepted as match)

• Relatively few hits (diagonal elements) will be on same matrix diagonal within a given distance A

• Perform 2-way local Dynamic Programming (gapped BLAST) only on ‘two-hits’ (preceding bullet)

The new way is a bit faster on average and gives better (gapped) alignments and better alignment scores!


E


Hashing – associative arrays

• Indexing with the object, the

• Hash function:

• Objects should be “well spread”

hash:

x

set of possible objects - largesmall

(fits in memory)


E


Hashing - examples

• T9 Predictive Text in mobile phones

• “hello”:4, 4, 3, 3, 5, 5, 5, (pause) 5, 5, 5, 6, 6, 6

• “hello” in T9: 4, 3, 5, 5, 6

• Collisions: 4, 6:“in”, “go”


E


Hashing – examples (cont..)• Other easier hash function: let a=1, b=2, c=3,

etc.

• “hello” now gets hash address 8+5+12+12+15 = 52

• “olleh” will get same address (collision)

• Each word encountered gets a hash address immediately and can be indexed.

• How good is this hash function?


E


BLAST, Step 2: Find ”exact” matches with hashing

• Preprocess the database

• Hash the database with k-words

• For each k-word store in which sequences it appears

k-word: RKP

Hashed DB:QKP: HUgn0151194, Gene14, IG0, ...KKP: haemoglobin, Gene134, IG_30, ...RQP: HSPHOSR1, GeneA22...RKP: galactosyltransferase, IG_1...REP: haemoglobin, Gene134, IG_30, ...RRP: Z17368, Creatine kinase, ......



E


BLAST, Step 2: Find “exact” matches with hashing

• The database is preprocessed only once! (independent from the query)

• In a constant time we can get the sequences with a certain k-word

k-word: RKP

Hashed DB:QKP: HUgn0151194, Gene14, IG0, ...KKP: haemoglobin, Gene134, IG_30, ...RQP: HSPHOSR1, GeneA22...RKP: galactosyltransferase, IG_1...REP: haemoglobin, Gene134, IG_30, ...RRP: Z17368, Creatine kinase, ......


E


BLAST flavours• blastp: protein query, protein db• blastn: DNA query, DNA db• blastx: DNA query, protein db

• in all reading frames. Used to find potential translation products of an unknown nucleotide sequence.

• tblastn: protein query, DNA db

• database dynamically translated in all reading frames.• tblastx: DNA query, DNA db

• all translations of query against all translations of db (compare at protein level)


E


PSI-BLAST• Position-Specific Iterated BLAST

• A profile (called PSSM by BLAST – Position Specific Scoring Matrix) is derived from the result of the first search (using a single query sequence)

• Database is searched against the profile (instead of a sequence) in subsequent rounds

• Up to 3-10 iterations are recommended


E


1. Query sequences are first scanned for the presence of so-called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to spurious hits; are excluded from alignment.

2. The program then initially operates on a single query sequence by performing a gapped BLAST search

3. Then, the program takes significant local alignments (hits) found, constructs a multiple alignment (master-slave alignment) and abstracts a position-specific scoring matrix (PSSM) from this alignment.

4. The database is rescanned in a subsequent round, now using the PSSM, to find more homologous sequences. Iteration continues until user decides to stop or search has converged

PSI-BLAST steps in words


E


Profile

• a Profile is a generalized form of sequence

• probabilities instead of a letter

ACDWY

0.30.10..0.30.3

0.500..00.5

00.50.2..0.10.2

0.20.00.1..0.40.3

...

...

...

...

...

...

...

...


E


Constructing a profile

• Take significant BLAST hits

• Make an alignment

• Assign weights to sequences

• Construct profile

ACDWY

0.30.10..0.30.3

0.500..00.5

00.50.2..0.10.2

0.20.00.1..0.40.3

...

...

...

...

...

...

...

...


E


PSI BLAST:Constructing the Profile Matrix



E


1 2 3 4 5 Overall

A .17 .33 .17 .17 .17 6/30 = .20

C .17 .17 .17 .50 .50 9/30 = .30

G .50 .17 .17 .17 .17 7/30 = .23

T .17 .33 .50 .17 .17 8/30 = .27

12345S1 GCTCC S2 AATCGS3 TACGCS4 GTGTTS5 GTAAAS6 CGTCC

1 2 3 4 5 Overall

A .85 1.65 .85 .85 .85 6/30 = .20

C .57 .57 .57 1.67 1.67 9/30 = .30

G 2.17 .74 .74 .74 .74 7/30 = .23

T .63 1.22 1.85 .63 .63 8/30 = .27

1 2 3 4 5

A -0.23 0.72 -0.23 -0.23 -0.23

C -0.81 -0.81 -0.81 0.74 0.74

G 1.11 -0.43 -0.43 -0.43 -0.43

T -0.66 0.29 0.89 -0.66 -0.66

Normalise by dividing by overall frequencies

Convert to log to base of 2

1 2 3 4 5

A -0.23 0.72 -0.23 -0.23 -0.23

C -0.81 -0.81 -0.81 0.74 0.74

G 1.11 -0.43 -0.43 -0.43 -0.43

T -0.66 0.29 0.89 -0.66 -0.66

Match GATCA to PSSM

Score = 1.11 + 0.72 + 0.89 + 0.74 - 0.23 = 3.23

Find nucleotides at corresponding positions

Sum corresponding log odds matrix scores

(A)

(B)

Profile calculation example using frequency normalisation and log conversionp

rofi

le


E


PSI BLAST: Determining profile elements more reliably using pseudo-counts• the value for a given element of the profile

matrix is given by:

• where the probability of seeing amino acid ai in column j is estimated as:

Observed frequency

Pseudocount (e.g. database frequency)

e.g. = number of sequences in profile, =1

Overall alignment frequency (preceding slide)


E


PSI BLAST: Determining profile elements more reliably using pseudo-countsPseudo-counts:

• mix observed a.a. frequencies with prior (e.g. database) frequencies

•drawback is pulling all frequencies to prior frequencies, which reduces differences

• are useful when multiple alignment contains only few sequences so that there is no statistical sample per column yet

•with greater numbers of sequences in the MSA, the profile becomes less dependent


E


PSI-BLAST iteration graphic…

Q

ACD..Y

Query sequence

PSSM

Q Query sequence

Gapped BLAST search

Database hits

Gapped BLAST search

ACD..Y

PSSM

Database hits

xxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxx

iterate

Low-complexity

region


E


DBT

hits

PSSM

Q

Discarded sequences

Run query sequence against

database

Run PSSM against database

Another PSI-BLAST iteration graphic…


E


(A) (B)

(C) (D)

Figure 6


E


PSI-BLAST entry page

Paste your query sequence

Switch this off for default run


E



E


1 - This portion of each description links to the sequence record for a particular hit.

2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence).

3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e-117 meaning that a sequence with a similar score is very unlikely to occur simply by chance.

4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's Molecular Modeling DataBase.


E


‘X’ residues denote low-complexity sequence fragments that are ignored


E


Alignment Bit Score

•S is the raw alignment score •The bit score (‘bits’) B has a standard set of units•The bit score B is calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment and K and are the statistical parameters of the scoring system (BLOSUM62 in Blast). •See Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices. •Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches based on different scoring schemes (a.a. exchange matrices)

B = (S – ln K) / ln 2


E


What is the statistical significance of an alignment• To get a null model: extract local

alignments from random sequences

• P-value• The probability of obtaining the result by

pure chance• An alignment giving a lower P-value than

a threshold value set by the user is considered a hit.


E


Normalised sequence similarity

The p-value is defined as the probability of seeing at least one unrelated score S greater than or equal to a given score x in a database search over n sequences.

This probability follows the Poisson distribution (Waterman and Vingron, 1994):

P(x, n) = 1 – e-nP(S x),

where n is the number of sequences in the database

Depending on x and n (fixed)


E


E-value• The concept of P-value applies to single

comparisons

• What with searching in a large database?

Task.

Having a protein, we want to find similar ones in a large database (1mln sequences). We are interested in

P-value < 0.01Count the number of hits we’ll get by chance alone.


E


Normalised sequence similarityStatistical significance

The E-value is defined as the expected number of non-homologous sequences with score greater than or equal to a score x in a database of n sequences:

E(x, n) = nP(S x)

For example, if E-value = 0.01, then the expected number of random hits with score S x is 0.01, which means that this E-value is expected by chance only once in 100 independent searches over the database.if the E-value of a hit is 5, then five fortuitous hits with S x are expected within a single database search, which renders the hit not significant.


E


A model for database searching score probabilities• Scores resulting from searching with a query

sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955).

• Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)


E


Extreme Value Distribution

Probability density function for the extreme value distribution resulting from parameter values = 0 and = 1, [y = 1 – exp(-e-x)], where is the characteristic value and is the decay constant.

y = 1 – exp(-e-(x-))


E


Extreme Value Distribution (EDV)

You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit.

real data

EDV approximation


E


Extreme Value DistributionThe probability of a score S to be larger than a

given value x can be calculated following the EDV as:

E-value: P(S x) = 1 – exp(-e -(x-)),

where =(ln Kmn)/, and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices).


E


Extreme Value DistributionUsing the equation for (preceding slide), the probability

for the raw alignment score S becomes

P(S x) = 1 – exp(-Kmne-x).

In practice, the probability P(Sx) is estimated using the approximation 1 – exp(-e-x) e-x, which is valid for large values of x. This leads to a simplification of the equation for P(Sx):

P(S x) e-(x-) = Kmne-x.

The lower the probability (E value) for a given threshold value x, the more significant the score S.


E


Normalised sequence similarityStatistical significance• Database searching is commonly

performed using an E-value in between 0.1 and 0.001.

• Low E-values decrease the number of false positives in a database search, but increase the number of false negatives, thereby lowering the sensitivity of the search.


E


Words of Encouragement• “There are three kinds of lies: lies,

damned lies, and statistics” – Benjamin Disraeli

• “Statistics in the hands of an engineer are like a lamppost to a drunk – they’re used more for support than illumination”

• “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates


E


Database Search Algorithms:Sensitivity, Selectivity• Sensitivity – the ability to detect weak similarities between sequences

(often due to long evolutionary separation). Increasing sensitivity reduces false negatives, i.e. those database sequences similar to the similar to the query, but rejected. Sensitivity = TP / (TP+FN)

• Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Selectivity = TP / (TP + FP)

SensitivitySensitivity

SelectivitySelectivity

Courtesy of Gary Benson (ISSCB 2003)


E


Dot-plotsa simple way to visualise sequence similarity

Can be a bit messy, though...Filter: 6/10 residues have to match...


E


Dot-plots, what about...• Insertions/deletions -- DNA and proteins

• Duplications (e.g. tandem repeats) – DNA and proteins

• Inversions -- DNA


E


Dot-plots, self-comparison

Direct repeatDirect repeat

Tandem repeatTandem repeat

Inverted repeatInverted repeat


E


• END

c e n t r f o r i n t e g r a t i v e b i o i n f o r m a t i c s v u e 02-11-2006 alignments 3:...

Documents

sequence analysis sequence

query sequence

blast sequence analysis

new sequence

c e n t r f o r i n

t i v e b i o i n f

t i c s v u e

blast altschul