mathematics and computation behind blast and fasta

Mathematics and computation behind BLAST and FASTA

Xuhua Xia

[email protected]

http://dambe.bio.uottawa.ca

Bioinformatics-enabled researchSequence variation:

UUCUCAACCAACCAUAAAGAUAU

UUCUCUACAAACCACAAAGACAU


UUCUCAACCAACCACAAAGACAU

UUCUCCACGAACCACAAAGAUAU

UUCUCUACAAACCACAAAGAUAU


UUCUCUACUAACCACAAAGACAU

Difference in

1. coding sequences

2. Regulatory sequences

3. transcription

4. splicing

5. translation

6. translated sequences

Difference in

1. protein abundance

2. protein structure

3. cellular localization

4. protein interaction partners

Difference in biochemical function

Difference in phenotype

1. morphological

2. physiological

3. behavioural

Difference in

1. susceptibility to diseases

2. response to medicine

3. Fitness (survival and reproductive success)

Personalized medicine

Conservation strategies

Evolutionary mechanisms

... Nurturing environment

Why string matching?• Efficient search against large sequence databases

• Practical significance from early applications– Sequence similarity between an oncogene (genes in viruses that cause

a cancer-like transformation of the infected cells), v-sis, and the platelet-derived growth factor (PDGF)

• M. D. Waterfield et al. 1983. Nature 304:35-39• R. F. Doolittle et al. 1983. Science 221:275-227

– Contig assembly– Functional annotation by homology search

• Fast computational methods in string matching– FASTA– BLAST– Local pair-wise alignment by dynamic programming

Basic stats in string matching• Given PA, PC, PG, PT in a target (database) sequence, the

probability of a query sequence, say, ATTGCC, having a perfect match of the target sequence is:

prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2

• Let M be the target sequence length and N be the query sequence length, the “matching operation” can be performed (M – N +1) times, e.g., Query: ATGTarget CGATTGCCCG

• The probability distribution of the number of matches follows (approximately) a binomial distribution with p = prob and n = (M – N +1)

Basic stats in string matching• Probability of having a sequence match: p

• Probability of having no match: q = 1-p

• Binomial distribution:

• When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq

• When np < 1 and n is very large, binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e., = 2 = np).

1! !( ) ... ...

( 1)!1! ( )! !n n n n x x nn n

p q p p q p q qn n x x

2

2

( )

21( )

2

x

P x e

( )!

xeP x

x

From Binomial to Poisson

1! ! !( ) ... ... ...

( 1)!1! ( )! ! ( )! !n n n n x x x n x nn n n

p q p p q p q p q qn n x x n x x

1

( )

( 1)

!( )

( )! !

!( )

( )! !

(0)

n

n

n x x

x n x

n

P n p

P n np q

nP n x p q

n x x

nP x p q

n x x

P q

!

( )! !

( 1)( 2)...( 1)

!(

!

( )

)(

! ! ! !

)! !

(1 )

x n x

nx x

x

n

npx x x

x p px

nx

n

nP x p q

n x x

qp q

p

q

pp

p

e

n

n x x

n n n n x

x

n n npe e

xe

x x x

Matching two sequences without gap• Assuming equal nucleotide frequencies, the probability of a

nucleotide site in the query sequence matching a site in the target sequence is p = 0.25.

• The probability of finding an exact match of L letters is a = pL = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST.

• M: query length; N: target length, e.g., M = 8, N = 5, L = 3AACGGTTCCGGTT

• A sequence of length L can move at (M – L +1) distinct sites along the query and (N – L +1) distinct sites along the target.

• m = (M-L+1) and n = (N-L+1) are called effective lengths of the two sequences.

• The expected number of matches with length L is mn2-S, which is called E-value in ungapped BLAST.

• S is calculated differently in the gapped BLAST

Blast Output (Nuc. Seq.)BLASTN 2.2.4 [Aug-26-2002]...Query= Seq1 38 Database: MgCDS 480 sequences; 526,317 total letters Score ESequences producing significant alignments: (bits) ValueMG001 1095 bases 34 7e-004 Score = 34.2 bits (17), Expect = 7e-004 Identities = 35/40 (87%), Gaps = 2/40 (5%)

Query: 1 atgaataacg--attatttccaacgacaaaacaaaaccac 38 |||||||||| ||||||||||| |||||| ||||||||Sbjct: 1 atgaataacgttattatttccaataacaaaataaaaccac 40

Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3Gap Penalties: Existence: 5, Extension: 2…effective length of query: 26effective length of database: 520,557

Matches: 35*1 = 35Mismatches: 3*(-3) = -9Gap Open: 1*5 = 5Gap extension: 2*2 =4R = 35 - 9 - 5 - 4 = 17S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04x p(x)0 0.9992652171 0.0007345132 0.0000002703 0.000000000

( )( )

!

E xe Ep x

x

Typically one would count only 1 GE here.

Constant gap penalty vs affine function penalty

Lambda () and K

4 4

1 1

1ijsi j

i j

p p e

4 42 2 3 3

1 1

4 0.25 12 0.25 0.25 0.75 1ijsi j

i j

p p e e e e e

BLAST output includes lambda () and K. Mathematically, is defined as follows:

where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or

mismatch (when i j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the

simplest case with equal nucleotide frequencies, i.e., when p i = 0.25, the equation above is reduced to

See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.

20 20

1 1

1ijsi j

i j

p p e

(for amino acid sequences)

E-Value in BLAST

• The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1.

• It is NOT the probability of finding the reported match.

• Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide).

E-value and P(1)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.20 0.40 0.60 0.80 1.00

E-value

P(1

)( )

( )!

E xe Ep x

x

(1) ( 0)Ep E E when Ee

Gapped BLAST• Adapted from Crane & Raymer 2003

• Input sequence: AILVPTVIGCTVPT

• Algorithm:– Break the query sequence into words:AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT

– Discard common words (i.e., words made entirely of common amino acids)

– Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming: AILVPTVIGCTVPTMVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC

BLAST ProgramsProgram Database Query Typical Uses

BLASTN/MEGABLAST

Nucleotide Nucleotide MEGABLAST has longer word size than BLASTN

BLASTP Protein Protein Query a protein/peptide against a protein database.

BLASTX Protein Nucleotide Translate a nuc sequence into a “protein” in six frames and search against a protein database

TBLASTN Nucleotide Protein Unannotated nuc sequences (e.g., ESTs) are translated in six frames against which the query protein is searched

TBLASTX Nucleotide Nucleotide 6-frame translation of both query and database

PHI-BLAST Protein Protein Pattern-hit iterated BLAST

PSI-BLAST Protein Protein Position-specific iterated BLAST

RPS-BLAST Protein Protein Reverse PSI-BLAST

FASTA

• Another commonly used family of alignment and search tools

• Generally considered to be more sensitive than BLAST.

• Illustration with two fictitious sequences used in the Contig Assembly lecture:Seq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGA

Seq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGA

String Match in FASTA1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Query A C C G C G A T G A C G A A T A Move N Move NTarget G A A T A C G A C T G A C G A T G G A -1 3 1 6

-2 5 2 7A C G T -3 1 3 31 2 4 8 -4 3 4 37 3 6 15 -5 7 5 610 5 9 -6 1 6 313 11 12 -7 1 7 314 -8 4 8 516 -9 1 9 2

-10 1 10 21 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 -11 5 11 3G A A T A C G A C T G A C G A T G G A -12 1 12 2-3 1 2 -4 4 4 3 7 7 2 7 11 11 10 14 8 13 14 18 -13 1 13 1-5 -5 -4 -11 -2 3 1 1 6 -5 5 5 10 8 8 1 11 12 12 -14 1 14 2-8 -8 -7 -5 1 -2 -2 4 2 2 8 5 5 8 9 9 -15 0 15 0-11 -11 -10 -8 -5 -5 -5 -2 -1 -1 2 2 2 5 6 6 16 0

-12 -11 -9 -6 -2 1 5 17 0-14 -13 -11 -8 -4 -1 3 18 1

Left Right

Left and Right: -n means moving the query left by n sites and n means moving the query right by n sites.

Alternative Matched Strings

Query: ACCGCGATGACGAATATarget:GAATACGACTGACGATGGA

From lecture on contig assembly:

Query: ACCGCGATGACGAATATarget: GAATACGACTGACGATGGA

From FASTA algorithm:



Which one is best based on YOUR judgment?

Move N Move N-1 3 1 6-2 5 2 7-3 1 3 3-4 3 4 3-5 7 5 6-6 1 6 3-7 1 7 3-8 4 8 5-9 1 9 2

-10 1 10 2-11 5 11 3-12 1 12 2-13 1 13 1-14 1 14 2-15 0 15 0

16 017 018 1

Forw. Back

Best

2nd best

One of the three 3rd best

Word length of 21 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Query A C C G C G A T G A C G A A T A Move N Move NTarget G A A T A C G A C T G A C G A T G G A -1 1 1 3

-2 2 2 5AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT -3 0 3 113 1 7 2 3 6 4 15 8 -4 1 4 1

10 14 5 9 -5 4 5 211 12 -6 0 6 1

-7 0 7 1-8 1 8 4-9 0 9 1-10 0 10 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 -11 4 11 1GA AA AT TA AC CG GA AC CT TG GA AC CG GA AT TG GG GA -12 0 12 1-5 -11 -4 -11 4 3 1 7 2 5 11 10 8 8 8 12 -13 0 13 0-8 -11 -5 1 -2 -2 2 2 8 5 1 9 -14 0 14 0-11 -5 -5 -1 2 2 6 15 0

16 017 0

Left Right


Query: ACCGCGATGACGAATATarget: GAATACGACTGACGATGGA Best

One of the three 2nd best

Comparison: BLAST and FASTA

• BLAST starts with exact string matching, while FASTA starts with inexact string matching (or exact string matching with a shorter words). BLAST is faster than FASTA.

• For the examples given, both BLAST and FASTA will find the same best match, i.e., shifting the query sequence by 2 sites to the right.

• Both perform dynamic programming for extending the match after the initial match.

Optional: BLAST Parameters• Lambda and Karlin-Altschul (K) parameters are important

because they directly affect the computation of E value.

• Both and K depend on – nucleotide (or aminon acid) frequencies

– match-mismatch matrix

• All BLAST implementations generally assume that nucleotide (or amino acid) sequences have roughly equal frequencies.

• For nucleotide (or amino acid) sequences with strongly biased frequencies, BLAST E value obtained with the assumption can be quite misleading, i.e., one should use appropriate and K.

Case 1: equal , (-3,1)A G C T

0.25 0.25 0.25 0.25A 0.25 0.0625 0.0625 0.0625 0.0625G 0.25 0.0625 0.0625 0.0625 0.0625C 0.25 0.0625 0.0625 0.0625 0.0625T 0.25 0.0625 0.0625 0.0625 0.0625

Match-MismatchA 1 -3 -3 -3G -3 1 -3 -3C -3 -3 1 -3T -3 -3 -3 1

Lambda 1.374070.246963 0.001013 0.001013 0.0010130.001013 0.246963 0.001013 0.0010130.001013 0.001013 0.246963 0.0010130.001013 0.001013 0.001013 0.246963 1.000007

Case 2: Different , (-3, 1) A G C T

0.1 0.4 0.4 0.1A 0.1 0.01 0.04 0.04 0.01G 0.4 0.04 0.16 0.16 0.04C 0.4 0.04 0.16 0.16 0.04T 0.1 0.01 0.04 0.04 0.01 1


Lambda 1.05010.028579 0.001714 0.001714 0.0004280.001714 0.45727 0.006854 0.0017140.001714 0.006854 0.45727 0.0017140.000428 0.001714 0.001714 0.028579 0.999972

Case 3: Different , s/v A G C T

0.1 0.4 0.4 0.1A 0.1 0.01 0.04 0.04 0.01G 0.4 0.04 0.16 0.16 0.04C 0.4 0.04 0.16 0.16 0.04T 0.1 0.01 0.04 0.04 0.01 1


Lambda 0.98990.02691 0.014865 0.002053 0.0005130.014865 0.430554 0.008211 0.0020530.002053 0.008211 0.430554 0.0148650.000513 0.002053 0.014865 0.02691 1.000046

K: case 10.25 0.25 0.25 0.25

A 0.25 0.0625 0.0625 0.0625 0.0625

G 0.25 0.0625 0.0625 0.0625 0.0625

C 0.25 0.0625 0.0625 0.0625 0.0625

T 0.25 0.0625 0.0625 0.0625 0.0625

Match 1

Mismatch -3

-3 -2 -1 0 1

0.75 0 0 0 0.25

Type '=karlin(-3,1,true,true,true)' to compute the BLAST parameters. The three 'true' corresponding to parameters bDoLambda, bDoH and bDoK.Lambda = 1.3741 H = 1.3072 K = 0.7106

K: Case 20.1 0.4 0.4 0.1

A 0.1 0.01 0.04 0.04 0.01

G 0.4 0.04 0.16 0.16 0.04

C 0.4 0.04 0.16 0.16 0.04

T 0.1 0.01 0.04 0.04 0.01

Match 1

Transition -1

Transversion -3

-3 -2 -1 0 1

0.5 0 0.16 0 0.34

Lambda = 0.9898 H = 0.7705 K = 0.4891

Bioinformatics research workflowAccumulation of nucleotide and amino acid sequences:


UUCUCUACAAACCACAAAGACAU



UUCUCCACGAACCACAAAGAUAU

UUCUCUACAAACCACAAAGAUAU


UUCUCUACUAACCACAAAGACAU

Storage and annotation of the sequences

1.Structural annotation with homology search and de novo gene prediction

2.Functional annotation with gene ontologies

Species-specific gene dictionaries, e.g., yeastgenome.org

1. Comparative genomics (the origin of new genes, new features and new species)

2. Phylogenetics (cladogenic process, dating of speciation and gene duplication events)

3. Phylogeny-based inference.

Mutation

Selection

Adaptation

1. Gene/Protein families (e.g., Pfam)

2. Cluster of orthologous genes (e.g., COG)

3. Supermatrix of gene presence/absence

4. Genome-based pair-wise distance distributions

Functional genomicsSystems biologyDigital cells

mathematics and computation behind blast and fasta

Documents

query sequence length

target sequence length

query length n

sequence match

target length

target database sequence

gapped blast slide

poisson distribution