Transcript
Page 1: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

BLAST and FASTAHeuristics in pairwise sequence alignment

Christoph Dieterich

Department of Evolutionary BiologyMax Planck Institute for Developmental Biology

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 2: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Introduction

1 Pairwise alignment is used to detect homologies betweendifferent protein or DNA sequences, e.g. as global or localalignments.

2 Problem solved using dynamic programming in O(nm)time and O(n) space.

3 This is too slow for searching current databases.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 3: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Heuristics for large-scale database searching

In practice algorithms are used that run much faster, at theexpense of possibly missing some significant hits due to theheuristics employed.

Such algorithms are usually seed-and-extend approaches inwhich first small exact matches are found, which are thenextended to obtain long inexact ones.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 4: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Basic idea: Preprocessing

After preprocessing, a large part of the computation is alreadyfinished before we search for similarities.For biological sequences of length n there are 4n (for theDNA-alphabet Σ = {A, G, C, T}) and/or 20n (for the amino acidalphabet) different strings.For small |Σ| and n, it is possible to store all in a hash table.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 5: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Basic idea: Preprocessing

1 A certain consistency of the database entries is assumed.2 Large sequence databases are split into two parts: one

consists of the constant part, containing all the sequencesthat were used for hash table generation, and of a dynamicpart, containing all new entries.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 6: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

BLAST

BLAST, the Basic Local Alignment Search Tool (Altschul et al.,1990), is perhaps the most widely used bioinformatics tool everwritten. It is an alignment heuristic that determines “localalignments” between a query and a database. It uses anapproximation of the Smith-Waterman algorithm.

BLAST consists of two components: a search algorithm andcomputation of the statistical significance of solutions.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 7: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

BLAST terminology

Definition

Let q be the query and d the database. A segment is simply asubstring s of q or d .A segment-pair (s, t) (or hit) consists of two segments, one in qand one d , of the same length.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 8: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

BLAST terminology

ExampleV A L L A RP A M M A R

We think of s and t as being aligned without gaps and scorethis alignment using a substitution score matrix, e.g. BLOSUMor PAM in the case of protein sequences.The alignment score for (s, t) is denoted by σ(s, t).

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 9: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

BLAST terminology

A locally maximal segment pair (LMSP) is any segmentpair (s, t) whose score cannot be improved by shorteningor extending the segment pair.A maximum segment pair (MSP) is any segment pair (s, t)of maximal alignment score σ(s, t).Given a cutoff score S, a segment pair (s, t) is called ahigh-scoring segment pair (HSP), if it is locally maximaland σ(s, t) ≥ S.Finally, a word is simply a short substring of fixed length w .

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 10: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

The BLAST algorithm

Goal: Find all HSPs for a given cut-off score.

Given three parameters, i.e. a word size w , a word similaritythreshold T and a minimum cut-off score S. Then we arelooking for a segment pair with a score of at least S thatcontains at least one word pair of length w with score at least T .

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 11: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

The BLAST algorithm - Preprocessing

Preprocessing: Of the query sequence q first all words oflength w are generated. Then a list of all w-mers of length wover the alphabet Σ that have similarity > T to some word inthe query sequence q is generated.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 12: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

The BLAST algorithm - Preprocessing

Example

For the query sequence RQCSAGW the list of words of lengthw = 2 with a score T > 8 using the BLOSUM62 matrix are:

word 2−mer with score > 8RQ RQQC QC, RC, EC, NC, DC, KC, MC, SCCS CS,CA,CN,CD,CQ,CE,CG,CK,CTSA -AG AGGW GW,AW,RW,NW,DW,QW,EW,HW,KW,PW,SW,TW,WW

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 13: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

The BLAST algorithm - Searching

1 Localization of the hits: The database sequence d isscanned for all hits t of w-mers s in the list, and theposition of the hit is saved.

2 Detection of hits: First all pairs of hits are searched thathave a distance of at most A (think of them lying on thesame diagonal in the matrix of the SW-algorithm).

3 Extension to HSPs: Each such seed (s, t) is extended inboth directions until its score σ(s, t) cannot be enlarged(LMSP). Then all best extensions are reported that havescore ≥ S, these are the HSPs. Originally the extensiondid not include gaps, the modern BLAST2 algorithm allowsinsertion of gaps.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 14: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

The BLAST algorithm - Searching

The list L of all words of length w that have similarity > Tto some word in the query sequence q can be produced inO(|L|) time.These are placed in a “keyword tree” and then, for eachword in the tree, all exact locations of the word in thedatabase d are detected in time linear to the length of d .As an alternative to storing the words in a tree, afinite-state machine can be used, which Altschul et al.found to have the faster implementation.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 15: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

The BLAST algorithm

Use of seeds of length w and the termination of extensions withfading scores (score dropoff threshold X) are both steps thatspeed up the algorithm.Recent improvements (BLAST 2.0):

Two word hits must be found within a window of A residues.Explicit treatment of gaps.Position-specific iterative BLAST (PSI-BLAST).

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 16: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

The BLAST algorithm - DNA

For DNA sequences, BLAST operates as follows:

The list of all words of length w in the query sequence q isgenerated. In practice, w = 12 for DNA.The database d is scanned for all hits of words in this list.Blast uses a two-bit encoding for DNA. This saves spaceand also search time, as four bases are encoded per byte.

Note that the “T” parameter dictates the speed and sensitivity ofthe search.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 17: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

All flavors of BLAST

BLASTN: compares a DNA query sequence to a DNAsequence database; qDNA ↔ sDNA

BLASTP: compares a protein query sequence to a proteinsequence database; qprotein ↔ sprotein

TBLASTN: compares a protein query sequence to a DNAsequence database (6 frames translation); qprotein ↔st{+1,+2,+3,−1,−2,−3}(DNA)

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 18: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

All flavors of BLAST - continued

BLASTX: compares a DNA query sequence (6 framestranslation) to a protein sequence database.TBLASTX: compares a DNA query sequence (6 framestranslation) to a DNA sequence database (6 framestranslation).

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 19: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Statistical analysis

1 The null hypothesis H0 states that the two sequences (s, t)are not homologous. Then the alternative hypothesisstates that the two sequences are homologous.

2 Choose an experiment to find the pair (s, t): use BLAST todetect HSPs.

3 Compute the probability of the result under the hypothesisH0, P(Score ≥ σ(s, t) | H0) by generating a probabilitydistribution with random sequences.

4 Fix a rejection level for H0.5 Perform the experiment, compute the probability of

achieving the result or higher and compare with therejection level.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 20: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Poisson and extreme value distributions

The Karlin and Altschul theory for local alignments (withoutgaps) is based on Poisson and extreme value distributions. Thedetails of that theory are beyond the scope of this lecture, butbasics are sketched in the following.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 21: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Poisson distribution

Definition

The Poisson distribution with parameter v is given by

P(X = x) =vx

x!e−v

Note that v is the expected value as well as the variance. Fromthe equation we follow that the probability that a variable X willhave a value at least x is

P(X ≥ x) = 1−x−1∑i=0

v i

i!e−v

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 22: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Statistical significance of an HSP

Problem

Given an HSP (s, t) with score σ(s, t). How significant is thismatch (i.e., local alignment)?

Given the scoring matrix S(a, b), the expected score foraligning a random pair of amino acid is required to be negative:

E =∑

a,b∈Σ

papbS(a, b) < 0

The sum of a large number of independent identicallydistributed (i.i.d) random variables tends to a normaldistribution. The maximum of a large number of i.i.d. randomvariables tends to an extreme value distribution as we will seebelow.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 23: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Statistical significance of an HSP

HSP scores are characterized by two parameters, K and λ.The parameters K and λ depend on the backgroundprobabilities of the symbols and on the employed scoringmatrix. λ is the unique value for y that satisfies the equation∑

a,b∈Σ

papbeS(a,b)y = 1

K and λ are scaling-factors for the search space and for thescoring scheme, respectively.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 24: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Statistical significance of an HSP

The number of random HSPs (s, t) with σ(s, t) ≥ S can bedescribed by a Poisson distribution with parameterv = Kmne−λS. The number of HSPs with score ≥ S that weexpect to see due to chance is then the parameter v , alsocalled the E-value:

E(HSPs with score ≥ S) = Kmne−λS

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 25: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Hence, the probability of finding exactly x HSPs with a score≥ S is given by

P(X = x) = e−E Ex

x!,

where E is the E-value for C.The probability of finding at least one HSP “by chance” is

P(S) = 1− P(X = 0) = 1− e−E .

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 26: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

We would like to “hide” the parameters K and λ to make iteasier to compare results from different BLAST searches.

For a given HSP (s, t) we transform the raw score S into abit-score:

S′ :=λS − ln K

ln 2.

Such bit-scores can be compared between different BLASTsearches, as the parameters of the given scoring systems aresubsumed in them.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 27: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

We would like to “hide” the parameters K and λ to make iteasier to compare results from different BLAST searches.

For a given HSP (s, t) we transform the raw score S into abit-score:

S′ :=λS − ln K

ln 2.

Such bit-scores can be compared between different BLASTsearches, as the parameters of the given scoring systems aresubsumed in them.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 28: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Significance of a bit-score

To determine the significance of a given bit-score S′ the onlyadditional value required is the size of the search space. SinceS = (S′ ln 2 + ln K )/λ, we can express the E-value in terms ofthe bit-score as follows:

E = Kmne−λS = Kmne−(S′ ln 2+ln K ) = mn2−S′.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 29: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

FASTA

FASTA (pronounced fast-ay) is a heuristic for finding significantmatches between a query string q and a database string d . It isthe older of the two heuristics introduced in the lecture.

FASTA’s general strategy is to find the most significantdiagonals in the dot-plot or dynamic programming matrix.The algorithm consists of four phases: Hashing, 1stscoring, 2nd scoring, alignment.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 30: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

FASTA

FASTA (pronounced fast-ay) is a heuristic for finding significantmatches between a query string q and a database string d . It isthe older of the two heuristics introduced in the lecture.

FASTA’s general strategy is to find the most significantdiagonals in the dot-plot or dynamic programming matrix.The algorithm consists of four phases: Hashing, 1stscoring, 2nd scoring, alignment.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 31: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Phase 1: Hashing

The first step of the algorithm is to determine all exactmatches of length k (word-size) between the twosequences, called hot-spots.A hot-spot is given by (i , j), where i and j are the locations(i.e., start positions) of an exact match of length k in thequery and database sequence respectively.Any such hot-spot (i , j) lies on the diagonal (i − j) of thedot-plot or dynamic programming matrix.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 32: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Phase 1: Hashing

Using this scheme, the main diagonal has number 0 (i = j),whereas diagonals above the main one have positivenumbers (i < j), the ones below negative (i < j).A diagonal run is a set of hot-spots that lie in a consecutivesequence on the same diagonal. It corresponds to agapless local alignment.A score is assigned to each diagonal run.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 33: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Phase 2+3:

Each of the ten diagonal runs with highest score are furtherprocessed. Within each of these scores an optimal localalignment is computed using the match score substitutionmatrix. These alignments are called initial regions.The score of the best sub-alignment found in this phase isreported as init1.The next step is to combine high scoring sub-alignmentsinto a single larger alignment, allowing the introduction ofgaps into the alignment. The score of this alignment isreported as initn

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 34: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Phase 4:

Finally, a banded Smith-Waterman dynamic program is used toproduce an optimal local alignment along the best matchedregions. The center of the band is determined by the regionwith the score init1, and the band has width 8 for ktup=2. Thescore of the resulting alignment is reported as opt.

In this way, FASTA determines a highest scoring region, not allhigh scoring alignments between two sequences. Hence,FASTA may miss instances of repeats or multiple domainsshared by two proteins.

After all sequences of the databases have thus been searcheda statistical significance similar to the BLAST statistics iscomputed and reported.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 35: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

FASTA

Example

Two sequences ACTGAC and TACCGA: The hot spots for k = 2 aremarked as pairs of black bullets, a diagonal run is shaded in darkgrey. An optimal sub-alignment in this case coincides with thediagonal run. The light grey shaded band of width 3 around thesub-alignment denotes the area in which the optimal local alignment

is searched.Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 36: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Comparing BLAST and FASTA

BLAST FASTA

Que

ry

Database

Que

ry

Database

↓ ↓

Que

ry

Database

Que

ry

Database

(a) (b)

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 37: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

BLAT- BLAST-Like Alignment Tool

BLAT (Kent et al. 2002), is supposedly more accurate and 500times faster than popular existing tools for mRNA/DNAalignments and 50 times faster for protein alignments atsensitivity settings typically used when comparing vertebratesequences.

BLAT’s speed stems from an index of all non-overlappingK -mers in the genome. The program has several stages: Ituses the index to find regions in the genome that are possiblyhomologous to the query sequence. It performs an alignmentbetween such regions. It stitches together the aligned regions(often exons) into larger alignments (typically genes). Finally,BLAT revisits small internal exons and adjusts large gapboundaries that have canonical splice sites where feasible.

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA

Page 38: BLAST and FASTA - ab.inf.uni-tuebingen.de · BLAST and FASTA Heuristics in pairwise sequence alignment Christoph Dieterich Department of Evolutionary Biology Max Planck Institute

MPI for Developmental Biology, Tubingen

logo

Introduction BLAST Statistical analysis FASTA

Job Advertisement

I encourage applications forStudent research projectsDiploma thesis projectsHiWi jobs

http://www2.tuebingen.mpg.de/abt4/plone/BT

Christoph Dieterich Max Planck Institute for Developmental Biology

BLAST and FASTA


Top Related