local alignments seq x: seq y:. local alignment what’s local? –allow only parts of the sequence...

Local alignments

Seq X:

Seq Y:

Local alignment

What’s local?– Allow only parts of the sequence to

match– Results in High Scoring Segments– Locally maximal: can not make it better

by trimming/extending the alignment

Seq X:

Seq Y:

Local alignmentSeq X:

Seq Y:

Why local?– Parts of sequence diverge faster

evolutionary pressure does not put constraints on the whole sequence

– Proteins have modular constructionsharing domains between sequences

Global local alignment

Take the old, good equation

Look at the result of the global alignment (on the right)

- s e q u e

-

s

e

q

CAGCACTTGGATTCTCG-CA-C-----GATTCGT-G

a) global align

b) retrieve the result

c) score along the result

align.pos.

Local alignment – breaking the alignment A recipe

– Just don’t let the score go below 0

– Start the new alignment when it happens

– Where is the result in the matrix?

Before: After:

align.pos.

q

e

s

0-

euqes-

q

e

0s

0-

euqes-

align.pos.

align.pos.

Local alignment – the equation

Init the matrix with 0’s Read the maximal value from

anywhere in the matrix Find the result with backtracking

M[i, j] =M[i, j-1] – g

M[i-1, j] – g

M[i-1, j-1] + score(X[i],Y[j])

max

0Great contribution to science!

q

e

s

-

euqes-

Amino acid substitution matrices

Point Accepted Mutations distance and matrices

Accepted by natural selection– not lethal– not silent

Def.: S1 and S2 are PAM 1 distant if

on avg. there was one mutation per 100 aa

Q.: If the seqs are PAM 8 distant, how many residues may be diffent?

PAM matrix Created from “easy”alignments

– pairwise– gapless– 85% id

f proline − frequency of occurence of proline

f proline ,valine − frequency of substitution

proline with valine , for PAM 1

i.e. f a ,b =∑

aligns

count a ,b

∑aligns

∑c ,d

count c ,d

M − symmetric matr ix ,

i.e. M=[f a ,a f a ,b

f b ,a f b ,b ]

PAM matrix How to calculate M for PAM 2

distance?– Take more distant sequences– or extrapolate...

M 2=[ f a ,a f a ,b

f b ,a f b ,b ]2

=[ f a ,a f a , a f a ,b f b ,a f a ,a f a ,b f a , b f b ,b

⋯ ⋯ ]

AlignmentsBLAST - Basic Local Alignment

Search Tool

The amount of genetic information in organisms

http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html

Largest genome: amoeba Chaos chaos (200x human genome) http://www.lawrence.edu/dept/biology/animal/

Sequence searching - challenges Exponential growth of

databases

Sequence searching – definition

Task:

– Query: short, new sequence (~1000 letters)

– Database (searching space): very many sequences

– Goal: find seqs homologous to the query

Sequence searching – definition

We want:

– fast tool– primarily a filter: most sequences will

be unrelated to the query – fine-tune the alignment later

What is BLAST Basic Local Alignment Search Tool Bad news: it is only a heuristics

– Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work

Basic idea:– High scoring segments have well

conserved (almost identical) part– As well conserved part are identified,

extend it to the real alignment

q

e

s

-

euqes-

What means well conserved for BLAST? BLAST works with k-words (words of length

k)

– k is a parameter– different for DNA (>10) and proteins

(2..4) word w1 is T-similar to w2 if the sum of pair

scores is at least T (e.g. T=12)

Similar 3-wordsW1: R K PW2: R R PScore: 9 –1 7 sum = 15

BLAST algorithm3 basic steps

1)Preprocess the query: extract all the k-words

2)Scan for T-similar matches in database

3)Extend them to alignments

1)Preprocess2)Scan3)Extend

BLAST, Step 1: Preprocess the query

Take the query (e.g. LVNRKPVVP) Chop it into overlapping k-words (k=3 in this

case) Query: LVNRKPVVPWord1: LVNWord2: VNRWord3: NRK…

For each word find all similar words (scoring at least T)

E.g. for RKP the following 3-words are similar:QKP KKP RQP REP RRP RKP


Finite state machine

AC*T|GGC

abstract machine constant amount of memory

(states) used in computation and languages recognizes regular expressions

– cp dmt*.pdf /home/john


BLAST, Step 2: Find ¨exact¨ matches with scanning Use all the T-similar k-words to build the Finite

State Machine Scan for exact matches

...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...

QKPKKPRQPREPRRPRKP...

movement


BLAST, Step 3: Extending ¨exact¨ matches Having the list of exact matches we extend

alignment in both directions

Query: L V N R K P V V PT-similar: R R PSubject: G V C R R P L K CScore: -3 4 -3 5 2 7 1 -2 -3

…till the sum of scores drops below some level X (e.g. X=-100) from the best known- what with gaps?


Gapped BLAST (now standard)

gapped local alignments are computed: much, much, much slower

therefore: modified “Hit criteria”


Hit criteria Extends the alignment only if there are

close two hits on the same diagonal

– sensitivity would drop without lowering T– reduces extensions (90% time is spend

on extensions) Gapped local alignments are computed

– increased sensitivity allows us raise T– raising T speeds up the search


dbpos

querypos

close

hit, s

ame d

iag

BLAST flavours blastp: protein query, protein db blastn: DNA query, DNA db blastx: DNA query, protein db

– in all reading frames. Used to find potential translation products of an unknown nucleotide sequence.

tblastn: protein query, DNA db

– database dynamically translated in all reading frames.

tblastx: DNA query, DNA db

– all translations of query against all translations of db

local alignments seq x: seq y:. local alignment what’s local? –allow only parts of the sequence...

Documents