local alignments seq x: seq y:. local alignment what’s local? –allow only parts of the sequence...
Post on 18-Dec-2015
218 views
TRANSCRIPT
Local alignments
Seq X:
Seq Y:
Local alignment
What’s local?– Allow only parts of the sequence to
match– Results in High Scoring Segments– Locally maximal: can not make it better
by trimming/extending the alignment
Seq X:
Seq Y:
Local alignmentSeq X:
Seq Y:
Why local?– Parts of sequence diverge faster
evolutionary pressure does not put constraints on the whole sequence
– Proteins have modular constructionsharing domains between sequences
Global local alignment
Take the old, good equation
Look at the result of the global alignment (on the right)
- s e q u e
-
s
e
q
CAGCACTTGGATTCTCG-CA-C-----GATTCGT-G
a) global align
b) retrieve the result
c) score along the result
align.pos.
Local alignment – breaking the alignment A recipe
– Just don’t let the score go below 0
– Start the new alignment when it happens
– Where is the result in the matrix?
Before: After:
align.pos.
q
e
s
0-
euqes-
q
e
0s
0-
euqes-
align.pos.
align.pos.
Local alignment – the equation
Init the matrix with 0’s Read the maximal value from
anywhere in the matrix Find the result with backtracking
M[i, j] =M[i, j-1] – g
M[i-1, j] – g
M[i-1, j-1] + score(X[i],Y[j])
max
0Great contribution to science!
q
e
s
-
euqes-
Amino acid substitution matrices
Point Accepted Mutations distance and matrices
Accepted by natural selection– not lethal– not silent
Def.: S1 and S2 are PAM 1 distant if
on avg. there was one mutation per 100 aa
Q.: If the seqs are PAM 8 distant, how many residues may be diffent?
PAM matrix Created from “easy”alignments
– pairwise– gapless– 85% id
f proline − frequency of occurence of proline
f proline ,valine − frequency of substitution
proline with valine , for PAM 1
i.e. f a ,b =∑
aligns
count a ,b
∑aligns
∑c ,d
count c ,d
M − symmetric matr ix ,
i.e. M=[f a ,a f a ,b
f b ,a f b ,b ]
PAM matrix How to calculate M for PAM 2
distance?– Take more distant sequences– or extrapolate...
M 2=[ f a ,a f a ,b
f b ,a f b ,b ]2
=[ f a ,a f a , a f a ,b f b ,a f a ,a f a ,b f a , b f b ,b
⋯ ⋯ ]
AlignmentsBLAST - Basic Local Alignment
Search Tool
The amount of genetic information in organisms
http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html
Largest genome: amoeba Chaos chaos (200x human genome) http://www.lawrence.edu/dept/biology/animal/
Sequence searching - challenges Exponential growth of
databases
Sequence searching – definition
Task:
– Query: short, new sequence (~1000 letters)
– Database (searching space): very many sequences
– Goal: find seqs homologous to the query
Sequence searching – definition
We want:
– fast tool– primarily a filter: most sequences will
be unrelated to the query – fine-tune the alignment later
What is BLAST Basic Local Alignment Search Tool Bad news: it is only a heuristics
– Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work
Basic idea:– High scoring segments have well
conserved (almost identical) part– As well conserved part are identified,
extend it to the real alignment
q
e
s
-
euqes-
What means well conserved for BLAST? BLAST works with k-words (words of length
k)
– k is a parameter– different for DNA (>10) and proteins
(2..4) word w1 is T-similar to w2 if the sum of pair
scores is at least T (e.g. T=12)
Similar 3-wordsW1: R K PW2: R R PScore: 9 –1 7 sum = 15
BLAST algorithm3 basic steps
1)Preprocess the query: extract all the k-words
2)Scan for T-similar matches in database
3)Extend them to alignments
1)Preprocess2)Scan3)Extend
BLAST, Step 1: Preprocess the query
Take the query (e.g. LVNRKPVVP) Chop it into overlapping k-words (k=3 in this
case) Query: LVNRKPVVPWord1: LVNWord2: VNRWord3: NRK…
For each word find all similar words (scoring at least T)
E.g. for RKP the following 3-words are similar:QKP KKP RQP REP RRP RKP
1)Preprocess2)Scan3)Extend
Finite state machine
AC*T|GGC
abstract machine constant amount of memory
(states) used in computation and languages recognizes regular expressions
– cp dmt*.pdf /home/john
1)Preprocess2)Scan3)Extend
BLAST, Step 2: Find ¨exact¨ matches with scanning Use all the T-similar k-words to build the Finite
State Machine Scan for exact matches
...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...
QKPKKPRQPREPRRPRKP...
movement
1)Preprocess2)Scan3)Extend
BLAST, Step 3: Extending ¨exact¨ matches Having the list of exact matches we extend
alignment in both directions
Query: L V N R K P V V PT-similar: R R PSubject: G V C R R P L K CScore: -3 4 -3 5 2 7 1 -2 -3
…till the sum of scores drops below some level X (e.g. X=-100) from the best known- what with gaps?
1)Preprocess2)Scan3)Extend
Gapped BLAST (now standard)
gapped local alignments are computed: much, much, much slower
therefore: modified “Hit criteria”
1)Preprocess2)Scan3)Extend
Hit criteria Extends the alignment only if there are
close two hits on the same diagonal
– sensitivity would drop without lowering T– reduces extensions (90% time is spend
on extensions) Gapped local alignments are computed
– increased sensitivity allows us raise T– raising T speeds up the search
1)Preprocess2)Scan3)Extend
dbpos
querypos
close
hit, s
ame d
iag
BLAST flavours blastp: protein query, protein db blastn: DNA query, DNA db blastx: DNA query, protein db
– in all reading frames. Used to find potential translation products of an unknown nucleotide sequence.
tblastn: protein query, DNA db
– database dynamically translated in all reading frames.
tblastx: DNA query, DNA db
– all translations of query against all translations of db