lecture 6ll
TRANSCRIPT
-
7/27/2019 Lecture 6ll
1/46
Bioinformatics 1 -- lecture 6
Affine gap penalty
Substitution Matrices
PAM
BLOSUM
Matrix bias in local alignment
-
7/27/2019 Lecture 6ll
2/46
Reminder: scoring an alignment
A ~ T S F M ~A G L S T F M
The score of the alignment is the sum of the scores of each
column (match, deletion or insertion) in the alignment.
Match: look up match score from substitution matrixNew gap: use gap initiation penalty
Additional gap: use gap extension penalty
End gap: Optional, may be zero.
-
7/27/2019 Lecture 6ll
3/46
Affine gap exercise
~ ~ A T S F M
A G L S T F M
A ~ ~ T S F M
A G L S T F M
A ~ T S F M ~
A G L S T F M
Which alignmentscores the highest?
Given:
No end gap penalty.
BLOSUM score.
Affine gap penalty. (-2,
-1)
-
7/27/2019 Lecture 6ll
4/46
Worksheet for affine gap local dynamic programming
Q
S
IM
PR
P
L T V K PM I D
Fill in the scores are the traceback letters using the BLOSUM62 matrix and
Gap opening = -2, gap extension = -1. Start at 0. End at maximum.No end gaps (meaning no starting gaps).
Q
S
IM
PR
P
L T V K P
-
7/27/2019 Lecture 6ll
5/46
Rules for affine gap penalty DPDo not penalize end-gaps.
I can follow D, and vice versa, but it is a gap opening.
For each box, write the score and the traceback letter (M,I,orD)
Affine gap penalty worksheet
M = match matrix I = insertion matrix D = deletion matrixscores for alignments
currently in a match statescores for alignments with
gap in first sequence.
scores for alignments with
gap in second sequence.
Fill in M as the
max over three
possible
diagnoal
arrows Fill in I as the
max over three
possible down
arrows
Fill in D as the
max over three
possible right
arrows
A D P Q F GA
K
L
K
L
D
Q
F
G
P
A
K
L
K
L
D
Q
F
G
P
A
K
L
K
L
D
Q
F
G
P
A D P Q F G A D P Q F GNote:Iwrotethesequences
inthegaprows!!
M[i,j] = MAX
M[i-1,j-1]+match score
I[i-1,j-1]+match score
D[i-1,j-1]+match score
I[i,j] = MAX
M[i,j-1] - 2
I[i,j-1] - 1
D[i,j-1] - 2
D[i,j] = MAX
M[i-1,j] - 2
I[i-1,j] - 2
D[i-1,j] - 1
-
7/27/2019 Lecture 6ll
6/46
BLOSUM matrix for match scoresTwo 20x20 substitution matrices are used: BLOSUM & PAM.
4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2
9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2
6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3
5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2
6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3
6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3
8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2
4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1
5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2
4 2 -3 -3 -2 -2 -2 -1 1 -2 -1
5 -2 -2 0 -1 -1 -1 1 -1 -1
6 -2 0 0 1 0 -3 -4 -2
7 -1 -2 -1 -1 -2 -4 -3
5 1 0 -1 -2 -2 -15 -1 -1 -3 -3 -2
4 1 -2 -3 -2
5 0 -2 -2
4 -3 -1
11 2
7
A C D E F G H I K L M N P Q R S T V W Y ACDEF
GHIKLMNP
QRSTVWY
BLOSUM62
Each number is the score
for aligning a single pair
ofamino acids.
What is the score for this alignment?:
ACEPGAA
ASDDGTV
-
7/27/2019 Lecture 6ll
7/46
Substitution matrices
Used to score aligned positions, usually of amino acids.
Expressed as the log-likelihood ratio of mutation (orlog-odds
ratio)
Derived from multiple sequence alignments
Two commonly used matrices: PAM and BLOSUM
PAM = percent accepted mutations (Dayhoff)
BLOSUM = Blocks substitution matrix (Henikoff)
Read: Mount pp94-113
-
7/27/2019 Lecture 6ll
8/46
PAM
Evolutionary time is measured in Percent Accepted
Mutations, or PAMs
One PAM of evolution means 1% of the residues/bases have
changed, averaged over all 20 amino acids.
To get the relative frequency of each type of mutation, we
count the times it was observed in a database of multiple
sequence alignments.
Based on global alignments
Assumes a Markov model for evolution.
M Dayhoff, 1978
-
7/27/2019 Lecture 6ll
9/46
BLOSUM
Based on database of ungapped local alignments
(BLOCKS)
Alignments have lower similarity than PAM alignments.
BLOSUM number indicates the percent identity level of
sequences in the alignment. For example, for BLOSUM62
sequences with approximately 62% identity were counted.
Some BLOCKS represent functional units, providing
validation of the alignment.
Henikoff & Henikoff, 1992
-
7/27/2019 Lecture 6ll
10/46
A multiple sequence alignment is made using many pairwise sequence alignments
Multiple Sequence Alignment
-
7/27/2019 Lecture 6ll
11/46
Columns in a MSA have a common evolutionary history
By aligning the sequences, we assert that the aligned
residues in each column had a common ancestor.
-
7/27/2019 Lecture 6ll
12/46
How do you count the mutations?Assume any of the sequences could be the ancestral
one.
L K F R L S K K P
L K F R L S K K P
L K F R L T K K P
L K F R L S K K P
L K F R L S R K PL K F R L T R K P
L K F R L ~ K K P
GG
G
W
W
NG
G
G W W N G G
If the first sequence was the ancestor,
then it mutated to a W twice, to N
once, and conserved G three times.
-
7/27/2019 Lecture 6ll
13/46
Or, we could have picked...
L K F R L S K K P
L K F R L S K K P
L K F R L T K K P
L K F R L S K K P
L K F R L S R K PL K F R L T R K P
L K F R L ~ K K P
WG
G
W
W
NG
G
G W W N G G
W was the ancestor, then it mutated
to a G four times, to N once, and was
conserved once.
-
7/27/2019 Lecture 6ll
14/46
Subsitution matrices are symmetrical
Since we don't know which sequence came first, we don't
know whether
...is correct. So we count this as one mutation of each type.G-->W and W-->G. In the end the 20x20 matrix will have
the same number for elements (i,j) and (j,i).
(That's why we only show the upper triangle)
G
Gw
w
or
-
7/27/2019 Lecture 6ll
15/46
Summing the substitution counts
G
G
W
W
N
GG
one column of a MSA
G
G
W
W
N
N
3 21
symmetrical matrix
We assume the ancester is one of the observed amino acids,
but we don't know which, so we try them all.
-
7/27/2019 Lecture 6ll
16/46
Next possible ancester...
G
G
W
W
N
GG
G
G
W
W
N
N
2 21
We already counted this residue against all others, so be blank it out.
-
7/27/2019 Lecture 6ll
17/46
Next...
G
G
W
W
N
GG
G
G
W
W
N
N
1
2
1
-
7/27/2019 Lecture 6ll
18/46
Next...
G
G
W
W
N
GG
G
G
W
W
N
N
0
2
1
-
7/27/2019 Lecture 6ll
19/46
Next...
G
G
W
W
N
GG
G
G
W
W
N
N
0
2
0
-
7/27/2019 Lecture 6ll
20/46
Next...
G
G
W
W
N
GG
G
G
W
W
N
N
1 0 0
-
7/27/2019 Lecture 6ll
21/46
Last...
G
G
W
W
N
GG
G
G
W
W
N
N
0 0 0
(no counts for last seq.)
-
7/27/2019 Lecture 6ll
22/46
Summing the substitution counts
G
G
W
W
N
GG
G
G
W
W
N
N
6 4 8
TOTAL=21
0
1
2
Now we do this for every
column in every multiple
sequence alignment...
-
7/27/2019 Lecture 6ll
23/46
log odds
log odds ratio = log2(observed/expected )
Substitutions (and many other things in bioinformatics) are
expressed as a "likelihood ratio", or "odds ratio" of the
observed data over the expected value. Likelihood and
odds are synomyms for Probability.So Log Odds is the log (usually base 2) of the odds ratio.
-
7/27/2019 Lecture 6ll
24/46
Getting log-odds from counts
Observed probability of G->G
qGG = P(G->G)=6/21 = 0.29
Expected probability of G->G,
eGG = 0.57*0.57 = 0.33
odds ratio = qGG/eGG = 0.29/0.33
log odds ratio = log2(qGG/eGG )
If the lod is < 0., then
the mutation is less
likely than expected by
chance. If it is > 0., it ismore likely.
P(G) = 4/7 = 0.57
-
7/27/2019 Lecture 6ll
25/46
Different observations, same expectation
G G
G A
W G
W A
N GG A
G A
P(G)=0.50
eGG = 0.25qGG = 9/42 =0.21
lod = log2(0.21/0.25) =0.2
G WG A
G W
G A
G WG A
G A
P(G)=0.50
eGG = 0.25qGG = 21/42 =0.5
lod = log2(0.50/0.25) = 1
Gs spread over many columns
Gs concentrated
-
7/27/2019 Lecture 6ll
26/46
Different observations, same expectation
G G
G A
W G
A W
N GG A
G A
P(G)=0.50, P(W)=0.14
eGW = 0.07qGW = 7/42 =0.17
lod = log2(0.17/0.07) = 1.3
G WG A
G W
G A
G WG A
A G
P(G)=0.50, P(W)=0.14
eGW = 0.07qGG = 3/42 =0.07
lod = log2(0.07/0.07) = 0
G and W seen together more
often than expected.
Gs and Ws not
seen together.
In class exercise:
-
7/27/2019 Lecture 6ll
27/46
Get the substitution value for P->Q
...given a very small database.
PQPPQQQPQQP
PQPPPQQQP
P(P)=_____, P(Q)=_____
ePQ = _____
qPQ = ___/___ =_____
lod = log2(ePQ/qPQ) = ____
P Q
Q
P
In class exercise:
-
7/27/2019 Lecture 6ll
28/46
Markovian evolution and PAMA Markov process is one where te likelihood of the next
"state" depends only on the current state.
Markovian evolution assumes that base changes (or
amino acid changes) occur at a constant rate and
depend only on the identity of the current base (oramino acid).
G G A V V G
millions of years
one
position ina protein
.9946 .0002 .0021 .0001.9932
-
7/27/2019 Lecture 6ll
29/46
Markovian evolution is an extrapolation
Start with all G's. Wait 1 million years. Where
do they go?
Using PAM1, we expect them to mutate to
about 0.0002 A, 0.0007 P, 0.9946 G, etcWait another million years.
The new A's mutate according to PAM1 for A's,
P's mutate according to PAM1 for P's, etc.
Wait another million, etc , etc etc.
What is the final distribution of amino acids at
the positions that were once G's?
PAM1 =
PAM1 =
-
7/27/2019 Lecture 6ll
30/46
Matrix multiplication
PAM1 =
00001000
00000000000
0
P(G->A)P(G->C)P(G->D)P(G->E)P(G->G)P(G->F)P(G->H)
P(G->I)P(G->K)P(G->L)P(G->M)P(G->N)P(G->P)P(G->Q)P(G->R)P(G->S)P(G->T)
PAM1 x =
To start we have 100%G,
0% everything else
After 1MY we have
each amino acid
according to the
PAM probabilities.
0.0001
0.0001
0.00015
0.00005
0.99943
0.00002
0.00005
0.00001
0.0002
0.00015
0.00002
0.00003
0.0006
0.0006
0.00002
=
-
7/27/2019 Lecture 6ll
31/46
Matrix multiplication
PAM1 =
PAM1 x
After 2MY each
amino acid has
mutated again
according to thePAM1 probabilities.
PAM1
=
etc.
0.0001
0.0001
0.00015
0.00005
0.99943
0.00002
0.00005
0.000010.0002
0.00015
0.00002
0.00003
0.0006
0.0006
0.00002
-
7/27/2019 Lecture 6ll
32/46
250 PAMs
PAM1 =PAM1 PAM1
PAM1
250
==
-
7/27/2019 Lecture 6ll
33/46
Differences between PAM and BLOSUM
PAMPAM matrices are based onglobal alignments of closely related proteins.
The PAM1 is the matrix calculated from comparisons of sequences with no more
than 1% divergence.
Other PAM matrices are extrapolated from PAM1 using an assumed Markov
chain.
BLOSUMBLOSUM matrices are based on local alignments.
BLOSUM 62 is a matrix calculated from comparisons of sequences with approx
62% identity.
All BLOSUM matrices are based on observed alignments; they are not
extrapolated from comparisons of closely related proteins.
BLOSUM 62 is the default matrix in BLAST (the database search program). It is
tailored for comparisons of moderately distant proteins. Alignment of distant
relatives may be more accurate with a different matrix.
-
7/27/2019 Lecture 6ll
34/46
Increasing sophistication in match scoring
1. Identity score.
2. Genetic code changes (mutations on one base more likely than 2,3). (1966)
3. Matrices based on chemical similarity of amino acids. (1985)
4. Matrices based on multiple sequence alignments (PAM (1978),BLOSUM (1994))
5. Dipeptide substitution matrices (ie. AG --> DG, etc) (1994)
6. Class specific substitution matrices (D. Jones' transmembrane proteinmatrix) (1994)
7. Structure-based substitution matrices (2000)
8. Position-specific, structure-based substitution matrices (2006)
-
7/27/2019 Lecture 6ll
35/46
PAM250
-
7/27/2019 Lecture 6ll
36/46
BLOSUM62
-
7/27/2019 Lecture 6ll
37/46
Which substitution matrix favors...
conservation of polar residues
conservation of non-polar residuesconservation of C, Y, or W
polar-to-nonpolar mutations
polar-to-polar mutations
PAM250 BLOSUM62
-
7/27/2019 Lecture 6ll
38/46
Local alignment, revisited
Starts at zero (the score of a non-alignment)
Ends at the maximum score anywhere in the matrix.
Global Local
Advantages:
Does not care if the aligned region has long "tails".
Can align pieces of one sequence to pieces of another.
Multi-domain sequences are OK
-
7/27/2019 Lecture 6ll
39/46
Local alignment, revisited
Global Local
Disadvantages:
Fails on multidomain alignments if large gaps are present.
Success depends on an additional parameter: Matrix Bias
-
7/27/2019 Lecture 6ll
40/46
Local Alignment with matrix biasMatrix bias= a constant added to the
substitution score.
Has the same effect as starting the
alignment at a number other than zero.
P
G
T
SF
E
P
A T S F M
A(i,j) = MAX
A(i-1,j-1) + match score + bias
A(i,j-1) + gap*
A(i-1,j) + gap
0 + match score
*linear gap penalty.
-
7/27/2019 Lecture 6ll
41/46
Effect of matrix bias
Higher matrix bias favors matches over
gaps. RESULT: more matches, longerlocal alignments.
more matrix bias
H d k h h li i
-
7/27/2019 Lecture 6ll
42/46
How do we know when the alignment is
correct?
2DRC:A 1/2 MISLIAALAVDRVIGMENAM-PFNLPADLAWFKRNTL-------DKPVIMGRHTWESIG-
1DRF:_ 3/4 SLNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSIPE
2DRC:A 52/53 --RPLPGRKNIILSSQP--GTDDRVTWVKSVDEAIAACG------DVPEIMVIGGGRVYE
1DRF:_ 63/64 KNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYK
2DRC:A 102/103 QFLPK--AQKLYLTHIDAEVEGDTHFPDYEPDDWESVF------SEFHDADAQNSHSYCF
1DRF:_ 123/124 EAMNHPGHLKLFVTRIMQDFESDTFFPEIDLEKYKLLPEYPGVLSDVQEE---KGIKYKF
2DRC:A 154/155 EILERR
1DRF:_ 180/181 EVYEKN
Compare to known structure-based alignments
-
7/27/2019 Lecture 6ll
43/46
Aligning fibronectin
Fibronectin is a long multidomain
protein involved in
adhesion/migration of cells, blood
clotting, signaling, and
interactions with the extracellular
matrix (ECM). Interacts withcollagen, fibrin, heparin and
integins.
It is made up of many copies of at
least 3 "modules". Smalldifferences within modules cause
important biological effects.
How do you align fibronectins?
-
7/27/2019 Lecture 6ll
44/46
Multiple local alignments?
II
III
II II II III II III
Fragment-based alignment methods find all local
alignments. (BLAST, FASTA)
One way would be to select the maximum score, then thenext highest score, and so on to get all of the possible
alignments.
-
7/27/2019 Lecture 6ll
45/46
You have seen....
Dynamic programming:
Global alignment
Global/local alignment (no end gaps. 3 ways to do it.)
Local alignment
Linear gap penalty
Affine gap penalty
How many ways are there to do DP?
-
7/27/2019 Lecture 6ll
46/46
In class exercise: local alignment
using BestFitStart SeqLabGo to Editor mode, with no sequences.
Download two protein sequences from the PIRdatabase(File/Add sequences/Databases):
PIR2:A46444
PIR2:S58653
Select both. Run BestFit (Functions/Pairwisecomparison/BestFit.)
Go to Options: set gap creation (range 1-20), gap extension
(range 0-10) . Run. Look at the results. (Compare to what you see on