scoringofalignments,scoring matrixes2011-05-23 2 3 mutaon"matrix"...
TRANSCRIPT
2011-05-23
1
1
Sequence comparison and alignment
• Maximal maximum parsimony: Choose the alignemnt requiring the lest improbable combina<on of point muta<ons and inser<ons/dele<ons
Before alignment Sequence1 AGGVLIIQVG! ||||||!Sequence2 AGGVLIQVG
AAer alignment Sequence1 AGGVLIIQVG! |||||| |||!Sequence2 AGGVLI-QVG
Inser<on in sequence 1 or dele<on in sequence 2; gap
Comparing longer sequnces
A B A B
2
Scoring of alignments, scoring matrixes
• Unitary scoring matrix Iden<ty = one point; otherwise no point
• Does not take into account that some muta<ons are more common than other
• Insensi<ve; difficult to detect distant rela<onships
2011-05-23
2
3
Muta<on matrix summarises observed muta<ons
Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6
Mutation probability Matrix for the Evolutionary Distance of 2 PAM
Cly Pro Asp Glu Ala Asn Gln Ser Thr Lys Arg His Val Ile Mel Cys Leu Phe Tyr Trp SumGly 9 8 7 0 1 7 1 3 2 2 4 0 2 2 1 1 4 2 8 5 0 1 7 0 0 3 2 0 0 0 10063Pro 7 9 8 5 0 1 1 3 2 3 9 1 3 1 1 5 3 0 0 4 3 0 0 0 0 0 0 9 9 4 2Asp 8 1 9 7 5 7 9 6 1 3 4 5 2 7 2 6 2 8 0 6 4 0 1 0 2 0 0 0 9 9 9 6Glu 1 3 1 7 9 5 9 7 2 6 2 1 9 4 0 1 5 1 2 1 3 0 4 7 4 1 0 4 0 0 0 9 9 8 1Ala 4 2 5 4 2 4 3 7 9 7 3 0 3 1 3 4 9 9 4 5 1 8 0 5 3 2 3 1 9 5 5 5 0 0 10188Asn 1 0 1 0 3 6 7 1 4 9 7 0 1 2 0 5 1 1 7 1 9 7 2 4 4 4 1 0 2 0 0 0 9 9 2 7Gln 4 1 1 1 6 2 4 1 2 1 5 9 7 3 6 1 3 1 0 9 1 4 1 4 5 4 i i 0 2 0 0 0 9 9 0 0Ser 2 6 1 5 2 8 1 6 5 9 6 7 2 2 9 5 9 8 6 9 1 4 2 1 7 7 4 2 3 2 7 3 6 0 0 10003Thr 6 8 3 1 4 3 0 2 5 2 0 7 6 9 7 5 9 1 0 0 8 2 0 2 4 1 1 8 5 3 0 0 10030Lys 5 6 1 3 2 1 1 7 3 7 2 3 2 2 1 4 9 8 4 5 6 5 1 4 1 3 9 I l 0 6 0 4 0 10125Arg 0 0 0 0 0 5 1 3 1 0 2 3 9 8 8 1 1 7 0 0 1 8 0 0 2 0 0 9 9 6 0H i s 0 0 4 3 2 2 0 1 5 1 0 5 6 1 9 9 8 6 5 1 4 0 0 3 3 4 1 1 9 9 7 5Val 6 8 5 1 0 2 7 7 1 2 9 2 5 1 2 0 3 9 7 8 3 1 5 6 8 2 1 8 2 2 3 0 0 10188Ile 0 2 0 3 1 3 4 3 1 4 4 0 4 7 0 9 7 0 3 2 2 3 2 2 1 4 0 0 9 8 7 2Met 0 0 0 0 2 0 4 5 2 2 7 0 1 2 7 9 6 7 2 5 1 4 5 0 0 9 7 3 7C y s 1 0 0 0 1 0 0 1 2 3 0 0 0 6 2 1 1 9 9 2 8 0 0 0 0 9 9 6 4Len 2 0 3 7 4 3 4 5 7 6 0 6 2 4 5 2 9 9 0 9 8 9 9 1 9 0 0 10140Phe 0 0 0 0 2 0 0 5 2 0 3 4 2 1 8 1 8 0 1 0 9 8 7 9 7 4 3 0 10047Tyr 0 0 0 0 0 0 0 0 0 2 0 4 0 0 0 0 0 5 1 9 9 0 9 1 7 9 9 8 1Trp 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 8 7 9 9 4 1 9 9 6 0
All entries are multiplied by 10,000. An element of this matrix rn(i,j) gives the probability that the amino acid in column 1 will bereplaced by the amino acid in row i after an evolutionary interval of 2 PAM, i.e., 2 accepted point mutations per 100 amino acids. Thus,there is a probability of 0.0059 that Ala will be replaced by Ser, anda probability of 0.0099 that Ser will be replaced by Ala. The sum ofeach column is 1.0. The sum of a row represents the growth factor per 2 PAM" of the corresponding amino acid residue; it ranges from0.9737 (for Met) to 1.0188 (for Ala and Val).
(PAM=percentage of accepted mutations)
4
Scoring matrix from muta<on matrix Each matrix element relates the probablity for similarty due to conserva<on to chance similarity
”log-‐odds”matrixes; use logarithsm of subs<tu<on probabili<tes element
gaps: Gap crea<on penalty Gap extension penalty
Weigh<ng with log-‐odd matrix
Scoring
2011-05-23
3
5
Log-‐odds matrix for 250 PAM C 12!S 0 2 !T -2 1 3!P -3 1 0 6!A -2 1 1 1 2!G -3 1 0 -1 1 5!N -4 1 0 -1 0 0 2!D -5 0 0 -1 0 1 2 4!E -5 0 0 -1 0 0 1 3 4!Q -5 -1 -1 0 0 -1 1 2 2 4!H -3 -1 -1 0 -1 -2 2 1 1 3 6!R -4 0 -1 0 -2 -3 0 -1 -1 1 2 8!K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5!M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6!I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5!L -8 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 8!V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4!F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 !Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10!W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 ! ! C S T P A G N D E Q H R K M I L V F Y W!!
Symmetric
6
Assump<ons of the PAM model
Assump<ons in PAM model: 1.replacement at any site depends only on the amino acid at that site and the
probability given by the table (Markov model). 2.sequences that are being compared have average amino acid composi<on. Sources of error in PAM model 1.Many sequences depart from average composi<on. 2.Rare replacements were observed too infrequently to resolve rela<ve
probabili<es accurately (for 36 pairs no replacements were observed!). 3.Errors in 1PAM are magnified in the extrapola<on to 250 PAM. 4.The Markov process is an imperfect representa<on of evolu<on: Distantly
related sequences usually have islands (blocks) of conserved residues. This implies that replacement is not equally probable over en<re sequence.
Must use sequences with known rela<on (>85 % iden<ty ) and extrapolate to lower levels of similarity
2011-05-23
4
7
Evolu<onary rates Rates of Mutation Acceptance!! PAMS per 100! Million Years! !IG kappa chain C region 37!Kappa casein 33!Phospholipase A 19!Prolactin 17!Carbonic anhydrase C 16!Hemoglobin alpha chain 12!Lipid-binding protein A-II 10!Animal lysozyme 9.8!Myoglobin 8.9!Trypsin 5.9!Alpha crystallin A chain 5.0!Cytochrome b 4.5!Calcitonin 4.3!Neurophysin 2 3.6!Lactate dehydrogenase 3.4!Adenylate kinase 3.2!Triosephosphate isomerase 2.8!Vasoactive intestinal peptide 2.6!Cytochrome c 2.2!Plant ferredoxin 1.9!Troponin C. skeletal muscle 1.5!Glutamate dehydrogenase 0.9!Histone H2B 0.9!Histone H2A 0.5!Histone H3 0.14!Histone H4 0.10!Ubiquitin 0.00!
Depend on protein; different mutation rates and selection pressures
8
BLOSUM matrixes
• Basedon short local alignements, (BLOCKS database)Know relatedness not required
• Suitable level of similarity can be used
2011-05-23
5
9
BLOSUM vs. PAM
It is best to use scoring matrix derived from seqenced with simlilar levels of similarity to those inves<gated
Op<mal pairwise alignment depends on efficient algorithms
• Problem: finding the alignment with highest score • Calculation of score for all possibilities not feasible; need optimization
method to find best solution with minimum computation: • Dynamic programming method
M N A L S Q L N N l l A l L l l M l S l Q l N l l H
M N A L S Q L N N l l A l L l M l S l Q l N l l H
Illustration with dot plots: finding the best path
2011-05-23
6
Real proteins
Needleman-Wunsch algorithm: implemenation of dynamic programming method for finding the optimal soution in pairwise alignment Mathematically guaranteed: it can be proven that the best alignment will be found
12
Database searching • How to find related sequences to a given sequnce in a large database?
• Need large number of sequence comparisons and scoring of results • Need fast methods for sequence comparisons; approximate. • Word methods (k-‐tuple)
FASTA, BLAST • Search form ”words” (k-‐tuples) inves<gate hits more closely.
• Result:
A number of op<mised alignments (with gaps) ranked according to score
2011-05-23
7
How BLAST works Query Sequence
“words” (subsequences of the query sequence)
Query words are compared to the database (target sequences) and exact matches identified
For each word match, alignment is extended in both directions to find alignments that score greater than some threshold (maximal segment pairs, or MSPs) (Schneider and La Rota 2000)
14
BLAST"
Op<mize and rank HSP’s
2011-05-23
8
FASTA
16
Sta<s<cs Score (S) Measure of similarity between query sequence and match Expect (E) value: A parameter that describes the number of hits (with score≥ S) one can expect to see by chance when searching a database of a particular size. It decreases exponentially as the score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance
K och λ parameters for database and scoring, n’ and m’ for relate to sequence lengths; D size of database
2011-05-23
9
BLAST output 1: Overview
Sequence motifs detected
BLAST output 2: List
2011-05-23
10
19
BLAST output 3: alignments"Best hit (number 1)
Number 25
20
Significance of sequence similarity
Practical definition > 20 % of residues identical (after reasonble correction for insertions/deletions). Probability of 20 % identity by chance in 100-residue sequences?
Efter Schultz och Schirmer, Principles of Protein Structure
Låt l = sekvenslängd, i antal identiska aminosyror
P =1
20! "
# $
i 1920! "
# $
l%i l!i!(l % i)!
Med l = 100 och i = 20 fås
P =1
20! "
# $
20 1920! "
# $
80 100!20!80!
& 10%7
Alignment becoms more difficult at lower levels of similarity
2011-05-23
11
21
Sequence and functional similarity"
Petsko&Ringe fig 4.3
Single domain
Multiplve domains
22
Translated BLAST"Method Query Database -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ BLASTP protein protein BLASTN DNA DNA BLASTX DNA
(6 reading frames) protein TBLASTN protein DNA Time consuming
(6 reading frames) TBLASTX DNA DNA More <me consuming
(6 reading frames) (6 reading frames)
2011-05-23
12
23
Multiple sequence alignment"Alignment of more than two sequnces to produce best global fit (op<mal placemnt of gaps etc)
1 2 3 4 5 6 7 8 9 10!----------------------------------!I Y D G G A V - E A L!II Y D G G - - - E A L!III F E G G I L V E A L !IV F D - G I L V Q A V!V Y E G G A V V Q A L!!!
posi<on
Sekven
s nr
• Computa<onally difficult; propor<onal to (seqqunce length)number
• Not mathema<cally guaranteed to fined best solu<on
• Most methods start with parwise alignmnents
• Clustal is a common program for MSA
24
Rekonstruc<on of evolu<on from sequence alignments
ACGH!DBGH!ADIJ!CBIJ!
ACGH! DBGH! ADIJ! CBIJ!
ABGH! ABIJ!
B->C! A->D! B->D! A->C!
I<->G!J<->H!
Parsimony: most probable path?
Minimal number of mutations?
Related sequnces Phylogenteic tree
2011-05-23
13
25
Multiple sequence alignment: example"Källa
Protein kinase domains from Pfam
Petsko&Ringe fig 4.4
Cataly<c loop
Inser<ons/dele<on
26
Database search using multiple alignment: PSI-BLAST"
• Posi<on-‐specific iterated BLAST • Useful for detec<on of related sequnces with weak similarity • Step1. Iden<fiy close rela<ves
and perform mul<ple sequence alignment. Generate sequnce profile (PSSM) from MSA
• Step 2 Query database with the generated profile. Hits can be added to the alignment and the profile can be modified.
• Repeat step 2un<l no more sequences are added to alignment
2011-05-23
14
Sequence conserva<on and structure Cellulose-binding domains