scoringofalignments,scoring matrixes2011-05-23 2 3 mutaon"matrix"...

2011-05-23

1

1

Sequence comparison and alignment

• Maximal maximum parsimony: Choose the alignemnt requiring the lest improbable combina<on of point muta<ons and inser<ons/dele<ons

Before alignment Sequence1 AGGVLIIQVG! ||||||!Sequence2 AGGVLIQVG

AAer alignment Sequence1 AGGVLIIQVG! |||||| |||!Sequence2 AGGVLI-QVG

Inser<on in sequence 1 or dele<on in sequence 2; gap

Comparing longer sequnces

A B A B

2

Scoring of alignments, scoring matrixes

• Unitary scoring matrix Iden<ty = one point; otherwise no point

• Does not take into account that some muta<ons are more common than other

• Insensi<ve; difficult to detect distant rela<onships

2011-05-23

2

3

Muta<on matrix summarises observed muta<ons

Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6

Mutation probability Matrix for the Evolutionary Distance of 2 PAM

Cly Pro Asp Glu Ala Asn Gln Ser Thr Lys Arg His Val Ile Mel Cys Leu Phe Tyr Trp SumGly 9 8 7 0 1 7 1 3 2 2 4 0 2 2 1 1 4 2 8 5 0 1 7 0 0 3 2 0 0 0 10063Pro 7 9 8 5 0 1 1 3 2 3 9 1 3 1 1 5 3 0 0 4 3 0 0 0 0 0 0 9 9 4 2Asp 8 1 9 7 5 7 9 6 1 3 4 5 2 7 2 6 2 8 0 6 4 0 1 0 2 0 0 0 9 9 9 6Glu 1 3 1 7 9 5 9 7 2 6 2 1 9 4 0 1 5 1 2 1 3 0 4 7 4 1 0 4 0 0 0 9 9 8 1Ala 4 2 5 4 2 4 3 7 9 7 3 0 3 1 3 4 9 9 4 5 1 8 0 5 3 2 3 1 9 5 5 5 0 0 10188Asn 1 0 1 0 3 6 7 1 4 9 7 0 1 2 0 5 1 1 7 1 9 7 2 4 4 4 1 0 2 0 0 0 9 9 2 7Gln 4 1 1 1 6 2 4 1 2 1 5 9 7 3 6 1 3 1 0 9 1 4 1 4 5 4 i i 0 2 0 0 0 9 9 0 0Ser 2 6 1 5 2 8 1 6 5 9 6 7 2 2 9 5 9 8 6 9 1 4 2 1 7 7 4 2 3 2 7 3 6 0 0 10003Thr 6 8 3 1 4 3 0 2 5 2 0 7 6 9 7 5 9 1 0 0 8 2 0 2 4 1 1 8 5 3 0 0 10030Lys 5 6 1 3 2 1 1 7 3 7 2 3 2 2 1 4 9 8 4 5 6 5 1 4 1 3 9 I l 0 6 0 4 0 10125Arg 0 0 0 0 0 5 1 3 1 0 2 3 9 8 8 1 1 7 0 0 1 8 0 0 2 0 0 9 9 6 0H i s 0 0 4 3 2 2 0 1 5 1 0 5 6 1 9 9 8 6 5 1 4 0 0 3 3 4 1 1 9 9 7 5Val 6 8 5 1 0 2 7 7 1 2 9 2 5 1 2 0 3 9 7 8 3 1 5 6 8 2 1 8 2 2 3 0 0 10188Ile 0 2 0 3 1 3 4 3 1 4 4 0 4 7 0 9 7 0 3 2 2 3 2 2 1 4 0 0 9 8 7 2Met 0 0 0 0 2 0 4 5 2 2 7 0 1 2 7 9 6 7 2 5 1 4 5 0 0 9 7 3 7C y s 1 0 0 0 1 0 0 1 2 3 0 0 0 6 2 1 1 9 9 2 8 0 0 0 0 9 9 6 4Len 2 0 3 7 4 3 4 5 7 6 0 6 2 4 5 2 9 9 0 9 8 9 9 1 9 0 0 10140Phe 0 0 0 0 2 0 0 5 2 0 3 4 2 1 8 1 8 0 1 0 9 8 7 9 7 4 3 0 10047Tyr 0 0 0 0 0 0 0 0 0 2 0 4 0 0 0 0 0 5 1 9 9 0 9 1 7 9 9 8 1Trp 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 8 7 9 9 4 1 9 9 6 0

All entries are multiplied by 10,000. An element of this matrix rn(i,j) gives the probability that the amino acid in column 1 will bereplaced by the amino acid in row i after an evolutionary interval of 2 PAM, i.e., 2 accepted point mutations per 100 amino acids. Thus,there is a probability of 0.0059 that Ala will be replaced by Ser, anda probability of 0.0099 that Ser will be replaced by Ala. The sum ofeach column is 1.0. The sum of a row represents the growth factor per 2 PAM" of the corresponding amino acid residue; it ranges from0.9737 (for Met) to 1.0188 (for Ala and Val).

(PAM=percentage of accepted mutations)

4

Scoring matrix from muta<on matrix Each matrix element relates the probablity for similarty due to conserva<on to chance similarity

”log-‐odds”matrixes; use logarithsm of subs<tu<on probabili<tes element

gaps: Gap crea<on penalty Gap extension penalty

Weigh<ng with log-‐odd matrix

Scoring

2011-05-23

3

5

Log-‐odds matrix for 250 PAM C 12!S 0 2 !T -2 1 3!P -3 1 0 6!A -2 1 1 1 2!G -3 1 0 -1 1 5!N -4 1 0 -1 0 0 2!D -5 0 0 -1 0 1 2 4!E -5 0 0 -1 0 0 1 3 4!Q -5 -1 -1 0 0 -1 1 2 2 4!H -3 -1 -1 0 -1 -2 2 1 1 3 6!R -4 0 -1 0 -2 -3 0 -1 -1 1 2 8!K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5!M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6!I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5!L -8 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 8!V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4!F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 !Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10!W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 ! ! C S T P A G N D E Q H R K M I L V F Y W!!

Symmetric

6

Assump<ons of the PAM model

Assump<ons in PAM model: 1.replacement at any site depends only on the amino acid at that site and the

probability given by the table (Markov model). 2.sequences that are being compared have average amino acid composi<on. Sources of error in PAM model 1.Many sequences depart from average composi<on. 2.Rare replacements were observed too infrequently to resolve rela<ve

probabili<es accurately (for 36 pairs no replacements were observed!). 3.Errors in 1PAM are magnified in the extrapola<on to 250 PAM. 4.The Markov process is an imperfect representa<on of evolu<on: Distantly

related sequences usually have islands (blocks) of conserved residues. This implies that replacement is not equally probable over en<re sequence.

Must use sequences with known rela<on (>85 % iden<ty ) and extrapolate to lower levels of similarity

2011-05-23

4

7

Evolu<onary rates Rates of Mutation Acceptance!! PAMS per 100! Million Years! !IG kappa chain C region 37!Kappa casein 33!Phospholipase A 19!Prolactin 17!Carbonic anhydrase C 16!Hemoglobin alpha chain 12!Lipid-binding protein A-II 10!Animal lysozyme 9.8!Myoglobin 8.9!Trypsin 5.9!Alpha crystallin A chain 5.0!Cytochrome b 4.5!Calcitonin 4.3!Neurophysin 2 3.6!Lactate dehydrogenase 3.4!Adenylate kinase 3.2!Triosephosphate isomerase 2.8!Vasoactive intestinal peptide 2.6!Cytochrome c 2.2!Plant ferredoxin 1.9!Troponin C. skeletal muscle 1.5!Glutamate dehydrogenase 0.9!Histone H2B 0.9!Histone H2A 0.5!Histone H3 0.14!Histone H4 0.10!Ubiquitin 0.00!

Depend on protein; different mutation rates and selection pressures

8

BLOSUM matrixes

•  Basedon short local alignements, (BLOCKS database)Know relatedness not required

•  Suitable level of similarity can be used

2011-05-23

5

9

BLOSUM vs. PAM

It is best to use scoring matrix derived from seqenced with simlilar levels of similarity to those inves<gated

Op<mal pairwise alignment depends on efficient algorithms

•  Problem: finding the alignment with highest score •  Calculation of score for all possibilities not feasible; need optimization

method to find best solution with minimum computation: •  Dynamic programming method

M N A L S Q L N N l  l  A l  L l  l  M l  S l  Q l  N l  l  H

M N A L S Q L N N l  l  A l  L l  M l  S l  Q l  N l  l  H

Illustration with dot plots: finding the best path

2011-05-23

6

Real proteins

Needleman-Wunsch algorithm: implemenation of dynamic programming method for finding the optimal soution in pairwise alignment Mathematically guaranteed: it can be proven that the best alignment will be found

12

Database searching •  How to find related sequences to a given sequnce in a large database?

•  Need large number of sequence comparisons and scoring of results •  Need fast methods for sequence comparisons; approximate. • Word methods (k-‐tuple)

FASTA, BLAST • Search form ”words” (k-‐tuples) inves<gate hits more closely.

• Result:

A number of op<mised alignments (with gaps) ranked according to score

2011-05-23

7

How BLAST works Query Sequence

“words” (subsequences of the query sequence)

Query words are compared to the database (target sequences) and exact matches identified

For each word match, alignment is extended in both directions to find alignments that score greater than some threshold (maximal segment pairs, or MSPs) (Schneider and La Rota 2000)

14

BLAST"

Op<mize and rank HSP’s

2011-05-23

8

FASTA

16

Sta<s<cs Score (S) Measure of similarity between query sequence and match Expect (E) value: A parameter that describes the number of hits (with score≥ S) one can expect to see by chance when searching a database of a particular size. It decreases exponentially as the score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance

K och λ parameters for database and scoring, n’ and m’ for relate to sequence lengths; D size of database

2011-05-23

9

BLAST output 1: Overview

Sequence motifs detected

BLAST output 2: List

2011-05-23

10

19

BLAST output 3: alignments"Best hit (number 1)

Number 25

20

Significance of sequence similarity

Practical definition > 20 % of residues identical (after reasonble correction for insertions/deletions). Probability of 20 % identity by chance in 100-residue sequences?

Efter Schultz och Schirmer, Principles of Protein Structure

Låt l = sekvenslängd, i antal identiska aminosyror

P =1

20! "

# $

i 1920! "

# $

l%i l!i!(l % i)!

Med l = 100 och i = 20 fås

P =1

20! "

# $

20 1920! "

# $

80 100!20!80!

& 10%7

Alignment becoms more difficult at lower levels of similarity

2011-05-23

11

21

Sequence and functional similarity"

Petsko&Ringe fig 4.3

Single domain

Multiplve domains

22

Translated BLAST"Method Query Database -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ BLASTP protein protein BLASTN DNA DNA BLASTX DNA

(6 reading frames) protein TBLASTN protein DNA Time consuming

(6 reading frames) TBLASTX DNA DNA More <me consuming

(6 reading frames) (6 reading frames)

2011-05-23

12

23

Multiple sequence alignment"Alignment of more than two sequnces to produce best global fit (op<mal placemnt of gaps etc)

1 2 3 4 5 6 7 8 9 10!----------------------------------!I Y D G G A V - E A L!II Y D G G - - - E A L!III F E G G I L V E A L !IV F D - G I L V Q A V!V Y E G G A V V Q A L!!!

posi<on

Sekven

s nr

•  Computa<onally difficult; propor<onal to (seqqunce length)number

•  Not mathema<cally guaranteed to fined best solu<on

• Most methods start with parwise alignmnents

•  Clustal is a common program for MSA

24

Rekonstruc<on of evolu<on from sequence alignments

ACGH!DBGH!ADIJ!CBIJ!

ACGH! DBGH! ADIJ! CBIJ!

ABGH! ABIJ!

B->C! A->D! B->D! A->C!

I<->G!J<->H!

Parsimony: most probable path?

Minimal number of mutations?

Related sequnces Phylogenteic tree

2011-05-23

13

25

Multiple sequence alignment: example"Källa

Protein kinase domains from Pfam

Petsko&Ringe fig 4.4

Cataly<c loop

Inser<ons/dele<on

26

Database search using multiple alignment: PSI-BLAST"

•  Posi<on-‐specific iterated BLAST •  Useful for detec<on of related sequnces with weak similarity •  Step1. Iden<fiy close rela<ves

and perform mul<ple sequence alignment. Generate sequnce profile (PSSM) from MSA

•  Step 2 Query database with the generated profile. Hits can be added to the alignment and the profile can be modified.

•  Repeat step 2un<l no more sequences are added to alignment

2011-05-23

14

Sequence conserva<on and structure Cellulose-binding domains

scoringofalignments,scoring matrixes2011-05-23 2 3 mutaon"matrix"...

Documents