scoring matrices for sequence alignment

Scoring Matricesfor Sequence Alignment

Anne Haake

Rhys Price Jones

Scoring Matrices

• Sequence comparisons require some scoring matrices

• To use the alignment algorithms to do database searches, we need some scoring schemes that are based on biological knowledge.– Scoring matrices represent evolutionary theory

• The choice of matrix can influence the outcome of the analysis– Understanding the theory can help in making an appropriate

choice

Nucleotide Scoring

1. Identity matrix (similarity)

A T C GA 1 0 0 0T 0 1 0 0C 0 0 1 0G 0 0 0 1

Nucleotide Scoring

2. BLAST matrix

A T C G

A 5 -4 -4 -4

T -4 5 -4 -4

C -4 -4 5 -4

G -4 -4 -4 5

Nucleotide Scoring

3. Transition/Transversion Matrix

A T C GA 0 5 5 1T 5 0 1 5C 5 1 0 5G 1 5 5 0

Protein Scoring

1. Identity Matrix– Score 1 if equal– Score 0 if not equal– Easy, but weak

2. Genetic code Matrix– Determine the minimum number of base changes required

to convert one amino acid into another– Edit distance: will be 0, 1, 2, or 3 – Is a distance matrix Matrix– Not very discriminating; consider CAU

Code Table

http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeE.html#page1

http://www.people.virginia.edu/~rjh9u/codetabl.html

Protein Scoring

3. Hydrophobicity Matrix– Based on physical/chemical properties of the

amino acids

Hydrophobicity matrix

4. Log odds Matrices- Which amino acids are most likely to be seen?

- In close relatives? In distant relatives?

Ex. PAM and BLOSUM matrices



PAM and BLOSUM Substitution Matrices for Amino Acids

• Based on actual substitution rates among the various amino acids in nature

• Empirically derived; huge amount of work!

• General Strategy:– Select a collection of related proteins and align them– Observe the frequencies with which one amino acid is

replaced by another = A– Figure out how often, given the frequencies of the amino

acids in your set, the replacement would occur by chance alone = B

– The ratio A/B (odds) tells us how often the replacement has occurred in evolution (as compared to a random process)

PAM and BLOSUM Substitution Matrices for Amino Acids

• Matrices are 20 X 20 tables of values that describe the probability of a residue pair occurring in an alignment– The scoring matrix values are logarithms of ratios

of the probability of a meaningful occurrence to the probability of random occurrence.

PAM Matrices

• PAM stands for Point Accepted Mutation or Percent Accepted Mutation

• Developed by Dayhoff et al. 1978. – Model based on empirically derived data– Groups of closely related proteins were aligned

(global alignments)• So that probability of more than one replacement at a

single site was negligible• 1,572 changes in 71 groups of closely related proteins

1 PAM

http://www.icp.ucl.ac.be/~opperd/private/Pammatrix.html

PAM Unit

• Matrix represents substitution probabilities over a fixed unit of evolutionary change– e.g. PAM1 is 1 substitution per 100 residues or one PAM

unit (an amount of evolution) – 1% divergence

• Start with a given polypeptide sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all a.a. residues have undergone changes at time t+n. New sequence M’

• What is the probability that a.a. i in M will be replaced by a.a. j in M’? – To get your answer, look it up in the PAM-1 table (entry Rij)

PAM Matrix

• Matrix values are based on the model that one sequence is derived from the other by a series of independent mutations, each changing one amino acid in the first sequence to another amino acid in the second

• The model is an approximation– Many assumptions– Not all of the assumptions necessarily hold

PAM Matrix

• PAM-1 is used to derive other PAM matrices– Why?

• PAM-1 is 1 % accepted mutations• PAM-N is N% accepted mutations• To derive PAM-N, the PAM-1 matrix is

multiplied by itself N times – e.g. PAM-100; PAM-250– What does this mean for errors?

Which PAM matrix do I use?

• Depends on how closely the sequences are believed to be related

• PAM-1 use for more closely related sequences

• PAM-1000 more distant relationships

• In practice, PAM-250 often used in alignment and database searching software.

BLOSUM Matrix

• BLOSUM is from BLOcks SUbstitution Matrix

• originate with a paper by Henikoff and Henikoff (1992; PNAS 89:10915-10919)

BLOSUM Matrix

• derived from the BLOCKS database

BLOCKS database• derived by observing substitution rates among similar

protein sequences– Use families of related (distantly) protein sequence because

need to do a multiple alignment – Are interested in substitutions rather than indels which tend

to occur more in distantly related sequences

• ungapped multiple alignments are used to identify conserved blocks of amino acids

http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeE.html#Blocksdb

BLOSUM Matrix

BLOSUM matrix

• Clustering approach used to sort the sequences into closely related groups where the sequences are similar at some threshold value of percentage identity– e.g. BLOSUM62 is standard matrix for ungapped alignment..

62 represents the cutoff value for clustering (sequences put into same cluster if more than 62% identical).

• Substitution frequencies for all pairs of amino acids are then calculated between the groups and this used to calculate a log odds BLOSUM matrix

BLOSUM

• BLOSUM-62 matrix: appropriate for comparing sequences of approximately 62% sequence similarity

• BLOSUM-80 matrix: 80% similarity

PAM vs BLOSUM

• Lower PAM numbers used for more closely related sequences

• Lower BLOSUM numbers used for more distantly related sequences

• Dayhoff-like matrices (PAM) derive their initial substitution frequencies from global alignments of very similar sequences.

• The BLOSUM matrix is derived from local multiple alignments of more distantly

related sequences

Constructing a BLOSUM matrix

• In class• In lab

Constructing PAM Matrices

• A multiple alignment is constructed between sequences with high identity (>85%)

• A phylogenetic tree is constructed from the aligned sequences

• Substitutions are identified between each pair of sequences in the tree

• The substitution matrix is constructed by calculating the frequency of substitution for each amino acid, the relative mutability for each, and the mutation probability for each pair of amino acids (see example)

Constructing PAM Matrix

• Example

scoring matrices for sequence alignment

Documents