scoring matrices for sequence alignment
DESCRIPTION
Scoring Matrices for Sequence Alignment. Anne Haake Rhys Price Jones. Scoring Matrices. Sequence comparisons require some scoring matrices To use the alignment algorithms to do database searches, we need some scoring schemes that are based on biological knowledge. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/1.jpg)
Scoring Matricesfor Sequence Alignment
Anne Haake
Rhys Price Jones
![Page 2: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/2.jpg)
Scoring Matrices
• Sequence comparisons require some scoring matrices
• To use the alignment algorithms to do database searches, we need some scoring schemes that are based on biological knowledge.– Scoring matrices represent evolutionary theory
• The choice of matrix can influence the outcome of the analysis– Understanding the theory can help in making an appropriate
choice
![Page 3: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/3.jpg)
Nucleotide Scoring
1. Identity matrix (similarity)
A T C GA 1 0 0 0T 0 1 0 0C 0 0 1 0G 0 0 0 1
![Page 4: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/4.jpg)
Nucleotide Scoring
2. BLAST matrix
A T C G
A 5 -4 -4 -4
T -4 5 -4 -4
C -4 -4 5 -4
G -4 -4 -4 5
![Page 5: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/5.jpg)
Nucleotide Scoring
3. Transition/Transversion Matrix
A T C GA 0 5 5 1T 5 0 1 5C 5 1 0 5G 1 5 5 0
![Page 6: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/6.jpg)
Protein Scoring
1. Identity Matrix– Score 1 if equal– Score 0 if not equal– Easy, but weak
2. Genetic code Matrix– Determine the minimum number of base changes required
to convert one amino acid into another– Edit distance: will be 0, 1, 2, or 3 – Is a distance matrix Matrix– Not very discriminating; consider CAU
Code Table
![Page 7: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/7.jpg)
Protein Scoring
3. Hydrophobicity Matrix– Based on physical/chemical properties of the
amino acids
Hydrophobicity matrix
4. Log odds Matrices- Which amino acids are most likely to be seen?
- In close relatives? In distant relatives?
Ex. PAM and BLOSUM matrices
![Page 8: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/8.jpg)
PAM and BLOSUM Substitution Matrices for Amino Acids
• Based on actual substitution rates among the various amino acids in nature
• Empirically derived; huge amount of work!
• General Strategy:– Select a collection of related proteins and align them– Observe the frequencies with which one amino acid is
replaced by another = A– Figure out how often, given the frequencies of the amino
acids in your set, the replacement would occur by chance alone = B
– The ratio A/B (odds) tells us how often the replacement has occurred in evolution (as compared to a random process)
![Page 9: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/9.jpg)
PAM and BLOSUM Substitution Matrices for Amino Acids
• Matrices are 20 X 20 tables of values that describe the probability of a residue pair occurring in an alignment– The scoring matrix values are logarithms of ratios
of the probability of a meaningful occurrence to the probability of random occurrence.
![Page 10: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/10.jpg)
PAM Matrices
• PAM stands for Point Accepted Mutation or Percent Accepted Mutation
• Developed by Dayhoff et al. 1978. – Model based on empirically derived data– Groups of closely related proteins were aligned
(global alignments)• So that probability of more than one replacement at a
single site was negligible• 1,572 changes in 71 groups of closely related proteins
1 PAM
![Page 11: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/11.jpg)
PAM Unit
• Matrix represents substitution probabilities over a fixed unit of evolutionary change– e.g. PAM1 is 1 substitution per 100 residues or one PAM
unit (an amount of evolution) – 1% divergence
• Start with a given polypeptide sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all a.a. residues have undergone changes at time t+n. New sequence M’
• What is the probability that a.a. i in M will be replaced by a.a. j in M’? – To get your answer, look it up in the PAM-1 table (entry Rij)
![Page 12: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/12.jpg)
PAM Matrix
• Matrix values are based on the model that one sequence is derived from the other by a series of independent mutations, each changing one amino acid in the first sequence to another amino acid in the second
• The model is an approximation– Many assumptions– Not all of the assumptions necessarily hold
![Page 13: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/13.jpg)
PAM Matrix
• PAM-1 is used to derive other PAM matrices– Why?
• PAM-1 is 1 % accepted mutations• PAM-N is N% accepted mutations• To derive PAM-N, the PAM-1 matrix is
multiplied by itself N times – e.g. PAM-100; PAM-250– What does this mean for errors?
![Page 14: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/14.jpg)
Which PAM matrix do I use?
• Depends on how closely the sequences are believed to be related
• PAM-1 use for more closely related sequences
• PAM-1000 more distant relationships
• In practice, PAM-250 often used in alignment and database searching software.
![Page 15: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/15.jpg)
BLOSUM Matrix
• BLOSUM is from BLOcks SUbstitution Matrix
• originate with a paper by Henikoff and Henikoff (1992; PNAS 89:10915-10919)
![Page 16: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/16.jpg)
BLOSUM Matrix
• derived from the BLOCKS database
BLOCKS database• derived by observing substitution rates among similar
protein sequences– Use families of related (distantly) protein sequence because
need to do a multiple alignment – Are interested in substitutions rather than indels which tend
to occur more in distantly related sequences
• ungapped multiple alignments are used to identify conserved blocks of amino acids
![Page 17: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/17.jpg)
BLOSUM Matrix
![Page 18: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/18.jpg)
BLOSUM matrix
• Clustering approach used to sort the sequences into closely related groups where the sequences are similar at some threshold value of percentage identity– e.g. BLOSUM62 is standard matrix for ungapped alignment..
62 represents the cutoff value for clustering (sequences put into same cluster if more than 62% identical).
• Substitution frequencies for all pairs of amino acids are then calculated between the groups and this used to calculate a log odds BLOSUM matrix
![Page 19: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/19.jpg)
BLOSUM
• BLOSUM-62 matrix: appropriate for comparing sequences of approximately 62% sequence similarity
• BLOSUM-80 matrix: 80% similarity
![Page 20: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/20.jpg)
PAM vs BLOSUM
• Lower PAM numbers used for more closely related sequences
• Lower BLOSUM numbers used for more distantly related sequences
• Dayhoff-like matrices (PAM) derive their initial substitution frequencies from global alignments of very similar sequences.
• The BLOSUM matrix is derived from local multiple alignments of more distantly
related sequences
![Page 21: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/21.jpg)
Constructing a BLOSUM matrix
• In class• In lab
![Page 22: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/22.jpg)
Constructing PAM Matrices
• A multiple alignment is constructed between sequences with high identity (>85%)
• A phylogenetic tree is constructed from the aligned sequences
• Substitutions are identified between each pair of sequences in the tree
• The substitution matrix is constructed by calculating the frequency of substitution for each amino acid, the relative mutability for each, and the mutation probability for each pair of amino acids (see example)
![Page 23: Scoring Matrices for Sequence Alignment](https://reader035.vdocuments.net/reader035/viewer/2022062720/5681356e550346895d9cd44f/html5/thumbnails/23.jpg)
Constructing PAM Matrix
• Example