dna sequences alignment measurement
DESCRIPTION
DNA sequences alignment measurement. Lecture 13 . Introduction. Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement of alignment gaps . Measurement of aligned sequences. When aligning sequences (DNA/AA ) it is assumed that: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/1.jpg)
DNA sequences alignment measurement
Lecture 13
![Page 2: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/2.jpg)
Introduction
• Measurement of “strength” alignment
• Nucleic acid and amino acid substitutions
• Measurement of alignment gaps
![Page 3: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/3.jpg)
Measurement of aligned sequences• When aligning sequences (DNA/AA ) it is assumed that:
– they have a common ancestor; – the differences between the sequences is the result of
mutations– important areas like coding sequences (CDS) will be
conserved. There is a bias “against” mutations in these areas– Furthermore there is a bias in the types of mutations:
substitutions more likely that insertions/deletions….
• The dot plot gives a visual representation of sequence alignment regions. But how do we measure the strength of these alignments.
![Page 4: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/4.jpg)
Measurement of aligned sequences
• One way is to count the mismatches: the “difference” between the sequences. – Hamming distance; :
• The distance corresponds to mismatches for strings of equal length. – agtc – cgta Distance is 2 (give another example)
• If the sequences (strings) are not of equal length then use:– The Levenshtein distance: is the minimum number of edit operations
(alter/ insert/delete) to required to turn one string into another:• ag- tcc• cgctca what is the levensthein distance?
• The latter technique has the advantage of allowing the inclusions of gaps
![Page 5: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/5.jpg)
Measurement of matching• But what about the biological plausibility of
these approaches to measuring “differences” between sequences (strings):
• DNA sequences (string mismatches) are different:– due to the probability of substitution; insertions,
deletions is not the same. – Certain types of mutations like inversions;
translocations; duplications …. Complicate the assessment of similarity; e.g. how would you treat tandem repeats; inverted repeats….
![Page 6: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/6.jpg)
Nucleic Acid mutations• In sequence alignment we are trying to determine have the
differences (similarity) occurred due to:– chance (random mutations)– They had a common origin (degree of conservatism)
• One approach would be to count the percentage of matches but there is now a need to include the bias associated with possible substitutions.
• However, similarity does not necessarily imply common ancestor or visa versa Zvelebil and Baum (2008 p. 74) suggest this can occur in convergent evolution/divergent evolution.
• So the results need to be contextualised the findings of alignment tests. (bat and bird both have wings…)
![Page 7: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/7.jpg)
Alignment Scoring methods• In general sequences are given a score at each
matching position and the one with the largest score is optimal and is chosen; however suboptimal may also need to be considered.
• The most basic approach is obtained by measuring the percentage of similarity.
• Given that not all “changes” occur with equal chance there is a need to develop:– A nucleotide substitution matrix
•
![Page 8: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/8.jpg)
Nucleotide scoring Matrix• While it is know that certain mutations are more
likely to occur than others: e.g. transitions a<->g is more common than transversions c<->t.
• However since the probability of such difference is insignificant in relation to the chance of a mutation itself the differences are mostly ignored. The following shows a typical scoring matrix for nucleotides.
Adapted from Baxevanis p. 303
![Page 9: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/9.jpg)
Nucleic acid scoring Matrix• The values are based on the probability of a
type of substitution occurring (expected value); this includes a nucleotide substituting with itself.
• These expected values are calculated by getting the ratio of :– number of “observed changes” /number of
changes “due to chance”• These values are obtained by examining large
numbers of DNA sequences.
![Page 10: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/10.jpg)
Nucleic acid scoring Matrix
• Then calculate 10*log 10
(“expected value”).• This ensures that
adjacent nucleotides expected values can now be added as opposed to being multiplied in determining the alignment score.
![Page 11: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/11.jpg)
Nucleic acid scoring Matrix• A expected value greater of 1 indicates the
substitution has the same change of occurrence as it is was occurring randomly.
• A value greater than 1 indicates a bias in favour or the substitution
• A values less than 1 indicates a bias against the substitution.
• A value of 5 will give what expected value?
![Page 12: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/12.jpg)
Measuring Protein similarity
• Deriving a matrix for proteins is more complex because:• There are 20 amino acids so much higher set of
substitutions. • The amino acids have properties that affect the
structure and so the protein functionality.• Therefore substitutions can be conserved or semi-
conserved • Observations shows that conserved substitutions
• e.g. Hydrophobic <-> hydrophobic mutations are more common • semi conserved; e.g. hydrophilic <-> hydrophobic
![Page 13: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/13.jpg)
The PAM matrix• The PAM, developed in the 1970’s by M. Dayhoff, was
derived from all types of sequences both in conserved and un-conserved regions of the protein.
• The PAM (point accepted mutation) Matrix is a measure of the chances of one amino acid being substitute by another.
• The PAM “1” matrix was “derived” from very similar sequences (close in terms of evolution ) and represented one point “accepted” mutation per 100 residues.
• It assumed that a substitution was independent of previous mutations at this position.
• This meant it could be used to derive matrices over longer “evolutionary distances” or more mutations per 100 residues.
![Page 14: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/14.jpg)
The PAM MATRIX • The number of times the matrix is “multiplied” by itself represents
the number of mutations per 100 residues• So a PAM 100 represents 100 mutations per 100 residues: this of
course does mean that the 100 residues changed.• The most common the PAM 250 represents 250 per 100 residues.
• The original PAM 1 was derived in a similar way to the nucleotide substitution matrix:– 10 log10(expected value)
– (the number of observed changes were divided by number of changes based on AA frequency ( as if changes occurred by chance) : expected value.)
![Page 15: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/15.jpg)
Dayhoff PAM (250) Matrix • A H<->N has a value of 2 or an
expected value 1.6 (occurred 1.6 times more often that would be expected.
• A same amino acid substitution can range from C <-> C has a high score 12; while V <-> V (4)
• A hydrophobic <-> Hydrophobic V<->A (0); while V<-> I is (4)
• A hydrophilic <-> hydrophilic K <-> T (0); K<-> R (3)
• A hydrophilic <-> hydrophobic: K <-> V (-2)
![Page 16: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/16.jpg)
Blosum matrix • Another matrix the Blosum (Blocks Substitution ) Matrix used a
larger data set than the PAM • This allowed the Bolsum to look at mutations within “blocks” of
conserved sequences as opposed to point mutations on individual sequences in both conserved and variable.
• The bolsum number ; e.g. 62 matrix, unlike the PAM number, e.g.120, indicates that the substitution probabilities are derived from sequences with that much similarity or conservation.
• So the Blosum 62 is obtained from blocks sharing 62% conservation .
• In addition these results are obtained directly as opposing to being derived as in the case of the PAM matrix.
![Page 17: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/17.jpg)
The Blosum and PAM matrices
Adapted from Zvelebil e baum p 82
The two matrices opposite are the Blosum 62 and PAM 120
The following shows scores from the Blosum 62 matrixSame resisdues:C<->C (9)V<->V (4)Hydrophobic to hydrophobic V<->A (O)V<-> I (3)Hydrophilic to HydrophilicK <-> T (-1)K<-> R (2)Hydrophobic to hydrophilicK<-> V (-2)
The trend is the same.
![Page 18: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/18.jpg)
PAM and Bolsum Matrices• In the PAM matrix the as the number
increases so does evolutionary distance while it is the reverse it the Bolsum Matrix.
• According to Baxevanis (2003) the following represents the equivalence and most appropriate use of both matrices– PAM250 and the Bolsum 45– PAM160 and the Bolsum 62
•
![Page 19: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/19.jpg)
PAM and Bolsum MatrixMatrix Best in determining
PAM 40/ Bolsum 90 Short similar (conserved) alignments
PAM 250 Longer more divergent alignments
Pam 160/ Bolsum 80 Detecting members of protein families
Bolsum 62 In finding all potential similarities
Adapted from Baxevanis 2005
An excellent review of scoring matrices can be found at : Henikoff and Henikoff 2000
![Page 20: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/20.jpg)
20
Dot plot Matrix: imperfect match• Some alignments require
gaps to increase the matching score; the gaps are used represent inclusion/deletion mutations
• The diagram shows that most of the 2 sequences are aligned. Where there are gaps indicates areas of non-alignment or mismatches: gaps or substitutions
Adapted from: dotplot example
![Page 21: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/21.jpg)
Measurement of alignment gaps • Gaps represents insertions and deletions• Baxevanis (2005) suggest that no more than “one gap
in 20 pairs is a good rule of thumb”.• Gaps in alignments are penalised; given a negative
scoring value. • The penalty associated with the using gaps is
dependent on – Opening the gap (introducing an insertion or deletion)– Extending the gap (as opposed to opening a new gap)– The length of the gap (the number of deletions/insertions).
![Page 22: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/22.jpg)
Gap penalties • There is no overall agreement on what values should be
assigned to gap penalties (Zvelebil e Baum 2008). • The purpose of an insertion is to increase the strength
of the alignment. • So choosing a high score will eliminate sequences with
gaps while of the score is too low then alignments with more and larger gaps will be chosen.
• The value should also be dependent on how closely “related” the alignments must be :– So sequences with a very strict match would use a high gap
score. – Alignment between distantly related species would use a low
gap score.
![Page 23: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/23.jpg)
Potential Exam Questions• What is the purpose of measuring the strength
of an alignment (3 marks)• Explain two differences between analysing a
string (sequence) and a DNA string. (4 marks)• Describe how you would measure the similarity
between two DNA sequences (10 marks)• Discuss the use of gap penalties in a sequence
alignment score (13 marks)
![Page 24: DNA sequences alignment measurement](https://reader035.vdocuments.net/reader035/viewer/2022062315/56816583550346895dd82839/html5/thumbnails/24.jpg)
References
• Baxevanis A.D. 2005 Bioinformatics: a practical guide to the analysis of genes and proteins chapter 11; Wiley
• Lesk, A. 2008; Introduction to bioinformatics, 3rd edition, oxford university press
• Zvelebil e Baum (2008) Understanding Bioinformatics