pairwise sequence alignment · 2014-05-26 · september 6, 2006. pairwise sequence . alignment....

129
September 6, 2006 Pairwise sequence alignment Jonathan Pevsner, Ph.D. Introduction to Bioinformatics Johns Hopkins 260.602.01

Upload: others

Post on 06-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

September 6, 2006

Pairwise sequence alignment

Jonathan Pevsner, Ph.D.Introduction to BioinformaticsJohns Hopkins260.602.01

Page 2: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

These images and materials may not be usedwithout permission from the publisher. We welcomeinstructors to use these powerpoints for educationalpurposes, but please acknowledge the source.

The book has a homepage at http://www.bioinfbook.orgincluding hyperlinks to the book chapters.

Copyright notice

Page 3: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Lab is Friday 10:30 in the computer center, W3025 SPH.Auditors are welcome. If you have questions about theNeedleman and Wunsch (1970) paper, we will discussthem then.

Announcements

The moodle quiz for Chapter 2 (from Friday, September1st class) is due by Friday at noon.

Page 4: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Outline: pairwise alignment

• Overview and examples• Definitions: homologs, paralogs, orthologs• Assigning scores to aligned amino acids:

Dayhoff’s PAM matrices• Alignment algorithms: Needleman-Wunsch,Smith-Waterman

• Statistical significance of pairwise alignments

Page 5: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Pairwise alignments in the 1950s

β-corticotropin (sheep)Corticotropin A (pig)

ala gly glu asp asp gluasp gly ala glu asp glu

OxytocinVasopressin

CYIQNCPLGCYFQNCPRG

Page 6: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Early example of sequence alignment: globins (1961)

H.C. Watson and J.C. Kendrew, “Comparison Between the Amino-Acid Sequences of Sperm Whale Myoglobin and of Human Hæmoglobin.” Nature190:670-672, 1961.

myoglobinα− β−globins:

Page 7: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

• It is used to decide if two proteins (or genes)are related structurally or functionally

• It is used to identify domains or motifs thatare shared between proteins

• It is the basis of BLAST searching (next week)

• It is used in the analysis of genomes

Pairwise sequence alignment is the most fundamental operation of bioinformatics

Page 8: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01
Page 9: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Pairwise alignment: protein sequencescan be more informative than DNA

• protein is more informative (20 vs 4 characters);many amino acids share related biophysical properties

• codons are degenerate: changes in the third positionoften do not alter the amino acid that is specified

• protein sequences offer a longer “look-back” time

• DNA sequences can be translated into protein,and then used in pairwise alignments

Page 10: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 54

Page 11: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

• DNA can be translated into six potential proteins

5’ CAT CAA 5’ ATC AAC 5’ TCA ACT

5’ GTG GGT 5’ TGG GTA5’ GGG TAG

Pairwise alignment: protein sequencescan be more informative than DNA

5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Page 12: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Pairwise alignment: protein sequencescan be more informative than DNA

• Many times, DNA alignments are appropriate--to confirm the identity of a cDNA--to study noncoding regions of DNA--to study DNA polymorphisms--example: Neanderthal vs modern human DNA

Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240|||||||| |||| |||||| ||||| | |||||||||||||||||||||||||||||||

Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247

Page 13: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

retinol-binding protein 4(NP_006735)

β-lactoglobulin(P02754)

Page 42

Page 14: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Outline: pairwise alignment

• Overview and examples• Definitions: homologs, paralogs, orthologs• Assigning scores to aligned amino acids:

Dayhoff’s PAM matrices• Alignment algorithms: Needleman-Wunsch,Smith-Waterman

• Statistical significance of pairwise alignments

Page 15: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Pairwise alignmentThe process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.

Definitions

Page 16: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

HomologySimilarity attributed to descent from a common ancestor.

Definitions

Page 42

Page 17: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

HomologySimilarity attributed to descent from a common ancestor.

Definitions

IdentityThe extent to which two (nucleotide or amino acid) sequences are invariant.

Page 44

RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVA 59 + K++ + ++ GTW++MA + L + A

glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKA 55

Page 18: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

OrthologsHomologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function.

ParalogsHomologous sequences within a single species that arose by gene duplication.

Definitions: two types of homology

Page 43

Page 19: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Orthologs:members of a gene (protein)family in variousorganisms.This tree showsRBP orthologs.

common carp

zebrafish

rainbow trout

teleost

African clawed frog

chicken

mouserat

rabbitcowpighorsehuman

10 changes Page 43

Page 20: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Paralogs:members of a gene (protein)family within aspecies

apolipoprotein D

retinol-bindingprotein 4

Complementcomponent 8

prostaglandinD2 synthase

neutrophilgelatinase-associatedlipocalin

10 changesLipocalin 1Odorant-bindingprotein 2A

progestagen-associatedendometrialprotein

Alpha-1Microglobulin/bikunin

Page 44

Page 21: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01
Page 22: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| | . |. . . | : .||||.:| :

1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP: | | | | :: | .| . || |: || |.

45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP|| ||. | :.|||| | . .|

94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : || . | || |

136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Pairwise alignment of retinol-binding protein 4 and β-lactoglobulin

Page 46

Page 23: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

SimilarityThe extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation.

IdentityThe extent to which two sequences are invariant.

ConservationChanges at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.

Definitions

Page 24: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| | . |. . . | : .||||.:| :

1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP: | | | | :: | .| . || |: || |.

45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP|| ||. | :.|||| | . .|

94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : || . | || |

136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and β-lactoglobulin

Identity(bar)

Page 46

Page 25: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| | . |. . . | : .||||.:| :

1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP: | | | | :: | .| . || |: || |.

45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP|| ||. | :.|||| | . .|

94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : || . | || |

136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and β-lactoglobulin

Somewhatsimilar

(one dot)

Verysimilar

(two dots)

Page 46

Page 26: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Pairwise alignmentThe process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.

Definitions

Page 47

Page 27: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| | . |. . . | : .||||.:| :

1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP: | | | | :: | .| . || |: || |.

45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP|| ||. | :.|||| | . .|

94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : || . | || |

136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and β-lactoglobulin

Internalgap

Terminalgap Page 46

Page 28: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

• Positions at which a letter is paired with a null are called gaps.

• Gap scores are typically negative.

• Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap.

• In BLAST, it is rarely necessary to change gap values from the default.

Gaps

Page 29: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| | . |. . . | : .||||.:| :

1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP: | | | | :: | .| . || |: || |.

45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP|| ||. | :.|||| | . .|

94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : || . | || |

136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and β-lactoglobulin

Page 30: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48:: || || || .||.||. .| :|||:.|:.| |||.|||||

1 MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP 47. . . . .

49 EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98|||| ||:||:|||||.|.|.||| ||| :||||:.||.| ||| || |

48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFEDTPD 97. . . . .

99 PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148||||||:||| ||:|| ||||||::||||| ||: |||| ..||||| |

98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCREVDLDGTCLDG 147. . . . .

149 YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNLL 199|||:||| | || || |||| :..|:| .|| : | |:|:

148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGKYRRVGHTGFCESS...... 192

Pairwise alignment of retinol-binding protein from human (top) and rainbow trout (O. mykiss)

Page 31: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

4 3 2 1 0

Pairwise sequence alignment allows usto look back billions of years ago (BYA)

Origin oflife

Origin ofeukaryotes insects

Fungi/animalPlant/animal

Earliestfossils

Eukaryote/archaea

Page 48

Page 32: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA

fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST

fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA

Multiple sequence alignment ofglyceraldehyde 3-phosphate dehydrogenases

Page 49

Page 33: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF odorant-binding protein 2aTKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR progestagen-assoc. endo.VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV apolipoprotein DVKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF retinol-binding proteinLQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF neutrophil gelatinase-ass.VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL prostaglandin D2 synthaseVQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobulinPKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD... complement component 8

Multiple sequence alignment ofhuman lipocalin paralogs

Page 49

Page 34: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Outline: pairwise alignment

• Overview and examples• Definitions: homologs, paralogs, orthologs• Assigning scores to aligned amino acids:

Dayhoff’s PAM matrices• Alignment algorithms: Needleman-Wunsch,Smith-Waterman

• Statistical significance of pairwise alignments

Page 35: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

General approach to pairwise alignment

• Choose two sequences• Select an algorithm that generates a score• Allow gaps (insertions, deletions)• Score reflects degree of similarity• Alignments can be global or local• Estimate probability that the alignmentoccurred by chance

Page 36: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Calculation of an alignment score

Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Alignment_Scores2.html

Page 37: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 80

Emile Zuckerkandl and Linus Pauling (1965) considered substitution frequencies in 18 globins(myoglobins and hemoglobins from human to lamprey).

Black: identityGray: very conservative substitutions (>40% occurrence)White: fairly conservative substitutions (>21% occurrence)Red: no substitutions observed

lys found at 58% of arg sites

Page 38: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 80

Page 39: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Dayhoff’s 34 protein superfamilies

Protein PAMs per 100 million years

Ig kappa chain 37Kappa casein 33luteinizing hormone b 30lactalbumin 27complement component 3 27epidermal growth factor 26proopiomelanocortin 21pancreatic ribonuclease 21haptoglobin alpha 20serum albumin 19phospholipase A2, group IB 19prolactin 17carbonic anhydrase C 16Hemoglobin α 12Hemoglobin β 12

Page 50

Page 40: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Dayhoff’s 34 protein superfamilies

Protein PAMs per 100 million years

Ig kappa chain 37Kappa casein 33luteinizing hormone b 30lactalbumin 27complement component 3 27epidermal growth factor 26proopiomelanocortin 21pancreatic ribonuclease 21haptoglobin alpha 20serum albumin 19phospholipase A2, group IB 19prolactin 17carbonic anhydrase C 16Hemoglobin α 12Hemoglobin β 12

human (NP_005203) versus mouse (NP_031812)

Page 41: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Dayhoff’s 34 protein superfamilies

Protein PAMs per 100 million years

apolipoprotein A-II 10lysozyme 9.8gastrin 9.8myoglobin 8.9nerve growth factor 8.5myelin basic protein 7.4thyroid stimulating hormone b 7.4parathyroid hormone 7.3parvalbumin 7.0trypsin 5.9insulin 4.4calcitonin 4.3arginine vasopressin 3.6adenylate kinase 1 3.2

Page 50

Page 42: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Dayhoff’s 34 protein superfamilies

Protein PAMs per 100 million years

triosephosphate isomerase 1 2.8vasoactive intestinal peptide 2.6glyceraldehyde phosph. dehydrogease 2.2cytochrome c 2.2collagen 1.7troponin C, skeletal muscle 1.5alpha crystallin B chain 1.5glucagon 1.2glutamate dehydrogenase 0.9histone H2B, member Q 0.9ubiquitin 0

Page 50

Page 43: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Pairwise alignment of human (NP_005203) versus mouse (NP_031812) ubiquitin

Page 44: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Accepted point mutations (PAMs): inferring amino acid substitutions between a protein and its ancestor

Dayhoff et al. compared protein sequences with inferred ancestors, rather than with each other directly. Consider four globins (myoglobin, alpha, beta, delta globin). In a phylogenetic tree there are four existing sequences plus two inferred ancestral sequences (5, 6). (We will learn how to make trees on October 16.)

56

4

123

Page 45: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Accepted point mutations (PAMs): inferring amino acid substitutions between a protein and its ancestor

The tree is made from a multiple sequence alignment of the four globins. Consider a comparison of alpha globin to myoglobin, and to their common ancestor (node 6). • A direct comparison suggests alanine changed to glycine. But an ancestral glutamate changed to ala or gly!• Three additional examples are boxed.

beta MVHLTPEEKSAVTALWGKVdelta MVHLTPEEKTAVNALWGKValpha MV.LSPADKTNVKAAWGKVmyoglobin .MGLSDGEWQLVLNVWGKV5 MVHLSPEEKTAVNALWGKV6 MVHLTPEEKTAVNALWGKV

56

4

123

Page 46: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

AAla

RArg

NAsn

DAsp

CCys

QGln

EGlu

GGly

AR 30

N 109 17

D 154 0 532

C 33 10 0 0

Q 93 120 50 76 0

E 266 0 94 831 0 422

G 579 10 156 162 10 30 112

H 21 103 226 43 10 243 23 10

Dayhoff’s numbers of “accepted point mutations”:what amino acid substitutions occur in proteins?

Page 52Dayhoff (1978) p.346.

Page 47: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA

fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST

fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA

Multiple sequence alignment ofglyceraldehyde 3-phosphate dehydrogenases

Page 48: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

The relative mutability of amino acids

Page 53

Dayhoff et al. described the “relative mutability” of each amino acid as the probability that amino acid will change over a small evolutionary time period. The total number of changes are counted (on all branches of all protein trees considered), and the total number of occurrences of each amino acid is also considered. A ratio is determined.

Relative mutability ∝ [changes] / [occurrences]

Example:

sequence 1 ala his val alasequence 2 ala arg ser val

For ala, relative mutability = [1] / [3] = 0.33For val, relative mutability = [2] / [2] = 1.0

Page 49: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

The relative mutability of amino acids

Asn 134 His 66Ser 120 Arg 65Asp 106 Lys 56Glu 102 Pro 56Ala 100 Gly 49Thr 97 Tyr 41Ile 96 Phe 41Met 94 Leu 40Gln 93 Cys 20Val 74 Trp 18

Page 53

Page 50: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

The relative mutability of amino acids

Asn 134 His 66Ser 120 Arg 65Asp 106 Lys 56Glu 102 Pro 56Ala 100 Gly 49Thr 97 Tyr 41Ile 96 Phe 41Met 94 Leu 40Gln 93 Cys 20Val 74 Trp 18

Page 53

Note that alanine is normalized to a value of 100.Trp and cys are least mutable.Asn and ser are most mutable.

Page 51: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Normalized frequencies of amino acids

Gly 8.9% Arg 4.1%Ala 8.7% Asn 4.0%Leu 8.5% Phe 4.0%Lys 8.1% Gln 3.8%Ser 7.0% Ile 3.7%Val 6.5% His 3.4%Thr 5.8% Cys 3.3%Pro 5.1% Tyr 3.0%Glu 5.0% Met 1.5%Asp 4.7% Trp 1.0%

• blue=6 codons; red=1 codon• These frequencies fi sum to 1

Page 53

Page 52: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 54

Page 53: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

AAla

RArg

NAsn

DAsp

CCys

QGln

EGlu

GGly

AR 30

N 109 17

D 154 0 532

C 33 10 0 0

Q 93 120 50 76 0

E 266 0 94 831 0 422

G 579 10 156 162 10 30 112

H 21 103 226 43 10 243 23 10

Dayhoff’s numbers of “accepted point mutations”:what amino acid substitutions occur in proteins?

Page 52

Page 54: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Dayhoff’s mutation probability matrixfor the evolutionary distance of 1 PAM

We have considered three kinds of information:• a table of number of accepted point mutations (PAMs)• relative mutabilities of the amino acids• normalized frequencies of the amino acids in PAM data

This information can be combined into a “mutation probability matrix” in which each element Mij gives the probability that the amino acid in column j will be replaced by the amino acid in row i after a given evolutionary interval (e.g. 1 PAM).

Page 50

Page 55: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Dayhoff’s PAM1 mutation probability matrix

AAla

RArg

NAsn

DAsp

CCys

QGln

EGlu

GGly

HHis

A 9867 2 9 10 3 8 17 21 2

R 1 9913 1 0 1 10 0 0 10

N 4 1 9822 36 0 4 6 6 21

D 6 0 42 9859 0 6 53 6 4

C 1 1 0 0 9973 0 0 0 1

Q 3 9 4 5 0 9876 27 1 23

E 10 0 7 56 0 35 9865 4 2

G 21 1 12 11 1 3 7 9935 1

H 1 8 18 3 1 20 1 0 9912

I 2 2 3 1 2 1 2 0 0

Original amino acid

Page 55

Page 56: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Dayhoff’s PAM1 mutation probability matrix

AAla

RArg

NAsn

DAsp

CCys

QGln

EGlu

GGly

HHis

A 9867 2 9 10 3 8 17 21 2

R 1 9913 1 0 1 10 0 0 10

N 4 1 9822 36 0 4 6 6 21

D 6 0 42 9859 0 6 53 6 4

C 1 1 0 0 9973 0 0 0 1

Q 3 9 4 5 0 9876 27 1 23

E 10 0 7 56 0 35 9865 4 2

G 21 1 12 11 1 3 7 9935 1

H 1 8 18 3 1 20 1 0 9912

I 2 2 3 1 2 1 2 0 0

Each element of the matrix shows the probability that an originalamino acid (top) will be replaced by another amino acid (side)

Page 57: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids.

Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments(or multiple sequence alignments) of amino acids.

Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution.

The two major types of substitution matrices arePAM and BLOSUM.

Substitution Matrix

Page 58: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

PAM matrices are based on global alignments of closely related proteins.

The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids.

Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids.

All the PAM data come from closely related proteins(>85% amino acid identity).

PAM matrices:Point-accepted mutations

Page 59: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Dayhoff’s PAM1 mutation probability matrix

AAla

RArg

NAsn

DAsp

CCys

QGln

EGlu

GGly

HHis

A 9867 2 9 10 3 8 17 21 2

R 1 9913 1 0 1 10 0 0 10

N 4 1 9822 36 0 4 6 6 21

D 6 0 42 9859 0 6 53 6 4

C 1 1 0 0 9973 0 0 0 1

Q 3 9 4 5 0 9876 27 1 23

E 10 0 7 56 0 35 9865 4 2

G 21 1 12 11 1 3 7 9935 1

H 1 8 18 3 1 20 1 0 9912

I 2 2 3 1 2 1 2 0 0

Page 55

Page 60: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Dayhoff’s PAM0 mutation probability matrix:the rules for extremely slowly evolving proteins

PAM0 AAla

RArg

NAsn

DAsp

CCys

QGln

EGlu

A 100% 0% 0% 0% 0% 0% 0%R 0% 100% 0% 0% 0% 0% 0%N 0% 0% 100% 0% 0% 0% 0%D 0% 0% 0% 100% 0% 0% 0%C 0% 0% 0% 0% 100% 0% 0%Q 0% 0% 0% 0% 0% 100% 0%E 0% 0% 0% 0% 0% 0% 100%G 0% 0% 0% 0% 0% 0% 0%

Top: original amino acidSide: replacement amino acid Page 56

Page 61: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Dayhoff’s PAM2000 mutation probability matrix:the rules for very distantly related proteinsPAM∞ A

AlaRArg

NAsn

DAsp

CCys

QGln

EGlu

GGly

A 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% 8.7%R 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% 4.1%N 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0%D 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% 4.7%C 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% 3.3%Q 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% 3.8%E 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% 5.0%G 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9%

Top: original amino acidSide: replacement amino acid Page 56

Page 62: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

PAM250 mutation probability matrix

Top: original amino acidSide: replacement amino acid Page 57

Page 63: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

PAM250 log oddsscoring matrix

Page 58

Page 64: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Why do we go from a mutation probabilitymatrix to a log odds matrix?

• We want a scoring matrix so that when we do a pairwisealignment (or a BLAST search) we know what score toassign to two aligned amino acid residues.

• Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them).

Page 57

Page 65: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

How do we go from a mutation probabilitymatrix to a log odds matrix?

• The cells in a log odds matrix consist of an “odds ratio”:

the probability that an alignment is authenticthe probability that the alignment was random

The score S for an alignment of residues a,b is given by:

S(a,b) = 10 log10 (Mab/pb)

As an example, for tryptophan,

S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4 Page 57

Page 66: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

What do the numbers meanin a log odds matrix?

S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4

A score of +17 for tryptophan means that this alignmentis 50 times more likely than a chance alignment of twoTrp residues.

S(a,b) = 17Probability of replacement (Mab/pb) = xThen17 = 10 log10 x1.7 = log10 x101.7 = x = 50

Page 58

Page 67: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

What do the numbers meanin a log odds matrix?

A score of +2 indicates that the amino acid replacementoccurs 1.6 times as frequently as expected by chance.

A score of 0 is neutral.

A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately representshomology (evolutionary descent) is one tenth as frequentas the chance alignment of these amino acids.

Page 58

Page 68: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

PAM250 log oddsscoring matrix

Page 58

Page 69: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

PAM10 log oddsscoring matrix

Page 59

Page 70: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Rat versus mouse RBP

Rat versus bacteriallipocalin

More conserved Less conserved

Page 71: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Comparing two proteins with a PAM1 matrixgives completely different results than PAM250!

Consider two distantly related proteins. A PAM40 matrixis not forgiving of mismatches, and penalizes themseverely. Using this matrix you can find almost no match.

A PAM250 matrix is very tolerant of mismatches.

hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC

* ** * **

24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% rbp4 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV

btlact 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** *

rbp4 86 --CADMVGTFTDTEDPAKFKM btlact 80 GECAQKKIIAEKTKIPAVFKI

** * ** ** Page 60

Page 72: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

BLOSUM matrices are based on local alignments.

BLOSUM stands for blocks substitution matrix.

BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% divergence.

BLOSUM Matrices

Page 60

Page 73: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

BLOSUM Matrices

100

62

30

Per

cent

am

ino

acid

iden

tity

BLOSUM62

Page 74: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

BLOSUM Matrices

100

62

30

Per

cent

am

ino

acid

iden

tity

BLOSUM62

100

62

30

BLOSUM30

100

62

30

BLOSUM80

Page 75: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.

The BLOCKS database contains thousands of groups ofmultiple sequence alignments.

BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

BLOSUM Matrices

Page 60

Page 76: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

A 4R -1 5N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5M -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

Blosum62 scoring matrix

Page 61

Page 77: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Blosum62 scoring matrix

Page 61

Page 78: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Rat versus mouse RBP

Rat versus bacteriallipocalin

Page 61

Page 79: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

PAM matrices are based on global alignments of closely related proteins.

The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids.

Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids.

All the PAM data come from closely related proteins(>85% amino acid identity).

PAM matrices:Point-accepted mutations

Page 80: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Perc

ent i

dent

ity

Evolutionary distance in PAMs

Two randomly diverging protein sequences changein a negatively exponential fashion

“twilight zone”

Page 62

Page 81: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Perc

ent i

dent

ity

Differences per 100 residues

At PAM1, two proteins are 99% identicalAt PAM10.7, there are 10 differences per 100 residuesAt PAM80, there are 50 differences per 100 residuesAt PAM250, there are 80 differences per 100 residues

“twilight zone”

Page 62

Page 82: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

PAM matrices reflect different degrees of divergence

PAM250

Page 83: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

PAM: “Accepted point mutation”

• Two proteins with 50% identity may have 80 changesper 100 residues. (Why? Because any residue can besubject to back mutations.)

• Proteins with 20% to 25% identity are in the “twilight zone”and may be statistically significantly related.

• PAM or “accepted point mutation” refers to the “hits” or matches between two sequences (Dayhoff & Eck, 1968)

Page 62

Page 84: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Ancestral sequence

Sequence 1ACCGATC

Sequence 2AATAATC

A no change AC single substitution C --> AC multiple substitutions C --> A --> TC --> G coincidental substitutions C --> AT --> A parallel substitutions T --> AA --> C --> T convergent substitutions A --> TC back substitution C --> T --> C

ACCCTAC

Li (1997) p.70

Page 85: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Percent identity between two proteins:What percent is significant?

100%80%65%30%23%19%

We will see in the BLAST lecture that it is appropriate to describe significance in terms of probability (or expect) values.As a rule of thumb, two proteins sharing > 30% over a substantial region are usually homologous.

Page 86: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Outline: pairwise alignment

• Overview and examples• Definitions: homologs, paralogs, orthologs• Assigning scores to aligned amino acids:

Dayhoff’s PAM matrices• Alignment algorithms: Needleman-Wunsch,Smith-Waterman

• Statistical significance of pairwise alignments

Page 87: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

An alignment scoring system is required to evaluate how good an alignment is

• positive and negative values assigned

• gap creation and extension penalties

• positive score for identities

• some partial positive score for conservative substitutions

• global versus local alignment

• use of a substitution matrix

Page 88: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Calculation of an alignment score

Page 89: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

We will first consider the global alignment algorithmof Needleman and Wunsch (1970).

We will then explore the local alignment algorithmof Smith and Waterman (1981).

Finally, we will consider BLAST, a heuristic versionof Smith-Waterman. We will cover BLAST in detailon Monday.

Two kinds of sequence alignment: global and local

Page 63

Page 90: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

• Two sequences can be compared in a matrixalong x- and y-axes.

• If they are identical, a path along a diagonal can be drawn

• Find the optimal subpaths, and add them up to achievethe best score. This involves

--adding gaps when needed--allowing for conservative substitutions--choosing a scoring system (simple or complicated)

• N-W is guaranteed to find optimal alignment(s)

Global alignment with the algorithmof Needleman and Wunsch (1970)

Page 63

Page 91: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

[1] set up a matrix

[2] score the matrix

[3] identify the optimal alignment(s)

Three steps to global alignment with the Needleman-Wunsch algorithm

Page 63

Page 92: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

[1] identity (stay along a diagonal)[2] mismatch (stay along a diagonal)[3] gap in one sequence (move vertically!)[4] gap in the other sequence (move horizontally!)

Four possible outcomes in aligning two sequences

1

2

Page 64

Page 93: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 64

Page 94: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Start Needleman-Wunsch with an identity matrix

Page 65

Page 95: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Start Needleman-Wunsch with an identity matrix

sequence 1 ABCNJ-RQCLCR-PMsequence 2 AJC-JNR-CKCRBP-

sequence 1 ABC-NJRQCLCR-PMsequence 2 AJCJN-R-CKCRBP-

Page 66

Page 96: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Fill in the matrix starting from the bottom right

Page 65

Page 97: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 65

Page 98: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 65

Page 99: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 65

Page 100: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 66

Page 101: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 66

Page 102: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Rule for assigning score in position i, j:

si,j = max si-1,j-1 + s(aibj) si-x,j (i.e. add a gap of length x)si,j-x (i.e. add a gap of length x)

Page 66-67

Page 103: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

After you’ve filled in the matrix, find the optimalpath(s) by a “traceback” procedure

Page 66

Page 104: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

sequence 1 ABCNJ-RQCLCR-PMsequence 2 AJC-JNR-CKCRBP-

sequence 1 ABC-NJRQCLCR-PMsequence 2 AJCJN-R-CKCRBP-

Page 66

Page 105: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments.

It is an example of a dynamic programming algorithm:an optimal path (alignment) is identified byincrementally extending optimal subpaths.Thus, a series of decisions is made at each step of thealignment to find the pair of residues with the best score.

Needleman-Wunsch: dynamic programming

Page 67

Page 106: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Try using needle to implement a Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps):

http://www.ebi.ac.uk/emboss/align/

Page 107: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Queries:beta globin (NP_000509)alpha globin (NP_000549)

Page 108: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01
Page 109: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Global alignment (Needleman-Wunsch) extendsfrom one end of each sequence to the other.

Local alignment finds optimally matching regions within two sequences (“subsequences”).

Local alignment is almost always used for databasesearches such as BLAST. It is useful to find domains(or limited regions of homology) within sequences.

Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Othermethods (BLAST, FASTA) are faster but less thorough.

Global alignment versus local alignment

Page 69

Page 110: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

How the Smith-Waterman algorithm works

Set up a matrix between two proteins (size m+1, n+1)

No values in the scoring matrix can be negative! S > 0

The score in each cell is the maximum of four values:[1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch)[2] s(i,j-1) – gap penalty[3] s(i-1,j) – gap penalty[4] zero

Page 69

Page 111: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 70

Page 112: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Try using SSEARCH to perform a rigorous Smith-Waterman local alignment: http://fasta.bioch.virginia.edu/

Page 113: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Queries:beta globin (NP_000509)alpha globin (NP_000549)

Page 114: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01
Page 115: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Rapid, heuristic versions of Smith-Waterman:FASTA and BLAST

Smith-Waterman is very rigorous and it is guaranteedto find an optimal alignment.

But Smith-Waterman is slow. It requires computerspace and time proportional to the product of the twosequences being aligned (or the product of a query against an entire database).

Gotoh (1982) and Myers and Miller (1988) improved thealgorithms so both global and local alignment requireless time and space.

FASTA and BLAST provide rapid alternatives to S-W.

Page 71

Page 116: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

How FASTA works

[1] A “lookup table” is created. It consists of short stretches of amino acids (e.g. k=3 for a protein search).The length of a stretch is called a k-tuple. The FASTAalgorithm finds the ten highest scoring segments that align to the query.

[2] These ten aligned regions are re-scored with a PAM or BLOSUM matrix.

[3] High-scoring segments are joined.

[4] Needleman-Wunsch or Smith-Waterman is then performed.

Page 72

Page 117: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

• Go to http://www.ncbi.nlm.nih.gov/BLAST• Choose BLAST 2 sequences• In the program,

[1] choose blastp or blastn[2] paste in your accession numbers

(or use FASTA format)[3] select optional parameters

--3 BLOSUM and 3 PAM matrices--gap creation and extension penalties--filtering--word size

[4] click “align”

Pairwise alignment: BLAST 2 sequences

Page 72

Page 118: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 73

Page 119: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Page 74

Page 120: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Outline: pairwise alignment

• Overview and examples• Definitions: homologs, paralogs, orthologs• Assigning scores to aligned amino acids:

Dayhoff’s PAM matrices• Alignment algorithms: Needleman-Wunsch,Smith-Waterman

• Statistical significance of pairwise alignments

Page 121: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Statistical significance of pairwise alignment

We will discuss the statistical significance of alignment scores in the next lecture (BLAST). A basic question is how to determine whether a particular alignment score is likely to have occurred by chance. According to the null hypothesis, two aligned sequences are not homologous (evolutionarily related). Can we reject the null hypothesis at a particular significance level alpha?

Page 122: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

True positives False positives

False negatives

Sequences reportedas related

Sequences reportedas unrelated

True negatives

Page 76

Page 123: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

True positives False positives

False negatives

Sequences reportedas related

Sequences reportedas unrelated

True negatives

homologoussequences

non-homologoussequences

Page 76

Page 124: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

True positives False positives

False negatives

Sequences reportedas related

Sequences reportedas unrelated

True negatives

homologoussequences

non-homologoussequences

Sensitivity:ability to findtrue positives

Specificity:ability to minimize

false positives

Page 125: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84+ K++ + + +GTW++MA + L + A V T + +L+ W+

glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

Randomization test: scramble a sequence

• First compare two proteins and obtain a score

• Next scramble the bottom sequence 100 times,and obtain 100 “randomized” scores (+/- standard deviation)

• Composition and length are maintained

• If the comparison is “real” we expect the authentic scoreto be several standard deviations above the mean of the“randomized” scores

Page 76

Page 126: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

0

2

4

6

8

10

12

14

16

1 10 19 28 37

100 random shufflesMean score = 8.4Std. dev. = 4.5

Quality score

Num

ber o

f ins

tanc

es

A randomization test shows that RBP is significantly related to β-lactoglobulin

Real comparisonScore = 37

But this test assumes a normal distribution of scores!In the BLAST lecture we will consider the distributionof scores and more appropriate significance tests.

Page 77

Page 127: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

You can perform this randomization test in GCG using the gap or bestfit pairwise alignment programs.

Type > gap -ran=100or > bestfit -ran=100

Z = (Sreal – Xrandomized score)standard deviation

Page 128: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

The PRSS program performsa scramble test for you(http://fasta.bioch.virginia.edu/fasta/prss.htm)

Badscores

Good scores

But these scores are notnormally distributed!

Page 129: Pairwise sequence alignment · 2014-05-26 · September 6, 2006. Pairwise sequence . alignment. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. Johns Hopkins. 260.602.01

Next in the course...

On your own: Take the moodle quiz (chapter 3), due in a week.You can try the problems for Friday’s computer lab.Optional: take the multiple choice quizzes in the book (chapters 2 & 3); answers are on page 735.

Friday: computer lab #1 (of 7)Spend half the time on chapter 2 problems(pages 37-38), and half the time on chapter 3problems (pages 80-82)

For Monday: Chapter 4 (BLAST)For Wednesday in a week: Chapter 5 (advanced BLAST)