an introduction to hidden markov models for biological sequences

24
Chapter 4 An Introduction to Hidden Markov Models for Biological Sequences by Anders Krogh Center for Biological Sequence Analysis Technical University of Denmark Building 206, 2800 Lyngby, Denmark Phone: +45 4525 2471 Fax: +45 4593 4808 E-mail: [email protected] In Computational Methods in Molecular Biology, edited by S. L. Salzberg, D. B. Searls and S. Kasif, pages 45-63. Elsevier, 1998. 1

Upload: hoangduong

Post on 27-Jan-2017

229 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: An Introduction to Hidden Markov Models for Biological Sequences

Chapter 4

An Intr oduction to Hidden Mark ovModels for Biological Sequences

by Anders Kr oghCenter for Biological SequenceAnalysisTechnicalUniversity of DenmarkBuilding 206,2800Lyngby, DenmarkPhone:+4545252471Fax: +4545934808E-mail: kr [email protected]

In ComputationalMethodsin MolecularBiology, editedby S.L. Salzberg, D. B.SearlsandS.Kasif, pages45-63.Elsevier, 1998.

1

Page 2: An Introduction to Hidden Markov Models for Biological Sequences

Contents

4 An Intr oduction to Hidden Mark ov Models for Biological Sequences 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.2 Fromregularexpressionsto HMMs . . . . . . . . . . . . . . . . 44.3 ProfileHMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.3.1 Pseudocounts. . . . . . . . . . . . . . . . . . . . . . . . 104.3.2 Searchinga database. . . . . . . . . . . . . . . . . . . . 114.3.3 Modelestimation. . . . . . . . . . . . . . . . . . . . . . 12

4.4 HMMs for genefinding . . . . . . . . . . . . . . . . . . . . . . . 134.4.1 Signalsensors . . . . . . . . . . . . . . . . . . . . . . . 144.4.2 Codingregions . . . . . . . . . . . . . . . . . . . . . . . 154.4.3 Combiningthemodels . . . . . . . . . . . . . . . . . . . 16

4.5 Furtherreading . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2

Page 3: An Introduction to Hidden Markov Models for Biological Sequences

4.1 Intr oduction

Veryefficientprogramsfor searchinga text for a combinationof wordsareavail-ableonmany computers.Thesamemethodscanbeusedfor searchingfor patternsin biologicalsequences,but oftenthey fail. This is becausebiological ‘spelling’is muchmoresloppythanEnglishspelling:proteinswith thesamefunctionfromtwo differentorganismsarealmostcertainlyspelleddifferently, that is, the twoaminoacidsequencesdiffer. It is not rarethat two suchhomologoussequenceshavelessthan30%identicalaminoacids.Similarly in DNA many interestingsig-nalsvary greatlyevenwithin thesamegenome.Somewell-known examplesareribosomebindingsitesandsplicesites,but the list is long. Fortunatelythereareusuallystill somesubtlesimilaritiesbetweentwo suchsequences,andtheques-tion is how to detectthesesimilarities.

Thevariationin a family of sequencescanbedescribedstatistically, andthisis thebasisfor mostmethodsusedin biologicalsequenceanalysis,see[1] for apresentationof someof thesestatisticalapproaches.For pairwisealignments,forinstance,theprobability thata certainresiduemutatesto anotherresidueis usedin a substitutionmatrix, suchasoneof the PAM matrices.For finding patternsin DNA, e.g. splicesites,somesort of weight matrix is very often used,whichis simply a position specificscorecalculatedfrom the frequenciesof the fournucleotidesat all thepositionsin someknown examples.Similarly, methodsforfinding genesuse,almostwithout exception,thestatisticsof codonsor dicodonsin someform or other.

A hiddenMarkov model (HMM) is a statisticalmodel,which is very wellsuitedfor many tasksin molecularbiology, althoughthey have beenmostlyde-velopedfor speechrecognitionsincetheearly1970s,see[2] for historicaldetails.Themostpopularuseof theHMM in molecularbiologyis asa‘probabilisticpro-file’ of aproteinfamily, which is calledaprofileHMM. Fromafamily of proteins(or DNA) a profile HMM canbemadefor searchinga databasefor othermem-bersof thefamily. TheseprofileHMMs resembletheprofile[3] andweightmatrixmethods[4, 5], andprobablythemaincontribution is thattheprofileHMM treatsgapsin a systematicway.

TheHMM canbe appliedto othertypesof problems.It is particularlywellsuitedfor problemswith a simple‘grammaticalstructure,’ suchasgenefinding.In genefindingseveralsignalsmustberecognizedandcombinedinto apredictionof exonsandintrons,andthepredictionmustconformto variousrulesto makeita reasonablegeneprediction.An HMM cancombinerecognitionof thesignals,andit canbemadesuchthatthepredictionsalwaysfollow therulesof a gene.

Sincemuchof the literatureon HMMs is a little hardto readfor many biol-ogists,I will attemptin this chapterto give a non-mathematicalintroductiontoHMMs. Whereasthe little biologicalbackgroundneededis takenfor granted,I

3

Page 4: An Introduction to Hidden Markov Models for Biological Sequences

havetriedto explainHMMs atalevel thatalmostanyonecanfollow. FirstHMMsare introducedby an exampleand then profile HMMs aredescribed.Then anHMM for finding eukaryoticgenesis sketched,andfinally pointersto the litera-turearegiven.

4.2 From regular expressionsto HMMs

Most readershave no doubtcomeacrossregularexpressionsat somepoint, andmany probablyusethemquitea lot. Regularexpressionsareusedin many pro-grams,in particularonUnix computers.In programslike awk, grep,sed,andperl,regularexpressionscanbeusedfor searchingtext filesfor apattern.With grepforinstance,youcansearchafile for all linescontaining‘C. elegans’or ‘Caenorhab-ditis elegans’with theregularexpression‘

����������� ����������������’. Thiswill match

any line containinga C followedby any numberof lower-caselettersor ‘.’, thena spaceandthenelegans. Regular expressionscanalsobe usedto characterizeproteinfamilies,which is thebasisfor thePROSITEdatabase[6].

Using regular expressionsis a very elegant and efficient way to searchforsomeprotein families, but difficult for other. As alreadymentionedin the in-troduction,thedifficultiesarisebecauseproteinspellingis muchmorefree thanEnglishspelling. Thereforethe regular expressionssometimesneedto be verybroadandcomplex. Imaginea DNA motif like this:

����������� �"!$#!����"�$�"!��"!$������������ ��#����#�������� �"!$������"#���� �"!$�(I useDNA only becauseof thesmallernumberof lettersthanfor aminoacids).A regularexpressionfor this is

[AT] [CG] [AC] [ACGT]* A [TG] [GC] ,

meaningthatthefirst positionis A or T, thesecondC or G,andsoforth. Theterm‘[A CGT]*’ meansthatany of thefour letterscanoccurany numberof times.

Theproblemwith theabove regularexpressionis that it doesnot in any waydistinguishbetweenthehighly implausiblesequence

!�#��%!���� ��#�#whichhastheexceptionalcharacterin eachposition,andtheconsensussequence

����������� �"!$�

4

Page 5: An Introduction to Hidden Markov Models for Biological Sequences

&(' )*(+ ,- ./(0 1234 5 6 78 9 :;< = > ?A@ B

CDE F G H I J KLM N O(P QRST U V W X Y

Z[\ ] ^ _A` a

bcd e f g h ij kl mn(o p

q r

s t u v

Figure4.1: A hiddenMarkov modelderivedfrom thealignmentdiscussedin thetext. Thetransitionsareshown with arrowswhosethicknessindicatetheirproba-bility. In eachstatethehistogramshows theprobabilitiesof thefour nucleotides.

with themostplausiblecharacterin eachposition(thedashesarejust for aligningthesesequenceswith thepreviousones).Whatis meantby a ‘plausible’sequencecanof coursebe debated,althoughmostwould probablyagreethat the first se-quenceis not likely to bethesamemotif asthe5 sequencesabove. It is possibleto maketheregularexpressionmorediscriminativeby splitting it into severaldif-ferentones,but it easilybecomesmessy. Thealternative is to scoresequencesbyhow well they fit thealignment.

To scorea sequence,we saythat thereis a probabilityof 4 w 5 x 0 y 8 for anAin thefirst positionand1 w 5 x 0 y 2 for a T, becauseweobserve thatoutof 5 letters4 areAs andoneis a T. Similarly in the secondpositionthe probabilityof C is4 w 5 andof G 1 w 5, andso forth. After the third positionin thealignment,3 outof 5 sequenceshave ‘insertions’of varying lengths,sowe saytheprobabilityofmakingan insertionis 3w 5 andthus2 w 5 for not makingone. To keeptrack ofthesenumbersadiagramcanbedrawn with probabilitiesasin Fig. 4.1.

This is a hiddenMarkov model. A box in the drawing is calleda state,andthereis a statefor eachterm in the regularexpression.All the probabilitiesarefoundsimply by countingin themultiple alignmenthow many timeseacheventoccur, just asdescribedabove. The only part that might seemtricky is the ‘in-sertion,’ which is representedby thestateabove theotherstates.Theprobabilityof eachletter is foundby countingall occurrencesof thefour nucleotidesin thisregion of thealignment.The total countsareoneA, two Cs,oneG, andoneT,yielding probabilities1 w 5, 2 w 5, 1 w 5, and1 w 5 respectively. After sequences2, 3and5 havemadeoneinsertioneach,therearetwo moreinsertions(from sequence2) andthetotalnumberof transitionsbackto themainline of statesis 3 (all threesequenceswith insertionshave to finish). Thereforethereare5 transitionsin totalfrom theinsertstate,andtheprobabilityof makinga transitionto itself is 2 w 5 andtheprobabilityof makingoneto thenext stateis 3 w 5.

5

Page 6: An Introduction to Hidden Markov Models for Biological Sequences

Sequence Probability z 100 Log oddsConsensus

�����$�{��� ��!��4.7 6.7

Original�������"��� ��!�#

3.3 4.9sequences

!��������"!���!��0.0075 3.0�����$�{��� �$#"�

1.2 5.3��#����"��� ��!��3.3 4.9������#{��� ��!��

0.59 4.6Exceptional

!�#���!���� �$#"#0.0023 -0.97

Table4.1: Probabilitiesandlog-oddsscoresfor the5 sequencesin thealignmentandfor theconsensussequenceandthe‘exceptional’sequence.

It is now easyto scoretheconsensussequence�|�}�|�}�!|�

. The probabilityofthefirst A is 4 w 5. This is multiplied by theprobabilityof thetransitionfrom thefirst stateto thesecond,which is 1. Continuingthis, the total probabilityof theconsensusis

P ~ ��� ��� �!�� � x 0 y 8 z 1 z 0 y 8 z 1 z 0 y 8 z 0 y 6 z0 y 4 z 0 y 6 z 1 z 1 z 0 y 8 z 1 z 0 y 8� 4 y 7 z 10� 2 y

Making thesamecalculationfor the exceptionalsequenceyields only 0 y 0023 z10� 2, which is roughly2000timessmallerthanfor theconsensus.This way weachieved thegoalof gettinga scorefor eachsequence,a measureof how well asequencefits themotif.

Thesameprobabilitycanbecalculatedfor thefour original sequencesin thealignmentin exactly the sameway, and the result is shown in Table 4.1. Theprobability dependsvery stronglyon the lengthof the sequence.Thereforetheprobability itself is not the mostconvenientnumberto useasa score,and thelog-oddsscoreshown in the last columnof the table is usuallybetter. It is thelogarithmof theprobabilityof thesequencedividedby theprobabilityaccordingto anull model.Thenull modelis onethattreatsthesequencesasrandomstringsof nucleotides,so the probability of a sequenceof lengthL is 0 y 25L. Thenthelog-oddsscoreis

log-oddsfor sequenceS x logP ~ S�0 y 25L x logP ~ S�|� L log0 y 25y

I have usedthenaturallogarithmin Table4.1. Logarithmsareproportional,soitdoesnot reallymatterwhichoneyouuse;it is quitecommonto usethelogarithmbase2. Onecanof courseuseothernull modelsinstead.Oftenonewouldusetheover-all nucleotidefrequenciesin theorganismstudiedinsteadof just0.25.

6

Page 7: An Introduction to Hidden Markov Models for Biological Sequences

��� ��� �(���

� ��� �(�

� ��� ����� ��� ������ ¡�£¢¥¤¦¢¨§©«ª ¤­¬®¬

¯±°²³°µ´¥¶¦´¨·¸«¹ ¶­º®º»�¼½ ¼±¾¥¿¦¾¨ÀÁ«Â ¿­Ã®Ã Ä�Å�Æ¥Ç È®É

ʳËÌ Ë�ͫήϭЮÐÑ Ï Ñ¨Ò Ó±ÔÕ Ô֥צ֨ØÙ«Ú ×­Û®Û

Ü�ÝÞ Ýß ÝàáÝâ«ã®ä­å®åã�ä æ®çâ«ã®ä­å®åâ«ã®ä­å®å

Figure 4.2: The probabilitiesof the model in Fig. 4.1 have beenturned intolog-oddsby taking the logarithmof eachnucleotideprobability andsubtractinglog ~ 0 y 25

�. Thetransitionprobabilitieshave beenconvertedto simplelogs.

Whena sequencefits the motif very well the log-oddsis high. Whenit fitsthenull modelbetter, thelog-oddsscoreis negative. Althoughtheraw probabilityof the secondsequence(the onewith threeinserts)is almostas low as that oftheexceptionalsequence,noticethat the log-oddsscoreis muchhigherthanforthe exceptionalsequence,and the discriminationis very good. Unfortunately,onecannotalwaysassumethatanything with a positive log-oddsscoreis ‘a hit,’becausetherearerandomhits if oneis searchingalargedatabase.SeeSection4.5for references.

Insteadof working with probabilitiesone might convert everything to log-odds. If eachnucleotideprobability is divided by the probability accordingtothe null model (0.25 in this case)and the logarithm is applied,we would getthe numbersshown in Fig. 4.2. The transitionprobabilitiesarealsoturnedintologarithms.Now thelog-oddsscorecanbecalculateddirectlyby addingup thesenumbersinsteadof multiplying theprobabilities.For instance,thecalculationofthelog-oddsof theconsensussequenceis

log-odds~ ��� ���}��!�� � x 1 y 16 è 0 è 1 y 16 è 0 è 1 y 16�

0 y 51 è0 y 47

�0 y 51 è 1 y 39 è 01y 16 è 0 è 1 y 16

x 6 y 64y(Thefinite precisioncausesthelittle differencebetweenthis numberandtheonein Table4.1.)

If thealignmenthadnogapsor insertionswe wouldgetrid of theinsertstate,andthenall theprobabilitiesassociatedwith thearrows (thetransitionprobabili-ties)wouldbe1 andmightaswell beignoredcompletely. ThentheHMM worksexactlyasa weightmatrixof log-oddsscores,which is commonlyused.

7

Page 8: An Introduction to Hidden Markov Models for Biological Sequences

éëê�ì®í î ïëð�ñ

Figure4.3: Thestructureof theprofileHMM.

4.3 Profile HMMs

A profile HMM is a certaintypeof HMM with a structurethat in a naturalwayallowspositiondependentgappenalties.A profileHMM canbeobtainedfrom amultiplealignmentandcanbeusedfor searchingadatabasefor othermembersofthefamily in thealignmentvery muchlike standardprofiles[3]. Thestructureofthemodelis shown in Fig.4.3.Thebottomline of statesarecalledthemainstates,becausethey modelthecolumnsof thealignment.In thesestatestheprobabilitydistribution is just thefrequency of theaminoacidsor nucleotidesasin theabovemodelof the DNA motif. The secondrow of diamondshapedstatesarecalledinsertstatesandareusedto modelhighly variableregionsin thealignment.Theyfunctionexactly like thetop statein Fig. 4.1,althoughonemightchooseto useafixeddistributionof residues,e.g. theoverall distribution of aminoacids,insteadof calculatingthe distribution asin the exampleabove. The top line of circularstatesarecalleddeletestates.Thesearea differenttype of state,calleda silentor null state.They do not matchany residues,andthey aretheremerelyto makeit possibleto jump over oneor morecolumnsin thealignment,i.e., to modelthesituationwhenjust a few of thesequenceshave a ‘–’ in themultiplealignmentataposition.Let usturn to anexample.

Supposeyouhaveamultiplealignmentastheoneshown in Fig. 4.4.A regionof this alignmenthasbeenchosento be an ‘insertion,’ becausean alignmentofthis region is highly uncertain.The restof the alignment(shadedin the figure)arethecolumnsthatwill correspondto mainstatesin themodel. For eachnon-insertcolumnwe makea mainstateandsettheprobabilitiesequalto theaminoacid frequencies. To estimatethe transitionprobabilitieswe count how manysequencesusethe varioustransitions,just like the transitionprobabilitieswerecalculatedin the first example. The model is shown in Fig. 4.5. Therearetwotransitionsfrom amainstateto adeletestateshown with dashedlinesin thefigure,that from ò�óõôöø÷ to the first deletestateand from main state12 to deletestate

8

Page 9: An Introduction to Hidden Markov Models for Biological Sequences

G G W W R G d y . g g k k q L W F P S N Y VI G W L N G y n e t t g e r G D F P G T Y VP N W W E G q l . . n n r r G I F P S N Y VD E W W Q A r r . . d e q i G I V P S K - -G E W W K A q s . . t g q e G F I P F N F VG D W W L A r s . . s g q t G Y I P S N Y VG D W W D A e l . . k g r r G K V P S N Y L- D W W E A r s l s s g h r G Y V P S N Y VG D W W Y A r s l i t n s e G Y I P S T Y VG E W W K A r s l a t r k e G Y I P S N Y VG D W W L A r s l v t g r e G Y V P S N F VG E W W K A k s l s s k r e G F I P S N Y VG E W C E A q t . k n g q . G W V P S N Y IS D W W R V v n l t t r q e G L I P L N F VL P W W R A r d . k n g q e G Y I P S N Y IR D W W E F r s k t v y t p G Y Y E S G Y VE H W W K V k d . a l g n v G Y I P S N Y VI H W W R V q d . r n g h e G Y V P S S Y LK D W W K V e v . . n d r q G F V P A A Y VV G W M P G l n e r t r q r G D F P G T Y VP D W W E G e l . . n g q r G V F P A S Y VE N W W N G e i . . g n r k G I F P A T Y VE E W L E G e c . . k g k v G I F P K V F VG G W W K G d y . g t r i q Q Y F P S N Y VD G W W R G s y . . n g q v G W F P S N Y VQ G W W R G e i . . y g r v G W F P A N Y VG R W W K A r r . a n g e t G I I P S N Y VG G W T Q G e l . k s g q k G W A P T N Y LG D W W E A r s n . t g e n G Y I P S N Y VN D W W T G r t . . n g k e G I F P A N Y V

Figure 4.4: An alignmentof 30 shortamino acid sequenceschoppedout of aalignmentof theSH3domain.Theshadedareasarethemostconservedandwerechosento berepresentedby themainstatesin theHMM. Theunshadedareawithlower-caseletterswaschosento berepresentedby aninsertstate.

BEGINùYW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

85ÿ

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

6�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

7�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

8�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

9�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

10�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

11

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

12û

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

13ü

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

14ý

END�

Figure4.5: A profile HMM madefrom thealignmentshown in Fig. 4.4. Transi-tion lineswith no arrow headaretransitionsfrom left to right. Transitionswithprobabilityzeroarenot shown, andthosewith very smallprobabilityareshownasdashedlines. Transitionsfrom aninsertstateto itself is not shown; insteadtheprobabilitytimes100is shown in thediamond.Thenumbersin thecirculardeletestatesare just positionnumbers.(This figure andFig. 4.6 weregeneratedby aprogramin theSAM packageof programs.)

9

Page 10: An Introduction to Hidden Markov Models for Biological Sequences

33�

BEGINù

33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

1ú33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

85ÿ

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

6�

33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

7�

33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

8�

33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

9�

33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

10�33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

11

33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

12û

33�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

13ü

50�

YW

V

TS

R

Q

P

N

ML

K

I

H

G

F

ED

C

A

14ý

END�

Figure4.6: Modelobtainedin thesamewayasFig. 4.5,but usinga pseudocountof one.

13. Both of thesecorrespondto dashesin thealignment.In bothcasesonly onesequencehasgaps,sotheprobabilityof thesedeletetransitionsis1/30.Thefourthsequencecontinuesdeletionto theend,sotheprobabilityof goingfrom delete13to 14 is 1 andfrom delete14 to theendis also1.

4.3.1 Pseudocounts

It is dangerousto estimatea probability distribution from just a few observedaminoacids.If for instanceyouhave just two sequenceswith leucineat a certainposition,theprobabilityfor leucinewouldbe1 andtheprobabilitywouldbezerofor all otheraminoacidsat this position,althoughit is well known thatoneoftenseesfor examplevalinesubstitutedfor leucine. In sucha casetheprobabilityofa wholesequencemayeasilybecomezeroif a singleleucineis substitutedby avaline,or equivalently, thelog-oddsis minusinfinity.

Thereforeit is importantto havesomewayof avoidingthissortof over-fitting,wherestrongconclusionsaredrawn from very little evidence. The mostcom-monmethodis to usepseudocounts,whichmeansthatonepretendsto havemorecountsof aminoacidsthanthosefrom thedata.Thesimplestis to add1 to all thecounts. With the leucineexampleit would meanthat the probability of leucinewould be estimatedas3w 23 andfor the 19 otheraminoacidsit would become1 w 23. In Fig. 4.6 a modelis shown, which wasobtainedfrom the alignmentinFig. 4.6usinga pseudocountof 1.

Adding oneto all the countscanbe interpretedasassuminga priori that alltheaminoacidsareequallylikely. However, therearesignificantdifferencesintheoccurrenceof the20 aminoacidsin known proteinsequences.Therefore,thenext stepis to usepseudocountsproportionalto the observed frequenciesof theaminoacidsinstead.This is theminimumlevel of pseudocountsto beusedin anyrealapplicationof HMMs.

Becausea columnin the alignmentmay containinformationaboutthe pre-

10

Page 11: An Introduction to Hidden Markov Models for Biological Sequences

ferredtype of aminoacids,it is alsopossibleto usemoresophisticatedpseudo-countstrategies. If a columnconsistspredominantlyof leucine(asabove), onewouldexpectsubstitutionsto otherhydrophobicaminoacidsto bemoreprobablethansubstitutionsto hydrophilicaminoacids. Onecane.g. derive pseudocountsfor a givencolumnfrom substitutionmatrices.SeeSection4.5for references.

4.3.2 Searching a database

Above we saw how to calculatethe probability of a sequencein the alignmentby multiplying all theprobabilities(or addingthe log-oddsscores)in themodelalongthepathfollowedby thatparticularsequence.However, thispathis usuallynot known for othersequenceswhich arenot part of theoriginal alignment,andthe next problemis how to scoresucha sequence.Obviously, if we canfind apaththroughthemodelwherethenew sequencefits well in somesense,thenwecanscorethesequenceasbefore.We needto ‘align’ thesequenceto themodel.It resemblesvery much the pairwisealignmentproblem,wheretwo sequencesarealignedso that they aremostsimilar, andindeedthe sametype of dynamicprogrammingalgorithmcanbeused.

For a particularsequence,analignmentto themodel(or a path)is anassign-mentof statesto eachresiduein thesequence.Therearemany suchalignmentsfor a givensequence.For instanceanalignmentmightbeasfollows. Let uslabeltheaminoacidsin a proteinasA1, A2, A3, etc. Similarly we canlabeltheHMMstatesasM1, M2, M3, etc. for matchstates,I1, I2, I3 for insertstates,andsoon.Thenanalignmentcouldhave A1 matchstateM1, A2 andA3 matchI1, A4 matchM2, A5 matchM6 (afterpassingthroughthreedeletestates),andsoon. For eachsuchpathwe cancalculatetheprobabilityof thesequenceor the log-oddsscore,andthuswe canfind thebestalignment,i.e., theonewith thelargestprobability.Althoughthereareanenormousnumberof possiblealignmentsit canbedoneeffi-cientlyby theabovementioneddynamicprogrammingalgorithm,which is calledtheViterbi algorithm.Thealgorithmalsogivestheprobabilityof thesequenceforthatalignment,andthusascoreis obtained.

The log-oddsscorefoundin this mannercanbeusedto searchdatabasesformembersof thesamefamily. A typical distribution of scoresfrom sucha searchis shown in Fig. 4.7. As is alsothecasewith othertypesof searches,thereis noclear-cutseparationof trueandfalsepositives,andoneneedsto investigatesomeof thesequencesarounda log-oddsof zero,andpossiblyincludesomeof theminthealignmentandtry searchingagain.

An alternativeway of scoringsequencesis to sumtheprobabilitiesof all pos-sible alignmentsof the sequenceto the model. This probability can be foundby a similar algorithmcalledthe forwardalgorithm. This type of scoringis notvery commonin biological sequencecomparison,but it is morenaturalfrom a

11

Page 12: An Introduction to Hidden Markov Models for Biological Sequences

0

10

20

30

40

50

-40 -20 0 20 40 60 80 100 120 140

num

ber

of s

eque

nces

log-odds

Figure4.7: Thedistribution of log-oddsscoresfrom a searchof Swissprotwitha profile HMM of the SH3 domain. The dark areaof the histogramrepresentsthe sequenceswith an annotatedSH3 domain,and the light thosethat are notannotatedashaving one.This is for illustrative purposesonly, andthesequenceswith log-oddsaroundzerowerenot investigatedfurther.

probabilisticpointof view. However, it usuallygivesverysimilar results.

4.3.3 Model estimation

As presentedsofar, onemayview theprofileHMMs asageneralizationof weightmatricesto incorporateinsertionsanddeletionsin a naturalway. Thereis how-ever oneinterestingfeatureof HMMs, which hasnot beenaddressedyet. It ispossibleto estimatethemodel,i.e. determineall theprobabilityparametersof it,from unalignedsequences.Furthermore,a multiple alignmentof the sequencesis producedin theprocess.Like many othermultiple alignmentmethodsthis isdonein aniterativemanner. Onestartsoutwith amodelwith moreor lessrandomprobabilities,or if a reasonablealignmentof someof thesequencesareavailable,a model is constructedfrom this alignment. Then,whenall the sequencesarealignedto themodel,wecanusethealignmentto improve theprobabilitiesin themodel. Thesenew probabilitiesmay thenleadto a slightly differentalignment.If they do, we thenrepeatthe processandimprove the probabilitiesagain. Theprocessis repeateduntil the alignmentdoesnot change. The alignmentof the

12

Page 13: An Introduction to Hidden Markov Models for Biological Sequences

sequencesto thefinal modelyieldsa multiplealignment.1

Although this estimationprocesssoundseasy, thereare many problemstoconsiderto actuallymakeit work well. Oneproblemis choosingthe appropri-atemodellength,which determinesthenumberof insertsin thefinal alignment.Anothersevereproblemis thattheiterativeprocedurecanconvergeto suboptimalsolutions.It is notguaranteedthatit findstheoptimalmultiplealignment,i.e. themostprobableone. Methodsfor dealingwith theseissuesaredescribedin theliteraturepointedto in Section4.5.

4.4 HMMs for genefinding

Oneability of HMMs, which is not reallyutilized in profileHMMs, is theabilityto modelgrammar. Many problemsin biologicalsequenceanalysishave a gram-maticalstructure,andeukaryoticgenestructure,which I will useasanexample,is oneof them.If youconsiderexonsandintronsasthe‘words’ in a language,thesentencesareof theform exon-intron-exon-intron...intron-exon. The ‘sentences’cannever endwith an intron, at leastif thegenesarecomplete,andanexon cannever follow an exon without an intron in between.Obviously this grammarisgreatlysimplified,becausethereareseveral otherconstraintson genestructure,suchas the constraintthat the exonshave to fit togetherto give a valid codingregionaftersplicing.In Fig. 4.8thestructureof a geneis shown with someof theknown signalsmarked.

Formallanguagetheoryappliedto biologicalproblemsis notanew invention.In particularDavid Searls[7] haspromotedthis ideaandusedit for genefinding[8], but many othergenefindersuseit implicitly. Formally the HMM canonlyrepresentthesimplestof grammars,which is calleda regulargrammar[7, 1], butthat turnsout to be goodenoughfor the genefinding problem,andmany otherproblems.Oneof the problemsthat hasa morecomplicatedgrammarthantheHMM canhandleis theRNA folding problem,which is onestepup theladderofgrammars,becausebasepairing introducescorrelationsbetweenbasesfar fromeachotherin theRNA sequence.

I will herebriefly outlinemy own approachto genefindingwith theweightontheprinciplesratherthanon thedetails.

1Anotherslightly differentmethodfor modelestimationsumsover all alignmentsinsteadofusingthe mostprobablealignmentof a sequenceto the model. This methodusesthe forwardalgorithminsteadof Viterbi, andit is calledtheBaum–Welchalgorithmor theforward–backwardalgorithm.

13

Page 14: An Introduction to Hidden Markov Models for Biological Sequences

��� ��� ����� ���� �������� ��

��������� ���������������������

������������ ������������������� ����� ������������������ "!#�%$'&"(%&") *'+%,"+%-". /10"2%0%354'6 7 8

9":<;1=5>?�@'@BA"C%D E%F5G'H D A

I"J<K1L5M

N�O P"QSR'P%T"P%UV�W�X�YY�Y�Y�W�X�VY�Y�XV W�VV�V�YY�Y�YX�Y

Z�[%\ ]'^#_�`'a b c

d�e'f%g

h i%j k#l"i

m�n#o%pSo"q r"n

s"t#u"v%w'x't#y z"{ y |%vw'{ u%t#{

Figure4.8: Thestructureof a genewith someof theimportantsignalsshown.

4.4.1 Signalsensors

Onemayapplyan HMM similar to theonesalreadydescribeddirectly to manyof the signalsin a genestructure. In Fig. 4.9 an alignmentis shown of somesequencesaroundacceptorsitesfrom humanDNA. It has19 columnsand anHMM with 19 states(no insert or deletestates)can be madedirectly from it.Sincethealignmentis gap-less,theHMM is equivalentto a weightmatrix.

Thereis oneproblem:in DNA therearefairly strongdinuclotidepreferences.A model like the onedescribedtreatsthe nucleotidesas independent,so dinu-cleotidepreferencescannot becaptured.This is easilyfixedby having 16 prob-ability parametersin eachstateinsteadof 4. In column two we first countalloccurrencesof the four nucleotidesgiven that thereis an A in the first columnandnormalizethesefour counts,sothey becomeprobabilities.This is thecondi-tional probability thata certainnucleotideappearsin positiontwo, giventhatthepreviousonewasA. Thesameis donefor all theinstancesof C in column1 andsimilarly for G andT. This givesa totalof 16probabilitiesto beusedin statetwoof theHMM. Similarly for all the otherstates.To calculatetheprobabilityof asequence,sayACTGTCy y y , we just multiply theconditionalprobabilities

P ~ ACTGTC y y y � x p1 ~ A� z p2 ~ C }A� z p3 ~ T }

C� z p4 ~ G }

T� z p5 ~ T }

G� z p6 ~ C }

T� z y y y

14

Page 15: An Introduction to Hidden Markov Models for Biological Sequences

G T G C C T C T C C C T C C A G A T TA C G A C A T T T T C C A C A G G A GT T C C A T G T C C T G A C A G G T GT C G T G T G T C T C C C C A G C C CT T T T C C T T T T C T A C A G A A TA T G T G C A T C C C C C C A G G A GT A T T T A T T T A A C A T A G G G CC C C A T G T G A C C T G C A G G T AC A C T T T G C T C C C A C A G C G TT T G G G T T T C T T T G C A G A A CT T C T G T T C C G A T G C A G G G CT C C T A T A T G T T G A C A G G G TT G C C T C T C T T T T C A A G G G TG T T C C T T T G T T T C T A G C A CT A T T G T T T T C T T A C A G G G CC T C C C T G T G T C C A C A G G C T

Figure4.9: Examplesof humanacceptorsites(thesplicesite5’ to theexon). Ex-ceptin rarecases,theintronendswith AG, whichhasbeenhighlighted.Includedin thesesequencesare16basesupstreamof thesplicesiteand3basesdownstreaminto theexon.

Herep1 is theprobabilityof thefour nucleotidesin state1, p2 ~ x } y � is thecondi-tionalprobabilityin state2 of nucleotidex giventhatthepreviousnucleotidewasy, andsoforth.

A statewith conditionalprobabilitiesis calleda first orderstate,becauseitcapturesthefirst ordercorrelationsbetweenneighboringnucleotides.It is easytoexpandto higherorder. A secondorderstatehasprobabilitiesconditionedon thetwo previousnucleotidesin thesequence,i.e., probabilitiesof the form p ~ x } y~ z� .Wewill returnto suchhigherorderstatesbelow.

SmallHMMs like this areconstructedin exactly thesameway for othersig-nals:donorsplicesites,theregionsaroundthestartcodons,andtheregionsaroundthestopcodons.

4.4.2 Coding regions

The codonstructureis the most importantfeatureof coding regions. Basesintripletscanbe modeledwith threestatesasshown in Fig. 4.10. Thefigurealsoshows how this modelof coding regionscan be usedin a simple modelof anunsplicedgenethatstartswith astartcodon(ATG), thenconsistsof somenumberof codons,andendswith astopcodon.

Sincea codonis threebaseslong,thelaststateof thecodonmodelmustbeatleastof ordertwo to correctlycapturethecodonstatistics.The64 probabilitiesinsuchastateareestimatedby countingthenumberof eachcodonin asetof knowncodingregions. Thesenumbersarethennormalizedproperly. For exampletheprobabilitiesderivedfrom thecountsof CAA, CAC, CAG, andCAT are

p ~ A }CA� x c ~ CAA

� w�� c ~ CAA� è c ~ CAC

� è c ~ CAG� è c ~ CAT

�'�

15

Page 16: An Introduction to Hidden Markov Models for Biological Sequences

�'� �%�#�%�'�"�%�%� �1�%�"�%�S�S�%�"�%� �'� �%�S�'�%�%�"�

�'�%�"�%��"�%�'�S  ¡'¢%£"¢%¤¥"¦%§'¨S© ª'«%¬"«%­®"¯%°'±S²

Figure4.10: Top: A modelof coding regions,wherestateone, two and threematchthefirst, secondandthird codonpositionsrespectively. A codingregionofany lengthcanmatchthismodel,becauseof thetransitionfrom statethreebacktostateone. Bottom: a simplemodelfor unsplicedgeneswith thefirst threestatesmatchinga startcodon,thenext threeof the form shown to the left, andthelastthreestatesmatchingastopcodon(only oneof thethreepossiblestopcodonsareshown).

p ~ C }CA� x c ~ CAC

� w�� c ~ CAA� è c ~ CAC

� è c ~ CAG� è c ~ CAT

�'�p ~ G }

CA� x c ~ CAG

� w�� c ~ CAA� è c ~ CAC

� è c ~ CAG� è c ~ CAT

�'�p ~ T }

CA� x c ~ CAT

� w�� c ~ CAA� è c ~ CAC

� è c ~ CAG� è c ~ CAT

�B�wherec ~ xyz

�is thecountof codonxyz.

Oneof thecharacteristicsof codingregionsis the lack of stopcodons.Thatis automaticallytakencareof, becausep ~ A }

TA�, p ~ G }

TA�

and p ~ A }TG�, corre-

spondingto thethreestopcodonsTAA, TAG andTGA, will automaticallybecomezero.

For modelingcodonstatisticsit is naturalto usean ordinary(zerothorder)stateasthe first stateof the codonmodelanda first orderstatefor the second.However, thereareactuallyalsodependenciesbetweenneighboringcodons,andthereforeonemay want even higherorderstates.In my own genefinder, I usethreefourthorderstates,which is inspiredby GeneMark[9], in which suchmod-els werefirst introduced.Technicallyspeaking,sucha model is calledan inho-mogeneousMarkov chain,whichcanbeviewedasa sub-classof HMMs.

4.4.3 Combining the models

To beableto discover genes,we needto combinethemodelsin a way thatsatis-fies the grammarof genes.I restrictmyself to codingregions, i.e. the 5’ and3’untranslatedregionsof the genesarenot modeledandalsopromotersaredisre-garded.

16

Page 17: An Introduction to Hidden Markov Models for Biological Sequences

³´³µ³µ³´³µ³µ³´³�¶¸·º¹¼»µ»µ»½ ¾´¾µ¾ ¿´¿µ¿ÁÀÃÂÄÂÆŵŴŵŵŴŵŵÅÇ ÈÊÉÌËÎÍÐÏÑ ËÎÈÊÇ Ò ÓÐÔÎÕÊÖ ×ÊØÚÙÊÓÛ×ÊÜÎØÊÝÞàß ÙÎÓ ßÊá ×ÊÝÊ×ÎØâàãÊäÎå æÊçèÐéÎçÊå ãÊæ êÐëÎìÊí îÊïÚðÊêÛîÊñÎïÊòóàô îÎõ÷öàîÎòÊîÊï

Figure4.11:A hiddenMarkov modelfor unsplicedgenes.In this drawing an‘x’meansa statefor non-codingDNA, anda ‘c’ a statefor codingDNA. Only oneof thethreepossiblestopcodonsareshown in themodelof theregionaroundthestopcodon.

ø÷ùÃú�û üÎýÿþ ùÃú ���

��������������� �� �� � ���� ���� ������� ������������������������������������

������� �"!#�%$'& ( )+*,*,-�.0/21�3"451'6%-�7

8�9�:�:�:�:�:�: ; <�= >�? @ A�?@ <�=�? A�<B�B�BCB D�DCD�D�DE�E�E�E�E�E�E�E�E�E�E�E�E�E�F�G

H�I�J�J�J�J�J�J K L�M N�O P Q�OP L�M�O Q�LR�R�RCR�R SCS�S�ST�T�T�T�T�T�T�T�T�T�T�T�T�T�U�V

W2X�Y[Z]\ X_^5\'`%a�b c

ddd

e"f]g hji,k2l�fmk�h5g'n%o�pq,r_s,t�r�uwv#r%x'y�z

Figure4.12: To allow for splicing in threedifferentframesthreeintron modelsareneeded.To gettheframecorrect‘spacerstates’areaddedbeforeandaftertheintronmodels.

17

Page 18: An Introduction to Hidden Markov Models for Biological Sequences

First, let usseehow to do it for unsplicedgenes.If we ignoregenesthatareverycloselyspacedor overlaps,amodelcouldlook like Fig. 4.11.It consistsof astatefor intergenicregions(of orderat least1), a modelfor theregionaroundthestartcodon,the modelfor thecodingregion, anda modelfor the region aroundthestopcodon.Themodelfor thestartcodonregionis madejust like theacceptormodeldescribedearlier. It modelseightbasesupstreamof the startcodon,2 theATG startcodonitself, andthe first codonafter the start. Similarly for the stopcodonregion. The whole modelis onebig HMM, althoughit wasput togetherfrom smallindependentHMMs.

Having sucha model,how canwe predictgenesin a sequenceof anonymousDNA? That is quitesimple: usetheViterbi algorithmto find the mostprobablepaththroughthemodel.WhenthispathgoesthroughtheATG states,astartcodonis predicted,whenit goesthroughthecodonstatesa codonis predicted,andsoon.

This modelmight not alwayspredictcorrectgenes,but at leastit will onlypredictsensiblegenesthatobey thegrammaticalrules. A genewill alwaysstartwith a startcodonandendwith a stopcodon,the lengthwill alwaysbedivisibleby 3, andit will never containstopcodonsin the readingframe,which aretheminimumrequirementsfor unsplicedgenecandidates.

Making a modelthat conformsto the rulesof splicing is a bit moredifficultthanit might seemat first. That is becausesplicingcanhappenin threedifferentreadingframes,andthereadingframein oneexonhasto fit theonein thenext. Itturnsout thatby usingthreedifferentmodelsof introns,onefor eachframe,thisis possible.In Fig. 4.12it is shown how thesemodelsareaddedto themodelofcodingregions.

Thetop line in themodelis for intronsappearingbetweentwo codons.It hasthreestates(labeledccc) beforethe intron startsto matchthe last codonof theexon. Thefirst two statesof the intron modelmatchGT, which is theconsensussequenceat donorsites(it is occasionallyanothersequence,but suchcasesareignoredhere). The next six statesmatchesthe six basesimmediatelyafter GT.Thestatesjust describedmodelthedonorsite,andtheprobabilitiesarefoundasit wasdescribedearlierfor acceptorsites.Thenfollowsasinglestateto modeltheinteriorof theintron. I actuallyusethesameprobabilityparametersin thisstateasin thestatemodelingintergenicregions.Now followstheacceptormodel,whichincludesthreestatesto matchthefirst codonof thenext exon.

The next line in the model is for intronsappearingafter the first basein acodon. Thedifferencefrom thefirst is that thereis onemorestatefor a coding

2A similarmodelcouldbeusedfor prokaryoticgenes.In thatcase,however, oneshouldmodelthe Shine-Dalgarnosequence,which is often morethan8 basesupstreamfrom the start. Also,onewould probablyneedto allow for otherstartcodonsthanATG thatareusedin theorganismstudied(in someeukaryotesotherstartcodonscanalsobeused).

18

Page 19: An Introduction to Hidden Markov Models for Biological Sequences

basebeforetheintron andtwo morestatesafter the intron. This ensuresthat theframesfit in two neighboringexons.Similarly in thethird line from thetop therearetwo extra codingstatesbeforethe intron andoneafter, so that it canmatchintronsappearingafterthesecondbasein a codon.

Thereareobviouslymany possiblevariationsof themodel.Onecanaddmorestatesto thesignalsensors,includemodelsof promoterelementsanduntranslatedregionsof thegene,andsoforth.

4.5 Further reading

A generalintroductioncan be found in [2], and oneaimedmore at biologicalreadersin [1]. Thefirst applicationsfor sequenceanalysisis probablyfor model-ing compositionaldifferencesbetweenvariousDNA types[10] andfor studyingthe compositionalstructureof genomes[11]. The initial work on usinghiddenMarkov models(HMMs) as‘probabilistic profiles’ of proteinfamilies waspre-sentedat a conferencein the springof 1992andin a technicalreport from thesameyear, andit waspublishedin [12, 13]. The ideawasquickly takenup byothers[14, 15]. Independently, somevery similar ideaswerealsodevelopedin[16, 17]. Also thegeneralizedprofiles[18] areverysimilar.

Estimationandmultiple alignmentis describedin [13] in detail, andin [19]someof the practicalmethodsare further discussed.Alternative methodsformodelestimationarepresentedin [14,20]. Methodsfor scoringsequencesagainsta profile HMM weregivenin [13], but theseissueshave morerecentlybeenad-dressedin [21]. Thebasicpseudocountmethodis alsoexplainedin [13], andmoreadvancedmethodsarediscussedin [22, 23,24,25,26].

A review of profileHMMs canbefoundin [27], andin [1] profileHMMs arediscussedin greatdetail. Also [28] will undoubtedlycontaingoodmaterialonprofileHMMs.

Someof therecentapplicationsof profileHMMs to proteinsare:detectionoffibronectintype III domainsin yeast[29], a databaseof proteindomainfamilies[30], proteintopologyrecognitionfrom secondarystructure[31], andmodelingof aproteinsplicingdomain[32].

Therearetwoprogrampackagesavailablefreeof chargeto theacademiccom-munity. One,developedby SeanEddy, is calledhmmer(pronounced‘hammer’),andcanbeobtainedfrom hisweb-site(http://genome.wustl.edu/eddy/hmm.html).Theotherone,calledSAM (http://www.cse.ucsc.edu/research/compbio/sam.html),wasdevelopedby myself andthe groupat UC SantaCruz, andit is now beingmaintainedandfurtherdevelopedunderthecommandof RichardHughey.

Thegenefindersketchedabove is calledHMMgene. Therearemany detailsomitted,suchasspecialmethodsfor estimationandpredictiondescribedin [33].

19

Page 20: An Introduction to Hidden Markov Models for Biological Sequences

It is still underdevelopment,andit is possibleto follow thedevelopmentandtestthecurrentversionat thewebsitehttp://www.cbs.dtu.dk/services/HMMgene/.

Methodsfor automatedgenefindinggobackalongtime,see[34] for areview.Thefirst HMM basedgenefinderis probablyEcoParsedevelopedfor E.coli [35].VEIL [36] is a recentHMM basedgenefinder for for humangenes.The maindifferencefrom HMMgeneis that it doesnot usehigh orderstates(neitherdoesEcoParse),whichmakesgoodmodelingof codingregionsharder.

Two recentmethodsuseso-calledgeneralizedHMMs. Genie[37, 38, 39]combinesneuralnetworksinto anHMM-like model,whereasGENSCAN[40] ismoresimilar to HMMgene,but usesa differentmodeltypefor splicesite. Also,the generalizedHMM canexplicitly useexon lengthdistributions,which is notpossiblein astandardHMM. Webpointersto genefindingcanbefoundathttp://www.cbs.dtu.dk/krogh/genefinding.html.

Otherapplicationsof HMMs relatedto genefinding are: detectionof shortprotein coding regions and analysisof translationinitiation sitesin Cyanobac-terium [41, 42], characterizationof prokaryoticandeukaryoticpromoters[43],andrecognitionof branchpoints[44].

Apart from theareasmentionedhere,HMMs havebeenusedfor predictionofproteinsecondarystructure[45], modelinganoscillatorypatternin nucleosomes[46], modelingsitedependenceof evolutionaryrates[47], andfor includingevo-lutionaryinformationin proteinsecondarystructureprediction[48].

Acknowledgements

HenrikNielsen,StevenSalzberg,andSteenKnudsenaregratefullyacknowledgedfor commentsonthemanuscript.Thisworkwassupportedby theDanishNationalResearchFoundation.

References

[1] Durbin,R. M., Eddy, S.R., Krogh,A., andMitchison,G. (1998)BiologicalSequenceAnalysisCambridgeUniversityPressTo appear.

[2] Rabiner, L. R. (1989)Proc. IEEE 77,257–286.

[3] Gribskov, M., McLachlan,A. D., andEisenberg, D. (1987)Proc.of theNat.Acad.of Sciencesof theU.S.A.84,4355–4358.

[4] Staden,R. (1984)NucleicAcidsResearch 12,505–519.

[5] Staden,R. (1988)ComputerApplicationsin theBiosciences4, 53–60.

20

Page 21: An Introduction to Hidden Markov Models for Biological Sequences

[6] Bairoch,A., Bucher, P., andHofmann,K. (1997) NucleicAcidsResearch25,217–221.

[7] Searls,D. B. (1992)AmericanScientist80,579–591.

[8] Dong,S.andSearls,D. B. (1994)Genomics23,540–551.

[9] Borodovsky, M. and McIninch, J. (1993) Computersand Chemistry17,123–133.

[10] Churchill,G. A. (1989)Bull MathBiol 51,79–94.

[11] Churchill,G. A. (1992)ComputersandChemistry16,107–115.

[12] Haussler, D., Krogh,A., Mian, I. S.,andSjolander, K. (1993)Proteinmodel-ing usinghiddenMarkov models:Analysisof globins In Mudge,T. N., Mi-lutinovic, V., andHunter, L. (Eds.),Proceedingsof theTwenty-SixthAnnualHawaii InternationalConferenceonSystemSciencesvolume1 pp.792–802LosAlamitos,California.IEEEComputerSocietyPress.

[13] Krogh, A., Brown, M., Mian, I. S., Sjolander, K., andHaussler, D. (1994)Journalof MolecularBiology235,1501–1531.

[14] Baldi, P., Chauvin,Y., Hunkapiller, T., andMcClure,M. A. (1994) Proc.oftheNat.Acad.of Sciencesof theU.S.A.91,1059–1063.

[15] Eddy, S.R., Mitchison,G.,andDurbin,R. (1995)J ComputBiol 2, 9–23.

[16] Stultz,C. M., White,J.V., andSmith,T. F. (1993) ProteinScience2, 305–314.

[17] White,J.V., Stultz,C.M., andSmith,T. F. (1994)MathematicalBiosciences119,35–75.

[18] Bucher, P., Karplus,K., Moeri, N., andHofmann,K. (1996)ComputersandChemistry20,3–24.

[19] Hughey, R. andKrogh,A. (1996)CABIOS12,95–107.

[20] Eddy, S. R. (1995) Multiple alignmentusinghiddenMarkov models. InRawlings,C.,Clark,D., Altman,R.,Hunter, L., Lengauer, T., andWodak,S.(Eds.),Proc.of Third Int. Conf. onIntelligentSystemsfor MolecularBiologyvolume3 pp.114–120MenloPark,CA. AAAI Press.

[21] Barrett,C., Hughey, R., andKarplus,K. (1997) ComputerApplicationsintheBiosciences13,191–199.

21

Page 22: An Introduction to Hidden Markov Models for Biological Sequences

[22] Brown, M., Hughey, R., Krogh,A., Mian, I. S.,Sjolander, K., andHaussler,D. (1993) UsingDirichlet mixturepriors to derive hiddenMarkov modelsfor proteinfamilies In Hunter, L., Searls,D., andShavlik, J. (Eds.),Proc.of First Int. Conf. on Intelligent Systemsfor Molecular Biology pp. 47–55MenloPark,CA. AAAI/MIT Press.

[23] Tatusov, R. L., Altschul, S. F., andKoonin,E. V. (1994) Proc. of theNat.Acad.of Sciencesof theU.S.A.91,12091–12095.

[24] Karplus,K. (1995) Evaluatingregularizersfor estimatingdistributionsofaminoacids. In Rawlings, C., Clark,D., Altman,R., Hunter, L., Lengauer,T., andWodak,S. (Eds.),Proc.of Third Int. Conf. on IntelligentSystemsforMolecularBiologyvolume3 pp.188–196MenloPark,CA. AAAI Press.

[25] Henikoff, J. G. andHenikoff, S. (1996) ComputerApplicationsin theBio-sciences12,135–143.

[26] Sjolander, K., Karplus,K., Brown, M., Hughey, R., Krogh,A., Mian, I. S.,andHaussler, D. (1996)CABIOS12,327–345.

[27] Eddy, S.R. (1996)CurrentOpinionin Structural Biology6, 361–365.

[28] Baldi, P. and Brunak,S. (1998) Bioinformatics- The Machine LearningApproach MIT PressCambridgeMA To appear.

[29] Bateman,A. andChothia,C. (1997)Curr. Biol. 6, 1544–1547.

[30] Sonnhammer, E. L. L., Eddy, S. R., andDurbin, R. (1997) Proteins28,405–420.

[31] Di Francesco,V., Garnier, J.,andMunson,P. J.(1997)Journalof MolecularBiology267,446–463.

[32] Dalgaard,J.Z., Moser, M. J.,Hughey, R., andMian, I. S.(1997) JournalofComputationalBiology4, 193–214.

[33] Krogh,A. (1997) Two methodsfor improving performanceof a HMM andtheir applicationfor genefinding In Gaasterland,T., Karp, P., Karplus,K.,Ouzounis,C., Sander, C., andValencia,A. (Eds.),Proc. of Fifth Int. Conf.on IntelligentSystemsfor MolecularBiologypp.179–186MenloPark,CA.AAAI Press.

[34] Fickett,J.W. (1996)TrendsGenet.12,316–320.

22

Page 23: An Introduction to Hidden Markov Models for Biological Sequences

[35] Krogh,A., Mian, I. S.,andHaussler, D. (1994) NucleicAcidsResearch 22,4768–4778.

[36] Henderson,J., Salzberg, S., and Fasman,K. H. (1997) Finding genesinDNA with ahiddenMarkov model Journalof ComputationalBiology.

[37] Kulp, D., Haussler, D., Reese,M. G., andEeckman,F. H. (1996) A gen-eralizedhiddenMarkov modelfor therecognitionof humangenesin DNAIn States,D., Agarwal,P., Gaasterland,T., Hunter, L., andSmith,R. (Eds.),Proc.Conf. on IntelligentSystemsin MolecularBiologypp.134–142MenloPark,CA. AAAI Press.

[38] Reese,M. G., Eeckman,F. H., Kulp, D., andHaussler, D. (1997) Improvedsplicesite detectionin Genie In Waterman,M. (Ed.), Proceedingsof theFirst AnnualInternationalConferenceonComputationalMolecularBiology(RECOMB)New York. ACM Press.

[39] Kulp, D., Haussler, D., Reese,M. G.,andEeckman,F. H. (1997)Integratingdatabasehomologyin aprobabilisticgenestructuremodelIn Altman,R.B.,Dunker, A. K., Hunter, L., andKlein, T. E. (Eds.),Proceedingsof thePacificSymposiumon BiocomputingNew York. World Scientific.

[40] Burge,C. andKarlin, S.(1997)Journalof MolecularBiology268,78–94.

[41] Yada,T. andHirosawa,M. (1996)DNA Res.3, 355–361.

[42] Yada,T., Sazuka,T., andHirosawa,M. (1997)DNA Res.4, 1–7.

[43] Pedersen,A. G.,Baldi,P., Brunak,S.,andChauvin,Y. (1996)Characteriza-tion of prokaryoticandeukaryoticpromotersusinghiddenMarkov modelsIn Proc. of Fourth Int. Conf. on Intelligent Systemsfor Molecular Biologypp.182–191MenloPark,CA. AAAI Press.

[44] Tolstrup,N., Rouze, P., andBrunak,S. (1997) NucleicAcidsResearch 25,3159–3164.

[45] Asai,K., Hayamizu,S.,andHanda,K. (1993)ComputerApplicationsin theBiosciences9, 141–146.

[46] Baldi, P., Brunak,S.,Chauvin,Y., andKrogh,A. (1996) Journal of Molec-ular Biology263,503–510.

[47] Felsenstein,J.andChurchill,G. A. (1996) MolecularBiological Evolution13.

23

Page 24: An Introduction to Hidden Markov Models for Biological Sequences

[48] Goldman,N., Thorne,J. L., andJones,D. T. (1996) Journal of MolecularBiology263,196–208.

24