sequence classification & hidden markov models

Sequence classification & Sequence classification & hidden Markov modelshidden Markov models

Bioinformatics,

Models & algorithms,

8th November 2005

Patrik Johansson,

Dept. of Cell & Molecular Biology,

Uppsala University

A family of proteins share a similar structure but not necessarily sequence

AA

A

A A

A A

B BB

B BB

BB

A

A

Classification of an unknown sequence s to family A or B using HMMs

s

Hidden Markov Models, introduction

• General method for pattern recognition, comp. Neural networks

• An HMM generates sequences / sequence distributions

• Markov chain of events

Outcome, e.g. Heads Heads Tails, generated by hidden Markov chain Г

A

B

C

B

A A

B

C C

Three coins A, B & C gives a Markov chain Г = CAABA..

Hidden Markov Models, introduction..

• Model M is emitting a symbol (T, H) in each state i based on some

probability ei

• The next state j is chosen based on some transition probability ai,j

)()()()|( 33,222,111,0 TailseaHeadseaTailseaMsP CCC

CCB

BB

1)|'('

Ss

MsP

A

Tails Heads Tails

A A

B B B

C C C

e.g, the sequence s = ‘Tails Heads Tails’

over the path Г = BCC

Profile hidden Markov Model architecture

• A first approach for sequence distribution modelling

B M1 Mj MNE

N

ji

Mj seMsP

1

)()|(

Profile hidden Markov Model architecture..

• Insertion modelling

B Mj - Mj Mj+E

Ij

Insertions random ; ejI(a) =q(a)

1,2

,1, )()()(

MjIj

k

iiIjIjIjMj asqasqainsertionskP

Profile Hidden Markov Model architecture..

• Deletion modelling

Mj

Mj

Dj

Alt.

Profile Hidden Markov Model architecture..

Insert & deletestates are generalized to all positions. The model M can generate sequences from state B by successive emissions and transitions to state E

Mj

Ij

Dj

EB

Probabilistic sequence modelling

• Classification criteria

kisMPsMP ij ...1,)|()|(

Bayes theorem ;

)(

)()|()|(

sP

MPMsPsMP

)()|(

)()|(

)|(

)|(

NPNsP

MPMsP

sNP

sMP

..but, P(M) & P(s)..?

( 1 )

( 2 )

( 3 )

Probabilistic sequence modelling..

If N models the whole sequence space (N = q)

)(

)(log)|(log)|(log)|(log)|(log

qP

MPqsPMsPsqPsMP

Since , logarithm probabilities more convenient

Def., log-odds score V;

)|(log)|(log qsPMsPscore

1)|(

)|(

sqP

sMP

( 5 )

( 4 )

1)|( MsP


)()(log)(log dMPqPscore zz

( 6 )

Eq. ( 4 ) & ( 5 ) gives new classification criteria ;

dnz nd zz loglog

..for a certain significance level (i.e. the number of incorrect classifications in an n big database) a threshold d is required

( 7 )

score = logzP(s | M)

logzP(s | q)

≥ d


Example

If z=e or z=2, the significance level is chosen to one incorrect classification (false positive) per 1000 trials in a database of n=10000 sequences ;

16lnln nd 23loglog 22 nnits, bits

AA

A

A A

A A

B BB

B BB

BB

A

A

Large vs. small threshold d

True positives

False positive

High d

Low d

One can define sensitivity, ‘how many are found’ ;

examplestrue

positivestrueysensitivitaccuracyrecall ,,

positivesall

positivestrueyselectivityreliabilitprecision ,,

..and selectivity, ‘how many are correct’ ;

Model characteristics

Model construction

• From initial alignment

Most common method. Start from an initial multiple alignment of e.g. a protein family• Iteratively

By successive database searches incorporating new similar sequences into the model

• Neural-inspired

The model is trained using some continuous minimization algorithm, e.g. Baum-Welsh, Steepest Descent etc.

Model construction..

A short family alignment gives a simple model M,

potential matchstates marked with an ()

A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * * EB

Model construction..A more generalized model

Ex. evaluate sequence s=‘AIEH’

A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * * EB

A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * *A I E _ H

EB

A _ _ _ _ KA D _ _ _ RA D _ _ _ RS D _ _ _ KA E L G _ R* * *_ A I E H _

EB

A _ _ _ _ KA _ D _ _ RA _ D _ _ RS _ D _ _ KA _ E I G R* * *A I E _ _ H

EB

Sequence evaluation

The optimal alignment, i.e. the path that has the greatest probability of generating sequences s, can be determined through dynamic programming

MjMj-1

Ij-1

Dj-1

MjDjziDj

MjIjziIj

MjMjziMj

i

iMjzi

Mj

asV

asV

asV

sq

sesV

,111

,111

,111

log)(

log)(

log)(

max)(

)(log)(

The maximum log-odds score VjM(si) for

matchstate j that is emitting si is calculated from the emission score, previous maximum score plus transition score

Sequence evaluation..

Viterbis Algorithm,

DjDjziDj

DjIjziIj

DjMjziMj

iDj

IjDjziDj

IjIjziIj

IjMjziMj

i

iIjzi

Ij

MjDjziDj

MjIjziIj

MjMjziMj

i

iMjzi

Mj

asV

asV

asV

sV

asV

asV

asV

sq

sesV

asV

asV

asV

sq

sesV

,11

,11

,11

,1

,1

,1

,111

,111

,111

log)(

log)(

log)(

max)(

log)(

log)(

log)(

max)(

)(log)(

log)(

log)(

log)(

max)(

)(log)(

1)(0)( 100 smedbörjarPsV B

0)(0,)( 10 smedejbörjarPisV iB

( 9 )

( 8 )

( 10 )

Parameter estimation, background

• Proteins with similar structures can have very different sequences

• Classical sequence alignment based only on heuristic rules & parameters cannot deal with sequence identities below ~ 50-60%

• Substitution matrices add static a priori information about amino acids and protein sequences good alignments down to ~ 25-30% sequence identity, ex. CLUSTAL

• How to get further down into ‘the twilight zone’..?

- More and dynamic a priori information..!

Parameter estimation

A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * *

Probability of emitting an alanine in the first matchstate, eM1(‘A’)..?

• Maximum likelihood-estimation

'

)'(

)()(

aj

jMj ac

acae

AminoAcid

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

A C D E F G H I K L M N P Q R S T V W Y

AminoAcid

Parameter estimation..

• Add-one pseudocount estimation

'

20)'(

1)()(

aj

jMj ac

acae

• Background pseudocount estimation

'

)'(

)()()(

aj

jMj Aac

aqAacae

Add-one pseudocount

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40


AminoAcid

Background Pseudocount

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4


AminoAcidasa


• Substitution mixture estimation Score :

)()(

),(log),(

bqaq

baPbas ),()()|( basebqabP

Maximum likelihood gives pseudocounts :

'

)'(

)()(

aj

jj ac

acaf

bjj baPbfAa )|()()(

Total estimation :

'

)'()'(

)()()(

ajj

jjMj aac

aacae

Substitution Mixture

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4


AminoAcid


All above methods are in spite of their dynamic implementation, still based on heuristic parameters

Method that compensates & complements lack of data in a statistically correct way ;

• Dirichlet mixture estimation

Looking at sequence alignments, several different amino acid distributions seem to be reoccurring, not just the background distribution q

Assume that there are k probability densities that generates these

kkggg ...1111

k

jjg

1

1

)( p

Parameter estimation, Dirichlet Mixture style..

Given the data, a countvector , this method allows a linear combination of k individual estimations weighted with the probability that n is generated by each component

201 nnn

20

1

)(

)(

)(

)()1(

)(

)(

)()1()|(

iii

jii

j

j

jn

n

n

nnP

The k componets can be modelled from a curated database of alignments. Using some parametric form of the probability density, an explicit expression for the probability that n has been generated by the jth component can be derived

k

j

l

jll

jii

ji

n

nnPe

120

1

)(

)(

)|(

Dirichlet Mixture

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4


AminoAcid

20

1

20

1

1

,)(i i

i iip

p

Ex.

Parameter estimation, Dirichlet Mixture style..

The k components describe peaks of aa distributions in some kind of multidimensional space

Depending on where in sequence space our countvector n lies, i.e. depending on which components that can be assumed to have generated n, distribution information is incorporated into the probability estimation e

n

Classification exampleAlignment of some known glycoside hydrolase family 16 sequences

• Define which columns are to be regarded as matchstates (*)

• Build the corresponding model M & HMM graph

• Estimate all emission and transition probabilities, ej & ajk

• Evaluate the log-odds score / probability that an unknown sequence s has been generated by M using Viterbis algorithm

• If score(s | M) > d, the sequence can be classified as a GH16 family member

Classification example..

A certain sequence s1=WHKLRQ.. is evaluated and gets a score of -17.63 nits, i.e. the probability that M has generated s1 is very small

Another sequence s2=SDGSYT.. gets a score of 27.49 nits and can with good significance be classified as a family member

Summary

• Hidden Markov models are used mainly for classification / searching (PFAM), but also for sequence mapping / alignment

• As compared to normal alignment, a position specific approach is

used for sequence distributions, insertions and deletions

• Model building is usually a compromise between sensitivity and

selectivity. If more a priori information is incorporated, the

sensitivity goes up whereas the selectivity goes down

sequence classification & hidden markov models

Documents