sequence classification & hidden markov models

30
Sequence Sequence classification & classification & hidden Markov models hidden Markov models Bioinformatics, Models & algorithms, 8 th November 2005 Patrik Johansson, Dept. of Cell & Molecular Biology, Uppsala University

Upload: olive

Post on 20-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Bioinformatics, Models & algorithms, 8 th November 2005 Patrik Johansson, Dept. of Cell & Molecular Biology, Uppsala University. Sequence classification & hidden Markov models. A family of proteins share a similar structure but not necessarily sequence. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence classification &  hidden Markov models

Sequence classification & Sequence classification & hidden Markov modelshidden Markov models

Bioinformatics,

Models & algorithms,

8th November 2005

Patrik Johansson,

Dept. of Cell & Molecular Biology,

Uppsala University

Page 2: Sequence classification &  hidden Markov models

A family of proteins share a similar structure but not necessarily sequence

Page 3: Sequence classification &  hidden Markov models

AA

A

A A

A A

B BB

B BB

BB

A

A

Classification of an unknown sequence s to family A or B using HMMs

s

Page 4: Sequence classification &  hidden Markov models

Hidden Markov Models, introduction

• General method for pattern recognition, comp. Neural networks

• An HMM generates sequences / sequence distributions

• Markov chain of events

Outcome, e.g. Heads Heads Tails, generated by hidden Markov chain Г

A

B

C

B

A A

B

C C

Three coins A, B & C gives a Markov chain Г = CAABA..

Page 5: Sequence classification &  hidden Markov models

Hidden Markov Models, introduction..

• Model M is emitting a symbol (T, H) in each state i based on some

probability ei

• The next state j is chosen based on some transition probability ai,j

)()()()|( 33,222,111,0 TailseaHeadseaTailseaMsP CCC

CCB

BB

1)|'('

Ss

MsP

A

Tails Heads Tails

A A

B B B

C C C

e.g, the sequence s = ‘Tails Heads Tails’

over the path Г = BCC

Page 6: Sequence classification &  hidden Markov models

Profile hidden Markov Model architecture

• A first approach for sequence distribution modelling

B M1 Mj MNE

N

ji

Mj seMsP

1

)()|(

Page 7: Sequence classification &  hidden Markov models

Profile hidden Markov Model architecture..

• Insertion modelling

B Mj - Mj Mj+E

Ij

Insertions random ; ejI(a) =q(a)

1,2

,1, )()()(

MjIj

k

iiIjIjIjMj asqasqainsertionskP

Page 8: Sequence classification &  hidden Markov models

Profile Hidden Markov Model architecture..

• Deletion modelling

Mj

Mj

Dj

Alt.

Page 9: Sequence classification &  hidden Markov models

Profile Hidden Markov Model architecture..

Insert & deletestates are generalized to all positions. The model M can generate sequences from state B by successive emissions and transitions to state E

Mj

Ij

Dj

EB

Page 10: Sequence classification &  hidden Markov models

Probabilistic sequence modelling

• Classification criteria

kisMPsMP ij ...1,)|()|(

Bayes theorem ;

)(

)()|()|(

sP

MPMsPsMP

)()|(

)()|(

)|(

)|(

NPNsP

MPMsP

sNP

sMP

..but, P(M) & P(s)..?

( 1 )

( 2 )

( 3 )

Page 11: Sequence classification &  hidden Markov models

Probabilistic sequence modelling..

If N models the whole sequence space (N = q)

)(

)(log)|(log)|(log)|(log)|(log

qP

MPqsPMsPsqPsMP

Since , logarithm probabilities more convenient

Def., log-odds score V;

)|(log)|(log qsPMsPscore

1)|(

)|(

sqP

sMP

( 5 )

( 4 )

1)|( MsP

Page 12: Sequence classification &  hidden Markov models

Probabilistic sequence modelling..

)()(log)(log dMPqPscore zz

( 6 )

Eq. ( 4 ) & ( 5 ) gives new classification criteria ;

dnz nd zz loglog

..for a certain significance level (i.e. the number of incorrect classifications in an n big database) a threshold d is required

( 7 )

score = logzP(s | M)

logzP(s | q)

≥ d

Page 13: Sequence classification &  hidden Markov models

Probabilistic sequence modelling..

Example

If z=e or z=2, the significance level is chosen to one incorrect classification (false positive) per 1000 trials in a database of n=10000 sequences ;

16lnln nd 23loglog 22 nnits, bits

Page 14: Sequence classification &  hidden Markov models

AA

A

A A

A A

B BB

B BB

BB

A

A

Large vs. small threshold d

True positives

False positive

High d

Low d

Page 15: Sequence classification &  hidden Markov models

One can define sensitivity, ‘how many are found’ ;

examplestrue

positivestrueysensitivitaccuracyrecall ,,

positivesall

positivestrueyselectivityreliabilitprecision ,,

..and selectivity, ‘how many are correct’ ;

Model characteristics

Page 16: Sequence classification &  hidden Markov models

Model construction

• From initial alignment

Most common method. Start from an initial multiple alignment of e.g. a protein family• Iteratively

By successive database searches incorporating new similar sequences into the model

• Neural-inspired

The model is trained using some continuous minimization algorithm, e.g. Baum-Welsh, Steepest Descent etc.

Page 17: Sequence classification &  hidden Markov models

Model construction..

A short family alignment gives a simple model M,

potential matchstates marked with an ()

A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * * EB

Page 18: Sequence classification &  hidden Markov models

Model construction..A more generalized model

Ex. evaluate sequence s=‘AIEH’

A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * * EB

A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * *A I E _ H

EB

A _ _ _ _ KA D _ _ _ RA D _ _ _ RS D _ _ _ KA E L G _ R* * *_ A I E H _

EB

A _ _ _ _ KA _ D _ _ RA _ D _ _ RS _ D _ _ KA _ E I G R* * *A I E _ _ H

EB

Page 19: Sequence classification &  hidden Markov models

Sequence evaluation

The optimal alignment, i.e. the path that has the greatest probability of generating sequences s, can be determined through dynamic programming

MjMj-1

Ij-1

Dj-1

MjDjziDj

MjIjziIj

MjMjziMj

i

iMjzi

Mj

asV

asV

asV

sq

sesV

,111

,111

,111

log)(

log)(

log)(

max)(

)(log)(

The maximum log-odds score VjM(si) for

matchstate j that is emitting si is calculated from the emission score, previous maximum score plus transition score

Page 20: Sequence classification &  hidden Markov models

Sequence evaluation..

Viterbis Algorithm,

DjDjziDj

DjIjziIj

DjMjziMj

iDj

IjDjziDj

IjIjziIj

IjMjziMj

i

iIjzi

Ij

MjDjziDj

MjIjziIj

MjMjziMj

i

iMjzi

Mj

asV

asV

asV

sV

asV

asV

asV

sq

sesV

asV

asV

asV

sq

sesV

,11

,11

,11

,1

,1

,1

,111

,111

,111

log)(

log)(

log)(

max)(

log)(

log)(

log)(

max)(

)(log)(

log)(

log)(

log)(

max)(

)(log)(

1)(0)( 100 smedbörjarPsV B

0)(0,)( 10 smedejbörjarPisV iB

( 9 )

( 8 )

( 10 )

Page 21: Sequence classification &  hidden Markov models

Parameter estimation, background

• Proteins with similar structures can have very different sequences

• Classical sequence alignment based only on heuristic rules & parameters cannot deal with sequence identities below ~ 50-60%

• Substitution matrices add static a priori information about amino acids and protein sequences good alignments down to ~ 25-30% sequence identity, ex. CLUSTAL

• How to get further down into ‘the twilight zone’..?

- More and dynamic a priori information..!

Page 22: Sequence classification &  hidden Markov models

Parameter estimation

A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * *

Probability of emitting an alanine in the first matchstate, eM1(‘A’)..?

• Maximum likelihood-estimation

'

)'(

)()(

aj

jMj ac

acae

AminoAcid

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

A C D E F G H I K L M N P Q R S T V W Y

AminoAcid

Page 23: Sequence classification &  hidden Markov models

Parameter estimation..

• Add-one pseudocount estimation

'

20)'(

1)()(

aj

jMj ac

acae

• Background pseudocount estimation

'

)'(

)()()(

aj

jMj Aac

aqAacae

Add-one pseudocount

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

A C D E F G H I K L M N P Q R S T V W Y

AminoAcid

Background Pseudocount

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

A C D E F G H I K L M N P Q R S T V W Y

AminoAcidasa

Page 24: Sequence classification &  hidden Markov models

Parameter estimation..

• Substitution mixture estimation Score :

)()(

),(log),(

bqaq

baPbas ),()()|( basebqabP

Maximum likelihood gives pseudocounts :

'

)'(

)()(

aj

jj ac

acaf

bjj baPbfAa )|()()(

Total estimation :

'

)'()'(

)()()(

ajj

jjMj aac

aacae

Substitution Mixture

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

A C D E F G H I K L M N P Q R S T V W Y

AminoAcid

Page 25: Sequence classification &  hidden Markov models

Parameter estimation..

All above methods are in spite of their dynamic implementation, still based on heuristic parameters

Method that compensates & complements lack of data in a statistically correct way ;

• Dirichlet mixture estimation

Looking at sequence alignments, several different amino acid distributions seem to be reoccurring, not just the background distribution q

Assume that there are k probability densities that generates these

kkggg ...1111

k

jjg

1

1

)( p

Page 26: Sequence classification &  hidden Markov models

Parameter estimation, Dirichlet Mixture style..

Given the data, a countvector , this method allows a linear combination of k individual estimations weighted with the probability that n is generated by each component

201 nnn

20

1

)(

)(

)(

)()1(

)(

)(

)()1()|(

iii

jii

j

j

jn

n

n

nnP

The k componets can be modelled from a curated database of alignments. Using some parametric form of the probability density, an explicit expression for the probability that n has been generated by the jth component can be derived

k

j

l

jll

jii

ji

n

nnPe

120

1

)(

)(

)|(

Dirichlet Mixture

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

A C D E F G H I K L M N P Q R S T V W Y

AminoAcid

20

1

20

1

1

,)(i i

i iip

p

Ex.

Page 27: Sequence classification &  hidden Markov models

Parameter estimation, Dirichlet Mixture style..

The k components describe peaks of aa distributions in some kind of multidimensional space

Depending on where in sequence space our countvector n lies, i.e. depending on which components that can be assumed to have generated n, distribution information is incorporated into the probability estimation e

n

Page 28: Sequence classification &  hidden Markov models

Classification exampleAlignment of some known glycoside hydrolase family 16 sequences

• Define which columns are to be regarded as matchstates (*)

• Build the corresponding model M & HMM graph

• Estimate all emission and transition probabilities, ej & ajk

• Evaluate the log-odds score / probability that an unknown sequence s has been generated by M using Viterbis algorithm

• If score(s | M) > d, the sequence can be classified as a GH16 family member

Page 29: Sequence classification &  hidden Markov models

Classification example..

A certain sequence s1=WHKLRQ.. is evaluated and gets a score of -17.63 nits, i.e. the probability that M has generated s1 is very small

Another sequence s2=SDGSYT.. gets a score of 27.49 nits and can with good significance be classified as a family member

Page 30: Sequence classification &  hidden Markov models

Summary

• Hidden Markov models are used mainly for classification / searching (PFAM), but also for sequence mapping / alignment

• As compared to normal alignment, a position specific approach is

used for sequence distributions, insertions and deletions

• Model building is usually a compromise between sensitivity and

selectivity. If more a priori information is incorporated, the

sensitivity goes up whereas the selectivity goes down