sequence classification & hidden markov models
DESCRIPTION
Bioinformatics, Models & algorithms, 8 th November 2005 Patrik Johansson, Dept. of Cell & Molecular Biology, Uppsala University. Sequence classification & hidden Markov models. A family of proteins share a similar structure but not necessarily sequence. - PowerPoint PPT PresentationTRANSCRIPT
Sequence classification & Sequence classification & hidden Markov modelshidden Markov models
Bioinformatics,
Models & algorithms,
8th November 2005
Patrik Johansson,
Dept. of Cell & Molecular Biology,
Uppsala University
A family of proteins share a similar structure but not necessarily sequence
AA
A
A A
A A
B BB
B BB
BB
A
A
Classification of an unknown sequence s to family A or B using HMMs
s
Hidden Markov Models, introduction
• General method for pattern recognition, comp. Neural networks
• An HMM generates sequences / sequence distributions
• Markov chain of events
Outcome, e.g. Heads Heads Tails, generated by hidden Markov chain Г
A
B
C
B
A A
B
C C
Three coins A, B & C gives a Markov chain Г = CAABA..
Hidden Markov Models, introduction..
• Model M is emitting a symbol (T, H) in each state i based on some
probability ei
• The next state j is chosen based on some transition probability ai,j
)()()()|( 33,222,111,0 TailseaHeadseaTailseaMsP CCC
CCB
BB
1)|'('
Ss
MsP
A
Tails Heads Tails
A A
B B B
C C C
e.g, the sequence s = ‘Tails Heads Tails’
over the path Г = BCC
Profile hidden Markov Model architecture
• A first approach for sequence distribution modelling
B M1 Mj MNE
N
ji
Mj seMsP
1
)()|(
Profile hidden Markov Model architecture..
• Insertion modelling
B Mj - Mj Mj+E
Ij
Insertions random ; ejI(a) =q(a)
1,2
,1, )()()(
MjIj
k
iiIjIjIjMj asqasqainsertionskP
Profile Hidden Markov Model architecture..
• Deletion modelling
Mj
Mj
Dj
Alt.
Profile Hidden Markov Model architecture..
Insert & deletestates are generalized to all positions. The model M can generate sequences from state B by successive emissions and transitions to state E
Mj
Ij
Dj
EB
Probabilistic sequence modelling
• Classification criteria
kisMPsMP ij ...1,)|()|(
Bayes theorem ;
)(
)()|()|(
sP
MPMsPsMP
)()|(
)()|(
)|(
)|(
NPNsP
MPMsP
sNP
sMP
..but, P(M) & P(s)..?
( 1 )
( 2 )
( 3 )
Probabilistic sequence modelling..
If N models the whole sequence space (N = q)
)(
)(log)|(log)|(log)|(log)|(log
qP
MPqsPMsPsqPsMP
Since , logarithm probabilities more convenient
Def., log-odds score V;
)|(log)|(log qsPMsPscore
1)|(
)|(
sqP
sMP
( 5 )
( 4 )
1)|( MsP
Probabilistic sequence modelling..
)()(log)(log dMPqPscore zz
( 6 )
Eq. ( 4 ) & ( 5 ) gives new classification criteria ;
dnz nd zz loglog
..for a certain significance level (i.e. the number of incorrect classifications in an n big database) a threshold d is required
( 7 )
score = logzP(s | M)
logzP(s | q)
≥ d
Probabilistic sequence modelling..
Example
If z=e or z=2, the significance level is chosen to one incorrect classification (false positive) per 1000 trials in a database of n=10000 sequences ;
16lnln nd 23loglog 22 nnits, bits
AA
A
A A
A A
B BB
B BB
BB
A
A
Large vs. small threshold d
True positives
False positive
High d
Low d
One can define sensitivity, ‘how many are found’ ;
examplestrue
positivestrueysensitivitaccuracyrecall ,,
positivesall
positivestrueyselectivityreliabilitprecision ,,
..and selectivity, ‘how many are correct’ ;
Model characteristics
Model construction
• From initial alignment
Most common method. Start from an initial multiple alignment of e.g. a protein family• Iteratively
By successive database searches incorporating new similar sequences into the model
• Neural-inspired
The model is trained using some continuous minimization algorithm, e.g. Baum-Welsh, Steepest Descent etc.
Model construction..
A short family alignment gives a simple model M,
potential matchstates marked with an ()
A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * * EB
Model construction..A more generalized model
Ex. evaluate sequence s=‘AIEH’
A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * * EB
A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * *A I E _ H
EB
A _ _ _ _ KA D _ _ _ RA D _ _ _ RS D _ _ _ KA E L G _ R* * *_ A I E H _
EB
A _ _ _ _ KA _ D _ _ RA _ D _ _ RS _ D _ _ KA _ E I G R* * *A I E _ _ H
EB
Sequence evaluation
The optimal alignment, i.e. the path that has the greatest probability of generating sequences s, can be determined through dynamic programming
MjMj-1
Ij-1
Dj-1
MjDjziDj
MjIjziIj
MjMjziMj
i
iMjzi
Mj
asV
asV
asV
sq
sesV
,111
,111
,111
log)(
log)(
log)(
max)(
)(log)(
The maximum log-odds score VjM(si) for
matchstate j that is emitting si is calculated from the emission score, previous maximum score plus transition score
Sequence evaluation..
Viterbis Algorithm,
DjDjziDj
DjIjziIj
DjMjziMj
iDj
IjDjziDj
IjIjziIj
IjMjziMj
i
iIjzi
Ij
MjDjziDj
MjIjziIj
MjMjziMj
i
iMjzi
Mj
asV
asV
asV
sV
asV
asV
asV
sq
sesV
asV
asV
asV
sq
sesV
,11
,11
,11
,1
,1
,1
,111
,111
,111
log)(
log)(
log)(
max)(
log)(
log)(
log)(
max)(
)(log)(
log)(
log)(
log)(
max)(
)(log)(
1)(0)( 100 smedbörjarPsV B
0)(0,)( 10 smedejbörjarPisV iB
( 9 )
( 8 )
( 10 )
Parameter estimation, background
• Proteins with similar structures can have very different sequences
• Classical sequence alignment based only on heuristic rules & parameters cannot deal with sequence identities below ~ 50-60%
• Substitution matrices add static a priori information about amino acids and protein sequences good alignments down to ~ 25-30% sequence identity, ex. CLUSTAL
• How to get further down into ‘the twilight zone’..?
- More and dynamic a priori information..!
Parameter estimation
A _ _ _ KA D _ _ RA D _ _ RS D _ _ KA E L G R* * *
Probability of emitting an alanine in the first matchstate, eM1(‘A’)..?
• Maximum likelihood-estimation
'
)'(
)()(
aj
jMj ac
acae
AminoAcid
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
A C D E F G H I K L M N P Q R S T V W Y
AminoAcid
Parameter estimation..
• Add-one pseudocount estimation
'
20)'(
1)()(
aj
jMj ac
acae
• Background pseudocount estimation
'
)'(
)()()(
aj
jMj Aac
aqAacae
Add-one pseudocount
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
A C D E F G H I K L M N P Q R S T V W Y
AminoAcid
Background Pseudocount
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
A C D E F G H I K L M N P Q R S T V W Y
AminoAcidasa
Parameter estimation..
• Substitution mixture estimation Score :
)()(
),(log),(
bqaq
baPbas ),()()|( basebqabP
Maximum likelihood gives pseudocounts :
'
)'(
)()(
aj
jj ac
acaf
bjj baPbfAa )|()()(
Total estimation :
'
)'()'(
)()()(
ajj
jjMj aac
aacae
Substitution Mixture
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
A C D E F G H I K L M N P Q R S T V W Y
AminoAcid
Parameter estimation..
All above methods are in spite of their dynamic implementation, still based on heuristic parameters
Method that compensates & complements lack of data in a statistically correct way ;
• Dirichlet mixture estimation
Looking at sequence alignments, several different amino acid distributions seem to be reoccurring, not just the background distribution q
Assume that there are k probability densities that generates these
kkggg ...1111
k
jjg
1
1
)( p
Parameter estimation, Dirichlet Mixture style..
Given the data, a countvector , this method allows a linear combination of k individual estimations weighted with the probability that n is generated by each component
201 nnn
20
1
)(
)(
)(
)()1(
)(
)(
)()1()|(
iii
jii
j
j
jn
n
n
nnP
The k componets can be modelled from a curated database of alignments. Using some parametric form of the probability density, an explicit expression for the probability that n has been generated by the jth component can be derived
k
j
l
jll
jii
ji
n
nnPe
120
1
)(
)(
)|(
Dirichlet Mixture
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
A C D E F G H I K L M N P Q R S T V W Y
AminoAcid
20
1
20
1
1
,)(i i
i iip
p
Ex.
Parameter estimation, Dirichlet Mixture style..
The k components describe peaks of aa distributions in some kind of multidimensional space
Depending on where in sequence space our countvector n lies, i.e. depending on which components that can be assumed to have generated n, distribution information is incorporated into the probability estimation e
n
Classification exampleAlignment of some known glycoside hydrolase family 16 sequences
• Define which columns are to be regarded as matchstates (*)
• Build the corresponding model M & HMM graph
• Estimate all emission and transition probabilities, ej & ajk
• Evaluate the log-odds score / probability that an unknown sequence s has been generated by M using Viterbis algorithm
• If score(s | M) > d, the sequence can be classified as a GH16 family member
Classification example..
A certain sequence s1=WHKLRQ.. is evaluated and gets a score of -17.63 nits, i.e. the probability that M has generated s1 is very small
Another sequence s2=SDGSYT.. gets a score of 27.49 nits and can with good significance be classified as a family member
Summary
• Hidden Markov models are used mainly for classification / searching (PFAM), but also for sequence mapping / alignment
• As compared to normal alignment, a position specific approach is
used for sequence distributions, insertions and deletions
• Model building is usually a compromise between sensitivity and
selectivity. If more a priori information is incorporated, the
sensitivity goes up whereas the selectivity goes down