hiddenmarkov modelspedagogix-tagc.univ-mrs.fr/.../pdf_files/03.05.hidden-markov-models.… · a...

Hidden Markov Models

Jacques van Helden

Aix-Marseille Université (AMU)Lab. Theory and Approaches of Genomic Complexity (TAGC)

https://tagc.univ-amu.fr/

Institut Français de Bioinformatique (IFB)http://www.france-bioinformatique.fr

[email protected]://orcid.org/0000-0002-8799-8584

https://tagc.univ-amu.fr/

http://www.france-bioinformatique.fr/

https://orcid.org/0000-0002-8799-8584

A seminal book

n Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison.Cambridge University Press, 1998. ISBN 0-521-62041-4 (hardback)

n In 1998, Richard Durbin, Sean Eddy, A. Krogh and G. Mitchison published a seminal book entitled « Biological sequence analysis »q A tutorial introduction to hidden Markov models and other probabilistic modelling

approaches in computational sequence analysis.q The authors restate the classical sequence analysis problems in terms of Hidden

Markov Models (HMM). q Even their table of contents is presented as an HMM (their Figure 1.1 below)

http://eddylab.org/cupbook.html

Applications of Hidden Markov Models in biologyn Hidden Markov models can be applied to solve a diversity

of problems in bioinformaticsn Sequence segmentation

q Detection of CpG islandsq Intron/exon prediction

n Motif detectionq Protein domains (long motifs in peptidic sequences)q Transcription factor binding sites (short motifs on DNA

sequences)n Secondary structure predictionn …

Markov models (nothing to hide so far)

n A Markov process is defined by q A finite number of states (A, B, C, …)

n Example: 2-state Markov processq States: {X, Y}

n Transitions: q {X à X, X à Y, Y à X, X à Y}

n The probability of transition from each state to each other one is described in a transition matrix.q Rows: current state si

q Columns: next state si+1q Values P(si+1 | si )q Transition probabilities sum to 1 on each row

n Examples of Markov models to annotate genomic sequences1. State X = intron, State Y = exon2. State X = transcribed region, state Y = intergenic region3. State X = CpG island; State Y = other genomic region

Markov process

X Y X YX 0.9 0.1Y 0.2 0.8

2-states Markov process Transition matrix

Examples of biological applications

transcriptGenome fragment

1. Segmentation of the genome into transcribed and intergenic regions

intronexon

2. Segmentation of transcribed regions into introns and exons

CpG islandnon-CpG island

3. Segmentation of the genome into CpG islands and non-CpG islands

n In order to annotate the genome, we could conceive a multi-state markov model that would represent the different 1. State W = intron, 2. State X = exon3. State Y = CpG island; 4. State Z = other genomic region

Markov process

W X Y Z

W 0.990 0.010 0.000 0.000

X 0.010 0.988 0.001 0.001

Y 0.00000 0.00002 0.99898 0.00100

Z 0 0.000002 0.000001 0.999997

k-states Markov process

Transition matrix (arbitrary values)

Segmentation of the genome into different types of regions

intronexonCpG islandother type of genomic region

W X

Y Z

EB

n We can model a macromolecular sequence as a Markov processq DNA : n = 4 states (A, C, G, T)q Proteins: n = 20 states (amino acids)q Optionally, additional states can be used to represent the

beginning (B) of and the end (E) of the sequence. This enables to generate sequences of different lengths.

n Transition probabilities indicate the probability to generate a given residue (suffix) given the current residue (prefix)

n Exerciseq DNA sequences are generated using a Markov model with

ending probability of 0.99 (irrespective of the current residue). What is the distribution of sequence lengths?

Markov model of a sequence4-states Markov process for DNA sequence

A C

G T

EB

Probability of a sequence segment

n What is the probability for a given sequence segment ?n Different models can be chosen

q Bernoulli model• Assumes independence between successive nucleotides.• The probability of each residue is fixed a priori (prior residue probability)

n Example: P(A) = 0.35; P(T) = 0.32; P(C) = 0.17; P(G) = 0.16• Particular case: equiprobable residues

n P(A) = P(T) = P(C) = P(G) = 0.25n Simple, but NOT realistic !

q Markov model• The probability of each residue depends on the m preceding residues.• The parameter m is called the order of the Markov model• Remark: a Bernoulli model can be considered as a Markov model of order 0

8

Independent and equiprobable nucleotides

n The simplest model : Bernoulli with identically and independently (i.i.d.) distributed nucleotides.p = P(A) = P(C) = P(G) = P(T)= 0.25

n The probability of a sequence q Is the product of its residue probabilities (independence)q Equiprobability: since all residues have the same probability, it is simply computed as the residue proba

(p) to the power of the sequence length (L)• S is a sequence segment (e.g. an oligonucleotide)• L length of the sequence segment• p nucleotide probability• P(S) is the probability to observe this sequence segment

at given position of a larger sequencen Example

q P(CACGTG) = 0.256 = 2.44e-4

9

€

P(S) = pL

Bernoulli model : independently distributed nucleotides

n A more refined model consists in using residue-specific probabilities. The probability of each residue is assumed to be constant on the whole sequence (Bernoulli schema).

n The probability of a sequence is the product of its residue probabilities. q i = 1..k is the index of nucleotide positionsq ri is the residue found at position Iq P(ri) is the probability of this residue

n Example: non-coding sequences in the yeast genomeq P(A) = P(T) = 0.325q P(C) = P(G) = 0.175q P(CACGTG) = P(C) P(A) P(C) P(G) P(T) P(G)

= 0.3254 * 0.1752

= 9.91E-5

10

€

P(S) = P ri( )i=1

L

∏

Bernoulli modelsn A Bernoulli model assumes that

q each residue has a specific prior probabilityq this probability is constant over the sequence (no context dependencies)

n The heat-maps below depict the nucleotide frequencies in non-coding upstream sequences of various organisms.n The frequencies of AT versus CG show strong inter-organism differences.

11

Saccharomyces cerevisiae(Fungus)

Escherichia coli K12(Proteobacteria)

Mycobacterium leprae (Actinobacteria)

Mycoplasma genitalium(Firmicute, intracellular)

Bacillus subtilis(Firmicute, extracellular)

Anopheles gambiae(Insect)

Homo sapiens(Mammalian)

Plasmodium falciparum(Aplicomplexa, intracellular)

Markov model estimation (“training”)n Transition frequencies for a Markov model of order m can

be estimated from the frequencies observed for oligomers (k-mers) of length k=m+1 in a reference sequence set.

n Exampleq The upper table shows dinucleotide frequencies (k=2)

computed from the whole set of upstream sequences of the yeast Saccharomyces cerevisiae.

q This table can be used to estimate a Markov model of order m = k–1 = 1.

13

Dinucleotide frequenciesSequences Occurrences Frequency S N(S) F(S) AA 526,149 0.112AC 251,377 0.054AG 275,056 0.059AT 414,453 0.088CA 294,423 0.063CC 178,324 0.038CG 146,052 0.031CT 275,859 0.059GA 277,343 0.059GC 184,367 0.039GG 173,404 0.037GT 239,569 0.051TA 369,980 0.079TC 280,475 0.060TG 279,932 0.060TT 521,236 0.111






14


Exercise: estimate P(G|T) from the dinucleotidefrequency table - Give the formula with symbols- Replace the symbols by their values - Compute the result






15

€

P ri | S1..m( ) =Fbg ri | S1..m( )Fbg rj | S1..m( )

j∈A∑

=Fbg S1..mri( )Fbg S1..mrj( )

j∈A∑


€

P G |T( ) =F G |T( )F j |T( )

j∈A∑

=F TG( )F T *( )

=0.060

0.079 + 0.060 + 0.060 + 0.111

=0.0600.310

= 0.194

Examples of transition matrices

n The two tables below show the transition matrices for a Markov model of order 1 (top) and 2 (bottom), respectively.n Trained with the whole set of non-coding upstream sequences of the yeast Saccharomyces cerevisiae. n Notice the high probability of transitions from AA to A and TT to T.

16

€

P ri | Si−m,i−1( )

Prefix/Suffix A C G TP(Prefix)aa 0.416 0.151 0.187 0.246 0.119ac 0.352 0.181 0.171 0.297 0.053ag 0.339 0.202 0.193 0.267 0.057at 0.346 0.166 0.162 0.326 0.092ca 0.344 0.185 0.180 0.291 0.060cc 0.305 0.200 0.171 0.324 0.035cg 0.282 0.232 0.193 0.294 0.031ct 0.241 0.189 0.184 0.385 0.058ga 0.411 0.144 0.187 0.257 0.055gc 0.334 0.192 0.182 0.293 0.038gg 0.315 0.220 0.194 0.271 0.033gt 0.307 0.156 0.200 0.338 0.050ta 0.304 0.184 0.160 0.352 0.087tc 0.313 0.192 0.152 0.343 0.057tg 0.300 0.214 0.180 0.307 0.055tt 0.218 0.194 0.164 0.423 0.120Sum

5.127 3.000 2.860 5.013P(suffix) 0.321 0.183 0.176 0.319

Pre/Suffix A C G T P(Prefix)a 0.371 0.165 0.178 0.285 0.321c 0.327 0.190 0.167 0.316 0.183g 0.312 0.214 0.189 0.285 0.177t 0.273 0.179 0.173 0.375 0.320Sym 1.283 0.748 0.708 1.261P(suffix) 0.321 0.183 0.176 0.320

Markov chains and Bernoulli modelsn By extension of the concept of Markov chain, Bernoulli models can be qualified as Markov models of order 0 (the order 0 means that there is

no dependency between a residue and the preceding ones). n The prior probabilities of a Makov model of order m=0 can be estimated from the residue of single nucleotides (k=m+1=1) in a background

sequence set. n The table below shows the residue frequencies in the genomes of the yeast Saccharomyces cerevisiae and the bacteria Escherichia coli K12,

respectively. n Notice the strong differences between these genomes.

17

Markov order 0 = BernouliA C G T Genome

0.310 0.191 0.191 0.309 Saccharomyces cerevisiae0.246 0.254 0.254 0.246 Escherichia coli K12

Scoring a sequence segment with a Markov model

n Exercise: compute the probability P(S|B) of a sequence segment S with a background Markov model B of order 2, estimated from 3nt frequencies on the yeast non-coding upstream sequences.

18

€

P(S |B) = P(S1,m |B) P ri | Si−m,i−1,B( )i=m+1

L

∏Transition matrix, order 2Prefix/Suffix A C G T P(Prefix)N(Prefix)AA 0.388 0.161 0.200 0.251 0.112 525,000AC 0.339 0.198 0.173 0.290 0.054 251,072AG 0.345 0.204 0.196 0.255 0.059 274,601AT 0.311 0.184 0.182 0.323 0.088 413,946CA 0.347 0.178 0.189 0.286 0.063 293,750CC 0.341 0.190 0.161 0.309 0.038 178,110CG 0.293 0.221 0.196 0.290 0.031 145,876CT 0.229 0.195 0.205 0.371 0.059 275,634GA 0.394 0.155 0.187 0.264 0.059 277,053GC 0.330 0.205 0.169 0.297 0.039 184,192GG 0.318 0.217 0.187 0.277 0.037 173,266GT 0.285 0.175 0.204 0.336 0.051 239,384TA 0.300 0.193 0.168 0.339 0.079 369,426TC 0.313 0.203 0.152 0.332 0.060 280,131TG 0.302 0.209 0.208 0.282 0.060 279,783TT 0.210 0.208 0.189 0.392 0.111 520,906P(Suffix) 0.313 0.191 0.187 0.310N(suffix) 1,466,075 893,444 873,260 1,449,351

CCTACTATATGCCCAGAATT

Background model B

Sequence probability given the background model

Scoring a sequence segment with a Markov model

n The example below illustrates the computation of the probability P(S|B) of a sequence segment S with a background Markov model B of order 2, estimated from 3nt frequencies on the yeast non-coding upstream sequences.

19

€

P(S |B) = P(S1,m |B) P ri | Si−m,i−1,B( )i=m+1

L

∏

pos P(R|W) wR S P(S)1 P(CC) 0.038 cc CC 3.80E-022 P(T|CC) 0.309 ccT CCT 1.17E-023 P(A|CT) 0.229 ctA CCTA 2.69E-034 P(C|TA) 0.193 taC CCTAC 5.19E-045 P(T|AC) 0.290 acT CCTACT 1.50E-046 P(A|CT) 0.229 ctA CCTACTA 3.45E-057 P(T|TA) 0.339 taT CCTACTAT 1.17E-058 P(A|AT) 0.311 atA CCTACTATA 3.63E-069 P(T|TA) 0.339 taT CCTACTATAT 1.23E-0610 P(G|AT) 0.182 atG CCTACTATATG 2.25E-0711 P(C|TG) 0.209 tgC CCTACTATATGC 4.69E-0812 P(C|GC) 0.205 gcC CCTACTATATGCC 9.61E-0913 P(C|CC) 0.190 ccC CCTACTATATGCCC 1.82E-0914 P(A|CC) 0.341 ccA CCTACTATATGCCCA 6.21E-1015 P(G|CA) 0.189 caG CCTACTATATGCCCAG 1.17E-1016 P(A|AG) 0.345 agA CCTACTATATGCCCAGA 4.04E-1117 P(A|GA) 0.394 gaA CCTACTATATGCCCAGAA 1.59E-1118 P(T|AA) 0.251 aaT CCTACTATATGCCCAGAAT 4.00E-1219 P(T|AT) 0.323 atT CCTACTATATGCCCAGAATT 1.29E-12

Transition matrix, order 2Prefix/Suffix A C G T P(Prefix)N(Prefix)AA 0.388 0.161 0.200 0.251 0.112 525,000AC 0.339 0.198 0.173 0.290 0.054 251,072AG 0.345 0.204 0.196 0.255 0.059 274,601AT 0.311 0.184 0.182 0.323 0.088 413,946CA 0.347 0.178 0.189 0.286 0.063 293,750CC 0.341 0.190 0.161 0.309 0.038 178,110CG 0.293 0.221 0.196 0.290 0.031 145,876CT 0.229 0.195 0.205 0.371 0.059 275,634GA 0.394 0.155 0.187 0.264 0.059 277,053GC 0.330 0.205 0.169 0.297 0.039 184,192GG 0.318 0.217 0.187 0.277 0.037 173,266GT 0.285 0.175 0.204 0.336 0.051 239,384TA 0.300 0.193 0.168 0.339 0.079 369,426TC 0.313 0.203 0.152 0.332 0.060 280,131TG 0.302 0.209 0.208 0.282 0.060 279,783TT 0.210 0.208 0.189 0.392 0.111 520,906P(Suffix) 0.313 0.191 0.187 0.310N(suffix) 1,466,075 893,444 873,260 1,449,351

CCTACTATATGCCCAGAATT

Background model B Sequence probability given the backgound model

Sequence discrimination

n Problem: for a given sequence of symbols (e.g. a nucleotidicsequence), identify the most likely Markov model.

n Approach: compute the log-likelihood ratio (LLR) of the sequence probabilities computed with the two respective transition matrices (CpG island versus genomic background)

𝑃CpG(𝑆) = 𝑃CpG(𝑆&) ⋅()*&

+,&

𝑃CpG (𝑆)-&|𝑆))

𝑃Bg(𝑆) = 𝑃Bg(𝑆&) ⋅()*&

+,&

𝑃Bg (𝑆)-&|𝑆))

𝐿(𝑆) = 𝑙𝑜𝑔𝑃CpG(𝑆)𝑃Bg(𝑆)

n A more efficient approachq Compute (only once) a log-odds matrix from the two transition

matrices

𝐿(𝑟4|𝑟)) = 𝑙𝑜𝑔𝑃CpG(𝑟4|𝑟))𝑃Bg(𝑟4|𝑟))

q Compute the LLR of the sequence by summing the transition LLRs

𝐿(𝑆) = 𝐿B(𝑆&) ⋅ 5)*&

+,&

𝐿 (𝑆)-&|𝑆))

Discriminating sequences based on alternative Markov modelsGenomic backgroundCpG islands

Log-odds transition matrix

A C G T

B

T

G

C

A

−0.843 0.684 0.691 −0.842

−1.431 0.762 0.47 −0.753

−0.76 0.809 0.478 −0.918

−1.062 0.471 2.605 −0.791

−0.769 0.611 0.745 −1.264

CpG / Bg log−oddsA C G T

B

T

G

C

A

0.29 0.21 0.21 0.29

0.243 0.223 0.272 0.263

0.307 0.219 0.213 0.26

0.375 0.213 0.052 0.361

0.263 0.188 0.261 0.287

Genomic background

A C G T

B

T

G

C

A

0.162 0.337 0.339 0.162

0.09 0.378 0.376 0.156

0.182 0.384 0.296 0.138

0.179 0.295 0.318 0.208

0.154 0.288 0.438 0.119

CpG islands

1. Open a connection to the UCSC genome browser, and select the table browser tool. 2. Choose a mammalian genome (e.g. Human version hg38), select CpG track in the Regulation group, and download the

sequences of all the annotated CpG islands.3. Open a connection to RSAT Metazoa4. Compute the transition matrix of a 1st order Markov the tool create background model. 5. Use the tool random genome fragments to extract sequences of random genomic fragments of the same sizes as the CpG

island. 6. Use these random genome fragments to compute the genomic background (transition matrix of a 1st order Markov model)7. With the tool sequence proba, compute the sequence probabilities for each sequence file (CpG islands, random genome

fragments) with each model (CpG island, genome background). 8. Open the 4 results files with R or in a spreadsheet, and compute the log-likelihood ratio log(P(S|CpG) / P(S|Bg)) for

q the CpG islandsq the genomic background

9. Compare the distributions of these LLR (you can depict them with histograms, boxplots, violin plots, …).

Exercise

https://genome.ucsc.edu/cgi-bin/hgTables

http://metazoa.rsat.eu/

http://rsat.sb-roscoff.fr/create-background-model_form.cgi

Hidden Markov Models (HMM)

n Let us consider the genome as a Markov chain composed of a succession of regions in states CpG (+) or non-CpG islands (-)

n The total size of the Human genome is 3GB, and its annotation contain 31,144 CpG islands totaling 24.2MB.

n Exercise: based on these numbers, estimate the transition probabilities between states.

Probabilities of transition between states

CpG(+)

other(-)

CpG OtherCpG

Other

2-states Markov process

Transition matrix


Segmentation of the genome into CpG islands and non-CpG islands

n The total size of non-CpG islands is the difference between genome size (3e+09) and the total size of the CpG islands (24,200,434).

𝑆, = 𝑆67 − 𝑆- = 3𝑒 + 09 − 24,200,434 = 2,975,799,566n In total, there are 31,144 CpG islands in the genome, and each of

them is preceded by a non-CpG island. There are thus 31,144transitions from non-CpG to CpG. The probability of transition from non-CpG to CpG is thus the number of non-CpG positions preceding a CpG divided by the total number of non-CpG positions.

𝑃(+|−) = 𝑁-/𝑆, =31,144

2,975,799,566= 1.0466𝑒 − 05

n Since there are only two possible states, the transition probability from non-CpG to non-CpG is the complement of the transition probability from non-CpG to CpG.

𝑃(−|−) = 1 − 𝑃(+|−) = 0.99999n Each CpG island has an exactly one ending nucleotide, which

precedes a non-CpG island nucleotide. The number of nucleotides marking a transition from CpG to non-CpG is thus 31,144. The probability of transition from CpG to non-CpG is this number divided by the total size of all CpGs.

𝑃(−|+) = 𝑁-/𝑆- =31,144

24,200,434= 0.00129

n The probability of transition from CpG to CpG is the complement.𝑃(+|+) = 1 − 𝑃(−|+) = 0.99871

Solution: transition probabilities between states

CpG(+)

other(-)

CpG (+) Other (-)CpG (+) 0.99871 0.00129

Other (-) 0.00001 0.99999


Transition matrix



Hidden Markov Models

CpG(+)

Other(-)


Transition matrix

Emission probabilitiesBg state

Genomic background (-)CpG state (+)CpG islands

n Hidden Markov Models (HMM) are an extension of Markov chains, where we assume a process with a given number of states, and a specific probability of emitting symbols associated to each state.

n The state-specific emission probabilities can themselves be modeled as Markov chains, or as Bernoulli models.

n Example: CpG islands in genomic sequencesq 2 states: CpG islands and genomic backgroundq Each state has a specific nucleotide transition matrixq Note: the emission probabilities were computed from a

random selection of genomic region, which might include some fragments of CpG islands. However the proportion is likely to be very low, so we can use it as estimator for the emission probabilities of non-CpG islands (other).

A C G T

B

T

G

C

A

0.29 0.21 0.21 0.29

0.243 0.223 0.272 0.263

0.307 0.219 0.213 0.26

0.375 0.213 0.052 0.361

0.263 0.188 0.261 0.287

Genomic background

A C G T

B

T

G

C

A

0.162 0.337 0.339 0.162

0.09 0.378 0.376 0.156

0.182 0.384 0.296 0.138

0.179 0.295 0.318 0.208

0.154 0.288 0.438 0.119

CpG islands

CpG (+) Other (-)CpG (+) 0.99871 0.00129

Other (-) 0.00001 0.99999



Sequence segmentation

n Example: sequence ATTATGGGCGCGAAn Each nucleotide (symbol) can be generated by either the

CpG (upper row) or the non-CpG (lower row) state.n Between each pair of nucleotides, we can either stay on

the current state (horizontal arrows) or switch to the other state (oblique arrows)

n The problem amounts to find, among all possible paths from B (sequence beginning) to E (end), the path having the maximal likelihood.

n Exercise: how many possible paths are there between B and E?

n Problem: given a long unannotated sequence, identify the segments corresponding to CpG islands.

n Notesq This problem differs from the discrimination problem

seen before, where we assign a class to each sequence as a whole.

q No unequivocal correspondence from symbols to states: the same symbol can be emitted by different state.

q The state underlying each emission (nucleotide) is thus “hidden”.

q We need to discover it by finding the chain of states most likely to have produced the sequence.

Sequence segmentation

A CGT T A T G G CG G A A

A CGT T A T G G CG G A AEB

CpG island

non-CpG

Sequence position 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Hidden state ? ? ? ? ? ? ? ? ? ? ? ? ? ?

n In the drawing below, we highlighted one of the possible paths from the beginning to the end of the sequence ATTATGGGCGCGAA.

n Exerciseq Annotate the sequence of hidden states with + and -q Compute the probability of this path according to the previously

defined parametersq How many possible paths are there between B and E?q Which path would you intuitively propose as the best?q Which path would you intuitively propose as the worse?q Compute the probability of these paths

Uncovering the hidden chain of states



CpG island

non-CpG


Hidden state ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Non-CpG (-)

A C G T

B

T

G

C

A

0.29 0.21 0.21 0.29

0.243 0.223 0.272 0.263

0.307 0.219 0.213 0.26

0.375 0.213 0.052 0.361

0.263 0.188 0.261 0.287

Genomic background

CpG (+)

A C G T

B

T

G

C

A

0.162 0.337 0.339 0.162

0.09 0.378 0.376 0.156

0.182 0.384 0.296 0.138

0.179 0.295 0.318 0.208

0.154 0.288 0.438 0.119

CpG islands

CpG (+) Other (-)CpG (+) 0.99871 0.00129

Other (-) 0.00001 0.99999

n Viterbi algorithm enables to find the optimal path in a Hidden Markov Model

n Same principle as dynamical programing: q Compute the probability to arrive at each state and

position from each of the incoming arrows.q At each position, take the highest probability and keep

track of the corresponding incoming arrow.

Viterbi algorithm



CpG island

non-CpG


Hidden state ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Sequence motifs

Bioinformatics

n Starting from a multiple alignment, one can build a matrix which reflects the most representative residues at each positionq Each column represents a positionq Each row represents a residue

(20 rows for proteins, 4 rows for DNA)q The cells indicate the frequency of each residue at each position of the multiple alignment.

Profile matrices (=position-specific scoring matrices, PSSM)

Multiple alignmentW S K T N V T S T L H I C W G A Q A G LW S K T N V T S T L H I C W G A Q A G LW T Q S H V H R T L N I C W A A Q A A VF L K Q N V T S S M Y I C W G A M A A LW S V T N V T S T I H I C W G A Q A G LW S K D H V T S T L F V C W A V Q A A LW S K D H V T S T L F V C W A V Q A A LW S K S H V Y S S L H I C W G A Q A A LW T T T N V H S T L N V C W G G M A A VW A K D H V T S T L F V C W A V Q A A LW A K D H V T S T L F V C W A V Q A A LW S K T H V Y S T L H I C W G A Q A G LW S R H N V Y S T M F I C W A A Q A G LW A K A H V T S T L Y I C W A A Q A G LW A K E H V T S T L F V C W A V Q A A LW T Q T N V H S T L N V C W G A M A A IW S K T H V Y S T L H I C W G A Q A G L

Position-Specific Scoring Matrix (counts)Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Residue

A 0 4 0 1 0 0 0 0 0 0 0 0 0 0 8 11 0 17 10 0C 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0D 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0F 1 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 1 0 0 7 0H 0 0 0 1 10 0 3 0 0 0 6 0 0 0 0 0 0 0 0 0I 0 0 0 0 0 0 0 0 0 1 0 10 0 0 0 0 0 0 0 1K 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0L 0 1 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 14M 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0N 0 0 0 0 7 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Q 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0R 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0S 0 9 0 2 0 0 0 16 2 0 0 0 0 0 0 0 0 0 0 0T 0 3 1 7 0 0 10 0 15 0 0 0 0 0 0 0 0 0 0 0V 0 0 1 0 0 17 0 0 0 0 0 7 0 0 0 5 0 0 0 2W 16 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0Y 0 0 0 0 0 0 4 0 0 0 2 0 0 0 0 0 0 0 0 0

sum 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17

Weight matrix

€

Wi, j = lnfi, j'

pi

"

# $

%

& '

€

fi, j' =

ni, j + pik

nr, jr=1

A

∑ + k

Count matrixPosition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sum FreqResidue

A 0 4 0 1 0 0 0 0 0 0 0 0 0 0 8 11 0 17 10 0 51 0.150C 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 17 0.050D 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0.012E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.003F 1 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 7 0.021G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 1 0 0 7 0 17 0.050H 0 0 0 1 10 0 3 0 0 0 6 0 0 0 0 0 0 0 0 0 20 0.059I 0 0 0 0 0 0 0 0 0 1 0 10 0 0 0 0 0 0 0 1 12 0.035K 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0.035L 0 1 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 14 29 0.085M 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 5 0.015N 0 0 0 0 7 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 10 0.029P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000Q 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 17 0.050R 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0.006S 0 9 0 2 0 0 0 16 2 0 0 0 0 0 0 0 0 0 0 0 29 0.085T 0 3 1 7 0 0 10 0 15 0 0 0 0 0 0 0 0 0 0 0 36 0.106V 0 0 1 0 0 17 0 0 0 0 0 7 0 0 0 5 0 0 0 2 32 0.094W 16 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 33 0.097Y 0 0 0 0 0 0 4 0 0 0 2 0 0 0 0 0 0 0 0 0 6 0.018

sum 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 340 1.000

Weight matrixPosition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Residue

A -1.72 0.19 -1.72 -0.39 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 0.49 0.63 -1.72 0.82 0.59 -1.72C -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.28 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26D -0.70 -0.70 -0.70 1.21 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70E -0.30 -0.30 -0.30 1.02 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30F 0.42 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 1.18 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90G -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.00 0.07 -1.26 -1.26 0.89 -1.26H -1.32 -1.32 -1.32 0.00 0.98 -1.32 0.46 -1.32 -1.32 -1.32 0.76 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32I -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 0.21 -1.11 1.19 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 0.21K -1.11 -1.11 1.27 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11L -1.48 -0.15 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 0.97 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 0.97M -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 0.83 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 1.01 -0.78 -0.78 -0.78N -1.04 -1.04 -1.04 -1.04 1.11 -1.04 -1.04 -1.04 -1.04 -1.04 0.74 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04P 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00Q -1.26 -1.26 0.36 0.07 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.19 -1.26 -1.26 -1.26R -0.48 -0.48 0.85 -0.48 -0.48 -0.48 -0.48 0.85 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48S -1.48 0.78 -1.48 0.14 -1.48 -1.48 -1.48 1.03 0.14 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48T -1.57 0.22 -0.25 0.58 -1.57 -1.57 0.73 -1.57 0.91 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57V -1.52 -1.52 -0.20 -1.52 -1.52 1.01 -1.52 -1.52 -1.52 -1.52 -1.52 0.63 -1.52 -1.52 -1.52 0.49 -1.52 -1.52 -1.52 0.09W 0.98 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 1.00 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53Y -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 1.06 -0.85 -0.85 -0.85 0.77 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85

sum -17.8 -14.4 -13.7 -10.7 -17.2 -19.1 -15.7 -17.8 -17.6 -16.3 -14.1 -17.2 -19.1 -19.1 -17.2 -16 -17.4 -19.1 -17.2 -16.3

Scoring a sequence with a profile matrix



sum -17.8 -14.4 -13.7 -10.7 -17.2 -19.1 -15.7 -17.8 -17.6 -16.3 -14.1 -17.2 -19.1 -19.1 -17.2 -16 -17.4 -19.1 -17.2 -16.3

Sequence L W A K D H V T S T M F V C W A V M A A SUMScore -1.48 -1.53 -1.72 -1.11 -0.7 -1.32 -1.52 -1.57 0.136 -1.57 -0.78 -0.9 -1.52 -1.26 -1.53 0.628 -1.52 -0.78 0.587 -1.72 -21.1626




sum -17.8 -14.4 -13.7 -10.7 -17.2 -19.1 -15.7 -17.8 -17.6 -16.3 -14.1 -17.2 -19.1 -19.1 -17.2 -16 -17.4 -19.1 -17.2 -16.3

Sequence W A K D H V T S T M F V C W A V M A A L SUMScore 0.975 0.192 1.268 1.21 0.981 1.014 0.735 1.029 0.91 0.835 1.18 0.631 1.277 1.001 0.491 0.486 1.007 0.817 0.587 0.972 17.59818




sum -17.8 -14.4 -13.7 -10.7 -17.2 -19.1 -15.7 -17.8 -17.6 -16.3 -14.1 -17.2 -19.1 -19.1 -17.2 -16 -17.4 -19.1 -17.2 -16.3

Sequence A K D H V T S T M F V C W A V M A A L V SUMScore -1.72 -1.11 -0.7 1E-16 -1.52 -1.57 -1.48 -1.57 -0.78 -0.9 -1.52 -1.26 -1.53 -1.72 -1.52 -0.78 -1.72 0.817 -1.48 0.094 -21.9422

PSI-BLASTn PSI-BLAST stands for Position-Specific Iterated BLAST (Altschul et al, 1997)

q BLAST runs a first time in normal mode.q Resulting sequences are aligned together (Multiple sequence alignment) and a PSSM is calculated.q This PSSM is used to scan the database for new matches.q Steps 2-3 can be iterated several times.

n The PSSM increases the sensitivity of the search.

n Substitution matricesq PAM series

• Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. (1978). A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5, 345--352.q BLOSUM substitution matrices

• Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89, 10915-9.q Gonnet matrices, built by an iterative procedure

• Gonnet, G. H., Cohen, M. A. & Benner, S. A. (1992). Exhaustive matching of the entire protein sequence database. Science 256, 1443-5. 1.

n Sequence alignment algorithmsq Needleman-Wunsch (pairwise, global)

• Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443-53.q Smith-Waterman (pairwise, local)

• Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences. J Mol Biol 147, 195-7.q FastA (database searches, pairwise, local)

• W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA, 85:2444–2448, 1988.q BLAST (database searches, pairwise, local)

• S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. A basic local alignment search tool. J. Mol. Biol., 215:403–410, 1990.• S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucleic Acids Res., 25:3389–3402, 1997.q Clustal (multiple, global)

• Higgins, D. G. & Sharp, P. M. (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237-44.• Higgins, D. G., Thompson, J. D. & Gibson, T. J. (1996). Using CLUSTAL for multiple sequence alignments. Methods Enzymol 266, 383-402.

q Dialign (multiple, local)• Morgenstern, B., Frech, K., Dress, A. & Werner, T. (1998). DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14, 290-4.

q MUSCLE (multiple local)

References

Modelling protein families with HMM

n The main limitation of position-weight matrices is that they are not practical to handle gaps. n Hidden Markov Models (HMM) can be used to handle gaps.

Limitations of position weight matrices

n http://pfam.xfam.org/q “The Pfam database is a large collection of protein

families, each represented by multiple sequence alignments and hidden Markov models (HMMs).”

The Pfam database

http://pfam.xfam.org/

hiddenmarkov modelspedagogix-tagc.univ-mrs.fr/.../pdf_files/03.05.hidden-markov-models.… · a...

Documents