sequence motifs, information content, logos, and hmm’s morten nielsen, cbs, biocentrum, dtu

51
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Upload: marcos-wardlow

Post on 14-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Sequence motifs, information content, logos, and HMM’s

Morten Nielsen,CBS, BioCentrum,

DTU

Page 2: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Outline

• What is a binding motif?• How to describe a sequence motif?• Construction of scoring matrices• Sequence motifs and hidden Markov models• Use of HMM• Why are Profile HMM’s better than Anders

Gorms sequence alignments– Or at least PSSM’s

Page 3: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Binding motifs

MHC-I

TAPMHC-II

Page 4: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Anchor positions

MHC class I with peptide

Page 5: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Sequence information

Page 6: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Sequence information

Page 7: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Sequence Information

Page 8: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Sequence Information

Calculate pa at each positionEntropy

Information content

Conserved positions– PV=1, PREST=0 => S=0, I=log(20)

Mutable positions– Paa=1/20 => S=log(20), I=0

S = − paa

∑ log(pa )

I = log(20) + paa

∑ log(pa )

Page 9: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Information content

A R N D C Q E G H I L K M F P S T W Y V S I1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.372 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.163 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.264 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.455 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.286 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.407 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.348 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.289 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55

I = log(20) + paa

∑ log(pa )

S = − paa

∑ log(pa )

Page 10: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Sequence logos

Height of a column equal to I

Relative height of a letter is pHighly useful tool to visualize sequence motifs

High information positions

HLA-A0201

http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

Page 11: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Characterizing a sequence motif from small data sets

What can we learn?

1. A at P1 favors binding?2. I is not allowed at P9? 3. K at P4 favors binding?4. Which positions are important

for binding?

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

10 MHC restricted peptides

Page 12: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Simple motifs Yes/No rules

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

10 MHC restricted peptides

[AGTK]1[LMIV ]2[ANLV ]3 ...[MNRTVL]9

• Only 11 of 212 peptides identified!• Need more flexible rules

•If not fit P1 but fit P2 then ok• Not all positions are equally important

•We know that P2 and P9 determines binding more than other positions

•Cannot discriminate between good and very good binders

Page 13: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Simple motifsYes/No rules

[AGTK]1[LMIV ]2[ANLV ]3 ...[AIFKLV ]7...[MNRTVL]9

• Example

•Two first peptides will not fit the motif

RLLDDTPEV 0.59GLLGNVSTV 0.71ALAKAAAAL 0.47

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

10 MHC restricted peptides

Page 14: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Extended motifs

Fitness of aa at each position given by P(aa)

Example P1PA = 6/10

PG = 2/10

PT = PK = 1/10

PC = PD = …PV = 0

Problems– Few data– Data redundancy/duplication

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 15: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Sequence informationRaw sequence counting

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 16: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Sequence weighting

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Poor or biased sampling of sequence spaceExample P1

PA = 2/6

PG = 2/6

PT = PK = 1/6

PC = PD = …PV = 0

}Similar sequencesWeight 1/5

Example RLLDDTPEV 0.59 GLLGNVSTV 0.71 ALAKAAAAL 0.47

Page 17: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Sequence weightingALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 18: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Pseudo counts

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

I is not found at position P9. Does this mean that I is forbidden (P(I)=0)?No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

Page 19: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.03 0.03 0.03 0.02 0.03 0.04 0.08 0.01 0.04 0.06 0.04 0.02 0.02 0.03 0.09 0.05 0.01 0.02 0.07 R 0.04 0.34 0.04 0.03 0.01 0.05 0.05 0.03 0.02 0.02 0.05 0.12 0.02 0.02 0.02 0.04 0.03 0.01 0.02 0.03 N 0.04 0.04 0.32 0.08 0.01 0.03 0.05 0.07 0.03 0.02 0.03 0.05 0.01 0.02 0.02 0.07 0.05 0.00 0.02 0.03 D 0.04 0.03 0.07 0.40 0.01 0.03 0.09 0.05 0.02 0.02 0.03 0.04 0.01 0.01 0.02 0.05 0.04 0.00 0.01 0.02 C 0.07 0.02 0.02 0.02 0.48 0.01 0.02 0.03 0.01 0.04 0.07 0.02 0.02 0.02 0.02 0.04 0.04 0.00 0.01 0.06 Q 0.06 0.07 0.04 0.05 0.01 0.21 0.10 0.04 0.03 0.03 0.05 0.09 0.02 0.01 0.02 0.06 0.04 0.01 0.02 0.04 E 0.06 0.05 0.04 0.09 0.01 0.06 0.30 0.04 0.03 0.02 0.04 0.08 0.01 0.02 0.03 0.06 0.04 0.01 0.02 0.03 G 0.08 0.02 0.04 0.03 0.01 0.02 0.03 0.51 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.05 0.03 0.01 0.01 0.02 H 0.04 0.05 0.05 0.04 0.01 0.04 0.05 0.04 0.35 0.02 0.04 0.05 0.02 0.03 0.02 0.04 0.03 0.01 0.06 0.02 I 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.18 L 0.04 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.12 0.38 0.03 0.05 0.05 0.01 0.02 0.03 0.01 0.02 0.10 K 0.06 0.11 0.04 0.04 0.01 0.05 0.07 0.04 0.02 0.03 0.04 0.28 0.02 0.02 0.03 0.05 0.04 0.01 0.02 0.03 M 0.05 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.10 0.20 0.04 0.16 0.05 0.02 0.04 0.04 0.01 0.02 0.09 F 0.03 0.02 0.02 0.02 0.01 0.01 0.02 0.03 0.02 0.06 0.11 0.02 0.03 0.39 0.01 0.03 0.03 0.02 0.09 0.06 P 0.06 0.03 0.02 0.03 0.01 0.02 0.04 0.04 0.01 0.03 0.04 0.04 0.01 0.01 0.49 0.04 0.04 0.00 0.01 0.03 S 0.11 0.04 0.05 0.05 0.02 0.03 0.05 0.07 0.02 0.03 0.04 0.05 0.02 0.02 0.03 0.22 0.08 0.01 0.02 0.04 T 0.07 0.04 0.04 0.04 0.02 0.03 0.04 0.04 0.01 0.05 0.07 0.05 0.02 0.02 0.03 0.09 0.25 0.01 0.02 0.07 W 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.03 0.02 0.03 0.05 0.02 0.02 0.06 0.01 0.02 0.02 0.49 0.07 0.03 Y 0.04 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 V 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27

The Blosum matrix

Some amino acids are highly conserved (i.e. C), some have a high change of mutation (i.e. I)

Page 20: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

• Calculate observed amino acids frequencies fa

• Pseudo frequency for amino acid b

• Example

gb = faa

∑ ⋅qb |a

gI = 0.2 ⋅qI |M + 0.1⋅qI |N + ...+ 0.3⋅qI |V + 0.1⋅qI |LgI = 0.2 ⋅0.04 + 0.1⋅0.01+ ...+ 0.3⋅0.18 + 0.1⋅0.17 ≈ 0.09

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Pseudo count estimation

Page 21: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Weight on pseudo count

• Pseudo counts are important when only limited data is available

• With large data sets only “true” observation should count

is the effective number of sequences (N-1), is the weight on prior

pa =α ⋅ fa + β ⋅gaα + β

Page 22: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

• Example

• If large, p ≈ f and only the observed data defines the motif

• If small, p ≈ g and the pseudo counts (or prior) defines the motif

is [50-200] normally

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Weight on pseudo count

pa =α ⋅ fa + β ⋅gaα + β

Page 23: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Sequence weighting and pseudo counts

RLLDDTPEV 0.59GLLGNVSTV 0.71ALAKAAAAL 0.47

P7P and P7S > 0

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 24: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Position specific weighting

We know that positions 2 and 9 are anchor positions for most MHC binding motifs– Increase weight on high

information positions

Motif found on large data set

Page 25: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Weight matrices

Estimate amino acid frequencies from alignment including sequence weighting and pseudo count

What do the numbers mean?– P2(V)>P2(M). Does this mean that V enables binding more than M.– In nature not all amino acids are found equally often

• qA = 0.070, qW = 0.013

• Finding 6% A is hence not significant, but 6% W highly significant • In nature V is found more often than M, so we must somehow rescale

with the background

A R N D C Q E G H I L K M F P S T W Y V1 0.08 0.06 0.02 0.03 0.02 0.02 0.03 0.08 0.02 0.08 0.11 0.06 0.04 0.06 0.02 0.09 0.04 0.01 0.04 0.082 0.04 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.11 0.44 0.02 0.06 0.03 0.01 0.02 0.05 0.00 0.01 0.103 0.08 0.04 0.05 0.07 0.02 0.03 0.03 0.08 0.02 0.05 0.11 0.03 0.03 0.06 0.04 0.06 0.05 0.03 0.05 0.074 0.08 0.05 0.03 0.10 0.01 0.05 0.08 0.13 0.01 0.05 0.06 0.05 0.01 0.03 0.08 0.06 0.04 0.02 0.01 0.055 0.06 0.04 0.05 0.03 0.01 0.04 0.05 0.11 0.03 0.04 0.09 0.04 0.02 0.06 0.06 0.04 0.05 0.02 0.05 0.086 0.06 0.03 0.03 0.03 0.03 0.03 0.04 0.06 0.02 0.10 0.14 0.04 0.03 0.05 0.04 0.06 0.06 0.01 0.03 0.137 0.10 0.02 0.04 0.04 0.02 0.03 0.04 0.05 0.04 0.08 0.12 0.02 0.03 0.06 0.07 0.06 0.05 0.03 0.03 0.088 0.05 0.07 0.04 0.03 0.01 0.04 0.06 0.06 0.03 0.06 0.13 0.06 0.02 0.05 0.04 0.08 0.07 0.01 0.04 0.059 0.08 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.10 0.23 0.03 0.02 0.04 0.01 0.04 0.04 0.00 0.02 0.25

Page 26: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

How to score a sequence to a probability matrix?

• pij describes a motif• The probability that a

peptide fits the motif is

p(s | model) = ppap

A R N D C Q E G H I L K M F P S T W Y V1 0.08 0.06 0.02 0.03 0.02 0.02 0.03 0.08 0.02 0.08 0.11 0.06 0.04 0.06 0.02 0.09 0.04 0.01 0.04 0.082 0.04 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.11 0.44 0.02 0.06 0.03 0.01 0.02 0.05 0.00 0.01 0.103 0.08 0.04 0.05 0.07 0.02 0.03 0.03 0.08 0.02 0.05 0.11 0.03 0.03 0.06 0.04 0.06 0.05 0.03 0.05 0.074 0.08 0.05 0.03 0.10 0.01 0.05 0.08 0.13 0.01 0.05 0.06 0.05 0.01 0.03 0.08 0.06 0.04 0.02 0.01 0.055 0.06 0.04 0.05 0.03 0.01 0.04 0.05 0.11 0.03 0.04 0.09 0.04 0.02 0.06 0.06 0.04 0.05 0.02 0.05 0.086 0.06 0.03 0.03 0.03 0.03 0.03 0.04 0.06 0.02 0.10 0.14 0.04 0.03 0.05 0.04 0.06 0.06 0.01 0.03 0.137 0.10 0.02 0.04 0.04 0.02 0.03 0.04 0.05 0.04 0.08 0.12 0.02 0.03 0.06 0.07 0.06 0.05 0.03 0.03 0.088 0.05 0.07 0.04 0.03 0.01 0.04 0.06 0.06 0.03 0.06 0.13 0.06 0.02 0.05 0.04 0.08 0.07 0.01 0.04 0.059 0.08 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.10 0.23 0.03 0.02 0.04 0.01 0.04 0.04 0.00 0.02 0.25

Page 27: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

How to score a sequence to a probability matrix?

• pij describes a motif• The probability that a

peptide fits the motif is

• The probability that the peptide fits a random model is €

p(s | model) = ppap

p(s | random) = qap

Page 28: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

How to score a sequence to a probability matrix?

• pij describes a motif• The probability that a

peptide fits the motif is• The probability that the

peptide fits a random model is

• The ratio of the two gives the odds

• The log gives the score

p(s | model) = ppap

p(s | random) = qap

0 =

ppap

qap

∏=

ppaqap

S = logppaqa

⎝ ⎜

⎠ ⎟

p

Page 29: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Weight matrices• A weight matrix is given as

Wij = log(pij/qj)– where i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j.

• W is a L x 20 matrix, L is motif length

A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5

Page 30: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

A R N D C Q E G H I L K M F P S T W Y V E 0.06 0.05 0.04 0.09 0.01 0.06 0.30 0.04 0.03 0.02 0.04 0.08 0.01 0.02 0.03 0.06 0.04 0.01 0.02 0.03 G 0.08 0.02 0.04 0.03 0.01 0.02 0.03 0.51 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.05 0.03 0.01 0.01 0.02 H 0.04 0.05 0.05 0.04 0.01 0.04 0.05 0.04 0.35 0.02 0.04 0.05 0.02 0.03 0.02 0.04 0.03 0.01 0.06 0.02 I 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.18 L 0.04 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.12 0.38 0.03 0.05 0.05 0.01 0.02 0.03 0.01 0.02 0.10 K 0.06 0.11 0.04 0.04 0.01 0.05 0.07 0.04 0.02 0.03 0.04 0.28 0.02 0.02 0.03 0.05 0.04 0.01 0.02 0.03 M 0.05 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.10 0.20 0.04 0.16 0.05 0.02 0.04 0.04 0.01 0.02 0.09 F 0.03 0.02 0.02 0.02 0.01 0.01 0.02 0.03 0.02 0.06 0.11 0.02 0.03 0.39 0.01 0.03 0.03 0.02 0.09 0.06

A R N D C Q E G H I L K M F P S T W Y V 0.08 0.05 0.04 0.05 0.02 0.03 0.06 0.07 0.02 0.06 0.10 0.06 0.02 0.04 0.04 0.06 0.05 0.01 0.03 0.07

Example

Calculate the weight matrix based on the following observation (use =50):

Sequence = I

Important. What is ?

gb = faa

∑ ⋅qb |a

pa =α ⋅ fa + β ⋅gaα + β

Wij = log(pij/qj)

q

qb|a

Page 31: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Example

So the score is simply the Blosum62 row for amino acid I!!!

This is why is called weight on prior. Our prior knowledge is Blosum. We will only accept a weight matrix different from Blosum if we have many data.

gb = faa

∑ ⋅qb |a

pa =α ⋅ fa + β ⋅gaα + β

Wij = log(pij/qj)

A R N D C Q E G H I L K M F P S T W Y 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.08 0.05 0.04 0.05 0 .02 0.03 0.06 0.07 0.02 0.06 0.10 0.06 0.02 0.04 0.04 0.06 0.05 0.01 -0.45 -1.08 -1.12 -1.12 -0.43 -0.94 -1.12 -1.28 -1.08 1.38 0.53 -0.90 0.39 -0.06 -0.98 -0.82 -0.25 -0.79 -0.44 -1.30 -3.11 -3.23 -3.29 -1.25 -2.71 -3.218 -3.69 -3.13 3.99 1.52 -2.60 1.12 -0.18 -2.82 -2.38 -0.72 -2.28 -1.27

Page 32: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Score sequences to weight matrix by looking up and adding L values from the matrix

A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5

Scoring a sequence to a weight matrix

RLLDDTPEVGLLGNVSTVALAKAAAAL

Which peptide is most likely to bind?Which peptide second?

11.9 14.7 4.3

0.59 0.71 0.47

Page 33: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Example from real life

• 10 peptides from MHCpep database

• Bind to the MHC complex

• Relevant for immune system recognition

• Estimate sequence motif and weight matrix

• Evaluate motif “correctness” on 528 peptides

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 34: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Prediction accuracy

Pearson correlation 0.45

Page 35: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Predictive performance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pearsons correlation

CC 0.45 0.5 0.6 0.65 0.79

Simple Seq.W Seq.W+SCSeq.W+SC+PW

Large dataset

Page 36: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

End of first part

Take a deep breathSmile to you neighbor

Page 37: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Hidden Markov Models

• Weight matrices do not deal with insertions and deletions

• In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension

• HMM is a natural frame work where insertions/deletions are dealt with explicitly

Page 38: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Why hidden?

Model generates numbers– 312453666641

Does not tell which die was used

Alignment (decoding) can give the most probable solution/path (Viterby)– FFFFFFLLLLLL

1:1/62:1/63:1/64:1/65:1/66:1/6Fair

1:1/102:1/103:1/104:1/105:1/106:1/2Loaded

0.95

0.10

0.05

0.9

The unfair casino: Loaded die p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1

Page 39: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

HMM (a simple example)

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• Example from A. Krogh• Core region defines the

number of states in the HMM (red)

• Insertion and deletion statistics are derived from the non-core part of the alignment (black)

Core of alignment

Page 40: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

.2

.8

.2

ACGT

ACGT

ACGT

ACGT

ACGT

ACGT.8

.8 .8.8

.2.2.2

.2

1

ACGT

.2

.2

.4

1. .4 1. 1.1.

.6.6

.4

HMM construction

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• 5 matches. A, 2xC, T, G• 5 transitions in gap region

• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

Page 41: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Align sequence to HMM

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2

ACAC--AGC = 1.2x10-2

AGA---ATC = 3.3x10-2

ACCG--ATC = 0.59x10-2

Consensus:

ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2

Exceptional:

TGCT--AGG = 0.0023x10-2

Page 42: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Align sequence to HMM - Null model

• Score depends strongly on length

• Null model is a random model. For length L the score is 0.25L

• Log-odds score for sequence S

Log( P(S)/0.25L)• Positive score means

more likely than Null model

ACA---ATG = 4.9

TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97

Note!

Page 43: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Model decoding (Viterby)The unfair casino

Example: 1245666

1 2 4 5 6 6 6

F -0.78 -1.58 -2.38 -3.18 -3.98 -4.78 -5.58

L Null -3.08 -3.88 -4.68 -4.78 -5.13 -5.48

FFFFLLL

1:-0.782:-0.783:-0.784:-0.785:-0.786:-0-78

Fair

1:-12:-13:-14:-15:-16:-0.3Loaded

-0.02

-1

-1.3

-0.05Log model

2F = −0.78 − 0.02 − 0.78 = −1.58

Pl (i +1) = pl (i +1) • maxk

(Pk (i) • akl ) or

log(Pl (i +1) = log(pl (i −1)) + maxk

(log(Pk (i) + log(akl ))

FFFFLLL

Page 44: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

HMM’s and weight matrices

• In the case of un-gapped alignments HMM’s become simple weight matrices

• To achieve high performance, the emission frequencies are estimated using the techniques of – Sequence weighting– Pseudo counts

Page 45: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Profile HMM’s

• Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner

• Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix)

• Profile HMM’s are ideal suited to describe such position specific variations

Page 46: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVPTVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Profile HMM’s

Conserved

Core: Position with < 2 gaps

Deletion

Insertion

Non-conserved

Must have a G Any thing can match

Page 47: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Profile HMM’s

All M/D pairs must be visited once

L1- Y2A3V4R5- I6

P1D2P3P4I4P5D6P7

Page 48: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Example. Sequence profiles

• Alignment of protein sequences 1PLC._ and 1GYC.A• E-value > 1000• Profile alignment

– Align 1PLC._ against Swiss-prot– Make position specific weight matrix from

alignment– Use this matrix to align 1PLC._ against 1GYC.A

• E-value < 10-22. Rmsd=3.3

Page 49: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Example continued

Smith-Waterman score: 53; 26.2% identity in 61 aa overlap

10 20 30 1PLC._ IDVLLGADDGSLAFVPSEFSISPG--EKIV-----FKNNAG :: .: : .:: .: . :... 1GYC.A ILRYQGAPVAEPTTTQTTSVIPLIETNLHPLARMPVPGSPTPGGVDKALNLAFNFNGTNF 280 290 300 310 320 330

40 50 60 70 80 90 1PLC._ FPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVG : .: : ..: .. . ... .::: : 1GYC.A FINNASFTPPTVPVLLQILSGAQTAQDLLPAGSVYPLPAHSTIEITLPATALAPGAPHPF 340 350 360 370 380 390

1PLC._ KVTVN 1GYC.A HLHGHAFAVVRSAGSTTYNYNDPIFRDVVSTGTPAAGDNVTIRFQTDNPGPWFLHCHIDF

400 410 420 430 440 450

Page 50: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Example continued

Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + +Sbjct: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G VSbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126

Rmsd=3.3 ÅModel redStructure blue

Page 51: Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

HMM packages

• HMMER (http://hmmer.wustl.edu/)– S.R. Eddy, WashU St. Louis. Freely available.

• SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa

Cruz. Freely available to academia, nominal license fee for commercial users.

• META-MEME (http://metameme.sdsc.edu/)– William Noble Grundy, UC San Diego. Freely available. Combines

features of PSSM search and profile HMM search.

• NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)– Freely available to academia, nominal license fee for commercial

users.– Allows HMM architecture construction.

• EasyGibbs (http://www.cbs.dtu.dk/biotools/EasyGibbs/)– Webserver for Gibbs sampling of proteins sequences