sequence motifs, information content, logos, and hmm’s

32
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Upload: moshe

Post on 29-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Sequence motifs, information content, logos, and HMM’s. Morten Nielsen, CBS, BioCentrum, DTU. Outline. Multiple alignments and sequence motifs Weight matrices and consensus sequence Sequence weighting Low (pseudo) counts Information content Sequence logos Mutual information - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence motifs, information content,

logos, and HMM’sMorten Nielsen,

CBS, BioCentrum, DTU

Page 2: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Outline• Multiple alignments and sequence motifs• Weight matrices and consensus sequence

– Sequence weighting– Low (pseudo) counts

• Information content– Sequence logos– Mutual information

• Example from the real world• HMM’s and profile HMM’s

– TMHMM (trans-membrane protein) – Gene finding

• Links to HMM packages

Page 3: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Multiple alignment and sequence motifs

• Core• Consensus

sequence• Weight matrices• Problems

– Sequence weights– Low counts

----------MLEFVVEADLPGIKA------------------MLEFVVEFALPGIKA------------------MLEFVVEFDLPGIAA---------------------YLQDSDPDSFQD-----------GSDTITLPCRMKQFINMWQE-------------RNQEERLLADLMQNYDPNLR-----------------YDPNLRPAERDSDVVNVSLK----------------NVSLKLTLTNLISLNEREEA-------EREEALTTNVWIEMQWCDYR-------------------WCDYRLRWDPRDYEGLWVLR-----LWVLRVPSTMVWRPDIVLEN-----------------------IVLENNVDGVFEVALYCNVL--------------YCNVLVSPDGCIYWLPPAIF---------PPAIFRSACSISVTYFPFDW---- ********* FVVEFDLPG

Consensus

Page 4: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequences weighting 1 - Clustering

----------MLEFVVEADLPGIKA------------------MLEFVVEFALPGIKA------------------MLEFVVEFDLPGIAA---------------------YLQDSDPDSFQD-----------GSDTITLPCRMKQFINMWQE-------------RNQEERLLADLMQNYDPNLR-----------------YDPNLRPAERDSDVVNVSLK----------------NVSLKLTLTNLISLNEREEA-------EREEALTTNVWIEMQWCDYR-------------------WCDYRLRWDPRDYEGLWVLR-----LWVLRVPSTMVWRPDIVLEN-----------------------IVLENNVDGVFEVALYCNVL--------------YCNVLVSPDGCIYWLPPAIF---------PPAIFRSACSISVTYFPFDW----

*********

} Homologous sequencesWeight = 1/n (1/3)

Consensus sequence

YRQELDPLV

Previous

FVVEFDLPG

Page 5: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequences weighting 2 - (Henikoff & Henikoff)

w FVVEADLPG 0.37FVVEFALPG 0.43FVVEFDLPG 0.32YLQDSDPDS 0.59MKQFINMWQ 0.90LMQNYDPNL 0.68PAERDSDVV 0.75LKLTLTNLI 0.85VWIEMQWCD 0.84YRLRWDPRD 0.51WRPDIVLEN 0.71VLENNVDGV 0.59YCNVLVSPD 0.71FRSACSISV 0.75

• waa’ = 1/rs• r: Number of different aa in a column• s: Number occurrences• Normalize so waa= 1 for each column• Sequence weight is sum of waa

F: r=7 (FYMLPVW), s=4 w’=1/28, w = 0.055Y: s=3, w`=1/21, w = 0.073M,P,W: s=1, w’=1/7, w = 0.218L,V: s=2, w’=1/14, w = 0.109

Page 6: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Low count correction

--------MLEFVVEADLPGIKA----------------MLEFVVEFALPGIKA----------------MLEFVVEFDLPGIAA-------------------YLQDSDPDSFQD---------GSDTITLPCRMKQFINMWQE-----------RNQEERLLADLMQNYDPNLR---------------YDPNLRPAERDSDVVNVSLK--------------NVSLKLTLTNLISLNEREEA-----EREEALTTNVWIEMQWCDYR-----------------WCDYRLRWDPRDYEGLWVLR---LWVLRVPSTMVWRPDIVLEN---------------------IVLENNVDGVFEVALYCNVL------------YCNVLVSPDGCIYWLPPAIF-------PPAIFRSACSISVTYFPFDW---- *********

• Limited number of data

• Poor sampling of sequence space

• I is not found at position P1. Does this mean that I is forbidden?

• No! Use Blosum matrix to estimate pseudo frequency of I

P1

Page 7: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Low count correction using Blosum matrices

# I L V

L 0.1154 0.3755 0.0962

V 0.1646 0.1303 0.2689

Blosum62 substitution frequencies• Every time for

instance L/V is observed, I is also likely to occur

• Estimate low (pseudo) count correction using this approach

• As more data are included the pseudo count correction becomes less important

NL = 2, NV=2, Neff=12 =>fI = (2*0.1154 + 2*0.1646)/12 = 0.05

pI* = (Neff * pI + * fI)/(Neff+) = (12*0 + 10*0.05)/(12+10) = 0.02

Page 8: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Information content

• Information and entropy– Conserved amino acid regions contain high degree of

information (high order == low entropy)– Variable amino acid regions contain low degree of

information (low order == high entropy)

• Shannon information D = log2(N) + pi log2 pi (for proteins N=20, DNA

N=4)

• Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins)

• Variable region pA=0.05, pC=0.05, .., D = 0

Page 9: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence logo

• Height of a column equal to D

• Relative height of a letter is pA

• Highly useful tool to visualize sequence motifs

High information position

MHC class IILogo from 10 sequences

http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

Page 10: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

More on logos

• Information contentD = pi log2 (pi/qi)

• Shannon, qi = 1/N = 0.05D = pi log2 (pi) - pi log2 (1/N)

= log2 N - pi log2 (pi)

• Kullback-Leibler, qi = background frequency– V/L/A more frequent than for instance C/H/W

A R N D C Q E G H I L K M F P S T W Y V2 1 1 1 1 1 1 1 1 4 16 1 6 15 7 1 2 7 18 138 19 1 1 7 2 2 2 1 3 15 13 6 2 1 2 2 7 1 83 2 7 2 1 17 13 2 1 8 14 3 1 1 7 7 2 0 1 88 13 13 14 1 2 13 2 1 2 3 3 1 7 1 3 7 0 1 74 1 7 7 7 1 2 2 1 13 15 2 6 6 1 7 2 7 7 45 2 8 23 1 6 3 2 1 3 3 2 1 1 1 13 8 0 1 182 1 7 13 1 1 2 2 1 8 14 2 6 1 20 7 2 7 1 33 7 7 8 7 1 7 8 1 2 8 2 1 1 13 7 2 7 1 73 2 7 19 1 6 2 8 1 9 9 2 1 1 1 7 2 0 1 18

Frequency matrix

Page 11: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Mutual information

I(i,j) = aai aaj

P(aai, aaj) *

log[P(aai, aaj)/P(aai)*P(aaj)]

P(G1) = 2/9 = 0.22, ..P(V6) = 4/9 = 0.44,..P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10

log(0.22/0.10) > 0

ALWGFFPVAILKEPVHGVILGFVFTLTLLFGYPVYVGLSPTVWLSYMNGTMSQV

GILGFVFTL WLSLLVPFVFLPSDFFPS

P1 P6

Page 12: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Mutual information 

313 binding peptides 313 random peptides

Page 13: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Mutual information at anchor position is low

• Mutual information between anchor positions 2 and 9 and other residues low– At pos 2 we know that L,M,T,V and I are the most

frequent amino acids. – At pos 9 V,L,I and A are most frequent– 313 Rammensee + Buus pep

• P(L2) = 0.51, P(V9)=0.48, P(L2,V9) = 0.23• P(L2,V9)/(P(L2)*P(V9) )=0.23/0.24 = 1.0

• Knowing that we have L at position 2 does not tell us which one of V,L or I is placed on position 9 => NO mutual information

Page 14: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Weight matrices

• Estimate amino acid frequencies from alignment inc. sequence weighting and pseudo counts

• Now a weight matrix is given as

Wij = log(pij/qj)• Here i is a position in the motif, and j an amino

acid. qj is the background frequency for amino acid j.

• W is a L x 20 matrix, L is motif length• Score sequences to weight matrix by looking

up and adding L values from matrix

Page 15: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Example from real life

• 10 peptides from MHCpep database

• Bind to the MHC complex

• Relevant for immune system recognition

• Estimate sequence motif and weight matrix

• Evaluate on 528 peptides

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 16: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Example (cont.)

• Raw sequence counting– No sequence

weighting – No pseudo count– Prediction accuracy

0.45

• Sequence weighting– No pseudo count– Prediction accuracy

0.5

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 17: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Example (cont.)

• Sequence weighting and pseudo count– Prediction accuracy

0.60

• Motif found on all data (485)– Prediction accuracy

0.79

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 18: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Hidden Markov Models

• Weight matrices do not deal with insertions and deletions

• In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension

• HMM is a natural frame work where insertions/deletions are dealt with explicitly

Page 19: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

HMM (a simple example)

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• Example from A. Krogh

• Core region defines the number of states in the HMM (red)

• Insertion and deletion statistics is derived from the non-core part of the alignment (blue)

Core of alignment

Page 20: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

.8

.2

ACGT

ACGT

ACGT

ACGT

ACGT

ACGT.8

.8 .8.8

.2.2.2

.2

1

ACGT .2

.2

.2

.4

1. .4 1. 1.1.

.6.6

.4

HMM construction

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• 5 matches. A, 2xC, T, G• 5 transitions in gap region

• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

Page 21: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Align sequence to HMMACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2

ACAC--AGC = 1.2x10-2

AGA---ATC = 3.3x10-2

ACCG--ATC = 0.59x10-2

Consensus:

ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2

Exceptional:

TGCT--AGG = 0.0023x10-2

Page 22: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Align sequence to HMM - Null model

• Score depends strongly on length

• Null model is a random model. For length L the score is

0.25L

• Log-odd score for sequence S

Log( P(S)/0.25L)

ACA---ATG = 4.9

TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97

Note!

Page 23: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

HMM’s and weight matrices

• In the case of un-gapped alignments HMM’s become simple weight matrices

• It still might be useful to use a HMM tool package to estimate a weight matrix– Sequence weighting– Pseudo counts

Page 24: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Profile HMM’s

• Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner

• Some position are highly conserved, some are highly flexible (more than what is described in the BLOSUM matrix)

• Profile HMM’s are ideal suited to describe such position specific variations

Page 25: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

ExampleSequence profiles

• Alignment of 1PLC._ to 1GYC.A• Blast e-value > 1000• Profile alignment

– Align 1PLC._ against Swiss-prot– Make position specific weight matrix from

alignment– Use this matrix to align 1PLC._ against

1GYC.A

• E-value > 10-22. Rmsd=3.3

Page 26: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Example continued Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 VLLGADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 V+ G F + G++ N+ + +G + +Sbjct: 26 VVNG------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G VSbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126

Rmsd=3.3

Page 27: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

EM55_HUMAN WWQGRVEGSSKESAGLIPSPELQEWRVASMAQSAP--SEAPSCSPFGKKKK-YKDKYLAKCSKP_HUMAN WWQGKLENSKNGTAGLIPSPELQEWRVACIAMEKTKQEQQASCTWFGKKKKQYKDKYLAKKAPB_MOUSE -----PENLLIDHQGYIQVTDFGFAKRVKG------------------------------NRC2_NEUCR -----PENILLHQSGHIMLSDFDLSKQSDPGGKPTMIIGKNGTSTSSLPTIDTKSCIANF

EM55_HUMAN HSSIFDQLDVVSYEEVVRLPAFKRKTLVLIGASGVGRSHIKNALLSQNPEKFVYPVPYTTCSKP_HUMAN HNAVFDQLDLVTYEEVVKLPAFKRKTLVLLGAHGVGRRHIKNTLITKHPDRFAYPIPHTTKAPB_MOUSE RTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGNRC2_NEUCR RTNSFVGTEEYIAPEVIKGSGHTSAVDWWTLGILIYEMLYGTTPFKGKNRNATFANILRE

EM55_HUMAN RPPRKSEEDGKEYHFISTEEMTRNISANEFLEFGSYQGNMFGTKFETVHQIHKQNKIAILCSKP_HUMAN RPPKKDEENGKNYYFVSHDQMMQDISNNEYLEYGSHEDAMYGTKLETIRKIHEQGLIAILKAPB_MOUSE KVRFPSHF-----SSDLKDLLRNLLQVDLTKRFGNLKNGVSDIKTHKWFATTDWIAIYQRNRC2_NEUCR DIPFPDHAGAPQISNLCKSLIRKLLIKDENRRLG-ARAGASDIKTHPFFRTTQWALI--R

EM55_HUMAN NNGVDETLKKLQEAFDQACSSPQWVPVSWVYCSKP_HUMAN NNEIDETIRHLEEAVELVCTAPQWVPVSWVYKAPB_MOUSE EKCGKEFCEF---------------------NRC2_NEUCR ENAVDPFEEFNSVTLHHDGDEEYHSDAYEKR

Profile HMM’s Insertion

Deletion

Conserved

Page 28: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Profile HMM’s

All M/D pairs must be visited once

Page 29: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

TMHMM (trans-membrane HMM)

(Sonnhammer, von Heijne, and Krogh)

Model TM length distribution.Power of HMM.Difficult in alignment.

Page 30: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Combination of HMM’s -Gene finding

x cccxxxxxxxxATGccc cccTAAxxxxxxxx

Inter-genicregion

Region aroundstart codon

Coding region

Region aroundstop codon

Start codon

Stop codon

Page 31: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

HMM packages

• HMMER (http://hmmer.wustl.edu/)– S.R. Eddy, WashU St. Louis. Freely available.

• SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa

Cruz. Freely available to academia, nominal license fee for commercial users.

• META-MEME (http://metameme.sdsc.edu/)– William Noble Grundy, UC San Diego. Freely available. Combines

features of PSSM search and profile HMM search.

• NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)– Freely available to academia, nominal license fee for commercial users.– Allows HMM architecture construction.

Page 32: Sequence motifs, information content, logos, and HMM’s

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Simple Hmmer command

hmmbuild --gapmax 0.0 --fast A2.hmmer A2.fsa

hmmbuild - build a hidden Markov model from an alignmentHMMER 2.2g (August 2001)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Alignment file: A2.fsa

File format: a2mSearch algorithm configuration: Multiple domain (hmmls)

Model construction strategy: Fast/ad hoc (gapmax 0.0)Null model used: (default)

Sequence weighting method: G/S/C tree weights- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Alignment: #1Number of sequences: 232

Number of columns: 9Determining effective sequence number ... done. [192]

Weighting sequences heuristically ... done.Constructing model architecture ... done.Converting counts to probabilities ... done.

Setting model name, etc. ... done. [A2.fasta]Constructed a profile HMM (length 9)

Average score: -6.42 bitsMinimum score: -15.47 bitsMaximum score: -0.84 bits

Std. deviation: 2.72 bits

>HLA-A.0201 16 Example_for_LigandSLLPAIVEL>HLA-A.0201 16 Example_for_LigandYLLPAIVHI>HLA-A.0201 16 Example_for_LigandTLWVDPYEV>HLA-A.0201 16 Example_for_LigandSXPSGGXGV>HLA-A.0201 16 Example_for_LigandGLVPFLVSV