applications of hmms yves moreau 2003-2004. overview profile hmms estimation database search...

53
Applications of HMMs Yves Moreau 2003-2004

Upload: rosemary-simon

Post on 11-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Applications of HMMs

Yves Moreau

2003-2004

Page 2: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Overview

Profile HMMs Estimation Database search Alignment

Gene finding Elements of gene prediction Prokaryotes vs. eukaryotes Gene prediction by homology GENSCAN

Page 3: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Profile HMM

Hidden Markov model for the modeling of protein families and for multiple alignment

Example Part of the alignment of the SH3 domain Two conserved regions separated

by a variable region

GGWWRGdy.ggkkqLWFPSNYVIGWLNGynettgerGDFPGTYVPNWWEGql..nnrrGIFPSNYVDEWWQArr..deqiGIVPSK--GEWWKAqs..tgqeGFIPFNFVGDWWLArs..sgqtGYIPSNYVGDWWDAel..kgrrGKVPSNYL-DWWEArslssghrGYVPSNYVGDWWYArslitnseGYIPSTYVGEWWKArslatrkeGYIPSNYVGDWWLArslvtgreGYVPSNFVGEWWKAkslsskreGFIPSNYVGEWCEAgt.kngq.GWVPSNYISDWWRVvnlttrqeGLIPLNFVLPWWRArd.kngqeGYIPSNYIRDWWEFrsktvytpGYYESGYVEHWWKVkd.algnvGYIPSNYVIHWWRVqd.rngheGYVPSSYLKDWWKVev..ndrqGFVPAAYV

Page 4: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Profile HMMs

Hidden Markov Models for multiple alignments Match, insert, and delete states

Bgn End Match

Insertion

Deletion

Page 5: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Silent deletion states

Deletions could be modeled by shortcut jumps between states

Problem: number of transitions grows quadratically Other solution: use parallel states that do not produce

any symbol (silent state)

Page 6: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

HMM from multiple alignment

GGWWRGdy.ggkkqLWFPSNYVIGWLNGynettgerGDFPGTYVPNWWEGql..nnrrGIFPSNYVDEWWQArr..deqiGIVPSK--GEWWKAqs..tgqeGFIPFNFVGDWWLArs..sgqtGYIPSNYVGDWWDAel..kgrrGKVPSNYL-DWWEArslssghrGYVPSNYVGDWWYArslitnseGYIPSTYVGEWWKArslatrkeGYIPSNYVGDWWLArslvtgreGYVPSNFVGEWWKAkslsskreGFIPSNYVGEWCEAgt.kngq.GWVPSNYISDWWRVvnlttrqeGLIPLNFVLPWWRArd.kngqeGYIPSNYIRDWWEFrsktvytpGYYESGYVEHWWKVkd.algnvGYIPSNYVIHWWRVqd.rngheGYVPSSYLKDWWKVev..ndrqGFVPAAYV

Multiple alignment (+ conserved columns)

Parameter estimation = estimation with known paths

.85

Corresponding profile HMM

Page 7: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Pseudocounts

Zero probabilities in HMM causes the rejection of sequences containing previously unseen residues

To avoid this problem, add pseudocounts (add extra counts as if prior data was available)

New profile HMM

.85.33

Page 8: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Database search with profile HMM

The estimated model can be used to detect new members of the protein family in a sequence database (more sensitive than PSI-BLAST)

For each sequence in the database, we compute P(x, * | M) (Viterbi) or P(x | M) (forward-backward)

In practice we work with log-odds (w.r.t. the random model P(x | R))

Page 9: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Alignment to profile HMM

Through Viterbi (search for the best alignment path), we can align sequences w.r.t a profile HMM Training sequences Database matches

Page 10: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Multiple alignment with profile HMM

If the sequences are not aligned, it is possible to train a profile HMM to align them

Initialization: choose the length of the profile HMM Length of profile HMM is number of match states

sequence length

Training: estimate the model via Viterbi training or Baum-Welch training Heuristics to avoid local minimas

Multiple alignment: use Viterbi decoding to align sequences

Page 11: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Extensions

More sophisticated pseudocounts are possible Dirichlet mixtures

Different types of local alignments can be done with HMMs

Methods are available to weigh sequences in function of evolutionary distances

Page 12: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Protein families

PFAM http://www.sanger.ac.uk/Software/Pfam/search.shtml Collection of protein families and protein domains

Provides multiple alignment of the protein families for the domains Provides the domain organization of proteins Provides profile HMMs of the domains

Page 13: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Software for profile HMMs

SAM: University of California Santa Cruz http://www.cse.ucsc.edu/research/compbio/sam.html Web service: http://www.cse.ucsc.edu/research/compbio

/HMM-apps/HMM-applications.html (takes time)

Hmmer (‘hammer’): Washington University, St. Louis http://genome.wustl.edu/eddy/hmmer.html

Page 14: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Gene finding

Page 15: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Overview

Elements of gene prediction Prokaryotes vs. eukaryotes Gene prediction by homology GENSCAN

Page 16: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes
Page 17: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

DNA makes RNA makes proteins

Page 18: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Evidence for gene prediction

Sources of evidence (positive and negative) Sequence similarity to known genes (e.g., found by BLASTX) Statistical measure of codon bias Template matches to functional sites (e.g., splice site) Similarity to features not likely to overlap coding sequence (e.g.,

Alu repeats)

The structure must respect the biological grammar (promoter, exon, intro, ...)

Page 19: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Search by signal vs. search by content

Search by signal Detect short signals in the genome E.g., splice site, signal peptide, glycosylation site Neural networks can be useful here

Search by content Detect extended regions in the genome e.g., coding regions, CpG islands Hidden Markov Models are useful here

Gene finding algorithms combine both

Page 20: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Probabilistic prediction vs. homology

Hidden Markov Models can be used to predict genes

Homology to a known gene is also a strong method for detecting genes

More and more gene prediction packages combine both approaches

Page 21: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Search by signal vs. content

Page 22: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Signals in prokaryotes

Transcription start and stop -35 region TATA box

Translation start and stop Open Reading Frames Shine-Delgarno motif Start ATG/GTG Stop TAA/TAG/TGA Stem-loops

Operon

Page 23: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Problems for prokaryotes

Short genes are hard to detect Operons Overlapping genes

Page 24: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Signals in eukaryotes

Transcription Promotor/enhancer/silencer TATA box Introns/exons

Donor/acceptor/branch PolyA Repeats

Alu, satellites CpG islands Cap/CCAAT&GC boxes

Translation 5’ and 3’ UTR Kozak consensus Start ATG Stop TAA/TAG/TGA

Page 25: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Open reading frames

Translate the sequence into the six possible reading frames Check for start and stop codons

Page 26: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Codon bias

In coding sequences, genomes have specific biases for the use of codons encoding the same amino acid

Page 27: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Coding potential

Most coding potentials are based on analysis of codon usage

The HMMs keeps track of some kind of average coding potential around each position

The increase and decrease of the coding potential will “push” the HMM in and out of the exons

Page 28: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Promoter region

Promoter region contains the elements that control the expression of the gene Prediction of the promoter region (e.g., prediction of the TATA-

box) is difficult

Page 29: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Intron-exon splicing

Consensus 5’ Donor

(A,C)AG/GT(A,G)AGT 3’ Acceptor

TTTTTNCAG/GCCCCC Branch

CT(G,A)A(C,T)

Neural networks can predict splice sites; they can detect complex correlation between positions in a functional site

Page 30: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Gene prediction by homology

Page 31: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Gene prediction by homology

Coding regions evolve more slowly than noncoding ones (conserved by natural selection because of their functional role)

Not only the protein sequence but also the gene structure can be conserved

Use standard homology methods Gene syntax must be respected

Page 32: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Gene prediction by homology

Page 33: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Procrustes

Find potentially related with BLASTX (= model sequences)

Find all possible blocks (exons) on the basis of acceptor/donor location Look which blocks can be aligned with model sequences Look for best alignment of blocks with the query sequence

Page 34: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Gene prediction by homology

Advantages Recognition of short exons and atypical exons Correct assembly of complex genes (> 10 exons)

Disadvantages Genes without known homologs are missed Good homologs necessary for the prediction of the gene

structure Very sensitive to sequencing errors

Page 35: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

GENSCAN

Page 36: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

GENSCAN

GENSCAN was used for the annotation of the human genome in the Human Genome Project

Gene prediction with Hidden Semi-Markov Models

Different models in function of GC-content (<43% G+C, 43-50%, 50-57%, >57%)

Page 37: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Typical gene structure

Page 38: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Signal: human splice site

5’ splice site

3’ splice site

Page 39: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Hidden semi-Markov model

Page 40: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Example

Nodes of HSMM Position-weight matrix (signal) Higher-order position-weight matrix HMM (content)

Page 41: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Architecture of GENSCAN

Page 42: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Training of HSMM

Viterbi algorithm Viterbi algorithm for HSMMs

Page 43: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Gene structure prediction

Current performance on exon prediction is acceptable

However, grouping the correct exons into the genes is still problematic

In many cases, a significant proportion of the predicted genes will not be correct

Page 44: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes
Page 45: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

CpG islands

In mammalians, CpG islands have higher G+C and CG dinucleotide content than the rest of the DNA

CpG islands arise in active regions where no deactivation by methylation takes place (CG dinucleotides in methylated regions disappear by deamination)

CpG islands may be used as gene markers in mammalians

Page 46: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Repeats

Repeats make up a large part of the human genome Alu repeats Long Interspersed Elements (LINEs) Short Interspersed Elements (SINEs)

Important to mask repeats when searching for genes

Page 47: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Promoter, enhancers, and silencers

Page 48: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Promotor, enhancers en silencers

Page 49: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Polyadenylation signal

Polyadenylation (cleavage of pre-mRNA 3' end and synthesis of poly-(A) tract) is a very important early step of pre-mRNA processing

The most well-known signal involved in this process is AATAAA, located 15-20 nucleotides upstream from the poly-(A) site (site of cleavage)

Real AATAAA signals can differ from AATAAA consensus sequence. The most frequent natural variant, ATTAAA, is nearly as active as the canonical sequence.

Page 50: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Problem: alternative splicing

Page 51: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Problem: pseudogenes

Loss of promoter, extra stop codon, frameshift Translocation, duplication

Page 52: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Problem: RNA genes

rRNA (ribosomal) tRNA (transfer) snRNA (splicing) tmRNA (telomerase) microRNAs

Page 53: Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes

Neural networks for exon prediction

GRAIL uses a neural network to predict the score of a candidate exon