gene prediction using hidden markov model and recurrent neural network

Gene Prediction Using HiddenMarkov Model

&Recurrent Neural Network

Ahmed Hani AlGhidaniMSc Student in Computer Science at Cairo University

Research and SDE at RDI Egypt

[email protected]

mailto:[email protected]

Agenda

• DNA Structure- Eukaryotic and Prokaryotic Cells

• Gene Prediction Methods- Empirical Methods- Ab initio Methods

• Hidden Markov Model- Existed HMM-based systems

• Recurrent Neural Network• Other Methods

DNA Structure

DNA Structure (Cont.)

• Prokaryotic Cells

• Most of DNA is coding• No Introns• Promoters


• Eukaryotic Cells

• Exons (Coding)• Introns (Non-Coding)• Acceptors (End of Intron in 5’ direction)• Donors (Start of Intron in 5’ direction)


• Eukaryotic Cells (cont.)

Gene Prediction

• Get the exons regions that would betranslated to Amino Acid (Protein)

Gene Prediction (Cont.)

• Empirical methods are used for specificallyProkaryotic cells

• Most of it is coding regions and no introns

• Feature Engineering method

• Open Reading Frames (ORFs)


• Pros- Simple and easy for implementation- Works well with Prokaryotic DNAbecause of its simplicity

• Cons- Bad performance in large sequences- Works bad with complex DNA such asEukaryotic DNA


• Ab initio methods for Eukaryotic cells

• Depend on statistical methods andcomputational models

• Features Engineering could be involved inthe computations

• Hidden Markov Model and RecurrentNeural Networks

Hidden Markov Model

• The basic idea is Markov Chains•

• Set of finite states

• Transition Matrix

Hidden Markov Model (Cont.)


• Practically, it may be hard to access thepatterns or classes that we want to predict

• We need indicators (visible states) toobtain the hidden patterns

Hidden Markov Model (Cont.)• Observations Probability Estimation

- Estimate the probability of observationsequence given the model

• Optimal Hidden State Sequence- Determine the optimal sequence of thehidden states

• HMM Parameters Estimation- Get the model parameters that maximizesthe probability of specific observationsgiven specific states


• In Gene Prediction, the observations arethe A, C, G, T sequences, and the hiddenstates are Exons, Introns and Other

• Use the training data to set the modelparameters (problem 3) using Baum-Welch algorithm

• For the given observations, we predict thestates (problem 2) using Viterbi algorithm

Neural Network (Cont.)

• Unexplored area in Bioinformatics

• No need for features engineering

• Outperforms old-school Machine Learning

• Based on Biological philiosophy!

Neural Network (Cont.)

Recurrent Neural Networks

Recurrent Neural Networks(Cont.)


• Acceptor/Donor experiments


• Exons/Introns still in progress

• Dataset size is 800K sequences

• Sequences aren’t fixed-size

• LSTM instead of Vanilla RNN

• Tensorflow

Other Methods

• Naive Bayesian + Statistical Features

• Hidden Markov Model Support VectorMachine (HMM-SVM)

• Open Reading Frames + Hidden MarkovModel

• Open Reading Frames + StatisticalFeatures + Hidden Markov Model

References• http://bpg.utoledo.edu/~afedorov/lab/eid.html• http://www.ece.drexel.edu/gailr/ECE-S690-503/markov_models.ppt.pdf• http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-62

• https://github.com/AhmedHani/Hidden-Markov-Model• https://ahmedhanibrahim.wordpress.com/2015/10/25/hidden-markov-models-hmms-part-i/

• http://www.cbcb.umd.edu/software/Glim-merHMM/man.shtml?tid%5B%5D=44&=Apply

• http://www.math.uwaterloo.ca/~aghodsib/courses/w05stat440/w05stat440-notes/feb27.pdf

• https://en.wikipedia.org/wiki/GLIMMER• https://ocw.mit.edu/courses/electrical-engineering-and-computer-sci-ence/6-096-algorithms-for-computational-biology-spring-2005/lecture-notes/lecture7.pdf

• https://www.cs.us.es/~fran/students/julian/gene_finding/gene_find-ing.html

• http://www.nature.com/nbt/journal/v25/n8/full/nbt0807-883.html• http://gobics.de/mario/papers/diss.pdf• https://www.ncbi.nlm.nih.gov/books/NBK21132/• https://archive.ics.uci.edu/ml/datasets/Molecular+Biology+(Splice-junc-tion+Gene+Sequences)

http://https://archive.ics.uci.edu/ml/datasets/Molecular+Biology+(Splice-junction+Gene+Sequences



http://https://www.ncbi.nlm.nih.gov/books/NBK21132/

http://gobics.de/mario/papers/diss.pdf

http://www.nature.com/nbt/journal/v25/n8/full/nbt0807-883.html

http://https://www.cs.us.es/~fran/students/julian/gene_finding/gene_finding.html

http://https://www.cs.us.es/~fran/students/julian/gene_finding/gene_finding.html

http://https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-096-algorithms-for-computational-biology-spring-2005/lecture-notes/lecture7.pdf



http://https://en.wikipedia.org/wiki/GLIMMER

http://www.math.uwaterloo.ca/~aghodsib/courses/w05stat440/w05stat440-notes/feb27.pdf

http://www.math.uwaterloo.ca/~aghodsib/courses/w05stat440/w05stat440-notes/feb27.pdf

http://www.cbcb.umd.edu/software/GlimmerHMM/man.shtml?tidQS=44&=Apply

http://www.cbcb.umd.edu/software/GlimmerHMM/man.shtml?tidQS=44&=Apply

http://https://ahmedhanibrahim.wordpress.com/2015/10/25/hidden-markov-models-hmms-part-i/

http://https://ahmedhanibrahim.wordpress.com/2015/10/25/hidden-markov-models-hmms-part-i/

http://https://github.com/AhmedHani/Hidden-Markov-Model

http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-62

http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-62

http://www.ece.drexel.edu/gailr/ECE-S690-503/markov_models.ppt.pdf

http://bpg.utoledo.edu/~afedorov/lab/eid.html

gene prediction using hidden markov model and recurrent neural network

Documents