gene prediction using hidden markov model and recurrent neural network
TRANSCRIPT
Gene Prediction Using HiddenMarkov Model
&Recurrent Neural Network
Ahmed Hani AlGhidaniMSc Student in Computer Science at Cairo University
Research and SDE at RDI Egypt
Agenda
• DNA Structure- Eukaryotic and Prokaryotic Cells
• Gene Prediction Methods- Empirical Methods- Ab initio Methods
• Hidden Markov Model- Existed HMM-based systems
• Recurrent Neural Network• Other Methods
DNA Structure (Cont.)
• Eukaryotic Cells
• Exons (Coding)• Introns (Non-Coding)• Acceptors (End of Intron in 5’ direction)• Donors (Start of Intron in 5’ direction)
Gene Prediction (Cont.)
• Empirical methods are used for specificallyProkaryotic cells
• Most of it is coding regions and no introns
• Feature Engineering method
• Open Reading Frames (ORFs)
Gene Prediction (Cont.)
• Pros- Simple and easy for implementation- Works well with Prokaryotic DNAbecause of its simplicity
• Cons- Bad performance in large sequences- Works bad with complex DNA such asEukaryotic DNA
Gene Prediction (Cont.)
• Ab initio methods for Eukaryotic cells
• Depend on statistical methods andcomputational models
• Features Engineering could be involved inthe computations
• Hidden Markov Model and RecurrentNeural Networks
Hidden Markov Model (Cont.)
• Practically, it may be hard to access thepatterns or classes that we want to predict
• We need indicators (visible states) toobtain the hidden patterns
Hidden Markov Model (Cont.)• Observations Probability Estimation
- Estimate the probability of observationsequence given the model
• Optimal Hidden State Sequence- Determine the optimal sequence of thehidden states
• HMM Parameters Estimation- Get the model parameters that maximizesthe probability of specific observationsgiven specific states
Hidden Markov Model (Cont.)
• In Gene Prediction, the observations arethe A, C, G, T sequences, and the hiddenstates are Exons, Introns and Other
• Use the training data to set the modelparameters (problem 3) using Baum-Welch algorithm
• For the given observations, we predict thestates (problem 2) using Viterbi algorithm
Neural Network (Cont.)
• Unexplored area in Bioinformatics
• No need for features engineering
• Outperforms old-school Machine Learning
• Based on Biological philiosophy!
Recurrent Neural Networks(Cont.)
• Exons/Introns still in progress
• Dataset size is 800K sequences
• Sequences aren’t fixed-size
• LSTM instead of Vanilla RNN
• Tensorflow
Other Methods
• Naive Bayesian + Statistical Features
• Hidden Markov Model Support VectorMachine (HMM-SVM)
• Open Reading Frames + Hidden MarkovModel
• Open Reading Frames + StatisticalFeatures + Hidden Markov Model
References• http://bpg.utoledo.edu/~afedorov/lab/eid.html• http://www.ece.drexel.edu/gailr/ECE-S690-503/markov_models.ppt.pdf• http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-62
• https://github.com/AhmedHani/Hidden-Markov-Model• https://ahmedhanibrahim.wordpress.com/2015/10/25/hidden-markov-models-hmms-part-i/
• http://www.cbcb.umd.edu/software/Glim-merHMM/man.shtml?tid%5B%5D=44&=Apply
• http://www.math.uwaterloo.ca/~aghodsib/courses/w05stat440/w05stat440-notes/feb27.pdf
• https://en.wikipedia.org/wiki/GLIMMER• https://ocw.mit.edu/courses/electrical-engineering-and-computer-sci-ence/6-096-algorithms-for-computational-biology-spring-2005/lecture-notes/lecture7.pdf
• https://www.cs.us.es/~fran/students/julian/gene_finding/gene_find-ing.html
• http://www.nature.com/nbt/journal/v25/n8/full/nbt0807-883.html• http://gobics.de/mario/papers/diss.pdf• https://www.ncbi.nlm.nih.gov/books/NBK21132/• https://archive.ics.uci.edu/ml/datasets/Molecular+Biology+(Splice-junc-tion+Gene+Sequences)