research proposal lancaster university
TRANSCRIPT
-
7/29/2019 Research Proposal Lancaster University.
1/7
Nermeen ShaltoutLancaster ScholarshipPrevious Work and Research ProposalFebruary 8, 2012
1 Previous Work
The research work in Bioinformatics, described below consists of two disciplines,
biology and computer science. The work then takes the normal machine learning
flow. The goal of the previous work is to be able to classify influenza data
according to host. The previous project was the survey of existing data mining
techniques in order to classify the influenza virus. The classification is according
to hosts: Swine, Avian, and Human, using
1. Data Preprocessing
2. Feature selection
3. Classification
1.1 Data Preprocessing
The data preprocessing section uses famous Bioinformatics tools, in order to prepare the
data for the feature selection process. It involves
1. Downloading data from online gene banks such as www.ncbi.org[1], and
www.fludb.org [2]. The data is in the form of genetic material, specifically
nucleotides: The nucleotides are letters representing the building unit of genes. Thedata being downloaded is DNA data, and consists of four letters arranged in several
patterns according to the virus, and host: A, G, C, and T.
2. Once the genetic data is downloaded it is adjusted using tools such as Bioedit[3],and
Mafft[4] for data alignment. The data alignment makes extracting the features easier
by arranging similar data segments together.
http://www.ncbi.org/http://www.ncbi.org/http://www.fludb.org/http://www.fludb.org/http://www.ncbi.org/ -
7/29/2019 Research Proposal Lancaster University.
2/7
1.2 Data Extraction
Genetic data is huge in size, which is why features must be selected. This
was done from scratch by coding in Matlab.
1. The data extraction is performed using information gain, across all
three virus classes.
2. The data with the most information gain would be most suitable for
classification, and will be selected in order to classify the genes.
3. The information gain is carried in two tiers. This is because the
influenza virus data is so massive. Influenza virus can also be divided
by subtype. The subtype is determined by proteins on the surface of
the virus known as antigens. The two main antigens being H, and N.
By maximizing the information gain against the four H antigen
subtypes; H1, H2, H3, and H5. After that the information gain is
optimized against the influenza host types: Human, Avian, and
Swine. The classification as shown in the next section will also be
divided into two tiers as will be seen in the next section.
4. The following algorithm is used to calculate the information gain, is
subtracting the entropy from the remainder. The entropy is calculated
using the equation. [5]
5. The remainder is calculated using the following equation.[5]
The information gain is finally calculated using the following equation.
[ 5 ]
1.3 Classification
-
7/29/2019 Research Proposal Lancaster University.
3/7
Classification can be achieved used more than one technique. Sometechniques are more suitable for certain techniques than others.
1. Check the classification accuracy using neural networks
2. Check the classification accuracy using decision trees.
3. Measuring the performance for both techniques
-
7/29/2019 Research Proposal Lancaster University.
4/7
Improving Influenza A Host Classification using Novel
Feature Selection and Classification techniques
Main Points ofResearch
3.1 Research proposal
There are several strains of Influenza virus, the main strain causing
problems being Influenza A virus. The influenza A virus has a high
tendency to mutate therefore causing pandemics. Recently the influenza
virus gained the ability to infect more than one host, in previous cases;
Avian and Swine Flu were able to infect both Avian and Human hosts, and
Swine and Human hosts respectively making the spread of these
pandemics twice as fast. The viruses also possess the ability to kill the hostin case of Avian, or pose a threat to the individuals' life within a certain age
range in case of Swine Flu. Pandemics of the influenza virus have basically
afflicted many countries especially infamous variations such as Swine Flu,
and Avian Flu. To solve this if Influenza A can be classified according to
host using the genetic information, future pandemics can be tracked and
stopped easily. However it must be done in a swift manner.
3.2 Goals and Methodology
Feature selection, and classification techniques already exist and have beensurveyed in the previous project. Unlike the first project, the main goal is
to create an optimized feature selection, classification or preprocessing
system to get the same result across a faster time line. Bioinformatics
calculations are usually very programming intensive and thus building
algorithms or organizing techniques to achieve the same results in lesser
time is crucial in order to help advance researches. The model can be
further abstracted to be used with other mutative viruses other than
influenza by grouping the different data mining techniques using scripts.
(Usually a range of detached programs are used to process the genetic data,
instead of a unified system.) My main future contribution would be towould be to use a graph based decision trees to see if it would improve the
feature extraction and classification algorithm speed [7]. If done right the
graph based decision tree would be able to do the feature selection and
classification in one step, with minimal data preprocessing.
3.3 Time Plan
Main phases of research work:
Phase I Existing Paper and Literature Review
Phase II Designing and Developing the System
-
7/29/2019 Research Proposal Lancaster University.
5/7
Phase III System Implementation, Deployment and Testing
Phase IV- Taking the research to the next level.
Phase I - Literature Review (Already implemented)
1. Data Mining in Bioinformatics [6]
2. Data Mining genetic data in other viruses [5]
3. Data Mining applied to Influenza Flu virus
4. Novel techniques in Classification of genetic data [7]
Phase II - Designing and Developing (In progress)
1. Linking the different modules together via a script. More than one program is
usually used for alignment, feature selection, and classification. Making a script to group
all three together would be beneficial, especially in the abstraction phase.
2. Building an improved graph based classification system
Phase III Testing the System's Performance
1. Performance test and evaluation of the old methods of preprocessing, feature
extraction, and classification.
2. Performance testing and evaluation of the novel graph based decision tree.
3. Comparing both results, to measure improvement in performance.
Phase IV- Taking the Research to the Next Level.
1. Publishing the theoretical part of the paper before March 15.
2. Transferring the research to a university abroad for further development in
Lancaster University, or transferring to an available project available in Lancaster and
matching with my research expertise.
3.4 Expected
Results
Phase II should be completed before March 15th where a proposal
should be submitted and defended, during February to the supervisors of
both disciplines. Phase III should take place on March 15th-May 15th as well as one,
when the research is passed to Kyoto university or a suitable university.
Phase IV should take place from September 2013or Spring 2014
depending on availability.
The final result of the research will be a system that is able to take
genetic input from online research databases, and output the origin
of the host of the virus, at a faster rate than before.
-
7/29/2019 Research Proposal Lancaster University.
6/7
References
[1] The Influenza Research Database. N.p.. Web.
.
[2] "National Center for BiotechnologyInformation."Influenza Virus Resource. N.p.. Web.
.
[3] Hall, Tom."http://www.mbio.ncsu.edu/bioedit/bioedit.html."Bioedit:Biological Sequence Alignment Editor. Ibis BiosciencesCarlsbad, CA 92008, n.d. Web. .
[4] Katoh, Kazutaka. "Mafft-a multiple sequence alignment
program." .CBRC, AIST., n.d. Web..
[5] Leung, KS, Eddie YT Ng, KH Lee, Henry LY Chan, Stephen KWTsui, Tony SK Mok , Chi-Hang Tse , and Joseph JY Sung. "Data
Mining on DNA Sequences of Hepatitis B Virus by NonlinearIntegrals." n. page. Web. 8 Feb. 2013.
[6] Y, Saeys, Inza U, and Larraaga P. "A Review of Feature
Selection Techniques in Bioinformatics." 23.19 (2007): 2507-17.Print. .
[7] Geamsakul, Warodom, Takashi Matsuda, Tetsuya Yoshida,
Hiroshi Motoda , and Takashi Washio . "Constructing aDecision Tree for Graph Structured Data." n. page. Web. 8
Feb. 2013.
-
7/29/2019 Research Proposal Lancaster University.
7/7