research proposal lancaster university

7/29/2019 Research Proposal Lancaster University.

1/7

Nermeen ShaltoutLancaster ScholarshipPrevious Work and Research ProposalFebruary 8, 2012

1 Previous Work

The research work in Bioinformatics, described below consists of two disciplines,

biology and computer science. The work then takes the normal machine learning

flow. The goal of the previous work is to be able to classify influenza data

according to host. The previous project was the survey of existing data mining

techniques in order to classify the influenza virus. The classification is according

to hosts: Swine, Avian, and Human, using

1. Data Preprocessing

2. Feature selection

3. Classification

1.1 Data Preprocessing

The data preprocessing section uses famous Bioinformatics tools, in order to prepare the

data for the feature selection process. It involves

1. Downloading data from online gene banks such as www.ncbi.org[1], and

www.fludb.org [2]. The data is in the form of genetic material, specifically

nucleotides: The nucleotides are letters representing the building unit of genes. Thedata being downloaded is DNA data, and consists of four letters arranged in several

patterns according to the virus, and host: A, G, C, and T.

2. Once the genetic data is downloaded it is adjusted using tools such as Bioedit[3],and

Mafft[4] for data alignment. The data alignment makes extracting the features easier

by arranging similar data segments together.
http://www.ncbi.org/http://www.ncbi.org/http://www.fludb.org/http://www.fludb.org/http://www.ncbi.org/


2/7

1.2 Data Extraction

Genetic data is huge in size, which is why features must be selected. This

was done from scratch by coding in Matlab.

1. The data extraction is performed using information gain, across all

three virus classes.

2. The data with the most information gain would be most suitable for

classification, and will be selected in order to classify the genes.

3. The information gain is carried in two tiers. This is because the

influenza virus data is so massive. Influenza virus can also be divided

by subtype. The subtype is determined by proteins on the surface of

the virus known as antigens. The two main antigens being H, and N.

By maximizing the information gain against the four H antigen

subtypes; H1, H2, H3, and H5. After that the information gain is

optimized against the influenza host types: Human, Avian, and

Swine. The classification as shown in the next section will also be

divided into two tiers as will be seen in the next section.

4. The following algorithm is used to calculate the information gain, is

subtracting the entropy from the remainder. The entropy is calculated

using the equation. [5]

5. The remainder is calculated using the following equation.[5]

The information gain is finally calculated using the following equation.

[ 5 ]

1.3 Classification


3/7

Classification can be achieved used more than one technique. Sometechniques are more suitable for certain techniques than others.

1. Check the classification accuracy using neural networks

2. Check the classification accuracy using decision trees.

3. Measuring the performance for both techniques


4/7

Improving Influenza A Host Classification using Novel

Feature Selection and Classification techniques

Main Points ofResearch

3.1 Research proposal

There are several strains of Influenza virus, the main strain causing

problems being Influenza A virus. The influenza A virus has a high

tendency to mutate therefore causing pandemics. Recently the influenza

virus gained the ability to infect more than one host, in previous cases;

Avian and Swine Flu were able to infect both Avian and Human hosts, and

Swine and Human hosts respectively making the spread of these

pandemics twice as fast. The viruses also possess the ability to kill the hostin case of Avian, or pose a threat to the individuals' life within a certain age

range in case of Swine Flu. Pandemics of the influenza virus have basically

afflicted many countries especially infamous variations such as Swine Flu,

and Avian Flu. To solve this if Influenza A can be classified according to

host using the genetic information, future pandemics can be tracked and

stopped easily. However it must be done in a swift manner.

3.2 Goals and Methodology

Feature selection, and classification techniques already exist and have beensurveyed in the previous project. Unlike the first project, the main goal is

to create an optimized feature selection, classification or preprocessing

system to get the same result across a faster time line. Bioinformatics

calculations are usually very programming intensive and thus building

algorithms or organizing techniques to achieve the same results in lesser

time is crucial in order to help advance researches. The model can be

further abstracted to be used with other mutative viruses other than

influenza by grouping the different data mining techniques using scripts.

(Usually a range of detached programs are used to process the genetic data,

instead of a unified system.) My main future contribution would be towould be to use a graph based decision trees to see if it would improve the

feature extraction and classification algorithm speed [7]. If done right the

graph based decision tree would be able to do the feature selection and

classification in one step, with minimal data preprocessing.

3.3 Time Plan

Main phases of research work:

Phase I Existing Paper and Literature Review

Phase II Designing and Developing the System


5/7

Phase III System Implementation, Deployment and Testing

Phase IV- Taking the research to the next level.

Phase I - Literature Review (Already implemented)

1. Data Mining in Bioinformatics [6]

2. Data Mining genetic data in other viruses [5]

3. Data Mining applied to Influenza Flu virus

4. Novel techniques in Classification of genetic data [7]

Phase II - Designing and Developing (In progress)

1. Linking the different modules together via a script. More than one program is

usually used for alignment, feature selection, and classification. Making a script to group

all three together would be beneficial, especially in the abstraction phase.

2. Building an improved graph based classification system

Phase III Testing the System's Performance

1. Performance test and evaluation of the old methods of preprocessing, feature

extraction, and classification.

2. Performance testing and evaluation of the novel graph based decision tree.

3. Comparing both results, to measure improvement in performance.

Phase IV- Taking the Research to the Next Level.

1. Publishing the theoretical part of the paper before March 15.

2. Transferring the research to a university abroad for further development in

Lancaster University, or transferring to an available project available in Lancaster and

matching with my research expertise.

3.4 Expected

Results

Phase II should be completed before March 15th where a proposal

should be submitted and defended, during February to the supervisors of

both disciplines. Phase III should take place on March 15th-May 15th as well as one,

when the research is passed to Kyoto university or a suitable university.

Phase IV should take place from September 2013or Spring 2014

depending on availability.

The final result of the research will be a system that is able to take

genetic input from online research databases, and output the origin

of the host of the virus, at a faster rate than before.


6/7

References

[1] The Influenza Research Database. N.p.. Web.

.

[2] "National Center for BiotechnologyInformation."Influenza Virus Resource. N.p.. Web.

.

[3] Hall, Tom."http://www.mbio.ncsu.edu/bioedit/bioedit.html."Bioedit:Biological Sequence Alignment Editor. Ibis BiosciencesCarlsbad, CA 92008, n.d. Web. .

[4] Katoh, Kazutaka. "Mafft-a multiple sequence alignment

program." .CBRC, AIST., n.d. Web..

[5] Leung, KS, Eddie YT Ng, KH Lee, Henry LY Chan, Stephen KWTsui, Tony SK Mok , Chi-Hang Tse , and Joseph JY Sung. "Data

Mining on DNA Sequences of Hepatitis B Virus by NonlinearIntegrals." n. page. Web. 8 Feb. 2013.

[6] Y, Saeys, Inza U, and Larraaga P. "A Review of Feature

Selection Techniques in Bioinformatics." 23.19 (2007): 2507-17.Print. .

[7] Geamsakul, Warodom, Takashi Matsuda, Tetsuya Yoshida,

Hiroshi Motoda , and Takashi Washio . "Constructing aDecision Tree for Graph Structured Data." n. page. Web. 8

Feb. 2013.


7/7

research proposal lancaster university

Documents