research proposal lancaster university

Upload: charles-higgins

Post on 03-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Research Proposal Lancaster University.

    1/7

    Nermeen ShaltoutLancaster ScholarshipPrevious Work and Research ProposalFebruary 8, 2012

    1 Previous Work

    The research work in Bioinformatics, described below consists of two disciplines,

    biology and computer science. The work then takes the normal machine learning

    flow. The goal of the previous work is to be able to classify influenza data

    according to host. The previous project was the survey of existing data mining

    techniques in order to classify the influenza virus. The classification is according

    to hosts: Swine, Avian, and Human, using

    1. Data Preprocessing

    2. Feature selection

    3. Classification

    1.1 Data Preprocessing

    The data preprocessing section uses famous Bioinformatics tools, in order to prepare the

    data for the feature selection process. It involves

    1. Downloading data from online gene banks such as www.ncbi.org[1], and

    www.fludb.org [2]. The data is in the form of genetic material, specifically

    nucleotides: The nucleotides are letters representing the building unit of genes. Thedata being downloaded is DNA data, and consists of four letters arranged in several

    patterns according to the virus, and host: A, G, C, and T.

    2. Once the genetic data is downloaded it is adjusted using tools such as Bioedit[3],and

    Mafft[4] for data alignment. The data alignment makes extracting the features easier

    by arranging similar data segments together.

    http://www.ncbi.org/http://www.ncbi.org/http://www.fludb.org/http://www.fludb.org/http://www.ncbi.org/
  • 7/29/2019 Research Proposal Lancaster University.

    2/7

    1.2 Data Extraction

    Genetic data is huge in size, which is why features must be selected. This

    was done from scratch by coding in Matlab.

    1. The data extraction is performed using information gain, across all

    three virus classes.

    2. The data with the most information gain would be most suitable for

    classification, and will be selected in order to classify the genes.

    3. The information gain is carried in two tiers. This is because the

    influenza virus data is so massive. Influenza virus can also be divided

    by subtype. The subtype is determined by proteins on the surface of

    the virus known as antigens. The two main antigens being H, and N.

    By maximizing the information gain against the four H antigen

    subtypes; H1, H2, H3, and H5. After that the information gain is

    optimized against the influenza host types: Human, Avian, and

    Swine. The classification as shown in the next section will also be

    divided into two tiers as will be seen in the next section.

    4. The following algorithm is used to calculate the information gain, is

    subtracting the entropy from the remainder. The entropy is calculated

    using the equation. [5]

    5. The remainder is calculated using the following equation.[5]

    The information gain is finally calculated using the following equation.

    [ 5 ]

    1.3 Classification

  • 7/29/2019 Research Proposal Lancaster University.

    3/7

    Classification can be achieved used more than one technique. Sometechniques are more suitable for certain techniques than others.

    1. Check the classification accuracy using neural networks

    2. Check the classification accuracy using decision trees.

    3. Measuring the performance for both techniques

  • 7/29/2019 Research Proposal Lancaster University.

    4/7

    Improving Influenza A Host Classification using Novel

    Feature Selection and Classification techniques

    Main Points ofResearch

    3.1 Research proposal

    There are several strains of Influenza virus, the main strain causing

    problems being Influenza A virus. The influenza A virus has a high

    tendency to mutate therefore causing pandemics. Recently the influenza

    virus gained the ability to infect more than one host, in previous cases;

    Avian and Swine Flu were able to infect both Avian and Human hosts, and

    Swine and Human hosts respectively making the spread of these

    pandemics twice as fast. The viruses also possess the ability to kill the hostin case of Avian, or pose a threat to the individuals' life within a certain age

    range in case of Swine Flu. Pandemics of the influenza virus have basically

    afflicted many countries especially infamous variations such as Swine Flu,

    and Avian Flu. To solve this if Influenza A can be classified according to

    host using the genetic information, future pandemics can be tracked and

    stopped easily. However it must be done in a swift manner.

    3.2 Goals and Methodology

    Feature selection, and classification techniques already exist and have beensurveyed in the previous project. Unlike the first project, the main goal is

    to create an optimized feature selection, classification or preprocessing

    system to get the same result across a faster time line. Bioinformatics

    calculations are usually very programming intensive and thus building

    algorithms or organizing techniques to achieve the same results in lesser

    time is crucial in order to help advance researches. The model can be

    further abstracted to be used with other mutative viruses other than

    influenza by grouping the different data mining techniques using scripts.

    (Usually a range of detached programs are used to process the genetic data,

    instead of a unified system.) My main future contribution would be towould be to use a graph based decision trees to see if it would improve the

    feature extraction and classification algorithm speed [7]. If done right the

    graph based decision tree would be able to do the feature selection and

    classification in one step, with minimal data preprocessing.

    3.3 Time Plan

    Main phases of research work:

    Phase I Existing Paper and Literature Review

    Phase II Designing and Developing the System

  • 7/29/2019 Research Proposal Lancaster University.

    5/7

    Phase III System Implementation, Deployment and Testing

    Phase IV- Taking the research to the next level.

    Phase I - Literature Review (Already implemented)

    1. Data Mining in Bioinformatics [6]

    2. Data Mining genetic data in other viruses [5]

    3. Data Mining applied to Influenza Flu virus

    4. Novel techniques in Classification of genetic data [7]

    Phase II - Designing and Developing (In progress)

    1. Linking the different modules together via a script. More than one program is

    usually used for alignment, feature selection, and classification. Making a script to group

    all three together would be beneficial, especially in the abstraction phase.

    2. Building an improved graph based classification system

    Phase III Testing the System's Performance

    1. Performance test and evaluation of the old methods of preprocessing, feature

    extraction, and classification.

    2. Performance testing and evaluation of the novel graph based decision tree.

    3. Comparing both results, to measure improvement in performance.

    Phase IV- Taking the Research to the Next Level.

    1. Publishing the theoretical part of the paper before March 15.

    2. Transferring the research to a university abroad for further development in

    Lancaster University, or transferring to an available project available in Lancaster and

    matching with my research expertise.

    3.4 Expected

    Results

    Phase II should be completed before March 15th where a proposal

    should be submitted and defended, during February to the supervisors of

    both disciplines. Phase III should take place on March 15th-May 15th as well as one,

    when the research is passed to Kyoto university or a suitable university.

    Phase IV should take place from September 2013or Spring 2014

    depending on availability.

    The final result of the research will be a system that is able to take

    genetic input from online research databases, and output the origin

    of the host of the virus, at a faster rate than before.

  • 7/29/2019 Research Proposal Lancaster University.

    6/7

    References

    [1] The Influenza Research Database. N.p.. Web.

    .

    [2] "National Center for BiotechnologyInformation."Influenza Virus Resource. N.p.. Web.

    .

    [3] Hall, Tom."http://www.mbio.ncsu.edu/bioedit/bioedit.html."Bioedit:Biological Sequence Alignment Editor. Ibis BiosciencesCarlsbad, CA 92008, n.d. Web. .

    [4] Katoh, Kazutaka. "Mafft-a multiple sequence alignment

    program." .CBRC, AIST., n.d. Web..

    [5] Leung, KS, Eddie YT Ng, KH Lee, Henry LY Chan, Stephen KWTsui, Tony SK Mok , Chi-Hang Tse , and Joseph JY Sung. "Data

    Mining on DNA Sequences of Hepatitis B Virus by NonlinearIntegrals." n. page. Web. 8 Feb. 2013.

    [6] Y, Saeys, Inza U, and Larraaga P. "A Review of Feature

    Selection Techniques in Bioinformatics." 23.19 (2007): 2507-17.Print. .

    [7] Geamsakul, Warodom, Takashi Matsuda, Tetsuya Yoshida,

    Hiroshi Motoda , and Takashi Washio . "Constructing aDecision Tree for Graph Structured Data." n. page. Web. 8

    Feb. 2013.

  • 7/29/2019 Research Proposal Lancaster University.

    7/7