secondary structure prediction of tuberculosis genomes using machine learning algorithms

Upload: rajivgb88

Post on 03-Jun-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    1/111

    1

    SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS

    GENOMES USING MACHINE LEARNING ALGORITHMS

    Dissertation submitted

    to

    Sri Ramachandra University

    Porur, Chennai 600 116.

    in partial fulfillment of the requirements for the Degree ofMaster of Science

    in

    Bioinformatics

    By

    RAJIV GANDHI.B

    Reg No: 3010815

    DEPARTMENT OF BIOINFORMATICS

    Sri Ramachandra College of Biomedical Sciences, Research & Technology

    Sri Ramachandra University

    Porur, Chennai - 600 116.

    JUNE 2010

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    2/111

    2

    ACKNOWLEDGEMENT

    I express my sincere gratitude to my guide Dr. P. Venkatesan, Deputy Director,

    Tuberculosis Research Centre, Indian Council of Medical Research for his excellent guidance,

    his knowledge, his belief, innovative thoughts and motivation throughout. His continuous and

    constant support and encouragement has given strength and motivation to me, as well as my

    work. He has been the backbone of my work.

    I am grateful to Dr.PK.Ragunath, Head of the Department, Department of

    Bioinformatics, Sri Ramachandra University for his kind encouragement all the times.

    I express my profound thanks to the Dean of Faculties, Vice Chancellor and Management

    of Sri Ramachandra University, for providing all the facilities to do this project work.

    I would like to express my heart full thanks to all staff members Mrs.Arundathi, Mr.

    Dicky John Davis, Ms.C.R.Hemalatha, Ms.S.Kayalvizhi, Ms.M.Premavathi, Mr.Ramesh and

    Mr.Venkatesan whose disciplined guidance helped to complete my project on time.

    I express my special gratitude to my parents and my sister, brother for their constantencouragement, support and blessings. Their inspiration had given me the immense courage to

    work efficiently.

    I express my heartfelt gratitude and obligation to my loving friends P.A.Abhinand,

    R.D.Nandhini, B.Chaitanya, Balaraman, Gajalakshmi, J.Andrews Paulraj, K.Deepa, Selva,

    T.Sridhar, and R.Sridhar for helping me throughout this project.

    I would like to thank all those who had directly or indirectly helped me in my project.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    3/111

    3

    DEDICATIONDEDICATIONDEDICATIONDEDICATION

    I would like to dedicate this project to my lovingly parents Mr.M.Balasubramanian

    & Mrs.B.Palaniammal Balasubramanianand my friends especially

    P.A.Abhinand, R.D.Nandhini & B. Chaitanyawho have assisted and motivated me.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    4/111

    4

    CONTENTS

    Serial No. Topic Page No.

    1Introduction 9

    2 Machine Leaning Approaches 33

    3 Mycobacterium tuberculosis 56

    4 Aim & Objectives 63

    5 Materials and Methods 65

    6 Result 85

    7 Discussion 107

    7 Summary 109

    8 Reference 111

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    5/111

    5

    IntroductionIntroductionIntroductionIntroduction

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    6/111

    6

    1. INTRODUCTION

    1.1 Bioinformatics

    The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of

    informatics processes in biotic systems. Its primary use since at least the late 1980s has been in

    genomics and genetics, particularly in those areas of genomics involving large-scale DNA

    sequencing.

    Bioinformatics represents a new, growing area of science that uses computational

    approaches to answer biological questions. Answering these questions requires that investigators

    take advantage of large, complex data sets (both public and private) in a rigorous fashion to

    reach valid, biological conclusions. The potential of such an approach is beginning to change the

    fundamental way in which basic science is done, helping to more efficiently guide experimental

    design in the laboratory. With the explosion of sequence and structural information available to

    researchers, the field of bioinformatics is playing an increasingly large role in the study of

    fundamental biomedical problems. The challenge facing computational biologists will be to aidin gene discovery and in the design of molecular modeling, site-directed mutagenesis, and

    experiments of other types that can potentially reveal previously unknown relationships with

    respect to the structure and function of genes and proteins. This challenge becomes particularly

    daunting in light of the vast amount of data that has been produced by the Human Genome

    Project and other systematic sequencing efforts to date.

    The primary goal of bioinformatics is to increase the understanding of biological

    processes. What sets it apart from other approaches, however, is its focus on developing andapplying computationally intensive techniques (e.g., pattern recognition, data mining, machine

    learning algorithms, and visualization) to achieve this goal. Major research efforts in the field

    include sequence alignment, gene finding, genome assembly, drug design, drug discovery,

    protein structure alignment, protein structure prediction, prediction of gene expression and

    protein-protein interactions, genome-wide association studies and the modeling of evolution

    (Achuthsankar S Nair. 2007).

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    7/111

    7

    1.2 Bioinformatics Algorithms

    Bioinformatics is the field that solves the biological problems with the application of

    information technology. To solve the problems, from the technology sides the algorithms should

    be generated to solve those problem. So, in short algorithms are the programs (softwares) that

    are used to solve the biological problems are called as bioinformatics algorithms. Some of the

    important bioinformatics algorithms are explained below:

    1.2.1 Greedy Algorithm

    Greedy algorithm is defined as the algorithm which searches for the local optimization

    and imagine itself as a global optimization (Cormen, et al.1990).

    1.2.1.1 Types of Greedy Algorithm

    There are about three types of greedy algorithm, they are,

    Pure greedy algorithm

    Orthogonal greedy

    Smooth greedy algorithm

    Disadvantage of greedy algorithm

    The disadvantage of greedy algorithm is, since it stops with the local optimization, theglobal optimization of the data could not be found out.

    1.2.2 Dynamic programming

    It is a general optimization method. It proposed by Richard Bellman of Princeton

    University in 1950s. It extensively used in sequence alignment and other computational

    problems. It applied to biological sequences by Needleman and Wunsch.

    Original problem is broken into smaller sub problems and then solved. Pieces of largerproblem have a sequential dependency. Fourth piece can be solved using solution of the third

    piece; the third piece can be solved by using solution of the second piece and so on. Dynamic

    Algorithm first solves all the sub problems and stores each intermediate solution in a table along

    with a score. And uses an m*n matrix of scores where m and n are the lengths of sequences

    being aligned (Eddy, S. R. 2004).

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    8/111

    8

    Dynamic programming can be used for

    Local alignment (Smith-Waterman Algorithm)

    Global Alignment(Needleman-Wunsch Algorithm)

    Figure 1: Processing steps in Dynamic Programming

    Steps Involved Dynamic programming

    1. Initialization

    2. Matrix Fill (scoring)

    3. Traceback (alignment)

    1.2.2.1 Global Alignment: Needleman-Wunsch Algorithm

    In global sequence alignment, an attempt to align the entirety of two different sequences

    is made, up to and including the ends of the sequence. Needleman and Wunsch (1970) were

    among the first to describe a dynamic programming algorithm for global sequence alignment.The NeedlemanWunsch algorithm performs a global alignment on two sequences

    (called A and B here). It is commonly used in bioinformatics to align protein or nucleotide

    sequences. The algorithm was published in 1970 by Saul Needleman and Christian Wunsch.

    The NeedlemanWunsch algorithm is an example of dynamic programming, and was the

    first application of dynamic programming to biological sequence comparison.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    9/111

    9

    1.2.2.2 Local alignment

    Smith waterman algorithm is the well known algorithm for performing local alignment.Instead of looking at the total sequence it compares segments of all possible lengths and

    optimizes the similarity measure.

    1.2.2.3 Difference between Global and local Alignment

    The main difference to the Needleman - wunsch algorithm is that negative scoring matrix

    cells are set to zero which renders the local alignment visible. Back tracking starts at the highest

    scoring matrix cells and proceeds until a cell with score zero is encountered yielding the highest

    scoring local alignment.

    Table 1: Difference between Global and Local Alignment

    Global Alignment Local Alignment

    It is a long process Shorter process

    ( , ),

    (0,0) 0

    ( 1, 1) ( , )

    ( , ) max ( 1, )

    ( , 1)

    i j

    i j

    Given s x y d

    F

    F i j s x y

    F i j F i j d

    F i j d

    =

    +

    =

    ( , ),

    (0,0) 00

    ( 1, 1) ( , )( , ) max

    ( 1, )

    ( , 1)

    i j

    i j

    Given s x y d

    F

    F i j s x yF i j

    F i j d

    F i j d

    =

    +=

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    10/111

    10

    1.2.3 Divide and Conquer Algorithm

    Divide and conquer is an algorithm which divides large problems into two or more sub-

    problems of the same type, until these become simple enough to be solved directly. The solutions

    to the sub-problems are then combined to give a solution to the original problem (Thomas H et

    al.2000).

    1.2.4 Randomized Algorithm

    In Randomized algorithm, the object to be optimized would be selected randomly. But all

    the objects in the given dataset would be optimized. This gives the better optimization. This

    saves the time, as one desired optimization is reached, the process would be stop (Cormen, et al.

    1990).

    1.2.5 Branch and bound Algorithm

    Branch and bound algorithm is usually used for finding the best local optimization,

    mainly in discrete and combinational optimization. The first step is pruning or splitting. In thisthe given dataset N is split into smaller sets {N1, N2, N3} the union covers of N.

    For each smaller set, the node present in that will provide a lower and upper bound value.

    Suppose the lower bound value of A is greater than the upper bound value of node B then discard

    the node A. at the end , the only node which has the lower bound value is taken, this provides the

    optimum solutions.

    1.2.6 Brute Force Algorithm

    Brute Force algorithms bring the optimum results. If the number of data is to be

    processed, all the data would be taken into consideration and each every data will be processed

    individually to get the desired output.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    11/111

    11

    1.2.7 Machine Learning Algorithm

    The machine learning algorithm mainly deals with the following types of learning (SergiosTheodoridis et al.2009),

    Supervised Learning

    Unsupervised Learning

    Reinforcement Learning

    Supervised Learning

    The steps involved with supervised learning are,

    The sample input would be given

    Test and training also would be supported

    When new data or test input is given, it takes the reference of sample data, and process

    accordingly

    Unsupervised Learning

    The steps involved with unsupervised learning are

    No test sample would be provided

    The input would be given directly to the system, and the process is done by automatic learning

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    12/111

    12

    Reinforcement Learning

    It is a sub-area of machine learning concerned with how an agent ought to take actions inan environment so as to maximize some notion of long-term reward. Reinforcement learning

    algorithms attempt to find a policy that maps states of the world to the actions the agent ought to

    take in those states. In economics and game theory, reinforcement learning is considered as a

    boundedly rational interpretation of how equilibrium may arise.

    Reinforcement learning is particularly well suited to problems which include a long-term

    versus short-term reward trade-off. It has been applied successfully to various problems,

    including robot control, elevator scheduling, telecommunications, etc.

    Thus the bioinformatics algorithms could be subjected to determine any statistical

    analysis of any data.

    1.3 PROTEINS

    Proteins are the most abundant organic molecules of the living system. They occur in

    every part of the living system. They occur in every part of the cell and constitute about 50% of

    cellular dry weight. Proteins form the fundamental basis of structure and function of life.

    1.3.1 Functions of proteins

    Proteins perform a great variety of specialized and essential functions in the living cells.

    These functions may be broadly grouped as static(structural) and dynamic (U.Satyanarayana.

    2007).

    Structural functions: Some proteins perform brick and mortar roles and are primarily

    responsible for structure and strength of body. These include collagen and elastin found in

    bone matrix, vascular system and other organs and -keratin present in epidermal tissues.

    Dynamic functions: The dynamic functions of proteins are more diversified in nature. These

    include proteins acting as enzymes, hormones, blood clotting factors, immunoglobulins,

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    13/111

    13

    membrane receptors, storage proteins, besides their function in genetic control, muscle

    contraction, respiration etc. proteins performing dynamic functions are appropriately

    regarded as the working horses of cell.

    1.3.2 Levels of protein structure

    Proteins are the polymers of L--amino acids, the structure of proteins is rather complex

    which can be divide into four levels of organization (U.Satyanarayana. 2007).

    Primary structure

    The linear sequence of amino acids forms the backbone of proteins (polypeptides).

    Secondary structure

    The special arrangement of amino acids by twisting of the polypeptide chain.

    Tertiary structure

    It represents the three dimensional structure of a functional protein.

    Quaternary structure

    Some of the proteins are composed of two or more polypeptide chains referred to as

    subunits. The special arrangement of these subunits is known as quaternary structure.

    1.3.3 Primary structure of protein

    Each protein has a unique sequence of amino acids which is determined by the genes

    contained in DNA. The primary structure of a protein is largely responsible for its function

    (U.Satyanarayana. 2007).

    Peptide bond

    The amino acids are held together in a protein by covalent peptide or linkages. Thesebonds are rather strong and serve as the individual amino acids.

    Formation of peptide bond: When the amino group of an amino acid combines with the

    carboxyl group of an amino acid, a peptide bond is formed. The dipeptide will have two amino

    acids and one peptide (not two) bond. Peptides containing more than 10 amino acids

    (decapeptide) are referred to as polypeptides (U.Satyanarayana. 2007).

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    14/111

    14

    Determination of primary structure

    The primary structure comprises the identification of constituent amino acids with regardto their quality, quantity and sequence in a protein structure.

    A pure sample of a protein or a polypeptide is essential for the determination of primary

    structure which involves 3 stages (U.Satyanarayana. 2007).

    1. Determination of amino acid composition.

    2. Degradation of protein or polypeptide into smaller fragments.

    3. Determination of the amino acid sequence.

    1.3.4 Secondary structure of protein

    The conformation of polypeptide chain by twisting or folding is referred to as secondary

    structure. The amino acids are located close to each other in their sequence. The following

    secondary structures are mainly identified (U.Satyanarayana. 2007).

    The Alpha is a coiled Structure stabilized by Intrachain Hydrogen Bonds

    In evaluating potential structures, Pauling and Corey considered which conformations of

    peptides were sterically allowed and which most fully exploited the hydrogen-bonding capacity

    of the backbone NH and CO groups. The first of their proposed structures, the helix, is a rod like

    structure. A tightly coiled backbone forms the inner part of the rod and the side chains extend

    outward in a helical array. The helix is stabilized by hydrogen bonds between the NH and Co

    groups of the main chain. In particular, the CO group of each amino acid forms a hydrogen bondwith the NH group of the amino acid that is situated four residues ahead in the sequence. Thus,

    expect for amino acids near the ends of an helix, all the main-chain CO and NH groups are

    hydrogen bonded. Each residue is related to the next one by a rise of 1.5 along the helix axis

    and a rotation of 100 degrees, which gives 3.6 amino acid residues per turn of helix.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    15/111

    15

    Thus, amino acids spaced three and four apart in the sequence are spatially quite close to

    one another in an helix. In contrast, amino acids two apart in the sequence are situated on

    opposite sides of the helix and so are unlikely to make contact. The pitch of the helix, which is

    equal to the product of the translation (1.5 ) and the number of residues per turn (3.6), is 5.4 .

    The screw sense of a helix can be right-handed (clockwise) or left-handed (counterclockwise).

    The Ramachandran diagram reveals that both the right-handed and the left-handed

    helices are among allowed conformations. However, right-handed helices are energetically more

    favorable because there is less steric clash between the side chains and the backbone. Essentially

    all helices found in proteins are right-handed. In schematic diagrams of proteins, helices aredepicted as twisted ribbons or rods.

    When Pauling and Corey predicted the structure of the helix it was actually seen in the

    X-ray reconstruction of structure of myoglobin. The elucidation of the structure of the helix is

    a landmark in biochemistry because it demonstrated that the conformation of a polypeptide chain

    can be predicted if the properties of its components are rigorously and precisely known.

    The -helical content of proteins ranges widely, from nearly none to almost 100%. For

    example, about 75% of the residues in ferritin, a protein that helps store iron, are in helices.

    Single helices are usually less than 45 long. However, two or more helices can entwine to

    form a very stable structure, which can have a length of 1000 (100 nm, or 0.1m) or more.

    Such helical coiled coils are found in myosin and tropomyosin in muscle, in fibrin in blood

    clots, and in keratin in hair. The helical cables in these proteins serve a mechanical role in

    forming stiff bundles of fibers, as in porcupine quills. The cytoskeleton (internal scaffolding) of

    cells is rich in so-called intermediate filaments, which also are two stranded helical coiledcoils. Many proteins that span biological membranes also contain helices (Jeremy M et al.

    1975).

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    16/111

    16

    Figure 2: The structure of helix

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    17/111

    17

    Beta Sheets Are Stabilized by Hydrogen Bonding Between Polypeptide

    Strands

    Pauling and Corey discovered another periodic structural motif, which they named as

    pleated sheet(because it was the second structure that they elucidated, the helix having been

    the first). The pleated sheet (or, more simply, the sheet) differs markedly from the rod like

    helix. A polypeptide chain, called a strand, in a sheet is almost fully extended rather than

    being tightly coiled as in the helix. A range of extended structures are sterically allowed.

    The distance between adjacent amino acids along a strand is approximately 3.5 , in

    contrast with a distance of 1.5 along an helix. The side chains of adjacent amino acids point

    in opposite directions. A sheet is formed by linking two or more strands by hydrogen bonds.

    Adjacent chains in a sheet can run in opposite directions (antiparallel sheet) or in the same

    direction (parallel sheet). In the antiparallel arrangement, the NH group and the CO group of

    each amino acid are respectively hydrogen bonded to the CO group and the NH group of a

    partner on the adjacent chain.

    In the parallel arrangement, the hydrogen-bonding scheme is slightly more complicated.

    For each amino acid, the NH group is hydrogen bonded to the CO group of one amino acid on

    the adjacent strand, whereas the CO group is hydrogen bonded to the NH group on the aminoacid two residues farther along the chain. Many strands, typically 4 or 5 but as many as 10 or

    more, can come together in sheets. Such sheets can be purely antiparallel, purely parallel, or

    mixed.

    In schematic diagrams, strands are usually depicted by broad arrows pointing in the

    direction of the carboxyl-terminal end to indicate the type of sheet formed-parallel or

    antiparallel. More structurally diverse than helices, sheets can be relatively flat but most

    adopt a somewhat twisted shape.

    The sheet is an important structural element in many proteins. For example, fatty acid-

    binding proteins, important for lipid metabolism, are built almost entirely from sheets (Jeremy

    M et al.1975).

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    18/111

    18

    Figure 3: The structure of strands. Two or more such strands may come together to form sheets

    in parallel (a) or antiparallel (b) orientation.

    N

    R5

    C

    R3

    R4

    N

    R1

    R2

    R5

    R3

    R4

    C

    R1

    R2

    (a) (b)

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    19/111

    19

    Polypeptide Chains Can Change Direction by Making Reverse Turns and

    Loops

    Most proteins have compact, globular shapes, requiring reversals in the direction of their

    polypeptide chains. Many of these reversals are accomplished by a common structural element

    called the reverse turn (also known as the turn or hairpin bend). In many reverse turns, the CO

    group of residue i of a polypeptide is hydrogen bonded to the NH group of residue i + 3. This

    interaction stabilizes abrupt changes in direction of the polypeptide chain. In other cases, more

    elaborate structures are responsible for chain reversals. These structures are called loops or

    sometimes loops (omega loops) to suggested their overall shape. Unlike helices and

    strands, loops do not have regular, periodic structures. Nonetheless, loop structures are often

    rigid and well defined. Turns and loops invariably lie on the surfaces of proteins and thus often

    participate in interactions between proteins and other molecules. The distribution of helices,

    strands, and turns along a protein chain is often referred to as its secondary structure (Jeremy M

    et al.1975).

    1.3.5 Tertiary structure of protein

    The three-dimensional arrangement of protein structure is referred to as tertiary structure.It is a compact structure with hydrophobic side chains held interior while the hydrophilic groups

    are on the surface of the protein molecule. This type of arrangement ensures stability of the

    molecules (U.Satyanarayana 2007).

    Bonds of tertiary structure: Besides the hydrogen bonds, disulfide bonds (-S-S-), ionic

    interactions (electrostatic bonds) and hydrophobic interactions also contribute to the tertiary

    structure of proteins.

    Domain: The term domain is used to represent the basic units of protein structure (tertiary) andfunctions. A polypeptide with 200 amino acids normally consists of two or more domains.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    20/111

    20

    1.3.6 Quaternary structure of protein

    A great majority of the proteins are composed of single polypeptide chains. Some of the

    proteins, however, consist of two or more polypeptides which may be identical or unrelated.

    Such proteins are termed as oligomers and possess quaternary structure.

    The individual polypeptide chains are known as monomers, protomers, or subunits. A

    dimmer consists of two polypeptides while a tetramer has four (U.Satyanarayana. 2007).

    Bonds in quaternary structure: the monomeric subunits are held together by non-covalent

    bonds namely hydrogen bonds, hydrophobic interactions and ionic bonds.

    1.4 Protein Structure Prediction

    Protein structure prediction is the prediction of the three-dimensional structure of a

    protein from its amino acid sequencethat is, the prediction of a protein's tertiary structure from

    its primary structure (structure prediction is fundamentally different from the inverse problem of

    protein design). Protein structure prediction is one of the most important goals pursued by

    bioinformatics and theoretical chemistry. Protein structure prediction is of high importance in

    medicine (for example, in drug design) and biotechnology (for example, in the design of novelenzymes) (Mount DM 2004).

    Proteins are an important class of biological macromolecules present in all biological

    organisms, made up primarily from the elements carbon, hydrogen, nitrogen, oxygen, and

    sulphur. All proteins are polymers of amino acids. Classified by their physical size, proteins are

    nanoparticles (definition: 1-100 nm). Each protein polymer also known as a polypeptide

    consists of a sequence of 20 different L--amino acids, also referred to as residues. For chains

    under 40 residues the term peptide is frequently used instead of protein.

    To be able to perform their biological function, proteins fold into one or more specific

    spatial conformations, driven by a number of non-covalent interactions such as hydrogen

    bonding, ionic interactions, Van Der Waals forces, and hydrophobic packing. To understand the

    functions of proteins at a molecular level, it is often necessary to determine their three-

    dimensional structure. This is the topic of the scientific field of structural biology, which

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    21/111

    21

    employs techniques such as X-ray crystallography, NMR spectroscopy, and dual polarization

    interferometry to determine the structure of proteins.

    1.5 Secondary structure prediction

    Secondary structure prediction is a set of techniques in bioinformatics that aim to

    predict the local secondary structures of proteins and RNA sequences based only on knowledge

    of their primary structure - amino acid or nucleotide sequence, respectively. For proteins, a

    prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta

    strands (often noted as "extended" conformations), or turns. Globular protein domains are

    typically composed of the two basic secondary structure types, the a-helix and the b-strand,

    which are easily distinguishable because of their regular (periodic) character. Other types of

    secondary structures such as different turns, bends, bridges, and non-a helices are less frequent

    and more difficult to observe and classify for a non-expert. The non-a, non-b structures are often

    referred to as coil or loop and the majority of secondary structure prediction methods are aimed

    at predicting only these three classes of local structure. Given the observed distribution of the

    three states in globular proteins (about 30 % -helix, 20 % strand and 50 % coil), random

    prediction should yield about 40 % accuracy per residueThe accuracy of the secondary structure prediction methods devised earlier, such as

    Chou-Fasman (1974) or GOR (Garnier et al. 1978) is in the range of 5055 %. The best modern

    secondary structure prediction methods have reached a sustained level of 76 % accuracy for the

    last 2 years, with a-helices predicted with ca. 10% higher accuracy than bstrands (Koh et al.

    2003).Hence, it is quite surprising that the early mediocre methods are still used in good faith by

    many researchers; maybe even more surprising that they are sometimes recommended in

    contemporary reviews of bioinformatics software or built in as a default method in new versions

    of commercial software packages for protein sequence analysis and structure modeling. Modern

    secondary structure prediction methods typically perform analyses not for the single target

    sequences, but rather utilize the evolutionary information derived from MSA provided by the

    user or generated by an internal routine for database searches and alignment (Levin et al. 1993).

    The information from the MSA provides a better insight into the positional conservation of

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    22/111

    22

    physico-chemical features such as hydrophobicity and hints at a position of loops in the regions

    of insertions and deletions (indels) corresponding to gaps in the alignment.

    It is also recommended to combine different methods for secondary structure prediction;

    the ways of combing predictions may include the calculation of a simple consensus or more

    advanced approaches, including machine learning, such as voting, linear discrimination, neural

    networks and decision trees (King et al. 2000). JPRED (Cuff et al. 1998) is an example of a

    consensus meta-server that returns predictions from several secondary structure prediction

    methods (mostly third-party algorithms) and infers a consensus using a neural network, thereby

    improving the average accuracy of prediction. In addition, JPRED predicts the relative solventaccessibility of each residue in the target sequence, which is very useful for identification of

    solvent-exposed and buried faces of amphipathic helices. In general, the most effective

    secondary structure prediction strategies follow these rules: (1) if an experimentally

    determined three-dimensional structure of a closely related protein is known, copy the secondary

    structure assignment from the known structure rather than attempt to predict it de novo. (2)If no

    related structures are known, use multiple sequence information. If your target sequence shows

    similarity to only a few (or none) other proteins with sequence identity

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    23/111

    23

    In our own hands, the application of these rules in a semi-automated manner (i.e. human

    post-processing of prediction generated by various individual methods) led to a very high

    accuracy of 83 % per residue (better than any single server or any other human predictor)

    according to the recent evaluation within the CASP-5 experiment

    (http://predictioncenter.llnl.gov/casp5/) (I. Cymerman et al.).

    1.6 Features Which Are Used For the Protein Secondary Structure Prediction

    A protein sequence contains characters from the 20-letter amino acid alphabet = {A, C,

    D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. An important issue in applying neural

    networks to protein sequence classification is how to encode protein sequences, i.e., how to

    represent the protein sequences as the input of the neural networks. Good input representations

    make it easier for the neural networks to recognize underlying regularities. Thus, good input

    representations are crucial to the success of neural network learning. (Hirsh H. et al.) Hence

    feature extraction from sequence of protein is an essential step for the prediction of 2D or 3D

    structure based on primary sequence.

    The best high-level features should be relevant. By relevant it means that there should

    be high mutual information between the features and the output of the neural networks, where

    the mutual information measures the average reduction in uncertainty about the output of the

    neural networks given the values of the features. (Wang J. T. L. et al.)

    Amino Acid Exchange Group

    One particular feature which has been considered here is the 6 letter amino acid exchange

    group {e1, e2, e3, e4, e5, e6}. This denotes e1 {H, R, K}, e2 {D, E, N, Q}, e3 {C}, e4 {S,T, P, A, G}, e5 {M, I, L, V}, and e6 {F, Y, W}. Exchange groups represent conservative

    replacements through evolution. These exchange groups are effectively equivalence classes of

    amino acids and are derived from PAM. (C. H. Wu et al.,Dayhoff M. O. et al.)

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    24/111

    24

    Amino Acid Grouping

    6 letter exchange group is taken to group the amino acids according to the amino acidproperties.

    e1 = {H, R, K}, e2 = {D, E, N, Q}, e3 = {C}, e4 = {S, T, P, A, G}, e5 = {M, I, L, V}, e6 = {F,

    Y, W}

    Exchange groups represent conservative replacements through evolution. These exchange

    groups are effectively equivalence classes of amino acids and are derived from PAM.

    Amino Acid Frequency

    Another feature which has been considered over here is the frequency of occurrence of

    particular amino acid residues in these secondary structures as shown in table which help in

    determining whether a particular sequence in a protein forms a helix, a strand, or a turn.

    Residues such as alanine, glutamate, and leucine tend to be present in helices, whereas valine

    and isoleucine tend to be present in strands. Glycine, asparagine, and proline have a propensity

    for being in turns. (Jeremy M. Berg et al.)

    The results of studies of proteins and synthetic peptides have revealed some reasons for

    these preferences. The helix can be regarded as the default conformation. Branching at the -

    carbon atom, as in valine, threonine, and isoleucine, tends to destabilize helices because of

    steric clashes. These residues are readily accommodated in strands, in which their side chains

    project out of the plane containing the main chain. Serine, aspartate, and asparagine tend to

    disrupt helices because their side chains contain hydrogen-bond donors or acceptors in close

    proximity to the main chain, where they compete for main-chain NH and CO groups. Proline

    tends to disrupt both helices and strands because it lacks an NH group and because its ring

    structure restricts its value to near -60 degrees. Glycine readily fits into all structures and for

    that reason does not favor helix formation in particular. (Jeremy M. Berg et al.)

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    25/111

    25

    Frequency Values for Each Amino Acid in Secondary Structure

    Table 2: Displaying Amino acid frequency values (Creighton T. E. et al.)

    The amino acids are grouped according to their preference for helices (top group), sheets (second group), or turns (third group).

    Propensity Values for Amino Acids in Secondary Structure

    Amino acid Amino_num Prop_alpha prop_beta prop_other

    A 1 142 83 66

    C 2 70 119 119

    D 3 101 54 146

    E 4 151 37 74

    F 5 113 138 60

    G 6 57 75 156

    H 7 100 870 95

    I 8 108 160 47

    K 9 114 74 101

    L 10 121 130 59

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    26/111

    26

    Amino acid Amino_num Prop_alpha prop_beta prop_other

    M 11 145 105 60

    N 12 67 89 156

    P 13 57 55 152

    Q 14 111 110 98

    R 15 98 93 95

    S 16 77 75 143

    T 17 83 119 96

    V 18 106 170 50

    W 19 108 137 96

    Y 20 69 147 114

    Table 3: Displaying Amino acid propensity values (Creighton T. E. et al.)

    Amino Acid Evolutionary Score

    Evolutionary score for each amino acid is considered as the next feature to be taken as

    neural network input. Knowledge of the amino acid composition of early proteomes can reveal

    which of the amino acids have increased and which have decreased in frequency with evolution.

    Such information could provide clues to the order in which amino acids were introduced into the

    genetic code and thus into primitive proteins. for inferring ancestral amino acid composition is

    based on the insight that the amino acid composition of conserved residues in present- day

    proteins, i.e., those residues which are unchanged between an ancestral sequence and any given

    descendant sequence, is determined by two factors: (1) the amino acid composition within

    ancestral sequences, and (2) the relative probability of conservation of each amino acid between

    an ancestral and an extant descendant sequence. Reversing this logic, given the amino acid

    composition of conserved residues and the relative probability of conservation of each amino

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    27/111

    27

    acid, the amino acid composition within ancestral sequences can be inferred. (Dawn J. Brooks

    et al.)

    Evolutionary score calculations

    Evolutionary Score for Each Amino Acid

    Aminoacid Amino_num Evol_score

    A 1 0.1109

    C 2 0.003

    D 3 0.0577

    E 4 0.0785

    F 5 0.0213

    G 6 0.0781

    H 7 0.0321

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    28/111

    28

    Table 4: Displaying the Amino acids evolutionary score

    I 8 0.0758

    K 9 0.062

    L 10 0.0573

    M 11 0.0266

    N 12 0.0397

    P 13 0.0435

    Q 14 0.018

    R 15 0.065

    S 16 0.0523

    T 17 0.0624

    V 18 0.0976

    W 19 0.0042

    Y 20 0.0139

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    29/111

    29

    Machine Learning ApproaMachine Learning ApproaMachine Learning ApproaMachine Learning Approachescheschesches

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    30/111

    30

    2.MACHINE LEARNING APPROACHES

    2.1 Machine Learning

    Machine learning can be best described as "the study of computer algorithms that

    improve automatically through experience" (Mitchell 1996). Machine learning is learning, like

    intelligence whose definition includes phrases such as to gain knowledge, or understanding of,

    or skill in, by study, instructions, or experience. It is a natural outgrowth of the intersection of

    computer science and statistics. It covers a broad range of learning tasks, such as how to design

    autonomous mobile robots that learns to navigate from its own experience. To be more precise,

    we say that a machine learns with respect to a particular task T, performance metric P, and type

    of experience E, if the system reliably improves its performance P at task T, following

    experience E [Tom M. Mitchell, 2006].

    2.2 Artificial Neural Networks

    Artificial neural networks are massively parallel adaptive networks of simple nonlinear

    computing elements called neurons which are intended to abstract and model some of thefunctionality of the human nervous system in an attempt to partially capture some of its

    computational strengths. It resembles the brain in two respects:

    Knowledge is acquired by the network through a learning process.

    Interneuron connection strengths known as synaptic weights are used to store the

    knowledge.

    A computational neural network is a set of non-linear data modeling tools consisting of input

    and output layers plus one or two hidden layers. The connections between neurons in each layer

    have associated weights, which are iteratively adjusted by the training algorithm to minimize

    error and provide accurate predictions. We set the conditions under which the network learns

    and can finely control the training stopping rules and network architecture, or let the procedure

    automatically choose the architecture for us.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    31/111

    31

    There are two main types of neural network models, such as supervised neural network,

    involved multilayer perceptron (MLP) or radial basis function (RBF) and unsupervised neural

    network,involved kohonen feature maps.

    2.2.1 Neural Network Structure:

    2.2.1.1 BIOLOGICAL NEURON:

    A neuron's dendritic tree is connected to a thousand neighboring neurons. When one of

    the neurons fired, a positive or negative charge is received by one of the dendrites. The strengths

    of all the received charges are added together through the processes of spatial and temporalsummation. Spatial summation occurs when several weak signals are converted into a single

    large one, while temporal summation converts a rapid series of weak pulses from one source into

    one large signal. The aggregate input is then passed to the soma (cell body). The soma and the

    enclosed nucleus don't play a significant role in the processing of incoming and outgoing data.

    Their primary function is to perform the continuous maintenance required to keep the neuron

    functional.

    The part of the soma that does concern itself with the signal is the axon hillock. If theaggregate input is greater than the axon hillock's threshold value, then the neuron fires, and an

    output signal is transmitted down the axon. The strength of the output is constant, regardless of

    whether the input was just above the threshold, or a hundred times as great. The output strength

    is unaffected by the many divisions in the axon; it reaches each terminal button with the same

    intensity it had at the axon hillock.

    Figure 4: showing the signal transmission in neurons

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    32/111

    32

    2.2.1.2 ARTIFICIAL NEURAL NETWORK:

    Although neural networks impose minimal demands on model structure and assumptions,it is useful to understand the general network architecture. The multilayer perceptron (MLP) or

    radial basis function (RBF) network is a function of predictors (also called inputs or independent

    variables) that minimize the prediction error of target variables (also called outputs).

    The neural network Structure shown in the figure, in that:

    The input layer contains the predictors.

    The hidden layer contains unobservable nodes, or units. The value of each hidden unit is

    some function of the predictors; the exact form of the function depends in part upon the

    network type and in part upon the user-controllable specifications.

    The output layer contains the responses. Since the history of default is a categorical

    variable with two categories, it is recoded as two indicator variables. Each output unit is

    some function of the hidden units. Again, the exact form of the function depends in part

    on the network type and in part on user-controllable specifications.

    The MLP network allows a second hidden layer; in that case, each unit of the second

    hidden layer is a function of the units in the first hidden layer, and each response is a function ofthe units in the second hidden layer.

    Figure 5: showing the structure of neural network

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    33/111

    33

    Figure 6: Some artificial neural network connection structures

    Figure 7: Some common activation functions

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    34/111

    34

    Multilayer Perceptron:

    The MLP procedure fits a particular kind of neural network called a multilayerperceptron. The multilayer perceptron is a supervised method using feed forward architecture. It

    can have multiple hidden layers. One or more dependent variables may be specified, which may

    be scale, categorical, or a combination. If a dependent variable has scale measurement level, then

    the neural network predicts continuous values that approximate the true value of some

    continuous function of the input data. If a dependent variable is categorical, then the neural

    network is used to classify cases into the best category based on the input predictors.

    Training

    Training the network means adjusting its weights so that it gives correct output. Training

    data is what the neural network uses to learn how to predict the known output. Its rather easy

    to train a network that has no hidden layers (called a perceptron). For the each object in training

    set the attribute values and class is known.

    Type of Training

    The training type determines how the network processes the records.

    Batch. Updates the synaptic weights only after passing all training data records; that is,

    batch training uses information from all records in the training dataset. Batch training is

    often preferred because it directly minimizes the total error; however, batch training may

    need to update the weights many times until one of the stopping rules is met and hence

    may need many data passes. It is most useful for smaller datasets.

    Online. Updates the synaptic weights after every single training data record; that is,

    online training uses information from one record at a time. Online training continuously

    gets a record and updates the weights until one of the stopping rules is met. If all therecords are used once and none of the stopping rules is met, then the process continues by

    recycling the data records. Online training is superior to batch for larger datasets with

    associated predictors; that is, if there are many records and many inputs, and their values

    are not independent of each other, then online training can more quickly obtain a

    reasonable answer than batch training.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    35/111

    35

    Mini-batch. Divides the training data records into groups of approximately equal size,

    then updates the synaptic weights after passing one group; that is, mini-batch training

    uses information from a group of records. Then the process recycles the data group if

    necessary. Mini-batch training offers a compromise between batch and online training,

    and it may be best for medium-size datasets. The procedure can automatically

    determine the number of training records per mini-batch, or you can specify an integer

    greater than 1 and less than or equal to the maximum number of cases to store in

    memory.

    Testing

    Testing is used for validation. For the each object in testing set the attribute values areknown, but class is unknown.

    Hidden Layers

    The hidden layer contains unobservable network nodes (units). Each hidden unit is a

    function of the weighted sum of the inputs. The function is the activation function, and the values

    of the weights are determined by the estimation algorithm.

    If the network contains a second hidden layer, each hidden unit in the second layer is a

    function of the weighted sum of the units in the first hidden layer. The same activation function

    is used in both layers.

    Number of Hidden Layers. A multilayer perceptron can have one or two hidden layers.

    Activation Function. The activation function "links" the weighted sums of units in a layer to the

    values of units in the succeeding layer.

    Hyperbolic tangent. This function has the form: (c) = tanh(c) = (ecec)/(ec+ec). It

    takes real-valued arguments and transforms them to the range (1, 1). When automaticarchitecture selection is used, this is the activation function for all units in the hidden

    layers.

    Sigmoid. This function has the form: (c) = 1/(1+ec). It takes real-valued arguments and

    transforms them to the range (0, 1).

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    36/111

    36

    Number of Units. The number of units in each hidden layer can be specified explicitly or

    determined automatically by the estimation algorithm.

    Output Layer

    The output layer contains the target (dependent) variables.

    Activation Function. The activation function "links" the weighted sums of units in a layer to the

    values of units in the succeeding layer.

    Identity. This function has the form: (c) = c. It takes real-valued arguments and returns

    them unchanged. When automatic architecture selection is used, this is the activation

    function for units in the output layer if there are any scale-dependent variables. Softmax. This function has the form: (ck) = exp(ck)/jexp(cj). It takes a vector of real-

    valued arguments and transforms it to a vector whose elements fall in the range (0, 1) and

    sum to 1. Softmax is available only if all dependent variables are categorical. When

    automatic architecture selection is used, this is the activation function for units in the

    output layer if all dependent variables are categorical.

    Hyperbolic tangent. This function has the form: (c) = tanh(c) = (ecec)/(ec+ec). It

    takes real-valued arguments and transforms them to the range (1, 1).

    Sigmoid. This function has the form: (c) = 1/(1+ec). It takes real-valued arguments and

    transforms them to the range (0, 1).

    Figure 8: showing the structure ofMultilayer Perceptron

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    37/111

    37

    Radial Basis Function:

    The RBF procedure fits a radial basis function neural network, which is a feed forward,supervised learning network with an input layer, a hidden layer called the radial basis

    function layer, and an output layer. The hidden layer transforms the input vectors into radial

    basis functions. Like the MLP procedure, the RBF procedure performs prediction and

    classification.The RBF procedure trains the network in two stages:

    1. The procedure determines the radial basis functions using clustering methods. The

    center and width of each radial basis function are determined.

    2. The procedure estimates the synaptic weights given the radial basis functions. The

    sum-of-squares error function with identity activation function for the output layer is used for

    both prediction and classification. Ordinary Least Squares regression is used to minimize the

    sum-of-squares error.

    Due to this two-stage training approach, the RBF network is in general trained much

    faster than MLP.

    Figure 9: showing the structure ofRadial Basis Function

    )/( 211 XK

    )/( 2jjXK

    )/( 222 XK

    )/( 233 XKX Yl

    Z1

    Z2

    Z3

    Zj

    wl1wl2

    wl3

    wlj

    RADIAL BASIS FUNCTION NEURAL NETWORK

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    38/111

    38

    2.3 Support vector machines

    Support vector machines(SVMs) are a set of related supervised learning methods usedfor classification and regression. In simple words, given a set of training examples, each marked

    as belonging to one of two categories, an SVM training algorithm builds a model that predicts

    whether a new example falls into one category or the other. Intuitively, an SVM model is a

    representation of the examples as points in space, mapped so that the examples of the separate

    categories are divided by a clear gap that is as wide as possible. New examples are then mapped

    into that same space and predicted to belong to a category based on which side of the gap they

    fall on.

    More formally, a support vector machine constructs a hyhperplane or set of hyperplanes

    in a high or infinite dimensional space, which can be used for classification, regression or other

    tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to

    the nearest training data points of any class (so-called functional margin), since in general the

    larger the margin the lower the generalization error of the classifier (David Meyer et al.2003).

    2.3.1 Motivation

    Classifying data is a common task in machine learning. Suppose some given data points

    each belong to one of two classes, and the goal is to decide which class a newdata point will be

    in. In the case of support vector machines, a data point is viewed as ap-dimensional vector (a list

    of p numbers), and we want to know whether we can separate such points with a p 1-

    dimensional hyperplane. This is called a linear classifier.

    There are many hyperplanes that might classify the data. One reasonable choice as thebest hyperplane is the one that represents the largest separation, or margin, between the two

    classes. So we choose the hyperplane so that the distance from it to the nearest data point on each

    side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane

    and the linear classifier it defines is known as a maximum margin classifier (Corinna Cortes et

    al. 1995).

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    39/111

    39

    Figure 10: showing the Motivation-H3 (green) doesn't separate the 2 classes. H1 (blue) does, with a

    small margin and H2 (red) with the maximum margin.

    2.3.2 Applications in Bioinformatics

    Support Vector Machine s has a natural match with the features of many bioinformatics

    datasets. They deliver state of the art performance in several applications. For microarray gene

    expression data, SVM is becoming the system of choice .SVMs are currently among the best

    performers for a number of classification tasks ranging from text to genomic data.

    2.4 Genetic Algorithms (GAs):

    The basic idea of a genetic algorithm (GA) is quite simple. GA works not only with one

    iterated solution in time but with the whole population of solutions in one algorithm iteration.

    The population contains many (ordinary several hundred) individuals bit strings representing

    solutions. Evolutionary algorithms deal with similar strings of more general form, e.g.,

    containing integer numbers or characters. The mechanism of GA involves only elementary

    operations like strings copying, partially bit swapping or bit value changing (Goldberg et al.

    1989).

    GA starts with a population of strings and thereafter generates successive populations

    using the following three basic operations:reproduction, crossover,andmutation. Reproduction

    is the process by which individual strings are copied according to an objective function value

    (fitness). Copying of strings according to their fitness value means that strings with a higher

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    40/111

    40

    value have a higher probability of contributing one or more offspring to the next generation. This

    is an artificial version of the natural selection. Mutation is an occasional (with a small

    probability) random alteration of the string position value. It is needed since, in spite of

    reproduction and crossover effectively searching and recombining the existing representations,

    they occasionally become overzealous and lose some potentially useful genetic material. The

    mutation operator prevents such an irrecoverable loss. The recombination mechanism allows

    mixing of parental information.

    While passing it to their descendants and mutation introduces innovation into the population.

    Figure 11: showing the Genetic Algorithm Based on Darwinian Paradigm

    In spite of simple principles, the design of GA for successful practical using is

    surprisingly complicated. GA has many parameters that depend on the problem to be solved. In

    the first, it is the size of population. Larger populations usually decrease the number of iterations

    needed, but dramatically increase the computing time for each iteration. The factors increasing

    demands on the size of population are the complexity of the problem being solved and the length

    of the individuals. Every individual contains one or more chromosomes containing value of

    potential solution. Chromosomes consist of genes. The gene in our version of GA is a structure

    representing one bit of solution value. It is usually advantageous to use some redundancy in

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    41/111

    41

    genes and so the physical length of our genes is greater than one bit. This type of redundancy

    was introduced by Ryan.

    To prevent degeneration and the deadlock in local extreme the limited lifetime of each

    individual can be used. This limited lifetime is realized by the death operator, which represents

    something like continual restart of GA. This operator enables decreasing of population size as

    well as increasing the speed of fitness improvement. It is necessary to store the best solution

    obtained separately the corresponding individual need not to be always present in the

    population because of the limited lifetime.

    Many GAs are implemented on a population consisting of haploid individuals (each

    individual contains just one chromosome). However, in nature, many living organisms havemore than one chromosome and there are mechanisms used to determine dominant genes.

    Sexual recombination generates an endless variety of genotype combination that

    increases the evolutionary potential of the population. Because it increases the variation among

    the offspring produced by an individual, it improves the change that some of them will be

    successful in varying and often-unpredictable environments they will encounter. Using diploid or

    multiploid individuals can often decrease demands on the population size. However the use of

    multiploid GA with sexual reproduction brings some complications, the advantage of

    multiploidity can be often substitute by the death operator and redundant genes coding.

    New individuals are created by operation called crossover. In the simplest case crossover

    means swapping of two parts of two chromosomes split in randomly selected point (so called one

    point crossover). In GA we use the uniform crossover on the bit level. The strategy of selection

    individuals for crossover is very important. It strongly determines the behavior of GA.

    Genetic algorithms commonly use heuristic and stochastic approaches. From the

    theoretical viewpoint, the convergence of heuristic algorithms is not guaranteed for the most ofapplication cases. That is why the definition of the stopping rule of the GA brings a new

    problem. It can be shown, that while using a proper version of GA the typical number of

    iterations can be determined.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    42/111

    42

    Details on GA implementation are specified in the following way:

    1. Generation of the initial population: At the beginning the whole population is generated

    randomly, the members are sorted by the fitness (in descendent order).

    2. Mutation: The mutation is applied to each gene with the same probability. The mutation

    of the gene means the inversion of one randomly selected bit in the gene.

    3. Death: Classical GA uses two main operations crossover and mutation (the otheroperation should be migration). In GA a third operation is also available is death. Every

    individual has the additional information age. A simple counter that is incremented in

    each of GA iteration represents the age.

    If the age of any member reaches the preset lifetime limit LT, the member dies and is

    immediately replaced by a new randomly generated member. The age is not mutated nor

    crossed over. The age of new individuals (incl. individuals created by crossover) is set to

    zero.

    4. Sorting is realized by the fitness function values.

    5. Crossover: Uniform crossover is used for all genes (each bit of the offspring gene is

    selected separately from corresponding bits of genes of both parents).

    6. When a stopping rule is not satisfied, go to step 2.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    43/111

    43

    Figure 12: showing the Genetic Algorithm as Conceptual Algorithm

    2.4.1 Problems suited to Genetic Algorithms:

    Table 5: displaying the problems suited to Genetic Algorithms

    Domain Application Types

    Control gas pipeline, pole balancing, missile evasion, pursuit

    Design semiconductor layout, aircraft design, keyboardconfiguration, communication networks

    Scheduling manufacturing, facility scheduling, resource allocation

    Robotics trajectory planning

    Machine Learning designing neural networks, improving classificationalgorithms, classifier systems

    Signal Processing filter design

    Game Playing poker, checkers, prisoners dilemma

    CombinatorialOptimization

    set covering, travelling salesman, routing, bin packing,graph colouring and partitioning

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    44/111

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    45/111

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    46/111

    46

    2.6 Fuzzy Logic

    The concept of Fuzzy Logic (FL) was conceived by Lotfi Zadeh, a professor at theUniversity of California at Berkley. It was presented not as a control methodology, but as a way

    of processing data by allowing partial set membership rather than crisp set membership or non-

    membership. Fuzzy Logic is a problem-solving control system methodology that lends itself to

    implementation in systems ranging from simple, small, embedded micro-controllers to large,

    networked, multi-channel PC or workstation-based data acquisition and control systems. It can

    be implemented in hardware, software, or a combination of both. It provides a simple way to

    arrive at a definite conclusion based upon vague, ambiguous, imprecise, noisy, or missing input

    information. FL's approach to control problems mimics how a person would make decisions,

    only much faster (Biacino et al. 2002).

    2.6.1 Fuzzy Logic in Bioinformatics

    Fuzzy logic and fuzzy technology are now frequently used in bioinformatics. The following

    are some examples used in bioinformatics.

    1. To analyze gene expression data

    2. To unravel functional and ancestral relationships between proteins via fuzzy alignment

    methods, or using a generalized radial basis function neural network architecture that

    generates fuzzy classification Rules.

    3. To analyze the relationships between genes and decipher a genetic network.

    4. To process complementary deoxyribonucleic acid (cDNA) microarray images. The

    procedure should be automated due to the large number of spots and it is achieved using a

    fuzzy vector filtering framework.

    5. To classify amino acid sequences into different super families

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    47/111

    47

    2.7 Hidden Markov Model (HMM)

    Hidden Markov models are statistical tools allowing to model complex stochasticphenomenon. Hidden Markov models have long history. In 1913, first studies [Markov, 1913] on

    Markov chain allow to analyze the language and lead A. A. Markov to the conception of the

    Markov chain theory. From 1948 to 1951, Shannon establishes the information theory using

    Markov chains [Shannon, 1948, Shannon, 1951]. In 1966 research by L. E. Baum and T. Petrie

    [Baum and Petrie, 1966, Baum and Eagon, 1967, Baum and Sell, 1968, Petrie, 1969, Baum et al.,

    1970, Baum, 1972] defined algorithms to train HMMs. In 1980, variable duration HMMs are

    defined [Furguson, 1980]. In 1980's, neural networks are incorporated in HMMs [Bourland andWellekens, 1990] allowing to extend applications of HMMs for speech analysis [Bahl et al.,

    1983, Rabiner et al., 1983, Juang and Rabiner, 1986, Rabiner and Levinson, 1985, Poritz and

    Richter, 1986, Rosemberg and Colla, 1987, Euler and Wolf, 1988, Dours, 1989] . Nowadays,

    HMMS are used for task scheduling [Soukhal et al., 2001b, Soukhal et al., 2001a] and

    information technologies [Zaragoza and Gallinari, 1998, Amini, 2001, Serradura et al., 2001].

    Lots of variant of genuine HMMs have been created to solve particular problems. Continuous

    density HMMs [Rabiner, 1989], hierarchical HMMs [Fine et al., 1998], multidimensional

    HMMS with independent processes [Brouard, 1999] or symbol substitution HMMs [Aupetit

    et al., 2002, Aupetit, 2005] are some examples of those new models.

    HMM is a finite set of states, each of which is associated with a (generally

    multidimensional) probability distribution. Transitions among the states are governed by a set of

    probabilities called transition probabilities. The sum of all transition probabilities out of a given

    state has to equal 1.0, as does the sum of all observation probabilities for a particular state. In a

    regular Markov model, the state is directly visible to the observer, and therefore the state

    transition probabilities are the only parameters. It is only the outcome, not the state directly

    visible to an external observer and therefore states are ``hidden'' to the outside; hence the name

    Hidden Markov Model. Each state has a probability distribution over the possible output tokens.

    Therefore the sequence of tokens generated by an HMM gives some information about the

    sequence of states.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    48/111

    48

    Note that the adjective 'hidden' refers to the state sequence through which the model

    passes, not to the parameters of the model; Even if the model parameters are known exactly, the

    model is still 'hidden'.

    Hidden Markov models are especially known for their application in temporal pattern

    recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical

    score following, partial discharges and bioinformatics.

    The number of states of the model, N.

    The number of observation symbols in the alphabet, M.

    A set of state transition probabilities.

    Figure 14: showing the Hidden Markov model

    Where, x-states, y-possible observations, a-state transition probabilities,b-output probabilities

    2.7.1 Applications of hidden Markov models

    Cryptanalysis

    Speech recognition

    Machine translation

    Partial discharge

    Gene prediction

    Alignment of bio-sequences

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    49/111

    49

    2.8 Markov chain Monte CarloA Morkov chain is a mathematical model for stochastic systems whose states, discrete or

    continuous, are governed by a transition probability. The current state in a markov chain only

    depends on the most recent previous states, e.g. for a 1storder Markov chain (Christophe

    Andrieu et al. 2003).

    The markovin property means locality in space or time, such as Markov random fields

    and markov chain. Indeed, a discrete time Markov chain can be viewed as a special case of the

    Markov random fields (causal and 1-dimensional).

    MCMC is a general purpose technique for generating fair samples from a probability inhigh-dimensional space, using random numbers (dice) drawn from probability in certain range. Amarkov chain is designed to have (X)being its stationary (or invariant) probability.

    This is a non-trivial task when (X) is very complicated in very high dimensional spaces!

    Usually it is not hard to construct a Markov Chain with the desired properties. The more

    difficult problem is to determine how many steps are needed to converge to the stationary

    distribution within an acceptable error. A good chain will have rapid mixingthe stationary

    distribution is reached quickly starting from an arbitrary positiondescribed further under

    Markov chain mixing time.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    50/111

    50

    Typical use of MCMC sampling can only approximate the target distribution, as there is

    always some residual effect of the starting position. More sophisticated MCMC-based algorithms

    such as coupling from the past can produce exact samples, at the cost of additional computation

    and an unbounded (though finite in expectation) running time.

    The most common application of these algorithms is numerically calculating multi-

    dimensional integrals. In these methods, an ensemble of "walkers" moves around randomly. At

    each point where the walker steps, the integrand value at that point is counted towards the

    integral. The walker then may make a number of tentative steps around the area, looking for a

    place with reasonably high contribution to the integral to move into next.

    Random walk methods are a kind of random simulation or Monte Carlo method.

    However, whereas the random samples of the integrand used in a conventional Monte Carlo

    integration are statistically independent, those used in MCMC are correlated. A Markov chain is

    constructed in such a way as to have the integrand as its equilibrium distribution. Surprisingly,

    this is often easy to do.

    Multi-dimensional integrals often arise in Bayesian statistics, computational physics,

    computational biology and computational linguistics, so Markov chain Monte Carlo methods are

    widely used in those fields.

    2.8.1 Some Properties of Markov Chains

    Irreduciblechain: can get from any state to any other eventually (non-zero probability)

    Periodicstate: state iis periodic with period kif all returns to imust occur in multiples of

    k

    Ergodic chain: irreducible and has an aperiodic state. Implies all states are aperiodic, so

    chain is aperiodic.

    Finite state space: can represent chain as matrix of transition probabilities then ergodic

    = regular

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    51/111

    51

    Regularchain: some power of chain has only positive elements

    Reversible chain: satisfies detailed balance (later)

    2.8.2 MCMC algorithms

    The following algorithms are involved with the Markov chain Monte Carlo.

    Metropolis-Hastings algorithm

    Metropolis algorithm

    Mixtures and blocks

    Gibbs sampling and Sequential Monte Carlo & Particle Filters

    2.8.3 Applications

    Computer vision

    Object tracking demo [Blake&Isard]

    Speech & audio enhancement

    Web statistics estimation

    Regression & classification

    Global maximization of MLPs [Freitas et al]

    Bayesian networks

    Details in Gilks et al book (in the School library)

    Genetics & molecular biology

    Robotics, etc.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    52/111

    52

    MMMMycobacterium Tuberculosisycobacterium Tuberculosisycobacterium Tuberculosisycobacterium Tuberculosis

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    53/111

    53

    3.MYCOBACTERIUM TUBERCULOSIS3.1Mycobacterium tuberculosis

    Tuberculosis (TB) is a disease that is spread from person to person through the air. TB

    usually affects the lungs, but it can also affect other parts of the body, such as the brain, the

    kidneys, or the spine. TB germs are put into the air when a person with TB disease of the lungs

    or throat coughs or sneezes. When a person inhales air that contains TB germs, he or she may

    become infected. People with TB infection do not feel sick and do not have any symptoms.

    However, they may develop TB disease at some time in the future. The general symptoms of TB

    disease include feeling sick or weak, weight loss, fever, and night sweats. The symptoms of TB

    of the lungs include coughing, chest pain, and coughing up blood. Other symptoms depend on

    the part of the body that is affected.

    Mycobacterium tuberculosis (MTB) is a pathogenic bacterial species in the genus

    Mycobacteriumand the causative agent of most cases of tuberculosis. First discovered in 1882

    by Robert Koch, M. tuberculosis has an unusual, waxy coating on the cell surface (primarily

    mycolic acid), which makes the cells impervious to Gram staining; acid-fast techniques are used

    instead. The physiology ofM. tuberculosisis highly aerobic and requires high levels of oxygen.

    Primarily a pathogen of the mammalian respiratory system, MTB infects the lungs, causing

    tuberculosis (Ryan KJ et al. 2004).

    TheM. tuberculosisgenome was sequenced in 1998 (Cole ST, et al 1998) (Camus JC et

    al.2002).

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    54/111

    54

    3.1.1 Strain variation

    M. tuberculosis appears to be genetically diverse. This genetic diversity results in

    significant phenotypic differences between clinical isolates. M. tuberculosis exhibits a

    biogeographic population structure and different strain lineages are associated with different

    geographic regions. Phenotypic studies suggest that this strain variation never has implications

    for the development of new diagnostics and vaccines. Micro-evolutionary variation affects the

    relative fitness and transmission dynamics of antibiotic-resistant strains (Gagneux S2009).

    3.1.2 Mycobacterium tuberculosis CDC1551

    CDC1551 is a clinical strain that was originally thought to be highly virulent (142); it has

    more recently been shown that CDC1551 induces levels of cytokines, including TNF-_, that are

    higher than those induced by other M. tuberculosis strains in mice. However, it is not more

    virulent than the other strains, as defined by bacterial load and mortality.

    The genome of M. tuberculosisstrain CDC 1551 was sequenced by the whole genome

    random sequencing method as described in Fleischmann, RD et al. (1995). Science269:496-512.

    TheM. tuberculosisgenome is a circular chromosome of 4,403,765 base pairs with an average G

    + C content of 65.6%. There are a total of 4,033 predicted open reading frames (ORFs).

    Predicted biological roles were assigned to 1,734 ORFs (43%); 605 ORFs (15%) match

    hypothetical proteins from other species, and 1,694 ORFs (42%) have no database match and

    presumably represent novel genes.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    55/111

    55

    Figure 15: Showing the Role Category Pie Chart ofMycobacterium tuberculosis

    CDC1551

    3.1.3 Mycobacterium tuberculosis H37Rv

    TheM. tuberculosis H37Rv genome consists of 4.4 _ 106 bp and contains approximately

    4,000 genes (Fig. 1) (53). Annotation of the M. tuberculosis genome shows that this bacterium

    has some unique features. H37Ra strain has been lacking. Historically, M. tuberculosis H37Ra is

    the a virulent counterpart of virulent strain H37Rv and both strains are derived from their

    virulent parent strain H37.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    56/111

    56

    Figure 16: Showing the Role Category Pie Chart ofMycobacterium tuberculosis

    H37RV (lab strain)

    3.1.4Comparison between H37Rv and H37RaThe H37Ra genome is highly similar to that of H37Rv with respect to gene content and

    order but is 8,445 bp larger as a result of 53 insertions and 21 deletions in H37Ra relative to

    H37Rv. Variations in repetitive sequences such as IS6110and PE/PPE/PE-PGRS family genes

    are responsible for most of the gross genetic changes. A total of 198 single nucleotide variations

    (SNVs) that are different between H37Ra and H37Rv were identified, yet 119 of them are

    identical between H37Ra and CDC1551 and 3 are due to H37Rv strain variation, leaving only 76

    H37Ra-specific SNVs that affect only 32 genes.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    57/111

    57

    Figure 17: Displaying a Proposed Phylogenetic Relationship of H37Ra, H37Rv, H37, and

    CDC1551

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    58/111

    58

    Figure 18: Displaying Statistics forMycobacterium tuberculosis CDC1551

    Figure 19: Displaying Statistics forMycobacterium tuberculosis H37RV (lab strain)

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    59/111

    59

    Aim & ObjectivesAim & ObjectivesAim & ObjectivesAim & Objectives

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    60/111

    60

    4. AIM & OBJECTIVE

    To predict the secondary structure of proteins using Genetic Algorithm.

    To compare the performance of Evolutionary Genetic Algorithms with neural network in

    the prediction of protein secondary structure.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    61/111

    61

    Materials &MethodsMaterials &MethodsMaterials &MethodsMaterials &Methods

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    62/111

    62

    5. MATERIALS AND METHODS

    5.1 Softwares and Databases:

    Table 6: Displaying the name and utility of the softwares which are

    used in the methodology

    Name of the software / Database Utility

    JCVI To retrieve the important proteins of

    Mycobacterium tuberculosis CDC1551 and

    Mycobacterium tuberculosis H37Rv

    NCBI Protein database To get the amino acid sequences of the desired

    protein

    SOPMA To make the secondary structure prediction

    SPSS16.0 To fit the Multi Layer Perceptron (MLP)

    Feed forward Neural Network.

    MATLAB (Genetic Algorithm Tool Box) To perform the secondary structure predictionof the amino acid sequence using genetic

    algorithm toolbox.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    63/111

    63

    5.2 FLOW CHART OF METHODOLOGY

    Get the desired protein name from thewhole list of the proteins

    The protein sequences was retrieved inthe FASTA format from the NCBI

    database

    Complete genome for the particularMycobacterium strainwas selected from

    the JCVI database

    Run the protein sequences in the SOPMAsecondary structure prediction tool and get the

    Use the PERL script to create a filewhich is consist the nine features (input

    nodes of each amino acids

    Use the inputs file and run it in the SPSS

    Neural Network (in both MultilayerPerceptron and Radial Basis Function)

    Use a file the inputs file and run it in

    the MATLAB Genetic Algorithm

    Compare the results which obtained from theboth in SPSS Neural Network and MATLAB

    Genetic Algorithm.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    64/111

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    65/111

    65

    Figure 21: Displaying Genome information ofMycobacterium tuberculosis H37Rv

    Figure 22: Displaying List of protein coding gene of Mycobacterium tuberculosis CDC1551

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    66/111

    66

    Figure 23: List of protein coding gene ofMycobacterium tuberculosisH37Rv

    Table 7: Showing the list of proteins taken fromMycobacterium tuberculosis CDC1551for

    secondary structure prediction

    S.No PROTEIN NAME LENGTH OF SEQUENCE

    1 conserved hypothetical protein 507

    2 conserved hypothetical protein 402

    3 conserved hypothetical protein 187

    4 conserved hypothetical protein 686

    5 conserved hypothetical protein 838

    6 conserved hypothetical protein 304

    7 conserved hypothetical protein 2628 conserved hypothetical protein 232

    9 conserved hypothetical protein 626

    10 conserved hypothetical protein 511

    11 conserved hypothetical protein 521

    12 conserved hypothetical protein 87

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    67/111

    67

    13 ABC transporter, ATP-binding protein 330

    14 Acyl carrier protein 87

    15 Chromosomal replication initiator protein DnaA 507

    16 DNA gyrase subunit A 838

    17 DNA gyrase subunit B 686

    18 DNA polymerase III,beta subunit 402

    19 Glutamine amidotransferase, class I 232

    20 Leucyl-tRNA synthetase 969

    21 L-serine dehydratase 461

    22 Penicillin-binding protein 82023 Replicative DNA helicase, intein-containing 874

    24 Ribosomal protein L9 152

    25 Ribosomal protein S18 84

    26 Ribosomal protein S6 96

    27 Serine/threonine protein kinase 626

    28 Transcriptional regulator, ArsR family 114

    29 Transcriptional regulator, GntR family 24430 Transcriptional regulator, MarR family 208

    Table 8: Showing the list of proteins taken fromMycobacterium tuberculosisH37Rvfor secondary

    structure prediction

    S.No PROTEIN NAME LENGTH OF SEQUENCE

    1 Hypothetical Protein Rv0257 124

    2 Hypothetical Protein Rv0264c 210

    3 Hypothetical Protein Rv0268c 169

    4 Hypothetical Protein Rv0272c 377

    5 Hypothetical Protein Rv0282 631

    6 Hypothetical protein Rv0064 979

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    68/111

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    69/111

    69

    Retrieving amino acids sequence data from NCBI protein database

    The amino acid sequences of desired proteins were retrieved from NCBI protein database

    by giving an appropriate query.

    5.3.2 NCBI PROTEIN DATABASE

    In addition to Protein sequences, other protein-related information is available via Entrez.

    Search the Structure database by choosing, "Structure" from the Entrez pull down menu,

    Conserved Domains Database (CDD) by choosing, "Domains", and 3D Domains by choosing,

    the "3D Domains" option.

    Figure 24: Displaying the home page of NCBI protein database

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    70/111

    70

    Figure 25: Displaying the retrieving amino acids sequence data from NCBI protein

    database

    5.4 SECONDARY STRUCTURE PREDICTION USING SOPMA

    Secondary structure prediction using the SOPMA tool is at the pole of bioinformatics server. The

    tool can access by the URL http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_sopma.html

    SOPMA

    Recently a new method called the self-optimized prediction method (SOPM) has been

    described to improve the success rate in the prediction of the secondary structure of proteins. In

    this method the improvements brought about by predicting all the sequences of a set of aligned

    proteins belonging to the same family. This improved SOPM method (SOPMA) correctly

    predicts 69.5% of amino acids for a three-state description of the secondary structure (alpha-

    helix, beta-sheet and coil) in a whole database containing 126 chains of non-homologous (less

    than 25% identity) proteins. Joint prediction with SOPMA and a neural networks method (PHD)

    correctly predicts 82.2% of residues for 74% of co-predicted amino acids.

  • 8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

    71/111

    71

    Figure 26: Displaying the home page