secondary structure prediction of tuberculosis genomes using machine learning algorithms
TRANSCRIPT
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
1/111
1
SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS
GENOMES USING MACHINE LEARNING ALGORITHMS
Dissertation submitted
to
Sri Ramachandra University
Porur, Chennai 600 116.
in partial fulfillment of the requirements for the Degree ofMaster of Science
in
Bioinformatics
By
RAJIV GANDHI.B
Reg No: 3010815
DEPARTMENT OF BIOINFORMATICS
Sri Ramachandra College of Biomedical Sciences, Research & Technology
Sri Ramachandra University
Porur, Chennai - 600 116.
JUNE 2010
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
2/111
2
ACKNOWLEDGEMENT
I express my sincere gratitude to my guide Dr. P. Venkatesan, Deputy Director,
Tuberculosis Research Centre, Indian Council of Medical Research for his excellent guidance,
his knowledge, his belief, innovative thoughts and motivation throughout. His continuous and
constant support and encouragement has given strength and motivation to me, as well as my
work. He has been the backbone of my work.
I am grateful to Dr.PK.Ragunath, Head of the Department, Department of
Bioinformatics, Sri Ramachandra University for his kind encouragement all the times.
I express my profound thanks to the Dean of Faculties, Vice Chancellor and Management
of Sri Ramachandra University, for providing all the facilities to do this project work.
I would like to express my heart full thanks to all staff members Mrs.Arundathi, Mr.
Dicky John Davis, Ms.C.R.Hemalatha, Ms.S.Kayalvizhi, Ms.M.Premavathi, Mr.Ramesh and
Mr.Venkatesan whose disciplined guidance helped to complete my project on time.
I express my special gratitude to my parents and my sister, brother for their constantencouragement, support and blessings. Their inspiration had given me the immense courage to
work efficiently.
I express my heartfelt gratitude and obligation to my loving friends P.A.Abhinand,
R.D.Nandhini, B.Chaitanya, Balaraman, Gajalakshmi, J.Andrews Paulraj, K.Deepa, Selva,
T.Sridhar, and R.Sridhar for helping me throughout this project.
I would like to thank all those who had directly or indirectly helped me in my project.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
3/111
3
DEDICATIONDEDICATIONDEDICATIONDEDICATION
I would like to dedicate this project to my lovingly parents Mr.M.Balasubramanian
& Mrs.B.Palaniammal Balasubramanianand my friends especially
P.A.Abhinand, R.D.Nandhini & B. Chaitanyawho have assisted and motivated me.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
4/111
4
CONTENTS
Serial No. Topic Page No.
1Introduction 9
2 Machine Leaning Approaches 33
3 Mycobacterium tuberculosis 56
4 Aim & Objectives 63
5 Materials and Methods 65
6 Result 85
7 Discussion 107
7 Summary 109
8 Reference 111
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
5/111
5
IntroductionIntroductionIntroductionIntroduction
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
6/111
6
1. INTRODUCTION
1.1 Bioinformatics
The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of
informatics processes in biotic systems. Its primary use since at least the late 1980s has been in
genomics and genetics, particularly in those areas of genomics involving large-scale DNA
sequencing.
Bioinformatics represents a new, growing area of science that uses computational
approaches to answer biological questions. Answering these questions requires that investigators
take advantage of large, complex data sets (both public and private) in a rigorous fashion to
reach valid, biological conclusions. The potential of such an approach is beginning to change the
fundamental way in which basic science is done, helping to more efficiently guide experimental
design in the laboratory. With the explosion of sequence and structural information available to
researchers, the field of bioinformatics is playing an increasingly large role in the study of
fundamental biomedical problems. The challenge facing computational biologists will be to aidin gene discovery and in the design of molecular modeling, site-directed mutagenesis, and
experiments of other types that can potentially reveal previously unknown relationships with
respect to the structure and function of genes and proteins. This challenge becomes particularly
daunting in light of the vast amount of data that has been produced by the Human Genome
Project and other systematic sequencing efforts to date.
The primary goal of bioinformatics is to increase the understanding of biological
processes. What sets it apart from other approaches, however, is its focus on developing andapplying computationally intensive techniques (e.g., pattern recognition, data mining, machine
learning algorithms, and visualization) to achieve this goal. Major research efforts in the field
include sequence alignment, gene finding, genome assembly, drug design, drug discovery,
protein structure alignment, protein structure prediction, prediction of gene expression and
protein-protein interactions, genome-wide association studies and the modeling of evolution
(Achuthsankar S Nair. 2007).
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
7/111
7
1.2 Bioinformatics Algorithms
Bioinformatics is the field that solves the biological problems with the application of
information technology. To solve the problems, from the technology sides the algorithms should
be generated to solve those problem. So, in short algorithms are the programs (softwares) that
are used to solve the biological problems are called as bioinformatics algorithms. Some of the
important bioinformatics algorithms are explained below:
1.2.1 Greedy Algorithm
Greedy algorithm is defined as the algorithm which searches for the local optimization
and imagine itself as a global optimization (Cormen, et al.1990).
1.2.1.1 Types of Greedy Algorithm
There are about three types of greedy algorithm, they are,
Pure greedy algorithm
Orthogonal greedy
Smooth greedy algorithm
Disadvantage of greedy algorithm
The disadvantage of greedy algorithm is, since it stops with the local optimization, theglobal optimization of the data could not be found out.
1.2.2 Dynamic programming
It is a general optimization method. It proposed by Richard Bellman of Princeton
University in 1950s. It extensively used in sequence alignment and other computational
problems. It applied to biological sequences by Needleman and Wunsch.
Original problem is broken into smaller sub problems and then solved. Pieces of largerproblem have a sequential dependency. Fourth piece can be solved using solution of the third
piece; the third piece can be solved by using solution of the second piece and so on. Dynamic
Algorithm first solves all the sub problems and stores each intermediate solution in a table along
with a score. And uses an m*n matrix of scores where m and n are the lengths of sequences
being aligned (Eddy, S. R. 2004).
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
8/111
8
Dynamic programming can be used for
Local alignment (Smith-Waterman Algorithm)
Global Alignment(Needleman-Wunsch Algorithm)
Figure 1: Processing steps in Dynamic Programming
Steps Involved Dynamic programming
1. Initialization
2. Matrix Fill (scoring)
3. Traceback (alignment)
1.2.2.1 Global Alignment: Needleman-Wunsch Algorithm
In global sequence alignment, an attempt to align the entirety of two different sequences
is made, up to and including the ends of the sequence. Needleman and Wunsch (1970) were
among the first to describe a dynamic programming algorithm for global sequence alignment.The NeedlemanWunsch algorithm performs a global alignment on two sequences
(called A and B here). It is commonly used in bioinformatics to align protein or nucleotide
sequences. The algorithm was published in 1970 by Saul Needleman and Christian Wunsch.
The NeedlemanWunsch algorithm is an example of dynamic programming, and was the
first application of dynamic programming to biological sequence comparison.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
9/111
9
1.2.2.2 Local alignment
Smith waterman algorithm is the well known algorithm for performing local alignment.Instead of looking at the total sequence it compares segments of all possible lengths and
optimizes the similarity measure.
1.2.2.3 Difference between Global and local Alignment
The main difference to the Needleman - wunsch algorithm is that negative scoring matrix
cells are set to zero which renders the local alignment visible. Back tracking starts at the highest
scoring matrix cells and proceeds until a cell with score zero is encountered yielding the highest
scoring local alignment.
Table 1: Difference between Global and Local Alignment
Global Alignment Local Alignment
It is a long process Shorter process
( , ),
(0,0) 0
( 1, 1) ( , )
( , ) max ( 1, )
( , 1)
i j
i j
Given s x y d
F
F i j s x y
F i j F i j d
F i j d
=
+
=
( , ),
(0,0) 00
( 1, 1) ( , )( , ) max
( 1, )
( , 1)
i j
i j
Given s x y d
F
F i j s x yF i j
F i j d
F i j d
=
+=
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
10/111
10
1.2.3 Divide and Conquer Algorithm
Divide and conquer is an algorithm which divides large problems into two or more sub-
problems of the same type, until these become simple enough to be solved directly. The solutions
to the sub-problems are then combined to give a solution to the original problem (Thomas H et
al.2000).
1.2.4 Randomized Algorithm
In Randomized algorithm, the object to be optimized would be selected randomly. But all
the objects in the given dataset would be optimized. This gives the better optimization. This
saves the time, as one desired optimization is reached, the process would be stop (Cormen, et al.
1990).
1.2.5 Branch and bound Algorithm
Branch and bound algorithm is usually used for finding the best local optimization,
mainly in discrete and combinational optimization. The first step is pruning or splitting. In thisthe given dataset N is split into smaller sets {N1, N2, N3} the union covers of N.
For each smaller set, the node present in that will provide a lower and upper bound value.
Suppose the lower bound value of A is greater than the upper bound value of node B then discard
the node A. at the end , the only node which has the lower bound value is taken, this provides the
optimum solutions.
1.2.6 Brute Force Algorithm
Brute Force algorithms bring the optimum results. If the number of data is to be
processed, all the data would be taken into consideration and each every data will be processed
individually to get the desired output.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
11/111
11
1.2.7 Machine Learning Algorithm
The machine learning algorithm mainly deals with the following types of learning (SergiosTheodoridis et al.2009),
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Supervised Learning
The steps involved with supervised learning are,
The sample input would be given
Test and training also would be supported
When new data or test input is given, it takes the reference of sample data, and process
accordingly
Unsupervised Learning
The steps involved with unsupervised learning are
No test sample would be provided
The input would be given directly to the system, and the process is done by automatic learning
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
12/111
12
Reinforcement Learning
It is a sub-area of machine learning concerned with how an agent ought to take actions inan environment so as to maximize some notion of long-term reward. Reinforcement learning
algorithms attempt to find a policy that maps states of the world to the actions the agent ought to
take in those states. In economics and game theory, reinforcement learning is considered as a
boundedly rational interpretation of how equilibrium may arise.
Reinforcement learning is particularly well suited to problems which include a long-term
versus short-term reward trade-off. It has been applied successfully to various problems,
including robot control, elevator scheduling, telecommunications, etc.
Thus the bioinformatics algorithms could be subjected to determine any statistical
analysis of any data.
1.3 PROTEINS
Proteins are the most abundant organic molecules of the living system. They occur in
every part of the living system. They occur in every part of the cell and constitute about 50% of
cellular dry weight. Proteins form the fundamental basis of structure and function of life.
1.3.1 Functions of proteins
Proteins perform a great variety of specialized and essential functions in the living cells.
These functions may be broadly grouped as static(structural) and dynamic (U.Satyanarayana.
2007).
Structural functions: Some proteins perform brick and mortar roles and are primarily
responsible for structure and strength of body. These include collagen and elastin found in
bone matrix, vascular system and other organs and -keratin present in epidermal tissues.
Dynamic functions: The dynamic functions of proteins are more diversified in nature. These
include proteins acting as enzymes, hormones, blood clotting factors, immunoglobulins,
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
13/111
13
membrane receptors, storage proteins, besides their function in genetic control, muscle
contraction, respiration etc. proteins performing dynamic functions are appropriately
regarded as the working horses of cell.
1.3.2 Levels of protein structure
Proteins are the polymers of L--amino acids, the structure of proteins is rather complex
which can be divide into four levels of organization (U.Satyanarayana. 2007).
Primary structure
The linear sequence of amino acids forms the backbone of proteins (polypeptides).
Secondary structure
The special arrangement of amino acids by twisting of the polypeptide chain.
Tertiary structure
It represents the three dimensional structure of a functional protein.
Quaternary structure
Some of the proteins are composed of two or more polypeptide chains referred to as
subunits. The special arrangement of these subunits is known as quaternary structure.
1.3.3 Primary structure of protein
Each protein has a unique sequence of amino acids which is determined by the genes
contained in DNA. The primary structure of a protein is largely responsible for its function
(U.Satyanarayana. 2007).
Peptide bond
The amino acids are held together in a protein by covalent peptide or linkages. Thesebonds are rather strong and serve as the individual amino acids.
Formation of peptide bond: When the amino group of an amino acid combines with the
carboxyl group of an amino acid, a peptide bond is formed. The dipeptide will have two amino
acids and one peptide (not two) bond. Peptides containing more than 10 amino acids
(decapeptide) are referred to as polypeptides (U.Satyanarayana. 2007).
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
14/111
14
Determination of primary structure
The primary structure comprises the identification of constituent amino acids with regardto their quality, quantity and sequence in a protein structure.
A pure sample of a protein or a polypeptide is essential for the determination of primary
structure which involves 3 stages (U.Satyanarayana. 2007).
1. Determination of amino acid composition.
2. Degradation of protein or polypeptide into smaller fragments.
3. Determination of the amino acid sequence.
1.3.4 Secondary structure of protein
The conformation of polypeptide chain by twisting or folding is referred to as secondary
structure. The amino acids are located close to each other in their sequence. The following
secondary structures are mainly identified (U.Satyanarayana. 2007).
The Alpha is a coiled Structure stabilized by Intrachain Hydrogen Bonds
In evaluating potential structures, Pauling and Corey considered which conformations of
peptides were sterically allowed and which most fully exploited the hydrogen-bonding capacity
of the backbone NH and CO groups. The first of their proposed structures, the helix, is a rod like
structure. A tightly coiled backbone forms the inner part of the rod and the side chains extend
outward in a helical array. The helix is stabilized by hydrogen bonds between the NH and Co
groups of the main chain. In particular, the CO group of each amino acid forms a hydrogen bondwith the NH group of the amino acid that is situated four residues ahead in the sequence. Thus,
expect for amino acids near the ends of an helix, all the main-chain CO and NH groups are
hydrogen bonded. Each residue is related to the next one by a rise of 1.5 along the helix axis
and a rotation of 100 degrees, which gives 3.6 amino acid residues per turn of helix.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
15/111
15
Thus, amino acids spaced three and four apart in the sequence are spatially quite close to
one another in an helix. In contrast, amino acids two apart in the sequence are situated on
opposite sides of the helix and so are unlikely to make contact. The pitch of the helix, which is
equal to the product of the translation (1.5 ) and the number of residues per turn (3.6), is 5.4 .
The screw sense of a helix can be right-handed (clockwise) or left-handed (counterclockwise).
The Ramachandran diagram reveals that both the right-handed and the left-handed
helices are among allowed conformations. However, right-handed helices are energetically more
favorable because there is less steric clash between the side chains and the backbone. Essentially
all helices found in proteins are right-handed. In schematic diagrams of proteins, helices aredepicted as twisted ribbons or rods.
When Pauling and Corey predicted the structure of the helix it was actually seen in the
X-ray reconstruction of structure of myoglobin. The elucidation of the structure of the helix is
a landmark in biochemistry because it demonstrated that the conformation of a polypeptide chain
can be predicted if the properties of its components are rigorously and precisely known.
The -helical content of proteins ranges widely, from nearly none to almost 100%. For
example, about 75% of the residues in ferritin, a protein that helps store iron, are in helices.
Single helices are usually less than 45 long. However, two or more helices can entwine to
form a very stable structure, which can have a length of 1000 (100 nm, or 0.1m) or more.
Such helical coiled coils are found in myosin and tropomyosin in muscle, in fibrin in blood
clots, and in keratin in hair. The helical cables in these proteins serve a mechanical role in
forming stiff bundles of fibers, as in porcupine quills. The cytoskeleton (internal scaffolding) of
cells is rich in so-called intermediate filaments, which also are two stranded helical coiledcoils. Many proteins that span biological membranes also contain helices (Jeremy M et al.
1975).
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
16/111
16
Figure 2: The structure of helix
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
17/111
17
Beta Sheets Are Stabilized by Hydrogen Bonding Between Polypeptide
Strands
Pauling and Corey discovered another periodic structural motif, which they named as
pleated sheet(because it was the second structure that they elucidated, the helix having been
the first). The pleated sheet (or, more simply, the sheet) differs markedly from the rod like
helix. A polypeptide chain, called a strand, in a sheet is almost fully extended rather than
being tightly coiled as in the helix. A range of extended structures are sterically allowed.
The distance between adjacent amino acids along a strand is approximately 3.5 , in
contrast with a distance of 1.5 along an helix. The side chains of adjacent amino acids point
in opposite directions. A sheet is formed by linking two or more strands by hydrogen bonds.
Adjacent chains in a sheet can run in opposite directions (antiparallel sheet) or in the same
direction (parallel sheet). In the antiparallel arrangement, the NH group and the CO group of
each amino acid are respectively hydrogen bonded to the CO group and the NH group of a
partner on the adjacent chain.
In the parallel arrangement, the hydrogen-bonding scheme is slightly more complicated.
For each amino acid, the NH group is hydrogen bonded to the CO group of one amino acid on
the adjacent strand, whereas the CO group is hydrogen bonded to the NH group on the aminoacid two residues farther along the chain. Many strands, typically 4 or 5 but as many as 10 or
more, can come together in sheets. Such sheets can be purely antiparallel, purely parallel, or
mixed.
In schematic diagrams, strands are usually depicted by broad arrows pointing in the
direction of the carboxyl-terminal end to indicate the type of sheet formed-parallel or
antiparallel. More structurally diverse than helices, sheets can be relatively flat but most
adopt a somewhat twisted shape.
The sheet is an important structural element in many proteins. For example, fatty acid-
binding proteins, important for lipid metabolism, are built almost entirely from sheets (Jeremy
M et al.1975).
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
18/111
18
Figure 3: The structure of strands. Two or more such strands may come together to form sheets
in parallel (a) or antiparallel (b) orientation.
N
R5
C
R3
R4
N
R1
R2
R5
R3
R4
C
R1
R2
(a) (b)
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
19/111
19
Polypeptide Chains Can Change Direction by Making Reverse Turns and
Loops
Most proteins have compact, globular shapes, requiring reversals in the direction of their
polypeptide chains. Many of these reversals are accomplished by a common structural element
called the reverse turn (also known as the turn or hairpin bend). In many reverse turns, the CO
group of residue i of a polypeptide is hydrogen bonded to the NH group of residue i + 3. This
interaction stabilizes abrupt changes in direction of the polypeptide chain. In other cases, more
elaborate structures are responsible for chain reversals. These structures are called loops or
sometimes loops (omega loops) to suggested their overall shape. Unlike helices and
strands, loops do not have regular, periodic structures. Nonetheless, loop structures are often
rigid and well defined. Turns and loops invariably lie on the surfaces of proteins and thus often
participate in interactions between proteins and other molecules. The distribution of helices,
strands, and turns along a protein chain is often referred to as its secondary structure (Jeremy M
et al.1975).
1.3.5 Tertiary structure of protein
The three-dimensional arrangement of protein structure is referred to as tertiary structure.It is a compact structure with hydrophobic side chains held interior while the hydrophilic groups
are on the surface of the protein molecule. This type of arrangement ensures stability of the
molecules (U.Satyanarayana 2007).
Bonds of tertiary structure: Besides the hydrogen bonds, disulfide bonds (-S-S-), ionic
interactions (electrostatic bonds) and hydrophobic interactions also contribute to the tertiary
structure of proteins.
Domain: The term domain is used to represent the basic units of protein structure (tertiary) andfunctions. A polypeptide with 200 amino acids normally consists of two or more domains.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
20/111
20
1.3.6 Quaternary structure of protein
A great majority of the proteins are composed of single polypeptide chains. Some of the
proteins, however, consist of two or more polypeptides which may be identical or unrelated.
Such proteins are termed as oligomers and possess quaternary structure.
The individual polypeptide chains are known as monomers, protomers, or subunits. A
dimmer consists of two polypeptides while a tetramer has four (U.Satyanarayana. 2007).
Bonds in quaternary structure: the monomeric subunits are held together by non-covalent
bonds namely hydrogen bonds, hydrophobic interactions and ionic bonds.
1.4 Protein Structure Prediction
Protein structure prediction is the prediction of the three-dimensional structure of a
protein from its amino acid sequencethat is, the prediction of a protein's tertiary structure from
its primary structure (structure prediction is fundamentally different from the inverse problem of
protein design). Protein structure prediction is one of the most important goals pursued by
bioinformatics and theoretical chemistry. Protein structure prediction is of high importance in
medicine (for example, in drug design) and biotechnology (for example, in the design of novelenzymes) (Mount DM 2004).
Proteins are an important class of biological macromolecules present in all biological
organisms, made up primarily from the elements carbon, hydrogen, nitrogen, oxygen, and
sulphur. All proteins are polymers of amino acids. Classified by their physical size, proteins are
nanoparticles (definition: 1-100 nm). Each protein polymer also known as a polypeptide
consists of a sequence of 20 different L--amino acids, also referred to as residues. For chains
under 40 residues the term peptide is frequently used instead of protein.
To be able to perform their biological function, proteins fold into one or more specific
spatial conformations, driven by a number of non-covalent interactions such as hydrogen
bonding, ionic interactions, Van Der Waals forces, and hydrophobic packing. To understand the
functions of proteins at a molecular level, it is often necessary to determine their three-
dimensional structure. This is the topic of the scientific field of structural biology, which
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
21/111
21
employs techniques such as X-ray crystallography, NMR spectroscopy, and dual polarization
interferometry to determine the structure of proteins.
1.5 Secondary structure prediction
Secondary structure prediction is a set of techniques in bioinformatics that aim to
predict the local secondary structures of proteins and RNA sequences based only on knowledge
of their primary structure - amino acid or nucleotide sequence, respectively. For proteins, a
prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta
strands (often noted as "extended" conformations), or turns. Globular protein domains are
typically composed of the two basic secondary structure types, the a-helix and the b-strand,
which are easily distinguishable because of their regular (periodic) character. Other types of
secondary structures such as different turns, bends, bridges, and non-a helices are less frequent
and more difficult to observe and classify for a non-expert. The non-a, non-b structures are often
referred to as coil or loop and the majority of secondary structure prediction methods are aimed
at predicting only these three classes of local structure. Given the observed distribution of the
three states in globular proteins (about 30 % -helix, 20 % strand and 50 % coil), random
prediction should yield about 40 % accuracy per residueThe accuracy of the secondary structure prediction methods devised earlier, such as
Chou-Fasman (1974) or GOR (Garnier et al. 1978) is in the range of 5055 %. The best modern
secondary structure prediction methods have reached a sustained level of 76 % accuracy for the
last 2 years, with a-helices predicted with ca. 10% higher accuracy than bstrands (Koh et al.
2003).Hence, it is quite surprising that the early mediocre methods are still used in good faith by
many researchers; maybe even more surprising that they are sometimes recommended in
contemporary reviews of bioinformatics software or built in as a default method in new versions
of commercial software packages for protein sequence analysis and structure modeling. Modern
secondary structure prediction methods typically perform analyses not for the single target
sequences, but rather utilize the evolutionary information derived from MSA provided by the
user or generated by an internal routine for database searches and alignment (Levin et al. 1993).
The information from the MSA provides a better insight into the positional conservation of
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
22/111
22
physico-chemical features such as hydrophobicity and hints at a position of loops in the regions
of insertions and deletions (indels) corresponding to gaps in the alignment.
It is also recommended to combine different methods for secondary structure prediction;
the ways of combing predictions may include the calculation of a simple consensus or more
advanced approaches, including machine learning, such as voting, linear discrimination, neural
networks and decision trees (King et al. 2000). JPRED (Cuff et al. 1998) is an example of a
consensus meta-server that returns predictions from several secondary structure prediction
methods (mostly third-party algorithms) and infers a consensus using a neural network, thereby
improving the average accuracy of prediction. In addition, JPRED predicts the relative solventaccessibility of each residue in the target sequence, which is very useful for identification of
solvent-exposed and buried faces of amphipathic helices. In general, the most effective
secondary structure prediction strategies follow these rules: (1) if an experimentally
determined three-dimensional structure of a closely related protein is known, copy the secondary
structure assignment from the known structure rather than attempt to predict it de novo. (2)If no
related structures are known, use multiple sequence information. If your target sequence shows
similarity to only a few (or none) other proteins with sequence identity
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
23/111
23
In our own hands, the application of these rules in a semi-automated manner (i.e. human
post-processing of prediction generated by various individual methods) led to a very high
accuracy of 83 % per residue (better than any single server or any other human predictor)
according to the recent evaluation within the CASP-5 experiment
(http://predictioncenter.llnl.gov/casp5/) (I. Cymerman et al.).
1.6 Features Which Are Used For the Protein Secondary Structure Prediction
A protein sequence contains characters from the 20-letter amino acid alphabet = {A, C,
D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. An important issue in applying neural
networks to protein sequence classification is how to encode protein sequences, i.e., how to
represent the protein sequences as the input of the neural networks. Good input representations
make it easier for the neural networks to recognize underlying regularities. Thus, good input
representations are crucial to the success of neural network learning. (Hirsh H. et al.) Hence
feature extraction from sequence of protein is an essential step for the prediction of 2D or 3D
structure based on primary sequence.
The best high-level features should be relevant. By relevant it means that there should
be high mutual information between the features and the output of the neural networks, where
the mutual information measures the average reduction in uncertainty about the output of the
neural networks given the values of the features. (Wang J. T. L. et al.)
Amino Acid Exchange Group
One particular feature which has been considered here is the 6 letter amino acid exchange
group {e1, e2, e3, e4, e5, e6}. This denotes e1 {H, R, K}, e2 {D, E, N, Q}, e3 {C}, e4 {S,T, P, A, G}, e5 {M, I, L, V}, and e6 {F, Y, W}. Exchange groups represent conservative
replacements through evolution. These exchange groups are effectively equivalence classes of
amino acids and are derived from PAM. (C. H. Wu et al.,Dayhoff M. O. et al.)
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
24/111
24
Amino Acid Grouping
6 letter exchange group is taken to group the amino acids according to the amino acidproperties.
e1 = {H, R, K}, e2 = {D, E, N, Q}, e3 = {C}, e4 = {S, T, P, A, G}, e5 = {M, I, L, V}, e6 = {F,
Y, W}
Exchange groups represent conservative replacements through evolution. These exchange
groups are effectively equivalence classes of amino acids and are derived from PAM.
Amino Acid Frequency
Another feature which has been considered over here is the frequency of occurrence of
particular amino acid residues in these secondary structures as shown in table which help in
determining whether a particular sequence in a protein forms a helix, a strand, or a turn.
Residues such as alanine, glutamate, and leucine tend to be present in helices, whereas valine
and isoleucine tend to be present in strands. Glycine, asparagine, and proline have a propensity
for being in turns. (Jeremy M. Berg et al.)
The results of studies of proteins and synthetic peptides have revealed some reasons for
these preferences. The helix can be regarded as the default conformation. Branching at the -
carbon atom, as in valine, threonine, and isoleucine, tends to destabilize helices because of
steric clashes. These residues are readily accommodated in strands, in which their side chains
project out of the plane containing the main chain. Serine, aspartate, and asparagine tend to
disrupt helices because their side chains contain hydrogen-bond donors or acceptors in close
proximity to the main chain, where they compete for main-chain NH and CO groups. Proline
tends to disrupt both helices and strands because it lacks an NH group and because its ring
structure restricts its value to near -60 degrees. Glycine readily fits into all structures and for
that reason does not favor helix formation in particular. (Jeremy M. Berg et al.)
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
25/111
25
Frequency Values for Each Amino Acid in Secondary Structure
Table 2: Displaying Amino acid frequency values (Creighton T. E. et al.)
The amino acids are grouped according to their preference for helices (top group), sheets (second group), or turns (third group).
Propensity Values for Amino Acids in Secondary Structure
Amino acid Amino_num Prop_alpha prop_beta prop_other
A 1 142 83 66
C 2 70 119 119
D 3 101 54 146
E 4 151 37 74
F 5 113 138 60
G 6 57 75 156
H 7 100 870 95
I 8 108 160 47
K 9 114 74 101
L 10 121 130 59
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
26/111
26
Amino acid Amino_num Prop_alpha prop_beta prop_other
M 11 145 105 60
N 12 67 89 156
P 13 57 55 152
Q 14 111 110 98
R 15 98 93 95
S 16 77 75 143
T 17 83 119 96
V 18 106 170 50
W 19 108 137 96
Y 20 69 147 114
Table 3: Displaying Amino acid propensity values (Creighton T. E. et al.)
Amino Acid Evolutionary Score
Evolutionary score for each amino acid is considered as the next feature to be taken as
neural network input. Knowledge of the amino acid composition of early proteomes can reveal
which of the amino acids have increased and which have decreased in frequency with evolution.
Such information could provide clues to the order in which amino acids were introduced into the
genetic code and thus into primitive proteins. for inferring ancestral amino acid composition is
based on the insight that the amino acid composition of conserved residues in present- day
proteins, i.e., those residues which are unchanged between an ancestral sequence and any given
descendant sequence, is determined by two factors: (1) the amino acid composition within
ancestral sequences, and (2) the relative probability of conservation of each amino acid between
an ancestral and an extant descendant sequence. Reversing this logic, given the amino acid
composition of conserved residues and the relative probability of conservation of each amino
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
27/111
27
acid, the amino acid composition within ancestral sequences can be inferred. (Dawn J. Brooks
et al.)
Evolutionary score calculations
Evolutionary Score for Each Amino Acid
Aminoacid Amino_num Evol_score
A 1 0.1109
C 2 0.003
D 3 0.0577
E 4 0.0785
F 5 0.0213
G 6 0.0781
H 7 0.0321
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
28/111
28
Table 4: Displaying the Amino acids evolutionary score
I 8 0.0758
K 9 0.062
L 10 0.0573
M 11 0.0266
N 12 0.0397
P 13 0.0435
Q 14 0.018
R 15 0.065
S 16 0.0523
T 17 0.0624
V 18 0.0976
W 19 0.0042
Y 20 0.0139
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
29/111
29
Machine Learning ApproaMachine Learning ApproaMachine Learning ApproaMachine Learning Approachescheschesches
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
30/111
30
2.MACHINE LEARNING APPROACHES
2.1 Machine Learning
Machine learning can be best described as "the study of computer algorithms that
improve automatically through experience" (Mitchell 1996). Machine learning is learning, like
intelligence whose definition includes phrases such as to gain knowledge, or understanding of,
or skill in, by study, instructions, or experience. It is a natural outgrowth of the intersection of
computer science and statistics. It covers a broad range of learning tasks, such as how to design
autonomous mobile robots that learns to navigate from its own experience. To be more precise,
we say that a machine learns with respect to a particular task T, performance metric P, and type
of experience E, if the system reliably improves its performance P at task T, following
experience E [Tom M. Mitchell, 2006].
2.2 Artificial Neural Networks
Artificial neural networks are massively parallel adaptive networks of simple nonlinear
computing elements called neurons which are intended to abstract and model some of thefunctionality of the human nervous system in an attempt to partially capture some of its
computational strengths. It resembles the brain in two respects:
Knowledge is acquired by the network through a learning process.
Interneuron connection strengths known as synaptic weights are used to store the
knowledge.
A computational neural network is a set of non-linear data modeling tools consisting of input
and output layers plus one or two hidden layers. The connections between neurons in each layer
have associated weights, which are iteratively adjusted by the training algorithm to minimize
error and provide accurate predictions. We set the conditions under which the network learns
and can finely control the training stopping rules and network architecture, or let the procedure
automatically choose the architecture for us.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
31/111
31
There are two main types of neural network models, such as supervised neural network,
involved multilayer perceptron (MLP) or radial basis function (RBF) and unsupervised neural
network,involved kohonen feature maps.
2.2.1 Neural Network Structure:
2.2.1.1 BIOLOGICAL NEURON:
A neuron's dendritic tree is connected to a thousand neighboring neurons. When one of
the neurons fired, a positive or negative charge is received by one of the dendrites. The strengths
of all the received charges are added together through the processes of spatial and temporalsummation. Spatial summation occurs when several weak signals are converted into a single
large one, while temporal summation converts a rapid series of weak pulses from one source into
one large signal. The aggregate input is then passed to the soma (cell body). The soma and the
enclosed nucleus don't play a significant role in the processing of incoming and outgoing data.
Their primary function is to perform the continuous maintenance required to keep the neuron
functional.
The part of the soma that does concern itself with the signal is the axon hillock. If theaggregate input is greater than the axon hillock's threshold value, then the neuron fires, and an
output signal is transmitted down the axon. The strength of the output is constant, regardless of
whether the input was just above the threshold, or a hundred times as great. The output strength
is unaffected by the many divisions in the axon; it reaches each terminal button with the same
intensity it had at the axon hillock.
Figure 4: showing the signal transmission in neurons
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
32/111
32
2.2.1.2 ARTIFICIAL NEURAL NETWORK:
Although neural networks impose minimal demands on model structure and assumptions,it is useful to understand the general network architecture. The multilayer perceptron (MLP) or
radial basis function (RBF) network is a function of predictors (also called inputs or independent
variables) that minimize the prediction error of target variables (also called outputs).
The neural network Structure shown in the figure, in that:
The input layer contains the predictors.
The hidden layer contains unobservable nodes, or units. The value of each hidden unit is
some function of the predictors; the exact form of the function depends in part upon the
network type and in part upon the user-controllable specifications.
The output layer contains the responses. Since the history of default is a categorical
variable with two categories, it is recoded as two indicator variables. Each output unit is
some function of the hidden units. Again, the exact form of the function depends in part
on the network type and in part on user-controllable specifications.
The MLP network allows a second hidden layer; in that case, each unit of the second
hidden layer is a function of the units in the first hidden layer, and each response is a function ofthe units in the second hidden layer.
Figure 5: showing the structure of neural network
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
33/111
33
Figure 6: Some artificial neural network connection structures
Figure 7: Some common activation functions
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
34/111
34
Multilayer Perceptron:
The MLP procedure fits a particular kind of neural network called a multilayerperceptron. The multilayer perceptron is a supervised method using feed forward architecture. It
can have multiple hidden layers. One or more dependent variables may be specified, which may
be scale, categorical, or a combination. If a dependent variable has scale measurement level, then
the neural network predicts continuous values that approximate the true value of some
continuous function of the input data. If a dependent variable is categorical, then the neural
network is used to classify cases into the best category based on the input predictors.
Training
Training the network means adjusting its weights so that it gives correct output. Training
data is what the neural network uses to learn how to predict the known output. Its rather easy
to train a network that has no hidden layers (called a perceptron). For the each object in training
set the attribute values and class is known.
Type of Training
The training type determines how the network processes the records.
Batch. Updates the synaptic weights only after passing all training data records; that is,
batch training uses information from all records in the training dataset. Batch training is
often preferred because it directly minimizes the total error; however, batch training may
need to update the weights many times until one of the stopping rules is met and hence
may need many data passes. It is most useful for smaller datasets.
Online. Updates the synaptic weights after every single training data record; that is,
online training uses information from one record at a time. Online training continuously
gets a record and updates the weights until one of the stopping rules is met. If all therecords are used once and none of the stopping rules is met, then the process continues by
recycling the data records. Online training is superior to batch for larger datasets with
associated predictors; that is, if there are many records and many inputs, and their values
are not independent of each other, then online training can more quickly obtain a
reasonable answer than batch training.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
35/111
35
Mini-batch. Divides the training data records into groups of approximately equal size,
then updates the synaptic weights after passing one group; that is, mini-batch training
uses information from a group of records. Then the process recycles the data group if
necessary. Mini-batch training offers a compromise between batch and online training,
and it may be best for medium-size datasets. The procedure can automatically
determine the number of training records per mini-batch, or you can specify an integer
greater than 1 and less than or equal to the maximum number of cases to store in
memory.
Testing
Testing is used for validation. For the each object in testing set the attribute values areknown, but class is unknown.
Hidden Layers
The hidden layer contains unobservable network nodes (units). Each hidden unit is a
function of the weighted sum of the inputs. The function is the activation function, and the values
of the weights are determined by the estimation algorithm.
If the network contains a second hidden layer, each hidden unit in the second layer is a
function of the weighted sum of the units in the first hidden layer. The same activation function
is used in both layers.
Number of Hidden Layers. A multilayer perceptron can have one or two hidden layers.
Activation Function. The activation function "links" the weighted sums of units in a layer to the
values of units in the succeeding layer.
Hyperbolic tangent. This function has the form: (c) = tanh(c) = (ecec)/(ec+ec). It
takes real-valued arguments and transforms them to the range (1, 1). When automaticarchitecture selection is used, this is the activation function for all units in the hidden
layers.
Sigmoid. This function has the form: (c) = 1/(1+ec). It takes real-valued arguments and
transforms them to the range (0, 1).
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
36/111
36
Number of Units. The number of units in each hidden layer can be specified explicitly or
determined automatically by the estimation algorithm.
Output Layer
The output layer contains the target (dependent) variables.
Activation Function. The activation function "links" the weighted sums of units in a layer to the
values of units in the succeeding layer.
Identity. This function has the form: (c) = c. It takes real-valued arguments and returns
them unchanged. When automatic architecture selection is used, this is the activation
function for units in the output layer if there are any scale-dependent variables. Softmax. This function has the form: (ck) = exp(ck)/jexp(cj). It takes a vector of real-
valued arguments and transforms it to a vector whose elements fall in the range (0, 1) and
sum to 1. Softmax is available only if all dependent variables are categorical. When
automatic architecture selection is used, this is the activation function for units in the
output layer if all dependent variables are categorical.
Hyperbolic tangent. This function has the form: (c) = tanh(c) = (ecec)/(ec+ec). It
takes real-valued arguments and transforms them to the range (1, 1).
Sigmoid. This function has the form: (c) = 1/(1+ec). It takes real-valued arguments and
transforms them to the range (0, 1).
Figure 8: showing the structure ofMultilayer Perceptron
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
37/111
37
Radial Basis Function:
The RBF procedure fits a radial basis function neural network, which is a feed forward,supervised learning network with an input layer, a hidden layer called the radial basis
function layer, and an output layer. The hidden layer transforms the input vectors into radial
basis functions. Like the MLP procedure, the RBF procedure performs prediction and
classification.The RBF procedure trains the network in two stages:
1. The procedure determines the radial basis functions using clustering methods. The
center and width of each radial basis function are determined.
2. The procedure estimates the synaptic weights given the radial basis functions. The
sum-of-squares error function with identity activation function for the output layer is used for
both prediction and classification. Ordinary Least Squares regression is used to minimize the
sum-of-squares error.
Due to this two-stage training approach, the RBF network is in general trained much
faster than MLP.
Figure 9: showing the structure ofRadial Basis Function
)/( 211 XK
)/( 2jjXK
)/( 222 XK
)/( 233 XKX Yl
Z1
Z2
Z3
Zj
wl1wl2
wl3
wlj
RADIAL BASIS FUNCTION NEURAL NETWORK
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
38/111
38
2.3 Support vector machines
Support vector machines(SVMs) are a set of related supervised learning methods usedfor classification and regression. In simple words, given a set of training examples, each marked
as belonging to one of two categories, an SVM training algorithm builds a model that predicts
whether a new example falls into one category or the other. Intuitively, an SVM model is a
representation of the examples as points in space, mapped so that the examples of the separate
categories are divided by a clear gap that is as wide as possible. New examples are then mapped
into that same space and predicted to belong to a category based on which side of the gap they
fall on.
More formally, a support vector machine constructs a hyhperplane or set of hyperplanes
in a high or infinite dimensional space, which can be used for classification, regression or other
tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to
the nearest training data points of any class (so-called functional margin), since in general the
larger the margin the lower the generalization error of the classifier (David Meyer et al.2003).
2.3.1 Motivation
Classifying data is a common task in machine learning. Suppose some given data points
each belong to one of two classes, and the goal is to decide which class a newdata point will be
in. In the case of support vector machines, a data point is viewed as ap-dimensional vector (a list
of p numbers), and we want to know whether we can separate such points with a p 1-
dimensional hyperplane. This is called a linear classifier.
There are many hyperplanes that might classify the data. One reasonable choice as thebest hyperplane is the one that represents the largest separation, or margin, between the two
classes. So we choose the hyperplane so that the distance from it to the nearest data point on each
side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane
and the linear classifier it defines is known as a maximum margin classifier (Corinna Cortes et
al. 1995).
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
39/111
39
Figure 10: showing the Motivation-H3 (green) doesn't separate the 2 classes. H1 (blue) does, with a
small margin and H2 (red) with the maximum margin.
2.3.2 Applications in Bioinformatics
Support Vector Machine s has a natural match with the features of many bioinformatics
datasets. They deliver state of the art performance in several applications. For microarray gene
expression data, SVM is becoming the system of choice .SVMs are currently among the best
performers for a number of classification tasks ranging from text to genomic data.
2.4 Genetic Algorithms (GAs):
The basic idea of a genetic algorithm (GA) is quite simple. GA works not only with one
iterated solution in time but with the whole population of solutions in one algorithm iteration.
The population contains many (ordinary several hundred) individuals bit strings representing
solutions. Evolutionary algorithms deal with similar strings of more general form, e.g.,
containing integer numbers or characters. The mechanism of GA involves only elementary
operations like strings copying, partially bit swapping or bit value changing (Goldberg et al.
1989).
GA starts with a population of strings and thereafter generates successive populations
using the following three basic operations:reproduction, crossover,andmutation. Reproduction
is the process by which individual strings are copied according to an objective function value
(fitness). Copying of strings according to their fitness value means that strings with a higher
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
40/111
40
value have a higher probability of contributing one or more offspring to the next generation. This
is an artificial version of the natural selection. Mutation is an occasional (with a small
probability) random alteration of the string position value. It is needed since, in spite of
reproduction and crossover effectively searching and recombining the existing representations,
they occasionally become overzealous and lose some potentially useful genetic material. The
mutation operator prevents such an irrecoverable loss. The recombination mechanism allows
mixing of parental information.
While passing it to their descendants and mutation introduces innovation into the population.
Figure 11: showing the Genetic Algorithm Based on Darwinian Paradigm
In spite of simple principles, the design of GA for successful practical using is
surprisingly complicated. GA has many parameters that depend on the problem to be solved. In
the first, it is the size of population. Larger populations usually decrease the number of iterations
needed, but dramatically increase the computing time for each iteration. The factors increasing
demands on the size of population are the complexity of the problem being solved and the length
of the individuals. Every individual contains one or more chromosomes containing value of
potential solution. Chromosomes consist of genes. The gene in our version of GA is a structure
representing one bit of solution value. It is usually advantageous to use some redundancy in
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
41/111
41
genes and so the physical length of our genes is greater than one bit. This type of redundancy
was introduced by Ryan.
To prevent degeneration and the deadlock in local extreme the limited lifetime of each
individual can be used. This limited lifetime is realized by the death operator, which represents
something like continual restart of GA. This operator enables decreasing of population size as
well as increasing the speed of fitness improvement. It is necessary to store the best solution
obtained separately the corresponding individual need not to be always present in the
population because of the limited lifetime.
Many GAs are implemented on a population consisting of haploid individuals (each
individual contains just one chromosome). However, in nature, many living organisms havemore than one chromosome and there are mechanisms used to determine dominant genes.
Sexual recombination generates an endless variety of genotype combination that
increases the evolutionary potential of the population. Because it increases the variation among
the offspring produced by an individual, it improves the change that some of them will be
successful in varying and often-unpredictable environments they will encounter. Using diploid or
multiploid individuals can often decrease demands on the population size. However the use of
multiploid GA with sexual reproduction brings some complications, the advantage of
multiploidity can be often substitute by the death operator and redundant genes coding.
New individuals are created by operation called crossover. In the simplest case crossover
means swapping of two parts of two chromosomes split in randomly selected point (so called one
point crossover). In GA we use the uniform crossover on the bit level. The strategy of selection
individuals for crossover is very important. It strongly determines the behavior of GA.
Genetic algorithms commonly use heuristic and stochastic approaches. From the
theoretical viewpoint, the convergence of heuristic algorithms is not guaranteed for the most ofapplication cases. That is why the definition of the stopping rule of the GA brings a new
problem. It can be shown, that while using a proper version of GA the typical number of
iterations can be determined.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
42/111
42
Details on GA implementation are specified in the following way:
1. Generation of the initial population: At the beginning the whole population is generated
randomly, the members are sorted by the fitness (in descendent order).
2. Mutation: The mutation is applied to each gene with the same probability. The mutation
of the gene means the inversion of one randomly selected bit in the gene.
3. Death: Classical GA uses two main operations crossover and mutation (the otheroperation should be migration). In GA a third operation is also available is death. Every
individual has the additional information age. A simple counter that is incremented in
each of GA iteration represents the age.
If the age of any member reaches the preset lifetime limit LT, the member dies and is
immediately replaced by a new randomly generated member. The age is not mutated nor
crossed over. The age of new individuals (incl. individuals created by crossover) is set to
zero.
4. Sorting is realized by the fitness function values.
5. Crossover: Uniform crossover is used for all genes (each bit of the offspring gene is
selected separately from corresponding bits of genes of both parents).
6. When a stopping rule is not satisfied, go to step 2.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
43/111
43
Figure 12: showing the Genetic Algorithm as Conceptual Algorithm
2.4.1 Problems suited to Genetic Algorithms:
Table 5: displaying the problems suited to Genetic Algorithms
Domain Application Types
Control gas pipeline, pole balancing, missile evasion, pursuit
Design semiconductor layout, aircraft design, keyboardconfiguration, communication networks
Scheduling manufacturing, facility scheduling, resource allocation
Robotics trajectory planning
Machine Learning designing neural networks, improving classificationalgorithms, classifier systems
Signal Processing filter design
Game Playing poker, checkers, prisoners dilemma
CombinatorialOptimization
set covering, travelling salesman, routing, bin packing,graph colouring and partitioning
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
44/111
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
45/111
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
46/111
46
2.6 Fuzzy Logic
The concept of Fuzzy Logic (FL) was conceived by Lotfi Zadeh, a professor at theUniversity of California at Berkley. It was presented not as a control methodology, but as a way
of processing data by allowing partial set membership rather than crisp set membership or non-
membership. Fuzzy Logic is a problem-solving control system methodology that lends itself to
implementation in systems ranging from simple, small, embedded micro-controllers to large,
networked, multi-channel PC or workstation-based data acquisition and control systems. It can
be implemented in hardware, software, or a combination of both. It provides a simple way to
arrive at a definite conclusion based upon vague, ambiguous, imprecise, noisy, or missing input
information. FL's approach to control problems mimics how a person would make decisions,
only much faster (Biacino et al. 2002).
2.6.1 Fuzzy Logic in Bioinformatics
Fuzzy logic and fuzzy technology are now frequently used in bioinformatics. The following
are some examples used in bioinformatics.
1. To analyze gene expression data
2. To unravel functional and ancestral relationships between proteins via fuzzy alignment
methods, or using a generalized radial basis function neural network architecture that
generates fuzzy classification Rules.
3. To analyze the relationships between genes and decipher a genetic network.
4. To process complementary deoxyribonucleic acid (cDNA) microarray images. The
procedure should be automated due to the large number of spots and it is achieved using a
fuzzy vector filtering framework.
5. To classify amino acid sequences into different super families
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
47/111
47
2.7 Hidden Markov Model (HMM)
Hidden Markov models are statistical tools allowing to model complex stochasticphenomenon. Hidden Markov models have long history. In 1913, first studies [Markov, 1913] on
Markov chain allow to analyze the language and lead A. A. Markov to the conception of the
Markov chain theory. From 1948 to 1951, Shannon establishes the information theory using
Markov chains [Shannon, 1948, Shannon, 1951]. In 1966 research by L. E. Baum and T. Petrie
[Baum and Petrie, 1966, Baum and Eagon, 1967, Baum and Sell, 1968, Petrie, 1969, Baum et al.,
1970, Baum, 1972] defined algorithms to train HMMs. In 1980, variable duration HMMs are
defined [Furguson, 1980]. In 1980's, neural networks are incorporated in HMMs [Bourland andWellekens, 1990] allowing to extend applications of HMMs for speech analysis [Bahl et al.,
1983, Rabiner et al., 1983, Juang and Rabiner, 1986, Rabiner and Levinson, 1985, Poritz and
Richter, 1986, Rosemberg and Colla, 1987, Euler and Wolf, 1988, Dours, 1989] . Nowadays,
HMMS are used for task scheduling [Soukhal et al., 2001b, Soukhal et al., 2001a] and
information technologies [Zaragoza and Gallinari, 1998, Amini, 2001, Serradura et al., 2001].
Lots of variant of genuine HMMs have been created to solve particular problems. Continuous
density HMMs [Rabiner, 1989], hierarchical HMMs [Fine et al., 1998], multidimensional
HMMS with independent processes [Brouard, 1999] or symbol substitution HMMs [Aupetit
et al., 2002, Aupetit, 2005] are some examples of those new models.
HMM is a finite set of states, each of which is associated with a (generally
multidimensional) probability distribution. Transitions among the states are governed by a set of
probabilities called transition probabilities. The sum of all transition probabilities out of a given
state has to equal 1.0, as does the sum of all observation probabilities for a particular state. In a
regular Markov model, the state is directly visible to the observer, and therefore the state
transition probabilities are the only parameters. It is only the outcome, not the state directly
visible to an external observer and therefore states are ``hidden'' to the outside; hence the name
Hidden Markov Model. Each state has a probability distribution over the possible output tokens.
Therefore the sequence of tokens generated by an HMM gives some information about the
sequence of states.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
48/111
48
Note that the adjective 'hidden' refers to the state sequence through which the model
passes, not to the parameters of the model; Even if the model parameters are known exactly, the
model is still 'hidden'.
Hidden Markov models are especially known for their application in temporal pattern
recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical
score following, partial discharges and bioinformatics.
The number of states of the model, N.
The number of observation symbols in the alphabet, M.
A set of state transition probabilities.
Figure 14: showing the Hidden Markov model
Where, x-states, y-possible observations, a-state transition probabilities,b-output probabilities
2.7.1 Applications of hidden Markov models
Cryptanalysis
Speech recognition
Machine translation
Partial discharge
Gene prediction
Alignment of bio-sequences
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
49/111
49
2.8 Markov chain Monte CarloA Morkov chain is a mathematical model for stochastic systems whose states, discrete or
continuous, are governed by a transition probability. The current state in a markov chain only
depends on the most recent previous states, e.g. for a 1storder Markov chain (Christophe
Andrieu et al. 2003).
The markovin property means locality in space or time, such as Markov random fields
and markov chain. Indeed, a discrete time Markov chain can be viewed as a special case of the
Markov random fields (causal and 1-dimensional).
MCMC is a general purpose technique for generating fair samples from a probability inhigh-dimensional space, using random numbers (dice) drawn from probability in certain range. Amarkov chain is designed to have (X)being its stationary (or invariant) probability.
This is a non-trivial task when (X) is very complicated in very high dimensional spaces!
Usually it is not hard to construct a Markov Chain with the desired properties. The more
difficult problem is to determine how many steps are needed to converge to the stationary
distribution within an acceptable error. A good chain will have rapid mixingthe stationary
distribution is reached quickly starting from an arbitrary positiondescribed further under
Markov chain mixing time.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
50/111
50
Typical use of MCMC sampling can only approximate the target distribution, as there is
always some residual effect of the starting position. More sophisticated MCMC-based algorithms
such as coupling from the past can produce exact samples, at the cost of additional computation
and an unbounded (though finite in expectation) running time.
The most common application of these algorithms is numerically calculating multi-
dimensional integrals. In these methods, an ensemble of "walkers" moves around randomly. At
each point where the walker steps, the integrand value at that point is counted towards the
integral. The walker then may make a number of tentative steps around the area, looking for a
place with reasonably high contribution to the integral to move into next.
Random walk methods are a kind of random simulation or Monte Carlo method.
However, whereas the random samples of the integrand used in a conventional Monte Carlo
integration are statistically independent, those used in MCMC are correlated. A Markov chain is
constructed in such a way as to have the integrand as its equilibrium distribution. Surprisingly,
this is often easy to do.
Multi-dimensional integrals often arise in Bayesian statistics, computational physics,
computational biology and computational linguistics, so Markov chain Monte Carlo methods are
widely used in those fields.
2.8.1 Some Properties of Markov Chains
Irreduciblechain: can get from any state to any other eventually (non-zero probability)
Periodicstate: state iis periodic with period kif all returns to imust occur in multiples of
k
Ergodic chain: irreducible and has an aperiodic state. Implies all states are aperiodic, so
chain is aperiodic.
Finite state space: can represent chain as matrix of transition probabilities then ergodic
= regular
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
51/111
51
Regularchain: some power of chain has only positive elements
Reversible chain: satisfies detailed balance (later)
2.8.2 MCMC algorithms
The following algorithms are involved with the Markov chain Monte Carlo.
Metropolis-Hastings algorithm
Metropolis algorithm
Mixtures and blocks
Gibbs sampling and Sequential Monte Carlo & Particle Filters
2.8.3 Applications
Computer vision
Object tracking demo [Blake&Isard]
Speech & audio enhancement
Web statistics estimation
Regression & classification
Global maximization of MLPs [Freitas et al]
Bayesian networks
Details in Gilks et al book (in the School library)
Genetics & molecular biology
Robotics, etc.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
52/111
52
MMMMycobacterium Tuberculosisycobacterium Tuberculosisycobacterium Tuberculosisycobacterium Tuberculosis
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
53/111
53
3.MYCOBACTERIUM TUBERCULOSIS3.1Mycobacterium tuberculosis
Tuberculosis (TB) is a disease that is spread from person to person through the air. TB
usually affects the lungs, but it can also affect other parts of the body, such as the brain, the
kidneys, or the spine. TB germs are put into the air when a person with TB disease of the lungs
or throat coughs or sneezes. When a person inhales air that contains TB germs, he or she may
become infected. People with TB infection do not feel sick and do not have any symptoms.
However, they may develop TB disease at some time in the future. The general symptoms of TB
disease include feeling sick or weak, weight loss, fever, and night sweats. The symptoms of TB
of the lungs include coughing, chest pain, and coughing up blood. Other symptoms depend on
the part of the body that is affected.
Mycobacterium tuberculosis (MTB) is a pathogenic bacterial species in the genus
Mycobacteriumand the causative agent of most cases of tuberculosis. First discovered in 1882
by Robert Koch, M. tuberculosis has an unusual, waxy coating on the cell surface (primarily
mycolic acid), which makes the cells impervious to Gram staining; acid-fast techniques are used
instead. The physiology ofM. tuberculosisis highly aerobic and requires high levels of oxygen.
Primarily a pathogen of the mammalian respiratory system, MTB infects the lungs, causing
tuberculosis (Ryan KJ et al. 2004).
TheM. tuberculosisgenome was sequenced in 1998 (Cole ST, et al 1998) (Camus JC et
al.2002).
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
54/111
54
3.1.1 Strain variation
M. tuberculosis appears to be genetically diverse. This genetic diversity results in
significant phenotypic differences between clinical isolates. M. tuberculosis exhibits a
biogeographic population structure and different strain lineages are associated with different
geographic regions. Phenotypic studies suggest that this strain variation never has implications
for the development of new diagnostics and vaccines. Micro-evolutionary variation affects the
relative fitness and transmission dynamics of antibiotic-resistant strains (Gagneux S2009).
3.1.2 Mycobacterium tuberculosis CDC1551
CDC1551 is a clinical strain that was originally thought to be highly virulent (142); it has
more recently been shown that CDC1551 induces levels of cytokines, including TNF-_, that are
higher than those induced by other M. tuberculosis strains in mice. However, it is not more
virulent than the other strains, as defined by bacterial load and mortality.
The genome of M. tuberculosisstrain CDC 1551 was sequenced by the whole genome
random sequencing method as described in Fleischmann, RD et al. (1995). Science269:496-512.
TheM. tuberculosisgenome is a circular chromosome of 4,403,765 base pairs with an average G
+ C content of 65.6%. There are a total of 4,033 predicted open reading frames (ORFs).
Predicted biological roles were assigned to 1,734 ORFs (43%); 605 ORFs (15%) match
hypothetical proteins from other species, and 1,694 ORFs (42%) have no database match and
presumably represent novel genes.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
55/111
55
Figure 15: Showing the Role Category Pie Chart ofMycobacterium tuberculosis
CDC1551
3.1.3 Mycobacterium tuberculosis H37Rv
TheM. tuberculosis H37Rv genome consists of 4.4 _ 106 bp and contains approximately
4,000 genes (Fig. 1) (53). Annotation of the M. tuberculosis genome shows that this bacterium
has some unique features. H37Ra strain has been lacking. Historically, M. tuberculosis H37Ra is
the a virulent counterpart of virulent strain H37Rv and both strains are derived from their
virulent parent strain H37.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
56/111
56
Figure 16: Showing the Role Category Pie Chart ofMycobacterium tuberculosis
H37RV (lab strain)
3.1.4Comparison between H37Rv and H37RaThe H37Ra genome is highly similar to that of H37Rv with respect to gene content and
order but is 8,445 bp larger as a result of 53 insertions and 21 deletions in H37Ra relative to
H37Rv. Variations in repetitive sequences such as IS6110and PE/PPE/PE-PGRS family genes
are responsible for most of the gross genetic changes. A total of 198 single nucleotide variations
(SNVs) that are different between H37Ra and H37Rv were identified, yet 119 of them are
identical between H37Ra and CDC1551 and 3 are due to H37Rv strain variation, leaving only 76
H37Ra-specific SNVs that affect only 32 genes.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
57/111
57
Figure 17: Displaying a Proposed Phylogenetic Relationship of H37Ra, H37Rv, H37, and
CDC1551
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
58/111
58
Figure 18: Displaying Statistics forMycobacterium tuberculosis CDC1551
Figure 19: Displaying Statistics forMycobacterium tuberculosis H37RV (lab strain)
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
59/111
59
Aim & ObjectivesAim & ObjectivesAim & ObjectivesAim & Objectives
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
60/111
60
4. AIM & OBJECTIVE
To predict the secondary structure of proteins using Genetic Algorithm.
To compare the performance of Evolutionary Genetic Algorithms with neural network in
the prediction of protein secondary structure.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
61/111
61
Materials &MethodsMaterials &MethodsMaterials &MethodsMaterials &Methods
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
62/111
62
5. MATERIALS AND METHODS
5.1 Softwares and Databases:
Table 6: Displaying the name and utility of the softwares which are
used in the methodology
Name of the software / Database Utility
JCVI To retrieve the important proteins of
Mycobacterium tuberculosis CDC1551 and
Mycobacterium tuberculosis H37Rv
NCBI Protein database To get the amino acid sequences of the desired
protein
SOPMA To make the secondary structure prediction
SPSS16.0 To fit the Multi Layer Perceptron (MLP)
Feed forward Neural Network.
MATLAB (Genetic Algorithm Tool Box) To perform the secondary structure predictionof the amino acid sequence using genetic
algorithm toolbox.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
63/111
63
5.2 FLOW CHART OF METHODOLOGY
Get the desired protein name from thewhole list of the proteins
The protein sequences was retrieved inthe FASTA format from the NCBI
database
Complete genome for the particularMycobacterium strainwas selected from
the JCVI database
Run the protein sequences in the SOPMAsecondary structure prediction tool and get the
Use the PERL script to create a filewhich is consist the nine features (input
nodes of each amino acids
Use the inputs file and run it in the SPSS
Neural Network (in both MultilayerPerceptron and Radial Basis Function)
Use a file the inputs file and run it in
the MATLAB Genetic Algorithm
Compare the results which obtained from theboth in SPSS Neural Network and MATLAB
Genetic Algorithm.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
64/111
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
65/111
65
Figure 21: Displaying Genome information ofMycobacterium tuberculosis H37Rv
Figure 22: Displaying List of protein coding gene of Mycobacterium tuberculosis CDC1551
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
66/111
66
Figure 23: List of protein coding gene ofMycobacterium tuberculosisH37Rv
Table 7: Showing the list of proteins taken fromMycobacterium tuberculosis CDC1551for
secondary structure prediction
S.No PROTEIN NAME LENGTH OF SEQUENCE
1 conserved hypothetical protein 507
2 conserved hypothetical protein 402
3 conserved hypothetical protein 187
4 conserved hypothetical protein 686
5 conserved hypothetical protein 838
6 conserved hypothetical protein 304
7 conserved hypothetical protein 2628 conserved hypothetical protein 232
9 conserved hypothetical protein 626
10 conserved hypothetical protein 511
11 conserved hypothetical protein 521
12 conserved hypothetical protein 87
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
67/111
67
13 ABC transporter, ATP-binding protein 330
14 Acyl carrier protein 87
15 Chromosomal replication initiator protein DnaA 507
16 DNA gyrase subunit A 838
17 DNA gyrase subunit B 686
18 DNA polymerase III,beta subunit 402
19 Glutamine amidotransferase, class I 232
20 Leucyl-tRNA synthetase 969
21 L-serine dehydratase 461
22 Penicillin-binding protein 82023 Replicative DNA helicase, intein-containing 874
24 Ribosomal protein L9 152
25 Ribosomal protein S18 84
26 Ribosomal protein S6 96
27 Serine/threonine protein kinase 626
28 Transcriptional regulator, ArsR family 114
29 Transcriptional regulator, GntR family 24430 Transcriptional regulator, MarR family 208
Table 8: Showing the list of proteins taken fromMycobacterium tuberculosisH37Rvfor secondary
structure prediction
S.No PROTEIN NAME LENGTH OF SEQUENCE
1 Hypothetical Protein Rv0257 124
2 Hypothetical Protein Rv0264c 210
3 Hypothetical Protein Rv0268c 169
4 Hypothetical Protein Rv0272c 377
5 Hypothetical Protein Rv0282 631
6 Hypothetical protein Rv0064 979
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
68/111
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
69/111
69
Retrieving amino acids sequence data from NCBI protein database
The amino acid sequences of desired proteins were retrieved from NCBI protein database
by giving an appropriate query.
5.3.2 NCBI PROTEIN DATABASE
In addition to Protein sequences, other protein-related information is available via Entrez.
Search the Structure database by choosing, "Structure" from the Entrez pull down menu,
Conserved Domains Database (CDD) by choosing, "Domains", and 3D Domains by choosing,
the "3D Domains" option.
Figure 24: Displaying the home page of NCBI protein database
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
70/111
70
Figure 25: Displaying the retrieving amino acids sequence data from NCBI protein
database
5.4 SECONDARY STRUCTURE PREDICTION USING SOPMA
Secondary structure prediction using the SOPMA tool is at the pole of bioinformatics server. The
tool can access by the URL http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_sopma.html
SOPMA
Recently a new method called the self-optimized prediction method (SOPM) has been
described to improve the success rate in the prediction of the secondary structure of proteins. In
this method the improvements brought about by predicting all the sequences of a set of aligned
proteins belonging to the same family. This improved SOPM method (SOPMA) correctly
predicts 69.5% of amino acids for a three-state description of the secondary structure (alpha-
helix, beta-sheet and coil) in a whole database containing 126 chains of non-homologous (less
than 25% identity) proteins. Joint prediction with SOPMA and a neural networks method (PHD)
correctly predicts 82.2% of residues for 74% of co-predicted amino acids.
-
8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS
71/111
71
Figure 26: Displaying the home page