secondary structure prediction of tuberculosis genomes using machine learning algorithms

8/12/2019 SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS GENOMES USING MACHINE LEARNING ALGORITHMS

1/111

1

SECONDARY STRUCTURE PREDICTION OF TUBERCULOSIS

GENOMES USING MACHINE LEARNING ALGORITHMS

Dissertation submitted

to

Sri Ramachandra University

Porur, Chennai 600 116.

in partial fulfillment of the requirements for the Degree ofMaster of Science

in

Bioinformatics

By

RAJIV GANDHI.B

Reg No: 3010815

DEPARTMENT OF BIOINFORMATICS

Sri Ramachandra College of Biomedical Sciences, Research & Technology

Sri Ramachandra University

Porur, Chennai - 600 116.

JUNE 2010


2/111

2

ACKNOWLEDGEMENT

I express my sincere gratitude to my guide Dr. P. Venkatesan, Deputy Director,

Tuberculosis Research Centre, Indian Council of Medical Research for his excellent guidance,

his knowledge, his belief, innovative thoughts and motivation throughout. His continuous and

constant support and encouragement has given strength and motivation to me, as well as my

work. He has been the backbone of my work.

I am grateful to Dr.PK.Ragunath, Head of the Department, Department of

Bioinformatics, Sri Ramachandra University for his kind encouragement all the times.

I express my profound thanks to the Dean of Faculties, Vice Chancellor and Management

of Sri Ramachandra University, for providing all the facilities to do this project work.

I would like to express my heart full thanks to all staff members Mrs.Arundathi, Mr.

Dicky John Davis, Ms.C.R.Hemalatha, Ms.S.Kayalvizhi, Ms.M.Premavathi, Mr.Ramesh and

Mr.Venkatesan whose disciplined guidance helped to complete my project on time.

I express my special gratitude to my parents and my sister, brother for their constantencouragement, support and blessings. Their inspiration had given me the immense courage to

work efficiently.

I express my heartfelt gratitude and obligation to my loving friends P.A.Abhinand,

R.D.Nandhini, B.Chaitanya, Balaraman, Gajalakshmi, J.Andrews Paulraj, K.Deepa, Selva,

T.Sridhar, and R.Sridhar for helping me throughout this project.

I would like to thank all those who had directly or indirectly helped me in my project.


3/111

3

DEDICATIONDEDICATIONDEDICATIONDEDICATION

I would like to dedicate this project to my lovingly parents Mr.M.Balasubramanian

& Mrs.B.Palaniammal Balasubramanianand my friends especially

P.A.Abhinand, R.D.Nandhini & B. Chaitanyawho have assisted and motivated me.


4/111

4

CONTENTS

Serial No. Topic Page No.

1Introduction 9

2 Machine Leaning Approaches 33

3 Mycobacterium tuberculosis 56

4 Aim & Objectives 63

5 Materials and Methods 65

6 Result 85

7 Discussion 107

7 Summary 109

8 Reference 111


5/111

5

IntroductionIntroductionIntroductionIntroduction


6/111

6

1. INTRODUCTION

1.1 Bioinformatics

The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of

informatics processes in biotic systems. Its primary use since at least the late 1980s has been in

genomics and genetics, particularly in those areas of genomics involving large-scale DNA

sequencing.

Bioinformatics represents a new, growing area of science that uses computational

approaches to answer biological questions. Answering these questions requires that investigators

take advantage of large, complex data sets (both public and private) in a rigorous fashion to

reach valid, biological conclusions. The potential of such an approach is beginning to change the

fundamental way in which basic science is done, helping to more efficiently guide experimental

design in the laboratory. With the explosion of sequence and structural information available to

researchers, the field of bioinformatics is playing an increasingly large role in the study of

fundamental biomedical problems. The challenge facing computational biologists will be to aidin gene discovery and in the design of molecular modeling, site-directed mutagenesis, and

experiments of other types that can potentially reveal previously unknown relationships with

respect to the structure and function of genes and proteins. This challenge becomes particularly

daunting in light of the vast amount of data that has been produced by the Human Genome

Project and other systematic sequencing efforts to date.

The primary goal of bioinformatics is to increase the understanding of biological

processes. What sets it apart from other approaches, however, is its focus on developing andapplying computationally intensive techniques (e.g., pattern recognition, data mining, machine

learning algorithms, and visualization) to achieve this goal. Major research efforts in the field

include sequence alignment, gene finding, genome assembly, drug design, drug discovery,

protein structure alignment, protein structure prediction, prediction of gene expression and

protein-protein interactions, genome-wide association studies and the modeling of evolution

(Achuthsankar S Nair. 2007).


7/111

7

1.2 Bioinformatics Algorithms

Bioinformatics is the field that solves the biological problems with the application of

information technology. To solve the problems, from the technology sides the algorithms should

be generated to solve those problem. So, in short algorithms are the programs (softwares) that

are used to solve the biological problems are called as bioinformatics algorithms. Some of the

important bioinformatics algorithms are explained below:

1.2.1 Greedy Algorithm

Greedy algorithm is defined as the algorithm which searches for the local optimization

and imagine itself as a global optimization (Cormen, et al.1990).

1.2.1.1 Types of Greedy Algorithm

There are about three types of greedy algorithm, they are,

Pure greedy algorithm

Orthogonal greedy

Smooth greedy algorithm

Disadvantage of greedy algorithm

The disadvantage of greedy algorithm is, since it stops with the local optimization, theglobal optimization of the data could not be found out.

1.2.2 Dynamic programming

It is a general optimization method. It proposed by Richard Bellman of Princeton

University in 1950s. It extensively used in sequence alignment and other computational

problems. It applied to biological sequences by Needleman and Wunsch.

Original problem is broken into smaller sub problems and then solved. Pieces of largerproblem have a sequential dependency. Fourth piece can be solved using solution of the third

piece; the third piece can be solved by using solution of the second piece and so on. Dynamic

Algorithm first solves all the sub problems and stores each intermediate solution in a table along

with a score. And uses an m*n matrix of scores where m and n are the lengths of sequences

being aligned (Eddy, S. R. 2004).


8/111

8

Dynamic programming can be used for

Local alignment (Smith-Waterman Algorithm)

Global Alignment(Needleman-Wunsch Algorithm)

Figure 1: Processing steps in Dynamic Programming

Steps Involved Dynamic programming

1. Initialization

2. Matrix Fill (scoring)

3. Traceback (alignment)

1.2.2.1 Global Alignment: Needleman-Wunsch Algorithm

In global sequence alignment, an attempt to align the entirety of two different sequences

is made, up to and including the ends of the sequence. Needleman and Wunsch (1970) were

among the first to describe a dynamic programming algorithm for global sequence alignment.The NeedlemanWunsch algorithm performs a global alignment on two sequences

(called A and B here). It is commonly used in bioinformatics to align protein or nucleotide

sequences. The algorithm was published in 1970 by Saul Needleman and Christian Wunsch.

The NeedlemanWunsch algorithm is an example of dynamic programming, and was the

first application of dynamic programming to biological sequence comparison.


9/111

9

1.2.2.2 Local alignment

Smith waterman algorithm is the well known algorithm for performing local alignment.Instead of looking at the total sequence it compares segments of all possible lengths and

optimizes the similarity measure.

1.2.2.3 Difference between Global and local Alignment

The main difference to the Needleman - wunsch algorithm is that negative scoring matrix

cells are set to zero which renders the local alignment visible. Back tracking starts at the highest

scoring matrix cells and proceeds until a cell with score zero is encountered yielding the highest

scoring local alignment.

Table 1: Difference between Global and Local Alignment

Global Alignment Local Alignment

It is a long process Shorter process

( , ),

(0,0) 0

( 1, 1) ( , )

( , ) max ( 1, )

( , 1)

i j

i j

Given s x y d

F

F i j s x y

F i j F i j d

F i j d

=

+

=

( , ),

(0,0) 00

( 1, 1) ( , )( , ) max

( 1, )

( , 1)

i j

i j

Given s x y d

F

F i j s x yF i j

F i j d

F i j d

=

+=


10/111

10

1.2.3 Divide and Conquer Algorithm

Divide and conquer is an algorithm which divides large problems into two or more sub-

problems of the same type, until these become simple enough to be solved directly. The solutions

to the sub-problems are then combined to give a solution to the original problem (Thomas H et

al.2000).

1.2.4 Randomized Algorithm

In Randomized algorithm, the object to be optimized would be selected randomly. But all

the objects in the given dataset would be optimized. This gives the better optimization. This

saves the time, as one desired optimization is reached, the process would be stop (Cormen, et al.

1990).

1.2.5 Branch and bound Algorithm

Branch and bound algorithm is usually used for finding the best local optimization,

mainly in discrete and combinational optimization. The first step is pruning or splitting. In thisthe given dataset N is split into smaller sets {N1, N2, N3} the union covers of N.

For each smaller set, the node present in that will provide a lower and upper bound value.

Suppose the lower bound value of A is greater than the upper bound value of node B then discard

the node A. at the end , the only node which has the lower bound value is taken, this provides the

optimum solutions.

1.2.6 Brute Force Algorithm

Brute Force algorithms bring the optimum results. If the number of data is to be

processed, all the data would be taken into consideration and each every data will be processed

individually to get the desired output.


11/111

11

1.2.7 Machine Learning Algorithm

The machine learning algorithm mainly deals with the following types of learning (SergiosTheodoridis et al.2009),

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Supervised Learning

The steps involved with supervised learning are,

The sample input would be given

Test and training also would be supported

When new data or test input is given, it takes the reference of sample data, and process

accordingly

Unsupervised Learning

The steps involved with unsupervised learning are

No test sample would be provided

The input would be given directly to the system, and the process is done by automatic learning


12/111

12

Reinforcement Learning

It is a sub-area of machine learning concerned with how an agent ought to take actions inan environment so as to maximize some notion of long-term reward. Reinforcement learning

algorithms attempt to find a policy that maps states of the world to the actions the agent ought to

take in those states. In economics and game theory, reinforcement learning is considered as a

boundedly rational interpretation of how equilibrium may arise.

Reinforcement learning is particularly well suited to problems which include a long-term

versus short-term reward trade-off. It has been applied successfully to various problems,

including robot control, elevator scheduling, telecommunications, etc.

Thus the bioinformatics algorithms could be subjected to determine any statistical

analysis of any data.

1.3 PROTEINS

Proteins are the most abundant organic molecules of the living system. They occur in

every part of the living system. They occur in every part of the cell and constitute about 50% of

cellular dry weight. Proteins form the fundamental basis of structure and function of life.

1.3.1 Functions of proteins

Proteins perform a great variety of specialized and essential functions in the living cells.

These functions may be broadly grouped as static(structural) and dynamic (U.Satyanarayana.

2007).

Structural functions: Some proteins perform brick and mortar roles and are primarily

responsible for structure and strength of body. These include collagen and elastin found in

bone matrix, vascular system and other organs and -keratin present in epidermal tissues.

Dynamic functions: The dynamic functions of proteins are more diversified in nature. These

include proteins acting as enzymes, hormones, blood clotting factors, immunoglobulins,


13/111

13

membrane receptors, storage proteins, besides their function in genetic control, muscle

contraction, respiration etc. proteins performing dynamic functions are appropriately

regarded as the working horses of cell.

1.3.2 Levels of protein structure

Proteins are the polymers of L--amino acids, the structure of proteins is rather complex

which can be divide into four levels of organization (U.Satyanarayana. 2007).

Primary structure

The linear sequence of amino acids forms the backbone of proteins (polypeptides).

Secondary structure

The special arrangement of amino acids by twisting of the polypeptide chain.

Tertiary structure

It represents the three dimensional structure of a functional protein.

Quaternary structure

Some of the proteins are composed of two or more polypeptide chains referred to as

subunits. The special arrangement of these subunits is known as quaternary structure.

1.3.3 Primary structure of protein

Each protein has a unique sequence of amino acids which is determined by the genes

contained in DNA. The primary structure of a protein is largely responsible for its function

(U.Satyanarayana. 2007).

Peptide bond

The amino acids are held together in a protein by covalent peptide or linkages. Thesebonds are rather strong and serve as the individual amino acids.

Formation of peptide bond: When the amino group of an amino acid combines with the

carboxyl group of an amino acid, a peptide bond is formed. The dipeptide will have two amino

acids and one peptide (not two) bond. Peptides containing more than 10 amino acids

(decapeptide) are referred to as polypeptides (U.Satyanarayana. 2007).


14/111

14

Determination of primary structure

The primary structure comprises the identification of constituent amino acids with regardto their quality, quantity and sequence in a protein structure.

A pure sample of a protein or a polypeptide is essential for the determination of primary

structure which involves 3 stages (U.Satyanarayana. 2007).

1. Determination of amino acid composition.

2. Degradation of protein or polypeptide into smaller fragments.

3. Determination of the amino acid sequence.

1.3.4 Secondary structure of protein

The conformation of polypeptide chain by twisting or folding is referred to as secondary

structure. The amino acids are located close to each other in their sequence. The following

secondary structures are mainly identified (U.Satyanarayana. 2007).

The Alpha is a coiled Structure stabilized by Intrachain Hydrogen Bonds

In evaluating potential structures, Pauling and Corey considered which conformations of

peptides were sterically allowed and which most fully exploited the hydrogen-bonding capacity

of the backbone NH and CO groups. The first of their proposed structures, the helix, is a rod like

structure. A tightly coiled backbone forms the inner part of the rod and the side chains extend

outward in a helical array. The helix is stabilized by hydrogen bonds between the NH and Co

groups of the main chain. In particular, the CO group of each amino acid forms a hydrogen bondwith the NH group of the amino acid that is situated four residues ahead in the sequence. Thus,

expect for amino acids near the ends of an helix, all the main-chain CO and NH groups are

hydrogen bonded. Each residue is related to the next one by a rise of 1.5 along the helix axis

and a rotation of 100 degrees, which gives 3.6 amino acid residues per turn of helix.


15/111

15

Thus, amino acids spaced three and four apart in the sequence are spatially quite close to

one another in an helix. In contrast, amino acids two apart in the sequence are situated on

opposite sides of the helix and so are unlikely to make contact. The pitch of the helix, which is

equal to the product of the translation (1.5 ) and the number of residues per turn (3.6), is 5.4 .

The screw sense of a helix can be right-handed (clockwise) or left-handed (counterclockwise).

The Ramachandran diagram reveals that both the right-handed and the left-handed

helices are among allowed conformations. However, right-handed helices are energetically more

favorable because there is less steric clash between the side chains and the backbone. Essentially

all helices found in proteins are right-handed. In schematic diagrams of proteins, helices aredepicted as twisted ribbons or rods.

When Pauling and Corey predicted the structure of the helix it was actually seen in the

X-ray reconstruction of structure of myoglobin. The elucidation of the structure of the helix is

a landmark in biochemistry because it demonstrated that the conformation of a polypeptide chain

can be predicted if the properties of its components are rigorously and precisely known.

The -helical content of proteins ranges widely, from nearly none to almost 100%. For

example, about 75% of the residues in ferritin, a protein that helps store iron, are in helices.

Single helices are usually less than 45 long. However, two or more helices can entwine to

form a very stable structure, which can have a length of 1000 (100 nm, or 0.1m) or more.

Such helical coiled coils are found in myosin and tropomyosin in muscle, in fibrin in blood

clots, and in keratin in hair. The helical cables in these proteins serve a mechanical role in

forming stiff bundles of fibers, as in porcupine quills. The cytoskeleton (internal scaffolding) of

cells is rich in so-called intermediate filaments, which also are two stranded helical coiledcoils. Many proteins that span biological membranes also contain helices (Jeremy M et al.

1975).


16/111

16

Figure 2: The structure of helix


17/111

17

Beta Sheets Are Stabilized by Hydrogen Bonding Between Polypeptide

Strands

Pauling and Corey discovered another periodic structural motif, which they named as

pleated sheet(because it was the second structure that they elucidated, the helix having been

the first). The pleated sheet (or, more simply, the sheet) differs markedly from the rod like

helix. A polypeptide chain, called a strand, in a sheet is almost fully extended rather than

being tightly coiled as in the helix. A range of extended structures are sterically allowed.

The distance between adjacent amino acids along a strand is approximately 3.5 , in

contrast with a distance of 1.5 along an helix. The side chains of adjacent amino acids point

in opposite directions. A sheet is formed by linking two or more strands by hydrogen bonds.

Adjacent chains in a sheet can run in opposite directions (antiparallel sheet) or in the same

direction (parallel sheet). In the antiparallel arrangement, the NH group and the CO group of

each amino acid are respectively hydrogen bonded to the CO group and the NH group of a

partner on the adjacent chain.

In the parallel arrangement, the hydrogen-bonding scheme is slightly more complicated.

For each amino acid, the NH group is hydrogen bonded to the CO group of one amino acid on

the adjacent strand, whereas the CO group is hydrogen bonded to the NH group on the aminoacid two residues farther along the chain. Many strands, typically 4 or 5 but as many as 10 or

more, can come together in sheets. Such sheets can be purely antiparallel, purely parallel, or

mixed.

In schematic diagrams, strands are usually depicted by broad arrows pointing in the

direction of the carboxyl-terminal end to indicate the type of sheet formed-parallel or

antiparallel. More structurally diverse than helices, sheets can be relatively flat but most

adopt a somewhat twisted shape.

The sheet is an important structural element in many proteins. For example, fatty acid-

binding proteins, important for lipid metabolism, are built almost entirely from sheets (Jeremy

M et al.1975).


18/111

18

Figure 3: The structure of strands. Two or more such strands may come together to form sheets

in parallel (a) or antiparallel (b) orientation.

N

R5

C

R3

R4

N

R1

R2

R5

R3

R4

C

R1

R2

(a) (b)


19/111

19

Polypeptide Chains Can Change Direction by Making Reverse Turns and

Loops

Most proteins have compact, globular shapes, requiring reversals in the direction of their

polypeptide chains. Many of these reversals are accomplished by a common structural element

called the reverse turn (also known as the turn or hairpin bend). In many reverse turns, the CO

group of residue i of a polypeptide is hydrogen bonded to the NH group of residue i + 3. This

interaction stabilizes abrupt changes in direction of the polypeptide chain. In other cases, more

elaborate structures are responsible for chain reversals. These structures are called loops or

sometimes loops (omega loops) to suggested their overall shape. Unlike helices and

strands, loops do not have regular, periodic structures. Nonetheless, loop structures are often

rigid and well defined. Turns and loops invariably lie on the surfaces of proteins and thus often

participate in interactions between proteins and other molecules. The distribution of helices,

strands, and turns along a protein chain is often referred to as its secondary structure (Jeremy M

et al.1975).

1.3.5 Tertiary structure of protein

The three-dimensional arrangement of protein structure is referred to as tertiary structure.It is a compact structure with hydrophobic side chains held interior while the hydrophilic groups

are on the surface of the protein molecule. This type of arrangement ensures stability of the

molecules (U.Satyanarayana 2007).

Bonds of tertiary structure: Besides the hydrogen bonds, disulfide bonds (-S-S-), ionic

interactions (electrostatic bonds) and hydrophobic interactions also contribute to the tertiary

structure of proteins.

Domain: The term domain is used to represent the basic units of protein structure (tertiary) andfunctions. A polypeptide with 200 amino acids normally consists of two or more domains.


20/111

20

1.3.6 Quaternary structure of protein

A great majority of the proteins are composed of single polypeptide chains. Some of the

proteins, however, consist of two or more polypeptides which may be identical or unrelated.

Such proteins are termed as oligomers and possess quaternary structure.

The individual polypeptide chains are known as monomers, protomers, or subunits. A

dimmer consists of two polypeptides while a tetramer has four (U.Satyanarayana. 2007).

Bonds in quaternary structure: the monomeric subunits are held together by non-covalent

bonds namely hydrogen bonds, hydrophobic interactions and ionic bonds.

1.4 Protein Structure Prediction

Protein structure prediction is the prediction of the three-dimensional structure of a

protein from its amino acid sequencethat is, the prediction of a protein's tertiary structure from

its primary structure (structure prediction is fundamentally different from the inverse problem of

protein design). Protein structure prediction is one of the most important goals pursued by

bioinformatics and theoretical chemistry. Protein structure prediction is of high importance in

medicine (for example, in drug design) and biotechnology (for example, in the design of novelenzymes) (Mount DM 2004).

Proteins are an important class of biological macromolecules present in all biological

organisms, made up primarily from the elements carbon, hydrogen, nitrogen, oxygen, and

sulphur. All proteins are polymers of amino acids. Classified by their physical size, proteins are

nanoparticles (definition: 1-100 nm). Each protein polymer also known as a polypeptide

consists of a sequence of 20 different L--amino acids, also referred to as residues. For chains

under 40 residues the term peptide is frequently used instead of protein.

To be able to perform their biological function, proteins fold into one or more specific

spatial conformations, driven by a number of non-covalent interactions such as hydrogen

bonding, ionic interactions, Van Der Waals forces, and hydrophobic packing. To understand the

functions of proteins at a molecular level, it is often necessary to determine their three-

dimensional structure. This is the topic of the scientific field of structural biology, which


21/111

21

employs techniques such as X-ray crystallography, NMR spectroscopy, and dual polarization

interferometry to determine the structure of proteins.

1.5 Secondary structure prediction

Secondary structure prediction is a set of techniques in bioinformatics that aim to

predict the local secondary structures of proteins and RNA sequences based only on knowledge

of their primary structure - amino acid or nucleotide sequence, respectively. For proteins, a

prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta

strands (often noted as "extended" conformations), or turns. Globular protein domains are

typically composed of the two basic secondary structure types, the a-helix and the b-strand,

which are easily distinguishable because of their regular (periodic) character. Other types of

secondary structures such as different turns, bends, bridges, and non-a helices are less frequent

and more difficult to observe and classify for a non-expert. The non-a, non-b structures are often

referred to as coil or loop and the majority of secondary structure prediction methods are aimed

at predicting only these three classes of local structure. Given the observed distribution of the

three states in globular proteins (about 30 % -helix, 20 % strand and 50 % coil), random

prediction should yield about 40 % accuracy per residueThe accuracy of the secondary structure prediction methods devised earlier, such as

Chou-Fasman (1974) or GOR (Garnier et al. 1978) is in the range of 5055 %. The best modern

secondary structure prediction methods have reached a sustained level of 76 % accuracy for the

last 2 years, with a-helices predicted with ca. 10% higher accuracy than bstrands (Koh et al.

2003).Hence, it is quite surprising that the early mediocre methods are still used in good faith by

many researchers; maybe even more surprising that they are sometimes recommended in

contemporary reviews of bioinformatics software or built in as a default method in new versions

of commercial software packages for protein sequence analysis and structure modeling. Modern

secondary structure prediction methods typically perform analyses not for the single target

sequences, but rather utilize the evolutionary information derived from MSA provided by the

user or generated by an internal routine for database searches and alignment (Levin et al. 1993).

The information from the MSA provides a better insight into the positional conservation of


22/111

22

physico-chemical features such as hydrophobicity and hints at a position of loops in the regions

of insertions and deletions (indels) corresponding to gaps in the alignment.

It is also recommended to combine different methods for secondary structure prediction;

the ways of combing predictions may include the calculation of a simple consensus or more

advanced approaches, including machine learning, such as voting, linear discrimination, neural

networks and decision trees (King et al. 2000). JPRED (Cuff et al. 1998) is an example of a

consensus meta-server that returns predictions from several secondary structure prediction

methods (mostly third-party algorithms) and infers a consensus using a neural network, thereby

improving the average accuracy of prediction. In addition, JPRED predicts the relative solventaccessibility of each residue in the target sequence, which is very useful for identification of

solvent-exposed and buried faces of amphipathic helices. In general, the most effective

secondary structure prediction strategies follow these rules: (1) if an experimentally

determined three-dimensional structure of a closely related protein is known, copy the secondary

structure assignment from the known structure rather than attempt to predict it de novo. (2)If no

related structures are known, use multiple sequence information. If your target sequence shows

similarity to only a few (or none) other proteins with sequence identity


23/111

23

In our own hands, the application of these rules in a semi-automated manner (i.e. human

post-processing of prediction generated by various individual methods) led to a very high

accuracy of 83 % per residue (better than any single server or any other human predictor)

according to the recent evaluation within the CASP-5 experiment

(http://predictioncenter.llnl.gov/casp5/) (I. Cymerman et al.).

1.6 Features Which Are Used For the Protein Secondary Structure Prediction

A protein sequence contains characters from the 20-letter amino acid alphabet = {A, C,

D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. An important issue in applying neural

networks to protein sequence classification is how to encode protein sequences, i.e., how to

represent the protein sequences as the input of the neural networks. Good input representations

make it easier for the neural networks to recognize underlying regularities. Thus, good input

representations are crucial to the success of neural network learning. (Hirsh H. et al.) Hence

feature extraction from sequence of protein is an essential step for the prediction of 2D or 3D

structure based on primary sequence.

The best high-level features should be relevant. By relevant it means that there should

be high mutual information between the features and the output of the neural networks, where

the mutual information measures the average reduction in uncertainty about the output of the

neural networks given the values of the features. (Wang J. T. L. et al.)

Amino Acid Exchange Group

One particular feature which has been considered here is the 6 letter amino acid exchange

group {e1, e2, e3, e4, e5, e6}. This denotes e1 {H, R, K}, e2 {D, E, N, Q}, e3 {C}, e4 {S,T, P, A, G}, e5 {M, I, L, V}, and e6 {F, Y, W}. Exchange groups represent conservative

replacements through evolution. These exchange groups are effectively equivalence classes of

amino acids and are derived from PAM. (C. H. Wu et al.,Dayhoff M. O. et al.)


24/111

24

Amino Acid Grouping

6 letter exchange group is taken to group the amino acids according to the amino acidproperties.

e1 = {H, R, K}, e2 = {D, E, N, Q}, e3 = {C}, e4 = {S, T, P, A, G}, e5 = {M, I, L, V}, e6 = {F,

Y, W}

Exchange groups represent conservative replacements through evolution. These exchange

groups are effectively equivalence classes of amino acids and are derived from PAM.

Amino Acid Frequency

Another feature which has been considered over here is the frequency of occurrence of

particular amino acid residues in these secondary structures as shown in table which help in

determining whether a particular sequence in a protein forms a helix, a strand, or a turn.

Residues such as alanine, glutamate, and leucine tend to be present in helices, whereas valine

and isoleucine tend to be present in strands. Glycine, asparagine, and proline have a propensity

for being in turns. (Jeremy M. Berg et al.)

The results of studies of proteins and synthetic peptides have revealed some reasons for

these preferences. The helix can be regarded as the default conformation. Branching at the -

carbon atom, as in valine, threonine, and isoleucine, tends to destabilize helices because of

steric clashes. These residues are readily accommodated in strands, in which their side chains

project out of the plane containing the main chain. Serine, aspartate, and asparagine tend to

disrupt helices because their side chains contain hydrogen-bond donors or acceptors in close

proximity to the main chain, where they compete for main-chain NH and CO groups. Proline

tends to disrupt both helices and strands because it lacks an NH group and because its ring

structure restricts its value to near -60 degrees. Glycine readily fits into all structures and for

that reason does not favor helix formation in particular. (Jeremy M. Berg et al.)


25/111

25

Frequency Values for Each Amino Acid in Secondary Structure

Table 2: Displaying Amino acid frequency values (Creighton T. E. et al.)

The amino acids are grouped according to their preference for helices (top group), sheets (second group), or turns (third group).

Propensity Values for Amino Acids in Secondary Structure

Amino acid Amino_num Prop_alpha prop_beta prop_other

A 1 142 83 66

C 2 70 119 119

D 3 101 54 146

E 4 151 37 74

F 5 113 138 60

G 6 57 75 156

H 7 100 870 95

I 8 108 160 47

K 9 114 74 101

L 10 121 130 59


26/111

26

Amino acid Amino_num Prop_alpha prop_beta prop_other

M 11 145 105 60

N 12 67 89 156

P 13 57 55 152

Q 14 111 110 98

R 15 98 93 95

S 16 77 75 143

T 17 83 119 96

V 18 106 170 50

W 19 108 137 96

Y 20 69 147 114

Table 3: Displaying Amino acid propensity values (Creighton T. E. et al.)

Amino Acid Evolutionary Score

Evolutionary score for each amino acid is considered as the next feature to be taken as

neural network input. Knowledge of the amino acid composition of early proteomes can reveal

which of the amino acids have increased and which have decreased in frequency with evolution.

Such information could provide clues to the order in which amino acids were introduced into the

genetic code and thus into primitive proteins. for inferring ancestral amino acid composition is

based on the insight that the amino acid composition of conserved residues in present- day

proteins, i.e., those residues which are unchanged between an ancestral sequence and any given

descendant sequence, is determined by two factors: (1) the amino acid composition within

ancestral sequences, and (2) the relative probability of conservation of each amino acid between

an ancestral and an extant descendant sequence. Reversing this logic, given the amino acid

composition of conserved residues and the relative probability of conservation of each amino


27/111

27

acid, the amino acid composition within ancestral sequences can be inferred. (Dawn J. Brooks

et al.)

Evolutionary score calculations

Evolutionary Score for Each Amino Acid

Aminoacid Amino_num Evol_score

A 1 0.1109

C 2 0.003

D 3 0.0577

E 4 0.0785

F 5 0.0213

G 6 0.0781

H 7 0.0321


28/111

28

Table 4: Displaying the Amino acids evolutionary score

I 8 0.0758

K 9 0.062

L 10 0.0573

M 11 0.0266

N 12 0.0397

P 13 0.0435

Q 14 0.018

R 15 0.065

S 16 0.0523

T 17 0.0624

V 18 0.0976

W 19 0.0042

Y 20 0.0139


29/111

29

Machine Learning ApproaMachine Learning ApproaMachine Learning ApproaMachine Learning Approachescheschesches


30/111

30

2.MACHINE LEARNING APPROACHES

2.1 Machine Learning

Machine learning can be best described as "the study of computer algorithms that

improve automatically through experience" (Mitchell 1996). Machine learning is learning, like

intelligence whose definition includes phrases such as to gain knowledge, or understanding of,

or skill in, by study, instructions, or experience. It is a natural outgrowth of the intersection of

computer science and statistics. It covers a broad range of learning tasks, such as how to design

autonomous mobile robots that learns to navigate from its own experience. To be more precise,

we say that a machine learns with respect to a particular task T, performance metric P, and type

of experience E, if the system reliably improves its performance P at task T, following

experience E [Tom M. Mitchell, 2006].

2.2 Artificial Neural Networks

Artificial neural networks are massively parallel adaptive networks of simple nonlinear

computing elements called neurons which are intended to abstract and model some of thefunctionality of the human nervous system in an attempt to partially capture some of its

computational strengths. It resembles the brain in two respects:

Knowledge is acquired by the network through a learning process.

Interneuron connection strengths known as synaptic weights are used to store the

knowledge.

A computational neural network is a set of non-linear data modeling tools consisting of input

and output layers plus one or two hidden layers. The connections between neurons in each layer

have associated weights, which are iteratively adjusted by the training algorithm to minimize

error and provide accurate predictions. We set the conditions under which the network learns

and can finely control the training stopping rules and network architecture, or let the procedure

automatically choose the architecture for us.


31/111

31

There are two main types of neural network models, such as supervised neural network,

involved multilayer perceptron (MLP) or radial basis function (RBF) and unsupervised neural

network,involved kohonen feature maps.

2.2.1 Neural Network Structure:

2.2.1.1 BIOLOGICAL NEURON:

A neuron's dendritic tree is connected to a thousand neighboring neurons. When one of

the neurons fired, a positive or negative charge is received by one of the dendrites. The strengths

of all the received charges are added together through the processes of spatial and temporalsummation. Spatial summation occurs when several weak signals are converted into a single

large one, while temporal summation converts a rapid series of weak pulses from one source into

one large signal. The aggregate input is then passed to the soma (cell body). The soma and the

enclosed nucleus don't play a significant role in the processing of incoming and outgoing data.

Their primary function is to perform the continuous maintenance required to keep the neuron

functional.

The part of the soma that does concern itself with the signal is the axon hillock. If theaggregate input is greater than the axon hillock's threshold value, then the neuron fires, and an

output signal is transmitted down the axon. The strength of the output is constant, regardless of

whether the input was just above the threshold, or a hundred times as great. The output strength

is unaffected by the many divisions in the axon; it reaches each terminal button with the same

intensity it had at the axon hillock.

Figure 4: showing the signal transmission in neurons


32/111

32

2.2.1.2 ARTIFICIAL NEURAL NETWORK:

Although neural networks impose minimal demands on model structure and assumptions,it is useful to understand the general network architecture. The multilayer perceptron (MLP) or

radial basis function (RBF) network is a function of predictors (also called inputs or independent

variables) that minimize the prediction error of target variables (also called outputs).

The neural network Structure shown in the figure, in that:

The input layer contains the predictors.

The hidden layer contains unobservable nodes, or units. The value of each hidden unit is

some function of the predictors; the exact form of the function depends in part upon the

network type and in part upon the user-controllable specifications.

The output layer contains the responses. Since the history of default is a categorical

variable with two categories, it is recoded as two indicator variables. Each output unit is

some function of the hidden units. Again, the exact form of the function depends in part

on the network type and in part on user-controllable specifications.

The MLP network allows a second hidden layer; in that case, each unit of the second

hidden layer is a function of the units in the first hidden layer, and each response is a function ofthe units in the second hidden layer.

Figure 5: showing the structure of neural network


33/111

33

Figure 6: Some artificial neural network connection structures

Figure 7: Some common activation functions


34/111

34

Multilayer Perceptron:

The MLP procedure fits a particular kind of neural network called a multilayerperceptron. The multilayer perceptron is a supervised method using feed forward architecture. It

can have multiple hidden layers. One or more dependent variables may be specified, which may

be scale, categorical, or a combination. If a dependent variable has scale measurement level, then

the neural network predicts continuous values that approximate the true value of some

continuous function of the input data. If a dependent variable is categorical, then the neural

network is used to classify cases into the best category based on the input predictors.

Training

Training the network means adjusting its weights so that it gives correct output. Training

data is what the neural network uses to learn how to predict the known output. Its rather easy

to train a network that has no hidden layers (called a perceptron). For the each object in training

set the attribute values and class is known.

Type of Training

The training type determines how the network processes the records.

Batch. Updates the synaptic weights only after passing all training data records; that is,

batch training uses information from all records in the training dataset. Batch training is

often preferred because it directly minimizes the total error; however, batch training may

need to update the weights many times until one of the stopping rules is met and hence

may need many data passes. It is most useful for smaller datasets.

Online. Updates the synaptic weights after every single training data record; that is,

online training uses information from one record at a time. Online training continuously

gets a record and updates the weights until one of the stopping rules is met. If all therecords are used once and none of the stopping rules is met, then the process continues by

recycling the data records. Online training is superior to batch for larger datasets with

associated predictors; that is, if there are many records and many inputs, and their values

are not independent of each other, then online training can more quickly obtain a

reasonable answer than batch training.


35/111

35

Mini-batch. Divides the training data records into groups of approximately equal size,

then updates the synaptic weights after passing one group; that is, mini-batch training

uses information from a group of records. Then the process recycles the data group if

necessary. Mini-batch training offers a compromise between batch and online training,

and it may be best for medium-size datasets. The procedure can automatically

determine the number of training records per mini-batch, or you can specify an integer

greater than 1 and less than or equal to the maximum number of cases to store in

memory.

Testing

Testing is used for validation. For the each object in testing set the attribute values areknown, but class is unknown.

Hidden Layers

The hidden layer contains unobservable network nodes (units). Each hidden unit is a

function of the weighted sum of the inputs. The function is the activation function, and the values

of the weights are determined by the estimation algorithm.

If the network contains a second hidden layer, each hidden unit in the second layer is a

function of the weighted sum of the units in the first hidden layer. The same activation function

is used in both layers.

Number of Hidden Layers. A multilayer perceptron can have one or two hidden layers.

Activation Function. The activation function "links" the weighted sums of units in a layer to the

values of units in the succeeding layer.

Hyperbolic tangent. This function has the form: (c) = tanh(c) = (ecec)/(ec+ec). It

takes real-valued arguments and transforms them to the range (1, 1). When automaticarchitecture selection is used, this is the activation function for all units in the hidden

layers.

Sigmoid. This function has the form: (c) = 1/(1+ec). It takes real-valued arguments and

transforms them to the range (0, 1).


36/111

36

Number of Units. The number of units in each hidden layer can be specified explicitly or

determined automatically by the estimation algorithm.

Output Layer

The output layer contains the target (dependent) variables.

Activation Function. The activation function "links" the weighted sums of units in a layer to the

values of units in the succeeding layer.

Identity. This function has the form: (c) = c. It takes real-valued arguments and returns

them unchanged. When automatic architecture selection is used, this is the activation

function for units in the output layer if there are any scale-dependent variables. Softmax. This function has the form: (ck) = exp(ck)/jexp(cj). It takes a vector of real-

valued arguments and transforms it to a vector whose elements fall in the range (0, 1) and

sum to 1. Softmax is available only if all dependent variables are categorical. When

automatic architecture selection is used, this is the activation function for units in the

output layer if all dependent variables are categorical.

Hyperbolic tangent. This function has the form: (c) = tanh(c) = (ecec)/(ec+ec). It

takes real-valued arguments and transforms them to the range (1, 1).

Sigmoid. This function has the form: (c) = 1/(1+ec). It takes real-valued arguments and

transforms them to the range (0, 1).

Figure 8: showing the structure ofMultilayer Perceptron


37/111

37

Radial Basis Function:

The RBF procedure fits a radial basis function neural network, which is a feed forward,supervised learning network with an input layer, a hidden layer called the radial basis

function layer, and an output layer. The hidden layer transforms the input vectors into radial

basis functions. Like the MLP procedure, the RBF procedure performs prediction and

classification.The RBF procedure trains the network in two stages:

1. The procedure determines the radial basis functions using clustering methods. The

center and width of each radial basis function are determined.

2. The procedure estimates the synaptic weights given the radial basis functions. The

sum-of-squares error function with identity activation function for the output layer is used for

both prediction and classification. Ordinary Least Squares regression is used to minimize the

sum-of-squares error.

Due to this two-stage training approach, the RBF network is in general trained much

faster than MLP.

Figure 9: showing the structure ofRadial Basis Function

)/( 211 XK

)/( 2jjXK

)/( 222 XK

)/( 233 XKX Yl

Z1

Z2

Z3

Zj

wl1wl2

wl3

wlj

RADIAL BASIS FUNCTION NEURAL NETWORK


38/111

38

2.3 Support vector machines

Support vector machines(SVMs) are a set of related supervised learning methods usedfor classification and regression. In simple words, given a set of training examples, each marked

as belonging to one of two categories, an SVM training algorithm builds a model that predicts

whether a new example falls into one category or the other. Intuitively, an SVM model is a

representation of the examples as points in space, mapped so that the examples of the separate

categories are divided by a clear gap that is as wide as possible. New examples are then mapped

into that same space and predicted to belong to a category based on which side of the gap they

fall on.

More formally, a support vector machine constructs a hyhperplane or set of hyperplanes

in a high or infinite dimensional space, which can be used for classification, regression or other

tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to

the nearest training data points of any class (so-called functional margin), since in general the

larger the margin the lower the generalization error of the classifier (David Meyer et al.2003).

2.3.1 Motivation

Classifying data is a common task in machine learning. Suppose some given data points

each belong to one of two classes, and the goal is to decide which class a newdata point will be

in. In the case of support vector machines, a data point is viewed as ap-dimensional vector (a list

of p numbers), and we want to know whether we can separate such points with a p 1-

dimensional hyperplane. This is called a linear classifier.

There are many hyperplanes that might classify the data. One reasonable choice as thebest hyperplane is the one that represents the largest separation, or margin, between the two

classes. So we choose the hyperplane so that the distance from it to the nearest data point on each

side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane

and the linear classifier it defines is known as a maximum margin classifier (Corinna Cortes et

al. 1995).


39/111

39

Figure 10: showing the Motivation-H3 (green) doesn't separate the 2 classes. H1 (blue) does, with a

small margin and H2 (red) with the maximum margin.

2.3.2 Applications in Bioinformatics

Support Vector Machine s has a natural match with the features of many bioinformatics

datasets. They deliver state of the art performance in several applications. For microarray gene

expression data, SVM is becoming the system of choice .SVMs are currently among the best

performers for a number of classification tasks ranging from text to genomic data.

2.4 Genetic Algorithms (GAs):

The basic idea of a genetic algorithm (GA) is quite simple. GA works not only with one

iterated solution in time but with the whole population of solutions in one algorithm iteration.

The population contains many (ordinary several hundred) individuals bit strings representing

solutions. Evolutionary algorithms deal with similar strings of more general form, e.g.,

containing integer numbers or characters. The mechanism of GA involves only elementary

operations like strings copying, partially bit swapping or bit value changing (Goldberg et al.

1989).

GA starts with a population of strings and thereafter generates successive populations

using the following three basic operations:reproduction, crossover,andmutation. Reproduction

is the process by which individual strings are copied according to an objective function value

(fitness). Copying of strings according to their fitness value means that strings with a higher


40/111

40

value have a higher probability of contributing one or more offspring to the next generation. This

is an artificial version of the natural selection. Mutation is an occasional (with a small

probability) random alteration of the string position value. It is needed since, in spite of

reproduction and crossover effectively searching and recombining the existing representations,

they occasionally become overzealous and lose some potentially useful genetic material. The

mutation operator prevents such an irrecoverable loss. The recombination mechanism allows

mixing of parental information.

While passing it to their descendants and mutation introduces innovation into the population.

Figure 11: showing the Genetic Algorithm Based on Darwinian Paradigm

In spite of simple principles, the design of GA for successful practical using is

surprisingly complicated. GA has many parameters that depend on the problem to be solved. In

the first, it is the size of population. Larger populations usually decrease the number of iterations

needed, but dramatically increase the computing time for each iteration. The factors increasing

demands on the size of population are the complexity of the problem being solved and the length

of the individuals. Every individual contains one or more chromosomes containing value of

potential solution. Chromosomes consist of genes. The gene in our version of GA is a structure

representing one bit of solution value. It is usually advantageous to use some redundancy in


41/111

41

genes and so the physical length of our genes is greater than one bit. This type of redundancy

was introduced by Ryan.

To prevent degeneration and the deadlock in local extreme the limited lifetime of each

individual can be used. This limited lifetime is realized by the death operator, which represents

something like continual restart of GA. This operator enables decreasing of population size as

well as increasing the speed of fitness improvement. It is necessary to store the best solution

obtained separately the corresponding individual need not to be always present in the

population because of the limited lifetime.

Many GAs are implemented on a population consisting of haploid individuals (each

individual contains just one chromosome). However, in nature, many living organisms havemore than one chromosome and there are mechanisms used to determine dominant genes.

Sexual recombination generates an endless variety of genotype combination that

increases the evolutionary potential of the population. Because it increases the variation among

the offspring produced by an individual, it improves the change that some of them will be

successful in varying and often-unpredictable environments they will encounter. Using diploid or

multiploid individuals can often decrease demands on the population size. However the use of

multiploid GA with sexual reproduction brings some complications, the advantage of

multiploidity can be often substitute by the death operator and redundant genes coding.

New individuals are created by operation called crossover. In the simplest case crossover

means swapping of two parts of two chromosomes split in randomly selected point (so called one

point crossover). In GA we use the uniform crossover on the bit level. The strategy of selection

individuals for crossover is very important. It strongly determines the behavior of GA.

Genetic algorithms commonly use heuristic and stochastic approaches. From the

theoretical viewpoint, the convergence of heuristic algorithms is not guaranteed for the most ofapplication cases. That is why the definition of the stopping rule of the GA brings a new

problem. It can be shown, that while using a proper version of GA the typical number of

iterations can be determined.


42/111

42

Details on GA implementation are specified in the following way:

1. Generation of the initial population: At the beginning the whole population is generated

randomly, the members are sorted by the fitness (in descendent order).

2. Mutation: The mutation is applied to each gene with the same probability. The mutation

of the gene means the inversion of one randomly selected bit in the gene.

3. Death: Classical GA uses two main operations crossover and mutation (the otheroperation should be migration). In GA a third operation is also available is death. Every

individual has the additional information age. A simple counter that is incremented in

each of GA iteration represents the age.

If the age of any member reaches the preset lifetime limit LT, the member dies and is

immediately replaced by a new randomly generated member. The age is not mutated nor

crossed over. The age of new individuals (incl. individuals created by crossover) is set to

zero.

4. Sorting is realized by the fitness function values.

5. Crossover: Uniform crossover is used for all genes (each bit of the offspring gene is

selected separately from corresponding bits of genes of both parents).

6. When a stopping rule is not satisfied, go to step 2.


43/111

43

Figure 12: showing the Genetic Algorithm as Conceptual Algorithm

2.4.1 Problems suited to Genetic Algorithms:

Table 5: displaying the problems suited to Genetic Algorithms

Domain Application Types

Control gas pipeline, pole balancing, missile evasion, pursuit

Design semiconductor layout, aircraft design, keyboardconfiguration, communication networks

Scheduling manufacturing, facility scheduling, resource allocation

Robotics trajectory planning

Machine Learning designing neural networks, improving classificationalgorithms, classifier systems

Signal Processing filter design

Game Playing poker, checkers, prisoners dilemma

CombinatorialOptimization

set covering, travelling salesman, routing, bin packing,graph colouring and partitioning


44/111


45/111


46/111

46

2.6 Fuzzy Logic

The concept of Fuzzy Logic (FL) was conceived by Lotfi Zadeh, a professor at theUniversity of California at Berkley. It was presented not as a control methodology, but as a way

of processing data by allowing partial set membership rather than crisp set membership or non-

membership. Fuzzy Logic is a problem-solving control system methodology that lends itself to

implementation in systems ranging from simple, small, embedded micro-controllers to large,

networked, multi-channel PC or workstation-based data acquisition and control systems. It can

be implemented in hardware, software, or a combination of both. It provides a simple way to

arrive at a definite conclusion based upon vague, ambiguous, imprecise, noisy, or missing input

information. FL's approach to control problems mimics how a person would make decisions,

only much faster (Biacino et al. 2002).

2.6.1 Fuzzy Logic in Bioinformatics

Fuzzy logic and fuzzy technology are now frequently used in bioinformatics. The following

are some examples used in bioinformatics.

1. To analyze gene expression data

2. To unravel functional and ancestral relationships between proteins via fuzzy alignment

methods, or using a generalized radial basis function neural network architecture that

generates fuzzy classification Rules.

3. To analyze the relationships between genes and decipher a genetic network.

4. To process complementary deoxyribonucleic acid (cDNA) microarray images. The

procedure should be automated due to the large number of spots and it is achieved using a

fuzzy vector filtering framework.

5. To classify amino acid sequences into different super families


47/111

47

2.7 Hidden Markov Model (HMM)

Hidden Markov models are statistical tools allowing to model complex stochasticphenomenon. Hidden Markov models have long history. In 1913, first studies [Markov, 1913] on

Markov chain allow to analyze the language and lead A. A. Markov to the conception of the

Markov chain theory. From 1948 to 1951, Shannon establishes the information theory using

Markov chains [Shannon, 1948, Shannon, 1951]. In 1966 research by L. E. Baum and T. Petrie

[Baum and Petrie, 1966, Baum and Eagon, 1967, Baum and Sell, 1968, Petrie, 1969, Baum et al.,

1970, Baum, 1972] defined algorithms to train HMMs. In 1980, variable duration HMMs are

defined [Furguson, 1980]. In 1980's, neural networks are incorporated in HMMs [Bourland andWellekens, 1990] allowing to extend applications of HMMs for speech analysis [Bahl et al.,

1983, Rabiner et al., 1983, Juang and Rabiner, 1986, Rabiner and Levinson, 1985, Poritz and

Richter, 1986, Rosemberg and Colla, 1987, Euler and Wolf, 1988, Dours, 1989] . Nowadays,

HMMS are used for task scheduling [Soukhal et al., 2001b, Soukhal et al., 2001a] and

information technologies [Zaragoza and Gallinari, 1998, Amini, 2001, Serradura et al., 2001].

Lots of variant of genuine HMMs have been created to solve particular problems. Continuous

density HMMs [Rabiner, 1989], hierarchical HMMs [Fine et al., 1998], multidimensional

HMMS with independent processes [Brouard, 1999] or symbol substitution HMMs [Aupetit

et al., 2002, Aupetit, 2005] are some examples of those new models.

HMM is a finite set of states, each of which is associated with a (generally

multidimensional) probability distribution. Transitions among the states are governed by a set of

probabilities called transition probabilities. The sum of all transition probabilities out of a given

state has to equal 1.0, as does the sum of all observation probabilities for a particular state. In a

regular Markov model, the state is directly visible to the observer, and therefore the state

transition probabilities are the only parameters. It is only the outcome, not the state directly

visible to an external observer and therefore states are ``hidden'' to the outside; hence the name

Hidden Markov Model. Each state has a probability distribution over the possible output tokens.

Therefore the sequence of tokens generated by an HMM gives some information about the

sequence of states.


48/111

48

Note that the adjective 'hidden' refers to the state sequence through which the model

passes, not to the parameters of the model; Even if the model parameters are known exactly, the

model is still 'hidden'.

Hidden Markov models are especially known for their application in temporal pattern

recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical

score following, partial discharges and bioinformatics.

The number of states of the model, N.

The number of observation symbols in the alphabet, M.

A set of state transition probabilities.

Figure 14: showing the Hidden Markov model

Where, x-states, y-possible observations, a-state transition probabilities,b-output probabilities

2.7.1 Applications of hidden Markov models

Cryptanalysis

Speech recognition

Machine translation

Partial discharge

Gene prediction

Alignment of bio-sequences


49/111

49

2.8 Markov chain Monte CarloA Morkov chain is a mathematical model for stochastic systems whose states, discrete or

continuous, are governed by a transition probability. The current state in a markov chain only

depends on the most recent previous states, e.g. for a 1storder Markov chain (Christophe

Andrieu et al. 2003).

The markovin property means locality in space or time, such as Markov random fields

and markov chain. Indeed, a discrete time Markov chain can be viewed as a special case of the

Markov random fields (causal and 1-dimensional).

MCMC is a general purpose technique for generating fair samples from a probability inhigh-dimensional space, using random numbers (dice) drawn from probability in certain range. Amarkov chain is designed to have (X)being its stationary (or invariant) probability.

This is a non-trivial task when (X) is very complicated in very high dimensional spaces!

Usually it is not hard to construct a Markov Chain with the desired properties. The more

difficult problem is to determine how many steps are needed to converge to the stationary

distribution within an acceptable error. A good chain will have rapid mixingthe stationary

distribution is reached quickly starting from an arbitrary positiondescribed further under

Markov chain mixing time.


50/111

50

Typical use of MCMC sampling can only approximate the target distribution, as there is

always some residual effect of the starting position. More sophisticated MCMC-based algorithms

such as coupling from the past can produce exact samples, at the cost of additional computation

and an unbounded (though finite in expectation) running time.

The most common application of these algorithms is numerically calculating multi-

dimensional integrals. In these methods, an ensemble of "walkers" moves around randomly. At

each point where the walker steps, the integrand value at that point is counted towards the

integral. The walker then may make a number of tentative steps around the area, looking for a

place with reasonably high contribution to the integral to move into next.

Random walk methods are a kind of random simulation or Monte Carlo method.

However, whereas the random samples of the integrand used in a conventional Monte Carlo

integration are statistically independent, those used in MCMC are correlated. A Markov chain is

constructed in such a way as to have the integrand as its equilibrium distribution. Surprisingly,

this is often easy to do.

Multi-dimensional integrals often arise in Bayesian statistics, computational physics,

computational biology and computational linguistics, so Markov chain Monte Carlo methods are

widely used in those fields.

2.8.1 Some Properties of Markov Chains

Irreduciblechain: can get from any state to any other eventually (non-zero probability)

Periodicstate: state iis periodic with period kif all returns to imust occur in multiples of

k

Ergodic chain: irreducible and has an aperiodic state. Implies all states are aperiodic, so

chain is aperiodic.

Finite state space: can represent chain as matrix of transition probabilities then ergodic

= regular


51/111

51

Regularchain: some power of chain has only positive elements

Reversible chain: satisfies detailed balance (later)

2.8.2 MCMC algorithms

The following algorithms are involved with the Markov chain Monte Carlo.

Metropolis-Hastings algorithm

Metropolis algorithm

Mixtures and blocks

Gibbs sampling and Sequential Monte Carlo & Particle Filters

2.8.3 Applications

Computer vision

Object tracking demo [Blake&Isard]

Speech & audio enhancement

Web statistics estimation

Regression & classification

Global maximization of MLPs [Freitas et al]

Bayesian networks

Details in Gilks et al book (in the School library)

Genetics & molecular biology

Robotics, etc.


52/111

52

MMMMycobacterium Tuberculosisycobacterium Tuberculosisycobacterium Tuberculosisycobacterium Tuberculosis


53/111

53

3.MYCOBACTERIUM TUBERCULOSIS3.1Mycobacterium tuberculosis

Tuberculosis (TB) is a disease that is spread from person to person through the air. TB

usually affects the lungs, but it can also affect other parts of the body, such as the brain, the

kidneys, or the spine. TB germs are put into the air when a person with TB disease of the lungs

or throat coughs or sneezes. When a person inhales air that contains TB germs, he or she may

become infected. People with TB infection do not feel sick and do not have any symptoms.

However, they may develop TB disease at some time in the future. The general symptoms of TB

disease include feeling sick or weak, weight loss, fever, and night sweats. The symptoms of TB

of the lungs include coughing, chest pain, and coughing up blood. Other symptoms depend on

the part of the body that is affected.

Mycobacterium tuberculosis (MTB) is a pathogenic bacterial species in the genus

Mycobacteriumand the causative agent of most cases of tuberculosis. First discovered in 1882

by Robert Koch, M. tuberculosis has an unusual, waxy coating on the cell surface (primarily

mycolic acid), which makes the cells impervious to Gram staining; acid-fast techniques are used

instead. The physiology ofM. tuberculosisis highly aerobic and requires high levels of oxygen.

Primarily a pathogen of the mammalian respiratory system, MTB infects the lungs, causing

tuberculosis (Ryan KJ et al. 2004).

TheM. tuberculosisgenome was sequenced in 1998 (Cole ST, et al 1998) (Camus JC et

al.2002).


54/111

54

3.1.1 Strain variation

M. tuberculosis appears to be genetically diverse. This genetic diversity results in

significant phenotypic differences between clinical isolates. M. tuberculosis exhibits a

biogeographic population structure and different strain lineages are associated with different

geographic regions. Phenotypic studies suggest that this strain variation never has implications

for the development of new diagnostics and vaccines. Micro-evolutionary variation affects the

relative fitness and transmission dynamics of antibiotic-resistant strains (Gagneux S2009).

3.1.2 Mycobacterium tuberculosis CDC1551

CDC1551 is a clinical strain that was originally thought to be highly virulent (142); it has

more recently been shown that CDC1551 induces levels of cytokines, including TNF-_, that are

higher than those induced by other M. tuberculosis strains in mice. However, it is not more

virulent than the other strains, as defined by bacterial load and mortality.

The genome of M. tuberculosisstrain CDC 1551 was sequenced by the whole genome

random sequencing method as described in Fleischmann, RD et al. (1995). Science269:496-512.

TheM. tuberculosisgenome is a circular chromosome of 4,403,765 base pairs with an average G

+ C content of 65.6%. There are a total of 4,033 predicted open reading frames (ORFs).

Predicted biological roles were assigned to 1,734 ORFs (43%); 605 ORFs (15%) match

hypothetical proteins from other species, and 1,694 ORFs (42%) have no database match and

presumably represent novel genes.


55/111

55

Figure 15: Showing the Role Category Pie Chart ofMycobacterium tuberculosis

CDC1551

3.1.3 Mycobacterium tuberculosis H37Rv

TheM. tuberculosis H37Rv genome consists of 4.4 _ 106 bp and contains approximately

4,000 genes (Fig. 1) (53). Annotation of the M. tuberculosis genome shows that this bacterium

has some unique features. H37Ra strain has been lacking. Historically, M. tuberculosis H37Ra is

the a virulent counterpart of virulent strain H37Rv and both strains are derived from their

virulent parent strain H37.


56/111

56

Figure 16: Showing the Role Category Pie Chart ofMycobacterium tuberculosis

H37RV (lab strain)

3.1.4Comparison between H37Rv and H37RaThe H37Ra genome is highly similar to that of H37Rv with respect to gene content and

order but is 8,445 bp larger as a result of 53 insertions and 21 deletions in H37Ra relative to

H37Rv. Variations in repetitive sequences such as IS6110and PE/PPE/PE-PGRS family genes

are responsible for most of the gross genetic changes. A total of 198 single nucleotide variations

(SNVs) that are different between H37Ra and H37Rv were identified, yet 119 of them are

identical between H37Ra and CDC1551 and 3 are due to H37Rv strain variation, leaving only 76

H37Ra-specific SNVs that affect only 32 genes.


57/111

57

Figure 17: Displaying a Proposed Phylogenetic Relationship of H37Ra, H37Rv, H37, and

CDC1551


58/111

58

Figure 18: Displaying Statistics forMycobacterium tuberculosis CDC1551

Figure 19: Displaying Statistics forMycobacterium tuberculosis H37RV (lab strain)


59/111

59

Aim & ObjectivesAim & ObjectivesAim & ObjectivesAim & Objectives


60/111

60

4. AIM & OBJECTIVE

To predict the secondary structure of proteins using Genetic Algorithm.

To compare the performance of Evolutionary Genetic Algorithms with neural network in

the prediction of protein secondary structure.


61/111

61

Materials &MethodsMaterials &MethodsMaterials &MethodsMaterials &Methods


62/111

62

5. MATERIALS AND METHODS

5.1 Softwares and Databases:

Table 6: Displaying the name and utility of the softwares which are

used in the methodology

Name of the software / Database Utility

JCVI To retrieve the important proteins of

Mycobacterium tuberculosis CDC1551 and

Mycobacterium tuberculosis H37Rv

NCBI Protein database To get the amino acid sequences of the desired

protein

SOPMA To make the secondary structure prediction

SPSS16.0 To fit the Multi Layer Perceptron (MLP)

Feed forward Neural Network.

MATLAB (Genetic Algorithm Tool Box) To perform the secondary structure predictionof the amino acid sequence using genetic

algorithm toolbox.


63/111

63

5.2 FLOW CHART OF METHODOLOGY

Get the desired protein name from thewhole list of the proteins

The protein sequences was retrieved inthe FASTA format from the NCBI

database

Complete genome for the particularMycobacterium strainwas selected from

the JCVI database

Run the protein sequences in the SOPMAsecondary structure prediction tool and get the

Use the PERL script to create a filewhich is consist the nine features (input

nodes of each amino acids

Use the inputs file and run it in the SPSS

Neural Network (in both MultilayerPerceptron and Radial Basis Function)

Use a file the inputs file and run it in

the MATLAB Genetic Algorithm

Compare the results which obtained from theboth in SPSS Neural Network and MATLAB

Genetic Algorithm.


64/111


65/111

65

Figure 21: Displaying Genome information ofMycobacterium tuberculosis H37Rv

Figure 22: Displaying List of protein coding gene of Mycobacterium tuberculosis CDC1551


66/111

66

Figure 23: List of protein coding gene ofMycobacterium tuberculosisH37Rv

Table 7: Showing the list of proteins taken fromMycobacterium tuberculosis CDC1551for

secondary structure prediction

S.No PROTEIN NAME LENGTH OF SEQUENCE

1 conserved hypothetical protein 507






7 conserved hypothetical protein 2628 conserved hypothetical protein 232






67/111

67

13 ABC transporter, ATP-binding protein 330

14 Acyl carrier protein 87

15 Chromosomal replication initiator protein DnaA 507

16 DNA gyrase subunit A 838

17 DNA gyrase subunit B 686

18 DNA polymerase III,beta subunit 402

19 Glutamine amidotransferase, class I 232

20 Leucyl-tRNA synthetase 969

21 L-serine dehydratase 461

22 Penicillin-binding protein 82023 Replicative DNA helicase, intein-containing 874

24 Ribosomal protein L9 152

25 Ribosomal protein S18 84

26 Ribosomal protein S6 96

27 Serine/threonine protein kinase 626

28 Transcriptional regulator, ArsR family 114

29 Transcriptional regulator, GntR family 24430 Transcriptional regulator, MarR family 208

Table 8: Showing the list of proteins taken fromMycobacterium tuberculosisH37Rvfor secondary

structure prediction

S.No PROTEIN NAME LENGTH OF SEQUENCE

1 Hypothetical Protein Rv0257 124

2 Hypothetical Protein Rv0264c 210



5 Hypothetical Protein Rv0282 631

6 Hypothetical protein Rv0064 979


68/111


69/111

69

Retrieving amino acids sequence data from NCBI protein database

The amino acid sequences of desired proteins were retrieved from NCBI protein database

by giving an appropriate query.

5.3.2 NCBI PROTEIN DATABASE

In addition to Protein sequences, other protein-related information is available via Entrez.

Search the Structure database by choosing, "Structure" from the Entrez pull down menu,

Conserved Domains Database (CDD) by choosing, "Domains", and 3D Domains by choosing,

the "3D Domains" option.

Figure 24: Displaying the home page of NCBI protein database


70/111

70

Figure 25: Displaying the retrieving amino acids sequence data from NCBI protein

database

5.4 SECONDARY STRUCTURE PREDICTION USING SOPMA

Secondary structure prediction using the SOPMA tool is at the pole of bioinformatics server. The

tool can access by the URL http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_sopma.html

SOPMA

Recently a new method called the self-optimized prediction method (SOPM) has been

described to improve the success rate in the prediction of the secondary structure of proteins. In

this method the improvements brought about by predicting all the sequences of a set of aligned

proteins belonging to the same family. This improved SOPM method (SOPMA) correctly

predicts 69.5% of amino acids for a three-state description of the secondary structure (alpha-

helix, beta-sheet and coil) in a whole database containing 126 chains of non-homologous (less

than 25% identity) proteins. Joint prediction with SOPMA and a neural networks method (PHD)

correctly predicts 82.2% of residues for 74% of co-predicted amino acids.


71/111

71

Figure 26: Displaying the home page

secondary structure prediction of tuberculosis genomes using machine learning algorithms

Documents