research paper - vaibhav

10
PREDICTING THE CELLULAR LOCALIZATION SITES OF PROTEINS USING ARTIFICIAL NEURAL NETWORKS Vaibhav Dhattarwal Department of Computer Science and Engineering Indian Institute of Technology Roorkee [email protected] Abstract - In this paper, I present a brief description of how a feed- forward artificial neural network was implemented in C++. In the introduction to my paper, I begin by explaining as the reason for implementing this artificial neural network was to predict the cellular localisation sites in proteins, and to be specific a yeast Data Set. This is followed by a concise explanation of the design and implementation of a three-layer feed forward neural network using back propagation algorithm. Also explained along with are the attributes of the data set and the output location possibilities in the protein. This is followed by a step-by-step breakdown of how I approached the project. The implementation of the network is explained along with how the algorithm is executed within the code. Finally we can see the results as we vary the parameters associated with the implemented artificial neural network. Keywords-Prediction, Localization Sites, Proteins, Simulation, Neural Networks I. INTRODUCTION Let me start off by the basic explanation about choosing this topic. The topic chosen, Prediction of Cellular localisation of protein, is basically the information represented by the data set I have chosen to do my paper on. I will be implementing an Artificial Neural Network based on the back propagation algorithm. To evaluate the performance of the simulated Artificial Neural Network, I needed to choose a data set to train and test the ANN. Let us take a look at the significance of the data set chosen by me. If one is able to deduce or figure out the sub cellular location of a

Upload: vaibhav-dhattarwal

Post on 14-Apr-2017

195 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Paper - Vaibhav

PREDICTING THE CELLULAR LOCALIZATION SITES OF PROTEINS

USING ARTIFICIAL NEURAL NETWORKS

Vaibhav DhattarwalDepartment of Computer Science and Engineering

Indian Institute of Technology [email protected]

Abstract - In this paper, I present a brief description of how a feed-forward artificial neural network was implemented in C++. In the introduction to my paper, I begin by explaining as the reason for implementing this artificial neural network was to predict the cellular localisation sites in proteins, and to be specific a yeast Data Set. This is followed by a concise explanation of the design and implementation of a three-layer feed forward neural network using back propagation algorithm. Also explained along with are the attributes of the data set and the output location possibilities in the protein. This is followed by a step-by-step breakdown of how I approached the project. The implementation of the network is explained along with how the algorithm is executed within the code. Finally we can see the results as we vary the parameters associated with the implemented artificial neural network.

Keywords-Prediction, Localization Sites, Proteins, Simulation, Neural Networks

I. INTRODUCTION

Let me start off by the basic explanation about choosing this topic. The topic chosen, Prediction of Cellular localisation of protein, is basically the information represented by the data set I have chosen to do my paper on. I will be implementing an Artificial Neural Network based on the back propagation algorithm. To evaluate the performance of the simulated Artificial Neural Network, I needed to choose a data set to train and test the ANN. Let us take a look at the significance

of the data set chosen by me. If one is able to deduce or figure out the sub cellular location of a protein, I can interpret its function, its part in healthy processes and also in commencement of disease, and its probable usage as a drug target. Other methods such as experiments used to ascertain the sub cellular location of a protein have advantages such as reliability and accuracy along with disadvantages such as being slow and being labour-intensive. If I compare to the above described methods, large throughput computation based forecasting tools enable me to deduce information which is difficult to attain. As an example, for those proteins whose composition is found out from a genomic sequence, computational methods are better as they may be tough to confine, produce, or locate in an experiment.

The sub cellular location of a protein can provide valuable information about the role it has in the cellular dynamics. If I may suggest, there has been an unprecedented surge in the amount of sequenced genomic data available, which in turn calls out for a computerized and high-accuracy tool which can be used to predict sub cellular location to become increasingly important. There have been lots of efforts to predict properly the protein sub cellular location. This paper aims to assimilate the artificial neural networks and the field of bioinformatics to predict the location of protein in yeast genome. I introduce a new sub cellular prediction method based on a back propagation neural network.

Page 2: Research Paper - Vaibhav

The statement goes like this “Prediction of Cellular Localization sites of proteins using artificial neural networks”The task of our paper lies first in simulating a three layered artificial neural network. In this case, the backpropagation algorithm is used to train the artificial neural network. First we explain the algorithm, and then in our implementation it is shown as to how the algorithm is implemented in the code used to simulate the artificial neural network. After this we see the observations recorded by executing the yeast data set on the simulated artificial neural network to train it. Then we use the observations to see trends and evaluate performance.

II. PROPOSED METHODOLOGY

A. Simulate an artificial neural network corresponding to the attributes of the yeast data set.To enlarge the function space that the neural network can represent, we implement the three-layer feed-forward network which involves one layer of hidden nodes. If we have our middle layer with large number of nodes, we can represent almost any continuous function with acceptable levels of accuracy.

Figure 1: illustrates the structure of a three-layer feed forward neural network.

The definitions of input nodes and output nodes can be looked upon as similar to the earlier discussed perceptrons network. The major difference is that we a single layer of hidden nodes between the input and output nodes.

Similarly, we also use the ratio of correctly classified examples in the training set as the threshold for the termination condition. The major

difference of the algorithm for training two-layer feed-forward neural network is when we update the weights for the hidden layer, we should back-propagate the error from the output layer to the hidden layer.

B. Implement the back propagation algorithm on the simulated artificial neural network.The Algorithm for our three layer network:a. Initialize the weights of the network.b. Perform the following operation

1. for every example in the training set Output by the neural network

for this example denoted by O(forward pass)

Teaching Output for this example denoted by T.

The error is given by (T-O). Calculate ΔWHO for all weights

between hidden and output layer.

Move backwards in the network(backward pass)

Calculate ΔWIH for all weights between input and hidden layer.

Update the weights if the network using the calculated delta values.

c. Stop when the error criterion is met. d. Return the trained network

The learning algorithm that we have chosen for our network is the Backpropagation Algorithm. It can be divided into two stages:

Stage One: Propagation Phase

This phase consists of the following operations:1. First we do the forward propagation of our

training pattern's input data through the network.

2. Secondly we do the backward propagation of the initial propagation of first step and use the output activations through the network using our training pattern's desire target data.

Stage two: Weight updating Phase

In this stage, for every connection possessing a weight, the following operations are carried out:

1. First, we multiply the output delta with input to calculate the gradient of the weight.

Page 3: Research Paper - Vaibhav

2. Second, we subtract a ratio of the gradient from the weight. This brings the weight in backward direction of the gradient.

We keep on repeating stages one and two until the network starts performing with acceptable success rate.

C. Train the network using the data set.The yeast data set has eight attributes. These attributes were calculated from amino acid sequences.

1. erl: It is representative of the lumen in the endoplasmic reticulum in the cell. This attribute tells whether an HDEL pattern as n signal for retention is present or not.

2. vac: This attribute gives an indication of the content of amino acids in vacuolar and extracellular proteins after performing a discriminant analysis.

3. mit: This attribute gives the composition of N terminal region, which has twenty residue of mitochondrial as well as non-mitochondrial protein after performing a discriminant analysis.

4. nuc: This feature tells us about nuclear localization patterns as to whether they are present or not. It also holds some information about the frequency of basic residues.

5. pox: This attribute provides the composition of the sequence of protein after discriminant analysis on them. Not only this, it also indicates the presence of a short sequence motif.

6. mcg: This is a parameter used in a signal sequence detection method known as McGeoch. However in this case we are using a modified version of it.

7. gvh: This attribute represents a weight matrix based procedure and is used to detect signal sequences which are cleavable.

8. alm: This final feature helps us by performing identification on the entire sequence for membrane spanning regions.

For the data set the output classes are summarized below. Remember that the localization site is represented by the class as output. Here are the various classes:

1. CYT (cytosolic or cytoskeletal) 2. NUC (nuclear) 3. MIT (mitochondrial) 4. ME3 (membrane protein, no N-terminal signal)5. ME2 (membrane protein, uncleaved signal) 6. ME1 (membrane protein, cleaved signal) 7. EXC (extracellular)

8. VAC (vacuolar) 9. POX (peroxisomal) 10. ERL (endoplasmic reticulum lumen)

Figure 2: a Yeast Cell.

D. Obtain results and compare performance with other networks and techniques used for predicting the cellular localization of proteins Results are evaluated after using the data set on

the simulated artificial neural network. Varying the number of nodes in the hidden

layer is used to evaluate performance. Comparison of Accuracies of various

algorithms Variation of success rate with number of

iterations Variation of success rate with number of nodes

in hidden layer

III. IMPLEMENTATION

Figure 3: design for calculating output activation

Page 4: Research Paper - Vaibhav

Er = 0.0 ;for all patterns in the training setdo // computes for all training patterns(E) //

for all elements in hidden layer [ NumUnitHidden ] do

InputHidden[E][j] = WtInput/Hidden[0][j] for all elements in input layer [ NumUnitInput ] do

Add to InputHidden[E] [j] the sum over OutputInput[E] [i] * WtInput/Hidden [i][j]end forCompute sigmoid for output

end forfor all elements in output layer [ NumUnitoutput ] do

InputOutput[E] [k] = WtHidden/Output[0][k] for all elements in hidden layer [ NumUnitHidden ] do

Add to InputOutput [E] [k] sum over OutputHidden[E] [j] * WtHidden/Output [j][k]end forCompute sigmoid for outputAdd to Er the sum over the product (1/2) * (Final[E][k] - Output[E][k]) * (Final[E][k] - Output[E][k]) ;ΔOutput[k] = (Final[E][k] - Output[E][k]) * Output[E][k] * (1 - Output[E][k])

// derivative of the function //end forfor all elements in hidden layer [ NumUnitHidden ] do // Backpropagation of error towards hidden layer //

Sum of ΔOutput [j] = 0.0for all elements in output layer [ NumUnitOutput ] do

Add to Sum of ΔOutput [j] the sum over the product WtHidden/Output [j][k] * ΔOutput [k] ;

end forΔH[j] = Sum of ΔOutput [j] * OutputHidden [E][j] * (1.0 - OutputHidden [E][j])

// derivative of the function //end forfor all elements in hidden layer [ NumUnitHidden ] do // This loop updates the weight input to hidden //

Add to ΔWih [0][j] the sum of: product β * ΔH [j] to the product: α * ΔWih [0][j]Add to WtInput/Hidden [0][j] the change ΔWih [0][j]for all elements in input layer [ NumUnitInput ] do

Add to ΔWih [i][j] the sum of product β * InputHidden [p][i] * ΔH [j] to the product: α * ΔWih [i][j]Add to WtInput/Hidden [i][j] the change ΔWih [i][j]

end forend forfor all elements in output layer [ NumUnitOutput ] do // This loop updates the weight hidden to output //

Add to ΔWho [0][k] the sum of: product β * ΔOutput[k] to the product: α * ΔWho [0][k]Add to WtHidden/Output [0][k] the change ΔWho [0][k]for all elements in hidden layer [ NumUnitHidden ] do

Add to ΔWho [j][k] the sum of product β * OutputHidden [p][j] * ΔOutput [k] to the product: α *ΔWho [j][k]Add to WtHidden/Output [j][k] the change ΔWho [j][k]

Page 5: Research Paper - Vaibhav

end for end for

IV. RESULTS AND DISCUSSION

A. Comparisons of Accuracies of Different AlgorithmsIn this section, we will take a look at the accuracies offered by different algorithms. We take into consideration four algorithms: Majority Algorithm, Decision Tree Algorithm, Perceptrons Learning Algorithm, Three layered Neural Network based on backpropagation algorithm. Two data sets are considered that have been studied in detail in earlier sections. The first is the E.coli data set which is for the E.coli cell and the second is the one chosen by us: the Yeast cell data set. As we can see from the chart below our algorithm is able to achieve slightly higher accuracy than the rest of the algorithms. Another thing of note is to see that considerable success is achieved in the yeast data set which we chose to implement with accuracy leading up to 61%

Figure 4: Plot of Accuracy of various algorithms for two data sets.

B. Variation of Success Rate with number of iterationsLet us consider the variation of success rate in our implementation. Success Rate is simply defined as number of successful predictions divided by total number of cases handled. The overall success rate will vary with number of iterations of training the neural network. As the number of iterations, the error is reduced as the network learns with every training session. We can look at the chart below to find the expected variation of success rate as it rises with number of iterations. However a thing to

consider is that after about 100 iterations the success rate remains constant more or less.

Figure 5: Plot of Success Rate with number of iterations

C. Variation of Success Rate with number of processing elements in Hidden LayerLet us consider now varying another important parameter in our neural network. We shall again consider the success rate defined in the previous section. The number of processing elements is under our control in the network. As the data set we have chosen is specifying the number of input attributes and possible outcomes the input and output layer have fixed number of processing elements. However we can see the variation of success rate with number of elements in the hidden layer. Note that the success rate reaches a constant value after about 75 elements in the layer.

Page 6: Research Paper - Vaibhav

Figure 6: Plot of Success Rate with No. of PE in Hidden Layer

V. CONCLUSIONS AND FUTURE WORK

A. ConclusionIn this paper, I implemented the machine learning algorithm of three-layer feed forward network. I applied it to the problem of classifying proteins to their cellular localization sites based on the amino acid sequences of proteins. The Yeast dataset’s accuracy was compared with the E.coli dataset’s accuracy. It was tested whether the three-layer neural network with hidden nodes is able to separate the datasets. We also explored using larger number of hidden nodes in the network. We also implemented three layer feed-forward neural network which represented discontinuous function. After obtaining results, we compared the performance with other networks and techniques used for predicting the cellular localization of proteins. The most important results can be summarized as:

● The classes CYT, NUC and MIT have the largest number of instances.

● The back propagation algorithm is able to achieve slightly higher accuracy than the rest of the algorithms.

● Another thing of note is to see that considerable success is achieved in the yeast data set which we chose to implement with accuracy leading up to 61%

● After about 100 iterations the success rate remains constant more or less.

● The success rate reaches a constant value after about 75 elements in the layer.

● The Accuracy rises till we reach the limit to which we can set the success rate.

B. Future WorkSince the prediction of proteins’ cellular localization sites is a typical classification problem, many other techniques such as probability model, Bayesian network, K-nearest neighbours etc, can be compared with our technique.Thus, an aspect of future work is to examine the performance of these techniques on this particular problem.

ACKNOWLEDGEMENT

I would like to acknowledge the contribution of Dr. Durga Toshniwal, Associate Professor, Department of Computer Science and Engineering, IIT Roorkee, whose guidance was indispensable throughout the course of this work.

REFERENCES

[1]. "A Probablistic Classification System for Predicting the

Cellular Localization Sites of Proteins", Paul Horton & Kenta

Nakai, Intelligent Systems in Molecular Biology, 109-115.

[2]. "Expert Sytem for Predicting Protein Localization Sites in

Gram-Negative Bacteria", Kenta Nakai & Minoru Kanehisa,

PROTEINS: Structure, Function, and Genetics 11:95-110, 1991.

[3]. "A Knowledge Base for Predicting Protein Localization

Sites in Eukaryotic Cells", Kenta Nakai & Minoru Kanehisa,

Genomics 14:897-911, 1992.

[4]. Cairns, P. Huyck, et.al, A Comparison of Categorization

Algorithms for Predicting the Cellular Localization Sites of

Proteins, IEEE Engineering in Medicine and Biology,

pp.296300, 2001.

[5]. Donnes, P., and Hoglund, A., Predicting protein subcellular

localization: Past, present, and future Genomics Proteomics

Bioinformatics, 2:209-215, 2004.