the performance of the backpropagation algorithm with varying slope of the activation function

Chaos, Solitons and Fractals 40 (2009) 69–77

www.elsevier.com/locate/chaos

The performance of the backpropagation algorithm withvarying slope of the activation function

Yanping Bai a,b,*, Haixia Zhang a, Yilong Hao a

a National Key Laboratory of Micro/Nano Fabrication Technology, Institute of Microelectronics, Peking University, 100871, Chinab Department of Applied Mathematics, North University of China, No. 3 Xueyuan Road, TaiYuan, ShanXi 030051, China

Accepted 10 July 2007

Abstract

Some adaptations are proposed to the basic BP algorithm in order to provide an efficient method to non-linear datalearning and prediction. In this paper, an adopted BP algorithm with varying slope of activation function and differentlearning rates is put forward. The results of experiment indicated that this algorithm can get very good performance oftraining. We also test the prediction performance of our adopted BP algorithm on 16 instances. We compared the testresults to the ones of the BP algorithm with gradient descent momentum and an adaptive learning rate. The resultsindicate this adopted BP algorithm gives best performance (100%) for test example, which conclude this adopted BPalgorithm produces a smoothed reconstruction that learns better to new prediction function values than the BP algo-rithm improved with momentum.� 2009 Published by Elsevier Ltd.

1. Introduction

Feedforward neural networks (FNN) are widely used to solve complex problems in pattern classification, systemmodeling and identification, and non-linear signal processing, analyzing non-liner multivariate data. One of the char-acteristics of the FNN is its learning (or training) ability. By training, the neural network can give correct answers notonly for learned examples, but also for the models similar to the learned examples, showing its strong associative abilityand rational ability which are suitable for solving large, nonlinear, and complex classification and function approxima-tion problems. The classical method for training FNN is the backpropagation (BP) algorithm [1] which is based on thegradient descent optimization technique. Despite the general success of this algorithm it may converge to a local min-imum of the mean squared-error objective function and requires a large number of learning iterations to adjust theweights of the FNN. Many attempts have been made to speed up the error BP algorithm. The most well known algo-rithms of this type are the conjugate gradient training algorithm [2] and Levenberg–Marquardt (LM) training algorithm[3]. The computational complexity of the conjugate gradient algorithm is heavily dependent on the line search methods.The LM algorithm has a faster speed than gradient and hardly stuck in the local minimum. But it requires too much

0960-0779/$ - see front matter � 2009 Published by Elsevier Ltd.doi:10.1016/j.chaos.2007.07.033

* Corresponding author. Address: National Key Laboratory of Micro/Nano Fabrication Technology, Institute of Microelectronics,Peking University, 100871, China.

E-mail addresses: [email protected] (Y. Bai), [email protected] (Y. Hao).

mailto:[email protected]

mailto:[email protected]

70 Y. Bai et al. / Chaos, Solitons and Fractals 40 (2009) 69–77

memory and computational time. Another method has been presented. Commonly known such as momentum [4], var-iable learning rate [5,10] or stochastic algorithms [6,7] lead only to a slight improvement. In [8], the author present anefficient method for learning a FNN that combines unsupervised training for the hidden neurons and supervised train-ing for the output neurons. More precisely, when an input is presented to the network, the updating rule drags theweight vectors of the winner and its topological neighbor according to unsupervised training. The weights leaving fromthe winning and its neighbor are adjusted by the gradient descent method.

In this paper, we will discuss the performance of the BP algorithm from difference view. In common practice, thefunctions will be specified not by algorithms but by a table or training set T consisting of n argument-value pairs.We will be given a d-dimensional argument x and an associated target value t that is our goal, and t will be approxi-mated by a network output y. The function to be constructed will be fitted to T ¼ fðxi; tiÞ : i ¼ 1 : ng. In most applica-tions the training set T is considered to be noisy and our goal is not to reproduce it exactly but rather to construct anetwork function gðx;wÞ that produces a smoothed reconstruction that generalizes (learns) well to new function values.Therefore we improve the basic BP algorithm by varying slope of activation function and different training rates. Andwe test the performance of the BP algorithm with varying slope of the activation function and different learning rates onsixteen training sets, which data come from real-life problems. The results of tests indicated our adopted BP algorithmproduces a smoothed reconstruction that learns better to new prediction function values than the BP algorithmimproved with momentum adapting and an adaptive learning rate.

2. Description of the BP algorithm

2.1. Basic BP algorithm

A BP network with a hidden layer can approximate with arbitrary precision an arbitrary nonlinear function definedon a compact set of Rn [4]. BP algorithm is a training algorithm with teachers, whose training procedures are dividedinto two parts: a forward propagation of information and a backward propagation (BP) of error. The network’s train-ing procedures is described below.

Let the node numbers of input and hidden layer be N and M, respectively. In this paper, the node number of theoutput layer is ascertained as 1. Let the input example vectors be nl ¼ ðnl

1 ; nl2 ; . . . ; nl

N Þ ð1 6 l 6 PÞ. Denote bywijð1 6 i 6 N ; 1 6 j 6 MÞ the weight connecting the ith input node and the jth hidden node. Denote byW j ð1 6 j 6 MÞ the connection weight between the jth hidden node and the output node. g(x) and f(x) are the activa-tion functions of the hidden layer and the output layer, respectively. When training example nl are input to the network,the input and output values of the jth hidden node are denoted as xl

j and ylj ð1 6 j 6 M ; 1 6 l 6 P Þ, respectively, while

the input values and output values of the output unit are denoted by Hl and Ol ð1 6 l 6 PÞ, respectively. In symbol wehave

xlj ¼

XN

i¼1

wijnli ; ð2:1Þ

ylj ¼ gðxl

j Þ; ð2:2Þ

Hl ¼XM

j¼1

W jylj ; ð2:3Þ

Ol ¼ f ðHlÞ: ð2:4Þ

Let the desired output corresponding to the input example nl be fl. (According to the type of output layer’s activationfunctions, fl are chosen as (0,1) in this paper). Then the square error function for this step of training is

El ¼1

2ðfl � OlÞ2: ð2:5Þ

The overall square error function after all example are used is

E ¼ 1

2

XP

l¼1

ðfl � OlÞ2: ð2:6Þ

Let W denote the vector containing all the weights. The purpose of BP algorithm is to choose W so as to minimize theerror function by, say, the gradient descent method. So, the general expression of the iteration formula is

Y. Bai et al. / Chaos, Solitons and Fractals 40 (2009) 69–77 71

W ðt þ 1Þ ¼ W ðtÞ þ DW ðtÞ; ð2:7aÞ

where

DW ¼ g � oEoW

� ��W¼W ðtÞ

ð2:7bÞ

is the weight increment at time t, and the positive constant g is the training rate, 0 < g < 1.A popular variation of the standard BP algorithm of gradient method (2.7) is the so-called online gradient method

(OGM for short). This means that the weight values are modified as soon as a training example is input to the network.Now, we have

DW j ¼ g1 �oEoW j

� �¼ g1ðfl � OlÞf 0ðHlÞyl

j : ð2:8Þ

By the chain rule and (2.1)–(2.5), we have

Dwij ¼ g2 �oEowij

� �¼ g2ðfl � OlÞf 0ðHlÞW jg0ðxl

j Þnlj ; ð2:9Þ

where g1 and g2 are the training rates of synaptic weights from hidden layer to output layer and input layer to hiddenlayer, respectively. The training examples fnlg are usually supplied to the network in stochastic order.

It is more likely for OGM to jump off from a local minimum of the error function, compared with the standard gra-dient method, and it requires less memory space. Therefore, OGM is widely used in neural network training [9].

2.2. BP algorithm with gradient descent momentum and an adaptive learning rate

BP algorithm with momentum updating is one of the most popular modifications to the standard BP algorithm pre-sented in Section 2.1. The idea of the algorithm is to update the weights in the direction which is a linear combination ofthe current gradient of the instantaneous error surface and the one obtained in the previous step of the training. Inpractical application, a momentum term is added to Formula (2.7a) to accelerate the convergence speed, resulting in

DW ¼ g � oEoW

� ��w¼wðtÞ

þ a½W ðtÞ � W ðt � 1Þ�; ð2:10Þ

where the positive constant a is a momentum factor, 0 < a < 1.In the BP algorithm with momentum updating, a batch-updating approach accumulates the weight corrections over

one entire epoch before actually performing the update. Batch updating with a variable learning rate represents a simpleheuristic strategy to increase the convergence speed of the BP algorithm with momentum updating. The idea behind theapproach is to increase the magnitude of the learning rate if the error function has decreased toward the goal. Con-versely, if the error function has increased, the learning rate needs to be decreased. If the error function is increasedless than the percentage which is defined previously, the learning rate remains unchanged.

Applying the variable learning rate to the BP algorithm with momentum updating can speed the convergence in thecases of smooth and slowly decreasing error function. However, the algorithm can easily be trapped in a local minimumof the error surface. To avoid this, the learning rate is not allowed to fall below a certain value [11].

3. The training performance of the BP algorithm with varying slope of the activation function and different learning rates

The activation function can be a linear or non-linear function. There are many different types of activation functions.Here, we present three of the most common types of activation functions.

The first type is the linear function, which is continuous valued. It is written as

y ¼ f ðuÞ ¼ u: ð3:1Þ

The second type of the activation function is a hard limiter. This is a binary (or bipolar) function that hard-limits theinput to the function to either a 0 or a 1 for the binary, and a �1 or a 1 for the bipolar type. They are written as

y ¼ f ðuÞ ¼0 if u < 0;

1 if u P 0;

�: ð3:2Þ


and

y ¼ f ðuÞ ¼�1 if u < 0;

0 if u ¼ 0;

1 if u > 0:

8><>: : ð3:3Þ

The third type of the basic activation function is the logistic function (or the hyperbolic tangent function, which is asimple rescaling of the logistic). It is showed as formula (3.4) (or formula (3.5)).

y ¼ f ðuÞ ¼ 1

1þ e�au; ð3:4Þ

y ¼ f ðuÞ ¼ tanhðauÞ ¼ eau � e�au

eau þ e�au¼ 1� e�2au

1þ e�2au; ð3:5Þ

where a is the slope parameter of the binary sigmoid function. The logistic function is a bounded, monotonic increasing,analytic function. Elementary properties include

limu!�1

f ðuÞ ¼ 0; limu!1

f ðuÞ ¼ 1; ð3:6Þ

f 0ðuÞ ¼ af ðuÞð1� f ðuÞÞ; f 00ðuÞ ¼ a2f ðuÞð1� f ðuÞÞð1� 2f ðuÞÞ; ð3:7Þ

jf 0ðuÞj 6 a4; ð3:8Þ

f ðuÞ � 1

2þ a

4u for small u: ð3:9Þ

For small values of u, the logistic function is nearly linear. The logistic activation functions for three different values ofthe slope parameter are shown in Fig. 1. The plots of the percentage errors incurred in the linear approximation to thelogistic function (three different a values) are shown in Fig. 2.

We can see from Fig. 2 that the logistic function with small a value is more nearly linear than the one with large avalue for moderate value of u.

From the weight update (2.7–2.9), we have seen thus far, we observe that the rate of the update is proportional to thederivative of the nonlinear activation function. During training of the network, the output of the linear combiner mayfall in the saturation region of the activation function. In that region the derivative of the activation function is verysmall, the rate of learning becomes extremely slow. It may spend many iterations before the output of the linear com-biner moves out of the saturation region. A straightforward approach to prevent this saturation would increase the sizeof the non-saturated part of the activation function by decreasing its slope. However, decreased the slope makes thenetwork behave more as a linear network, which in effect diminishes the advantages of having a multilayer network.Because any number of layers having linear activation functions can be replaced by a single layer. Hence, there isan optimum value for the activation function slope that balances the speed of the network training and its mappingcapabilities. This is also the aim of our adopted BP algorithm. The BP algorithm with varying slope of the activationfunction and different learning rates can be summarized in the following.

Fig. 1. Logistic activation functions for three different values of the slope parameter.

Fig. 2. Percentage errors in approximating logistic function by linear function.


4. Our adopted BP algorithm

Step 1. Initialize the weights in the networks (having same initialize weights in our tests).Step 2. The activation functions of the hidden layer and the output layer are selected as the logistic function

gðxÞ ¼ 11þe�x and f ðxÞ ¼ 1

1þe�ax, respectively (we adapt two layers network in this paper), where a is the slopeparameter of the activation function f ðxÞ.

Step 3. Present an input pattern from the set of the training input/output pairs by random order, and calculate thenetworks response.

Step 4. Compare the desired network response with the actual output of the network, and the squared error is com-puted according to Eq. (2.5).

Step 5. The weights of the network are updated according to Formulas (2.8) and (2.9), where the training rates of syn-aptic weight are different for output layer and hidden layer, which are g1 and g2, respectively.

Step 6. Stop if the network has converged or the maximum number of iterations is reached; else go to step 2.

We have three parameters, two training rates g1 and g2, one slope parameter a of the output activation function. Thethree parameters are selected as follows, which perform 500 iterations.

First, we fix g1 and g2. The sum squared-error are recorded for different a. We find out the a value with the conver-gence speed is faster and the sum squared-error is small.

Second, we set the selected a and fix g1, and check for different g2. The sum squared-error are recorded for differentg2. The training rate g2 with the minimum error is confirmed.

Last, we set the selected a and g2, and check for different g1. We finally determined g1 with minimum sum squared-error.After selected the three parameters, we perform the algorithm with maximum 2000 iterations and a determinate tar-

get error.In this paper, the network consists of one input node, four hidden units and one output node. We test the perfor-

mance of the BP algorithm with varying slope of the activation function and different learning rates on 16 instances oftime series. The values of input node are the time of time series, time ¼ 0; 1; 2; . . .. The target values of output node arethe data of time series. Our adopted BP algorithm was employed to train 16 instances. We gained the sum squared-errors of three different slope parameters for each instance, which performed with same training rates (the middle slopeparameter is with minimum sum squared error). The experiment results were showed in Table 1. Partial simulationresults of the instances were figured in Fig. 3.

From Table 1 and Fig. 3, we have seen thus far, the performances of network training with same training rates and dif-ferent slope are very different for a large of instances. However it is similar for a little instance. We conclude that the basicBP algorithm can get very good performance of training by adjust two different training rates and one slope parameter.

5. Prediction performance of our adopted BP algorithm and the BP algorithm with momentum

In this section, we test the prediction performance of our adopted BP algorithm with varying slope of activationfunction and different training rate. We will compare the test results to the ones of the BP algorithm with gradient

Table 1The experiment results of 16 sets for the BP algorithm with varying slope of activation function

Instances Number of data Training rate Slope parameter (a) Sum squared-error

1 65 g1 ¼ 0:5, g2 ¼ 0:05 3 1.60064 0.05625 0.1470

2 65 g1 ¼ 0:5, g2 ¼ 0:1 4 0.11975 0.10846 0.1143

3 65 g1 ¼ 0:5, g2 ¼ 0:05 4 0.00305 0.00286 0.0051

4 65 g1 ¼ 0:1, g2 ¼ 0:5 0.5 0.00421 0.00411.5 0.0049

5 65 g1 ¼ 0:5, g2 ¼ 0:05 2 1.90023 0.51434 23.1957

6 65 g1 ¼ 0:5, g2 ¼ 0:01 2 1.78683 0.14314 19.9188

7 40 g1 ¼ 0:1, g2 ¼ 0:5 1 0.01482 0.01193 0.0125

8 40 g1 ¼ 0:5, g2 ¼ 0:05 2 0.48263 0.04744 0.0559

9 40 g1 ¼ 0:17, g2 ¼ 0:05 1 1.34622 0.75583 1.1129

10 40 g1 ¼ 0:17, g2 ¼ 0:002 0.5 0.04291 0.00502 0.6188

11 40 g1 ¼ 0:5, g2 ¼ 0:012 2 0.91603 0.02794 2.3028

12 40 g1 ¼ 0:25, g2 ¼ 0:5 0.5 5.6555e�0040.95 3.2507e�0041.5 7.6633e�004

13 40 g1 ¼ 0:25, g2 ¼ 0:025 1 1.67912 1.36513 1.4455

14 40 g1 ¼ 0:17, g2 ¼ 0:056 1 3.39272 0.92993 2.0429

15 40 g1 ¼ 0:5, g2 ¼ 0:033 2 1.79113 0.02054 4.1184

16 40 g1 ¼ 1:5, g2 ¼ 0:5 4 0.00895 0.00866 0.0089


descent momentum and an adaptive learning rate. In our adopted BP algorithm two training rates g1and g2 wereadopted within Table 1, and the slope parameter a was adopted using the middle value with minimum sumsquared-error in Table 1 for each instance. The maximum iteration and the target value of sum squared-error wereused as following:

net.trainParam.epochs = 2000;net.trainParam.goal = 0.005.

Fig. 3. Partial simulation results of the instances by the BP algorithm with varying slope of activation function.


TRAINGDX is a network training function that updates weight and bias values according to gradient descentmomentum and an adaptive learning rate in MATLAB. The function TRAINGDX was employed to train sixteeninstances. We select a network with five input nodes, four hidden units and one output unit in the BP algorithm withmomentum, and current value is predicted by previous five values. Therefore five values of input nodes are gained fromcurrent value delaying 1–5 s. The target value of output unit is the current value. The activation functions of hiddenlayer and output layer are hyperbolic tangent sigmoid transfer function TANSIG and linear transfer function PURE-LIN. The parameters of function TRAINGDX are selected as following:

net.performFcn = ‘sse’;net.trainParam.epochs = 2000;net.trainParam.lr = 0.01;net.trainParam.lr_inc = 1.05;net.trainParam.lr_dec = 0.7;net.trainParam.goal = 0.005;

Table 2Simulation results of prediction for our adopted BP algorithm and the BP algorithm with momentum updating and an adaptivelearning rate on 16 instances

Instances Number of data Our adopted BP algorithm BP algorithm with momentum updating

Sum squared errorof training examples

Sum squared errorof test examples

Sum squared errorof training examples

Sum squared errorof test examples

1 65 0.0607 1.0945e�004 0.0179 0.01042 65 0.1061 0.0023 0.0026 0.04503 65 0.0038 5.9083e�005 0.0044 3.7632e�0044 65 0.0043 2.3597e�004 5.2604e�004 8.4070e�0045 65 0.3592 0.3344 0.0590 2.58386 65 0.7756 0.0437 0.0169 0.95047 40 0.0103 0.0039 0.0094 0.05548 40 0.0541 0.0015 0.0264 0.00979 40 0.8784 0.0081 0.3284 2.232410 40 0.0054 4.8038e�004 0.0074 0.536811 40 0.0263 0.0046 0.0064 0.213712 40 4.7998e�004 1.0114e�004 4.2961e�004 2.5851e�00413 40 1.4882 0.0557 0.7205 0.569514 40 0.3011 0.0959 0.0983 0.723915 40 0.0157 0.0048 0.0039 0.575416 40 0.0107 1.9815e�004 0.0090 0.0019


net.trainParam.mc = 0.9;net.trainParam.mingrad = 1e�10.

We used the first 52 data to form training examples, and the last 13 data to form test examples for the instances from1 to 6. And we use the first 30 data to form training examples, and the last 10 data to form test examples for theinstances from 7 to 16. Simulation results are showed in Table 2. Further more, Table 2 indicates that the mean squarederrors of test examples for our adopted BP algorithm and the BP algorithm with momentum adapting are 0.0348 and0.5319, respectively. Our adopted BP algorithm gives best performance (100%) for test examples.

6. Conclusions

Some adaptations were proposed to the basic BP algorithm in order to provide an efficient method to non-lineardata learning and prediction. Our adopted BP algorithm with varying slope of activation function and different learningrates was employed to train 16 instances. The results of experiment indicated that the performances of network trainingwith same training rates and different slope parameters are very different for a large of instances, and it is similar for alittle instance. We conclude that the basic BP algorithm can get very good performance of training by adjust two dif-ferent training rates and one slope parameter.

We also test the prediction performance of our adopted BP algorithm on sixteen instances. We compared the testresults to the ones of the BP algorithm with gradient descent momentum and an adaptive learning rate. The resultsindicated that the mean squared errors of test examples for our adopted BP algorithm and the BP algorithm withmomentum adapting are 0.0348 and 0.5319, respectively. Our adopted BP algorithm gives best performance (100%)for test example.

The contributions were that the basic BP algorithm can get very good performance of training and test by adjust twodifferent learning rates and one slope parameter of output activation function. The results of tests on sixteen sets indi-cated our adopted BP algorithm produces a smoothed reconstruction that learns better to new prediction function val-ues than the BP algorithm improved with momentum adapting and an adaptive learning rate.

Acknowledgement

The authors are thankful that the research is supported by the National Postdoctoral Science Foundations of China(No. 20060400032).


References

[1] Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. Parallel distributed processing:explorations in the microstructures of cognition, vol. 1. Cambridge (MA): MIT Press; 1986. pp. 318–62.

[2] Martin F, Moller S. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 1993;6:525–633.[3] Hagan MT, Menhaj MB. Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Network

1994;5:989–93.[4] Rojas R. Neural networks: a systematic introduction. Berlin: Springer Verlag; 1996.[5] Jacobs RA. Increased rates of convergence through learning rate adaptation. Neural Networks 1988;1:295–307.[6] Najim K, Chtourou M, Thibault J. Neural network synthesis using learning automata. J Systems Eng 1992;2(4):192–7.[7] Poznyak AS, Najim K, Chtourou M. Use of recursive stochastic algorithm for neural network synthesis. Appl Math Modelling

1993;17:444–8.[8] Mounir Ben Nasr, Mohamed Chtourou. A hybrid training algorithm for feedforward neural networks. Neural Process Lett,

Published online: 29 September 2006, doi:10.1007/s11063-006-9013-x .[9] Simon H. Neural networks: a comprehensive foundation [M]. Beijing: Tsinghua University Press and Prentice-Hall; 2001.

[10] Bai Yanping, Jin Zhen. Prediction of SARS epidemic by BP neural networks with online prediction strategy. Chaos, Solitons &Fractals 2005;26/2:559–69.

[11] Vogl TP, Mangis JK, Zigler AK, Zink WT, Alkon DL. Accelerating the convergence of the backpropagation method. BiolCybernet 1988;59:257–63.

http://dx.doi.org/10.1007/s11063-006-9013-x

the performance of the backpropagation algorithm with varying slope of the activation function

Documents