[ieee 2012 nasa/esa conference on adaptive hardware and systems (ahs) - erlangen, germany...

Recursive Sigmoidal Neurons for Adaptive Accuracy Neural Network Implementations

Koldo Basterretxea Dep. Electronic Technology

University of the Basque Country (UPV/EHU) Bilbao, Basque Country, Spain

[email protected]

Abstract—This paper describes an accuracy programmable sigmoidal neuron design and its hardware implementation. The “recursive neuron” can be adjusted to produce recursively more accurate and smoother piecewise linear approximations to the sigmoidal neural squashing function. This adaptive accuracy neuron, combined with a constructive training algorithm, can be used as the basic component for the implementation of self adaptive neural processing systems able to optimize power consumption and processing speeds when operating in applications with changing performance requirements and varying operational constraints.

Keywords-Artificial Self Adaptive Neural Network; sigmoidal function; recursive neuron; Centered Linear interpolation;

I. INTRODUCTION Dedicated hardware implementation is decisive if the

inherent properties of Artificial Neural Networks (ANN) are to be exploited. The application of this computing scheme to many real word problems requires truly parallel architectures capable of satisfying real time requirements, very often with reduced silicon area and low power consumption. ANNs are complex information processing schemes that commonly require the computation of smooth nonlinear functions (neural squashing or activation functions, as the sigmoid functions). However, when facing the hardware design of ANNs the accuracy in the computation of the basic nodal nonlinear functions is usually considered a second order problem. In consequence, many hardware designs commonly use very simple, roughly generated activation functions such as step or ramp functions instead of sigmoid-like ones, since computing accurately approximated and smooth nonlinear functions requires storing a great number of parameters or data in memory, or the use of complex circuits that occupy large silicon area and may drastically reduce system operation-speed.

The roughness in the computation of the nodal functions is usually justified, when done, by referring to the great parallelism and the massive adjustable parameter-set of the ANNs, and their faculty to mask the errors resulting from low precision in the basic nodal function generation. This reasoning is based on the fulfillment of the universal approximator property of ANNs, as demonstrated in various

works [1], [2], even when simple nodal piecewise-linear (PWL) functions are used [3], [4]. Indeed ANNs are capable of adjusting their parameters by means of a training algorithm, and produce the desired output even using very simple nodal activation functions. Yet, the aforementioned theorems are existence theorems, and attention must be paid to the quantitative aspects of the approximation theory [5]; i.e., the size of the generated network (the number of layers and the total amount of neurons). In other words, the accuracy, and what is often more important, the smoothness of the nodal functions affects the general properties and modeling capabilities of ANNs. As pointed out in [6], in general, a compromise between performance and the fulfillment of various constraints must be solved in an ANN design. Examples of such constrains are minimality –the smallest ANN solving a task– and smoothness –smooth behavior for the approximating model–.Less neurons in a network topology implies a reduced computational load, faster training, less silicon area, lower power consumption, and faster information processing. It also implies a drastic reduction of the interconnections between neurons, as interconnections grow exponentially with the number of neurons, and this is a critical issue for VLSI implementations.

In most applications the use of smooth nonlinear neural activation functions produce better mappings and lower errors. For instance, when applying ANNs in control applications, smoothness of produced mapping surfaces is usually much appreciated. However, there also may be other type of tasks where smooth mappings produce not necessarily a better performance. These systems can take benefit from simpler piecewise linear activation functions, easier to compute and so producing faster processing of information.

In this paper the use of adjustable accuracy sigmoidal function processing nodes or neurons in ANNs is presented and discussed. The computation of those sigmoidal approximators rests on a lattice algebra-based recursive algorithm that produces successively smoother PWL sigmoid-like functions with no additional requirements of logic and memory. The use of these nodes in the hardware design of parallel processing ANNs introduces a new system parameter, the “recursion level” of the neurons, which can be exploited to

2012 NASA/ESA Conference on Adaptive Hardware and Systems (AHS-2012)

978-1-4673-1916-4/12/$31.00 ©2012 IEEE 152

obtain more efficient architectures by adaptively adjusting the network architecture to the performance requirements of each application.

II. THE RECURSIVE NEURON The recursive neuron is based on a computational algorithm

named CRI (Centered Recursive Interpolation), which is a mathematical method that exploits lattice algebra (an algebra based on maximum and minimum operators) to approximate nonlinear functions by means of efficient PWL function computation [7]. The CRI scheme produces PWL approximations with an exponential growing of the number of segments in successive iterations. For this particular application CRI has been optimized to approximate de sigmoid function [8] given by

1( )

1 xf xe−=

+ (1)

The initial PWL structure is obtained by generating three straight lines:

1 2 3( ) 0 ; ( ) 1 2(1 2) ; ( ) 1y x y x x y x= = + = (2)

where y2(x) is the tangent line to the reference function (1) at x=0, so null error is obtained for that input and the squashing property of the function is assured. Considering only the positive semi-axis (x+), as the output for x- can be easily computed by ( ) 1 ( )f x f x− += − , the CRI algorithm for the sigmoidal approximators is as follows:

[ ]( )

[ ]

2 3( ) ( ); ( ) ( );for (i=0;i= ;i++) { ( ) Min ( ), ( ) ;

( ) 1 2 ( ) ( ) ; ( ) ( ); 4;}

( ) Min ( ), ( ) ;

g x y x h x y xq

g x g x h x

h x g x h xg x g x

g x g x h x

= =

′ =

= + −Δ

′=Δ = Δ

=

(3)

where q is the interpolation level, Δ is the depth parameter, which must be stored in memory for each value of q, h(x) is the linear interpolation function, and g(x) is the resulting approximated function. g’(x) is only necessary to obtain a sequential algorithm, but becomes unnecessary when g(x) and h(x) are computed in parallel. For an optimal approximation the value of Δ that produces the minimum value of the maximum error for any input and for each interpolation level q (Δq,opt) has been obtained. Notice that the number of segments (nos) increases exponentially with each recursion, being

12 1qnos += + . Fig. 1 shows the approximations for q = 0,1,2 and 3, from a bare ramp function to a smooth sigmoid approximation with 17 segments.

Figure 1. CRI optimized approximation of the sigmoidal function from interpoaltion level 0 (CRI0) to interpolation level 3 (CRI3)

III. RESURSIVE NEURON-BASED ANN WITH ADAPTIVE ARCHITECTURE

To evaluate the performance of the recursive neuron-based ANN the universal approximator property of such structures is invoked. Precisely, let φ(x) be a monotonic decreasing bounded non-constant function –the sigmoid function (1) is so–, and let K be a compact subset of ℜn, and f(x1...xn) a real continuous function in K. Then, for any ε > 0 there is an integral number N and real constants ci, θi (i=1...N), ωij (i=1...N, j=1...n) exist such that

( )11 1

...N n

n i ij j ii j

f x x c w xφ θ= =

⎛ ⎞= +⎜ ⎟

⎝ ⎠∑ ∑ (4)

satisfies

1 1max ( ... ) ( ... )x K n nf x x f x x ε∈ − < (5)

In fact, as pointed out above, this theorem is satisfied for ANNs with PWL sigmoidal functions such as our CRI approximations [2]. But, for a given f(x1...xn), φ(x) and a given value of ε, the minimum value of N that assures the observance of (5) cannot be stated analytically.

In order to produce optimized (minimum) ANN architectures based on the adaptive neurons described above, a combination of a parametric training algorithm and a constructive growing network strategy has been programmed. The algorithm begins by initializing the network with the minimum number of nodes (one node) in the hidden layer and by applying the minimum interpolation level (q = 0) to the recursive neuron. If the ANN is unable to achieve the imposed performance goal (usually the minimum approximation error in the training set), the interpolation level q is augmented. If the maximum interpolation level is reached for a given architecture, the constructive algorithm adds a new node to the layer and resets the value of q to zero. This scheme is repeated until the performance goal is satisfied.

-7.5-5-2.5 0 2.5 5 7.5-0.2

00.20.40.60.81 Sigmoid,g3,10xerr2; q=2

-7.5-5-2.5 0 2.5 5 7.5-0.2

00.20.40.60.81 Sigmoid,g4,10xerr3; q=3

-7.5-5-2.5 0 2.5 5 7.5

00.20.40.60.81 Sigmoid,g1,err0; q=0

-7.5-5-2.5 0 2.5 5 7.5-0.4-0.2

00.20.40.60.81 Sigmoid,g2,10xerr1; q=1


153

A. A Sinusoidal Function Two simple function approximation problems solved by

training one hidden layer ANNs have been studied. The first one is the approximation of a sinusoidal function given by

( ) 1 sin( . )y x tπ= + (6)

This is a simple function but it is able to activate all the parameters of an ANN (weights and biases). The training data set was obtained from 41 equally separated samples in time. The generalization error (test set error) was evaluated at 161 samples, including those belonging to the training set. Imposed performance goal was MSE = 1e-04. Obtained results are shown in Table 1. Each column shows the produced network size, the total number of parameters, and obtained performance when a limit is imposed to the maximum interpolation level in the CRI nodes. It is clear from these figures that high interpolation levels produce smaller networks and better approximations. In this case, though, objective performance is reached for q = 3, and no benefit is obtained extending the maximum recursion limit to q = 4. These results are not surprising considering that the objective function is a smooth nonlinear trigonometric function (see Fig. 2).

TABLE I

q limit q=0 q =1 q=2 q=3 q=4 Nodes 21 9 7 5 5

Parameters 64 28 22 16 16 Achieved q 0 1 2 3 3 RMSE test 5.629e-02 1.341e-02 1.011e-02 0.922e-02 1.341e-02

B. A Sawtooth Function It is clear therefore, that using high interpolation levels in

the recursive neuron can save many nodes and reduce network complexity when smooth mappings are required. But what if the ANN processor is applied to solve a task that does not require of smooth mappings? In that case it is very probable that the optimum network architecture could be achieved with low interpolation levels in the sigmoidal generators, saving occupied area and accelerating the processing of information. To reveal more clearly what kind of networks would produce a non-smooth mapping objective, a Sawtooth function was selected to produce a training data set and evaluate the network generalization capability. As in the previous experiment 41 equally distanced samples was used for training and generalization capability was tested at 161 samples. Imposed performance goal was MSE = 1e-04. This time the goal was achieved for q=0 and 4 neurons no matter what the imposed limit to this parameter had been. This suggests that the optimum network is produced for simple ramp-shaped nodes, which better fit the tooth-shaped mapping.

To verify that higher interpolation levels do not produce any benefit in this case, interpolation levels were forced to a previously set value (q = 1, 2, 3 and 4) before the net was trained. Table II sums up obtained results. As can be observed, no smaller networks are produced for higher interpolation levels, although slightly better test errors are obtained for q = 3 and q = 4. Fig. 3 graphically shows produced mappings for the

minimum and the maximum interpolation levels considered. Notice that although calculated test error was slightly smaller for CRI4 nodes than for CRI0 nodes, Fig. 3 shows that in fact generalization is not improved, but is poorer, as a consequence of a oversized network.

TABLE II

Int. level q=0 q=1 q=2 q=3 q=4 Nodes 4 4 6 6 5

Parameters 13 13 19 19 16 RMSE test 0.19976 0.21216 0.20270 0.19407 0.19914

C. More Realistic Examples The results obtained for simple approximation problems

were verified in two more complex tasks. Firstly the system was applied to the approximation of a two dimensional highly non linear approximation problem that could resemble a complex control surface. The training and testing data sets were obtained by sampling the function

( ) ( )1 2sin cosy x xπ π= (7)

The training data set comprised 730 samples, while the generalization error was tested at 6561 samples. The system was a 2-N-1 network with sigmoidal nodes in the hidden layer and a linear node at the output. Table III shows obtained results after training. Here again, for a smooth mapping, the smallest and more accurate network is obtained for the highest interpolation levels. Fig. 4 shows graphic representations of the tested network outputs for the minimum and the maximum interpolation levels.

Figure 2 ANN mappings of the sine function with 21 CRI0 nodes in the hidden layer (top) and with 5 CRI3 nodes in the hidden layer (bottom).


154

TABLE III

q limit q = 0 q = 1 q = 2 q = 3 q= 4 Nodes 73 22 13 12 11

Parameters 293 89 53 49 45 Achieved q 0 1 2 3 4 RMSE test 1.151e-02 1.073e-02 0.987e-02 0.956e-02 0.853e-02

The second experiment is a well known classification problem based on Iris data. It implies the calssification of three subspecies of the Iris flower namely, Iris sestosa, Iris versicolor, and Iris virginica on the basis of four feature measurements of the Iris flower-sepal legth, sepal width, petal length, and petal width [9]. There are 50 patterns for each of the three subspecies. Two subspecies, versicolor and virginica subtantialy overlap, while Iris sestosa is well separated from the other two. For this classification problem the ANN model is a 4-N-3 architecture plus hardlimit or step function nodes connected to the output neurons to force 0/1 outputs. The objective was to achieve a system that produces no misclassifications. Table IV summs up obtained network architectures. It can be observed that for this task q = 2 is the maximum interpolation level generated by the training algorithm, even when higher interpolation levels were permited, so there would be no reason to use nodes with higher values of q that would require longer processing times.

TABLE IV

q limit q = 0 q = 1 q = 2 q = 3 q= 4 Nodes 7 4 3 3 3

Parameters 71 39 31 31 31 Achieved q 0 1 2 2 2 Misclass. 0% 0% 0% 0% 0%

Figure 4. ANN mappings of the two-dimensional function defined in (7) with 73 CRI0 nodes in the hidden layer (top) and with 11 CRI4 nodes in the hidden layer (bottom).

IV. HARDWARE IMPLEMENTATION OF THE RECURSIVE NEURON

The experimental study shows that a network architecture based on adaptive accuracy recursive neurons combined with a constructive algorithm could be able to produce an optimized hardware ANN in terms of occupied area, consumed power, and processing speed. Such a system could perform adaptively by activating or disconnecting the nodes of the network and by selecting the best interpolation level at each network layer to optimaly achieve required performance figures. The basic module for the hardware implementation of the adaptive ANN is the recursive neuron. Fig. 5 shows the structure of a neuron that processes serially the input signals incoming from the precedent layer by multiplying each input by its corresponding weight and summing the multiplication results. The inputs can also be processed in parallel by providing one multiplier for each input signal. However, considering an adaptive architecture that could give rise to variations in the number of neurons at each layer, a serial scheme may be more efficient I terms of hardware resource utilization, for example in a FPGA. The second stage of the neuron processing is the squashing

Figure 4 Minimum size ANN mappings of the Swatooth function with 4 CRI0 nodes in the hidden layer (top) and with 5 CRI4 nodes in the hidden layer (bottom).

Figure 3 Minimum size ANN mappings of the Swatooth function with 4 CRI0 nodes in the hidden layer (top) and with 5 CRI4 nodes in the hidden layer (bottom).


155

function produced by the adaptive sigmoidal approximator whose circuit design and implementation is described in what follows.

A digital circuit implementation based on algorithm (3) does require neither a multiplier nor a divider, as all divisions are powers of two. Only an adder, a subtractor, and a min operator are required. Memory requirements are extremely low as only Δq,opt values must be stored, and these values can be shared by all the neurons in a network. A fixed point fractional number system with 16 bit worth-lengths were employed for the circuit design (1-bit sign, 2 bit left and 13 bit right of the radix point) [10]. This precision is enough since the sigmoid function output is saturated rapidly to its maximum value 1, so the domain of calculation can be restricted to what can be considered the active domain of the neuron. The wider active domain for the first four interpolation levels is x∈[0,3.86116]. Besides, no overflows can occur at intermediate calculations when 2 bits are used for integer number representation. At the input stage the sign and module of the signal is checked to detect whether it is out of the active domain to activate the saturated output signal (out = g(x) = 1) and to detect negative inputs to activate the output module that performs the operation g-(x) = 1- g(x).

A Xilinx System Generator scheme of the sigmoidal recursive approximator is shown in Figure 6. The circuit is

programmable by choosing the value of the input q between 0, no interpolation, and 4, fourth order interpolation. The circuit latency is q+2 clock cycles, and it is controlled by a three bit Gray-code up counter and a counter clearer controlled by the interpolation level parameter. When the count value is 0002 the multiplexor selector (sel) is set to 0 to get into reg_g, reg_h, and reg_Δ registers the initial values of g(x), that is y2(x), h(x), that is 200016, and Δq,opt respectively. The computation of

2 ( ) 4 1 2y x x= + has been simplified by taking advantage of the regularities that occur among the bits:

x x2 x1 x0 . x-1x-2x-3x-4 x-5 x-6 x-7 x-8 x-9 x-10 x-11 x-12 x-13

y2(x) 0 x2 x1 . 1x x0 x-1 x-2 x-3 x-4 x-5 x-6 x-7 x-8 x-9 x-10 x-11

Notice here that bit shift operations are constant and consequently are directly hardwired, so this block has no hardware costs.

A complete recursive neuron consisting of a 20×16 multiplier, an accumulator, and a CRI sigmoid generator (see Fig.5was synthesised using Xilinx XST 13.4 synthesizer and was implemented using a Xilinx Virtex-5 XC5VSX50T-2 as target device. This chip contains 288 DSP48E blocks, which can be used to efficiently implement multiply-accumulate circuits. The placement and routing tool reported a total resource usage of 94 Slice registers (1%), 160 Slice LUTs (1%) and 1 DSP48E block. The latency of the neuron, if implemented with serialized inputs as shown in Figure 5, is q+2 (the latency of the CRI sigmoidal generator) plus as many cycles as neurons in the previous network layer. Consequently, reducing the number of neurons of a network layer has also direct effects on the speeding up of ANN processing.

× ACC CRI Sig.

R AM

from previous layer

weights

Figure 5: Schematic picture of the complete serial input recursive neuron.

Figure 6. System Generator scheme of the IRC sigmoidal approximator.


156

Finally, let us see what the implementation figures of a complete ANN based on recursive neurons are for a given problem. For this purpose the two input function mapping defined in (7) was selected as reference example. Two 2-N-1 ANNs were implemented corresponding to the architectures obtained for minimum and maximum neuron recursion levels, that is, a 2-73-1 CRI0 network and a 2-11-1 CRI4 network respectively (see Table III). Obtained resource usage figures and maximum path delay times are summed up in table V. Notice that, for equal modelling performance –in fact the second network produces a slightly smaller approximation error and improved mapping smoothness– the CRI4-based neural network occupies much less area than the CRI0-based network (see Fig.8). Maximum achievable clock frequencies are very similar for the two implementations, but network latency is smaller for the CRI4-ANN as a consequence of the reduction of the number of nodes in the hidden layer: 27 clock cycles for the CRI-4 ANN and 81 clock cycles for the CRI0-ANN.

TABLE V

Device Utilization Summary Logic CRI0-ANN CRI4-ANN

Slice Registers 6,892 21% 1,133 3% Slice LUTs 7,620 23% 1,271 3%

Occupied Slices 2,666 32% 429 5% DSP48Es 80 27% 17 5%

Block RAM 54 40% 11 8% Total Memory (KB) 1,332 28% 216 4% Max. path delay (ns) 1.904 1.868 Latency (clk cycles) 81 27

Figure 8. FPGA device occupation layouts of the two implemented CRI-ANNs: 2-11-1 CRI4-ANN (left) and 2-73-1 CRI0-ANN (right).

V. CONCLUDING REMARKS A hardware friendly recursive sigmoidal neuron computation module has been described. This adaptive accuracy neuron, combined with a constructive training algorithm, is able to produce optimized ANN architectures in terms of complexity, modeling performance, and processing speed for any target application. A digital hardware implementation of the neuron has been described, implemented and verified. This recursive

Figure 7.Snapshots of the outputs generated by bit-true cycle-accurate simulations of the circuit depicted in Fig. 6 for the first three levels of interpolation.


157

neuron can be used as the basic component for the implementation of self adaptive neural processing systems on reconfigurable hardware that would be able to optimize power consumption and processing speeds when operating in applications with changing performance requirements and varying operational constraints.

ACKNOWLEDGMENT This work was supported in part by the Spanish Ministry of Science and Innovation and European FEDER funds under Grant TEC2010-15388, and by the Basque Country Government under Grants S-PC10UN09 and SPC11UN012.

REFERENCES [1] G. Cybenco “Approximation by Superposition of a sigmoidal

Function,” Mathematics of Control, Signals, and Systems, vol. 2, pp. 303-314, 1989.

[2] K-I. Funahashi, “On the Approximate Realization of Continuous Mappings by Neural Networks,” Neural Networks, vol.2, pp. 183-192, 1989.

[3] Hornik K., Stinchcombe M. and White H. “Multilayer Feedforward Networks are Universal Approximators,” Neural Networks, vol. 2, pp. 359-366, 1989.

[4] M. Stinchombe and H. White, “Universal Approximation Using Feedforward Networks with Nonsigmoid Hidden Layer Activation Functions,” in Proc. IJCNN, Washington, D.C., 1989, pp.161-166, 1993.

[5] V. Kurková, “Approximation of Functions by Perceptron Networks with Bounded Number of Hidden Units,” Neural Networks, vol. 8, no. 5, pp. 745-750, 1995.

[6] Alippi C., “Selecting Accurate, Robust, and Minimal Feedforward Neural Networks,” IEEE Trans. Circuits and Systems, vol. 49, no. 12, pp. 1799-2002, 2002.

[7] J. M. Tarela, K. Basterretxea, I. del Campo, M. V. Martínez and E. Alonso, “Optimised PWL Recursive Approximation and its Application to Neuro-Fuzzy Systems,” Mathematical and Computer Modelling, vol. 35, no. 7-8, pp. 867-883, 2002.

[8] K. Basterretxea, J. M. Tarela and I. del Campo, “Digital Design of sigmoid Approximator for Artificial Neural Networks,” Electronics Letters, vol. 38, no. 1, pp. 35-37, 2002

[9] R. A. Fisher, “The use of multiple measurements in taxonomic problems”, Ann. Eugenics, vol. 7, no. 2, pp. 179-188, 1986.

[10] J. Told and T. Baker, “Backpropagation simulations using limited precision calculation”, in Proc. Int. Joint Conf. Neural Networks, Seattle, Jul. 1991, vol. 2, pp. 121-126.


158

[ieee 2012 nasa/esa conference on adaptive hardware and systems (ahs) - erlangen, germany...

Documents