a programmable vlsi architecture based on multilayer cnn paradigms for real-time visual processing

INTERNATIONAL JOURNAL OF CIRCUIT THEORY AND APPLICATIONS, VOL. 24,357-367 (1996)

A PROGRAMMABLE VLSI ARCHITECTURE BASED ON

PROCESSING? MULTILAYER CNN PARADIGMS FOR REAL-TIME VISUAL

LUIGl RAFFO

Department of Electrical and Electronic Engineering, University of Cagliari, P iaza dArmi , 1-09123 Cagliari, Italy

A N D

SlLVlO P. SABATlNl A N D GIACOMO M. BlS lO

Department of Biophysical and Electronic Engineering, University of Genova, Via all'opera Pin I I A , 1-16145 Genova, Italy

SUMMARY

A new digital VLSI architecture has been presented for the implementation of discrete-time multilayer CNNs. At functional level, the architecture is organized as 12 layers of 64 x 64 cells which interact as specified by a set of 3D generalized templates. At structural level the application of cloning templates occurs in a set of processing units programmed by instruction masks, generated on the basis of the algorithm to be emulated. It is demonstrated that this architecture is applicable to multilayer algorithms for visual processing and also to standard CNNs, including those that use sequences of templates or that work in parallel. Simulations evidence the high efficiency of this implementation.

1. INTRODUCTION

Analogue CNNs have been proved to be very effective in various image-processing tasks, that can be related to local interactions among processing units arranged in a two-dimensional grid. ' J However, for machine vision processing and other related intelligent assignments it is necessary t o combine several different elementary tasks defined by a series of templates to be used in sequence or conc~rrent ly .~ Hence the need for programmable architectures emerges strongly.

Even though CNNs are especially tailored for analogue processing, since they can be therefore directly mapped on a grid of simple analogue processors, and a number of papers have been devoted to the study of specific analogue building blocks for CNN implementation and to programmable solutions based on operational amplifiers, several practical constraints/limitations on the efficacy of analogue CNN implementation are posed by occupation area, input-output interfacing, status memorization and flexibility. Moreover, these solutions have a limited programmability of system parameters, thus preventing a global reconfiguration of the structure of the elaboration (i.e. network structure for CNNs). Hence architectural solutions able to map on several CNN paradigms the same computational substrate are strongly desirable to fully exploit the potentialities of CNNs in real applications.

In this paper, starting from a generalized reformulation of cell dynamics for multilayer CNNs, we present a reconfigurable digital VLSI architecture able to fulfil both these demands on programmability and the requirements of higher efficiency with respect to commercial DSP or hardware accelerators . This architecture will be motivated in relation both to standard CNN templates and to a specific algorithm4 based on a multilayer cortical-like computational model of preattentive visual processing.

t Pan of this research has been reported in the Proceedings of the 1994 IEEE International Workshop on Cellular Neural Networks and Their Applications held in Rome.

CCC 0098-98861961030357- 11 0 1996 by John Wiley & Sons, Ltd.

Received 23 January I995 Revised I I July 1995

358 L. RAFFO, S. P. SABATINI AND G. M. BISIO

2. FROM ALGORITHM TO ARCHITECTURE SPECIFICATION

In discrete time the dynamics of a CNN cell can be described by the recursive algorithm

x,(n + 1) = c AIj ,dY,] (~)? y,(n)) + c ~ l , . L / ( ~ l f ( 4 * udn)) I / t NJri) A/EN,(rl)

ME N , ( t ) )

Y(N) =f(x(n)) ~ ( 0 ) = xn

where x , y , u, and I denote cell state, output, input and bias, respectively; N, is the r neighbourhood of the cell ( i , j ) ; A, A', B , B' are the cloning templates; z is the memory duration time; and f is the non-linear output function. It is noteworthy that if the input u is constant during the iteration, the delayed B-template (B ') is null.

N-dimensional generalizations (multilayer CNNs) of discrete-time CNNs can also be formulated on the basis of the multilayer generalization introduced by Chua and Yang.' In multiple-layer CNNs each cell is characterized by several variables instead of only one state variable as in the single-layer case. We can observe that, whatever its level of complexity, a discrete-time multilayer CNN can be computationally described by an ensemble of nodes locally interacting to reach the prescribed computation.

The set of L nodes associated with each cell can be viewed as the components of a column vector v j j representing the whole set of inputs, outputs and delayed outputs related to the corresponding location ( i , j ) on the cell layer. Each set of nodes interacts only with neighbouring sets through a generalized 30 template 0 that spans the L layers:

where a and 6 index the components of the column vector (i.e. the layer) and ( i , j ) the elements of a layer. For each component of a vector v j j one can recognize in the corresponding section of the 3D template

the conventional control operators B and B ' and the feedback operators A and A I. For example, the CNN described in equation (I) with null delayed templates can be implemented assuming L = 3: v' = I , v 2 = u ,

g ' ( x ) = x , g 2 ( x ) = x, g3(.)= f(-). if delay temperatures are present, more components of ?I should be considered, one for each previous output (and/or input) present in the algorithm. For example, t = 1: v l = I , v 2 = u , v 3 = y , v 4 = y ( t - 1); D " = D 2 2 = D D 3 1 r D 4 3 = I f o r ( i , j ) = ( k , l ) andnullotherwise, D 3 2 = B , D 3 3 = A ,

It is worth noting that in this way D specifies not only the strength of connections among the cells of the CNN but also the interconnection structure of the CNN itself, thus allowing us to achieve the higher degree of programmability required. This is the form of computation to which we should refer to for devising architectural solutions.

v3 5 y ; D 2 2 = D3' = 1 for ( i , j ) = (k, I) and null otherwise, D3'= B , 033 A , OI2 = 0 ' 3 0 2 ' 5 D23, 0.

D34=A-' , 0 ' 2 , D"= D2' 0 2 3 , DI4= D z 4 ~ DMs 0 4 2 , D4' ~ 0 ; g ' ( x = X , g3(.) =f( . ) , g 4 ( . ) s x .

3. ARCHITECTURAL SPECIFICATION

3.1. Organization of architectural resources

Considering linear, space-invariant templates, equation (2) is require to perform convolution operations with a proper set of masks. It is easy to verify that a direct implementation, i.e. a circuit composed of a single adder-multiplier performing these calculations over the whole cellular array, would result in an infeasible solution for real-time needs. In addition to the performance bottleneck, the need for accessing each single cell of the layers implies an inefficient memorization scheme. On the other side, conceiving one such device for each cell would result in an excessive silicon area. A trade-off can be sought by considering

VLSI ARCHITECTURE BASED ON MULTILAYER CNN PARADIGMS 359

a high-level transformation of the original specification through unrolling of the innermost loop of elaboration, to extract the implicit parallelism contained in the original specification, and subsequent loop f i ~ l l i n g ,' to exploit this parallelism by means of pipelining. This leads to an architectural specification based on a limited set (one per layer) of processing units able to evaluate the new state of a cell through few iterations but no reload of already processed input data. Two main blocks characterize the architecture: the storage blockjn which the vector v are stored, and the processing block, which updates each vector according to the cloning ternplates D. In this respect each template can be viewed as a 3D array. Since many elements of a cloning template are null, in order to make the implementation more efficient, the 3D template can be projected onto a reduced number of 2D masks (see Figure 1).

S 2 Iirzplementation

Limits on VLSI technology, power consumption and speed of computation pose some constraints on the number of layers and cells, on the dimension of the instruction masks and on the number of bits to represent weights. The trade-off between performance and available resource depends on the target application domain, specified later in Section 4.2. On the basis of it we consider 12 layers of 64 x 64 cells, interactions among first neighbours only (i.e. 3 x 3 x 12 cloning template) and weight magnitudes specified with 3 bits as power of 2 (the successive non-linear block takes care of scaling). With this choice a compact memorization of weights is achieved and weight multiplications occur through arithmetic shifts.

The architectural schema of our system is illustrated in Figure 2. The storuge block is based on a single-port RAM. The currentlprevious outputs of cellular neural

network elements are stored in 64 x 64 locations of 96 bits, functionally subdivided into 12 groups of 8 hits to implement 12 layers (L1 , . . . , L12).

The processing block is composed of 12 processing units (one for each layer), 12 pairs of 16 bit row buffers and 12 instruction mask sets that play the role of cloning templates. The behaviour of each proceccing unit is controlled by its set of masks, whose elements determine the weight sign and magnitude anti the number that identifies the layer in which to read the output. Specifically, each element of a mask is

LA

L3

Y

/7

mask1

Ki -w3

mask2

, . I

+ W l +w2 +w3 -wl -w2 -w3 . .

Figurc 1, A pictorial view of the generalized 3D CNN. On the left side, all the nodes contributing to the output of the marked cell in layer L3 are evidenced. (positive weights are represented the corresponding 3D cloning template, while below the related 2D

projection masks are represented


I/O i STORAGE BLOCK

from RAM

I PROCESSING BLOCK

Figure 2. Overall structure of the architecture; control signals arc omitted

1.111..

I signs s

BUFFERS 1 - ' I * - - m N L .

BLOCK '

Figure 3. The first and last processing units are depicted. From the bus they rcceivc the same datum referred to a complete colurnn of cells. The data processor extracts and manipulates the data according to the actual mask element (sec text). When a datum

becomes available at the end of the adder cascade, it is stored in the buffer and then moved to the RAM

VLSI ARCHITECTURE BASED ON MULTILAYER CNN PARADIGMS 36 1

composed of two fields: the first specifies the weight (null flag, sign and magnitude); the second addresses the layer in which the mask has to act.

At each iteration the actual value of the vector v is moved from the RAM (scanned row by row) towards all the processing units (see Figure 3). In each processing unit: (i) each data processor (see Figure 4) extracts a portion of the datum according to the content of the element of its masks, then shifts and complements the result if requested; (ii) a cascade of adders operates to add it to the partial sum coming from the preceding rows stored in the buffer.

At each iteration we need to have available three rows of data (the preceding, the actual and the next). The data belonging to these rows are sent to the processing unit, row by row, element by element. When all the data of a row are transferred to the processing unit in the buffer, the convolution between the row considered and the first row mask is available. The content of the buffer is the starting value for the convolution of the second tOW of the mask with the actual row and so on for all the masks. When a row is completely processed, it cannot be moved directly to the RAM, because its values should be used for the next row. Hence we need another buffer to store it for the time in which the next row is processed. When the processing of the next row is completed, the content of the second buffer is moved to the RAM through a non-linear block implemented by a clipping function with a programmable slope (Figure 5).

mask element I INSTRUCTION DECODER I

Figure 4. A data processor is depicted. The data from the 96 bit bus is subdivided into 12 blocks connected to and 8 bit bus through three state buffers. The 8 bits are inverted or buffered (according to the signs of the weights) and arithmetically shifted. To complete

the 2-complementation, the resulting value is incremented by one if the weight is negative

- 1 ./// ,' I//, 0- - - - - -

z< - , -4 -3 -2 -1 c

1 2 3 4 x

Figure 5 . Transfer functions of the non-linear programmable block

362 L. RAFFO, S . P. SABA'MNI AND G. M. BISIO

This schema limits the number of transfers to the RAM, allowing the storage of the values useful only for the next iteration and avoiding the physical duplication of the storage block.

Thanks to the horizontal pipeline schema and the parallelization of fetch from rnernory with mask computation, the number of clock periods needed for an iteration update is nine, since a convolution mask needs three buffer updates each lasting three clock periods (for data processing and sum).

4. APPLICATIONS

Single-layer CNN algorithms

Most applications of CNNs could be performed by our architecture too. In its simplest discrete-time formulation a CNN can be implemented with two layers, i.e. one input layer and a status/output layer.

OUTPUT OUTPUT OUTPUT

STATUS

OUTPUT OUTPUT OUTPUT

OUTPUT

t CURRENT STATUS & CURRENT STATUS

PREVIOUS STATES

(4 Figure 6. Possible utilization of the architecture for CNN algorithms: (a) different CNNs performing tliff'crent computations on the

same input; (b) several CNNs working in parallel on different inputs; (c) a delay-type CNN


Hence our architecture is able to implement both several CNNs working in parallel and delay-type CNNs, as sketched in Figure 6 and detailed in the following two examples.

A delay template CNN can he implemented using a layer for each previous state of the elaboration we are interested in. This architecture can implement a CNN with t C 10.

Edge detection. Many cloning templates for edge detection have been presented.' In Figure 7 is shown the result of the implementation of a 3 x 3 cloning template A with circular symmetry (2 in the middle, -0-25 the neighbours). This operator is mapped on the architecture according to the example of Section 2.

We present in Figure 8 the results of the simulation of the CNN proposed in Reference 8 for connected component detection (see caption).

Connected component detection.

4.2. A multilayer algorithm for preattentive visual tasks

Problem description. Many machine vision processing tasks are based on the recurrent application of simple and uniform operators on a large set of data representing the image. These applications usually require real-time performances that cannot be achieved by software implementation. In particular solving visual tasks requires one (i) to extract elementary information from the data image (e.g. contrast, contrast differences, etc.) and (ii) to merge such information in a global unifying percept. Both operations resort to point and local interactions within restricted portions of the image. For an efficient hardware design it is important to have a structure based on simple modules locally connected to limit communication overhead. To this end, by studying biological solutions for vision processes and especially those evolved in visual c ~ r t e x , ~ ~ ~ " one can derive the following set of computational paradigm^.^

1 . Local feature extraction. Each cell analyses the input image by performing a weighted sum over the

2. Topology preservation. Adjacent locations on the visual cortex (i.e. the output port) correspond to

3. 3 0 mapping of local information. The 3D structure representing the cortex is composed of layers,

portion of the image around the current pixel.

adjacent locations in the image, thus preserving the topographic organization of the image.

organized hierarchically. Each cell in a layer gains its

- 1

I.,* i __.. :

properties both through feedforward

(4 (b)

Figure 7. (a) Test image. (b) Output of the edge detection CNN using the template of Reference 7

Figure 8. (a) Test image. (b) Output of a connected component detection CNN with template of Reference 8. (c) Same as (b) using a delay-type template A' with t = 3.' A' is mapped on the architecture by considering three additional layers in which the outputs of

the previous ones are copied at each step, realizing a memory of the last four output values


connections from cells in the previous layer and through horizontal and vertical, locally confined recurrent paths. These computations, together with topology preservation, ensure a direct correspondence between the morphology of connections and the detection of spatial relations among featural elements.

Algorithm .ym$kxztion. The fundamental module of the model is a ‘column’, i.e. an ensemble of orientation-selective cells present in simple, coniplex, and hypercomplex layers at the same location (see Figure 9(a)). Each layer can be described as being composed of a number of (e.g. four) subluyers, each of which can be described as a 2D regular grid of cells selective to the same oriented featural element. The simple layer is the input layer and provides computational primitives to the complex layer to extract oriented featural elements: the excitation e,(i, j , 8) reflects the dominant featural element among those detected by a convolution with different kernels. The excitation of a neuron in the complex layer belonging to column (i,;), with orientation preference 8, is the result of four contributions: direct excitation z , = g ( r , ) from the corresponding position in the simple layer, where g ( . ) is a sigmoidal transfer function; feedforward inhibition from a set M ( B ) of simple cells; recurrent inhibition from a set Nc(i, j, 6) of complex neurons; positive feedback zh from the corresponding neuron in the hypercomplex layer. The excitation of neurons in the tiyperromnpfex layer results from two contributions: the feedforward actions from il set f , ( i , j , 0) of neurons in the complex layer and the cross-orientation inhibition from a set N,,( i , j , 0) of neurons in the hypercomplex layer (see Figure 9(b)).

Summarizing, the algorithm can be described by the following system of equations:

nub 12

Br rub 11

(33 iub 10

sub 9 Layer

a nub 6

sub 7

iub 6

sub 6

tar

Bub 4

nub 3

sub 2

rub 1 nimple layer

84

(a ) (b)

Figure 9. (a) Artistic view of columns: the fundamental module of the neural computational model for visual processing ( s = simple; c = coniplex: h = hypercomplex). The arrows evidence feedforward and recurrent interactions occurring among layer.

(b) Feedforward, inhibitory and recurrent connection schemata among cells


where I (m, n ) denotes the intensity of a pixel at point ( m , n ) in the image plane; wP(rn, n, i , j , 8 ) with p = 1 , 2 , 3 , 4 are the kernels of different contrast selectivity that describe the receptive field profile of the neuron belonging to column ( i , j ) ; wsc, w,,, wh,, Whh and w,h denote the weights of connection from simple to complex (feedforward), from complex to complex (intralayer), from hypercomplex to complex (feedback), from hypercomplex to hypercomplex and from complex to hypercomplex respectively; and k is the iteration index.

It is worth noting that the feedforward inhibition schema M ( 8 ) does not depend on the position of the neuron considered in the layer; a complex neuron selective to I3 is inhibited by the two simple neurons (with similar orientation preferences) belonging to the same column. Nc( i , j , 0) depends on the orientation preference 8 of the target neuron; more precisely, a neuron selective to 8 receives inhibitory inputs from two complex neurons (selective to I3 + n/2) that belong to the two closest columns lying along an axis orthogonal to 8.

The set L ( i , j , 8 ) depends on the orientation preference of the target neuron. More specifically, the connection schema can be defined as follows: if the target neuron is selective to 8, then the complex neurons that provide the input are selective to 8 and belong to neighbouring columns that lie on an axis oriented along 8. Typical values for the number of columns involved in the interaction range from three to seven, but three is sufficient for major applications.

Architectural mapping. The functionality of complex and hypercomplex layers can be mapped on the architecture presented here, while the functionality of simple cells has to be implemented by a specific convolution block. I ' This block performs convolutions with four pairs of orthogonal filters (oriented along 0", 45", 90" and 135" directions), by four-pixel steps, and provides as output for each orientation the maximum of the absolute value of the convolution pairs (see equation (3)) on an array of 64 x 64 elements. In this way, with an input image of 256 x 256 pixels, the resulting convolution is an array of 61 x 61 elements for each orientation. It is noteworthy that the frequencial selectivity of the masks of the convolution blocks will determine the capability of the whole network to be sensitive to particular textures.

The outputs of the convolution stage z , are stored as excitation inputs in the four layers of the simple layer. The statuses of complex and hypercomplex cells are stored in the corresponding quartet of layers. The values in complex and hypercomplex layers are updated according to the programmed rules and the values stored in all the layers. This occurs by setting the generalized template D of equation (2) according to the explicit algorithmic formulation of equations (4) and (5).

Simulation results, performance and implementation perspectives. We have tested this implementation on a natural textured image." The simulations presented here concern texture segregation on natural images. In Figure 10 the test image and the content of the four hypercomplex layers of the architecture are presented at convergence. The image is subdivided into four square areas that represent the resulting images for the four types of orientation-selective cells along 0", 45", 90" and 135". The luminous intensity of a pixel codes the activity of the corresponding neuron: if the pixel is light, the neuron is active; if the pixel is dark, the neuron is inhibited; if the pixel has an intermediate value, the corresponding neuron is silent (i.e. the neuron is not selective to the stimulus present in its receptive field). Taking into account the number of elements per layer (64 x 64), the number of masks per layer (four) and the number of iterations (10) and assuming a clock frequency of 50 MHz, a complete texture segregation of 256 x 256 pixel images is estimated be obtained in about 30 ms, allowing one to process images at a commercial camera frame rate (25 images/second). The VLSI design of this architecture is being pursued using a standard cell approach

366 L. RAFFO, S . P. SABATINI AND G. M. BISIO

( h )

Figure 10. (a) 256 x 256 pixel test image. (ti) Outputs of the four hypcrcornplex layers for the four angles ( O O , 45'. 90° and 135'), evidencing the pressure of textural features of corresponding orientation

with an appropriate customized memory rnodule generator. On the basis of a similar implementationt3 it is estimated that 15 mm x 15 mm of silicon in a 0.5 pm technology will be necessary.

5 . CONCLUSIONS

We have considered a digital VLSI architecture for the implementation of multilayer CNNs. This architecture combines programmability with high efficiency. This has been achieved with the following strategy: ( i ) an elementary recursive algorithm has been defined as the building block of every multilayer CNN by introducing 3D generalized templates that tit well to a direct VLSI mapping; (ii) such sparse 3D templates are projected onto a small set of 2D templates; (iii) the recursive operations of the whole algorithm are sequenced with high efficiency using programmable dedicated architectural resources,

In comparison with general-purpose CNN implementations such as the CNN universal machine, l 4 the following major differences can be evidenced: (i) the issue of programmability for this reconfigurable


digital architecture has been explored in a specific application context, though this architectural approach could be extended to other domains of application; (ii) a fully digital solution has been pursued.

ACKNOWLEDGEMENTS

This work was supported in part by CEC ESPRIT-BRA Project CORMORANT. The authors wish to thank Dr. Paolo Faraboschi and Dr. Giovanni Nateri for useful suggestions.

REFERENCES

I . J. Vandewalle and T. Roska, Guest editorial-special issue on cellular neural networks’, Int. J. cir. theor. appl. 20, 449-451

2. CNNA-94, IEEE, New York, 1994. 3. K. Halonen, V. Porra and T. Roska, ‘Programmable analogue VLSI CNN with local digital logic’, Int. J. cir. theor. appl., 20,

4. G. Indiveri, L. Raffo, S . Sabatini and G. Bisio, ‘A neuromorphic architecture for cortical multi-layer integration of early visual

5. L. Chua and L. Yang, ‘Cellular neural networks: theory’, IEEE Trans Circuits and Systems, CAS-35, 1257-1272 (1988). 6. G. Goossens, J. Rabaey, J. Vandewalle and H. De Man, ‘Loop optimization in register transfer scheduling for DSP systems’,

7. L. Chua and C. Wu, ‘On the universe of stable cellular neural networks’, Int. j . cir. theor. appl. 20,497-518 (1992). 8. T. Roska and L. Chua, ‘Cellular neural networks with non-linear and delay-type template elements and non-uniform grids’, Int.

9. D. Van Essen, C. Anderson and D. Felleman, ‘Information processing in the primate visual system-an integrated system

10. S. Grossberg, E. Mingolla and D. Todovoric, ‘A neural network architecture for preattentive vision’, IEEE Trans. Biomed. Eng.,

1 1 . L. Raffo, S. Sabatini, G. Indiveri, G. Nateri and G. Bisio, ‘A memory-based recurrent neural architecture for chips emulating

12. P. Brodatz, Textures, a Photographic Album for Artists and Designers, Dover, New York, 1966. 13. M. Valle, G. Nateri, D. Caviglia, G. Bisio and L. Bnozzo, ‘An ASIC design for real time image processing in industrial

14. T. Roska and L. Chua, ’The CNN universal machine: an analogic array computer’, IEEE Trans. Circuits and Systems [ I ,

(1992).

573-582 (1992).

tasks’, Machine Vision Appl., in press.

Proc. 26th ACMIIEEE Design Automation Conf., IEEE, New York, 1989.

J. cir, theor. appl,, 20, 469-482 (1992).

perspective’, Science, 255,419-423 (1992).

BE-36, 65-83 (1989).

cortical visual processing’, IEICE Trans. Electron., E77-C (1994).

applications’, Proc. EDTC’95, 1995, pp. 385-390.

CAS-40, 163-173 (1993).

a programmable vlsi architecture based on multilayer cnn paradigms for real-time visual processing

Documents