136 ieee journal of solid-state circuits, vol. …hellojooyoung.com/paper/b15_khkim_jssc2009.pdf ·...

12
136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009 A 125 GOPS 583 mW Network-on-Chip Based Parallel Processor With Bio-Inspired Visual Attention Engine Kwanho Kim, Student Member, IEEE, Seungjin Lee, Student Member, IEEE, Joo-Young Kim, Student Member, IEEE, Minsu Kim, Student Member, IEEE, and Hoi-Jun Yoo, Fellow, IEEE Abstract—A network-on-chip (NoC) based parallel processor is presented for bio-inspired real-time object recognition with visual attention algorithm. It contains an ARM10-compatible 32-bit main processor, 8 single-instruction multiple-data (SIMD) clusters with 8 processing elements in each cluster, a cellular neural network based visual attention engine (VAE), a matching accelerator, and a DMA-like external interface. The VAE with 2-D shift register array finds salient objects on the entire image rapidly. Then, the parallel processor performs further detailed image processing within only the pre-selected attention regions. The low-latency NoC employs dual channel, adaptive switching and packet-based power man- agement, providing 76.8 GB/s aggregated bandwidth. The 36 mm chip contains 1.9 M gates and 226 kB SRAM in a 0.13 m 8-metal CMOS technology. The fabricated chip achieves a peak perfor- mance of 125 GOPS and 22 frames/sec object recognition while dissipating 583 mW at 1.2 V. Index Terms—Matching accelerator, network-on-chip (NoC), object recognition, parallel processor, processing element clusters, visual attention engine. I. INTRODUCTION R ECENTLY, intelligent vision processing such as ob- ject recognition and video analysis has been emerging research area for intelligent mobile robot vision system, autonomous vehicle control, video surveillance and natural human-machine interfaces [1]–[4]. Such vision applications require huge computational power and real-time response under the low power constraint, especially for mobile devices [1], [2]. Programmability is also needed to cope with a wide variety of applications and recognition targets [2]. Object recognition involves complex image processing tasks which can be classified into several stages of processing with different computational characteristics. In low-level processing (e.g. image filtering, feature extraction), simple arithmetic operations are performed on a 2-D image array of pixels. On the contrary, high-level processing is irregular and performed Manuscript received April 15, 2008; revised August 31, 2008. Current ver- sion published December 24, 2008. This work was supported by the MIC (Min- istry of Information and Communication), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA(Institute for Information Technology Advancement) (IITA-2008-(C1090-0801-0012). The authors are with the Division of Electrical Engineering, Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 305-701, Korea (e-mail: [email protected]). Digital Object Identifier 10.1109/JSSC.2008.2007157 on objects that are defined by groups of features extracted at the lower level. Since object recognition requires huge com- putation power on each stage, general-purpose architectures such as microprocessor and digital signal processor cannot achieve a real-time processing due to its sequential pipelining feature. Many vision processors previously reported were based on massively parallel SIMD architecture with a number of processing elements (PEs) for data-level parallelism [1]–[3]. However, these processors focus only on the low-level image processing operations like image filtering, and they are not suit- able for object-level parallelism, which is essential for higher level vision applications such as object recognition. A mul- tiple-instruction multiple-data (MIMD) multi-processor was presented with Network-on-Chip (NoC) to exploit task-level parallelism [4]. However, it cannot achieve a real-time pro- cessing due to its limited computing power and complex data synchronization requirement. In this work, to overcome the computational complexity of the object recognition, visual attention based object recognition algorithm is applied to design the pattern recognition processor [5]. The processor of this study combines 3 features, the parallel processor, visual attention engine (VAE) and the NoC platform, and improves object recognition performance: 58% reduction in power and 38% improvement in recognition speed over the pre- vious design [4]. Its SIMD/MIMD dual-mode parallel processor contains 8 SIMD linear array PE clusters which have 8 PEs each, achieving the peak performance of 96 GOPS. The VAE is com- posed of an 80 x 60 digital cellular neural network (CNN) and selects salient object regions out of the image rapidly. The NoC supports 76.8 GB/s aggregated bandwidth with 2-clock cycle latency as a communication platform. The chip is fabricated in 0.13 um CMOS technology and shows 125 GOPS peak perfor- mance at the recognition speed of 22 frames/sec with less than 583 mW. This paper is organized as follows. In Section II, the atten- tion-based object recognition is briefly introduced. The system architecture with dual-mode configuration will be described in Section III. Key building blocks such as the VAE, SIMD PE Clusters, low-latency NoC, and matching accelerator will be explained in Section IV–VII. Packet-based power manage- ment employed in this chip is described in Section VIII. Im- plementation results and performance evaluations are given in Section IX. The conclusion of this work will be made in Section X. 0018-9200/$25.00 © 2008 IEEE Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Upload: letuyen

Post on 19-Mar-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009

A 125 GOPS 583 mW Network-on-Chip BasedParallel Processor With Bio-Inspired Visual

Attention EngineKwanho Kim, Student Member, IEEE, Seungjin Lee, Student Member, IEEE,

Joo-Young Kim, Student Member, IEEE, Minsu Kim, Student Member, IEEE, and Hoi-Jun Yoo, Fellow, IEEE

Abstract—A network-on-chip (NoC) based parallel processor ispresented for bio-inspired real-time object recognition with visualattention algorithm. It contains an ARM10-compatible 32-bit mainprocessor, 8 single-instruction multiple-data (SIMD) clusters with8 processing elements in each cluster, a cellular neural networkbased visual attention engine (VAE), a matching accelerator, and aDMA-like external interface. The VAE with 2-D shift register arrayfinds salient objects on the entire image rapidly. Then, the parallelprocessor performs further detailed image processing within onlythe pre-selected attention regions. The low-latency NoC employsdual channel, adaptive switching and packet-based power man-agement, providing 76.8 GB/s aggregated bandwidth. The 36 mm�

chip contains 1.9 M gates and 226 kB SRAM in a 0.13 m 8-metalCMOS technology. The fabricated chip achieves a peak perfor-mance of 125 GOPS and 22 frames/sec object recognition whiledissipating 583 mW at 1.2 V.

Index Terms—Matching accelerator, network-on-chip (NoC),object recognition, parallel processor, processing element clusters,visual attention engine.

I. INTRODUCTION

R ECENTLY, intelligent vision processing such as ob-ject recognition and video analysis has been emerging

research area for intelligent mobile robot vision system,autonomous vehicle control, video surveillance and naturalhuman-machine interfaces [1]–[4]. Such vision applicationsrequire huge computational power and real-time response underthe low power constraint, especially for mobile devices [1], [2].Programmability is also needed to cope with a wide variety ofapplications and recognition targets [2].

Object recognition involves complex image processing taskswhich can be classified into several stages of processing withdifferent computational characteristics. In low-level processing(e.g. image filtering, feature extraction), simple arithmeticoperations are performed on a 2-D image array of pixels. Onthe contrary, high-level processing is irregular and performed

Manuscript received April 15, 2008; revised August 31, 2008. Current ver-sion published December 24, 2008. This work was supported by the MIC (Min-istry of Information and Communication), Korea, under the ITRC (InformationTechnology Research Center) support program supervised by the IITA(Institutefor Information Technology Advancement) (IITA-2008-(C1090-0801-0012).

The authors are with the Division of Electrical Engineering, Departmentof Electrical Engineering and Computer Science, Korea Advanced Instituteof Science and Technology (KAIST), Daejeon 305-701, Korea (e-mail:[email protected]).

Digital Object Identifier 10.1109/JSSC.2008.2007157

on objects that are defined by groups of features extracted atthe lower level. Since object recognition requires huge com-putation power on each stage, general-purpose architecturessuch as microprocessor and digital signal processor cannotachieve a real-time processing due to its sequential pipeliningfeature. Many vision processors previously reported werebased on massively parallel SIMD architecture with a numberof processing elements (PEs) for data-level parallelism [1]–[3].However, these processors focus only on the low-level imageprocessing operations like image filtering, and they are not suit-able for object-level parallelism, which is essential for higherlevel vision applications such as object recognition. A mul-tiple-instruction multiple-data (MIMD) multi-processor waspresented with Network-on-Chip (NoC) to exploit task-levelparallelism [4]. However, it cannot achieve a real-time pro-cessing due to its limited computing power and complex datasynchronization requirement.

In this work, to overcome the computational complexity ofthe object recognition, visual attention based object recognitionalgorithm is applied to design the pattern recognition processor[5]. The processor of this study combines 3 features, the parallelprocessor, visual attention engine (VAE) and the NoC platform,and improves object recognition performance: 58% reduction inpower and 38% improvement in recognition speed over the pre-vious design [4]. Its SIMD/MIMD dual-mode parallel processorcontains 8 SIMD linear array PE clusters which have 8 PEs each,achieving the peak performance of 96 GOPS. The VAE is com-posed of an 80 x 60 digital cellular neural network (CNN) andselects salient object regions out of the image rapidly. The NoCsupports 76.8 GB/s aggregated bandwidth with 2-clock cyclelatency as a communication platform. The chip is fabricated in0.13 um CMOS technology and shows 125 GOPS peak perfor-mance at the recognition speed of 22 frames/sec with less than583 mW.

This paper is organized as follows. In Section II, the atten-tion-based object recognition is briefly introduced. The systemarchitecture with dual-mode configuration will be describedin Section III. Key building blocks such as the VAE, SIMDPE Clusters, low-latency NoC, and matching accelerator willbe explained in Section IV–VII. Packet-based power manage-ment employed in this chip is described in Section VIII. Im-plementation results and performance evaluations are givenin Section IX. The conclusion of this work will be made inSection X.

0018-9200/$25.00 © 2008 IEEE

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 2: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

KIM et al.: A 125 GOPS 583 mW NETWORK-ON-CHIP BASED PARALLEL PROCESSOR WITH BIO-INSPIRED VISUAL ATTENTION ENGINE 137

Fig. 1. Attention-based object recognition system.

II. ATTENTION-BASED OBJECT RECOGNITION

A. Algorithm Overview

The proposed attention-based object recognition algorithmconsists of three steps (Fig. 1): visual attention, key-points ex-traction and matching. In contrast to the conventional objectrecognition algorithm such as Scale Invariant Feature Transform(SIFT) [6], visual attention is performed in advance. Visual at-tention is the ability of the human visual system to rapidly selectthe most salient part of an image. It is an essential role of vi-sual cortex in the human brain [7]. Then, key-points extractionand feature descriptor generation are performed on the pre-se-lected salient image regions by the visual attention mechanism.Finally, we can recognize the object by matching individual fea-tures to a database of features using a nearest neighbor searchalgorithm [8].

By incorporating the visual attention into the conventionalobject recognition algorithm, next visual processing suchas key-point extraction and matching can focus on only thepre-selected image to reduce the computation cost of the objectrecognition. The visual attention can confine other image pro-cessing tasks within the extent of the interested image regions.Therefore, the amount of the image data to be processed onhigher-level visual processing stages can be reduced and thecomputation cost can go down. The number of key-pointsextracted in the image is reduced and the key-points only inthe attended image region need to be matched to the objectdatabase, making it faster and easier to recognize the object.As a result, the VAE leads to a considerable speed-up to makereal-time object recognition possible. Moreover, numerouscomputer vision applications such as object tracking and imagesegmentation can benefit from the VAE as well.

B. Cellular Neural Network for VAE

Saliency-based model of visual attention has been widelyused in various computer vision applications [9], [10]. Ac-cording to the [9], visual attention can be modeled by the

four steps: multi-scale image generation, low-level featureextraction, conspicuity map, and saliency map generation. Sucha saliency-based visual attention process involves a series of2-D image filtering operations such as difference-of-Gaussiansfilter and Gabor filter, which can be easily implemented by analgorithm with CNN architecture [11]. The CNN is a 2-D arrayof locally connected cells and the connection weights amongneighboring cells as a template define the CNN operation [12].Because 2-D structure of the CNN can be directly mapped ontoan image, its inherent cell-level parallel processing can givehigh performance. In addition, uniform local connections makeit suitable for VLSI implementation. Therefore, the VAE of thisstudy is implemented using the CNN.

III. SYSTEM ARCHITECTURE

A. System Operation

Fig. 2 shows the overall architecture of the proposed NoC-based parallel processor. It consists of 12 IPs: a main processor,VAE, a matching accelerator, 8 PE Clusters (PECs) and an ex-ternal interface. The ARM10-compatible 32-bit main processorcontrols the overall system operations. The VAE, an 80 x 60digital cellular neural network, rapidly detects the salient imageregions on the sub-sampled image (80 x 60 pixels) by contourand saliency map extraction. Although the low-resolution imageis mapped on the VAE, it does not cause any loss of recognitionaccuracy because the role of the VAE is just to make a roughselection of the salient image regions before the detailed pro-cessing. The 8 linearly connected PECs perform data-intensiveimage processing applications such as image gradients and his-togram calculations for more detail analysis of the salient imageparts (i.e., the objects) selected by the VAE. The matching ac-celerator boosts the nearest neighbor search to obtain the finalrecognition result in real-time. The DMA-like external interfacedistributes automatically the corresponding image data to eachPEC to reduce system overhead. Initially, 2-D image plane isequally divided into 8 PECs according to the image size speci-fied by the main processor. Each core is connected to the NoC

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 3: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

138 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009

Fig. 2. System architecture.

Fig. 3. Dual-mode configuration: (a) SIMD mode and (b) MIMD mode.

via a network interface (NI). The on-chip PLL generates two in-dependent clocks for the IPs and the NoC, and the clocks can becontrolled by the host processor.

B. Dual-Mode Configuration

The attention-based computer vision applications suchas object recognition and tracking require a wide range ofparallelism: data-level parallelism for the processing of theentire image in the pre-attentive phase, and object-level par-allelism for only salient image regions selected by the VAEin the post-attentive phase. To incorporate the above require-ments into a single system, the proposed parallel processorhas dual-mode configuration. That is, by modifying its NoCconfiguration, the system can choose one mode between SIMDand MIMD mode as shown in Fig. 3. In a circuit switchingNoC, the main processor broadcasts instruction and data to allPE arrays. In this mode, the system exploits massively parallelSIMD operation for image pre-processing, and in this case itspeak performance is 96 GOPS at 200 MHz. On the contrary,in a packet switching NoC or in the MIMD mode, the 8 PECsoperate independently in parallel for object-parallel processing.In this case, each PEC is responsible for the objects, each ofwhich contains image data around the extracted key-points.

It takes about a few tens of cycles to change the NoC config-uration and the exact cycles depend on the network traffic statusdue to circuit establishment and release time overhead for the

Fig. 4. Block diagram of the VAE.

circuit switching NoC. For object recognition application, how-ever, the operation mode conversion occurs only twice when therecognition of 1-frame image is performed: SIMD to MIMDconversion after the pre-processing stage including the VAE op-eration and MIMD to SIMD conversion after completing therecognition. Therefore, such a dual-mode architecture is suit-able for compact object recognition system with negligible im-pact on the overall system performance.

IV. VISUAL ATTENTION ENGINE

A. Cellular Neural Network Based Architecture

The CNN is usually implemented using analog cells becausebiological neuron operates in a continuous time domain [13].However, the analog CNN requires high accuracy analog cir-cuits to deal with complex algorithms like visual attention andit is not suitable to be integrated into SoC. To overcome thelimitation of the analog CNN, a digital CNN, a discrete-time

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 4: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

KIM et al.: A 125 GOPS 583 mW NETWORK-ON-CHIP BASED PARALLEL PROCESSOR WITH BIO-INSPIRED VISUAL ATTENTION ENGINE 139

Fig. 5. VAE cell schematic.

version of CNN, has been studied [14]. The digital CNN canbe more easily integrated into the parallel processor withoutanalog-to-digital (A/D) or digital-to-analog (D/A) conversionoverhead.

The VAE is an 80 x 60 digital CNN optimized for small areaand energy efficiency. Fig. 4 shows the block diagram of theVAE, which is composed of 4 arrays of 20 x 60 cells, 120 vi-sual PEs (VPEs) shared by the cell arrays, and a controller with2 kB instruction memory. Previous implementations of digitalCNN [14] can integrate only a small number of cells due to thelarge size of digital arithmetic blocks. On the contrary, the VAEintegrates 80 x 60 cells that each correspond to a pixel in an 80x 60 resolution image. This is possible because the cells of theVAE only perform storage and inter-cell data transfer to min-imize area while a smaller number of the shared VPEs are re-sponsible for processing the cells data. An 80 x 60 shift reg-ister array, distributed among the cells, eliminates data commu-nication overhead in convolution operations of arbitrary kernelsize and shape, which is the most frequently used operationin the CNN. The VAE controller generates the control signalsfor sequencing the operation of the cells and the VPEs. Sucha CNN-based architecture can accelerate visual attention algo-rithms like contour and saliency map extraction.

B. VAE Cell

Fig. 5 shows the schematic diagram of the VAE cell. It con-sists of two elements: a 8-bit 4-entry register file and a 4-direc-tional shift register. Four 6T SRAM cell based registers storeintermediate and result data of the CNN operation. The shiftregister’s data is initially loaded from the register file and thenshifted to neighboring cells. A shift operation on the entire cellarray requires only 1 cycle to complete. Because all cells shift inthe same direction, one bidirectional channel is used for 2-waycommunication between neighboring cells to save routing chan-nels. A dynamic logic based on MUX/DEMUX with NMOSpass-transistor only circuit is utilized to reduce the area of the

4-directional shift register. In this circuit, the voltage value atdynamic node D is precharged to and then is evaluatedthrough one of five possible paths selected by the control signals‘N_En’, ‘E_En’, ‘S_En’, ‘W_En’, and ‘load_En’ before beingcaptured by the pulsed latch. As a result, the full-custom de-signed cell occupies a compact area of 502 , achieving thecell area reduction of 40% compared with a static MUX-baseddesign.

C. VAE Operation

Fig. 6(a) shows the basic VAE operation. Each VPE located inthe middle is shared by a group of 40 cells connected via 2 readbuses and 1 write bus. The VPEs, operating in SIMD mode, arecapable of 1 cycle MAC operation and employ 3 stage pipelinesthat consist of read, execute, and write. The cell data stored inthe shift register and the register file can be read through 2 readbuses. Execution results of the VPEs are written back to the reg-ister file of the cell through a write bus. The single-ended readbus is pre-charged to and the complementary write busdriven by the output of the VPE has full swing signal to ensurereliable write operation. To facilitate 1 op/cycle throughput, readand write of cell data is sequentially executed within one cycleusing a self-timed circuit. It takes 42 cycles for the VPEs to ex-ecute one instruction on the entire cell array. The resulting peakperformance of the 120 VPEs is 24 GOPS at 200 MHz. Fig. 6(b)shows the measured waveforms of cell control signals when theVAE operates at 200 MHz. Word line, read enable, and writeenable signals are sequentially asserted for cell read and writeoperation within a single cycle. Thanks to the VAE pipelinedoperation, 1-cycle throughput and the peak performance of 24GMACS are achieved.

The most time consuming operation of the digital CNN is tocalculate the weighted sum of neighborhood cell values. Fig. 7visualizes the method to obtain the weighted sum. It involves aspiraling shift sequence that can be straightforwardly extendedto neighborhoods larger than the 3 x 3 neighborhood of Fig. 7.

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 5: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

140 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009

Fig. 6. (a) VAE operation and (b) measured waveforms of cell control signals.

The procedure shown in Fig. 7 takes 387 cycles (42 cycles perMAC operation and 1 cycle per shift operation) to complete theweighted sum operation on the VAE. Thanks to the efficient shiftpattern and a single cycle shift operation, data communicationoverhead is only 2.4% and 93% utilization of the VPE array canbe achieved. For a complete iteration of a 3 x 3 CNN template,858 cycles or 4.3 s is required. As a result, the VAE takes only2.4 ms to complete a saliency map extraction, which is about twoorders of magnitude improvement over that by an Intel Core 2processor.

V. SIMD PE CLUSTER

The PEC is a SIMD processor array designed to accelerateimage processing tasks. Fig. 8 shows the architecture of thePEC. It contains 8 linearly-connected PEs controlled by a clustercontroller, a cluster processing unit (CLPU), 20 kB local sharedmemory (LSM), a LSM controller, and a PE load/store unit.The 8 PEs operate in a SIMD fashion and process image opera-tions in a column-parallel (or row-parallel) manner. The CLPU,which consists of an accumulator and an 8-input comparator,generates a single scalar result from the parallel output pro-cessed by the PE array. The LSM is used as on-chip framememory or local memory for each PEC to store the input orprocessed image data and objects. A single-port 128-bit wideSRAM is used for the LSM to avoid area overhead. The LSM

provides a single-cycle access and is shared among the PE load/store unit, the LSM controller and the CLPU. Arbitration forthe LSM is performed on a cycle-by-cycle basis to improvethe LSM utilization. The LSM controller is responsible for datatransfer between external memory or other PECs and the LSMwhile the PE load/store unit can access the LSM only for localdata transfer. The LSM controller, which is an independent pro-cessing unit optimized for data transfer like the DMA engine,enables the data transfers in parallel with PE execution to hideexcessive external memory latency.

Fig. 9 shows the 5-stage pipeline architecture of the PEC. Thecluster controller, the 3-stage pipelined PE array, and the CLPUare tightly coupled together to maintain 1-cycle throughputfor all operations. Especially, the tightly coupled PE arrayand CLPU architecture achieves single-cycle execution forstatistical image processing tasks (e.g. histogram calculations)where an input image is transformed into a scalar or vectordata, while the massively parallel SIMD processors [1], [2]require sequential operations on a line-by-line basis to obtainthe same result due to the absence of the CLPU-like processingunit. Such an architecture is suitable for object recognitionbecause histogram calculations is the essential operation forkey-point descriptor generation in the object recognition task[6]. In addition, due to the simple control circuit in the SIMDarchitecture, the cluster controller including 2 KB instruction

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 6: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

KIM et al.: A 125 GOPS 583 mW NETWORK-ON-CHIP BASED PARALLEL PROCESSOR WITH BIO-INSPIRED VISUAL ATTENTION ENGINE 141

Fig. 7. Spiral shift sequence for CNN operation on the VAE.

Fig. 8. Block diagram of the PE cluster.

memory occupies only 6% of the total PEC area, which resultsin high computation efficiency.

Each PE utilizes 4-way very long instruction word (VLIW)architecture to execute up to 4 instructions in a single cycle asshown in Fig. 10: three instructions for data processing and oneinstruction for data transfer. It consists of two 16-bit ALUs, ashifter, a multiplier and a 16-bit 10-port register file. All PE in-structions have single-cycle execution except 16-bit multiply-accumulate (MAC) operation, which has a two-cycle latency.The 16-bit datapath units of the PE can be configured to exe-cute two 8-bit operations in parallel for gray-scale image pro-cessing. The left and right neighbor PE registers can be di-rectly accessed in a single-cycle using the linearly connected PEarray for efficient inter-PE communication, which is one of themost frequently used operations for neighborhood image pro-cessing tasks such as image filtering. Meanwhile, memory ac-

Fig. 9. Tightly-coupled PEC pipeline architecture.

Fig. 10. Block diagram of 4-way VLIW PE.

cess patterns are well predictable for such low-level image pro-cessing tasks due to the characteristics of regular and pre-de-fined data accesses. The 4-way VLIW PEs allow PEC softwareto pre-fetch the needed data in advance without performancedegradation by executing data transfer and processing instruc-tions concurrently.

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 7: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

142 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009

Fig. 11. NoC packet format.

VI. LOW-LATENCY NETWORK-ON-CHIP

For attention-based vision applications, image regions of in-terest are pre-selected by the VAE in SIMD mode. After thepre-attentive stage, each PEC handles the selected image regionson a per-object basis in MIMD mode. In order to facilitate ob-ject-level parallel processing, a large amount of data transac-tions among the PECs is required to redistribute the image dataof the object to the corresponding PEC. We apply the NoC tosecure the huge communication bandwidth required for parallelcomputing.

A. NoC Protocol

Regular topology NoC (e.g. mesh, torus) has been widelyused because of its better scalability and higher throughput[15]. However, most of SoCs are heterogeneous with each corehaving different communication requirements. Therefore, theNoC topology should be decided based on the traffic charac-teristics of the SoC to achieve high performance and low cost[16]. In this work, the tree-based NoC with 3 star-connectedcrossbar switches is used for lower latency and power thana 2-D mesh NoC. Fig. 11 shows the NoC packet format. Awormhole switching, where each packet is divided into a few32-bit FLITs (FLow control unIT) with additional 3-bit controlsignals, is employed to reduce buffer requirements. HeaderFLIT contains 4-bit burst length information for burst packettransaction up to 256-bit (8 x 32-bit) data and 2-bit priority in-formation for quality-of-service (QoS) control. A handshakingprotocol is supported for reliable transmission by using anacknowledgement request (AC) bit in the packet header. Thepacket length is determined by the burst length and maximum10 FLITS are possible. Deterministic source routing scheme isused for simple hardware implementation. Circuit and packetswitching are adaptively selected for a specific route path fromthe main processor to the PECs in order to support dual-modeconfiguration. A 1-bit sideband back-pressure signal is usedfor the flow control in the NoC. The back-pressure signal isasserted to stop the packet transmission when buffer overflowoccurs, or when destination PE cannot provide the requiredservice.

B. Low-latency Crossbar Switch

Fig. 12 shows the block diagram of the proposed low-la-tency crossbar switch. A 7 x 7 crossbar switch is optimizedfor low-latency and energy efficiency with two key features:adaptive switching and dual mode channel. At port 0, theswitch supports both circuit and packet switching adaptivelyaccording to the system operation mode. In circuit switchingmode, burst packet can be broadcasted to all PECs by by-passing the 8-FLIT queuing buffers and arbiter, resulting inreduced delay and energy dissipation. An input driver at port 0dynamically controls its drive strength based on the output loadassociated with the switching mode for reliable packet trans-mission. At port 1 through 6, dual mode channel is adopted toreduce packet latency, especially for return packets transfer-ring from slave IPs. The return packet latency seriously affectsthe overall system performance because the PEC with in-orderexecution stalls until the return packet arrives. Incoming re-turn packets, detected by a pre-route unit, are ejected immedi-ately after a 1 flit buffer through an additional image-expresschannel. This mechanism saves 2 pipeline stages of the switchby eliminating unnecessary packet queuing, arbitration, andcrossbar fabric traversal.

Fig. 13(a) shows the 4-stage low-latency crossbar switchpipeline. Incoming return packets are ejected 2-cycle earlierthan normal packets without any flow control. Because returnpackets are mostly burst packets, this scheme is more effective.The crossbar switch with the dual mode channel does not storethe return packets at queuing buffers. They are directly injectedinto the network without any suppression by the back-pressureflow control, which leads to a significant performance im-provement over a conventional crossbar switch [4]. Measuredwaveforms (Fig. 13(b)) show the low-latency return packettransmission by the crossbar switch with dual mode channelwhen the NoC operates at 200 MHz. As a result, 26% latencyreduction and 33% energy reduction are obtained with only 6%area overhead compared to the conventional crossbar switch[4] while various image processing applications are running onthe NoC-based system.

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 8: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

KIM et al.: A 125 GOPS 583 mW NETWORK-ON-CHIP BASED PARALLEL PROCESSOR WITH BIO-INSPIRED VISUAL ATTENTION ENGINE 143

Fig. 12. Proposed low-latency crossbar switch.

Fig. 13. (a) Crossbar switch pipeline and (b) measured waveforms.

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 9: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

144 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009

Fig. 14. FLIT-level clock gating.

C. Synchronization

A first-in-first-out (FIFO) based synchronizer is designed tointerface between the IPs and the NoC with independent clockfrequencies and phases. Without global synchronization, packettransmission is performed by source synchronous scheme inwhich a strobe signal is transmitted along with the packet data[16]. A 4 FLITs depth FIFO captures the incoming FLIT usingthe delayed strobe signal. Detection of the full or empty statusis accomplished using the FIFO write and read pointers to avoidFIFO overflow or underflow, respectively. The synchronizer isplaced at the first stage of the crossbar switch pipeline.

D. FLIT-Level Clock Gating

FLIT-level fine-grained clock gating is used to reduce NoCpower consumption as shown in Fig. 14. Only the requiredpacket routing path is activated on a per-port basis. A powercontrol unit that monitors an incoming packet header is alwaysturned on and the output port number is encoded to control theNoC clock signal. The clock gating signals are generated on allpipelining stages of the crossbar switch in a pipelined manner.Only queuing buffer at port N, arbiter and crossbar fabric atport M are enabled when a FLIT is transferred from an inputport N to an output port M. Since queuing buffers, built usinga number of flip-flops, are the most power-consuming unit inthe NoC, the FLIT-level power management can reduce NoCpower consumption by 32% without degradation of throughputand latency.

VII. MATCHING ACCELERATOR

In nearest neighbor search algorithm, the most time con-suming part is the distance calculation between input vectorand database vector. In this work, sum of absolute differences(SAD) is used for the distance metric. The proposed matchingaccelerator aims at accelerating the nearest neighbor searchalgorithm to get a final recognition result in real-time. Fig. 15shows the overall architecture of the proposed matching ac-celerator [8], which consists of RISC core, pre-fetch DMA,and two 8 kB 256-bit wide database vector (DB) memories.The RISC core manages the overall operations of the matchingaccelerator and performs the nearest neighbor search algorithm.

The pre-fetch DMA initiated by the RISC core transfers externalobject database to the internal DB memory via the NI. 2-stagepipelined tree structure SAD accumulation logics are mergedinto the DB memory in order to resolve a throughput bottleneckcaused by bandwidth conversion between 256-bit vector dataand 32-bit scalar RISC core. By accumulating four absolutedifference results per stage, 16 absolute difference results areaccumulated into 32-bit scalar value in every cycle at 200 MHz.As a result, the SAD merged memory logics perform the dis-tance calculation between two 256-bit (16-bit 16-dimension)vectors by 2-cycle latency and 1-cycle throughput at 200 MHz,which enables a real-time nearest neighbor matching.

VIII. PACKET-BASED POWER MANAGEMENT

The modular and point-to-point NoC approach makes it easyto manage the overall system by decoupling computation ofIPs from inter-IP communication, which enables efficient powermanagement techniques compared to the bus-based system. Forlow power consumption, our chip performs packet-based powermanagement at the IP level as shown in Fig. 16. Each PE clustercan be individually enabled or disabled according to the framingsignals of the packet to cut the power consumption by the inac-tive IPs. The valid signals generated by the network interfacewake up the related blocks within the IP only when incomingpacket arrives. 4 clock domains of the PE cluster are individu-ally controlled based on the issued instruction types. During theimage data transfer phase for which only the LSM controllerneeds to be activated, the clock signals of the PE register filesare gated-off and operand isolation to the PE datapath preventsunnecessary signal transitions to reduce power consumption.Since the PE datapath and register files occupy about 62% of thetotal power consumption, the power reduction up to 27% can beachieved when the object recognition application is running.

IX. IMPLEMENTATION AND EXPERIMENTAL RESULTS

The proposed NoC-based parallel processor is fabricatedin a 0.13 m 1-poly 8-metal standard CMOS logic process,and its die area takes 6 x 6 mm including 1.9 M gate countand 228 kB on-chip SRAM. The chip micrograph is shown inFig. 17 and Table I summarizes the chip features. Operatingfrequency of the chip is 200 MHz for the IPs and 400 MHz for

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 10: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

KIM et al.: A 125 GOPS 583 mW NETWORK-ON-CHIP BASED PARALLEL PROCESSOR WITH BIO-INSPIRED VISUAL ATTENTION ENGINE 145

Fig. 15. Block diagram of the matching accelerator.

Fig. 16. Packet-based power management.

Fig. 17. Chip micrograph.

the NoC. The power consumption is about 583 mW at 1.2 Vpower supply while object recognition application program isrunning at 22 frames/sec. Table II shows the power breakdownof the chip. The NoC consumes 9% of the die area and 8%of the power consumption, which means that the NoC cost isamortized over the processing units.

Fig. 18 shows the comparison with the previously reportedparallel processors in terms of power efficiency [1], [2], [4],[17]. All data is scaled to 0.13 m technology. For the equalcomparison, GOPS/W and nJ/pixel are adopted as a perfor-mance index for the normalization. As a result, the chip achieves

TABLE IPERFORMANCE SUMMARY

TABLE IIPOWER BREAKDOWN

up to 4.3 times higher GOPS/W in case of 8-bit fixed-pointoperation and energy per pixel reduction up to 42% is obtainedfor object recognition task with the help of the VAE and thepacket-based power management compared with other parallelprocessors.

Fig. 19 shows the performance evaluation when the atten-tion-based object recognition is performed on the chip. In thisexample, the VAE extracts a CNN-based saliency map as the at-tention cues and 50 objects are used as the database for patternmatching. The VAE takes only 2.4 ms to complete saliency map

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 11: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

146 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009

Fig. 18. Power efficiency comparison.

Fig. 19. Performance evaluation.

extraction, which occupies only 3% of total application execu-tion time. With the help of the VAE, the number of key-pointsextracted is reduced by 65% as shown in Fig. 19. Therefore, theprocessing time for next vision tasks such as feature vector gen-eration and matching is diminished by the ratio of the reducedkey-points. As a result, the chip achieves 22 frames/sec recog-nition speed without degradation of recognition rate, which issufficient for real-time operation.

The implemented NoC-based parallel processor with the VAEis applied to the intelligent robot vision system and successfullyworks on a system evaluation board as shown in Fig. 20. Theimplemented chip is used as a vision acceleration IP on a PXAprocessor based robot platform. A camera of the robot capturesthe image of a target object and then an object recognition soft-ware is running on the vision processor. The recognition resultwith extracted key-points is displayed on the LCD screen.

X. CONCLUSION

A NoC-based parallel processor is designed and implementedfor bio-inspired real-time vision applications. The proposed pro-cessor has three key features: a SIMD/MIMD dual-mode par-allel processor, a cellular neural network based VAE, and a low-latency NoC. The combined architecture of the dual-mode par-allel processor and the VAE on the low-latency NoC platformreduces the computation cost of the object recognition while ex-ploiting both data-level and object-level parallelism required forthe attention-based vision applications. The chip, fabricated ina 0.13 m CMOS process, takes die size of 36 mm and pro-vides 125 GOPS peak performance of 8-bit fixed-point opera-tions at 200 MHz. With the help of the packet-based power man-agement, the measured power consumption is 583 mW when

Fig. 20. Demonstration system on the intelligent robot.

the object recognition application is running at 22 frames/sec.The results show that the chip provides a high performance andlow power vision system of an intelligent mobile robot for thereal-time object recognition.

REFERENCES

[1] A. Abbo et al., “XETAL-II: A 107 GOPS, 600 mW massively-parallelprocessor for video scene analysis,” in IEEE ISSCC Dig. Tech. Papers,2007, pp. 270–271.

[2] S. Kyo et al., “A 51.2-GOPS scalable video recognition processor forintelligent cruise control based on a linear array of 128 four-way VLIWprocessing elements,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp.1992–2000, Nov. 2003.

[3] H. Noda et al., “The design and implementation of the massively par-allel processor based on the matrix architecture,” IEEE J. Solid-StateCircuits, vol. 42, no. 1, pp. 183–192, Jan. 2007.

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Page 12: 136 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. …hellojooyoung.com/paper/B15_KHKIM_JSSC2009.pdf · 8 processing elements in each cluster, ... Restrictions apply. ... equally divided

KIM et al.: A 125 GOPS 583 mW NETWORK-ON-CHIP BASED PARALLEL PROCESSOR WITH BIO-INSPIRED VISUAL ATTENTION ENGINE 147

[4] D. Kim et al., “An 81.6 GOPS object recognition processor based onNoC and visual image processing memory,” in Proc. CICC, 2007, pp.443–446.

[5] K. Kim et al., “A 125 GOPS 583 mW network-on-chip based parallelprocessor with bio-inspired visual attention engine,” in IEEE ISSCCDig. Tech. Papers, 2008, pp. 308–309.

[6] D. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.

[7] M. I. Posner and S. E. Petersen, “The attention system in human brain,”Annual Rev. Neuroscience, vol. 13, pp. 25–42, 1990.

[8] J.-Y. Kim et al., “A 66 fps 38 mW nearest neighbor matching processorwith hierarchical VQ algorithm for real-time object recognition,” inProc. IEEE Asian Solid-State Circuits Conf., 2008, pp. 177–180.

[9] L. Itti et al., “A model of saliency-based visual attention for rapid sceneanalysis,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, no. 11,Nov. 1998.

[10] N. Ouerhani and H. Hugli, “A model of dynamic visual attention forobject tracking in natural image sequences,” in Int. Conf. Artificial andNatural Neural Network, Lecture Notes in Computer Science, 2003,vol. 2686, pp. 702–709.

[11] B. E. Shi, “Gabor-type filtering in space and time with cellular neuralnetworks,” IEEE Trans. Circuits Syst. I, Fundam. Theory Applicat., vol.45, no. 2, pp. 121–132, Feb. 1998.

[12] L. Chua and L. Yang, “Cellular neural networks: Theory,” IEEETrans. Circuits Syst. I, Fundam. Theory Applicat., vol. 35, no. 10, pp.1257–1272, Oct. 1988.

[13] A. Rodriguez-Vazquez et al., “ACE16k: The third generation of mixed-signal SIMD-CNN ACE chips toward VSoCs,” IEEE Trans. CircuitsSyst. I, Fundam. Theory Applicat., vol. 51, no. 5, pp. 851–863, 2004.

[14] P. Keresztes et al., “An emulated digital CNN implementation,” J. VLSISignal Process., vol. 23, pp. 291–303, 1999.

[15] H.-J. Yoo, K. Lee, and J. K. Kim, Low-Power NoC for High-Perfor-mance SoC Design. Boca Raton, FL: CRC Press, 2008.

[16] K. Lee et al., “Low-power networks-on-chip for high-performance SoCdesign,” IEEE Trans. VLSI Syst., vol. 14, no. 2, pp. 148–160, Feb. 2006.

[17] B. Khailany et al., “A programmable 512 GOPS stream processor forsignal, image, and video processing,” in IEEE ISSCC Dig. Tech. Pa-pers, 2007, pp. 272–273.

Kwanho Kim (S’04) received the B.S. and M.Sdegrees in electrical engineering and computerscience from Korea Advanced Institute of Scienceand Technology (KAIST), Daejeon, Korea, in 2004and 2006, respectively. He is currently workingtoward the Ph.D. degree in electrical engineeringand computer science at KAIST.

In 2004, he joined the Semiconductor SystemLaboratory (SSL) at KAIST as a Research Assistant.His research interests include VLSI design for objectrecognition, architecture and implementation of

NoC-based SoC.

Seungjin Lee (S’06) received the B.S. and M.S. de-grees in electrical engineering and computer sciencefrom the Korea Advanced Institute of Science andTechnology (KAIST), Daejeon, Korea, in 2006 and2008, respectively. He is currently working towardthe Ph.D. degree in electrical engineering and com-puter science from KAIST.

His previous research interests include low powerdigital signal processors for digital hearing aidsand body area communication. Currently, he isinvestigating parallel architectures for computer

vision processing.

Joo-Young Kim (S’05) received the B.S. and M.S.degrees in electrical engineering and computer sci-ence from the Korea Advanced Institute of Scienceand Technology (KAIST), Daejeon, Korea, in 2005and 2007, respectively, and is currently working to-ward the Ph.D. degree in electrical engineering andcomputer science at KAIST.

Since 2006, he has been involved with the develop-ment of the parallel processors for computer vision,as a digital block designer. Currently, his research in-terests are parallel architecture and sub-block design

for computer vision system.

Minsu Kim (S’07) received the B.S. degree inelectrical engineering and computer science from theKorea Advanced Institute of Science and Technology(KAIST), Daejeon, Korea, in 2007. He is currentlyworking toward the M.S. degree in electrical engi-neering and computer science at KAIST.

His research interests include Network-on-chipbased SoC Design and VLSI architecture for com-puter vision processing.

Hoi-Jun Yoo (M’95–SM’04–F’08) graduated fromthe Electronic Department of Seoul National Uni-versity, Seoul, Korea, in 1983 and received the M.S.and Ph.D degrees in electrical engineering from theKorea Advanced Institute of Science and Technology(KAIST), Daejeon, in 1985 and 1988, respectively.His Ph.D. work concerned the fabrication processfor GaAs vertical optoelectronic integrated circuits.

From 1988 to 1990, he was with Bell Communi-cations Research, Red Bank, NJ, where he inventedthe two-dimensional phase-locked VCSEL array, the

front-surface-emitting laser, and the high-speed lateral HBT. In 1991, he becameManager of a DRAM design group at Hyundai Electronics and designed a familyof from fast-1 M DRAMs and 256 M synchronous DRAMs. In 1998 he joinedthe faculty of the Department of Electrical Engineering at KAIST and now isa full professor. From 2001 to 2005, he was the director of System Integrationand IP Authoring Research Center (SIPAC), funded by Korean government topromote worldwide IP authoring and its SOC application. From 2003 to 2005,he was the full time Advisor to Minister of Korea Ministry of Information andCommunication and National Project Manager for SoC and Computer. In 2007,he founded SDIA(System Design Innovation & Application Research Center) atKAIST to research and develop SoCs for intelligent robots, wearable computersand bio systems. His current interests are high-speed and low-power Networkon Chips, 3D graphics, Body Area Networks, biomedical devices and circuits,and memory circuits and systems. He is the author of the books DRAM Design(Seoul, Korea: Hongleung, 1996; in Korean), High Performance DRAM (Seoul,Korea: Sigma, 1999; in Korean), and chapters of Networks on Chips (New York,Morgan Kaufmann, 2006).

Dr. Yoo received the Electronic Industrial Association of Korea Award forhis contribution to DRAM technology the 1994, Hynix Development Award in1995, the Korea Semiconductor Industry Association Award in 2002, Best Re-search of KAIST Award in 2007, Design Award of 2001 ASP-DAC, and Out-standing Design Awards 2005, 2006, 2007 A-SSCC. He is a member of theexecutive committee of ISSCC, Symposium on VLSI, and A-SSCC. He is theTPC chair of the A-SSCC 2008.

Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on January 13, 2009 at 10:12 from IEEE Xplore. Restrictions apply.