ultra-low-power design and implementation of application ...859802/fulltext01.pdf ·...

Ultra-low-power Design and Implementation of

Application-specific Instruction-set Processors for

Ubiquitous Sensing and Computing

NING MA

Doctoral Thesis in Electronic and Computer Systems

Stockholm, Sweden 2015

TRITA-ICT 2015:11ISSN 1653-6363ISRN KTH/ICT-2015/11-SEISBN 978-91-7595-692-3

KTH School of Information andCommunication TechnologySE-164 40 Kista, Stockholm

SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framläggestill offentlig granskning för avläggande av teknologie doktorsexamen i Elektronikoch Datorsystem onsdag den 4 november 2015 klockan 10.00 i Sal B, Electrum,Kungl Tekniska högskolan, Kista 164 40, Stockholm.

© Ning Ma, September 2015

Tryck: Universitetsservice US AB

iii

Abstract

The feature size of transistors keeps shrinking with the development oftechnology, which enables ubiquitous sensing and computing. However, withthe break down of Dennard scaling caused by the difficulties for further lower-ing supply voltage, the power density increases significantly. The consequenceis that, for a given power budget, the energy efficiency must be improved forhardware resources to maximize the performance. Application-specific inte-grated circuits (ASICs) obtain high energy efficiency at the cost of low flexi-bility for various applications, while general-purpose processors (GPPs) gaingenerality at the expense of efficiency.

To provide both high energy efficiency and flexibility, this dissertation ex-plores the ultra-low-power design of application-specific instruction-set pro-cessors (ASIP) for ubiquitous sensing and computing. Two application sce-narios, i.e. high-throughput compute-intensive processing for multimedia andlow-throughput low-cost processing for Internet of Things (IoT) are imple-mented in the proposed ASIPs.

Multimedia stream processing for human-computer interaction is alwaysfeatured with high data throughput. To design processors for networked mul-timedia streams, customizing application-specific accelerators controlled bythe embedded processor is exploited. By abstracting the common featuresfrom multiple coding algorithms, video decoding accelerators are implementedfor networked multi-standard multimedia stream processing. Fabricated in0.13 µm CMOS technology, the processor running at 216 MHz is capable ofdecoding real-time high-definition video streams with power consumption of414 mW.

When even higher throughput is required, such as in multi-view videocoding applications, multiple customized processors will be connected withan on-chip network. Design problems are further studied for selecting thecapability of single processors, the number of processors, the capacity of com-munication network, as well as the task assignment schemes.

In the IoT scenario, low processing throughput but high energy efficiencyand adaptability are demanded for a wide spectrum of devices. In this case,a tile processor including a multi-mode router and dual cores is proposedand implemented. The multi-mode router supports both circuit and worm-hole switching to facilitate inter-silicon extension for providing on-demandperformance. The control-centric dual-core architecture uses control wordsto directly manipulate all hardware resources. Such a mechanism avoids in-troducing complex control logics, and the hardware utilization is increased.Programmable control words enable reconfigurability of the processor for sup-porting general-purpose ISAs, application-specific instructions and dedicatedimplementations. The idea of reducing global data transfer also increases theenergy efficiency. Finally, a single tile processor together with network of baredies and network of packaged chips has been demonstrated as the result. Theprocessor implemented in 65 nm low leakage CMOS technology and achievesthe energy efficiency of 101.4 GOPS/W for each core.

v

To my family

Acknowledgments

First and foremost, I would like to express my deep gratitude to my advisors:Prof. Li-Rong Zheng, Assoc. Prof. Zhonghai Lu and Dr. Zhuo Zou. I am heartilythankful to Li-Rong for providing me the opportunity to be a doctoral student atKTH and all the support during these years. His vast expertise and broad visionsgreatly inspire me for the research. I am sincerely grateful to Zhonghai for hisprofessional guidance and constructive suggestions. I appreciate all the discussionswith him and I benefit a lot. I am really thankful to Zhuo for his valuable advicesfrom doing research, writing scientific papers to trivia. It is exciting and helpful towork with him.

I would also like to express my gratitude to Dr. Zhibo Pang for his assistancein the early phase of my research. I gained the experience both from engineeringand everyday life during that period. I am indebted to Stefan Blixt at Imsys ABfor broadening and deepening my knowledge about processor architectures throughplentiful technique discussions. His patience and enthusiasm greatly encourage me.

I would like to thank Dr. Qiang Chen and Dr. Geng Yang for sharing experi-ences and giving suggestions both on work and life. Many thanks to the previousand current colleagues at KTH and Imsys for their all kinds of help to make thePh.D journey pleasurable.

I am grateful to Prof. Hannu Tenhunen for reviewing the draft of this dis-sertation. Many thanks to Dávid Juhász and Nicholas Rose for proofreading thisdissertation and providing so many valuable comments. I would also like to expressmy sincere acknowledgment to Prof. Jari Nurmi from Tampere University of Tech-nology for acting as my opponent, as well as to Prof. Peeter Ellervee from TallinnUniversity of Technology, Adj. Prof. Lauri Koskinen from University of Turku,and Assoc. Prof. Cristina Rusu from ACREO AB for serving as my committeemembers.

I also want to thank all the people I have met to make everyday special!

Finally I wish to express my gratitude to my wife Youjia for her understanding,patience and support, to my little boy Kevin for his innocent smiles, to my parentsfor their unconditional love and constant support. This dissertation is dedicated toyou.

Ning MaSeptember 2015

Stockholm

vii

Abbreviations

ALU Arithmetic Logic UnitASIC Application-specific Integrated CircuitASIP Application-specific Instruction-set ProcessorCIF Common Interchange FormatCISC Complex Instruction Set ComputerCMOS Complementary Metal Oxide SemiconductorCMP Chip MultiprocessorCPU Central Processing UnitCS Circuit SwitchingCW Control WordDF Deblocking FilterDM Data MemoryDMA Direct Memory AccessDSP Digital Signal ProcessorFEC Forward Error CorrectionFIFO First In First OutFPGA Field-Programmable Gate ArrayFSM Finite State MachineFU Function UnitGOP Group of PicturesGPP General-purpose ProcessorHD High DefinitionHEVC High Efficiency Video CodingIM Instruction MemoryIoT Internet of ThingsIQ Inverse QuantizationISA Instruction Set ArchitectureIT Inverse TransformationMAC Multiplication and Accumulate unitMB Macro BlockMC Motion CompensationMPEG Moving Picture Experts GroupMVC Multi-view Video CodingNoC Network on ChipNRE Non-recurring EngineeringRAM Random Access MemoryRISC Reduced Instruction Set ComputerROM Read-only Memory

ix

x

SDRAM Synchronous Dynamic Random Access MemorySIMD Single Instruction Multiple DataSMT Sequence Mapping TableSoC System on ChipUHD Ultra High DefinitionVC Virtual ChannelVCEG Video Coding Experts GroupVLD Variable Length DecodingVLIW Very Long Instruction WordVGA Video Graphics ArrayWS Wormhole Switching

List of Publications

Papers included in the thesis:

1. Ning Ma, Zhonghai Lu, and Li-Rong Zheng. “System design of full HD MVCdecoding on mesh-based multicore NoCs,” In Microprocessors and Microsys-tems (Elsevier), Volume 35, Issue 2, March 2011, Pages 217-229,

2. Ning Ma, Zhuo Zou, Zhonghai Lu, and Li-Rong Zheng. “Design and Imple-mentation of Multi-mode Routers for Large-scale Inter-core networks,” MinorRevision (requires no re-evaluation), Integration, the VLSI Journal (Elsevier),2015

3. Ning Ma, Zhibo Pang, Jun Chen, Hannu Tenhunen and Li-Rong Zheng. “A5Mgate/414mW networked media SoC in 0.13um CMOS with 720p multi-standard video decoding,” In Proceedings of IEEE Asian Solid-State CircuitsConference A-SSCC, pp.385,388, 16-18 Nov. 2009

4. Ning Ma, Zhibo Pang, Hannu Tenhunen and Li-Rong Zheng. “An ASIC-design-based configurable SOC architecture for networked media,” In Pro-ceedings of IEEE International Symposium on System-on-Chip, pp.1,4, Nov.2008.

5. Ning Ma, Zhonghai Lu, Zhibo Pang, and Li-Rong Zheng. “System-level ex-ploration of mesh-based NoC architectures for multimedia applications,” InProceedings of IEEE International SOC Conference (SOCC), pp.99,104, 27-29Sept. 2010

6. Ning Ma, Zhuo Zou, Zhonghai Lu, and Li-Rong Zheng. “Implementing MVCDecoding on Homogeneous NoCs: Circuit Switching or Wormhole Switching,”In Proceedings of 23rd Euromicro International Conference on Parallel, Dis-tributed and Network-Based Processing (PDP), pp.387,391, 4-6 March 2015

7. Ning Ma, Zhuo Zou, Zhonghai Lu, Stefan Blixt and Li-Rong Zheng. “Ahierarchical reconfigurable micro-coded multi-core processor for IoT appli-cations,” In Proceedings of IEEE International Symposium on Reconfigurableand Communication-Centric Systems-on-Chip (ReCoSoC), pp.1,4, 26-28 May2014

8. Ning Ma, Zhuo Zou, Zhonghai Lu, Yuxiang Huan, Stefan Blixt and Li-RongZheng. “A 2-mW Multi-mode Router Design with Dual-core Processor in 65nm LL CMOS for Inter-silicon Communication,” Manuscript submitted toMicroelectronics Journal (Elsevier), 2015

xi

9. Ning Ma, Zhuo Zou, Yuxiang Huan, Stefan Blixt, Zhonghai Lu and Li-RongZheng. “A 101.4 GOPS/W Reconfigurable and Scalable Control-centric Em-bedded Processor for Domain-specific Applications,” Manuscript submittedto IEEE International Symposium on Circuits and Systems (ISCAS), 2016

Papers not included in the thesis:

10. Ning Ma, Zhuo Zou, Yuxiang Huan, Stefan Blixt, Zhonghai Lu and Li-RongZheng. “A Cacheless Control-centric Dual-core X3 Processor for Real-timeEnergy-efficient Processing,” Manuscript submitted to Special Issue on Sus-tainable Processor Architectures and Applications, Journal of Microprocessorsand Microsystems (Elsevier), 2015

11. Yuxiang Huan, Ning Ma, Stefan Blixt, Zhuo Zou, and Lirong Zheng, “A 61µA/MHz Reconfigurable Application-Specific Processor and System-on-Chipfor Internet-of-Things,” In Proceedings of IEEE International SOC Confer-ence (SOCC), pp. 235-239, 9-11 Sept. 2015.

xii

Contents

Contents xiii

List of Figures xv

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Thesis contributions and organization . . . . . . . . . . . . . . . . . 6

2 Low-power application-specific instruction-set processors 112.1 Low-power techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Single-ASIP implementation for multi-standard multimedia pro-cessing 213.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Design methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Video coding structure . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Accelerator-enriched architecture . . . . . . . . . . . . . . . . . . . . 263.5 Design instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6 Implementation results . . . . . . . . . . . . . . . . . . . . . . . . . . 283.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Multi-processor architecture exploration for multi-view videodecoding 314.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Design space exploration . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Communication network . . . . . . . . . . . . . . . . . . . . . . . . . 414.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Customizable tile-ASIP implementation for Internet of Things 45

xiii

5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Multi-mode router . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3 Control-centric processor . . . . . . . . . . . . . . . . . . . . . . . . . 515.4 Tile implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Conclusions 596.1 Thesis summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Bibliography 63

xiv

List of Figures

1.1 Time line of computers . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Trends of video applications . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 One scenario of IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Applications and implementations . . . . . . . . . . . . . . . . . . . . . 6

2.1 Comparison of ASIC, GPP and ASIP . . . . . . . . . . . . . . . . . . . 122.2 Architectures of ASIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Design flow for extending a existing processor . . . . . . . . . . . . . . . 172.4 Design flow for customizing a processor . . . . . . . . . . . . . . . . . . 18

3.1 Video coding standards . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 HW/SW partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 General structure of video coding . . . . . . . . . . . . . . . . . . . . . . 253.5 Accelerator-enriched architecture . . . . . . . . . . . . . . . . . . . . . . 263.6 Variety of data transferring . . . . . . . . . . . . . . . . . . . . . . . . . 263.7 Design instance for networked multimedia processing . . . . . . . . . . . 273.8 Video decoding pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.9 Die photo and power consumption . . . . . . . . . . . . . . . . . . . . . 293.10 Test system and application scenario . . . . . . . . . . . . . . . . . . . . 29

4.1 Typical prediction structure . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 An example of 4x4 2D mesh NoC . . . . . . . . . . . . . . . . . . . . . . 334.3 Mapping multimedia applications to NoC . . . . . . . . . . . . . . . . . 344.4 Simplified System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Work flow for the system exploration . . . . . . . . . . . . . . . . . . . . 364.6 Hierarchical structure of MVC . . . . . . . . . . . . . . . . . . . . . . . 374.7 Picture-level parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.8 A possible data flow for picture-level parallel processing . . . . . . . . . 394.9 Design options for picture-level task assignment . . . . . . . . . . . . . . 394.10 A possible data flow for view-level parallel processing . . . . . . . . . . . 404.11 Design options for view-level task assignment . . . . . . . . . . . . . . . 414.12 Router structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xv

5.1 Proposed multi-mode router . . . . . . . . . . . . . . . . . . . . . . . . . 485.2 Maximum throughput with only CS or WS activated . . . . . . . . . . . 495.3 The comparison of throughput and latency . . . . . . . . . . . . . . . . 495.4 Areas for routers with different configurations . . . . . . . . . . . . . . . 505.5 Overall throughput (OT) and power efficiency (PE) . . . . . . . . . . . 515.6 Complete structure of the on-chip multi-mode router in one tile . . . . . 515.7 Architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.8 Structure of one control word . . . . . . . . . . . . . . . . . . . . . . . . 535.9 Mapping from applications to the processor . . . . . . . . . . . . . . . . 535.10 Comparison of instructions processing . . . . . . . . . . . . . . . . . . . 535.11 Control word timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.12 Hierarchical implementation structure . . . . . . . . . . . . . . . . . . . 555.13 Die photo together with gate count breakdown . . . . . . . . . . . . . . 565.14 Power consumption under different working modes and conditions . . . 575.15 Network on a board and network in a package . . . . . . . . . . . . . . 58

xvi

Chapter 1

Introduction

1.1 Background

When the first electronic general-purpose computer ENIAC was announced in 1946,it used over 17000 vacuum tubes and covered 140 m2 with power consumption of174 kW [1]. To program it, manual setting of switches and cables was necessary.The vacuum tubes were later substituted by discrete transistors, and further byintegrated circuits. The miniaturization and integration of devices greatly decreasedthe size and power consumption as well as increased performance, which enabledthe application of personal computers and portable computers. Figure 1.1 shows abrief history of computers.

With the utilization of CMOS process and following Moore’s law, the monolithicintegrated circuit can hold more and more transistors since the feature size scalesdown around every two years [2]. Table 1.1 gives the power scaling for fixed-sizedchips. Assuming a scaling factor of S, if the size of chips remains the same as that

Figure 1.1: Time line of computers

1

Table 1.1: Power scaling for a fixed-sized chip from [4]

of the previous generation, the quantity of transistors and the maximum workingfrequency scale with the factor of S2 and S, respectively. The capability of chipswith the same size will be improved by a factor of S3. To keep the power consump-tion constant, the supply voltage needs to be scaled down correspondingly, whichis called Dennard scaling [3]. During this period, the performance of processors isincreased by raising the working clock frequency with the support of more and moresophisticated architectures, e.g. deeper pipelines, speculation, hierarchical caches,etc. powered by augmented transistors. The powerful yet small processors boostthe miniaturization of computers, and the mobile phones are even more capablethan the computers in old generations.

The further miniaturization brings about the mergence of computers to the am-bient, and facilitates the ubiquitous sensing and computing empowered by the smartelectronics. Application scenarios can be both high data throughput applicationssuch as the human-computer interaction and low data throughput applications suchas the Internet of Things (IoT). Ultra-low-power processors are highly demandedfor the miniaturized devices. On the other hand, the transistor size is continuouslyshrinking, while the supply voltage cannot be scaled down correspondingly due tothe leakage problems, which breaks down the Dennard scaling [4]. As shown inTable 1.1, if all the transistors work at the maximum frequency, the power con-sumption will increase exponentially. For a given power budget, either the chip sizeor activity of the transistors should be diminished.

The traditional low-power techniques such as clock gating and power gating,which either lower the working frequency or completely shut down the circuit, willdecrease the power consumption while the performance is degraded at the sametime. Dynamic voltage and frequency scaling (DVFS) is utilized in [5, 6, 7] toadapt the voltage supply and clock frequency for providing the proper performancefor different application scenarios. When high performance is not needed, a sub-threshold supply voltage can even be applied to save the energy at the cost of largedelay and slow speed as reported in [8, 9].

In order to lower the power consumption while maximizing the performance,energy efficiency and hardware utilization must be effectively improved in post-Dennard era. The augmented transistors from Moore’s law should be utilizedefficiently instead of only increasing the working clock frequency of processors.

2

Single View

Stereo Views

Multiple Views

8K UHD

7680

43203840

2160

4K UHD1920

1080HD

CIF

VGA

(a) Picture Resolutions

(b) Views

Figure 1.2: Trends of video applications

Therefore, the architectural level optimization for high energy efficiency becomescritical.

This dissertation explores the design of ultra-low-power application-specific instruction-set processors (ASIP) for ubiquitous sensing and computing in both high-throughputcompute-intensive processing of multimedia and low-throughput low-cost process-ing of IoT scenarios. The low power consumption and high performance are realizedby increasing the energy efficiency through customization and specialization at thearchitectural level for particular scenarios.

High-throughput applications

Multimedia processing in human-computer interactions always needs high perfor-mance processors for the extremely large data throughput. In the case of videoprocessing, for example, the resolutions of videos have been increased from CIF(352x288), VGA (640x480) to HD (1920x1080), even to 8K UHD (7680x4320) [10].Digital Cinema Initiatives (DCI) have already adopted 2K (2048x1080) and 4K(4096x2160) images [11]. NHK is going to do experimental broadcasting for Super-Hi Vision (7680x4320) in 2020 [12]. From another point of view, the included viewsof video streams can vary from single view through stereo to multiple views as shown

3

T

T

Figure 1.3: One scenario of IoT

in Figure 1.2. 3D video is an important application and having a great impact onour daily life [13, 14, 15]. The increase of resolution and views have resulted in tensor even hundreds of data volume increase. This trend will continue since it is theinstinct of people to pursue the higher quality and vivid experience. Consideringthe varieties of similar applications and rapid updates, flexibility is also a merit forthe design.

Low-throughput applications

As one of the enabling technologies for ubiquitous sensing and computing, Internetof Things (IoT) will connect everything in the surrounding seamlessly [16]. Therewill be billions or trillions of devices embedded in those “things”. Figure 1.3 showsone of the scenarios. Lightweight architectures capable of sensing, processing andcommunication are highly demanded in extremely low power and low cost manners.Low power consumption is one of the most important characteristics required forIoT devices due to their limited power supply. Furthermore, certain computationalcapability is still necessary for signal processing and communication. Flexibilityand scalability should also be featured for the processors adapting to various andvast number of IoT devices.

Architecture alternatives

Many architectures might be used for the aforementioned application scenarios.General-purpose processors (GPP) have good programmability, and can be eas-ily applied in different applications. However, limited performance and high powerconsumption make them insufficient for high throughput multimedia processing par-ticularly for portable devices [17, 18]. The low energy efficiency of GPP is also theobstacle for IoT applications. Digital signal processors (DSP) enhance the compu-

4

tational capability for digital signal processing and keep the good programmabilityas well [19]. But the performance is still not adequate for sophisticated multimediaapplications, and DSPs are not suitable for IoT applications because their capabil-ity for general-purpose controlling is weak and they are not energy efficient enough.Application-specific integrated circuits (ASIC) are superior to other architecturesfrom the perspective of both performance and energy efficiency [20, 21, 22, 23, 24].The drawback is the low flexibility and scalability when trying to adapt to differentapplications, which results in high design and implementation cost as well as longtime-to-market.

Application-specific instruction-set processors (ASIP) can improve the energyefficiency and provide the flexibility as well [25, 26]. The programmable processorsempower the adaptability for similar applications, while the customized functionunits guarantee the necessary performance. The implementation of ASIPs for mul-timedia and IoT have been reported in [27, 28, 29, 30, 31].

1.2 Motivation

The two mentioned scenarios represent two extremes of the applications. Differ-ent strategies are necessary for designing the ultra-low-power customizable energy-efficient processors following the implementation targets. This dissertation will ex-plore and verify the design strategies by proposing and realizing the correspondingprocessors.

Networked multimedia processing is the performance-demanding application.Furthermore, there are many formats for the multimedia streams. In order tohandle the networked multimedia streams, the ASIPs with high performance forhigh-definition video coding and flexibility for multiple formats should be designed.In this work, video accelerators supporting high-definition multi-standard videodecoding are designed and implemented. Those accelerators are controlled by thespecific instructions issued by the embedded processor, and both the number andcomputational capability of the accelerators are adjustable to provide the neededperformance.

As introduced in last section, another trend for video applications is the utiliza-tion of multiple views for the same scene. The videos from different view directionsare recorded simultaneously. Multi-view coding (MVC) has been standardized asan amendment of the H.264/MPEG-4 AVC video coding standard [32], and can beused in 3D video, free view video applications, etc. Compared with the architecturefor single-view applications, to process the video with multiple views, performanceshould be improved by many times. The reasonable solution is to employ multipleprocessors by exploiting parallelism in the MVC. Considering the huge volume ofdata exchanged among different processors, an efficient communication architectureis required. Network-on-chip (NoC) based multiple processors are thus used in theproposed architecture for MVC applications in this dissertation.

5

Figure 1.4: Applications and implementations

For the applications discussed above, the design of processors is performance-driven, while from the IoT perspective, high energy efficiency under extremely lowpower conditions is more important. At the same time, to adapt to different scenar-ios, the ASIPs should also be featured with flexibility and scalability. Based uponthe preceding considerations, a reconfigurable and scalable control-centric dual-coreprocessor with an on-chip multi-mode router is proposed and implemented. Thehigh energy efficiency is achieved by increasing the hardware utilization of func-tional units, decreasing non-functional units, and reducing excessive global datatransferring. The reorganizable functional units enable reconfigurability of the pro-cessor, and the on-chip multi-mode router extends scalability of the processor evenin the post-fabrication stage. Figure 1.4 summarize the applications, together withthe required features and the proposed systems in this dissertation.

1.3 Thesis contributions and organization

This dissertation is organized as follows:

6

Chapter 2

This chapter will review the related work on the design of low power ASIPs. Thepossible application scenarios are shown in this chapter, including multimedia, com-munication, biomedical applications, etc. followed by the implementation architec-tures. The two approaches for designing ASIPs are given at the end of this chapter.

Chapter 3

The design and implementation of an ASIP for networked multimedia processingare described in this chapter. The design flow is firstly given for software/hardwareco-design, followed by analysis of the application. The video processing acceleratorsare then customized and integrated with the embedded processor cores to constructthe high performance and low power processor. The chip implementation and testsystem are demonstrated as the results.

• Contributions: The design strategy of utilizing customizable acceleratorstogether with embedded processors is verified for the applications where highperformance and energy efficiency are required. As a result, an accelerator-enriched dual-core architecture is proposed for the performance-demandedapplications. A design instance for high-definition multi-standard video de-coding is then fabricated and tested with high performance and low powerconsumption.

The included papers are:

• Paper I. Ning Ma, Zhibo Pang, Hannu Tenhunen and Li-Rong Zheng. “AnASIC-design-based configurable SOC architecture for networked media,” InProceedings of International Symposium on System-on-Chip, pp.1,4, Nov.2008.

• Paper II. Ning Ma, Zhibo Pang, Jun Chen, Hannu Tenhunen and Li-RongZheng. “A 5Mgate/414mW networked media SoC in 0.13um CMOS with720p multi-standard video decoding,” In Proceedings of IEEE Asian Solid-State Circuits Conference A-SSCC, pp.385,388, 16-18 Nov. 2009

Chapter 4

This chapter explores the design space of multi-processor architecture for multi-viewvideo decoding. It begins with the theoretical analysis on the simplified system,from which the exploration work flow is extracted accordingly. To implement multi-view video decoding, both picture-level and view-level task assignment schemes areintroduced and discussed for the homogeneous multi-processor architecture. Basedon the exploration platform, the design options for each task assignment schemeare studied and summarized.

7

• Contributions: Multi-processor architecture is proposed and explored formulti-view video applications, which demand even higher performance. Tothe best of our knowledge, the scalable and flexible architecture for the MVCdecoding has not been studied before. It is also found that to achieve thespecific performance target, the computation and communication subsystemshould be balanced, when all the resources can be utilized efficiently.


• Paper III. Ning Ma, Zhonghai Lu, and Li-Rong Zheng. “System design offull HD MVC decoding on mesh-based multicore NoCs,” In Microprocessorsand Microsystems (Elsevier), Volume 35, Issue 2, March 2011, Pages 217-229,

• Paper IV. Ning Ma, Zhonghai Lu, Zhibo Pang, and Li-Rong Zheng. “System-level exploration of mesh-based NoC architectures for multimedia applica-tions,” In Proceedings of International SOC Conference (SOCC), pp.99,104,27-29 Sept. 2010

• Paper V. Ning Ma, Zhuo Zou, Zhonghai Lu, and Li-Rong Zheng. “Imple-menting MVC Decoding on Homogeneous NoCs: Circuit Switching or Worm-hole Switching,” In Proceedings of 23rd Euromicro International Conferenceon Parallel, Distributed and Network-Based Processing (PDP), pp.387,391,4-6 March 2015

Chapter 5

The energy efficient processor for IoT applications is proposed and implementedin this chapter. It contains two topics: designing multi-mode routers for modularextension and proposing processor architecture for high energy efficiency. The ne-cessity of supporting both circuit and wormhole switching in one router to composethe inter-silicon network is discussed at first, and the corresponding multi-moderouter is then introduced. Aiming at increasing the energy efficiency by improv-ing the hardware utilization, the control-centric processor architecture is proposed.The implementation results of the module chip integrating both the processor androuter are demonstrated at the end of this chapter.

• Contributions: The quantitative comparison of circuit and wormhole switch-ing is presented to give the guidelines for choosing the proper switching modein a specific scenario. To adapt to various application scenarios, a multi-moderouter supporting both circuit and wormhole switching is proposed and im-plemented. The multi-mode router is further integrated with a reconfigurabledual-core control-centric processor to construct the processor tile. The tilecan either be used independently or compose a large scale multi-processornetwork to provide high performance, energy efficiency and flexibility.


8

• Paper VI. Ning Ma, Zhuo Zou, Zhonghai Lu, and Li-Rong Zheng. “De-sign and Implementation of Multi-mode Routers for Large-scale Inter-corenetworks,” Minor Revision (requires no re-evaluation), Integration, the VLSIJournal (Elsevier), 2015

• Paper VII. Ning Ma, Zhuo Zou, Zhonghai Lu, Yuxiang Huan, Stefan Blixtand Li-Rong Zheng. “A 2-mW Multi-mode Router Design with Dual-coreProcessor in 65 nm LL CMOS for Inter-silicon Communication,” Manuscriptsubmitted to Microelectronics Journal (Elsevier), 2015

• Paper VIII. Ning Ma, Zhuo Zou, Zhonghai Lu, Stefan Blixt and Li-RongZheng. “A hierarchical reconfigurable micro-coded multi-core processor forIoT applications,” In Proceedings of International Symposium on Reconfig-urable and Communication-Centric Systems-on-Chip (ReCoSoC), pp.1,4, 26-28 May 2014

• Paper IX. Ning Ma, Zhuo Zou, Yuxiang Huan, Stefan Blixt, Zhonghai Luand Li-Rong Zheng. “A 101.4 GOPS/W Reconfigurable and Scalable Control-centric Embedded Processor for Domain-specific Applications,” Manuscriptsubmitted to IEEE International Symposium on Circuits and Systems (IS-CAS), 2016

Chapter 6

This chapter will conclude the dissertation and indicate the future work from pro-cessor architecture, programming models to application mapping. Brain-inspiredcomputing will be the target for those work to further decrease the power consump-tion while achieving high performance.

9

Chapter 2

Low-power application-specific

instruction-set processors

For specific applications, many architectures can be employed as alternative im-plementations. Application-specific integrated circuit (ASIC) utilizes specializedhardware units and determinate data paths to achieve parallel and dedicated pro-cessing. High performance and energy efficiency are the advantages. The weak-nesses are poor generality, high design complexity and non-recurring engineering(NRE) cost. Furthermore, as more transistors can be integrated in one single chip,the design complexity has been greatly increased. Verifying and testing such hugescale circuits have led to tremendous challenges for ASIC implementations [33, 34].

On the other hand, general-purpose processors (GPP) serialize the operationsof the arithmetic logic unit (ALU) through instructions for various applications.Wide generality is the merit, and the implementation of a specific application issimplified as designing the sequence of instructions for the GPP. The time-to-marketis greatly shortened compared with the ASIC implementation. However, GPPimplementations suffer from low performance and poor energy efficiency because ofthe serial processing of instructions and the overhead for fetching and decoding theinstructions.

Application-specific instruction-set processors (ASIP) try to gain the balance ofperformance, efficiency and generality for specific applications. Dedicated hardwareaccelerators, data paths or processing flows are customized for the related applica-tions to provide sufficient performance and efficiency. Meanwhile, the specializedfunctions alleviate the overhead of processing instructions in GPP. The instruction-controlled processing or reconfigurable logic guarantees the generality for similarapplications, and gains the programmability and flexibility compared with ASICs.The comparisons of ASIC, GPP and ASIP are shown in Figure 2.1.

11

Figure 2.1: Comparison of ASIC, GPP and ASIP

2.1 Low-power techniques

Decreasing the working frequency or supply voltage is the direct way to reducepower consumption, which lowers performance as well. Clock gating [35, 36, 37] andpower gating [38, 39, 40] are often employed to deactivate the circuits when they arenot used. Besides, dynamic voltage and frequency scaling (DVFS) tunes operatingconditions at a finer grain for the processors to provide proper performance indiverse application scenarios [41]. In [42], the processor can work at full speed withthe supply voltage of 1.14 V, frequency of 1 GHz, and consumes 125.3 W. It canalso work for low power consumption of 24.7 W at 0.7 V and 125 MHz. Fine-grained adaptive power management is implemented in [43] by applying multipleDVFS configurations based on the real-time power and temperature data collectedthrough on-die power and thermal meters. The purpose is to achieve the maximumperformance while maintaining temperature and power under predefined constrains.In [44], an on-chip controller performs per-core DVFS by monitoring workload,temperature, voltage and current dynamically. The voltage supply in [45] can evenbe lowered to 210 mV for a JPEG encoder working at 5 MHz.

Heterogeneity is another way to lower the power consumption. ARM’s big.LITTLEheterogeneous multi-core processor [46] runs big core with short duration at peakfrequency and LITTLE core at the majority of time with moderate operating fre-quencies. Up to 75 % energy can be saved without decreasing the delivered per-formance. Similarly, high-performance CPU1 runs the performance-intensive taskswith much higher power consumption, and general applications are mapped to thepower efficient CPU2 in [47]. Two architecturally identical but physically differ-ent cores are utilized in [48]. One core is implemented using worst-case corners toachieve the target frequency, while the other one uses typical corners to gain energy

12

efficiency. The resulted energy savings can be up to 20 %. Asynchronous designis proposed in [49] for neural signal processors. Handshaking protocol circuitry isadded between function units for data exchanging. Compared with the synchronousdesign, the asynchronous processor shows a 2.3x reduction in power.

Integrating the customized application-specific function units will gain bothlow power consumption and high performance. In [50], fine-grained parallelism isimplemented by SIMD architectures to accelerate the classic sequence alignmentprocedures. Together with the multi-core structure, 25x less energy is used whileachieving similar performance with a quad-core Intel Core i7 3820 processor. Adata compression algorithm is mapped to a specific circuit in [51] for lowering thecommunication data rate in artificial vision systems. The energy reduction ratioscan be up to 85 % compared with its base processor. Linear matrix and vectoroperations are enhanced for adaptive filters in neural coding in [52], where theproposed application-specific instruction-set processors provide high performanceand flexibility at low power consumption compared with the implementations oncommercial CPUs.

2.2 Applications

ASIPs are implemented for applications such as multimedia processing, communi-cation, health care, etc. High efficiency video coding (HEVC) is the new standardfor video compression, and it can achieve very high compression ratio with highimplementation complexity. An ASIP is proposed in [53] targeting at acceleratingthe motion compensation, which is the most compute-intensive task in video codingprocess. Three different models with diverse hardware configurations are evaluatedto show the trade-offs between implementation cost and performance. Dedicatedhardware accelerators for inter prediction and entropy coding could be found inthe ASIP designed in [54] for H.264/AVC video coding standard. Hardware ac-celerators work in parallel with the main processor and are coordinated throughthe customized instructions. Image detection is another application scenario. Thealgorithm of Histograms for Oriented Gradients is run in the Support Vector Ma-chine implemented by an ASIP in [55]. SIMD instructions, wide memory interfacesand VLIW extensions are added to a RISC processor to achieve high performance,power efficiency and programmability. An audio ASIP is implemented in [56] todecode MPEG-2/4 AAC audio streams. Specific instruction for inverse DCT, Huff-man decoding, inverse quantization, etc. are integrated to accelerate the decodingspeed.

Multi-standard multi-mode channel coding in wireless communication is an-other important application scenario for ASIPs. Two turbo decoders are presentedin [57] with different design objectives. The TurbASIP supplies the maximum flex-ibility by supporting all of the different modes in existing wireless communicationstandards. Diverse parallelism techniques are exploited. The devised pipeline ar-chitecture together with dedicated instructions is properly selected for the target

13

flexibility. Another tuber decoder TDecASIP aims to increase the efficiency bylimiting the flexibility and supporting the turbo codes with related parametersspecified in 3GPP LTE, WiMAX and DVB-RCS standards. Both the instructionset architecture and memory organization are simplified to increase the executionefficiency. A multi-standard forward error correction ASIP processor is designed in[58] offering QC-LDPC decoding for WiMAX, Turbo decoding for 3GPP-LTE andConvolutional code decoding. By analyzing the supported decoding algorithms, ageneral-purpose algorithm that could be further tuned to the specific decoding algo-rithm is selected, and the common parts are abstracted and implemented by sharedfunctional blocks. Optimized instructions manage the operations and data pathsof functional blocks to implement specific decoding algorithms. In [59], a basebandASIP processor is designed for multi-mode MIMO detection by parameterizing thededicated hardware block for specific algorithms.

ASIPs can also be found in promising biomedical applications. In [60], a dual-core system is proposed for wearable sensor devices, which employs an ASIP coreoptimized for processing ECG, EMG, EEG signals and a low power embedded pro-cessor as a controller. The energy efficiency is greatly improved compared with theimplementation in a general-purpose processor. An integrated SoC is implementedin [61] for digital hearing aids. The embedded ASIP extends a RISC instructionset by adding dedicated hardware accelerators and compacting certain instructionsto provide both flexibility and power efficiency for digital audio signal processing.In [51], a RISC processor and run-length coding accelerators compose the ASIPfor data compression in artificial vision systems. Special instructions are designedfor data transferring and run-length counting. The energy consumption is greatlyreduced for both compression and decompression process compared with the im-plementation in the base RISC processor. DNA sequence alignment is the tar-get application scenario in [62]. SIMD instructions for vector-vector, vector-scalararithmetic, logic, load/store and control are implemented to exploit fine-grainedparallelism in the related algorithms, and coarse-grained parallelism is achieved byutilizing multi-processor architecture for querying multiple sequences.

Besides, in other applications, such as compressed sensing [63], encryption [64],packet classification [65], etc. which are scenarios with emphasis on both flexibilityand energy efficiency, the utilization of ASIPs can meet the requirements properly.

2.3 Architectures

Figure 2.2 shows the architectures that are utilized for ASIPs. Dedicated hard-ware accelerators can serve as the execution units invoked by application-specificinstructions as shown in Figure 2.2(a). In [66], a bit stream engine designed forvariable length coding in multiple video formats is integrated into a conventionalgeneral-purpose processor, where the bitwise computation is the bottle neck. Thecoding/decoding instructions share common data path with general-purpose in-structions but only on different execution paths. The processor gains a performance

14

Specialized

FU

Issue

Register file

......

Register file

FU

......

FUDecode

Issue

Decode

Decode Issue Register file FU

...... ...... ......

Issue Register file FU

Core Core

Core Core

Core Core

Core Core

(a) Accelerator

(b) SIMD

(c) VLIW

(d) Multiple cores

Figure 2.2: Architectures of ASIPs

increase for variable length coding and maintains the flexibility as well. For the ap-plication of software defined radio in [67], a trigonometric math unit, a Viterbicomplex unit and a floating-point control law accelerator are coupled to a general-purpose CPU. The ASIP can provide enough performance for the existing complexwireless communication algorithms and shows the probability to implement addi-tional standards.

Single instruction multiple data (SIMD) architectures exploit the data-level par-allelism to accelerate compute-intensive tasks. Multiple data are operated with thesame operations on multiple execution units simultaneously under the control ofone instruction. A vector function unit (VFU) is designed in [68] for Fast FourierTransform (FFT) used in various wireless communication standards. The VFU canperform sixteen 16-bit MACs, or eight 32-bit MACs simultaneously. Memory band-width is enlarged to ease the data transferring between the vector register file andthe maim memory. In [69], arithmetic, logic, load/store SIMD instructions realizedata parallel processing on vectors of 8 or 16 elements according to the size of op-erated pixel-block in one image when implementing Advanced Video Coding (AVS)standard. The tasks with heavy duty of computation such as motion compensation,deblocking, etc. are greatly accelerated by the SIMD instructions.

Figure 2.2(c) presents the very long instruction word (VLIW) architecture whichprovides instruction-level parallel processing to increase the execution throughputand hardware utilization. Multiple independent instructions are issued and exe-cuted concurrently. In [70], the VLIW architecture is optimized for stereoscopicvideo applications. The basic instruction processing unit, i.e. vector unit, are repli-cated to utilize possible instruction parallelism. A VLIW processor is implementedon FPGA for image processing in [71]. The application is realized using a highlevel programming language at first. By analyzing the generated assembly codes,the independent instructions that can be executed in parallel are bound into thevery long instructions.

15

To further increase the performance, task- or thread-level parallelism can beutilized by multi-core or multiprocessor architectures. In [72], multiple ASIP coresare connected by the butterfly NoC network, and can be dynamically configuredthrough a dedicated bus-based communication structure. By selecting the suitablenumber of active ASIP cores and the corresponding configuration parameters, mul-timode and multi-standard turbo decoding have been implemented. Turbo decodersare also implemented in [73] by connecting ASIPs and memories with networks in-creasing the throughput. The proposed system employs ring-style topology andenables simultaneous data exchange between the ASIPs and memories. MultipleASIPs are connected through the shared memory in [74] for H.264 video encoding.Each ASIP is responsible for coding macroblocks (MB) in one slice of a picture.The MB processing in different slices are executed in different ASIPs in parallel.Furthermore, several video accelerators are integrated into each ASIP to increaseits processing performance. According to the throughput requirement for differ-ent video resolutions, the platform supports different configurations of the ASIPcores. A multi-core platform for multi-format video decoding is proposed in [75],the macroblock-row level parallelism is investigated and utilized by multiple ASIPcores. Every single ASIP core is featured with a dual-issue VLIW architecture anddesignated SIMD instructions for pixel-level acceleration.

2.4 Approaches

ASIPs are implemented by either extending a existing processor or customizing aprocessor. The former approach strengthens the off-the-shelf processor cores withapplication-specific function units. Application-specific instructions together withthe original general-purpose instruction set compose the final application-specificinstruction set. The tool chain for software development is inherited from theexisting processor. The design flow of this type of ASIPs is shown in Figure 2.3.Based on the analysis of applications, the time-consuming tasks will be acceleratedby the designated instructions, while the controlling task is managed by the general-purpose instructions.

In [50], 56 SIMD instructions for biological sequence alignment are implementedon a 32-bit Harvard RISC processor to efficiently utilize the fine-grained parallelismin the corresponding algorithms. By adding the specialized instruction set in theback end, an already existing compiler is extended to support the new ASIP. 40carefully crafted specialized instructions for slice header processing are integratedinto the base ISA of a 32-bit RISC processor (ARP32) for video decoding in [76].Each customized instruction is mapped to an intrinsic function by the ARP32compiler. The base ISA handles general-purpose controlling and processing, whilethe customized instructions deal with the complex slice header processing functions.The AES encryption algorithm is mapped to the vector units in the ASIP designedin [77]. By integrating the AES vector units in a 16-bit default processor, theproposed ASIP can be used in wireless sensor node applications, and the security

16

Application

Analysis

SW/HW Partition

Application specific

instructions design

Software

implementation


FU implementation

Test

Co verification

Goal achieved?

ASIP delivered

No

No

Yes

Yes

System integration

Meet constrains?No

Yes

Figure 2.3: Design flow for extending a existing processor

service is provided by the add-on application-specific instructions.The other implementation approach is to customize a processor, which gener-

ates a new processor. The instruction set is tailored to only support the targetapplications. The control units, execution units, data paths, pipelines are all cus-tomized to maximize the performance for a cluster of applications. The softwaretool chain needs also to be adjusted to fit the new processor. The correspond-ing design flow is summarized in Figure 2.4. Compared with the previous designmethodology, the totally new instruction set and hardware architecture will gainhigh performance and energy efficiency for the target applications. However, thetarget applications should include few general-purpose controlling and processingtasks due to the limitation of the instruction set.

17

Application

Analysis


instruction set

design

Software

implementation

Hardware

implementation of

whole processor

architecture

Meet constrains?Test

Co verification

Goal achieved?

ASIP delivered

No No

No

Yes Yes

Yes

Realization of

software tool chain

Figure 2.4: Design flow for customizing a processor

In [78], an ASIP designed for elliptic curve cryptography employs a finite-statemachine (FSM) to control the execution of eight designated instructions imple-menting the functions of point addition and doubling. Different algorithms can berealized by programming with the assembly codes and the final program is stored inthe internal ROM. Multi-standard forward error correction (FEC) decoding is sup-ported in the SIMD ASIP proposed in [79]. The computational units, data paths,memory system are all customized according to the common features of the FECalgorithms. In order to decode a specific format, the assembly codes are utilizedto designate the different fields of the instructions for controlling the computationand data flow. As a result, the proposed ASIP can provides high-throughput de-coding of turbo, QC-LDPC and CC codes. The ASIP in [80] implements motionestimation (ME) for video codecs. Three instructions including Sum Absolute Dif-ference (SAD), Compare, and Mode Decision, together with an efficient hardware

18

architecture, are designed to handle the real-time processing of ME for HD video.This dissertation utilizes both two of the approaches. Besides, the proposed cus-

tomizable control-centric processor supports two approaches simultaneously. Eitherexisting general-purpose ISAs or completely customized ISAs can be mapped to theprocessor as needed.

19

Chapter 3

Single-ASIP implementation for

multi-standard multimedia

processing

3.1 Introduction

Large data volume is one of the most important features for multimedia applica-tions. For high-definition (HD) video applications, if the resolution of frames is1920x1080 and the frame rate is 30 fps, about 93 MB/s raw data is to be processedfor the YUV 4:2:0 format. It is a great challenge for both transmitting and storingthe data. Moreover, the requirement for higher resolution is growing, and ultra HD(UHD) is proposed. For 4K UHD and 8K UHD, the data volume is 4 and 16 timesof that for HD video, respectively.

To make the data fit into the capacity of network and storage devices, diverseeffective compression algorithms are proposed and applied. They employ advancedalgorithms to gain higher compression ratio, which results in high implementationcost as well. There is always a trade-off between compression ratio and implemen-tation complexity. However, since the computational capability of processors keepsincreasing, more complex compression algorithms can be mapped to the processor.The result is that data rate has been decreased significantly for the same quality.It also means that using the same data rate achieves much better quality.

There are mainly two organizations standardizing the video compressing pro-cess: ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Pic-ture Experts Group (MPEG), who developed the H.26x series and MPEG seriesstandards, respectively. Figure 3.1 shows the performance of some of those stan-dards, and compression ratios are from [81]. From H.261, which only supportsCIF (352x288) and QCIF (176x144), to the recently published H.265, which sup-ports up to 8K UHD, and from MPEG-1, which could achieve compression ratio ofaround 26:1, to MPEG-H Part 2 (H.265), the compression ratio of which could be

21

1990 20142003

H.264/MPEG-4 Part101996

H.263

2013

HEVC/H.265/MPEG-H Part2

1995

H.262/MPEG-2 Part2

1999

MPEG-4 Part2

Su

pp

ort

ed

pic

ture

re

solu

tio

n

Up to 1920x1080

Up to 4096x2160

Up to 8192x4320

Relative compression ratio

1

1.2 1.3

2.2

3.4

Year

Figure 3.1: Video coding standards

around 400:1 [81], many video coding standards can be selected for different appli-cation scenarios taking both the required performance and implementation cost intoconsideration. There are also other popular video coding formats like RealVideo,WMV, VC series, VP series, and so on. Various video formats exist simultane-ously on the Internet. To process the networked media, a flexible architecture withsupporting multi-standard multimedia processing is required. If ASICs are im-plemented for decoding multi-standard multimedia, the decoder for each standardshould be integrated together [82, 83], which increases the implementation cost andhinders flexibility.

In this chapter, a dual-core ASIP with dedicated video processing accelerators isproposed and implemented to support multi-standard video decoding for networkapplications. High performance is guaranteed by the dedicated video processingaccelerators on one hand, and the dual-core processor supplies the flexibility on theother hand. Working at full speed, the ASIP is capable of decoding real-time1280x720@25fps MPEG-2/MPEG-4/RealVideo video streams at 216 MHz withmaximum power consumption of 414 mW, which can be lowered to 95 mW atworking frequency of 27 MHz for 352x288@25fps video decoding.

3.2 Design methodology

The design flow is shown in Figure 3.2, and the detailed description is as follows:

22

Figure 3.2: Design flow

• Application analysis: multimedia streams from Internet need to be processed.There are many different formats for both audio and video. In this case,MPEG-2/MPEG-4/RealVideo formats for the HD video contents are the tar-gets.

• Function definition: the major functions for this application are network pro-tocol processing, media data depacketization, multi-standard audio decodingand multi-standard video decoding.

• HW/SW partition: to make the architecture flexible, we use software to im-plement most of the required features, except the ones with extremely highperformance demand. Hardware accelerators are needed for variable lengthdecoding, inverse quantization, inverse transformation, motion compensationand deblocking functions used for video decoding as shown in Figure 3.3.

23

HW

SW

Network protocol

processing

Media data

depacketization

Multi standard

audio decoding

Multi standard

video decoding

Variable length

decoding

Inverse

quantation

Inverse

transformationDeblocking

Motion

compensation

Instructions

Figure 3.3: HW/SW partition

• HW/SW interface design: dedicated instructions are designed for the embed-ded processors controlling the video accelerators.

• HW/SW implementation: the related software and hardware are implementedin parallel.

• HW/SW co-verification: during this phase, the whole system is verified for theapplication, and the maximum performance is tested. If the performance isnot adequate for the HD video decoding, adjustment should be made betweenthe software and hardware to gain the required performance.

3.3 Video coding structure

To design the suitable accelerators, video coding algorithms are studied at first.Most of the video coding standards employ block-based algorithms, which utilizecertain-sized pixel blocks as the basic coding units. For example, the 16x16 pixelsblocks are commonly used in most of the standards, except the latest HEVC/H.265standard, where blocks are as large as 64x64 pixels. The coding structures of mostof the standards are similar as shown in Figure 3.4, and the differences are thedetailed implementation algorithms.

In the general structure of video compression, prediction is the most efficienttechnique to decrease data volume. Instead of sending the raw data, only theresiduals between the raw data and predictions are put into the final stream. Theprediction can be performed on the neighbor pixels in the same frame, which is

24

Raw

picturesTransform Quantization

Inverse

Quantization

Inverse

Transform

Deblocking

Decoded

pictures

Motion

Estimation

Inter frame

Prediction

Intra frame

Prediction

Entropy

CodingFormating

Coded

Stream

Decoder

Coefficients

Mo

tio

nv

ect

ors

Figure 3.4: General structure of video coding

known as intra-frame prediction or spatial prediction. Considering similarities be-tween the consecutive frames in video sequence, the prediction could also be madeby referring to previous coded frames, which is inter-frame prediction or temporalprediction. Furthermore, to get more accurate values for inter-frame prediction,motion estimation is calculated between the coded frame and the current frame tofind motion vectors which indicate how the blocks move from the previous frameto the current one. The motion vectors will help to achieve accurate predictionin the coded frames. To ensure that the decoder gets the same data as those inthe encoder end, a decoder is included in the encoder structure as shown in Figure3.4 with dashed lines, and only the decoded frames are used as the references forprediction.

The residuals after prediction are transformed from spatial to frequency domainto ease the further data compression in the quantization phase. The high frequencypart will be quantized with a large step size because people are not sensitive to highfrequency data. Finally, the quantized coefficients together with the motion vectorsare coded using entropy coding and then formatted to compose the compressedstream.

25

CPU1 LM1 CPU2 LM2 ... CPUn LMn

LMDSP/

CPU

IM

INF1

FU1 DM1

INF2

FU2 DM2

DMACINFn

FUn DMn

...

DBUS1

DBUS2

CBUS1

CBUS2

CPU group

Enhanced data

processing unit

Figure 3.5: Accelerator-enriched architecture

DMn 1 FUn DMn DMx DMy

DMACDMx DMy

DMAC

(a) (b) (c)

Figure 3.6: Variety of data transferring

3.4 Accelerator-enriched architecture

To implement applications with the requirement of high performance and energyefficiency, an accelerator-enriched architecture is proposed and shown in Figure3.5. The CPU group is responsible for high-level task parsing and processing.The application-specific function units (FU) together with the tightly coupled datamemories (DM) provide high performance and energy efficiency for low-level dataprocessing. They are controlled by the embedded processor (DSP or CPU) throughthe application-specific instructions, and can work in parallel to maximize the uti-lization. The deployment of instruction memory (IM) decouples the embeddedprocessor from FUs, and enables both parts to run independently.

To avoid global moving of data, neighbor FUs share a DM in between, and anFU can access both its neighbor DMs as shown in Figure 3.6(a). Furthermore, anytwo of the DMs can exchange local data managed by the DMA controller (DMAC)as shown in Figure 3.6(b). If data transfer with external memories is required,DMA operations will be issued by the DMAC for fast data accessing as shown inFigure 3.6(c).

3.5 Design instance

Based on the proposed architecture and analysis of the application, a design instancefor networked multimedia processing is shown in Figure 3.7.

26

HRISC

MRISC

NWI LSI CCM ... Bridge

CRG CSM AcceleratorsCMD

cache

Dispatcher

Timer GPIO ...

Ref.

Buffer

DMA

ES

BufferVLD

IQ

BufferIQ

IT

BufferIT

Res.

BufferMC

DF

BufferDF

WB

Buffer

Figure 3.7: Design instance for networked multimedia processing

One of the RISC processors, i.e. HRISC is utilized to run the operating system,manage network devices, configure system environment, process network proto-cols, depacketize multimedia data, etc. The other RISC processor, i.e. MRISCmainly works on the multimedia processing. Multi-standard audio decoding andlow-definition multi-standard video decoding will be implemented on the MRISC.For high-definition video decoding, the hardware video decoding accelerators willbe activated. In that case, MRISC analyzes the video stream, collects the necessaryparameters and assigns tasks to corresponding accelerators by writing customizedinstructions to the CMD cache.

The video enhancement unit (VEU) consists of VLD, IQ, IT, MC and DF mod-ules, which are derived from the general decoder model in last section and re-sponsible for high-definition video decoding. The VEU is designed by adding theindividual specific features of MPEG-2/MPEG-4/RealVideo to the common struc-ture. The VEU parses instructions from the CMD cache and performs operationson the data stored in the corresponding buffers. These instructions are differentfrom the normal instructions which are generally simple arithmetic, logic, or datamoving operations. For each module in VEU, one instruction includes the requiredparameters and operations for the whole coding block. When starting processing,each module will perform the corresponding operations for one coding block andwrite the results to proper memory space without requiring the processors to beinvolved.

In order to make the proposed architecture capable of decoding high-definitionvideos featured with huge data volume, an efficient data communication structureis necessary. In the proposed architecture, internal buffers are placed between theVEU modules to localize the video data and accelerate the computation. The datatransferring between internal buffers and external memories is managed by theDMA module. Due to the deployment of internal buffers, the modules in VEU areable to work in parallel in a pipelined way at the block level as shown in Figure3.8, and at most five blocks can be processed simultaneously by the VEU.

27

Time

VLD IQ IT MC DF

VLD IQ IT MC DF

VLD IQ IT MC DF

VLD IQ IT MC DF

VLD IQ IT MC DF

Block1

Block2

Block3

Block4

Block5

Figure 3.8: Video decoding pipeline

Flexibility

The flexibility of the proposed architecture is three-fold:

• Dual-core processor: when high performance is not necessary, the VEU canbe turned off, and two embedded RISC processors are fully programmable forvarious multimedia applications.

• Activity of VEU modules: the number of active modules in VEU can beadjusted for the specific performance requirement. The unnecessary modulescan be turned off to save power.

• Working frequency of each module: working frequency of modules in VEUtogether with two embedded processors can be tuned according to the specificapplications to provide adequate performance while keeping power consump-tion at a low level.

3.6 Implementation results

The ASIP is implemented using 0.13 µm CMOS technology with gate count ofaround 5 million, and the die size is 6.4 mm x 6.4 mm which is shown in Figure3.9 together with the power consumption. The maximum working frequency is216 MHz, when 1280x720 @ 25 fps MPEG-2/MPEG-4/RealVideo streams will beprocessed with power consumption of 414 mW. Only 13 % is consumed by the VEU,which shows the efficiency of hardware accelerators. For low resolution videos, suchas CIF-sized pictures with resolution of 352x288, the working frequency can belowered to 27 MHz, when the power consumption is only 95 mW. The test systemand application scenario are demonstrated in Figure 3.10.

28

6.4mm

6.4

mm

(a) Die photo (b) Power consumption

Figure 3.9: Die photo and power consumption

VEU : MB level video decoding

MRISC32: media streams parsing

NRISC32: network, UI

Figure 3.10: Test system and application scenario. (Adapted from [84])

3.7 Summary

To implement networked multi-standard multimedia processing applications, a dual-core ASIP with high performance and high flexibility is proposed and implemented.Highly specific accelerators provide the necessary performance. The generality ofaccelerators enable the adaptability for other similar applications. In addition,high flexibility is provided by the embedded processors. The balance betweenperformance-to-power ratio and flexibility is made by software/hardware co-design.

29

Memory organization and data transmission structure are also key issues forapplications with high throughput requirement. In this architecture, distributedinternal buffers are deployed to increase the processing throughput and DMA isimplemented to speed up the data transmission between internal buffers and ex-ternal memories. The buses connect memories and functional units, and the com-munication throughput is adequate for this application. But for applications withheavy communication load, a more efficient communication infrastructure will beexplored in the next chapter.

30

Chapter 4

Multi-processor architecture

exploration for multi-view video

decoding

For applications with even higher performance requirement, multiple ASIPs can beconnected to increase the performance while providing sufficient flexibility. Thischapter explores multi-processor architectures for multi-view video processing in-cluding design space exploration and implementation of communication infrastruc-ture.

4.1 Background

3D video applications are emerging into our daily life, such as 3DTV, 3D movie,3D football match, etc. To present 3D video, the pictures from two or more viewsshould be recorded, coded and transmitted. Compared with single-view video, datavolume and throughput are multiplied by many times. Because similarities existbetween pictures from different views, many algorithms are proposed to compressthe multi-view video efficiently by making use of the correlations between differentviews [85]. As a result, performance of decoders needs to be improved by manytimes compared with that required for single-view video decoding. There are manychallenges not only for processing units but also for communication architectures.Some sophisticated ASICs have already been reported for decoding MVC videoin [21, 86, 87]. High speed, high processing throughput circuits could provideadequate performance for high-definition multi-view video decoding. However, thedesign complexity is also considerable, and flexibility and scalability are limited.

By utilizing multiple processors, the applications can be implemented in anotherway by exploring the parallelism and employing multi-processor architectures. Weuse this approach which has several advantages:

31

• Low design complexity: the design problems are simplified as: choosinga single-ASIP with proper performance, selecting the necessary number ofASIPs and connecting those processors with an efficient communication struc-ture.

• High flexibility: these processors can be re-programmed to adapt to similarapplications.

• High scalability: the on-demand performance can be provided for a particularapplication by adjusting the number of processors.

Multi-view video coding (MVC)

As Annex H of H.264 [32], MVC inherits all the sophisticated coding algorithmsin H.264 for single-view video coding. Besides, inter-view prediction is utilizedas a very efficient technique to obtain large compression ratio [88]. For single-view video coding, the inter-frame prediction uses temporal neighbor frames asthe reference to predict the data in the current frame, which greatly increases thecoding efficiency compared with the techniques for still image encoding. Whilein MVC, not only temporal neighbor frames but also spatial adjacent frames, i.e.frames from neighbor views, are utilized to predict the current frame. The merit isthat coding efficiency is significantly increased. The drawback is the complicatedcoding structures which will be obstacles for parallel processing.

A typical prediction structure in one group of pictures (GOP) is shown in Figure4.1 for eight views, and totally 64 pictures exist in one GOP. In a base view,encoding or decoding of the pictures will not rely on pictures from other views.The single-view decoder can only process the data in a base view, which keepsthe bit stream compatible for normal single-view decoders. For anchor pictures,the temporal prediction will not be applied, and only the inter-view prediction canbe utilized. The purpose is to prevent error propagation between GOPs and tosupport random access among GOPs. For the other pictures, both the temporalinter-frame prediction and spacial inter-view prediction are employed to reducedata redundancy. As a result, MVC streams are characterized by larger number ofpictures to be processed and complex dependences among the pictures. In order tohandle MVC streams, multiple times of data processing throughput are requiredcompared with that for single-view streams.

Network on Chip (NoC)

To connect multiple ASIPs, an efficient data communication infrastructure shouldbe designed to provide sufficient throughput. In [50], multiple ASIPs are connectedvia a bus structure together with a shared memory for data exchanging, as wellas a master core for managing the work flow and collecting the results. Burstmode and DMA operations are provided in the bus system for data transmissionto reduce the possible contention. However, for applications with heavy data load,

32

00

01

02

03

04

05

06

07

10

11

12

13

14

15

16

17

20

21

22

23

24

25

26

27

30

31

32

33

34

35

36

37

40

41

42

43

44

45

46

47

50

51

52

53

54

55

56

57

60

61

62

63

64

65

66

67

70

71

72

73

74

75

76

77

80

81

82

83

84

85

86

87

time

vie

w

Base view

Anchor pictures

(Blue squares)

One GOP

Figure 4.1: Typical prediction structure; one square represents one picture, andarrows show the reference relations.

ASIP 00

R

ASIP 01

R

ASIP 02

R

ASIP 03

R

ASIP10

R

ASIP 11

R

ASIP 12

R

ASIP 13

R

ASIP 20

R

ASIP 21

R

ASIP 22

R

ASIP 23

R

ASIP 30

R

ASIP 31

R

ASIP 32

R

ASIP 33

R

Figure 4.2: An example of 4x4 2D mesh NoC

33

Processing

phase I

Processing

phase II

Processing

phase ...

Processing

phase ...

IN OUT

Task

Distribution

Tile

Processing

Tile

Processing

...

Tile

Processing

Result

Converging

OUTIN

(a) Pipelining processing phases

(b) Tile processing

Figure 4.3: Mapping multimedia applications to NoC

the non-sharable and non-scalable bus structure becomes the bottleneck in a multi-ASIP system. By connecting the ASIPs with an on-chip network, the network-on-chip (NoC) [89] can supply high communication bandwidth, high efficiency andscalability.

NoC architectures decouple the communication subsystem from the computa-tion subsystem. Data communication is performed in a dedicated on-chip network,and the computation part needs only to interact with the network interface for dataexchanging. Design problems related to the communication subsystem include theselection of topology, task assignment scheme, switching modes, flow control, rout-ing algorithms, and so on [90]. In this work, we choose 2D-mesh topology consider-ing the intrinsic modularity and scalability. An example of 4x4 2D-mesh networksis shown in Figure 4.2. Besides, to implement NoCs for specific applications, de-sign decisions for the computation part, such as choosing the processing capabilityof each node, selecting suitable number of nodes, etc. need also to be taken intoconsideration. The computation capability and communication capacity should bebalanced to achieve an optimized overall performance for the whole system.

4.2 Design space exploration

There are two strategies for implementing multimedia applications on NoC archi-tectures. One is partitioning the whole process into several phases, implementingdifferent phases on different NoC nodes and pipelining those phases to gain perfor-mance improvement as shown in Figure 4.3(a). The merit is the simple traffic pat-tern and determinate data dependency between consecutive nodes in the network,while the drawback is the limited scalability constrained by the partitioned process-ing phases. Examples with this mapping strategy can be found in [91, 92]. Another

34

Disp.

Proc.

...

Proc.

Conv.

Disp. Proc. Conv.Sd Sc

td tp tc

(a)

(b)

td tp tc1

2

...

Npp

1

td tp tc

td tp tc

td tp tc

td tp tc

(c)

td tp1

2

(d)

td tp

td tp

td tp

...

Nptd tp1

2 td tp

Figure 4.4: Simplified System (a) the whole system, (b) one unit, (c) work schedulewith inadequate link bandwidth, (d) work schedule with excessive link bandwidth;Sd for the size of tasks, and its dispatching time td; Sc for the size of results, andits collecting time tc; tp is the processing time. (Adapted from [93])

implementation strategy is based on parallel tile processing shown in Figure 4.3(b).Each tile is capable of dealing with the whole process, and the performance is in-creased by enabling concurrent tile processing. The scalability is extended at thecost of complex traffic patterns and possible large volume of data transferred amongthe tiles. Emphasizing scalability and flexibility, the second strategy is utilized forthe following exploration.

Exploration for a simplified system

A theoretical analysis is firstly performed to formulate the exploration work flow.To ease the theoretical analysis, a simplified system as shown in Figure 4.4(a) isstudied to explore the design space. The data dependency between processingtiles is eliminated, which means the processing tiles will get all the necessary datafrom the task distribution node and send the results only to the converging node.Assuming that the task size (Sd) is larger than the result size (Sc), the relation

35

Given a target tsp

For a certain Np,

calculate minimum

value of p

Calculate reference

BWlink according to

the theoretical model

Sweep BWlinkaround

BWlink , get the

simulated sp under

Npand p

sp >= tsp ?

Found one

configuration of the

system

No

Yes

Figure 4.5: Work flow for the system exploration

of system processing throughput (Θsp), single-node processing throughput (θp),physical link bandwidth (BWlink) and number of processing nodes (Np) has beentheoretically analyzed in [93]. For the case of Sd smaller than Sc, a similar analysiscan be applied. Combining with the simulation, for a given target system processingthroughput (Θsp), optimized configurations of θp, BWlink and Np can be exploredthrough the procedure as shown in Figure 4.5.

1. By performing a theoretical analysis as that in [93], the reference value oflink bandwidth (BWlinkγ ) is calculated from the given value of target systemprocessing throughput.

2. For each network with certain size (Np), based on BWlinkγ and the relationof BWlink versus θp and Np, the minimum required value of θp is acquired.

36

GOP GOP

ViewI B P

GOP

View level Picture level Macroblock level

GOP level

Figure 4.6: Hierarchical structure of MVC (Adapted from [94])

3. For each Np and θp, sweep BWlink whose value is around BWlinkγ and eval-uate them in the simulation platform.

4. If the simulation result of system processing throughput is larger than thegiven value, stop sweeping BWlink, change the size of network and goto 2 tofind another group of BWlink, θp and Np.

Exploration for MVC implementation

Based on the aforementioned analysis, the NoC system for MVC decoding is ex-plored accordingly, and the target is to find optimized configurations of the systemto implement full HD (1920x1080) MVC decoding at 30 fps (frames per second).

Figure 4.6 shows the hierarchical structure of MVC streams. The basic process-ing unit is the macroblock (MB), which consists of 16x16 pixels. Neighbor MBsor the MBs in neighbor frames may be utilized to generate the current MB. Onepicture is composed of a number of MBs. If only spatial prediction is employed, thepicture is an I frame. If only forward pictures in the time domain can be used as thereference to predict the current picture, the picture is a P frame. If both forwardand backward pictures in the time domain are selectable as reference pictures, thecurrent picture is a B frame. Thus, at the picture level, to decode an I frame,no related data are required from other frames, while to decode P and B frames,data from other frames are necessary. In MVC, pictures further constitute viewsas introduced in the previous section. Decoding streams from one view may needto refer to data from other views. On top of views, GOP processing is independentfrom each other, and can be executed in parallel.

Task assignment at different levels of MVC will generate diverse traffic patterns,and result in varied configurations of the system. If the task assignment is at theMB level, frequent data transferring leads to performance degradation. On theother hand, GOP-level task assignment needs large storage space in each node andwill cause large processing delay as well. So the picture-level and view-level taskassignments are studied for implementing the MVC in the following sections.

37

00

01

02

03

04

05

06

07

10

4

11

6

12

5

13

7

14

6

15

8

16

7

17

8

20

3

21

5

22

4

23

6

24

5

25

7

26

6

27

7

30

4

31

6

32

5

33

7

34

6

35

8

36

7

37

8

40

2

41

4

42

3

43

5

44

4

45

6

46

5

47

6

50

4

51

6

52

5

53

7

54

6

55

8

56

7

57

8

60

3

61

5

62

4

63

6

64

5

65

7

66

6

67

7

70

4

71

6

72

5

73

7

74

6

75

8

76

7

77

8

80

1

81

3

82

2

83

4

84

3

85

5

86

4

87

5

1 2 5 10 12 14 12 8

time

vie

w

One GOP

83

4

Temporal index

View index

Set index

Number of

pictures includedSet index: 1 2 3 4 5 6 7 8

Figure 4.7: Picture-level parallelism for the typical MVC prediction structure(Adapted from [94])

Picture-level task assignment

Data dependences exist not only between pictures of the same view, but also be-tween pictures in different views. However, by analyzing the dependence of picturesin one GOP, the pictures can be subdivided into several sets, and in each set thepictures are independent of each other [95]. The pictures with larger set indexmay need the pictures with smaller index as the reference as shown in Figure 4.7.To process each set, the dispatching node sends all the necessary data to a pro-cessing node for single picture decoding. Results are collected by the convergingnode. Only after all the pictures with smaller set index are processed, the task forprocessing the pictures with larger set index can be issued. A shared memory isdeployed between the dispatching and converging nodes. A possible data flow forpicture-level task assignment is depicted in Figure 4.8.

Picture-level task assignment in this work only issues tasks of decoding picturesin the same set to processing nodes, and transferring data between processing nodesis avoided. In this case, the exploration algorithm proposed for the simplified sys-tem can be directly utilized for the analysis. After the exploration in [94], possibleconfigurations of the system are summarized in Figure 4.9. Both 3x3 and 4x4 meshNoCs can be selected depending on the physical link bandwidth and single-nodeperformance. With 3x3 networks, the minimum performance requirement for singlenode is the capability of processing full HD single-view stream at 50 fps, correspond-

38

...

...

Picture

decoding

...

...

Task

disp.

Result

conv.

(a) Data flow of task

(0,0)

(0,1)

(0,2)

(0,3)

(1,0)

(1,1)

(1,2)

(1,3)

(2,0)

(2,1)

(2,2)

(2,3)

(3,0)

(3,1)

(3,2)

(3,3)

Reference

data input

Decoded

data output

Task

data input

Idle

message

(b) Data flow on NoC

Figure 4.8: A possible data flow of core (1,3) for the picture-level parallel processing(Adapted from [94])

60

50 50

4038.4

78.6

38.4

78.6

0

10

20

30

40

50

60

70

80

90

1 2 3 4

Options

!p (fps) BL (Gbps)

3x3 3x34x4 4x4

Figure 4.9: Design options for picture-level task assignment; Γ for network size, Φp

for single-node performance and BL for link bandwidth (Adapted from [94])

ing to the link bandwidth of 78.6 Gbps. If the link bandwidth could not achieve78.6 Gbps, the minimum requirement is 38.4 Gbps. In that case, the configurationcan be either a small network (3x3) with high single-node performance (60 fps) or alarge network (4x4) with relatively low single-node performance (50 fps). With 4x4networks, the requirement for single-node performance can be lowered to 40 fps. Itmeans that, if one processor was initially designed to implement single-view full HD

39

View 0

decoding

View 1

decoding

View 2

decoding

View 3

decoding

...

Task

disp.

Result

conv.

(a) Data flow of task

(0,0)

(0,1)

(0,2)

(0,3)

(1,0)

(1,1)

(1,2)

(1,3)

(2,0)

(2,1)

(2,2)

(2,3)

(3,0)

(3,1)

(3,2)

(3,3)

Reference

data input

Decoded

data output

Task

data input

Idle

message

(b) Data flow on NoC

Figure 4.10: A possible data flow of core (1,3) for the view-level parallel processing(Adapted from [94])

decoding at 30 fps, to support eight-view MVC decoding, only 33% performanceimprovement is required.

View-level task assignment

The view-level task assignment scheme issues single-view decoding tasks to eachprocessing node. Considering the dependence between views, data exchangingamong the processing nodes is necessary. Figure 4.10 shows an example of thedata flow for decoding view 2 on node (1,3). The task dispatching node sendscoded streams to node (1,3) for decoding, and meanwhile, data from node (1,0) forview 0 are also sent to node (1,3) as the reference. Furthermore, decoded streamsneed to be sent to the converging node for result collection and to node (0,1) and(3,1) as the reference for view 1 and view 3.

View-level task assignment scheme complicates the traffic pattern in the net-work. However, the exploration algorithm for the simplified system can still beutilized as a preliminary analysis. After that, by tuning the single-node perfor-mance and physical link bandwidth, optimized configurations of the NoC systemcan be obtained from the simulation platform to meet the performance requirement.

Feasible configurations of the system using view-level task assignment schemeare shown in Figure 4.11. For eight-view MVC decoding, if only one GOP can beprocessed at a time, at most eight processing nodes are needed, which is limitedby the view number. In this case, increasing the network size will not help toincrease the system performance. This explains the fact that options 1 and 3, aswell as options 2 and 4, have the same requirement for single-node performance and

40

2-GOP

1-GOP

3-GOP

50

60

50

60

40

30

2019.2

9.6

19.2

9.6 9.6 9.6 9.6

0

10

20

30

40

50

60

70

1 2 3 4 5 6 7

Options

!p (fps) BL (Gbps)

4x4 4x4 4x4

5x5 5x5 5x5 5x5

Figure 4.11: Design options for view-level task assignment; Γ for network size, Φp

for single-node performance and BL for link bandwidth; n-GOP means n GOPscan be processed at the same time (Adapted from [94])

link bandwidth even though the network size is different. To utilize more processingnodes, some GOPs should be processed in parallel. Option 7 shows the case of threeGOPs being processed simultaneously with a 5x5 network, when the requirementfor single-node performance is only 20 fps for single-view streams processing and thelink bandwidth is 9.6 Gbps. However, since introducing the GOP-level parallelismwill increase the implementation cost and processing delay, the maximum numberof GOP being processed in parallel should be constrained to a small value.

4.3 Communication network

Besides network topology, the selection of switching techniques also has a largeeffect on both the performance and the implementation cost. Circuit switching(CS) and wormhole switching (WS) are two commonly used techniques for on-chipnetworks. Circuit switching will reserve the whole end-to-end path before start-ing to send data, and the data transmission uses a dedicated link without sharingit with other transmissions. The advantages are the lightweight architecture andsmall transmission delay. The disadvantages are the large link setup delay andlower network utilization. In contrast, a data transmission is divided to many flitsin wormhole switching, and starts immediately once there is any free buffer in thedownstream node. A header flit contains the routing information, and body flitsonly need to follow the header flit to reach the destination node. To increase uti-lization of the network, virtual channels (VC) can be deployed to share the physicalchannels. Different data transmissions can use different virtual channels but shareone physical channel. The merits are the zero setup delay and higher networkutilization. The drawback is the uncertain transmission delay and relatively highimplementation cost.

41

North_in

South_in

West_in

East_in

Local_in

Relay

register Switch

North_out

South_out

West_out

East_out

Local_out

Control logic

Reg

Reg

Reg

Reg

Reg

North_in

South_in

West_in

East_in

Local_in Switch

North_out

South_out

West_out

East_out

Local_out

Control logicCtrl_in Ctrl_out

.

.

.

.

.

.

Virtual channelFlit

Ctrl_in Ctrl_out

(a)

(b)

...

.

.

.

N0

Nn

S0

Sn

Input

lane

Output

lane

Input

lane

Output

lane

...

...

...

...

Figure 4.12: Router structures of (a) circuit switching and (b) wormhole switching(Adapted from [97])

Circuit switching is suitable for applications with large infrequent data transmis-sion, otherwise, wormhole switching should be used [96]. For a specific applicationscenario, the decision of choosing the suitable switching technique still need to bemade by carrying out the evaluation on performance and cost. In this case, theanalysis is performed for MVC decoding in [97].

The simulation router models of CS and WS used for evaluating the performanceand cost are shown in Figure 4.12. Five pairs of in/out ports connect four neighborrouters with the local processor. In the CS router, any one of the output ports canonly be linked to one input port during one transmission. For the WS router, anumber of VCs share one input port, and one output port can be associated withseveral VCs during data transferring.

To find the suitable switching technique for MVC decoding, the optimized con-figuration of WS on link bandwidth, buffer depth, number of VCs is firstly explored.Then, the optimized structure of WS is compared with CS on the performance ofdecoding speed, network utilization and delay for decoding MVC streams with thesame topology of NoCs. The results indicate that employing CS and WS achievessimilar performance for MVC decoding. Considering the simpler architecture whichwill result in lower implementation cost, CS should be used when implementingMVC decoding with NoC architectures.

42

4.4 Summary

By connecting multiple ASIPs in an NoC manner, the system provides high per-formance together with scalability and flexibility. To implement MVC decoding, ifsingle ASIP is utilized, performance is to be improved by several times which aredetermined by the number of views supported. While with multi-processor NoCarchitectures, the required performance improvement can be much smaller bothwith picture-level and view-level task assignment schemes. Meanwhile, to gainthe necessary performance for the whole system, design options including single-node performance, network size, link bandwidth and task assignment schemes arebalanced. If the single-node performance is not enough, it can be compensated bychoosing a suitable task assignment scheme, network size and link bandwidth, whatis also applicable when the network size or link bandwidth is inadequate. For thecommunication structure, circuit switching instead of wormhole switching is ap-propriate for MVC decoding considering their similar performance and lightweightarchitecture of CS.

43

Chapter 5

Customizable tile-ASIP

implementation for Internet of

Things

In this chapter, we focus on the implementation of a low power customizable ASIPfor Internet of Things (IoT). In the vision of IoT, everything will be connected andcan talk to each other [98, 99]. The ubiquitous sensing, processing and communi-cation should be seamlessly performed by the attached devices, which will demandhardware architectures with characteristics like small-sized memory requirement,flexibility and adaptability to various devices with diverse performance demands,low power consumption but sufficient computational performance for functions suchas digital signal processing, communication protocol processing, encryption, etc.

5.1 Background

Modular design

In IoT applications, the demanded performances for processors are greatly variedwith the devices from sensor nodes to servers and also varied with the vast numberof application scenarios. The straightforward solution is to design and implementspecific processors for each application scenario. However, the design, verificationand mask costs are extremely high and become the obstacles for this solution [100].A modular design, instead, utilizes mature extensible modules to compose a specificsystem for the target performance and cost. Suitable number of modules and properconnections can be selected after analyzing the application. The implementationcosts are greatly decreased. The challenge, as a result, is the implementation ofsuch extensible and efficient modules. The hardware-level conjunction of the systemshould be transparent to the high-level programming, and the performance andpower efficiency should also be close to that of a single processor implementation.

45

The concept of “Lego Chip” is proposed in [100]. In this concept, a systemincluding two or more chips behaves like a whole single SoC. Diverse but matureLego Chips are seamlessly connected via the highly efficient Lego Interconnects toachieve the target performance. The engineering cost can be dramatically decreasedby using that technique. XMOS architecture and its implementation are introducedin [101]. Multiple processors can be connected with point-to-point communicationlinks, and it allows designers to build the system either on multiprocessor chip, inpackages, on boards or as a distributed system. A number of threads can be directlymapped to the hardware resources in each core, and controlled by a hardwarescheduler. The event-driven threads are able to reduce power when waiting forevents. For constructing the multi-core system, each core has its global addressand can communicate with the others through control and data packets. Networks-in-package is implemented in [102] by connecting four NoC chips with off-chiplinks in one package to demonstrate the scalability and modularity of the system.For each NoC chip, heterogeneous function units are integrated together with thepacket-switched communication infrastructure. If higher performance is required,suitable number of NoC dies can be clustered to construct the networks-in-package.It alleviates the problems of super large scale circuits such as low yield and highcosts.

Processor architecture

When the computer was introduced at the beginning, the memory size was lim-ited, and the compiler and interpreter were very simple. Many instructions whichcould perform various complex functions were designed together with the necessaryhardware units. Those instructions constitute the basic instruction set architec-ture (ISA), based on which applications could be implemented. In order to makethe memory requirement smaller and make the programming easier, abundant in-structions were included in the ISA for various functionalities. In this case, due tothe complex operations implemented in different instructions, multiple cycles areneeded to execute one instruction and the execution time for different instructionsvaries, which makes the execution of instructions not easy to be pipelined.

Compared with the aforementioned design methodology, which is called com-plex instruction set computer (CISC), reduced instruction set computer (RISC)simplifies the functionality of instructions. Most of the instructions are finished inone single cycle. The hardware design is simplified. Complex functions are imple-mented by combining the simple instructions in a suitable way, which is done by acomplier. The memory requirement for storing executable codes is thus larger andthe code density is lower in RISC [103].

With the rapid development of semiconductor technology, the speed and numberof transistors are greatly increased for single chips. The memory space for storingcodes is not a problem any more. By pipelining the execution of instructions, thespeed of RISCs can be very fast due to the simple hardware architecture. Cachesare deployed to fill the gap between the extremely fast executing core logic and the

46

relatively low speed memories. Owing to the simple hardware architecture and fastexecution speed, RISCs become popular for embedded applications.

For the IoT applications in this case, lightweight architectures with low powerconsumption and high energy efficiency are highly demanded. Memories occupylarge proportion of the power consumption [104]. Small memories should be usedconsidering both the area and power consumption. The utilization of hardware unitsis supposed to be explored as much as possible. Meanwhile, due to the varieties ofapplication scenarios in IoT, reconfigurability and scalability are also of importancefor IoT processors.

In this work, a reconfigurable and scalable control-centric processor with an in-tegrated multi-mode router is proposed and implemented as one tile for the possiblemodular extension. The router supports circuit and wormhole switching simultane-ously to fit into the different traffic patterns for various application scenarios. Thetechniques that can ease modular extension also apply to the multi-mode router.Aiming at high energy efficiency, the processor simplifies control logics, and ba-sic functional units are directly managed by the sequence mapping table (SMT)saved in internal storage of the processor. The decoding hardware is thus avoided.By designing the corresponding SMT, different ISAs can be implemented to theprocessor. Code density is increased through optimizing the SMT for specific ap-plications. Dedicated SMTs can even be mapped to the application directly onceextremely high energy efficiency is demanded.

5.2 Multi-mode router

NoC routers that utilize both circuit and wormhole switching to improve communi-cation capacity are popular in many studies [105, 106, 107, 108, 109, 110, 111, 112].The main purpose of those studies is to combine the merits of two switching tech-niques to increase the transmission speed and lower the latency, but not to fit intothe diverse traffic patterns by adaptively selecting the proper switching techniquefor particular application scenarios, which is emphasized in this work for the mod-ular design.

The proposed router architecture is shown in Figure 5.1. Circuit switching (CS)and wormhole switching (WS) utilize the common physical channels and controllogics. The use of a specific switching mode is determined by the header sentfrom the source node in the network. A header in CS will setup en end-to-endlink between the source and destination nodes, while a header in WS allocates oneavailable virtual channel (VC) for the following data packets. A maximum of fourvirtual channels can share one physical channel if it is not occupied for circuit-switched messages. A circuit-switched message will preempt the physical channelbeing used for WS. As a result, each physical channel supports dynamical and freeswitch between CS and WS for different application scenarios.

47

VC0

VC1

VC2

VC3

De

mu

x

Bo

un

da

ry B

uff

ers RC

Input lane

CR

RRVA

SS

Switches

Out ctrl.

Bo

un

da

ry B

uff

ers

VC0

VC1

VC2

VC3

De

mu

x

Bo

un

da

ry B

uff

ers RC

Input lane

CR

RRVA

SS

Switches

Out ctrl.

Bo

un

da

ry B

uff

ers

north

_in

local

_in

north

_out

local

_out

CR: registers for circuit switching

RC: routing computation

SS: switching mode selection

VA: virtual channel allocation RR: Round-RobinC

ross

ba

rs ...

...

...

...

...

...

Figure 5.1: Proposed multi-mode router (Adapted from [113])

Necessity

As described in the last chapter, CS is suitable for large infrequent data transmis-sion, and applying WS is appropriate in other cases. However, “large” or “small”is relative to the hop count number and network status when performing the datatransmission. Even if the data are “large” for one transmission, they might be“small” for another transmission.

Figure 5.2 gives the simulation results of maximum throughputs with only CSor WS mode activated for a 10x10 mesh NoC when the maximum hop count of eachtransmission is restricted to 2, 4, and 6. When the hop count number is increased,the performance degradation of CS is much larger than that of WS. It means thatCS is more sensitive to the distance of data transmission. For the message size withlabel “P1” in Figure 5.2, if the maximum hop count number is 2, employing CS issuitable. However, when increasing the maximum hop count to 4 and 6, utilizingWS is appropriate. This shows the importance of the capability for dynamicallyadjusting the switching mode to actual traffic patterns.

In some application scenarios, “large” and “small” data transmissions need tobe handled at the same time. In that case, using only one switching technique willrestrict the performance of the network. The previous 10x10 mesh NoC is still usedas an example, and a maximum hop count number of 4 is applied. Assume thatthere are two types of data messages in one application at the same time, 10-flit-sized messages (message size of 320 bits in Figure 5.2) as “small” data and 100-flit-sized messages (message size of 3200 bits in Figure 5.2) as “large” data, and they

48

0

200

400

600

800

1000

1200

1400

0 500 1000 1500 2000 2500 3000

Maxim

um

throughput(bits/cycle)

Message size (bits)

CSM_H2

WSM_H2

CSM_H4

WSM_H4

CSM_H6

WSM_H6

S1 S2P1

Figure 5.2: The maximum throughput with only CS mode (CSM) or WS mode(WSM) activated; Hn means that the maximum hop count is n; S1 and S2 are themessage sizes corresponding to the throughput cross points of the CSM and WSMfor H2 and H4, respectively (Adapted from [113])

0

20

40

60

80

100

0 5 10 15

Throughput(bits/cycle)

Injection rate (bits/cycle)

CSM

WSM

CSM/WSM

0

100

200

300

400

500

600

0 5 10 15

Latency

(cycles)


CSM

WSM

CSM/WSM

0

200

400

600

800

1000

0 5 10 15

Throughput(bits/cycle)


CSM

WSM

CSM/WSM

0

200

400

600

800

0 5 10 15

Latency

(cycles)


CSM

WSM

CSM/WSM

(a) Throughput for 10 flit messages

(c) Latency for 10 flit messages

(b) Throughput for 100 flit messages

(d) Latency for 100 flit messages

Figure 5.3: The comparison of throughput and latency for 10-flit messages and100-flit messages with only CSM, only WSM and mixed CSM/WSM (Adaptedfrom [113])

49

0

100000

200000

300000

400000

500000

600000

8b

its

16

bit

s

32

bit

s

64

bit

s

8b

its

16

bit

s

32

bit

s

64

bit

s

8b

its

16

bit

s

32

bit

s

64

bit

s

buffer depth = 8 buffer depth = 16 buffer depth = 32

Area ( m2)

Figure 5.4: Areas for routers with buffer depth of 8, 16, 32 and link width from 8to 64 bits

are generated with the same probability. Three possible strategies can be utilizedfor this application: CS for all messages, WS for all messages, and mixed CS/WSadaptively to message sizes. The simulation results for throughputs and latenciesare shown in Figure 5.3. The mixed mode provides both higher performance andlower latency.

The previous two situations show the necessity of supporting CS and WS si-multaneously in one router when various traffic patterns may be present on thenetwork.

Implementation

The proposed router is implemented with parameterized buffer depth of VCs andlink width of physical channels. The areas for routers with buffer depth of 8, 16,32 and link width from 8 to 64 bits are given in Figure 5.4 after being synthesizedin 65 nm low leakage CMOS process. The areas are almost doubled when thebuffers or physical links are duplicated. The corresponding throughput and powerefficiency are shown in Figure 5.5, under the conditions of 10x10 network, 40-flit-sized messages, maximum hop count of 4, and CS for hop count not larger than2 otherwise using WS. Increasing the buffer depth will improve performance, butdecrease power efficiency. Because of that, buffer depth of 8 is selected in thiswork. Nevertheless, increasing the link width can both increase the performanceand power efficiency.

As a proof of concept, one tile including processors and the proposed router withbuffer depth of 8 and link width of 8 bits, has been fabricated in 65 nm low leak-age CMOS process. Figure 5.6 depicts the complete structure of the implemented

50

0

20

40

60

80

100

120

140

160

180

0

2000

4000

6000

8000

10000

12000

14000

8 bits 16 bits 32 bits 64 bits

Po

we

re

ffic

ien

cy(G

bp

s/W

)

Ov

era

llth

rou

gh

pu

t(G

bp

s)

Flit width

8 flit_OT

16 flit_OT

32 flit_OT

8 flit_PE

16 flit_PE

32 flit_PE

Figure 5.5: Overall throughput (OT) and power efficiency (PE) for buffer depth of8, 16, 32 flits with flit width from 8 to 64 bits (Adapted from [113])

REG

BUFFER

BUFFER

BUFFER

BUFFER

PARSING

ANALYZING

INPUT LANES

Adaptable router

MESSAGE

BUFFER

MESSAGE

BUFFER

SENDER RECEIVERBUSS

ADAPTER

MESSAGE

BUFFER

MESSAGE

BUFFER

...

CONTROL

OUTPUT LANES

TO/FROM PROCESSOR

StP

StP

StP

StP

PtS

PtS

PtS

PtS

Link

interface

Processor interface

Link

interface

CS path

WS path

CS path

WS path

Figure 5.6: Complete structure of the on-chip multi-mode router in one tile(Adapted from [114])

router. The processor interface exchanges data messages between the processorand the multi-mode router. Link interface connecting the external links with theinternal router also performs serial-to-parallel and parallel-to-serial signal conver-sion to reduce the required pins on one tile, thus decreasing the area and powerconsumption of the physical chip.

5.3 Control-centric processor

The proposed architecture of the customizable ASIP is shown in Figure 5.7. Adual-core reconfigurable control-centric processor is integrated together with the

51

+, ,&,... <<, >> , *

… ...

ALU Shifter & MAC

Memory

Access

Unit

Control

Unit

Control Word Fetch

CROM CRAM

Control Word Fetch

+, ,&,... <<, >> , *

… ...

ALU Shifter & MAC

Memory

Access

Unit

Control

Unit

AROM

ARAM

External

Mem. Intf.

Peripheral &

Debug

Multi mode

Router

Clock

Generate

Power

Management

CORE_L

CORE_R

Configuration registers

Figure 5.7: Architecture overview

multi-mode router, internal memories, power management unit, clock generationunit, debug interfaces, and other peripherals.

Control-centric architecture

In each core (CORE_L or CORE_R), functional units and data paths are directlymanipulated by a control word (CW) without internal interpretation. The resultis simplified control logics and the capability of parallel execution of all functionalunits. The structure of one CW is shown in Figure 5.8. One CW is partitioned intoseveral sections, each of which contains the necessary control bits for correspondingfunctional units or data paths. CWs defining specific functions constitute SMTs,which are saved either in CROM for the fixed configurations of the processor or inCRAM to provide reconfigurability.

SMTs can be customized to support general-purpose ISAs or application-specificinstructions, and even directly implement a particular application, as shown in Fig-ure 5.9 for application mapping schemes. The comparison of energy efficiency andcomplexity for those schemes are also presented in that figure. The implementa-tions with general-purpose ISAs gain low design complexity, while dedicated SMTsfor specific applications can achieve higher energy efficiency.

When the processor is used to execute instructions, the fetching (IF), decoding(ID) and executing (EX) processes are all carried out by the SMT with the same

52

Figure 5.8: Structure of one control word

Sequence Mapping Table (SMT)

ALU DSP MEM IO CTRL

ISASpecial

Instructions

S i

Assembly codes

High level languages

Applications

Complexity

En

erg

y

Eff

icie

ncy

General purpose

Application-specific

Dedicated

Figure 5.9: Mapping from applications to the processor (Adapted from [114])

Hardware

Function

IF

IF

ID

ID

EX

EX

IF

IF

IF

IF

ID

ID

ID

ID

EX

EX

EX

EX

IF

IF

ID

ID

EX

EX

IF

IF

IF

IF

ID

ID

FU

IF

FU

ID

FU

EX & IF

FU

ID

FU

EX & IF

FU

...

Time

Hardware

Control centric processor

Data centric processor

Hardware

Function

If with pipeline

This work

Figure 5.10: Instructions processing in the proposed processor compared with otherdata-centric processors

hardware units, compared with data-centric processors which employ dedicatedhardware units for instruction fetching (IF), decoding (ID) and executing (EX),respectively, as described in Figure 5.10. For data-centric processors, pipelines are

53

normally utilized to increase the processing throughput. Extra hardware resourcesare required to support speculative execution of the instructions, since the successiveinstruction starts to be processed before the processing of current one is finished. Incontrast, a control-centric processor changes functions of the same hardware unitswith different configurations to perform the corresponding operations. Hardwarecan be utilized to a high extent to achieve high energy efficiency.

Low power considerations

The low power design considerations are listed below:

• Reconfigurable architecture: the reprogrammable SMT supports both general-purpose and application-specific instructions. Therefore, dense software pro-grams will be generated. The results are the small memory requirement andlow access frequency to the memory, which can both lower the power con-sumption.

• On-chip memories: the on-chip ROM and RAM are used both for storingSMTs and application programs. For many IoT applications, on-chip mem-ories are enough for the programs, and the power consumptions on the pinsfor connecting the external memories can be saved.

• Power management: different working modes are designed to provide properperformance for adapting to different scenarios. The activity and workingfrequency of each core and peripherals can be dynamically adjusted followingthe change of work load.

High performance

Two cores are equipped with the same kinds of computational units. Besides thegeneral-purpose ALU, a multiplication and accumulate unit (MAC) together with ashifter is also integrated for each single core to increase the performance for digitalsignal processing. Moreover, all the function units can be activated simultaneouslyand work in parallel, which is controlled by the SMT. By organizing function unitsin a suitable way, hardware resources are utilized efficiently and high performancecan be achieved.

For each single core, to make the critical path short and guarantee sufficientexecution time for each control word, one execution cycle consists of two clockcycles. The address of the SMT memory is given in the first cycle, while the data isavailable in the next cycle, as shown in Figure 5.11(a). By connecting two cores tothe same memories, and letting them access memories alternatively, the memoriesare utilized efficiently and the performance is doubled without the need of arbitersbetween two cores, as shown in Figure 5.11(b).

54

Address Data Address Data Address

Clock

One executing cycle

Single

core

Address Data Address Data Address

Clock

Core1

Core2 Address Data Address Data

(a) Single core timing

(b) Dual core timing

Figure 5.11: Control word timing

Figure 5.12: Hierarchical implementation structure

Flexibility

For applications to be implemented in the proposed architecture, the hierarchicalflexible structure is shown in Figure 5.12. The application layer flexibility is realizedby providing a certain number of predefined common functions to support rapidmapping from applications to the processor. Besides, general-purpose programmingis applicable in layer 3 and 4 by using high-level languages or assembly codes.For some special applications with critical timing or power requirement, the re-programmability in layer 2, which is the SMT coding layer, can be exploited. As theflexibility in layer 1, i.e. the hardware layer, the number of active cores is adjustablefor different applications. If the performance of two cores is not sufficient, morecores can work together by being connected through the integrated multi-moderouter to construct a network.

The memory structure is also flexible. For applications with small memory re-

55

92.59%

1.74%

0.77%

1.31%

3.06%

0.53%

7.41%

Memory CORE_L CORE_R PERI Router Other

CORE_L

CORE_R

Router

R

Pe

ri.

RAM for

Instructions

and Data

RAM for

Instructions and

Data

RAM for

Instructions

and Data

RAM for

Instructions and

Data

ROM for

Instructions

ROM for

Instructions

ROM for

SMT

RAM for

SMT

Technology UMC 65nm LL

Gate count 1.8 Million

Die size 1.875mm x 1.875mm

Memories 80KB+256KB ROM

40KB+160KB SRAM

Frequency [email protected]

Power

Consumption

22 mW

6.9 mW per core

Efficiency 50.7 GOPS/W/core

Switching mode Circuit/Wormhole

Pad number 112

Gate count breakdown

Figure 5.13: Die photo together with gate count breakdown

quirement, internal memories are enough. If more memories are required, externalSDRAM can be connected to the processor. The internal and external memoriesutilize a uniform memory space, and it is transparent to higher levels program-ming. Besides, the pins for SDRAM interfaces are shared with general-purpose I/Opins, and functions of those pins are configurable depending on the connectivity ofexternal memories.

5.4 Tile implementation

The customizable tile-ASIP including dual cores and the multi-mode router hasbeen fabricated in 65 nm low leakage CMOS process. The die photo and gatecount breakdown are presented in Figure 5.13. Most of the gates are used by thememories. Adding the second core (CORE_R) contributes only 0.77% to the gates,but doubles the performance of the processor. The maximum working frequency is350 MHz with power consumption of 22 mW when all the hardware resources areactivated. If one core is turned off, the power consumption is decreased to 15 mW.In order to keep the status of the processor when all clock sources are shut off, 0.74

56

0

5

10

15

20

25

0 50 100 150 200 250 300 350

Powerconsumption(m

W)

Clock frequency (MHz)

Dual Core

CORE_L only

CORE_R only

Idle

0

5

10

15

20

25

0 50 100 150 200 250 300 350

Powerconsumption(m

W)

Clock frequency (MHz)

1.2 V

1.1 V

1.0 V

Power consumption for dual cores Power consumption in 1.2 V

Working modes Power consumption*

Dual cores 22 mW

Single core 15 mW

Standby 8.4 mW

Halt 0.74 mW

*with clock frequency 350 MHz, core voltage 1.2 V.

Power consumption under different working modes

Figure 5.14: Power consumption under different working modes and conditions

mW is needed. The power consumptions for different configurations are shown inFigure 5.14. Based on this figure, appropriate working strategy can be selected toachieve the on-demand performance while keeping the power consumption as lowas possible.

With the proposed ASIPs, on-board networks with packaged chips and in-package networks with bare dies are feasible to further increase the performanceof the whole system. Such configurations are demonstrated in Figure 5.15. Amaximum of 256 tiles can be connected with single link bandwidth up to 1.4 Gbps.

5.5 Summary

This chapter discussed the design and implementation of a customizable ASIP forlow power applications, which demand high energy efficiency. To achieve that, amulti-mode router and a control-centric processor are integrated to provide theon-demand performance. By utilizing the multi-mode router, diverse inter-siliconnetworks can be constructed to meet the performance requirement for specific ap-plications. With the control-centric processor, control logics are simplified, andhardware resources are directly operated by the control words of SMT inside theprocessor. By properly orchestrating the SMTs, both general-purpose ISAs andapplication-specific instructions can be mapped to the processor. Furthermore,customized SMT can dedicate all resources of the processor to particular applica-tions if required. Finally, the tile-ASIP is fabricated in 65 nm low leakage CMOSprocess, and different working strategies are proposed for various application sce-

57

Serial clock

FDC

(forward ctl.)

FDD0

(forward data0)BDC

(backward ctl.)

Network on board

Network in package

Sending header in circuit switching Sending data in wormhole switching

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RP

RPP

RPP

RPP

RPP

Figure 22: Scalable network on a board and network in a package for different application scenarios.

Figure 5.15: Scalable network on a board and network in a package for differentapplication scenarios (Adapted from [114])

narios. Meanwhile, both on-board and in-package networks are demonstrated toshow the feasibility of performance extension.

58

Chapter 6

Conclusions

6.1 Thesis summary

This dissertation explores the design and implementation of energy-efficient ASIPsfor ubiquitous sensing and computing in both high-throughput compute-intensiveprocessing of multimedia and low-throughput low-cost processing of IoT scenarios.For high-throughput processing of networked multi-standard multimedia, single-ASIPs integrated with general-purpose processors and application-specific accel-erators are proposed and implemented. Control-intensive tasks such as proto-col processing, depacketizing, stream parsing, etc. are more suitable to be pro-cessed by the embedded processor, while the dedicated accelerators are efficient forthe computation-intensive functions. The instruction-controlled accelerators andgeneral-purpose processor also provide the flexibility to support both the existingand future standards. The chip has been fabricated and a test system is built tohandle networked multimedia streams.

Multiple ASIPs are connected with an on-chip network to gain even higher per-formance and flexibility for applications such as multi-view video decoding. Thedesign problems including the selection of the number of ASIPs, required per-formance for a single ASIP, link bandwidth of the network and task assignmentschemes, have been studied for high-definition eight-view MVC decoding. To usemulti-ASIP architectures, parallel processing at the proper level has been explored.Both picture-level and view-level task assignment schemes are proposed, and thecorresponding communication and computation subsystems are analyzed to obtainthe design options for the whole system. The results show that a limitation of onefactor can be compensated by the enhancement of the others under the conditionthat the minimum requirement for each factor is guaranteed.

A tile-ASIP is designed and implemented for applications with emphasis on lowpower consumption, high energy efficiency, and high adaptability. The integratedmulti-mode router supports both circuit and wormhole switching simultaneously,and the two schemes can be dynamically selected to adapt to the actual traffic

59

pattern. The embedded processor core utilizes control words to directly operatethe functional units and avoid complex control logics so increasing the hardwareutilization. The control words compose the sequence mapping table, which is storedin the on-chip memories. By creating the proper tables, general-purpose ISAs,application-specific instructions and mappings dedicated to particular applicationscan be implemented by the processor. Dual processor cores are deployed sharingboth the control and data storage, which doubles the performance and furtherimproves energy efficiency. Thanks to the integration of the multi-mode router, asuitable number of tile-ASIPs can be connected with the inter-silicon network toprovide on-demand performance for diverse application scenarios.

To conclude, different strategies for designing ASIPs are utilized for diverseapplication scenarios in ubiquitous sensing and computing. Employing application-specific accelerators will effectively increase the performance with low power con-sumption, while they pose high design and verification cost together with the rela-tively low flexibility. They can be used in the ASIPs that only similar applicationswill be mapped to. An example is multi-standard multimedia processing. Whenthe applications demand even higher performance, connecting the existing highperformance ASIPs to construct the multi-ASIP architecture is a feasible solutioninstead of further increasing the single-ASIP performance. As a contrast, the tile-ASIP increases energy efficiency by improving hardware utilization. As a result ofthe reconfigurability of the processor and the scalability provided by the on-chipmulti-mode router, the capability of the tile-ASIP providing on-demand perfor-mance lowers the cost and gains high energy efficiency.

6.2 Future work

One revolutionary approach to achieve high performance at extreme low-powerconsumption is the application of brain-inspired computing [4, 115, 116]. Thehuman brain runs at very low power consumption, but performs many complextasks that even the current computers cannot handle. Specialization is one of themost important features to achieve high energy efficiency [117]. Different parts ofthe brain are trained and customized to perform diverse functions. For the proposedtile-ASIP in this dissertation, the reconfigurable and customizable processor corestogether with the integrated router make it a competitive solution for constructinga brain-inspired computer. Future research can be performed in the following threeaspects to facilitate the implementation of brain-inspired computing with the tile-ASIP.

Hardware improvement

In the current version of the tile-ASIP, fixed configurations of the ASIP are saved inthe internal ROM. To enhance reconfigurability and customizability, re-programmablenon-volatile memories can be utilized to replace the ROM. Meanwhile, the SMTs

60

and instructions will share the internal RAM to maximize the flexibility instead ofusing independent storages like in the current version.

The integrated router is designed for mesh networks, and up to 256 nodes aresupported. The reliability and adaptability can be increased by supporting irregulartopology, adding fault detection and correction algorithms, etc. Energy-efficientinter-router connections could also be developed. Besides, for both the router andprocessor, event-driven working modes are necessary to further increase the energyefficiency by activating the resources only for the time period while incoming eventsare being recorded.

Programming model

The programming of single tile-ASIP including SMTs and application programs issupported. To ease the integration of hundreds or even thousands of tile-ASIPs,programming models for configuring and scheduling multiple tile-ASIPs need to bestudied and formulated. Besides, since SMTs can be reprogrammed at real time tochange the processor behavior, in-system optimization for specific tasks is desiredwith the support of customized software tool chain.

Application mapping

Brain-inspired computing has drawn much attention due to the extremely high en-ergy efficiency of the human brain compared with traditional computers. The brainworks with high specialization, low processing duty cycles and low-voltage opera-tions [4]. One million programmable spiking neurons and 256 million configurablesynapses are implemented in a single chip in [115], and many chips can be furthertiled to support even more complex neural networks. 18 ARM968 processing coresare integrated as a chip multiprocessor (CMP) in [116] for simulating large-scalespiking neural networks. The architecture can scale up to 65536 chips enablingmodeling up to 1 billion neurons.

Considering the customizable processor core, its tightly coupled internal mem-ories, and the scalable communication structure of the proposed tile-ASIP, themapping of brain-inspired computing can be studied as future work.

61

Bibliography

[1] C.D. Martin. ENIAC: press conference that shook the world. Technology andSociety Magazine, IEEE, 14(4):3–10, Winter 1995.

[2] G. E. Moore. Cramming more components onto integrated circuits. Electron-ics, 38(8):114–117, 1965.

[3] R.H. Dennard, V.L. Rideout, E. Bassous, and A.R. LeBlanc. Design of ion-implanted mosfet’s with very small physical dimensions. Solid-State Circuits,IEEE Journal of, 9(5):256–268, Oct 1974.

[4] M.B. Taylor. A landscape of the new dark silicon design regime. Micro,IEEE, 33(5):8–19, Sept 2013.

[5] Z. Toprak-Deniz, M. Sperling, J. Bulzacchelli, G. Still, R. Kruse, Seong-won Kim, D. Boerstler, T. Gloekler, R. Robertazzi, K. Stawiasz, T. Diemoz,G. English, D. HUI, P. Muench, and J. Friedrich. Distributed system of digi-tally controlled microregulators enabling per-core DVFS for the POWER8TMmicroprocessor. In IEEE International Solid-State Circuits Conference Digestof Technical Papers (ISSCC), pages 98–99, Feb 2014.

[6] D. Jacquet, F. Hasbani, P. Flatresse, R. Wilson, F. Arnaud, G. Cesana,T. Di Gilio, C. Lecocq, T. Roy, A. Chhabra, C. Grover, O. Minez, J. Ug-inet, G. Durieu, C. Adobati, D. Casalotto, F. Nyer, P. Menut, A. Cathelin,I. Vongsavady, and P. Magarshack. A 3 ghz dual core processor ARM Cor-tex TM -A9 in 28 nm UTBB FD-SOI CMOS with ultra-wide voltage rangeand energy efficiency optimization. IEEE Journal of Solid-State Circuits,49(4):812–826, April 2014.

[7] Y. Akgul, D. Puschini, S. Lesecq, E. Beigne, I. Miro-Panades, P. Benoit, andL. Torres. Power management through DVFS and dynamic body biasing inFD-SOI circuits. In 51st ACM/EDAC/IEEE Design Automation Conference(DAC), pages 1–6, June 2014.

[8] K. Craig, Y. Shakhsheer, S. Arrabi, S. Khanna, J. Lach, and B.H. Calhoun. A32 b 90 nm processor implementing panoptic DVS achieving energy efficient

63

operation from sub-threshold to high performance. IEEE Journal of Solid-State Circuits, 49(2):545–552, Feb 2014.

[9] S.R. Sridhara, M. DiRenzo, S. Lingam, Seok-Jun Lee, R. Blazquez, J. Maxey,S. Ghanem, Yu-Hung Lee, R. Abdallah, P. Singh, and M. Goel. Microwattembedded processor platform for medical system-on-chip applications. IEEEJournal of Solid-State Circuits, 46(4):721–730, April 2011.

[10] E. Nakasu. Super hi-vision on the horizon: A future TV system that conveysan enhanced sense of reality and presence. Consumer Electronics Magazine,IEEE, 1(2):36–42, 2012.

[11] DCI DCSS v12 with errata 2012-1010. http://www.dcimovies.com/, 2012.

[12] T. Ito. Future television - Super Hi-Vision and beyond. In IEEE Asian SolidState Circuits Conference (A-SSCC), pages 1–4, 2010.

[13] M. Ikeda, T. Onishi, T. Sano, A. Sagata, H. Iwasaki, Y. Nakajima, K. Nitta,Y. Takahashi, K. Yokohari, D. Kobayashi, K. Kamikura, and H. Jozawa.MVC real-time video encoder for full-HDTV 3D video. In IEEE InternationalConference on Consumer Electronics (ICCE), pages 166–167, 2012.

[14] G. Van Wallendael, S. Van Leuven, J. De Cock, F. Bruls, and R. VanDe Walle. 3D video compression based on high efficiency video coding. IEEETransactions on Consumer Electronics, 58(1):137–145, 2012.

[15] J. Dick, H. Almeida, L.D. Soares, and P. Nunes. 3D holoscopic video codingusing MVC. In EUROCON - International Conference on Computer as aTool (EUROCON), 2011 IEEE, pages 1–4, 2011.

[16] Luigi Atzori, Antonio Iera, and Giacomo Morabito. The internet of things:A survey. Computer Networks, 54(15):2787 – 2805, 2010.

[17] Guo-An Jian, Ting-Yu Huang, Jui-Chin Chu, and Jiun-In Guo. Optimizationof VC-1/H.264/AVS video decoders on embedded processors. In Sixth In-ternational Conference on Information Technology: New Generations, pages1313–1318, 2009.

[18] K. Nishihara, A. Hatabu, and T. Moriyoshi. Parallelization of H.264 videodecoder for embedded multicore processor. In IEEE International Conferenceon Multimedia and Expo, pages 329–332, 2008.

[19] F. Pescador, M.J. Garrido, E. Juarez, and C. Sanz. On an implementation ofHEVC video decoders with dsp technology. In IEEE International Conferenceon Consumer Electronics (ICCE), pages 121–122, 2013.

64

[20] Chao-Tsung Huang, M. Tikekar, C. Juvekar, V. Sze, and A. Chandrakasan.A 249Mpixel/s HEVC video-decoder chip for quad full HD applications. In2013 IEEE International Solid-State Circuits Conference Digest of TechnicalPapers (ISSCC), pages 162–163, 2013.

[21] Dajiang Zhou, Jinjia Zhou, Jiayi Zhu, Peilin Liu, and S. Goto. A2Gpixel/s H.264/AVC HP/MVC video decoder chip for Super Hi-Vision and3DTV/FTV applications. In IEEE International Solid-State Circuits Con-ference Digest of Technical Papers (ISSCC), pages 224–226, 2012.

[22] Tzu-Der Chuang, Pei-Kuei Tsung, Pin-Chih Lin, Lo-Mei Chang, Tsung-Chuan Ma, Yi-Hau Chen, and Liang-Gee Chen. A 59.5mW scalable/multi-view video decoder chip for Quad/3D Full HDTV and video streaming ap-plications. In IEEE International Solid-State Circuits Conference Digest ofTechnical Papers (ISSCC), pages 330–331, 2010.

[23] Chien-Chang Lin, Jia-Wei Chen, Hsiu-Cheng Chang, Yao-Chang Yang, Yi-Huan Ou Yang, Ming-Chih Tsai, Jiun-In Guo, and Jinn-Shyan Wang. A160k gates/4.5 KB SRAM H.264 video decoder for HDTV applications. IEEEJournal of Solid-State Circuits, 42(1):170–182, 2007.

[24] H. Martin, E. San Millan, P. Peris-Lopez, and J. Tapiador. Efficient ASIC im-plementation and analysis of two EPC-C1G2 RFID authentication protocols.IEEE Sensors Journal, PP(99):1–1, 2013.

[25] K. Keutzer, S. Malik, and A.R. Newton. From ASIC to ASIP: the next designdiscontinuity. In IEEE International Conference on Computer Design: VLSIin Computers and Processors, pages 84–90, 2002.

[26] M.K. Jain, M. Balakrishnan, and A. Kumar. ASIP design methodologies:survey and issues. In Fourteenth International Conference on VLSI Design,pages 76–81, 2001.

[27] Antoni Portero, G. Talavera, Marc Moreno, J. Carrabina, and F. Catthoor.Methodology for energy-flexibility space exploration and mapping of multi-media applications to single-processor platform styles. IEEE Transactions onCircuits and Systems for Video Technology, 21(8):1027–1039, 2011.

[28] M. Rashid, L. Apvrille, and R. Pacalet. Application specific processors formultimedia applications. In 11th IEEE International Conference on Compu-tational Science and Engineering, pages 109–116, 2008.

[29] SungDae Kim and MyungH. Sunwoo. ASIP approach for implementation ofH.264/AVC. Journal of Signal Processing Systems, 50(1):53–67, 2008.

[30] Zheng Shen, Hu He, Yanjun Zhang, and Yihe Sun. A video specific instructionset architecture for ASIP design. VLSI Design, 2007(2):4:1–4:7, April 2007.

65

[31] Erich Wenger and Johann Grobschadl. An 8-bit AVR-Based elliptic curvecryptographic RISC processor for the internet of things. In Proceedings of the2012 45th Annual IEEE/ACM International Symposium on MicroarchitectureWorkshops, MICROW ’12, pages 39–46, Washington, DC, USA, 2012. IEEEComputer Society.

[32] ITU-T SERIES H: Audiovisual and multimedia systems infrastructure of au-diovisual services - coding of moving video: Advanced video coding for genericaudiovisual services. In Joint Video Team (JVT) of ISO/IEC MPEG andITU-T VCEG, JVT-AF11, Nov. 2009.

[33] George Kornaros and Dionisios Pnevmatikatos. Dynamic power and thermalmanagement of noc-based heterogeneous mpsocs. ACM Trans. ReconfigurableTechnol. Syst., 7(1):1:1–1:26, February 2014.

[34] Brett H. Meyer, Adam S. Hartman, and Donald E. Thomas. Cost-effectivelifetime and yield optimization for noc-based mpsocs. ACM Trans. Des. Au-tom. Electron. Syst., 19(2):12:1–12:33, March 2014.

[35] L. Sterpone, L. Carro, D. Matos, S. Wong, and F. Fakhar. A new reconfig-urable clock-gating technique for low power SRAM-based FPGAs. In Design,Automation Test in Europe Conference Exhibition (DATE), 2011, pages 1–6,March 2011.

[36] Dingguo Wei, Chun Zhang, Yan Cui, Hong Chen, and Zhihua Wang. Designof a low-cost low-power baseband-processor for UHF RFID tag with asyn-chronous design technique. In Circuits and Systems (ISCAS), 2012 IEEEInternational Symposium on, pages 2789–2792, May 2012.

[37] L. Benini, P. Siegel, and G. De Micheli. Saving power by synthesizing gatedclocks for sequential circuits. Design Test of Computers, IEEE, 11(4):32–41,Winter 1994.

[38] Kwangok Jeong, A.B. Kahng, Seokhyeong Kang, T.S. Rosing, and R. Strong.MAPG: Memory access power gating. In Design, Automation Test in EuropeConference Exhibition (DATE), 2012, pages 1054–1059, March 2012.

[39] Chiu-Kuo Chen, Yi-Yuan Wang, Zong-Han Hsieh, E. Chua, Wai-Chi Fang,and Tzyy-Ping Jung. A low power independent component analysis proces-sor in 90nm CMOS technology for portable EEG signal processing systems.In Circuits and Systems (ISCAS), 2011 IEEE International Symposium on,pages 801–804, May 2011.

[40] H. Singh, K. Agarwal, D. Sylvester, and K.J. Nowka. Enhanced leakagereduction techniques using intermediate strength power gating. Very LargeScale Integration (VLSI) Systems, IEEE Transactions on, 15(11):1215–1224,Nov 2007.

66

[41] T. Burd, T. Pering, A. Stratakos, and R. Brodersen. A dynamic voltage scaledmicroprocessor system. In Solid-State Circuits Conference, 2000. Digest ofTechnical Papers. ISSCC. 2000 IEEE International, pages 294–295, Feb 2000.

[42] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins,H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada,S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege,J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl,S. Borkar, V. De, R. Van Der Wijngaart, and T. Mattson. A 48-core IA-32message-passing processor with DVFS in 45nm CMOS. In Solid-State CircuitsConference Digest of Technical Papers (ISSCC), 2010 IEEE International,pages 108–109, Feb 2010.

[43] V. Krishnaswamy, J. Brooks, G. Konstadinidis, C. McAllister, H. Pham,S. Turullols, J.L. Shin, Yifan YangGong, and Haowei Zhang. Fine-grainedadaptive power management of the SPARC M7 processor. In Solid- State Cir-cuits Conference - (ISSCC), 2015 IEEE International, pages 1–3, Feb 2015.

[44] E.J. Fluhr, S. Baumgartner, D. Boerstler, J.F. Bulzacchelli, T. Diemoz,D. Dreps, G. English, J. Friedrich, A. Gattiker, T. Gloekler, C. Gonzalez,J.D. Hibbeler, K.A. Jenkins, Yong Kim, P. Muench, R. Nett, J. Paredes,J. Pille, D. Plass, P. Restle, R. Robertazzi, D. Shan, D. Siljenberg, M. Sper-ling, K. Stawiasz, G. Still, Z. Toprak-Deniz, J. Warnock, G. Wiedemeier, andV. Zyuban. The 12-core POWER8; processor with 7.6 tb/s io bandwidth,integrated voltage regulation, and resonant clocking. Solid-State Circuits,IEEE Journal of, 50(1):10–23, Jan 2015.

[45] N. Reynders and W. Dehaene. A 210mv 5mhz variation-resilient near-threshold JPEG encoder in 40nm CMOS. In Solid-State Circuits Confer-ence Digest of Technical Papers (ISSCC), 2014 IEEE International, pages456–457, Feb 2014.

[46] Advances in big.little technology for power and energy savings.http://www.thinkbiglittle.com/.

[47] Jungyul Pyo, Youngmin Shin, Hoi-Jin Lee, Sung il Bae, Min-Su Kim, KwangilKim, Ken Shin, Yohan Kwon, Heungchul Oh, Jaeyoung Lim, Dong wook Lee,Jongho Lee, Inpyo Hong, Kyungkuk Chae, Heon-Hee Lee, Sung-Wook Lee,Seongho Song, Chung-Hee Kim, Jin-Soo Park, Heesoo Kim, Sunghee Yun,Uk-Rae Cho, Jae Cheol Son, and Sungho Park. 20nm high-K metal-gate het-erogeneous 64b quad-core CPUs and hexa-core GPU for high-performanceand energy-efficient mobile application processor. In IEEE InternationalSolid- State Circuits Conference - (ISSCC), pages 1–3, Feb 2015.

[48] A. Gomez, C. Pinto, A. Bartolini, D. Rossi, L. Benini, H. Fatemi, and J.P.de Gyvez. Reducing energy consumption in microcontroller-based platforms

67

with low design margin co-processors. In Design, Automation Test in EuropeConference Exhibition (DATE), 2015, pages 269–272, March 2015.

[49] T.-T. Liu and J.M. Rabaey. A 0.25 V 460 nW asynchronous neural signalprocessor with inherent leakage suppression. IEEE Journal of Solid-StateCircuits, 48(4):897–906, April 2013.

[50] N. Neves, N. Sebastiao, D. Matos, P. Tomas, P. Flores, and N. Roma. Multi-core SIMD ASIP for next-generation sequencing and alignment biochip plat-forms. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,23(7):1287–1300, July 2015.

[51] T. Sugiura, S. Nakatsuka, Jaehoon Yu, Y. Takeuchi, and M. Imai. An effi-cient data compression method for artificial vision systems and its low energyimplementation using ASIP technology. In IEEE Biomedical Circuits andSystems Conference (BioCAS), pages 81–84, Oct 2014.

[52] Y. Xin, W.X.Y. Li, Z. Zhang, R.C.C. Cheung, D. Song, and T.W. Berger.An application specific instruction set processor (asip) for adaptive filters inneural prosthetics. Computational Biology and Bioinformatics, IEEE/ACMTransactions on, PP(99):1–1, 2015.

[53] Lee Minkyu, Song Yong Ho, and Chung Ki-Seok. An ASIP approach for inter-polation performance enhancement in HEVC decoder. In 4th IEEE Interna-tional Conference on Network Infrastructure and Digital Content (IC-NIDC),pages 232–235, Sept 2014.

[54] Kwang Woo Lee, Sung Dae Kim, and M.H. Sunwoo. VSIP : Video Spe-cific Instruction Set Processor for H.264/AVC. In IEEE Workshop on SignalProcessing Systems Design and Implementation, pages 124–129, Oct 2006.

[55] A. Gupta and A. Pal. Accelerating SVM on Ultra Low Power ASIP for HighThroughput Streaming Applications. In 28th International Conference onVLSI Design (VLSID), pages 517–522, Jan 2015.

[56] Suk Hyun Yoon, Myung Hoon Sunwoo, and Jong Ha Moon. Design of ahigh-quality audio-specific DSP core. In IEEE Workshop on Signal ProcessingSystems Design and Implementation, pages 509–513, Nov 2005.

[57] Purushotham Murugappa, Amer Baghdadi, and Michel Jezequel. ASIP designfor multi-standard channel decoders. In Cyrille Chavet and Philippe Coussy,editors, Advanced Hardware Design for Error Correcting Codes, pages 151–175. Springer International Publishing, 2015.

[58] Zhenzhi Wu and D. Liu. Flexible multistandard FEC processor design withASIP methodology. In IEEE 25th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 210–218, June2014.

68

[59] Ubaid Ahmad, Min Li, Amir Amin, Meng Li, Liesbet Van der Perre, RudyLauwereins, and Sofie Pollin. An energy-efficient reconfigurable ASIP sup-porting multi-mode MIMO detection. Journal of Signal Processing Systems,pages 1–17, 2015.

[60] Frank Bouwens, Jos Huisken, Harmke De Groot, Martijn Bennebroek, An-teneh Abbo, Octavio Santana, Jef van Meerbergen, and Antoine Fraboulet.A dual-core system solution for wearable health monitors. In Proceedings ofthe 21st Edition of the Great Lakes Symposium on Great Lakes Symposiumon VLSI, GLSVLSI ’11, pages 379–382, New York, NY, USA, 2011. ACM.

[61] Li-Ming Chen, Zeng-Hui Yu, Cheng-Ying Chen, Xiao-Yu Hu, Jun Fan, JunYang, and Yong Hei. A 1-V, 1.2-mA fully integrated SoC for digital hearingaids. Microelectronics Journal, 46(1):12 – 19, 2015.

[62] N. Neves, N. Sebastiao, A. Patricio, D. Matos, P. Tomas, P. Flores, andN. Roma. BioBlaze: Multi-core SIMD ASIP for DNA sequence alignment. InIEEE 24th International Conference on Application-Specific Systems, Archi-tectures and Processors (ASAP), pages 241–244, June 2013.

[63] J. Constantin, A. Dogan, O. Andersson, P. Meinerzhagen, J.N. Rodrigues,D. Atienza, and A. Burg. TamaRISC-CS: An ultra-low-power application-specific processor for compressed sensing. In IEEE/IFIP 20th InternationalConference on VLSI and System-on-Chip (VLSI-SoC), pages 159–164, Oct2012.

[64] Yi Wang and Yajun Ha. A Performance and Area Efficient ASIP for Higher-Order DPA-Resistant AES. IEEE Journal on Emerging and Selected Topicsin Circuits and Systems, 4(2):190–202, June 2014.

[65] O. Ahmed and S. Areibi. An efficient application-specific instruction-set pro-cessor for packet classification. In International Conference on ReconfigurableComputing and FPGAs (ReConFig), pages 1–6, Dec 2013.

[66] Seung-Hyun Choi, Neungsoo Park, YongHo Song, and Seong-Won Lee.ASiPEC: An application specific instruction-set processor for high perfor-mance entropy coding. In James J. (Jong Hyuk) Park, Yi Pan, Han-ChiehChao, and Gangman Yi, editors, Ubiquitous Computing Application andWireless Sensor, volume 331 of Lecture Notes in Electrical Engineering, pages67–75. Springer Netherlands, 2015.

[67] Srinivas Lingam, Timothy Schmidl, and Anuj Batra. Software defined radiofor smart utility networks. In Proceedings of the 2014 ACM Workshop onSoftware Radio Implementation Forum, SRIF ’14, pages 37–44, New York,NY, USA, 2014. ACM.

69

[68] Wenxiang Wang, Ling Li, Guangfei Zhang, Dong Liu, and Ji Qiu. An appli-cation specific instruction set processor optimized for FFT. In Circuits andSystems (MWSCAS), 2011 IEEE 54th International Midwest Symposium on,pages 1–4, Aug 2011.

[69] M. Koziri, D. Zacharis, I. Katsavounidis, and N. Bellas. Implementation of theAVS video decoder on a heterogeneous dual-core SIMD processor. ConsumerElectronics, IEEE Transactions on, 57(2):673–681, May 2011.

[70] G. Paya-Vayaa, J. Martin-Langerwerf, C. Banz, F. Giesemann, P. Pirsch,and H. Blume. VLIW architecture optimization for an efficient computationof stereoscopic video applications. In Green Circuits and Systems (ICGCS),2010 International Conference on, pages 457–462, June 2010.

[71] Vincent Brost, Fan Yang, and Charles Meunier. Flexible VLIW processorbased on FPGA for efficient embedded real-time image processing. Journalof Real-Time Image Processing, 9(1):47–59, 2014.

[72] V. Lapotre, P. Murugappa, G. Gogniat, A. Baghdadi, M. Hubner, and J.-P.Diguet. A dynamically reconfigurable multi-ASIP architecture for multistan-dard and multimode turbo decoding. Very Large Scale Integration (VLSI)Systems, IEEE Transactions on, PP(99):1–1, 2015.

[73] Q. Yang, X. Zhou, G.E. Sobelman, and X. Li. Network-on-chip for turbo de-coders. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,PP(99):1–1, 2015.

[74] Hong Chinh Doan, H. Javaid, and S. Parameswaran. Flexible and scalable im-plementation of H.264/AVC encoder for multiple resolutions using ASIPs. InDesign, Automation and Test in Europe Conference and Exhibition (DATE),2014, pages 1–6, March 2014.

[75] Hyun-Ho Jo, Yong-Jo Ahn, Dae-Beom Kang, Bongil Ji, Dong-Gyu Sim, andJae-Jin Lee. Flexible multi-core platform for a multiple-format video decoder.Journal of Signal Processing Systems, 80(2):163–179, 2015.

[76] Dipan Kumar Mandal, Mihir Mody, Mahesh Mehendale, Naresh Yadav,Ghone Chaitanya, Piyali Goswami, Hetul Sanghvi, and Niraj Nandan. Accel-erating h.264/hevc video slice processing using application specific instructionset processor. In Consumer Electronics (ICCE), 2015 IEEE InternationalConference on, pages 408–411, Jan 2015.

[77] I. Tsekoura, G. Selimis, J. Hulzink, F. Catthoor, J. Huisken, H. de Groot,and C. Goutis. Exploration of cryptographic ASIP designs for wireless sensornodes. In 17th IEEE International Conference on Electronics, Circuits, andSystems (ICECS), pages 827–830, Dec 2010.

70

[78] H. Marzouqi, M. Al-Qutayri, K. Salah, D. Schinianakis, and T. Stouraitis.A high-speed FPGA implementation of an RSD-based ECC processor. VeryLarge Scale Integration (VLSI) Systems, IEEE Transactions on, PP(99):1–1,2015.

[79] Z. Wu and D. Liu. High-throughput trellis processor for multistandard FECdecoding. Very Large Scale Integration (VLSI) Systems, IEEE Transactionson, PP(99):1–1, 2015.

[80] Hee Kwan Eun, Sung Jo Hwang, and Myung Hoon Sunwoo. Novel IMEinstructions and their hardware architecture for ME ASIP. In InternationalSoC Design Conference (ISOCC), pages 139–142, Nov 2010.

[81] J. Ohm, G.J. Sullivan, H. Schwarz, Thiow Keng Tan, and T. Wiegand. Com-parison of the coding efficiency of video coding standards-including high ef-ficiency video coding (HEVC). Circuits and Systems for Video Technology,IEEE Transactions on, 22(12):1669–1684, 2012.

[82] Chih-Da Chien, Chien-Chang Lin, Yi-Hung Shih, He-Chun Chen, Chia-JuiHuang, Cheng-Yen Yu, Chih-Liang Chen, Ching-Hwa Cheng, and Jiun-InGuo. A 252kgate/71mw multi-standard multi-channel video decoder for highdefinition video applications. In IEEE International Solid-State Circuits Con-ference. ISSCC 2007. Digest of Technical Papers., pages 282–603, 2007.

[83] Tsu-Ming Liu, Ting-An Lin, Sheng-Zen Wang, Wen-Ping Lee, Jiun-Yan Yang,Kang-Cheng Hou, and Chen-Yi Lee. A 125 µW , fully scalable MPEG-2 andH.264/AVC video decoder for mobile applications. Solid-State Circuits, IEEEJournal of, 42(1):161–169, 2007.

[84] Ning Ma, Zhibo Pang, Jun Chen, H. Tenhunen, and Li-Rong Zheng. A5mgate/414mw networked media soc in 0.13um cmos with 720p multi-standard video decoding. In Solid-State Circuits Conference, 2009. A-SSCC2009. IEEE Asian, pages 385–388, Nov 2009.

[85] A. Smolic, K. Mueller, N. Stefanoski, J. Ostermann, A. Gotchev, G.B. Akar,G. Triantafyllidis, and A. Koz. Coding algorithms for 3DTV: A survey. Cir-cuits and Systems for Video Technology, IEEE Transactions on, 17(11):1606–1621, 2007.

[86] Chi-Cheng Ju, Tsu-Ming Liu, Yung-Chang Chang, Chih-Ming Wang, Chun-Chia Chen, Hue-Min Lin, Chia-Yun Cheng, Min-Hao Chiu, Sheng-Jen Wang,Ping Chao, MJ Hu, Hao-Wei Li, and Chung-Hung Tsai. A 775-uW/fps/viewH.264/MVC decoder chip compliant with 3D Blu-ray specifications. In IEEEInternational Symposium on Circuits and Systems (ISCAS), pages 1440–1443,2012.

71

[87] Pei-Kuei Tsung, Ping-Chih Lin, Kuan-Yu Chen, Tzu-Der Chuang, Hsin-JungYang, Shao-Yi Chien, Li-Fu Ding, Wei-Yin Chen, Chih-Chi Cheng, Tung-Chien Chen, and Liang-Gee Chen. A 216fps 4096*2160p 3DTV set-top boxSoC for free-viewpoint 3DTV applications. In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 124–126, 2011.

[88] A. Vetro, T. Wiegand, and G.J. Sullivan. Overview of the stereo and multi-view video coding extensions of the H.264/MPEG-4 AVC standard. Proceed-ings of the IEEE, 99(4):626–642, 2011.

[89] Axel Jantsch and Hannu Tenhunen. Networks on Chip. Kluwer AcademicPublishers, 2003.

[90] Radu Marculescu, Umit Y. Ogras, Li-Shiuan Peh, Natalie Enright Jerger,and Yatin Hoskote. Outstanding research problems in NoC design: Sys-tem, microarchitecture, and circuit perspectives. In IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, 28(1), pages 3 –21, Jan. 2009.

[91] G. Tuveri, S. Secchi, P. Meloni, L. Raffo, and E. Cannella. A runtime adap-tive H.264 video-decoding MPSoC platform. In Design and Architectures forSignal and Image Processing (DASIP), 2013 Conference on, pages 149–156,Oct 2013.

[92] Comparative reliability analysis between AMBA and network-on-chip: AnMPEG-2 case study. In SOC Conference, 2009. SOCC 2009. IEEE Interna-tional, pages 247–250, Sept 2009.

[93] Ning Ma, Zhonghai Lu, Zhibo Pang, and Lirong Zheng. System-level explo-ration of mesh-based NoC architectures for multimedia applications. In IEEEInternational SOC Conference (SOCC), pages 99–104, Sept 2010.

[94] Ning Ma, Zhonghai Lu, and Lirong Zheng. System design of full HD MVC de-coding on mesh-based multicore NoCs. Microprocess. Microsyst., 35(2):217–229, March 2011.

[95] Yi Pang, Lifeng Sun, Jiangtao (Gene) Wen, Fengyan Zhang, Weidong Hu,Wei Feng, and Shiqiang Yang. A framework for heuristic scheduling for par-allel processing on multicore architecture: A case study with multiview videocoding. In IEEE Transactions on Circuits and Systems for Video Technology,Volume 19, Issue 11, pages 1658 – 1666, Nov. 2009.

[96] J. Duato, S. Yalamanchili, and L. Ni. Interconnection Network - An Engi-neering Approach. IEEE Computer Society Press, 1997.

72

[97] Ning Ma, Zhuo Zou, Zhonghai Lu, and Lirong Zheng. Implementing MVCdecoding on homogeneous NoCs: Circuit switching or wormhole switching.In 23rd Euromicro International Conference on Parallel, Distributed andNetwork-Based Processing (PDP), pages 387–391, March 2015.

[98] Hakima Chaouchi. The internet of things: Connecting objects. 2010.

[99] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and MarimuthuPalaniswami. Internet of Things (IoT): A vision, architectural elements, andfuture directions. Future Generation Computer Systems, 29(7):1645 – 1660,2013.

[100] S. Sutardja. The future of IC design innovation. In IEEE International Solid-State Circuits Conference - (ISSCC), pages 1–6, Feb 2015.

[101] D. May. The XMOS architecture and XS1 chips. Micro, IEEE, 32(6):28–37,Nov 2012.

[102] Kangmin Lee, Se-Joong Lee, Donghyun Kim, Kwanho Kim, Gawon Kim,Joungho Kim, and Hoi-Jun Yoo. Networks-on-chip and networks-in-packagefor high-performance SoC platforms. In Asian Solid-State Circuits Confer-ence, pages 485–488, Nov 2005.

[103] T. Jamil. RISC versus CISC. IEEE Potentials, 14(3):13–16, 1995.

[104] N. Ickes, Y. Sinangil, F. Pappalardo, E. Guidetti, and A.P. Chandrakasan. A10 pj/cycle ultra-low-voltage 32-bit microprocessor system-on-chip. In 2011Proceedings of ESSCIRC (ESSCIRC), pages 159–162, 2011.

[105] G. Jiang, Z. Li, F. Wang, and S. Wei. Mapping of embedded applicationson hybrid networks-on-chip with multiple switching mechanisms. EmbeddedSystems Letters, IEEE, PP(99):1–1, 2015.

[106] Abbas Mazloumi and Mehdi Modarressi. A hybrid packet/circuit-switchedrouter to accelerate memory access in NoC-based chip multiprocessors. InDesign, Automation Test in Europe Conference Exhibition (DATE), pages908–911, March 2015.

[107] G. Chen, M.A. Anders, H. Kaul, S.K. Satpathy, S.K. Mathew, S.K. Hsu,A. Agarwal, R.K. Krishnamurthy, V. De, and S. Borkar. A 340 mV-to-0.9 V20.2 Tb/s source-synchronous hybrid packet/circuit-switched 16x16 network-on-chip in 22 nm tri-gate CMOS. Solid-State Circuits, IEEE Journal of,50(1):59–67, Jan 2015.

[108] Mark A. Anders, Himanshu Kaul, Ram K. Krishnamurthy, and Shekhar Y.Borkar. Hybrid circuit/- packet switched network for energy efficient on-chipinterconnections. Low Power Networks-on-chip, pages 3 – 20, 2011.

73

[109] M. Modarressi, H. Sarbazi-Azad, and M. Arjomand. A hybrid packet-circuitswitched on-chip network based on SDM. In Design, Automation Test inEurope Conference Exhibition (DATE ’09), pages 566 –569, April 2009.

[110] Angelo Kuti Lusala and Jean-Didier Legat. Combining sdm-based circuitswitching with packet switching in a NoC for real-time applications. In IEEEInternational Symposium on Circuits and Systems (ISCAS), pages 2505 –2508, May 2011.

[111] M. Modarressi, A. Tavakkol, and H. Sarbazi-Azad. Virtual point-to-pointconnections for NoCs. IEEE Transactions on Computer-Aided Design of In-tegrated Circuits and Systems, 29(6):855 –868, June 2010.

[112] N. Enright Jerger, Li-Shiuan Peh, and M. Lipasti. Circuit-switched coher-ence. In Second ACM/IEEE International Symposium on Networks-on-Chip(NoCS), pages 193 –202, April 2008.

[113] Ning Ma, Zhuo Zou, Zhonghai Lu, and Lirong Zheng. Design and imple-mentation of multi-mode routers for large-scale inter-core networks. MinorRevision (requires no re-evaluation), 2015.

[114] Ning Ma, Zhuo Zou, Yuxiang Huan, Stefan Blixt, Zhonghai Lu, and LirongZheng. A 101.4 gops/w reconfigurable and scalable control-centric embeddedprocessor for domain-specific applications. Manuscript submitted for publi-cation, 2015.

[115] Paul A. Merolla, John V. Arthur, Rodrigo Alvarez-Icaza, Andrew S. Cassidy,Jun Sawada, Filipp Akopyan, Bryan L. Jackson, Nabil Imam, Chen Guo, Yu-taka Nakamura, Bernard Brezzo, Ivan Vo, Steven K. Esser, RathinakumarAppuswamy, Brian Taba, Arnon Amir, Myron D. Flickner, William P. Risk,Rajit Manohar, and Dharmendra S. Modha. A million spiking-neuron inte-grated circuit with a scalable communication network and interface. Science,345(6197):668–673, 2014.

[116] E. Painkras, L.A. Plana, J. Garside, S. Temple, F. Galluppi, C. Patterson,D.R. Lester, A.D. Brown, and S.B. Furber. SpiNNaker: A 1-W 18-coresystem-on-chip for massively-parallel neural network simulation. Solid-StateCircuits, IEEE Journal of, 48(8):1943–1953, Aug 2013.

[117] Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao.Customizable Computing. Morgan & Claypool Publishers, 2015.

74

ultra-low-power design and implementation of application ...859802/fulltext01.pdf ·...

Documents