design of reconfigurable hardware architectures for real-time applications

Design of Reconfigurable Hardware

Architectures for Real-time Applications

Modeling and Implementation

Thomas Lenart

Lund 2008

The Department of Electrical and Information TechnologyLund UniversityBox 118, S-221 00 LUNDSWEDEN

This thesis is set in Computer Modern 10pt,with the LATEX Documentation Systemusing Pontus Astroms thesis template.

Series of licentiate and doctoral thesisNo. 5ISSN 1654-790X

c© Thomas Lenart 2008Printed in Sweden by Tryckeriet E-huset, Lund.May 2008

Abstract

This thesis discusses modeling and implementation of reconfigurable hardwarearchitectures for real-time applications. The target application in this workis digital holographic imaging, where visible images are to be reconstructedbased on holographic recordings. The reconstruction process is computation-ally demanding and requires hardware acceleration to achieve real-time perfor-mance. Thus, this work presents two design approaches, with different levelsof reconfigurability, to accelerate the image reconstruction process and relatedcomputationally demanding applications.

The first approach is based on application-specific hardware accelerators, whichare usually required in systems with high constraints on processing perfor-mance, physical size, or power consumption, and are tailored for a certainapplication to achieve high performance. Hence, an acceleration platformis proposed and designed to enable real-time image reconstruction in digitalholographic imaging, constituting a set of hardware accelerators that are con-nected in a flexible and reconfigurable pipeline. Hardware accelerators areoptimized for high computational performance and low memory requirements.The application-specific design has been integrated into an embedded systemconsisting of a microprocessor, a high-performance memory controller, a digi-tal image sensor, and a video output device. The system has been prototypedusing an FPGA platform and synthesized for a 0.13µm standard cell library,achieving a reconstruction rate of 30 frames per second running at 400 MHz.

The second approach is based on a dynamically reconfigurable architectureto accelerate arbitrary applications, which presents a trade-off between versa-tileness and hardware cost. The proposed reconfigurable architecture is con-structed from processing and memory cells, which communicate using a com-bination of local interconnects and a global network. High-performance localinterconnects generate a high communication bandwidth between neighboringcells, while the global network provides flexibility and access to external mem-ory. The processing and memory cells are run-time reconfigurable to enableflexible application mapping. Proposed reconfigurable architectures are mod-eled and evaluated using Scenic, which is a system-level exploration environ-ment developed in this work. A design with 16 cells is implemented and synthe-sized for a 0.13µm standard cell library, resulting in low area overhead whencompared with application-specific solutions. It is shown that the proposedreconfigurable architecture achieves high computation performance comparedto traditional DSP processors.

iii

Science may set limits to knowledge,

but should not set limits to imagination

Bertrand Russell (1872 - 1970)

Contents

Abstract iii

Preface xi

Acknowledgment xiii

List of Acronyms xv

List of Definitions and Mathematical Operators xix

1 Introduction 11.1 Challenges in Digital Hardware Design . . . . . . . . . . . . . . . 1

1.1.1 Towards Parallel Architectures . . . . . . . . . . . . . . . 3

1.1.2 Application-Specific vs. Reconfigurable Architectures . . . 3

1.2 Contributions and Thesis Outline . . . . . . . . . . . . . . . . . . 4

2 Digital System Design 92.1 Programmable Architectures . . . . . . . . . . . . . . . . . . . . 10

2.1.1 General-Purpose Processors . . . . . . . . . . . . . . . . . 10

2.1.2 Configurable Instruction-Set Processors . . . . . . . . . . 11

2.1.3 Special-Purpose Processors . . . . . . . . . . . . . . . . . 12

2.2 Reconfigurable Architectures . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Fine-Grained Architectures . . . . . . . . . . . . . . . . . 15

2.2.2 Coarse-Grained Architectures . . . . . . . . . . . . . . . . 16

2.3 Application-Specific Architectures . . . . . . . . . . . . . . . . . 17

vii

2.4 Memory and Storage . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Data Caching . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Data Streaming . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Hardware Design Flow . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.1 Architectural Exploration . . . . . . . . . . . . . . . . . . 22

2.5.2 Virtual Platforms . . . . . . . . . . . . . . . . . . . . . . 24

2.5.3 Instruction Set Simulators . . . . . . . . . . . . . . . . . . 25

3 Digital Holographic Imaging 273.1 Microscopy using Digital Holography . . . . . . . . . . . . . . . . 28

3.1.1 Phase Information . . . . . . . . . . . . . . . . . . . . . . 29

3.1.2 Software Autofocus . . . . . . . . . . . . . . . . . . . . . 31

3.1.3 Three-Dimensional Visualization . . . . . . . . . . . . . . 32

3.2 Processing of Holographic Images . . . . . . . . . . . . . . . . . 32

3.2.1 Numerical Reconstruction . . . . . . . . . . . . . . . . . . 32

3.2.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Acceleration of Image Reconstruction . . . . . . . . . . . . . . . . 37

3.3.1 Digital Image Sensors . . . . . . . . . . . . . . . . . . . . 37

3.3.2 Algorithmic Complexity . . . . . . . . . . . . . . . . . . . 38

3.3.3 Hardware Mapping . . . . . . . . . . . . . . . . . . . . . 39

I A Hardware Acceleration Platform forDigital Holographic Imaging 411 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

1.1 Reduced Complexity Image Reconstruction . . . . . . . . . 43

1.2 Algorithm Selection . . . . . . . . . . . . . . . . . . . . . 44

2 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . 46

3.1 Algorithm Mapping . . . . . . . . . . . . . . . . . . . . . 47

3.2 Image Combine Unit . . . . . . . . . . . . . . . . . . . . 48

3.3 One-Dimensional Flexible FFT Core . . . . . . . . . . . . 50

3.4 Two-Dimensional FFT . . . . . . . . . . . . . . . . . . . 52

4 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1 External Interface . . . . . . . . . . . . . . . . . . . . . . 54

4.2 Software Design . . . . . . . . . . . . . . . . . . . . . . . 56

5 Simulation and Optimization . . . . . . . . . . . . . . . . . . . . 57

5.1 Flexible Addressing Modes . . . . . . . . . . . . . . . . . 57

5.2 Memory Optimization . . . . . . . . . . . . . . . . . . . . 58

5.3 Parameter Selection . . . . . . . . . . . . . . . . . . . . . 59

6 Results and Comparisons . . . . . . . . . . . . . . . . . . . . . . 61

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

II A High-performance FFT Core for DigitalHolographic Imaging 651 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

1.1 Algorithmic Aspects . . . . . . . . . . . . . . . . . . . . . 69

1.2 Architectural Aspects . . . . . . . . . . . . . . . . . . . . 70

2 Proposed Architectures . . . . . . . . . . . . . . . . . . . . . . . 71

2.1 Hybrid Floating-Point . . . . . . . . . . . . . . . . . . . . 73

2.2 Block Floating-Point . . . . . . . . . . . . . . . . . . . . 73

2.3 Co-Optimization . . . . . . . . . . . . . . . . . . . . . . . 74

3 Architectural Extensions . . . . . . . . . . . . . . . . . . . . . . . 77

4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5 VLSI Implementation . . . . . . . . . . . . . . . . . . . . . . . . 81

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

III A Design Environment and Models forReconfigurable Computing 851 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3 Scenic Exploration Environment . . . . . . . . . . . . . . . . . . 89

3.1 Scripting Environment . . . . . . . . . . . . . . . . . . . 92

3.2 Simulation Construction . . . . . . . . . . . . . . . . . . . 93

3.3 Simulation Interaction . . . . . . . . . . . . . . . . . . . . 94

3.4 Code Generation . . . . . . . . . . . . . . . . . . . . . . . 99

4 Scenic Model and Architectural Generators . . . . . . . . . . . . 100

4.1 Processor Generator . . . . . . . . . . . . . . . . . . . . . 101

4.2 Memory Generator . . . . . . . . . . . . . . . . . . . . . 104

4.3 Architectural Generators . . . . . . . . . . . . . . . . . . 105

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

IV A Run-time Reconfigurable Computing Platform 1091 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

3 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . 113

3.1 Dynamically Reconfigurable Architecture . . . . . . . . . . 114

3.2 Processing Cells . . . . . . . . . . . . . . . . . . . . . . . 114

3.3 Memory Cells . . . . . . . . . . . . . . . . . . . . . . . . 116

3.4 Communication and Interconnects . . . . . . . . . . . . . 118

4 System-Level Integration . . . . . . . . . . . . . . . . . . . . . . 123

4.1 Stream Memory Controller . . . . . . . . . . . . . . . . . 124

4.2 Run-Time Reconfiguration . . . . . . . . . . . . . . . . . 125

5 Exploration and Analysis . . . . . . . . . . . . . . . . . . . . . . 126

5.1 Network Performance . . . . . . . . . . . . . . . . . . . . 126

5.2 Memory Simulation . . . . . . . . . . . . . . . . . . . . . 129

5.3 Application Mapping . . . . . . . . . . . . . . . . . . . . 133

6 Hardware Prototyping . . . . . . . . . . . . . . . . . . . . . . . . 136

7 Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . 136

8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Conclusion and Outlook 141

Bibliography 143

Appendix 155

A The Scenic Shell 1591 Launching Scenic . . . . . . . . . . . . . . . . . . . . . . . . . 159

1.1 Command Syntax . . . . . . . . . . . . . . . . . . . . . . 159

1.2 Number Representations . . . . . . . . . . . . . . . . . . 161

1.3 Environment Variables . . . . . . . . . . . . . . . . . . . 161

1.4 System Variables . . . . . . . . . . . . . . . . . . . . . . 161

2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

2.1 Starting a Simulation . . . . . . . . . . . . . . . . . . . . 162

2.2 Running a Simulation . . . . . . . . . . . . . . . . . . . . 163

3 Simulation Modules . . . . . . . . . . . . . . . . . . . . . . . . . 163

3.1 Module Library . . . . . . . . . . . . . . . . . . . . . . . 163

3.2 Module Hierarchy . . . . . . . . . . . . . . . . . . . . . . 164

3.3 Module Commands . . . . . . . . . . . . . . . . . . . . . 164

3.4 Module Parameters . . . . . . . . . . . . . . . . . . . . . 165

3.5 Module Variables . . . . . . . . . . . . . . . . . . . . . . 165

3.6 Module Debug . . . . . . . . . . . . . . . . . . . . . . . . 166

4 System Specification . . . . . . . . . . . . . . . . . . . . . . . . . 167

4.1 Module Instantiation . . . . . . . . . . . . . . . . . . . . 167

4.2 Module Binding . . . . . . . . . . . . . . . . . . . . . . . 168

4.3 Module Configuration . . . . . . . . . . . . . . . . . . . . 168

B DSP/MAC Processor Architecture 173

Preface

This thesis summarizes my academic work in the digital ASIC group at thedepartment of Electrical and Information Technology. The main contributionsto the thesis are derived from the following publications:

Thomas Lenart, Henrik Svensson, and Viktor Owall, “Modeling and Ex-ploration of a Reconfigurable Architecture for Digital Holographic Imag-ing,” in Proceedings of IEEE International Symposium on Circuits andSystems, Seattle, USA, May 2008.

Henrik Svensson, Thomas Lenart, and Viktor Owall, “Modelling and Ex-ploration of a Reconfigurable Array using SystemC TLM,” in Proceedingsof Reconfigurable Architectures Workshop, Miami, Florida, USA, April2008.

Thomas Lenart, Henrik Svensson, and Viktor Owall, “A Hybrid Inter-connect Network-on-Chip and a Transaction Level Modeling approach forReconfigurable Computing,” in Proceedings of IEEE International Sym-posium on Electronic Design, Test and Applications, Hong Kong, China,January 2008, pp. 398–404.

Thomas Lenart, Mats Gustafsson, and Viktor Owall, “A Hardware Ac-celeration Platform for Digital Holographic Imaging,” Springer Journalof Signal Processing Systems, DOI: 10.1007/s11265-008-0161-2, January2008.

Thomas Lenart and Viktor Owall, “Architectures for Dynamic Data Scal-ing in 2/4/8K Pipeline FFT Cores,” IEEE Transactions on Very LargeScale Integration Systems, vol. 14, no. 11, November 2006, pp. 1286–1290.

Thomas Lenart and Viktor Owall, “XStream - A Hardware Acceleratorfor Digital Holographic Imaging,” in Proceedings of IEEE InternationalConference on Electronics, Circuits, and Systems, Gammarth, Tunisia,December 2005.

Mats Gustafsson, Mikael Sebesta, Bengt Bengtsson, Sven-Goran Petters-son, Peter Egelberg, and Thomas Lenart, “High Resolution Digital Trans-mission Microscopy - a Fourier Holography Approach,” in Optics andLasers in Engineering, vol. 41, issue 3, March 2004, pp. 553–563.

xi

Thomas Lenart, Viktor Owall, Mats Gustafsson, Mikael Sebesta, andPeter Egelberg, “Accelerating Signal Processing Algorithms in DigitalHolography using an FPGA Platform,” in Proceedings of IEEE Inter-national Conference on Field-Programmable Technology, Tokyo, Japan,December 2003, pp. 387–390.

Thomas Lenart and Viktor Owall, “A 2048 Complex Point FFT Processorusing a Novel Data Scaling Approach,” in Proceedings of IEEE Interna-tional Symposium on Circuits and Systems, vol. 4, Bangkok, Thailand,May 2003, pp. 45–48.

Thomas Lenart and Viktor Owall, “A Pipelined FFT Processor usingData Scaling with Reduced Memory Requirements,” in Proceedings ofNorchip, Copenhagen, Denmark, November 2002, pp. 74–79.

The following papers concerning education are also published, but are notconsidered part of this thesis:

Hugo Hedberg, Thomas Lenart, Henrik Svensson, “A Complete MP3Decoder on a Chip,” in Proceedings of Microelectronic Systems Education,Anaheim, California, USA, June 2005, pp. 103–104.

Hugo Hedberg, Thomas Lenart, Henrik Svensson, Peter Nilsson and Vik-tor Owall, “Teaching Digital HW-Design by Implementing a CompleteMP3 Decoder,” in Proceedings of Microelectronic Systems Education,Anaheim, California, USA, June 2003, pp. 31–32.

Acknowledgment

I have many people to thank for a memorable time during my PhD studies. Ithas been the most challenging and fruitful experience in my life, and given methe opportunity to work with something that I really enjoy!

I would first of all like to extend my sincere gratitude to my supervisor as-sociate professor Viktor Owall. He has not only been guiding, reading, andcommenting my work over the years, but also given me the opportunity andencouragement to work abroad during my PhD studies and gain experiencefrom industrial research. I also would like to sincerely thank associate pro-fessor Mats Gustafsson, who has been supervising parts of this work, for hisprofound theoretical knowledge and continuous support during this time.

I would like to thank my colleagues and friends at the department of Electricaland Information Technology (former Electroscience) for an eventful and enjoy-able time. It has been a pleasure to work with you all. I will always rememberour social events and interesting discussions on any topic. My gratitude goes toHenrik Svensson, Hugo Hedberg, Fredrik Kristensen, Matthias Kamuf, JoachimNeves Rodrigues, and Hongtu Jiang for friendship and support over the years,and to my new colleagues Johan Lofgren and Deepak Dasalukunte for accom-panying me during my last year of PhD studies. I especially would like tothank Henrik Svensson and Joachim Neves Rodrigues for reading and com-menting parts of the thesis, which was highly appreciated. The administrativeand practical support at the department has been more than excellent, andnaturally I would like to thank Pia Bruhn, Elsbieta Szybicka, Erik Jonsson,Stefan Molund, Bengt Bengtsson, and Lars Hedenstjerna for all their help.

I am so grateful that my family has always been there for me, and encouragedmy interest in electronics from an early age. At that time, my curiosity andresearch work constituted of dividing all kind of electronic devices into smallerpieces, i.e. components, which were unfortunately never returned back intotheir original state again. This experience has been extremely valuable forme, though the focus has slightly shifted from system separation to systemintegration.

My dear fiancee Yiran, I am so grateful for your love and patient support.During my PhD studies I visited many countries, and I am so glad and fortunatethat I got the chance to meet you in the most unexpected place. Thank youfor being who you are, for always giving me new perspectives and angles of life,and for inspiring me to new challenges.

xiii

The initial part of this work was initiated as a multi-disciplinary project ondigital holography, which later became Phase Holographic Imaging AB. I havehad the opportunity to take part in this project for a long time, and it hasbeen a valuable experience in many ways. From our weekly meetings in thedepartment library, to the first prototype of the holographic microscope, whatan exciting time! I would like to thank all the people who have been involvedin this work, especially Mikael Sebesta for our valuable discussions since thevery beginning, and for your enduring enthusiasm to take this project forward.

In 2004 and 2006, I was invited for internships at Xilinx in San Jose California,for 3 and 9 months, respectively. I gained a lot of experience and knowledgeduring that time, and I would like to thank Dr. Ivo Bolsens for giving me thisopportunity. I would like to extend my gratitude to the colleagues and friendsin the Xilinx Research Labs, especially my supervisors Dr. Adam Donlin andDr. Jorn Janneck for their encouragement, support, and valuable discussions.I have a lot of good memories from the visit, and I would like to thank my localand international friends for all the unforgettable adventures we experiencedtogether in the US.

This work has been funded by the Competence Center for Circuit Design(CCCD), and sponsored with hardware equipment from the Xilinx UniversityProgram (XUP).

Lund, May 2008

Thomas Lenart

List of Acronyms

AHB Advanced High-performance Bus

AGU Address Generation Unit

ALU Arithmetic Logic Unit

AMBA Advanced Microcontroller Bus Architecture

APB Advanced Peripheral Bus

API Application Programming Interface

ASIC Application-Specific Integrated Circuit

ASIP Application-Specific Instruction Processor

CCD Charge-Coupled Device

CGRA Coarse-Grained Reconfigurable Architecture

CMOS Complementary Metal Oxide Semiconductor

CORDIC Coordinate Rotation Digital Computer

CPU Central Processing Unit

DAB Digital Audio Broadcasting

DAC Digital-to-Analog Converter

DCT Discrete Cosine Transform

DDR Double Data-Rate

DFT Discrete Fourier Transform

DIF Decimation-In-Frequency

DIT Decimation-In-Time

DRA Dynamically Reconfigurable Architecture

DRAM Dynamic Random-Access Memory

DSP Digital Signal Processor or Digital Signal Processing

xv

DVB Digital Video Broadcasting

EDA Electronic Design Automation

EMIF External Memory Interface

ESL Electronic System Level

FFT Fast Fourier Transform

FIFO First In, First Out

FIR Finite Impulse Response

FPGA Field Programmable Gate Array

FSM Finite State Machine

GPP General-Purpose Processor

GPU Graphics Processing Unit

GUI Graphical User Interface

HDL Hardware Description Language

IFFT Inverse Fast Fourier Transform

IP Intellectual Property

ISS Instruction Set Simulator

LSB Least Significant Bit

LUT Lookup Table

MAC Multiply-Accumulate

MC Memory Cell

MPMC Multi-Port Memory Controller

MSB Most Significant Bit

NoC Network-on-Chip

OFDM Orthogonal Frequency Division Multiplexing

OSCI Open SystemC Initiative

PE Processing Element

PIM Processor-In-Memory

RAM Random-Access Memory

PC Processing Cell

RC Resource Cell

RGB Red-Green-Blue

ROM Read-Only Memory

RPA Reconfigurable Processing Array

RTL Register Transfer Level

SAR Synthetic Aperture Radar

SCENIC SystemC Environment with Interactive Control

SCV SystemC Verification (library)

SDF Single-path Delay Feedback

SDRAM Synchronous DRAM

SNR Signal-to-Noise Ratio

SOC System-On-Chip

SQNR Signal-to-Quantization-Noise Ratio

SRAM Static Random-Access Memory

SRF Stream Register File

TLM Transaction Level Modeling

VGA Video Graphics Array

VHDL Very high-speed integrated circuit HDL

VLIW Very Long Instruction Word

XML eXtensible Markup Language

List of Definitions and Mathematical Operators

Z+ Positive integer space

i, j Imaginary unit

∞ Infinity

∝ Proportional

≈ Approximation

O Computational complexity

⌈x⌉ Ceiling function. Rounds x to nearest upper integer value towars ∞

⌊x⌋ Floor function. Rounds x to nearest lower integer value towars −∞

x ∈ A The element x belongs to the set A

x 6∈ A The element x does not belong to the set A

F(x(t)) Fourier transform of x(t)

F−1(X(f)) Inverse Fourier transform of X(f)

x ∗ h The convolution of x and h

r Spatial vector

|r| Spatial vector length

x, y, z Cartesian unit vectors

xix

Chapter 1

Introduction

This thesis discusses modeling and implementation of both application-specificand dynamically reconfigurable hardware architectures. Application-specifichardware architectures are usually found in systems with high constraints onprocessing performance, physical size, or power consumption, and are tailoredfor a certain application or algorithm to achieve high performance. However, anapplication-specific architecture provides limited flexibility and reuse. It is alsolikely that a system platform may require a large number of specific hardwareaccelerators, but only a few units will operate concurrently. In contrast, a re-configurable architecture trades performance for increased flexibility. It allowsthe hardware architecture to be changed during run-time in order to acceleratearbitrary algorithms, hence extends the application domain and versatilenessof the device.

In this work, two approaches to reconfigurable hardware design are pre-sented. In the first approach, a high-performance application-specific hardwareaccelerator is proposed, which is designed to enable real-time image recon-struction in digital holographic imaging [1]. The hardware accelerator providesreconfigurability on block level, using a flexible pipeline architecture. In the sec-ond approach, a reconfigurable platform is designed and modeled by combininga conventional processor based solution with a dynamically reconfigurable ar-chitecture. Hence, the platform can be reused to map arbitrary applications,which is a trade-off between versatileness and hardware cost. A reconfigurableapproach enables flexibility in the algorithm design work, and also the possibil-ity to dynamically reconfigure the accelerator for additional processing steps.An example is post-processing operations to enhance the quality and perceptionof reconstructed images.

1

2 CHAPTER 1. INTRODUCTION

1.1 Challenges in Digital Hardware Design

The hardware design space is multi-dimensional and constrained by many phys-ical and practical boundaries. Hence, designing hardware is a trade-off betweensystem performance, resource and development costs, power consumption, andother parameters related to the implementation. Hence, a design goal is tofind an optimal balance to efficiently utilize available system resources to avoidperformance bottlenecks.

However, performance bottlenecks do not disappear – they move around.Improving a part of the design is in practice to push the performance issuesinto another part of the design, or even to another part of the design space. Forexample, increasing the processing performance traditionally requires a higherclock frequency fclk, which will increase the power consumption and thereforealso the heat dissipation. Heat dissipation is physically constrained to avoiddamage due to overheating of the device. One way to prevent this situation isto lower the supply voltage VDD, since it has a quadratic effect on the dynamicpower as

Pdyn = αCLV2DDfclk,

where α is the switching activity and CL is the load capacitance [2]. However,the supply voltage also affects the propagation time tp in logic gates as

tp ∝ VDD

(VDD − VT)β,

where VT is the threshold voltage and β is a technology specific parameter.Hence, lowering the supply voltage means that the system becomes slower andcounteracts the initial goal. This is the situation for most sequential processingunits, e.g. microprocessors, where the processing power heavily depends on thesystem clock frequency.

The current trend in the computer industry is to address this situation us-ing multiple parallel processing units, which shows the beginning of a paradigmshift towards parallel hardware architectures [3]. However, conventional com-puter programs are described as sequentially executed instructions and can noteasily be adapted for a multi-processor environment. This situation preventsa potential speed-up due to the problem of finding enough parallelism in thesoftware. The speedup S from using N parallel processing units is calculatedusing Amdahl’s law as

S =1

(1 − p) + pN

,

1.1. CHALLENGES IN DIGITAL HARDWARE DESIGN 3

where p is the fraction of the sequential program that can be parallelized [4].Assuming that 50% of the sequential code can be executed on parallel proces-sors, the speedup will still be limited to a modest factor of 2, no matter howmany parallel processors that are used. Parallel architectures have a promisingfuture, but will require new design approaches and programming methodologiesto enable high system utilization.

1.1.1 Towards Parallel Architectures

In 1965, Intel’s co-founder Gordon E. Moore made an observation that thenumber of transistors on an integrated circuit is increasing exponentially [5].Since then, Moore’s law has been used more than four decades to describetrends in computer electronics and integrated circuits. Niklaus E. Wirth, thedeveloper of the Pascal programming language, more recently stated another”law” that software gets slower faster than hardware gets faster. This is onlyan ironic observation that software developers rely on Moore’s law when thesoftware complexity increase, and expect continuous exponential growth in per-formance as well as in complexity (number of transistors). However, there isa more serious aspect to this observation related to the current trend towardsmulti-processor and highly parallel systems. In the past, software abstractionhas been a way to hide complex behavior. However, parallel systems requiremore from the software developer in terms of knowledge about the underly-ing architecture to fully utilize the hardware. From a software perspective,this would imply the use of programming languages that describe concurrentsystems, to expose and utilize the system hardware resourced in order to effi-ciently program future parallel computer systems, or the use of programminginterfaces for parallel computing [6, 7].

In addition, increasing hardware design complexity advocates tools for high-level modeling and virtual prototyping for hardware and software co-design.This is becoming an emergent requirement when designing multi-processorbased systems, to understand and analyze system-level behavior and perfor-mance.

1.1.2 Application-Specific vs. Reconfigurable Architectures

Despite the rapid development pace of modern computer systems, there willalways be situations where more customized solution will have a better fit.Examples are in mobile applications where power is a crucial factor, or in em-bedded systems that require real-time processing. These systems are usuallycomposed by a generic processor-based hardware platform that includes one ormore application-specific hardware accelerators. Application-specific architec-tures results in the highest performance with the lowest power consumption,


which may seem like an ideal situation. However, application-specific hard-ware accelerators provide limited flexibility and reusability compared to moregeneral-purpose processing units. They are also a physical part of device, whichincrease the manufacturing cost and the power consumption even if the hard-ware accelerator is only active for a fraction of the user-time.

In contrast, the design of reusable and reconfigurable architectures is analternative to combine the flexibility of generic processing units with the pro-cessing power of application-specific architectures. Today, proposed reconfig-urable architectures and processor arrays range from clusters of homogeneousfull-scale multiprocessors to arrays of weakly programmable elements or con-figurable logic blocks, which communicate over a network of interconnects. Byobserving current trends, the question for the future will probably not be ifhighly parallel and reconfigurable architectures will become mainstream, butrather how they can be efficiently explored, constructed, and programmed [8].A likely scenario for the future is processing platforms combining application-specific architectures, reconfigurable architectures, and generic processors tosupport a wide range of applications.

1.2 Contributions and Thesis Outline

The thesis is divided into four parts, which cover modeling, design, and im-plementation of application-specific and reconfigurable hardware architectures.Parts I and II present an application-specific hardware acceleration platformfor digital holographic imaging. Part I proposes a system architecture, wherea set of hardware accelerations are connected in a reconfigurable communica-tion pipeline, to enable flexibility and real-time image reconstruction. Part IIpresents a high-performance and scalable FFT core, which is the main buildingblock in the hardware acceleration platform in Part I. Furthermore, proposedideas on index scaling in pipeline FFT architectures, originally presented in [9],has been further developed by other research groups [10].

In Parts III and IV, the research is generalized towards dynamically recon-figurable architectures based on scalable processing arrays. Part III presents anexploration framework and simulation models, which are used for constructingand evaluating reconfigurable architectures. In Part IV, a coarse-grained re-configurable architecture is proposed, which consists of an array of processingand memory cells for accelerating arbitrary applications. An intended targetapplication for this work is again digital holographic imaging.

A general overview of hardware design is given in Chapter 2, discussingprocessing alternatives and memory concerns. This chapter also presents thedesign flow from high-level specification down to physical layout, with the em-phasis on system modeling and the use of virtual platforms for design explo-

1.2. CONTRIBUTIONS AND THESIS OUTLINE 5

ration and hardware/software co-design. Chapter 3 presents an introductionto digital holographic imaging, which has been the driver application for thefirst part of this work, and as a motivation for the second part of the work.Algorithmic requirements and hardware design trade-offs are also discussed.

The research on reconfigurable architectures is a collaboration between theauthor of this thesis and PhD student Henrik Svensson. The work is howeverbased on an individually developed set of reconfigurable models, using the pre-sented Scenic exploration environment as a common development platform.For published material, the first named author indicates which reconfigurablemodules that are evaluated. The Scenic core library has been mainly de-veloped by the author, while the environment has been ported to additionaloperating systems and integrated with ArchC [11], which is an academic archi-tectural description language for processors, by Henrik Svensson.

Some of the VHDL models presented in Part IV have been developed incollaboration with students in the IC project and verification course, accordingto the authors specification. The author especially wants to acknowledge thework of Chenxin Zhang.

Part I: A Hardware Acceleration Platform for Digital Holographic Imaging

Many biomedical imaging systems have high computational demands but stillrequire real-time performance. When desktop computers are not able to satisfythe real-time requirements, dedicated or application-specific solutions are nec-essary. In this part, a hardware acceleration platform for digital holographicimaging is presented, which transforms an interference pattern captured on adigital image sensor into visible images. A computationally and memory effi-cient pipeline, referred to as Xstream, is presented together with simulationsand implementation results for the final design. The work shows significantreductions in memory requirements and high speedup due to improved mem-ory organization and efficient processing. The accelerator has been embeddedinto an FPGA system platform, which is integrated into a prototype of theholographic system.

Publications:

Thomas Lenart, Mats Gustafsson, and Viktor Owall, “A Hardware Acceler-ation Platform for Digital Holographic Imaging,” Springer Journal of Signal

Processing Systems, DOI: 10.1007/s11265-008-0161-2, January 2008.

Thomas Lenart and Viktor Owall, “XStream - A Hardware Accelerator forDigital Holographic Imaging,” in Proceedings of IEEE International Conference

on Electronics, Circuits, and Systems, Gammarth, Tunisia, December 2005.


Thomas Lenart, Viktor Owall, Mats Gustafsson, Mikael Sebesta, and PeterEgelberg, “Accelerating Signal Processing Algorithms in Digital Holographyusing an FPGA Platform,” in Proceedings of IEEE International Conference

on Field-Programmable Technology, Tokyo, Japan, December 2003, pp. 387–390.

Mats Gustafsson, Mikael Sebesta, Bengt Bengtsson, Sven-Goran Pettersson,Peter Egelberg, and Thomas Lenart, “High Resolution Digital TransmissionMicroscopy - a Fourier Holography Approach,” in Optics and Lasers in Engi-

neering, vol. 41, issue 3, March 2004, pp. 553–563.

Part II: A High-performance FFT Core for Digital Holographic Imaging

The main building block in the holographic imaging platform is a large sizetwo-dimensional FFT core. A high-performance FFT core is presented, wheredynamic data scaling is used to increase the accuracy in the computationalblocks, and to improve the quality of the reconstructed holographic image.A hybrid floating-point scheme with tailored exponent datapath is proposedtogether with a co-optimized architecture between hybrid floating-point andblock floating-point. A pipeline FFT architecture using dynamic data scalinghas been fabricated in a standard CMOS process, and is used as a buildingblock in the FPGA prototype presented in Part I.

Publications:

Thomas Lenart and Viktor Owall, “Architectures for Dynamic Data Scaling in2/4/8K Pipeline FFT Cores,” IEEE Transactions on Very Large Scale Integra-

tion Systems, vol. 14, no. 11, November 2006, pp. 1286–1290.

Thomas Lenart and Viktor Owall, “A 2048 Complex Point FFT Processorusing a Novel Data Scaling Approach,” in Proceedings of IEEE International

Symposium on Circuits and Systems, vol. 4, Bangkok, Thailand, May 2003, pp.45–48.

Thomas Lenart and Viktor Owall, “A Pipelined FFT Processor using DataScaling with Reduced Memory Requirements,” in Proceedings of Norchip, Copen-hagen, Denmark, November 2002, pp. 74–79.

Part III: A Design Environment and Models for Reconfigurable Computing

System-level simulation and exploration tools and models are required to rapidlyevaluate system performance in the early design phase. The use of virtual plat-forms enables hardware modeling as well as early software development. Theexploration tool Scenic is developed, which introduces functionality to ac-cess and extract performance related information during simulation. A set of

1.2. CONTRIBUTIONS AND THESIS OUTLINE 7

model generators and architectural generators are also proposed to create cus-tom processor, memory, and system architectures that facilitate design explo-ration. The Scenic exploration tool is used in Part IV to evaluate dynamicallyreconfigurable architectures.

Publications:

Henrik Svensson, Thomas Lenart, and Viktor Owall, “Modelling and Explo-ration of a Reconfigurable Array using SystemC TLM,” in Proceedings of Re-

configurable Architectures Workshop, Miami, Florida, USA, April 2008.

Thomas Lenart, Henrik Svensson, and Viktor Owall, “A Hybrid InterconnectNetwork-on-Chip and a Transaction Level Modeling approach for Reconfig-urable Computing,” in Proceedings of IEEE International Symposium on Elec-

tronic Design, Test and Applications, Hong Kong, China, January 2008, pp.398–404.

Part IV: A Run-time Reconfigurable Computing Platform

Reconfigurable hardware architectures are emerging as a suitable approach tocombine high performance with flexibility and programmability. While fine-grained architectures are capable of bit-level reconfiguration, more recent workfocus on more coarse-grained architectures that enable higher performance us-ing word-level data processing. A coarse-grained dynamically reconfigurablearchitecture is presented, and constructed from an array of processing andmemory cells. The cells communicate using local interconnects and a hierarchi-cal network using routers. The design has been modeled using the Scenic ex-ploration environment and simulation models, and implemented in VHDL andsynthesized for a 0.13µm cell library.

Publications:

Thomas Lenart, Henrik Svensson, and Viktor Owall, “Modeling and Explo-ration of a Reconfigurable Architecture for Digital Holographic Imaging,” inProceedings of IEEE International Symposium on Circuits and Systems, Seat-tle, USA, May 2008.

Henrik Svensson, Thomas Lenart, and Viktor Owall, “Modelling and Explo-ration of a Reconfigurable Array using SystemC TLM,” in Proceedings of Re-

configurable Architectures Workshop, Miami, Florida, USA, April 2008.

Thomas Lenart, Henrik Svensson, and Viktor Owall, “A Hybrid InterconnectNetwork-on-Chip and a Transaction Level Modeling approach for Reconfig-urable Computing,” in Proceedings of IEEE International Symposium on Elec-

tronic Design, Test and Applications, Hong Kong, China, January 2008, pp.398–404.

Chapter 2

Digital System Design

Digital systems range from desktop computers for general-purpose processingto embedded systems with application-specific functionality. A digital systemplatform comprises of elements for data processing and storage, which are con-nected and integrated to form a design-specific architecture.

In this chapter, elements for data processing are further divided into threeclasses, namely programmable, reconfigurable, and application-specific architec-tures. Programmable architectures include for example the general-purposeprocessor (GPP), the application-specific instruction processor (ASIP), andthe digital signal processor (DSP). The reconfigurable architectures range frombit-level configuration, such as the field-programmable gate array (FPGA), toword-level configuration in coarse-grained reconfigurable architectures (CGRA).Application-specific architectures are required when there are high constraintson power consumption or processing performance, such as in hand-held devicesand real-time systems, where tailored hardware may be the only feasible so-lution [12]. Figure 2.1 illustrates how different architectures trade flexibilityfor higher efficiency. The term flexibility includes programmability and versa-tility, while the term efficiency relates to processing performance and energyefficiency. General-purpose devices provide a high level of flexibility and are lo-cated in one end of the of the design space, while the specialized and optimizedapplication-specific architectures are located in the other end [13].

Efficient memory and storage elements are required to supply the processingelements with data. Memory performance depends on how the memory systemis organized, as well as how data is stored. The memory organization includesthe memory hierarchy and configuration, while storage aspects relate to how

9

10 CHAPTER 2. DIGITAL SYSTEM DESIGN

Fle

xib

ilit

y

Efficiency, Performance

GPPASIP

ASIC

GPUDSP

General

purpose

Special

PurposeReconfigurable

Architecture Application

Specific

CGRA

This work

Figure 2.1: Efficiency and performance as a function of flexibility forvarious architectures. Architectures from different domains can not bedirectly relates, hence they are grouped in different domains.

well data is re-used through caching, and the memory access pattern to non-cacheable data [14]. Hence, system performance is a balance between efficientprocessing and efficient memory management.

Constructing digital systems not only require a set of hardware buildingblocks, but also a methodology and design flow to generate a functional cir-cuit from a high-level specification. In addition, the current trends towardshigh-level modeling and system-level exploration require more advanced designsteps to enable software and hardware co-development from a common virtualplatform [15]. For complex system design, exploration tools require the use ofabstraction levels to trade modeling accuracy for simulation speed [16], whichis further discussed in Part III.

2.1 Programmable Architectures

Programmable architectures are further divided into three groups: general-purpose processors (GPP), configurable instruction-set processors, and special-purpose processors. The GPP has a general instruction set that has not beenoptimized for any specific application domain. Configurable instruction-setprocessors offer the possibility to extend the instruction set and to tune theprocessor for a user-defined application [17]. The special-purpose processors in-clude instructions and additional hardware resources for primitives commonlyused in its field of operation, for example signal or graphics processing. Theprocessor architectures are trade-offs between computational efficiency, flexi-bility, and hardware requirements.

2.1. PROGRAMMABLE ARCHITECTURES 11

2.1.1 General-Purpose Processors

The microprocessor is a generic computational unit, designed to support anyalgorithm that can be converted into a computer program. Microprocessors suf-fer from an obvious performance bottleneck: the sequential nature of programexecution. Instructions are processed one after the other, with the upside beinghigh instruction locality and the downside being high instruction dependency.

There are several ways to improve the performance of microprocessors,i.e. increase the number of executed instructions per second. By dividing theinstruction execution into a set of stages, several instructions can execute con-currently in an instruction pipeline. An extension to the pipelining conceptis to identify independent instruction in run-time that can be sent to differentinstruction pipelines, commonly known as a superscalar architecture. However,evaluating instruction dependency during run-time is a costly technique due tothe increased hardware complexity. Another approach is to let software analyzethe dependencies and provide a program where several instructions are givenin parallel, which is known as a very long instruction word (VLIW) processor.A comparison between different processor architectures are presented in [18].

In recent computer systems, multi-core processors are introduced to addressthe limitation of pure sequential processing. Independent applications, or partsof the same application, are parallelized over a plurality of processing units.This means that processing speed may potentially increase with a factor equalto the number of processing cores. In reality, speed-up for a single thread islimited by the instruction level parallelism (ILP) in the program code, whichindicates to what extent a sequential program can be parallelized [19]. In thesituation of a single-threaded application with limited ILP, multi-core will notbe able to provide any noticeable speed-up. However, if the programmer dividesthe application into multiple parallel threads of execution, a multi-core systemcan provide thread level parallelism (TLP).

2.1.2 Configurable Instruction-Set Processors

One way to improve processing performance is to adapt and extend the pro-cessor’s instruction set for a given application [20]. Profiling tools are firstused analyze the application data-flow graph to find and extract frequentlyused computational kernels, which constitute the main part of the executiontime [21]. The kernels are groups of basic instructions that often occur in con-junction, hence it is possible to merge them into a specialized instruction [22].By extending the processor with support for such operations, the overall per-formance of the system is significantly improved. However, from a systemperspective the performance improvement has to be weighted against resource,area, and power requirements for introducing specialized instructions.


ALUs

ALUs

...

ALUs

Stream

Register

File

(SRF)

LRF LRF LRF LRF

ALU cluster

Register file A

Common cache

* +

Register file B

EMIF Stream Controller

ALU/MAC ALU/MAC

(a) (b)

Figure 2.2: (a) A DSP architecture with two separate register files andmultiple functional units on each data path. (b) A stream processor witha high-performance register file connected to an array of ALU clusters.Each cluster contains multiple functional units and local register files.The stream controller handles data transfers to external memory.

In addition to extending the processor with new instructions, the compilertools have to be modified to support the extended instruction set. Frame-works for building extendable processors can be found in industry as well as inacademia [23]. Advanced tools also provide generation of processor models forsimulation, and tools for system tuning and exploring architectural trade-offs.

2.1.3 Special-Purpose Processors

An example of a special-purpose device is the digital signal processor (DSP).A digital signal processor has much in common with a general-purpose micro-processor but contains additional features and usually include more than onecomputation unit and register bank, as illustrated in Figure 2.2(a). In sig-nal processing algorithms, a set of frequently used operations can be isolated.An example is multiplication followed by accumulation, present in for instanceFIR filters [24]. In a DSP, this operation is executed in a single clock cycleusing dedicated multiply-accumulate (MAC) hardware. Another useful fea-ture is multiple addressing modes such as modulo, ring-buffer, and bit-reversedaddressing. In signal processing, overflow causes serious problems when thevalues exceed the maximum wordlength. In a conventional CPU, integer val-ues wrap around on overflow and corrupt the result value. DSPs often includesaturation logic for integer arithmetic, preventing the value to wrap and causeless damage to the final result. Despite the fact that DSPs are referred to as aspecial-purpose processors, their application domain is fairly wide.

2.2. RECONFIGURABLE ARCHITECTURES 13

Stream processors combine a highly parallel processing architecture witha memory controller that supervise the data streams from external and localmemories, and an architectural overview is shown in Figure 2.2(b) [25]. Astream register file (SRF) manages the exchange of data between computationalclusters and the external memory. The clusters are based on parallel processingelements that are programmed using VLIW instructions. Examples of streamprocessors are Imagine and Merrimac from Stanford [26,27].

Another special-purpose processor with a more narrow application domainis the graphic processing units (GPU), which resembles the stream processor.Traditionally, accelerators for graphic processing have been application-specificto be able to handle an extreme amount of operations per second. As a re-sult, GPUs provided none or limited programmability. However, the trend ischanging towards general purpose GPUs (GPGPU), which widens the applica-tion domain for this kind of special-purpose processors [28]. Due to the mas-sively parallel architecture, the GPGPU is suitable for mapping parallel andstreaming applications. The use of high-performance graphic cards for gen-eral processing is especially interesting for applications with either real-timerequirements or long execution times.

A common characteristic for special-purpose processors is the ability to op-erate on sets of data. Execution in a generic processor is divided into whichdata to process and how to process that data, referred to as control-flow anddata-flow, respectively. The control flow is expressed using control statementssuch as if, for, while. When applying a single operation to a large data setor a data stream, the control statements will dominate the execution time.A solution is to use a single operation or kernel that is directly applied to alarger data set. This is referred to as a single-instruction multiple-data (SIMD)or multiple-instruction multiple-data (MIMD) processing unit, according toFlynn’s taxonomy [29]. Hence, the control overhead is reduced and the pro-cessing throughput increased.

2.2 Reconfigurable Architectures

In contrast to programmable architectures, the reconfigurable architectures en-able hardware programmability. It means that not only the software that runson a platform is modified, but also how the architecture operates and commu-nicates [13, 30–32]. Hence, an application is accelerated by allocating a set ofrequired processing, memory, and routing resource.

Many presented reconfigurable architectures consist of an array of process-ing and storage elements [33]. Architectures are either homogeneous, meaningthat all elements implement the same functionality, or heterogeneous where el-ements provide different functionality. Processing elements communicate using


Table 2.1: A comparison between fine-grained and coarse-grained architectures.Development time and design specification refers to applications running on theplatform.

Properties Fine-grained Coarse-grained

Granularity bit-level (LUT) word-level (ALU)Versatility/Flexibility high medium/highPerformance medium highInterconnect overhead large smallReconfiguration time long (ms) short (µs)Development time long mediumDesign specification hardware software

Application domainPrototyping RTR systems

HPC HPC

either statically defined interconnects or programmable switch boxes, whichare dynamically configured during run-time. Physical communication chan-nels for transferring data are reserved for a duration of time, referred to ascircuit switching, or shared between data transfers using packet switching todynamically route the traffic [34].

The size of the reconfigurable elements is referred to as the granularity of thedevice. Fine-grained devices are usually based on small look-up tables (LUT) toenable bit-level manipulations. These devices are extremely versatile and canbe used to map virtually any algorithm. However, fine-grained architecturesare inefficient in terms of hardware utilization of logic and routing resources.In contrast, coarse-grained architectures use building blocks in a size rangingfrom arithmetic logic units (ALU) to full-scale processors. This yields a higherperformance when constructing standard datapaths, since the arithmetic unitsare constructed more efficiently, but the device becomes less versatile. Theproperties of fine-grained and coarse-grained architectures are summarized inTable 2.1, and are further discussed in the following sections.

Reconfigurable architectures are used for multiple purposes and for a wideapplication domain, which includes:

Prototyping - Versatile fine-grain architectures, i.e. FPGAs, are com-monly uses for functional verification prior to ASIC manufacturing. Thisis of high importance when software simulations are not powerful enoughto verify that the design operates correctly during exhaustive runs.

Reconfigurable computing - As an alternative to develop application-specific hardware for low-power and compute-intensive operation, coarse-

2.2. RECONFIGURABLE ARCHITECTURES 15

grained reconfigurable architectures are used to combine efficiency withreconfigurability. This enables post-manufacturing programmability anda reduced development time and risk. Application mapping becomesflexible, where all hardware resources are either used for a single task, orshared to handle parallel task simultaneously.

High-Performance Computing - A reconfigurable architecture canaccelerate and off-load compute-intensive tasks in a conventional com-puter system. The reconfigurable architecture is either placed insidethe processor datapath, or as an external accelerator that communi-cates over the system bus or directly on the processor front side bus(FSB). Both fine-grained and coarse-grained architectures are candidatesfor high-performance computing [35].

2.2.1 Fine-Grained Architectures

Fine-grained reconfigurable architectures have been available for almost threedecades, and is a natural part of digital system design and verification. Thefine-grained devices include programmable array logic (PAL), complex pro-grammable logic devices (CPLD), and field-programmable gate arrays (FPGA).The devices are listed in increasing complexity, and also differ in architectureand how the configuration data is stored. Smaller devices use internal non-volatile memory, while larger devices are volatile and programmed from anexternal source. However, since FPGAs have the most versatile architecture,it will here be used as a reference for fine-grained reconfigurable architectures.

The traditional FPGA is a fine-grained logic device that can be reconfiguredafter manufacturing, i.e. field programmable. The FPGA architecture is illus-trated in Figure 2.3(a), and is constructed from an array of configurable logicblocks (CLB) that emulate boolean logic and basic arithmetic using look-uptables (LUT). An interconnect network, which is configured using switchboxes,enables the logic blocks to communicate.

The realization of memories using logic blocks is an inefficient approach,hence FPGAs include small dedicated block RAMs to implement single anddual port memories. Block RAMs may be connected to emulate larger memo-ries, and also enables data exchange between two clock domains on the device.Furthermore, a current industry trend is to include more special-purpose struc-tures in the FPGA, referred to as macro-blocks. Available macro-blocks aremultipliers and DSP units, which lead to significant performance improvementover implementations using logic blocks. Larger devices may include one ormore GPP, which are directly connected to the FPGA fabric. This enablesrapid system design using a single FPGA platform.


GPP GPP Fabric ALU

Router(a) (b)

Figure 2.3: (a) Overview of the FPGA architecture with embedded mem-ory (in gray) and GPP macro-blocks. The FPGA fabric containing logicblocks and switchboxes are magnified. (b) An example of a coarse-grained reconfigurable architecture, with an array of processing elements(ALU) and a routing network.

The FPGA design flow starts with a specification, either from a hardwaredescription language (HDL) or from a more high-level description. HDL codeis synthesized to gate level and the functionality is mapped onto the logicblocks in the FPGA. The final step is to route the signals over the interconnectnetwork and to generate a device specific configuration file (bit-file). This filecontains information on how to configure logic blocks, interconnect network,and IO-pads inside the FPGA. The configuration file is static and can notalter the functionality during run-time, but some FPGAs support run-timereconfiguration (RTR) and the possibility to download partial configurationfiles while the device is operating [36]. However, the reconfiguration time is inthe range of milliseconds since the FPGA is configured at bit-level.

2.2.2 Coarse-Grained Architectures

Coarse-grained reconfigurable architectures are arrays constructed from largercomputational elements, usually in the size of ALUs or smaller programmablekernels and state-machines. The computational elements communicate using arouting network, and an architectural overview is illustrated in Figure 2.3(b).In this way, the coarse-grained architecture requires less configuration data,which improves the reconfiguration time, while the routing resources generatea lower hardware overhead.

In contrast to a fine-grained FPGA, course-grained architectures are de-signed for partial and run-time reconfiguration. This is an important aspectdue to situations when hardware acceleration is required for short time dura-

2.3. APPLICATION-SPECIFIC ARCHITECTURES 17

tions or only during the device initialization phase. Instead of developing anapplication-specific hardware accelerator for each individual situation, a recon-figurable architecture may be reused to accelerate arbitrary algorithms. Oncethe execution of one algorithm completes, the architecture is reconfigured forother tasks. The work on a dynamically reconfigurable architecture is furtherdiscussed in Part IV.

The possibility to support algorithmic scaling is also an important aspect.Algorithmic scaling means that an algorithm can be mapped to the reconfig-urable array in multiple ways, which could be a trade-off between processingperformance and area requirements. A library containing different algorithmmappings would enable the programmer or the mapping tool to select a suitablearchitecture for each situation. A low complexity algorithm mapping may besuitable for non-critical processing, while a parallel mapping may be requiredfor high performance computing.

From an algorithm development perspective, the coarse-grained architec-tures differ considerably from the design methodology used for FPGA develop-ment. While FPGAs use a hardware-centric methodology to map functionalityinto gates, the coarse-grained architectures enable a more software-centric andhigh-level approach. Hence, it allows hardware accelerators to be developedon-demand, and potentially in the same language used for software develop-ment. Unified programming environments enhance productivity by simplifyingsystem integration and verification.

2.3 Application-Specific Architectures

In some cases it is not feasible to use programmable solutions, where the reasonscan be constraints related to power consumption, area requirements, and real-time performance. Instead, an application-specific architecture that satisfiesthe constraints is tailored for this situation. Many digital signal processing(DSP) algorithms, such as filters and transforms, allow efficient mapping toan application-specific architecture due to their highly regular structure. DSPalgorithms are also easy to scale, using folding and un-folding, to find a balancebetween throughput and area requirements.

Application-specific architectures are found in most embedded systems, of-ten in parts of the design that has hard real-time constraints. In cellular ap-plications, the most time-critical parts of the digital transceiver require dedi-cated hardware to handle the high real-time and the throughput requirements.Application-specific hardware is also required when communicating with ex-ternal components using fixed protocols, such as interfaces to memories anddigital image sensors.


For embedded systems, application-specific hardware accelerators are com-monly used to off-load computationally intensive software operations, in orderto save power or to gain processing speed. First, profiling is used to revealand identify performance critical sections that consume most processing power.Secondly, a hardware accelerator is designed and connected to the processingunit, either as a tightly-coupled co-processor or as a loosely-coupled acceleratordirectly connected to the system bus [37].

Since application-specific architectures are only restricted by user-definedconstraints, the architectural mapping can take several forms. For design con-strained by area, the hardware can be time-multiplexed to re-use a single pro-cessing element. The idea is the same as for the microprocessor: to processdata sequentially. The difference is that the program is replaced by a state-machine, which reduces the control-flow overhead and increases the efficiency.In contrast to time-multiplexed implementations, pipeline and parallel archi-tectures are designed to maximize throughput. Parallel architectures are oftenreferred to as direct-mapped, which relates to the data-flow graph of the parallelalgorithm.

Even application-specific architectures are often designed with flexibility inmind, but are still often limited to fixed algorithms. In datapath designs, aflexible dataflow enables algorithmic changes on a structural level. Examplesare algorithms with similar dataflow, such as the discrete cosine transform(DCT) and the fast Fourier transform (FFT), hence having small structuraldifferences. In control-based designs, programmable state-machines provideflexibility to change algorithmic parameters using the same structural datapath.Examples in filter design are variable filter length and ability to update filtercoefficients.

2.4 Memory and Storage

In addition to the computation elements, efficient memory and storage ele-ments are required to store and supply the system with data. However, fordecades the speed of processors has improved at a much higher rate than formemories, causing a processor-memory performance gap [38]. The point wherecomputational elements are limited by the memory performance is referred toas the memory wall. For high-performance systems using dedicated hardwareaccelerators or highly parallel architectures, this problem becomes even morecritical.

2.4.1 Data Caching

There are several techniques to hide the processor-memory gap, both in termsof latency and throughput. The most common approach is caching of data,

2.4. MEMORY AND STORAGE 19

Bank 0

Memory array

Sense Amplifiers

Column Decoder

Row

Dec

oder

Column

Address

Latch

Row

Address

Latch

Address

Data

Data

I/O

Latch

Bank 1

Bank 2

Bank N

Row buffer

Figure 2.4: Modern SDRAM is divided into banks, rows and columns.Accessing consequtive elements result in a high memory bandwidth andmust be considered to achieve high performance.

which means that frequently accessed information is stored in a smaller butfaster memory close to the processing unit. Caches use static random accessmemory (SRAM), which has the desirable property of fast uniform access timeto all memory elements. Thus, random access to data has low latency, assumingthat the data is available in the cache. In contrast, accessing data that iscurrently not available in the cache generates a cache miss. For state-of-the-art processors, a cache miss is extremely expensive in terms of processor clockcycles. It is not uncommon that a cache miss stalls the processor for hundredsof clock cycles until data becomes available. The reasons are long round-triplatency and the difference in clock frequency between external memory and theprocessor.

Unfortunately, caching is not always helpful. In streaming applications eachdata value is often only referenced once, hence resulting in continuous accessto new data in external memory. In this situation, the memory access patternis of high importance as well as hiding the round-trip latency. In this case, theuse of burst transfers with consecutive data elements is necessary to minimizethe communication overhead, and to enable streaming applications to utilizethe memory more efficiently.

2.4.2 Data Streaming

Many signal processing algorithms are based on applying one or more kernelfunctions to one or more data sets. If a data set can be described as a sequentialstream, the kernel function can efficiently be applied to all elements in the


stream. From an external memory perspective, streaming data generates ahigher bandwidth than random access operations. This is mainly from thefact that random access generates a high round-trip latency, but also due tothe physical organization of burst-oriented memories. While high-performanceprocessors rely on caches to avoid or hide the external memory access latency,streaming application often use each data value only once and can not becached. In other words the stream processor is optimized for streams, whilethe general purpose microprocessor handles random access by assuming a highlocality of reference.

Modern DRAMs are based on several individual memory banks, and thememory address is separated into rows and columns, as shown in Figure 2.4.The three-dimensional organization of the memory device results in non-uniformaccess time. The memory banks are accessed independently, but the two-dimensional memory array in each bank is more complicated. To access amemory element, the corresponding row needs to be selected. Data in the se-lected row is then transferred to the row buffer. From the row buffer, data isaccessed at high-speed and with uniform access time for any access pattern.When data from a different row is requested, the current row has to be closed,by pre-charging the bank, before the next row can be activated and transferredto the row buffer. Therefore, the bandwidth of burst-oriented memories highlydepends on the access pattern. Accessing different banks or columns insidea row has a low latency, while accessing data in different rows has a high la-tency. When processing several memory streams, a centralized memory accessscheduler can optimize the overall performance by reordering the memory ac-cess requests [39]. Latency for individual transfers may increase, but the goalis to minimize average latency and maximize overall throughput. Additionalaspects on efficient memory access is discussed in Part I.

2.5 Hardware Design Flow

Hardware developers follow a design flow to systematically handle the designsteps from an abstract specification or model to a functional hardware platform.A simplified design flow is illustrated in Figure 2.5 and involves the followingsteps:

Specification - The initial design decisions are outlined during the spec-ification phase. This is usually in the form of block diagrams and hierar-chical schematics, design constraints, and design requirements.

Virtual prototype - A virtual hardware prototype is constructed toverify the specification and to allow hardware and software co-design [40].It is usually based on a high-level description using sequential languages,

2.5. HARDWARE DESIGN FLOW 21

Structural

(RTL)Logic Synthesis

HDL

Placement

Routing

Cell Library

(IP)Physical

Virtual prototype

Behavioral

(ESL)

Specification

GDSII / bit-file

IP modelsSimulation

Simulation

=?

Exploration

Software dev.

Figure 2.5: Design flow for system hardware development, starting frombehavioral modeling, followed by hardware implementation, and finallyphysical placement and routing.

such as C/C++ or Matlab, or a high-level hardware description languagesuch as SystemC [41]. More accurate descriptions yield better estimates ofsystem performance, but results in increased simulation time [42]. High-level simulation models from vendor specific IP are also integrated ifavailable.

Hardware description - Converting the high-level description into ahardware description can be done automatically or manually. Manualtranslation is still the most common approach, where the system is de-scribed at register-transfer level (RTL) using a hardware description lan-guage such as VHDL or Verilog.

Logic synthesis - The hardware description is compiled to gate levelusing logic synthesis. Gate primitives are provided in form of a standardcell library, describing the functional and physical properties of each cell.The output from logic synthesis is a netlist containing standard cells,macro-cells, and interconnects. The netlist is verified against the RTLdescription.

Physical placement and routing - The netlist generated after logicsynthesis contains logic cells, where the physical properties are extractedfrom the standard cell library. A floorplan is specified, which describes thephysical placement of IO-cells, IP macro-cells, and standard cells as wellas power planning. The cells are then placed according to the floorplanand finally the design is routed.


The design flow is an iterative process, where changes to one step mayrequire the designer to go back to a previous step in the flow. The virtualhardware prototype is one such example since it performs design explorationbased on parameters given by the initial specification. Therefore, architecturalproblems discovered during virtual prototyping may require the specification tobe revised. Furthermore, the virtual prototype is not only reference design forthe structural RTL generation, but also a simulation and verification platformfor software development. Changing the RTL implementation requires that thevirtual platform is updated correspondingly, to avoid inconsistency between theembedded software and the final hardware.

2.5.1 Architectural Exploration

For decades, RTL has been a natural abstraction level for hardware designers,which is completely different from how software abstractions have developedover the same time. The increasing demands towards unified platforms for de-sign exploration and early software development is pushing for higher abstrac-tions in electronic design automation (EDA). To bridge the hardware-softwaregap, electronic system level (ESL) is a recent design methodology towardsmodeling and implementation of systems on higher abstraction levels [43]. Themotivation is to raise the abstraction level, to increase productivity and to man-age increasing design complexity. Hence, efficient architectural exploration andsystem performance analysis require virtual prototyping combined with modelsdescribed at appropriate abstraction levels. Virtual prototyping is the emu-lation of an existing or non-existing hardware platform in order to analyzethe system performance and estimate hardware requirements. Hence, modelsfor virtual prototyping need to be abstracted and simplified to achieve high-performance exploration. An exploration environment and models for virtualprototyping is presented in Part III.

Modeling Abstraction Levels

RTL simulation is over-detailed when it comes to functional verification, whichresults in long run-times, limited coverage of system behavior, and poor observ-ability about how the system executes. In contrast, the ESL design method-ology promotes the use of appropriate abstractions to increase the knowledgeand understanding of how a complex system operates [44]. The required levelof detail is hence defined by the current modeling objectives. In addition, largesystem may require simulations described as a mix of different abstraction lev-els [45]. Common modeling abstractions are presented in Figure 2.6, whichillustrates how simulation accuracy is traded for higher simulation speed, andcan be divided into the following categories:


Behavioral

modeling

(ESL)

Algorithm

Transaction-level

Cycle-callable

Cycle accurate

Timing-accurate

Structural

modeling

(RTL)

sim

ula

tion s

pee

d

sim

ula

tion a

ccura

cy

...

...

Matlab/C

Figure 2.6: Abstraction levels is the ESL design space, ranging fromabstract transaction-level modeling to pin and timing accurate modeling.

Algorithm - The algorithm design work is the high level implementationof a system, usually developed as Matlab or C code. At this design level,communication is modeled using shared variables and processing is basedon sequential execution.

Transaction level - Transaction level modeling (TLM) is an abstrac-tion level to separate functionality from communication [46]. Trans-actions, e.g. the communication primitives, are initiated using functioncalls, which results in a higher simulation speed than traditional pin-levelmodeling. TLM also enables simplified co-simulation between differentsystems since communication is handled by more abstract mechanisms.In this case, adapters are used to translate between different TLM andpin-accurate interfaces.

Cycle callable - A cycle callable (CC) model is also referred to as a cy-cle approximate or loosely timed model. A cycle callable model emulatestime only on its interface, hence gaining speed by using untimed simula-tion internally. Simulations of embedded systems may use cycle callablemodels for accurate modeling of bus-transactions, but at the same timeavoid the simulation overhead from the internal behavior inside each busmodel.

Cycle accurate - A cycle accurate (CA) model uses the same model-ing accuracy at clock cycle level as a conventional RTL implementation.Cycle accurate models are usually also pin-accurate.

Timing accurate - A timing accurate simulation is required to modeldelays in logic and wires. Timing can be annotated to the netlist producedafter logic synthesis or after place and route, but is sometimes required


Acc #1CPU Acc #2

void process() {

...

x = acc1.read()

acc2.write(x)

acc2.start()

..

}

Mem

ctrl

Peri

#1

Peri

#2Slave

Master

(a) (b)

Figure 2.7: (a) The programmer’s view of an embedded system, wherethe hardware is register-true, but not clock cycle accurate. (b) Hardwarearchitect’s view of an embedded system. The design is cycle- and pinaccurate.

during the RTL design phase to simulate for example the complex timingconstraints in memory models.

During the implementation phase, models sometimes need to be refined byproviding more details to an initial high-level description [42]. For example,at algorithmic level there is no information about how a computational blockreceives data. This is the system view for the algorithmic designer. When themodel is converted into a hardware description, the communication is initiallymodeled using transactions on a bus, which is the programmer’s view of asystem [47]. Software development only requires the register map and memoryaddressing to be accurate, but depends less on the actual hardware timing. Incontrast, the hardware architect’s view requires clock cycle accurate simulationto verify handshaking protocols and data synchronization. The software andhardware views are illustrated in Figure 2.7(a-b), respectively.

2.5.2 Virtual Platforms

A virtual platform is a software simulation environment to emulate a completehardware system [48]. It is a complement to functional verification on pro-grammable logic devices, i.e. FPGAs, but introduces additional features thatcan not be addressed using hardware prototyping systems. Since a virtualplatform is available before the actual hardware has been developed, it enablesearly software development and the possibility to locate system bottlenecksand other design issues in the initial design process. Hence, exploration ofthe hardware architecture using a virtual platform has several advantages overhardware-based prototyping, including:


• Precise control - In contrast to hardware-based verification, softwaresimulation enables the possibility to freeze the execution in all simulationmodules simultaneously at clock cycle resolution.

• Full visibility - All parts of virtual platform can be completely visibleand analyzed using unlimited traces, while hardware-based verification islimited to hardware probing and standardized debug interfaces.

• Deterministic execution - A software simulation is completely deter-ministic, where all errors and malfunctions are reproducible. A repro-ducible simulation is a requirement for efficient system development anddebugging.

However, controllability and visibility come with the price of longer exe-cution times. Hence, hardware-based prototyping using FPGA platforms isstill an important part of ASIC verification, but will be introduced later in thedesign flow.

2.5.3 Instruction Set Simulators

A majority of digital systems include one or more embedded processor, whichrequire additional simulation models for the virtual platform. A processorsimulation model is referred to as an instruction-set simulator (ISS), and is afunctional model for a specific processor architecture. The ISS executes binaryinstructions in the same way as the target processor, hence all transactionson the bus correspond to the actual hardware implementation. An ISS basedvirtual platform normally reaches a simulation performance in the range of 1-10 MIPS, and academic work targeting automatic ISS generation is presentedin [49]. Faster simulation is possible using a compiled ISS, where the targetprocessor instructions are converted into native processor instructions [50]. Incontrast to an interpreted instruction-set simulator, a compiled ISS can typi-cally reach a simulation performance of 10-100 MIPS.

Given this background, the virtual platform and ISS concepts are further dis-cussed in Part III, which proposes a design environment and a set of scalablesimulation models to evaluate reconfigurable architectures.

Chapter 3

Digital Holographic Imaging

In 1947, the Hungarian scientist Dennis Gabor developed a method and theoryto photographically create a three-dimensional recording of a scene, commonlyknown as holography [51]. For his work on holographic methods, Dennis Ga-bor was awarded the Nobel Prize in physics 1971. A holographic setup isshown in Figure 3.1, and is based on a coherent light source. The interferencepattern between light from a reference wave and reflected light from an ob-ject illuminated with the same light source is captured on a photographic film(holographic plate). Interference between two wave fronts cancels or amplifiesthe light in each point on the holographic film. This is called constructive anddestructive interference, respectively.

A recorded hologram has certain properties that distinguish it from a con-ventional photograph. In a normal camera, the intensity (amplitude) of thelight is captured and the developed photography is directly visible. The pho-tographic film in a holographic setup captures the interference, or phase differ-ence, between two waves [52]. Hence, both amplitude and phase informationare stored in the hologram. By illuminating the developed photographic filmwith the same reference light as used during the recording phase, the originalimage is reconstructed and appears three-dimensional.

The use of photographic film in conventional holography makes it a time-consuming, expensive, and inflexible process. In digital holography, the pho-tographic film is replaced by a high-resolution digital image sensor to capturethe interference pattern. The interference pattern is recorded in a similar wayas in conventional holography, but the reconstruction process is completelydifferent. Reconstruction requires the use of a signal processing algorithm to

27

28 CHAPTER 3. DIGITAL HOLOGRAPHIC IMAGING

laser

light

plane

waves

laser

light

photographic

plate

scene

Figure 3.1: The reflected light from the scene and the reference lightcreates an interference pattern on a photographic film.

transform the digital recordings into visible images [1,53]. The advantage overconventional holography is that it only takes a fraction of a second to captureand digitally store the image, instead of developing a photographic film. Thedownside is that current sensor resolution is in the range 300-500 pixels/mm,whereas the photographic film contains 3000 lines/mm.

A holographic setup with a digital sensor is illustrated in Figure 3.2. Thelight source is a 633 nm Helium-Neon laser, which is divided into separate ref-erence and object lights using a beam splitter. The object light passes throughthe object and creates an interference pattern with the reference light on thedigital image sensor. The interference pattern (hologram) is used to digitallyreconstruct the original object image.

3.1 Microscopy using Digital Holography

Digital holography has a number of interesting properties that has shown use-ful in the field of microscopy. The study of transparent specimens, such asorganic cells, conventionally requires either the use of non-quantitative mea-surement techniques, or the use of contrast-increasing staining methods thatcould be harmful for the cells. This issue can be addressed with a microscopebased on digital holography, introducing a non-destructive and quantitativemeasurement technique to study living cells over time [54].

Most biological specimens, such as cells and organisms, are almost com-pletely colorless and transparent [55]. In conventional optical microscopy, thisresults in low contrast images of objects that are nearly invisible to the humaneye, as illustrated in Figure 3.3(a). The reason is that the human eye only mea-

3.1. MICROSCOPY USING DIGITAL HOLOGRAPHY 29

633 nm He-Ne Laser5.4.3.2.1.

6.

7.

11.

12.

9.

8.

1.2.3/64/5.7.8.9.10.11.

Mirror.Polarization beamsplitter cube.Half-wave plates.Polarizers.Mirror.Iris diaphragm.Object.Beam splitter cube.GRIN lens.

10.

12. Digital image sensor.

Figure 3.2: The experimental holographic setup. The reference, object,and interference pattern are captured on a high-resolution digital imagesensor, instead of using a photographic film as in conventional hologra-phy. The images are captured by blocking the reference or object beam.

sures the light amplitude (energy), but the phase shift still carries importantinformation. In the beginning of the 1930s, the Dutch physicist Frits Zernikeinvented a technique to enhance the image contrast in microscopy, known asphase contrast, for which he received the Nobel Prize in physics 1953. Theinvention is a technique to convert the phase shift in a transparent specimeninto amplitude changes, in other words making invisible images visible [56].However, the generated image does not provide the true phase of the light,only improved contrast as illustrated in Figure 3.3(b).

An alternative to phase contrast microscopy is to instead increase the cellscontrast. A common approach is to stain the cells using various dyes, suchas Trypan blue or Eosin. However, staining is not only a cumbersome andtime-consuming procedure, but could also be harmful to the cells. Therefore,experiments to study growth, viability, and other cell characteristics over timerequire multiple sets of cell samples. This prevents the study of individual cellsover time, and is instead based on the assumption that each cell sample hassimilar growth-rate and properties. In addition, the staining itself can generateartifacts that affect the result. In Figure 3.4(a), staining with Trypan blue hasbeen used to identify dead cells.

In contrast, digital holography provides a non-invasive method and doesnot require any special cell preparation techniques [57]. The following sectionsdiscuss the main advantages of digital holography in the field of microscopy,namely the true phase-shift, the software autofocus, and the possibility to gen-erate three-dimensional images.


(a) (b) (c)

(d) (e) (f)

Figure 3.3: Images of transparent mouse fibroblast cells. (a) Conven-tional microscopy. (b) Phase-contrast microscopy. (c) Reconstructedphase information from the holographic microscope. (d) Unwrappedand background processed holographic image with true phase. (e) 3Dview from the holographic microscope. (f) Close-up from image (d).

cell

parallel light waves

(a) (b) (c)

Figure 3.4: To the left a dead cell and to the right a living cell. (a) UsingTrypan blue staining in a phase contrast microscope to identify deadcells. Dead cells absorb the dye from the surrounding fluid. (b) Usingphase information in non-invasive digital holography to identify deadcells. (c) Phase-shift when parallel light pass through a cell containingregions with different refractive index.

3.1. MICROSCOPY USING DIGITAL HOLOGRAPHY 31

Figure 3.5: Reconstructed images of an USAF resolution test chart eval-uated at the distances −200µm to 50µm in step of 50µm from theoriginal position.

3.1.1 Phase Information

The true phase-shift is of great importance for quantitative analysis. Sincethe phase reveals changes to the light path, it can be used to determine eitherthe integral refractive index of a material with known thickness, or the thick-ness in a known medium. The phase shift is illustrated in Figure 3.4(c). Indigital holography, the obtained interference pattern contains both amplitudeand phase information. The amplitude is what can be seen in a conventionalmicroscope, while phase information can be used to generate high-contrast im-ages and to enable quantitative analysis. The analysis can yield informationabout refractive index, object thickness, object volume, or volume distribution.Figure 3.4(a-b) show images obtained with a phase contrast microscope and aholographic microscope, respectively.

3.1.2 Software Autofocus

In conventional microscopes, the focus position is manually controlled by amechanical lens system. The location of the lens determines the focus position,and the characteristics of the lens determine the depth of field. The imageis sharp only at a certain distance, limited by the properties of the lens, andthe objects that are currently in focus are defined to be in the focal plane.Naturally, images captured in conventional microscopy only show a single focalplane, which may cause parts of the image to be out of focus.


In digital holography, individual focal planes are obtained from a singlerecording by changing a software parameter inside the reconstruction algo-rithm. Hence, no mechanical movements are required to change the focusposition. As an example, Figure 3.5 shows the reconstruction of a resolutiontest chart in different focal planes, in this case either in focus or out of focus.Microscope recordings can thus be stored and transmitted as holographic im-ages, which simplifies remote analysis with the ability to interactively changefocal plane and magnification.

3.1.3 Three-Dimensional Visualization

Phase and amplitude information enable visualization of captured images inthree dimensions. The phase information reveals the optical thickness of anobject, and can be directly used to visualize the cell topology. To avoid visualartifacts, it is assumed that cells grow on a flat surface, and that there isminimal overlap between adjoining cells. Examples of topology images areshown in Figure 3.3(e-f).

Extraction of accurate three-dimensional information requires more com-plex image processing, and involves the analysis of a stack of reconstructedamplitude images obtained at different focal planes [58]. Each reconstructedimage is processed to identify and extract the region that is currently in focus.The resulting stack of two-dimensional shapes is joined into a three-dimensionalobject. The same technique can be used to merge different focal planes to pro-duce an image where all objects are in focus [59].

3.2 Processing of Holographic Images

This section presents the numerical reconstruction, phase unwrapping, and im-age analysis required to enable non-invasive biomedical imaging. Numericalreconstruction is the process to convert holographic recordings into visible im-ages, and is the main focus of this work. Reconstruction is followed by phaseunwrapping, which is a method to extract true phase information from imagesthat contain natural discontinuities. Finally, image analysis is briefly discussedto illustrate the potential of the holographic technology, especially to study cellmorphology [55].

3.2.1 Numerical Reconstruction

Three different images are required to reconstruct one visible image. In consec-utive exposures, the hologram ψh, object ψo, and reference light ψr are capturedon the digital image sensor. First, the hologram is captured by recording theinterference light between the object and reference source. Then, by block-

3.2. PROCESSING OF HOLOGRAPHIC IMAGES 33

(a) (b)

(c) (d)

Figure 3.6: (a) Reference image ψr. (b) Object image ψo. (c) Holo-gram or interference image ψh. (d) Close-up of interference fringes in aholographic image.

ing either the object or the reference beam, the reference and object light arecaptured respectively. An example of captured holographic images is shown inFigure 3.6 together with a close-up on the recorded interference fringes in thehologram.

The images captured on the digital image sensor need to be processed usinga signal processing algorithm, which reconstructs the original and visible image.The reference and object light is first subtracted from the hologram, and theremaining component is referred to as the modified interference pattern ψ

ψ(ρ) = ψh(ρ) − ψo(ρ) − ψr(ρ), (3.1)

where ρ represents the (x, y) position on the sensor. A visible image of anobject (the object plane) is reconstructed from ψ(ρ), which is the object fieldcaptured by the digital sensor (the sensor plane), as illustrated in Figure 3.7.The reconstruction algorithm, or inversion algorithm, can retrofocus the lightcaptured on the sensor to an arbitrary object plane, which makes it possibleto change focus position in the object. This is equivalent to manually movingthe object up and down in a conventional microscope.

An image is reconstructed by computing the Rayleigh-Sommerfeld diffrac-tion integral as

Ψ(ρ′) =

∫

sensor

ψ(ρ)e−ik|r−rr|eik|r−r′| dSρ, (3.2)


sensor planesensor plane

object plane

object plane

ref

ref

(a) (b)

ψ(ρ)

Ψ(ρ′)

OO

z

z

x

y

z′

zr

rrrr

r′r′

Figure 3.7: (a) Definition of vectors from origin of coordinates to thesensor surface, reference point, and object plane. (b) Sensor and ob-ject planes. Modifying the z-position of the object plane changes focusposition.

where k = 2π/λ is the angular wave number for a wavelength λ, and thereference field is assumed to be spherical for simplicity [60]. By specifyinga point of origin, r′ represents the vector to any point in the object planeand r represents the vector to any point in the sensor plan, as illustrated inFigure 3.7. The three-dimensional vectors can be divided into a z componentand an orthogonal two-dimensional vector ρ representing the x and y positionsas r = ρ+ zz. The distance z′ specifies the location of the image plane to bereconstructed, whereas z is the distance to the sensor. The integral in (3.2)can be expressed as a convolution

Ψ(ρ′) = ψ1 ∗G, (3.3)

whereψ1(ρ) = ψ(ρ)e−ik

√|z−zr|2+|ρ−ρ

r|2

andG(ρ) = eik

√|z−z′|2+|ρ|2 .

The discrete version of the integral with an equidistant grid, equal to the sen-sor pixel size (∆x, ∆y), generates a discrete convolution of (3.3) that can beevaluated with the FFT [61] as

Ψ(ρ′) = F−1(F(ψ1) · F(G)). (3.4)

The size of the two-dimensional FFT needs to be at least the sum of the sensorsize and the object size in each dimension. Higher resolution is achieved byshifting the coordinates a fraction of a pixel size and combining the partial

3.2. PROCESSING OF HOLOGRAPHIC IMAGES 35

sensor

plane

Rayleigh-

SommerfeldFresnel Fraunhofer

(far-field)

Vector

Solutions

object

plane

z

r

r′O

|r − r′| ≈(z − z′) + (x−x′)2+(y−y′)2

2(z−z′)

|r − r′| ≈ r − r · r′/rz ≫ λ

Figure 3.8: Approximations to reduce the computational requirements.

results. Hence, an image with higher resolution requires several processingiterations, which increases the reconstruction time with a factor N2 for aninterpolation of N times in each dimension. The image reconstruction requiresthree FFT calculations (3.4), while each additional interpolation only requirestwo FFT calculations, since only G(ρ) changes.

Several approximations to simplify the reconstruction process exist, basedon assumptions about distance and angle between the image plane and thesensor plane. Figure 3.8 illustrates how different approximations can be appliedwhen the distance between the object and sensor plane increases. The Fresnelapproximation simplifies the relation to an arbitrary point in the image planeto an arbitrary point in the sensor plane [62]. In the far-field, the Fraunhoferapproximation is applicable, which further simplifies the distance equation [63].The approximations are further discussed in Section 3.3.2, and used in Part Ito reduce the computational requirements of an application specific hardwareaccelerator.

3.2.2 Image Processing

The output data from the reconstruction algorithm is complex valued, repre-sented as magnitude and phase. However, this means that the phase is con-strained to an interval (−π, π], which is referred to as the wrapped phase. Awrapped phase image contains 2π discontinuities, and requires further pro-cessing to extract the unwrapped phase image. Figure 3.3(c) shows a wrappedphase image where changes in the cell thickness have resulted in discontinuities.Figure 3.3(d) shows the same image after phase unwrapping.

Several algorithms to compute phase unwrapping exist, with various com-plexity in terms of processing speed and accuracy [64]. Phase-unwrapping algo-rithms are mainly divided into two groups, path-following and minimum norm.Path-following algorithms, such as Goldstein’s and Flynn’s algorithm [65], iden-tifies paths on which to integrate. The algorithms are less sensitive to noise


but the run-time is not deterministic. Minimum norm algorithms, based onthe discrete cosine transform (DCT) or discrete Fourier transform (DFT), arefaster but more sensitive to noise. Algorithm selection is a trade-off betweenrun-time and accuracy, where faster algorithms are useful for real-time visual-ization, while more accurate algorithms are suitable for quantitative analysis.

Image reconstruction followed by phase unwrapping produces amplitudeand phase images from the obtained hologram. Hence, the images are readyto be further processed for more advanced analysis. The imaging operationscommonly desired to be able to study, quantify, classify, and track cells andother biological specimens are:

Cell visualization - Providing intuitive ways of visualizing cell infor-mation is desirable, and has previously been discussed in Section 3.1.3.Three-dimensional imaging provides depth information, and the use oftwo-dimensional image superposition of amplitude and phase may illus-trate and highlight different cell properties.

Cell quantification - The possibility to automatically measure cells andcell samples is favorable over manual time-consuming and error-pronemethods. Cell counting and viability studies are today a highly manualprocedure, but which can be automated using holographic technology.Other desirable measurements are cell area, cell coverage, cell/total vol-ume, and cell density.

Cell classification - Classification is a statistical procedure of placingindividual cells into groups, based on quantitative information and featureextraction. The cell is matched against a database, containing trainingsets of previously classified cells, to find the most accurate classification.In biomedical imaging, this can be used to analyse the distribution ofdifferent cell types in a sample.

Cell morphology - Morphology refers to the change in a cells appear-ance, including properties such as shape and structure. Morphology canbe used to study the development of cells, and to determine the cellscondition and internal health [55]. Due to the non-invasive propertiesof a holographic microscope, it has shown to be a suitable candidate fortime-lapse studies, i.e. the study of cells over time.

Cell tracking - Morphology requires each cell to be studied individu-ally, hence the location of each cell must be well-defined. However, itis assumed that cells will not only change their physical properties overtime, but also their individual location. Therefore, tracking is required

3.3. ACCELERATION OF IMAGE RECONSTRUCTION 37

to study cells over time. In biomedical imaging, this information can alsobe used to study cell trajectories.

3.3 Acceleration of Image Reconstruction

As previously discussed, the reconstruction process is computationally demand-ing and the use of a standard computer is not a feasible solution to performimage reconstruction in real-time. In a microscope, a short visual responsetime is required to immediately reflect changes to the object position, illumi-nation, or focus point. It is also useful to study real-time events, for examplewhen studying a liquid flow. The human perception of real-time requires anupdate rate in a range of 20 to 60 frames per second (fps), depending on appli-cation and situation. However, for a biological application a frame rate aroundfrate = 25 fps would be more than sufficient.

The sensor frame rate is one of the factors that affect the required systemperformance. Other parameters are the sensor resolution and the image re-construction time. These properties are discussed in the following sections tospecify the requirements for the hardware accelerator.

3.3.1 Digital Image Sensors

Digital image sensors are available in two technologies, either as charge-coupleddevices (CCD) or as complementary metal oxide semiconductors (CMOS). TheCCD sensor has superior image quality, but the CMOS sensor is more robustand reliable. The CCD is based on a technique to move charges from thesensor array and then convert the charge into voltage. In contrast, CMOSsensors directly convert the charge to voltage directly at each pixel, and alsointegrate control circuitry inside the device. A comparison between imagesensor technologies is presented in [66].

The term resolution is often used to indicate the number of pixels on theimage sensor array, while the term pixel pitch refers to the size of each activeelement in the array. Over the past few years the resolution of digital sensorshas continuously increased, whereas the pixel pitch consequently has decreasedto comply with standardized optical formats.

The frame rate for digital image sensors highly depends on the sensor reso-lution. In a digital video camera the real-time constraints require a high framerate for the images to appear as motion video. However, sensors with highresolution are often not able to produce real-time video. This limitation isaddressed by either lowering the resolution by reading out less pixels from thesensor array, or by using binning functionality to cluster neighboring pixels.

In digital holography, both high resolution and high frame rate are required.


Table 3.1: Algorithmic complexity (two-dimensional FFTs) and requiredthroughtput for a holographic system with a resolution of 2048 × 2048 pix-els and a frame rate of 25 fps.

PropertiesRayleigh- Fresnel/Fraunhofer

Sommerfeld approximation

Algorithmic complexity∗ 2N2 + 1 1Software run-time∗∗ (s) ≈ 82 ≈ 1.1Required throughput (samples/s) 15.3G 210M

∗ Number of 2D-FFTs required. The original algorithm requires refinement with N = 6.∗∗ based on FFTW 2D-FFT [67] on a Pentium-4 2.0 GHz.

For the current application, a 1.3Mpixel CMOS sensor with a video frame rateof 15 fps has been selected. This is a trade-off to be able to produce high qualityimages in close to video speed. However, it is likely that future image sensorswould enable combinations of higher resolution with higher frame rate.

3.3.2 Algorithmic Complexity

Table 3.1 compares the Rayleigh-Sommerfeld diffraction integral and the ap-proximations in terms of complexity, run-time and required throughput forreal-time performance. The two-dimensional FFT is assumed to dominatethe reconstruction time, hence the comparison has been limited to a part ofthe complete reconstruction process. For complexity analysis, the Rayleigh-Sommerfeld diffraction integral is evaluated using 3 FFT calculations, withadditional 2 FFTs required for each refinement step. In contrast, only a singleFFT is required when applying the Fresnel and Fraunhofer approximations,which is shown in Part I.

The two-dimensional FFT is evaluated on arrays that are larger than theactual sensor size. The FFT size must be at least the sum of the sensor sizeand the object size in each dimension. The digital image sensor used in thiswork has a resolution of 1312×1032 pixels. The object size can be at most thesize of one quadrant on the sensor array, extending the processing dimensionsto 1968 × 1548 pixels. However, the size of the FFT must be selected andzero-padded to a size of 2n, where n is an integer number. The closest optionis n = 11, which corresponds to an FFT of size 2048×2048 points. This definesNFFT = 2048.

The frame rate in this application is a soft contraint. Based on the currentsensor device and the above definition of real-time video, the desired frame rateis in the range of 15-25 fps. However, for the algorithmic complexity analysis

3.3. ACCELERATION OF IMAGE RECONSTRUCTION 39

Clock Freq. fCLK (MHz)

Tcc (samples/cc)

200

400

600

800

.5 1 2 3

Area A (M eq. gates)

Tcc (samples/cc)

.25

.5

.75

1

.5 1 2 3 4 51.5 2.5

more parallel

more sequential

(a) (b)

Figure 3.9: Feasible design space points (grey) that comply with thesystem requirements. Area is estimated based on similar designs.

in Table 3.1, a real-time video frame rate of 25 fps is assumed. The table showsthe run-time in software and the required throughput to meet the real-timeconstraints.

3.3.3 Hardware Mapping

For hardware mapping, the approximations are applied and the complexityreduction is further explained in Part I. To estimate the hardware requirements,the throughput of the two-dimensional FFT is used. However, due to the largetransform size, implementation of the two-dimensional FFT is separated intoone-dimensional FFTs, first applied over rows and then over columns [68].Hence, 2 × 2048 transforms of size 2048 must be computed for each frame,resulting in a required throughput of

2N2FFT × frate ≈ 210Msample/s,

where FFT size NFFT and frame rate frate are given by the specification. Thechoice of hardware architecture and the selected system clock frequency controlthe throughput, which results in the following relations

{fclk × Tcc = 2N2

FFT × frateA ∝ Tcc,

where fclk is the clock frequency, and Tcc is the scaled throughput in sam-ples per clock cycle. It is assumed that the required area (complexity) A ispropositional to the throughput. An attempt to double the throughput for the


FFT algorithm, for given operational conditions such as frequency and supplyvoltage, would result in twice the amount of computational resources.

The equations enable design space exploration based on the variable param-eters. Samples per clock cycle relate to architectural selection, while frequencycorresponds to operational conditions. Figure 3.9 shows points in the designspace that comply with the specification. Both graphs use the horizontal axisto represent the architectural scaling, ranging from time-multiplexed (folded)to more parallel architectures. Figure 3.9(a) shows the clock frequency requiredfor a certain architecture. Since the clock frequency has an upper feasible limit,it puts a lower bound on Tcc. Figure 3.9(b) shows the area requirement, whichis assumed to grow linearly with the architectural complexity, and puts an up-per bound on Tcc. Hence, Tcc is constrained by system clock frequency and thehardware area requirement. The graphs can be used to select a suitable andfeasible architecture, where Tcc = 1 seems to be reasonable candidate. Basedon this estimated analysis, the architectural decisions are further discussed inPart I.

Part I

A Hardware Acceleration Platform for

Digital Holographic Imaging

Abstract

A hardware acceleration platform for image reconstruction in digital holo-graphic imaging is proposed. The hardware accelerator executes a computa-tionally demanding reconstruction algorithm which transforms an interferencepattern captured on a digital image sensor into visible images. Focus in thiswork is to maximize computational efficiency, and to minimize the externalmemory transfer overhead, as well as required internal buffering. We presentan efficient processing datapath with a fast transpose unit and an interleavedmemory storage scheme. The proposed architecture results in a speedup witha factor 3 compared with the traditional column/row approach for calculatingthe two-dimensional FFT. Memory sharing between the computational unitsreduces the on-chip memory requirements with over 50%. The custom hard-ware accelerator, extended with a microprocessor and a memory controller,has been implemented on a custom designed FPGA platform and integratedin a holographic microscope to reconstruct images. The proposed architec-ture targeting a 0.13µm CMOS standard cell library achieves real-time imagereconstruction with over 30 frames per second.

Based on: T. Lenart and M. Gustafsson and V. Owall, “A Hardware AccelerationPlatform for Digital Holographic Imaging,” Springer Journal of Signal Processing

Systems, DOI: 10.1007/s11265-008-0161-2, Jan 2008.

41

1. INTRODUCTION 43

1 Introduction

In digital holography, the photographic film used in conventional holographyis replaced by a digital image sensor to obtain a digital hologram, as shownin Figure 1(a). A reconstruction algorithm processes the images captured bythe digital sensor to create a visible image, which in this work is implementedusing a hardware acceleration platform, as illustrated in Figure 1(b).

A hardware accelerator for image reconstruction in digital holographic imag-ing is presented. The hardware accelerator, referred to as Xstream, executesa computationally demanding reconstruction algorithm based on a 2048×2048point two-dimensional FFT, which requires a substantial amount of signal pro-cessing. It will be shown that a general-purpose computer is not capable ofmeeting the real-time constraints, hence a custom solution is presented.

The following section presents how to reduce the complexity of the recon-struction algorithm, as presented in Chapter 3.2.1, and how it can be adaptedfor hardware implementation. System requirements and hardware estimatesare discussed in Section 2, and Section 3 presents the proposed hardware archi-tecture and the internal functional units. In Section 4, the hardware acceleratoris integrated into an embedded system, which is followed by simulation resultsand proposed optimizations in Section 5. Results and comparisons are pre-sented in Section 6 and finally conclusions are drawn in Section 7.

1.1 Reduced Complexity Image Reconstruction

As discussed in Chapter 3.2.1, the Rayleigh-Sommerfeld diffraction integral iscompute-intensive since it requires three two-dimensional FFTs to be evalu-ated. However, by observing that |r′| ≪ |r|, the Fraunhofer or far-field ap-proximation [63] simplifies the relation between the sensor plane and objectplane as

|r − r′| ≈ r − r · r′/r,

where r = |r|. This changes the diffraction integral in (3.2) to

Ψ(ρ′) =

∫

sensor

ψ(ρ)e−ik|r−rr|eik|r−r′| dSρ

≈∫

sensor

ψ(ρ)e−ik(|r−rr|−r)e−ikr·r′/r dSρ

=

∫

sensor

ψ(ρ)e−ik(|r−rr|−r)e−ikzz′/re−ikρ·ρ′/r dSρ

=

∫

sensor

ψ2(ρ)e−ikρ·ρ′/r dSρ. (1)

44 PART I. A HARDWARE ACCELERATION PLATFORM FOR DIGITAL...

Hardware

Acceleration

Platform

Laser

mirror

object

lens

splitter

(a)

(b)

ψh

ψh

ψo

ψo

ψr

ψr

Figure 1: (a) Simplified holographic setup capturing the light from alaser source on a digital sensor. By blocking the reference or the objectbeam, totally three images are captured representing the reference, theobject, and the hologram. (b) The three captured images are processedby the presented hardware accelerator to reconstruct visible images ona screen.

Assuming r ≈ z, which is a constant, the integral in (1) is the definition of thetwo-dimensional FFT, where

ψ2(ρ) = ψ(ρ)e−ik(|r−rr|−r)e−ikzz′/r. (2)

The first exponential term in (2) can be removed since it only affects the objectlocation in the reconstructed image. Instead, the coordinates of the object loca-tion can be modified after reconstruction. The image reconstruction algorithmwith reduced complexity requires only a single FFT as

Ψ(ρ′) ≈ F(ψ(ρ)e−ikzz′/r), (3)

where α(ρ) = kzz′/r is referred to as the phase factor.

1.2 Algorithm Selection

The visible quality difference between using the convolution algorithm andapplying the far-field approximation is shown in Figure 2(b-c). In this example,the convolution algorithm is interpolated N = 6 times in each dimension, whichrequires a total of 2(N2) + 1 = 73 FFTs to be evaluated, while the algorithmbased on the far-field approximation requires only a single FFT. For the currentapplication, the visual difference between the images is negligible. Hence, thealgorithm with reduced complexity has been chosen due to the substantiallylower computational effort. The possibility to evaluate convolutions is still

2. SYSTEM REQUIREMENTS 45

(a) (b) (c)

Figure 2: (a) Reconstructed images from the reference, object, and holo-gram captured by the digital sensor. (b) Close-up of the resulting im-age using the convolution reconstruction algorithm. (c) Close-up of theresulting image using the reduced complexity reconstruction algorithm(far-field approximation). The difference in visual quality is negligible.

supported by our proposed architecture, and a special bidirectional pipelineFFT is presented in Section 3.3 for this purpose.

2 System Requirements

As discussed in Chapter 3.3, the performance of the two-dimensional FFT willlimit the number of frames that can be reconstructed per second. The selecteddigital image sensor has a resolution of 1312 × 1032 pixels, with a pixel pitchof 6 × 6µm, and a precision of 8 bits per pixel. The hardware accelerator isdesigned to satisfy the following specification:

frate = 25 fpsNFFT = 2048 pointsTcc = 1 sample/cc

Captured images and intermediate results during computation must bestored in external memory. Each captured image requiring 1312 × 1032 × 8 ≈1.3Mbytes of storage space, and a table of the same size is required to storepre-computed phase factors. Two intermediate memory areas, in 32-bit preci-sion and with the same size as the two-dimensional FFT, are needed duringprocessing. Hence, an estimate of the required amount of external memory is4 · 1.3 + 2(4 ·N2

FFT) ≈ 37.2Mbytes. Due to the large amount of external mem-ory, the use of static RAM (SRAM) is not a feasible solution, and instead twoparallel 512-Mbit Synchronous Dynamic RAM (SDRAM) has been used [69].

An estimate of the required processing speed can be found by analyzingthe two-dimensional FFT, which is the most compute intensive part of the


combine 1-D FFTbit-reverse

spectrum shift

1-D FFTbit-reverse

spectrum shift Mem

Post-

Processing

Mem

(a)

(b)

(c)

ψ(ρ)

α

Ψ(ρ′)max |Ψ|

|Ψ| ∠Ψ

Figure 3: Dataflow divided into three processing steps, where greyboxes indicate memory transfers and white boxes indicate computation.(a) Combine, rotate, and one-dimensional FFT over rows. (b) One-dimensional FFT over columns and store peak amplitude. (c) Calculatemagnitude or phase and scale result into pixels.

system. The FFT is a separable transform and can be calculated by con-secutively applying one-dimensional transforms over rows and columns [68].However, the three-dimensional organization of SDRAM devices into banks,rows, and columns, results in non-uniform access time to memory contents.Accessing different banks or consecutive elements inside a row has a low la-tency, while accessing data in different rows has a high latency. Therefore, thememory bandwidth is highly dependent on the access pattern, and data shouldbe transferred in bursts accessing consecutive row elements [14]. Calculatingthe FFT over columns will consequently result in a longer latency for eachmemory access, resulting in a reduced throughput. An alternative, which ispresented in Section 3.4, is to transpose the data as an intermediate step inthe two-dimensional FFT. This improves the overall performance, but requiresdata to be transferred to and from memory three times to compute a singleframe. With a dual memory system, supporting simultaneous reads and writesin burst mode, the system operating frequency must hence be in the range of3 ·N2

FFT · 25 fps ≈ 300MHz.

3 Proposed Architecture

Figure 3 shows the dataflow for the reconstruction algorithm and additionalprocessing to obtain the result image. Images recorded with the digital sensorare transferred and stored in external memory. The processing is divided intothree sequential steps, each streaming data from external memory. Storingimages or intermediate data in on-chip memory is not feasible due to the size of

3. PROPOSED ARCHITECTURE 47

DMA I/F Combine

AGU

AGU I/F

Config

CORDIC 1D-FFT

M

Buffer DMA I/F

AGU

CMUL

AGU I/F

M MIn OutMMdata

valid

ack

Protocol

Figure 4: Functional units in the Xstream accelerator. Grey boxes (M)indicate units that contain local memories. All processing units commu-nicate using a handshake protocol as illustrated between the DMA andthe combine unit.

the two-dimensional FFT, which requires the processing flow to be sequential.Figure 3(a) shows the first processing step where images are combined andeach pixel vector is rotated with a phase factor according to (3), and one-dimensional FFTs are calculated over the rows. Figure 3(b) shows the secondprocessing step, which calculates one-dimensional FFTs over the columns. Thepeak amplitude max(|Ψ|) is stored internally before streaming the data backto memory. In the final processing step, shown in Figure 3(c), the vectormagnitude or phase is calculated to produce a visible image with 8-bit dynamicrange.

3.1 Algorithm Mapping

The dataflow in Figure 3 has been mapped onto a pipeline architecture con-taining several 32-bit processing units, referred to as Xstream. The nameXstream denotes that the accelerator operates on data streams with focuson maximizing the computational efficiency with concurrent execution, andoptimizing global bandwidth using long burst transfers. Processing units com-municate internally using a handshake protocol, with a valid signal from theproducer and an acknowledge signal back from the receiver. Each of the unitshas local configuration registers that can be individually programmed to setupdifferent operations, and a global register allows units to be enabled and dis-abled to bypass parts of the pipeline.

Figure 4 shows the Xstream pipeline, which streams data from externalmemories using DMA interfaces, where each interface can be connected to aseparate bus or external memory to operate concurrently. In the first processingstep, the recorded images are combined to compute the modified interferencepattern. The combine unit is constructed from a reorder unit to extract pixelsfrom the same position in each image, and a subtract unit. The resulting


interference pattern is complex-valued with the imaginary part set to zero.This two-dimensional pixel vector is then rotated with the phase factor α(ρ),defined in Section 1.1, using a coordinate rotation digital computer (CORDIC)unit operating in rotation mode [70]. The CORDIC architecture is pipelinedto produce one complex output value each clock cycle. Rotation is followed bya two-dimensional Fourier transformation, evaluated using a one-dimensionalFFT separately applied over rows and columns. Considering the size of the two-dimensional FFT, data must be transferred to the main memory between rowand column processing, which requires a second processing step. The buffer unitfollowing the FFT is required to unscramble the FFT output since it is producedin bit-reversed order, and to perform a spectrum shift operation to center thezero-frequency component. In the third processing step, the CORDIC unit isreused to calculate magnitude or phase images from the complex-valued resultproduced by the FFT. The output from the CORDIC unit is scaled usingthe complex multiplier (CMUL) in the pipeline, from floating-point formatback to fixed-point numbers that represent an 8-bit grayscale image. Thefollowing sections present architectures and simulations of the functional unitsin detail. The simulation results are further analyzed in Section 5 to selectdesign parameters.

To efficiently supply the processing units with data, a high bandwidth toexternal memories is required. The external SDRAM is burst oriented and thebandwidth depends on how data is being accessed. Linear or row-wise accessto the memory generates a high bandwidth since the read/write latency foraccessing sequential element is low, while random and column-wise access sig-nificantly degrades the performance due to increased latency. Some algorithms,for example the two-dimensional FFT, require both row-wise and column-wisedata access. Memory access problems are addressed in the following sections,where we propose modifications to the access pattern to optimize the memorythroughput.

3.2 Image Combine Unit

The combine unit processes the three images captured by the digital sensoraccording to (3.1) in Chapter 3.2.1. An additional table (image) containingpre-calculated phase factors α(ρ) is stored in main memory, and is directlypropagated through the combine unit to the CORDIC unit. Since all fourimages have to be accessed concurrently, the physical placement in memoryis an important aspect. We first consider the images to be stored in separatememory regions, as shown in Figure 5(a). The read operation would theneither be single reads accessing position (x, y) in each image, or four bursttransfers from separate memory location. In the former case, simply accessing


(a) (b)

ψh

ψh

ψo

ψo

ψr

ψr

α

α

Figure 5: (a) The three captured images and the phase factor. (b) Ex-ample of how the captured images are stored interleaved in externalmemory. The buffer in the combine unit contains 4 groups holding 8pixels each, requiring a total buffer size of 32 pixels. Grey indicates theamount of data copied to the input buffer in a single burst transfer.

the images pixel-by-pixel from a burst oriented memory is inefficient since itrequires four single transfers for each pixel. For the latter approach the bursttransfers will require four separate DMA transfers, and each transfer will causelatency when accessing a new memory region. Accessing the data in one singleburst transfer would be desirable, and therefore our approach is to combineimages by reordering the physical placement in memory during image capture.

Instead of storing the images separately, they can be stored interleaved inmemory directly when data arrive from the sensor, as shown in Figure 5(b).Hence, partial data from each image is fetched in a single burst transfer andstored in a local buffer. Thereafter, the data sequence is reordered using amodified n-bit address counter from linear address mode [an−1an−2 . . . a1a0]to extract image information pixel-by-pixel using a reordered address mode[a1a0an−1 . . . a2] as

in : {{ψ0...B−1h }{ψ0...B−1

o }{ψ0...B−1r }{ψ0...B−1

α }}out : {{ψ0

hψ0oψ

0rψ

0α}{. . .}{. . .}{ψB−1

h ψB−1o ψB−1

r ψB−1α }},

where B is the size of the block of pixels to read from each image. Hence,the internal buffer in the combine unit must be able to store 4B pixels. Usingthis scheme, both storing captured images and reading images from memorycan be performed with single burst transfers. Figure 6 shows how the reorderoperation depends on the burst size Nburst, a parameter that also defines theblock size B and the required internal buffering. For a short burst length, it ismore efficient to store the images separately since the write operation can be


Read burst length Nburst(=buffer size)

Clo

ckcy

cles

per

pix

el

separated images (read and write)

interleaved images (read and write)

writing images to memory using interleaving

reading interleaved images from memory

4 8 16 32 64 1280

2

4

6

8

10

12

14

16

Figure 6: Average number of clock cycles for each produced pixel,including both the write and the read operation to and from mem-ory. The dashed lines show the read and write part from the inter-leaved version. Simulation assumes that two memory banks are con-nected the Xstream accelerator. (Memory device: [69], 2 × 16 bit,fmem = fbus = 133MHz)

burst oriented. When the burst length increases above 8, interleaving becomesa suitable alternative but requires more buffer space. A trade-off between speedand memory is when the curve that represents interleaving flattens out, whichsuggests a burst length between 32 and 64. Parameter selection is furtherdiscussed in Section 5.

3.3 One-Dimensional Flexible FFT Core

The FFT is a separable transform, which means that a two-dimensional FFTcan be calculated using a one-dimensional FFT applied consecutively over thex and y dimensions of a matrix. We present the implementation of a high-precision one-dimensional FFT using data scaling, that is used as the corecomponent in a two-dimensional FFT presented in Section 3.4. The FFT im-plementation is based on a radix-22 single-path delay feedback architecture [71].It supports data scaling to reduce the wordlength to only 10 bits, and supportsinput data in both linear and bit-reversed order.


2

MR-2

1

MR-2-j

8

MR-2

4

MR-2-j

0 1 2 3 ...

xlin

0 8 4 12 ...

Xbitrev

Xlin xbitrev

MR-22

Delay feedback

Linear/BitrevData I/OData I/O

(a)

(b)

Figure 7: (a) The FFT pipeline is constructed from modified radix-2units (MR-2). For a 2048-point FFT, totally 5 MR-22 and one MR-2unit is required. (b) A modified radix-2 unit containing a selection unitfor switching between linear and bit-reversed mode, an alignment unitfor hybrid floating-point, and a butterfly unit.

Scaling alternatives - Various data scaling alternatives have been evaluatedin this work and the approach selected for the current implementation is brieflypresented below. A more detailed description and comparison is given in [72],with the focus on various types of data scaling, and related work is presentedin [73]. We propose to use a reduced complexity floating-point representationfor complex valued numbers called hybrid floating-point, which uses a sharedexponent for the real and imaginary part. Besides reduced complexity in thearithmetic units, especially the complex multiplication, the total wordlengthfor a complex number is reduced. Figure 7(a) shows the pipeline architecturewith cascaded processing units. The radix-2 units are modified to support datascaling by adding an alignment unit before the butterfly operation, presented inFigure 7(b). The output from the complex multipliers requires normalization,which is implemented using a barrel shifter. The proposed hybrid floating-pointarchitecture has low memory requirements but still generates a high SQNR of45.3 dB.


Bidirectional pipeline - An architecture with a bidirectional pipeline hasbeen implemented to support the convolution-based algorithm. A bidirectionaldataflow can be supported if the wordlength is constant in the pipeline, whichenables the data to be propagated through the butterfly units in reverse order.Reversing the data flow means that it is possible to support both linear andbit-reversed address mode on the input port, as shown in Figure 7(a). To-day, one-directional FFT with optimized wordlength in each butterfly unit iscommonly used, increasing the wordlength through the pipeline to maintain ac-curacy. Implementations based on convergent block floating-point (CBFP) arealso proposed in [74], but both of these architectures only work in one direction.Our bidirectional pipeline simplifies the memory access pattern when evaluat-ing a one- or two-dimensional convolution using the FFT [75]. If the forwardtransform generates data in bit-reversed order, input to the reverse transformshould also be bit-reversed. The result is that both input and the output fromthe convolution is in normal bit order, hence no reorder buffers are required.Other applications for a bidirectional pipeline is in OFDM transceivers to min-imize the required buffering for inserting and removing the cyclic suffix, whichhas been proposed in [76].

3.4 Two-Dimensional FFT

The two-dimensional FFT is calculated using the one-dimensional FFT corefrom Section 3.3. Calculating the two-dimensional FFT requires both row-wise and column-wise access to data, which will cause a memory bottleneck asdiscussed in Section 2. The memory access pattern when processing columnswill cause a serious performance loss since new rows are constantly accessed,preventing burst transfers. However, it can be observed that an equivalent pro-cedure to the row/column approach is to transpose the data matrix betweencomputations. This means that the FFT is actually applied two times overrows in the memory separated with an intermediate transpose operation. Ifthe transpose operation combined with FFTs over rows is faster than FFTsover columns, then the total execution time is reduced. To evaluate this ap-proach, a fast transpose unit is proposed.

Transpose unit - Considering the large size of the data matrix, on-chip storageduring processing is not realistic and transfers to external memory is required.The access pattern for a transpose operation is normally reading rows andwriting columns, or vice versa. To avoid column-wise memory access, a trans-pose operation can be broken down into a set of smaller transpose operations


DMA I/F

AGU

AGU I/F

BufferDMA I/F

AGU

AGU I/F

Source

1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Destination

4 8 12

1 5 9 13

2 6 10 14

3 7 11 15

XSTREAM

Axy ATxy

Figure 8: The transpose logic uses the DMA controllers, the AGU unitsand the internal buffer to transpose large matrices. In the figure, a32 × 32 matrix is transposed by individually transposing and relocating4 × 4 macro blocks. Each macro block contains an 8 × 8 matrix.

combined with block relocation as

[A11 A12

A21 A22

]T

=

[AT

11 AT21

AT12 AT

22

]

.

The matrix is divided recursively using a divide-and-conquer approach,which can be performed all the way down to single elements. Breaking thematrix down into single elements is not desirable, since all accesses result insingle read and write operations. However, if the matrix is divided into smallermacro blocks that fit into an on-chip buffer with a minimum size of M ×Melements, burst transfers of size Nburst ≤ M can be applied for both read andwrite operations. Hence, data is always transferred to and from the bufferas consecutive elements, while the actual transpose operation is performed inthe buffer by writing rows and reading columns. The transpose operation isillustrated in Figure 8. The macro blocks are transferred to the internal bufferusing direct memory access (DMA) units and address generation units (AGU)provide a base address for each macro block as

ADDRx,y = M(x+ yNFFT) x, y = 0, 1, . . . ,M − 1.

Input and output AGUs generate base addresses in reverse order by exchangingindex counters x, y to relocate the macro blocks. Figure 9 shows a simulation ofthe row-column FFT and the row-transpose-row FFT. For the latter approach,the transpose operation is required, also shown separately in the graph. Whenthe burst length is short, the transpose overhead dominates the computationtime. When the burst length increases, the row-transpose-row FFT improvesthe overall performance. The graph representing row-transpose-row flattensout when the burst length is between 16 and 32 words, which is a good trade-


Burst length Nburst

Clo

ckcy

cles

per

elem

ent

row/column two-dimensional FFT

row/transpose/row two-dimensional FFT

only transpose operation

1 2 4 8 16 32 64 1280

2

4

6

8

10

12

14

16

18

20

22

Figure 9: Number of clock cycles per element for calculating a two-dimensional FFT using the two methods. The transpose operation isalso shown separately. Simulation assumes that two memory banks areconnected the Xstream accelerator. (Memory device: [69], 2 × 16 bit,fmem = fbus = 133MHz)

off between speed and memory. Parameter selection is further discussed inSection 5.

4 System Design

This section presents the integration of the Xstream accelerator into an em-bedded system and a prototype of a holographic microscope. The system cap-tures, reconstructs and presents holographic images. The architecture is basedon the Xstream accelerator, extended with an embedded SPARC compat-ible microprocessor [77] and a memory controller for connecting to externalmemory. Only a single memory interface is supported in the prototype, butthe Xstream accelerator supports streaming from multiple memories. Twoadditional interface blocks provide functionality for capturing images from anexternal sensor device and to present reconstructed images on a monitor. Thecomplete system is shown in Figure 10.

4. SYSTEM DESIGN 55

SPARC(LEON)

Memctrl

SensorI/F

bridge

AHB

APB

DMA

DMA

DMA

DMA VGA

XSTREAM

UART I2C IRQ I/O

Figure 10: The system is based on a CPU (LEON), a hardware acceler-ator, a high-speed memory controller, an image sensor interface and aVGA controller.

4.1 External Interface

Xstream communicates using a standard bus interface to transfer data be-tween the processing pipeline and an external memory, where data is trans-ferred using DMA modules as shown in Figure 11. The custom designed DMAmodules connect to the on-chip bus, and are constructed from three blocks:a bus interface, a buffer module and a configuration interface. One side con-nects to the external bus while the other side connects to the internal dataflowprotocol.

The DMA bus interface is compatible with the advanced high-speed bus(AHB) protocol which is a part of the AMBA specification [78]. Configurationdata is transferred over the advanced peripheral bus (APB). The bus interfacereads and writes data using transfers of any length, which allows fast access toburst oriented memories. By specifying a base address (addr), the transfer size(hsize, vsize), and the space between individual rows (skip), the bus interfacecan access a two-dimensional matrix of any size inside a two-dimensional ad-dress space of any size. The DMA interface also supports an external addressgeneration unit (AGU). This interface can be used to automatically restart theDMA transfer from a new base address when the previous transfer has com-pleted. Hence, there is no latency between transfers. This is useful when pro-cessing or moving data inside a larger memory space, e.g. a matrix of blocks.An illustrating example is the transpose operation described in Section 3.4,which relocates blocks of data inside a matrix. The block is the actual transferand matrix is the current address space.

The DMA buffer contains a format transformation unit that allows splittingof words into sub-word transfers on the read side and combining sub-words into


AHBAPB

Bus I/F Buffer

DMA I/F

data

flow

controladdrsize 2Dspace

formatreorder

hsizeaddr

vsiz

e

skip

0 1 2 3 4 ... ... N-1 N

0

1

2

3

M-1

M

data

flow

AGU

Config

(a) (b)

Figure 11: (a) The DMA interface connects the internal dataflow protocolto the external bus using a buffer and an AMBA bus interface. (b)The DMA interface can access a two-dimensional matrix inside a two-dimensional memory array, where skip is the distance to the next row.

words on the write side. An example is when calculating the vector magnitudeof the resulting image and then rescaling the value into an 8-bit pixel. Pixelsare processed individually, but combined into 32-bit words (groups of 4 pixels)in the buffer before transferred to the memory to reduce the transfer size. Thebuffer also has the ability to reorder the output data in various ways, which isfurther explained in Section 5.1.

4.2 Software Design

The Xstream accelerator is configured by an on-chip processor. A programis running on the embedded processor, utilizing the hardware accelerator tocapture, process and present images. For development purpose, the softwareprogram also has the capability of emulating all the hardware resources in soft-ware, which means that the system can run on any computer with a C compiler.For evaluation purpose, the software also contains a full-precision floating-pointmodel, which can be activated to evaluate the arithmetic accuracy of the hard-ware units.

On the embedded system, platform specific drivers control the underlyinghardware through configuration registers. Each driver has two modes, one forcontrolling the hardware and one for controlling the corresponding hardwareemulator. Switching between hardware and emulated hardware is done trans-parently during compile time, based on the current platform. For example,capturing images from the sensor is on the computer implemented by readinga bitmap image from a file. Output to a monitor is emulated with a graph-

5. SIMULATION AND OPTIMIZATION 57

Figure 12: The platform independent software running in Microsoft Vi-sual Studio. The hardware accelerator, sensor and monitor are emulatedin software.

ical window, as shown in Figure 12. From an application point of view, thebehavior of the system is the same.

5 Simulation and Optimization

Design parameter values that affect the performance and on-chip memory re-quirements, such as the burst length Nburst, are chosen based on the Matlabsimulations presented in previous sections. For an accurate simulation, mem-ory timing parameters are extracted from the same memory device as used inthe hardware prototype, which is two parallel 512-Mbit SDRAM devices fromMicron Technology, with a combined wordlength of 32 bits [69].

5.1 Flexible Addressing Modes

Many of the functional units contain buffers to store and re-arrange data, whereeach buffer requires a special addressing mode as presented in Table 1. In thenext section some of the buffers are merged to save storage space, and thereforeit is important that each buffer supports a flexible addressing mode to maintainthe original functionality after merging. Figure 13 illustrates how differentaddressing modes are used to rearrange input data. Data is always writtento the buffer using linear addressing, and when reading from the buffer theaddress mode determines the order of the output sequence. Both bit-reverseand FFT spectrum shift depends on the transform size, i.e. the number of


Table 1: Required buffering inside each processing block for the original and op-timized pipeline. The addressing mode for each buffer depends on the functionthat the block is evaluating.

Unit Buffer Original Optimized Addr mode

DMA Input buffer Nburst Nburst LinearCombine Reorder Nburst 0 InterleavedFFT Delay feedback NFFT−1 NFFT−1 FIFOBuffer Transpose Nburst

2 max(Nburst2, NFFT) Col-row swap

Bit-reverse NFFT 0 Bit-reversedSpectrum shift NFFT/2 0 FFT shift

DMA Output buffer Nburst 0 Linear

address bits, which requires the addressing modes to be flexible with a dynamicaddress wordlength. Since bit-reverse and FFT spectrum shift is often used inconjunction, this addressing mode can be optimized. The spectrum shift invertsthe MSB, as shown in Figure 13(d), and the location of the MSB depends onthe transform size. However, in bit-reversed addressing the MSB is actually theLSB, and the LSB location is always constant. By reordering the operations,the cost for moving the spectrum is a single inverter.

5.2 Memory Optimization

Table 1 presents the required buffering in each functional unit, where someof the units can merge buffers if they support the flexible addressing modepresented in Section 5.1. The table also presents an optimized pipeline withmerged buffers. Units without a buffer after optimization read data directlyfrom the buffer located in the previous unit. The following optimizations areperformed:

• The reorder operation in the image combine block is moved to the DMAinput buffer. The only modification required is that the DMA inputbuffer must support both linear and interleaved addressing mode.

• The transpose buffer for two-dimensional FFT transforms is reused forbit-reversal and FFT spectrum shift. These are in fact only two differentaddressing modes, which are supported by the transpose buffer. Mergingthese operations save a large amount of memory, since both bit-reverseand FFT shift require the complete sequence and half the sequence to bestored, respectively.

5. SIMULATION AND OPTIMIZATION 59

a0

a3

a4

a3

a2

a1

a0

a0

a1

a2

a3

a4

a4

a3

a2

a1

a0

a1

a2

a4

a3

a2

a5

a4

a3

a2

a1

a2

a1

a0

a5

a4

a4

a3

a1

a0

a4

a3

a2

a1

a0

a2

(a) (b) (c) (d)

Figure 13: Buffers support multiple addressing modes to rearrange data.an represents the address bits, and the input data (top) is addressedwith a linear counter. (a) Interleaving with 4 groups and 8 values ineach group. (b) Transpose addressing for an 8 × 8 macro block matrixby swapping row and column address bits. (c) Bit-reversed addressingto unscramble FFT data. (d) FFT spectrum shift by inverting the MSB.

• The output DMA read data in linear mode directly from the main bufferunit.

Figure 14 shows the memory requirements before and after optimization andhow it depends on Nburst. The delay feedback memory in the FFT can not beshared and is not included in the memory simulation. For short burst lengths,the memory requirements are reduced from ≈ 3K words to ≈ 2K words. Thememory requirement then increases with the burst length for the unoptimizeddesign, but stays constant for the optimized design up to Nburst = 32. Afterthis point, the memory requirements for both architectures rapidly increase.

5.3 Parameter Selection

The shared buffer unit in the pipeline must be at least the size of NFFT tosupport the bit-reverse operation. It is reused for the transpose operation,which requires Nburst

2 elements. The buffer size is hence the maximum of thetwo. Figure 14 shows how the memory requirements depend on the designparameter Nburst, which should be maximized under the condition that

N2burst ≤ NFFT = 2048,


Burst length Nburst

Mem

ory

wor

ds

Unoptimized pipelineOptimized pipeline

1 2 4 8 16 32 64 1280

2K

4K

6K

8K

10K

12K

14K

Figure 14: Memory requirements in 32-bit words for different values onNburst when NFFT= 2048. The delay feedback memory is not includedsince it can not be shared with other units.

and that Nburst is an integer power of 2. When Nburst2 exceeds NFFT, internal

memory requirements rapidly increases, which leads to a high area cost accord-ing to Figure 14 and a relatively low performance improvement according toFigure 6 and Figure 9. Another condition is that the image read operationshould generate one set of pixels per clock cycle, or at least close to this value,to supply the FFT with input data at full speed to balance the throughput.Selecting Nburst = 32 satisfy the conditions and results in:

• A total of 3.2 cc/element for calculating the two-dimensional FFT, com-pared with up to 10 clock cycles for the traditional row-column approach,as shown in Figure 9. This is a speed-up factor of approximately 3 forthe two-dimensional FFT.

• The combine unit requires 2.8 cc/element to store and read the imagedata from external memory in interleaved mode. This is a speed-up factorof approximately 2 compared to storing images separately in memory, asshown in Figure 6.

• The combine unit is capable of constantly supplying the FFT unit withdata. Hence, the total system speed-up factor is the same as for thetwo-dimensional FFT.

6. RESULTS AND COMPARISONS 61

Table 2: Equivalent gates (NAND2) and memory cells (RAM and ROM) aftersynthesis to a 0.13µm standard cell library. Slice count and block RAMs arepresented for the FPGA design.

Module Eq. gates RAM ROM FPGA # Block(logic only) (Kbit) (Kbit) (Slices) RAMs

DMA Input 3123 (3%) 1.0 0 223 (3%) 1Combine 2489 (2%) 0 0 198 (2%) 0CORDIC 10016 (8%) 0 0 403 (5%) 0CMUL 5161 (4%) 0 0 362 (5%) 0FFT 83430 (70%) 49.0 47.6 5826 (73%) 25Buffer 1287 (1%) 65.5 0 166 (2%) 16DMA Output 2592 (2%) 0 0 102 (1%) 0AGU (2x) 2148 (2%) 0 0 110 (1%) 0Sensor I/F 4574 (4%) 1.0 0 280 (4%) 1VGA I/F 4469 (4%) 1.0 0 315 (4%) 1

Total 119289 117.5 47.6 7985 45

• The optimized pipeline requires less then 50% of the memory comparedto the original pipeline, reduced from over 4Kbit down to 2Kbit as shownin Figure 14.

6 Results and Comparisons

The system presented in Section 4 has been synthesized for FPGA, targetinga Xilinx Virtex-1000E device, and integrated into a microscope based on digi-tal holography to reconstruct images captured with a sensor device, shown inFigure 15(a). The microscope is shown to the right, and the screen in middleis connected to the FPGA platform to display the resulting image from thereconstruction. The computer to the left runs a graphical user interface tosetup the hardware accelerator and to download reconstructed images for fur-ther processing. The FPGA prototyping board is shown in Figure 15(b) andcontains a Virtex-1000E device, 128MB of SDRAM, a digital sensor interface,and a VGA monitor interface.

The design has also been synthesized to a high-speed 0.13µm standardcell library from Faraday. Synthesis results from both implementations can befound in Table 2. The table shows the number of equivalent gates (nand2)and the memory requirements including both RAM and ROM (twiddle factortables in the FFT). For FPGA synthesis, the number of occupied slices and


Table 3: Comparison between related work and the proposed architecture. Per-formance is presented as fps/MHz, normalized to 1.0 for the proposed FPGAdesign.

Technology Freq Mem Rate FPGA Performance(MHz) I/F (fps) Slices (relative)

Pentium-4 Dual Core 3GHz Dual 0.8 - 0.00440.35µm [79] 133 Single 2.6 - 0.3XCV2000E [80] 35 Quad 2.0 8480 0.9Proposed XCV1000E 24 Dual 1.5 7985 1.0Proposed 0.13µm 398 Dual ≈ 31 - 1.27

required block RAMs are presented, where the Xstream accelerator occupiesapproximately 65% of the FPGA resources. The largest part of the design isthe 2048-point pipeline FFT, which is approximately 73% of the Xstream ac-celerator. The FPGA design runs at a clock frequency of 24MHz, limited bythe embedded processor, while the design synthesized for a 0.13µm cell libraryis capable of running up to 398MHz. A floorplan of the Xstream acceleratorsis shown in Figure 16, with a core area of 1500 × 1400µm2 containing 352Kequivalent gates.

In related work on two-dimensional FFT implementation, the problem withmemory organization is not widely mentioned. Instead, a high bandwidth frommemory with uniform access-time is assumed (SRAM). However, for computinga large size multi-dimensional FFT the memory properties and data organiza-tion must be taken into account, as discusses in this work. Table 3 shows acomparison between the proposed architecture, a modern desktop computer,and related work. In [79], an ASIC design of a 512-point two-dimensional FFTconnected to a single memory interface has been presents. [80] presents anFPGA implementation with variable transform size storing data in four sep-arate memory banks. To compare the processing efficiency between differentarchitectures, a performance metric is defined as (fps / MHz) and normalizedto 1.0 for the proposed FPGA implementation. The frame rate is estimated fora transform size of 2048× 2048 points. The table shows the proposed architec-ture to be highly efficient, resulting in real-time image reconstruction with over30 fps for the proposed ASIC design. The reason for increased efficiency whentargeting ASIC is that the DMA transfers with fixed bandwidth requirements,such as the sensor and VGA interfaces from Figure 10, will have less impacton the total available bandwidth as the system clock frequency increases.

6. RESULTS AND COMPARISONS 63

Sensor

VGA

I/F

FPGA

SDRAM

(a)

(b)

Figure 15: (a) Microscope prototype connected to an external monitor.(b) The hardware platform containing a Virtex-1000E device, 128MB ofSDRAM, a digital sensor interface, and a VGA monitor interface.

BUFFER

FFT

DELAY FEEDBACK

RDMA

COMB

CORDIC

CMUL

WD

MA

Sen

sor

/ V

GA

Figure 16: Final layout of the Xstream accelerator using a 0.13µm celllibrary. The core size is 1500 × 1400µm2.


7 Conclusion

A hardware acceleration platform for image processing in digital holographyhas been presented. The hardware accelerator contains an efficient datapathfor calculating FFT and other required operations. A fast transpose unit isproposed to significantly improve the computation time for a two-dimensionalFFT, which improves the computation time with a factor of 3 compared withthe traditional row/column approach. To cope with the increased bandwidthand to balance the throughput of the computational units, a fast reorder unit isproposed to store captured images and read data in an interleaved fashion. Thisresults in a speedup of 2 compared with accessing separately stored images inmemory. It is also shown how to reduce the memory requirement in a pipelineddesign with over 50% by sharing buffers between modules. The design has beensynthesized and integrated in an FPGA-based system for digital holography.The same architecture targeting a 0.13µm CMOS standard cell library achievesreal-time image reconstruction with over 30 frames per second.

Part II

A High-performance FFT Core for Digital

Holographic Imaging

Abstract

Dynamic data scaling in pipeline FFTs is suitable when implementing largesize FFTs in applications such as DVB and digital holographic imaging. Ina pipeline FFT, data is continuously streaming and must hence be scaledwithout stalling the dataflow. We propose a hybrid floating-point schemewith tailored exponent datapath, and a co-optimized architecture between hy-brid floating-point and block floating-point to reduce memory requirements fortwo-dimensional signal processing. The presented co-optimization generatesa higher SQNR and requires less memory than for instance convergent blockfloating-point. A 2048 point pipeline FFT has been fabricated in a standardCMOS process from AMI Semiconductor [9], and an FPGA prototype integrat-ing a two-dimensional FFT core in a larger design shows that the architectureis suitable for image reconstruction in digital holographic imaging.

Based on: T. Lenart and V. Owall, “Architectures for dynamic data scaling in2/4/8K pipeline FFT cores,” IEEE Transaction on Very Large Scale Integration.ISSN 1063-8210, vol. 14, no. 11, Nov 2006.

and: T. Lenart and V. Owall, “A 2048 Complex Point FFT Processor using a NovelData Scaling Approach,” in Proceedings of IEEE International Symposium on Circuits

and Systems, vol. 4, Bangkok, Thailand, May 2003, pp. 45–48.

65

1. INTRODUCTION 67

1 Introduction

The discrete Fourier transform is a commonly used operation in digital signalprocessing, where typical applications are linear filtering, correlation, and spec-trum analysis [68]. The Fourier transform is also found in modern communi-cation systems using digital modulation techniques, including wireless networkstandards such as 802.11a [81] and 802.11g [82], as well as in audio and videobroadcasting using DAB and DVB.

The DFT is defined as

X(k) =

N−1∑

n=0

x(n)W knN 0 ≤ k < N, (1)

where

WN = e−i2π/N . (2)

Evaluating (1) requires N MAC operations for each transformed value in X, orN2 operations for the complete DFT. Changing transform size significantly af-fects computation time, e.g. calculating a 1024-point Fourier transform requiresthree orders of magnitude more work than a 32-point DFT.

A more efficient way to compute the DFT is to use the fast Fourier transform(FFT) [83]. The FFT is a decomposition of an N -point DFT into successivelysmaller DFT transforms. The concept of breaking down the original probleminto smaller sub-problems is known as a divide-and-conquer approach. Theoriginal sequence can for example be divided into N = r1 · r2 · ... · rq where eachr is a prime. For practical reasons, the r values are often chosen equal, creatinga more regular structure. As a result, the DFT size is restricted to N = rq,where r is called radix or decomposition factor. Most decompositions are basedon a radix value of 2, 4 or even 8 [84]. Consider the following decomposition of(1), known as radix-2

X(k) =

N−1∑

n=0

x(n)W knN

=

N/2−1∑

n=0

x(2n)Wk(2n)N +

N/2−1∑

n=0

x(2n+ 1)Wk(2n+1)N

=

N/2−1∑

n=0

xeven(n)W knN/2

︸︷︷︸

DFTN/2(xeven)

+W kN

N/2−1∑

n=0

xodd(n)W knN/2

︸︷︷︸

DFTN/2(xodd)

. (3)

68 PART II. A HIGH-PERFORMANCE FFT CORE FOR DIGITAL...

The original N -point DFT has been divided into two N/2 DFTs, a procedurethat can be repeated over again on the smaller transforms. The complexity isthus reduced from O(N2) to O(N log2N). The decomposition in (3) is calleddecimation-in-time (DIT), since the input x(n) is decimated with a factor of 2when divided into an even and odd sequence. Combining the result from eachtransform requires a scaling and add operation. Another common approachis known as decimation-in-frequency (DIF), splitting the input sequence intox1 = {x(0), x(1), ..., x(N/2 − 1)} and x2 = {x(N/2), x(N/2 + 1), ..., x(N − 1)}.The summation now yields

X(k) =

N/2−1∑

n=0

x(n)W knN +

N−1∑

n=N/2

x(n)W knN (4)

=

N/2−1∑

n=0

x1(n)W knN +W

kN/2N

︸︷︷︸

(−1)k

N/2−1∑

n=0

x2(n)W knN ,

where WkN/2N can be extracted from the summation since it only depends on

the value of k, and is expressed as (−1)k. This expression divides, or decimates,X(k) into two groups depending on whether (−1)k is positive or negative. Thatis, one equation calculate the even values and one calculate the odd values asin

X(2k) =

N/2−1∑

n=0

(

x1(n) + x2(n))

W knN/2 (5)

= DFTN/2(x1(n) + x2(n))

and

X(2k + 1) =

N/2−1∑

n=0

[(

x1(n) − x2(n))

WnN

]

W knN/2 (6)

= DFTN/2((x1(n) − x2(n))WnN ).

(5) calculates the sum of two sequences, while (6) calculates the difference andthen scales the result. This kind of operation, adding and subtracting the sametwo values, is commonly referred to as butterfly due to its butterfly-like shape inthe flow graph, shown in Figure 1(a). Sometimes, scaling is also considered tobe a part of the butterfly operation. The flow graph in Figure 1(b) represents

1. INTRODUCTION 69

x(4)

x(5)

x(6)

x(7)

x(0)

x(1)

x(2)

x(3)

W38

W28

W18

W08

DFTN/2

DFTN/2

X(1)

X(5)

X(3)

X(7)

X(0)

X(4)

X(2)

X(6)

WN

-1

x1

x2

x1+ x

2

(x1- x

2)W

N

(a) (b)

Figure 1: (a) Butterfly operation and scaling. (b) The radix-2decimation-in-frequency FFT algorithm divides an N -point DFT intotwo separate N/2-point DFTs.

the computations from (5) and (6), where each decomposition step requiresN/2 butterfly operations.

In Figure 1(b), the output sequence from the FFT appears scrambled. Thebinary output index is bit-reversed, i.e. the most significant bits (MSB) havechanged place with the least significant bits (LSB), e.g. 11001 becomes 10011.To unscramble the sequence, bit-reversed indexing is required.

1.1 Algorithmic Aspects

From an algorithmic perspective, several possible decomposition algorithmsexist [85]. The radix-2 algorithm can be used when the size of the transformis N = 2q, where q ∈ Z

+. The algorithm requires q decomposition steps, eachcomputing N/2 butterfly operation using a radix-2 (R-2) butterfly, shown inFigure 1(a).

However, when the size is N = 4q, more hardware efficient decompositionalgorithms exist. One possible alternative is the radix-4 decomposition, whichreduces the number of complex multiplications with the penalty of increas-ing the number of complex additions. The more complex radix-4 butterfly isshown in Figure 2(a). Another decomposition similar to radix-4 is the radix-22 algorithm, which simplifies the complex radix-4 butterfly into four radix-2butterflies [71]. On a flow graph level, radix-4 and radix-22 requires the samenumber of resources. The difference between the algorithms become evidentwhen folding is applied, which is shown later. The radix-22 (R-22) butterfly


-j

-j

-

j

-

-

-j

-

j

x1

x2

x3

x4

x1

x2

x3

x4

(a) (b)

Figure 2: (a) Radix-4 butterfly. (b) Radix-22 butterfly.

is found in Figure 2(b). To calculate an FFT size not supported by radix-4and radix-22, for example 2048, both radix-4 decompositions and a radix-2decomposition is needed since N = 2048 = 2145.

1.2 Architectural Aspects

There are several ways of mapping an algorithm to hardware. Three approachesare discussed and evaluated in this section: direct-mapped hardware, a pipelinestructure, and time-multiplexing using a single butterfly unit. Direct-mappedhardware basically means that each processing unit in the flow graph is im-plemented using a unique arithmetic unit. Normally, using this approach in alarge and complex algorithm is not desirable due to the huge amount of hard-ware resources required. The alternative is to fold operations onto the sameblock of hardware, an approach that saves resources but to the cost of increasedcomputation time.

Figure 3(a) shows a 4-point radix-2 FFT. Each stage consists of two butter-fly operations, hence the direct-mapped hardware implementation requires 4butterfly units and 2 complex multiplication units, numbers that will increasewith the size of the transform. Folding the algorithm vertically, as shown inFigure 3(b), reduces hardware complexity by reusing computational units, astructure often referred to as a pipeline FFT [86]. A pipeline structure of theFFT is constructed from cascaded butterfly blocks. When the input is in se-quential order, each butterfly operates on sample xn and xn+N/2, hence a delaybuffer of size N/2 is required in the first stage. This is referred to as a single-path delay feedback (SDF). In the second stage, the transform is N/2, hencethe delay feedback memory is N/4. In total, this sums up to N − 1 words inthe delay buffers.

2. PROPOSED ARCHITECTURES 71

2

R-2

1

R-2W

04

W14

-1

-1

x(0)

x(1)

x(2)

x(3)

-1

-1

X(0)

X(2)

X(1)

X(3)

W4

x X

(a) (b)

Figure 3: (a) Flow graph of a 4-point radix-2 FFT (b) Mapping of the4-point FFT using two radix-2 butterfly units with delay feedback mem-ories, where the number represents the FIFO depth.

Folding the pipeline architecture horizontally as well reduces the hardwareto a single time-multiplexed butterfly and complex multiplier. This approachsaves arithmetic resources, but still requires the same amount of storage spaceas a pipeline architecture. The penalty is further reduced calculation speed,since all decompositions are mapped onto a single computational unit.

To summarize, which architecture to use is closely linked to the requiredcomputational speed and available hardware and memory resources. A highercomputational speed normally requires more hardware resources, a trade-offthat has to be decided before the actual implementation work begins.

Another important parameter associated with the architectural decisions isthe wordlength, which affects the computational accuracy and the hardwarerequirements. An increased wordlength improves the computational quality,measured in signal-to-quantization-noise-ratio (SQNR), but also increases thehardware cost and the latency in arithmetic units. The trade-off between lowhardware cost and high SQNR is referred to as wordlength optimization. TheSQNR is defined as

SQNRdB = 10 · log10

Px

Pq, (7)

where Pq is the quantization energy and Px is the energy in the input signal. Away to increase the SQNR for signals with large dynamic range is to use datascaling, which is discussed in the next section.

2 Proposed Architectures

Currently the demands increase towards larger and multidimensional trans-forms for use in synthetic aperture radar (SAR) and scientific computing, in-cluding biomedical imaging, seismic analysis, and radio astronomy. Largertransforms require more processing on each data sample, which increases thetotal quantization noise. This can be avoided by gradually increasing the


R-2R-2

2x x

-jR-22

2x x

Figure 4: The radix-22 (R-22) butterfly is constructed from two radix-2(R-2) butterflies, separated with a trivial multiplication.

wordlength inside the pipeline, but will also increase memory requirementsas well as the critical path in arithmetic components. For large size FFTs,dynamic scaling is therefore a suitable trade-off between arithmetic complexityand memory requirements. The following architectures have been evaluatedand compared with related work:

(A) A hybrid floating-point pipeline with fixed-point input and tailored expo-nent datapath for one-dimensional (1D) FFT computation.

(B) A hybrid floating-point pipeline for two-dimensional (2D) FFT compu-tation, which also requires the input format to be hybrid floating-point.Hence, the hardware cost is slightly higher than in (A).

(C) A co-optimized design based on a hybrid floating-point pipeline combinedwith block floating-point for 2D FFT computation. This architecture hasthe processing abilities of (B) with hardware requirements comparable to(A).

The primary application for the implemented FFT core is a microscopebased on digital holography, where visible images are to be digitally recon-structed from a recorded interference pattern [87]. The pattern is recorded ona large digital image sensor with a resolution of 2048 × 2048 pixels and pro-cessed by a reconstruction algorithm based on a 2D Fourier transformation.Hence, the architectures outlined in (B) and (C) are suitable for this appli-cation. Another area of interest is in wireless communication systems basedon orthogonal frequency division multiplexing (OFDM). The OFDM scheme isused in for example digital video broadcasting (DVB) [88], including DVB-Twith 2/8K FFT modes and DVB-H with an additional 4K FFT mode. Thearchitecture described in (A) is suitable for this field of application.

Fixed-point is a widely used format in real-time and low power applicationsdue to the simple implementation of arithmetic units. In fixed-point arith-metic, a result from a multiplication is usually rounded or truncated to avoid asignificantly increased wordlength, hence generating a quantization error. The


MR-2 = R-2

x x2M+E

=M MM+T

T

(a)

(b)

Figure 5: Building blocks for a hybrid floating-point implementation.(a) Symbol of a modified butterfly containing an align unit on the input.(b) Symbol for a modified complex multiplier containing a normalizationunit.

quantization energy caused by rounding is relatively constant due to the fixedlocation of the binary point, whereas the total energy depends on how the inputsignal utilizes the available dynamic range. Therefore, precision in the calcula-tions depends on properties of the input signal, caused by uniform resolutionover the total dynamic range. Fixed-point arithmetic usually requires an in-creased wordlength due to the trade-off between dynamic range and precision.By using floating-point and dynamically changing the quantization steps, theenergy in the error signal will follow the energy in the input signal and the re-sulting SQNR will remain relatively constant over a large dynamic range. Thisis desirable to generate a high signal quality, less dependent on the transformlength. However, floating-point arithmetic is considerably more expensive interms of chip area and power consumption, and alternatives are presented inthe following subsections followed by a comparison in Section 4.

2.1 Hybrid Floating-Point

Floating-point arithmetic increases the dynamic range by expressing numberswith a mantissa m and an exponent e, represented with M and E bits, re-spectively. A hybrid and simplified scheme for floating-point representation ofcomplex numbers is to use a single exponent for the real and imaginary part.Besides reduced complexity in the arithmetic units the total wordlength for acomplex number is reduced from 2 × (M +E) to 2 ×M +E bits. Supportinghybrid floating-point requires pre- and post-processing units in the arithmeticbuilding blocks, and Figure 5 defines symbols used for representing these units.The FFT twiddle factors are represented with T bits.


e

m

N N/2 N/4

e

R-2 R-22mN

Lbuffer

e

N/4L

buffer

Figure 6: An example of convergent block floating-point. The bufferafter each complex multiplier selects a common exponent for a groupof values, allowing fixed-point butterfly units. The first buffer is oftenomitted to save storage space, but will have a negative impact on signalquality.

2.2 Block Floating-Point

Block floating-point (BFP) combines the advantages of simple fixed-point arith-metic with floating-point dynamic range. A single exponent is assigned to agroup of values to reduce memory requirements and arithmetic complexity.However, output signal quality depends on the block size and characteristics ofthe input signal [89]. Finding a common exponent requires processing of thecomplete block. This information is directly available in a parallel FFT archi-tecture, but for pipeline FFT architectures scaling becomes more complicatedsince data is continuously streaming. A scheme known as convergent blockfloating-point (CBFP) has been proposed for pipeline architectures [73]. Byplacing buffers between intermediate stages, data can be rescaled using blockfloating-point, as shown in Figure 6. The block size will decrease as data propa-gates through the pipeline until each value has its own exponent. Intermediatebuffering of data between each stage requires a large amount of memory, andin practical applications the first intermediate buffer is often omitted to savestorage space. However, this leads to a reduced SQNR as will be shown inSection 4 and referred to as CBFPlow due to the lower memory requirements.

2.3 Co-Optimization

In this section a co-optimized architecture that combines hybrid floating-pointand BFP is proposed. By extending the hybrid floating-point architecturewith small intermediate buffers, the size of the delay feedback memory can bereduced. Figure 7(a-c) show dynamic data scaling for hybrid floating-point,CBFP, and the proposed co-optimization architecture. Figure 7(c) is a com-bined architecture with an intermediate buffer to apply block scaling on Delements, which reduces the storage space for exponents in the delay feed-


N

R-2

2M

2NMR-2

N2M+E

E

MR-2Lbuffer

DL

buffer

N/D

N

E2M

(a) (b) (c)

Figure 7: (a) Hybrid floating-point. (b) Convergent block floating-pointwith D = 2N − 1 using a large buffer and fixed-point butterfly. (c) Asmall buffer reduces the exponent storage space in the delay feedbackmemory.

back memory with a factor D. We will derive an expression to find optimumvalues for the block size in each butterfly stage i to minimize the memory re-quirements for supporting dynamic scaling. The equations can be used for allconfigurations in Figure 7(a-c) by specifying D = 1 for hybrid floating-pointand D = 2Ni for CBFP. The length of the delay feedback memory, or FIFO,at stage i is

Ni = 2i 0 ≤ i ≤ imax = log2NFFT − 1

and the number of exponent bits for the same stage is denoted Ei. The blocksize D spans from single elements to 2Ni, which can be expressed as

D(αi) = 2αi 0 ≤ αi ≤ i+ 1.

The total bits required for supporting dynamic scaling is the sum of exponentbits in the delay feedback unit and the total size of the intermediate buffer.This can be expressed as

Memi = Ei

⌊ γNi

D(αi)

⌋

︸︷︷︸

delay feedback

+L(D(αi) − 1)︸︷︷︸

buffer

, (8)

where

γ =

{1 Radix-23/2 Radix-22

and

L =

{2M + Ei i = imax

2(M + T ) 0 ≤ i < imax


Intermediate buffer length D

Mem

ory

(bit

s)NFFT = 8192 (i = 12)

NFFT = 4096 (i = 11)

NFFT = 2048 (i = 10)

1 2 4 8 16 32 64 128 256 512 10240

2K

4K

6K

8K

10K

12K

14K

16K

18K

?

Co-optimization

��

��

CBFP

@@

@@@I

Hybrid FP

Figure 8: Memory requirements for supporting dynamic scaling as afunction of D for the initial butterfly in an NFFT point FFT using dataformat 2× 10 + 4. D = 1 represent a hybrid floating-point architecture,whereas D → NFFT approaches the CBFP architecture. An optimalvalue can be found in between these architectures.

For radix-22 butterflies, (8) is only defined for odd values of i. This is com-pensated by a scale factor γ = 3/2 to include both delay feedback units in theradix-22 butterfly, as shown in Figure 4. The buffer input wordlength L differsbetween initial and internal butterflies. For every butterfly stage, αi is chosento minimize (8). For example, an 8192 point FFT using a hybrid floating-pointformat of 2 × 10 + 4 bits requires 16Kb of memory in the initial butterfly forstoring exponents, as shown in Figure 8. The number of memory elements forsupporting dynamic scaling can be reduced to only 1256 bits by selecting ablock size of 32, hence removing over 90% of the storage space for exponents.The hardware overhead is a counter to keep track of when to update the blockexponent in the delay feedback, similar to the exponent control logic requiredin CBFP implementations. Thus the proposed co-optimization architecturesupports hybrid floating-point on the input port at very low hardware cost.

3. ARCHITECTURAL EXTENSIONS 77

-j

Y

x

y

X

8

MR-2

4

MR-2

2

MR-2

1

MR-2-j

-j

-j01

01

01

01

01

Figure 9: A bidirectional 16-point FFT pipeline. The input/output tothe left is in sequential order. The input/output to the right is in bit-reversed order.

Since the input and output format is the same, this architecture then becomessuitable for 2D FFT computation.

3 Architectural Extensions

The architectures described in this paper have been extended with support forbidirectional processing, which is important for the intended application andalso in many general applications. A pipeline FFT can support a bidirectionaldataflow if all internal butterfly stages have the same wordlength. The advan-tage with a bidirectional pipeline is that input data can be supplied either inlinear or bit-reversed sample order by changing the dataflow direction. Oneapplication for the bidirectional pipeline is to exchange the FFT/IFFT struc-ture using reordering buffers in an OFDM transceiver to minimize the requiredbuffering for inserting and removing the cyclic suffix, proposed in [76]. OFDMimplementations based on CBFP have also been proposed in [74], but thesesolutions only operate in one direction since input and output format differ.Another application for a bidirectional pipeline is to evaluate 1D and 2D convo-lutions. Since the forward transform generates data in bit-reversed order, thearchitecture is more efficient if the inverse transform supports a bit-reversedinput sequence as shown in Figure 9. Both input and output from the convo-lution are in linear sample order, hence no reorder buffers are required. Thehardware requirement for a bidirectional pipeline is limited to multiplexers onthe inputs of each butterfly and on each complex multiplier. Each unit requires26 two-input muxes for internal 2×11+4 format, which is negligible comparedto the size of an FFT stage.

4 Simulations

A simulation tool has been designed to evaluate different FFT architectures interms of precision, dynamic range, memory requirements, and estimated chip


size based on architectural descriptions. The user can specify the number ofbits for representing mantissa M , exponents E, twiddle factors T , FFT size(NFFT), rounding type and simulation stimuli. To make a fair comparisonwith related work, all architectures have been described and simulated in thedeveloped tool.

First, we compare the proposed architectures with CBFP in terms of mem-ory requirements and signal quality. In addition to the lower memory require-ments, we will show how the co-optimized architecture produces a higher SQNRthan CBFP. Secondly, we will compare the fabricated design with related workin terms of chip size and data throughput.

Table 1 shows a comparison of memory distribution between delay feed-back units and intermediate buffers. 1D architectures have fixed-point input,whereas 2D architectures support hybrid floating-point input. The table showsthat the intermediate buffers used in CBFP consume a large amount of memory,which puts the co-optimized architecture in favor for 1D processing. For 2Dprocessing, the co-optimized architecture also has lower memory requirementsthan hybrid floating-point due to the buffer optimization. Figures 10 and 11present simulation results for the 1D architectures in Table 1. Figure 10 is asimulation to compare SQNR when changing energy level in the input signal.In this case, the variations only affect CBFPlow since scaling is applied laterin the pipeline. Figure 11 shows the result when applying signals with a largecrest factor, i.e. the ratio between peak and mean value of the input. In thiscase, both CBFP implementations are strongly affected due to the large blocksize in the beginning of the pipeline. Signal statistics have minor impact on thehybrid floating-point architecture since every value is scaled individually. TheSQNR for the co-optimized solution is located between hybrid floating-pointand CBFP since it uses a relatively small block size.

Table 1: Memory requirements in Kbits for pipeline architectures, based on a2048 point radix-22 with M = 10 and E = 4.

ArchitectureDelay Intermediate Total

feedback buffers memory

1D Co-optimization 45.8 1.6 47.41D Hybrid FP (A) 49.0 - 49.01D CBFPlow 45.7 14.7 60.41D CBFP 45.7 60.0 105.7

2D Co-optimization(C) 50.0 0.4 50.42D Hybrid FP (B) 53.9 - 53.9

4. SIMULATIONS 79

Signal level

SQ

NR

(dB

)

Hybrid floating-point

CBFP

Co-optimizationCBFPlow

1 1/2 1/4 1/8 1/16 1/3220

25

30

35

40

45

50

Figure 10: Decreasing the energy in a random value input signal affectsonly the architecture when scaling is not applied in the initial stage.Signal level=1 means utilizing the full dynamic range.

Table 2 shows an extended comparison between the proposed architecturesand related work. The table includes two pipeline architectures using hybridfloating-point, for 1D signal processing (A) using a tailored datapath for ex-ponent bits E = 0 . . . 4, and for 2D signal processing (B) using a constantnumber of exponent bits E = 4. Then the proposed co-optimized architecturefor 2D signal processing (C), with a reduced hardware cost more comparableto the 1D hybrid floating-point implementation. It uses block scaling in theinitial butterfly unit and then hybrid floating-point in the internal butterfly toshow the low hardware cost for extending architecture (A) with support for 2Dprocessing.

The parallel architecture proposed by Lin et al. [90] uses block floating pointwith a block size of 64 elements. The large block size affects the signal quality,but with slightly lower memory requirements compared to pipeline architec-tures. A pipeline architecture proposed by Bidet et al. [73] uses convergentblock floating point with a multi-path delay commutator. The memory require-ments are high due to the intermediate storage of data in the pipeline, whichsignificantly affects the chip area. However, CBFP generates a higher SQNRthan traditional BFP. The pipeline architecture proposed by Wang et al. [91]


Crest Factor

SQ

NR

(dB

)

Hybrid floating-point

CBFP

Co-optimizationCBFPlow

5 10 15 2020

25

30

35

40

45

50

Figure 11: Decreasing the energy in a random value input signal withpeak values utilizing the full dynamic range. This affects all block scal-ing architectures, and the SQNR depends on the block size. The co-optimized architecture performs better than convergent block floating-point, since it has a smaller block size through the pipeline.

does not support scaling and is not directly comparable in terms of precisionsince SQNR depends on the input signal. The wordlength increases graduallyin the pipeline to minimize the quantization noise, but this increases the mem-ory requirements or more important the wordlength in arithmetic componentsand therefore also the chip area.

The proposed architectures have low hardware requirements and producehigh SQNR using dynamic data scaling. They can easily be adapted to 2D sig-nal processing, in contrast to architectures without data scaling or using CBFP.The pipeline implementation results in a high throughput by continuous datastreaming, which is shown as peak performance of 1D transforms in Table 2.

5. VLSI IMPLEMENTATION 81

5 VLSI Implementation

A 2048 complex point pipeline FFT core using hybrid floating-point and basedon the radix-22 decimation-in-frequency algorithm [71] has been designed, fab-ricated, and verified. This section presents internal building blocks and mea-surements on the fabricated ASIC prototype.

The butterfly units calculate the sum and the difference between the inputsequence and the output sequence from the delay feedback. Output from thebutterfly connects to the complex multiplier, and data is finally normalizedand sent to the next FFT stage. The implementation of the delay feedbacksis a main consideration. For shorter delay sequences, serially connected flip-flops are used as delay elements. As the number of delay elements increases,this approach is no longer area and power efficient. One solution is to useSRAM and to continuously supply the computational units with data, oneread and one write operation have to be performed in every clock cycle. A dualport memory approach allow simultaneous read and write operations, but islarger and consumes more energy per memory access than single port memories.Instead two single port memories, alternating between read and write each clockcycle could be used. This approach can be further simplified by using one singleport memory with double wordlength to hold two consecutive values in a singlelocation, alternating between reading two values in one cycle and writing twovalues in the next cycle. The latter approach has been used for delay feedbackexceeding the length of eight values. An area comparison can be found in [9].

A 2048 point FFT chip based on architecture (A) has been fabricated ina 0.35µm 5ML CMOS process from AMI Semiconductor, and is shown inFigure 12. The core size is 2632× 2881µm2 connected to 58 I/O pads and26 power pads. The implementation requires 11 delay feedback buffers, onefor each butterfly unit. Seven on-chip RAMs are used as delay buffers (ap-proximately 49 K bits), while the four smallest buffers are implemented usingflip-flops. Twiddle factors are stored in three ROMs containing approximately47 K bits. The memories can be seen along the sides of the chip. The numberof equivalent gates (2-input NAND) is 45900 for combinatorial area and 78300for non-combinatorial area (including memories). The power consumption ofthe core was measured to 526 mW when running at 50 MHz and using a supplyvoltage of 2.7 V. The pipeline architecture produces one output value each clockcycle, or 37K transforms per second running at maximum clock frequency. The2D FFT architecture (B) has been implemented on FPGA in [92].


Figure 12: Chip photo of the 2048 complex point FFT core fabricated ina 0.35µm 5ML CMOS process. The core size is 2632× 2881µm2.

6 Conclusion

Dynamic data scaling architectures for pipeline FFTs have been proposed forboth 1D and 2D applications. Based on hybrid floating-point, a high-precisionpipeline with low memory and arithmetic requirements has been constructed. Aco-optimization between hybrid floating-point and block floating-point has beenproposed, reducing the memory requirement further by adding small interme-diate buffers. A 2048 complex point pipeline FFT core has been implementedand fabricated in a 0.35µm 5ML CMOS process, based on the presented scal-ing architecture and a throughput of 1 complex point/cc. The bidirectionalpipeline FFT core, intended for image reconstruction in digital holography, hasalso been integrated on a custom designed FPGA platform to create a completehardware accelerator for digital holographic imaging.

6. CONCLUSION 83

Table

2:

Com

par

ison

bet

wee

npro

pos

edar

chit

ectu

res

and

rela

ted

wor

k.

The

valu

esbas

edon

sim

ula

ted

dat

aar

ehig

hligh

ted

by

grey

fiel

ds.

Pro

pose

d(A

)P

ropose

d(B

)P

ropose

d(C

)Lin

[90]

Bid

et[7

3]

Wang

[91]

Arc

hit

ectu

reP

ipel

ine

(1D

)P

ipel

ine

(2D

)P

ipel

ine

(2D

)Para

llel

Pip

elin

eP

ipel

ine

Dynam

icsc

aling

Hybri

dFP

Hybri

dFP

Co-o

pti

miz

edB

FP

D=

64

CB

FP

No

Tec

hnolo

gy

(µm

)0.3

5V

irte

x-E

Vir

tex-E

0.1

80.5

0.3

5

Max

Fre

q.

(MH

z)76

50

50

56

22

16

Input

word

length

2×

10

2×

10

+4

2×

10

+4

2×

10

2×

10

2×

8

Inte

rnalw

ord

length

2×

11

+(0

...4

)2×

11

+4

2×

11

+4

2×

11

+4

2×

12

+4

2×

(19

...3

4)

Tra

nsf

orm

size

2K

1/2/4/8K

2K

2K

8K

8K

2/8K

SQ

NR

(dB

)45.3

44.0

45.3

44.3

41.2

42.4

Mem

ory

(bit

s)49K

196K

53.9

K50.4

K185K

350K

213K

Norm

.A

rea

(mm

2)1

7.5

8≈

16

18.3

49

33.7

5

1D

Tra

nsf

orm

/s

37109

9277

24414

24414

3905

2686

7812/1953

1A

rea

norm

alize

dto

0.3

5µm

tech

nolo

gy.

Part III

A Design Environment and Models for

Reconfigurable Computing

Abstract

System-level simulation and exploration tools are required to rapidly evaluatesystem performance early in the design phase. The use of virtual platformsenables hardware modeling as well as early software development. An explo-ration framework (Scenic) is proposed, which is based on OSCI SystemC andconsists of a design exploration environment and a set of customizable simula-tion models. The exploration environment extends the SystemC library withfeatures to construct and configure simulation models using an XML descrip-tion, and to control and extract performance data using run-time reflection.A set of generic simulation models have been developed, and are annotatedwith performance monitors for interactive run-time access. The Scenic frame-work is developed to enable design exploration and performance analysis ofreconfigurable architectures and embedded systems.

Based on: T. Lenart, H. Svensson, and V. Owall, “Modeling and Exploration ofa Reconfigurable Architecture for Digital Holographic Imaging,” in Proceedings of

IEEE International Symposium on Circuits and Systems, Seattle, USA, May 2008.

and: T. Lenart, H. Svensson, and V. Owall, “A Hybrid Interconnect Network-on-Chip and a Transactional Level Modeling approach for Reconfigurable Computing,”in Proceedings of IEEE International Symposium on Electronic Design, Test and Ap-

plications, Hong Kong, China, Jan. 2008.

85

1. INTRODUCTION 87

1 Introduction

Due to the continuous increase in design complexity, system-level explorationtools and methodologies are required to rapidly evaluate system behavior andperformance. An important aspect for efficient design exploration is the designmethodology, which involves the construction and configuration of the systemto be simulated, and the controllability and observability of simulation models.

SystemC has shown to be a powerful system-level modeling language, mainlyfor exploring complex system architectures such as System-on-Chip (SOC),Network-on-Chip (NoC) [93], multi-processor systems [94], and run-time re-configurable platforms [95]. The Open SystemC Initiative (OSCI) maintainsan open-source simulator for SystemC, which is a C++ library containing rou-tines and macros to simulate concurrent processes using an HDL like seman-tic [96]. Systems are constructed from SystemC modules, which are connectedto form a design hierarchy. SystemC modules encapsulate processes, whichdescribe behavior, and communicate through ports and channels with otherSystemC modules. The advantages with SystemC, besides the well-knownC++ syntax, include modeling at different abstraction levels, simplified hard-ware/software co-simulation, and a high simulation performance compared totraditional HDL. The abstraction levels range from cycle accurate (CA) totransaction level modeling (TLM), where abstract models trade modeling ac-curacy for a higher simulation speed [42].

SystemC supports observability by tracing signals and transactions usingthe SystemC verification library (SCV), but it only provides limited featuresusing trace-files. Logging to trace-files is time-consuming and requires post-processing of extracted simulation data. Another drawback is that the OSCIsimulation kernel does not support real-time control to allow users to start andstop the simulation interactively. Issues related to system construction, systemconfiguration, and controllability are not addressed.

To cover the most important aspects on efficient design exploration, a Sys-temC Environment with Interactive Control (Scenic) is proposed. Scenic isbased on OSCI SystemC 2.2 and extends the SystemC library with functional-ity to construct and configure simulations from eXtensible Markup Language(XML), possibilities to interact with simulation models during run-time, andthe ability to control the SystemC simulation kernel using micro-step simula-tion. Scenic extends OSCI SystemC without modifying the core library, henceproposing a non-intrusive exploration approach. A command shell is providedto handle user interaction and a connection to a graphical user interface. Inaddition, a library of customizable simulation models is developed, which con-tains commonly used building blocks for modeling embedded systems. Themodels are used in Part IV to simulate and evaluate reconfigurable architec-

88 PART III. A DESIGN ENVIRONMENT AND MODELS FOR RECONFIGURABLE...

OSCI SystemC 2.2

Exploration

Environment

Model generators

Architectural generators

User modules

GU

I

Module

Library

SCENIC CoreSCENIC Shell

Figure 1: The Scenic environment extends OSCI SystemC with func-tionality for system exploration (core) and user interaction (shell).Model generators are basic building blocks for the architectural gen-erators, which are used to construct complex simulations.

tures. The Scenic exploration framework is illustrated in Figure 1, where theScenic core implements SystemC extensions and the Scenic shell providesan interface for user interaction.

Section 2 presents related work on existing platforms for design exploration.This is followed by the proposed Scenic exploration environment in Section 3,which is divided into scripting environment, simulation construction, simulationinteraction, and code generation. Section 4 propose model generators and ar-chitectural generators to construct highly customizable processor and memoryarchitectures. The use of the Scenic framework for system design is presentedin Part IV, where a platform for reconfigurable computing is proposed. Moredetailed information on the Scenic exploration environment can be found inAppendix A.

2 Related Work

Performance analysis is an important part of design exploration, and is basedon extracted performance data from a simulation. Extraction of performancedata requires either the use of trace files, for post-processing of simulation data,or run-time access to performance data inside simulation models. The formerapproach is not suitable for interactive performance exploration due to the lackof observability during simulation, and also has a negative impact on simulationperformance. The latter approach is a methodology referred to as data intro-spection. Data introspection is the ability to access run-time information usinga reflection mechanism. The mechanism enables either structural reflection, toexpose the design hierarchy, or run-time reflection to extract performance dataand statistics for performance analysis.

General frameworks to automatically reflect run-time information in soft-ware are presented in [97] and [47]. While this approach works well for stan-

3. SCENIC EXPLORATION ENVIRONMENT 89

dard data types, performance data is highly model-specific and thus can notbe automatically annotated and reflected. Dust is another approach to reflectstructural and run-time information [98]. Dust is a Java visualization front-end, which captures run-time information about data transactions using SCV.However, performance data is not captured, and has to be evaluated and ana-lyzed from recorded transactions. Design structure and recorded transactionsare stored in XML format and visualized in a structural view and a messagesequence chart, respectively. Frameworks have also been presented for mod-eling at application level, where models are annotated with performance dataand trace files are used for performance analysis [15]. However, due to the useof more abstract simulation models, these environments are more suitable forsoftware evaluation than hardware exploration.

The use of structural reflection have been proposed in [99] and [100] togenerate a visual representation of the simulated system. This can be usefulto ensure structural correctness and provide greater understanding about thedesign hierarchy, but does not provide additional design information to enableperformance analysis. In contrast, it is argued that structural reflection shouldinstead be used to translate the design hierarchy into a user-specified formatto enable automatic code generation.

Related work only covers a part of the functionality required for efficientdesign exploration. This work proposes an exploration environment, Scenic,that supports simulation construction using XML format, simulation interac-tion using run-time reflection, and code generation using structural reflection.

3 Scenic Exploration Environment

Design exploration is an important part of the hardware design flow, and re-quires tools and models that allow rapid simulation construction, configuration,and evaluation. The Scenic exploration environment is proposed to addressthese important aspects. Scenic is a system exploration tool for hardwaremodeling, and use simulation models annotated with user-defined performancemonitors to capture and reflect relevant information during simulation-time.

Figure 2(a) illustrates the Scenic exploration flow from simulation con-struction, using XML format, to an interactive SystemC simulation model. TheScenic environment is divided into two tightly-coupled parts: the Scenic coreand the Scenic shell. The Scenic core handles run-time interaction with sim-ulation models, and the Scenic shell handles the user interface and scriptingenvironment. Figure 2(b) illustrates user interaction during run-time, to con-figure and observe the behavior of simulation models.


SCENIC Core

SCENIC Shell

Filtered

Debug messages

Access variables

Simulation events

SCI_MODULE

events

Simulation

TCP/IP

interface

Module

library

XML

parameters

variables

(a) (b)

Figure 2: (a) A simulation is created from an XML design specificationusing simulation models from a module library. (b) The models inter-act with the Scenic shell using access variables, simulation events, andfiltered debug messages.

Scenic is characterized by the following modeling and implementationproperties:

• Scenic supports mixed application domains and abstraction levels, andis currently evaluated in the field of embedded system design and recon-figurable computing using transaction level modeling.

• Scenic presents a non-intrusive approach that does not require the OSCISystemC library to be changed. Furthermore, port or signal types are notmodified or extended. This enables interoperability and allows integra-tion of any simulation module developed in SystemC.

Scenic interacts with simulation models during simulation-time, which isimplemented by extending the OSCI SystemC module class with run-time re-flection. To take advantage of the Scenic extensions, the only required changeis to use sci module, which encapsulates functionality to access the simulationmodels from the Scenic shell. Scenic can still simulate standard OSCI Sys-temC modules without any required modifications to the source code, but theywill not be visible from the Scenic shell. Compatibility is an important aspectsince it allows integration and co-simulation with existing SystemC models.

The class hierarchy and the extended functionality provided by sci moduleis shown in Figure 3. sci module is an extension of sci access and the


sc_module

sci_access

sci_variable list

sci_moduledebug_level

(SystemC)

user_module

callback()

custom_access()

sci_access_base

callback()

access()

custom_access() access()

scenic_variable(data)

int databind()

generate()

Figure 3: Class hierarchy for constructing simulation models. The accessvariables are stored in sci access and exported to the Scenic shell.The custom access function enable user defined commands to be imple-mented.

SystemC sc module classes. sci module supports automatically bindingof simulation models and code generation in any structural language format.sci access handles debug message filtering, run-time reflection, and customaccess functions. Debug message filtering enables the user to set a debug levelon which messages are produced from each simulation model, and custom ac-cess methods provide an interface to implement user-defined shell commandsfor each simulation model. The SystemC extensions provide means for moreeffective simulation and exploration, and are briefly described below:

• Scripting Environment - The Scenic shell provides a powerful script-ing environment to control simulations and to interact with simulationmodels. The scripting environment supports features to define user sce-narios by interactively configuring the simulation models. It can alsobe used to access and process performance data that is captured insideeach model during simulation. The scripting environment is presented inSection 3.1.

• Simulation Construction - The simulation construction and static con-figuration is specified in XML format. Hence, simulation models aredynamically created, connected and configured from a single design spec-ification language. This enables rapid simulation construction since nocompilation processes is involved. Simulation construction is presentedin Section 3.2.

• Simulation Interaction - An extension to access internal data struc-tures inside simulation models is proposed. This allows simulation modelsto be dynamically controlled, to describe different simulation scenarios,


Figure 4: Graphical user interface connecting to Scenic using a tcp/ipsocket. It is used to navigate through the design hierarchy, modify accessvariables, and to view data captured over time. The Scenic shell isshown on top of the graphical user interface.

and observed to extract performance data during simulation. Simulationinteraction is presented in Section 3.3.

• Code Generation - Simulation models provide an interface to generatestructural information in a user-specific language. This allows the designhierarchy to be visually represented [101], or user-developed hardwaremodels to be constructed and configured in HDL syntax. Hence, it ispossible to generate structural VHDL directly from the simulation modelsto support component instantiation and parameterization, port mapping,and signal binding. Code generation is presented in Section 3.4.

3.1 Scripting Environment

The Scenic environment is a user-interface to construct, control, and interactwith a simulation. It provides a command line interface (the Scenic shell) thatfacilitates advanced scripting, and supports an optional tcp/ip connection toa graphical user interface. The Scenic shell and graphical user interface areshown in Figure 4. The graphical user interface includes a tree view to navigatethrough the design hierarchy (structural reflection) and provides a list view withaccessible variables and performance data inside each simulation module (run-time reflection). Variable tracing is used to capture trends over time, which ispresented in a wave form window.


Built-in commands are executed from the Scenic shell command line in-terface by typing the command name followed by a parameter list. Built-incommands include for example setting environment variables, evaluate basicarithmetic or binary operations, and random number generation. If the nameof a SystemC module is given instead of a Scenic shell command, then thecommand is forwarded and evaluated by that module’s custom access method.The interface enables fast scripting to setup different scenarios in order to eval-uate system architectures. The Scenic shell commands are presented in detailin Appendix A.

Simulation Control

OSCI SystemC currently lacks the possibility to enable the user to interactivelycontrol the SystemC simulation kernel, and simulation execution is instead di-rectly described in the source code. Hence, changes to simulation executionrequire the source code to be recompiled, which is a non-desirable approachto control the simulation kernel. In contrast, Scenic addresses the issue ofsimulation control by introducing micro-step simulation. Micro-step simula-tion allows the user to interactively pause the simulation, to modify or viewinternal state, and then resume execution. A micro-step is a user-defined dura-tion of simulation time, during which the SystemC kernel simulates in blockingmode. The micro-steps are repeated until the total simulation time is com-pleted. From a user-perspective, the simulation appears to be non-blockingand fully interactive. The performance penalty for using micro-steps is evalu-ated to be negligible down to clock cycle resolution.

From the Scenic shell, simulations are controlled using either blocking ornon-blocking run commands. Non-blocking commands are useful for interac-tive simulations, whereas blocking commands are useful for scripting. A stopcommand halts a non-blocking simulation at the beginning of the next micro-step.

3.2 Simulation Construction

SystemC provides a pre-defined C-function (sc main) that is called duringelaboration-time to instantiate and bind the top-level simulation models. Achange to the simulation architecture requires the source files to be updatedand re-compiled, which makes design exploration more difficult. In Scenic,the use of XML is proposed to construct and configure a simulation. XML is atextural language that is suitable to describe hierarchical designs, and has beenproposed as an interchangeable format for ESL design by the Spirit Consor-tium [102]. The use of XML for design specification also enables the possibilityto co-simulate multiple projects, since several XML files can be overloaded in


Module library

Bind

Configure

Instantiate

XML Design

Specification

Figure 5: XML is used for design specification, to instantiate and config-ure simulation models from the Scenic module library. Automatic portbinding enables models and designs to be connected.

Scenic to construct larger simulations. By using multiple design specifications,parts of the simulation may be substituted with abstract models to gain simu-lation performance during the exploration cycle. Once the individual parts ofthe system have been verified, a more detailed and accurate system simulationcan be executed.

Simulation models are instantiated from the Scenic module library, as pre-viously illustrated in Figure 2(a). User-models that are based on sci moduleare automatically registered during start-up and stored in the module library.A design specification in XML format specifies instantiations, bindings, andconfiguration of simulation models from the module library, as illustrated inFigure 5. All models based on sci module use the same constructor argu-ments, which simplify the instantiation and configuration process. Configu-ration parameters are transferred from the XML specification to the modelsusing a data structure supporting arbitrary data types. Hence, models receiveand extract XML parameters to obtain a user-defined static configuration.

3.3 Simulation Interaction

Run-time reflection is required to dynamically interact with simulation modelsand to access internal information inside each model during simulation time. Itis desirable to be able to access internal member variables inside each simulationmodel, and it is also valuable to trace such information over time. Hence, accessvariables are proposed, which makes it possible to access member variables fromthe Scenic shell. Access variables may also be recorded during simulation, andthe information can be retrieved from the Scenic shell to study trends overtime. Finally, simulation events are proposed and provides the functionalityto notify the simulator when user-defined events occur. Simulation events areintroduced to allow a more interactive simulation.


Access Variables

Member variables inside an sci module are exported to the Scenic shell asconfigurable and observable access variables. This reflects the variables datatype, data size and value during simulation time. Access variables enable dy-namic configuration (controllability) and extraction of performance data andstatistics during simulation (observability). Hence, the user can configure mod-els for specific simulation scenarios, and then observe the corresponding simu-lation effects.

Performance data and statistics that are represented by native and trivialdata types can be directly reflected using an access variable. An exampleis a memory model that contains a counter to represents the total numberof memory accesses. However, more complex operations are required whenperformance values are data or time dependent. This requires functional codeto evaluate and calculate a performance value, and is supported in Scenic byusing variable callbacks. When an access variable is requested, the associatedcallback function is evaluated to assign a new value to the member variable.This new value is then reflected in the Scenic shell. The callback functionalitycan also be used for the reverse operation of assigning values that affect themodel functionality or configuration. An example is a variable that representsa memory size, which requires the memory array to be resized (reallocated)when the variable change. The following code is a trivial example on how toexport member variables and how to implement a custom callback function:

class ScenicModule : public sci_module { /* extended module */private:

void* m_mem; /* pointer to data */short m_size; /* scalar data type */double m_average; /* scalar data type */int m_data[5]; /* static vector */

public:ScenicModule(sc_module_name nm, SCI_PARAMS& params) :

sci_module(nm, params){

scenic_variable(m_data, "data"); /* variable "data" */scenic_callback(m_size, "size"); /* callback "size" */scenic_callback(m_average, "average"); /* callback "average" */

}

virtual SCI_RETURN callback(SCI_OP op, string name) { /* callback function */if (name == "size")

if (op == WRITE) m_mem = realloc(m_mem, m_size); /* realloc "size" bytes */if (name == "average")

if (op == READ) m_average = average(m_data); /* return average value */}

};

Modifying the value of the size variable triggers the callback function tobe executed and the memory to be reallocated. The variable average is also


int data

sci_variable <int>

pointer to datahistory size, depth

get()set()

read()

sci_scheduler

register(intervall)

log() periodic logging

user_module

sci_access

sci_variable list

access()

register data

SCENIC shell

custom_access()

history buffer global scheduler

1©

2©3©

4© 5©

6©

Figure 6: A local variable is first registered, which creates a sci variableobject pointing to the variable. It contains features to create a historybuffer and to register a periodic logging interval with a global schedulerto capture the value of a variable at constant intervals. The user interfacecan access a variable through the sci access object, which contains alist of all registered sci variable objects.

evaluated on demand, while the data vector is constant and reflects the currentsimulation value. Registered variables are accessed from the Scenic shell as:

[0 ns]> ScenicModule set -var size -value 256[0 ns]> ScenicModule set -var data -value "4 7 6 3 8"[0 ns]> ScenicModule get

data 5x4 [0] { 4 7 6 3 8 }average 1x8 [0] { 5.6 }size 1x2 [0] { 256 }

Figure 6 shows how a user-defined simulation model exports a membervariable named data 1©, which is encapsulated by creating a new sci variable2© that contains a reference to the actual member variable 3©. sci variableprovides functionality to enable for example get and set operations on arbitrarydata types using standard C++ iostreams. Hence, the stream operators (<<and >>) must be supported for all access variable types. The Scenic shellobtains information about variables by calling the access method in sci access4©. The call is responded by the sci variable that is associated with therequested access variable 5©.

Access Variable Tracing

It is desirable to be able to study trends over time and to trace the values of ac-cess variables. Hence, the value of an access variable may be periodically loggedand stored into a history buffer during simulation. The periodical time inter-val and history buffer depth are configurable parameters from the Scenic shell,which is also used to request and reflect data from history buffers. However,


0

20

40

60

80

Logging

Disabled

Proposed

Micro-steps

Conventional

Method (Mutex)

Rel

ativ

e per

form

ance

im

pac

t (%

)

100

51.2

0.6

16.8

66.7

49.9

100Proposed Global Scheduler

Conventional SC_MODULE

Figure 7: The impact on simulation performance using sc module tosupport sci variable logging and when using a global scheduler to re-duce context switching in the simulation kernel. The graph representsrelative performance overhead.

periodical logging requires the variable class to be aware of SystemC simula-tion time. Normally, this would require each sci variable to inherit fromsc module and to create a separate SystemC process. This would consumea substantial amount of simulation time due to increased context switching inthe SystemC kernel.

To reduce context switching, an implementation based on a global scheduleris proposed, which handles logging for all access variables inside all simulationmodels. The global scheduler, shown in Figure 6, is configured with a periodicaltime intervall from the Scenic shell through a sci variable 6©. A globalscheduler improves the simulation performance since it is only invoked once foreach time event that requires variable logging. Hence, multiple variables arelogged in a single context switch.

To evaluate the performance improvement, a comparison between using aconventional sc module module and the proposed global scheduler is shownin Figure 7. In the former approach there is large performance penalty even iflogging is disabled, due to the increased number of SystemC processes. For thelatter approach, there is negligible simulation overhead when logging is turnedoff. When logging is enabled, two approaches are presented. The first simula-tion is based on a conventional approach when using standard macros (mutex)to enable mutually exclusive data access. These macros prevent simultaneousaccess to history buffers to protect the thread communication between the sim-ulation kernel and the Scenic shell. The second approach propose a method


scenic_variable(v)

SCI_VARIABLE SCI_CONDITION SCI_EVENT SCENIC shell

ScenicModule event onThree -when v=3register callback

create onThree

log (v=2)

log (v=3) notifyexecute "stop"

construct

simulate

ScenicModule log -var v -every 10 ns

t = 10 ns

t = 20 ns

t = 0 ns

assign "stop"

stopped

configure

end-of-elaboration

Figure 8: The function call flow to create a dynamic simulation eventfrom the Scenic shell. A condition is created and registered with anaccess variable. When a simulation condition is satisfied, the associatedevent is triggered.

based on atomic steps (micro-steps), discussed in Section 3.1, which preventsrace conditions between the simulation thread and the Scenic shell. As shown,the proposed implementation based on a global scheduler and atomic simula-tion steps results in lower overhead compared to conventional approaches.

Simulation Events

During simulations, it is valuable to automatically receive information fromsimulation models regarding simulation status. This information could be in-formative messages on when to read out generated performance data, or warn-ing messages to enable the user to trace simulation problems at an exact timeinstance. Instead of printing debug messages in the Scenic shell, it is desirableto be able to automatically halt the simulation once user-defined events occur.

An extension to the access variables provides the ability to combine datalogging with conditional simulation events. Hence, dynamic simulation eventsare proposed, which can be configured to execute Scenic commands whentriggered. In this way, the simulation models can notify the simulator of specificevents or conditions on which to observe, reconfigure, or halt the simulation.

A simulation event is a sci variable that is assigned a user-specifiedScenic shell command. Simulation events are created dynamically during run-time from the Scenic shell to notify the simulator when a boolean conditionassociated with an access variable is satisfied. The condition is evaluated ondata inside the history buffer, hence repeated assignments between clock cyclesare effectively avoided.


The internal call graph for creating a simulation event is shown in Figure 8.An access variable is first configured with a periodical logging intervall onwhich to evaluate a boolean condition. A boolean condition is created from theScenic shell, which registers itself with the access variable and creates a newsimulation event. This simulation event is configured by the user to evaluatea Scenic shell command when triggered. Every time the access variable islogged, the condition is evaluated against the assigned boolean expression. Ifthe condition is true, the associated event is triggered. The event executesthe user command, which in this case is set to halt the simulation. Hence,the simulation events operate in a similar way as software assertions, but aredynamically created during simulation time.

3.4 Code Generation

Simulation models provide an interface to generate structural information ina user-specific language, where Scenic currently supports DOT and VHDLformats. The code generators are executed from the Scenic shell by specifyingthe format and a top-level. A code generator use multiple passes to traverseall hierarchical modules from the top-level to visit each simulation module.Modules can implement both pre visitors and post visitors, depending on inwhich order the structural information should be produced. In addition, thecode generator divides a structural description into separate files and sections,which are user-defined to represent different parts of the generated code. Forexample, a VHDL file is divided into sections representing entity, architecture,type declarations, processes, and concurrent statements.

A DOT generator is implemented to create a visual representation of asystem design and is used to verify the connectivity and design hierarchy for thecomplex array architectures presented in Part IV. DOT is a textural languagedescribing nodes and edges, which is used to represent graphs and hierarchicalrelations [101]. A DOT specification is converted into a visible image usingopen-source tools.

A VHDL generator is implemented to generate structural information to in-stantiate and parameterize components, and to generate signal and port bind-ings. Hence, only the behavioral description of each simulation model has tobe translated. The generator produces a VHDL netlist, where the modules areinstantiated, parameterized, and connected according to a Scenic XML designspecification and dynamic configuration generated by architectural generators,presented in Section 4.3. Hence, all manual design steps between system spec-ification and code generation have been effectively removed.


Adapter

SCI_PROCESSOR

Acc.

Adapter

SCI_MEMORY

Bus

I D

(a) (b)

Figure 9: (a) Constructing an embedded system using the processor andmemory generators. The processor uses the external memory interfaceto communicate with a system bus, and an i/o register to communicatewith an external hardware accelerator. (b) Constructing a processorarray using the model and architectural generators. The array elementsare connected using bi-directional i/o registers.

4 Scenic Model and Architectural Generators

The Scenic exploration environment allows efficient hardware modeling bycombining rapid simulation construction with simulation interaction. Simula-tion interaction requires simulation models that are highly configurable andthat export performance data and statistics to the Scenic shell. Hence, thissection presents simulation primitives developed to enable design exploration.

A set of generic models are developed to provide commonly used build-ing blocks in digital system design. The models enable design exploration byexporting configuration parameters and exposing performance metric to theScenic shell. Hence, models for design exploration are characterized by scal-ability (static parameterization), configurability (dynamic parameterization),and a high level of observability. Scalability enables model re-use in variousparts of a system, while configurability allows system tuning. Finally, observ-ability provides feedback about the system behavior and performance.

The module library contains highly configurable and reusable simulationmodels, referred to as model generators. Two model generators have been de-veloped to simplify the modeling of digital systems, namely the processor gener-ator and the memory generator. The processor generator constructs a customprocessor model based on a user-defined architectural description, while thememory generator constructs a custom memory model based on user-definedtiming parameters. The module library also contains several architectural gen-

4. SCENIC MODEL AND ARCHITECTURAL GENERATORS 101

Table 1: Highly customizable model generators for rapid design exploration.

Generator Scalability Configurability Observability

Processor

- Wordlength - Execution flow- Instruction set Program code - Registers & Ports- Registers - Memory access- Memory i/f

Memory- Wordlength - Bandwidth/Latency- Memory size Memory timing - Access pattern

- Memory contents

erators. An architectural generator constructs more complex simulation ob-jects, such as Network-on-Chip (NoC) or embedded system architectures.

The model and architectural generators are used in Part IV to constructa dynamically reconfigurable architecture, and are further described in thefollowing sections. Figure 9(a) illustrates an embedded system, where the pro-cessor and memory generators have been used to construct the main simulationmodels. The models communicate with the system bus using adaptors, whichtranslate a model-specific interface to a user-specific interface. An array of pro-cessors and memories are shown in Figure 9(b) and illustrates the architecturespresented in Part IV.

4.1 Processor Generator

The Scenic module library includes a processor generator to create custominstruction-set simulators (ISS) from a user-defined architectural description.An architectural description specifies the processor data wordlength, instruc-tion wordlength, internal registers, instruction set, and external memory inter-face, where the properties are summarized in Table 1. Furthermore, a generatedprocessor automatically provides a code generator (assembler) that translatesprocessor-specific assembly instructions into binary format.

Generated processors are annotated with access variables to extract per-formance data and statistics during simulation time. The performance dataand statistics include processor utilization, number of executed and stalled in-structions, and internal register values. Logging of performance data allowsdesigners to study processor execution over time, and captured statistics arevaluable for high-accuracy software profiling.


Processor Registers

The register bank is user-defined, thus each generated processor has a differentregister configuration. However, a register holding the program counter isautomatically created to control the processor execution flow.

There are two types of processor registers: internal general-purpose regis-ters and external i/o port registers. General-purpose registers hold data valuesfor local processing, while i/o port registers are bi-directional interfaces for ex-ternal communication. i/o port registers are useful for connecting externalco-processors or hardware accelerators to a generated processor. Each uni-directional link uses flow control to prevent the processor from accessing anempty input link or a full output link. Similar port registers are found in manyprocessor architectures, for example the IBM PowerPC [103].

Processor Instructions

The instruction set is also user-defined, to emulate arbitrary processor archi-tectures. An instruction is constructed from the sci instruction class, andis based on an instruction template that describes operand fields and bitfields.The operand fields specify the number of destination registers (D), source reg-isters (S), and immediate values (I). The bitfields specifies the number of bitsto represent opcode (OP), operand fields, and instruction flags (F). Figure 10illustrates how a generic instruction format is used to represent different in-struction templates, based on {OP,D, S, I,F}. An instruction template hasthe following format:

class TypeB : public sci_instruction { /* instruction template */public:

TypeB(string nm) : sci_instruction(nm, /* constructor */"D=1,S=1,I=1", /* operand fields */"opcode=6,dst=5,src=5,imm=16,flags=0") /* bitfields */

{ }};

In this example the opcode use 6 bits, source and destination registers use 5bits each, the immediate value use 16 bits, and the instruction flags are notused. Templates are reused to group similar instructions, i.e. those representedwith the same memory layout. For example, arithmetic instructions betweenregisters are based on one template, while arithmetic instructions involvingimmediate values are based on another template.

A user-defined instruction inherit properties from an instruction template,and extends it with a name and an implementation. The name is used to ref-erence the instruction from the assembly source code, and the implementationspecifies the instructions functionality. An execute method describes the func-tionality by reading processor registers, performing a specific operation, and


D=1

D0 S0 S1 ImmD0OPOP

OP FD0 DN S0 SN I0 IN

F

6 65 !5 5 5

S=2 D=1

6 5 5 16

I=1

S0

S=1(a)

(b)

(c)

Figure 10: Instruction templates. (a) Specification of operand fields toset the number of source, destination, and immediate values. (b) Speci-fication of bitfields to describe the instructions bit-accurate format. (c)General format used to derive instruction templates.

writing processor registers accordingly. An instruction can also control the pro-gram execution flow by modifying the program counter, and provides a methodto specify the instruction delay for more accurate modeling. Instructions havethe following format:

class subtract_immediate : public TypeB { /* instruction class */public:

subtract_immediate() : TypeB("subi") {} /* a TypeB template */int cycle_delay() { return 1; } /* instruction delay */

void execute(sci_registers& regs, sci_ctrl& ctrl) /* implementation */{

dst(0)->write( src(0)->read() - imm(0) ); /* behavioral description */}

};

The execute method initiates a read from zero or more registers (i/o or general-purpose) and a write to zero or more registers (i/o or general-purpose). Theinstruction is only allowed to execute if all input values are available, and theinstruction can only finish execution if all output values can be successfullywritten to result registers. Hence, an instruction performs blocking access toi/o port registers and to memory interfaces.

Memory Interfaces

The processor provides an interface to access external memory or to connect theprocessor to a system bus. External memory access is supported using virtualfunction calls, for which the user supply a suitable adaptor implementation.For the processor to execute properly, functionality to load data from memory,store data in memory, and fetch instructions from memory are required. Theimplementation of virtual memory access methods are illustrated in Figure 9(a),


.restart addi %LACC,%R0,0 addi %HACC,%R0,0 addi %R1,$PIF,0 addi $POF,$PID,0 ilc $C_FILTER_ORDER .repeat dmov $POF,%R1,$PIF,$PIF mul{al} $PIC,%R1 jmov $POD,%HACC,%LACC bri .restart

00: 0x844d0000 addi %LACC,%R0,0 01: 0x846d0000 addi %HACC,%R0,0 02: 0x85cb0000 addi %R1,%L6,0 03: 0x85640000 addi %L6,%G,0 04: 0xac000024 ilc 36 05: 0x1d6e5ac0 dmov %L6,%R1,%L6,%L6 06: 0x10003b83 mul{al} %L2,%R1 07: 0x18e01880 jmov %L2,%HACC,%LACC 08: 0xa400fff8 bri -8 09: 0x00000000 nop0a: 0x00000000 nop

Assembly program Disassembly view

nop [opcode|------------------------------] mul [opcode|-----------| src | src |flags ] dmov [opcode| dst | dst | src | src |flags ] addi [opcode| dst | src | imm ] ilc [opcode|-----------| imm ] ...

Instruction set (px)

px

Figure 11: The processor provides a code generator to convert user-developed assembly code into processor specific machine code. The ma-chine code is stored in an external memory, and instructions are retrivedusing the memory interface.

and is performed by the adapter unit. The adapter translates the function callsto bus accesses, and returns requested data to the processor.

Programming and Code Generation

The architectural description provides the processor with information regardingthe instructions binary format, the instruction and register names, and the reg-ister coding. Based on this information, the processor automatically provides acode generator to convert user-developed assembly code into processor specificmachine code. Hence, when extending the instruction set with additional in-struction, these can be immediately used in the assembly program without anymanual development steps. The reverse process is also automatically supportedand generates the disassembled source code from binary format. Figure 11 il-lustrates an embedded system, where an assembly program is downloaded toa processor. The processor translates the assembly program and stores theinstructions in an external memory. The disassembly view of the externalmemory is also illustrated in Figure 11. Processors request instructions fromexternal memory over the external memory interface connected to the bus.


Memory map

Addre

ss s

pac

e

untimed read

reference

addr = 0x12

0x0

0x07

0x10

0x17

0x0 - 0x070x10 - 0x17

. . .

. . .

Figure 12: The memory models are accessed using either the simulationports or thought the untimed interface. The untimed interface enablesaccess to the memory contents from the Scenic shell.

4.2 Memory Generator

Memories are frequently used in digital system designs, operating as externalmemories, cache memories, or register banks. The memory generator enablesthe possibility to generate custom memory models for all these situations byallowing the memory wordlength, access latency, and timing parameters to beconfigurable. Each generated memory module implements a generic interfacefor reading and writing the memory contents, which can also be accessed fromthe Scenic shell. As shown in Figure 9(a), adaptors translate user requestsinto memory accesses. Hence, the memory can be connected to standard busformats such as AMBA [78], by using appropriate adaptors.

The memory is not only active during simulation time, but also when down-loading program code and when reading and writing application data from theScenic environment. Therefore, an additional untimed interface is provided bythe processors and memories, to be used even if the simulation is not running.The untimed interface is byte-oriented, and can hence be used for any mem-ory regardless of the user-defined data type (abstraction). In addition, severalmemories can be grouped using a common memory map. A memory map isan address space that is shared by multiple memories, as shown in Figure 12.An assembly program can be translated by a processor and downloaded to anyexternal memory through a global memory map.

The generated memories contain performance monitors to export statisticsand implement a custom access method to access memory contents. The perfor-mance monitors provide information about the bandwidth utilization, averagedata latency, and the amount of data transferred. The memory access patterncan be periodically logged to identify memory locations that are frequentlyrequested. Frequent access to data areas may indicate that data caching isrequired in another part of the design.


Table 2: Architectural generators used to construct reconfigurable architec-tures.

Generator Scalability Configurability

Tile- Dimension (W ×H) - Dynamic reconfiguration- Static template

Topology- Topologies:- Mesh, Torus,- Ring, Custom

Network- Router fanout - Router queue depth- Routing schemes - Router latency- Network enhancements - External connections

4.3 Architectural Generators

In addition to the model generators, three architectural generators are pro-vided to construct more complex simulation platforms. The architectural gen-erators construct customized arrays containing processor and memory models.Architectures are constructed in three steps using a tile generator, a topologygenerator, and a network generator. The tile generator creates arrays of simula-tion models, while the topology and network generators creates and configuresthe local and global communication networks, respectively. Table 2 presents asummary over the scalability and configurability of the proposed architecturalgenerators. The generators are used in Part IV to model and evaluate dynam-ically reconfigurable architectures, using both the presented model generatorsand the architectural generators. The following architectural generators forreconfigurable computing have been designed and are briefly described below:

• Tile Generator - The tile generator use a tile template file to createa static array of resource cells, presented in Part IV, which are genericcontainers for any type of simulation models. When constructing largerarrays, the tile is used as the basic building block, as illustrated in Fig-ure 13(a), and extended to arrays of arbitrary size. The resource cells areconfigured with a functional unit specified by the tile template, and canbe either a processing cell or a memory cell.

• Topology Generator - The topology generator creates the local in-terconnects between resource cells, and supports mesh, torus, ring, anduser-defined topologies. Figure 13(b) illustrates resource cells connectedin a mesh topology. The local interconnects provide a high bandwidthbetween neighboring resource cells.

5. CONCLUSION 107

Tile template

(a) (b) (c)

Figure 13: (a) Constructing the array from a tile description. (b) Buildinglocal communication topology, which can be mesh, ring, torus or customtopology. (c) Building global network with routers (in grey).

• Network Generator - The network generator inserts a global net-work, which connects to resource cells using routers, as illustrated inFigure 13(c). The network generator is highly configurable, includingnumber of devices to connect to a single router (fanout), the router queuedepth and latency, and the possibility to create user-defined links to en-hance the global communication network. The routing tables are auto-matically configures to enable global communication.

5 Conclusion

SystemC enables rapid system development and high simulation performance.The proposed exploration environment, Scenic, is a SystemC environmentwith interactive control that addresses the issues on controllability and ob-servability of SystemC models. It extends the SystemC functionality to en-able system construction and configuration from XML, and provides accessto simulation and performance data using run-time reflection. Scenic pro-vides advanced scripting capabilities, which allow rapid design exploration andperformance analysis in complex designs. In addition, a library of model gen-erators and architectural generators is proposed. Model generators are used toconstruct designs based on customized processing and memory elements, andarchitectural generators provide capabilities to construct complex systems, suchas reconfigurable architectures and Network-on-Chip designs.

Part IV

A Run-time Reconfigurable Computing Platform

Abstract

Reconfigurable hardware architectures are emerging as a suitable and feasi-ble approach to achieve high performance combined with flexibility and pro-grammability. While conventional fine-grained architectures are capable ofbit-level reconfiguration, recent work focuses on medium-grained and coarse-grained architectures that result in higher performance using word-level dataprocessing. In this work, a coarse-grained dynamically reconfigurable architec-ture is proposed. The system is constructed from an array of processing andmemory cells, which communicate using local interconnects and a hierarchicalrouting network. Architectures are evaluated using the Scenic exploration en-vironment and simulation models, and implemented VHDL modules have beensynthesized for a 0.13µm cell library. A reconfigurable architecture of size 4×4has a core area of 2.48mm2 and runs up to 325MHz. It is shown that mappingof a 256-point FFT generates 18 times higher throughput than for commercialembedded DSPs.

Based on: T. Lenart, H. Svensson, and V. Owall, “Modeling and Exploration ofa Reconfigurable Architecture for Digital Holographic Imaging,” in Proceedings of

IEEE International Symposium on Circuits and Systems, Seattle, USA, May 2008.

and: T. Lenart, H. Svensson, and V. Owall, “A Hybrid Interconnect Network-on-Chip and a Transactional Level Modeling approach for Reconfigurable Computing,”in Proceedings of IEEE International Symposium on Electronic Design, Test and Ap-

plications, Hong Kong, China, Jan. 2008.

109

1. INTRODUCTION 111

1 Introduction

Platforms based on reconfigurable architectures combine high performance pro-cessing with flexibility and programmability [104, 105]. A reconfigurable ar-chitecture enables re-use in multiple design projects to allow rapid hardwaredevelopment. This is an important aspect for developing consumer electronics,which are continuously required to include and support more functionality.

A dynamically reconfigurable architecture (DRA) can be reconfigured dur-ing run-time to adapt to the current operational and processing conditions.Using reconfigurable hardware platforms, radio transceivers dynamically adaptto radio protocols used by surrounding networks [106], whereas digital camerasadapt to the currently selected image or video compression format. Reconfig-urable architectures provide numerous additional advantages over traditionalapplication-specific hardware accelerators, such as resource sharing to providemore functionality than there is physical hardware. Hence, currently inacti-vated functional units do not occupy any physical resources, which are insteaddynamically configured during run-time. Another advantage is that a reconfig-urable architecture may enable mapping of future functionality without addi-tional hardware or manufacturing costs, which could also extend the lifetimeof the platform.

In this part, a DRA is proposed and modeled using the Scenic explorationframework and simulation models presented in Part III. By evaluating theplatform at system level, multiple design aspects are considered during per-formance analysis. The proposed design flow for constructing, modeling, andimplementing a DRA is presented in Figure 1. System construction is basedon the Scenic architectural generators, which use the model library to cus-tomize simulation components. Related work is discussed in Section 2, and theproposed architecture is presented in Section 3. In Section 4, the Scenic ex-ploration environment is used for system-level integration, and the platform ismodeled and evaluated for application mapping in Section 5. However, auto-mated application mapping is not part of the presented work, but is a naturalextension. In Section 6, the exploration models are translated to VHDL, syn-thesized, and compared against existing architectures.

2 Related Work

A range of reconfigurable architectures have been proposed for a variety ofapplication domains [33]. Presented architectures differ in granularity, pro-cessing and memory organization, communication strategy, and programmingmethodology. For example, the GARP project presents a generic MIPS proces-sor with reconfigurable co-processors [107]. PipeRench is a programmable dat-apath of virtualized hardware units that is programmed through self-managed

112 PART IV. A RUN-TIME RECONFIGURABLE COMPUTING PLATFORM

Modeling

Application Mapping

Network Generator

Exploration

Topology GeneratorConstruction

Tile Generator

Model

Generators

Analysis

Specification

Algorithms

f(x)

Arc

hit

ectu

ral

anal

ysi

s

System Integration

HDL Design Implementation

Figure 1: Design flow to construct a DRA platform. Modeling is basedon the Scenic model and architectural generators, and the hardwareimplementation is described using VHDL.

configurations [108]. Other proposed architectures are the MIT RAW pro-cessor array [109], the REMARC array of 16-bit nano processors [110], theCimaera reconfigurable functional unit [111], the RICA instruction cell [112],weakly programmable processor arrays (WPPA) [113, 114], and architecturesoptimized for multimedia applications [115]. A medium-grained architecture ispresented in [116], using a multi-level interconnection network similar to thiswork.

Examples of commercial dynamically reconfigurable architectures includethe field programmable object array (FPOA) from MathStar [117], which is anarray of 16-bit objects that contain local program and data memory. The adap-tive computing machine (ACM) from QuickSilver Technology [118] is a 32-bitarray of nodes that each contain an algorithmic core supporting arithmetic,bit-manipulation, general-purpose computing, or external memory access. TheXPP platform from PACT Technologies is constructed from 24-bit processingarray elements (PAE) and communication is based on self-synchronizing dataflows [119, 120]. The Montium tile processor is a programmable architecturewhere each tile contains five processing units, each with a reconfigurable in-struction set, connected to ten parallel memories [106].

A desirable programming approach for the architectures above is software-centric, where applications are described using a high-level design language


to ease development work and to increase productivity. The high-level de-scription is then partitioned, scheduled, and mapped using automated designsteps. Many high-level design languages for automated hardware generationexist [121], but few for application mapping onto arrays containing coarse-grained reconfigurable resources. For these architectures, the most widely usedprogramming approach today is hardware-centric, where the designer manuallydevelop, map, and simulate applications. Processing elements are programmedusing either a C-style syntax or directly in assembly code. Although beinga more complex design approach, this is the situation for most commercialreconfigurable platforms.

3 Proposed Architecture

Based on recently proposed architectures, it is evident that coarse-grained de-signs are becoming increasingly complex, with heterogeneous processing unitscomprising a range of application-tuned and general-purpose processors. As aconsequence, efficient and high-performance memories for internal and externaldata streaming are required, to supply the computational units with data.

However, increasing complexity is not a feasible approach for an embeddedcommunication network. In fact, it has lately been argued that interconnectnetworks should only be sufficiently complex to be able to fully utilize thecomputational power of the processing units [122]. For example, the mesh-based network-on-chip (NoC) structure has been widely researched in bothindustry and academia, but suffers from inefficient global communication due tomulti-path routing and long network latencies. Another drawback is the largeamount of network routers required to construct a mesh. As a consequence,star-based and tree-based networks are being considered [123], as well as arange of hybrid network topologies [124].

This work proposes reconfigurable architectures based on the followingstatements:

• Coarse-grained architectures result in better performance/area trade-offthan fine-grained and medium-grained architectures for the current ap-plication domain. Flavors of coarse-grained processing elements are pro-posed to efficiently map different applications.

• Streaming applications require more than a traditional load/store archi-tecture. Hence, a combination of a RISC architecture and a streamingarchitecture is proposed. Furthermore, the instruction set of each pro-cessor is customized to extend the application domain.

• Multi-level communication is required to combine high bandwidth withflexibility to a reasonable hardware cost [116]. A separate local and global


communication network is proposed, which also simplifies the interfaceto external resources.

• Presented systems where a processor provides data to a tightly-coupledreconfigurable architecture is not a feasible solution for high-performancecomputing [110]. Hence, stream memory controllers are proposed to effi-ciently transfer data between the reconfigurable architecture and externalmemories.

3.1 Dynamically Reconfigurable Architecture

A hardware acceleration platform based on a dynamically reconfigurable archi-tecture is proposed. The DRA is constructed from a tile of W × H resourcecells and a communication network, which is shown in Figure 2(a). Resourcecell is the common name for all types of functional units, which are dividedinto processing cells (PC) and memory cells (MC). Processing cells implementthe processing functionality to map applications to the DRA, while memorycells are used to store data tables and intermediate results during processing.The resource cells are dynamically reconfigurable to support run-time mappingof arbitrary applications.

An array of resource cells is constructed from a tile template. A tile templateis user-defined and contains the pattern in which processing and memory cellsare distributed over the array. For example, the architecture presented inFigure 2(a) is based on a tile template of size 2 × 2, with two processing cellsand two memory cells. The template is replicated to construct an array ofarbitrary size.

The resource cells communicate over local interconnects and a global hi-erarchical network. The local network of dedicated wires enable high-speedcommunication between neighboring resource cells, while the global networkprovides communication flexibility to allow any two resource cells in the arrayto communicate [122].

3.2 Processing Cells

Processing cells are computational units on which applications are mapped, andthe processing cells contain a configurable number of local input and outputports (Lx) to communicate with neighboring processing or memory cells, andone global port (G) to communicate on the hierarchical network.

Three different processing cells are presented and implemented: A 32-bitDSP processor with radix-2 butterfly support, a 16/32-bit MAC processor withmultiplication support, and a 16-bit CORDIC processor for advanced functionevaluation. The DSP and MAC processors are based on a similar architecture,


IF/ID ID/EX EX/WB

ILC

PC L0-Lx

G

R0-Rx

...

...instr.

Imm

nx

en

16

16

L0 L1 Lx G...

Local ports Global port

Program memory

16

16

tile template

(a) (b)

Figure 2: (a) Proposed architecture with an array of processing andmemory cells, connected using a local and a global (hierarchical) net-work. W = H = 8. (b) Internal building blocks in a processing cell,which contain a programmable processing element.

with a customized instruction set and a program memory to hold processorinstructions, as shown in Figure 2(b). Data is processed from a register bankthat contains the general-purpose registers (R0-Rx) and the i/o port registers(L0-Lx,G). An instruction that access i/o port registers is automatically stalleduntil data becomes available. Hence, from a programming perspective there isno difference between general-purpose registers and i/o port register access.The DSP and MAC processors have the following functionality and specialproperties to efficiently process data streams:

• i/o port registers - The port registers connect to surrounding processingand memory cells. Port registers are directly accessible in the same way asgeneral-purpose processor registers, to be used in arithmetic operations.Hence, no load/store operations are required to move data between reg-isters and ports, which significantly increase the processing rate.

• Dual ALU - A conventional ALU takes two input operands and producesa single result value. In contrast, the DSP and MAC processors includetwo separate ALUs to produce two values in a single instruction. Thisis useful when computing a radix-2 butterfly or when moving two datavalues in parallel. The MAC processor uses a parallel move instruction


Table 1: Supported functionality by the configurable processing cells.

ProcessorALU Dual ALU Separable ALU Special

format (ALU1/ALU2) (ALUhi

n/ALUlo

n) functions

DSP32-bit +/+ 2 × 32-bit +/+ 2 × 16-bit

2 × 16-bit −/− 2 × 32-bit −/− 2 × 16-bit Radix-2 Butterfly+/− 2 × 32-bit +/− 2 × 16-bit

MACSplit/Join: Multiply

16-bit 2 GPR → i/o Multiply/Addi/o → 2 GPR Multiply-Acc.

CORDICMultiply/Divide

2 × 16-bit 16-stage Trigonometricpipeline Magnitude/Phase

to split and join 16-bit internal registers and 32-bit i/o registers.

• Separable ALU - Each 32-bit ALU data path can be separated into twoindependent 16-bit fields, where arithmetic operations are applied to bothfields in parallel. This is useful when operating on complex valued data,represented as a 2 × 16-bit value. Hence, complex values can be addedand subtracted in a single instruction.

• Inner loop counter - A special set of registers are used to reduce controloverhead in compute-intensive inner loops. The inner loop counter (ILC)register is loaded using a special instruction that stores the next programcounter address. Each instruction contains a flag that indicates end-of-loop, which updates the ILC register and reloads the previously storedprogram counter.

The CORDIC processor is based on a 16-stage pipeline of adders and sub-tractors, with 2 input and 2 output values, capable of producing one set ofoutput values each clock cycle. The CORDIC processor operates in eithervectoring or rotation mode to support multiplication, division, trigonometricfunctions, and to calculate magnitude and phase of complex valued data [70].

The supported functionality is summarized in Table 1. As shown, the DSPprocessor is designed for standard and complex valued data processing, whilethe MAC processor supports multiply-accumulate and the CORDIC processorprovides complex function evaluation. The instruction-set and a more detailedarchitectural schematic of the DSP and MAC processors are presented in Ap-pendix B.


...

input ports

64-bit descriptor table

Controller

#0

#1

#2

#3

L0 L1 Lx G...

Global portLocal ports

Src Dst ID -BASEFIFO HIGH rPtr wPtr

tSizetAddrAddr Data ID RWBASEMEM HIGH

output ports

Memory array

Switch

#0 Area

#1 Area

Figure 3: A memory cell contains a descriptor table, a memory array, anda controller. The descriptor table indicates how data is to be transferredbetween the ports and which part of the memory that is allocated foreach descriptor.

3.3 Memory Cells

Most algorithms require intermediate storage space to buffer, reorder, or delaydata. Memories can be centralized in the system, internal inside processingcells, or distributed over the array. However, to fully utilize the computationalpower of the processing cells, storage units are required close to the computa-tional units. Naturally, an internal memory approach satisfies this requirement,but suffers from the drawback that it is difficult to share between processingcells [125]. In contrast, a distributed approach enables surrounding cells toshare a single memory. Hence, shared memory cells are distributed in a user-defined pattern (tile template) over the array.

An overview of the memory cell architecture is shown in Figure 3, which isconstructed from a descriptor table, a memory array, and a controller unit. Thedescriptor table contains stream transfers to be processed, and is dynamicallyconfigured during run-time. The memory array is a shared dual-port memorybank that stores data associated with each stream transfer. The controllerunit manages and schedules the descriptor processing and initiates transfersbetween the memory array and the external ports.

Stream transfers are rules that define how the external ports communicatewith neighboring cells, operating in either fifo or ram mode. A stream transferin fifo mode is configured with a source and destination port, and an allocatedmemory area. The source and destination reference either local ports (L0-Lx)or the global port (G). The memory area is a part of the shared memory bank


that is reserved for each stream transfer, which is indicated with a base addressand a high address. In fifo mode, the reserved memory area operates as acircular buffer, and the controller unit handles address pointers to the currentread and write locations. Input data is received from the source port and placedat the current write location inside the memory array. At the same time, thedestination port receives the data stored in the memory array at the currentread location.

In a similar fashion, a stream transfer in ram mode is configured with anaddress port, a data port, and an allocated memory area. The controller unitis triggered when the address port receives either a read or a write request,that specifies a memory address and a transfer length. If the address portreceives a read request, data will start streaming from the memory array tothe configured data port. Consequently, if the address port receives a writerequest, data is fetched from the configured data port and stored in the memoryarray.

3.4 Communication and Interconnects

The resource cells communicate over local and global interconnects, where datatransactions are synchronized using a handshake protocol. The local intercon-nects handle communication between neighboring resource cells, and providea high communication bandwidth. Hence, it is assumed that the main part ofthe total communication is between neighboring cells and through local inter-connects. However, a global hierarchical network enables non-neighboring cellsto communicate and provides an interface to external memories. Global com-munication is supported by routers that forward network packets over a globalnetwork. In the proposed architecture, routers use a static lookup-table thatis automatically generated at design-time. Since the routing network is static,there is only one valid path from each source to each destination. Static rout-ing simplifies the hardware implementation, and enables each router instanceto be optimized individually during logic synthesis. However, a drawback withstatic routing is network congestion, but mainly concerns networks with a highdegree of routing freedom, for example a traditional mesh topology.

An overview of the network router architecture is shown in Figure 4(a),which is constructed from a decision unit with a static routing table, a routingstructure to direct traffic, and a packet queue for each output port. The decisionunit monitors the status of input ports and output queues, and configures therouting structure to transfer data accordingly. The complexity of the routingstructure determines the routing capacity, as illustrated in Figure 4(b). Aparallel routing structure is associated with a higher area cost and requires amore complex decision unit.


Decision Routing

routing

table

FSM

P0

P1

P2

P3

MU

X

DE

-MU

X

P0

P1

P2

P3

P0

P1

P2

P3

P0

P1

P2

P3

P0

P1

P2

P3

Control Control

Queues

P0

P1

P2

P3

0-1 : P0

2-3 : P1

4-5 : P2

6-7 : P3

nType pType 32-bit payloaddst IDsrc IDpType = { data, READ, WRITE }

nType = { data, config, control }

(a) (b)

(c)

2 × ⌈log2(#IDs)⌉ 2 2 32

Figure 4: (a) Network router constructed from a decision unit, a routingstructure, and packet queues for each output port. (b) Architectural op-tions for routing structure implementation. (c) Local flit format (white)and extension for global routing (grey).

Network Packets

The routers forward network packets over the global interconnects. A networkpacket is a carrier of data and control information from a source to a destinationcell, or between resource cells and an external memory. A data stream isa set of network packets, and for streaming data each individual packet issend as a network flow control digit (flit). A flit is an atomic element that istransferred as a single word on the network, as illustrated in Figure 4(c). Aflit consists of a 32-bit payload and a 2-bit payload type identifier to indicateif the flit contains data, a read request, or a write request. For global routing,unique identification numbers are required and discusses in the next section.An additional 2-bit network type identifier indicates if the packet carries data,configuration, or control information. Data packets have the same size as aflit, and contain a payload to be either processed or stored in a resource cell.Configuration packets contain a functional description on how resource cellsshould be configured, and is further discussed in Section 4.2. Configurationsare transferred with a header specifying the local target address and the payloadsize, to be able to handle partial configurations. Control packets are used tonotify the host processor of the current processing status, and are reserved toexchange flow control information between resource cells.


(a) (b) (c)

R0,0 R1,0

R2,0 R3,0

R4,0 R5,0

R6,0 R7,0

R0,1 R1,1

R0,2

0 − 3 4 − 7

8 − 11 12 − 15

0 − 15

0 1

2 3

4 5

6 7

8 9

10 11

12 13

14 15

Figure 5: (a) Recursively assignment of network IDs. (b) A range of con-secutive network IDs are assigned to each router table. (c) Hierarchicalrouter naming as Rindex,level.

Network Routing

Each resource cell allocates one or more network identifiers (ID), which areinteger numbers to uniquely identify a resource cell. Each resource cell allocatesone (or more) network ID as shown in Figure 5(a). The identification field isrepresented with ⌈log2(W × H + IDext)⌉ bits, where IDext is the number ofnetwork IDs allocated for external communication.

A static routing table is stored inside the router and used to direct trafficover the network. At design-time, network IDs and routing tables are recur-sively assigned by traversing the global network from the top router. Recursiveassignment results in that each entry in the routing table for a router Ri,l,where i is the router index number and l is the routers hierarchical level asdefined in Figure 5(c), is a continuous range of network IDs as illustrated inFigure 5(b). Hence, network ID ranges are represented with a base addressand a high address. The network connectivity C is defined by a function

C(Ri,l, Rm,n) =

{

1, Ri,llink→ Rm,n

0, otherwise,

where the value 1 indicates that there is a link from router Ri,l to router Rm,n

(Ri,l 6= Rm,n) and 0 otherwise. Based on the network connectivity, the routingtable Λ for a router Ri,l is defined as

Λ(Ri,l) = {λ(Ri,l), κ(Ri,l)}, (1)

where λ(Ri,l) is the set of network IDs to reachable routers from Ri,l to lowerhierarchical levels, and κ(Ri,l) is the set of network IDs from Ri,l to reachablerouters on the same hierarchical level as

λ(Ri,l) = {λ(Rj,l−1) : C(Ri,l, Rj,l−1) = 1, l > 0}


(a) (b)

Figure 6: (a) Enhancing the router capacity when the hierarchical levelincrease. (b) Enhancing network capacity by connecting routers at thesame hierarchical level.

andκ(Ri,l) = {λ(Rj,l) : C(Ri,l, Rj,l) = 1}.

At the lowest hierarchical level, where l = 0, the reachable nodes in λ(Ri,0) isa set of network IDs that are reserved by the connected resource cells. At thislevel, λ is always represented as a continuous range of network IDs.

A link from a router Ri,l to a router Rj,l+1 is referred to as an uplink. Anypacket received by router R is forwarded to the uplink router if the packets net-work ID is not found in the router table Λ(R). A router may only have a singleuplink port, else the communication path could become non-deterministic.

In the Scenic environment, routing tables can be extracted for any router,and contain the information below for top router R0,1 and sub-router R0,0 fromFigure 5(b):

Routing table for R_01: Routing table for R_00:------------------------------ --------------------------------------

Port 0 -> ID 0-3 [R_00] Port 0 -> ID 0-0 [resource cell #0]Port 1 -> ID 4-7 [R_10] Port 1 -> ID 1-1 [resource cell #1]Port 2 -> ID 8-11 [R_20] Port 2 -> ID 2-2 [resource cell #2]Port 3 -> ID 12-15 [R_30] Port 3 -> ID 3-3 [resource cell #3]Port 4 -> ID 16-16 [Memory] Uplink is port 4

The top router connects to an external memory, which is further discussed inSection 4.1.


Mem

Mem

Mem

Mem

Mem

ID=4 ID=9

ID=14 ID=19

ID=20

extra link

i/o

interface

i/o

interface

i/o

interface

(a) (b)

Figure 7: (a) Hierarchical view of the router interconnects and externalinterfaces. (b) External memories connected to the network routers. Thememories are uniquely identified using the assigned network ID.

Network Capacity

When the size of a DRA increases, the global communication network is likelyto handle more traffic, which requires network enhancements. A solution to im-prove the communication bandwidth is to increase the network capacity in thecommunication links, as shown in Figure 6(a). Since routers on a higher hierar-chical level could become potential bottlenecks to the system, these routers androuter links are candidates for network link capacity scaling. Thus, this meansthat a single link between two routers is replaced by parallel links to improvethe network capacity. A drawback is increased complexity, since a more ad-vanced router decision unit is required to avoid packet reordering. Otherwise,if packets from the same stream are divided onto different parallel links, thismight result in that individual packets arrive out-of-order at the destination.

Another way to improve the communication bandwidth is to insert addi-tional network paths to avoid routing congestion in higher level routers, referredto as network balancing. Figure 6(b) shows an example where all Ri,1 routersare connected to lower the network traffic through the top router. Additionallinks may be inserted between routers as long as the routing table in eachnetwork router is deterministic. When a network link is created between tworouters, the destination router’s reachable IDs (λ) is inserted in the routingtable (Λ) for the source router. Links that do not satisfy the conditions aboveare not guaranteed to automatically generate deterministic routing tables, butcould still represent a valid configuration.

4. SYSTEM-LEVEL INTEGRATION 123

GPP MPMC SMC

DRA

BridgePeri #NPeri #1 . . .

External memory

config

data

I/F 0

I/F 1

Figure 8: An embedded system containing a DRA, which is connected toa stream memory controller (SMC) to transfer streaming data betweenthe DRA and external memory. Configurations are generated by theGPP and sent over the DRA bridge.

External Communication

The network routers are used for connecting external devices to the DRA,as illustrated in Figure 7(a). When binding an external device to a router, aunique network ID is assigned and the routing tables are automatically updatedto support the new configuration. Examples of external devices are memoriesfor data streaming, as illustrated in Figure 7(b), or an interface to receive con-figuration data from an embedded GPP. Hence, data streaming to and fromexternal memories require the use of a global network. Details about exter-nal communication and how to enable efficient memory streaming is furtherdiscussed in Section 4.1.

4 System-Level Integration

Additional system components are required to dynamically configure resourcecells, and to transfer data between resource cells and external memories. There-fore, the proposed DRA is integrated into an embedded system containing ageneral-purpose processor, a multi-port memory controller (MPMC), and aproposed stream memory controller (SMC) to efficiently supply the DRA withdata. The GPP and the MPMC are based on the Scenic processor and mem-ory generators, while the SMC is custom designed. The system architectureis shown in Figure 8, where the top network router is connected to the SMCto receive streaming data from an external memory. The top network routeris also connected to a bridge, which allows the GPP to transmit configurationdata over the system bus.

The MPMC contains multiple ports to access a shared external memory,


ADDR

StreamMPMC SMC MPMC

I/O buffers

0x1000

0x2000

0x1000

Control

FSM

R/W SHAPESIZE

R 64 [1 8 1]

[1 2 1]

[8 8 -55]

2048

64

R

W

... ... ...

REF

2

-

0

Config

External memory

ID

#0

#1

#2

...

*

*

Stream Descriptor Table

Figure 9: A stream memory controller (SMC) that handles data transfersto and from a DRA. Data is transferred in blocks and the SMC handlesmemory addressing and double-buffering. An active transfer is indicatedby a star sign (∗).

were one port is connected to the system bus and one port to the SMC. Fordesign exploration, the MPMC can be configured to emulate different memorytypes by changing operating frequency, wordlength, data-rate, and internalparameters for controlling the memory timing. This is useful when optimizingand tuning the memory system, which is presented in 5.2.

4.1 Stream Memory Controller

The SMC is connected to one of the memory ports on the MPMC, and managesdata streams between resource cells and external memory. Hence, the DRA isonly concerned with data processing, whereas the SMC provides data streamsto be processed.

Internal building blocks in the SMC are shown in Figure 9, and includesa stream descriptor table, a control unit, and i/o stream buffers. The streamdescriptor table is configured from the GPP, and each descriptor specifies howdata is transferred [126]. The control unit manages the stream descriptors andinitiates transfers between the external memory and the stream buffers. Datais transferred from the stream buffers to stream ports, which are connected tothe DRA.

There is one stream descriptor for each allocated network ID, which are indi-vidually associated with the corresponding resource cells. A stream descriptorconsists of a memory address, direction, size, shape, and additional control bits.The memory address is the allocated area in main memory that is associated

4. SYSTEM-LEVEL INTEGRATION 125

0x009254FE

...

0xE0F21125

TADDR = 4TSIZE = 16

TADDR= 3TSIZE = 2

Data

@ 4

0

1

16

0

1

2

Base=4

High=19

RAM Src Dst ID

addi %R1,%L2,0

ilc 16

add{l} %L1,%L0,%R1

end 0

TADDR = 10TSIZE = 4

TADDR = 0TSIZE = 1

{start}

Program

@ 9

Control

0

1

2

3

4

0

1 PC = 9Descriptor 3

...

rPtr=x wPtr=x

W

W

W

W

0

1

(a) (b)

Figure 10: 32-bit configuration packets. (a) Configuration of a processingcell, including program memory and control register. (b) Configurationof a memory cell, including memory array and descriptor table.

with the stream transfer. The transfer direction is either read or write, andnumber of words to transfer is indicated by size. A shape describes how datais accessed inside the memory area, and consists of a three parameters: stride,span, and skip [127]. Stride is the memory distance to the next stream element,and span is the number of elements to transfer. Skip indicates the distance tothe next start address, which restarts the stride and span counters. The useof programmable memory shapes is a flexible and efficient method to avoidaddress generation overhead in functional units.

Additional control bits enable resource cells to share buffers in an externalmemory. The reference field (ref) contains a pointer to an inactivated streamdescriptor that shares the same memory area. When the current transfer com-pletes, the associated stream is activated and allowed to access the memory.Hence, the memory area is used by two stream transfers, but the access to thememory is mutually exclusive. Alternatively, the descriptors are associated withdifferent memory regions, and the data pointers are automatically interchangedwhen both stream transfers complete, which enables double-buffering. An ex-ample is shown in Figure 9, where stream descriptors 0 and 2 are configuredto perform a 8 × 8 matrix transpose operation. Data is written column-wise,and then read row-wise once the write transaction completes.

4.2 Run-Time Reconfiguration

The resource cells are reconfigured during run-time to emulate different func-tionality. Configuration data is sent over a global network, and a destinationcell is specified using the network ID. Hence, there is no additional hardwarecost due to dedicated configuration interconnects. Furthermore, partial recon-figuration is supported, which means that only the currently required part of a


processing or memory cell is reconfigured. This may significantly improves thereconfiguration time.

A resource cell is reconfigured by sending configuration packets to the allo-cated network ID. Each configuration packet contains a header, which specifiesthe target address and size of payload data, as well as the payload. Severalpartial configuration packets may be sent consecutively to configure differentregions inside a resource cell. Configuration packet format for a processing cellis illustrated in Figure 10(a). A processing cell has two configurable parts, aprogram memory and a control register. The program memory is indexed fromaddress 1 to Mpc, where Mpc is the size in words of the memory array. Address0 is reserved for the control register, which contains configuration bits to reset,start, stop, and single-step the processor. Configuration packet format for amemory cell is illustrated in Figure 10(b). A memory cell is also divided intotwo configurable parts, a memory array and a descriptor table. The memoryarray contains 32-bit data, while descriptors are 64-bit wide, hence requiringtwo separate configuration words for each descriptor.

System configuration time depends on the required resources for an appli-cation mapping A, which includes the size of the application data and pro-grams to be downloaded. The reconfiguration time tr is measured in clockcycles and is the sum of all the partial reconfigurations in A. For the arrayin Figure 2(a) with W = H = 8, Mpc = 64 words of program memory, andMmc = 256 words in the memory array, the configuration time is in the rangeof tr = 32Mpc + 32Mmc ≈ 10K clock cycles for a complete system reconfigura-tion. At 300MHz, this corresponds to a configuration delay of 34µs. However,in most situations the partial reconfiguration time is only a fraction of the timerequired for full system reconfiguration.

5 Exploration and Analysis

This section presents simulation results and performance analysis of the pro-posed reconfigurable platform. It is divided into network simulation, memorysimulation and application mapping. The Scenic exploration framework en-ables configuration of resource cells to define different scenarios, and supportsextraction of performance metric, such as throughput and latency, during sim-ulation. For application mapping, Scenic is used to configure resource cellsusing an XML-based description, and to analyze systems for potential perfor-mance bottlenecks during simulation.

5.1 Network Performance

To evaluate network performance, a DRA is configured as an 8×8 array consist-ing of traffic generators, which are resource cells that randomly send packets

5. EXPLORATION AND ANALYSIS 127

Original network from Figure 2(a).

localization 0localization 0.2localization 0.4localization 0.6localization 0.8

Injection rate r

Acc

epte

dtr

afficT

0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8


Injection rate r0.2 0.4 0.6 0.8 1.0

Lat

ency

L

0

5

10

15

20

Network with increased link capacity from Figure 6(a).


Injection rate r

Acc

epte

dtr

afficT

0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8


Injection rate r0 0.2 0.4 0.6 0.8 1.0

Lat

ency

L

5

10

15

20

Network with balanced traffic load from Figure 6(b).


Injection rate r

Acc

epte

dtr

afficT

0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8


Injection rate r0 0.2 0.4 0.6 0.8 1.0

Lat

ency

L

5

10

15

20

Figure 11: Network performance in terms of accepted traffic T and net-work latency L. Three simulations show original network, network withincreased link capacity, and network with balanced traffic load.


to neighboring cells and to the global network. Packets are annotated withstatistical information to monitor when each packet was produced, injected,received, and consumed. It also contains information about the number ofnetwork hops from the source node to the destination node, while the trafficgenerators monitor the number of received packets and the transport latency.Since communication is both local and global, a localization factor α is definedas the ratio between the amount of local and global communication, whereα = 1 corresponds to pure local communication and α = 0 corresponds to fullyglobal communication.

The traffic generators inject packets into the network according to theBernoulli process, which is a commonly used injection process to character-ize networks [128]. The injection rate, r, is the number of packets per clockcycle and per resource cell injected into the local or global network. The ac-cepted traffic is the network throughput T , which is measured in packets perclock cycle and per resource cell. Ideally, accepted traffic should increase lin-early with the injection rate. However, due to traffic congestion in the globalnetwork the amount of accepted traffic will saturate at a certain level. Theaverage transport latency is defined as L =

∑Li/N , where Li is the transport

latency for packet i and N is the total number of consumed packets. Transportlatency is measured as the number of clock cycles between a successful injectioninto the network and consumption at the final destination.

The network performance depends on both the local and global network. Asdiscussed in Section 3.4, the global network may be enhanced by either improv-ing the link capacities or by inserting additional network links between routers.Figure 11 shows the simulation results from three network scenarios based onan 8 × 8 resource array. The first simulation is the original network configura-tion shown in Figure 2(a). The second and third scenarios are the enhancedrouting networks shown in Figure 6(a-b). As shown, network enhancementsgenerate a higher communication bandwidth with reduced latency. Perfor-mance is measured as the combined local and global communication, i.e. theoverall performance, while latency is measured for global communication. Therouters used in this simulation can manage up to two data transactions perclock cycle, but only for transactions that do not access the same physicalports.

The simulations in Figure 11 indicate how much traffic is injectable intothe global network before saturation. Assuming an application with 80% localcommunication, i.e. α = 0.8, the networks can manage an injection rate ofaround 0.3, 0.8 and 0.8, respectively. Thus, this illustrates the need for capacityenhancements in the global network. Since both enhancement techniques havesimilar performance, the network in Figure 6(b) is a better candidate sinceit is less complex and easier to scale with existing router models. However,


OriginalIncreased link capacityBalanced networkBoth enhancements

Router queue depth Q

Acc

epte

dtr

afficT

1 2 4 8 16 32 64 ∞0

0.1

0.2

0.3

0.4

Figure 12: The impact on network performance for different router out-put queue depth Q when α = 0. The curve start to flatten out aroundQ = 8.

if a shared memory is connected to the top router, more bandwidth will berequired, which instead implies increased link capacity to the top router.

Another important system parameter is the router output queue depth,which affects the router complexity and the network performance. Networkbuffers are expensive, and avoiding large memory macros for each router isdesirable. Figure 12 shows the global network performance when α = 0 as afunction of the router queue depth Q. The final simulation point shows theperformance with unlimited buffer space, to illustrate the difference betweenfeasible implementations and maximum theoretical performance. Around Q =8, the curve start to flatten out with a performance of 80−90% of Tmax for eachcurve, hence a suitable trade-off between resource requirements and networkperformance. The hardware requirements for different router implementationsare presented in Section 6.

5.2 Memory Simulation

Other important system aspects are the external memory configuration andthe external memory interface. The functional units operate on streams ofdata, which are managed by the SMC. An input stream, i.e. from memory tothe DRA, is fully controlled by the SMC unit and is transferred as a block ofconsecutive elements (assuming a trivial memory shape). Hence, the memoryaccess pattern renders a high bandwidth to external memory. In contrast, an


output stream, i.e. from the DRA to memory, can be managed in differentways, either as a locally controlled or as a globally controlled memory transfer.Local control means that each resource cell transfers data autonomously, whileglobal control implies stream management by the SMC.

To evaluate the memory stream interface, two architectural mappings andfive memory scenarios is evaluated. The first mapping is when an applica-tion executes on a single processing cell, and the second mapping is when thedata processing is shared and divided onto two processing cells, as shown inFigure 13(a-b). This will illustrate that throughput is not only a matter ofparallelism, but a balance between shared system resources such as the globalnetwork and the memory interface.

A processing cell is configured with an application to post-process imagesin digital holography, as illustrated in Part I Figure 3(c). The post-processingstep generates a superposed image between magnitude and phase data afterimage reconstruction, to enhance the visual perception. A configuration isdownloaded to combine the RGB-colors of two input streams, and write theresult to an output stream. Four clock cycles are required to process one pixel,resulting in a throughput of 1/4 samples per clock cycle. The DRA and theexternal memory are assumed to operate at the same clock frequency, hence3/4 of the theoretical memory bandwidth is required for three streams. Thefollowing Scenic script constructs a DRA platform, downloads two bitmapimages through the MPMC, and inserts stream descriptors in the SMC totransfer data to and from the DRA:

xml load -file "system_4x4.xml" % load XML designsim % create simulation platformxml config -name "post-process" % load XML configuration

MPMC wr -addr 0x100000 -file "amplitude.bmp" % download bitmaps to memoryMPMC wr -addr 0x200000 -file "phase.bmp"SMC insert -id 1 -addr 0x100000 -rnw TRUE -size $IMSIZE % insert stream descriptorsSMC insert -id 3 -addr 0x200000 -rnw TRUE -size $IMSIZESMC insert -id 2 -addr 0x300000 -rnw FALSE -size $IMSIZE

runb 4 ms % run simulation

MPMC rd -addr 0x300000 -size $IMSIZE -file "result.bmp" % save result imageset TRANSFERS [ra.cell* get -var transfers] % extract transfersset SIM_CC [SMC get -var last_transfer] % extract clock cyclesset UTILIZATION [eval [eval sum $TRANSFERS] / $SIM_CC] % calculate utilization

The system is evaluated for different transfer lengths, which is the size in wordsof a transfer between the external memory and the processing cell. The transferlength is important for a realistic case study, since external memory is burstoriented with high initial latency to access the first word in a transfer. Conse-quently, the simulation in Figure 14 plot (a) shows improved throughput whenthe transfer length increases.


SMC SMCMPMC MPMC

(a) (b)

Figure 13: (a) Application mapping to a single processing cell, using tworead and one write stream to external memory. (b) Dividing the compu-tations onto two processing cells, which require two sets of i/o streamsand double bandwidth to external memory.

The application mapping is modified to execute on two parallel processingcells, sharing the computational workload to increase throughput. Initially,the performance curve follows the result from mapping to a single processingcell, as shown in Figure 14 plot (b). However, for a certain burst length theperformance suddenly decrease, presenting a worse result than for the previousapplication mapping. With further analysis, by inspecting the MPMC statisticsusing Scenic, it can be seen in Figure 15 plot (b) that the memory accesspattern change abruptly at this point, caused by interleaving of the resultstreams from the two processing cells. Interleaving results in shorter bursttransfers to external memory, and therefore also a lower throughput.

The stream interleaving problem can be addressed by using globally con-trolled streaming, where the SMC operates as an arbitration unit and grants theresource cells access to external memory. To avoid transfer latency between twoconsecutive grants, two resource cells are granted access to the memory with anoverlap in time, and the two streams are reordered inside the SMC to optimizethe memory access pattern for burst oriented memories.

The result with globally controlled and reordered streams is shown in Fig-ure 14 plot (c). Reordering solves the burst transfer issue, but the throughputhas not increased. By inspecting the processing cells unit Scenic, it can beseen that the utilization is only 50% in each processing element. Hence, morememory bandwidth is required to supply the processing elements with data.

To address the bandwidth issue, the MPMC is reconfigured to evaluatenew scenarios, first emulating a double data-rate (DDR) memory and thenincreasing memory wordlength to 64 bits. Simulations with DDR and 64-bit


Transfer length (words)

Rel

ativ

eth

rough

put

(%)

(a) Single PC with 32-bit Memory

(b) Dual PC with 32-bit Memory

(c) Dual PC with 32-bit Memory (reorder)

(d) Dual PC with 32-bit DDR Memory (reorder)

(e) Dual PC with 64-bit DDR Memory (reorder)

2 4 8 12 16 20 24 28 320

40

80

100

120

160

200

Figure 14: (a-e) Throughput relative to execution on a single processingcell. Simulations show mappings to single and dual processing cells fordifferent memory configurations.

Transfer length (words)

#ro

wac

cess

(K) (a) Single PC with 32-bit Memory

(b) Dual PC with 32-bit Memory

2 4 8 12 16 20 24 28 320

100

200

300

Figure 15: (a-b) Number of row address strobes to external memory. Ahigh number indicates an irregular access pattern, which has negativeimpact on throughput.

memory are shown in Figure 14 plot (d-e), and presents dramatically increaseddata throughput as expected. Hence, it illustrates that system performance isa combination of many design parameters.


.restart addi %LACC,%R0,0 addi %HACC,%R0,0 addi %R1,$PIF,0 addi $POF,$PID,0 ilc $FIR_ORDER .repeat dmov $POF,%R1,$PIF,$PIF mul{al} $PIC,%R1 jmov $POD,%HACC,%LACC bri .restart

<SCENIC> <CONFIG name="FIR-filter"> <MODULE name="pc2x2"> <PARAM name="SRC" value="fir.asm"> <PARAM name="ADDR" value="0x1"> <PARAM name="PID" value="%G"> <PARAM name="PIC" value="%L2"> <PARAM name="PIF" value="%L3"> <PARAM name="POF" value="%L3"> <PARAM name="POD" value="%L1"> <PARAM name="FIR_ORDER" value="36"> </MODULE> ...

(a) (b)

Figure 16: Application mapping (FIR filter) that allocates one processingcell and two memory cells (grey). (a) The XML configuration specifiesa source file and parameters for code generation, and from which portsdata are streaming. (b) A generic assembly program that use the XMLparameters as port references.

5.3 Application Mapping

Currently, applications are manually mapped to the DRA using configurations.Configurations are specified in XML format, to allow rapid design exploration,which can be translated to binary files and reloaded by the embedded processorduring run-time. A configuration allocates a set of resource cells, and containsparameters to control the application mapping, as shown in Figure 16(a). Formemory cells, these parameters configure the memory descriptors and the initialmemory contents. For processing cells, the parameters specify a program sourcefile and port mappings to assist the code generator. The source file is describedusing processor-specific assembly code, as presented in Part III in Section 4.1,and an example is shown in Figure 16(b). Hence, algorithm source files onlydescribe how to process data, while port mappings describe how to stream data.

The following applications have been evaluated for the proposed DRA,which are common digital signal processing algorithms and are also used inimage reconstruction for digital holographic imaging:

• FIR Filter - The time-multiplexed and pipeline FIR filters are mappedto the DRA using MAC processors. The time-multiplexed design requiresone MAC unit and two memory cells, one for coefficients and one operat-ing as a circular buffer for data values. The inner loop counter is used toefficiently iterate over data values and coefficients, which are multipliedpair-wise and accumulated. When an iteration completes, the least re-cent value in the circular buffer is discarded and replaced with the value


on the input port. At the same time, the result from accumulation iswritten to the output port. The time-multiplexed FIR implementationis illustrated in Figure 17(a), and the corresponding mapping to resourcecells is shown in Figure 17(b). In contrast, the pipeline design requiresone MAC unit for each filter coefficient, which are serially connected toform a pipeline. Each unit multiplies the input value with the coefficient,adds the partial sum from the preceding stage, and forwards the datavalue and result to the next stage.

• Radix-2 FFT - The time-multiplexed and pipeline FFT cores are mappedto the DRA using DSP and CORDIC processors. DSPs are used for thebutterfly operations, while CORDIC units emulate complex multiplica-tion using vector rotation. Delay feedback units are connected to eachDSP butterfly, which are implemented using memory cells operating inFIFO mode. An example of an FFT stage is illustrated in Figure 17(c),and the corresponding mapping to resource cells is shown in Figure 17(d).For the time-multiplexed design, data is streamed through the DSP andCORDIC units n times for a 2n-point FFT, where the butterfly sizechanges for every iteration. The pipeline radix-2 design is constructedfrom n DSP and n − 1 CORDIC units, which significantly increases thethroughput but also requires more hardware resources.

• Matrix Transpose - A matrix transpose operation is mapped to illus-trate that the DSP processors may alternatively be used as address gen-eration units (AGU). A fast transpose operation requires two DSP unitsto generate read and write addresses, and the data is double-buffered in-side a memory cell. While one buffer is being filled linearly, the other isdrained using an addressing mode to transpose the data. When both op-erations finish, the DSP units switch buffers to transpose the next block.

Table 2 presents the mapping results in terms of reconfiguration time,throughput, and resource requirements. It can be seen that the reconfigurationtime is negligible for most practical applications. A configuration is assumedto be active for a much longer time than it takes to partially reconfigure theresource cells. It is also shown that both the pipeline FFT and the fast matrixtranspose unit generates a throughput of close to 1 sample per clock cycle, hencethese configurations are candidates for mapping the presented reconstructionalgorithm in digital holographic imaging, as discussed in Part I.


DS

PC

OR

DIC

RA

MF

IFO

DS

P

FIF

O

MA

CR

AM

FIF

OD

SP

(a)

(b)

(c)

(d)

0

feed

bac

kfe

edbac

k

butt

erfly

butt

erfly

table

mult

.

mult

.FIR

FIR

buffer

buffer

coef

.

are

q

are

q

f i

f if i

f i

do

do

do

do

di

di

di

di

c i

c i

b o

b o

Wi N

Wi N

f o

f of o

f o

Fig

ure

17:

Exam

ple

sof

applica

tion

map

pin

g:(a

)D

ata

flow

ina

tim

e-m

ult

iple

xed

FIR

filt

erunit

.(b

)M

appin

gof

MA

Cunit

,buffer

,an

dco

effici

ents

toth

eD

RA

.(c

)D

ata

flow

grap

hof

ara

dix

-2butt

erfly

and

com

ple

xm

ult

iplier

.(d

)M

appin

gof

butt

erfly

and

com

ple

xm

ult

iplier

toth

eD

RA

.T

he

input

dat

ast

ream

from

the

SM

Cis

rece

ived

from

the

glob

alnet

wor

k.

Table

2:

Applica

tion

map

pin

gre

sult

sin

term

sof

reco

nfigu

rati

onti

me,

thro

ugh

put,

and

reso

urc

ere

quir

emen

ts.

Applica

tion/A

lgori

thm

(A)

Rec

onfigura

tion

Thro

ughput

#P

C#

PC

#P

C#

MC

tim

et r

(cc)

(Sam

ple

s/cc

)(D

SP

)(M

AC

)(C

OR

DIC

)

FIR

(Nta

ps)

TM

12

+[8

+N

]1/(7

+2N

)0

10

2FIR

(Nta

ps)

pip

elin

e10N

1/2

0N

00

Radix

-2FFT

(2n

bin

s)T

M24

+[2

n

/2]

≈1/n

10

12

Radix

-2FFT

(2n

bin

s)pip

elin

e9n

+[6

n+

2n

]≈

1n

0n−

1≤

2n-1

Tra

nsp

ose

(N×

Nblo

ck)

7+

[3]

≈1/2

10

01

Tra

nsp

ose

(N×

Nblo

ck)

14

+[3

]N

/(N

+2)≈

12

00

1


6 Hardware Prototyping

The Scenic models have been translated to VHDL, and verified using the sameconfiguration as during design exploration and analysis. Currently availableVHDL models are the DSP and MAC processors presented in Section 3.2, thememory cell presented in Section 3.3, and the router presented in Section 3.4.VHDL implementations have been individually synthesized to explore differentparameter settings, and integrated to construct arrays of size 4 × 4 and 8 × 8.The results from logic synthesis in a 0.13µm technology are presented in Ta-ble 3, where each design has been synthesized with the following configurationparameters:

R Number of general purpose registersL Number of local portsD Number of descriptors in each memory cellMpc Number of 32-bit words of program memoryMmc Number of 32-bit words in the memory arrayQ Router output queue depth

The table also presents the maximum frequency and the memory storagespace inside each hardware unit. The router overhead illustrates how largepart of the system resources that is spent on global routing. Synthesis resultsshow how configuration parameters affect the area, frequency, and requiredstorage space. The DSP and MAC units are in the same area range, butthe memory cell is slightly larger. When constructing an array of cells, itis important to choose cells that have similar area requirements. Hence, amemory cell with Mmc = 256 is a trade-off between area and memory space,since it is comparable is size with the processing cells. For routers, it can beseen that the output queue depth is associated with large hardware cost. Toavoid overhead from routing resources, it is important to minimize the queuedepth. Hence, a router with Q = 4 has been chosen as a trade-off between areaand network performance. As an example, the floorplan and layout of a 4 × 4array, with 8 processing cells and 8 memory cells, are shown in Figure 18 and19, respectively. The floorplan size is 1660 × 1660µm2 (90% core utilization).

7 Comparison and Discussion

Table 4 presents a comparison between different platforms for computing a256-point FFT, to relate the hardware cost and performance to existing archi-tectures. The Xstream FFT core from Part II is presented as an application-specific alternative, with low area requirements and high performance. Hard-ware requirements for the time-multiplexed and pipelined FFT from Section 5.3

7. COMPARISON AND DISCUSSION 137

Table 3: Synthesis results for the processor, memory, router, and array. Theresults are based on a 0.13µm cell library. Mpc = 64 for all processing cells.

ArchitectureEq. Kgates Area fmax Storage Router(nand2) (mm2) (MHz) (bytes) overhead

DSP (R = 4, L = 2) 17.3 0.089 325 256 -DSP (R = 8, L = 4) 20.7 0.106 325 256 -DSP (R = 16, L = 8) 27.6 0.141 325 256 -MAC (R = 4, L = 2) 17.4 0.089 325 256 -MAC (R = 8, L = 4) 20.5 0.105 325 256 -MAC (R = 16, L = 8) 25.0 0.128 325 256 -CORDIC (16-bit) 24.7∗ 0.126 1100 0 -

MC (D = 4, Mmc = 256) 30.2 0.155 460 1024 -MC (D = 4, Mmc = 512) 41.3 0.211 460 2048 -MC (D = 4, Mmc = 1024) 62.5 0.320 460 4096 -

4-port Router (Q = 4) 12.9 0.066 1000 4 × 16 -4-port Router (Q = 8) 22.4 0.115 1000 4 × 32 -4-port Router (Q = 16) 35.5 0.182 1000 4 × 64 -5-port Router (Q = 4) 17.0 0.087 990 5 × 16 -8-port Router (Q = 4) 28.4 0.145 890 8 × 16 -

4x4 Array (5 routers) 484 2.48 325 10 K 15.8%8x8 Array (21 routers) 1790 9.18 325 40 K 17.7%

∗ CORDIC pipeline presented in Part I. Control overhead is not included.

are estimated based on how many resource cells that are allocation for eachapplication. Table 4 also includes a recently proposed medium-grained archi-tecture for mapping a 256-point FFT [116]. Finally, commercial CPU and DSPprocessors are presented to compare with general-purpose and special-purposearchitectures.

Compared with an application-specific solution, a time-multiplexed versionof the FFT can be mapped to the proposed DRA at an even lower cost, butwith the penalty of reduced performance. In contrast, the proposed pipelineFFT generates the same throughput as the application specific solution, butwith four times higher cost. Based on this application, this is the price forflexible for a reconfigurable architecture.

The area requirement for the proposed DRA of size 4 × 4 is 2.48mm2, asshown in Table 3. As a comparison, the PowerPC 405F6 embedded processorhas a core area of 4.53mm2 (with caches) in the same implementation technol-ogy [103], while the Pentium 4 processor requires 305mm2. Area numbers forthe TMS320VC55 DSP are not available.


Table 4: Architectural comparison for computing a 256-point FFT. Latencyis not considered. Required area is scaled to 0.13µm technology. Perfor-mance/Area ratio is (samples per cc) / (required area).

ArchitectureRequired fmax Cycles Performance/

Area (mm2) (MHz) (cc) Area ratio

Xstream 256-point FFT 0.51 398 256 1.96Proposed DRA (TM) 0.31 325 1536 0.54Proposed DRA (pipeline) 2.04 325 256 0.49Medium-Grain DRA [116] 32.08 267 256 0.03

Texas TMS320VC5502 [129] - 300 4786 -IBM PowerPC 405 F6 [103] 4.53 658 > 20K∗ < 0.0028Intel Pentium 4 [130] 305 3000 20K 0.00004

∗ Estimated to require more clock cycles than an Intel Pentium 4.

Recent related work proposes a medium-grained architecture for mappinga 256-point FFT [116]. However, Table 4 shows that this architecture resultsin poor performance/area ratio due to only 4-bit granularity in processing el-ements. With the same throughput, the area requirement is 16 times higherthan for the proposed DRA.

DSP architectures include MAC support, but still requires more than 18times the number of clock cycles over pipelined solutions [129]. General purposeprocessors require even more processing, due to the lack of dedicated hardwarefor MAC operations [130]. General-purpose architectures gain performancethrough higher clock frequency, but this solution is not scalable. In contrast,presented architectures result in high performance and low area requirements.

8 Conclusion

Modeling and implementation of a dynamically reconfigurable architecture hasbeen presented. The reconfigurable architecture is based on an array of pro-cessing and memory cells, communicating using local interconnects and a hier-archical network. The Scenic exploration environment and models have beenused to evaluate the architecture and to emulate application mapping. Variousnetwork, memory, and application scenarios have been evaluated using Scenic,to facilitate system tuning during the design phase. A 4×4 array of processingcells, memory cells, and routers has been implemented in VHDL and synthe-sized for a 0.13µm cell library. The design has a core size of 2.48mm2 and iscapable of operating up to 325MHz. It is shown that mapping of a 256-pointFFT generate 18 times higher throughput than a traditional DSP solution.

8. CONCLUSION 139

DSP MC MAC MC

MC MAC MC DSP

DSP MC MAC MC

MC MAC MC DSP

Router

Router Router

Router Router

(4-port)

(5-port) (5-port)

(5-port) (5-port)

Figure 18: Floorplan of an 4 × 4 array of 8 processing cells (blue) and8 memory cells (green) connected with 5 network routers. The internalmemories in each resource cell are represented in slightly darker color.For this design R = 8, L = 4, D = 4, Mpc = 64, Mmc = 256, and Q = 4.

Figure 19: Final layout of a 4 × 4 array of processing and memory cellsconnected with network routers. The core size is 1660 × 1660µm2.

Conclusion and Outlook

Flexibility and reconfigurability are becoming important aspects of hardwaredesign, where the focus is the ability to dynamically adapt a system to a rangeof various applications. There are many ways to achieve reconfigurability, andtwo approaches are evaluated in this work. The target application is digitalholographic imaging, where visible images are to be reconstructed based onholographic recordings.

In the first part of the thesis, an application-specific hardware accelerator andsystem platform for digital holographic imaging is proposed. A prototype isdesigned and constructed to demonstrate the potential of hardware accelera-tion in this application field. The presented result is a hardware platform thatachieves real-time image reconstruction when implemented in a modern tech-nology. It is concluded that application-specific architectures are required forreal-time performance, and that block-level reconfiguration provides sufficientflexibility for this application.

In the second part of the thesis, the work is generalized towards dynamicallyreconfigurable architectures. Modeling and implementation results indicate theprocessing potential, and the cost of flexibility, of such architectures. An ex-ploration framework and simulation modules for evaluating reconfigurable plat-forms are designed to study architectural trade-offs, system performance, andapplication mapping. An array of run-time reconfigurable processing and mem-ory cells is proposed, and it is shown that arrays of simplified programmableprocessing cells, with regular interconnects and memory structures, result inincreased performance over traditional DSP solutions. It is concluded thatreconfigurable architectures provide a feasible solution to accelerate arbitraryapplications.

Current trends towards parallel and reconfigurable architectures indicate thebeginning of a paradigm shift for hardware design. Within a few years, weare likely to find massively parallel architectures in desktop computers andinside embedded systems. This paradigm shift will certainly also require usto change our mindset and our view on traditional software design, to allowefficient mapping onto such architectures.

141

Bibliography

[1] U. Schnars and W. Jueptner, Digital Holography. Springer-Verlag,Berlin, Heidelberg: Springer, 2005.

[2] J. M. Jabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Cir-cuits. Upper Saddle River, New Jersey: Prentice-Hall, 2003.

[3] J. Shalf, “The New Landscape of Parallel Computer Architecture,” Jour-nal of Physics: Conference Series 78, pp. 1–15, 2007.

[4] G. M. Amdahl, “Validity of the Single Processor Approach to AchievingLarge Scale Computing Capabilities,” in AFIPS Conference Proceedings,Vol. 30, Atlantic City, USA, Apr. 18-20 1967, pp. 483–485.

[5] G. E. Moore, “Cramming More Components onto Integrated Circuits,”Electronics, vol. 38, no. 8, pp. 114–117, 1965.

[6] C. A. R. Hoare, Communicating Sequential Processes. Upper SaddleRiver, New Jersey: Prentice-Hall, 1985.

[7] A. Marowka, “Parallel Computing on Any Desktop,” Communications ofthe ACM, vol. 50, no. 9, pp. 74–78, 2007.

[8] H. Nikolov, T. Stefanov, and E. Deprettere, “Systematic and Auto-mated Multiprocessor System Design, Programming, and Implementa-tion,” IEEE Transactions on Computer-Aided Design of Integrated Cir-cuits and Systems, vol. 27, no. 3, pp. 542–555, 2008.

143

144 BIBLIOGRAPHY

[9] T. Lenart and V. Owall, “A 2048 complex point FFT processor using anovel data scaling approach,” in Proceedings of IEEE International Sym-posium on Circuits and Systems, Bangkok, Thailand, May 25-28 2003,pp. 45–48.

[10] Y. Chen, Y.-C. Tsao, Y.-W. Lin, C.-H. Lin, and C.-Y. Lee, “An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications,”IEEE Transactions on Circuits and Systems—Part II: Analog and DigitalSignal Processing, vol. 55, no. 2, pp. 146–150, Feb. 2008.

[11] Computer Systems Laboratory, University of Campinas , “The ArchCArchitecture Description Language,” http://archc.sourceforge.net.

[12] M. J. Flynn, “Area - Time - Power and Design Effort: The Basic Trade-offs in Application Specific Systems,” in Proceedings of IEEE 16th Inter-national Conference on Application-specific Systems, Architectures andProcessors, Samos, Greece, July 23-25 2005, pp. 3–6.

[13] R. Tessier and W. Burleson, “Reconfigurable Computing for Digital Sig-nal Processing: A Survey,” The Journal of VLSI Signal Processing,vol. 28, no. 1-2, pp. 7–27, 2001.

[14] S.A. McKee et al., “Smarter Memory: Improving Bandwidth forStreamed References,” Computer, vol. 31, no. 7, pp. 54–63, July 1998.

[15] S. Mahadevan, M. Storgaard, J. Madsen, and K. Virk, “ARTS: A System-level Framework for Modeling MPSoC Components and Analysis of theirCausality,” in Proceedings of 13th IEEE International Symposium onModeling, Analysis, and Simulation of Computer and TelecommunicationSystems, Sept. 27-29 2005.

[16] G. Beltrame, D. Sciuto, and C. Silvano, “Multi-Accuracy Powerand Performance Transaction-Level Modeling,” IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, vol. 26,no. 10, pp. 1830–1842, 2007.

[17] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, “A Scalable SynthesisMethodology for Application-Specific Processors,” IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 14, no. 11, pp. 1175–1188, 2006.

[18] D. Talla, L. K. John, V. Lapinskii, and B. L. Evans, “Evaluating SignalProcessing and Multimedia Applications on SIMD, VLIW and Super-scalar Architectures,” in Proceedings of IEEE International Conferenceon Computer Design, Austin, Texas, USA, Sept. 11720 2000, pp. 163–172.

BIBLIOGRAPHY 145

[19] A. Douillet and G. R. Gao, “Software-Pipelining on Multi-Core Archi-tectures,” in Proceedings of International Conference on Parallel Archi-tecture and Compilation Techniques, Brasov, Romania, Sept. 1519 2007,pp. 39–48.

[20] K. Keutzer, S. Malik, and R. Newton, “From ASIC to ASIP: The NextDesign Discontinuity,” in Proceedings of IEEE International Conferenceon Computer Design, Freiburg, Germany, Sept. 16-18 2002, pp. 84–90.

[21] C. Wolinski and K. Kuchcinski, “Identification of Application SpecificInstructions Based on Sub-Graph Isomorphism Constraints,” in Proceed-ings of IEEE 14th International Conference on Application-specific Sys-tems, Architectures and Processors, Montreal, Canada, July 9-11 2007,pp. 328–333.

[22] A. C. Cheng, “A Software-to-Hardware Self-Mapping Technique to En-hance Program Throughput for Portable Multimedia Workloads,” in Pro-ceedings of IEEE International Symposium on Electronic Design, Testand Applications, Hong Kong, China, Jan. 23-25 2008, pp. 356–361.

[23] Tensilica, “Configurable and Standard Processor Cores for SOC Design,”http://www.tensilica.com.

[24] K. K. Parhi, VLSI Digital Signal Processing Systems. 605 Third Avenue,New York 10158: John Wiley and Sons, 1999.

[25] U. Kapasi, S. Rixner, W. Dally, B. Khailany, J. H. Ahn, P. Mattson, andJ. Owens, “Programmable Stream Processors,” Computer, vol. 36, no. 8,pp. 54–62, Aug. 2003.

[26] U. Kapasi, W. Dally, S. Rixner, J. Owens, and B. Khailany, “The ImagineStream Processor,” in Proceedings of IEEE International Conference onComputer Design, Freiburg, Germany, Sept. 16-18 2002, pp. 282–288.

[27] W. J. Dally et al., “Merrimac: Supercomputing with Streams,” in Pro-ceedings of ACM/IEEE Conference on Supercomputing, Phoenix, AZ,USA, Nov. 15-21 2003, pp. 35–42.

[28] D. Manocha, “General-Purpose Computations Using Graphics Proces-sors,” Computer, vol. 38, no. 8, pp. 85–88, 2005.

[29] M. J. Flynn, “Some Computer Organizations and Their Effectiveness,”IEEE Transactions on Computers, vol. C-21, no. 9, pp. 948–960, 1972.

146 BIBLIOGRAPHY

[30] M. Gokhale and P. S. Graham, Reconfigurable Computing. Springer-Verlag, Berlin, Heidelberg: Springer, 2005.

[31] K. Compton and S. Hauck, “Reconfigurable Computing: A Survey ofSystems and Software,” ACM Computing Surveys , vol. 34, no. 2, pp.171–210, 2002.

[32] T. Todman, G. Constantinides, S. Wilton, O. Mencer, W. Luk, andP. Cheung, “Reconfigurable Computing: Architectures and Design Meth-ods,” IEE Proceedings of Computers and Digital Techniques, vol. 152,no. 2, pp. 193–207, 2005.

[33] R. Hartenstein, “A Decade of Reconfigurable Computing: A VisionaryRetrospective,” in Proceedings of IEEE Conference on Design, Automa-tion and Test in Europe, Munich, Germany, Mar. 10-14 2001, pp. 642–649.

[34] C. Hilton and B. Nelson, “PNoC: A Flexible Circuit-switched NoC forFPGA-based Systems,” IEE Proceedings on Computers and Digital Tech-niques, vol. 153, pp. 181–188, May 2006.

[35] R. Baxter et al., “The FPGA High-Performance Computing Alliance Par-allel Toolkit ,” in Proceedings of NASA/ESA Conference on AdaptiveHardware and Systems, Scotland, United Kingdom, Aug. 5-8 2007, pp.301–310.

[36] S. Raaijmakers and S. Wong, “Run-time Partial Reconfiguration for Re-moval, Placement and Routing on the Virtex-II Pro,” in Proceedings ofIEEE International Conference on Field Programmable Logic and Appli-cations, Amsterdam, Netherlands, Aug. 27-29 2007, pp. 679–683.

[37] S. Sirowy, Y. Wu, S. Lonardi, and F. Vahid, “Two-Level Microprocessor-Accelerator Partitioning,” in Proceedings of IEEE Conference on Design,Automation and Test in Europe, Acropolis, Nice, France, Apr. 16-20 2007,pp. 313–318.

[38] D. Burger, J. Goodman, and A. Kagi, “Limited Bandwidth to AffectProcessor Design,” IEEE Micro, vol. 17, no. 6, pp. 55–62, Nov. 1997.

[39] S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens, “MemoryAccess Scheduling,” in Proceedings of the 27th International Symposiumon Computer Architecture, Vancouver, BC Canada, June 10-14 2000, pp.128–138.

BIBLIOGRAPHY 147

[40] M. Brandenburg, A. Schollhorn, S. Heinen, J. Eckmuller, and T. Eckart,“A Novel System Design Approach Based on Virtual Prototyping and itsConsequences for Interdisciplinary System Design Teams,” in Proceed-ings of IEEE Conference on Design, Automation and Test in Europe,Acropolis, Nice, France, Apr. 16-20 2007, pp. 1–3.

[41] D. C. Black and J. Donovan, SystemC: From the Ground Up. Springer-Verlag, Berlin, Heidelberg: Springer, 2005.

[42] A. Donlin, “Transaction Level Modeling: Flows and Use Models,” in Pro-ceedings of IEEE International Conference on Hardware/Software Code-sign and System Synthesis, Stockholm, Sweden, Sept. 8-10 2004, pp. 75–80.

[43] G. Martin, B. Bailey, and A. Piziali, ESL Design and Verification: APrescription for Electronic System Level Methodology. San Francisco,USA: Morgan Kaufmann, 2007.

[44] T. Kogel and M. Braun, “Virtual Prototyping of Embedded Platformsfor Wireless and Multimedia,” in Proceedings of IEEE Conference onDesign, Automation and Test in Europe, Munich, Germany, Mar. 6-102006, pp. 1–6.

[45] T. Rissa, P. Y. Cheung, and W. Luk, “Mixed Abstraction Execution forthe SoftSONIC Virtual Hardware Platform,” in Proceedings of MidwestSymposium on Circuits and Systems, Cincinnati, Ohio, USA, Aug. 7-102005, pp. 976–979.

[46] L. Cai and D. Gajski, “Transaction Level Modeling: An Overview,”in Proceedings of IEEE International Conference on Hardware/SoftwareCodesign and System Synthesis, Newport Beach, CA, USA, Oct. 1-3 2003,pp. 19–24.

[47] T. Kempf et al., “A SW Performance Estimation Framework for earlySystem-Level-Design using Fine-grained Instrumentation,” in Proceed-ings of IEEE Conference on Design, Automation and Test in Europe,Munich, Germany, Mar. 6-10 2006.

[48] M. Etelapera, J. V. Anttila, and J. P. Soininen, “Architecture Explorationof 3D Video Recorder Using Virtual Platform Models,” in 10th Euromi-cro Conference on Digital System Design, Lubeck, Germany, Aug. 29-312007.

148 BIBLIOGRAPHY

[49] S. Rigo, G. Araujo, M. Bartholomeu, and R. Azevedo, “ArchC: ASystemC-based Architecture Description Language,” in Proceedings ofthe 16th Symposium on Computer Architecture and High PerformanceComputing, Foz do Iguacu, Brazil, Oct. 27-29 2004, pp. 163–172.

[50] M. Reshadi, P. Mishra, and N. Dutt, “Instruction Set Compiled Simula-tion: A Technique for Fast and Flexible Instruction Set Simulation,” inProceedings of the 40th Design Automation Conference, Anaheim, CA,USA, June 2-6 2003, pp. 758–763.

[51] D. Gabor, “A New Microscopic Principle,” Nature, vol. 161, pp. 777–778,1948.

[52] W. E. Kock, Lasers and Holography. 180 Varick Street, New York 10014:Dover Publications Inc., 1981.

[53] Wenbo Xu and M. H. Jericho and I. A. Meinertzhagen and H. J. Kreuzer,“Digital in-line Holography for Biological Applications,” Cell Biology,vol. 98, pp. 11 301–11 305, Sept. 2001.

[54] C. M. Do and B. Javidi, “Digital Holographic Microscopy for Live CellApplications and Technical Inspection,” Applied Optics, vol. 47, no. 4,pp. 52–61, 2008.

[55] B. Alberts et al., Molecular Biology Of The Cell. London, England:Taylor & Francis Inc, 2008.

[56] F. Zernike, “Phase Contrast, A new Method for the Microscopic Ob-servation of Transparent Objects,” Physica, vol. 9, no. 7, pp. 686–698,1942.

[57] P. Marquet, B. Rappaz, P. J. Magistretti, E. Cuche, Y. Emery,T. Colomb, and C. Depeursinge, “Digital Holographic Microscopy: ANoninvasive Contrast Imaging Technique allowing Quantitative Visual-ization of Living Cells with Subwavelength Axial Accuracy,” Optics let-ters, vol. 30, no. 7, pp. 468–470, 2005.

[58] A. Asundi and V. R. Singh, “Sectioning of Amplitude Images in DigitalHolography,” Measurement Science and Technology, vol. 17, pp. 75–78,2006.

[59] C. M. Do and B. Javidi, “Multifocus Holographic 3-D Image Fusion Us-ing Independent Component Analysis,” Journal of Display Technology,vol. 3, no. 3, pp. 326–332, 2007.

BIBLIOGRAPHY 149

[60] M. Born and E. Wolf, Principles of Optics. Cambridge, U.K.: CambridgeUniversity Press, 1999.

[61] T.H. Demetrakopoulos and R. Mittra, “Digital and Optical Reconstruc-tion of Suboptical Diffraction Patterns,” Applied Optics, vol. 13, pp. 665–670, Mar. 1974.

[62] W. G. Rees, “The Validity of the Fresnel Approximation,” EuropeanJournal of Physics, vol. 8, pp. 44–48, 1987.

[63] S. Mezouari and A. R. Harvey, “Validity of Fresnel and Fraunhofer Ap-proximations in Scalar Diffraction,” Journal of Optics A: Pure and Ap-plied Optics, vol. 5, pp. 86–91, 2003.

[64] D. C. Ghiglia and M. D. Pritt, Two-Dimensional Phase Unwrapping.605 Third Avenue, New York 10158: John Wiley and Sons, 1998.

[65] T. J. Flynn, “Two-dimensional Phase Unwrapping with MinimumWeighted Discontinuity,” Optical Society of America, vol. 14, no. 10, pp.2692–2701, 1997.

[66] D. Litwiller, “CCD vs. CMOS: Facts and Fiction,” in Photonics Spectra,2001.

[67] M. Frigo, “FFTW: An Adaptive Software Architecture for the FFT,” inProceedings of International Conference on Acoustics, Speech, and SignalProcessing, Seattle, Washington, USA, May 12-15 1998, pp. 1381–1384.

[68] E. O. Brigham, The Fast Fourier Transform and its Applications. UpperSaddle River, New Jersey: Prentice-Hall, 1988.

[69] Micron Technology, “MT48LC32M16A2-75 - SDR Synchronous DRAMdevice,” http://www.micron.com.

[70] B. Parhami, Computer Arithmetic. 198 Madison Avenue, New York10016: Oxford University Press, 2000.

[71] S. He and M. Torkelson, “Designing Pipeline FFT Processor for OFDM(de)modulation,” in URSI International Symposium on Signals, Systems,and Electronics, Pisa, Italy, Sept. 29-Oct. 2 1998, pp. 257–262.

[72] T. Lenart and V. Owall, “Architectures for Dynamic Data Scaling in2/4/8K Pipeline FFT Cores,” IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, vol. 14, no. 11, pp. 1286–1290, Nov. 2006.

150 BIBLIOGRAPHY

[73] E. Bidet, C. Joanblanq, and P. Senn, “A Fast Single-Chip Implementa-tion of 8192 Complex Point FFT,” IEEE Journal of Solid-State Circuits,vol. 30, pp. 300–305, Mar. 1995.

[74] C.D. Toso et al., “0.5-µm CMOS Circuits for Demodulation and Decodingof an OFDM-Based Digital TV Signal Conforming to the European DVB-T Standard,” IEEE Journal of Solid-State Circuits, vol. 33, no. 11, pp.1781–1792, Nov. 1998.

[75] M. Wosnitza, M. Cavadini, M. Thaler, and G. Troster, “A High Precision1024-point FFT Processor for 2D Convolution,” in Proceedings of IEEEInternational Solid-State Circuits Conference. Digest of Technical Papers,San Fransisco, CA, USA, Feb. 5-7 1998, pp. 118–119.

[76] F. Kristensen, P. Nilsson, and A. Olsson, “Reduced Transceiver-delayfor OFDM Systems,” in Proceedings of Vehicular Technology Conference,vol. 3, Milan, Italy, May 17-19 2004, pp. 1242–1245.

[77] J. Gaisler, “A Portable and Fault-tolerant Microprocessor based on theSPARC v8 Architecture,” in Proceedings of Dependable Systems and Net-works, Washington DC, USA, June 23-26 2002, pp. 409–415.

[78] ARM Ltd., “AMBA Specification - Advanced Microcontroller Bus Archi-tecture,” http://www.arm.com, 1999.

[79] N. Miyamoto, L. Karnan, K. Maruo, K. Kotani, and T. Ohmi1, “A Small-Area High-Performance 512-Point 2-Dimensional FFT Single-Chip Pro-cessor,” in Proceedings of European Solid-State Circuits, Estoril, Portu-gal, Sept. 16-18 2003, pp. 603–606.

[80] I. Uzun, A. Amira, and F. Bensaali, “A Reconfigurable Coprocessor forHigh-resolution Image Filtering in Real time,” in Proceedings of the 10thIEEE International Conference on Electronics, Circuits and Systems,Sharjah, United Arab Emirates, Dec. 14-17 2003, pp. 192–195.

[81] IEEE 802.11a, “High-speed Physical Layer in 5 GHz Band,”http://ieee802.org, 1999.

[82] IEEE 802.11g, “High-speed Physical Layer in 2.4 GHz Band,”http://ieee802.org, 2003.

[83] J. Cooley and J. Tukey, “An Algorithm for Machine Calculation of Com-plex Fourier Series,” IEEE Journal of Solid-State Circuits, vol. 19, pp.297–301, Apr. 1965.

BIBLIOGRAPHY 151

[84] K. Zhong, H. He, and G. Zhu, “An Ultra High-speed FFT Processor,” inInternational Symposium on Signals, Circuits and Systems, Lasi, Roma-nia, July 10-11 2003, pp. 37–40.

[85] J. G. Proakis and D. G. Manolakis, Digital Signal Processing - Principles,algorithms, and applications. Upper Saddle River, New Jersey: Prentice-Hall, 1996.

[86] S. He, “Concurrent VLSI Architectures for DFT Computing and Algo-rithms for Multi-output Logic Decomposition,” Ph.D. dissertation, Ph.D.dissertation, Lund University, Department of Applied Electronics, 1995.

[87] M. Gustafsson et al., “High resolution digital transmission microscopy -a Fourier holography approach,” Optics and Lasers in Engineering, vol.41(3), pp. 553–563, Mar. 2004.

[88] ETSI, “Digital Video Broadcasting (DVB); Framing Structure, ChannelCoding and Modulation for Digital Terrestrial Television,” ETSI EN 300744 v1.4.1, 2001.

[89] K. Kalliojrvi and J. Astola, “Roundoff Errors is Block-Floating-PointSystems,” IEEE Transactions on Signal Processing, vol. 44, no. 4, pp.783–790, Apr. 1996.

[90] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, “A Dynamic Scaling FFT Processorfor DVB-T Applications,” IEEE Journal of Solid-State Circuits, vol. 39,no. 11, pp. 2005–2013, Nov. 2004.

[91] C.-C. Wang, J.-M. Huang, and H.-C. Cheng, “A 2K/8K Mode Small-areaFFT Processor for OFDM Demodulation of DVB-T Receivers,” IEEETransactions on Consumer Electronics, vol. 51, no. 1, pp. 28–32, Feb.2005.

[92] T. Lenart et al., “Accelerating signal processing algorithms in digitalholography using an FPGA platform,” in Proceedings of IEEE Inter-national Conference on Field Programmable Technology, Tokyo, Japan,Dec. 15-17 2003, pp. 387–390.

[93] X. Ningyi et al., “A SystemC-based NoC Simulation Framework support-ing Heterogeneous Communicators,” in Proceedings IEEE InternationalConference on ASIC, Shanghai, China, Sept. 24-27 2005, pp. 1032–1035.

[94] L. Yu, S. Abdi, and D. D. Gajski, “Transaction Level Platform Modelingin SystemC for Multi-Processor Designs,” UC Irvine, Tech. Rep., Jan.2007. [Online]. Available: http://www.gigascale.org/pubs/988.html

152 BIBLIOGRAPHY

[95] A.V. Brito et al., “Modelling and Simulation of Dynamic and PartiallyReconfigurable Systems using SystemC,” in Proceedings of IEEE Com-puter Society Annual Symposium on VLSI, Porto Alegra, Brazil, Mar. 9-11 2007, pp. 35–40.

[96] Open SystemC Initiative (OSCI), “OSCI SystemC 2.2 Open-source Li-brary,” http://www.systemc.org.

[97] F. Doucet, S. Shukla, and R. Gupta, “Introspection in System-Level Lan-guage Frameworks: Meta-level vs. Integrated,” in Proceedings of IEEEConference on Design, Automation and Test in Europe, Munich, Ger-many, Mar. 3-7 2003.

[98] W. Klingauf and M. Geffken, “Design Structure Analysis and Transac-tion Recording in SystemC Designs: A Minimal-Intrusive Approach,” inProceedings of Forum on Specification and Design Languages, TU Darm-stadt, Germany, Sept. 19-22 2006.

[99] D. Berner et al., “Automated Extraction of Structural Information fromSystemC-based IP for Validation,” in Proceedings of IEEE InternationalWorkshop on Microprocessor Test and Verification, Nov. 3-4 2005.

[100] F. Rogin, C. Genz, R. Drechsler, and S. Rulke, “An Integrated SystemCDebugging Environment,” in Proceedings of Forum on Specification andDesign Languages, Sept. 18-20 2007.

[101] E. Gansner et al., “Dot user guide - Drawing graphs with dot,”http://www.graphviz.org/Documentation/dotguide.pdf.

[102] The Spirit Consortium, “Enabling Innovative IP Re-use and Design Au-tomation,” http://www.spiritconsortium.org.

[103] IBM Microelectronics, “PowerPC 405 CPU Core,” http://www.ibm.com.

[104] K. Compton and S. Hauck, “Reconfigurable computing: A survey ofsystems and software,” ACM Computing Surveys, vol. 34, no. 2, pp. 171–211, 2002.

[105] T. Todman, G. Constantinides, S. Wilton, O. Mencer, W. Luk, andP. Cheung, “Reconfigurable computing: architectures and design meth-ods,” IEE Proceedings - Computers and Digital Techniques, vol. 152,no. 2, pp. 193–207, 2005.

BIBLIOGRAPHY 153

[106] G. K. Rauwerda, P. M. Heysters, and G. J. Smit, “Towards SoftwareDefined Radios Using Coarse-Grained Reconfigurable Hardware,” IEEETransactions on Very Large Scale Integration (VLSI) Systems, vol. 16,no. 1, pp. 3–13, 2008.

[107] T. J. Callahan, J. R. Hauserand, and J. Wawrzynek, “The Garp architec-ture and C compiler,” IEEE Computer, vol. 33, no. 4, pp. 62–69, 2000.

[108] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. R.Taylor, “PipeRench: A Reconfigurable Architecture and Compiler,”IEEE Computer, vol. 33, no. 4, pp. 70–77, 2000.

[109] M. B. Taylor et al., “The Raw microprocessor: A Computational Fab-ric for Software Circuits and General-purpose Programs,” IEEE Micro,vol. 22, no. 2, pp. 25–35, 2002.

[110] T. Miyamori and K. Olukotun, “A Quantitative Analysis of Reconfig-urable Coprocessorsfor Multimedia Applications,” in Proceedings of IEEESymposium on FPGAs for Custom Computing Machines, Apr. 15-171998, pp. 2–11.

[111] S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao, “The ChimaeraReconfigurable Functional Unit,” IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, vol. 12, no. 2, pp. 206–217, 2004.

[112] S. Khawam, I. Nousias, M. Milward, Y. Ying, M. Muir, and T. Arslan,“The Reconfigurable Instruction Cell Array,” IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems, vol. 16, no. 1, pp. 75–85, 2008.

[113] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich, “A Highly Parame-terizable Parallel Processor Array Architecture,” in Proceedings of IEEEInternational Conference on Field Programmable Technology, Bangkok,Thailand, Dec. 13-15 2006, pp. 105–112.

[114] ——, “Hardware Cost Analysis for Weakly Programmable Processor Ar-rays,” in Proceedings of International Symposium on System-on-Chip,Tampere, Finland, Nov. 13-16 2006, pp. 1–4.

[115] G. Zhou and X. Shen, “A Coarse-Grained Dynamically ReconfigurableProcessing Array (RPA) for Multimedia Application,” in Proceedings ofthird International Conference on Natural Computation, 2007, pp. 157–161.

154 BIBLIOGRAPHY

[116] M. J. Myjak and J. G. Delgado-Frias, “A Medium-Grain ReconfigurableArchitecture for DSP: VLSI Design, Benchmark Mapping, and Perfor-mance,” IEEE Transactions on Very Large Scale Integration (VLSI) Sys-tems, vol. 16, no. 1, pp. 14–23, 2008.

[117] MathStar, “A High-performance Field Programmable Object Arrays,”http://www.mathstar.com.

[118] QuickSilver Technology, “Adaptive Computing Machine: Adapt2400,”http://www.qstech.com.

[119] PACT XPP Technologies, “XPP-III High Performance Media Processor,”http://www.pactxpp.com.

[120] J. Becker and M. Vorbach, “Architecture, Memory and Interface Technol-ogy Integration of an Industrial/Academic Configurable System-on-Chip(CSoC),” in Proceedings of IEEE Computer Society Annual Symposiumon VLSI, Tampa, Florida, Feb. 20-21 2003, pp. 107–112.

[121] D. Rosenband and Arvind, “Hardware Synthesis from Guarded AtomicActions with Performance Specifications,” in Proceedings of IEEE/ACMInternational Conference on Computer-Aided Design, San Jose, CA,USA, Nov. 6-10 2005, pp. 784–791.

[122] W. J. Dally and B. Towles, “Route Packets, Not Wires: On-chip In-terconnection Networks,” in Proceedings of the 38th Design AutomationConference, Las Vegas, Nevada, USA, June 18-22 2001, pp. 684–689.

[123] P. Pratim et al., “Performance Evaluation and Design Trade-Offs forNetwork-on-Chip Interconnect Architectures,” IEEE Transactions onComputers, vol. 54, no. 8, pp. 1025–1040, Aug. 2005.

[124] L. Bononi and N. Concer, “Simulation and Analysis of Network on ChipArchitectures: Ring, Spidergon and 2D Mesh,” in Proceedings of IEEEConference on Design, Automation and Test in Europe, Munich, Ger-many, Mar. 6-10 2006, pp. 154–159.

[125] M. Lanuzza, S. Perri, P. Corsonello, and M. Margala, “A New Reconfig-urable Coarse-Grain Architecture for Multimedia Applications,” in Pro-ceedings of NASA/ESA Conference on Adaptive Hardware and Systems,Scotland, United Kingdom, Aug. 5-8 2007, pp. 119–126.

[126] A. Lopez-Lagunas and S. M. Chai, “Compiler Manipulation of StreamDescriptors for Data Access Optimization,” in Proceedings of Interna-tional Conference on Parallel Processing Workshops, Columbus, Ohio,USA, Aug. 14-18 2006, pp. 337–344.

BIBLIOGRAPHY 155

[127] S. Ciricescu et al., “The Reconfigurable Streaming Vector Processor(RSVP),” in Proceedings of IEEE/ACM International Symposium on Mi-croarchitecture, San Diego, CA, USA, Dec. 3-5 2003, pp. 141–150.

[128] W. Dally and B. Towles, Principles and Practices of Interconnection Net-works. San Francisco, USA: Morgan Kaufmann, 2004.

[129] Texas Instruments, “TMS320C55x DSP Library Programmer’s Refer-ence,” http://www.ti.com.

[130] D. Eliemble, “Optimizing DSP and Media Benchmarks for Pentium 4:Hardware and Software Issues,” in Proceedings of IEEE InternationalConference on Multimedia and Expo, Lusanne, Switzerland, Aug. 26-292002, pp. 109–112.

Appendix

157

Appendix A

The Scenic Shell

Scenic is an extension to the OSCI SystemC library and enables rapid sys-tem prototyping and interactive simulations. This manual presents the built-inScenic shell commands, how to start and run a simulation, and how to in-teract with simulation models and access information during the simulationphase. An overview of the Scenic functionality is presented in Figure 1. Amodule library contains simulation modules that can be instantiated, bound,and configured from an XML description. The Scenic shell provides accessto internal variables during simulation and enables extraction of performancedata from simulation models.

1 Launching Scenic

Scenic is started using the command scenic.exe from the windows commandprompt, or using the command ./scenic from a cygwin or linux shell.

scenic.exe <file> [-ip] [-port] [-exec] [-exit]

If a file named scenic.ini is present in the startup directory, it is executed beforeany other action is taken. Scenic can be launched with optional argumentsto execute a script file or to set environment variables. The first argument,or the flag exec, specifies a script to execute during startup, and the flag exitindicates with true or false if the program should terminate once the script hasfinished execution (for batch mode processing). The flags ip and port are usedto open a socket from which the program can be interactively controlled, eitherfrom Matlab or from a custom application. All startup flags are registered asenvironment variables in the Scenic shell to simplify parameter passing.

159

160 APPENDIX A. THE SCENIC SHELL

SCENIC Core

SCENIC Shell

Filtered

Debug messages

Access variables

Simulation events

SCI_MODULE

events

Simulation

TCP/IP

interface

Module

library

XML

parameters

variables

(a) (b)

Figure 1: (a) A simulation is created from an XML description formatusing a library of simulation modules. (b) The modules can communi-cate with the Scenic shell to access member variables, notify simulationevents, and filter debug messages.

1.1 Command Syntax

The built-in Scenic commands are presented in Table 2. Commands are exe-cuted from the Scenic shell by typing a built-in command, or the hierarchicalname of a simulation module. The command is followed by a list of parameters,i.e. single values, and flag/value pairs. The ordering of flag/value pairs doesnot matter, but parameters need to be specified in the correct order.

command/module {parameter}* {-flag value}* {-> filename} {;}

Variable substitution replaces environment variables with their correspondingvalue. Substitution is evaluated on all parameters and on the value for aflag/value pair. A parameter or value containing space characters must beenclosed by single or double quotation marks to be treated as a single value.Variable substitution is performed on a value enclosed by double quotationmarks, but single quotation marks prevent variable substitution. Commandsubstitution is expressed using square brackets, where a command inside asquare bracket is evaluated before the parent command is executed.

A semicolon ”;” at the end of a command prevents the return value to beprinted on the screen. The return value can also be directed to a file by endingthe line with ”-> filename” to create a new file, or ”->> filename” to appendan existing file.

Comments are preceded by a ”#” or a ”%” token. The comment charactersonly have significance when they appear at the beginning of a new line.

1. LAUNCHING SCENIC 161

[0 s]> set a 10 % assigns variable a = 10[0 s]> random -min 0 -max $a % generated random number between 0 and 10[0 s]> set b [eval $a - 2] % calculate 10-2=8 and assign to b[0 s]> set c "$a $b" % assign c the concatinated string "10 8"[0 s]> set d ’$a $b’ % assign d the unmodified string "$a $b"

1.2 Number Representations

Integer values may be specified in decimal, hexadecimal, or binary represen-tation. The decimal value 100 is represented in hexadecimal as 0x64 and inbinary as b1100100. Negative and floating point numbers must be specified indecimal format. Boolean values are represented with true or 1 for true, andfalse or 0 for false.

1.3 Environment Variables

Environment variables are used to store global variables, intermediate resultsfrom calculations, and parameters during function calls. Variables are assignedusing the set command and can be removed using unset. Environment vari-ables are treated as string literals and can represent any of the standard datatypes such as integer, float, boolean, and string.

[0 s]> set I 42[0 s]> set S "Hello World"

Environment variable values are accessed by using the $ operator followed bythe variable name.

[0 s]> set MSG "$S $I"[0 s]> echo $MSG

Hello World 42

1.4 System Variables

System variables are special environment variables that are set by the system,including the simulation time, execution time, and other global variables relatedto the simulation. The system variables are useful for calculating simulationperformance or execution time.

[0 s]> set start $_TIME...

[1 ms]> set time [eval $_TIME - $start][1 ms]> echo "execution time: $time ms"

execution time: 5225 ms


Both environment and system variables can be listed using either the set com-mand without specifying any parameters or by typing list env.

[1 ms]> set

system variable(s) :_DELTACOUNT : "276635"_ELABDONE : "TRUE"_MICROSTEP : "25"_RESOLUTION : "1 ns"_SIMTIME : "1000000"_TIME : "5225"

environment variable(s) :I : "42"S : "Hello World"MSG : "Hello World 42"

2 Simulation

Scenic modeling is divided into two phases, a system specification phase andsimulation phase. System specification means describing the hierarchical mod-ules, the static module configuration, and the module interconnects. The sys-tem specification phase will end when the command sim is executed, whichlaunches the SystemC simulator and constructs the simulation models (elab-oration). After elaboration, simulation modules can not be instantiated, butdynamic configuration can be sent to the simulation modules.

2.1 Starting a Simulation

Simulation architectures are described using XML language, and are loadedbefore the simulation can begin. The architectural description format willbe further explained in Section 4. Modular systems can be described usingmultiple XML files, which will be merged into a single simulation. If twomodular systems share a common simulation module with the same instancename, Scenic will automatically create a single instance to which both sub-systems can connect.

[0 s]> xml load -file demo_system.xml[0 s]> sim

SystemC 2.2.0 --- Feb 8 2008 16:25:45Copyright (c) 1996-2006 by all Contributors

ALL RIGHTS RESERVED

Simulation Started [1]

[0 s]>

3. SIMULATION MODULES 163

After loading the XML system descriptions, the command sim launches theSystemC simulator, instantiates and configures the library modules, and bindsthe module ports. If the optional argument nostart is set to true, the simu-lator stays in specification phase until the simulation is actually run.

2.2 Running a Simulation

A simulation runs in either blocking or non-blocking mode, using the commandsrunb and run, respectively. The former is useful when the execution time isknown or when running the simulation from a script, while the latter is usefulfor user-interactive simulations. A non-blocking simulation can be abortedusing the stop command, which will halt the simulation at the start of thenext micro-step.

[1 ms]> step 25 us[1 ms]> runb 1 ms[2 ms]> run 10 ms[2 ms]> stop

Simulation halted at 3125 us[3125 us]>

A non-blocking simulation is executing in time chunks referred to as micro-steps. The size of a micro-step is configurable using the step command withthe preferred micro-step size as a parameter. Using the step function withoutparameters will run the simulation for a time duration equal to one micro-step.It can be used to for example single-step a processor simulation module, wherethe step size is set to the processors clock period. A larger step size results ina higher simulation performance, but may not be able to halt the simulationwith clock cycle accuracy.

3 Simulation Modules

Simulation modules are described in SystemC and uses Scenic macros to regis-ter the module in a module library, read configuration data during instantiation,export internal member variables, register simulation events, and generate de-bug messages and trace data.

3.1 Module Library

The module library handles both module registration and instantiation. WhenScenic is launched, all simulation modules will automatically be registeredin the module library (using a Scenic macro). System specification refers tosimulation modules in the module library, which will be instantiated with thedynamic module configuration specified in the XML file.


The simulation modules in the module library can be listed using the com-mand list lib, which will show the module name and the number of currentlyinstantiated components of each type.

[0 s]> list lib

library module(s) :

GenericProcessor_v1_00 [1 instance(s)]mpmc_v1_00 [1 instance(s)]spb2mpmc_v1_00 [1 instance(s)]spb_v1_00 [1 instance(s)]

3.2 Module Hierarchy

During the simulation phase, the hierarchical view of the constructed systemcan be shown using the list mod command. The command shows the hierar-chical names of each module and the module type.

[0 s]> list mod

instantiated module(s) :

Adapter [spb2mpmc_v1_00]BUS [spb_v1_00]CPU [GenericProcessor_v1_00]CPU.program [sci_memory]MEM [mpmc_v1_00]_memory [access object]_scheduler [sci_module]

An addition to the user simulation modules, Scenic creates a number of spe-cial simulation objects. The special simulation objects use an underscore in thebeginning of the name to separate them from user simulation modules. The” memory” module handles global memory and implements function to list,read, and write memories that are registered as global. The ” scheduler” mod-ule handles logging of registered variables to improve simulation performance.

3.3 Module Commands

Each simulation module support basic functionality, inherited from sci access,to list parameters and variables, read and write variables, periodical logging ofvariables, creating conditional simulation events, and tracing data to file. Inaddition, a module based on sci module also supports port binding. The helpcommand lists the supported functions and shows which class that implementsthe functionality.

3. SIMULATION MODULES 165

[0 s] > CPU.program help---------------- access object ----------------. : Show sub-moduleslist <-display> : Show internal variablesparam <name> : Show / get parameterget [-var] <-vmin> <-vmax> : Get variable valueset [-var] [-value] <-vmin> <-recursive> : Set variable valuesize [-var] : Get the size of a variablelog [-var] <-depth> <-every> : Log to history bufferpush [-var] : Push variable to historyread [-var] <-vmin/vmax/tmin/tmax> <-timed> : Read history bufferevent [name] [-when] <-message> : Create conditional eventtrace <on,off> <-filename> : Trace to file----------------- sci_memory ------------------rd <-addr> <-size> <-bin> <-file> : Read memory contentswr <-addr> <-size> <-bin> <-file> <-data> : Write memory contents

Scenic contains base classes for frequently used objects, such as sci memoryand sci processor, which provide additional module specific functionality.For example, the memory class implements functions to read and write memorycontents as shown below. In the same way, user modules can implement customfunctions to respond to requests from the Scenic shell.

[0 s]> CPU.program rd -addr 0x0 -size 64

00000000h: 64 00 00 0c 00 00 20 0c 00 00 40 0c 00 10 01 08 d..... [email protected]: 00 10 01 04 04 00 21 10 02 00 00 10 00 10 01 08 ......!.........00000020h: fc ff 00 18 00 00 00 00 00 00 00 00 00 00 00 00 ................00000030h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

3.4 Module Parameters

The parameters used when instantiating a module are listed using the paramcommand. The command shows the parameter name, the type index, and theparameter value.

[10 ns]> MEM paramC_BASE_ADDR [typeID=6] { 0 }C_BYTE_SIZE [typeID=6] { 1048576 }C_CAS_LATENCY [typeID=6] { 3 }C_DATA_WIDTH [typeID=6] { 32 }C_ENABLE_DDR [typeID=0] { FALSE }C_PERIOD [typeID=13] { 10 ns }

Parameters are read-only and can be accessed individually by specifying theparameter name as a second argument.


3.5 Module Variables

Each module can export local member variables to be accessible from theScenic shell. This reflects the variable type, size and value during simula-tion time. Accessible variables enable the possibility of dynamic configuration,and extraction of performance data, during simulation. All registered variablesin a module can be viewed using the modules list command. It shows thename, vector size, data type size, history buffer size, and the variable value.

[1 ms]> CPU list_debug_level 1x4 [0] { MESSAGE }_trace_file 1x32 [0] { }_trace_flag 1x1 [0] { 0 }%PC 1x1 [0] { 19 }stat_exec_instructions 1x4 [0] { 98662 }batch_size 1x4 [0] { 1 }use_data_bus 1x1 [0] { TRUE }use_instr_bus 1x1 [0] { TRUE }

Each registered variable can be accessed and modified using the get and setcommands. The example below illustrated how the program counter (PC) ina process or set to instruction 4. After simulating two clock cycles, the PCreaches the value 6. For vector variables, the value is specified as a string ofvalues.

[10 ns]> CPU set -var "%PC" -value 4[10 ns]> runb 20 ns[30 ns]> CPU get -var "%PC"

[ 6 ]

Each variable can be periodically logged to study trends over time. The com-mand log configures the history buffer depth and the logging time interval.To disable logging, the time interval is set to 0. The read command is usedto access data in the history buffer, where time and vector range are optionalparameters. Time/value pairs can also be acquired by setting the timed flagto true.

[0 us]> CPU log -var "%PC" -depth 10 -every "10 ns"[0 us]> runb 1 us[1 us]> CPU read -var "%PC"

[ 4 ; 5 ; 6 ; 7 ; 8 ; 4 ; 5 ; 6 ; 7 ; 8 ]

3.6 Module Debug

Modules use a set of macros to write debug information to the screen, shownin Table 1. In this way, the level of detail can be configured for each individualmodule. Each module contain a system variable named ” debug level” that

4. SYSTEM SPECIFICATION 167

can be changed by either the global command debug, which configures allsimulation modules, or individually by changing the variable with the setcommand. The following command will set the debug level for all modules towarning, and then set the debug level for the processor to message. Duringsimulation, the processor reports the executed instructions using a messagemacro, which prints the information on the screen.

[0 s]> debug -level WARNING[0 s]> CPU set -var _debug_level -value MESSAGE[0 s]> runb 50 ns

[0 ns] <CPU> In reset[10 ns] <CPU> processing instruction @0[20 ns] <CPU> processing instruction @1[30 ns] <CPU> processing instruction @2[40 ns] <CPU> processing instruction @3

[50 ns]>

Tracing is used to extract information from a simulation models to a file, andthe macros can be found in Table 1. If the simulation model generates tracedata, it can be configured using the trace command to start streaming tofile. Tracing can be interactively turned on and off during simulation and thehierarchical name of the module will be used if a filename is not given.

[0 us]> MEM trace on -file "memory_trace.txt"[0 us]> runb 1 us[1 us]> MEM trace off

4 System Specification

System description is based on eXtensible Markup Language (XML), whichis divided into module instantiation, binding, and configuration. Instantiationand binding is handled during the specification phase using the command xmlload, while configurations can be loaded during the simulation phase usingxml config. Multiple XML files can be loaded and will be merged into asingle system during elaboration.

xml load -file system.xml % load system decriptionxml load -file config.xml % load configurationssim % build the simulationrunb 1 ns % start simulation phasexml config -name "config_processor" % configure module

4.1 Module Instantiation

Simulation models are instantiated from the module library based on a systemspecification. The specification contains the models to instantiate (type), the


hierarchical name of the module (name), and parameters to statically configureeach module.

<INSTANTIATE><MODULE type="Processor_v1_00" name="CPU">

<PARAMETER name="C_PERIOD" type="TIME" value="5 ns"/><PARAMETER name="C_USE_CACHE" type="BOOL" value="TRUE"/><PARAMETER name="C_REGISTERS" type="UINT" cmd="$I"/>

</MODULE>...

</INSTANTIATE>

Parameter are given for each module to be instantiated, and consists of aparameter name, type, and value. The parameter name is unique and usedby the simulation model (in the constructor) to read out the configuration.The type can be one of the following: string, bool, char, uchar, short,ushort, int, uint, long, ulong, float, double, time. string is assumedif no type name is given, and the value must match the type.

4.2 Module Binding

If a simulation module supports binding to another simulation module, thenthe modules can be bound from the system specification using bind tags. Eachmodule implements functionality to bind to compatible modules, which can becalled during the specification phase.

<CONNECT><BIND src="CPU" dst="BUS" if="ICACHE"/><BIND src="CPU" dst="BUS" if="DCACHE"/>

</CONNECT>

All parameters specified in the bind tag are sent to the modules bind function,for example parameter if, but src and dst represents the modules to be bound.

4.3 Module Configuration

In some cases it is useful to dynamically reconfigure the simulation modulesduring the simulation phase. Therefore, it is possible to create different namedconfiguration that can be loaded during simulation. A configuration is loadedusing the xml config command and by specifying the name of the configuration.

<CONFIG name="config_processor"><MODULE name="CPU">

<PARAMETER name="C_PROGRAM" type="STRING" value="demo.asm"/><PARAMETER name="C_ADDR" type="UINT" value="0"/>

</MODULE></CONFIG>


The configuration contains information about how to reconfigure one or moresimulation modules. It uses the modules hierarchical name, and parametersare given in the same way as for module instantiation.


Table

1:

Des

crip

tion

ofScenic

mac

ros

and

funct

ions.

Scenic

macro

sand

functi

ons

Desc

ripti

on

SC

IR

EG

IST

ER

LIB

RA

RY

MO

DU

LE

(module

)R

egis

ter

asi

mula

tion

module

inth

em

odule

libra

ry(g

lobalco

nte

xt)

SC

IR

EG

IST

ER

AR

CH

ITE

CT

UR

E(m

odule

)R

egis

ter

an

arc

hitec

ture

.A

nalter

native

tousi

ng

XM

L

SC

IR

EG

IST

ER

GLO

BA

LM

EM

ORY

(mem

,base

,nam

e)R

egis

ter

am

emory

inth

eglo

balm

emory

space

SC

IR

EG

IST

ER

CO

MM

AN

D(n

am

e,fu

nc

ptr

,des

c)R

egis

ter

ause

rdefi

ned

com

mand

inth

eScenic

shel

l

SC

IR

EG

IST

ER

SY

ST

EM

VA

RIA

BLE

(nam

e,fu

nc

ptr

)R

egis

ter

ause

rdefi

ned

syst

emvari

able

inth

eScenic

shel

l

SC

IR

EG

IST

ER

CO

DE

GE

NE

RA

TO

R(g

en,form

at,des

c)R

egis

ter

aco

de

gen

erato

rfo

ra

spec

ific

file

type

SC

IV

ER

BO

SE

(mes

sage)

Pri

nt

ver

bose

mes

sage

on

deb

ug

level

verb

ose

SC

IM

ESSA

GE

(mes

sage)

Pri

nt

mes

sage

on

deb

ug

level

mess

age

SC

IW

AR

NIN

G(m

essa

ge)

Pri

nt

warn

ing

on

deb

ug

level

warn

ing

SC

IE

RR

OR

(mes

sage)

Pri

nt

erro

ron

deb

ug

level

err

or

SC

IT

RA

CE

(label

,data

)G

ener

ate

trace

data

iftr

aci

ng

isen

able

d

SC

IT

RA

CE

V(v

ar)

Tra

cea

vari

able

iftr

aci

ng

isen

able

d

SC

IT

RA

CE

I(var,

index

)Tra

cea

vari

able

[index

]if

traci

ng

isen

able

d

SC

ISIM

EV

EN

T(e

ven

t,nam

e,m

essa

ge)

Reg

iste

ra

sim

ula

tion

event

toex

ecute

vari

able

nam

e

SC

ISIM

EV

EN

TN

OT

IFY

(even

t)Tri

gger

the

sim

ula

tion

even

tto

exec

ute

the

com

mand

spec

ified

inits

vari

able

scen

icca

llback

(var,

nam

e,si

ze)

Export

callback

vari

able

var

as

nam

e

scen

icvari

able

(var,

nam

e)E

xport

vari

able

var

as

nam

e

scen

icvari

able

(var,

nam

e,si

ze)

Export

vari

able

var

with

size

elem

ents

as

nam

e

scen

icvec

tor(

var,

nam

e)E

xport

vec

tor

var

as

nam

e

scen

icvec

tor(

var,

nam

e,si

ze)

Export

vec

tor

var

with

size

elem

ents

as

nam

e


Table 2: Description of Scenic user commands and global simulation modules.

Command Description

codegen [-format] [-top] [-

proj] [-dir]

run the code generator specified by format on a hierarchi-

cal simulation module top.

debug [-level] set the global debug level.

echo [text] print text message.

eval [A] [op] [B] evaluate mathematical or logical operation.

exec [-file] [-repeat] [-fork] execute a script file with scenic commands repeat times.

fork executes the script in a new context.

exit exit program.

foreach [-var] [-iterator] [-

command]

execute a command cmd for every value in a set var.

function(parameters) function declaration with parameter list.

list [mod,gen,lib,env,arch] list module, generator, library, env ironment, architecture.

random [-min] [-max] [-

size] [-round] [-seed]

generate a sequence with size random numbers in the

range min to max.

return [value] return value from function call.

run [time] [unit] run simulation for time time units. valid time units are

[fs,ps,ns,us,ms,s].

runb [time] [unit] blocking version of run.

set [var] [value] assign or list environment variables.

sim [-arch] [-nostart] launch SystemC simulator and create system from

architectural description or from loaded XML.

step [time] [unit] run a single micro-step / set micro-step value.

stop halts a running simulation at next micro-step.

system [command] execute system shell command.

unset [var] remove environment variable var.

xml [clear,load,config,view] load XML system description or configure simulation mod-

ules from XML.

Global modules Description

memory [map,rd,wr] Scenic module that manages the global memory map

scheduler [active] Scenic module that manages access variable logging

Appendix B

DSP/MAC Processor Architecture

The DSP and MAC processors, presented in Part IV, are based on the samearchitectural description, but with minor differences in the instruction set. TheDSP processor uses a 32-bit ALU and supports real and complex valued radix-2butterfly. The MAC unit is based on a 16-bit ALU with multiplication support,and implements instructions to join and split data transactions between theexternal ports and the internal registers.

Table 2 presents the instruction set for the VHDL implementation, whilethe Scenic models support additional and configurable functionality. Figure 1presents a detailed description of the processor architecture. It is based on athree-stage pipeline for instruction decoding, execution, and write-back. Theprogram is stored internally in the processing cell (PGM), and the main con-troller handles configuration management and the control/status register.

Table 1: Flags for instructions of type A.

Flag Bit Processor Description

l 0 DSP/MAC End of inner loopc 1 DSP Separated ALU for complex data (2 × 16 bits)a 1 MAC Accumulate value to xACCr 2 DSP/MAC Local i/o port read requestw 3 DSP/MAC Local i/o port write requests 5 MAC Barrel shifter enabled (uses bit 0-4)

173

174 APPENDIX B. DSP/MAC PROCESSOR ARCHITECTURE

Table

2:

Inst

ruct

ion

set

for

the

32-b

itD

SP

pro

cess

oran

dth

e16

-bit

MA

Cpro

cess

or.

Type

AInstructio

ns

31-2

625-2

120-1

615-1

110-6

5-0

Type

ASem

antic

s

Type

BInstructio

ns

31-2

625-2

120-1

615-0

Type

BSem

antic

s

AD

DD

0,S

0,S

1000001

D0

S0

S1

{sr

wcl}

D0

:=S0

+S1

SU

BD

0,S

0,S

1000010

D0

S0

S1

{sr

wcl}

D0

:=S0

-S1

DM

OV

D0,D

1,S

0,S

1000111

D0

D1

S0

S1

{sr

wl}

D0

:=S0

-S1

AD

DI

D0,S

0,Im

m100001

D0

S0

Imm

D0

:=S0

+sx

t(Im

m)

SU

BI

D0,S

0,Im

m100010

D0

S0

Imm

D0

:=S0

-sx

t(Im

m)

BE

QI

S0,Im

m100011

S0

Imm

PC

:=P

C+

sxt(

Imm

)if

S0=

0B

NE

IS0,Im

m100100

S0

Imm

PC

:=P

C+

sxt(

Imm

)if

S06=

0B

LT

IS0,Im

m100101

S0

Imm

PC

:=P

C+

sxt(

Imm

)if

S0<

0B

LE

IS0,Im

m100110

S0

Imm

PC

:=P

C+

sxt(

Imm

)if

S0≤

0B

GT

IS0,Im

m100111

S0

Imm

PC

:=P

C+

sxt(

Imm

)if

S0>

0B

GE

IS0,Im

m101000

S0

Imm

PC

:=P

C+

sxt(

Imm

)if

S0≥

0

Specia

lin

structio

n

NO

P000000

No

oper

ation

BR

IIm

m101001

Imm

PC

:=P

C+

sxt(

Imm

)E

ND

Imm

101010

Imm

End

exec

ution

with

code=

Imm

ILC

Imm

101011

Imm

ILP

:=P

C+

1IL

C:=

Imm

GID

Imm

101100

Imm

Glo

balport

des

tination

ID=

Imm

32-b

itD

SP

specifi

c

BT

FD

0,D

1,S

0,S

1000011

D0

D1

S0

S1

{sr

wcl}

D0

:=S0

+S1

D1

:=S0

-S1

16-b

itM

AC

specifi

c

SM

OV

D0,D

1,S

0000101

D0

D1

S0

{sr

wl}

D0

:=H

I(S0)

D1

:=LO

(S0)

JM

OV

D0,S

0,S

1000110

D0

S0

S1

{sr

wl}

HI(

D0)

:=S0

LO

(D0)

:=S1

MU

LS0,S

1000100

S0

S1

{sr

wal}

HA

CC

:=H

I(S0

*S1)

LA

CC

:=LO

(S0

*S1)

175

Re

se

rve

d

HA

CC

(32

-bit)

LA

CC

(16

-bit)

Jum

pJum

p_re

g

Re

se

rved

Re

serv

ed

Re

se

rve

d

PC

+

PG

M

CL

K

1

Current_PC

32

IFID

_IN

_IN

ST

PC

IF/ID

IFID

_IN

_P

C

36

L0RX

36

L7RX

R0

R1

Rx

54

G0RX

IF_O

UT

_D

0 [In

st.(2

5 -

21)]

IF_O

UT

_D

1 [In

st.(2

0 -

16)]

Inst.(2

0 -

16)

Inst.(1

5 -

11)

S0

S1 [In

st.(1

0 –

6)]

16/3

2

16

/3

216

/32

32

16

IF_O

UT

_IM

M

5 5

CL

K

CL

K

12

32

PC

ID/E

XE

IFID_OUT_INST

G0R

X_m

ain

_contr

ol

S0_Sel

[CO

NS

T]

ALU

_16

32

16/3

2

IFID

_O

UT

_P

C

Pip

e_S

tall

EX

E

WB

AL

U

16

/32

ID_O

UT

_R

EG

1

16

/32

ID_O

UT

_IM

M

OP

B

OP

A

MS

R

16

CL

K

IFID

_O

UT

_S

0

IFID

_O

UT

_S

1

32

ID_O

UT

_R

EG

0

ZN

CB

32

16/3

2

WD

0

WD

1

16

LO

(16)

32

16

/3

2LO

(16

)

HI(16)

[CO

NS

T]

ALU

_1

6

EN

EN

1 0

M1

2

0 1

M1

0

0 1

M1

1

0 1M9

0 2 1

M7

1 0M3

0 1

M4

WD_MSR

ALU_modeALU_complex

+

+B

ranch_M

UX

WE

N

WE

N

ILC

-

0 1

M1

4

1

Fla

g_l

CL

K

CL

K

CL

K

ILC

_Load

ILP

0 1

M1

7

CL

K

Bra

nch_

PC

ILP

_P

C_

OU

T

Next_PC

= 0?

32

ID_

OU

TW

D0

ID_O

UT

WD

1

WE

N_W

D1

WE

N_W

D0

4W

B

EX

E/W

B

ID_O

UT

_D

0

ID_

OU

T_D

1

ID_

OU

T_

WE

N_

WD

0ID

_O

UT

_W

EN

_W

D1

WB_WD0

WB_WD136

L0TX

36

L7TX

54

G0TX

G0

TX

Dst

T

ID_

OU

T_

pT

yp

e

pType

Valid

TX

_S

tall

PG

M_P

rogra

m

Ma

in c

on

tro

l(F

SM

)

PG

M_C

S, O

E,

WE

N

PC

_E

N

RS

T

CL

K

2

[CO

NS

T]

tType

1 0

M1

8

$1

1 0

M1

9M

SR

32

0 1

M1

5

0 1

M1

6

0 2 1

M8

Fo

rwa

rdin

g

Forw

ard

A

Forw

ard

B

6

16

/3

2

16 / 32

32

Src

A

LO(16) / 32

[CO

NS

T]

Src

Flo

w

Contr

ol

A A A

V V V

P P P

L0

TX

L1

TX

L7

TX

A A A

V V V

P P P

L0

RX

L1

RX

L7

RX

G0R

XD

st

TS

rcA

S1_IMM

Flo

w

Contr

ol

RX

_S

tall

V

32

16/3

2

ACK

6

OP

CO

DE

[In

st.(3

1 -

26)]

& F

lag [In

st.(5

-0)]

LO

(8)

ID_

OU

T_

SR

CID

_W

EN

HI(

32

)

48

48

Inst_

do

ne

ILC_Loop_End

48-bit / 64-bit Barrel Shifter

0 1

M5 0 1

M6

P

PV

PC_RST

PG

M_

AD

DR

54

54

G0T

X_m

ain

_contr

ol

16

Inst_end_code

0

1

M13

ma

c_out

32

Flag_a

Jump_JAL

ACC_WEN

16

/3

2

CO

-ALU

IDEXE_IN_PC

S0_

data

1 0

M2

PC_Stall

0

2

1

M1

3

AD

DR

G0

TX

G0R

X

Address decoder 0 & 1

Con

trol re

gite

r

Co

ntr

ol re

gis

ter

Addr

0:

OP

CO

DE

’hig

h

[Inst.(3

1)]

IMM

[In

st.(1

5 –

0)]

S1

_d

ata

0

OP

CO

DE

[Inst.(3

1 -

26)]

5

Inst.

Con

trol

reg_en_H

AC

C

reg_en_LA

CC

[CO

NS

T]

BS

_E

N

2 2

16

32

32

16

Valid

LO

(16)

/ 32

32

No

tes:

$0

: Z

ero

re

gis

ter

$1

: Ju

mp

an

d L

ink r

eg

iste

r (R

ese

rve

d)

$2

~ $

x:

Ge

ne

ral P

urp

ose

re

gis

ters

Re

se

rve

d f

un

ctio

ns o

r sig

na

ls

Fig

ure

1:

Adet

aile

dpro

cess

orar

chit

ectu

refo

rth

e32

-bit

DSP

pro

cess

oran

dth

e16

-bit

MA

Cpro

cess

or.

The

pro

cess

orin

stru

ctio

n-s

et,

regi

ster

setu

p,

por

tco

nfigu

rati

on,

and

spec

ial

funct

ional

unit

sar

eco

nfigu

rable

duri

ng

des

ign-t

ime.

design of reconfigurable hardware architectures for real-time applications

Documents