research article hardware acceleration of beamforming in a...

12
Hindawi Publishing Corporation VLSI Design Volume 2013, Article ID 861691, 11 pages http://dx.doi.org/10.1155/2013/861691 Research Article Hardware Acceleration of Beamforming in a UWB Imaging Unit for Breast Cancer Detection Francesco Colonna, Mariagrazia Graziano, Mario R. Casu, Xiaolu Guo, and Maurizio Zamboni Department of Electronics and Telecommunications (DET), Politecnico di Torino, C.so Duca degli Abruzzi, 24 I-10129 Torino, Italy Correspondence should be addressed to Mario R. Casu; [email protected] Received 14 September 2012; Accepted 15 May 2013 Academic Editor: Lazhar Khriji Copyright © 2013 Francesco Colonna et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e Ultrawideband (UWB) imaging technique for breast cancer detection is based on the fact that cancerous cells have different dielectric characteristics than healthy tissues. When a UWB pulse in the microwave range strikes a cancerous region, the reflected signal is more intense than the backscatter originating from the surrounding fat tissue. A UWB imaging system consists of transmitters, receivers, and antennas for the RF part, and of a digital back-end for processing the received signals. In this paper we focus on the imaging unit, which elaborates the acquired data and produces 2D or 3D maps of reflected energies. We show that one of the processing tasks, Beamforming, is the most timing critical and cannot be executed in soſtware by a standard microprocessor in a reasonable time. We thus propose a specialized hardware accelerator for it. We design the accelerator in VHDL and test it in an FPGA-based prototype. We also evaluate its performance when implemented on a CMOS 45nm ASIC technology. e speed-up with respect to a soſtware implementation is on the order of tens to hundreds, depending on the degree of parallelism permitted by the target technology. 1. Introduction Prescreening tests aimed at breast cancer diagnosis dra- matically reduce mortality. Mammography, the technique currently used for screening, is very effective but has a few significant shortcomings: its high cost prevents a widespread diffusion, thus limiting de facto the organization of pervasive screening campaigns; its rate of false positives in young patients is very high; its use of ionizing radiations does not allow a frequent use; the obtained images are not tridimen- sional. Other techniques, like ultrasound or magnetic reso- nance, partially solve these problems but raise other issues. None of these techniques has the characteristics required to promote a widespread diffusion and to permit frequent screening campaigns on large ensembles of individuals. One promising alternative is based on the radar principle and operates at microwave frequencies with Ultrawideband (UWB) pulses [13]. A set of antennas is placed around the patient’s breast, and UWB pulses are sent to the breast target. e breast tumor typically exhibits a large dielectric contrast with the surrounding fatty tissue and thus reflects more the incident signal. e detection of the tumor requires, however, a significant amount of processing of the reflected signals. Figure 1 is an overview of a full UWB system for breast cancer detection. e UWB technique does not use ionizing radiations and proved capable of detecting tumors as small as 2 mm [4]. So far, the technique has been demonstrated using bulky RF apparatus (e.g., a network analyzer connected to antennas placed around the patient’s breast) and standard general purpose processors for the elaboration of the UWB signals and the generation of 2D or 3D maps of the reflected energy. Our aim is to demonstrate the feasibility of an integrated and compact system, which we outline in Section 2. Our previous works were focused on the UWB probe system, particularly on the transmitter and the receiver [58]. Here, we focus instead on the processing part, which we refer to as Imaging Unit (see Figure 1). So far, soſtware-only versions of the complex processing algorithms have been proposed

Upload: others

Post on 04-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

Hindawi Publishing CorporationVLSI DesignVolume 2013 Article ID 861691 11 pageshttpdxdoiorg1011552013861691

Research ArticleHardware Acceleration of Beamforming in a UWB ImagingUnit for Breast Cancer Detection

Francesco Colonna Mariagrazia Graziano Mario R Casu Xiaolu Guoand Maurizio ZamboniDepartment of Electronics and Telecommunications (DET) Politecnico di Torino Cso Duca degli Abruzzi 24 I-10129 Torino Italy

Correspondence should be addressed to Mario R Casu mariocasupolitoit

Received 14 September 2012 Accepted 15 May 2013

Academic Editor Lazhar Khriji

Copyright copy 2013 Francesco Colonna et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

The Ultrawideband (UWB) imaging technique for breast cancer detection is based on the fact that cancerous cells have differentdielectric characteristics than healthy tissues When a UWB pulse in the microwave range strikes a cancerous region the reflectedsignal is more intense than the backscatter originating from the surrounding fat tissue A UWB imaging system consists oftransmitters receivers and antennas for the RF part and of a digital back-end for processing the received signals In this paper wefocus on the imaging unit which elaborates the acquired data and produces 2D or 3Dmaps of reflected energies We show that oneof the processing tasks Beamforming is the most timing critical and cannot be executed in software by a standard microprocessorin a reasonable time We thus propose a specialized hardware accelerator for it We design the accelerator in VHDL and test it in anFPGA-based prototype We also evaluate its performance when implemented on a CMOS 45 nm ASIC technology The speed-upwith respect to a software implementation is on the order of tens to hundreds depending on the degree of parallelism permittedby the target technology

1 Introduction

Prescreening tests aimed at breast cancer diagnosis dra-matically reduce mortality Mammography the techniquecurrently used for screening is very effective but has a fewsignificant shortcomings its high cost prevents a widespreaddiffusion thus limiting de facto the organization of pervasivescreening campaigns its rate of false positives in youngpatients is very high its use of ionizing radiations does notallow a frequent use the obtained images are not tridimen-sional Other techniques like ultrasound or magnetic reso-nance partially solve these problems but raise other issuesNone of these techniques has the characteristics requiredto promote a widespread diffusion and to permit frequentscreening campaigns on large ensembles of individuals

One promising alternative is based on the radar principleand operates at microwave frequencies with Ultrawideband(UWB) pulses [1ndash3] A set of antennas is placed around thepatientrsquos breast and UWB pulses are sent to the breast targetThe breast tumor typically exhibits a large dielectric contrast

with the surrounding fatty tissue and thus reflects more theincident signalThe detection of the tumor requires howevera significant amount of processing of the reflected signalsFigure 1 is an overview of a full UWB system for breast cancerdetection

The UWB technique does not use ionizing radiationsand proved capable of detecting tumors as small as 2mm[4] So far the technique has been demonstrated usingbulky RF apparatus (eg a network analyzer connected toantennas placed around the patientrsquos breast) and standardgeneral purpose processors for the elaboration of the UWBsignals and the generation of 2D or 3D maps of the reflectedenergy

Our aim is to demonstrate the feasibility of an integratedand compact system which we outline in Section 2 Ourprevious works were focused on the UWB probe systemparticularly on the transmitter and the receiver [5ndash8] Herewe focus instead on the processing part which we refer toas Imaging Unit (see Figure 1) So far software-only versionsof the complex processing algorithms have been proposed

2 VLSI Design

Recon-structed3D image

Imagingunit

Anatomical sensorsarray of transmittersand receivers for scanof breast volume

Tumor

Safetissue

Front endinterface

Generated pulse

Backscattered

Hardwareacceleration MicroprocessorHigh speed

link

Backscattered signal S

signal T

contrast T S= 1 10

Figure 1 The UWB-based breast cancer imaging system The insetis a high-level view of the imaging unit which consists of a processorfor the execution of the less demanding tasks and a hardwareaccelerator for the most computationally-intensive processingtasks

[4 9] which get typically executed offline by general purposeprocessors For the system to be effectively and efficientlyexploited in screening the imaging unit should be ableto process UWB data almost in real-time that is with anegligible delay compared to the whole examination timeas perceived by the patient The imaging unit should beorganized as in the inset in Figure 1 A microprocessorexecutes the control tasks and the part of the processing thatdoes not require specialized hardware whereas a hardwareaccelerator connected to the processor through a high-speedbus performs all the timing-critical tasks

In this paper we report the results of our investigation onthe imaging unit These are our contributions

(i) We are the first to profile the various software tasks acomplete UWB imaging algorithm for breast cancerdetection consists of [10] Based on the profiling weare able to decide which part of the imaging softwareneeds to be accelerated by a specialized hardware Aswe illustrate in Section 3 Beamforming is the mostcritical task

(ii) We propose for the first time a complete implemen-tation of a hardware accelerator for the beamformingalgorithm in synthesizable VHDLThe architecture isdescribed in Section 4

(iii) We prove the functionality of the accelerator with aFPGA-based prototype as shown in Section 5

(iv) We evaluate performance and resource utilizationwhen the accelerator is implemented both in a XilinxVirtex-4 and in a standard-cell ASIC with a 45 nmCMOS technology These results are reported inSection 6

We end the paper in Section 7 with a recapitulation ofthe main results alongside a final discussion about ourachievements

2 The System

The overall UWB imaging system we aim at is sketched inFigure 1 An array of transceivers generates and receivesUWBsignals Figure 2(b) reports a typical UWB pulse The litera-ture suggests that the frequency range between 05GHz and10GHz should be covered to achieve both high contrast (lowfrequency range) and high resolution (high frequency range)A variety of signals can be used like Gaussian derivative ormodulated and modified Hermite polynomials (MMHP)

The contrast between the damaged (possibly a tumor)and healthy regions is due to an abrupt dielectric constantvariation and can be as high as 10 1

The entire breast volume is scanned by sequentiallyactivating the transceivers In the monostatic configuration[11] each antenna transmits and receives a scattered signalTransceivers activation and data collection are coordinatedby a control unit through a synchronous serial bus as shownin the block diagram in Figure 2(a) All sampled data arecollected and delivered to a memory

Samples in memory are processed by the imaging unitwhich in turn produces a 2D or 3D map of reflected energyvalues each related to a different breast volume location(voxel) The maps are finally sent to a personal computer anddisplayed

The total time necessary to execute a scan of the wholevolume is typically much less than one second [12] eventhough it may change depending on design choices likethe number of antennas 119873ANT the sampling frequency thenumber of samples acquired 119873

119904 the type of UWB signal

selected and the methods used to compensate noise Wethen assume 1 s as a reference value to which we comparethe time required by the imaging unit In particular weassess the need for hardware acceleration when the softwareprocessing largely exceeds this timing reference resulting inan unacceptable performance for a realistic clinical scenario

The imaging unit is then the focus of this work and itsdata-flowdiagram is shown in Figure 3 It performs the imagereconstruction in two steps according to the flow proposedin [9 11]Calibration (Skin Artifact Removal Algorithm (SKAR) Blockin Figure 3) This step removes unwanted scattering contri-butions determined by the large dielectric contrast betweenair and skin These artifacts are orders of magnitude greaterthan useful contribution and need to be removedBeamforming (Beamforming (BEAF) Block in Figure 3) Thisstep reconstructs a map of scattered energy in a 2D or 3Dvolume This is done by the microwave imaging via space-time (MIST) beamforming method which coherently shiftsin time the various received signals in order to focus theenergy analysis on a single voxel [4]

The following section details the Imaging Unit organiza-tion and provides a profiling of the time needed to executethe various subtasks of calibration and beamforming

VLSI Design 3

Memory

Bus

Control

RXTX

RXTX

RXTX

Imagingunit

Array oftrans-ceivers

interface

Front end

Front end

(a)

Am

plitu

de (V

)

1

05

0

minus05

minus1

Differentiated Gaussian pulse f = 6GHz 120591 = 110ps

Transmitted pulse

0 1 2 3 4 5Time (ns)

(b)

Figure 2 The system block diagram (b) an example of a UWB transmitted pulse central frequency 6 GHz total duration time 110 ps

Weights-BEAF

Weights for

Weights-SKAR

SKAR

Calibratedsamples

Energyvalues

BEAF

Access BEAF weights

Weights for

Samples

Calibrationskin

artifactremoval

Mistbeamforming

(Nvox times)

BEAF

SKAR

Figure 3 Data-flow diagram of the imaging unit computation tasks

3 The Imaging Unit

The imaging unit performs sequentially the following sub-tasks (see Figure 3)

Weights-SKAR The skin artifact removal algorithm (here-inafter SKAR) needs weight coefficients that are functions ofthe received signals and the position of the antennas Thesedata are then processed once for each scan of the breast togenerate the weights for SKAR

SKAR By removing the artifact of the large skin reflectionfrom the received signal samples as well as the contributionof the direct path between transmitter and receiver the SKARsub-task generates a set of calibrated samples

Weights-BEAF Beamforming (hereinafter BEAF) also needsweights coefficients but these only depend on the transmittedsignals and not on the received ones Therefore the weightsfor BEAF are computed once and for all at the beginningwhen the characteristics of the transmitted signal are selected

Access BEAF Weights In this phase BEAF weights are readfrom amemoryThis operation is repeated for each voxel We

separate it from the actual BEAF to permit a correct profilingof performanceBEAF This phase applies a space-time beamformer to thesignal samples received by each antenna a step that is donespecifically for each voxel To focus on a specific voxel thesignals are shifted in time filtered with the weights forBEAF and finally coherently added (see Section 4 for a moredetailed formulation of the algorithm)The weighting opera-tion enhances the contribution in each signal of the reflectiondetermined by the voxel under analysis and deemphasizesat the same time the scattering contributions of the othervoxels By iterating on the number of voxels BEAF producesan energy map for the whole volume

We first implemented all the subtasks in software usingOctave [13] We also built a finite-difference time-domain(FDTD) electromagnetic solver to simulate the transmissionof the signals and the reflections that occur in a realisticnumerical breast phantom [14]The output data of the FDTDsimulation are the received signals that we process with ourOctave routines according to the tasks outlined above

31 Profiling SKAR and BEAF have a very different impacton the overall execution timeThis difference depends on thedifference in complexity but mostly on the different numberof times they get executed SKAR is performed once andso is the computation of the SKAR weights whereas BEAFis repeated as many times as there are voxels In the largestmaps found in the breast phantom repository [14] the voxelscan be as many as 119873vox = 425 sdot 10

6 which clearly makesBEAF the bottleneck for the whole computationThe weightsrequired for the BEAF operation however do not depend onthe specific breast but only on some specific design choicessuch as the shape of the UWB pulse and the location oftransmitters and receivers The computation of these valuesis thus performed only once at the beginning of the breastexamination

4 VLSI Design

Table 1 Software runtime of SKAR and BEAF subtasks for a casewith119873vox = 425 sdot 10

6

Task RuntimeWeight-SKAR 164 sSKAR 025 s

Weights-BEAF 217msvoxel26 h

Access BEAF 732 120583svoxelweights 31 s

BEAF 488msvoxel58 h

We performed software simulations of the completesystem with an Intel CPU (Core i5-430M 3MB L3 226GHzclock frequency) 4GB DRAM and Linux Ubuntu 1110 withkernel 30 and instrumented our code to accurately determinethe CPU runtime The results are in Table 1 For the BEAFsub-tasks results are detailed for one voxel and for the wholebreast volume

For all sub-tasks with the exception of SKAR runtimeexceeds by far the 1 s bar However all tasks except weights-BEAF and BEAF are executed in less than one minute whichcan still be judged acceptable especially because a furtheroptimization of the code can significantly reduce this timeAs for weights-BEAF 26 h is certainly a long time but theweights could be computed offline once and for all andaccessed frommemory when needed As expected the BEAFalgorithm has the largest impact on timing performance dueto the large number of iterations Based on this analysis wedecided to focus on the hardware acceleration of the BEAFsub-task while all the other tasks are executed in softwareThe target is an imaging unit like the one sketched in Figure 1a processor is coupled with a dedicated hardware unit (ASIC)that accelerates the timing critical parts of the computation

4 A Hardware Implementation of BEAF

41 The MIST Beamforming (BEAF) Algorithm We report asuccinct description of the MIST beamforming and refer tothe literature for all the details [4 10]

Let us consider the backscatter contribution from a singlebreast location r

0 assuming that the signal received by the

119894th receiver contains only information from that source forsimplicity We denote the sampled version of that signal as119909119894[119899] and its Fourier transform as

119883119894(120596) = 119868 (120596) 119878

119894119894(r0 120596) 1 le 119894 le 119873ANT (1)

where 119868(120596) is the Fourier transform of the transmitted signal119894(119905)119873ANT is the number of antennas which is also the num-ber of transmitters and receivers 119878

119894119894(r0 120596) is an analytical

model of the monostatic frequency response associated withthe propagation from the 119894th antenna to r

0and back The

distance between each antenna and the point r0is not the

same for all the antennas and so the round-trip times of thesignals differWe then need to time align the signals by delay-ing themby an integer number of samples 119899

119894(r0) = 119899119886minus120591119894(r0)

In this equation 120591119894is the round-trip propagation delay in the

119894th channel for location r0and is obtained by dividing the

round-trip path length by the average propagation speed androunding to the nearest sample The term 119899

119886is the reference

time to which every signal is aligned and is chosen to be theworst-case delay over all channels and locations

119899119886ge round(max

119894r0120591119894(r0)) (2)

The signal is windowed to remove components before sample119899119886that are not important for energy estimationThe following

window function 119892[119899] is used

119892 [119899] = 1 119899 ge 119899

119886

0 otherwise(3)

The algorithm also allows to equalize path-length depen-dent dispersion and attenuation to interpolate any fractionaltime delays remaining after coarse time alignment and tobandpass-filter the signal These operations are done in timedomain using a bank of FIR filters each associated with avector of 119871 weights w

119894= [119908

1198940 1199081198941 119908

119894(119871minus1)] Increasing

the number of weights improves accuracy but increasescomplexity too and so must be carefully chosen

By summing the output of the filters we obtain

119911 [119899 r0] =

119873

sum

119894=1

119871

sum

119897=0

119908119894119897sdot 119892 [119899 minus 119897] sdot 119909

119894[119899 minus 119897 minus 119899

119894(r0)] (4)

where 119908119894119897is the 119897th weight of the 119894th filter and 119899

119894(r0) is the

delay value associated with the 119894th filterWe need to define another window

ℎ [119899 r0] =

1 119899ℎle 119899 le 119899

ℎ+ 119897ℎ

0 otherwise(5)

The interval between 119899ℎand 119899ℎ+119897ℎis thewindowof samples in

which we expect to find the received pulse Even if we knowthe shape of the transmitted signal the scattering from thetumor is frequency-dependent Therefore the backscatteredpulse is a distorted version of the transmitted pulse anddefining the exact timing window is not straightforwardThevalues of these parameters suggested in [4] are meant tomaximize the contrast for small lesions

The energy scattered by r0can be computed as the sum of

the square of the windowed values

119901 (r0) = sum

119899

1003816100381610038161003816119911 [119899] ℎ [119899 r0]1003816100381610038161003816

2 (6)

Many parameters in the previous equations have to beassigned a value before proceeding with the hardware designThe value of 119899

ℎcan be computed based on the time shift 119899

119886

which is easily computed as shown in (2) on the delay of theFIR filter 120591

0= (119871 minus 1)2 and on the first significant sample of

the transmitted pulse119873start119899ℎ= 119873Start + 119899119886 + 1205910 (7)

As for 119897ℎ an optimal value is suggested in [4] and is 119897

ℎ= 5

As for 119871 a reasonable value for a Gaussian pulse like the onein Figure 2 which we use for our experiments is 119871 gt 50 andwe chose 119871 = 55

VLSI Design 5

Table 2 Maximum difference of energy values with respect to idealcomputation for three approximation cases

Calibration Tumor Pos Max errorApprox-I Approx-II Approx-III

Low 2678 2328 025IDEAL Med 1464 494 007

Deep 1797 670 016Low 3360 2502 044

SKAR Med 1561 867 016Deep 3068 3446 048

42 Accuracy and Approximation One of the crucial designchoices is data representation We assume that receivedsignals are sampled and represented as 16-bit 2rsquos complementnumbers [12] These data are processed according to (4) and(6) in which multiplications and additions are the mostfrequent operations The number of bits for the intermediateand final results is one of the designerrsquos knobs who has tofind the best trade-off between precision and complexity Weexplored three possible approximation strategies which wediscuss in the followingApprox-IHere we keep data parallelism as low as possible Inthe first multiplication in (4) the 16-bit inputs are multipliedand a 32-bit output is produced Before the addition in thesame equation we truncate the multiplication results anddiscard the 16 least significant bits Then to avoid overflowthe additions are performed over 32 bits by extending theremaining 16 most significant bits This parallelism ensuresno overflow for up to 16 levels of addition Before the finalmultiplication in (6) the 16 least significant bits are discardedand the multiplication result is a 32-bit numberApprox-II In this case we do not discard the 16 leastsignificant bits in the square computation in (6) obtaining a64-bit resultThe 32 least significant bits of the result are thencut away This solution requires of course larger multipliersand so implies a larger complexityApprox-III In this case we keep the maximum level ofaccuracy and perform all the operations with 64 bits

We evaluated these three alternatives with a softwareimplementation of the algorithm in Octave before movingto the hardware design The algorithm performance dependson the relative position of any high-permittivity tissue andthe position of the antennas Therefore we compared theperformance considering an ideal breast model uniformlyfilled with fat tissue for three possible tumor positions in thebreast a small depth (low) an intermediate depth (med) anda considerable depth (deep)

We also evaluated the effect on the performance of theSKAR algorithm In particular we compared the perfor-mance of the actual SKAR with an ideal calibration in whichwe remove any artifact by artificially subtracting the responsewithout tumor from the response with tumor

The three maps in Figure 4 are obtained with SKARcalibration and with the three approximation strategies forthe case of deep position In the first case in particular the

low resolution of the internal computations creates a largeartificial clutter response and so a clear loss in accuracyNevertheless the tumor position is clearly and correctlyfound In the second case the removal of the least 32-bit ofthe energy result eliminates the clutter response but may endup in inaccuracies should a point of interest have a low energyand be therefore eliminated The third approximation is ofcourse the onewith the greatest accuracy In fact the results inthe thirdmap in Figure 4 are identical to those obtained froma simulation with double precision floating-point numbers

In Table 2 we report the maximum error of the energyvalues with respect to an ideal computation (with no approx-imation) The maximum error is calculated as follows

MaxError = max(10038161003816100381610038161003816100381610038161003816100381610038161003816

EMmax (EM)

minus

EMfp

max (EMfp)

10038161003816100381610038161003816100381610038161003816100381610038161003816

) (8)

EM represents the energy map without approximations(double precision) and EMfp is the energy map with approx-imation (fixed point) The error is determined by comparingenergy values normalized to the maximum values obtainedwith or without approximation respectively The maximumerror is consistent with the graphical results of the maps

To keep the maximum level of accuracy then we decideto maintain a 64-bit data representation This decision doesnot affect the design as much as one may think To store 64-bit energy values we need twice as much memory locationsthan we need for the 32-bit case However the memory sizeis mostly determined by the many weights we have to storeand both the number and the bit size of the weights do notdepend on the representation of the intermediate data

43 Architecture After having decided the parallelism forinternal data representation we moved on to the RTLhardware design We described our system in synthesizableVHDL Design-time parameters were used to keep the designflexible The block diagram in Figure 5 represents the BEAFarchitecture that we described in RTL In the following wedescribe the behavior of each componentWeights amp Delays Shift RegisterWhen the image reconstruc-tion process starts after the acquisition phase calibratedsamples weights for BEAF and delays values are availableeach in their own memories For each energy value an entirewindow of samples is processed in every FIR filter togetherwith the corresponding weights Before the filter computa-tion these weights are sequentially read from memory andloaded in a serial-in parallel-out shift register included in theWeights amp Delays shift register blockThe delay values are alsoread from memory at the same time They are sent to the 120591blocks to determine the correct time-shift amount necessaryto align the signals according to the beamforming algorithmdescribed in Section 41Sample Memory The sample values stored in this memoryare read one at a time for each channel and fed to the filterblocksFilters A bank of filters receives samples from the samplememory and delay values from the Weights ampDelays shift reg-ister In a fully parallel implementation the number of filters

6 VLSI Design

0

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14

2e minus 08

4e minus 08

6e minus 08

8e minus 08

1e minus 07

i (cm)

j(c

m)

(a) First approximation type (Approx-I)

Beamforming approximated - SKAR calibration

0

5e minus 09

1e minus 08

15e minus 08

2e minus 08

25e minus 08

3e minus 08

35e minus 08

4e minus 08

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

(b) Second approximation type (Approx-II)

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

0

1e minus 08

2e minus 08

3e minus 08

4e minus 08

5e minus 08

(c) Third approximation type (Approx-III)

Figure 4 Energy map of the breast after BEAF data approximation for ideal breast model and SKAR calibration A tumor is present in a deepposition

(119873FIR) corresponds to the number of transceiversantennasin the UWB probe As we discuss momentarily sequentialimplementations that use less filters than there are antennasare possible Each filter includes three main componentsThe120591 blocks are programmable delay lines which shift the inputsamples of the right time shift to ensure that all FIR inputs aretime alignedThe maximum delay is 119899

119886 TheWindow1 blocks

perform the 119892[119899] windowing functions they feed the FIRblocks only with samples that come after 119899

119886 and replace all

the samples that come before 119899119886with zero values In the FIR

blocks the samples are processed according to (4) Every FIRis a fully-parallel structure with 119871multipliers Eachmultiplieris fed with one of the weights which do not change duringthe whole voxel computation and with the samplesThe latterare sequentially introduced in the FIR blocks and then are

shifted from one multiplier to the next via an internal shiftregister The outputs of all the multipliers are then summedvia an adder tree producing one single FIR output Thiswhole computation block is pipelined with registers at theinput of every adder and multiplierSumTheoutputs of all the filters are summed together with aadder tree to obtain a single value which represents the resultof (4)Window2 The samples we are interested in are those inwhich the signal pulse occurs Window2 lets input data passunaltered if they correspond to the time interval specified by119899ℎand 119897ℎ Otherwise the output value is set to zero

Energy Computation This block computes the total voxelenergy according to (6) A multiplier and an accumulator are

VLSI Design 7

FIR

FIR

FIR

Filter

FSM

Energycomputation

Energymemory

Win

dow

2

Weightsand delays

shiftregister

Window1

Window1

Window1

120591

120591

120591

Sum shift register

SumSamplesmemory

Figure 5 RTL block diagram of the BEAF unit

used for this purpose Finally the energy of the current voxelgets written into the energy memoryFSM To coordinate the whole computation process a finitestate machine (FSM) is used A set of counters in it also keepstrack of the number of computation steps performed at anycontrol step

The filtering functions need not to be necessarily per-formed in parallel Two extreme solutions are possible Aparallel implementation in which all the data are processedconcurrently and a serial implementation in which onefilter is iteratively used In the serial implementation anaccumulator is inserted at the output of the filter since theFIR outputs must be summed to compute the energy value

Between the two extremes hybrid solutions are alsopossible We can instantiate a number of filters (and thus anumber of FIR blocks) 119873FIR that matches our timing andresource requirementsWe can use them iteratively to processthe data from119873ANT antennas In case of a hybrid solution ateach iteration the results of the partial sumof the FIR outputsare fed back to the sum shift register block in Figure 5 andadded coherently to the current value at a given iterationWhen all the antennas sets have been processed the finalresult is fed to the energy computation block The finitestate machine is designed in a parameterized way so as tocorrectly handle all possible solutions with different degreesof parallelism

It is worth mentioning that a choice of degree of par-allelism determines not just the number of FIR blocks butalso the number of 120591 Window1 and Window2 blocks In thefollowing whenwe refer to119873FIR as a parameter for the degreeof parallelism we mean that all these blocks are instantiated119873FIR times

We verified our parameterized VHDL code by means ofextensive logic-level simulations based on Mentor GraphicsrsquoModelsim [15] varying the degree of parallelismWe validatedthe output obtained with the RTL code against the resultsobtained with a software implementation and refined ourdesign until they perfectly matched

44 Timing Performance of BEAF In our architectureeach voxelrsquos energy is sequentially computed therefore the

beamforming execution time is

119879BEAF = 119873vox sdot 119879vox (9)

where 119873vox is the number of voxels and 119879vox is the timerequired to compute each voxel

The number of clock cycles necessary to execute thebeamforming step for each voxel depends on the number ofFIR filters in Figure 5 As we discussed above the spectrumof possible solutions spans from a fully sequential implemen-tation with a single FIR iteratively used for all the antennasto a fully parallel solution with as many filters as there areantennas The number of iterations depends on the ratiobetween number of antennas and number of filters

119873iter = lceil119873ANT119873FIR

rceil (10)

119879vox is the sum of three contributions

119879vox = 119879119908 + 119879comp + 119879119908119903 (11)

119879119908is the time necessary to write the FIR weights in the

memory once they are received from the link that connectsthe processor with the accelerator to read the weights fromthe memory and to store them in local registers within thefilters

119879119908= (2 (119871 + 1)119873ANT + 2119873iter) 119879CLK (12)

The first termwithin parenthesis in (12) is the time for writingand reading from the memory 119871 weights The second term isan overhead for enabling the memory and for asserting a fewcontrol signals 119879CLK is the clock period

119879comp is the actual computation time and is given by

119879comp = (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ lceillog2(119873FIR + 1)rceil )119873iter119879CLK

(13)

In (13) 119871wp is the time distance evaluated in time samplesthat corresponds to the worst path between two differentantennas and is necessary for the time alignment of thevarious signals Term lceillog

2(119871)rceil+119899

ℎ+119897ℎis the number of clock

cycles needed to load the samples in the filters with 119899ℎand 119897ℎ

representing the first sample in awindow of interest and 119897ℎthe

size of that window according to equation (5) Seven clockcycles are needed for window traversal and various registerwriting The term that depends on the logarithm of 119873FIR isrelated to the final addition in Figure 5

The final term in (11) 119879nrj is given by

119879nrj = 3119873iter (14)

and represents the time needed to write the voxelrsquos energyvalue back in the memory

We rewrite the final expression of 119879vox obtained bywrapping up all previous equations

119879vox = (2 (119871 + 1) sdot 119873ANT + 2 sdot 119873iter

+ (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ log2(119873FIR + 1)) sdot 119873iter + 3 sdot 119873iter) 119879CLK

(15)

8 VLSI Design

Figure 6 Experimental setup for the FPGA prototype the personalcomputer executes in software (Octave) part of the algorithm andallows energy maps visualization A USB link transfers data to andfrom the FPGA board where part of the beamforming algorithm isaccelerated

Equation (15) can be used to determine the overallexecution time of the beamforming algorithm in (9) giventhe total number of voxels and applies both to a 2D anda 3D map The accuracy of (15) was tested with VHDLsimulations usingMentor Graphicsrsquo Modelsim simulator andexperimentally verified with the FPGA prototype describedin the next section

5 Emulation with an FPGA-Based Prototype

We built an emulation prototype of the imaging unit Thisprototype relies on a general purpose microprocessor toperform the portions of processing that we decided to executein software while the hardware part of the BEAF algorithmwas deployed on an Xilinx Virtex-4 XC4VLX160-FF1513FPGA In particular wewere able to connect the Intel Core-i5of a laptop with the FPGA located in an AVNET Virtex-4 LXdevelopment boardTheCypress FX2USB 20 chip integratedin the development board was used to communicate with theFPGATheUSB chip behaves simply as a conduit between theUSB and the external data-processing logic The FX2 clockwas set to 48MHz the data parallelism to 8 bits and thepacket size to 512 bytes

The prototype has multiple purposes It serves as anaccelerator of the system-level simulation because it permitsa much faster execution of the part of the algorithm executedby the FPGA compared to a logic-level simulation Theoverall architecture with the software and the hardwarerunning concurrently is clearly a close approximation ofthe final system even though the processor and the digitalhardware will be different in the final target The RTL VHDLcode that gets synthesized on the FPGA is the same that wewill eventually implement in an ASIC therefore we can fullyverify it before fabrication Finally we can anticipate whatthe exact performance will be in terms of clock cycles Theonly difference in performance will be then determined bythe different clock frequency in FPGA and ASIC

Both the software and the hardware code are fullyinstrumented to permit a cycle accurate evaluation of timingperformance The USB communication time is also correctly

estimated and decoupled from the overall performance as amuch faster link will be used in a realistic setting like the onedepicted in Figure 1

The software running on the microprocessor is layeredand so the application part and the communication partare independent and portable The communication part isalso further partitioned in a set of low-level hardware-specific USB primitives (we use the libusb library availablein Linux) and a set of higher-level technology-independentAPIs for data organization and communication When thehardware target changes as it happens when moving fromthe prototype to the final target the developers replace onlythe low-level communication primitives and reuse both theapplication code and the communication APIs

Figure 6 is a photograph of the prototype and showsan energy map elaborated in the FPGA sent to the hostcomputer and displayed in the host screen

6 Experimental Results on FPGA and ASIC

We ran several experiments first with Xilinxrsquos ISE software[16] for the FPGA target of our prototype then with SynopsysDesign Compiler [17] for an ASIC target We were able todetermine the resource requirements and the actual timingperformance which depends on the actual clock frequency asfound by the FPGA and ASIC design CAD toolsThe variousbitstream files generated after FPGA synthesis mappingand place-and-route experiments were each tested in ourprototype They refer to a specific design case with nineantennas located around the numerical breast phantomnumber 012204 from the repository made available by theUniversity of Wisconsin [14] This phantom represents inthe database an average case in terms of type of tissue andnumber of voxelsWe have run finite-difference time-domain(FDTD) electromagnetic simulations to determine the signalreceived by the nine antennas and we fed the imagingunit with the received samples after 16-bit quantizationWe also extrapolated our performance results to a case offifty antennas which we assume it to be more or less thanmaximum number of antennas that can be placed around apatientrsquos breast

61 FPGA Experiments The FPGA resources the main partof which consists of look-up tables (LUT) and registers aregrouped in slices Depending on the complexity of the designand particularly on the number of filters 119873FIR a higher orlower amount of LUT and registers is used When mappingthe design the strategy of ISE does not consist in fillingthe slices one by one Instead to meet timing requirementsthe logic gates are spread over a high number of sliceswhich often results in a poor utilization Sometimes a slice isdeclared as occupied even if only one of its LUTs is usedThisfact limits the number of filters that we were able to put in ourVirtex-4 to a maximum of five FIRs The results of resourceutilization are reported in Table 3

We notice that the number of occupied slices increasesrapidly with the number of filters With six filters it exceedsthe number of available slices and we were not able to

VLSI Design 9

Table 3 Area allocation with different number of FIR filters

119873FIR Slice registers 4 input LUTs Occupied slices1 9081 (6) 8038 (5) 7188 (10)2 17347 (12) 20072 (14) 16115 (23)3 25551 (18) 44290 (32) 30899 (45)4 33851 (25) 67247 (49) 45139 (66)5 42055 (31) 91461 (67) 59936 (88)

Table 4 Hardware execution time over 14485 voxels with 9antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 128 s

119873FIR 119873ITER 119879exe [ms] Speed-up1 9 80639 15872 5 55336 23133 3 42581 30064 3 42650 30015 2 36249 35316 2 36249 35317 2 36249 35318 2 36295 35279 1 29871 4285

test that configuration in our prototype even though wecould estimate the performance in simulation The allocatedresources follow a trend that is almost linear with the numberof filters which is in accordance with our expectations

The timing performance depends proportionally on thenumber of voxels in the energy map This value is derivedfrom the 012204 phantom breast at 1mm resolution whichwe also used for the Octave simulations The 2D map gridconsists of 152 rows and 176 columns The Octave simulationcode was optimized to consider in the computation onlythose voxels which happened to be located in the ellipticalspace described by the antennas that is the area actuallyirradiated by the transmitters With such optimization thenumber of voxels is119873VOX = 14495

With such number of voxels the time required for thesoftware execution of the BEAF on the Intel i5 core of ourprototype (with Ubuntu 1110 kernel 30) was approximately119879SW = 128 sThe clock period for the FPGA implementationis 119879CLK = 16 ns which we use in our experiments with theprototype Table 4 reports the time required to execute theBEAF routine in hardware and the speed-up with respect to asoftware implementation Timing results with 119873FIR in range6ndash9 refer to logic simulations only because only five filterscould be placed in the FPGA However since all the resultsare consistent with (15) they are all fully accurate Table 5refers to a hypothetical case with fifty antennas and reportsonly simulated data which are also fully accurate comparedto a software execution that requires 119879SW = 71 s

In both Tables 4 and 5 we observe a similar trend Aswe expected the effect of changing the number of filtersmainly depends on the actual number of iterations 119873ITER =lceil119873ANT119873FIRrceil Increasing the number of filters boosts perfor-mance only when it allows to perform the computation with

Table 5 Hardware execution time over 14485 voxels with 50antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 71 s

119873FIR 119873ITER 119879exe [s] Speed-up1 50 448 15862 25 289 24533 17 238 29794 13 213 33315 10 194 36616 9 188 37867 8 181 39198 7 175 40599 6 168 421410ndash12 5 162 438113ndash16 4 156 456217ndash24 3 149 475625ndash49 2 143 497150 1 136 5205

less iterations For example in the case with nine antennaswe would not get a significant advantage from using eightfilters if we cannot use nine then we could as well use sixand still obtain basically the same performance we would getwith eight filters

The speed-up value grows rapidly for low values of119873FIRwhereas its marginal increase tends to diminish in the upperrangeThe reason is that the part of the algorithm that consistsof accessing the memory for reading samples and weightsremains sequential and cannot be made parallel

Considering the same phantom and iterating the per-formance evaluation in the 3D case with 1374953 voxels theresulting trends are similar To summarize in the worst caseof fifty antennas and 119879ITER = 50 the required time is 119879exe =42464 s (around 7 minutes) while in the best case of onlyone iteration we obtained119879exe = 12940 s (around 2minutes)With respect to a software implementation requiring 119879SW =

19 h the speedup is a factor 1611 and 5286 respectively inthe worst and best case

Referring to data discussed in Table 1 obtained for a worstcase breast phantom (in terms of number of voxels) theexecution time in the worst case (119879ITER = 50) would bearound 21 minutes (compared to the 58 h obtained in thesoftware case)

62 ASIC Experiments For this part of our experimentswe targeted a standard cell library in a 11 V 45 nm CMOStechnology We synthesized with Synopsys Design Compilervarious instances of our architecture varying the numberof filters and determined the relationship between speed-up(with respect to a software execution) and silicon area Thedesign when synthesized with the maximum optimizationeffort available can run at a clock period of 2 ns Thereforecompared to the FPGA results reported in the previoussection all the values of speed-up can be multiplied by afactor 8 (16 ns2 ns)

10 VLSI Design

0

1

2

3

4

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 50Number of FIR filters

Are

a (m

m2)

Figure 7 Postsynthesis ASIC area versus number of FIR filters fora 50 antennas system

100

150

200

250

300

350

400

450

0 1 2 3 4 5 6 7 8 9 10

Spee

d-up

ver

sus s

oftw

are

Area (mm2)

Figure 8 Speed-up versus area in an ASIC implementation of a 50antennas system

Figures 7 and 8 report silicon area as a function of thenumber of filters and speed-up as a function of the siliconarea respectively

The area curve in Figure 7 grows linearly with119873FIR witha slope of 01885mm2 which is in fact the area of a single FIRfilterThe area depends also on thememory sizeThememoryarea was obtained by considering the use of 100 memoryblocks (two for each antenna) each consisting of 256 bytesfor the signal samples (256 kB over 0457mm2) and a dual-portmemory blockwith 2048 bytes for theweights and delaysvalues (4 kB over 00153mm2) The memory area weighs onthe total area in a significant way only when a few filters areused because it does not depend on119873FIR

The speed-up that is the ratio of software execution timeand hardware execution time as a function of the silicon areashows a behavior similar to what we found for the FPGAcase but the acceleration factor is of course much biggerfrom around 50 to around 400 in the case of fifty antennasin the ASIC case This result would be notably improved incase a more scaled technology would be used [18] As we

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

Are

a and

spee

d-up

incr

ease

(rel

ativ

e uni

ts)

Number of FIR filters

Speed-up increaseArea increase

115

225

335

2 4 6 8 10

Zoom filter range 1 5

Figure 9 Trade-off between speed-up and area when changingthe number of filters in a 50 antennas ASIC implementation Cost-effective solutions are those with119873FIR le 6

utilize more and more area to make room for the filters thespeed-up first increases in a steep way then it bends becausethe sequential part of the algorithm starts dominating theexecution time

We report in Figure 9 two more curves the speed-upincrease and the area increase These two curves are simplythe normalized versions of the curves reported in Figures 7and 8 In practice we divided the area and the speed-up ofthe solution with119873FIR filters by the area and the speed-up ofthe solution with one filter respectivelyThe rationale behindthis choice is to evaluate the trade-off between performanceand cost In particular we judge that a parallel solutionis cost-effective if for a given percentage cost increase theperformance increases at least as much as the cost [19 20]From the inset in Figure 9 we are able to evaluate for whatvalue of 119873FIR the two curves cross each other The cost-effective solutions are those that use no more than six filters

7 Conclusions

Implementing in hardware the Mist-beamforming algorithmportion of the image reconstruction process determines aremarkable acceleration in terms of execution time From apractical point of view this acceleration implies a significantimprovement in the applicability of the system if we considerthe case of a 3D image reconstruction with a fifty-antennasystem we move from 19 hours of computation in thesoftware-only version to only about 7 minutes for an FPGAimplementation and less than one minute for an ASICimplementation even in the serial hardware configuration(ie with 119873FIR = 1) The FPGA clock frequency was set inour experiments to 625MHz which is the clock frequencyof the IP that communicates with the UWB external chipHowever our results show that the FPGA accelerator alonecan run up to 180MHz hence providing a further 288Xspeed-up with respect to the software version Therefore

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

2 VLSI Design

Recon-structed3D image

Imagingunit

Anatomical sensorsarray of transmittersand receivers for scanof breast volume

Tumor

Safetissue

Front endinterface

Generated pulse

Backscattered

Hardwareacceleration MicroprocessorHigh speed

link

Backscattered signal S

signal T

contrast T S= 1 10

Figure 1 The UWB-based breast cancer imaging system The insetis a high-level view of the imaging unit which consists of a processorfor the execution of the less demanding tasks and a hardwareaccelerator for the most computationally-intensive processingtasks

[4 9] which get typically executed offline by general purposeprocessors For the system to be effectively and efficientlyexploited in screening the imaging unit should be ableto process UWB data almost in real-time that is with anegligible delay compared to the whole examination timeas perceived by the patient The imaging unit should beorganized as in the inset in Figure 1 A microprocessorexecutes the control tasks and the part of the processing thatdoes not require specialized hardware whereas a hardwareaccelerator connected to the processor through a high-speedbus performs all the timing-critical tasks

In this paper we report the results of our investigation onthe imaging unit These are our contributions

(i) We are the first to profile the various software tasks acomplete UWB imaging algorithm for breast cancerdetection consists of [10] Based on the profiling weare able to decide which part of the imaging softwareneeds to be accelerated by a specialized hardware Aswe illustrate in Section 3 Beamforming is the mostcritical task

(ii) We propose for the first time a complete implemen-tation of a hardware accelerator for the beamformingalgorithm in synthesizable VHDLThe architecture isdescribed in Section 4

(iii) We prove the functionality of the accelerator with aFPGA-based prototype as shown in Section 5

(iv) We evaluate performance and resource utilizationwhen the accelerator is implemented both in a XilinxVirtex-4 and in a standard-cell ASIC with a 45 nmCMOS technology These results are reported inSection 6

We end the paper in Section 7 with a recapitulation ofthe main results alongside a final discussion about ourachievements

2 The System

The overall UWB imaging system we aim at is sketched inFigure 1 An array of transceivers generates and receivesUWBsignals Figure 2(b) reports a typical UWB pulse The litera-ture suggests that the frequency range between 05GHz and10GHz should be covered to achieve both high contrast (lowfrequency range) and high resolution (high frequency range)A variety of signals can be used like Gaussian derivative ormodulated and modified Hermite polynomials (MMHP)

The contrast between the damaged (possibly a tumor)and healthy regions is due to an abrupt dielectric constantvariation and can be as high as 10 1

The entire breast volume is scanned by sequentiallyactivating the transceivers In the monostatic configuration[11] each antenna transmits and receives a scattered signalTransceivers activation and data collection are coordinatedby a control unit through a synchronous serial bus as shownin the block diagram in Figure 2(a) All sampled data arecollected and delivered to a memory

Samples in memory are processed by the imaging unitwhich in turn produces a 2D or 3D map of reflected energyvalues each related to a different breast volume location(voxel) The maps are finally sent to a personal computer anddisplayed

The total time necessary to execute a scan of the wholevolume is typically much less than one second [12] eventhough it may change depending on design choices likethe number of antennas 119873ANT the sampling frequency thenumber of samples acquired 119873

119904 the type of UWB signal

selected and the methods used to compensate noise Wethen assume 1 s as a reference value to which we comparethe time required by the imaging unit In particular weassess the need for hardware acceleration when the softwareprocessing largely exceeds this timing reference resulting inan unacceptable performance for a realistic clinical scenario

The imaging unit is then the focus of this work and itsdata-flowdiagram is shown in Figure 3 It performs the imagereconstruction in two steps according to the flow proposedin [9 11]Calibration (Skin Artifact Removal Algorithm (SKAR) Blockin Figure 3) This step removes unwanted scattering contri-butions determined by the large dielectric contrast betweenair and skin These artifacts are orders of magnitude greaterthan useful contribution and need to be removedBeamforming (Beamforming (BEAF) Block in Figure 3) Thisstep reconstructs a map of scattered energy in a 2D or 3Dvolume This is done by the microwave imaging via space-time (MIST) beamforming method which coherently shiftsin time the various received signals in order to focus theenergy analysis on a single voxel [4]

The following section details the Imaging Unit organiza-tion and provides a profiling of the time needed to executethe various subtasks of calibration and beamforming

VLSI Design 3

Memory

Bus

Control

RXTX

RXTX

RXTX

Imagingunit

Array oftrans-ceivers

interface

Front end

Front end

(a)

Am

plitu

de (V

)

1

05

0

minus05

minus1

Differentiated Gaussian pulse f = 6GHz 120591 = 110ps

Transmitted pulse

0 1 2 3 4 5Time (ns)

(b)

Figure 2 The system block diagram (b) an example of a UWB transmitted pulse central frequency 6 GHz total duration time 110 ps

Weights-BEAF

Weights for

Weights-SKAR

SKAR

Calibratedsamples

Energyvalues

BEAF

Access BEAF weights

Weights for

Samples

Calibrationskin

artifactremoval

Mistbeamforming

(Nvox times)

BEAF

SKAR

Figure 3 Data-flow diagram of the imaging unit computation tasks

3 The Imaging Unit

The imaging unit performs sequentially the following sub-tasks (see Figure 3)

Weights-SKAR The skin artifact removal algorithm (here-inafter SKAR) needs weight coefficients that are functions ofthe received signals and the position of the antennas Thesedata are then processed once for each scan of the breast togenerate the weights for SKAR

SKAR By removing the artifact of the large skin reflectionfrom the received signal samples as well as the contributionof the direct path between transmitter and receiver the SKARsub-task generates a set of calibrated samples

Weights-BEAF Beamforming (hereinafter BEAF) also needsweights coefficients but these only depend on the transmittedsignals and not on the received ones Therefore the weightsfor BEAF are computed once and for all at the beginningwhen the characteristics of the transmitted signal are selected

Access BEAF Weights In this phase BEAF weights are readfrom amemoryThis operation is repeated for each voxel We

separate it from the actual BEAF to permit a correct profilingof performanceBEAF This phase applies a space-time beamformer to thesignal samples received by each antenna a step that is donespecifically for each voxel To focus on a specific voxel thesignals are shifted in time filtered with the weights forBEAF and finally coherently added (see Section 4 for a moredetailed formulation of the algorithm)The weighting opera-tion enhances the contribution in each signal of the reflectiondetermined by the voxel under analysis and deemphasizesat the same time the scattering contributions of the othervoxels By iterating on the number of voxels BEAF producesan energy map for the whole volume

We first implemented all the subtasks in software usingOctave [13] We also built a finite-difference time-domain(FDTD) electromagnetic solver to simulate the transmissionof the signals and the reflections that occur in a realisticnumerical breast phantom [14]The output data of the FDTDsimulation are the received signals that we process with ourOctave routines according to the tasks outlined above

31 Profiling SKAR and BEAF have a very different impacton the overall execution timeThis difference depends on thedifference in complexity but mostly on the different numberof times they get executed SKAR is performed once andso is the computation of the SKAR weights whereas BEAFis repeated as many times as there are voxels In the largestmaps found in the breast phantom repository [14] the voxelscan be as many as 119873vox = 425 sdot 10

6 which clearly makesBEAF the bottleneck for the whole computationThe weightsrequired for the BEAF operation however do not depend onthe specific breast but only on some specific design choicessuch as the shape of the UWB pulse and the location oftransmitters and receivers The computation of these valuesis thus performed only once at the beginning of the breastexamination

4 VLSI Design

Table 1 Software runtime of SKAR and BEAF subtasks for a casewith119873vox = 425 sdot 10

6

Task RuntimeWeight-SKAR 164 sSKAR 025 s

Weights-BEAF 217msvoxel26 h

Access BEAF 732 120583svoxelweights 31 s

BEAF 488msvoxel58 h

We performed software simulations of the completesystem with an Intel CPU (Core i5-430M 3MB L3 226GHzclock frequency) 4GB DRAM and Linux Ubuntu 1110 withkernel 30 and instrumented our code to accurately determinethe CPU runtime The results are in Table 1 For the BEAFsub-tasks results are detailed for one voxel and for the wholebreast volume

For all sub-tasks with the exception of SKAR runtimeexceeds by far the 1 s bar However all tasks except weights-BEAF and BEAF are executed in less than one minute whichcan still be judged acceptable especially because a furtheroptimization of the code can significantly reduce this timeAs for weights-BEAF 26 h is certainly a long time but theweights could be computed offline once and for all andaccessed frommemory when needed As expected the BEAFalgorithm has the largest impact on timing performance dueto the large number of iterations Based on this analysis wedecided to focus on the hardware acceleration of the BEAFsub-task while all the other tasks are executed in softwareThe target is an imaging unit like the one sketched in Figure 1a processor is coupled with a dedicated hardware unit (ASIC)that accelerates the timing critical parts of the computation

4 A Hardware Implementation of BEAF

41 The MIST Beamforming (BEAF) Algorithm We report asuccinct description of the MIST beamforming and refer tothe literature for all the details [4 10]

Let us consider the backscatter contribution from a singlebreast location r

0 assuming that the signal received by the

119894th receiver contains only information from that source forsimplicity We denote the sampled version of that signal as119909119894[119899] and its Fourier transform as

119883119894(120596) = 119868 (120596) 119878

119894119894(r0 120596) 1 le 119894 le 119873ANT (1)

where 119868(120596) is the Fourier transform of the transmitted signal119894(119905)119873ANT is the number of antennas which is also the num-ber of transmitters and receivers 119878

119894119894(r0 120596) is an analytical

model of the monostatic frequency response associated withthe propagation from the 119894th antenna to r

0and back The

distance between each antenna and the point r0is not the

same for all the antennas and so the round-trip times of thesignals differWe then need to time align the signals by delay-ing themby an integer number of samples 119899

119894(r0) = 119899119886minus120591119894(r0)

In this equation 120591119894is the round-trip propagation delay in the

119894th channel for location r0and is obtained by dividing the

round-trip path length by the average propagation speed androunding to the nearest sample The term 119899

119886is the reference

time to which every signal is aligned and is chosen to be theworst-case delay over all channels and locations

119899119886ge round(max

119894r0120591119894(r0)) (2)

The signal is windowed to remove components before sample119899119886that are not important for energy estimationThe following

window function 119892[119899] is used

119892 [119899] = 1 119899 ge 119899

119886

0 otherwise(3)

The algorithm also allows to equalize path-length depen-dent dispersion and attenuation to interpolate any fractionaltime delays remaining after coarse time alignment and tobandpass-filter the signal These operations are done in timedomain using a bank of FIR filters each associated with avector of 119871 weights w

119894= [119908

1198940 1199081198941 119908

119894(119871minus1)] Increasing

the number of weights improves accuracy but increasescomplexity too and so must be carefully chosen

By summing the output of the filters we obtain

119911 [119899 r0] =

119873

sum

119894=1

119871

sum

119897=0

119908119894119897sdot 119892 [119899 minus 119897] sdot 119909

119894[119899 minus 119897 minus 119899

119894(r0)] (4)

where 119908119894119897is the 119897th weight of the 119894th filter and 119899

119894(r0) is the

delay value associated with the 119894th filterWe need to define another window

ℎ [119899 r0] =

1 119899ℎle 119899 le 119899

ℎ+ 119897ℎ

0 otherwise(5)

The interval between 119899ℎand 119899ℎ+119897ℎis thewindowof samples in

which we expect to find the received pulse Even if we knowthe shape of the transmitted signal the scattering from thetumor is frequency-dependent Therefore the backscatteredpulse is a distorted version of the transmitted pulse anddefining the exact timing window is not straightforwardThevalues of these parameters suggested in [4] are meant tomaximize the contrast for small lesions

The energy scattered by r0can be computed as the sum of

the square of the windowed values

119901 (r0) = sum

119899

1003816100381610038161003816119911 [119899] ℎ [119899 r0]1003816100381610038161003816

2 (6)

Many parameters in the previous equations have to beassigned a value before proceeding with the hardware designThe value of 119899

ℎcan be computed based on the time shift 119899

119886

which is easily computed as shown in (2) on the delay of theFIR filter 120591

0= (119871 minus 1)2 and on the first significant sample of

the transmitted pulse119873start119899ℎ= 119873Start + 119899119886 + 1205910 (7)

As for 119897ℎ an optimal value is suggested in [4] and is 119897

ℎ= 5

As for 119871 a reasonable value for a Gaussian pulse like the onein Figure 2 which we use for our experiments is 119871 gt 50 andwe chose 119871 = 55

VLSI Design 5

Table 2 Maximum difference of energy values with respect to idealcomputation for three approximation cases

Calibration Tumor Pos Max errorApprox-I Approx-II Approx-III

Low 2678 2328 025IDEAL Med 1464 494 007

Deep 1797 670 016Low 3360 2502 044

SKAR Med 1561 867 016Deep 3068 3446 048

42 Accuracy and Approximation One of the crucial designchoices is data representation We assume that receivedsignals are sampled and represented as 16-bit 2rsquos complementnumbers [12] These data are processed according to (4) and(6) in which multiplications and additions are the mostfrequent operations The number of bits for the intermediateand final results is one of the designerrsquos knobs who has tofind the best trade-off between precision and complexity Weexplored three possible approximation strategies which wediscuss in the followingApprox-IHere we keep data parallelism as low as possible Inthe first multiplication in (4) the 16-bit inputs are multipliedand a 32-bit output is produced Before the addition in thesame equation we truncate the multiplication results anddiscard the 16 least significant bits Then to avoid overflowthe additions are performed over 32 bits by extending theremaining 16 most significant bits This parallelism ensuresno overflow for up to 16 levels of addition Before the finalmultiplication in (6) the 16 least significant bits are discardedand the multiplication result is a 32-bit numberApprox-II In this case we do not discard the 16 leastsignificant bits in the square computation in (6) obtaining a64-bit resultThe 32 least significant bits of the result are thencut away This solution requires of course larger multipliersand so implies a larger complexityApprox-III In this case we keep the maximum level ofaccuracy and perform all the operations with 64 bits

We evaluated these three alternatives with a softwareimplementation of the algorithm in Octave before movingto the hardware design The algorithm performance dependson the relative position of any high-permittivity tissue andthe position of the antennas Therefore we compared theperformance considering an ideal breast model uniformlyfilled with fat tissue for three possible tumor positions in thebreast a small depth (low) an intermediate depth (med) anda considerable depth (deep)

We also evaluated the effect on the performance of theSKAR algorithm In particular we compared the perfor-mance of the actual SKAR with an ideal calibration in whichwe remove any artifact by artificially subtracting the responsewithout tumor from the response with tumor

The three maps in Figure 4 are obtained with SKARcalibration and with the three approximation strategies forthe case of deep position In the first case in particular the

low resolution of the internal computations creates a largeartificial clutter response and so a clear loss in accuracyNevertheless the tumor position is clearly and correctlyfound In the second case the removal of the least 32-bit ofthe energy result eliminates the clutter response but may endup in inaccuracies should a point of interest have a low energyand be therefore eliminated The third approximation is ofcourse the onewith the greatest accuracy In fact the results inthe thirdmap in Figure 4 are identical to those obtained froma simulation with double precision floating-point numbers

In Table 2 we report the maximum error of the energyvalues with respect to an ideal computation (with no approx-imation) The maximum error is calculated as follows

MaxError = max(10038161003816100381610038161003816100381610038161003816100381610038161003816

EMmax (EM)

minus

EMfp

max (EMfp)

10038161003816100381610038161003816100381610038161003816100381610038161003816

) (8)

EM represents the energy map without approximations(double precision) and EMfp is the energy map with approx-imation (fixed point) The error is determined by comparingenergy values normalized to the maximum values obtainedwith or without approximation respectively The maximumerror is consistent with the graphical results of the maps

To keep the maximum level of accuracy then we decideto maintain a 64-bit data representation This decision doesnot affect the design as much as one may think To store 64-bit energy values we need twice as much memory locationsthan we need for the 32-bit case However the memory sizeis mostly determined by the many weights we have to storeand both the number and the bit size of the weights do notdepend on the representation of the intermediate data

43 Architecture After having decided the parallelism forinternal data representation we moved on to the RTLhardware design We described our system in synthesizableVHDL Design-time parameters were used to keep the designflexible The block diagram in Figure 5 represents the BEAFarchitecture that we described in RTL In the following wedescribe the behavior of each componentWeights amp Delays Shift RegisterWhen the image reconstruc-tion process starts after the acquisition phase calibratedsamples weights for BEAF and delays values are availableeach in their own memories For each energy value an entirewindow of samples is processed in every FIR filter togetherwith the corresponding weights Before the filter computa-tion these weights are sequentially read from memory andloaded in a serial-in parallel-out shift register included in theWeights amp Delays shift register blockThe delay values are alsoread from memory at the same time They are sent to the 120591blocks to determine the correct time-shift amount necessaryto align the signals according to the beamforming algorithmdescribed in Section 41Sample Memory The sample values stored in this memoryare read one at a time for each channel and fed to the filterblocksFilters A bank of filters receives samples from the samplememory and delay values from the Weights ampDelays shift reg-ister In a fully parallel implementation the number of filters

6 VLSI Design

0

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14

2e minus 08

4e minus 08

6e minus 08

8e minus 08

1e minus 07

i (cm)

j(c

m)

(a) First approximation type (Approx-I)

Beamforming approximated - SKAR calibration

0

5e minus 09

1e minus 08

15e minus 08

2e minus 08

25e minus 08

3e minus 08

35e minus 08

4e minus 08

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

(b) Second approximation type (Approx-II)

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

0

1e minus 08

2e minus 08

3e minus 08

4e minus 08

5e minus 08

(c) Third approximation type (Approx-III)

Figure 4 Energy map of the breast after BEAF data approximation for ideal breast model and SKAR calibration A tumor is present in a deepposition

(119873FIR) corresponds to the number of transceiversantennasin the UWB probe As we discuss momentarily sequentialimplementations that use less filters than there are antennasare possible Each filter includes three main componentsThe120591 blocks are programmable delay lines which shift the inputsamples of the right time shift to ensure that all FIR inputs aretime alignedThe maximum delay is 119899

119886 TheWindow1 blocks

perform the 119892[119899] windowing functions they feed the FIRblocks only with samples that come after 119899

119886 and replace all

the samples that come before 119899119886with zero values In the FIR

blocks the samples are processed according to (4) Every FIRis a fully-parallel structure with 119871multipliers Eachmultiplieris fed with one of the weights which do not change duringthe whole voxel computation and with the samplesThe latterare sequentially introduced in the FIR blocks and then are

shifted from one multiplier to the next via an internal shiftregister The outputs of all the multipliers are then summedvia an adder tree producing one single FIR output Thiswhole computation block is pipelined with registers at theinput of every adder and multiplierSumTheoutputs of all the filters are summed together with aadder tree to obtain a single value which represents the resultof (4)Window2 The samples we are interested in are those inwhich the signal pulse occurs Window2 lets input data passunaltered if they correspond to the time interval specified by119899ℎand 119897ℎ Otherwise the output value is set to zero

Energy Computation This block computes the total voxelenergy according to (6) A multiplier and an accumulator are

VLSI Design 7

FIR

FIR

FIR

Filter

FSM

Energycomputation

Energymemory

Win

dow

2

Weightsand delays

shiftregister

Window1

Window1

Window1

120591

120591

120591

Sum shift register

SumSamplesmemory

Figure 5 RTL block diagram of the BEAF unit

used for this purpose Finally the energy of the current voxelgets written into the energy memoryFSM To coordinate the whole computation process a finitestate machine (FSM) is used A set of counters in it also keepstrack of the number of computation steps performed at anycontrol step

The filtering functions need not to be necessarily per-formed in parallel Two extreme solutions are possible Aparallel implementation in which all the data are processedconcurrently and a serial implementation in which onefilter is iteratively used In the serial implementation anaccumulator is inserted at the output of the filter since theFIR outputs must be summed to compute the energy value

Between the two extremes hybrid solutions are alsopossible We can instantiate a number of filters (and thus anumber of FIR blocks) 119873FIR that matches our timing andresource requirementsWe can use them iteratively to processthe data from119873ANT antennas In case of a hybrid solution ateach iteration the results of the partial sumof the FIR outputsare fed back to the sum shift register block in Figure 5 andadded coherently to the current value at a given iterationWhen all the antennas sets have been processed the finalresult is fed to the energy computation block The finitestate machine is designed in a parameterized way so as tocorrectly handle all possible solutions with different degreesof parallelism

It is worth mentioning that a choice of degree of par-allelism determines not just the number of FIR blocks butalso the number of 120591 Window1 and Window2 blocks In thefollowing whenwe refer to119873FIR as a parameter for the degreeof parallelism we mean that all these blocks are instantiated119873FIR times

We verified our parameterized VHDL code by means ofextensive logic-level simulations based on Mentor GraphicsrsquoModelsim [15] varying the degree of parallelismWe validatedthe output obtained with the RTL code against the resultsobtained with a software implementation and refined ourdesign until they perfectly matched

44 Timing Performance of BEAF In our architectureeach voxelrsquos energy is sequentially computed therefore the

beamforming execution time is

119879BEAF = 119873vox sdot 119879vox (9)

where 119873vox is the number of voxels and 119879vox is the timerequired to compute each voxel

The number of clock cycles necessary to execute thebeamforming step for each voxel depends on the number ofFIR filters in Figure 5 As we discussed above the spectrumof possible solutions spans from a fully sequential implemen-tation with a single FIR iteratively used for all the antennasto a fully parallel solution with as many filters as there areantennas The number of iterations depends on the ratiobetween number of antennas and number of filters

119873iter = lceil119873ANT119873FIR

rceil (10)

119879vox is the sum of three contributions

119879vox = 119879119908 + 119879comp + 119879119908119903 (11)

119879119908is the time necessary to write the FIR weights in the

memory once they are received from the link that connectsthe processor with the accelerator to read the weights fromthe memory and to store them in local registers within thefilters

119879119908= (2 (119871 + 1)119873ANT + 2119873iter) 119879CLK (12)

The first termwithin parenthesis in (12) is the time for writingand reading from the memory 119871 weights The second term isan overhead for enabling the memory and for asserting a fewcontrol signals 119879CLK is the clock period

119879comp is the actual computation time and is given by

119879comp = (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ lceillog2(119873FIR + 1)rceil )119873iter119879CLK

(13)

In (13) 119871wp is the time distance evaluated in time samplesthat corresponds to the worst path between two differentantennas and is necessary for the time alignment of thevarious signals Term lceillog

2(119871)rceil+119899

ℎ+119897ℎis the number of clock

cycles needed to load the samples in the filters with 119899ℎand 119897ℎ

representing the first sample in awindow of interest and 119897ℎthe

size of that window according to equation (5) Seven clockcycles are needed for window traversal and various registerwriting The term that depends on the logarithm of 119873FIR isrelated to the final addition in Figure 5

The final term in (11) 119879nrj is given by

119879nrj = 3119873iter (14)

and represents the time needed to write the voxelrsquos energyvalue back in the memory

We rewrite the final expression of 119879vox obtained bywrapping up all previous equations

119879vox = (2 (119871 + 1) sdot 119873ANT + 2 sdot 119873iter

+ (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ log2(119873FIR + 1)) sdot 119873iter + 3 sdot 119873iter) 119879CLK

(15)

8 VLSI Design

Figure 6 Experimental setup for the FPGA prototype the personalcomputer executes in software (Octave) part of the algorithm andallows energy maps visualization A USB link transfers data to andfrom the FPGA board where part of the beamforming algorithm isaccelerated

Equation (15) can be used to determine the overallexecution time of the beamforming algorithm in (9) giventhe total number of voxels and applies both to a 2D anda 3D map The accuracy of (15) was tested with VHDLsimulations usingMentor Graphicsrsquo Modelsim simulator andexperimentally verified with the FPGA prototype describedin the next section

5 Emulation with an FPGA-Based Prototype

We built an emulation prototype of the imaging unit Thisprototype relies on a general purpose microprocessor toperform the portions of processing that we decided to executein software while the hardware part of the BEAF algorithmwas deployed on an Xilinx Virtex-4 XC4VLX160-FF1513FPGA In particular wewere able to connect the Intel Core-i5of a laptop with the FPGA located in an AVNET Virtex-4 LXdevelopment boardTheCypress FX2USB 20 chip integratedin the development board was used to communicate with theFPGATheUSB chip behaves simply as a conduit between theUSB and the external data-processing logic The FX2 clockwas set to 48MHz the data parallelism to 8 bits and thepacket size to 512 bytes

The prototype has multiple purposes It serves as anaccelerator of the system-level simulation because it permitsa much faster execution of the part of the algorithm executedby the FPGA compared to a logic-level simulation Theoverall architecture with the software and the hardwarerunning concurrently is clearly a close approximation ofthe final system even though the processor and the digitalhardware will be different in the final target The RTL VHDLcode that gets synthesized on the FPGA is the same that wewill eventually implement in an ASIC therefore we can fullyverify it before fabrication Finally we can anticipate whatthe exact performance will be in terms of clock cycles Theonly difference in performance will be then determined bythe different clock frequency in FPGA and ASIC

Both the software and the hardware code are fullyinstrumented to permit a cycle accurate evaluation of timingperformance The USB communication time is also correctly

estimated and decoupled from the overall performance as amuch faster link will be used in a realistic setting like the onedepicted in Figure 1

The software running on the microprocessor is layeredand so the application part and the communication partare independent and portable The communication part isalso further partitioned in a set of low-level hardware-specific USB primitives (we use the libusb library availablein Linux) and a set of higher-level technology-independentAPIs for data organization and communication When thehardware target changes as it happens when moving fromthe prototype to the final target the developers replace onlythe low-level communication primitives and reuse both theapplication code and the communication APIs

Figure 6 is a photograph of the prototype and showsan energy map elaborated in the FPGA sent to the hostcomputer and displayed in the host screen

6 Experimental Results on FPGA and ASIC

We ran several experiments first with Xilinxrsquos ISE software[16] for the FPGA target of our prototype then with SynopsysDesign Compiler [17] for an ASIC target We were able todetermine the resource requirements and the actual timingperformance which depends on the actual clock frequency asfound by the FPGA and ASIC design CAD toolsThe variousbitstream files generated after FPGA synthesis mappingand place-and-route experiments were each tested in ourprototype They refer to a specific design case with nineantennas located around the numerical breast phantomnumber 012204 from the repository made available by theUniversity of Wisconsin [14] This phantom represents inthe database an average case in terms of type of tissue andnumber of voxelsWe have run finite-difference time-domain(FDTD) electromagnetic simulations to determine the signalreceived by the nine antennas and we fed the imagingunit with the received samples after 16-bit quantizationWe also extrapolated our performance results to a case offifty antennas which we assume it to be more or less thanmaximum number of antennas that can be placed around apatientrsquos breast

61 FPGA Experiments The FPGA resources the main partof which consists of look-up tables (LUT) and registers aregrouped in slices Depending on the complexity of the designand particularly on the number of filters 119873FIR a higher orlower amount of LUT and registers is used When mappingthe design the strategy of ISE does not consist in fillingthe slices one by one Instead to meet timing requirementsthe logic gates are spread over a high number of sliceswhich often results in a poor utilization Sometimes a slice isdeclared as occupied even if only one of its LUTs is usedThisfact limits the number of filters that we were able to put in ourVirtex-4 to a maximum of five FIRs The results of resourceutilization are reported in Table 3

We notice that the number of occupied slices increasesrapidly with the number of filters With six filters it exceedsthe number of available slices and we were not able to

VLSI Design 9

Table 3 Area allocation with different number of FIR filters

119873FIR Slice registers 4 input LUTs Occupied slices1 9081 (6) 8038 (5) 7188 (10)2 17347 (12) 20072 (14) 16115 (23)3 25551 (18) 44290 (32) 30899 (45)4 33851 (25) 67247 (49) 45139 (66)5 42055 (31) 91461 (67) 59936 (88)

Table 4 Hardware execution time over 14485 voxels with 9antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 128 s

119873FIR 119873ITER 119879exe [ms] Speed-up1 9 80639 15872 5 55336 23133 3 42581 30064 3 42650 30015 2 36249 35316 2 36249 35317 2 36249 35318 2 36295 35279 1 29871 4285

test that configuration in our prototype even though wecould estimate the performance in simulation The allocatedresources follow a trend that is almost linear with the numberof filters which is in accordance with our expectations

The timing performance depends proportionally on thenumber of voxels in the energy map This value is derivedfrom the 012204 phantom breast at 1mm resolution whichwe also used for the Octave simulations The 2D map gridconsists of 152 rows and 176 columns The Octave simulationcode was optimized to consider in the computation onlythose voxels which happened to be located in the ellipticalspace described by the antennas that is the area actuallyirradiated by the transmitters With such optimization thenumber of voxels is119873VOX = 14495

With such number of voxels the time required for thesoftware execution of the BEAF on the Intel i5 core of ourprototype (with Ubuntu 1110 kernel 30) was approximately119879SW = 128 sThe clock period for the FPGA implementationis 119879CLK = 16 ns which we use in our experiments with theprototype Table 4 reports the time required to execute theBEAF routine in hardware and the speed-up with respect to asoftware implementation Timing results with 119873FIR in range6ndash9 refer to logic simulations only because only five filterscould be placed in the FPGA However since all the resultsare consistent with (15) they are all fully accurate Table 5refers to a hypothetical case with fifty antennas and reportsonly simulated data which are also fully accurate comparedto a software execution that requires 119879SW = 71 s

In both Tables 4 and 5 we observe a similar trend Aswe expected the effect of changing the number of filtersmainly depends on the actual number of iterations 119873ITER =lceil119873ANT119873FIRrceil Increasing the number of filters boosts perfor-mance only when it allows to perform the computation with

Table 5 Hardware execution time over 14485 voxels with 50antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 71 s

119873FIR 119873ITER 119879exe [s] Speed-up1 50 448 15862 25 289 24533 17 238 29794 13 213 33315 10 194 36616 9 188 37867 8 181 39198 7 175 40599 6 168 421410ndash12 5 162 438113ndash16 4 156 456217ndash24 3 149 475625ndash49 2 143 497150 1 136 5205

less iterations For example in the case with nine antennaswe would not get a significant advantage from using eightfilters if we cannot use nine then we could as well use sixand still obtain basically the same performance we would getwith eight filters

The speed-up value grows rapidly for low values of119873FIRwhereas its marginal increase tends to diminish in the upperrangeThe reason is that the part of the algorithm that consistsof accessing the memory for reading samples and weightsremains sequential and cannot be made parallel

Considering the same phantom and iterating the per-formance evaluation in the 3D case with 1374953 voxels theresulting trends are similar To summarize in the worst caseof fifty antennas and 119879ITER = 50 the required time is 119879exe =42464 s (around 7 minutes) while in the best case of onlyone iteration we obtained119879exe = 12940 s (around 2minutes)With respect to a software implementation requiring 119879SW =

19 h the speedup is a factor 1611 and 5286 respectively inthe worst and best case

Referring to data discussed in Table 1 obtained for a worstcase breast phantom (in terms of number of voxels) theexecution time in the worst case (119879ITER = 50) would bearound 21 minutes (compared to the 58 h obtained in thesoftware case)

62 ASIC Experiments For this part of our experimentswe targeted a standard cell library in a 11 V 45 nm CMOStechnology We synthesized with Synopsys Design Compilervarious instances of our architecture varying the numberof filters and determined the relationship between speed-up(with respect to a software execution) and silicon area Thedesign when synthesized with the maximum optimizationeffort available can run at a clock period of 2 ns Thereforecompared to the FPGA results reported in the previoussection all the values of speed-up can be multiplied by afactor 8 (16 ns2 ns)

10 VLSI Design

0

1

2

3

4

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 50Number of FIR filters

Are

a (m

m2)

Figure 7 Postsynthesis ASIC area versus number of FIR filters fora 50 antennas system

100

150

200

250

300

350

400

450

0 1 2 3 4 5 6 7 8 9 10

Spee

d-up

ver

sus s

oftw

are

Area (mm2)

Figure 8 Speed-up versus area in an ASIC implementation of a 50antennas system

Figures 7 and 8 report silicon area as a function of thenumber of filters and speed-up as a function of the siliconarea respectively

The area curve in Figure 7 grows linearly with119873FIR witha slope of 01885mm2 which is in fact the area of a single FIRfilterThe area depends also on thememory sizeThememoryarea was obtained by considering the use of 100 memoryblocks (two for each antenna) each consisting of 256 bytesfor the signal samples (256 kB over 0457mm2) and a dual-portmemory blockwith 2048 bytes for theweights and delaysvalues (4 kB over 00153mm2) The memory area weighs onthe total area in a significant way only when a few filters areused because it does not depend on119873FIR

The speed-up that is the ratio of software execution timeand hardware execution time as a function of the silicon areashows a behavior similar to what we found for the FPGAcase but the acceleration factor is of course much biggerfrom around 50 to around 400 in the case of fifty antennasin the ASIC case This result would be notably improved incase a more scaled technology would be used [18] As we

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

Are

a and

spee

d-up

incr

ease

(rel

ativ

e uni

ts)

Number of FIR filters

Speed-up increaseArea increase

115

225

335

2 4 6 8 10

Zoom filter range 1 5

Figure 9 Trade-off between speed-up and area when changingthe number of filters in a 50 antennas ASIC implementation Cost-effective solutions are those with119873FIR le 6

utilize more and more area to make room for the filters thespeed-up first increases in a steep way then it bends becausethe sequential part of the algorithm starts dominating theexecution time

We report in Figure 9 two more curves the speed-upincrease and the area increase These two curves are simplythe normalized versions of the curves reported in Figures 7and 8 In practice we divided the area and the speed-up ofthe solution with119873FIR filters by the area and the speed-up ofthe solution with one filter respectivelyThe rationale behindthis choice is to evaluate the trade-off between performanceand cost In particular we judge that a parallel solutionis cost-effective if for a given percentage cost increase theperformance increases at least as much as the cost [19 20]From the inset in Figure 9 we are able to evaluate for whatvalue of 119873FIR the two curves cross each other The cost-effective solutions are those that use no more than six filters

7 Conclusions

Implementing in hardware the Mist-beamforming algorithmportion of the image reconstruction process determines aremarkable acceleration in terms of execution time From apractical point of view this acceleration implies a significantimprovement in the applicability of the system if we considerthe case of a 3D image reconstruction with a fifty-antennasystem we move from 19 hours of computation in thesoftware-only version to only about 7 minutes for an FPGAimplementation and less than one minute for an ASICimplementation even in the serial hardware configuration(ie with 119873FIR = 1) The FPGA clock frequency was set inour experiments to 625MHz which is the clock frequencyof the IP that communicates with the UWB external chipHowever our results show that the FPGA accelerator alonecan run up to 180MHz hence providing a further 288Xspeed-up with respect to the software version Therefore

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

VLSI Design 3

Memory

Bus

Control

RXTX

RXTX

RXTX

Imagingunit

Array oftrans-ceivers

interface

Front end

Front end

(a)

Am

plitu

de (V

)

1

05

0

minus05

minus1

Differentiated Gaussian pulse f = 6GHz 120591 = 110ps

Transmitted pulse

0 1 2 3 4 5Time (ns)

(b)

Figure 2 The system block diagram (b) an example of a UWB transmitted pulse central frequency 6 GHz total duration time 110 ps

Weights-BEAF

Weights for

Weights-SKAR

SKAR

Calibratedsamples

Energyvalues

BEAF

Access BEAF weights

Weights for

Samples

Calibrationskin

artifactremoval

Mistbeamforming

(Nvox times)

BEAF

SKAR

Figure 3 Data-flow diagram of the imaging unit computation tasks

3 The Imaging Unit

The imaging unit performs sequentially the following sub-tasks (see Figure 3)

Weights-SKAR The skin artifact removal algorithm (here-inafter SKAR) needs weight coefficients that are functions ofthe received signals and the position of the antennas Thesedata are then processed once for each scan of the breast togenerate the weights for SKAR

SKAR By removing the artifact of the large skin reflectionfrom the received signal samples as well as the contributionof the direct path between transmitter and receiver the SKARsub-task generates a set of calibrated samples

Weights-BEAF Beamforming (hereinafter BEAF) also needsweights coefficients but these only depend on the transmittedsignals and not on the received ones Therefore the weightsfor BEAF are computed once and for all at the beginningwhen the characteristics of the transmitted signal are selected

Access BEAF Weights In this phase BEAF weights are readfrom amemoryThis operation is repeated for each voxel We

separate it from the actual BEAF to permit a correct profilingof performanceBEAF This phase applies a space-time beamformer to thesignal samples received by each antenna a step that is donespecifically for each voxel To focus on a specific voxel thesignals are shifted in time filtered with the weights forBEAF and finally coherently added (see Section 4 for a moredetailed formulation of the algorithm)The weighting opera-tion enhances the contribution in each signal of the reflectiondetermined by the voxel under analysis and deemphasizesat the same time the scattering contributions of the othervoxels By iterating on the number of voxels BEAF producesan energy map for the whole volume

We first implemented all the subtasks in software usingOctave [13] We also built a finite-difference time-domain(FDTD) electromagnetic solver to simulate the transmissionof the signals and the reflections that occur in a realisticnumerical breast phantom [14]The output data of the FDTDsimulation are the received signals that we process with ourOctave routines according to the tasks outlined above

31 Profiling SKAR and BEAF have a very different impacton the overall execution timeThis difference depends on thedifference in complexity but mostly on the different numberof times they get executed SKAR is performed once andso is the computation of the SKAR weights whereas BEAFis repeated as many times as there are voxels In the largestmaps found in the breast phantom repository [14] the voxelscan be as many as 119873vox = 425 sdot 10

6 which clearly makesBEAF the bottleneck for the whole computationThe weightsrequired for the BEAF operation however do not depend onthe specific breast but only on some specific design choicessuch as the shape of the UWB pulse and the location oftransmitters and receivers The computation of these valuesis thus performed only once at the beginning of the breastexamination

4 VLSI Design

Table 1 Software runtime of SKAR and BEAF subtasks for a casewith119873vox = 425 sdot 10

6

Task RuntimeWeight-SKAR 164 sSKAR 025 s

Weights-BEAF 217msvoxel26 h

Access BEAF 732 120583svoxelweights 31 s

BEAF 488msvoxel58 h

We performed software simulations of the completesystem with an Intel CPU (Core i5-430M 3MB L3 226GHzclock frequency) 4GB DRAM and Linux Ubuntu 1110 withkernel 30 and instrumented our code to accurately determinethe CPU runtime The results are in Table 1 For the BEAFsub-tasks results are detailed for one voxel and for the wholebreast volume

For all sub-tasks with the exception of SKAR runtimeexceeds by far the 1 s bar However all tasks except weights-BEAF and BEAF are executed in less than one minute whichcan still be judged acceptable especially because a furtheroptimization of the code can significantly reduce this timeAs for weights-BEAF 26 h is certainly a long time but theweights could be computed offline once and for all andaccessed frommemory when needed As expected the BEAFalgorithm has the largest impact on timing performance dueto the large number of iterations Based on this analysis wedecided to focus on the hardware acceleration of the BEAFsub-task while all the other tasks are executed in softwareThe target is an imaging unit like the one sketched in Figure 1a processor is coupled with a dedicated hardware unit (ASIC)that accelerates the timing critical parts of the computation

4 A Hardware Implementation of BEAF

41 The MIST Beamforming (BEAF) Algorithm We report asuccinct description of the MIST beamforming and refer tothe literature for all the details [4 10]

Let us consider the backscatter contribution from a singlebreast location r

0 assuming that the signal received by the

119894th receiver contains only information from that source forsimplicity We denote the sampled version of that signal as119909119894[119899] and its Fourier transform as

119883119894(120596) = 119868 (120596) 119878

119894119894(r0 120596) 1 le 119894 le 119873ANT (1)

where 119868(120596) is the Fourier transform of the transmitted signal119894(119905)119873ANT is the number of antennas which is also the num-ber of transmitters and receivers 119878

119894119894(r0 120596) is an analytical

model of the monostatic frequency response associated withthe propagation from the 119894th antenna to r

0and back The

distance between each antenna and the point r0is not the

same for all the antennas and so the round-trip times of thesignals differWe then need to time align the signals by delay-ing themby an integer number of samples 119899

119894(r0) = 119899119886minus120591119894(r0)

In this equation 120591119894is the round-trip propagation delay in the

119894th channel for location r0and is obtained by dividing the

round-trip path length by the average propagation speed androunding to the nearest sample The term 119899

119886is the reference

time to which every signal is aligned and is chosen to be theworst-case delay over all channels and locations

119899119886ge round(max

119894r0120591119894(r0)) (2)

The signal is windowed to remove components before sample119899119886that are not important for energy estimationThe following

window function 119892[119899] is used

119892 [119899] = 1 119899 ge 119899

119886

0 otherwise(3)

The algorithm also allows to equalize path-length depen-dent dispersion and attenuation to interpolate any fractionaltime delays remaining after coarse time alignment and tobandpass-filter the signal These operations are done in timedomain using a bank of FIR filters each associated with avector of 119871 weights w

119894= [119908

1198940 1199081198941 119908

119894(119871minus1)] Increasing

the number of weights improves accuracy but increasescomplexity too and so must be carefully chosen

By summing the output of the filters we obtain

119911 [119899 r0] =

119873

sum

119894=1

119871

sum

119897=0

119908119894119897sdot 119892 [119899 minus 119897] sdot 119909

119894[119899 minus 119897 minus 119899

119894(r0)] (4)

where 119908119894119897is the 119897th weight of the 119894th filter and 119899

119894(r0) is the

delay value associated with the 119894th filterWe need to define another window

ℎ [119899 r0] =

1 119899ℎle 119899 le 119899

ℎ+ 119897ℎ

0 otherwise(5)

The interval between 119899ℎand 119899ℎ+119897ℎis thewindowof samples in

which we expect to find the received pulse Even if we knowthe shape of the transmitted signal the scattering from thetumor is frequency-dependent Therefore the backscatteredpulse is a distorted version of the transmitted pulse anddefining the exact timing window is not straightforwardThevalues of these parameters suggested in [4] are meant tomaximize the contrast for small lesions

The energy scattered by r0can be computed as the sum of

the square of the windowed values

119901 (r0) = sum

119899

1003816100381610038161003816119911 [119899] ℎ [119899 r0]1003816100381610038161003816

2 (6)

Many parameters in the previous equations have to beassigned a value before proceeding with the hardware designThe value of 119899

ℎcan be computed based on the time shift 119899

119886

which is easily computed as shown in (2) on the delay of theFIR filter 120591

0= (119871 minus 1)2 and on the first significant sample of

the transmitted pulse119873start119899ℎ= 119873Start + 119899119886 + 1205910 (7)

As for 119897ℎ an optimal value is suggested in [4] and is 119897

ℎ= 5

As for 119871 a reasonable value for a Gaussian pulse like the onein Figure 2 which we use for our experiments is 119871 gt 50 andwe chose 119871 = 55

VLSI Design 5

Table 2 Maximum difference of energy values with respect to idealcomputation for three approximation cases

Calibration Tumor Pos Max errorApprox-I Approx-II Approx-III

Low 2678 2328 025IDEAL Med 1464 494 007

Deep 1797 670 016Low 3360 2502 044

SKAR Med 1561 867 016Deep 3068 3446 048

42 Accuracy and Approximation One of the crucial designchoices is data representation We assume that receivedsignals are sampled and represented as 16-bit 2rsquos complementnumbers [12] These data are processed according to (4) and(6) in which multiplications and additions are the mostfrequent operations The number of bits for the intermediateand final results is one of the designerrsquos knobs who has tofind the best trade-off between precision and complexity Weexplored three possible approximation strategies which wediscuss in the followingApprox-IHere we keep data parallelism as low as possible Inthe first multiplication in (4) the 16-bit inputs are multipliedand a 32-bit output is produced Before the addition in thesame equation we truncate the multiplication results anddiscard the 16 least significant bits Then to avoid overflowthe additions are performed over 32 bits by extending theremaining 16 most significant bits This parallelism ensuresno overflow for up to 16 levels of addition Before the finalmultiplication in (6) the 16 least significant bits are discardedand the multiplication result is a 32-bit numberApprox-II In this case we do not discard the 16 leastsignificant bits in the square computation in (6) obtaining a64-bit resultThe 32 least significant bits of the result are thencut away This solution requires of course larger multipliersand so implies a larger complexityApprox-III In this case we keep the maximum level ofaccuracy and perform all the operations with 64 bits

We evaluated these three alternatives with a softwareimplementation of the algorithm in Octave before movingto the hardware design The algorithm performance dependson the relative position of any high-permittivity tissue andthe position of the antennas Therefore we compared theperformance considering an ideal breast model uniformlyfilled with fat tissue for three possible tumor positions in thebreast a small depth (low) an intermediate depth (med) anda considerable depth (deep)

We also evaluated the effect on the performance of theSKAR algorithm In particular we compared the perfor-mance of the actual SKAR with an ideal calibration in whichwe remove any artifact by artificially subtracting the responsewithout tumor from the response with tumor

The three maps in Figure 4 are obtained with SKARcalibration and with the three approximation strategies forthe case of deep position In the first case in particular the

low resolution of the internal computations creates a largeartificial clutter response and so a clear loss in accuracyNevertheless the tumor position is clearly and correctlyfound In the second case the removal of the least 32-bit ofthe energy result eliminates the clutter response but may endup in inaccuracies should a point of interest have a low energyand be therefore eliminated The third approximation is ofcourse the onewith the greatest accuracy In fact the results inthe thirdmap in Figure 4 are identical to those obtained froma simulation with double precision floating-point numbers

In Table 2 we report the maximum error of the energyvalues with respect to an ideal computation (with no approx-imation) The maximum error is calculated as follows

MaxError = max(10038161003816100381610038161003816100381610038161003816100381610038161003816

EMmax (EM)

minus

EMfp

max (EMfp)

10038161003816100381610038161003816100381610038161003816100381610038161003816

) (8)

EM represents the energy map without approximations(double precision) and EMfp is the energy map with approx-imation (fixed point) The error is determined by comparingenergy values normalized to the maximum values obtainedwith or without approximation respectively The maximumerror is consistent with the graphical results of the maps

To keep the maximum level of accuracy then we decideto maintain a 64-bit data representation This decision doesnot affect the design as much as one may think To store 64-bit energy values we need twice as much memory locationsthan we need for the 32-bit case However the memory sizeis mostly determined by the many weights we have to storeand both the number and the bit size of the weights do notdepend on the representation of the intermediate data

43 Architecture After having decided the parallelism forinternal data representation we moved on to the RTLhardware design We described our system in synthesizableVHDL Design-time parameters were used to keep the designflexible The block diagram in Figure 5 represents the BEAFarchitecture that we described in RTL In the following wedescribe the behavior of each componentWeights amp Delays Shift RegisterWhen the image reconstruc-tion process starts after the acquisition phase calibratedsamples weights for BEAF and delays values are availableeach in their own memories For each energy value an entirewindow of samples is processed in every FIR filter togetherwith the corresponding weights Before the filter computa-tion these weights are sequentially read from memory andloaded in a serial-in parallel-out shift register included in theWeights amp Delays shift register blockThe delay values are alsoread from memory at the same time They are sent to the 120591blocks to determine the correct time-shift amount necessaryto align the signals according to the beamforming algorithmdescribed in Section 41Sample Memory The sample values stored in this memoryare read one at a time for each channel and fed to the filterblocksFilters A bank of filters receives samples from the samplememory and delay values from the Weights ampDelays shift reg-ister In a fully parallel implementation the number of filters

6 VLSI Design

0

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14

2e minus 08

4e minus 08

6e minus 08

8e minus 08

1e minus 07

i (cm)

j(c

m)

(a) First approximation type (Approx-I)

Beamforming approximated - SKAR calibration

0

5e minus 09

1e minus 08

15e minus 08

2e minus 08

25e minus 08

3e minus 08

35e minus 08

4e minus 08

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

(b) Second approximation type (Approx-II)

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

0

1e minus 08

2e minus 08

3e minus 08

4e minus 08

5e minus 08

(c) Third approximation type (Approx-III)

Figure 4 Energy map of the breast after BEAF data approximation for ideal breast model and SKAR calibration A tumor is present in a deepposition

(119873FIR) corresponds to the number of transceiversantennasin the UWB probe As we discuss momentarily sequentialimplementations that use less filters than there are antennasare possible Each filter includes three main componentsThe120591 blocks are programmable delay lines which shift the inputsamples of the right time shift to ensure that all FIR inputs aretime alignedThe maximum delay is 119899

119886 TheWindow1 blocks

perform the 119892[119899] windowing functions they feed the FIRblocks only with samples that come after 119899

119886 and replace all

the samples that come before 119899119886with zero values In the FIR

blocks the samples are processed according to (4) Every FIRis a fully-parallel structure with 119871multipliers Eachmultiplieris fed with one of the weights which do not change duringthe whole voxel computation and with the samplesThe latterare sequentially introduced in the FIR blocks and then are

shifted from one multiplier to the next via an internal shiftregister The outputs of all the multipliers are then summedvia an adder tree producing one single FIR output Thiswhole computation block is pipelined with registers at theinput of every adder and multiplierSumTheoutputs of all the filters are summed together with aadder tree to obtain a single value which represents the resultof (4)Window2 The samples we are interested in are those inwhich the signal pulse occurs Window2 lets input data passunaltered if they correspond to the time interval specified by119899ℎand 119897ℎ Otherwise the output value is set to zero

Energy Computation This block computes the total voxelenergy according to (6) A multiplier and an accumulator are

VLSI Design 7

FIR

FIR

FIR

Filter

FSM

Energycomputation

Energymemory

Win

dow

2

Weightsand delays

shiftregister

Window1

Window1

Window1

120591

120591

120591

Sum shift register

SumSamplesmemory

Figure 5 RTL block diagram of the BEAF unit

used for this purpose Finally the energy of the current voxelgets written into the energy memoryFSM To coordinate the whole computation process a finitestate machine (FSM) is used A set of counters in it also keepstrack of the number of computation steps performed at anycontrol step

The filtering functions need not to be necessarily per-formed in parallel Two extreme solutions are possible Aparallel implementation in which all the data are processedconcurrently and a serial implementation in which onefilter is iteratively used In the serial implementation anaccumulator is inserted at the output of the filter since theFIR outputs must be summed to compute the energy value

Between the two extremes hybrid solutions are alsopossible We can instantiate a number of filters (and thus anumber of FIR blocks) 119873FIR that matches our timing andresource requirementsWe can use them iteratively to processthe data from119873ANT antennas In case of a hybrid solution ateach iteration the results of the partial sumof the FIR outputsare fed back to the sum shift register block in Figure 5 andadded coherently to the current value at a given iterationWhen all the antennas sets have been processed the finalresult is fed to the energy computation block The finitestate machine is designed in a parameterized way so as tocorrectly handle all possible solutions with different degreesof parallelism

It is worth mentioning that a choice of degree of par-allelism determines not just the number of FIR blocks butalso the number of 120591 Window1 and Window2 blocks In thefollowing whenwe refer to119873FIR as a parameter for the degreeof parallelism we mean that all these blocks are instantiated119873FIR times

We verified our parameterized VHDL code by means ofextensive logic-level simulations based on Mentor GraphicsrsquoModelsim [15] varying the degree of parallelismWe validatedthe output obtained with the RTL code against the resultsobtained with a software implementation and refined ourdesign until they perfectly matched

44 Timing Performance of BEAF In our architectureeach voxelrsquos energy is sequentially computed therefore the

beamforming execution time is

119879BEAF = 119873vox sdot 119879vox (9)

where 119873vox is the number of voxels and 119879vox is the timerequired to compute each voxel

The number of clock cycles necessary to execute thebeamforming step for each voxel depends on the number ofFIR filters in Figure 5 As we discussed above the spectrumof possible solutions spans from a fully sequential implemen-tation with a single FIR iteratively used for all the antennasto a fully parallel solution with as many filters as there areantennas The number of iterations depends on the ratiobetween number of antennas and number of filters

119873iter = lceil119873ANT119873FIR

rceil (10)

119879vox is the sum of three contributions

119879vox = 119879119908 + 119879comp + 119879119908119903 (11)

119879119908is the time necessary to write the FIR weights in the

memory once they are received from the link that connectsthe processor with the accelerator to read the weights fromthe memory and to store them in local registers within thefilters

119879119908= (2 (119871 + 1)119873ANT + 2119873iter) 119879CLK (12)

The first termwithin parenthesis in (12) is the time for writingand reading from the memory 119871 weights The second term isan overhead for enabling the memory and for asserting a fewcontrol signals 119879CLK is the clock period

119879comp is the actual computation time and is given by

119879comp = (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ lceillog2(119873FIR + 1)rceil )119873iter119879CLK

(13)

In (13) 119871wp is the time distance evaluated in time samplesthat corresponds to the worst path between two differentantennas and is necessary for the time alignment of thevarious signals Term lceillog

2(119871)rceil+119899

ℎ+119897ℎis the number of clock

cycles needed to load the samples in the filters with 119899ℎand 119897ℎ

representing the first sample in awindow of interest and 119897ℎthe

size of that window according to equation (5) Seven clockcycles are needed for window traversal and various registerwriting The term that depends on the logarithm of 119873FIR isrelated to the final addition in Figure 5

The final term in (11) 119879nrj is given by

119879nrj = 3119873iter (14)

and represents the time needed to write the voxelrsquos energyvalue back in the memory

We rewrite the final expression of 119879vox obtained bywrapping up all previous equations

119879vox = (2 (119871 + 1) sdot 119873ANT + 2 sdot 119873iter

+ (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ log2(119873FIR + 1)) sdot 119873iter + 3 sdot 119873iter) 119879CLK

(15)

8 VLSI Design

Figure 6 Experimental setup for the FPGA prototype the personalcomputer executes in software (Octave) part of the algorithm andallows energy maps visualization A USB link transfers data to andfrom the FPGA board where part of the beamforming algorithm isaccelerated

Equation (15) can be used to determine the overallexecution time of the beamforming algorithm in (9) giventhe total number of voxels and applies both to a 2D anda 3D map The accuracy of (15) was tested with VHDLsimulations usingMentor Graphicsrsquo Modelsim simulator andexperimentally verified with the FPGA prototype describedin the next section

5 Emulation with an FPGA-Based Prototype

We built an emulation prototype of the imaging unit Thisprototype relies on a general purpose microprocessor toperform the portions of processing that we decided to executein software while the hardware part of the BEAF algorithmwas deployed on an Xilinx Virtex-4 XC4VLX160-FF1513FPGA In particular wewere able to connect the Intel Core-i5of a laptop with the FPGA located in an AVNET Virtex-4 LXdevelopment boardTheCypress FX2USB 20 chip integratedin the development board was used to communicate with theFPGATheUSB chip behaves simply as a conduit between theUSB and the external data-processing logic The FX2 clockwas set to 48MHz the data parallelism to 8 bits and thepacket size to 512 bytes

The prototype has multiple purposes It serves as anaccelerator of the system-level simulation because it permitsa much faster execution of the part of the algorithm executedby the FPGA compared to a logic-level simulation Theoverall architecture with the software and the hardwarerunning concurrently is clearly a close approximation ofthe final system even though the processor and the digitalhardware will be different in the final target The RTL VHDLcode that gets synthesized on the FPGA is the same that wewill eventually implement in an ASIC therefore we can fullyverify it before fabrication Finally we can anticipate whatthe exact performance will be in terms of clock cycles Theonly difference in performance will be then determined bythe different clock frequency in FPGA and ASIC

Both the software and the hardware code are fullyinstrumented to permit a cycle accurate evaluation of timingperformance The USB communication time is also correctly

estimated and decoupled from the overall performance as amuch faster link will be used in a realistic setting like the onedepicted in Figure 1

The software running on the microprocessor is layeredand so the application part and the communication partare independent and portable The communication part isalso further partitioned in a set of low-level hardware-specific USB primitives (we use the libusb library availablein Linux) and a set of higher-level technology-independentAPIs for data organization and communication When thehardware target changes as it happens when moving fromthe prototype to the final target the developers replace onlythe low-level communication primitives and reuse both theapplication code and the communication APIs

Figure 6 is a photograph of the prototype and showsan energy map elaborated in the FPGA sent to the hostcomputer and displayed in the host screen

6 Experimental Results on FPGA and ASIC

We ran several experiments first with Xilinxrsquos ISE software[16] for the FPGA target of our prototype then with SynopsysDesign Compiler [17] for an ASIC target We were able todetermine the resource requirements and the actual timingperformance which depends on the actual clock frequency asfound by the FPGA and ASIC design CAD toolsThe variousbitstream files generated after FPGA synthesis mappingand place-and-route experiments were each tested in ourprototype They refer to a specific design case with nineantennas located around the numerical breast phantomnumber 012204 from the repository made available by theUniversity of Wisconsin [14] This phantom represents inthe database an average case in terms of type of tissue andnumber of voxelsWe have run finite-difference time-domain(FDTD) electromagnetic simulations to determine the signalreceived by the nine antennas and we fed the imagingunit with the received samples after 16-bit quantizationWe also extrapolated our performance results to a case offifty antennas which we assume it to be more or less thanmaximum number of antennas that can be placed around apatientrsquos breast

61 FPGA Experiments The FPGA resources the main partof which consists of look-up tables (LUT) and registers aregrouped in slices Depending on the complexity of the designand particularly on the number of filters 119873FIR a higher orlower amount of LUT and registers is used When mappingthe design the strategy of ISE does not consist in fillingthe slices one by one Instead to meet timing requirementsthe logic gates are spread over a high number of sliceswhich often results in a poor utilization Sometimes a slice isdeclared as occupied even if only one of its LUTs is usedThisfact limits the number of filters that we were able to put in ourVirtex-4 to a maximum of five FIRs The results of resourceutilization are reported in Table 3

We notice that the number of occupied slices increasesrapidly with the number of filters With six filters it exceedsthe number of available slices and we were not able to

VLSI Design 9

Table 3 Area allocation with different number of FIR filters

119873FIR Slice registers 4 input LUTs Occupied slices1 9081 (6) 8038 (5) 7188 (10)2 17347 (12) 20072 (14) 16115 (23)3 25551 (18) 44290 (32) 30899 (45)4 33851 (25) 67247 (49) 45139 (66)5 42055 (31) 91461 (67) 59936 (88)

Table 4 Hardware execution time over 14485 voxels with 9antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 128 s

119873FIR 119873ITER 119879exe [ms] Speed-up1 9 80639 15872 5 55336 23133 3 42581 30064 3 42650 30015 2 36249 35316 2 36249 35317 2 36249 35318 2 36295 35279 1 29871 4285

test that configuration in our prototype even though wecould estimate the performance in simulation The allocatedresources follow a trend that is almost linear with the numberof filters which is in accordance with our expectations

The timing performance depends proportionally on thenumber of voxels in the energy map This value is derivedfrom the 012204 phantom breast at 1mm resolution whichwe also used for the Octave simulations The 2D map gridconsists of 152 rows and 176 columns The Octave simulationcode was optimized to consider in the computation onlythose voxels which happened to be located in the ellipticalspace described by the antennas that is the area actuallyirradiated by the transmitters With such optimization thenumber of voxels is119873VOX = 14495

With such number of voxels the time required for thesoftware execution of the BEAF on the Intel i5 core of ourprototype (with Ubuntu 1110 kernel 30) was approximately119879SW = 128 sThe clock period for the FPGA implementationis 119879CLK = 16 ns which we use in our experiments with theprototype Table 4 reports the time required to execute theBEAF routine in hardware and the speed-up with respect to asoftware implementation Timing results with 119873FIR in range6ndash9 refer to logic simulations only because only five filterscould be placed in the FPGA However since all the resultsare consistent with (15) they are all fully accurate Table 5refers to a hypothetical case with fifty antennas and reportsonly simulated data which are also fully accurate comparedto a software execution that requires 119879SW = 71 s

In both Tables 4 and 5 we observe a similar trend Aswe expected the effect of changing the number of filtersmainly depends on the actual number of iterations 119873ITER =lceil119873ANT119873FIRrceil Increasing the number of filters boosts perfor-mance only when it allows to perform the computation with

Table 5 Hardware execution time over 14485 voxels with 50antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 71 s

119873FIR 119873ITER 119879exe [s] Speed-up1 50 448 15862 25 289 24533 17 238 29794 13 213 33315 10 194 36616 9 188 37867 8 181 39198 7 175 40599 6 168 421410ndash12 5 162 438113ndash16 4 156 456217ndash24 3 149 475625ndash49 2 143 497150 1 136 5205

less iterations For example in the case with nine antennaswe would not get a significant advantage from using eightfilters if we cannot use nine then we could as well use sixand still obtain basically the same performance we would getwith eight filters

The speed-up value grows rapidly for low values of119873FIRwhereas its marginal increase tends to diminish in the upperrangeThe reason is that the part of the algorithm that consistsof accessing the memory for reading samples and weightsremains sequential and cannot be made parallel

Considering the same phantom and iterating the per-formance evaluation in the 3D case with 1374953 voxels theresulting trends are similar To summarize in the worst caseof fifty antennas and 119879ITER = 50 the required time is 119879exe =42464 s (around 7 minutes) while in the best case of onlyone iteration we obtained119879exe = 12940 s (around 2minutes)With respect to a software implementation requiring 119879SW =

19 h the speedup is a factor 1611 and 5286 respectively inthe worst and best case

Referring to data discussed in Table 1 obtained for a worstcase breast phantom (in terms of number of voxels) theexecution time in the worst case (119879ITER = 50) would bearound 21 minutes (compared to the 58 h obtained in thesoftware case)

62 ASIC Experiments For this part of our experimentswe targeted a standard cell library in a 11 V 45 nm CMOStechnology We synthesized with Synopsys Design Compilervarious instances of our architecture varying the numberof filters and determined the relationship between speed-up(with respect to a software execution) and silicon area Thedesign when synthesized with the maximum optimizationeffort available can run at a clock period of 2 ns Thereforecompared to the FPGA results reported in the previoussection all the values of speed-up can be multiplied by afactor 8 (16 ns2 ns)

10 VLSI Design

0

1

2

3

4

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 50Number of FIR filters

Are

a (m

m2)

Figure 7 Postsynthesis ASIC area versus number of FIR filters fora 50 antennas system

100

150

200

250

300

350

400

450

0 1 2 3 4 5 6 7 8 9 10

Spee

d-up

ver

sus s

oftw

are

Area (mm2)

Figure 8 Speed-up versus area in an ASIC implementation of a 50antennas system

Figures 7 and 8 report silicon area as a function of thenumber of filters and speed-up as a function of the siliconarea respectively

The area curve in Figure 7 grows linearly with119873FIR witha slope of 01885mm2 which is in fact the area of a single FIRfilterThe area depends also on thememory sizeThememoryarea was obtained by considering the use of 100 memoryblocks (two for each antenna) each consisting of 256 bytesfor the signal samples (256 kB over 0457mm2) and a dual-portmemory blockwith 2048 bytes for theweights and delaysvalues (4 kB over 00153mm2) The memory area weighs onthe total area in a significant way only when a few filters areused because it does not depend on119873FIR

The speed-up that is the ratio of software execution timeand hardware execution time as a function of the silicon areashows a behavior similar to what we found for the FPGAcase but the acceleration factor is of course much biggerfrom around 50 to around 400 in the case of fifty antennasin the ASIC case This result would be notably improved incase a more scaled technology would be used [18] As we

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

Are

a and

spee

d-up

incr

ease

(rel

ativ

e uni

ts)

Number of FIR filters

Speed-up increaseArea increase

115

225

335

2 4 6 8 10

Zoom filter range 1 5

Figure 9 Trade-off between speed-up and area when changingthe number of filters in a 50 antennas ASIC implementation Cost-effective solutions are those with119873FIR le 6

utilize more and more area to make room for the filters thespeed-up first increases in a steep way then it bends becausethe sequential part of the algorithm starts dominating theexecution time

We report in Figure 9 two more curves the speed-upincrease and the area increase These two curves are simplythe normalized versions of the curves reported in Figures 7and 8 In practice we divided the area and the speed-up ofthe solution with119873FIR filters by the area and the speed-up ofthe solution with one filter respectivelyThe rationale behindthis choice is to evaluate the trade-off between performanceand cost In particular we judge that a parallel solutionis cost-effective if for a given percentage cost increase theperformance increases at least as much as the cost [19 20]From the inset in Figure 9 we are able to evaluate for whatvalue of 119873FIR the two curves cross each other The cost-effective solutions are those that use no more than six filters

7 Conclusions

Implementing in hardware the Mist-beamforming algorithmportion of the image reconstruction process determines aremarkable acceleration in terms of execution time From apractical point of view this acceleration implies a significantimprovement in the applicability of the system if we considerthe case of a 3D image reconstruction with a fifty-antennasystem we move from 19 hours of computation in thesoftware-only version to only about 7 minutes for an FPGAimplementation and less than one minute for an ASICimplementation even in the serial hardware configuration(ie with 119873FIR = 1) The FPGA clock frequency was set inour experiments to 625MHz which is the clock frequencyof the IP that communicates with the UWB external chipHowever our results show that the FPGA accelerator alonecan run up to 180MHz hence providing a further 288Xspeed-up with respect to the software version Therefore

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

4 VLSI Design

Table 1 Software runtime of SKAR and BEAF subtasks for a casewith119873vox = 425 sdot 10

6

Task RuntimeWeight-SKAR 164 sSKAR 025 s

Weights-BEAF 217msvoxel26 h

Access BEAF 732 120583svoxelweights 31 s

BEAF 488msvoxel58 h

We performed software simulations of the completesystem with an Intel CPU (Core i5-430M 3MB L3 226GHzclock frequency) 4GB DRAM and Linux Ubuntu 1110 withkernel 30 and instrumented our code to accurately determinethe CPU runtime The results are in Table 1 For the BEAFsub-tasks results are detailed for one voxel and for the wholebreast volume

For all sub-tasks with the exception of SKAR runtimeexceeds by far the 1 s bar However all tasks except weights-BEAF and BEAF are executed in less than one minute whichcan still be judged acceptable especially because a furtheroptimization of the code can significantly reduce this timeAs for weights-BEAF 26 h is certainly a long time but theweights could be computed offline once and for all andaccessed frommemory when needed As expected the BEAFalgorithm has the largest impact on timing performance dueto the large number of iterations Based on this analysis wedecided to focus on the hardware acceleration of the BEAFsub-task while all the other tasks are executed in softwareThe target is an imaging unit like the one sketched in Figure 1a processor is coupled with a dedicated hardware unit (ASIC)that accelerates the timing critical parts of the computation

4 A Hardware Implementation of BEAF

41 The MIST Beamforming (BEAF) Algorithm We report asuccinct description of the MIST beamforming and refer tothe literature for all the details [4 10]

Let us consider the backscatter contribution from a singlebreast location r

0 assuming that the signal received by the

119894th receiver contains only information from that source forsimplicity We denote the sampled version of that signal as119909119894[119899] and its Fourier transform as

119883119894(120596) = 119868 (120596) 119878

119894119894(r0 120596) 1 le 119894 le 119873ANT (1)

where 119868(120596) is the Fourier transform of the transmitted signal119894(119905)119873ANT is the number of antennas which is also the num-ber of transmitters and receivers 119878

119894119894(r0 120596) is an analytical

model of the monostatic frequency response associated withthe propagation from the 119894th antenna to r

0and back The

distance between each antenna and the point r0is not the

same for all the antennas and so the round-trip times of thesignals differWe then need to time align the signals by delay-ing themby an integer number of samples 119899

119894(r0) = 119899119886minus120591119894(r0)

In this equation 120591119894is the round-trip propagation delay in the

119894th channel for location r0and is obtained by dividing the

round-trip path length by the average propagation speed androunding to the nearest sample The term 119899

119886is the reference

time to which every signal is aligned and is chosen to be theworst-case delay over all channels and locations

119899119886ge round(max

119894r0120591119894(r0)) (2)

The signal is windowed to remove components before sample119899119886that are not important for energy estimationThe following

window function 119892[119899] is used

119892 [119899] = 1 119899 ge 119899

119886

0 otherwise(3)

The algorithm also allows to equalize path-length depen-dent dispersion and attenuation to interpolate any fractionaltime delays remaining after coarse time alignment and tobandpass-filter the signal These operations are done in timedomain using a bank of FIR filters each associated with avector of 119871 weights w

119894= [119908

1198940 1199081198941 119908

119894(119871minus1)] Increasing

the number of weights improves accuracy but increasescomplexity too and so must be carefully chosen

By summing the output of the filters we obtain

119911 [119899 r0] =

119873

sum

119894=1

119871

sum

119897=0

119908119894119897sdot 119892 [119899 minus 119897] sdot 119909

119894[119899 minus 119897 minus 119899

119894(r0)] (4)

where 119908119894119897is the 119897th weight of the 119894th filter and 119899

119894(r0) is the

delay value associated with the 119894th filterWe need to define another window

ℎ [119899 r0] =

1 119899ℎle 119899 le 119899

ℎ+ 119897ℎ

0 otherwise(5)

The interval between 119899ℎand 119899ℎ+119897ℎis thewindowof samples in

which we expect to find the received pulse Even if we knowthe shape of the transmitted signal the scattering from thetumor is frequency-dependent Therefore the backscatteredpulse is a distorted version of the transmitted pulse anddefining the exact timing window is not straightforwardThevalues of these parameters suggested in [4] are meant tomaximize the contrast for small lesions

The energy scattered by r0can be computed as the sum of

the square of the windowed values

119901 (r0) = sum

119899

1003816100381610038161003816119911 [119899] ℎ [119899 r0]1003816100381610038161003816

2 (6)

Many parameters in the previous equations have to beassigned a value before proceeding with the hardware designThe value of 119899

ℎcan be computed based on the time shift 119899

119886

which is easily computed as shown in (2) on the delay of theFIR filter 120591

0= (119871 minus 1)2 and on the first significant sample of

the transmitted pulse119873start119899ℎ= 119873Start + 119899119886 + 1205910 (7)

As for 119897ℎ an optimal value is suggested in [4] and is 119897

ℎ= 5

As for 119871 a reasonable value for a Gaussian pulse like the onein Figure 2 which we use for our experiments is 119871 gt 50 andwe chose 119871 = 55

VLSI Design 5

Table 2 Maximum difference of energy values with respect to idealcomputation for three approximation cases

Calibration Tumor Pos Max errorApprox-I Approx-II Approx-III

Low 2678 2328 025IDEAL Med 1464 494 007

Deep 1797 670 016Low 3360 2502 044

SKAR Med 1561 867 016Deep 3068 3446 048

42 Accuracy and Approximation One of the crucial designchoices is data representation We assume that receivedsignals are sampled and represented as 16-bit 2rsquos complementnumbers [12] These data are processed according to (4) and(6) in which multiplications and additions are the mostfrequent operations The number of bits for the intermediateand final results is one of the designerrsquos knobs who has tofind the best trade-off between precision and complexity Weexplored three possible approximation strategies which wediscuss in the followingApprox-IHere we keep data parallelism as low as possible Inthe first multiplication in (4) the 16-bit inputs are multipliedand a 32-bit output is produced Before the addition in thesame equation we truncate the multiplication results anddiscard the 16 least significant bits Then to avoid overflowthe additions are performed over 32 bits by extending theremaining 16 most significant bits This parallelism ensuresno overflow for up to 16 levels of addition Before the finalmultiplication in (6) the 16 least significant bits are discardedand the multiplication result is a 32-bit numberApprox-II In this case we do not discard the 16 leastsignificant bits in the square computation in (6) obtaining a64-bit resultThe 32 least significant bits of the result are thencut away This solution requires of course larger multipliersand so implies a larger complexityApprox-III In this case we keep the maximum level ofaccuracy and perform all the operations with 64 bits

We evaluated these three alternatives with a softwareimplementation of the algorithm in Octave before movingto the hardware design The algorithm performance dependson the relative position of any high-permittivity tissue andthe position of the antennas Therefore we compared theperformance considering an ideal breast model uniformlyfilled with fat tissue for three possible tumor positions in thebreast a small depth (low) an intermediate depth (med) anda considerable depth (deep)

We also evaluated the effect on the performance of theSKAR algorithm In particular we compared the perfor-mance of the actual SKAR with an ideal calibration in whichwe remove any artifact by artificially subtracting the responsewithout tumor from the response with tumor

The three maps in Figure 4 are obtained with SKARcalibration and with the three approximation strategies forthe case of deep position In the first case in particular the

low resolution of the internal computations creates a largeartificial clutter response and so a clear loss in accuracyNevertheless the tumor position is clearly and correctlyfound In the second case the removal of the least 32-bit ofthe energy result eliminates the clutter response but may endup in inaccuracies should a point of interest have a low energyand be therefore eliminated The third approximation is ofcourse the onewith the greatest accuracy In fact the results inthe thirdmap in Figure 4 are identical to those obtained froma simulation with double precision floating-point numbers

In Table 2 we report the maximum error of the energyvalues with respect to an ideal computation (with no approx-imation) The maximum error is calculated as follows

MaxError = max(10038161003816100381610038161003816100381610038161003816100381610038161003816

EMmax (EM)

minus

EMfp

max (EMfp)

10038161003816100381610038161003816100381610038161003816100381610038161003816

) (8)

EM represents the energy map without approximations(double precision) and EMfp is the energy map with approx-imation (fixed point) The error is determined by comparingenergy values normalized to the maximum values obtainedwith or without approximation respectively The maximumerror is consistent with the graphical results of the maps

To keep the maximum level of accuracy then we decideto maintain a 64-bit data representation This decision doesnot affect the design as much as one may think To store 64-bit energy values we need twice as much memory locationsthan we need for the 32-bit case However the memory sizeis mostly determined by the many weights we have to storeand both the number and the bit size of the weights do notdepend on the representation of the intermediate data

43 Architecture After having decided the parallelism forinternal data representation we moved on to the RTLhardware design We described our system in synthesizableVHDL Design-time parameters were used to keep the designflexible The block diagram in Figure 5 represents the BEAFarchitecture that we described in RTL In the following wedescribe the behavior of each componentWeights amp Delays Shift RegisterWhen the image reconstruc-tion process starts after the acquisition phase calibratedsamples weights for BEAF and delays values are availableeach in their own memories For each energy value an entirewindow of samples is processed in every FIR filter togetherwith the corresponding weights Before the filter computa-tion these weights are sequentially read from memory andloaded in a serial-in parallel-out shift register included in theWeights amp Delays shift register blockThe delay values are alsoread from memory at the same time They are sent to the 120591blocks to determine the correct time-shift amount necessaryto align the signals according to the beamforming algorithmdescribed in Section 41Sample Memory The sample values stored in this memoryare read one at a time for each channel and fed to the filterblocksFilters A bank of filters receives samples from the samplememory and delay values from the Weights ampDelays shift reg-ister In a fully parallel implementation the number of filters

6 VLSI Design

0

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14

2e minus 08

4e minus 08

6e minus 08

8e minus 08

1e minus 07

i (cm)

j(c

m)

(a) First approximation type (Approx-I)

Beamforming approximated - SKAR calibration

0

5e minus 09

1e minus 08

15e minus 08

2e minus 08

25e minus 08

3e minus 08

35e minus 08

4e minus 08

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

(b) Second approximation type (Approx-II)

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

0

1e minus 08

2e minus 08

3e minus 08

4e minus 08

5e minus 08

(c) Third approximation type (Approx-III)

Figure 4 Energy map of the breast after BEAF data approximation for ideal breast model and SKAR calibration A tumor is present in a deepposition

(119873FIR) corresponds to the number of transceiversantennasin the UWB probe As we discuss momentarily sequentialimplementations that use less filters than there are antennasare possible Each filter includes three main componentsThe120591 blocks are programmable delay lines which shift the inputsamples of the right time shift to ensure that all FIR inputs aretime alignedThe maximum delay is 119899

119886 TheWindow1 blocks

perform the 119892[119899] windowing functions they feed the FIRblocks only with samples that come after 119899

119886 and replace all

the samples that come before 119899119886with zero values In the FIR

blocks the samples are processed according to (4) Every FIRis a fully-parallel structure with 119871multipliers Eachmultiplieris fed with one of the weights which do not change duringthe whole voxel computation and with the samplesThe latterare sequentially introduced in the FIR blocks and then are

shifted from one multiplier to the next via an internal shiftregister The outputs of all the multipliers are then summedvia an adder tree producing one single FIR output Thiswhole computation block is pipelined with registers at theinput of every adder and multiplierSumTheoutputs of all the filters are summed together with aadder tree to obtain a single value which represents the resultof (4)Window2 The samples we are interested in are those inwhich the signal pulse occurs Window2 lets input data passunaltered if they correspond to the time interval specified by119899ℎand 119897ℎ Otherwise the output value is set to zero

Energy Computation This block computes the total voxelenergy according to (6) A multiplier and an accumulator are

VLSI Design 7

FIR

FIR

FIR

Filter

FSM

Energycomputation

Energymemory

Win

dow

2

Weightsand delays

shiftregister

Window1

Window1

Window1

120591

120591

120591

Sum shift register

SumSamplesmemory

Figure 5 RTL block diagram of the BEAF unit

used for this purpose Finally the energy of the current voxelgets written into the energy memoryFSM To coordinate the whole computation process a finitestate machine (FSM) is used A set of counters in it also keepstrack of the number of computation steps performed at anycontrol step

The filtering functions need not to be necessarily per-formed in parallel Two extreme solutions are possible Aparallel implementation in which all the data are processedconcurrently and a serial implementation in which onefilter is iteratively used In the serial implementation anaccumulator is inserted at the output of the filter since theFIR outputs must be summed to compute the energy value

Between the two extremes hybrid solutions are alsopossible We can instantiate a number of filters (and thus anumber of FIR blocks) 119873FIR that matches our timing andresource requirementsWe can use them iteratively to processthe data from119873ANT antennas In case of a hybrid solution ateach iteration the results of the partial sumof the FIR outputsare fed back to the sum shift register block in Figure 5 andadded coherently to the current value at a given iterationWhen all the antennas sets have been processed the finalresult is fed to the energy computation block The finitestate machine is designed in a parameterized way so as tocorrectly handle all possible solutions with different degreesof parallelism

It is worth mentioning that a choice of degree of par-allelism determines not just the number of FIR blocks butalso the number of 120591 Window1 and Window2 blocks In thefollowing whenwe refer to119873FIR as a parameter for the degreeof parallelism we mean that all these blocks are instantiated119873FIR times

We verified our parameterized VHDL code by means ofextensive logic-level simulations based on Mentor GraphicsrsquoModelsim [15] varying the degree of parallelismWe validatedthe output obtained with the RTL code against the resultsobtained with a software implementation and refined ourdesign until they perfectly matched

44 Timing Performance of BEAF In our architectureeach voxelrsquos energy is sequentially computed therefore the

beamforming execution time is

119879BEAF = 119873vox sdot 119879vox (9)

where 119873vox is the number of voxels and 119879vox is the timerequired to compute each voxel

The number of clock cycles necessary to execute thebeamforming step for each voxel depends on the number ofFIR filters in Figure 5 As we discussed above the spectrumof possible solutions spans from a fully sequential implemen-tation with a single FIR iteratively used for all the antennasto a fully parallel solution with as many filters as there areantennas The number of iterations depends on the ratiobetween number of antennas and number of filters

119873iter = lceil119873ANT119873FIR

rceil (10)

119879vox is the sum of three contributions

119879vox = 119879119908 + 119879comp + 119879119908119903 (11)

119879119908is the time necessary to write the FIR weights in the

memory once they are received from the link that connectsthe processor with the accelerator to read the weights fromthe memory and to store them in local registers within thefilters

119879119908= (2 (119871 + 1)119873ANT + 2119873iter) 119879CLK (12)

The first termwithin parenthesis in (12) is the time for writingand reading from the memory 119871 weights The second term isan overhead for enabling the memory and for asserting a fewcontrol signals 119879CLK is the clock period

119879comp is the actual computation time and is given by

119879comp = (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ lceillog2(119873FIR + 1)rceil )119873iter119879CLK

(13)

In (13) 119871wp is the time distance evaluated in time samplesthat corresponds to the worst path between two differentantennas and is necessary for the time alignment of thevarious signals Term lceillog

2(119871)rceil+119899

ℎ+119897ℎis the number of clock

cycles needed to load the samples in the filters with 119899ℎand 119897ℎ

representing the first sample in awindow of interest and 119897ℎthe

size of that window according to equation (5) Seven clockcycles are needed for window traversal and various registerwriting The term that depends on the logarithm of 119873FIR isrelated to the final addition in Figure 5

The final term in (11) 119879nrj is given by

119879nrj = 3119873iter (14)

and represents the time needed to write the voxelrsquos energyvalue back in the memory

We rewrite the final expression of 119879vox obtained bywrapping up all previous equations

119879vox = (2 (119871 + 1) sdot 119873ANT + 2 sdot 119873iter

+ (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ log2(119873FIR + 1)) sdot 119873iter + 3 sdot 119873iter) 119879CLK

(15)

8 VLSI Design

Figure 6 Experimental setup for the FPGA prototype the personalcomputer executes in software (Octave) part of the algorithm andallows energy maps visualization A USB link transfers data to andfrom the FPGA board where part of the beamforming algorithm isaccelerated

Equation (15) can be used to determine the overallexecution time of the beamforming algorithm in (9) giventhe total number of voxels and applies both to a 2D anda 3D map The accuracy of (15) was tested with VHDLsimulations usingMentor Graphicsrsquo Modelsim simulator andexperimentally verified with the FPGA prototype describedin the next section

5 Emulation with an FPGA-Based Prototype

We built an emulation prototype of the imaging unit Thisprototype relies on a general purpose microprocessor toperform the portions of processing that we decided to executein software while the hardware part of the BEAF algorithmwas deployed on an Xilinx Virtex-4 XC4VLX160-FF1513FPGA In particular wewere able to connect the Intel Core-i5of a laptop with the FPGA located in an AVNET Virtex-4 LXdevelopment boardTheCypress FX2USB 20 chip integratedin the development board was used to communicate with theFPGATheUSB chip behaves simply as a conduit between theUSB and the external data-processing logic The FX2 clockwas set to 48MHz the data parallelism to 8 bits and thepacket size to 512 bytes

The prototype has multiple purposes It serves as anaccelerator of the system-level simulation because it permitsa much faster execution of the part of the algorithm executedby the FPGA compared to a logic-level simulation Theoverall architecture with the software and the hardwarerunning concurrently is clearly a close approximation ofthe final system even though the processor and the digitalhardware will be different in the final target The RTL VHDLcode that gets synthesized on the FPGA is the same that wewill eventually implement in an ASIC therefore we can fullyverify it before fabrication Finally we can anticipate whatthe exact performance will be in terms of clock cycles Theonly difference in performance will be then determined bythe different clock frequency in FPGA and ASIC

Both the software and the hardware code are fullyinstrumented to permit a cycle accurate evaluation of timingperformance The USB communication time is also correctly

estimated and decoupled from the overall performance as amuch faster link will be used in a realistic setting like the onedepicted in Figure 1

The software running on the microprocessor is layeredand so the application part and the communication partare independent and portable The communication part isalso further partitioned in a set of low-level hardware-specific USB primitives (we use the libusb library availablein Linux) and a set of higher-level technology-independentAPIs for data organization and communication When thehardware target changes as it happens when moving fromthe prototype to the final target the developers replace onlythe low-level communication primitives and reuse both theapplication code and the communication APIs

Figure 6 is a photograph of the prototype and showsan energy map elaborated in the FPGA sent to the hostcomputer and displayed in the host screen

6 Experimental Results on FPGA and ASIC

We ran several experiments first with Xilinxrsquos ISE software[16] for the FPGA target of our prototype then with SynopsysDesign Compiler [17] for an ASIC target We were able todetermine the resource requirements and the actual timingperformance which depends on the actual clock frequency asfound by the FPGA and ASIC design CAD toolsThe variousbitstream files generated after FPGA synthesis mappingand place-and-route experiments were each tested in ourprototype They refer to a specific design case with nineantennas located around the numerical breast phantomnumber 012204 from the repository made available by theUniversity of Wisconsin [14] This phantom represents inthe database an average case in terms of type of tissue andnumber of voxelsWe have run finite-difference time-domain(FDTD) electromagnetic simulations to determine the signalreceived by the nine antennas and we fed the imagingunit with the received samples after 16-bit quantizationWe also extrapolated our performance results to a case offifty antennas which we assume it to be more or less thanmaximum number of antennas that can be placed around apatientrsquos breast

61 FPGA Experiments The FPGA resources the main partof which consists of look-up tables (LUT) and registers aregrouped in slices Depending on the complexity of the designand particularly on the number of filters 119873FIR a higher orlower amount of LUT and registers is used When mappingthe design the strategy of ISE does not consist in fillingthe slices one by one Instead to meet timing requirementsthe logic gates are spread over a high number of sliceswhich often results in a poor utilization Sometimes a slice isdeclared as occupied even if only one of its LUTs is usedThisfact limits the number of filters that we were able to put in ourVirtex-4 to a maximum of five FIRs The results of resourceutilization are reported in Table 3

We notice that the number of occupied slices increasesrapidly with the number of filters With six filters it exceedsthe number of available slices and we were not able to

VLSI Design 9

Table 3 Area allocation with different number of FIR filters

119873FIR Slice registers 4 input LUTs Occupied slices1 9081 (6) 8038 (5) 7188 (10)2 17347 (12) 20072 (14) 16115 (23)3 25551 (18) 44290 (32) 30899 (45)4 33851 (25) 67247 (49) 45139 (66)5 42055 (31) 91461 (67) 59936 (88)

Table 4 Hardware execution time over 14485 voxels with 9antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 128 s

119873FIR 119873ITER 119879exe [ms] Speed-up1 9 80639 15872 5 55336 23133 3 42581 30064 3 42650 30015 2 36249 35316 2 36249 35317 2 36249 35318 2 36295 35279 1 29871 4285

test that configuration in our prototype even though wecould estimate the performance in simulation The allocatedresources follow a trend that is almost linear with the numberof filters which is in accordance with our expectations

The timing performance depends proportionally on thenumber of voxels in the energy map This value is derivedfrom the 012204 phantom breast at 1mm resolution whichwe also used for the Octave simulations The 2D map gridconsists of 152 rows and 176 columns The Octave simulationcode was optimized to consider in the computation onlythose voxels which happened to be located in the ellipticalspace described by the antennas that is the area actuallyirradiated by the transmitters With such optimization thenumber of voxels is119873VOX = 14495

With such number of voxels the time required for thesoftware execution of the BEAF on the Intel i5 core of ourprototype (with Ubuntu 1110 kernel 30) was approximately119879SW = 128 sThe clock period for the FPGA implementationis 119879CLK = 16 ns which we use in our experiments with theprototype Table 4 reports the time required to execute theBEAF routine in hardware and the speed-up with respect to asoftware implementation Timing results with 119873FIR in range6ndash9 refer to logic simulations only because only five filterscould be placed in the FPGA However since all the resultsare consistent with (15) they are all fully accurate Table 5refers to a hypothetical case with fifty antennas and reportsonly simulated data which are also fully accurate comparedto a software execution that requires 119879SW = 71 s

In both Tables 4 and 5 we observe a similar trend Aswe expected the effect of changing the number of filtersmainly depends on the actual number of iterations 119873ITER =lceil119873ANT119873FIRrceil Increasing the number of filters boosts perfor-mance only when it allows to perform the computation with

Table 5 Hardware execution time over 14485 voxels with 50antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 71 s

119873FIR 119873ITER 119879exe [s] Speed-up1 50 448 15862 25 289 24533 17 238 29794 13 213 33315 10 194 36616 9 188 37867 8 181 39198 7 175 40599 6 168 421410ndash12 5 162 438113ndash16 4 156 456217ndash24 3 149 475625ndash49 2 143 497150 1 136 5205

less iterations For example in the case with nine antennaswe would not get a significant advantage from using eightfilters if we cannot use nine then we could as well use sixand still obtain basically the same performance we would getwith eight filters

The speed-up value grows rapidly for low values of119873FIRwhereas its marginal increase tends to diminish in the upperrangeThe reason is that the part of the algorithm that consistsof accessing the memory for reading samples and weightsremains sequential and cannot be made parallel

Considering the same phantom and iterating the per-formance evaluation in the 3D case with 1374953 voxels theresulting trends are similar To summarize in the worst caseof fifty antennas and 119879ITER = 50 the required time is 119879exe =42464 s (around 7 minutes) while in the best case of onlyone iteration we obtained119879exe = 12940 s (around 2minutes)With respect to a software implementation requiring 119879SW =

19 h the speedup is a factor 1611 and 5286 respectively inthe worst and best case

Referring to data discussed in Table 1 obtained for a worstcase breast phantom (in terms of number of voxels) theexecution time in the worst case (119879ITER = 50) would bearound 21 minutes (compared to the 58 h obtained in thesoftware case)

62 ASIC Experiments For this part of our experimentswe targeted a standard cell library in a 11 V 45 nm CMOStechnology We synthesized with Synopsys Design Compilervarious instances of our architecture varying the numberof filters and determined the relationship between speed-up(with respect to a software execution) and silicon area Thedesign when synthesized with the maximum optimizationeffort available can run at a clock period of 2 ns Thereforecompared to the FPGA results reported in the previoussection all the values of speed-up can be multiplied by afactor 8 (16 ns2 ns)

10 VLSI Design

0

1

2

3

4

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 50Number of FIR filters

Are

a (m

m2)

Figure 7 Postsynthesis ASIC area versus number of FIR filters fora 50 antennas system

100

150

200

250

300

350

400

450

0 1 2 3 4 5 6 7 8 9 10

Spee

d-up

ver

sus s

oftw

are

Area (mm2)

Figure 8 Speed-up versus area in an ASIC implementation of a 50antennas system

Figures 7 and 8 report silicon area as a function of thenumber of filters and speed-up as a function of the siliconarea respectively

The area curve in Figure 7 grows linearly with119873FIR witha slope of 01885mm2 which is in fact the area of a single FIRfilterThe area depends also on thememory sizeThememoryarea was obtained by considering the use of 100 memoryblocks (two for each antenna) each consisting of 256 bytesfor the signal samples (256 kB over 0457mm2) and a dual-portmemory blockwith 2048 bytes for theweights and delaysvalues (4 kB over 00153mm2) The memory area weighs onthe total area in a significant way only when a few filters areused because it does not depend on119873FIR

The speed-up that is the ratio of software execution timeand hardware execution time as a function of the silicon areashows a behavior similar to what we found for the FPGAcase but the acceleration factor is of course much biggerfrom around 50 to around 400 in the case of fifty antennasin the ASIC case This result would be notably improved incase a more scaled technology would be used [18] As we

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

Are

a and

spee

d-up

incr

ease

(rel

ativ

e uni

ts)

Number of FIR filters

Speed-up increaseArea increase

115

225

335

2 4 6 8 10

Zoom filter range 1 5

Figure 9 Trade-off between speed-up and area when changingthe number of filters in a 50 antennas ASIC implementation Cost-effective solutions are those with119873FIR le 6

utilize more and more area to make room for the filters thespeed-up first increases in a steep way then it bends becausethe sequential part of the algorithm starts dominating theexecution time

We report in Figure 9 two more curves the speed-upincrease and the area increase These two curves are simplythe normalized versions of the curves reported in Figures 7and 8 In practice we divided the area and the speed-up ofthe solution with119873FIR filters by the area and the speed-up ofthe solution with one filter respectivelyThe rationale behindthis choice is to evaluate the trade-off between performanceand cost In particular we judge that a parallel solutionis cost-effective if for a given percentage cost increase theperformance increases at least as much as the cost [19 20]From the inset in Figure 9 we are able to evaluate for whatvalue of 119873FIR the two curves cross each other The cost-effective solutions are those that use no more than six filters

7 Conclusions

Implementing in hardware the Mist-beamforming algorithmportion of the image reconstruction process determines aremarkable acceleration in terms of execution time From apractical point of view this acceleration implies a significantimprovement in the applicability of the system if we considerthe case of a 3D image reconstruction with a fifty-antennasystem we move from 19 hours of computation in thesoftware-only version to only about 7 minutes for an FPGAimplementation and less than one minute for an ASICimplementation even in the serial hardware configuration(ie with 119873FIR = 1) The FPGA clock frequency was set inour experiments to 625MHz which is the clock frequencyof the IP that communicates with the UWB external chipHowever our results show that the FPGA accelerator alonecan run up to 180MHz hence providing a further 288Xspeed-up with respect to the software version Therefore

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

VLSI Design 5

Table 2 Maximum difference of energy values with respect to idealcomputation for three approximation cases

Calibration Tumor Pos Max errorApprox-I Approx-II Approx-III

Low 2678 2328 025IDEAL Med 1464 494 007

Deep 1797 670 016Low 3360 2502 044

SKAR Med 1561 867 016Deep 3068 3446 048

42 Accuracy and Approximation One of the crucial designchoices is data representation We assume that receivedsignals are sampled and represented as 16-bit 2rsquos complementnumbers [12] These data are processed according to (4) and(6) in which multiplications and additions are the mostfrequent operations The number of bits for the intermediateand final results is one of the designerrsquos knobs who has tofind the best trade-off between precision and complexity Weexplored three possible approximation strategies which wediscuss in the followingApprox-IHere we keep data parallelism as low as possible Inthe first multiplication in (4) the 16-bit inputs are multipliedand a 32-bit output is produced Before the addition in thesame equation we truncate the multiplication results anddiscard the 16 least significant bits Then to avoid overflowthe additions are performed over 32 bits by extending theremaining 16 most significant bits This parallelism ensuresno overflow for up to 16 levels of addition Before the finalmultiplication in (6) the 16 least significant bits are discardedand the multiplication result is a 32-bit numberApprox-II In this case we do not discard the 16 leastsignificant bits in the square computation in (6) obtaining a64-bit resultThe 32 least significant bits of the result are thencut away This solution requires of course larger multipliersand so implies a larger complexityApprox-III In this case we keep the maximum level ofaccuracy and perform all the operations with 64 bits

We evaluated these three alternatives with a softwareimplementation of the algorithm in Octave before movingto the hardware design The algorithm performance dependson the relative position of any high-permittivity tissue andthe position of the antennas Therefore we compared theperformance considering an ideal breast model uniformlyfilled with fat tissue for three possible tumor positions in thebreast a small depth (low) an intermediate depth (med) anda considerable depth (deep)

We also evaluated the effect on the performance of theSKAR algorithm In particular we compared the perfor-mance of the actual SKAR with an ideal calibration in whichwe remove any artifact by artificially subtracting the responsewithout tumor from the response with tumor

The three maps in Figure 4 are obtained with SKARcalibration and with the three approximation strategies forthe case of deep position In the first case in particular the

low resolution of the internal computations creates a largeartificial clutter response and so a clear loss in accuracyNevertheless the tumor position is clearly and correctlyfound In the second case the removal of the least 32-bit ofthe energy result eliminates the clutter response but may endup in inaccuracies should a point of interest have a low energyand be therefore eliminated The third approximation is ofcourse the onewith the greatest accuracy In fact the results inthe thirdmap in Figure 4 are identical to those obtained froma simulation with double precision floating-point numbers

In Table 2 we report the maximum error of the energyvalues with respect to an ideal computation (with no approx-imation) The maximum error is calculated as follows

MaxError = max(10038161003816100381610038161003816100381610038161003816100381610038161003816

EMmax (EM)

minus

EMfp

max (EMfp)

10038161003816100381610038161003816100381610038161003816100381610038161003816

) (8)

EM represents the energy map without approximations(double precision) and EMfp is the energy map with approx-imation (fixed point) The error is determined by comparingenergy values normalized to the maximum values obtainedwith or without approximation respectively The maximumerror is consistent with the graphical results of the maps

To keep the maximum level of accuracy then we decideto maintain a 64-bit data representation This decision doesnot affect the design as much as one may think To store 64-bit energy values we need twice as much memory locationsthan we need for the 32-bit case However the memory sizeis mostly determined by the many weights we have to storeand both the number and the bit size of the weights do notdepend on the representation of the intermediate data

43 Architecture After having decided the parallelism forinternal data representation we moved on to the RTLhardware design We described our system in synthesizableVHDL Design-time parameters were used to keep the designflexible The block diagram in Figure 5 represents the BEAFarchitecture that we described in RTL In the following wedescribe the behavior of each componentWeights amp Delays Shift RegisterWhen the image reconstruc-tion process starts after the acquisition phase calibratedsamples weights for BEAF and delays values are availableeach in their own memories For each energy value an entirewindow of samples is processed in every FIR filter togetherwith the corresponding weights Before the filter computa-tion these weights are sequentially read from memory andloaded in a serial-in parallel-out shift register included in theWeights amp Delays shift register blockThe delay values are alsoread from memory at the same time They are sent to the 120591blocks to determine the correct time-shift amount necessaryto align the signals according to the beamforming algorithmdescribed in Section 41Sample Memory The sample values stored in this memoryare read one at a time for each channel and fed to the filterblocksFilters A bank of filters receives samples from the samplememory and delay values from the Weights ampDelays shift reg-ister In a fully parallel implementation the number of filters

6 VLSI Design

0

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14

2e minus 08

4e minus 08

6e minus 08

8e minus 08

1e minus 07

i (cm)

j(c

m)

(a) First approximation type (Approx-I)

Beamforming approximated - SKAR calibration

0

5e minus 09

1e minus 08

15e minus 08

2e minus 08

25e minus 08

3e minus 08

35e minus 08

4e minus 08

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

(b) Second approximation type (Approx-II)

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

0

1e minus 08

2e minus 08

3e minus 08

4e minus 08

5e minus 08

(c) Third approximation type (Approx-III)

Figure 4 Energy map of the breast after BEAF data approximation for ideal breast model and SKAR calibration A tumor is present in a deepposition

(119873FIR) corresponds to the number of transceiversantennasin the UWB probe As we discuss momentarily sequentialimplementations that use less filters than there are antennasare possible Each filter includes three main componentsThe120591 blocks are programmable delay lines which shift the inputsamples of the right time shift to ensure that all FIR inputs aretime alignedThe maximum delay is 119899

119886 TheWindow1 blocks

perform the 119892[119899] windowing functions they feed the FIRblocks only with samples that come after 119899

119886 and replace all

the samples that come before 119899119886with zero values In the FIR

blocks the samples are processed according to (4) Every FIRis a fully-parallel structure with 119871multipliers Eachmultiplieris fed with one of the weights which do not change duringthe whole voxel computation and with the samplesThe latterare sequentially introduced in the FIR blocks and then are

shifted from one multiplier to the next via an internal shiftregister The outputs of all the multipliers are then summedvia an adder tree producing one single FIR output Thiswhole computation block is pipelined with registers at theinput of every adder and multiplierSumTheoutputs of all the filters are summed together with aadder tree to obtain a single value which represents the resultof (4)Window2 The samples we are interested in are those inwhich the signal pulse occurs Window2 lets input data passunaltered if they correspond to the time interval specified by119899ℎand 119897ℎ Otherwise the output value is set to zero

Energy Computation This block computes the total voxelenergy according to (6) A multiplier and an accumulator are

VLSI Design 7

FIR

FIR

FIR

Filter

FSM

Energycomputation

Energymemory

Win

dow

2

Weightsand delays

shiftregister

Window1

Window1

Window1

120591

120591

120591

Sum shift register

SumSamplesmemory

Figure 5 RTL block diagram of the BEAF unit

used for this purpose Finally the energy of the current voxelgets written into the energy memoryFSM To coordinate the whole computation process a finitestate machine (FSM) is used A set of counters in it also keepstrack of the number of computation steps performed at anycontrol step

The filtering functions need not to be necessarily per-formed in parallel Two extreme solutions are possible Aparallel implementation in which all the data are processedconcurrently and a serial implementation in which onefilter is iteratively used In the serial implementation anaccumulator is inserted at the output of the filter since theFIR outputs must be summed to compute the energy value

Between the two extremes hybrid solutions are alsopossible We can instantiate a number of filters (and thus anumber of FIR blocks) 119873FIR that matches our timing andresource requirementsWe can use them iteratively to processthe data from119873ANT antennas In case of a hybrid solution ateach iteration the results of the partial sumof the FIR outputsare fed back to the sum shift register block in Figure 5 andadded coherently to the current value at a given iterationWhen all the antennas sets have been processed the finalresult is fed to the energy computation block The finitestate machine is designed in a parameterized way so as tocorrectly handle all possible solutions with different degreesof parallelism

It is worth mentioning that a choice of degree of par-allelism determines not just the number of FIR blocks butalso the number of 120591 Window1 and Window2 blocks In thefollowing whenwe refer to119873FIR as a parameter for the degreeof parallelism we mean that all these blocks are instantiated119873FIR times

We verified our parameterized VHDL code by means ofextensive logic-level simulations based on Mentor GraphicsrsquoModelsim [15] varying the degree of parallelismWe validatedthe output obtained with the RTL code against the resultsobtained with a software implementation and refined ourdesign until they perfectly matched

44 Timing Performance of BEAF In our architectureeach voxelrsquos energy is sequentially computed therefore the

beamforming execution time is

119879BEAF = 119873vox sdot 119879vox (9)

where 119873vox is the number of voxels and 119879vox is the timerequired to compute each voxel

The number of clock cycles necessary to execute thebeamforming step for each voxel depends on the number ofFIR filters in Figure 5 As we discussed above the spectrumof possible solutions spans from a fully sequential implemen-tation with a single FIR iteratively used for all the antennasto a fully parallel solution with as many filters as there areantennas The number of iterations depends on the ratiobetween number of antennas and number of filters

119873iter = lceil119873ANT119873FIR

rceil (10)

119879vox is the sum of three contributions

119879vox = 119879119908 + 119879comp + 119879119908119903 (11)

119879119908is the time necessary to write the FIR weights in the

memory once they are received from the link that connectsthe processor with the accelerator to read the weights fromthe memory and to store them in local registers within thefilters

119879119908= (2 (119871 + 1)119873ANT + 2119873iter) 119879CLK (12)

The first termwithin parenthesis in (12) is the time for writingand reading from the memory 119871 weights The second term isan overhead for enabling the memory and for asserting a fewcontrol signals 119879CLK is the clock period

119879comp is the actual computation time and is given by

119879comp = (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ lceillog2(119873FIR + 1)rceil )119873iter119879CLK

(13)

In (13) 119871wp is the time distance evaluated in time samplesthat corresponds to the worst path between two differentantennas and is necessary for the time alignment of thevarious signals Term lceillog

2(119871)rceil+119899

ℎ+119897ℎis the number of clock

cycles needed to load the samples in the filters with 119899ℎand 119897ℎ

representing the first sample in awindow of interest and 119897ℎthe

size of that window according to equation (5) Seven clockcycles are needed for window traversal and various registerwriting The term that depends on the logarithm of 119873FIR isrelated to the final addition in Figure 5

The final term in (11) 119879nrj is given by

119879nrj = 3119873iter (14)

and represents the time needed to write the voxelrsquos energyvalue back in the memory

We rewrite the final expression of 119879vox obtained bywrapping up all previous equations

119879vox = (2 (119871 + 1) sdot 119873ANT + 2 sdot 119873iter

+ (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ log2(119873FIR + 1)) sdot 119873iter + 3 sdot 119873iter) 119879CLK

(15)

8 VLSI Design

Figure 6 Experimental setup for the FPGA prototype the personalcomputer executes in software (Octave) part of the algorithm andallows energy maps visualization A USB link transfers data to andfrom the FPGA board where part of the beamforming algorithm isaccelerated

Equation (15) can be used to determine the overallexecution time of the beamforming algorithm in (9) giventhe total number of voxels and applies both to a 2D anda 3D map The accuracy of (15) was tested with VHDLsimulations usingMentor Graphicsrsquo Modelsim simulator andexperimentally verified with the FPGA prototype describedin the next section

5 Emulation with an FPGA-Based Prototype

We built an emulation prototype of the imaging unit Thisprototype relies on a general purpose microprocessor toperform the portions of processing that we decided to executein software while the hardware part of the BEAF algorithmwas deployed on an Xilinx Virtex-4 XC4VLX160-FF1513FPGA In particular wewere able to connect the Intel Core-i5of a laptop with the FPGA located in an AVNET Virtex-4 LXdevelopment boardTheCypress FX2USB 20 chip integratedin the development board was used to communicate with theFPGATheUSB chip behaves simply as a conduit between theUSB and the external data-processing logic The FX2 clockwas set to 48MHz the data parallelism to 8 bits and thepacket size to 512 bytes

The prototype has multiple purposes It serves as anaccelerator of the system-level simulation because it permitsa much faster execution of the part of the algorithm executedby the FPGA compared to a logic-level simulation Theoverall architecture with the software and the hardwarerunning concurrently is clearly a close approximation ofthe final system even though the processor and the digitalhardware will be different in the final target The RTL VHDLcode that gets synthesized on the FPGA is the same that wewill eventually implement in an ASIC therefore we can fullyverify it before fabrication Finally we can anticipate whatthe exact performance will be in terms of clock cycles Theonly difference in performance will be then determined bythe different clock frequency in FPGA and ASIC

Both the software and the hardware code are fullyinstrumented to permit a cycle accurate evaluation of timingperformance The USB communication time is also correctly

estimated and decoupled from the overall performance as amuch faster link will be used in a realistic setting like the onedepicted in Figure 1

The software running on the microprocessor is layeredand so the application part and the communication partare independent and portable The communication part isalso further partitioned in a set of low-level hardware-specific USB primitives (we use the libusb library availablein Linux) and a set of higher-level technology-independentAPIs for data organization and communication When thehardware target changes as it happens when moving fromthe prototype to the final target the developers replace onlythe low-level communication primitives and reuse both theapplication code and the communication APIs

Figure 6 is a photograph of the prototype and showsan energy map elaborated in the FPGA sent to the hostcomputer and displayed in the host screen

6 Experimental Results on FPGA and ASIC

We ran several experiments first with Xilinxrsquos ISE software[16] for the FPGA target of our prototype then with SynopsysDesign Compiler [17] for an ASIC target We were able todetermine the resource requirements and the actual timingperformance which depends on the actual clock frequency asfound by the FPGA and ASIC design CAD toolsThe variousbitstream files generated after FPGA synthesis mappingand place-and-route experiments were each tested in ourprototype They refer to a specific design case with nineantennas located around the numerical breast phantomnumber 012204 from the repository made available by theUniversity of Wisconsin [14] This phantom represents inthe database an average case in terms of type of tissue andnumber of voxelsWe have run finite-difference time-domain(FDTD) electromagnetic simulations to determine the signalreceived by the nine antennas and we fed the imagingunit with the received samples after 16-bit quantizationWe also extrapolated our performance results to a case offifty antennas which we assume it to be more or less thanmaximum number of antennas that can be placed around apatientrsquos breast

61 FPGA Experiments The FPGA resources the main partof which consists of look-up tables (LUT) and registers aregrouped in slices Depending on the complexity of the designand particularly on the number of filters 119873FIR a higher orlower amount of LUT and registers is used When mappingthe design the strategy of ISE does not consist in fillingthe slices one by one Instead to meet timing requirementsthe logic gates are spread over a high number of sliceswhich often results in a poor utilization Sometimes a slice isdeclared as occupied even if only one of its LUTs is usedThisfact limits the number of filters that we were able to put in ourVirtex-4 to a maximum of five FIRs The results of resourceutilization are reported in Table 3

We notice that the number of occupied slices increasesrapidly with the number of filters With six filters it exceedsthe number of available slices and we were not able to

VLSI Design 9

Table 3 Area allocation with different number of FIR filters

119873FIR Slice registers 4 input LUTs Occupied slices1 9081 (6) 8038 (5) 7188 (10)2 17347 (12) 20072 (14) 16115 (23)3 25551 (18) 44290 (32) 30899 (45)4 33851 (25) 67247 (49) 45139 (66)5 42055 (31) 91461 (67) 59936 (88)

Table 4 Hardware execution time over 14485 voxels with 9antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 128 s

119873FIR 119873ITER 119879exe [ms] Speed-up1 9 80639 15872 5 55336 23133 3 42581 30064 3 42650 30015 2 36249 35316 2 36249 35317 2 36249 35318 2 36295 35279 1 29871 4285

test that configuration in our prototype even though wecould estimate the performance in simulation The allocatedresources follow a trend that is almost linear with the numberof filters which is in accordance with our expectations

The timing performance depends proportionally on thenumber of voxels in the energy map This value is derivedfrom the 012204 phantom breast at 1mm resolution whichwe also used for the Octave simulations The 2D map gridconsists of 152 rows and 176 columns The Octave simulationcode was optimized to consider in the computation onlythose voxels which happened to be located in the ellipticalspace described by the antennas that is the area actuallyirradiated by the transmitters With such optimization thenumber of voxels is119873VOX = 14495

With such number of voxels the time required for thesoftware execution of the BEAF on the Intel i5 core of ourprototype (with Ubuntu 1110 kernel 30) was approximately119879SW = 128 sThe clock period for the FPGA implementationis 119879CLK = 16 ns which we use in our experiments with theprototype Table 4 reports the time required to execute theBEAF routine in hardware and the speed-up with respect to asoftware implementation Timing results with 119873FIR in range6ndash9 refer to logic simulations only because only five filterscould be placed in the FPGA However since all the resultsare consistent with (15) they are all fully accurate Table 5refers to a hypothetical case with fifty antennas and reportsonly simulated data which are also fully accurate comparedto a software execution that requires 119879SW = 71 s

In both Tables 4 and 5 we observe a similar trend Aswe expected the effect of changing the number of filtersmainly depends on the actual number of iterations 119873ITER =lceil119873ANT119873FIRrceil Increasing the number of filters boosts perfor-mance only when it allows to perform the computation with

Table 5 Hardware execution time over 14485 voxels with 50antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 71 s

119873FIR 119873ITER 119879exe [s] Speed-up1 50 448 15862 25 289 24533 17 238 29794 13 213 33315 10 194 36616 9 188 37867 8 181 39198 7 175 40599 6 168 421410ndash12 5 162 438113ndash16 4 156 456217ndash24 3 149 475625ndash49 2 143 497150 1 136 5205

less iterations For example in the case with nine antennaswe would not get a significant advantage from using eightfilters if we cannot use nine then we could as well use sixand still obtain basically the same performance we would getwith eight filters

The speed-up value grows rapidly for low values of119873FIRwhereas its marginal increase tends to diminish in the upperrangeThe reason is that the part of the algorithm that consistsof accessing the memory for reading samples and weightsremains sequential and cannot be made parallel

Considering the same phantom and iterating the per-formance evaluation in the 3D case with 1374953 voxels theresulting trends are similar To summarize in the worst caseof fifty antennas and 119879ITER = 50 the required time is 119879exe =42464 s (around 7 minutes) while in the best case of onlyone iteration we obtained119879exe = 12940 s (around 2minutes)With respect to a software implementation requiring 119879SW =

19 h the speedup is a factor 1611 and 5286 respectively inthe worst and best case

Referring to data discussed in Table 1 obtained for a worstcase breast phantom (in terms of number of voxels) theexecution time in the worst case (119879ITER = 50) would bearound 21 minutes (compared to the 58 h obtained in thesoftware case)

62 ASIC Experiments For this part of our experimentswe targeted a standard cell library in a 11 V 45 nm CMOStechnology We synthesized with Synopsys Design Compilervarious instances of our architecture varying the numberof filters and determined the relationship between speed-up(with respect to a software execution) and silicon area Thedesign when synthesized with the maximum optimizationeffort available can run at a clock period of 2 ns Thereforecompared to the FPGA results reported in the previoussection all the values of speed-up can be multiplied by afactor 8 (16 ns2 ns)

10 VLSI Design

0

1

2

3

4

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 50Number of FIR filters

Are

a (m

m2)

Figure 7 Postsynthesis ASIC area versus number of FIR filters fora 50 antennas system

100

150

200

250

300

350

400

450

0 1 2 3 4 5 6 7 8 9 10

Spee

d-up

ver

sus s

oftw

are

Area (mm2)

Figure 8 Speed-up versus area in an ASIC implementation of a 50antennas system

Figures 7 and 8 report silicon area as a function of thenumber of filters and speed-up as a function of the siliconarea respectively

The area curve in Figure 7 grows linearly with119873FIR witha slope of 01885mm2 which is in fact the area of a single FIRfilterThe area depends also on thememory sizeThememoryarea was obtained by considering the use of 100 memoryblocks (two for each antenna) each consisting of 256 bytesfor the signal samples (256 kB over 0457mm2) and a dual-portmemory blockwith 2048 bytes for theweights and delaysvalues (4 kB over 00153mm2) The memory area weighs onthe total area in a significant way only when a few filters areused because it does not depend on119873FIR

The speed-up that is the ratio of software execution timeand hardware execution time as a function of the silicon areashows a behavior similar to what we found for the FPGAcase but the acceleration factor is of course much biggerfrom around 50 to around 400 in the case of fifty antennasin the ASIC case This result would be notably improved incase a more scaled technology would be used [18] As we

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

Are

a and

spee

d-up

incr

ease

(rel

ativ

e uni

ts)

Number of FIR filters

Speed-up increaseArea increase

115

225

335

2 4 6 8 10

Zoom filter range 1 5

Figure 9 Trade-off between speed-up and area when changingthe number of filters in a 50 antennas ASIC implementation Cost-effective solutions are those with119873FIR le 6

utilize more and more area to make room for the filters thespeed-up first increases in a steep way then it bends becausethe sequential part of the algorithm starts dominating theexecution time

We report in Figure 9 two more curves the speed-upincrease and the area increase These two curves are simplythe normalized versions of the curves reported in Figures 7and 8 In practice we divided the area and the speed-up ofthe solution with119873FIR filters by the area and the speed-up ofthe solution with one filter respectivelyThe rationale behindthis choice is to evaluate the trade-off between performanceand cost In particular we judge that a parallel solutionis cost-effective if for a given percentage cost increase theperformance increases at least as much as the cost [19 20]From the inset in Figure 9 we are able to evaluate for whatvalue of 119873FIR the two curves cross each other The cost-effective solutions are those that use no more than six filters

7 Conclusions

Implementing in hardware the Mist-beamforming algorithmportion of the image reconstruction process determines aremarkable acceleration in terms of execution time From apractical point of view this acceleration implies a significantimprovement in the applicability of the system if we considerthe case of a 3D image reconstruction with a fifty-antennasystem we move from 19 hours of computation in thesoftware-only version to only about 7 minutes for an FPGAimplementation and less than one minute for an ASICimplementation even in the serial hardware configuration(ie with 119873FIR = 1) The FPGA clock frequency was set inour experiments to 625MHz which is the clock frequencyof the IP that communicates with the UWB external chipHowever our results show that the FPGA accelerator alonecan run up to 180MHz hence providing a further 288Xspeed-up with respect to the software version Therefore

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

6 VLSI Design

0

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14

2e minus 08

4e minus 08

6e minus 08

8e minus 08

1e minus 07

i (cm)

j(c

m)

(a) First approximation type (Approx-I)

Beamforming approximated - SKAR calibration

0

5e minus 09

1e minus 08

15e minus 08

2e minus 08

25e minus 08

3e minus 08

35e minus 08

4e minus 08

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

(b) Second approximation type (Approx-II)

Beamforming approximated - SKAR calibration

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14i (cm)

j(c

m)

0

1e minus 08

2e minus 08

3e minus 08

4e minus 08

5e minus 08

(c) Third approximation type (Approx-III)

Figure 4 Energy map of the breast after BEAF data approximation for ideal breast model and SKAR calibration A tumor is present in a deepposition

(119873FIR) corresponds to the number of transceiversantennasin the UWB probe As we discuss momentarily sequentialimplementations that use less filters than there are antennasare possible Each filter includes three main componentsThe120591 blocks are programmable delay lines which shift the inputsamples of the right time shift to ensure that all FIR inputs aretime alignedThe maximum delay is 119899

119886 TheWindow1 blocks

perform the 119892[119899] windowing functions they feed the FIRblocks only with samples that come after 119899

119886 and replace all

the samples that come before 119899119886with zero values In the FIR

blocks the samples are processed according to (4) Every FIRis a fully-parallel structure with 119871multipliers Eachmultiplieris fed with one of the weights which do not change duringthe whole voxel computation and with the samplesThe latterare sequentially introduced in the FIR blocks and then are

shifted from one multiplier to the next via an internal shiftregister The outputs of all the multipliers are then summedvia an adder tree producing one single FIR output Thiswhole computation block is pipelined with registers at theinput of every adder and multiplierSumTheoutputs of all the filters are summed together with aadder tree to obtain a single value which represents the resultof (4)Window2 The samples we are interested in are those inwhich the signal pulse occurs Window2 lets input data passunaltered if they correspond to the time interval specified by119899ℎand 119897ℎ Otherwise the output value is set to zero

Energy Computation This block computes the total voxelenergy according to (6) A multiplier and an accumulator are

VLSI Design 7

FIR

FIR

FIR

Filter

FSM

Energycomputation

Energymemory

Win

dow

2

Weightsand delays

shiftregister

Window1

Window1

Window1

120591

120591

120591

Sum shift register

SumSamplesmemory

Figure 5 RTL block diagram of the BEAF unit

used for this purpose Finally the energy of the current voxelgets written into the energy memoryFSM To coordinate the whole computation process a finitestate machine (FSM) is used A set of counters in it also keepstrack of the number of computation steps performed at anycontrol step

The filtering functions need not to be necessarily per-formed in parallel Two extreme solutions are possible Aparallel implementation in which all the data are processedconcurrently and a serial implementation in which onefilter is iteratively used In the serial implementation anaccumulator is inserted at the output of the filter since theFIR outputs must be summed to compute the energy value

Between the two extremes hybrid solutions are alsopossible We can instantiate a number of filters (and thus anumber of FIR blocks) 119873FIR that matches our timing andresource requirementsWe can use them iteratively to processthe data from119873ANT antennas In case of a hybrid solution ateach iteration the results of the partial sumof the FIR outputsare fed back to the sum shift register block in Figure 5 andadded coherently to the current value at a given iterationWhen all the antennas sets have been processed the finalresult is fed to the energy computation block The finitestate machine is designed in a parameterized way so as tocorrectly handle all possible solutions with different degreesof parallelism

It is worth mentioning that a choice of degree of par-allelism determines not just the number of FIR blocks butalso the number of 120591 Window1 and Window2 blocks In thefollowing whenwe refer to119873FIR as a parameter for the degreeof parallelism we mean that all these blocks are instantiated119873FIR times

We verified our parameterized VHDL code by means ofextensive logic-level simulations based on Mentor GraphicsrsquoModelsim [15] varying the degree of parallelismWe validatedthe output obtained with the RTL code against the resultsobtained with a software implementation and refined ourdesign until they perfectly matched

44 Timing Performance of BEAF In our architectureeach voxelrsquos energy is sequentially computed therefore the

beamforming execution time is

119879BEAF = 119873vox sdot 119879vox (9)

where 119873vox is the number of voxels and 119879vox is the timerequired to compute each voxel

The number of clock cycles necessary to execute thebeamforming step for each voxel depends on the number ofFIR filters in Figure 5 As we discussed above the spectrumof possible solutions spans from a fully sequential implemen-tation with a single FIR iteratively used for all the antennasto a fully parallel solution with as many filters as there areantennas The number of iterations depends on the ratiobetween number of antennas and number of filters

119873iter = lceil119873ANT119873FIR

rceil (10)

119879vox is the sum of three contributions

119879vox = 119879119908 + 119879comp + 119879119908119903 (11)

119879119908is the time necessary to write the FIR weights in the

memory once they are received from the link that connectsthe processor with the accelerator to read the weights fromthe memory and to store them in local registers within thefilters

119879119908= (2 (119871 + 1)119873ANT + 2119873iter) 119879CLK (12)

The first termwithin parenthesis in (12) is the time for writingand reading from the memory 119871 weights The second term isan overhead for enabling the memory and for asserting a fewcontrol signals 119879CLK is the clock period

119879comp is the actual computation time and is given by

119879comp = (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ lceillog2(119873FIR + 1)rceil )119873iter119879CLK

(13)

In (13) 119871wp is the time distance evaluated in time samplesthat corresponds to the worst path between two differentantennas and is necessary for the time alignment of thevarious signals Term lceillog

2(119871)rceil+119899

ℎ+119897ℎis the number of clock

cycles needed to load the samples in the filters with 119899ℎand 119897ℎ

representing the first sample in awindow of interest and 119897ℎthe

size of that window according to equation (5) Seven clockcycles are needed for window traversal and various registerwriting The term that depends on the logarithm of 119873FIR isrelated to the final addition in Figure 5

The final term in (11) 119879nrj is given by

119879nrj = 3119873iter (14)

and represents the time needed to write the voxelrsquos energyvalue back in the memory

We rewrite the final expression of 119879vox obtained bywrapping up all previous equations

119879vox = (2 (119871 + 1) sdot 119873ANT + 2 sdot 119873iter

+ (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ log2(119873FIR + 1)) sdot 119873iter + 3 sdot 119873iter) 119879CLK

(15)

8 VLSI Design

Figure 6 Experimental setup for the FPGA prototype the personalcomputer executes in software (Octave) part of the algorithm andallows energy maps visualization A USB link transfers data to andfrom the FPGA board where part of the beamforming algorithm isaccelerated

Equation (15) can be used to determine the overallexecution time of the beamforming algorithm in (9) giventhe total number of voxels and applies both to a 2D anda 3D map The accuracy of (15) was tested with VHDLsimulations usingMentor Graphicsrsquo Modelsim simulator andexperimentally verified with the FPGA prototype describedin the next section

5 Emulation with an FPGA-Based Prototype

We built an emulation prototype of the imaging unit Thisprototype relies on a general purpose microprocessor toperform the portions of processing that we decided to executein software while the hardware part of the BEAF algorithmwas deployed on an Xilinx Virtex-4 XC4VLX160-FF1513FPGA In particular wewere able to connect the Intel Core-i5of a laptop with the FPGA located in an AVNET Virtex-4 LXdevelopment boardTheCypress FX2USB 20 chip integratedin the development board was used to communicate with theFPGATheUSB chip behaves simply as a conduit between theUSB and the external data-processing logic The FX2 clockwas set to 48MHz the data parallelism to 8 bits and thepacket size to 512 bytes

The prototype has multiple purposes It serves as anaccelerator of the system-level simulation because it permitsa much faster execution of the part of the algorithm executedby the FPGA compared to a logic-level simulation Theoverall architecture with the software and the hardwarerunning concurrently is clearly a close approximation ofthe final system even though the processor and the digitalhardware will be different in the final target The RTL VHDLcode that gets synthesized on the FPGA is the same that wewill eventually implement in an ASIC therefore we can fullyverify it before fabrication Finally we can anticipate whatthe exact performance will be in terms of clock cycles Theonly difference in performance will be then determined bythe different clock frequency in FPGA and ASIC

Both the software and the hardware code are fullyinstrumented to permit a cycle accurate evaluation of timingperformance The USB communication time is also correctly

estimated and decoupled from the overall performance as amuch faster link will be used in a realistic setting like the onedepicted in Figure 1

The software running on the microprocessor is layeredand so the application part and the communication partare independent and portable The communication part isalso further partitioned in a set of low-level hardware-specific USB primitives (we use the libusb library availablein Linux) and a set of higher-level technology-independentAPIs for data organization and communication When thehardware target changes as it happens when moving fromthe prototype to the final target the developers replace onlythe low-level communication primitives and reuse both theapplication code and the communication APIs

Figure 6 is a photograph of the prototype and showsan energy map elaborated in the FPGA sent to the hostcomputer and displayed in the host screen

6 Experimental Results on FPGA and ASIC

We ran several experiments first with Xilinxrsquos ISE software[16] for the FPGA target of our prototype then with SynopsysDesign Compiler [17] for an ASIC target We were able todetermine the resource requirements and the actual timingperformance which depends on the actual clock frequency asfound by the FPGA and ASIC design CAD toolsThe variousbitstream files generated after FPGA synthesis mappingand place-and-route experiments were each tested in ourprototype They refer to a specific design case with nineantennas located around the numerical breast phantomnumber 012204 from the repository made available by theUniversity of Wisconsin [14] This phantom represents inthe database an average case in terms of type of tissue andnumber of voxelsWe have run finite-difference time-domain(FDTD) electromagnetic simulations to determine the signalreceived by the nine antennas and we fed the imagingunit with the received samples after 16-bit quantizationWe also extrapolated our performance results to a case offifty antennas which we assume it to be more or less thanmaximum number of antennas that can be placed around apatientrsquos breast

61 FPGA Experiments The FPGA resources the main partof which consists of look-up tables (LUT) and registers aregrouped in slices Depending on the complexity of the designand particularly on the number of filters 119873FIR a higher orlower amount of LUT and registers is used When mappingthe design the strategy of ISE does not consist in fillingthe slices one by one Instead to meet timing requirementsthe logic gates are spread over a high number of sliceswhich often results in a poor utilization Sometimes a slice isdeclared as occupied even if only one of its LUTs is usedThisfact limits the number of filters that we were able to put in ourVirtex-4 to a maximum of five FIRs The results of resourceutilization are reported in Table 3

We notice that the number of occupied slices increasesrapidly with the number of filters With six filters it exceedsthe number of available slices and we were not able to

VLSI Design 9

Table 3 Area allocation with different number of FIR filters

119873FIR Slice registers 4 input LUTs Occupied slices1 9081 (6) 8038 (5) 7188 (10)2 17347 (12) 20072 (14) 16115 (23)3 25551 (18) 44290 (32) 30899 (45)4 33851 (25) 67247 (49) 45139 (66)5 42055 (31) 91461 (67) 59936 (88)

Table 4 Hardware execution time over 14485 voxels with 9antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 128 s

119873FIR 119873ITER 119879exe [ms] Speed-up1 9 80639 15872 5 55336 23133 3 42581 30064 3 42650 30015 2 36249 35316 2 36249 35317 2 36249 35318 2 36295 35279 1 29871 4285

test that configuration in our prototype even though wecould estimate the performance in simulation The allocatedresources follow a trend that is almost linear with the numberof filters which is in accordance with our expectations

The timing performance depends proportionally on thenumber of voxels in the energy map This value is derivedfrom the 012204 phantom breast at 1mm resolution whichwe also used for the Octave simulations The 2D map gridconsists of 152 rows and 176 columns The Octave simulationcode was optimized to consider in the computation onlythose voxels which happened to be located in the ellipticalspace described by the antennas that is the area actuallyirradiated by the transmitters With such optimization thenumber of voxels is119873VOX = 14495

With such number of voxels the time required for thesoftware execution of the BEAF on the Intel i5 core of ourprototype (with Ubuntu 1110 kernel 30) was approximately119879SW = 128 sThe clock period for the FPGA implementationis 119879CLK = 16 ns which we use in our experiments with theprototype Table 4 reports the time required to execute theBEAF routine in hardware and the speed-up with respect to asoftware implementation Timing results with 119873FIR in range6ndash9 refer to logic simulations only because only five filterscould be placed in the FPGA However since all the resultsare consistent with (15) they are all fully accurate Table 5refers to a hypothetical case with fifty antennas and reportsonly simulated data which are also fully accurate comparedto a software execution that requires 119879SW = 71 s

In both Tables 4 and 5 we observe a similar trend Aswe expected the effect of changing the number of filtersmainly depends on the actual number of iterations 119873ITER =lceil119873ANT119873FIRrceil Increasing the number of filters boosts perfor-mance only when it allows to perform the computation with

Table 5 Hardware execution time over 14485 voxels with 50antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 71 s

119873FIR 119873ITER 119879exe [s] Speed-up1 50 448 15862 25 289 24533 17 238 29794 13 213 33315 10 194 36616 9 188 37867 8 181 39198 7 175 40599 6 168 421410ndash12 5 162 438113ndash16 4 156 456217ndash24 3 149 475625ndash49 2 143 497150 1 136 5205

less iterations For example in the case with nine antennaswe would not get a significant advantage from using eightfilters if we cannot use nine then we could as well use sixand still obtain basically the same performance we would getwith eight filters

The speed-up value grows rapidly for low values of119873FIRwhereas its marginal increase tends to diminish in the upperrangeThe reason is that the part of the algorithm that consistsof accessing the memory for reading samples and weightsremains sequential and cannot be made parallel

Considering the same phantom and iterating the per-formance evaluation in the 3D case with 1374953 voxels theresulting trends are similar To summarize in the worst caseof fifty antennas and 119879ITER = 50 the required time is 119879exe =42464 s (around 7 minutes) while in the best case of onlyone iteration we obtained119879exe = 12940 s (around 2minutes)With respect to a software implementation requiring 119879SW =

19 h the speedup is a factor 1611 and 5286 respectively inthe worst and best case

Referring to data discussed in Table 1 obtained for a worstcase breast phantom (in terms of number of voxels) theexecution time in the worst case (119879ITER = 50) would bearound 21 minutes (compared to the 58 h obtained in thesoftware case)

62 ASIC Experiments For this part of our experimentswe targeted a standard cell library in a 11 V 45 nm CMOStechnology We synthesized with Synopsys Design Compilervarious instances of our architecture varying the numberof filters and determined the relationship between speed-up(with respect to a software execution) and silicon area Thedesign when synthesized with the maximum optimizationeffort available can run at a clock period of 2 ns Thereforecompared to the FPGA results reported in the previoussection all the values of speed-up can be multiplied by afactor 8 (16 ns2 ns)

10 VLSI Design

0

1

2

3

4

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 50Number of FIR filters

Are

a (m

m2)

Figure 7 Postsynthesis ASIC area versus number of FIR filters fora 50 antennas system

100

150

200

250

300

350

400

450

0 1 2 3 4 5 6 7 8 9 10

Spee

d-up

ver

sus s

oftw

are

Area (mm2)

Figure 8 Speed-up versus area in an ASIC implementation of a 50antennas system

Figures 7 and 8 report silicon area as a function of thenumber of filters and speed-up as a function of the siliconarea respectively

The area curve in Figure 7 grows linearly with119873FIR witha slope of 01885mm2 which is in fact the area of a single FIRfilterThe area depends also on thememory sizeThememoryarea was obtained by considering the use of 100 memoryblocks (two for each antenna) each consisting of 256 bytesfor the signal samples (256 kB over 0457mm2) and a dual-portmemory blockwith 2048 bytes for theweights and delaysvalues (4 kB over 00153mm2) The memory area weighs onthe total area in a significant way only when a few filters areused because it does not depend on119873FIR

The speed-up that is the ratio of software execution timeand hardware execution time as a function of the silicon areashows a behavior similar to what we found for the FPGAcase but the acceleration factor is of course much biggerfrom around 50 to around 400 in the case of fifty antennasin the ASIC case This result would be notably improved incase a more scaled technology would be used [18] As we

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

Are

a and

spee

d-up

incr

ease

(rel

ativ

e uni

ts)

Number of FIR filters

Speed-up increaseArea increase

115

225

335

2 4 6 8 10

Zoom filter range 1 5

Figure 9 Trade-off between speed-up and area when changingthe number of filters in a 50 antennas ASIC implementation Cost-effective solutions are those with119873FIR le 6

utilize more and more area to make room for the filters thespeed-up first increases in a steep way then it bends becausethe sequential part of the algorithm starts dominating theexecution time

We report in Figure 9 two more curves the speed-upincrease and the area increase These two curves are simplythe normalized versions of the curves reported in Figures 7and 8 In practice we divided the area and the speed-up ofthe solution with119873FIR filters by the area and the speed-up ofthe solution with one filter respectivelyThe rationale behindthis choice is to evaluate the trade-off between performanceand cost In particular we judge that a parallel solutionis cost-effective if for a given percentage cost increase theperformance increases at least as much as the cost [19 20]From the inset in Figure 9 we are able to evaluate for whatvalue of 119873FIR the two curves cross each other The cost-effective solutions are those that use no more than six filters

7 Conclusions

Implementing in hardware the Mist-beamforming algorithmportion of the image reconstruction process determines aremarkable acceleration in terms of execution time From apractical point of view this acceleration implies a significantimprovement in the applicability of the system if we considerthe case of a 3D image reconstruction with a fifty-antennasystem we move from 19 hours of computation in thesoftware-only version to only about 7 minutes for an FPGAimplementation and less than one minute for an ASICimplementation even in the serial hardware configuration(ie with 119873FIR = 1) The FPGA clock frequency was set inour experiments to 625MHz which is the clock frequencyof the IP that communicates with the UWB external chipHowever our results show that the FPGA accelerator alonecan run up to 180MHz hence providing a further 288Xspeed-up with respect to the software version Therefore

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

VLSI Design 7

FIR

FIR

FIR

Filter

FSM

Energycomputation

Energymemory

Win

dow

2

Weightsand delays

shiftregister

Window1

Window1

Window1

120591

120591

120591

Sum shift register

SumSamplesmemory

Figure 5 RTL block diagram of the BEAF unit

used for this purpose Finally the energy of the current voxelgets written into the energy memoryFSM To coordinate the whole computation process a finitestate machine (FSM) is used A set of counters in it also keepstrack of the number of computation steps performed at anycontrol step

The filtering functions need not to be necessarily per-formed in parallel Two extreme solutions are possible Aparallel implementation in which all the data are processedconcurrently and a serial implementation in which onefilter is iteratively used In the serial implementation anaccumulator is inserted at the output of the filter since theFIR outputs must be summed to compute the energy value

Between the two extremes hybrid solutions are alsopossible We can instantiate a number of filters (and thus anumber of FIR blocks) 119873FIR that matches our timing andresource requirementsWe can use them iteratively to processthe data from119873ANT antennas In case of a hybrid solution ateach iteration the results of the partial sumof the FIR outputsare fed back to the sum shift register block in Figure 5 andadded coherently to the current value at a given iterationWhen all the antennas sets have been processed the finalresult is fed to the energy computation block The finitestate machine is designed in a parameterized way so as tocorrectly handle all possible solutions with different degreesof parallelism

It is worth mentioning that a choice of degree of par-allelism determines not just the number of FIR blocks butalso the number of 120591 Window1 and Window2 blocks In thefollowing whenwe refer to119873FIR as a parameter for the degreeof parallelism we mean that all these blocks are instantiated119873FIR times

We verified our parameterized VHDL code by means ofextensive logic-level simulations based on Mentor GraphicsrsquoModelsim [15] varying the degree of parallelismWe validatedthe output obtained with the RTL code against the resultsobtained with a software implementation and refined ourdesign until they perfectly matched

44 Timing Performance of BEAF In our architectureeach voxelrsquos energy is sequentially computed therefore the

beamforming execution time is

119879BEAF = 119873vox sdot 119879vox (9)

where 119873vox is the number of voxels and 119879vox is the timerequired to compute each voxel

The number of clock cycles necessary to execute thebeamforming step for each voxel depends on the number ofFIR filters in Figure 5 As we discussed above the spectrumof possible solutions spans from a fully sequential implemen-tation with a single FIR iteratively used for all the antennasto a fully parallel solution with as many filters as there areantennas The number of iterations depends on the ratiobetween number of antennas and number of filters

119873iter = lceil119873ANT119873FIR

rceil (10)

119879vox is the sum of three contributions

119879vox = 119879119908 + 119879comp + 119879119908119903 (11)

119879119908is the time necessary to write the FIR weights in the

memory once they are received from the link that connectsthe processor with the accelerator to read the weights fromthe memory and to store them in local registers within thefilters

119879119908= (2 (119871 + 1)119873ANT + 2119873iter) 119879CLK (12)

The first termwithin parenthesis in (12) is the time for writingand reading from the memory 119871 weights The second term isan overhead for enabling the memory and for asserting a fewcontrol signals 119879CLK is the clock period

119879comp is the actual computation time and is given by

119879comp = (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ lceillog2(119873FIR + 1)rceil )119873iter119879CLK

(13)

In (13) 119871wp is the time distance evaluated in time samplesthat corresponds to the worst path between two differentantennas and is necessary for the time alignment of thevarious signals Term lceillog

2(119871)rceil+119899

ℎ+119897ℎis the number of clock

cycles needed to load the samples in the filters with 119899ℎand 119897ℎ

representing the first sample in awindow of interest and 119897ℎthe

size of that window according to equation (5) Seven clockcycles are needed for window traversal and various registerwriting The term that depends on the logarithm of 119873FIR isrelated to the final addition in Figure 5

The final term in (11) 119879nrj is given by

119879nrj = 3119873iter (14)

and represents the time needed to write the voxelrsquos energyvalue back in the memory

We rewrite the final expression of 119879vox obtained bywrapping up all previous equations

119879vox = (2 (119871 + 1) sdot 119873ANT + 2 sdot 119873iter

+ (119871wp + lceillog2 (119871)rceil + 119899ℎ + 119897ℎ + 7

+ log2(119873FIR + 1)) sdot 119873iter + 3 sdot 119873iter) 119879CLK

(15)

8 VLSI Design

Figure 6 Experimental setup for the FPGA prototype the personalcomputer executes in software (Octave) part of the algorithm andallows energy maps visualization A USB link transfers data to andfrom the FPGA board where part of the beamforming algorithm isaccelerated

Equation (15) can be used to determine the overallexecution time of the beamforming algorithm in (9) giventhe total number of voxels and applies both to a 2D anda 3D map The accuracy of (15) was tested with VHDLsimulations usingMentor Graphicsrsquo Modelsim simulator andexperimentally verified with the FPGA prototype describedin the next section

5 Emulation with an FPGA-Based Prototype

We built an emulation prototype of the imaging unit Thisprototype relies on a general purpose microprocessor toperform the portions of processing that we decided to executein software while the hardware part of the BEAF algorithmwas deployed on an Xilinx Virtex-4 XC4VLX160-FF1513FPGA In particular wewere able to connect the Intel Core-i5of a laptop with the FPGA located in an AVNET Virtex-4 LXdevelopment boardTheCypress FX2USB 20 chip integratedin the development board was used to communicate with theFPGATheUSB chip behaves simply as a conduit between theUSB and the external data-processing logic The FX2 clockwas set to 48MHz the data parallelism to 8 bits and thepacket size to 512 bytes

The prototype has multiple purposes It serves as anaccelerator of the system-level simulation because it permitsa much faster execution of the part of the algorithm executedby the FPGA compared to a logic-level simulation Theoverall architecture with the software and the hardwarerunning concurrently is clearly a close approximation ofthe final system even though the processor and the digitalhardware will be different in the final target The RTL VHDLcode that gets synthesized on the FPGA is the same that wewill eventually implement in an ASIC therefore we can fullyverify it before fabrication Finally we can anticipate whatthe exact performance will be in terms of clock cycles Theonly difference in performance will be then determined bythe different clock frequency in FPGA and ASIC

Both the software and the hardware code are fullyinstrumented to permit a cycle accurate evaluation of timingperformance The USB communication time is also correctly

estimated and decoupled from the overall performance as amuch faster link will be used in a realistic setting like the onedepicted in Figure 1

The software running on the microprocessor is layeredand so the application part and the communication partare independent and portable The communication part isalso further partitioned in a set of low-level hardware-specific USB primitives (we use the libusb library availablein Linux) and a set of higher-level technology-independentAPIs for data organization and communication When thehardware target changes as it happens when moving fromthe prototype to the final target the developers replace onlythe low-level communication primitives and reuse both theapplication code and the communication APIs

Figure 6 is a photograph of the prototype and showsan energy map elaborated in the FPGA sent to the hostcomputer and displayed in the host screen

6 Experimental Results on FPGA and ASIC

We ran several experiments first with Xilinxrsquos ISE software[16] for the FPGA target of our prototype then with SynopsysDesign Compiler [17] for an ASIC target We were able todetermine the resource requirements and the actual timingperformance which depends on the actual clock frequency asfound by the FPGA and ASIC design CAD toolsThe variousbitstream files generated after FPGA synthesis mappingand place-and-route experiments were each tested in ourprototype They refer to a specific design case with nineantennas located around the numerical breast phantomnumber 012204 from the repository made available by theUniversity of Wisconsin [14] This phantom represents inthe database an average case in terms of type of tissue andnumber of voxelsWe have run finite-difference time-domain(FDTD) electromagnetic simulations to determine the signalreceived by the nine antennas and we fed the imagingunit with the received samples after 16-bit quantizationWe also extrapolated our performance results to a case offifty antennas which we assume it to be more or less thanmaximum number of antennas that can be placed around apatientrsquos breast

61 FPGA Experiments The FPGA resources the main partof which consists of look-up tables (LUT) and registers aregrouped in slices Depending on the complexity of the designand particularly on the number of filters 119873FIR a higher orlower amount of LUT and registers is used When mappingthe design the strategy of ISE does not consist in fillingthe slices one by one Instead to meet timing requirementsthe logic gates are spread over a high number of sliceswhich often results in a poor utilization Sometimes a slice isdeclared as occupied even if only one of its LUTs is usedThisfact limits the number of filters that we were able to put in ourVirtex-4 to a maximum of five FIRs The results of resourceutilization are reported in Table 3

We notice that the number of occupied slices increasesrapidly with the number of filters With six filters it exceedsthe number of available slices and we were not able to

VLSI Design 9

Table 3 Area allocation with different number of FIR filters

119873FIR Slice registers 4 input LUTs Occupied slices1 9081 (6) 8038 (5) 7188 (10)2 17347 (12) 20072 (14) 16115 (23)3 25551 (18) 44290 (32) 30899 (45)4 33851 (25) 67247 (49) 45139 (66)5 42055 (31) 91461 (67) 59936 (88)

Table 4 Hardware execution time over 14485 voxels with 9antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 128 s

119873FIR 119873ITER 119879exe [ms] Speed-up1 9 80639 15872 5 55336 23133 3 42581 30064 3 42650 30015 2 36249 35316 2 36249 35317 2 36249 35318 2 36295 35279 1 29871 4285

test that configuration in our prototype even though wecould estimate the performance in simulation The allocatedresources follow a trend that is almost linear with the numberof filters which is in accordance with our expectations

The timing performance depends proportionally on thenumber of voxels in the energy map This value is derivedfrom the 012204 phantom breast at 1mm resolution whichwe also used for the Octave simulations The 2D map gridconsists of 152 rows and 176 columns The Octave simulationcode was optimized to consider in the computation onlythose voxels which happened to be located in the ellipticalspace described by the antennas that is the area actuallyirradiated by the transmitters With such optimization thenumber of voxels is119873VOX = 14495

With such number of voxels the time required for thesoftware execution of the BEAF on the Intel i5 core of ourprototype (with Ubuntu 1110 kernel 30) was approximately119879SW = 128 sThe clock period for the FPGA implementationis 119879CLK = 16 ns which we use in our experiments with theprototype Table 4 reports the time required to execute theBEAF routine in hardware and the speed-up with respect to asoftware implementation Timing results with 119873FIR in range6ndash9 refer to logic simulations only because only five filterscould be placed in the FPGA However since all the resultsare consistent with (15) they are all fully accurate Table 5refers to a hypothetical case with fifty antennas and reportsonly simulated data which are also fully accurate comparedto a software execution that requires 119879SW = 71 s

In both Tables 4 and 5 we observe a similar trend Aswe expected the effect of changing the number of filtersmainly depends on the actual number of iterations 119873ITER =lceil119873ANT119873FIRrceil Increasing the number of filters boosts perfor-mance only when it allows to perform the computation with

Table 5 Hardware execution time over 14485 voxels with 50antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 71 s

119873FIR 119873ITER 119879exe [s] Speed-up1 50 448 15862 25 289 24533 17 238 29794 13 213 33315 10 194 36616 9 188 37867 8 181 39198 7 175 40599 6 168 421410ndash12 5 162 438113ndash16 4 156 456217ndash24 3 149 475625ndash49 2 143 497150 1 136 5205

less iterations For example in the case with nine antennaswe would not get a significant advantage from using eightfilters if we cannot use nine then we could as well use sixand still obtain basically the same performance we would getwith eight filters

The speed-up value grows rapidly for low values of119873FIRwhereas its marginal increase tends to diminish in the upperrangeThe reason is that the part of the algorithm that consistsof accessing the memory for reading samples and weightsremains sequential and cannot be made parallel

Considering the same phantom and iterating the per-formance evaluation in the 3D case with 1374953 voxels theresulting trends are similar To summarize in the worst caseof fifty antennas and 119879ITER = 50 the required time is 119879exe =42464 s (around 7 minutes) while in the best case of onlyone iteration we obtained119879exe = 12940 s (around 2minutes)With respect to a software implementation requiring 119879SW =

19 h the speedup is a factor 1611 and 5286 respectively inthe worst and best case

Referring to data discussed in Table 1 obtained for a worstcase breast phantom (in terms of number of voxels) theexecution time in the worst case (119879ITER = 50) would bearound 21 minutes (compared to the 58 h obtained in thesoftware case)

62 ASIC Experiments For this part of our experimentswe targeted a standard cell library in a 11 V 45 nm CMOStechnology We synthesized with Synopsys Design Compilervarious instances of our architecture varying the numberof filters and determined the relationship between speed-up(with respect to a software execution) and silicon area Thedesign when synthesized with the maximum optimizationeffort available can run at a clock period of 2 ns Thereforecompared to the FPGA results reported in the previoussection all the values of speed-up can be multiplied by afactor 8 (16 ns2 ns)

10 VLSI Design

0

1

2

3

4

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 50Number of FIR filters

Are

a (m

m2)

Figure 7 Postsynthesis ASIC area versus number of FIR filters fora 50 antennas system

100

150

200

250

300

350

400

450

0 1 2 3 4 5 6 7 8 9 10

Spee

d-up

ver

sus s

oftw

are

Area (mm2)

Figure 8 Speed-up versus area in an ASIC implementation of a 50antennas system

Figures 7 and 8 report silicon area as a function of thenumber of filters and speed-up as a function of the siliconarea respectively

The area curve in Figure 7 grows linearly with119873FIR witha slope of 01885mm2 which is in fact the area of a single FIRfilterThe area depends also on thememory sizeThememoryarea was obtained by considering the use of 100 memoryblocks (two for each antenna) each consisting of 256 bytesfor the signal samples (256 kB over 0457mm2) and a dual-portmemory blockwith 2048 bytes for theweights and delaysvalues (4 kB over 00153mm2) The memory area weighs onthe total area in a significant way only when a few filters areused because it does not depend on119873FIR

The speed-up that is the ratio of software execution timeand hardware execution time as a function of the silicon areashows a behavior similar to what we found for the FPGAcase but the acceleration factor is of course much biggerfrom around 50 to around 400 in the case of fifty antennasin the ASIC case This result would be notably improved incase a more scaled technology would be used [18] As we

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

Are

a and

spee

d-up

incr

ease

(rel

ativ

e uni

ts)

Number of FIR filters

Speed-up increaseArea increase

115

225

335

2 4 6 8 10

Zoom filter range 1 5

Figure 9 Trade-off between speed-up and area when changingthe number of filters in a 50 antennas ASIC implementation Cost-effective solutions are those with119873FIR le 6

utilize more and more area to make room for the filters thespeed-up first increases in a steep way then it bends becausethe sequential part of the algorithm starts dominating theexecution time

We report in Figure 9 two more curves the speed-upincrease and the area increase These two curves are simplythe normalized versions of the curves reported in Figures 7and 8 In practice we divided the area and the speed-up ofthe solution with119873FIR filters by the area and the speed-up ofthe solution with one filter respectivelyThe rationale behindthis choice is to evaluate the trade-off between performanceand cost In particular we judge that a parallel solutionis cost-effective if for a given percentage cost increase theperformance increases at least as much as the cost [19 20]From the inset in Figure 9 we are able to evaluate for whatvalue of 119873FIR the two curves cross each other The cost-effective solutions are those that use no more than six filters

7 Conclusions

Implementing in hardware the Mist-beamforming algorithmportion of the image reconstruction process determines aremarkable acceleration in terms of execution time From apractical point of view this acceleration implies a significantimprovement in the applicability of the system if we considerthe case of a 3D image reconstruction with a fifty-antennasystem we move from 19 hours of computation in thesoftware-only version to only about 7 minutes for an FPGAimplementation and less than one minute for an ASICimplementation even in the serial hardware configuration(ie with 119873FIR = 1) The FPGA clock frequency was set inour experiments to 625MHz which is the clock frequencyof the IP that communicates with the UWB external chipHowever our results show that the FPGA accelerator alonecan run up to 180MHz hence providing a further 288Xspeed-up with respect to the software version Therefore

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

8 VLSI Design

Figure 6 Experimental setup for the FPGA prototype the personalcomputer executes in software (Octave) part of the algorithm andallows energy maps visualization A USB link transfers data to andfrom the FPGA board where part of the beamforming algorithm isaccelerated

Equation (15) can be used to determine the overallexecution time of the beamforming algorithm in (9) giventhe total number of voxels and applies both to a 2D anda 3D map The accuracy of (15) was tested with VHDLsimulations usingMentor Graphicsrsquo Modelsim simulator andexperimentally verified with the FPGA prototype describedin the next section

5 Emulation with an FPGA-Based Prototype

We built an emulation prototype of the imaging unit Thisprototype relies on a general purpose microprocessor toperform the portions of processing that we decided to executein software while the hardware part of the BEAF algorithmwas deployed on an Xilinx Virtex-4 XC4VLX160-FF1513FPGA In particular wewere able to connect the Intel Core-i5of a laptop with the FPGA located in an AVNET Virtex-4 LXdevelopment boardTheCypress FX2USB 20 chip integratedin the development board was used to communicate with theFPGATheUSB chip behaves simply as a conduit between theUSB and the external data-processing logic The FX2 clockwas set to 48MHz the data parallelism to 8 bits and thepacket size to 512 bytes

The prototype has multiple purposes It serves as anaccelerator of the system-level simulation because it permitsa much faster execution of the part of the algorithm executedby the FPGA compared to a logic-level simulation Theoverall architecture with the software and the hardwarerunning concurrently is clearly a close approximation ofthe final system even though the processor and the digitalhardware will be different in the final target The RTL VHDLcode that gets synthesized on the FPGA is the same that wewill eventually implement in an ASIC therefore we can fullyverify it before fabrication Finally we can anticipate whatthe exact performance will be in terms of clock cycles Theonly difference in performance will be then determined bythe different clock frequency in FPGA and ASIC

Both the software and the hardware code are fullyinstrumented to permit a cycle accurate evaluation of timingperformance The USB communication time is also correctly

estimated and decoupled from the overall performance as amuch faster link will be used in a realistic setting like the onedepicted in Figure 1

The software running on the microprocessor is layeredand so the application part and the communication partare independent and portable The communication part isalso further partitioned in a set of low-level hardware-specific USB primitives (we use the libusb library availablein Linux) and a set of higher-level technology-independentAPIs for data organization and communication When thehardware target changes as it happens when moving fromthe prototype to the final target the developers replace onlythe low-level communication primitives and reuse both theapplication code and the communication APIs

Figure 6 is a photograph of the prototype and showsan energy map elaborated in the FPGA sent to the hostcomputer and displayed in the host screen

6 Experimental Results on FPGA and ASIC

We ran several experiments first with Xilinxrsquos ISE software[16] for the FPGA target of our prototype then with SynopsysDesign Compiler [17] for an ASIC target We were able todetermine the resource requirements and the actual timingperformance which depends on the actual clock frequency asfound by the FPGA and ASIC design CAD toolsThe variousbitstream files generated after FPGA synthesis mappingand place-and-route experiments were each tested in ourprototype They refer to a specific design case with nineantennas located around the numerical breast phantomnumber 012204 from the repository made available by theUniversity of Wisconsin [14] This phantom represents inthe database an average case in terms of type of tissue andnumber of voxelsWe have run finite-difference time-domain(FDTD) electromagnetic simulations to determine the signalreceived by the nine antennas and we fed the imagingunit with the received samples after 16-bit quantizationWe also extrapolated our performance results to a case offifty antennas which we assume it to be more or less thanmaximum number of antennas that can be placed around apatientrsquos breast

61 FPGA Experiments The FPGA resources the main partof which consists of look-up tables (LUT) and registers aregrouped in slices Depending on the complexity of the designand particularly on the number of filters 119873FIR a higher orlower amount of LUT and registers is used When mappingthe design the strategy of ISE does not consist in fillingthe slices one by one Instead to meet timing requirementsthe logic gates are spread over a high number of sliceswhich often results in a poor utilization Sometimes a slice isdeclared as occupied even if only one of its LUTs is usedThisfact limits the number of filters that we were able to put in ourVirtex-4 to a maximum of five FIRs The results of resourceutilization are reported in Table 3

We notice that the number of occupied slices increasesrapidly with the number of filters With six filters it exceedsthe number of available slices and we were not able to

VLSI Design 9

Table 3 Area allocation with different number of FIR filters

119873FIR Slice registers 4 input LUTs Occupied slices1 9081 (6) 8038 (5) 7188 (10)2 17347 (12) 20072 (14) 16115 (23)3 25551 (18) 44290 (32) 30899 (45)4 33851 (25) 67247 (49) 45139 (66)5 42055 (31) 91461 (67) 59936 (88)

Table 4 Hardware execution time over 14485 voxels with 9antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 128 s

119873FIR 119873ITER 119879exe [ms] Speed-up1 9 80639 15872 5 55336 23133 3 42581 30064 3 42650 30015 2 36249 35316 2 36249 35317 2 36249 35318 2 36295 35279 1 29871 4285

test that configuration in our prototype even though wecould estimate the performance in simulation The allocatedresources follow a trend that is almost linear with the numberof filters which is in accordance with our expectations

The timing performance depends proportionally on thenumber of voxels in the energy map This value is derivedfrom the 012204 phantom breast at 1mm resolution whichwe also used for the Octave simulations The 2D map gridconsists of 152 rows and 176 columns The Octave simulationcode was optimized to consider in the computation onlythose voxels which happened to be located in the ellipticalspace described by the antennas that is the area actuallyirradiated by the transmitters With such optimization thenumber of voxels is119873VOX = 14495

With such number of voxels the time required for thesoftware execution of the BEAF on the Intel i5 core of ourprototype (with Ubuntu 1110 kernel 30) was approximately119879SW = 128 sThe clock period for the FPGA implementationis 119879CLK = 16 ns which we use in our experiments with theprototype Table 4 reports the time required to execute theBEAF routine in hardware and the speed-up with respect to asoftware implementation Timing results with 119873FIR in range6ndash9 refer to logic simulations only because only five filterscould be placed in the FPGA However since all the resultsare consistent with (15) they are all fully accurate Table 5refers to a hypothetical case with fifty antennas and reportsonly simulated data which are also fully accurate comparedto a software execution that requires 119879SW = 71 s

In both Tables 4 and 5 we observe a similar trend Aswe expected the effect of changing the number of filtersmainly depends on the actual number of iterations 119873ITER =lceil119873ANT119873FIRrceil Increasing the number of filters boosts perfor-mance only when it allows to perform the computation with

Table 5 Hardware execution time over 14485 voxels with 50antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 71 s

119873FIR 119873ITER 119879exe [s] Speed-up1 50 448 15862 25 289 24533 17 238 29794 13 213 33315 10 194 36616 9 188 37867 8 181 39198 7 175 40599 6 168 421410ndash12 5 162 438113ndash16 4 156 456217ndash24 3 149 475625ndash49 2 143 497150 1 136 5205

less iterations For example in the case with nine antennaswe would not get a significant advantage from using eightfilters if we cannot use nine then we could as well use sixand still obtain basically the same performance we would getwith eight filters

The speed-up value grows rapidly for low values of119873FIRwhereas its marginal increase tends to diminish in the upperrangeThe reason is that the part of the algorithm that consistsof accessing the memory for reading samples and weightsremains sequential and cannot be made parallel

Considering the same phantom and iterating the per-formance evaluation in the 3D case with 1374953 voxels theresulting trends are similar To summarize in the worst caseof fifty antennas and 119879ITER = 50 the required time is 119879exe =42464 s (around 7 minutes) while in the best case of onlyone iteration we obtained119879exe = 12940 s (around 2minutes)With respect to a software implementation requiring 119879SW =

19 h the speedup is a factor 1611 and 5286 respectively inthe worst and best case

Referring to data discussed in Table 1 obtained for a worstcase breast phantom (in terms of number of voxels) theexecution time in the worst case (119879ITER = 50) would bearound 21 minutes (compared to the 58 h obtained in thesoftware case)

62 ASIC Experiments For this part of our experimentswe targeted a standard cell library in a 11 V 45 nm CMOStechnology We synthesized with Synopsys Design Compilervarious instances of our architecture varying the numberof filters and determined the relationship between speed-up(with respect to a software execution) and silicon area Thedesign when synthesized with the maximum optimizationeffort available can run at a clock period of 2 ns Thereforecompared to the FPGA results reported in the previoussection all the values of speed-up can be multiplied by afactor 8 (16 ns2 ns)

10 VLSI Design

0

1

2

3

4

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 50Number of FIR filters

Are

a (m

m2)

Figure 7 Postsynthesis ASIC area versus number of FIR filters fora 50 antennas system

100

150

200

250

300

350

400

450

0 1 2 3 4 5 6 7 8 9 10

Spee

d-up

ver

sus s

oftw

are

Area (mm2)

Figure 8 Speed-up versus area in an ASIC implementation of a 50antennas system

Figures 7 and 8 report silicon area as a function of thenumber of filters and speed-up as a function of the siliconarea respectively

The area curve in Figure 7 grows linearly with119873FIR witha slope of 01885mm2 which is in fact the area of a single FIRfilterThe area depends also on thememory sizeThememoryarea was obtained by considering the use of 100 memoryblocks (two for each antenna) each consisting of 256 bytesfor the signal samples (256 kB over 0457mm2) and a dual-portmemory blockwith 2048 bytes for theweights and delaysvalues (4 kB over 00153mm2) The memory area weighs onthe total area in a significant way only when a few filters areused because it does not depend on119873FIR

The speed-up that is the ratio of software execution timeand hardware execution time as a function of the silicon areashows a behavior similar to what we found for the FPGAcase but the acceleration factor is of course much biggerfrom around 50 to around 400 in the case of fifty antennasin the ASIC case This result would be notably improved incase a more scaled technology would be used [18] As we

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

Are

a and

spee

d-up

incr

ease

(rel

ativ

e uni

ts)

Number of FIR filters

Speed-up increaseArea increase

115

225

335

2 4 6 8 10

Zoom filter range 1 5

Figure 9 Trade-off between speed-up and area when changingthe number of filters in a 50 antennas ASIC implementation Cost-effective solutions are those with119873FIR le 6

utilize more and more area to make room for the filters thespeed-up first increases in a steep way then it bends becausethe sequential part of the algorithm starts dominating theexecution time

We report in Figure 9 two more curves the speed-upincrease and the area increase These two curves are simplythe normalized versions of the curves reported in Figures 7and 8 In practice we divided the area and the speed-up ofthe solution with119873FIR filters by the area and the speed-up ofthe solution with one filter respectivelyThe rationale behindthis choice is to evaluate the trade-off between performanceand cost In particular we judge that a parallel solutionis cost-effective if for a given percentage cost increase theperformance increases at least as much as the cost [19 20]From the inset in Figure 9 we are able to evaluate for whatvalue of 119873FIR the two curves cross each other The cost-effective solutions are those that use no more than six filters

7 Conclusions

Implementing in hardware the Mist-beamforming algorithmportion of the image reconstruction process determines aremarkable acceleration in terms of execution time From apractical point of view this acceleration implies a significantimprovement in the applicability of the system if we considerthe case of a 3D image reconstruction with a fifty-antennasystem we move from 19 hours of computation in thesoftware-only version to only about 7 minutes for an FPGAimplementation and less than one minute for an ASICimplementation even in the serial hardware configuration(ie with 119873FIR = 1) The FPGA clock frequency was set inour experiments to 625MHz which is the clock frequencyof the IP that communicates with the UWB external chipHowever our results show that the FPGA accelerator alonecan run up to 180MHz hence providing a further 288Xspeed-up with respect to the software version Therefore

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

VLSI Design 9

Table 3 Area allocation with different number of FIR filters

119873FIR Slice registers 4 input LUTs Occupied slices1 9081 (6) 8038 (5) 7188 (10)2 17347 (12) 20072 (14) 16115 (23)3 25551 (18) 44290 (32) 30899 (45)4 33851 (25) 67247 (49) 45139 (66)5 42055 (31) 91461 (67) 59936 (88)

Table 4 Hardware execution time over 14485 voxels with 9antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 128 s

119873FIR 119873ITER 119879exe [ms] Speed-up1 9 80639 15872 5 55336 23133 3 42581 30064 3 42650 30015 2 36249 35316 2 36249 35317 2 36249 35318 2 36295 35279 1 29871 4285

test that configuration in our prototype even though wecould estimate the performance in simulation The allocatedresources follow a trend that is almost linear with the numberof filters which is in accordance with our expectations

The timing performance depends proportionally on thenumber of voxels in the energy map This value is derivedfrom the 012204 phantom breast at 1mm resolution whichwe also used for the Octave simulations The 2D map gridconsists of 152 rows and 176 columns The Octave simulationcode was optimized to consider in the computation onlythose voxels which happened to be located in the ellipticalspace described by the antennas that is the area actuallyirradiated by the transmitters With such optimization thenumber of voxels is119873VOX = 14495

With such number of voxels the time required for thesoftware execution of the BEAF on the Intel i5 core of ourprototype (with Ubuntu 1110 kernel 30) was approximately119879SW = 128 sThe clock period for the FPGA implementationis 119879CLK = 16 ns which we use in our experiments with theprototype Table 4 reports the time required to execute theBEAF routine in hardware and the speed-up with respect to asoftware implementation Timing results with 119873FIR in range6ndash9 refer to logic simulations only because only five filterscould be placed in the FPGA However since all the resultsare consistent with (15) they are all fully accurate Table 5refers to a hypothetical case with fifty antennas and reportsonly simulated data which are also fully accurate comparedto a software execution that requires 119879SW = 71 s

In both Tables 4 and 5 we observe a similar trend Aswe expected the effect of changing the number of filtersmainly depends on the actual number of iterations 119873ITER =lceil119873ANT119873FIRrceil Increasing the number of filters boosts perfor-mance only when it allows to perform the computation with

Table 5 Hardware execution time over 14485 voxels with 50antennas and 16 ns clock period and speed-up with respect to asoftware execution requiring 119879SW = 71 s

119873FIR 119873ITER 119879exe [s] Speed-up1 50 448 15862 25 289 24533 17 238 29794 13 213 33315 10 194 36616 9 188 37867 8 181 39198 7 175 40599 6 168 421410ndash12 5 162 438113ndash16 4 156 456217ndash24 3 149 475625ndash49 2 143 497150 1 136 5205

less iterations For example in the case with nine antennaswe would not get a significant advantage from using eightfilters if we cannot use nine then we could as well use sixand still obtain basically the same performance we would getwith eight filters

The speed-up value grows rapidly for low values of119873FIRwhereas its marginal increase tends to diminish in the upperrangeThe reason is that the part of the algorithm that consistsof accessing the memory for reading samples and weightsremains sequential and cannot be made parallel

Considering the same phantom and iterating the per-formance evaluation in the 3D case with 1374953 voxels theresulting trends are similar To summarize in the worst caseof fifty antennas and 119879ITER = 50 the required time is 119879exe =42464 s (around 7 minutes) while in the best case of onlyone iteration we obtained119879exe = 12940 s (around 2minutes)With respect to a software implementation requiring 119879SW =

19 h the speedup is a factor 1611 and 5286 respectively inthe worst and best case

Referring to data discussed in Table 1 obtained for a worstcase breast phantom (in terms of number of voxels) theexecution time in the worst case (119879ITER = 50) would bearound 21 minutes (compared to the 58 h obtained in thesoftware case)

62 ASIC Experiments For this part of our experimentswe targeted a standard cell library in a 11 V 45 nm CMOStechnology We synthesized with Synopsys Design Compilervarious instances of our architecture varying the numberof filters and determined the relationship between speed-up(with respect to a software execution) and silicon area Thedesign when synthesized with the maximum optimizationeffort available can run at a clock period of 2 ns Thereforecompared to the FPGA results reported in the previoussection all the values of speed-up can be multiplied by afactor 8 (16 ns2 ns)

10 VLSI Design

0

1

2

3

4

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 50Number of FIR filters

Are

a (m

m2)

Figure 7 Postsynthesis ASIC area versus number of FIR filters fora 50 antennas system

100

150

200

250

300

350

400

450

0 1 2 3 4 5 6 7 8 9 10

Spee

d-up

ver

sus s

oftw

are

Area (mm2)

Figure 8 Speed-up versus area in an ASIC implementation of a 50antennas system

Figures 7 and 8 report silicon area as a function of thenumber of filters and speed-up as a function of the siliconarea respectively

The area curve in Figure 7 grows linearly with119873FIR witha slope of 01885mm2 which is in fact the area of a single FIRfilterThe area depends also on thememory sizeThememoryarea was obtained by considering the use of 100 memoryblocks (two for each antenna) each consisting of 256 bytesfor the signal samples (256 kB over 0457mm2) and a dual-portmemory blockwith 2048 bytes for theweights and delaysvalues (4 kB over 00153mm2) The memory area weighs onthe total area in a significant way only when a few filters areused because it does not depend on119873FIR

The speed-up that is the ratio of software execution timeand hardware execution time as a function of the silicon areashows a behavior similar to what we found for the FPGAcase but the acceleration factor is of course much biggerfrom around 50 to around 400 in the case of fifty antennasin the ASIC case This result would be notably improved incase a more scaled technology would be used [18] As we

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

Are

a and

spee

d-up

incr

ease

(rel

ativ

e uni

ts)

Number of FIR filters

Speed-up increaseArea increase

115

225

335

2 4 6 8 10

Zoom filter range 1 5

Figure 9 Trade-off between speed-up and area when changingthe number of filters in a 50 antennas ASIC implementation Cost-effective solutions are those with119873FIR le 6

utilize more and more area to make room for the filters thespeed-up first increases in a steep way then it bends becausethe sequential part of the algorithm starts dominating theexecution time

We report in Figure 9 two more curves the speed-upincrease and the area increase These two curves are simplythe normalized versions of the curves reported in Figures 7and 8 In practice we divided the area and the speed-up ofthe solution with119873FIR filters by the area and the speed-up ofthe solution with one filter respectivelyThe rationale behindthis choice is to evaluate the trade-off between performanceand cost In particular we judge that a parallel solutionis cost-effective if for a given percentage cost increase theperformance increases at least as much as the cost [19 20]From the inset in Figure 9 we are able to evaluate for whatvalue of 119873FIR the two curves cross each other The cost-effective solutions are those that use no more than six filters

7 Conclusions

Implementing in hardware the Mist-beamforming algorithmportion of the image reconstruction process determines aremarkable acceleration in terms of execution time From apractical point of view this acceleration implies a significantimprovement in the applicability of the system if we considerthe case of a 3D image reconstruction with a fifty-antennasystem we move from 19 hours of computation in thesoftware-only version to only about 7 minutes for an FPGAimplementation and less than one minute for an ASICimplementation even in the serial hardware configuration(ie with 119873FIR = 1) The FPGA clock frequency was set inour experiments to 625MHz which is the clock frequencyof the IP that communicates with the UWB external chipHowever our results show that the FPGA accelerator alonecan run up to 180MHz hence providing a further 288Xspeed-up with respect to the software version Therefore

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

10 VLSI Design

0

1

2

3

4

5

6

7

8

9

10

5 10 15 20 25 30 35 40 45 50Number of FIR filters

Are

a (m

m2)

Figure 7 Postsynthesis ASIC area versus number of FIR filters fora 50 antennas system

100

150

200

250

300

350

400

450

0 1 2 3 4 5 6 7 8 9 10

Spee

d-up

ver

sus s

oftw

are

Area (mm2)

Figure 8 Speed-up versus area in an ASIC implementation of a 50antennas system

Figures 7 and 8 report silicon area as a function of thenumber of filters and speed-up as a function of the siliconarea respectively

The area curve in Figure 7 grows linearly with119873FIR witha slope of 01885mm2 which is in fact the area of a single FIRfilterThe area depends also on thememory sizeThememoryarea was obtained by considering the use of 100 memoryblocks (two for each antenna) each consisting of 256 bytesfor the signal samples (256 kB over 0457mm2) and a dual-portmemory blockwith 2048 bytes for theweights and delaysvalues (4 kB over 00153mm2) The memory area weighs onthe total area in a significant way only when a few filters areused because it does not depend on119873FIR

The speed-up that is the ratio of software execution timeand hardware execution time as a function of the silicon areashows a behavior similar to what we found for the FPGAcase but the acceleration factor is of course much biggerfrom around 50 to around 400 in the case of fifty antennasin the ASIC case This result would be notably improved incase a more scaled technology would be used [18] As we

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

Are

a and

spee

d-up

incr

ease

(rel

ativ

e uni

ts)

Number of FIR filters

Speed-up increaseArea increase

115

225

335

2 4 6 8 10

Zoom filter range 1 5

Figure 9 Trade-off between speed-up and area when changingthe number of filters in a 50 antennas ASIC implementation Cost-effective solutions are those with119873FIR le 6

utilize more and more area to make room for the filters thespeed-up first increases in a steep way then it bends becausethe sequential part of the algorithm starts dominating theexecution time

We report in Figure 9 two more curves the speed-upincrease and the area increase These two curves are simplythe normalized versions of the curves reported in Figures 7and 8 In practice we divided the area and the speed-up ofthe solution with119873FIR filters by the area and the speed-up ofthe solution with one filter respectivelyThe rationale behindthis choice is to evaluate the trade-off between performanceand cost In particular we judge that a parallel solutionis cost-effective if for a given percentage cost increase theperformance increases at least as much as the cost [19 20]From the inset in Figure 9 we are able to evaluate for whatvalue of 119873FIR the two curves cross each other The cost-effective solutions are those that use no more than six filters

7 Conclusions

Implementing in hardware the Mist-beamforming algorithmportion of the image reconstruction process determines aremarkable acceleration in terms of execution time From apractical point of view this acceleration implies a significantimprovement in the applicability of the system if we considerthe case of a 3D image reconstruction with a fifty-antennasystem we move from 19 hours of computation in thesoftware-only version to only about 7 minutes for an FPGAimplementation and less than one minute for an ASICimplementation even in the serial hardware configuration(ie with 119873FIR = 1) The FPGA clock frequency was set inour experiments to 625MHz which is the clock frequencyof the IP that communicates with the UWB external chipHowever our results show that the FPGA accelerator alonecan run up to 180MHz hence providing a further 288Xspeed-up with respect to the software version Therefore

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

VLSI Design 11

even the fully serial configuration gives us a noteworthyacceleration that justifies alone the validity of the hardwaredesign When using some degree of parallelism the timingperformance improves even more overcoming by far anypossible advantage of a software implementation We shouldconsider that Octave is an interpreted language and so thecode is executed by an interpreter program that interprets it atruntime Sometimes an interpreted code hasworse executiontime than an optimized compiled code written for instancein C We plan then to translate the Octave code in C asa next step Even though preliminary results suggest that a10X reduction of the CPU execution time can be achievedthis will not help reduce the BEAF part of the computationdown to an acceptable level (an acceleration of more than400X would be needed if an ASIC implementation with fiftyantennas is the reference point)Therefore the BEAF taskwillstill require a hardware accelerator An optimizedC codemayhelp instead to reduce the other tasks below the 1 s target

Thanks to the parameterized implementation our designcan be easily adapted to various configurations with differ-ent number of transmitters and receivers and for varioustechnology targets (FPGA or ASIC) with different availableresources This gives our design a high degree of flexibilityand scalability

Acknowledgment

This work was partially supported by the Italian Ministry ofEducation and Research under FIRB 2012 Project ldquoMICE-NEArdquo

References

[1] S C Hagness A Taflove and J E Bridges ldquoThree-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection design of an antenna-array elementrdquo IEEETransactions on Antennas and Propagation vol 47 no 5 pp783ndash791 1999

[2] X Li and S C Hagness ldquoA confocal microwave imagingalgorithm for breast cancer detectionrdquo IEEE Microwave andWireless Components Letters vol 11 no 3 pp 130ndash132 2001

[3] X Li S K Davis S C Hagness DW VanDerWeide and B DVan Veen ldquoMicrowave imaging via space-time beamformingexperimental investigation of tumor detection in multilayerbreast phantomsrdquo IEEE Transactions on Microwave Theory andTechniques vol 52 no 8 pp 1856ndash1865 2004

[4] E J Bond X Li S C Hagness and B D VanVeen ldquoMicrowaveimaging via space-time beamforming for early detection ofbreast cancerrdquo IEEE Transactions on Antennas and Propagationvol 51 no 8 pp 1690ndash1705 2003

[5] M R CasuM Crepaldi andMGraziano ldquoAVHDL-AMS sim-ulation environment for an UWB impulse radio transceiverrdquoIEEE Transactions on Circuits and Systems I vol 55 no 5 pp1368ndash1381 2008

[6] M Cutrupi M Crepaldi M R Casu and M Graziano ldquoAflexible UWB transmitter for breast cancer detection imagingsystemsrdquo in Proceedings of the Design Automation and Test inEurope Conference and Exhibition (DATE rsquo10) pp 1076ndash1081March 2010

[7] M R Casu M Graziano and M Zamboni ldquoA fully differentialdigital CMOS UWB pulse generatorrdquo Circuits Systems andSignal Processing vol 28 no 5 pp 649ndash664 2009

[8] M Crepaldi M R Casu M Graziano and M ZambonildquoA mixed-signal demodulator for a low-complexity IR-UWBreceiver methodology simulation and designrdquo Integration theVLSI Journal vol 42 no 1 pp 47ndash60 2009

[9] S C Hagness A Taflove and J E Bridges ldquoTwo-dimensionalFDTDanalysis of a pulsedmicrowave confocal system for breastcancer detection fixed-focus and antenna-array sensorsrdquo IEEETransactions onBiomedical Engineering vol 45 no 12 pp 1470ndash1474 1998

[10] S K Davis E J Bond X Li S C Hagness and B D Van VeenldquoMicrowave imaging via space-time beamforming for earlydetection of breast cancer beamformer design in the frequencydomainrdquo Journal of Electromagnetic Waves and Applicationsvol 17 no 2 pp 357ndash381 2003

[11] X Li E J Bond B D Van Veen and S C Hagness ldquoAnoverview of ultra-wideband microwave imaging via space-timebeamforming for early-stage breast-cancer detectionrdquo IEEEAntennas and Propagation Magazine vol 47 no 1 pp 19ndash342005

[12] P RosinganaDesign of aUWB Imaging System for Breast CancerDetection [MSc thesis] Politecnico di Torino

[13] J W Eaton Gnu Octave Manual NetworkTheory Ltd 2002[14] UWCEM ldquoNumerical Breast Phantom Repositoryrdquo http

uwcemecewisceduhomehtm[15] ModelSim Reference Manual v65e Mentor Graphics 2010[16] ISE Design Suite 13 Release Notes Guide Xilinx 2011[17] Design Compiler Reference Manual Optimization and Timing

Analysis Version D-201003 Synopsys 2010[18] A Pulimeno M Graziano and G Piccinini ldquoUdsm trends

comparison from technology roadmap to ultrasparc niagara2rdquoIEEE Transactions on Very Large Scale Integration (VLSI)Systems vol 20 no 7 pp 1341ndash1346 2012

[19] S V Tota M R Casu M R Roch L Rostagno and MZamboni ldquoMedea a hybrid shared-memorymessage-passingmultiprocessor NoC-based architecturerdquo in Proceedings of theDesign Automation and Test in Europe Conference and Exhibi-tion (DATE rsquo10) pp 45ndash50 March 2010

[20] M R Casu M R Roch S V Tota and M Zamboni ldquoANoC-based hybrid message-passingshared-memory approachto CMP designrdquo Microprocessors and Microsystems vol 35 no2 pp 261ndash273 2011

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 12: Research Article Hardware Acceleration of Beamforming in a ...downloads.hindawi.com/journals/vlsi/2013/861691.pdf · Hardware acceleration Microprocessor High speed link Backscattered

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of