sonic millip3de:an architecture for handheld …twenisch/papers/ieee-micro14.pdf · sonic...

.................................................................................................................................................................................................................

SONIC MILLIP3DE: AN ARCHITECTUREFOR HANDHELD 3D ULTRASOUND

.................................................................................................................................................................................................................

SONIC MILLIP3DE, A SYSTEM ARCHITECTURE AND ACCELERATOR FOR 3D ULTRASOUND

BEAMFORMING, HAS A THREE-LAYER DIE-STACKED DESIGN THAT COMBINES A

HARDWARE-FRIENDLY APPROACH TO THE ULTRASOUND IMAGING ALGORITHM WITH A

CUSTOM BEAMFORMING ACCELERATOR. THE SYSTEM ACHIEVES HIGH-QUALITY 3D

ULTRASOUND IMAGING WITHIN A FULL-SYSTEM POWER OF 15 W IN 45-NM

SEMICONDUCTOR TECHNOLOGY. SONIC MILLIP3DE IS PROJECTED TO ACHIEVE THE TARGET

5-W POWER BUDGET BY THE 16-NM TECHNOLOGY NODE.

......Much as every medical professio-nal listens beneath the skin with a stethoscopetoday, we foresee a time when handheld med-ical imaging will become as ubiquitous—“peering under the skin” using a handheldimaging device. Mobile medical imaging isadvancing rapidly to reduce the footprint ofbulky, often room-sized machines to compacthandheld devices. In the last five years,research has demonstrated that by combiningthe increasing capabilities of mobile process-ors with intelligent system design, portableand even handheld imaging devices are notonly possible, but commercially viable. Inparticular, ultrasound imaging has proven tobe an especially successful candidate for highportability due to its safety and low transmitpower, with commercial handheld 2D ultra-sound devices marketed and being used inhospitals today. Newly developed portableimaging devices have not only led to demon-strated improvements in patient health,1 theyhave also enabled new applications for hand-held ultrasound, such as disaster relief care2

and battlefield triage.3 However, despitethe increasing capabilities of handheld

ultrasound devices, these systems remainunable to produce the high-quality real-time3D images that are possible with their non-handheld counterparts.

In recent years, many hospitals have beentransitioning to 3D ultrasound imagingwhen mobility is not required because it pro-vides numerous benefits over 2D, includingincreased technician productivity, greatervolumetric measurement accuracy, and morereadily interpreted images. 3D imaging canalso enable advanced diagnostic capabilities,such as tissue sonoelastography through high-velocity 3D motion tracking and accurateblood-flow measurements via 3D Doppler.Creating a handheld 3D system could enablehospital-quality ultrasound imaging in nearlyany setting, greatly expanding the way ultra-sound is used today. However, 3D ultrasoundcomes with many challenges that are com-pounded when implementing a system in ahandheld form factor. The sheer amount ofdata that must be sensed, transferred, andcomputed is nearly 5,000 times more than ina 2D system. At the same time, the massivedata rate (as high as 6 terabits/second) of the

Richard Sampson

University of Michigan

Ming Yang

Siyuan Wei

Chaitali Chakrabarti

Arizona State University

Thomas F. Wenisch

University of Michigan

.......................................................

100 Published by the IEEE Computer Society 0272-1732/14/$31.00�c 2014 IEEE

received echo signals is so high that the datacannot easily be transferred off chip for imageformation; current 3D systems typically trans-fer data for only a fraction of receive channels,sacrificing image quality or aperture size. Inaddition to the extreme computationalrequirements, power is of the utmost impor-tance, not only to ensure adequate batterylife, but more importantly because the deviceis in direct contact with the patient’s skin,placing tight constraints on safe operatingtemperature.

For safe operation, a handheld ultrasoundsystem must operate within roughly a 5-Wpower budget. Implementing a handheld 3Dsystem with commercially available digitalsignal processor (DSP) or graphics-acceleratorchips using conventional beamforming algo-rithms designed for software is simply infeasi-ble. Our analysis indicates that it would take700 ultrasound DSP chips with a total powerbudget of 7.1 kW to meet typical 3D imagingcomputational demands at just 1 frame persecond (fps). To enable such demandingcomputation on such a low power budget, acomplete rethink of both the algorithm andarchitecture is required.

In this article, we introduce Sonic Milli-p3De,4,5 a hardware accelerator that combinesa new approach to the ultrasound imagingalgorithm better suited to hardware withmodern computer architecture techniques toachieve high-quality 3D ultrasound imagingwithin a full-system power of 15 W in 45-nmsemiconductor technology. Under anticipatedscaling trends, we project that Sonic Milli-p3De will achieve our target 5-W powerbudget by the 16-nm technology node.

We present this work both to make prog-ress on realizing the promise of handheldmedical imaging and as a case study for appli-cation-specific accelerator design. Our workalso illustrates the unique benefits of 3D diestacking in heterogeneous systems and moti-vates moving beyond the limitations of theconventional von Neumann architecture incertain applications.

Synthetic aperture ultrasoundSynthetic aperture ultrasound imaging is

performed by sending high-frequency pulses(typically 1 to 15 MHz) into a medium and

constructing an image from the reflected pulsesignals. The process comprises three stages:transmit, receive, and beamsum. Transmissionand reception are both done using an array oftransducers that are electrically stimulated toproduce the outgoing signal and generate cur-rent when they vibrate from the returningecho. After all echo data is received, the beam-sum process (the computation-intensive stage)combines the data into a partial image. Thepartial image corresponds to echoes from a sin-gle transmission. Several transmissions fromdifferent locations on the transducer array areneeded to produce high-quality images, so sev-eral iterations of transmit, receive, and beam-sum are typically necessary to construct asingle complete frame.

Each transmission is a pulsed signal con-ceptually originating from a single locationin the array, shown in Figure 1a. To improvesignal strength, multiple transducers can firetogether in a pattern to emulate a single vir-tual source located behind the transducerarray.6 The pulse expands into the mediumradially, and as it encounters interfacesbetween materials of differing density, thesignal is partially transmitted and partiallyreflected, as shown in Figure 1b. The return-ing echoes cause the transducers to vibrate,generating a current signal that is digitizedand stored in a memory array associated witheach transducer. Each position within thesememory arrays corresponds to a differentround-trip time from the emitting transducerto the receiving transducer. Because trans-ducers cannot distinguish the direction of anincoming echo, each array element containsthe superimposed echoes from all locationsin the imaging volume with equal round-triptimes (that is, an arc in the imaging volume).Because of the geometry of the problem, theround-trip arcs are different for each trans-ducer, resulting in different superpositions ateach receiver. The beamsum operation sumsthe echo intensity observed by all transducersfor the arcs intersecting a particular focalpoint (that is, a location in the imaging vol-ume), yielding a strong signal when the focalpoint lies on an echoic boundary. Combiningtransmissions from multiple source locationsallows further focusing.

A typical beamsum pipeline first trans-forms the raw signal received from each

.............................................................

MAY/JUNE 2014 101

transducer to enhance signal quality. The sig-nal is upsampled using an interpolation filterto generate additional data points betweenthe received samples. This process enhancesresolution without the power and storageoverheads of increasing the data samplingrate of the analog front end. Then, so-calledapodization scaling factors are applied to theinterpolated data to place greater weight onreceivers near the origin of the transmission,because these signals are more accurate owingto their lower angle of incidence.

Once the data has been preprocessed(“transformed”), the beamsum operation canbegin. In essence, this entails calculating theround-trip delay between the emitting trans-ducer and all receiving transducers througheach focal point, converting these delays intoindices in each transducer’s received signalarray, retrieving the corresponding data, andsumming these values. Figure 1c illustratesthis process. These partial images are then

summed over multiple transmissions. Finally,a demodulation operation is applied toremove the ultrasound carrier signal.

The delay calculation (identifying theright index within each receive array) is themost computationally intensive aspect ofbeamsum, because it must be completed forevery {focal point, transmit transducer, re-ceive transducer} trio. Because the transmitsignal propagates radially, the image space isdescribed by a grid of scanlines that radiate ata constant angular increment from the centerof the transducer array into the image vol-ume. Focal points are located at even spacingalong each scanline, in effect creating a spher-ical coordinate system. However, the trans-ducers are laid out in a grid-based Cartesiancoordinate system, requiring a fairly complexlaw of cosines calculation to compute round-trip distances via Equation 1:

dP ¼1

cRp þ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiR2

p þ x2i � 2xiRp sin h

q� �

ð1ÞIn this equation, dp is the round-trip delay

from the center transducer to point P totransducer i, c is the speed of sound in tissue(1,540 m/s), Rp is the radial distance of pointP from the center of the transducer, h is theangular distance of point P from the line nor-mal to the center transducer, and xi is thedistance of transducer i from the center. Fig-ure 1d shows variables as they correspond tothe system geometry. This formula requiresextensive evaluation of both trigonometricfunctions and square roots; hence, many 2Dultrasound systems precalculate all delays andstore them in a lookup table (LUT).7 How-ever, a typical 3D system requires roughly250 billion unique delay values, making aLUT implementation impractical. Instead,delays are calculated as needed.8

Redesigning the ultrasound algorithm forhardware acceleration

A key innovation of Sonic Millip3De is tocodesign hardware with a new beamformingalgorithm better suited to hardware accelera-tion. Our main algorithmic insight is toreplace the expensive exact delay calculationof Equation 1 with an iterative, piecewisequadratic approximation, which can be

A

B

C

A

B

C

B

Point P

Image Space

RP

Xi

(a) (b)

(d)(c)

θ

Figure 1. Synthetic aperture ultrasound. Pulse leaving transmit transducer

(a). Echo pulses reflecting from points B and C. All transducers in array (or

subaperture) will receive the echo data, but at different times due to

different round trip distances (b). All of the reconstructed data for point B

from each of the transducers added together. By adding thousands of

“views” together, crisp points become focused (c). Variables used in

calculating round-trip distance, dp , for the ith transducer and point P in

Equation 1 (d).

..............................................................................................................................................................................................

TOP PICKS

............................................................

102 IEEE MICRO

computed efficiently using only add opera-tions. The algorithm’s iterative nature lendsitself to an efficient data streaming model,allowing the proposed hardware to exploitlocality and eliminate inefficient address cal-culation and memory-access operations thatare a bottleneck in conventional implementa-tions. Our early analysis shows that the deltafunction between adjacent focal-point delayson a scanline forms a smooth curve and indi-ces can be approximated accurately (witherror similar to that introduced by interpola-tion) over short intervals with quadraticapproximations. We replace these exact deltacurves with a per-transducer precomputedpiecewise quadratic approximation con-strained to allow an index error of at most 3(corresponding to at most a 30-lm errorbetween the estimated and exact focal point),thus resulting in negligible blur.

Using offline image quality analysis, wehave determined that, for a target imaging

depth of 8 cm, we can meet the error con-straints with only three piece-wise sections.Each section requires precalculating threecoefficients and a section cut-off, achieving a250-times storage reduction relative to anexhaustive lookup table. Through carefulpipelining of the beamforming process, theconstants can be efficiently streamed fromoff-chip memory, limiting storage require-ments within the beamforming accelerator.

Sonic Millip3DeThe Sonic Millip3De system hardware

(shown in Figure 2) is divided into three dis-tinct silicon die layers—transducers and ana-log components, analog-to-digital converters(ADCs) and storage, and beamforming com-putation—which are connected verticallyusing through-silicon vias (TSVs). The 3D-stacked chip connects to separate LPDDR2(low-power double data rate 2) memory. All

Layer 2: ADC/storageLayer 1: Transducers Layer 3: Beamforming

Sonic Millip3De

Transducerbank

12-bitADC

SRAMarray

Transformunit

Select unit(10 Subunits)

Reduceunit

Transducerbank

12-bitADC

SRAMarray

Transformunit


Reduceunit

Transducerbank

12-bitADC

SRAMarray

Transformunit


Reduceunit

To memory

From memory

Figure 2. Sonic Millip3De hardware overview. Layer 1 comprises 120� 88 transducers grouped into banks with one

transducer per bank in each subaperture. Analog transducer outputs from each bank are multiplexed and routed over through-

silicon vias (TSVs) to Layer 2, comprising 1,024 analog-to-digital converter (ADC) units operating at 40 MHz and static RAM

(SRAM) arrays to store incoming samples. The stored data is passed via face-to-face links to Layer 3 for processing in the

three stages of the 1,024-unit beamsum accelerator. The “transform” stage upsamples the signal to 160 MHz. The 10 units in

the “select” stage map signal data from the receive time domain to the image space domain in parallel for 10 scanlines. The

“reduce” stage combines previously stored data from memory with the incoming signal from all 1,024 beamsum nodes over

a unidirectional pipelined interconnect, and the resulting updated image is written back to memory.

.............................................................

MAY/JUNE 2014 103

of these components are integrated directlyinto the ultrasound scanhead, the wand-likedevice held against the patient’s skin to obtainultrasound images, allowing for a completehandheld device. These components com-prise an ultrasound system’s front end, capa-ble of generating raw, volumetric images. Aseparate back end for viewing and postpro-cessing might be implemented in a tabletor PC.

Using a 3D die-stacked design providesseveral architectural benefits. First, it is possi-ble to stack dies manufactured in differenttechnologies. Hence, the transducer layer canbe manufactured in a cost-effective processfor the analog circuitry, higher voltages, andlarge geometry of ultrasonic transducers,while the beamforming accelerator canexploit the latest digital logic process technol-ogy. Second, stacking allows far more TSVlinks between dies than conventional chippins, resolving the bandwidth bottleneck thatplagues existing 3D systems where the probeand computation units are connected viacable. Third, TSV connections replace longwires that would otherwise be required insuch a massively parallel system, reducinginterconnect power requirements. Finally,stacking provides the potential for designmodularity, where the same beamformingaccelerator die could be stacked with alterna-tive transducer arrays designed for differentimaging applications.

The top die layer comprises a 120 � 88grid of capacitive micromachined ultrasonictransducers (CMUTs) with k/2 spacing. Thearea between the transducers is used for addi-tional analog components and routing to theTSV interface. Our system uses a sliding“subaperture” technique where, for eachtransmit, only a 32 � 32 subgrid of trans-ducers receive. The full 120 � 88 aperture issampled over multiple transmissions, reduc-ing hardware (and power) requirements atthe cost of more transmissions per frame.The transducers are grouped into banks suchthat only one transducer per bank receivesdata in any subaperture. With this bankingdesign, only a single beamforming channel isnecessary for each of the 1,024 banks ratherthan each of 10,560 transducers.

The second layer comprises 1,024 (12-bit) ADCs and static RAM (SRAM) arrays,

which each correspond to the 1,024 trans-ducer banks of the analog transducer layer.The ADCs are sampled at 40 MHz, storingthe digital output into corresponding 6-KbyteSRAM arrays. The SRAMs are clocked at1 GHz and connect vertically to a corre-sponding computational unit on the beam-forming accelerator layer, requiring a totalof 24,000 face-to-face TSVs for data andaddress signals.

The final layer is the most complex of thethree, comprising the beamforming accelera-tor processing units, a unidirectional pipe-lined interconnect, and a control processor(M-class ARM core) that interfaces to theLPDDR2 off-chip memory.

Beamforming accelerator designThe beamforming accelerator comprises

1,024 independent channels, each dividedinto three conceptual stages: transform, select,and reduce. Each of these stages performs aseparate operation to convert the digitizedreceive samples into beamformed focal-pointintensities. Although the transform-select-reduce conceptual framework is particularlywell suited for ultrasound beamforming, thisdesign paradigm could also be applicable toother problems with similar dataflow.

TransformThe transform unit operates on all of the

receive data, performing a 4-times linearinterpolation on the raw receive signals. Afterupsampling, a constant apodization is ap-plied providing a weight based on transducerposition as previously described.

SelectThe select unit remaps data from the

receive time domain to the image spacedomain using the algorithm described previ-ously. The select unit is split into 10 subunitsthat concurrently operate on neighboringscanlines. These subunits each iterate overthe same incoming datastream from a corre-sponding second-layer SRAM array in asynchronized fashion, reducing the numberof times data must be read from the SRAMby a factor of 10. Figure 3 shows a blockdiagram of a single subunit.

Data is streamed simultaneously into theinput first-in, first-out (FIFO) buffer of each

..............................................................................................................................................................................................

TOP PICKS

............................................................

104 IEEE MICRO

select subunit. As the subunits each draintheir input buffers, new data is streamed infrom SRAM. On each clock cycle, a sampleis popped from the head of the input buffer,and the select logic determines whether thedata corresponds to the next focal point onthe scanline. If the sample is selected, it iscopied to the output buffer. Otherwise, thesample is discarded. Whenever the outputbuffer fills, its contents are sent to the reducestage.

The select logic implements the piecewisequadratic delay calculation described previ-ously. The logic calculates the delta—that is,the number of samples to discard—betweentwo consecutive focal points. The unit com-prises constant storage that is preloaded withthe piecewise quadratic constants required toprocess a particular set of scanlines, a decre-mentor that determines when to change sec-tions, a series of adders to generate the deltavalue, and finally a decrementor that countsdown in step with the input buffer and

determines which data should be selected.After initialization, the subunit generates thefirst delta value (n ¼ 0) to determine howmuch to advance the input. This delta value isthen loaded into the select decrementor.Once the select decrementor reaches 0, thecurrent data is “selected” and written to theoutput buffer. A new delta value (n ¼ 1) iscalculated and the process continues until theentire scanline has been generated. Because ofthe iterative nature of the calculation, deltascan be calculated efficiently using the adderchain shown in Figure 3. Using this designapproach, we can change what is typically anaddress-calculation and load-intensive soft-ware loop to a streaming design, greatlyimproving efficiency.

ReduceThe final stage is the reduce unit, which

ties the 1,024 channels together via a pipe-lined network. Each reduce unit correspondsto a single node on the network and adds the

Constantstorage

C

A+B

2A

++

Output bufferInput buffer

Select logic

From transformunit

Select subunit

To reduceunit

From reduceunit

+

Sectiondecrementor

Selectdecrementor

Figure 3. Select unit microarchitecture. Select units map upsampled echo data from the receive time domain to image focal

points. Sample data arrives from the transform unit at the input buffer, and each sample is either discarded or copied to

the output buffer as determined by our piecewise quadratic approximation algorithm. The constant storage holds the

precomputed constants and boundary for each approximation section. The adder chain calculates the next delta index value

to determine how far ahead the hardware needs to iterate to find the next focal point, with the final adder accumulating

fractional bits from previous additions. The select decrementor is initialized with the integer component of the adder chain.

In each cycle, the head of the input buffer is copied to the output if the decrementor is zero, or discarded if it is nonzero. The

section decrementor tracks when to advance to the next piece-wise section.

.............................................................

MAY/JUNE 2014 105

data received from the preceding node withthe data from the local select unit beforesending the summed result to the next nodeon the network.

Methodology and resultsWe evaluate our system in terms of both

image quality and system power. BecauseSonic Millip3De is intended for diagnosticmedical imaging, it is critical that it generateshigh-quality images, comparable to existingdevices. Hence, the goal of our image qualityevaluation is to confirm that the approxima-tion techniques used to reduce power do notunacceptably degrade image quality. We con-trast images generated according to ourmethod against an ideal system without powerconstraints or approximations.

In our image quality analysis, we simulatecysts in tissue using Field II,9,10 varying cystsize with depth and covering a range of 8 cm

(2- to 10-cm depth). Table 1 shows the rele-vant ultrasound system parameters. We gen-erate 3D images using both our system(iterative delay calculation and fixed-pointadders) as well as an ideal system (full delaycalculation and double-precision floating-point arithmetic). Figure 4 shows a 2D slicefor both images. We quantitatively compareimage quality using contrast-to-noise ratios(CNRs) for each cyst, shown in Table 2.Overall, Sonic Millip3De’s image quality isnearly indistinguishable from the ideal case,providing high image quality and validatingour algorithm design.

We analyze the full system power of SonicMillip3De using a register transfer level designtargeting a 45-nm standard cell library andSpice models of the global interconnect. Usingresults from synthesis (SRAM, beamformer,interconnect) and published values (trans-ducers, analog-to-digital converters, memoryinterface, DRAM), we determine that thedesign requires 14.6 W in current 45-nmtechnology, falling a bit short of the ambitious5-W power target. However, under current

Table 1. 3D ultrasound system parameters.

Parameter Value

Total transmits per frame 96

Total transducers 10,560

Receive transducers per subframe 1,024

SRAM size per receive transducer 4,096� 12 bits

Focal points per scanline 4,096

Image depth 10 cm

Image total angular width p/6

Sampling frequency 40 MHz

Interpolation factor 4�Interpolated sampling frequency 160 MHz

Speed of sound (tissue) 1540 m/s

Target frame rate 1 frame/s

20

30

40

50

60

70

80

90

100–10 0 10

x (mm)

z (m

m)

20

30

40

50

60

70

80

90

100–10 0 10

x (mm)

z (m

m)

(a) (b)

Figure 4. Image quality comparison. X to Z

(horizontal) slice through a series of cysts

from a 3D simulation using Field II,9,10

generated with double-precision floating-

point and exact delay index calculation (a).

The same slice generated via our delay

algorithm, fixed-point precision, and

dynamic focus (b). Table 2 gives the

contrast-to-noise ratios (CNRs) for both.

Table 2. CNR values for ideal system and Sonic Millip3De (SM3D).

Values correspond to cysts shown in Figure 4.

Left column of

cysts in Figure 4

Right column of

cysts in Figure 4

Ideal SM3D Ideal SM3D

3.59 3.58 1.93 1.85

3.18 3.21 1.51 1.41

2.68 2.67 1.94 1.85

1.61 1.62 2.10 2.01

1.10 1.18 2.39 2.30

0.33 0.39 2.43 2.34

..............................................................................................................................................................................................

TOP PICKS

............................................................

106 IEEE MICRO

scaling trends, we project that the design willmeet the target 5-W budget by the 16-nmtechnology node.

C omputer architecture is continuing tomove toward heterogeneous designs,

with accelerated processing units for multi-media and graphics already commonplacetoday. As the heterogeneous space grows, sowill the need for more advanced accelerators.With Sonic Millip3De, we have targeted aspecific application (handheld 3D ultra-sound) that has incredible potential impactand whose unique form of computation isnot well suited to existing designs. We believethat focusing on such problems will helpfuture accelerator design move forward.

Furthermore, a key take-away from ourprocess is the importance of codesigningalgorithms and hardware when high-efficiency gains are required. Sonic Milli-p3De achieves orders-of-magnitude greaterenergy efficiency over stock ultrasounddesigns, which would not have been possiblewith just a simple hardware solution. Thefundamental reworking of the algorithmitself enabled the streaming dataflow that liesat the heart of our efficiency gains, and was acritical component in our hardware design.However, because our algorithmic modifica-tions introduce approximations, they necessi-tated a domain-specific evaluation to ensurethat result quality was not compromised.Additionally, emerging architectural techni-ques such as 3D die stacking helped us createa design that simply could not have existedpreviously, and they show great potential fornew and unique heterogeneous systems. MICRO

AcknowledgmentsThis work was partially supported by

NSF CCF-0815457, CSR-0910699, andgrants from ARM Inc. The authors thankJ. Brian Fowlkes, Oliver Kripfgans, and PaulCarson for feedback and assistance withimage quality analysis and Ron Dreslinskifor assistance with Spice.

....................................................................References1. D. Weinreb and J. Stahl, The Introduction of

a Portable Head/Neck CT Scanner May Be

Associated with an 86% Increase in the

Predicted Percent of Acute Stroke Patients

Treatable with Thrombolytic Therapy, Radio-

logical Soc. of North America, 2008.

2. M. Shorter and D.J. Macias, “Portable

Handheld Ultrasound in Austere Environ-

ments: Use in the Haiti Disaster,” Prehospi-

tal Disaster and Medicine, vol. 27, no. 2,

2012, pp. 172-177.

3. J.A. Nations and R.F. Browning, “Battlefield

Applications for Handheld Ultrasound,”

Ultrasound Quarterly, vol. 27, no. 3, 2011,

pp. 171-176.

4. R. Sampson et al., “Sonic Millip3De: A Mas-

sively Parallel 3D Stacked Accelerator for

3D Ultrasound,” Proc. 19th IEEE Int’l Symp.

High-Performance Computer Architecture,

2013, pp. 318-329.

5. R. Sampson et al., “Sonic Millip3De with

Dynamic Receive Focusing and Apodization

Optimization,” Proc. IEEE Ultrasonics, Fer-

roelectrics, and Frequency Control Soc.

Symp., 2013, pp. 557-560.

6. C. Passmann and H. Eermert, “A 100-MHz

Ultrasound Imaging System for Dermato-

logic and Ophthalmologic Diagnostics,”

IEEE Trans. Ultrasonics, Ferroelectrics and

Frequency Control, vol. 43, no. 4, 1996,

pp. 545-552.

7. K. Karadayi, C. Lee, and Y. Kim, Software-

Based Ultrasound Beamforming on Multicore

DSPs, tech. report, Univ. of Washington,

March 2011.

8. F. Zhang et al., “Parallelization and Perform-

ance of 3D Ultrasound Imaging Beamform-

ing Algorithms on Modern Clusters,” Proc.

16th Int’l Conf. Supercomputing, 2002,

pp. 294-304.

9. J. Jensen, “Field: A Program for Simulating

Ultrasound Systems,” Proc. 10th Nordic-

Baltic Conf. Biomedical Imaging, 1996,

pp. 351-353.

10. J.A. Jensen and N.B. Svendsen, “Calculation

of Pressure Fields from Arbitrarily Shaped,

Apodized, and Excited Ultrasound Trans-

ducers,” IEEE Trans. Ultrasonics, Ferroelec-

trics and Frequency Control, vol. 39, no. 2,

1992, pp. 262-267.

Richard Sampson is a PhD candidate inthe Department of Electrical Engineeringand Computer Science at the University of

.............................................................

MAY/JUNE 2014 107

Michigan. His research interests includehardware-system and accelerator design forimaging and computer vision applications.Sampson has an MS in computer scienceand engineering from the University ofMichigan.

Ming Yang is a PhD student in the School ofElectrical, Computer and Energy Engineeringat Arizona State University. His researchfocuses on the development of algorithms andlow-power hardware for a handheld 3D ultra-sound imaging device. Yang has an MS in elec-trical engineering from Beijing University ofPosts and Telecommunications.

Siyuan Wei is a PhD student in the School ofElectrical, Computer and Energy Engineeringat Arizona State University. His researchfocuses on algorithm-architecture codesign ofultrasound imaging systems, especially thosebased on Doppler imaging. Wei has a BE inelectrical engineering from Huazhong Univer-sity of Technology and Science.

Chaitali Chakrabarti is a professor in theSchool of Electrical, Computer and Energy

Engineering at Arizona State University.Her research interests include low-powersystem design, VLSI algorithms, and archi-tectures for signal processing systems. Chak-rabarti has a PhD in electrical engineeringfrom the University of Maryland. She is afellow of IEEE.

Thomas F. Wenisch is an associate profes-sor in the Department of Electrical Engi-neering and Computer Science at the Uni-versity of Michigan. His research focuses oncomputer architecture, particularly multi-processor and multicore systems, multicoreprogrammability, smartphone architecture,data center architecture, and performanceevaluation methodology. Wenisch has aPhD in electrical and computer engineeringfrom Carnegie Mellon University. He is amember of IEEE and the ACM.

Direct questions and comments about thisarticle to Thomas F. Wenisch, ComputerScience & Engineering Department, 2260Hayward St., Ann Arbor, MI 48109;[email protected].

..............................................................................................................................................................................................

TOP PICKS

............................................................

108 IEEE MICRO

sonic millip3de:an architecture for handheld …twenisch/papers/ieee-micro14.pdf · sonic...

Documents