the imagine stream processor flexibility with performance march 30, 2001 william j. dally computer...

The Imagine Stream ProcessorFlexibility with Performance

March 30, 2001

William J. DallyComputer Systems Laboratory

Stanford Universitybilld@csl.stanford.edu

Convergence WorkshopMarch 30, 2001 2

Outline

• Motivation – We need low-power, programmable TeraOps

• The problem is bandwidth– Growing gap between special-purpose and general-

purpose hardware– Its easy to make ALUs, hard to keep them fed

• A stream processor gives programmable bandwidth– Streams expose locality and concurrency in the

application– A bandwidth hierarchy exploits this

• Imagine is a 20GFLOPS prototype stream processor

• Many opportunities to do better– Scaling up– Simplifying programming

Motivation

• Some things I’d like to do with a few TeraOps– Have a realistic face-to-face meeting with someone

in Boston without riding an airplane• 4-8 cameras, extract depth, fit model, compress, render

to several screens

– High-quality rendering at video rates• Ray tracing a 2K x 4K image with 105 objects at 60

frames/s

The good news – FLOPS are cheap, OPS are cheaper

• 32-bit FPU – 2GFLOPS/mm2 – 400GFLOPS/chip• 16-bit add – 40GOPS/mm2 – 8TOPS/chip

146.7 m

Local RF

Integer Adder

The bad news – General purpose processors can’t harness this

2001 2003 2005 2007 2009 2011

GP-Peak

GP-Useful

Why do Special-Purpose Processors Perform Well?

Fed by dedicated wires/memoriesLots (100s) of ALUs

Care and Feeding of ALUs

DataBandwidth

Instruction Bandwidth

Instr.Cache

IP‘Feeding’ Structure Dwarfs ALU

The problem is bandwidth

• Can we solve this bandwidth problem without sacrificing programmability?

Streams expose locality and concurrency

Image 1 convolve convolve

Image 0 convolve convolve

Depth Map

Operations within a kernel operate on local data

Streams expose data parallelism

Kernels can be partitioned across chips to exploit control parallelism

A Bandwidth Hierarchy exploits locality and concurrency

• VLIW clusters with shared control• 41.2 32-bit operations per word of memory bandwidth

2GB/s 32GB/s

ALU Cluster

544GB/s

Bandwidth Usage

Memory BW Global RF BW Local RF BW

Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s

2GB/s 32GB/s

SDRAMS

ile ALU Cluster

ALU Cluster

544GB/s

The Imagine Stream Processor

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

Arithmetic Clusters

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Performance

23.925.6

depth mpeg qrd dct convolve fft

16-bit kernels16-bit

applications

floating-pointapplication

floating-pointkernel

depth mpeg qrd dct convolve fft average

OtherMem SysPinsSRF ClustClock

GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9

A Look Inside an ApplicationStereo Depth Extraction

• 320x240 8-bit grayscale images

• 30 disparity search• 220 frames/second• 12.7 GOPS• 5.7 GOPS/W

Clusters Mem_0 Mem_121300

CONV 3x3

STOREUNPACK LOADCONV 7x7

CONV 3x3

Clust Mem0 Mem1501300

501400

501500

501600

501700

501800

501900

502000

502100

502200

502300

502400

502500

502600

502700

502800

502900

503000

503100

503200

503300

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

Load originalpacked row

Unpack (8bit -> 16 bit)

7x7 Convolve

3x3 Convolve

Store convolved row

Load ConvolvedRows

CalculateBlockSADs atdifferent disparities

Store bestdisparity values

Stereo Depth ExtractorConvolutions Disparity Search

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

G E N _ C I S T A T E

C O N D _ I N _ D

G E N _ C C E N D

S P C R E A D _ W T S P C W R I T E

C O M M U C D A T A

C H K _ A N Y

S E L E C T

S H I F T A 1 6

C O M M U C P E R M

S E L E C T

S E L E C T C O M M U C P E R M

I M U L R N D 1 6 S E L E C T

I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T

I M U L R N D 1 6I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S

P A S SS H U F F L E

S H U F F L ES H U F F L E S H U F F L E

I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S

I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S

I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I A D D S 1 6I A D D S 1 6 I A D D S 1 6

S H U F F L E S H U F F L ES H U F F L E D A T A _ I N

I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N

I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N

S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N

I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T

I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T

L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T

D A T A _ O U T D A T A _ O U T

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS

IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE

IMULRND16 IMULRND16SHUFFLE PASS

IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE

IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS

PASSSHUFFLE SELECT

SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM

IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT

IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16

IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT

IADDS16IADDS16 IADDS16 IMULRND16IMULRND16

IMULRND16IMULRND16 PASS

SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS

IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE

IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D

SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND

IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT

IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT

IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT

IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT

LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT

IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT

7x7 Convolve Kernel

Imagine gives high performance with low power and flexible programming

• Matches capabilities of communication-limited technology to demands of signal and image processing applications

• Performance– compound stream operations realize >10GOPS on key

applications– can be extended by partitioning an application across

several Imagines (TFLOPS on a circuit board)

• Power– three-level register hierarchy gives 2-10GOPS/W

• Flexibility– programmed in “C”– streaming model– conditional stream operations enable applications like sort

A look forward

• Next steps– Build some Imagine prototypes

• Dual-processor 40GFLOPS systems, 64-processor TeraFLOPS systems

• Longer term– ‘Industrial Strength’ Imagine – 100-200GFLOPS/chip

• Multiple sets of arithmetic clusters per chip, higher clock rate, on-chip cache, more off-chip bandwidth

– Graphics extensions• Texture cache, raster unit – as SRF clients

– A streaming supercomputer• 64-bit FP, high-bandwidth global memory, MIMD

extensions– Simplified stream programming

• Automate inter-cluster communication, partitioning into kernels, sub-word arithmetic, staging of data.

Take home message

• VLSI technology enables us to put TeraOPS on a chip

• Conventional general-purpose architecture cannot exploit this– The problem is bandwidth

• Casting an application as kernels operating on streams exposes locality and concurrency

• A stream architecture exploits this locality and concurrency to achieve high arithmetic rates with limited bandwidth– Bandwidth hierarchy, compound stream operations

• Imagine is a prototype stream processor– One chip – 20GFLOPS peak, 10GFLOPS sustained, 4W– Systems scale to TeraFLOPS and more.

the imagine stream processor flexibility with performance march 30, 2001 william j. dally computer...

convergence workshop

convergence workshop

gbs slide

gopsw slide

stream processor slide

framess slide

s of alus slide

control parallelism

Documents

n-acetylcysteine in cystic fibrosis and pseudomonas ... ·...

healthcare healthcare - dally & associates

energy policy leadership brice dally september 16, 2008 cbe...

2053-4 sept. 2008efw inst+soc pdr idpu chassis mechanical...

instrumentation for engineering measurements-dally jw, rile...

dally brice landry, directeur technique, valsch consulting...

introduction councillor graham dally chair – task group on...

oklahoma corporation commission records/0303cae8.pdfsdfwe...

nicolas dally-lifeand work (1795-1862) · franjoprot...

prof.moatamad saeb dally ibraheem al-jaafar

dally e mail

drosophila glypican dally-like acts in fgf-receiving cells...

resource efficient computing for warehouse-scale...

nicolas dally - Život i djelo (1795-1862) ·...

s.e. hinton the outsidersthe window of the burning church;...

high-performance architectures for embedded memory...

making nested parallel transactions practical using...

back for another year, tdre inc. presents the 2010 dally m...

smith mi nathan dally

convolution engine: balancing efficiency and flexibility...