the imagine stream processor concurrent vlsi architecture group stanford university computer systems...

The Imagine Stream Processor

Concurrent VLSI Architecture Group

Stanford University Computer Systems Laboratory

Stanford, CA 94305

Scott Rixner

February 23, 2000

Scott Rixner The Imagine Stream Processor 2

Outline

The Stanford Concurrent VLSI Architecture Group Media processing Technology

– Distributed register files

– Stream organization The Imagine Stream Processor

– 20GFLOPS on a 0.7cm2 chip

– Conditional streams

– Memory access scheduling

– Programming tools

The Concurrent VLSI Architecture Group

Architecture and design technology for VLSI Routing chips

– Torus Routing Chip, Network Design Frame, Reliable Router

– Basis for Intel, Cray/SGI, Mercury, Avici network chips

Parallel computer systems

J-Machine (MDP) led to Cray T3D/T3E M-Machine (MAP)

– Fast messaging, scalable processing nodes, scalable memory architecture

MDP Chip J-Machine Cray T3D MAP Chip

Design technology

Off-chip I/O– Simultaneous bidirectional signaling, 1989

• now used by Intel and Hitachi

– High-speed signalling• 4Gb/s in 0.6m CMOS, Equalization, 1995

On-Chip Signalling– Low-voltage on-chip signalling

– Low-skew clock distribution Synchronization

– Mesochronous, Plesiochronous

– Self-Timed Design4Gb/s CMOS I/O

250ps/division

Forces Acting on Media Processor Architecture

Applications - streams of low-precision samples– video, graphics, audio, DSL modems, cellular base stations

Technology - becoming wire-limited– power/delay dominated by communication, not arithmetic

– global structures: register files, instruction issue don’t scale Microarchitecture - ILP has been mined out

– to the point of diminishing returns on squeezing performance from sequential code

– explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance

640x480 @ 30 fps Requirements

– 11 GOPS

Imagine stream processor– 12.1 GOPS, 4.6 GOPS/W

Stereo Depth Extraction

Left Camera Image Right Camera Image

Depth Map

Applications

Little locality of reference– read each pixel once

– often non-unit stride

– but there is producer-consumer locality Very high arithmetic intensity

– 100s of arithmetic operations per memory reference

Dominated by low-precision (16-bit) integer operations

Stream Processing

Kernel StreamInput Data

Output Data

Image 1 convolve convolve

Depth Map

Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 operations per memory reference)

Locality and Concurrency

Depth Map

Operations within a kernel operate on local data

Streams expose data parallelism

Kernels can be partitioned across chips to exploit control parallelism

These Applications Demand HighPerformance Density

GOPs to TOPs of sustained performance Space, weight, and power are limited Historically achieved using special-purpose hardware Building a special processor for each application is

– expensive

– time consuming

– inflexible

Performance of Special-Purpose Processors

Fed by dedicated wires/memoriesLots (100s) of ALUs

Care and Feeding of ALUs

DataBandwidth

Instruction Bandwidth

Instr.Cache

IP‘Feeding’ Structure Dwarfs ALU

Technology scaling makes communication the scarce resource

0.18m256Mb DRAM16 64b FP Proc

500MHz

0.07m4Gb DRAM

256 64b FP Proc2.5GHz

1999 2008

18mm30,000 tracks

1 clockrepeaters every 3mm

25mm120,000 tracks

16 clocksrepeaters every 0.5mm

What Does This Say About Architecture?

Tremendous opportunities– Media problems have lots of parallelism and locality

– VLSI technology enables 100s of ALUs/chip (1000s soon)• (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder)

Challenging problems– Locality - global structures won’t work

– Explicit parallelism - ILP won’t keep 100 ALUs busy

– Memory - streaming applications don’t cache well Its time to try some new approaches

Example: Register File Organization

Register files serve two functions:– Short term storage for intermediate results

– Communication between multiple function units Global register files don’t scale well as N, number of

ALUs increases– Need more registers to hold more results (grows with N)

– Need more ports to connect all of the units (grows with N2)

Register Files Dwarf ALUs

N A rithm etic Units

32 ALUs

Size of RFto support32 ALUs

Size of1 ALU

Size of RFto support

1 ALU1 cm

4 ALUs 16 ALUs

Register File Area

Each cell requires:– 1 word line per port

– 1 bit line per port Each cell grows as p2

R registers in the file Area: p2R N3

Bit Lines

1 wiregrid

Register Bit Cell

SIMD and Distributed Register Files

N A rithm etic U n its N/CA rithm etic

U nits

C S IM D C lusters

N/CA rithm etic

U nits

C S IM D C lusters

N A rithm etic U n itsN/C A rithm etic

U nitsN/C A rithm etic

U nits

Scalar SIMD

Central

Hierarchical Organizations

N/CA rith . U n its

C S IM D C lusters

N/CA rith . U n its

C S IM D C lusters

N/C A rithm eticU nits

N A rithm etic U n its

Scalar SIMD

Central

Stream Organizations

N A rithm etic U n its N/CA rith . U n its

N/CA rith . U n its

C S IM D C lusters

N/C A rith . U n itsN A rithm etic U n its N/C A rith . U n its

Scalar SIMD

Central

Organizations

1 10 100 1000Number of Arithmetic Units

SIMDCentral

Stream/SIMD/DRF

Hier/SIMD/DRF

SIMD/DRF

1 10 100 1000Number of Arithmetic Units

Central

Hier/SIMD/DRF &Stream/SIMD/DRF

SIMD/DRF

48 ALUs (32-bit), 500 MHz Stream organization improves central organization by

Area: 195x, Delay: 20x, Power: 430x

Performance

CENTRAL SIMD SIMD/DRF HIER. STREAM

16% Performance Drop(8% with latency constraints)

CENTRAL S IMD S IMD/DRF HIER. S TREAM

Convolve DCT Transform Shader FIR FFT Mean

180x Improvement

Scheduling Distributed Register Files

Distributed register files means:– Not all functional units can access all data

– Each functional unit input/output no longer has a dedicated route from/to all register files

A D D 0 L/S A D D 1

can write toeither or

both busescan read

from eitherbus

Communication Scheduling

Schedule operation on instruction Assign operation to functional unit Schedule communication for

operation

load x

A D D 0 L/S A D D 1

... = x + ...

Functiona l U n it

... = x + ...

...load xy = ... z = ...

A D D 0 L/S A D D 1

Functiona l U n it

When to schedule route from operation A to operation B?– Can’t schedule route when A

is scheduled since B is not assigned to a FU

– Can’t wait to schedule route when B is scheduled since other routes may block

Stubs: Communication Placeholders

FU(Op 1)

FU(Op 2)

Stubs– ensure that other

communications will not block write of Op1’s result

Copy-connected architecture– ensures that any functional

unit that performs Op2 can read Op1’s result

Scheduling With Stubs

load x

A D D 0 L/S A D D 1

Functiona l U n it

load xy = ...

A D D 0 L/S A D D 1

Functiona l U n it

= ... x + ...

load xy = ...

A D D 0 L/S A D D 1

Functiona l U n it

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

Arithmetic Clusters

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Bandwidth Hierarchy

VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth

2GB/s 32GB/s

ALU Cluster

544GB/s

Imagine is a Stream Processor

Instructions are Load, Store, and Operate– operands are streams

– Send and Receive for multiple-imagine systems Operate performs a compound stream operation

– read elements from input streams

– perform a local computation

– append elements to output streams

– repeat until input stream is consumed

– (e.g., triangle transform)

Bandwidth Demands of Transformation

References (per ) Stream

Memory 5.5 117 (21.3x) 48 (8.7x)

Global RF 48 624 (13.0x) 261 (5.4x)

Local RF 372

Scalar Vector

N/A N/A

Bandwidth Usage

Memory BW Global RF BW Local RF BW

Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s

Data Parallelism is easier than ILP

Kernel 1 to 8 Cluster Speedup

FFT (1024) 6.4

DCT (8x8) 7.8

Blockwarp (8x8) 7.2

Transform () 8.0

Harmonic Mean 7.3

Conventional Approaches to Data-Dependent Conditional Execution

Data-DependentBranch

Whoops

SpeculativeLoss D x W~100s

y=(x>0)

ExponentiallyDecreasing Duty Factor

Zero-Cost Conditionals

Most Approaches to Conditional Operations are Costly– Branching control flow - dead issue slots on mispredicted branches

– Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle.

Conditional Streams– append an element to an output stream depending on a case variable.

Value Stream

Case Stream {0,1}

Output Stream

Modern DRAM

Synchronous Three-dimensional

– Bank

– Row

– Column

Shared address lines Shared data lines Not truly random access

B ank N

B ank 1

Sense A m plifie rs(R ow B uffer)

C o lum n D ecoder

D a ta

A ddress

M em oryA rray

(B ank 0)

ConcurrentAccess

PipelinedAccess

DRAM Throughput and Latency

P: Bank PrechargeA: Row ActivationC: Column Access

(0 ,0 ,0 )

(1 ,1 ,2 )

(1 ,0 ,1 )

(1 ,1 ,1 )

(1 ,0 ,0 )

(0 ,1 ,3 )

(0 ,0 ,1 )

(0 ,1 ,0 )P A C

292827262524232221191817161514131210987654321 20 39383736353433323130 4443424140

P A CP A C

4948474645 50 565554535251

(0 ,0 ,0 )

(1 ,1 ,2 )

(1 ,0 ,1 )

(1 ,1 ,1 )

(1 ,0 ,0 )

(0 ,1 ,3 )

(0 ,0 ,1 )

(0 ,1 ,0 )P A C

191817161514131210987654321

Time (Cycles)

Without Access Scheduling (56 DRAM cycles)

With Access Scheduling (19 DRAM cycles)

Streaming Memory System

Stream load/store architecture

Memory access scheduling– Out-of-order access

– High stream throughput

Address Generators

Index Data Data

Reorder Buffer

Streaming Memory Performance

Bank Buffer Effectiveness

0.00.20.40.60.81.01.21.41.61.82.0

1 2 4 8 16 32 64 Infinite

Buffer Size

Performance

Application Trace Bandwidth

FFT Depth MPEG Tex WeightedMean

80-90% Improvement

Application Bandwidth

FFT Depth MPEG Tex WeightedMean

in order first ready col/open col/closed row/open row/closed

35% Improvement

Evaluation Tools

Simulators– Imagine RTL model (verilog)

– isim: cycle-accurate simulator (C++)

– idebug: stream-level simulator (C++) Compiler tools

– iscd: kernel scheduler (communication scheduling)

– istream: stream scheduler (SRF/memory management)

– schedviz: visualizer

Stream Programming

Stream-level– Host Processor

– C++

– istream

load (datin1, mem1); // Memory to SRF

load (datin2, mem2);

op (ADDSUB, datin1, datin2, datout1, datout2);

send (datout1, P2, chan0); // SRF to Network

send (datout2, P3, chan1);

Kernel-level– Clusters

– C-like Syntax

– iscd

loop(input_stream0) {

input_stream0 >> a;

input_stream1 >> b;

c = a + b;

d = a - b;

output_stream0 << c;

output_stream1 << d;

Stereo Depth Extractor

Clusters Mem_0 Mem_121300

CONV 3x3

STOREUNPACK LOADCONV 7x7

CONV 3x3

Clust Mem0 Mem1501300

501400

501500

501600

501700

501800

501900

502000

502100

502200

502300

502400

502500

502600

502700

502800

502900

503000

503100

503200

503300

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

Load originalpacked row

Unpack (8bit 16 bit)

7x7 Convolve

3x3 Convolve

Store convolved row

Load ConvolvedRows

CalculateBlockSADs atdifferent disparities

Store bestdisparity values

Convolutions Disparity Search

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

G E N _ C I S T A T E

C O N D _ I N _ D

G E N _ C C E N D

S P C R E A D _ W T S P C W R I T E

C O M M U C D A T A

C H K _ A N Y

S E L E C T

S H I F T A 1 6

C O M M U C P E R M

S E L E C T

S E L E C T C O M M U C P E R M

I M U L R N D 1 6 S E L E C T

I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T

I M U L R N D 1 6I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S

P A S SS H U F F L E

S H U F F L ES H U F F L E S H U F F L E

I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S

I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S

I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I A D D S 1 6I A D D S 1 6 I A D D S 1 6

S H U F F L E S H U F F L ES H U F F L E D A T A _ I N

I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N

I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N

S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N

I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T

I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T

L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T

D A T A _ O U T D A T A _ O U T

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS

IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE

IMULRND16 IMULRND16SHUFFLE PASS

IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE

IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS

PASSSHUFFLE SELECT

SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM

IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT

IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16

IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT

IADDS16IADDS16 IADDS16 IMULRND16IMULRND16

IMULRND16IMULRND16 PASS

SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS

IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE

IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D

SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND

IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT

IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT

IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT

IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT

LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT

IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT

7x7 Convolution Kernel

ALUs Comm/SPStreams

Pipeline Stage 0

Pipeline Stage 1

Pipeline Stage 2

Performance

12.110.2 11.0

23.925.6

depth mpeg qrd dct convolve fft

16-bit kernels

floating-pointkernel

16-bitapplications

floating-pointapplication

GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3

depth mpeg qrd dct convolve fft average

OtherMem SysPinsSRF ClustClock

Implementation

Goals– < 0.7 cm2 chip in 0.18 m CMOS

– Operates at 500 MHz

Status– Cycle accurate C++ simulator

– Behavioral verilog

– Detailed floorplan with all critical signals planned

– Schematics and layout for key cells

– timing analysis of key units and communication paths

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

Imagine Floorplan

SRF/SB CtrlAG

Stream

Cluster 2

Cluster 3

Cluster 0

Cluster 1

Cluster 6

Cluster 7

Cluster 4

Cluster 5

I Stream

rder B

uffers

CCCCCCCCC CCore

BitCoderNI

Routing

5mm2mm0.6mm

Imagine Pipeline

R X1 X2 X3

….Microcontroller

Stream buffersClusters

Communication paths

SRF/SB CtrlAG

Stream

treamb

uffers

BitCoderNI

Control Routing

Clusters

R X1 X2 X3….

Stream buffers

146.7 m

Adder Layout

Local RF

Integer Adder

Imagine Team

Professor William J. Dally

– Scott Rixner

– Ghazi Ben Amor

– Ujval Kapasi

– Brucek Khailany

– Peter Mattson

– Jinyung Namkoong

– John Owens

– Brian Towles

– Don Alpert (Intel)

– Chris Buehler (MIT)

– JP Grossman (MIT)

– Brad Johanson

– Dr. Abelardo Lopez-Lagunas

– Ben Mowery

– Manman Ren

Imagine Summary

Imagine operates on streams of records– simplifies programming

– exposes locality and concurrency

Compound stream operations– perform a subroutine on each stream element

– reduces global register bandwidth

Bandwidth hierarchy– use bandwidth where its inexpensive

– distributed and hierarchical register organization

Conditional stream operations– sort elements into homogeneous streams

– avoid predication or speculation

Imagine gives high performance with low power and flexible programming

Matches capabilities of communication-limited technology to demands of signal and image processing applications

Performance– compound stream operations realize >10GOPS on key applications

– can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board)

Power– three-level register hierarchy gives 2-10GOPS/W

Flexibility– programmed in “C”

– streaming model

– conditional stream operations enable applications like sort

MPEG-2 Encoder

BlockSAD(future)

B lockSAD(past)

Pick BestMV

DCTQ RLE VLC

IQDCT Correlate

Save toMemory

Rate Control Feeback

Advantages of Implementing MPEG-2 Encode on Imagine

Block Matching well suited to Imagine– Clusters provide high arithmetic bandwidth

– SRF provides necessary data locality, bandwidth

– Good performance Programmable Solution

– Simplifies system design

– Allows tradeoff between complexity of block matcher vs. bitrate, picture quality

Real Time Power Efficient

MPEG-2 Challenges

VLC (Huffman Coding)– Difficult and inefficient to implement on SIMD clusters

– A separate module which provides LUT and shift register capabilities is used instead

Rate Control– Difficult because multiple macroblocks encoded in parallel

– Must perform on a coarser granularity (impact on picture quality?)

the imagine stream processor concurrent vlsi architecture group stanford university computer systems...

stream processorforces

stream processor20gflops

stream processorlocality

stream processorcare

data parallelismkernels

special processor

chip signallinglowvoltage

memory referencethe

Documents

article in press - stanford medicine · pdf filed program in...

department of anesthesia, stanford, ca 94305

andrea goldsmith - stanford...

epitope-specific monoclonal antibodies to fshβ increase...

leonard susskind - arxiv › pdf › 1604.02589v2.pdf ·...

vita - stanford university€¦ · web viewvita. for....

curriculum vitae - stanford universitycurriculum vitae...

eric h. huang, richard socher, christopher d. manning,...

tanya marie luhrmann - anthropology.stanford.edu · tanya...

david m. studdert · 2018-05-14 · david m. studdert...

300 pasteur drive stanford, ca 94305

atom interferometry group stanford center for position,...

measuring heart rate from video - stanford university ·...

neuroscience center - stanford medical center · date of...

i lnnflhlfl - stanford university · department of...

michael anthony mcfaul stanford ca, 94305 mcfaul@stanford...

use of hierarchical surface parameterization for...

555 salvatierra walk, stanford, ca 94305

vita for elizabeth b. bernhardt office: home...elizabeth b....

· web viewpamela susan karlan stanford law...