the imagine stream processor flexibility with performance march 30, 2001 william j. dally computer...
Post on 20-Dec-2015
215 Views
Preview:
TRANSCRIPT
The Imagine Stream ProcessorFlexibility with Performance
March 30, 2001
William J. DallyComputer Systems Laboratory
Stanford Universitybilld@csl.stanford.edu
Convergence WorkshopMarch 30, 2001 2
Outline
• Motivation – We need low-power, programmable TeraOps
• The problem is bandwidth– Growing gap between special-purpose and general-
purpose hardware– Its easy to make ALUs, hard to keep them fed
• A stream processor gives programmable bandwidth– Streams expose locality and concurrency in the
application– A bandwidth hierarchy exploits this
• Imagine is a 20GFLOPS prototype stream processor
• Many opportunities to do better– Scaling up– Simplifying programming
Convergence WorkshopMarch 30, 2001 3
Motivation
• Some things I’d like to do with a few TeraOps– Have a realistic face-to-face meeting with someone
in Boston without riding an airplane• 4-8 cameras, extract depth, fit model, compress, render
to several screens
– High-quality rendering at video rates• Ray tracing a 2K x 4K image with 105 objects at 60
frames/s
Convergence WorkshopMarch 30, 2001 4
The good news – FLOPS are cheap, OPS are cheaper
• 32-bit FPU – 2GFLOPS/mm2 – 400GFLOPS/chip• 16-bit add – 40GOPS/mm2 – 8TOPS/chip
460 m
146.7 m
Local RF
Integer Adder
Convergence WorkshopMarch 30, 2001 5
The bad news – General purpose processors can’t harness this
1e+8
1e+9
1e+10
1e+11
1e+12
1e+13
1e+14
1e+15
2001 2003 2005 2007 2009 2011
Year
FL
OP
S
FLOPS
GP-Peak
GP-Useful
Convergence WorkshopMarch 30, 2001 6
Why do Special-Purpose Processors Perform Well?
Fed by dedicated wires/memoriesLots (100s) of ALUs
Convergence WorkshopMarch 30, 2001 7
Care and Feeding of ALUs
DataBandwidth
Instruction Bandwidth
Regs
Instr.Cache
IR
IP‘Feeding’ Structure Dwarfs ALU
Convergence WorkshopMarch 30, 2001 8
The problem is bandwidth
• Can we solve this bandwidth problem without sacrificing programmability?
Convergence WorkshopMarch 30, 2001 9
Streams expose locality and concurrency
SAD
Image 1 convolve convolve
Image 0 convolve convolve
Depth Map
Operations within a kernel operate on local data
Streams expose data parallelism
Kernels can be partitioned across chips to exploit control parallelism
Convergence WorkshopMarch 30, 2001 10
A Bandwidth Hierarchy exploits locality and concurrency
• VLIW clusters with shared control• 41.2 32-bit operations per word of memory bandwidth
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
File
ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
Convergence WorkshopMarch 30, 2001 11
Bandwidth Usage
Memory BW Global RF BW Local RF BW
Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s
MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s
Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s
QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAMS
trea
m
Reg
iste
r F
ile ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
Convergence WorkshopMarch 30, 2001 12
The Imagine Stream Processor
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
con
trol
ler
Convergence WorkshopMarch 30, 2001 13
Arithmetic Clusters
CU
Inte
rclu
ster
N
etw
ork+
From SRF
To SRF
+ + * * /
Cross Point
Local Register File
Convergence WorkshopMarch 30, 2001 14
Performance
12.1
19.8
11.0
23.925.6
7.0
0
5
10
15
20
25
30
GO
PS
depth mpeg qrd dct convolve fft
16-bit kernels16-bit
applications
floating-pointapplication
floating-pointkernel
Convergence WorkshopMarch 30, 2001 15
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Wa
tts
depth mpeg qrd dct convolve fft average
OtherMem SysPinsSRF ClustClock
Power
GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9
23%
63%
6%5%
1% 2%
Convergence WorkshopMarch 30, 2001 16
A Look Inside an ApplicationStereo Depth Extraction
• 320x240 8-bit grayscale images
• 30 disparity search• 220 frames/second• 12.7 GOPS• 5.7 GOPS/W
Clusters Mem_0 Mem_121300
21400
21500
21600
21700
21800
21900
22000
22100
22200
22300
22400
22500
22600
22700
22800
22900
23000
23100
23200
23300
23400
23500
23600
CONV 3x3
STOREUNPACK LOADCONV 7x7
CONV 3x3
STOREUNPACK LOADCONV 7x7
CONV 3x3
STOREUNPACK LOADCONV 7x7
Clust Mem0 Mem1501300
501400
501500
501600
501700
501800
501900
502000
502100
502200
502300
502400
502500
502600
502700
502800
502900
503000
503100
503200
503300
BlockSAD
BlockSADLoad Load
BlockSAD
StoreBlockSAD
BlockSAD
BlockSADLoad Load
BlockSAD
StoreBlockSAD
Load originalpacked row
Unpack (8bit -> 16 bit)
7x7 Convolve
3x3 Convolve
Store convolved row
Load ConvolvedRows
CalculateBlockSADs atdifferent disparities
Store bestdisparity values
Stereo Depth ExtractorConvolutions Disparity Search
Convergence WorkshopMarch 30, 2001 18
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0
G E N _ C I S T A T E
C O N D _ I N _ D
G E N _ C C E N D
S P C R E A D _ W T S P C W R I T E
C O M M U C D A T A
C H K _ A N Y
S E L E C T
S H I F T A 1 6
C O M M U C P E R M
C O M M U C P E R M
C O M M U C P E R M
S E L E C T
S E L E C T C O M M U C P E R M
I M U L R N D 1 6 S E L E C T
I M U L R N D 1 6
I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T
I M U L R N D 1 6I M U L R N D 1 6
I M U L R N D 1 6I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S
P A S SS H U F F L E
S H U F F L ES H U F F L E S H U F F L E
I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S
I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S
I A D D S 1 6 I A D D S 1 6I A D D S 1 6
I A D D S 1 6I A D D S 1 6 I A D D S 1 6
S H U F F L E S H U F F L ES H U F F L E D A T A _ I N
I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N
I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N
S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N
I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T
I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T
I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T
I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T
L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T
D A T A _ O U T D A T A _ O U T
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0
IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS
IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16
IMULRND16 IMULRND16IADDS16 IADDS16
IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE
IMULRND16 IMULRND16SHUFFLE PASS
IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16
IMULRND16 IMULRND16IADDS16 IADDS16
IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE
IMULRND16 IMULRND16IADDS16 SHUFFLE
IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE
IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS
PASSSHUFFLE SELECT
SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM
IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT
IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16
IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT
IADDS16IADDS16 IADDS16 IMULRND16IMULRND16
IMULRND16IMULRND16 PASS
SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS
IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE
IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D
SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND
IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT
IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT
IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT
IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT
LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT
IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT
7x7 Convolve Kernel
Convergence WorkshopMarch 30, 2001 19
Imagine gives high performance with low power and flexible programming
• Matches capabilities of communication-limited technology to demands of signal and image processing applications
• Performance– compound stream operations realize >10GOPS on key
applications– can be extended by partitioning an application across
several Imagines (TFLOPS on a circuit board)
• Power– three-level register hierarchy gives 2-10GOPS/W
• Flexibility– programmed in “C”– streaming model– conditional stream operations enable applications like sort
Convergence WorkshopMarch 30, 2001 20
A look forward
• Next steps– Build some Imagine prototypes
• Dual-processor 40GFLOPS systems, 64-processor TeraFLOPS systems
• Longer term– ‘Industrial Strength’ Imagine – 100-200GFLOPS/chip
• Multiple sets of arithmetic clusters per chip, higher clock rate, on-chip cache, more off-chip bandwidth
– Graphics extensions• Texture cache, raster unit – as SRF clients
– A streaming supercomputer• 64-bit FP, high-bandwidth global memory, MIMD
extensions– Simplified stream programming
• Automate inter-cluster communication, partitioning into kernels, sub-word arithmetic, staging of data.
Convergence WorkshopMarch 30, 2001 21
Take home message
• VLSI technology enables us to put TeraOPS on a chip
• Conventional general-purpose architecture cannot exploit this– The problem is bandwidth
• Casting an application as kernels operating on streams exposes locality and concurrency
• A stream architecture exploits this locality and concurrency to achieve high arithmetic rates with limited bandwidth– Bandwidth hierarchy, compound stream operations
• Imagine is a prototype stream processor– One chip – 20GFLOPS peak, 10GFLOPS sustained, 4W– Systems scale to TeraFLOPS and more.
top related