the imagine stream processor concurrent vlsi architecture group stanford university computer systems...
Post on 04-Jan-2016
230 Views
Preview:
TRANSCRIPT
The Imagine Stream Processor
Concurrent VLSI Architecture Group
Stanford University Computer Systems Laboratory
Stanford, CA 94305
Scott Rixner
February 23, 2000
Scott Rixner The Imagine Stream Processor 2
Outline
The Stanford Concurrent VLSI Architecture Group Media processing Technology
– Distributed register files
– Stream organization The Imagine Stream Processor
– 20GFLOPS on a 0.7cm2 chip
– Conditional streams
– Memory access scheduling
– Programming tools
Scott Rixner The Imagine Stream Processor 3
The Concurrent VLSI Architecture Group
Architecture and design technology for VLSI Routing chips
– Torus Routing Chip, Network Design Frame, Reliable Router
– Basis for Intel, Cray/SGI, Mercury, Avici network chips
Scott Rixner The Imagine Stream Processor 4
Parallel computer systems
J-Machine (MDP) led to Cray T3D/T3E M-Machine (MAP)
– Fast messaging, scalable processing nodes, scalable memory architecture
MDP Chip J-Machine Cray T3D MAP Chip
Scott Rixner The Imagine Stream Processor 5
Design technology
Off-chip I/O– Simultaneous bidirectional signaling, 1989
• now used by Intel and Hitachi
– High-speed signalling• 4Gb/s in 0.6m CMOS, Equalization, 1995
On-Chip Signalling– Low-voltage on-chip signalling
– Low-skew clock distribution Synchronization
– Mesochronous, Plesiochronous
– Self-Timed Design4Gb/s CMOS I/O
250ps/division
Scott Rixner The Imagine Stream Processor 6
Forces Acting on Media Processor Architecture
Applications - streams of low-precision samples– video, graphics, audio, DSL modems, cellular base stations
Technology - becoming wire-limited– power/delay dominated by communication, not arithmetic
– global structures: register files, instruction issue don’t scale Microarchitecture - ILP has been mined out
– to the point of diminishing returns on squeezing performance from sequential code
– explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance
Scott Rixner The Imagine Stream Processor 7
640x480 @ 30 fps Requirements
– 11 GOPS
Imagine stream processor– 12.1 GOPS, 4.6 GOPS/W
Stereo Depth Extraction
Left Camera Image Right Camera Image
Depth Map
Scott Rixner The Imagine Stream Processor 8
Applications
Little locality of reference– read each pixel once
– often non-unit stride
– but there is producer-consumer locality Very high arithmetic intensity
– 100s of arithmetic operations per memory reference
Dominated by low-precision (16-bit) integer operations
Scott Rixner The Imagine Stream Processor 9
Stream Processing
SAD
Kernel StreamInput Data
Output Data
Image 1 convolve convolve
Image 0 convolve convolve
Depth Map
Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 operations per memory reference)
Scott Rixner The Imagine Stream Processor 10
Locality and Concurrency
SAD
Image 1 convolve convolve
Image 0 convolve convolve
Depth Map
Operations within a kernel operate on local data
Streams expose data parallelism
Kernels can be partitioned across chips to exploit control parallelism
Scott Rixner The Imagine Stream Processor 11
These Applications Demand HighPerformance Density
GOPs to TOPs of sustained performance Space, weight, and power are limited Historically achieved using special-purpose hardware Building a special processor for each application is
– expensive
– time consuming
– inflexible
Scott Rixner The Imagine Stream Processor 12
Performance of Special-Purpose Processors
Fed by dedicated wires/memoriesLots (100s) of ALUs
Scott Rixner The Imagine Stream Processor 13
Care and Feeding of ALUs
DataBandwidth
Instruction Bandwidth
Regs
Instr.Cache
IR
IP‘Feeding’ Structure Dwarfs ALU
Scott Rixner The Imagine Stream Processor 14
Technology scaling makes communication the scarce resource
0.18m256Mb DRAM16 64b FP Proc
500MHz
0.07m4Gb DRAM
256 64b FP Proc2.5GHz
P
1999 2008
18mm30,000 tracks
1 clockrepeaters every 3mm
25mm120,000 tracks
16 clocksrepeaters every 0.5mm
P
Scott Rixner The Imagine Stream Processor 15
What Does This Say About Architecture?
Tremendous opportunities– Media problems have lots of parallelism and locality
– VLSI technology enables 100s of ALUs/chip (1000s soon)• (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder)
Challenging problems– Locality - global structures won’t work
– Explicit parallelism - ILP won’t keep 100 ALUs busy
– Memory - streaming applications don’t cache well Its time to try some new approaches
Scott Rixner The Imagine Stream Processor 16
Example: Register File Organization
Register files serve two functions:– Short term storage for intermediate results
– Communication between multiple function units Global register files don’t scale well as N, number of
ALUs increases– Need more registers to hold more results (grows with N)
– Need more ports to connect all of the units (grows with N2)
Scott Rixner The Imagine Stream Processor 17
Register Files Dwarf ALUs
N A rithm etic Units
1 cm
32 ALUs
Size of RFto support32 ALUs
Size of1 ALU
Size of RFto support
1 ALU1 cm
4 ALUs 16 ALUs
Scott Rixner The Imagine Stream Processor 18
Register File Area
Each cell requires:– 1 word line per port
– 1 bit line per port Each cell grows as p2
R registers in the file Area: p2R N3
Bit Lines
Wor
d Li
nes
...
1 wiregrid
...
p
p w
h
Register Bit Cell
Scott Rixner The Imagine Stream Processor 19
SIMD and Distributed Register Files
N A rithm etic U n its N/CA rithm etic
U nits
C S IM D C lusters
N/CA rithm etic
U nits
C S IM D C lusters
N A rithm etic U n itsN/C A rithm etic
U nitsN/C A rithm etic
U nits
Scalar SIMD
Central
DRF
Scott Rixner The Imagine Stream Processor 20
Hierarchical Organizations
N/CA rith . U n its
C S IM D C lusters
N/CA rith . U n its
C S IM D C lusters
N/C A rithm eticU nits
N/C A rithm eticU nits
N A rithm etic U n its
N A rithm etic U n its
Scalar SIMD
Central
DRF
Scott Rixner The Imagine Stream Processor 21
Stream Organizations
N A rithm etic U n its N/CA rith . U n its
N/CA rith . U n its
C S IM D C lusters
C S IM D C lusters
N/C A rith . U n itsN A rithm etic U n its N/C A rith . U n its
Scalar SIMD
Central
DRF
Scott Rixner The Imagine Stream Processor 22
Organizations
0.1
1
10
100
1000
1 10 100 1000Number of Arithmetic Units
SIMDCentral
Stream/SIMD/DRF
Hier/SIMD/DRF
SIMD/DRF
480.1
1
10
100
1000
1 10 100 1000Number of Arithmetic Units
SIMD
Central
Hier/SIMD/DRF &Stream/SIMD/DRF
SIMD/DRF
48
48 ALUs (32-bit), 500 MHz Stream organization improves central organization by
Area: 195x, Delay: 20x, Power: 430x
Scott Rixner The Imagine Stream Processor 23
Performance
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Sp
eed
up
CENTRAL SIMD SIMD/DRF HIER. STREAM
16% Performance Drop(8% with latency constraints)
0
50
100
150
200
250
Per
form
ance
/Are
a
CENTRAL S IMD S IMD/DRF HIER. S TREAM
Convolve DCT Transform Shader FIR FFT Mean
180x Improvement
Scott Rixner The Imagine Stream Processor 24
Scheduling Distributed Register Files
Distributed register files means:– Not all functional units can access all data
– Each functional unit input/output no longer has a dedicated route from/to all register files
A D D 0 L/S A D D 1
can write toeither or
both busescan read
from eitherbus
Scott Rixner The Imagine Stream Processor 25
Communication Scheduling
Schedule operation on instruction Assign operation to functional unit Schedule communication for
operation
load x
A D D 0 L/S A D D 1
0
... = x + ...
xInst
ruct
ion
Functiona l U n it
... = x + ...
?
x
??
? ?
...load xy = ... z = ...
A D D 0 L/S A D D 1
0
Inst
ruct
ion
Functiona l U n it
XX
When to schedule route from operation A to operation B?– Can’t schedule route when A
is scheduled since B is not assigned to a FU
– Can’t wait to schedule route when B is scheduled since other routes may block
Scott Rixner The Imagine Stream Processor 26
Stubs: Communication Placeholders
FU(Op 1)
RF
FU(Op 2)
RF
Op 1
Op 2
Wri
te s
tub
Re
ad
stu
bDa
ta t
ran
sfe
r
Stubs– ensure that other
communications will not block write of Op1’s result
Copy-connected architecture– ensures that any functional
unit that performs Op2 can read Op1’s result
Scott Rixner The Imagine Stream Processor 27
Scheduling With Stubs
load x
A D D 0 L/S A D D 1
0
Functiona l U n it
Inst
ruct
ion
load xy = ...
A D D 0 L/S A D D 1
0
Functiona l U n it
Inst
ruct
ion
= ... x + ...
1
x
load xy = ...
A D D 0 L/S A D D 1
0
Functiona l U n it
Inst
ruct
ion
Scott Rixner The Imagine Stream Processor 28
The Imagine Stream Processor
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
con
trol
ler
Scott Rixner The Imagine Stream Processor 29
Arithmetic Clusters
CU
Inte
rclu
ster
N
etw
ork+
From SRF
To SRF
+ + * * /
Cross Point
Local Register File
Scott Rixner The Imagine Stream Processor 30
Bandwidth Hierarchy
VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
File
ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
Scott Rixner The Imagine Stream Processor 31
Imagine is a Stream Processor
Instructions are Load, Store, and Operate– operands are streams
– Send and Receive for multiple-imagine systems Operate performs a compound stream operation
– read elements from input streams
– perform a local computation
– append elements to output streams
– repeat until input stream is consumed
– (e.g., triangle transform)
Scott Rixner The Imagine Stream Processor 32
Bandwidth Demands of Transformation
References (per ) Stream
Memory 5.5 117 (21.3x) 48 (8.7x)
Global RF 48 624 (13.0x) 261 (5.4x)
Local RF 372
Scalar Vector
N/A N/A
Scott Rixner The Imagine Stream Processor 33
Bandwidth Usage
Memory BW Global RF BW Local RF BW
Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s
MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s
Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s
QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s
Scott Rixner The Imagine Stream Processor 34
Data Parallelism is easier than ILP
Kernel 1 to 8 Cluster Speedup
FFT (1024) 6.4
DCT (8x8) 7.8
Blockwarp (8x8) 7.2
Transform () 8.0
Harmonic Mean 7.3
Scott Rixner The Imagine Stream Processor 35
Conventional Approaches to Data-Dependent Conditional Execution
A
x>0
B
C
J
K
Data-DependentBranch
Y N
A
x>0
Y
B
C
Whoops
J
K
SpeculativeLoss D x W~100s
A
B
J
y=(x>0)
if y
if ~y
C
K
if y
if ~y
ExponentiallyDecreasing Duty Factor
Scott Rixner The Imagine Stream Processor 36
Zero-Cost Conditionals
Most Approaches to Conditional Operations are Costly– Branching control flow - dead issue slots on mispredicted branches
– Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle.
Conditional Streams– append an element to an output stream depending on a case variable.
Value Stream
Case Stream {0,1}
Output Stream
Scott Rixner The Imagine Stream Processor 37
Modern DRAM
Synchronous Three-dimensional
– Bank
– Row
– Column
Shared address lines Shared data lines Not truly random access
B ank N
B ank 1
Sense A m plifie rs(R ow B uffer)
C o lum n D ecoder
Ro
w D
eco
de
r
D a ta
A ddress
M em oryA rray
(B ank 0)
ConcurrentAccess
PipelinedAccess
Scott Rixner The Imagine Stream Processor 38
DRAM Throughput and Latency
P: Bank PrechargeA: Row ActivationC: Column Access
11
(0 ,0 ,0 )
(1 ,1 ,2 )
(1 ,0 ,1 )
(1 ,1 ,1 )
(1 ,0 ,0 )
(0 ,1 ,3 )
(0 ,0 ,1 )
(0 ,1 ,0 )P A C
292827262524232221191817161514131210987654321 20 39383736353433323130 4443424140
P A CP A C
P A CP A C
P A CP A C
P A C
4948474645 50 565554535251
11
(0 ,0 ,0 )
(1 ,1 ,2 )
(1 ,0 ,1 )
(1 ,1 ,1 )
(1 ,0 ,0 )
(0 ,1 ,3 )
(0 ,0 ,1 )
(0 ,1 ,0 )P A C
191817161514131210987654321
P A
CP
CP A
C
A C
C
C
C
Time (Cycles)
Ref
eren
ces
( Ba
nk,
Ro
w, C
olu
mn
)R
efer
ence
s( B
an
k, R
ow
, Co
lum
n)
Time (Cycles)
Without Access Scheduling (56 DRAM cycles)
With Access Scheduling (19 DRAM cycles)
Scott Rixner The Imagine Stream Processor 39
Streaming Memory System
Stream load/store architecture
Memory access scheduling– Out-of-order access
– High stream throughput
Address Generators
Index Data Data
DRAM
Ba
nk
0
Ba
nk
Bu
ffe
r
DRAM
Ba
nk n
Ba
nk
Bu
ffe
r
Reorder Buffer
Scott Rixner The Imagine Stream Processor 40
Streaming Memory Performance
Bank Buffer Effectiveness
0.00.20.40.60.81.01.21.41.61.82.0
1 2 4 8 16 32 64 Infinite
Buffer Size
Cyc
les/
Acc
ess
Scott Rixner The Imagine Stream Processor 41
Performance
Application Trace Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
FFT Depth MPEG Tex WeightedMean
Me
mo
ry B
an
dw
idth
(M
B/s
)
80-90% Improvement
Application Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
FFT Depth MPEG Tex WeightedMean
Me
mo
ry B
an
dw
idth
(M
B/s
)
in order first ready col/open col/closed row/open row/closed
35% Improvement
Scott Rixner The Imagine Stream Processor 42
Evaluation Tools
Simulators– Imagine RTL model (verilog)
– isim: cycle-accurate simulator (C++)
– idebug: stream-level simulator (C++) Compiler tools
– iscd: kernel scheduler (communication scheduling)
– istream: stream scheduler (SRF/memory management)
– schedviz: visualizer
Scott Rixner The Imagine Stream Processor 43
Stream Programming
Stream-level– Host Processor
– C++
– istream
load (datin1, mem1); // Memory to SRF
load (datin2, mem2);
op (ADDSUB, datin1, datin2, datout1, datout2);
send (datout1, P2, chan0); // SRF to Network
send (datout2, P3, chan1);
Kernel-level– Clusters
– C-like Syntax
– iscd
loop(input_stream0) {
input_stream0 >> a;
input_stream1 >> b;
c = a + b;
d = a - b;
output_stream0 << c;
output_stream1 << d;
}
Scott Rixner The Imagine Stream Processor 44
Stereo Depth Extractor
Clusters Mem_0 Mem_121300
21400
21500
21600
21700
21800
21900
22000
22100
22200
22300
22400
22500
22600
22700
22800
22900
23000
23100
23200
23300
23400
23500
23600
CONV 3x3
STOREUNPACK LOADCONV 7x7
CONV 3x3
STOREUNPACK LOADCONV 7x7
CONV 3x3
STOREUNPACK LOADCONV 7x7
Clust Mem0 Mem1501300
501400
501500
501600
501700
501800
501900
502000
502100
502200
502300
502400
502500
502600
502700
502800
502900
503000
503100
503200
503300
BlockSAD
BlockSADLoad Load
BlockSAD
StoreBlockSAD
BlockSAD
BlockSADLoad Load
BlockSAD
StoreBlockSAD
Load originalpacked row
Unpack (8bit 16 bit)
7x7 Convolve
3x3 Convolve
Store convolved row
Load ConvolvedRows
CalculateBlockSADs atdifferent disparities
Store bestdisparity values
Convolutions Disparity Search
Scott Rixner The Imagine Stream Processor 45
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0
G E N _ C I S T A T E
C O N D _ I N _ D
G E N _ C C E N D
S P C R E A D _ W T S P C W R I T E
C O M M U C D A T A
C H K _ A N Y
S E L E C T
S H I F T A 1 6
C O M M U C P E R M
C O M M U C P E R M
C O M M U C P E R M
S E L E C T
S E L E C T C O M M U C P E R M
I M U L R N D 1 6 S E L E C T
I M U L R N D 1 6
I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T
I M U L R N D 1 6I M U L R N D 1 6
I M U L R N D 1 6I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S
P A S SS H U F F L E
S H U F F L ES H U F F L E S H U F F L E
I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S
I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S
I A D D S 1 6 I A D D S 1 6I A D D S 1 6
I A D D S 1 6I A D D S 1 6 I A D D S 1 6
S H U F F L E S H U F F L ES H U F F L E D A T A _ I N
I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N
I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N
S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N
I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T
I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T
I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T
I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T
L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T
D A T A _ O U T D A T A _ O U T
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0
IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS
IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16
IMULRND16 IMULRND16IADDS16 IADDS16
IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE
IMULRND16 IMULRND16SHUFFLE PASS
IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16
IMULRND16 IMULRND16IADDS16 IADDS16
IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE
IMULRND16 IMULRND16IADDS16 SHUFFLE
IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE
IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS
PASSSHUFFLE SELECT
SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM
IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT
IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16
IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT
IADDS16IADDS16 IADDS16 IMULRND16IMULRND16
IMULRND16IMULRND16 PASS
SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS
IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE
IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D
SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND
IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT
IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT
IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT
IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT
LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT
IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT
7x7 Convolution Kernel
ALUs Comm/SPStreams
Pipeline Stage 0
Pipeline Stage 1
Pipeline Stage 2
Scott Rixner The Imagine Stream Processor 46
Performance
12.110.2 11.0
23.925.6
7.0
0
5
10
15
20
25
30
GO
PS
depth mpeg qrd dct convolve fft
16-bit kernels
floating-pointkernel
16-bitapplications
floating-pointapplication
Scott Rixner The Imagine Stream Processor 47
Power
GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Wa
tts
depth mpeg qrd dct convolve fft average
OtherMem SysPinsSRF ClustClock
24%
63%
5%5%
1% 2%
Scott Rixner The Imagine Stream Processor 48
Implementation
Goals– < 0.7 cm2 chip in 0.18 m CMOS
– Operates at 500 MHz
Status– Cycle accurate C++ simulator
– Behavioral verilog
– Detailed floorplan with all critical signals planned
– Schematics and layout for key cells
– timing analysis of key units and communication paths
Scott Rixner The Imagine Stream Processor 49
The Imagine Stream Processor
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
con
trol
ler
Scott Rixner The Imagine Stream Processor 50
Imagine Floorplan
SRF/SB CtrlAG
Clu
st/UC
Stream
Bu
ffers
SRF
SRF
SRF
SRF
SRF
SRF
SRF
SRF
Cluster 2
Cluster 3
Cluster 0
Cluster 1
Cluster 6
Cluster 7
Cluster 4
Cluster 5
MS
/HI/N
I Stream
/Re-O
rder B
uffers
CCCCCCCCC CCore
HI
MB
0M
B1
MB
2M
B3
BitCoderNI
Routing
5.9mm
1mm
1mm
5mm2mm0.6mm
Scott Rixner The Imagine Stream Processor 51
Imagine Pipeline
MEM
SRF
F1 D
R X1
R X1 X2 X3
….Microcontroller
X2 W
X4 W
Stream buffersClusters
F2
SEL
Scott Rixner The Imagine Stream Processor 52
Communication paths
SRF/SB CtrlAG
Stream
Bu
ffers
SRFS
treamb
uffers
C
HI
Mem
Sys
BitCoderNI
Control Routing
Clusters
MEM
F1 D
R X1
R X1 X2 X3….
X2W
X4W
Stream buffers
F2
SEL
Scott Rixner The Imagine Stream Processor 53
460 m
146.7 m
Adder Layout
Local RF
Integer Adder
Scott Rixner The Imagine Stream Processor 54
Imagine Team
Professor William J. Dally
– Scott Rixner
– Ghazi Ben Amor
– Ujval Kapasi
– Brucek Khailany
– Peter Mattson
– Jinyung Namkoong
– John Owens
– Brian Towles
– Don Alpert (Intel)
– Chris Buehler (MIT)
– JP Grossman (MIT)
– Brad Johanson
– Dr. Abelardo Lopez-Lagunas
– Ben Mowery
– Manman Ren
Scott Rixner The Imagine Stream Processor 55
Imagine Summary
Imagine operates on streams of records– simplifies programming
– exposes locality and concurrency
Compound stream operations– perform a subroutine on each stream element
– reduces global register bandwidth
Bandwidth hierarchy– use bandwidth where its inexpensive
– distributed and hierarchical register organization
Conditional stream operations– sort elements into homogeneous streams
– avoid predication or speculation
Scott Rixner The Imagine Stream Processor 56
Imagine gives high performance with low power and flexible programming
Matches capabilities of communication-limited technology to demands of signal and image processing applications
Performance– compound stream operations realize >10GOPS on key applications
– can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board)
Power– three-level register hierarchy gives 2-10GOPS/W
Flexibility– programmed in “C”
– streaming model
– conditional stream operations enable applications like sort
Scott Rixner The Imagine Stream Processor 57
MPEG-2 Encoder
BlockSAD(future)
B lockSAD(past)
Color
Pick BestMV
DCTQ RLE VLC
IQDCT Correlate
Save toMemory
I
P
A L L
P
Rate Control Feeback
Scott Rixner The Imagine Stream Processor 58
Advantages of Implementing MPEG-2 Encode on Imagine
Block Matching well suited to Imagine– Clusters provide high arithmetic bandwidth
– SRF provides necessary data locality, bandwidth
– Good performance Programmable Solution
– Simplifies system design
– Allows tradeoff between complexity of block matcher vs. bitrate, picture quality
Real Time Power Efficient
Scott Rixner The Imagine Stream Processor 59
MPEG-2 Challenges
VLC (Huffman Coding)– Difficult and inefficient to implement on SIMD clusters
– A separate module which provides LUT and shift register capabilities is used instead
Rate Control– Difficult because multiple macroblocks encoded in parallel
– Must perform on a coarser granularity (impact on picture quality?)
top related