ash: a substrate for scalable architectures
DESCRIPTION
ASH: A Substrate for Scalable Architectures. Mihai Budiu Seth Copen Goldstein http://www.cs.cmu.edu/~phoenix CALCM Seminar, March 19, 2002. Resources. CPU Problems. Complexity Power Global Signals Limited issue window => limited ILP. - PowerPoint PPT PresentationTRANSCRIPT
ASH: A Substrate for Scalable
Architectures
Mihai Budiu
Seth Copen Goldsteinhttp://www.cs.cmu.edu/~phoenix
CALCM Seminar, March 19, 2002
/322
Resources
/323
CPU Problems
• Complexity
• Power
• Global Signals
• Limited issue window => limited ILP
We propose an architecture with none of these limits
/324
Outline
• Scalability
• Reconfigurable hardware advantages
• A hybrid RH + CPU architecture
• CPU and RH as peers
• Application Specific Hardware
/325
FU * clock freq
Computational Bandwidth
CPU
Unbounded
RH
*
+
/
a=a+bb=b+c
/326
Registers
Fixed
RH
Unbounded
eaxebxecxedx
ijklm spillsp[0]
CPU
/327
Register Bandwidth
Fixed
CPU
R1R2R3W1W2
RH
Unbounded
/328
Out-of-Order Execution
RHCPU
Fe
tch
De
cod
e
Dis
pa
tch
Exe
cute
Co
mm
it
In-order
Limited bywindow
Compiler’s window is unbounded
/329
Outline
• Scalability
• Reconfigurable hardware advantages
• A hybrid RH + CPU architecture
• CPU and RH as peers
• Application Specific Hardware
/3210
Hybrid system: CPU+RH
High ILP
application-specific
Low ILP+ OS + VM
generic
CPU RH
Memory
Tight coupling
/3211
Problem
HLL Program
CPU RH
Memory
Compiler
/3212
Our Solution
General: applicable to today’s software
Automatic: compiler-driven [RISC approach]
Scalable: with clock, hardware and program size
Parallelism: exploit application parallelism• bit-level• ILP• pipeline• loop-level
/3213
Outline
• Scalability
• Reconfigurable hardware advantages
• A hybrid RH + CPU architecture
• CPU and RH as peers
• Application Specific Hardware
/3214
Peeringa( ) {
b( );}
b( ) {c( );
}
c( ) {d( )
}
d( ) { }
CPU RH
a
b
c
d
Program
/3215
marshalling,control transfer
softwareprocedure
callhardware
dependent
RH
“RPC”
CPU
a
b
c
d
b’
c’
d’
Stubs built automatically.
/3216
Stub Synthesis
Proceduresfor RH
RH Compiler
Proceduresfor CPU
Program
Partitioning
Stubs
Configuration
Linker
Executable
/3217
Outline
• Scalability
• Reconfigurable hardware advantages
• A hybrid RH + CPU architecture
• CPU and RH as peers
• Application Specific Hardware
/3218
Application-Specific Hardware
Reconfigurablehardware
HLL program
Compiler
Circuit
HLL Program
CPU RH
Memory
Compiler
/3219
CASH: Compiling for ASH
Memory partitioning
Interconnection net
Circuits
C Program
RH
/3220
Asynchronous Computation
+
data
dataready
ack
Can extend to locally synchronous, globally asynchronous
/3221
Dataflow Graphs
int plus(int x, int y)
{
return x + y;
}
/3222
From Control Flow to Data Flow
/3223
From Control Flow to Data Flow
/3224
From Control Flow to Data Flow
/3225
Conditionals = Speculation
int cond(int p, int x, int y)
{
int z;
if (p)
z = x;
else
z = y;
return z;
}
/3226
Critical Paths
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
/3227
Executing Lenient Operators
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
Up to 40% performance improvement.
/3228
Pipelining
Pipelined Cycles
N 903
Y 653
/3229
Loop Pipelining
Pipe FIFO Cycles
N 0 903
N 1 903
Y 0 653
Y 1 474
Y 2 408
Y 3 408
/3230
Loop Pipelining
Pipe FIFO Cycles
N 0 903
N 1 903
Y 0 653
Y 1 474
Y 2 408
Y 3 408
/3231
ASH Features
• What you code is what you get– no hidden control logic– really lean hardware
(no CAM, decoders, multiported files, etc.)
• Compiler has complete control• Dynamic scheduling => latency tolerant• Naturally exploits ILP,
even across loop iterations
/3232
Conclusions
• ASH = Compiler-synthesized hardware
• ASH matches program parallelism
• Dynamically scheduled RH
• ASH scales with – clock frequency– transistors– program size
/3233
Backup Slides
/3234
Reconfigurable Hardware
Universal gates
and/or
storage elements
Interconnectionnetwork
Programmable switches
/3235
Switch controlled by a 1-bit RAM cell
0001
Universal gate = RAM
a0a1a0
a1
dataa1 & a2
0data in
control
Main RH Ingredient: RAM Cell
/3236
Stubs
a( ) { r = b(b_args);}
b(b_args) {
}
a( ) { r = b’(b_args);}
b’(b_args) { send_rh(b_args); invoke_rh(b); r = receive_rh( ); return r;}
RH
Program
/3237
Independent of b
Dispatcher Stubs
a( ) { r = b(b_args);}
b(b_args) { if (x) c( ); return r;}
c( ) {
}
Program
b’(b_args) { send_rh(b_args); invoke_rh(b);
while (1) { com = get_rh_command( ); if (! com) break; (*com)( ); }
r = receive_rh( ); return r;}
c’s stub
/3238
C’s Stuba( ) { r = b(b_args);}
b(b_args) { if (x) c( ); return r;}
c( ) {
}
Program
c’( ) { receive_rh(c_args); r = c(c_args); send_rh(r); invoke_rh(return_to_rh);}
back
/3239
Input to Output
int io(int x)
{
return x;
}
/3240
Loops
int loop()
{
int w = 10;
while (w > 0)
w--;
return w;
}
/3241
Pointers and Arrays
int a[10];
void pointer(int *p)
{
a[2] += a[4] + *p;
}
/3242
int sum(){ int s = 0; int i;
for (i=0; i < 10; i++)s += a[i];
return s;}
Pointers and Loops