computer science &engineering compiled code acceleration on fpgas w. najjar, b.buyukkurt, z.guo,...
TRANSCRIPT
COMPUTER SCIENCE &ENGINEERING
Compiled code acceleration on
FPGAs
W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra
Computer Science & EngineeringUniversity of California Riverside
28 September 2007 Future of Computing - W. Najjar2
Why?
Are FPGA: A New HPC Platform?
Comparison of a dual core Opteron (2.5 GHz) to
Virtex 4 & 5 FPGA on dp fp
Balanced allocation of adders, multipliers and registers
Use both DSP and logic for multipliers, run at lower speed
Logic & wires for I/O interfaces
(dp) Gflop/sOpt V-4 V-5
MAc 10 15.9 28.0
Mult 5 12.0 19.9
Add 5 23.9 55.3
WattsOpt V-4 V-5
95 25 ~35David Strensky, FPGAs Floating-Point Performance -- a pencil and paper evaluation, in HPCwire.com
28 September 2007 Future of Computing - W. Najjar3
ROCCC
Riverside Optimizing Compiler for Configurable Computing
Code acceleration By mapping of circuits to FPGA Achieve same speed as hand-written VHDL codes
Improved productivity Allows design and algorithm space exploration
Keeps the user fully in control We automate only what is very well understood
28 September 2007 Future of Computing - W. Najjar4
Challenges
FPGA is an amorphous mass of logic Structure provided by the code being accelerated Repeatedly applied to a large data set: streams
Languages reflect the von Neumann execution model: Highly structured and sequential (control driven) Vast randomly accessible uniform memory
CPUs (& GPUs) FPGAs
Temporal computing Spatial computing
Sequential Parallel
Centralized storage Distributed storage
Control flow driven Data flow driven
28 September 2007 Future of Computing - W. Najjar5
ROCCC Overview
Limitations on the code:•No recursion•No pointers
High level transformations
Low level transformations
Code generation
Hi-CIRRFJava
C/C++
Lo-CIRRF
SystemC
VHDL
Binary
FPGA
CPU
GPU
DSP
Customunit
Procedure, loop and array optimizations
Instruction schedulingPipelining and storageoptimizations
CIRRFCompiler Intermediate
Representation for Reconfigurable Fabrics
28 September 2007 Future of Computing - W. Najjar6
Input memory(on or off chip)
Output memory(on or off chip)
Mem Fetch Unit
Mem Store Unit
Input Buffer
Output Buffer
Multiple loop bodiesUnrolled and pipelined
A Decoupled Execution Model
Decoupled memory access from datapath
Parallel loop iterations Pipelined datapath Smart buffer (input)
does data reuse Memory fetch and store
units, data path configured by compiler
Off chip accesses platform specific
28 September 2007 Future of Computing - W. Najjar7
So far, working compiler with …
Extensive optimizations and transformations Traditional and FPGA specific
Systolic array, pipelined unrolling, look-up tables
Compile + hardware support for data reuse > 98% reduction in memory fetches on image codes
Efficient code generation and pipelining Within 10% of hand-optimized HDL codes
Import of existing IP cores Leverages huge wealth, integrated with C source code
Support for dynamic partial reconfiguration
28 September 2007 Future of Computing - W. Najjar8
Indices of A[]
coefficients
#define N 516void begin_hw();void end_hw();int main(){ int i; const int T[5] = {3,5,7}; int A[N], B[N];begin_hw();L1: for (i=0; i<=(N-3); i=i+1) { B[i] = T[0]*A[i] +
T[1]*A[i+1] + T[2]*A[i+2]; }end_hw(); }
Example: 3-tap FIR
28 September 2007 Future of Computing - W. Najjar9
RC Platform Models
CPU
FPGA
Memory interface
CPU
CPU
Memory interface
FPGA
SR
AM
Fast Network
CPU Memory FPGA
SR
AM CPU Memory FPGA
SR
AM
21
3
28 September 2007 Future of Computing - W. Najjar10
What we have learned so far
Big speedups are possible 10x to 1,000x on application codes, over Xeon
and Itanium, molecular dynamics, bio-informatics, etc.
Works best with streaming data
New paradigms and tools For spatio-temporal concurrency Algorithms, languages, compilers, run-time
systems etc
28 September 2007 Future of Computing - W. Najjar11
Future? Very wide use of FPGAs
Why? High throughput (> 10x) AND low power (< 25%)
How? Mostly in Models 2 and 3, initially
Model2: See Intel QuickAssist, Xtremedata & DRC Model 3: SGI, SRC & Cray
Contingency Market brings price of FPGAs down Availability of some software stack
for savvy programmers, initially
Potential Multiple “killer apps” (to be discovered)