erik d’hollander - unical · •gpu: cuda, opencl –c ptx (parallel thread execution) ......

Erik D’Hollander

University of Ghent Belgium

Outline

1. Super desktop GPU/FPGA architecture

2. Programming tool chain

3. FPGA vs. GPU strengths

4. Roofline performance model for FPGA

5. Tuning performance

6. Optimizing compute resources

7. Conclusion

Supercomputing 1969-2018

• 1969: MFlops

• 1985: GFlops

• 1997: PFlops

• 2008: TFlops

• 2018: EFlops? 1.E+03

1.E+06

1.E+09

1.E+12

1.E+15

1.E+18

1969 1974 1982 1985 1990 1996 1997 2004 2005 2008 2010 2011

MFLOPS(y) = 1.72(y-1969)

Trendlines

• Supercomputing FLOPS > Moore’s law

• Memory speed increase << Moore’s law

R² = 0.97

1960 1970 1980 1990 2000 2010 2020

FLops (log10)

Moore's law

MFlops Trendline

Memory speed increase (relative)

Trendlines

• Supercomputing FLOPS > Moore’s law

• Memory speed increase << Moore’s law

R² = 0.97

1960 1970 1980 1990 2000 2010 2020

FLops (log10)

Moore's law

MFlops Trendline

Memory speed increase (relative)

PC today

Super desktop with GP-GPU and FPGA

• Host = Supermicro PC

• Accelerators =

– GPGPU Tesla C2050 highly regular parallel apps.

– FPGA board Pico EX500 with 2x M501 Virtex 6 configurable, massively parallel apps., low power

“GUDI” Tetra project supported by IWT Flanders Belgium, EhB, VUB and UGent

Combining GPU and FPGA strengths

• Image processing + Bio-informatics

• Face recognition + Security

• Audio processing + HMM speech recognition

• Traffic analysis + Neural network control

• Internal architecture and interconnections

• Hybrid system : CPU, 2 FPGAS, GP-GPU

• Internal bandwiths CPU memory: 19.2 GB/s CPU accelerators: 25.6 GB/s (QPI)

• Internal bandwiths CPU FPGAs : 8 GB/s CPU GP-GPU: 8 GB/s

• Internal bandwiths: GPU SMP Global Mem: 115.0 GB/s SMP Shared Mem: 73.5 GB/s

• Internal bandwiths: FPGA DSP/Logic Block RAM: 386 GB/s DSP/Logic PCIe switch: 4 GB/s DSP/Logic DDR3 RAM: 3.2 GB/s

• Heterogeneous architecture:

– 3 computing architectures

– non-uniform memories

Programming tool chain

• Algorithm decomposed in GPU, Host and FPGA parts

• FPGA architecture generated with High Level Synthesis tools (C to VHDL compilers)

• Bitmap files = hardware procedure calls

• Code executed on combined platform

• Communication via PCIe

Heterogeneous computing

Data transfer

• GPU: AllDataToDev calculate AllResultToHost (*)

• FPGA: StreamToDev calculate StreamToHost

Local Mem CPU GPU

CPU FPGA PCIe stream

(*) unless explicit double buffering

Comparison axes

• Speed: computational power

• Communication: bandwidth/latency

• Programmability: IDE efficiency speed

programmability

communication

Programming environment Programming language: C

• GPU: CUDA, OpenCL

– C PTX (Parallel Thread Execution)

• FPGA: HLS (High Level Synthesis)

– C VHDL

– History:

AutoESL (Xilinx) Vivado HLS Catapult C tool from Mentor Graphics

C-to HDL tool from Politecnico di Milano (Italy) C-to-Verilog tool from www.c-to-verilog.com

DIME-C from Nallatech Handel-C from Celoxica (defunct)

HercuLeS (C/assembly-to-VHDL) tool Impulse C from Impulse Accelerated Technologies Nios II C-to-Hardware Acceleration Compiler from Altera

ROCCC 2.0 (free and open source C to HDL tool) from Jacquard Computing Inc. SPARK (a C-to-VHDL) from University Of California, San Diego

SystemC from Celoxica (defunct)

FPGA high level synthesis compilers

• ROCCC Riverside Optimizing Compiler for Configurable Computing – target:

• platform dependent modules (IP cores) into library • platform independent systems use modules as functions replicate, parallelize and pipeline

– optimizations • low level: arithmetic balancing • high level: loop unrolling, fusion, wavefront, mul/div elimination,

subexpression elimination • data optimizations: stream with smart buffer

– output • vhdl design + testbench • PCore (Xilinx)

FPGA high level synthesis compilers

• AutoESL: – target:

• Xilinx FPGAs

– optimizations • code: loop unroll, fusion, pipeline, inline • data: remap, partition, arrays, reshape, resource, stream • interface selection: handshake, fifo, bus, register, none,…

– output • vhdl design • performance reports: timing, design and loops latency, utilization,

area, power, interface • design viewer with timeline, regs and interfaces, with links back to

source code

AutoESL programming example: Tuning design for performance

• Simple example: sum of array (N=1.e8) for(i=0; i<N; i++) sum += A[i];

• No optimizations: AutoESL reports 2 * N = 2.e8 cycles

• AutoESL Designer view: 2 cycles/add

cycles

Unroll for parallelism

• Unroll 8 times arith. balancing (4, // adds)

• AutoESL directive:

• Designer view: only 2 // adds?

cycles

Increase # memory ports

• Dual-port memory: only 2 loads at a time!

• I/O bottleneck, increase # mem ports

Partition data for // access

• Partition A over 4 memories (=8 ports, 256 bits)

• 8 loads, 4 // adds

Balance unroll and partitioning

• Impact of Unrolling and Partitioning (N=108)

• Best result: 64 unroll, 32 memory ports, speedup = 16

0.E+00

1.E+08

2.E+08

3.E+08

1 10 100 1000

# cycles

Unroll factor 1, 8, 64, 512

Unrolling loops and increasing memory ports

2 PORTS ONLY

Partition=2 , 4 // streams (DP)

Partition=16, 32 // streams (DP)I/O

bound Resource

• Compare lines C vs. lines VHDL

• Order of magnitude speed up

• VHDL design is correct

Programming Productivity

Code C AutoESL bare AutoESL opt Ratio AutoESL/C

Sum Array 16 266 6,346 17 - 397

Erosion 3x3 31 195 1,067 6 - 34

Gaxpy 13 374 3,904 29 - 300

Performance evaluation Roofline Performance Model

• What is it?

• Why is it required?

• How is it able to compare both architectures?

Roofline model

• Peak Performance (PP) is limited by

– Compute power, CP GFlops/s

– I/O Bandwidth, BW GBytes/s

– Arithmetic Intensity, AI Flops/Byte

• Hardware limited PP = CP

• I/O limited PP = BW*AI

• PP = Min (CP, BW*AI)

AI(Ops/Byte)

PP (GOps/s)

CP(GOps/s)

Roofline model

• Roofline model for FPGA. I/O limit?

BRAM: 386 GB/s 386 Gops/s @ AI=1 op/byte

Roofline model

32 streams @ 4GB/s 128 Gops/s @ AI=1 op/byte

(Pico Computing firmware allows 32 streams)

Roofline model

1 streams @ 4GB/s 4 Gops/s @ AI=1 op/byte

Roofline model

• Roofline model for FPGA. Computation limit?

• 32 bit addition on Virtex 6 resource consumption

Total: 3834 @ 250 Mhz 958.5 Gops/s

AVAILABLE ADD_DSP ADD_Logic

LUT 98125 0 32

FF 201715 0 32

DSP 768 1 0

AVAILABLE: 768 3066

Roofline model

• Roofline model for FPGA.

Roofline model

• Roofline model for GPU

Roofline model

• Roofline model for GPU and FPGA combined

Experimental Results: FPN

FPN (Fixed Pattern Noise Correction) algorithm Output pixel = f(input pixel, gain, offset, origin)

Requires 4 input bytes to generate 1 output byte Computational intensity = 1 / 4 (output overlaps)

Pico stream = 16 bytes @ 250 MHz = 4 GB/s

One full-duplex stream fits 4 FPNs

Max number of FPNS?: Logic Resources

FPGA logic resources allow 96 full-duplex streams

Peak performance = 96 * 4 Ops / 4ns = 96 Gops/s

Max number of streams? : Available Bandwidth

AI = 1/4 (pipelined output overlaps with input)

I/O limited performance = BW*AI

32 Pico streams = 32*4GB/4 = 32 Gops/s

1 PCI e stream = 4GB/4 = 1 Gops/s

Max performance on the Pico board (32 PicoStreams)

Max performance on the combined platform

Image Erosion 3x3

• Example: 3x3 erosion pixel(i,j) = Min(neighbor pixels) = 1 “operation”

• Handwritten VHDL: 9 cycles for 1 computational block (CB)

• Peak? Virtex 6 FPGA accomodates 1536 CBs @250MHz clock rate PP = 42.6 Gops/s

Erosion3x3 on FPGA

Erosion3x3 operation requires 9 input bytes to generate 1 output byte

Computational intensity = 1 / 10

Handwritten VHDL code:

– 1 input bytes per clock cycles

– 1 output byte each 9 clock cycles

Performance = 27.77 MPixelOperations/s

Erosion3x3 on FPGA

Handwritten VHDL code:

One full-duplex stream fits 16x parallel erosion operations = 1 erosion block:

Experimental Results: Erosion3x3

Max number of erosion blocks? : Logic Resources

FPGA logic resources allow 96 full-duplex streams

Peak performance = 96 * 16 Ops/36ns = 42.66 Gops/s

RESOURCE ESTIMATIONS

Logic Utilization128x[16x Erosion[128b]] 96x[16x Erosion[128b]]

Used Available Utilization Used Available Utilization

Number of Slice Registers 214874 301440 71% 174220 301440 58%

109095 150720 72% 76423 150720 51%

Number of fully used LUT-FF pairs 49994 213650 23% 33806 248902 14%

81 600 14% 81 600 14%

Number of Block RAM/FIFO 542 416 130% 414 416 100%

7 32 22% 7 32 22%

Number of DSP48E1s 0 768 0% 0 768 0%

Number of Slice LUTs

Number of bonded IOBs

Number of BUFG/BUFGCTRLs

I/O limited performance? : Available bandwidth

AI = 1 result per 9 bytes = 1/9

BRAM BW = 386 GB/s limit = 42.88 Gops/s

Pico streams BW = 32 GB/s limit = 3.55 Gops/s

PCIe stream BW = 4 GB/s limit = 0.44 Gops/s

Hardware peak = 42.66 Gops/s

I/O streams limit performance

HandWritten VHDL code: Measurements

Smart buffers reuse data only 1 fetch and store

Impact of the smart buffers on the computational intensity:

Improvement of about a factor of (k+1) for larger images H = Height of the image

W= Width of the image

k2= Dimension of the kernel or mask

Manual partial loop unrolling increases data reuse with smart buffers:

Loop Unrolling increases Computationl Intensity

1x Pixel in Parallel 2x Pixel in Parallel 4x Pixel in Parallel

128x128

256x256

512x512

1024x1024

tensity CIx2.25

CIx2.97

CIx3.60 CI original = 0.11

ROCCC: Measurements

AutoESL

First implementation:

Extremely similar to the Handwritten VHDL code.

Same Computational Intensity

AutoESL

Partial Loop Unrolling x4:

AutoESL

Erosion 1

AutoESL

Erosion 2

AutoESL

Erosion 3

AutoESL

Partial Loop Unrolling x4

Unrolled loops are pipelined and data reused CI increases (less bytes fetched per operation):

Erosion 4

AutoESL: Measurements

Handwritten VHDL code vs ROCCC vs AutoESL

Internal Performance (32 PicoStreams)

HandWritten VHDL: Stream Version

HandWritten VHDL: BRAM VersionROCCC 4xParalell Stream.: Default + Inlinemodule

ROCCC 4xParalell BRAM.: Default + InlinemoduleAutoESL Stream.: Pipeline

AutoESL Stream.: Pipeline, PLU x2AutoESL Stream.: Pipeline, PLU x4

AutoESL Stream.: Pipeline, PLU x16

Perfomance based on the maximum streams

Maximum Nof Streams

Max resource limit performance

Highest

Performance

Handwritten ROCCC AutoESL

96 handwritten CBs

HandWritten VHDL: Stream VersionHandWritten VHDL: BRAM Version

ROCCC 4xParalell Stream.: Default + InlinemoduleROCCC 4xParalell BRAM.: Default + Inlinemodule

AutoESL Stream.: Pipeline AutoESL Stream.: Pipeline, PLU x2

Performance based on the 32 available streams

Maximum Nof Streams

Max bandwidth limit performance

HandWritten VHDL: Stream VersionHandWritten VHDL: BRAM Version

ROCCC 4xParalell Stream.: Default + InlinemoduleROCCC 4xParalell BRAM.: Default + Inlinemodule

AutoESL Stream.: Pipeline AutoESL Stream.: Pipeline, PLU x2

Performance based on the 32 available streams

Maximum Nof Streams

Max bandwidth limit performance

Highest

Performance

Conclusion

ROCCC presents the best performance per stream but is resource hungry.

AutoESL offers the best trade-off between performance and resource consumption.

I/O stress # I/O streams limited to 1 DDR3 memory too slow PCIe limited to 8 lanes

FPGA needs more HPC tweaking HLS tools (AutoESL) are productive

erik d’hollander - unical · •gpu: cuda, opencl –c ptx (parallel thread execution) ......

Documents

unical modulex ext

ellprexx - unical ag

northwestern ptx elections

maldi and esi mass spectra of peptides and proteins - and lc...

globa - unical ag

ptx elections

specializzazione internazionale - unical

capitale deﬁnizione - unical

ipsec - unical

ptx 2012_01

ocelot: ptx emulator

tpa ptx updates

pneumothorax (ptx)

daft punk - ptx

the chest ptx

fotovoltaico - unical ag

rcim 2008 - - unical

naming - unical

lezioni unical dic2014

kon - unical ag